| < draft-hopps-ecmp-algo-analysis-03.txt | draft-hopps-ecmp-algo-analysis-04.txt > | |||
|---|---|---|---|---|
| Internet Engineering Task Force Christian E. Hopps | Internet Engineering Task Force Christian E. Hopps | |||
| INTERNET-DRAFT Merit Network | INTERNET-DRAFT Merit Network | |||
| Expires September 1999 13 April 1999 | Expires July 2000 7 February 2000 | |||
| Analysis of an Equal-Cost Multi-Path Algorithm | Analysis of an Equal-Cost Multi-Path Algorithm | |||
| <draft-hopps-ecmp-algo-analysis-03.txt> | <draft-hopps-ecmp-algo-analysis-04.txt> | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 33 ¶ | skipping to change at page 1, line 32 ¶ | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society (2000). All Rights Reserved. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| Abstract | Abstract | |||
| Equal-cost multi-path (ECMP) is a routing technique for routing | Equal-cost multi-path (ECMP) is a routing technique for routing | |||
| packets along multiple paths of equal cost. The forwarding engine | packets along multiple paths of equal cost. The forwarding engine | |||
| identifies paths by next-hop. When forwarding a packet the router | identifies paths by next-hop. When forwarding a packet the router | |||
| must decide which next-hop (path) to use. This document gives an | must decide which next-hop (path) to use. This document gives an | |||
| analysis of one method for making that decision. The analysis | analysis of one method for making that decision. The analysis | |||
| includes the performance of the algorithm and the disruption caused | includes the performance of the algorithm and the disruption caused | |||
| by changes to the set of next-hops. | by changes to the set of next-hops. | |||
| 1. Hash-Threshold | 1. Hash-Threshold | |||
| One method for determining which next-hop to use when routing with | One method for determining which next-hop to use when routing with | |||
| ECMP can be called hash-threshold. The router first selects a key by | ECMP can be called hash-threshold. The router first selects a key by | |||
| performing a hash (e.g., modulo-K where K is large, or CRC16) over | performing a hash (e.g., CRC16) over the packet header fields that | |||
| the packet header fields that identify a flow. The N next-hops have | identify a flow. The N next-hops have been assigned unique regions in | |||
| been assigned unique regions in the key space. The router uses the | the key space. The router uses the key to determine which region and | |||
| key to determine which region and thus which next-hop to use. | thus which next-hop to use. | |||
| As an example of hash-threshold, upon receiving a packet the router | As an example of hash-threshold, upon receiving a packet the router | |||
| performs a CRC16 on the packet's header fields that define the flow | performs a CRC16 on the packet's header fields that define the flow | |||
| (e.g., the source and destination fields of the packet), this is the | (e.g., the source and destination fields of the packet), this is the | |||
| key. Say for this destination there are 4 next-hops to choose from. | key. Say for this destination there are 4 next-hops to choose from. | |||
| Each next-hop is assigned a region in 16 bit space (the key space). | Each next-hop is assigned a region in 16 bit space (the key space). | |||
| For equal usage the router may have chosen to divide it up evenly so | For equal usage the router may have chosen to divide it up evenly so | |||
| each region is 65536/4 or 16k large. The next-hop is chosen by | each region is 65536/4 or 16k large. The next-hop is chosen by | |||
| determining which region contains the key (i.e., the CRC result). | determining which region contains the key (i.e., the CRC result). | |||
| 2. Analysis | 2. Analysis | |||
| There are a few concerns when choosing an algorithm for deciding | There are a few concerns when choosing an algorithm for deciding | |||
| which next-hop to use. One is performance, the computational | which next-hop to use. One is performance, the computational | |||
| requirements to run the algorithm. Another is disruption (i.e., the | requirements to run the algorithm. Another is disruption (i.e., the | |||
| changing of which path a flow uses). Balancing is a third concern; | changing of which path a flow uses). Balancing is a third concern; | |||
| however since the algorithm's balancing characteristics are directly | however, since the algorithm's balancing characteristics are directly | |||
| related to the chosen hash function this analysis does not treat this | related to the chosen hash function this analysis does not treat this | |||
| concern in depth. | concern in depth. | |||
| For this analysis we will assume regions of equal size. If the hash | For this analysis we will assume regions of equal size. If the | |||
| function is uniformly distributed the distribution of flows amongst | output of the hash function is uniformly distributed the distribution | |||
| paths will also be uniform. | of flows amongst paths will also be uniform, and so the algorithm | |||
| will properly implement ECMP. One can implement non-equal-cost | ||||
| multi-path routing by using regions of unequal size; however, non- | ||||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| equal-cost multi-path routing is outside the scope of this document. | ||||
| 2.1. Performance | 2.1. Performance | |||
| The performance of the hash-threshold algorithm can be broken down | The performance of the hash-threshold algorithm can be broken down | |||
| into three parts: selection of regions for the next-hops, obtaining | into three parts: selection of regions for the next-hops, obtaining | |||
| the key and comparing the key to the regions to decide which next-hop | the key and comparing the key to the regions to decide which next-hop | |||
| to use. | to use. | |||
| Since regions are restricted to be of equal size the calculation of | ||||
| region boundaries is trivial. The boundaries can be calculated as | ||||
| follows: | ||||
| i = 0; | ||||
| regionssize = keyspace.size / #{ next-hops } | ||||
| for n in { next-hops } | ||||
| n.start = i * regionsize; | ||||
| n.end = n.start + regionsize; | ||||
| i = i + 1 | ||||
| done | ||||
| This calculation is O(N). Furthermore the calculation can be done | ||||
| when the next-hops are added to or removed from the destination | ||||
| route. | ||||
| The algorithm doesn't specify the hash function used to obtain the | The algorithm doesn't specify the hash function used to obtain the | |||
| key. Its performance in this area will be exactly the performance of | key. Its performance in this area will be exactly the performance of | |||
| the hash function. It is presumed that if this calculation proves to | the hash function. It is presumed that if this calculation proves to | |||
| be a concern it can be done in hardware parallel to other operations | be a concern it can be done in hardware parallel to other operations | |||
| that need to complete before deciding which next-hop to use. | that need to complete before deciding which next-hop to use. | |||
| Finally the next-hop must be chosen. This is done by determining | Since regions are restricted to be of equal size the calculation of | |||
| which region contains the key. The time required to do this is | region boundaries is trivial. Each boundary is exactly regionsize | |||
| dependent on the way the next-hops are organized in memory. The | away from the previous boundary starting from 0 for the first region. | |||
| problem reduces to a search. For example if the next-hops are stored | As we will show, for equal sized regions, we don't need to store the | |||
| as a linked list the time is O(N) as the router must traverse the | boundary values. | |||
| list comparing each next-hop's region boundaries against the key. If | ||||
| the next-hops are stored as an ordered array a binary search can be | ||||
| used yielding O(logN). | ||||
| If the number of next-hops is limited to a fixed maximum the | To choose the next-hop we must determine which region contains the | |||
| comparison can be done in parallel in hardware yielding O(1). | key. Because the regions are of equal size determining which region | |||
| contains the key is a simple division operation. | ||||
| regionsize = keyspace.size / #{nexthops} | ||||
| region = key / regionsize; | ||||
| Thus the time required to find the next-hop is dependent on the way | ||||
| the next-hops are organized in memory. The obvious use of an array | ||||
| indexed by region yields O(1). | ||||
| 2.2. Disruption | 2.2. Disruption | |||
| Protocols such as TCP perform better if the path they flow along does | Protocols such as TCP perform better if the path they flow along does | |||
| not change while the stream is connected. Disruption is the | not change while the stream is connected. Disruption is the | |||
| measurement of how many flows have their paths changed due to some | measurement of how many flows have their paths changed due to some | |||
| change in the router. We measure disruption as the fraction of total | change in the router. We measure disruption as the fraction of total | |||
| flows whose path changes in response to some change in the router. | flows whose path changes in response to some change in the router. | |||
| This can become important if one or more of the paths is flapping. | ||||
| For a description of disruption and how it affects protocols such as | ||||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| TCP see [1]. | ||||
| Some algorithms such as round-robin (i.e., upon receiving a packet | Some algorithms such as round-robin (i.e., upon receiving a packet | |||
| the least recently used next-hop is chosen) are disruptive regardless | the least recently used next-hop is chosen) are disruptive regardless | |||
| of any change in the router. Clearly this is not the case with hash- | of any change in the router. Clearly this is not the case with hash- | |||
| threshold. As long as the region boundaries remain unchanged the | threshold. As long as the region boundaries remain unchanged the | |||
| same next-hop will be chosen for a given flow. | same next-hop will be chosen for a given flow. | |||
| Because we have required regions to be equal in size the only reason | Because we have required regions to be equal in size the only reason | |||
| for a change in region boundaries is the addition or removal of a | for a change in region boundaries is the addition or removal of a | |||
| next-hop. In this case the regions must all grow or shrink to fill | next-hop. In this case the regions must all grow or shrink to fill | |||
| skipping to change at page 5, line 5 ¶ | skipping to change at page 5, line 5 ¶ | |||
| in region 4 and 1/4 of region 4 is in region 5. Since each of the | in region 4 and 1/4 of region 4 is in region 5. Since each of the | |||
| original regions represent 1/5 of the flows, the total disruption is | original regions represent 1/5 of the flows, the total disruption is | |||
| 1/5*(1/4 + 1/2 + 1/2 + 1/4) or 3/10. | 1/5*(1/4 + 1/2 + 1/2 + 1/4) or 3/10. | |||
| Note that the disruption to flows when adding a region is equivalent | Note that the disruption to flows when adding a region is equivalent | |||
| to that of removing a region. That is, we are considering the | to that of removing a region. That is, we are considering the | |||
| fraction of total flows that changes regions when moving from N to | fraction of total flows that changes regions when moving from N to | |||
| N-1 regions, and that same fraction of flows will change when moving | N-1 regions, and that same fraction of flows will change when moving | |||
| from N-1 to N regions. | from N-1 to N regions. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| 0123456701234567012345670123456701234567 | 0123456701234567012345670123456701234567 | |||
| +-------+-------+-------+-------+-------+ | +-------+-------+-------+-------+-------+ | |||
| | 1 | 2 | 3 | 4 | 5 | | | 1 | 2 | 3 | 4 | 5 | | |||
| +-------+-+-----+---+---+-----+-+-------+ | +-------+-+-----+---+---+-----+-+-------+ | |||
| | 1 | 2 | 3 | 5 | | | 1 | 2 | 3 | 5 | | |||
| +---------+---------+---------+---------+ | +---------+---------+---------+---------+ | |||
| 0123456789012345678901234567890123456789 | 0123456789012345678901234567890123456789 | |||
| Figure 2. Before and after deletion of region 4 | Figure 2. Before and after deletion of region 4 | |||
| skipping to change at page 6, line 5 ¶ | skipping to change at page 6, line 5 ¶ | |||
| K-1 N | K-1 N | |||
| --- i --- (i-K) | --- i --- (i-K) | |||
| disruption = \ --- + \ --- | disruption = \ --- + \ --- | |||
| / (N)(N-1) / (N)(N-1) | / (N)(N-1) / (N)(N-1) | |||
| --- --- | --- --- | |||
| i=1 i=K+1 | i=1 i=K+1 | |||
| We can factor 1/((N)(N-1)) out as it is constant. | We can factor 1/((N)(N-1)) out as it is constant. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| / K-1 N \ | / K-1 N \ | |||
| 1 | --- --- | | 1 | --- --- | | |||
| = --- | \ i + \ (i-K) | | = --- | \ i + \ (i-K) | | |||
| (N)(N-1) | / / | | (N)(N-1) | / / | | |||
| \ --- --- / | \ --- --- / | |||
| 1 i=K+1 | 1 i=K+1 | |||
| We now use the the concrete formulas for the sum of integers. The | We now use the the concrete formulas for the sum of integers. The | |||
| first summation is (K)(K-1)/2. For the second summation notice that | first summation is (K)(K-1)/2. For the second summation notice that | |||
| we are summing the integers from 1 to N-K, thus it is (N-K)(N-K+1)/2. | we are summing the integers from 1 to N-K, thus it is (N-K)(N-K+1)/2. | |||
| skipping to change at page 7, line 5 ¶ | skipping to change at page 7, line 5 ¶ | |||
| 2(N)(N-1) | 2(N)(N-1) | |||
| K*K - K - NK N + 1 | K*K - K - NK N + 1 | |||
| = -------------- + ------- | = -------------- + ------- | |||
| (N)(N-1) 2(N-1) | (N)(N-1) 2(N-1) | |||
| Since we are minimizing for K the right side (N+1)/2(N-1) is constant | Since we are minimizing for K the right side (N+1)/2(N-1) is constant | |||
| as is the denominator (N)(N-1) so we can drop them. To minimize we | as is the denominator (N)(N-1) so we can drop them. To minimize we | |||
| take the derivative. | take the derivative. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| d | d | |||
| -- (K*K - (N+1)K) | -- (K*K - (N+1)K) | |||
| dk | dk | |||
| = 2K - (N+1) | = 2K - (N+1) | |||
| Which is zero when K is (N+1)/2. | Which is zero when K is (N+1)/2. | |||
| The last thing to consider is that K must be an integer. When N is | The last thing to consider is that K must be an integer. When N is | |||
| odd (N+1)/2 will yield an integer, however when N is even (N+1)/2 | odd (N+1)/2 will yield an integer, however when N is even (N+1)/2 | |||
| skipping to change at page 7, line 40 ¶ | skipping to change at page 7, line 42 ¶ | |||
| 3. Comparison to other algorithms | 3. Comparison to other algorithms | |||
| Other algorithms exist to decide which next-hop to use. These | Other algorithms exist to decide which next-hop to use. These | |||
| algorithms all have different performance and disruptive | algorithms all have different performance and disruptive | |||
| characteristics. Of these algorithms we will only consider ones that | characteristics. Of these algorithms we will only consider ones that | |||
| are not disruptive by design (i.e., if no change to the set of next- | are not disruptive by design (i.e., if no change to the set of next- | |||
| hops occurs the path a flow takes remains the same). This will | hops occurs the path a flow takes remains the same). This will | |||
| exclude round-robin and random choice. We will look at modulo-N and | exclude round-robin and random choice. We will look at modulo-N and | |||
| highest random weight. | highest random weight. | |||
| Modulo-N is a simpler form of hash-threshold. Given N next-hops the | Modulo-N is a "simpler" form of hash-threshold. Given N next-hops | |||
| hash function includes a final modulo-N which directly maps the | the packet header fields which describe the flow are run through a | |||
| result to one of the next-hops. This operation is the fastest of the | hash function. A final modulo-N is applied to the output of the hash. | |||
| three we consider, however if a next-hop is added or removed the | This result then directly maps to one of the next-hops. Modulo-N is | |||
| disruption is (N-1)/N. | the most disruptive of the algorithms; if a next-hop is added or | |||
| removed the disruption is (N-1)/N. The performance of Modulo-N is | ||||
| Highest random weight (HRW) is another comparative method similar to | Draft Analysis of an ECMP Algorithm February 2000 | |||
| hash-threshold. The router seeds a pseudo-random number generator | ||||
| with the packet header fields which describe the flow and the next- | equivalent to hash-threshold. | |||
| hop to obtain a weight. The next-hop which receives the highest | ||||
| weight is selected. The advantage with using HRW is that it has | Highest random weight (HRW) is a comparative method similar in some | |||
| minimal disruption (i.e., disruption due to adding or removing a | ways to hash-threshold with non-fixed sized regions. For each next- | |||
| next-hop is always 1/N.) The disadvantage with HRW is an only | hop, the router seeds a pseudo-random number generator with the | |||
| slightly more complex function to choose the next-hop. A description | packet header fields which describe the flow and the next-hop to | |||
| of HRW along with comparisons to other methods can be found in [1]. | obtain a weight. The next-hop which receives the highest weight is | |||
| selected. The advantage with using HRW is that it has minimal | ||||
| disruption (i.e., disruption due to adding or removing a next-hop is | ||||
| always 1/N.) The disadvantage with HRW is that the next-hop | ||||
| selection is more expensive than hash-threshold. A description of | ||||
| HRW along with comparisons to other methods can be found in [2]. | ||||
| Although not used for next-hop calculation an example usage of HRW | Although not used for next-hop calculation an example usage of HRW | |||
| can be found in [2]. | can be found in [3]. | |||
| If the complexity of HRW's next-hop selection processes is acceptable | Since each of modulo-N, hash-threshold and HRW require a hash on the | |||
| packet header fields which define a flow, we can factor the | ||||
| performance of the hash out of the comparison. If the hash can not | ||||
| be done inexpensively (e.g., in hardware) it too must be considered | ||||
| when using any of the above methods. | ||||
| The lookup performance for hash-threshold, like modulo-N is an | ||||
| optimal O(1). HRW's lookup performance is O(N). | ||||
| Disruptive behavior is the opposite of performance. HRW is best with | ||||
| 1/N. Hash-threshold is between 1/4 and 1/2. Finally Modulo-N is | ||||
| (N-1)/N. | ||||
| If the complexity of HRW's next-hop selection process is acceptable | ||||
| we think it should be considered as an alternative to hash-threshold. | we think it should be considered as an alternative to hash-threshold. | |||
| This could be the case when, for example, per-flow state is kept and | ||||
| thus the next-hop choice is made infrequently. | ||||
| However, when HRW's next-hop selection is seen as too expensive the | ||||
| obvious choice is hash-threshold as it performs as well as modulo-N | ||||
| and is less disruptive. | ||||
| 4. Security Considerations | 4. Security Considerations | |||
| This document is an analysis of an algorithm used to implement an | This document is an analysis of an algorithm used to implement an | |||
| ECMP routing decision. This analysis does not directly effect the | ECMP routing decision. This analysis does not directly affect the | |||
| security of the Internet Infrastructure. | security of the Internet Infrastructure. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| 5. References | 5. References | |||
| [1] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to | [1] Thaler, D., and Hopps, C., "Multipath Issues in Unicast and | |||
| Multicast", Work in progress, June 1999. | ||||
| [2] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to | ||||
| Increase Hit Rates", IEEE/ACM Transactions on Networking, February | Increase Hit Rates", IEEE/ACM Transactions on Networking, February | |||
| 1998. | 1998. | |||
| [2] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., | [3] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., | |||
| Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, | Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, | |||
| "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol | "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol | |||
| Specification", RFC 2362, June 1998. | Specification", RFC 2362, June 1998. | |||
| 6. Author's Address | 6. Author's Address | |||
| Christian E. Hopps | Christian E. Hopps | |||
| Merit Network | Merit Network | |||
| 4251 Plymouth Road, Suite C. | 4251 Plymouth Road, Suite C. | |||
| Ann Arbor, MI 48105 | Ann Arbor, MI 48105 | |||
| Phone: +1 734 936 0291 | Phone: +1 734 936 0291 | |||
| EMail: chopps@merit.edu | EMail: chopps@merit.edu | |||
| 7. Full Copyright Statement | 7. Full Copyright Statement | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society 2000. All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| included on all such copies and derivative works. However, this | included on all such copies and derivative works. However, this | |||
| document itself may not be modified in any way, such as by removing | document itself may not be modified in any way, such as by removing | |||
| the copyright notice or references to the Internet Society or other | the copyright notice or references to the Internet Society or other | |||
| Internet organizations, except as needed for the purpose of | Internet organizations, except as needed for the purpose of | |||
| developing Internet standards in which case the procedures for | developing Internet standards in which case the procedures for | |||
| copyrights defined in the Internet languages other than English. | copyrights defined in the Internet Standards process must be | |||
| followed, or as required to translate it into languages other than | ||||
| English. | ||||
| The limited permissions granted above are perpetual and will not be | The limited permissions granted above are perpetual and will not be | |||
| revoked by the Internet Society or its successors or assigns. | revoked by the Internet Society or its successors or assigns. | |||
| Draft Analysis of an ECMP Algorithm February 2000 | ||||
| This document and the information contained herein is provided on an | ||||
| "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | ||||
| TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | ||||
| BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | ||||
| HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | ||||
| MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||
| End of changes. 27 change blocks. | ||||
| 60 lines changed or deleted | 100 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||