| < draft-thaler-multipath-04.txt | draft-thaler-multipath-05.txt > | |||
|---|---|---|---|---|
| Internet Engineering Task Force D. Thaler | Internet Engineering Task Force D. Thaler | |||
| INTERNET-DRAFT Microsoft | INTERNET-DRAFT Microsoft | |||
| Expires December 1999 C. Hopps | Expires July 2000 C. Hopps | |||
| Merit Network | Merit Network | |||
| 21 June 1999 | 7 February 2000 | |||
| Multipath Issues in Unicast and Multicast Next-Hop Selection | Multipath Issues in Unicast and Multicast Next-Hop Selection | |||
| <draft-thaler-multipath-04.txt> | <draft-thaler-multipath-05.txt> | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with all | This document is an Internet-Draft and is in full conformance with all | |||
| provisions of Section 10 of RFC2026. | provisions of Section 10 of RFC2026. | |||
| This document is an Internet Draft. Internet Drafts are working | This document is an Internet Draft. Internet Drafts are working | |||
| documents of the Internet Engineering Task Force (IETF), its Areas, and | documents of the Internet Engineering Task Force (IETF), its Areas, and | |||
| its Working Groups. Note that other groups may also distribute working | its Working Groups. Note that other groups may also distribute working | |||
| documents as Internet Drafts. | documents as Internet Drafts. | |||
| Internet Drafts are valid for a maximum of six months and may be | Internet Drafts are valid for a maximum of six months and may be | |||
| updated, replaced, or obsoleted by other documents at any time. It is | updated, replaced, or obsoleted by other documents at any time. It is | |||
| inappropriate to use Internet Drafts as reference material or to cite | inappropriate to use Internet Drafts as reference material or to cite | |||
| them other than as a "work in progress". | them other than as a "work in progress". | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society (2000). All Rights Reserved. | |||
| 1. Introduction | 1. Introduction | |||
| Various routing protocols, including OSPF [1] and ISIS, explicitly allow | Various routing protocols, including OSPF [1] and ISIS, explicitly allow | |||
| "Equal-Cost Multipath" routing. Some router implementations also allow | "Equal-Cost Multipath" routing. Some router implementations also allow | |||
| equal-cost multipath usage with RIP and other routing protocols. Using | equal-cost multipath usage with RIP and other routing protocols. Using | |||
| equal-cost multipath means that if multiple equal-cost routes to the | equal-cost multipath means that if multiple equal-cost routes to the | |||
| Draft Multipath Issues February 2000 | ||||
| same destination exist, they can be discovered and used to provide load | same destination exist, they can be discovered and used to provide load | |||
| balancing among redundant paths. | balancing among redundant paths. | |||
| The effect of multipath routing on a forwarder is that the forwarder | The effect of multipath routing on a forwarder is that the forwarder | |||
| potentially has several next-hops for any given destination and must use | potentially has several next-hops for any given destination and must use | |||
| some method to choose which next-hop should be used for a given data | some method to choose which next-hop should be used for a given data | |||
| packet. This memo summarizes current practices, problems, and solutions. | packet. This memo summarizes current practices, problems, and | |||
| solutions. | ||||
| 2. Concerns | 2. Concerns | |||
| Several router implementations allow multipath forwarding. This is | Several router implementations allow multipath forwarding. This is | |||
| sometimes done naively via round-robin, where each packet matching a | sometimes done naively via round-robin, where each packet matching a | |||
| given destination route is forwarded using the subsequent next-hop, in a | given destination route is forwarded using the subsequent next-hop, in a | |||
| round-robin fashion. This does provide a form of load balancing, but | round-robin fashion. This does provide a form of load balancing, but | |||
| there are several problems with approaches such as round-robin or | there are several problems with approaches such as round-robin or | |||
| random: | random: | |||
| Variable Path MTU | Variable Path MTU | |||
| Since each of the redundant paths may have a different MTU, this | Since each of the redundant paths may have a different MTU, this | |||
| means that the overall path MTU can change on a packet-by-packet | means that the overall path MTU can change on a packet-by-packet | |||
| basis, negating the usefulness of path MTU discovery. | basis, negating the usefulness of path MTU discovery. | |||
| skipping to change at page 2, line 44 ¶ | skipping to change at page 3, line 4 ¶ | |||
| one. When three or more packets are received before a "late" | one. When three or more packets are received before a "late" | |||
| packet, TCP enters a mode called "fast-retransmit" [6] which | packet, TCP enters a mode called "fast-retransmit" [6] which | |||
| consumes extra bandwidth (which could potentially cause more loss, | consumes extra bandwidth (which could potentially cause more loss, | |||
| decreasing throughput) as it attempts to unnecessarily retransmit | decreasing throughput) as it attempts to unnecessarily retransmit | |||
| the delayed packet(s). Hence, reordering can be detrimental to | the delayed packet(s). Hence, reordering can be detrimental to | |||
| network performance. | network performance. | |||
| Debugging | Debugging | |||
| Common debugging utilities such as ping and traceroute are much | Common debugging utilities such as ping and traceroute are much | |||
| less reliable in the presence of multiple paths and may even | less reliable in the presence of multiple paths and may even | |||
| Draft Multipath Issues February 2000 | ||||
| present completely wrong results. | present completely wrong results. | |||
| In multicast routing, the problem with multiple paths is that multicast | In multicast routing, the problem with multiple paths is that multicast | |||
| routing protocols prevent loops and duplicates by constructing a single | routing protocols prevent loops and duplicates by constructing a single | |||
| tree to all receivers of the same group address. Multicast routing | tree to all receivers of the same group address. Multicast routing | |||
| protocols deployed today (DVMRP, PIM-DM, PIM-SM) [2] construct | protocols deployed today (DVMRP, PIM-DM, PIM-SM) [2] construct shortest- | |||
| shortest-path trees rooted at either the source, or another router known | path trees rooted at either the source, or another router known as a | |||
| as a Core or Rendezvous Point. Hence, the way they ensure that | Core or Rendezvous Point. Hence, the way they ensure that duplicates | |||
| duplicates will not arise is that a given tree must use only a single | will not arise is that a given tree must use only a single next-hop | |||
| next-hop towards the root of the tree. | towards the root of the tree. | |||
| 3. Requirements | 3. Requirements | |||
| In the remainder of this document, we will use the term "flow" to | In the remainder of this document, we will use the term "flow" to | |||
| represent the granularity at which the router keeps state (if at all) | represent the granularity at which the router keeps state (if at all) | |||
| for classes of traffic. The exact definition of a flow may depend on | for classes of traffic. The exact definition of a flow may depend on | |||
| the actual implementation. For example, a flow might be identified | the actual implementation. For example, a flow might be identified | |||
| solely by destination address, or it might be identified by (source | solely by destination address, or it might be identified by (source | |||
| address, destination address, protocol id) triplet. Hence "flow" is not | address, destination address, protocol id) triplet. Hence "flow" is not | |||
| necessarily synonymous with the term "microflow" as used in RFC 2474 | necessarily synonymous with the term "microflow" as used in RFC 2474 | |||
| skipping to change at page 3, line 46 ¶ | skipping to change at page 4, line 4 ¶ | |||
| Two additional features are desirable: | Two additional features are desirable: | |||
| Minimal disruption | Minimal disruption | |||
| When multipath is used, meaning that multiple routes contribute | When multipath is used, meaning that multiple routes contribute | |||
| valid next-hops, the chances are higher of routes being added and | valid next-hops, the chances are higher of routes being added and | |||
| deleted from consideration than when only the "best" route is used | deleted from consideration than when only the "best" route is used | |||
| (in which case metric changes in alternate routes have no effect on | (in which case metric changes in alternate routes have no effect on | |||
| traffic paths). Since a higher number of routes may actually be | traffic paths). Since a higher number of routes may actually be | |||
| used for forwarding when multipath is in use, the potential for | used for forwarding when multipath is in use, the potential for | |||
| packet reordering and packet loss due to route flaps can be much | packet reordering and packet loss due to route flaps can be much | |||
| Draft Multipath Issues February 2000 | ||||
| greater than when not using multipath. Hence, it is desirable to | greater than when not using multipath. Hence, it is desirable to | |||
| minimize the number of active flows affected by the addition or | minimize the number of active flows affected by the addition or | |||
| deletion of another next-hop. | deletion of another next-hop. | |||
| Fast implementation | Fast implementation | |||
| The amount of additional computation required to forward a packet | The amount of additional computation required to forward a packet | |||
| should be small. For example, when doing round-robin, this | should be small. For example, when doing round-robin, this | |||
| computation might consist of incrementing (modulo the number of | computation might consist of incrementing (modulo the number of | |||
| next-hops) a next-hop index. | next-hops) a next-hop index. | |||
| skipping to change at page 4, line 26 ¶ | skipping to change at page 4, line 31 ¶ | |||
| forwarding. | forwarding. | |||
| Modulo-N Hash | Modulo-N Hash | |||
| To select a next-hop from the list of N next-hops, the router | To select a next-hop from the list of N next-hops, the router | |||
| performs a modulo-N hash over the packet header fields that | performs a modulo-N hash over the packet header fields that | |||
| identify a flow. This has the advantage of being fast, at the | identify a flow. This has the advantage of being fast, at the | |||
| expense of (N-1)/N of all flows changing paths whenever a next-hop | expense of (N-1)/N of all flows changing paths whenever a next-hop | |||
| is added or removed. | is added or removed. | |||
| Hash-Threshold | Hash-Threshold | |||
| The router first selects a key by performing a hash (e.g., modulo-K | The router first selects a key by performing a hash over the packet | |||
| where K is large, or CRC16) over the packet header fields that | header fields that identify the flow. The N next-hops have been | |||
| identify the flow. The N next-hops have been assigned unique | assigned unique regions in the hash function's output space. By | |||
| regions in the key space. By comparing the key against region | comparing the hash value against region boundaries the router can | |||
| boundaries the router can determine which region the key belongs to | determine which region the hash value belongs to and thus which | |||
| and thus which next-hop to use. This method has the advantage of | next-hop to use. This method has the advantage of only affecting | |||
| only affecting flows near the region boundaries (or thresholds) | flows near the region boundaries (or thresholds) when next-hops are | |||
| when next-hops are added or removed. Hash-threshold's lookup can | added or removed. For ECMP hash-threshold's lookup can be done | |||
| be done in software using a binary search yielding O(logN), or in | with a simple division (hash_value / fixed_region_size). When a | |||
| hardware in parallel for O(1). When a next-hop is added or | next-hop is added or removed, between 1/4 and 1/2 of all flows | |||
| removed, between 1/4 and 1/2 of all flows change paths. An analysis | change paths. An analysis of this method can be found in [3]. | |||
| of this method can be found in [3]. | ||||
| Highest Random Weight (HRW) | Highest Random Weight (HRW) | |||
| The router computes a key for EACH next-hop by performing a hash | The router computes a key for EACH next-hop by performing a hash | |||
| over the packet header fields that identify the flow, as well as | over the packet header fields that identify the flow, as well as | |||
| over the address of the next-hop. The router then chooses the | over the address of the next-hop. The router then chooses the | |||
| Draft Multipath Issues February 2000 | ||||
| next-hop with the highest resulting key value [4]. This has the | next-hop with the highest resulting key value [4]. This has the | |||
| advantage of minimizing the number of flows affected by a next-hop | advantage of minimizing the number of flows affected by a next-hop | |||
| addition or deletion (only 1/N of them), but is approximately N | addition or deletion (only 1/N of them), but is approximately N | |||
| times as expensive as a modulo-N hash. | times as expensive as a modulo-N hash. | |||
| The applicability of these three alternatives depends on (at least) two | The applicability of these three alternatives depends on (at least) two | |||
| factors: whether the forwarder maintains per-flow state, and how | factors: whether the forwarder maintains per-flow state, and how | |||
| precious CPU is to a multipath forwarder. | precious CPU is to a multipath forwarder. | |||
| Some routers may maintain per-flow state for reasons other than for | Some routers may maintain per-flow state for reasons other than for | |||
| supporting multipath. For example, routers typically keep per-flow | supporting multipath. For example, routers typically keep per-flow | |||
| state for multicast flows so that they can maintain the list of | state for multicast flows so that they can maintain the list of | |||
| interfaces to which packets in the flow should be copied. | interfaces to which packets in the flow should be copied. | |||
| If per-flow state is maintained in a multipath forwarder, then | If per-flow state is maintained in a multipath forwarder, then | |||
| computation of the next-hop can be done by the router at state creation | computation of the next-hop can be done by the router at state creation | |||
| time. This entails no additional computations at packet forwarding time | time. This entails no additional computations at packet forwarding time | |||
| compared with normal forwarding to a single next-hop, since the next-hop | compared with normal forwarding to a single next-hop, since the next-hop | |||
| is precomputed. In this case, any method can be used, including round- | is precomputed. In this case, any method can be used, including round- | |||
| robin, random, modulo-N, hash-threshold or HRW. Hash functions such as | robin, random, modulo-N, hash-threshold or HRW. Hash functions such as | |||
| modulo-N, hash-threshold and HRW are better if the forwarder state may | modulo-N, hash-threshold and HRW are better if the forwarder state may | |||
| be deleted for any reason during the lifetime of a flow since subsequent | be deleted for any reason during the lifetime of a flow since subsequent | |||
| next-hop computations by the router will always select the same path. | next-hop computations by the router will always select the same path. | |||
| This also improves the usefulness of debugging utilities such as | This also improves the usefulness of debugging utilities such as | |||
| traceroute. Finally, to maximize the stability of paths (and hence the | traceroute. Finally, to maximize the stability of paths (and hence the | |||
| usefulness of traceroute, etc.), the use of HRW is recommended over the | usefulness of traceroute, etc.), the use of HRW is recommended over the | |||
| other methods mentioned herein. | other methods mentioned herein. | |||
| skipping to change at page 6, line 5 ¶ | skipping to change at page 6, line 5 ¶ | |||
| hash-threshold is recommended over the other methods mentioned herein. | hash-threshold is recommended over the other methods mentioned herein. | |||
| 4.1. Unicast Forwarding | 4.1. Unicast Forwarding | |||
| Depending on the implementation, unicast forwarding may or may not keep | Depending on the implementation, unicast forwarding may or may not keep | |||
| per-flow state. We recommend that where forwarder implementations keep | per-flow state. We recommend that where forwarder implementations keep | |||
| flow state, routers should use HRW at state creation time (and next-hop | flow state, routers should use HRW at state creation time (and next-hop | |||
| deletion time) to select the next-hop, and that forwarders without per- | deletion time) to select the next-hop, and that forwarders without per- | |||
| flow state use hash-threshold. | flow state use hash-threshold. | |||
| Draft Multipath Issues February 2000 | ||||
| 4.2. Multicast Forwarding | 4.2. Multicast Forwarding | |||
| Today's multicast forwarding engines use a cache of forwarding entries | Today's multicast forwarding engines use a cache of forwarding entries | |||
| indexed by group (or group prefix) and source (or source prefix). This | indexed by group (or group prefix) and source (or source prefix). This | |||
| means that today's multicast forwarder's always keep per-flow state, | means that today's multicast forwarder's always keep per-flow state, | |||
| although for some multicast routing protocols, the "flow" may be fairly | although for some multicast routing protocols, the "flow" may be fairly | |||
| coarse (e.g., traffic from all sources to the same destination). Since | coarse (e.g., traffic from all sources to the same destination). Since | |||
| per-flow state is kept by the forwarder, it is recommended that the | per-flow state is kept by the forwarder, it is recommended that the | |||
| router always use HRW to select the next-hop. | router always use HRW to select the next-hop. | |||
| skipping to change at page 6, line 42 ¶ | skipping to change at page 6, line 44 ¶ | |||
| A related problem occurs when multiple parallel links are used between | A related problem occurs when multiple parallel links are used between | |||
| the same pair of routers. A common solution is to bundle the two links | the same pair of routers. A common solution is to bundle the two links | |||
| together into a "super"-link when is then used for routing. For | together into a "super"-link when is then used for routing. For | |||
| multicast forwarding, this results in the two links being reduced to a | multicast forwarding, this results in the two links being reduced to a | |||
| single next-hop (over the combined link) which can be used to prevent | single next-hop (over the combined link) which can be used to prevent | |||
| duplicates. When a unicast or multicast packet is queued to the | duplicates. When a unicast or multicast packet is queued to the | |||
| combined link, some method, such as those discussed earlier, is still | combined link, some method, such as those discussed earlier, is still | |||
| required to determine the physical link on which to transmit the packet. | required to determine the physical link on which to transmit the packet. | |||
| If the parallel links are identical, then most of the concerns discussed | If the parallel links are identical, then most of the concerns discussed | |||
| in this document are avoided with the combined link. The exception is | in this document are avoided with the combined link. The exception is | |||
| packet reordering, which can still occur with round-robin, adversely | packet reordering, which can still occur with round-robin, adversely | |||
| affecting TCP. | affecting TCP. | |||
| Draft Multipath Issues February 2000 | ||||
| 7. Security Considerations | 7. Security Considerations | |||
| This document discusses issues with various methods of choosing a next- | This document discusses issues with various methods of choosing a next- | |||
| hop from among multiple valid next-hops. As such, it does not directly | hop from among multiple valid next-hops. As such, it does not directly | |||
| impact the security of the Internet infrastructure or its applications. | impact the security of the Internet infrastructure or its applications. | |||
| One issue that is worth mentioning, however, is that when next-hop | One issue that is worth mentioning, however, is that when next-hop | |||
| selection is predictable, an attacker can synthesize traffic that will | selection is predictable, an attacker can synthesize traffic that will | |||
| all hash the same, making it possible to launch a denial-of-service | all hash the same, making it possible to launch a denial-of-service | |||
| attack that overloads a particular path. Since a special case of this | attack that overloads a particular path. Since a special case of this | |||
| skipping to change at page 7, line 29 ¶ | skipping to change at page 7, line 31 ¶ | |||
| any single link. | any single link. | |||
| 8. References | 8. References | |||
| [1] Moy, J., "OSPF Version 2", RFC 2178, July 1997. | [1] Moy, J., "OSPF Version 2", RFC 2178, July 1997. | |||
| [2] Maufer, T., "Deploying IP Multicast in the Enterprise", Prentice- | [2] Maufer, T., "Deploying IP Multicast in the Enterprise", Prentice- | |||
| Hall, 1998. | Hall, 1998. | |||
| [3] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",, Work | [3] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",, Work | |||
| in progress, April 1999. | in progress, January 2000. | |||
| [4] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to | [4] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to | |||
| Increase Hit Rates", IEEE/ACM Transactions on Networking, February | Increase Hit Rates", IEEE/ACM Transactions on Networking, February | |||
| 1998. | 1998. | |||
| [5] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., | [5] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., | |||
| Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, | Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, | |||
| "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol | "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol | |||
| Specification", RFC 2362, June 1998. | Specification", RFC 2362, June 1998. | |||
| [6] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", | [6] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", | |||
| RFC 2581, April 1999. | RFC 2581, April 1999. | |||
| [7] Nichols, K., Blake, S., Baker, F., and D. Black., "Definition of | [7] Nichols, K., Blake, S., Baker, F., and D. Black., "Definition of | |||
| the Differentiated Services Field (DS Field) in the IPv4 and IPv6 | the Differentiated Services Field (DS Field) in the IPv4 and IPv6 | |||
| Headers", RFC 2474, December 1998. | Headers", RFC 2474, December 1998. | |||
| Draft Multipath Issues February 2000 | ||||
| 9. Authors' Addresses | 9. Authors' Addresses | |||
| Dave Thaler | Dave Thaler | |||
| Microsoft | Microsoft | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| Phone: +1 425 703 8835 | Phone: +1 425 703 8835 | |||
| EMail: dthaler@dthaler.microsoft.com | EMail: dthaler@dthaler.microsoft.com | |||
| Christian E. Hopps | Christian E. Hopps | |||
| Merit Network | Merit Network | |||
| 4251 Plymouth Road, Suite C. | 4251 Plymouth Road, Suite C. | |||
| Ann Arbor, MI 48105 | Ann Arbor, MI 48105 | |||
| Phone: +1 734 936 0291 | Phone: +1 734 936 0291 | |||
| EMail: chopps@merit.edu | EMail: chopps@merit.edu | |||
| 10. Full Copyright Statement | 10. Full Copyright Statement | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society (2000). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it or | others, and derivative works that comment on or otherwise explain it or | |||
| assist in its implementation may be prepared, copied, published and | assist in its implementation may be prepared, copied, published and | |||
| distributed, in whole or in part, without restriction of any kind, | distributed, in whole or in part, without restriction of any kind, | |||
| provided that the above copyright notice and this paragraph are included | provided that the above copyright notice and this paragraph are included | |||
| on all such copies and derivative works. However, this document itself | on all such copies and derivative works. However, this document itself | |||
| may not be modified in any way, such as by removing the copyright notice | may not be modified in any way, such as by removing the copyright notice | |||
| or references to the Internet Society or other Internet organizations, | or references to the Internet Society or other Internet organizations, | |||
| except as needed for the purpose of developing Internet standards in | except as needed for the purpose of developing Internet standards in | |||
| skipping to change at page 9, line 4 ¶ | skipping to change at page 9, line 4 ¶ | |||
| languages other than English. | languages other than English. | |||
| The limited permissions granted above are perpetual and will not be | The limited permissions granted above are perpetual and will not be | |||
| revoked by the Internet Society or its successors or assigns. | revoked by the Internet Society or its successors or assigns. | |||
| This document and the information contained herein is provided on an "AS | This document and the information contained herein is provided on an "AS | |||
| IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK | IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK | |||
| FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT | FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT | |||
| LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT | LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT | |||
| INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR | INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR | |||
| Draft Multipath Issues February 2000 | ||||
| FITNESS FOR A PARTICULAR PURPOSE. | FITNESS FOR A PARTICULAR PURPOSE. | |||
| End of changes. 23 change blocks. | ||||
| 31 lines changed or deleted | 51 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||