idnits 2.17.1 draft-thaler-multipath-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 9 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (21 June 1999) is 9076 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2178 (ref. '1') (Obsoleted by RFC 2328) -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '3' -- Possible downref: Non-RFC (?) normative reference: ref. '4' ** Obsolete normative reference: RFC 2362 (ref. '5') (Obsoleted by RFC 4601, RFC 5059) ** Obsolete normative reference: RFC 2581 (ref. '6') (Obsoleted by RFC 5681) Summary: 10 errors (**), 0 flaws (~~), 3 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force D. Thaler 2 INTERNET-DRAFT Microsoft 3 Expires December 1999 C. Hopps 4 Merit Network 5 21 June 1999 7 Multipath Issues in Unicast and Multicast Next-Hop Selection 8 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with all 13 provisions of Section 10 of RFC2026. 15 This document is an Internet Draft. Internet Drafts are working 16 documents of the Internet Engineering Task Force (IETF), its Areas, and 17 its Working Groups. Note that other groups may also distribute working 18 documents as Internet Drafts. 20 Internet Drafts are valid for a maximum of six months and may be 21 updated, replaced, or obsoleted by other documents at any time. It is 22 inappropriate to use Internet Drafts as reference material or to cite 23 them other than as a "work in progress". 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 Copyright Notice 33 Copyright (C) The Internet Society (1999). All Rights Reserved. 35 1. Introduction 37 Various routing protocols, including OSPF [1] and ISIS, explicitly allow 38 "Equal-Cost Multipath" routing. Some router implementations also allow 39 equal-cost multipath usage with RIP and other routing protocols. Using 40 equal-cost multipath means that if multiple equal-cost routes to the 41 same destination exist, they can be discovered and used to provide load 42 balancing among redundant paths. 44 The effect of multipath routing on a forwarder is that the forwarder 45 potentially has several next-hops for any given destination and must use 46 some method to choose which next-hop should be used for a given data 47 packet. This memo summarizes current practices, problems, and solutions. 49 2. Concerns 51 Several router implementations allow multipath forwarding. This is 52 sometimes done naively via round-robin, where each packet matching a 53 given destination route is forwarded using the subsequent next-hop, in a 54 round-robin fashion. This does provide a form of load balancing, but 55 there are several problems with approaches such as round-robin or 56 random: 58 Variable Path MTU 59 Since each of the redundant paths may have a different MTU, this 60 means that the overall path MTU can change on a packet-by-packet 61 basis, negating the usefulness of path MTU discovery. 63 Variable Latencies 64 Since each of the redundant paths may have a different latency 65 involved, having packets take separate paths can cause packets to 66 always arrive out of order, increasing delivery latency and 67 buffering requirements. 69 Packet reordering causes TCP to believe that loss has taken place 70 when packets with higher sequence numbers arrive before an earlier 71 one. When three or more packets are received before a "late" 72 packet, TCP enters a mode called "fast-retransmit" [6] which 73 consumes extra bandwidth (which could potentially cause more loss, 74 decreasing throughput) as it attempts to unnecessarily retransmit 75 the delayed packet(s). Hence, reordering can be detrimental to 76 network performance. 78 Debugging 79 Common debugging utilities such as ping and traceroute are much 80 less reliable in the presence of multiple paths and may even 81 present completely wrong results. 83 In multicast routing, the problem with multiple paths is that multicast 84 routing protocols prevent loops and duplicates by constructing a single 85 tree to all receivers of the same group address. Multicast routing 86 protocols deployed today (DVMRP, PIM-DM, PIM-SM) [2] construct 87 shortest-path trees rooted at either the source, or another router known 88 as a Core or Rendezvous Point. Hence, the way they ensure that 89 duplicates will not arise is that a given tree must use only a single 90 next-hop towards the root of the tree. 92 3. Requirements 94 In the remainder of this document, we will use the term "flow" to 95 represent the granularity at which the router keeps state (if at all) 96 for classes of traffic. The exact definition of a flow may depend on 97 the actual implementation. For example, a flow might be identified 98 solely by destination address, or it might be identified by (source 99 address, destination address, protocol id) triplet. Hence "flow" is not 100 necessarily synonymous with the term "microflow" as used in RFC 2474 101 [7], which also includes port numbers. Indeed, including transport- 102 layer information in the next-hop selection process can actually be 103 problematic. For example, if packets are fragmented, the transport- 104 layer information may not be available in every packet. Furthermore, 105 having the choice of path depend on transport-layer fields may negate 106 the benefit of caching information such as MTU for use in subsequent 107 connections between the same endpoints. 109 All of the problems outlined in the previous section arise when packets 110 in the same unicast or multicast "flow" are split among multiple paths. 111 The natural solution is therefore to ensure that packets for the same 112 flow always use the same path. 114 Two additional features are desirable: 116 Minimal disruption 117 When multipath is used, meaning that multiple routes contribute 118 valid next-hops, the chances are higher of routes being added and 119 deleted from consideration than when only the "best" route is used 120 (in which case metric changes in alternate routes have no effect on 121 traffic paths). Since a higher number of routes may actually be 122 used for forwarding when multipath is in use, the potential for 123 packet reordering and packet loss due to route flaps can be much 124 greater than when not using multipath. Hence, it is desirable to 125 minimize the number of active flows affected by the addition or 126 deletion of another next-hop. 128 Fast implementation 129 The amount of additional computation required to forward a packet 130 should be small. For example, when doing round-robin, this 131 computation might consist of incrementing (modulo the number of 132 next-hops) a next-hop index. 134 4. Solutions 136 We now provide three possible methods for improving the performance of 137 multipath and then discuss their applicability to unicast and multicast 138 forwarding. 140 Modulo-N Hash 141 To select a next-hop from the list of N next-hops, the router 142 performs a modulo-N hash over the packet header fields that 143 identify a flow. This has the advantage of being fast, at the 144 expense of (N-1)/N of all flows changing paths whenever a next-hop 145 is added or removed. 147 Hash-Threshold 148 The router first selects a key by performing a hash (e.g., modulo-K 149 where K is large, or CRC16) over the packet header fields that 150 identify the flow. The N next-hops have been assigned unique 151 regions in the key space. By comparing the key against region 152 boundaries the router can determine which region the key belongs to 153 and thus which next-hop to use. This method has the advantage of 154 only affecting flows near the region boundaries (or thresholds) 155 when next-hops are added or removed. Hash-threshold's lookup can 156 be done in software using a binary search yielding O(logN), or in 157 hardware in parallel for O(1). When a next-hop is added or 158 removed, between 1/4 and 1/2 of all flows change paths. An analysis 159 of this method can be found in [3]. 161 Highest Random Weight (HRW) 162 The router computes a key for EACH next-hop by performing a hash 163 over the packet header fields that identify the flow, as well as 164 over the address of the next-hop. The router then chooses the 165 next-hop with the highest resulting key value [4]. This has the 166 advantage of minimizing the number of flows affected by a next-hop 167 addition or deletion (only 1/N of them), but is approximately N 168 times as expensive as a modulo-N hash. 170 The applicability of these three alternatives depends on (at least) two 171 factors: whether the forwarder maintains per-flow state, and how 172 precious CPU is to a multipath forwarder. 174 Some routers may maintain per-flow state for reasons other than for 175 supporting multipath. For example, routers typically keep per-flow 176 state for multicast flows so that they can maintain the list of 177 interfaces to which packets in the flow should be copied. 179 If per-flow state is maintained in a multipath forwarder, then 180 computation of the next-hop can be done by the router at state creation 181 time. This entails no additional computations at packet forwarding time 182 compared with normal forwarding to a single next-hop, since the next-hop 183 is precomputed. In this case, any method can be used, including round- 184 robin, random, modulo-N, hash-threshold or HRW. Hash functions such as 185 modulo-N, hash-threshold and HRW are better if the forwarder state may 186 be deleted for any reason during the lifetime of a flow since subsequent 187 next-hop computations by the router will always select the same path. 188 This also improves the usefulness of debugging utilities such as 189 traceroute. Finally, to maximize the stability of paths (and hence the 190 usefulness of traceroute, etc.), the use of HRW is recommended over the 191 other methods mentioned herein. 193 If per-flow state is not maintained by the forwarder, then using 194 multiple next-hops requires that the next-hop be calculated at packet 195 arrival time. When CPU is more precious than stability of flow paths, 196 hash-threshold is recommended over the other methods mentioned herein. 198 4.1. Unicast Forwarding 200 Depending on the implementation, unicast forwarding may or may not keep 201 per-flow state. We recommend that where forwarder implementations keep 202 flow state, routers should use HRW at state creation time (and next-hop 203 deletion time) to select the next-hop, and that forwarders without per- 204 flow state use hash-threshold. 206 4.2. Multicast Forwarding 208 Today's multicast forwarding engines use a cache of forwarding entries 209 indexed by group (or group prefix) and source (or source prefix). This 210 means that today's multicast forwarder's always keep per-flow state, 211 although for some multicast routing protocols, the "flow" may be fairly 212 coarse (e.g., traffic from all sources to the same destination). Since 213 per-flow state is kept by the forwarder, it is recommended that the 214 router always use HRW to select the next-hop. 216 Routers using explicit-joining protocols such as PIM-SM [5] should thus 217 use the multipath information when determining to which neighbor a join 218 message should be sent. For example, when multiple next-hops exist for 219 a given Rendezvous Point (RP) toward which a (*,G) Join should be sent, 220 it is recommended that HRW be used to select the next-hop to use for 221 each group. 223 5. Applicability 225 The algorithms discussed above (except round-robin) all rely on some 226 form of hash function. Equal flow distribution is achieved when the 227 hash function is uniformly distributed. Since the commonly used hash 228 functions only become uniformly distributed when the number of inputs is 229 relatively large, these algorithms are more applicable to routers used 230 to route many flows, than in, for example, a small business setting. 232 6. Redundant Parallel Links 234 A related problem occurs when multiple parallel links are used between 235 the same pair of routers. A common solution is to bundle the two links 236 together into a "super"-link when is then used for routing. For 237 multicast forwarding, this results in the two links being reduced to a 238 single next-hop (over the combined link) which can be used to prevent 239 duplicates. When a unicast or multicast packet is queued to the 240 combined link, some method, such as those discussed earlier, is still 241 required to determine the physical link on which to transmit the packet. 242 If the parallel links are identical, then most of the concerns discussed 243 in this document are avoided with the combined link. The exception is 244 packet reordering, which can still occur with round-robin, adversely 245 affecting TCP. 247 7. Security Considerations 249 This document discusses issues with various methods of choosing a next- 250 hop from among multiple valid next-hops. As such, it does not directly 251 impact the security of the Internet infrastructure or its applications. 253 One issue that is worth mentioning, however, is that when next-hop 254 selection is predictable, an attacker can synthesize traffic that will 255 all hash the same, making it possible to launch a denial-of-service 256 attack that overloads a particular path. Since a special case of this 257 is when the same (single) next-hop is always selected, such an attack is 258 easiest when multipath is not being used. Introducing multipath routing 259 can make such an attack more difficult; the more unpredictable the hash 260 is, the harder it becomes to conduct a denial-of-service attack against 261 any single link. 263 8. References 265 [1] Moy, J., "OSPF Version 2", RFC 2178, July 1997. 267 [2] Maufer, T., "Deploying IP Multicast in the Enterprise", Prentice- 268 Hall, 1998. 270 [3] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",, Work 271 in progress, April 1999. 273 [4] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to 274 Increase Hit Rates", IEEE/ACM Transactions on Networking, February 275 1998. 277 [5] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., 278 Handley, M., Jacobson, V., Liu, C., Sharma, P., and L. Wei, 279 "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol 280 Specification", RFC 2362, June 1998. 282 [6] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion Control", 283 RFC 2581, April 1999. 285 [7] Nichols, K., Blake, S., Baker, F., and D. Black., "Definition of 286 the Differentiated Services Field (DS Field) in the IPv4 and IPv6 287 Headers", RFC 2474, December 1998. 289 9. Authors' Addresses 291 Dave Thaler 292 Microsoft 293 One Microsoft Way 294 Redmond, WA 98052 295 Phone: +1 425 703 8835 296 EMail: dthaler@dthaler.microsoft.com 298 Christian E. Hopps 299 Merit Network 300 4251 Plymouth Road, Suite C. 301 Ann Arbor, MI 48105 302 Phone: +1 734 936 0291 303 EMail: chopps@merit.edu 305 10. Full Copyright Statement 307 Copyright (C) The Internet Society (1999). All Rights Reserved. 309 This document and translations of it may be copied and furnished to 310 others, and derivative works that comment on or otherwise explain it or 311 assist in its implementation may be prepared, copied, published and 312 distributed, in whole or in part, without restriction of any kind, 313 provided that the above copyright notice and this paragraph are included 314 on all such copies and derivative works. However, this document itself 315 may not be modified in any way, such as by removing the copyright notice 316 or references to the Internet Society or other Internet organizations, 317 except as needed for the purpose of developing Internet standards in 318 which case the procedures for copyrights defined in the Internet 319 languages other than English. 321 The limited permissions granted above are perpetual and will not be 322 revoked by the Internet Society or its successors or assigns. 324 This document and the information contained herein is provided on an "AS 325 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 326 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 327 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 328 INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR 329 FITNESS FOR A PARTICULAR PURPOSE.