idnits 2.17.1 draft-ietf-6man-flow-ecmp-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 8 has weird spacing: '...routing and l...' -- The document date (July 5, 2011) is 4671 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-07) exists of draft-ietf-6man-flow-3697bis-05 ** Obsolete normative reference: RFC 2460 (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 3697 (Obsoleted by RFC 6437) -- Obsolete informational reference (is this intentional?): RFC 2629 (Obsoleted by RFC 7749) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Carpenter 3 Internet-Draft Univ. of Auckland 4 Intended status: BCP S. Amante 5 Expires: January 6, 2012 Level 3 6 July 5, 2011 8 Using the IPv6 flow label for equal cost multipath routing and link 9 aggregation in tunnels 10 draft-ietf-6man-flow-ecmp-04 12 Abstract 14 The IPv6 flow label has certain restrictions on its use. This 15 document describes how those restrictions apply when using the flow 16 label for load balancing by equal cost multipath routing, and for 17 link aggregation, particularly for IP-in-IPv6 tunneled traffic. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on January 6, 2012. 36 Copyright Notice 38 Copyright (c) 2011 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 1.1. Choice of IP Header Fields for Hash Input . . . . . . . . . 3 55 1.2. Flow label rules . . . . . . . . . . . . . . . . . . . . . 5 56 2. Normative Notation . . . . . . . . . . . . . . . . . . . . . . 6 57 3. Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . 6 58 4. Security Considerations . . . . . . . . . . . . . . . . . . . . 7 59 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 60 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 61 7. Change log [RFC Editor: please remove] . . . . . . . . . . . . 8 62 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 63 8.1. Normative References . . . . . . . . . . . . . . . . . . . 9 64 8.2. Informative References . . . . . . . . . . . . . . . . . . 9 65 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 67 1. Introduction 69 When several network paths between the same two nodes are known by 70 the routing system to be equally good (in terms of capacity and 71 latency), it may be desirable to share traffic among them. Two such 72 techniques are known as equal cost multipath routing (ECMP) and link 73 aggregation (LAG) [IEEE802.1AX]. There are of course numerous 74 possible approaches to this, but certain goals need to be met: 75 o Roughly equal share of traffic on each path. 76 (In some cases, the multiple paths might not all have the same 77 capacity and the goal might be appropriately weighted traffic 78 shares rather than equal shares. This would affect the load 79 sharing algorithm, but would not otherwise change the argument.) 80 o Minimize or avoid out-of-order delivery for individual traffic 81 flows. 82 o Minimize idle time on any path when queue is non-empty. 84 There is some conflict between these goals: for example, strictly 85 avoiding idle time could cause a small packet sent on an idle path to 86 overtake a bigger packet from the same flow, causing out-of-order 87 delivery. 89 One lightweight approach to ECMP or LAG is this: if there are N 90 equally good paths to choose from, then form a modulo(N) hash 91 [RFC2991] from a defined set of fields in each packet header that are 92 certain to have the same values throughout the duration of a flow, 93 and use the resulting output hash value to select a particular path. 94 If the hash function is chosen so that the output values have a 95 uniform statistical distribution, this method will share traffic 96 roughly equally between the N paths. If the header fields included 97 in the hash input are consistent, all packets from a given flow will 98 generate the same hash output value, so out-of-order delivery will 99 not occur. Assuming a large number of unique flows are involved, it 100 is also probable that the method will avoid idle time, since the 101 queue for each link will remain non-empty. 103 1.1. Choice of IP Header Fields for Hash Input 105 In the remainder of this document, we will use the term "flow" to 106 represent a sequence of packets that may be identified by either the 107 source and destination IP addresses alone {2-tuple} or the source and 108 destination IP addresses, protocol and source and destination port 109 numbers {5-tuple}. It should be noted that the latter is more 110 specifically referred to as a "microflow" in [RFC2474], but this term 111 is not used in connection with the flow label in [RFC3697]. 113 The question is, then, which header fields are used to identify a 114 flow and to serve as input keys to a modulo(N) hash algorithm. A 115 common choice when routing general traffic is simply to use a hash of 116 the source and destination IP addresses, i.e., the 2-tuple. This is 117 necessary and sufficient to avoid out-of-order delivery, and with a 118 wide variety of sources and destinations, as one finds in the core of 119 the network, often statistically sufficient to distribute load 120 evenly. In practice, many implementations use the 5-tuple {dest 121 addr, source addr, protocol, dest port, source port} as input keys to 122 the hash function, to maximize the probability of evenly sharing 123 traffic over the equal cost paths. However, including transport 124 layer information as input keys to a hash may be a problem for IP 125 fragments [RFC2991] or for encrypted traffic. Including the protocol 126 and port numbers, totalling 40 bits, in the hash input makes the hash 127 slightly more expensive to compute but does improve the hash 128 distribution, due to the variable nature of ephemeral ports. 129 Ephemeral port numbers are quite well distributed [Lee10] and will 130 typically contribute 16 variable bits. However, in the case of IPv6, 131 transport layer information is inconvenient to extract, due to the 132 variable placement of and variable length of next-headers; all 133 implementations must be capable of skipping over next-headers, even 134 if they are rarely present in actual traffic. In fact, [RFC2460] 135 implies that next-headers, except hop-by-hop options, are not 136 normally inspected by intermediate nodes in the network. This 137 situation may be challenging for some hardware implementations, 138 raising the potential that network equipment vendors might sacrifice 139 the length of the fields extracted from an IPv6 header. 141 It is worth noting that the possible presence of a Generic Routing 142 Encapsulation (GRE) header [RFC2784] and the possible presence of a 143 GRE key within that header creates a similar challenge to the 144 possible presence of IPv6 extension headers; anything that 145 complicates header analysis is undesirable. 147 The situation is different in IP-in-IP tunneled scenarios. 148 Identifying a flow inside the tunnel is more complicated, 149 particularly because nearly all hardware can only identify flows 150 based on information contained in the outermost IP header. Assume 151 that traffic from many sources to many destinations is aggregated in 152 a single IP-in-IP tunnel from tunnel end point (TEP) A to TEP B (see 153 figure). Then all the packets forming the tunnel have outer source 154 address A and outer destination address B. In all probability they 155 also have the same port and protocol numbers. If there are multiple 156 paths between routers R1 and R2, and ECMP or LAG is applied to choose 157 a particular path, the 2-tuple or 5-tuple, and its hash, will be 158 constant and no load sharing will be achieved. If there is a high 159 proportion of traffic from one or small number of tunnels, traffic 160 will not be distributed as intended across the paths between R1 and 161 R2. 163 _____ _____ _____ _____ 164 | TEP |_________| R1 |-------------| R2 |_________| TEP | 165 |__A__| |_____|-------------|_____| |__B__| 166 tunnel ECMP or LAG tunnel 167 here 169 As noted above, for IPv6, the 5-tuple is in any case quite 170 inconvenient to extract due to the next-header placement. The 171 question therefore arises whether the 20-bit flow label in IPv6 172 packets would be suitable for use as input to an ECMP or LAG hash 173 algorithm, especially in the case of tunnels where the inner packet 174 header is inaccessible. If the flow label could be used in place of 175 the port numbers and protocol number in the 5-tuple, the 176 implementation would be simplified. 178 1.2. Flow label rules 180 The flow label was left experimental by [RFC2460] but was better 181 defined by [RFC3697]. We quote three rules from that RFC: 182 1. "The Flow Label value set by the source MUST be delivered 183 unchanged to the destination node(s)." 184 2. "IPv6 nodes MUST NOT assume any mathematical or other properties 185 of the Flow Label values assigned by source nodes." 186 3. "Router performance SHOULD NOT be dependent on the distribution 187 of the Flow Label values. Especially, the Flow Label bits alone 188 make poor material for a hash key." 190 These rules, especially the last one, have caused designers to 191 hesitate about using the flow label in support of ECMP or LAG. The 192 fact is today that most nodes set a zero value in the flow label, and 193 the first rule definitely forbids the routing system from changing 194 the flow label once a packet has left the source node. Considering 195 normal IPv6 traffic, the fact that the flow label is typically zero 196 means that it would add no value to an ECMP or LAG hash. But neither 197 would it do any harm to the distribution of the hash values. 199 However, in the case of an IP-in-IPv6 tunnel, the TEP is itself the 200 source node of the outer packets. Therefore, a TEP may freely set a 201 flow label in the outer IPv6 header of the packets it sends into the 202 tunnel. 204 The second two rules quoted above need to be seen in the context of 205 [RFC3697], which assumes that routers using the flow label in some 206 way will be involved in some sort of method of establishing flow 207 state: "To enable flow-specific treatment, flow state needs to be 208 established on all or a subset of the IPv6 nodes on the path from the 209 source to the destination(s)." The RFC should perhaps have made 210 clear that a router that has participated in flow state establishment 211 can rely on properties of the resulting flow label values without 212 further signaling. If a router knows these properties, rule 2 is 213 irrelevant, and it can choose to deviate from rule 3. 215 In the tunneling situation sketched above, routers R1 and R2 can rely 216 on the flow labels set by TEP A and TEP B being assigned by a known 217 method. This allows an ECMP or LAG method to be based on the flow 218 label consistently with [RFC3697], regardless of whether the non- 219 tunnel traffic carries non-zero flow label values. 221 At the time of this writing, the IETF is preparing a revision of RFC 222 3697 [I-D.ietf-6man-flow-3697bis]. That revision is fully compatible 223 with the present document and obviates the concerns resulting from 224 the above three rules. Therefore, the present specification applies 225 both to RFC 3697 and to its successor. 227 2. Normative Notation 229 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 230 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 231 document are to be interpreted as described in [RFC2119]. 233 3. Guidelines 235 We assume that the routers supporting ECMP or LAG (R1 and R2 in the 236 above figure) are unaware that they are handling tunneled traffic. 237 If it is desired to include the IPv6 flow label in an ECMP or LAG 238 hash in the tunneled scenario shown above, the following guidelines 239 apply: 240 o Inner packets MUST be encapsulated in an outer IPv6 packet whose 241 source and destination addresses are those of the tunnel end 242 points (TEPs). 243 o The flow label in the outer packet SHOULD be set by the sending 244 TEP to a 20-bit value in accordance with 245 [I-D.ietf-6man-flow-3697bis]. The same flow label value MUST be 246 used for all packets in a single user flow, as determined by the 247 IP header fields of the inner packet. 248 o To achieve this, the sending TEP MUST classify all packets into 249 flows, once it has determined that they should enter a given 250 tunnel, and then write the relevant flow label into the outer IPv6 251 header. A user flow could be identified by the sending TEP most 252 simply by its {destination, source} address 2-tuple or by its 253 5-tuple {dest addr, source addr, protocol, dest port, source 254 port}. At present, there would be little point in using the {dest 255 addr, source addr, flow label} 3-tuple of the inner packet, but 256 doing so would be a future-proof option. The choice of n-tuple is 257 an implementation choice in the sending TEP. 258 * As specified in [I-D.ietf-6man-flow-3697bis], the flow label 259 values should be chosen from a uniform distribution. Such 260 values will be suitable as input to a load balancing hash 261 function and will be hard for a malicious third party to 262 predict. 263 * The sending TEP MAY perform stateless flow label assignment, by 264 using a suitable 20 bit hash of the inner IP header's 2-tuple 265 or 5-tuple as the flow label value. 266 * If the inner packet is an IPv6 packet, its flow label value 267 could also be included in this hash. 268 * This stateless method creates a small probability of two 269 different user flows hashing to the same flow label. Since 270 [I-D.ietf-6man-flow-3697bis] allows a source (the TEP in this 271 case) to define any set of packets that it wishes as a single 272 flow, occasionally labeling two user flows as a single flow 273 through the tunnel is acceptable. 274 o At intermediate router(s) that perform load distribution, the hash 275 algorithm used to determine the outgoing component-link in an ECMP 276 and/or LAG toward the next-hop MUST minimally include the 3-tuple 277 {dest addr, source addr, flow label} and MAY also include the 278 remaining components of the 5-tuple. This applies whether the 279 traffic is tunneled traffic only, or a mixture of normal traffic 280 and tunneled traffic. 281 * Intermediate IPv6 router(s) will presumably encounter a mixture 282 of tunneled traffic and normal IPv6 traffic. Because of this, 283 the design should also include {protocol, dest port, source 284 port} as input keys to the ECMP and/or LAG hash algorithms, to 285 provide additional entropy for flows whose flow label is set to 286 zero, including non-tunneled traffic flows. 288 4. Security Considerations 290 The flow label is not protected in any way and can be forged by an 291 on-path attacker. However, it is expected that tunnel end-points and 292 the ECMP or LAG paths will be part of managed infrastructure that is 293 well protected against on-path attacks. Off-path attackers are 294 unlikely to guess a valid flow label if an apparently pseudo-random 295 and unpredictable value is used. In either case, the worst an 296 attacker could do against ECMP or LAG is to attempt to selectively 297 overload a particular path. For further discussion, see 298 [I-D.ietf-6man-flow-3697bis]. 300 5. IANA Considerations 302 This document requests no action by IANA. 304 6. Acknowledgements 306 This document was suggested by corridor discussions at IETF76. Joel 307 Halpern made crucial comments on an early version. We are grateful 308 to Qinwen Hu for general discussion about the flow label. Valuable 309 comments and contributions were made by Miguel Garcia, Brian 310 Haberman, Sheng Jiang, Thomas Narten, Jarno Rajahalme, Brian Weis, 311 and others. 313 This document was produced using the xml2rfc tool [RFC2629]. 315 7. Change log [RFC Editor: please remove] 317 draft-ietf-6man-flow-ecmp-04: IETF Last Call comments, 2011-06-20. 319 draft-ietf-6man-flow-ecmp-03: minor editorial fixes, AD comments, 320 2011-06-20. 322 draft-ietf-6man-flow-ecmp-02: updated after further comments, 2011- 323 05-02. Note that RFC3697bis becomes a normative reference. 325 draft-ietf-6man-flow-ecmp-01: updated after WG Last Call, 2011-02-10 327 draft-ietf-6man-flow-ecmp-00: after WG adoption at IETF 79, 328 2010-12-02 330 draft-carpenter-flow-ecmp-03: clarifications after further comments, 331 2010-10-07 333 draft-carpenter-flow-ecmp-02: updated after IETF77 discussion, 334 especially adding LAG, changed to BCP language, added second author, 335 2010-04-14 337 draft-carpenter-flow-ecmp-01: updated after comments, 2010-02-18 339 draft-carpenter-flow-ecmp-00: original version, 2010-01-19 341 8. References 342 8.1. Normative References 344 [I-D.ietf-6man-flow-3697bis] 345 Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, 346 "IPv6 Flow Label Specification", 347 draft-ietf-6man-flow-3697bis-05 (work in progress), 348 June 2011. 350 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 351 Requirement Levels", BCP 14, RFC 2119, March 1997. 353 [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 354 (IPv6) Specification", RFC 2460, December 1998. 356 [RFC3697] Rajahalme, J., Conta, A., Carpenter, B., and S. Deering, 357 "IPv6 Flow Label Specification", RFC 3697, March 2004. 359 8.2. Informative References 361 [IEEE802.1AX] 362 Institute of Electrical and Electronics Engineers, "Link 363 Aggregation", IEEE Standard 802.1AX-2008, 2008. 365 [Lee10] Lee, D., Carpenter, B., and N. Brownlee, "Observations of 366 UDP to TCP Ratio and Port Numbers", Fifth International 367 Conference on Internet Monitoring and Protection ICIMP 368 2010, May 2010, . 371 [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, 372 "Definition of the Differentiated Services Field (DS 373 Field) in the IPv4 and IPv6 Headers", RFC 2474, 374 December 1998. 376 [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, 377 June 1999. 379 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 380 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 381 March 2000. 383 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 384 Multicast Next-Hop Selection", RFC 2991, November 2000. 386 Authors' Addresses 388 Brian Carpenter 389 Department of Computer Science 390 University of Auckland 391 PB 92019 392 Auckland, 1142 393 New Zealand 395 Email: brian.e.carpenter@gmail.com 397 Shane Amante 398 Level 3 Communications, LLC 399 1025 Eldorado Blvd 400 Broomfield, CO 80021 401 USA 403 Email: shane@level3.net