idnits 2.17.1 draft-ietf-6man-flow-ecmp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 8 has weird spacing: '...routing and l...' -- The document date (December 2, 2010) is 4887 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2460 (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 3697 (Obsoleted by RFC 6437) -- Obsolete informational reference (is this intentional?): RFC 2629 (Obsoleted by RFC 7749) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Carpenter 3 Internet-Draft Univ. of Auckland 4 Intended status: BCP S. Amante 5 Expires: June 5, 2011 Level 3 6 December 2, 2010 8 Using the IPv6 flow label for equal cost multipath routing and link 9 aggregation in tunnels 10 draft-ietf-6man-flow-ecmp-00 12 Abstract 14 The IPv6 flow label has certain restrictions on its use. This 15 document describes how those restrictions apply when using the flow 16 label for load balancing by equal cost multipath routing, and for 17 link aggregation, particularly for tunneled traffic. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on June 5, 2011. 36 Copyright Notice 38 Copyright (c) 2010 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Normative Notation . . . . . . . . . . . . . . . . . . . . . . 6 55 3. Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . 6 56 4. Security Considerations . . . . . . . . . . . . . . . . . . . . 7 57 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 58 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7 59 7. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 61 8.1. Normative References . . . . . . . . . . . . . . . . . . . 8 62 8.2. Informative References . . . . . . . . . . . . . . . . . . 8 63 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 8 65 1. Introduction 67 When several network paths between the same two nodes are known by 68 the routing system to be equally good (in terms of capacity and 69 latency), it may be desirable to share traffic among them. Two such 70 techniques are known as equal cost multipath routing (ECMP) and link 71 aggregation (LAG) [IEEE802.1AX]. There are of course numerous 72 possible approaches to this, but certain goals need to be met: 73 o Roughly equal share of traffic on each path. 74 o Work-conserving method (no idle time when queue is non-empty). 75 o Minimize or avoid out-of-order delivery for individual traffic 76 flows. 78 There is some conflict between these goals: for example, strictly 79 avoiding idle time could cause a small packet sent on an idle path to 80 overtake a bigger packet from the same flow, causing out-of-order 81 delivery. 83 One lightweight approach to ECMP or LAG is this: if there are N 84 equally good paths to choose from, then form a modulo(N) hash 85 [RFC2991] from a consistent set of fields in each packet header, and 86 use the resulting value to select a particular path. If the hash 87 function is chosen so that the hash values have a uniform statistical 88 distribution, this method will share traffic roughly equally between 89 the N paths. If the header fields included in the hash are 90 consistent, all packets from a given flow will generate the same 91 hash, so out-of-order delivery will not occur. Assuming a large 92 number of unique flows are involved, it is also probable that the 93 method will be work-conserving, since the queue for each link will 94 remain non-empty. 96 The question with such a method is which IP header fields are chosen 97 to identify a flow and, consequently, are used as input keys to a 98 modulo(N) hash algorithm. 100 In the remainder of this document, we will use the term "flow" to 101 represent a sequence of packets that may be identified by either the 102 source and destination IP addresses alone {2-tuple} or the source and 103 destination IP addresses, protocol and source and destination port 104 numbers {5-tuple}. It should be noted that the latter is more 105 specifically referred to as a "microflow" in [RFC2474], but this term 106 is not used in connection with the flow label in [RFC3697]. 108 The question with such a method, then, is which IP header fields to 109 include to identify a flow. A minimal choice in the routing system 110 is simply to use a hash of the source and destination IP addresses, 111 i.e., the 2-tuple. This is necessary and sufficient to avoid out-of- 112 order delivery, and with a wide variety of sources and destinations, 113 as one finds in the core of the network, sometimes sufficient to 114 achieve work-conserving load sharing. In practice, implementations 115 often use the 5-tuple {dest addr, source addr, protocol, dest port, 116 source port} as input keys to the hash function, to maximize the 117 probability of evenly sharing traffic over the equal cost paths. 118 However, including transport layer information as input keys to a 119 hash may be a problem for IPv4 fragments [RFC2991]. In addition, 120 protocol and destination port numbers in the hash will not only make 121 the hash slightly more expensive to compute, but will not 122 particularly improve the hash distribution, due to the prevalence of 123 well known port numbers and popular protocol numbers. Ephemeral 124 ports, on the other hand, are quite well distributed [Lee10]. In the 125 case of IPv6, protocol numbers are particularly inconvenient due to 126 the variable placement of and variable length of next-headers. In 127 addition, [RFC2460] recommends that all next-headers, except hop-by- 128 hop options, should not be inspected by intermediate nodes in the 129 network, presumably to make introduction of new next-headers more 130 straightforward. 132 The situation is different in tunneled scenarios. Identifying a flow 133 inside the tunnel is more complicated, particularly because nearly 134 all hardware can only identify flows based on information contained 135 in the outermost IP header. Assume that traffic from many sources to 136 many destinations is aggregated in a single IP-in-IP tunnel from 137 tunnel end point (TEP) A to TEP B (see figure). Then all the packets 138 forming the tunnel have outer source address A and outer destination 139 address B. In all probability they also have the same port and 140 protocol numbers. If there are multiple paths between routers R1 and 141 R2, and ECMP or LAG is applied to choose a particular path, the 142 5-tuple and its hash will be constant and no load sharing will be 143 achieved. If there is much tunnel traffic, this will result in a 144 high probability of congestion on one of the paths between R1 and R2. 146 _____ _____ _____ _____ 147 | TEP |_________| R1 |-------------| R2 |_________| TEP | 148 |__A__| |_____|-------------|_____| |__B__| 149 tunnel ECMP or LAG tunnel 150 here 152 Also, for IPv6, the total number of bits in the 5-tuple is quite 153 large (296), as well as inconvenient to extract due to the next- 154 header placement. This may be challenging for some hardware 155 implementations, raising the potential that network equipment vendors 156 might sacrifice the length of the fields extracted from an IPv6 157 header. The question therefore arises whether the 20-bit flow label 158 in IPv6 packets would be suitable for use as input to an ECMP or LAG 159 hash algorithm. If it could be used in place of the port numbers and 160 protocol number in the 5-tuple, the hash calculation would be 161 simplified. 163 The flow label is left experimental by [RFC2460] but is better 164 defined by [RFC3697]. We quote three rules from that RFC: 165 1. "The Flow Label value set by the source MUST be delivered 166 unchanged to the destination node(s)." 167 2. "IPv6 nodes MUST NOT assume any mathematical or other properties 168 of the Flow Label values assigned by source nodes." 169 3. "Router performance SHOULD NOT be dependent on the distribution 170 of the Flow Label values. Especially, the Flow Label bits alone 171 make poor material for a hash key." 173 These rules, especially the last one, have caused designers to 174 hesitate about using the flow label in support of ECMP or LAG. The 175 fact is today that most nodes set a zero value in the flow label, and 176 the first rule definitely forbids the routing system from changing 177 the flow label once a packet has left the source node. Considering 178 normal IPv6 traffic, the fact that the flow label is typically zero 179 means that it would add no value to an ECMP or LAG hash. But neither 180 would it do any harm to the distribution of the hash values. If the 181 community at some stage agrees to set pseudo-random flow labels in 182 the majority of traffic flows, this would add to the value of the 183 hash. 185 However, in the case of an IP-in-IPv6 tunnel, the TEP is itself the 186 source node of the outer packets. Therefore, a TEP may freely set a 187 flow label in the outer IPv6 header of the packets it sends into the 188 tunnel. In particular, it may follow the [RFC3697] suggestion to set 189 a pseudo-random value. 191 The second two rules quoted above need to be seen in the context of 192 [RFC3697], which assumes that routers using the flow label in some 193 way will be involved in some sort of method of establishing flow 194 state: "To enable flow-specific treatment, flow state needs to be 195 established on all or a subset of the IPv6 nodes on the path from the 196 source to the destination(s)." The RFC should perhaps have made 197 clear that a router that has participated in flow state establishment 198 can rely on properties of the resulting flow label values without 199 further signaling. If a router knows these properties, rule 2 is 200 irrelevant, and it can choose to deviate from rule 3. 202 In the tunneling situation sketched above, routers R1 and R2 can rely 203 on the flow labels set by TEP A and TEP B being assigned by a known 204 method. This allows a safe ECMP or LAG method to be based on the 205 flow label without breaching [RFC3697]. 207 At the time of this writing, the IETF is discussing a possible 208 revision of the rules of RFC 3697 [I-D.ietf-6man-flow-update]. If 209 adopted, that revision would be fully compatible with the present 210 document and would obviate much of the above discussion. 212 2. Normative Notation 214 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 215 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 216 document are to be interpreted as described in [RFC2119]. 218 3. Guidelines 220 We assume that the routers supporting ECMP or LAG (R1 and R2 in the 221 above figure) are unaware that they are handling tunneled traffic. 222 If it is desired to include the IPv6 flow label in an ECMP or LAG 223 hash in the tunneled scenario shown above, the following guidelines 224 apply: 225 o Inner packets MUST be encapsulated in an outer IPv6 packet whose 226 source and destination addresses are those of the tunnel end 227 points (TEPs). 228 o The flow label in the outer packet SHOULD be set by the sending 229 TEP to a pseudo-random 20-bit value in accordance with [RFC3697]. 230 The same flow label value MUST be used for all packets in a single 231 user flow, as determined by the IP header fields of the inner 232 packet. 233 * Note that this rule is a SHOULD rather than a MUST, to permit 234 individual implementers to take an alternative approach if they 235 wish to do so. Such an alternative MUST conform to [RFC3697]. 236 o The sending TEP MUST classify all packets into flows, once it has 237 determined that they should enter a given tunnel, and then write 238 the relevant flow label into the outer IPv6 header. A user flow 239 could be identified by the ingress TEP most simply by its 240 {destination, source} address pair (coarse) or by its 5-tuple 241 {dest addr, source addr, protocol, dest port, source port} (fine). 242 This is an implementation detail in the sending TEP. 243 * It might be possible to make this classifier stateless, by 244 using a suitable 20 bit hash of the inner IP header's 2-tuple 245 or 5-tuple as the pseudo-random flow label value. 246 o At intermediate router(s) that perform load distribution of 247 tunneled packets whose source address is a TEP, the hash algorithm 248 used to determine the outgoing component-link in an ECMP and/or 249 LAG toward the next-hop MUST minimally include the triple {dest 250 addr, source addr, flow label} to meet the [RFC3697] rules. 251 * Intermediate router(s) MAY also include {protocol, dest port, 252 source port} as input keys to the ECMP and/or LAG hash 253 algorithms, to provide sufficient entropy in cases where the 254 flow-label is currently set to zero. 256 4. Security Considerations 258 The flow label is not protected in any way and can be forged by an 259 on-path attacker. Off-path attackers are unlikely to guess a valid 260 flow label if a pseudo-random value is used. In either case, the 261 worst an attacker could do against ECMP or LAG is to attempt to 262 selectively overload a particular path. For further discussion, see 263 [RFC3697]. 265 5. IANA Considerations 267 This document requests no action by IANA. 269 6. Acknowledgements 271 This document was suggest by corridor discussions at IETF76. Joel 272 Halpern made crucial comments on an early version. We are grateful 273 to Qinwen Hu for general discussion about the flow label. Valuable 274 comments and contributions were made by Jarno Rajahalme, Brian 275 Haberman, Sheng Jiang, and others. 277 This document was produced using the xml2rfc tool [RFC2629]. 279 7. Change log 281 draft-ietf-6man-flow-ecmp-00: after WG adoption at IETF 79, 282 2010-12-02 284 draft-carpenter-flow-ecmp-03: clarifications after further comments, 285 2010-10-07 287 draft-carpenter-flow-ecmp-02: updated after IETF77 discussion, 288 especially adding LAG, changed to BCP language, added second author, 289 2010-04-14 291 draft-carpenter-flow-ecmp-01: updated after comments, 2010-02-18 293 draft-carpenter-flow-ecmp-00: original version, 2010-01-19 295 8. References 296 8.1. Normative References 298 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 299 Requirement Levels", BCP 14, RFC 2119, March 1997. 301 [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 302 (IPv6) Specification", RFC 2460, December 1998. 304 [RFC3697] Rajahalme, J., Conta, A., Carpenter, B., and S. Deering, 305 "IPv6 Flow Label Specification", RFC 3697, March 2004. 307 8.2. Informative References 309 [I-D.ietf-6man-flow-update] 310 Amante, S., Carpenter, B., and S. Jiang, "Update to the 311 IPv6 flow label specification", Internet-Draft ietf-6man- 312 flow-update-00, December 2010. 314 [IEEE802.1AX] 315 Institute of Electrical and Electronics Engineers, "Link 316 Aggregation", IEEE Standard 802.1AX-2008, 2008. 318 [Lee10] Lee, D., Carpenter, B., and N. Brownlee, "Observations of 319 UDP to TCP Ratio and Port Numbers", Fifth International 320 Conference on Internet Monitoring and Protection ICIMP 321 2010, May 2010, . 324 [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, 325 "Definition of the Differentiated Services Field (DS 326 Field) in the IPv4 and IPv6 Headers", RFC 2474, 327 December 1998. 329 [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, 330 June 1999. 332 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 333 Multicast Next-Hop Selection", RFC 2991, November 2000. 335 Authors' Addresses 337 Brian Carpenter 338 Department of Computer Science 339 University of Auckland 340 PB 92019 341 Auckland, 1142 342 New Zealand 344 Email: brian.e.carpenter@gmail.com 346 Shane Amante 347 Level 3 Communications, LLC 348 1025 Eldorado Blvd 349 Broomfield, CO 80021 350 USA 352 Email: shane@level3.net