idnits 2.17.1 draft-krishnan-opsawg-large-flow-load-balancing-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. (A line matching the expected section header was found, but with an unexpected indentation: ' 1. Introduction' ) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC 3176' is mentioned on line 296, but not defined == Missing Reference: 'RFC 5475' is mentioned on line 296, but not defined == Unused Reference: 'RFC2234' is defined on line 514, but no explicit reference was found in the text == Unused Reference: 'RFC 6790' is defined on line 523, but no explicit reference was found in the text == Unused Reference: 'RFC2991' is defined on line 534, but no explicit reference was found in the text == Unused Reference: 'RFC2992' is defined on line 537, but no explicit reference was found in the text == Unused Reference: 'RFC5475' is defined on line 540, but no explicit reference was found in the text == Unused Reference: 'RFC3176' is defined on line 543, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) Summary: 2 errors (**), 0 flaws (~~), 10 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Experimental Brocade Communications 4 Expires: July 2013 L. Yong 5 January 12, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Best Practices for Optimal LAG/ECMP Component Link Utilization in 14 Provider Backbone Networks 16 draft-krishnan-opsawg-large-flow-load-balancing-02.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html 39 This Internet-Draft will expire on July 12, 2013. 41 Copyright Notice 43 Copyright (c) 2013 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. 63 Abstract 65 Demands on networking infrastructure are growing exponentially; the 66 drivers are bandwidth hungry rich media applications, inter-data 67 center communications, etc. In this context, it is important to 68 optimally use the bandwidth in the service provider backbone networks 69 which extensively use LAG/ECMP techniques for bandwidth scaling. This 70 draft describes the issues faced in service provider backbones in the 71 context of LAG/ECMP and recommends some best practices for managing 72 the bandwidth efficiently in service provider backbones. 74 Table of Contents 76 1. Introduction...................................................3 77 1.1. Conventions...............................................3 78 2. Hash-based Load Distribution in LAG/ECMP.......................4 79 3. Best Practices for Optimal LAG/ECMP Component Link Utilization.5 80 3.1. Large Flow Recognition....................................7 81 3.1.1. Flow Identification..................................7 82 3.1.2. Sampling Techniques - sFlow/PSAMP....................7 83 3.1.3. Automatic Hardware Recognition.......................8 84 3.2. Load Re-Balancing Options.................................9 85 3.2.1. Alternative Placement of Large Flows.................9 86 3.2.2. Redistributing Other Flows...........................9 87 3.2.2.1. Redistributing All Other Flows..................9 88 3.2.2.2. Redistributing the Other Flows on the Congested 89 Link....................................................10 90 3.2.3. Component Link Protection Considerations............10 91 3.2.4. Load Re-Balancing Example...........................10 92 4. Operational Considerations....................................11 93 5. Data Model Considerations.....................................11 94 6. IANA Considerations...........................................11 95 7. Security Considerations.......................................12 96 8. Acknowledgements..............................................12 97 9. References....................................................12 98 9.1. Normative References.....................................12 99 9.2. Informative References...................................12 100 Appendix A. Internet Traffic Analysis and Load Balancing Simulation13 102 1. Introduction 104 Service provider backbone networks extensively use LAG/ECMP 105 techniques for capacity scaling. Network traffic can be predominantly 106 categorized into two traffic types: long-lived large flows and other 107 flows (include long-lived small flows, short-lived small/large 108 flows). Stateless hash-based techniques[ITCOM, RFC 2991, RFC 2992, 109 RFC 6790] are often used to distribute both long-lived large flows 110 and other flows over the component links in a LAG/ECMP. However the 111 traffic may not be evenly distributed over the component links due to 112 the traffic pattern. 114 This draft describes best practices for optimal LAG/ECMP component 115 link utilization while using hash-based techniques. These best 116 practices comprise the following steps -- recognizing long-lived 117 large flows in a router; and assigning the long-lived large flows to 118 specific LAG/ECMP component links or redistribute other flows when a 119 component link on the router is congested. 121 1.1. Conventions 123 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 124 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 125 document are to be interpreted as described in RFC 2119 [RFC2119]. The 126 following acronyms are used: 128 Following Terms are used in the document: 130 COTS: Commercial Off-the-shelf 132 DOS: Denial of Service 134 ECMP: Equal Cost Multi-path 136 GRE: Generic Routing Encapsulation 138 LAG: Link Aggregation Group 140 Large flow(s): long-lived large flow(s) 142 MPLS: Multiprotocol Label Switching 144 NVGRE: Network Virtualization using Generic Routing Encapsulation 146 Other flows: long-lived small flows and short-lived small/large flows 148 QoS: Quality of Service 150 VXLAN: Virtual Extensible LAN 152 2. Hash-based Load Distribution in LAG/ECMP 154 Hashing techniques are often used for flow based load distribution 155 [ITCOM]. A large space of the flow identifications, i.e. finer 156 granularity of the flows, conducts more random in spreading the flows 157 over a set of component links. The advantages of hashing based load 158 distribution are the preservation of the packet sequence in a flow 159 and the real time distribution with the stateless of individual 160 flows. If the traffic flows randomly spread in the flow 161 identification space, the flow rates are much smaller compared to the 162 link capacity, and the rate differences are not dramatic, the hashing 163 algorithm works very well in general. However, if one or more of 164 these conditions do not meet, the hashing may result very unbalanced 165 loads on individual component links. One example is illustrated in 166 Figure 1. There is a LAG between 2 routers R1 and R2. This LAG has 3 167 component links (1), (2), (3). 169 . Component link (1) has 2 other flows and 1 large flow and the 170 link utilization is normal. 172 . Component link (2) has 3 other flows and no large flow and the 173 link utilization is light. 175 o The absence of any large flow causes the component link 176 under-utilized. 178 . Component link (3) has 2 other flows and 2 large flows and the 179 link utilization is exceeded. 181 o The presence of 2 large flows causes the component link 182 congested. 184 +-----------+ +-----------+ 185 | | -> -> | | 186 | |=====> | | 187 | (1)|--/---/-|(1) | 188 | | | | 189 | | | | 190 | (R1) |-> -> ->| (R2) | 191 | (2)|--/---/-|(2) | 192 | | | | 193 | | -> -> | | 194 | |=====> | | 195 | |=====> | | 196 | (3)|--/---/-|(3) | 197 | | | | 198 +-----------+ +-----------+ 200 Where: ->-> other flows 201 ===> large flow 203 Figure 1: Unevenly Utilized Component Links 205 This document presents the improved hashing load distribution 206 techniques based on the large flow awareness. The techniques 207 compensate unbalanced load distribution from hashing due to the 208 traffic pattern. 210 3. Best Practices for Optimal LAG/ECMP Component Link Utilization 212 The suggested techniques in this draft are about a local optimization 213 solution, where the local is in the sense of both measuring large 214 flows and re-balancing the load at individual nodes in the network. 215 This approach would not yield a globally optimal placement of a large 216 flow across several nodes in the network which some networks may 217 desire/require. On the other hand, this may be adequate for some 218 operators for the following reasons-- 1) Different links in the 219 network experience different levels of utilization and, thus, a more 220 "targeted" solution is needed for those few hot-spots in the network; 221 2) Some networks may lack end-to-end visibility, e.g. when a network 222 carries the traffic from multiple other networks. 224 The various steps in achieving optimal LAG/ECMP component link 225 utilization in backbone networks are detailed below: 227 Step 1) This involves large flow recognition in routers and 228 maintaining the mapping of the large flow to the component link that 229 it uses. The recognition of large flows is explained in Section 3.1. 231 Step 2) The egress component links are periodically scanned for link 232 utilization. If the egress component link utilization exceeds a pre- 233 programmed threshold, an operator alert is generated. The large flows 234 mapped to the congested egress component link are exported to a 235 central management entity. 237 Step 3) On receiving the alert about the congested component link, 238 the operator, through a central management entity, finds the large 239 flows mapped to that component link and the LAG/ECMP group to which 240 the component link belongs. 242 Step 4) The operator can choose to rebalance the large flows on 243 lightly loaded component links of the LAG/ECMP group or redistribute 244 all the other flows on the congested link to other component links of 245 the group. The operator, through a central management entity, can 246 choose one of the following actions: 248 1) Can indicate specific large flows to rebalance; 250 2) Let the router decide the best large flows to rebalance; 252 3) Let the router to redistribute all the other flows on the 253 congested link to other component links in the group. 255 The central management entity conveys the above information to the 256 router. The load re-balancing options are explained in section 3.2. 258 Optionally, if desired, steps 2) to 4) could become an automated 259 process. 261 The techniques described above are especially useful when bundling 262 links of different bandwidths for e.g. 10Gbps and 100Gbps as 263 described in [I-D.ietf-rtgwg-cl-requirement]. 265 3.1. Large Flow Recognition 267 3.1.1. Flow Identification 269 A flow (large flow or other flow) can be defined as a sequence of 270 packets for which ordered delivery should be maintained. Flows are 271 commonly identified by using any of the following sets of fields in a 272 packet header: 274 . Layer 2: source MAC address, destination MAC address, VLAN ID 276 . IP 5 tuple: IP Protocol, IP source address, IP destination 277 address, TCP/UDP source port, TCP/UDP destination port 279 . IP 3 tuple: IP Protocol, IP source address, IP destination 280 address 282 . MPLS Labels 284 . IPv6: IP source address, IP destination address and IPv6 flow 285 label (RFC 6437) 287 Flow identification is possible based on inner and/or outer headers 288 for tunneling protocols like GRE, VXLAN, NVGRE etc. 290 The above list is not exhaustive. The best practices described in 291 this document are agnostic to the fields that are used for flow 292 identification. 294 3.1.2. Sampling Techniques - sFlow/PSAMP 296 Enable sFlow [RFC 3176]/PSAMP [RFC 5475] sampling on all the egress 297 ports in the routers. Through sFlow processing in a sFlow collector, 298 an approximate indication of large flows mapping to each of the 299 component links in each LAG/ECMP group is available. The advantages 300 and disadvantages of sFlow/PSAMP are detailed below. 302 Advantages: 304 . Supported in most routers. 306 . Requires minimal router resources. 308 Disadvantages: 310 . Large flow recognition time is long, not instant. 312 The time taken to determine a candidate large flow would be dependent 313 on the number of sFlow samples being generated and the processing 314 power of the external sFlow collector. 316 3.1.3. Automatic Hardware Recognition 318 Implementations may choose an automatic recognition of large flows on 319 the hardware of a router. The characteristics of such an 320 implementation would be: 322 . Inline solution 324 . Maintain line-rate performance 326 . Perform accounting of large flows with a high degree of 327 accuracy 329 Using automatic hardware recognition of large flows, an accurate 330 indication of large flows mapped to each of the component links in a 331 LAG/ECMP group is available. The advantages and disadvantages of 332 automatic hardware recognition are: 334 Advantages: 336 . Accurate and in real-time 338 Disadvantages: 340 . Not supported in many routers 342 The measurement interval for determining a large flow and the 343 bandwidth threshold of a large flow would be programmable parameters 344 in the router. 346 The implementation of automatic hardware recognition of large flows 347 is vendor dependent. Below is a suggested technique. 349 Suggested Technique for Automatic Hardware Recognition 351 Step 1) If the large flow exists in a hardware table resource like 352 TCAM, increment the counter of the flow. Else, proceed to Step 2. 354 Step 2) There are multiple hash tables, each with a different hash 355 function. Each hash table entry has an associated counter. On packet 356 arrival, a new flow is looked up in parallel in all the hash tables 357 and the corresponding counter is incremented. If the counter exceeds 358 a programmed threshold in a given time interval in all the hash table 359 entries, a candidate large flow is learnt and programmed in a 360 hardware table resource like TCAM. 362 There may be some false positives due to multiple other flows 363 masquerading as a large flow; the amount of false positives is 364 reduced by parallel hashing using different hash functions 366 3.2. Load Re-Balancing Options 368 Below are suggested techniques for load re-balancing. Equipment 369 vendors should implement all these techniques and allow the operator 370 to choose one or more techniques based on their applications. 372 3.2.1. Alternative Placement of Large Flows 374 In the LAG/ECMP group, choose other member component links with least 375 average port utilization. Move some large flow(s) from the heavily 376 loaded component link to other member component links using a Policy 377 Based Routing (PBR) rule in the ingress processing element(s) in the 378 routers. The key aspects of this are: 380 . Other flows are not subjected to flow re-ordering. 382 . Only certain large flows are subjected to momentary flow re- 383 ordering temporarily. 385 Note that perfect re-balancing of large flows may not be possible 386 since flows arrive and depart at different times. 388 3.2.2. Redistributing Other Flows 390 Some large flows may consume the entire bandwidth of the component 391 link(s). In this case, it would be desirable for the other flows to 392 not use the congested component link(s). This can be accomplished in 393 one of the following ways. 395 3.2.2.1. Redistributing All Other Flows 397 This works on existing router hardware. The idea is to prevent the 398 other flow from hashing into the congested component link(s). 400 . Modify the LAG/ECMP table to only include the non-congested 401 component link(s). The other flows hash into this table to be 402 mapped to a destination component link. 404 . All the other flows are subject to momentary flow re-ordering. 406 . The PBR rules for large flows (refer to Section 3.2.1) have 407 strict precedence over the LAG/ECMP table lookup result. 409 3.2.2.2. Redistributing the Other Flows on the Congested Link 411 This needs a switch/router hardware change. 413 . If a packet belongs to one of other flows and is hashed to 414 congested component link, apply a second hashing on it, which 415 results the flow mapped to one of the non-congested component 416 links. 418 . The other flows originally directed to the congested link are 419 re-directed to other non-congested component links. 421 . The other flows originally directed to a congested component 422 link are subject to momentary flow re-ordering. 424 3.2.3. Component Link Protection Considerations 426 If desired, certain component links may be reserved for link 427 protection. These reserved component links are not used for any flows 428 which are described in Section 3.2. In the case when the component 429 link(s) fail, all the flows on the failed component link(s) are moved 430 to the reserved component link(s). The mapping table of large flows/ 431 component link simply replaces the reference pointer from the failed 432 component link to the reserved link. The LAG/ECMP hash table just 433 replaces the reference pointer from the failed component link to the 434 reserved link. 436 3.2.4. Load Re-Balancing Example 438 Optimal LAG/ECMP component utilization for the use case in Figure 1, 439 is depicted below in Figure 2. The large flow rebalancing explained 440 in Section 3.2.1 is used. The improved link utilization is as 441 follows: 443 . Component link (1) has 2 other flows and 1 large flow and the 444 link utilization is normal. 446 . Component link (2) has 3 other flows and 1 large flow and the 447 link utilization is normal now. 449 . Component link (3) has 2 other flows and 1 large flow and the 450 link utilization is normal now. 452 +-----------+ +-----------+ 453 | | -> -> | | 454 | |=====> | | 455 | (1)|--/---/-|(1) | 456 | | | | 457 | |=====> | | 458 | (R1) |-> -> ->| (R2) | 459 | (2)|--/---/-|(2) | 460 | | | | 461 | | | | 462 | | -> -> | | 463 | |=====> | | 464 | (3)|--/---/-|(3) | 465 | | | | 466 +-----------+ +-----------+ 468 Where: ->-> other flows 469 ===> large flow 471 Figure 2: Evenly utilized Composite Links 473 4. Operational Considerations 475 For future study. We like to get operators input here. 477 5. Data Model Considerations 479 For Step 2 in Section 3, IETF could potentially consider a standards- 480 based activity around, say, a data-model used to move the long-lived 481 large flow information from the router to the central management 482 entity. 484 For Step 4 in Section 3, IETF could potentially consider a standards- 485 based activity around, say, a data-model used to move the long-lived 486 large flow re-balancing information from the central management 487 entity to the router. 489 6. IANA Considerations 491 This memo includes no request to IANA. 493 7. Security Considerations 495 This document does not directly impact the security of the Internet 496 infrastructure or its applications. In fact, it could help if there 497 is a DOS attack pattern which causes a hash imbalance resulting in 498 heavy overloading of large flows to certain LAG/ECMP component 499 links. 500 8. Acknowledgements 502 The authors would like to thank Shane Amante for all the support and 503 valuable input. The authors would like to thank Curtis Villamizar 504 for his valuable input. The authors would also like to thank Fred 505 Baker and Wes George for their input. 507 9. References 509 9.1. Normative References 511 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 512 Requirement Levels", BCP 14, RFC 2119, March 1997. 514 [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for 515 Syntax Specifications: ABNF", RFC 2234, Internet Mail 516 Consortium and Demon Internet Ltd., November 1997. 518 9.2. Informative References 520 [I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements 521 for MPLS Over a Composite Link", June 2012 523 [RFC 6790] K. Kompella et al., "The Use of Entropy Labels in MPLS 524 Forwarding", November 2012 526 [CAIDA] Caida Internet Traffic Analysis, www.caida.org/home 528 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", 529 draft-yong-pwe3-enhance-ecmp-lfat-01, Sept. 2010 531 [ITCOM] Jo, J., etc "Internet traffic load balancing using dynamic 532 hashing with flow volume", SPIE ITCOM, 2002, 534 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 535 Multicast", November 2000. 537 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 538 Algorithm", November 2000. 540 [RFC5475] T. Zseby et al., "Sampling and Filtering Techniques for IP 541 Packet Selection", March 2009. 543 [RFC3176] P. Phaal et al. "InMon Corporation's sFlow: A Method for 544 Monitoring Traffic in Switched and Routed Networks", RFC 3176, 545 September 2001 547 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 549 Internet traffic [CAIDA] has been analyzed on the packet volume per a 550 flow. The five tuples in the packet header (IP addresses, TCP/UDP 551 Ports, and IP protocol) are used as the flow identification. The 552 analysis indicates that <~2% of the top rate ranked flows takes 553 about ~30% of total traffic volume while the rest of >98% flows 554 contributes ~70% in total.[YONG] 556 The simulation has shown that given Internet traffic pattern, the 557 hash method does not evenly distribute the flows over ECMP paths. 558 Some links may be >90% loaded while some may be <40% loaded. The 559 more ECMP paths exist, the more severe is the un-balancing. This 560 implies that hash based distribution can cause some paths congested 561 while other paths are only partial filled. [YONG] 563 The simulation also shows the substantial improvement by using large 564 flow aware hashing distribution technique described in this document. 565 In using the same simulated traffic, the improved rebalancing can 566 achieve <10% load differences among the links. It proves how large 567 flow aware hashing distribution can effectively compensate the uneven 568 load balancing caused by hashing and the traffic pattern. 570 Authors' Addresses 572 Ram Krishnan 573 Brocade Communications 574 San Jose, 95134, USA 575 Phone: +001-408-406-7890 576 Email: ramk@brocade.com 578 Sanjay Khanna 579 Brocade Communications 580 San Jose, 95134, USA 581 Phone: +001-408-333-4850 582 Email: skhanna@brocade.com 584 Lucy Yong 585 Huawei USA 586 5340 Legacy Drive 587 Plano, TX 75025, USA 588 Phone: 469-277-5837 589 Email: lucy.yong@huawei.com 591 Anoop Ghanwani 592 Dell 593 San Jose, CA 95134 594 Phone: (408) 571-3228 595 Email: anoop@alumni.duke.edu 597 Ning So 598 Tata Communications 599 Plano, TX 75082, USA 600 Phone: +001-972-955-0914 601 Email: ning.so@tatacommunications.com 603 Bhumip Khasnabish 604 ZTE Corporation 605 New Jersey, 07960, USA 606 Phone: +001-781-752-8003 607 Email: bhumip.khasnabish@zteusa.com