idnits 2.17.1 draft-krishnan-opsawg-large-flow-load-balancing-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(ii) Publication Limitation clause. If this document is intended for submission to the IESG for publication, this constitutes an error. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: '1' is defined on line 533, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 536, but no explicit reference was found in the text == Unused Reference: 'RFC2234' is defined on line 543, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-rtgwg-cl-requirement' is defined on line 549, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-mpls-entropy-label' is defined on line 552, but no explicit reference was found in the text == Unused Reference: 'I-D.kj-nvo3-pion-architecture' is defined on line 555, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2234 (ref. '2') (Obsoleted by RFC 4234) -- Duplicate reference: RFC2119, mentioned in 'RFC2119', was also mentioned in '1'. -- Duplicate reference: RFC2234, mentioned in 'RFC2234', was also mentioned in '2'. ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) Summary: 3 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Experimental Brocade Communications 4 Expires: May 2013 B. Khasnabish 5 November 4, 2012 ZTE Corporation 6 A. Ghanwani 7 Dell 9 Best Practices for Optimal LAG/ECMP Component Link Utilization in 10 Provider Backbone networks 12 draft-krishnan-opsawg-large-flow-load-balancing-01.txt 14 Status of this Memo 16 This Internet-Draft is submitted in full conformance with the 17 provisions of BCP 78 and BCP 79. 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. This document may not be modified, 21 and derivative works of it may not be created, and it may not be 22 published except as an Internet-Draft. 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. This document may not be modified, 26 and derivative works of it may not be created, except to publish it 27 as an RFC and to translate it into languages other than English. 29 This document may contain material from IETF Documents or IETF 30 Contributions published or made publicly available before November 31 10, 2008. The person(s) controlling the copyright in some of this 32 material may not have granted the IETF Trust the right to allow 33 modifications of such material outside the IETF Standards Process. 34 Without obtaining an adequate license from the person(s) controlling 35 the copyright in such materials, this document may not be modified 36 outside the IETF Standards Process, and derivative works of it may 37 not be created outside the IETF Standards Process, except to format 38 it for publication as an RFC or to translate it into languages other 39 than English. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF), its areas, and its working groups. Note that 43 other groups may also distribute working documents as Internet- 44 Drafts. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 52 Utilization in Provider Backbone networks November 2012 54 The list of current Internet-Drafts can be accessed at 55 http://www.ietf.org/ietf/1id-abstracts.txt 57 The list of Internet-Draft Shadow Directories can be accessed at 58 http://www.ietf.org/shadow.html 60 This Internet-Draft will expire on May 4, 2009. 62 Copyright Notice 64 Copyright (c) 2012 IETF Trust and the persons identified as the 65 document authors. All rights reserved. 67 This document is subject to BCP 78 and the IETF Trust's Legal 68 Provisions Relating to IETF Documents 69 (http://trustee.ietf.org/license-info) in effect on the date of 70 publication of this document. Please review these documents 71 carefully, as they describe your rights and restrictions with respect 72 to this document. Code Components extracted from this document must 73 include Simplified BSD License text as described in Section 4.e of 74 the Trust Legal Provisions and are provided without warranty as 75 described in the Simplified BSD License. 77 This document is subject to BCP 78 and the IETF Trust's Legal 78 Provisions Relating to IETF Documents 79 (http://trustee.ietf.org/license-info) in effect on the date of 80 publication of this document. Please review these documents 81 carefully, as they describe your rights and restrictions with respect 82 to this document. 84 Abstract 86 The demands on the networking infrastructure are growing 87 exponentially; the drivers are bandwidth hungry rich media 88 applications, inter data center communications etc. In this context, 89 it is important to optimally use the bandwidth in the service 90 provider backbone networks which extensively use LAG/ECMP techniques 91 for bandwidth scaling. This internet draft describes the issues faced 92 in the service provider backbone in the context of LAG/ECMP and 93 formulates best practice recommendations for managing the bandwidth 94 efficiently in the service provider backbone. 96 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 97 Utilization in Provider Backbone networks November 2012 99 Table of Contents 101 1. Introduction...................................................3 102 1.1. Conventions used..........................................4 103 2. Sub-optimal LAG/ECMP Component Link Utilization in the current 104 framework.........................................................5 105 3. Best practices for optimal LAG/ECMP Component Link Utilization.7 106 3.1. Long-lived Large Flow Identification......................8 107 3.1.1. sFlow/Netflow........................................8 108 3.1.2. Automatic hardware identification....................9 109 3.1.2.1. Suggested Technique for Automatic Hardware 110 Identification..........................................10 111 3.2. Long-lived Large Flow Re-balancing.......................10 112 3.2.1. No re-balancing of short-lived small flows..........10 113 3.2.2. Other Techniques....................................11 114 3.2.3. Re-balancing of long-lived large flows and short-lived 115 small flows - an example...................................11 116 4. Acknowledgements..............................................12 117 5. IANA Considerations...........................................12 118 6. Security Considerations.......................................13 119 7. References....................................................13 120 7.1. Normative References.....................................13 121 7.2. Informative References...................................13 123 1. Introduction 125 Service provider backbone networks extensively use LAG/ECMP 126 techniques for bandwidth scaling. Network traffic can be 127 predominantly categorized into two traffic types, long-lived large 128 flows and short-lived small flows. Hashing techniques, which perform 129 an approximate distribution of these flows across the LAG/ECMP 130 component links, typically result in a sub-optimal utilization of 131 LAG/ECMP component links. Round Robin load-balancing techniques 132 address this problem but have the side effect of causing packet re- 133 ordering. This internet draft recommends best practices for optimal 134 LAG/ECMP component link utilization while using hashing techniques. 135 These best practices comprise of the following; first is 136 identification of long-lived large flows in routers and next is 137 assigning the long-lived large flows to specific LAG/ECMP component 138 links. 140 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 141 Utilization in Provider Backbone networks November 2012 143 1.1. Conventions used 145 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 146 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 147 document are to be interpreted as described in RFC 2119 [RFC2119]. 149 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 150 Utilization in Provider Backbone networks November 2012 152 The following acronyms are used: 154 ECMP: Equal Cost Multi-path 156 LAG: Link Aggregation Group 158 QoS: Quality of Service 160 MPLS: Multiprotocol Label Switching 162 DOS: Denial of Service 164 2. Sub-optimal LAG/ECMP Component Link Utilization in the current framework 166 Hashing techniques, which perform an approximate distribution of 167 long-lived large flows and short-lived small flows across the 168 LAG/ECMP component links, typically results in a sub-optimal 169 utilization of LAG/ECMP component links. This is depicted in Figure 1 170 with a detailed description below. 172 . There is a LAG between 2 routers R1 and R2. This LAG has 3 173 component links (1), (2), (3) 175 . Component link (1) has 2 short-lived small flows and 1 long- 176 lived large flow and the link capacity is optimally utilized 178 . Component link (2) has 3 short-lived small flows and no long- 179 lived large flow and the link capacity is sub-optimally utilized 181 o The absence of any long-lived large flow causes the 182 component link under-utilization 184 . Component link (3) has 2 short-lived small flows and 2 long- 185 lived large flows and the link capacity is over-utilized. 187 o The presence of 2 long-lived large flows causes the 188 component link over-utilization 190 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 191 Utilization in Provider Backbone networks November 2012 193 |-----------| |-----------| 195 | | -> -> | | 197 | |=====> | | 199 | (1)|--/---/-|(1) | 201 | | | | 203 | | | | 205 | (R1) |-> -> ->| (R2) | 207 | (2)|--/---/-|(2) | 209 | | | | 211 | | -> -> | | 213 | |=====> | | 215 | |=====> | | 217 | (3)|--/---/-|(3) | 219 | | | | 221 |-----------| |-----------| 223 Figure 1: Long-lived Large Flows - uneven distribution across 225 LAG/ECMP component links 227 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 228 Utilization in Provider Backbone networks November 2012 230 3. Best practices for optimal LAG/ECMP Component Link Utilization 232 The suggested techniques in this draft for optimal LAG/ECMP component 233 link utilization are meant to put forth a locally_ optimized 234 solution, i.e. local in the sense of both measuring and optimizing 235 for long-lived large flows at individual nodes in the network. This 236 approach would not yield a globally optimal placement of a large, 237 long-lived flow across several nodes in the network which some 238 networks may desire/require. On the other hand, this may be adequate 239 for some operators for the following reasons 1) Different links in 240 the network experience different levels of utilization and, thus, a 241 more "targeted" solution is needed for those few hot-spots in the 242 network 2) Some networks may lack end-to-end visibility. 244 The various steps in achieving optimal LAG/ECMP component link 245 utilization in backbone networks are detailed below 247 Step 1) This involves identifying long-lived large flows in the 248 egress processing elements in routers; besides the flow parameters, 249 this also involves identifying the egress component link the flow is 250 using. The identification of long-lived large flows is explained in 251 detail in section 4.1. 253 Step 2) The egress component links are periodically scanned for link 254 utilization. If the egress component link utilization exceeds a pre- 255 programmed threshold, an operator alert is generated. The long-lived 256 large flows mapping to the congested egress component link are 257 exported to a central management entity. IETF could potentially 258 consider a standards-based activity around, say, a data-model used to 259 move this information from the router to the central management 260 entity. 262 Step 3) On receiving the alert about the congested component link, 263 the operator, through a central management entity finds out the long- 264 lived large flows mapping to the component link and the LAG/ECMP 265 group to which the component link maps to. 267 Step 4) The operator can choose to rebalance the long-lived large 268 flows on lightly loaded component links of the LAG/ECMP group. The 269 operator, through a central management entity 1) Can indicate 270 specific long-lived large flows to rebalance 2) Let the router decide 271 the best long-lived large flows to rebalance. The central management 272 entity conveys the above information to the router. IETF could 274 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 275 Utilization in Provider Backbone networks November 2012 277 potentially consider a standards-based activity around, say, a data- 278 model used to move this information from the central management 279 entity to the router. The re-balancing of long-lived large flows is 280 explained in detail in section 4.2. 282 Optionally, if desired, steps 2) to 4) could be automated resulting 283 in automatic rebalancing of long-lived large flows. 285 3.1. Long-lived Large Flow Identification 287 A flow (long-lived large flow or short-lived small flow) can be 288 defined using one of the following suggested formats as described 289 below 291 . IP 5 tuple: IP Protocol, IP source address, IP destination 292 address, TCP/UDP source port, TCP/UDP destination port 294 . IP 3 tuple: IP Protocol, IP source address, IP destination 295 address 297 . MPLS Labels 299 . VXLAN, NVGRE 301 . IP source address, IP destination address and IPv6 flow label 302 (RFC 6437) 304 . Other formats 306 The best practices described in this document are agnostic to the 307 format of the flow. 309 3.1.1. sFlow/Netflow 311 Enable sFlow/Netflow sampling on all the egress ports in the routers. 312 Through sFlow processing in a sFlow Collector, an approximate 313 indication of large flows mapping to each of the component links in 314 each LAG/ECMP group is available. The advantages and disadvantages of 315 sFlow/Netflow are detailed below. 317 Advantages of sFlow/Netflow 319 . Supported in most routers 321 . Minimal router resources 323 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 324 Utilization in Provider Backbone networks November 2012 326 Disadvantages of sFlow/Netflow 328 . Approximate identification of long-lived large flows 330 . Non real-time identification of long-lived large flows based on 331 historical analysis 333 The time taken to determine a candidate long-lived large flow would 334 be dependent on the amount of sFlow samples being generated and the 335 processing power of the external sFlow collector; this is under 336 further study. 338 3.1.2. Automatic hardware identification 340 Implementations may choose to implement automatic identification of 341 long-lived large flows in hardware in egress processing elements of 342 routers. The characteristics of such an implementation would be 344 . Inline solution 346 . Minimal system resources 348 . Maintain line-rate performance 350 . Perform accounting of long-lived large flows with a high degree 351 of accuracy 353 Using automatic hardware identification of long-lived large flows, an 354 accurate indication of large flows mapping to each of the component 355 links in a LAG/ECMP group is available. The advantages and 356 disadvantages of automatic hardware identification are detailed 357 below. 359 Advantages of Automatic Hardware Identification 361 . Accurate identification of long-lived large flows 363 . Real-time identification of long-lived large flows 365 Disadvantages of Automatic Hardware Identification 367 . Not supported in many routers 369 The measurement interval for determining a candidate long-lived large 370 flow and the minimum bandwidth of the long-lived large flow would be 371 programmable parameters in the router; this is under further study. 373 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 374 Utilization in Provider Backbone networks November 2012 376 The implementation of automatic hardware identification of long-lived 377 large flows is vendor dependent. Below is a suggested technique. 379 3.1.2.1. Suggested Technique for Automatic Hardware Identification 381 There are multiple hash tables, each with a different hash function. 382 Each hash table entry has an associated counter. On packet arrival, a 383 new flow is looked up in parallel in all the hash tables and the 384 corresponding counter is incremented. If the counter exceeds a 385 programmed threshold in a given time interval in all the hash table 386 entries, a candidate long-lived-flow is learnt and programmed in a 387 hardware table resource like TCAM. There may be some false positives 388 due to multiple short-lived small flows masquerading as a long-lived 389 large flow; the amount of false positives is reduced by parallel 390 hashing. 392 3.2. Long-lived Large Flow Re-balancing 394 Below are suggested techniques for long-lived large flow re- 395 balancing. Our suggestion is that the router vendors should implement 396 all these techniques and the operator chooses the right technique 397 based on various application needs. Perfect re-balancing of long- 398 lived large flows may not be possible since flows can arrive and 399 depart at different times. 401 3.2.1. No re-balancing of short-lived small flows 403 In the LAG/ECMP group, choose other member component links with least 404 average port utilization. Move the long-lived large flow(s) from the 405 heavily loaded component link to the new member component links using 406 a Policy based routing (PBR) rule in the ingress processing 407 element(s) in the routers. 409 The benefits of this algorithm are 411 . Short-lived small flows are not subjected to flow re-ordering 413 . Only certain long-lived large flows are subjected to flow re- 414 ordering 416 The disadvantages of this algorithm are 418 . There may be a Quality of Service (QoS) impact on the existing 419 short-lived flows 421 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 422 Utilization in Provider Backbone networks November 2012 424 3.2.2. Other Techniques 426 It is possible use other algorithms, for example, removing a member 427 component link from the LAG/ECMP group and using it only for long- 428 lived large flows. 430 3.2.3. Re-balancing of long-lived large flows and short-lived small 431 flows - an example 433 Optimal LAG/ECMP component utilization for the use case in Figure 1, 434 is depicted below in Figure 2. This is achieved as follows 436 Step 1) Long-lived large flows are identified in the egress 437 processing elements of router R1 using techniques suggested in 438 Section 4.1. 440 Step 2) An operator alert is generated indicating that egress 441 component link (3) in router R1 is congested. The long-lived large 442 flows mapping to the congested egress component link are exported 443 from the router to a central management entity. 445 Step 3) On receiving the alert about the congested component link 446 (3), the operator, through a central management entity finds out the 447 long-lived large flows mapping to the component link and the LAG/ECMP 448 group to which the component link maps to. 450 Step 4) The operator, through a central management entity, can choose 451 to rebalance the long-lived large flows on lightly loaded component 452 links of the LAG/ECMP group using the suggested techniques in Section 453 4.2. In the router, a long-lived large flow is moved from component 454 link (3) to component link (2) by using a PBR rule in the ingress 455 processing element(s) in the routers. 457 Detailed description for Figure 2 is as follows 459 . There is a LAG between 2 routers R1 and R2. This LAG has 3 460 component links (1), (2), (3) 462 . Component link (1) has 2 short-lived small flows and 1 long- 463 lived large flow and the link capacity is optimally utilized 465 . Component link (2) has 3 short-lived small flows and 1 long- 466 lived large flow and the link capacity is optimally utilized 468 . Component link (3) has 2 short-lived small flows and 1 long- 469 lived large flow and the link capacity is optimally utilized 471 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 472 Utilization in Provider Backbone networks November 2012 474 |-----------| |-----------| 476 | | -> -> | | 478 | |=====> | | 480 | (1)|--/---/-|(1) | 482 | | | | 484 | |=====> | | 486 | (R1) |-> -> ->| (R2) | 488 | (2)|--/---/-|(2) | 490 | | | | 492 | | | | 494 | | -> -> | | 496 | |=====> | | 498 | (3)|--/---/-|(3) | 500 | | | | 502 |-----------| |-----------| 504 Figure 2: Long-lived Large Flows - even distribution across 506 LAG/ECMP component links 508 4. Acknowledgements 510 The authors would like to thank Shane Amante for all the support and 511 valuable input. The authors would also like to thank Fred Baker and 512 Wes George for their input. 514 5. IANA Considerations 516 This memo includes no request to IANA. 518 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 519 Utilization in Provider Backbone networks November 2012 521 6. Security Considerations 523 This document does not directly impact the security of the Internet 524 infrastructure or its applications. In fact, it could help if there 525 is a DOS attack pattern which causes a hash imbalance resulting in 526 heavy overloading of large flows to certain LAG/ECMP component 527 links. 529 7. References 531 7.1. Normative References 533 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 534 Levels", BCP 14, RFC 2119, March 1997. 536 [2] Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax 537 Specifications: ABNF", RFC 2234, Internet Mail Consortium and 538 Demon Internet Ltd., November 1997. 540 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 541 Requirement Levels", BCP 14, RFC 2119, March 1997. 543 [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for 544 Syntax Specifications: ABNF", RFC 2234, Internet Mail 545 Consortium and Demon Internet Ltd., November 1997. 547 7.2. Informative References 549 [I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements 550 for MPLS Over a Composite Link", June 2012 552 [I-D.ietf-mpls-entropy-label] K. Kompella et al., "The Use of 553 Entropy Labels in MPLS Forwarding", July 2012 555 [I-D.kj-nvo3-pion-architecture] L. Jin and B. Khasnabish, 556 "Architecture of PSN Independent Overlay Network(PION)," May 2012. 558 Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 559 Multicast", RFC 2991, November 2000. 561 Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", 563 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 564 Utilization in Provider Backbone networks November 2012 566 RFC 2992, November 2000. 568 Newman, D. and T. Player, "Hash and Stuffing: Overlooked Factors in 569 Network Device Benchmarking", RFC 4814, March 2007. 571 Authors' Addresses 573 Ram Krishnan 575 Brocade Communications 577 San Jose, 95134, USA 579 Phone: +001-408-406-7890 581 Email: ramk@brocade.com 583 Sanjay Khanna 585 Brocade Communications 587 San Jose, 95134, USA 589 Phone: +001-408-333-4850 591 Email: skhanna@brocade.com 593 Anoop Ghanwani 595 Dell 596 San Jose, CA 95134 597 Phone: (408) 571-3228 598 Email: anoop@alumni.duke.edu 600 Internet-Draft Best Practices for Optimal LAG/ECMP Component Link 601 Utilization in Provider Backbone networks November 2012 603 Bhumip Khasnabish 605 ZTE Corporation 607 New Jersey, 07960, USA 609 Phone: +001-781-752-8003 611 Email: bhumip.khasnabish@zteusa.com