idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'Bloom' is defined on line 678, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Informational Brocade Communications 4 Expires: November 2013 L. Yong 5 May 8, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 14 Networks 16 draft-ietf-opsawg-large-flow-load-balancing-00.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. This document may not be modified, 22 and derivative works of it may not be created, except to publish it 23 as an RFC and to translate it into languages other than English. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as Internet- 28 Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/ietf/1id-abstracts.txt 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 This Internet-Draft will expire on November 8, 2013. 43 Copyright Notice 45 Copyright (c) 2013 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 Demands on networking infrastructure are growing exponentially; the 61 drivers are bandwidth hungry rich media applications, inter-data 62 center communications, etc. In this context, it is important to 63 optimally use the bandwidth in wired networks that extensively use 64 LAG/ECMP techniques for bandwidth scaling. This draft explores some 65 of the mechanisms useful for achieving this. 67 Table of Contents 69 1. Introduction...................................................3 70 1.1. Acronyms..................................................3 71 1.2. Terminology...............................................4 72 2. Hash-based Load Distribution in LAG/ECMP.......................4 73 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....5 74 3.1. Large Flow Recognition....................................7 75 3.1.1. Flow Identification..................................7 76 3.1.2. Criteria for Identifying a Large Flow................8 77 3.1.3. Sampling Techniques..................................8 78 3.1.4. Automatic Hardware Recognition.......................9 79 3.2. Load Re-balancing Options................................10 80 3.2.1. Alternative Placement of Large Flows................10 81 3.2.2. Redistributing Small Flows..........................11 82 3.2.3. Component Link Protection Considerations............11 83 3.2.4. Load Re-Balancing Example...........................12 84 4. Information Model for Flow Re-balancing.......................13 85 4.1. Configuration Parameters.................................13 86 4.2. Import of Flow Information...............................13 87 5. Operational Considerations....................................14 88 6. IANA Considerations...........................................14 89 7. Security Considerations.......................................15 90 8. Acknowledgements..............................................15 91 9. References....................................................15 92 9.1. Normative References.....................................15 93 9.2. Informative References...................................15 95 1. Introduction 97 Networks extensively use LAG/ECMP techniques for capacity scaling. 98 Network traffic can be predominantly categorized into two traffic 99 types: long-lived large flows and other flows (which include long- 100 lived small flows, short-lived small/large flows). Stateless hash- 101 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 102 to distribute both long-lived large flows and other flows over the 103 component links in a LAG/ECMP. However the traffic may not be evenly 104 distributed over the component links due to the traffic pattern. 106 This draft describes best practices for optimal LAG/ECMP component 107 link utilization while using hash-based techniques. These best 108 practices comprise the following steps -- recognizing long-lived 109 large flows in a router; and assigning the long-lived large flows to 110 specific LAG/ECMP component links or redistributing other flows when 111 a component link on the router is congested. 113 It is useful to keep in mind that the typical use case is where the 114 long-lived large flows are those that consume a significant amount of 115 bandwidth on a link, e.g. greater than 5% of link bandwidth. The 116 number of such flows would necessarily be fairly small, e.g. on the 117 order of 10's or 100's per link. In other words, the number of long- 118 lived large flows is NOT expected to be on the order of millions of 119 flows. Examples of such long-lived large flows would be IPSec 120 tunnels in service provider backbones or storage backup traffic in 121 data center networks. 123 1.1. Acronyms 125 COTS: Commercial Off-the-shelf 127 DOS: Denial of Service 129 ECMP: Equal Cost Multi-path 131 GRE: Generic Routing Encapsulation 133 LAG: Link Aggregation Group 135 MPLS: Multiprotocol Label Switching 137 NVGRE: Network Virtualization using Generic Routing Encapsulation 139 PBR: Policy Based Routing 140 QoS: Quality of Service 142 STT: Stateless Transport Tunneling 144 TCAM: Ternary Content Addressable Memory 146 VXLAN: Virtual Extensible LAN 148 1.2. Terminology 150 Large flow(s): long-lived large flow(s) 152 Small flow(s): long-lived small flow(s) and short-lived small/large 153 flow(s) 155 2. Hash-based Load Distribution in LAG/ECMP 157 Hashing techniques are often used for traffic load balancing to 158 select among multiple available paths with LAG/ECMP. The advantages 159 of hash-based load distribution are the preservation of the packet 160 sequence in a flow and the real-time distribution without maintaining 161 per-flow state in the router. Hash-based techniques use a combination 162 of fields in the packet's headers to identify a flow, and the hash 163 function on these fields is used to generate a unique number that 164 identifies a link/path in a LAG/ECMP. The result of the hashing 165 procedure is a many-to-one mapping of flows to component links. 167 If the traffic load constitutes flows such that the result of the 168 hash function across these flows is fairly uniform so that a similar 169 number of flows is mapped to each component link, if, the individual 170 flow rates are much smaller as compared to the link capacity, and if 171 the rate differences are not dramatic, the hash-based algorithm 172 produces good results with respect to utilization of the individual 173 component links. However, if one or more of these conditions are not 174 met, hash-based techniques may result in unbalanced loads on 175 individual component links. 177 One example is illustrated in Figure 1. In the figure, there are two 178 routers, R1 and R2, and there is a LAG between them which has 3 179 component links (1), (2), (3). There are a total of 10 flows that 180 need to be distributed across the links in this LAG. The result of 181 hashing is as follows: 183 . Component link (1) has 3 flows -- 2 small flows and 1 large 184 flow -- and the link utilization is normal. 186 . Component link (2) has 3 flows -- 3 small flows and no large 187 flow -- and the link utilization is light. 189 o The absence of any large flow causes the component link 190 under-utilized. 192 . Component link (3) has 4 flows -- 2 small flows and 2 large 193 flows -- and the link capacity is exceeded resulting in 194 congestion. 196 o The presence of 2 large flows causes congestion on this 197 component link. 199 +-----------+ +-----------+ 200 | | -> -> | | 201 | |=====> | | 202 | (1)|--/---/-|(1) | 203 | | | | 204 | | | | 205 | (R1) |-> -> ->| (R2) | 206 | (2)|--/---/-|(2) | 207 | | | | 208 | | -> -> | | 209 | |=====> | | 210 | |=====> | | 211 | (3)|--/---/-|(3) | 212 | | | | 213 +-----------+ +-----------+ 215 Where: ->-> small flows 216 ===> large flow 218 Figure 1: Unevenly Utilized Component Links 220 This document presents improved load distribution techniques based on 221 the large flow awareness. The techniques compensate for unbalanced 222 load distribution resulting from hashing as demonstrated in the above 223 example. 225 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization 227 The suggested techniques in this draft are about a local optimization 228 solution; they are local in the sense that both the identification of 229 large flows and re-balancing of the load can be accomplished 230 completely within individual nodes in the network without the need 231 for interaction with other nodes. 233 This approach may not yield a globally optimal placement of large 234 flows across multiple nodes in a network, which may be desirable in 235 some networks. On the other hand, a local approach may be adequate 236 for some environments for the following reasons: 238 1) Different links within a network experience different levels of 239 utilization and, thus, a "targeted" solution is needed for those hot- 240 spots in the network. An example is the utilization of a LAG between 241 two routers that needs to be optimized. 243 2) Some networks may lack end-to-end visibility, e.g. when a 244 certain network, under the control of a given operator, is a transit 245 network for traffic from other networks that are not under the 246 control of the same operator. 248 The various steps in achieving optimal LAG/ECMP component link 249 utilization in networks are detailed below: 251 Step 1) This involves large flow recognition in routers and 252 maintaining the mapping of the large flow to the component link that 253 it uses. The recognition of large flows is explained in Section 3.1. 255 Step 2) The egress component links are periodically scanned for link 256 utilization. If the egress component link utilization exceeds a pre- 257 programmed threshold, an operator alert is generated. The large flows 258 mapped to the congested egress component link are exported to a 259 central management entity. 261 Step 3) On receiving the alert about the congested component link, 262 the operator, through a central management entity, finds the large 263 flows mapped to that component link and the LAG/ECMP group to which 264 the component link belongs. 266 Step 4) The operator can choose to rebalance the large flows on 267 lightly loaded component links of the LAG/ECMP group or redistribute 268 the small flows on the congested link to other component links of the 269 group. The operator, through a central management entity, can choose 270 one of the following actions: 272 1) Indicate specific large flows to rebalance; 274 2) Have the router decide the best large flows to rebalance; 276 3) Have the router redistribute all the small flows on the 277 congested link to other component links in the group. 279 The central management entity conveys the above information to the 280 router. The load re-balancing options are explained in Section 3.2. 282 Steps 2) to 4) could be automated if desired. 284 Providing large flow information to a central management entity 285 provides the capability to further optimize flow distribution at with 286 multi-node visibility. Consider the following example. A router may 287 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 288 of hops downstream on P1 may be congested, while P2 and P3 may be 289 under-utilized, which the local router does not have visibility into. 290 With the help of a central management entity, the operator could 291 redistribute some of the flows from P1 to P2 and P3 resulting in a 292 more optimized flow of traffic. 294 The techniques described above are especially useful when bundling 295 links of different bandwidths for e.g. 10Gbps and 100Gbps as 296 described in [I-D.ietf-rtgwg-cl-requirement]. 298 3.1. Large Flow Recognition 300 3.1.1. Flow Identification 302 A flow (large flow or small flow) can be defined as a sequence of 303 packets for which ordered delivery should be maintained. Flows are 304 typically identified using one or more fields from the packet header 305 from the following list: 307 . Layer 2: source MAC address, destination MAC address, VLAN ID. 309 . IP header: IP Protocol, IP source address, IP destination 310 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 311 destination port. 313 . MPLS Labels. 315 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 316 identification is possible based on inner and/or outer headers. The 317 above list is not exhaustive. The mechanisms described in this 318 document are agnostic to the fields that are used for flow 319 identification. 321 3.1.2. Criteria for Identifying a Large Flow 323 From a bandwidth and time duration perspective, in order to identify 324 large flows we define an observation interval and observe the 325 bandwidth of the flow over that interval. A flow that exceeds a 326 certain minimum bandwidth threshold over that observation interval 327 would be considered a large flow. 329 The two parameters -- the observation interval, and the minimum 330 bandwidth threshold over that observation interval -- should be 331 programmable in a router to facilitate handling of different use 332 cases and traffic characteristics. For example, a flow which is at or 333 above 10% of link bandwidth for a time period of at least 1 second 334 could be declared a large flow [DevoFlow]. 336 In order to avoid excessive churn in the rebalancing, once a flow has 337 been recognized as a large flow, it should continue to be recognized 338 as a large flow as long as the traffic received during an observation 339 interval exceeds some fraction of the bandwidth threshold, for 340 example 80% of the bandwidth threshold. 342 Various techniques to identify a large flow are described below. 344 3.1.3. Sampling Techniques 346 A number of routers support sampling techniques such as sFlow [sFlow- 347 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 348 For the purpose of large flow identification, sampling must be 349 enabled on all of the egress ports in the router where such 350 measurements are desired. 352 Using sflow as an example, processing in an sFlow collector will 353 provide an approximate indication of the large flows mapping to each 354 of the component links in each LAG/ECMP group. It is possible to 355 implement this part of the collector function in the control plane of 356 the router reducing dependence on an external management station, 357 assuming sufficient control plane resources are available. 359 If egress sampling is not available, ingress sampling can suffice 360 since the central management entity used by the sampling technique 361 typically has multi-node visibility and can use the samples from an 362 immediately downstream node to make measurements for egress traffic 363 at the local node. This may not be available if the downstream 364 device is under the control of a different operator, or if the 365 downstream device does not support sampling. Alternatively, since 366 sampling techniques require that the sample annotated with the 367 packet's egress port information, ingress sampling may suffice. 368 However, this means that sampling would have to be enabled on all 369 ports, rather than only on those ports where such monitoring is 370 desired. 372 The advantages and disadvantages of sampling techniques are as 373 follows. 375 Advantages: 377 . Supported in most existing routers. 379 . Requires minimal router resources. 381 Disadvantages: 383 . In order to minimize the error inherent in sampling, there is a 384 minimum delay for the recognition time of large flows, and in 385 the time that it takes to react to this information. 387 With sampling, the detection of large flows can be done on the order 388 of one second [DevoFlow]. 390 3.1.4. Automatic Hardware Recognition 392 Implementations may perform automatic recognition of large flows in 393 hardware on a router. Since this is done in hardware, it is an inline 394 solution and would be expected to operate at line rate. 396 Using automatic hardware recognition of large flows, a faster 397 indication of large flows mapped to each of the component links in a 398 LAG/ECMP group is available (as compared to the sampling approach 399 described above). 401 The advantages and disadvantages of automatic hardware recognition 402 are: 404 Advantages: 406 . Large flow detection is offloaded to hardware freeing up 407 software resources and possible dependence on an external 408 management station. 410 . As link speeds get higher, sampling rates are typically reduced 411 to keep the number of samples manageable which places a lower 412 bound on the detection time. With automatic hardware 413 recognition, large flows can be detected in shorter windows on 414 higher link speeds since every packet is accounted for in 415 hardware [NDTM] 417 Disadvantages: 419 . Not supported in many routers. 421 As mentioned earlier, the observation interval for determining a 422 large flow and the bandwidth threshold for classifying a flow as a 423 large flow should be programmable parameters in a router. 425 The implementation of automatic hardware recognition of large flows 426 is vendor dependent and beyond the scope of this document. 428 3.2. Load Re-balancing Options 430 Below are suggested techniques for load re-balancing. Equipment 431 vendors should implement all of these techniques and allow the 432 operator to choose one or more techniques based on their 433 applications. 435 Note that regardless of the method used, perfect re-balancing of 436 large flows may not be possible since flows arrive and depart at 437 different times. Also, any flows that are moved from one component 438 link to another may experience momentary packet reordering. 440 3.2.1. Alternative Placement of Large Flows 442 Within a LAG/ECMP group, the member component links with least 443 average port utilization are identified. Some large flow(s) from the 444 heavily loaded component links are then moved to those lightly-loaded 445 member component links using a PBR rule in the ingress processing 446 element(s) in the routers. 448 With this approach, only certain large flows are subjected to 449 momentary flow re-ordering. 451 When a large flow is moved, this will increase the utilization of the 452 link that it moved to potentially creating unbalanced utilization 453 once again across the link components. Therefore, when moving large 454 flows, care must be taken to account for the existing load, and what 455 the future load will be after large flow has been moved. Further, 456 the appearance of new large flows may require a rearrangement of the 457 placement of existing flows. 459 Consider a case where there is a LAG compromising 4 10 Gbps component 460 links and there are 4 large flows each of 1 Gbps. These flows are 461 each placed on one of the component links. Subsequent, a 5-th large 462 flow of 2 Gbps is recognized and to maintain equitable load 463 distribution, it may require placement of one of the existing 1 Gbps 464 flow to a different component link. And this would still result in 465 some imbalance in the utilization across the component links. 467 3.2.2. Redistributing Small Flows 469 Some large flows may consume the entire bandwidth of the component 470 link(s). In this case, it would be desirable for the small flows to 471 not use the congested component link(s). This can be accomplished in 472 one of the following ways. 474 This method works on some existing router hardware. The idea is to 475 prevent, or reduce the probability, that the small flow hashes into 476 the congested component link(s). 478 . The LAG/ECMP table is modified to include only non-congested 479 component link(s). Small flows hash into this table to be mapped 480 to a destination component link. Alternatively, if certain 481 component links are heavily loaded, but not congested, the 482 output of the hash function can be adjusted to account for large 483 flow loading on each of the component links. 485 . The PBR rules for large flows (refer to Section 3.2.1) must 486 have strict precedence over the LAG/ECMP table lookup result. 488 With this approach the small flows that are moved would be subject to 489 reordering. 491 3.2.3. Component Link Protection Considerations 493 If desired, certain component links may be reserved for link 494 protection. These reserved component links are not used for any flows 495 in the absence of any failures.. In the case when the component 496 link(s) fail, all the flows on the failed component link(s) are moved 497 to the reserved component link(s). The mapping table of large flows 498 to component link simply replaces the failed component link with the 499 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 500 component link with the reserved link. 502 3.2.4. Load Re-Balancing Example 504 Optimal LAG/ECMP component utilization for the use case in Figure 1 505 is depicted below in Figure 2. The large flow rebalancing explained 506 in Section 3.2.1 is used. The improved link utilization is as 507 follows: 509 . Component link (1) has 3 flows -- 2 small flows and 1 large 510 flow -- and the link utilization is normal. 512 . Component link (2) has 4 flows -- 3 small flows and 1 large 513 flow -- and the link utilization is normal now. 515 . Component link (3) has 3 flows -- 2 small flows and 1 large 516 flow -- and the link utilization is normal now. 518 +-----------+ +-----------+ 519 | | -> -> | | 520 | |=====> | | 521 | (1)|--/---/-|(1) | 522 | | | | 523 | |=====> | | 524 | (R1) |-> -> ->| (R2) | 525 | (2)|--/---/-|(2) | 526 | | | | 527 | | | | 528 | | -> -> | | 529 | |=====> | | 530 | (3)|--/---/-|(3) | 531 | | | | 532 +-----------+ +-----------+ 534 Where: ->-> small flows 535 ===> large flow 537 Figure 2: Evenly utilized Composite Links 539 Basically, the use of the mechanisms described in Section 3.2.1 540 resulted in a rebalancing of flows where one of the large flows on 541 component link (3) which was previously congested was moved to 542 component link (2) which was previously under-utilized. 544 4. Information Model for Flow Re-balancing 546 4.1. Configuration Parameters 548 The following parameters are required the configuration of this 549 feature: 551 . Large flow recognition parameters. 553 o Observation interval: The observation interval is the time 554 period in seconds over which the packet arrivals are 555 observed for the purpose of large flow recognition. 557 o Minimum bandwidth threshold: The minimum bandwidth threshold 558 would be configured as a percentage of link speed and 559 translated into a number of bytes over the observation 560 interval. A flow for which the number of bytes received, 561 for a given observation interval, exceeds this number would 562 be recognized as a large flow. 564 o Minimum bandwidth threshold for large flow maintenance: The 565 minimum bandwidth threshold for large flow maintenance is 566 used to provide hysteresis for large flow recognition. 567 Once a flow is recognized as a large flow, it continues to 568 be recognized as a large flow until it falls below this 569 threshold. This is also configured as a percentage of link 570 speed and is typically lower than the minimum bandwidth 571 threshold defined above. 573 . Imbalance threshold: the difference between the utilization of 574 the least utilized and most utilized component links. Expressed 575 as a percentage of link speed. 577 4.2. Import of Flow Information 579 In cases where large flow recognition is handled by an external 580 management station (see Section 3.1.3), an information model for 581 flows is required to allow the import of large flow information to 582 the router. 584 The following are some of the elements of information model for 585 importing of flows: 587 . Layer 2: source MAC address, destination MAC address, VLAN ID. 589 . Layer 3 IP: IP Protocol, IP source address, IP destination 590 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 591 destination port. 593 . MPLS Labels. 595 This list is not exhaustive. For example, with overlay protocols 596 such as VXLAN and NVGRE, fields from the outer and/or inner headers 597 may be specified. In general, all fields in the packet that can be 598 used by forwarding decisions should be available for use when 599 importing flow information from an external management station. 601 5. Operational Considerations 603 Flows should be re-balanced only when the imbalance in the 604 utilization across component links exceeds a certain threshold. 605 Frequent re-balancing to achieve precise equitable utilization across 606 component links could be counter-productive as it may result in 607 moving flows back and forth between the component links impacting 608 packet ordering and system stability. This applies regardless of 609 whether large flows or small flows are re-distributed. 611 The operator would have to experiment with various values of the 612 large flow recognition parameters (minimum bandwidth threshold, 613 observation interval) and the imbalance threshold across component 614 links to tune the solution for their environment. 616 6. IANA Considerations 618 This memo includes no request to IANA. 620 7. Security Considerations 622 This document does not directly impact the security of the Internet 623 infrastructure or its applications. In fact, it could help if there 624 is a DOS attack pattern which causes a hash imbalance resulting in 625 heavy overloading of large flows to certain LAG/ECMP component 626 links. 628 8. Acknowledgements 630 The authors would like to thank the following individuals for their 631 review and valuable feedback on earlier versions of this document: 632 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 633 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 634 Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, Andrew 635 Mallis, Dave Mcdysan and Zhen Cao 637 9. References 639 9.1. Normative References 641 9.2. Informative References 643 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 644 for MPLS over a Composite Link", June 2012. 646 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 647 Forwarding", November 2012. 649 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 651 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", 652 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 654 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 655 dynamic hashing with flow volume", SPIE ITCOM, 2002. 657 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 658 Multicast", November 2000. 660 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 661 Algorithm", November 2000. 663 [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for 664 IP Packet Selection", March 2009. 666 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004. 668 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 669 structure", September 2012. 671 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 672 9", October 2004 674 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 675 Management for High Performance Enterprise Networks", Proceedings of 676 the ACM SIGCOMM, August 2011. 678 [Bloom] Bloom, B. H., "Space /Time Trade-offs in Hash Coding with 679 Allowable Errors", Communications of the ACM, July 1970. 681 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 682 measurement and accounting", Proceedings of ACM SIGCOMM, August 2002. 684 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 686 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 687 such as the number of packets in a flow and the flow duration. The 688 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 689 protocol) are used for flow identification. The analysis indicates 690 that < ~2% of the flows take ~30% of total traffic volume while the 691 rest of the flows (> ~98%) contributes ~70% [YONG]. 693 The simulation has shown that given Internet traffic pattern, the 694 hash-based technique does not evenly distribute the flows over ECMP 695 paths. Some paths may be > 90% loaded while others are < 40% loaded. 696 The more ECMP paths exist, the more severe the misbalancing. This 697 implies that hash-based distribution can cause some paths to become 698 congested while other paths are underutilized [YONG]. 700 The simulation also shows substantial improvement by using the large 701 flow-aware hash-based distribution technique described in this 702 document. In using the same simulated traffic, the improved 703 rebalancing can achieve < 10% load differences among the paths. It 704 proves how large flow-aware hash-based distribution can effectively 705 compensate the uneven load balancing caused by hashing and the 706 traffic characteristics [YONG]. 708 Authors' Addresses 710 Ram Krishnan 711 Brocade Communications 712 San Jose, 95134, USA 713 Phone: +1-408-406-7890 714 Email: ramk@brocade.com 716 Sanjay Khanna 717 Brocade Communications 718 San Jose, 95134, USA 719 Phone: +1-408-333-4850 720 Email: skhanna@brocade.com 722 Lucy Yong 723 Huawei USA 724 5340 Legacy Drive 725 Plano, TX 75025, USA 726 Phone: +1-469-277-5837 727 Email: lucy.yong@huawei.com 729 Anoop Ghanwani 730 Dell 731 San Jose, CA 95134 732 Phone: +1-408-571-3228 733 Email: anoop@alumni.duke.edu 735 Ning So 736 Tata Communications 737 Plano, TX 75082, USA 738 Phone: +1-972-955-0914 739 Email: ning.so@tatacommunications.com 741 Bhumip Khasnabish 742 ZTE Corporation 743 New Jersey, 07960, USA 744 Phone: +1-781-752-8003 745 Email: bhumip.khasnabish@zteusa.com