idnits 2.17.1 draft-krishnan-opsawg-large-flow-load-balancing-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Informational Brocade Communications 4 Expires: September 2013 L. Yong 5 March 31, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 14 Networks 16 draft-krishnan-opsawg-large-flow-load-balancing-06.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. This document may not be modified, 22 and derivative works of it may not be created, except to publish it 23 as an RFC and to translate it into languages other than English. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as Internet- 28 Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/ietf/1id-abstracts.txt 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 This Internet-Draft will expire on September 31, 2013. 43 Copyright Notice 45 Copyright (c) 2013 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. 55 Abstract 57 Demands on networking infrastructure are growing exponentially; the 58 drivers are bandwidth hungry rich media applications, inter-data 59 center communications, etc. In this context, it is important to 60 optimally use the bandwidth in wired networks that extensively use 61 LAG/ECMP techniques for bandwidth scaling. This draft explores some 62 of the mechanisms useful for achieving this. 64 Table of Contents 66 1. Introduction...................................................3 67 1.1. Acronyms..................................................3 68 1.2. Terminology...............................................4 69 2. Hash-based Load Distribution in LAG/ECMP.......................4 70 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....6 71 3.1. Large Flow Recognition....................................7 72 3.1.1. Flow Identification..................................7 73 3.1.2. Criteria for Identifying a Large Flow................8 74 3.1.3. Sampling Techniques..................................8 75 3.1.4. Automatic Hardware Recognition.......................9 76 3.2. Load Re-balancing Options................................11 77 3.2.1. Alternative Placement of Large Flows................11 78 3.2.2. Redistributing Small Flows..........................12 79 3.2.3. Component Link Protection Considerations............12 80 3.2.4. Load Re-Balancing Example...........................12 81 4. Information Model for Flow Re-balancing.......................13 82 4.1. Configuration Parameters.................................14 83 4.2. Import of Flow Information...............................14 85 5. Operational Considerations....................................15 86 6. IANA Considerations...........................................15 87 7. Security Considerations.......................................15 88 8. Acknowledgements..............................................16 89 9. References....................................................16 90 9.1. Normative References.....................................16 91 9.2. Informative References...................................16 93 1. Introduction 95 Networks extensively use LAG/ECMP techniques for capacity scaling. 96 Network traffic can be predominantly categorized into two traffic 97 types: long-lived large flows and other flows (which include long- 98 lived small flows, short-lived small/large flows). Stateless hash- 99 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 100 to distribute both long-lived large flows and other flows over the 101 component links in a LAG/ECMP. However the traffic may not be evenly 102 distributed over the component links due to the traffic pattern. 104 This draft describes best practices for optimal LAG/ECMP component 105 link utilization while using hash-based techniques. These best 106 practices comprise the following steps -- recognizing long-lived 107 large flows in a router; and assigning the long-lived large flows to 108 specific LAG/ECMP component links or redistributing other flows when 109 a component link on the router is congested. 111 It is useful to keep in mind that the typical use case is where the 112 long-lived large flows are those that consume a significant amount of 113 bandwidth on a link, e.g. greater than 5% of link bandwidth. The 114 number of such flows would necessarily be fairly small, e.g. on the 115 order of 10's or 100's per link. In other words, the number of long- 116 lived large flows is NOT expected to be on the order of millions of 117 flows. Examples of such long-lived large flows would be IPSec 118 tunnels in service provider backbones or storage backup traffic in 119 data center networks. 121 1.1. Acronyms 123 COTS: Commercial Off-the-shelf 125 DOS: Denial of Service 127 ECMP: Equal Cost Multi-path 129 GRE: Generic Routing Encapsulation 130 LAG: Link Aggregation Group 132 MPLS: Multiprotocol Label Switching 134 NVGRE: Network Virtualization using Generic Routing Encapsulation 136 PBR: Policy Based Routing 138 QoS: Quality of Service 140 STT: Stateless Transport Tunneling 142 TCAM: Ternary Content Addressable Memory 144 VXLAN: Virtual Extensible LAN 146 1.2. Terminology 148 Large flow(s): long-lived large flow(s) 150 Small flow(s): long-lived small flow(s) and short-lived small/large 151 flow(s) 153 2. Hash-based Load Distribution in LAG/ECMP 155 Hashing techniques are often used for traffic load balancing to 156 select among multiple available paths with LAG/ECMP. The advantages 157 of hash-based load distribution are the preservation of the packet 158 sequence in a flow and the real-time distribution without maintaining 159 per-flow state in the router. Hash-based techniques use a combination 160 of fields in the packet's headers to identify a flow, and the hash 161 function on these fields is used to generate a unique number that 162 identifies a link/path in a LAG/ECMP. The result of the hashing 163 procedure is a many-to-one mapping of flows to component links. 165 If the traffic load constitutes flows such that the result of the 166 hash function across these flows is fairly uniform so that a similar 167 number of flows is mapped to each component link, if, the individual 168 flow rates are much smaller as compared to the link capacity, and if 169 the rate differences are not dramatic, the hash-based algorithm 170 produces good results with respect to utilization of the individual 171 component links. However, if one or more of these conditions are not 172 met, hash-based techniques may result in unbalanced loads on 173 individual component links. 175 One example is illustrated in Figure 1. In the figure, there are two 176 routers, R1 and R2, and there is a LAG between them which has 3 177 component links (1), (2), (3). There are a total of 10 flows that 178 need to be distributed across the links in this LAG. The result of 179 hashing is as follows: 181 . Component link (1) has 3 flows -- 2 small flows and 1 large 182 flow -- and the link utilization is normal. 184 . Component link (2) has 3 flows -- 3 small flows and no large 185 flow -- and the link utilization is light. 187 o The absence of any large flow causes the component link 188 under-utilized. 190 . Component link (3) has 4 flows -- 2 small flows and 2 large 191 flows -- and the link capacity is exceeded resulting in 192 congestion. 194 o The presence of 2 large flows causes congestion on this 195 component link. 197 +-----------+ +-----------+ 198 | | -> -> | | 199 | |=====> | | 200 | (1)|--/---/-|(1) | 201 | | | | 202 | | | | 203 | (R1) |-> -> ->| (R2) | 204 | (2)|--/---/-|(2) | 205 | | | | 206 | | -> -> | | 207 | |=====> | | 208 | |=====> | | 209 | (3)|--/---/-|(3) | 210 | | | | 211 +-----------+ +-----------+ 213 Where: ->-> small flows 214 ===> large flow 216 Figure 1: Unevenly Utilized Component Links 218 This document presents improved load distribution techniques based on 219 the large flow awareness. The techniques compensate for unbalanced 220 load distribution resulting from hashing as demonstrated in the above 221 example. 223 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization 225 The suggested techniques in this draft are about a local optimization 226 solution; they are local in the sense that both the identification of 227 large flows and re-balancing of the load can be accomplished 228 completely within individual nodes in the network without the need 229 for interaction with other nodes. 231 This approach may not yield a globally optimal placement of large 232 flows across multiple nodes in a network, which may be desirable in 233 some networks. On the other hand, a local approach may be adequate 234 for some environments for the following reasons: 236 1) Different links within a network experience different levels of 237 utilization and, thus, a "targeted" solution is needed for those hot- 238 spots in the network. An example is the utilization of a LAG between 239 two routers that needs to be optimized. 241 2) Some networks may lack end-to-end visibility, e.g. when a 242 certain network, under the control of a given operator, is a transit 243 network for traffic from other networks that are not under the 244 control of the same operator. 246 The various steps in achieving optimal LAG/ECMP component link 247 utilization in networks are detailed below: 249 Step 1) This involves large flow recognition in routers and 250 maintaining the mapping of the large flow to the component link that 251 it uses. The recognition of large flows is explained in Section 3.1. 253 Step 2) The egress component links are periodically scanned for link 254 utilization. If the egress component link utilization exceeds a pre- 255 programmed threshold, an operator alert is generated. The large flows 256 mapped to the congested egress component link are exported to a 257 central management entity. 259 Step 3) On receiving the alert about the congested component link, 260 the operator, through a central management entity, finds the large 261 flows mapped to that component link and the LAG/ECMP group to which 262 the component link belongs. 264 Step 4) The operator can choose to rebalance the large flows on 265 lightly loaded component links of the LAG/ECMP group or redistribute 266 the small flows on the congested link to other component links of the 267 group. The operator, through a central management entity, can choose 268 one of the following actions: 270 1) Indicate specific large flows to rebalance; 272 2) Have the router decide the best large flows to rebalance; 274 3) Have the router redistribute all the small flows on the 275 congested link to other component links in the group. 277 The central management entity conveys the above information to the 278 router. The load re-balancing options are explained in Section 3.2. 280 Steps 2) to 4) could be automated if desired. 282 Providing large flow information to a central management entity 283 provides the capability to further optimize flow distribution at with 284 multi-node visibility. Consider the following example. A router may 285 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 286 of hops downstream on P1 may be congested, while P2 and P3 may be 287 under-utilized, which the local router does not have visibility into. 288 With the help of a central management entity, the operator could 289 redistribute some of the flows from P1 to P2 and P3 resulting in a 290 more optimized flow of traffic. 292 The techniques described above are especially useful when bundling 293 links of different bandwidths for e.g. 10Gbps and 100Gbps as 294 described in [I-D.ietf-rtgwg-cl-requirement]. 296 3.1. Large Flow Recognition 298 3.1.1. Flow Identification 300 A flow (large flow or small flow) can be defined as a sequence of 301 packets for which ordered delivery should be maintained. Flows are 302 typically identified using one or more fields from the packet header 303 from the following list: 305 . Layer 2: source MAC address, destination MAC address, VLAN ID. 307 . IP header: IP Protocol, IP source address, IP destination 308 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 309 destination port. 311 . MPLS Labels. 313 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 314 identification is possible based on inner and/or outer headers. The 315 above list is not exhaustive. The mechanisms described in this 316 document are agnostic to the fields that are used for flow 317 identification. 319 3.1.2. Criteria for Identifying a Large Flow 321 From a bandwidth and time duration perspective, in order to identify 322 large flows we define an observation interval and observe the 323 bandwidth of the flow over that interval. A flow that exceeds a 324 certain minimum bandwidth threshold over that observation interval 325 would be considered a large flow. 327 The two parameters -- the observation interval, and the minimum 328 bandwidth threshold over that observation interval -- should be 329 programmable in a router to facilitate handling of different use 330 cases and traffic characteristics. For example, a flow which is at or 331 above 10% of link bandwidth for a time period of at least 1 second 332 could be declared a large flow [DevoFlow]. 334 In order to avoid excessive churn in the rebalancing, once a flow has 335 been recognized as a large flow, it should continue to be recognized 336 as a large flow as long as the traffic received during an observation 337 interval exceeds some fraction of the bandwidth threshold, for 338 example 80% of the bandwidth threshold. 340 Various techniques to identify a large flow are described below. 342 3.1.3. Sampling Techniques 344 A number of routers support sampling techniques such as sFlow [sFlow- 345 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 346 For the purpose of large flow identification, sampling must be 347 enabled on all of the egress ports in the router where such 348 measurements are desired. 350 Using sflow as an example, processing in an sFlow collector will 351 provide an approximate indication of the large flows mapping to each 352 of the component links in each LAG/ECMP group. It is possible to 353 implement this part of the collector function in the control plane of 354 the router reducing dependence on an external management station, 355 assuming sufficient control plane resources are available. 357 If egress sampling is not available, ingress sampling can suffice 358 since the central management entity used by the sampling technique 359 typically has multi-node visibility and can use the samples from an 360 immediately downstream node to make measurements for egress traffic 361 at the local node. This may not be available if the downstream 362 device is under the control of a different operator, or if the 363 downstream device does not support sampling. Alternatively, since 364 sampling techniques require that the sample annotated with the 365 packet's egress port information, ingress sampling may suffice. 366 However, this means that sampling would have to be enabled on all 367 ports, rather than only on those ports where such monitoring is 368 desired. 370 The advantages and disadvantages of sampling techniques are as 371 follows. 373 Advantages: 375 . Supported in most existing routers. 377 . Requires minimal router resources. 379 Disadvantages: 381 . In order to minimize the error inherent in sampling, there is a 382 minimum delay for the recognition time of large flows, and in 383 the time that it takes to react to this information. 385 With sampling, the detection of large flows can be done on the order 386 of one second [DevoFlow]. 388 3.1.4. Automatic Hardware Recognition 390 Implementations may perform automatic recognition of large flows in 391 hardware on a router. Since this is done in hardware, it is an inline 392 solution and would be expected to operate at line rate. 394 Using automatic hardware recognition of large flows, a faster 395 indication of large flows mapped to each of the component links in a 396 LAG/ECMP group is available (as compared to the sampling approach 397 described above). 399 The advantages and disadvantages of automatic hardware recognition 400 are: 402 Advantages: 404 . Large flow detection is offloaded to hardware freeing up 405 software resources and possible dependence on an external 406 management station. 408 . As link speeds get higher, sampling rates are typically reduced 409 to keep the number of samples manageable which places a lower 410 bound on the detection time. With automatic hardware 411 recognition, large flows can be detected in shorter windows on 412 higher link speeds since every packet is accounted for in 413 hardware [NDTM] 415 Disadvantages: 417 . Not supported in many routers. 419 As mentioned earlier, the observation interval for determining a 420 large flow and the bandwidth threshold for classifying a flow as a 421 large flow should be programmable parameters in a router. 423 The implementation of automatic hardware recognition of large flows 424 is vendor dependent. Below is a suggested technique based on the use 425 of a variation of the Bloom filter [Bloom]. 427 This technique requires a few tables -- a flow table, and multiple 428 hash tables. 430 The flow table comprises entries that are programmed with packet 431 fields for flows that are already known to be large flows and each 432 entry has a corresponding byte counter. It is initialized as an 433 empty table (i.e. none of the incoming packets would match a flow 434 table entry). 436 The hash tables each have a different hash function and comprise 437 entries that are byte counters. The counters are initialized to zero 438 and would be modified as described by the algorithm below. 440 Step 1) If the large flow exists in the flow table, increment the 441 counter associated with the flow by the packet size. Else, proceed to 442 Step 2. 444 Step 2) The hash function for each table is applied to the fields of 445 the packet header and the result is looked up in parallel in 446 corresponding hash table and the associated counter corresponding to 447 the entry that is hit in that table is incremented by the packet 448 size. If the counter exceeds a programmed byte threshold in the 449 observation interval (this counter threshold would be set to match 450 the bandwidth threshold) in the entries that were hit in all of the 451 hash tables, a candidate large flow is learnt and programmed in the 452 flow table and the counters are reset. 454 Additionally, the counters in all of the hash tables must be reset 455 every observation interval. 457 There may be some false positives due to multiple small flows 458 masquerading as a large flow. The number of such false positives can 459 be reduced by increasing the number of parallel hash tables, each of 460 which uses a different hash function. There is a design tradeoff 461 between size of the hash tables, the number of hash tables, and the 462 probability of a false positive. This aspect is further illustrated 463 in Appendix B. 465 3.2. Load Re-balancing Options 467 Below are suggested techniques for load re-balancing. Equipment 468 vendors should implement all of these techniques and allow the 469 operator to choose one or more techniques based on their 470 applications. 472 Note that regardless of the method used, perfect re-balancing of 473 large flows may not be possible since flows arrive and depart at 474 different times. Also, any flows that are moved from one component 475 link to another may experience momentary packet reordering. 477 3.2.1. Alternative Placement of Large Flows 479 Within a LAG/ECMP group, the member component links with least 480 average port utilization are identified. Some large flow(s) from the 481 heavily loaded component links are then moved to those lightly-loaded 482 member component links using a PBR rule in the ingress processing 483 element(s) in the routers. 485 With this approach, only certain large flows are subjected to 486 momentary flow re-ordering. 488 When a large flow is moved, this will increase the utilization of the 489 link that it moved to potentially creating unbalanced utilization 490 once again across the link components. Therefore, when moving large 491 flows, care must be taken to account for the existing load, and what 492 the future load will be after large flow has been moved. Further, 493 the appearance of new large flows may require a rearrangement of the 494 placement of existing flows. 496 Consider a case where there is a LAG compromising 4 10 Gbps component 497 links and there are 4 large flows each of 1 Gbps. These flows are 498 each placed on one of the component links. Subsequent, a 5-th large 499 flow of 2 Gbps is recognized and to maintain equitable load 500 distribution, it may require placement of one of the existing 1 Gbps 501 flow to a different component link. And this would still result in 502 some imbalance in the utilization across the component links. 504 3.2.2. Redistributing Small Flows 506 Some large flows may consume the entire bandwidth of the component 507 link(s). In this case, it would be desirable for the small flows to 508 not use the congested component link(s). This can be accomplished in 509 one of the following ways. 511 This method works on some existing router hardware. The idea is to 512 prevent, or reduce the probability, that the small flow hashes into 513 the congested component link(s). 515 . The LAG/ECMP table is modified to include only non-congested 516 component link(s). Small flows hash into this table to be mapped 517 to a destination component link. Alternatively, if certain 518 component links are heavily loaded, but not congested, the 519 output of the hash function can be adjusted to account for large 520 flow loading on each of the component links. 522 . The PBR rules for large flows (refer to Section 3.2.1) must 523 have strict precedence over the LAG/ECMP table lookup result. 525 With this approach the small flows that are moved would be subject to 526 reordering. 528 3.2.3. Component Link Protection Considerations 530 If desired, certain component links may be reserved for link 531 protection. These reserved component links are not used for any flows 532 in the absence of any failures.. In the case when the component 533 link(s) fail, all the flows on the failed component link(s) are moved 534 to the reserved component link(s). The mapping table of large flows 535 to component link simply replaces the failed component link with the 536 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 537 component link with the reserved link. 539 3.2.4. Load Re-Balancing Example 541 Optimal LAG/ECMP component utilization for the use case in Figure 1 542 is depicted below in Figure 2. The large flow rebalancing explained 543 in Section 3.2.1 is used. The improved link utilization is as 544 follows: 546 . Component link (1) has 3 flows -- 2 small flows and 1 large 547 flow -- and the link utilization is normal. 549 . Component link (2) has 4 flows -- 3 small flows and 1 large 550 flow -- and the link utilization is normal now. 552 . Component link (3) has 3 flows -- 2 small flows and 1 large 553 flow -- and the link utilization is normal now. 555 +-----------+ +-----------+ 556 | | -> -> | | 557 | |=====> | | 558 | (1)|--/---/-|(1) | 559 | | | | 560 | |=====> | | 561 | (R1) |-> -> ->| (R2) | 562 | (2)|--/---/-|(2) | 563 | | | | 564 | | | | 565 | | -> -> | | 566 | |=====> | | 567 | (3)|--/---/-|(3) | 568 | | | | 569 +-----------+ +-----------+ 571 Where: ->-> small flows 572 ===> large flow 574 Figure 2: Evenly utilized Composite Links 576 Basically, the use of the mechanisms described in Section 3.2.1 577 resulted in a rebalancing of flows where one of the large flows on 578 component link (3) which was previously congested was moved to 579 component link (2) which was previously under-utilized. 581 4. Information Model for Flow Re-balancing 582 4.1. Configuration Parameters 584 The following parameters are required the configuration of this 585 feature: 587 . Large flow recognition parameters. 589 o Observation interval: The observation interval is the time 590 period in seconds over which the packet arrivals are 591 observed for the purpose of large flow recognition. 593 o Minimum bandwidth threshold: The minimum bandwidth threshold 594 would be configured as a percentage of link speed and 595 translated into a number of bytes over the observation 596 interval. A flow for which the number of bytes received, 597 for a given observation interval, exceeds this number would 598 be recognized as a large flow. 600 o Minimum bandwidth threshold for large flow maintenance: The 601 minimum bandwidth threshold for large flow maintenance is 602 used to provide hysteresis for large flow recognition. 603 Once a flow is recognized as a large flow, it continues to 604 be recognized as a large flow until it falls below this 605 threshold. This is also configured as a percentage of link 606 speed and is typically lower than the minimum bandwidth 607 threshold defined above. 609 . Imbalance threshold: the difference between the utilization of 610 the least utilized and most utilized component links. Expressed 611 as a percentage of link speed. 613 4.2. Import of Flow Information 615 In cases where large flow recognition is handled by an external 616 management station (see Section 3.1.3), an information model for 617 flows is required to allow the import of large flow information to 618 the router. 620 The following are some of the elements of information model for 621 importing of flows: 623 . Layer 2: source MAC address, destination MAC address, VLAN ID. 625 . Layer 3 IP: IP Protocol, IP source address, IP destination 626 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 627 destination port. 629 . MPLS Labels. 631 This list is not exhaustive. For example, with overlay protocols 632 such as VXLAN and NVGRE, fields from the outer and/or inner headers 633 may be specified. In general, all fields in the packet that can be 634 used by forwarding decisions should be available for use when 635 importing flow information from an external management station. 637 5. Operational Considerations 639 Flows should be re-balanced only when the imbalance in the 640 utilization across component links exceeds a certain threshold. 641 Frequent re-balancing to achieve precise equitable utilization across 642 component links could be counter-productive as it may result in 643 moving flows back and forth between the component links impacting 644 packet ordering and system stability. This applies regardless of 645 whether large flows or small flows are re-distributed. 647 The operator would have to experiment with various values of the 648 large flow recognition parameters (minimum bandwidth threshold, 649 observation interval) and the imbalance threshold across component 650 links to tune the solution for their environment. 652 6. IANA Considerations 654 This memo includes no request to IANA. 656 7. Security Considerations 658 This document does not directly impact the security of the Internet 659 infrastructure or its applications. In fact, it could help if there 660 is a DOS attack pattern which causes a hash imbalance resulting in 661 heavy overloading of large flows to certain LAG/ECMP component 662 links. 664 8. Acknowledgements 666 The authors would like to thank the following individuals for their 667 review and valuable feedback on earlier versions of this document: 668 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 669 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 670 Wong, Peter Phaal, Roman Krzanowski and Weifeng Zhang. 672 9. References 674 9.1. Normative References 676 9.2. Informative References 678 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 679 for MPLS over a Composite Link", June 2012. 681 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 682 Forwarding", November 2012. 684 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 686 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", 687 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 689 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 690 dynamic hashing with flow volume", SPIE ITCOM, 2002. 692 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 693 Multicast", November 2000. 695 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 696 Algorithm", November 2000. 698 [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for 699 IP Packet Selection", March 2009. 701 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004. 703 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 704 structure", September 2012. 706 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 707 9", October 2004 709 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 710 Management for High Performance Enterprise Networks", Proceedings of 711 the ACM SIGCOMM, August 2011. 713 [Bloom] Bloom, B. H., "Space /Time Trade-offs in Hash Coding with 714 Allowable Errors", Communications of the ACM, July 1970. 716 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 717 measurement and accounting", Proceedings of ACM SIGCOMM, August 2002. 719 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 721 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 722 such as the number of packets in a flow and the flow duration. The 723 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 724 protocol) are used for flow identification. The analysis indicates 725 that < ~2% of the flows take ~30% of total traffic volume while the 726 rest of the flows (> ~98%) contributes ~70% [YONG]. 728 The simulation has shown that given Internet traffic pattern, the 729 hash-based technique does not evenly distribute the flows over ECMP 730 paths. Some paths may be > 90% loaded while others are < 40% loaded. 731 The more ECMP paths exist, the more severe the misbalancing. This 732 implies that hash-based distribution can cause some paths to become 733 congested while other paths are underutilized [YONG]. 735 The simulation also shows substantial improvement by using the large 736 flow-aware hash-based distribution technique described in this 737 document. In using the same simulated traffic, the improved 738 rebalancing can achieve < 10% load differences among the paths. It 739 proves how large flow-aware hash-based distribution can effectively 740 compensate the uneven load balancing caused by hashing and the 741 traffic characteristics [YONG]. 743 Appendix B. Techniques for Automatic Hardware Recognition 744 B.1 Comparison of Single Hash vs Multiple Hash for Large Flow 745 Identification 747 The suggested multiple hash technique is scalable in terms of hash 748 table storage and flow-table storage. This is independent of the 749 quality of the hash functions. Scalability is an important 750 consideration because there could be millions of flows only a small 751 percentage of which are large flows which need to be examined. The 752 amount of hash table storage is proportional to the number of large 753 flows - the exact number would depend on the use cases and traffic 754 characteristics. The amount of flow-table storage needed is only 755 slightly more than the number of large flows; a little extra space is 756 needed to accommodate any false positives (small flows which may have 757 been incorrectly identified as large flows). 759 With a single hash, the hash table storage would be proportionate to 760 the total number of flows in order to minimize false positives. The 761 amount of flow-table storage needed depends on the total number flows 762 and the hash table size. 764 B.2 Steady state analysis of suggested multiple hash technique 766 Objective: 768 To determine the probability of small flows masquerading as large 769 flows. 771 Assumptions: 773 The small flows are uniformly distributed over all of the hash 774 buckets. Further, assume that the small flows have an identical 775 number of packets in the observation interval. 777 Notation: 779 Number of hash stages - m 781 Number of hash buckets per stage - n 783 Minimum large flow rate (bytes/sec) - s 785 Time interval of examination (sec) - t 787 Number of small flows in time interval t - x1 789 Number of packets per small flow in time interval t - y 790 Average packet size of small flow - z 792 Average number of small flows in a hash bucket - x2 (x1/n) 794 Minimum number of small flows in the same hash bucket that would lead 795 to one small flow being incorrectly identified as a large flow - x3. 797 Basically, if we have some number of small flows hashing into a 798 bucket, only those buckets which hit the minimum bandwidth threshold 799 would trigger large flow identification and only the last small flow 800 which triggered the event would be incorrectly identified as a large 801 flow. 803 Using the above notation, it would take at least x3 small flows 804 hashing to the same bucket to trigger the minimum bandwidth threshold 805 for that bucket, and the flow to which the last packet belongs would 806 be incorrectly identified as a large flow. 808 x3 = (s*t)/(y*z) 810 Let p1 be the probability of such an incorrectly identified large 811 flow per hash stage. Let x be a variable denoting the number of small 812 flows which hash to the same bucket and can take a value {0, 1, ..., 813 x1}. As noted above, the mean value of x is x2. 815 p1 = Prob(x >= x3) 817 Overall probability with m independent hash stages - p1^m 819 Thus, larger tables and more number of tables would lead to a lower 820 probability of incorrectly identified large flows. 822 An example: 824 m = 4, n = 2K, s = 1 MB/sec, t = 1 sec, x1 = 200K, y = 10, z = 1K 826 x2 = 200K/2K = 100 828 x3 = (1024*1024)/(10*1024) = 102.4 830 p1 = Prob (x >= x3) ~= 0.5 832 (x2 = 100 small flows fall into one hash bucket on the average) 834 Overall probability with m independent hash stages is 836 (p1)^m = (0.5)^4 = 0.0625 837 Thus, by having 4 stages, the probability of a small flow being 838 incorrectly identified as a large flow is reduced from 0.5 to 0.0625. 840 Authors' Addresses 842 Ram Krishnan 843 Brocade Communications 844 San Jose, 95134, USA 845 Phone: +1-408-406-7890 846 Email: ramk@brocade.com 848 Sanjay Khanna 849 Brocade Communications 850 San Jose, 95134, USA 851 Phone: +1-408-333-4850 852 Email: skhanna@brocade.com 854 Lucy Yong 855 Huawei USA 856 5340 Legacy Drive 857 Plano, TX 75025, USA 858 Phone: +1-469-277-5837 859 Email: lucy.yong@huawei.com 861 Anoop Ghanwani 862 Dell 863 San Jose, CA 95134 864 Phone: +1-408-571-3228 865 Email: anoop@alumni.duke.edu 867 Ning So 868 Tata Communications 869 Plano, TX 75082, USA 870 Phone: +1-972-955-0914 871 Email: ning.so@tatacommunications.com 873 Bhumip Khasnabish 874 ZTE Corporation 875 New Jersey, 07960, USA 876 Phone: +1-781-752-8003 877 Email: bhumip.khasnabish@zteusa.com