idnits 2.17.1 draft-krishnan-opsawg-large-flow-load-balancing-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2991' is mentioned on line 107, but not defined == Missing Reference: 'RFC 2992' is mentioned on line 107, but not defined == Missing Reference: 'RFC 5475' is mentioned on line 343, but not defined == Unused Reference: 'RFC2234' is defined on line 581, but no explicit reference was found in the text == Unused Reference: 'RFC2991' is defined on line 601, but no explicit reference was found in the text == Unused Reference: 'RFC2992' is defined on line 604, but no explicit reference was found in the text == Unused Reference: 'RFC5475' is defined on line 607, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Informational Brocade Communications 4 Expires: August 12, 2013 L. Yong 5 February 13, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 14 Networks 16 draft-krishnan-opsawg-large-flow-load-balancing-03.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html 39 This Internet-Draft will expire on August 13, 2013. 41 Copyright Notice 43 Copyright (c) 2013 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. 63 Abstract 65 Demands on networking infrastructure are growing exponentially; the 66 drivers are bandwidth hungry rich media applications, inter-data 67 center communications, etc. In this context, it is important to 68 optimally use the bandwidth in wired networks that extensively use 69 LAG/ECMP techniques for bandwidth scaling. This draft explores some 70 of the mechanisms useful for achieving this. 72 Table of Contents 74 1. Introduction...................................................3 75 1.1. Conventions...............................................3 76 1.2. Acronyms..................................................4 77 1.3. Terminology...............................................4 78 2. Hash-based Load Distribution in LAG/ECMP.......................4 79 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....6 80 3.1. Large Flow Recognition....................................8 81 3.1.1. Flow Identification..................................8 82 3.1.2. Criteria for Identifying a Large Flow................8 83 3.1.3. Sampling Techniques..................................9 84 3.1.4. Automatic Hardware Recognition.......................9 85 3.2. Load Re-Balancing Options................................11 86 3.2.1. Alternative Placement of Large Flows................11 87 3.2.2. Redistributing Small flows..........................11 88 3.2.3. Component Link Protection Considerations............12 89 3.2.4. Load Re-Balancing Example...........................12 90 4. Future Work...................................................13 91 5. IANA Considerations...........................................14 92 6. Security Considerations.......................................14 93 7. Acknowledgements..............................................14 94 8. References....................................................14 95 8.1. Normative References.....................................14 96 8.2. Informative References...................................14 97 9. Appendix A. Internet Traffic Analysis and Load Balancing 98 Simulation.......................................................15 99 10. Appendix B. Techniques for Automatic Hardware Recognition...16 101 1. Introduction 103 Networks extensively use LAG/ECMP techniques for capacity scaling. 104 Network traffic can be predominantly categorized into two traffic 105 types: long-lived large flows and other flows (which include long- 106 lived small flows, short-lived small/large flows). Stateless hash- 107 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 108 to distribute both long-lived large flows and other flows over the 109 component links in a LAG/ECMP. However the traffic may not be evenly 110 distributed over the component links due to the traffic pattern. 112 This draft describes best practices for optimal LAG/ECMP component 113 link utilization while using hash-based techniques. These best 114 practices comprise the following steps -- recognizing long-lived 115 large flows in a router; and assigning the long-lived large flows to 116 specific LAG/ECMP component links or redistributing other flows when 117 a component link on the router is congested. 119 1.1. Conventions 121 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 123 document are to be interpreted as described in RFC 2119 [RFC2119]. 125 1.2. Acronyms 127 COTS: Commercial Off-the-shelf 129 DOS: Denial of Service 131 ECMP: Equal Cost Multi-path 133 GRE: Generic Routing Encapsulation 135 LAG: Link Aggregation Group 137 MPLS: Multiprotocol Label Switching 139 NVGRE: Network Virtualization using Generic Routing Encapsulation 141 PBR: Policy Based Routing 143 QoS: Quality of Service 145 STT: Stateless Transport Tunneling 147 TCAM: Ternary Content Addressable Memory 149 VXLAN: Virtual Extensible LAN 151 1.3. Terminology 153 Large flow(s): long-lived large flow(s) 155 Small flow(s): long-lived small flow(s) and short-lived small/large 156 flow(s) 158 2. Hash-based Load Distribution in LAG/ECMP 160 Hashing techniques are often used for traffic load balancing to 161 select among multiple available paths with LAG/ECMP. The advantages 162 of hash-based load distribution are the preservation of the packet 163 sequence in a flow and the real-time distribution without maintaining 164 per-flow state in the router. Hash-based techniques use a combination 165 of fields in the packet's headers to identify a flow, and the hash 166 function on these fields is used to generate a unique number that 167 identifies a link/path in a LAG/ECMP. The result of the hashing 168 procedure is a many-to-one mapping of flows to component links. 170 If the traffic load constitutes flows such that the result of the 171 hash function across these flows is fairly uniform so that a similar 172 number of flows is mapped to each component link, if, the individual 173 flow rates are much smaller as compared to the link capacity, and if 174 the rate differences are not dramatic, the hash-based algorithm 175 produces good results with respect to utilization of the individual 176 component links. However, if one or more of these conditions are not 177 met, hash-based techniques may result in unbalanced loads on 178 individual component links. 180 One example is illustrated in Figure 1. In the figure, there are two 181 routers, R1 and R2, and there is a LAG between them which has 3 182 component links (1), (2), (3). There are a total of 10 flows that 183 need to be distributed across the links in this LAG. The result of 184 hashing is as follows: 186 . Component link (1) has 3 flows -- 2 small flows and 1 large 187 flow -- and the link utilization is normal. 189 . Component link (2) has 3 flows -- 3 small flows and no large 190 flow -- and the link utilization is light. 192 o The absence of any large flow causes the component link 193 under-utilized. 195 . Component link (3) has 4 flows -- 2 small flows and 2 large 196 flows -- and the link capacity is exceeded resulting in 197 congestion. 199 o The presence of 2 large flows causes congestion on this 200 component link. 202 +-----------+ +-----------+ 203 | | -> -> | | 204 | |=====> | | 205 | (1)|--/---/-|(1) | 206 | | | | 207 | | | | 208 | (R1) |-> -> ->| (R2) | 209 | (2)|--/---/-|(2) | 210 | | | | 211 | | -> -> | | 212 | |=====> | | 213 | |=====> | | 214 | (3)|--/---/-|(3) | 215 | | | | 216 +-----------+ +-----------+ 218 Where: ->-> small flows 219 ===> large flow 221 Figure 1: Unevenly Utilized Component Links 223 This document presents improved load distribution techniques based on 224 the large flow awareness. The techniques compensate for unbalanced 225 load distribution resulting from hashing as demonstrated in the above 226 example. 228 3. Mechanisms for Optimal LAG/ECMP Component Link Utilization 230 The suggested techniques in this draft are about a local optimization 231 solution; they are local in the sense that both the identification of 232 large flows and re-balancing of the load can be accomplished 233 completely within individual nodes in the network without the need 234 for interaction with other nodes. 236 This approach may not yield a globally optimal placement of large 237 flows across multiple nodes in a network, which may be desirable in 238 some networks. On the other hand, a local approach may be adequate 239 for some environments for the following reasons: 241 1) Different links within a network experience different levels of 242 utilization and, thus, a "targeted" solution is needed for those hot- 243 spots in the network. An example is the utilization of a LAG between 244 two routers that needs to be optimized. 246 2) Some networks may lack end-to-end visibility, e.g. when a 247 certain network, under the control of a given operator, is a transit 248 network for traffic from other networks that are not under the 249 control of the same operator. 251 The various steps in achieving optimal LAG/ECMP component link 252 utilization in networks are detailed below: 254 Step 1) This involves large flow recognition in routers and 255 maintaining the mapping of the large flow to the component link that 256 it uses. The recognition of large flows is explained in Section 3.1. 258 Step 2) The egress component links are periodically scanned for link 259 utilization. If the egress component link utilization exceeds a pre- 260 programmed threshold, an operator alert is generated. The large flows 261 mapped to the congested egress component link are exported to a 262 central management entity. 264 Step 3) On receiving the alert about the congested component link, 265 the operator, through a central management entity, finds the large 266 flows mapped to that component link and the LAG/ECMP group to which 267 the component link belongs. 269 Step 4) The operator can choose to rebalance the large flows on 270 lightly loaded component links of the LAG/ECMP group or redistribute 271 the small flows on the congested link to other component links of the 272 group. The operator, through a central management entity, can choose 273 one of the following actions: 275 1) Indicate specific large flows to rebalance; 277 2) Have the router decide the best large flows to rebalance; 279 3) Have the router redistribute all the small flows on the 280 congested link to other component links in the group. 282 The central management entity conveys the above information to the 283 router. The load re-balancing options are explained in Section 3.2. 285 Steps 2) to 4) could be automated if desired. 287 Providing large flow information to a central management entity 288 provides the capability to further optimize flow distribution at with 289 multi-node visibility. Consider the following example. A router may 290 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 291 of hops downstream on P1 may be congested, while P2 and P3 may be 292 under-utilized, which the local router does not have visibility into. 293 With the help of a central management entity, the operator could 294 redistribute some of the flows from P1 to P2 and P3 resulting in a 295 more optimized flow of traffic. 297 The techniques described above are especially useful when bundling 298 links of different bandwidths for e.g. 10Gbps and 100Gbps as 299 described in [I-D.ietf-rtgwg-cl-requirement]. 301 3.1. Large Flow Recognition 303 3.1.1. Flow Identification 305 A flow (large flow or small flow) can be defined as a sequence of 306 packets for which ordered delivery should be maintained. Flows are 307 typically identified using one or more fields from the packet header 308 from the following list: 310 . Layer 2: source MAC address, destination MAC address, VLAN ID. 312 . IP header: IP Protocol, IP source address, IP destination 313 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 314 destination port. 316 . MPLS Labels. 318 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 319 identification is possible based on inner and/or outer headers. The 320 above list is not exhaustive. The mechanisms described in this 321 document are agnostic to the fields that are used for flow 322 identification. 324 3.1.2. Criteria for Identifying a Large Flow 326 From a bandwidth and time duration perspective, in order to identify 327 large flows we define an observation interval and observe the 328 bandwidth of the flow over that interval. A flow that exceeds a 329 certain minimum bandwidth threshold over that observation interval 330 would be considered a large flow. 332 The two parameters -- the observation interval, and the minimum 333 bandwidth threshold over that observation interval -- should be 334 programmable in a router to facilitate handling of different use 335 cases and traffic characteristics. For example, a flow which is at or 336 above 100 Mbps for a time period of at least 5 minutes could be 337 declared a large flow. Various techniques to identify a large flow 338 are described below. 340 3.1.3. Sampling Techniques 342 A number of routers support sampling techniques such as sFlow [sFlow- 343 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 344 For the purpose of large flow identification, sampling must be 345 enabled on all of the egress ports in the router. 347 For example, through sFlow processing in a sFlow collector, an 348 approximate indication of the large flows mapping to each of the 349 component links in each LAG/ECMP group is available. 351 If egress sampling is not available, ingress sampling can suffice 352 since the central management entity used by the sampling technique 353 typically has multi-node visibility and can use the samples from an 354 immediately downstream node to make measurements for egress traffic 355 at the local node. This may not be available if the downstream 356 device is under the control of a different operator, or if the 357 downstream device does not support sampling. 359 The advantages and disadvantages of sampling techniques are as 360 follows. 362 Advantages: 364 . Supported in most routers. 366 . Requires minimal router resources. 368 Disadvantages: 370 . There is a delay in the recognition time for large flows, and 371 in the time that it takes to react to this information. 373 With sampling, the detection of large flows can be done on the order 374 of one to a few seconds [DevoFlow]. 376 3.1.4. Automatic Hardware Recognition 378 Implementations may perform automatic recognition of large flows in 379 hardware on a router. Since this is done in hardware, it is an inline 380 solution and would be expected to operate at line rate. 382 Using automatic hardware recognition of large flows, a faster 383 indication of large flows mapped to each of the component links in a 384 LAG/ECMP group is available (as compared to the sampling approach 385 described above). 387 The advantages and disadvantages of automatic hardware recognition 388 are: 390 Advantages: 392 . Accurate and performed in real-time. 394 Disadvantages: 396 . Not supported in many routers. 398 As mentioned earlier, the observation interval for determining a 399 large flow and the bandwidth threshold for classifying a flow as a 400 large flow should be programmable parameters in a router. 402 The implementation of automatic hardware recognition of large flows 403 is vendor dependent. Below is a suggested technique. 405 This technique requires a few tables -- a flow table, and multiple 406 hash tables. 408 The flow table comprises entries which are programmed with packet 409 fields for flows that are already known to be large flows and each 410 entry has a corresponding byte counter. It is initialized as an 411 empty table (i.e. none of the incoming packets would match a flow 412 table entry). 414 The hash tables each have a different hash function and comprise 415 entries which are byte counters. The counters are initialized to 416 zero and would be modified as described by the algorithm below. 418 Step 1) If the large flow exists in the flow table (for e.g. TCAM), 419 increment the counter associated with the flow by the packet size. 420 Else, proceed to Step 2. 422 Step 2) The hash function for each table is applied to the fields of 423 the packet header and the result is looked up in parallel in 424 corresponding hash table and the associated counter corresponding to 425 the entry that is hit in that table is incremented by the packet 426 size. If the counter exceeds a programmed byte threshold in the 427 observation interval (this counter threshold would be set to match 428 the bandwidth threshold) in the entries that were hit in all of the 429 hash tables, a candidate large flow is learnt and programmed in the 430 flow table and the counters are reset. 432 Additionally, the counters in all of the hash tables must be reset 433 every observation interval. 435 There may be some false positives due to multiple small flows 436 masquerading as a large flow. The number of such false positives is 437 reduced by increasing the number of parallel hash tables using 438 different hash functions. There will be a design tradeoff between 439 size of the hash tables, the number of hash tables, and the 440 probability of a false positive. More details on this algorithm are 441 found in Appendix B. 443 3.2. Load Re-Balancing Options 445 Below are suggested techniques for load re-balancing. Equipment 446 vendors should implement all of these techniques and allow the 447 operator to choose one or more techniques based on their 448 applications. 450 3.2.1. Alternative Placement of Large Flows 452 In the LAG/ECMP group, choose other member component links with least 453 average port utilization. Move some large flow(s) from the heavily 454 loaded component link to other member component links using a Policy 455 Based Routing (PBR) rule in the ingress processing element(s) in the 456 routers. The key aspects of this are: 458 . Small flows are not subjected to flow re-ordering. 460 . Only certain large flows are subjected to momentary flow re- 461 ordering. 463 Note that perfect re-balancing of large flows may not be possible 464 since flows arrive and depart at different times. 466 3.2.2. Redistributing Small flows 468 Some large flows may consume the entire bandwidth of the component 469 link(s). In this case, it would be desirable for the small flows to 470 not use the congested component link(s). This can be accomplished in 471 one of the following ways. 473 This method works on some existing router hardware. The idea is to 474 prevent, or reduce the probability, that the small flow hashes into 475 the congested component link(s). 477 . The LAG/ECMP table is modified to include only non-congested 478 component link(s). Small flows hash into this table to be mapped 479 to a destination component link. Alternatively, if certain 480 component links are heavily loaded, but not congested, the 481 output of the hash function can be adjusted to account for large 482 flow loading on each of the component links. 484 . Small flows may be subject to momentary packet re-ordering. 486 . The PBR rules for large flows (refer to Section 3.2.1) must 487 have strict precedence over the LAG/ECMP table lookup result. 489 3.2.3. Component Link Protection Considerations 491 If desired, certain component links may be reserved for link 492 protection. These reserved component links are not used for any flows 493 which are described in Section 3.2. In the case when the component 494 link(s) fail, all the flows on the failed component link(s) are moved 495 to the reserved component link(s). The mapping table of large flows 496 to component link simply replaces the reference pointer from the 497 failed component link to the reserved link. Likewise, the LAG/ECMP 498 hash table replaces the reference pointer from the failed component 499 link to the reserved link. 501 3.2.4. Load Re-Balancing Example 503 Optimal LAG/ECMP component utilization for the use case in Figure 1 504 is depicted below in Figure 2. The large flow rebalancing explained 505 in Section 3.2.1 is used. The improved link utilization is as 506 follows: 508 . Component link (1) has 3 flows -- 2 small flows and 1 large 509 flow -- and the link utilization is normal. 511 . Component link (2) has 4 flows -- 3 small flows and 1 large 512 flow -- and the link utilization is normal now. 514 . Component link (3) has 3 flows -- 2 small flows and 1 large 515 flow -- and the link utilization is normal now. 517 +-----------+ +-----------+ 518 | | -> -> | | 519 | |=====> | | 520 | (1)|--/---/-|(1) | 521 | | | | 522 | |=====> | | 523 | (R1) |-> -> ->| (R2) | 524 | (2)|--/---/-|(2) | 525 | | | | 526 | | | | 527 | | -> -> | | 528 | |=====> | | 529 | (3)|--/---/-|(3) | 530 | | | | 531 +-----------+ +-----------+ 533 Where: ->-> small flows 534 ===> large flow 536 Figure 2: Evenly utilized Composite Links 538 Basically, the use of the mechanisms described in Section 3.2.1 539 resulted in a rebalancing of flows where one of the large flows on 540 component link (3) which was previously congested was moved to 541 component link (2) which was previously under-utilized. 543 4. Future Work 545 There are two areas that would benefit from further standards work: 547 1) Development of a data model used to move the large flow 548 information from the router to the central management entity. 550 2) Development of a data model used to move the large flow re- 551 balancing information from the central management entity to the 552 router. 554 5. IANA Considerations 556 This memo includes no request to IANA. 558 6. Security Considerations 560 This document does not directly impact the security of the Internet 561 infrastructure or its applications. In fact, it could help if there 562 is a DOS attack pattern which causes a hash imbalance resulting in 563 heavy overloading of large flows to certain LAG/ECMP component 564 links. 566 7. Acknowledgements 568 The authors would like to thank the following individuals for their 569 review and valuable feedback on earlier versions of this document: 570 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 571 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 572 Wong, and Peter Phaal. 574 8. References 576 8.1. Normative References 578 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 579 Requirement Levels", BCP 14, RFC 2119, March 1997. 581 [RFC2234] Crocker, D. and Overell, P. (Editors), "Augmented BNF for 582 Syntax Specifications: ABNF", RFC 2234, Internet Mail 583 Consortium and Demon Internet Ltd., November 1997. 585 8.2. Informative References 587 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 588 for MPLS over a Composite Link", June 2012. 590 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 591 Forwarding", November 2012. 593 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 595 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", 596 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 598 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 599 dynamic hashing with flow volume", SPIE ITCOM, 2002. 601 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 602 Multicast", November 2000. 604 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 605 Algorithm", November 2000. 607 [RFC5475] Zseby, T., et al., "Sampling and Filtering Techniques for 608 IP Packet Selection", March 2009. 610 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004. 612 [sFlow-LAG] Phaal, P., and A. Ghanwani, "sFlow LAG counters 613 structure", September 2012. 615 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 616 9", October 2004 618 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 619 Management for High Performance Enterprise Networks", Proceedings of 620 the ACM SIGCOMM, August 2011. 622 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 624 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 625 such as the number of packets in a flow and the flow duration. The 626 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 627 protocol) are used for flow identification. The analysis indicates 628 that < ~2% of the flows take ~30% of total traffic volume while the 629 rest of the flows (> ~98%) contributes ~70% [YONG]. 631 The simulation has shown that given Internet traffic pattern, the 632 hash-based technique does not evenly distribute the flows over ECMP 633 paths. Some paths may be > 90% loaded while others are < 40% loaded. 634 The more ECMP paths exist, the more severe the misbalancing. This 635 implies that hash-based distribution can cause some paths to become 636 congested while other paths are underutilized [YONG]. 638 The simulation also shows substantial improvement by using the large 639 flow-aware hash-based distribution technique described in this 640 document. In using the same simulated traffic, the improved 641 rebalancing can achieve < 10% load differences among the paths. It 642 proves how large flow-aware hash-based distribution can effectively 643 compensate the uneven load balancing caused by hashing and the 644 traffic characteristics [YONG]. 646 Appendix B. Techniques for Automatic Hardware Recognition 648 B.1 Comparison of Single Hash vs Multiple Hash for Large Flow 649 Identification 651 The suggested multiple hash technique is scalable in terms of hash 652 table storage and flow-table storage. This is independent of the 653 quality of the hash functions. Scalability is an important 654 consideration because there could be millions of flows only a small 655 percentage of which are large flows which need to be examined. The 656 amount of hash table storage is proportional to the number of large 657 flows - the exact number would depend on the use cases and traffic 658 characteristics. The amount of flow-table storage needed is only 659 slightly more than the number of large flows; a little extra space is 660 needed to accommodate any false positives (small flows which may have 661 been incorrectly identified as large flows). 663 With a single hash, the hash table storage would be proportionate to 664 the total number of flows in order to minimize false positives. The 665 amount of flow-table storage needed depends on the total number flows 666 and the hash table size. 668 B.2 Steady state analysis of suggested multiple hash technique 670 Objective: 672 Determine the probability of short-lived flows masquerading as long- 673 lived-flows 675 Assumptions: 677 The small flows are uniformly distributed over all of the hash 678 buckets. Further, assume that the small flows have an identical 679 number of packets in the observation interval. 681 Notation: 683 Number of hash stages - m 685 Number of hash buckets per stage - n 687 Minimum large flow rate (bytes/sec) - s 689 Time interval of examination (sec) - t 691 Number of small flows in time interval t - x1 693 Number of packets per small flow in time interval t - y 695 Average packet size of small flow - z 697 Average number of small flows in a hash bucket - x2 (x1/n) 699 Minimum number of small flows in the same hash bucket that would lead 700 to one small flow being incorrectly identified as a large flow - 701 x3Basically, if we have some number of small flows hashing into a 702 bucket, only those buckets which hit the minimum bandwidth threshold 703 would trigger large flow identification and only the last small flow 704 which triggered the event would be incorrectly identified as a large 705 flow. 707 Using the above notation, it would take at least x3 small flows 708 hashing to the same bucket to trigger the minimum bandwidth threshold 709 for that bucket, and the flow to which the last packet belongs would 710 be incorrectly identified as a large flow. 712 x3 = (s*t)/(y*z) 714 Let p1 be the probability of such an incorrectly identified large 715 flow per hash stage. Let x be a variable denoting the number of small 716 flows which hash to the same bucket and can take a value {0, 1, ..., 717 x1}. As noted above, the mean value of x is x2. 719 p1 = Prob(x >= x3) 721 Overall probability with m independent hash stages - p1^m 723 Thus, larger tables and more number of tables would lead to a lower 724 probability of incorrectly identified large flows. 726 An example: 728 m = 4, n = 2K, s = 1 MB/sec, t = 1 sec, x1 = 200K, y = 10, z = 1K 730 x2 = 200K/2K = 100 732 x3 = (1024*1024)/(10*1024) = 102.4 734 p1 = Prob (x >= x3) ~= 0.5 736 (x2 = 100 small flows fall into one hash bucket on the average) 738 Overall probability with m independent hash stages is 740 (p1)^m = (0.5)^4 = 0.0625 742 Thus, by having 4 stages, the probability of a small flow being 743 incorrectly identified as a large flow is reduced from 0.5 to 0.0625. 745 Authors' Addresses 747 Ram Krishnan 748 Brocade Communications 749 San Jose, 95134, USA 750 Phone: +1-408-406-7890 751 Email: ramk@brocade.com 753 Sanjay Khanna 754 Brocade Communications 755 San Jose, 95134, USA 756 Phone: +1-408-333-4850 757 Email: skhanna@brocade.com 759 Lucy Yong 760 Huawei USA 761 5340 Legacy Drive 762 Plano, TX 75025, USA 763 Phone: +1-469-277-5837 764 Email: lucy.yong@huawei.com 766 Anoop Ghanwani 767 Dell 768 San Jose, CA 95134 769 Phone: +1-408-571-3228 770 Email: anoop@alumni.duke.edu 772 Ning So 773 Tata Communications 774 Plano, TX 75082, USA 775 Phone: +1-972-955-0914 776 Email: ning.so@tatacommunications.com 778 Bhumip Khasnabish 779 ZTE Corporation 780 New Jersey, 07960, USA 781 Phone: +1-781-752-8003 782 Email: bhumip.khasnabish@zteusa.com