idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5101 (Obsoleted by RFC 7011) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Informational Brocade Communications 4 Expires: December 23, 2013 L. Yong 5 June 23, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 14 Networks 16 draft-ietf-opsawg-large-flow-load-balancing-01.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. This document may not be modified, 22 and derivative works of it may not be created, except to publish it 23 as an RFC and to translate it into languages other than English. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as Internet- 28 Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/ietf/1id-abstracts.txt 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 This Internet-Draft will expire on December 23, 2013. 43 Copyright Notice 45 Copyright (c) 2013 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 Demands on networking infrastructure are growing exponentially; the 61 drivers are bandwidth hungry rich media applications, inter-data 62 center communications, etc. In this context, it is important to 63 optimally use the bandwidth in wired networks that extensively use 64 LAG/ECMP techniques for bandwidth scaling. This draft explores some 65 of the mechanisms useful for achieving this. 67 Table of Contents 69 1. Introduction...................................................3 70 1.1. Acronyms..................................................3 71 1.2. Terminology...............................................4 72 2. Flow Categorization............................................4 73 3. Hash-based Load Distribution in LAG/ECMP.......................5 74 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 75 4.1. Differences in LAG vs ECMP................................7 76 4.2. Overview of the mechanism.................................8 77 4.3. Large Flow Recognition....................................9 78 4.3.1. Flow Identification..................................9 79 4.3.2. Criteria for Identifying a Large Flow...............10 80 4.3.3. Sampling Techniques.................................10 81 4.3.4. Automatic Hardware Recognition......................11 82 4.4. Load Re-balancing Options................................12 83 4.4.1. Alternative Placement of Large Flows................12 84 4.4.2. Redistributing Small Flows..........................13 85 4.4.3. Component Link Protection Considerations............13 86 4.4.4. Load Re-balancing Algorithms........................14 87 4.4.5. Load Re-Balancing Example...........................14 88 5. Information Model for Flow Re-balancing.......................15 89 5.1. Configuration Parameters for Flow Re-balancing...........15 90 5.2. System Configuration and Identification Parameters.......16 91 5.3. Information for Alternative Placement of Large Flows.....16 92 5.4. Information for Redistribution of Small Flows............17 93 5.5. Export of Flow Information...............................17 94 5.6. Monitoring information...................................17 95 5.6.1. Interface (link) utilization........................17 96 5.6.2. Other monitoring information........................17 97 6. Operational Considerations....................................18 98 7. IANA Considerations...........................................18 99 8. Security Considerations.......................................18 100 9. Acknowledgements..............................................18 101 10. References...................................................19 102 10.1. Normative References....................................19 103 10.2. Informative References..................................19 105 1. Introduction 107 Networks extensively use LAG/ECMP techniques for capacity scaling. 108 Network traffic can be predominantly categorized into two traffic 109 types: long-lived large flows and other flows (which include long- 110 lived small flows, short-lived small/large flows). Stateless hash- 111 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 112 to distribute both long-lived large flows and other flows over the 113 component links in a LAG/ECMP. However the traffic may not be evenly 114 distributed over the component links due to the traffic pattern. 116 This draft describes mechanisms for optimal LAG/ECMP component link 117 utilization while using hash-based techniques. The mechanisms 118 comprise the following steps -- recognizing long-lived large flows in 119 a router; and assigning the long-lived large flows to specific 120 LAG/ECMP component links or redistributing other flows when a 121 component link on the router is congested. 123 It is useful to keep in mind that the typical use case is where the 124 long-lived large flows are those that consume a significant amount of 125 bandwidth on a link, e.g. greater than 5% of link bandwidth. The 126 number of such flows would necessarily be fairly small, e.g. on the 127 order of 10's or 100's per link. In other words, the number of long- 128 lived large flows is NOT expected to be on the order of millions of 129 flows. Examples of such long-lived large flows would be IPSec 130 tunnels in service provider backbones or storage backup traffic in 131 data center networks. 133 1.1. Acronyms 135 COTS: Commercial Off-the-shelf 137 DOS: Denial of Service 139 ECMP: Equal Cost Multi-path 140 GRE: Generic Routing Encapsulation 142 LAG: Link Aggregation Group 144 MPLS: Multiprotocol Label Switching 146 NVGRE: Network Virtualization using Generic Routing Encapsulation 148 PBR: Policy Based Routing 150 QoS: Quality of Service 152 STT: Stateless Transport Tunneling 154 TCAM: Ternary Content Addressable Memory 156 VXLAN: Virtual Extensible LAN 158 1.2. Terminology 160 Large flow(s): long-lived large flow(s) 162 Small flow(s): long-lived small flow(s) and short-lived small/large 163 flow(s) 165 2. Flow Categorization 167 In general, based on the size and duration, a flow can be categorized 168 into any one of the following four types, as shown in Figure 1: 170 (a) Short-Lived Large Flow (SLLF), 171 (b) Short-Lived Small Flow (SLSF), 172 (c) Long-Lived Large Flow (LLLF), and 173 (d) Long-Lived Small Flow (LLSF). 175 Flow Size 176 ^ 177 |--------------------|--------------------| 178 | | | 179 Large | SLLF | LLLF | 180 Flow | | | 181 |--------------------|--------------------| 182 | | | 183 Small | SLSF | LLSF | 184 Flow | | | 185 +--------------------+--------------------+---> Flow duration 186 Short-Lived Long-Lived 187 Flow Flow 189 Figure 1: Flow Categorization 191 In this document, we categorize Long-lived large flow(s) as "Large" 192 flow(s), and all of the others -- Long-lived small flow(s) and short- 193 lived small/large flow(s) as "Small" flow(s). 195 3. Hash-based Load Distribution in LAG/ECMP 197 Hashing techniques are often used for traffic load balancing to 198 select among multiple available paths with LAG/ECMP. The advantages 199 of hash-based load distribution are the preservation of the packet 200 sequence in a flow and the real-time distribution without maintaining 201 per-flow state in the router. Hash-based techniques use a combination 202 of fields in the packet's headers to identify a flow, and the hash 203 function on these fields is used to generate a unique number that 204 identifies a link/path in a LAG/ECMP. The result of the hashing 205 procedure is a many-to-one mapping of flows to component links. 207 If the traffic load constitutes flows such that the result of the 208 hash function across these flows is fairly uniform so that a similar 209 number of flows is mapped to each component link, if, the individual 210 flow rates are much smaller as compared to the link capacity, and if 211 the rate differences are not dramatic, the hash-based algorithm 212 produces good results with respect to utilization of the individual 213 component links. However, if one or more of these conditions are not 214 met, hash-based techniques may result in unbalanced loads on 215 individual component links. 217 One example is illustrated in Figure 2. In Figure 2, there are two 218 routers, R1 and R2, and there is a LAG between them which has 3 219 component links (1), (2), (3). There are a total of 10 flows that 220 need to be distributed across the links in this LAG. The result of 221 hashing is as follows: 223 . Component link (1) has 3 flows -- 2 small flows and 1 large 224 flow -- and the link utilization is normal. 226 . Component link (2) has 3 flows -- 3 small flows and no large 227 flow -- and the link utilization is light. 229 o The absence of any large flow causes the component link 230 under-utilized. 232 . Component link (3) has 4 flows -- 2 small flows and 2 large 233 flows -- and the link capacity is exceeded resulting in 234 congestion. 236 o The presence of 2 large flows causes congestion on this 237 component link. 239 +-----------+ +-----------+ 240 | | -> -> | | 241 | |=====> | | 242 | (1)|--/---/-|(1) | 243 | | | | 244 | | | | 245 | (R1) |-> -> ->| (R2) | 246 | (2)|--/---/-|(2) | 247 | | | | 248 | | -> -> | | 249 | |=====> | | 250 | |=====> | | 251 | (3)|--/---/-|(3) | 252 | | | | 253 +-----------+ +-----------+ 255 Where: ->-> small flows 256 ===> large flow 258 Figure 2: Unevenly Utilized Component Links 260 This document presents improved load distribution techniques based on 261 the large flow awareness. The techniques compensate for unbalanced 262 load distribution resulting from hashing as demonstrated in the above 263 example. 265 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization 267 The suggested techniques in this draft are about a local optimization 268 solution; they are local in the sense that both the identification of 269 large flows and re-balancing of the load can be accomplished 270 completely within individual nodes in the network without the need 271 for interaction with other nodes. 273 This approach may not yield a globally optimal placement of large 274 flows across multiple nodes in a network, which may be desirable in 275 some networks. On the other hand, a local approach may be adequate 276 for some environments for the following reasons: 278 1) Different links within a network experience different levels of 279 utilization and, thus, a "targeted" solution is needed for those hot- 280 spots in the network. An example is the utilization of a LAG between 281 two routers that needs to be optimized. 283 2) Some networks may lack end-to-end visibility, e.g. when a 284 certain network, under the control of a given operator, is a transit 285 network for traffic from other networks that are not under the 286 control of the same operator. 288 4.1. Differences in LAG vs ECMP 290 While the mechanisms explained herein are applicable to both LAGs and 291 ECMP, it is useful to note that there are some key differences 292 between the two that may impact how effective the mechanism is. This 293 relates, in part, to the localized information with which the scheme 294 is intended to operate. 296 A LAG is almost always between 2 adjacent routers. As a result, the 297 scope of problem of optimizing the bandwidth utilization on the 298 component links is fairly narrow. It simply involves re-balancing 299 the load across the component links between these two routers, and 300 there is no impact whatsoever to other parts of the network. The 301 scheme works equally well for unicast and multicast flows. 303 On the other hand, with ECMP, redistributing the load across 304 component links that are part of the ECMP group may impact traffic 305 patterns at all of the nodes that are downstream of the given router 306 between itself and the destination. The local optimization may 307 result in congestion at a downstream node. (In its simplest form, an 308 ECMP group may be used to distribute traffic on component links that 309 are between two adjacent routers, and in that case, the ECMP group is 310 no different than a LAG for the purpose of this discussion.) 312 To demonstrate the limitations of local optimization, consider a two- 313 level fat-tree topology with three leaf nodes (L1, L2, L3) and two 314 spine nodes (S1, S2) and assume all of the links are 10 Gbps. Let L1 315 have two flows of 4 Gbps each towards L3, and let L2 have one flow of 316 7 Gbps also towards L3. If L1 balances the load optimally between S1 317 and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 318 would get congested resulting in packet discards. On the other hand, 319 if L1 had sent both its flows towards S1 and L2 had sent its flow 320 towards S2, there would have been no congestion at either S1 or S2. 322 The other issue with applying this scheme to ECMP groups is that it 323 may not apply equally to unicast and multicast traffic because of the 324 way multicast trees are constructed. 326 4.2. Overview of the mechanism 328 The various steps in achieving optimal LAG/ECMP component link 329 utilization in networks are detailed below: 331 Step 1) This involves large flow recognition in routers and 332 maintaining the mapping of the large flow to the component link that 333 it uses. The recognition of large flows is explained in Section 3.1. 335 Step 2) The egress component links are periodically scanned for link 336 utilization. If the egress component link utilization exceeds a pre- 337 programmed threshold, an operator alert is generated. The large flows 338 mapped to the congested egress component link are exported to a 339 central management entity. 341 Step 3) On receiving the alert about the congested component link, 342 the operator, through a central management entity, finds the large 343 flows mapped to that component link and the LAG/ECMP group to which 344 the component link belongs. 346 Step 4) The operator can choose to rebalance the large flows on 347 lightly loaded component links of the LAG/ECMP group or redistribute 348 the small flows on the congested link to other component links of the 349 group. The operator, through a central management entity, can choose 350 one of the following actions: 352 1) Indicate specific large flows to rebalance; 354 2) Have the router decide the best large flows to rebalance; 356 3) Have the router redistribute all the small flows on the 357 congested link to other component links in the group. 359 The central management entity conveys the above information to the 360 router. The load re-balancing options are explained in Section 3.2. 362 Steps 2) to 4) could be automated if desired. 364 Providing large flow information to a central management entity 365 provides the capability to further optimize flow distribution at with 366 multi-node visibility. Consider the following example. A router may 367 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 368 of hops downstream on P1 may be congested, while P2 and P3 may be 369 under-utilized, which the local router does not have visibility into. 370 With the help of a central management entity, the operator could 371 redistribute some of the flows from P1 to P2 and P3 resulting in a 372 more optimized flow of traffic. 374 The techniques described above are especially useful when bundling 375 links of different bandwidths for e.g. 10Gbps and 100Gbps as 376 described in [I-D.ietf-rtgwg-cl-requirement]. 378 4.3. Large Flow Recognition 380 4.3.1. Flow Identification 382 A flow (large flow or small flow) can be defined as a sequence of 383 packets for which ordered delivery should be maintained. Flows are 384 typically identified using one or more fields from the packet header 385 from the following list: 387 . Layer 2: source MAC address, destination MAC address, VLAN ID. 389 . IP header: IP Protocol, IP source address, IP destination 390 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 391 destination port. 393 . MPLS Labels. 395 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 396 identification is possible based on inner and/or outer headers. The 397 above list is not exhaustive. The mechanisms described in this 398 document are agnostic to the fields that are used for flow 399 identification. 401 4.3.2. Criteria for Identifying a Large Flow 403 From a bandwidth and time duration perspective, in order to identify 404 large flows we define an observation interval and observe the 405 bandwidth of the flow over that interval. A flow that exceeds a 406 certain minimum bandwidth threshold over that observation interval 407 would be considered a large flow. 409 The two parameters -- the observation interval, and the minimum 410 bandwidth threshold over that observation interval -- should be 411 programmable in a router to facilitate handling of different use 412 cases and traffic characteristics. For example, a flow which is at or 413 above 10% of link bandwidth for a time period of at least 1 second 414 could be declared a large flow [DevoFlow]. 416 In order to avoid excessive churn in the rebalancing, once a flow has 417 been recognized as a large flow, it should continue to be recognized 418 as a large flow as long as the traffic received during an observation 419 interval exceeds some fraction of the bandwidth threshold, for 420 example 80% of the bandwidth threshold. 422 Various techniques to identify a large flow are described below. 424 4.3.3. Sampling Techniques 426 A number of routers support sampling techniques such as sFlow [sFlow- 427 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 428 For the purpose of large flow identification, sampling must be 429 enabled on all of the egress ports in the router where such 430 measurements are desired. 432 Using sflow as an example, processing in a sFlow collector will 433 provide an approximate indication of the large flows mapping to each 434 of the component links in each LAG/ECMP group. It is possible to 435 implement this part of the collector function in the control plane of 436 the router reducing dependence on an external management station, 437 assuming sufficient control plane resources are available. 439 If egress sampling is not available, ingress sampling can suffice 440 since the central management entity used by the sampling technique 441 typically has multi-node visibility and can use the samples from an 442 immediately downstream node to make measurements for egress traffic 443 at the local node. This may not be available if the downstream 444 device is under the control of a different operator, or if the 445 downstream device does not support sampling. Alternatively, since 446 sampling techniques require that the sample annotated with the 447 packet's egress port information, ingress sampling may suffice. 448 However, this means that sampling would have to be enabled on all 449 ports, rather than only on those ports where such monitoring is 450 desired. 452 The advantages and disadvantages of sampling techniques are as 453 follows. 455 Advantages: 457 . Supported in most existing routers. 459 . Requires minimal router resources. 461 Disadvantages: 463 . In order to minimize the error inherent in sampling, there is a 464 minimum delay for the recognition time of large flows, and in 465 the time that it takes to react to this information. 467 With sampling, the detection of large flows can be done on the order 468 of one second [DevoFlow]. 470 4.3.4. Automatic Hardware Recognition 472 Implementations may perform automatic recognition of large flows in 473 hardware on a router. Since this is done in hardware, it is an inline 474 solution and would be expected to operate at line rate. 476 Using automatic hardware recognition of large flows, a faster 477 indication of large flows mapped to each of the component links in a 478 LAG/ECMP group is available (as compared to the sampling approach 479 described above). 481 The advantages and disadvantages of automatic hardware recognition 482 are: 484 Advantages: 486 . Large flow detection is offloaded to hardware freeing up 487 software resources and possible dependence on an external 488 management station. 490 . As link speeds get higher, sampling rates are typically reduced 491 to keep the number of samples manageable which places a lower 492 bound on the detection time. With automatic hardware 493 recognition, large flows can be detected in shorter windows on 494 higher link speeds since every packet is accounted for in 495 hardware [NDTM] 497 Disadvantages: 499 . Not supported in many routers. 501 As mentioned earlier, the observation interval for determining a 502 large flow and the bandwidth threshold for classifying a flow as a 503 large flow should be programmable parameters in a router. 505 The implementation of automatic hardware recognition of large flows 506 is vendor dependent and beyond the scope of this document. 508 4.4. Load Re-balancing Options 510 Below are suggested techniques for load re-balancing. Equipment 511 vendors should implement all of these techniques and allow the 512 operator to choose one or more techniques based on their 513 applications. 515 Note that regardless of the method used, perfect re-balancing of 516 large flows may not be possible since flows arrive and depart at 517 different times. Also, any flows that are moved from one component 518 link to another may experience momentary packet reordering. 520 4.4.1. Alternative Placement of Large Flows 522 Within a LAG/ECMP group, the member component links with least 523 average port utilization are identified. Some large flow(s) from the 524 heavily loaded component links are then moved to those lightly-loaded 525 member component links using a PBR rule in the ingress processing 526 element(s) in the routers. 528 With this approach, only certain large flows are subjected to 529 momentary flow re-ordering. 531 When a large flow is moved, this will increase the utilization of the 532 link that it moved to potentially creating unbalanced utilization 533 once again across the link components. Therefore, when moving large 534 flows, care must be taken to account for the existing load, and what 535 the future load will be after large flow has been moved. Further, 536 the appearance of new large flows may require a rearrangement of the 537 placement of existing flows. 539 Consider a case where there is a LAG compromising 4 10 Gbps component 540 links and there are 4 large flows each of 1 Gbps. These flows are 541 each placed on one of the component links. Subsequent, a 5-th large 542 flow of 2 Gbps is recognized and to maintain equitable load 543 distribution, it may require placement of one of the existing 1 Gbps 544 flow to a different component link. And this would still result in 545 some imbalance in the utilization across the component links. 547 4.4.2. Redistributing Small Flows 549 Some large flows may consume the entire bandwidth of the component 550 link(s). In this case, it would be desirable for the small flows to 551 not use the congested component link(s). This can be accomplished in 552 one of the following ways. 554 This method works on some existing router hardware. The idea is to 555 prevent, or reduce the probability, that the small flow hashes into 556 the congested component link(s). 558 . The LAG/ECMP table is modified to include only non-congested 559 component link(s). Small flows hash into this table to be mapped 560 to a destination component link. Alternatively, if certain 561 component links are heavily loaded, but not congested, the 562 output of the hash function can be adjusted to account for large 563 flow loading on each of the component links. 565 . The PBR rules for large flows (refer to Section 3.2.1) must 566 have strict precedence over the LAG/ECMP table lookup result. 568 With this approach the small flows that are moved would be subject to 569 reordering. 571 4.4.3. Component Link Protection Considerations 573 If desired, certain component links may be reserved for link 574 protection. These reserved component links are not used for any flows 575 in the absence of any failures.. In the case when the component 576 link(s) fail, all the flows on the failed component link(s) are moved 577 to the reserved component link(s). The mapping table of large flows 578 to component link simply replaces the failed component link with the 579 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 580 component link with the reserved link. 582 4.4.4. Load Re-balancing Algorithms 584 Specific algorithms for placement of large flows are out of scope of 585 this document. One possibility is to formulate the problem for large 586 flow placement as the well-known bin-packing problem and make use of 587 the various heuristics that are available for that problem [bin- 588 pack]. 590 4.4.5. Load Re-Balancing Example 592 Optimal LAG/ECMP component utilization for the use case in Figure 2 593 is depicted below in Figure 3. The large flow rebalancing explained 594 in Section 3.2.1 is used. The improved link utilization is as 595 follows: 597 . Component link (1) has 3 flows -- 2 small flows and 1 large 598 flow -- and the link utilization is normal. 600 . Component link (2) has 4 flows -- 3 small flows and 1 large 601 flow -- and the link utilization is normal now. 603 . Component link (3) has 3 flows -- 2 small flows and 1 large 604 flow -- and the link utilization is normal now. 606 +-----------+ +-----------+ 607 | | -> -> | | 608 | |=====> | | 609 | (1)|--/---/-|(1) | 610 | | | | 611 | |=====> | | 612 | (R1) |-> -> ->| (R2) | 613 | (2)|--/---/-|(2) | 614 | | | | 615 | | | | 616 | | -> -> | | 617 | |=====> | | 618 | (3)|--/---/-|(3) | 619 | | | | 620 +-----------+ +-----------+ 622 Where: ->-> small flows 623 ===> large flow 625 Figure 3: Evenly utilized Composite Links 627 Basically, the use of the mechanisms described in Section 3.2.1 628 resulted in a rebalancing of flows where one of the large flows on 629 component link (3) which was previously congested was moved to 630 component link (2) which was previously under-utilized. 632 5. Information Model for Flow Re-balancing 634 5.1. Configuration Parameters for Flow Re-balancing 636 The following parameters are required the configuration of this 637 feature: 639 . Large flow recognition parameters: 641 o Observation interval: The observation interval is the time 642 period in seconds over which the packet arrivals are 643 observed for the purpose of large flow recognition. 645 o Minimum bandwidth threshold: The minimum bandwidth threshold 646 would be configured as a percentage of link speed and 647 translated into a number of bytes over the observation 648 interval. A flow for which the number of bytes received, 649 for a given observation interval, exceeds this number would 650 be recognized as a large flow. 652 o Minimum bandwidth threshold for large flow maintenance: The 653 minimum bandwidth threshold for large flow maintenance is 654 used to provide hysteresis for large flow recognition. 655 Once a flow is recognized as a large flow, it continues to 656 be recognized as a large flow until it falls below this 657 threshold. This is also configured as a percentage of link 658 speed and is typically lower than the minimum bandwidth 659 threshold defined above. 661 . Imbalance threshold: the difference between the utilization of 662 the least utilized and most utilized component links. Expressed 663 as a percentage of link speed. 665 . Rebalancing interval: the minimum amount of time between 666 rebalancing events. This parameter ensures that rebalancing is 667 not invoked too frequently as it impacts frame ordering. 669 These parameters may be configured on a system-wide basis or it may 670 apply to an individual LAG. 672 5.2. System Configuration and Identification Parameters 674 . IP address: The IP address of a specific router that the 675 feature is being configured on, or that the large flow placement 676 is being applied to. 678 . LAG ID: Identifies the LAG. The LAG ID may be required when 679 configuring this feature (to apply a specific set of large flow 680 identification parameters to the LAG) and will be required when 681 specifying flow placement to achieve the desired rebalancing. 683 . Component Link ID: Identifies the component link within a LAG. 684 This is required when specifying flow placement to achieve the 685 desired rebalancing. 687 5.3. Information for Alternative Placement of Large Flows 689 In cases where large flow recognition is handled by an external 690 management station (see Section 3.1.3), an information model for 691 flows is required to allow the import of large flow information to 692 the router. 694 The following are some of the elements of information model for 695 importing of flows: 697 . Layer 2: source MAC address, destination MAC address, VLAN ID. 699 . Layer 3 IP: IP Protocol, IP source address, IP destination 700 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 701 destination port. 703 . MPLS Labels. 705 This list is not exhaustive. For example, with overlay protocols 706 such as VXLAN and NVGRE, fields from the outer and/or inner headers 707 may be specified. In general, all fields in the packet that can be 708 used by forwarding decisions should be available for use when 709 importing flow information from an external management station. 711 The IPFIX information model [RFC 5101] can be leveraged for large 712 flow identification. The component link ID would be used to specify 713 the target component link for the flow. 715 5.4. Information for Redistribution of Small Flows 717 For small flows, the LAG ID and the component link IDs along with the 718 percentage of traffic to be assigned to each component link ID Is 719 required. 721 5.5. Export of Flow Information 723 Exporting flow information is required when large flow identification 724 is being done on a router, but the decision to rebalance is being 725 made in an external management station. 727 It is recommended to use IPFIX protocol [RFC 5101] for exporting of 728 large flows from the router to an external management station. 730 5.6. Monitoring information 732 5.6.1. Interface (link) utilization 734 The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and 735 interface speed (ifSpeed) can be measured from the Interface table 736 (iftable) MIB [RFC 1213]. 738 The link utilization can then be computed as follows: 740 Incoming link utilization = (ifInOctets *8 / ifSpeed) 742 Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) 744 For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] 745 can be used. 747 The outgoing link utilization of the component links within a LAG can 748 be used to compute the imbalance threshold (See Section 5.1) for the 749 LAG. 751 5.6.2. Other monitoring information 753 Additional monitoring information includes: 755 . Number of times rebalancing was done. 757 . Time since the last rebalancing event. 759 6. Operational Considerations 761 Flows should be re-balanced only when the imbalance in the 762 utilization across component links exceeds a certain threshold. 763 Frequent re-balancing to achieve precise equitable utilization across 764 component links could be counter-productive as it may result in 765 moving flows back and forth between the component links impacting 766 packet ordering and system stability. This applies regardless of 767 whether large flows or small flows are re-distributed. It should be 768 noted that reordering is a concern for TCP flows with even a few 769 packets because three out-of-order packets would trigger sufficient 770 duplicate ACKs to the sender resulting in a retransmission [RFC 771 5681]. 773 The operator would have to experiment with various values of the 774 large flow recognition parameters (minimum bandwidth threshold, 775 observation interval) and the imbalance threshold across component 776 links to tune the solution for their environment. 778 7. IANA Considerations 780 This memo includes no request to IANA. 782 8. Security Considerations 784 This document does not directly impact the security of the Internet 785 infrastructure or its applications. In fact, it could help if there 786 is a DOS attack pattern which causes a hash imbalance resulting in 787 heavy overloading of large flows to certain LAG/ECMP component 788 links. 790 9. Acknowledgements 792 The authors would like to thank the following individuals for their 793 review and valuable feedback on earlier versions of this document: 794 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 795 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 796 Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, 797 Andrew Malis, Dave McDysan, Zhen Cao, and Dan Romascanu. 799 10. References 801 10.1. Normative References 803 10.2. Informative References 805 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 806 for MPLS over a Composite Link," September 2013. 808 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 809 Forwarding," November 2012. 811 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 813 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," 814 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 816 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 817 dynamic hashing with flow volume," SPIE ITCOM, 2002. 819 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 820 Multicast," November 2000. 822 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 823 Algorithm," November 2000. 825 [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for 826 IP Packet Selection," March 2009. 828 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," July 2004. 830 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 831 structure," September 2012. 833 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 834 9," October 2004 836 [RFC 5101] Claise, B., "Specification of the IP Flow Information 837 Export (IPFIX) Protocol for the Exchange of IP Traffic Flow 838 Information," January 2008 840 [RFC 1213] McCloghrie, K., "Management Information Base for Network 841 Management of TCP/IP-based internets: MIB-II," March 1991. 843 [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management 844 Information Base for High Capacity Networks," July 2002. 846 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 847 Management for High Performance Enterprise Networks," Proceedings of 848 the ACM SIGCOMM, August 2011. 850 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 851 measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. 853 [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation 854 Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design 855 for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. 856 Springer-Verlag, 1984. 858 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 860 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 861 such as the number of packets in a flow and the flow duration. The 862 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 863 protocol) are used for flow identification. The analysis indicates 864 that < ~2% of the flows take ~30% of total traffic volume while the 865 rest of the flows (> ~98%) contributes ~70% [YONG]. 867 The simulation has shown that given Internet traffic pattern, the 868 hash-based technique does not evenly distribute the flows over ECMP 869 paths. Some paths may be > 90% loaded while others are < 40% loaded. 870 The more ECMP paths exist, the more severe the misbalancing. This 871 implies that hash-based distribution can cause some paths to become 872 congested while other paths are underutilized [YONG]. 874 The simulation also shows substantial improvement by using the large 875 flow-aware hash-based distribution technique described in this 876 document. In using the same simulated traffic, the improved 877 rebalancing can achieve < 10% load differences among the paths. It 878 proves how large flow-aware hash-based distribution can effectively 879 compensate the uneven load balancing caused by hashing and the 880 traffic characteristics [YONG]. 882 Authors' Addresses 884 Ram Krishnan 885 Brocade Communications 886 San Jose, 95134, USA 887 Phone: +1-408-406-7890 888 Email: ramk@brocade.com 890 Sanjay Khanna 891 Brocade Communications 892 San Jose, 95134, USA 893 Phone: +1-408-333-4850 894 Email: skhanna@brocade.com 896 Lucy Yong 897 Huawei USA 898 5340 Legacy Drive 899 Plano, TX 75025, USA 900 Phone: +1-469-277-5837 901 Email: lucy.yong@huawei.com 903 Anoop Ghanwani 904 Dell 905 San Jose, CA 95134 906 Phone: +1-408-571-3228 907 Email: anoop@alumni.duke.edu 909 Ning So 910 Tata Communications 911 Plano, TX 75082, USA 912 Phone: +1-972-955-0914 913 Email: ning.so@tatacommunications.com 915 Bhumip Khasnabish 916 ZTE Corporation 917 New Jersey, 07960, USA 918 Phone: +1-781-752-8003 919 Email: bhumip.khasnabish@zteusa.com