idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5101 (Obsoleted by RFC 7011) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft S. Khanna 3 Intended status: Informational Brocade Communications 4 Expires: January 9, 2014 L. Yong 5 July 9, 2013 Huawei USA 6 A. Ghanwani 7 Dell 8 Ning So 9 Tata Communications 10 B. Khasnabish 11 ZTE Corporation 13 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 14 Networks 16 draft-ietf-opsawg-large-flow-load-balancing-04.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. This document may not be modified, 22 and derivative works of it may not be created, except to publish it 23 as an RFC and to translate it into languages other than English. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as Internet- 28 Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/ietf/1id-abstracts.txt 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 This Internet-Draft will expire on January 9, 2014. 43 Copyright Notice 45 Copyright (c) 2013 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 Demands on networking infrastructure are growing exponentially; the 61 drivers are bandwidth hungry rich media applications, inter-data 62 center communications, etc. In this context, it is important to 63 optimally use the bandwidth in wired networks that extensively use 64 LAG/ECMP techniques for bandwidth scaling. This draft explores some 65 of the mechanisms useful for achieving this. 67 Table of Contents 69 1. Introduction...................................................3 70 1.1. Acronyms..................................................3 71 1.2. Terminology...............................................4 72 2. Flow Categorization............................................4 73 3. Hash-based Load Distribution in LAG/ECMP.......................5 74 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 75 4.1. Differences in LAG vs ECMP................................8 76 4.2. Overview of the mechanism.................................9 77 4.3. Large Flow Recognition...................................10 78 4.3.1. Flow Identification.................................10 79 4.3.2. Criteria for Identifying a Large Flow...............10 80 4.3.3. Sampling Techniques.................................11 81 4.3.4. Automatic Hardware Recognition......................12 82 4.4. Load Re-balancing Options................................13 83 4.4.1. Alternative Placement of Large Flows................13 84 4.4.2. Redistributing Small Flows..........................13 85 4.4.3. Component Link Protection Considerations............14 86 4.4.4. Load Re-balancing Algorithms........................14 87 4.4.5. Load Re-Balancing Example...........................14 88 5. Information Model for Flow Re-balancing.......................15 89 5.1. Configuration Parameters for Flow Re-balancing...........16 90 5.2. System Configuration and Identification Parameters.......16 91 5.3. Information for Alternative Placement of Large Flows.....17 92 5.4. Information for Redistribution of Small Flows............17 93 5.5. Export of Flow Information...............................17 94 5.6. Monitoring information...................................18 95 5.6.1. Interface (link) utilization........................18 96 5.6.2. Other monitoring information........................18 97 6. Operational Considerations....................................18 98 7. IANA Considerations...........................................19 99 8. Security Considerations.......................................19 100 9. Acknowledgements..............................................20 101 10. References...................................................20 102 10.1. Normative References....................................20 103 10.2. Informative References..................................20 105 1. Introduction 107 Networks extensively use LAG/ECMP techniques for capacity scaling. 108 Network traffic can be predominantly categorized into two traffic 109 types: long-lived large flows and other flows (which include long- 110 lived small flows, short-lived small/large flows). Stateless hash- 111 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 112 to distribute both long-lived large flows and other flows over the 113 component links in a LAG/ECMP. However the traffic may not be evenly 114 distributed over the component links due to the traffic pattern. 116 This draft describes mechanisms for optimal LAG/ECMP component link 117 utilization while using hash-based techniques. The mechanisms 118 comprise the following steps -- recognizing long-lived large flows in 119 a router; and assigning the long-lived large flows to specific 120 LAG/ECMP component links or redistributing other flows when a 121 component link on the router is congested. 123 It is useful to keep in mind that the typical use case is where the 124 long-lived large flows are those that consume a significant amount of 125 bandwidth on a link, e.g. greater than 5% of link bandwidth. The 126 number of such flows would necessarily be fairly small, e.g. on the 127 order of 10's or 100's per link. In other words, the number of long- 128 lived large flows is NOT expected to be on the order of millions of 129 flows. Examples of such long-lived large flows would be IPSec 130 tunnels in service provider backbones or storage backup traffic in 131 data center networks. 133 1.1. Acronyms 135 COTS: Commercial Off-the-shelf 136 DOS: Denial of Service 138 ECMP: Equal Cost Multi-path 140 GRE: Generic Routing Encapsulation 142 LAG: Link Aggregation Group 144 MPLS: Multiprotocol Label Switching 146 NVGRE: Network Virtualization using Generic Routing Encapsulation 148 PBR: Policy Based Routing 150 QoS: Quality of Service 152 STT: Stateless Transport Tunneling 154 TCAM: Ternary Content Addressable Memory 156 VXLAN: Virtual Extensible LAN 158 1.2. Terminology 160 Large flow(s): long-lived large flow(s) 162 Small flow(s): long-lived small flow(s) and short-lived small/large 163 flow(s) 165 2. Flow Categorization 167 In general, based on the size and duration, a flow can be categorized 168 into any one of the following four types, as shown in Figure 1: 170 (a) Short-Lived Large Flow (SLLF), 171 (b) Short-Lived Small Flow (SLSF), 172 (c) Long-Lived Large Flow (LLLF), and 173 (d) Long-Lived Small Flow (LLSF). 175 Flow Size 176 ^ 177 |--------------------|--------------------| 178 | | | 179 Large | SLLF | LLLF | 180 Flow | | | 181 |--------------------|--------------------| 182 | | | 183 Small | SLSF | LLSF | 184 Flow | | | 185 +--------------------+--------------------+---> Flow duration 186 Short-Lived Long-Lived 187 Flow Flow 189 Figure 1: Flow Categorization 191 In this document, we categorize Long-lived large flow(s) as "Large" 192 flow(s), and all of the others -- Long-lived small flow(s) and short- 193 lived small/large flow(s) as "Small" flow(s). 195 3. Hash-based Load Distribution in LAG/ECMP 197 Hashing techniques are often used for traffic load balancing to 198 select among multiple available paths with LAG/ECMP. The advantages 199 of hash-based load distribution are the preservation of the packet 200 sequence in a flow and the real-time distribution without maintaining 201 per-flow state in the router. Hash-based techniques use a combination 202 of fields in the packet's headers to identify a flow, and the hash 203 function on these fields is used to generate a unique number that 204 identifies a link/path in a LAG/ECMP. The result of the hashing 205 procedure is a many-to-one mapping of flows to component links. 207 If the traffic load constitutes flows such that the result of the 208 hash function across these flows is fairly uniform so that a similar 209 number of flows is mapped to each component link, if, the individual 210 flow rates are much smaller as compared to the link capacity, and if 211 the rate differences are not dramatic, the hash-based algorithm 212 produces good results with respect to utilization of the individual 213 component links. However, if one or more of these conditions are not 214 met, hash-based techniques may result in unbalanced loads on 215 individual component links. 217 One example is illustrated in Figure 2. In Figure 2, there are two 218 routers, R1 and R2, and there is a LAG between them which has 3 219 component links (1), (2), (3). There are a total of 10 flows that 220 need to be distributed across the links in this LAG. The result of 221 hashing is as follows: 223 . Component link (1) has 3 flows -- 2 small flows and 1 large 224 flow -- and the link utilization is normal. 226 . Component link (2) has 3 flows -- 3 small flows and no large 227 flow -- and the link utilization is light. 229 o The absence of any large flow causes the component link 230 under-utilized. 232 . Component link (3) has 4 flows -- 2 small flows and 2 large 233 flows -- and the link capacity is exceeded resulting in 234 congestion. 236 o The presence of 2 large flows causes congestion on this 237 component link. 239 +-----------+ +-----------+ 240 | | -> -> | | 241 | |=====> | | 242 | (1)|--/---/-|(1) | 243 | | | | 244 | | | | 245 | (R1) |-> -> ->| (R2) | 246 | (2)|--/---/-|(2) | 247 | | | | 248 | | -> -> | | 249 | |=====> | | 250 | |=====> | | 251 | (3)|--/---/-|(3) | 252 | | | | 253 +-----------+ +-----------+ 255 Where: ->-> small flows 256 ===> large flow 258 Figure 2: Unevenly Utilized Component Links 260 This document presents improved load distribution techniques based on 261 the large flow awareness. The techniques compensate for unbalanced 262 load distribution resulting from hashing as demonstrated in the above 263 example. 265 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization 267 The suggested techniques in this draft are about a local optimization 268 solution; they are local in the sense that both the identification of 269 large flows and re-balancing of the load can be accomplished 270 completely within individual nodes in the network without the need 271 for interaction with other nodes. 273 This approach may not yield a globally optimal placement of large 274 flows across multiple nodes in a network, which may be desirable in 275 some networks. On the other hand, a local approach may be adequate 276 for some environments for the following reasons: 278 1) Different links within a network experience different levels of 279 utilization and, thus, a "targeted" solution is needed for those hot- 280 spots in the network. An example is the utilization of a LAG between 281 two routers that needs to be optimized. 283 2) Some networks may lack end-to-end visibility, e.g. when a 284 certain network, under the control of a given operator, is a transit 285 network for traffic from other networks that are not under the 286 control of the same operator. 288 4.1. Differences in LAG vs ECMP 290 While the mechanisms explained herein are applicable to both LAGs and 291 ECMP, it is useful to note that there are some key differences 292 between the two that may impact how effective the mechanism is. This 293 relates, in part, to the localized information with which the scheme 294 is intended to operate. 296 A LAG is almost always between 2 adjacent routers. As a result, the 297 scope of problem of optimizing the bandwidth utilization on the 298 component links is fairly narrow. It simply involves re-balancing 299 the load across the component links between these two routers, and 300 there is no impact whatsoever to other parts of the network. The 301 scheme works equally well for unicast and multicast flows. 303 On the other hand, with ECMP, redistributing the load across 304 component links that are part of the ECMP group may impact traffic 305 patterns at all of the nodes that are downstream of the given router 306 between itself and the destination. The local optimization may 307 result in congestion at a downstream node. (In its simplest form, an 308 ECMP group may be used to distribute traffic on component links that 309 are between two adjacent routers, and in that case, the ECMP group is 310 no different than a LAG for the purpose of this discussion.) 312 To demonstrate the limitations of local optimization, consider a two- 313 level fat-tree topology with three leaf nodes (L1, L2, L3) and two 314 spine nodes (S1, S2) and assume all of the links are 10 Gbps. Let L1 315 have two flows of 4 Gbps each towards L3, and let L2 have one flow of 316 7 Gbps also towards L3. If L1 balances the load optimally between S1 317 and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 318 would get congested resulting in packet discards. On the other hand, 319 if L1 had sent both its flows towards S1 and L2 had sent its flow 320 towards S2, there would have been no congestion at either S1 or S2. 322 The other issue with applying this scheme to ECMP groups is that it 323 may not apply equally to unicast and multicast traffic because of the 324 way multicast trees are constructed. 326 4.2. Overview of the mechanism 328 The various steps in achieving optimal LAG/ECMP component link 329 utilization in networks are detailed below: 331 Step 1) This involves large flow recognition in routers and 332 maintaining the mapping of the large flow to the component link that 333 it uses. The recognition of large flows is explained in Section 4.3. 335 Step 2) The egress component links are periodically scanned for link 336 utilization. If the egress component link utilization exceeds a pre- 337 programmed threshold, an operator alert is generated. The large flows 338 mapped to the congested egress component link are exported to a 339 central management entity. 341 Step 3) On receiving the alert about the congested component link, 342 the operator, through a central management entity, finds the large 343 flows mapped to that component link and the LAG/ECMP group to which 344 the component link belongs. 346 Step 4) The operator can choose to rebalance the large flows on 347 lightly loaded component links of the LAG/ECMP group or redistribute 348 the small flows on the congested link to other component links of the 349 group. The operator, through a central management entity, can choose 350 one of the following actions: 352 1) Indicate specific large flows to rebalance; 354 2) Have the router decide the best large flows to rebalance; 356 3) Have the router redistribute all the small flows on the 357 congested link to other component links in the group. 359 The central management entity conveys the above information to the 360 router. The load re-balancing options are explained in Section 4.4. 362 Steps 2) to 4) could be automated if desired. 364 Providing large flow information to a central management entity 365 provides the capability to further optimize flow distribution at with 366 multi-node visibility. Consider the following example. A router may 367 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 368 of hops downstream on P1 may be congested, while P2 and P3 may be 369 under-utilized, which the local router does not have visibility into. 370 With the help of a central management entity, the operator could 371 redistribute some of the flows from P1 to P2 and P3 resulting in a 372 more optimized flow of traffic. 374 The techniques described above are especially useful when bundling 375 links of different bandwidths for e.g. 10Gbps and 100Gbps as 376 described in [I-D.ietf-rtgwg-cl-requirement]. 378 4.3. Large Flow Recognition 380 4.3.1. Flow Identification 382 A flow (large flow or small flow) can be defined as a sequence of 383 packets for which ordered delivery should be maintained. Flows are 384 typically identified using one or more fields from the packet header 385 from the following list: 387 . Layer 2: source MAC address, destination MAC address, VLAN ID. 389 . IP header: IP Protocol, IP source address, IP destination 390 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 391 destination port. 393 . MPLS Labels. 395 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 396 identification is possible based on inner and/or outer headers. The 397 above list is not exhaustive. The mechanisms described in this 398 document are agnostic to the fields that are used for flow 399 identification. 401 4.3.2. Criteria for Identifying a Large Flow 403 From a bandwidth and time duration perspective, in order to identify 404 large flows we define an observation interval and observe the 405 bandwidth of the flow over that interval. A flow that exceeds a 406 certain minimum bandwidth threshold over that observation interval 407 would be considered a large flow. 409 The two parameters -- the observation interval, and the minimum 410 bandwidth threshold over that observation interval -- should be 411 programmable in a router to facilitate handling of different use 412 cases and traffic characteristics. For example, a flow which is at or 413 above 10% of link bandwidth for a time period of at least 1 second 414 could be declared a large flow [DevoFlow]. 416 In order to avoid excessive churn in the rebalancing, once a flow has 417 been recognized as a large flow, it should continue to be recognized 418 as a large flow as long as the traffic received during an observation 419 interval exceeds some fraction of the bandwidth threshold, for 420 example 80% of the bandwidth threshold. 422 Various techniques to identify a large flow are described below. 424 4.3.3. Sampling Techniques 426 A number of routers support sampling techniques such as sFlow [sFlow- 427 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 428 For the purpose of large flow identification, sampling must be 429 enabled on all of the egress ports in the router where such 430 measurements are desired. 432 Using sflow as an example, processing in an sFlow collector will 433 provide an approximate indication of the large flows mapping to each 434 of the component links in each LAG/ECMP group. It is possible to 435 implement this part of the collector function in the control plane of 436 the router reducing dependence on an external management station, 437 assuming sufficient control plane resources are available. 439 If egress sampling is not available, ingress sampling can suffice 440 since the central management entity used by the sampling technique 441 typically has multi-node visibility and can use the samples from an 442 immediately downstream node to make measurements for egress traffic 443 at the local node. This may not be available if the downstream 444 device is under the control of a different operator, or if the 445 downstream device does not support sampling. Alternatively, since 446 sampling techniques require that the sample annotated with the 447 packet's egress port information, ingress sampling may suffice. 448 However, this means that sampling would have to be enabled on all 449 ports, rather than only on those ports where such monitoring is 450 desired. 452 The advantages and disadvantages of sampling techniques are as 453 follows. 455 Advantages: 457 . Supported in most existing routers. 459 . Requires minimal router resources. 461 Disadvantages: 463 . In order to minimize the error inherent in sampling, there is a 464 minimum delay for the recognition time of large flows, and in 465 the time that it takes to react to this information. 467 With sampling, the detection of large flows can be done on the order 468 of one second [DevoFlow]. 470 4.3.4. Automatic Hardware Recognition 472 Implementations may perform automatic recognition of large flows in 473 hardware on a router. Since this is done in hardware, it is an inline 474 solution and would be expected to operate at line rate. 476 Using automatic hardware recognition of large flows, a faster 477 indication of large flows mapped to each of the component links in a 478 LAG/ECMP group is available (as compared to the sampling approach 479 described above). 481 The advantages and disadvantages of automatic hardware recognition 482 are: 484 Advantages: 486 . Large flow detection is offloaded to hardware freeing up 487 software resources and possible dependence on an external 488 management station. 490 . As link speeds get higher, sampling rates are typically reduced 491 to keep the number of samples manageable which places a lower 492 bound on the detection time. With automatic hardware 493 recognition, large flows can be detected in shorter windows on 494 higher link speeds since every packet is accounted for in 495 hardware [NDTM] 497 Disadvantages: 499 . Not supported in many routers. 501 As mentioned earlier, the observation interval for determining a 502 large flow and the bandwidth threshold for classifying a flow as a 503 large flow should be programmable parameters in a router. 505 The implementation of automatic hardware recognition of large flows 506 is vendor dependent and beyond the scope of this document. 508 4.4. Load Re-balancing Options 510 Below are suggested techniques for load re-balancing. Equipment 511 vendors should implement all of these techniques and allow the 512 operator to choose one or more techniques based on their 513 applications. 515 Note that regardless of the method used, perfect re-balancing of 516 large flows may not be possible since flows arrive and depart at 517 different times. Also, any flows that are moved from one component 518 link to another may experience momentary packet reordering. 520 4.4.1. Alternative Placement of Large Flows 522 Within a LAG/ECMP group, the member component links with least 523 average port utilization are identified. Some large flow(s) from the 524 heavily loaded component links are then moved to those lightly-loaded 525 member component links using a PBR rule in the ingress processing 526 element(s) in the routers. 528 With this approach, only certain large flows are subjected to 529 momentary flow re-ordering. 531 When a large flow is moved, this will increase the utilization of the 532 link that it moved to potentially creating unbalanced utilization 533 once again across the link components. Therefore, when moving large 534 flows, care must be taken to account for the existing load, and what 535 the future load will be after large flow has been moved. Further, 536 the appearance of new large flows may require a rearrangement of the 537 placement of existing flows. 539 Consider a case where there is a LAG compromising 4 10 Gbps component 540 links and there are 4 large flows each of 1 Gbps. These flows are 541 each placed on one of the component links. Subsequent, a 5-th large 542 flow of 2 Gbps is recognized and to maintain equitable load 543 distribution, it may require placement of one of the existing 1 Gbps 544 flow to a different component link. And this would still result in 545 some imbalance in the utilization across the component links. 547 4.4.2. Redistributing Small Flows 549 Some large flows may consume the entire bandwidth of the component 550 link(s). In this case, it would be desirable for the small flows to 551 not use the congested component link(s). This can be accomplished in 552 one of the following ways. 554 This method works on some existing router hardware. The idea is to 555 prevent, or reduce the probability, that the small flow hashes into 556 the congested component link(s). 558 . The LAG/ECMP table is modified to include only non-congested 559 component link(s). Small flows hash into this table to be mapped 560 to a destination component link. Alternatively, if certain 561 component links are heavily loaded, but not congested, the 562 output of the hash function can be adjusted to account for large 563 flow loading on each of the component links. 565 . The PBR rules for large flows (refer to Section 4.4.1) must 566 have strict precedence over the LAG/ECMP table lookup result. 568 With this approach the small flows that are moved would be subject to 569 reordering. 571 4.4.3. Component Link Protection Considerations 573 If desired, certain component links may be reserved for link 574 protection. These reserved component links are not used for any flows 575 in the absence of any failures.. In the case when the component 576 link(s) fail, all the flows on the failed component link(s) are moved 577 to the reserved component link(s). The mapping table of large flows 578 to component link simply replaces the failed component link with the 579 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 580 component link with the reserved link. 582 4.4.4. Load Re-balancing Algorithms 584 Specific algorithms for placement of large flows are out of scope of 585 this document. One possibility is to formulate the problem for large 586 flow placement as the well-known bin-packing problem and make use of 587 the various heuristics that are available for that problem [bin- 588 pack]. 590 4.4.5. Load Re-Balancing Example 592 Optimal LAG/ECMP component utilization for the use case in Figure 2 593 is depicted below in Figure 3. The large flow rebalancing explained 594 in Section 4.4 is used. The improved link utilization is as follows: 596 . Component link (1) has 3 flows -- 2 small flows and 1 large 597 flow -- and the link utilization is normal. 599 . Component link (2) has 4 flows -- 3 small flows and 1 large 600 flow -- and the link utilization is normal now. 602 . Component link (3) has 3 flows -- 2 small flows and 1 large 603 flow -- and the link utilization is normal now. 605 +-----------+ +-----------+ 606 | | -> -> | | 607 | |=====> | | 608 | (1)|--/---/-|(1) | 609 | | | | 610 | |=====> | | 611 | (R1) |-> -> ->| (R2) | 612 | (2)|--/---/-|(2) | 613 | | | | 614 | | | | 615 | | -> -> | | 616 | |=====> | | 617 | (3)|--/---/-|(3) | 618 | | | | 619 +-----------+ +-----------+ 621 Where: ->-> small flows 622 ===> large flow 624 Figure 3: Evenly utilized Composite Links 626 Basically, the use of the mechanisms described in Section 4.4.1 627 resulted in a rebalancing of flows where one of the large flows on 628 component link (3) which was previously congested was moved to 629 component link (2) which was previously under-utilized. 631 5. Information Model for Flow Re-balancing 633 In order to support flow rebalancing in a router from an external 634 system, the exchange of some information is necessary between the 635 router and the external system. This section provides an exemplary 636 information model covering the various components needed for the 637 purpose. The model is intended to be informational and may be used 638 as input for development of a data model. 640 5.1. Configuration Parameters for Flow Re-balancing 642 The following parameters are required the configuration of this 643 feature: 645 . Large flow recognition parameters: 647 o Observation interval: The observation interval is the time 648 period in seconds over which the packet arrivals are 649 observed for the purpose of large flow recognition. 651 o Minimum bandwidth threshold: The minimum bandwidth threshold 652 would be configured as a percentage of link speed and 653 translated into a number of bytes over the observation 654 interval. A flow for which the number of bytes received, 655 for a given observation interval, exceeds this number would 656 be recognized as a large flow. 658 o Minimum bandwidth threshold for large flow maintenance: The 659 minimum bandwidth threshold for large flow maintenance is 660 used to provide hysteresis for large flow recognition. 661 Once a flow is recognized as a large flow, it continues to 662 be recognized as a large flow until it falls below this 663 threshold. This is also configured as a percentage of link 664 speed and is typically lower than the minimum bandwidth 665 threshold defined above. 667 . Imbalance threshold: the difference between the utilization of 668 the least utilized and most utilized component links. Expressed 669 as a percentage of link speed. 671 . Rebalancing interval: the minimum amount of time between 672 rebalancing events. This parameter ensures that rebalancing is 673 not invoked too frequently as it impacts frame ordering. 675 These parameters may be configured on a system-wide basis or it may 676 apply to an individual LAG. 678 5.2. System Configuration and Identification Parameters 680 . IP address: The IP address of a specific router that the 681 feature is being configured on, or that the large flow placement 682 is being applied to. 684 . LAG ID: Identifies the LAG. The LAG ID may be required when 685 configuring this feature (to apply a specific set of large flow 686 identification parameters to the LAG) and will be required when 687 specifying flow placement to achieve the desired rebalancing. 689 . Component Link ID: Identifies the component link within a LAG. 690 This is required when specifying flow placement to achieve the 691 desired rebalancing. 693 5.3. Information for Alternative Placement of Large Flows 695 In cases where large flow recognition is handled by an external 696 management station (see Section 4.3.3 ), an information model for 697 flows is required to allow the import of large flow information to 698 the router. 700 The following are some of the elements of information model for 701 importing of flows: 703 . Layer 2: source MAC address, destination MAC address, VLAN ID. 705 . Layer 3 IP: IP Protocol, IP source address, IP destination 706 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 707 destination port. 709 . MPLS Labels. 711 This list is not exhaustive. For example, with overlay protocols 712 such as VXLAN and NVGRE, fields from the outer and/or inner headers 713 may be specified. In general, all fields in the packet that can be 714 used by forwarding decisions should be available for use when 715 importing flow information from an external management station. 717 The IPFIX information model [RFC 5101] can be leveraged for large 718 flow identification. The component link ID would be used to specify 719 the target component link for the flow. 721 5.4. Information for Redistribution of Small Flows 723 For small flows, the LAG ID and the component link IDs along with the 724 percentage of traffic to be assigned to each component link ID Is 725 required. 727 5.5. Export of Flow Information 729 Exporting large flow information is required when large flow 730 recognition is being done on a router, but the decision to rebalance 731 is being made in an external management station. Large flow 732 information includes flow identification and the component link ID 733 that the flow currently is assigned to. Other information such as 734 flow QoS and bandwidth may be exported too. 736 The IPFIX information model [RFC 5101] can be leveraged for large 737 flow identification. 739 5.6. Monitoring information 741 5.6.1. Interface (link) utilization 743 The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and 744 interface speed (ifSpeed) can be measured from the Interface table 745 (iftable) MIB [RFC 1213]. 747 The link utilization can then be computed as follows: 749 Incoming link utilization = (ifInOctets *8 / ifSpeed) 751 Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) 753 For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] 754 can be used. 756 For further scalability, it is recommended to use the counter push 757 mechanism in [sflow-v5] for the interface counters; this would help 758 avoid counter polling through the MIB interface. 760 The outgoing link utilization of the component links within a LAG can 761 be used to compute the imbalance threshold (See Section 5.1) for the 762 LAG. 764 5.6.2. Other monitoring information 766 Additional monitoring information includes: 768 . Number of times rebalancing was done. 770 . Time since the last rebalancing event. 772 6. Operational Considerations 773 Flows should be re-balanced only when the imbalance in the 774 utilization across component links exceeds a certain threshold. 775 Frequent re-balancing to achieve precise equitable utilization across 776 component links could be counter-productive as it may result in 777 moving flows back and forth between the component links impacting 778 packet ordering and system stability. This applies regardless of 779 whether large flows or small flows are re-distributed. It should be 780 noted that reordering is a concern for TCP flows with even a few 781 packets because three out-of-order packets would trigger sufficient 782 duplicate ACKs to the sender resulting in a retransmission [RFC 783 5681]. 785 The operator would have to experiment with various values of the 786 large flow recognition parameters (minimum bandwidth threshold, 787 observation interval) and the imbalance threshold across component 788 links to tune the solution for their environment. 790 7. IANA Considerations 792 This memo includes no request to IANA. 794 8. Security Considerations 796 This document does not directly impact the security of the Internet 797 infrastructure or its applications. In fact, it could help if there 798 is a DOS attack pattern which causes a hash imbalance resulting in 799 heavy overloading of large flows to certain LAG/ECMP component 800 links. 802 9. Acknowledgements 804 The authors would like to thank the following individuals for their 805 review and valuable feedback on earlier versions of this document: 806 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 807 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 808 Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, 809 Andrew Malis, Dave McDysan, Zhen Cao, and Dan Romascanu. 811 10. References 813 10.1. Normative References 815 10.2. Informative References 817 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 818 for MPLS over a Composite Link," September 2013. 820 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 821 Forwarding," November 2012. 823 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 825 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," 826 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 828 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 829 dynamic hashing with flow volume," SPIE ITCOM, 2002. 831 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 832 Multicast," November 2000. 834 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 835 Algorithm," November 2000. 837 [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for 838 IP Packet Selection," March 2009. 840 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," July 2004. 842 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 843 structure," September 2012. 845 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 846 9," October 2004 848 [RFC 5101] Claise, B., "Specification of the IP Flow Information 849 Export (IPFIX) Protocol for the Exchange of IP Traffic Flow 850 Information," January 2008 852 [RFC 1213] McCloghrie, K., "Management Information Base for Network 853 Management of TCP/IP-based internets: MIB-II," March 1991. 855 [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management 856 Information Base for High Capacity Networks," July 2002. 858 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 859 Management for High Performance Enterprise Networks," Proceedings of 860 the ACM SIGCOMM, August 2011. 862 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 863 measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. 865 [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation 866 Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design 867 for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. 868 Springer-Verlag, 1984. 870 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 872 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 873 such as the number of packets in a flow and the flow duration. The 874 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 875 protocol) are used for flow identification. The analysis indicates 876 that < ~2% of the flows take ~30% of total traffic volume while the 877 rest of the flows (> ~98%) contributes ~70% [YONG]. 879 The simulation has shown that given Internet traffic pattern, the 880 hash-based technique does not evenly distribute the flows over ECMP 881 paths. Some paths may be > 90% loaded while others are < 40% loaded. 882 The more ECMP paths exist, the more severe the misbalancing. This 883 implies that hash-based distribution can cause some paths to become 884 congested while other paths are underutilized [YONG]. 886 The simulation also shows substantial improvement by using the large 887 flow-aware hash-based distribution technique described in this 888 document. In using the same simulated traffic, the improved 889 rebalancing can achieve < 10% load differences among the paths. It 890 proves how large flow-aware hash-based distribution can effectively 891 compensate the uneven load balancing caused by hashing and the 892 traffic characteristics [YONG]. 894 Authors' Addresses 896 Ram Krishnan 897 Brocade Communications 898 San Jose, 95134, USA 899 Phone: +1-408-406-7890 900 Email: ramk@brocade.com 902 Sanjay Khanna 903 Brocade Communications 904 San Jose, 95134, USA 905 Phone: +1-408-333-4850 906 Email: skhanna@brocade.com 908 Lucy Yong 909 Huawei USA 910 5340 Legacy Drive 911 Plano, TX 75025, USA 912 Phone: +1-469-277-5837 913 Email: lucy.yong@huawei.com 915 Anoop Ghanwani 916 Dell 917 San Jose, CA 95134 918 Phone: +1-408-571-3228 919 Email: anoop@alumni.duke.edu 921 Ning So 922 Tata Communications 923 Plano, TX 75082, USA 924 Phone: +1-972-955-0914 925 Email: ning.so@tatacommunications.com 927 Bhumip Khasnabish 928 ZTE Corporation 929 New Jersey, 07960, USA 930 Phone: +1-781-752-8003 931 Email: bhumip.khasnabish@zteusa.com