idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5101 (Obsoleted by RFC 7011) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft Brocade Communications 3 Intended status: Informational L. Yong 4 Expires: February 23, 2014 Huawei USA 5 August 23, 2013 A. Ghanwani 6 Dell 7 Ning So 8 Tata Communications 9 S. Khanna 10 Cisco Systems 11 B. Khasnabish 12 ZTE Corporation 14 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 15 Networks 17 draft-ietf-opsawg-large-flow-load-balancing-05.txt 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. This document may not be modified, 23 and derivative works of it may not be created, except to publish it 24 as an RFC and to translate it into languages other than English. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/ietf/1id-abstracts.txt 39 The list of Internet-Draft Shadow Directories can be accessed at 40 http://www.ietf.org/shadow.html 42 This Internet-Draft will expire on February 23, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Abstract 61 Demands on networking infrastructure are growing exponentially; the 62 drivers are bandwidth hungry rich media applications, inter-data 63 center communications, etc. In this context, it is important to 64 optimally use the bandwidth in wired networks that extensively use 65 LAG/ECMP techniques for bandwidth scaling. This draft explores some 66 of the mechanisms useful for achieving this. 68 Table of Contents 70 1. Introduction...................................................3 71 1.1. Acronyms..................................................4 72 1.2. Terminology...............................................4 73 2. Flow Categorization............................................4 74 3. Hash-based Load Distribution in LAG/ECMP.......................5 75 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 76 4.1. Differences in LAG vs ECMP................................8 77 4.2. Overview of the mechanism.................................9 78 4.3. Large Flow Recognition...................................10 79 4.3.1. Flow Identification.................................10 80 4.3.2. Criteria for Identifying a Large Flow...............10 81 4.3.3. Sampling Techniques.................................11 82 4.3.4. Automatic Hardware Recognition......................12 83 4.4. Load Re-balancing Options................................13 84 4.4.1. Alternative Placement of Large Flows................13 85 4.4.2. Redistributing Small Flows..........................13 86 4.4.3. Component Link Protection Considerations............14 87 4.4.4. Load Re-balancing Algorithms........................14 88 4.4.5. Load Re-Balancing Example...........................14 89 5. Information Model for Flow Re-balancing.......................15 90 5.1. Configuration Parameters for Flow Re-balancing...........16 91 5.2. System Configuration and Identification Parameters.......16 92 5.3. Information for Alternative Placement of Large Flows.....17 93 5.4. Information for Redistribution of Small Flows............17 94 5.5. Export of Flow Information...............................17 95 5.6. Monitoring information...................................18 96 5.6.1. Interface (link) utilization........................18 97 5.6.2. Other monitoring information........................18 98 6. Operational Considerations....................................18 99 6.1. Rebalancing Frequency....................................19 100 6.2. Handling Route Changes...................................19 101 7. IANA Considerations...........................................19 102 8. Security Considerations.......................................19 103 9. Acknowledgements..............................................20 104 10. References...................................................20 105 10.1. Normative References....................................20 106 10.2. Informative References..................................20 108 1. Introduction 110 Networks extensively use LAG/ECMP techniques for capacity scaling. 111 Network traffic can be predominantly categorized into two traffic 112 types: long-lived large flows and other flows (which include long- 113 lived small flows, short-lived small/large flows). Stateless hash- 114 based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used 115 to distribute both long-lived large flows and other flows over the 116 component links in a LAG/ECMP. However the traffic may not be evenly 117 distributed over the component links due to the traffic pattern. 119 This draft describes mechanisms for optimal LAG/ECMP component link 120 utilization while using hash-based techniques. The mechanisms 121 comprise the following steps -- recognizing long-lived large flows in 122 a router; and assigning the long-lived large flows to specific 123 LAG/ECMP component links or redistributing other flows when a 124 component link on the router is congested. 126 It is useful to keep in mind that the typical use case is where the 127 long-lived large flows are those that consume a significant amount of 128 bandwidth on a link, e.g. greater than 5% of link bandwidth. The 129 number of such flows would necessarily be fairly small, e.g. on the 130 order of 10's or 100's per link. In other words, the number of long- 131 lived large flows is NOT expected to be on the order of millions of 132 flows. Examples of such long-lived large flows would be IPSec 133 tunnels in service provider backbones or storage backup traffic in 134 data center networks. 136 1.1. Acronyms 138 COTS: Commercial Off-the-shelf 140 DOS: Denial of Service 142 ECMP: Equal Cost Multi-path 144 GRE: Generic Routing Encapsulation 146 LAG: Link Aggregation Group 148 MPLS: Multiprotocol Label Switching 150 NVGRE: Network Virtualization using Generic Routing Encapsulation 152 PBR: Policy Based Routing 154 QoS: Quality of Service 156 STT: Stateless Transport Tunneling 158 TCAM: Ternary Content Addressable Memory 160 VXLAN: Virtual Extensible LAN 162 1.2. Terminology 164 Large flow(s): long-lived large flow(s) 166 Small flow(s): long-lived small flow(s) and short-lived small/large 167 flow(s) 169 2. Flow Categorization 171 In general, based on the size and duration, a flow can be categorized 172 into any one of the following four types, as shown in Figure 1: 174 (a) Short-Lived Large Flow (SLLF), 175 (b) Short-Lived Small Flow (SLSF), 176 (c) Long-Lived Large Flow (LLLF), and 177 (d) Long-Lived Small Flow (LLSF). 179 Flow Size 180 ^ 181 |--------------------|--------------------| 182 | | | 183 Large | SLLF | LLLF | 184 Flow | | | 185 |--------------------|--------------------| 186 | | | 187 Small | SLSF | LLSF | 188 Flow | | | 189 +--------------------+--------------------+---> Flow duration 190 Short-Lived Long-Lived 191 Flow Flow 193 Figure 1: Flow Categorization 195 In this document, we categorize Long-lived large flow(s) as "Large" 196 flow(s), and all of the others -- Long-lived small flow(s) and short- 197 lived small/large flow(s) as "Small" flow(s). 199 3. Hash-based Load Distribution in LAG/ECMP 201 Hashing techniques are often used for traffic load balancing to 202 select among multiple available paths with LAG/ECMP. The advantages 203 of hash-based load distribution are the preservation of the packet 204 sequence in a flow and the real-time distribution without maintaining 205 per-flow state in the router. Hash-based techniques use a combination 206 of fields in the packet's headers to identify a flow, and the hash 207 function on these fields is used to generate a unique number that 208 identifies a link/path in a LAG/ECMP. The result of the hashing 209 procedure is a many-to-one mapping of flows to component links. 211 If the traffic load constitutes flows such that the result of the 212 hash function across these flows is fairly uniform so that a similar 213 number of flows is mapped to each component link, if, the individual 214 flow rates are much smaller as compared to the link capacity, and if 215 the rate differences are not dramatic, the hash-based algorithm 216 produces good results with respect to utilization of the individual 217 component links. However, if one or more of these conditions are not 218 met, hash-based techniques may result in unbalanced loads on 219 individual component links. 221 One example is illustrated in Figure 2. In Figure 2, there are two 222 routers, R1 and R2, and there is a LAG between them which has 3 223 component links (1), (2), (3). There are a total of 10 flows that 224 need to be distributed across the links in this LAG. The result of 225 hashing is as follows: 227 . Component link (1) has 3 flows -- 2 small flows and 1 large 228 flow -- and the link utilization is normal. 230 . Component link (2) has 3 flows -- 3 small flows and no large 231 flow -- and the link utilization is light. 233 o The absence of any large flow causes the component link 234 under-utilized. 236 . Component link (3) has 4 flows -- 2 small flows and 2 large 237 flows -- and the link capacity is exceeded resulting in 238 congestion. 240 o The presence of 2 large flows causes congestion on this 241 component link. 243 +-----------+ +-----------+ 244 | | -> -> | | 245 | |=====> | | 246 | (1)|--/---/-|(1) | 247 | | | | 248 | | | | 249 | (R1) |-> -> ->| (R2) | 250 | (2)|--/---/-|(2) | 251 | | | | 252 | | -> -> | | 253 | |=====> | | 254 | |=====> | | 255 | (3)|--/---/-|(3) | 256 | | | | 257 +-----------+ +-----------+ 259 Where: ->-> small flows 260 ===> large flow 262 Figure 2: Unevenly Utilized Component Links 264 This document presents improved load distribution techniques based on 265 the large flow awareness. The techniques compensate for unbalanced 266 load distribution resulting from hashing as demonstrated in the above 267 example. 269 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization 271 The suggested techniques in this draft are about a local optimization 272 solution; they are local in the sense that both the identification of 273 large flows and re-balancing of the load can be accomplished 274 completely within individual nodes in the network without the need 275 for interaction with other nodes. 277 This approach may not yield a globally optimal placement of large 278 flows across multiple nodes in a network, which may be desirable in 279 some networks. On the other hand, a local approach may be adequate 280 for some environments for the following reasons: 282 1) Different links within a network experience different levels of 283 utilization and, thus, a "targeted" solution is needed for those hot- 284 spots in the network. An example is the utilization of a LAG between 285 two routers that needs to be optimized. 287 2) Some networks may lack end-to-end visibility, e.g. when a 288 certain network, under the control of a given operator, is a transit 289 network for traffic from other networks that are not under the 290 control of the same operator. 292 4.1. Differences in LAG vs ECMP 294 While the mechanisms explained herein are applicable to both LAGs and 295 ECMP, it is useful to note that there are some key differences 296 between the two that may impact how effective the mechanism is. This 297 relates, in part, to the localized information with which the scheme 298 is intended to operate. 300 A LAG is almost always between 2 adjacent routers. As a result, the 301 scope of problem of optimizing the bandwidth utilization on the 302 component links is fairly narrow. It simply involves re-balancing 303 the load across the component links between these two routers, and 304 there is no impact whatsoever to other parts of the network. The 305 scheme works equally well for unicast and multicast flows. 307 On the other hand, with ECMP, redistributing the load across 308 component links that are part of the ECMP group may impact traffic 309 patterns at all of the nodes that are downstream of the given router 310 between itself and the destination. The local optimization may 311 result in congestion at a downstream node. (In its simplest form, an 312 ECMP group may be used to distribute traffic on component links that 313 are between two adjacent routers, and in that case, the ECMP group is 314 no different than a LAG for the purpose of this discussion.) 316 To demonstrate the limitations of local optimization, consider a two- 317 level fat-tree topology with three leaf nodes (L1, L2, L3) and two 318 spine nodes (S1, S2) and assume all of the links are 10 Gbps. Let L1 319 have two flows of 4 Gbps each towards L3, and let L2 have one flow of 320 7 Gbps also towards L3. If L1 balances the load optimally between S1 321 and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 322 would get congested resulting in packet discards. On the other hand, 323 if L1 had sent both its flows towards S1 and L2 had sent its flow 324 towards S2, there would have been no congestion at either S1 or S2. 326 The other issue with applying this scheme to ECMP groups is that it 327 may not apply equally to unicast and multicast traffic because of the 328 way multicast trees are constructed. 330 4.2. Overview of the mechanism 332 The various steps in achieving optimal LAG/ECMP component link 333 utilization in networks are detailed below: 335 Step 1) This involves large flow recognition in routers and 336 maintaining the mapping of the large flow to the component link that 337 it uses. The recognition of large flows is explained in Section 4.3. 339 Step 2) The egress component links are periodically scanned for link 340 utilization. If the egress component link utilization exceeds a pre- 341 programmed threshold, an operator alert is generated. The large flows 342 mapped to the congested egress component link are exported to a 343 central management entity. 345 Step 3) On receiving the alert about the congested component link, 346 the operator, through a central management entity, finds the large 347 flows mapped to that component link and the LAG/ECMP group to which 348 the component link belongs. 350 Step 4) The operator can choose to rebalance the large flows on 351 lightly loaded component links of the LAG/ECMP group or redistribute 352 the small flows on the congested link to other component links of the 353 group. The operator, through a central management entity, can choose 354 one of the following actions: 356 1) Indicate specific large flows to rebalance; 358 2) Have the router decide the best large flows to rebalance; 360 3) Have the router redistribute all the small flows on the 361 congested link to other component links in the group. 363 The central management entity conveys the above information to the 364 router. The load re-balancing options are explained in Section 4.4. 366 Steps 2) to 4) could be automated if desired. 368 Providing large flow information to a central management entity 369 provides the capability to further optimize flow distribution at with 370 multi-node visibility. Consider the following example. A router may 371 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 372 of hops downstream on P1 may be congested, while P2 and P3 may be 373 under-utilized, which the local router does not have visibility into. 374 With the help of a central management entity, the operator could 375 redistribute some of the flows from P1 to P2 and P3 resulting in a 376 more optimized flow of traffic. 378 The techniques described above are especially useful when bundling 379 links of different bandwidths for e.g. 10Gbps and 100Gbps as 380 described in [I-D.ietf-rtgwg-cl-requirement]. 382 4.3. Large Flow Recognition 384 4.3.1. Flow Identification 386 A flow (large flow or small flow) can be defined as a sequence of 387 packets for which ordered delivery should be maintained. Flows are 388 typically identified using one or more fields from the packet header 389 from the following list: 391 . Layer 2: source MAC address, destination MAC address, VLAN ID. 393 . IP header: IP Protocol, IP source address, IP destination 394 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 395 destination port. 397 . MPLS Labels. 399 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 400 identification is possible based on inner and/or outer headers. The 401 above list is not exhaustive. The mechanisms described in this 402 document are agnostic to the fields that are used for flow 403 identification. 405 4.3.2. Criteria for Identifying a Large Flow 407 From a bandwidth and time duration perspective, in order to identify 408 large flows we define an observation interval and observe the 409 bandwidth of the flow over that interval. A flow that exceeds a 410 certain minimum bandwidth threshold over that observation interval 411 would be considered a large flow. 413 The two parameters -- the observation interval, and the minimum 414 bandwidth threshold over that observation interval -- should be 415 programmable in a router to facilitate handling of different use 416 cases and traffic characteristics. For example, a flow which is at or 417 above 10% of link bandwidth for a time period of at least 1 second 418 could be declared a large flow [DevoFlow]. 420 In order to avoid excessive churn in the rebalancing, once a flow has 421 been recognized as a large flow, it should continue to be recognized 422 as a large flow as long as the traffic received during an observation 423 interval exceeds some fraction of the bandwidth threshold, for 424 example 80% of the bandwidth threshold. 426 Various techniques to identify a large flow are described below. 428 4.3.3. Sampling Techniques 430 A number of routers support sampling techniques such as sFlow [sFlow- 431 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 432 For the purpose of large flow identification, sampling must be 433 enabled on all of the egress ports in the router where such 434 measurements are desired. 436 Using sflow as an example, processing in an sFlow collector will 437 provide an approximate indication of the large flows mapping to each 438 of the component links in each LAG/ECMP group. It is possible to 439 implement this part of the collector function in the control plane of 440 the router reducing dependence on an external management station, 441 assuming sufficient control plane resources are available. 443 If egress sampling is not available, ingress sampling can suffice 444 since the central management entity used by the sampling technique 445 typically has multi-node visibility and can use the samples from an 446 immediately downstream node to make measurements for egress traffic 447 at the local node. This may not be available if the downstream 448 device is under the control of a different operator, or if the 449 downstream device does not support sampling. Alternatively, since 450 sampling techniques require that the sample annotated with the 451 packet's egress port information, ingress sampling may suffice. 452 However, this means that sampling would have to be enabled on all 453 ports, rather than only on those ports where such monitoring is 454 desired. 456 The advantages and disadvantages of sampling techniques are as 457 follows. 459 Advantages: 461 . Supported in most existing routers. 463 . Requires minimal router resources. 465 Disadvantages: 467 . In order to minimize the error inherent in sampling, there is a 468 minimum delay for the recognition time of large flows, and in 469 the time that it takes to react to this information. 471 With sampling, the detection of large flows can be done on the order 472 of one second [DevoFlow]. 474 4.3.4. Automatic Hardware Recognition 476 Implementations may perform automatic recognition of large flows in 477 hardware on a router. Since this is done in hardware, it is an inline 478 solution and would be expected to operate at line rate. 480 Using automatic hardware recognition of large flows, a faster 481 indication of large flows mapped to each of the component links in a 482 LAG/ECMP group is available (as compared to the sampling approach 483 described above). 485 The advantages and disadvantages of automatic hardware recognition 486 are: 488 Advantages: 490 . Large flow detection is offloaded to hardware freeing up 491 software resources and possible dependence on an external 492 management station. 494 . As link speeds get higher, sampling rates are typically reduced 495 to keep the number of samples manageable which places a lower 496 bound on the detection time. With automatic hardware 497 recognition, large flows can be detected in shorter windows on 498 higher link speeds since every packet is accounted for in 499 hardware [NDTM] 501 Disadvantages: 503 . Not supported in many routers. 505 As mentioned earlier, the observation interval for determining a 506 large flow and the bandwidth threshold for classifying a flow as a 507 large flow should be programmable parameters in a router. 509 The implementation of automatic hardware recognition of large flows 510 is vendor dependent and beyond the scope of this document. 512 4.4. Load Re-balancing Options 514 Below are suggested techniques for load re-balancing. Equipment 515 vendors should implement all of these techniques and allow the 516 operator to choose one or more techniques based on their 517 applications. 519 Note that regardless of the method used, perfect re-balancing of 520 large flows may not be possible since flows arrive and depart at 521 different times. Also, any flows that are moved from one component 522 link to another may experience momentary packet reordering. 524 4.4.1. Alternative Placement of Large Flows 526 Within a LAG/ECMP group, the member component links with least 527 average port utilization are identified. Some large flow(s) from the 528 heavily loaded component links are then moved to those lightly-loaded 529 member component links using a PBR rule in the ingress processing 530 element(s) in the routers. 532 With this approach, only certain large flows are subjected to 533 momentary flow re-ordering. 535 When a large flow is moved, this will increase the utilization of the 536 link that it moved to potentially creating unbalanced utilization 537 once again across the link components. Therefore, when moving large 538 flows, care must be taken to account for the existing load, and what 539 the future load will be after large flow has been moved. Further, 540 the appearance of new large flows may require a rearrangement of the 541 placement of existing flows. 543 Consider a case where there is a LAG compromising 4 10 Gbps component 544 links and there are 4 large flows each of 1 Gbps. These flows are 545 each placed on one of the component links. Subsequent, a 5-th large 546 flow of 2 Gbps is recognized and to maintain equitable load 547 distribution, it may require placement of one of the existing 1 Gbps 548 flow to a different component link. And this would still result in 549 some imbalance in the utilization across the component links. 551 4.4.2. Redistributing Small Flows 553 Some large flows may consume the entire bandwidth of the component 554 link(s). In this case, it would be desirable for the small flows to 555 not use the congested component link(s). This can be accomplished in 556 one of the following ways. 558 This method works on some existing router hardware. The idea is to 559 prevent, or reduce the probability, that the small flow hashes into 560 the congested component link(s). 562 . The LAG/ECMP table is modified to include only non-congested 563 component link(s). Small flows hash into this table to be mapped 564 to a destination component link. Alternatively, if certain 565 component links are heavily loaded, but not congested, the 566 output of the hash function can be adjusted to account for large 567 flow loading on each of the component links. 569 . The PBR rules for large flows (refer to Section 4.4.1) must 570 have strict precedence over the LAG/ECMP table lookup result. 572 With this approach the small flows that are moved would be subject to 573 reordering. 575 4.4.3. Component Link Protection Considerations 577 If desired, certain component links may be reserved for link 578 protection. These reserved component links are not used for any flows 579 in the absence of any failures.. In the case when the component 580 link(s) fail, all the flows on the failed component link(s) are moved 581 to the reserved component link(s). The mapping table of large flows 582 to component link simply replaces the failed component link with the 583 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 584 component link with the reserved link. 586 4.4.4. Load Re-balancing Algorithms 588 Specific algorithms for placement of large flows are out of scope of 589 this document. One possibility is to formulate the problem for large 590 flow placement as the well-known bin-packing problem and make use of 591 the various heuristics that are available for that problem [bin- 592 pack]. 594 4.4.5. Load Re-Balancing Example 596 Optimal LAG/ECMP component utilization for the use case in Figure 2 597 is depicted below in Figure 3. The large flow rebalancing explained 598 in Section 4.4 is used. The improved link utilization is as follows: 600 . Component link (1) has 3 flows -- 2 small flows and 1 large 601 flow -- and the link utilization is normal. 603 . Component link (2) has 4 flows -- 3 small flows and 1 large 604 flow -- and the link utilization is normal now. 606 . Component link (3) has 3 flows -- 2 small flows and 1 large 607 flow -- and the link utilization is normal now. 609 +-----------+ +-----------+ 610 | | -> -> | | 611 | |=====> | | 612 | (1)|--/---/-|(1) | 613 | | | | 614 | |=====> | | 615 | (R1) |-> -> ->| (R2) | 616 | (2)|--/---/-|(2) | 617 | | | | 618 | | | | 619 | | -> -> | | 620 | |=====> | | 621 | (3)|--/---/-|(3) | 622 | | | | 623 +-----------+ +-----------+ 625 Where: ->-> small flows 626 ===> large flow 628 Figure 3: Evenly utilized Composite Links 630 Basically, the use of the mechanisms described in Section 4.4.1 631 resulted in a rebalancing of flows where one of the large flows on 632 component link (3) which was previously congested was moved to 633 component link (2) which was previously under-utilized. 635 5. Information Model for Flow Re-balancing 637 In order to support flow rebalancing in a router from an external 638 system, the exchange of some information is necessary between the 639 router and the external system. This section provides an exemplary 640 information model covering the various components needed for the 641 purpose. The model is intended to be informational and may be used 642 as input for development of a data model. 644 5.1. Configuration Parameters for Flow Re-balancing 646 The following parameters are required the configuration of this 647 feature: 649 . Large flow recognition parameters: 651 o Observation interval: The observation interval is the time 652 period in seconds over which the packet arrivals are 653 observed for the purpose of large flow recognition. 655 o Minimum bandwidth threshold: The minimum bandwidth threshold 656 would be configured as a percentage of link speed and 657 translated into a number of bytes over the observation 658 interval. A flow for which the number of bytes received, 659 for a given observation interval, exceeds this number would 660 be recognized as a large flow. 662 o Minimum bandwidth threshold for large flow maintenance: The 663 minimum bandwidth threshold for large flow maintenance is 664 used to provide hysteresis for large flow recognition. 665 Once a flow is recognized as a large flow, it continues to 666 be recognized as a large flow until it falls below this 667 threshold. This is also configured as a percentage of link 668 speed and is typically lower than the minimum bandwidth 669 threshold defined above. 671 . Imbalance threshold: the difference between the utilization of 672 the least utilized and most utilized component links. Expressed 673 as a percentage of link speed. 675 . Rebalancing interval: the minimum amount of time between 676 rebalancing events. This parameter ensures that rebalancing is 677 not invoked too frequently as it impacts frame ordering. 679 These parameters may be configured on a system-wide basis or it may 680 apply to an individual LAG. 682 5.2. System Configuration and Identification Parameters 684 . IP address: The IP address of a specific router that the 685 feature is being configured on, or that the large flow placement 686 is being applied to. 688 . LAG ID: Identifies the LAG. The LAG ID may be required when 689 configuring this feature (to apply a specific set of large flow 690 identification parameters to the LAG) and will be required when 691 specifying flow placement to achieve the desired rebalancing. 693 . Component Link ID: Identifies the component link within a LAG. 694 This is required when specifying flow placement to achieve the 695 desired rebalancing. 697 5.3. Information for Alternative Placement of Large Flows 699 In cases where large flow recognition is handled by an external 700 management station (see Section 4.3.3 ), an information model for 701 flows is required to allow the import of large flow information to 702 the router. 704 The following are some of the elements of information model for 705 importing of flows: 707 . Layer 2: source MAC address, destination MAC address, VLAN ID. 709 . Layer 3 IP: IP Protocol, IP source address, IP destination 710 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 711 destination port. 713 . MPLS Labels. 715 This list is not exhaustive. For example, with overlay protocols 716 such as VXLAN and NVGRE, fields from the outer and/or inner headers 717 may be specified. In general, all fields in the packet that can be 718 used by forwarding decisions should be available for use when 719 importing flow information from an external management station. 721 The IPFIX information model [RFC 5101] can be leveraged for large 722 flow identification. The component link ID would be used to specify 723 the target component link for the flow. 725 5.4. Information for Redistribution of Small Flows 727 For small flows, the LAG ID and the component link IDs along with the 728 percentage of traffic to be assigned to each component link ID Is 729 required. 731 5.5. Export of Flow Information 733 Exporting large flow information is required when large flow 734 recognition is being done on a router, but the decision to rebalance 735 is being made in an external management station. Large flow 736 information includes flow identification and the component link ID 737 that the flow currently is assigned to. Other information such as 738 flow QoS and bandwidth may be exported too. 740 The IPFIX information model [RFC 5101] can be leveraged for large 741 flow identification. 743 5.6. Monitoring information 745 5.6.1. Interface (link) utilization 747 The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and 748 interface speed (ifSpeed) can be measured from the Interface table 749 (iftable) MIB [RFC 1213]. 751 The link utilization can then be computed as follows: 753 Incoming link utilization = (ifInOctets *8 / ifSpeed) 755 Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) 757 For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] 758 can be used. 760 For further scalability, it is recommended to use the counter push 761 mechanism in [sflow-v5] for the interface counters; this would help 762 avoid counter polling through the MIB interface. 764 The outgoing link utilization of the component links within a LAG can 765 be used to compute the imbalance threshold (See Section 5.1) for the 766 LAG. 768 5.6.2. Other monitoring information 770 Additional monitoring information includes: 772 . Number of times rebalancing was done. 774 . Time since the last rebalancing event. 776 6. Operational Considerations 778 6.1. Rebalancing Frequency 780 Flows should be re-balanced only when the imbalance in the 781 utilization across component links exceeds a certain threshold. 782 Frequent re-balancing to achieve precise equitable utilization across 783 component links could be counter-productive as it may result in 784 moving flows back and forth between the component links impacting 785 packet ordering and system stability. This applies regardless of 786 whether large flows or small flows are re-distributed. It should be 787 noted that reordering is a concern for TCP flows with even a few 788 packets because three out-of-order packets would trigger sufficient 789 duplicate ACKs to the sender resulting in a retransmission [RFC 790 5681]. 792 The operator would have to experiment with various values of the 793 large flow recognition parameters (minimum bandwidth threshold, 794 observation interval) and the imbalance threshold across component 795 links to tune the solution for their environment. 797 6.2. Handling Route Changes 799 Large flow rebalancing must be aware of any changes to the FIB. In 800 cases where the next-hop of a route no longer to points to the LAG, 801 or to an ECMP group, any PBR entries added as described in Section 802 4.4.1 and 4.4.2 must be withdrawn in order to avoid the creation of 803 forwarding loops. 805 7. IANA Considerations 807 This memo includes no request to IANA. 809 8. Security Considerations 811 This document does not directly impact the security of the Internet 812 infrastructure or its applications. In fact, it could help if there 813 is a DOS attack pattern which causes a hash imbalance resulting in 814 heavy overloading of large flows to certain LAG/ECMP component 815 links. 817 9. Acknowledgements 819 The authors would like to thank the following individuals for their 820 review and valuable feedback on earlier versions of this document: 821 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 822 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 823 Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, 824 Andrew Malis, Dave McDysan, Zhen Cao, and Dan Romascanu. 826 10. References 828 10.1. Normative References 830 10.2. Informative References 832 [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 833 for MPLS over a Composite Link," September 2013. 835 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 836 Forwarding," November 2012. 838 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 840 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," 841 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 843 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 844 dynamic hashing with flow volume," SPIE ITCOM, 2002. 846 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 847 Multicast," November 2000. 849 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 850 Algorithm," November 2000. 852 [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for 853 IP Packet Selection," March 2009. 855 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," July 2004. 857 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 858 structure," September 2012. 860 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 861 9," October 2004 863 [RFC 5101] Claise, B., "Specification of the IP Flow Information 864 Export (IPFIX) Protocol for the Exchange of IP Traffic Flow 865 Information," January 2008 867 [RFC 1213] McCloghrie, K., "Management Information Base for Network 868 Management of TCP/IP-based internets: MIB-II," March 1991. 870 [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management 871 Information Base for High Capacity Networks," July 2002. 873 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 874 Management for High Performance Enterprise Networks," Proceedings of 875 the ACM SIGCOMM, August 2011. 877 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 878 measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. 880 [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation 881 Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design 882 for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. 883 Springer-Verlag, 1984. 885 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 887 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 888 such as the number of packets in a flow and the flow duration. The 889 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 890 protocol) are used for flow identification. The analysis indicates 891 that < ~2% of the flows take ~30% of total traffic volume while the 892 rest of the flows (> ~98%) contributes ~70% [YONG]. 894 The simulation has shown that given Internet traffic pattern, the 895 hash-based technique does not evenly distribute the flows over ECMP 896 paths. Some paths may be > 90% loaded while others are < 40% loaded. 897 The more ECMP paths exist, the more severe the misbalancing. This 898 implies that hash-based distribution can cause some paths to become 899 congested while other paths are underutilized [YONG]. 901 The simulation also shows substantial improvement by using the large 902 flow-aware hash-based distribution technique described in this 903 document. In using the same simulated traffic, the improved 904 rebalancing can achieve < 10% load differences among the paths. It 905 proves how large flow-aware hash-based distribution can effectively 906 compensate the uneven load balancing caused by hashing and the 907 traffic characteristics [YONG]. 909 Authors' Addresses 911 Ram Krishnan 912 Brocade Communications 913 San Jose, 95134, USA 914 Phone: +1-408-406-7890 915 Email: ramk@brocade.com 917 Lucy Yong 918 Huawei USA 919 5340 Legacy Drive 920 Plano, TX 75025, USA 921 Phone: +1-469-277-5837 922 Email: lucy.yong@huawei.com 924 Anoop Ghanwani 925 Dell 926 San Jose, CA 95134 927 Phone: +1-408-571-3228 928 Email: anoop@alumni.duke.edu 930 Ning So 931 Tata Communications 932 Plano, TX 75082, USA 933 Phone: +1-972-955-0914 934 Email: ning.so@tatacommunications.com 936 Sanjay Khanna 937 Cisco Systems 938 Email: sanjakha@gmail.com 940 Bhumip Khasnabish 941 ZTE Corporation 942 New Jersey, 07960, USA 943 Phone: +1-781-752-8003 944 Email: bhumip.khasnabish@zteusa.com