idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft Brocade Communications 3 Intended status: Informational L. Yong 4 Expires: June 26, 2014 Huawei USA 5 December 26, 2013 A. Ghanwani 6 Dell 7 Ning So 8 Tata Communications 9 S. Khanna 10 Cisco Systems 11 B. Khasnabish 12 ZTE Corporation 14 Mechanisms for Optimal LAG/ECMP Component Link Utilization in 15 Networks 17 draft-ietf-opsawg-large-flow-load-balancing-06.txt 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. This document may not be modified, 23 and derivative works of it may not be created, except to publish it 24 as an RFC and to translate it into languages other than English. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/ietf/1id-abstracts.txt 39 The list of Internet-Draft Shadow Directories can be accessed at 40 http://www.ietf.org/shadow.html 42 This Internet-Draft will expire on June 26, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Abstract 61 Demands on networking infrastructure are growing exponentially due to 62 bandwidth hungry applications such as rich media applications and 63 inter-data center communications. In this context, it is important to 64 optimally use the bandwidth in wired networks that extensively use 65 link aggregation groups and equal cost multi-paths as techniques for 66 bandwidth scaling. This draft explores some of the mechanisms useful 67 for achieving this. 69 Table of Contents 71 1. Introduction...................................................3 72 1.1. Acronyms..................................................4 73 1.2. Terminology...............................................4 74 2. Flow Categorization............................................4 75 3. Hash-based Load Distribution in LAG/ECMP.......................5 76 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 77 4.1. Differences in LAG vs ECMP................................7 78 4.2. Overview of the mechanism.................................8 79 4.3. Large Flow Recognition...................................10 80 4.3.1. Flow Identification.................................10 81 4.3.2. Criteria for Identifying a Large Flow...............10 82 4.3.3. Sampling Techniques.................................11 83 4.3.4. Automatic Hardware Recognition......................12 84 4.4. Load Re-balancing Options................................12 85 4.4.1. Alternative Placement of Large Flows................13 86 4.4.2. Redistributing Small Flows..........................13 87 4.4.3. Component Link Protection Considerations............14 88 4.4.4. Load Re-balancing Algorithms........................14 89 4.4.5. Load Re-Balancing Example...........................14 90 5. Information Model for Flow Re-balancing.......................15 91 5.1. Configuration Parameters for Flow Re-balancing...........15 92 5.2. System Configuration and Identification Parameters.......16 93 5.3. Information for Alternative Placement of Large Flows.....17 94 5.4. Information for Redistribution of Small Flows............17 95 5.5. Export of Flow Information...............................17 96 5.6. Monitoring information...................................18 97 5.6.1. Interface (link) utilization........................18 98 5.6.2. Other monitoring information........................18 99 6. Operational Considerations....................................19 100 6.1. Rebalancing Frequency....................................19 101 6.2. Handling Route Changes...................................19 102 7. IANA Considerations...........................................19 103 8. Security Considerations.......................................19 104 9. Acknowledgements..............................................20 105 10. References...................................................20 106 10.1. Normative References....................................20 107 10.2. Informative References..................................20 109 1. Introduction 111 Networks extensively use link aggregation groups (LAG) [802.1AX] and 112 equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity 113 scaling. For the problems addressed by this document, network traffic 114 can be predominantly categorized into two traffic types: long-lived 115 large flows and other flows. These other flows, which include long- 116 lived small flows, short-lived small flows, and short-lived large 117 flows, are referred to as small flows in this document. Stateless 118 hash-based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often 119 used to distribute both large flows and small flows over the 120 component links in a LAG/ECMP. However the traffic may not be evenly 121 distributed over the component links due to the traffic pattern. 123 This draft describes mechanisms for optimal LAG/ECMP component link 124 utilization while using hash-based techniques. The mechanisms 125 comprise the following steps -- recognizing large flows in a router; 126 and assigning the large flows to specific LAG/ECMP component links or 127 redistributing the small flows when a component link on the router is 128 congested. 130 It is useful to keep in mind that in typical use cases for this 131 mechanism the large flows are those that consume a significant amount 132 of bandwidth on a link, e.g. greater than 5% of link bandwidth. The 133 number of such flows would necessarily be fairly small, e.g. on the 134 order of 10's or 100's per LAG/ECMP. In other words, the number of 135 large flows is NOT expected to be on the order of millions of flows. 136 Examples of such large flows would be IPSec tunnels in service 137 provider backbone networks or storage backup traffic in data center 138 networks. 140 1.1. Acronyms 142 COTS: Commercial Off-the-shelf 144 DOS: Denial of Service 146 ECMP: Equal Cost Multi-path 148 GRE: Generic Routing Encapsulation 150 LAG: Link Aggregation Group 152 MPLS: Multiprotocol Label Switching 154 NVGRE: Network Virtualization using Generic Routing Encapsulation 156 PBR: Policy Based Routing 158 QoS: Quality of Service 160 STT: Stateless Transport Tunneling 162 TCAM: Ternary Content Addressable Memory 164 VXLAN: Virtual Extensible LAN 166 1.2. Terminology 168 Large flow(s): long-lived large flow(s) 170 Small flow(s): long-lived small flow(s), short-lived small flows, and 171 short-lived large flow(s) 173 2. Flow Categorization 175 In general, based on the size and duration, a flow can be categorized 176 into any one of the following four types, as shown in Figure 1: 178 (a) Short-lived Large Flow (SLLF), 179 (b) Short-lived Small Flow (SLSF), 180 (c) Long-lived Large Flow (LLLF), and 181 (d) Long-lived Small Flow (LLSF). 183 Flow Size 184 ^ 185 |--------------------|--------------------| 186 | | | 187 Large | SLLF | LLLF | 188 Flow | | | 189 |--------------------|--------------------| 190 | | | 191 Small | SLSF | LLSF | 192 Flow | | | 193 +--------------------+--------------------+---> Flow Duration 194 Short-lived Long-lived 195 Flow Flow 197 Figure 1: Flow Categorization 199 In this document, as mentioned earlier, we categorize long-lived large 200 flows as "large flows", and all of the others -- long-lived small flows, 201 short-lived small flows, and short-lived large flows as "small flows". 203 3. Hash-based Load Distribution in LAG/ECMP 205 Hashing techniques are often used for traffic load balancing to 206 select among multiple available paths with LAG/ECMP. The advantages 207 of hash-based load distribution are the preservation of the packet 208 sequence in a flow and the real-time distribution without maintaining 209 per-flow state in the router. Hash-based techniques use a combination 210 of fields in the packet's headers to identify a flow, and the hash 211 function on these fields is used to generate a unique number that 212 identifies a link/path in a LAG/ECMP. The result of the hashing 213 procedure is a many-to-one mapping of flows to component links. 215 If the traffic mix constitutes flows such that the result of the hash 216 function across these flows is fairly uniform so that a similar 217 number of flows is mapped to each component link, if the individual 218 flow rates are much smaller as compared to the link capacity, and if 219 the rate differences are not dramatic, the hash-based algorithm 220 produces good results with respect to utilization of the individual 221 component links. However, if one or more of these conditions are not 222 met, hash-based techniques may result in unbalanced loads on 223 individual component links. 225 One example is illustrated in Figure 2. In Figure 2, there are two 226 routers, R1 and R2, and there is a LAG between them which has 3 227 component links (1), (2), (3). There are a total of 10 flows that 228 need to be distributed across the links in this LAG. The result of 229 hashing is as follows: 231 . Component link (1) has 3 flows -- 2 small flows and 1 large 232 flow -- and the link utilization is normal. 234 . Component link (2) has 3 flows -- 3 small flows and no large 235 flow -- and the link utilization is light. 237 o The absence of any large flow causes the component link 238 under-utilized. 240 . Component link (3) has 4 flows -- 2 small flows and 2 large 241 flows -- and the link capacity is exceeded resulting in 242 congestion. 244 o The presence of 2 large flows causes congestion on this 245 component link. 247 +-----------+ -> +-----------+ 248 | | -> | | 249 | | ===> | | 250 | (1)|--------|(1) | 251 | | -> | | 252 | | -> | | 253 | (R1) | -> | (R2) | 254 | (2)|--------|(2) | 255 | | -> | | 256 | | -> | | 257 | | ===> | | 258 | | ===> | | 259 | (3)|--------|(3) | 260 | | | | 261 +-----------+ +-----------+ 263 Where: -> small flow 264 ===> large flow 266 Figure 2: Unevenly Utilized Component Links 268 This document presents improved load distribution techniques based on 269 the large flow awareness. The techniques compensate for unbalanced 270 load distribution resulting from hashing as demonstrated in the above 271 example. 273 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization 275 The suggested techniques in this draft are about a local optimization 276 solution; they are local in the sense that both the identification of 277 large flows and re-balancing of the load can be accomplished 278 completely within individual nodes in the network without the need 279 for interaction with other nodes. 281 This approach may not yield a globally optimal placement of large 282 flows across multiple nodes in a network, which may be desirable in 283 some networks. On the other hand, a local approach may be adequate 284 for some environments for the following reasons: 286 1) Different links within a network experience different levels of 287 utilization and, thus, a "targeted" solution is needed for those hot- 288 spots in the network. An example is the utilization of a LAG between 289 two routers that needs to be optimized. 291 2) Some networks may lack end-to-end visibility, e.g. when a 292 certain network, under the control of a given operator, is a transit 293 network for traffic from other networks that are not under the 294 control of the same operator. 296 4.1. Differences in LAG vs ECMP 298 While the mechanisms explained herein are applicable to both LAGs and 299 ECMP, it is useful to note that there are some key differences 300 between the two that may impact how effective the mechanism is. This 301 relates, in part, to the localized information with which the scheme 302 is intended to operate. 304 A LAG is almost always between 2 adjacent routers. As a result, the 305 scope of problem of optimizing the bandwidth utilization on the 306 component links is fairly narrow. It simply involves re-balancing 307 the load across the component links between these two routers, and 308 there is no impact whatsoever to other parts of the network. The 309 scheme works equally well for unicast and multicast flows. 311 On the other hand, with ECMP, redistributing the load across 312 component links that are part of the ECMP group may impact traffic 313 patterns at all of the nodes that are downstream of the given router 314 between itself and the destination. The local optimization may 315 result in congestion at a downstream node. (In its simplest form, an 316 ECMP group may be used to distribute traffic on component links that 317 are between two adjacent routers, and in that case, the ECMP group is 318 no different than a LAG for the purpose of this discussion.) 320 To demonstrate the limitations of local optimization, consider a two- 321 level fat-tree topology with three leaf nodes (L1, L2, L3) and two 322 spine nodes (S1, S2) and assume all of the links are 10 Gbps. 324 +-----+ +-----+ 325 | S1 | | S2 | 326 +-----+ +-----+ 327 / \ \ / /\ 328 / +---------+ / \ 329 / / \ \ / \ 330 / / \ +------+ \ 331 / / \ / \ \ 332 +-----+ +-----+ +-----+ 333 | L1 | | L2 | | L3 | 334 +-----+ +-----+ +-----+ 336 Figure 3: Two Level Fat-tree 338 Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one 339 flow of 7 Gbps also towards L3. If L1 balances the load optimally 340 between S1 and S2, and L2 sends the flow via S1, then the downlink 341 from S1 to L3 would get congested resulting in packet discards. On 342 the other hand, if L1 had sent both its flows towards S1 and L2 had 343 sent its flow towards S2, there would have been no congestion at 344 either S1 or S2. 346 The other issue with applying this scheme to ECMP groups is that it 347 may not apply equally to unicast and multicast traffic because of the 348 way multicast trees are constructed. 350 4.2. Overview of the mechanism 352 The various steps in achieving optimal LAG/ECMP component link 353 utilization in networks are detailed below: 355 Step 1) This involves large flow recognition in routers and 356 maintaining the mapping of the large flow to the component link that 357 it uses. The recognition of large flows is explained in Section 4.3. 359 Step 2) The egress component links are periodically scanned for link 360 utilization. If the egress component link utilization exceeds a pre- 361 programmed threshold, an operator alert is generated. The large flows 362 mapped to the congested egress component link are exported to a 363 central management entity. 365 Step 3) On receiving the alert about the congested component link, 366 the operator, through a central management entity, finds the large 367 flows mapped to that component link and the LAG/ECMP group to which 368 the component link belongs. 370 Step 4) The operator can choose to rebalance the large flows on 371 lightly loaded component links of the LAG/ECMP group or redistribute 372 the small flows on the congested link to other component links of the 373 group. The operator, through a central management entity, can choose 374 one of the following actions: 376 1) Indicate specific large flows to rebalance; 378 2) Have the router decide the best large flows to rebalance; 380 3) Have the router redistribute all the small flows on the 381 congested link to other component links in the group. 383 The central management entity conveys the above information to the 384 router. The load re-balancing options are explained in Section 4.4. 386 Steps 2) to 4) could be automated if desired. 388 Providing large flow information to a central management entity 389 provides the capability to further optimize flow distribution at with 390 multi-node visibility. Consider the following example. A router may 391 have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple 392 of hops downstream on P1 may be congested, while P2 and P3 may be 393 under-utilized, which the local router does not have visibility into. 394 With the help of a central management entity, the operator could 395 redistribute some of the flows from P1 to P2 and P3 resulting in a 396 more optimized flow of traffic. 398 The techniques described above are especially useful when bundling 399 links of different bandwidths for e.g. 10Gbps and 100Gbps as 400 described in [ID.ietf-rtgwg-cl-requirement]. 402 4.3. Large Flow Recognition 404 4.3.1. Flow Identification 406 A flow (large flow or small flow) can be defined as a sequence of 407 packets for which ordered delivery should be maintained. Flows are 408 typically identified using one or more fields from the packet header, 409 for example: 411 . Layer 2: source MAC address, destination MAC address, VLAN ID. 413 . IP header: IP Protocol, IP source address, IP destination 414 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 415 destination port. 417 . MPLS Labels. 419 For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow 420 identification is possible based on inner and/or outer headers. The 421 above list is not exhaustive. The mechanisms described in this 422 document are agnostic to the fields that are used for flow 423 identification. 425 This definition of flows is consistent with that in IPFIX [RFC 7011]. 427 4.3.2. Criteria for Identifying a Large Flow 429 From a bandwidth and time duration perspective, in order to identify 430 large flows we define an observation interval and observe the 431 bandwidth of the flow over that interval. A flow that exceeds a 432 certain minimum bandwidth threshold over that observation interval 433 would be considered a large flow. 435 The two parameters -- the observation interval, and the minimum 436 bandwidth threshold over that observation interval -- should be 437 programmable in a router to facilitate handling of different use 438 cases and traffic characteristics. For example, a flow which is at or 439 above 10% of link bandwidth for a time period of at least 1 second 440 could be declared a large flow [DevoFlow]. 442 In order to avoid excessive churn in the rebalancing, once a flow has 443 been recognized as a large flow, it should continue to be recognized 444 as a large flow as long as the traffic received during an observation 445 interval exceeds some fraction of the bandwidth threshold, for 446 example 80% of the bandwidth threshold. 448 Various techniques to identify a large flow are described below. 450 4.3.3. Sampling Techniques 452 A number of routers support sampling techniques such as sFlow [sFlow- 453 v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. 454 For the purpose of large flow identification, sampling must be 455 enabled on all of the egress ports in the router where such 456 measurements are desired. 458 Using sflow as an example, processing in an sFlow collector will 459 provide an approximate indication of the large flows mapping to each 460 of the component links in each LAG/ECMP group. It is possible to 461 implement this part of the collector function in the control plane of 462 the router reducing dependence on an external management station, 463 assuming sufficient control plane resources are available. 465 If egress sampling is not available, ingress sampling can suffice 466 since the central management entity used by the sampling technique 467 typically has multi-node visibility and can use the samples from an 468 immediately downstream node to make measurements for egress traffic 469 at the local node. This may not be available if the downstream 470 device is under the control of a different operator, or if the 471 downstream device does not support sampling. Alternatively, since 472 sampling techniques require that the sample annotated with the 473 packet's egress port information, ingress sampling may suffice. 474 However, this means that sampling would have to be enabled on all 475 ports, rather than only on those ports where such monitoring is 476 desired. 478 The advantages and disadvantages of sampling techniques are as 479 follows. 481 Advantages: 483 . Supported in most existing routers. 485 . Requires minimal router resources. 487 Disadvantages: 489 . In order to minimize the error inherent in sampling, there is a 490 minimum delay for the recognition time of large flows, and in 491 the time that it takes to react to this information. 493 With sampling, the detection of large flows can be done on the order 494 of one second [DevoFlow]. 496 4.3.4. Automatic Hardware Recognition 498 Implementations may perform automatic recognition of large flows in 499 hardware on a router. Since this is done in hardware, it is an inline 500 solution and would be expected to operate at line rate. 502 Using automatic hardware recognition of large flows, a faster 503 indication of large flows mapped to each of the component links in a 504 LAG/ECMP group is available (as compared to the sampling approach 505 described above). 507 The advantages and disadvantages of automatic hardware recognition 508 are: 510 Advantages: 512 . Large flow detection is offloaded to hardware freeing up 513 software resources and possible dependence on an external 514 management station. 516 . As link speeds get higher, sampling rates are typically reduced 517 to keep the number of samples manageable which places a lower 518 bound on the detection time. With automatic hardware 519 recognition, large flows can be detected in shorter windows on 520 higher link speeds since every packet is accounted for in 521 hardware [NDTM] 523 Disadvantages: 525 . Not supported in many routers. 527 As mentioned earlier, the observation interval for determining a 528 large flow and the bandwidth threshold for classifying a flow as a 529 large flow should be programmable parameters in a router. 531 The implementation of automatic hardware recognition of large flows 532 is vendor dependent and beyond the scope of this document. 534 4.4. Load Re-balancing Options 536 Below are suggested techniques for load re-balancing. Equipment 537 vendors should implement all of these techniques and allow the 538 operator to choose one or more techniques based on their 539 applications. 541 Note that regardless of the method used, perfect re-balancing of 542 large flows may not be possible since flows arrive and depart at 543 different times. Also, any flows that are moved from one component 544 link to another may experience momentary packet reordering. 546 4.4.1. Alternative Placement of Large Flows 548 Within a LAG/ECMP group, the member component links with least 549 average port utilization are identified. Some large flow(s) from the 550 heavily loaded component links are then moved to those lightly-loaded 551 member component links using a PBR rule in the ingress processing 552 element(s) in the routers. 554 With this approach, only certain large flows are subjected to 555 momentary flow re-ordering. 557 When a large flow is moved, this will increase the utilization of the 558 link that it moved to potentially creating unbalanced utilization 559 once again across the link components. Therefore, when moving large 560 flows, care must be taken to account for the existing load, and what 561 the future load will be after large flow has been moved. Further, 562 the appearance of new large flows may require a rearrangement of the 563 placement of existing flows. 565 Consider a case where there is a LAG compromising 4 10 Gbps component 566 links and there are 4 large flows each of 1 Gbps. These flows are 567 each placed on one of the component links. Subsequent, a 5-th large 568 flow of 2 Gbps is recognized and to maintain equitable load 569 distribution, it may require placement of one of the existing 1 Gbps 570 flow to a different component link. And this would still result in 571 some imbalance in the utilization across the component links. 573 4.4.2. Redistributing Small Flows 575 Some large flows may consume the entire bandwidth of the component 576 link(s). In this case, it would be desirable for the small flows to 577 not use the congested component link(s). This can be accomplished in 578 one of the following ways. 580 This method works on some existing router hardware. The idea is to 581 prevent, or reduce the probability, that the small flow hashes into 582 the congested component link(s). 584 . The LAG/ECMP table is modified to include only non-congested 585 component link(s). Small flows hash into this table to be mapped 586 to a destination component link. Alternatively, if certain 587 component links are heavily loaded, but not congested, the 588 output of the hash function can be adjusted to account for large 589 flow loading on each of the component links. 591 . The PBR rules for large flows (refer to Section 4.4.1) must 592 have strict precedence over the LAG/ECMP table lookup result. 594 With this approach the small flows that are moved would be subject to 595 reordering. 597 4.4.3. Component Link Protection Considerations 599 If desired, certain component links may be reserved for link 600 protection. These reserved component links are not used for any flows 601 in the absence of any failures. In the case when the component 602 link(s) fail, all the flows on the failed component link(s) are moved 603 to the reserved component link(s). The mapping table of large flows 604 to component link simply replaces the failed component link with the 605 reserved link. Likewise, the LAG/ECMP hash table replaces the failed 606 component link with the reserved link. 608 4.4.4. Load Re-balancing Algorithms 610 Specific algorithms for placement of large flows are out of scope of 611 this document. One possibility is to formulate the problem for large 612 flow placement as the well-known bin-packing problem and make use of 613 the various heuristics that are available for that problem [bin- 614 pack]. 616 4.4.5. Load Re-Balancing Example 618 Optimal LAG/ECMP component utilization for the use case in Figure 2 619 is depicted below in Figure 4. The large flow rebalancing explained 620 in Section 4.4 is used. The improved link utilization is as follows: 622 . Component link (1) has 3 flows -- 2 small flows and 1 large 623 flow -- and the link utilization is normal. 625 . Component link (2) has 4 flows -- 3 small flows and 1 large 626 flow -- and the link utilization is normal now. 628 . Component link (3) has 3 flows -- 2 small flows and 1 large 629 flow -- and the link utilization is normal now. 631 +-----------+ -> +-----------+ 632 | | -> | | 633 | | ===> | | 634 | (1)|--------|(1) | 635 | | | | 636 | | ===> | | 637 | | -> | | 638 | | -> | | 639 | (R1) | -> | (R2) | 640 | (2)|--------|(2) | 641 | | | | 642 | | -> | | 643 | | -> | | 644 | | ===> | | 645 | (3)|--------|(3) | 646 | | | | 647 +-----------+ +-----------+ 649 Where: -> small flows 650 ===> large flow 652 Figure 4: Evenly utilized Composite Links 654 Basically, the use of the mechanisms described in Section 4.4.1 655 resulted in a rebalancing of flows where one of the large flows on 656 component link (3) which was previously congested was moved to 657 component link (2) which was previously under-utilized. 659 5. Information Model for Flow Re-balancing 661 In order to support flow rebalancing in a router from an external 662 system, the exchange of some information is necessary between the 663 router and the external system. This section provides an exemplary 664 information model covering the various components needed for the 665 purpose. The model is intended to be informational and may be used 666 as input for development of a data model. 668 5.1. Configuration Parameters for Flow Re-balancing 670 The following parameters are required the configuration of this 671 feature: 673 . Large flow recognition parameters: 675 o Observation interval: The observation interval is the time 676 period in seconds over which the packet arrivals are 677 observed for the purpose of large flow recognition. 679 o Minimum bandwidth threshold: The minimum bandwidth threshold 680 would be configured as a percentage of link speed and 681 translated into a number of bytes over the observation 682 interval. A flow for which the number of bytes received, 683 for a given observation interval, exceeds this number would 684 be recognized as a large flow. 686 o Minimum bandwidth threshold for large flow maintenance: The 687 minimum bandwidth threshold for large flow maintenance is 688 used to provide hysteresis for large flow recognition. 689 Once a flow is recognized as a large flow, it continues to 690 be recognized as a large flow until it falls below this 691 threshold. This is also configured as a percentage of link 692 speed and is typically lower than the minimum bandwidth 693 threshold defined above. 695 . Imbalance threshold: the difference between the utilization of 696 the least utilized and most utilized component links. Expressed 697 as a percentage of link speed. 699 . Rebalancing interval: the minimum amount of time between 700 rebalancing events. This parameter ensures that rebalancing is 701 not invoked too frequently as it impacts frame ordering. 703 These parameters may be configured on a system-wide basis or it may 704 apply to an individual LAG. 706 5.2. System Configuration and Identification Parameters 708 . IP address: The IP address of a specific router that the 709 feature is being configured on, or that the large flow placement 710 is being applied to. 712 . LAG ID: Identifies the LAG. The LAG ID may be required when 713 configuring this feature (to apply a specific set of large flow 714 identification parameters to the LAG) and will be required when 715 specifying flow placement to achieve the desired rebalancing. 717 . Component Link ID: Identifies the component link within a LAG. 718 This is required when specifying flow placement to achieve the 719 desired rebalancing. 721 5.3. Information for Alternative Placement of Large Flows 723 In cases where large flow recognition is handled by an external 724 management station (see Section 4.3.3 ), an information model for 725 flows is required to allow the import of large flow information to 726 the router. 728 The following are some of the elements of information model for 729 importing of flows: 731 . Layer 2: source MAC address, destination MAC address, VLAN ID. 733 . Layer 3 IP: IP Protocol, IP source address, IP destination 734 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 735 destination port. 737 . MPLS Labels. 739 This list is not exhaustive. For example, with overlay protocols 740 such as VXLAN and NVGRE, fields from the outer and/or inner headers 741 may be specified. In general, all fields in the packet that can be 742 used by forwarding decisions should be available for use when 743 importing flow information from an external management station. 745 The IPFIX information model [RFC 7011] can be leveraged for large 746 flow identification. The component link ID would be used to specify 747 the target component link for the flow. 749 5.4. Information for Redistribution of Small Flows 751 For small flows, the LAG ID and the component link IDs along with the 752 percentage of traffic to be assigned to each component link ID Is 753 required. 755 5.5. Export of Flow Information 757 Exporting large flow information is required when large flow 758 recognition is being done on a router, but the decision to rebalance 759 is being made in an external management station. Large flow 760 information includes flow identification and the component link ID 761 that the flow currently is assigned to. Other information such as 762 flow QoS and bandwidth may be exported too. 764 The IPFIX information model [RFC 7011] can be leveraged for large 765 flow identification. 767 5.6. Monitoring information 769 5.6.1. Interface (link) utilization 771 The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and 772 interface speed (ifSpeed) can be measured from the Interface table 773 (iftable) MIB [RFC 1213]. 775 The link utilization can then be computed as follows: 777 Incoming link utilization = (ifInOctets *8 / ifSpeed) 779 Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) 781 For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] 782 can be used. 784 For further scalability, it is recommended to use the counter push 785 mechanism in [sflow-v5] for the interface counters; this would help 786 avoid counter polling through the MIB interface. 788 The outgoing link utilization of the component links within a LAG can 789 be used to compute the imbalance threshold (See Section 5.1) for the 790 LAG. 792 5.6.2. Other monitoring information 794 Additional monitoring information that is useful includes: 796 . Number of times rebalancing was done. 798 . Time since the last rebalancing event. 800 . The number of large flows currently rebalanced by the scheme. 802 . A list of the large flows that have been rebalanced including 804 o the rate of each large flow at the time of the last 805 rebalancing for that flow, 807 o the time that rebalancing was last performed for the given 808 large flow, and 810 o the interfaces that the large flows was (re)directed to. 812 . The settings for the weights of the interfaces within a 813 LAG/ECMP used by the small flows which depend on hashing. 815 6. Operational Considerations 817 6.1. Rebalancing Frequency 819 Flows should be re-balanced only when the imbalance in the 820 utilization across component links exceeds a certain threshold. 821 Frequent re-balancing to achieve precise equitable utilization across 822 component links could be counter-productive as it may result in 823 moving flows back and forth between the component links impacting 824 packet ordering and system stability. This applies regardless of 825 whether large flows or small flows are re-distributed. It should be 826 noted that reordering is a concern for TCP flows with even a few 827 packets because three out-of-order packets would trigger sufficient 828 duplicate ACKs to the sender resulting in a retransmission [RFC 829 5681]. 831 The operator would have to experiment with various values of the 832 large flow recognition parameters (minimum bandwidth threshold, 833 observation interval) and the imbalance threshold across component 834 links to tune the solution for their environment. 836 6.2. Handling Route Changes 838 Large flow rebalancing must be aware of any changes to the FIB. In 839 cases where the next-hop of a route no longer to points to the LAG, 840 or to an ECMP group, any PBR entries added as described in Section 841 4.4.1 and 4.4.2 must be withdrawn in order to avoid the creation of 842 forwarding loops. 844 7. IANA Considerations 846 This memo includes no request to IANA. 848 8. Security Considerations 850 This document does not directly impact the security of the Internet 851 infrastructure or its applications. In fact, it could help if there 852 is a DOS attack pattern which causes a hash imbalance resulting in 853 heavy overloading of large flows to certain LAG/ECMP component 854 links. 856 9. Acknowledgements 858 The authors would like to thank the following individuals for their 859 review and valuable feedback on earlier versions of this document: 860 Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian 861 Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong 862 Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, 863 Andrew Malis, Dave McDysan, Zhen Cao, Dan Romascanu, and Benoit 864 Claise. 866 10. References 868 10.1. Normative References 870 10.2. Informative References 872 [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE 873 Standard for Local and Metropolitan Area Networks - Link 874 Aggregation", 2008. 876 [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation 877 Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design 878 for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. 879 Springer-Verlag, 1984. 881 [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. 883 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 884 Management for High Performance Enterprise Networks," Proceedings of 885 the ACM SIGCOMM, August 2011. 887 [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 888 for MPLS over a Composite Link," September 2013. 890 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 891 dynamic hashing with flow volume," SPIE ITCOM, 2002. 893 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 894 measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. 896 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 897 Multicast," November 2000. 899 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 900 Forwarding," November 2012. 902 [RFC 1213] McCloghrie, K., "Management Information Base for Network 903 Management of TCP/IP-based internets: MIB-II," March 1991. 905 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 906 Algorithm," November 2000. 908 [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management 909 Information Base for High Capacity Networks," July 2002. 911 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 912 9," October 2004. 914 [RFC 5475] Zseby T., et al., "Sampling and Filtering Techniques for 915 IP Packet Selection," March 2009. 917 [RFC 7011] Claise, B., "Specification of the IP Flow Information 918 Export (IPFIX) Protocol for the Exchange of IP Traffic Flow 919 Information," September 2013. 921 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 922 structure," http://www.sflow.org/sflow_lag.txt, September 2012. 924 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," 925 http://www.sflow.org/sflow_version_5.txt, July 2004. 927 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," 928 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 930 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 932 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 933 such as the number of packets in a flow and the flow duration. The 934 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 935 protocol) are used for flow identification. The analysis indicates 936 that < ~2% of the flows take ~30% of total traffic volume while the 937 rest of the flows (> ~98%) contributes ~70% [YONG]. 939 The simulation has shown that given Internet traffic pattern, the 940 hash-based technique does not evenly distribute the flows over ECMP 941 paths. Some paths may be > 90% loaded while others are < 40% loaded. 942 The more ECMP paths exist, the more severe the misbalancing. This 943 implies that hash-based distribution can cause some paths to become 944 congested while other paths are underutilized [YONG]. 946 The simulation also shows substantial improvement by using the large 947 flow-aware hash-based distribution technique described in this 948 document. In using the same simulated traffic, the improved 949 rebalancing can achieve < 10% load differences among the paths. It 950 proves how large flow-aware hash-based distribution can effectively 951 compensate the uneven load balancing caused by hashing and the 952 traffic characteristics [YONG]. 954 Authors' Addresses 956 Ram Krishnan 957 Brocade Communications 958 San Jose, 95134, USA 959 Phone: +1-408-406-7890 960 Email: ramkri123@gmail.com 962 Lucy Yong 963 Huawei USA 964 5340 Legacy Drive 965 Plano, TX 75025, USA 966 Phone: +1-469-277-5837 967 Email: lucy.yong@huawei.com 969 Anoop Ghanwani 970 Dell 971 San Jose, CA 95134 972 Phone: +1-408-571-3228 973 Email: anoop@alumni.duke.edu 975 Ning So 976 Tata Communications 977 Plano, TX 75082, USA 978 Phone: +1-972-955-0914 979 Email: ning.so@tatacommunications.com 981 Sanjay Khanna 982 Cisco Systems 983 Email: sanjakha@gmail.com 985 Bhumip Khasnabish 986 ZTE Corporation 987 New Jersey, 07960, USA 988 Phone: +1-781-752-8003 989 Email: vumip1@gmail.com