idnits 2.17.1 draft-ietf-opsawg-large-flow-load-balancing-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 13, 2014) is 3605 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-04 == Outdated reference: A later version (-08) exists of draft-davie-stt-06 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 OPSAWG R. Krishnan 2 Internet Draft Brocade Communications 3 Intended status: Informational L. Yong 4 Expires: December 13, 2014 Huawei USA 5 A. Ghanwani 6 Dell 7 Ning So 8 Tata Communications 9 B. Khasnabish 10 ZTE Corporation 11 June 13, 2014 13 Mechanisms for Optimizing LAG/ECMP Component Link Utilization in 14 Networks 16 draft-ietf-opsawg-large-flow-load-balancing-12.txt 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. This document may not be modified, 22 and derivative works of it may not be created, except to publish it 23 as an RFC and to translate it into languages other than English. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as Internet- 28 Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/ietf/1id-abstracts.txt 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 This Internet-Draft will expire on December 13, 2014. 43 Copyright Notice 45 Copyright (c) 2014 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 Demands on networking infrastructure are growing exponentially due to 61 bandwidth hungry applications such as rich media applications and 62 inter-data center communications. In this context, it is important to 63 optimally use the bandwidth in wired networks that extensively use 64 link aggregation groups and equal cost multi-paths as techniques for 65 bandwidth scaling. This draft explores some of the mechanisms useful 66 for achieving this. 68 Table of Contents 70 1. Introduction...................................................3 71 1.1. Acronyms..................................................4 72 1.2. Terminology...............................................4 73 2. Flow Categorization............................................5 74 3. Hash-based Load Distribution in LAG/ECMP.......................6 75 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 76 4.1. Differences in LAG vs ECMP................................8 77 4.2. Operational Overview......................................9 78 4.3. Large Flow Recognition...................................10 79 4.3.1. Flow Identification.................................10 80 4.3.2. Criteria and Techniques for Large Flow Recognition..11 81 4.3.3. Sampling Techniques.................................11 82 4.3.4. Inline Data Path Measurement........................13 83 4.3.5. Use of More Than One Method for Large Flow 84 Recognition.........................................13 85 4.4. Load Rebalancing Options.................................14 86 4.4.1. Alternative Placement of Large Flows................14 87 4.4.2. Redistributing Small Flows..........................15 88 4.4.3. Component Link Protection Considerations............15 89 4.4.4. Load Rebalancing Algorithms.........................15 90 4.4.5. Load Rebalancing Example............................16 91 5. Information Model for Flow Rebalancing........................17 92 5.1. Configuration Parameters for Flow Rebalancing............17 93 5.2. System Configuration and Identification Parameters.......18 94 5.3. Information for Alternative Placement of Large Flows.....19 95 5.4. Information for Redistribution of Small Flows............19 96 5.5. Export of Flow Information...............................20 97 5.6. Monitoring information...................................20 98 5.6.1. Interface (link) utilization........................20 99 5.6.2. Other monitoring information........................20 100 6. Operational Considerations....................................21 101 6.1. Rebalancing Frequency....................................21 102 6.2. Handling Route Changes...................................21 103 6.3. Forwarding Resources.....................................21 104 7. IANA Considerations...........................................22 105 8. Security Considerations.......................................22 106 9. Contributing Authors..........................................22 107 10. Acknowledgements.............................................22 108 11. References...................................................23 109 11.1. Normative References....................................23 110 11.2. Informative References..................................23 112 1. Introduction 114 Networks extensively use link aggregation groups (LAG) [802.1AX] and 115 equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity 116 scaling. For the problems addressed by this document, network traffic 117 can be predominantly categorized into two traffic types: long-lived 118 large flows and other flows. These other flows, which include long- 119 lived small flows, short-lived small flows, and short-lived large 120 flows, are referred to as "small flows" in this document. Long-lived 121 large flows are simply referred to as "large flows." 123 Stateless hash-based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] 124 are often used to distribute both large flows and small flows over 125 the component links in a LAG/ECMP. However the traffic may not be 126 evenly distributed over the component links due to the traffic 127 pattern. 129 This draft describes mechanisms for optimizing LAG/ECMP component 130 link utilization while using hash-based techniques. The mechanisms 131 comprise the following steps -- recognizing large flows in a router; 132 and assigning the large flows to specific LAG/ECMP component links or 133 redistributing the small flows when a component link on the router is 134 congested. 136 It is useful to keep in mind that in typical use cases for this 137 mechanism the large flows are those that consume a significant amount 138 of bandwidth on a link, e.g. greater than 5% of link bandwidth. The 139 number of such flows would necessarily be fairly small, e.g. on the 140 order of 10's or 100's per LAG/ECMP. In other words, the number of 141 large flows is NOT expected to be on the order of millions of flows. 142 Examples of such large flows would be IPsec tunnels in service 143 provider backbone networks or storage backup traffic in data center 144 networks. 146 1.1. Acronyms 148 DOS: Denial of Service 150 ECMP: Equal Cost Multi-path 152 GRE: Generic Routing Encapsulation 154 LAG: Link Aggregation Group 156 MPLS: Multiprotocol Label Switching 158 NVGRE: Network Virtualization using Generic Routing Encapsulation 160 PBR: Policy Based Routing 162 QoS: Quality of Service 164 STT: Stateless Transport Tunneling 166 TCAM: Ternary Content Addressable Memory 168 VXLAN: Virtual Extensible LAN 170 1.2. Terminology 172 Central management entity: Refers to an entity that is capable of 173 monitoring information about link utilization and flows in routers 174 across the network and may be capable of making traffic engineering 175 decisions for placement of large flows. It may include the functions 176 of a collector if the routers employ a sampling technique [RFC 7011]. 178 ECMP component link: An individual nexthop within an ECMP group. An 179 ECMP component link may itself comprise a LAG. 181 ECMP table: A table that is used as the nexthop of an ECMP route that 182 comprises the set of component links and the weights associated with 183 each of those component links. The weights are used to determine 184 which values of the hash function map to a given component link. 186 LAG component link: An individual link within a LAG. A LAG component 187 link is typically a physical link. 189 LAG table: A table that is used as the output port which is a LAG 190 that comprises the set of component links and the weights associated 191 with each of those component links. The weights are used to 192 determine which values of the hash function map to a given component 193 link. 195 Large flow(s): Refers to long-lived large flow(s). 197 Small flow(s): Refers to any of, or a combination of, long-lived 198 small flow(s), short-lived small flows, and short-lived large 199 flow(s). 201 2. Flow Categorization 203 In general, based on the size and duration, a flow can be categorized 204 into any one of the following four types, as shown in Figure 1: 206 (a) Short-lived Large Flow (SLLF), 207 (b) Short-lived Small Flow (SLSF), 208 (c) Long-lived Large Flow (LLLF), and 209 (d) Long-lived Small Flow (LLSF). 211 Flow Size 212 ^ 213 |--------------------|--------------------| 214 | | | 215 Large | SLLF | LLLF | 216 Flow | | | 217 |--------------------|--------------------| 218 | | | 219 Small | SLSF | LLSF | 220 Flow | | | 221 +--------------------+--------------------+-->Flow Duration 222 Short-lived Long-lived 223 Flow Flow 225 Figure 1: Flow Categorization 227 In this document, as mentioned earlier, we categorize long-lived 228 large flows as "large flows", and all of the others -- long-lived 229 small flows, short-lived small flows, and short-lived large flows 230 as "small flows". 232 3. Hash-based Load Distribution in LAG/ECMP 234 Hash-based techniques are often used for traffic load balancing to 235 select among multiple available paths within a LAG/ECMP group. The 236 advantages of hash-based techniques for load distribution are the 237 preservation of the packet sequence in a flow and the real-time 238 distribution without maintaining per-flow state in the router. Hash- 239 based techniques use a combination of fields in the packet's headers 240 to identify a flow, and the hash function computed using these fields 241 is used to generate a unique number that identifies a link/path in a 242 LAG/ECMP group. The result of the hashing procedure is a many-to-one 243 mapping of flows to component links. 245 If the traffic mix constitutes flows such that the result of the hash 246 function across these flows is fairly uniform so that a similar 247 number of flows is mapped to each component link, if the individual 248 flow rates are much smaller as compared to the link capacity, and if 249 the rate differences are not dramatic, hash-based techniques produce 250 good results with respect to utilization of the individual component 251 links. However, if one or more of these conditions are not met, hash- 252 based techniques may result in imbalance in the loads on individual 253 component links. 255 One example is illustrated in Figure 2. In Figure 2, there are two 256 routers, R1 and R2, and there is a LAG between them which has 3 257 component links (1), (2), (3). There are a total of 10 flows that 258 need to be distributed across the links in this LAG. The result of 259 applying the hash-based technique is as follows: 261 . Component link (1) has 3 flows -- 2 small flows and 1 large 262 flow -- and the link utilization is normal. 264 . Component link (2) has 3 flows -- 3 small flows and no large 265 flow -- and the link utilization is light. 267 o The absence of any large flow causes the component link 268 under-utilized. 270 . Component link (3) has 4 flows -- 2 small flows and 2 large 271 flows -- and the link capacity is exceeded resulting in 272 congestion. 274 o The presence of 2 large flows causes congestion on this 275 component link. 277 +-----------+ -> +-----------+ 278 | | -> | | 279 | | ===> | | 280 | (1)|--------|(1) | 281 | | -> | | 282 | | -> | | 283 | (R1) | -> | (R2) | 284 | (2)|--------|(2) | 285 | | -> | | 286 | | -> | | 287 | | ===> | | 288 | | ===> | | 289 | (3)|--------|(3) | 290 | | | | 291 +-----------+ +-----------+ 293 Where: -> small flow 294 ===> large flow 296 Figure 2: Unevenly Utilized Component Links 298 This document presents mechanisms for addressing the imbalance in 299 load distribution resulting from commonly used hash-based techniques 300 for LAG/ECMP that were shown in the above example. The mechanisms use 301 large flow awareness to compensate for the imbalance in load 302 distribution. 304 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization 306 The suggested mechanisms in this draft are about a local optimization 307 solution; they are local in the sense that both the identification of 308 large flows and re-balancing of the load can be accomplished 309 completely within individual nodes in the network without the need 310 for interaction with other nodes. 312 This approach may not yield a global optimization of the placement of 313 large flows across multiple nodes in a network, which may be 314 desirable in some networks. On the other hand, a local approach may 315 be adequate for some environments for the following reasons: 317 1) Different links within a network experience different levels of 318 utilization and, thus, a "targeted" solution is needed for those hot- 319 spots in the network. An example is the utilization of a LAG between 320 two routers that needs to be optimized. 322 2) Some networks may lack end-to-end visibility, e.g. when a 323 certain network, under the control of a given operator, is a transit 324 network for traffic from other networks that are not under the 325 control of the same operator. 327 4.1. Differences in LAG vs ECMP 329 While the mechanisms explained herein are applicable to both LAGs and 330 ECMP groups, it is useful to note that there are some key differences 331 between the two that may impact how effective the mechanism is. This 332 relates, in part, to the localized information with which the scheme 333 is intended to operate. 335 A LAG is usually established across links that are between 2 adjacent 336 routers. As a result, the scope of problem of optimizing the 337 bandwidth utilization on the component links is fairly narrow. It 338 simply involves re-balancing the load across the component links 339 between these two routers, and there is no impact whatsoever to other 340 parts of the network. The scheme works equally well for unicast and 341 multicast flows. 343 On the other hand, with ECMP, redistributing the load across 344 component links that are part of the ECMP group may impact traffic 345 patterns at all of the nodes that are downstream of the given router 346 between itself and the destination. The local optimization may 347 result in congestion at a downstream node. (In its simplest form, an 348 ECMP group may be used to distribute traffic on component links that 349 are between two adjacent routers, and in that case, the ECMP group is 350 no different than a LAG for the purpose of this discussion. It 351 should be noted that an ECMP component link may itself comprise a 352 LAG, in which case the scheme may be further applied to the component 353 links within the LAG.) 355 +-----+ +-----+ 356 | S1 | | S2 | 357 +-----+ +-----+ 358 / \ \ / /\ 359 / +---------+ / \ 360 / / \ \ / \ 361 / / \ +------+ \ 362 / / \ / \ \ 363 +-----+ +-----+ +-----+ 364 | L1 | | L2 | | L3 | 365 +-----+ +-----+ +-----+ 367 Figure 3: Two-level Clos Network 369 To demonstrate the limitations of local optimization, consider a two- 370 level Clos network topology as shown in Figure 3 with three leaf 371 nodes (L1, L2, L3) and two spine nodes (S1, S2). Assume all of the 372 links are 10 Gbps. 374 Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one 375 flow of 7 Gbps also towards L3. If L1 balances the load optimally 376 between S1 and S2, and L2 sends the flow via S1, then the downlink 377 from S1 to L3 would get congested resulting in packet discards. On 378 the other hand, if L1 had sent both its flows towards S1 and L2 had 379 sent its flow towards S2, there would have been no congestion at 380 either S1 or S2. 382 The other issue with applying this scheme to ECMP groups is that it 383 may not apply equally to unicast and multicast traffic because of the 384 way multicast trees are constructed. 386 Finally, it is possible for a single physical link to participate as 387 a component link in multiple ECMP groups, whereas with LAGs, a link 388 can participate as a component link of only one LAG. 390 4.2. Operational Overview 392 The various steps in optimizing LAG/ECMP component link utilization 393 in networks are detailed below: 395 Step 1) This involves large flow recognition in routers and 396 maintaining the mapping of the large flow to the component link that 397 it uses. The recognition of large flows is explained in Section 4.3. 399 Step 2) The egress component links are periodically scanned for link 400 utilization and the imbalance for the LAG/ECMP group is monitored. If 401 the imbalance exceeds a certain imbalance threshold, then re- 402 balancing is triggered. Measurement of the imbalance is discussed 403 further in 5.1. Additional criteria may also be used to determine 404 whether or not to trigger rebalancing, such as the maximum 405 utilization of any of the component links, in addition to the 406 imbalance. 408 Step 3) As a part of rebalancing, the operator can choose to 409 rebalance the large flows on to lightly loaded component links of the 410 LAG/ECMP group, redistribute the small flows on the congested link to 411 other component links of the group, or a combination of both. 413 All of the steps identified above can be done locally within the 414 router itself or could involve the use of a central management 415 entity. 417 Providing large flow information to a central management entity 418 provides the capability to globally optimize flow distribution as 419 described in Section 4.1. Consider the following example. A router 420 may have 3 ECMP nexthops that lead down paths P1, P2, and P3. A 421 couple of hops downstream on path P1 there may be a congested link, 422 while paths P2 and P3 may be under-utilized. This is something that 423 the local router does not have visibility into. With the help of a 424 central management entity, the operator could redistribute some of 425 the flows from P1 to P2 and/or P3 resulting in a more optimized flow 426 of traffic. 428 The mechanisms described above are especially useful when bundling 429 links of different bandwidths for e.g. 10 Gbps and 100 Gbps as 430 described in [ID.ietf-rtgwg-cl-requirement]. 432 4.3. Large Flow Recognition 434 4.3.1. Flow Identification 436 A flow (large flow or small flow) can be defined as a sequence of 437 packets for which ordered delivery should be maintained. Flows are 438 typically identified using one or more fields from the packet header, 439 for example: 441 . Layer 2: source MAC address, destination MAC address, VLAN ID. 443 . IP header: IP Protocol, IP source address, IP destination 444 address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP 445 destination port. 447 . MPLS Labels. 449 For tunneling protocols like Generic Routing Encapsulation (GRE) 450 [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], 451 Network Virtualization using Generic Routing Encapsulation (NVGRE) 452 [NVGRE], Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling 453 Protocol (L2TP) [RFC 3931], etc., flow identification is possible 454 based on inner and/or outer headers as well as fields introduced by 455 the tunnel header, as any or all such fields may be used for load 456 balancing decisions [RFC 5640]. The above list is not exhaustive. 457 The mechanisms described in this document are agnostic to the fields 458 that are used for flow identification. 460 This method of flow identification is consistent with that of IPFIX 461 [RFC 7011]. 463 4.3.2. Criteria and Techniques for Large Flow Recognition 465 From a bandwidth and time duration perspective, in order to recognize 466 large flows we define an observation interval and observe the 467 bandwidth of the flow over that interval. A flow that exceeds a 468 certain minimum bandwidth threshold over that observation interval 469 would be considered a large flow. 471 The two parameters -- the observation interval, and the minimum 472 bandwidth threshold over that observation interval -- should be 473 programmable to facilitate handling of different use cases and 474 traffic characteristics. For example, a flow which is at or above 10% 475 of link bandwidth for a time period of at least 1 second could be 476 declared a large flow [DevoFlow]. 478 In order to avoid excessive churn in the rebalancing, once a flow has 479 been recognized as a large flow, it should continue to be recognized 480 as a large flow for as long as the traffic received during an 481 observation interval exceeds some fraction of the bandwidth 482 threshold, for example 80% of the bandwidth threshold. 484 Various techniques to recognize a large flow are described below. 486 4.3.3. Sampling Techniques 488 A number of routers support sampling techniques such as sFlow [sFlow- 489 v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954]. 490 For the purpose of large flow recognition, sampling needs to be 491 enabled on all of the egress ports in the router where such 492 measurements are desired. 494 Using sFlow as an example, processing in a sFlow collector will 495 provide an approximate indication of the large flows mapping to each 496 of the component links in each LAG/ECMP group. It is possible to 497 implement this part of the collector function in the control plane of 498 the router reducing dependence on an external management station, 499 assuming sufficient control plane resources are available. 501 If egress sampling is not available, ingress sampling can suffice 502 since the central management entity used by the sampling technique 503 typically has multi-node visibility and can use the samples from an 504 immediately downstream node to make measurements for egress traffic 505 at the local node. 507 The option of using ingress sampling for this purpose may not be 508 available if the downstream device is under the control of a 509 different operator, or if the downstream device does not support 510 sampling. 512 Alternatively, since sampling techniques require that the sample be 513 annotated with the packet's egress port information, ingress sampling 514 may suffice. However, this means that sampling would have to be 515 enabled on all ports, rather than only on those ports where such 516 monitoring is desired. There is one situation in which this approach 517 may not work. If there are tunnels that originate from the given 518 router, and if the resulting tunnel comprises the large flow, then 519 this cannot be deduced from ingress sampling at the given router. 520 Instead, if egress sampling is unavailable, then ingress sampling 521 from the downstream router must be used. 523 To illustrate the use of ingress versus egress sampling, we refer to 524 Figure 2. Since we are looking at rebalancing flows at R1, we would 525 need to enable egress sampling on ports (1), (2), and (3) on R1. If 526 egress sampling is not available, and if R2 is also under the control 527 of the same administrator, enabling ingress sampling on R2's ports 528 (1), (2), and (3) would also work, but it would necessitate the 529 involvement of a central management entity in order for R1 to obtain 530 large flow information for each of its links. Finally, R1 can enable 531 ingress sampling only on all of its ports (not just the ports that 532 are part of the LAG/ECMP group being monitored) and that would 533 suffice if the sampling technique annotates the samples with the 534 egress port information. 536 The advantages and disadvantages of sampling techniques are as 537 follows. 539 Advantages: 541 . Supported in most existing routers. 543 . Requires minimal router resources. 545 Disadvantages: 547 . In order to minimize the error inherent in sampling, there is a 548 minimum delay for the recognition time of large flows, and in 549 the time that it takes to react to this information. 551 With sampling, the detection of large flows can be done on the order 552 of one second [DevoFlow]. A discussion on determining the 553 appropriate sampling frequency is available in the following 554 reference [SAMP-BASIC]. 556 4.3.4. Inline Data Path Measurement 558 Implementations may perform recognition of large flows by performing 559 measurements on traffic in the data path of a router. Such an 560 approach would be expected to operate at the interface speed on every 561 interface, accounting for all packets processed by the data path of 562 the router. An example of such an approach is described in IPFIX 563 [RFC 5470]. 565 Using inline data path measurement, a faster and more accurate 566 indication of large flows mapped to each of the component links in a 567 LAG/ECMP group may be possible (as compared to the sampling-based 568 approach). 570 The advantages and disadvantages of inline data path measurement are: 572 Advantages: 574 . As link speeds get higher, sampling rates are typically reduced 575 to keep the number of samples manageable which places a lower 576 bound on the detection time. With inline data path measurement, 577 large flows can be recognized in shorter windows on higher link 578 speeds since every packet is accounted for [NDTM]. 580 . Eliminates the potential dependence on an external management 581 station for large flow recognition. 583 Disadvantages: 585 . It is more resource intensive in terms of the tables sizes 586 required for monitoring all flows in order to perform the 587 measurement. 589 As mentioned earlier, the observation interval for determining a 590 large flow and the bandwidth threshold for classifying a flow as a 591 large flow should be programmable parameters in a router. 593 The implementation details of inline data path measurement of large 594 flows is vendor dependent and beyond the scope of this document. 596 4.3.5. Use of More Than One Method for Large Flow Recognition 598 It is possible that a router may have line cards that support a 599 sampling technique while other line cards support inline data path 600 measurement of large flows. As long as there is a way for the router 601 to reliably determine the mapping of large flows to component links 602 of a LAG/ECMP group, it is acceptable for the router to use more than 603 one method for large flow recognition. 605 If both methods are supported, inline data path measurement may be 606 preferable because of its speed of detection [FLOW-ACC]. 608 4.4. Load Rebalancing Options 610 Below are suggested techniques for load rebalancing. Equipment 611 vendors may implement more than one technique, including those not 612 described in this document, allowing the operator to choose between 613 them. 615 Note that regardless of the method used, perfect rebalancing of large 616 flows may not be possible since flows arrive and depart at different 617 times. Also, any flows that are moved from one component link to 618 another may experience momentary packet reordering. 620 4.4.1. Alternative Placement of Large Flows 622 Within a LAG/ECMP group, the member component links with least 623 average port utilization are identified. Some large flow(s) from the 624 heavily loaded component links are then moved to those lightly-loaded 625 member component links using a policy-based routing (PBR) rule in the 626 ingress processing element(s) in the routers. 628 With this approach, only certain large flows are subjected to 629 momentary flow re-ordering. 631 When a large flow is moved, this will increase the utilization of the 632 link that it moved to potentially creating imbalance in the 633 utilization once again across the component links. Therefore, when 634 moving large flows, care must be taken to account for the existing 635 load, and what the future load will be after large flow has been 636 moved. Further, the appearance of new large flows may require a 637 rearrangement of the placement of existing flows. 639 Consider a case where there is a LAG compromising four 10 Gbps 640 component links and there are four large flows, each of 1 Gbps. 641 These flows are each placed on one of the component links. 642 Subsequent, a fifth large flow of 2 Gbps is recognized and to 643 maintain equitable load distribution, it may require placement of one 644 of the existing 1 Gbps flow to a different component link. And this 645 would still result in some imbalance in the utilization across the 646 component links. 648 4.4.2. Redistributing Small Flows 650 Some large flows may consume the entire bandwidth of the component 651 link(s). In this case, it would be desirable for the small flows to 652 not use the congested component link(s). This can be accomplished in 653 one of the following ways. 655 This method works on some existing router hardware. The idea is to 656 prevent, or reduce the probability, that the small flow hashes into 657 the congested component link(s). 659 . The LAG/ECMP table is modified to include only non-congested 660 component link(s). Small flows hash into this table to be mapped 661 to a destination component link. Alternatively, if certain 662 component links are heavily loaded, but not congested, the 663 output of the hash function can be adjusted to account for large 664 flow loading on each of the component links. 666 . The PBR rules for large flows (refer to Section 4.4.1) must 667 have strict precedence over the LAG/ECMP table lookup result. 669 With this approach the small flows that are moved would be subject to 670 reordering. 672 4.4.3. Component Link Protection Considerations 674 If desired, certain component links may be reserved for link 675 protection. These reserved component links are not used for any flows 676 in the absence of any failures. In the case when the component 677 link(s) fail, all the flows on the failed component link(s) are moved 678 to the reserved component link(s). The mapping table of large flows 679 to component link simply replaces the failed component link with the 680 reserved link. Likewise, the LAG/ECMP table replaces the failed 681 component link with the reserved link. 683 4.4.4. Load Rebalancing Algorithms 685 Specific algorithms for placement of large flows are out of scope of 686 this document. One possibility is to formulate the problem for large 687 flow placement as the well-known bin-packing problem and make use of 688 the various heuristics that are available for that problem [bin- 689 pack]. 691 4.4.5. Load Rebalancing Example 693 Optimizing LAG/ECMP component utilization for the use case in Figure 694 2 is depicted below in Figure 4. The large flow rebalancing explained 695 in Section 4.4 is used. The improved link utilization is as follows: 697 . Component link (1) has 3 flows -- 2 small flows and 1 large 698 flow -- and the link utilization is normal. 700 . Component link (2) has 4 flows -- 3 small flows and 1 large 701 flow -- and the link utilization is normal now. 703 . Component link (3) has 3 flows -- 2 small flows and 1 large 704 flow -- and the link utilization is normal now. 706 +-----------+ -> +-----------+ 707 | | -> | | 708 | | ===> | | 709 | (1)|--------|(1) | 710 | | | | 711 | | ===> | | 712 | | -> | | 713 | | -> | | 714 | (R1) | -> | (R2) | 715 | (2)|--------|(2) | 716 | | | | 717 | | -> | | 718 | | -> | | 719 | | ===> | | 720 | (3)|--------|(3) | 721 | | | | 722 +-----------+ +-----------+ 724 Where: -> small flow 725 ===> large flow 727 Figure 4: Evenly Utilized Composite Links 729 Basically, the use of the mechanisms described in Section 4.4.1 730 resulted in a rebalancing of flows where one of the large flows on 731 component link (3) which was previously congested was moved to 732 component link (2) which was previously under-utilized. 734 5. Information Model for Flow Rebalancing 736 In order to support flow rebalancing in a router from an external 737 system, the exchange of some information is necessary between the 738 router and the external system. This section provides an exemplary 739 information model covering the various components needed for the 740 purpose. The model is intended to be informational and may be used 741 as input for development of a data model. 743 5.1. Configuration Parameters for Flow Rebalancing 745 The following parameters are required the configuration of this 746 feature: 748 . Large flow recognition parameters: 750 o Observation interval: The observation interval is the time 751 period in seconds over which the packet arrivals are 752 observed for the purpose of large flow recognition. 754 o Minimum bandwidth threshold: The minimum bandwidth threshold 755 would be configured as a percentage of link speed and 756 translated into a number of bytes over the observation 757 interval. A flow for which the number of bytes received, 758 for a given observation interval, exceeds this number would 759 be recognized as a large flow. 761 o Minimum bandwidth threshold for large flow maintenance: The 762 minimum bandwidth threshold for large flow maintenance is 763 used to provide hysteresis for large flow recognition. 764 Once a flow is recognized as a large flow, it continues to 765 be recognized as a large flow until it falls below this 766 threshold. This is also configured as a percentage of link 767 speed and is typically lower than the minimum bandwidth 768 threshold defined above. 770 . Imbalance threshold: A measure of the deviation of the 771 component link utilizations from the utilization of the overall 772 LAG/ECMP group. Since component links can be of a different 773 speed, the imbalance can be computed as follows. Let the 774 utilization of each component link in a LAG/ECMP group with n 775 links of speed b_1, b_2 ... b_n, be u_1, u_2 ... u_n. The mean 776 utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) + 777 ... + (u_n x b_n) ] / [b_1 + b_2 + ... + b_n]. The imbalance is 778 then computed as max_{i=1 ... n} | u_i - u_ave |. 780 . Rebalancing interval: The minimum amount of time between 781 rebalancing events. This parameter ensures that rebalancing is 782 not invoked too frequently as it impacts packet ordering. 784 These parameters may be configured on a system-wide basis or it may 785 apply to an individual LAG. It may be applied to an ECMP group 786 provided the component links are not shared with any other ECMP 787 group. 789 5.2. System Configuration and Identification Parameters 791 The following parameters are useful for router configuration and 792 operation when using the mechanisms in this document. 794 . IP address: The IP address of a specific router that the 795 feature is being configured on, or that the large flow placement 796 is being applied to. 798 . LAG ID: Identifies the LAG on a given router. The LAG ID may be 799 required when configuring this feature (to apply a specific set 800 of large flow identification parameters to the LAG) and will be 801 required when specifying flow placement to achieve the desired 802 rebalancing. 804 . Component Link ID: Identifies the component link within a LAG 805 or ECMP group. This is required when specifying flow placement 806 to achieve the desired rebalancing. 808 . Component Link Weight: The relative weight to be applied to 809 traffic for a given component link when using hash-based 810 techniques for load distribution. 812 . ECMP group: Identifies a particular ECMP group. The ECMP group 813 may be required when configuring this feature (to apply a 814 specific set of large flow identification parameters to the ECMP 815 group) and will be required when specifying flow placement to 816 achieve the desired rebalancing. We note that multiple ECMP 817 groups can share an overlapping set (or non-overlapping subset) 818 of component links. This document does not deal with the 819 complexity of addressing such configurations. 821 The feature may be configured globally for all LAGs and/or for all 822 ECMP groups, or it may be configured specifically for a given LAG or 823 ECMP group. 825 5.3. Information for Alternative Placement of Large Flows 827 In cases where large flow recognition is handled by an external 828 management station (see Section 4.3.3), an information model for 829 flows is required to allow the import of large flow information to 830 the router. 832 Typical fields use for identifying large flows were discussed in 833 Section 4.3.1. The IPFIX information model [RFC 7012] can be 834 leveraged for large flow identification. 836 Large Flow placement is achieved by specifying the relevant flow 837 information along with the following: 839 . For LAG: Router's IP address, LAG ID, LAG component link ID. 841 . For ECMP: Router's IP address, ECMP group, ECMP component link 842 ID. 844 In the case where the ECMP component link itself comprises a LAG, we 845 would have to specify the parameters for both the ECMP group as well 846 as the LAG to which the large flow is being directed. 848 5.4. Information for Redistribution of Small Flows 850 Redistribution of small flows is done using the following: 852 . For LAG: The LAG ID and the component link IDs along with the 853 relative weight of traffic to be assigned to each component link 854 ID are required. 856 . For ECMP: The ECMP group and the ECMP Nexthop along with the 857 relative weight of traffic to be assigned to each ECMP Nexthop 858 are required. 860 It is possible to have an ECMP nexthop that itself comprises a LAG. 861 In that case, we would have to specify the new weights for both the 862 ECMP nexthops within the ECMP group as well as the component links 863 within the LAG. 865 In the case where an ECMP component link itself comprises a LAG, we 866 would have to specify new weights for both the component links within 867 the ECMP group as well as the component links within the LAG. 869 5.5. Export of Flow Information 871 Exporting large flow information is required when large flow 872 recognition is being done on a router, but the decision to rebalance 873 is being made in an external management station. Large flow 874 information includes flow identification and the component link ID 875 that the flow currently is assigned to. Other information such as 876 flow QoS and bandwidth may be exported too. 878 The IPFIX information model [RFC 7012] can be leveraged for large 879 flow identification. 881 5.6. Monitoring information 883 5.6.1. Interface (link) utilization 885 The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and 886 interface speed (ifSpeed) can be measured from the Interface table 887 (iftable) MIB [RFC 1213]. 889 The link utilization can then be computed as follows: 891 Incoming link utilization = (ifInOctets/8) / ifSpeed 893 Outgoing link utilization = (ifOutOctets/8) / ifSpeed 895 For high speed Ethernet links, the etherStatsHighCapacityTable MIB 896 [RFC 3273] can be used. 898 For scalability, it is recommended to use the counter push mechanism 899 in [sflow-v5] for the interface counters. Doing so would help avoid 900 counter polling through the MIB interface. 902 The outgoing link utilization of the component links within a 903 LAG/ECMP group can be used to compute the imbalance (See Section 5.1) 904 for the LAG/ECMP group. 906 5.6.2. Other monitoring information 908 Additional monitoring information that is useful includes: 910 . Number of times rebalancing was done. 912 . Time since the last rebalancing event. 914 . The number of large flows currently rebalanced by the scheme. 916 . A list of the large flows that have been rebalanced including 918 o the rate of each large flow at the time of the last 919 rebalancing for that flow, 921 o the time that rebalancing was last performed for the given 922 large flow, and 924 o the interfaces that the large flows was (re)directed to. 926 . The settings for the weights of the interfaces within a 927 LAG/ECMP used by the small flows which depend on hashing. 929 6. Operational Considerations 931 6.1. Rebalancing Frequency 933 Flows should be rebalanced only when the imbalance in the utilization 934 across component links exceeds a certain threshold. Frequent 935 rebalancing to achieve precise equitable utilization across component 936 links could be counter-productive as it may result in moving flows 937 back and forth between the component links impacting packet ordering 938 and system stability. This applies regardless of whether large flows 939 or small flows are redistributed. It should be noted that reordering 940 is a concern for TCP flows with even a few packets because three out- 941 of-order packets would trigger sufficient duplicate ACKs to the 942 sender resulting in a retransmission [RFC 5681]. 944 The operator would have to experiment with various values of the 945 large flow recognition parameters (minimum bandwidth threshold, 946 observation interval) and the imbalance threshold across component 947 links to tune the solution for their environment. 949 6.2. Handling Route Changes 951 Large flow rebalancing must be aware of any changes to the FIB. In 952 cases where the nexthop of a route no longer to points to the LAG, or 953 to an ECMP group, any PBR entries added as described in Section 4.4.1 954 and 4.4.2 must be withdrawn in order to avoid the creation of 955 forwarding loops. 957 6.3. Forwarding Resources 959 Hash-based techniques used for load balancing with LAG/ECMP are 960 usually stateless. The mechanisms described in this document require 961 additional resources in the forwarding plane of routers for creating 962 PBR rules that are capable of overriding the forwarding decision from 963 the hash-based approach. These resources may limit the number of 964 flows that can be rebalanced and may also impact the latency 965 experienced by packets due to the additional lookups that are 966 required. 968 7. IANA Considerations 970 This memo includes no request to IANA. 972 8. Security Considerations 974 This document does not directly impact the security of the Internet 975 infrastructure or its applications. In fact, it could help if there 976 is a DOS attack pattern which causes a hash imbalance resulting in 977 heavy overloading of large flows to certain LAG/ECMP component 978 links. 980 An attacker with knowledge of the large flow recognition algorithm 981 and any stateless distribution method can generate flows that are 982 distributed in a way that overloads a specific path. This could be 983 used to cause the creation of PBR rules that exhaust the available 984 rule capacity on nodes. If PBR rules are consequently discarded, 985 this could result in congestion on the attacker-selected path. 986 Alternatively, tracking large numbers of PBR rules could result in 987 performance degradation. 989 9. Contributing Authors 991 Sanjay Khanna 992 Cisco Systems 993 Email: sanjakha@gmail.com 995 10. Acknowledgements 997 The authors would like to thank the following individuals for their 998 review and valuable feedback on earlier versions of this document: 999 Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian 1000 Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh 1001 Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, 1002 Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George 1003 Yum, and Weifeng Zhang. As a part of the IETF Last Call process, 1004 valuable comments were received from Martin Thomson, 1006 11. References 1008 11.1. Normative References 1010 [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE 1011 Standard for Local and Metropolitan Area Networks - Link 1012 Aggregation", 2008. 1014 [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 1015 Multicast," November 2000. 1017 [RFC 7011] Claise, B. et al., "Specification of the IP Flow 1018 Information Export (IPFIX) Protocol for the Exchange of IP Traffic 1019 Flow Information," September 2013. 1021 [RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow 1022 Information Export (IPFIX)," September 2013. 1024 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," 1025 http://www.sflow.org/sflow_version_5.txt, July 2004. 1027 11.2. Informative References 1029 [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation 1030 Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design 1031 for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. 1032 Springer-Verlag, 1984. 1034 [CAIDA] "Caida Internet Traffic Analysis," http://www.caida.org/home. 1036 [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow 1037 Management for High Performance Enterprise Networks," Proceedings of 1038 the ACM SIGCOMM, August 2011. 1040 [FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: 1041 challenges and limitations," Proceedings of the 9th international 1042 conference on Passive and active network measurement, 2008. 1044 [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements 1045 for MPLS over a Composite Link," September 2013. 1047 [ITCOM] Jo, J., et al., "Internet traffic load balancing using 1048 dynamic hashing with flow volume," SPIE ITCOM, 2002. 1050 [NDTM] Estan, C. and G. Varghese, "New directions in traffic 1051 measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. 1053 [NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using 1054 Generic Routing Encapsulation," draft-sridharan-virtualization- 1055 nvgre-04, February 2014. 1057 [RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation 1058 (GRE)," March 2000. 1060 [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS 1061 Forwarding," November 2012. 1063 [RFC 1213] McCloghrie, K., "Management Information Base for Network 1064 Management of TCP/IP-based internets: MIB-II," March 1991. 1066 [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 1067 Algorithm," November 2000. 1069 [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management 1070 Information Base for High Capacity Networks," July 2002. 1072 [RFC 3931] Lau, J. (Ed.), M. Townsley (Ed.), and I. Goyret (Ed.), 1073 "Layer 2 Tunneling Protocol - Version 3," March 2005. 1075 [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 1076 9," October 2004. 1078 [RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information 1079 Export," March 2009. 1081 [RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for 1082 IP Packet Selection," March 2009. 1084 [RFC 5640] Filsfils, C., P. Mohapatra, and C. Pignataro, "Load 1085 Balancing for Mesh Softwires," August 2009. 1087 [RFC 5681] Allman, M. et al., "TCP Congestion Control," September 1088 2009. 1090 [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," 1091 http://www.sflow.org/packetSamplingBasics/. 1093 [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters 1094 structure," http://www.sflow.org/sflow_lag.txt, September 2012. 1096 [STT] Davie, B. (Ed.) and J. Gross, "A Stateless Transport Tunneling 1097 Protocol for Network Virtualization (STT)," draft-davie-stt-06, March 1098 2014. 1100 [VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying 1101 Virtualized Layer 2 Networks over Layer 3 Networks," draft- 1102 mahalingam-dutt-dcops-vxlan-09, April 2014. 1104 [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," 1105 draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. 1107 Appendix A. Internet Traffic Analysis and Load Balancing Simulation 1109 Internet traffic [CAIDA] has been analyzed to obtain flow statistics 1110 such as the number of packets in a flow and the flow duration. The 1111 five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP 1112 protocol) are used for flow identification. The analysis indicates 1113 that < ~2% of the flows take ~30% of total traffic volume while the 1114 rest of the flows (> ~98%) contributes ~70% [YONG]. 1116 The simulation has shown that given Internet traffic pattern, the 1117 hash-based technique does not evenly distribute the flows over ECMP 1118 paths. Some paths may be > 90% loaded while others are < 40% loaded. 1119 The more ECMP paths exist, the more severe the misbalancing. This 1120 implies that hash-based distribution can cause some paths to become 1121 congested while other paths are underutilized [YONG]. 1123 The simulation also shows substantial improvement by using the large 1124 flow-aware hash-based distribution technique described in this 1125 document. In using the same simulated traffic, the improved 1126 rebalancing can achieve < 10% load differences among the paths. It 1127 proves how large flow-aware hash-based distribution can effectively 1128 compensate the uneven load balancing caused by hashing and the 1129 traffic characteristics [YONG]. 1131 Authors' Addresses 1133 Ram Krishnan 1134 Brocade Communications 1135 San Jose, 95134, USA 1136 Phone: +1-408-406-7890 1137 Email: ramkri123@gmail.com 1139 Lucy Yong 1140 Huawei USA 1141 5340 Legacy Drive 1142 Plano, TX 75025, USA 1143 Phone: +1-469-277-5837 1144 Email: lucy.yong@huawei.com 1146 Anoop Ghanwani 1147 Dell 1148 San Jose, CA 95134 1149 Phone: +1-408-571-3228 1150 Email: anoop@alumni.duke.edu 1152 Ning So 1153 Tata Communications 1154 Plano, TX 75082, USA 1155 Phone: +1-972-955-0914 1156 Email: ning.so@tatacommunications.com 1158 Bhumip Khasnabish 1159 ZTE Corporation 1160 New Jersey, 07960, USA 1161 Phone: +1-781-752-8003 1162 Email: vumip1@gmail.com