idnits 2.17.1 draft-ietf-bess-evpn-unequal-lb-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (1 June 2022) is 694 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PE-1' is mentioned on line 595, but not defined == Missing Reference: 'PE-2' is mentioned on line 595, but not defined == Missing Reference: 'PE-3' is mentioned on line 595, but not defined == Missing Reference: 'EVI' is mentioned on line 767, but not defined == Missing Reference: 'ES' is mentioned on line 767, but not defined == Missing Reference: 'RFC8126' is mentioned on line 816, but not defined == Unused Reference: 'EVPN-VIRTUAL-ES' is defined on line 859, but no explicit reference was found in the text == Unused Reference: 'RFC7814' is defined on line 877, but no explicit reference was found in the text == Outdated reference: A later version (-13) exists of draft-ietf-bess-evpn-pref-df-06 == Outdated reference: A later version (-09) exists of draft-ietf-bess-evpn-per-mcast-flow-df-election-04 == Outdated reference: A later version (-15) exists of draft-ietf-bess-evpn-virtual-eth-segment-06 ** Downref: Normative reference to an Informational RFC: RFC 7814 Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup N. Malhotra, Ed. 3 Internet-Draft A. Sajassi 4 Intended status: Standards Track Cisco Systems 5 Expires: 3 December 2022 J. Rabadan 6 Nokia 7 J. Drake 8 Juniper 9 A. Lingala 10 ATT 11 S. Thoria 12 Cisco Systems 13 1 June 2022 15 Weighted Multi-Path Procedures for EVPN Multi-Homing 16 draft-ietf-bess-evpn-unequal-lb-16 18 Abstract 20 EVPN enables all-active multi-homing for a CE device connected to two 21 or more PEs via a LAG, such that bridged and routed traffic from 22 remote PEs to hosts attached to the Ethernet Segment can be equally 23 load balanced (it uses Equal Cost Multi Path) across the multi-homing 24 PEs. EVPN also enables multi-homing for IP subnets advertised in IP 25 Prefix routes, so that routed traffic from remote PEs to those IP 26 subnets can be load balanced. This document defines extensions to 27 EVPN procedures to optimally handle unequal access bandwidth 28 distribution across a set of multi-homing PEs in order to: 30 * provide greater flexibility, with respect to adding or removing 31 individual multi-homed PE-CE links. 33 * handle multi-homed PE-CE link failures that can result in unequal 34 PE-CE access bandwidth across a set of multi-homing PEs. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on 3 December 2022. 53 Copyright Notice 55 Copyright (c) 2022 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 60 license-info) in effect on the date of publication of this document. 61 Please review these documents carefully, as they describe your rights 62 and restrictions with respect to this document. Code Components 63 extracted from this document must include Revised BSD License text as 64 described in Section 4.e of the Trust Legal Provisions and are 65 provided without warranty as described in the Revised BSD License. 67 Table of Contents 69 1. Requirements Language and Terminology . . . . . . . . . . . . 3 70 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.1. PE-CE Link Provisioning . . . . . . . . . . . . . . . . . 4 72 2.2. PE-CE Link Failures . . . . . . . . . . . . . . . . . . . 6 73 2.3. Design Requirement . . . . . . . . . . . . . . . . . . . 7 74 3. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 8 75 4. EVPN Link Bandwidth Extended Community . . . . . . . . . . . 8 76 4.1. Encoding and Usage of EVPN Link Bandwidth Extended 77 Community . . . . . . . . . . . . . . . . . . . . . . . . 9 78 4.2. Note on BGP Link Bandwidth Extended Community . . . . . . 10 79 5. Weighted Unicast Traffic Load-balancing to an Ethernet 80 Segment . . . . . . . . . . . . . . . . . . . . . . . . . 10 81 5.1. Egress PE Behavior . . . . . . . . . . . . . . . . . . . 10 82 5.2. Ingress PE Behavior . . . . . . . . . . . . . . . . . . . 10 83 6. Weighted BUM Traffic Load-Sharing across an Ethernet 84 Segment . . . . . . . . . . . . . . . . . . . . . . . . . 12 85 6.1. The BW Capability in the DF Election Extended 86 Community . . . . . . . . . . . . . . . . . . . . . . . . 12 87 6.2. BW Capability and Default DF Election algorithm . . . . . 13 88 6.3. BW Capability and HRW DF Election algorithm (Type 1 and 89 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 90 6.3.1. BW Increment . . . . . . . . . . . . . . . . . . . . 14 91 6.3.2. HRW Hash Computations with BW Increment . . . . . . . 15 92 6.4. BW Capability and Preference DF Election algorithm . . . 16 93 7. Cost-Benefit Tradeoff on Link Failures . . . . . . . . . . . 17 94 8. Real-time Available Bandwidth . . . . . . . . . . . . . . . . 17 95 9. Weighted Load-balancing to Multi-homed Subnets . . . . . . . 17 96 10. Weighted Load-balancing without EVPN aliasing . . . . . . . . 17 97 11. EVPN-IRB Multi-homing With Non-EVPN routing . . . . . . . . . 18 98 12. Operational Considerations . . . . . . . . . . . . . . . . . 18 99 13. Security Considerations . . . . . . . . . . . . . . . . . . . 18 100 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 101 15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 102 16. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19 103 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 104 17.1. Normative References . . . . . . . . . . . . . . . . . . 19 105 17.2. Informative References . . . . . . . . . . . . . . . . . 20 106 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 108 1. Requirements Language and Terminology 110 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 111 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 112 document are to be interpreted as described in [RFC2119]. 114 "Local PE" in the context of an Ethernet Segment refers to a provider 115 edge switch OR router that physically hosts the Ethernet Segment. 117 "Remote PE" in the context of an Ethernet Segment refers to a 118 provider edge switch OR router in an EVPN overlay, whose overlay 119 reachability to the Ethernet Segment is via the Local PE. 121 * BW: BandWidth 123 * LAG: Link Aggregation Group 125 * ES: Ethernet Segment 127 * ESI: Ethernet Segment ID 129 * vES: Virtual Ethernet Segment 131 * EVI: Ethernet virtual Instance, this is a mac-vrf. 133 * Path-List: A forwarding object used to load-balance routed or 134 bridged traffic across multiple forwarding paths. 136 * Access Bandwidth: Bandwidth of PE-CE links in an Ethernet Segment 138 * Egress PE: In the context of an Ethernet Segment or a route, this 139 is the PE that advertises a locally attached Ethernet Segment RT- 140 1, or a locally attached host or prefix route (RT-2, RT-5). 142 * Ingress PE: In the context of an Ethernet Segment or a route, this 143 is the receiving PE that learns remote Ethernet Segment RT-1 and/ 144 or host and prefix routes (RT-2, RT-5) from the Egress PE 146 * IMET: Inclusive Multicast Route 148 * DF: Designated Forwarder 150 * BDF: Backup Designated Forwarder 152 * DCI: Data Center Interconnect Router 154 2. Introduction 156 In an EVPN-IRB based network overlay, with a CE multi-homed via a 157 EVPN all-active multi-homing, bridged and routed traffic from ingress 158 PEs can be equally load balanced (ECMPed) across the multi-homing 159 egress PEs: 161 * ECMP Load-balancing for bridged unicast traffic is enabled via 162 aliasing and mass-withdraw procedures detailed in RFC 7432. 164 * ECMP Load-balancing for routed unicast traffic is enabled via 165 existing L3 ECMP mechanisms. 167 * Load-sharing of bridged BUM traffic on local ports is enabled via 168 EVPN DF election procedure detailed in RFC 7432 170 All of the above load balancing and DF election procedures implicitly 171 assume equal bandwidth distribution between the CE and the set of 172 egress PEs. Essentially, with this assumption of equal "access" 173 bandwidth distribution across all egress PEs, ALL remote traffic is 174 equally load balanced across the egress PEs. This assumption of 175 equal access bandwidth distribution can be restrictive with respect 176 to adding / removing links in a multi-homed LAG interface and may 177 also be easily broken on individual link failures. A solution to 178 handle unequal access bandwidth distribution across a set of egress 179 PEs is proposed in this document. Primary motivation behind this 180 proposal is to enable greater flexibility with respect to adding / 181 removing member PE-CE links, as needed and to optimally handle PE-CE 182 link failures. 184 2.1. PE-CE Link Provisioning 185 +------------------------+ 186 | Underlay Network Fabric| 187 +------------------------+ 189 +-----+ +-----+ 190 | PE1 | | PE2 | 191 +-----+ +-----+ 192 \ / 193 \ ES-1 / 194 \ / 195 +\---/+ 196 | \ / | 197 +--+--+ 198 | 199 CE1 201 Figure 1 203 Consider CE1 that is dual-homed to egress PE1 and egress PE2 via EVPN 204 all-active multi-homing with single member links of equal bandwidth 205 to each PE (aka, equal access bandwidth distribution across PE1 and 206 PE2). If the provider wants to increase link bandwidth to CE1, it 207 must add a link to both PE1 and PE2 in order to maintain equal access 208 bandwidth distribution and inter-work with EVPN ECMP load balancing. 209 In other words, for a dual-homed CE, total number of CE links must be 210 provisioned in multiples of 2 (2, 4, 6, and so on). For a triple- 211 homed CE, number of CE links must be provisioned in multiples of 212 three (3, 6, 9, and so on). To generalize, for a CE that is multi- 213 homed to "n" PEs, number of PE-CE physical links provisioned must be 214 an integral multiple of "n". This is restrictive in case of dual- 215 homing and very quickly becomes prohibitive in case of multi-homing. 217 Instead, a provider may wish to increase PE-CE bandwidth OR number of 218 links in any link increments. As an example, for CE1 dual-homed to 219 egress PE1 and egress PE2 in all-active mode, provider may wish to 220 add a third link to only PE1 to increase total bandwidth for this CE 221 by 50%, rather than being required to increase access bandwidth by 222 100% by adding a link to each of the two PEs. While existing EVPN 223 based all-active load balancing procedures do not necessarily 224 preclude such asymmetric access bandwidth distribution among the PEs 225 providing redundancy, it may result in unexpected traffic loss due to 226 congestion in the access interface towards CE. This traffic loss is 227 due to the fact that PE1 and PE2 will continue to be treated as equal 228 cost paths at remote PEs, and as a result may attract approximately 229 equal amount of CE1 destined traffic, even when PE2 only has half the 230 bandwidth to CE1 as PE1. This may lead to congestion and traffic 231 loss on the PE2-CE1 link. If bandwidth distribution to CE1 across 232 PE1 and PE2 is 2:1, traffic from remote hosts must also be load 233 balanced across PE1 and PE2 in 2:1 manner. 235 2.2. PE-CE Link Failures 237 More importantly, unequal PE-CE bandwidth distribution described 238 above may occur during regular operation following a link failure, 239 even when PE-CE links were provisioned to provide equal bandwidth 240 distribution across multi-homing PEs. 242 +------------------------+ 243 | Underlay Network Fabric| 244 +------------------------+ 246 +-----+ +-----+ 247 | PE1 | | PE2 | 248 +-----+ +-----+ 249 \\ // 250 \\ ES-1 // 251 \\ /X 252 +\\---//+ 253 | \\ // | 254 +---+---+ 255 | 256 CE1 258 Figure 2 260 Consider a CE1 that is multi-homed to egress PE1 and egress PE2 via a 261 LAG with two member links to each PE. On a PE2-CE1 physical link 262 failure, LAG represented by an Ethernet Segment ES-1 on PE2 stays up, 263 however, its bandwidth is cut in half. With existing ECMP 264 procedures, both PE1 and PE2 may continue to attract equal amount of 265 traffic from remote PEs, even when PE1 has double the bandwidth to 266 CE1. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 267 traffic from remote hosts must also be load balanced across PE1 and 268 PE2 in 2:1 manner to avoid unexpected congestion and traffic loss on 269 PE2-CE1 links within the LAG. As an alternative, min-link on LAGs is 270 sometimes used to bring down the LAG interface on member link 271 failures. This however results in loss of available bandwidth in the 272 network, and is not ideal. 274 2.3. Design Requirement 276 +-----------------------+ 277 |Underlay Network Fabric| 278 +-----------------------+ 280 +-----+ +-----+ +-----+ +-----+ 281 | PE1 | | PE2 | ..... | PEx | | PEn | 282 +-----+ +-----+ +-----+ +-----+ 283 \ \ // // 284 \ L1 \ L2 // Lx // Ln 285 \ \ // // 286 +-\-------\-----------//--------//-+ 287 | \ \ ES-1 // // | 288 +----------------------------------+ 289 | 290 CE 292 Figure 3 294 To generalize, if total link bandwidth to a CE is distributed across 295 "n" egress PEs, with Lx being the total bandwidth to PEx across all 296 links, traffic from ingress PEs to this CE must be load balanced 297 unequally across egress PE set [PE1, PE2, ....., PEn] such that, 298 fraction of total unicast and BUM flows destined for CE that are 299 serviced by egress PEx is: 301 Lx / [L1+L2+.....+Ln] 303 Figure 3 illustrates a scenario where egress PE1..PEn are attached to 304 a multi-homed Ethernet Segment, however this document generalizes 305 this requirement so that the unequal load balancing can be applied to 306 PEs attached to a vES or to a multi-homed subnet advertised by EVPN 307 IP Prefix routes. 309 The solution proposed below includes extensions to EVPN procedures to 310 achieve the above. Following assumption apply to procedure described 311 in this document: 313 * For procedures related to bridged unicast and BUM traffic, EVPN 314 all active multi-homing is assumed. 316 * Procedures related to bridged unicast and BUM traffic are 317 applicable to both aliasing and non-alaising mode as defined in 318 [RFC7432]. 320 3. Solution Overview 322 In order to achieve weighted load balancing to an ES or vES for 323 overlay unicast traffic, Ethernet A-D per ES route (EVPN Route Type 324 1) is leveraged to signal the Ethernet Segment weight to ingress PEs. 325 Using Ethernet A-D per ES route to signal the Ethernet Segment weight 326 provides a mechanism that reacts to changes in access bandwidth or 327 number of access links in a service and host independent manner. 328 Ingress PEs computing the MAC path-lists based on global and aliasing 329 Ethernet A-D routes now have the ability to setup weighted load 330 balancing path-lists based on the ES access bandwidth or number of 331 links received from each egress PE that the ES is multi-homed to. 333 In order to achieve weighted load balancing of overlay BUM traffic, 334 EVPN ES route (Route Type 4) is leveraged to signal the ES weight to 335 egress PEs within an ES's redundancy group to influence per-service 336 DF election. Egress PEs in an ES redundancy group now have the 337 ability to do service carving in proportion to each egress PE's 338 relative ES weight. 340 Unequal load balancing to multi-homed subnets is achieved by 341 signaling the weight along with the IP Prefix routes advertised for 342 the subnet. 344 Procedures to accomplish this are described in greater detail next. 346 4. EVPN Link Bandwidth Extended Community 348 A new EVPN Link Bandwidth extended community is defined for the 349 solution specified in this document: 351 * This extended community is defined of type 0x06 (EVPN). 353 * IANA is requested to assign a sub-type value of 0x10 for the EVPN 354 Link bandwidth extended community, of type 0x06 (EVPN). 356 * EVPN Link Bandwidth extended community is defined as transitive. 358 4.1. Encoding and Usage of EVPN Link Bandwidth Extended Community 360 EVPN Link Bandwidth Extended Community value field is used to carry 361 total bandwidth of egress PE's all physical links in an ethernet 362 segment, expressed in Mbits/sec (MegabitsPerSecond) represented as an 363 unsigned integer. Note however that the load balancing algorithm 364 defined in this document uses ratio of Link Bandwidths. Hence, the 365 operator may choose a different unit or use the community as a 366 generalized weight that may be set to link count, locally configured 367 weight, or a value computed based on something other than link 368 bandwidth. In such case, the operator MUST ensure consistent usage 369 of the unit across all egress PEs in an ethernet segment. This may 370 involve multiple routing domains/Autonomous Systems. 372 In order to facilitate this, as well as avoid interop issues because 373 of provisioning error, one octet in the extended community's six 374 octet 'value' field is used to explicitly signal if the weight 375 encoded in the remaining five octets is link bandwidth expressed in 376 Mbps or a generalized weight value. This results in the following 377 encoding for EVPN link bandwidth extended community: 379 0 1 2 3 380 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 381 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 382 | Type | Sub-Type | Value-Units | | 383 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 384 | Value-Weight | 385 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 387 Figure 4 389 Value-Units is encoded as: 391 * 0x00: weight expressed using default units of Mbps 393 * 0x01: generalized weight expressed in something other than Mbps 395 Generalized weight units are intentionally left arbritrary to allow 396 for flexibility in its usage for different applications without 397 having to define new encoding for each non-default application. 398 Implementations SHOULD support the default units of Mbps, while 399 support of non-default generalized weight is considered optional. 401 Additionally, following considerations apply to handling of this 402 extended community at the ingress PE: 404 * An ingress PE MUST check for consistent 'Value-Units' received in 405 the EVPN link bandwidth exteneded community from each egress PE in 406 an Ethernet Segment. In case of any inconsistency in 'Value- 407 Units' across egress PEs in an Ethernet Segment, this EVPN Link 408 Bandwidth extended community is to be ignored. 410 * An ingress PE MUST ensure that each route contains only a single 411 instance of this extended community sub-type. In case of more 412 than one instance, this EVPN Link Bandwidth extended community is 413 to be ignored. 415 4.2. Note on BGP Link Bandwidth Extended Community 417 Link bandwidth extended community described in [BGP-LINK-BW] for 418 layer 3 VPNs was considered for re-use here. This Link bandwidth 419 extended community is however defined in [BGP-LINK-BW] as optional 420 non-transitive. Since it is not possible to change deployed behavior 421 of extended community defined in [BGP-LINK-BW], it was decided to 422 define a new one. In inter-AS scenarios, link-bandwidth needs to be 423 signaled to eBGP neighbors. When signaled across AS boundary, this 424 extended community can be used to achieve optimal load-balancing 425 towards egress PEs in a different AS. This is applicable both when 426 next-hop is changed or unchanged across AS boundaries. 428 5. Weighted Unicast Traffic Load-balancing to an Ethernet Segment 430 5.1. Egress PE Behavior 432 A PE that is part of an Ethernet Segment's redundancy group SHOULD 433 advertise an additional "EVPN link bandwidth" extended community with 434 Ethernet A-D per ES route (EVPN Route Type 1), that carries total 435 bandwidth of PE's physical links in an Ethernet Segment or a 436 generalized weight. New EVPN link bandwidth extended community 437 defined in this document is used for this purpose. 439 EVPN link bandwidth extended community SHOULD NOT be attached to per- 440 EVI RT-1 or to EVPN RT-2. 442 5.2. Ingress PE Behavior 444 An ingress PE MUST ensure that the EVPN link bandwidth extended 445 community is recevied from all the egress PEs in an Ethernet Segment 446 and check for consistent 'Value-Units' received from each egress PE 447 in an Ethernet Segment. In case of missing EVPN Link Bandwidth 448 extended community OR inconsistent 'Value-Units' from any of the 449 egress PEs in an Ethernet Segment, this EVPN Link Bandwidth extended 450 community is to be ignored by the ingress PE and ingress PE is to 451 follow regular ECMP forwarding to that Ethernet Segment. 453 Once consistency of 'Value-Units' is validated, ingress PE SHOULD use 454 the 'Value-Weight' received from each egress PE to compute a relative 455 (normalized) weight for each egress PE, per ES, and then use this 456 relative weight to compute a weighted path-list to be used for load 457 balancing, as opposed to using an ECMP path-list for load balancing 458 across the egress PE paths. Egress PE Weight and resulting weighted 459 path-list computation at ingress PEs is a local matter. An example 460 computation algorithm is shown below to illustrate the idea: 462 if, 464 L(x,y) : link bandwidth advertised by egress PE-x for ES-y 466 W(x,y) : normalized weight assigned to egress PE-x for ES-y 468 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., L(n,y)] 470 then, the normalized weight assigned to egress PE-x for ES-y may be 471 computed as follows: 473 W(x,y) = L(x,y) / H(y) 475 For a MAC+IP route (EVPN Route Type 2) received with ES-y, ingress PE 476 may compute MAC and IP forwarding path-list weighted by the above 477 normalized weights. 479 As an example, for a CE multi-homed to PE-1, PE-2, PE-3 via 2, 1, and 480 1 GE physical links respectively, as part of a LAG represented by ES- 481 10: 483 L(1, 10) = 2000 Mbps 485 L(2, 10) = 1000 Mbps 487 L(3, 10) = 1000 Mbps 489 H(10) = 1000 491 Normalized weights assigned to each egress PE for ES-10 are as 492 follows: 494 W(1, 10) = 2000 / 1000 = 2. 496 W(2, 10) = 1000 / 1000 = 1. 498 W(3, 10) = 1000 / 1000 = 1. 500 For a remote MAC+IP host route received with ES-10, forwarding load 501 balancing path-list may now be computed as: [PE-1, PE-1, PE-2, PE-3] 502 instead of [PE-1, PE-2, PE-3]. This now results in load balancing of 503 all traffic destined for ES-10 across the three egress PEs in 504 proportion to ES-10 bandwidth at each egress PE. 506 Weighted path-list computation must only be done for an ES if EVPN 507 link bandwidth extended community is received from all of the egress 508 PE's advertising reachability to that ES via Ethernet A-D per ES 509 Route Type 1. In an unlikely event that EVPN link bandwidth extended 510 community is not received from one or more egress PEs, forwarding 511 path-list should be computed using regular ECMP semantics. Note that 512 a default weight cannot be assumed for an egress PE that does not 513 advertise its link bandwidth as the weight to be used in path-list 514 computation is relative. 516 If per-ES RT-1 is not advertised or withdrawn from any of the egress 517 PE(s), as per [RFC7432], egress PE is removed from the forwarding 518 path-list for that [EVI, ES]. Hence, the weighted path-list MUST be 519 re-computed. 521 In an unlikely scenario that per-[ES, EVI] RT-1 is not advertised 522 from any of the egress PE(s), as per [RFC7432], egress PE is not 523 included in the forwarding path-list for that [EVI, ES]. Hence, the 524 weighted path-list for the [EVI, ES] MUST be computed based only on 525 the weights received from egress PEs that advertised the per-[ES, 526 EVI] RT-1. 528 6. Weighted BUM Traffic Load-Sharing across an Ethernet Segment 530 Optionally, load sharing of per-service DF role, weighted by 531 individual egress PE's link-bandwidth share within a multi-homed ES 532 may also be achieved. 534 In order to do that, a new DF Election Capability [RFC8584] called 535 "BW" (Bandwidth Weighted DF Election) is defined. BW MAY be used 536 along with some DF Election Types, as described in the following 537 sections. 539 6.1. The BW Capability in the DF Election Extended Community 541 [RFC8584] defines a new extended community for PEs within a 542 redundancy group to signal and agree on uniform DF Election Type and 543 Capabilities for each ES. This document requests IANA to allocate a 544 bit in the "DF Election capabilities" registry setup by [RFC8584]: 546 Bit 4: BW (Bandwidth Weighted DF Election) 547 ES routes advertised with the BW bit set will indicate the desire of 548 the advertising egress PE to consider the link-bandwidth in the DF 549 Election algorithm defined by the value in the "DF Type". 551 As per [RFC8584], all the egress PEs in the ES MUST advertise the 552 same Capabilities and DF Type, otherwise the PEs will fall back to 553 Default [RFC7432] DF Election procedure. 555 The BW Capability MAY be advertised with the following DF Types: 557 * Type 0: Default DF Election algorithm, as in [RFC7432] 559 * Type 1: HRW algorithm, as in [RFC8584] 561 * Type 2: Preference algorithm, as in [EVPN-DF-PREF] 563 * Type 4: HRW per-multicast flow DF Election, as in [EVPN-PER-MCAST- 564 FLOW-DF] 566 The following sections describe how the DF Election procedures are 567 modified for the above DF Types when the BW Capability is used. 569 6.2. BW Capability and Default DF Election algorithm 571 When all the PEs in the Ethernet Segment (ES) agree to use the BW 572 Capability with DF Type 0, the Default DF Election procedure as 573 defined in [RFC7432] is modified as follows: 575 * Each PE advertises a "EVPN Link Bandwidth" extended community 576 along with the ES route to signal the PE-CE link bandwidth (LBW) 577 for the ES. 579 * A receiving egress PE MUST use the ES link bandwidth extended 580 community received from each egress PE to compute a relative 581 weight for each egress PE in an Ethernet Segment. 583 * The DF Election procedure MUST now use this weighted list of 584 egress PEs to compute the per-VLAN Designated Forwarder, such that 585 the DF role is distributed in proportion to this normalized 586 weight. As a result, a single PE may have multiple ordinals in 587 the DF candidate PE list and 'N' used in (V mod N) operation as 588 defined in [RFC7432] is modified to be total number of ordinals 589 instead of being total number of egress PEs in an Ethernet 590 Segment. 592 Considering the same example as in Section 5.2, the candidate PE list 593 for DF election is: 595 [PE-1, PE-1, PE-2, PE-3]. 597 The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). 598 This would result in the DF role being distributed across PE1, PE2, 599 and PE3 in portion to each PE's normalized weight for ES-10. 601 6.3. BW Capability and HRW DF Election algorithm (Type 1 and 4) 603 [RFC8584] introduces Highest Random Weight (HRW) algorithm (DF Type 604 1) for DF election in order to solve potential DF election skew 605 depending on Ethernet tag space distribution. [EVPN-PER-MCAST-FLOW- 606 DF] further extends HRW algorithm for per-multicast flow based hash 607 computations (DF Type 4). This section describes extensions to HRW 608 Algorithm for EVPN DF Election specified in [RFC8584] and in [EVPN- 609 PER-MCAST-FLOW-DF] in order to achieve DF election distribution that 610 is weighted by link bandwidth. 612 6.3.1. BW Increment 614 A new variable called "bandwidth increment" is computed for each [PE, 615 ES] advertising the ES link bandwidth extended community as follows: 617 In the context of an ES, 619 L(i) = Link bandwidth advertised by PE(i) for this ES 621 L(min) = lowest link bandwidth advertised across all PEs for this ES 623 Bandwidth increment, "b(i)" for a given PE(i) advertising a link 624 bandwidth of L(i) is defined as an integer value computed as: 626 b(i) = L(i) / L(min) 628 As an example, 630 with PE(1) = 10, PE(2) = 10, PE(3) = 20 632 bandwidth increment for each PE would be computed as: 634 b(1) = 1, b(2) = 1, b(3) = 2 636 with PE(1) = 10, PE(2) = 10, PE(3) = 10 638 bandwidth increment for each PE would be computed as: 640 b(1) = 1, b(2) = 1, b(3) = 1 641 Note that the bandwidth increment must always be an integer, 642 including, in an unlikely scenario of a PE's link bandwidth not being 643 an exact multiple of L(min). If it computes to a non-integer value 644 (including as a result of link failure), it MUST be rounded down to 645 an integer. 647 6.3.2. HRW Hash Computations with BW Increment 649 HRW algorithm as described in [RFC8584] and in [EVPN-PER-MCAST-FLOW- 650 DF] computes a random hash value for each PE(i), where, (0 < i <= N), 651 PE(i) is the PE at ordinal i, and Address(i) is the IP address of 652 PE(i). 654 For 'N' PEs sharing an Ethernet segment, this results in 'N' 655 candidate hash computations. The PE that has the highest hash value 656 is selected as the DF. 658 We refer to this hash value as "affinity" in this document. Hash or 659 affinity computation for each PE(i) is extended to be computed one 660 per bandwidth increment associated with PE(i) instead of a single 661 affinity computation per PE(i). 663 PE(i) with b(i) = j, results in j affinity computations: 665 affinity(i, x), where 1 < x <= j 667 This essentially results in number of candidate HRW hash computations 668 for each PE that is directly proportional to that PE's relative 669 bandwidth within an ES and hence gives PE(i) a probability of being 670 DF in proportion to it's relative bandwidth within an ES. 672 As an example, consider an ES that is multi-homed to two PEs, PE1 and 673 PE2, with equal bandwidth distribution across PE1 and PE2. This 674 would result in a total of two candidate hash computations: 676 affinity(PE1, 1) 678 affinity(PE2, 1) 680 Now, consider a scenario with PE1's link bandwidth as 2x that of PE2. 681 This would result in a total of three candidate hash computations to 682 be used for DF election: 684 affinity(PE1, 1) 686 affinity(PE1, 2) 688 affinity(PE2, 1) 689 which would give PE1 2/3 probability of getting elected as a DF, in 690 proportion to its relative bandwidth in the ES. 692 Depending on the chosen HRW hash function, affinity function MUST be 693 extended to include bandwidth increment in the computation. 695 For e.g., 697 affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be 698 extended as follows to incorporate bandwidth increment j: 700 affinity(S,G,V, ESI, Address(i,j)) = 701 (1103515245.((1103515245.Address(i).j + 12345) XOR 702 D(S,G,V,ESI))+12345) (mod 2^31) 704 affinity or random function specified in [RFC8584] MAY be extended as 705 follows to incorporate bandwidth increment j: 707 affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j 708 + 12345) XOR D(v,Es))+12345)(mod 2^31) 710 6.4. BW Capability and Preference DF Election algorithm 712 This section applies to ES'es where all the PEs in the ES agree use 713 the BW Capability with DF Type 2. The BW Capability modifies the 714 Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW 715 value as a tie-breaker as follows: 717 Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW 718 value: 720 f) In case of equal Preference in two or more PEs in the ES, the tie- 721 breakers will be the DP bit, the LBW value and the lowest IP PE in 722 that order. For instance: 724 * If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 725 [Pref=500,DP=1, LBW=2000] in PE2, PE2 would be elected due to the 726 DP bit. 728 * If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 729 [Pref=500,DP=0, LBW=2000] in PE2, PE2 would be elected due to a 730 higher LBW, even if PE1's IP address is lower. 732 * The LBW exchanged value has no impact on the Non-Revertive option 733 described in [EVPN-DF-PREF]. 735 7. Cost-Benefit Tradeoff on Link Failures 737 While incorporating link bandwidth into the DF election process 738 provides optimal BUM traffic distribution across the ES links, it 739 also implies that DF elections are re-adjusted on link failures or 740 bandwidth changes. If the operator does not wish to have this level 741 of churn in their DF election, then they should not advertise the BW 742 capability. Not advertising BW capability may result in less than 743 optimal BUM traffic distribution while still retaining the ability to 744 allow an ingress PE to do weighted ECMP for its unicast traffic to a 745 set of egress PEs. 747 8. Real-time Available Bandwidth 749 PE-CE link bandwidth availability may sometimes vary in real-time 750 disproportionately across PE-CE links within a multi-homed ES due to 751 various factors such as flow based hashing combined with fat flows 752 and unbalanced hashing. Reacting to real-time available bandwidth is 753 at this time outside the scope of this document. 755 9. Weighted Load-balancing to Multi-homed Subnets 757 EVPN Link bandwidth extended community may also be used to achieve 758 unequal load-balancing of prefix routed traffic by including this 759 extended community in EVPN Route Type 5. When included in EVPN RT-5, 760 its value is to be interpreted as egress PE's relative weight for the 761 prefix included in this RT-5. Ingress PE will then compute the 762 forwarding path-list for the prefix route using weighted paths 763 received from each egress PE. 765 10. Weighted Load-balancing without EVPN aliasing 767 [RFC7432] defines per-[ES, EVI] RT-1 based EVPN aliasing procedure as 768 an optional propcedure. In an unlikely scenario where an EVPN 769 implementation does not support EVPN aliasing procedures, MAC 770 forwarding path-list at the ingress PE is computed based on per-ES 771 RT-1 and RT-2 routes received from egress PEs, instead of per-ES RT-1 772 and per-[ES, EVI] RT-1 from egress PEs. In such a case, only the 773 weights received via per-ES RT-1 from the egress PEs included in the 774 MAC path-list are to be considered for weighted path-list 775 computation. 777 11. EVPN-IRB Multi-homing With Non-EVPN routing 779 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 780 together with non-EVPN routing, such as global routing or an L3VPN 781 routing control plane. Key property that differentiates this set of 782 use cases from EVPN IRB use cases discussed earlier is that EVPN 783 control plane is used only to enable LAG interface based multi-homing 784 and NOT as an overlay VPN control plane. Applicability of weighted 785 ECMP procedures proposed in this document to these set of use cases 786 is an area of further consideration beyond the scope of this 787 document. 789 12. Operational Considerations 791 None 793 13. Security Considerations 795 This document raises no new security issues for EVPN. 797 14. IANA Considerations 799 [RFC8584] defines a new extended community for PEs within a 800 redundancy group to signal and agree on uniform DF Election Type and 801 Capabilities for each ES. This document requests IANA to allocate a 802 bit in the "DF Election capabilities" registry setup by [RFC8584]: 804 Bit 4: BW (Bandwidth Weighted DF Election) 806 A new EVPN Link Bandwidth extended community is defined to signal 807 local ES link bandwidth to ingress PEs. This extended community is 808 defined of type 0x06 (EVPN). IANA is requested to assign a sub-type 809 value of 0x10 for the EVPN Link bandwidth extended community, of type 810 0x06 (EVPN). EVPN Link Bandwidth extended community is defined as 811 transitive. 813 IANA is requested to set up a registry called "Value-Units" for the 814 1-octet field in the EVPN Link Bandwidth Extended Community. New 815 registrations will be made through the "RFC Required" procedure 816 defined in [RFC8126]. The following initial values in that registry 817 exist: 819 Value Name Reference 820 ---- ---------------- ------------- 821 0 Weight in units of Mbps This document 822 1 Generalized Weight This document 823 2-255 Unassigned 825 15. Acknowledgements 827 Authors would like to thank Satya Mohanty for valuable review and 828 inputs with respect to HRW and weighted HRW algorithm refinements 829 proposed in this document. Authors would also like to thank Bruno 830 Decraene and Sergey Fomin for valuable review and comments. 832 16. Contributors 834 Satya Ranjan Mohanty 835 Cisco Systems 836 US 837 Email: satyamoh@cisco.com 839 17. References 841 17.1. Normative References 843 [EVPN-DF-PREF] 844 Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., 845 Drake, J., Sajassi, A., Mohanty, S., and , "Preference- 846 based EVPN DF Election", Work in Progress, Internet-Draft, 847 draft-ietf-bess-evpn-pref-df-06, 19 June 2020, 848 . 851 [EVPN-PER-MCAST-FLOW-DF] 852 Sajassi, A., mishra, m., Thoria, S., Rabadan, J., and J. 853 Drake, "Per multicast flow Designated Forwarder Election 854 for EVPN", Work in Progress, Internet-Draft, draft-ietf- 855 bess-evpn-per-mcast-flow-df-election-04, 31 August 2020, 856 . 859 [EVPN-VIRTUAL-ES] 860 Sajassi, A., Brissette, P., Schell, R., Drake, J., 861 Rabadan, J., and , "EVPN Virtual Ethernet Segment", Work 862 in Progress, Internet-Draft, draft-ietf-bess-evpn-virtual- 863 eth-segment-06, 9 March 2020, 864 . 867 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 868 Requirement Levels", BCP 14, RFC 2119, 869 DOI 10.17487/RFC2119, March 1997, 870 . 872 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 873 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 874 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 875 2015, . 877 [RFC7814] Xu, X., Jacquenet, C., Raszuk, R., Boyes, T., and B. Fee, 878 "Virtual Subnet: A BGP/MPLS IP VPN-Based Subnet Extension 879 Solution", RFC 7814, DOI 10.17487/RFC7814, March 2016, 880 . 882 [RFC8584] Rabadan, J., Ed., Mohanty, R., Sajassi, N., Drake, A., 883 Nagaraj, K., and S. Sathappan, "Framework for Ethernet VPN 884 Designated Forwarder Election Extensibility", RFC 8584, 885 DOI 10.17487/RFC8584, April 2019, 886 . 888 17.2. Informative References 890 [BGP-LINK-BW] 891 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 892 Extended Community", Work in Progress, Internet-Draft, 893 draft-ietf-idr-link-bandwidth-07, March 2019, 894 . 897 Authors' Addresses 899 Neeraj Malhotra (editor) 900 Cisco Systems 901 170 W. Tasman Drive 902 San Jose, CA 95134 903 United States of America 904 Email: nmalhotr@cisco.com 906 Ali Sajassi 907 Cisco Systems 908 170 W. Tasman Drive 909 San Jose, CA 95134 910 United States of America 911 Email: sajassi@cisco.com 913 Jorge Rabadan 914 Nokia 915 777 E. Middlefield Road 916 Mountain View, CA 94043 917 United States of America 918 Email: jorge.rabadan@nokia.com 920 John Drake 921 Juniper 922 Email: jdrake@juniper.net 924 Avinash Lingala 925 ATT 926 200 S. Laurel Avenue 927 Middletown, CA 07748 928 United States of America 929 Email: ar977m@att.com 931 Samir Thoria 932 Cisco Systems 933 170 W. Tasman Drive 934 San Jose, CA 95134 935 United States of America 936 Email: sthoria@cisco.com