idnits 2.17.1 draft-ietf-bess-evpn-unequal-lb-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 18, 2021) is 1153 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PE-1' is mentioned on line 462, but not defined == Missing Reference: 'PE-2' is mentioned on line 462, but not defined == Missing Reference: 'PE-3' is mentioned on line 462, but not defined == Unused Reference: 'RFC7814' is defined on line 707, but no explicit reference was found in the text == Outdated reference: A later version (-13) exists of draft-ietf-bess-evpn-pref-df-06 == Outdated reference: A later version (-09) exists of draft-ietf-bess-evpn-per-mcast-flow-df-election-04 ** Downref: Normative reference to an Informational RFC: RFC 7814 Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup N. Malhotra, Ed. 3 Internet-Draft A. Sajassi 4 Intended status: Standards Track Cisco Systems 5 Expires: August 22, 2021 J. Rabadan 6 Nokia 7 J. Drake 8 Juniper 9 A. Lingala 10 ATT 11 S. Thoria 12 Cisco Systems 13 February 18, 2021 15 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing 16 draft-ietf-bess-evpn-unequal-lb-08 18 Abstract 20 In an EVPN-IRB based network overlay, EVPN all-active multi-homing 21 enables multi-homing for a CE device connected to two or more PEs via 22 a LAG, such that bridged and routed traffic from remote PEs can be 23 equally load balanced (ECMPed) across the multi-homing PEs. This 24 document defines extensions to EVPN procedures to optimally handle 25 unequal access bandwidth distribution across a set of multi-homing 26 PEs in order to: 28 o provide greater flexibility, with respect to adding or removing 29 individual PE-CE links within the access LAG. 31 o handle PE-CE LAG member link failures that can result in unequal 32 PE-CE access bandwidth across a set of multi-homing PEs. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on August 22, 2021. 50 Copyright Notice 52 Copyright (c) 2021 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Requirements Language and Terminology . . . . . . . . . . . . 3 68 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 69 2.1. PE-CE Link Provisioning . . . . . . . . . . . . . . . . . 4 70 2.2. PE-CE Link Failures . . . . . . . . . . . . . . . . . . . 5 71 2.3. Design Requirement . . . . . . . . . . . . . . . . . . . 6 72 3. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 6 73 4. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 7 74 4.1. Local PE Behavior . . . . . . . . . . . . . . . . . . . . 7 75 4.2. EVPN Link Bandwidth Extended Community . . . . . . . . . 7 76 4.3. Remote PE Behavior . . . . . . . . . . . . . . . . . . . 8 77 5. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 9 78 5.1. The BW Capability in the DF Election Extended Community . 9 79 5.2. BW Capability and Default DF Election algorithm . . . . . 10 80 5.3. BW Capability and HRW DF Election algorithm (Type 1 and 81 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 82 5.3.1. BW Increment . . . . . . . . . . . . . . . . . . . . 11 83 5.3.2. HRW Hash Computations with BW Increment . . . . . . . 11 84 5.4. BW Capability and Preference DF Election algorithm . . . 13 85 6. Cost-Benefit Tradeoff on Link Failures . . . . . . . . . . . 13 86 7. Real-time Available Bandwidth . . . . . . . . . . . . . . . . 13 87 8. EVPN-IRB Multi-homing With Non-EVPN routing . . . . . . . . . 14 88 9. Operational Considerations . . . . . . . . . . . . . . . . . 14 89 10. Security Considerations . . . . . . . . . . . . . . . . . . . 14 90 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 91 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 92 13. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 15 93 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 94 14.1. Normative References . . . . . . . . . . . . . . . . . . 15 95 14.2. Informative References . . . . . . . . . . . . . . . . . 16 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 16 99 1. Requirements Language and Terminology 101 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 102 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 103 document are to be interpreted as described in [RFC2119]. 105 "Local PE" in the context of an ESI refers to a provider edge switch 106 OR router that physically hosts the ESI. 108 "Remote PE" in the context of an ESI refers to a provider edge switch 109 OR router in an EVPN overlay, whose overlay reachability to the ESI 110 is via the Local PE. 112 o BW: Band-Width 114 o LAG: Link Aggregation Group 116 o ES: Ethernet Segment 118 o vES: Virtual Ethernet Segment 120 o EVI: Ethernet virtual Instance, this is a mac-vrf. 122 o IMET: Inclusive Multicast Route 124 o DF: Designated Forwarder 126 o BDF: Backup Designated Forwarder 128 o DCI: Data Center Interconnect Router 130 2. Introduction 132 In an EVPN-IRB based network overlay, with a CE multi-homed via a 133 EVPN all-active multi-homing, bridged and routed traffic from remote 134 PEs can be equally load balanced (ECMPed) across the multi-homing 135 PEs: 137 o ECMP Load-balancing for bridged unicast traffic is enabled via 139 o aliasing and mass-withdraw procedures detailed in RFC 7432. 141 o ECMP Load-balancing for routed unicast traffic is enabled via 142 existing L3 ECMP mechanisms. 144 o Load-sharing of bridged BUM traffic on local ports is enabled via 145 EVPN DF election procedure detailed in RFC 7432 147 All of the above load balancing and DF election procedures implicitly 148 assume equal bandwidth distribution between the CE and the set of 149 multi-homing PEs. Essentially, with this assumption of equal 150 "access" bandwidth distribution across all PEs, ALL remote traffic is 151 equally load balanced across the multi-homing PEs. This assumption 152 of equal access bandwidth distribution can be restrictive with 153 respect to adding / removing links in a multi-homed LAG interface and 154 may also be easily broken on individual link failures. A solution to 155 handle unequal access bandwidth distribution across a set of multi- 156 homing EVPN PEs is proposed in this document. Primary motivation 157 behind this proposal is to enable greater flexibility with respect to 158 adding / removing member PE-CE links, as needed and to optimally 159 handle PE-CE link failures. 161 2.1. PE-CE Link Provisioning 163 +------------------------+ 164 | Underlay Network Fabric| 165 +------------------------+ 167 +-----+ +-----+ 168 | PE1 | | PE2 | 169 +-----+ +-----+ 170 \ / 171 \ ESI-1 / 172 \ / 173 +\---/+ 174 | \ / | 175 +--+--+ 176 | 177 CE1 179 Figure 1 181 Consider CE1 that is dual-homed to PE1 and PE2 via EVPN all-active 182 multi-homing with single member links of equal bandwidth to each PE 183 (aka, equal access bandwidth distribution across PE1 and PE2). If 184 the provider wants to increase link bandwidth to CE1, it must add a 185 link to both PE1 and PE2 in order to maintain equal access bandwidth 186 distribution and inter-work with EVPN ECMP load balancing. In other 187 words, for a dual-homed CE, total number of CE links must be 188 provisioned in multiples of 2 (2, 4, 6, and so on). For a triple- 189 homed CE, number of CE links must be provisioned in multiples of 190 three (3, 6, 9, and so on). To generalize, for a CE that is multi- 191 homed to "n" PEs, number of PE-CE physical links provisioned must be 192 an integral multiple of "n". This is restrictive in case of dual- 193 homing and very quickly becomes prohibitive in case of multi-homing. 195 Instead, a provider may wish to increase PE-CE bandwidth OR number of 196 links in any link increments. As an example, for CE1 dual-homed to 197 PE1 and PE2 in all-active mode, provider may wish to add a third link 198 to only PE1 to increase total bandwidth for this CE by 50%, rather 199 than being required to increase access bandwidth by 100% by adding a 200 link to each of the two PEs. While existing EVPN based all-active 201 load balancing procedures do not necessarily preclude such asymmetric 202 access bandwidth distribution among the PEs providing redundancy, it 203 may result in unexpected traffic loss due to congestion in the access 204 interface towards CE. This traffic loss is due to the fact that PE1 205 and PE2 will continue to be treated as equal cost paths at remote 206 PEs, and as a result may attract approximately equal amount of CE1 207 destined traffic, even when PE2 only has half the bandwidth to CE1 as 208 PE1. This may lead to congestion and traffic loss on the PE2-CE1 209 link. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 210 traffic from remote hosts must also be load balanced across PE1 and 211 PE2 in 2:1 manner. 213 2.2. PE-CE Link Failures 215 More importantly, unequal PE-CE bandwidth distribution described 216 above may occur during regular operation following a link failure, 217 even when PE-CE links were provisioned to provide equal bandwidth 218 distribution across multi-homing PEs. 220 +------------------------+ 221 | Underlay Network Fabric| 222 +------------------------+ 224 +-----+ +-----+ 225 | PE1 | | PE2 | 226 +-----+ +-----+ 227 \\ // 228 \\ ESI-1 // 229 \\ /X 230 +\\---//+ 231 | \\ // | 232 +---+---+ 233 | 234 CE1 236 Figure 2 238 Consider a CE1 that is multi-homed to PE1 and PE2 via a LAG with two 239 member links to each PE. On a PE2-CE1 physical link failure, LAG 240 represented by an Ethernet Segment ESI-1 on PE2 stays up, however, 241 its bandwidth is cut in half. With existing ECMP procedures, both 242 PE1 and PE2 may continue to attract equal amount of traffic from 243 remote PEs, even when PE1 has double the bandwidth to CE1. If 244 bandwidth distribution to CE1 across PE1 and PE2 is 2:1, traffic from 245 remote hosts must also be load balanced across PE1 and PE2 in 2:1 246 manner to avoid unexpected congestion and traffic loss on PE2-CE1 247 links within the LAG. As an alternative, min-link on LAGs is 248 sometimes used to bring down the LAG interface on member link 249 failures. This however results in loss of available bandwidth in the 250 network, and is not ideal. 252 2.3. Design Requirement 254 +-----------------------+ 255 |Underlay Network Fabric| 256 +-----------------------+ 258 +-----+ +-----+ +-----+ +-----+ 259 | PE1 | | PE2 | ..... | PEx | | PEn | 260 +-----+ +-----+ +-----+ +-----+ 261 \ \ // // 262 \ L1 \ L2 // Lx // Ln 263 \ \ // // 264 +-\-------\-----------//--------//-+ 265 | \ \ ESI-1 // // | 266 +----------------------------------+ 267 | 268 CE 270 Figure 3 272 To generalize, if total link bandwidth to a CE is distributed across 273 "n" multi-homing PEs, with Lx being the total bandwidth to PEx across 274 all links, traffic from remote PEs to this CE must be load balanced 275 unequally across [PE1, PE2, ....., PEn] such that, fraction of total 276 unicast and BUM flows destined for CE that are serviced by PEx is: 278 Lx / [L1+L2+.....+Ln] 280 The solution proposed below includes extensions to EVPN procedures to 281 achieve the above. 283 3. Solution Overview 285 In order to achieve weighted load balancing for overlay unicast 286 traffic, Ethernet A-D per-ES route (EVPN Route Type 1) is leveraged 287 to signal the Ethernet Segment bandwidth to remote PEs. Using 288 Ethernet A-D per-ES route to signal the Ethernet Segment bandwidth 289 provides a mechanism to be able to react to changes in access 290 bandwidth in a service and host independent manner. Remote PEs 291 computing the MAC path-lists based on global and aliasing Ethernet 292 A-D routes now have the ability to setup weighted load balancing 293 path-lists based on the ESI access bandwidth received from each PE 294 that the ESI is multi-homed to. 296 In order to achieve weighted load balancing of overlay BUM traffic, 297 EVPN ES route (Route Type 4) is leveraged to signal the ESI bandwidth 298 to PEs within an ESI's redundancy group to influence per-service DF 299 election. PEs in an ESI redundancy group now have the ability to do 300 service carving in proportion to each PE's relative ESI bandwidth. 302 Procedures to accomplish this are described in greater detail next. 304 4. Weighted Unicast Traffic Load-balancing 306 4.1. Local PE Behavior 308 A PE that is part of an Ethernet Segment's redundancy group would 309 advertise an additional "link bandwidth" extended community attribute 310 with Ethernet A-D per-ES route (EVPN Route Type 1), that represents 311 total bandwidth of PE's physical links in an Ethernet Segment. BGP 312 link bandwidth extended community defined in [BGP-LINK-BW] is re-used 313 for this purpose. 315 4.2. EVPN Link Bandwidth Extended Community 317 A new EVPN Link Bandwidth extended community is defined to signal 318 local ES link bandwidth to remote PEs. This extended-community is 319 defined of type 0x06 (EVPN). IANA is requested to assign a sub-type 320 value of 0x10 for the EVPN Link bandwidth extended community, of type 321 0x06 (EVPN). EVPN Link Bandwidth extended community is defined as 322 transitive. 324 Link bandwidth extended community described in [BGP-LINK-BW] for 325 layer 3 VPNs was considered for re-use here. This Link bandwidth 326 extended community is however defined in [BGP-LINK-BW] as optional 327 non-transitive. Since it is not possible to change deployed behavior 328 of extended-community defined in [BGP-LINK-BW], it was decided to 329 define a new one. In inter-AS scenarios, link-bandwidth needs to be 330 signaled to eBGP neighbors. When signaled across AS boundary, this 331 attribute can be used to achieve optimal load-balancing towards 332 source PEs from a different AS. This is applicable both when next- 333 hop is changed or unchanged across AS boundaries. 335 4.3. Remote PE Behavior 337 A receiving PE SHOULD use per-ES link bandwidth attribute received 338 from each PE to compute a relative weight for each remote PE, per-ES, 339 and then use this relative weight to compute a weighted path-list to 340 be used for load balancing, as opposed to using an ECMP path-list for 341 load balancing across the PE paths. PE Weight and resulting weighted 342 path-list computation at remote PEs is a local matter. An example 343 computation algorithm is shown below to illustrate the idea: 345 if, 347 L(x,y) : link bandwidth advertised by PE-x for ESI-y 349 W(x,y) : normalized weight assigned to PE-x for ESI-y 351 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., L(n,y)] 353 then, the normalized weight assigned to PE-x for ESI-y may be 354 computed as follows: 356 W(x,y) = L(x,y) / H(y) 358 For a MAC+IP route (EVPN Route Type 2) received with ESI-y, receiving 359 PE may compute MAC and IP forwarding path-list weighted by the above 360 normalized weights. 362 As an example, for a CE dual-homed to PE-1, PE-2, PE-3 via 2, 1, and 363 1 GE physical links respectively, as part of a LAG represented by 364 ESI-10: 366 L(1, 10) = 2000 Mbps 368 L(2, 10) = 1000 Mbps 370 L(3, 10) = 1000 Mbps 372 H(10) = 1000 374 Normalized weights assigned to each PE for ESI-10 are as follows: 376 W(1, 10) = 2000 / 1000 = 2. 378 W(2, 10) = 1000 / 1000 = 1. 380 W(3, 10) = 1000 / 1000 = 1. 382 For a remote MAC+IP host route received with ESI-10, forwarding load 383 balancing path-list may now be computed as: [PE-1, PE-1, PE-2, PE-3] 384 instead of [PE-1, PE-2, PE-3]. This now results in load balancing of 385 all traffic destined for ESI-10 across the three multi-homing PEs in 386 proportion to ESI-10 bandwidth at each PE. 388 Weighted path-list computation must only be done for an ESI if a link 389 bandwidth attribute is received from all of the PE's advertising 390 reachability to that ESI via Ethernet A-D per-ES Route Type 1. In an 391 unlikely event that link bandwidth attribute is not received from one 392 or more subset of PEs, forwarding path-list should be computed using 393 regular ECMP semantics. Note that a default weight cannot be assumed 394 for a PE that does not advertise its link bandwidth as the weight 395 attribute t be used in path-list computation is relative. 397 5. Weighted BUM Traffic Load-Sharing 399 Optionally, load sharing of per-service DF role, weighted by 400 individual PE's link-bandwidth share within a multi-homed ES may also 401 be achieved. 403 In order to do that, a new DF Election Capability [RFC8584] called 404 "BW" (Bandwidth Weighted DF Election) is defined. BW MAY be used 405 along with some DF Election Types, as described in the following 406 sections. 408 5.1. The BW Capability in the DF Election Extended Community 410 [RFC8584] defines a new extended community for PEs within a 411 redundancy group to signal and agree on uniform DF Election Type and 412 Capabilities for each ES. This document requests IANA for a bit in 413 the DF Election extended community Bitmap: 415 Bit 28: BW (Bandwidth Weighted DF Election) 417 ES routes advertised with the BW bit set will indicate the desire of 418 the advertising PE to consider the link-bandwidth in the DF Election 419 algorithm defined by the value in the "DF Type". 421 As per [RFC8584], all the PEs in the ES MUST advertise the same 422 Capabilities and DF Type, otherwise the PEs will fall back to Default 423 [RFC7432] DF Election procedure. 425 The BW Capability MAY be advertised with the following DF Types: 427 o Type 0: Default DF Election algorithm, as in [RFC7432] 429 o Type 1: HRW algorithm, as in [RFC8584] 430 o Type 2: Preference algorithm, as in [EVPN-DF-PREF] 432 o Type 4: HRW per-multicast flow DF Election, as in [EVPN-PER-MCAST- 433 FLOW-DF] 435 The following sections describe how the DF Election procedures are 436 modified for the above DF Types when the BW Capability is used. 438 5.2. BW Capability and Default DF Election algorithm 440 When all the PEs in the Ethernet Segment (ES) agree to use the BW 441 Capability with DF Type 0, the Default DF Election procedure as 442 defined in [RFC7432] is modified as follows: 444 o Each PE advertises a "Link Bandwidth" extended community attribute 445 along with the ES route to signal the PE-CE link bandwidth (LBW) 446 for the ES. 448 o A receiving PE MUST use the ES link bandwidth attribute received 449 from each PE to compute a relative weight for each remote PE. 451 o The DF Election procedure MUST now use this weighted list of PEs 452 to compute the per-VLAN Designated Forwarder, such that the DF 453 role is distributed in proportion to this normalized weight. As a 454 result, a single PE may have multiple ordinals in the DF candidate 455 PE list and 'N' used in (V mode N) operation as defined in 456 [RFC7432] is modified to be total number of ordinals instead of 457 being total number of PEs. 459 Considering the same example as in Section 3, the candidate PE list 460 for DF election is: 462 [PE-1, PE-1, PE-2, PE-3]. 464 The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). 465 This would result in the DF role being distributed across PE1, PE2, 466 and PE3 in portion to each PE's normalized weight for ES-10. 468 5.3. BW Capability and HRW DF Election algorithm (Type 1 and 4) 470 [RFC8584] introduces Highest Random Weight (HRW) algorithm (DF Type 471 1) for DF election in order to solve potential DF election skew 472 depending on Ethernet tag space distribution. [EVPN-PER-MCAST-FLOW- 473 DF] further extends HRW algorithm for per-multicast flow based hash 474 computations (DF Type 4). This section describes extensions to HRW 475 Algorithm for EVPN DF Election specified in [RFC8584] and in [EVPN- 476 PER-MCAST-FLOW-DF] in order to achieve DF election distribution that 477 is weighted by link bandwidth. 479 5.3.1. BW Increment 481 A new variable called "bandwidth increment" is computed for each [PE, 482 ES] advertising the ES link bandwidth attribute as follows: 484 In the context of an ES, 486 L(i) = Link bandwidth advertised by PE(i) for this ES 488 L(min) = lowest link bandwidth advertised across all PEs for this ES 490 Bandwidth increment, "b(i)" for a given PE(i) advertising a link 491 bandwidth of L(i) is defined as an integer value computed as: 493 b(i) = L(i) / L(min) 495 As an example, 497 with PE(1) = 10, PE(2) = 10, PE(3) = 20 499 bandwidth increment for each PE would be computed as: 501 b(1) = 1, b(2) = 1, b(3) = 2 503 with PE(1) = 10, PE(2) = 10, PE(3) = 10 505 bandwidth increment for each PE would be computed as: 507 b(1) = 1, b(2) = 1, b(3) = 1 509 Note that the bandwidth increment must always be an integer, 510 including, in an unlikely scenario of a PE's link bandwidth not being 511 an exact multiple of L(min). If it computes to a non-integer value 512 (including as a result of link failure), it MUST be rounded down to 513 an integer. 515 5.3.2. HRW Hash Computations with BW Increment 517 HRW algorithm as described in [RFC8584] and in [EVPN-PER-MCAST-FLOW- 518 DF] computes a random hash value for each PE(i), where, (0 < i <= N), 519 PE(i) is the PE at ordinal i, and Address(i) is the IP address of 520 PE(i). 522 For 'N' PEs sharing an Ethernet segment, this results in 'N' 523 candidate hash computations. The PE that has the highest hash value 524 is selected as the DF. 526 We refer to this hash value as "affinity" in this document. Hash or 527 affinity computation for each PE(i) is extended to be computed one 528 per bandwidth increment associated with PE(i) instead of a single 529 affinity computation per PE(i). 531 PE(i) with b(i) = j, results in j affinity computations: 533 affinity(i, x), where 1 < x <= j 535 This essentially results in number of candidate HRW hash computations 536 for each PE that is directly proportional to that PE's relative 537 bandwidth within an ES and hence gives PE(i) a probability of being 538 DF in proportion to it's relative bandwidth within an ES. 540 As an example, consider an ES that is multi-homed to two PEs, PE1 and 541 PE2, with equal bandwidth distribution across PE1 and PE2. This 542 would result in a total of two candidate hash computations: 544 affinity(PE1, 1) 546 affinity(PE2, 1) 548 Now, consider a scenario with PE1's link bandwidth as 2x that of PE2. 549 This would result in a total of three candidate hash computations to 550 be used for DF election: 552 affinity(PE1, 1) 554 affinity(PE1, 2) 556 affinity(PE2, 1) 558 which would give PE1 2/3 probability of getting elected as a DF, in 559 proportion to its relative bandwidth in the ES. 561 Depending on the chosen HRW hash function, affinity function MUST be 562 extended to include bandwidth increment in the computation. 564 For e.g., 566 affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be 567 extended as follows to incorporate bandwidth increment j: 569 affinity(S,G,V, ESI, Address(i,j)) = 570 (1103515245.((1103515245.Address(i).j + 12345) XOR 571 D(S,G,V,ESI))+12345) (mod 2^31) 572 affinity or random function specified in [RFC8584] MAY be extended as 573 follows to incorporate bandwidth increment j: 575 affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j 576 + 12345) XOR D(v,Es))+12345)(mod 2^31) 578 5.4. BW Capability and Preference DF Election algorithm 580 This section applies to ES'es where all the PEs in the ES agree use 581 the BW Capability with DF Type 2. The BW Capability modifies the 582 Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW 583 value as a tie-breaker as follows: 585 Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW 586 value: 588 f) In case of equal Preference in two or more PEs in the ES, the tie- 589 breakers will be the DP bit, the LBW value and the lowest IP PE in 590 that order. For instance: 592 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 593 [Pref=500,DP=1, LBW=2000] in PE2, PE2 would be elected due to the 594 DP bit. 596 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 597 [Pref=500,DP=0, LBW=2000] in PE2, PE2 would be elected due to a 598 higher LBW, even if PE1's IP address is lower. 600 o The LBW exchanged value has no impact on the Non-Revertive option 601 described in [EVPN-DF-PREF]. 603 6. Cost-Benefit Tradeoff on Link Failures 605 While incorporating link bandwidth into the DF election process 606 provides optimal BUM traffic distribution across the ES links, it 607 also implies that DF elections are re-adjusted on link failures or 608 bandwidth changes. If the operator does not wish to have this level 609 of churn in their DF election, then they should not advertise the BW 610 capability. Not advertising BW capability may result in less than 611 optimal BUM traffic distribution while still retaining the ability to 612 allow a remote ingress PE to do weighted ECMP for its unicast traffic 613 to a set of multi-homed PEs. 615 7. Real-time Available Bandwidth 617 PE-CE link bandwidth availability may sometimes vary in real-time 618 disproportionately across PE-CE links within a multi-homed ESI due to 619 various factors such as flow based hashing combined with fat flows 620 and unbalanced hashing. Reacting to real-time available bandwidth is 621 at this time outside the scope of this document. Procedures 622 described in this document are strictly based on static link 623 bandwidth parameter. 625 8. EVPN-IRB Multi-homing With Non-EVPN routing 627 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 628 together with non-EVPN routing, such as global routing or an L3VPN 629 routing control plane. Key property that differentiates this set of 630 use cases from EVPN IRB use cases discussed earlier is that EVPN 631 control plane is used only to enable LAG interface based multi-homing 632 and NOT as an overlay VPN control plane. EVPN control plane in this 633 case enables: 635 o DF election via EVPN RT-4 based procedures described in [RFC7432] 637 o Local MAC sync across multi-homing PEs via EVPN RT-2 639 o Local ARP and ND sync across multi-homing PEs via EVPN RT-2 641 Applicability of weighted ECMP procedures proposed in this document 642 to these set of use cases is an area of further consideration. 644 9. Operational Considerations 646 None 648 10. Security Considerations 650 This document raises no new security issues for EVPN. 652 11. IANA Considerations 654 [RFC8584] defines a new extended community for PEs within a 655 redundancy group to signal and agree on uniform DF Election Type and 656 Capabilities for each ES. This document requests IANA for a bit in 657 the DF Election extended community Bitmap: 659 Bit 28: BW (Bandwidth Weighted DF Election) 661 A new EVPN Link Bandwidth extended community is defined to signal 662 local ES link bandwidth to remote PEs. This extended-community is 663 defined of type 0x06 (EVPN). IANA is requested to assign a sub-type 664 value of 0x10 for the EVPN Link bandwidth extended community, of type 665 0x06 (EVPN). EVPN Link Bandwidth extended community is defined as 666 transitive. 668 12. Acknowledgements 670 Authors would like to thank Satya Mohanty for valuable review and 671 inputs with respect to HRW and weighted HRW algorithm refinements 672 proposed in this document. 674 13. Contributors 676 Satya Ranjan Mohanty 677 Cisco Systems 678 US 679 Email: satyamoh@cisco.com 681 14. References 683 14.1. Normative References 685 [EVPN-DF-PREF] 686 Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., 687 Drake, J., Sajassi, A., Mohanty, S., and , "Preference- 688 based EVPN DF Election", draft-ietf-bess-evpn-pref-df-06 689 (work in progress), June 2020. 691 [EVPN-PER-MCAST-FLOW-DF] 692 Sajassi, A., mishra, m., Thoria, S., Rabadan, J., and J. 693 Drake, "Per multicast flow Designated Forwarder Election 694 for EVPN", draft-ietf-bess-evpn-per-mcast-flow-df- 695 election-04 (work in progress), August 2020. 697 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 698 Requirement Levels", BCP 14, RFC 2119, 699 DOI 10.17487/RFC2119, March 1997, 700 . 702 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 703 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 704 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 705 2015, . 707 [RFC7814] Xu, X., Jacquenet, C., Raszuk, R., Boyes, T., and B. Fee, 708 "Virtual Subnet: A BGP/MPLS IP VPN-Based Subnet Extension 709 Solution", RFC 7814, DOI 10.17487/RFC7814, March 2016, 710 . 712 [RFC8584] Rabadan, J., Ed., Mohanty, R., Sajassi, N., Drake, A., 713 Nagaraj, K., and S. Sathappan, "Framework for Ethernet VPN 714 Designated Forwarder Election Extensibility", RFC 8584, 715 DOI 10.17487/RFC8584, April 2019, 716 . 718 14.2. Informative References 720 [BGP-LINK-BW] 721 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 722 Extended Community", draft-ietf-idr-link-bandwidth-07 723 (work in progress), March 2019. 725 Authors' Addresses 727 Neeraj Malhotra (editor) 728 Cisco Systems 729 170 W. Tasman Drive 730 San Jose, CA 95134 731 USA 733 Email: nmalhotr@cisco.com 735 Ali Sajassi 736 Cisco Systems 737 170 W. Tasman Drive 738 San Jose, CA 95134 739 USA 741 Email: sajassi@cisco.com 743 Jorge Rabadan 744 Nokia 745 777 E. Middlefield Road 746 Mountain View, CA 94043 747 USA 749 Email: jorge.rabadan@nokia.com 751 John Drake 752 Juniper 754 Email: jdrake@juniper.net 755 Avinash Lingala 756 ATT 757 200 S. Laurel Avenue 758 Middletown, CA 07748 759 USA 761 Email: ar977m@att.com 763 Samir Thoria 764 Cisco Systems 765 170 W. Tasman Drive 766 San Jose, CA 95134 767 USA 769 Email: sthoria@cisco.com