idnits 2.17.1 draft-ietf-bess-evpn-unequal-lb-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (Nov 2, 2019) is 1634 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PE-1' is mentioned on line 427, but not defined == Missing Reference: 'PE-2' is mentioned on line 427, but not defined == Missing Reference: 'PE-3' is mentioned on line 427, but not defined == Missing Reference: 'RFC 7814' is mentioned on line 647, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'BGP-LINK-BW' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IP-ALIASING' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-DF-PREF' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-PER-MCAST-FLOW-DF' -- Possible downref: Non-RFC (?) normative reference: ref. 'WEIGHTED-HRW' Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group N. Malhotra, Ed. 3 Internet-Draft A. Sajassi 4 Intended Status: Proposed Standard S. Thoria 5 Cisco 7 J. Rabadan 8 Nokia 10 J. Drake 11 Juniper 13 A. Lingala 14 AT&T 16 Expires: May 5, 2020 Nov 2, 2019 18 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing 19 draft-ietf-bess-evpn-unequal-lb-03 21 Abstract 23 In an EVPN-IRB based network overlay, EVPN all-active multi-homing 24 enables multi-homing for a CE device connected to two or more PEs via 25 a LAG bundle, such that bridged and routed traffic from remote PEs 26 can be equally load balanced (ECMPed) across the multi-homing PEs. 27 This document defines extensions to EVPN procedures to optimally 28 handle unequal access bandwidth distribution across a set of multi- 29 homing PEs in order to: 31 o provide greater flexibility, with respect to adding or 32 removing individual PE-CE links within the access LAG 34 o handle PE-CE LAG member link failures that can result in unequal 35 PE-CE access bandwidth across a set of multi-homing PEs 37 Status of this Memo 39 This Internet-Draft is submitted to IETF in full conformance with 40 the provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF), its areas, and its working groups. Note that 44 other groups may also distribute working documents as 45 Internet-Drafts. 47 Internet-Drafts are draft documents valid for a maximum of six 48 months and may be updated, replaced, or obsoleted by other 49 documents at any time. It is inappropriate to use Internet- 50 Drafts as reference material or to cite them other than as "work 51 in progress." 53 The list of current Internet-Drafts can be accessed at 54 http://www.ietf.org/1id-abstracts.html 56 The list of Internet-Draft Shadow Directories can be accessed at 57 http://www.ietf.org/shadow.html 59 Copyright and License Notice 61 Copyright (c) 2017 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with 69 respect to this document. Code Components extracted from this 70 document must include Simplified BSD License text as described in 71 Section 4.e of the Trust Legal Provisions and are provided 72 without warranty as described in the Simplified BSD License. 74 Table of Contents 76 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 77 1.1 PE CE Link Provisioning . . . . . . . . . . . . . . . . . . 5 78 1.2 PE CE Link Failures . . . . . . . . . . . . . . . . . . . . 6 79 1.3 Design Requirement . . . . . . . . . . . . . . . . . . . . . 7 80 1.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 7 81 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 8 82 3. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 8 83 3.1 LOCAL PE Behavior . . . . . . . . . . . . . . . . . . . . . 8 84 3.1 Link Bandwidth Extended Community . . . . . . . . . . . . . 8 85 3.2 REMOTE PE Behavior . . . . . . . . . . . . . . . . . . . . . 9 86 4. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 10 87 4.1 The BW Capability in the DF Election Extended Community . . 10 88 4.2 BW Capability and Default DF Election algorithm . . . . . . 11 89 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 90 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 91 4.3.1 BW Increment . . . . . . . . . . . . . . . . . . . . . . 11 92 4.3.2 HRW Hash Computations with BW Increment . . . . . . . . 12 93 4.3.3 Cost-Benefit Tradeoff on Link Failures . . . . . . . . . 13 94 4.4 BW Capability and Weighted HRW DF Election algorithm 95 (Type TBD) . . . . . . . . . . . . . . . . . . . . . . . . 14 96 4.5 BW Capability and Preference DF Election algorithm . . . . 15 97 5. Real-time Available Bandwidth . . . . . . . . . . . . . . . . . 16 98 6. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 16 99 7. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 17 100 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 101 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 102 7.2 Informative References . . . . . . . . . . . . . . . . . . 18 103 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 104 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 19 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 107 1 Introduction 109 In an EVPN-IRB based network overlay, with a CE multi-homed via a 110 EVPN all-active multi-homing, bridged and routed traffic from remote 111 PEs can be equally load balanced (ECMPed) across the multi-homing 112 PEs: 114 o ECMP Load-balancing for bridged unicast traffic is enabled via 115 aliasing and mass-withdraw procedures detailed in RFC 7432. 117 o ECMP Load-balancing for routed unicast traffic is enabled via 118 existing L3 ECMP mechanisms. 120 o Load-sharing of bridged BUM traffic on local ports is enabled 121 via EVPN DF election procedure detailed in RFC 7432 123 All of the above load-balancing and DF election procedures implicitly 124 assume equal bandwidth distribution between the CE and the set of 125 multi-homing PEs. Essentially, with this assumption of equal "access" 126 bandwidth distribution across all PEs, ALL remote traffic is equally 127 load balanced across the multi-homing PEs. This assumption of equal 128 access bandwidth distribution can be restrictive with respect to 129 adding / removing links in a multi-homed LAG interface and may also 130 be easily broken on individual link failures. A solution to handle 131 unequal access bandwidth distribution across a set of multi-homing 132 EVPN PEs is proposed in this document. Primary motivation behind this 133 proposal is to enable greater flexibility with respect to adding / 134 removing member PE-CE links, as needed and to optimally handle PE-CE 135 link failures. 137 1.1 PE CE Link Provisioning 139 +------------------------+ 140 | Underlay Network Fabric| 141 +------------------------+ 143 +-----+ +-----+ 144 | PE1 | | PE2 | 145 +-----+ +-----+ 146 \ / 147 \ ESI-1 / 148 \ / 149 +\---/+ 150 | \ / | 151 +--+--+ 152 | 153 CE1 155 Figure 1 157 Consider a CE1 that is dual-homed to PE1 and PE2 via EVPN all-active 158 multi-homing with single member links of equal bandwidth to each PE 159 (aka, equal access bandwidth distribution across PE1 and PE2). If the 160 provider wants to increase link bandwidth to CE1, it MUST add a link 161 to both PE1 and PE2 in order to maintain equal access bandwidth 162 distribution and inter-work with EVPN ECMP load-balancing. In other 163 words, for a dual-homed CE, total number of CE links must be 164 provisioned in multiples of 2 (2, 4, 6, and so on). For a triple- 165 homed CE, number of CE links must be provisioned in multiples of 166 three (3, 6, 9, and so on). To generalize, for a CE that is multi- 167 homed to "n" PEs, number of PE-CE physical links provisioned must be 168 an integral multiple of "n". This is restrictive in case of dual- 169 homing and very quickly becomes prohibitive in case of multi-homing. 171 Instead, a provider may wish to increase PE-CE bandwidth OR number of 172 links in ANY link increments. As an example, for CE1 dual-homed to 173 PE1 and PE2 in all-active mode, provider may wish to add a third link 174 to ONLY PE1 to increase total bandwidth for this CE by 50%, rather 175 than being required to increase access bandwidth by 100% by adding a 176 link to each of the two PEs. While existing EVPN based all-active 177 load-balancing procedures do not necessarily preclude such asymmetric 178 access bandwidth distribution among the PEs providing redundancy, it 179 may result in unexpected traffic loss due to congestion in the access 180 interface towards CE. This traffic loss is due to the fact that PE1 181 and PE2 will continue to attract equal amount of CE1 destined traffic 182 from remote PEs, even when PE2 only has half the bandwidth to CE1 as 183 PE1. This may lead to congestion and traffic loss on the PE2-CE1 184 link. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 185 traffic from remote hosts MUST also be load-balanced across PE1 and 186 PE2 in 2:1 manner. 188 1.2 PE CE Link Failures 190 More importantly, unequal PE-CE bandwidth distribution described 191 above may occur during regular operation following a link failure, 192 even when PE-CE links were provisioned to provide equal bandwidth 193 distribution across multi-homing PEs. 195 +------------------------+ 196 | Underlay Network Fabric| 197 +------------------------+ 199 +-----+ +-----+ 200 | PE1 | | PE2 | 201 +-----+ +-----+ 202 \\ // 203 \\ ESI-1 // 204 \\ /X 205 +\\---//+ 206 | \\ // | 207 +---+---+ 208 | 209 CE1 211 Consider a CE1 that is multi-homed to PE1 and PE2 via a link bundle 212 with two member links to each PE. On a PE2-CE1 physical link failure, 213 link bundle represented by an Ethernet Segment ESI-1 on PE2 stays up, 214 however, it's bandwidth is cut in half. With existing ECMP 215 procedures, both PE1 and PE2 will continue to attract equal amount of 216 traffic from remote PEs, even when PE1 has double the bandwidth to 217 CE1. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 218 traffic from remote hosts MUST also be load-balanced across PE1 and 219 PE2 in 2:1 manner to avoid unexpected congestion and traffic loss on 220 PE2-CE1 links within the LAG. 222 1.3 Design Requirement 224 +-----------------------+ 225 |Underlay Network Fabric| 226 +-----------------------+ 228 +-----+ +-----+ +-----+ +-----+ 229 | PE1 | | PE2 | ..... | PEx | | PEn | 230 +-----+ +-----+ +-----+ +-----+ 231 \ \ // // 232 \ L1 \ L2 // Lx // Ln 233 \ \ // // 234 +-\-------\-----------//--------//-+ 235 | \ \ ESI-1 // // | 236 +----------------------------------+ 237 | 238 CE 240 To generalize, if total link bandwidth to a CE is distributed across 241 "n" multi-homing PEs, with Lx being the number of links / bandwidth 242 to PEx, traffic from remote PEs to this CE MUST be load-balanced 243 unequally across [PE1, PE2, ....., PEn] such that, fraction of total 244 unicast and BUM flows destined for CE that are serviced by PEx is: 246 Lx / [L1+L2+.....+Ln] 248 Solution proposed below includes extensions to EVPN procedures to 249 achieve the above. 251 1.4 Terminology 253 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 254 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 255 "OPTIONAL" in this document are to be interpreted as described in 256 BCP14 [RFC2119] [RFC8174] when, and only when, they appear in all 257 capitals, as shown here. 259 "LOCAL PE" in the context of an ESI refers to a provider edge switch 260 OR router that physically hosts the ESI. 262 "REMOTE PE" in the context of an ESI refers to a provider edge switch 263 OR router in an EVPN overlay, who's overlay reachability to the ESI 264 is via the LOCAL PE. 266 2. Solution Overview 268 In order to achieve weighted load balancing for overlay unicast 269 traffic, Ethernet A-D per-ES route (EVPN Route Type 1) is leveraged 270 to signal the Ethernet Segment bandwidth to remote PEs. Using 271 Ethernet A-D per-ES route to signal the Ethernet Segment bandwidth 272 provides a mechanism to be able to react to changes in access 273 bandwidth in a service and host independent manner. Remote PEs 274 computing the MAC path-lists based on global and aliasing Ethernet A- 275 D routes now have the ability to setup weighted load-balancing path- 276 lists based on the ESI access bandwidth received from each PE that 277 the ESI is multi-homed to. If Ethernet A-D per-ES route is also 278 leveraged for IP path-list computation, as per [EVPN-IP-ALIASING], it 279 also provides a method to do weighted load-balancing for IP routed 280 traffic. 282 In order to achieve weighted load-balancing of overlay BUM traffic, 283 EVPN ES route (Route Type 4) is leveraged to signal the ESI bandwidth 284 to PEs within an ESI's redundancy group to influence per-service DF 285 election. PEs in an ESI redundancy group now have the ability to do 286 service carving in proportion to each PE's relative ESI bandwidth. 288 Procedures to accomplish this are described in greater detail next. 290 3. Weighted Unicast Traffic Load-balancing 292 3.1 LOCAL PE Behavior 294 A PE that is part of an Ethernet Segment's redundancy group would 295 advertise a additional "link bandwidth" EXT-COMM attribute with 296 Ethernet A-D per-ES route (EVPN Route Type 1), that represents total 297 bandwidth of PE's physical links in an Ethernet Segment. BGP link 298 bandwidth EXT-COMM defined in [BGP-LINK-BW] is re-used for this 299 purpose. 301 3.1 Link Bandwidth Extended Community 303 Link bandwidth extended community described in [BGP-LINK-BW] for 304 layer 3 VPNs is re-used here to signal local ES link bandwidth to 305 remote PEs. link-bandwidth extended community is however defined in 306 [BGP-LINK-BW] as optional non-transitive. In inter-AS scenarios, 307 link-bandwidth may need to be signaled to an eBGP neighbor along with 308 next-hop unchanged. It is work in progress with authors of [BGP-LINK- 309 BW] to allow for this attribute to be used as transitive in inter-AS 310 scenarios. 312 3.2 REMOTE PE Behavior 314 A receiving PE should use per-ES link bandwidth attribute received 315 from each PE to compute a relative weight for each remote PE, per-ES, 316 as shown below. 318 if, 320 L(x,y) : link bandwidth advertised by PE-x for ESI-y 322 W(x,y) : normalized weight assigned to PE-x for ESI-y 324 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., 325 L(n,y)] 327 then, the normalized weight assigned to PE-x for ESI-y may be 328 computed as follows: 330 W(x,y) = L(x,y) / H(y) 332 For a MAC+IP route (EVPN Route Type 2) received with ESI-y, receiving 333 PE MUST compute MAC and IP forwarding path-list weighted by the above 334 normalized weights. 336 As an example, for a CE dual-homed to PE-1, PE-2, PE-3 via 2, 1, and 337 1 GE physical links respectively, as part of a link bundle 338 represented by ESI-10: 340 L(1, 10) = 2000 Mbps 342 L(2, 10) = 1000 Mbps 344 L(3, 10) = 1000 Mbps 346 H(10) = 1000 348 Normalized weights assigned to each PE for ESI-10 are as follows: 350 W(1, 10) = 2000 / 1000 = 2. 352 W(2, 10) = 1000 / 1000 = 1. 354 W(3, 10) = 1000 / 1000 = 1. 356 For a remote MAC+IP host route received with ESI-10, forwarding load- 357 balancing path-list must now be computed as: [PE-1, PE-1, PE-2, PE-3] 358 instead of [PE-1, PE-2, PE-3]. This now results in load-balancing of 359 all traffic destined for ESI-10 across the three multi-homing PEs in 360 proportion to ESI-10 bandwidth at each PE. 362 Above weighted path-list computation MUST only be done for an ESI, IF 363 a link bandwidth attribute is received from ALL of the PE's 364 advertising reachability to that ESI via Ethernet A-D per-ES Route 365 Type 1. In the event that link bandwidth attribute is not received 366 from one or more PEs, forwarding path-list would be computed using 367 regular ECMP semantics. 369 4. Weighted BUM Traffic Load-Sharing 371 Optionally, load sharing of per-service DF role, weighted by 372 individual PE's link-bandwidth share within a multi-homed ES may also 373 be achieved. 375 In order to do that, a new DF Election Capability [RFC8584] called 376 "BW" (Bandwidth Weighted DF Election) is defined. BW may be used 377 along with some DF Election Types, as described in the following 378 sections. 380 4.1 The BW Capability in the DF Election Extended Community 382 [RFC8584] defines a new extended community for PEs within a 383 redundancy group to signal and agree on uniform DF Election Type and 384 Capabilities for each ES. This document requests a bit in the DF 385 Election extended community Bitmap: 387 Bit 28: BW (Bandwidth Weighted DF Election) 389 ES routes advertised with the BW bit set will indicate the desire of 390 the advertising PE to consider the link-bandwidth in the DF Election 391 algorithm defined by the value in the "DF Type". 393 As per [RFC8584], all the PEs in the ES MUST advertise the same 394 Capabilities and DF Type, otherwise the PEs will fall back to Default 395 [RFC7432] DF Election procedure. 397 The BW Capability MAY be advertised with the following DF Types: 399 o Type 0: Default DF Election algorithm, as in [RFC7432] 400 o Type 1: HRW algorithm, as in [RFC8584] 401 o Type 2: Preference algorithm, as in [EVPN-DF-PREF] 402 o Type 4: HRW per-multicast flow DF Election, as in 403 [EVPN-PER-MCAST-FLOW-DF] 405 The following sections describe how the DF Election procedures are 406 modified for the above DF Types when the BW Capability is used. 408 4.2 BW Capability and Default DF Election algorithm 410 When all the PEs in the Ethernet Segment (ES) agree to use the BW 411 Capability with DF Type 0, the Default DF Election procedure is 412 modified as follows: 414 o Each PE advertises a "Link Bandwidth" EXT-COMM attribute along 415 with the ES route to signal the PE-CE link bandwidth (LBW) for 416 the ES. 417 o A receiving PE MUST use the ES link bandwidth attribute 418 received from each PE to compute a relative weight for each 419 remote PE. 420 o The DF Election procedure MUST now use this weighted list of PEs 421 to compute the per-VLAN Designated Forwarder, such that the DF 422 role is distributed in proportion to this normalized weight. 424 Considering the same example as in Section 3, the candidate PE list 425 for DF election is: 427 [PE-1, PE-1, PE-2, PE-3]. 429 The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). 430 This would result in the DF role being distributed across PE1, PE2, 431 and PE3 in portion to each PE's normalized weight for ES-10. 433 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 4) 435 [RFC8584] introduces Highest Random Weight (HRW) algorithm (DF Type 436 1) for DF election in order to solve potential DF election skew 437 depending on Ethernet tag space distribution. [EVPN-PER-MCAST-FLOW- 438 DF] further extends HRW algorithm for per-multicast flow based hash 439 computations (DF Type 4). This section describes extensions to HRW 440 Algorithm for EVPN DF Election specified in [RFC8584] and in [EVPN- 441 PER-MCAST-FLOW-DF] in order to achieve DF election distribution that 442 is weighted by link bandwidth. 444 4.3.1 BW Increment 446 A new variable called "bandwidth increment" is computed for each [PE, 447 ES] advertising the ES link bandwidth attribute as follows: 449 In the context of an ES, 451 L(i) = Link bandwidth advertised by PE(i) for this ES 453 L(min) = lowest link bandwidth advertised across all PEs for this ES 455 Bandwidth increment, "b(i)" for a given PE(i) advertising a link 456 bandwidth of L(i) is defined as an integer value computed as: 458 b(i) = L(i) / L(min) 460 As an example, 462 with PE(1) = 10, PE(2) = 10, PE(3) = 20 464 bandwidth increment for each PE would be computed as: 466 b(1) = 1, b(2) = 1, b(3) = 2 468 with PE(1) = 10, PE(2) = 10, PE(3) = 10 470 bandwidth increment for each PE would be computed as: 472 b(1) = 1, b(2) = 1, b(3) = 1 474 Note that the bandwidth increment must always be an integer, 475 including, in an unlikely scenario of a PE's link bandwidth not being 476 an exact multiple of L(min). If it computes to a non-integer value 477 (including as a result of link failure), it MUST be rounded down to 478 an integer. 480 4.3.2 HRW Hash Computations with BW Increment 482 HRW algorithm as described in [RFC8584] and in [EVPN-PER-MCAST-FLOW- 483 DF] compute a random hash value (referred to as affinity here) for 484 each PE(i), where, (0 < i <= N), PE(i) is the PE at ordinal i, and 485 Address(i) is the IP address of PE at ordinal i. 487 For 'N' PEs sharing an Ethernet segment, this results in 'N' 488 candidate hash computations. PE that has the highest hash value is 489 selected as the DF. 491 Affinity computation for each PE(i) is extended to be computed one 492 per-bandwidth increment associated with PE(i) instead of a single 493 affinity computation per PE(i). 495 PE(i) with b(i) = j, results in j affinity computations: 497 affinity(i, x), where 1 < x <= j 499 This essentially results in number of candidate HRW hash computations 500 for each PE that is directly proportional to that PE's relative 501 bandwidth within an ES and hence gives PE(i) a probability of being 502 DF in proportion to it's relative bandwidth within an ES. 504 As an example, consider an ES that is multi-homed to two PEs, PE1 and 505 PE2, with equal bandwidth distribution across PE1 and PE2. This would 506 result in a total of two candidate hash computations: 508 affinity(PE1, 1) 510 affinity(PE2, 1) 512 Now, consider a scenario with PE1's link bandwidth as 2x that of PE2. 513 This would result in a total of three candidate hash computations to 514 be used for DF election: 516 affinity(PE1, 1) 518 affinity(PE1, 2) 520 affinity(PE2, 1) 522 which would give PE1 2/3 probability of getting elected as a DF, in 523 proportion to its relative bandwidth in the ES. 525 Depending on the chosen HRW hash function, affinity function MUST be 526 extended to include bandwidth increment in the computation. 528 For e.g., 530 affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be 531 extended as follows to incorporate bandwidth increment j: 533 affinity(S,G,V, ESI, Address(i,j)) = 534 (1103515245.((1103515245.Address(i).j + 12345) XOR 535 D(S,G,V,ESI))+12345) (mod 2^31) 537 affinity or random function specified in [RFC8584] MAY be extended as 538 follows to incorporate bandwidth increment j: 540 affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j 541 + 12345) XOR D(v,Es))+12345)(mod 2^31) 543 4.3.3 Cost-Benefit Tradeoff on Link Failures 545 While incorporating link bandwidth into the DF election process 546 provides optimal BUM traffic distribution across the ES links, it 547 also implies that affinity values for a given PE are re-computed, and 548 DF elections are re-adjusted on changes to that PE's bandwidth 549 increment that might result from link failures or link additions. If 550 the operator does not wish to have this level of churn in their DF 551 election, then they should not advertise the BW capability. Not 552 advertising BW capability may result in less than optimal BUM traffic 553 distribution while still retaining the ability to allow a remote 554 ingress PE to do weighted ECMP for its unicast traffic to a set of 555 multi-homed PEs, as described in section 3.2. 557 Same also applies to use of BW capability with service carving (DF 558 Type 0), as specified in section 4.2. 560 4.4 BW Capability and Weighted HRW DF Election algorithm (Type TBD) 562 Use of BW capability together with HRW DF election algorithm 563 described in the previous section has a few limitations: 565 o While in most scenarios a change in BW for a given PE results in 566 re-assigment of DF roles from or to that PE, in certain 567 scenarios, a change in PE BW can result in complete re-assignment 568 of DF roles. 569 o If BW advertised from a set of PEs does not have a good least 570 common multiple, the BW set may result in a high BW increment for 571 each PE, and hence, may result in higher order of complexity. 573 [WEIGHTED-HRW] document describes an alternate DF election algorithm 574 that uses a weighted score function that is minimally disruptive such 575 that it minimizes the probability of complete re-assignment of DF 576 roles in a BW change scenario. It also does not require multiple BW 577 increment based computations. 579 Instead of computing BW increment and an HRW hash for each [PE, BW 580 increment], a single weighted score is computed for each PE using the 581 proposed score function with absolute BW advertised by each PE as its 582 weight value. 584 As described in section 4 of [WEIGHTED-HRW], a HRW hash computation 585 for each PE is converted to a weighted score as follows: 587 Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the maximum 588 hash value. 590 Oi is object being assigned, for e.g., a vlan-id in this case; 592 Sj is the server, for e.g., a PE IP address in this case; 594 wi is the weight, for e.g., BW capability in this case; 596 Object Oi is assigned to server Si with the highest score. 598 4.5 BW Capability and Preference DF Election algorithm 600 This section applies to ES'es where all the PEs in the ES agree use 601 the BW Capability with DF Type 2. The BW Capability modifies the 602 Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW 603 value as a tie-breaker as follows: 605 o Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW 606 value: 608 f) In case of equal Preference in two or more PEs in the ES, the 609 tie-breakers will be the DP bit, the LBW value and the lowest 610 IP PE in that order. For instance: 612 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 613 [Pref=500,DP=1, LBW=2000] in PE2, PE2 would be elected due 614 to the DP bit. 615 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 616 [Pref=500,DP=0, LBW=2000] in PE2, PE2 would be elected due 617 to a higher LBW, even if PE1's IP address is lower. 618 o The LBW exchanged value has no impact on the Non-Revertive 619 option described in [EVPN-DF-PREF]. 621 5. Real-time Available Bandwidth 623 PE-CE link bandwidth availability may sometimes vary in real-time 624 disproportionately across PE_CE links within a multi-homed ESI due to 625 various factors such as flow based hashing combined with fat flows 626 and unbalanced hashing. Reacting to real-time available bandwidth is 627 at this time outside the scope of this document. Procedures described 628 in this document are strictly based on static link bandwidth 629 parameter. 631 6. Routed EVPN Overlay 633 An additional use case is possible, such that traffic to an end host 634 in the overlay is always IP routed. In a purely routed overlay such 635 as this: 637 o A host MAC is never advertised in EVPN overlay control plane o 638 Host /32 or /128 IP reachability is distributed across the 639 overlay via EVPN route type 5 (RT-5) along with a zero or non- 640 zero ESI 641 o An overlay IP subnet may still be stretched across the underlay 642 fabric, however, intra-subnet traffic across the stretched 643 overlay is never bridged 644 o Both inter-subnet and intra-subnet traffic, in the overlay is 645 IP routed at the EVPN GW. 647 Please refer to [RFC 7814] for more details. 649 Weighted multi-path procedure described in this document may be used 650 together with procedures described in [EVPN-IP-ALIASING] for this use 651 case. Ethernet A-D per-ES route advertised with Layer 3 VRF RTs would 652 be used to signal ES link bandwidth attribute instead of the Ethernet 653 A-D per-ES route with Layer 2 VRF RTs. All other procedures described 654 earlier in this document would apply as is. 656 If [EVPN-IP-ALIASING] is not used for routed fast convergence, link 657 bandwidth attribute may still be advertised with IP routes (RT-5) to 658 achieve PE-CE link bandwidth based load-balancing as described in 659 this document. In the absence of [EVPN-IP-ALIASING], re-balancing of 660 traffic following changes in PE-CE link bandwidth will require all IP 661 routes from that CE to be re-advertised in a prefix dependent manner. 663 7. EVPN-IRB Multi-homing with non-EVPN routing 665 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 666 together with non-EVPN routing, such as global routing or an L3VPN 667 routing control plane. Key property that differentiates this set of 668 use cases from EVPN IRB use cases discussed earlier is that EVPN 669 control plane is used only to enable LAG interface based multi-homing 670 and NOT as an overlay VPN control plane. EVPN control plane in this 671 case enables: 673 o DF election via EVPN RT-4 based procedures described in [RFC7432] 674 o LOCAL MAC sync across multi-homing PEs via EVPN RT-2 675 o LOCAL ARP and ND sync across multi-homing PEs via EVPN RT-2 677 Applicability of weighted ECMP procedures proposed in this document 678 to these set of use cases is an area of further consideration. 680 7. References 682 7.1 Normative References 684 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 685 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 686 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 687 2015, . 689 [BGP-LINK-BW] Mohapatra, P., Fernando, R., "BGP Link Bandwidth 690 Extended Community", March 2018, 691 . 694 [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass 695 Withdrawal Support for EVPN", July 2017, 696 . 699 [EVPN-DF-PREF] Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., 700 Drake, J., Sajassi, A., and S. Mohanty, "Preference-based 701 EVPN DF Election", internet-draft ietf-bess-evpn-pref-df- 702 01.txt, April 2018. 704 [EVPN-PER-MCAST-FLOW-DF] Sajassi, et al., "Per multicast flow 705 Designated Forwarder Election for EVPN", March 2018, 706 . 709 [RFC8584] Rabadan, Mohanty, et al., "Framework for Ethernet VPN 710 Designated Forwarder Election Extensibility", April 2019, 711 . 713 [WEIGHTED-HRW] Mohanty, et al., "Weighted HRW and its applications", 714 Sept. 2019, . 717 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate 718 Requirement Levels", March 1997, 719 . 721 [RFC8174] B. Leiba, "Ambiguity of Uppercase vs Lowercase in RFC 2119 722 Key Words", May 2017, 723 . 725 7.2 Informative References 726 8. Acknowledgements 728 Authors would like to thank Satya Mohanty for valuable review and 729 inputs with respect to HRW and weighted HRW algorithm refinements 730 proposed in this document. 732 9. Contributors 734 Satya Ranjan Mohanty 735 Cisco 736 Email: satyamoh@cisco.com 738 Authors' Addresses 740 Neeraj Malhotra, Editor. 741 Cisco 742 Email: neeraj.ietf@gmail.com 744 Ali Sajassi 745 Cisco 746 Email: sajassi@cisco.com 748 Jorge Rabadan 749 Nokia 750 Email: jorge.rabadan@nokia.com 752 John Drake 753 Juniper 754 EMail: jdrake@juniper.net 756 Avinash Lingala 757 AT&T 758 Email: ar977m@att.com 760 Samir Thoria 761 Cisco 762 Email: sthoria@cisco.com