idnits 2.17.1 draft-malhotra-bess-evpn-unequal-lb-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (June 26, 2018) is 2121 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PE-1' is mentioned on line 417, but not defined == Missing Reference: 'PE-2' is mentioned on line 417, but not defined == Missing Reference: 'PE-3' is mentioned on line 417, but not defined == Missing Reference: 'XXX' is mentioned on line 394, but not defined == Missing Reference: 'RFC 7814' is mentioned on line 599, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'BGP-LINK-BW' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IP-ALIASING' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-DF-PREF' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-PER-MCAST-FLOW-DF' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-DF-ELECT-FRAMEWORK' Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT N. Malhotra, Ed. 4 A. Sajassi 5 Intended Status: Proposed Standard Cisco 6 J. Rabadan 7 Nokia 8 J. Drake 9 Juniper 10 A. Lingala 11 AT&T 13 Expires: Dec 28, 2018 June 26, 2018 15 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing 16 draft-malhotra-bess-evpn-unequal-lb-02 18 Abstract 20 In an EVPN-IRB based network overlay, EVPN LAG enables all-active 21 multi-homing for a host or CE device connected to two or more PEs via 22 a LAG bundle, such that bridged and routed traffic from remote PEs 23 can be equally load balanced (ECMPed) across the multi-homing PEs. 24 This document defines extensions to EVPN procedures to optimally 25 handle unequal access bandwidth distribution across a set of multi- 26 homing PEs in order to: 28 o provide greater flexibility, with respect to adding or 29 removing individual PE-CE links within the access LAG 31 o handle PE-CE LAG member link failures that can result in unequal 32 PE-CE access bandwidth across a set of multi-homing PEs 34 Status of this Memo 36 This Internet-Draft is submitted to IETF in full conformance with 37 the provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF), its areas, and its working groups. Note that 41 other groups may also distribute working documents as 42 Internet-Drafts. 44 Internet-Drafts are draft documents valid for a maximum of six 45 months and may be updated, replaced, or obsoleted by other 46 documents at any time. It is inappropriate to use Internet- 47 Drafts as reference material or to cite them other than as "work 48 in progress." 50 The list of current Internet-Drafts can be accessed at 51 http://www.ietf.org/1id-abstracts.html 53 The list of Internet-Draft Shadow Directories can be accessed at 54 http://www.ietf.org/shadow.html 56 Copyright and License Notice 58 Copyright (c) 2017 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents 63 (http://trustee.ietf.org/license-info) in effect on the date of 64 publication of this document. Please review these documents 65 carefully, as they describe your rights and restrictions with 66 respect to this document. Code Components extracted from this 67 document must include Simplified BSD License text as described in 68 Section 4.e of the Trust Legal Provisions and are provided 69 without warranty as described in the Simplified BSD License. 71 Table of Contents 73 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 1.1 PE CE Link Provisioning . . . . . . . . . . . . . . . . . . 5 75 1.2 PE CE Link Failures . . . . . . . . . . . . . . . . . . . . 6 76 1.3 Design Requirement . . . . . . . . . . . . . . . . . . . . . 7 77 1.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 7 78 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 8 79 3. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 8 80 3.1 LOCAL PE Behavior . . . . . . . . . . . . . . . . . . . . . 8 81 3.1 Link Bandwidth Extended Community . . . . . . . . . . . . . 8 82 3.2 REMOTE PE Behavior . . . . . . . . . . . . . . . . . . . . . 9 83 4. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 10 84 4.1 The BW Capability in the DF Election Extended Community . . 10 85 4.2 BW Capability and Default DF Election algorithm . . . . . . 11 86 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 87 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 88 4.3.1 BW Increment . . . . . . . . . . . . . . . . . . . . . . 11 89 4.3.2 HRW Hash Computations with BW Increment . . . . . . . . 12 90 4.3.3 Cost-Benefit Tradeoff on Link Failures . . . . . . . . . 13 91 4.4 BW Capability and Preference DF Election algorithm . . . . 14 92 5. Real-time Available Bandwidth . . . . . . . . . . . . . . . . . 15 93 6. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 15 94 7. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 16 95 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 96 7.1 Normative References . . . . . . . . . . . . . . . . . . . 17 97 7.2 Informative References . . . . . . . . . . . . . . . . . . 17 98 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 99 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 18 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 102 1 Introduction 104 In an EVPN-IRB based network overlay, with an access CE multi-homed 105 via a LAG interface, bridged and routed traffic from remote PEs can 106 be equally load balanced (ECMPed) across the multi-homing PEs: 108 o ECMP Load-balancing for bridged unicast traffic is enabled via 109 aliasing and mass-withdraw procedures detailed in RFC 7432. 111 o ECMP Load-balancing for routed unicast traffic is enabled via 112 existing L3 ECMP mechanisms. 114 o Load-sharing of bridged BUM traffic on local ports is enabled 115 via EVPN DF election procedure detailed in RFC 7432 117 All of the above load-balancing and DF election procedures implicitly 118 assume equal bandwidth distribution between the CE and the set of 119 multi-homing PEs. Essentially, with this assumption of equal "access" 120 bandwidth distribution across all PEs, ALL remote traffic is equally 121 load balanced across the multi-homing PEs. This assumption of equal 122 access bandwidth distribution can be restrictive with respect to 123 adding / removing links in a multi-homed LAG interface and may also 124 be easily broken on individual link failures. A solution to handle 125 unequal access bandwidth distribution across a set of multi-homing 126 EVPN PEs is proposed in this document. Primary motivation behind this 127 proposal is to enable greater flexibility with respect to adding / 128 removing member PE-CE links, as needed and to optimally handle PE-CE 129 link failures. 131 1.1 PE CE Link Provisioning 133 +------------------------+ 134 | Underlay Network Fabric| 135 +------------------------+ 137 +-----+ +-----+ 138 | PE1 | | PE2 | 139 +-----+ +-----+ 140 \ / 141 \ ESI-1 / 142 \ / 143 +\---/+ 144 | \ / | 145 +--+--+ 146 | 147 CE1 149 Figure 1 151 Consider a CE1 that is dual-homed to PE1 and PE2 via EVPN-LAG with 152 single member links of equal bandwidth to each PE (aka, equal access 153 bandwidth distribution across PE1 and PE2). If the provider wants to 154 increase link bandwidth to CE1, it MUST add a link to both PE1 and 155 PE2 in order to maintain equal access bandwidth distribution and 156 inter-work with EVPN ECMP load-balancing. In other words, for a dual- 157 homed CE, total number of CE links must be provisioned in multiples 158 of 2 (2, 4, 6, and so on). For a triple-homed CE, number of CE links 159 must be provisioned in multiples of three (3, 6, 9, and so on). To 160 generalize, for a CE that is multi-homed to "n" PEs, number of PE-CE 161 physical links provisioned must be an integral multiple of "n". This 162 is restrictive in case of dual-homing and very quickly becomes 163 prohibitive in case of multi-homing. 165 Instead, a provider may wish to increase PE-CE bandwidth OR number of 166 links in ANY link increments. As an example, for CE1 dual-homed to 167 PE1 and PE2 in all-active mode, provider may wish to add a third link 168 to ONLY PE1 to increase total bandwidth for this CE by 50%, rather 169 than being required to increase access bandwidth by 100% by adding a 170 link to each of the two PEs. While existing EVPN based all-active 171 load-balancing procedures do not necessarily preclude such asymmetric 172 access bandwidth distribution among the PEs providing redundancy, it 173 may result in unexpected traffic loss due to congestion in the access 174 interface towards CE. This traffic loss is due to the fact that PE1 175 and PE2 will continue to attract equal amount of CE1 destined traffic 176 from remote PEs, even when PE2 only has half the bandwidth to CE1 as 177 PE1. This may lead to congestion and traffic loss on the PE2-CE1 178 link. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 179 traffic from remote hosts MUST also be load-balanced across PE1 and 180 PE2 in 2:1 manner. 182 1.2 PE CE Link Failures 184 More importantly, unequal PE-CE bandwidth distribution described 185 above may occur during regular operation following a link failure, 186 even when PE-CE links were provisioned to provide equal bandwidth 187 distribution across multi-homing PEs. 189 +------------------------+ 190 | Underlay Network Fabric| 191 +------------------------+ 193 +-----+ +-----+ 194 | PE1 | | PE2 | 195 +-----+ +-----+ 196 \\ // 197 \\ ESI-1 // 198 \\ /X 199 +\\---//+ 200 | \\ // | 201 +---+---+ 202 | 203 CE1 205 Consider a CE1 that is multi-homed to PE1 and PE2 via a link bundle 206 with two member links to each PE. On a PE2-CE1 physical link failure, 207 link bundle represented by ESI-1 on PE2 stays up, however, it's 208 bandwidth is cut in half. With the existing ECMP procedures, both PE1 209 and PE2 will continue to attract equal amount of traffic from remote 210 PEs, even when PE1 has double the bandwidth to CE1. If bandwidth 211 distribution to CE1 across PE1 and PE2 is 2:1, traffic from remote 212 hosts MUST also be load-balanced across PE1 and PE2 in 2:1 manner to 213 avoid unexpected congestion and traffic loss on PE2-CE1 links within 214 the LAG. 216 1.3 Design Requirement 218 +-----------------------+ 219 |Underlay Network Fabric| 220 +-----------------------+ 222 +-----+ +-----+ +-----+ +-----+ 223 | PE1 | | PE2 | ..... | PEx | | PEn | 224 +-----+ +-----+ +-----+ +-----+ 225 \ \ // // 226 \ L1 \ L2 // Lx // Ln 227 \ \ // // 228 +-\-------\-----------//--------//-+ 229 | \ \ ESI-1 // // | 230 +----------------------------------+ 231 | 232 CE 234 To generalize, if total link bandwidth to a CE is distributed across 235 "n" multi-homing PEs, with Lx being the number of links / bandwidth 236 to PEx, traffic from remote PEs to this CE MUST be load-balanced 237 unequally across [PE1, PE2, ....., PEn] such that, fraction of total 238 unicast and BUM flows destined for CE that are serviced by PEx is: 240 Lx / [L1+L2+.....+Ln] 242 Solution proposed below includes extensions to EVPN procedures to 243 achieve the above. 245 1.4 Terminology 247 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 248 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 249 "OPTIONAL" in this document are to be interpreted as described in 250 BCP14 [RFC2119] [RFC8174] when, and only when, they appear in all 251 capitals, as shown here. 253 "LOCAL PE" in the context of an ESI refers to a provider edge switch 254 OR router that physically hosts the ESI. 256 "REMOTE PE" in the context of an ESI refers to a provider edge switch 257 OR router in an EVPN overlay, who's overlay reachability to the ESI 258 is via the LOCAL PE. 260 2. Solution Overview 262 In order to achieve weighted load balancing for overlay unicast 263 traffic, Ethernet A-D per-ES route (EVPN Route Type 1) is leveraged 264 to signal the ESI bandwidth to remote PEs. Using Ethernet A-D per-ES 265 route to signal the ESI bandwidth provides a mechanism to be able to 266 react to changes in access bandwidth in a service and host 267 independent manner. Remote PEs computing the MAC path-lists based on 268 global and aliasing Ethernet A-D routes now have the ability to setup 269 weighted load-balancing path-lists based on the ESI access bandwidth 270 received from each PE that the ESI is multi-homed to. If Ethernet A-D 271 per-ES route is also leveraged for IP path-list computation, as per 272 [EVPN-IP-ALIASING], it also provides a method to do weighted load- 273 balancing for IP routed traffic. 275 In order to achieve weighted load-balancing of overlay BUM traffic, 276 EVPN ES route (Route Type 4) is leveraged to signal the ESI bandwidth 277 to PEs within an ESI's redundancy group to influence per-service DF 278 election. PEs in an ESI redundancy group now have the ability to do 279 service carving in proportion to each PE's relative ESI bandwidth. 281 Procedures to accomplish this are described in greater detail next. 283 3. Weighted Unicast Traffic Load-balancing 285 3.1 LOCAL PE Behavior 287 A PE that is part of an ESI's redundancy group would advertise a 288 additional "link bandwidth" EXT-COMM attribute with Ethernet A-D per- 289 ES route (EVPN Route Type 1), that represents total bandwidth of PE's 290 physical links in an ESI. BGP link bandwidth EXT-COMM defined in 291 [BGP-LINK-BW] is re-used for this purpose. 293 3.1 Link Bandwidth Extended Community 295 Link bandwidth extended community described in [BGP-LINK-BW] for 296 layer 3 VPNs is re-used here to signal local ES link bandwidth to 297 remote PEs. link-bandwidth extended community is however defined in 298 [BGP-LINK-BW] as optional non-transitive. In inter-AS scenarios, 299 link-bandwidth may need to be signaled to an eBGP neighbor along with 300 next-hop unchanged. It is work in progress with authors of [BGP-LINK- 301 BW] to allow for this attribute to be used as transitive in inter-AS 302 scenarios. 304 3.2 REMOTE PE Behavior 306 A receiving PE should use per-ES link bandwidth attribute received 307 from each PE to compute a relative weight for each remote PE, per-ES, 308 as shown below. 310 if, 312 L(x,y) : link bandwidth advertised by PE-x for ESI-y 314 W(x,y) : normalized weight assigned to PE-x for ESI-y 316 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., 317 L(n,y)] 319 then, the normalized weight assigned to PE-x for ESI-y may be 320 computed as follows: 322 W(x,y) = L(x,y) / H(y) 324 For a MAC+IP route (EVPN Route Type 2) received with ESI-y, receiving 325 PE MUST compute MAC and IP forwarding path-list weighted by the above 326 normalized weights. 328 As an example, for a CE dual-homed to PE-1, PE-2, PE-3 via 2, 1, and 329 1 GE physical links respectively, as part of a link bundle 330 represented by ESI-10: 332 L(1, 10) = 2000 Mbps 334 L(2, 10) = 1000 Mbps 336 L(3, 10) = 1000 Mbps 338 H(10) = 1000 340 Normalized weights assigned to each PE for ESI-10 are as follows: 342 W(1, 10) = 2000 / 1000 = 2. 344 W(2, 10) = 1000 / 1000 = 1. 346 W(3, 10) = 1000 / 1000 = 1. 348 For a remote MAC+IP host route received with ESI-10, forwarding load- 349 balancing path-list must now be computed as: [PE-1, PE-1, PE-2, PE-3] 350 instead of [PE-1, PE-2, PE-3]. This now results in load-balancing of 351 all traffic destined for ESI-10 across the three multi-homing PEs in 352 proportion to ESI-10 bandwidth at each PE. 354 Above weighted path-list computation MUST only be done for an ESI, IF 355 a link bandwidth attribute is received from ALL of the PE's 356 advertising reachability to that ESI via Ethernet A-D per-ES Route 357 Type 1. In the event that link bandwidth attribute is not received 358 from one or more PEs, forwarding path-list would be computed using 359 regular ECMP semantics. 361 4. Weighted BUM Traffic Load-Sharing 363 Optionally, load sharing of per-service DF role, weighted by 364 individual PE's link-bandwidth share within a multi-homed ES may also 365 be achieved. 367 In order to do that, a new DF Election Capability [EVPN-DF-ELECT- 368 FRAMEWORK] called "BW" (Bandwidth Weighted DF Election) is defined. 369 BW may be used along with some DF Election Types, as described in the 370 following sections. 372 4.1 The BW Capability in the DF Election Extended Community 374 [EVPN-DF-ELECT-FRAMEWORK] defines a new extended community for PEs 375 within a redundancy group to signal and agree on uniform DF Election 376 Type and Capabilities for each ES. This document requests a bit in 377 the DF Election extended community Bitmap: 379 Bit 28: BW (Bandwidth Weighted DF Election) 381 ES routes advertised with the BW bit set will indicate the desire of 382 the advertising PE to consider the link-bandwidth in the DF Election 383 algorithm defined by the value in the "DF Type". 385 As per [EVPN-DF-ELECT-FRAMEWORK], all the PEs in the ES MUST 386 advertise the same Capabilities and DF Type, otherwise the PEs will 387 fall back to Default [RFC7432] DF Election procedure. 389 The BW Capability MAY be advertised with the following DF Types: 391 o Type 0: Default DF Election algorithm, as in [RFC7432] 392 o Type 1: HRW algorithm, as in [EVPN-DF-ELECT-FRAMEWORK] 393 o Type 2: Preference algorithm, as in [EVPN-DF-PREF] 394 o Type 4: HRW per-multicast flow DF Election, as in [XXX] 396 The following sections describe how the DF Election procedures are 397 modified for the above DF Types when the BW Capability is used. 399 4.2 BW Capability and Default DF Election algorithm 401 When all the PEs in the ES agree to use the BW Capability with DF 402 Type 0, the Default DF Election procedure is modified as follows: 404 o Each PE advertises a "Link Bandwidth" EXT-COMM attribute along 405 with the ES route to signal the PE-CE link bandwidth (LBW) for 406 the ES. 407 o A receiving PE MUST use the ES link bandwidth attribute 408 received from each PE to compute a relative weight for each 409 remote PE. 410 o The DF Election procedure MUST now use this weighted list of PEs 411 to compute the per-VLAN Designated Forwarder, such that the DF 412 role is distributed in proportion to this normalized weight. 414 Considering the same example as in Section 3, the candidate PE list 415 for DF election is: 417 [PE-1, PE-1, PE-2, PE-3]. 419 The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). 420 This would result in the DF role being distributed across PE1, PE2, 421 and PE3 in portion to each PE's normalized weight for ES-10. 423 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 4) 425 [EVPN-DF-ELECT-FRAMEWORK] introduces Highest Random Weight (HRW) 426 algorithm (DF Type 1) for DF election in order to solve potential DF 427 election skew depending on Ethernet tag space distribution. [EVPN- 428 PER-MCAST-FLOW-DF] further extends HRW algorithm for per-multicast 429 flow based hash computations (DF Type 4). This section describes 430 extensions to HRW Algorithm for EVPN DF Election specified in [EVPN- 431 DF-ELECT-FRAMEWORK] and in [EVPN-PER-MCAST-FLOW-DF] in order to 432 achieve DF election distribution that is weighted by link bandwidth. 434 4.3.1 BW Increment 436 A new variable called "bandwidth increment" is computed for each [PE, 437 ES] advertising the ES link bandwidth attribute as follows: 439 In the context of an ES, 441 L(i) = Link bandwidth advertised by PE(i) for this ES 443 L(min) = lowest link bandwidth advertised across all PEs for this ES 445 Bandwidth increment, "b(i)" for a given PE(i) advertising a link 446 bandwidth of L(i) is defined as an integer value computed as: 448 b(i) = L(i) / L(min) 450 As an example, 452 with PE(1) = 10, PE(2) = 10, PE(3) = 20 454 bandwidth increment for each PE would be computed as: 456 b(1) = 1, b(2) = 1, b(3) = 2 458 with PE(1) = 10, PE(2) = 10, PE(3) = 10 460 bandwidth increment for each PE would be computed as: 462 b(1) = 1, b(2) = 1, b(3) = 1 464 Note that the bandwidth increment must always be an integer, 465 including, in an unlikely scenario of a PE's link bandwidth not being 466 an exact multiple of L(min). If it computes to a non-integer value 467 (including as a result of link failure), it MUST be rounded down to 468 an integer. 470 4.3.2 HRW Hash Computations with BW Increment 472 HRW algorithm as described in [EVPN-DF-ELECT-FRAMEWORK] and in [EVPN- 473 PER-MCAST-FLOW-DF] compute a random hash value (referred to as 474 affinity here) for each PE(i), where, (0 < i <= N), PE(i) is the PE 475 at ordinal i, and Address(i) is the IP address of PE at ordinal i. 477 For 'N' PEs sharing an Ethernet segment, this results in 'N' 478 candidate hash computations. PE that has the highest hash value is 479 selected as the DF. 481 Affinity computation for each PE(i) is extended to be computed one 482 per-bandwidth increment associated with PE(i) instead of a single 483 affinity computation per PE(i). 485 PE(i) with b(i) = j, results in j affinity computations: 487 affinity(i, x), where 1 < x <= j 489 This essentially results in number of candidate HRW hash computations 490 for each PE that is directly proportional to that PE's relative 491 bandwidth within an ES and hence gives PE(i) a probability of being 492 DF in proportion to it's relative bandwidth within an ES. 494 As an example, consider an ES that is multi-homed to two PEs, PE1 and 495 PE2, with equal bandwidth distribution across PE1 and PE2. This would 496 result in a total of two candidate hash computations: 498 affinity(PE1, 1) 500 affinity(PE2, 1) 502 Now, consider a scenario with PE1's link bandwidth as 2x that of PE2. 503 This would result in a total of three candidate hash computations to 504 be used for DF election: 506 affinity(PE1, 1) 508 affinity(PE1, 2) 510 affinity(PE2, 1) 512 which would give PE1 2/3 probability of getting elected as a DF, in 513 proportion to its relative bandwidth in the ES. 515 Depending on the chosen HRW hash function, affinity function MUST be 516 extended to include bandwidth increment in the computation. 518 For e.g., 520 affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be 521 extended as follows to incorporate bandwidth increment j: 523 affinity(S,G,V, ESI, Address(i,j)) = 524 (1103515245.((1103515245.Address(i).j + 12345) XOR 525 D(S,G,V,ESI))+12345) (mod 2^31) 527 affinity or random function specified in [EVPN-DF-ELECT-FRAMEWORK] 528 MAY be extended as follows to incorporate bandwidth increment j: 530 affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j 531 + 12345) XOR D(v,Es))+12345)(mod 2^31) 533 4.3.3 Cost-Benefit Tradeoff on Link Failures 535 While incorporating link bandwidth into the DF election process 536 provides optimal BUM traffic distribution across the ES links, it 537 also implies that affinity values for a given PE are re-computed, and 538 DF elections are re-adjusted on changes to that PE's bandwidth 539 increment that might result from link failures or link additions. If 540 the operator does not wish to have this level of churn in their DF 541 election, then they should not advertise the BW capability. Not 542 advertising BW capability may result in less than optimal BUM traffic 543 distribution while still retaining the ability to allow a remote 544 ingress PE to do weighted ECMP for its unicast traffic to a set of 545 multi-homed PEs, as described in section 3.2. 547 Same also applies to use of BW capability with service carving (DF 548 Type 0), as specified in section 4.2. 550 4.4 BW Capability and Preference DF Election algorithm 552 This section applies to ES'es where all the PEs in the ES agree use 553 the BW Capability with DF Type 2. The BW Capability modifies the 554 Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW 555 value as a tie-breaker as follows: 557 o Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW 558 value: 560 f) In case of equal Preference in two or more PEs in the ES, the 561 tie-breakers will be the DP bit, the LBW value and the lowest 562 IP PE in that order. For instance: 564 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 565 [Pref=500,DP=1, LBW=2000] in PE2, PE2 would be elected due 566 to the DP bit. 567 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 568 [Pref=500,DP=0, LBW=2000] in PE2, PE2 would be elected due 569 to a higher LBW, even if PE1's IP address is lower. 570 o The LBW exchanged value has no impact on the Non-Revertive 571 option described in [EVPN-DF-PREF]. 573 5. Real-time Available Bandwidth 575 PE-CE link bandwidth availability may sometimes vary in real-time 576 disproportionately across PE_CE links within a multi-homed ESI due to 577 various factors such as flow based hashing combined with fat flows 578 and unbalanced hashing. Reacting to real-time available bandwidth is 579 at this time outside the scope of this document. Procedures described 580 in this document are strictly based on static link bandwidth 581 parameter. 583 6. Routed EVPN Overlay 585 An additional use case is possible, such that traffic to an end host 586 in the overlay is always IP routed. In a purely routed overlay such 587 as this: 589 o A host MAC is never advertised in EVPN overlay control plane o 590 Host /32 or /128 IP reachability is distributed across the 591 overlay via EVPN route type 5 (RT-5) along with a zero or non- 592 zero ESI 593 o An overlay IP subnet may still be stretched across the underlay 594 fabric, however, intra-subnet traffic across the stretched 595 overlay is never bridged 596 o Both inter-subnet and intra-subnet traffic, in the overlay is 597 IP routed at the EVPN GW. 599 Please refer to [RFC 7814] for more details. 601 Weighted multi-path procedure described in this document may be used 602 together with procedures described in [EVPN-IP-ALIASING] for this use 603 case. Ethernet A-D per-ES route advertised with Layer 3 VRF RTs would 604 be used to signal ES link bandwidth attribute instead of the Ethernet 605 A-D per-ES route with Layer 2 VRF RTs. All other procedures described 606 earlier in this document would apply as is. 608 If [EVPN-IP-ALIASING] is not used for routed fast convergence, link 609 bandwidth attribute may still be advertised with IP routes (RT-5) to 610 achieve PE-CE link bandwidth based load-balancing as described in 611 this document. In the absence of [EVPN-IP-ALIASING], re-balancing of 612 traffic following changes in PE-CE link bandwidth will require all IP 613 routes from that CE to be re-advertised in a prefix dependent manner. 615 7. EVPN-IRB Multi-homing with non-EVPN routing 617 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 618 together with non-EVPN routing, such as global routing or an L3VPN 619 routing control plane. Key property that differentiates this set of 620 use cases from EVPN IRB use cases discussed earlier is that EVPN 621 control plane is used only to enable LAG interface based multi-homing 622 and NOT as an overlay VPN control plane. EVPN control plane in this 623 case enables: 625 o DF election via EVPN RT-4 based procedures described in [RFC7432] 626 o LOCAL MAC sync across multi-homing PEs via EVPN RT-2 627 o LOCAL ARP and ND sync across multi-homing PEs via EVPN RT-2 629 Applicability of weighted ECMP procedures proposed in this document 630 to these set of use cases will be addressed in subsequent revisions. 632 7. References 634 7.1 Normative References 636 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 637 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 638 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 639 2015, . 641 [BGP-LINK-BW] Mohapatra, P., Fernando, R., "BGP Link Bandwidth 642 Extended Community", January 2013, 643 . 646 [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass 647 Withdrawal Support for EVPN", July 2017, 648 . 651 [EVPN-DF-PREF] Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., 652 Drake, J., Sajassi, A., and S. Mohanty, "Preference-based 653 EVPN DF Election", internet-draft ietf-bess-evpn-pref-df- 654 01.txt, April 2018. 656 [EVPN-PER-MCAST-FLOW-DF] Sajassi, et al., "Per multicast flow 657 Designated Forwarder Election for EVPN", March 2018, 658 . 661 [EVPN-DF-ELECT-FRAMEWORK] Rabadan, Mohanty, et al., "Framework for 662 EVPN Designated Forwarder Election Extensibility", March 663 2018, . 666 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate 667 Requirement Levels", March 1997, 668 . 670 [RFC8174] B. Leiba, "Ambiguity of Uppercase vs Lowercase in RFC 2119 671 Key Words", May 2017, 672 . 674 7.2 Informative References 675 8. Acknowledgements 677 Authors would like to thank Satya Mohanty for valuable review and 678 inputs with respect to HRW algorithm refinements proposed in this 679 document. 681 8. Contributors 683 Contributors in addition to the authors listed on this draft: 685 Samir Thoria 686 Cisco 687 Email: sthoria@cisco.com 689 Authors' Addresses 691 Neeraj Malhotra, Ed. 692 Email: neeraj.ietf@gmail.com 694 Ali Sajassi 695 Cisco 696 Email: sajassi@cisco.com 698 Jorge Rabadan 699 Nokia 700 Email: jorge.rabadan@nokia.com 702 John Drake 703 Juniper 704 EMail: jdrake@juniper.net 706 Avinash Lingala 707 AT&T 708 Email: ar977m@att.com