idnits 2.17.1 draft-ietf-bess-evpn-unequal-lb-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (March 25, 2019) is 1851 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PE-1' is mentioned on line 426, but not defined == Missing Reference: 'PE-2' is mentioned on line 426, but not defined == Missing Reference: 'PE-3' is mentioned on line 426, but not defined == Missing Reference: 'RFC 7814' is mentioned on line 608, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'BGP-LINK-BW' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IP-ALIASING' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-DF-PREF' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-PER-MCAST-FLOW-DF' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-DF-ELECT-FRAMEWORK' Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group N. Malhotra, Ed. 3 Internet-Draft Arrcus 4 Intended Status: Proposed Standard 5 A. Sajassi 6 S. Thoria 7 Cisco 9 J. Rabadan 10 Nokia 12 J. Drake 13 Juniper 15 A. Lingala 16 AT&T 18 Expires: Sept 26, 2019 March 25, 2019 20 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing 21 draft-ietf-bess-evpn-unequal-lb-01 23 Abstract 25 In an EVPN-IRB based network overlay, EVPN all-active multi-homing 26 enables multi-homing for a CE device connected to two or more PEs via 27 a LAG bundle, such that bridged and routed traffic from remote PEs 28 can be equally load balanced (ECMPed) across the multi-homing PEs. 29 This document defines extensions to EVPN procedures to optimally 30 handle unequal access bandwidth distribution across a set of multi- 31 homing PEs in order to: 33 o provide greater flexibility, with respect to adding or 34 removing individual PE-CE links within the access LAG 36 o handle PE-CE LAG member link failures that can result in unequal 37 PE-CE access bandwidth across a set of multi-homing PEs 39 Status of this Memo 41 This Internet-Draft is submitted to IETF in full conformance with 42 the provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF), its areas, and its working groups. Note that 46 other groups may also distribute working documents as 47 Internet-Drafts. 49 Internet-Drafts are draft documents valid for a maximum of six 50 months and may be updated, replaced, or obsoleted by other 51 documents at any time. It is inappropriate to use Internet- 52 Drafts as reference material or to cite them other than as "work 53 in progress." 55 The list of current Internet-Drafts can be accessed at 56 http://www.ietf.org/1id-abstracts.html 58 The list of Internet-Draft Shadow Directories can be accessed at 59 http://www.ietf.org/shadow.html 61 Copyright and License Notice 63 Copyright (c) 2017 IETF Trust and the persons identified as the 64 document authors. All rights reserved. 66 This document is subject to BCP 78 and the IETF Trust's Legal 67 Provisions Relating to IETF Documents 68 (http://trustee.ietf.org/license-info) in effect on the date of 69 publication of this document. Please review these documents 70 carefully, as they describe your rights and restrictions with 71 respect to this document. Code Components extracted from this 72 document must include Simplified BSD License text as described in 73 Section 4.e of the Trust Legal Provisions and are provided 74 without warranty as described in the Simplified BSD License. 76 Table of Contents 78 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 79 1.1 PE CE Link Provisioning . . . . . . . . . . . . . . . . . . 5 80 1.2 PE CE Link Failures . . . . . . . . . . . . . . . . . . . . 6 81 1.3 Design Requirement . . . . . . . . . . . . . . . . . . . . . 7 82 1.4 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 7 83 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 8 84 3. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 8 85 3.1 LOCAL PE Behavior . . . . . . . . . . . . . . . . . . . . . 8 86 3.1 Link Bandwidth Extended Community . . . . . . . . . . . . . 8 87 3.2 REMOTE PE Behavior . . . . . . . . . . . . . . . . . . . . . 9 88 4. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 10 89 4.1 The BW Capability in the DF Election Extended Community . . 10 90 4.2 BW Capability and Default DF Election algorithm . . . . . . 11 91 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 92 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 93 4.3.1 BW Increment . . . . . . . . . . . . . . . . . . . . . . 11 94 4.3.2 HRW Hash Computations with BW Increment . . . . . . . . 12 95 4.3.3 Cost-Benefit Tradeoff on Link Failures . . . . . . . . . 13 96 4.4 BW Capability and Preference DF Election algorithm . . . . 14 97 5. Real-time Available Bandwidth . . . . . . . . . . . . . . . . . 15 98 6. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 15 99 7. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 16 100 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 101 7.1 Normative References . . . . . . . . . . . . . . . . . . . 17 102 7.2 Informative References . . . . . . . . . . . . . . . . . . 17 103 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 106 1 Introduction 108 In an EVPN-IRB based network overlay, with a CE multi-homed via a 109 EVPN all-active multi-homing, bridged and routed traffic from remote 110 PEs can be equally load balanced (ECMPed) across the multi-homing 111 PEs: 113 o ECMP Load-balancing for bridged unicast traffic is enabled via 114 aliasing and mass-withdraw procedures detailed in RFC 7432. 116 o ECMP Load-balancing for routed unicast traffic is enabled via 117 existing L3 ECMP mechanisms. 119 o Load-sharing of bridged BUM traffic on local ports is enabled 120 via EVPN DF election procedure detailed in RFC 7432 122 All of the above load-balancing and DF election procedures implicitly 123 assume equal bandwidth distribution between the CE and the set of 124 multi-homing PEs. Essentially, with this assumption of equal "access" 125 bandwidth distribution across all PEs, ALL remote traffic is equally 126 load balanced across the multi-homing PEs. This assumption of equal 127 access bandwidth distribution can be restrictive with respect to 128 adding / removing links in a multi-homed LAG interface and may also 129 be easily broken on individual link failures. A solution to handle 130 unequal access bandwidth distribution across a set of multi-homing 131 EVPN PEs is proposed in this document. Primary motivation behind this 132 proposal is to enable greater flexibility with respect to adding / 133 removing member PE-CE links, as needed and to optimally handle PE-CE 134 link failures. 136 1.1 PE CE Link Provisioning 138 +------------------------+ 139 | Underlay Network Fabric| 140 +------------------------+ 142 +-----+ +-----+ 143 | PE1 | | PE2 | 144 +-----+ +-----+ 145 \ / 146 \ ESI-1 / 147 \ / 148 +\---/+ 149 | \ / | 150 +--+--+ 151 | 152 CE1 154 Figure 1 156 Consider a CE1 that is dual-homed to PE1 and PE2 via EVPN all-active 157 multi-homing with single member links of equal bandwidth to each PE 158 (aka, equal access bandwidth distribution across PE1 and PE2). If the 159 provider wants to increase link bandwidth to CE1, it MUST add a link 160 to both PE1 and PE2 in order to maintain equal access bandwidth 161 distribution and inter-work with EVPN ECMP load-balancing. In other 162 words, for a dual-homed CE, total number of CE links must be 163 provisioned in multiples of 2 (2, 4, 6, and so on). For a triple- 164 homed CE, number of CE links must be provisioned in multiples of 165 three (3, 6, 9, and so on). To generalize, for a CE that is multi- 166 homed to "n" PEs, number of PE-CE physical links provisioned must be 167 an integral multiple of "n". This is restrictive in case of dual- 168 homing and very quickly becomes prohibitive in case of multi-homing. 170 Instead, a provider may wish to increase PE-CE bandwidth OR number of 171 links in ANY link increments. As an example, for CE1 dual-homed to 172 PE1 and PE2 in all-active mode, provider may wish to add a third link 173 to ONLY PE1 to increase total bandwidth for this CE by 50%, rather 174 than being required to increase access bandwidth by 100% by adding a 175 link to each of the two PEs. While existing EVPN based all-active 176 load-balancing procedures do not necessarily preclude such asymmetric 177 access bandwidth distribution among the PEs providing redundancy, it 178 may result in unexpected traffic loss due to congestion in the access 179 interface towards CE. This traffic loss is due to the fact that PE1 180 and PE2 will continue to attract equal amount of CE1 destined traffic 181 from remote PEs, even when PE2 only has half the bandwidth to CE1 as 182 PE1. This may lead to congestion and traffic loss on the PE2-CE1 183 link. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 184 traffic from remote hosts MUST also be load-balanced across PE1 and 185 PE2 in 2:1 manner. 187 1.2 PE CE Link Failures 189 More importantly, unequal PE-CE bandwidth distribution described 190 above may occur during regular operation following a link failure, 191 even when PE-CE links were provisioned to provide equal bandwidth 192 distribution across multi-homing PEs. 194 +------------------------+ 195 | Underlay Network Fabric| 196 +------------------------+ 198 +-----+ +-----+ 199 | PE1 | | PE2 | 200 +-----+ +-----+ 201 \\ // 202 \\ ESI-1 // 203 \\ /X 204 +\\---//+ 205 | \\ // | 206 +---+---+ 207 | 208 CE1 210 Consider a CE1 that is multi-homed to PE1 and PE2 via a link bundle 211 with two member links to each PE. On a PE2-CE1 physical link failure, 212 link bundle represented by an Ethernet Segment ESI-1 on PE2 stays up, 213 however, it's bandwidth is cut in half. With existing ECMP 214 procedures, both PE1 and PE2 will continue to attract equal amount of 215 traffic from remote PEs, even when PE1 has double the bandwidth to 216 CE1. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 217 traffic from remote hosts MUST also be load-balanced across PE1 and 218 PE2 in 2:1 manner to avoid unexpected congestion and traffic loss on 219 PE2-CE1 links within the LAG. 221 1.3 Design Requirement 223 +-----------------------+ 224 |Underlay Network Fabric| 225 +-----------------------+ 227 +-----+ +-----+ +-----+ +-----+ 228 | PE1 | | PE2 | ..... | PEx | | PEn | 229 +-----+ +-----+ +-----+ +-----+ 230 \ \ // // 231 \ L1 \ L2 // Lx // Ln 232 \ \ // // 233 +-\-------\-----------//--------//-+ 234 | \ \ ESI-1 // // | 235 +----------------------------------+ 236 | 237 CE 239 To generalize, if total link bandwidth to a CE is distributed across 240 "n" multi-homing PEs, with Lx being the number of links / bandwidth 241 to PEx, traffic from remote PEs to this CE MUST be load-balanced 242 unequally across [PE1, PE2, ....., PEn] such that, fraction of total 243 unicast and BUM flows destined for CE that are serviced by PEx is: 245 Lx / [L1+L2+.....+Ln] 247 Solution proposed below includes extensions to EVPN procedures to 248 achieve the above. 250 1.4 Terminology 252 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 253 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 254 "OPTIONAL" in this document are to be interpreted as described in 255 BCP14 [RFC2119] [RFC8174] when, and only when, they appear in all 256 capitals, as shown here. 258 "LOCAL PE" in the context of an ESI refers to a provider edge switch 259 OR router that physically hosts the ESI. 261 "REMOTE PE" in the context of an ESI refers to a provider edge switch 262 OR router in an EVPN overlay, who's overlay reachability to the ESI 263 is via the LOCAL PE. 265 2. Solution Overview 267 In order to achieve weighted load balancing for overlay unicast 268 traffic, Ethernet A-D per-ES route (EVPN Route Type 1) is leveraged 269 to signal the Ethernet Segment bandwidth to remote PEs. Using 270 Ethernet A-D per-ES route to signal the Ethernet Segment bandwidth 271 provides a mechanism to be able to react to changes in access 272 bandwidth in a service and host independent manner. Remote PEs 273 computing the MAC path-lists based on global and aliasing Ethernet A- 274 D routes now have the ability to setup weighted load-balancing path- 275 lists based on the ESI access bandwidth received from each PE that 276 the ESI is multi-homed to. If Ethernet A-D per-ES route is also 277 leveraged for IP path-list computation, as per [EVPN-IP-ALIASING], it 278 also provides a method to do weighted load-balancing for IP routed 279 traffic. 281 In order to achieve weighted load-balancing of overlay BUM traffic, 282 EVPN ES route (Route Type 4) is leveraged to signal the ESI bandwidth 283 to PEs within an ESI's redundancy group to influence per-service DF 284 election. PEs in an ESI redundancy group now have the ability to do 285 service carving in proportion to each PE's relative ESI bandwidth. 287 Procedures to accomplish this are described in greater detail next. 289 3. Weighted Unicast Traffic Load-balancing 291 3.1 LOCAL PE Behavior 293 A PE that is part of an Ethernet Segment's redundancy group would 294 advertise a additional "link bandwidth" EXT-COMM attribute with 295 Ethernet A-D per-ES route (EVPN Route Type 1), that represents total 296 bandwidth of PE's physical links in an Ethernet Segment. BGP link 297 bandwidth EXT-COMM defined in [BGP-LINK-BW] is re-used for this 298 purpose. 300 3.1 Link Bandwidth Extended Community 302 Link bandwidth extended community described in [BGP-LINK-BW] for 303 layer 3 VPNs is re-used here to signal local ES link bandwidth to 304 remote PEs. link-bandwidth extended community is however defined in 305 [BGP-LINK-BW] as optional non-transitive. In inter-AS scenarios, 306 link-bandwidth may need to be signaled to an eBGP neighbor along with 307 next-hop unchanged. It is work in progress with authors of [BGP-LINK- 308 BW] to allow for this attribute to be used as transitive in inter-AS 309 scenarios. 311 3.2 REMOTE PE Behavior 313 A receiving PE should use per-ES link bandwidth attribute received 314 from each PE to compute a relative weight for each remote PE, per-ES, 315 as shown below. 317 if, 319 L(x,y) : link bandwidth advertised by PE-x for ESI-y 321 W(x,y) : normalized weight assigned to PE-x for ESI-y 323 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., 324 L(n,y)] 326 then, the normalized weight assigned to PE-x for ESI-y may be 327 computed as follows: 329 W(x,y) = L(x,y) / H(y) 331 For a MAC+IP route (EVPN Route Type 2) received with ESI-y, receiving 332 PE MUST compute MAC and IP forwarding path-list weighted by the above 333 normalized weights. 335 As an example, for a CE dual-homed to PE-1, PE-2, PE-3 via 2, 1, and 336 1 GE physical links respectively, as part of a link bundle 337 represented by ESI-10: 339 L(1, 10) = 2000 Mbps 341 L(2, 10) = 1000 Mbps 343 L(3, 10) = 1000 Mbps 345 H(10) = 1000 347 Normalized weights assigned to each PE for ESI-10 are as follows: 349 W(1, 10) = 2000 / 1000 = 2. 351 W(2, 10) = 1000 / 1000 = 1. 353 W(3, 10) = 1000 / 1000 = 1. 355 For a remote MAC+IP host route received with ESI-10, forwarding load- 356 balancing path-list must now be computed as: [PE-1, PE-1, PE-2, PE-3] 357 instead of [PE-1, PE-2, PE-3]. This now results in load-balancing of 358 all traffic destined for ESI-10 across the three multi-homing PEs in 359 proportion to ESI-10 bandwidth at each PE. 361 Above weighted path-list computation MUST only be done for an ESI, IF 362 a link bandwidth attribute is received from ALL of the PE's 363 advertising reachability to that ESI via Ethernet A-D per-ES Route 364 Type 1. In the event that link bandwidth attribute is not received 365 from one or more PEs, forwarding path-list would be computed using 366 regular ECMP semantics. 368 4. Weighted BUM Traffic Load-Sharing 370 Optionally, load sharing of per-service DF role, weighted by 371 individual PE's link-bandwidth share within a multi-homed ES may also 372 be achieved. 374 In order to do that, a new DF Election Capability [EVPN-DF-ELECT- 375 FRAMEWORK] called "BW" (Bandwidth Weighted DF Election) is defined. 376 BW may be used along with some DF Election Types, as described in the 377 following sections. 379 4.1 The BW Capability in the DF Election Extended Community 381 [EVPN-DF-ELECT-FRAMEWORK] defines a new extended community for PEs 382 within a redundancy group to signal and agree on uniform DF Election 383 Type and Capabilities for each ES. This document requests a bit in 384 the DF Election extended community Bitmap: 386 Bit 28: BW (Bandwidth Weighted DF Election) 388 ES routes advertised with the BW bit set will indicate the desire of 389 the advertising PE to consider the link-bandwidth in the DF Election 390 algorithm defined by the value in the "DF Type". 392 As per [EVPN-DF-ELECT-FRAMEWORK], all the PEs in the ES MUST 393 advertise the same Capabilities and DF Type, otherwise the PEs will 394 fall back to Default [RFC7432] DF Election procedure. 396 The BW Capability MAY be advertised with the following DF Types: 398 o Type 0: Default DF Election algorithm, as in [RFC7432] 399 o Type 1: HRW algorithm, as in [EVPN-DF-ELECT-FRAMEWORK] 400 o Type 2: Preference algorithm, as in [EVPN-DF-PREF] 401 o Type 4: HRW per-multicast flow DF Election, as in 402 [EVPN-PER-MCAST-FLOW-DF] 404 The following sections describe how the DF Election procedures are 405 modified for the above DF Types when the BW Capability is used. 407 4.2 BW Capability and Default DF Election algorithm 409 When all the PEs in the Ethernet Segment (ES) agree to use the BW 410 Capability with DF Type 0, the Default DF Election procedure is 411 modified as follows: 413 o Each PE advertises a "Link Bandwidth" EXT-COMM attribute along 414 with the ES route to signal the PE-CE link bandwidth (LBW) for 415 the ES. 416 o A receiving PE MUST use the ES link bandwidth attribute 417 received from each PE to compute a relative weight for each 418 remote PE. 419 o The DF Election procedure MUST now use this weighted list of PEs 420 to compute the per-VLAN Designated Forwarder, such that the DF 421 role is distributed in proportion to this normalized weight. 423 Considering the same example as in Section 3, the candidate PE list 424 for DF election is: 426 [PE-1, PE-1, PE-2, PE-3]. 428 The DF for a given VLAN-a on ES-10 is now computed as (VLAN-a % 4). 429 This would result in the DF role being distributed across PE1, PE2, 430 and PE3 in portion to each PE's normalized weight for ES-10. 432 4.3 BW Capability and HRW DF Election algorithm (Type 1 and 4) 434 [EVPN-DF-ELECT-FRAMEWORK] introduces Highest Random Weight (HRW) 435 algorithm (DF Type 1) for DF election in order to solve potential DF 436 election skew depending on Ethernet tag space distribution. [EVPN- 437 PER-MCAST-FLOW-DF] further extends HRW algorithm for per-multicast 438 flow based hash computations (DF Type 4). This section describes 439 extensions to HRW Algorithm for EVPN DF Election specified in [EVPN- 440 DF-ELECT-FRAMEWORK] and in [EVPN-PER-MCAST-FLOW-DF] in order to 441 achieve DF election distribution that is weighted by link bandwidth. 443 4.3.1 BW Increment 445 A new variable called "bandwidth increment" is computed for each [PE, 446 ES] advertising the ES link bandwidth attribute as follows: 448 In the context of an ES, 450 L(i) = Link bandwidth advertised by PE(i) for this ES 452 L(min) = lowest link bandwidth advertised across all PEs for this ES 454 Bandwidth increment, "b(i)" for a given PE(i) advertising a link 455 bandwidth of L(i) is defined as an integer value computed as: 457 b(i) = L(i) / L(min) 459 As an example, 461 with PE(1) = 10, PE(2) = 10, PE(3) = 20 463 bandwidth increment for each PE would be computed as: 465 b(1) = 1, b(2) = 1, b(3) = 2 467 with PE(1) = 10, PE(2) = 10, PE(3) = 10 469 bandwidth increment for each PE would be computed as: 471 b(1) = 1, b(2) = 1, b(3) = 1 473 Note that the bandwidth increment must always be an integer, 474 including, in an unlikely scenario of a PE's link bandwidth not being 475 an exact multiple of L(min). If it computes to a non-integer value 476 (including as a result of link failure), it MUST be rounded down to 477 an integer. 479 4.3.2 HRW Hash Computations with BW Increment 481 HRW algorithm as described in [EVPN-DF-ELECT-FRAMEWORK] and in [EVPN- 482 PER-MCAST-FLOW-DF] compute a random hash value (referred to as 483 affinity here) for each PE(i), where, (0 < i <= N), PE(i) is the PE 484 at ordinal i, and Address(i) is the IP address of PE at ordinal i. 486 For 'N' PEs sharing an Ethernet segment, this results in 'N' 487 candidate hash computations. PE that has the highest hash value is 488 selected as the DF. 490 Affinity computation for each PE(i) is extended to be computed one 491 per-bandwidth increment associated with PE(i) instead of a single 492 affinity computation per PE(i). 494 PE(i) with b(i) = j, results in j affinity computations: 496 affinity(i, x), where 1 < x <= j 498 This essentially results in number of candidate HRW hash computations 499 for each PE that is directly proportional to that PE's relative 500 bandwidth within an ES and hence gives PE(i) a probability of being 501 DF in proportion to it's relative bandwidth within an ES. 503 As an example, consider an ES that is multi-homed to two PEs, PE1 and 504 PE2, with equal bandwidth distribution across PE1 and PE2. This would 505 result in a total of two candidate hash computations: 507 affinity(PE1, 1) 509 affinity(PE2, 1) 511 Now, consider a scenario with PE1's link bandwidth as 2x that of PE2. 512 This would result in a total of three candidate hash computations to 513 be used for DF election: 515 affinity(PE1, 1) 517 affinity(PE1, 2) 519 affinity(PE2, 1) 521 which would give PE1 2/3 probability of getting elected as a DF, in 522 proportion to its relative bandwidth in the ES. 524 Depending on the chosen HRW hash function, affinity function MUST be 525 extended to include bandwidth increment in the computation. 527 For e.g., 529 affinity function specified in [EVPN-PER-MCAST-FLOW-DF] MAY be 530 extended as follows to incorporate bandwidth increment j: 532 affinity(S,G,V, ESI, Address(i,j)) = 533 (1103515245.((1103515245.Address(i).j + 12345) XOR 534 D(S,G,V,ESI))+12345) (mod 2^31) 536 affinity or random function specified in [EVPN-DF-ELECT-FRAMEWORK] 537 MAY be extended as follows to incorporate bandwidth increment j: 539 affinity(v, Es, Address(i,j)) = (1103515245((1103515245.Address(i).j 540 + 12345) XOR D(v,Es))+12345)(mod 2^31) 542 4.3.3 Cost-Benefit Tradeoff on Link Failures 544 While incorporating link bandwidth into the DF election process 545 provides optimal BUM traffic distribution across the ES links, it 546 also implies that affinity values for a given PE are re-computed, and 547 DF elections are re-adjusted on changes to that PE's bandwidth 548 increment that might result from link failures or link additions. If 549 the operator does not wish to have this level of churn in their DF 550 election, then they should not advertise the BW capability. Not 551 advertising BW capability may result in less than optimal BUM traffic 552 distribution while still retaining the ability to allow a remote 553 ingress PE to do weighted ECMP for its unicast traffic to a set of 554 multi-homed PEs, as described in section 3.2. 556 Same also applies to use of BW capability with service carving (DF 557 Type 0), as specified in section 4.2. 559 4.4 BW Capability and Preference DF Election algorithm 561 This section applies to ES'es where all the PEs in the ES agree use 562 the BW Capability with DF Type 2. The BW Capability modifies the 563 Preference DF Election procedure [EVPN-DF-PREF], by adding the LBW 564 value as a tie-breaker as follows: 566 o Section 4.1, bullet (f) in [EVPN-DF-PREF] now considers the LBW 567 value: 569 f) In case of equal Preference in two or more PEs in the ES, the 570 tie-breakers will be the DP bit, the LBW value and the lowest 571 IP PE in that order. For instance: 573 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 574 [Pref=500,DP=1, LBW=2000] in PE2, PE2 would be elected due 575 to the DP bit. 576 o If vES1 parameters were [Pref=500,DP=0,LBW=1000] in PE1 and 577 [Pref=500,DP=0, LBW=2000] in PE2, PE2 would be elected due 578 to a higher LBW, even if PE1's IP address is lower. 579 o The LBW exchanged value has no impact on the Non-Revertive 580 option described in [EVPN-DF-PREF]. 582 5. Real-time Available Bandwidth 584 PE-CE link bandwidth availability may sometimes vary in real-time 585 disproportionately across PE_CE links within a multi-homed ESI due to 586 various factors such as flow based hashing combined with fat flows 587 and unbalanced hashing. Reacting to real-time available bandwidth is 588 at this time outside the scope of this document. Procedures described 589 in this document are strictly based on static link bandwidth 590 parameter. 592 6. Routed EVPN Overlay 594 An additional use case is possible, such that traffic to an end host 595 in the overlay is always IP routed. In a purely routed overlay such 596 as this: 598 o A host MAC is never advertised in EVPN overlay control plane o 599 Host /32 or /128 IP reachability is distributed across the 600 overlay via EVPN route type 5 (RT-5) along with a zero or non- 601 zero ESI 602 o An overlay IP subnet may still be stretched across the underlay 603 fabric, however, intra-subnet traffic across the stretched 604 overlay is never bridged 605 o Both inter-subnet and intra-subnet traffic, in the overlay is 606 IP routed at the EVPN GW. 608 Please refer to [RFC 7814] for more details. 610 Weighted multi-path procedure described in this document may be used 611 together with procedures described in [EVPN-IP-ALIASING] for this use 612 case. Ethernet A-D per-ES route advertised with Layer 3 VRF RTs would 613 be used to signal ES link bandwidth attribute instead of the Ethernet 614 A-D per-ES route with Layer 2 VRF RTs. All other procedures described 615 earlier in this document would apply as is. 617 If [EVPN-IP-ALIASING] is not used for routed fast convergence, link 618 bandwidth attribute may still be advertised with IP routes (RT-5) to 619 achieve PE-CE link bandwidth based load-balancing as described in 620 this document. In the absence of [EVPN-IP-ALIASING], re-balancing of 621 traffic following changes in PE-CE link bandwidth will require all IP 622 routes from that CE to be re-advertised in a prefix dependent manner. 624 7. EVPN-IRB Multi-homing with non-EVPN routing 626 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 627 together with non-EVPN routing, such as global routing or an L3VPN 628 routing control plane. Key property that differentiates this set of 629 use cases from EVPN IRB use cases discussed earlier is that EVPN 630 control plane is used only to enable LAG interface based multi-homing 631 and NOT as an overlay VPN control plane. EVPN control plane in this 632 case enables: 634 o DF election via EVPN RT-4 based procedures described in [RFC7432] 635 o LOCAL MAC sync across multi-homing PEs via EVPN RT-2 636 o LOCAL ARP and ND sync across multi-homing PEs via EVPN RT-2 638 Applicability of weighted ECMP procedures proposed in this document 639 to these set of use cases will be addressed in subsequent revisions. 641 7. References 643 7.1 Normative References 645 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 646 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 647 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 648 2015, . 650 [BGP-LINK-BW] Mohapatra, P., Fernando, R., "BGP Link Bandwidth 651 Extended Community", January 2013, 652 . 655 [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass 656 Withdrawal Support for EVPN", July 2017, 657 . 660 [EVPN-DF-PREF] Rabadan, J., Sathappan, S., Przygienda, T., Lin, W., 661 Drake, J., Sajassi, A., and S. Mohanty, "Preference-based 662 EVPN DF Election", internet-draft ietf-bess-evpn-pref-df- 663 01.txt, April 2018. 665 [EVPN-PER-MCAST-FLOW-DF] Sajassi, et al., "Per multicast flow 666 Designated Forwarder Election for EVPN", March 2018, 667 . 670 [EVPN-DF-ELECT-FRAMEWORK] Rabadan, Mohanty, et al., "Framework for 671 EVPN Designated Forwarder Election Extensibility", March 672 2018, . 675 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate 676 Requirement Levels", March 1997, 677 . 679 [RFC8174] B. Leiba, "Ambiguity of Uppercase vs Lowercase in RFC 2119 680 Key Words", May 2017, 681 . 683 7.2 Informative References 684 8. Acknowledgements 686 Authors would like to thank Satya Mohanty for valuable review and 687 inputs with respect to HRW algorithm refinements proposed in this 688 document. 690 Authors' Addresses 692 Neeraj Malhotra, Editor. 693 Arrcus 694 Email: neeraj.ietf@gmail.com 696 Ali Sajassi 697 Cisco 698 Email: sajassi@cisco.com 700 Jorge Rabadan 701 Nokia 702 Email: jorge.rabadan@nokia.com 704 John Drake 705 Juniper 706 EMail: jdrake@juniper.net 708 Avinash Lingala 709 AT&T 710 Email: ar977m@att.com 712 Samir Thoria 713 Cisco 714 Email: sthoria@cisco.com