idnits 2.17.1 draft-ietf-bess-evpn-df-election-framework-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC7432, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 18, 2019) is 1917 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-05) exists of draft-ietf-bess-vpls-multihoming-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Workgroup J. Rabadan, Ed. 3 Internet Draft Nokia 4 Updates: 7432 S. Mohanty, Ed. 5 Intended status: Standards Track A. Sajassi 6 Cisco 7 J. Drake 8 Juniper 9 K. Nagaraj 10 S. Sathappan 11 Nokia 13 Expires: July 22, 2019 January 18, 2019 15 Framework for EVPN Designated Forwarder Election Extensibility 16 draft-ietf-bess-evpn-df-election-framework-08 18 Abstract 20 An alternative to the Default Designated Forwarder (DF) selection 21 algorithm in Ethernet VPN (EVPN) networks is defined. The DF is the 22 Provider Edge (PE) router responsible for sending broadcast, unknown 23 unicast and multicast (BUM) traffic to multi-homed Customer Equipment 24 (CE) on a particular Ethernet Segment (ES) within a VLAN. In 25 addition, the capability to influence the DF election result for a 26 VLAN based on the state of the associated Attachment Circuit (AC) is 27 specified. This document clarifies the DF Election Finite State 28 Machine in EVPN, therefore it updates the EVPN specification. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 The list of current Internet-Drafts can be accessed at 46 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 This Internet-Draft will expire on July 22, 2019. 53 Copyright Notice 55 Copyright (c) 2019 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Default Designated Forwarder (DF) Election in EVPN . . . . 3 72 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . 6 73 1.2.1. Unfair Load-Balancing and Service Disruption . . . . . 6 74 1.2.2. Traffic Black-Holing on Individual AC Failures . . . . 7 75 1.3. The Need for Extending the Default DF Election in EVPN . . 10 76 2. Conventions and Terminology . . . . . . . . . . . . . . . . . . 11 77 3. Designated Forwarder Election Protocol and BGP Extensions . . . 12 78 3.1. The DF Election Finite State Machine (FSM) . . . . . . . . 12 79 3.2. The DF Election Extended Community . . . . . . . . . . . . 15 80 3.2.1. Backward Compatibility . . . . . . . . . . . . . . . . 18 81 3.3. Auto-Derivation of ES-Import Route Target . . . . . . . . . 18 82 4. The Highest Random Weight DF Election Algorithm . . . . . . . . 18 83 4.1. HRW and Consistent Hashing . . . . . . . . . . . . . . . . 19 84 4.2. HRW Algorithm for EVPN DF Election . . . . . . . . . . . . 19 85 5. The Attachment Circuit Influenced DF Election Capability . . . 21 86 5.1. AC-Influenced DF Election Capability For VLAN-Aware 87 Bundle Services . . . . . . . . . . . . . . . . . . . . . . 23 89 6. Solution Benefits . . . . . . . . . . . . . . . . . . . . . . . 24 90 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 25 91 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 25 92 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 93 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26 94 9.2. Informative References . . . . . . . . . . . . . . . . . . 27 95 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 28 96 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 28 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28 99 1. Introduction 101 The Designated Forwarder (DF) in EVPN networks is the Provider Edge 102 (PE) router responsible for sending broadcast, unknown unicast and 103 multicast (BUM) traffic to a multi-homed Customer Equipment (CE) 104 device, on a given VLAN on a particular Ethernet Segment (ES). The DF 105 is selected out of a list of candidate PEs that advertise the same 106 Ethernet Segment Identifier (ESI) to the EVPN network. By default, 107 EVPN uses a DF Election algorithm referred to as "Service Carving" 108 and it is based on a modulus function (V mod N) that takes the number 109 of PEs in the ES (N) and the VLAN value (V) as input. This Default DF 110 Election algorithm has some inefficiencies that this document 111 addresses by defining a new DF Election algorithm and a capability to 112 influence the DF Election result for a VLAN, depending on the state 113 of the associated Attachment Circuit (AC). In order to avoid any 114 ambiguity with the identifier used in the DF Election Algorithm, this 115 document uses the term Ethernet Tag instead of VLAN. This document 116 also creates a registry with IANA, for future DF Election Algorithms 117 and Capabilities. It also presents a formal definition and 118 clarification of the DF Election Finite State Machine (FSM), 119 therefore the document updates [RFC7432] and EVPN implementations 120 MUST conform to the prescribed FSM. 122 The procedures described in this document apply to DF election in all 123 EVPN solutions including [RFC7432] and [RFC8214]. Apart from the FSM 124 formal description, this document does not intend to update other 125 [RFC7432] procedures. It only aims to improve the behavior of the DF 126 Election on PEs that are upgraded to follow the described procedures. 128 1.1. Default Designated Forwarder (DF) Election in EVPN 130 [RFC7432] defines the Designated Forwarder (DF) as the EVPN PE 131 responsible for: 133 o Flooding Broadcast, Unknown unicast and Multicast traffic (BUM), on 134 a given Ethernet Tag on a particular Ethernet Segment (ES), to the 135 CE. This is valid for single-active and all-active EVPN 136 multi-homing. 138 o Sending unicast traffic on a given Ethernet Tag on a particular ES 139 to the CE. This is valid for single-active multi-homing. 141 Figure 1 illustrates an example that we will use to explain the 142 Designated Forwarder function. 144 +---------------+ 145 | IP/MPLS | 146 | CORE | 147 +----+ ES1 +----+ +----+ 148 | CE1|-----| | | |____ES2 149 +----+ | PE1| | PE2| \ 150 | | +----+ \+----+ 151 +----+ | | CE2| 152 | +----+ /+----+ 153 | | |____/ | 154 | | PE3| ES2 / 155 | +----+ / 156 | | / 157 +-------------+----+ / 158 | PE4|____/ES2 159 | | 160 +----+ 162 Figure 1 Multi-homing Network of EVPN 164 Figure 1 illustrates a case where there are two Ethernet Segments, 165 ES1 and ES2. PE1 is attached to CE1 via Ethernet Segment ES1 whereas 166 PE2, PE3 and PE4 are attached to CE2 via ES2 i.e. PE2, PE3 and PE4 167 form a redundancy group. Since CE2 is multi-homed to different PEs on 168 the same Ethernet Segment, it is necessary for PE2, PE3 and PE4 to 169 agree on a DF to satisfy the above mentioned requirements. 171 The effect of forwarding loops in a Layer-2 network is particularly 172 severe because of the broadcast nature of Ethernet traffic and the 173 lack of a Time-To-Live (TTL). Therefore it is very important that in 174 the case of a multi-homed CE only one of the PEs be used to send BUM 175 traffic to it. 177 One of the pre-requisites for this support is that participating PEs 178 must agree amongst themselves as to who would act as the Designated 179 Forwarder (DF). This needs to be achieved through a distributed 180 algorithm in which each participating PE independently and 181 unambiguously selects one of the participating PEs as the DF, and the 182 result should be consistent and unanimous. 184 The default algorithm for DF election defined by [RFC7432] at the 185 granularity of (ESI,EVI) is referred to as "service carving". In this 186 document, service carving and Default DF Election algorithm are used 187 interchangeably. With service carving, it is possible to elect 188 multiple DFs per Ethernet Segment (one per EVI) in order to perform 189 load-balancing of traffic destined to a given Segment. The objective 190 is that the load-balancing procedures should carve up the BD space 191 among the redundant PE nodes evenly, in such a way that every PE is 192 the DF for a distinct set of EVIs. 194 The DF Election algorithm as described in [RFC7432] (Section 8.5) is 195 based on a modulus operation. The PEs to which the ES (for which DF 196 election is to be carried out per EVI) is multi-homed form an ordered 197 (ordinal) list in ascending order of the PE IP address values. For 198 example, there are N PEs: PE0, PE1,... PEN-1 ranked as per increasing 199 IP addresses in the ordinal list; then for each VLAN with Ethernet 200 Tag V, configured on the Ethernet Segment ES1, PEx is the DF for VLAN 201 V on ES1 when x equals (V mod N). In the case of VLAN Bundle only the 202 lowest VLAN is used. In the case when the planned density is high 203 (meaning there are significant number of VLANs and the Ethernet Tags 204 are uniformly distributed), the thinking is that the DF Election will 205 be spread across the PEs hosting that Ethernet Segment and good load- 206 balancing can be achieved. 208 However, the described Default DF Election algorithm has some 209 undesirable properties and in some cases can be somewhat disruptive 210 and unfair. This document describes some of those issues and defines 211 a mechanism for dealing with them. These mechanisms do involve 212 changes to the Default DF Election algorithm, but they do not require 213 any changes to the EVPN Route exchange and have minimal changes in 214 the EVPN routes. 216 In addition, there is a need to extend the DF Election procedures so 217 that new algorithms and capabilities are possible. A single algorithm 218 (the Default DF Election algorithm) may not meet the requirements in 219 all the use-cases. 221 Note that while [RFC7432] elects a DF per , this document 222 elects a DF per . This means that unlike [RFC7432], where for 223 a VLAN-Aware Bundle service EVI there is only one DF for the EVI, 224 this document specifies that there will be multiple DFs, one for each 225 BD configured in that EVI. 227 1.2. Problem Statement 229 This section describes some potential issues with the Default DF 230 Election algorithm. 232 1.2.1. Unfair Load-Balancing and Service Disruption 234 There are three fundamental problems with the current Default DF 235 Election algorithm. 237 1- First, the algorithm will not perform well when the Ethernet Tag 238 follows a non-uniform distribution, for instance when the Ethernet 239 Tags are all even or all odd. In such a case let us assume that 240 the ES is multi-homed to two PEs; one of the PEs will be elected 241 as DF for all of the VLANs. This is very sub-optimal. It defeats 242 the purpose of service carving as the DFs are not really evenly 243 spread across. In fact, in this particular case, one of the PEs 244 does not get elected as DF at all, so it does not participate in 245 the DF responsibilities at all. Consider another example where, 246 referring to Figure 1, lets assume that PE2, PE3, PE4 are in 247 ascending order of the IP address; and each VLAN configured on ES2 248 is associated with an Ethernet Tag of the form (3x+1), where x is 249 an integer. This will result in PE3 always be selected as the DF. 251 2- The Ethernet tag that identifies the BD can be as large as 2^24; 252 however, it is not guaranteed that the tenant BD on the ES will 253 conform to a uniform distribution. In fact, it is up to the 254 customer what BDs they will configure on the ES. Quoting [Knuth], 255 "In general, we want to avoid values of M that divide r^k+a or 256 r^k-a, where k and a are small numbers and r is the radix of the 257 alphabetic character set (usually r=64, 256 or 100), since a 258 remainder modulo such a value of M tends to be largely a simple 259 superposition of key digits. Such considerations suggest that we 260 choose M to be a prime number such that r^k!=a(modulo)M or 261 r^k!=?a(modulo)M for small k & a." 263 In our case, N is the number of PEs in [RFC7432] which corresponds 264 to M above. Since N, N-1 or N+1 need not satisfy the primality 265 properties of the M above; as per the [RFC7432] modulo based DF 266 assignment, whenever a PE goes down or a new PE boots up (hosting 267 the same Ethernet Segment), the modulo scheme will not necessarily 268 map BDs to PEs uniformly. 270 3- The third problem is one of disruption. Consider a case when the 271 same Ethernet Segment is multi-homed to a set of PEs. When the ES 272 is down in one of the PEs, say PE1, or PE1 itself reboots, or the 273 BGP process goes down or the connectivity between PE1 and an RR 274 goes down, the effective number of PEs in the system now becomes 275 N-1, and DFs are computed for all the VLANs that are configured on 276 that Ethernet Segment. In general, if the DF for a VLAN v happens 277 not to be PE1, but some other PE, say PE2, it is likely that some 278 other PE (different from PE1 and PE2) will become the new DF. This 279 is not desirable. Similarly when a new PE hosts the same Ethernet 280 Segment, the mapping again changes because of the modulus 281 operation. This results in needless churn. Again referring to 282 Figure 1, say v1, v2 and v3 are VLANs configured on ES2 with 283 associated Ethernet Tags of value 999, 1000 and 1001 respectively. 284 So PE1, PE2 and PE3 are the DFs for v1, v2 and v3 respectively. 285 Now when PE3 goes down, PE2 will become the DF for v1 and PE1 will 286 become the DF for v2. 288 One point to note is that the Default DF election algorithm assumes 289 that all the PEs who are multi-homed to the same Ethernet Segment 290 (and interested in the DF Election by exchanging EVPN routes) use an 291 Originating Router's IP Address of the same family. This does not 292 need to be the case as the EVPN address-family can be carried over an 293 IPv4 or IPv6 peering, and the PEs attached to the same ES may use an 294 address of either family. 296 Mathematically, a conventional hash function maps a key k to a number 297 i representing one of m hash buckets through a function h(k) i.e. 298 i=h(k). In the EVPN case, h is simply a modulo-m hash function viz. 299 h(v) = v mod N, where N is the number of PEs that are multi-homed to 300 the Ethernet Segment in discussion. It is well-known that for good 301 hash distribution using the modulus operation, the modulus N should 302 be a prime-number not too close to a power of 2 [CLRS2009]. When the 303 effective number of PEs changes from N to N-1 (or vice versa); all 304 the objects (VLAN V) will be remapped except those for which V mod N 305 and V mod (N-1) refer to the same PE in the previous and subsequent 306 ordinal rankings respectively. From a forwarding perspective, this is 307 a churn, as it results in re-programming the PE ports as either 308 blocking or non-blocking at the PEs where the DF state changes. 310 This document addresses this problem and furnishes a solution to this 311 undesirable behavior. 313 1.2.2. Traffic Black-Holing on Individual AC Failures 315 As discussed in section 2.1 the Default DF Election algorithm defined 316 by [RFC7432] takes into account only two variables in the modulus 317 function for a given ES: the existence of the PE's IP address on the 318 candidate list and the locally provisioned Ethernet Tags. 320 If the DF for an fails (due to physical link/node 321 failures) an ES route withdrawal will make the Non-DF (NDF) PEs re- 322 elect the DF for that and the service will be recovered. 324 However, the Default DF election procedure does not provide a 325 protection against "logical" failures or human errors that may occur 326 at service level on the DF, while the list of active PEs for a given 327 ES does not change. These failures may have an impact not only on the 328 local PE where the issue happens, but also on the rest of the PEs of 329 the ES. Some examples of such logical failures are listed below: 331 a) A given individual Attachment Circuit (AC) defined in an ES is 332 accidentally shutdown or even not provisioned yet (hence the 333 Attachment Circuit Status - ACS - is DOWN), while the ES is 334 operationally active (since the ES route is active). 336 b) A given MAC-VRF - with a defined ES - is shutdown or not 337 provisioned yet, while the ES is operationally active (since the 338 ES route is active). In this case, the ACS of all the ACs defined 339 in that MAC-VRF is considered to be DOWN. 341 Neither (a) nor (b) will trigger the DF re-election on the remote 342 multi-homed PEs for a given ES since the ACS is not taken into 343 account in the DF election procedures. While the ACS is used as a DF 344 election tie-breaker and trigger in VPLS multi-homing procedures 345 [VPLS-MH], there is no procedure defined in EVPN [RFC7432] to trigger 346 the DF re-election based on the ACS change on the DF. 348 Figure 2 illustrates the described issue with an example. 350 +---+ 351 |CE4| 352 +---+ 353 | 354 PE4 | 355 +-----+-----+ 356 +---------------| +-----+ |---------------+ 357 | | | BD-1| | | 358 | +-----------+ | 359 | | 360 | EVPN | 361 | | 362 | PE1 PE2 PE3 | 363 | (NDF) (DF) (NDF)| 364 +-----------+ +-----------+ +-----------+ 365 | | BD-1| | | | BD-1| | | | BD-1| | 366 | +-----+ |-------| +-----+ |-------| +-----+ | 367 +-----------+ +-----------+ +-----------+ 368 AC1\ ES12 /AC2 AC3\ ES23 /AC4 369 \ / \ / 370 \ / \ / 371 +----+ +----+ 372 |CE12| |CE23| 373 +----+ +----+ 375 Figure 2 Default DF Election and Traffic Black-Holing 377 BD-1 is defined in PE1, PE2, PE3 and PE4. CE12 is a multi-homed CE 378 connected to ES12 in PE1 and PE2. Similarly CE23 is multi-homed to 379 PE2 and PE3 using ES23. Both, CE12 and CE23, are connected to BD-1 380 through VLAN-based service interfaces: CE12-VID 1 (VLAN ID 1 on CE12) 381 is associated to AC1 and AC2 in BD-1, whereas CE23-VID 1 is 382 associated to AC3 and AC4 in BD-1. Assume that, although not 383 represented, there are other ACs defined on these ES mapped to 384 different BDs. 386 After executing the [RFC7432] Default DF election algorithm, PE2 387 turns out to be the DF for ES12 and ES23 in BD-1. The following 388 issues may arise: 390 a) If AC2 is accidentally shutdown or even not configured, CE12 391 traffic will be impacted. In case of all-active multi-homing, the 392 BUM traffic to CE12 will be "black-holed", whereas for single- 393 active multi-homing, all the traffic to/from CE12 will be 394 discarded. This is due to the fact that a logical failure in PE2's 395 AC2 may not trigger an ES route withdrawn for ES12 (since there 396 are still other ACs active on ES12) and therefore PE1 will not re- 397 run the DF election procedures. 399 b) If the Bridge Table for BD-1 is administratively shutdown or even 400 not configured yet on PE2, CE12 and CE23 will both be impacted: 401 BUM traffic to both CEs will be discarded in case of all-active 402 multi-homing and all traffic will be discarded to/from the CEs in 403 case of single-active multi-homing. This is due to the fact that 404 PE1 and PE3 will not re-run the DF election procedures and will 405 keep assuming PE2 is the DF. 407 Quoting [RFC7432], "when an Ethernet Tag is decommissioned on an 408 Ethernet Segment, then the PE MUST withdraw the Ethernet A-D per EVI 409 route(s) announced for the that are impacted by 410 the decommissioning", however, while this A-D per EVI route 411 withdrawal is used at the remote PEs performing aliasing or backup 412 procedures, it is not used to influence the DF election for the 413 affected EVIs. 415 This document adds an optional modification of the DF Election 416 procedure so that the ACS may be taken into account as a variable in 417 the DF election, and therefore EVPN can provide protection against 418 logical failures. 420 1.3. The Need for Extending the Default DF Election in EVPN 422 Section 1.2 describes some of the issues that exist in the Default DF 423 Election procedures. In order to address those issues, this document 424 introduces a new DF Election framework. This framework allows the PEs 425 to agree on a common DF election algorithm, as well as the 426 capabilities to enable during the DF Election procedure. Generally, 427 'DF election algorithm' refers to the algorithm by which a number of 428 input parameters are used to determine the DF PE, while 'DF election 429 capability' refers to an additional feature that can be used prior to 430 the invocation of the DF election algorithm, such as modifying the 431 inputs (or list of candidate PEs). 433 Within this framework, this document defines a new DF Election 434 algorithm and a new capability that can influence the DF Election 435 result: 437 o The new DF Election algorithm is referred to as "Highest Random 438 Weight" (HRW). The HRW procedures are described in section 4. 440 o The new DF Election capability is referred to as "AC-Influenced DF 441 Election" (AC-DF). The AC-DF procedures are described in section 5. 443 o HRW and AC-DF mechanisms are independent of each other. Therefore, 444 a PE may support either HRW or AC-DF independently or may support 445 both of them together. A PE may also support AC-DF capability along 446 with the Default DF election algorithm per [RFC7432]. 448 In addition, this document defines a way to indicate the support of 449 HRW and/or AC-DF along with the EVPN ES routes advertised for a given 450 ES. Refer to section 3.2 for more details. 452 2. Conventions and Terminology 454 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 455 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 456 "OPTIONAL" in this document are to be interpreted as described in BCP 457 14 [RFC2119] [RFC8174] when, and only when, they appear in all 458 capitals, as shown here. 460 o AC and ACS - Attachment Circuit and Attachment Circuit Status. An 461 AC has an Ethernet Tag associated to it. 463 o BUM - refers to the Broadcast, Unknown unicast and Multicast 464 traffic. 466 o DF, NDF and BDF - Designated Forwarder, Non-Designated Forwarder 467 and Backup Designated Forwarder 469 o Ethernet A-D per ES route - refers to [RFC7432] route type 1 or 470 Auto-Discovery per Ethernet Segment route. 472 o Ethernet A-D per EVI route - refers to [RFC7432] route type 1 or 473 Auto-Discovery per EVPN Instance route. 475 o ES and ESI - Ethernet Segment and Ethernet Segment Identifier. 477 o EVI - EVPN Instance. 479 o MAC-VRF - A Virtual Routing and Forwarding table for Media Access 480 Control (MAC) addresses on a PE. 482 o BD - Broadcast Domain. An EVI may be comprised of one (VLAN-Based 483 or VLAN Bundle services) or multiple (VLAN-Aware Bundle services) 484 Broadcast Domains. 486 o Bridge Table - An instantiation of a broadcast domain on a MAC-VRF. 488 o HRW - Highest Random Weight 490 o VID and CE-VID - VLAN Identifier and Customer Equipment VLAN 491 Identifier. 493 o Ethernet Tag - used to represent a Broadcast Domain that is 494 configured on a given ES for the purpose of DF election. Note that 495 any of the following may be used to represent a Broadcast Domain: 496 VIDs (including Q-in-Q tags), configured IDs, VNI (VXLAN Network 497 Identifiers), normalized VID, I-SIDs (Service Instance 498 Identifiers), etc., as long as the representation of the broadcast 499 domains is configured consistently across the multi-homed PEs 500 attached to that ES. The Ethernet Tag value MUST be different from 501 zero. 503 o Ethernet Tag ID - refers to the identifier used in the EVPN routes 504 defined in [RFC7432]. Its value may be the same as the Ethernet Tag 505 value (see Ethernet Tag definition) when advertising routes for 506 VLAN-aware Bundle services. Note that in case of VLAN-based or VLAN 507 Bundle services, the Ethernet Tag ID is zero. 509 o DF Election Procedure and DF Algorithm - The Designated Forwarder 510 Election Procedure or simply DF Election, refers to the process in 511 its entirety, including the discovery of the PEs in the ES, the 512 creation and maintenance of the PE candidate list and the selection 513 of a PE. The Designated Forwarder Algorithm is just a component of 514 the DF Election Procedure and strictly refers to the selection of a 515 PE for a given . 517 o TTL - Time To Live 519 This document also assumes familiarity with the terminology of 520 [RFC7432]. 522 3. Designated Forwarder Election Protocol and BGP Extensions 524 This section describes the BGP extensions required to support the new 525 DF Election procedures. In addition, since the EVPN specification 526 [RFC7432] does leave several questions open as to the precise final 527 state machine behavior of the DF election, section 3.1 describes 528 precisely the intended behavior. 530 3.1. The DF Election Finite State Machine (FSM) 532 Per [RFC7432], the FSM described in Figure 3 is executed per 533 in case of VLAN-based service or in case of VLAN Bundle on each participating PE. 536 Observe that currently the VLANs are derived from local configuration 537 and the FSM does not provide any protection against misconfiguration 538 where the same (EVI,ESI) combination has different set of VLANs on 539 different participating PEs or one of the PEs elects to consider 540 VLANs as VLAN Bundle and another as separate VLANs for election 541 purposes (service type mismatch). 543 The FSM is conceptual and any design or implementation MUST comply 544 with a behavior equivalent to the one outlined in this FSM. 546 VLAN_CHANGE 547 VLAN_CHANGE RCVD_ES 548 RCVD_ES LOST_ES 549 LOST_ES +----+ 550 +----+ | v 551 | | ++----++ 552 | +-+----+ ES_UP | DF | 553 +->+ INIT +---------------> WAIT | 554 ++-----+ +----+-+ 555 ^ | 556 +-----------+ | |DF_TIMER 557 | ANY STATE +-------+ VLAN_CHANGE | 558 +-----------+ ES_DOWN +-----------------+ | 559 | RCVD_ES v v 560 +-----++ LOST_ES ++---+-+ 561 | DF | | DF | 562 | DONE +<--------------+ CALC +<-+ 563 +------+ CALCULATED +----+-+ | 564 | | 565 +----+ 566 VLAN_CHANGE 567 RCVD_ES 568 LOST_ES 570 Figure 3 DF Election Finite State Machine 572 States: 574 1. INIT: Initial State 576 2. DF_WAIT: State in which the participant waits for enough 577 information to perform the DF election for the EVI/ESI/VLAN 578 combination. 580 3. DF_CALC: State in which the new DF is recomputed. 582 4. DF_DONE: State in which the according DF for the EVI/ESI/VLAN 583 combination has been elected. 585 5. ANY_STATE: Refers to any of the above states. 587 Events: 589 1. ES_UP: The ESI has been locally configured as 'up'. 591 2. ES_DOWN: The ESI has been locally configured as 'down'. 593 3. VLAN_CHANGE: The VLANs configured in a bundle (that uses the ESI) 594 changed. This event is necessary for VLAN Bundles only. 596 4. DF_TIMER: DF Wait timer [RFC7432] has expired. 598 5. RCVD_ES: A new or changed Ethernet Segment route is received in a 599 BGP REACH UPDATE. Receiving an unchanged UPDATE MUST NOT trigger 600 this event. 602 6. LOST_ES: A BGP UNREACH UPDATE for a previously received Ethernet 603 Segment route has been received. If an UNREACH is seen for a 604 route that has not been advertised previously, the event MUST NOT 605 be triggered. 607 7. CALCULATED: DF has been successfully calculated. 609 According actions when transitions are performed or states 610 entered/exited: 612 1. ANY_STATE on ES_DOWN: (i) stop DF wait timer (ii) assume NDF for 613 local PE. 615 2. INIT on ES_UP: transition to DF_WAIT. 617 3. INIT on VLAN_CHANGE, RCVD_ES or LOST_ES: do nothing. 619 4. DF_WAIT on entering the state: (i) start DF wait timer if not 620 started already or expired (ii) assume NDF for local PE. 622 5. DF_WAIT on VLAN_CHANGE, RCVD_ES or LOST_ES: do nothing. 624 6. DF_WAIT on DF_TIMER: transition to DF_CALC. 626 7. DF_CALC on entering or re-entering the state: (i) rebuild 627 candidate list, hash and perform election (ii) Afterwards FSM 628 generates CALCULATED event against itself. 630 8. DF_CALC on VLAN_CHANGE, RCVD_ES or LOST_ES: do as in transition 631 7. 633 9. DF_CALC on CALCULATED: mark election result for VLAN or bundle, 634 and transition to DF_DONE. 636 11. DF_DONE on exiting the state: if there is a new DF election 637 triggered and the current DF is lost, then assume NDF for local 638 PE for VLAN or VLAN Bundle. 640 12. DF_DONE on VLAN_CHANGE, RCVD_ES or LOST_ES: transition to 641 DF_CALC. 643 The above events and transitions are defined for the Default DF 644 Election Algorithm. As described in Section 5, the use of the AC-DF 645 capability introduces additional events and transitions. 647 3.2. The DF Election Extended Community 649 For the DF election procedures to be consistent and unanimous, it is 650 necessary that all the participating PEs agree on the DF Election 651 algorithm and capabilities to be used. For instance, it is not 652 possible that some PEs continue to use the Default DF Election 653 algorithm and some PEs use HRW. For brown-field deployments and for 654 interoperability with legacy PEs, it is important that all PEs need 655 to have the capability to fall back on the Default DF Election. A PE 656 can indicate its willingness to support HRW and/or AC-DF by signaling 657 a DF Election Extended Community along with the Ethernet Segment 658 route (Type-4). 660 The DF Election Extended Community is a new BGP transitive extended 661 community attribute [RFC4360] that is defined to identify the DF 662 election procedure to be used for the Ethernet Segment. Figure 4 663 shows the encoding of the DF Election Extended Community. 665 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 666 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 667 | Type=0x06 | Sub-Type(0x06)| RSV | DF Alg | Bitmap ~ 668 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 669 ~ Bitmap | Reserved | 670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 672 Figure 4 DF Election Extended Community 674 Where: 676 o Type is 0x06 as registered with IANA for EVPN Extended Communities. 678 o Sub-Type is 0x06 - "DF Election Extended Community" as requested by 679 this document to IANA. 681 o RSV / Reserved - Reserved bits for DF Alg specific information. 683 o DF Alg (5 bits) - Encodes the DF Election algorithm values (between 684 0 and 31) that the advertising PE desires to use for the ES. This 685 document requests IANA to set up a registry called "DF Alg 686 Registry" and solicits the following values: 688 - Type 0: Default DF Election algorithm, or modulus-based algorithm 689 as in [RFC7432]. 691 - Type 1: HRW algorithm (explained in this document). 693 - Types 2-30: Unassigned. 695 - Type 31: Reserved for Experimental Use. 697 o Bitmap (2 octets) - Encodes "capabilities" to use with the DF 698 Election algorithm in the field "DF Alg". This document requests 699 IANA to create a registry for the Bitmap field, with values 0-15, 700 called "DF Election Capabilities" and solicits the following 701 values: 703 1 1 1 1 1 1 704 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 705 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 706 | |A| | 707 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 709 Figure 5 Bitmap field in the DF Election Extended Community 711 - Bit 0 (corresponds to Bit 24 of the DF Election Extended 712 Community): Unassigned. 714 - Bit 1: AC-DF (AC-Influenced DF Election, explained in this 715 document). When set to 1, it indicates the desire to use AC- 716 Influenced DF Election with the rest of the PEs in the ES. 718 - Bits 2-15: Unassigned. 720 The DF Election Extended Community is used as follows: 722 o A PE SHOULD attach the DF Election Extended Community to any 723 advertised ES route and the Extended Community MUST be sent if the 724 ES is locally configured with a DF election algorithm other than 725 the Default Election algorithm or if a capability is required to be 726 used. In the Extended Community, the PE indicates the desired "DF 727 Alg" algorithm and "Bitmap" capabilities to be used for the ES. 729 - Only one DF Election Extended Community can be sent along with an 730 ES route. Note that the intent is not for the advertising PE to 731 indicate all the supported DF election algorithms and 732 capabilities, but signal the preferred one. 734 - DF Algs 0 and 1 can be both used with bit AC-DF set to 0 or 1. 736 - In general, a specific DF Alg SHOULD determine the use of the 737 reserved bits in the Extended Community, which may be used in a 738 different way for a different DF Alg. In particular, for DF Algs 739 0 and 1, the reserved bits are not set by the advertising PE and 740 SHOULD be ignored by the receiving PE. 742 o When a PE receives the ES Routes from all the other PEs for the ES 743 in question, it checks to see if all the advertisements have the 744 extended community with the same DF Alg and Bitmap: 746 - In the case that they do, this particular PE MUST follow the 747 procedures for the advertised DF Alg and capabilities. For 748 instance, if all ES routes for a given ES indicate DF Alg HRW and 749 AC-DF set to 1, the receiving PE and by induction all the other 750 PEs in the ES will proceed to do DF Election as per the HRW 751 Algorithm and following the AC-DF procedures. 753 - Otherwise if even a single advertisement for the type-4 route is 754 received without the locally configured DF Alg and capability, 755 the Default DF Election algorithm (modulus) algorithm MUST be 756 used as in [RFC7432]. This procedure handles the case where 757 participating PEs in the ES disagree about the DF algorithm and 758 capability to apply. 760 - The absence of the DF Election Extended Community or the presence 761 of multiple DF Election Extended Communities (in the same route) 762 MUST be interpreted by a receiving PE as an indication of the 763 Default DF Election algorithm on the sending PE, that is, DF Alg 764 0 and no DF Election capabilities. 766 o When all the PEs in an ES advertise DF Type 31, they will rely on 767 the local policy to decide how to proceed with the DF Election. 769 o For any new capability defined in the future, the 770 applicability/compatibility of this new capability to the existing 771 DF Algs must be assessed on a case by case basis. 773 o Likewise, for any new DF Alg defined in future, its 774 applicability/compatibility to the existing capabilities must be 775 assessed on a case by case basis. 777 3.2.1. Backward Compatibility 779 [RFC7432] implementations (i.e., those that predate this 780 specification) will not advertise the DF Election Extended Community. 781 That means that all other participating PEs in the ES will not 782 receive DF preferences and will revert to the Default DF Election 783 algorithm without AC-Influenced DF Election. 785 Similarly, a [RFC7432] implementation receiving a DF Election 786 Extended Community will ignore it and will continue to use the 787 Default DF Election algorithm. 789 3.3. Auto-Derivation of ES-Import Route Target 791 Section 7.6 of [RFC7432] describes how the value of the ES-Import 792 Route Target for ESI types 1, 2, and 3 can be auto-derived by using 793 the high-order six bytes of the nine byte ESI value. The same auto- 794 derivation procedure can be extended to ESI types 0, 4, and 5 as long 795 as it is ensured that the auto-derived values for ES-Import RT among 796 different ES types don't overlap. As in [RFC7432], the mechanism to 797 guarantee that the auto-derived ESI or ES-import RT values for 798 different ESIs do not match is out of scope of this document. 800 4. The Highest Random Weight DF Election Algorithm 802 The procedure discussed in this section is applicable to the DF 803 Election in EVPN Services [RFC7432] and EVPN Virtual Private Wire 804 Services [RFC8214]. 806 Highest Random Weight (HRW) as defined in [HRW1999] is originally 807 proposed in the context of Internet Caching and proxy Server load 808 balancing. Given an object name and a set of servers, HRW maps a 809 request to a server using the object-name (object-id) and server-name 810 (server-id) rather than the server states. HRW forms a hash out of 811 the server-id and the object-id and forms an ordered list of the 812 servers for the particular object-id. The server for which the hash 813 value is highest, serves as the primary responsible for that 814 particular object, and the server with the next highest value in that 815 hash serves as the backup server. HRW always maps a given object name 816 to the same server within a given cluster; consequently it can be 817 used at client sites to achieve global consensus on object-server 818 mappings. When that server goes down, the backup server becomes the 819 responsible designate. 821 Choosing an appropriate hash function that is statistically oblivious 822 to the key distribution and imparts a good uniform distribution of 823 the hash output is an important aspect of the algorithm. Fortunately 824 many such hash functions exist. [HRW1999] provides pseudo-random 825 functions based on the Unix utilities rand and srand and easily 826 constructed XOR functions that perform considerably well. This 827 imparts very good properties in the load balancing context. Also each 828 server independently and unambiguously arrives at the primary server 829 selection. HRW already finds use in multicast and ECMP [RFC2991], 830 [RFC2992]. 832 4.1. HRW and Consistent Hashing 834 HRW is not the only algorithm that addresses the object to server 835 mapping problem with goals of fair load distribution, redundancy and 836 fast access. There is another family of algorithms that also 837 addresses this problem; these fall under the umbrella of the 838 Consistent Hashing Algorithms [CHASH]. These will not be considered 839 here. 841 4.2. HRW Algorithm for EVPN DF Election 843 This section describes the application of HRW to DF election. Let 844 DF(v) denote the Designated Forwarder and BDF(v) the Backup 845 Designated forwarder for the Ethernet Tag v, where v is the VLAN, Si 846 is the IP address of PE i, Es denotes the Ethernet Segment Identifier 847 and weight is a function of v, Si, and Es. 849 Note that while the DF election algorithm in [RFC7432] uses PE 850 address and vlan as inputs, this document uses Ethernet Tag, PE 851 address and ESI as inputs. This is because if the same set of PEs are 852 multi-homed to the same set of ESes, then the DF election algorithm 853 used in [RFC7432] would result in the same PE being elected DF for 854 the same set of broadcast domains on each ES, which can have adverse 855 side-effects on both load balancing and redundancy. Including ESI in 856 the DF election algorithm introduces additional entropy which 857 significantly reduces the probability of the same PE being elected DF 858 for the same set of broadcast domains on each ES. Therefore, the ESI 859 value in the Weight function below SHOULD be set to that of 860 corresponding ES. The ESI value MAY be set to all 0's in the Weight 861 function below if the operator so chooses. 863 In case of a VLAN Bundle service, v denotes the lowest VLAN similar 864 to the 'lowest VLAN in bundle' logic of [RFC7432]. 866 1. DF(v) = Si: Weight(v, Es, Si) >= Weight(v, Es, Sj), for all j. In 867 case of a tie, choose the PE whose IP address is numerically the 868 least. Note 0 <= i,j < Number of PEs in the redundancy group. 870 2. BDF(v) = Sk: Weight(v, Es, Si) >= Weight(v, Es, Sk) and Weight(v, 871 Es, Sk) >= Weight(v, Es, Sj). In case of tie choose the PE whose 872 IP address is numerically the least. 874 Where: 876 DF(v): is defined to be the address Si (index i) for which weight(v, 877 Es, Si) is the highest, 0 <= i < N-1 879 BDF(v) is defined as that PE with address Sk for which the computed 880 weight is the next highest after the weight of the DF. j is the 881 running index from 0 to N-1, i, k are selected values. 883 Since the Weight is a pseudo-random function with domain as the 884 three-tuple (v, Es, S), it is an efficient and deterministic 885 algorithm that is independent of the Ethernet Tag v sample space 886 distribution. Choosing a good hash function for the pseudo-random 887 function is an important consideration for this algorithm to perform 888 better than the Default algorithm. As mentioned previously, such 889 functions are described in the HRW paper. We take as candidate hash 890 function the first one out of the two that are preferred in 891 [HRW1999]: 893 Wrand(v, Es, Si) = (1103515245((1103515245.Si+12345) XOR 894 D(v,Es))+12345)(mod 2^31) 896 Here D(v,Es) is the 31-bit digest (CRC-32 and discarding the MSB as 897 in [HRW1999]) of the 14-byte stream, the Ethernet Tag v (4 bytes) 898 followed by the Ethernet Segment Identifier (10 bytes). It is 899 mandated that the 14-byte stream is formed by concatenation of the 900 Ethernet tag and the Ethernet Segment identifier in network byte 901 order. The CRC should proceed as if the stream is in network byte 902 order (big-endian). Si is address of the ith server. The server's IP 903 address length does not matter as only the low-order 31 bits are 904 modulo significant. 906 A point to note is that the Weight function takes into consideration 907 the combination of the Ethernet Tag, Ethernet Segment and the PE IP- 908 address, and the actual length of the server IP address (whether IPv4 909 or IPv6) is not really relevant. The Default algorithm in [RFC7432] 910 cannot employ both IPv4 and IPv6 PE addresses, since [RFC7432] does 911 not specify how to decide on the ordering (the ordinal list) when 912 both IPv4 and IPv6 PEs are present. 914 HRW solves the disadvantages pointed out in Section 1.2.1 and 915 ensures: 917 o with very high probability that the task of DF election for the 918 VLANs configured on an ES is more or less equally distributed among 919 the PEs even for the 2 PE case. 921 o If a PE that is not the DF or the BDF for that VLAN, goes down or 922 its connection to the ES goes down, it does not result in a DF or 923 BDF reassignment. This saves computation, especially in the case 924 when the connection flaps. 926 o More importantly it avoids the needless disruption case of Section 927 1.2.1 (3), that is inherent in the existing Default DF Election. 929 o In addition to the DF, the algorithm also furnishes the BDF, which 930 would be the DF if the current DF fails. 932 5. The Attachment Circuit Influenced DF Election Capability 934 The procedure discussed in this section is applicable to the DF 935 Election in EVPN Services [RFC7432] and EVPN Virtual Private Wire 936 Services [RFC8214]. 938 The AC-DF capability is expected to be of general applicability with 939 any future DF Algorithm. It modifies the DF Election procedures by 940 removing from consideration any candidate PE in the ES that cannot 941 forward traffic on the AC that belongs to the BD. This section is 942 applicable to VLAN-Based and VLAN Bundle service interfaces. Section 943 5.1 describes the procedures for VLAN-Aware Bundle interfaces. 945 In particular, when used with the Default DF Alg, the AC-DF 946 capability modifies the Step 3 in the DF Election procedure described 947 in [RFC7432] Section 8.5, as follows: 949 3. When the timer expires, each PE builds an ordered "candidate" list 950 of the IP addresses of all the PE nodes attached to the Ethernet 951 Segment (including itself), in increasing numeric value. The 952 candidate list is based on the Originator Router's IP addresses of 953 the ES routes, but excludes any PE from whom no Ethernet A-D per 954 ES route has been received, or from whom the route has been 955 withdrawn. Afterwards, the DF Election algorithm is applied on a 956 per , however, the IP address for a PE will not 957 be considered candidate for a given until the 958 corresponding Ethernet A-D per EVI route has been received from 959 that PE. In other words, the ACS on the ES for a given PE must be 960 UP so that the PE is considered as candidate for a given BD. If 961 the Default DF Alg is used, every PE in the resulting candidate 962 list is then given an ordinal indicating its position in the 963 ordered list, starting with 0 as the ordinal for the PE with the 964 numerically lowest IP address. The ordinals are used to determine 965 which PE node will be the DF for a given Ethernet Tag on the 966 Ethernet Segment, using the following rule: 968 Assuming a redundancy group of N PE nodes, for VLAN-based service, 969 the PE with ordinal i is the DF for an when 970 (V mod N)= i. In the case of VLAN-(aware) bundle service, then the 971 numerically lowest VLAN value in that bundle on that ES MUST be 972 used in the modulo function as Ethernet Tag. 974 It should be noted that using the "Originating Router's IP 975 address" field in the Ethernet Segment route to get the PE IP 976 address needed for the ordered list allows for a CE to be 977 multihomed across different ASes if such a need ever arises. 979 The above three paragraphs differ from [RFC7432] Section 8.5, Step 3, 980 in two aspects: 982 o Any DF Alg algorithm can be used, and not only the described 983 modulus-based DF Alg (referred to as the Default DF Election, or DF 984 Alg 0 in this document). 986 o The candidate list is pruned based upon non-receipt of Ethernet A-D 987 routes: a PE's IP address MUST be removed from the ES candidate 988 list if its Ethernet A-D per ES route is withdrawn. A PE's IP 989 address MUST NOT be considered as candidate DF for a , if its Ethernet A-D per EVI route for the 991 is withdrawn. 993 The following example illustrates the AC-DF behavior applied to the 994 Default DF election algorithm, assuming the network in Figure 2: 996 a) When PE1 and PE2 discover ES12, they advertise an ES route for 997 ES12 with the associated ES-import extended community and the DF 998 Election Extended Community indicating AC-DF=1; they start a DF 999 Wait timer (independently). Likewise, PE2 and PE3 advertise an ES 1000 route for ES23 with AC-DF=1 and start a DF Wait timer. 1002 b) PE1/PE2 advertise an Ethernet A-D per ES route for ES12, and 1003 PE2/PE3 advertise an Ethernet A-D per ES route for ES23. 1005 c) In addition, PE1/PE2/PE3 advertise an Ethernet A-D per EVI route 1006 for AC1, AC2, AC3 and AC4 as soon as the ACs are enabled. Note 1007 that the AC can be associated to a single customer VID (e.g. VLAN- 1008 based service interfaces) or a bundle of customer VIDs (e.g. VLAN 1009 Bundle service interfaces). 1011 d) When the timer expires, each PE builds an ordered "candidate" list 1012 of the IP addresses of all the PE nodes connected to the Ethernet 1013 Segment (including itself) as explained above in [RFC7432] Step 3. 1014 Any PE from which an Ethernet A-D per ES route has not been 1015 received is pruned from the list. 1017 e) When electing the DF for a given BD, a PE will not be considered 1018 candidate until an Ethernet A-D per EVI route has been received 1019 from that PE. In other words, the ACS on the ES for a given PE 1020 must be UP so that the PE is considered as candidate for a given 1021 BD. For example, PE1 will not consider PE2 as candidate for DF 1022 election for until an Ethernet A-D per EVI route is 1023 received from PE2 for . 1025 f) Once the PEs with ACS = DOWN for a given BD have been removed from 1026 the candidate list, the DF Election can be applied for the 1027 remaining N candidates. 1029 Note that this procedure only modifies the existing EVPN control 1030 plane by adding and processing the DF Election Extended Community, 1031 and by pruning the candidate list of PEs that take part in the DF 1032 election. 1034 In addition to the events defined in the FSM in Section 3.1, the 1035 following events SHALL modify the candidate PE list and trigger the 1036 DF re-election in a PE for a given . In the FSM of 1037 Figure 3, the events below MUST trigger a transition from DF_DONE to 1038 DF_CALC: 1040 i. Local AC going DOWN/UP. 1042 ii. Reception of a new Ethernet A-D per EVI update/withdraw for the 1043 . 1045 iii. Reception of a new Ethernet A-D per ES update/withdraw for the 1046 ES. 1048 5.1. AC-Influenced DF Election Capability For VLAN-Aware Bundle Services 1050 The procedure described in section 5 works for VLAN-based and VLAN 1051 Bundle service interfaces since, for those service types, a PE 1052 advertises only one Ethernet A-D per EVI route per or 1053 . In Section 5, an Ethernet Tag represents a given 1054 VLAN or VLAN Bundle for the purpose of DF Election. The withdrawal of 1055 such route means that the PE cannot forward traffic on that 1056 particular or , therefore the PE can be 1057 removed from consideration for DF. 1059 According to [RFC7432], in VLAN-aware Bundle services, the PE 1060 advertises multiple Ethernet A-D per EVI routes per 1061 (one route per Ethernet Tag), while the DF Election is still 1062 performed per . The withdrawal of an individual route 1063 only indicates the unavailability of a specific AC but not 1064 necessarily all the ACs in the . 1066 This document modifies the DF Election for VLAN-Aware Bundle services 1067 in the following way: 1069 o After confirming that all the PEs in the ES advertise the AC-DF 1070 capability, a PE will perform a DF Election per , as 1071 opposed to per in [RFC7432]. Now, the withdrawal 1072 of an Ethernet A-D per EVI route for a VLAN will indicate that the 1073 advertising PE's ACS is DOWN and the rest of the PEs in the ES can 1074 remove the PE from consideration for DF in the . 1076 o The PEs will now follow the procedures in section 5. 1078 For example, assuming three Bridge Tables in PE1 for the same MAC-VRF 1079 (each one associated to a different Ethernet Tag, e.g. VLAN-1, VLAN-2 1080 and VLAN-3), PE1 will advertise three Ethernet A-D per EVI routes for 1081 ES12. Each of the three routes will indicate the status of each of 1082 the three ACs in ES12. PE1 will be considered as a valid candidate PE 1083 for DF Election in , , as 1084 long as its three routes are active. For instance, if PE1 withdraws 1085 the Ethernet A-D per EVI routes for , the PEs in ES12 1086 will not consider PE1 as a suitable DF candidate for . 1087 PE1 will still be considered for and 1088 since its routes are active. 1090 6. Solution Benefits 1092 The solution described in this document provides the following 1093 benefits: 1095 a) Extends the DF Election in [RFC7432] to address the unfair load- 1096 balancing and potential black-holing issues of the Default DF 1097 Election algorithm. The solution is applicable to the DF Election 1098 in EVPN Services [RFC7432] and EVPN Virtual Private Wire Services 1099 [RFC8214]. 1101 b) It defines a way to signal the DF Election algorithm and 1102 capabilities intended by the advertising PE. This is done by 1103 defining the DF Election Extended Community, which allow signaling 1104 of the capabilities supported by this document as well as any 1105 other future DF Election algorithms and capabilities. 1107 c) The solution is backwards compatible with the procedures defined 1108 in [RFC7432]. If one or more PEs in the ES do not support the new 1109 procedures, they will all follow the [RFC7432] DF Election. 1111 7. Security Considerations 1113 This document addresses some identified issues in the DF Election 1114 procedures described in [RFC7432] by defining a new DF Election 1115 framework. In general, this framework allows the PEs that are part of 1116 the same Ethernet Segment to exchange additional information and 1117 agree on the DF Election Type and Capabilities to be used. 1119 Following the procedures in this document, the operator will minimize 1120 undesired situations such as unfair load-balancing, service 1121 disruption and traffic black-holing. Since those situations may have 1122 been purposely created by a malicious user with access to the 1123 configuration of one PE, this document enhances also the security of 1124 the network. Note that the network will not benefit of the new 1125 procedures if the DF Election Alg is not consistently configured on 1126 all the PEs in the ES (if there is no unanimity among all the PEs, 1127 the DF Election Alg falls back to the Default [RFC7432] DF Election). 1128 This behavior could be exploited by an attacker that manages to 1129 modify the configuration of one PE in the Ethernet Segment so that 1130 the DF Election Alg and capabilities in all the PEs in the Ethernet 1131 Segment fall back to the Default DF Election. If that is the case, 1132 the PEs will be exposed to the unfair load-balancing, service 1133 disruption and black-holing that were mentioned earlier. 1135 In addition, the new framework is extensible and allows for future 1136 new security enhancements that are out of the scope of this document. 1137 Finally, since this document extends the procedures in [RFC7432], the 1138 same Security Considerations described in [RFC7432] are valid for 1139 this document. 1141 8. IANA Considerations 1143 IANA is requested to: 1145 o Allocate Sub-Type value 0x06 in the "EVPN Extended Community Sub- 1146 Types" registry defined in [RFC7153] as follows: 1148 SUB-TYPE VALUE NAME Reference 1149 -------------- ------------------------- ------------- 1150 0x06 DF Election Extended Community This document 1152 o Set up a registry called "DF Alg" for the DF Alg field in the 1153 Extended Community. New registrations will be made through the "RFC 1154 Required" procedure defined in [RFC8126]. Value 31 is for 1155 Experimental use and does not require any other RFC than this 1156 document. The following initial values in that registry are 1157 requested: 1159 Alg Name Reference 1160 ---- -------------- ------------- 1161 0 Default DF Election This document 1162 1 HRW algorithm This document 1163 2-30 Unassigned 1164 31 Reserved for Experimental use This document 1166 o Set up a registry called "DF Election Capabilities" for the two- 1167 octet Bitmap field in the Extended Community. New registrations 1168 will be made through the "RFC Required" procedure defined in 1169 [RFC8126]. The following initial value in that registry is 1170 requested: 1172 Bit Name Reference 1173 ---- -------------- ------------- 1174 0 Unassigned 1175 1 AC-DF capability This document 1176 2-15 Unassigned 1178 9. References 1180 9.1. Normative References 1182 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 1183 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet 1184 VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, 1185 . 1187 [RFC8214] Boutros, S., Sajassi, A., Salam, S., Drake, J., and J. 1188 Rabadan, "Virtual Private Wire Service Support in Ethernet VPN", RFC 1189 8214, DOI 10.17487/RFC8214, August 2017, . 1192 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1193 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1194 1997, . 1196 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1197 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, 1198 . 1200 [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended 1201 Communities Attribute", RFC 4360, DOI 10.17487/RFC4360, February 1202 2006, . 1204 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 1205 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, March 2014, 1206 . 1208 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 1209 Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, 1210 DOI 10.17487/RFC8126, June 2017, . 1213 9.2. Informative References 1215 [VPLS-MH] Kothari, Henderickx et al., "BGP based Multi-homing in 1216 Virtual Private LAN Service", draft-ietf-bess-vpls-multihoming- 1217 02.txt, work in progress, September, 2018. 1219 [CHASH] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, 1220 M., and D. Lewin, "Consistent Hashing and Random Trees: Distributed 1221 Caching Protocols for Relieving Hot Spots on the World Wide Web", ACM 1222 Symposium on Theory of Computing ACM Press New York, May 1997. 1224 [CLRS2009] Cormen, T., Leiserson, C., Rivest, R., and C. Stein, 1225 "Introduction to Algorithms (3rd ed.)", MIT Press and McGraw-Hill 1226 ISBN 0-262-03384-4., February 2009. 1228 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 1229 Multicast Next-Hop Selection", RFC 2991, DOI 10.17487/RFC2991, 1230 November 2000, . 1232 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 1233 Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000, 1234 . 1236 [HRW1999] Thaler, D. and C. Ravishankar, "Using Name-Based Mappings 1237 to Increase Hit Rates", IEEE/ACM Transactions in networking Volume 6 1238 Issue 1, February 1998, . 1241 [Knuth] Art of Computer Programming - Sorting and Searching,Vol 3 1242 Pg. 516, Addison Wesley 1244 10. Acknowledgments 1246 The authors want to thank Sriram Venkateswaran, Laxmi Padakanti, 1247 Ranganathan Boovaraghavan, Tamas Mondal, Sami Boutros, Jakob Heitz, 1248 Mrinmoy Ghosh, Leo Mermelstein, Mankamana Mishra, Anoop Ghanwani and 1249 Samir Thoria for their review and contributions. Special thanks to 1250 Stephane Litkowski for his thorough review and detailed 1251 contributions. 1253 11. Contributors 1255 In addition to the authors listed on the front page, the following 1256 coauthors have also contributed to this document: 1258 Antoni Przygienda 1259 Juniper Networks, Inc. 1260 1194 N. Mathilda Drive 1261 Sunnyvale, CA 95134 1262 USA 1263 Email: prz@juniper.net 1265 Vinod Prabhu 1266 Nokia 1267 Email: vinod.prabhu@nokia.com 1269 Wim Henderickx 1270 Nokia 1271 Email: wim.henderickx@nokia.com 1273 Wen Lin 1274 Juniper Networks, Inc. 1275 Email: wlin@juniper.net 1277 Patrice Brissette 1278 Cisco Systems 1279 Email: pbrisset@cisco.com 1281 Keyur Patel 1282 Arrcus, Inc 1283 Email: keyur@arrcus.com 1285 Autumn Liu 1286 Ciena 1287 Email: hliu@ciena.com 1289 Authors' Addresses 1290 Jorge Rabadan 1291 Nokia 1292 777 E. Middlefield Road 1293 Mountain View, CA 94043 USA 1294 Email: jorge.rabadan@nokia.com 1296 Satya Mohanty 1297 Cisco Systems, Inc. 1298 225 West Tasman Drive 1299 San Jose, CA 95134 1300 USA 1301 Email: satyamoh@cisco.com 1303 Ali Sajassi 1304 Cisco Systems, Inc. 1305 225 West Tasman Drive 1306 San Jose, CA 95134 1307 USA 1308 Email: sajassi@cisco.com 1310 John Drake 1311 Juniper Networks, Inc. 1312 1194 N. Mathilda Drive 1313 Sunnyvale, CA 95134 1314 USA 1315 Email: jdrake@juniper.net 1317 Kiran Nagaraj 1318 Nokia 1319 701 E. Middlefield Road 1320 Mountain View, CA 94043 USA 1321 Email: kiran.nagaraj@nokia.com 1323 Senthil Sathappan 1324 Nokia 1325 701 E. Middlefield Road 1326 Mountain View, CA 94043 USA 1327 Email: senthil.sathappan@nokia.com