idnits 2.17.1 draft-ietf-bess-evpn-fast-df-recovery-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC8584], [RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 9, 2020) is 1507 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'DF-FRAMWORK' is mentioned on line 409, but not defined == Missing Reference: 'ESI' is mentioned on line 443, but not defined == Missing Reference: 'DF-REQ' is mentioned on line 434, but not defined == Missing Reference: 'SEQ1' is mentioned on line 443, but not defined == Missing Reference: 'DF-ACK' is mentioned on line 480, but not defined == Missing Reference: 'PE3' is mentioned on line 443, but not defined == Missing Reference: 'ES1' is mentioned on line 480, but not defined == Missing Reference: 'PE2' is mentioned on line 480, but not defined == Missing Reference: 'SEQ' is mentioned on line 480, but not defined == Missing Reference: 'TBD3' is mentioned on line 652, but not defined == Unused Reference: 'RFC8126' is defined on line 835, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group A. Sajassi, Ed. 3 Internet-Draft G. Badoni 4 Intended status: Standards Track D. Rao 5 Expires: September 10, 2020 P. Brissette 6 Cisco 7 J. Drake 8 Juniper 9 J. Rabadan 10 Nokia 11 March 9, 2020 13 Fast Recovery for EVPN DF Election 14 draft-ietf-bess-evpn-fast-df-recovery-01 16 Abstract 18 Ethernet Virtual Private Network (EVPN) solution [RFC7432] describes 19 DF election procedures for multi-homing Ethernet Segments. These 20 procedures are enhanced further in [RFC8584] by applying Highest 21 Random Weight Algorithm for DF election in order to avoid DF status 22 unnecessarily upon a failure. This draft makes further improvement 23 to DF election procedures in [RFC8584] by providing two options for 24 fast DF election upon recovery of the failed link or node associated 25 with the multi-homing Ethernet Segment. This fast DF election is 26 achieved independent of number of EVIs associated with that Ethernet 27 Segment and it is performed via a simple signaling between the 28 recovered PE and each PE in the multi-homing group. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on September 10, 2020. 47 Copyright Notice 49 Copyright (c) 2020 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Challenges with Existing Solution . . . . . . . . . . . . . . 3 67 2.1. Overview of Proposed Solutions . . . . . . . . . . . . . 5 68 3. DF Election Handshake Solution . . . . . . . . . . . . . . . 5 69 3.1. Discovery . . . . . . . . . . . . . . . . . . . . . . . . 5 70 3.2. DF Candidates Determination . . . . . . . . . . . . . . . 6 71 3.3. DF Election Handshake . . . . . . . . . . . . . . . . . . 6 72 3.4. Node Insertion . . . . . . . . . . . . . . . . . . . . . 7 73 3.5. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 8 74 3.5.1. DF Election Handshake Request Route . . . . . . . . . 8 75 3.5.2. DF Election Handshake Response Route . . . . . . . . 8 76 3.6. DF Handshake Scenarios . . . . . . . . . . . . . . . . . 10 77 3.7. Backwards Compatibility . . . . . . . . . . . . . . . . . 12 78 4. DF Election Synchronization Solution . . . . . . . . . . . . 13 79 4.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . 14 80 4.2. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 14 81 4.3. Note on NTP-based synchronization . . . . . . . . . . . . 15 82 4.4. Synchronization Scenarios . . . . . . . . . . . . . . . . 15 83 4.5. Backwards Compatibility . . . . . . . . . . . . . . . . . 16 84 5. Interoperability . . . . . . . . . . . . . . . . . . . . . . 17 85 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17 86 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 87 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 88 8.1. Normative References . . . . . . . . . . . . . . . . . . 18 89 8.2. Informative References . . . . . . . . . . . . . . . . . 18 90 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 18 91 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 19 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 94 1. Introduction 96 Ethernet Virtual Private Network (EVPN) solution [RFC7432] is 97 becoming pervasive in data center (DC) applications for Network 98 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 99 in service provider (SP) applications for next generation virtual 100 private LAN services. 102 EVPN solution [RFC7432] describes DF election procedures for multi- 103 homing Ethernet Segments. These procedures are enhanced further in 104 [RFC8584] by applying Highest Random Weight Algorithm for DF election 105 in order to avoid DF status change unnecessarily upon a link or node 106 failure associated with the multi-homing Ethernet Segment. This 107 draft makes further improvement to DF election procedures in 108 [RFC8584] by providing two options for a fast DF election upon 109 recovery of the failed link or node associated with the multi-homing 110 Ethernet Segment. This DF election is achieved independent of number 111 of EVIs associated with that Ethernet Segment and it is performed via 112 a simple signaling between the recovered PE and each PE in the multi- 113 homing group. The draft presents two signaling options. The first 114 option is based on a bidirectional handshake procedure whereas the 115 second option is based on simple one-way signaling mechanism. 117 1.1. Terminology 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in [RFC2119]. 123 Provider Edge (PE): A device that sits in the boundary of Provider 124 and Customer networks and performs encap/decap of data from L2 to 125 L3 and vice-versa. 127 Designated Forwarder (DF): A PE that is currently forwarding 128 (encapsulating/decapsulating) traffic for a given VLAN in and out 129 of a site. 131 2. Challenges with Existing Solution 133 In EVPN technology, multiple PE devices have the ability to encap and 134 decap data belonging to the same VLAN. In certain situations, this 135 may cause L2 duplicates and even loops if there is a momentary 136 overlap of forwarding roles between two or more PE devices, leading 137 to broadcast storms. 139 EVPN [RFC7432] currently uses timer based synchronization among PE 140 devices in redundancy group that can result in duplications (and even 141 loops) because of multiple DFs if the timer is too short or 142 blackholing if the timer is too long. 144 Using site-of-origin Split Horizon filtering can prevent loops (but 145 not duplicates), however if there are overlapping DFs in two 146 different sites at the same time for the same VLAN, the site 147 identifier will be different upon re-entry of the packet and hence 148 the split horizon check will fail, leading to L2 loops. 150 The current state of art [RFC8584] uses the well known HRW 151 (Highest Random Weight) algorithm to avoid reshuffling of VLANs among 152 PE devices in the redundancy group upon failure/recovery and thus 153 reducing the impact of failure/recovery to VLANs not on the 154 failed/recovered ports. This eliminates loops/duplicates in failure 155 scenarios. 157 However, upon PE insertion or port bring-up, HRW cannot help as a 158 transfer of DF role need to happen to the newly inserted device/port 159 while the old DF is still active. 161 +---------+ 162 +-------------+ | | 163 | | | | 164 / | PE1 |----| | +-------------+ 165 / | | | MPLS/ | | |---H3 166 / +-------------+ | VxLAN/ | | PE10 | 167 CE1 - | Cloud | | | 168 \ +-------------+ | |---| | 169 \ | | | | +-------------+ 170 \ | PE2 |----| | 171 | | | | 172 +-------------+ | | 173 +---------+ 175 Figure 1: CE1 multi-homed to PE1 and PE2. Potential for duplicate 176 DF. 178 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 179 DF role of some VLANs to PE2 to achieve load balancing. However, 180 because there is no handshake mechanism between PE1 and PE2, 181 duplication of DF roles for a give VLAN is possible. Duplication of 182 DF roles may eventually lead to L2 loops as well as duplication of 183 traffic. 185 Current state of EVPN art relies on a blackholing timer for 186 transferring the DF role to the newly inserted device. This can 187 cause the following issues: 189 * Loops/Duplicates if the timer value is too short 191 * Prolonged Traffic Blackholing if the timer value is too long 193 2.1. Overview of Proposed Solutions 195 The first solution proposed deterministically eliminates loops/ 196 duplicates with a state machine approach. The second proposal helps 197 narrow the DF Election window defined in [RFC7432], intended to 198 eliminate loops, based on common clock alignment. Both proposals 199 provide fast convergence upon PE/port insertion. 201 Two signaling mechanisms between the newly inserted PE and remaining 202 PEs are described. The signaling is only possible once the newly 203 inserted PE has reliably discovered the other PEs and vice versa. 204 The first option is referred to as DF Election Handshake solution and 205 is described in Section 3. The second option is referred to as DF 206 Election Synchronization Solution and is described in Section 4. 208 3. DF Election Handshake Solution 210 Due to HRW, the handshake will only be one per PE device and 211 independent of EVI/VNI scale. Therefore, this solution is divided 212 into three steps: 214 Phase 1: Discovery 216 Phase 2: DF Candidates Determination 218 Phase 3: DF Election Handshake 220 Following is the description each step in detail. 222 3.1. Discovery 224 Each PE needs to have a consistent view of the network including the 225 newly inserted PE. 227 Newly inserted device PE will advertise it's Ethernet Segment route 228 and start a flood/wait timer. This timer should be large enough to 229 guarantee the dissemination and receipt of this advertisement by 230 previously inserted PEs. 232 As the old DF is continuously forwarding traffic while the new PE is 233 running this timer, this timer can be made as long as required 234 without impacting traffic convergence. The timer value can be the 235 BGP session hold time in the worst case to ensure proper discovery 236 but in most cases will be equivalent to [RFC7432]'s PEERING timer. 238 3.2. DF Candidates Determination 240 After the discovery timer has elapsed, each PE would have an imported 241 list of the Ethernet Segment Routes from other PEs. The resultant 242 database will comprise of all the DF candidates on a per ES basis and 243 will be used for DF election. Each PE will independently run the 244 selected DF algorithm - i.e., HRW algorithm (or Preference-based) for 245 all VLANs in a given Ethernet Segment. Since the discovery phase 246 guarantees uniform network view between the participating devices, 247 the VLAN distribution results based on HRW (or Preference-based) will 248 be consistent. 250 3.3. DF Election Handshake 252 The DF Election handshake will be accomplished in the following 253 steps: 255 - The newly inserted PE will send the DF-Request to previously 256 inserted PEs with a new sequence number. 258 - The previously inserted PE(s) will receive the DF-Request, will 259 validate this request as per own discovery state and local DF 260 Candidates results (e.g. Modulo, HRW or Preference-based). 262 - The previously inserted PE(s) will program its hardware to block 263 the VLANs that must be transferred to the newly inserted PE. 265 - The previously inserted PE(s) will send DF-Response (with DF-ACK or 266 DF-NACK flag) to the newly inserted PE with the same sequence 267 number that was contained in the DF-Request. 269 - Newly inserted PE will receive DF-Response and validate it using 270 the sequence number. It will take action per received DF-Response 271 message, and for faster convergence, does not wait for all 272 previously inserted devices. The Handshake transaction are on a 273 per-pair of peering PEs. 275 - The DF-Response received at newly inserted PE is interpreted as an 276 indication from the previously inserted PE that is has relinquished 277 the DF role on those VLANs for which the newly inserted PE should 278 be DF. In other words, the newly inserted PE will only take over 279 as DF for a given VLAN/ISID if 281 A. it is the DF Candidates election winner, AND 283 B. it gets the DF-ACK from the previous DF. 285 - Upon receiving DF-Response with DF-ACK, newly inserted PE assumes 286 the DF responsibility and will program its hardware to unblock the 287 VLANs it is assuming. 289 - In case of Preference-based DF Election, the above procedure should 290 only be followed if there is at least one previously inserted PE 291 that signals DP=0 in its ES route (there is no need for handshake 292 in case of non-revertive mode). 294 We don't need to have a handshake on a per VLAN/EVI basis but rather 295 per pair of PEs in the redundancy group - i.e., if a new PE is added 296 to an existing redundancy group of 3 PE devices, then we need only to 297 have 3 handshakes. This is because the devices already are in sync 298 about which VLANs to give-up/takeover. 300 At the end of these three phases, the VLAN DF role transfer would 301 have happened in a deterministic way while ensuring minimum traffic 302 loss. Device recovery and device insertion scenarios are identical 303 in terms of the handshaking procedure. In next section, we describe 304 the procedure details for device insertion. 306 3.4. Node Insertion 308 Consider the scenario where PE3 is inserted in the network, while PE1 309 and PE2 are already in stable state. PE3 will send/receive the 310 following flags along with the EVPN Type 4 route: 312 - DF-Request: Upon completing the DF Election, PE3 will send DF 313 Request with a new sequence number. PE1 and PE2 will receive this 314 message and respond with Response DF-ACK or DF-NACK with the same 315 sequence number that was generated by PE3. 317 - DF-Response DF-ACK: When PE3 receives DF-Response DF-ACK from PE1 318 with the same sequence number as DF-Request, it will take over the 319 DF role for the appropriate VLANS that are being transferred from 320 PE1. When DF-Response DF-ACK from PE2 arrives, the rest of the 321 VLANS to be transferred from PE2 to PE3 are then taken over by PE3. 323 - DF-Response DF-NACK: If PE3 receives DF-Response DF-NACK from at 324 least one of PE1 or PE2, it will not take over DF role and will 325 start over (new sequence number). 327 Consider the scenario where two nodes PE3 and PE4 are being inserted 328 at the same time. Both of them will send a DF-Request to PE1 and PE2 329 at around the same time with possibly the same sequence number. When 330 PE1 and PE2 respond with DF-Response DF-ACK, it is important to 331 signify exactly whom the response is meant for as it could be for 332 either requester (PE3 or PE4). To remove any ambiguity and false 333 positives, the IP address of the requester MUST be included in the 334 response message to specify who the response is meant for. 336 3.5. BGP Encoding 338 The EVPN NLRI comprises of Route Type (1B), Length (1B) and Route 339 Type specific variable encoding. Here we propose the creation of two 340 new EVPN route types: 342 + TBD1 - DF Election Handshake Request Route 344 + TBD2 - DF Election Handshake Response Route 346 3.5.1. DF Election Handshake Request Route 348 A DF Election Handshake Request Type NLRI consists of the following: 350 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 351 | RD (8 octets) | 352 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 353 | Ethernet Segment Identifier (10 octets) | 354 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 355 | DF-Flags (1 octet) | 356 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 357 | Sequence Number (1 octet) | 358 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 359 | Originating Router's IP Address | 360 | (4 or 16 octets) | 361 +-----------------------------------------+ 363 The DF Flags can have the following values: 365 DF-INIT : Sent initially upon boot-up; bootstraps the network 366 DF-REQUEST : Sent to request DF takeover 368 For the purpose of BGP route key processing, the Ethernet Segment 369 Identifier and Originating Router's IP address fields are considered 370 to be part of the prefix in the NLRI. The DF-Flag and Sequence 371 number is to be treated as a route attribute as opposed to being part 372 of the route. This route is sent along with ES-Import route target. 374 3.5.2. DF Election Handshake Response Route 376 A DF Election Handshake Response Type NLRI consists of the following: 378 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 379 | RD (8 octets) | 380 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 381 | Ethernet Segment Identifier (10 octets) | 382 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 383 | IP-Address Length (1 octet) | 384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 385 | Destination Router's IP Address | 386 | (4 or 16 octets) | 387 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 388 | DF-Flags (1 octet) | 389 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 390 | Sequence Number (1 octet) | 391 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 392 | Originating Router's IP Address | 393 | (4 or 16 octets) | 394 +-----------------------------------------+ 396 The DF-Flags can have the following values: 398 DF-ACK : Sent to Acknowledge DF-REQUEST 399 DF-NACK : Sent to Reject DF-REQUEST 401 For the purpose of BGP route key processing, the Ethernet Segment 402 Identifier, IP Address Length and Destination Router's IP Address 403 fields, and Originating Router's IP address fields are considered to 404 be part of the prefix in the NLRI. The DF-Flag and Sequence number 405 is to be treated as a route attribute as opposed to being part of the 406 route. This route is sent along with ESI-Import route target. 408 This document introduces a new flag called "H" (for Handshake) to the 409 bitmap field of the DF Election Extended Community defined in [DF- 410 FRAMWORK]. 412 1 2 3 413 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 415 | Type=0x06 | Sub-Type(0x06)| DF Type |D|A|H|T| |P| | 416 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 417 | Reserved = 0 | 418 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 420 H: This flag is located in bit position 26 as shown above. When set 421 to 1, it indicates the desire to use Handshaking capability with the 422 rest of the PEs in the ES. This capability can only be used with a 423 selected number of DF election algorithms such as HRW and Preference- 424 based. 426 3.6. DF Handshake Scenarios 428 Consider the scenario where PE3 is freshly inserted into the network 429 with PE1 and PE2 in steady state (as shown below). As shown in the 430 sequence diagram below, at time = T0, PE3 will send Type 4 ES route 431 and that will cause PE1 and PE2 to discover PE3. 433 Post the discovery timer, at time = T1, PE3 will send DF-Request 434 containing [ESI, DF-REQ, SEQ1]. 436 PE2 responds via DF-Response ACK at time = T2, with the same sequence 437 number SEQ1. [ESI, DF-ACK, PE3, SEQ1]. Note that the sequence 438 number is the same as is contained in the DF-Request from PE3. PE3 439 will receive the DF-Response ACK and take over the appropriate VLANs 440 based on HRW only if the sequence number matches. 442 PE1 responds via DF-Response DF-ACK at time = T3, with the same 443 sequence number SEQ1; [ESI, DF-ACK, PE3, SEQ1]. PE3 will receive the 444 DF Response ACK and take over the appropriate VLANs based on HRW only 445 if the sequence number matches. 447 By the end of the handshake, all appropriate VLANs for the ES are 448 transferred from PE1 and PE2 to PE3 with a single per-ES handshake. 450 PE1 PE2 PE3 451 | | | 452 | | Type-4 (Discovery) | 453 | |<<-------------------------| T0 454 |<<------------------------------------------------| 455 | | | 456 | | | 457 | | Type-TBD1 (DF-Request) | 458 |<<--------------------|<<-------------------------| T1 459 | | | 460 | | Type-TBD2 (DF-Response)| 461 | |------------------------->>| T2 462 | Type-TBD2 (DF-Response)| | 463 |------------------------------------------------>>| T3 464 | | | 465 |<<##############################################>>| 466 | PE3 freshly inserted | 467 |<<##############################################>>| 468 . . . 470 Consider the scenario where PE2 and PE3 are inserted simultaneously 471 in the network where PE1 is in steady state (as shown below). PE2 472 and PE3 will send the Type 4 ES routes and start the discovery timer. 473 This will cause PE1, PE2 and PE3 to discover each other. 475 PE2 and PE3 will then simultaneously and separately send DF Request. 476 PE1 will receive these requests and respond to them. 478 To avoid any ambiguity, PE1 will explicitly specify in the DF Request 479 route the destination for which the DF-ACK is meant for. That is why 480 the responses from PE1 will contain [ES1, DF-ACK, PE2, SEQ] and [ESI, 481 DF-ACK, PE3, SEQ] to specify that the response is meant for PE2 and 482 PE3 respectively. 484 Upon receiving the Type-TBD2 response message, PE2 and PE3 will take 485 over the respective VLANs. 487 PE1 PE2 PE3 488 | | | 489 | | Type 4 (Discovery) | 490 |<<-------------------| | T0 491 |<<-------------------|<<-------------------------| 492 | | | 493 | | | 494 | | Type-TBD1 (DF-Request) | 495 | |<<-------------------------| T1 496 | | | 497 | Type-TBD1(DF-Request)| | 498 |<<-------------------| | T2 499 | | | 500 | | Type-TBD2 (DF-Response)| 501 |----------------------------------------------->>| T3 502 | | | 503 |Type-TBD2(DF-Response)| | 504 |------------------->>| | T4 505 | | | 506 |<<#############################################>>| 507 | PE2 and PE3 inserted simultaneously | 508 |<<#############################################>>| 509 . . . 511 When PE3 is booted down or removed from the network, the routes 512 formerly advertised by PE3 will be withdrawn, including the Type-4 513 route (as shown below). When PE1 and PE2 process the deletion of 514 PE3's Type-4 route, they will clean up any DF handshake state 515 pertaining to PE3. This means that PE1 and PE2 will withdraw the DF 516 Response routes that they had earlier sent with PE3 as the 517 destination. 519 PE1 PE2 PE3 520 | | | 521 | | Type-4 Route Withdrawal | 522 | |<<-------------------------| T0 523 |<<-----------------------------------------------| 524 | | | 525 | PE2 purges Type-TBD2 (DF-Response) sent to PE3| T2 526 | | | 527 | PE1 purges Type-TBD2 (DF-Response) sent to PE3| T3 528 | | | 529 |<<#############################################>>| 530 | PE3 booted down/removed from the network | 531 |<<#############################################>>| 532 . . . 534 3.7. Backwards Compatibility 536 Per redundancy group (per ES), for the DF election procedures to be 537 globally convergent and unanimous, it is necessary that all the 538 participating PEs agree on the DF Election algorithm to be used. It 539 is, however, possible that some PEs continue to use the existing 540 modulus based DF election and do not rely on the new handshake/sync 541 procedures. PEs running an old versions of draft/RFC shall simply 542 discard unrecognized new BGP extended communities. 544 A PE can indicate its willingness to support new Handshake and/or 545 Time Synchronization capabilities by signaling them in the DF 546 Election Extended Community defined in [RFC8584] sent along with the 547 Ethernet-Segment Route (Type-4). 549 Considering that all the PE devices support the HRW election 550 algorithm, but only a subset of them may have the capability of 551 performing the handshake or synchronization mechanism. In such a 552 situation, the following procedure are exercised. 554 In the illustration below, PE1, PE2 and PE3 send their respective 555 Type-4 routes indicating their DF capabilities at time T1, T2 and T3 556 respectively. Only PE2 and PE3 are Handshake capable, hence only PE2 557 and PE3 partake in DF Handshaking procedure described here at time T4 558 and T5. PE1 on the other hand, runs the DF election timer and takes 559 over the DF role upon timer expiry at time T6. 561 PE1 PE2 PE3 562 | | | 563 | | | 564 | Type-4 (0x0 Default Capability) | 565 |------------------->>|------------------------->>| T1 566 | | | 567 | Type-4 (H=1 Handshake Capable) | 568 |<<-------------------|------------------------->>| T2 569 | | | 570 | Type-4 (H=1 Handshake Capable) | 571 |<<-------------------|<<-------------------------| T3 572 | | | 573 | | | 574 | | Type-TBD1 (DF-Request) | 575 | |<<-------------------------| T4 576 | | | 577 | | Type-TBD2 (DF-Response)| 578 | |------------------------->>| T5 579 | PE1 Timer Expiry (DF Takeover) | T6 580 |<<#############################################>>| 581 | Only PE2 and PE3 Handshake Capable | 582 |<<#############################################>>| 583 . . . 585 4. DF Election Synchronization Solution 587 If all PE devices attached to a given Ethernet Segment are clock- 588 synchronized with each other, then the above handshaking procedures 589 can be simplified and packet loss can be reduced from BGP-propagation 590 time (between recovered PE and the DF PE) to very small time (e.g., 591 milliseconds or less). 593 The simplified procedure is as follow: 595 The DF Election procedure, as described in [RFC7432] and as 596 optionally signalled in [RFC8584], is applied. 598 All PEs attached to a given Ethernet-Segment are clock-synchronized; 599 using a networking protocol for clock synchronization (e.g. NTP, 600 PTP, etc.). 602 Newly inserted device PE or during failure recovery of a PE, that PE 603 communicates the current time to peering partners plus the remaining 604 peering timer time left. This constitute an "end" or "absolute" time 605 as seen from local PE. That absolute time is called "Service Carving 606 Time" (SCT). 608 A new BGP Extended Community is advertised along with RT-4 to 609 communicate to other partners the Service Carving Time. 611 Upon reception of that new BGP Extended Community, partner PEs know 612 exactly its carving time. The notion of skew is introduced to 613 eliminate any potential duplicate traffic or loops. They add a skew 614 (default = -10ms) to the Service Carving Time to enforce this. The 615 previously inserted PE(s) must carve first, followed shortly(skew) by 616 the newly insterted PE. 618 To summarize, all peering PEs carve almost simultaneously at the time 619 announced by newly added/recovered PE. The newly inserted PE 620 initiates the SCT, and carves immediately on peering timer expiry. 621 The previously inserted PE(s) receiving RT-4 with a SCT BGP extended 622 community, carve shortly before Service Carving Time. 624 4.1. Advantages 626 There are multiples advantages of using the approach. Here is a non- 627 exhaustive list: 629 - A simple uni-directional signaling is all needed 631 - Backwards-compatible: PEs supporting only older [RFC7432] shall 632 simply discard unrecognized new "Service Carving Timestamp" BGP 633 Extended Community 635 - Multiple DF Election algorithms can be supported: 637 * [RFC7432] default ordered list ordinal algorithm (Modulo), 639 * [RFC8584] highest-random weight, etc. 641 - Independent of BGP transmission delay for RT-4 643 - Agnostic of the time synchronization mechanism used (e.g .NTP, PTP, 644 etc.) 646 4.2. BGP Encoding 648 A new BGP extended community needs to be defined to communicate the 649 Service Carving Timestamp for each Ethernet Segment. 651 A new transitive extended community where the Type field is 0x06, and 652 the Sub-Type is [TBD3] is advertised along with Ethernet Segment 653 route. Timestamp for expected Service carving is encoded as a 654 8-octet value as follows: 656 1 2 3 657 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 658 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 659 | Type=0x06 | Sub-Type(TBD3)| Timestamp(upper 16)| 660 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 661 | Timestamp (lower 32) | 662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 664 This document introduces a new flag called "T" (for Time 665 Synchronization) to the bitmap field of the DF Election Extended 666 Community defined in [RFC8584]. 668 1 2 3 669 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 671 | Type=0x06 | Sub-Type(0x06)| DF Type |P|A|H|T| Bitmap| 672 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 673 | Reserved = 0 | 674 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 676 T: This flag is located in bit position 27 as shown above. When set 677 to 1, it indicates the desire to use Time Synchronization capability 678 with the rest of the PEs in the ES. This capability is used in 679 conjunction with the agreed upon DF Type (DF Election Type). For 680 example if all the PEs in the ES indicated that they have Time 681 Synchronization capability and they want the DF type be of HRW, then 682 HRW algorithm is used in conjunction with this capability. 684 4.3. Note on NTP-based synchronization 686 The 64-bit timestamp used by NTP protocol consists of a 32-bit part 687 for seconds and a 32-bit part for fractional second. Giving a time 688 scale that rolls over every 2^32 seconds (136 years) and a 689 theoretical resolution of 2^-32 seconds (233 picoseconds). The 690 recommendation is to keep the top 32 bits and carry lower MSB 16 bits 691 of fractional second. 693 4.4. Synchronization Scenarios 695 Let's take Figure 1 as an example where initially PE2 had failed and 696 PE1 had taken over. 698 Based on [RFC7432]: 700 - Initial state: PE1 is in steady-state, PE2 is recovering 701 - PE2 recovers at (absolute) time t=99 703 - PE2 advertises RT-4 (sent at t=100) to partner PE1 705 - PE2, it starts its 3sec peering timer as per RFC7432 707 - PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal BGP 708 propagation delay 710 - PE2 carves at time t=103 712 With above procedure, and based on the [RFC7432] aim of favouring 713 traffic black hole over duplicate traffic, traffic black hole will 714 occur as part of each PE recovery sequence. The peering timer value 715 has a direct effect on the duration of the prolonged blackholing. A 716 short (esp. zero) peering timer may, however, result in duplicate 717 traffic or traffic loops. 719 Based on the Service Carving Time (SCT) approach: 721 - Initial state: PE1 is in steady-state, PE2 is recovering 723 - PE2 recovers at (absolute) time t=99 725 - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to 726 partner PE1 728 - PE2 starts its 3 second peering timer as per [RFC7432] 730 - Both PE1 and PE2 carves at (absolute) time t=103; In fact, PE1 731 should carve slightly before PE2 (skew). 733 Using SCT approach, the negative effect of the peering timer is 734 mitigated. Also, the BGP RT-4 transmission delay (from PE2 to PE1) 735 becomes a no-op. 737 4.5. Backwards Compatibility 739 Per redundancy group, for the DF election procedures to be globally 740 convergent and unanimous, it is necessary that all the participating 741 PEs agree on the DF Election algorithm to be used. It is, however, 742 possible that some PEs continue to use the existing modulus based DF 743 election and do not rely on the new SCT BGP extended community. PEs 744 running an baseline DF election mechanism shall simply discard 745 unrecognized new SCT BGP extended community. 747 A PE can indicate its willingness to support clock-synched carving by 748 signaling the new 'T' DF Election Capability as well as including the 749 new Service Carving Time BGP extended community along with the 750 Ethernet-Segment Route (Type-4). 752 5. Interoperability 754 If some PEs in the redundancy group signal both Handshake and Time 755 Synchronization capabilities (both H & T set to 1), then Time 756 Synchronization capability SHALL be chosen over Handshake capability 757 with the HRW (or Preference-based) DF election algorithm. 759 If some PEs in the redundancy group signal Time Synchronization (T=1) 760 but not Handshaking (H=0); whereas, some other PEs in the same 761 redundancy group signal Handshaking (H=1) but not Time 762 Synchronization (T=0), then the PEs that have handshaking ability, 763 SHALL perform DF Election using signaled or default DF-Type with 764 handshaking among themselves and the PEs that Time Synchronization 765 capability SHALL perform DF Election using signaled or default DF- 766 Type with time synchronization among themselves. 768 If some PEs in the redundancy group don't signal either Time 769 Synchronization or Handshaking capabilities, then these PEs SHALL 770 perform DF Election (Modulo, HRW or Preference-based) with default 771 Peering timer based mechanism defined in [RFC7432]. 773 6. Security Considerations 775 The mechanisms in this document use EVPN control plane as defined in 776 [RFC7432]. Security considerations described in [RFC7432] are 777 equally applicable. This document uses MPLS and IP-based tunnel 778 technologies to support data plane transport. Security 779 considerations described in [RFC7432] and in [RFC8365] are equally 780 applicable. 782 7. IANA Considerations 784 This document solicits the allocation of the following sub-type in 785 the "EVPN Route Types" registry setup by [RFC7432]: 787 TBD1 DF Election Handshake Request This document 788 TBD2 DF Election Handshake Rsponse This document 790 This document solicits the allocation of the following sub-type in 791 the "EVPN Extended Community Sub-Types" registry setup by [RFC7153]: 793 TBD3 Service Carving Timestamp This document 795 This document solicits the allocation of the following values in the 796 "DF Election Capabilities" registry setup by [RFC8584]: 798 Bit Name Reference 799 ---- ---------------- ------------- 800 2 Handshake This document 801 3 Time Synchronization This document 803 8. References 805 8.1. Normative References 807 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 808 Requirement Levels", BCP 14, RFC 2119, 809 DOI 10.17487/RFC2119, March 1997, 810 . 812 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 813 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, 814 March 2014, . 816 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 817 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 818 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 819 2015, . 821 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 822 Uttaro, J., and W. Henderickx, "A Network Virtualization 823 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 824 DOI 10.17487/RFC8365, March 2018, 825 . 827 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, 828 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet 829 VPN Designated Forwarder Election Extensibility", 830 RFC 8584, DOI 10.17487/RFC8584, April 2019, 831 . 833 8.2. Informative References 835 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 836 Writing an IANA Considerations Section in RFCs", BCP 26, 837 RFC 8126, DOI 10.17487/RFC8126, June 2017, 838 . 840 Appendix A. Contributors 842 In addition to the authors listed on the front page, the following 843 co-authors have also contributed substantially to this document: 845 Luc Andre Burdet 846 Cisco 848 Email: lburdet@cisco.com 850 Appendix B. Acknowledgements 852 Authors would like to acknowledge helpful comments and contributions 853 of Satya Mohanty and Bharath Vasudevan. 855 Authors' Addresses 857 Ali Sajassi (editor) 858 Cisco 860 Email: sajassi@cisco.com 862 Gaurav Badoni 863 Cisco 865 Email: gbadoni@cisco.com 867 Dhananjaya Rao 868 Cisco 870 Email: dhrao@cisco.com 872 Patrice Brissette 873 Cisco 875 Email: pbrisset@cisco.com 877 John Drake 878 Juniper 880 Email: jdrake@juniper.net 882 Jorge Rabadan 883 Nokia 885 Email: jorge.rabadan@nokia.com