idnits 2.17.1 draft-ietf-bess-evpn-fast-df-recovery-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 4 instances of lines with control characters in the document. ** The abstract seems to contain references ([DF-FRAMEWORK], [RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 118 has weird spacing: '...ased on simpl...' -- The document date (June 12, 2018) is 2139 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 7432' is mentioned on line 560, but not defined == Missing Reference: 'DF-FRAMWORK' is mentioned on line 680, but not defined == Missing Reference: 'ESI' is mentioned on line 433, but not defined == Missing Reference: 'DF-REQ' is mentioned on line 424, but not defined == Missing Reference: 'SEQ1' is mentioned on line 433, but not defined == Missing Reference: 'DF-ACK' is mentioned on line 470, but not defined == Missing Reference: 'PE3' is mentioned on line 433, but not defined == Missing Reference: 'ES1' is mentioned on line 470, but not defined == Missing Reference: 'PE2' is mentioned on line 470, but not defined == Missing Reference: 'SEQ' is mentioned on line 470, but not defined == Missing Reference: 'R7432' is mentioned on line 747, but not defined == Outdated reference: A later version (-09) exists of draft-ietf-bess-evpn-df-election-framework-00 Summary: 2 errors (**), 0 flaws (~~), 14 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group A. Sajassi 3 Internet-Draft G. Badoni 4 Intended Status: Standards Track D. Rao 5 P. Brissette 6 Cisco 7 J. Drake 8 Juniper 9 J. Rabadan 10 Nokia 12 Expires: December 12, 2018 June 12, 2018 14 Fast Recovery for EVPN DF Election 15 draft-ietf-bess-evpn-fast-df-recovery-00 17 Abstract 19 Ethernet Virtual Private Network (EVPN) solution [RFC 7432] describes 20 DF election procedures for multi-homing Ethernet Segments. These 21 procedures are enhanced further in [DF-FRAMEWORK] by applying Highest 22 Random Weight Algorithm for DF election in order to avoid DF status 23 unnecessarily upon a failure. This draft makes further improvement to 24 DF election procedures in [DF-FRAMEWORK] by providing two options for 25 fast DF election upon recovery of the failed link or node associated 26 with the multi-homing Ethernet Segment. This fast DF election is 27 achieved independent of number of EVIs associated with that Ethernet 28 Segment and it is performed via a simple signaling between the 29 recovered PE and each PE in the multi-homing group. 31 Status of this Memo 33 This Internet-Draft is submitted to IETF in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF), its areas, and its working groups. Note that 38 other groups may also distribute working documents as 39 Internet-Drafts. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/1id-abstracts.html 49 The list of Internet-Draft Shadow Directories can be accessed at 50 http://www.ietf.org/shadow.html 52 Copyright and License Notice 54 Copyright (c) 2017 IETF Trust and the persons identified as the 55 document authors. All rights reserved. 57 This document is subject to BCP 78 and the IETF Trust's Legal 58 Provisions Relating to IETF Documents 59 (http://trustee.ietf.org/license-info) in effect on the date of 60 publication of this document. Please review these documents 61 carefully, as they describe your rights and restrictions with respect 62 to this document. Code Components extracted from this document must 63 include Simplified BSD License text as described in Section 4.e of 64 the Trust Legal Provisions and are provided without warranty as 65 described in the Simplified BSD License. 67 Table of Contents 69 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2 Challenges with Existing Solution . . . . . . . . . . . . . . . 4 72 3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 73 3.1 DF Election Handshake Solution . . . . . . . . . . . . . . . 6 74 3.1.1 Discovery . . . . . . . . . . . . . . . . . . . . . . . 6 75 3.1.2 DF candidates Determination . . . . . . . . . . . . . . 6 76 3.1.3 DF Election Handshake . . . . . . . . . . . . . . . . . 7 77 3.1.4 Node Insertion . . . . . . . . . . . . . . . . . . . . . 8 78 3.1.5 BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 8 79 3.1.5.1 DF Election Handshake Request Route . . . . . . . . 9 80 3.5.1.2 DF Election Handshake Response Route . . . . . . . . 9 81 3.1.6 DF Handshake Scenarios . . . . . . . . . . . . . . . . . 11 82 3.1.7 Interoperability . . . . . . . . . . . . . . . . . . . 13 83 3.2 DF Election Synchronization Solution . . . . . . . . . . . . 14 84 3.2.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . 15 85 3.2.4 Interoperability . . . . . . . . . . . . . . . . . . . . 16 86 3.2.5 BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 16 87 3.2.6 Note on NTP-based synchronization . . . . . . . . . . . 17 88 3.2.7 An example . . . . . . . . . . . . . . . . . . . . . . . 17 89 4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 18 90 5 Security Considerations . . . . . . . . . . . . . . . . . . . . 18 91 6 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 18 92 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 93 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 94 7.2 Informative References . . . . . . . . . . . . . . . . . . 18 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 97 1 Introduction 99 Ethernet Virtual Private Network (EVPN) solution [RFC 7432] is 100 becoming pervasive in data center (DC) applications for Network 101 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 102 in service provider (SP) applications for next generation virtual 103 private LAN services. 105 EVPN solution [RFC 7432] describes DF election procedures for multi- 106 homing Ethernet Segments. These procedures are enhanced further in 107 [DF-FRAMEWORK] by applying Highest Random Weight Algorithm for DF 108 election in order to avoid DF status change unnecessarily upon a link 109 or node failure associated with the multi-homing Ethernet Segment. 110 This draft makes further improvement to DF election procedures in 111 [DF-FRAMEWORK] by providing two options for a fast DF election upon 112 recovery of the failed link or node associated with the multi-homing 113 Ethernet Segment. This DF election is achieved independent of number 114 of EVIs associated with that Ethernet Segment and it is performed via 115 a simple signaling between the recovered PE and each PE in the multi- 116 homing group. The draft presents two signaling options. The first 117 option is based on a bidirectional handshake procedure whereas the 118 second option is based on simple one-way signaling mechanism. 120 1.1 Terminology 122 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 123 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 124 document are to be interpreted as described in RFC 2119 [KEYWORDS]. 126 Provider Edge (PE) : A device that sits in the boundary of Provider 127 and Customer networks and performs encap/decap of data from L2 to L3 128 and vice-versa. 130 Designated Forwarder (DF): An PE that is currently forwarding 131 (encapsulating/decapsulating) traffic for a given VLAN in and out of 132 a site. 134 2 Challenges with Existing Solution 136 In EVPN technology, multiple PE devices have the ability to encap and 137 decap data belonging to the same VLAN. In certain situations, this 138 may cause L2 duplicates and even loops if there is a momentary 139 overlap of forwarding roles between two or more PE devices, leading 140 to broadcast storms. 142 EVPN [RFC 7432] currently uses timer based synchronization among PE 143 devices in redundancy group that can result in duplications (and even 144 loops) because of multiple DFs if the timer is too short or 145 blackholing if the timer is too long. 147 Using site-of-origin Split Horizon filtering can prevent loops (but 148 not duplicates), however if there are overlapping DFs in two 149 different sites at the same time for the same VLAN, the site 150 identifier will be different upon re-entry of the packet and hence 151 the split horizon check will fail, leading to L2 loops. 153 The current state of art [DF-FRAMEWORK] uses the well known HRW 154 (Highest Random Weight) algorithm to avoid reshuffling of VLANs among 155 PE devices in the redundancy group upon failure/recovery and thus 156 reducing the impact of failure/recovery to VLANs not on the 157 failed/recovered ports. This eliminates loops/duplicates in failure 158 scenarios. 160 However, upon PE insertion or port bring-up, HRW cannot help as a 161 transfer of DF role need to happen to the newly inserted device/port 162 while the old DF is still active. 164 +---------+ 165 +-------------+ | | 166 | | | | 167 / | PE1 |----| | +-------------+ 168 / | | | MPLS/ | | |---H3 169 / +-------------+ | VxLAN/ | | PE10 | 170 CE1 - | Cloud | | | 171 \ +-------------+ | |---| | 172 \ | | | | +-------------+ 173 \ | PE2 |----| | 174 | | | | 175 +-------------+ | | 176 +---------+ 178 Figure 1: CE1 multi-homed to PE1 and PE2. Potential for duplicate DF. 180 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 181 DF role of some VLANs to PE2 to achieve load balancing. However, 182 because there is no handshake mechanism between PE1 and PE2, 183 duplication of DF roles for a give VLAN is possible. Duplication of 184 DF roles may eventually lead to L2 loops as well as duplication of 185 traffic. 187 Current state of EVPN art relies on a blackholing timer for 188 transferring the DF role to the newly inserted device. This can cause 189 the following issues: 191 * Loops/Duplicates if the timer value is too short 192 * Prolonged Traffic Blackholing if the timer value is too long 194 This draft is proposing solutions that deterministically eliminates 195 loops/duplicates and at the same time provides fast convergence upon 196 PE/port insertion. 198 3 Operation 200 Here we describe two signaling mechanisms between the newly inserted 201 PE and remaining PEs. The signaling is only possible once the newly 202 inserted PE has reliably discovered the other PEs and vice versa. The 203 first option is referred to as DF Election Handshake solution and is 204 described in section 3.1. The second option is referred to as DF 205 Election Synchronization Solution and is described in section 3.2. 207 3.1 DF Election Handshake Solution 209 Due to HRW, the handshake will only be one per PE device and 210 independent of EVI/VNI scale. Therefore, this solution is divided 211 into three steps: 213 Phase 1: Discovery 215 Phase 2: DF Candidate Determination; HRW or Preference-based 217 Phase 3: Handshake 219 Following is the description each step in detail. 221 3.1.1 Discovery 223 Each PE needs to have a consistent view of the network including the 224 newly inserted PE. 226 Newly inserted device PE will advertise it's Ethernet Segment route 227 and start a flood/wait timer. This timer should be large enough to 228 guarantee the dissemination and receipt of this advertisement by 229 previously inserted PEs. 231 As the old DF is continuously forwarding traffic while the new PE is 232 running this timer, this timer can be made as long as required 233 without impacting traffic convergence. The timer value can be the BGP 234 session hold time in the worst case to ensure proper discovery. 236 3.1.2 DF candidates Determination 238 After the discovery timer has elapsed, each PE would have an imported 239 list of the Ethernet Segment Routes from other PEs. The resultant 240 database will comprise of all the DF candidates on a per ES basis and 241 will be used for DF election. Each PE will independently run the 242 selected DF algorithm - i.e., HRW algorithm (or Preference-based) for 243 all VLANs in a given Ethernet Segment. Since the discovery phase 244 guarantees uniform network view between the participating devices, 245 the VLAN distribution results based on HRW (or Preference-based) will 246 be consistent. 248 3.1.3 DF Election Handshake 250 The DF Election handshake will be accomplished in the following 251 steps: 253 - The newly inserted PE will send the DF Request to previously 254 inserted PEs with a new sequence number. 256 - The previously inserted PE(s) will receive the DF Request, will 257 validate this request as per own discovery state and HRW (or 258 Preference-based) results. 260 - The previously inserted PE(s) will program hardware to block the 261 VLANs that must be transferred to the newly inserted PE. 263 - The previously inserted PE(s) will send DF Response (W/ ACK OR 264 NACK) to the newly inserted PE with the same sequence number that was 265 contained in the DF Request. 267 - Newly inserted PE will receive DF Response and validate it using 268 the sequence number. It will take action per received DF Response 269 message and will not wait for all previously inserted devices for 270 faster convergence. The received DF Response is interpreted as an 271 indication from the previously inserted PE to give up the DF role on 272 those VLANs for which the newly inserted PE should be DF. In other 273 words, the newly inserted PE will only take over as DF for a given 274 VLAN/ISID if (a) it is the DF Election winner AND (b) it gets the ACK 275 from the previous DF. 277 - In case of Preference-based DF Election, the above procedure should 278 only be followed if there is at least one previously inserted PE that 279 signals DP=0 in its ES route (there is no need for handshake in case 280 of non-revertive mode). 282 - In case of a DF Response ACK, newly inserted PE will program its 283 hardware to assume the DF responsibility. 285 We don't need to have a handshake on a per VLAN/EVI basis but rather 286 per pair of PEs in the redundancy group - i.e., if a new PE is added 287 to an existing redundancy group of 3 PE devices, then we need only to 288 have 3 handshakes. This is because the devices already are in sync 289 about which VLANs to give-up/takeover (HRW). 291 At the end of these three phases, the VLAN DF role transfer would 292 have happened in a deterministic way while ensuring minimum traffic 293 loss. Device recovery and device insertion scenarios are identical in 294 terms of the handshaking procedure. In next section, we describe the 295 procedure details for device insertion. 297 3.1.4 Node Insertion 299 Consider the scenario where PE3 is inserted in the network, while PE1 300 and PE2 are already in stable state. PE3 will send/receive the 301 following flags along with the EVPN Type 4 route: 303 - DF Request: Upon completing the DF Election, PE3 will send DF 304 Request with a new sequence number. PE1 and PE2 will receive this 305 message and respond with DF Response ACK or NACK with the same 306 sequence number that was generated by PE3. 308 - DF Response ACK: When PE3 receives DF Response ACK from PE1 with 309 the same sequence number as DF Request, it will take over the DF role 310 for the appropriate VLANS that are being transferred from PE1. When 311 DF Response ACK from PE2 arrives, the rest of the VLANS to be 312 transferred from PE2 to PE3 are then taken over by PE3. 314 - DF Response NACK: If PE3 receives DF Response NACK from at least 315 one of PE1 or PE2, it will not take over DF role and will start 316 over. 318 Consider the scenario where two nodes PE3 and PE4 are being inserted 319 at the same time. Both of them will send a DF Request to PE1 and PE2 320 at around the same time with possibly the same sequence number. When 321 PE1 and PE2 respond with DF Response ACK, it is important to signify 322 exactly whom the response is meant for as it could be for either 323 requester (PE3 or PE4). To remove any ambiguity and false positives, 324 the IP address of the requester MUST be included in the response 325 message to specify who the response is meant for. 327 3.1.5 BGP Encoding 329 The EVPN NLRI comprises of Route Type (1B), Length (1B) and Route 330 Type specific variable encoding. Here we propose the creation of two 331 new EVPN route types: 333 + 0x0C - DF Election Handshake Request Route 334 + 0x0D - DF Election Handshake Response Route 336 3.1.5.1 DF Election Handshake Request Route 338 A DF Election Handshake Request Type NLRI consists of the following: 340 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 341 | RD (8 octets) | 342 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 343 | Ethernet Segment Identifier (10 octets) | 344 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 345 | DF-Flags (1 octet) | 346 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 347 | Sequence Number (1 octet) | 348 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 349 | Originating Router's IP Address | 350 | (4 or 16 octets) | 351 +-----------------------------------------+ 353 The DF-Flags can have the following values: 355 DF-INIT : Sent initially upon boot-up; bootstraps the network 356 DF-REQUEST : Sent to request DF takeover 358 For the purpose of BGP route key processing, the Ethernet Segment 359 Identifier and Originating Router's IP address fields are considered 360 to be part of the prefix in the NLRI. The DF-Flag and Sequence number 361 is to be treated as a route attribute as opposed to being part of the 362 route. This route is sent along with ESI-Import route target. 364 3.5.1.2 DF Election Handshake Response Route 366 A DF Election Handshake Response Type NLRI consists of the following: 368 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 369 | RD (8 octets) | 370 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 371 | Ethernet Segment Identifier (10 octets) | 372 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 373 | IP-Address Length (1 octet) | 374 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 375 | Destination Router's IP Address | 376 | (4 or 16 octets) | 377 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 378 | DF-Flags (1 octet) | 379 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 380 | Sequence Number (1 octet) | 381 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 382 | Originating Router's IP Address | 383 | (4 or 16 octets) | 384 +-----------------------------------------+ 386 The DF-Flags can have the following values: 388 DF-ACK : Sent to Acknowledge DF-REQUEST 389 DF-NACK : Sent to Reject DF-Request 391 For the purpose of BGP route key processing, the Ethernet Segment 392 Identifier, IP Address Length and Destination Router's IP Address 393 fields, and Originating Router's IP address fields are considered to 394 be part of the prefix in the NLRI. The DF-Flag and Sequence number is 395 to be treated as a route attribute as opposed to being part of the 396 route. This route is sent along with ESI-Import route target. 398 This document introduces a new flag called "H" (for Handshake) to the 399 bitmap field of the DF Election Extended Community defined in [DF- 400 FRAMWORK]. 402 1 2 3 403 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 405 | Type=0x06 | Sub-Type(0x06)| DF Type |P|A|H|T| Bitmap| 406 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 407 | Reserved = 0 | 408 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 410 H: This flag is located in bit position 26 as shown above. When set 411 to 1, it indicates the desire to use Handshaking capability with the 412 rest of the PEs in the ES. This capability can only be used with a 413 selected number of DF election algorithms such as HRW and Preference- 414 based. 416 3.1.6 DF Handshake Scenarios 418 Consider the scenario where PE3 is freshly inserted into the network 419 with PE1 and PE2 in steady state (as shown below). As shown in the 420 sequence diagram below, at time = T0, PE3 will send Type 4 ES route 421 and that will cause PE1 and PE2 to discover PE3. 423 Post the discovery timer, at time = T1, PE3 will send DF Request 424 containing [ESI, DF-REQ, SEQ1]. 426 PE2 responds via DF Response ACK at time = T2, with the same sequence 427 number SEQ1. [ESI, DF-ACK, PE3, SEQ1]. Note that the sequence number 428 is the same as is contained in the DF Request from PE3. PE3 will 429 receive the DF Response ACK and take over the appropriate VLANs based 430 on HRW only if the sequence number matches. 432 PE1 responds via DF Response ACK at time = T3, with the same sequence 433 number SEQ1; [ESI, DF-ACK, PE3, SEQ1]. PE3 will receive the DF 434 Response ACK and take over the appropriate VLANs based on HRW only if 435 the sequence number matches. 437 By the end of the handshake, all appropriate VLANs for the ES are 438 transferred from PE1 and PE2 to PE3 with a single per-ES handshake. 440 PE1 PE2 PE3 441 | | | 442 | | Type 4 (Discovery) | 443 | |<<-------------------------| T0 444 |<<---------------------------------------------| 445 | | | 446 | | | 447 | | Type C (DF Request) | .br 448 |<<-----------------|<<-------------------------| T1 449 | | | 450 | | Type D (DF Response) | .br 451 | |------------------------->>| T2 452 | Type D(DF Resp) | | 453 |--------------------------------------------->>| T3 454 | | | 455 |<<###########################################>>| 456 | PE3 freshly inserted | 457 |<<###########################################>>| 458 . . . 460 Consider the scenario where PE2 and PE3 are inserted simultaneously 461 in the network where PE1 is in steady state (as shown below). PE2 and 462 PE3 will send the Type 4 ES routes and start the discovery timer. 463 This will cause PE1, PE2 and PE3 to discover each other. 465 PE2 and PE3 will then simultaneously and separately send DF Request. 466 PE1 will receive these requests and respond to them. 468 To avoid any ambiguity, PE1 will explicitly specify in the DF Request 469 route the destination for which the DF-ACK is meant for. That is why 470 the responses from PE1 will contain [ES1, DF-ACK, PE2, SEQ] and [ESI, 471 DF-ACK, PE3, SEQ] to specify that the response is meant for PE2 and 472 PE3 respectively. 474 Upon receiving the Type-D response message, PE2 and PE3 will take 475 over the respective VLANs. 477 PE1 PE2 PE3 478 | | | 479 | | Type 4 (Discovery) | 480 |<<-----------------| | T0 481 |<<-----------------|<<-------------------------| 482 | | | 483 | | | 484 | | Type C (DF Request) | 485 | |<<-------------------------| T1 486 | | | 487 | Type C(DF Request)| | 488 |<<-----------------| | T2 489 | | | 490 | | Type D (DF Response) | 491 |--------------------------------------------->>| T3 492 | | | 493 |Type D(DF Response)| | 494 |----------------->>| | T4 495 | | | 496 |<<###########################################>>| 497 | PE2 and PE3 inserted simultaneously | 498 |<<###########################################>>| 499 . . . 501 When PE3 is booted down or removed from the network, the routes 502 formerly advertised by PE3 will be withdrawn, including the Type 4 503 route (as shown below). When PE1 and PE2 process the deletion of 504 PE3's Type 4 route, they will clean up any DF handshake state 505 pertaining to PE3. This means that PE1 and PE2 will withdraw the DF 506 Response routes that they had earlier sent with PE3 as the 507 destination. 509 PE1 PE2 PE3 510 | | | 511 | | Type 4 Route Withdrawal | 512 | |<<-------------------------| T0 513 |<<---------------------------------------------| 514 | | | 515 | PE2 purges Type D(DF Resp) sent to PE3 | T2 516 | | | 517 | PE1 purges Type D(DF Resp) sent to PE3 | T3 518 | | | 519 |<<###########################################>>| 520 | PE3 booted down/removed from the network | 521 |<<###########################################>>| 522 . . . 524 3.1.7 Interoperability 526 Per redundancy group (per ES), for the DF election procedures to be 527 globally convergent and unanimous, it is necessary that all the 528 participating PEs agree on the DF Election algorithm to be used. It 529 is, however, possible that some PEs continue to use the existing 530 modulus based DF election and do not rely on the new handshake/sync 531 procedures. PEs running an old versions of draft/RFC shall simply 532 discard unrecognized new BGP extended communities. 534 A PE can indicate its willingness to support new Handshake and/or 535 Time Synchronization capabilities by signaling them in the DF 536 Election Extended Community defined in [DF-FRAMEWORK] sent along with 537 the Ethernet-Segment Route (Type-4). 539 Considering that all the PE devices support the HRW election 540 algorithm, but only a subset of them may have the capability of 541 performing the handshake or synchronization mechanism. In such a 542 situation, the following procedure are exercised. 544 If some PEs in the redundancy group signal both Handshake and Time 545 Synchronization capabilities (both H & T set to 1), then Time 546 Synchronization capability SHALL be chosen over Handshake capability 547 with the HRW (or Preference-based) DF election algorithm. 549 If some PEs in the redundancy group signal Time Synchronization 550 (T=1) but not Handshaking (H=0); whereas, some other PEs in the same 551 redundancy group signal Handshaking (H=1) but not Time 552 Synchronization (T=0), then the PEs that have handshaking ability, 553 SHALL perform HRW with handshaking among themselves and the PEs that 554 Time Synchronization capability SHALL perform HRW (or Preference- 555 based) with time synchronization among themselves. 557 If some PEs in the redundancy group don't signal either Time 558 Synchronization or Handshaking capabilities, then these PEs SHALL 559 perform HRW (or Preference-based) with default timer based mechanism 560 defined in [RFC 7432]. 562 In the illustration below, PE1, PE2 and PE3 send their respective 563 Type 4 routes indicating their DF capabilities at time T1, T2 and T3 564 respectively. Only PE2 and PE3 are Handshake capable, hence only PE2 565 and PE3 partake in DF Handshaking procedure described here at time T4 566 and T5. PE1 on the other hand, runs the DF election timer and takes 567 over the DF role upon timer expiry at time T6. 569 PE1 PE2 PE3.br 570 | | | 571 | | | 572 | Type 4 (0x0 Default Capability) | 573 |----------------->>|------------------------->>| T1 574 | | | 575 | Type 4 (H=1 Handshake Capable) | 576 |<<-----------------|------------------------->>| T2 577 | | | 578 | Type 4 (H=1 Handshake Capable) | 579 |<<-----------------|<<-------------------------| T3 580 | | | 581 | | | 582 | | Type C (DF Request) | 583 | |<<-------------------------| T4 584 | | | 585 | | Type D (DF Response) | 586 | |------------------------->>| T5 587 | PE1 Timer Expiry (DF Takeover) | T6 588 |<<###########################################>>| 589 | Only PE2 and PE3 Handshake Capable | 590 |<<###########################################>>| 591 . . . 593 3.2 DF Election Synchronization Solution 594 If all PE devices attached to a given Ethernet Segment are clock- 595 synchronized with each other, then the above handshaking procedures 596 can be simplified and packet loss can be reduced from BGP-propagation 597 time (between recovered PE and the DF PE) to very small time (e.g., 598 milliseconds or less). 600 The simplified procedure is as follow: 602 First, the DF election procedure, described in RFC7432, is applied as 603 before. 605 All PEs attached to a given Ethernet-Segment are clock-synchronized; 606 using a networking protocol for clock synchronization (e.g. NTP, PTP, 607 etc). 609 Newly inserted device PE or during failure recovery of a PE, that PE 610 communicates the current time to peering partners plus the remaining 611 peering timer time left. This constitute an "endtime" as see from 612 local PE. That "endtime" is called "Service Carving Time" (SCT). 614 A new BGP Extended Community is advertised along with RT-4 to 615 communicate to other partners the Service Carving Time. 617 Upon reception of that new BGP Extended Community, partner PEs know 618 exactly its carving time. The notion of skew is introduced to 619 eliminate any potential duplicate traffic or loops. They add a skew 620 (default = -10ms) to the Service Carving Time to enforce this; 621 basically partner PEs must carve first. 623 To summarize, all peering PEs carve almost simultaneously at the time 624 announced by newly added / recovered PE. The newly added/recovered PE 625 initiates the SCT, carves immediately on peering timer expiry. Other 626 PE receiving RT-4 with a SCT BGP ExtComm, carve shortly before "SCT 627 time". 629 3.2.3 Advantages 631 There are multiples advantages of using the approach. Here is a non- 632 exhaustive list: 634 - A simple uni-directional signaling is all needed 636 - Backwards-compatible: old versions of draft/RFC shall simply 637 discard unrecognized new SCT BGP ExtComm 639 - Multiple DF Election algorithms can be supported: 640 * RFC7432's default ordered list ordinal algorithm (modulo) 641 * HRW in [DF-FRAMEWORK], etc 642 - Independent of BGP transmission delay for RT-4 643 - Solutions is agnostic of the time synchronization mechanisms (e.g. 644 NTP, PTP, ...) 646 3.2.4 Interoperability 648 Per redundancy group, for the DF election procedures to be globally 649 convergent and unanimous, it is necessary that all the participating 650 PEs agree on the DF Election algorithm to be used. It is, however, 651 possible that some PEs continue to use the existing modulus based DF 652 election and do not rely on the new SCT BGP extended community. PEs 653 running an baseline DF election mechanism shall simply discard 654 unrecognized new SCT BGP extended community. 656 A PE can indicate its willingness to support clock-synched carving by 657 signaling the new SCT BGP extended community along with the Ethernet- 658 Segment Route (Type-4). 660 3.2.5 BGP Encoding 662 A new BGP extended community needs to be defined to communicate the 663 Service Carving Expected Timestamp for each Ethernet Segment. 665 A new transitive extended community where the Type field is 0x06, and 666 the Sub-Type is is advertised along with Ethernet 667 Segment route. Timestamp for expected Service carving is encoded as a 668 8-octet value as follows: 670 1 2 3 671 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 672 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 673 | Type=0x06 | Sub-Type(TBD) | Timestamp(upper 16)| 674 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 675 | Timestamp (lower 32) | 676 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 678 This document introduces a new flag called "T" (for Time 679 Synchronization) to the bitmap field of the DF Election Extended 680 Community defined in [DF-FRAMWORK]. 682 1 2 3 683 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 684 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 685 | Type=0x06 | Sub-Type(0x06)| DF Type |P|A|H|T| Bitmap| 686 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 687 | Reserved = 0 | 688 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 690 T: This flag is located in bit position 27 as shown above. When set 691 to 1, it indicates the desire to use Time Synchronization capability 692 with the rest of the PEs in the ES. This capability is used in 693 conjunction with the agreed upon DF Type (DF Election Type). For 694 example if all the PEs in the ES indicated that they have Time 695 Synchronization capability and they want the DF type be of HRW, then 696 HRW algorithm is used in conjunction with this capability. 698 3.2.6 Note on NTP-based synchronization 700 The 64-bit timestamp used by NTP protocol consists of a 32-bit part 701 for seconds and a 32-bit part for fractional second. Giving a time 702 scale that rolls over every 2^32 seconds (136 years) and a 703 theoretical resolution of 2^32 seconds (233 picoseconds). The 704 recommendation is to keep the top 32 bits and carry lower MSB 16 bits 705 of fractional second. 707 3.2.7 An example 709 Let's take figure 1 as an example where initially PE2 had failed and 710 PE1 had taken over. 712 Based on RFC-7432: 714 - Initial state: PE1 is in steady-state, PE2 is recovering 715 - PE2 recovers at (absolute) time t=99 716 - PE2 advertises RT-4 (sent at t=100) to partner PE1. 717 - PE2, it starts its 3sec peering timer as per RFC7432 718 - PE1 carves immediately on RT-4 reception. PE2 carves at time t=103. 720 With following procedure, there is a high chance to generate a 721 traffic black hole or traffic loop. The peering timer value has a 722 direct effect of this behavior. A short peering timer may generate 723 loop whereas a long peering timer provide a prolong blackout. 725 Based on the SCT approach: 727 - Initial state: PE1 is in steady-state, PE2 is recovering 728 - PE2 recovers at (absolute) time t=99 729 - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to 730 partner PE1 731 - PE2 starts its 3sec peering timer as per RFC7432 732 - Both PE1 and PE2 carves at (absolute) time t=103; In fact, PE1 733 should carve slightly before PE2 (skew). 735 Using SCT approach, the effect of the peering timer is gone. Also, 736 the BGP RT-4 transmission delay (from PE2 to PE1) becomes a no-op. 738 4 Acknowledgement Authors would like to acknowledge helpful comments 739 and contributions of Satya Mohanty and Luc Andre Burdet. 741 5 Security Considerations 743 The mechanisms in this document use EVPN control plane as defined in 744 [RFC7432]. Security considerations described in [RFC7432] are equally 745 applicable. This document uses MPLS and IP-based tunnel technologies 746 to support data plane transport. Security considerations described in 747 [R7432] and in [ietf-evpn-overlay] are equally applicable. 749 6 IANA Considerations 751 Allocation of Extended Community Type and Sub-Type for EVPN. 753 7 References 755 7.1 Normative References 757 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 758 Requirement Levels", BCP 14, RFC 2119, March 1997. 760 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", February, 761 2015. 763 [DF-FRAMEWORK] Rabadan, Mohanty et al., "Framework for EVPN 764 Designated Forwarder Election Extensibility", draft-ietf- 765 bess-evpn-df-election-framework-00, work in progress, 766 March 5, 2018. 768 7.2 Informative References 769 Authors' Addresses 771 Ali Sajassi 772 Cisco 773 Email: sajassi@cisco.com 775 Gaurav Badoni 776 Cisco 777 Email: gbadoni@cisco.com 779 Patrice Brissette 780 Cisco 781 Email: pbrisset@cisco.com 783 Dhananjaya Rao 784 Cisco 785 Email: dhrao@cisco.com 787 John Drake 788 Juniper 789 Email: jdrake@juniper.net 791 Jorge Rabadan 792 Juniper 793 Email: jorge.rabadan@nokia.com