idnits 2.17.1 draft-sajassi-bess-evpn-fast-df-recovery-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 4 instances of lines with control characters in the document. ** The abstract seems to contain references ([EVPN-DF-Election], [RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 116 has weird spacing: '...ased on simpl...' -- The document date (March 12, 2017) is 2595 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 7432' is mentioned on line 141, but not defined == Missing Reference: 'EVPN-DF-Election' is mentioned on line 152, but not defined == Missing Reference: 'ESI' is mentioned on line 392, but not defined == Missing Reference: 'DF-REQ' is mentioned on line 383, but not defined == Missing Reference: 'SEQ1' is mentioned on line 392, but not defined == Missing Reference: 'DF-ACK' is mentioned on line 429, but not defined == Missing Reference: 'PE3' is mentioned on line 392, but not defined == Missing Reference: 'ES1' is mentioned on line 429, but not defined == Missing Reference: 'PE2' is mentioned on line 429, but not defined == Missing Reference: 'SEQ' is mentioned on line 429, but not defined == Missing Reference: 'EVPN-DF' is mentioned on line 495, but not defined == Missing Reference: 'R7432' is mentioned on line 675, but not defined Summary: 2 errors (**), 0 flaws (~~), 14 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group Ali Sajassi 3 Internet-Draft Gaurav Badoni 4 Intended Status: Standards Track Dhananjaya Rao 5 Patrice Brissette 6 Cisco 7 John Drake 8 Juniper 10 Expires: September 12, 2017 March 12, 2017 12 Fast Recovery for EVPN DF Election 13 draft-sajassi-bess-evpn-fast-df-recovery-00 15 Abstract 17 Ethernet Virtual Private Network (EVPN) solution [RFC 7432] describes 18 DF election procedures for multi-homing Ethernet Segments. These 19 procedures are enhanced further in [EVPN-DF-Election] by applying 20 Highest Random Weight Algorithm for DF election in order to avoid DF 21 status unnecessarily upon a failure. This draft makes further 22 improvement to DF election procedures in [EVPN-DF-Election] by 23 providing two options for fast DF election upon recovery of the 24 failed link or node associated with the multi-homing Ethernet 25 Segment. This fast DF election is achieved independent of number of 26 EVIs associated with that Ethernet Segment and it is performed via a 27 simple signaling between the recovered PE and each PE in the multi- 28 homing group. 30 Status of this Memo 32 This Internet-Draft is submitted to IETF in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as 38 Internet-Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 The list of current Internet-Drafts can be accessed at 46 http://www.ietf.org/1id-abstracts.html 47 The list of Internet-Draft Shadow Directories can be accessed at 48 http://www.ietf.org/shadow.html 50 Copyright and License Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4 69 2 Challenges with Existing Solution . . . . . . . . . . . . . . . 4 70 3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 71 3.1 DF Election Handshake Solution . . . . . . . . . . . . . . . 6 72 3.1.1 Discovery . . . . . . . . . . . . . . . . . . . . . . . 6 73 3.1.2 DF candidates Determination . . . . . . . . . . . . . . 6 74 3.1.3 DF Election Handshake . . . . . . . . . . . . . . . . . 7 75 3.1.4 Node Insertion . . . . . . . . . . . . . . . . . . . . . 7 76 3.1.5 BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 8 77 3.1.5.1 DF Election Handshake Request Route . . . . . . . . 8 78 3.5.1.2 DF Election Handshake Response Route . . . . . . . . 9 79 3.1.6 DF Handshake Scenarios . . . . . . . . . . . . . . . . . 10 80 3.1.7 Interoperability . . . . . . . . . . . . . . . . . . . 13 81 3.2 DF Election Synchronization Solution . . . . . . . . . . . . 14 82 3.2.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . 15 83 3.2.4 Interoperability . . . . . . . . . . . . . . . . . . . . 15 84 3.2.5 BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 15 85 3.2.6 Note on NTP-based synchronization . . . . . . . . . . . 16 86 3.2.7 An example . . . . . . . . . . . . . . . . . . . . . . . 16 87 4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . 17 88 5 Security Considerations . . . . . . . . . . . . . . . . . . . . 17 89 6 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 17 90 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 91 7.1 Normative References . . . . . . . . . . . . . . . . . . . 17 92 7.2 Informative References . . . . . . . . . . . . . . . . . . 17 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 95 1 Introduction 97 Ethernet Virtual Private Network (EVPN) solution [RFC 7432] is 98 becoming pervasive in data center (DC) applications for Network 99 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 100 in service provider (SP) applications for next generation virtual 101 private LAN services. 103 EVPN solution [RFC 7432] describes DF election procedures for multi- 104 homing Ethernet Segments. These procedures are enhanced further in 105 [EVPN-DF-Election] by applying Highest Random Weight Algorithm for DF 106 election in order to avoid DF status change unnecessarily upon a link 107 or node failure associated with the multi-homing Ethernet Segment. 108 This draft makes further improvement to DF election procedures in 109 [EVPN-DF-Election] by providing two options for a fast DF election 110 upon recovery of the failed link or node associated with the multi- 111 homing Ethernet Segment. This DF election is achieved independent of 112 number of EVIs associated with that Ethernet Segment and it is 113 performed via a simple signaling between the recovered PE and each PE 114 in the multi-homing group. The draft presents two signaling options. 115 The first option is based on a bidirectional handshake procedure 116 whereas the second option is based on simple one-way signaling 117 mechanism. 119 1.1 Terminology 121 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 123 document are to be interpreted as described in RFC 2119 [KEYWORDS]. 125 Provider Edge (PE) : A device that sits in the boundary of Provider 126 and Customer networks and performs encap/decap of data from L2 to L3 127 and vice-versa. 129 Designated Forwarder (DF): An PE that is currently forwarding 130 (encapsulating/decapsulating) traffic for a given VLAN in and out of 131 a site. 133 2 Challenges with Existing Solution 135 In EVPN technology, multiple PE devices have the ability to encap and 136 decap data belonging to the same VLAN. In certain situations, this 137 may cause L2 duplicates and even loops if there is a momentary 138 overlap of forwarding roles between two or more PE devices, leading 139 to broadcast storms. 141 EVPN [RFC 7432] currently uses timer based synchronization among PE 142 devices in redundancy group that can result in duplications (and even 143 loops) because of multiple DFs if the timer is too short or 144 blackholing if the timer is too long. 146 Using site-of-origin Split Horizon filtering can prevent loops (but 147 not duplicates), however if there are overlapping DFs in two 148 different sites at the same time for the same VLAN, the site 149 identifier will be different upon re-entry of the packet and hence 150 the split horizon check will fail, leading to L2 loops. 152 The current state of art [EVPN-DF-Election] uses the well known HRW 153 (Highest Random Weight) algorithm to avoid reshuffling of VLANs among 154 PE devices in the redundancy group upon failure/recovery and thus 155 reducing the impact of failure/recovery to VLANs not on the 156 failed/recovered ports. This eliminates loops/duplicates in failure 157 scenarios. 159 However, upon PE insertion or port bring-up, HRW cannot help as a 160 transfer of DF role need to happen to the newly inserted device/port 161 while the old DF is still active. 163 +---------+ 164 +-------------+ | | 165 | | | | 166 / | PE1 |----| | +-------------+ 167 / | | | MPLS/ | | |---H3 168 / +-------------+ | VxLAN/ | | PE10 | 169 CE1 - | Cloud | | | 170 \ +-------------+ | |---| | 171 \ | | | | +-------------+ 172 \ | PE2 |----| | 173 | | | | 174 +-------------+ | | 175 +---------+ 177 Figure 1: CE1 multi-homed to PE1 and PE2. Potential for duplicate DF. 179 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 180 DF role of some VLANs to PE2 to achieve load balancing. However, 181 because there is no handshake mechanism between PE1 and PE2, 182 duplication of DF roles for a give VLAN is possible. Duplication of 183 DF roles may eventually lead to L2 loops as well as duplication of 184 traffic. 186 Current state of EVPN art relies on a blackholing timer for 187 transferring the DF role to the newly inserted device. This can cause 188 the following issues: 190 * Loops/Duplicates if the timer value is too short 192 * Prolonged Traffic Blackholing if the timer value is too long 194 This draft is proposing solutions that deterministically eliminates 195 loops/duplicates and at the same time provides fast convergence upon 196 PE/port insertion. 198 3 Operation 200 Here we describe two signaling mechanisms between the newly inserted 201 PE and remaining PEs. The signaling is only possible once the newly 202 inserted PE has reliably discovered the other PEs and vice versa. The 203 first option is referred to as DF Election Handshake solution and is 204 described in section 3.1. The second option is referred to as DF 205 Election Synchronization Solution and is described in section 3.2. 207 3.1 DF Election Handshake Solution 209 Due to HRW, the handshake will only be one per PE device and 210 independent of EVI/VNI scale. Therefore, this solution is divided 211 into three steps: 213 Phase 1: Discovery 215 Phase 2: DF Candidate Determination; HRW 217 Phase 3: Handshake 219 Following is the description each step in detail. 221 3.1.1 Discovery 223 Each PE needs to have a consistent view of the network including the 224 newly inserted PE. 226 Newly inserted device PE will advertise it's Ethernet Segment route 227 and start a flood/wait timer. This timer should be large enough to 228 guarantee the dissemination and receipt of this advertisement by 229 previously inserted PEs. 231 As the old DF is continuously forwarding traffic while the new PE is 232 running this timer, this timer can be made as long as required 233 without impacting traffic convergence. The timer value can be the BGP 234 session hold time in the worst case to ensure proper discovery. 236 3.1.2 DF candidates Determination 237 After the discovery timer has elapsed, each PE would have an imported 238 list of the Ethernet Segment Routes from other PEs. The resultant 239 database will comprise of all the DF candidates on a per ES basis and 240 will be used for DF election. Each PE will independently run the HRW 241 algorithm for all VLANs in a given Ethernet Segment. Since the 242 discovery phase guarantees uniform network view between the 243 participating devices, the HRW VLAN distribution results will be 244 consistent. 246 3.1.3 DF Election Handshake 248 The DF Election handshake will be accomplished in the following 249 steps: 251 - The newly inserted PE will send the DF Request to previously 252 inserted PEs with a new sequence number. 254 - The previously inserted PE(s) will receive the DF Request, will 255 validate this request as per own discovery state and HRW results. 257 - The previously inserted PE(s) will program hardware to block the 258 VLANs that must be transferred to the newly inserted PE. 260 - The previously inserted PE(s) will send DF Response (W/ ACK OR 261 NACK) to the newly inserted PE with the same sequence number that was 262 contained in the DF Request. 264 - Newly inserted PE will receive DF Response and validate it using 265 the sequence number. It will take action per received DF Response 266 message and will not wait for all previously inserted devices for 267 faster convergence. 269 - In case of a DF Response ACK, newly inserted PE will program its 270 hardware to assume the DF responsibility. 272 We don't need to have a handshake on a per VLAN/EVI basis but rather 273 per pair of PEs in the redundancy group - i.e., if a new PE is added 274 to an existing redundancy group of 3 PE devices, then we need only to 275 have 3 handshakes. This is because the devices already are in sync 276 about which VLANs to give-up/takeover (HRW). 278 At the end of these three phases, the VLAN DF role transfer would 279 have happened in a deterministic way while ensuring minimum traffic 280 loss. Device recovery and device insertion scenarios are identical in 281 terms of the handshaking procedure. In next section, we describe the 282 procedure details for device insertion. 284 3.1.4 Node Insertion 285 Consider the scenario where PE3 is inserted in the network, while PE1 286 and PE2 are already in stable state. PE3 will send/receive the 287 following flags in the route Type 4: 289 - DF Request: Upon completing the DF Election, PE3 will send DF 290 Request with a new sequence number. PE1 and PE2 will receive this 291 message and respond with DF Response ACK or NACK with the same 292 sequence number that was generated by PE3. 294 - DF Response ACK: When PE3 receives DF Response ACK from PE1 with 295 the same sequence number as DF Request, it will take over the DF role 296 for the appropriate VLANS that are being transferred from PE1. When 297 DF Response ACK from PE2 arrives, the rest of the VLANS to be 298 transferred from PE2 to PE3 are then taken over by PE3. 300 - DF Response NACK: If PE3 receives DF Response NACK from at least 301 one of PE1 or PE2, it will not take over DF role and will start 302 over. 304 Consider the scenario where two nodes PE3 and PE4 are being inserted 305 at the same time. Both of them will send a DF Request to PE1 and PE2 306 at around the same time with possibly the same sequence number. When 307 PE1 and PE2 respond with DF Response ACK, it is important to signify 308 exactly whom the response is meant for as it could be for either 309 requester (PE3 or PE4). To remove any ambiguity and false positives, 310 the IP address of the requester MUST be included in the response 311 message to specify who the response is meant for. 313 3.1.5 BGP Encoding 315 The EVPN NLRI comprises of Route Type (1B), Length (1B) and Route 316 Type specific variable encoding. Here we propose the creation of two 317 new EVPN route types: 319 + 0x0C - DF Election Handshake Request Route 320 + 0x0D - DF Election Handshake Response Route 322 3.1.5.1 DF Election Handshake Request Route 324 A DF Election Handshake Request Type NLRI consists of the following: 326 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 327 | RD (8 octets) | 328 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 329 | Ethernet Segment Identifier (10 octets) | 330 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 331 | DF-Flags (1 octet) | 332 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 333 | Sequence Number (1 octet) | 334 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 336 The DF-Flags can have the following values: 338 DF-INIT : Sent initially upon boot-up; bootstraps the network 339 DF-REQUEST : Sent to request DF takeover 341 For the purpose of BGP route key processing, only the Ethernet 342 Segment Identifier is considered to be part of the prefix in the 343 NLRI. The DF-Flag and Sequence number is to be treated as a route 344 attribute as opposed to being part of the route. 346 3.5.1.2 DF Election Handshake Response Route 348 A DF Election Handshake Response Type NLRI consists of the following: 350 +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-++ 351 | RD (8 octets) | 352 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 353 | Ethernet Segment Identifier (10 octets) | 354 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 355 | IP-Address Length (1 octet) | 356 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 357 | Destination Router's IP Address | 358 | (4 or 16 octets) | 359 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 360 | DF-Flags (1 octet) | 361 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 362 | Sequence Number (1 octet) | 363 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 365 The DF-Flags can have the following values: 367 DF-ACK : Sent to Acknowledge DF-REQUEST 368 DF-NACK : Sent to Reject DF-Request 369 For the purpose of BGP route key processing, only the Ethernet 370 Segment Identifier, IP Address Length and Destination Router's IP 371 Address fields are considered to be part of the prefix in the NLRI. 372 The DF-Flag and Sequence number is to be treated as a route attribute 373 as opposed to being part of the route. 375 3.1.6 DF Handshake Scenarios 377 Consider the scenario where PE3 is freshly inserted into the network 378 with PE1 and PE2 in steady state (as shown below). As shown in the 379 sequence diagram below, at time = T0, PE3 will send Type 4 ES route 380 and that will cause PE1 and PE2 to discover PE3. 382 Post the discovery timer, at time = T1, PE3 will send DF Request 383 containing [ESI, DF-REQ, SEQ1]. 385 PE2 responds via DF Response ACK at time = T2, with the same sequence 386 number SEQ1. [ESI, DF-ACK, PE3, SEQ1]. Note that the sequence number 387 is the same as is contained in the DF Request from PE3. PE3 will 388 receive the DF Response ACK and take over the appropriate VLANs based 389 on HRW only if the sequence number matches. 391 PE1 responds via DF Response ACK at time = T3, with the same sequence 392 number SEQ1; [ESI, DF-ACK, PE3, SEQ1]. PE3 will receive the DF 393 Response ACK and take over the appropriate VLANs based on HRW only if 394 the sequence number matches. 396 By the end of the handshake, all appropriate VLANs for the ES are 397 transferred from PE1 and PE2 to PE3 with a single per-ES handshake. 399 PE1 PE2 PE3 400 | | | 401 | | Type 4 (Discovery) | 402 | |<<-------------------------| T0 403 |<<---------------------------------------------| 404 | | | 405 | | | 406 | | Type C (DF Request) | .br 407 |<<-----------------|<<-------------------------| T1 408 | | | 409 | | Type D (DF Response) | .br 410 | |------------------------->>| T2 411 | Type D(DF Resp) | | 412 |--------------------------------------------->>| T3 413 | | | 414 |<<###########################################>>| 415 | PE3 freshly inserted | 416 |<<###########################################>>| 417 . . . 419 Consider the scenario where PE2 and PE3 are inserted simultaneously 420 in the network where PE1 is in steady state (as shown below). PE2 and 421 PE3 will send the Type 4 ES routes and start the discovery timer. 422 This will cause PE1, PE2 and PE3 to discover each other. 424 PE2 and PE3 will then simultaneously and separately send DF Request. 425 PE1 will receive these requests and respond to them. 427 To avoid any ambiguity, PE1 will explicitly specify in the DF Request 428 route the destination for which the DF-ACK is meant for. That is why 429 the responses from PE1 will contain [ES1, DF-ACK, PE2, SEQ] and [ESI, 430 DF-ACK, PE3, SEQ] to specify that the response is meant for PE2 and 431 PE3 respectively. 433 Upon receiving the Type-D response message, PE2 and PE3 will take 434 over the respective VLANs. 436 PE1 PE2 PE3 437 | | | 438 | | Type 4 (Discovery) | 439 |<<-----------------| | T0 440 |<<-----------------|<<-------------------------| 441 | | | 442 | | | 443 | | Type C (DF Request) | 444 | |<<-------------------------| T1 445 | | | 446 | Type C(DF Request)| | 447 |<<-----------------| | T2 448 | | | 449 | | Type D (DF Response) | 450 |--------------------------------------------->>| T3 451 | | | 452 |Type D(DF Response)| | 453 |----------------->>| | T4 454 | | | 455 |<<###########################################>>| 456 | PE2 and PE3 inserted simultaneously | 457 |<<###########################################>>| 458 . . . 460 When PE3 is booted down or removed from the network, the routes 461 formerly advertised by PE3 will be withdrawn, including the Type 4 462 route (as shown below). When PE1 and PE2 process the deletion of 463 PE3's Type 4 route, they will clean up any DF handshake state 464 pertaining to PE3. This means that PE1 and PE2 will withdraw the DF 465 Response routes that they had earlier sent with PE3 as the 466 destination. 468 PE1 PE2 PE3 469 | | | 470 | | Type 4 Route Withdrawal | 471 | |<<-------------------------| T0 472 |<<---------------------------------------------| 473 | | | 474 | PE2 purges Type D(DF Resp) sent to PE3 | T2 475 | | | 476 | PE1 purges Type D(DF Resp) sent to PE3 | T3 477 | | | 478 |<<###########################################>>| 479 | PE3 booted down/removed from the network | 480 |<<###########################################>>| 481 . . . 483 3.1.7 Interoperability 485 Per redundancy group (per ES), for the DF election procedures to be 486 globally convergent and unanimous, it is necessary that all the 487 participating PEs agree on the DF Election algorithm to be used. It 488 is, however, possible that some PEs continue to use the existing 489 modulus based DF election and do not rely on the new handshake/sync 490 procedures. PEs running an old versions of draft/RFC shall simply 491 discard unrecognized new BGP extended communities. 493 A PE can indicate its willingness to support new DF handshake 494 procedures by signaling DF Election type in the DF Election Extended 495 Community defined in [EVPN-DF] sent along with the Ethernet-Segment 496 Route (Type-4). 498 Following additional types will be used to indicate the capability: 500 0x3 : Handshake Based Mechanism 501 0x4 : Time Sync Based Mechanism 503 Given that all the PE devices run the same HRW election algorithm, 504 only a subset of them may have the capability of performing the 505 handshake or synchronization mechanism. In such a situation, only the 506 devices that are capable of handshake will partake in handshake and 507 only the devices that are capable of synchronization will partake in 508 sync, rest will default to the timer based mechanism defined in the 509 base RFC. 511 In the illustration below, PE1, PE2 and PE3 send their respective 512 Type 4 routes indicating their DF capabilities at time T1, T2 and T3 513 respectively. Only PE2 and PE3 are Handshake capable, hence only PE2 514 and PE3 partake in DF Handshaking procedure described here at time T4 515 and T5. PE1 on the other hand, runs the DF election timer and takes 516 over the DF role upon timer expiry at time T6. 518 PE1 PE2 PE3.br 519 | | | 520 | | | 521 | Type 4 (0x0 Default Capability) | 522 |----------------->>|------------------------->>| T1 523 | | | 524 | Type 4 (0x1 Handshake Capable) | 525 |<<-----------------|------------------------->>| T2 526 | | | 527 | Type 4 (0x1 Handshake Capable) | 528 |<<-----------------|<<-------------------------| T3 529 | | | 530 | | | 531 | | Type C (DF Request) | 532 | |<<-------------------------| T4 533 | | | 534 | | Type D (DF Response) | 535 | |------------------------->>| T5 536 | PE1 Timer Expiry (DF Takeover) | T6 537 |<<###########################################>>| 538 | Only PE2 and PE3 Handshake Capable | 539 |<<###########################################>>| 540 . . . 542 3.2 DF Election Synchronization Solution 544 If all PE devices attached to a given Ethernet Segment are clock- 545 synchronized with each other, then the above handshaking procedures 546 can be simplified and packet loss can be reduced from BGP-propagation 547 time (between recovered PE and the DF PE) to very small time (e.g., 548 milliseconds or less). 550 The simplified procedure is as follow: 552 First, the DF election procedure, described in RFC7432, is applied as 553 before. 555 All PEs attached to a given Ethernet-Segment are clock-synchronized; 556 using a networking protocol for clock synchronization (e.g. NTP, PTP, 557 etc). 559 Newly inserted device PE or during failure recovery of a PE, that PE 560 communicates the current time to peering partners plus the remaining 561 peering timer time left. This constitute an "endtime" as see from 562 local PE. That "endtime" is called "Service Carving Time" (SCT). 564 A new BGP Extended Community is advertised along with RT-4 to 565 communicate to other partners the Service Carving Time. 567 Upon reception of that new BGP Extended Community, partner PEs know 568 exactly its carving time. The notion of skew is introduced to 569 eliminate any potential duplicate traffic or loops. They add a skew 570 (default = -10ms) to the Service Carving Time to enforce this; 571 basically partner PEs must carve first. 573 To summarize, all peering PEs carve almost simultaneously at the time 574 announced by newly added / recovered PE. The newly added/recovered PE 575 initiates the SCT, carves immediately on peering timer expiry. Other 576 PE receiving RT-4 with a SCT BGP ExtComm, carve shortly before "SCT 577 time". 579 3.2.3 Advantages 581 There are multiples advantages of using the approach. Here is a non- 582 exhaustive list: 584 - A simple uni-directional signaling is all needed 586 - Backwards-compatible: old versions of draft/RFC shall simply 587 discard unrecognized new SCT BGP ExtComm 589 - Multiple DF Election algorithms can be supported: 590 * RFC7432's default ordered list ordinal algorithm (modulo) 591 * draft-mohanty-bess-evpn-df-election (HRW), etc 592 - Independent of BGP transmission delay for RT-4 593 - Solutions is agnostic of the time synchronization mechanisms (e.g. 594 NTP, PTP, ...) 596 3.2.4 Interoperability 598 Per redundancy group, for the DF election procedures to be globally 599 convergent and unanimous, it is necessary that all the participating 600 PEs agree on the DF Election algorithm to be used. It is, however, 601 possible that some PEs continue to use the existing modulus based DF 602 election and do not rely on the new SCT BGP extended community. PEs 603 running an baseline DF election mechanism shall simply discard 604 unrecognized new SCT BGP extended community. 606 A PE can indicate its willingness to support clock-synched carving by 607 signaling the new SCT BGP extended community along with the Ethernet- 608 Segment Route (Type-4). 610 3.2.5 BGP Encoding 611 A new BGP extended community needs to be defined to communicate the 612 Service Carving Expected Timestamp for each Ethernet Segment. 614 A new transitive extended community where the Type field is 0x06, and 615 the Sub-Type is is advertised along with Ethernet 616 Segment route. Timestamp for expected Service carving is encoded as a 617 8-octet value as follows: 619 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 620 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 621 | Type=0x06 | Sub-Type(TBD) | Timestamp(upper 16)| 622 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 623 | Timestamp (lower 32) | 624 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 626 3.2.6 Note on NTP-based synchronization 628 The 64-bit timestamp used by NTP protocol consists of a 32-bit part 629 for seconds and a 32-bit part for fractional second. Giving a time 630 scale that rolls over every 2^32 seconds (136 years) and a 631 theoretical resolution of 2^32 seconds (233 picoseconds). The 632 recommendation is to keep the top 32 bits and carry lower MSB 16 bits 633 of fractional second. 635 3.2.7 An example 637 Let's take figure 1 as an example where initially PE2 had failed and 638 PE1 had taken over. 640 Based on RFC-7432: 642 - Initial state: PE1 is in steady-state, PE2 is recovering 643 - PE2 recovers at (absolute) time t=99 644 - PE2 advertises RT-4 (sent at t=100) to partner PE1. 645 - PE2, it starts its 3sec peering timer as per RFC7432 646 - PE1 carves immediately on RT-4 reception. PE2 carves at time t=103. 648 With following procedure, there is a high chance to generate a 649 traffic black hole or traffic loop. The peering timer value has a 650 direct effect of this behavior. A short peering timer may generate 651 loop whereas a long peering timer provide a prolong blackout. 653 Based on the SCT approach: 655 - Initial state: PE1 is in steady-state, PE2 is recovering 656 - PE2 recovers at (absolute) time t=99 657 - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to 658 partner PE1 659 - PE2 starts its 3sec peering timer as per RFC7432 660 - Both PE1 and PE2 carves at (absolute) time t=103; In fact, PE1 661 should carve slightly before PE2 (skew). 663 Using SCT approach, the effect of the peering timer is gone. Also, 664 the BGP RT-4 transmission delay (from PE2 to PE1) becomes a no-op. 666 4 Acknowledgement Authors would like to acknowledge helpful comments 667 and contributions of Satya Mohanty and Luc Andre Burdet. 669 5 Security Considerations 671 The mechanisms in this document use EVPN control plane as defined in 672 [RFC7432]. Security considerations described in [RFC7432] are equally 673 applicable. This document uses MPLS and IP-based tunnel technologies 674 to support data plane transport. Security considerations described in 675 [R7432] and in [ietf-evpn-overlay] are equally applicable. 677 6 IANA Considerations 679 Allocation of Extended Community Type and Sub-Type for EVPN. 681 7 References 683 7.1 Normative References 685 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 686 Requirement Levels", BCP 14, RFC 2119, March 1997. 688 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", February, 689 2015. 691 7.2 Informative References 693 [EVPN-DF]Key et al., "A new Designated Forwarder Election for the 694 EVPN", draft-ietf-bess-evpn-df-election-01, work in progress, April 695 2017. 697 Authors' Addresses 698 Ali Sajassi 699 Cisco 700 Email: sajassi@cisco.com 702 Gaurav Badoni 703 Cisco 704 Email: gbadoni@cisco.com 706 Patrice Brissette 707 Cisco 708 Email: pbrisset@cisco.com 710 Dhananjaya Rao 711 Cisco 712 Email: dhrao@cisco.com 714 John Drake 715 Juniper 716 Email: jdrake@juniper.net