idnits 2.17.1 draft-ietf-bess-evpn-fast-df-recovery-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (7 March 2022) is 780 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group P. Brissette, Ed. 3 Internet-Draft A. Sajassi 4 Intended status: Standards Track LA. Burdet 5 Expires: 8 September 2022 Cisco 6 J. Drake 7 Juniper 8 J. Rabadan 9 Nokia 10 7 March 2022 12 Fast Recovery for EVPN Designated Forwarder Election 13 draft-ietf-bess-evpn-fast-df-recovery-05 15 Abstract 17 Ethernet Virtual Private Network (EVPN) solution provides Designated 18 Forwarder election procedures for multihomed Ethernet Segments. 19 These procedures have been enhanced further by applying Highest 20 Random Weight (HRW) Algorithm for Designated Forwarded election in 21 order to avoid unnecessary DF status changes upon a failure. This 22 draft improves these procedures by providing a fast Designated 23 Forwarder (DF) election upon recovery of the failed link or node 24 associated with the multihomed Ethernet Segment. The solution is 25 independent of number of EVIs associated with that Ethernet Segment 26 and it is performed via a simple signaling between the recovered PE 27 and each of the other PEs in the multihoming group. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in RFC 2119 [RFC2119] and 34 RFC 8174 [RFC8174]. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on 8 September 2022. 53 Copyright Notice 55 Copyright (c) 2022 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 60 license-info) in effect on the date of publication of this document. 61 Please review these documents carefully, as they describe your rights 62 and restrictions with respect to this document. Code Components 63 extracted from this document must include Revised BSD License text as 64 described in Section 4.e of the Trust Legal Provisions and are 65 provided without warranty as described in the Revised BSD License. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 70 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Challenges with Existing Solution . . . . . . . . . . . . . . 3 72 3. DF Election Synchronization Solution . . . . . . . . . . . . 4 73 3.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . 5 74 3.2. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 6 75 3.3. Synchronization Scenarios . . . . . . . . . . . . . . . . 7 76 3.4. Backwards Compatibility . . . . . . . . . . . . . . . . . 8 77 4. Security Considerations . . . . . . . . . . . . . . . . . . . 8 78 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 79 6. Normative References . . . . . . . . . . . . . . . . . . . . 9 80 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 10 81 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 10 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 84 1. Introduction 86 Ethernet Virtual Private Network (EVPN) solution [RFC7432] is 87 becoming pervasive in data center (DC) applications for Network 88 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 89 in service provider (SP) applications for next generation virtual 90 private LAN services. 92 The EVPN specification [RFC7432] describes DF election procedures for 93 multihomed Ethernet Segments. These procedures are enhanced further 94 in [RFC8584] by applying Highest Random Weight Algorithm for DF 95 election in order to avoid DF status change unnecessarily upon a link 96 or node failure associated with the multihomed Ethernet Segment. 97 This draft makes further improvement to DF election procedures in 98 [RFC8584] by providing an option for a fast DF election upon recovery 99 of the failed link or node associated with the multihomed Ethernet 100 Segment. This DF election is achieved independent of number of EVIs 101 associated with that Ethernet Segment and it is performed via a 102 simple signaling between the recovered PE and each of the other PEs 103 in the multihomed group. The solution is based on simple one-way 104 signaling mechanism. 106 1.1. Terminology 108 Provider Edge (PE): A device that sits in the boundary of Provider 109 and Customer networks and performs encap/decap of data from L2 to 110 L3 and vice-versa. 112 Designated Forwarder (DF): A PE that is currently forwarding 113 (encapsulating/decapsulating) traffic for a given VLAN in and out 114 of a site. 116 2. Challenges with Existing Solution 118 In EVPN technology, multiple PE devices have the ability to encap and 119 decap data belonging to the same VLAN. In certain situations, this 120 may cause L2 duplicates and even loops if there is a momentary 121 overlap of forwarding roles between two or more PE devices, leading 122 to broadcast storms. 124 EVPN [RFC7432] currently uses timer based synchronization among PE 125 devices in redundancy group that can result in duplications (and even 126 loops) because of multiple DFs if the timer is too short or 127 blackholing if the timer is too long. 129 Using split-horizon filtering (Section 8.3 of [RFC7432]) can prevent 130 loops (but not duplicates), however if there are overlapping DFs in 131 two different sites at the same time for the same VLAN, the site 132 identifier will be different upon re-entry of the packet and hence 133 the split-horizon check will fail, leading to L2 loops. 135 The updated DF procedures in [RFC8584] use the well known 136 HRW (Highest Random Weight) algorithm to avoid reshuffling of VLANs 137 among PE devices in the redundancy group upon failure/recovery. This 138 reduces the impact to VLANs not assigned to the failed/recovered 139 ports and eliminates loops or duplicates at failure/recovery events. 141 However, upon PE insertion or port bring-up (recovery event), HRW 142 also cannot help as a transfer of DF role to the newly inserted 143 device/port must occur while the old DF is still active. 145 +---------+ 146 +-------------+ | | 147 | | | | 148 / | PE1 |----| | +-------------+ 149 / | | | MPLS/ | | |---CE3 150 / +-------------+ | VxLAN/ | | PE3 | 151 CE1 - | Cloud | | | 152 \ +-------------+ | |---| | 153 \ | | | | +-------------+ 154 \ | PE2 |----| | 155 | | | | 156 +-------------+ | | 157 +---------+ 159 Figure 1: CE1 multihomed to PE1 and PE2. 161 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 162 DF role of some VLANs to PE2 to achieve load balancing. However, 163 because there is no handshake mechanism between PE1 and PE2, 164 duplication of DF roles for a given VLAN is possible. Duplication of 165 DF roles may eventually lead to duplication of traffic as well as L2 166 loops. 168 Current EVPN specification [RFC7432] and [RFC8584] relies on a timer- 169 based approach for transferring the DF role to the newly inserted 170 device. This can cause the following issues: 172 * Loops/Duplicates if the timer value is too short 174 * Prolonged Traffic Blackholing if the timer value is too long 176 3. DF Election Synchronization Solution 178 The solution relies on the concept of common clock alignment between 179 partner PEs participating to a common Ethernet Segment. The main 180 idea is to have all peering PEs of that Ethernet Segment perform DF 181 election, and apply their resulting carving state, at a same well- 182 known time. 184 The DF Election procedure, as described in [RFC7432] and as 185 optionally signalled in [RFC8584], is applied. All PEs attached to a 186 given Ethernet Segment are clock-synchronized; using a networking 187 protocol for clock synchronization (e.g. NTP, PTP, etc.). Newly 188 inserted device PE or during failure recovery of a PE, that PE 189 communicates the current time to peering partners plus the remaining 190 peering timer time left. This constitutes an "end time" or "absolute 191 time" as seen from local PE. That absolute time is called "Service 192 Carving Time" (SCT). 194 A new BGP Extended Community is advertised along with Ethernet 195 Segment route (RT-4) to communicate to other partners the Service 196 Carving Time. 198 Upon reception of that new BGP Extended Community, partner PEs know 199 exactly its carving time. The notion of skew is introduced to 200 eliminate any potential duplicate traffic or loops. They add a skew 201 (default = -10ms) to the Service Carving Time to enforce this. The 202 previously inserted PE(s) must carve first, followed shortly(skew) by 203 the newly insterted PE. 205 To summarize, all peering PEs carve almost simultaneously at the time 206 announced by newly added/recovered PE. The newly inserted PE 207 initiates the SCT, and carves immediately on peering timer expiry. 208 The previously inserted PE(s) receiving Ethernet Segment route (RT-4) 209 with a SCT BGP extended community, carve shortly before Service 210 Carving Time. 212 3.1. Advantages 214 There are multiples advantages of using the approach. Here is a non- 215 exhaustive list: 217 * A simple uni-directional signaling is all that is needed 219 * Backwards-compatible: PEs supporting only older [RFC7432] shall 220 simply discard unrecognized new "Service Carving Timestamp" BGP 221 Extended Community 223 * Multiple DF Election algorithms can be supported: 225 - [RFC7432] default ordered list ordinal algorithm (Modulo), 227 - [RFC8584] highest-random weight, etc. 229 * Independent of BGP transmission delay regarding Ethernet Segment 230 route (RT-4) 232 * Agnostic of the time synchronization mechanism used (e.g. NTP, 233 PTP, etc.) 235 3.2. BGP Encoding 237 A new BGP extended community needs to be defined to communicate the 238 Service Carving Timestamp for each Ethernet Segment. 240 A new transitive extended community where the Type field is 0x06, and 241 the Sub-Type is 0x0F is advertised along with Ethernet Segment route. 242 The expected Service Carving Time is encoded as a 8-octet value as 243 follows: 245 1 2 3 246 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 247 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 248 | Type = 0x06 | Sub-Type(0x0F)| Timestamp Seconds ~ 249 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 250 ~ Timestamp Seconds | Timestamp Fractional Seconds | 251 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 253 The timestamp exchanged uses the NTP epoch of January 1, 1900 254 [RFC5905]. The 64-bit timestamp of the NTP protocol consists of a 255 32-bit part for seconds and a 32-bit part for fractional second: 257 * Timestamp Seconds: 32-bit NTP seconds are encoded in this field. 259 * Timestamp Fractional Seconds: 16 bits of the NTP fractional 260 seconds are encoded in this field. The use of a 16-bit fractional 261 seconds yields adequate precision of 15 microseconds (2^-16 s). 263 This document introduces a new flag called "T" (for Time 264 Synchronization) to the bitmap field of the DF Election Extended 265 Community defined in [RFC8584]. 267 1 2 3 268 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 269 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 270 | Type = 0x06 | Sub-Type(0x06)| RSV | DF Alg | |A| |T| ~ 271 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 272 ~ Bitmap | Reserved = 0 | 273 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 275 * Bit 3: Time Synchronization (corresponds to Bit 27 of the DF 276 Election Extended Community). When set to 1, it indicates the 277 desire to use Time Synchronization capability with the rest of the 278 PEs in the Ethernet Segment. 280 This capability is used in conjunction with the agreed upon DF Type 281 (DF Election Type). For example if all the PEs in the Ethernet 282 Segment indicated that they have Time Synchronization capability and 283 they want the DF type to be HRW, then HRW algorithm is used in 284 conjunction with this capability. 286 3.3. Synchronization Scenarios 288 Let's take Figure 1 as an example where initially PE2 had failed and 289 PE1 had taken over. This example shows the problem with the DF- 290 Election mechanism in [RFC7432]. 292 Based on Section 8.5 of [RFC7432], using the default 3 second peering 293 timer: 295 1. Initial state: PE1 is in steady-state, PE2 is recovering 297 2. PE2 recovers at (absolute) time t=99 299 3. PE2 advertises RT-4 (sent at t=100) to partner PE1 301 4. PE2 starts a 3 second peering timer 303 5. PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal 304 BGP propagation delay 306 6. PE2 carves at time t=103 308 [RFC7432] aims of favouring traffic black hole over duplicate 309 traffic. With above procedure, traffic black holing will occur as 310 part of each PE recovery sequence since PE1 has transitioned some 311 VLANs to Non-Designated-Forwarder (NDF) immediately upon reception. 312 The peering timer value (default = 3 seconds) has a direct effect on 313 the duration of the blackholing. A shorter (esp. zero) peering timer 314 may, however, result in duplicate traffic or traffic loops. 316 Based on the Service Carving Time (SCT) approach: 318 1. Initial state: PE1 is in steady-state, PE2 is recovering 320 2. PE2 recovers at (absolute) time t=99 322 3. PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 323 to partner PE1 325 4. PE2 starts 3 second peering timer 327 5. Both PE1 and PE2 carve at (absolute) time t=103 328 In fact, PE1 should carve slightly before PE2 (skew). The previously 329 inserted PE2 that is recovering performs both transitions DF to NDF 330 and NDF to DF per VLANs at the peering timer expiry. Since the goal 331 is to prevent duplicates, the original PE1, which received the SCT 332 will apply: 334 * DF to NDF transition at t=SCT minus skew where both PEs are NDF 335 for 'skew' amount of time 337 * NDF to DF transition at t=SCT 339 It is this split-behaviour which ensures good transition of DF role 340 with contained amount of loss. 342 Using SCT approach, the negative effect of the peering timer is 343 mitigated. Furthermore, the BGP Ethernet Segment route (RT-4) 344 transmission delay (from PE2 to PE1) becomes a non-issue. The use of 345 SCT approach remedies the problem associated with the peering timer: 346 the 3 second timer window is shortened to the order of milliseconds. 348 3.4. Backwards Compatibility 350 Per redundancy group, for the DF election procedures to be globally 351 convergent and unanimous, it is necessary that all the participating 352 PEs agree on the DF Election algorithm to be used. It is, however, 353 possible that some PEs continue to use the existing modulo-based DF 354 election and do not rely on the new SCT BGP extended community. PEs 355 running a baseline DF election mechanism will simply discard the new 356 SCT BGP extended community as unrecognized. 358 A PE can indicate its willingness to support clock-synched carving by 359 signaling the new 'T' DF Election Capability as well as including the 360 new Service Carving Time BGP extended community along with the 361 Ethernet Segment Route (Type-4). In the case where one or more PEs 362 attached to the Ethernet Segment do not signal T=1, all PEs in the 363 Ethernet Segment SHALL revert back to the [RFC7432] timer approach. 364 This is especially important in the context of the VLAN shuffling 365 with more than 2 PEs. 367 4. Security Considerations 369 The mechanisms in this document use EVPN control plane as defined in 370 [RFC7432]. Security considerations described in [RFC7432] are 371 equally applicable. This document uses MPLS and IP-based tunnel 372 technologies to support data plane transport. Security 373 considerations described in [RFC7432] and in [RFC8365] are equally 374 applicable. 376 5. IANA Considerations 378 This document solicits the allocation of the following sub-type in 379 the "EVPN Extended Community Sub-Types" registry setup by [RFC7153]: 381 0x0F Service Carving Timestamp This document 383 This document solicits the allocation of the following values in the 384 "DF Election Capabilities" registry setup by [RFC8584]: 386 Bit Name Reference 387 ---- ---------------- ------------- 388 3 Time Synchronization This document 390 6. Normative References 392 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 393 Requirement Levels", BCP 14, RFC 2119, 394 DOI 10.17487/RFC2119, March 1997, 395 . 397 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 398 "Network Time Protocol Version 4: Protocol and Algorithms 399 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 400 . 402 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 403 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, 404 March 2014, . 406 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 407 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 408 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 409 2015, . 411 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 412 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 413 May 2017, . 415 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 416 Uttaro, J., and W. Henderickx, "A Network Virtualization 417 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 418 DOI 10.17487/RFC8365, March 2018, 419 . 421 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, 422 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet 423 VPN Designated Forwarder Election Extensibility", 424 RFC 8584, DOI 10.17487/RFC8584, April 2019, 425 . 427 Appendix A. Contributors 429 In addition to the authors listed on the front page, the following 430 co-authors have also contributed substantially to this document: 432 Gaurav Badoni 433 Cisco 435 Email: gbadoni@cisco.com 437 Dhananjaya Rao 438 Cisco 440 Email: dhrao@cisco.com 442 Appendix B. Acknowledgements 444 Authors would like to acknowledge helpful comments and contributions 445 of Satya Mohanty and Bharath Vasudevan. Also thank you to Anoop 446 Ghanwani for his thorough review with valuable comments and 447 corrections. 449 Authors' Addresses 451 Patrice Brissette (editor) 452 Cisco 453 Email: pbrisset@cisco.com 455 Ali Sajassi 456 Cisco 457 Email: sajassi@cisco.com 459 Luc Andre Burdet 460 Cisco 461 Email: lburdet@cisco.com 463 John Drake 464 Juniper 465 Email: jdrake@juniper.net 466 Jorge Rabadan 467 Nokia 468 Email: jorge.rabadan@nokia.com