idnits 2.17.1 draft-ietf-bess-evpn-fast-df-recovery-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 6, 2021) is 1017 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'TBD3' is mentioned on line 244, but not defined Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group P. Brissette, Ed. 3 Internet-Draft A. Sajassi 4 Intended status: Standards Track LA. Burdet 5 Expires: January 7, 2022 Cisco 6 J. Drake 7 Juniper 8 J. Rabadan 9 Nokia 10 July 6, 2021 12 Fast Recovery for EVPN DF Election 13 draft-ietf-bess-evpn-fast-df-recovery-02 15 Abstract 17 Ethernet Virtual Private Network (EVPN) solution provides Designated 18 Forwarder election procedures for multi-homing Ethernet Segments. 19 These procedures have been enhanced further by applying Highest 20 Random Weight (HRW) Algorithm for Designated Forwarded election in 21 order to avoid unnecessary DF status changes upon a failure. This 22 draft improves these procedures by providing a fast Designated 23 Forwarder (DF) election upon recovery of the failed link or node 24 associated with the multi-homing Ethernet Segment. The solution is 25 independent of number of EVIs associated with that Ethernet Segment 26 and it is performed via a simple signaling between the recovered PE 27 and each PEs in the multi-homing group. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in RFC 2119 [RFC2119] and 34 RFC 8174 [RFC8174]. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on January 7, 2022. 53 Copyright Notice 55 Copyright (c) 2021 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 71 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Challenges with Existing Solution . . . . . . . . . . . . . . 3 73 3. DF Election Synchronization Solution . . . . . . . . . . . . 4 74 3.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . 5 75 3.2. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 6 76 3.3. Note on NTP-based synchronization . . . . . . . . . . . . 6 77 3.4. Synchronization Scenarios . . . . . . . . . . . . . . . . 7 78 3.5. Backwards Compatibility . . . . . . . . . . . . . . . . . 8 79 4. Security Considerations . . . . . . . . . . . . . . . . . . . 8 80 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 81 6. Normative References . . . . . . . . . . . . . . . . . . . . 9 82 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 10 83 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 10 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 86 1. Introduction 88 Ethernet Virtual Private Network (EVPN) solution [RFC7432] is 89 becoming pervasive in data center (DC) applications for Network 90 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 91 in service provider (SP) applications for next generation virtual 92 private LAN services. 94 EVPN solution [RFC7432] describes DF election procedures for multi- 95 homing Ethernet Segments. These procedures are enhanced further in 97 [RFC8584] by applying Highest Random Weight Algorithm for DF election 98 in order to avoid DF status change unnecessarily upon a link or node 99 failure associated with the multi-homing Ethernet Segment. This 100 draft makes further improvement to DF election procedures in 101 [RFC8584] by providing an option for a fast DF election upon recovery 102 of the failed link or node associated with the multi-homing Ethernet 103 Segment. This DF election is achieved independent of number of EVIs 104 associated with that Ethernet Segment and it is performed via a 105 simple signaling between the recovered PE and each PE in the multi- 106 homing group. The solution is based on simple one-way signaling 107 mechanism. 109 1.1. Terminology 111 Provider Edge (PE): A device that sits in the boundary of Provider 112 and Customer networks and performs encap/decap of data from L2 to 113 L3 and vice-versa. 115 Designated Forwarder (DF): A PE that is currently forwarding 116 (encapsulating/decapsulating) traffic for a given VLAN in and out 117 of a site. 119 2. Challenges with Existing Solution 121 In EVPN technology, multiple PE devices have the ability to encap and 122 decap data belonging to the same VLAN. In certain situations, this 123 may cause L2 duplicates and even loops if there is a momentary 124 overlap of forwarding roles between two or more PE devices, leading 125 to broadcast storms. 127 EVPN [RFC7432] currently uses timer based synchronization among PE 128 devices in redundancy group that can result in duplications (and even 129 loops) because of multiple DFs if the timer is too short or 130 blackholing if the timer is too long. 132 Using ESI label Split Horizon filtering can prevent loops (but not 133 duplicates), however if there are overlapping DFs in two different 134 sites at the same time for the same VLAN, the site identifier will be 135 different upon re-entry of the packet and hence the split horizon 136 check will fail, leading to L2 loops. 138 The current state of art [RFC8584] uses the well known HRW 139 (Highest Random Weight) algorithm to avoid reshuffling of VLANs among 140 PE devices in the redundancy group upon failure/recovery and thus 141 reducing the impact of failure/recovery to VLANs not on the 142 failed/recovered ports. This eliminates loops/duplicates in failure 143 scenarios. 145 However, upon PE insertion or port bring-up, HRW cannot help as a 146 transfer of DF role need to happen to the newly inserted device/port 147 while the old DF is still active. 149 +---------+ 150 +-------------+ | | 151 | | | | 152 / | PE1 |----| | +-------------+ 153 / | | | MPLS/ | | |---H3 154 / +-------------+ | VxLAN/ | | PE10 | 155 CE1 - | Cloud | | | 156 \ +-------------+ | |---| | 157 \ | | | | +-------------+ 158 \ | PE2 |----| | 159 | | | | 160 +-------------+ | | 161 +---------+ 163 Figure 1: CE1 multi-homed to PE1 and PE2. 165 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 166 DF role of some VLANs to PE2 to achieve load balancing. However, 167 because there is no handshake mechanism between PE1 and PE2, 168 duplication of DF roles for a give VLAN is possible. Duplication of 169 DF roles may eventually lead to L2 loops as well as duplication of 170 traffic. 172 Current state of EVPN art relies on a blackholing timer for 173 transferring the DF role to the newly inserted device. This can 174 cause the following issues: 176 * Loops/Duplicates if the timer value is too short 178 * Prolonged Traffic Blackholing if the timer value is too long 180 3. DF Election Synchronization Solution 182 The solution relies on the concept of common clock alignment between 183 partner PEs participating to a common Ethernet-Segment. The main 184 idea is to have them all to perform/apply their carving state, 185 resulting from DF election, at the well-known time. 187 The DF Election procedure, as described in [RFC7432] and as 188 optionally signalled in [RFC8584], is applied. All PEs attached to a 189 given Ethernet-Segment are clock-synchronized; using a networking 190 protocol for clock synchronization (e.g. NTP, PTP, etc.). Newly 191 inserted device PE or during failure recovery of a PE, that PE 192 communicates the current time to peering partners plus the remaining 193 peering timer time left. This constitute an "end" or "absolute" time 194 as seen from local PE. That absolute time is called "Service Carving 195 Time" (SCT). 197 A new BGP Extended Community is advertised along with Ethernet- 198 Segment route (RT-4) to communicate to other partners the Service 199 Carving Time. 201 Upon reception of that new BGP Extended Community, partner PEs know 202 exactly its carving time. The notion of skew is introduced to 203 eliminate any potential duplicate traffic or loops. They add a skew 204 (default = -10ms) to the Service Carving Time to enforce this. The 205 previously inserted PE(s) must carve first, followed shortly(skew) by 206 the newly insterted PE. 208 To summarize, all peering PEs carve almost simultaneously at the time 209 announced by newly added/recovered PE. The newly inserted PE 210 initiates the SCT, and carves immediately on peering timer expiry. 211 The previously inserted PE(s) receiving Ethernet-Segment route (RT-4) 212 with a SCT BGP extended community, carve shortly before Service 213 Carving Time. 215 3.1. Advantages 217 There are multiples advantages of using the approach. Here is a non- 218 exhaustive list: 220 - A simple uni-directional signaling is all needed 222 - Backwards-compatible: PEs supporting only older [RFC7432] shall 223 simply discard unrecognized new "Service Carving Timestamp" BGP 224 Extended Community 226 - Multiple DF Election algorithms can be supported: 228 * [RFC7432] default ordered list ordinal algorithm (Modulo), 230 * [RFC8584] highest-random weight, etc. 232 - Independent of BGP transmission delay regarding Ethernet-Segment 233 route (RT-4) 235 - Agnostic of the time synchronization mechanism used (e.g .NTP, PTP, 236 etc.) 238 3.2. BGP Encoding 240 A new BGP extended community needs to be defined to communicate the 241 Service Carving Timestamp for each Ethernet Segment. 243 A new transitive extended community where the Type field is 0x06, and 244 the Sub-Type is [TBD3] is advertised along with Ethernet Segment 245 route. Timestamp for expected Service carving is encoded as a 246 8-octet value as follows: 248 1 2 3 249 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 251 | Type=0x06 | Sub-Type(TBD3)| Timestamp Seconds ~ 252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 253 ~ Timestamp Seconds | Timestamp Fractional Seconds | 254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 256 This document introduces a new flag called "T" (for Time 257 Synchronization) to the bitmap field of the DF Election Extended 258 Community defined in [RFC8584]. 260 1 2 3 261 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 263 | Type = 0x06 | Sub-Type(0x06)| RSV | DF Alg | |A| |T| ~ 264 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 265 ~ Bitmap | Reserved = 0 | 266 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 268 T: This flag is located in bit position 27 as shown above. When set 269 to 1, it indicates the desire to use Time Synchronization capability 270 with the rest of the PEs in the ES. This capability is used in 271 conjunction with the agreed upon DF Type (DF Election Type). For 272 example if all the PEs in the ES indicated that they have Time 273 Synchronization capability and they want the DF type be of HRW, then 274 HRW algorithm is used in conjunction with this capability. 276 3.3. Note on NTP-based synchronization 278 The 64-bit timestamp used by NTP protocol consists of a 32-bit part 279 for seconds and a 32-bit part for fractional second. The timestamp 280 exchanged uses the NTP epoch of January 1, 1900 [RFC5905]. The use 281 of a 32-bit seconds and 16-bit fractional seconds yields adequate 282 precision of 15 microseconds (2^-16 s). 284 3.4. Synchronization Scenarios 286 Let's take Figure 1 as an example where initially PE2 had failed and 287 PE1 had taken over. This example shows the problem with known 288 mechanism. 290 Based on [RFC7432]: 292 - Initial state: PE1 is in steady-state, PE2 is recovering 294 - PE2 recovers at (absolute) time t=99 296 - PE2 advertises RT-4 (sent at t=100) to partner PE1 298 - PE2, it starts its 3sec peering timer as per RFC7432 300 - PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal BGP 301 propagation delay 303 - PE2 carves at time t=103 305 [RFC7432] aims of favouring traffic black hole over duplicate traffic 306 With above procedure, traffic black hole will occur as part of each 307 PE recovery sequence. The peering timer value (default = 3 seconds) 308 has a direct effect on the duration of the prolonged blackholing. A 309 short (esp. zero) peering timer may, however, result in duplicate 310 traffic or traffic loops. 312 Based on the Service Carving Time (SCT) approach: 314 - Initial state: PE1 is in steady-state, PE2 is recovering 316 - PE2 recovers at (absolute) time t=99 318 - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to 319 partner PE1 321 - PE2 starts its 3 second peering timer as per [RFC7432] 323 - Both PE1 and PE2 carves at (absolute) time t=103 325 In fact, PE1 should carve slightly before PE2 (skew). The previously 326 inserted PE2 that is recovering performs both transitions DF to NDF 327 and NDF to DF per VLANs at the peering timer expiry. Since the goal 328 is to prevent duplicates, the original PE1, which received the SCT 329 will apply: 331 - DF to NDF transition at t=SCT minus skew where both PEs are NDF for 332 'skew' amount of time 334 - NDF to DF transition at t=SCT 336 It is this split-behaviour which ensures good transition of DF role 337 with contained amount of loss. 339 Using SCT approach, the negative effect of the peering timer is 340 mitigated. Furthermore, the BGP Ethernet-Segment route (RT-4) 341 transmission delay (from PE2 to PE1) becomes a no-op. The usage of 342 SCT approach remedies to the exposed problem with the usage of 343 peering timer. The 3 seconds timer window is shorthen to few 344 milliseconds. 346 3.5. Backwards Compatibility 348 Per redundancy group, for the DF election procedures to be globally 349 convergent and unanimous, it is necessary that all the participating 350 PEs agree on the DF Election algorithm to be used. It is, however, 351 possible that some PEs continue to use the existing modulus based DF 352 election and do not rely on the new SCT BGP extended community. PEs 353 running an baseline DF election mechanism shall simply discard 354 unrecognized new SCT BGP extended community. 356 A PE can indicate its willingness to support clock-synched carving by 357 signaling the new 'T' DF Election Capability as well as including the 358 new Service Carving Time BGP extended community along with the 359 Ethernet-Segment Route (Type-4). In the case where one or more PEs 360 attached to the Ethernet-Segment do not signal T=1, all PEs in the 361 Ethernet-Segment may revert back to the RFC7432 timer approach. 363 4. Security Considerations 365 The mechanisms in this document use EVPN control plane as defined in 366 [RFC7432]. Security considerations described in [RFC7432] are 367 equally applicable. This document uses MPLS and IP-based tunnel 368 technologies to support data plane transport. Security 369 considerations described in [RFC7432] and in [RFC8365] are equally 370 applicable. 372 5. IANA Considerations 374 This document solicits the allocation of the following sub-type in 375 the "EVPN Extended Community Sub-Types" registry setup by [RFC7153]: 377 TBD3 Service Carving Timestamp This document 379 This document solicits the allocation of the following values in the 380 "DF Election Capabilities" registry setup by [RFC8584]: 382 Bit Name Reference 383 ---- ---------------- ------------- 384 3 Time Synchronization This document 386 6. Normative References 388 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 389 Requirement Levels", BCP 14, RFC 2119, 390 DOI 10.17487/RFC2119, March 1997, 391 . 393 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 394 "Network Time Protocol Version 4: Protocol and Algorithms 395 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 396 . 398 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 399 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, 400 March 2014, . 402 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 403 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 404 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 405 2015, . 407 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 408 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 409 May 2017, . 411 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 412 Uttaro, J., and W. Henderickx, "A Network Virtualization 413 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 414 DOI 10.17487/RFC8365, March 2018, 415 . 417 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, 418 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet 419 VPN Designated Forwarder Election Extensibility", 420 RFC 8584, DOI 10.17487/RFC8584, April 2019, 421 . 423 Appendix A. Contributors 425 In addition to the authors listed on the front page, the following 426 co-authors have also contributed substantially to this document: 428 Gaurav Badoni 429 Cisco 431 Email: gbadoni@cisco.com 433 Dhananjaya Rao 434 Cisco 436 Email: dhrao@cisco.com 438 Appendix B. Acknowledgements 440 Authors would like to acknowledge helpful comments and contributions 441 of Satya Mohanty and Bharath Vasudevan. 443 Authors' Addresses 445 Patrice Brissette (editor) 446 Cisco 448 Email: pbrisset@cisco.com 450 Ali Sajassi 451 Cisco 453 Email: sajassi@cisco.com 455 Luc Andre Burdet 456 Cisco 458 Email: lburdet@cisco.com 460 John Drake 461 Juniper 463 Email: jdrake@juniper.net 464 Jorge Rabadan 465 Nokia 467 Email: jorge.rabadan@nokia.com