idnits 2.17.1 draft-ietf-bess-evpn-fast-df-recovery-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (20 January 2022) is 820 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'TBD3' is mentioned on line 240, but not defined Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group P.B. Brissette, Ed. 3 Internet-Draft A.S. Sajassi 4 Intended status: Standards Track LA.B. Burdet 5 Expires: 24 July 2022 Cisco 6 J.D. Drake 7 Juniper 8 J.R. Rabadan 9 Nokia 10 20 January 2022 12 Fast Recovery for EVPN DF Election 13 draft-ietf-bess-evpn-fast-df-recovery-03 15 Abstract 17 Ethernet Virtual Private Network (EVPN) solution provides Designated 18 Forwarder election procedures for multi-homing Ethernet Segments. 19 These procedures have been enhanced further by applying Highest 20 Random Weight (HRW) Algorithm for Designated Forwarded election in 21 order to avoid unnecessary DF status changes upon a failure. This 22 draft improves these procedures by providing a fast Designated 23 Forwarder (DF) election upon recovery of the failed link or node 24 associated with the multi-homing Ethernet Segment. The solution is 25 independent of number of EVIs associated with that Ethernet Segment 26 and it is performed via a simple signaling between the recovered PE 27 and each PEs in the multi-homing group. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in RFC 2119 [RFC2119] and 34 RFC 8174 [RFC8174]. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on 24 July 2022. 53 Copyright Notice 55 Copyright (c) 2022 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 60 license-info) in effect on the date of publication of this document. 61 Please review these documents carefully, as they describe your rights 62 and restrictions with respect to this document. Code Components 63 extracted from this document must include Revised BSD License text as 64 described in Section 4.e of the Trust Legal Provisions and are 65 provided without warranty as described in the Revised BSD License. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 70 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Challenges with Existing Solution . . . . . . . . . . . . . . 3 72 3. DF Election Synchronization Solution . . . . . . . . . . . . 4 73 3.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . 5 74 3.2. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 6 75 3.3. Note on NTP-based synchronization . . . . . . . . . . . . 6 76 3.4. Synchronization Scenarios . . . . . . . . . . . . . . . . 7 77 3.5. Backwards Compatibility . . . . . . . . . . . . . . . . . 8 78 4. Security Considerations . . . . . . . . . . . . . . . . . . . 8 79 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 80 6. Normative References . . . . . . . . . . . . . . . . . . . . 9 81 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 9 82 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 10 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 85 1. Introduction 87 Ethernet Virtual Private Network (EVPN) solution [RFC7432] is 88 becoming pervasive in data center (DC) applications for Network 89 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 90 in service provider (SP) applications for next generation virtual 91 private LAN services. 93 EVPN solution [RFC7432] describes DF election procedures for multi- 94 homing Ethernet Segments. These procedures are enhanced further in 95 [RFC8584] by applying Highest Random Weight Algorithm for DF election 96 in order to avoid DF status change unnecessarily upon a link or node 97 failure associated with the multi-homing Ethernet Segment. This 98 draft makes further improvement to DF election procedures in 99 [RFC8584] by providing an option for a fast DF election upon recovery 100 of the failed link or node associated with the multi-homing Ethernet 101 Segment. This DF election is achieved independent of number of EVIs 102 associated with that Ethernet Segment and it is performed via a 103 simple signaling between the recovered PE and each PE in the multi- 104 homing group. The solution is based on simple one-way signaling 105 mechanism. 107 1.1. Terminology 109 Provider Edge (PE): A device that sits in the boundary of Provider 110 and Customer networks and performs encap/decap of data from L2 to 111 L3 and vice-versa. 113 Designated Forwarder (DF): A PE that is currently forwarding 114 (encapsulating/decapsulating) traffic for a given VLAN in and out 115 of a site. 117 2. Challenges with Existing Solution 119 In EVPN technology, multiple PE devices have the ability to encap and 120 decap data belonging to the same VLAN. In certain situations, this 121 may cause L2 duplicates and even loops if there is a momentary 122 overlap of forwarding roles between two or more PE devices, leading 123 to broadcast storms. 125 EVPN [RFC7432] currently uses timer based synchronization among PE 126 devices in redundancy group that can result in duplications (and even 127 loops) because of multiple DFs if the timer is too short or 128 blackholing if the timer is too long. 130 Using ESI label Split Horizon filtering can prevent loops (but not 131 duplicates), however if there are overlapping DFs in two different 132 sites at the same time for the same VLAN, the site identifier will be 133 different upon re-entry of the packet and hence the split horizon 134 check will fail, leading to L2 loops. 136 The current state of art (Highest Random Weight) algorithm to avoid 137 reshuffling of VLANs among PE devices in the redundancy group upon 138 failure/recovery and thus reducing the impact of failure/recovery to 139 VLANs not on the failed/recovered ports. This eliminates loops/ 140 duplicates in failure scenarios.[RFC8584] uses the well known HRW 141 However, upon PE insertion or port bring-up, HRW cannot help as a 142 transfer of DF role need to happen to the newly inserted device/port 143 while the old DF is still active. 145 +---------+ 146 +-------------+ | | 147 | | | | 148 / | PE1 |----| | +-------------+ 149 / | | | MPLS/ | | |---H3 150 / +-------------+ | VxLAN/ | | PE10 | 151 CE1 - | Cloud | | | 152 \ +-------------+ | |---| | 153 \ | | | | +-------------+ 154 \ | PE2 |----| | 155 | | | | 156 +-------------+ | | 157 +---------+ 159 Figure 1: CE1 multi-homed to PE1 and PE2. 161 In the Figure 1, when PE2 is inserted or booted up, PE1 will transfer 162 DF role of some VLANs to PE2 to achieve load balancing. However, 163 because there is no handshake mechanism between PE1 and PE2, 164 duplication of DF roles for a give VLAN is possible. Duplication of 165 DF roles may eventually lead to L2 loops as well as duplication of 166 traffic. 168 Current state of EVPN art relies on a blackholing timer for 169 transferring the DF role to the newly inserted device. This can 170 cause the following issues: 172 * Loops/Duplicates if the timer value is too short 174 * Prolonged Traffic Blackholing if the timer value is too long 176 3. DF Election Synchronization Solution 178 The solution relies on the concept of common clock alignment between 179 partner PEs participating to a common Ethernet-Segment. The main 180 idea is to have them all to perform/apply their carving state, 181 resulting from DF election, at the well-known time. 183 The DF Election procedure, as described in [RFC7432] and as 184 optionally signalled in [RFC8584], is applied. All PEs attached to a 185 given Ethernet-Segment are clock-synchronized; using a networking 186 protocol for clock synchronization (e.g. NTP, PTP, etc.). Newly 187 inserted device PE or during failure recovery of a PE, that PE 188 communicates the current time to peering partners plus the remaining 189 peering timer time left. This constitute an "end" or "absolute" time 190 as seen from local PE. That absolute time is called "Service Carving 191 Time" (SCT). 193 A new BGP Extended Community is advertised along with Ethernet- 194 Segment route (RT-4) to communicate to other partners the Service 195 Carving Time. 197 Upon reception of that new BGP Extended Community, partner PEs know 198 exactly its carving time. The notion of skew is introduced to 199 eliminate any potential duplicate traffic or loops. They add a skew 200 (default = -10ms) to the Service Carving Time to enforce this. The 201 previously inserted PE(s) must carve first, followed shortly(skew) by 202 the newly insterted PE. 204 To summarize, all peering PEs carve almost simultaneously at the time 205 announced by newly added/recovered PE. The newly inserted PE 206 initiates the SCT, and carves immediately on peering timer expiry. 207 The previously inserted PE(s) receiving Ethernet-Segment route (RT-4) 208 with a SCT BGP extended community, carve shortly before Service 209 Carving Time. 211 3.1. Advantages 213 There are multiples advantages of using the approach. Here is a non- 214 exhaustive list: 216 - A simple uni-directional signaling is all needed 218 - Backwards-compatible: PEs supporting only older [RFC7432] shall 219 simply discard unrecognized new "Service Carving Timestamp" BGP 220 Extended Community 222 - Multiple DF Election algorithms can be supported: 224 * [RFC7432] default ordered list ordinal algorithm (Modulo), 226 * [RFC8584] highest-random weight, etc. 228 - Independent of BGP transmission delay regarding Ethernet-Segment 229 route (RT-4) 231 - Agnostic of the time synchronization mechanism used (e.g .NTP, 232 PTP, etc.) 234 3.2. BGP Encoding 236 A new BGP extended community needs to be defined to communicate the 237 Service Carving Timestamp for each Ethernet Segment. 239 A new transitive extended community where the Type field is 0x06, and 240 the Sub-Type is [TBD3] is advertised along with Ethernet Segment 241 route. Timestamp for expected Service carving is encoded as a 242 8-octet value as follows: 244 1 2 3 245 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 246 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 247 | Type=0x06 | Sub-Type(TBD3)| Timestamp Seconds ~ 248 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 249 ~ Timestamp Seconds | Timestamp Fractional Seconds | 250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 252 This document introduces a new flag called "T" (for Time 253 Synchronization) to the bitmap field of the DF Election Extended 254 Community defined in [RFC8584]. 256 1 2 3 257 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 259 | Type = 0x06 | Sub-Type(0x06)| RSV | DF Alg | |A| |T| ~ 260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 261 ~ Bitmap | Reserved = 0 | 262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 264 T: This flag is located in bit position 27 as shown above. When set 265 to 1, it indicates the desire to use Time Synchronization capability 266 with the rest of the PEs in the ES. This capability is used in 267 conjunction with the agreed upon DF Type (DF Election Type). For 268 example if all the PEs in the ES indicated that they have Time 269 Synchronization capability and they want the DF type be of HRW, then 270 HRW algorithm is used in conjunction with this capability. 272 3.3. Note on NTP-based synchronization 274 The 64-bit timestamp used by NTP protocol consists of a 32-bit part 275 for seconds and a 32-bit part for fractional second. The timestamp 276 exchanged uses the NTP epoch of January 1, 1900 [RFC5905]. The use 277 of a 32-bit seconds and 16-bit fractional seconds yields adequate 278 precision of 15 microseconds (2^-16 s). 280 3.4. Synchronization Scenarios 282 Let's take Figure 1 as an example where initially PE2 had failed and 283 PE1 had taken over. This example shows the problem with known 284 mechanism. 286 Based on [RFC7432]: 288 - Initial state: PE1 is in steady-state, PE2 is recovering 290 - PE2 recovers at (absolute) time t=99 292 - PE2 advertises RT-4 (sent at t=100) to partner PE1 294 - PE2, it starts its 3sec peering timer as per RFC7432 296 - PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal BGP 297 propagation delay 299 - PE2 carves at time t=103 301 [RFC7432] aims of favouring traffic black hole over duplicate traffic 302 With above procedure, traffic black hole will occur as part of each 303 PE recovery sequence. The peering timer value (default = 3 seconds) 304 has a direct effect on the duration of the prolonged blackholing. A 305 short (esp. zero) peering timer may, however, result in duplicate 306 traffic or traffic loops. 308 Based on the Service Carving Time (SCT) approach: 310 - Initial state: PE1 is in steady-state, PE2 is recovering 312 - PE2 recovers at (absolute) time t=99 314 - PE2 advertises RT-4 (sent at t=100) with target SCT value t=103 to 315 partner PE1 317 - PE2 starts its 3 second peering timer as per [RFC7432] 319 - Both PE1 and PE2 carves at (absolute) time t=103 321 In fact, PE1 should carve slightly before PE2 (skew). The previously 322 inserted PE2 that is recovering performs both transitions DF to NDF 323 and NDF to DF per VLANs at the peering timer expiry. Since the goal 324 is to prevent duplicates, the original PE1, which received the SCT 325 will apply: 327 - DF to NDF transition at t=SCT minus skew where both PEs are NDF 328 for 'skew' amount of time 330 - NDF to DF transition at t=SCT 332 It is this split-behaviour which ensures good transition of DF role 333 with contained amount of loss. 335 Using SCT approach, the negative effect of the peering timer is 336 mitigated. Furthermore, the BGP Ethernet-Segment route (RT-4) 337 transmission delay (from PE2 to PE1) becomes a no-op. The usage of 338 SCT approach remedies to the exposed problem with the usage of 339 peering timer. The 3 seconds timer window is shorthen to few 340 milliseconds. 342 3.5. Backwards Compatibility 344 Per redundancy group, for the DF election procedures to be globally 345 convergent and unanimous, it is necessary that all the participating 346 PEs agree on the DF Election algorithm to be used. It is, however, 347 possible that some PEs continue to use the existing modulus based DF 348 election and do not rely on the new SCT BGP extended community. PEs 349 running an baseline DF election mechanism shall simply discard 350 unrecognized new SCT BGP extended community. 352 A PE can indicate its willingness to support clock-synched carving by 353 signaling the new 'T' DF Election Capability as well as including the 354 new Service Carving Time BGP extended community along with the 355 Ethernet-Segment Route (Type-4). In the case where one or more PEs 356 attached to the Ethernet-Segment do not signal T=1, all PEs in the 357 Ethernet-Segment may revert back to the RFC7432 timer approach. 359 4. Security Considerations 361 The mechanisms in this document use EVPN control plane as defined in 362 [RFC7432]. Security considerations described in [RFC7432] are 363 equally applicable. This document uses MPLS and IP-based tunnel 364 technologies to support data plane transport. Security 365 considerations described in [RFC7432] and in [RFC8365] are equally 366 applicable. 368 5. IANA Considerations 370 This document solicits the allocation of the following sub-type in 371 the "EVPN Extended Community Sub-Types" registry setup by [RFC7153]: 373 TBD3 Service Carving Timestamp This document 375 This document solicits the allocation of the following values in the 376 "DF Election Capabilities" registry setup by [RFC8584]: 378 Bit Name Reference 379 ---- ---------------- ------------- 380 3 Time Synchronization This document 382 6. Normative References 384 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 385 Requirement Levels", BCP 14, RFC 2119, 386 DOI 10.17487/RFC2119, March 1997, 387 . 389 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 390 "Network Time Protocol Version 4: Protocol and Algorithms 391 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 392 . 394 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 395 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, 396 March 2014, . 398 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 399 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 400 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 401 2015, . 403 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 404 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 405 May 2017, . 407 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 408 Uttaro, J., and W. Henderickx, "A Network Virtualization 409 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 410 DOI 10.17487/RFC8365, March 2018, 411 . 413 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, 414 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet 415 VPN Designated Forwarder Election Extensibility", 416 RFC 8584, DOI 10.17487/RFC8584, April 2019, 417 . 419 Appendix A. Contributors 421 In addition to the authors listed on the front page, the following 422 co-authors have also contributed substantially to this document: 424 Gaurav Badoni Cisco Email: gbadoni@cisco.com 426 Dhananjaya Rao Cisco Email: dhrao@cisco.com 428 Appendix B. Acknowledgements 430 Authors would like to acknowledge helpful comments and contributions 431 of Satya Mohanty and Bharath Vasudevan. 433 Authors' Addresses 435 Patrice Brissette (editor) 436 Cisco 438 Email: pbrisset@cisco.com 440 Ali Sajassi 441 Cisco 443 Email: sajassi@cisco.com 445 Luc Andre Burdet 446 Cisco 448 Email: lburdet@cisco.com 450 John Drake 451 Juniper 453 Email: jdrake@juniper.net 455 Jorge Rabadan 456 Nokia 458 Email: jorge.rabadan@nokia.com