idnits 2.17.1 draft-ietf-rtgwg-backoff-algo-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 10, 2017) is 2327 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-42) exists of draft-ietf-isis-yang-isis-cfg-19 == Outdated reference: A later version (-29) exists of draft-ietf-ospf-yang-09 == Outdated reference: A later version (-10) exists of draft-ietf-rtgwg-spf-uloop-pb-statement-05 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Decraene 3 Internet-Draft Orange 4 Intended status: Standards Track S. Litkowski 5 Expires: June 13, 2018 Orange Business Service 6 H. Gredler 7 RtBrick Inc 8 A. Lindem 9 Cisco Systems 10 P. Francois 12 C. Bowers 13 Juniper Networks, Inc. 14 December 10, 2017 16 SPF Back-off algorithm for link state IGPs 17 draft-ietf-rtgwg-backoff-algo-07 19 Abstract 21 This document defines a standard algorithm to back-off link-state IGP 22 Shortest Path First (SPF) computations. 24 Having one standard algorithm improves interoperability by reducing 25 the probability and/or duration of transient forwarding loops during 26 the IGP convergence when the IGP reacts to multiple temporally close 27 IGP events. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in [RFC2119]. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on June 13, 2018. 51 Copyright Notice 53 Copyright (c) 2017 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 2. High level goals . . . . . . . . . . . . . . . . . . . . . . 3 70 3. Definitions and parameters . . . . . . . . . . . . . . . . . 4 71 4. Principles of SPF delay algorithm . . . . . . . . . . . . . . 5 72 5. Specification of the SPF delay state machine . . . . . . . . 5 73 5.1. States . . . . . . . . . . . . . . . . . . . . . . . . . 6 74 5.2. Timers . . . . . . . . . . . . . . . . . . . . . . . . . 6 75 5.3. States Transitions . . . . . . . . . . . . . . . . . . . 6 76 5.4. FSM Events . . . . . . . . . . . . . . . . . . . . . . . 7 77 6. Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 9 78 7. Partial Deployment . . . . . . . . . . . . . . . . . . . . . 10 79 8. Impact on micro-loops . . . . . . . . . . . . . . . . . . . . 10 80 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 81 10. Security considerations . . . . . . . . . . . . . . . . . . . 11 82 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 83 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 84 12.1. Normative References . . . . . . . . . . . . . . . . . . 11 85 12.2. Informative References . . . . . . . . . . . . . . . . . 11 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 88 1. Introduction 90 Link state IGPs, such as IS-IS [ISO10589-Second-Edition] and OSPF 91 [RFC2328], perform distributed route computation on all routers in 92 the area/level. In order to have consistent routing tables across 93 the network, such distributed computation requires that all routers 94 have the same version of the network topology (Link State DataBase 95 (LSDB)) and perform their computation at the same time. 97 In general, when the network is stable, there is a desire to compute 98 a new Shortest Path First (SPF) as soon as a failure is detected in 99 order to quickly route around the failure. However, when the network 100 is experiencing multiple temporally close failures over a short 101 period of time, there is a conflicting desire to limit the frequency 102 of SPF computations. Indeed, this allows a reduction in control 103 plane resources used by IGPs and all protocols/subsystems reacting on 104 the attendant route change, such as LDP [RFC5036], RSVP-TE [RFC3209], 105 BGP [RFC4271], Fast ReRoute computations (e.g. Loop Free Alternates 106 (LFA) [RFC5286], FIB updates... This also reduces the churn on 107 routers and in the network and, in particular, reduces the side 108 effects such as micro-loops [RFC5715] that ensue during IGP 109 convergence. 111 To allow for this, IGPs implement an SPF back-off algorithm. 112 However, different implementations have choosen different algorithms. 113 Hence, in a multi-vendor network, it's not possible to ensure that 114 all routers trigger their SPF computation after the same delay. This 115 situation increases the average and maximum differential delay 116 between routers completing their SPF computation. It also increases 117 the probability that different routers compute their FIBs based on 118 different LSDB versions. Both factors increase the probability and/ 119 or duration of micro-loops as discussed in Section 8. 121 To allow multi-vendor networks to have all routers delay their SPF 122 computations for the same duration, this document specifies a 123 standard algorithm. Optionally, implementations may also offer 124 alternative algorithms. 126 2. High level goals 128 The high level goals of this algorithm are the following: 130 o Very fast convergence for a single event (e.g., link failure). 132 o Paced fast convergence for multiple temporally close IGP events 133 while IGP stability is considered acceptable. 135 o Delayed convergence when IGP stability is problematic. This will 136 allow the IGP and related processes to conserve resources during 137 the period of instability. 139 o Always try to avoid different SPF_DELAY timers values across 140 different routers in the area/level. Even though not all routers 141 will receive IGP messages at the same time, due to differences 142 both in the distance from the originator of the IGP event and in 143 flooding implementations. 145 3. Definitions and parameters 147 IGP events: The reception or origination of an IGP LSDB change 148 requiring a new routing table computation. Examples are a topology 149 change, a prefix change, a metric change on a link or prefix... Note 150 that locally triggering a routing table computation is not considered 151 as an IGP event since other IGP routers are unaware of this 152 occurrence. 154 Routing table computation: Computation of the routing table, by the 155 IGP, using the IGP LSDB. No distinction is made between the type of 156 computation performed. e.g., full SPF, incremental SPF, Partial Route 157 Computation (PRC). The type of computation is a local consideration. 158 This document may interchangeably use the terms routing table 159 computation and SPF computation. 161 SPF_DELAY: The delay between the first IGP event triggering a new 162 routing table computation and the start of that routing table 163 computation. It can take the following values: 165 INITIAL_SPF_DELAY: A very small delay to quickly handle link 166 failure, e.g., 0 milliseconds. 168 SHORT_SPF_DELAY: A small delay to have a fast convergence in case of 169 a single failure (node, SRLG..), e.g., 50-100 milliseconds. 171 LONG_SPF_DELAY: A long delay when the IGP is unstable, e.g., 2 172 seconds. Note that this allows the IGP network to stabilize. 174 TIME_TO_LEARN_INTERVAL: This is the maximum duration typically needed 175 to learn all the IGP events related to a single component failure 176 (e.g., router failure, SRLG failure), e.g., 1 second. It's mostly 177 dependent on failure detection time variation between all routers 178 that are adjacent to the failure. Additionally, it may depend on the 179 different IGP implementations/parameters across the network, related 180 to origination and flooding of their link state advertisements. 182 HOLDDOWN_INTERVAL: The time required with no received IGP events 183 before considering the IGP to be stable again and allowing the 184 SPF_DELAY to be restored to INITIAL_SPF_DELAY. e.g., 3 seconds. The 185 HOLDDOWN_INTERVAL MUST be defaulted or configured to be longer than 186 the TIME_TO_LEARN_INTERVAL. 188 4. Principles of SPF delay algorithm 190 For this first IGP event, we assume that there has been a single 191 simple change in the network which can be taken into account using a 192 single routing computation (e.g., link failure, prefix (metric) 193 change) and we optimize for very fast convergence, delaying the 194 routing computation by INITIAL_SPF_DELAY. Under this assumption, 195 there is no benefit in delaying the routing computation. In a 196 typical network, this is the most common type of IGP event. Hence, 197 it makes sense to optimize this case. 199 If subsequent IGP events are received in a short period of time 200 (TIME_TO_LEARN_INTERVAL), we then assume that a single component 201 failed, but that this failure requires the knowledge of multiple IGP 202 events in order for IGP routing to converge. Under this assumption, 203 we want fast convergence since this is a normal network situation. 204 However, there is a benefit in waiting for all IGP events related to 205 this single component failure so that the IGP can compute the post- 206 failure routing table in a single route computation. In this 207 situation, we delay the routing computation by SHORT_SPF_DELAY. 209 If IGP events are still received after TIME_TO_LEARN_INTERVAL from 210 the initial IGP event received in QUIET state, then the network is 211 presumably experiencing multiple independent failures. In this case, 212 while waiting for network stability, the computations are delayed for 213 a longer time represented by LONG_SPF_DELAY. This SPF delay is kept 214 until no IGP events are received for HOLDDOWN_INTERVAL. 216 Note that previously implemented SPF delay algorithms counted the 217 number of SPF computations. However, as all routers may receive the 218 IGP events at different times, we cannot assume that all routers will 219 perform the same number of SPF computations or that they will 220 schedule them at the same time. For example, assuming that the SPF 221 delay is 50 ms, router R1 may receive 3 IGP events (E1, E2, E3) in 222 those 50 ms and hence will perform a single routing computation. 223 While another router R2 may only receive 2 events (E1, E2) in those 224 50 ms and hence will schedule another routing computation when 225 receiving E3. That's why this document uses a time 226 (TIME_TO_LEARN_INTERVAL) from the initial event detection/reception 227 as opposed to counting the number of SPF computations to determine 228 when the IGP is unstable. 230 5. Specification of the SPF delay state machine 231 5.1. States 233 This section describes the state machine. The naming and semantics 234 of each state corresponds directly to the SPF delay used for IGP 235 events received in that state. Three states are defined: 237 QUIET: This is the initial state, when no IGP events have occured for 238 at least HOLDDOWN_INTERVAL since the previous routing table 239 computation. The state is meant to handle link failures very 240 quickly. 242 SHORT_WAIT: State entered when an IGP event has been received in 243 QUIET state. This state is meant to handle single component failure 244 requiring multiple IGP events (e.g., node, SRLG). 246 LONG_WAIT: State reached after TIME_TO_LEARN_INTERVAL. In other 247 words, state reached after TIME_TO_LEARN_INTERVAL in state 248 SHORT_WAIT. This state is meant to handle multiple independent 249 component failures during periods of IGP instability. 251 5.2. Timers 253 SPF_TIMER: The Finite State Machine (FSM) abstract timer that uses 254 the computed SPF delay. Upon expiration, the Route Table Computation 255 (as defined in Section 3) is performed. 257 HOLDDOWN_TIMER: The Finite State Machine (FSM) abstract timer that is 258 (re)started whan an IGP event is received and set to 259 HOLDDOWN_INTERVAL. Upon expiration, the FSM is moved to the QUIET 260 state. 262 LEARN_TIMER: The Finite State Machine (FSM) abstract timer that is 263 started when an IGP event is recevied while the FSM is in the QUIET 264 state. Upon expiration, the FSM is moved to the LONG_WAIT state. 266 5.3. States Transitions 268 The FSM is initialized to the QUIET state with all three timers 269 timers (SPF_TIMER, HOLDDOWN_TIMER, LEARN_TIMER) deactivated. 271 The events which may change the FSM states are an IGP event or the 272 expiration of one timer (SPF_TIMER, HOLDDOWN_TIMER, LEARN_TIMER). 274 The following diagram briefly describes the state transitions. 276 +-------------------+ 277 +---->| |<-------------------+ 278 | | QUIET | | 279 +-----| |<---------+ | 280 7: +-------------------+ | | 281 SPF_TIMER | | | 282 expiration | | | 283 | 1: IGP event | | 284 | | | 285 v | | 286 +-------------------+ | | 287 +---->| | | | 288 | | SHORT_WAIT |----->----+ | 289 +-----| | | 290 2: +-------------------+ 6: HOLDDOWN_TIMER | 291 IGP event | expiration | 292 8: SPF_TIMER | | 293 expiration | | 294 | 3: LEARN_TIMER | 295 | expiration | 296 | | 297 v | 298 +-------------------+ | 299 +---->| | | 300 | | LONG_WAIT |------------>-------+ 301 +-----| | 302 4: +-------------------+ 5: HOLDDOWN_TIMER 303 IGP event expiration 304 9: SPF_TIMER expiration 306 Figure 1: State Machine 308 5.4. FSM Events 310 This section describes the events and the actions performed in 311 response. 313 Transition 1: IGP event, while in QUIET_STATE. 315 Actions on event 1: 317 o If SPF_TIMER is not already running, start it with value 318 INITIAL_SPF_DELAY. 320 o Start LEARN_TIMER with TIME_TO_LEARN_INTERVAL. 322 o Start HOLDDOWN_TIMER with HOLDDOWN_INTERVAL. 324 o Transition to SHORT_WAIT state. 326 Transition 2: IGP event, while in SHORT_WAIT. 328 Actions on event 2: 330 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 332 o If SPF_TIMER is not already running, start it with value 333 SHORT_SPF_DELAY. 335 o Remain in current state. 337 Transition 3: LEARN_TIMER expiration. 339 Actions on event 3: 341 o Transition to LONG_WAIT state. 343 Transition 4: IGP event, while in LONG_WAIT. 345 Actions on event 4: 347 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 349 o If SPF_TIMER is not already running, start it with value 350 LONG_SPF_DELAY. 352 o Remain in current state. 354 Transition 5: HOLDDOWN_TIMER expiration, while in LONG_WAIT. 356 Actions on event 5: 358 o Transition to QUIET state. 360 Transition 6: HOLDDOWN_TIMER expiration, while in SHORT_WAIT. 362 Actions on event 6: 364 o Deactivate LEARN_TIMER. 366 o Transition to QUIET state. 368 Transition 7: SPF_TIMER expiration, while in QUIET. 370 Actions on event 7: 372 o Compute SPF. 374 o Remain in current state. 376 Transition 8: SPF_TIMER expiration, while in SHORT_WAIT. 378 Actions on event 8: 380 o Compute SPF. 382 o Remain in current state. 384 Transition 9: SPF_TIMER expiration, while in LONG_WAIT. 386 Actions on event 9: 388 o Compute SPF. 390 o Remain in current state. 392 6. Parameters 394 All the parameters MUST be configurable [I-D.ietf-isis-yang-isis-cfg] 395 [I-D.ietf-ospf-yang] at the protocol instance granularity. They MAY 396 be configurable at the area/level granularity. All the delays 397 (INITIAL_SPF_DELAY, SHORT_SPF_DELAY, LONG_SPF_DELAY, 398 TIME_TO_LEARN_INTERVAL, HOLDDOWN_INTERVAL) SHOULD be configurable at 399 the millisecond granularity. They MUST be configurable at least at 400 the tenth of second granularity. The configurable range for all the 401 parameters SHOULD at least be from 0 milliseconds to 60 seconds. 403 This document does not propose default values for the parameters 404 because these values are expected to be context dependent. 405 Implementations are free to propose their own default values. 407 In order to satisfy the goals stated in Section 2, operators are 408 RECOMMENDED to configure delay intervals such that SPF_INITIAL_DELAY 409 <= SPF_SHORT_DELAY and SPF_SHORT_DELAY <= SPF_LONG_DELAY. 411 When setting (default) values, one SHOULD consider the customers and 412 their application requirements, the computational power of the 413 routers, the size of the network, and, in particular, the number of 414 IP prefixes advertised in the IGP, the frequency and number of IGP 415 events, the number of protocols reactions/computations triggered by 416 IGP SPF (e.g., BGP, PCEP, Traffic Engineering CSPF, Fast ReRoute 417 computations). 419 Note that some or all of these factors may change over the life of 420 the network. In case of doubt, it's RECOMMENDED to play it safe and 421 start with safe, i.e., longer timers. 423 For the standard algorithm to be effective in mitigating micro-loops, 424 it is RECOMMENDED that all routers in the IGP domain, or at least all 425 the routers in the same area/level, have exactly the same configured 426 values. 428 7. Partial Deployment 430 In general, the SPF delay algorithm is only effective in mitigating 431 micro-loops if it is deployed, with the same parameters, on all 432 routers, in the IGP domain or, at least, all routers in an IGP area/ 433 level. The impact of partial deployment is based on the particular 434 event, topology, and the SPF algorithm(s) used on other routers in 435 the IGP area/level. In cases where the previous SPF algorithm was 436 implemented uniformly, partial deployment will increase the frequency 437 and duration of micro-loops. Hence, it is RECOMMENDED that all 438 routers in the IGP domain or at least within the same area/level be 439 migrated to the SPF algorithm described herein at roughly the same 440 time. 442 Note that this is not a new consideration as over times, network 443 operators have changed SPF delay parameters in order to accommodate 444 new customer requirements for fast convergence, as permitted by new 445 software and hardware. They may also have progressively replaced an 446 implementation with a given SPF delay algorithm by another 447 implementation with a different one. 449 8. Impact on micro-loops 451 Micro-loops during IGP convergence are due to a non-synchronized or 452 non-ordered update of the forwarding information tables (FIB) 453 [RFC5715] [RFC6976] [I-D.ietf-rtgwg-spf-uloop-pb-statement]. FIBs 454 are installed after multiple steps such as flooding of the IGP event 455 across the network, SPF wait time, SPF computation, FIB distribution 456 across line cards, and FIB update. This document only addresses the 457 first contribution. This standardized procedure reduces the 458 probability and/or duration of micro-loops when IGPs experience 459 multiple temporally close events. It does not prevent all micro- 460 loops. However, it is beneficial and is less complex and costly to 461 implement when compared to full solutions such as [RFC5715] or 462 [RFC6976]. 464 9. IANA Considerations 466 No IANA actions required. 468 10. Security considerations 470 The algorithm presented in this document does not compromise IGP 471 security. An attacker having the ability to generate IGP events 472 would be able to delay the IGP convergence time. The LONG_SPF_DELAY 473 state may help mitigate the effects of Denial-of-Service (DOS) 474 attacks generating many IGP events. 476 11. Acknowledgements 478 We would like to acknowledge Les Ginsberg, Uma Chunduri, Mike Shand 479 and Alexander Vainshtein for the discussions and comments related to 480 this document. 482 12. References 484 12.1. Normative References 486 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 487 Requirement Levels", BCP 14, RFC 2119, 488 DOI 10.17487/RFC2119, March 1997, 489 . 491 12.2. Informative References 493 [I-D.ietf-isis-yang-isis-cfg] 494 Litkowski, S., Yeung, D., Lindem, A., Zhang, Z., and L. 495 Lhotka, "YANG Data Model for IS-IS protocol", draft-ietf- 496 isis-yang-isis-cfg-19 (work in progress), November 2017. 498 [I-D.ietf-ospf-yang] 499 Yeung, D., Qu, Y., Zhang, Z., Chen, I., and A. Lindem, 500 "Yang Data Model for OSPF Protocol", draft-ietf-ospf- 501 yang-09 (work in progress), October 2017. 503 [I-D.ietf-rtgwg-spf-uloop-pb-statement] 504 Litkowski, S., Decraene, B., and M. Horneffer, "Link State 505 protocols SPF trigger and delay algorithm impact on IGP 506 micro-loops", draft-ietf-rtgwg-spf-uloop-pb-statement-05 507 (work in progress), December 2017. 509 [ISO10589-Second-Edition] 510 International Organization for Standardization, 511 "Intermediate system to Intermediate system intra-domain 512 routeing information exchange protocol for use in 513 conjunction with the protocol for providing the 514 connectionless-mode Network Service (ISO 8473)", ISO/ 515 IEC 10589:2002, Second Edition, Nov 2002. 517 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 518 DOI 10.17487/RFC2328, April 1998, 519 . 521 [RFC3209] Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., 522 and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP 523 Tunnels", RFC 3209, DOI 10.17487/RFC3209, December 2001, 524 . 526 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 527 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 528 DOI 10.17487/RFC4271, January 2006, 529 . 531 [RFC5036] Andersson, L., Ed., Minei, I., Ed., and B. Thomas, Ed., 532 "LDP Specification", RFC 5036, DOI 10.17487/RFC5036, 533 October 2007, . 535 [RFC5286] Atlas, A., Ed. and A. Zinin, Ed., "Basic Specification for 536 IP Fast Reroute: Loop-Free Alternates", RFC 5286, 537 DOI 10.17487/RFC5286, September 2008, 538 . 540 [RFC5715] Shand, M. and S. Bryant, "A Framework for Loop-Free 541 Convergence", RFC 5715, DOI 10.17487/RFC5715, January 542 2010, . 544 [RFC6976] Shand, M., Bryant, S., Previdi, S., Filsfils, C., 545 Francois, P., and O. Bonaventure, "Framework for Loop-Free 546 Convergence Using the Ordered Forwarding Information Base 547 (oFIB) Approach", RFC 6976, DOI 10.17487/RFC6976, July 548 2013, . 550 Authors' Addresses 552 Bruno Decraene 553 Orange 555 Email: bruno.decraene@orange.com 556 Stephane Litkowski 557 Orange Business Service 559 Email: stephane.litkowski@orange.com 561 Hannes Gredler 562 RtBrick Inc 564 Email: hannes@rtbrick.com 566 Acee Lindem 567 Cisco Systems 568 301 Midenhall Way 569 Cary, NC 27513 570 USA 572 Email: acee@cisco.com 574 Pierre Francois 576 Email: pfrpfr@gmail.com 578 Chris Bowers 579 Juniper Networks, Inc. 580 1194 N. Mathilda Ave. 581 Sunnyvale, CA 94089 582 US 584 Email: cbowers@juniper.net