idnits 2.17.1 draft-ietf-rtgwg-backoff-algo-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 22, 2017) is 2376 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-42) exists of draft-ietf-isis-yang-isis-cfg-18 == Outdated reference: A later version (-29) exists of draft-ietf-ospf-yang-08 == Outdated reference: A later version (-10) exists of draft-ietf-rtgwg-spf-uloop-pb-statement-04 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Decraene 3 Internet-Draft Orange 4 Intended status: Standards Track S. Litkowski 5 Expires: April 25, 2018 Orange Business Service 6 H. Gredler 7 RtBrick Inc 8 A. Lindem 9 Cisco Systems 10 P. Francois 12 C. Bowers 13 Juniper Networks, Inc. 14 October 22, 2017 16 SPF Back-off algorithm for link state IGPs 17 draft-ietf-rtgwg-backoff-algo-06 19 Abstract 21 This document defines a standard algorithm to back-off link-state IGP 22 SPF computations. 24 Having one standard algorithm improves interoperability by reducing 25 the probability and/or duration of transient forwarding loops during 26 the IGP convergence when the IGP reacts to multiple temporally close 27 IGP events. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in [RFC2119]. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on April 25, 2018. 51 Copyright Notice 53 Copyright (c) 2017 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 2. High level goals . . . . . . . . . . . . . . . . . . . . . . 3 70 3. Definitions and parameters . . . . . . . . . . . . . . . . . 4 71 4. Principles of SPF delay algorithm . . . . . . . . . . . . . . 5 72 5. Specification of the SPF delay state machine . . . . . . . . 5 73 5.1. States . . . . . . . . . . . . . . . . . . . . . . . . . 5 74 5.2. States Transitions . . . . . . . . . . . . . . . . . . . 6 75 5.3. FSM Events . . . . . . . . . . . . . . . . . . . . . . . 7 76 6. Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 7. Partial Deployment . . . . . . . . . . . . . . . . . . . . . 9 78 8. Impact on micro-loops . . . . . . . . . . . . . . . . . . . . 10 79 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 80 10. Security considerations . . . . . . . . . . . . . . . . . . . 10 81 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 82 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 83 12.1. Normative References . . . . . . . . . . . . . . . . . . 10 84 12.2. Informative References . . . . . . . . . . . . . . . . . 11 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 87 1. Introduction 89 Link state IGPs, such as IS-IS [ISO10589-Second-Edition] and OSPF 90 [RFC2328], perform distributed route computation on all routers in 91 the area/level. In order to have consistent routing tables across 92 the network, such distributed computation requires that all routers 93 have the same version of the network topology (Link State DataBase 94 (LSDB)) and perform their computation at the same time. 96 In general, when the network is stable, there is a desire to compute 97 a new SPF as soon as a failure is detected in order to quickly route 98 around the failure. However, when the network is experiencing 99 multiple temporally close failures over a short period of time, there 100 is a conflicting desire to limit the frequency of SPF computations. 101 Indeed, this allows a reduction in control plane resources used by 102 IGPs and all protocols/subsystems reacting on the attendant route 103 change, such as LDP, RSVP-TE, BGP, Fast ReRoute computations, FIB 104 updates... This also reduces the churn on routers and in the network 105 and, in particular, reduces the side effects such as micro-loops that 106 ensue during IGP convergence. 108 To allow for this, IGPs implement an SPF back-off algorithm. 109 However, different implementations have choosen different algorithms. 110 Hence, in a multi-vendor network, it's not possible to ensure that 111 all routers trigger their SPF computation after the same delay. This 112 situation increases the average differential delay between routers 113 completing their SPF computation. It also increases the probability 114 that different routers compute their FIBs based on different LSDB 115 versions. Both factors increase the probability and/or duration of 116 micro-loops. 118 To allow multi-vendor networks to have all routers delay their SPF 119 computations for the same duration, this document specifies a 120 standard algorithm. Optionally, implementations may offer 121 alternative algorithms. 123 2. High level goals 125 The high level goals of this algorithm are the following: 127 o Very fast convergence for a single event (e.g., link failure). 129 o Paced fast convergence for multiple temporally close IGP events 130 while IGP stability is considered acceptable. 132 o Delayed convergence when IGP stability is problematic. This will 133 allow the IGP and related processes to conserve resources during 134 the period of instability. 136 o Always try to avoid different SPF_DELAY timers values across 137 different routers in the area/level. Even though not all routers 138 will receive IGP messages at the same time, due to differences 139 both in the distance from the originator of the IGP event and in 140 flooding implementations. 142 3. Definitions and parameters 144 IGP events: The reception or origination of an IGP LSDB change 145 requiring a new routing table computation. Examples are a topology 146 change, a prefix change, a metric change on a link or prefix... Note 147 that locally triggering a routing table computation is not considered 148 as an IGP event since other IGP routers are unaware of this 149 occurrence. 151 Routing table computation: Computation of the routing table, by the 152 IGP, using the IGP LSDB. No distinction is made between the type of 153 computation performed. e.g., full SPF, incremental SPF, Partial Route 154 Computation (PRC). The type of computation is a local consideration. 155 This document may interchangeably use the terms routing table 156 computation and SPF computation. 158 SPF_DELAY: The delay between the first IGP event triggering a new 159 routing table computation and the start of that routing table 160 computation. It can take the following values: 162 INITIAL_SPF_DELAY: A very small delay to quickly handle link 163 failure, e.g., 0 milliseconds. 165 SHORT_SPF_DELAY: A small delay to have a fast convergence in case of 166 a single component failure (node, SRLG..), e.g., 50-100 167 milliseconds. 169 LONG_SPF_DELAY: A long delay when the IGP is unstable, e.g., 2 170 seconds. Note that this allows the IGP network to stabilize. 172 TIME_TO_LEARN_INTERVAL: This is the maximum duration typically needed 173 to learn all the IGP events related to a single component failure 174 (e.g., router failure, SRLG failure), e.g., 1 second. It's mostly 175 dependent on failure detection time variation between all routers 176 that are adjacent to the failure. Additionally, it may depend on the 177 different IGP implementations across the network, related to 178 origination and flooding of their link state advertisements. 180 HOLDDOWN_INTERVAL: The time required with no received IGP events 181 before considering the IGP to be stable again and allowing the 182 SPF_DELAY to be restored to INITIAL_SPF_DELAY. e.g., 3 seconds. 184 SPF_TIMER: The Finite State Machine (FSM) abstract timer that uses 185 the computed SPF delay. Upon expiration, the Route Table Computation 186 (as defined above) is performed. 188 4. Principles of SPF delay algorithm 190 For this first IGP event, we assume that there has been a single 191 simple change in the network which can be taken into account using a 192 single routing computation (e.g., link failure, prefix (metric) 193 change) and we optimize for very fast convergence, delaying the 194 routing computation by INITIAL_SPF_DELAY. Under this assumption, 195 there is no benefit in delaying the routing computation. In a 196 typical network, this is the most common type of IGP event. Hence, 197 it makes sense to optimize this case. 199 If subsequent IGP events are received in a short period of time 200 (TIME_TO_LEARN_INTERVAL), we then assume that a single component 201 failed, but that this failure requires the knowledge of multiple IGP 202 events in order for IGP routing to converge. Under this assumption, 203 we want fast convergence since this is a normal network situation. 204 However, there is a benefit in waiting for all IGP events related to 205 this single component failure so that the IGP can compute the post- 206 failure routing table in a single route computation. In this 207 situation, we delay the routing computation by SHORT_SPF_DELAY. 209 If IGP events are still received after TIME_TO_LEARN_INTERVAL from 210 the initial IGP event received in QUIET state, then the network is 211 presumably experiencing multiple independent failures. In this case, 212 while waiting for network stability, the computations are delayed for 213 a longer time represented by LONG_SPF_DELAY. This SPF delay is kept 214 until no IGP events are received for HOLDDOWN_INTERVAL. 216 Note that previous SPF delay algorithms used to count the number of 217 SPF computations. However, as all routers may receive the IGP events 218 at different times, we cannot assume that all routers will perform 219 the same number of SPF computations or that they will schedule them 220 at the same time. For example, assuming that the SPF delay is 50 ms, 221 router R1 may receive 3 IGP events (E1, E2, E3) in those 50 ms and 222 hence will perform a single routing computation. While another 223 router R2 may only receive 2 events (E1, E2) in those 50 ms and hence 224 will schedule another routing computation when receiving E3. That's 225 why this document uses a time (TIME_TO_LEARN) from the initial event 226 detection/reception as opposed to counting the number of SPF 227 computations to determine when the IGP is unstable. 229 5. Specification of the SPF delay state machine 231 5.1. States 233 This section describes the state machine. The naming and semantics 234 of each state corresponds directly to the SPF delay used for IGP 235 events received in that state. Three states are defined: 237 QUIET: This is the initial state, when no IGP events have occured for 238 at least HOLDDOWN_INTERVAL since the previous routing table 239 computation. The state is meant to handle link failures very 240 quickly. 242 SHORT_WAIT: State entered when an IGP event has been received in 243 QUIET state. This state is meant to handle single component failure 244 requiring multiple IGP events (e.g., node, SRLG). 246 LONG_WAIT: State reached after TIME_TO_LEARN_INTERVAL. In other 247 words, state reached after TIME_TO_LEARN_INTERVAL in state 248 SHORT_WAIT. This state is meant to handle multiple independent 249 component failures during periods of IGP instability. 251 5.2. States Transitions 253 The FSM is initialized to the QUIET_STATE with all three timers 254 deactivated. The following diagram describes briefly the state 255 transitions. 257 +-------------------+ 258 | |<-------------------+ 259 | QUIET | | 260 | |<---------+ | 261 +-------------------+ | | 262 | | | 263 | | | 264 | 1: IGP event | | 265 | | | 266 v | | 267 +-------------------+ | | 268 +---->| | | | 269 | | SHORT_WAIT |----->----+ | 270 +-----| | | 271 2: +-------------------+ 6: HOLDDOWN_TIMER | 272 IGP event | expiration | 273 | | 274 | | 275 | 3: LEARN_TIMER | 276 | expiration | 277 | | 278 v | 279 +-------------------+ | 280 +---->| | | 281 | | LONG_WAIT |------------>-------+ 282 +-----| | 283 4: +-------------------+ 5: HOLDDOWN_TIMER 284 IGP event expiration 286 Figure 1: State Machine 288 5.3. FSM Events 290 This section describes the events and the actions performed in 291 response. 293 Event 1: IGP event, while in QUIET_STATE. 295 Actions on event 1: 297 o If SPF_TIMER is not already running, start it with value 298 INITIAL_SPF_DELAY. 300 o Start LEARN_TIMER with TIME_TO_LEARN_INTERVAL. 302 o Start HOLDDOWN_TIMER with HOLDDOWN_INTERVAL. 304 o Transition to SHORT_WAIT state. 306 Event 2: IGP event, while in SHORT_WAIT. 308 Actions on event 2: 310 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 312 o If SPF_TIMER is not already running, start it with value 313 SHORT_SPF_DELAY. 315 o Remain in current state. 317 Event 3: LEARN_TIMER expiration. 319 Actions on event 3: 321 o Transition to LONG_WAIT state. 323 Event 4: IGP event, while in LONG_WAIT. 325 Actions on event 4: 327 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 329 o If SPF_TIMER is not already running, start it with value 330 LONG_SPF_DELAY. 332 o Remain in current state. 334 Event 5: HOLDDOWN_TIMER expiration, while in LONG_WAIT. 336 Actions on event 5: 338 o Transition to QUIET state. 340 Event 6: HOLDDOWN_TIMER expiration, while in SHORT_WAIT. 342 Actions on event 6: 344 o Deactivate LEARN_TIMER. 346 o Transition to QUIET state. 348 6. Parameters 350 All the parameters MUST be configurable [I-D.ietf-isis-yang-isis-cfg] 351 [I-D.ietf-ospf-yang] at the protocol instance granularity. They MAY 352 be configurable at the area/level granularity. All the delays 353 (INITIAL_SPF_DELAY, SHORT_SPF_DELAY, LONG_SPF_DELAY, 354 TIME_TO_LEARN_INTERVAL, HOLDDOWN_INTERVAL) SHOULD be configurable at 355 the millisecond granularity. They MUST be configurable at least at 356 the tenth of second granularity. The configurable range for all the 357 parameters SHOULD at least be from 0 milliseconds to 60 seconds. 359 This document does not propose default values for the parameters 360 because these values are expected to be context dependent. 361 Implementations are free to propose their own default values. 363 In order to satisfy the goals stated in Section 2, operators are 364 RECOMMENDED to configure delay intervals such that SPF_INITIAL_DELAY 365 <= SPF_SHORT_DELAY and SPF_SHORT_DELAY <= SPF_LONG_DELAY. 367 When setting (default) values, one SHOULD consider the customers and 368 their application requirements, the computational power of the 369 routers, the size of the network, and, in particular, the number of 370 IP prefixes advertised in the IGP, the frequency and number of IGP 371 events, the number of protocols reactions/computations triggered by 372 IGP SPF (e.g., BGP, PCEP, Traffic Engineering CSPF, Fast ReRoute 373 computations). 375 Note that some or all of these factors may change over the life of 376 the network. In case of doubt, it's RECOMMENDED to play it safe and 377 start with safe, i.e., longer timers. 379 For the standard algorithm to be effective in mitigating micro-loops, 380 it is RECOMMENDED that all routers in the IGP domain, or at least all 381 the routers in the same area/level, have exactly the same configured 382 values. 384 7. Partial Deployment 386 In general, the SPF delay algorithm is only effective in mitigating 387 micro-loops if it is deployed on all routers in the IGP domain or, at 388 least, all routers in an IGP area/level. The impact of partial 389 deployment is based on the particular event, topology, and the SPF 390 algorithm(s) used on other routers in the IGP area/level. In cases 391 where the previous SPF algorithm was implemented uniformly, partial 392 deployment will increase the frequency and duration of micro-loops. 393 Hence, it is RECOMMENDED that all routers in the IGP domain or at 394 least within the same area/level be migrated to the SPF algorithm 395 described herein at roughly the same time. 397 Note that this is not a new consideration as over times, network 398 operators have changed SPF delay parameters in order to accommodate 399 new customer requirements for fast convergence, as permitted by new 400 software and hardware. They may also have progressively replaced an 401 implementation with a given SPF delay algorithm by another 402 implementation with a different one. 404 8. Impact on micro-loops 406 Micro-loops during IGP convergence are due to a non-synchronized or 407 non-ordered update of the forwarding information tables (FIB) 408 [RFC5715] [RFC6976] [I-D.ietf-rtgwg-spf-uloop-pb-statement]. FIBs 409 are installed after multiple steps such as SPF wait time, SPF 410 computation, FIB distribution, and FIB update. This document only 411 addresses the first contribution. This standardized procedure 412 reduces the probability and/or duration of micro-loops when IGPs 413 experience multiple temporally close events. It does not prevent all 414 micro-loops. However, it is beneficial and is less complex and 415 costly to implement when compared to full solutions such as [RFC5715] 416 or [RFC6976]. 418 9. IANA Considerations 420 No IANA actions required. 422 10. Security considerations 424 The algorithm presented in this document does not compromise IGP 425 security. An attacker having the ability to generate IGP events 426 would be able to delay the IGP convergence time. The LONG_SPF_DELAY 427 state may help mitigate the effects of Denial-of-Service (DOS) 428 attacks generating many IGP events. 430 11. Acknowledgements 432 We would like to acknowledge Les Ginsberg, Uma Chunduri, Mike Shand 433 and Alexander Vainshtein for the discussions and comments related to 434 this document. 436 12. References 438 12.1. Normative References 440 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 441 Requirement Levels", BCP 14, RFC 2119, 442 DOI 10.17487/RFC2119, March 1997, 443 . 445 12.2. Informative References 447 [I-D.ietf-isis-yang-isis-cfg] 448 Litkowski, S., Yeung, D., Lindem, A., Zhang, Z., and L. 449 Lhotka, "YANG Data Model for IS-IS protocol", draft-ietf- 450 isis-yang-isis-cfg-18 (work in progress), July 2017. 452 [I-D.ietf-ospf-yang] 453 Yeung, D., Qu, Y., Zhang, Z., Chen, I., and A. Lindem, 454 "Yang Data Model for OSPF Protocol", draft-ietf-ospf- 455 yang-08 (work in progress), July 2017. 457 [I-D.ietf-rtgwg-spf-uloop-pb-statement] 458 Litkowski, S., Decraene, B., and M. Horneffer, "Link State 459 protocols SPF trigger and delay algorithm impact on IGP 460 micro-loops", draft-ietf-rtgwg-spf-uloop-pb-statement-04 461 (work in progress), May 2017. 463 [ISO10589-Second-Edition] 464 International Organization for Standardization, 465 "Intermediate system to Intermediate system intra-domain 466 routeing information exchange protocol for use in 467 conjunction with the protocol for providing the 468 connectionless-mode Network Service (ISO 8473)", ISO/ 469 IEC 10589:2002, Second Edition, Nov 2002. 471 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 472 DOI 10.17487/RFC2328, April 1998, 473 . 475 [RFC5715] Shand, M. and S. Bryant, "A Framework for Loop-Free 476 Convergence", RFC 5715, DOI 10.17487/RFC5715, January 477 2010, . 479 [RFC6976] Shand, M., Bryant, S., Previdi, S., Filsfils, C., 480 Francois, P., and O. Bonaventure, "Framework for Loop-Free 481 Convergence Using the Ordered Forwarding Information Base 482 (oFIB) Approach", RFC 6976, DOI 10.17487/RFC6976, July 483 2013, . 485 Authors' Addresses 487 Bruno Decraene 488 Orange 490 Email: bruno.decraene@orange.com 491 Stephane Litkowski 492 Orange Business Service 494 Email: stephane.litkowski@orange.com 496 Hannes Gredler 497 RtBrick Inc 499 Email: hannes@rtbrick.com 501 Acee Lindem 502 Cisco Systems 503 301 Midenhall Way 504 Cary, NC 27513 505 USA 507 Email: acee@cisco.com 509 Pierre Francois 511 Email: pfrpfr@gmail.com 513 Chris Bowers 514 Juniper Networks, Inc. 515 1194 N. Mathilda Ave. 516 Sunnyvale, CA 94089 517 US 519 Email: cbowers@juniper.net