idnits 2.17.1 draft-ietf-rtgwg-backoff-algo-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 9, 2017) is 2664 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-10) exists of draft-ietf-rtgwg-spf-uloop-pb-statement-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Decraene 3 Internet-Draft Orange 4 Intended status: Standards Track S. Litkowski 5 Expires: July 13, 2017 Orange Business Service 6 H. Gredler 7 RtBrick Inc 8 A. Lindem 9 Cisco Systems 10 P. Francois 12 C. Bowers 13 Juniper Networks, Inc. 14 January 9, 2017 16 SPF Back-off algorithm for link state IGPs 17 draft-ietf-rtgwg-backoff-algo-04 19 Abstract 21 This document defines a standard algorithm to back-off link-state IGP 22 SPF computations. 24 Having one standard algorithm improves interoperability by reducing 25 the probability and/or duration of transient forwarding loops during 26 the IGP convergence when the IGP reacts to multiple proximate IGP 27 events. 29 Requirements Language 31 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 32 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 33 document are to be interpreted as described in [RFC2119]. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at http://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on July 13, 2017. 51 Copyright Notice 53 Copyright (c) 2017 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 2. High level goals . . . . . . . . . . . . . . . . . . . . . . 3 70 3. Definitions and parameters . . . . . . . . . . . . . . . . . 4 71 4. Principles of SPF delay algorithm . . . . . . . . . . . . . . 4 72 5. Specification of the SPF delay state machine . . . . . . . . 5 73 5.1. States . . . . . . . . . . . . . . . . . . . . . . . . . 5 74 5.2. States Transitions . . . . . . . . . . . . . . . . . . . 6 75 5.3. FSM Events . . . . . . . . . . . . . . . . . . . . . . . 7 76 6. Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 8 77 7. Impact on micro-loops . . . . . . . . . . . . . . . . . . . . 9 78 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 79 9. Security considerations . . . . . . . . . . . . . . . . . . . 9 80 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 81 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 82 11.1. Normative References . . . . . . . . . . . . . . . . . . 9 83 11.2. Informative References . . . . . . . . . . . . . . . . . 9 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 86 1. Introduction 88 Link state IGPs, such as IS-IS [ISO10589-Second-Edition] and OSPF 89 [RFC2328], perform distributed route computation on all routers in 90 the area/level. In order to have consistent routing tables across 91 the network, such distributed computation requires that all routers 92 have the same version of the network topology (Link State DataBase 93 (LSDB)) and perform their computation at the same time. 95 In general, when the network is stable, there is a desire to compute 96 a new SPF as soon as a failure is detected in order to quickly route 97 around the failure. However, when the network is experiencing 98 multiple proximate failures over a short period of time, there is a 99 conflicting desire to limit the frequency of SPF computations. 100 Indeed, this allows a reduction in control plane resources used by 101 IGPs and all protocols/subsystems reacting on the attendant route 102 change, such as LDP, RSVP-TE, BGP, Fast ReRoute computations, FIB 103 updates... This also reduces the churn on routers and in the network 104 and, in particular, reduces the side effects such as micro-loops that 105 ensue during IGP convergence. 107 To allow for this, IGPs implement an SPF back-off algorithm. 108 However, different implementations have choosen different algorithms. 109 Hence, in a multi-vendor network, it's not possible to ensure that 110 all routers trigger their SPF computation after the same delay. This 111 situation increases the average differential delay between routers 112 completing their SPF computation. It also increases the probability 113 that different routers compute their FIBs based on different LSDB 114 versions. Both factors increase the probability and/or duration of 115 micro-loops. 117 To allow multi-vendor networks to have all routers delay their SPF 118 computations for the same duration, this document specifies a 119 standard algorithm. Optionally, implementations may offer 120 alternative algorithms. 122 2. High level goals 124 The high level goals of this algorithm are the following: 126 o Very fast convergence for a single event (e.g., link failure). 128 o Paced fast convergence for multiple proximate IGP events while IGP 129 stability is considered acceptable. 131 o Delayed convergence when IGP stability is problematic. This will 132 allow the IGP and related processes to conserve resources during 133 the period of instability. 135 o Always try to avoid different SPF_DELAY timers values across 136 different routers in the area/level. Even though not all routers 137 will receive IGP messages at the same time, due to differences 138 both in the distance from the originator of the IGP event and in 139 flooding implementations. 141 3. Definitions and parameters 143 IGP events: The reception or origination of an IGP LSDB change 144 requiring a new routing table computation. Examples are a topology 145 change, a prefix change, a metric change on a link or prefix... Note 146 that locally triggering a routing table computation is not considered 147 as an IGP event since other IGP routers are unaware of this 148 occurrence. 150 Routing table computation: Computation of the routing table, by the 151 IGP, using the IGP LSDB. No distinction is made between the type of 152 computation performed. e.g., full SPF, incremental SPF, Partial Route 153 Computation (PRC). The type of computation is a local consideration. 154 This document may interchangeably use the terms routing table 155 computation and SPF computation. 157 SPF_DELAY: The delay between the first IGP event triggering a new 158 routing table computation and the start of that routing table 159 computation. It can take the following values: 161 INITIAL_SPF_DELAY: A very small delay to quickly handle link 162 failure, e.g., 0 milliseconds. 164 SHORT_SPF_DELAY: A small delay to have a fast convergence in case of 165 a single component failure (node, SRLG..), e.g., 50-100 166 milliseconds. 168 LONG_SPF_DELAY: A long delay when the IGP is unstable, e.g., 2 169 seconds. Note that this allows the IGP network to stabilize. 171 TIME_TO_LEARN_INTERVAL: This is the maximum duration typically needed 172 to learn all the IGP events related to a single component failure 173 (e.g., router failure, SRLG failure), e.g., 1 second. It's mostly 174 dependent on failure detection time variation between all routers 175 that are adjacent to the failure. Additionally, it may depend on the 176 different IGP implementations across the network, related to 177 origination and flooding of their link state advertisements. 179 HOLD_DOWN_INTERVAL: The time required with no received IGP events 180 before considering the IGP to be stable again and allowing the 181 SPF_DELAY to be restored to INITIAL_WAIT. e.g., 3 seconds. 183 4. Principles of SPF delay algorithm 185 For this first IGP event, we assume that there has been a single 186 simple change in the network which can be taken into account using a 187 single routing computation (e.g., link failure, prefix (metric) 188 change) and we optimize for very fast convergence, delaying the 189 routing computation by INITIAL_SPF_DELAY. Under this assumption, 190 there is no benefit in delaying the routing computation. In a 191 typical network, this is the most common type of IGP event. Hence, 192 it makes sense to optimize this case. 194 If subsequent IGP events are received in a short period of time 195 (TIME_TO_LEARN_INTERVAL), we then assume that a single component 196 failed, but that this failure requires the knowledge of multiple IGP 197 events in order for IGP routing to converge. Under this assumption, 198 we want fast convergence since this is a normal network situation. 199 However, there is a benefit in waiting for all IGP events related to 200 this single component failure so that the IGP can compute the post- 201 failure routing table in a single route computation. In this 202 situation, we delay the routing computation by SHORT_SPF_DELAY. 204 If IGP events are still received after TIME_TO_LEARN_INTERVAL from 205 the initial IGP event received in QUIET state, then the network is 206 presumably experiencing multiple independent failures. In this case, 207 while waiting for network stability, the computations are delayed for 208 a longer time represented by LONG_SPF_DELAY. This SPF delay is kept 209 until no IGP events are received for HOLDDOWN_INTERVAL. 211 Note that previous SPF delay algorithms used to count the number of 212 SPF computations. However, as all routers may receive the IGP events 213 at different times, we cannot assume that all routers will perform 214 the same number of SPF computations or that they will schedule them 215 at the same time. For example, assuming that the SPF delay is 50 ms, 216 router R1 may receive 3 IGP events (E1, E2, E3) in those 50 ms and 217 hence will perform a single routing computation. While another 218 router R2 may only receive 2 events (E1, E2) in those 50 ms and hence 219 will schedule another routing computation when receiving E3. That's 220 why this document uses a time (TIME_TO_LEARN) from the initial event 221 detection/reception as opposed to counting the number of SPF 222 computations to determine when the IGP is unstable. 224 5. Specification of the SPF delay state machine 226 5.1. States 228 This section describes the state machine. The naming and semantics 229 of each state corresponds directly to the SPF delay used for IGP 230 events received in that state. Three states are defined: 232 QUIET: This is the initial state, when no IGP events have occured for 233 at least HOLDDOWN_INTERVAL since the previous routing table 234 computation. The state is meant to handle link failures very 235 quickly. 237 SHORT_WAIT: State entered when an IGP event has been received in 238 QUIET state. This state is meant to handle single component failure 239 requiring multiple IGP events (e.g., node, SRLG). 241 LONG_WAIT: State reached after TIME_TO_LEARN_INTERVAL. In other 242 words, state reached after TIME_TO_LEARN_INTERVAL in state 243 SHORT_WAIT. This state is meant to handle multiple independent 244 component failures during periods of IGP instability. 246 5.2. States Transitions 248 The FSM is initialized to the QUIET_STATE with all three timers 249 deactivated. The following diagram describes briefly the state 250 transitions. 252 +-------------------+ 253 | |<-------------------+ 254 | QUIET | | 255 | |<---------+ | 256 +-------------------+ | | 257 | | | 258 | | | 259 | 1: IGP event | | 260 | | | 261 v | | 262 +-------------------+ | | 263 +---->| | | | 264 | | SHORT_WAIT |----->----+ | 265 +-----| | | 266 2: +-------------------+ 6: HOLDDOWN_TIMER | 267 IGP event | expiration | 268 | | 269 | | 270 | 3: LEARN_TIMER | 271 | expiration | 272 | | 273 v | 274 +-------------------+ | 275 +---->| | | 276 | | LONG_WAIT |------------>-------+ 277 +-----| | 278 4: +-------------------+ 5: HOLDDOWN_TIMER 279 IGP event expiration 281 Figure 1: State Machine 283 5.3. FSM Events 285 This section describes the events and the actions performed in 286 response. 288 Event 1: IGP event, while in QUIET_STATE. 290 Actions on event 1: 292 o If SPF_TIMER is not already running, start it with value 293 INITIAL_SPF_DELAY. 295 o Start LEARN_TIMER with TIME_TO_LEARN_INTERVAL. 297 o Start HOLDDOWN_TIMER with HOLDDOWN_INTERVAL. 299 o Transition to SHORT_WAIT state. 301 Event 2: IGP event, while in SHORT_WAIT. 303 Actions on event 2: 305 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 307 o If SPF_TIMER is not already running, start it with value 308 SHORT_SPF_DELAY. 310 o Remain in current state. 312 Event 3: LEARN_TIMER expiration. 314 Actions on event 3: 316 o Transition to LONG_WAIT state. 318 Event 4: IGP event, while in LONG_WAIT. 320 Actions on event 4: 322 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 324 o If SPF_TIMER is not already running, start it with value 325 LONG_SPF_DELAY. 327 o Remain in current state. 329 Event 5: HOLDDOWN_TIMER expiration, while in LONG_WAIT. 331 Actions on event 5: 333 o Transition to QUIET state. 335 Event 6: HOLDDOWN_TIMER expiration, while in SHORT_WAIT. 337 Actions on event 6: 339 o Deactivate LEARN_TIMER. 341 o Transition to QUIET state. 343 6. Parameters 345 All the parameters MUST be configurable. All the delays 346 (INITIAL_SPF_DELAY, SHORT_SPF_DELAY, LONG_SPF_DELAY, 347 TIME_TO_LEARN_INTERVAL, HOLD_DOWN_INTERVAL) SHOULD be configurable at 348 the millisecond granularity. They MUST be configurable at least at 349 the tenth of second granularity. The configurable range for all the 350 parameters SHOULD at least be from 0 milliseconds to 60 seconds. 352 This document does not propose default values for the parameters 353 because these values are expected to be context dependent. 354 Implementations are free to propose their own default values. 356 When setting (default) values, one SHOULD consider the customers and 357 their application requirements, the computational power of the 358 routers, the size of the network, and, in particular, the number of 359 IP prefixes advertised in the IGP, the frequency and number of IGP 360 events, the number of protocols reactions/computations triggered by 361 IGP SPF (e.g., BGP, PCEP, Traffic Engineering CSPF, Fast ReRoute 362 computations). 364 Note that some or all of these factors may change over the life of 365 the network. In case of doubt, it's RECOMMENDED to play it safe and 366 start with safe, i.e., longer timers. 368 For the standard algorithm to be effective in mitigating micro-loops, 369 it is RECOMMENDED that all routers in the IGP domain, or at least all 370 the routers in the same area/level, have exactly the same configured 371 values. 373 7. Impact on micro-loops 375 Micro-loops during IGP convergence are due to a non-synchronized or 376 non-ordered update of the forwarding information tables (FIB) 377 [RFC5715] [RFC6976] [I-D.ietf-rtgwg-spf-uloop-pb-statement]. FIBs 378 are installed after multiple steps such as SPF wait time, SPF 379 computation, FIB distribution, and FIB update. This document only 380 addresses the first contribution. This standardized procedure 381 reduces the probability and/or duration of micro-loops when IGPs 382 experience multiple proximate events. It does not prevent all micro- 383 loops. However, it is beneficial and is less complex and costly to 384 implement when compared to full solutions such as [RFC5715] or 385 [RFC6976]. 387 8. IANA Considerations 389 No IANA actions required. 391 9. Security considerations 393 The algorithm presented in this document does not compromise IGP 394 security. An attacker having the ability to generate IGP events 395 would be able to delay the IGP convergence time. The LONG_SPF_DELAY 396 state may help mitigate the effects of Denial-of-Service (DOS) 397 attacks generating many IGP events. 399 10. Acknowledgements 401 We would like to acknowledge Les Ginsberg, Uma Chunduri, and Mike 402 Shand for the discussions and comments related to this document. 404 11. References 406 11.1. Normative References 408 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 409 Requirement Levels", BCP 14, RFC 2119, 410 DOI 10.17487/RFC2119, March 1997, 411 . 413 11.2. Informative References 415 [I-D.ietf-rtgwg-spf-uloop-pb-statement] 416 Litkowski, S., Decraene, B., and M. Horneffer, "Link State 417 protocols SPF trigger and delay algorithm impact on IGP 418 micro-loops", draft-ietf-rtgwg-spf-uloop-pb-statement-02 419 (work in progress), December 2015. 421 [ISO10589-Second-Edition] 422 International Organization for Standardization, 423 "Intermediate system to Intermediate system intra-domain 424 routeing information exchange protocol for use in 425 conjunction with the protocol for providing the 426 connectionless-mode Network Service (ISO 8473)", ISO/ 427 IEC 10589:2002, Second Edition, Nov 2002. 429 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 430 DOI 10.17487/RFC2328, April 1998, 431 . 433 [RFC5715] Shand, M. and S. Bryant, "A Framework for Loop-Free 434 Convergence", RFC 5715, DOI 10.17487/RFC5715, January 435 2010, . 437 [RFC6976] Shand, M., Bryant, S., Previdi, S., Filsfils, C., 438 Francois, P., and O. Bonaventure, "Framework for Loop-Free 439 Convergence Using the Ordered Forwarding Information Base 440 (oFIB) Approach", RFC 6976, DOI 10.17487/RFC6976, July 441 2013, . 443 Authors' Addresses 445 Bruno Decraene 446 Orange 448 Email: bruno.decraene@orange.com 450 Stephane Litkowski 451 Orange Business Service 453 Email: stephane.litkowski@orange.com 455 Hannes Gredler 456 RtBrick Inc 458 Email: hannes@rtbrick.com 459 Acee Lindem 460 Cisco Systems 461 301 Midenhall Way 462 Cary, NC 27513 463 USA 465 Email: acee@cisco.com 467 Pierre Francois 469 Email: pfrpfr@gmail.com 471 Chris Bowers 472 Juniper Networks, Inc. 473 1194 N. Mathilda Ave. 474 Sunnyvale, CA 94089 475 US 477 Email: cbowers@juniper.net