idnits 2.17.1 draft-ietf-rtgwg-backoff-algo-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 4, 2016) is 2851 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-10) exists of draft-ietf-rtgwg-spf-uloop-pb-statement-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Decraene 3 Internet-Draft Orange 4 Intended status: Standards Track S. Litkowski 5 Expires: January 5, 2017 Orange Business Service 6 H. Gredler 7 RtBrick Inc 8 A. Lindem 9 P. Francois 10 Cisco Systems 11 C. Bowers 12 Juniper Networks, Inc. 13 July 4, 2016 15 SPF Back-off algorithm for link state IGPs 16 draft-ietf-rtgwg-backoff-algo-03 18 Abstract 20 This document defines a standard algorithm to back-off link-state IGP 21 SPF computations. 23 Having one standard algorithm improves interoperability by reducing 24 the probability and/or duration of transient forwarding loops during 25 the IGP convergence when the IGP reacts to multiple proximate IGP 26 events. 28 Requirements Language 30 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 31 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 32 document are to be interpreted as described in [RFC2119]. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on January 5, 2017. 50 Copyright Notice 52 Copyright (c) 2016 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 2. High level goals . . . . . . . . . . . . . . . . . . . . . . 3 69 3. Definitions and parameters . . . . . . . . . . . . . . . . . 4 70 4. Principles of SPF delay algorithm . . . . . . . . . . . . . . 4 71 5. Specification of the SPF delay state machine . . . . . . . . 5 72 5.1. States . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 5.2. States Transitions . . . . . . . . . . . . . . . . . . . 6 74 5.3. FSM Events . . . . . . . . . . . . . . . . . . . . . . . 7 75 6. Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 8 76 7. Impact on micro-loops . . . . . . . . . . . . . . . . . . . . 9 77 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 78 9. Security considerations . . . . . . . . . . . . . . . . . . . 9 79 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 80 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 81 11.1. Normative References . . . . . . . . . . . . . . . . . . 9 82 11.2. Informative References . . . . . . . . . . . . . . . . . 9 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 85 1. Introduction 87 Link state IGPs, such as IS-IS [ISO10589-Second-Edition] and OSPF 88 [RFC2328], perform distributed route computation on all routers in 89 the area/level. In order to have consistent routing tables across 90 the network, such distributed computation requires that all routers 91 have the same version of the network topology (Link State DataBase 92 (LSDB)) and perform their computation at the same time. 94 In general, when the network is stable, there is a desire to compute 95 a new SPF as soon as a failure is detected in order to quickly route 96 around the failure. However, when the network is experiencing 97 multiple proximate failures over a short period of time, there is a 98 conflicting desire to limit the frequency of SPF computations. 99 Indeed, this allows a reduction in control plane resources used by 100 IGPs and all protocols/subsystems reacting on the attendant route 101 change, such as LDP, RSVP-TE, BGP, Fast ReRoute computations, FIB 102 updates... This also reduces the churn on routers and in the network 103 and, in particular, reduces the side effects such as micro-loops that 104 ensue during IGP convergence. 106 To allow for this, IGPs implement an SPF back-off algorithm. 107 However, different implementations have choosen different algorithms. 108 Hence, in a multi-vendor network, it's not possible to ensure that 109 all routers trigger their SPF computation after the same delay. This 110 situation increases the average differential delay between routers 111 completing their SPF computation. It also increases the probability 112 that different routers compute their FIBs based on different LSDB 113 versions. Both factors increase the probability and/or duration of 114 micro-loops. 116 To allow multi-vendor networks to have all routers delay their SPF 117 computations for the same duration, this document specifies a 118 standard algorithm. Optionally, implementations may offer 119 alternative algorithms. 121 2. High level goals 123 The high level goals of this algorithm are the following: 125 o Very fast convergence for a single event (e.g., link failure). 127 o Paced fast convergence for multiple proximate IGP events while IGP 128 stability is considered acceptable. 130 o Delayed convergence when IGP stability is problematic. This will 131 allow the IGP and related processes to conserve resources during 132 the period of instability. 134 o Always try to avoid different SPF_DELAY timers values across 135 different routers in the area/level. Even though not all routers 136 will receive IGP messages at the same time, due to differences 137 both in the distance from the originator of the IGP event and in 138 flooding implementations. 140 3. Definitions and parameters 142 IGP events: The reception or origination of an IGP LSDB change 143 requiring a new routing table computation. Examples are a topology 144 change, a prefix change, a metric change on a link or prefix... Note 145 that locally triggering a routing table computation is not considered 146 as an IGP event since other IGP routers are unaware of this 147 occurrence. 149 Routing table computation: Computation of the routing table, by the 150 IGP, using the IGP LSDB. No distinction is made between the type of 151 computation performed. e.g., full SPF, incremental SPF, Partial Route 152 Computation (PRC). The type of computation is a local consideration. 153 This document may interchangeably use the terms routing table 154 computation and SPF computation. 156 SPF_DELAY: The delay between the first IGP event triggering a new 157 routing table computation and the start of that routing table 158 computation. It can take the following values: 160 INITIAL_SPF_DELAY: A very small delay to quickly handle link 161 failure, e.g., 0 milliseconds. 163 SHORT_SPF_DELAY: A small delay to have a fast convergence in case of 164 a single component failure (node, SRLG..), e.g., 50-100 165 milliseconds. 167 LONG_SPF_DELAY: A long delay when the IGP is unstable, e.g., 2 168 seconds. Note that this allows the IGP network to stabilize. 170 TIME_TO_LEARN_INTERVAL: This is the maximum duration typically needed 171 to learn all the IGP events related to a single component failure 172 (e.g., router failure, SRLG failure), e.g., 1 second. It's mostly 173 dependent on failure detection time variation between all routers 174 that are adjacent to the failure. Additionally, it may depend on the 175 different IGP implementations across the network, related to 176 origination and flooding of their link state advertisements. 178 HOLD_DOWN_INTERVAL: The time required with no received IGP events 179 before considering the IGP to be stable again and allowing the 180 SPF_DELAY to be restored to INITIAL_WAIT. e.g., 3 seconds. 182 4. Principles of SPF delay algorithm 184 For this first IGP event, we assume that there has been a single 185 simple change in the network which can be taken into account using a 186 single routing computation (e.g., link failure, prefix (metric) 187 change) and we optimize for very fast convergence, delaying the 188 routing computation by INITIAL_SPF_DELAY. Under this assumption, 189 there is no benefit in delaying the routing computation. In a 190 typical network, this is the most common type of IGP event. Hence, 191 it makes sense to optimize this case. 193 If subsequent IGP events are received in a short period of time 194 (TIME_TO_LEARN_INTERVAL), we then assume that a single component 195 failed, but that this failure requires the knowledge of multiple IGP 196 events in order for IGP routing to converge. Under this assumption, 197 we want fast convergence since this is a normal network situation. 198 However, there is a benefit in waiting for all IGP events related to 199 this single component failure so that the IGP can compute the post- 200 failure routing table in a single route computation. In this 201 situation, we delay the routing computation by LONG_WAIT. 203 If IGP events are still received after TIME_TO_LEARN_INTERVAL from 204 the initial IGP event received in QUIET state, then the network is 205 presumably experiencing multiple independent failures. In this case, 206 while waiting for network stability, the computations are delayed for 207 a longer time represented by LONG_SPF_DELAY. This SPF delay is kept 208 until no IGP events are received for HOLDDOWN_INTERVAL. 210 Note that previous SPF delay algorithms used to count the number of 211 SPF computations. However, as all routers may receive the IGP events 212 at different times, we cannot assume that all routers will perform 213 the same number of SPF computations or that they will schedule them 214 at the same time. For example, assuming that the SPF delay is 50 ms, 215 router R1 may receive 3 IGP events (E1, E2, E3) in those 50 ms and 216 hence will perform a single routing computation. While another 217 router R2 may only receive 2 events (E1, E2) in those 50 ms and hence 218 will schedule another routing computation when receiving E3. That's 219 why this document uses a time (TIME_TO_LEARN) from the initial event 220 detection/reception as opposed to counting the number of SPF 221 computations to determine when the IGP is unstable. 223 5. Specification of the SPF delay state machine 225 5.1. States 227 This section describes the state machine. The naming and semantics 228 of each state corresponds directly to the SPF delay used for IGP 229 events received in that state. Three states are defined: 231 QUIET: This is the initial state, when no IGP events have occured for 232 at least HOLDDOWN_INTERVAL since the previous routing table 233 computation. The state is meant to handle link failures very 234 quickly. 236 SHORT_WAIT: State entered when an IGP event has been received in 237 QUIET state and exited when no IGP events have been received for 238 HOLDDOWN_TIMER. This state is meant to handle single component 239 failure requiring multiple IGP events (e.g., node, SRLG). 241 LONG_WAIT: State reached after TIME_TO_LEARN_INTERVAL. In other 242 words, state reached after TIME_TO_LEARN_INTERVAL in state 243 SHORT_WAIT. This state is meant to handle multiple independent 244 component failures during periods of IGP instability. 246 5.2. States Transitions 248 The FSM is initialized to the QUIET_STATE with all three timers 249 deactivated. The following diagram describes briefly the state 250 transitions. 252 +-------------------+ 253 | |<-------------------+ 254 | QUIET | | 255 | |<---------+ | 256 +-------------------+ | | 257 | | | 258 | | | 259 | 1: IGP event | | 260 | | | 261 v | | 262 +-------------------+ | | 263 +---->| | | | 264 | | SHORT_WAIT |----->----+ | 265 +-----| | | 266 2: +-------------------+ 6: HOLDDOWN_TIMER | 267 IGP event | expiration | 268 | | 269 | | 270 | 3: LEARN_TIMER | 271 | expiration | 272 | | 273 v | 274 +-------------------+ | 275 +---->| | | 276 | | LONG_WAIT |------------>-------+ 277 +-----| | 278 4: +-------------------+ 5: HOLDDOWN_TIMER 279 IGP event expiration 281 Figure 1: State Machine 283 5.3. FSM Events 285 This section describes the events and the actions performed in 286 response. 288 Event 1: IGP events, while in QUIET_STATE. 290 Actions on event 1: 292 o If SPF_TIMER is not already running, start it with value 293 INITIAL_SPF_DELAY. 295 o Start LEARN_TIMER with TIME_TO_LEARN_INTERVAL. 297 o Start HOLDDOWN_TIMER with HOLDDOWN_INTERVAL. 299 o Transition to SHORT_WAIT state. 301 Event 2: IGP events, while in SHORT_WAIT. 303 Actions on event 2: 305 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 307 o If SPF_TIMER is not already running, start it with value 308 SHORT_SPF_DELAY. 310 o Remain in current state. 312 Event 3: LEARN_TIMER expiration. 314 Actions on event 3: 316 o Transition to LONG_WAIT state. 318 Event 4: IGP events, while in LONG_WAIT. 320 Actions on event 4: 322 o Reset HOLDDOWN_TIMER to HOLDDOWN_INTERVAL. 324 o If SPF_TIMER is not already running, start it with value 325 LONG_SPF_DELAY. 327 o Remain in current state. 329 Event 5: HOLDDOWN_TIMER expiration, while in LONG_WAIT. 331 Actions on event 5: 333 o Transition to QUIET state. 335 Event 6: HOLDDOWN_TIMER expiration, while in SHORT_WAIT. 337 Actions on event 6: 339 o Deactivate LEARN_TIMER. 341 o Transition to QUIET state. 343 6. Parameters 345 All the parameters MUST be configurable. All the delays 346 (INITIAL_SPF_DELAY, SHORT_SPF_DELAY, LONG_SPF_DELAY, 347 TIME_TO_LEARN_INTERVAL, HOLD_DOWN_INTERVAL) SHOULD be configurable at 348 the millisecond granularity. They MUST be configurable at least at 349 the tenth of second granularity. The configurable range for all the 350 parameters SHOULD at least be from 0 milliseconds to 60 seconds. 352 This document does not propose default values for the parameters 353 because these values are expected to be context dependent. 354 Implementations are free to propose their own default values. 356 When setting (default) values, one SHOULD consider the customers and 357 their application requirements, the computational power of the 358 routers, the size of the network, and, in particular, the number of 359 IP prefixes advertised in the IGP, the frequency and number of IGP 360 events, the number of protocols reactions/computations triggered by 361 IGP SPF (e.g., BGP, PCEP, Traffic Engineering CSPF, Fast ReRoute 362 computations). 364 Note that some or all of these factors may change over the life of 365 the network. In case of doubt, it's RECOMMENDED to play it safe and 366 start with safe, i.e., longer timers. 368 For the standard algorithm to be effective in mitigating micro-loops, 369 it is RECOMMENDED that all routers in the IGP domain, or at least all 370 the routers in the same area/level, have exactly the same configured 371 values. 373 7. Impact on micro-loops 375 Micro-loops during IGP convergence are due to a non-synchronized or 376 non-ordered update of the forwarding information tables (FIB) 377 [RFC5715] [RFC6976] [I-D.ietf-rtgwg-spf-uloop-pb-statement]. FIBs 378 are installed after multiple steps such as SPF wait time, SPF 379 computation, FIB distribution, and FIB update. This document only 380 addresses the first contribution. This standardized procedure 381 reduces the probability and/or duration of micro-loops when IGPs 382 experience multiple proximate events. It does not prevent all micro- 383 loops. However, it is beneficial and is less complex and costly to 384 implement when compared to full solutions such as [RFC5715] or 385 [RFC6976]. 387 8. IANA Considerations 389 No IANA actions required. 391 9. Security considerations 393 The algorithm presented in this document does not compromise IGP 394 security. An attacker having the ability to generate IGP events 395 would be able to delay the IGP convergence time. The LONG_SPF_DELAY 396 state may help mitigate the effects of Denial-of-Service (DOS) 397 attacks generating many IGP events. 399 10. Acknowledgements 401 We would like to acknowledge Les Ginsberg, Uma Chunduri, and Mike 402 Shand for the discussions and comments related to this document. 404 11. References 406 11.1. Normative References 408 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 409 Requirement Levels", BCP 14, RFC 2119, 410 DOI 10.17487/RFC2119, March 1997, 411 . 413 11.2. Informative References 415 [I-D.ietf-rtgwg-spf-uloop-pb-statement] 416 Litkowski, S., Decraene, B., and M. Horneffer, "Link State 417 protocols SPF trigger and delay algorithm impact on IGP 418 micro-loops", draft-ietf-rtgwg-spf-uloop-pb-statement-02 419 (work in progress), December 2015. 421 [ISO10589-Second-Edition] 422 International Organization for Standardization, 423 "Intermediate system to Intermediate system intra-domain 424 routeing information exchange protocol for use in 425 conjunction with the protocol for providing the 426 connectionless-mode Network Service (ISO 8473)", ISO/ 427 IEC 10589:2002, Second Edition, Nov 2002. 429 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 430 DOI 10.17487/RFC2328, April 1998, 431 . 433 [RFC5715] Shand, M. and S. Bryant, "A Framework for Loop-Free 434 Convergence", RFC 5715, DOI 10.17487/RFC5715, January 435 2010, . 437 [RFC6976] Shand, M., Bryant, S., Previdi, S., Filsfils, C., 438 Francois, P., and O. Bonaventure, "Framework for Loop-Free 439 Convergence Using the Ordered Forwarding Information Base 440 (oFIB) Approach", RFC 6976, DOI 10.17487/RFC6976, July 441 2013, . 443 Authors' Addresses 445 Bruno Decraene 446 Orange 448 Email: bruno.decraene@orange.com 450 Stephane Litkowski 451 Orange Business Service 453 Email: stephane.litkowski@orange.com 455 Hannes Gredler 456 RtBrick Inc 458 Email: hannes@rtbrick.com 459 Acee Lindem 460 Cisco Systems 461 301 Midenhall Way 462 Cary, NC 27513 463 USA 465 Email: acee@cisco.com 467 Pierre Francois 468 Cisco Systems 470 Email: pifranco@cisco.com 472 Chris Bowers 473 Juniper Networks, Inc. 474 1194 N. Mathilda Ave. 475 Sunnyvale, CA 94089 476 US 478 Email: cbowers@juniper.net