idnits 2.17.1 draft-bryant-francois-shand-ipfrr-aah-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 453. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 464. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 471. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 477. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 30, 2008) is 5656 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'AAH' on line 241 -- Looks like a reference, but probably isn't: 'Hold' on line 234 -- Looks like a reference, but probably isn't: 'Q' on line 248 -- Looks like a reference, but probably isn't: 'CC' on line 245 -- Looks like a reference, but probably isn't: 'AAH-hold' on line 246 -- Looks like a reference, but probably isn't: 'IDLE' on line 323 -- Looks like a reference, but probably isn't: 'TX-AAH' on line 330 == Outdated reference: A later version (-13) exists of draft-ietf-rtgwg-ipfrr-framework-09 == Outdated reference: A later version (-07) exists of draft-ietf-rtgwg-lf-conv-frmwk-02 == Outdated reference: A later version (-12) exists of draft-ietf-rtgwg-ordered-fib-02 Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Shand 3 Internet-Draft S. Bryant 4 Intended status: Informational Cisco Systems 5 Expires: May 3, 2009 P. Francois 6 Universite catholique de Louvain 7 October 30, 2008 9 Mechanisms for safely abandoning loop-free convergence (AAH) 10 draft-bryant-francois-shand-ipfrr-aah-01 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on May 3, 2009. 37 Abstract 39 IPFRR and loop-free convergence techniques can deal with single 40 topology change events, multiple correlated change events, and in 41 some cases even certain uncorrelated events. However, in all cases 42 there are events which cannot be dealt with and the mechanism needs 43 to quickly revert to normal convergence. This is known as 44 "Abandoning All Hope" (AAH). This document describes the nature of 45 the problem, and various proposed mechanisms to deal with it. 47 Table of Contents 49 1. Conventions used in this document . . . . . . . . . . . . . . 3 50 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 51 3. Possible Solutions . . . . . . . . . . . . . . . . . . . . . . 4 52 3.1. Hold-down timer only . . . . . . . . . . . . . . . . . . . 4 53 3.2. Basic per event AAH messages . . . . . . . . . . . . . . . 4 54 3.3. AAH messages . . . . . . . . . . . . . . . . . . . . . . . 5 55 3.3.1. Per Router State Machine . . . . . . . . . . . . . . . 6 56 3.3.2. Per Neighbor State Machine . . . . . . . . . . . . . . 8 57 4. Management Considerations . . . . . . . . . . . . . . . . . . 9 58 5. Scope and applicability . . . . . . . . . . . . . . . . . . . 9 59 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 60 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 61 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 62 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 63 9.1. Normative References . . . . . . . . . . . . . . . . . . . 10 64 9.2. Informative References . . . . . . . . . . . . . . . . . . 10 65 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 66 Intellectual Property and Copyright Statements . . . . . . . . . . 12 68 1. Conventions used in this document 70 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 71 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 72 document are to be interpreted as described in RFC 2119 [1]. 74 2. Introduction 76 IPFRR[2] and loop-free convergence techniques[3] can deal with single 77 topology change events, multiple correlated change events, and in 78 some cases even certain uncorrelated events. However, in all cases 79 there are events which cannot be dealt with and the mechanism needs 80 to quickly revert to normal convergence. This is known as 81 "Abandoning All Hope" (AAH). 83 A good example is the case of the ordered FIB loop-free convergence 84 technique (oFIB)[4], however the problem and the mechanisms described 85 here for its resolution are equally applicable to any loop free 86 convergence mechanism, such as PLSN[5]. All the routers performing 87 the calculation must have an identical view of the set of topology 88 changes under consideration. One technique to ensure this is to 89 start a hold-down timer on reception of the first event in the hope 90 that all subsequent events related to the same root cause will arrive 91 before the timer expires. If this is the case, then all routers in 92 the network will have acquired an identical set of changes and 93 processing can continue correctly. However, in some cases the timer 94 value will be too short to ensure that all the related events have 95 arrived at all routers (perhaps because there was some unexpected 96 propagation delay, or one or more of the events are slow in being 97 detected). In other cases, a completely unrelated event may occur 98 after the timer has expired, but before the processing is complete. 99 In either case it is necessary to "Abandon all Hope" and revert to 100 traditional convergence. 102 There are a number of problems with this naive approach. Firstly, 103 since the timer is started at each router on reception of the first 104 LSP announcing a topology change, the actual starting time is 105 dependant upon the propagation time of the first LSP. So, for a 106 subsequent event occurring around the time of the timer expiry, 107 because of variations in propagation delay it may reach some routers 108 before the timer expires and others after it has expired. In the 109 former case this LSP will be included in the set of changes to be 110 considered, while in the latter it will be excluded and would invoke 111 an AAH in the routers receiving it. Clearly this would be a 112 dangerous condition, and it is therefore necessary to arrange that an 113 AAH invoked anywhere in the network causes ALL routers to AAH. This 114 can be achieved by reliably propagating an AAH message throughout the 115 network. However, this raises a second problem, the need to 116 synchronize the exit from AAH state throughout the network. 118 While in AAH state any topology changes previously received, or which 119 are subsequently received, should be processed immediately using the 120 traditional convergence algorithms i.e. without invoking controlled 121 convergence. If the exit from the AAH state is not correctly 122 synchronized, a new event may be processed by some routers 123 immediately (as AAH), while those which have already left AAH state 124 will treat it as the first of a new batch of changes and attempt 125 controlled convergence. 127 3. Possible Solutions 129 A number of approaches to this problem have been proposed, in 130 increasing order of complexity: 132 1. Hold-down timer only. This is the solution proposed in PLSN. 134 2. Basic per event AAH messages 136 3. Synchronization of AAH state using AAH messages. 138 These are described below. The purpose of this draft is to trigger 139 discussion on the trade-offs between complexity and robustness in the 140 AAH solution-space. 142 3.1. Hold-down timer only 144 This method uses a hold-down to acquire a set of LSPs which should be 145 processed together. On expiry of the local hold-down timer, the 146 router begins processing the batch of LSPs according to the loop free 147 prevention algorithm. 149 3.2. Basic per event AAH messages 151 This method uses signaling between neighbors to announce the 152 abandoning of controlled convergence. 154 A router individually decides when it should abandon controlled 155 convergence for a given (set of) LSP(s). It bases this decision on 156 the LSP reception timings and the hold down timers defined for the 157 controlled convergence mechanism used. 159 When a router makes a decision to abandon controlled convergence for 160 an LSP, it sends an AAH message to a selected subset of its 161 neighbors. The message identifies the LSPs for which controlled 162 convergence was abandoned. 164 The reception of such a message MUST trigger the decision to abandon 165 controlled convergence for this LSP by the receiver. The receiver 166 SHOULD also abandon controlled convergence for the other pending 167 LSPs. 169 A router is only allowed to send AAH messages for a given event once. 170 This can be achieved for example with a one bit flag in the LSP of 171 the LSDB, stating whether convergence has been abandoned and signaled 172 for this LSP. This can also be achieved by storing the 173 identification of the LSPs for which convergence was abandoned for a 174 time that is an order of magnitude longer than a typical IGP 175 convergence (i.e., 10 seconds). The subsest of neighbors to which an 176 AAH message must be sent by a router R depends on the controlled 177 convergence mechanism. It can be equal to all the neighbors of R, 178 but not necessarily. 180 For any controlled convergence mechanism, the selection of this 181 subset MUST be such that if a router R abandons controlled 182 convergence, all the routers who could create a forwarding loop with 183 R by not abandoning controlled convergence will eventually abandon 184 controlled convergence. 186 For the case of controlled convergence using ordered-FIB : 188 o In the case of a link up / node up / metric decrease event, the 189 set MUST include the neighbors of R that are on the shortest paths 190 between R and the originator of the LSP for which controlled 191 convergence is abandoned. 193 o In the case of a link down / node down / metric increase event, 194 the set MUST include the neighbors of R that are upstream of R on 195 the paths towards the originator of the LSP for which controlled 196 convergence is abandoned. 198 3.3. AAH messages 200 Like the others, this method uses a hold-down to acquire a set of 201 LSPs which should be processed together. On expiry of the local 202 hold-down timer, the router begins processing the batch of LSPs 203 according to the loop free prevention algorithm. This is the same 204 behaviour as the hold-down timer only method. However, if any 205 router, having started the loop-free convergence process receives an 206 LSP which would trigger a topology change, it locally abandons the 207 controlled convergence process, and sends an AAH message to all its 208 neighbors. This eventually triggers all routers to abandon the 209 controlled convergence. The routers remain in AAH state (i.e. 211 processing topology changes using normal "fast" convergence), until a 212 period of quiescence has elapsed. The exit from AAH state is 213 synchronized by using a two step process. 215 To achieve the required synchronization, two additional messages are 216 required, AAH and AAH ACK. The AAH message is reliably exchanged 217 between neighbours using the AAH ACK message. These could be 218 implemented as a new message within the routing protocol or carried 219 in existing routing hello messages. 221 Two types of state machines are needed. A per-router AAH state 222 machine and a per neighbour AAH state machine(PNSM). These are 223 described below. 225 3.3.1. Per Router State Machine 227 Per Router State Table 228 +-------------+-----------+---------+--------+------------+----------+ 229 | EVENT | Q | Hold | CC | AAH | AAH-hold | 230 +=============+===========+=========+========+============+==========+ 231 | RX LSP | Start | - | TX-AAH | Re-start | TX-AAH | 232 | triggering | hold-down | | Start | AAH timer. | Start | 233 | change | timer | | AAH | [AAH] | AAH | 234 | | [Hold] | | timer. | | timer. | 235 | | | | [AAH] | | [AAH] | 236 +-------------+-----------+---------+--------+------------+----------+ 237 | RX AAH | TX-AAH | TX-AAH | TX-AAH | [AAH] | TX-AAH | 238 | (Neighbor's | Start AAH | Start | Start | | Start | 239 | PNSM | timer. | AAH | AAH | | AAH | 240 | processes | [AAH] | timer | timer. | | timer. | 241 | RX AAH.) | | [AAH] | [AAH] | | [AAH] | 242 +-------------+-----------+---------+--------+------------+----------+ 243 | Timer | - | Trigger | - | Start | [Q] | 244 | expiry | | CC. | | AAH-hold | | 245 | | | [CC] | | timer. | | 246 | | | | | [AAH-hold] | | 247 +-------------+-----------+---------+--------+------------+----------+ 248 | Controlled | - | - | [Q] | - | - | 249 | convergence | | | | | | 250 | completed | | | | | | 251 +-------------+-----------+---------+--------+------------+----------+ 252 TX-AAH = Send "goto TX-AAH" to all other PNSMs. 254 Operation of the per-router state machine is as follows: 256 Operation of this state machine under normal topology change involves 257 only states: Quiescent (Q), Hold-down (Hold) and Controlled 258 Convergence (CC). The remaining states are associated with an AAH 259 event. 261 The resting state is Quiescent. When the router in the Quiescent 262 state receives an LSP indicating a topology change, which would 263 normally trigger an SPF, it starts the Hold-down timer and changes 264 state to Hold-down. It normally remains in this state, collecting 265 additional LSPs until the Hold-down timer expires. Note that all 266 routers MUST use a common value for the Hold-down timer. When the 267 Hold-down timer expires the router then enters Controlled Convergence 268 (CC) state and executes the CC mechanism to re-converge the topology. 269 When the CC process has completed on the router, the router re-enters 270 the Quiescent state. 272 If this router receives a topology changing LSP whilst it is in the 273 CC state, it enters AAH state, and sends a "goto TX-AAH" command to 274 all per neighbour state machines which causes each per-neighbour 275 state machine to signal this state change to its neighbour. 276 Alternatively, if this router receives an AAH message from any of its 277 neighbors whilst in any state except AAH, it starts the AAH timer and 278 enters the AAH state. The per neighbor state machine corresponding 279 to the neighbor from which the AAH was received executes the RX AAH 280 action (which causes it to send an AAH ACK), while the remainder are 281 sent the "goto TX-AAH" command. The result is that the AAH is 282 acknowledged to the neighbor from which it was received and 283 propagated to all other neighbors. On entering AAH state, all CC 284 timers are expired and normal convergence takes place. 286 Whilst in the AAH state, LSPs are processed in the traditional 287 manner. Each time an LSP is received, the AAH timer is restarted. 288 In an unstable network ALL routers will remain in this state for some 289 time and the network will behave in the traditional uncontrolled 290 convergence manner. 292 When the AAH timer expires, the router enters AAH-hold state and 293 starts the AAH hold timer. The purpose of the AAH-hold state is to 294 synchronize the transition of the network from AAH to Quiescent. The 295 additional state ensures that the network cannot contain a mixture of 296 routers in both AAH and Quiescent states. If, whilst in AAH-Hold 297 state the router receives a topology changing LSP, it re-enters AAH 298 state and commands all per neighbour state machines to "goto TX-AAH". 299 If, whilst in AAH-Hold state the router receives an AAH message from 300 one of its neighbours, it re-enters the AAH state and commands all 301 other per neighbour state machines to "goto TX-AAH". Note that the 302 per-neighbor state machine receiving the AAH message will 303 autonomously acknowledge receipt of the AAH message. Commanding the 304 per-neighbour state machine to "goto TX-AAH" is necessary, because 305 routers may be in a mixture of Quiescent, Hold-down and AAH-hold 306 state, and it is necessary to rendezvous the entire network back to 307 AAH state. 309 When the AAH Hold timer expires the router changes to state Quiescent 310 and is ready for loop free convergence. 312 3.3.2. Per Neighbor State Machine 314 Per Neighbor State Table 315 +----------------------------+--------------+------------------------+ 316 | EVENT | Idle | TX-AAH | 317 +============================+==============+========================+ 318 | RX AAH | Send ACK. | Send ACK. | 319 | | | Cancel timer. | 320 | | [IDLE] | [IDLE] | 321 +----------------------------+--------------+------------------------+ 322 | RX ACK | ignore | Cancel timer. | 323 | | | [IDLE] | 324 +----------------------------+--------------+------------------------+ 325 | RX "goto TX-AAH" from | Send AAH | ignore | 326 | Router State Machine | [TX-AAH] | | 327 +----------------------------+--------------+------------------------+ 328 | Timer expires | impossible | Send AAH | 329 | | | Restart timer. | 330 | | | [TX-AAH] | 331 +----------------------------+--------------+------------------------+ 333 There is one instance of the per-neighbour (PN) state machine for 334 each neighbour within the convergence control domain. 336 The normal state is IDLE. 338 On command ("goto TX-AAH") from the router state machine, the state 339 machine enters TX-AAH state, transmits an AAH message to its 340 neighbour and starts a timer. 342 On receipt of an AAH ACK in state TX-AAH the state machine cancels 343 the timer and enters IDLE state. 345 In states IDLE, any AAH ACK message received is ignored. 347 On expiry of the timer in state TX-AAH the state machine transmits an 348 AAH message to the neighbour and restarts the timer. (The timer 349 cannot expire in any other state.) 351 In any state, receipt of an AAH causes the state machine to transmit 352 an AAH ACK and enter the IDLE state. 354 Note that for correct operation the state machine MUST remain in 355 state TX-AAH, until an AAH ACK or an AAH is received, or the state 356 machine is deleted. Deletion of the per neighbor state machine 357 occurs when routing determines that the neighbour has gone away, or 358 when the interface goes away. 360 When routing detects a new neighbour it creates a new instance of the 361 per-neighbour state machine in state Idle. The consequent generation 362 of the router's own LSP will then cause the router state machine to 363 execute the LSP receipt actions, which will if necessary result in 364 the new per-neighbour state machine receiving a "goto TX-AAH" command 365 and transitioning to TX-AAH state. 367 4. Management Considerations 369 The management requirements will depend upon the solution adopted, 370 but at the very least there needs to be reporting of the current 371 state. 373 5. Scope and applicability 375 The initial scope of this work is in the context of link state IGPs. 377 6. IANA Considerations 379 There are no IANA considerations that arise from this document. 381 7. Security Considerations 383 This document does not itself introduce any security issues, but 384 attention must be paid to the security implications of any proposed 385 solutions to the problem. 387 8. Acknowledgements 389 The authors would like to acknowledge contributions made by Les 390 Ginsberg. 392 9. References 393 9.1. Normative References 395 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 396 Levels", BCP 14, RFC 2119, March 1997. 398 9.2. Informative References 400 [2] Shand, M. and S. Bryant, "IP Fast Reroute Framework", 401 draft-ietf-rtgwg-ipfrr-framework-09 (work in progress), 402 October 2008. 404 [3] Shand, M. and S. Bryant, "A Framework for Loop-free 405 Convergence", draft-ietf-rtgwg-lf-conv-frmwk-02 (work in 406 progress), February 2008. 408 [4] Francois, P., "Loop-free convergence using oFIB", 409 draft-ietf-rtgwg-ordered-fib-02 (work in progress), 410 February 2008. 412 [5] Zinin, A., "Analysis and Minimization of Microloops in Link- 413 state Routing Protocols", draft-ietf-rtgwg-microloop-analysis-01 414 (work in progress), October 2005. 416 Authors' Addresses 418 Mike Shand 419 Cisco Systems 420 250, Longwater Avenue. 421 Reading, Berks RG2 6GB 422 UK 424 Email: mshand@cisco.com 426 Stewart Bryant 427 Cisco Systems 428 250, Longwater Avenue. 429 Reading, Berks RG2 6GB 430 UK 432 Email: stbryant@cisco.com 433 Pierre Francois 434 Universite catholique de Louvain 436 Email: pierre.francois@uclouvain.be 437 URI: http://inl.info.ucl.ac.be/pfr 439 Full Copyright Statement 441 Copyright (C) The IETF Trust (2008). 443 This document is subject to the rights, licenses and restrictions 444 contained in BCP 78, and except as set forth therein, the authors 445 retain all their rights. 447 This document and the information contained herein are provided on an 448 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 449 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 450 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 451 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 452 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 453 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 455 Intellectual Property 457 The IETF takes no position regarding the validity or scope of any 458 Intellectual Property Rights or other rights that might be claimed to 459 pertain to the implementation or use of the technology described in 460 this document or the extent to which any license under such rights 461 might or might not be available; nor does it represent that it has 462 made any independent effort to identify any such rights. Information 463 on the procedures with respect to rights in RFC documents can be 464 found in BCP 78 and BCP 79. 466 Copies of IPR disclosures made to the IETF Secretariat and any 467 assurances of licenses to be made available, or the result of an 468 attempt made to obtain a general license or permission for the use of 469 such proprietary rights by implementers or users of this 470 specification can be obtained from the IETF on-line IPR repository at 471 http://www.ietf.org/ipr. 473 The IETF invites any interested party to bring to its attention any 474 copyrights, patents or patent applications, or other proprietary 475 rights that may cover technology that may be required to implement 476 this standard. Please address the information to the IETF at 477 ietf-ipr@ietf.org.