idnits 2.17.1 draft-ietf-mpls-recovery-frmwrk-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 32 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '2') -- Possible downref: Non-RFC (?) normative reference: ref. '3' Summary: 9 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MPLS Working Group Vishal Sharma (Metanoia, Inc.) 3 Informational Track Fiffi Hellstrand (Nortel Networks) 4 Expires: March 2003 (Editors) 6 September 2002 8 Framework for MPLS-based Recovery 9 11 Status of this memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that other 17 groups may also distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six months 19 and may be updated, replaced, or obsoleted by other documents at any 20 time. It is inappropriate to use Internet-Drafts as reference 21 material or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 24 The list of Internet-Draft Shadow Directories can be accessed at 25 http://www.ietf.org/shadow.html. 27 Abstract 29 Multi-protocol label switching (MPLS) integrates the label swapping 30 forwarding paradigm with network layer routing. To deliver reliable 31 service, MPLS requires a set of procedures to provide protection of 32 the traffic carried on different paths. This requires that the label 33 switched routers (LSRs) support fault detection, fault notification, 34 and fault recovery mechanisms, and that MPLS signaling, support the 35 configuration of recovery. With these objectives in mind, this 36 document specifies a framework for MPLS based recovery. 38 Table of Contents 39 1. Introduction ....................................................2 40 1.1. Background ......................................................3 41 1.2. Motivation for MPLS-Based Recovery ..............................3 42 1.3. Objectives/Goals ................................................4 43 2. Contributing Authors ............................................6 44 3. Overview ........................................................6 45 3.1. Recovery Models .................................................7 46 3.1.1 Rerouting .....................................................7 47 3.1.2 Protection Switching ..........................................8 48 3.2. The Recovery Cycles .............................................8 49 3.2.1 MPLS Recovery Cycle Model .....................................8 50 3.2.2 MPLS Reversion Cycle Model ...................................10 51 3.2.3 Dynamic Re-routing Cycle Model ...............................11 52 3.3. Definitions and Terminology ....................................13 53 3.3.1 General Recovery Terminology .................................13 54 3.3.2 Failure Terminology ..........................................16 55 3.4. Abbreviations ..................................................16 56 4. MPLS-based Recovery Principles .................................17 57 4.1. Configuration of Recovery ......................................17 58 4.2. Initiation of Path Setup .......................................17 59 4.3. Initiation of Resource Allocation ..............................18 60 4.4. Scope of Recovery ..............................................18 61 4.4.1 Topology .....................................................19 62 1.1.1.1 Local Repair................................................19 63 1.1.1.2 Global Repair...............................................19 64 1.1.1.3 Alternate Egress Repair.....................................20 65 1.1.1.4 Multi-Layer Repair..........................................20 67 1.1.1.5 Concatenated Protection Domains.............................20 68 4.4.2 Path Mapping .................................................20 69 4.4.3 Bypass Tunnels ...............................................21 70 4.4.4 Recovery Granularity .........................................22 71 1.1.1.6 Selective Traffic Recovery..................................22 72 1.1.1.7 Bundling....................................................22 73 4.4.5 Recovery Path Resource Use ...................................22 74 4.5. Fault Detection ................................................23 75 4.6. Fault Notification .............................................23 76 4.7. Switch-Over Operation ..........................................24 77 4.7.1 Recovery Trigger .............................................24 78 4.7.2 Recovery Action ..............................................25 79 4.8. Post Recovery Operation ........................................25 80 4.8.1 Fixed Protection Counterparts ................................25 81 1.1.1.8 Revertive Mode..............................................25 82 1.1.1.9 Non-revertive Mode..........................................26 83 4.8.2 Dynamic Protection Counterparts ..............................26 84 4.8.3 Restoration and Notification .................................26 85 4.8.4 Reverting to Preferred Path (or Controlled Rearrangement) ....27 86 4.9. Performance ....................................................27 87 5. MPLS Recovery Features .........................................28 88 6. Comparison Criteria ............................................28 89 7. Security Considerations ........................................30 90 8. Intellectual Property Considerations ...........................31 91 9. Acknowledgements ...............................................31 92 10. Editors' Addresses .............................................31 93 11. References .....................................................31 95 1. Introduction 97 This memo describes a framework for MPLS-based recovery. We provide a 98 detailed taxonomy of recovery terminology, and discuss the motivation 99 for, the objectives of, and the requirements for MPLS-based recovery. 100 We outline principles for MPLS-based recovery, and also provide 101 comparison criteria that may serve as a basis for comparing and 102 evaluating different recovery schemes. 104 At points in the document, we provide some thoughts about the 105 operation or viability of certain recovery objectives. These should 106 be viewed as the opinions of the authors, and not the consolidated 107 views of the IETF. 109 1.1. Background 111 Network routing deployed today is focused primarily on connectivity, 112 and typically supports only one class of service, the best effort 113 class. Multi-protocol label switching [1], on the other hand, by 114 integrating forwarding based on label-swapping of a link local label 115 with network layer routing allows flexibility in the delivery of new 116 routing services. MPLS allows for using such media specific 117 forwarding mechanisms as label swapping. This enables some 118 sophisticated features such as quality-of-service (QoS) and traffic 119 engineering [2] to be implemented more effectively. An important 120 component of providing QoS, however, is the ability to transport data 121 reliably and efficiently. Although the current routing algorithms are 122 robust and survivable, the amount of time they take to recover from a 123 fault can be significant, on the order of several seconds or minutes, 124 causing disruption of service for some applications in the interim. 125 This is unacceptable in situations where the aim to provide a highly 126 reliable service, with recovery times that are on the order of 127 seconds down to 10's of milliseconds. Examples of such applications 128 are Virtual Leased Line services, Stock Exchange data services, voice 129 traffic, video services etc, i.e., any application for which a 130 disruption in service due to a failure is long enough to not fulfill 131 service agreements or not guarantee the required level of quality. 133 MPLS recovery may be motivated by the notion that there are 134 limitations to improving the recovery times of current routing 135 algorithms. Additional improvement can be obtained by augmenting 136 these algorithms with MPLS recovery mechanisms [3]. Since MPLS is a 137 possible technology of choice in future IP-based transport networks, 138 it is useful that MPLS be able to provide protection and restoration 139 of traffic. MPLS may facilitate the convergence of network 140 functionality on a common control and management plane. Further, a 141 protection priority could be used as a differentiating mechanism for 142 premium services that require high reliability, such as Virtual 143 Leased Line services, high priority voice and video traffic. The 144 remainder of this document provides a framework for MPLS based 145 recovery. It is focused at a conceptual level and is meant to 146 address motivation, objectives and requirements. Issues of 147 mechanism, policy, routing plans and characteristics of traffic 148 carried by recovery paths are beyond the scope of this document. 150 1.2. Motivation for MPLS-Based Recovery 151 MPLS based protection of traffic (called MPLS-based Recovery) is 152 useful for a number of reasons. The most important is its ability to 153 increase network reliability by enabling a faster response to faults 154 than is possible with traditional Layer 3 (or IP layer) approaches 155 alone while still providing the visibility of the network afforded by 156 Layer 3. Furthermore, a protection mechanism using MPLS could enable 157 IP traffic to be put directly over WDM optical channels and provide a 158 recovery option without an intervening SONET layer. This would 159 facilitate the construction of IP-over-WDM networks that request a 160 fast recovery ability. 162 The need for MPLS-based recovery arises because of the following: 164 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 165 that needs to support recovery times that are smaller than the 166 convergence times of IP routing protocols. 168 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 169 SONET) mechanisms may be wasteful use of resources. 171 III. The granularity at which the lower layers may be able to protect 172 traffic may be too coarse for traffic that is switched using MPLS- 173 based mechanisms. 175 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 176 layer operations. Thus, while they may provide, for example, link 177 protection, they cannot easily provide node protection or protection 178 of traffic transported at layer 3. Further, this may prevent the 179 lower layers from providing restoration based on the traffic's needs. 180 For example, fast restoration for traffic that needs it, and slower 181 restoration (with possibly more optimal use of resources) for traffic 182 that does not require fast restoration. In networks where the latter 183 class of traffic is dominant, providing fast restoration to all 184 classes of traffic may not be cost effective from a service 185 provider's perspective. 187 V. MPLS has desirable attributes when applied to the purpose of 188 recovery for connectionless networks. Specifically that an LSP is 189 source routed and a forwarding path for recovery can be "pinned" and 190 is not affected by transient instability in SPF routing brought on by 191 failure scenarios. 193 VI. Establishing interoperability of protection mechanisms between 194 routers/LSRs from different vendors in IP or MPLS networks is desired 195 to enable recovery mechanisms to work in a multivendor environment, 196 and to enable the transition of certain protected services to an MPLS 197 core. 199 1.3. Objectives/Goals 201 The following are some important goals for MPLS-based recovery. 203 Ia. MPLS-based recovery mechanisms may be subject to the traffic 204 engineering goal of optimal use of resources. 206 Ib. MPLS based recovery mechanisms should aim to facilitate 207 restoration times that are sufficiently fast for the end user 208 application. That is, that better match the end-user's application 209 requirements. In some cases, this may be as short as 10s of 210 milliseconds. 212 We observe that Ia and Ib are conflicting objectives, and a trade off 213 exists between them. The optimal choice depends on the end-user 214 application's sensitivity to restoration time and the cost impact of 215 introducing restoration in the network, as well as the end-user 216 application's sensitivity to cost. 218 II. MPLS-based recovery should aim to maximize network reliability 219 and availability. MPLS-based recovery of traffic should aim to 220 minimize the number of single points of failure in the MPLS protected 221 domain. 223 III. MPLS-based recovery should aim to enhance the reliability of the 224 protected traffic while minimally or predictably degrading the 225 traffic carried by the diverted resources. 227 IV. MPLS-based recovery techniques should aim to be applicable for 228 protection of traffic at various granularities. For example, it 229 should be possible to specify MPLS-based recovery for a portion of 230 the traffic on an individual path, for all traffic on an individual 231 path, or for all traffic on a group of paths. Note that a path is 232 used as a general term and includes the notion of a link, IP route or 233 LSP. 235 V. MPLS-based recovery techniques may be applicable for an entire 236 end-to-end path or for segments of an end-to-end path. 238 VI. MPLS-based recovery mechanisms should aim to take into 239 consideration the recovery actions of lower layers. MPLS-based 240 mechanisms should not trigger lower layer protection switching. 242 VII. MPLS-based recovery mechanisms should aim to minimize the loss 243 of data and packet reordering during recovery operations. (The 244 current MPLS specification itself has no explicit requirement on 245 reordering). 247 VIII. MPLS-based recovery mechanisms should aim to minimize the state 248 overhead incurred for each recovery path maintained. 250 IX. MPLS-based recovery mechanisms should aim to preserve the 251 constraints on traffic after switchover, if desired. That is, if 252 desired, the recovery path should meet the resource requirements of, 253 and achieve the same performance characteristics as, the working 254 path. 256 We observe that some of the above are conflicting goals, and real 257 deployment will often involve engineering compromises based on a 258 variety of factors such as cost, end-user application requirements, 259 network efficiency, and revenue considerations. Thus, these goals are 260 subject to tradeoffs based on the above considerations. 262 2. Contributing Authors 264 This document was the collective work of several individuals over a 265 period of two and a half years. The text and content of this document 266 was contributed by the editors and the co-authors listed below. (The 267 contact information for the editors appears in Section 10, and is not 268 repeated below.) 270 Ben Mack-Crane Srinivas Makam 271 Tellabs Operations, Inc. Eshernet, Inc. 272 4951 Indiana Avenue 1712 Ada Ct. 273 Lisle, IL 60532 Naperville, IL 60540 274 Phone: (630) 512-7255 Phone: (630) 308-3213 275 Ben.Mack-Crane@tellabs.com Smakam60540@yahoo.com 277 Ken Owens Changcheng Huang 278 Erlang Technology, Inc. Carleton University 279 345 Marshall Ave., Suite 300 Minto Center, Rm. 3082 280 St. Louis, MO 63119 1125 Colonial By Drive 281 Phone: (314) 918-1579 Ottawa, Ont. K1S 5B6 Canada 282 keno@erlangtech.com Phone: (613) 520-2600 x2477 283 Changcheng.Huang@sce.carleton.ca 285 Jon Weil Brad Cain 286 Nortel Networks Storigen Systems 287 Harlow Laboratories London Road 650 Suffolk Street 288 Harlow Essex CM17 9NA, UK Lowell, MA 01854 289 Phone: +44 (0)1279 403935 Phone: (978) 323-4454 290 jonweil@nortelnetworks.com bcain@storigen.com 292 Loa Andersson Bilel Jamoussi 293 Utfors AB Nortel Networks 294 Rasundavagen 12, Box 525 3 Federal Street, BL3-03 295 169 29 Solna, Sweden Billerica, MA 01821, USA 296 Phone: +46 8 5270 5038 Phone:(978) 288-4506 297 loa.andersson@utfors.se jamoussi@nortelnetworks.com 299 Angela Chiu Seyhan Civanlar 300 Celion Networks, Inc. Lemur Networks, Inc. 301 One Shiela Drive, Suite 2 135 West 20th Street, 5th Floor 302 Tinton Falls, NJ 07724 New York, NY 10011 303 Phone: (732) 345-3441 Phone: (212) 367-7676 304 angela.chiu@celion.com scivanlar@lemurnetworks.com 306 3. Overview 307 There are several options for providing protection of traffic. The 308 most generic requirement is the specification of whether recovery 309 should be via Layer 3 (or IP) rerouting or via MPLS protection 310 switching or rerouting actions. 312 Generally network operators aim to provide the fastest and the best 313 protection mechanism that can be provided at a reasonable cost. The 314 higher the levels of protection, the more the resources consumed. 315 Therefore it is expected that network operators will offer a spectrum 316 of service levels. MPLS-based recovery should give the flexibility to 317 select the recovery mechanism, choose the granularity at which 318 traffic is protected, and to also choose the specific types of 319 traffic that are protected in order to give operators more control 320 over that tradeoff. With MPLS-based recovery, it can be possible to 321 provide different levels of protection for different classes of 322 service, based on their service requirements. For example, using 323 approaches outlined below, a Virtual Leased Line (VLL) service or 324 real-time applications like Voice over IP (VoIP) may be supported 325 using link/node protection together with pre-established, pre- 326 reserved path protection. Best effort traffic, on the other hand, may 327 use path protection that is established on demand or may simply rely 328 on IP re-route or higher layer recovery mechanisms. As another 329 example of their range of application, MPLS-based recovery strategies 330 may be used to protect traffic not originally flowing on label 331 switched paths, such as IP traffic that is normally routed hop-by- 332 hop, as well as traffic forwarded on label switched paths. 334 3.1. Recovery Models 336 There are two basic models for path recovery: rerouting and 337 protection switching. 339 Protection switching and rerouting, as defined below, may be used 340 together. For example, protection switching to a recovery path may 341 be used for rapid restoration of connectivity while rerouting 342 determines a new optimal network configuration, rearranging paths, as 343 needed, at a later time. 345 3.1.1 Rerouting 347 Recovery by rerouting is defined as establishing new paths or path 348 segments on demand for restoring traffic after the occurrence of a 349 fault. The new paths may be based upon fault information, network 350 routing policies, pre-defined configurations and network topology 351 information. Thus, upon detecting a fault, paths or path segments to 352 bypass the fault are established using signaling. 354 Once the network routing algorithms have converged after a fault, it 355 may be preferable, in some cases, to reoptimize the network by 356 performing a reroute based on the current state of the network and 357 network policies. This is discussed further in Section 3.8. 359 In terms of the principles defined in section 3, reroute recovery 360 employs paths established-on-demand with resources reserved-on- 361 demand. 363 3.1.2 Protection Switching 365 Protection switching recovery mechanisms pre-establish a recovery 366 path or path segment, based upon network routing policies, the 367 restoration requirements of the traffic on the working path, and 368 administrative considerations. The recovery path may or may not be 369 link and node disjoint with the working path. However if the recovery 370 path shares sources of failure with the working path, the overall 371 reliability of the construct is degraded. When a fault is detected, 372 the protected traffic is switched over to the recovery path(s) and 373 restored. 375 In terms of the principles in section 3, protection switching employs 376 pre-established recovery paths, and, if resource reservation is 377 required on the recovery path, pre-reserved resources. The various 378 sub-types of protection switching are detailed in Section 4.4 of this 379 document. 381 3.2. The Recovery Cycles 383 There are three defined recovery cycles: the MPLS Recovery Cycle, the 384 MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 385 cycle detects a fault and restores traffic onto MPLS-based recovery 386 paths. If the recovery path is non-optimal the cycle may be followed 387 by any of the two latter cycles to achieve an optimized network 388 again. The reversion cycle applies for explicitly routed traffic that 389 that does not rely on any dynamic routing protocols to be converged. 390 The dynamic re-routing cycle applies for traffic that is forwarded 391 based on hop-by-hop routing. 393 3.2.1 MPLS Recovery Cycle Model 395 The MPLS recovery cycle model is illustrated in Figure 1. 396 Definitions and a key to abbreviations follow. 398 --Network Impairment 399 | --Fault Detected 400 | | --Start of Notification 401 | | | -- Start of Recovery Operation 402 | | | | --Recovery Operation Complete 403 | | | | | --Path Traffic Restored 404 | | | | | | 405 | | | | | | 406 v v v v v v 407 ---------------------------------------------------------------- 408 | T1 | T2 | T3 | T4 | T5 | 410 Figure 1. MPLS Recovery Cycle Model 412 The various timing measures used in the model are described below. 413 T1 Fault Detection Time 414 T2 Hold-off Time 415 T3 Notification Time 416 T4 Recovery Operation Time 417 T5 Traffic Restoration Time 419 Definitions of the recovery cycle times are as follows: 421 Fault Detection Time 423 The time between the occurrence of a network impairment and the 424 moment the fault is detected by MPLS-based recovery mechanisms. This 425 time may be highly dependent on lower layer protocols. 427 Hold-Off Time 429 The configured waiting time between the detection of a fault and 430 taking MPLS-based recovery action, to allow time for lower layer 431 protection to take effect. The Hold-off Time may be zero. 433 Note: The Hold-Off Time may occur after the Notification Time 434 interval if the node responsible for the switchover, the Path Switch 435 LSR (PSL), rather than the detecting LSR, is configured to wait. 437 Notification Time 439 The time between initiation of a fault indication signal (FIS) by the 440 LSR detecting the fault and the time at which the Path Switch LSR 441 (PSL) begins the recovery operation. This is zero if the PSL detects 442 the fault itself or infers a fault from such events as an adjacency 443 failure. 445 Note: If the PSL detects the fault itself, there still may be a Hold- 446 Off Time period between detection and the start of the recovery 447 operation. 449 Recovery Operation Time 451 The time between the first and last recovery actions. This may 452 include message exchanges between the PSL and PML to coordinate 453 recovery actions. 455 Traffic Restoration Time 457 The time between the last recovery action and the time that the 458 traffic (if present) is completely recovered. This interval is 459 intended to account for the time required for traffic to once again 460 arrive at the point in the network that experienced disrupted or 461 degraded service due to the occurrence of the fault (e.g. the PML). 463 This time may depend on the location of the fault, the recovery 464 mechanism, and the propagation delay along the recovery path. 466 3.2.2 MPLS Reversion Cycle Model 468 Protection switching, revertive mode, requires the traffic to be 469 switched back to a preferred path when the fault on that path is 470 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 471 Note that the cycle shown below comes after the recovery cycle shown 472 in Fig. 1. 474 --Network Impairment Repaired 475 | --Fault Cleared 476 | | --Path Available 477 | | | --Start of Reversion Operation 478 | | | | --Reversion Operation Complete 479 | | | | | --Traffic Restored on Preferred Path 480 | | | | | | 481 | | | | | | 482 v v v v v v 483 ----------------------------------------------------------------- 484 | T7 | T8 | T9 | T10| T11| 486 Figure 2. MPLS Reversion Cycle Model 488 The various timing measures used in the model are described below. 489 T7 Fault Clearing Time 490 T8 Wait-to-Restore Time 491 T9 Notification Time 492 T10 Reversion Operation Time 493 T11 Traffic Restoration Time 495 Note that time T6 (not shown above) is the time for which the network 496 impairment is not repaired and traffic is flowing on the recovery 497 path. 499 Definitions of the reversion cycle times are as follows: 501 Fault Clearing Time 503 The time between the repair of a network impairment and the time that 504 MPLS-based mechanisms learn that the fault has been cleared. This 505 time may be highly dependent on lower layer protocols. 507 Wait-to-Restore Time 509 The configured waiting time between the clearing of a fault and MPLS- 510 based recovery action(s). Waiting time may be needed to ensure that 511 the path is stable and to avoid flapping in cases where a fault is 512 intermittent. The Wait-to-Restore Time may be zero. 514 Note: The Wait-to-Restore Time may occur after the Notification Time 515 interval if the PSL is configured to wait. 517 Notification Time 519 The time between initiation of a fault recovery signal (FRS) by the 520 LSR clearing the fault and the time at which the path switch LSR 521 begins the reversion operation. This is zero if the PSL clears the 522 fault itself. 523 Note: If the PSL clears the fault itself, there still may be a Wait- 524 to-Restore Time period between fault clearing and the start of the 525 reversion operation. 527 Reversion Operation Time 529 The time between the first and last reversion actions. This may 530 include message exchanges between the PSL and PML to coordinate 531 reversion actions. 533 Traffic Restoration Time 535 The time between the last reversion action and the time that traffic 536 (if present) is completely restored on the preferred path. This 537 interval is expected to be quite small since both paths are working 538 and care may be taken to limit the traffic disruption (e.g., using 539 "make before break" techniques and synchronous switch-over). 541 In practice, the only interesting times in the reversion cycle are 542 the Wait-to-Restore Time and the Traffic Restoration Time (or some 543 other measure of traffic disruption). Given that both paths are 544 available, there is no need for rapid operation, and a well- 545 controlled switch-back with minimal disruption is desirable. 547 3.2.3 Dynamic Re-routing Cycle Model 549 Dynamic rerouting aims to bring the IP network to a stable state 550 after a network impairment has occurred. A re-optimized network is 551 achieved after the routing protocols have converged, and the traffic 552 is moved from a recovery path to a (possibly) new working path. The 553 steps involved in this mode are illustrated in Figure 3. 555 Note that the cycle shown below may be overlaid on the recovery cycle 556 shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in 557 the event that both the recovery cycle and the reversion cycle take 558 place before the routing protocols converge), and after the 559 convergence of the routing protocols it is determined (based on on- 560 line algorithms or off-line traffic engineering tools, network 561 configuration, or a variety of other possible criteria) that there is 562 a better route for the working path. 564 --Network Enters a Semi-stable State after an Impairment 565 | --Dynamic Routing Protocols Converge 566 | | --Initiate Setup of New Working Path between PSL 567 | | | and PML 568 | | | --Switchover Operation Complete 569 | | | | --Traffic Moved to New Working Path 570 | | | | | 571 | | | | | 572 v v v v v 573 ----------------------------------------------------------------- 574 | T12 | T13 | T14 | T15 | 576 Figure 3. Dynamic Rerouting Cycle Model 577 The various timing measures used in the model are described below. 578 T12 Network Route Convergence Time 579 T13 Hold-down Time (optional) 580 T14 Switchover Operation Time 581 T15 Traffic Restoration Time 583 Network Route Convergence Time 585 We define the network route convergence time as the time taken for 586 the network routing protocols to converge and for the network to 587 reach a stable state. 589 Holddown Time 591 We define the holddown period as a bounded time for which a recovery 592 path must be used. In some scenarios it may be difficult to determine 593 if the working path is stable. In these cases a holddown time may be 594 used to prevent excess flapping of traffic between a working and a 595 recovery path. 597 Switchover Operation Time 599 The time between the first and last switchover actions. This may 600 include message exchanges between the PSL and PML to coordinate the 601 switchover actions. 603 As an example of the recovery cycle, we present a sequence of events 604 that occur after a network impairment occurs and when a protection 605 switch is followed by dynamic rerouting. 607 I. Link or path fault occurs 608 II. Signaling initiated (FIS) for the detected fault 609 III. FIS arrives at the PSL 610 IV. The PSL initiates a protection switch to a pre-configured 611 recovery path 612 V. The PSL switches over the traffic from the working path to the 613 recovery path 614 VI. The network enters a semi-stable state 615 VII. Dynamic routing protocols converge after the fault, and a new 616 working path is calculated (based, for example, on some of the 617 criteria mentioned in Section 2.1.1). 619 VIII. A new working path is established between the PSL and the PML 620 (assumption is that PSL and PML have not changed) 621 IX. Traffic is switched over to the new working path. 623 3.3. Definitions and Terminology 625 This document assumes the terminology given in [1], and, in addition, 626 introduces the following new terms. 628 3.3.1 General Recovery Terminology 630 Rerouting 632 A recovery mechanism in which the recovery path or path segments are 633 created dynamically after the detection of a fault on the working 634 path. In other words, a recovery mechanism in which the recovery path 635 is not pre-established. 637 Protection Switching 639 A recovery mechanism in which the recovery path or path segments are 640 created prior to the detection of a fault on the working path. In 641 other words, a recovery mechanism in which the recovery path is pre- 642 established. 644 Working Path 646 The protected path that carries traffic before the occurrence of a 647 fault. The working path exists between a PSL and PML. The working 648 path can be of different kinds; a hop-by-hop routed path, a trunk, a 649 link, an LSP or part of a multipoint-to-point LSP. 651 Synonyms for a working path are primary path and active path. 653 Recovery Path 655 The path by which traffic is restored after the occurrence of a 656 fault. In other words, the path on which the traffic is directed by 657 the recovery mechanism. The recovery path is established by MPLS 658 means. The recovery path can either be an equivalent recovery path 659 and ensure no reduction in quality of service, or be a limited 660 recovery path and thereby not guarantee the same quality of service 661 (or some other criteria of performance) as the working path. A 662 limited recovery path is not expected to be used for an extended 663 period of time. 665 Synonyms for a recovery path are: back-up path, alternative path, and 666 protection path. 668 Protection Counterpart 669 The "other" path when discussing pre-planned protection switching 670 schemes. The protection counterpart for the working path is the 671 recovery path and vice-versa. 673 Path Group (PG) 675 A logical bundling of multiple working paths, each of which is routed 676 identically between a Path Switch LSR and a Path Merge LSR. 678 Protected Path Group (PPG) 680 A path group that requires protection. 682 Protected Traffic Portion (PTP) 684 The portion of the traffic on an individual path that requires 685 protection. For example, code points in the EXP bits of the shim 686 header may identify a protected portion. 688 Path Switch LSR (PSL) 690 An LSR that is responsible for switching or replicating the traffic 691 between the working path and the recovery path. 693 Path Merge LSR (PML) 695 An LSR that is responsible for receiving the recovery path traffic, 696 and either merging the traffic back onto the working path, or, if it 697 is itself the destination, passing the traffic on to the higher layer 698 protocols. 700 Point of Repair (POR) 702 An LSR that is setup for performing MPLS recovery. In other words, an 703 LSR that is responsible for effecting the repair of an LSP. The POR, 704 for example, can be a PSL or a PML, depending on the type of recovery 705 scheme employed. 707 Intermediate LSR 709 An LSR on a working or recovery path that is neither a PSL nor a PML 710 for that path. 712 Bypass Tunnel 714 A path that serves to back up a set of working paths using the label 715 stacking approach [1]. The working paths and the bypass tunnel must 716 all share the same path switch LSR (PSL) and the path merge LSR 717 (PML). 719 Switch-Over 720 The process of switching the traffic from the path that the traffic 721 is flowing on onto one or more alternate path(s). This may involve 722 moving traffic from a working path onto one or more recovery paths, 723 or may involve moving traffic from a recovery path(s) on to a more 724 optimal working path(s). 726 Switch-Back 728 The process of returning the traffic from one or more recovery paths 729 back to the working path(s). 731 Revertive Mode 733 A recovery mode in which traffic is automatically switched back from 734 the recovery path to the original working path upon the restoration 735 of the working path to a fault-free condition. This assumes a failed 736 working path does not automatically surrender resources to the 737 network. 739 Non-revertive Mode 741 A recovery mode in which traffic is not automatically switched back 742 to the original working path after this path is restored to a fault- 743 free condition. (Depending on the configuration, the original working 744 path may, upon moving to a fault-free condition, become the recovery 745 path, or it may be used for new working traffic, and be no longer 746 associated with its original recovery path). 748 MPLS Protection Domain 750 The set of LSRs over which a working path and its corresponding 751 recovery path are routed. 753 MPLS Protection Plan 755 The set of all LSP protection paths and the mapping from working to 756 protection paths deployed in an MPLS protection domain at a given 757 time. 759 Liveness Message 761 A message exchanged periodically between two adjacent LSRs that 762 serves as a link probing mechanism. It provides an integrity check of 763 the forward and the backward directions of the link between the two 764 LSRs as well as a check of neighbor aliveness. 766 Path Continuity Test 768 A test that verifies the integrity and continuity of a path or path 769 segment. The details of such a test are beyond the scope of this 770 draft. (This could be accomplished, for example, by transmitting a 771 control message along the same links and nodes as the data traffic or 772 similarly could be measured by the absence of traffic and by 773 providing feedback.) 775 3.3.2 Failure Terminology 777 Path Failure (PF) 778 Path failure is fault detected by MPLS-based recovery mechanisms, 779 which is define as the failure of the liveness message test or a path 780 continuity test, which indicates that path connectivity is lost. 782 Path Degraded (PD) 783 Path degraded is a fault detected by MPLS-based recovery mechanisms 784 that indicates that the quality of the path is unacceptable. 786 Link Failure (LF) 787 A lower layer fault indicating that link continuity is lost. This may 788 be communicated to the MPLS-based recovery mechanisms by the lower 789 layer. 791 Link Degraded (LD) 792 A lower layer indication to MPLS-based recovery mechanisms that the 793 link is performing below an acceptable level. 795 Fault Indication Signal (FIS) 796 A signal that indicates that a fault along a path has occurred. It is 797 relayed by each intermediate LSR to its upstream or downstream 798 neighbor, until it reaches an LSR that is setup to perform MPLS 799 recovery (the POR). The FIS is transmitted periodically by the 800 node/nodes closest to the point of failure, for some configurable 801 length of time. 803 Fault Recovery Signal (FRS) 804 A signal that indicates a fault along a working path has been 805 repaired. Again, like the FIS, it is relayed by each intermediate LSR 806 to its upstream or downstream neighbor, until is reaches the LSR that 807 performs recovery of the original path. The FRS is transmitted 808 periodically by the node/nodes closest to the point of failure, for 809 some configurable length of time. 811 3.4. Abbreviations 813 FIS: Fault Indication Signal. 814 FRS: Fault Recovery Signal. 815 LD: Link Degraded. 816 LF: Link Failure. 817 PD: Path Degraded. 818 PF: Path Failure. 819 PML: Path Merge LSR. 820 PG: Path Group. 821 POR: Point of Repair 822 PPG: Protected Path Group. 823 PTP: Protected Traffic Portion. 825 PSL: Path Switch LSR. 827 4. MPLS-based Recovery Principles 829 MPLS-based recovery refers to the ability to effect quick and 830 complete restoration of traffic affected by a fault in an MPLS- 831 enabled network. The fault may be detected on the IP layer or in 832 lower layers over which IP traffic is transported. Fastest MPLS 833 recovery is assumed to be achieved with protection switching and may 834 be viewed as the MPLS LSR switch completion time that is comparable 835 to, or equivalent to, the 50 ms switch-over completion time of the 836 SONET layer. This section provides a discussion of the concepts and 837 principles of MPLS-based recovery. The concepts are presented in 838 terms of atomic or primitive terms that may be combined to specify 839 recovery approaches. We do not make any assumptions about the 840 underlying layer 1 or layer 2 transport mechanisms or their recovery 841 mechanisms. 843 4.1. Configuration of Recovery 845 An LSR may support any or all of the following recovery options: 847 Default-recovery (No MPLS-based recovery enabled): 848 Traffic on the working path is recovered only via Layer 3 or IP 849 rerouting or by some lower layer mechanism such as SONET APS. This 850 is equivalent to having no MPLS-based recovery. This option may be 851 used for low priority traffic or for traffic that is recovered in 852 another way (for example load shared traffic on parallel working 853 paths may be automatically recovered upon a fault along one of the 854 working paths by distributing it among the remaining working paths). 856 Recoverable (MPLS-based recovery enabled): 857 This working path is recovered using one or more recovery paths, 858 either via rerouting or via protection switching. 860 4.2. Initiation of Path Setup 862 There are three options for the initiation of the recovery path 863 setup. The active and recovery paths may be established by using 864 either RSVP-TE [4][5] or CR-LDP [6]. 866 Pre-established: 868 This is the same as the protection switching option. Here a recovery 869 path(s) is established prior to any failure on the working path. The 870 path selection can either be determined by an administrative 871 centralized tool, or chosen based on some algorithm implemented at 872 the PSL and possibly intermediate nodes. To guard against the 873 situation when the pre-established recovery path fails before or at 874 the same time as the working path, the recovery path should have 875 secondary configuration options as explained in Section 3.3 below. 877 Pre Qualified: 879 A pre-established path need not be created, it may be pre-qualified. 880 A pre-qualified recovery path is not created expressly for protecting 881 the working path, but instead is a path created for other purposes 882 that is designated as a recovery path after determining that it is an 883 acceptable alternative for carrying the working path traffic. 884 Variants include the case where an optical path or trail is 885 configured, but no switches are set. 887 Established-on-Demand: 889 This is the same as the rerouting option. Here, a recovery path is 890 established after a failure on its working path has been detected and 891 notified to the PSL. 893 4.3. Initiation of Resource Allocation 895 A recovery path may support the same traffic contract as the working 896 path, or it may not. We will distinguish these two situations by 897 using different additive terms. If the recovery path is capable of 898 replacing the working path without degrading service, it will be 899 called an equivalent recovery path. If the recovery path lacks the 900 resources (or resource reservations) to replace the working path 901 without degrading service, it will be called a limited recovery path. 902 Based on this, there are two options for the initiation of resource 903 allocation: 905 Pre-reserved: 907 This option applies only to protection switching. Here a pre- 908 established recovery path reserves required resources on all hops 909 along its route during its establishment. Although the reserved 910 resources (e.g., bandwidth and/or buffers) at each node cannot be 911 used to admit more working paths, they are available to be used by 912 all traffic that is present at the node before a failure occurs. 914 Reserved-on-Demand: 916 This option may apply either to rerouting or to protection switching. 917 Here a recovery path reserves the required resources after a failure 918 on the working path has been detected and notified to the PSL and 919 before the traffic on the working path is switched over to the 920 recovery path. 922 Note that under both the options above, depending on the amount of 923 resources reserved on the recovery path, it could either be an 924 equivalent recovery path or a limited recovery path. 926 4.4. Scope of Recovery 927 4.4.1 Topology 929 4.4.1.1 Local Repair 931 The intent of local repair is to protect against a link or neighbor 932 node fault and to minimize the amount of time required for failure 933 propagation. In local repair (also known as local recovery), the node 934 immediately upstream of the fault is the one to initiate recovery 935 (either rerouting or protection switching). Local repair can be of 936 two types: 938 Link Recovery/Restoration 940 In this case, the recovery path may be configured to route around a 941 certain link deemed to be unreliable. If protection switching is 942 used, several recovery paths may be configured for one working path, 943 depending on the specific faulty link that each protects against. 945 Alternatively, if rerouting is used, upon the occurrence of a fault 946 on the specified link, each path is rebuilt such that it detours 947 around the faulty link. 948 In this case, the recovery path need only be disjoint from its 949 working path at a particular link on the working path, and may have 950 overlapping segments with the working path. Traffic on the working 951 path is switched over to an alternate path at the upstream LSR that 952 connects to the failed link. This method is potentially the fastest 953 to perform the switchover, and can be effective in situations where 954 certain path components are much more unreliable than others. 956 Node Recovery/Restoration 958 In this case, the recovery path may be configured to route around a 959 neighbor node deemed to be unreliable. Thus the recovery path is 960 disjoint from the working path only at a particular node and at links 961 associated with the working path at that node. Once again, the 962 traffic on the primary path is switched over to the recovery path at 963 the upstream LSR that directly connects to the failed node, and the 964 recovery path shares overlapping portions with the working path. 966 4.4.1.2 Global Repair 968 The intent of global repair is to protect against any link or node 969 fault on a path or on a segment of a path, with the obvious exception 970 of the faults occurring at the ingress node of the protected path 971 segment. In global repair, the POR is usually distant from the 972 failure and needs to be notified by a FIS. 973 In global repair also, end-to-end path recovery/restoration applies. 974 In many cases, the recovery path can be made completely link and node 975 disjoint with its working path. This has the advantage of protecting 976 against all link and node fault(s) on the working path (end-to-end 977 path or path segment). 979 However, it may, in some cases, be slower than local repair since the 980 fault notification message must now travel to the POR to trigger the 981 recovery action. 983 4.4.1.3 Alternate Egress Repair 985 It is possible to restore service without specifically recovering the 986 faulted path. 987 For example, for best effort IP service it is possible to select a 988 recovery path that has a different egress point from the working path 989 (i.e., there is no PML). The recovery path egress must simply be a 990 router that is acceptable for forwarding the FEC carried by the 991 working path (without creating looping). In an engineering context, 992 specific alternative FEC/LSP mappings with alternate egresses can be 993 formed. 995 This may simplify enhancing the reliability of implicitly constructed 996 MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate 997 recovery paths as simply link and node disjoint with the immediate 998 downstream LSR of the working path. 1000 4.4.1.4 Multi-Layer Repair 1002 Multi-layer repair broadens the network designer's tool set for those 1003 cases where multiple network layers can be managed together to 1004 achieve overall network goals. Specific criteria for determining 1005 when multi-layer repair is appropriate are beyond the scope of this 1006 draft. 1008 4.4.1.5 Concatenated Protection Domains 1010 A given service may cross multiple networks and these may employ 1011 different recovery mechanisms. It is possible to concatenate 1012 protection domains so that service recovery can be provided end-to- 1013 end. It is considered that the recovery mechanisms in different 1014 domains may operate autonomously, and that multiple points of 1015 attachment may be used between domains (to ensure there is no single 1016 point of failure). Alternate egress repair requires management of 1017 concatenated domains in that an explicit MPLS point of failure (the 1018 PML) is by definition excluded. Details of concatenated protection 1019 domains are beyond the scope of this draft. 1021 4.4.2 Path Mapping 1023 Path mapping refers to the methods of mapping traffic from a faulty 1024 working path on to the recovery path. There are several options for 1025 this, as described below. Note that the options below should be 1026 viewed as atomic terms that only describe how the working and 1027 protection paths are mapped to each other. The issues of resource 1028 reservation along these paths, and how switchover is actually 1029 performed lead to the more commonly used composite terms, such as 1+1 1030 and 1:1 protection, which were described in Section 2.1. 1032 1-to-1 Protection 1034 In 1-to-1 protection the working path has a designated recovery path 1035 that is only to be used to recover that specific working path. 1037 n-to-1 Protection 1039 In n-to-1 protection, up to n working paths are protected using only 1040 one recovery path. If the intent is to protect against any single 1041 fault on any of the working paths, the n working paths should be 1042 diversely routed between the same PSL and PML. In some cases, 1043 handshaking between PSL and PML may be required to complete the 1044 recovery, the details of which are beyond the scope of this draft. 1046 n-to-m Protection 1048 In n-to-m protection, up to n working paths are protected using m 1049 recovery paths. Once again, if the intent is to protect against any 1050 single fault on any of the n working paths, the n working paths and 1051 the m recovery paths should be diversely routed between the same PSL 1052 and PML. In some cases, handshaking between PSL and PML may be 1053 required to complete the recovery, the details of which are beyond 1054 the scope of this draft. n-to-m protection is for further study. 1056 Split Path Protection 1058 In split path protection, multiple recovery paths are allowed to 1059 carry the traffic of a working path based on a certain configurable 1060 load splitting ratio. This is especially useful when no single 1061 recovery path can be found that can carry the entire traffic of the 1062 working path in case of a fault. Split path protection may require 1063 handshaking between the PSL and the PML(s), and may require the 1064 PML(s) to correlate the traffic arriving on multiple recovery paths 1065 with the working path. Although this is an attractive option, the 1066 details of split path protection are beyond the scope of this draft, 1067 and are for further study. 1069 4.4.3 Bypass Tunnels 1071 It may be convenient, in some cases, to create a "bypass tunnel" for 1072 a PPG between a PSL and PML, thereby allowing multiple recovery paths 1073 to be transparent to intervening LSRs [2]. In this case, one LSP 1074 (the tunnel) is established between the PSL and PML following an 1075 acceptable route and a number of recovery paths are supported through 1076 the tunnel via label stacking. A bypass tunnel can be used with any 1077 of the path mapping options discussed in the previous section. 1079 As with recovery paths, the bypass tunnel may or may not have 1080 resource reservations sufficient to provide recovery without service 1081 degradation. It is possible that the bypass tunnel may have 1082 sufficient resources to recover some number of working paths, but not 1083 all at the same time. If the number of recovery paths carrying 1084 traffic in the tunnel at any given time is restricted, this is 1085 similar to the n-to-1 or n-to-m protection cases mentioned in Section 1086 3.4.2. 1088 4.4.4 Recovery Granularity 1090 Another dimension of recovery considers the amount of traffic 1091 requiring protection. This may range from a fraction of a path to a 1092 bundle of paths. 1094 4.4.4.1 Selective Traffic Recovery 1096 This option allows for the protection of a fraction of traffic within 1097 the same path. The portion of the traffic on an individual path that 1098 requires protection is called a protected traffic portion (PTP). A 1099 single path may carry different classes of traffic, with different 1100 protection requirements. The protected portion of this traffic may be 1101 identified by its class, as for example, via the EXP bits in the MPLS 1102 shim header or via the priority bit in the ATM header. 1104 4.4.4.2 Bundling 1106 Bundling is a technique used to group multiple working paths together 1107 in order to recover them simultaneously. The logical bundling of 1108 multiple working paths requiring protection, each of which is routed 1109 identically between a PSL and a PML, is called a protected path group 1110 (PPG). When a fault occurs on the working path carrying the PPG, the 1111 PPG as a whole can be protected either by being switched to a bypass 1112 tunnel or by being switched to a recovery path. 1114 4.4.5 Recovery Path Resource Use 1116 In the case of pre-reserved recovery paths, there is the question of 1117 what use these resources may be put to when the recovery path is not 1118 in use. There are two options: 1120 Dedicated-resource: 1121 If the recovery path resources are dedicated, they may not be used 1122 for anything except carrying the working traffic. For example, in 1123 the case of 1+1 protection, the working traffic is always carried on 1124 the recovery path. Even if the recovery path is not always carrying 1125 the working traffic, it may not be possible or desirable to allow 1126 other traffic to use these resources. 1128 Extra-traffic-allowed: 1129 If the recovery path only carries the working traffic when the 1130 working path fails, then it is possible to allow extra traffic to use 1131 the reserved resources at other times. Extra traffic is, by 1132 definition, traffic that can be displaced (without violating service 1133 agreements) whenever the recovery path resources are needed for 1134 carrying the working path traffic. 1136 Shared-resource: 1138 A shared recovery resource is dedicated for use by multiple primary 1139 resources that (according to SRLGs) are not expected to fail 1140 simultaneously. 1142 4.5. Fault Detection 1144 MPLS recovery is initiated after the detection of either a lower 1145 layer fault or a fault at the IP layer or in the operation of MPLS- 1146 based mechanisms. We consider four classes of impairments: Path 1147 Failure, Path Degraded, Link Failure, and Link Degraded. 1149 Path Failure (PF) is a fault that indicates to an MPLS-based recovery 1150 scheme that the connectivity of the path is lost. This may be 1151 detected by a path continuity test between the PSL and PML. Some, 1152 and perhaps the most common, path failures may be detected using a 1153 link probing mechanism between neighbor LSRs. An example of a probing 1154 mechanism is a liveness message that is exchanged periodically along 1155 the working path between peer LSRs [3]. For either a link probing 1156 mechanism or path continuity test to be effective, the test message 1157 must be guaranteed to follow the same route as the working or 1158 recovery path, over the segment being tested. In addition, the path 1159 continuity test must take the path merge points into consideration. 1160 In the case of a bi-directional link implemented as two 1161 unidirectional links, path failure could mean that either one or both 1162 unidirectional links are damaged. 1164 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1165 schemes/mechanisms that the path has connectivity, but that the 1166 quality of the connection is unacceptable. This may be detected by a 1167 path performance monitoring mechanism, or some other mechanism for 1168 determining the error rate on the path or some portion of the path. 1169 This is local to the LSR and consists of excessive discarding of 1170 packets at an interface, either due to label mismatch or due to TTL 1171 errors, for example. 1173 Link Failure (LF) is an indication from a lower layer that the link 1174 over which the path is carried has failed. If the lower layer 1175 supports detection and reporting of this fault (that is, any fault 1176 that indicates link failure e.g., SONET LOS), this may be used by the 1177 MPLS recovery mechanism. In some cases, using LF indications may 1178 provide faster fault detection than using only MPLS_based fault 1179 detection mechanisms. 1181 Link Degraded (LD) is an indication from a lower layer that the link 1182 over which the path is carried is performing below an acceptable 1183 level. If the lower layer supports detection and reporting of this 1184 fault, it may be used by the MPLS recovery mechanism. In some cases, 1185 using LD indications may provide faster fault detection than using 1186 only MPLS-based fault detection mechanisms. 1188 4.6. Fault Notification 1189 MPLS-based recovery relies on rapid and reliable notification of 1190 faults. Once a fault is detected, the node that detected the fault 1191 must determine if the fault is severe enough to require path 1192 recovery. If the node is not capable of initiating direct action 1193 (e.g. as a point of repair, POR) the node should send out a 1194 notification of the fault by transmitting a FIS to the POR. This can 1195 take several forms: 1197 (i) control plane messaging: relayed hop-by-hop along the path of the 1198 failed LSP until a POR is reached. 1199 (ii) user plane messaging: sent to the PML, which may take corrective 1200 action (as a POR for 1+1) or then communicate with a POR (for 1:n) by 1201 any of several means: 1202 - control plane messaging 1203 - user plane return path (either through a bi-directional LSP 1204 or via other means) 1206 Since the FIS is a control message, it should be transmitted with 1207 high priority to ensure that it propagates rapidly towards the 1208 affected POR(s). Depending on how fault notification is configured in 1209 the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2 1210 or Layer 3 packet [3]. The use of a Layer 2-based notification 1211 requires a Layer 2 path direct to the POR. An example of a FIS could 1212 be the liveness message sent by a downstream LSR to its upstream 1213 neighbor, with an optional fault notification field set or it can be 1214 implicitly denoted by a teardown message. Alternatively, it could be 1215 a separate fault notification packet. The intermediate LSR should 1216 identify which of its incoming links to propagate the FIS on. 1218 4.7. Switch-Over Operation 1220 4.7.1 Recovery Trigger 1222 The activation of an MPLS protection switch following the detection 1223 or notification of a fault requires a trigger mechanism at the PSL. 1224 MPLS protection switching may be initiated due to automatic inputs or 1225 external commands. The automatic activation of an MPLS protection 1226 switch results from a response to a defect or fault conditions 1227 detected at the PSL or to fault notifications received at the PSL. It 1228 is possible that the fault detection and trigger mechanisms may be 1229 combined, as is the case when a PF, PD, LF, or LD is detected at a 1230 PSL and triggers a protection switch to the recovery path. In most 1231 cases, however, the detection and trigger mechanisms are distinct, 1232 involving the detection of fault at some intermediate LSR followed by 1233 the propagation of a fault notification to the POR via the FIS, which 1234 serves as the protection switch trigger at the POR. MPLS protection 1235 switching in response to external commands results when the operator 1236 initiates a protection switch by a command to a POR (or alternatively 1237 by a configuration command to an intermediate LSR, which transmits 1238 the FIS towards the POR). 1240 Note that the PF fault applies to hard failures (fiber cuts, 1241 transmitter failures, or LSR fabric failures), as does the LF fault, 1242 with the difference that the LF is a lower layer impairment that may 1243 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1244 fault, on the other hand, applies to soft defects (excessive errors 1245 due to noise on the link, for instance). The PD (or LD) results in a 1246 fault declaration only when the percentage of lost packets exceeds a 1247 given threshold, which is provisioned and may be set based on the 1248 service level agreement(s) in effect between a service provider and a 1249 customer. 1251 4.7.2 Recovery Action 1253 After a fault is detected or FIS is received by the POR, the recovery 1254 action involves either a rerouting or protection switching operation. 1255 In both scenarios, the next hop label forwarding entry for a recovery 1256 path is bound to the working path. 1258 4.8. Post Recovery Operation 1260 When traffic is flowing on the recovery path decisions can be made to 1261 whether let the traffic remain on the recovery path and consider it 1262 as a new working path or do a switch to the old or a new working 1263 path. This post recovery operation has two styles, one where the 1264 protection counterparts, i.e. the working and recovery path, are 1265 fixed or "pinned" to its route and one in which the PSL or other 1266 network entity with real time knowledge of failure dynamically 1267 performs re-establishment or controlled rearrangement of the paths 1268 comprising the protected service. 1270 4.8.1 Fixed Protection Counterparts 1272 For fixed protection counterparts the PSL will be pre-configured with 1273 the appropriate behavior to take when the original fixed path is 1274 restored to service. The choices are revertive and non-revertive 1275 mode. The choice will typically be depended on relative costs of the 1276 working and protection paths, and the tolerance of the service to the 1277 effects of switching paths yet again. These protection modes indicate 1278 whether or not there is a preferred path for the protected traffic. 1280 4.8.1.1 Revertive Mode 1282 If the working path always is the preferred path, this path will be 1283 used whenever it is available. Thus, in the event of a fault on this 1284 path, its unused resources will not be reclaimed by the network on 1285 failure. If the working path has a fault, traffic is switched to the 1286 recovery path. In the revertive mode of operation, when the 1287 preferred path is restored the traffic is automatically switched back 1288 to it. 1290 There are a number of implications to pinned working and recovery 1291 paths: 1292 - upon failure and traffic moved to recovery path, the traffic is 1293 unprotected until such time as the path defect in the original 1294 working path is repaired and that path restored to service. 1296 - upon failure and traffic moved to recovery path, the resources 1297 associated with the original path remain reserved. 1299 4.8.1.2 Non-revertive Mode 1301 In the non-revertive mode of operation, there is no preferred path or 1302 it may be desirable to minimize further disruption of the service 1303 brought on by a revertive switching operation. A switch-back to the 1304 original working path is not desired or not possible since the 1305 original path may no longer exist after the occurrence of a fault on 1306 that path. 1307 If there is a fault on the working path, traffic is switched to the 1308 recovery path. When or if the faulty path (the originally working 1309 path) is restored, it may become the recovery path (either by 1310 configuration, or, if desired, by management actions). 1312 In the non-revertive mode of operation, the working traffic may or 1313 may not be restored to a new optimal working path or to the original 1314 working path anyway. This is because it might be useful, in some 1315 cases, to either: (a) administratively perform a protection switch 1316 back to the original working path after gaining further assurances 1317 about the integrity of the path, or (b) it may be acceptable to 1318 continue operation on the recovery path, or (c) it may be desirable 1319 to move the traffic to a new optimal working path that is calculated 1320 based on network topology and network policies. 1322 4.8.2 Dynamic Protection Counterparts 1324 For dynamic protection counterparts when the traffic is switched over 1325 to a recovery path, the association between the original working path 1326 and the recovery path may no longer exist, since the original path 1327 itself may no longer exist after the fault. Instead, when the network 1328 reaches a stable state following routing convergence, the recovery 1329 path may be switched over to a different preferred path either 1330 optimization based on the new network topology and associated 1331 information or based on pre-configured information. 1333 Dynamic protection counterparts assume that upon failure, the PSL or 1334 other network entity will establish new working paths if another 1335 switch-over will be performed. 1337 4.8.3 Restoration and Notification 1339 MPLS restoration deals with returning the working traffic from the 1340 recovery path to the original or a new working path. Reversion is 1341 performed by the PSL either upon receiving notification, via FRS, 1342 that the working path is repaired, or upon receiving notification 1343 that a new working path is established. 1345 For fixed counterparts in revertive mode, an LSR that detected the 1346 fault on the working path also detects the restoration of the working 1347 path. If the working path had experienced a LF defect, the LSR 1348 detects a return to normal operation via the receipt of a liveness 1349 message from its peer. If the working path had experienced a LD 1350 defect at an LSR interface, the LSR could detect a return to normal 1351 operation via the resumption of error-free packet reception on that 1352 interface. Alternatively, a lower layer that no longer detects a LF 1353 defect may inform the MPLS-based recovery mechanisms at the LSR that 1354 the link to its peer LSR is operational. 1355 The LSR then transmits FRS to its upstream LSR(s) that were 1356 transmitting traffic on the working path. At the point the PSL 1357 receives the FRS, it switches the working traffic back to the 1358 original working path. 1360 A similar scheme is for dynamic counterparts where e.g. an update of 1361 topology and/or network convergence may trigger installation or setup 1362 of new working paths and may send notification to the PSL to perform 1363 a switch over. 1365 We note that if there is a way to transmit fault information back 1366 along a recovery path towards a PSL and if the recovery path is an 1367 equivalent working path, it is possible for the working path and its 1368 recovery path to exchange roles once the original working path is 1369 repaired following a fault. This is because, in that case, the 1370 recovery path effectively becomes the working path, and the restored 1371 working path functions as a recovery path for the original recovery 1372 path. This is important, since it affords the benefits of non- 1373 revertive switch operation outlined in Section 3.8.1, without leaving 1374 the recovery path unprotected. 1376 4.8.4 Reverting to Preferred Path (or Controlled Rearrangement) 1378 In the revertive mode, a "make before break" restoration switching 1379 can be used, which is less disruptive than performing protection 1380 switching upon the occurrence of network impairments. This will 1381 minimize both packet loss and packet reordering. The controlled 1382 rearrangement of paths can also be used to satisfy traffic 1383 engineering requirements for load balancing across an MPLS domain. 1385 4.9. Performance 1387 Resource/performance requirements for recovery paths should be 1388 specified in terms of the following attributes: 1390 I. Resource class attribute: 1391 Equivalent Recovery Class: The recovery path has the same resource 1392 reservations and performance guarantees as the working path. In other 1393 words, the recovery path meets the same SLAs as the working path. 1394 Limited Recovery Class: The recovery path does not have the same 1395 resource reservations and performance guarantees as the working path. 1397 A. Lower Class: The recovery path has lower resource requirements or 1398 less stringent performance requirements than the working path. 1400 B. Best Effort Class: The recovery path is best effort. 1402 II. Priority Attribute: 1403 The recovery path has a priority attribute just like the working path 1404 (i.e., the priority attribute of the associated traffic trunks). It 1405 can have the same priority as the working path or lower priority. 1407 III. Preemption Attribute: 1408 The recovery path can have the same preemption attribute as the 1409 working path or a lower one. 1411 5. MPLS Recovery Features 1413 The following features are desirable from an operational point of 1414 view: 1416 I. It is desirable that MPLS recovery provides an option to identify 1417 protection groups (PPGs) and protection portions (PTPs). 1419 II. Each PSL should be capable of performing MPLS recovery upon the 1420 detection of the impairments or upon receipt of notifications of 1421 impairments. 1423 III. A MPLS recovery method should not preclude manual protection 1424 switching commands. This implies that it would be possible under 1425 administrative commands to transfer traffic from a working path to a 1426 recovery path, or to transfer traffic from a recovery path to a 1427 working path, once the working path becomes operational following a 1428 fault. 1430 IV. A PSL may be capable of performing either a switch back to the 1431 original working path after the fault is corrected or a switchover to 1432 a new working path, upon the discovery or establishment of a more 1433 optimal working path. 1435 V. The recovery model should take into consideration path merging at 1436 intermediate LSRs. If a fault affects the merged segment, all the 1437 paths sharing that merged segment should be able to recover. 1438 Similarly, if a fault affects a non-merged segment, only the path 1439 that is affected by the fault should be recovered. 1441 6. Comparison Criteria 1443 Possible criteria to use for comparison of MPLS-based recovery 1444 schemes are as follows: 1446 Recovery Time 1448 We define recovery time as the time required for a recovery path to 1449 be activated (and traffic flowing) after a fault. Recovery Time is 1450 the sum of the Fault Detection Time, Hold-off Time, Notification 1451 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1452 other words, it is the time between a failure of a node or link in 1453 the network and the time before a recovery path is installed and the 1454 traffic starts flowing on it. 1456 Full Restoration Time 1458 We define full restoration time as the time required for a permanent 1459 restoration. This is the time required for traffic to be routed onto 1460 links, which are capable of or have been engineered sufficiently to 1461 handle traffic in recovery scenarios. Note that this time may or may 1462 not be different from the "Recovery Time" depending on whether 1463 equivalent or limited recovery paths are used. 1465 Setup vulnerability 1467 The amount of time that a working path or a set of working paths is 1468 left unprotected during such tasks as recovery path computation and 1469 recovery path setup may be used to compare schemes. The nature of 1470 this vulnerability should be taken into account, e.g.: End to End 1471 schemes correlate the vulnerability with working paths, Local Repair 1472 schemes have a topological correlation that cuts across working paths 1473 and Network Plan approaches have a correlation that impacts the 1474 entire network. 1476 Backup Capacity 1478 Recovery schemes may require differing amounts of "backup capacity" 1479 in the event of a fault. This capacity will be dependent on the 1480 traffic characteristics of the network. However, it may also be 1481 dependent on the particular protection plan selection algorithms as 1482 well as the signaling and re-routing methods. 1484 Additive Latency 1486 Recovery schemes may introduce additive latency to traffic. For 1487 example, a recovery path may take many more hops than the working 1488 path. This may be dependent on the recovery path selection 1489 algorithms. 1491 Quality of Protection 1493 Recovery schemes can be considered to encompass a spectrum of "packet 1494 survivability" which may range from "relative" to "absolute". 1495 Relative survivability may mean that the packet is on an equal 1496 footing with other traffic of, as an example, the same diff-serv code 1497 point (DSCP) in contending for the resources of the portion of the 1498 network that survives the failure. Absolute survivability may mean 1499 that the survivability of the protected traffic has explicit 1500 guarantees. 1502 Re-ordering 1503 Recovery schemes may introduce re-ordering of packets. Also the 1504 action of putting traffic back on preferred paths might cause packet 1505 re-ordering. 1507 State Overhead 1509 As the number of recovery paths in a protection plan grows, the state 1510 required to maintain them also grows. Schemes may require differing 1511 numbers of paths to maintain certain levels of coverage, etc. The 1512 state required may also depend on the particular scheme used to 1513 recover. In many cases the state overhead will be in proportion to 1514 the number of recovery paths. 1516 Loss 1518 Recovery schemes may introduce a certain amount of packet loss during 1519 switchover to a recovery path. Schemes that introduce loss during 1520 recovery can measure this loss by evaluating recovery times in 1521 proportion to the link speed. 1523 In case of link or node failure a certain packet loss is inevitable. 1525 Coverage 1527 Recovery schemes may offer various types of failover coverage. The 1528 total coverage may be defined in terms of several metrics: 1530 I. Fault Types: Recovery schemes may account for only link faults or 1531 both node and link faults or also degraded service. For example, a 1532 scheme may require more recovery paths to take node faults into 1533 account. 1535 II. Number of concurrent faults: dependent on the layout of recovery 1536 paths in the protection plan, multiple fault scenarios may be able to 1537 be restored. 1539 III. Number of recovery paths: for a given fault, there may be one or 1540 more recovery paths. 1542 IV. Percentage of coverage: dependent on a scheme and its 1543 implementation, a certain percentage of faults may be covered. This 1544 may be subdivided into percentage of link faults and percentage of 1545 node faults. 1547 V. The number of protected paths may effect how fast the total set of 1548 paths affected by a fault could be recovered. The ratio of protected 1549 is n/N, where n is the number of protected paths and N is the total 1550 number of paths. 1552 7. Security Considerations 1553 The MPLS recovery that is specified herein does not raise any 1554 security issues that are not already present in the MPLS 1555 architecture. 1557 8. Intellectual Property Considerations 1559 The IETF has been notified of intellectual property rights claimed in 1560 regard to some or all of the specification contained in this 1561 document. For more information consult the online list of claimed 1562 rights. 1564 9. Acknowledgements 1566 We would like to thank members of the MPLS WG mailing list for their 1567 suggestions on the earlier versions of this draft. In particular, 1568 Bora Akyol, Dave Allan, Dave Danenberg, Sharam Davari, and Neil 1569 Harrison whose suggestions and comments were very helpful in revising 1570 the document. 1572 The editors would like to give very special thanks to Curtis 1573 Villamizar for his careful and extremely thorough reading of the 1574 document and for taking the time to provide numerous suggestions, 1575 which were very helpful in the last couple of revisions of the 1576 document. 1578 10. Editors' Addresses 1580 Vishal Sharma Fiffi Hellstrand 1581 Metanoia, Inc. Nortel Networks 1582 1600 Villa Street, Unit 352 St Eriksgatan 115 1583 Mountain View, CA 94041-1174 PO Box 6701 1584 Phone: (650) 386-6723 113 85 Stockholm, Sweden 1585 v.sharma@ieee.org Phone: +46 8 5088 3687 1586 Fiffi@nortelnetworks.com 1588 11. References 1590 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1591 Switching Architecture", RFC 3031, January 2001. 1593 [2] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J., 1594 "Requirements for Traffic Engineering Over MPLS", RFC 2702, 1595 September 1999. 1597 [3] Haung, C., Sharma, V., Owens, K., Makam, V. "Building Reliable 1598 MPLS Networks Using a Path Protection Mechanism", IEEE Commun. 1599 Mag., Vol. 40, Issue 3, March 2002, pp. 156-162. 1601 [4] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource 1602 ReSerVation Protocol (RSVP) -- Version 1 Functional 1603 Specification", RFC 2205, September 1997. 1605 [5] Awduche, D., et al "RSVP-TE Extensions to RSVP for LSP Tunnels", 1606 RFC 3209, December 2001. 1608 [6] Jamoussi, B., et al "Constraint-Based LSP Setup using LDP", RFC 1609 3212, January 2002.