idnits 2.17.1 draft-makam-mpls-recovery-frmwrk-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([2], [3], [4], [5], [6], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1329: '...I. MPLS recovery SHALL provide an opti...' RFC 2119 keyword, line 1332: '... II. Each PSL SHALL be capable of pe...' RFC 2119 keyword, line 1336: '... recovery method SHALL not preclude ma...' RFC 2119 keyword, line 1343: '... IV. A PSL SHALL be capable of perfo...' RFC 2119 keyword, line 1356: '... There SHOULD be an option for:...' (5 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: III. A MPLS recovery method SHALL not preclude manual protection switching commands. This implies that it would be possible under administrative commands to transfer traffic from a working path to a recovery path, or to transfer traffic from a recovery path to a working path, once the working path becomes operational following a fault. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: I. Configuration of the recovery path as excess or reserved, with excess as the default. The recovery path that is configured as excess SHALL provide lower priority preemptable traffic access to the protection bandwidth, while the recovery path configured as reserved SHALL not provide any other traffic access to the protection bandwidth. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 2000) is 8686 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' == Outdated reference: A later version (-02) exists of draft-ietf-mpls-rsvp-tunnel-applicability-00 ** Downref: Normative reference to an Informational draft: draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3') -- Possible downref: Non-RFC (?) normative reference: ref. '4' -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '7') -- Possible downref: Normative reference to a draft: ref. '8' == Outdated reference: A later version (-01) exists of draft-swallow-rsvp-bypass-label-00 -- Possible downref: Normative reference to a draft: ref. '9' -- Possible downref: Normative reference to a draft: ref. '10' -- Possible downref: Non-RFC (?) normative reference: ref. '11' == Outdated reference: A later version (-05) exists of draft-haskin-mpls-fast-reroute-01 -- Possible downref: Normative reference to a draft: ref. '12' Summary: 9 errors (**), 0 flaws (~~), 6 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IETF Draft Srinivas Makam 3 Multi-Protocol Label Switching Vishal Sharma 4 Expires: January 2001 Ken Owens 5 Changcheng Huang 6 Tellabs Operations, Inc. 8 Fiffi Hellstrand 9 Jon Weil 10 Loa Andersson 11 Bilel Jamoussi 12 Nortel Networks 14 Brad Cain 15 Mirror Image Internet 17 Seyhan Civanlar 18 Coreon Networks 20 Angela Chiu 21 AT&T Labs 23 July 2000 25 Framework for MPLS-based Recovery 27 29 Status of this memo 31 This document is an Internet-Draft and is in full conformance with 32 all provisions of Section 10 of RFC2026. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF), its areas, and its working groups. Note that 36 other groups may also distribute working documents as Internet- 37 Drafts. Internet-Drafts are draft documents valid for a maximum of 38 six months and may be updated, replaced, or obsoleted by other 39 documents at any time. It is inappropriate to use Internet-Drafts as 40 reference material or to cite them other than as "work in progress." 42 The list of current Internet-Drafts can be accessed at 43 http://www.ietf.org/ietf/1id-abstracts.txt 45 The list of Internet-Draft Shadow Directories can be accessed at 46 http://www.ietf.org/shadow.html. 48 Abstract 49 Multi-protocol label switching (MPLS) [1] integrates the label 50 swapping forwarding paradigm with network layer routing. To deliver 51 reliable service, MPLS requires a set of procedures to provide 52 protection of the traffic carried on different paths. This requires 53 that the label switched routers (LSRs) support fault detection, 54 fault notification, and fault recovery mechanisms, and that MPLS 55 signaling [2] [3] [4] [5] [6] support the configuration of recovery. 56 With these objectives in mind, this document specifies a framework 57 for MPLS based recovery. 59 Table of Contents Page 61 1.0 Introduction 3 62 1.1 Background 3 63 1.2 Motivations for MPLS-Based Recovery 4 64 1.3 Objectives 5 66 2.0 Overview 6 67 2.1 Recovery Models 6 68 2.2 Recovery Cycles 8 69 2.2.1 MPLS Recovery Cycle Model 8 70 2.2.2 MPLS Reversion Cycle Model 10 71 2.2.3 Dynamic Reroute Cycle Model 11 72 2.3 Terminology 13 73 2.4 Abbreviations 17 75 3.0 MPLS Recovery Principles 17 76 3.1 Configuration of Recovery 17 77 3.2 Initiation of Path Setup 18 78 3.3 Initiation of Resource Allocation 18 79 3.4 Scope of Recovery 19 80 3.4.1 Topology 19 81 3.4.1.1 Local Repair 19 82 3.4.1.2 Global Repair 20 83 3.4.1.3 Alternate Egress Repair 20 84 3.4.1.4 Multi-Layer Repair 21 85 3.4.1.5 Concatenated Protection Domains 21 86 3.4.2 Path Mapping 21 87 3.4.3 Bypass Tunnels 22 88 3.4.4 Recovery Granularity 23 89 3.4.4.1 Selective Traffic Recovery 23 90 3.4.4.2 Bundling 23 91 3.4.5 Recovery Path Resource Use 23 92 3.5 Fault Detection 24 93 3.6 Fault Notification 25 94 3.7 Switch Over Operation 25 95 3.7.1 Recovery Trigger 25 96 3.7.2 Recovery Action 26 97 3.8 Switch Back Operation 26 98 3.8.1 Revertive and Non-revertive Mode 26 99 3.8.2 Restoration and Notification 27 100 3.8.3 Reverting to Preferred LSP 28 101 3.9 Performance 28 102 4.0 Recovery Requirements 28 103 5.0 MPLS Recovery Options 29 104 6.0 Comparison Criteria 30 105 7.0 Security Considerations 32 106 8.0 Intellectual Property Considerations 32 107 9.0 Acknowledgements 32 108 10.0 Author's Addresses 33 109 11.0 References 34 111 1.0 Introduction 113 This memo describes a framework for MPLS-based recovery. We provide 114 a detailed taxonomy of recovery terminology, and discuss the 115 motivation for, the objectives of, and the requirements for MPLS- 116 based recovery. We outline principles for MPLS-based recovery, and 117 also provide comparison criteria that may serve as a basis for 118 comparing and evaluating different recovery schemes. 120 1.1 Background 122 Network routing deployed today is focussed primarily on connectivity 123 and typically supports only one class of service, the best effort 124 class. Multi-protocol label switching, on the other hand, by 125 integrating forwarding based on label-swapping of a link local label 126 with network layer routing allows flexibility in the delivery of new 127 routing services. MPLS allows for using media specific forwarding 128 mechanisms as label swapping. This enables more sophisticated 129 features such as quality-of-service (QoS) and traffic engineering 130 [7] to be implemented more effectively. An important component of 131 providing QoS, however, is the ability to transport data reliably 132 and efficiently. Although the current routing algorithms are very 133 robust and survivable, the amount of time they take to recover from 134 a fault can be significant, on the order of several seconds or 135 minutes, causing serious disruption of service for some applications 136 in the interim. This is unacceptable to many organizations that aim 137 to provide a highly reliable service, and thus require recovery 138 times on the order of tens of milliseconds, as specified, for 139 example, in the GR253 specification for SONET. 141 Since MPLS is likely to be the technology of choice in the future 142 IP-based transport network, it is imperative that MPLS be able to 143 provide protection and restoration of traffic. In fact, a protection 144 priority could be used as a differentiating mechanism for premium 145 services that require high reliability. The remainder of this 146 document provides a framework for MPLS based recovery. It is 147 focused at a conceptual level and is meant to address motivation, 148 objectives and requirements. Issues of mechanism, policy, routing 149 plans and characteristics of traffic carried by protection paths are 150 beyond the scope of this document. 152 1.2 Motivation for MPLS-Based Recovery 154 MPLS based protection of traffic (called MPLS-based Recovery) is 155 useful for a number of reasons. The most important is its ability to 156 increase network reliability by enabling a faster response to faults 157 than is possible with traditional Layer 3 (or the IP layer) alone 158 while still providing the visibility of the network afforded Layer 159 3. Furthermore, a protection mechanism using MPLS could enable IP 160 traffic to be put directly over WDM optical channels, without an 161 intervening SONET layer. This would facilitate the construction of 162 IP-over-WDM networks. 164 The need for MPLS-based recovery arises because of the following: 166 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 167 that needs to support high reliability/availability. 169 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 170 SONET) mechanisms may be deployed in ring topologies and may not 171 always include mesh protection. That is, layer 0 or layer 1 networks 172 may not be deployed in topologies that meet carriers' protection 173 goals. 175 III. The granularity at which the lower layers may be able to 176 protect traffic may be too coarse for traffic that is switched using 177 MPLS-based mechanisms. 179 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 180 layer operations. Thus, while they may provide, for example, link 181 protection, they cannot easily provide node protection. 183 Furthermore there is a need for open standards. 185 V. Establishing interoperability of protection mechanisms between 186 routers/LSRs from different vendors in IP or MPLS networks is 187 urgently required to enable the adoption of MPLS as a viable core 188 transport and traffic engineering technology. 190 1.3 Objectives/Goals 192 We lay down the following objectives for MPLS-based recovery. 194 I. MPLS-based recovery mechanisms should facilitate fast (10's of 195 ms) recovery times. 197 II. MPLS-based recovery should maximize network reliability and 198 availability. MPLS based protection of traffic should minimize the 199 number of single points of failure in the MPLS protected domain. 201 III. MPLS-based recovery techniques should be applicable for 202 protection of traffic at various granularities. For example, it 203 should be possible to specify MPLS-based recovery for a portion of 204 the traffic on an individual path, for all traffic on an individual 205 path, or for all traffic on a group of paths. 207 IV. MPLS-based recovery techniques may be applicable for an entire 208 end-to-end path or for segments of an end-to-end path. 210 V. MPLS-based recovery actions should not adversely affect other 211 network operations. 213 VI. MPLS-based recovery actions in one MPLS protection domain 214 (defined in Section 2.2) should not adversely affect the recovery 215 actions in other MPLS protection domains. 217 VII. MPLS-based recovery mechanisms should be able to take into 218 consideration the recovery actions of lower layers. 220 VIII. MPLS-based recovery actions should avoid network-layering 221 violations. That is, defects in MPLS-based mechanisms should not 222 trigger lower layer protection switching. 224 IX. MPLS-based recovery mechanisms should minimize the loss of data 225 and packet reordering during recovery operations. (The current MPLS 226 specification has itself no explicit requirement on reordering). 228 X. MPLS-based recovery mechanisms should minimize the state overhead 229 incurred for each recovery path maintained. 231 XI. MPLS-based recovery mechanisms should be able to preserve the 232 constraints on traffic after switchover, if desired. That is, if 233 desired, the recovery path should meet the resource requirements of, 234 and achieve the same performance characteristics, as the working 235 path. 237 2.0 Overview 239 There are several options for providing protection of traffic using 240 MPLS. The most generic requirement is the specification of whether 241 recovery should be via Layer 3 (or IP) rerouting or via MPLS 242 protection switching or rerouting actions. 244 Generally network operators aim to provide the fastest and the best 245 protection mechanism that can be provided at a reasonable cost. The 246 higher the level of protection, the more resources it consumes. 247 MPLS-based recovery should give the flexibility to select the 248 recovery mechanism, choose the granularity at which traffic is 249 protected, and to also choose the specific types of traffic that are 250 protected in order to give operators more control over that 251 tradeoff. With MPLS-based recovery, it can be possible to provide 252 different levels of protection for different classes of service, 253 based on their service requirements. For example, using approaches 254 outlined below, a VLL service that supports real-time applications 255 like VoIP may be supported using link/node protection together with 256 pre-established, pre-reserved path protection, while best effort 257 traffic may use established-on-demand path protection or simply rely 258 on IP re-route or higher layer recovery mechanisms. As another 259 example of their range of application, MPLS-based recovery 260 strategies may be used to protect traffic not originally flowing on 261 label switched paths, such as IP traffic that is normally routed 262 hop-by-hop, as well as traffic forwarded on label switched paths. 264 2.1 Recovery Models 266 There are two basic models for path recovery: rerouting and 267 protection switching. 269 Protection switching and rerouting, as defined below, may be used 270 together. For example, protection switching to a recovery path may 271 be used for rapid restoration of connectivity while rerouting 272 determines a new optimal network configuration, rearranging paths, 273 as needed, at a later time [8] [9]. 275 2.1.1 Rerouting 277 Recovery by rerouting is defined as establishing new paths or path 278 segments on demand for restoring traffic after the occurrence of a 279 fault. The new paths may be based upon fault information, network 280 routing policies, pre-defined configurations and network topology 281 information. Thus, upon detecting a fault, the affected paths are 282 re-established using signaling. Reroute mechanisms are inherently 283 slower than protection switching mechanisms, since more must be done 284 following the detection of a fault. Once the network routing 285 algorithms have converged after a fault, it may be preferable, in 286 some cases, to reoptimize the network by performing a reroute based 287 on the current state of the network and network policies. This is 288 currently discussed further in Section 3.8, but will also be 289 clarified further in upcoming revisions of this document. 291 In terms of the principles defined in section 3, reroute recovery 292 employs paths established-on-demand with resources reserved-on- 293 demand. 295 2.1.2 Protection Switching 297 Protection switching recovery mechanisms pre-establish a recovery 298 path or path segment, based upon network routing policies, the 299 restoration requirements of the traffic on the working path, and 300 administrative considerations. The recovery path may or may not be 301 link and node disjoint with the working path [10]. When a fault is 302 detected, the affected traffic that is considered for protection is 303 switched over to the recovery path(s) and restored. 305 In terms of the principles in section 3, protection switching 306 employs pre-established recovery paths, and if resource reservation 307 is required on the recovery path, pre-reserved resources. 309 2.1.2.1. Subtypes of Protection Switching 311 The resources (bandwidth, buffers, processing) on the recovery path 312 may be used to carry either a copy of the working path traffic or 313 extra traffic that is displaced when a protection switch occurs. 314 This leads to two subtypes of protection switching. 316 In 1+1 ("one plus one") protection, the resources (bandwidth, 317 buffers, processing capacity) on the recovery path are fully 318 reserved, if needed, and carry the same traffic as the working path. 319 Selection between the traffic on the working and recovery paths is 320 made at the path merge LSR (PML). 322 In 1:1 ("one for one") protection, the resources (if any) allocated 323 on the recovery path are fully available to preemptible low priority 324 traffic except when the recovery path is in use due to a fault on 325 the working path. In other words, in 1:1 protection, the protected 326 traffic normally travels only on the working path, and is switched 327 to the recovery path only when the working path has a fault. Once 328 the protection switch is initiated, the low priority traffic being 329 carried on the recovery path may be displaced by the protected 330 traffic. This method affords a way to make efficient use of the 331 recovery path resources. 333 This concept can be extended to 1:n (one for n) and m:n (m for n) 334 protection. 336 Additional specifications of the recovery actions are found in 337 Section 3. 339 2.2 The Recovery Cycles 341 There are three defined recovery cycles; the MPLS Recovery Cycle, 342 the MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 343 cycle detects a fault and restores traffic onto MPLS-based recovery 344 paths. If the recovery path is non-optimal the cycle may be followed 345 by any of the two latter to achieve an optimized network again. The 346 reversion cycle applies for explicitly routed traffic that that does 347 not rely on any dynamic routing protocols to be converged. The 348 dynamic re-routing cycle applies for traffic that is forwarded based 349 on hop-by-hop routing. 351 2.2.1 MPLS Recovery Cycle Model 353 The MPLS recovery cycle model is illustrated in Figure 1. 354 Definitions and a key to abbreviations follow. 356 --Network Impairment 357 | --Fault Detected 358 | | --Start of Notification 359 | | | -- Start of Recovery Operation 360 | | | | --Recovery Operation Complete 361 | | | | | --Path Traffic Restored 362 | | | | | | 363 | | | | | | 364 v v v v v v 365 ---------------------------------------------------------------- 366 | T1 | T2 | T3 | T4 | T5 | 368 Figure 1. MPLS Recovery Cycle Model 370 The various timing measures used in the model are described below. 372 T1 Fault Detection Time 373 T2 Hold-off Time 374 T3 Notification Time 375 T4 Recovery Operation Time 376 T5 Traffic Restoration Time 378 Definitions of the recovery cycle times are as follows: 380 Fault Detection Time 382 The time between the occurrence of a network impairment and the 383 moment the fault is detected by MPLS-based recovery mechanisms. This 384 time may be highly dependent on lower layer protocols. 386 Hold-Off Time 388 The configured waiting time between the detection of a fault and 389 taking MPLS-based recovery action, to allow time for lower layer 390 protection to take effect. The Hold-off Time may be zero. 392 Note: The Hold-Off Time may occur after the Notification Time 393 interval if the node responsible for the switchover, the Path Switch 394 LSR (PSL), rather than the detecting LSR, is configured to wait. 396 Notification Time 398 The time between initiation of an FIS by the LSR detecting the fault 399 and the time at which the Path Switch LSR (PSL) begins the recovery 400 operation. This is zero if the PSL detects the fault itself. 402 Note: If the PSL detects the fault itself, there still may be a 403 Hold-Off Time period between detection and the start of the recovery 404 operation. 406 Recovery Operation Time 408 The time between the first and last recovery actions. This may 409 include message exchanges between the PSL and PML to coordinate 410 recovery actions. 412 Traffic Restoration Time 414 The time between the last recovery action and the time that the 415 traffic (if present) is completely - recovered. This interval is 416 intended to account for the time required for traffic to once again 417 arrive at the point in the network that experienced disrupted or 418 degraded service due to the occurrence of the fault (e.g. the PML). 419 This time may depend on the location of the fault, the recovery 420 mechanism, and the propagation delay along the recovery path. 422 2.2.2 MPLS Reversion Cycle Model 424 Protection switching, revertive mode, requires the traffic to be 425 switched back to a preferred path when the fault on that path is 426 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 427 Note that the cycle shown below comes after the recovery cycle shown 428 in Fig. 1. 430 --Network Impairment Repaired 431 | --Fault Cleared 432 | | --Path Available 433 | | | --Start of Reversion Operation 434 | | | | --Reversion Operation Complete 435 | | | | | --Traffic Restored on Preferred Path 436 | | | | | | 437 | | | | | | 438 v v v v v v 439 ----------------------------------------------------------------- 440 | T7 | T8 | T9 | T10| T11| 442 Figure 2. MPLS Reversion Cycle Model 444 The various timing measures used in the model are described below. 446 T7 Fault Clearing Time 447 T8 Wait-to-Restore Time 448 T9 Notification Time 449 T10 Reversion Operation Time 450 T11 Traffic Restoration Time 452 Note that time T6 (not shown above) is the time for which the 453 network impairment is not repaired and traffic is flowing on the 454 recovery path. 456 Definitions of the reversion cycle times are as follows: 458 Fault Clearing Time 460 The time between the repair of a network impairment and the time 461 that MPLS-based mechanisms learn that the fault has been cleared. 462 This time may be highly dependent on lower layer protocols. 464 Wait-to-Restore Time 466 The configured waiting time between the clearing of a fault and 467 MPLS-based recovery action(s). Waiting time may be needed to ensure 468 the path is stable and to avoid flapping in cases where a fault is 469 intermittent. The Wait-to-Restore Time may be zero. 471 Note: The Wait-to-Restore Time may occur after the Notification Time 472 interval if the PSL is configured to wait. 474 Notification Time 476 The time between initiation of an FRS by the LSR clearing the fault 477 and the time at which the path switch LSR begins the reversion 478 operation. This is zero if the PSL clears the fault itself. 480 Note: If the PSL clears the fault itself, there still may be a Wait- 481 to-Restore Time period between fault clearing and the start of the 482 reversion operation. 484 Reversion Operation Time 486 The time between the first and last reversion actions. This may 487 include message exchanges between the PSL and PML to coordinate 488 reversion actions. 490 Traffic Restoration Time 492 The time between the last reversion action and the time that traffic 493 (if present) is completely restored on the preferred path. This 494 interval is expected to be quite small since both paths are working 495 and care may be taken to limit the traffic disruption (e.g., using 496 "make before break" techniques and synchronous switch-over). 498 In practice, the only interesting times in the reversion cycle are 499 the Wait-to-Restore Time and the Traffic Restoration Time (or some 500 other measure of traffic disruption). Given that both paths are 501 available, there is no need for rapid operation, and a well- 502 controlled switch-back with minimal disruption is desirable. 504 2.2.3 Dynamic Re-routing Cycle Model 506 Dynamic rerouting aims to bring the IP network to a stable state 507 after a network impairment has occurred. A re-optimized network is 508 achieved after the routing protocols have converged, and the traffic 509 is moved from a recovery path to a (possibly) new working path. The 510 steps involved in this mode are illustrated in Figure 3. 512 Note that the cycle shown below may follow the recovery cycle shown 513 in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in the 514 event that both the recovery cycle and the reversion cycle take 515 place before the routing protocols converge, and after the 516 convergence of the routing protocols it is determined (based on on- 517 line algorithms or off-line traffic engineering tools, network 518 configuration, or a variety of other possible criteria) that there 519 is a better route for the working path). 521 --Network Enters a Semi-stable State after an Impairment 522 | --Dynamic Routing Protocols Converge 523 | | --Initiate Setup of New Working Path between PSL 524 | | | and PML 525 | | | --Switchover Operation Complete 526 | | | | --Traffic Moved to New Working Path 527 | | | | | 528 | | | | | 529 v v v v v 530 ----------------------------------------------------------------- 531 | T12 | T13 | T14 | T15 | 533 Figure 3. Dynamic Rerouting Cycle Model 535 The various timing measures used in the model are described below. 537 T12 Network Route Convergence Time 538 T13 Hold-down Time (optional) 539 T14 Switchover Operation Time 540 T15 Traffic Restoration Time 542 Network Route Convergence Time 544 We define the network route convergence time as the time taken for 545 the network routing protocols to converge and for the network to 546 reach a stable state. 548 Holddown Time 550 We define the holddown period as a bounded time for which a recovery 551 path must be used. In some scenarios it may be difficult to 552 determine if the working path is stable. In these cases a holddown 553 time may be used to prevent excess flapping of traffic between a 554 working and a recovery path. 556 Switchover Operation Time 558 The time between the first and last switchover actions. This may 559 include message exchanges between the PSL and PML to coordinate the 560 switchover actions. 562 As an example of the recovery cycle, we present a sequence of events 563 that occur after a network impairment occurs and when a protection 564 switch is followed by dynamic rerouting. 566 I. Link or path fault occurs 568 II. Signaling initiated (FIS) for the fault detected 570 III. FIS arrives at the PSL 572 IV. The PSL initiates a protection switch to a pre-configured 573 recovery path 575 V. The PSL switches over the traffic from the working path to the 576 recovery path 578 VI. The network enters a semi-stable state 580 VII. Dynamic routing protocols converge after the fault, and a new 581 working path is calculated (based, for example, on some of the 582 criteria mentioned earlier in Section 2.1.1). 584 VIII. A new working path is established between the PSL and the PML 585 (assumption is that PSL and PML have not changed) 587 IX. Traffic is switched over to the new working path. 589 2.3 Definitions and Terminology 591 This document assumes the terminology given in [11], and, in 592 addition, introduces the following new terms. 594 2.3.1 General Recovery Terminology 596 Rerouting 598 A recovery mechanism in which the recovery path or path segments are 599 created dynamically after the detection of a fault on the working 600 path. In other words, a recovery mechanism in which the recovery 601 path is not pre-established. 603 Protection Switching 605 A recovery mechanism in which the recovery path or path segments are 606 created prior to the detection of a fault on the working path. In 607 other words, a recovery mechanism in which the recovery path is pre- 608 established. 610 Working Path 612 The protected path that carries traffic before the occurrence of a 613 fault. The working path exists between a PSL and PML. The working 614 path can be of different kinds; a hop-by-hop routed path, a trunk, a 615 link, an LSP or part of a multipoint-to-point LSP. 616 Two synonyms for a working path are primary path, active path. 618 Recovery Path 620 The path by which traffic is restored after the occurrence of a 621 fault. In other words, the path on which the traffic is directed by 622 the recovery mechanism. The recovery path is established by MPLS 623 means. The recovery path can either be an equivalent recovery path 624 and ensure no reduction in quality of service, or be a limited 625 recovery path and thereby not guarantee the same quality of service 626 (or some other criteria of performance) as the working path. A 627 limited recovery path is not expected to be used for an extended 628 period of time. 629 Synonyms for a recovery path are; back-up path, alternative path, 630 protection path. 632 Path Group (PG) 634 A logical bundling of multiple working paths, each of which is 635 routed identically between a Path Switch LSR and a Path Merge LSR. 637 Protected Path Group (PPG) 639 A path group that requires protection. 641 Protected Traffic Portion (PTP) 643 The portion of the traffic on an individual path that requires 644 protection. For example, code points in the EXP bits of the shim 645 header may identify a protected portion. 647 Path Switch LSR (PSL) 649 An LSR that is the transmitter of both the working path traffic and 650 its corresponding recovery path traffic. The PSL is responsible for 651 switching of the traffic between the working path and the recovery 652 path. 654 Path Merge LSR (PML) 656 An LSR that receives both working path traffic and its corresponding 657 recovery path traffic, and either merges their traffic into a single 658 outgoing path, or, if it is itself the destination, passes the 659 traffic on to the higher layer protocols. 661 Intermediate LSR 662 An LSR on a working or recovery path that is neither a PSL nor a PML 663 for that path. 665 Bypass Tunnel 667 A path that serves to backup a set of working paths using the label 668 stacking approach. The working paths and the bypass tunnel must all 669 share the same path switch LSR (PSL) and the path merge LSR (PML). 671 Switch-Over 673 The process of switching the traffic from the path that the traffic 674 is flowing on onto one or more alternate path(s). This may involve 675 moving traffic from a working path onto one or more recovery paths, 676 or may involve moving traffic from a recovery path(s) on to a more 677 optimal working path(s). 679 Switch-Back 681 The process of returning the traffic from one or more recovery paths 682 back to the working path(s). 684 Revertive Mode 686 A recovery mode in which traffic is automatically switched back from 687 the recovery path to the original working path upon the restoration 688 of the working path to a fault-free condition. 690 Non-revertive Mode 692 A recovery mode in which traffic is not automatically switched back 693 to the original working path after this path is restored to a fault- 694 free condition. (Depending on the configuration, the original 695 working path may, upon moving to a fault-free condition, become the 696 recovery path, or it may be used for new working traffic, and be no 697 longer associated with its original recovery path). 699 MPLS Protection Domain 701 The set of LSRs over which a working path and its corresponding 702 recovery path are routed. 704 MPLS Protection Plan 706 The set of all LSP protection paths and the mapping from working to 707 protection paths deployed in an MPLS protection domain at a given 708 time. 710 Liveness Message 711 A message exchanged periodically between two adjacent LSRs that 712 serves as a link probing mechanism. It provides an integrity check 713 of the forward and the backward directions of the link between the 714 two LSRs as well as a check of neighbor aliveness. 716 Path Continuity Test 718 A test that verifies the integrity and continuity of a path or path 719 segment. The details of such a test are beyond the scope of this 720 draft. (This could be accomplished, for example, by transmitting a 721 control message along the same links and nodes as the data traffic.) 723 2.3.2 Failure Terminology 725 Path Failure (PF) 727 Path failure is fault detected by MPLS-based recovery mechanisms, 728 which is define as the failure of the liveness message test or a 729 path continuity test, which indicates that path connectivity is 730 lost. 732 Path Degraded (PD) 734 Path degraded is a fault detected by MPLS-based recovery mechanisms 735 that indicates that the quality of the path is unacceptable. 737 Link Failure (LF) 739 A lower layer fault indicating that link continuity is lost. This 740 may be communicated to the MPLS-based recovery mechanisms by the 741 lower layer. 743 Link Degraded (LD) 745 A lower layer indication to MPLS-based recovery mechanisms that the 746 link is performing below an acceptable level. 748 Fault Indication Signal (FIS) 750 A signal that indicates that a fault along a path has occurred. It 751 is relayed by each intermediate LSR to its upstream or downstream 752 neighbor, until it reaches an LSR that is setup to perform MPLS 753 recovery. 755 Fault Recovery Signal (FRS) 757 A signal that indicates a fault along a working path has been 758 repaired. Again, like the FIS, it is relayed by each intermediate 759 LSR to its upstream or downstream neighbor, until is reaches the LSR 760 that performs recovery of the original path. 762 2.4 Abbreviations 764 FIS: Fault Indication Signal. 765 FRS: Fault Recovery Signal. 766 LD: Link Degraded. 767 LF: Link Failure. 768 PD: Path Degraded. 769 PF: Path Failure. 770 PML: Path Merge LSR. 771 PG: Path Group. 772 PPG: Protected Path Group. 773 PTP: Protected Traffic Portion. 774 PSL: Path Switch LSR. 776 3.0 MPLS-based Recovery Principles 778 MPLS-based recovery refers to the ability to effect quick and 779 complete restoration of traffic affected by a fault in an MPLS- 780 enabled network. The fault may be detected on the IP layer or in 781 lower layers over which IP traffic is transported. Fast MPLS 782 protection may be viewed as the MPLS LSR switch completion time that 783 is comparable to, or equivalent to, the 50 ms switch-over completion 784 time of the SONET layer. This section provides a discussion of the 785 concepts and principles of MPLS-based recovery. The concepts are 786 presented in terms of atomic or primitive terms that may be combined 787 to specify recovery approaches. We do not make any assumptions 788 about the underlying layer 1 or layer 2 transport mechanisms or 789 their recovery mechanisms. 791 3.1 Configuration of Recovery 793 An LSR should allow for configuration of the following recovery 794 options: 796 Default-recovery (No MPLS-based recovery enabled): Traffic on the 797 working path is recovered only via Layer 3 or IP rerouting. This is 798 equivalent to having no MPLS-based recovery. This option may be used 799 for low priority traffic or for traffic that is recovered in another 800 way (for example load shared traffic on parallel working paths may 801 be automatically recovered upon a fault along one of the working 802 paths by distributing it among the remaining working paths) 803 Recoverable (MPLS-based recovery enabled): This working path is 804 recovered using one or more recovery paths, either via rerouting or 805 via protection switching. 807 3.2 Initiation of Path Setup 809 There are three options for the initiation of the recovery path 810 setup. 812 Pre-established: 814 This is the same as the protection switching option. Here a recovery 815 path(s) is established prior to any failure on the working path. The 816 path selection can either be determined by an administrative 817 centralized tool (online or offline), or chosen based on some 818 algorithm implemented at the PSL and possibly intermediate nodes. To 819 guard against the situation when the pre-established recovery path 820 fails before or at the same time as the working path, the recovery 821 path should have secondary configuration options as explained in 822 Section 3.3 below. 824 Pre Qualified: 826 A pre-established path need not be created, it may be pre-qualified. 827 A pre-qualified recovery path is not created expressly for 828 protecting the working path, but instead is a path created for other 829 purposes that is designated as a recovery path after determination 830 that it is an acceptable alternative for carrying the working path 831 traffic. 833 Established-on-Demand: 835 This is the same as the rerouting option. Here, a recovery path is 836 established after a failure on its working path has been detected 837 and notified to the PSL. 839 Additional options are possible as MPLS is extended to control 840 optical networks. One example of this is shared mesh protection in 841 optical networks where the wavelength (or port) in-to-out mapping 842 for a recovery lightpath is selected in every optical layer cross- 843 connect prior to the failure, but the physical cross-connect is not 844 made until after the failure occurs. This and other options related 845 to optical MPLS are for further study. 847 3.3 Initiation of Resource Allocation 849 A recovery path may support the same traffic contract as the working 850 path, or it may not. We will distinguish these two situations by 851 using different additive terms. If the recovery path is capable of 852 replacing the working path without degrading service, it will be 853 called an equivalent recovery path. If the recovery path lacks the 854 resources (or resource reservations) to replace the working path 855 without degrading service, it will be called a limited recovery 856 path. Based on this, there are two options for the initiation of 857 resource allocation: 859 Pre-reserved: 861 This option applies only to protection switching. Here a pre- 862 established recovery path reserves required resources on all hops 863 along its route during its establishment. Although the reserved 864 resources (e.g., bandwidth and/or buffers) at each node cannot be 865 used to admit more working paths, they are available to be used by 866 all traffic that is present at the node before a failure occurs, 867 which results in better resource usage than SONET APS. 869 Reserved-on-Demand: 871 This option may apply either to rerouting or to protection 872 switching. Here a recovery path reserves the required resources 873 after a failure on the working path has been detected and notified 874 to the PSL and before the traffic on the working path is switched 875 over to the recovery path. 877 Note that under both the options above, depending on the amount of 878 resources reserved on the recovery path, it could either be an 879 equivalent recovery path or a limited recovery path. 881 3.4 Scope of Recovery 883 3.4.1 Topology 885 3.4.1.1 Local Repair 887 The intent of local repair is to protect against a single link or 888 neighbor node fault. In local repair (also known as local recovery 889 [12] [9]), the node detecting the fault is the one to initiate 890 recovery (either rerouting or protection switching). Local repair 891 can be of two types: 893 Link Recovery/Restoration 895 In this case, the recovery path may be configured to route around a 896 certain link deemed to be unreliable. If protection switching is 897 used, several recovery paths may be configured for one working path, 898 depending on the specific faulty link that each protects against. 900 Alternatively, if rerouting is used, upon the occurrence of a fault 901 on the specified link each path is rebuilt such that it detours 902 around the faulty link. 904 In this case, the recovery path need only be disjoint from its 905 working path at a particular link on the working path, and may have 906 overlapping segments with the working path. Traffic on the working 907 path is switched over to an alternate path at the upstream LSR that 908 connects to the failed link. This method is potentially the fastest 909 to perform the switchover, and can be effective in situations where 910 certain path components are much more unreliable than others. 912 Node Recovery/Restoration 914 In this case, the recovery path may be configured to route around a 915 neighbor node deemed to be unreliable. Thus the recovery path is 916 disjoint from the working path only at a particular node and at 917 links associated with the working path at that node. Once again, the 918 traffic on the primary path is switched over to the recovery path at 919 the upstream LSR that directly connects to the failed node, and the 920 recovery path shares overlapping portions with the working path. 922 3.4.1.2 Global Repair 924 The intent of global repair is to protect against any link or node 925 fault on the entire path or on a segment of a path (with the obvious 926 exception of the ingress and egress nodes). In global repair (also 927 known as path recovery/restoration) the node that initiates the 928 recovery may be distant from the faulty link or node. In some cases, 929 a fault notification (in the form of a FIS) must be sent from the 930 node detecting the fault to the PSL. In many cases, the recovery 931 path can be made completely link and node disjoint with its working 932 path. This has the advantage of protecting against all link and node 933 fault(s) on the working path (or path segment), and being more 934 efficient than per-hop link or node recovery. 936 In addition, it can be potentially more optimal in resource usage 937 than the link or node recovery. However, it is in some cases slower 938 than local repair since it takes longer for the fault notification 939 message to get to the PSL to trigger the recovery action. 941 3.4.1.3 Alternate Egress Repair 943 It is possible to restore service without specifically recovering 944 the faulted path. 946 For example, for best effort IP service it is possible to select a 947 recovery path that has a different egress point from the working 948 path (i.e., there is no PML). The recovery path egress must simply 949 be a router that is acceptable for forwarding the FEC carried by the 950 working path (without creating looping). In an engineering context, 951 specific alternative FEC/LSP mappings with alternate egresses can be 952 formed. 954 3.4.1.4 Multi-Layer Repair 956 Multi-layer repair broadens the network designer's tool set for 957 those cases where multiple network layers can be managed together to 958 achieve overall network goals. Specific criteria for determining 959 when multi-layer repair is appropriate are beyond the scope of this 960 draft. 962 3.4.1.5 Concatenated Protection Domains 964 A given service may cross multiple networks and these may employ 965 different recovery mechanisms. It is possible to concatenate 966 protection domains so that service recovery can be provided end-to- 967 end. It is considered that the recovery mechanisms in different 968 domains may operate autonomously, and that multiple points of 969 attachment may be used between domains (to ensure there is no single 970 point of failure). Details of concatenated protection domains are 971 beyond the scope of this draft. 973 3.4.2 Path Mapping 975 Path mapping refers to the methods of mapping traffic from a faulty 976 working path on to the recovery path. There are several options for 977 this, as described below. Note that the options below should be 978 viewed as atomic terms that only describe how the working and 979 protection paths are mapped to each other. The issues of resource 980 reservation along these paths, and how switchover is actually 981 performed lead to the more commonly used composite terms, such as 982 1+1 and 1:1 protection, which were described in Section 2.1. 984 i) 1-to-1 Protection 986 In 1-to-1 protection the working path has a designated recovery path 987 that is only to be used to recover that specific working path. 989 ii) n-to-1 Protection 991 In n-to-1 protection, up to n working paths are protected using only 992 one recovery path. If the intent is to protect against any single 993 fault on any of the working paths, the n working paths should be 994 diversely routed between the same PSL and PML. In some cases, 995 handshaking between PSL and PML may be required to complete the 996 recovery, the details of which are beyond the scope of this draft. 998 iii) n-to-m Protection 1000 In n-to-m protection, up to n working paths are protected using m 1001 recovery paths. Once again, if the intent is to protect against any 1002 single fault on any of the n working paths, the n working paths and 1003 the m recovery paths should be diversely routed between the same PSL 1004 and PML. In some cases, handshaking between PSL and PML may be 1005 required to complete the recovery, the details of which are beyond 1006 the scope of this draft. -N-to-m protection is for further study. 1008 iv) Split Path Protection 1010 In split path protection, multiple recovery paths are allowed to 1011 carry the traffic of a working path based on a certain configurable 1012 load splitting ratio. This is especially useful when no single 1013 recovery path can be found that can carry the entire traffic of the 1014 working path in case of a fault. Split path protection may require 1015 handshaking between the PSL and the PML(s), and may require the 1016 PML(s) to correlate the traffic arriving on multiple recovery paths 1017 with the working path. Although this is an attractive option, the 1018 details of split path protection are beyond the scope of this draft, 1019 and are for further study. 1021 3.4.3 Bypass Tunnels 1023 It may be convenient, in some cases, to create a "bypass tunnel" for 1024 a PPG between a PSL and PML, thereby allowing multiple recovery 1025 paths to be transparent to intervening LSRs [8]. In this case, one 1026 LSP (the tunnel) is established between the PSL and PML following an 1027 acceptable route and a number of recovery paths are supported 1028 through the tunnel via label stacking. A bypass tunnel can be used 1029 with any of the path mapping options discussed in the previous 1030 section. 1032 As with recovery paths, the bypass tunnel may or may not have 1033 resource reservations sufficient to provide recovery without service 1034 degradation. It is possible that the bypass tunnel may have 1035 sufficient resources to recover some number of working paths, but 1036 not all at the same time. If the number of recovery paths carrying 1037 traffic in the tunnel at any given time is restricted, this is 1038 similar to the 1 to n or m to n protection cases mentioned in 1039 Section 3.4.2. 1041 3.4.4 Recovery Granularity 1043 Another dimension of recovery considers the amount of traffic 1044 requiring protection. This may range from a fraction of a path to a 1045 bundle of paths. 1047 3.4.4.1 Selective Traffic Recovery 1049 This option allows for the protection of a fraction of traffic 1050 within the same path. The portion of the traffic on an individual 1051 path that requires protection is called a protected traffic portion 1052 (PTP). A single path may carry different classes of traffic, with 1053 different protection requirements. The protected portion of this 1054 traffic may be identified by its class, as for example, via the EXP 1055 bits in the MPLS shim header or via the priority bit in the ATM 1056 header. 1058 3.4.4.2 Bundling 1060 Bundling is a technique used to group multiple working paths 1061 together in order to recover them simultaneously. The logical 1062 bundling of multiple working paths requiring protection, each of 1063 which is routed identically between a PSL and a PML, is called a 1064 protected path group (PPG). When a fault occurs on the working path 1065 carrying the PPG, the PPG as a whole can be protected either by 1066 being switched to a bypass tunnel or by being switched to a recovery 1067 path. 1069 3.4.5 Recovery Path Resource Use 1071 In the case of pre-reserved recovery paths, there is the question of 1072 what use these resources may be put to when the recovery path is not 1073 in use. There are two options: 1075 Dedicated-resource: 1077 If the recovery path resources are dedicated, they may not be used 1078 for anything except carrying the working traffic. For example, in 1079 the case of 1+1 protection, the working traffic is always carried on 1080 the recovery path. Even if the recovery path is not always carrying 1081 the working traffic, it may not be possible or desirable to allow 1082 other traffic to use these resources. 1084 Extra-traffic-allowed: 1086 If the recovery path only carries the working traffic when the 1087 working path fails, then it is possible to allow extra traffic to 1088 use the reserved resources at other times. Extra traffic is, by 1089 definition, traffic that can be displaced (without violating service 1090 agreements) whenever the recovery path resources are needed for 1091 carrying the working path traffic. 1093 3.5 Fault Detection 1095 MPLS recovery is initiated after the detection of either a lower 1096 layer fault or a fault at the IP layer or in the operation of MPLS- 1097 based mechanisms. We consider four classes of impairments: Path 1098 Failure, Path Degraded, Link Failure, and Link Degraded. 1100 Path Failure (PF) is a fault that indicates to an MPLS-based 1101 recovery scheme that the connectivity of the path is lost. This may 1102 be detected by a path continuity test between the PSL and PML. 1103 Some, and perhaps the most common, path failures may be detected 1104 using a link probing mechanism between neighbor LSRs. An example of 1105 a probing mechanism is a liveness message that is exchanged 1106 periodically along the working path between peer LSRs. For either a 1107 link probing mechanism or path continuity test to be effective, the 1108 test message must be guaranteed to follow the same route as the 1109 working or recovery path, over the segment being tested. In 1110 addition, the path continuity test must take the path merge points 1111 into consideration. In the case of a bi-directional link implemented 1112 as two unidirectional links, path failure could mean that either one 1113 or both unidirectional links are damaged. 1115 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1116 schemes/mechanisms that the path has connectivity, but that the 1117 quality of the connection is unacceptable. This may be detected by 1118 a path performance monitoring mechanism, or some other mechanism for 1119 determining the error rate on the path or some portion of the path. 1120 This is local to the LSR and consists of excessive discarding of 1121 packets at an interface, either due to label mismatch or due to TTL 1122 errors, for example. 1124 Link Failure (LF) is an indication from a lower layer that the link 1125 over which the path is carried has failed. If the lower layer 1126 supports detection and reporting of this fault (that is, any fault 1127 that indicates link failure e.g., SONET LOS), this may be used by 1128 the MPLS recovery mechanism. In some cases, using LF indications may 1129 provide faster fault detection than using only MPLS-based fault 1130 detection mechanisms. 1132 Link Degraded (LD) is an indication from a lower layer that the link 1133 over which the path is carried is performing below an acceptable 1134 level. If the lower layer supports detection and reporting of this 1135 fault, it may be used by the MPLS recovery mechanism. In some cases, 1136 using LD indications may provide faster fault detection than using 1137 only MPLS-based fault detection mechanisms. 1139 3.6 Fault Notification 1141 Protection switching relies on rapid notification of faults. Once a 1142 fault is detected, the node that detected the fault must determine 1143 if the fault is severe enough to require path recovery. Then the 1144 node should send out a notification of the fault by transmitting a 1145 FIS to those of its upstream LSRs that were sending traffic on the 1146 working path that is affected by the fault. This notification is 1147 relayed hop-by-hop by each subsequent LSR to its upstream neighbor, 1148 until it eventually reaches a PSL. A PSL is the only LSR that can 1149 terminate the FIS and initiate a protection switch of the working 1150 path to a recovery path. Since the FIS is a control message, it 1151 should be transmitted with high priority to ensure that it 1152 propagates rapidly towards the affected PSL(s). Depending on how 1153 fault notification is configured in the LSRs of an MPLS domain, the 1154 FIS could be sent either as a Layer 2 or Layer 3 packet. An example 1155 of a FIS could be the liveness message sent by a downstream LSR to 1156 its upstream neighbor, with an optional fault notification field 1157 set. Alternatively, it could be a separate fault notification 1158 packet. The intermediate LSR should identify which of its incoming 1159 links (upstream LSRs) to propagate the FIS on. In the case of 1+1 1160 protection, the FIS should also be sent downstream to the PML where 1161 the recovery action is taken. 1163 3.7 Switch-Over Operation 1165 3.7.1 Recovery Trigger 1167 The activation of an MPLS protection switch following the detection 1168 or notification of a fault requires a trigger mechanism at the PSL. 1169 MPLS protection switching may be initiated due to automatic inputs 1170 or external commands. The automatic activation of an MPLS protection 1171 switch results from a response to a defect or fault conditions 1172 detected at the PSL or to fault notifications received at the PSL. 1173 It is possible that the fault detection and trigger mechanisms may 1174 be combined, as is the case when a PF, PD, LF, or LD is detected at 1175 a PSL and triggers a protection switch to the recovery path. In most 1176 cases, however, the detection and trigger mechanisms are distinct, 1177 involving the detection of fault at some intermediate LSR followed 1178 by the propagation of a fault notification back to the PSL via the 1179 FIS, which serves as the protection switch trigger at the PSL. MPLS 1180 protection switching in response to external commands results when 1181 the operator initiates a protection switch by a command to a PSL (or 1182 alternatively by a configuration command to an intermediate LSR, 1183 which transmits the FIS towards the PSL). 1185 Note that the PF fault applies to hard failures (fiber cuts, 1186 transmitter failures, or LSR fabric failures), as does the LF fault, 1187 with the difference that the LF is a lower layer impairment that may 1188 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1189 fault, on the other hand, applies to soft defects (excessive errors 1190 due to noise on the link, for instance). The PD (or LD) results in a 1191 fault declaration only when the percentage of lost packets exceeds a 1192 given threshold, which is provisioned and may be set based on the 1193 service level agreement(s) in effect between a service provider and 1194 a customer. 1196 3.7.2 Recovery Action 1198 After a fault is detected or FIS is received by the PSL, the 1199 recovery action involves either a rerouting or protection switching 1200 operation. In both scenarios, the next hop label forwarding entry 1201 for a recovery path is bound to the working path. 1203 3.8 Switch-Back Operation 1205 3.8.1 Revertive and Non-Revertive Modes 1207 These protection modes indicate whether or not there is a preferred 1208 path for the protected traffic. 1210 3.8.1.1 Revertive Mode 1212 If the working path always is the preferred path, this path will be 1213 used whenever it is available. If the working path has a fault, 1214 traffic is switched to the recovery path. In the revertive mode of 1215 operation, when the preferred path is restored the traffic is 1216 automatically switched back to it. 1218 3.8.1.2 Non-revertive Mode 1220 In the non-revertive mode of operation, there is no preferred path. 1221 A switchback to the "original" working path is not desired or not 1222 possible since the original path may no longer exist after the 1223 occurrence of a fault on that path. 1225 If there is a fault on the working path, traffic is switched to the 1226 recovery path. When or if the faulty path (the originally working 1227 path) is restored, it may become the recovery path (either by 1228 configuration, or, if desired, by management actions). This applies 1229 for explicitly routed working paths. 1231 When the traffic is switched over to a recovery path, the 1232 association between the original working path and the recovery path 1233 may no longer exist, since the original path itself may no longer 1234 exist after the fault. Instead, when the network reaches a stable 1235 state following routing convergence, the recovery path may be 1236 switched over to a different preferred path based either on pre- 1237 configured information or optimization based on the new network 1238 topology and associated information. 1240 3.8.2 Restoration and Notification 1242 MPLS restoration deals with returning the working traffic from the 1243 recovery path to the original or a new working path. Reversion is 1244 performed by the PSL upon receiving notification, via FRS, that the 1245 working path is repaired or upon receiving notification that a new 1246 working path is established. 1248 As before, an LSR that detected the fault on the working path also 1249 detects the restoration of the working path. If the working path had 1250 experienced a LF defect, the LSR detects a return to normal 1251 operation via the receipt of a liveness message from its peer. If 1252 the working path had experienced a LD defect at an LSR interface, 1253 the LSR could detect a return to normal operation via the resumption 1254 of error-free packet reception on that interface. Alternatively, a 1255 lower layer that no longer detects a LF defect may inform the MPLS- 1256 based recovery mechanisms at the LSR that the link to its peer LSR 1257 is operational. The LSR then transmits FRS to its upstream LSR(s) 1258 that were transmitting traffic on the working path. This is relayed 1259 hop-by-hop until it reaches the PSL(s), at which point the PSL 1260 switches the working traffic back to the original working path. 1262 In the non-revertive mode of operation, the working traffic may or 1263 may not be restored to the original working path. This is because it 1264 might be useful, in some cases, to either: (a) administratively 1265 perform a protection switch back to the original working path after 1266 gaining further assurances about the integrity of the path, or (b) 1267 it may be acceptable to continue operation without the recovery path 1268 being protected, or (c) it may be desirable to move the traffic to a 1269 new working path that is calculated based on network topology and 1270 network policies, after the dynamic routing protocols have 1271 converged. 1273 We note that if there is a way to transmit fault information back 1274 along a recovery path towards a PSL and if the recovery path is an 1275 equivalent recovery path, it is possible for the working path and 1276 its recovery path to exchange roles once the original working path 1277 is repaired following a fault. This is because, in that case, the 1278 recovery path effectively becomes the working path, and the restored 1279 working path functions as a recovery path for the original recovery 1280 path. This is important, since it affords the benefits of non- 1281 revertive switch operation outlined in Section 3.8.1, without 1282 leaving the recovery path unprotected. 1284 3.8.3 Reverting to Preferred Path (or Controlled Rearrangement) 1286 In the revertive mode, a "make before break" restoration switching 1287 can be used, which is less disruptive than performing protection 1288 switching upon the occurrence of network impairments. This will 1289 minimize both packet loss and packet reordering. The controlled 1290 rearrangement of paths can also be used to satisfy traffic 1291 engineering requirements for load balancing across an MPLS domain. 1293 3.9 Performance 1295 Resource/performance requirements for recovery paths should be 1296 specified in terms of the following attributes: 1298 I. Resource class attribute: 1300 Equivalent Recovery Class: The recovery path has the same resource 1301 reservations and performance guarantees as the working path. In 1302 other words, the recovery path meets the same SLAs as the working 1303 path. 1305 Limited Recovery Class: The recovery path does not have the same 1306 resource reservations and performance guarantees as the working 1307 path. 1309 A. Lower Class: The recovery path has lower resource requirements or 1310 less stringent performance requirements than the working path. 1312 B. Best Effort Class: The recovery path is best effort. 1314 II. Priority Attribute: 1316 The recovery path has a priority attribute just like the working 1317 path (i.e., the priority attribute of the associated traffic 1318 trunks). It can have the same priority as the working path or lower 1319 priority. 1321 III. Preemption Attribute: 1323 The recovery path can have the same preemption attribute as the 1324 working path or a lower one. 1326 4.0 MPLS Recovery Requirement 1327 The following are the MPLS recovery requirements: 1329 I. MPLS recovery SHALL provide an option to identify protection 1330 groups (PPGs) and protection portions (PTPs). 1332 II. Each PSL SHALL be capable of performing MPLS recovery upon the 1333 detection of the impairments or upon receipt of notifications of 1334 impairments. 1336 III. A MPLS recovery method SHALL not preclude manual protection 1337 switching commands. This implies that it would be possible under 1338 administrative commands to transfer traffic from a working path to a 1339 recovery path, or to transfer traffic from a recovery path to a 1340 working path, once the working path becomes operational following a 1341 fault. 1343 IV. A PSL SHALL be capable of performing either a switch back to the 1344 original working path after the fault is corrected or a switchover 1345 to a new working path, upon the discovery of a more optimal working 1346 path. 1348 V. The recovery model should take into consideration path merging at 1349 intermediate LSRs. If a fault affects the merged segment, all the 1350 paths sharing that merged segment should be able to recover. 1351 Similarly, if a fault affects a non-merged segment, only the path 1352 that is affected by the fault should be recovered. 1354 5.0 MPLS Recovery Options 1356 There SHOULD be an option for: 1358 I. Configuration of the recovery path as excess or reserved, with 1359 excess as the default. The recovery path that is configured as 1360 excess SHALL provide lower priority preemptable traffic access to 1361 the protection bandwidth, while the recovery path configured as 1362 reserved SHALL not provide any other traffic access to the 1363 protection bandwidth. 1365 II. Each protected path SHALL provide an option for configuring the 1366 protection alternatives as either rerouting or protection switching. 1368 III. Each protected path SHALL provide a configuration option for 1369 enabling restoration as either non-revertive or revertive, with 1370 revertive as the default. 1372 IV. Each LSR supporting protection switching SHALL provide an option 1373 for fault notification to the PSL. 1375 6.0 Comparison Criteria 1377 Possible criteria to use for comparison of MPLS-based recovery 1378 schemes are as follows: 1380 Recovery Time 1382 We define recovery time as the time required for a recovery path to 1383 be activated (and traffic flowing) after a fault. Recovery Time is 1384 the sum of the Fault Detection Time, Hold-off Time, Notification 1385 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1386 other words, it is the time between a failure of a node or link in 1387 the network and the time before a recovery path is installed and the 1388 traffic starts flowing on it. 1390 Full Restoration Time 1392 We define full restoration time as the time required for a permanent 1393 restoration. This is the time required for traffic to be routed onto 1394 links which are capable of or have been engineered sufficiently to 1395 handle traffic in recovery scenarios. Note that this time may or may 1396 not be different from the "Recovery Time" depending on whether 1397 equivalent or limited recovery paths are used. 1399 Backup Capacity 1401 Recovery schemes may require differing amounts of "backup capacity" 1402 in the event of a fault. This capacity will be dependent on the 1403 traffic characteristics of the network. However, it may also be 1404 dependent on the particular protection plan selection algorithms as 1405 well as the signaling and re-routing methods. 1407 Additive Latency 1409 Recovery schemes may introduce additive latency to traffic. For 1410 example, a recovery path may take many more hops than the working 1411 path. This may be dependent on the recovery path selection 1412 algorithms. 1414 Re-ordering 1416 Recovery schemes may introduce re-ordering of packets. Also the 1417 action of putting traffic back on preferred paths might cause packet 1418 re-ordering. 1420 State Overhead 1421 As the number of recovery paths in a protection plan grows, the 1422 state required to maintain them also grows. Schemes may require 1423 differing numbers of paths to maintain certain levels of coverage, 1424 etc. The state required may also depend on the particular scheme 1425 used to recover. In many cases the state overhead will be in 1426 proportion to the number of recovery paths. 1428 Loss 1430 Recovery schemes may introduce a certain amount of packet loss 1431 during switchover to a recovery path. Schemes that introduce loss 1432 during recovery can measure this loss by evaluating recovery times 1433 in proportion to the link speed. 1435 In case of link or node failure a certain packet loss is inevitable. 1437 Coverage 1439 Recovery schemes may offer various types of failover coverage. The 1440 total coverage may be defined in terms of several metrics: 1442 I. Fault Types: Recovery schemes may account for only link faults or 1443 both node and link faults or also degraded service. For example, a 1444 scheme may require more recovery paths to take node faults into 1445 account. 1447 II. Number of concurrent faults: dependent on the layout of recovery 1448 paths in the protection plan, multiple fault scenarios may be able 1449 to be restored. 1451 III. Number of recovery paths: for a given fault, there may be one 1452 or more recovery paths. 1454 IV. Percentage of coverage: dependent on a scheme and its 1455 implementation, a certain percentage of faults may be covered. This 1456 may be subdivided into percentage of link faults and percentage of 1457 node faults. 1459 V. The number of protected paths may effect how fast the total set 1460 of paths affected by a fault could be recovered. The ratio of 1461 protected is n/N, where n is the number of protected paths and N is 1462 the total number of paths. 1464 7.0 Security Considerations 1466 The MPLS recovery that is specified herein does not raise any 1467 security issues that are not already present in the MPLS 1468 architecture. 1470 8.0 Intellectual Property Considerations 1472 The IETF has been notified of intellectual property rights claimed 1473 in regard to some or all of the specification contained in this 1474 document. For more information consult the online list of claimed 1475 rights. 1477 9.0 Acknowledgements 1479 We would like to thank members of the MPLS WG mailing list for their 1480 suggestions on the earlier version of this draft. In particular, 1481 Bora Akyol, Dave Allan, and Neil Harrisson, whose suggestions and 1482 comments were very helpful in revising the document. 1484 10.0 Authors' Addresses 1486 Srinivas Makam Vishal Sharma 1487 Tellabs Operations, Inc. Tellabs Research Center 1488 4951 Indiana Avenue One Kendall Square 1489 Lisle, IL 60532 Bldg. 100, Ste. 121 1490 Phone: 630-512-7217 Cambridge, MA 02139-1562 1491 Srinivas.Makam@tellabs.com Phone: 617-577-8760 1492 Vishal.Sharma@tellabs.com 1494 Ken Owens Changcheng Huang 1495 Tellabs Operations, Inc. Tellabs Operations, Inc. 1496 1106 Fourth Street 4951 Indiana Avenue 1497 St. Louis, MO 63126 Lisle, IL 60532 1498 Phone: 314-918-1579 Phone: 630-512-7754 1499 Ken.Owens@tellabs.com Changcheng.Huang@tellabs.com 1501 Ben Mack-Crane Fiffi Hellstrand 1502 Tellabs Operations, Inc. Nortel Networks 1503 4951 Indiana Avenue St Eriksgatan 115, PO Box 6701 1504 Lisle, IL 60532 113 85 Stockholm, Sweden 1505 Ph: 630-512-7255 Ph: +46 8 5088 3687 1506 Ben.Mack-Crane@tellabs.com Fiffi@nortelnetworks.com 1508 Jon Weil Brad Cain 1509 Nortel Networks Mirror Image Internet 1510 Harlow Laboratories London Road 49 Dragon Ct. 1511 Harlow Essex CM17 9NA, UK Woburn, MA 01801, USA 1512 Phone: +44 (0)1279 403935 bcain@mirror-image.com 1513 jonweil@nortelnetworks.com 1515 Loa Andersson Bilel Jamoussi 1516 Nortel Networks Nortel Networks 1517 St Eriksgatan 115, PO Box 6701 3 Federal Street, BL3-03 1518 113 85 Stockholm, Sweden Billerica, MA 01821, USA 1519 phone: +46 8 50 88 36 34 jamoussi@nortelnetworks.com 1520 loa.andersson@nortelnetworks.com 1522 Seyhan Civanlar Angela Chiu 1523 Coreon, Inc. AT&T Labs, Rm. 4-204, 1524 1200 South Avenue, Suite 103 100 Schulz Dr. 1525 Staten Island, NY 10314 Red Bank, NJ 07701 1526 Ph: (718) 889 4203 Ph: (732) 345-3441 1527 scivanlar@coreon.net alchiu@att.com 1528 11.0 References 1530 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1531 Switching Architecture", Work in Progress, Internet Draft , August 1999. 1534 [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, 1535 B., "LDP Specification", Work in Progress, Internet Draft , September 1999. 1538 [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement 1539 for Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp- 1540 tunnel-applicability-00.txt, work in progress, Sept. 1999. 1542 [4] Jamoussi, B. "Constraint-Based LSP Setup using LDP", Work in 1543 Progress, Internet Draft , 1544 September 1999. 1546 [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource 1547 ReSerVation Protocol (RSVP) -- Version 1 Functional 1548 Specification", RFC 2205, September 1997. 1550 [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Work in 1551 Progress, Internet Draft , Work in Progress, September 1573 1999. 1575 [12] Haskin, D. and Krishnan R., "A Method for Setting an 1576 Alternative Label Switched Path to Handle Fast Reroute", draft- 1577 haskin-mpls-fast-reroute-01.txt, 1999, Work in progress.