idnits 2.17.1 draft-ietf-mpls-recovery-frmwrk-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 33 longer pages, the longest (page 10) being 63 lines == It seems as if not all pages are separated by form feeds - found 1 form feeds but 33 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([2], [3], [4], [5], [6], [7], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1375: '...I. MPLS recovery SHALL provide an opti...' RFC 2119 keyword, line 1378: '... II. Each PSL SHALL be capable of pe...' RFC 2119 keyword, line 1382: '... recovery method SHALL not preclude ma...' RFC 2119 keyword, line 1389: '... IV. A PSL SHALL be capable of perfo...' RFC 2119 keyword, line 1402: '... There SHOULD be an option for:...' (2 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 534 has weird spacing: '... on the recov...' == Line 668 has weird spacing: '...icating the t...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: III. A MPLS recovery method SHALL not preclude manual protection switching commands. This implies that it would be possible under administrative commands to transfer traffic from a working path to a recovery path, or to transfer traffic from a recovery path to a working path, once the working path becomes operational following a fault. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: I. Configuration of the recovery path as excess or reserved, with excess as the default. The recovery path that is configured as excess SHALL provide lower priority preemptable traffic access to the protection bandwidth, while the recovery path configured as reserved SHALL not provide any other traffic access to the protection bandwidth. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '14' is mentioned on line 324, but not defined ** Obsolete normative reference: RFC 3036 (ref. '2') (Obsoleted by RFC 5036) == Outdated reference: A later version (-02) exists of draft-ietf-mpls-rsvp-tunnel-applicability-01 ** Downref: Normative reference to an Informational draft: draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3') == Outdated reference: A later version (-06) exists of draft-ietf-mpls-cr-ldp-04 == Outdated reference: A later version (-09) exists of draft-ietf-mpls-rsvp-lsp-tunnel-07 -- Possible downref: Normative reference to a draft: ref. '7' ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '8') == Outdated reference: A later version (-01) exists of draft-kini-restoration-shared-backup-00 -- Possible downref: Normative reference to a draft: ref. '9' -- Possible downref: Normative reference to a draft: ref. '10' -- Possible downref: Normative reference to a draft: ref. '11' -- Possible downref: Normative reference to a draft: ref. '12' == Outdated reference: A later version (-03) exists of draft-chang-mpls-path-protection-02 -- Possible downref: Normative reference to a draft: ref. '13' Summary: 14 errors (**), 0 flaws (~~), 13 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IETF Draft Vishal Sharma 3 Multi-Protocol Label Switching Jasmine Networks, Inc. 4 Expires: August 2001 5 Ben-Mack Crane 6 Srinivas Makam 7 Tellabs Operations, Inc. 9 Ken Owens 10 Erlang Technology, Inc. 12 Changcheng Huang 13 Carleton University 15 Fiffi Hellstrand 16 Jon Weil 17 Loa Andersson 18 Bilel Jamoussi 19 Nortel Networks 21 Brad Cain 22 Mirror Image Internet 24 Seyhan Civanlar 25 Coreon Networks 27 Angela Chiu 28 Celion Networks, Inc. 30 March 2001 32 Framework for MPLS-based Recovery 33 35 Status of this memo 37 This document is an Internet-Draft and is in full conformance with 38 all provisions of Section 10 of RFC2026. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF), its areas, and its working groups. Note that other 41 groups may also distribute working documents as Internet-Drafts. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html. 51 Abstract 52 Multi-protocol label switching (MPLS) [1] integrates the label 53 swapping forwarding paradigm with network layer routing. To deliver 54 reliable service, MPLS requires a set of procedures to provide 55 protection of the traffic carried on different paths. This requires 56 that the label switched routers (LSRs) support fault detection, fault 57 notification, and fault recovery mechanisms, and that MPLS signaling 58 [2], [3], [4], [5], [6], [7] support the configuration of recovery. 59 With these objectives in mind, this document specifies a framework 60 for MPLS based recovery. 62 Table of Contents Page 64 1.0 Introduction 3 65 1.1 Background 3 66 1.2 Motivations for MPLS-Based Recovery 3 67 1.3 Objectives 4 69 2.0 Overview 5 70 2.1 Recovery Models 6 71 2.2 Recovery Cycles 7 72 2.2.1 MPLS Recovery Cycle Model 7 73 2.2.2 MPLS Reversion Cycle Model 9 74 2.2.3 Dynamic Reroute Cycle Model 10 75 2.3 Definitions and Terminology 11 76 2.4 Abbreviations 15 78 3.0 MPLS Recovery Principles 15 79 3.1 Configuration of Recovery 15 80 3.2 Initiation of Path Setup 15 81 3.3 Initiation of Resource Allocation 16 82 3.4 Scope of Recovery 17 83 3.4.1 Topology 17 84 3.4.1.1 Local Repair 17 85 3.4.1.2 Global Repair 17 86 3.4.1.3 Alternate Egress Repair 18 87 3.4.1.4 Multi-Layer Repair 18 88 3.4.1.5 Concatenated Protection Domains 18 89 3.4.2 Path Mapping 18 90 3.4.3 Bypass Tunnels 19 91 3.4.4 Recovery Granularity 20 92 3.4.4.1 Selective Traffic Recovery 20 93 3.4.4.2 Bundling 20 94 3.4.5 Recovery Path Resource Use 20 95 3.5 Fault Detection 21 96 3.6 Fault Notification 21 97 3.7 Switch Over Operation 22 98 3.7.1 Recovery Trigger 22 99 3.7.2 Recovery Action 22 100 3.8 Post Recovery Operation 23 101 3.8.1 Fixed Protection Counterparts 23 102 3.8.2 Dynamic Protection Counterparts 24 103 3.8.3 Restoration and Notification 25 104 3.8.4 Reverting to Preferred Path 25 105 3.9 Performance 26 107 4.0 Recovery Requirements 26 108 5.0 MPLS Recovery Options 27 109 6.0 Comparison Criteria 27 110 7.0 Security Considerations 29 111 8.0 Intellectual Property Considerations 29 112 9.0 Acknowledgements 29 113 10.0 Author's Addresses 30 114 11.0 References 30 116 1.0 Introduction 118 This memo describes a framework for MPLS-based recovery. We provide a 119 detailed taxonomy of recovery terminology, and discuss the motivation 120 for, the objectives of, and the requirements for MPLS-based recovery. 121 We outline principles for MPLS-based recovery, and also provide 122 comparison criteria that may serve as a basis for comparing and 123 evaluating different recovery schemes. 125 1.1 Background 127 Network routing deployed today is focussed primarily on connectivity 128 and typically supports only one class of service, the best effort 129 class. Multi-protocol label switching, on the other hand, by 130 integrating forwarding based on label-swapping of a link local label 131 with network layer routing allows flexibility in the delivery of new 132 routing services. MPLS allows for using such media specific 133 forwarding mechanisms as label swapping. This enables more 134 sophisticated features such as quality-of-service (QoS) and traffic 135 engineering [8] to be implemented more effectively. An important 136 component of providing QoS, however, is the ability to transport data 137 reliably and efficiently. Although the current routing algorithms are 138 very robust and survivable, the amount of time they take to recover 139 from a fault can be significant, on the order of several seconds or 140 minutes, causing serious disruption of service for some applications 141 in the interim. This is unacceptable to many organizations that aim 142 to provide a highly reliable service, and thus require recovery times 143 on the order of tens of milliseconds, as specified, for example, in 144 the GR253 specification for SONET. 146 MPLS recovery may be motivated by the notion that there are inherent 147 limitations to improving the recovery times of current routing 148 algorithms. Additional improvement not obtainable by other means can 149 be obtained by augmenting these algorithms with MPLS recovery 150 mechanisms. Since MPLS is likely to be the technology of choice in 151 the future IP-based transport network, it is useful that MPLS be able 152 to provide protection and restoration of traffic. MPLS may 153 facilitate the convergence of network functionality on a common 154 control and management plane. Further, a protection priority could be 155 used as a differentiating mechanism for premium services that require 156 high reliability. The remainder of this document provides a framework 157 for MPLS based recovery. It is focused at a conceptual level and is 158 meant to address motivation, objectives and requirements. Issues of 159 mechanism, policy, routing plans and characteristics of traffic 160 carried by recovery paths are beyond the scope of this document. 162 1.2 Motivation for MPLS-Based Recovery 164 MPLS based protection of traffic (called MPLS-based Recovery) is 165 useful for a number of reasons. The most important is its ability to 166 increase network reliability by enabling a faster response to faults 167 than is possible with traditional Layer 3 (or IP layer) approaches 168 alone while still providing the visibility of the network afforded by 169 Layer 3. Furthermore, a protection mechanism using MPLS could enable 170 IP traffic to be put directly over WDM optical channels, without an 171 intervening SONET layer. This would facilitate the construction of 172 IP-over-WDM networks. 174 The need for MPLS-based recovery arises because of the following: 176 I. Layer 3 or IP rerouting may be too slow for a core MPLS network 177 that needs to support high reliability/availability. 179 II. Layer 0 (for example, optical layer) or Layer 1 (for example, 180 SONET) mechanisms may not be deployed in topologies that meet 181 carriers' protection goals. 183 III. The granularity at which the lower layers may be able to protect 184 traffic may be too coarse for traffic that is switched using MPLS- 185 based mechanisms. 187 IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher 188 layer operations. Thus, while they may provide, for example, link 189 protection, they cannot easily provide node protection or protection 190 of traffic transported at layer 3. 192 V. MPLS has desirable attributes when applied to the purpose of 193 recovery for connectionless networks. Specifically that an LSP is 194 source routed and a forwarding path for recovery can be "pinned" and 195 is not affected by transient instability in SPF routing brought on by 196 failure scenarios. 198 Furthermore there is a need for open standards. 200 VI. Establishing interoperability of protection mechanisms between 201 routers/LSRs from different vendors in IP or MPLS networks is 202 urgently required to enable the adoption of MPLS as a viable core 203 transport and traffic engineering technology. 205 1.3 Objectives/Goals 207 We lay down the following objectives for MPLS-based recovery. 209 I. MPLS-based recovery mechanisms should facilitate fast (10's of ms) 210 recovery times. 212 II. MPLS-based recovery should maximize network reliability and 213 availability. MPLS-based recovery of traffic should minimize the 214 number of single points of failure in the MPLS protected domain. 216 III. MPLS-based recovery should enhance the reliability of the 217 protected traffic while minimally or predictably degrading the 218 traffic carried by the diverted resources. 220 IV. MPLS-based recovery techniques should be applicable for 221 protection of traffic at various granularities. For example, it 222 should be possible to specify MPLS-based recovery for a portion of 223 the traffic on an individual path, for all traffic on an individual 224 path, or for all traffic on a group of paths. Note that a path is 225 used as a general term and includes the notion of a link, IP route or 226 LSP. 228 V. MPLS-based recovery techniques may be applicable for an entire 229 end-to-end path or for segments of an end-to-end path. 231 VI. MPLS-based recovery actions should not adversely affect other 232 network operations. 234 VII. MPLS-based recovery actions in one MPLS protection domain 235 (defined in Section 2.2) should not adversely affect the recovery 236 actions in other MPLS protection domains. 238 VII. MPLS-based recovery mechanisms should be able to take into 239 consideration the recovery actions of lower layers. 241 VIII. MPLS-based recovery actions should avoid network-layering 242 violations. That is, defects in MPLS-based mechanisms should not 243 trigger lower layer protection switching. 245 IX. MPLS-based recovery mechanisms should minimize the loss of data 246 and packet reordering during recovery operations. (The current MPLS 247 specification has itself no explicit requirement on reordering). 249 X. MPLS-based recovery mechanisms should minimize the state overhead 250 incurred for each recovery path maintained. 252 XI. MPLS-based recovery mechanisms should be able to preserve the 253 constraints on traffic after switchover, if desired. That is, if 254 desired, the recovery path should meet the resource requirements of, 255 and achieve the same performance characteristics as the working path. 257 2.0 Overview 259 There are several options for providing protection of traffic using 260 MPLS. The most generic requirement is the specification of whether 261 recovery should be via Layer 3 (or IP) rerouting or via MPLS 262 protection switching or rerouting actions. 264 Generally network operators aim to provide the fastest and the best 265 protection mechanism that can be provided at a reasonable cost. The 266 higher the level of protection, the more resources are consumed. 267 Therefore it is expected that network operators will offer a spectrum 268 of service levels. MPLS-based recovery should give the flexibility to 269 select the recovery mechanism, choose the granularity at which 270 traffic is protected, and to also choose the specific types of 271 traffic that are protected in order to give operators more control 272 over that tradeoff. With MPLS-based recovery, it can be possible to 273 provide different levels of protection for different classes of 274 service, based on their service requirements. For example, using 275 approaches outlined below, a VLL service that supports real-time 276 applications like VoIP may be supported using link/node protection 277 together with pre-established, pre-reserved path protection, while 278 best effort traffic may use established-on-demand path protection or 279 simply rely on IP re-route or higher layer recovery mechanisms. As 280 another example of their range of application, MPLS-based recovery 281 strategies may be used to protect traffic not originally flowing on 282 label switched paths, such as IP traffic that is normally routed hop- 283 by-hop, as well as traffic forwarded on label switched paths. 285 2.1 Recovery Models 287 There are two basic models for path recovery: rerouting and 288 protection switching. 290 Protection switching and rerouting, as defined below, may be used 291 together. For example, protection switching to a recovery path may 292 be used for rapid restoration of connectivity while rerouting 293 determines a new optimal network configuration, rearranging paths, as 294 needed, at a later time [9] [10]. 296 2.1.1 Rerouting 298 Recovery by rerouting is defined as establishing new paths or path 299 segments on demand for restoring traffic after the occurrence of a 300 fault. The new paths may be based upon fault information, network 301 routing policies, pre-defined configurations and network topology 302 information. Thus, upon detecting a fault, paths or path segments to 303 bypass the fault are established using signaling. Reroute mechanisms 304 are inherently slower than protection switching mechanisms, since 305 more must be done following the detection of a fault. However reroute 306 mechanisms are simpler and more frugal as no resources are committed 307 until after the fault occurs and the location of the fault is known. 309 Once the network routing algorithms have converged after a fault, it 310 may be preferable, in some cases, to reoptimize the network by 311 performing a reroute based on the current state of the network and 312 network policies. This is discussed further in Section 3.8. 314 In terms of the principles defined in section 3, reroute recovery 315 employs paths established-on-demand with resources reserved-on- 316 demand. 318 2.1.2 Protection Switching 320 Protection switching recovery mechanisms pre-establish a recovery 321 path or path segment, based upon network routing policies, the 322 restoration requirements of the traffic on the working path, and 323 administrative considerations. The recovery path may or may not be 324 link and node disjoint with the working path[11], [14]. However if 325 the recovery path shares sources of failure with the working path, 326 the overall reliability of the construct is degraded. When a fault is 327 detected, the protected traffic is switched over to the recovery 328 path(s) and restored. 330 In terms of the principles in section 3, protection switching employs 331 pre-established recovery paths, and if resource reservation is 332 required on the recovery path, pre-reserved resources. 334 2.1.2.1. Subtypes of Protection Switching 336 The resources (bandwidth, buffers, processing) on the recovery path 337 may be used to carry either a copy of the working path traffic or 338 extra traffic that is displaced when a protection switch occurs. 339 This leads to two subtypes of protection switching. 341 In 1+1 ("one plus one") protection, the resources (bandwidth, 342 buffers, processing capacity) on the recovery path are fully 343 reserved, and carry the same traffic as the working path. Selection 344 between the traffic on the working and recovery paths is made at the 345 path merge LSR (PML). In effect the PSL function is deprecated to 346 establishment of the working and recovery paths and a simple 347 replication function. The recovery intelligence is delegated to the 348 PML. 350 In 1:1 ("one for one") protection, the resources (if any) allocated 351 on the recovery path are fully available to preemptible low priority 352 traffic except when the recovery path is in use due to a fault on the 353 working path. In other words, in 1:1 protection, the protected 354 traffic normally travels only on the working path, and is switched to 355 the recovery path only when the working path has a fault. Once the 356 protection switch is initiated, the low priority traffic being 357 carried on the recovery path may be displaced by the protected 358 traffic. This method affords a way to make efficient use of the 359 recovery path resources. 361 This concept can be extended to 1:n (one for n) and m:n (m for n) 362 protection. 364 2.2 The Recovery Cycles 365 There are three defined recovery cycles; the MPLS Recovery Cycle, the 366 MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first 367 cycle detects a fault and restores traffic onto MPLS-based recovery 368 paths. If the recovery path is non-optimal the cycle may be followed 369 by any of the two latter to achieve an optimized network again. The 370 reversion cycle applies for explicitly routed traffic that that does 371 not rely on any dynamic routing protocols to be converged. The 372 dynamic re-routing cycle applies for traffic that is forwarded based 373 on hop-by-hop routing. 375 2.2.1 MPLS Recovery Cycle Model 377 The MPLS recovery cycle model is illustrated in Figure 1. 378 Definitions and a key to abbreviations follow. 380 --Network Impairment 381 | --Fault Detected 382 | | --Start of Notification 383 | | | -- Start of Recovery Operation 384 | | | | --Recovery Operation Complete 385 | | | | | --Path Traffic Restored 386 | | | | | | 387 | | | | | | 388 v v v v v v 389 ---------------------------------------------------------------- 390 | T1 | T2 | T3 | T4 | T5 | 392 Figure 1. MPLS Recovery Cycle Model 394 The various timing measures used in the model are described below. 395 T1 Fault Detection Time 396 T2 Hold-off Time 397 T3 Notification Time 398 T4 Recovery Operation Time 399 T5 Traffic Restoration Time 401 Definitions of the recovery cycle times are as follows: 403 Fault Detection Time 405 The time between the occurrence of a network impairment and the 406 moment the fault is detected by MPLS-based recovery mechanisms. This 407 time may be highly dependent on lower layer protocols. 409 Hold-Off Time 411 The configured waiting time between the detection of a fault and 412 taking MPLS-based recovery action, to allow time for lower layer 413 protection to take effect. The Hold-off Time may be zero. 415 Note: The Hold-Off Time may occur after the Notification Time 416 interval if the node responsible for the switchover, the Path Switch 417 LSR (PSL), rather than the detecting LSR, is configured to wait. 419 Notification Time 421 The time between initiation of a fault indication signal (FIS) by the 422 LSR detecting the fault and the time at which the Path Switch LSR 423 (PSL) begins the recovery operation. This is zero if the PSL detects 424 the fault itself or infers a fault from such events as an adjacency 425 failure. 427 Note: If the PSL detects the fault itself, there still may be a Hold- 428 Off Time period between detection and the start of the recovery 429 operation. 431 Recovery Operation Time 433 The time between the first and last recovery actions. This may 434 include message exchanges between the PSL and PML to coordinate 435 recovery actions. 437 Traffic Restoration Time 439 The time between the last recovery action and the time that the 440 traffic (if present) is completely recovered. This interval is 441 intended to account for the time required for traffic to once again 442 arrive at the point in the network that experienced disrupted or 443 degraded service due to the occurrence of the fault (e.g. the PML). 444 This time may depend on the location of the fault, the recovery 445 mechanism, and the propagation delay along the recovery path. 447 2.2.2 MPLS Reversion Cycle Model 449 Protection switching, revertive mode, requires the traffic to be 450 switched back to a preferred path when the fault on that path is 451 cleared. The MPLS reversion cycle model is illustrated in Figure 2. 452 Note that the cycle shown below comes after the recovery cycle shown 453 in Fig. 1. 455 --Network Impairment Repaired 456 | --Fault Cleared 457 | | --Path Available 458 | | | --Start of Reversion Operation 459 | | | | --Reversion Operation Complete 460 | | | | | --Traffic Restored on Preferred Path 461 | | | | | | 462 | | | | | | 463 v v v v v v 464 ----------------------------------------------------------------- 465 | T7 | T8 | T9 | T10| T11| 467 Figure 2. MPLS Reversion Cycle Model 468 The various timing measures used in the model are described below. 469 T7 Fault Clearing Time 470 T8 Wait-to-Restore Time 471 T9 Notification Time 472 T10 Reversion Operation Time 473 T11 Traffic Restoration Time 475 Note that time T6 (not shown above) is the time for which the network 476 impairment is not repaired and traffic is flowing on the recovery 477 path. 479 Definitions of the reversion cycle times are as follows: 481 Fault Clearing Time 483 The time between the repair of a network impairment and the time that 484 MPLS-based mechanisms learn that the fault has been cleared. This 485 time may be highly dependent on lower layer protocols. 487 Wait-to-Restore Time 489 The configured waiting time between the clearing of a fault and MPLS- 490 based recovery action(s). Waiting time may be needed to ensure the 491 path is stable and to avoid flapping in cases where a fault is 492 intermittent. The Wait-to-Restore Time may be zero. 494 Note: The Wait-to-Restore Time may occur after the Notification Time 495 interval if the PSL is configured to wait. 497 Notification Time 499 The time between initiation of an FRS by the LSR clearing the fault 500 and the time at which the path switch LSR begins the reversion 501 operation. This is zero if the PSL clears the fault itself. 502 Note: If the PSL clears the fault itself, there still may be a Wait- 503 to-Restore Time period between fault clearing and the start of the 504 reversion operation. 506 Reversion Operation Time 508 The time between the first and last reversion actions. This may 509 include message exchanges between the PSL and PML to coordinate 510 reversion actions. 512 Traffic Restoration Time 514 The time between the last reversion action and the time that traffic 515 (if present) is completely restored on the preferred path. This 516 interval is expected to be quite small since both paths are working 517 and care may be taken to limit the traffic disruption (e.g., using 518 "make before break" techniques and synchronous switch-over). 520 In practice, the only interesting times in the reversion cycle are 521 the Wait-to-Restore Time and the Traffic Restoration Time (or some 522 other measure of traffic disruption). Given that both paths are 523 available, there is no need for rapid operation, and a well- 524 controlled switch-back with minimal disruption is desirable. 526 2.2.3 Dynamic Re-routing Cycle Model 528 Dynamic rerouting aims to bring the IP network to a stable state 529 after a network impairment has occurred. A re-optimized network is 530 achieved after the routing protocols have converged, and the traffic 531 is moved from a recovery path to a (possibly) new working path. The 532 steps involved in this mode are illustrated in Figure 3. 534 Note that the cycle shown below may be overlaid on the recovery 535 cycle shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both 536 (in the event that both the recovery cycle and the reversion cycle 537 take place before the routing protocols converge, and after the 538 convergence of the routing protocols it is determined (based on on- 539 line algorithms or off-line traffic engineering tools, network 540 configuration, or a variety of other possible criteria) that there is 541 a better route for the working path). 543 --Network Enters a Semi-stable State after an Impairment 544 | --Dynamic Routing Protocols Converge 545 | | --Initiate Setup of New Working Path between PSL 546 | | | and PML 547 | | | --Switchover Operation Complete 548 | | | | --Traffic Moved to New Working Path 549 | | | | | 550 | | | | | 551 v v v v v 552 ----------------------------------------------------------------- 553 | T12 | T13 | T14 | T15 | 555 Figure 3. Dynamic Rerouting Cycle Model 556 The various timing measures used in the model are described below. 557 T12 Network Route Convergence Time 558 T13 Hold-down Time (optional) 559 T14 Switchover Operation Time 560 T15 Traffic Restoration Time 562 Network Route Convergence Time 564 We define the network route convergence time as the time taken for 565 the network routing protocols to converge and for the network to 566 reach a stable state. 568 Holddown Time 570 We define the holddown period as a bounded time for which a recovery 571 path must be used. In some scenarios it may be difficult to determine 572 if the working path is stable. In these cases a holddown time may be 573 used to prevent excess flapping of traffic between a working and a 574 recovery path. 576 Switchover Operation Time 578 The time between the first and last switchover actions. This may 579 include message exchanges between the PSL and PML to coordinate the 580 switchover actions. 582 As an example of the recovery cycle, we present a sequence of events 583 that occur after a network impairment occurs and when a protection 584 switch is followed by dynamic rerouting. 586 I. Link or path fault occurs 587 II. Signaling initiated (FIS) for the fault detected 588 III. FIS arrives at the PSL 589 IV. The PSL initiates a protection switch to a pre-configured 590 recovery path 591 V. The PSL switches over the traffic from the working path to the 592 recovery path 593 VI. The network enters a semi-stable state 594 VII. Dynamic routing protocols converge after the fault, and a new 595 working path is calculated (based, for example, on some of the 596 criteria mentioned earlier in Section 2.1.1). 597 VIII. A new working path is established between the PSL and the PML 598 (assumption is that PSL and PML have not changed) 599 IX. Traffic is switched over to the new working path. 601 2.3 Definitions and Terminology 603 This document assumes the terminology given in [1], and, in addition, 604 introduces the following new terms. 606 2.3.1 General Recovery Terminology 608 Rerouting 610 A recovery mechanism in which the recovery path or path segments are 611 created dynamically after the detection of a fault on the working 612 path. In other words, a recovery mechanism in which the recovery path 613 is not pre-established. 615 Protection Switching 617 A recovery mechanism in which the recovery path or path segments are 618 created prior to the detection of a fault on the working path. In 619 other words, a recovery mechanism in which the recovery path is pre- 620 established. 622 Working Path 623 The protected path that carries traffic before the occurrence of a 624 fault. The working path exists between a PSL and PML. The working 625 path can be of different kinds; a hop-by-hop routed path, a trunk, a 626 link, an LSP or part of a multipoint-to-point LSP. 628 Synonyms for a working path are primary path and active path. 630 Recovery Path 632 The path by which traffic is restored after the occurrence of a 633 fault. In other words, the path on which the traffic is directed by 634 the recovery mechanism. The recovery path is established by MPLS 635 means. The recovery path can either be an equivalent recovery path 636 and ensure no reduction in quality of service, or be a limited 637 recovery path and thereby not guarantee the same quality of service 638 (or some other criteria of performance) as the working path. A 639 limited recovery path is not expected to be used for an extended 640 period of time. 642 Synonyms for a recovery path are: back-up path, alternative path, and 643 protection path. 645 Protection Counterpart 647 The "other" path when discussing pre-planned protection switching 648 schemes. The protection counterpart for the working path is the 649 recovery path and vice-versa. 651 Path Group (PG) 653 A logical bundling of multiple working paths, each of which is routed 654 identically between a Path Switch LSR and a Path Merge LSR. 656 Protected Path Group (PPG) 658 A path group that requires protection. 660 Protected Traffic Portion (PTP) 662 The portion of the traffic on an individual path that requires 663 protection. For example, code points in the EXP bits of the shim 664 header may identify a protected portion. 666 Path Switch LSR (PSL) 668 The PSL is responsible for switching or replicating the traffic 669 between the working path and the recovery path. 671 Path Merge LSR (PML) 673 An LSR that receives both working path traffic and its corresponding 674 recovery path traffic, and either merges their traffic into a single 675 outgoing path, or, if it is itself the destination, passes the 676 traffic on to the higher layer protocols. 678 Intermediate LSR 680 An LSR on a working or recovery path that is neither a PSL nor a PML 681 for that path. 683 Bypass Tunnel 685 A path that serves to back up a set of working paths using the label 686 stacking approach [1]. The working paths and the bypass tunnel must 687 all share the same path switch LSR (PSL) and the path merge LSR 688 (PML). 690 Switch-Over 692 The process of switching the traffic from the path that the traffic 693 is flowing on onto one or more alternate path(s). This may involve 694 moving traffic from a working path onto one or more recovery paths, 695 or may involve moving traffic from a recovery path(s) on to a more 696 optimal working path(s). 698 Switch-Back 700 The process of returning the traffic from one or more recovery paths 701 back to the working path(s). 703 Revertive Mode 705 A recovery mode in which traffic is automatically switched back from 706 the recovery path to the original working path upon the restoration 707 of the working path to a fault-free condition. This assumes a failed 708 working path does not automatically surrender resources to the 709 network. 711 Non-revertive Mode 713 A recovery mode in which traffic is not automatically switched back 714 to the original working path after this path is restored to a fault- 715 free condition. (Depending on the configuration, the original working 716 path may, upon moving to a fault-free condition, become the recovery 717 path, or it may be used for new working traffic, and be no longer 718 associated with its original recovery path). 720 MPLS Protection Domain 722 The set of LSRs over which a working path and its corresponding 723 recovery path are routed. 725 MPLS Protection Plan 726 The set of all LSP protection paths and the mapping from working to 727 protection paths deployed in an MPLS protection domain at a given 728 time. 730 Liveness Message 732 A message exchanged periodically between two adjacent LSRs that 733 serves as a link probing mechanism. It provides an integrity check of 734 the forward and the backward directions of the link between the two 735 LSRs as well as a check of neighbor aliveness. 737 Path Continuity Test 739 A test that verifies the integrity and continuity of a path or path 740 segment. The details of such a test are beyond the scope of this 741 draft. (This could be accomplished, for example, by transmitting a 742 control message along the same links and nodes as the data traffic or 743 similarly could be measured by the absence of traffic and by 744 providing feedback.) 746 2.3.2 Failure Terminology 748 Path Failure (PF) 749 Path failure is fault detected by MPLS-based recovery mechanisms, 750 which is define as the failure of the liveness message test or a path 751 continuity test, which indicates that path connectivity is lost. 753 Path Degraded (PD) 754 Path degraded is a fault detected by MPLS-based recovery mechanisms 755 that indicates that the quality of the path is unacceptable. 757 Link Failure (LF) 758 A lower layer fault indicating that link continuity is lost. This may 759 be communicated to the MPLS-based recovery mechanisms by the lower 760 layer. 762 Link Degraded (LD) 763 A lower layer indication to MPLS-based recovery mechanisms that the 764 link is performing below an acceptable level. 766 Fault Indication Signal (FIS) 767 A signal that indicates that a fault along a path has occurred. It is 768 relayed by each intermediate LSR to its upstream or downstream 769 neighbor, until it reaches an LSR that is setup to perform MPLS 770 recovery. The FIS is transmitted periodically by the node/nodes 771 closest to the point of failure, for some configurable length of 772 time. 774 Fault Recovery Signal (FRS) 775 A signal that indicates a fault along a working path has been 776 repaired. Again, like the FIS, it is relayed by each intermediate LSR 777 to its upstream or downstream neighbor, until is reaches the LSR that 778 performs recovery of the original path. . The FRS is transmitted 779 periodically by the node/nodes closest to the point of failure, for 780 some configurable length of time. 782 2.4 Abbreviations 784 FIS: Fault Indication Signal. 785 FRS: Fault Recovery Signal. 786 LD: Link Degraded. 787 LF: Link Failure. 788 PD: Path Degraded. 789 PF: Path Failure. 790 PML: Path Merge LSR. 791 PG: Path Group. 792 PPG: Protected Path Group. 793 PTP: Protected Traffic Portion. 794 PSL: Path Switch LSR. 796 3.0 MPLS-based Recovery Principles 798 MPLS-based recovery refers to the ability to effect quick and 799 complete restoration of traffic affected by a fault in an MPLS- 800 enabled network. The fault may be detected on the IP layer or in 801 lower layers over which IP traffic is transported. Fastest MPLS 802 recovery is assumed to be achieved with protection switching and may 803 be viewed as the MPLS LSR switch completion time that is comparable 804 to, or equivalent to, the 50 ms switch-over completion time of the 805 SONET layer. This section provides a discussion of the concepts and 806 principles of MPLS-based recovery. The concepts are presented in 807 terms of atomic or primitive terms that may be combined to specify 808 recovery approaches. We do not make any assumptions about the 809 underlying layer 1 or layer 2 transport mechanisms or their recovery 810 mechanisms. 812 3.1 Configuration of Recovery 814 An LSR should allow for configuration of the following recovery 815 options: 817 Default-recovery (No MPLS-based recovery enabled): 818 Traffic on the working path is recovered only via Layer 3 or IP 819 rerouting or by some lower layer mechanism such as SONET APS. This 820 is equivalent to having no MPLS-based recovery. This option may be 821 used for low priority traffic or for traffic that is recovered in 822 another way (for example load shared traffic on parallel working 823 paths may be automatically recovered upon a fault along one of the 824 working paths by distributing it among the remaining working paths). 826 Recoverable (MPLS-based recovery enabled): 827 This working path is recovered using one or more recovery paths, 828 either via rerouting or via protection switching. 830 3.2 Initiation of Path Setup 832 There are three options for the initiation of the recovery path 833 setup. 835 Pre-established: 837 This is the same as the protection switching option. Here a recovery 838 path(s) is established prior to any failure on the working path. The 839 path selection can either be determined by an administrative 840 centralized tool (online or offline), or chosen based on some 841 algorithm implemented at the PSL and possibly intermediate nodes. To 842 guard against the situation when the pre-established recovery path 843 fails before or at the same time as the working path, the recovery 844 path should have secondary configuration options as explained in 845 Section 3.3 below. 847 Pre Qualified: 849 A pre-established path need not be created, it may be pre-qualified. 850 A pre-qualified recovery path is not created expressly for protecting 851 the working path, but instead is a path created for other purposes 852 that is designated as a recovery path after determination that it is 853 an acceptable alternative for carrying the working path traffic. 854 Variants include the case where an optical path or trail is 855 configured, but no switches are set. 857 Established-on-Demand: 859 This is the same as the rerouting option. Here, a recovery path is 860 established after a failure on its working path has been detected and 861 notified to the PSL. 863 3.3 Initiation of Resource Allocation 865 A recovery path may support the same traffic contract as the working 866 path, or it may not. We will distinguish these two situations by 867 using different additive terms. If the recovery path is capable of 868 replacing the working path without degrading service, it will be 869 called an equivalent recovery path. If the recovery path lacks the 870 resources (or resource reservations) to replace the working path 871 without degrading service, it will be called a limited recovery path. 872 Based on this, there are two options for the initiation of resource 873 allocation: 875 Pre-reserved: 877 This option applies only to protection switching. Here a pre- 878 established recovery path reserves required resources on all hops 879 along its route during its establishment. Although the reserved 880 resources (e.g., bandwidth and/or buffers) at each node cannot be 881 used to admit more working paths, they are available to be used by 882 all traffic that is present at the node before a failure occurs. 884 Reserved-on-Demand: 886 This option may apply either to rerouting or to protection switching. 887 Here a recovery path reserves the required resources after a failure 888 on the working path has been detected and notified to the PSL and 889 before the traffic on the working path is switched over to the 890 recovery path. 892 Note that under both the options above, depending on the amount of 893 resources reserved on the recovery path, it could either be an 894 equivalent recovery path or a limited recovery path. 896 3.4 Scope of Recovery 898 3.4.1 Topology 900 3.4.1.1 Local Repair 902 The intent of local repair is to protect against a link or neighbor 903 node fault and to minimize the amount of time required for failure 904 propagation. In local repair (also known as local recovery [12] [9]), 905 the node immediately upstream of the fault is the one to initiate 906 recovery (either rerouting or protection switching). Local repair can 907 be of two types: 909 Link Recovery/Restoration 911 In this case, the recovery path may be configured to route around a 912 certain link deemed to be unreliable. If protection switching is 913 used, several recovery paths may be configured for one working path, 914 depending on the specific faulty link that each protects against. 916 Alternatively, if rerouting is used, upon the occurrence of a fault 917 on the specified link each path is rebuilt such that it detours 918 around the faulty link. 919 In this case, the recovery path need only be disjoint from its 920 working path at a particular link on the working path, and may have 921 overlapping segments with the working path. Traffic on the working 922 path is switched over to an alternate path at the upstream LSR that 923 connects to the failed link. This method is potentially the fastest 924 to perform the switchover, and can be effective in situations where 925 certain path components are much more unreliable than others. 927 Node Recovery/Restoration 929 In this case, the recovery path may be configured to route around a 930 neighbor node deemed to be unreliable. Thus the recovery path is 931 disjoint from the working path only at a particular node and at links 932 associated with the working path at that node. Once again, the 933 traffic on the primary path is switched over to the recovery path at 934 the upstream LSR that directly connects to the failed node, and the 935 recovery path shares overlapping portions with the working path. 937 3.4.1.2 Global Repair 939 The intent of global repair is to protect against any link or node 940 fault on a path or on a segment of a path, with the obvious exception 941 of the faults occurring at the ingress node of the protected path 942 segment. In global repair the PSL is usually distant from the failure 943 and needs to be notified by a FIS. 944 In global repair also end-to end path recovery/restoration applies. 945 In many cases, the recovery path can be made completely link and node 946 disjoint with its working path. This has the advantage of protecting 947 against all link and node fault(s) on the working path (end-to-end 948 path or path segment). 949 However, it is in some cases slower than local repair since it takes 950 longer for the fault notification message to get to the PSL to 951 trigger the recovery action. 953 3.4.1.3 Alternate Egress Repair 955 It is possible to restore service without specifically recovering the 956 faulted path. 957 For example, for best effort IP service it is possible to select a 958 recovery path that has a different egress point from the working path 959 (i.e., there is no PML). The recovery path egress must simply be a 960 router that is acceptable for forwarding the FEC carried by the 961 working path (without creating looping). In an engineering context, 962 specific alternative FEC/LSP mappings with alternate egresses can be 963 formed. 965 This may simplify enhancing the reliability of implicitly constructed 966 MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate 967 recovery paths as simply link and node disjoint with the immediate 968 downstream LSR of the working path. 970 3.4.1.4 Multi-Layer Repair 972 Multi-layer repair broadens the network designer's tool set for those 973 cases where multiple network layers can be managed together to 974 achieve overall network goals. Specific criteria for determining 975 when multi-layer repair is appropriate are beyond the scope of this 976 draft. 978 3.4.1.5 Concatenated Protection Domains 980 A given service may cross multiple networks and these may employ 981 different recovery mechanisms. It is possible to concatenate 982 protection domains so that service recovery can be provided end-to- 983 end. It is considered that the recovery mechanisms in different 984 domains may operate autonomously, and that multiple points of 985 attachment may be used between domains (to ensure there is no single 986 point of failure). Alternate egress repair requires management of 987 concatenated domains in that an explicit MPLS point of failure (the 988 PML) is by definition excluded. Details of concatenated protection 989 domains are beyond the scope of this draft. 991 3.4.2 Path Mapping 993 Path mapping refers to the methods of mapping traffic from a faulty 994 working path on to the recovery path. There are several options for 995 this, as described below. Note that the options below should be 996 viewed as atomic terms that only describe how the working and 997 protection paths are mapped to each other. The issues of resource 998 reservation along these paths, and how switchover is actually 999 performed lead to the more commonly used composite terms, such as 1+1 1000 and 1:1 protection, which were described in Section 2.1. 1002 1-to-1 Protection 1004 In 1-to-1 protection the working path has a designated recovery path 1005 that is only to be used to recover that specific working path. 1007 n-to-1 Protection 1009 In n-to-1 protection, up to n working paths are protected using only 1010 one recovery path. If the intent is to protect against any single 1011 fault on any of the working paths, the n working paths should be 1012 diversely routed between the same PSL and PML. In some cases, 1013 handshaking between PSL and PML may be required to complete the 1014 recovery, the details of which are beyond the scope of this draft. 1016 n-to-m Protection 1018 In n-to-m protection, up to n working paths are protected using m 1019 recovery paths. Once again, if the intent is to protect against any 1020 single fault on any of the n working paths, the n working paths and 1021 the m recovery paths should be diversely routed between the same PSL 1022 and PML. In some cases, handshaking between PSL and PML may be 1023 required to complete the recovery, the details of which are beyond 1024 the scope of this draft. N-to-m protection is for further study. 1026 Split Path Protection 1028 In split path protection, multiple recovery paths are allowed to 1029 carry the traffic of a working path based on a certain configurable 1030 load splitting ratio. This is especially useful when no single 1031 recovery path can be found that can carry the entire traffic of the 1032 working path in case of a fault. Split path protection may require 1033 handshaking between the PSL and the PML(s), and may require the 1034 PML(s) to correlate the traffic arriving on multiple recovery paths 1035 with the working path. Although this is an attractive option, the 1036 details of split path protection are beyond the scope of this draft, 1037 and are for further study. 1039 3.4.3 Bypass Tunnels 1040 It may be convenient, in some cases, to create a "bypass tunnel" for 1041 a PPG between a PSL and PML, thereby allowing multiple recovery paths 1042 to be transparent to intervening LSRs [8]. In this case, one LSP 1043 (the tunnel) is established between the PSL and PML following an 1044 acceptable route and a number of recovery paths are supported through 1045 the tunnel via label stacking. A bypass tunnel can be used with any 1046 of the path mapping options discussed in the previous section. 1048 As with recovery paths, the bypass tunnel may or may not have 1049 resource reservations sufficient to provide recovery without service 1050 degradation. It is possible that the bypass tunnel may have 1051 sufficient resources to recover some number of working paths, but not 1052 all at the same time. If the number of recovery paths carrying 1053 traffic in the tunnel at any given time is restricted, this is 1054 similar to the 1 to n or m to n protection cases mentioned in Section 1055 3.4.2. 1057 3.4.4 Recovery Granularity 1059 Another dimension of recovery considers the amount of traffic 1060 requiring protection. This may range from a fraction of a path to a 1061 bundle of paths. 1063 3.4.4.1 Selective Traffic Recovery 1065 This option allows for the protection of a fraction of traffic within 1066 the same path. The portion of the traffic on an individual path that 1067 requires protection is called a protected traffic portion (PTP). A 1068 single path may carry different classes of traffic, with different 1069 protection requirements. The protected portion of this traffic may be 1070 identified by its class, as for example, via the EXP bits in the MPLS 1071 shim header or via the priority bit in the ATM header. 1073 3.4.4.2 Bundling 1075 Bundling is a technique used to group multiple working paths together 1076 in order to recover them simultaneously. The logical bundling of 1077 multiple working paths requiring protection, each of which is routed 1078 identically between a PSL and a PML, is called a protected path group 1079 (PPG). When a fault occurs on the working path carrying the PPG, the 1080 PPG as a whole can be protected either by being switched to a bypass 1081 tunnel or by being switched to a recovery path. 1083 3.4.5 Recovery Path Resource Use 1085 In the case of pre-reserved recovery paths, there is the question of 1086 what use these resources may be put to when the recovery path is not 1087 in use. There are two options: 1089 Dedicated-resource: 1090 If the recovery path resources are dedicated, they may not be used 1091 for anything except carrying the working traffic. For example, in 1092 the case of 1+1 protection, the working traffic is always carried on 1093 the recovery path. Even if the recovery path is not always carrying 1094 the working traffic, it may not be possible or desirable to allow 1095 other traffic to use these resources. 1097 Extra-traffic-allowed: 1098 If the recovery path only carries the working traffic when the 1099 working path fails, then it is possible to allow extra traffic to use 1100 the reserved resources at other times. Extra traffic is, by 1101 definition, traffic that can be displaced (without violating service 1102 agreements) whenever the recovery path resources are needed for 1103 carrying the working path traffic. 1105 3.5 Fault Detection 1107 MPLS recovery is initiated after the detection of either a lower 1108 layer fault or a fault at the IP layer or in the operation of MPLS- 1109 based mechanisms. We consider four classes of impairments: Path 1110 Failure, Path Degraded, Link Failure, and Link Degraded. 1112 Path Failure (PF) is a fault that indicates to an MPLS-based recovery 1113 scheme that the connectivity of the path is lost. This may be 1114 detected by a path continuity test between the PSL and PML. Some, 1115 and perhaps the most common, path failures may be detected using a 1116 link probing mechanism between neighbor LSRs. An example of a probing 1117 mechanism is a liveness message that is exchanged periodically along 1118 the working path between peer LSRs. For either a link probing 1119 mechanism or path continuity test to be effective, the test message 1120 must be guaranteed to follow the same route as the working or 1121 recovery path, over the segment being tested. In addition, the path 1122 continuity test must take the path merge points into consideration. 1123 In the case of a bi-directional link implemented as two 1124 unidirectional links, path failure could mean that either one or both 1125 unidirectional links are damaged. 1127 Path Degraded (PD) is a fault that indicates to MPLS-based recovery 1128 schemes/mechanisms that the path has connectivity, but that the 1129 quality of the connection is unacceptable. This may be detected by a 1130 path performance monitoring mechanism, or some other mechanism for 1131 determining the error rate on the path or some portion of the path. 1132 This is local to the LSR and consists of excessive discarding of 1133 packets at an interface, either due to label mismatch or due to TTL 1134 errors, for example. 1136 Link Failure (LF) is an indication from a lower layer that the link 1137 over which the path is carried has failed. If the lower layer 1138 supports detection and reporting of this fault (that is, any fault 1139 that indicates link failure e.g., SONET LOS), this may be used by the 1140 MPLS recovery mechanism. In some cases, using LF indications may 1141 provide faster fault detection than using only MPLS_based fault 1142 detection mechanisms. 1144 Link Degraded (LD) is an indication from a lower layer that the link 1145 over which the path is carried is performing below an acceptable 1146 level. If the lower layer supports detection and reporting of this 1147 fault, it may be used by the MPLS recovery mechanism. In some cases, 1148 using LD indications may provide faster fault detection than using 1149 only MPLS-based fault detection mechanisms. 1151 3.6 Fault Notification 1153 MPLS-based recovery relies on rapid and reliable notification of 1154 faults. Once a fault is detected, the node that detected the fault 1155 must determine if the fault is severe enough to require path 1156 recovery. If the node is not capable of initiating direct action 1157 (e.g. as a PSL) the node should send out a notification of the fault 1158 by transmitting a FIS to those of its upstream LSRs that were sending 1159 traffic on the working path that is affected by the fault. This 1160 notification is relayed hop-by-hop by each subsequent LSR to its 1161 upstream neighbor, until it eventually reaches a PSL. A PSL is the 1162 only LSR that can terminate the FIS and initiate a protection switch 1163 of the working path to a recovery path. 1165 Since the FIS is a control message, it should be transmitted with 1166 high priority to ensure that it propagates rapidly towards the 1167 affected PSL(s). Depending on how fault notification is configured in 1168 the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2 1169 or Layer 3 packet [13]. The use of a Layer 2-based notification 1170 requires a Layer 2 path direct to the PSL. An example of a FIS could 1171 be the liveness message sent by a downstream LSR to its upstream 1172 neighbor, with an optional fault notification field set or it can be 1173 implicitly denoted by a teardown message. Alternatively, it could be 1174 a separate fault notification packet. The intermediate LSR should 1175 identify which of its incoming links (upstream LSRs) to propagate the 1176 FIS on. In the case of 1+1 protection, the FIS should also be sent 1177 downstream to the PML where the recovery action is taken. 1179 3.7 Switch-Over Operation 1181 3.7.1 Recovery Trigger 1183 The activation of an MPLS protection switch following the detection 1184 or notification of a fault requires a trigger mechanism at the PSL. 1185 MPLS protection switching may be initiated due to automatic inputs or 1186 external commands. The automatic activation of an MPLS protection 1187 switch results from a response to a defect or fault conditions 1188 detected at the PSL or to fault notifications received at the PSL. It 1189 is possible that the fault detection and trigger mechanisms may be 1190 combined, as is the case when a PF, PD, LF, or LD is detected at a 1191 PSL and triggers a protection switch to the recovery path. In most 1192 cases, however, the detection and trigger mechanisms are distinct, 1193 involving the detection of fault at some intermediate LSR followed by 1194 the propagation of a fault notification back to the PSL via the FIS, 1195 which serves as the protection switch trigger at the PSL. MPLS 1196 protection switching in response to external commands results when 1197 the operator initiates a protection switch by a command to a PSL (or 1198 alternatively by a configuration command to an intermediate LSR, 1199 which transmits the FIS towards the PSL). 1201 Note that the PF fault applies to hard failures (fiber cuts, 1202 transmitter failures, or LSR fabric failures), as does the LF fault, 1203 with the difference that the LF is a lower layer impairment that may 1204 be communicated to - MPLS-based recovery mechanisms. The PD (or LD) 1205 fault, on the other hand, applies to soft defects (excessive errors 1206 due to noise on the link, for instance). The PD (or LD) results in a 1207 fault declaration only when the percentage of lost packets exceeds a 1208 given threshold, which is provisioned and may be set based on the 1209 service level agreement(s) in effect between a service provider and a 1210 customer. 1212 3.7.2 Recovery Action 1214 After a fault is detected or FIS is received by the PSL, the recovery 1215 action involves either a rerouting or protection switching operation. 1216 In both scenarios, the next hop label forwarding entry for a recovery 1217 path is bound to the working path. 1219 3.8 Post Recovery Operation 1221 When traffic is flowing on the recovery path decisions can be made to 1222 whether let the traffic remain on the recovery path and consider it 1223 as a new working path or do a switch to the old or a new working 1224 path. This post recovery operation has two styles, one where the 1225 protection counterparts, i.e. the working and recovery path, are 1226 fixed or "pinned" to its route and one in which the PSL or other 1227 network entity with real time knowledge of failure dynamically 1228 performs re-establishment or controlled rearrangement of the paths 1229 comprising the protected service. 1231 3.8.1 Fixed Protection Counterparts 1233 For fixed protection counterparts the PSL will be pre-configured with 1234 the appropriate behavior to take when the original fixed path is 1235 restored to service. The choices are revertive and non-revertive 1236 mode. The choice will typically be depended on relative costs of the 1237 working and protection paths, and the tolerance of the service to the 1238 effects of switching paths yet again. These protection modes indicate 1239 whether or not there is a preferred path for the protected traffic. 1241 3.8.1.1 Revertive Mode 1243 If the working path always is the preferred path, this path will be 1244 used whenever it is available. Thus, in the event of a fault on this 1245 path, its unused resources will not be reclaimed by the network on 1246 failure. If the working path has a fault, traffic is switched to the 1247 recovery path. In the revertive mode of operation, when the 1248 preferred path is restored the traffic is automatically switched back 1249 to it. 1251 There are a number of implications to pinned working and recovery 1252 paths: 1253 - upon failure and traffic moved to recovery path, the traffic is 1254 unprotected until such time as the path defect in the original 1255 working path is repaired and that path restored to service. 1256 - upon failure and traffic moved to recovery path, the resources 1257 associated with the original path remain reserved. 1259 3.8.1.2 Non-revertive Mode 1261 In the non-revertive mode of operation, there is no preferred path or 1262 it may be desirable to minimize further disruption of the service 1263 brought on by a revertive switching operation. A switch-back to the 1264 original working path is not desired or not possible since the 1265 original path may no longer exist after the occurrence of a fault on 1266 that path. 1267 If there is a fault on the working path, traffic is switched to the 1268 recovery path. When or if the faulty path (the originally working 1269 path) is restored, it may become the recovery path (either by 1270 configuration, or, if desired, by management actions). 1272 In the non-revertive mode of operation, the working traffic may or 1273 may not be restored to a new optimal working path or to the original 1274 working path anyway. This is because it might be useful, in some 1275 cases, to either: (a) administratively perform a protection switch 1276 back to the original working path after gaining further assurances 1277 about the integrity of the path, or (b) it may be acceptable to 1278 continue operation on the recovery path, or (c) it may be desirable 1279 to move the traffic to a new optimal working path that is calculated 1280 based on network topology and network policies. 1282 3.8.2 Dynamic Protection Counterparts 1284 For Dynamic protection counterparts when the traffic is switched over 1285 to a recovery path, the association between the original working path 1286 and the recovery path may no longer exist, since the original path 1287 itself may no longer exist after the fault. Instead, when the network 1288 reaches a stable state following routing convergence, the recovery 1289 path may be switched over to a different preferred path either 1290 optimization based on the new network topology and associated 1291 information or based on pre-configured information. 1293 Dynamic protection counterparts assume that upon failure, the PSL or 1294 other network entity will establish new working paths if another 1295 switch-over will be performed. 1297 3.8.3 Restoration and Notification 1299 MPLS restoration deals with returning the working traffic from the 1300 recovery path to the original or a new working path. Reversion is 1301 performed by the PSL either upon receiving notification, via FRS, 1302 that the working path is repaired, or upon receiving notification 1303 that a new working path is established. 1305 For fixed counterparts in revertive mode, an LSR that detected the 1306 fault on the working path also detects the restoration of the working 1307 path. If the working path had experienced a LF defect, the LSR 1308 detects a return to normal operation via the receipt of a liveness 1309 message from its peer. If the working path had experienced a LD 1310 defect at an LSR interface, the LSR could detect a return to normal 1311 operation via the resumption of error-free packet reception on that 1312 interface. Alternatively, a lower layer that no longer detects a LF 1313 defect may inform the MPLS-based recovery mechanisms at the LSR that 1314 the link to its peer LSR is operational. 1315 The LSR then transmits FRS to its upstream LSR(s) that were 1316 transmitting traffic on the working path. At the point the PSL 1317 receives the FRS, it switches the working traffic back to the 1318 original working path. 1320 A similar scheme is for dynamic counterparts where e.g. an update of 1321 topology and/or network convergence may trigger installation or setup 1322 of new working paths and send notification to the PSL to perform a 1323 switch over. 1325 We note that if there is a way to transmit fault information back 1326 along a recovery path towards a PSL and if the recovery path is an 1327 equivalent working path, it is possible for the working path and its 1328 recovery path to exchange roles once the original working path is 1329 repaired following a fault. This is because, in that case, the 1330 recovery path effectively becomes the working path, and the restored 1331 working path functions as a recovery path for the original recovery 1332 path. This is important, since it affords the benefits of non- 1333 revertive switch operation outlined in Section 3.8.1, without leaving 1334 the recovery path unprotected. 1336 3.8.4 Reverting to Preferred Path (or Controlled Rearrangement) 1338 In the revertive mode, a "make before break" restoration switching 1339 can be used, which is less disruptive than performing protection 1340 switching upon the occurrence of network impairments. This will 1341 minimize both packet loss and packet reordering. The controlled 1342 rearrangement of paths can also be used to satisfy traffic 1343 engineering requirements for load balancing across an MPLS domain. 1345 3.9 Performance 1347 Resource/performance requirements for recovery paths should be 1348 specified in terms of the following attributes: 1350 I. Resource class attribute: 1351 Equivalent Recovery Class: The recovery path has the same resource 1352 reservations and performance guarantees as the working path. In other 1353 words, the recovery path meets the same SLAs as the working path. 1354 Limited Recovery Class: The recovery path does not have the same 1355 resource reservations and performance guarantees as the working path. 1357 A. Lower Class: The recovery path has lower resource requirements or 1358 less stringent performance requirements than the working path. 1360 B. Best Effort Class: The recovery path is best effort. 1362 II. Priority Attribute: 1363 The recovery path has a priority attribute just like the working path 1364 (i.e., the priority attribute of the associated traffic trunks). It 1365 can have the same priority as the working path or lower priority. 1367 III. Preemption Attribute: 1368 The recovery path can have the same preemption attribute as the 1369 working path or a lower one. 1371 4.0 MPLS Recovery Requirement 1373 The following are the MPLS recovery requirements: 1375 I. MPLS recovery SHALL provide an option to identify protection 1376 groups (PPGs) and protection portions (PTPs). 1378 II. Each PSL SHALL be capable of performing MPLS recovery upon the 1379 detection of the impairments or upon receipt of notifications of 1380 impairments. 1382 III. A MPLS recovery method SHALL not preclude manual protection 1383 switching commands. This implies that it would be possible under 1384 administrative commands to transfer traffic from a working path to a 1385 recovery path, or to transfer traffic from a recovery path to a 1386 working path, once the working path becomes operational following a 1387 fault. 1389 IV. A PSL SHALL be capable of performing either a switch back to the 1390 original working path after the fault is corrected or a switchover to 1391 a new working path, upon the discovery or establishment of a more 1392 optimal working path. 1394 V. The recovery model should take into consideration path merging at 1395 intermediate LSRs. If a fault affects the merged segment, all the 1396 paths sharing that merged segment should be able to recover. 1397 Similarly, if a fault affects a non-merged segment, only the path 1398 that is affected by the fault should be recovered. 1400 5.0 MPLS Recovery Options 1402 There SHOULD be an option for: 1404 I. Configuration of the recovery path as excess or reserved, with 1405 excess as the default. The recovery path that is configured as excess 1406 SHALL provide lower priority preemptable traffic access to the 1407 protection bandwidth, while the recovery path configured as reserved 1408 SHALL not provide any other traffic access to the protection 1409 bandwidth. 1411 II. Configuring the protection alternatives as either rerouting or 1412 protection switching. 1414 III. Enabling restoration as either non-revertive or revertive, with 1415 non-revertive as the default if fixed protection counterparts are 1416 used. 1418 6.0 Comparison Criteria 1420 Possible criteria to use for comparison of MPLS-based recovery 1421 schemes are as follows: 1423 Recovery Time 1425 We define recovery time as the time required for a recovery path to 1426 be activated (and traffic flowing) after a fault. Recovery Time is 1427 the sum of the Fault Detection Time, Hold-off Time, Notification 1428 Time, Recovery Operation Time, and the Traffic Restoration Time. In 1429 other words, it is the time between a failure of a node or link in 1430 the network and the time before a recovery path is installed and the 1431 traffic starts flowing on it. 1433 Full Restoration Time 1435 We define full restoration time as the time required for a permanent 1436 restoration. This is the time required for traffic to be routed onto 1437 links, which are capable of or have been engineered sufficiently to 1438 handle traffic in recovery scenarios. Note that this time may or may 1439 not be different from the "Recovery Time" depending on whether 1440 equivalent or limited recovery paths are used. 1442 Setup vulnerability 1444 The amount of time that a working path or a set of working paths is 1445 left unprotected during such tasks as recovery path computation and 1446 recovery path setup may be used to compare schemes. The nature of 1447 this vulnerability should be taken into account, e.g.: End to End 1448 schemes correlate the vulnerability with working paths, Local Repair 1449 schemes have a topological correlation that cuts across working paths 1450 and Network Plan approaches have a correlation that impacts the 1451 entire network. 1453 Backup Capacity 1455 Recovery schemes may require differing amounts of "backup capacity" 1456 in the event of a fault. This capacity will be dependent on the 1457 traffic characteristics of the network. However, it may also be 1458 dependent on the particular protection plan selection algorithms as 1459 well as the signaling and re-routing methods. 1461 Additive Latency 1463 Recovery schemes may introduce additive latency to traffic. For 1464 example, a recovery path may take many more hops than the working 1465 path. This may be dependent on the recovery path selection 1466 algorithms. 1468 Quality of Protection 1470 Recovery schemes can be considered to encompass a spectrum of "packet 1471 survivability" which may range from "relative" to "absolute". 1472 Relative survivability may mean that the packet is on an equal 1473 footing with other traffic of, as an example, the same diff-serv code 1474 point (DSCP) in contending for the surviving network resources. 1475 Absolute survivability may mean that the survivability of the 1476 protected traffic has explicit guarantees. 1478 Re-ordering 1480 Recovery schemes may introduce re-ordering of packets. Also the 1481 action of putting traffic back on preferred paths might cause packet 1482 re-ordering. 1484 State Overhead 1486 As the number of recovery paths in a protection plan grows, the state 1487 required to maintain them also grows. Schemes may require differing 1488 numbers of paths to maintain certain levels of coverage, etc. The 1489 state required may also depend on the particular scheme used to 1490 recover. In many cases the state overhead will be in proportion to 1491 the number of recovery paths. 1493 Loss 1495 Recovery schemes may introduce a certain amount of packet loss during 1496 switchover to a recovery path. Schemes that introduce loss during 1497 recovery can measure this loss by evaluating recovery times in 1498 proportion to the link speed. 1500 In case of link or node failure a certain packet loss is inevitable. 1502 Coverage 1504 Recovery schemes may offer various types of failover coverage. The 1505 total coverage may be defined in terms of several metrics: 1507 I. Fault Types: Recovery schemes may account for only link faults or 1508 both node and link faults or also degraded service. For example, a 1509 scheme may require more recovery paths to take node faults into 1510 account. 1512 II. Number of concurrent faults: dependent on the layout of recovery 1513 paths in the protection plan, multiple fault scenarios may be able to 1514 be restored. 1516 III. Number of recovery paths: for a given fault, there may be one or 1517 more recovery paths. 1519 IV. Percentage of coverage: dependent on a scheme and its 1520 implementation, a certain percentage of faults may be covered. This 1521 may be subdivided into percentage of link faults and percentage of 1522 node faults. 1524 V. The number of protected paths may effect how fast the total set of 1525 paths affected by a fault could be recovered. The ratio of protected 1526 is n/N, where n is the number of protected paths and N is the total 1527 number of paths. 1529 7.0 Security Considerations 1531 The MPLS recovery that is specified herein does not raise any 1532 security issues that are not already present in the MPLS 1533 architecture. 1535 8.0 Intellectual Property Considerations 1537 The IETF has been notified of intellectual property rights claimed in 1538 regard to some or all of the specification contained in this 1539 document. For more information consult the online list of claimed 1540 rights. 1542 9.0 Acknowledgements 1544 We would like to thank members of the MPLS WG mailing list for their 1545 suggestions on the earlier versions of this draft. In particular, 1546 Bora Akyol, Dave Allan, Neil Harrison, and Dave Danenberg whose 1547 suggestions and comments were very helpful in revising the document. 1549 10.0 Authors' Addresses 1551 Vishal Sharma Ben Mack-Crane 1552 Jasmine Networks, Inc. Tellabs Operations, Inc. 1553 3061 B, Zanker Road 4951 Indiana Avenue 1554 San Jose, CA 95134 Lisle, IL 60532 1555 Phone: 408-895-5030 Phone: 630-512-7255 1556 vsharma@JasmineNetworks.com Ben.Mack-Crane@tellabs.com 1557 Srinivas Makam Ken Owens 1558 Tellabs Operations, Inc. Erlang Technology, Inc. 1559 Lisle, IL 60532 St. Louis, MO 63119 1560 Phone: 630-512-7217 1561 Srinivas.Makam@tellabs.com keno@erlangtech.com 1563 Changcheng Huang Fiffi Hellstrand 1564 Dept. of Systems & Computer Engg. Nortel Networks 1565 Carleton University St Eriksgatan 115 1566 Minto Center, Rm. 3082 PO Box 6701 1567 1125 Colonial By Drive 113 85 Stockholm, Sweden 1568 Ottawa, Ontario K1S 5B6, Canada Phone: +46 8 5088 3687 1569 Phone: 613 520-2600 x2477 Fiffi@nortelnetworks.com 1570 Changcheng.Huang@sce.carleton.ca 1572 Jon Weil Brad Cain 1573 Nortel Networks Mirror Image Internet 1574 Harlow Laboratories London Road 49 Dragon Ct. 1575 Harlow Essex CM17 9NA, UK Woburn, MA 01801, USA 1576 Phone: +44 (0)1279 403935 bcain@mirror-image.com 1577 jonweil@nortelnetworks.com 1579 Loa Andersson Bilel Jamoussi 1580 Nortel Networks Nortel Networks 1581 St Eriksgatan 115, PO Box 6701 3 Federal Street, BL3-03 1582 113 85 Stockholm, Sweden Billerica, MA 01821, USA 1583 Phone: +46 8 50 88 36 34 Phone:(978) 288-4506 1584 loa.andersson@nortelnetworks.com jamoussi@nortelnetworks.com 1586 Seyhan Civanlar Angela Chiu 1587 Coreon, Inc. Celion Networks, Inc. 1588 1200 South Avenue, Suite 103 One Shiela Drive, Suite 2 1589 Staten Island, NY 10314 Tinton Falls, NJ 07724 1590 Phone: (718) 889 4203 Phone: (732) 345-3441 1591 scivanlar@coreon.net angela.chiu@celion.com 1593 11.0 References 1595 [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label 1596 Switching Architecture", RFC 3031, January 2001. 1598 [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, B., 1599 "LDP Specification", RFC 3036, January 2001. 1601 [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement for 1602 Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp-tunnel- 1603 applicability-01.txt, work in progress, April 2000. 1605 [4] Jamoussi, B. et al "Constraint-Based LSP Setup using LDP", Internet 1606 Draft draft-ietf-mpls-cr-ldp-04.txt, Work in Progress , July 2000. 1608 [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource ReSerVation 1609 Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, 1610 September 1997. 1612 [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Internet 1613 Draft, draft-ietf-mpls-rsvp-lsp-tunnel-07.txt, Work in Progress, August 1614 2000. 1616 [7] Hellstrand, F., and Andersson, L., "Extensions to RSVP-TE and CR-LDP 1617 for setup of pre-established LSP Tunnels," Internet Draft, Work in 1618 Progress, draft-hellstrand-mpls-recovery-merge-01.txt, November 2000. 1620 [8] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J., 1621 "Requirements for Traffic Engineering Over MPLS", RFC 2702, September 1622 1999. 1624 [9] Kini, S, Lakshman, T. V., Villamizar, C., "Shared Backup Label 1625 Switched Path Restoration, " Internet Draft, Work in Progress, draft- 1626 kini-restoration-shared-backup-00.txt, October 2000. 1628 [10] Goguen, R. and Swallow, G., "RSVP Label Allocation for Backup 1629 Tunnels", draft-swallow-rsvp-bypass-label-01.txt, work in progress, 1630 November 2000. 1632 [11] Kini, S., Lakshman, T. V., Villamizar, C., "Reservation Protocol 1633 with Traffic Engineering Extensions: Extension for Label Switched Path 1634 Restoration," Internet Draft, Work in Progress, draft-kini-rsvp-lsp- 1635 restoration-00.txt, November 2000. 1637 [12] Haskin, D. and Krishnan R., "A Method for Setting an Alternative 1638 Label Switched Path to Handle Fast Reroute", Internet Draft draft- 1639 haskin-mpls-fast-reroute-05.txt, November 2000, Work in progress. 1641 [13] Owens, K., Makam, V., Sharma, V., Mack-Crane, B., and Haung, C., "A 1642 Path Protection/Restoration Mechanism for MPLS Networks", Internet 1643 Draft, draft-chang-mpls-path-protection-02.txt, Work in Progress 1644 November 2000.