idnits 2.17.1 draft-ietf-mpls-tp-survive-fwk-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 792: '... defined in [RFC4427]) is not required in MPLS-TP and MAY be omitted...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 20, 2010) is 5058 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'G.8081' is defined on line 2611, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group N. Sprecher 2 Internet-Draft Nokia Siemens Networks 3 Intended status: Informational A. Farrel 4 Expires: December 20, 2010 Old Dog Consulting 5 June 20, 2010 7 Multiprotocol Label Switching Transport Profile Survivability Framework 9 draft-ietf-mpls-tp-survive-fwk-06.txt 11 Abstract 13 Network survivability is the ability of a network to recover traffic 14 delivery following failure, or degradation of network resources. 15 Survivability is critical for the delivery of guaranteed network 16 services, such as those subject to strict Service Level Agreements 17 (SLAs) which place maximum bounds on the length of time that services 18 may be degraded or be unavailable. 20 The Transport Profile of Multiprotocol Label Switching (MPLS-TP) is a 21 packet-based transport technology based on the MPLS data plane which 22 re-uses many aspects of the MPLS management and control planes. 24 This document comprises a framework for the provision of 25 survivability in an MPLS-TP network; it describes recovery elements, 26 types, methods, and topological considerations. To enable data-plane 27 recovery, survivability may be supported by the control plane, 28 management plane, and by Operations, Administration and Maintenance 29 (OAM) functions. This document describes mechanisms for recovering 30 MPLS-TP Label Switched Paths (LSPs). A detailed description of 31 pseudowire recovery in MPLS-TP networks is beyond the scope of this 32 document. 34 This document is a product of a joint Internet Engineering Task Force 35 (IETF) / International Telecommunication Union Telecommunication 36 Standardization Sector (ITU-T) effort to include an MPLS Transport 37 Profile within the IETF MPLS and PWE3 architectures to support the 38 capabilities and functionalities of a packet-based transport network, 39 as defined by the ITU-T. 41 Status of this Memo 43 This Internet-Draft is submitted to IETF in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF), its areas, and its working groups. Note that 48 other groups may also distribute working documents as Internet- 49 Drafts. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 The list of current Internet-Drafts can be accessed at 57 http://www.ietf.org/ietf/1id-abstracts.txt. 59 The list of Internet-Draft Shadow Directories can be accessed at 60 http://www.ietf.org/shadow.html. 62 This Internet-Draft will expire on May 13, 2010. 64 Copyright Notice 66 Copyright (c) 2010 IETF Trust and the persons identified as the 67 document authors. All rights reserved. 69 This document is subject to BCP 78 and the IETF Trust's Legal 70 Provisions Relating to IETF Documents 71 (http://trustee.ietf.org/license-info) in effect on the date of 72 publication of this document. Please review these documents 73 carefully, as they describe your rights and restrictions with respect 74 to this document. Code Components extracted from this document must 75 include Simplified BSD License text as described in Section 4.e of 76 the Trust Legal Provisions and are provided without warranty as 77 described in the Simplified BSD License. 79 Table of Contents 81 1. Introduction ................................................. 4 82 1.1. Recovery Schemes ........................................... 5 83 1.2. Recovery Action Initiation ................................. 6 84 1.3. Recovery Context ........................................... 7 85 1.4. Scope of this Framework .................................... 8 86 2. Terminology and References ................................... 9 87 3. Requirements for Survivability .............................. 10 88 4. Functional Architecture ..................................... 10 89 4.1. Elements of Control ....................................... 11 90 4.1.1. Operator Control ........................................ 11 91 4.1.2. Defect-Triggered Actions ................................ 12 92 4.1.3. OAM Signaling ........................................... 12 93 4.1.4. Control-Plane Signaling ................................. 12 94 4.2. Elements of Recovery ...................................... 13 95 4.2.1. Span Recovery ........................................... 13 96 4.2.2. Segment Recovery ........................................ 14 97 4.2.3. End-to-End Recovery ..................................... 14 98 4.3. Levels of Recovery ........................................ 15 99 4.3.1. Dedicated Protection .................................... 15 100 4.3.2. Shared Protection ....................................... 16 101 4.3.3. Extra Traffic ........................................... 17 102 4.3.4. Restoration ............................................. 18 103 4.3.5. Reversion ............................................... 19 104 4.4. Mechanisms for Protection ................................. 20 105 4.4.1. Link-Level Protection ................................... 20 106 4.4.2. Alternate Paths and Segments ............................ 21 107 4.4.3. Protection Tunnels ...................................... 22 108 4.5. Recovery Domains .......................................... 22 109 4.6. Protection in Different Topologies ........................ 24 110 4.7. Mesh Networks ............................................. 25 111 4.7.1. 1:n Linear Protection .................... .............. 26 112 4.7.2. 1+1 Linear Protection ................................... 28 113 4.7.3. P2MP Linear Protection .................................. 29 114 4.7.4. Triggers for the Linear Protection Switching Action ..... 30 115 4.7.5. Applicability of Linear Protection for LSP Segments ..... 31 116 4.7.6. Shared Mesh Protection .................................. 32 117 4.8. Ring Networks ............................................. 33 118 4.9. Recovery in Layered Networks .............................. 34 119 4.9.1. Inherited Link-Level Protection ......................... 35 120 4.9.2. Shared Risk Groups ...................................... 35 121 4.9.3. Fault Correlation ....................................... 36 122 5. Applicability and Scope of Survivability in MPLS-TP ......... 37 123 6. Mechanisms for Providing Survivability for MPLS-TP LSPs ..... 39 124 6.1. Management Plane .......................................... 39 125 6.1.1. Configuration of Protection Operation ................... 40 126 6.1.2. External Manual Commands ................................ 40 127 6.2. Fault Detection ........................................... 41 128 6.3. Fault Localization ........................................ 42 129 6.4. OAM Signaling ............................................. 43 130 6.4.1. Fault Detection ......................................... 44 131 6.4.2. Testing for Faults ...................................... 44 132 6.4.3. Fault Localization ...................................... 45 133 6.4.4. Fault Reporting ......................................... 45 134 6.4.5. Coordination of Recovery Actions ........................ 46 135 6.5. Control Plane ............................................. 46 136 6.5.1. Fault Detection ......................................... 47 137 6.5.2. Testing for Faults ...................................... 47 138 6.5.3. Fault Localization ...................................... 48 139 6.5.4. Fault Status Reporting .................................. 48 140 6.5.5. Coordination of Recovery Actions ........................ 49 141 6.5.6. Establishment of Protection and Restoration LSPs ........ 49 142 7. Pseudowire Recovery Considerations .......................... 50 143 7.1. Utilizing Underlying MPLS-TP Recovery ..................... 50 144 7.2. Recovery in the Pseudowire Layer .......................... 51 145 8. Manageability Considerations ................................ 51 146 9. Security Considerations ..................................... 52 147 10. IANA Considerations ........................................ 52 148 11. Acknowledgments ............................................ 52 149 12. References ................................................. 53 150 12.1. Normative References ..................................... 53 151 12.2. Informative References ................................... 54 153 Editors' Note: 155 This Informational Internet-Draft is aimed at achieving IETF 156 Consensus before publication as an RFC and will be subject to an IETF 157 Last Call. 159 [RFC Editor, please remove this note before publication as an RFC and 160 insert the correct Streams Boilerplate to indicate that the published 161 RFC has IETF Consensus.] 163 1. Introduction 165 Network survivability is the network's ability to recover traffic 166 delivery following the failure or degradation of traffic delivery 167 caused by a network fault or a denial of service attack on the 168 network. Survivability plays a critical role in the delivery of 169 reliable services in transport networks. Guaranteed services in the 170 form of Service Level Agreements (SLAs) require a resilient network 171 that very rapidly detects facility or node degradation or failures, 172 and immediately starts to recover network operations in accordance 173 with the terms of the SLA. 175 The MPLS Transport Profile (MPLS-TP) is described in [MPLS-TP-FWK]. 176 MPLS-TP is designed to be consistent with existing transport network 177 operations and management models, while providing survivability 178 mechanisms, such as protection and restoration. The functionality 179 provided is intended to be similar to or better than that found in 180 established transport networks which set a high benchmark for 181 reliability. That is, it is intended to provide the operator with 182 functions with which they are familiar through their experience with 183 other transport networks, although this does not preclude additional 184 techniques. 186 This document provides a framework for MPLS-TP-based survivability. 187 that meets the recovery requirements specified in [RFC5654]. 188 It uses the recovery terminology defined in [RFC4427] which draws 189 heavily on [G.808.1], and it refers to the requirements specified in 190 [RFC5654]. 192 This document is a product of a joint Internet Engineering Task Force 193 (IETF) / International Telecommunication Union Telecommunication 194 Standardization Sector (ITU-T) effort to include an MPLS Transport 195 Profile within the IETF MPLS and PWE3 architectures to support the 196 capabilities and functionalities of a packet-based transport network 197 as defined by the ITU-T. 199 1.1. Recovery Schemes 201 Various recovery schemes (for protection and restoration) and 202 processes have been defined and analyzed in [RFC4427] and [RFC4428]. 203 These schemes can also be applied in MPLS-TP networks to re-establish 204 end-to-end traffic delivery according to the agreed service 205 parameters, and to trigger recovery from "failed" or "degraded" 206 transport entities. In the context of this document, transport 207 entities are nodes, links, transport path) segments, concatenated 208 transport path segments, and entire transport paths. Recovery actions 209 are initiated by the detection of a defect, or by an external request 210 (e.g., an operator's request for manual control of protection 211 switching). 213 [RFC4427] makes a distinction between protection switching and 214 restoration mechanisms. 216 - Protection switching uses pre-assigned capacity between nodes, 217 where the simplest scheme has a single, dedicated protection entity 218 for each working entity, while the most complex scheme has m 219 protection entities shared between n working entities (m:n). 221 - Restoration uses any capacity available between nodes and usually 222 involves re-routing. The resources used for restoration may be 223 pre-planned (i.e., predetermined, but not yet allocated to the 224 recovery path), and recovery priority may be used as a 225 differentiation mechanism to determine which services are recovered 226 and which are not recovered. 228 Both protection switching and restoration may be either 229 unidirectional or bidirectional; unidirectional implies that 230 protection switching is performed independently for each direction of 231 a bidirectional transport path, while bidirectional means that both 232 directions are switched simultaneously using appropriate 233 coordination, even if the fault applies to only one direction of the 234 path. 236 Both protection and restoration mechanisms may be either revertive or 237 non-revertive as described in Section 4.11 of [RFC4427]. 239 Pre-emption priority may be used to determine which services are 240 sacrificed to enable the recovery of other services. Restoration may 241 also be either unidirectional or bidirectional. In general, 242 protection actions are completed within time frames amounting to tens 243 of milliseconds, while automated restoration actions are normally 244 completed within periods ranging from hundreds of milliseconds to a 245 maximum of a few seconds. Restoration is not guaranteed (for 246 example, because network resources may not be available at the time 247 of the defect). 249 1.2. Recovery Action Initiation 251 The recovery schemes described in [RFC4427] and evaluated in 252 [RFC4428] are presented in the context of control-plane-driven 253 actions (such as the configuration of the protection entities and 254 functions, etc.). The presence of a distributed control plane in an 255 MPLS-TP network is optional. However, the absence of such a control 256 plane does not affect the operation of the network and the use of 257 MPLS-TP forwarding, Operations, Administration and Maintenance (OAM), 258 and survivability capabilities. In particular, the concepts discussed 259 in[RFC4427] and [RFC4428] refer to recovery actions effected in the 260 data plane; they are equally applicable in MPLS-TP, with or without 261 the use of a control plane. 263 Thus, some of the MPLS-TP recovery mechanisms do not depend on a 264 control plane and use MPLS-TP OAM mechanisms or management actions to 265 trigger recovery actions. 267 The principles of MPLS-TP protection-switching actions are similar 268 to those described in [RFC4427], since the protection mechanism is 269 based on the capability to detect certain defects in the transport 270 entities within the recovery domain. The protection-switching 271 controller does not care which initiation method is used, provided 272 that it can be given information about the status of the transport 273 entities within the recovery domain (e.g., OK, signal failure, 274 signal degradation, etc.). 276 In the context of MPLS-TP, it is imperative to ensure that performing 277 switchovers is possible, regardless of the way in which the network 278 is configured and managed (for example, regardless of whether a 279 control plane, management plane or OAM initiation mechanism is used). 281 All MPLS and GMPLS protection mechanisms [RFC4428] are applicable in 282 an MPLS-TP environment. It is also be possible to provision and 283 manage the related protection entities and functions defined in MPLS 284 and GMPLS using the management plane [RFC5654]. Regardless of whether 285 an OAM, management, or control plane initiation mechanism is used, 286 the protection-switching operation is a data-plane operation. 288 In some recovery schemes (such as bidirectional protection 289 switching), it is necessary to coordinate the protection state 290 between the edges of the recovery domain to achieve initiation of 291 recovery actions for both directions. An MPLS-TP protocol may be 292 used as an in-band (i.e., data-plane based) control protocol in order 293 to coordinate the protection state between the edges of the 294 protection domain. When the MPLS-TP control plane is in use, a 295 control-plane-based mechanism can also be used to coordinate the 296 protection states between the edges of the protection domain. 298 1.3. Recovery Context 300 An MPLS-TP Label Switched Path (LSP) may be subject to any part of or 301 all of MPLS-TP link recovery, path-segment recovery, or end-to-end 302 recovery, where: 304 o MPLS-TP link recovery refers to the recovery of an individual link 305 (and hence all or a subset of the LSPs routed over the link) 306 between two MPLS-TP nodes. For example, link recovery may be 307 provided by server layer recovery. 309 o Segment recovery refers to the recovery of an LSP segment (i.e., 310 segment and concatenated segment in the language of [RFC5654]) 311 between two nodes and is used to recover from the failure of one 312 or more links or nodes. 314 o End-to-end recovery refers to the recovery of an entire LSP, from 315 its ingress to its egress node. 317 For additional resiliency, more than one of these recovery techniques 318 may be configured concurrently for a single path. 320 Co-routed bidirectional MPLS-TP LSPs are defined in a way that allows 321 both directions of the LSP to follow the same route through the 322 network. In this scenario, the operator often requires the directions 323 to fate-share (that is, if one direction fails, both directions 324 should cease to operate). 326 Associated bidirectional MPLS-TP LSPs exist where the two directions 327 of a bidirectional LSP follow different paths through the network. 328 An operator may also request fate-sharing for associated 329 bidirectional LSPs. 331 The requirement for fate-sharing causes a direct interaction between 332 the recovery processes affecting the two directions of an LSP, so 333 that both directions of the bidirectional LSP are recovered at the 334 same time. This mode of recovery is termed bidirectional recovery and 335 may be seen as a consequence of fate-sharing. 337 The recovery scheme operating at the data-plane level can function in 338 a multi-domain environment (in the wider sense of a "domain" 340 [RFC4726]). It can also protect against a failure of a boundary node 341 in the case of inter-domain operation. MPLS-TP recovery schemes are 342 intended to protect client services when it is sent across the MPLS- 343 TP network. 345 1.4. Scope of this Framework 347 This framework introduces the architecture of the MPLS-TP recovery 348 domain and describes the recovery schemes in MPLS-TP (based on the 349 recovery types defined in [RFC4427]) as well as the principles of 350 operation, recovery states, recovery triggers, and information 351 exchanges between the different elements that support the reference 352 model. 354 The framework also describes the qualitative levels of the 355 survivability functions that can be provided, such as dedicated 356 recovery, shared protection, restoration, etc. In the event of a 357 network failure, the level of recovery directly affects the service 358 level provided to the end-user. 360 The general description of the functional architecture is applicable 361 to both LSPs and pseudowires (PWs), however, PW recovery is only 362 introduced in Section 7, and the relevant details are beyond the 363 scope of this document and are for further study. 365 This framework applies to general recovery schemes as well as to 366 mechanisms that are optimized for specific topologies and are 367 tailored to efficiently handle protection switching. 369 This document addresses the need for the co-ordination of protection 370 switching across multiple layers and at sub-layers (for clarity, we 371 use the term "layer" to refer equally to layers and sub-layers). 372 This allows an operator to prevent race conditions and allows the 373 protection switching mechanism of one layer to recover from a failure 374 before switching is invoked at another layer. 376 This framework also specifies the functions that must be supported by 377 MPLS-TP to support the recovery mechanisms. MPLS-TP introduces a 378 tool kit to enable recovery in MPLS-TP-based networks, and to ensure 379 that affected services are recovered in the event of a failure. 381 Generally, network operators aim to provide the fastest, most stable, 382 and best protection mechanism at a reasonable cost in accordance with 383 customer requirements. The greater the level of protection required, 384 the greater the number of resources will be consumed. It is 385 therefore expected that network operators will offer a wide spectrum 386 of service levels. MPLS-TP-based recovery offers the flexibility to 387 select a recovery mechanism, define the granularity at which traffic 388 delivery is to be protected, and choose the specific traffic types 389 that are to be protected. With MPLS-TP-based recovery, it should be 390 possible to provide different levels of protection for different 391 traffic classes within the same path based on the service 392 requirements. 394 2. Terminology and References 396 The terminology used in this document is consistent with that defined 397 in [RFC4427]. The latter is consistent with [G.808.1]. 399 However, certain protection concepts (such as ring protection) are 400 not discussed in [RFC4427]; for those concepts, the terminology used 401 in this document is drawn from [G.841]. 403 Readers should refer to those documents for normative definitions. 405 This document supplies brief summaries of a number of terms for 406 reasons of clarity and to assist the reader, but it does not re- 407 define terms. 409 Note, in particular, the distinction and definitions made in 410 [RFC4427] for the following three terms: 412 o Protection: re-establishing end-to-end traffic delivery using pre- 413 allocated resources. 415 o Restoration: re-establishing end-to-end traffic delivery using 416 resources allocated at the time of need; sometimes referred to as 417 "repair" of a service, LSP, or the traffic. 419 o Recovery: a generic term covering both Protection and Restoration. 421 Note that the term "survivability" is used in [RFC5654] to cover the 422 functional elements or "protection" and "restoration" which are 423 collectively known as "recovery". 425 Important background information on survivability can be found in 426 [RFC3386], [RFC3469] , [RFC4426], [RFC4427], and [RFC4428]. 428 In this document, the following additional terminology is applied: 430 o "Fault Management", as defined in [MPLS-TP-NM-Framework]. 432 o The terms "defect" and "failure" are used interchangeably to 433 indicate any defect or failure in the sense that they defined in 434 [G.806]. The terms also include any signal degradation event as 435 defined in [G.806]. 437 o A "fault" is a fault or fault cause as defined in [G.806]. 439 o "Trigger" indicates any event that may initiate a recovery action. 440 See Section 4.1 for a more detailed discussion of triggers. 442 o The acronym "OAM" is defined as Operations, Administration and 443 Maintenance, consistent with [OAM-SOUP]. 445 o A "Transport Entity" is a node, link, transport path segment, 446 concatenated transport path segment, or entire transport path. 448 o A "Working Entity" is a transport entity that carries traffic 449 during normal network operation. 451 o A "Protection Entity" is a transport entity that is pre-allocated 452 and used to protect and transport traffic when the working entity 453 fails. 455 o A "Recovery Entity" is a transport entity that is used to recover 456 and transport traffic when the working entity fails. 458 o "Survivability Actions" are the steps that may be taken by network 459 nodes to communicate faults and to switch traffic from faulted or 460 degraded paths to other paths. This may include sending messages 461 and establishing new paths. 463 General terminology for MPLS-TP is found in [MPLS-TP-FWK] and 464 [ROSETTA]. Background information on MPLS-TP requirements can be 465 found in [RFC5654]. 467 3. Requirements for Survivability 469 MPLS-TP requirements are presented in [RFC5654] and serve as 470 normative references for the definition of all MPLS-TP functionality, 471 including survivability. Survivability is presented in [RFC5654] as 472 playing a critical role in the delivery of reliable services, and the 473 requirements for survivability are set out using the recovery 474 terminology defined in [RFC4427]. 476 4. Functional Architecture 478 This section presents an overview of the elements relating to the 479 functional architecture for survivability within an MPLS-TP network. 480 The components are presented separately to demonstrate the way in 481 which they may be combined to provide the different levels of 482 recovery needed to meet the requirements set out in the previous 483 section. 485 4.1. Elements of Control 487 Recovery is achieved by implementing specific actions. These actions 488 aim to repair network resources or redirect traffic along paths that 489 avoid failures in the network. They may be triggered automatically 490 by the MPLS-TP network nodes upon detection of a network defect, or 491 they may be triggered by an operator. Automated actions may be 492 enhanced by in-band (i.e., data-plane-based) OAM mechanisms , or by 493 in-band or out-of-band control-plane signaling. 495 4.1.1. Operator Control 497 The survivability behavior of the network as a whole, and the 498 reaction of each transport path when a fault is reported, may be 499 controlled by the operator. This control can be split into two sets 500 of functions: policies and actions performed when the transport path 501 is set up; and commands used to control or force recovery actions for 502 established transport paths. 504 The operator may establish network-wide or local policies that 505 determine the actions that will be taken when various defects are 506 reported which affect different transport paths. Also, when a 507 service request is made that causes the establishment of one or more 508 transport paths in the network, the operator (or requesting 509 application) may define a particular level of service, and this will 510 be mapped to specific survivability actions taken before and during 511 transport path setup, after the discovery of a failure of network 512 resources, and upon recovery of those resources. 514 It should be noted that it is unusual to present a user or customer 515 with options directly related to recovery actions. Instead, the 516 user/customer enters into an SLA with the network provider, and the 517 network operator maps the terms of the SLA (for example, for 518 guaranteed delivery, availability, or reliability) to recovery 519 schemes within the network. 521 The operator can also issue commands to control recovery actions and 522 events. For example, the operator may perform the following actions: 524 o Enable or disable the survivability function. 526 o Invoke the simulation of a network fault. 528 o Force a switchover from a working path to a recovery path or vice 529 versa. 531 Forced switchover may be performed for network optimization purposes 532 with minimal service interruption, such as when modifying protected 533 or unprotected services, when replacing MPLS-TP network nodes, etc. 534 In some circumstances, a fault may be reported to the operator and 535 the operator may then select and initiate the appropriate recovery 536 action. A description of the different operator commands is found in 537 Section 4.12 of [RFC4427]. 539 4.1.2. Defect-Triggered Actions 541 Survivability actions may be directly triggered by network defects. 542 This means that the device that detects the defect (for example, 543 notification of an issue reported from equipment in a lower layer, 544 failure to receive an OAM Continuity message, or receipt of an OAM 545 message reporting a failure condition) may immediately perform a 546 survivability action. 548 The action is 549 directly triggered by events in the data plane. Note, however, that 550 coordination of recovery actions between the edges of the recovery 551 domain may require message exchanges for some recovery functions or 552 for performing a bidirectional recovery action. 554 4.1.3. OAM Signaling 556 OAM signaling refers to data plane OAM message exchange. Such 557 messages may be used to detect and localize faults or to indicate a 558 degradation in the operation of the network. However, in this context 559 these messages are used to control or trigger survivability actions. 560 The mechanisms to achieve this are discussed in 561 [MPLS-TP-OAM-Framework] 563 OAM signaling may also be used to coordinate recovery actions within 564 the protection domain. 566 4.1.4. Control-Plane Signaling 568 Control-plane signaling is responsible for setup, maintenance, and 569 teardown of transport paths that do not fall under management-plane 570 control. The control plane may also be used to coordinate the 571 detection, localization, and reaction to network defects pertaining 572 to peer relationships (neighbor-to-neighbor, or end-to-end). Thus, 573 control-plane signaling may initiate and coordinate survivability 574 actions. 576 The control plane can also be used to distribute topology and 577 information relating to resource availability. In this way, the 578 "graceful shutdown" [RFC5817] of resources may be affected by 579 withdrawing them; this can be used to invoke a survivability action 580 in a similar way to that used when reporting or discovering a fault, 581 as described in the previous sections. 583 The use of a control plane for MPLS-TP is discussed in 584 [MPLS-TP-CP-Framework]. 586 4.2. Elements of Recovery 588 This section describes the elements of recovery. These are the 589 quantitative aspects of recovery, that is, the parts of the network 590 for which recovery can be provided. 592 Note that the terminology in this section is consistent with 593 [RFC4427]. Where the terms differ from those in [RFC5654], mapping 594 is provided. 596 4.2.1. Span Recovery 598 A span is a single hop between neighboring MPLS-TP nodes in the same 599 network layer. A span is sometimes referred to as a link, and this 600 may cause some confusion between the concept of a data link and a 601 traffic engineering (TE) link. LSPs traverse TE links between 602 neighboring MPLS-TP nodes in the MPLS-TP network layer. However, a 603 TE link may be provided by any of the following: 605 o A single data link. 607 o A series of data links in a lower layer, established as an LSP and 608 presented to the upper layer as a single TE link. 610 o A set of parallel data links in the same layer, presented either 611 as a bundle of TE links, or as a collection of data links which 612 together provide a data-link-layer protection scheme. 614 Thus, span recovery may be provided by any of the following: 616 o Selecting a different TE link from a bundle. 618 o Moving the TE link so that it is supported by a different data 619 link between the same pair of neighbors. 621 o Re-routing the LSP in the lower layer. 623 Moving the protected LSP to another TE link between the same pair of 624 neighbors is a form of segment recovery and is described in Section 625 4.2.2. 627 4.2.2. Segment Recovery 629 An LSP segment comprises one or more continuous hops on the path of 630 the LSP. [RFC5654] defines two terms. A "segment" is a single hop 631 along the path of an LSP, while a "concatenated segment" is more than 632 one hop along the path of an LSP. In the context of this document, a 633 segment covers both of these concepts. 635 A PW segment refers to a Single Segment PW (SS-PW) or to a single 636 segment of a multi-segment PW (MS-PW) that is set up between two PE 637 devices that may be Terminating PEs (T-PEs) or Switching PEs (S-PEs) 638 so that the full set of possibilities is T-PE to S-PE, S-PE to S-PE, 639 S-PE to T-PE, or T-PE to T-PE (for the SS-PW case). As indicated in 640 Section 1, the recovery of PWs and PW segments is beyond the scope of 641 this document, however, see Section 7. 643 Segment recovery involves redirecting or copying traffic at the 644 source end of a segment onto an alternate path leading to the other 645 end of the segment. According to the required level of recovery 646 (described in Section 4.3), traffic may either be redirected to a 647 pre-established segment, through re-routing the protected segment, or 648 it may be tunneled to the far end of the protected segment through a 649 "bypass" LSP. For details on recovery mechanisms, see Section 4.4. 651 Note that protecting a transport path against node failure requires 652 the use of segment recovery or end-to-end recovery, while a link 653 failure can be protected using span, segment, or end-to-end recovery. 655 4.2.3. End-to-End Recovery 657 End-to-end recovery is a special case of segment recovery where the 658 protected segment comprises the entire transport path. End-to-end 659 recovery may be provided as link-diverse or node-diverse recovery 660 where the recovery path shares no links or no nodes with the working 661 path. 663 Note that node-diverse paths are necessarily link-diverse, and that 664 full, end-to-end node-diversity is required to guarantee recovery. 666 Two observations need to be made about end-to-end recovery. 668 - Firstly, there may be circumstances where node-diverse end-to-end 669 paths do not guarantee recovery. The ingress and egress nodes will 670 themselves be single points of failure. Additionally there may be 671 shared risks of failure (for example, geographic collocation, 672 shared resources, etc.) between diverse nodes as described in 673 Section 4.9.2. 675 - Secondly, it is possible to use end-to-end recovery techniques even 676 when there is not full diversity and the working and protection 677 paths share links or nodes. 679 4.3. Levels of Recovery 681 This section describes the qualitative levels of survivability 682 that can be provided. In the event of a network failure, the level 683 of recovery offered directly affects the service level provided to 684 the end-user. This will be observed as the amount of data lost when 685 a network fault occurs, and the length of time required to recover 686 connectivity. 688 In general, there is a correlation between the recovery service level 689 (i.e., the speed of recovery and reduction of data loss) and the 690 amount of resources used in the network; better service levels 691 require the pre-allocation of resources to the recovery paths, and 692 those resources cannot be used for other purposes if high-quality 693 recovery is required. An operator will consider how providing 694 different levels of recovery requires that network resources may need 695 to be provisioned and allocated for exclusive use of the recovery 696 paths and so cannot be used to support other customer services. 698 Sections 6 and 7 of [RFC4427] provide a full breakdown of the 699 protection and recovery schemes. This section summarizes the 700 qualitative levels available. 702 Note that, in the context of recovery, a useful discussion of the 703 term "resource" and its interpretation in both the IETF and ITU-T 704 context may be found in Section 3.2 of [RFC4397]. 706 4.3.1. Dedicated Protection 708 In dedicated protection, the resources for the recovery entity are 709 pre-assigned for the sole use of the protected transport path. This 710 will clearly be the case in 1+1 protection, and may also be the case 711 in 1:1 protection where extra traffic (see Section 4.3.3) is not 712 supported. 714 Note that when using protection tunnels (see Section 4.4.3), 715 resources may also be dedicated to the protection of a specific 716 transport path. In some cases (1:1 protection) the entire bypass 717 tunnel may be dedicated to providing recovery for a specific 718 transport path, while in other cases (such as facility backup), a 719 subset of the resources associated with the bypass tunnel may be pre- 720 assigned for the recovery of a specific service. 722 However, as described in Section 4.4.3, the bypass tunnel method can 723 also be used for shared protection (Section 4.3.2), either to carry 724 extra traffic (Section 4.3.3), or to achieve best-effort recovery 725 without the need for resource reservation. 727 4.3.2. Shared Protection 729 In shared protection, the resources for the recovery entities of 730 several services are shared. These may be shared as 1:n or m:n, and 731 are shared on individual links. Link-by-link resource sharing may be 732 managed and operated along LSP segments, on PW segments, or on end-to 733 end transport paths (LSP or PW). Note that there is no requirement 734 for m:n recovery in the list of MPLS-TP requirements documented in 735 [RFC5654]. Shared protection can be applied in different topologies 736 (mesh, ring, etc.) and can utilize different protection mechanisms 737 (linear, ring, etc.). 739 End-to-end shared protection shares resources between a number of 740 paths that have common end points. Thus a number of paths (n paths) 741 are all protected by one or more protection paths (m paths where m 742 may equal 1). When there have been m failures there are no more 743 available protection paths and the n paths are no longer protected. 744 Thus, in 1:n protection, one fault can be protected against before 745 all the n paths are unprotected. The fact that the paths have become 746 unprotected needs to be conveyed to the path end points since they 747 may need to report the change in service level or may need to take 748 further action to increase their protection. In end-to-end shared 749 protection, this communication is simple since the end points are 750 common. 752 In shared mesh protection (see Section 4.7.6) the paths that share 753 the protection resources do not necessarily have the same end points. 754 This provides a more flexible resource sharing scheme, but the 755 network planning and the coordination of protection state after a 756 recovery action are more complex. 758 Where a bypass tunnel is used (Section 4.4.3), the tunnel might not 759 have sufficient resources to simultaneously protect all of the paths 760 for which it offers protection; in the event that all paths were 761 affected by network defects and failures at the same time, not all of 762 them would be recovered. Policy would dictate how this situation 763 should be handled: some paths might be protected, while others would 764 simply fail; the traffic for some paths would be guaranteed, while 765 traffic on other paths would be treated as best-effort with the risk 766 of dropped packets; alternatively, it is possible that protection 767 would not be attempted according to local policy at the nodes that 768 perform the recovery actions. 770 Shared protection is a trade-off between assigning network resources 771 to protection (which is not required most of the time) and risking 772 unrecoverable services in the event that multiple network defects or 773 failures occur. Rapid recovery can be achieved with dedicated 774 protection, but it is delayed by message exchanges in the management, 775 control, or data planes for shared protection. This means that there 776 is also a trade-off between rapid recovery and resource sharing. In 777 some cases, shared protection might not meet the speed required for 778 protection, but it may still be faster than restoration. 780 These trade-offs may be somewhat mitigated by the following: 782 o Adjusting the value of n in 1:n protection. 784 o Using m:n protection for a value of m > 1. 786 o Establishing new protection paths as each available protection 787 path is put into use. 789 4.3.3. Extra Traffic 791 Section 2.5.1.1 of [RFC5654] says: "Support for extra traffic (as 792 defined in [RFC4427]) is not required in MPLS-TP and MAY be omitted 793 from the MPLS-TP specifications." This document observes that extra 794 traffic facilities may therefore be provided as part of the MPLS-TP 795 survivability toolkit depending upon the development of suitable 796 solution specifications. The remainder of this section explains the 797 concepts of extra traffic without prejudging the decision to 798 specify or not specify such solutions. 800 Network resources allocated for protection represent idle capacity 801 during the time that recovery is not actually required, and can be 802 utilized by carrying other traffic, referred to as "extra traffic". 804 Note that extra traffic does not need to start or terminate at the 805 ends of the entity (e.g. LSP) that it uses. 807 When a network resource carrying extra traffic is required for the 808 recovery of protected traffic from the failed working path, the extra 809 traffic is disrupted. This disruption make take one of two forms: 811 - In "hard preemption" the extra traffic is excluded from the 812 protection resource. The disruption of the extra traffic is total, 813 and the service supported by the extra traffic must be dropped, or 814 some form of rerouting or restoration must be applied to the extra 815 traffic LSP in order to recover the service. 817 Hard preemption is achieved by "setting a switch" on the path of 818 the extra traffic such that it no longer flows. This situation 819 may be detected by OAM and reported as a fault, or may be 820 proactively reported through OAM or control plane signaling. 822 - In "soft preemption" the extra traffic is not explicitly excluded 823 from the protection resource, but is given lower priority than the 824 protected traffic. In a packet network (such as MPLS-TP) this can 825 result in oversubscription of the protection resource with the 826 result that the extra traffic receives "best effort" delivery. 827 Depending on the volume of protection and extra traffic, and the 828 level of oversubscription, the extra traffic may be slightly or 829 heavily impacted. 831 The event of soft preemption may be detected by OAM and reported as 832 a degradation of traffic delivery or as a fault. It may also be 833 proactively reported through OAM or control plane signaling. 835 Note that both hard and soft preemption may utilize additional 836 message exchanges in the management, control, or data planes. These 837 messages do not necessarily mean that recovery is delayed, but may 838 increase the complexity of the protection system. Thus, the benefits 839 of carrying extra traffic must be weighed against the disadvantages 840 of delayed recovery, additional network overhead, and the impact on 841 the services which support the extra traffic according to the details 842 of the solutions selected. 844 Note that extra traffic is not protected by definition, but may be 845 restored. 847 Extra traffic is not supported on dedicated protection resources 848 which, by definition, are used for 1+1 protection (Section 4.3.1), 849 but it can be supported in other protection schemes, including shared 850 protection (Section 4.3.2) and tunnel protection (Section 4.4.3). 852 Best-effort traffic should not be confused with extra traffic. For 853 best-effort traffic, the network does not guarantee data delivery, 854 and the user does not receive guaranteed quality of service (e.g., in 855 terms of jitter, packet loss, delay, etc.). Best-effort traffic 856 depends on the current traffic load. However, for extra traffic, 857 quality can only be guaranteed until resources are required for 858 recovery. At this point, the extra traffic may be completely 859 displaced, may be treated as best effort, or it may itself be 860 recovered (for example, by restoration techniques). 862 4.3.4. Restoration 864 This section refers to LSP restoration. Restoration for PWs is 865 beyond the scope of this document (but see Section 7). 867 Restoration represents the most effective use of network resources, 868 since no resources are reserved for recovery. However, restoration 869 requires the computation of a new path and the activation of a new 870 LSP (through the management or control plane). It may be more time- 871 consuming to perform these steps than to implement recovery using 872 protection techniques. 874 Furthermore, there is no guarantee that restoration will be able to 875 recover the service. It may be that all suitable network resources 876 are already in use for other LSPs, so that no new path can be found. 877 This problem can be partially mitigated by using LSP setup 878 priorities, so that recovery LSPs can preempt existing LSPs with 879 lower priorities. 881 Additionally, when a network defect occurs, multiple LSPs may be 882 disrupted by the same event. These LSPs may have been established by 883 different Network Management Stations (NMSes) or they may have been 884 signaled by different head-end MPLS-TP nodes, meaning that multiple 885 points in the network will try to compute and establish recovery LSPs 886 at the same time. This can lead to a lack of resources within the 887 network and cause recovery failures; some recovery actions will need 888 to be retried, resulting in even slower recovery times for some 889 services. 891 Both hard and soft LSP restoration may be supported. For hard LSP 892 restoration, the resources of the working LSP are released before the 893 the recovery LSP is fully established (i.e., break-before-make). For 894 soft LSP restoration, the resources of the working LSP are released 895 after an alternate LSP is fully established (i.e., make-before- 896 break). Note that in the case of reversion (Section 4.3.5), the 897 resources associated with the working LSP are not released. 899 The restoration resources may be pre-calculated and even pre-signaled 900 before the restoration action starts, but not pre-allocated. This is 901 known as pre-planned LSP restoration. The complete establishment / 902 activation of the restoration LSP occurs only when the restoration 903 action starts. Pre-planning may occur periodically and provides the 904 most accurate information about the available resources in the 905 network. 907 4.3.5. Reversion 909 After a service has been recovered and traffic is flowing along the 910 recovery LSP, the defective network resource may be replaced. Traffic 911 can be redirected back onto the original working LSP (known as 912 "reversion"), or it can be left where it is on the recovery LSP 913 ("non-revertive" behavior). 915 It should be possible to specify the reversion behavior of each 916 service; this might even be configured for each recovery instance. 918 In non-revertive mode, an additional operational option is possible 919 where protection roles are switched, so that the recovery LSP becomes 920 the working LSP, while the previous working path (or the resources 921 used by the previous working path) are used for recovery in the event 922 of an additional fault. 924 In revertive mode, it is important to prevent excessive swapping 925 between the working and recovery paths in the case of an intermittent 926 defect. This can be addressed by using a reversion delay timer (the 927 Wait To Restore timer) which controls the length of time to wait 928 before reversion following the repair of a fault on the original 929 working path. It should be possible for an operator to configure 930 this timer per LSP, and a default value should be defined. 932 4.4. Mechanisms for Protection 934 This section provides general descriptions (MPLS-TP non-specific) of 935 the mechanisms that can be used for protection purposes. As 936 indicated above, while the functional architecture applies to both 937 LSPs and PWs, the mechanism for recovery described in this document 938 refers to LSPs and LSP segments only. Recovery mechanisms for 939 pseudowires and pseudowire segments are for further study and will be 940 described in a separate document (see also Section 7). 942 4.4.1. Link-Level Protection 944 Link-level protection refers to two paradigms: (1) where protection 945 is provided in a lower network layer, and (2) where protection is 946 provided by the MPLS-TP link layer. 948 Note that link-level protection mechanisms do not protect the nodes 949 at each end of the entity (e.g., a link or span) that is protected. 950 End-to-end or segment protection should be used in conjunction with 951 link-level protection to protect against a failure of the edge nodes. 953 Link-level protection offers the following levels of protection: 955 o Full protection where a dedicated protection entity (e.g., a link 956 or span) is pre-established to protect a working entity. When the 957 working entity fails, the protected traffic is switched to the 958 protecting entity. In this scenario, all LSPs carried over the 959 working entity are recovered (in one protection operation) when 960 there is a failure condition. This is referred to in [RFC4427] as 961 "bulk recovery". 963 o Partial protection where only a subset of the LSPs or traffic 964 carried over a selected entity is recovered when there is a 965 failure condition. The decision as to which LSPs will be 966 recovered and which will not depends on local policy. 968 When there is no failure on the working entity, the protection entity 969 may transport extra traffic which may be preempted when protection 970 switching occurs. 972 If link-level protection is available, it may be desirable to allow 973 this to be attempted before attempting other recovery mechanisms for 974 the transport paths affected by the fault because link-level 975 protection may be faster and more conservative of network resources. 976 This can be achieved both by limiting the propagation of fault 977 condition notifications and by delaying the other recovery actions. 978 This consideration of other protection can be compared with the 979 discussion of recovery domains (Section 4.5) and recovery in multi- 980 layer networks (Section 4.9). 982 A protection mechanism may be provided at the MPLS-TP link layer 983 (which connects two MPLS-TP nodes). Such a mechanism can make use of 984 the procedures defined in [RFC5586] to set up in-band communication 985 channels at the MPLS-TP section level, to use these channels to 986 monitor the health of the MPLS-TP link, and to coordinate the 987 protection states between the ends of the MPLS-TP link. 989 4.4.2. Alternate Paths and Segments 991 The use of alternate paths and segments refers to the paradigm 992 whereby protection is performed in the network layer in which the 993 protected LSP is located; this applies either to the entire end-to- 994 end LSP or to a segment of the LSP. In this case, hierarchical LSPs 995 are not used (compare with Section 4.4.3). 997 Different levels of protection may be provided: 999 o Dedicated protection where a dedicated entity (e.g., LSP or LSP 1000 segment) is (fully) pre-established to protect a working entity 1001 (e.g., LSP or LSP segment). When a failure condition occurs on 1002 the working entity, traffic is switched onto the protection 1003 entity. Dedicated protection may be performed using 1:1 or 1+1 1004 linear protection schemes. When the failure condition is 1005 eliminated, the traffic may revert to the working entity. This 1006 is subject to local configuration. 1008 o Shared protection where one or more protection entities is pre- 1009 established to protect against a failure of one or more working 1010 entities (1:n or m:n) 1012 When the fault condition on the working entity is eliminated, the 1013 traffic should revert back to the working entity in order to allow 1014 other related working entities to be protected by the shared 1015 protection resource. 1017 4.4.3. Protection Tunnels 1019 A protection tunnel is a hierarchical LSP that is pre-provisioned in 1020 order to protect against a failure condition along a sequence of 1021 spans in the network. We call such a sequence, a network segment. A 1022 failure of a network segment may affect one or more LSPs that 1023 transit the network segment. 1025 When a failure condition occurs in the network segment (detected 1026 either by OAM on the network segment, or by OAM on a concatenated 1027 segment of one of the LSPs transiting the network segment), one or 1028 more of the protected LSPs are switched over at the ingress point of 1029 the network segment and are transmitted over the protection tunnel. 1030 This is implemented through label stacking. Label mapping may be an 1031 option as well. 1033 Different levels of protection may be provided: 1035 o Dedicated protection where the protection tunnel reserves 1036 sufficient resources to provide protection for all protected LSPs 1037 without causing service degradation 1039 o Partial protection where the protection tunnel has enough 1040 resources to protect some of the protected LSPs, but not all of 1041 them simultaneously. Policy dictates how this situation should be 1042 handled: it is possible that some LSPs would be protected, while 1043 others would simply fail; it is possible that traffic would be 1044 guaranteed for some LSPs, while for other LSPs it would be treated 1045 as best effort with the risk of packets being dropped; 1046 alternatively, it is possible that protection would not be 1047 attempted. 1049 4.5. Recovery Domains 1051 Protection and restoration are performed in the context of a recovery 1052 domain. A recovery domain is defined between two or more recovery 1053 reference end points which are located at the edges of the recovery 1054 domain and which border on the element on which recovery can be 1055 provided (as described in Section 4.2). This element can be an end- 1056 to-end path, a segment, or a span. 1058 An end-to-end path can be observed as a special segment case where 1059 the ingress and egress label edge routers (LERs) serve as the 1060 recovery reference end points. 1062 In this simple case of a point-to-point (P2P) protected entity, two 1063 end-points reside at the boundary of the Protection Domain. An LSP 1064 can enter through one reference end point and exit the recovery 1065 domain through another reference end point. 1067 In the case of unidirectional point-to-multipoint (P2MP), three or 1068 more end points reside at the boundary of the Protection Domain. One 1069 of the end points is referred to as the source/root, while the others 1070 are referred to as sinks/leaves. An LSP can enter the recovery 1071 domain through the root point and exit the recovery domain through 1072 the leaf points. 1074 The recovery mechanism should restore traffic that was interrupted by 1075 a facility (link or node) fault within the recovery domain. Note 1076 that a single link may be part of several recovery domains. If two 1077 recovery domains have common links, one recovery domain must be 1078 contained within the other. This can be referred to as nested 1079 recovery domains. The boundaries of recovery domains may coincide, 1080 but recovery domains must not overlap. 1082 Note that the edges of a recovery domain are not protected and unless 1083 the whole domain is contained within another recovery domain, the 1084 edges form a single point of failure. 1086 A recovery group is defined within a recovery domain and consists of 1087 a working (primary) entity and one or more recovery (backup) entities 1088 which reside between the end points of the recovery domain. To 1089 guarantee protection in all situations, a dedicated recovery entity 1090 should be pre-provisioned using disjoint resources in the recovery 1091 domain, in order to protect against a failure of a working entity. Of 1092 course, mechanisms to detect faults and to trigger protection 1093 switching are also needed. 1095 The method used to monitor the health of the recovery element is 1096 beyond the scope of this document. The end points that are 1097 responsible for the recovery action must receive information on 1098 its condition. The condition of the recovery element may be 'OK', 1099 'failed', or 'degraded'. 1101 When the recovery operation is to be triggered by OAM mechanisms, an 1102 OAM Maintenance Entity Group must be defined for each of the working 1103 and protection entities. 1105 The recovery entities and functions in a recovery domain can be 1106 configured using a management plane or a control plane. A 1107 management plane may be used to configure the recovery domain by 1108 setting the reference points, the working and recovery entities, and 1109 the recovery type (e.g., 1:1 bidirectional linear protection, ring 1110 protection, etc.). Additional parameters associated with the 1111 recovery process may also be configured. For more details, see 1112 Section 6.1. 1114 When a control plane is used, the ingress LERs may communicate with 1115 the recovery reference points which request that protection or 1116 restoration be configured across a recovery domain. For details, see 1117 Section 6.5. 1119 Cases of multiple interconnections between distinct recovery domains 1120 create a hierarchical arrangement of recovery domains, since a single 1121 top-level recovery domain is created from the concatenation of two 1122 recovery domains with multiple interconnections. In this case, 1123 recovery actions may be taken both in the individual, lower-level 1124 recovery domains to protect any LSP segment that crosses the domain, 1125 and within the higher-level recovery domain to protect the longer LSP 1126 segment that traverses the higher-level domain. 1128 The MPLS-TP recovery mechanism can be arranged to ensure coordination 1129 between domains. In inter-connected rings, for example, it may be 1130 preferable to allow the upstream ring to perform recovery before the 1131 downstream ring, in order to ensure that recovery takes place in the 1132 ring in which the defect occurred. Coordination of recovery actions 1133 is particularly important in nested domains, and is discussed further 1134 in Section 4.9. 1136 4.6. Protection in Different Topologies 1138 As described in the requirements listed in Section 3 and detailed in 1139 [RFC5654], the selected recovery techniques may be optimized for 1140 different network topologies if the optimized mechanisms perform 1141 significantly better than the generic mechanisms in the same 1142 topology. 1144 These mechanisms are required (R91 of [RFC5654]) to interoperate with 1145 the mechanisms defined for arbitrary topologies, in order to allow 1146 end-to-end protection and to ensure that consistent protection 1147 techniques are used across the entire network. In this context, 1148 'interoperate' means that the use of one technique must not inhibit 1149 the use of another technique in an adjacent part of the network for 1150 use on the same end-to-end transport path, and must not prohibit the 1151 use of end-to-end protection mechanisms. 1153 The next sections (4.7 and 4.8) describe two different topologies and 1154 explain how recovery may be markedly different in those different 1155 scenarios. They also develop the concept of a recovery domain and 1156 show how end-to-end survivability may be achieved through a 1157 concatenation of recovery domains, each providing some level of 1158 recovery in part of the network. 1160 4.7. Mesh Networks 1162 A mesh network is any network where there is arbitrary 1163 interconnectivity between nodes in the network. Mesh networks are 1164 usually contrasted with more specific topologies such as hub-and- 1165 spoke or ring (see Section 4.8), although such networks are actually 1166 examples of mesh networks. This section is limited to the discussion 1167 of protection techniques in the context of mesh networks. That is, it 1168 does not include optimizations for specific topologies. 1170 Linear protection is a protection mechanism that provides rapid and 1171 simple protection switching. In a mesh network, linear protection 1172 provides a very suitable protection mechanism because it can operate 1173 between any pair of points within the network. It can protect 1174 against a defect in a node, a span, a transport path segment, or an 1175 end-to-end transport path. Linear protection gives a clear 1176 indication of the protection status. 1178 Linear protection operates in the context of a Protection Domain. A 1179 Protection Domain is a special type of Recovery Domain (see Section 1180 4.5 associated with the protection function. A Protection Domain is 1181 composed of the following architectural elements: 1183 o A set of end points which reside at the boundary of the Protection 1184 Domain. In the simple case of 1:n or 1+1 P2P protection, two end 1185 points reside at the boundary of the Protection Domain. In each 1186 transmission direction, one of the end points is referred to as 1187 the source and the other is referred to as the sink. For 1188 unidirectional P2MP protection, three or more end points reside at 1189 the boundary of the Protection Domain. One of the end points is 1190 referred to as the source/root while the others are referred to as 1191 sinks/leaves. 1193 o A Protection Group consists of one or more working (primary) paths 1194 and one or more protection (backup) paths which run between the 1195 end points belonging to the Protection Domain. To guarantee 1196 protection in all scenarios, a dedicated protection path should be 1197 pre-provisioned to protect against a defect of a working path 1198 (i.e., 1:1 or 1+1 protection schemes). In addition, the working 1199 and the protection paths should be disjoint, i.e., the physical 1200 routes of the working and the protection paths should be 1201 physically diverse in every respect. 1203 Note that if the resources of the protection path are less than those 1204 of the working path, the protection path may not have sufficient 1205 resources to protect the traffic of the working path. 1207 As mentioned in Section 4.3.2, the resources of the protection path 1208 may be shared as 1:n. In this scenario, the protection path will not 1209 have sufficient resources to protect all the working paths at a 1210 specific time. 1212 For bidirectional P2P paths, both unidirectional and bidirectional 1213 protection switching are supported. If a defect occurs when 1214 bidirectional protection switching is defined, the protection actions 1215 are performed in both directions (even if the defect is 1216 unidirectional). The protection state is required to operate with a 1217 level of coordination between the end points of the protection 1218 domain. 1220 In unidirectional protection switching, the protection actions are 1221 only performed in the affected direction. 1223 Revertive and non-revertive operations are provided as options for 1224 the network operator. 1226 Linear protection supports the protection schemes described in the 1227 following sub-sections. 1229 4.7.1. 1:n Linear Protection 1231 In the 1:1 scheme, a protection path is allocated to protect against 1232 a defect, failure, or a degradation in a working path. As described 1233 above, to guarantee protection, the protection entity should support 1234 the full capacity and bandwidth, although it may be configured (for 1235 example, because of limited network resource availability) to offer a 1236 degraded service when compared with the working entity. 1238 Figure 1 presents 1:1 protection architecture. In normal conditions, 1239 data traffic is transmitted over the working entity while the 1240 protection entity functions in the idle state. (OAM may run on the 1241 protection entity to verify its state.) Normal conditions are 1242 defined when there is no defect, failure, or degradation on the 1243 working entity, and no administrative configuration or request causes 1244 traffic to flow over the protection entity. 1246 |-----------------Protection Domain---------------| 1248 ============================== 1249 /**********Working path***********\ 1250 +--------+ ============================== +--------+ 1251 | Node /| |\ Node | 1252 | A {< | | >} B | 1253 | | | | 1254 +--------+ ============================== +--------+ 1255 Protection path 1256 ============================== 1258 Figure 1: 1:1 Protection Architecture 1260 If there is a defect on the working entity, or a specific 1261 administrative request, traffic is switched to the protection entity. 1263 Note that when operating with non-revertive behavior (see Section 1264 4.3.5), after the conditions causing the switchover have been cleared 1265 the traffic continues to flow on the protection path but the working 1266 and protection roles are not switched. 1268 In each transmission direction, the protection domain source bridges 1269 traffic onto the appropriate entity, while the sink selects traffic 1270 from the appropriate entity. The source and the sink need to 1271 coordinate the protection states to ensure that bridging and 1272 selection are performed to and from the same entity. For this 1273 reason, a signaling coordination protocol (either data-plane in-band 1274 signaling protocol or a control-plane based signaling protocol) is 1275 required. 1277 In bidirectional protection switching, both ends of the protection 1278 domain are switched to the protection entity (even when the fault is 1279 unidirectional). This requires a protocol to coordinate the 1280 protection state between the two end points of the Protection Domain. 1282 When there is no defect, the bandwidth resources of the idle entity 1283 may be used for traffic with lower priority. When protection 1284 switching is performed, the traffic with lower priority may be pre- 1285 empted by the protected traffic through tearing down the LSP with 1286 lower priority, reporting a fault on the LSP with lower priority, or 1287 by treating the traffic with lower priority as best effort and 1288 discarding it when there is congestion. 1290 In the general case of 1:n linear protection, one protection entity 1291 is allocated to protect n working entities. The protection entity 1292 might not have sufficient resources to protect all the working 1293 entities that may be affected by fault conditions at a specific 1294 time. In this case, in order to guaranteed protection, the protection 1295 entity should support enough capacity and bandwidth to protect any of 1296 the n working entities. 1298 When defects or failures occur along multiple working entities, the 1299 entity to be protected should be prioritized. The protection states 1300 between the edges of the Protection Domain should be fully 1301 coordinated to ensure consistent behavior. As explained in Section 1302 4.3.5, revertive behavior is recommended when 1:n is supported. 1304 4.7.2. 1+1 Linear Protection 1306 In the 1+1 protection scheme, a fully dedicated protection entity is 1307 allocated. 1309 As depicted in Figure 2, data traffic is copied and fed at the source 1310 to both the working and the protection entities. The traffic on the 1311 working and the protection entities is transmitted simultaneously to 1312 the sink of the Protection Domain, where selection between the 1313 working and protection entities is performed (based on some 1314 predetermined criteria). 1316 |---------------Protection Domain---------------| 1318 ============================== 1319 /**********Working path************\ 1320 +--------+ ============================== +--------+ 1321 | Node /| |\ Node | 1322 | A {< | | >} Z | 1323 | \| |/ | 1324 +--------+ ============================== +--------+ 1325 \**********Protection path*********/ 1326 ============================== 1328 Figure 2: 1+1 Protection Architecture 1330 Note that control traffic between the edges of the Protection Domain 1331 (such as OAM or a control protocol to coordinate the protection 1332 state, etc.) may be transmitted on an entity that differs from the 1333 one used for the protected traffic. These packets should not be 1334 discarded by the sink. 1336 In 1+1 unidirectional protection switching there is no need to 1337 coordinate the protection state between the protection controllers at 1338 both ends of the protection domain. In 1+1 bidirectional protection 1339 switching, a protocol is required to coordinate the protection state 1340 between the edges of the Protection Domain. 1342 In both protection schemes, traffic flows end-to-end on the working 1343 entity after the conditions causing the switchover have been cleared. 1344 Data selection may return to selecting traffic from the working 1345 entity if reversion is enabled, and will require coordination of the 1346 protection state between the edges of the Protection Domain. To 1347 avoid frequent switching caused by intermittent defects or failures 1348 when the network is not stable, traffic is not selected from the 1349 working entity before the Wait-to-Restore (WTR) timer has expired. 1351 4.7.3. P2MP Linear Protection 1353 Linear protection may be applied to protect unidirectional P2MP 1354 entities using 1+1 protection architecture. The source/root MPLS-TP 1355 node bridges the user traffic to both the working and protection 1356 entities. Each sink/leaf MPLS-TP node selects the traffic from one 1357 entity according to some predetermined criteria. Note that when 1358 there is a fault condition on one of the branches of the P2MP path, 1359 some leaf MPLS-TP nodes may select the working entity, while other 1360 leaf MPLS-TP nodes may select traffic from the protection entity. 1362 In a 1:1 P2MP protection scheme, the source/root MPLS-TP node needs 1363 to identify the existence of a fault condition on any of the branches 1364 of the network. This means that the sink/leaf MPLS-TP nodes need to 1365 notify the source/root MPLS-TP node of any fault condition. This 1366 also necessitates a return path from the sinks/leaves to the 1367 source/root MPLS-TP node. When protection switching is triggered, the 1368 source/root MPLS-TP node selects the protection transport path for 1369 traffic transfer. 1371 A form of "segment recovery for P2MP LSPs" could be constructed. 1372 Given a P2MP LSP, one can protect any possible point of failure (link 1373 or node) using N backup P2MP LSPs. Each backup P2MP LSP originates 1374 from the upstream node with respect to a different possible failure 1375 point and terminates at all of the destinations downstream of the 1376 potential failure point. In case of a failure, traffic is redirected 1377 to the backup P2MP path. 1379 Note that such mechanisms do not yet exist and their exact behavior 1380 is for further study. 1382 A 1:n protection scheme for P2MP transport paths is also required by 1383 [RFC5654]. Such a mechanism is for future study. 1385 4.7.4. Triggers for the Linear Protection Switching Action 1387 Protection switching may be performed when: 1389 o A defect condition is detected on the working entity and the 1390 protection entity has "no" or an inferior condition. Proactive 1391 in-band OAM Continuity and Connectivity Verification (CCV) 1392 monitoring of both the working and the protection entities may be 1393 used to enable the rapid detection of a fault condition. For 1394 protection switching, it is common to run a CCV every 3.33 ms. In 1395 the absence of three consecutive CCV messages, a fault condition 1396 is declared. In order to monitor the working and the protection 1397 entities, an OAM Maintenance Entity Group should be defined for 1398 each entity. OAM indications associated with fault conditions 1399 should be provided at the edges of the Protection Domain which are 1400 responsible for the protection-switching operation. Input from 1401 OAM performance monitoring that indicates degradation in the 1402 working entity may also be used as a trigger for protection 1403 switching. In the case of degradation, switching to the 1404 protection entity is needed only if the protection entity can 1405 exhibit better operating conditions. 1407 o An indication is received from a lower-layer server that there is 1408 a defect in the lower layer. 1410 o An external operator command is received (e.g., 'Forced Switch', 1411 'Manual Switch'). For details see Section 6.1.2. 1413 o A request to switch over is received from the far end. The far 1414 end may initiate this request, for example, on receipt of an 1415 administrative request to switch over, or when bidirectional 1:1 1416 protection switching is supported and a defect occurred that could 1417 only be detected by the far end, etc. 1419 As described above, the protection state should be coordinated 1420 between the end points of the Protection Domain. Control messages 1421 should be exchanged between the edges of the Protection Domain to 1422 coordinate the protection state of the edge nodes. Control messages 1423 can be delivered using an in-band, data-plane-driven control 1424 protocol, or a control-plane-based protocol. 1426 For 50-ms protection switching, it is recommended that an inband, 1427 data-plane-driven signaling protocol be used in order to coordinate 1428 the protection states. An in-band, data-plane protocol for use in 1429 MPLS-TP networks will be documented in [MPLS-TP-Linear-Protection] 1430 for this purpose. This protocol is also used to detect mismatches 1431 between the configurations provisioned at the ends of the Protection 1432 Domain. 1434 As described in Section 6.5, the GMPLS control plane already includes 1435 procedures and message elements to coordinate the protection states 1436 between the edges of the protection domain. These procedures and 1437 protocol messages are specified in [RFC4426], [RFC4872], and 1438 [RFC4873]. However, these messages lack the capability to coordinate 1439 the revertive/non-revertive behavior and the consistency of 1440 configured timers at the edges of the Protection Domain (timers such 1441 as Wait to Restore (WTR), Hold-off timer, etc.). 1443 4.7.5. Applicability of Linear Protection for LSP Segments 1445 In order to implement data-plane-based linear protection on LSP 1446 segments, use is made of the Sub-Path Maintenance Entity (SPME), an 1447 MPLS-TP architectural element defined in [MPLS-TP-FWK]. Maintenance 1448 operations (e.g., monitoring, protection, or management) engage with 1449 message transmission (e.g., OAM, Protection Path Coordination, etc.) 1450 in the maintained domain. Further discussion of the architecture for 1451 OAM and SPME is found in [MPLS-TP-FWK] and [MPLS-TP-OAM-Framework]. 1452 An SPME is an LSP which is basically defined and used for the 1453 purposes of OAM monitoring, protection, or management of LSP 1454 segments. The SPME uses the MPLS construct of a hierarchical, nested 1455 LSP, as defined in [RFC3031]. 1457 For linear protection, SPMEs should be defined over the working and 1458 protection entities between the edges of a Protection Domain. OAM 1459 messages and messages used to coordinate protection state can be 1460 initiated at the edge of the SPME and sent to the peer edge of the 1461 SPME. Note that these messages are sent over the Generic Associated 1462 Channel (G-ACh) within the SPME, and that they use a two-label 1463 stack, the SPME label and, at the bottom of the stack, the G-ACh 1464 label (GAL) [RFC5586]. 1466 The end-to-end traffic of the LSP, which includes data-traffic and 1467 control traffic (messages for OAM, management, signaling, and to 1468 coordinate protection state), is tunneled within the SPMEs by means 1469 of label-stacking, as defined in [RFC3031]. 1471 Mapping between an LSP and a SPME can be 1:1; this is similar to 1472 the ITU-T Tandem Connection element which defines a sub-layer 1473 corresponding to a segment of a path. Mapping can also be 1:n to 1474 allow the scalable protection of a set of LSP segments traversing the 1475 part of the network in which a Protection Domain is defined. Note 1476 that each of these LSPs can be initiated or terminated at different 1477 end points in the network, but that they all traverse the Protection 1478 Domain and share similar constraints (such as requirements for QoS, 1479 terms of protection ,etc.). 1481 Note also that in the context of segment protection, the SPMEs serve 1482 as the working and protection entities. 1484 4.7.6. Shared Mesh Protection 1486 For shared mesh protection, the protection resources are used to 1487 protect multiple LSPs which do not all share the same end points, 1488 for example, in Figure 3 there are two paths ABCDE and VWXYZ. These 1489 paths do not share end points and cannot, therefore, make use of 1:n 1490 linear protection, even though they do not have any common points of 1491 failure. 1492 ABCDE may be protected by the path APQRE, while VWXYZ can be 1493 protected by the path VPQRZ. In both cases, 1:1 or 1+1 protection 1494 may be used. However, it can be seen that if 1:1 protection is used 1495 for both paths, the PQR network segment does not carry traffic when 1496 no failures affect either of the two working paths. Furthermore, in 1497 the event of only one failure, the PQR segment carries traffic from 1498 only one of the working paths. 1500 Thus, it is possible for the network resources on the PQR segment to 1501 be shared by the two recovery paths. In this way, mesh protection 1502 can substantially reduce the number of network resources that have to 1503 be reserved in order to provide 1:n protection. 1505 A----B----C----D----E 1506 \ / 1507 \ / 1508 \ / 1509 P-----Q-----R 1510 / \ 1511 / \ 1512 / \ 1513 V----W----X----Y----Z 1515 Figure 3: A Shared Mesh Protection Topology 1517 As the network becomes more complex and the number of LSPs increases, 1518 the potential for shared-mesh protection also increases. However, 1519 this can quickly become unmanageable owing to the increased 1520 complexity. Therefore, shared-mesh protection is normally pre- 1521 planned and configured by the operator, although an automated system 1522 cannot be ruled out. 1524 Note that shared-mesh protection operates as 1:n linear protection 1525 (see Section 4.7.1). However, the protection state needs to be 1526 coordinated between a larger number of nodes: the end points of the 1527 shared concatenated protection segment (nodes P and R in the example) 1528 as well as the end points of the protected LSPs(nodes A, E, V, and Z 1529 in the example). 1531 Additionally, note that the shared-protection resources could be used 1532 1 to carry extra traffic, for example, in Figure 4, an LSP JPQRK 1533 could be a preemptable LSP that constitutes extra traffic over the 1534 PQR hops; it would be displaced in the event of a protection event. 1535 In this case, it should be noted that the protection state must also 1536 be coordinated with the ends of the extra-traffic LSPs. 1538 A----B----C----D----E 1539 \ / 1540 \ / 1541 \ / 1542 J-----P-----Q-----R-----K 1543 / \ 1544 / \ 1545 / \ 1546 V----W----X----Y----Z 1548 Figure 4: Shared Mesh Protection with Extra Traffic 1550 4.8. Ring Networks 1552 Several Service Providers have expressed great interest in the 1553 operation of MPLS-TP in ring topologies; they demand a high degree of 1554 survivability functionality in these topologies. 1556 Various criteria for optimization are considered in ring 1557 topologies, such as: 1559 1. Simplification in ring operation in terms of the number of OAM 1560 Maintenance Entities that are needed to trigger the recovery 1561 actions, the number of recovery elements, the number of 1562 management-plane transactions during maintenance operations, etc. 1564 2. Optimization of resource consumption around the ring, such as the 1565 number of labels needed for the protection paths that traverse 1566 the network, the total bandwidth required in the ring to ensure 1567 path protection, etc. (see R91 of [RFC5654]). 1569 [RFC5654] introduces a list of requirements for ring protection 1570 covering the recovery mechanisms needed to protect traffic in a 1571 single ring as well as traffic that traverses more than one ring. 1572 Note that configuration and the operation of the recovery mechanisms 1573 in a ring must scale well with the number of transport paths, the 1574 number of nodes, and the number of ring interconnects. 1576 The requirements for ring protection are fully compatible with the 1577 generic requirements for recovery. 1579 The architecture and the mechanisms for ring protection are specified 1580 in separate documents. These mechanisms need to be evaluated against 1581 the requirements specified in [RFC5654] which includes guidance on 1582 the principles for the development of new mechanisms. 1584 4.9. Recovery in Layered Networks 1586 In multi-layer or multi-regional networking [RFC5212], recovery may 1587 be performed at multiple layers or across nested recovery domains. 1589 The MPLS-TP recovery mechanism must ensure that the timing of 1590 recovery is coordinated in order to avoid race scenarios, and to 1591 allow the recovery mechanism of the server layer to fix the problem 1592 before recovery takes place in the MPLS-TP layer, or to allow 1593 the MPLS-TP layer to perform recovery before a client network. 1595 A hold-off timer is required to coordinate recovery timing in 1596 multiple layers or across nested recovery domains. Setting this 1597 configurable timer involves a trade-off between rapid recovery and 1598 the creation of a race condition where multiple layers respond to the 1599 same fault, potentially allocating resources in an inefficient 1600 manner. Thus, the detection of a defect condition in the MPLS-TP 1601 layer should not immediately trigger the recovery process if the 1602 hold-off timer is configured as a value other than zero. Instead, 1603 the hold-off timer should be started when the defect is detected and, 1604 on expiry, the recovery element should be checked to determine 1605 whether the defect condition still exists. If it does exist, the 1606 defect triggers the recovery operation. 1608 The hold-off timer should be configurable. 1610 In other configurations, where the lower layer does not have a 1611 restoration capability, or where it is not expected to provide 1612 protection, the lower layer needs to trigger the higher layer to 1613 immediately perform recovery. Although this can be forced by 1614 configuring the hold-off timer as zero, it may be that because of 1615 layer-independence, the higher layer does not know whether the lower 1616 layer will perform restoration. In this case, the higher layer will 1617 configure a non-zero hold-off timer and rely on the receipt of a 1618 specific notification from the lower layer if the lower layer cannot 1619 perform restoration. Since layer boundaries are always within nodes, 1620 such coordination is implementation-specific and does not need to be 1621 covered here. 1623 Reference should be made to [RFC3386] that discusses the interaction 1624 between layers in survivable networks. 1626 4.9.1. Inherited Link-Level Protection 1628 Where a link in the MPLS-TP network is formed through connectivity 1629 (i.e., a packet or non-packet LSP) in a lower layer network, that 1630 connectivity may itself be protected, for example, the LSP in the 1631 lower-layer network may be provisioned with 1+1 protection. In this 1632 case, the link in the MPLS-TP network has an inherited level of 1633 protection. 1635 An LSP in the MPLS-TP network may be provisioned with protection in 1636 the MPLS-TP network, as already described, or it may be provisioned 1637 to utilize only those links that have inherited protection. 1639 By classifying the links in the MPLS-TP network according to the 1640 level of protection that they inherited from the server network, it 1641 is possible to compute an end-to-end path in the MPLS-TP network that 1642 uses only those links with a specific or superior level of inherited 1643 protection. This means that the end-to-end MPLS-TP LSP can be 1644 protected at the level necessary to conform to the SLA without 1645 needing to provide any additional protection in the MPLS-TP layer. 1646 This reduces complexity, saves network resources, and eliminates 1647 protection-switching coordination problems. 1649 When the requisite level of inherited protection is not available 1650 on all segments along the path in the MPLS-TP network, segment 1651 protection may be used to achieve the desired protection level. 1653 It should be noted, however, that inherited protection only applies 1654 to links. Nodes cannot be protected in this way. An operator will 1655 need to perform an analysis of the relative likelihood and 1656 consequences of node failure if this approach is taken without 1657 providing protection in the MPLS-TP LSP or PW layer to handle 1658 node failure. 1660 4.9.2. Shared Risk Groups 1662 When an MPLS-TP protection scheme is established, it is important 1663 that the working and protection paths do not share resources in the 1664 network. If this is not achieved, a single defect may affect both 1665 the working and the protection paths with the result that traffic 1666 cannot be delivered - since under such a condition the traffic was 1667 not protected. 1669 Note that this restriction does not apply to restoration, since this 1670 takes place after the fault has occurred, which means that the point 1671 of failure can be avoided if an available path exists. 1673 When planning a recovery scheme, it is possible to use a topology map 1674 of the MPLS-TP layer to select paths that use diverse links and nodes 1675 within the MPLS-TP network. However, this does not guarantee that 1676 the paths are truly diverse, for example, two separate links in an 1677 MPLS-TP network may be provided by two lambdas in the same optical 1678 fiber, or by two fibers that cross the same bridge. Moreover, two 1679 completely separate MPLS-TP nodes might be situated in the same 1680 building with a shared power supply. 1682 Thus, in order to achieve proper recovery planning, the MPLS-TP 1683 network must have an understanding of the groups of lower-layer 1684 resources that share a common risk of failure. From this, MPLS-TP 1685 shared risk groups can be constructed that show which MPLS-TP 1686 resources share a common risk of failure. Diversity of working and 1687 protection paths can be planned, not only with regard to nodes and 1688 links, but also in order to refrain from using resources from the 1689 same shared risk groups. 1691 4.9.3. Fault Correlation 1693 In a layered network, a low-layer fault may be detected and reported 1694 by multiple layers and may sometimes lead to the generation of 1695 multiple fault reports from the same layer. For example, a failure 1696 of a data link may be reported by the line cards in an MPLS-TP node, 1697 but it could also be detected and reported by the MPLS-TP OAM. 1699 Section 4.6 explains how it is important to coordinate the 1700 survivability actions configured and operated in a multi-layer 1701 network in a way that will avoid over-equipping the survivability 1702 resources in the network, while ensuring that recovery actions are 1703 performed in only one layer at a time. 1705 Fault correlation is about understanding which single event has 1706 generated a set of fault reports, so that recovery actions can be 1707 coordinated, and so that the fault logging system does not become 1708 overloaded. Fault correlation depends on understanding resource use 1709 at lower layers, shared risk groups, and a wider view with regard to 1710 the way in which the layers are inter-related. 1712 Fault correlation is most easily performed at the point of fault 1713 detection, for example, an MPLS-TP node that receives a fault 1714 notification from the lower layer, and detects a fault on an LSP in 1715 the MPLS-TP layer, can easily correlate these two events. 1716 Furthermore, if the same node detects multiple faults on LSPs that 1717 share the same faulty data link, it can easily correlate them. Such 1718 a node may use correlation to perform group-based recovery actions, 1719 and can reduce the number of alarm events that it generates to its 1720 management station. 1722 Fault correlation may also be performed at a management station that 1723 receives fault reports from different layers and different nodes in 1724 the network. This enables the management station to coordinate 1725 management-originated recovery actions, and to present consolidated 1726 fault information to the user and automated management systems. 1728 It is also necessary to correlate fault information detected and 1729 reported through OAM. This function would enable a fault detected at 1730 a lower layer, and reported at a transit node of an MPLS-TP LSP, to 1731 be correlated with an MPLS-TP-layer fault detected at a Maintenance 1732 End Point (MEP) (for example, the egress of the MPLS-TPLSP). Such 1733 correlation allows the coordination of recovery actions performed at 1734 the MEP, but it also requires that the lower-layer fault information 1735 is propagated to the MEP, which is most easily achieved using a 1736 control plane, management plane, or OAM message. 1738 5. Applicability and Scope of Survivability in MPLS-TP 1740 The MPLS-TP network can be viewed as two layers (the MPLS LSP layer 1741 and the PW layer). The MPLS-TP network operates over data-link 1742 connections and data-link networks whereby the MPLS-TP links are 1743 provided by individual data links or by connections in a lower-layer 1744 network. The MPLS LSP layer is a mandatory part of the MPLS-TP 1745 network, while the PW layer is an optional addition for 1746 supporting specific services. 1748 MPLS-TP survivability provides recovery from failure of the links and 1749 nodes in the MPLS-TP network. The link defects and failures are 1750 typically caused by defects or failures in the underlying data-link 1751 connections and networks, but this section is only concerned with 1752 recovery actions performed in the MPLS-TP network, which must recover 1753 from the manifestation of any problem as a defect failure in the 1754 MPLS-TP network. 1756 This section lists the recovery elements (see Section 1) supported in 1757 each of the two layers that can recover from defects or failures of 1758 nodes or links in the MPLS-TP network. 1760 +--------------+---------------------+------------------------------+ 1761 | Recovery | MPLS LSP Layer | PW Layer | 1762 | Element | | | 1763 +--------------+---------------------+------------------------------+ 1764 | Link | MPLS LSP recovery | The PW layer is not aware of | 1765 | Recovery | can be used to | the underlying network. | 1766 | | survive the failure | This function is not | 1767 | | of an MPLS-TP link. | supported. | 1768 +--------------+---------------------+------------------------------+ 1769 | Segment/Span | An individual LSP | For a SS-PW, segment | 1770 | Recovery | segment can be | recovery is the same as | 1771 | | recovered to | end-to-end recovery. | 1772 | | survive the failure | Segment recovery for a MS-PW | 1773 | | of an MPLS-TP link. | is for future study, and | 1774 | | | this function is now | 1775 | | | provided using end-to-end | 1776 | | | recovery. | 1777 +--------------+---------------------+------------------------------+ 1778 | Concatenated | A concatenated LSP | Concatenated segment | 1779 | Segment | segment can be | recovery (in a MS-PW) is for | 1780 | Recovery | recovered to | future study, and this | 1781 | | survive the failure | function is now provided | 1782 | | of an MPLS-TP link | using end-to-end recovery. | 1783 | | or node. | | 1784 +--------------+---------------------+------------------------------+ 1785 | End-to-end | An end-to-end LSP | End-to-end PW recovery can | 1786 | Recovery | can be recovered to | be applied to survive any | 1787 | | survive any node or | node (including S-PE) or | 1788 | | link failure, | link failure, except for | 1789 | | except for the | failure of the ingress or | 1790 | | failure of the | egress T-PE. | 1791 | | ingress or egress | | 1792 | | node. | | 1793 +--------------+---------------------+------------------------------+ 1794 | Service | The MPLS LSP layer | PW layer service recovery | 1795 | Recovery | is service- | requires surviving faults in | 1796 | | agnostic. This | T-PEs or on Attachment | 1797 | | function is not | Circuits (ACs). This is | 1798 | | supported. | currently out of scope for | 1799 | | | MPLS-TP. | 1800 +--------------+---------------------+------------------------------+ 1802 Table 1 1804 Section 6 provides a description of mechanisms for MPLS-TP-LSP 1805 survivability. Section 7 provides a brief overview of mechanisms for 1806 MPLS-TP-PW survivability. 1808 6. Mechanisms for Providing Survivability for MPLS-TP LSPs 1810 This section describes the existing mechanisms which provide LSP 1811 protection within MPLS-TP networks, and highlights areas 1812 where new work is required. 1814 6.1. Management Plane 1816 As described above, a fundamental requirement of MPLS-TP is that 1817 recovery mechanisms should be capable of functioning in the absence 1818 of a control plane. Recovery may be triggered by MPLS-TP OAM fault 1819 management functions or by external requests (e.g., an operator's 1820 request for manual control of protection switching). Recovery LSPs 1821 (and in particular Restoration LSPs) may be provisioned through the 1822 management plane. 1824 The management plane may be used to configure the recovery domain by 1825 setting the reference end point points (which control the recovery 1826 actions), the working and the recovery entities, and the recovery 1827 type (e.g., 1:1 bidirectional linear protection, ring protection, 1828 etc.). 1830 Additional parameters associated with the recovery process (such as a 1831 WTR and hold-off timers, revertive/non-revertive operation, etc.) may 1832 also be configured. 1834 In addition, the management plane may initiate manual control of the 1835 recovery function. A priority should be set for the fault conditions 1836 and the operator's requests. 1838 Since provisioning the recovery domain involves the selection of a 1839 number of options, mismatches may occur at the different reference 1840 points. The MPLS-TP protocol to coordinate protection state, which 1841 is specified in [MPLS-TP-Linear-Protection], may be used as an in- 1842 band (i.e., data-plane-based) control protocol to coordinate the 1843 protection states between the end points of the recovery domain, and 1844 to check the consistency of configured parameters (such as timers, 1845 revertive/non-revertive behavior, etc.) with discovered 1846 inconsistencies that are reported to the operator. 1848 It should also be possible for the management plane to track the 1849 recovery status by receiving reports or by issuing polls. 1851 6.1.1. Configuration of Protection Operation 1853 To implement the protection switching mechanisms, the following 1854 entities and information should be configured and provisioned: 1856 o The end points of a recovery domain. As described above, these 1857 end points border on the element of recovery to which recovery is 1858 applied. 1860 o The protection group which, depending on the required protection 1861 scheme, consists of a recovery entity and one or more working 1862 entities. In 1:1 or 1+1 P2P protection, the paths of the working 1863 entity and the recovery entities must be physically diverse in 1864 every respect (i.e. not share any resources or physical 1865 locations), in order to guarantee protection. 1867 o As defined in Section 4.8, the SPME must be supported in order to 1868 implement data-plane-based LSP segment recovery, since related 1869 control messages (e.g., for OAM, Protection Path Coordination, 1870 etc.) can be initiated and terminated at the edges of a path where 1871 push and pop operations are enabled. The SPME is an end-to-end 1872 LSP which in this context corresponds to the recovery entities 1873 (working and protection) and makes use of the MPLS construct of 1874 hierarchical nested LSP, as defined in [RFC3031]. OAM messages 1875 and messages to coordinate protection state can be initiated at 1876 the edge of the SPME and sent over G-ACH to the peer edge of the 1877 SPME. It is necessary to configure the related SPMEs and map 1878 between the LSP segments being protected and the SPME. Mapping 1879 can be 1:1 or 1:N to allow scalable protection of a set of LSPs 1880 segments traversing the part of the network in which a Protection 1881 Domain is defined. 1883 Note that each of these LSPs can be initiated or terminated at 1884 different end points in the network, but that they all traverse 1885 the Protection Domain and share similar constraints (such as 1886 requirements for QoS, terms of protection ,etc.). 1888 o The protection type that should be defined (e.g., unidirectional 1889 1:1, bidirectional 1+1, etc.) 1891 o Revertive/non-revertive behavior should be configured. 1893 o Timers (such as WTR, hold-off timer, etc.) should be set. 1895 6.1.2. External Manual Commands 1897 The following external, manual commands may be provided for manual 1898 control of the protection switching operation. These commands apply 1899 to a protection group; they are listed in descending order of 1900 priority: 1902 o Blocked protection action - a manual command to prevent data 1903 traffic from switching to the recovery entity. This command 1904 actually disables the protection group. 1906 o Force protection action - a manual command that forces a switch of 1907 normal data traffic to the recovery entity 1909 o Manual protection action - a manual command that forces a switch 1910 of data traffic to the recovery entity only when there is no 1911 defect in the recovery entity 1913 o Clear switching command - the operator may request that a previous 1914 administrative switch command(manual or force switch) be cleared. 1916 6.2. Fault Detection 1918 Fault detection is a fundamental part of recovery and survivability. 1919 In all schemes, with the exception of some types of 1+1 protection, 1920 the actions required for the recovery of traffic delivery depend on 1921 the discovery of some kind of fault. In 1+1 protection, the selector 1922 (at the receiving end) may simply be configured to choose the better 1923 signal, thus it does not detect a fault or degradation of itself, but 1924 simply identifies the path that is better for data delivery. 1926 Faults may be detected in a number of ways depending on the traffic 1927 pattern and the underlying hardware. End-to-end faults may be 1928 reported by the application or by knowledge of the application's data 1929 pattern, but this is an unusual approach. There are two more common 1930 mechanisms for detecting faults in the MPLS-TP layer: 1932 o Faults reported by the lower layers. 1934 o Faults detected by protocols within the MPLS-TP layer. 1936 In an IP/MPLS network, the second mechanism may utilize control-plane 1937 protocols (such as the routing protocols) to detect a failure of 1938 adjacency between neighboring nodes. In an MPLS-TP network, it is 1939 possible that no control plane will be present. Even if a control 1940 plane is present, it will be a GMPLS control plane [RFC3945] which 1941 logically separates control channels from data channels, which means 1942 that no conclusion about the health of a data channel can be drawn 1943 from the failure of an associated control channel. MPLS-TP layer 1944 faults are, therefore, only detected through the use of OAM 1945 protocols, as described in Section 6.4.1. 1947 Faults may, however, be reported by a lower layer. These generally 1948 show up as interface failures or data link failures (sometimes known 1949 as connectivity failures) within the MPLS-TP network, for example, 1950 an underlying optical link may detect loss of light and report a 1951 failure of the MPLS-TP link that uses it. Alternatively, an 1952 interface card failure may be reported to the MPLS-TP layer. 1954 Faults reported by lower layers are only visible in specific nodes 1955 within the MPLS-TP network (i.e., at the adjacent end-points of the 1956 MPLS-TP link). This would only allow recovery to be performed 1957 locally so, to enable recovery to be performed by nodes that are 1958 not immediately local to the fault, the fault must be reported 1959 (Sections 6.4.3 and 6.5.4). 1961 6.3. Fault Localization 1963 If an MPLS-TP node detects that there is a fault in an LSP (that is, 1964 not a network fault reported from a lower layer, but a fault detected 1965 by examining the LSP), it can immediately perform a recovery action. 1966 However, unless the location of the fault is known, the only 1967 practical options are: 1969 o Perform end-to-end recovery. 1971 o Perform some other recovery as a speculative act. 1973 Since the speculative acts are not guaranteed to achieve the desired 1974 results and could be consume resources to unnecessarily, and since 1975 end-to-end recovery can require a lot of network resources, it is 1976 important to be able to localize the fault. 1978 Fault localization may be achieved by dividing the network into 1979 protection domains. End-to-end protection is thereby operated on 1980 LSP segments, depending on the domain in which the fault is 1981 discovered. This necessitates monitoring of the LSP at the 1982 domain edges. 1984 Alternatively, a proactive mechanism of fault localization through 1985 OAM (Section 6.4.3) or through the control plane (Section 6.5.3) is 1986 required. 1988 Fault localization is particularly important for restoration because 1989 a new path must be selected which avoids the fault. It may not be 1990 practical or desirable to select a path that avoids the entire failed 1991 working path and it is therefore necessary to isolate the fault's 1992 location. 1994 6.4. OAM Signaling 1996 MPLS-TP provides a comprehensive set of OAM tools for fault 1997 management and performance monitoring at different nested levels 1998 (end-to-end, a portion of a path (LSP or PW) and at the link level) 1999 [MPLS-TP-OAM-Framework]. 2001 These tools support proactive and on-demand fault management (for 2002 fault detection and fault localization) as well as performance 2003 monitoring (to measure the quality of the signals and detect 2004 degradation). 2006 To support fast recovery, it is useful to use some of the proactive 2007 tools to detect fault conditions (e.g., link/node failure or 2008 degradation) and to trigger the recovery action. 2010 The MPLS-TP OAM messages run in-band with the traffic and support 2011 unidirectional and bidirectional P2P paths as well as P2MP paths. 2013 As described in [MPLS-TP-OAM-Framework], MPLS-TP OAM operates in the 2014 context of a Maintenance Entity which borders on the OAM 2015 responsibilities and represents the portion of a path between two 2016 points which is monitored and maintained, and along which OAM 2017 messages are exchanged. [MPLS-TP-OAM-Framework] refers also to a 2018 Maintenance Entity Group (MEG), which is a collection of one or more 2019 Maintenance Entities (MEs) that belong to the same transport path 2020 (e.g., P2MP transport path) and which are maintained and monitored as 2021 a group. 2023 An ME includes two MEPs (Maintenance Group End Points) which reside 2024 at the boundaries of an ME, and a set of zero or more (Maintenance 2025 Group Intermediate Points (MIPs) which reside within the Maintenance 2026 Entity along the path. A MEP is capable of initiating and 2027 terminating OAM messages, and as such can only be located at the 2028 edges of a path where push and pop operations are supported. In 2029 order to define an ME over a portion of path, it is necessary to 2030 support SPMEs. 2032 The SPME is an end-to-end LSP which in this context corresponds to 2033 the ME; it uses the MPLS construct of hierarchical nested LSPs which 2034 is defined in [RFC3031]. OAM messages can be initiated at the edge 2035 of the SPME and sent over G-ACH to the peer edge of the SPME. 2037 The related SPMEs must be configured and mapping must be performed 2038 between the LSP segments being monitored and the SPME. Mapping can 2039 be 1:1 or 1:N to allow scalable operation. Note that each of these 2040 LSPs can be initiated or terminated at different end points in the 2041 network and can share similar constraints (such as requirements for 2042 QoS, terms of protection ,etc.). 2044 With regard to recovery, where MPLS-TP OAM is supported, an OAM 2045 Maintenance Entity Group is defined for each of the working and 2046 protection entities. 2048 6.4.1. Fault Detection 2050 MPLS-TP OAM tools may be used proactively to detect the following 2051 fault conditions between MEPs: 2053 o Loss of continuity and misconnectivity - the proactive Continuity 2054 Check (CC) function is used to detect loss of continuity between 2055 two MEPs in an MEG. The proactive Connectivity Verification (CV) 2056 allows a sink MEP to detect a misconnectivity defect (e.g., 2057 mismerge or misconnection) with its peer source MEP when the 2058 received packet carries an incorrect ME identifier. For 2059 protection switching, it is common to run a CCV (Continuity and 2060 Connectivity Verification) message every 3.33 ms. In the absence 2061 of three consecutive CCV messages, Loss of Continuity is declared 2062 and is notified locally to the edge of the recovery domain in 2063 order to trigger a recovery action. In some cases, when a slower 2064 recovery time is acceptable, it is also possible to lengthen the 2065 transmission rate. 2067 o Signal degradation - notification from OAM performance monitoring 2068 indicating degradation in the working entity may also be used as a 2069 trigger for protection switching. In the event of degradation, 2070 switching to the recovery entity is necessary only if the recovery 2071 entity can guarantee better conditions. Degradation can be 2072 measured by proactively activating MPLS-TP OAM packet loss 2073 measurement or delay measurement. 2075 o An MEP can receive an indication from its sink MEP of a Remote 2076 Defect Indication and locally notify the end point of the recovery 2077 domain regarding the fault condition, in order to trigger the 2078 recovery action. 2080 6.4.2. Testing for Faults 2082 The management plane may be used to initiate the testing of links, 2083 LSP segments, or entire LSPs. 2085 MPLS-TP provides OAM tools which may be manually invoked on-demand 2086 for a limited period, in order to troubleshoot links, LSP segments, 2087 or entire LSPs (e.g. diagnostics, connectivity verification, packet 2088 loss measurements, etc.). On-demand monitoring covers a combination 2089 of "in service" and "out-of service" monitoring functions. "Out-of- 2090 service" testing is supported by the OAM on-demand lock operation. 2091 The lock operation temporarily disables the transport entity (LSP, 2092 LSP segment, or link), preventing the transmission of all types of 2093 traffic, with the exception of test traffic, and OAM (dedicated to 2094 the locked entity). 2096 [MPLS-TP-OAM-Framework] describes the operations of the OAM functions 2097 that may be initiated on-demand and provides some considerations. 2099 MPLS-TP also supports in/out-of-service testing of the recovery 2100 (protection and restoration) mechanism, the integrity of the 2101 protection/recovery transport paths, and the coordination protocol 2102 between the end points of the recovery domain. The testing operation 2103 emulates a protection switching request but does not perform the 2104 actual switching action. 2106 6.4.3. Fault Localization 2108 MPLS-TP provides OAM tools to locate a fault and determine its 2109 precise location. Fault detection often only takes place at key 2110 points in the network (such as at LSP end points, or MEPs). This 2111 means that a fault may be located anywhere within a segment of the 2112 relevant LSP. Finer information granularity is needed to implement 2113 optimal recovery actions or to diagnose the fault. On-demand tools 2114 like trace-route, loopback, and on-demand CCV can be used to localize 2115 a fault. 2117 The information may be notified locally to the end point of the 2118 recovery domain to allow implementation of optimal recovery action. 2119 This may be useful for the re-calculation of a recovery path. 2121 The information should also be reported to network management for 2122 diagnostics purposes. 2124 6.4.4. Fault Reporting 2126 The end points of a recovery domain should be able to detect fault 2127 conditions in the recovery domain, and notify the management plane. 2129 In addition, a node within a recovery domain that detects a fault 2130 condition should also be able to report this to network management. 2131 Network management should be capable of correlating the fault reports 2132 and identifying the source of the fault. 2134 MPLS-TP OAM tools support a function where an intermediate 2135 node along a path is able to send an alarm report message to the MEP, 2136 indicating the presence of a fault condition in the server layer 2137 which connects it to its adjacent node. This capability allows an 2138 MEP to suppress alarms that may be generated as a result of a failure 2139 condition in the server layer. 2141 6.4.5. Coordination of Recovery Actions 2143 As described above, in some cases (such as in bidirectional 2144 protection switching, etc.) it is necessary to coordinate the 2145 protection states between the edges of the recovery domain. [MPLS- 2146 TP-Linear-Protection] defines procedures, protocol messages, and 2147 elements for this purpose. 2149 The protocol is also used to signal administrative requests (e.g., 2150 manual switch, etc.), but only when these are provisioned at the 2151 edge of the recovery domain. 2153 The protocol also enables mismatches to be detected between the 2154 configurations at the ends of the Protection Domain (such as timers, 2155 revertive/non-revertive behavior); these mismatches can subsequently 2156 be reported to the management plane. 2158 In the absence of suitable coordination (owing to failures in the 2159 delivery or processing of the coordination protocol messages), 2160 protection switching will fail. This means that the operation of the 2161 protocol that coordinates the protection state is a fundamental part 2162 of protection switching. 2164 6.5. Control Plane 2166 The GMPLS control plane has been proposed as the control plane for 2167 MPLS-TP [RFC5317]. Since GMPLS was designed for use in transport 2168 networks, and since it has been implemented and deployed in many 2169 networks, it is not surprising that it contains many features which 2170 support a high degree of survivability. 2172 The signaling elements of the GMPLS control plane utilize extensions 2173 to the Resource Reservation Protocol (RSVP) (as described in a series 2174 of documents commencing with [RFC3471] and [RFC3473]), although it is 2175 based on [RFC3209] and [RFC2205]. The architecture for GMPLS is 2176 provided in [RFC3945], while [RFC4426] gives a functional description 2177 of the protocol extensions needed to support GMPLS-based recovery 2178 (i.e., protection and restoration). 2180 A further control-plane protocol called the Link Management Protocol 2181 (LMP) [RFC4204] is part of the GMPLS protocol family and can be used 2182 to coordinate fault localization and reporting. 2184 Clearly, the control plane techniques described here only apply where 2185 an MPLS-TP control plane is deployed and operated. All mandatory 2186 MPLS-TP survivability features must be enabled, even in the absence 2187 of t1he control plane. However, when present, the control plane may 2188 be used to provide alternative mechanisms which may be desirable, 2189 since they offer simple automation or a richer feature-set. 2191 6.5.1. Fault Detection 2193 The control plane is unable to detect data-plane faults. However, 2194 it does provide mechanisms that detect control-plane faults and these 2195 can be used to recognize data-plane faults when it is evident that 2196 the control and data planes are fate-sharing. Although [RFC5654] 2197 specifies that MPLS-TP must support an out-of-band control channel, 2198 it does not insist that this be used exclusively. This means that 2199 there may be deployments where an in-band (or at least an in-fiber) 2200 control channel is used. In this scenario, failure of the control 2201 channel can be used to infer that there is a failure of the data 2202 channel, or, at least, it can be used to trigger an investigation of 2203 the health of the data channel. 2205 Both RSVP and LMP provide a control channel "keep-alive" mechanism 2206 (called the Hello message in both cases). Failure to receive a 2207 message in the configured/negotiated time period indicates a control 2208 plane failure. GMPLS routing protocols ([RFC4203] and [RFC5307] also 2209 include keep-alive mechanisms designed to detect routing adjacency 2210 failures and, although these keep-alive mechanisms tend to operate at 2211 a relatively low frequency (order of seconds), it is still possible 2212 that the first indication of a control-plane fault will be received 2213 through the routing protocol. 2215 Note, however, that care must be taken to ascertain that a specific 2216 failure is not caused by a problem in the control-plane software or 2217 in a processor component at the far end of a link. 2219 Because of the various issues involved, it is not recommended that 2220 the control plane be used as the primary mechanism for fault 2221 detection in an MPLS-TP network. 2223 6.5.2. Testing for Faults 2225 The control plane may be used to initiate and coordinate the testing 2226 of links, LSP segments, or entire LSPs. This is important in some 2227 technologies where it is necessary to halt data transmission while 2228 testing, but it may also be useful where testing needs to be 2229 specifically enabled or configured. 2231 LMP provides a control-plane mechanism to test the continuity and 2232 connectivity (and naming) of individual links. A single management 2233 operation is required to initiate the test at one end of the link, 2234 while the LMP handles the coordination with the other end of the 2235 link. The test mechanism for an MPLS packet link relies on the LMP 2236 Test message inserted into the data stream at one end of the link and 2237 extracted at the other end of the link. This mechanism need not 2238 disrupt data flowing over the link. 2240 Note that a link in the LMP may, in fact, be an LSP tunnel used to 2241 form a link in the MPLS-TP network. 2243 GMPLS signaling (RSVP) offers two mechanisms that may also assist 2244 with fault testing. The first mechanism [RFC3473] defines the 2245 Admin_Status object that allows an LSP to be set into "testing mode". 2246 The interpretation of this mode is implementation-specific and could 2247 be documented more precisely for MPLS-TP. The mode sets the whole 2248 LSP into a state where it can be tested; this need not be disruptive 2249 to data traffic. 2251 The second mechanism provided by GMPLS to support testing is 2252 described in [GMPLS-OAM]. This protocol extension supports the 2253 configuration (including enabling and disabling) of OAM mechanisms 2254 for a specific LSP. 2256 6.5.3. Fault Localization 2258 Fault localization is the process whereby the exact location of a 2259 fault is determined. Fault detection often only takes place at key 2260 points in the network (such as at LSP end points, or at MEPs). This 2261 means that a fault may be located anywhere within a segment of the 2262 relevant LSP. 2264 If segment or end-to-end protection is in use, this level of 2265 information is often sufficient to repair the LSP. However, if 2266 finer information granularity is required (either to implement 2267 optimal recovery actions or to diagnose a fault), it is necessary to 2268 localize the specific fault. 2270 LMP provides a cascaded test-and-propagate mechanism which is 2271 designed specifically for this purpose. 2273 6.5.4. Fault Status Reporting 2275 GMPLS signaling uses the Notify message to report fault status 2276 [RFC3473]. The Notify message can apply to a single LSP or can carry 2277 fault information for a set of LSPs, in order to improve the 2278 scalability of fault notification. 2280 Since the Notify message is targeted at a specific node, it can be 2281 delivered rapidly without requiring hop-by-hop processing. It can be 2282 targeted at LSP end-points, or at segment end-points (such as MEPs). 2283 The target points for Notify messages can be manually configured 2284 within the network, or they may be signaled when the LSP is set up. 2286 This enables the process to be made consistent with segment 2287 protection as well as with the concept of Maintenance Entities. 2289 GMPLS signaling also provides a slower, hop-by-hop mechanism for 2290 reporting individual LSP faults on a hop-by-hop basis using PathErr 2291 and ResvErr messages. 2293 [RFC4783] provides a mechanism to coordinate alarms and other event 2294 or fault information through GMPLS signaling. This mechanism is 2295 useful for understanding the status of the resources used by an LSP, 2296 and for providing information as to why an LSP is not functioning, 2297 however, it is not intended to replace other fault reporting 2298 mechanisms. 2300 GMPLS routing protocols [RFC4203] and [RFC5307] are used to advertise 2301 link availability and capabilities within a GMPLS-enabled network. 2302 Thus, the routing protocols can also provide indirect information 2303 about network faults, that is, the protocol may stop advertising or 2304 may withdraw the advertisement for a failed link, or it may advertise 2305 that the link is about to be shut down gracefully [RFC5817]. This 2306 mechanisms is, however, not normally considered to be fast enough for 2307 use as a trigger for protection switching. 2309 6.5.5. Coordination of Recovery Actions 2311 Fault coordination is an important feature for certain protection 2312 mechanisms (such as bidirectional 1:1 protection). The use of the 2313 GMPLS Notify message for this purpose is described in [RFC4426], 2314 however, specific message field values have not yet been defined for 2315 this operation. 2317 Further work is needed in GMPLS for control and configuration 2318 of reversion behavior for end-to-end and segment protection, and the 2319 coordination of timer values. 2321 6.5.6. Establishment of Protection and Restoration LSPs 2323 The management plane may be used to set up protection and recovery 2324 LSPs, but, when present, the control plane may be used. 2326 Several protocol extensions exist which simplify this process: 2328 o [RFC4872] provides features which support end-to-end protection 2329 switching. 2331 o [RFC4873] describes the establishment of a single, segment 2332 protected LSP. Note that end-to-end protection is a special case 2333 of segment protection and [RFC4872] can also be used to provide 2334 end-to-end protection. 2336 o [RFC4874] allows an LSP to be signaled with a request that its 2337 path exclude specified resources such as links, nodes, shared risk 2338 link groups (SRLGs). This allows a disjoint protection path to be 2339 requested, or a recovery path to be set up to avoid failed 2340 resources. 2342 o Lastly, it should be noted that [RFC5298] provides an overview of 2343 the GMPLS techniques available to achieve protection in multi- 2344 domain environments. 2346 7. Pseudowire Recovery Considerations 2348 Pseudowires provide end-to-end connectivity over the MPLS-TP network 2349 and may comprise a single pseudowire segment, or multiple segments 2350 "stitched" together to provide end-to-end connectivity. 2352 The pseudowire may, itself, require a level of protection, in order 2353 to meet the service-level guarantees of its SLA. This protection 2354 could be provided by the MPLS-TP LSPs that support the pseudowire, or 2355 could be a feature of the pseudowire layer itself. 2357 As indicated above, the functional architecture described in this 2358 document applies to both LSPs and pseudowires. However, the recovery 2359 mechanisms for pseudowires are for further study and will be defined 2360 in a separate document by the PWE3 working group. 2362 7.1. Utilization of Underlying MPLS-TP Recovery 2364 MPLS-TP PWs are carried across the network inside MPLS-TP LSPs. 2365 Therefore, an obvious way to provide protection for a PW is to 2366 protect the LSP that carries it. Such protection can take any of the 2367 forms described in this document. The choice of recovery scheme will 2368 depend on the required speed of recovery and the traffic loss that is 2369 acceptable for the SLA that the PW is providing. 2371 If the PW is a multi-segment PW, then LSP recovery can only protect 2372 the PW in individual segments. This means that a single LSP recovery 2373 action cannot protect against a failure of a PW switching point (an 2374 S-PE), nor can it protect more than one segment at a time, since the 2375 LSP tunnel is terminated at each S-PE. In this respect, LSP 2376 protection of a PW is very similar to link-level protection offered 2377 to the MPLS-TP LSP layer by an underlying network layer (see Section 2378 4.9). 2380 7.2. Recovery in the Pseudowire Layer 2382 Recovery in the PW layer can be provided by simply running separate 2383 PWs end-to-end. Other recovery mechanisms in the PW layer, such as 2384 segment or concatenated segment recovery, or service-level recovery 2385 involving survivability of T-PE or AC faults will be described in a 2386 separate document. 2388 As with any recovery mechanism, it is important to coordinate between 2389 layers. This coordination is necessary to ensure that actions 2390 associated with recovery mechanisms are only performed in one layer 2391 at a time (that is, the recovery of an underlying LSP needs to be 2392 coordinated with the recovery of the PW itself); it also makes sure 2393 that the working and protection PWs do not both use the same MPLS 2394 resources within the network (for example, by running over the same 2395 LSP tunnel - see also Section 4.9). 2397 8. Manageability Considerations 2399 Manageability of MPLS-TP networks and their functions is discussed in 2400 [MPLS-TP-NM-Framework]. OAM features are discussed in 2401 [MPLS-TP-OAM-Framework]. 2403 Survivability has some key interactions with management, as described 2404 in this document. In particular: 2406 o Recovery domains may be configured in a way that prevents one- 2407 to-one correspondence between the MPLS-TP network and the recovery 2408 domains. 2410 o Survivability policies may be configured per network, per recovery 2411 domain, or per LSP. 2413 o Configuration of OAM may involve the selection of MEPs, enabling 2414 OAM on network segments, spans, and links, and the operation of 2415 OAM on LSPs, concatenated LSP segments, and LSP segments. 2417 o Manual commands may be used to control recovery functions, 2418 including forcing recovery and locking recovery actions. 2420 See also the considerations regarding security for management and OAM 2421 in Section 9 of this document. 2423 9. Security Considerations 2425 This framework does not introduce any new security considerations; 2426 general issues relating to MPLS security can be found in [MPLS-SEC]. 2428 However, several points about MPLS-TP survivability should be noted 2429 here. 2431 o If an attacker is able to force a protection switch-over, this may 2432 result in a small perturbation to user traffic, and could result 2433 in extra traffic being preempted or displaced from the protection 2434 resources. In the case of 1:n protection or shared mesh 2435 protection, this may result in other traffic becoming unprotected. 2436 Therefore, it is important that OAM protocols for detecting or 2437 notifying faults use adequate security to prevent them from being 2438 used (through the insertion of bogus messages, or through the 2439 capture of legitimate messages) to falsely trigger a recovery 2440 event. 2442 o If manual commands are modified, captured, or simulated (including 2443 replay), it might be possible for an attacker to perform forced 2444 recovery actions or to impose lock-out. These actions could 2445 impact the capability to provide the recovery function, and could 2446 also affect the normal operation of the network for other traffic. 2447 Therefore, management protocols used to perform manual commands 2448 must allow the operator to use appropriate security mechanisms. 2449 This includes verification that the user who performs the commands 2450 has appropriate authorization. 2452 o If the control plane is used to configure or operate recovery 2453 mechanisms, the control-plane protocols must also be capable of 2454 providing adequate security. 2456 10. IANA Considerations 2458 This informational document makes no requests for IANA action. 2460 11. Acknowledgments 2462 Thanks for useful comments and discussions to: Italo Busi, David 2463 McWalter, Lou Berger, Yaacov Weingarten, Stewart Bryant, Dan Frost, 2464 Lievren Levrau, Xuehui Dai, Liu Guoman, Xiao Min, Daniele Ceccarelli, 2465 Scott Bradner, Francesco Fondelli, Curtis Villamizar, Maarten 2466 Vissers, and Greg Mirsky. 2468 The Editors would like to thank the participants in ITU-T Study Group 2469 15 for their detailed review. 2471 Some figures and text on shared-mesh protection were borrowed from 2472 [MPLS-TP-MESH] with thanks to Tae-sik Cheung and Jeong-dong Ryoo. 2474 12. References 2476 12.1. Normative References 2478 [RFC2205] Braden, R., Ed., Zhang, L., Berson, S., Herzog, S., and 2479 J. Jamin, "Resource ReserVation Protocol - Version 1 2480 Functional Specification", RFC 2205, September 1997. 2482 [RFC3209] Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., 2483 and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP 2484 Tunnels", RFC 3209, December 2001. 2486 [RFC3471] Berger, L., Ed., "Generalized Multi-Protocol Label 2487 Switching (GMPLS) Signaling Functional Description", 2488 RFC 3471, January 2003. 2490 [RFC3473] Berger, L., "Generalized Multi-Protocol Label Switching 2491 (GMPLS) Signaling Resource ReserVation Protocol-Traffic 2492 Engineering (RSVP-TE) Extensions", RFC 3473, January 2003. 2494 [RFC3945] Mannie, E., "Generalized Multi-Protocol Label Switching 2495 (GMPLS) Architecture", RFC 3945, October 2004. 2497 [RFC4203] Kompella, K. and Y. Rekhter, "IS-IS Extensions in Support 2498 of Generalized Multi-Protocol Label Switching (GMPLS)", 2499 RFC 4203, October 2005. 2501 [RFC4204] Lang, J., Ed., "The Link Management Protocol (LMP)", 2502 RFC 4204, September 2005. 2504 [RFC4427] Mannie, E. and D. Papadimitriou, "Recovery (Protection and 2505 Restoration) Terminology for Generalized Multi-Protocol 2506 Label Switching (GMPLS)", RFC 4427, March 2006. 2508 [RFC4428] Papadimitriou, D. and E. Mannie, "Analysis of Generalized 2509 Multi-Protocol Label Switching (GMPLS) - based Recovery 2510 Mechanisms (including Protection and Restoration) Recovery 2511 (Protection and Restoration) Terminology for Generalized 2512 Multi-Protocol Label Switching (GMPLS)", RFC 4428, 2513 March 2006. 2515 [RFC4873] Berger, L., Bryskin, I., Papadimitriou, D., and A. Farrel, 2516 "GMPLS Segment Recovery", RFC 4873, May 2007. 2518 [RFC5307] Kompella, K. and Y. Rekhter, "IS-IS Extensions in Support 2519 of Generalized Multi-Protocol Label Switching (GMPLS)", 2520 RFC 5307, October 2008. 2522 [RFC5317] Bryant, S. and L. Andersson, "Joint Working Team (JWT) 2523 Report on MPLS Architectural Considerations for a 2524 Transport Profile", RFC 5317, February 2009. 2526 [RFC5654] Niven-Jenkins, B., Ed., Brungard, D., Ed., Betts, M., Ed., 2527 Sprecher, N., and S. Ueno, "Requirements of an MPLS 2528 Transport Profile", RFC 5654, September 2009. 2530 [RFC5586] Bocci, M., Ed., Vigoureux, M., Ed., and S. Bryant, Ed., 2531 "MPLS Generic Associated Channel", RFC 5586, June 2009. 2533 [G.806] ITU-T, "Characteristics of transport equipment - 2534 Description methodology and generic functionality", 2535 Recommendation G.806, January 2009. 2537 [G.808.1] ITU-T, "Generic Protection Switching - Linear trail and 2538 subnetwork protection", Recommendation G.808.1, 2539 December 2003. 2541 [G.841] ITU-T, "Types and Characteristics of SDH Network 2542 Protection Architectures", Recommendation G.841, 2543 October 1998. 2545 [MPLS-TP-FWK] 2546 Bocci, M., Bryant, S., Frost, D., Levrau, L., and Berger, 2547 L., "A Framework for MPLS in Transport Networks", 2548 draft-ietf-mpls-tp-framework, Work in Progress. 2550 [MPLS-TP-NM-Framework] 2551 Mansfield, S., Gray, E., and Lam, K., "MPLS-TP Network 2552 Management Framework", draft-ietf-mpls-tp-nm-framework, 2553 Work in Progress. 2555 [MPLS-TP-OAM-Framework] 2556 Buci, I., Ed. and B. Niven-Jenkins, Ed., "A Framework for 2557 MPLS in Transport Networks", draft-ietf-mpls-tp-oam- 2558 framework, Work in Progress. 2560 12.2. Informative References 2562 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 2563 Label Switching Architecture", RFC 3031, January 2001. 2565 [RFC3386] Lai, W. and D. McDysan, "Network Hierarchy and Multilayer 2566 Survivability", RFC 3386, November 2002. 2568 [RFC3469] Sharma, V. and F. Hellstrand, "Framework for Multi- 2569 Protocol Label Switching (MPLS)-based Recovery", RFC 3469, 2570 February 2003. 2572 [RFC4397] Bryskin, I. and Farrel, A., " A Lexicography for the 2573 Interpretation of Generalized Multiprotocol Label 2574 Switching (GMPLS) Terminology within the Context of the 2575 ITU-T's Automatically Switched Optical Network (ASON) 2576 Architecture", RFC 4397, February 2006. 2578 [RFC4426] Lang, J., Ed., Rajagopalan, B., and D. Papadimitriou, 2579 "Generalized Multiprotocol Label Switching (GMPLS) 2580 Recovery Functional Specification", RFC 4426, March 2006. 2582 [RFC4726] Farrel, A., Vasseur, J.-P., and Ayyangar, A., "A Framework 2583 for Inter-Domain Multiprotocol Label Switching Traffic 2584 Engineering", RFC 4726, November 2006. 2586 [RFC4783] Berger, L., "GMPLS - Communication of Alarm Information", 2587 RFC 4783, December 2006. 2589 [RFC4872] Lang, J., Rekhter, Y., and D. Papadimitriou, "RSVP-TE 2590 Extensions in Support of End-to-End Generalized Multi- 2591 Protocol Label Switching (GMPLS) Recovery", RFC 4872, May 2592 2007. 2594 [RFC4874] Lee, CY., Farrel, A., and S. De Cnodder, "Exclude Routes - 2595 Extension to Resource ReserVation Protocol- Traffic 2596 Engineering (RSVP-TE)", RFC 4874, April 2007. 2598 [RFC5212] Shiomoto, K., Papadimitriou, D., Le Roux, JL., Vigoureux, 2599 M., and Brungard, D., " Requirements for GMPLS-Based 2600 Multi-Region and Multi-Layer Networks (MRN/MLN)", RFC 2601 5212, July 2008 2603 [RFC5298] Takeda, T., Farrel, A., Ikejiri, Y., and JP. Vasseur, 2604 "Analysis of Inter-Domain Label Switched Path (LSP) 2605 Recovery", RFC 5298, August 2008. 2607 [RFC5817] Ali, Z., Vasseur, J.-P., Zamfir, A., and Newton, J., 2608 "Graceful Shutdown in MPLS and Generalized MPLS Traffic 2609 Engineering Networks", RFC 5817, April 2010. 2611 [G.8081] ITU-T, "Terms and definitions for Automatically Switched 2612 Optical Networks (ASON)", Recommendation G.8081, June 2004 2613 and Recommendation G.8081 Amendment 1, June 2006. 2615 [GMPLS-OAM] 2616 Takacs, A., Fedyk, D., and H. Jia, "OAM Configuration 2617 Framework and Requirements for GMPLS RSVP-TE", 2618 draft-ietf-ccamp-oam-configuration-fwk, Work in Progress. 2620 [MPLS-SEC] L. Fang (Ed.), " Security Framework for MPLS and GMPLS 2621 Networks", draft-ietf-mpls-mpls-and-gmpls-security- 2622 framework, Work in Progress. 2624 [MPLS-TP-CP-Framework] 2625 Andersson, L., Berger, L., Fang, L., and Bitar, N., "MPLS- 2626 TP Control Plane Framework", draft-ietf-ccamp-mpls-tp-cp- 2627 framework, Work in Progress. 2629 [MPLS-TP-Linear-Protection] 2630 Weingarten, Y., Bryant, S., Ed., Sprecher, N., Ed., Van 2631 Helvoort, H., Ed., and A. Fulignoli, "MPLS-TP Linear 2632 Protection", draft-ietf-mpls-tp-linear-protection, Work 2633 in Progress. 2635 [MPLS-TP-MESH] 2636 Cheung , T., and Ryoo, J., "MPLS-TP Mesh Protection", 2637 draft-cheung-mpls-tp-mesh-protection, Work in Progress. 2639 [OAM-SOUP] Andersson, L., Betts, M., Van Helvoort, H., Bonica, R., 2640 and D. Romascanu, "MPLS-TP Linear Protection", draft-ietf- 2641 opsawg-mpls-tp-oam-def, Work in Progress. 2643 [ROSETTA] Van Helvoort, H., Ed., Andersson, L., and N. Sprecher, "A 2644 Thesaurus for the Terminology used in Multiprotocol Label 2645 Switching Transport Profile (MPLS-TP) drafts/RFCs and 2646 ITU-T's Transport Network Recommendations", draft-ietf- 2647 mpls-tp-rosetta-stone, Work in Progress. 2649 Authors' Addresses 2651 Nurit Sprecher 2652 Nokia Siemens Networks 2653 3 Hanagar St. Neve Ne'eman B 2654 Hod Hasharon, 45241 2655 Israel 2657 Email: nurit.sprecher@nsn.com 2659 Adrian Farrel 2660 Old Dog Consulting 2662 Email: adrian@olddog.co.uk