idnits 2.17.1 

draft-ietf-mpls-tp-survive-fwk-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 792: '...   defined in [RFC4427]) is not required in MPLS-TP and MAY be omitted...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 20, 2010) is 5058 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'G.8081' is defined on line 2611, but no explicit
     reference was found in the text


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                        N. Sprecher
2	Internet-Draft                                    Nokia Siemens Networks
3	Intended status: Informational                                 A. Farrel
4	Expires: December 20, 2010                            Old Dog Consulting
5	                                                           June 20, 2010

7	 Multiprotocol Label Switching Transport Profile Survivability Framework

9	                 draft-ietf-mpls-tp-survive-fwk-06.txt

11	Abstract

13	   Network survivability is the ability of a network to recover traffic
14	   delivery following failure, or degradation of network resources.
15	   Survivability is critical for the delivery of guaranteed network
16	   services, such as those subject to strict Service Level Agreements
17	   (SLAs) which place maximum bounds on the length of time that services
18	   may be degraded or be unavailable.

20	   The Transport Profile of Multiprotocol Label Switching (MPLS-TP) is a
21	   packet-based transport technology based on the MPLS data plane which
22	   re-uses many aspects of the MPLS management and control planes.

24	   This document comprises a framework for the provision of
25	   survivability in an MPLS-TP network; it describes recovery elements,
26	   types, methods, and topological considerations.  To enable data-plane
27	   recovery, survivability may be supported by the control plane,
28	   management plane, and by Operations, Administration and Maintenance
29	   (OAM) functions.  This document describes mechanisms for recovering
30	   MPLS-TP Label Switched Paths (LSPs). A detailed description of
31	   pseudowire recovery in MPLS-TP networks is beyond the scope of this
32	   document.

34	   This document is a product of a joint Internet Engineering Task Force
35	   (IETF) / International Telecommunication Union Telecommunication
36	   Standardization Sector (ITU-T) effort to include an MPLS Transport
37	   Profile within the IETF MPLS and PWE3 architectures to support the
38	   capabilities and functionalities of a packet-based transport network,
39	   as defined by the ITU-T.

41	Status of this Memo

43	   This Internet-Draft is submitted to IETF in full conformance with the
44	   provisions of BCP 78 and BCP 79.

46	   Internet-Drafts are working documents of the Internet Engineering
47	   Task Force (IETF), its areas, and its working groups.  Note that
48	   other groups may also distribute working documents as Internet-
49	   Drafts.

51	   Internet-Drafts are draft documents valid for a maximum of six months
52	   and may be updated, replaced, or obsoleted by other documents at any
53	   time.  It is inappropriate to use Internet-Drafts as reference
54	   material or to cite them other than as "work in progress."

56	   The list of current Internet-Drafts can be accessed at
57	   http://www.ietf.org/ietf/1id-abstracts.txt.

59	   The list of Internet-Draft Shadow Directories can be accessed at
60	   http://www.ietf.org/shadow.html.

62	   This Internet-Draft will expire on May 13, 2010.

64	Copyright Notice

66	   Copyright (c) 2010 IETF Trust and the persons identified as the
67	   document authors. All rights reserved.

69	   This document is subject to BCP 78 and the IETF Trust's Legal
70	   Provisions Relating to IETF Documents
71	   (http://trustee.ietf.org/license-info) in effect on the date of
72	   publication of this document. Please review these documents
73	   carefully, as they describe your rights and restrictions with respect
74	   to this document. Code Components extracted from this document must
75	   include Simplified BSD License text as described in Section 4.e of
76	   the Trust Legal Provisions and are provided without warranty as
77	   described in the Simplified BSD License.

79	Table of Contents

81	   1.  Introduction ................................................. 4
82	   1.1.  Recovery Schemes ........................................... 5
83	   1.2.  Recovery Action Initiation ................................. 6
84	   1.3.  Recovery Context ........................................... 7
85	   1.4.  Scope of this Framework .................................... 8
86	   2.  Terminology and References ................................... 9
87	   3.  Requirements for Survivability .............................. 10
88	   4.  Functional Architecture ..................................... 10
89	   4.1.  Elements of Control ....................................... 11
90	   4.1.1.  Operator Control ........................................ 11
91	   4.1.2.  Defect-Triggered Actions ................................ 12
92	   4.1.3.  OAM Signaling ........................................... 12
93	   4.1.4.  Control-Plane Signaling ................................. 12
94	   4.2.  Elements of Recovery ...................................... 13
95	   4.2.1.  Span Recovery ........................................... 13
96	   4.2.2.  Segment Recovery ........................................ 14
97	   4.2.3.  End-to-End Recovery ..................................... 14
98	   4.3.  Levels of Recovery ........................................ 15
99	   4.3.1.  Dedicated Protection .................................... 15
100	   4.3.2.  Shared Protection ....................................... 16
101	   4.3.3.  Extra Traffic ........................................... 17
102	   4.3.4.  Restoration ............................................. 18
103	   4.3.5.  Reversion ............................................... 19
104	   4.4.  Mechanisms for Protection ................................. 20
105	   4.4.1.  Link-Level Protection ................................... 20
106	   4.4.2.  Alternate Paths and Segments ............................ 21
107	   4.4.3.  Protection Tunnels ...................................... 22
108	   4.5.  Recovery Domains .......................................... 22
109	   4.6.  Protection in Different Topologies ........................ 24
110	   4.7.  Mesh Networks ............................................. 25
111	   4.7.1.  1:n Linear Protection .................... .............. 26
112	   4.7.2.  1+1 Linear Protection ................................... 28
113	   4.7.3.  P2MP Linear Protection .................................. 29
114	   4.7.4.  Triggers for the Linear Protection Switching Action ..... 30
115	   4.7.5.  Applicability of Linear Protection for LSP Segments ..... 31
116	   4.7.6.  Shared Mesh Protection .................................. 32
117	   4.8.  Ring Networks ............................................. 33
118	   4.9.  Recovery in Layered Networks .............................. 34
119	   4.9.1.  Inherited Link-Level Protection ......................... 35
120	   4.9.2.  Shared Risk Groups ...................................... 35
121	   4.9.3.  Fault Correlation ....................................... 36
122	   5.  Applicability and Scope of Survivability in MPLS-TP ......... 37
123	   6.  Mechanisms for Providing Survivability for MPLS-TP LSPs ..... 39
124	   6.1.  Management Plane .......................................... 39
125	   6.1.1.  Configuration of Protection Operation ................... 40
126	   6.1.2.  External Manual Commands ................................ 40
127	   6.2.  Fault Detection ........................................... 41
128	   6.3.  Fault Localization ........................................ 42
129	   6.4.  OAM Signaling ............................................. 43
130	   6.4.1.  Fault Detection ......................................... 44
131	   6.4.2.  Testing for Faults ...................................... 44
132	   6.4.3.  Fault Localization ...................................... 45
133	   6.4.4.  Fault Reporting ......................................... 45
134	   6.4.5.  Coordination of Recovery Actions ........................ 46
135	   6.5.  Control Plane ............................................. 46
136	   6.5.1.  Fault Detection ......................................... 47
137	   6.5.2.  Testing for Faults ...................................... 47
138	   6.5.3.  Fault Localization ...................................... 48
139	   6.5.4.  Fault Status Reporting .................................. 48
140	   6.5.5.  Coordination of Recovery Actions ........................ 49
141	   6.5.6.  Establishment of Protection and Restoration LSPs ........ 49
142	   7.  Pseudowire Recovery Considerations .......................... 50
143	   7.1.  Utilizing Underlying MPLS-TP Recovery ..................... 50
144	   7.2.  Recovery in the Pseudowire Layer .......................... 51
145	   8.  Manageability Considerations ................................ 51
146	   9.  Security Considerations ..................................... 52
147	   10.  IANA Considerations ........................................ 52
148	   11.  Acknowledgments ............................................ 52
149	   12.  References ................................................. 53
150	   12.1.  Normative References ..................................... 53
151	   12.2.  Informative References ................................... 54

153	Editors' Note:

155	   This Informational Internet-Draft is aimed at achieving IETF
156	   Consensus before publication as an RFC and will be subject to an IETF
157	   Last Call.

159	   [RFC Editor, please remove this note before publication as an RFC and
160	   insert the correct Streams Boilerplate to indicate that the published
161	   RFC has IETF Consensus.]

163	1. Introduction

165	   Network survivability is the network's ability to recover traffic
166	   delivery following the failure or degradation of traffic delivery
167	   caused by a network fault or a denial of service attack on the
168	   network. Survivability plays a critical role in the delivery of
169	   reliable services in transport networks.  Guaranteed services in the
170	   form of Service Level Agreements (SLAs) require a resilient network
171	   that very rapidly detects facility or node degradation or failures,
172	   and immediately starts to recover network operations in accordance
173	   with the terms of the SLA.

175	   The MPLS Transport Profile (MPLS-TP) is described in [MPLS-TP-FWK].
176	   MPLS-TP is designed to be consistent with existing transport network
177	   operations and management models, while providing survivability
178	   mechanisms, such as protection and restoration.  The functionality
179	   provided is intended to be similar to or better than that found in
180	   established transport networks which set a high benchmark for
181	   reliability. That is, it is intended to provide the operator with
182	   functions with which they are familiar through their experience with
183	   other transport networks, although this does not preclude additional
184	   techniques.

186	   This document provides a framework for MPLS-TP-based survivability.
187	   that meets the recovery requirements specified in [RFC5654].
188	   It uses the recovery terminology defined in [RFC4427] which draws
189	   heavily on [G.808.1], and it refers to the requirements specified in
190	   [RFC5654].

192	   This document is a product of a joint Internet Engineering Task Force
193	   (IETF) / International Telecommunication Union Telecommunication
194	   Standardization Sector (ITU-T) effort to include an MPLS Transport
195	   Profile within the IETF MPLS and PWE3 architectures to support the
196	   capabilities and functionalities of a packet-based transport network
197	   as defined by the ITU-T.

199	1.1.  Recovery Schemes

201	   Various recovery schemes (for protection and restoration) and
202	   processes have been defined and analyzed in [RFC4427] and [RFC4428].
203	   These schemes can also be applied in MPLS-TP networks to re-establish
204	   end-to-end traffic delivery according to the agreed service
205	   parameters, and to trigger recovery from "failed" or "degraded"
206	   transport entities. In the context of this document, transport
207	   entities are nodes, links, transport path) segments, concatenated
208	   transport path segments, and entire transport paths. Recovery actions
209	   are initiated by the detection of a defect, or by an external request
210	   (e.g., an operator's request for manual control of protection
211	   switching).

213	   [RFC4427] makes a distinction between protection switching and
214	   restoration mechanisms.

216	   - Protection switching uses pre-assigned capacity between nodes,
217	     where the simplest scheme has a single, dedicated protection entity
218	     for each working entity, while the most complex scheme has m
219	     protection entities shared between n working entities (m:n).

221	   - Restoration uses any capacity available between nodes and usually
222	     involves re-routing.  The resources used for restoration may be
223	     pre-planned (i.e., predetermined, but not yet allocated to the
224	     recovery path), and recovery priority may be used as a
225	     differentiation mechanism to determine which services are recovered
226	     and which are not recovered.

228	   Both protection switching and restoration may be either
229	   unidirectional or bidirectional; unidirectional implies that
230	   protection switching is performed independently for each direction of
231	   a bidirectional transport path, while bidirectional means that both
232	   directions are switched simultaneously using appropriate
233	   coordination, even if the fault applies to only one direction of the
234	   path.

236	   Both protection and restoration mechanisms may be either revertive or
237	   non-revertive as described in Section 4.11 of [RFC4427].

239	   Pre-emption priority may be used to determine which services are
240	   sacrificed to enable the recovery of other services.  Restoration may
241	   also be either unidirectional or bidirectional.  In general,
242	   protection actions are completed within time frames amounting to tens
243	   of milliseconds, while automated restoration actions are normally
244	   completed within periods ranging from hundreds of milliseconds to a
245	   maximum of a few seconds.  Restoration is not guaranteed (for
246	   example, because network resources may not be available at the time
247	   of the defect).

249	1.2.  Recovery Action Initiation

251	   The recovery schemes described in [RFC4427] and evaluated in
252	   [RFC4428] are presented in the context of control-plane-driven
253	   actions (such as the configuration of the protection entities and
254	   functions, etc.).  The presence of a distributed control plane in an
255	   MPLS-TP network is optional. However, the absence of such a control
256	   plane does not affect the operation of the network and the use of
257	   MPLS-TP forwarding, Operations, Administration and Maintenance (OAM),
258	   and survivability capabilities. In particular, the concepts discussed
259	   in[RFC4427] and [RFC4428] refer to recovery actions effected in the
260	   data plane; they are equally applicable in MPLS-TP, with or without
261	   the use of a control plane.

263	   Thus, some of the MPLS-TP recovery mechanisms do not depend on a
264	   control plane and use MPLS-TP OAM mechanisms or management actions to
265	   trigger recovery actions.

267	   The principles of MPLS-TP protection-switching actions are similar
268	   to those described in [RFC4427], since the protection mechanism is
269	   based on the capability to detect certain defects in the transport
270	   entities within the recovery domain.  The protection-switching
271	   controller does not care which initiation method is used, provided
272	   that it can be given information about the status of the transport
273	   entities within the recovery domain (e.g., OK, signal failure,
274	   signal degradation, etc.).

276	   In the context of MPLS-TP, it is imperative to ensure that performing
277	   switchovers is possible, regardless of the way in which the network
278	   is configured and managed (for example, regardless of whether a
279	   control plane, management plane or OAM initiation mechanism is used).

281	   All MPLS and GMPLS protection mechanisms [RFC4428] are applicable in
282	   an MPLS-TP environment. It is also be possible to provision and
283	   manage the related protection entities and functions defined in MPLS
284	   and GMPLS using the management plane [RFC5654]. Regardless of whether
285	   an OAM, management, or control plane initiation mechanism is used,
286	   the protection-switching operation is a data-plane operation.

288	   In some recovery schemes (such as bidirectional protection
289	   switching), it is necessary to coordinate the protection state
290	   between the edges of the recovery domain to achieve initiation of
291	   recovery actions for both directions.  An MPLS-TP protocol may be
292	   used as an in-band (i.e., data-plane based) control protocol in order
293	   to coordinate the protection state between the edges of the
294	   protection domain.  When the MPLS-TP control plane is in use, a
295	   control-plane-based mechanism can also be used to coordinate the
296	   protection states between the edges of the protection domain.

298	1.3.  Recovery Context

300	   An MPLS-TP Label Switched Path (LSP) may be subject to any part of or
301	   all of MPLS-TP link recovery, path-segment recovery, or end-to-end
302	   recovery, where:

304	   o  MPLS-TP link recovery refers to the recovery of an individual link
305	      (and hence all or a subset of the LSPs routed over the link)
306	      between two MPLS-TP nodes. For example, link recovery may be
307	      provided by server layer recovery.

309	   o  Segment recovery refers to the recovery of an LSP segment (i.e.,
310	      segment and concatenated segment in the language of [RFC5654])
311	      between two nodes and is used to recover from the failure of one
312	      or more links or nodes.

314	   o  End-to-end recovery refers to the recovery of an entire LSP, from
315	      its ingress to its egress node.

317	   For additional resiliency, more than one of these recovery techniques
318	   may be configured concurrently for a single path.

320	   Co-routed bidirectional MPLS-TP LSPs are defined in a way that allows
321	   both directions of the LSP to follow the same route through the
322	   network. In this scenario, the operator often requires the directions
323	   to fate-share (that is, if one direction fails, both directions
324	   should cease to operate).

326	   Associated bidirectional MPLS-TP LSPs exist where the two directions
327	   of a bidirectional LSP follow different paths through the network.
328	   An operator may also request fate-sharing for associated
329	   bidirectional LSPs.

331	   The requirement for fate-sharing causes a direct interaction between
332	   the recovery processes affecting the two directions of an LSP, so
333	   that both directions of the bidirectional LSP are recovered at the
334	   same time. This mode of recovery is termed bidirectional recovery and
335	   may be seen as a consequence of fate-sharing.

337	   The recovery scheme operating at the data-plane level can function in
338	   a multi-domain environment (in the wider sense of a "domain"

340	   [RFC4726]). It can also protect against a failure of a boundary node
341	   in the case of inter-domain operation.  MPLS-TP recovery schemes are
342	   intended to protect client services when it is sent across the MPLS-
343	   TP network.

345	1.4.  Scope of this Framework

347	   This framework introduces the architecture of the MPLS-TP recovery
348	   domain and describes the recovery schemes in MPLS-TP (based on the
349	   recovery types defined in [RFC4427]) as well as the principles of
350	   operation, recovery states, recovery triggers, and information
351	   exchanges between the different elements that support the reference
352	   model.

354	   The framework also describes the qualitative levels of the
355	   survivability functions that can be provided, such as dedicated
356	   recovery, shared protection, restoration, etc.  In the event of a
357	   network failure, the level of recovery directly affects the service
358	   level provided to the end-user.

360	   The general description of the functional architecture is applicable
361	   to both LSPs and pseudowires (PWs), however, PW recovery is only
362	   introduced in Section 7, and the relevant details are beyond the
363	   scope of this document and are for further study.

365	   This framework applies to general recovery schemes as well as to
366	   mechanisms that are optimized for specific topologies and are
367	   tailored to efficiently handle protection switching.

369	   This document addresses the need for the co-ordination of protection
370	   switching across multiple layers and at sub-layers (for clarity, we
371	   use the term "layer" to refer equally to layers and sub-layers).
372	   This allows an operator to prevent race conditions and allows the
373	   protection switching mechanism of one layer to recover from a failure
374	   before switching is invoked at another layer.

376	   This framework also specifies the functions that must be supported by
377	   MPLS-TP to support the recovery mechanisms.  MPLS-TP introduces a
378	   tool kit to enable recovery in MPLS-TP-based networks, and to ensure
379	   that affected services are recovered in the event of a failure.

381	   Generally, network operators aim to provide the fastest, most stable,
382	   and best protection mechanism at a reasonable cost in accordance with
383	   customer requirements. The greater the level of protection required,
384	   the greater the number of resources will be consumed.  It is
385	   therefore expected that network operators will offer a wide spectrum
386	   of service levels.  MPLS-TP-based recovery offers the flexibility to
387	   select a recovery mechanism, define the granularity at which traffic
388	   delivery is to be protected, and choose the specific traffic types
389	   that are to be protected.  With MPLS-TP-based recovery, it should be
390	   possible to provide different levels of protection for different
391	   traffic classes within the same path based on the service
392	   requirements.

394	2.  Terminology and References

396	   The terminology used in this document is consistent with that defined
397	   in [RFC4427].  The latter is consistent with [G.808.1].

399	   However, certain protection concepts (such as ring protection) are
400	   not discussed in [RFC4427]; for those concepts, the terminology used
401	   in this document is drawn from [G.841].

403	   Readers should refer to those documents for normative definitions.

405	   This document supplies brief summaries of a number of terms for
406	   reasons of clarity and to assist the reader, but it does not re-
407	   define terms.

409	   Note, in particular, the distinction and definitions made in
410	   [RFC4427] for the following three terms:

412	   o  Protection: re-establishing end-to-end traffic delivery using pre-
413	      allocated resources.

415	   o  Restoration: re-establishing end-to-end traffic delivery using
416	      resources allocated at the time of need; sometimes referred to as
417	      "repair" of a service, LSP, or the traffic.

419	   o  Recovery: a generic term covering both Protection and Restoration.

421	   Note that the term "survivability" is used in [RFC5654] to cover the
422	   functional elements or "protection" and "restoration" which are
423	   collectively known as "recovery".

425	   Important background information on survivability can be found in
426	   [RFC3386], [RFC3469] , [RFC4426], [RFC4427], and [RFC4428].

428	   In this document, the following additional terminology is applied:

430	   o  "Fault Management", as defined in [MPLS-TP-NM-Framework].

432	   o  The terms "defect" and "failure" are used interchangeably to
433	      indicate any defect or failure in the sense that they defined in
434	      [G.806]. The terms also include any signal degradation event as
435	      defined in [G.806].

437	   o  A "fault" is a fault or fault cause as defined in [G.806].

439	   o  "Trigger" indicates any event that may initiate a recovery action.
440	      See Section 4.1 for a more detailed discussion of triggers.

442	   o  The acronym "OAM" is defined as Operations, Administration and
443	      Maintenance, consistent with [OAM-SOUP].

445	   o  A "Transport Entity" is a node, link, transport path segment,
446	      concatenated transport path segment, or entire transport path.

448	   o  A "Working Entity" is a transport entity that carries traffic
449	      during normal network operation.

451	   o  A "Protection Entity" is a transport entity that is pre-allocated
452	      and used to protect and transport traffic when the working entity
453	      fails.

455	   o  A "Recovery Entity" is a transport entity that is used to recover
456	      and transport traffic when the working entity fails.

458	   o  "Survivability Actions" are the steps that may be taken by network
459	      nodes to communicate faults and to switch traffic from faulted or
460	      degraded paths to other paths. This may include sending messages
461	      and establishing new paths.

463	   General terminology for MPLS-TP is found in [MPLS-TP-FWK] and
464	   [ROSETTA].  Background information on MPLS-TP requirements can be
465	   found in [RFC5654].

467	3.  Requirements for Survivability

469	   MPLS-TP requirements are presented in [RFC5654] and serve as
470	   normative references for the definition of all MPLS-TP functionality,
471	   including survivability.  Survivability is presented in [RFC5654] as
472	   playing a critical role in the delivery of reliable services, and the
473	   requirements for survivability are set out using the recovery
474	   terminology defined in [RFC4427].

476	4.  Functional Architecture

478	   This section presents an overview of the elements relating to the
479	   functional architecture for survivability within an MPLS-TP network.
480	   The components are presented separately to demonstrate the way in
481	   which they may be combined to provide the different levels of
482	   recovery needed to meet the requirements set out in the previous
483	   section.

485	4.1.  Elements of Control

487	   Recovery is achieved by implementing specific actions. These actions
488	   aim to repair network resources or redirect traffic along paths that
489	   avoid failures in the network.  They may be triggered automatically
490	   by the MPLS-TP network nodes upon detection of a network defect, or
491	   they may be triggered by an operator.  Automated actions may be
492	   enhanced by in-band (i.e., data-plane-based) OAM mechanisms , or by
493	   in-band or out-of-band control-plane signaling.

495	4.1.1.  Operator Control

497	   The survivability behavior of the network as a whole, and the
498	   reaction of each transport path when a fault is reported, may be
499	   controlled by the operator. This control can be split into two sets
500	   of functions: policies and actions performed when the transport path
501	   is set up; and commands used to control or force recovery actions for
502	   established transport paths.

504	   The operator may establish network-wide or local policies that
505	   determine the actions that will be taken when various defects are
506	   reported which affect different transport paths.  Also, when a
507	   service request is made that causes the establishment of one or more
508	   transport paths in the network, the operator (or requesting
509	   application) may define a particular level of service, and this will
510	   be mapped to specific survivability actions taken before and during
511	   transport path setup, after the discovery of a failure of network
512	   resources, and upon recovery of those resources.

514	   It should be noted that it is unusual to present a user or customer
515	   with options directly related to recovery actions.  Instead, the
516	   user/customer enters into an SLA with the network provider, and the
517	   network operator maps the terms of the SLA (for example, for
518	   guaranteed delivery, availability, or reliability) to recovery
519	   schemes within the network.

521	   The operator can also issue commands to control recovery actions and
522	   events.  For example, the operator may perform the following actions:

524	   o  Enable or disable the survivability function.

526	   o  Invoke the simulation of a network fault.

528	   o  Force a switchover from a working path to a recovery path or vice
529	      versa.

531	   Forced switchover may be performed for network optimization purposes
532	   with minimal service interruption, such as when modifying protected
533	   or unprotected services, when replacing MPLS-TP network nodes, etc.
534	   In some circumstances, a fault may be reported to the operator and
535	   the operator may then select and initiate the appropriate recovery
536	   action. A description of the different operator commands is found in
537	   Section 4.12 of [RFC4427].

539	4.1.2.  Defect-Triggered Actions

541	   Survivability actions may be directly triggered by network defects.
542	   This means that the device that detects the defect (for example,
543	   notification of an issue reported from equipment in a lower layer,
544	   failure to receive an OAM Continuity message, or receipt of an OAM
545	   message reporting a failure condition) may immediately perform a
546	   survivability action.

548	   The action is
549	   directly triggered by events in the data plane.  Note, however, that
550	   coordination of recovery actions between the edges of the recovery
551	   domain may require message exchanges for some recovery functions or
552	   for performing a bidirectional recovery action.

554	4.1.3.  OAM Signaling

556	   OAM signaling refers to data plane OAM message exchange.  Such
557	   messages may be used to detect and localize faults or to indicate a
558	   degradation in the operation of the network. However, in this context
559	   these messages are used to control or trigger survivability actions.
560	   The mechanisms to achieve this are discussed in
561	   [MPLS-TP-OAM-Framework]

563	   OAM signaling may also be used to coordinate recovery actions within
564	   the protection domain.

566	4.1.4.  Control-Plane Signaling

568	   Control-plane signaling is responsible for setup, maintenance, and
569	   teardown of transport paths that do not fall under management-plane
570	   control.  The control plane may also be used to coordinate the
571	   detection, localization, and reaction to network defects pertaining
572	   to peer relationships (neighbor-to-neighbor, or end-to-end).  Thus,
573	   control-plane signaling may initiate and coordinate survivability
574	   actions.

576	   The control plane can also be used to distribute topology and
577	   information relating to resource availability.  In this way, the
578	   "graceful shutdown" [RFC5817] of resources may be affected by
579	   withdrawing them; this can be used to invoke a survivability action
580	   in a similar way to that used when reporting or discovering a fault,
581	   as described in the previous sections.

583	   The use of a control plane for MPLS-TP is discussed in
584	   [MPLS-TP-CP-Framework].

586	4.2.  Elements of Recovery

588	   This section describes the elements of recovery.  These are the
589	   quantitative aspects of recovery, that is, the parts of the network
590	   for which recovery can be provided.

592	   Note that the terminology in this section is consistent with
593	   [RFC4427].  Where the terms differ from those in [RFC5654], mapping
594	   is provided.

596	4.2.1.  Span Recovery

598	   A span is a single hop between neighboring MPLS-TP nodes in the same
599	   network layer.  A span is sometimes referred to as a link, and this
600	   may cause some confusion between the concept of a data link and a
601	   traffic engineering (TE) link.  LSPs traverse TE links between
602	   neighboring MPLS-TP nodes in the MPLS-TP network layer.  However, a
603	   TE link may be provided by any of the following:

605	   o  A single data link.

607	   o  A series of data links in a lower layer, established as an LSP and
608	      presented to the upper layer as a single TE link.

610	   o  A set of parallel data links in the same layer, presented either
611	      as a bundle of TE links, or as a collection of data links which
612	      together provide a data-link-layer protection scheme.

614	   Thus, span recovery may be provided by any of the following:

616	   o  Selecting a different TE link from a bundle.

618	   o  Moving the TE link so that it is supported by a different data
619	      link between the same pair of neighbors.

621	   o  Re-routing the LSP in the lower layer.

623	   Moving the protected LSP to another TE link between the same pair of
624	   neighbors is a form of segment recovery and is described in Section
625	   4.2.2.

627	4.2.2.  Segment Recovery

629	   An LSP segment comprises one or more continuous hops on the path of
630	   the LSP.  [RFC5654] defines two terms.  A "segment" is a single hop
631	   along the path of an LSP, while a "concatenated segment" is more than
632	   one hop along the path of an LSP.  In the context of this document, a
633	   segment covers both of these concepts.

635	   A PW segment refers to a Single Segment PW (SS-PW) or to a single
636	   segment of a multi-segment PW (MS-PW) that is set up between two PE
637	   devices that may be Terminating PEs (T-PEs) or Switching PEs (S-PEs)
638	   so that the full set of possibilities is  T-PE to S-PE, S-PE to S-PE,
639	   S-PE to T-PE, or T-PE to T-PE (for the SS-PW case).  As indicated in
640	   Section 1, the recovery of PWs and PW segments is beyond the scope of
641	   this document, however, see Section 7.

643	   Segment recovery involves redirecting or copying traffic at the
644	   source end of a segment onto an alternate path leading to the other
645	   end of the segment. According to the required level of recovery
646	   (described in Section 4.3), traffic may either be redirected to a
647	   pre-established segment, through re-routing the protected segment, or
648	   it may be tunneled to the far end of the protected segment through a
649	   "bypass" LSP.  For details on recovery mechanisms, see Section 4.4.

651	   Note that protecting a transport path against node failure requires
652	   the use of segment recovery or end-to-end recovery, while a link
653	   failure can be protected using span, segment, or end-to-end recovery.

655	4.2.3.  End-to-End Recovery

657	   End-to-end recovery is a special case of segment recovery where the
658	   protected segment comprises the entire transport path.  End-to-end
659	   recovery may be provided as link-diverse or node-diverse recovery
660	   where the recovery path shares no links or no nodes with the working
661	   path.

663	   Note that node-diverse paths are necessarily link-diverse, and that
664	   full, end-to-end node-diversity is required to guarantee recovery.

666	   Two observations need to be made about end-to-end recovery.

668	   - Firstly, there may be circumstances where node-diverse end-to-end
669	     paths do not guarantee recovery. The ingress and egress nodes will
670	     themselves be single points of failure. Additionally there may be
671	     shared risks of failure (for example, geographic collocation,
672	     shared resources, etc.) between diverse nodes as described in
673	     Section 4.9.2.

675	   - Secondly, it is possible to use end-to-end recovery techniques even
676	     when there is not full diversity and the working and protection
677	     paths share links or nodes.

679	4.3.  Levels of Recovery

681	   This section describes the qualitative levels of survivability
682	   that can be provided.  In the event of a network failure, the level
683	   of recovery offered directly affects the service level provided to
684	   the end-user.  This will be observed as the amount of data lost when
685	   a network fault occurs, and the length of time required to recover
686	   connectivity.

688	   In general, there is a correlation between the recovery service level
689	   (i.e., the speed of recovery and reduction of data loss) and the
690	   amount of resources used in the network; better service levels
691	   require the pre-allocation of resources to the recovery paths, and
692	   those resources cannot be used for other purposes if high-quality
693	   recovery is required.  An operator will consider how providing
694	   different levels of recovery requires that network resources may need
695	   to be provisioned and allocated for exclusive use of the recovery
696	   paths and so cannot be used to support other customer services.

698	   Sections 6 and 7 of [RFC4427] provide a full breakdown of the
699	   protection and recovery schemes.  This section summarizes the
700	   qualitative levels available.

702	   Note that, in the context of recovery, a useful discussion of the
703	   term "resource" and its interpretation in both the IETF and ITU-T
704	   context may be found in Section 3.2 of [RFC4397].

706	4.3.1.  Dedicated Protection

708	   In dedicated protection, the resources for the recovery entity are
709	   pre-assigned for the sole use of the protected transport path.  This
710	   will clearly be the case in 1+1 protection, and may also be the case
711	   in 1:1 protection where extra traffic (see Section 4.3.3) is not
712	   supported.

714	   Note that when using protection tunnels (see Section 4.4.3),
715	   resources may also be dedicated to the protection of a specific
716	   transport path.  In some cases (1:1 protection) the entire bypass
717	   tunnel may be dedicated to providing recovery for a specific
718	   transport path, while in other cases (such as facility backup), a
719	   subset of the resources associated with the bypass tunnel may be pre-
720	   assigned for the recovery of a specific service.

722	   However, as described in Section 4.4.3, the bypass tunnel method can
723	   also be used for shared protection (Section 4.3.2), either to carry
724	   extra traffic (Section 4.3.3), or to achieve best-effort recovery
725	   without the need for resource reservation.

727	4.3.2.  Shared Protection

729	   In shared protection, the resources for the recovery entities of
730	   several services are shared.  These may be shared as 1:n or m:n, and
731	   are shared on individual links. Link-by-link resource sharing may be
732	   managed and operated along LSP segments, on PW segments, or on end-to
733	   end transport paths (LSP or PW).  Note that there is no requirement
734	   for m:n recovery in the list of MPLS-TP requirements documented in
735	   [RFC5654]. Shared protection can be applied in different topologies
736	   (mesh, ring, etc.) and can utilize different protection mechanisms
737	   (linear, ring, etc.).

739	   End-to-end shared protection shares resources between a number of
740	   paths that have common end points. Thus a number of paths (n paths)
741	   are all protected by one or more protection paths (m paths where m
742	   may equal 1). When there have been m failures there are no more
743	   available protection paths and the n paths are no longer protected.
744	   Thus, in 1:n protection, one fault can be protected against before
745	   all the n paths are unprotected. The fact that the paths have become
746	   unprotected needs to be conveyed to the path end points since they
747	   may need to report the change in service level or may need to take
748	   further action to increase their protection. In end-to-end shared
749	   protection, this communication is simple since the end points are
750	   common.

752	   In shared mesh protection (see Section 4.7.6) the paths that share
753	   the protection resources do not necessarily have the same end points.
754	   This provides a more flexible resource sharing scheme, but the
755	   network planning and the coordination of protection state after a
756	   recovery action are more complex.

758	   Where a bypass tunnel is used (Section 4.4.3), the tunnel might not
759	   have sufficient resources to simultaneously protect all of the paths
760	   for which it offers protection; in the event that all paths were
761	   affected by network defects and failures at the same time, not all of
762	   them would be recovered. Policy would dictate how this situation
763	   should be handled: some paths might be protected, while others would
764	   simply fail; the traffic for some paths would be guaranteed, while
765	   traffic on other paths would be treated as best-effort with the risk
766	   of dropped packets; alternatively, it is possible that protection
767	   would not be attempted according to local policy at the nodes that
768	   perform the recovery actions.

770	   Shared protection is a trade-off between assigning network resources
771	   to protection (which is not required most of the time) and risking
772	   unrecoverable services in the event that multiple network defects or
773	   failures occur.  Rapid recovery can be achieved with dedicated
774	   protection, but it is delayed by message exchanges in the management,
775	   control, or data planes for shared protection.  This means that there
776	   is also a trade-off between rapid recovery and resource sharing. In
777	   some cases, shared protection might not meet the speed required for
778	   protection, but it may still be faster than restoration.

780	   These trade-offs may be somewhat mitigated by the following:

782	   o  Adjusting the value of n in 1:n protection.

784	   o  Using m:n protection for a value of m > 1.

786	   o  Establishing new protection paths as each available protection
787	      path is put into use.

789	4.3.3.  Extra Traffic

791	   Section 2.5.1.1 of [RFC5654] says: "Support for extra traffic (as
792	   defined in [RFC4427]) is not required in MPLS-TP and MAY be omitted
793	   from the MPLS-TP specifications." This document observes that extra
794	   traffic facilities may therefore be provided as part of the MPLS-TP
795	   survivability toolkit depending upon the development of suitable
796	   solution specifications. The remainder of this section explains the
797	   concepts of extra traffic without prejudging the decision to
798	   specify or not specify such solutions.

800	   Network resources allocated for protection represent idle capacity
801	   during the time that recovery is not actually required, and can be
802	   utilized by carrying other traffic, referred to as "extra traffic".

804	   Note that extra traffic does not need to start or terminate at the
805	   ends of the entity (e.g. LSP) that it uses.

807	   When a network resource carrying extra traffic is required for the
808	   recovery of protected traffic from the failed working path, the extra
809	   traffic is disrupted. This disruption make take one of two forms:

811	   - In "hard preemption" the extra traffic is excluded from the
812	     protection resource. The disruption of the extra traffic is total,
813	     and the service supported by the extra traffic must be dropped, or
814	     some form of rerouting or restoration must be applied to the extra
815	     traffic LSP in order to recover the service.

817	     Hard preemption is achieved by "setting a switch" on the path of
818	     the extra traffic such that it no longer flows. This situation
819	     may be detected by OAM and reported as a fault, or may be
820	     proactively reported through OAM or control plane signaling.

822	   - In "soft preemption" the extra traffic is not explicitly excluded
823	     from the protection resource, but is given lower priority than the
824	     protected traffic. In a packet network (such as MPLS-TP) this can
825	     result in oversubscription of the protection resource with the
826	     result that the extra traffic receives "best effort" delivery.
827	     Depending on the volume of protection and extra traffic, and the
828	     level of oversubscription, the extra traffic may be slightly or
829	     heavily impacted.

831	     The event of soft preemption may be detected by OAM and reported as
832	     a degradation of traffic delivery or as a fault. It may also be
833	     proactively reported through OAM or control plane signaling.

835	   Note that both hard and soft preemption may utilize additional
836	   message exchanges in the management, control, or data planes. These
837	   messages do not necessarily mean that recovery is delayed, but may
838	   increase the complexity of the protection system. Thus, the benefits
839	   of carrying extra traffic must be weighed against the disadvantages
840	   of delayed recovery, additional network overhead, and the impact on
841	   the services which support the extra traffic according to the details
842	   of the solutions selected.

844	   Note that extra traffic is not protected by definition, but may be
845	   restored.

847	   Extra traffic is not supported on dedicated protection resources
848	   which, by definition, are used for 1+1 protection (Section 4.3.1),
849	   but it can be supported in other protection schemes, including shared
850	   protection (Section 4.3.2) and tunnel protection (Section 4.4.3).

852	   Best-effort traffic should not be confused with extra traffic. For
853	   best-effort traffic, the network does not guarantee data delivery,
854	   and the user does not receive guaranteed quality of service (e.g., in
855	   terms of jitter, packet loss, delay, etc.). Best-effort traffic
856	   depends on the current traffic load. However, for extra traffic,
857	   quality can only be guaranteed until resources are required for
858	   recovery. At this point, the extra traffic may be completely
859	   displaced, may be treated as best effort, or it may itself be
860	   recovered (for example, by restoration techniques).

862	4.3.4.  Restoration

864	   This section refers to LSP restoration.  Restoration for PWs is
865	   beyond the scope of this document (but see Section 7).

867	   Restoration represents the most effective use of network resources,
868	   since no resources are reserved for recovery.  However, restoration
869	   requires the computation of a new path and the activation of a new
870	   LSP (through the management or control plane).  It may be more time-
871	   consuming to perform these steps than to implement recovery using
872	   protection techniques.

874	   Furthermore, there is no guarantee that restoration will be able to
875	   recover the service.  It may be that all suitable network resources
876	   are already in use for other LSPs, so that no new path can be found.
877	   This problem can be partially mitigated by using LSP setup
878	   priorities, so that recovery LSPs can preempt existing LSPs with
879	   lower priorities.

881	   Additionally, when a network defect occurs, multiple LSPs may be
882	   disrupted by the same event.  These LSPs may have been established by
883	   different Network Management Stations (NMSes) or they may have been
884	   signaled by different head-end MPLS-TP nodes, meaning that multiple
885	   points in the network will try to compute and establish recovery LSPs
886	   at the same time.  This can lead to a lack of resources within the
887	   network and cause recovery failures; some recovery actions will need
888	   to be retried, resulting in even slower recovery times for some
889	   services.

891	   Both hard and soft LSP restoration may be supported.  For hard LSP
892	   restoration, the resources of the working LSP are released before the
893	   the recovery LSP is fully established (i.e., break-before-make). For
894	   soft LSP restoration, the resources of the working LSP are released
895	   after an alternate LSP is fully established (i.e., make-before-
896	   break).  Note that in the case of reversion (Section 4.3.5), the
897	   resources associated with the working LSP are not released.

899	   The restoration resources may be pre-calculated and even pre-signaled
900	   before the restoration action starts, but not pre-allocated.  This is
901	   known as pre-planned LSP restoration.  The complete establishment /
902	   activation of the restoration LSP occurs only when the restoration
903	   action starts.  Pre-planning may occur periodically and provides the
904	   most accurate information about the available resources in the
905	   network.

907	4.3.5.  Reversion

909	   After a service has been recovered and traffic is flowing along the
910	   recovery LSP, the defective network resource may be replaced. Traffic
911	   can be redirected back onto the original working LSP (known as
912	   "reversion"), or it can be left where it is on the recovery LSP
913	   ("non-revertive" behavior).

915	   It should be possible to specify the reversion behavior of each
916	   service; this might even be configured for each recovery instance.

918	   In non-revertive mode, an additional operational option is possible
919	   where protection roles are switched, so that the recovery LSP becomes
920	   the working LSP, while the previous working path (or the resources
921	   used by the previous working path) are used for recovery in the event
922	   of an additional fault.

924	   In revertive mode, it is important to prevent excessive swapping
925	   between the working and recovery paths in the case of an intermittent
926	   defect.  This can be addressed by using a reversion delay timer (the
927	   Wait To Restore timer) which controls the length of time to wait
928	   before reversion following the repair of a fault on the original
929	   working path.  It should be possible for an operator to configure
930	   this timer per LSP, and a default value should be defined.

932	4.4.  Mechanisms for Protection

934	   This section provides general descriptions (MPLS-TP non-specific) of
935	   the mechanisms that can be used for protection purposes.  As
936	   indicated above, while the functional architecture applies to both
937	   LSPs and PWs, the mechanism for recovery described in this document
938	   refers to LSPs and LSP segments only.  Recovery mechanisms for
939	   pseudowires and pseudowire segments are for further study and will be
940	   described in a separate document (see also Section 7).

942	4.4.1.  Link-Level Protection

944	   Link-level protection refers to two paradigms: (1) where protection
945	   is provided in a lower network layer, and (2) where protection is
946	   provided by the MPLS-TP link layer.

948	   Note that link-level protection mechanisms do not protect the nodes
949	   at each end of the entity (e.g., a link or span) that is protected.
950	   End-to-end or segment protection should be used in conjunction with
951	   link-level protection to protect against a failure of the edge nodes.

953	   Link-level protection offers the following levels of protection:

955	   o  Full protection where a dedicated protection entity (e.g., a link
956	      or span) is pre-established to protect a working entity.  When the
957	      working entity fails, the protected traffic is switched to the
958	      protecting entity.  In this scenario, all LSPs carried over the
959	      working entity are recovered (in one protection operation) when
960	      there is a failure condition.  This is referred to in [RFC4427] as
961	      "bulk recovery".

963	   o  Partial protection where only a subset of the LSPs or traffic
964	      carried over a selected entity is recovered when there is a
965	      failure condition.  The decision as to which LSPs will be
966	      recovered and which will not depends on local policy.

968	   When there is no failure on the working entity, the protection entity
969	   may transport extra traffic which may be preempted when protection
970	   switching occurs.

972	   If link-level protection is available, it may be desirable to allow
973	   this to be attempted before attempting other recovery mechanisms for
974	   the transport paths affected by the fault because link-level
975	   protection may be faster and more conservative of network resources.
976	   This can be achieved both by limiting the propagation of fault
977	   condition notifications and by delaying the other recovery actions.
978	   This consideration of other protection can be compared with the
979	   discussion of recovery domains (Section 4.5) and recovery in multi-
980	   layer networks (Section 4.9).

982	   A protection mechanism may be provided at the MPLS-TP link layer
983	   (which connects two MPLS-TP nodes).  Such a mechanism can make use of
984	   the procedures defined in [RFC5586] to set up in-band communication
985	   channels at the MPLS-TP section level, to use these channels to
986	   monitor the health of the MPLS-TP link, and to coordinate the
987	   protection states between the ends of the MPLS-TP link.

989	4.4.2.  Alternate Paths and Segments

991	   The use of alternate paths and segments refers to the paradigm
992	   whereby protection is performed in the network layer in which the
993	   protected LSP is located; this applies either to the entire end-to-
994	   end LSP or to a segment of the LSP.  In this case, hierarchical LSPs
995	   are not used (compare with Section 4.4.3).

997	   Different levels of protection may be provided:

999	   o  Dedicated protection where a dedicated entity (e.g., LSP or LSP
1000	      segment) is (fully) pre-established to protect a working entity
1001	      (e.g., LSP or LSP segment).  When a failure condition occurs on
1002	      the working entity, traffic is switched onto the protection
1003	      entity.  Dedicated protection may be performed using 1:1 or 1+1
1004	      linear protection schemes.  When the failure condition is
1005	      eliminated, the traffic may revert to the working entity.  This
1006	      is subject to local configuration.

1008	   o  Shared protection where one or more protection entities is pre-
1009	      established to protect against a failure of one or more working
1010	      entities (1:n or m:n)

1012	   When the fault condition on the working entity is eliminated, the
1013	   traffic should revert back to the working entity in order to allow
1014	   other related working entities to be protected by the shared
1015	   protection resource.

1017	4.4.3.  Protection Tunnels

1019	   A protection tunnel is a hierarchical LSP that is pre-provisioned in
1020	   order to protect against a failure condition along a sequence of
1021	   spans in the network. We call such a sequence, a network segment.  A
1022	   failure of a network segment may affect one or more LSPs that
1023	   transit the network segment.

1025	   When a failure condition occurs in the network segment (detected
1026	   either by OAM on the network segment, or by OAM on a concatenated
1027	   segment of one of the LSPs transiting the network segment), one or
1028	   more of the protected LSPs are switched over at the ingress point of
1029	   the network segment and are transmitted over the protection tunnel.
1030	   This is implemented through label stacking.  Label mapping may be an
1031	   option as well.

1033	   Different levels of protection may be provided:

1035	   o  Dedicated protection where the protection tunnel reserves
1036	      sufficient resources to provide protection for all protected LSPs
1037	      without causing service degradation

1039	   o  Partial protection where the protection tunnel has enough
1040	      resources to protect some of the protected LSPs, but not all of
1041	      them simultaneously.  Policy dictates how this situation should be
1042	      handled: it is possible that some LSPs would be protected, while
1043	      others would simply fail; it is possible that traffic would be
1044	      guaranteed for some LSPs, while for other LSPs it would be treated
1045	      as best effort with the risk of packets being dropped;
1046	      alternatively, it is possible that protection would not be
1047	      attempted.

1049	4.5.  Recovery Domains

1051	   Protection and restoration are performed in the context of a recovery
1052	   domain.  A recovery domain is defined between two or more recovery
1053	   reference end points which are located at the edges of the recovery
1054	   domain and which border on the element on which recovery can be
1055	   provided (as described in Section 4.2).  This element can be an end-
1056	   to-end path, a segment, or a span.

1058	   An end-to-end path can be observed as a special segment case where
1059	   the ingress and egress label edge routers (LERs) serve as the
1060	   recovery reference end points.

1062	   In this simple case of a point-to-point (P2P) protected entity, two
1063	   end-points reside at the boundary of the Protection Domain.  An LSP
1064	   can enter through one reference end point and exit the recovery
1065	   domain through another reference end point.

1067	   In the case of unidirectional point-to-multipoint (P2MP), three or
1068	   more end points reside at the boundary of the Protection Domain.  One
1069	   of the end points is referred to as the source/root, while the others
1070	   are referred to as sinks/leaves.  An LSP can enter the recovery
1071	   domain through the root point and exit the recovery domain through
1072	   the leaf points.

1074	   The recovery mechanism should restore traffic that was interrupted by
1075	   a facility (link or node) fault within the recovery domain.  Note
1076	   that a single link may be part of several recovery domains.  If two
1077	   recovery domains have common links, one recovery domain must be
1078	   contained within the other.  This can be referred to as nested
1079	   recovery domains.  The boundaries of recovery domains may coincide,
1080	   but recovery domains must not overlap.

1082	   Note that the edges of a recovery domain are not protected and unless
1083	   the whole domain is contained within another recovery domain, the
1084	   edges form a single point of failure.

1086	   A recovery group is defined within a recovery domain and consists of
1087	   a working (primary) entity and one or more recovery (backup) entities
1088	   which reside between the end points of the recovery domain. To
1089	   guarantee protection in all situations, a dedicated recovery entity
1090	   should be pre-provisioned using disjoint resources in the recovery
1091	   domain, in order to protect against a failure of a working entity. Of
1092	   course, mechanisms to detect faults and to trigger protection
1093	   switching are also needed.

1095	   The method used to monitor the health of the recovery element is
1096	   beyond the scope of this document. The end points that are
1097	   responsible for the recovery action must receive information on
1098	   its condition.  The condition of the recovery element may be 'OK',
1099	   'failed', or 'degraded'.

1101	   When the recovery operation is to be triggered by OAM mechanisms, an
1102	   OAM Maintenance Entity Group must be defined for each of the working
1103	   and protection entities.

1105	   The recovery entities and functions in a recovery domain can be
1106	   configured using a management plane or a control plane.  A
1107	   management plane may be used to configure the recovery domain by
1108	   setting the reference points, the working and recovery entities, and
1109	   the recovery type (e.g., 1:1 bidirectional linear protection, ring
1110	   protection, etc.).  Additional parameters associated with the
1111	   recovery process may also be configured.  For more details, see
1112	   Section 6.1.

1114	   When a control plane is used, the ingress LERs may communicate with
1115	   the recovery reference points which request that protection or
1116	   restoration be configured across a recovery domain.  For details, see
1117	   Section 6.5.

1119	   Cases of multiple interconnections between distinct recovery domains
1120	   create a hierarchical arrangement of recovery domains, since a single
1121	   top-level recovery domain is created from the concatenation of two
1122	   recovery domains with multiple interconnections.  In this case,
1123	   recovery actions may be taken both in the individual, lower-level
1124	   recovery domains to protect any LSP segment that crosses the domain,
1125	   and within the higher-level recovery domain to protect the longer LSP
1126	   segment that traverses the higher-level domain.

1128	   The MPLS-TP recovery mechanism can be arranged to ensure coordination
1129	   between domains. In inter-connected rings, for example, it may be
1130	   preferable to allow the upstream ring to perform recovery before the
1131	   downstream ring, in order to ensure that recovery takes place in the
1132	   ring in which the defect occurred. Coordination of recovery actions
1133	   is particularly important in nested domains, and is discussed further
1134	   in Section 4.9.

1136	4.6.  Protection in Different Topologies

1138	   As described in the requirements listed in Section 3 and detailed in
1139	   [RFC5654], the selected recovery techniques may be optimized for
1140	   different network topologies if the optimized mechanisms perform
1141	   significantly better than the generic mechanisms in the same
1142	   topology.

1144	   These mechanisms are required (R91 of [RFC5654]) to interoperate with
1145	   the mechanisms defined for arbitrary topologies, in order to allow
1146	   end-to-end protection and to ensure that consistent protection
1147	   techniques are used across the entire network. In this context,
1148	   'interoperate' means that the use of one technique must not inhibit
1149	   the use of another technique in an adjacent part of the network for
1150	   use on the same end-to-end transport path, and must not prohibit the
1151	   use of end-to-end protection mechanisms.

1153	   The next sections (4.7 and 4.8) describe two different topologies and
1154	   explain how recovery may be markedly different in those different
1155	   scenarios.  They also develop the concept of a recovery domain and
1156	   show how end-to-end survivability may be achieved through a
1157	   concatenation of recovery domains, each providing some level of
1158	   recovery in part of the network.

1160	4.7.  Mesh Networks

1162	   A mesh network is any network where there is arbitrary
1163	   interconnectivity between nodes in the network. Mesh networks are
1164	   usually contrasted with more specific topologies such as hub-and-
1165	   spoke or ring (see Section 4.8), although such networks are actually
1166	   examples of mesh networks. This section is limited to the discussion
1167	   of protection techniques in the context of mesh networks. That is, it
1168	   does not include optimizations for specific topologies.

1170	   Linear protection is a protection mechanism that provides rapid and
1171	   simple protection switching.  In a mesh network, linear protection
1172	   provides a very suitable protection mechanism because it can operate
1173	   between any pair of points within the network.  It can protect
1174	   against a defect in a node, a span, a transport path segment, or an
1175	   end-to-end transport path.  Linear protection gives a clear
1176	   indication of the protection status.

1178	   Linear protection operates in the context of a Protection Domain.  A
1179	   Protection Domain is a special type of Recovery Domain (see Section
1180	   4.5 associated with the protection function.  A Protection Domain is
1181	   composed of the following architectural elements:

1183	   o  A set of end points which reside at the boundary of the Protection
1184	      Domain.  In the simple case of 1:n or 1+1 P2P protection,  two end
1185	      points reside at the boundary of the Protection Domain.  In each
1186	      transmission direction, one of the end points is referred to as
1187	      the source and the other is referred to as the sink.  For
1188	      unidirectional P2MP protection, three or more end points reside at
1189	      the boundary of the Protection Domain.  One of the end points is
1190	      referred to as the source/root while the others are referred to as
1191	      sinks/leaves.

1193	   o  A Protection Group consists of one or more working (primary) paths
1194	      and one or more protection (backup) paths which run between the
1195	      end points belonging to the Protection Domain.  To guarantee
1196	      protection in all scenarios, a dedicated protection path should be
1197	      pre-provisioned to protect against a defect of a working path
1198	      (i.e., 1:1 or 1+1 protection schemes).  In addition, the working
1199	      and the protection paths should be disjoint, i.e., the physical
1200	      routes of the working and the protection paths should be
1201	      physically diverse in every respect.

1203	   Note that if the resources of the protection path are less than those
1204	   of the working path, the protection path may not have sufficient
1205	   resources to protect the traffic of the working path.

1207	   As mentioned in Section 4.3.2, the resources of the protection path
1208	   may be shared as 1:n.  In this scenario, the protection path will not
1209	   have sufficient resources to protect all the working paths at a
1210	   specific time.

1212	   For bidirectional P2P paths, both unidirectional and bidirectional
1213	   protection switching are supported.  If a defect occurs when
1214	   bidirectional protection switching is defined, the protection actions
1215	   are performed in both directions (even if the defect is
1216	   unidirectional).  The protection state is required to operate with a
1217	   level of coordination between the end points of the protection
1218	   domain.

1220	   In unidirectional protection switching, the protection actions are
1221	   only performed in the affected direction.

1223	   Revertive and non-revertive operations are provided as options for
1224	   the network operator.

1226	   Linear protection supports the protection schemes described in the
1227	   following sub-sections.

1229	4.7.1.  1:n Linear Protection

1231	   In the 1:1 scheme, a protection path is allocated to protect against
1232	   a defect, failure, or a degradation in a working path.  As described
1233	   above, to guarantee protection, the protection entity should support
1234	   the full capacity and bandwidth, although it may be configured (for
1235	   example, because of limited network resource availability) to offer a
1236	   degraded service when compared with the working entity.

1238	   Figure 1 presents 1:1 protection architecture.  In normal conditions,
1239	   data traffic is transmitted over the working entity while the
1240	   protection entity functions in the idle state. (OAM may run on the
1241	   protection entity to verify its state.)  Normal conditions are
1242	   defined when there is no defect, failure, or degradation on the
1243	   working entity, and no administrative configuration or request causes
1244	   traffic to flow over the protection entity.

1246	                  |-----------------Protection Domain---------------|

1248	                             ==============================
1249	                          /**********Working path***********\
1250	                +--------+   ==============================   +--------+
1251	                | Node  /|                                    |\  Node |
1252	                |  A {<  |                                    | >}  B  |
1253	                |        |                                    |        |
1254	                +--------+   ==============================   +--------+
1255	                                     Protection path
1256	                             ==============================

1258	                         Figure 1: 1:1 Protection Architecture

1260	   If there is a defect on the working entity, or a specific
1261	   administrative request, traffic is switched to the protection entity.

1263	   Note that when operating with non-revertive behavior (see Section
1264	   4.3.5), after the conditions causing the switchover have been cleared
1265	   the traffic continues to flow on the protection path but the working
1266	   and protection roles are not switched.

1268	   In each transmission direction, the protection domain source bridges
1269	   traffic onto the appropriate entity, while the sink selects traffic
1270	   from the appropriate entity.  The source and the sink need to
1271	   coordinate the protection states to ensure that bridging and
1272	   selection are performed to and from the same entity.  For this
1273	   reason, a signaling coordination protocol (either data-plane in-band
1274	   signaling protocol or a control-plane based signaling protocol) is
1275	   required.

1277	   In bidirectional protection switching, both ends of the protection
1278	   domain are switched to the protection entity (even when the fault is
1279	   unidirectional).  This requires a protocol to coordinate the
1280	   protection state between the two end points of the Protection Domain.

1282	   When there is no defect, the bandwidth resources of the idle entity
1283	   may be used for traffic with lower priority.  When protection
1284	   switching is performed, the traffic with lower priority may be pre-
1285	   empted by the protected traffic through tearing down the LSP with
1286	   lower priority, reporting a fault on the LSP with lower priority, or
1287	   by treating the traffic with lower priority as best effort and
1288	   discarding it when there is congestion.

1290	   In the general case of 1:n linear protection, one protection entity
1291	   is allocated to protect n working entities.  The protection entity
1292	   might not have sufficient resources to protect all the working
1293	   entities that may be affected by fault conditions at a specific
1294	   time. In this case, in order to guaranteed protection, the protection
1295	   entity should support enough capacity and bandwidth to protect any of
1296	   the n working entities.

1298	   When defects or failures occur along multiple working entities, the
1299	   entity to be protected should be prioritized.  The protection states
1300	   between the edges of the Protection Domain should be fully
1301	   coordinated to ensure consistent behavior.  As explained in Section
1302	   4.3.5, revertive behavior is recommended when 1:n is supported.

1304	4.7.2.  1+1 Linear Protection

1306	   In the 1+1 protection scheme, a fully dedicated protection entity is
1307	   allocated.

1309	   As depicted in Figure 2, data traffic is copied and fed at the source
1310	   to both the working and the protection entities.  The traffic on the
1311	   working and the protection entities is transmitted simultaneously to
1312	   the sink of the Protection Domain, where selection between the
1313	   working and protection entities is performed (based on some
1314	   predetermined criteria).

1316	                   |---------------Protection Domain---------------|

1318	                             ==============================
1319	                          /**********Working path************\
1320	                +--------+   ==============================   +--------+
1321	                | Node  /|                                    |\  Node |
1322	                |  A {<  |                                    | >}  Z  |
1323	                |       \|                                    |/       |
1324	                +--------+   ==============================   +--------+
1325	                          \**********Protection path*********/
1326	                             ==============================

1328	                        Figure 2: 1+1 Protection Architecture

1330	   Note that control traffic between the edges of the Protection Domain
1331	   (such as OAM or a control protocol to coordinate the protection
1332	   state, etc.) may be transmitted on an entity that differs from the
1333	   one used for the protected traffic.  These packets should not be
1334	   discarded by the sink.

1336	   In 1+1 unidirectional protection switching there is no need to
1337	   coordinate the protection state between the protection controllers at
1338	   both ends of the protection domain.  In 1+1 bidirectional protection
1339	   switching, a protocol is required to coordinate the protection state
1340	   between the edges of the Protection Domain.

1342	   In both protection schemes, traffic flows end-to-end on the working
1343	   entity after the conditions causing the switchover have been cleared.
1344	   Data selection may return to selecting traffic from the working
1345	   entity if reversion is enabled, and will require coordination of the
1346	   protection state between the edges of the Protection Domain.  To
1347	   avoid frequent switching caused by intermittent defects or failures
1348	   when the network is not stable, traffic is not selected from the
1349	   working entity before the Wait-to-Restore (WTR) timer has expired.

1351	4.7.3.  P2MP Linear Protection

1353	   Linear protection may be applied to protect unidirectional P2MP
1354	   entities using 1+1 protection architecture.  The source/root MPLS-TP
1355	   node bridges the user traffic to both the working and protection
1356	   entities.  Each sink/leaf MPLS-TP node selects the traffic from one
1357	   entity according to some predetermined criteria.  Note that when
1358	   there is a fault condition on one of the branches of the P2MP path,
1359	   some leaf MPLS-TP nodes may select the working entity, while other
1360	   leaf MPLS-TP nodes may select traffic from the protection entity.

1362	   In a 1:1 P2MP protection scheme, the source/root MPLS-TP node needs
1363	   to identify the existence of a fault condition on any of the branches
1364	   of the network.  This means that the sink/leaf MPLS-TP nodes need to
1365	   notify the source/root MPLS-TP node of any fault condition.  This
1366	   also necessitates a return path from the sinks/leaves to the
1367	   source/root MPLS-TP node. When protection switching is triggered, the
1368	   source/root MPLS-TP node selects the protection transport path for
1369	   traffic transfer.

1371	   A form of "segment recovery for P2MP LSPs" could be constructed.
1372	   Given a P2MP LSP, one can protect any possible point of failure (link
1373	   or node) using N backup P2MP LSPs. Each backup P2MP LSP originates
1374	   from the upstream node with respect to a different possible failure
1375	   point and terminates at all of the destinations downstream of the
1376	   potential failure point. In case of a failure, traffic is redirected
1377	   to the backup P2MP path.

1379	   Note that such mechanisms do not yet exist and their exact behavior
1380	   is for further study.

1382	   A 1:n protection scheme for P2MP transport paths is also required by
1383	   [RFC5654]. Such a mechanism is for future study.

1385	4.7.4.  Triggers for the Linear Protection Switching Action

1387	  Protection switching may be performed when:

1389	   o  A defect condition is detected on the working entity and the
1390	      protection entity has "no" or an inferior condition.  Proactive
1391	      in-band OAM Continuity and Connectivity Verification (CCV)
1392	      monitoring of both the working and the protection entities may be
1393	      used to enable the rapid detection of a fault condition.  For
1394	      protection switching, it is common to run a CCV every 3.33 ms.  In
1395	      the absence of three consecutive CCV messages, a fault condition
1396	      is declared.  In order to monitor the working and the protection
1397	      entities, an OAM Maintenance Entity Group should be defined for
1398	      each entity.  OAM indications associated with fault conditions
1399	      should be provided at the edges of the Protection Domain which are
1400	      responsible for the protection-switching operation.  Input from
1401	      OAM performance monitoring that indicates degradation in the
1402	      working entity may also be used as a trigger for protection
1403	      switching.  In the case of degradation, switching to the
1404	      protection entity is needed only if the protection entity can
1405	      exhibit better operating conditions.

1407	   o  An indication is received from a lower-layer server that there is
1408	      a defect in the lower layer.

1410	   o  An external operator command is received (e.g., 'Forced Switch',
1411	      'Manual Switch').  For details see Section 6.1.2.

1413	   o  A request to switch over is received from the far end.  The far
1414	      end may initiate this request, for example, on receipt of an
1415	      administrative request to switch over, or when bidirectional 1:1
1416	      protection switching is supported and a defect occurred that could
1417	      only be detected by the far end, etc.

1419	   As described above, the protection state should be coordinated
1420	   between the end points of the Protection Domain.  Control messages
1421	   should be exchanged between the edges of the Protection Domain to
1422	   coordinate the protection state of the edge nodes.  Control messages
1423	   can be delivered using an in-band, data-plane-driven control
1424	   protocol, or a control-plane-based protocol.

1426	   For 50-ms protection switching, it is recommended that an inband,
1427	   data-plane-driven signaling protocol be used in order to coordinate
1428	   the protection states.  An in-band, data-plane protocol for use in
1429	   MPLS-TP networks will be documented in [MPLS-TP-Linear-Protection]
1430	   for this purpose.  This protocol is also used to detect mismatches
1431	   between the configurations provisioned at the ends of the Protection
1432	   Domain.

1434	   As described in Section 6.5, the GMPLS control plane already includes
1435	   procedures and message elements to coordinate the protection states
1436	   between the edges of the protection domain.  These procedures and
1437	   protocol messages are specified in [RFC4426], [RFC4872], and
1438	   [RFC4873].  However, these messages lack the capability to coordinate
1439	   the revertive/non-revertive behavior and the consistency of
1440	   configured timers at the edges of the Protection Domain (timers such
1441	   as Wait to Restore (WTR), Hold-off timer, etc.).

1443	4.7.5.  Applicability of Linear Protection for LSP Segments

1445	   In order to implement data-plane-based linear protection on LSP
1446	   segments, use is made of the Sub-Path Maintenance Entity (SPME), an
1447	   MPLS-TP architectural element defined in [MPLS-TP-FWK].  Maintenance
1448	   operations (e.g., monitoring, protection, or management) engage with
1449	   message transmission (e.g., OAM, Protection Path Coordination, etc.)
1450	   in the maintained domain.  Further discussion of the architecture for
1451	   OAM and SPME is found in [MPLS-TP-FWK] and [MPLS-TP-OAM-Framework].
1452	   An SPME is an LSP which is basically defined and used for the
1453	   purposes of OAM monitoring, protection, or management of LSP
1454	   segments.  The SPME uses the MPLS construct of a hierarchical, nested
1455	   LSP, as defined in [RFC3031].

1457	   For linear protection, SPMEs should be defined over the working and
1458	   protection entities between the edges of a Protection Domain.  OAM
1459	   messages and messages used to coordinate protection state can be
1460	   initiated at the edge of the SPME and sent to the peer edge of the
1461	   SPME.  Note that these messages are sent over the Generic Associated
1462	   Channel (G-ACh)  within the SPME, and that they use a two-label
1463	   stack, the SPME label and, at the bottom of the stack, the G-ACh
1464	   label (GAL) [RFC5586].

1466	   The end-to-end traffic of the LSP, which includes data-traffic and
1467	   control traffic (messages for OAM, management, signaling, and to
1468	   coordinate protection state), is tunneled within the SPMEs by means
1469	   of label-stacking, as defined in [RFC3031].

1471	   Mapping between an LSP and a SPME can be 1:1; this is similar to
1472	   the ITU-T Tandem Connection element which defines a sub-layer
1473	   corresponding to a segment of a path.  Mapping can also be 1:n to
1474	   allow the scalable protection of a set of LSP segments traversing the
1475	   part of the network in which a Protection Domain is defined.  Note
1476	   that each of these LSPs can be initiated or terminated at different
1477	   end points in the network, but that they all traverse the Protection
1478	   Domain and share similar constraints (such as requirements for QoS,
1479	   terms of protection ,etc.).

1481	   Note also that in the context of segment protection, the SPMEs serve
1482	   as the working and protection entities.

1484	4.7.6.  Shared Mesh Protection

1486	   For shared mesh protection, the protection resources are used to
1487	   protect multiple LSPs which do not all share the same end points,
1488	   for example, in Figure 3 there are two paths ABCDE and VWXYZ. These
1489	   paths do not share end points and cannot, therefore, make use of 1:n
1490	   linear protection, even though they do not have any common points of
1491	   failure.
1492	   ABCDE may be protected by the path APQRE, while VWXYZ can be
1493	   protected by the path VPQRZ.  In both cases, 1:1 or 1+1 protection
1494	   may be used.  However, it can be seen that if 1:1 protection is used
1495	   for both paths, the PQR  network segment does not carry traffic when
1496	   no failures affect either of the two working paths.  Furthermore, in
1497	   the event of only one failure, the PQR segment carries traffic from
1498	   only one of the working paths.

1500	   Thus, it is possible for the network resources on the PQR segment to
1501	   be shared by the two recovery paths.  In this way, mesh protection
1502	   can substantially reduce the number of network resources that have to
1503	   be reserved in order to provide 1:n protection.

1505	             A----B----C----D----E
1506	              \                 /
1507	               \               /
1508	                \             /
1509	                 P-----Q-----R
1510	                /             \
1511	               /               \
1512	              /                 \
1513	             V----W----X----Y----Z

1515	       Figure 3: A Shared Mesh Protection Topology

1517	   As the network becomes more complex and the number of LSPs increases,
1518	   the potential for shared-mesh protection also increases.  However,
1519	   this can quickly become unmanageable owing to the increased
1520	   complexity.  Therefore, shared-mesh protection is normally pre-
1521	   planned and configured by the operator, although an automated system
1522	   cannot be ruled out.

1524	   Note that shared-mesh protection operates as 1:n linear protection
1525	   (see Section 4.7.1).  However, the protection state needs to be
1526	   coordinated between a larger number of nodes: the end points of the
1527	   shared concatenated protection segment (nodes P and R in the example)
1528	   as well as the end points of the protected LSPs(nodes A, E, V, and Z
1529	   in the example).

1531	   Additionally, note that the shared-protection resources could be used
1532	   1 to carry extra traffic,  for example, in Figure 4, an LSP JPQRK
1533	   could be a preemptable LSP that constitutes extra traffic over the
1534	   PQR hops; it would be displaced in the event of a protection event.
1535	   In this case, it should be noted that the protection state must also
1536	   be coordinated with the ends of the extra-traffic LSPs.

1538	             A----B----C----D----E
1539	              \                 /
1540	               \               /
1541	                \             /
1542	           J-----P-----Q-----R-----K
1543	                /             \
1544	               /               \
1545	              /                 \
1546	             V----W----X----Y----Z

1548	       Figure 4: Shared Mesh Protection with Extra Traffic

1550	4.8.  Ring Networks

1552	   Several Service Providers have expressed great interest in the
1553	   operation of MPLS-TP in ring topologies; they demand a high degree of
1554	   survivability functionality in these topologies.

1556	   Various criteria for optimization are considered in ring
1557	   topologies, such as:

1559	   1.  Simplification in ring operation in terms of the number of OAM
1560	       Maintenance Entities that are needed to trigger the recovery
1561	       actions, the number of recovery elements, the number of
1562	       management-plane transactions during maintenance operations, etc.

1564	   2.  Optimization of resource consumption around the ring, such as the
1565	       number of labels needed for the protection paths that traverse
1566	       the network, the total bandwidth required in the ring to ensure
1567	       path protection, etc. (see R91 of [RFC5654]).

1569	   [RFC5654] introduces a list of requirements for ring protection
1570	   covering the recovery mechanisms needed to protect traffic in a
1571	   single ring as well as traffic that traverses more than one ring.
1572	   Note that configuration and the operation of the recovery mechanisms
1573	   in a ring must scale well with the number of transport paths, the
1574	   number of nodes, and the number of ring interconnects.

1576	   The requirements for ring protection are fully compatible with the
1577	   generic requirements for recovery.

1579	   The architecture and the mechanisms for ring protection are specified
1580	   in separate documents.  These mechanisms need to be evaluated against
1581	   the requirements specified in [RFC5654] which includes guidance on
1582	   the principles for the development of new mechanisms.

1584	4.9.  Recovery in Layered Networks

1586	   In multi-layer or multi-regional networking [RFC5212], recovery may
1587	   be performed at multiple layers or across nested recovery domains.

1589	   The MPLS-TP recovery mechanism must ensure that the timing of
1590	   recovery is coordinated in order to avoid race scenarios, and to
1591	   allow the recovery mechanism of the server layer to fix the problem
1592	   before recovery takes place in the MPLS-TP layer, or to allow
1593	   the MPLS-TP layer to perform recovery before a client network.

1595	   A hold-off timer is required to coordinate recovery timing in
1596	   multiple layers or across nested recovery domains.  Setting this
1597	   configurable timer involves a trade-off between rapid recovery and
1598	   the creation of a race condition where multiple layers respond to the
1599	   same fault, potentially allocating resources in an inefficient
1600	   manner.  Thus, the detection of a defect condition in the MPLS-TP
1601	   layer should not immediately trigger the recovery process if the
1602	   hold-off timer is configured as a value other than zero.  Instead,
1603	   the hold-off timer should be started when the defect is detected and,
1604	   on expiry, the recovery element should be checked to determine
1605	   whether the defect condition still exists.  If it does exist, the
1606	   defect triggers the recovery operation.

1608	   The hold-off timer should be configurable.

1610	   In other configurations, where the lower layer does not have a
1611	   restoration capability, or where it is not expected to provide
1612	   protection, the lower layer needs to trigger the higher layer to
1613	   immediately perform recovery. Although this can be forced by
1614	   configuring the hold-off timer as zero, it may be that because of
1615	   layer-independence, the higher layer does not know whether the lower
1616	   layer will perform restoration. In this case, the higher layer will
1617	   configure a non-zero hold-off timer and rely on the receipt of a
1618	   specific notification from the lower layer if the lower layer cannot
1619	   perform restoration. Since layer boundaries are always within nodes,
1620	   such coordination is implementation-specific and does not need to be
1621	   covered here.

1623	   Reference should be made to [RFC3386] that discusses the interaction
1624	   between layers in survivable networks.

1626	4.9.1.  Inherited Link-Level Protection

1628	   Where a link in the MPLS-TP network is formed through connectivity
1629	   (i.e., a packet or non-packet LSP) in a lower layer network, that
1630	   connectivity may itself be protected,  for example, the LSP in the
1631	   lower-layer network may be provisioned with 1+1 protection.  In this
1632	   case, the link in the MPLS-TP network has an inherited level of
1633	   protection.

1635	   An LSP in the MPLS-TP network may be provisioned with protection in
1636	   the MPLS-TP network, as already described, or it may be provisioned
1637	   to utilize only those links that have inherited protection.

1639	   By classifying the links in the MPLS-TP network according to the
1640	   level of protection that they inherited from the server network, it
1641	   is possible to compute an end-to-end path in the MPLS-TP network that
1642	   uses only those links with a specific or superior level of inherited
1643	   protection.  This means that the end-to-end MPLS-TP LSP can be
1644	   protected at the level necessary to conform to the SLA without
1645	   needing to provide any additional protection in the MPLS-TP layer.
1646	   This reduces complexity, saves network resources, and eliminates
1647	   protection-switching coordination problems.

1649	   When the requisite level of inherited protection is not available
1650	   on all segments along the path in the MPLS-TP network, segment
1651	   protection may be used to achieve the desired protection level.

1653	   It should be noted, however, that inherited protection only applies
1654	   to links.  Nodes cannot be protected in this way.  An operator will
1655	   need to perform an analysis of the relative likelihood and
1656	   consequences of node failure if this approach is taken without
1657	   providing protection in the MPLS-TP LSP or PW layer to handle
1658	   node failure.

1660	4.9.2.  Shared Risk Groups

1662	   When an MPLS-TP protection scheme is established, it is important
1663	   that the working and protection paths do not share resources in the
1664	   network.  If this is not achieved, a single defect may affect both
1665	   the working and the protection paths with the result that traffic
1666	   cannot be delivered - since under such a condition the traffic was
1667	   not protected.

1669	   Note that this restriction does not apply to restoration, since this
1670	   takes place after the fault has occurred, which means that the point
1671	   of failure can be avoided if an available path exists.

1673	   When planning a recovery scheme, it is possible to use a topology map
1674	   of the MPLS-TP layer to select paths that use diverse links and nodes
1675	   within the MPLS-TP network.  However, this does not guarantee that
1676	   the paths are truly diverse, for example, two separate links in an
1677	   MPLS-TP network may be provided by two lambdas in the same optical
1678	   fiber, or by two fibers that cross the same bridge.  Moreover, two
1679	   completely separate MPLS-TP nodes might be situated in the same
1680	   building with a shared power supply.

1682	   Thus, in order to achieve proper recovery planning, the MPLS-TP
1683	   network must have an understanding of the groups of lower-layer
1684	   resources that share a common risk of failure.  From this, MPLS-TP
1685	   shared risk groups can be constructed that show which MPLS-TP
1686	   resources share a common risk of failure.  Diversity of working and
1687	   protection paths can be planned, not only with regard to nodes and
1688	   links, but also in order to refrain from using resources from the
1689	   same shared risk groups.

1691	4.9.3.  Fault Correlation

1693	   In a layered network, a low-layer fault may be detected and reported
1694	   by multiple layers and may sometimes lead to the generation of
1695	   multiple fault reports from the same layer.  For example, a failure
1696	   of a data link may be reported by the line cards in an MPLS-TP node,
1697	   but it could also be detected and reported by the MPLS-TP OAM.

1699	   Section 4.6 explains how it is important to coordinate the
1700	   survivability actions configured and operated in a multi-layer
1701	   network in a way that will avoid over-equipping the survivability
1702	   resources in the network, while ensuring that recovery actions are
1703	   performed in only one layer at a time.

1705	   Fault correlation is about understanding which single event has
1706	   generated a set of fault reports, so that recovery actions can be
1707	   coordinated, and so that the fault logging system does not become
1708	   overloaded.  Fault correlation depends on understanding resource use
1709	   at lower layers, shared risk groups, and a wider view with regard to
1710	   the way in which the layers are inter-related.

1712	   Fault correlation is most easily performed at the point of fault
1713	   detection, for example, an MPLS-TP node that receives a fault
1714	   notification from the lower layer, and detects a fault on an LSP in
1715	   the MPLS-TP layer, can easily correlate these two events.
1716	   Furthermore, if the same node detects multiple faults on LSPs that
1717	   share the same faulty data link, it can easily correlate them.  Such
1718	   a node may use correlation to perform group-based recovery actions,
1719	   and can reduce the number of alarm events that it generates to its
1720	   management station.

1722	   Fault correlation may also be performed at a management station that
1723	   receives fault reports from different layers and different nodes in
1724	   the network.  This enables the management station to coordinate
1725	   management-originated recovery actions, and to present consolidated
1726	   fault information to the user and automated management systems.

1728	   It is also necessary to correlate fault information detected and
1729	   reported through OAM.  This function would enable a fault detected at
1730	   a lower layer, and reported at a transit node of an MPLS-TP LSP, to
1731	   be correlated with an MPLS-TP-layer fault detected at a Maintenance
1732	   End Point (MEP) (for example, the egress of the MPLS-TPLSP).  Such
1733	   correlation allows the coordination of recovery actions performed at
1734	   the MEP, but it also requires that the lower-layer fault information
1735	   is propagated to the MEP, which is most easily achieved using a
1736	   control plane, management plane, or OAM message.

1738	5.  Applicability and Scope of Survivability in MPLS-TP

1740	   The MPLS-TP network can be viewed as two layers (the MPLS LSP layer
1741	   and the PW layer).  The MPLS-TP network operates over data-link
1742	   connections and data-link networks whereby the MPLS-TP links are
1743	   provided by individual data links or by connections in a lower-layer
1744	   network.  The MPLS LSP layer is a mandatory part of the MPLS-TP
1745	   network, while the PW layer is an optional addition for
1746	   supporting specific services.

1748	   MPLS-TP survivability provides recovery from failure of the links and
1749	   nodes in the MPLS-TP network.  The link defects and failures are
1750	   typically caused by defects or failures in the underlying data-link
1751	   connections and networks, but this section is only concerned with
1752	   recovery actions performed in the MPLS-TP network, which must recover
1753	   from the manifestation of any problem as a defect failure in the
1754	   MPLS-TP network.

1756	   This section lists the recovery elements (see Section 1) supported in
1757	   each of the two layers that can recover from defects or failures of
1758	   nodes or links in the MPLS-TP network.

1760	   +--------------+---------------------+------------------------------+
1761	   | Recovery     | MPLS LSP Layer      | PW Layer                     |
1762	   | Element      |                     |                              |
1763	   +--------------+---------------------+------------------------------+
1764	   | Link         | MPLS LSP recovery   | The PW layer is not aware of |
1765	   | Recovery     | can be used to      | the underlying network.      |
1766	   |              | survive the failure | This function is not         |
1767	   |              | of an MPLS-TP link. | supported.                   |
1768	   +--------------+---------------------+------------------------------+
1769	   | Segment/Span | An individual LSP   | For a SS-PW, segment         |
1770	   | Recovery     | segment can be      | recovery is the same as      |
1771	   |              | recovered to        | end-to-end recovery.         |
1772	   |              | survive the failure | Segment recovery for a MS-PW |
1773	   |              | of an MPLS-TP link. | is for future study, and     |
1774	   |              |                     | this function is now         |
1775	   |              |                     | provided using end-to-end    |
1776	   |              |                     | recovery.                    |
1777	   +--------------+---------------------+------------------------------+
1778	   | Concatenated | A concatenated LSP  | Concatenated segment         |
1779	   | Segment      | segment can be      | recovery (in a MS-PW) is for |
1780	   | Recovery     | recovered to        | future study, and this       |
1781	   |              | survive the failure | function is now provided     |
1782	   |              | of an MPLS-TP link  | using end-to-end recovery.   |
1783	   |              | or node.            |                              |
1784	   +--------------+---------------------+------------------------------+
1785	   | End-to-end   | An end-to-end LSP   | End-to-end PW recovery can   |
1786	   | Recovery     | can be recovered to | be applied to survive any    |
1787	   |              | survive any node or | node (including S-PE) or     |
1788	   |              | link failure,       | link failure, except for     |
1789	   |              | except for the      | failure of the ingress or    |
1790	   |              | failure of the      | egress T-PE.                 |
1791	   |              | ingress or egress   |                              |
1792	   |              | node.               |                              |
1793	   +--------------+---------------------+------------------------------+
1794	   | Service      | The MPLS LSP layer  | PW layer service recovery    |
1795	   | Recovery     | is service-         | requires surviving faults in |
1796	   |              | agnostic.  This     | T-PEs or on Attachment       |
1797	   |              | function is not     | Circuits (ACs).  This is     |
1798	   |              | supported.          | currently out of scope for   |
1799	   |              |                     | MPLS-TP.                     |
1800	   +--------------+---------------------+------------------------------+

1802	                                  Table 1

1804	   Section 6 provides a description of mechanisms for MPLS-TP-LSP
1805	   survivability.  Section 7 provides a brief overview of mechanisms for
1806	   MPLS-TP-PW survivability.

1808	6.  Mechanisms for Providing Survivability for MPLS-TP LSPs

1810	   This section describes the existing mechanisms which provide LSP
1811	   protection within MPLS-TP networks, and highlights areas
1812	   where new work is required.

1814	6.1.  Management Plane

1816	   As described above, a fundamental requirement of MPLS-TP is that
1817	   recovery mechanisms should be capable of functioning in the absence
1818	   of a control plane.  Recovery may be triggered by MPLS-TP OAM fault
1819	   management functions or by external requests (e.g., an operator's
1820	   request for manual control of protection switching). Recovery LSPs
1821	   (and in particular Restoration LSPs) may be provisioned through the
1822	   management plane.

1824	   The management plane may be used to configure the recovery domain by
1825	   setting the reference end point points (which control the recovery
1826	   actions), the working and the recovery entities, and the recovery
1827	   type (e.g., 1:1 bidirectional linear protection, ring protection,
1828	   etc.).

1830	   Additional parameters associated with the recovery process (such as a
1831	   WTR and hold-off timers, revertive/non-revertive operation, etc.) may
1832	   also be configured.

1834	   In addition, the management plane may initiate manual control of the
1835	   recovery function.  A priority should be set for the fault conditions
1836	   and the operator's requests.

1838	   Since provisioning the recovery domain involves the selection of a
1839	   number of options, mismatches may occur at the different reference
1840	   points.  The MPLS-TP protocol to coordinate protection state, which
1841	   is specified in [MPLS-TP-Linear-Protection], may be used as an in-
1842	   band (i.e., data-plane-based) control protocol to coordinate the
1843	   protection states between the end points of the recovery domain, and
1844	   to check the consistency of configured parameters (such as timers,
1845	   revertive/non-revertive behavior, etc.) with discovered
1846	   inconsistencies that are reported to the operator.

1848	   It should also be possible for the management plane to track the
1849	   recovery status by receiving reports or by issuing polls.

1851	6.1.1.  Configuration of Protection Operation

1853	   To implement the protection switching mechanisms, the following
1854	   entities and information should be configured and provisioned:

1856	   o  The end points of a recovery domain.  As described above, these
1857	      end points border on the element of recovery to which recovery is
1858	      applied.

1860	   o  The protection group which, depending on the required protection
1861	      scheme, consists of a recovery entity and one or more working
1862	      entities.  In 1:1 or 1+1 P2P protection, the paths of the working
1863	      entity and the recovery entities must be physically diverse in
1864	      every respect (i.e. not share any resources or physical
1865	      locations), in order to guarantee protection.

1867	   o  As defined in Section 4.8, the SPME must be supported in order to
1868	      implement data-plane-based LSP segment recovery, since related
1869	      control messages (e.g., for OAM, Protection Path Coordination,
1870	      etc.) can be initiated and terminated at the edges of a path where
1871	      push and pop operations are enabled.  The SPME is an end-to-end
1872	      LSP which in this context corresponds to the recovery entities
1873	      (working and protection) and makes use of the MPLS construct of
1874	      hierarchical nested LSP, as defined in [RFC3031].  OAM messages
1875	      and messages to coordinate protection state can be initiated at
1876	      the edge of the SPME and sent over G-ACH to the peer edge of the
1877	      SPME.  It is necessary to configure the related SPMEs and map
1878	      between the LSP segments being protected and the SPME.  Mapping
1879	      can be 1:1 or 1:N to allow scalable protection of a set of LSPs
1880	      segments traversing the part of the network in which a Protection
1881	      Domain is defined.

1883	      Note that each of these LSPs can be initiated or terminated at
1884	      different end points in the network, but that they all traverse
1885	      the Protection Domain and share similar constraints (such as
1886	      requirements for QoS, terms of protection ,etc.).

1888	   o  The protection type that should be defined (e.g., unidirectional
1889	      1:1, bidirectional 1+1, etc.)

1891	   o  Revertive/non-revertive behavior should be configured.

1893	   o  Timers (such as WTR, hold-off timer, etc.) should be set.

1895	6.1.2.  External Manual Commands

1897	   The following external, manual commands may be provided for manual
1898	   control of the protection switching operation.  These commands apply
1899	   to a protection group; they are listed in descending order of
1900	   priority:

1902	   o  Blocked protection action - a manual command to prevent data
1903	      traffic from switching to the recovery entity.  This command
1904	      actually disables the protection group.

1906	   o  Force protection action - a manual command that forces a switch of
1907	      normal data traffic to the recovery entity

1909	   o  Manual protection action - a manual command that forces a switch
1910	      of data traffic to the recovery entity only when there is no
1911	      defect in the recovery entity

1913	   o  Clear switching command - the operator may request that a previous
1914	      administrative switch command(manual or force switch) be cleared.

1916	6.2.  Fault Detection

1918	   Fault detection is a fundamental part of recovery and survivability.
1919	   In all schemes, with the exception of some types of 1+1 protection,
1920	   the actions required for the recovery of traffic delivery depend on
1921	   the discovery of some kind of fault. In 1+1 protection, the selector
1922	   (at the receiving end) may simply be configured to choose the better
1923	   signal, thus it does not detect a fault or degradation of itself, but
1924	   simply identifies the path that is better for data delivery.

1926	   Faults may be detected in a number of ways depending on the traffic
1927	   pattern and the underlying hardware.  End-to-end faults may be
1928	   reported by the application or by knowledge of the application's data
1929	   pattern, but this is an unusual approach.  There are two more common
1930	   mechanisms for detecting faults in the MPLS-TP layer:

1932	   o  Faults reported by the lower layers.

1934	   o  Faults detected by protocols within the MPLS-TP layer.

1936	   In an IP/MPLS network, the second mechanism may utilize control-plane
1937	   protocols (such as the routing protocols) to detect a failure of
1938	   adjacency between neighboring nodes.  In an MPLS-TP network, it is
1939	   possible that no control plane will be present.  Even if a control
1940	   plane is present, it will be a GMPLS control plane [RFC3945] which
1941	   logically separates control channels from data channels, which means
1942	   that no conclusion about the health of a data channel can be drawn
1943	   from the failure of an associated control channel. MPLS-TP layer
1944	   faults are, therefore, only detected through the use of OAM
1945	   protocols, as described in Section 6.4.1.

1947	   Faults may, however, be reported by a lower layer.  These generally
1948	   show up as interface failures or data link failures (sometimes known
1949	   as connectivity failures) within the MPLS-TP network,  for example,
1950	   an underlying optical link may detect loss of light and report a
1951	   failure of the MPLS-TP link that uses it.  Alternatively, an
1952	   interface card failure may be reported to the MPLS-TP layer.

1954	   Faults reported by lower layers are only visible in specific nodes
1955	   within the MPLS-TP network (i.e., at the adjacent end-points of the
1956	   MPLS-TP link).  This would only allow recovery to be performed
1957	   locally so, to enable recovery to be performed by nodes that are
1958	   not immediately local to the fault, the fault must be reported
1959	   (Sections 6.4.3 and 6.5.4).

1961	6.3.  Fault Localization

1963	   If an MPLS-TP node detects that there is a fault in an LSP (that is,
1964	   not a network fault reported from a lower layer, but a fault detected
1965	   by examining the LSP), it can immediately perform a recovery action.
1966	   However, unless the location of the fault is known, the only
1967	   practical options are:

1969	   o  Perform end-to-end recovery.

1971	   o  Perform some other recovery as a speculative act.

1973	   Since the speculative acts are not guaranteed to achieve the desired
1974	   results and could be consume resources to unnecessarily, and since
1975	   end-to-end recovery can require a lot of network resources, it is
1976	   important to be able to localize the fault.

1978	   Fault localization may be achieved by dividing the network into
1979	   protection domains.  End-to-end protection is thereby operated on
1980	   LSP segments, depending on the domain in which the fault is
1981	   discovered.  This necessitates monitoring of the LSP at the
1982	   domain edges.

1984	   Alternatively, a proactive mechanism of fault localization through
1985	   OAM (Section 6.4.3) or through the control plane (Section 6.5.3) is
1986	   required.

1988	   Fault localization is particularly important for restoration because
1989	   a new path must be selected which avoids the fault. It may not be
1990	   practical or desirable to select a path that avoids the entire failed
1991	   working path and it is therefore necessary to isolate the fault's
1992	   location.

1994	6.4.  OAM Signaling

1996	   MPLS-TP provides a comprehensive set of OAM tools for fault
1997	   management and performance monitoring at different nested levels
1998	   (end-to-end, a portion of a path (LSP or PW) and at the link level)
1999	   [MPLS-TP-OAM-Framework].

2001	   These tools support proactive and on-demand fault management (for
2002	   fault detection and fault localization) as well as performance
2003	   monitoring (to measure the quality of the signals and detect
2004	   degradation).

2006	   To support fast recovery, it is useful to use some of the proactive
2007	   tools to detect fault conditions (e.g., link/node failure or
2008	   degradation) and to trigger the recovery action.

2010	   The MPLS-TP OAM messages run in-band with the traffic and support
2011	   unidirectional and bidirectional P2P paths as well as P2MP paths.

2013	   As described in [MPLS-TP-OAM-Framework], MPLS-TP OAM operates in the
2014	   context of a Maintenance Entity which borders on the OAM
2015	   responsibilities and represents the portion of a path between two
2016	   points which is monitored and maintained, and along which OAM
2017	   messages are exchanged.  [MPLS-TP-OAM-Framework] refers also to a
2018	   Maintenance Entity Group (MEG), which is a collection of one or more
2019	   Maintenance Entities (MEs) that belong to the same transport path
2020	   (e.g., P2MP transport path) and which are maintained and monitored as
2021	   a group.

2023	   An ME includes two MEPs (Maintenance Group End Points) which reside
2024	   at the boundaries of an ME, and a set of zero or more (Maintenance
2025	   Group Intermediate Points (MIPs) which reside within the Maintenance
2026	   Entity along the path.  A MEP is capable of initiating and
2027	   terminating OAM messages, and as such can only be located at the
2028	   edges of a path where push and pop operations are supported.  In
2029	   order to define an ME over a portion of path, it is necessary to
2030	   support SPMEs.

2032	   The SPME is an end-to-end LSP which in this context corresponds to
2033	   the ME; it uses the MPLS construct of hierarchical nested LSPs which
2034	   is defined in [RFC3031].  OAM messages can be initiated at the edge
2035	   of the SPME and sent over G-ACH to the peer edge of the SPME.

2037	   The related SPMEs must be configured and mapping must be performed
2038	   between the LSP segments being monitored and the SPME. Mapping can
2039	   be 1:1 or 1:N to allow scalable operation.  Note that each of these
2040	   LSPs can be initiated or terminated at different end points in the
2041	   network and can share similar constraints (such as requirements for
2042	   QoS, terms of protection ,etc.).

2044	   With regard to recovery, where MPLS-TP OAM is supported, an OAM
2045	   Maintenance Entity Group is defined for each of the working and
2046	   protection entities.

2048	6.4.1.  Fault Detection

2050	   MPLS-TP OAM tools may be used proactively to detect the following
2051	   fault conditions between MEPs:

2053	   o  Loss of continuity and misconnectivity - the proactive Continuity
2054	      Check (CC) function is used to detect loss of continuity between
2055	      two MEPs in an MEG.  The proactive Connectivity Verification (CV)
2056	      allows a sink MEP to detect a misconnectivity defect (e.g.,
2057	      mismerge or misconnection) with its peer source MEP when the
2058	      received packet carries an incorrect ME identifier.  For
2059	      protection switching, it is common to run a CCV (Continuity and
2060	      Connectivity Verification) message every 3.33 ms.  In the absence
2061	      of three consecutive CCV messages, Loss of Continuity is declared
2062	      and is notified locally to the edge of the recovery domain in
2063	      order to trigger a recovery action.  In some cases, when a slower
2064	      recovery time is acceptable, it is also possible to lengthen the
2065	      transmission rate.

2067	   o  Signal degradation - notification from OAM performance monitoring
2068	      indicating degradation in the working entity may also be used as a
2069	      trigger for protection switching.  In the event of degradation,
2070	      switching to the recovery entity is necessary only if the recovery
2071	      entity can guarantee better conditions.  Degradation can be
2072	      measured by proactively activating MPLS-TP OAM packet loss
2073	      measurement or delay measurement.

2075	   o  An MEP can receive an indication from its sink MEP of a Remote
2076	      Defect Indication and locally notify the end point of the recovery
2077	      domain regarding the fault condition, in order to trigger the
2078	      recovery action.

2080	6.4.2.  Testing for Faults

2082	   The management plane may be used to initiate the testing of links,
2083	   LSP segments, or entire LSPs.

2085	   MPLS-TP provides OAM tools which may be manually invoked on-demand
2086	   for a limited period, in order to troubleshoot links, LSP segments,
2087	   or entire LSPs (e.g. diagnostics, connectivity verification, packet
2088	   loss measurements, etc.). On-demand monitoring covers a combination
2089	   of "in service" and "out-of service" monitoring functions. "Out-of-
2090	   service" testing is supported by the OAM on-demand lock operation.
2091	   The lock operation temporarily disables the transport entity (LSP,
2092	   LSP segment, or link), preventing the transmission of all types of
2093	   traffic, with the exception of test traffic, and OAM (dedicated to
2094	   the locked entity).

2096	   [MPLS-TP-OAM-Framework] describes the operations of the OAM functions
2097	   that may be initiated on-demand and provides some considerations.

2099	   MPLS-TP also supports in/out-of-service testing of the recovery
2100	   (protection and restoration) mechanism, the integrity of the
2101	   protection/recovery transport paths, and the coordination protocol
2102	   between the end points of the recovery domain. The testing operation
2103	   emulates a protection switching request but does not perform the
2104	   actual switching action.

2106	6.4.3.  Fault Localization

2108	   MPLS-TP provides OAM tools to locate a fault and determine its
2109	   precise location.  Fault detection often only takes place at key
2110	   points in the network (such as at LSP end points, or MEPs).  This
2111	   means that a fault may be located anywhere within a segment of the
2112	   relevant LSP. Finer information granularity is needed to implement
2113	   optimal recovery actions or to diagnose the fault.  On-demand tools
2114	   like trace-route, loopback, and on-demand CCV can be used to localize
2115	   a fault.

2117	   The information may be notified locally to the end point of the
2118	   recovery domain to allow implementation of optimal recovery action.
2119	   This may be useful for the re-calculation of a recovery path.

2121	   The information should also be reported to network management for
2122	   diagnostics purposes.

2124	6.4.4.  Fault Reporting

2126	   The end points of a recovery domain should be able to detect fault
2127	   conditions in the recovery domain, and notify the management plane.

2129	   In addition, a node within a recovery domain that detects a fault
2130	   condition should also be able to report this to network management.
2131	   Network management should be capable of correlating the fault reports
2132	   and identifying the source of the fault.

2134	   MPLS-TP OAM tools support a function where an intermediate
2135	   node along a path is able to send an alarm report message to the MEP,
2136	   indicating the presence of a fault condition in the server layer
2137	   which connects it to its adjacent node.  This capability allows an
2138	   MEP to suppress alarms that may be generated as a result of a failure
2139	   condition in the server layer.

2141	6.4.5.  Coordination of Recovery Actions

2143	   As described above, in some cases (such as in bidirectional
2144	   protection switching, etc.) it is necessary to coordinate the
2145	   protection states between the edges of the recovery domain.  [MPLS-
2146	   TP-Linear-Protection] defines procedures, protocol messages, and
2147	   elements for this purpose.

2149	   The protocol is also used to signal administrative requests (e.g.,
2150	   manual switch, etc.), but only when these are provisioned at the
2151	   edge of the recovery domain.

2153	   The protocol also enables mismatches to be detected between the
2154	   configurations at the ends of the Protection Domain (such as timers,
2155	   revertive/non-revertive behavior); these mismatches can subsequently
2156	   be reported to the management plane.

2158	   In the absence of suitable coordination (owing to failures in the
2159	   delivery or processing of the coordination protocol messages),
2160	   protection switching will fail. This means that the operation of the
2161	   protocol that coordinates the protection state is a fundamental part
2162	   of protection switching.

2164	6.5.  Control Plane

2166	   The GMPLS control plane has been proposed as the control plane for
2167	   MPLS-TP [RFC5317].  Since GMPLS was designed for use in transport
2168	   networks, and since it has been implemented and deployed in many
2169	   networks, it is not surprising that it contains many features which
2170	   support a high degree of survivability.

2172	   The signaling elements of the GMPLS control plane utilize extensions
2173	   to the Resource Reservation Protocol (RSVP) (as described in a series
2174	   of documents commencing with [RFC3471] and [RFC3473]), although it is
2175	   based on [RFC3209] and [RFC2205].  The architecture for GMPLS is
2176	   provided in [RFC3945], while [RFC4426] gives a functional description
2177	   of the protocol extensions needed to support GMPLS-based recovery
2178	   (i.e., protection and restoration).

2180	   A further control-plane protocol called the Link Management Protocol
2181	   (LMP) [RFC4204] is part of the GMPLS protocol family and can be used
2182	   to coordinate fault localization and reporting.

2184	   Clearly, the control plane techniques described here only apply where
2185	   an MPLS-TP control plane is deployed and operated.  All mandatory
2186	   MPLS-TP survivability features must be enabled, even in the absence
2187	   of t1he control plane. However, when present, the control plane may
2188	   be used to provide alternative mechanisms which may be desirable,
2189	   since they offer simple automation or a richer feature-set.

2191	6.5.1.  Fault Detection

2193	   The control plane is unable to detect data-plane faults.  However,
2194	   it does provide mechanisms that detect control-plane faults and these
2195	   can be used to recognize data-plane faults when it is evident that
2196	   the control and data planes are fate-sharing.  Although [RFC5654]
2197	   specifies that MPLS-TP must support an out-of-band control channel,
2198	   it does not insist that this be used exclusively.  This means that
2199	   there may be deployments where an in-band (or at least an in-fiber)
2200	   control channel is used.  In this scenario, failure of the control
2201	   channel can be used to infer that there is a failure of the data
2202	   channel, or, at least, it can be used to trigger an investigation of
2203	   the health of the data channel.

2205	   Both RSVP and LMP provide a control channel "keep-alive" mechanism
2206	   (called the Hello message in both cases).  Failure to receive a
2207	   message in the configured/negotiated time period indicates a control
2208	   plane failure.  GMPLS routing protocols ([RFC4203] and [RFC5307] also
2209	   include keep-alive mechanisms designed to detect routing adjacency
2210	   failures and, although these keep-alive mechanisms tend to operate at
2211	   a relatively low frequency (order of seconds), it is still possible
2212	   that the first indication of a control-plane fault will be received
2213	   through the routing protocol.

2215	   Note, however, that care must be taken to ascertain that a specific
2216	   failure is not caused by a problem in the control-plane software or
2217	   in a processor component at the far end of a link.

2219	   Because of the various issues involved, it is not recommended that
2220	   the control plane be used as the primary mechanism for fault
2221	   detection in an MPLS-TP network.

2223	6.5.2.  Testing for Faults

2225	   The control plane may be used to initiate and coordinate the testing
2226	   of links, LSP segments, or entire LSPs.  This is important in some
2227	   technologies where it is necessary to halt data transmission while
2228	   testing, but it may also be useful where testing needs to be
2229	   specifically enabled or configured.

2231	   LMP provides a control-plane mechanism to test the continuity and
2232	   connectivity (and naming) of individual links.  A single management
2233	   operation is required to initiate the test at one end of the link,
2234	   while the LMP handles the coordination with the other end of the
2235	   link.  The test mechanism for an MPLS packet link relies on the LMP
2236	   Test message inserted into the data stream at one end of the link and
2237	   extracted at the other end of the link.  This mechanism need not
2238	   disrupt data flowing over the link.

2240	   Note that a link in the LMP may, in fact, be an LSP tunnel used to
2241	   form a link in the MPLS-TP network.

2243	   GMPLS signaling (RSVP) offers two mechanisms that may also assist
2244	   with fault testing.  The first mechanism [RFC3473] defines the
2245	   Admin_Status object that allows an LSP to be set into "testing mode".
2246	   The interpretation of this mode is implementation-specific and could
2247	   be documented more precisely for MPLS-TP.  The mode sets the whole
2248	   LSP into a state where it can be tested; this need not be disruptive
2249	   to data traffic.

2251	   The second mechanism provided by GMPLS to support testing is
2252	   described in [GMPLS-OAM].  This protocol extension supports the
2253	   configuration (including enabling and disabling) of OAM mechanisms
2254	   for a specific LSP.

2256	6.5.3.  Fault Localization

2258	   Fault localization is the process whereby the exact location of a
2259	   fault is determined.  Fault detection often only takes place at key
2260	   points in the network (such as at LSP end points, or at MEPs). This
2261	   means that a fault may be located anywhere within a segment of the
2262	   relevant LSP.

2264	   If segment or end-to-end protection is in use, this level of
2265	   information is often sufficient to repair the LSP.  However, if
2266	   finer information granularity is required (either to implement
2267	   optimal recovery actions or to diagnose a fault), it is necessary to
2268	   localize the specific fault.

2270	   LMP provides a cascaded test-and-propagate mechanism which is
2271	   designed specifically for this purpose.

2273	6.5.4.  Fault Status Reporting

2275	   GMPLS signaling uses the Notify message to report fault status
2276	   [RFC3473].  The Notify message can apply to a single LSP or can carry
2277	   fault information for a set of LSPs, in order to improve the
2278	   scalability of fault notification.

2280	   Since the Notify message is targeted at a specific node, it can be
2281	   delivered rapidly without requiring hop-by-hop processing.  It can be
2282	   targeted at LSP end-points, or at segment end-points (such as MEPs).
2283	   The target points for Notify messages can be manually configured
2284	   within the network, or they may be signaled when the LSP is set up.

2286	   This enables the process to be made consistent with segment
2287	   protection as well as with the concept of Maintenance Entities.

2289	   GMPLS signaling also provides a slower, hop-by-hop mechanism for
2290	   reporting individual LSP faults on a hop-by-hop basis using PathErr
2291	   and ResvErr messages.

2293	   [RFC4783] provides a mechanism to coordinate alarms and other event
2294	   or fault information through GMPLS signaling.  This mechanism is
2295	   useful for understanding the status of the resources used by an LSP,
2296	   and for providing information as to why an LSP is not functioning,
2297	   however, it is not intended to replace other fault reporting
2298	   mechanisms.

2300	   GMPLS routing protocols [RFC4203] and [RFC5307] are used to advertise
2301	   link availability and capabilities within a GMPLS-enabled network.
2302	   Thus, the routing protocols can also provide indirect information
2303	   about network faults,  that is, the protocol may stop advertising or
2304	   may withdraw the advertisement for a failed link, or it may advertise
2305	   that the link is about to be shut down gracefully [RFC5817].  This
2306	   mechanisms is, however, not normally considered to be fast enough for
2307	   use as a trigger for protection switching.

2309	6.5.5.  Coordination of Recovery Actions

2311	   Fault coordination is an important feature for certain protection
2312	   mechanisms (such as bidirectional 1:1 protection).  The use of the
2313	   GMPLS Notify message for this purpose is described in [RFC4426],
2314	   however, specific message field values have not yet been defined for
2315	   this operation.

2317	   Further work is needed in GMPLS for control and configuration
2318	   of reversion behavior for end-to-end and segment protection, and the
2319	   coordination of timer values.

2321	6.5.6.  Establishment of Protection and Restoration LSPs

2323	   The management plane may be used to set up protection and recovery
2324	   LSPs, but, when present, the control plane may be used.

2326	   Several protocol extensions exist which simplify this process:

2328	   o  [RFC4872] provides features which support end-to-end protection
2329	      switching.

2331	   o  [RFC4873] describes the establishment of a single, segment
2332	      protected LSP.  Note that end-to-end protection is a special case
2333	      of segment protection and [RFC4872] can also be used to provide
2334	      end-to-end protection.

2336	   o  [RFC4874] allows an LSP to be signaled with a request that its
2337	      path exclude specified resources such as links, nodes, shared risk
2338	      link groups (SRLGs).  This allows a disjoint protection path to be
2339	      requested, or a recovery path to be set up to avoid failed
2340	      resources.

2342	   o  Lastly, it should be noted that [RFC5298] provides an overview of
2343	      the GMPLS techniques available to achieve protection in multi-
2344	      domain environments.

2346	7.  Pseudowire Recovery Considerations

2348	   Pseudowires provide end-to-end connectivity over the MPLS-TP network
2349	   and may comprise a single pseudowire segment, or multiple segments
2350	   "stitched" together to provide end-to-end connectivity.

2352	   The pseudowire may, itself, require a level of protection, in order
2353	   to meet the service-level guarantees of its SLA.  This protection
2354	   could be provided by the MPLS-TP LSPs that support the pseudowire, or
2355	   could be a feature of the pseudowire layer itself.

2357	   As indicated above, the functional architecture described in this
2358	   document applies to both LSPs and pseudowires.  However, the recovery
2359	   mechanisms for pseudowires are for further study and will be defined
2360	   in a separate document by the PWE3 working group.

2362	7.1.  Utilization of Underlying MPLS-TP Recovery

2364	   MPLS-TP PWs are carried across the network inside MPLS-TP LSPs.
2365	   Therefore, an obvious way to provide protection for a PW is to
2366	   protect the LSP that carries it.  Such protection can take any of the
2367	   forms described in this document.  The choice of recovery scheme will
2368	   depend on the required speed of recovery and the traffic loss that is
2369	   acceptable for the SLA that the PW is providing.

2371	   If the PW is a multi-segment PW, then LSP recovery can only protect
2372	   the PW in individual segments.  This means that a single LSP recovery
2373	   action cannot protect against a failure of a PW switching point (an
2374	   S-PE), nor can it protect more than one segment at a time, since the
2375	   LSP tunnel is terminated at each S-PE.  In this respect, LSP
2376	   protection of a PW is very similar to link-level protection offered
2377	   to the MPLS-TP LSP layer by an underlying network layer (see Section
2378	   4.9).

2380	7.2.  Recovery in the Pseudowire Layer

2382	   Recovery in the PW layer can be provided by simply running separate
2383	   PWs end-to-end.  Other recovery mechanisms in the PW layer, such as
2384	   segment or concatenated segment recovery, or service-level recovery
2385	   involving survivability of T-PE or AC faults will be described in a
2386	   separate document.

2388	   As with any recovery mechanism, it is important to coordinate between
2389	   layers.  This coordination is necessary to ensure that actions
2390	   associated with recovery mechanisms are only performed in one layer
2391	   at a time (that is, the recovery of an underlying LSP needs to be
2392	   coordinated with the recovery of the PW itself); it also makes sure
2393	   that the working and protection PWs do not both use the same MPLS
2394	   resources within the network (for example, by running over the same
2395	   LSP tunnel - see also Section 4.9).

2397	8.  Manageability Considerations

2399	   Manageability of MPLS-TP networks and their functions is discussed in
2400	   [MPLS-TP-NM-Framework].  OAM features are discussed in
2401	   [MPLS-TP-OAM-Framework].

2403	   Survivability has some key interactions with management, as described
2404	   in this document. In particular:

2406	   o  Recovery domains may be configured in a way that prevents one-
2407	      to-one correspondence between the MPLS-TP network and the recovery
2408	      domains.

2410	   o  Survivability policies may be configured per network, per recovery
2411	      domain, or per LSP.

2413	   o  Configuration of OAM may involve the selection of MEPs, enabling
2414	      OAM on network segments, spans, and links, and the operation of
2415	      OAM on LSPs, concatenated LSP segments, and LSP segments.

2417	   o  Manual commands may be used to control recovery functions,
2418	      including forcing recovery and locking recovery actions.

2420	   See also the considerations regarding security for management and OAM
2421	   in Section 9 of this document.

2423	9.  Security Considerations

2425	   This framework does not introduce any new security considerations;
2426	   general issues relating to MPLS security can be found in [MPLS-SEC].

2428	   However, several points about MPLS-TP survivability should be noted
2429	   here.

2431	   o  If an attacker is able to force a protection switch-over, this may
2432	      result in a small perturbation to user traffic, and could result
2433	      in extra traffic being preempted or displaced from the protection
2434	      resources.  In the case of 1:n protection or shared mesh
2435	      protection, this may result in other traffic becoming unprotected.
2436	      Therefore, it is important that OAM protocols for detecting or
2437	      notifying faults use adequate security to prevent them from being
2438	      used (through the insertion of bogus messages, or through the
2439	      capture of legitimate messages) to falsely trigger a recovery
2440	      event.

2442	   o  If manual commands are modified, captured, or simulated (including
2443	      replay), it might be possible for an attacker to perform forced
2444	      recovery actions or to impose lock-out.  These actions could
2445	      impact the capability to provide the recovery function, and could
2446	      also affect the normal operation of the network for other traffic.
2447	      Therefore, management protocols used to perform manual commands
2448	      must allow the operator to use appropriate security mechanisms.
2449	      This includes verification that the user who performs the commands
2450	      has appropriate authorization.

2452	   o  If the control plane is used to configure or operate recovery
2453	      mechanisms, the control-plane protocols must also be capable of
2454	      providing adequate security.

2456	10.  IANA Considerations

2458	   This informational document makes no requests for IANA action.

2460	11.  Acknowledgments

2462	   Thanks for useful comments and discussions to: Italo Busi, David
2463	   McWalter, Lou Berger, Yaacov Weingarten, Stewart Bryant, Dan Frost,
2464	   Lievren Levrau, Xuehui Dai, Liu Guoman, Xiao Min, Daniele Ceccarelli,
2465	   Scott Bradner, Francesco Fondelli, Curtis Villamizar, Maarten
2466	   Vissers, and Greg Mirsky.

2468	   The Editors would like to thank the participants in ITU-T Study Group
2469	   15 for their detailed review.

2471	   Some figures and text on shared-mesh protection were borrowed from
2472	   [MPLS-TP-MESH] with thanks to Tae-sik Cheung and Jeong-dong Ryoo.

2474	12.  References

2476	12.1.  Normative References

2478	   [RFC2205]  Braden, R., Ed., Zhang, L., Berson, S., Herzog, S., and
2479	              J. Jamin, "Resource ReserVation Protocol - Version 1
2480	              Functional Specification", RFC 2205, September 1997.

2482	   [RFC3209]  Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V.,
2483	              and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP
2484	              Tunnels", RFC 3209, December 2001.

2486	   [RFC3471]  Berger, L., Ed., "Generalized Multi-Protocol Label
2487	              Switching (GMPLS) Signaling Functional Description",
2488	              RFC 3471, January 2003.

2490	   [RFC3473]  Berger, L., "Generalized Multi-Protocol Label Switching
2491	              (GMPLS) Signaling Resource ReserVation Protocol-Traffic
2492	              Engineering (RSVP-TE) Extensions", RFC 3473, January 2003.

2494	   [RFC3945]  Mannie, E., "Generalized Multi-Protocol Label Switching
2495	              (GMPLS) Architecture", RFC 3945, October 2004.

2497	   [RFC4203]  Kompella, K. and Y. Rekhter, "IS-IS Extensions in Support
2498	              of Generalized Multi-Protocol Label Switching (GMPLS)",
2499	              RFC 4203, October 2005.

2501	   [RFC4204]  Lang, J., Ed., "The Link Management Protocol (LMP)",
2502	              RFC 4204, September 2005.

2504	   [RFC4427]  Mannie, E. and D. Papadimitriou, "Recovery (Protection and
2505	              Restoration) Terminology for Generalized Multi-Protocol
2506	              Label Switching (GMPLS)", RFC 4427, March 2006.

2508	   [RFC4428]  Papadimitriou, D. and E. Mannie, "Analysis of Generalized
2509	              Multi-Protocol Label Switching (GMPLS) - based  Recovery
2510	              Mechanisms (including Protection and Restoration) Recovery
2511	              (Protection and Restoration) Terminology for Generalized
2512	              Multi-Protocol Label Switching (GMPLS)", RFC 4428,
2513	              March 2006.

2515	   [RFC4873]  Berger, L., Bryskin, I., Papadimitriou, D., and A. Farrel,
2516	              "GMPLS Segment Recovery", RFC 4873, May 2007.

2518	   [RFC5307]  Kompella, K. and Y. Rekhter, "IS-IS Extensions in Support
2519	              of Generalized Multi-Protocol Label Switching (GMPLS)",
2520	              RFC 5307, October 2008.

2522	   [RFC5317]  Bryant, S. and L. Andersson, "Joint Working Team (JWT)
2523	              Report on MPLS Architectural Considerations for a
2524	              Transport  Profile", RFC 5317, February 2009.

2526	   [RFC5654]  Niven-Jenkins, B., Ed., Brungard, D., Ed., Betts, M., Ed.,
2527	              Sprecher, N., and S. Ueno, "Requirements of an MPLS
2528	              Transport Profile", RFC 5654, September 2009.

2530	   [RFC5586]  Bocci, M., Ed., Vigoureux, M., Ed., and S. Bryant, Ed.,
2531	              "MPLS Generic Associated Channel", RFC 5586, June 2009.

2533	   [G.806]    ITU-T, "Characteristics of transport equipment -
2534	              Description methodology and generic functionality",
2535	              Recommendation G.806, January 2009.

2537	   [G.808.1]  ITU-T, "Generic Protection Switching - Linear trail and
2538	              subnetwork protection", Recommendation G.808.1,
2539	              December 2003.

2541	   [G.841]    ITU-T, "Types and Characteristics of SDH Network
2542	              Protection Architectures", Recommendation G.841,
2543	              October 1998.

2545	   [MPLS-TP-FWK]
2546	              Bocci, M., Bryant, S., Frost, D., Levrau, L., and Berger,
2547	              L., "A Framework for MPLS in Transport Networks",
2548	              draft-ietf-mpls-tp-framework, Work in Progress.

2550	   [MPLS-TP-NM-Framework]
2551	              Mansfield, S., Gray, E., and Lam, K., "MPLS-TP Network
2552	              Management Framework", draft-ietf-mpls-tp-nm-framework,
2553	              Work in Progress.

2555	   [MPLS-TP-OAM-Framework]
2556	              Buci, I., Ed. and B. Niven-Jenkins, Ed., "A Framework for
2557	              MPLS in Transport Networks", draft-ietf-mpls-tp-oam-
2558	              framework, Work in Progress.

2560	12.2.  Informative References

2562	   [RFC3031]  Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
2563	              Label Switching Architecture", RFC 3031, January 2001.

2565	   [RFC3386]  Lai, W. and D. McDysan, "Network Hierarchy and Multilayer
2566	              Survivability", RFC 3386, November 2002.

2568	   [RFC3469]  Sharma, V. and F. Hellstrand, "Framework for Multi-
2569	              Protocol Label Switching (MPLS)-based Recovery", RFC 3469,
2570	              February 2003.

2572	   [RFC4397]  Bryskin, I. and Farrel, A., " A Lexicography for the
2573	              Interpretation of Generalized Multiprotocol Label
2574	              Switching (GMPLS) Terminology within the Context of the
2575	              ITU-T's Automatically Switched Optical Network (ASON)
2576	              Architecture", RFC 4397, February 2006.

2578	   [RFC4426]  Lang, J., Ed., Rajagopalan, B., and D. Papadimitriou,
2579	              "Generalized Multiprotocol Label Switching (GMPLS)
2580	              Recovery Functional Specification", RFC 4426, March 2006.

2582	   [RFC4726]  Farrel, A., Vasseur, J.-P., and Ayyangar, A., "A Framework
2583	              for Inter-Domain Multiprotocol Label Switching Traffic
2584	              Engineering", RFC 4726, November 2006.

2586	   [RFC4783]  Berger, L., "GMPLS - Communication of Alarm Information",
2587	              RFC 4783, December 2006.

2589	   [RFC4872]  Lang, J., Rekhter, Y., and D. Papadimitriou, "RSVP-TE
2590	              Extensions in Support of End-to-End Generalized Multi-
2591	              Protocol Label Switching (GMPLS) Recovery", RFC 4872, May
2592	              2007.

2594	   [RFC4874]  Lee, CY., Farrel, A., and S. De Cnodder, "Exclude Routes -
2595	              Extension to Resource ReserVation Protocol- Traffic
2596	              Engineering (RSVP-TE)", RFC 4874, April 2007.

2598	   [RFC5212]  Shiomoto, K., Papadimitriou, D., Le Roux, JL., Vigoureux,
2599	              M., and Brungard, D., " Requirements for GMPLS-Based
2600	              Multi-Region and Multi-Layer Networks (MRN/MLN)", RFC
2601	              5212, July 2008

2603	   [RFC5298]  Takeda, T., Farrel, A., Ikejiri, Y., and JP. Vasseur,
2604	              "Analysis of Inter-Domain Label Switched Path (LSP)
2605	              Recovery", RFC 5298, August 2008.

2607	   [RFC5817]  Ali, Z., Vasseur, J.-P., Zamfir, A., and Newton, J.,
2608	              "Graceful Shutdown in MPLS and Generalized MPLS Traffic
2609	              Engineering Networks", RFC 5817, April 2010.

2611	   [G.8081]   ITU-T, "Terms and definitions for Automatically Switched
2612	              Optical Networks (ASON)", Recommendation G.8081, June 2004
2613	              and Recommendation G.8081 Amendment 1, June 2006.

2615	   [GMPLS-OAM]
2616	              Takacs, A., Fedyk, D., and H. Jia, "OAM Configuration
2617	              Framework and Requirements for GMPLS RSVP-TE",
2618	              draft-ietf-ccamp-oam-configuration-fwk, Work in Progress.

2620	   [MPLS-SEC] L. Fang (Ed.), " Security Framework for MPLS and GMPLS
2621	              Networks", draft-ietf-mpls-mpls-and-gmpls-security-
2622	              framework, Work in Progress.

2624	   [MPLS-TP-CP-Framework]
2625	              Andersson, L., Berger, L., Fang, L., and Bitar, N., "MPLS-
2626	              TP Control Plane Framework", draft-ietf-ccamp-mpls-tp-cp-
2627	              framework, Work in Progress.

2629	   [MPLS-TP-Linear-Protection]
2630	              Weingarten, Y., Bryant, S., Ed., Sprecher, N., Ed., Van
2631	              Helvoort, H., Ed., and A. Fulignoli, "MPLS-TP Linear
2632	              Protection", draft-ietf-mpls-tp-linear-protection, Work
2633	              in Progress.

2635	   [MPLS-TP-MESH]
2636	              Cheung , T., and Ryoo, J., "MPLS-TP Mesh Protection",
2637	              draft-cheung-mpls-tp-mesh-protection, Work in Progress.

2639	   [OAM-SOUP] Andersson, L., Betts, M., Van Helvoort, H., Bonica, R.,
2640	              and D. Romascanu, "MPLS-TP Linear Protection", draft-ietf-
2641	              opsawg-mpls-tp-oam-def, Work in Progress.

2643	   [ROSETTA]  Van Helvoort, H., Ed., Andersson, L., and N. Sprecher, "A
2644	              Thesaurus for the Terminology used in Multiprotocol Label
2645	              Switching Transport Profile (MPLS-TP) drafts/RFCs and
2646	              ITU-T's Transport Network Recommendations", draft-ietf-
2647	              mpls-tp-rosetta-stone, Work in Progress.

2649	Authors' Addresses

2651	   Nurit Sprecher
2652	   Nokia Siemens Networks
2653	   3 Hanagar St. Neve Ne'eman B
2654	   Hod Hasharon, 45241
2655	   Israel

2657	   Email: nurit.sprecher@nsn.com

2659	   Adrian Farrel
2660	   Old Dog Consulting

2662	   Email: adrian@olddog.co.uk