idnits 2.17.1 

draft-ietf-mpls-recovery-frmwrk-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand
     corner of the first page

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 32 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '2')

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'


     Summary: 9 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	   MPLS Working Group                     Vishal Sharma (Metanoia, Inc.)
3	   Informational Track                Fiffi Hellstrand (Nortel Networks)
4	   Expires: March 2003                                         (Editors)

6	                                                         September  2002

8	                    Framework for MPLS-based Recovery
9	                <draft-ietf-mpls-recovery-frmwrk-07.txt>

11	   Status of this memo

13	   This document is an Internet-Draft and is in full conformance with
14	   all provisions of Section 10 of RFC2026.
15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups. Note that other
17	   groups may also distribute working documents as Internet-Drafts.
18	   Internet-Drafts are draft documents valid for a maximum of six months
19	   and may be updated, replaced, or obsoleted by other documents at any
20	   time. It is inappropriate to use Internet-Drafts as reference
21	   material or to cite them other than as "work in progress."
22	   The list of current Internet-Drafts can be accessed at
23	   http://www.ietf.org/ietf/1id-abstracts.txt
24	   The list of Internet-Draft Shadow Directories can be accessed at
25	   http://www.ietf.org/shadow.html.

27	   Abstract

29	   Multi-protocol label switching (MPLS) integrates the label swapping
30	   forwarding paradigm with network layer routing. To deliver reliable
31	   service, MPLS requires a set of procedures to provide protection of
32	   the traffic carried on different paths. This requires that the label
33	   switched routers (LSRs) support fault detection, fault notification,
34	   and fault recovery mechanisms, and that MPLS signaling, support the
35	   configuration of recovery. With these objectives in mind, this
36	   document specifies a framework for MPLS based recovery.

38	   Table of Contents
39	1.    Introduction ....................................................2
40	1.1.  Background ......................................................3
41	1.2.  Motivation for MPLS-Based Recovery ..............................3
42	1.3.  Objectives/Goals ................................................4
43	2.    Contributing Authors ............................................6
44	3.    Overview ........................................................6
45	3.1.  Recovery Models .................................................7
46	3.1.1   Rerouting .....................................................7
47	3.1.2   Protection Switching ..........................................8
48	3.2.  The Recovery Cycles .............................................8
49	3.2.1   MPLS Recovery Cycle Model .....................................8
50	3.2.2   MPLS Reversion Cycle Model ...................................10
51	3.2.3   Dynamic Re-routing Cycle Model ...............................11
52	3.3.  Definitions and Terminology ....................................13
53	3.3.1   General Recovery Terminology .................................13
54	3.3.2   Failure Terminology ..........................................16
55	3.4.  Abbreviations ..................................................16
56	4.    MPLS-based Recovery Principles .................................17
57	4.1.  Configuration of Recovery ......................................17
58	4.2.  Initiation of Path Setup .......................................17
59	4.3.  Initiation of Resource Allocation ..............................18
60	4.4.  Scope of Recovery ..............................................18
61	4.4.1   Topology .....................................................19
62	1.1.1.1   Local Repair................................................19
63	1.1.1.2   Global Repair...............................................19
64	1.1.1.3   Alternate Egress Repair.....................................20
65	1.1.1.4   Multi-Layer Repair..........................................20

67	1.1.1.5   Concatenated Protection Domains.............................20
68	4.4.2   Path Mapping .................................................20
69	4.4.3   Bypass Tunnels ...............................................21
70	4.4.4   Recovery Granularity .........................................22
71	1.1.1.6   Selective Traffic Recovery..................................22
72	1.1.1.7   Bundling....................................................22
73	4.4.5   Recovery Path Resource Use ...................................22
74	4.5.  Fault Detection ................................................23
75	4.6.  Fault Notification .............................................23
76	4.7.  Switch-Over Operation ..........................................24
77	4.7.1   Recovery Trigger .............................................24
78	4.7.2   Recovery Action ..............................................25
79	4.8.  Post Recovery Operation ........................................25
80	4.8.1   Fixed Protection Counterparts ................................25
81	1.1.1.8   Revertive Mode..............................................25
82	1.1.1.9   Non-revertive Mode..........................................26
83	4.8.2   Dynamic Protection Counterparts ..............................26
84	4.8.3   Restoration and Notification .................................26
85	4.8.4   Reverting to Preferred Path (or Controlled Rearrangement) ....27
86	4.9.  Performance ....................................................27
87	5.    MPLS Recovery Features .........................................28
88	6.    Comparison Criteria ............................................28
89	7.    Security Considerations ........................................30
90	8.    Intellectual Property Considerations ...........................31
91	9.    Acknowledgements ...............................................31
92	10.   Editors' Addresses .............................................31
93	11.   References .....................................................31

95	1.   Introduction

97	   This memo describes a framework for MPLS-based recovery. We provide a
98	   detailed taxonomy of recovery terminology, and discuss the motivation
99	   for, the objectives of, and the requirements for MPLS-based recovery.
100	   We outline principles for MPLS-based recovery, and also provide
101	   comparison criteria that may serve as a basis for comparing and
102	   evaluating different recovery schemes.

104	   At points in the document, we provide some thoughts about the
105	   operation or viability of certain recovery objectives. These should
106	   be viewed as the opinions of the authors, and not the consolidated
107	   views of the IETF.

109	1.1. Background

111	   Network routing deployed today is focused primarily on connectivity,
112	   and typically supports only one class of service, the best effort
113	   class. Multi-protocol label switching [1], on the other hand, by
114	   integrating forwarding based on label-swapping of a link local label
115	   with network layer routing allows flexibility in the delivery of new
116	   routing services. MPLS allows for using such media specific
117	   forwarding mechanisms as label swapping. This enables some
118	   sophisticated features such as quality-of-service (QoS) and traffic
119	   engineering [2] to be implemented more effectively. An important
120	   component of providing QoS, however, is the ability to transport data
121	   reliably and efficiently. Although the current routing algorithms are
122	   robust and survivable, the amount of time they take to recover from a
123	   fault can be significant, on the order of several seconds or minutes,
124	   causing disruption of service for some applications in the interim.
125	   This is unacceptable in situations where the aim to provide a highly
126	   reliable service, with recovery times that are on the order of
127	   seconds down to 10's of milliseconds. Examples of such applications
128	   are Virtual Leased Line services, Stock Exchange data services, voice
129	   traffic, video services etc, i.e., any application for which a
130	   disruption in service due to a failure is long enough to not fulfill
131	   service agreements or not guarantee the required level of quality.

133	   MPLS recovery may be motivated by the notion that there are
134	   limitations to improving the recovery times of current routing
135	   algorithms. Additional improvement can be obtained by augmenting
136	   these algorithms with MPLS recovery mechanisms [3]. Since MPLS is a
137	   possible technology of choice in future IP-based transport networks,
138	   it is useful that MPLS be able to provide protection and restoration
139	   of traffic.  MPLS may facilitate the convergence of network
140	   functionality on a common control and management plane. Further, a
141	   protection priority could be used as a differentiating mechanism for
142	   premium services that require high reliability, such as Virtual
143	   Leased Line services, high priority voice and video traffic. The
144	   remainder of this document provides a framework for MPLS based
145	   recovery.  It is focused at a conceptual level and is meant to
146	   address motivation, objectives and requirements.  Issues of
147	   mechanism, policy, routing plans and characteristics of traffic
148	   carried by recovery paths are beyond the scope of this document.

150	1.2. Motivation for MPLS-Based Recovery
151	   MPLS based protection of traffic (called MPLS-based Recovery) is
152	   useful for a number of reasons. The most important is its ability to
153	   increase network reliability by enabling a faster response to faults
154	   than is possible with traditional Layer 3 (or IP layer) approaches
155	   alone while still providing the visibility of the network afforded by
156	   Layer 3. Furthermore, a protection mechanism using MPLS could enable
157	   IP traffic to be put directly over WDM optical channels and provide a
158	   recovery option without an intervening SONET layer.  This would
159	   facilitate the construction of IP-over-WDM networks that request a
160	   fast recovery ability.

162	   The need for MPLS-based recovery arises because of the following:

164	   I. Layer 3 or IP rerouting may be too slow for a core MPLS network
165	   that needs to support recovery times that are smaller than the
166	   convergence times of IP routing protocols.

168	   II. Layer 0 (for example, optical layer) or Layer 1 (for example,
169	   SONET) mechanisms may be wasteful use of resources.

171	   III. The granularity at which the lower layers may be able to protect
172	   traffic may be too coarse for traffic that is switched using MPLS-
173	   based mechanisms.

175	   IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher
176	   layer operations.  Thus, while they may provide, for example, link
177	   protection, they cannot easily provide node protection or protection
178	   of traffic transported at layer 3. Further, this may prevent the
179	   lower layers from providing restoration based on the traffic's needs.
180	   For example, fast restoration for traffic that needs it, and slower
181	   restoration (with possibly more optimal use of resources) for traffic
182	   that does not require fast restoration. In networks where the latter
183	   class of traffic is dominant, providing fast restoration to all
184	   classes of traffic may not be cost effective from a service
185	   provider's perspective.

187	   V. MPLS has desirable attributes when applied to the purpose of
188	   recovery for connectionless networks. Specifically that an LSP is
189	   source routed and a forwarding path for recovery can be "pinned" and
190	   is not affected by transient instability in SPF routing brought on by
191	   failure scenarios.

193	   VI. Establishing interoperability of protection mechanisms between
194	   routers/LSRs from different vendors in IP or MPLS networks is desired
195	   to enable recovery mechanisms to work in a multivendor environment,
196	   and to enable the transition of certain protected services to an MPLS
197	   core.

199	1.3. Objectives/Goals

201	   The following are some important goals for MPLS-based recovery.

203	   Ia. MPLS-based recovery mechanisms may be subject to the traffic
204	   engineering goal of optimal use of resources.

206	   Ib. MPLS based recovery mechanisms should aim to facilitate
207	   restoration times that are sufficiently fast for the end user
208	   application. That is, that better match the end-user's application
209	   requirements. In some cases, this may be as short as 10s of
210	   milliseconds.

212	   We observe that Ia and Ib are conflicting objectives, and a trade off
213	   exists between them. The optimal choice depends on the end-user
214	   application's sensitivity to restoration time and the cost impact of
215	   introducing restoration in the network, as well as the end-user
216	   application's sensitivity to cost.

218	   II. MPLS-based recovery should aim to maximize network reliability
219	   and availability. MPLS-based recovery of traffic should aim to
220	   minimize the number of single points of failure in the MPLS protected
221	   domain.

223	   III. MPLS-based recovery should aim to enhance the reliability of the
224	   protected traffic while minimally or predictably degrading the
225	   traffic carried by the diverted resources.

227	   IV. MPLS-based recovery techniques should aim to be applicable for
228	   protection of traffic at various granularities. For example, it
229	   should be possible to specify MPLS-based recovery for a portion of
230	   the traffic on an individual path, for all traffic on an individual
231	   path, or for all traffic on a group of paths. Note that a path is
232	   used as a general term and includes the notion of a link, IP route or
233	   LSP.

235	   V. MPLS-based recovery techniques may be applicable for an entire
236	   end-to-end path or for segments of an end-to-end path.

238	   VI. MPLS-based recovery mechanisms should aim to take into
239	   consideration the recovery actions of lower layers. MPLS-based
240	   mechanisms should not trigger lower layer protection switching.

242	   VII. MPLS-based recovery mechanisms should aim to minimize the loss
243	   of data and packet reordering during recovery operations. (The
244	   current MPLS specification itself has no explicit requirement on
245	   reordering).

247	   VIII. MPLS-based recovery mechanisms should aim to minimize the state
248	   overhead incurred for each recovery path maintained.

250	   IX. MPLS-based recovery mechanisms should aim to preserve the
251	   constraints on traffic after switchover, if desired.  That is, if
252	   desired, the recovery path should meet the resource requirements of,
253	   and achieve the same performance characteristics as, the working
254	   path.

256	   We observe that some of the above are conflicting goals, and real
257	   deployment will often involve engineering compromises based on a
258	   variety of factors such as cost, end-user application requirements,
259	   network efficiency, and revenue considerations. Thus, these goals are
260	   subject to tradeoffs based on the above considerations.

262	2.   Contributing Authors

264	   This document was the collective work of several individuals over a
265	   period of two and a half years. The text and content of this document
266	   was contributed by the editors and the co-authors listed below. (The
267	   contact information for the editors appears in Section 10, and is not
268	   repeated below.)

270	   Ben Mack-Crane                       Srinivas Makam
271	   Tellabs Operations, Inc.             Eshernet, Inc.
272	   4951 Indiana Avenue                  1712 Ada Ct.
273	   Lisle, IL 60532                      Naperville, IL 60540
274	   Phone: (630) 512-7255                Phone: (630) 308-3213
275	   Ben.Mack-Crane@tellabs.com           Smakam60540@yahoo.com

277	   Ken Owens                            Changcheng Huang
278	   Erlang Technology, Inc.              Carleton University
279	   345 Marshall Ave., Suite 300         Minto Center, Rm. 3082
280	   St. Louis, MO 63119                  1125 Colonial By Drive
281	   Phone: (314) 918-1579                Ottawa, Ont. K1S 5B6 Canada
282	   keno@erlangtech.com                  Phone: (613) 520-2600 x2477
283	                                        Changcheng.Huang@sce.carleton.ca

285	   Jon Weil                             Brad Cain
286	   Nortel Networks                      Storigen Systems
287	   Harlow Laboratories London Road      650 Suffolk Street
288	   Harlow Essex CM17 9NA, UK            Lowell, MA 01854
289	   Phone: +44 (0)1279 403935            Phone: (978) 323-4454
290	   jonweil@nortelnetworks.com           bcain@storigen.com

292	   Loa Andersson                        Bilel Jamoussi
293	   Utfors AB                            Nortel Networks
294	   Rasundavagen 12, Box 525             3 Federal Street, BL3-03
295	   169 29 Solna, Sweden                 Billerica, MA 01821, USA
296	   Phone: +46 8 5270 5038               Phone:(978) 288-4506
297	   loa.andersson@utfors.se              jamoussi@nortelnetworks.com

299	   Angela Chiu                          Seyhan Civanlar
300	   Celion Networks, Inc.                Lemur Networks, Inc.
301	   One Shiela Drive, Suite 2            135 West 20th Street, 5th Floor
302	   Tinton Falls, NJ 07724               New York, NY 10011
303	   Phone: (732) 345-3441                Phone: (212) 367-7676
304	   angela.chiu@celion.com                   scivanlar@lemurnetworks.com

306	3.   Overview
307	   There are several options for providing protection of traffic. The
308	   most generic requirement is the specification of whether recovery
309	   should be via Layer 3 (or IP) rerouting or via MPLS protection
310	   switching or rerouting actions.

312	   Generally network operators aim to provide the fastest and the best
313	   protection mechanism that can be provided at a reasonable cost. The
314	   higher the levels of protection, the more the resources consumed.
315	   Therefore it is expected that network operators will offer a spectrum
316	   of service levels. MPLS-based recovery should give the flexibility to
317	   select the recovery mechanism, choose the granularity at which
318	   traffic is protected, and to also choose the specific types of
319	   traffic that are protected in order to give operators more control
320	   over that tradeoff.  With MPLS-based recovery, it can be possible to
321	   provide different levels of protection for different classes of
322	   service, based on their service requirements. For example, using
323	   approaches outlined below, a Virtual Leased Line (VLL) service or
324	   real-time applications like Voice over IP (VoIP) may be supported
325	   using link/node protection together with pre-established, pre-
326	   reserved path protection. Best effort traffic, on the other hand, may
327	   use path protection that is established on demand or may simply rely
328	   on IP re-route or higher layer recovery mechanisms.  As another
329	   example of their range of application, MPLS-based recovery strategies
330	   may be used to protect traffic not originally flowing on label
331	   switched paths, such as IP traffic that is normally routed hop-by-
332	   hop, as well as traffic forwarded on label switched paths.

334	3.1. Recovery Models

336	   There are two basic models for path recovery: rerouting and
337	   protection switching.

339	   Protection switching and rerouting, as defined below, may be used
340	   together.  For example, protection switching to a recovery path may
341	   be used for rapid restoration of connectivity while rerouting
342	   determines a new optimal network configuration, rearranging paths, as
343	   needed, at a later time.

345	3.1.1     Rerouting

347	   Recovery by rerouting is defined as establishing new paths or path
348	   segments on demand for restoring traffic after the occurrence of a
349	   fault. The new paths may be based upon fault information, network
350	   routing policies, pre-defined configurations and network topology
351	   information. Thus, upon detecting a fault, paths or path segments to
352	   bypass the fault are established using signaling.

354	   Once the network routing algorithms have converged after a fault, it
355	   may be preferable, in some cases, to reoptimize the network by
356	   performing a reroute based on the current state of the network and
357	   network policies. This is discussed further in Section 3.8.

359	   In terms of the principles defined in section 3, reroute recovery
360	   employs paths established-on-demand with resources reserved-on-
361	   demand.

363	3.1.2     Protection Switching

365	   Protection switching recovery mechanisms pre-establish a recovery
366	   path or path segment, based upon network routing policies, the
367	   restoration requirements of the traffic on the working path, and
368	   administrative considerations. The recovery path may or may not be
369	   link and node disjoint with the working path. However if the recovery
370	   path shares sources of failure with the working path, the overall
371	   reliability of the construct is degraded. When a fault is detected,
372	   the protected traffic is switched over to the recovery path(s) and
373	   restored.

375	   In terms of the principles in section 3, protection switching employs
376	   pre-established recovery paths, and, if resource reservation is
377	   required on the recovery path, pre-reserved resources. The various
378	   sub-types of protection switching are detailed in Section 4.4 of this
379	   document.

381	3.2. The Recovery Cycles

383	   There are three defined recovery cycles: the MPLS Recovery Cycle, the
384	   MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first
385	   cycle detects a fault and restores traffic onto MPLS-based recovery
386	   paths. If the recovery path is non-optimal the cycle may be followed
387	   by any of the two latter cycles to achieve an optimized network
388	   again. The reversion cycle applies for explicitly routed traffic that
389	   that does not rely on any dynamic routing protocols to be converged.
390	   The dynamic re-routing cycle applies for traffic that is forwarded
391	   based on hop-by-hop routing.

393	3.2.1     MPLS Recovery Cycle Model

395	   The MPLS recovery cycle model is illustrated in Figure 1.
396	   Definitions and a key to abbreviations follow.

398	    --Network Impairment
399	    |    --Fault Detected
400	    |    |    --Start of Notification
401	    |    |    |    -- Start of Recovery Operation
402	    |    |    |    |    --Recovery Operation Complete
403	    |    |    |    |    |    --Path Traffic Restored
404	    |    |    |    |    |    |
405	    |    |    |    |    |    |
406	    v    v    v    v    v    v
407	   ----------------------------------------------------------------
408	    | T1 | T2 | T3 | T4 | T5 |

410	   Figure 1. MPLS Recovery Cycle Model

412	   The various timing measures used in the model are described below.
413	   T1   Fault Detection Time
414	   T2   Hold-off Time
415	   T3   Notification Time
416	   T4   Recovery Operation Time
417	   T5   Traffic Restoration Time

419	   Definitions of the recovery cycle times are as follows:

421	   Fault Detection Time

423	   The time between the occurrence of a network impairment and the
424	   moment the fault is detected by MPLS-based recovery mechanisms. This
425	   time may be highly dependent on lower layer protocols.

427	   Hold-Off Time

429	   The configured waiting time between the detection of a fault and
430	   taking MPLS-based recovery action, to allow time for lower layer
431	   protection to take effect. The Hold-off Time may be zero.

433	   Note: The Hold-Off Time may occur after the Notification Time
434	   interval if the node responsible for the switchover, the Path Switch
435	   LSR (PSL), rather than the detecting LSR, is configured to wait.

437	   Notification Time

439	   The time between initiation of a fault indication signal (FIS) by the
440	   LSR detecting the fault and the time at which the Path Switch LSR
441	   (PSL) begins the recovery operation.  This is zero if the PSL detects
442	   the fault itself or infers a fault from such events as an adjacency
443	   failure.

445	   Note: If the PSL detects the fault itself, there still may be a Hold-
446	   Off Time period between detection and the start of the recovery
447	   operation.

449	   Recovery Operation Time

451	   The time between the first and last recovery actions.  This may
452	   include message exchanges between the PSL and PML to coordinate
453	   recovery actions.

455	   Traffic Restoration Time

457	   The time between the last recovery action and the time that the
458	   traffic (if present) is completely recovered.  This interval is
459	   intended to account for the time required for traffic to once again
460	   arrive at the point in the network that experienced disrupted or
461	   degraded service due to the occurrence of the fault (e.g. the PML).

463	   This time may depend on the location of the fault, the recovery
464	   mechanism, and the propagation delay along the recovery path.

466	3.2.2     MPLS Reversion Cycle Model

468	   Protection switching, revertive mode, requires the traffic to be
469	   switched back to a preferred path when the fault on that path is
470	   cleared.  The MPLS reversion cycle model is illustrated in Figure 2.
471	   Note that the cycle shown below comes after the recovery cycle shown
472	   in Fig. 1.

474	          --Network Impairment Repaired
475	          |    --Fault Cleared
476	          |    |    --Path Available
477	          |    |    |    --Start of Reversion Operation
478	          |    |    |    |    --Reversion Operation Complete
479	          |    |    |    |    |    --Traffic Restored on Preferred Path
480	          |    |    |    |    |    |
481	          |    |    |    |    |    |
482	          v    v    v    v    v    v
483	       -----------------------------------------------------------------
484	          | T7 | T8 | T9 | T10| T11|

486	   Figure 2. MPLS Reversion Cycle Model

488	   The various timing measures used in the model are described below.
489	   T7   Fault Clearing Time
490	   T8   Wait-to-Restore Time
491	   T9   Notification Time
492	   T10  Reversion Operation Time
493	   T11  Traffic Restoration Time

495	   Note that time T6 (not shown above) is the time for which the network
496	   impairment is not repaired and traffic is flowing on the recovery
497	   path.

499	   Definitions of the reversion cycle times are as follows:

501	   Fault Clearing Time

503	   The time between the repair of a network impairment and the time that
504	   MPLS-based mechanisms learn that the fault has been cleared. This
505	   time may be highly dependent on lower layer protocols.

507	   Wait-to-Restore Time

509	   The configured waiting time between the clearing of a fault and MPLS-
510	   based recovery action(s).  Waiting time may be needed to ensure that
511	   the path is stable and to avoid flapping in cases where a fault is
512	   intermittent. The Wait-to-Restore Time may be zero.

514	   Note: The Wait-to-Restore Time may occur after the Notification Time
515	   interval if the PSL is configured to wait.

517	   Notification Time

519	   The time between initiation of a fault recovery signal (FRS) by the
520	   LSR clearing the fault and the time at which the path switch LSR
521	   begins the reversion operation.  This is zero if the PSL clears the
522	   fault itself.
523	   Note: If the PSL clears the fault itself, there still may be a Wait-
524	   to-Restore Time period between fault clearing and the start of the
525	   reversion operation.

527	   Reversion Operation Time

529	   The time between the first and last reversion actions.  This may
530	   include message exchanges between the PSL and PML to coordinate
531	   reversion actions.

533	   Traffic Restoration Time

535	   The time between the last reversion action and the time that traffic
536	   (if present) is completely restored on the preferred path.  This
537	   interval is expected to be quite small since both paths are working
538	   and care may be taken to limit the traffic disruption (e.g., using
539	   "make before break" techniques and synchronous switch-over).

541	   In practice, the only interesting times in the reversion cycle are
542	   the Wait-to-Restore Time and the Traffic Restoration Time (or some
543	   other measure of traffic disruption).  Given that both paths are
544	   available, there is no need for rapid operation, and a well-
545	   controlled switch-back with minimal disruption is desirable.

547	3.2.3     Dynamic Re-routing Cycle Model

549	   Dynamic rerouting aims to bring the IP network to a stable state
550	   after a network impairment has occurred. A re-optimized network is
551	   achieved after the routing protocols have converged, and the traffic
552	   is moved from a recovery path to a (possibly) new working path. The
553	   steps involved in this mode are illustrated in Figure 3.

555	   Note that the cycle shown below may be overlaid on the recovery cycle
556	   shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in
557	   the event that both the recovery cycle and the reversion cycle take
558	   place before the routing protocols converge), and after the
559	   convergence of the routing protocols it is determined (based on on-
560	   line algorithms or off-line traffic engineering tools, network
561	   configuration, or a variety of other possible criteria) that there is
562	   a better route for the working path.

564	          --Network Enters a Semi-stable State after an Impairment
565	          |     --Dynamic Routing Protocols Converge
566	          |     |     --Initiate Setup of New Working Path between PSL
567	          |     |     |                                         and PML
568	          |     |     |     --Switchover Operation Complete
569	          |     |     |     |     --Traffic Moved to New Working Path
570	          |     |     |     |     |
571	          |     |     |     |     |
572	          v     v     v     v     v
573	       -----------------------------------------------------------------
574	          | T12 | T13 | T14 | T15 |

576	   Figure 3. Dynamic Rerouting Cycle Model
577	   The various timing measures used in the model are described below.
578	   T12  Network Route Convergence Time
579	   T13  Hold-down Time (optional)
580	   T14  Switchover Operation Time
581	   T15  Traffic Restoration Time

583	   Network Route Convergence Time

585	   We define the network route convergence time as the time taken for
586	   the network routing protocols to converge and for the network to
587	   reach a stable state.

589	   Holddown Time

591	   We define the holddown period as a bounded time for which a recovery
592	   path must be used. In some scenarios it may be difficult to determine
593	   if the working path is stable. In these cases a holddown time may be
594	   used to prevent excess flapping of traffic between a working and a
595	   recovery path.

597	   Switchover Operation Time

599	   The time between the first and last switchover actions.  This may
600	   include message exchanges between the PSL and PML to coordinate the
601	   switchover actions.

603	   As an example of the recovery cycle, we present a sequence of events
604	   that occur after a network impairment occurs and when a protection
605	   switch is followed by dynamic rerouting.

607	   I. Link or path fault occurs
608	   II. Signaling initiated (FIS) for the detected fault
609	   III. FIS arrives at the PSL
610	   IV. The PSL initiates a protection switch to a pre-configured
611	   recovery path
612	   V. The PSL switches over the traffic from the working path to the
613	   recovery path
614	   VI. The network enters a semi-stable state
615	   VII. Dynamic routing protocols converge after the fault, and a new
616	   working path is calculated (based, for example, on some of the
617	   criteria mentioned in Section 2.1.1).

619	   VIII. A new working path is established between the PSL and the PML
620	   (assumption is that PSL and PML have not changed)
621	   IX. Traffic is switched over to the new working path.

623	3.3. Definitions and Terminology

625	   This document assumes the terminology given in [1], and, in addition,
626	   introduces the following new terms.

628	3.3.1     General Recovery Terminology

630	   Rerouting

632	   A recovery mechanism in which the recovery path or path segments are
633	   created dynamically after the detection of a fault on the working
634	   path. In other words, a recovery mechanism in which the recovery path
635	   is not pre-established.

637	   Protection Switching

639	   A recovery mechanism in which the recovery path or path segments are
640	   created prior to the detection of a fault on the working path. In
641	   other words, a recovery mechanism in which the recovery path is pre-
642	   established.

644	   Working Path

646	   The protected path that carries traffic before the occurrence of a
647	   fault.  The working path exists between a PSL and PML. The working
648	   path can be of different kinds; a hop-by-hop routed path, a trunk, a
649	   link, an LSP or part of a multipoint-to-point LSP.

651	   Synonyms for a working path are primary path and active path.

653	   Recovery Path

655	   The path by which traffic is restored after the occurrence of a
656	   fault. In other words, the path on which the traffic is directed by
657	   the recovery mechanism. The recovery path is established by MPLS
658	   means. The recovery path can either be an equivalent recovery path
659	   and ensure no reduction in quality of service, or be a limited
660	   recovery path and thereby not guarantee the same quality of service
661	   (or some other criteria of performance) as the working path. A
662	   limited recovery path is not expected to be used for an extended
663	   period of time.

665	   Synonyms for a recovery path are: back-up path, alternative path, and
666	   protection path.

668	   Protection Counterpart
669	   The "other" path when discussing pre-planned protection switching
670	   schemes. The protection counterpart for the working path is the
671	   recovery path and vice-versa.

673	   Path Group (PG)

675	   A logical bundling of multiple working paths, each of which is routed
676	   identically between a Path Switch LSR and a Path Merge LSR.

678	   Protected Path Group (PPG)

680	   A path group that requires protection.

682	   Protected Traffic Portion (PTP)

684	   The portion of the traffic on an individual path that requires
685	   protection.  For example, code points in the EXP bits of the shim
686	   header may identify a protected portion.

688	   Path Switch LSR (PSL)

690	   An LSR that is responsible for switching or replicating the traffic
691	   between the working path and the recovery path.

693	   Path Merge LSR (PML)

695	   An LSR that is responsible for receiving the recovery path traffic,
696	   and either merging the traffic back onto the working path, or, if it
697	   is itself the destination, passing the traffic on to the higher layer
698	   protocols.

700	   Point of Repair (POR)

702	   An LSR that is setup for performing MPLS recovery. In other words, an
703	   LSR that is responsible for effecting the repair of an LSP. The POR,
704	   for example, can be a PSL or a PML, depending on the type of recovery
705	   scheme employed.

707	   Intermediate LSR

709	   An LSR on a working or recovery path that is neither a PSL nor a PML
710	   for that path.

712	   Bypass Tunnel

714	   A path that serves to back up a set of working paths using the label
715	   stacking approach [1]. The working paths and the bypass tunnel must
716	   all share the same path switch LSR (PSL) and the path merge LSR
717	   (PML).

719	   Switch-Over
720	   The process of switching the traffic from the path that the traffic
721	   is flowing on onto one or more alternate path(s). This may involve
722	   moving traffic from a working path onto one or more recovery paths,
723	   or may involve moving traffic from a recovery path(s) on to a more
724	   optimal working path(s).

726	   Switch-Back

728	   The process of returning the traffic from one or more recovery paths
729	   back to the working path(s).

731	   Revertive Mode

733	   A recovery mode in which traffic is automatically switched back from
734	   the recovery path to the original working path upon the restoration
735	   of the working path to a fault-free condition. This assumes a failed
736	   working path does not automatically surrender resources to the
737	   network.

739	   Non-revertive Mode

741	   A recovery mode in which traffic is not automatically switched back
742	   to the original working path after this path is restored to a fault-
743	   free condition. (Depending on the configuration, the original working
744	   path may, upon moving to a fault-free condition, become the recovery
745	   path, or it may be used for new working traffic, and be no longer
746	   associated with its original recovery path).

748	   MPLS Protection Domain

750	   The set of LSRs over which a working path and its corresponding
751	   recovery path are routed.

753	   MPLS Protection Plan

755	   The set of all LSP protection paths and the mapping from working to
756	   protection paths deployed in an MPLS protection domain at a given
757	   time.

759	   Liveness Message

761	   A message exchanged periodically between two adjacent LSRs that
762	   serves as a link probing mechanism. It provides an integrity check of
763	   the forward and the backward directions of the link between the two
764	   LSRs as well as a check of neighbor aliveness.

766	   Path Continuity Test

768	   A test that verifies the integrity and continuity of a path or path
769	   segment. The details of such a test are beyond the scope of this
770	   draft. (This could be accomplished, for example, by transmitting a
771	   control message along the same links and nodes as the data traffic or
772	   similarly could be measured by the absence of traffic and by
773	   providing feedback.)

775	3.3.2     Failure Terminology

777	   Path Failure (PF)
778	   Path failure is fault detected by MPLS-based recovery mechanisms,
779	   which is define as the failure of the liveness message test or a path
780	   continuity test, which indicates that path connectivity is lost.

782	   Path Degraded (PD)
783	   Path degraded is a fault detected by MPLS-based recovery mechanisms
784	   that indicates that the quality of the path is unacceptable.

786	   Link Failure (LF)
787	   A lower layer fault indicating that link continuity is lost. This may
788	   be communicated to the MPLS-based recovery mechanisms by the lower
789	   layer.

791	   Link Degraded (LD)
792	   A lower layer indication to MPLS-based recovery mechanisms that the
793	   link is performing below an acceptable level.

795	   Fault Indication Signal (FIS)
796	   A signal that indicates that a fault along a path has occurred. It is
797	   relayed by each intermediate LSR to its upstream or downstream
798	   neighbor, until it reaches an LSR that is setup to perform MPLS
799	   recovery (the POR).  The FIS is transmitted periodically by the
800	   node/nodes closest to the point of failure, for some configurable
801	   length of time.

803	   Fault Recovery Signal (FRS)
804	   A signal that indicates a fault along a working path has been
805	   repaired. Again, like the FIS, it is relayed by each intermediate LSR
806	   to its upstream or downstream neighbor, until is reaches the LSR that
807	   performs recovery of the original path. The FRS is transmitted
808	   periodically by the node/nodes closest to the point of failure, for
809	   some configurable length of time.

811	3.4. Abbreviations

813	   FIS:   Fault Indication Signal.
814	   FRS:   Fault Recovery Signal.
815	   LD:    Link Degraded.
816	   LF:    Link Failure.
817	   PD:    Path Degraded.
818	   PF:    Path Failure.
819	   PML:   Path Merge LSR.
820	   PG:    Path Group.
821	   POR:   Point of Repair
822	   PPG:   Protected Path Group.
823	   PTP:   Protected Traffic Portion.

825	   PSL:   Path Switch LSR.

827	4.   MPLS-based Recovery Principles

829	   MPLS-based recovery refers to the ability to effect quick and
830	   complete restoration of traffic affected by a fault in an MPLS-
831	   enabled network. The fault may be detected on the IP layer or in
832	   lower layers over which IP traffic is transported. Fastest MPLS
833	   recovery is assumed to be achieved with protection switching and may
834	   be viewed as the MPLS LSR switch completion time that is comparable
835	   to, or equivalent to, the 50 ms switch-over completion time of the
836	   SONET layer. This section provides a discussion of the concepts and
837	   principles of MPLS-based recovery. The concepts are presented in
838	   terms of atomic or primitive terms that may be combined to specify
839	   recovery approaches.  We do not make any assumptions about the
840	   underlying layer 1 or layer 2 transport mechanisms or their recovery
841	   mechanisms.

843	4.1. Configuration of Recovery

845	   An LSR may support any or all of the following recovery options:

847	   Default-recovery (No MPLS-based recovery enabled):
848	   Traffic on the working path is recovered only via Layer 3 or IP
849	   rerouting or by some lower layer mechanism such as SONET APS.  This
850	   is equivalent to having no MPLS-based recovery. This option may be
851	   used for low priority traffic or for traffic that is recovered in
852	   another way (for example load shared traffic on parallel working
853	   paths may be automatically recovered upon a fault along one of the
854	   working paths by distributing it among the remaining working paths).

856	   Recoverable (MPLS-based recovery enabled):
857	   This working path is recovered using one or more recovery paths,
858	   either via rerouting or via protection switching.

860	4.2. Initiation of Path Setup

862	   There are three options for the initiation of the recovery path
863	   setup. The active and recovery paths may be established by using
864	   either RSVP-TE [4][5] or CR-LDP [6].

866	   Pre-established:

868	   This is the same as the protection switching option. Here a recovery
869	   path(s) is established prior to any failure on the working path. The
870	   path selection can either be determined by an administrative
871	   centralized tool, or chosen based on some algorithm implemented at
872	   the PSL and possibly intermediate nodes. To guard against the
873	   situation when the pre-established recovery path fails before or at
874	   the same time as the working path, the recovery path should have
875	   secondary configuration options as explained in Section 3.3 below.

877	   Pre Qualified:

879	   A pre-established path need not be created, it may be pre-qualified.
880	   A pre-qualified recovery path is not created expressly for protecting
881	   the working path, but instead is a path created for other purposes
882	   that is designated as a recovery path after determining that it is an
883	   acceptable alternative for carrying the working path traffic.
884	   Variants include the case where an optical path or trail is
885	   configured, but no switches are set.

887	   Established-on-Demand:

889	   This is the same as the rerouting option. Here, a recovery path is
890	   established after a failure on its working path has been detected and
891	   notified to the PSL.

893	4.3. Initiation of Resource Allocation

895	   A recovery path may support the same traffic contract as the working
896	   path, or it may not. We will distinguish these two situations by
897	   using different additive terms. If the recovery path is capable of
898	   replacing the working path without degrading service, it will be
899	   called an equivalent recovery path. If the recovery path lacks the
900	   resources (or resource reservations) to replace the working path
901	   without degrading service, it will be called a limited recovery path.
902	   Based on this, there are two options for the initiation of resource
903	   allocation:

905	   Pre-reserved:

907	   This option applies only to protection switching. Here a pre-
908	   established recovery path reserves required resources on all hops
909	   along its route during its establishment. Although the reserved
910	   resources (e.g., bandwidth and/or buffers) at each node cannot be
911	   used to admit more working paths, they are available to be used by
912	   all traffic that is present at the node before a failure occurs.

914	   Reserved-on-Demand:

916	   This option may apply either to rerouting or to protection switching.
917	   Here a recovery path reserves the required resources after a failure
918	   on the working path has been detected and notified to the PSL and
919	   before the traffic on the working path is switched over to the
920	   recovery path.

922	   Note that under both the options above, depending on the amount of
923	   resources reserved on the recovery path, it could either be an
924	   equivalent recovery path or a limited recovery path.

926	4.4. Scope of Recovery
927	4.4.1     Topology

929	4.4.1.1         Local Repair

931	   The intent of local repair is to protect against a link or neighbor
932	   node fault and to minimize the amount of time required for failure
933	   propagation. In local repair (also known as local recovery), the node
934	   immediately upstream of the fault is the one to initiate recovery
935	   (either rerouting or protection switching). Local repair can be of
936	   two types:

938	   Link Recovery/Restoration

940	   In this case, the recovery path may be configured to route around a
941	   certain link deemed to be unreliable. If protection switching is
942	   used, several recovery paths may be configured for one working path,
943	   depending on the specific faulty link that each protects against.

945	   Alternatively, if rerouting is used, upon the occurrence of a fault
946	   on the specified link, each path is rebuilt such that it detours
947	   around the faulty link.
948	   In this case, the recovery path need only be disjoint from its
949	   working path at a particular link on the working path, and may have
950	   overlapping segments with the working path. Traffic on the working
951	   path is switched over to an alternate path at the upstream LSR that
952	   connects to the failed link. This method is potentially the fastest
953	   to perform the switchover, and can be effective in situations where
954	   certain path components are much more unreliable than others.

956	   Node Recovery/Restoration

958	   In this case, the recovery path may be configured to route around a
959	   neighbor node deemed to be unreliable. Thus the recovery path is
960	   disjoint from the working path only at a particular node and at links
961	   associated with the working path at that node. Once again, the
962	   traffic on the primary path is switched over to the recovery path at
963	   the upstream LSR that directly connects to the failed node, and the
964	   recovery path shares overlapping portions with the working path.

966	4.4.1.2        Global Repair

968	   The intent of global repair is to protect against any link or node
969	   fault on a path or on a segment of a path, with the obvious exception
970	   of the faults occurring at the ingress node of the protected path
971	   segment. In global repair, the POR is usually distant from the
972	   failure and needs to be notified by a FIS.
973	   In global repair also, end-to-end path recovery/restoration applies.
974	   In many cases, the recovery path can be made completely link and node
975	   disjoint with its working path. This has the advantage of protecting
976	   against all link and node fault(s) on the working path (end-to-end
977	   path or path segment).

979	   However, it may, in some cases, be slower than local repair since the
980	   fault notification message must now travel to the POR to trigger the
981	   recovery action.

983	4.4.1.3        Alternate Egress Repair

985	   It is possible to restore service without specifically recovering the
986	   faulted path.
987	   For example, for best effort IP service it is possible to select a
988	   recovery path that has a different egress point from the working path
989	   (i.e., there is no PML).  The recovery path egress must simply be a
990	   router that is acceptable for forwarding the FEC carried by the
991	   working path (without creating looping).  In an engineering context,
992	   specific alternative FEC/LSP mappings with alternate egresses can be
993	   formed.

995	   This may simplify enhancing the reliability of implicitly constructed
996	   MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate
997	   recovery paths as simply link and node disjoint with the immediate
998	   downstream LSR of the working path.

1000	4.4.1.4        Multi-Layer Repair

1002	   Multi-layer repair broadens the network designer's tool set for those
1003	   cases where multiple network layers can be managed together to
1004	   achieve overall network goals.  Specific criteria for determining
1005	   when multi-layer repair is appropriate are beyond the scope of this
1006	   draft.

1008	4.4.1.5        Concatenated Protection Domains

1010	   A given service may cross multiple networks and these may employ
1011	   different recovery mechanisms.  It is possible to concatenate
1012	   protection domains so that service recovery can be provided end-to-
1013	   end.  It is considered that the recovery mechanisms in different
1014	   domains may operate autonomously, and that multiple points of
1015	   attachment may be used between domains (to ensure there is no single
1016	   point of failure).  Alternate egress repair requires management of
1017	   concatenated domains in that an explicit MPLS point of failure (the
1018	   PML) is by definition excluded.  Details of concatenated protection
1019	   domains are beyond the scope of this draft.

1021	4.4.2     Path Mapping

1023	   Path mapping refers to the methods of mapping traffic from a faulty
1024	   working path on to the recovery path. There are several options for
1025	   this, as described below. Note that the options below should be
1026	   viewed as atomic terms that only describe how the working and
1027	   protection paths are mapped to each other. The issues of resource
1028	   reservation along these paths, and how switchover is actually
1029	   performed lead to the more commonly used composite terms, such as 1+1
1030	   and 1:1 protection, which were described in Section 2.1.

1032	   1-to-1 Protection

1034	   In 1-to-1 protection the working path has a designated recovery path
1035	   that is only to be used to recover that specific working path.

1037	   n-to-1 Protection

1039	   In n-to-1 protection, up to n working paths are protected using only
1040	   one recovery path. If the intent is to protect against any single
1041	   fault on any of the working paths, the n working paths should be
1042	   diversely routed between the same PSL and PML. In some cases,
1043	   handshaking between PSL and PML may be required to complete the
1044	   recovery, the details of which are beyond the scope of this draft.

1046	   n-to-m Protection

1048	   In n-to-m protection, up to n working paths are protected using m
1049	   recovery paths. Once again, if the intent is to protect against any
1050	   single fault on any of the n working paths, the n working paths and
1051	   the m recovery paths should be diversely routed between the same PSL
1052	   and PML. In some cases, handshaking between PSL and PML may be
1053	   required to complete the recovery, the details of which are beyond
1054	   the scope of this draft. n-to-m protection is for further study.

1056	   Split Path Protection

1058	   In split path protection, multiple recovery paths are allowed to
1059	   carry the traffic of a working path based on a certain configurable
1060	   load splitting ratio.  This is especially useful when no single
1061	   recovery path can be found that can carry the entire traffic of the
1062	   working path in case of a fault. Split path protection may require
1063	   handshaking between the PSL and the PML(s), and may require the
1064	   PML(s) to correlate the traffic arriving on multiple recovery paths
1065	   with the working path. Although this is an attractive option, the
1066	   details of split path protection are beyond the scope of this draft,
1067	   and are for further study.

1069	4.4.3     Bypass Tunnels

1071	   It may be convenient, in some cases, to create a "bypass tunnel" for
1072	   a PPG between a PSL and PML, thereby allowing multiple recovery paths
1073	   to be transparent to intervening LSRs [2].  In this case, one LSP
1074	   (the tunnel) is established between the PSL and PML following an
1075	   acceptable route and a number of recovery paths are supported through
1076	   the tunnel via label stacking. A bypass tunnel can be used with any
1077	   of the path mapping options discussed in the previous section.

1079	   As with recovery paths, the bypass tunnel may or may not have
1080	   resource reservations sufficient to provide recovery without service
1081	   degradation.  It is possible that the bypass tunnel may have
1082	   sufficient resources to recover some number of working paths, but not
1083	   all at the same time.  If the number of recovery paths carrying
1084	   traffic in the tunnel at any given time is restricted, this is
1085	   similar to the n-to-1 or n-to-m protection cases mentioned in Section
1086	   3.4.2.

1088	4.4.4     Recovery Granularity

1090	   Another dimension of recovery considers the amount of traffic
1091	   requiring protection. This may range from a fraction of a path to a
1092	   bundle of paths.

1094	4.4.4.1        Selective Traffic Recovery

1096	   This option allows for the protection of a fraction of traffic within
1097	   the same path. The portion of the traffic on an individual path that
1098	   requires protection is called a protected traffic portion (PTP). A
1099	   single path may carry different classes of traffic, with different
1100	   protection requirements. The protected portion of this traffic may be
1101	   identified by its class, as for example, via the EXP bits in the MPLS
1102	   shim header or via the priority bit in the ATM header.

1104	4.4.4.2        Bundling

1106	   Bundling is a technique used to group multiple working paths together
1107	   in order to recover them simultaneously. The logical bundling of
1108	   multiple working paths requiring protection, each of which is routed
1109	   identically between a PSL and a PML, is called a protected path group
1110	   (PPG). When a fault occurs on the working path carrying the PPG, the
1111	   PPG as a whole can be protected either by being switched to a bypass
1112	   tunnel or by being switched to a recovery path.

1114	4.4.5     Recovery Path Resource Use

1116	   In the case of pre-reserved recovery paths, there is the question of
1117	   what use these resources may be put to when the recovery path is not
1118	   in use.  There are two options:

1120	   Dedicated-resource:
1121	   If the recovery path resources are dedicated, they may not be used
1122	   for anything except carrying the working traffic.  For example, in
1123	   the case of 1+1 protection, the working traffic is always carried on
1124	   the recovery path.  Even if the recovery path is not always carrying
1125	   the working traffic, it may not be possible or desirable to allow
1126	   other traffic to use these resources.

1128	   Extra-traffic-allowed:
1129	   If the recovery path only carries the working traffic when the
1130	   working path fails, then it is possible to allow extra traffic to use
1131	   the reserved resources at other times.  Extra traffic is, by
1132	   definition, traffic that can be displaced (without violating service
1133	   agreements) whenever the recovery path resources are needed for
1134	   carrying the working path traffic.

1136	   Shared-resource:

1138	   A shared recovery resource is dedicated for use by multiple primary
1139	   resources that (according to SRLGs) are not expected to fail
1140	   simultaneously.

1142	4.5. Fault Detection

1144	   MPLS recovery is initiated after the detection of either a lower
1145	   layer fault or a fault at the IP layer or in the operation of MPLS-
1146	   based mechanisms. We consider four classes of impairments: Path
1147	   Failure, Path Degraded, Link Failure, and Link Degraded.

1149	   Path Failure (PF) is a fault that indicates to an MPLS-based recovery
1150	   scheme that the connectivity of the path is lost.  This may be
1151	   detected by a path continuity test between the PSL and PML.  Some,
1152	   and perhaps the most common, path failures may be detected using a
1153	   link probing mechanism between neighbor LSRs. An example of a probing
1154	   mechanism is a liveness message that is exchanged periodically along
1155	   the working path between peer LSRs [3].  For either a link probing
1156	   mechanism or path continuity test to be effective, the test message
1157	   must be guaranteed to follow the same route as the working or
1158	   recovery path, over the segment being tested. In addition, the path
1159	   continuity test must take the path merge points into consideration.
1160	   In the case of a bi-directional link implemented as two
1161	   unidirectional links, path failure could mean that either one or both
1162	   unidirectional links are damaged.

1164	   Path Degraded (PD) is a fault that indicates to MPLS-based recovery
1165	   schemes/mechanisms that the path has connectivity, but that the
1166	   quality of the connection is unacceptable.  This may be detected by a
1167	   path performance monitoring mechanism, or some other mechanism for
1168	   determining the error rate on the path or some portion of the path.
1169	   This is local to the LSR and consists of excessive discarding of
1170	   packets at an interface, either due to label mismatch or due to TTL
1171	   errors, for example.

1173	   Link Failure (LF) is an indication from a lower layer that the link
1174	   over which the path is carried has failed.  If the lower layer
1175	   supports detection and reporting of this fault (that is, any fault
1176	   that indicates link failure e.g., SONET LOS), this may be used by the
1177	   MPLS recovery mechanism. In some cases, using LF indications may
1178	   provide faster fault detection than using only MPLS_based fault
1179	   detection mechanisms.

1181	   Link Degraded (LD) is an indication from a lower layer that the link
1182	   over which the path is carried is performing below an acceptable
1183	   level.  If the lower layer supports detection and reporting of this
1184	   fault, it may be used by the MPLS recovery mechanism. In some cases,
1185	   using LD indications may provide faster fault detection than using
1186	   only MPLS-based fault detection mechanisms.

1188	4.6. Fault Notification
1189	   MPLS-based recovery relies on rapid and reliable notification of
1190	   faults. Once a fault is detected, the node that detected the fault
1191	   must determine if the fault is severe enough to require path
1192	   recovery. If the node is not capable of initiating direct action
1193	   (e.g. as a point of repair, POR) the node should send out a
1194	   notification of the fault by transmitting a FIS to the POR. This can
1195	   take several forms:

1197	   (i) control plane messaging: relayed hop-by-hop along the path of the
1198	   failed LSP until a POR is reached.
1199	   (ii) user plane messaging: sent to the PML, which may take corrective
1200	   action (as a POR for 1+1) or then communicate with a POR (for 1:n) by
1201	   any of several means:
1202	   - control plane messaging
1203	   - user plane return path (either through a bi-directional LSP
1204	   or via other means)

1206	   Since the FIS is a control message, it should be transmitted with
1207	   high priority to ensure that it propagates rapidly towards the
1208	   affected POR(s). Depending on how fault notification is configured in
1209	   the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2
1210	   or Layer 3 packet [3]. The use of a Layer 2-based notification
1211	   requires a Layer 2 path direct to the POR. An example of a FIS could
1212	   be the liveness message sent by a downstream LSR to its upstream
1213	   neighbor, with an optional fault notification field set or it can be
1214	   implicitly denoted by a teardown message. Alternatively, it could be
1215	   a separate fault notification packet. The intermediate LSR should
1216	   identify which of its incoming links to propagate the FIS on.

1218	4.7. Switch-Over Operation

1220	4.7.1     Recovery Trigger

1222	   The activation of an MPLS protection switch following the detection
1223	   or notification of a fault requires a trigger mechanism at the PSL.
1224	   MPLS protection switching may be initiated due to automatic inputs or
1225	   external commands. The automatic activation of an MPLS protection
1226	   switch results from a response to a defect or fault conditions
1227	   detected at the PSL or to fault notifications received at the PSL. It
1228	   is possible that the fault detection and trigger mechanisms may be
1229	   combined, as is the case when a PF, PD, LF, or LD is detected at a
1230	   PSL and triggers a protection switch to the recovery path. In most
1231	   cases, however, the detection and trigger mechanisms are distinct,
1232	   involving the detection of fault at some intermediate LSR followed by
1233	   the propagation of a fault notification to the POR via the FIS, which
1234	   serves as the protection switch trigger at the POR. MPLS protection
1235	   switching in response to external commands results when the operator
1236	   initiates a protection switch by a command to a POR (or alternatively
1237	   by a configuration command to an intermediate LSR, which transmits
1238	   the FIS towards the POR).

1240	   Note that the PF fault applies to hard failures (fiber cuts,
1241	   transmitter failures, or LSR fabric failures), as does the LF fault,
1242	   with the difference that the LF is a lower layer impairment that may
1243	   be communicated to - MPLS-based recovery mechanisms. The PD (or LD)
1244	   fault, on the other hand, applies to soft defects (excessive errors
1245	   due to noise on the link, for instance). The PD (or LD) results in a
1246	   fault declaration only when the percentage of lost packets exceeds a
1247	   given threshold, which is provisioned and may be set based on the
1248	   service level agreement(s) in effect between a service provider and a
1249	   customer.

1251	4.7.2     Recovery Action

1253	   After a fault is detected or FIS is received by the POR, the recovery
1254	   action involves either a rerouting or protection switching operation.
1255	   In both scenarios, the next hop label forwarding entry for a recovery
1256	   path is bound to the working path.

1258	4.8. Post Recovery Operation

1260	   When traffic is flowing on the recovery path decisions can be made to
1261	   whether let the traffic remain on the recovery path and consider it
1262	   as a new working path or do a switch to the old or a new working
1263	   path. This post recovery operation has two styles, one where the
1264	   protection counterparts, i.e. the working and recovery path, are
1265	   fixed or "pinned" to its route and one in which the PSL or other
1266	   network entity with real time knowledge of failure dynamically
1267	   performs re-establishment or controlled rearrangement of the paths
1268	   comprising the protected service.

1270	4.8.1     Fixed Protection Counterparts

1272	   For fixed protection counterparts the PSL will be pre-configured with
1273	   the appropriate behavior to take when the original fixed path is
1274	   restored to service. The choices are revertive and non-revertive
1275	   mode. The choice will typically be depended on relative costs of the
1276	   working and protection paths, and the tolerance of the service to the
1277	   effects of switching paths yet again. These protection modes indicate
1278	   whether or not there is a preferred path for the protected traffic.

1280	4.8.1.1          Revertive Mode

1282	   If the working path always is the preferred path, this path will be
1283	   used whenever it is available. Thus, in the event of a fault on this
1284	   path, its unused resources will not be reclaimed by the network on
1285	   failure.  If the working path has a fault, traffic is switched to the
1286	   recovery path.  In the revertive mode of operation, when the
1287	   preferred path is restored the traffic is automatically switched back
1288	   to it.

1290	   There are a number of implications to pinned working and recovery
1291	   paths:
1292	   - upon failure and traffic moved to recovery path, the traffic is
1293	   unprotected until such time as the path defect in the original
1294	   working path is repaired and that path restored to service.

1296	   - upon failure and traffic moved to recovery path, the resources
1297	   associated with the original path remain reserved.

1299	4.8.1.2        Non-revertive Mode

1301	   In the non-revertive mode of operation, there is no preferred path or
1302	   it may be desirable to minimize further disruption of the service
1303	   brought on by a revertive switching operation. A switch-back to the
1304	   original working path is not desired or not possible since the
1305	   original path may no longer exist after the occurrence of a fault on
1306	   that path.
1307	   If there is a fault on the working path, traffic is switched to the
1308	   recovery path. When or if the faulty path (the originally working
1309	   path) is restored, it may become the recovery path (either by
1310	   configuration, or, if desired, by management actions).

1312	   In the non-revertive mode of operation, the working traffic may or
1313	   may not be restored to a new optimal working path or to the original
1314	   working path anyway. This is because it might be useful, in some
1315	   cases, to either: (a) administratively perform a protection switch
1316	   back to the original working path after gaining further assurances
1317	   about the integrity of the path, or (b) it may be acceptable to
1318	   continue operation on the recovery path, or (c) it may be desirable
1319	   to move the traffic to a new optimal working path that is calculated
1320	   based on network topology and network policies.

1322	4.8.2     Dynamic Protection Counterparts

1324	   For dynamic protection counterparts when the traffic is switched over
1325	   to a recovery path, the association between the original working path
1326	   and the recovery path may no longer exist, since the original path
1327	   itself may no longer exist after the fault. Instead, when the network
1328	   reaches a stable state following routing convergence, the recovery
1329	   path may be switched over to a different preferred path either
1330	   optimization based on the new network topology and associated
1331	   information or based on pre-configured information.

1333	   Dynamic protection counterparts assume that upon failure, the PSL or
1334	   other network entity will establish new working paths if another
1335	   switch-over will be performed.

1337	4.8.3     Restoration and Notification

1339	   MPLS restoration deals with returning the working traffic from the
1340	   recovery path to the original or a new working path.  Reversion is
1341	   performed by the PSL either upon receiving notification, via FRS,
1342	   that the working path is repaired, or upon receiving notification
1343	   that a new working path is established.

1345	   For fixed counterparts in revertive mode, an LSR that detected the
1346	   fault on the working path also detects the restoration of the working
1347	   path. If the working path had experienced a LF defect, the LSR
1348	   detects a return to normal operation via the receipt of a liveness
1349	   message from its peer. If the working path had experienced a LD
1350	   defect at an LSR interface, the LSR could detect a return to normal
1351	   operation via the resumption of error-free packet reception on that
1352	   interface. Alternatively, a lower layer that no longer detects a LF
1353	   defect may inform the MPLS-based recovery mechanisms at the LSR that
1354	   the link to its peer LSR is operational.
1355	   The LSR then transmits FRS to its upstream LSR(s) that were
1356	   transmitting traffic on the working path. At the point the PSL
1357	   receives the FRS, it switches the working traffic back to the
1358	   original working path.

1360	   A similar scheme is for dynamic counterparts where e.g. an update of
1361	   topology and/or network convergence may trigger installation or setup
1362	   of new working paths and may send notification to the PSL to perform
1363	   a switch over.

1365	   We note that if there is a way to transmit fault information back
1366	   along a recovery path towards a PSL and if the recovery path is an
1367	   equivalent working path, it is possible for the working path and its
1368	   recovery path to exchange roles once the original working path is
1369	   repaired following a fault. This is because, in that case, the
1370	   recovery path effectively becomes the working path, and the restored
1371	   working path functions as a recovery path for the original recovery
1372	   path. This is important, since it affords the benefits of non-
1373	   revertive switch operation outlined in Section 3.8.1, without leaving
1374	   the recovery path unprotected.

1376	4.8.4     Reverting to Preferred Path (or Controlled Rearrangement)

1378	   In the revertive mode, a "make before break" restoration switching
1379	   can be used, which is less disruptive than performing protection
1380	   switching upon the occurrence of network impairments. This will
1381	   minimize both packet loss and packet reordering. The controlled
1382	   rearrangement of paths can also be used to satisfy traffic
1383	   engineering requirements for load balancing across an MPLS domain.

1385	4.9. Performance

1387	   Resource/performance requirements for recovery paths should be
1388	   specified in terms of the following attributes:

1390	   I. Resource class attribute:
1391	   Equivalent Recovery Class: The recovery path has the same resource
1392	   reservations and performance guarantees as the working path. In other
1393	   words, the recovery path meets the same SLAs as the working path.
1394	   Limited Recovery Class: The recovery path does not have the same
1395	   resource reservations and performance guarantees as the working path.

1397	   A. Lower Class: The recovery path has lower resource requirements or
1398	   less stringent performance requirements than the working path.

1400	   B. Best Effort Class: The recovery path is best effort.

1402	   II. Priority Attribute:
1403	   The recovery path has a priority attribute just like the working path
1404	   (i.e., the priority attribute of the associated traffic trunks). It
1405	   can have the same priority as the working path or lower priority.

1407	   III. Preemption Attribute:
1408	   The recovery path can have the same preemption attribute as the
1409	   working path or a lower one.

1411	5.    MPLS Recovery Features

1413	   The following features are desirable from an operational point of
1414	   view:

1416	   I. It is desirable that MPLS recovery provides an option to identify
1417	   protection groups (PPGs) and protection portions (PTPs).

1419	   II. Each PSL should be capable of performing MPLS recovery upon the
1420	   detection of the impairments or upon receipt of notifications of
1421	   impairments.

1423	   III. A MPLS recovery method should not preclude manual protection
1424	   switching commands. This implies that it would be possible under
1425	   administrative commands to transfer traffic from a working path to a
1426	   recovery path, or to transfer traffic from a recovery path to a
1427	   working path, once the working path becomes operational following a
1428	   fault.

1430	   IV. A PSL may be capable of performing either a switch back to the
1431	   original working path after the fault is corrected or a switchover to
1432	   a new working path, upon the discovery or establishment of a more
1433	   optimal working path.

1435	   V. The recovery model should take into consideration path merging at
1436	   intermediate LSRs. If a fault affects the merged segment, all the
1437	   paths sharing that merged segment should be able to recover.
1438	   Similarly, if a fault affects a non-merged segment, only the path
1439	   that is affected by the fault should be recovered.

1441	6.    Comparison Criteria

1443	   Possible criteria to use for comparison of MPLS-based recovery
1444	   schemes are as follows:

1446	   Recovery Time

1448	   We define recovery time as the time required for a recovery path to
1449	   be activated (and traffic flowing) after a fault. Recovery Time is
1450	   the sum of the Fault Detection Time, Hold-off Time, Notification
1451	   Time, Recovery Operation Time, and the Traffic Restoration Time. In
1452	   other words, it is the time between a failure of a node or link in
1453	   the network and the time before a recovery path is installed and the
1454	   traffic starts flowing on it.

1456	   Full Restoration Time

1458	   We define full restoration time as the time required for a permanent
1459	   restoration. This is the time required for traffic to be routed onto
1460	   links, which are capable of or have been engineered sufficiently to
1461	   handle traffic in recovery scenarios. Note that this time may or may
1462	   not be different from the "Recovery Time" depending on whether
1463	   equivalent or limited recovery paths are used.

1465	   Setup vulnerability

1467	   The amount of time that a working path or a set of working paths is
1468	   left unprotected during such tasks as recovery path computation and
1469	   recovery path setup may be used to compare schemes.  The nature of
1470	   this vulnerability should be taken into account, e.g.:  End to End
1471	   schemes correlate the vulnerability with working paths, Local Repair
1472	   schemes have a topological correlation that cuts across working paths
1473	   and Network Plan approaches have a correlation that impacts the
1474	   entire network.

1476	   Backup Capacity

1478	   Recovery schemes may require differing amounts of "backup capacity"
1479	   in the event of a fault. This capacity will be dependent on the
1480	   traffic characteristics of the network. However, it may also be
1481	   dependent on the particular protection plan selection algorithms as
1482	   well as the signaling and re-routing methods.

1484	   Additive Latency

1486	   Recovery schemes may introduce additive latency to traffic. For
1487	   example, a recovery path may take many more hops than the working
1488	   path. This may be dependent on the recovery path selection
1489	   algorithms.

1491	   Quality of Protection

1493	   Recovery schemes can be considered to encompass a spectrum of "packet
1494	   survivability" which may range from "relative" to "absolute".
1495	   Relative survivability may mean that the packet is on an equal
1496	   footing with other traffic of, as an example, the same diff-serv code
1497	   point (DSCP) in contending for the resources of the portion of the
1498	   network that survives the failure. Absolute survivability may mean
1499	   that the survivability of the protected traffic has explicit
1500	   guarantees.

1502	   Re-ordering
1503	   Recovery schemes may introduce re-ordering of packets. Also the
1504	   action of putting traffic back on preferred paths might cause packet
1505	   re-ordering.

1507	   State Overhead

1509	   As the number of recovery paths in a protection plan grows, the state
1510	   required to maintain them also grows. Schemes may require differing
1511	   numbers of paths to maintain certain levels of coverage, etc. The
1512	   state required may also depend on the particular scheme used to
1513	   recover. In many cases the state overhead will be in proportion to
1514	   the number of recovery paths.

1516	   Loss

1518	   Recovery schemes may introduce a certain amount of packet loss during
1519	   switchover to a recovery path. Schemes that introduce loss during
1520	   recovery can measure this loss by evaluating recovery times in
1521	   proportion to the link speed.

1523	   In case of link or node failure a certain packet loss is inevitable.

1525	   Coverage

1527	   Recovery schemes may offer various types of failover coverage. The
1528	   total coverage may be defined in terms of several metrics:

1530	   I. Fault Types: Recovery schemes may account for only link faults or
1531	   both node and link faults or also degraded service. For example, a
1532	   scheme may require more recovery paths to take node faults into
1533	   account.

1535	   II. Number of concurrent faults: dependent on the layout of recovery
1536	   paths in the protection plan, multiple fault scenarios may be able to
1537	   be restored.

1539	   III. Number of recovery paths: for a given fault, there may be one or
1540	   more recovery paths.

1542	   IV. Percentage of coverage: dependent on a scheme and its
1543	   implementation, a certain percentage of faults may be covered. This
1544	   may be subdivided into percentage of link faults and percentage of
1545	   node faults.

1547	   V. The number of protected paths may effect how fast the total set of
1548	   paths affected by a fault could be recovered. The ratio of protected
1549	   is n/N, where n is the number of protected paths and N is the total
1550	   number of paths.

1552	7.   Security Considerations
1553	   The MPLS recovery that is specified herein does not raise any
1554	   security issues that are not already present in the MPLS
1555	   architecture.

1557	8.   Intellectual Property Considerations

1559	   The IETF has been notified of intellectual property rights claimed in
1560	   regard to some or all of the specification contained in this
1561	   document. For more information consult the online list of claimed
1562	   rights.

1564	9.   Acknowledgements

1566	   We would like to thank members of the MPLS WG mailing list for their
1567	   suggestions on the earlier versions of this draft. In particular,
1568	   Bora Akyol, Dave Allan, Dave Danenberg, Sharam Davari, and Neil
1569	   Harrison whose suggestions and comments were very helpful in revising
1570	   the document.

1572	   The editors would like to give very special thanks to Curtis
1573	   Villamizar for his careful and extremely thorough reading of the
1574	   document and for taking the time to provide numerous suggestions,
1575	   which were very helpful in the last couple of revisions of the
1576	   document.

1578	10.  Editors' Addresses

1580	   Vishal Sharma                        Fiffi Hellstrand
1581	   Metanoia, Inc.                       Nortel Networks
1582	   1600 Villa Street, Unit 352          St Eriksgatan 115
1583	   Mountain View, CA 94041-1174         PO Box 6701
1584	   Phone: (650) 386-6723                113 85 Stockholm, Sweden
1585	   v.sharma@ieee.org                    Phone: +46 8 5088 3687
1586	                                        Fiffi@nortelnetworks.com

1588	11.  References

1590	   [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label
1591	      Switching Architecture", RFC 3031, January 2001.

1593	   [2] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J.,
1594	      "Requirements for Traffic Engineering Over MPLS", RFC 2702,
1595	      September 1999.

1597	   [3] Haung, C., Sharma, V., Owens, K., Makam, V. "Building Reliable
1598	      MPLS Networks Using a Path Protection Mechanism", IEEE Commun.
1599	      Mag., Vol. 40, Issue 3, March 2002, pp. 156-162.

1601	   [4] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource
1602	      ReSerVation Protocol (RSVP) -- Version 1 Functional
1603	      Specification", RFC 2205, September 1997.

1605	   [5] Awduche, D., et al "RSVP-TE Extensions to RSVP for LSP Tunnels",
1606	      RFC 3209, December 2001.

1608	   [6] Jamoussi, B., et al "Constraint-Based LSP Setup using LDP", RFC
1609	      3212, January 2002.