idnits 2.17.1 

draft-makam-mpls-recovery-frmwrk-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand
     corner of the first page

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** The abstract seems to contain references ([2], [3], [4], [5], [6], [1]),
     which it shouldn't.  Please replace those with straight textual mentions
     of the documents in question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 1329: '...I. MPLS recovery SHALL provide an opti...'
     RFC 2119 keyword, line 1332: '...   II. Each PSL SHALL be capable of pe...'
     RFC 2119 keyword, line 1336: '... recovery method SHALL not preclude ma...'
     RFC 2119 keyword, line 1343: '...   IV. A PSL SHALL be capable of perfo...'
     RFC 2119 keyword, line 1356: '...   There SHOULD be an option for:...'
     (5 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'SHALL not' in this paragraph:
     
     III. A MPLS recovery method SHALL not preclude manual protection
     switching commands. This implies that it would be possible under
     administrative commands to transfer traffic from a working path to a
     recovery path, or to transfer traffic from a recovery path to a working
     path, once the working path becomes operational following a fault.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'SHALL not' in this paragraph:
     
     I. Configuration of the recovery path as excess or reserved, with
     excess as the default. The recovery path that is configured as excess
     SHALL provide lower priority preemptable traffic access to the protection
     bandwidth, while the recovery path configured as reserved SHALL not
     provide any other traffic access to the protection bandwidth.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 2000) is 8686 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  == Outdated reference: A later version (-02) exists of
     draft-ietf-mpls-rsvp-tunnel-applicability-00

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-mpls-rsvp-tunnel-applicability (ref. '3')

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'

  ** Downref: Normative reference to an Informational RFC: RFC 2702 (ref. '7')

  -- Possible downref: Normative reference to a draft: ref. '8' 

  == Outdated reference: A later version (-01) exists of
     draft-swallow-rsvp-bypass-label-00

  -- Possible downref: Normative reference to a draft: ref. '9' 

  -- Possible downref: Normative reference to a draft: ref. '10' 

  -- Possible downref: Non-RFC (?) normative reference: ref. '11'

  == Outdated reference: A later version (-05) exists of
     draft-haskin-mpls-fast-reroute-01

  -- Possible downref: Normative reference to a draft: ref. '12' 


     Summary: 9 errors (**), 0 flaws (~~), 6 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	IETF Draft                                                Srinivas Makam
3	Multi-Protocol Label Switching                             Vishal Sharma
4	Expires: January 2001                                          Ken Owens
5	                                                        Changcheng Huang
6	                                                Tellabs Operations, Inc.

8	                                                        Fiffi Hellstrand
9	                                                                Jon Weil
10	                                                           Loa Andersson
11	                                                          Bilel Jamoussi
12	                                                         Nortel Networks

14	                                                               Brad Cain
15	                                                   Mirror Image Internet

17	                                                         Seyhan Civanlar
18	                                                         Coreon Networks

20	                                                             Angela Chiu
21	                                                               AT&T Labs

23	                                                               July 2000

25	                  Framework for MPLS-based Recovery

27	              <draft-makam-mpls-recovery-frmwrk-01.txt>

29	Status of this memo

31	   This document is an Internet-Draft and is in full conformance with
32	   all provisions of Section 10 of RFC2026.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF), its areas, and its working groups. Note that
36	   other groups may also distribute working documents as Internet-
37	   Drafts. Internet-Drafts are draft documents valid for a maximum of
38	   six months and may be updated, replaced, or obsoleted by other
39	   documents at any time. It is inappropriate to use Internet-Drafts as
40	   reference material or to cite them other than as "work in progress."

42	   The list of current Internet-Drafts can be accessed at
43	   http://www.ietf.org/ietf/1id-abstracts.txt

45	   The list of Internet-Draft Shadow Directories can be accessed at
46	   http://www.ietf.org/shadow.html.

48	Abstract
49	   Multi-protocol label switching (MPLS) [1] integrates the label
50	   swapping forwarding paradigm with network layer routing. To deliver
51	   reliable service, MPLS requires a set of procedures to provide
52	   protection of the traffic carried on different paths. This requires
53	   that the label switched routers (LSRs) support fault detection,
54	   fault notification, and fault recovery mechanisms, and that MPLS
55	   signaling [2] [3] [4] [5] [6] support the configuration of recovery.
56	   With these objectives in mind, this document specifies a framework
57	   for MPLS based recovery.

59	Table of Contents                                                 Page

61	1.0 Introduction                                                      3
62	1.1 Background                                                        3
63	1.2 Motivations for MPLS-Based Recovery                               4
64	1.3 Objectives                                                        5

66	2.0 Overview                                                          6
67	2.1 Recovery Models                                                   6
68	2.2 Recovery Cycles                                                   8
69	2.2.1 MPLS Recovery Cycle Model                                       8
70	2.2.2 MPLS Reversion Cycle Model                                     10
71	2.2.3 Dynamic Reroute Cycle Model                                    11
72	2.3 Terminology                                                      13
73	2.4 Abbreviations                                                    17

75	3.0 MPLS Recovery Principles                                         17
76	3.1 Configuration of Recovery                                        17
77	3.2 Initiation of Path Setup                                         18
78	3.3 Initiation of Resource Allocation                                18
79	3.4 Scope of Recovery                                                19
80	3.4.1 Topology                                                       19
81	3.4.1.1 Local Repair                                                 19
82	3.4.1.2 Global Repair                                                20
83	3.4.1.3 Alternate Egress Repair                                      20
84	3.4.1.4 Multi-Layer Repair                                           21
85	3.4.1.5 Concatenated Protection Domains                              21
86	3.4.2 Path Mapping                                                   21
87	3.4.3 Bypass Tunnels                                                 22
88	3.4.4 Recovery Granularity                                           23
89	3.4.4.1 Selective Traffic Recovery                                   23
90	3.4.4.2 Bundling                                                     23
91	3.4.5 Recovery Path Resource Use                                     23
92	3.5 Fault Detection                                                  24
93	3.6 Fault Notification                                               25
94	3.7 Switch Over Operation                                            25
95	3.7.1 Recovery Trigger                                               25
96	3.7.2 Recovery Action                                                26
97	3.8 Switch Back Operation                                            26
98	3.8.1 Revertive and Non-revertive Mode                               26
99	3.8.2 Restoration and Notification                                   27
100	3.8.3 Reverting to Preferred LSP                                     28
101	3.9 Performance                                                      28
102	4.0 Recovery Requirements                                            28
103	5.0 MPLS Recovery Options                                            29
104	6.0 Comparison Criteria                                              30
105	7.0 Security Considerations                                          32
106	8.0 Intellectual Property Considerations                             32
107	9.0 Acknowledgements                                                 32
108	10.0 Author's Addresses                                              33
109	11.0 References                                                      34

111	1.0 Introduction

113	   This memo describes a framework for MPLS-based recovery. We provide
114	   a detailed taxonomy of recovery terminology, and discuss the
115	   motivation for, the objectives of, and the requirements for MPLS-
116	   based recovery. We outline principles for MPLS-based recovery, and
117	   also provide comparison criteria that may serve as a basis for
118	   comparing and evaluating different recovery schemes.

120	1.1 Background

122	   Network routing deployed today is focussed primarily on connectivity
123	   and typically supports only one class of service, the best effort
124	   class. Multi-protocol label switching, on the other hand, by
125	   integrating forwarding based on label-swapping of a link local label
126	   with network layer routing allows flexibility in the delivery of new
127	   routing services. MPLS allows for using media specific forwarding
128	   mechanisms as label swapping. This enables more sophisticated
129	   features such as quality-of-service (QoS) and traffic engineering
130	   [7] to be implemented more effectively. An important component of
131	   providing QoS, however, is the ability to transport data reliably
132	   and efficiently. Although the current routing algorithms are very
133	   robust and survivable, the amount of time they take to recover from
134	   a fault can be significant, on the order of several seconds or
135	   minutes, causing serious disruption of service for some applications
136	   in the interim. This is unacceptable to many organizations that aim
137	   to provide a highly reliable service, and thus require recovery
138	   times on the order of tens of milliseconds, as specified, for
139	   example, in the GR253 specification for SONET.

141	   Since MPLS is likely to be the technology of choice in the future
142	   IP-based transport network, it is imperative that MPLS be able to
143	   provide protection and restoration of traffic. In fact, a protection
144	   priority could be used as a differentiating mechanism for premium
145	   services that require high reliability. The remainder of this
146	   document provides a framework for MPLS based recovery.  It is
147	   focused at a conceptual level and is meant to address motivation,
148	   objectives and requirements.  Issues of mechanism, policy, routing
149	   plans and characteristics of traffic carried by protection paths are
150	   beyond the scope of this document.

152	1.2 Motivation for MPLS-Based Recovery

154	   MPLS based protection of traffic (called MPLS-based Recovery) is
155	   useful for a number of reasons. The most important is its ability to
156	   increase network reliability by enabling a faster response to faults
157	   than is possible with traditional Layer 3 (or the IP layer) alone
158	   while still providing the visibility of the network afforded Layer
159	   3. Furthermore, a protection mechanism using MPLS could enable IP
160	   traffic to be put directly over WDM optical channels, without an
161	   intervening SONET layer.  This would facilitate the construction of
162	   IP-over-WDM networks.

164	   The need for MPLS-based recovery arises because of the following:

166	   I. Layer 3 or IP rerouting may be too slow for a core MPLS network
167	   that needs to support high reliability/availability.

169	   II. Layer 0 (for example, optical layer) or Layer 1 (for example,
170	   SONET) mechanisms may be deployed in ring topologies and may not
171	   always include mesh protection. That is, layer 0 or layer 1 networks
172	   may not be deployed in topologies that meet carriers' protection
173	   goals.

175	   III. The granularity at which the lower layers may be able to
176	   protect traffic may be too coarse for traffic that is switched using
177	   MPLS-based mechanisms.

179	   IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher
180	   layer operations.  Thus, while they may provide, for example, link
181	   protection, they cannot easily provide node protection.

183	   Furthermore there is a need for open standards.

185	   V. Establishing interoperability of protection mechanisms between
186	   routers/LSRs from different vendors in IP or MPLS networks is
187	   urgently required to enable the adoption of MPLS as a viable core
188	   transport and traffic engineering technology.

190	1.3 Objectives/Goals

192	   We lay down the following objectives for MPLS-based recovery.

194	   I. MPLS-based recovery mechanisms should facilitate fast (10's of
195	   ms) recovery times.

197	   II. MPLS-based recovery should maximize network reliability and
198	   availability. MPLS based protection of traffic should minimize the
199	   number of single points of failure in the MPLS protected domain.

201	   III. MPLS-based recovery techniques should be applicable for
202	   protection of traffic at various granularities. For example, it
203	   should be possible to specify MPLS-based recovery for a portion of
204	   the traffic on an individual path, for all traffic on an individual
205	   path, or for all traffic on a group of paths.

207	   IV. MPLS-based recovery techniques may be applicable for an entire
208	   end-to-end path or for segments of an end-to-end path.

210	   V. MPLS-based recovery actions should not adversely affect other
211	   network operations.

213	   VI. MPLS-based recovery actions in one MPLS protection domain
214	   (defined in Section 2.2) should not adversely affect the recovery
215	   actions in other MPLS protection domains.

217	   VII. MPLS-based recovery mechanisms should be able to take into
218	   consideration the recovery actions of lower layers.

220	   VIII. MPLS-based recovery actions should avoid network-layering
221	   violations. That is, defects in MPLS-based mechanisms should not
222	   trigger lower layer protection switching.

224	   IX. MPLS-based recovery mechanisms should minimize the loss of data
225	   and packet reordering during recovery operations. (The current MPLS
226	   specification has itself no explicit requirement on reordering).

228	   X. MPLS-based recovery mechanisms should minimize the state overhead
229	   incurred for each recovery path maintained.

231	   XI. MPLS-based recovery mechanisms should be able to preserve the
232	   constraints on traffic after switchover, if desired.  That is, if
233	   desired, the recovery path should meet the resource requirements of,
234	   and achieve the same performance characteristics, as the working
235	   path.

237	2.0 Overview

239	   There are several options for providing protection of traffic using
240	   MPLS. The most generic requirement is the specification of whether
241	   recovery should be via Layer 3 (or IP) rerouting or via MPLS
242	   protection switching or rerouting actions.

244	   Generally network operators aim to provide the fastest and the best
245	   protection mechanism that can be provided at a reasonable cost. The
246	   higher the level of protection, the more resources it consumes.
247	   MPLS-based recovery should give the flexibility to select the
248	   recovery mechanism, choose the granularity at which traffic is
249	   protected, and to also choose the specific types of traffic that are
250	   protected in order to give operators more control over that
251	   tradeoff.  With MPLS-based recovery, it can be possible to provide
252	   different levels of protection for different classes of service,
253	   based on their service requirements. For example, using approaches
254	   outlined below, a VLL service that supports real-time applications
255	   like VoIP may be supported using link/node protection together with
256	   pre-established, pre-reserved path protection, while best effort
257	   traffic may use established-on-demand path protection or simply rely
258	   on IP re-route or higher layer recovery mechanisms.  As another
259	   example of their range of application, MPLS-based recovery
260	   strategies may be used to protect traffic not originally flowing on
261	   label switched paths, such as IP traffic that is normally routed
262	   hop-by-hop, as well as traffic forwarded on label switched paths.

264	2.1 Recovery Models

266	   There are two basic models for path recovery: rerouting and
267	   protection switching.

269	   Protection switching and rerouting, as defined below, may be used
270	   together.  For example, protection switching to a recovery path may
271	   be used for rapid restoration of connectivity while rerouting
272	   determines a new optimal network configuration, rearranging paths,
273	   as needed, at a later time [8] [9].

275	2.1.1 Rerouting

277	   Recovery by rerouting is defined as establishing new paths or path
278	   segments on demand for restoring traffic after the occurrence of a
279	   fault. The new paths may be based upon fault information, network
280	   routing policies, pre-defined configurations and network topology
281	   information. Thus, upon detecting a fault, the affected paths are
282	   re-established using signaling. Reroute mechanisms are inherently
283	   slower than protection switching mechanisms, since more must be done
284	   following the detection of a fault.  Once the network routing
285	   algorithms have converged after a fault, it may be preferable, in
286	   some cases, to reoptimize the network by performing a reroute based
287	   on the current state of the network and network policies. This is
288	   currently discussed further in Section 3.8, but will also be
289	   clarified further in upcoming revisions of this document.

291	   In terms of the principles defined in section 3, reroute recovery
292	   employs paths established-on-demand with resources reserved-on-
293	   demand.

295	2.1.2 Protection Switching

297	   Protection switching recovery mechanisms pre-establish a recovery
298	   path or path segment, based upon network routing policies, the
299	   restoration requirements of the traffic on the working path, and
300	   administrative considerations. The recovery path may or may not be
301	   link and node disjoint with the working path [10].  When a fault is
302	   detected, the affected traffic that is considered for protection is
303	   switched over to the recovery path(s) and restored.

305	   In terms of the principles in section 3, protection switching
306	   employs pre-established recovery paths, and if resource reservation
307	   is required on the recovery path, pre-reserved resources.

309	2.1.2.1. Subtypes of Protection Switching

311	   The resources (bandwidth, buffers, processing) on the recovery path
312	   may be used to carry either a copy of the working path traffic or
313	   extra traffic that is displaced when a protection switch occurs.
314	   This leads to two subtypes of protection switching.

316	   In 1+1 ("one plus one") protection, the resources (bandwidth,
317	   buffers, processing capacity) on the recovery path are fully
318	   reserved, if needed, and carry the same traffic as the working path.
319	   Selection between the traffic on the working and recovery paths is
320	   made at the path merge LSR (PML).

322	   In 1:1 ("one for one") protection, the resources (if any) allocated
323	   on the recovery path are fully available to preemptible low priority
324	   traffic except when the recovery path is in use due to a fault on
325	   the working path. In other words, in 1:1 protection, the protected
326	   traffic normally travels only on the working path, and is switched
327	   to the recovery path only when the working path has a fault. Once
328	   the protection switch is initiated, the low priority traffic being
329	   carried on the recovery path may be displaced by the protected
330	   traffic. This method affords a way to make efficient use of the
331	   recovery path resources.

333	   This concept can be extended to 1:n (one for n) and m:n (m for n)
334	   protection.

336	   Additional specifications of the recovery actions are found in
337	   Section 3.

339	2.2 The Recovery Cycles

341	   There are three defined recovery cycles; the MPLS Recovery Cycle,
342	   the MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first
343	   cycle detects a fault and restores traffic onto MPLS-based recovery
344	   paths. If the recovery path is non-optimal the cycle may be followed
345	   by any of the two latter to achieve an optimized network again. The
346	   reversion cycle applies for explicitly routed traffic that that does
347	   not rely on any dynamic routing protocols to be converged. The
348	   dynamic re-routing cycle applies for traffic that is forwarded based
349	   on hop-by-hop routing.

351	2.2.1 MPLS Recovery Cycle Model

353	   The MPLS recovery cycle model is illustrated in Figure 1.
354	   Definitions and a key to abbreviations follow.

356	     --Network Impairment
357	     |    --Fault Detected
358	     |    |    --Start of Notification
359	     |    |    |    -- Start of Recovery Operation
360	     |    |    |    |    --Recovery Operation Complete
361	     |    |    |    |    |    --Path Traffic Restored
362	     |    |    |    |    |    |
363	     |    |    |    |    |    |
364	      v    v    v    v    v    v
365	    ----------------------------------------------------------------
366	     | T1 | T2 | T3 | T4 | T5 |

368	   Figure 1. MPLS Recovery Cycle Model

370	   The various timing measures used in the model are described below.

372	    T1   Fault Detection Time
373	    T2   Hold-off Time
374	    T3   Notification Time
375	    T4   Recovery Operation Time
376	    T5   Traffic Restoration Time

378	   Definitions of the recovery cycle times are as follows:

380	   Fault Detection Time

382	   The time between the occurrence of a network impairment and the
383	   moment the fault is detected by MPLS-based recovery mechanisms. This
384	   time may be highly dependent on lower layer protocols.

386	   Hold-Off Time

388	   The configured waiting time between the detection of a fault and
389	   taking MPLS-based recovery action, to allow time for lower layer
390	   protection to take effect. The Hold-off Time may be zero.

392	   Note: The Hold-Off Time may occur after the Notification Time
393	   interval if the node responsible for the switchover, the Path Switch
394	   LSR (PSL), rather than the detecting LSR, is configured to wait.

396	   Notification Time

398	   The time between initiation of an FIS by the LSR detecting the fault
399	   and the time at which the Path Switch LSR (PSL) begins the recovery
400	   operation.  This is zero if the PSL detects the fault itself.

402	   Note: If the PSL detects the fault itself, there still may be a
403	   Hold-Off Time period between detection and the start of the recovery
404	   operation.

406	   Recovery Operation Time

408	   The time between the first and last recovery actions.  This may
409	   include message exchanges between the PSL and PML to coordinate
410	   recovery actions.

412	   Traffic Restoration Time

414	   The time between the last recovery action and the time that the
415	   traffic (if present) is completely - recovered.  This interval is
416	   intended to account for the time required for traffic to once again
417	   arrive at the point in the network that experienced disrupted or
418	   degraded service due to the occurrence of the fault (e.g. the PML).
419	   This time may depend on the location of the fault, the recovery
420	   mechanism, and the propagation delay along the recovery path.

422	2.2.2 MPLS Reversion Cycle Model

424	   Protection switching, revertive mode, requires the traffic to be
425	   switched back to a preferred path when the fault on that path is
426	   cleared.  The MPLS reversion cycle model is illustrated in Figure 2.
427	   Note that the cycle shown below comes after the recovery cycle shown
428	   in Fig. 1.

430	       --Network Impairment Repaired
431	       |    --Fault Cleared
432	       |    |    --Path Available
433	       |    |    |    --Start of Reversion Operation
434	       |    |    |    |    --Reversion Operation Complete
435	       |    |    |    |    |    --Traffic Restored on Preferred Path
436	       |    |    |    |    |    |
437	       |    |    |    |    |    |
438	       v    v    v    v    v    v
439	    -----------------------------------------------------------------
440	       | T7 | T8 | T9 | T10| T11|

442	   Figure 2. MPLS Reversion Cycle Model

444	   The various timing measures used in the model are described below.

446	    T7   Fault Clearing Time
447	    T8   Wait-to-Restore Time
448	    T9   Notification Time
449	    T10  Reversion Operation Time
450	    T11  Traffic Restoration Time

452	   Note that time T6 (not shown above) is the time for which the
453	   network impairment is not repaired and traffic is flowing on the
454	   recovery path.

456	   Definitions of the reversion cycle times are as follows:

458	   Fault Clearing Time

460	   The time between the repair of a network impairment and the time
461	   that MPLS-based mechanisms learn that the fault has been cleared.
462	   This time may be highly dependent on lower layer protocols.

464	   Wait-to-Restore Time

466	   The configured waiting time between the clearing of a fault and
467	   MPLS-based recovery action(s).  Waiting time may be needed to ensure
468	   the path is stable and to avoid flapping in cases where a fault is
469	   intermittent. The Wait-to-Restore Time may be zero.

471	   Note: The Wait-to-Restore Time may occur after the Notification Time
472	   interval if the PSL is configured to wait.

474	   Notification Time

476	   The time between initiation of an FRS by the LSR clearing the fault
477	   and the time at which the path switch LSR begins the reversion
478	   operation.  This is zero if the PSL clears the fault itself.

480	   Note: If the PSL clears the fault itself, there still may be a Wait-
481	   to-Restore Time period between fault clearing and the start of the
482	   reversion operation.

484	   Reversion Operation Time

486	   The time between the first and last reversion actions.  This may
487	   include message exchanges between the PSL and PML to coordinate
488	   reversion actions.

490	   Traffic Restoration Time

492	   The time between the last reversion action and the time that traffic
493	   (if present) is completely restored on the preferred path.  This
494	   interval is expected to be quite small since both paths are working
495	   and care may be taken to limit the traffic disruption (e.g., using
496	   "make before break" techniques and synchronous switch-over).

498	   In practice, the only interesting times in the reversion cycle are
499	   the Wait-to-Restore Time and the Traffic Restoration Time (or some
500	   other measure of traffic disruption).  Given that both paths are
501	   available, there is no need for rapid operation, and a well-
502	   controlled switch-back with minimal disruption is desirable.

504	2.2.3 Dynamic Re-routing Cycle Model

506	   Dynamic rerouting aims to bring the IP network to a stable state
507	   after a network impairment has occurred. A re-optimized network is
508	   achieved after the routing protocols have converged, and the traffic
509	   is moved from a recovery path to a (possibly) new working path. The
510	   steps involved in this mode are illustrated in Figure 3.

512	   Note that the cycle shown below may follow the recovery cycle shown
513	   in Fig. 1 or the reversion cycle shown in Fig. 2, or both (in the
514	   event that both the recovery cycle and the reversion cycle take
515	   place before the routing protocols converge, and after the
516	   convergence of the routing protocols it is determined (based on on-
517	   line algorithms or off-line traffic engineering tools, network
518	   configuration, or a variety of other possible criteria) that there
519	   is a better route for the working path).

521	       --Network Enters a Semi-stable State after an Impairment
522	       |     --Dynamic Routing Protocols Converge
523	       |     |     --Initiate Setup of New Working Path between PSL
524	       |     |     |                                         and PML
525	       |     |     |     --Switchover Operation Complete
526	       |     |     |     |     --Traffic Moved to New Working Path
527	       |     |     |     |     |
528	       |     |     |     |     |
529	       v     v     v     v     v
530	    -----------------------------------------------------------------
531	       | T12 | T13 | T14 | T15 |

533	   Figure 3. Dynamic Rerouting Cycle Model

535	   The various timing measures used in the model are described below.

537	    T12  Network Route Convergence Time
538	    T13  Hold-down Time (optional)
539	    T14  Switchover Operation Time
540	    T15  Traffic Restoration Time

542	   Network Route Convergence Time

544	   We define the network route convergence time as the time taken for
545	   the network routing protocols to converge and for the network to
546	   reach a stable state.

548	   Holddown Time

550	   We define the holddown period as a bounded time for which a recovery
551	   path must be used. In some scenarios it may be difficult to
552	   determine if the working path is stable. In these cases a holddown
553	   time may be used to prevent excess flapping of traffic between a
554	   working and a recovery path.

556	   Switchover Operation Time

558	   The time between the first and last switchover actions.  This may
559	   include message exchanges between the PSL and PML to coordinate the
560	   switchover actions.

562	   As an example of the recovery cycle, we present a sequence of events
563	   that occur after a network impairment occurs and when a protection
564	   switch is followed by dynamic rerouting.

566	   I. Link or path fault occurs

568	   II. Signaling initiated (FIS) for the fault detected

570	   III. FIS arrives at the PSL

572	   IV. The PSL initiates a protection switch to a pre-configured
573	   recovery path

575	   V. The PSL switches over the traffic from the working path to the
576	   recovery path

578	   VI. The network enters a semi-stable state

580	   VII. Dynamic routing protocols converge after the fault, and a new
581	   working path is calculated (based, for example, on some of the
582	   criteria mentioned earlier in Section 2.1.1).

584	   VIII. A new working path is established between the PSL and the PML
585	   (assumption is that PSL and PML have not changed)

587	   IX. Traffic is switched over to the new working path.

589	2.3 Definitions and Terminology

591	   This document assumes the terminology given in [11], and, in
592	   addition, introduces the following new terms.

594	2.3.1 General Recovery Terminology

596	   Rerouting

598	   A recovery mechanism in which the recovery path or path segments are
599	   created dynamically after the detection of a fault on the working
600	   path. In other words, a recovery mechanism in which the recovery
601	   path is not pre-established.

603	   Protection Switching

605	   A recovery mechanism in which the recovery path or path segments are
606	   created prior to the detection of a fault on the working path. In
607	   other words, a recovery mechanism in which the recovery path is pre-
608	   established.

610	   Working Path

612	   The protected path that carries traffic before the occurrence of a
613	   fault.  The working path exists between a PSL and PML. The working
614	   path can be of different kinds; a hop-by-hop routed path, a trunk, a
615	   link, an LSP or part of a multipoint-to-point LSP.
616	   Two synonyms for a working path are primary path, active path.

618	   Recovery Path

620	   The path by which traffic is restored after the occurrence of a
621	   fault. In other words, the path on which the traffic is directed by
622	   the recovery mechanism. The recovery path is established by MPLS
623	   means. The recovery path can either be an equivalent recovery path
624	   and ensure no reduction in quality of service, or be a limited
625	   recovery path and thereby not guarantee the same quality of service
626	   (or some other criteria of performance) as the working path. A
627	   limited recovery path is not expected to be used for an extended
628	   period of time.
629	   Synonyms for a recovery path are; back-up path, alternative path,
630	   protection path.

632	   Path Group (PG)

634	   A logical bundling of multiple working paths, each of which is
635	   routed identically between a Path Switch LSR and a Path Merge LSR.

637	   Protected Path Group (PPG)

639	   A path group that requires protection.

641	   Protected Traffic Portion (PTP)

643	   The portion of the traffic on an individual path that requires
644	   protection.  For example, code points in the EXP bits of the shim
645	   header may identify a protected portion.

647	   Path Switch LSR (PSL)

649	   An LSR that is the transmitter of both the working path traffic and
650	   its corresponding recovery path traffic. The PSL is responsible for
651	   switching of the traffic between the working path and the recovery
652	   path.

654	   Path Merge LSR (PML)

656	   An LSR that receives both working path traffic and its corresponding
657	   recovery path traffic, and either merges their traffic into a single
658	   outgoing path, or, if it is itself the destination, passes the
659	   traffic on to the higher layer protocols.

661	   Intermediate LSR
662	   An LSR on a working or recovery path that is neither a PSL nor a PML
663	   for that path.

665	   Bypass Tunnel

667	   A path that serves to backup a set of working paths using the label
668	   stacking approach. The working paths and the bypass tunnel must all
669	   share the same path switch LSR (PSL) and the path merge LSR (PML).

671	   Switch-Over

673	   The process of switching the traffic from the path that the traffic
674	   is flowing on onto one or more alternate path(s). This may involve
675	   moving traffic from a working path onto one or more recovery paths,
676	   or may involve moving traffic from a recovery path(s) on to a more
677	   optimal working path(s).

679	   Switch-Back

681	   The process of returning the traffic from one or more recovery paths
682	   back to the working path(s).

684	   Revertive Mode

686	   A recovery mode in which traffic is automatically switched back from
687	   the recovery path to the original working path upon the restoration
688	   of the working path to a fault-free condition.

690	   Non-revertive Mode

692	   A recovery mode in which traffic is not automatically switched back
693	   to the original working path after this path is restored to a fault-
694	   free condition. (Depending on the configuration, the original
695	   working path may, upon moving to a fault-free condition, become the
696	   recovery path, or it may be used for new working traffic, and be no
697	   longer associated with its original recovery path).

699	   MPLS Protection Domain

701	   The set of LSRs over which a working path and its corresponding
702	   recovery path are routed.

704	   MPLS Protection Plan

706	   The set of all LSP protection paths and the mapping from working to
707	   protection paths deployed in an MPLS protection domain at a given
708	   time.

710	   Liveness Message
711	   A message exchanged periodically between two adjacent LSRs that
712	   serves as a link probing mechanism. It provides an integrity check
713	   of the forward and the backward directions of the link between the
714	   two LSRs as well as a check of neighbor aliveness.

716	   Path Continuity Test

718	   A test that verifies the integrity and continuity of a path or path
719	   segment. The details of such a test are beyond the scope of this
720	   draft. (This could be accomplished, for example, by transmitting a
721	   control message along the same links and nodes as the data traffic.)

723	2.3.2 Failure Terminology

725	   Path Failure (PF)

727	   Path failure is fault detected by MPLS-based recovery mechanisms,
728	   which is define as the failure of the liveness message test or a
729	   path continuity test, which indicates that path connectivity is
730	   lost.

732	   Path Degraded (PD)

734	   Path degraded is a fault detected by MPLS-based recovery mechanisms
735	   that indicates that the quality of the path is unacceptable.

737	   Link Failure (LF)

739	   A lower layer fault indicating that link continuity is lost. This
740	   may be communicated to the MPLS-based recovery mechanisms by the
741	   lower layer.

743	   Link Degraded (LD)

745	   A lower layer indication to MPLS-based recovery mechanisms that the
746	   link is performing below an acceptable level.

748	   Fault Indication Signal (FIS)

750	   A signal that indicates that a fault along a path has occurred. It
751	   is relayed by each intermediate LSR to its upstream or downstream
752	   neighbor, until it reaches an LSR that is setup to perform MPLS
753	   recovery.

755	   Fault Recovery Signal (FRS)

757	   A signal that indicates a fault along a working path has been
758	   repaired. Again, like the FIS, it is relayed by each intermediate
759	   LSR to its upstream or downstream neighbor, until is reaches the LSR
760	   that performs recovery of the original path.

762	2.4 Abbreviations

764	     FIS: Fault Indication Signal.
765	     FRS: Fault Recovery Signal.
766	     LD:  Link Degraded.
767	     LF:  Link Failure.
768	     PD:  Path Degraded.
769	     PF:  Path Failure.
770	     PML: Path Merge LSR.
771	     PG:  Path Group.
772	     PPG: Protected Path Group.
773	     PTP: Protected Traffic Portion.
774	     PSL: Path Switch LSR.

776	3.0 MPLS-based Recovery Principles

778	   MPLS-based recovery refers to the ability to effect quick and
779	   complete restoration of traffic affected by a fault in an MPLS-
780	   enabled network. The fault may be detected on the IP layer or in
781	   lower layers over which IP traffic is transported. Fast MPLS
782	   protection may be viewed as the MPLS LSR switch completion time that
783	   is comparable to, or equivalent to, the 50 ms switch-over completion
784	   time of the SONET layer. This section provides a discussion of the
785	   concepts and principles of MPLS-based recovery. The concepts are
786	   presented in terms of atomic or primitive terms that may be combined
787	   to specify recovery approaches.  We do not make any assumptions
788	   about the underlying layer 1 or layer 2 transport mechanisms or
789	   their recovery mechanisms.

791	3.1 Configuration of Recovery

793	   An LSR should allow for configuration of the following recovery
794	   options:

796	   Default-recovery (No MPLS-based recovery enabled): Traffic on the
797	   working path is recovered only via Layer 3 or IP rerouting.  This is
798	   equivalent to having no MPLS-based recovery. This option may be used
799	   for low priority traffic or for traffic that is recovered in another
800	   way (for example load shared traffic on parallel working paths may
801	   be automatically recovered upon a fault along one of the working
802	   paths by distributing it among the remaining working paths)
803	   Recoverable (MPLS-based recovery enabled): This working path is
804	   recovered using one or more recovery paths, either via rerouting or
805	   via protection switching.

807	3.2 Initiation of Path Setup

809	   There are three options for the initiation of the recovery path
810	   setup.

812	   Pre-established:

814	   This is the same as the protection switching option. Here a recovery
815	   path(s) is established prior to any failure on the working path. The
816	   path selection can either be determined by an administrative
817	   centralized tool (online or offline), or chosen based on some
818	   algorithm implemented at the PSL and possibly intermediate nodes. To
819	   guard against the situation when the pre-established recovery path
820	   fails before or at the same time as the working path, the recovery
821	   path should have secondary configuration options as explained in
822	   Section 3.3 below.

824	   Pre Qualified:

826	   A pre-established path need not be created, it may be pre-qualified.
827	   A pre-qualified recovery path is not created expressly for
828	   protecting the working path, but instead is a path created for other
829	   purposes that is designated as a recovery path after determination
830	   that it is an acceptable alternative for carrying the working path
831	   traffic.

833	   Established-on-Demand:

835	   This is the same as the rerouting option. Here, a recovery path is
836	   established after a failure on its working path has been detected
837	   and notified to the PSL.

839	   Additional options are possible as MPLS is extended to control
840	   optical networks. One example of this is shared mesh protection in
841	   optical networks where the wavelength (or port) in-to-out mapping
842	   for a recovery lightpath is selected in every optical layer cross-
843	   connect prior to the failure, but the physical cross-connect is not
844	   made until after the failure occurs.  This and other options related
845	   to optical MPLS are for further study.

847	3.3 Initiation of Resource Allocation

849	   A recovery path may support the same traffic contract as the working
850	   path, or it may not. We will distinguish these two situations by
851	   using different additive terms. If the recovery path is capable of
852	   replacing the working path without degrading service, it will be
853	   called an equivalent recovery path. If the recovery path lacks the
854	   resources (or resource reservations) to replace the working path
855	   without degrading service, it will be called a limited recovery
856	   path. Based on this, there are two options for the initiation of
857	   resource allocation:

859	   Pre-reserved:

861	   This option applies only to protection switching. Here a pre-
862	   established recovery path reserves required resources on all hops
863	   along its route during its establishment. Although the reserved
864	   resources (e.g., bandwidth and/or buffers) at each node cannot be
865	   used to admit more working paths, they are available to be used by
866	   all traffic that is present at the node before a failure occurs,
867	   which results in better resource usage than SONET APS.

869	   Reserved-on-Demand:

871	   This option may apply either to rerouting or to protection
872	   switching. Here a recovery path reserves the required resources
873	   after a failure on the working path has been detected and notified
874	   to the PSL and before the traffic on the working path is switched
875	   over to the recovery path.

877	   Note that under both the options above, depending on the amount of
878	   resources reserved on the recovery path, it could either be an
879	   equivalent recovery path or a limited recovery path.

881	3.4 Scope of Recovery

883	3.4.1 Topology

885	3.4.1.1 Local Repair

887	   The intent of local repair is to protect against a single link or
888	   neighbor node fault. In local repair (also known as local recovery
889	   [12] [9]), the node detecting the fault is the one to initiate
890	   recovery (either rerouting or protection switching). Local repair
891	   can be of two types:

893	   Link Recovery/Restoration

895	   In this case, the recovery path may be configured to route around a
896	   certain link deemed to be unreliable. If protection switching is
897	   used, several recovery paths may be configured for one working path,
898	   depending on the specific faulty link that each protects against.

900	   Alternatively, if rerouting is used, upon the occurrence of a fault
901	   on the specified link each path is rebuilt such that it detours
902	   around the faulty link.

904	   In this case, the recovery path need only be disjoint from its
905	   working path at a particular link on the working path, and may have
906	   overlapping segments with the working path. Traffic on the working
907	   path is switched over to an alternate path at the upstream LSR that
908	   connects to the failed link. This method is potentially the fastest
909	   to perform the switchover, and can be effective in situations where
910	   certain path components are much more unreliable than others.

912	   Node Recovery/Restoration

914	   In this case, the recovery path may be configured to route around a
915	   neighbor node deemed to be unreliable. Thus the recovery path is
916	   disjoint from the working path only at a particular node and at
917	   links associated with the working path at that node. Once again, the
918	   traffic on the primary path is switched over to the recovery path at
919	   the upstream LSR that directly connects to the failed node, and the
920	   recovery path shares overlapping portions with the working path.

922	3.4.1.2 Global Repair

924	   The intent of global repair is to protect against any link or node
925	   fault on the entire path or on a segment of a path (with the obvious
926	   exception of the ingress and egress nodes). In global repair (also
927	   known as path recovery/restoration) the node that initiates the
928	   recovery may be distant from the faulty link or node. In some cases,
929	   a fault notification (in the form of a FIS) must be sent from the
930	   node detecting the fault to the PSL. In many cases, the recovery
931	   path can be made completely link and node disjoint with its working
932	   path. This has the advantage of protecting against all link and node
933	   fault(s) on the working path (or path segment), and being more
934	   efficient than per-hop link or node recovery.

936	   In addition, it can be potentially more optimal in resource usage
937	   than the link or node recovery. However, it is in some cases slower
938	   than local repair since it takes longer for the fault notification
939	   message to get to the PSL to trigger the recovery action.

941	3.4.1.3 Alternate Egress Repair

943	   It is possible to restore service without specifically recovering
944	   the faulted path.

946	   For example, for best effort IP service it is possible to select a
947	   recovery path that has a different egress point from the working
948	   path (i.e., there is no PML).  The recovery path egress must simply
949	   be a router that is acceptable for forwarding the FEC carried by the
950	   working path (without creating looping).  In an engineering context,
951	   specific alternative FEC/LSP mappings with alternate egresses can be
952	   formed.

954	3.4.1.4 Multi-Layer Repair

956	   Multi-layer repair broadens the network designer's tool set for
957	   those cases where multiple network layers can be managed together to
958	   achieve overall network goals.  Specific criteria for determining
959	   when multi-layer repair is appropriate are beyond the scope of this
960	   draft.

962	3.4.1.5 Concatenated Protection Domains

964	   A given service may cross multiple networks and these may employ
965	   different recovery mechanisms.  It is possible to concatenate
966	   protection domains so that service recovery can be provided end-to-
967	   end.  It is considered that the recovery mechanisms in different
968	   domains may operate autonomously, and that multiple points of
969	   attachment may be used between domains (to ensure there is no single
970	   point of failure).  Details of concatenated protection domains are
971	   beyond the scope of this draft.

973	3.4.2 Path Mapping

975	   Path mapping refers to the methods of mapping traffic from a faulty
976	   working path on to the recovery path. There are several options for
977	   this, as described below. Note that the options below should be
978	   viewed as atomic terms that only describe how the working and
979	   protection paths are mapped to each other. The issues of resource
980	   reservation along these paths, and how switchover is actually
981	   performed lead to the more commonly used composite terms, such as
982	   1+1 and 1:1 protection, which were described in Section 2.1.

984	   i) 1-to-1 Protection

986	   In 1-to-1 protection the working path has a designated recovery path
987	   that is only to be used to recover that specific working path.

989	   ii) n-to-1 Protection

991	   In n-to-1 protection, up to n working paths are protected using only
992	   one recovery path. If the intent is to protect against any single
993	   fault on any of the working paths, the n working paths should be
994	   diversely routed between the same PSL and PML. In some cases,
995	   handshaking between PSL and PML may be required to complete the
996	   recovery, the details of which are beyond the scope of this draft.

998	   iii) n-to-m Protection

1000	   In n-to-m protection, up to n working paths are protected using m
1001	   recovery paths. Once again, if the intent is to protect against any
1002	   single fault on any of the n working paths, the n working paths and
1003	   the m recovery paths should be diversely routed between the same PSL
1004	   and PML. In some cases, handshaking between PSL and PML may be
1005	   required to complete the recovery, the details of which are beyond
1006	   the scope of this draft. -N-to-m protection is for further study.

1008	   iv) Split Path Protection

1010	   In split path protection, multiple recovery paths are allowed to
1011	   carry the traffic of a working path based on a certain configurable
1012	   load splitting ratio.  This is especially useful when no single
1013	   recovery path can be found that can carry the entire traffic of the
1014	   working path in case of a fault. Split path protection may require
1015	   handshaking between the PSL and the PML(s), and may require the
1016	   PML(s) to correlate the traffic arriving on multiple recovery paths
1017	   with the working path. Although this is an attractive option, the
1018	   details of split path protection are beyond the scope of this draft,
1019	   and are for further study.

1021	3.4.3 Bypass Tunnels

1023	   It may be convenient, in some cases, to create a "bypass tunnel" for
1024	   a PPG between a PSL and PML, thereby allowing multiple recovery
1025	   paths to be transparent to intervening LSRs [8].  In this case, one
1026	   LSP (the tunnel) is established between the PSL and PML following an
1027	   acceptable route and a number of recovery paths are supported
1028	   through the tunnel via label stacking. A bypass tunnel can be used
1029	   with any of the path mapping options discussed in the previous
1030	   section.

1032	   As with recovery paths, the bypass tunnel may or may not have
1033	   resource reservations sufficient to provide recovery without service
1034	   degradation.  It is possible that the bypass tunnel may have
1035	   sufficient resources to recover some number of working paths, but
1036	   not all at the same time.  If the number of recovery paths carrying
1037	   traffic in the tunnel at any given time is restricted, this is
1038	   similar to the 1 to n or m to n protection cases mentioned in
1039	   Section 3.4.2.

1041	3.4.4 Recovery Granularity

1043	   Another dimension of recovery considers the amount of traffic
1044	   requiring protection. This may range from a fraction of a path to a
1045	   bundle of paths.

1047	3.4.4.1 Selective Traffic Recovery

1049	   This option allows for the protection of a fraction of traffic
1050	   within the same path. The portion of the traffic on an individual
1051	   path that requires protection is called a protected traffic portion
1052	   (PTP). A single path may carry different classes of traffic, with
1053	   different protection requirements. The protected portion of this
1054	   traffic may be identified by its class, as for example, via the EXP
1055	   bits in the MPLS shim header or via the priority bit in the ATM
1056	   header.

1058	3.4.4.2 Bundling

1060	   Bundling is a technique used to group multiple working paths
1061	   together in order to recover them simultaneously. The logical
1062	   bundling of multiple working paths requiring protection, each of
1063	   which is routed identically between a PSL and a PML, is called a
1064	   protected path group (PPG). When a fault occurs on the working path
1065	   carrying the PPG, the PPG as a whole can be protected either by
1066	   being switched to a bypass tunnel or by being switched to a recovery
1067	   path.

1069	3.4.5 Recovery Path Resource Use

1071	   In the case of pre-reserved recovery paths, there is the question of
1072	   what use these resources may be put to when the recovery path is not
1073	   in use.  There are two options:

1075	   Dedicated-resource:

1077	   If the recovery path resources are dedicated, they may not be used
1078	   for anything except carrying the working traffic.  For example, in
1079	   the case of 1+1 protection, the working traffic is always carried on
1080	   the recovery path.  Even if the recovery path is not always carrying
1081	   the working traffic, it may not be possible or desirable to allow
1082	   other traffic to use these resources.

1084	   Extra-traffic-allowed:

1086	   If the recovery path only carries the working traffic when the
1087	   working path fails, then it is possible to allow extra traffic to
1088	   use the reserved resources at other times.  Extra traffic is, by
1089	   definition, traffic that can be displaced (without violating service
1090	   agreements) whenever the recovery path resources are needed for
1091	   carrying the working path traffic.

1093	3.5 Fault Detection

1095	   MPLS recovery is initiated after the detection of either a lower
1096	   layer fault or a fault at the IP layer or in the operation of MPLS-
1097	   based mechanisms. We consider four classes of impairments: Path
1098	   Failure, Path Degraded, Link Failure, and Link Degraded.

1100	   Path Failure (PF) is a fault that indicates to an MPLS-based
1101	   recovery scheme that the connectivity of the path is lost.  This may
1102	   be detected by a path continuity test between the PSL and PML.
1103	   Some, and perhaps the most common, path failures may be detected
1104	   using a link probing mechanism between neighbor LSRs. An example of
1105	   a probing mechanism is a liveness message that is exchanged
1106	   periodically along the working path between peer LSRs.  For either a
1107	   link probing mechanism or path continuity test to be effective, the
1108	   test message must be guaranteed to follow the same route as the
1109	   working or recovery path, over the segment being tested. In
1110	   addition, the path continuity test must take the path merge points
1111	   into consideration. In the case of a bi-directional link implemented
1112	   as two unidirectional links, path failure could mean that either one
1113	   or both unidirectional links are damaged.

1115	   Path Degraded (PD) is a fault that indicates to MPLS-based recovery
1116	   schemes/mechanisms that the path has connectivity, but that the
1117	   quality of the connection is unacceptable.  This may be detected by
1118	   a path performance monitoring mechanism, or some other mechanism for
1119	   determining the error rate on the path or some portion of the path.
1120	   This is local to the LSR and consists of excessive discarding of
1121	   packets at an interface, either due to label mismatch or due to TTL
1122	   errors, for example.

1124	   Link Failure (LF) is an indication from a lower layer that the link
1125	   over which the path is carried has failed.  If the lower layer
1126	   supports detection and reporting of this fault (that is, any fault
1127	   that indicates link failure e.g., SONET LOS), this may be used by
1128	   the MPLS recovery mechanism. In some cases, using LF indications may
1129	   provide faster fault detection than using only MPLS-based fault
1130	   detection mechanisms.

1132	   Link Degraded (LD) is an indication from a lower layer that the link
1133	   over which the path is carried is performing below an acceptable
1134	   level.  If the lower layer supports detection and reporting of this
1135	   fault, it may be used by the MPLS recovery mechanism. In some cases,
1136	   using LD indications may provide faster fault detection than using
1137	   only MPLS-based fault detection mechanisms.

1139	3.6 Fault Notification

1141	   Protection switching relies on rapid notification of faults. Once a
1142	   fault is detected, the node that detected the fault must determine
1143	   if the fault is severe enough to require path recovery. Then the
1144	   node should send out a notification of the fault by transmitting a
1145	   FIS to those of its upstream LSRs that were sending traffic on the
1146	   working path that is affected by the fault. This notification is
1147	   relayed hop-by-hop by each subsequent LSR to its upstream neighbor,
1148	   until it eventually reaches a PSL. A PSL is the only LSR that can
1149	   terminate the FIS and initiate a protection switch of the working
1150	   path to a recovery path. Since the FIS is a control message, it
1151	   should be transmitted with high priority to ensure that it
1152	   propagates rapidly towards the affected PSL(s). Depending on how
1153	   fault notification is configured in the LSRs of an MPLS domain, the
1154	   FIS could be sent either as a Layer 2 or Layer 3 packet. An example
1155	   of a FIS could be the liveness message sent by a downstream LSR to
1156	   its upstream neighbor, with an optional fault notification field
1157	   set. Alternatively, it could be a separate fault notification
1158	   packet. The intermediate LSR should identify which of its incoming
1159	   links (upstream LSRs) to propagate the FIS on. In the case of 1+1
1160	   protection, the FIS should also be sent downstream to the PML where
1161	   the recovery action is taken.

1163	3.7 Switch-Over Operation

1165	3.7.1 Recovery Trigger

1167	   The activation of an MPLS protection switch following the detection
1168	   or notification of a fault requires a trigger mechanism at the PSL.
1169	   MPLS protection switching may be initiated due to automatic inputs
1170	   or external commands. The automatic activation of an MPLS protection
1171	   switch results from a response to a defect or fault conditions
1172	   detected at the PSL or to fault notifications received at the PSL.
1173	   It is possible that the fault detection and trigger mechanisms may
1174	   be combined, as is the case when a PF, PD, LF, or LD is detected at
1175	   a PSL and triggers a protection switch to the recovery path. In most
1176	   cases, however, the detection and trigger mechanisms are distinct,
1177	   involving the detection of fault at some intermediate LSR followed
1178	   by the propagation of a fault notification back to the PSL via the
1179	   FIS, which serves as the protection switch trigger at the PSL. MPLS
1180	   protection switching in response to external commands results when
1181	   the operator initiates a protection switch by a command to a PSL (or
1182	   alternatively by a configuration command to an intermediate LSR,
1183	   which transmits the FIS towards the PSL).

1185	   Note that the PF fault applies to hard failures (fiber cuts,
1186	   transmitter failures, or LSR fabric failures), as does the LF fault,
1187	   with the difference that the LF is a lower layer impairment that may
1188	   be communicated to - MPLS-based recovery mechanisms. The PD (or LD)
1189	   fault, on the other hand, applies to soft defects (excessive errors
1190	   due to noise on the link, for instance). The PD (or LD) results in a
1191	   fault declaration only when the percentage of lost packets exceeds a
1192	   given threshold, which is provisioned and may be set based on the
1193	   service level agreement(s) in effect between a service provider and
1194	   a customer.

1196	3.7.2 Recovery Action

1198	   After a fault is detected or FIS is received by the PSL, the
1199	   recovery action involves either a rerouting or protection switching
1200	   operation. In both scenarios, the next hop label forwarding entry
1201	   for a recovery path is bound to the working path.

1203	3.8 Switch-Back Operation

1205	3.8.1 Revertive and Non-Revertive Modes

1207	   These protection modes indicate whether or not there is a preferred
1208	   path for the protected traffic.

1210	3.8.1.1 Revertive Mode

1212	   If the working path always is the preferred path, this path will be
1213	   used whenever it is available.  If the working path has a fault,
1214	   traffic is switched to the recovery path.  In the revertive mode of
1215	   operation, when the preferred path is restored the traffic is
1216	   automatically switched back to it.

1218	3.8.1.2 Non-revertive Mode

1220	   In the non-revertive mode of operation, there is no preferred path.
1221	   A switchback to the "original" working path is not desired or not
1222	   possible since the original path may no longer exist after the
1223	   occurrence of a fault on that path.

1225	   If there is a fault on the working path, traffic is switched to the
1226	   recovery path. When or if the faulty path (the originally working
1227	   path) is restored, it may become the recovery path (either by
1228	   configuration, or, if desired, by management actions). This applies
1229	   for explicitly routed working paths.

1231	   When the traffic is switched over to a recovery path, the
1232	   association between the original working path and the recovery path
1233	   may no longer exist, since the original path itself may no longer
1234	   exist after the fault. Instead, when the network reaches a stable
1235	   state following routing convergence, the recovery path may be
1236	   switched over to a different preferred path based either on pre-
1237	   configured information or optimization based on the new network
1238	   topology and associated information.

1240	3.8.2 Restoration and Notification

1242	   MPLS restoration deals with returning the working traffic from the
1243	   recovery path to the original or a new working path.  Reversion is
1244	   performed by the PSL upon receiving notification, via FRS, that the
1245	   working path is repaired or upon receiving notification that a new
1246	   working path is established.

1248	   As before, an LSR that detected the fault on the working path also
1249	   detects the restoration of the working path. If the working path had
1250	   experienced a LF defect, the LSR detects a return to normal
1251	   operation via the receipt of a liveness message from its peer. If
1252	   the working path had experienced a LD defect at an LSR interface,
1253	   the LSR could detect a return to normal operation via the resumption
1254	   of error-free packet reception on that interface. Alternatively, a
1255	   lower layer that no longer detects a LF defect may inform the MPLS-
1256	   based recovery mechanisms at the LSR that the link to its peer LSR
1257	   is operational. The LSR then transmits FRS to its upstream LSR(s)
1258	   that were transmitting traffic on the working path. This is relayed
1259	   hop-by-hop until it reaches the PSL(s), at which point the PSL
1260	   switches the working traffic back to the original working path.

1262	   In the non-revertive mode of operation, the working traffic may or
1263	   may not be restored to the original working path. This is because it
1264	   might be useful, in some cases, to either: (a) administratively
1265	   perform a protection switch back to the original working path after
1266	   gaining further assurances about the integrity of the path, or (b)
1267	   it may be acceptable to continue operation without the recovery path
1268	   being protected, or (c) it may be desirable to move the traffic to a
1269	   new working path that is calculated based on network topology and
1270	   network policies, after the dynamic routing protocols have
1271	   converged.

1273	   We note that if there is a way to transmit fault information back
1274	   along a recovery path towards a PSL and if the recovery path is an
1275	   equivalent recovery path, it is possible for the working path and
1276	   its recovery path to exchange roles once the original working path
1277	   is repaired following a fault. This is because, in that case, the
1278	   recovery path effectively becomes the working path, and the restored
1279	   working path functions as a recovery path for the original recovery
1280	   path. This is important, since it affords the benefits of non-
1281	   revertive switch operation outlined in Section 3.8.1, without
1282	   leaving the recovery path unprotected.

1284	3.8.3 Reverting to Preferred Path (or Controlled Rearrangement)

1286	   In the revertive mode, a "make before break" restoration switching
1287	   can be used, which is less disruptive than performing protection
1288	   switching upon the occurrence of network impairments. This will
1289	   minimize both packet loss and packet reordering. The controlled
1290	   rearrangement of paths can also be used to satisfy traffic
1291	   engineering requirements for load balancing across an MPLS domain.

1293	3.9 Performance

1295	   Resource/performance requirements for recovery paths should be
1296	   specified in terms of the following attributes:

1298	   I. Resource class attribute:

1300	   Equivalent Recovery Class: The recovery path has the same resource
1301	   reservations and performance guarantees as the working path. In
1302	   other words, the recovery path meets the same SLAs as the working
1303	   path.

1305	   Limited Recovery Class: The recovery path does not have the same
1306	   resource reservations and performance guarantees as the working
1307	   path.

1309	   A. Lower Class: The recovery path has lower resource requirements or
1310	   less stringent performance requirements than the working path.

1312	   B. Best Effort Class: The recovery path is best effort.

1314	   II. Priority Attribute:

1316	   The recovery path has a priority attribute just like the working
1317	   path (i.e., the priority attribute of the associated traffic
1318	   trunks). It can have the same priority as the working path or lower
1319	   priority.

1321	   III. Preemption Attribute:

1323	   The recovery path can have the same preemption attribute as the
1324	   working path or a lower one.

1326	4.0 MPLS Recovery Requirement
1327	   The following are the MPLS recovery requirements:

1329	   I. MPLS recovery SHALL provide an option to identify protection
1330	   groups (PPGs) and protection portions (PTPs).

1332	   II. Each PSL SHALL be capable of performing MPLS recovery upon the
1333	   detection of the impairments or upon receipt of notifications of
1334	   impairments.

1336	   III. A MPLS recovery method SHALL not preclude manual protection
1337	   switching commands. This implies that it would be possible under
1338	   administrative commands to transfer traffic from a working path to a
1339	   recovery path, or to transfer traffic from a recovery path to a
1340	   working path, once the working path becomes operational following a
1341	   fault.

1343	   IV. A PSL SHALL be capable of performing either a switch back to the
1344	   original working path after the fault is corrected or a switchover
1345	   to a new working path, upon the discovery of a more optimal working
1346	   path.

1348	   V. The recovery model should take into consideration path merging at
1349	   intermediate LSRs. If a fault affects the merged segment, all the
1350	   paths sharing that merged segment should be able to recover.
1351	   Similarly, if a fault affects a non-merged segment, only the path
1352	   that is affected by the fault should be recovered.

1354	5.0 MPLS Recovery Options

1356	   There SHOULD be an option for:

1358	   I. Configuration of the recovery path as excess or reserved, with
1359	   excess as the default. The recovery path that is configured as
1360	   excess SHALL provide lower priority preemptable traffic access to
1361	   the protection bandwidth, while the recovery path configured as
1362	   reserved SHALL not provide any other traffic access to the
1363	   protection bandwidth.

1365	   II. Each protected path SHALL provide an option for configuring the
1366	   protection alternatives as either rerouting or protection switching.

1368	   III. Each protected path SHALL provide a configuration option for
1369	   enabling restoration as either non-revertive or revertive, with
1370	   revertive as the default.

1372	   IV. Each LSR supporting protection switching SHALL provide an option
1373	   for fault notification to the PSL.

1375	6.0 Comparison Criteria

1377	   Possible criteria to use for comparison of MPLS-based recovery
1378	   schemes are as follows:

1380	   Recovery Time

1382	   We define recovery time as the time required for a recovery path to
1383	   be activated (and traffic flowing) after a fault. Recovery Time is
1384	   the sum of the Fault Detection Time, Hold-off Time, Notification
1385	   Time, Recovery Operation Time, and the Traffic Restoration Time. In
1386	   other words, it is the time between a failure of a node or link in
1387	   the network and the time before a recovery path is installed and the
1388	   traffic starts flowing on it.

1390	   Full Restoration Time

1392	   We define full restoration time as the time required for a permanent
1393	   restoration. This is the time required for traffic to be routed onto
1394	   links which are capable of or have been engineered sufficiently to
1395	   handle traffic in recovery scenarios. Note that this time may or may
1396	   not be different from the "Recovery Time" depending on whether
1397	   equivalent or limited recovery paths are used.

1399	   Backup Capacity

1401	   Recovery schemes may require differing amounts of "backup capacity"
1402	   in the event of a fault. This capacity will be dependent on the
1403	   traffic characteristics of the network. However, it may also be
1404	   dependent on the particular protection plan selection algorithms as
1405	   well as the signaling and re-routing methods.

1407	   Additive Latency

1409	   Recovery schemes may introduce additive latency to traffic. For
1410	   example, a recovery path may take many more hops than the working
1411	   path. This may be dependent on the recovery path selection
1412	   algorithms.

1414	   Re-ordering

1416	   Recovery schemes may introduce re-ordering of packets. Also the
1417	   action of putting traffic back on preferred paths might cause packet
1418	   re-ordering.

1420	   State Overhead
1421	   As the number of recovery paths in a protection plan grows, the
1422	   state required to maintain them also grows. Schemes may require
1423	   differing numbers of paths to maintain certain levels of coverage,
1424	   etc. The state required may also depend on the particular scheme
1425	   used to recover. In many cases the state overhead will be in
1426	   proportion to the number of recovery paths.

1428	   Loss

1430	   Recovery schemes may introduce a certain amount of packet loss
1431	   during switchover to a recovery path. Schemes that introduce loss
1432	   during recovery can measure this loss by evaluating recovery times
1433	   in proportion to the link speed.

1435	   In case of link or node failure a certain packet loss is inevitable.

1437	   Coverage

1439	   Recovery schemes may offer various types of failover coverage. The
1440	   total coverage may be defined in terms of several metrics:

1442	   I. Fault Types: Recovery schemes may account for only link faults or
1443	   both node and link faults or also degraded service. For example, a
1444	   scheme may require more recovery paths to take node faults into
1445	   account.

1447	   II. Number of concurrent faults: dependent on the layout of recovery
1448	   paths in the protection plan, multiple fault scenarios may be able
1449	   to be restored.

1451	   III. Number of recovery paths: for a given fault, there may be one
1452	   or more recovery paths.

1454	   IV. Percentage of coverage: dependent on a scheme and its
1455	   implementation, a certain percentage of faults may be covered. This
1456	   may be subdivided into percentage of link faults and percentage of
1457	   node faults.

1459	   V. The number of protected paths may effect how fast the total set
1460	   of paths affected by a fault could be recovered. The ratio of
1461	   protected is n/N, where n is the number of protected paths and N is
1462	   the total number of paths.

1464	7.0 Security Considerations

1466	   The MPLS recovery that is specified herein does not raise any
1467	   security issues that are not already present in the MPLS
1468	   architecture.

1470	8.0 Intellectual Property Considerations

1472	   The IETF has been notified of intellectual property rights claimed
1473	   in regard to some or all of the specification contained in this
1474	   document. For more information consult the online list of claimed
1475	   rights.

1477	9.0 Acknowledgements

1479	   We would like to thank members of the MPLS WG mailing list for their
1480	   suggestions on the earlier version of this draft. In particular,
1481	   Bora Akyol, Dave Allan, and Neil Harrisson, whose suggestions and
1482	   comments were very helpful in revising the document.

1484	10.0 Authors' Addresses

1486	Srinivas Makam                       Vishal Sharma
1487	Tellabs Operations, Inc.             Tellabs Research Center
1488	4951 Indiana Avenue                  One Kendall Square
1489	Lisle, IL 60532                      Bldg. 100, Ste. 121
1490	Phone: 630-512-7217                  Cambridge, MA 02139-1562
1491	Srinivas.Makam@tellabs.com           Phone: 617-577-8760
1492	                                     Vishal.Sharma@tellabs.com

1494	Ken Owens                            Changcheng Huang
1495	Tellabs Operations, Inc.             Tellabs Operations, Inc.
1496	1106 Fourth Street                   4951 Indiana Avenue
1497	St. Louis, MO 63126                  Lisle, IL 60532
1498	Phone: 314-918-1579                  Phone: 630-512-7754
1499	Ken.Owens@tellabs.com                Changcheng.Huang@tellabs.com

1501	Ben Mack-Crane                       Fiffi Hellstrand
1502	Tellabs  Operations, Inc.            Nortel Networks
1503	4951 Indiana Avenue                  St Eriksgatan 115, PO Box 6701
1504	Lisle, IL 60532                      113 85 Stockholm, Sweden
1505	Ph: 630-512-7255                     Ph: +46 8 5088 3687
1506	Ben.Mack-Crane@tellabs.com           Fiffi@nortelnetworks.com

1508	Jon Weil                             Brad Cain
1509	Nortel Networks                      Mirror Image Internet
1510	Harlow Laboratories London Road      49 Dragon Ct.
1511	Harlow Essex CM17 9NA, UK            Woburn, MA 01801, USA
1512	Phone: +44 (0)1279 403935            bcain@mirror-image.com
1513	jonweil@nortelnetworks.com

1515	Loa Andersson                        Bilel Jamoussi
1516	Nortel Networks                      Nortel Networks
1517	St Eriksgatan 115, PO Box 6701       3 Federal Street, BL3-03
1518	113 85 Stockholm, Sweden             Billerica, MA 01821, USA
1519	phone: +46 8 50 88 36 34             jamoussi@nortelnetworks.com
1520	loa.andersson@nortelnetworks.com

1522	Seyhan Civanlar                      Angela Chiu
1523	Coreon, Inc.                         AT&T Labs, Rm. 4-204,
1524	1200 South Avenue, Suite 103         100 Schulz Dr.
1525	Staten Island, NY 10314              Red Bank, NJ 07701
1526	Ph: (718) 889 4203                   Ph: (732) 345-3441
1527	scivanlar@coreon.net                 alchiu@att.com
1528	11.0 References

1530	   [1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label
1531	      Switching Architecture", Work in Progress, Internet Draft <draft-
1532	      ietf-mpls-arch-06.txt>, August 1999.

1534	   [2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas,
1535	      B., "LDP Specification", Work in Progress, Internet Draft <draft-
1536	      ietf-mpls-ldp-06.txt>, September 1999.

1538	   [3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement
1539	      for Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp-
1540	      tunnel-applicability-00.txt, work in progress, Sept. 1999.

1542	   [4] Jamoussi, B. "Constraint-Based LSP Setup using LDP", Work in
1543	      Progress, Internet Draft <draft-ietf-mpls-cr-ldp-03.txt>,
1544	      September 1999.

1546	   [5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource
1547	      ReSerVation Protocol (RSVP) -- Version 1 Functional
1548	      Specification", RFC 2205, September 1997.

1550	   [6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Work in
1551	      Progress, Internet Draft <draft-ietf-mpls-rsvp-lsp-tunnel-04.txt,
1552	      September 1999.

1554	   [7] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J.,
1555	      "Requirements for Traffic Engineering Over MPLS", RFC 2702,
1556	      September 1999.

1558	   [8] Andersson, L., Cain B., Jamoussi, B., "Requirement Framework for
1559	      Fast Re-route with MPLS", draft-andersson-reroute-frmwrk-00.txt,
1560	      work in progress, October 1999.

1562	   [9] Goguen, R. and Swallow, G., "RSVP Label Allocation for Backup
1563	      Tunnels", draft-swallow-rsvp-bypass-label-00.txt, work in
1564	      progress, October 1999.

1566	   [10] Makam, S., Sharma, V., Owens, K., Huang, C.,
1567	      "Protection/restoration of MPLS Networks", draft-makam-mpls-
1568	      protection-00.txt, work in progress, October 1999.

1570	   [11] Callon, R., Doolan, P., Feldman, N., Fredette, A., Swallow, G.,
1571	      Viswanathan, A., "A Framework for Multiprotocol Label Switching",
1572	      <draft-ietf-mpls-framework-05.txt>, Work in Progress, September
1573	      1999.

1575	   [12] Haskin, D. and Krishnan R., "A Method for Setting an
1576	      Alternative Label Switched Path to Handle Fast Reroute", draft-
1577	      haskin-mpls-fast-reroute-01.txt, 1999, Work in progress.