idnits 2.17.1 

draft-ietf-tewg-restore-hierarchy-00.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 1442 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There is 1 instance of too long lines in the document, the longest one
     being 4 characters in excess of 72.

  ** The abstract seems to contain references ([1]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 1155 has weird spacing: '... define  the c...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 2001) is 8229 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Missing reference section? '1' on line 55 looks like a reference

  -- Missing reference section? '2' on line 115 looks like a reference

  -- Missing reference section? '3' on line 884 looks like a reference

  -- Missing reference section? '4' on line 810 looks like a reference

  -- Missing reference section? '5' on line 966 looks like a reference

  -- Missing reference section? '6' on line 352 looks like a reference

  -- Missing reference section? '7' on line 566 looks like a reference

  -- Missing reference section? '8' on line 566 looks like a reference

  -- Missing reference section? '9' on line 641 looks like a reference

  -- Missing reference section? '10' on line 641 looks like a reference

  -- Missing reference section? '11' on line 618 looks like a reference

  -- Missing reference section? '12' on line 675 looks like a reference

  -- Missing reference section? '13' on line 836 looks like a reference

  -- Missing reference section? '14' on line 894 looks like a reference

  -- Missing reference section? '15' on line 894 looks like a reference

  -- Missing reference section? '16' on line 904 looks like a reference

  -- Missing reference section? '17' on line 960 looks like a reference

  -- Missing reference section? '18' on line 1012 looks like a reference

  -- Missing reference section? '19' on line 1067 looks like a reference

  -- Missing reference section? '20' on line 1122 looks like a reference

  -- Missing reference section? '21' on line 1176 looks like a reference

  -- Missing reference section? '22' on line 1234 looks like a reference

  -- Missing reference section? '23' on line 1292 looks like a reference

  -- Missing reference section? '24' on line 1323 looks like a reference


     Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 26 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Traffic Engineering Working Group                     Wai Sum Lai, AT&T
3	Internet Draft                                   Dave McDysan, WorldCom
4	<draft-ietf-tewg-restore-hierarchy-00.txt>                 (Co-Editors)
5	Category: Informational
6	Expiration Date: April 2002                           Jim Boyle, PDNets
7	                                                          Malin Carlzon
8	                                                    Rob Coltun, Redback
9	                                                      Tim Griffin, AT&T
10	                                                                Ed Kern
11	                                                 Tom Reddington, Lucent

13	                                                           October 2001

15	             Network Hierarchy and Multilayer Survivability

17	Status of this Memo

19	   This document is an Internet-Draft and is in full conformance with
20	      all provisions of Section 10 of RFC2026 [1].

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF), its areas, and its working groups. Note that
24	   other groups may also distribute working documents as Internet-
25	   Drafts. Internet-Drafts are draft documents valid for a maximum of
26	   six months and may be updated, replaced, or obsoleted by other
27	   documents at any time. It is inappropriate to use Internet- Drafts
28	   as reference material or to cite them other than as "work in
29	   progress."

31	   The list of current Internet-Drafts can be accessed at
32	   http://www.ietf.org/ietf/1id-abstracts.txt

34	   The list of Internet-Draft Shadow Directories can be accessed at
35	   http://www.ietf.org/shadow.html.

37	1. Abstract

39	   This document is the deliverable out of the Network Hierarchy and
40	   Survivability Techniques Design Team established within the Traffic
41	   Engineering Working Group.  This team collected and documented
42	   current and near term requirements for survivability and hierarchy
43	   in service provider environments.  For clarity, an expanded set of
44	   definitions is included.  The team determined that there appears to
45	   be a need to define a small set of interoperable survivability
46	   approaches in packet and non-packet networks.  Suggested approaches
47	   include path-based as well as one that repairs connections in
48	   proximity to the network fault.  They operate primarily at a single
49	   network layer.  For hierarchy, there did not appear to be a driving
50	   near-term need for work on "vertical hierarchy," defined as
51	   communication between network layers such as TDM/optical and MPLS.
52	   In particular, instead of direct exchange of signaling and routing
53	   between vertical layers, some looser form of coordination and

55	Lai, et al              Category - Expiration                     [1]

57	            Network Hierarchy and Multilayer Survivability    Oct 2001

59	   communication, such as the specification of hold-off timers, is a
60	   nearer term need.  For "horizontal hierarchy" in data networks,
61	   there are several pressing needs.  The requirement is to be able to
62	   set up many LSPs in a service provider network with hierarchical
63	   IGP.  This is necessary to support layer 2 and layer 3 VPN services
64	   that require edge-to-edge signaling across a core network.

66	   Please send comments to te-wg@ops.ietf.org

68	Table of Contents

70	   1. Abstract                                                     1
71	   2. Conventions used in this document                            2
72	   3. Introduction                                                 3
73	   4. Terminology and Concepts                                     4
74	   4.1 Hierarchy                                                   4
75	   4.1.1 Vertical Hierarchy                                        5
76	   4.1.2 Horizontal Hierarchy                                      5
77	   4.2 Survivability Terminology                                   5
78	   4.2.1 Survivability                                             6
79	   4.2.2 Generic Operations                                        6
80	   4.2.3 Survivability Techniques                                  7
81	   4.2.4 Survivability Performance                                 8
82	   4.3 Survivability Mechanisms: Comparison                        9
83	   5. Survivability                                               10
84	   5.1 Scope                                                      10
85	   5.2 Required initial set of survivability mechanisms           11
86	   5.2.1 1:1 Path Protection with Pre-Established Capacity        11
87	   5.2.2 1:1 Path Protection with Pre-Planned Capacity            12
88	   5.2.3 Local Restoration                                        12
89	   5.2.4 Path Restoration                                         13
90	   5.3 Applications Supported                                     13
91	   5.4 Timing Bounds for Survivability Mechanisms                 13
92	   5.5 Coordination Among Layers                                  14
93	   5.6 Evolution Toward IP Over Optical                           15
94	   6. Hierarchy Requirements                                      15
95	   6.1 Historical Context                                         16
96	   6.2 Applications for Horizontal Hierarchy                      16
97	   6.3 Horizontal Hierarchy Requirements                          17
98	   7. Survivability and Hierarchy                                 18
99	   8. Security Considerations                                     18
100	   9. References                                                  18
101	   10.  Acknowledgments                                           20
102	   11. Author's Addresses                                         20
103	   Appendix A: Questions used to help develop requirements        21
104	   Full Copyright Statement                                       24

106	2. Conventions used in this document

108	Lai, et al              Category - Expiration                     [2]

110	            Network Hierarchy and Multilayer Survivability    Oct 2001

112	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
113	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
114	   this document are to be interpreted as described in RFC-2119 [2].

116	3. Introduction

118	   This document presents a proposal of the tangible requirements for
119	   network survivability and hierarchy in current service provider
120	   environments.  With feedback from the working group solicited, the
121	   objective is to help focus the work that is being addressed in the
122	   TEWG (Traffic Engineering Working Group), CCAMP (Common Control and
123	   Measurement Plane Working Group), and other working groups.  A main
124	   goal of this work is to provide some expedience for required
125	   functionality in multi-vendor service provider networks.  The
126	   initial focus is primarily on intra-domain operations.  However, to
127	   maintain consistency in the provision of end-to-end service in a
128	   multi-provider environment, rules governing the operations of
129	   survivability mechanisms at domain boundaries must also be
130	   specified.  While such issues are raised and discussed, where
131	   appropriate, they will not be treated in depth in the initial
132	   release of this document.

134	   The document first develops a set of definitions to be used later in
135	   this document and potentially in other documents as well.  It then
136	   addresses the requirements and issues associated with service
137	   restoration, hierarchy, and finally a short discussion of
138	   survivability in hierarchical context.

140	   Here is a summary of the findings:

142	   A. Survivability Requirements

144	   @ need to define a small set of interoperable survivability
145	     approaches in packet and non-packet networks
146	   @ suggested survivability mechanisms include
147	     -  1:1 path protection with pre-established backup capacity (non-
148	         shared)
149	     -  1:1 path protection with pre-planned backup capacity (shared)
150	     -  local restoration with repairs in proximity to the network
151	         fault
152	     -  path restoration through source-based rerouting
153	   @ timing bounds for service restoration to support voice call cutoff
154	     (140 msec to 2 sec), protocol timer requirements in premium data
155	     services, and mission critical applications
156	   @ use of restoration priority for service differentiation

158	   B. Hierarchy Requirements

160	   B.1. Horizontally Oriented Hierarchy (Intra-Domain)

162	   @ ability to set up many LSPs in a service provider network with
163	     hierarchical IGP, for the support layer 2 and layer 3 VPN services

165	Lai, et al              Category - Expiration                     [3]

167	            Network Hierarchy and Multilayer Survivability    Oct 2001

169	   @ requirements for multi-area traffic engineering need to be
170	     developed to provide guidance for any necessary protocol
171	     extensions

173	   B.2. Vertically Oriented Hierarchy

175	   The following functionality for survivability is common on most
176	   routing equipment today.
177	   @ near-term need is some loose form of coordination and
178	     communication based on the use of nested hold-off timers, instead
179	     of direct exchange of signaling and routing between vertical
180	     layers
181	   @ means for an upper layer to immediately begin recovery actions in
182	     the event that a lower layer is not configured to perform recovery

184	   C. Survivability Requirements in Horizontal Hierarchy

186	   @ protection of end-to-end connection is based on a concatenated set
187	     of connections, each protected within their area
188	   @ mechanisms for connection routing may include (1) a network
189	     element that participates on both sides of a boundary (e.g., OSPF
190	     ABR) - note that this is a common point of failure; (2) route
191	     server
192	   @ need for inter-area signaling of survivability information (1) to
193	     enable a "least common denominator" survivability mechanism at the
194	     boundary; (2) to convey the success or failure of the service
195	     restoration action;  e.g., if a part of a "connection" is down on
196	     one side of a boundary, there is no need for the other side to
197	     recover from failures

199	4. Terminology and Concepts

201	4.1 Hierarchy

203	   Hierarchy is a technique to build scalable complex systems.  It is
204	   based on an abstraction, at each level, of what is most significant
205	   from the details and internal structures of the levels further away.
206	   This approach makes use of a general property of all hierarchical
207	   systems composed of related subsystems that interactions between
208	   subsystems decrease as the level of communication between subsystems
209	   decreases.

211	   Network hierarchy is an abstraction of part of a network's topology,
212	   routing and signaling mechanisms.  Abstraction may be used as a
213	   mechanism to build large networks or as a technique for enforcing
214	   administrative, topological, or geographic boundaries.  For example,
215	   network hierarchy might be used to separate the metropolitan and
216	   long-haul regions of a network, or to separate the regional and
217	   backbone sections of a network, or to interconnect service provider
218	   networks (with BGP which reduces a network to an Autonomous System).

220	Lai, et al              Category - Expiration                     [4]

222	            Network Hierarchy and Multilayer Survivability    Oct 2001

224	   In this document, network hierarchy is considered from two
225	   perspectives:

227	   (1) Vertically oriented: between two network technology layers
228	   (2) Horizontally oriented: between two areas or administrative
229	   subdivisions within the same network technology layer

231	4.1.1 Vertical Hierarchy

233	   Vertical hierarchy is the abstraction, or reduction in information,
234	   which would be of benefit when communicating information across
235	   network technology layers, as in propagating information between
236	   optical and router networks.

238	   In the vertical hierarchy, the total network functions are
239	   partitioned into a series of functional or technological layers with
240	   clear logical, and may be even physical, separation between adjacent
241	   layers. Survivability mechanisms either currently exist or are being
242	   developed at multiple layers in networks [3].  The optical layer is
243	   now becoming capable of providing dynamic ring and mesh restoration
244	   functionality, in addition to traditional 1+1 or 1:1 protection.
245	   The SDH/SONET layer provides survivability capability with automatic
246	   protection switching (APS), as well as self-healing ring and mesh
247	   restoration architectures.  Similar functionality has been defined
248	   in the ATM Layer, with work ongoing to also provide such
249	   functionality using MPLS [4].  At the IP layer, rerouting is used to
250	   restore service continuity following link and node outages.
251	   Rerouting at the IP layer, however, occurs after a period of routing
252	   convergence, which may require from a few seconds to several minutes
253	   to complete.

255	4.1.2 Horizontal Hierarchy

257	   Horizontal hierarchy is the abstraction that allows a network at one
258	   technology layer, for instance a packet network, to scale.  Examples
259	   of horizontal hierarchy include BGP confederations, separate
260	   Autonomous Systems, and multi-area OSPF.

262	   In the horizontal hierarchy, a large network is partitioned into
263	   multiple smaller, non-overlapping sub-networks.  The partitioning
264	   criteria can be based on topology, network function, administrative
265	   policy, or service domain demarcation.  Two networks at the *same*
266	   hierarchical level, e.g., two Autonomous Systems in BGP, may share a
267	   peer relation with each other through some loose form of coupling.
268	   On the other hand, for routing in large networks using multi-area
269	   OSPF, abstraction through the aggregation of routing information is
270	   achieved through a hierarchical partitioning of the network.

272	4.2 Survivability Terminology

274	   In alphabetical order, the following terms are defined in this
275	   section:

277	Lai, et al              Category - Expiration                     [5]

279	            Network Hierarchy and Multilayer Survivability    Oct 2001

281	   backup entity, same as protection entity (section 4.2.2)
282	   extra traffic (section 4.2.2)
283	   non-revertive mode (section 4.2.2)
284	   normalization (section 4.2.2)
285	   preemptable traffic, same as extra traffic (section 4.2.2)
286	   preemption priority (section 4.2.4)
287	   protection (section 4.2.3)
288	   protection entity (section 4.2.2)
289	   protection switching (section 4.2.3)
290	   protection switch time (section 4.2.4)
291	   recovery (section 4.2.2)
292	   recovery by rerouting, same as restoration (section 4.2.3)
293	   recovery entity, same as protection entity (section 4.2.2)
294	   restoration (section 4.2.3)
295	   restoration priority (section 4.2.4)
296	   restoration time (section 4.2.4)
297	   revertive mode (section 4.2.2)
298	   shared risk group (SRG) (section 4.2.2)
299	   survivability (section 4.2.1)
300	   working entity (section 4.2.2)

302	4.2.1 Survivability

304	   Survivability is the capability of a network to maintain service
305	   continuity in the presence of faults within the network [5].
306	   Survivability mechanisms such as protection and restoration are
307	   implemented either on a per-link basis, on a per-path basis, or
308	   throughout an entire network to alleviate service disruption at
309	   affordable costs.  The degree of survivability is determined by the
310	   network's capability to survive single failures, multiple failures,
311	   and equipment failures.

313	4.2.2 Generic Operations

315	   This document does not discuss the sequence of events of how network
316	   failures are monitored, detected, and mitigated.  For more detail of
317	   this aspect, see [4].  Also, the repair process following a failure
318	   is out of the scope here.

320	   A working entity is the entity that is used to carry traffic in
321	   normal operation mode.  Depending on the context, an entity can be a
322	   channel or a transmission link in the physical layer, an LSP in
323	   MPLS, or a logical bundle of one or more LSPs.

325	   A protection entity, also called backup entity or recovery entity,
326	   is the entity that is used to carry protected traffic in recovery
327	   operation mode, i.e., when the working entity is in error or has
328	   failed.

330	   Extra traffic, also referred to as preemptable traffic, is the
331	   traffic carried over the protection entity while the working entity
332	   is active.  Extra traffic is not protected, i.e., when the

334	Lai, et al              Category - Expiration                     [6]

336	            Network Hierarchy and Multilayer Survivability    Oct 2001

338	   protection entity is required to protect the traffic that is being
339	   carried over the working entity, the extra traffic is preempted.

341	   A shared risk group (SRG) is a set of network elements that are
342	   collectively impacted by a specific fault or fault type.  For
343	   example, a shared risk link group (SRLG) is the union of all the
344	   links on those fibers that are routed in the same physical conduit
345	   in a fiber-span network.  This concept includes, besides shared
346	   conduit, other types of compromise such as shared fiber cable,
347	   shared right of way, shared optical ring, shared office without
348	   power sharing, etc.  The span of an SRG, such as the length of the
349	   sharing for compromised outside plant, needs to be considered on a
350	   per fault basis.  The concept of SRG can be extended to represent a
351	   "risk domain" and its associated capabilities and summarization for
352	   traffic engineering purposes.  See [6] for further discussion.

354	   Normalization is the sequence of events and actions taken by a
355	   network that returns the network to the preferred state upon
356	   completing repair of a failure.  This could include the switching or
357	   rerouting of affected traffic to the original repaired working
358	   entities or new routes.  Revertive mode refers to the case where
359	   traffic is automatically returned to a repaired working entity (also
360	   called switch back).

362	   Recovery is the sequence of events and actions taken by a network
363	   after the detection of a failure to maintain the required
364	   performance level for existing services (e.g., according to service
365	   level agreements) and to allow normalization of the network.  The
366	   actions include notification of the failure followed by two parallel
367	   processes: (1) a repair process with fault isolation and repair of
368	   the failed components, and (2) a reconfiguration process using
369	   survivability mechanisms to maintain service continuity.  In
370	   protection, reconfiguration involves switching the affected traffic
371	   from a working entity to a protection entity.  In restoration,
372	   reconfiguration involves path selection and rerouting for the
373	   affected traffic.

375	   Revertive mode is a procedure in which revertive action, i.e.,
376	   switch back from the protection entity to the working entity, is
377	   taken once the failed working entity has been repaired.  In non-
378	   revertive mode, such action is not taken.  To minimize service
379	   interruption, switch-back in revertive mode should be performed at a
380	   time when there is the least impact on the traffic concerned, or by
381	   using the make-before-break concept.

383	   Non-revertive mode is the case where there is no preferred path or
384	   it may be desirable to minimize further disruption of the service
385	   brought on by a revertive switching operation.  A switch-back to the
386	   original working path is not desired or not possible since the
387	   original path may no longer exist after the occurrence of a fault on
388	   that path.

390	4.2.3 Survivability Techniques

392	Lai, et al              Category - Expiration                     [7]

394	            Network Hierarchy and Multilayer Survivability    Oct 2001

396	   Protection, also called protection switching, is a survivability
397	   technique based on predetermined failure recovery: as the working
398	   entity is established, a protection entity is also established.
399	   Protection techniques can be implemented by several architectures:
400	   1+1, 1:1, 1:n, and m:n.  In the context of SDH/SONET, they are
401	   referred to as Automatic Protection Switching (APS).

403	   In the 1+1 protection architecture, a protection entity is dedicated
404	   to each working entity.  The dual-feed mechanism is used whereby the
405	   working entity is permanently bridged onto the protection entity at
406	   the source of the protected domain.  In normal operation mode,
407	   identical traffic is transmitted simultaneously on both the working
408	   and protection entities.  At the other end (sink) of the protected
409	   domain, both feeds are monitored for alarms and maintenance signals.
410	   A selection between the working and protection entity is made based
411	   on some predetermined criteria, such as the transmission performance
412	   requirements or defect indication.

414	   In the 1:1 protection architecture, a protection entity is also
415	   dedicated to each working entity.  The protected traffic is normally
416	   transmitted by the working entity.  When the working entity fails,
417	   the protected traffic is switched to the protection entity.  The two
418	   ends of the protected domain must signal detection of the fault and
419	   initiate the switchover.

421	   In the 1:n protection architecture, a dedicated protection entity is
422	   shared by n working entities.  In this case, not all of the affected
423	   traffic may be protected.

425	   The m:n architecture is a generalization of the 1:n architecture.
426	   Typically m <= n, m dedicated protection entities are shared by n
427	   working entities.

429	   Restoration, also referred to as recovery by rerouting [4], is a
430	   survivability technique that establishes new paths or path segments
431	   on demand, for restoring affected traffic after the occurrence of a
432	   fault.  The resources in these alternate paths are the currently
433	   unassigned (unreserved) resources in the same layer.  Preemption of
434	   extra traffic may also be used if spare resources are not available
435	   to carry the higher-priority protected traffic.  As initiated by
436	   detection of a fault on the working path, the selection of a
437	   recovery path may be based on preplanned configurations, network
438	   routing policies, or current network status such as network topology
439	   and fault information.  Signaling is used for establishing the new
440	   paths to bypass the fault.  Thus, restoration involves a path
441	   selection process followed by rerouting of the affected traffic from
442	   the working entity to the recovery entity.

444	4.2.4 Survivability Performance

446	   Protection switch time is the time interval from the occurrence of a
447	   network fault until the completion of the protection-switching

449	Lai, et al              Category - Expiration                     [8]

451	            Network Hierarchy and Multilayer Survivability    Oct 2001

453	   operations.  It includes the detection time necessary to initiate
454	   the protection switch, any hold-off time to allow for interworking
455	   of protection schemes, and the switch completion time.

457	   Restoration time is the time interval from the occurrence of a
458	   network fault to the instant when the affected traffic is either
459	   completely restored, or until spare resources are exhausted, and/or
460	   no more extra traffic exists that can be preempted to make room.

462	   Restoration priority is a method of giving preference to protect
463	   higher-priority traffic ahead of lower-priority traffic.  Its use is
464	   to help determine the order of restoring traffic after a failure has
465	   occurred.  The purpose is to differentiate service restoration time
466	   as well as to control access to available spare capacity for
467	   different classes of traffic.

469	   Preemption priority is a method of determining which traffic can be
470	   disconnected in the event that not all traffic with a higher
471	   restoration priority is restored after the occurrence of a failure.

473	4.3 Survivability Mechanisms: Comparison

475	   In a survivable network design, spare capacity and diversity must be
476	   built into the network from the beginning to support some degree of
477	   self-healing whenever failures occur.  A common strategy is to
478	   associate each working entity with a protection entity having either
479	   dedicated resources or shared resources that are pre-reserved or
480	   reserved-on-demand.  According to the methods of setting up a
481	   protection entity, different approaches to providing survivability
482	   can be classified.  Generally, protection techniques are based on
483	   having a dedicated protection entity set up prior to failure.  Such
484	   is not the case in restoration techniques, which mainly rely on the
485	   use of spare capacity in the network.  Hence, in terms of trade-
486	   offs, protection techniques usually offer fast recovery from failure
487	   with enhanced availability, while restoration techniques usually
488	   achieve better resource utilization.

490	   A 1+1 protection architecture is rather expensive since resource
491	   duplication is required for the working and protection entities.  It
492	   is generally used for specific services that need a very high
493	   availability.

495	   A 1:1 architecture is inherently slower in recovering from failure
496	   than a 1+1 architecture since communication between both ends of the
497	   protection domain is required to perform the switch-over operation.
498	   An advantage is that the protection entity can optionally be used to
499	   carry low-priority extra traffic in normal operation, if traffic
500	   preemption is allowed.  Packet networks can pre-establish a
501	   protection path for later use with pre-planned but not pre-reserved
502	   capacity.  That is, if no packets are sent onto a protection path,
503	   then no bandwidth is consumed.  This is not the case in transmission
504	   networks like optical or TDM where path establishment and resource
505	   reservation cannot be decoupled.

507	Lai, et al              Category - Expiration                     [9]

509	            Network Hierarchy and Multilayer Survivability    Oct 2001

511	   In the 1:n protection architecture, traffic is normally sent on the
512	   working entities.  When multiple working entities have failed
513	   simultaneously, only one of them can be restored by the common
514	   protection entity.  This contention could be resolved by assigning a
515	   different preemptive priority to each working entity.  As in the 1:1
516	   case, the protection entity can optionally be used to carry
517	   preemptable traffic in normal operation.

519	   While the m:n architecture can improve system availability with
520	   small cost increases, it has rarely been implemented or
521	   standardized.

523	   When compared with protection mechanisms, restoration mechanisms are
524	   generally more frugal as no resources are committed until after the
525	   fault occurs and the location of the fault is known.  However,
526	   restoration mechanisms are inherently slower, since more must be
527	   done following the detection of a fault.  Also, the time it takes
528	   for the dynamic selection and establishment of alternate paths may
529	   vary, depending on the amount of traffic and connections to be
530	   restored, and is influenced by the network topology, technology
531	   employed, and the type and severity of the fault.  As a result,
532	   restoration time tends to be more variable than the protection
533	   switch time needed with pre-selected protection entities.  Hence, in
534	   using restoration mechanisms, it is essential to use restoration
535	   priority to ensure that service objectives are met cost-effectively.

537	   Once the network routing algorithms have converged after a fault, it
538	   may be preferable, in some cases, to reoptimize the network by
539	   performing a reroute based on the current state of the network and
540	   network policies.

542	5. Survivability

544	5.1 Scope

546	   Interoperable approaches to network survivability were determined to
547	   be an immediate requirement in packet networks as well as in
548	   SDH/SONET framed TDM networks.  Not as pressing at this time were
549	   techniques which would cover all-optical networks (e.g., where
550	   framing is unknown), as the control of these networks in a multi-
551	   vendor environment appeared to have some other hurdles to first deal
552	   with.  Also, not of immediate interest were approaches to coordinate
553	   or explicitly communicate survivability mechanisms across network
554	   layers (such as from a TDM or optical network to/from an IP
555	   network).  However, a capability should be provided for a network
556	   operator to perform fault notification and to control the operation
557	   of survivability mechanisms among different layers.  This may
558	   require the development of corresponding OAM functionality.
559	   However, such issues and those related to OAM are currently outside

561	Lai, et al              Category - Expiration                    [10]

563	            Network Hierarchy and Multilayer Survivability    Oct 2001

565	   the scope of this document.  (For proposed MPLS OAM requirements,
566	   see [7, 8]).

568	   The initial scope is to address only "backhoe failures" in the
569	   inter-office connections of a service provider network.  A link
570	   connection in the router layer typically comprises of multiple spans
571	   in the lower layers.  Therefore, the types of network failures that
572	   cause a recovery to be performed include link/span failures.
573	   However, linecard and node failures may not need to be treated any
574	   differently than their respective link/span failures, as a router
575	   failure may be represented as a set of simultaneous link failures.

577	   Depending on the actual network configuration, drop-side interface
578	   (e.g., between a customer and an access router, or between a router
579	   and an optical cross-connect) may be considered either inter-domain
580	   or inter-layer.  Another inter-domain scenario is the use of intra-
581	   office links for interconnecting a metro network and a core network,
582	   with both networks being administered by the same service provider.
583	   Failures at such interfaces may be similarly protected by the
584	   mechanisms of this section.

586	   Other more complex failure mechanisms such as systematic control-
587	   plane failure, configuration error, or breach of security are not
588	   within the scope of the survivability mechanisms discussed in this
589	   document.  Network impairment such as congestion which results in
590	   lower throughput are also not covered.

592	5.2 Required initial set of survivability mechanisms

594	5.2.1   1:1 Path Protection with Pre-Established Capacity

596	   In this protection mode, the head end of a working connection
597	   establishes a protection connection to the destination.  There
598	   should be the ability to maintain relative restoration priorities
599	   between working and protection connections, as well as between
600	   different classes of protection connections.

602	   In normal operation, traffic is only sent on the working connection,
603	   though the ability to signal that traffic will be sent on both
604	   connections (1+1 Path for signaling purposes) would be valuable in
605	   non-packet networks.  Some distinction between working and
606	   protection connections is likely, either through explicit objects,
607	   or preferably through implicit methods such as general classes or
608	   priorities.  Head ends need the ability to create connections that
609	   are as failure disjoint as possible from each other.  This requires
610	   SRG information that can be generally assigned to either nodes or
611	   links and propagated through the control or management plane.  In
612	   this mechanism, capacity in the protection connection is pre-
613	   established, however it should be capable of carrying preemptable
614	   extra traffic in non-packet networks.  When protection capacity is
615	   called into service during recovery, there should be the ability to
616	   promote the protection connection to working status (for non-

618	Lai, et al              Category - Expiration                    [11]

620	            Network Hierarchy and Multilayer Survivability    Oct 2001

622	   revertive mode operation) with some form of make-before-break
623	   capability.

625	5.2.2   1:1 Path Protection with Pre-Planned Capacity

627	   Similar to the above 1:1 protection with pre-established capacity,
628	   the protection connection in this case is also pre-signaled.  The
629	   difference is in the way protection capacity is assigned.  With pre-
630	   planned capacity, the mechanism supports the ability for the
631	   protection capacity to be shared, or "double-booked".  Operators
632	   need the ability to provision different amounts of protection
633	   capacity according to expected failure modes and service level
634	   agreements.  Thus, an operator may wish to provision sufficient
635	   restoration capacity to handle a single failure affecting all
636	   connections in an SRG, or may wish to provision less or more
637	   restoration capacity.  Mechanisms should be provided to allow
638	   restoration capacity on each link to be shared by SRG-disjoint
639	   failures.In a sense, this is 1:1 from a path perspective, however
640	   the protection capacity in the network (on a link by link basis) is
641	   shared in a 1:n fashion, e.g., see the proposals in [9, 10].  If
642	   capacity is planned but not allocated, some form of signaling could
643	   be required before traffic may be sent on protection connections,
644	   especially in TDM networks.

646	   The use of this approach improves network resource utilization, but
647	   may require more careful planning.  So, initial deployment might be
648	   based on 1:1 path protection with pre-established capacity and the
649	   local restoration mechanism to be described next.

651	5.2.3   Local Restoration

653	   Due to the time impact of signal propagation, dynamic recovery of an
654	   entire path may not meet the service requirements of some networks.
655	   The solution to this is to restore connectivity of the link or span
656	   in immediate proximity to the fault, e.g., see the proposals in [11,
657	   12].  At a minimum, this approach should be able to protect against
658	   connectivity-type SRGs, though protecting against node-based SRGs
659	   might be worthwhile.  Also, this approach is applicable to support
660	   restoration on the inter-domain and inter-layer interconnection
661	   scenarios using intra-office links as described in the Scope
662	   Section.

664	   Head end systems must have some control as to whether their
665	   connections are candidates for or excluded from local restoration.
666	   For example, best-effort and preemptable traffic may be excluded
667	   from local restoration; they only get restored if there is bandwidth
668	   available.  This type of control may require the definition of an
669	   object in signaling.

671	   Since local restoration may be suboptimal, a means for head end
672	   systems to later perform path-level re-grooming must be supported
673	   for this approach.

675	Lai, et al              Category - Expiration                    [12]

677	            Network Hierarchy and Multilayer Survivability    Oct 2001

679	5.2.4   Path Restoration

681	   In this approach, connections that are impacted by a fault are
682	   rerouted by the originating network element upon notification of
683	   connection failure.  Such a source-based approach is efficient for
684	   network resources, but typically takes longer to accomplish
685	   restoration.  It does not involve any new mechanisms.  It merely is
686	   a mention of another common approach to protecting against faults in
687	   a network.

689	5.3 Applications Supported

691	   With service continuity under failure as a goal, a network is
692	   "survivable" if, in the face of a network failure, connectivity is
693	   interrupted for a "brief" period and then recovered before the
694	   network failure ends.  The length of this interrupted period is
695	   dependent on the application supported.  Here are some typical
696	   applications and considerations that drive the requirements for an
697	   acceptable protection switch time or restoration time:

699	   - Best-effort data: recovery of network connectivity by rerouting at
700	     the IP layer would be sufficient
701	   - Premium data service: need to meet TCP timeout or application
702	     protocol timer requirements
703	   - Voice: call cutoff is in the range of 140 msec to 2 sec (the time
704	     that a person waits after interruption of the speech path before
705	     hanging up or the time that a telephone switch will disconnect a
706	     call)
707	   - Other real-time service (e.g., streaming, fax) where an
708	     interruption would cause the session to terminate
709	   - Mission-critical applications that cannot tolerate even brief
710	     interruptions, for example, real-time financial transactions

712	5.4 Timing Bounds for Survivability Mechanisms

714	   The approach to picking the types of survivability mechanisms
715	   recommended was to consider a spectrum of mechanisms that can be
716	   used to protect traffic with varying characteristics of
717	   survivability and speed of protection/restoration, and then attempt
718	   to select a few general points which provide some coverage across
719	   that spectrum.  The focus of this work is to provide requirements to
720	   which a small set of detailed proposals may be developed, allowing
721	   the operator some (limited) flexibility in approaches to meeting
722	   their design goals in engineering multi-vendor networks.
723	   Requirements of different applications as listed in the previous
724	   sub-section were discussed generally, however none on the team would
725	   likely attest to the scientific merit of the ability of the timing
726	   bounds below to meet any specific application's needs.  A few
727	   assumptions include:

729	   1.      Approaches that protection switch without propagation of
730	     information are likely to be faster than those that do require

732	Lai, et al              Category - Expiration                    [13]

734	            Network Hierarchy and Multilayer Survivability    Oct 2001

736	     some form of fault notification to some or all elements in a
737	     network.
738	   2.      Approaches that require some form of signaling after a fault will
739	     also likely suffer some timing impact.

741	   Proposed timing bounds for different survivability mechanisms are as
742	   follows (all bounds are exclusive of signal propagation):

744	   1:1 path protection with pre-established capacity:   100-500 ms
745	   1:1 path protection with pre-planned capacity:       100-750 ms
746	   Local restoration:                                   50 ms
747	   Path restoration:                                    1-5 seconds

749	   To ensure that the service requirements for different applications
750	   can be met within the above timing bounds, restoration priority must
751	   be implemented to determine the order in which connections are
752	   restored (to minimize service restoration time as well as to gain
753	   access to available spare capacity on the best paths).  For example,
754	   mission critical applications may require high restoration priority.
755	   At the fiber layer, instead of specific applications, it may be
756	   possible that priority be given to certain classifications of
757	   customers with their traffic types enclosed within the customer
758	   aggregate.  Preemption priority should only be used in the event
759	   that not all connections can be restored, in which case connections
760	   with lower preemption priority should be released.  Depending on a
761	   service provider's strategy in provisioning network resources for
762	   backup, preemption may or not be needed in the network.

764	5.5 Coordination Among Layers

766	   A common design goal for networks with multiple technological layers
767	   is to provide the desired level of service in the most cost-
768	   effective manner.  Multilayer survivability may allow the
769	   optimization of spare resources through the improvement of resource
770	   utilization by sharing spare capacity across different layers,
771	   though further investigations are needed.  Coordination during
772	   recovery among different network layers (e.g. IP, SDH/SONET, optical
773	   layer) might necessitate development of vertical hierarchy.  The
774	   benefits of providing survivability mechanisms at multiple layers,
775	   and the optimization of the overall approach, must be weighed with
776	   the associated cost and service impacts.

778	   A default coordination mechanism for inter-layer interaction could
779	   be the use of nested timers and current SDH/SONET fault monitoring,
780	   as has been done traditionally for backward compatibility.  Thus,
781	   when lower-layer recovery happens in a longer time period than
782	   higher-layer recovery, a hold-off timer is utilized to avoid
783	   contention between the different single-layer survivability schemes.
784	   In other words, multilayer interaction is addressed by having
785	   successively higher multiplexing levels operate at a
786	   protection/restoration time scale greater than the next lowest
787	   layer.  This can impact the overall time to recover service.  For
788	   example, if SDH/SONET protection switching is used, MPLS recovery

790	Lai, et al              Category - Expiration                    [14]

792	            Network Hierarchy and Multilayer Survivability    Oct 2001

794	   timers must wait until SDH/SONET has had time to switch.  Setting
795	   such timers involves a tradeoff between rapid recovery and creation
796	   of a race condition where multiple layers are responding to the same
797	   fault, potentially allocating resources in an inefficient manner.

799	   In other configurations where the lower layer does not have a
800	   restoration capability or is not expected to protect, say an
801	   unprotected SDH/SONET linear circuit, then there must be a mechanism
802	   for the lower layer to trigger the higher layer to take recovery
803	   actions immediately.  This difference in network configuration means
804	   that implementations must allow for adjustment of hold-off timer
805	   values and/or a means for a lower layer to immediately indicate to a
806	   higher layer that a fault has occurred so that the higher layer can
807	   take restoration or protection actions.

809	   Furthermore, faults at higher layers should not trigger restoration
810	   or protection actions at lower layers [3, 4].

812	   It was felt that the current approach to coordination of
813	   survivability approaches currently did not have significant
814	   operational shortfalls.  These approaches include protecting traffic
815	   solely at one layer (e.g. at the IP layer over linear WDM, or at the
816	   SDH/SONET layer).  Where survivability mechanisms might be deployed
817	   at several layers, such as when a routed network rides a SDH/SONET
818	   protected network, it was felt that current coordination approaches
819	   were sufficient in many cases.  One exception is the hold-off of
820	   MPLS recovery until the completion of SDH/SONET protection switching
821	   as described above.  This limits the recovery time of fast MPLS
822	   restoration.  Also, by design, the operations and mechanisms within
823	   a given layer tend to be invisible to other layers.

825	5.6 Evolution Toward IP Over Optical

827	   As more pressing requirements for survivability and horizontal
828	   hierarchy for edge-to-edge signaling are met with technical
829	   proposals, it is believed that the benefits of merging (in some
830	   manner) the control planes of multiple layers will be outlined.
831	   When these benefits are self-evident, it would then seem to be the
832	   right time to review if vertical hierarchy mechanisms are needed,
833	   and what the requirements might be.  For example, a future
834	   requirement might be to provide a better match between the recovery
835	   requirements of IP networks with the recovery capability of optical
836	   transport.  One such proposal is described in [13].

838	6. Hierarchy Requirements

840	   Efforts in the area of network hierarchy should focus on mechanisms
841	   that would allow more scalable edge-to-edge signaling, or signaling
842	   across networks with existing network hierarchy (such as multi-area
843	   OSPF).  This appears to be a more urgent need than mechanisms that
844	   might be needed to interconnect networks at different layers.

846	Lai, et al              Category - Expiration                    [15]

848	            Network Hierarchy and Multilayer Survivability    Oct 2001

850	6.1 Historical Context

852	   One reason for horizontal hierarchy is functionality (e.g., metro
853	   versus backbone).  Geographic "islands" or partititons reduce the
854	   need for interoperability and make administration and operations
855	   less complex.  Using a simpler, more interoperable, survivability
856	   scheme at metro/backbone boundaries is natural for many provider
857	   network architectures.  In transmission networks, creating
858	   geographic islands of different vendor equipment has been done for a
859	   long time because multi-vendor interoperability has been difficult
860	   to achieve.  Traditionally, providers have to coordinate the
861	   equipment on either end of a "connection," and making this
862	   interoperable reduces complexity.  A provider should be able to
863	   concatenate survivability mechanisms in order to provide a
864	   "protected link" to the next higher level.  Think of SDH/SONET rings
865	   connecting to TDM DXCs with 1+1 line-layer protection between the
866	   ADM and the DXC port.  The TDM connection, e.g., a DS3 is protected,
867	   but usually all equipment on each SDH/SONET ring is from a single
868	   vendor.  The DXC cross connections are controlled by the provider
869	   and the ports are physically protected resulting in a highly
870	   available design.  Thus, concatenation of survivability approaches
871	   can be used to cascade across horizontal hierarchy.  While not
872	   perfect, it is workable in the near- to mid-term until multi-vendor
873	   interoperability is achieved.

875	   While the problems associated with multi-vendor interoperability may
876	   necessitate horizontal hierarchy as a practical matter in the near
877	   to mid-term (at least this has been the case in TDM networks), there
878	   should not be a technical reason for it in the standards developed
879	   by the IETF for core networks, or even most access networks.
880	   Establishing interoperability of survivability mechanisms between
881	   multi-vendor equipment in core IP networks is urgently required to
882	   enable adoption of IP as a viable core transport technology and to
883	   facilitate the traffic engineering of future multi-service IP
884	   networks [3].

886	   Some of the largest service provider networks currently run a single
887	   area/level IGP.  Some service providers, as well as many large
888	   enterprise networks, run multi-area OSPF to gain increases in
889	   scalability.  Often, this was from an original design, so it is
890	   difficult to say if the network truly required the hierarchy to
891	   reach its current size.

893	   Some proposals on improved mechanisms to address network hierarchy
894	   have been suggested [14, 15, 16, 17, 18].  This document aims to
895	   provide the concrete requirements so that these and other proposals
896	   can first aim to meet some limited objectives.

898	6.2 Applications for Horizontal Hierarchy

900	   A primary driver for intra-domain horizontal hierarchy is signaling
901	   capabilities in the context of edge-to-edge VPNs, potentially across
902	   traffic-engineered data networks.  There are a number of different

904	Lai, et al              Category - Expiration                    [16]

906	            Network Hierarchy and Multilayer Survivability    Oct 2001

908	   approaches to layer 2 and layer 3 VPNs and they are currently being
909	   addressed by different emerging protocols in the provider-
910	   provisioned VPNs (e.g., virtual routers) and Pseudo Wire Edge-to-
911	   Edge Emulation (PWE3) efforts based on either MPLS and/or IP
912	   tunnels. These may or not need explicit signaling from edge to edge,
913	   but it is a common perception that in order to meet SLAs, some form
914	   of edge-to-edge signaling may be required.

916	   With a large number of edges (N), scalability is concerned with
917	   avoiding the O(N^2) properties of edge-to-edge signaling.  However,
918	   the main issue here is not with the scalability of large amounts of
919	   signaling, such as in O(N^2) meshes with a "connection" between
920	   every edge-pair.  This is because, even if establishing and
921	   maintaining connections is feasible in a large network, there might
922	   be an impact on core survivability mechanisms which would cause
923	   protection/restoration times to grow with N^2, which would be
924	   undesirable.  While some value of N may be inevitable, approaches to
925	   reduce N (e.g. to pull in from the edge to aggregation points) might
926	   be of value.

928	   Thus, most service providers feel that O(N^2) meshes are not
929	   necessary for VPNs, and that the number of tunnels to support VPNs
930	   would be within the scalability bounds of current protocols and
931	   implementations.  That may be the case, there is currently a lack of
932	   ability to signal MPLS tunnels from edge to edge across IGP
933	   hierarchy, such as OSPF areas.  This may require the development
934	   of signaling standards that support dynamic establishment and
935	   potentially restoration of LSPs across a 2-level IGP hierarchy.

937	   For routing scalability, especially in data applications, a major
938	   concern is the amount of processing/state that is required in the
939	   variety of network elements.  If some nodes might not be able to
940	   communicate and process the state of every other node, it might be
941	   preferable to limit the information.  There is one school of thought
942	   that says that the amount of information contained by a horizontal
943	   barrier should be significant, and that impacts this might have on
944	   optimality in route selection and ability to provide global
945	   survivability are accepted tradeoffs.

947	6.3 Horizontal Hierarchy Requirements

949	   Mechanisms are required to allow for edge-to-edge signaling of
950	   connections through a network.  One network scenario includes medium
951	   to large networks that currently have hierarchical interior routing
952	   such as multi-area OSPF or multi-level IS-IS.  The primary context
953	   of this is edge-to-edge signaling which is thought to be required to
954	   assure the SLAs for the layer 2 and layer 3 VPNs that are being
955	   carried across the network.  Another possible context would be edge-
956	   to-edge signaling in TDM SDH/SONET networks with IP control, where
957	   metro and core networks again might be in a hierarchical interior
958	   routing domain.

960	Lai, et al              Category - Expiration                    [17]

962	            Network Hierarchy and Multilayer Survivability    Oct 2001

964	   To support edge-to-edge signaling in the above network scenarios
965	   within the framework of existing horizontal hierarchies, current
966	   traffic engineering (TE) methods [19, 5] may need to be extended.
967	   Requirements for multi-area TE need to be developed to provide
968	   guidance for any necessary protocol extensions.

970	7. Survivability and Hierarchy

972	   When horizontal hierarchy exists in a network technology layer, a
973	   question arises as to how survivability can be provided along a
974	   connection which crosses hierarchical boundaries.

976	   In designing protocols to meet the requirements of hierarchy, an
977	   approach to consider is that boundaries are either clean, or are of
978	   minimal value.  However, the concept of network elements that
979	   participate on both sides of a boundary might be a consideration
980	   (e.g. OSPF ABRs).  That would allow for devices on either side to
981	   take an intra-area approach within their region of knowledge, and
982	   for the ABR to do this in both areas, and splice the two protected
983	   connections together at a common point (granted it is a common point
984	   of failure now).  If the limitations of this approach start to
985	   appear in operational settings, then perhaps it would be time to
986	   start thinking about route-servers and signaling propagated
987	   directives.  However, one initial approach might be to signal
988	   through a common border router, and to consider the service as
989	   protected as it consist of a concatenated set of connections which
990	   are each protected within their area.  Another approach might be to
991	   have a least common denominator mechanism at the boundary, e.g., 1+1
992	   port protection.  There should also be some standardized means for a
993	   survivability scheme on one side of such a boundary to communicate
994	   with the scheme on the other side regarding the success or failure
995	   of the recovery action.  For example, if a part of a "connection" is
996	   down on one side of such a boundary, there is no need for the other
997	   side to recover from failures.

999	   In summary, at this time, approaches as described above that allow
1000	   concatenation of survivability schemes across hierarchical
1001	   boundaries seem sufficient.

1003	8. Security Considerations

1005	   No security issues have been raised in these requirements.

1007	9. References

1009	   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
1010	      9, RFC 2026, October 1996.

1012	Lai, et al              Category - Expiration                    [18]

1014	            Network Hierarchy and Multilayer Survivability    Oct 2001

1016	   2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
1017	      Levels", BCP 14, RFC 2119, March 1997.

1019	   3  K. Owens, V. Sharma, and M.Oommen, "Network Survivability
1020	      Considerations for Traffic Engineered IP Networks," Internet-
1021	      Draft, Work in Progress, July 2001.

1023	   4  V. Sharma, B. Crane, S. Makam, K. Owens, C. Huang, F. Hellstrand,
1024	      J. Weil, L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A.
1025	      Chiu, "Framework for MPLS-based Recovery," Internet-Draft, Work
1026	      in Progress, July 2001.

1028	   5  D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao,
1029	      "Overview and Principles of Internet Traffic Engineering,"
1030	      Internet-Draft, Work in Progress, August 2001.

1032	   6  S. Dharanikota, R. Jain, D. Papadimitriou, R. Hartani, G.
1033	      Bernstein, V. Sharma, C. Brownmiller, Y. Xue, and J. Strand,
1034	      "Inter-domain routing with Shared Risk Groups," Internet-Draft,
1035	      Work in Progress, July 2001.

1037	   7  N. Harrison, P. Willis, S. Davari, E. Cuevas, B. Mack-Crane, E.
1038	      Franze, H. Ohta, T. So, S. Goldfless, and F. Chen, "Requirements
1039	      for OAM in MPLS Networks," Internet-Draft, Work in Progress, May
1040	      2001.

1042	   8  D. Allan and M. Azad, "A Framework for MPLS User Plane OAM,"
1043	      Internet-Draft, Work in Progress, July 2001.

1045	   9  S. Kini, M. Kodialam, T.V. Lakshman, S. Sengupta, and C.
1046	      Villamizar, "Shared Backup Label Switched Path Restoration,"
1047	      Internet-Draft, Work in Progress, May 2001.

1049	   10 G. Li, C. Kalmanek, J. Yates, G. Bernstein, F. Liaw, and V.
1050	      Sharma, "RSVP-TE Extensions For Shared-Mesh Restoration in
1051	      Transport Networks," Internet-Draft, Work in Progress, July 2001.

1053	   11 D.H. Gan, P. Pan, A. Ayyangar, and K. Kompella, "A Method for
1054	      MPLS LSP Fast-Reroute Using RSVP Detours," Internet-Draft, Work
1055	      in Progress, April 2001.

1057	   12 A. Atlas, C. Villamizar, and C. Litvanyi, "MPLS RSVP-TE
1058	      Interoperability for Local Protection/Fast Reroute," Internet-
1059	      Draft, Work in Progress, July 2001.

1061	   13 A. Chiu and J. Strand, "Joint IP/Optical Layer Restoration after
1062	      a Router Failure," Proc. OFC'2001, Anaheim, CA, March 2001.

1064	   14 K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic
1065	      Engineering," Internet-Draft, Work in Progress, March 2001.

1067	Lai, et al              Category - Expiration                    [19]

1069	            Network Hierarchy and Multilayer Survivability    Oct 2001

1071	   15 G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft,
1072	      Work in Progress, September 2001.

1074	   16 A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing
1075	      Extensions for MPLS Signaling," Internet-Draft, Work in Progress,
1076	      July 2001.

1078	   17 C-Y Lee, A Celer, N Gammage, S Ghanti, G. Ash, "Distributed Route
1079	      Exchangers," Internet-Draft, Work in Progress, March 2001.

1081	   18 C-Y Lee and S Ghanti, "Path Request and Path Reply Message,"
1082	      Internet-Draft, Work in Progress, July 2001.

1084	   19 D. Awduche, J. Malcolm, J. Agogbua, M. O'Dell, J. McManus,
1085	      "Requirements for Traffic Engineering Over MPLS," RFC 2702,
1086	      September 1999.

1088	10.  Acknowledgments

1090	   A lot of the direction taken in this document, and by the team in
1091	   its initial effort, was steered by the insightful questions provided
1092	   by Bala Rajagoplan, Greg Bernstein, Yangguang Xu, and Avri Doria.
1093	   The set of questions is attached as Appendix A in this document.

1095	   After the release of the first draft, a number of comments were
1096	   received.  Thanks to the inputs from Jerry Ash, Sudheer Dharanikota,
1097	   Chuck Kalmanek, Dan Koller, Lyndon Ong, Steve Plote, and Yong Xue.

1099	11. Author's Addresses

1101	   Wai Sum Lai
1102	   AT&T
1103	   200 Laurel Avenue
1104	   Middletown, NJ 07748, USA
1105	   Tel: +1 732-420-3712
1106	   wlai@att.com

1108	   Dave McDysan
1109	   WorldCom
1110	   22001 Loudoun County Pkwy
1111	   Ashburn, VA 20147, USA
1112	   dave.mcdysan@wcom.com

1114	   Jim Boyle
1115	   Protocol Driven Networks
1116	   Tel: +1 919-852-5160
1117	   jboyle@pdnets.com

1119	   Malin Carlzon
1120	   malin@sunet.se

1122	Lai, et al              Category - Expiration                    [20]

1124	            Network Hierarchy and Multilayer Survivability    Oct 2001

1126	   Rob Coltun
1127	   Redback Networks
1128	   300 Ferguson Drive
1129	   Mountain View, CA 94043, USA
1130	   Tel: +1 650-390-9030
1131	   rcoltun@redback.com

1133	   Tim Griffin
1134	   AT&T
1135	   180 Park Avenue
1136	   Florham Park, NJ 07932, USA
1137	   Tel: +1 973-360-7238
1138	   griffin@research.att.com

1140	   Ed Kern
1141	   ejk@tech.org

1143	   Tom Reddington
1144	   Lucent Technologies
1145	   67 Whippany Rd
1146	   Whippany, NJ 07981, USA
1147	   Tel: +1 973-386-7291
1148	   treddington@bell-labs.com

1150	Appendix A: Questions used to help develop requirements

1152	   A. Definitions

1154	   1. In determining the specific requirements, the design team should
1155	   precisely define  the concepts "survivability", "restoration",
1156	   "protection", "protection switching", "recovery", "re-routing" etc.
1157	   and their relations. This would enable the requirements doc to
1158	   describe precisely which of these will be addressed.
1159	   In the following, the term "restoration" is used to indicate the
1160	   broad set of policies and mechanisms used to ensure survivability.

1162	   B. Network types and protection modes

1164	   1. What is the scope of the requirements with regard to the types of
1165	   networks covered? Specifically, are the following in scope:

1167	   Restoration of connections in mesh optical networks (opaque or
1168	   transparent)
1169	   Restoration of connections in hybrid mesh-ring networks
1170	   Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a
1171	   transport network, e.g., optical)
1172	   Any other types of networks?
1173	   Is commonality of approach, or optimization of approach more
1174	   important?

1176	Lai, et al              Category - Expiration                    [21]

1178	            Network Hierarchy and Multilayer Survivability    Oct 2001

1180	   2.  What are the requirements with regard to the protection modes to
1181	   be supported in each network type covered? (Examples of protection
1182	   modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes
1183	   such as P-cycles, etc.)

1185	   3.  What are the requirements on local span (i.e., link by link)
1186	   protection and end-to-end protection, and the interaction between
1187	   them?  E.g.: what should be the granularity of connections for each
1188	   type (single connection, bundle of connections, etc).

1190	   C. Hierarchy

1192	   1. Vertical (between two network layers):
1193	       What are the requirements for the interaction between
1194	   restoration procedures across two network layers, when these
1195	   features are offered in both layers?  (Example, MPLS network
1196	   realized over pt-to-pt optical connections.) Under such a case,

1198	       (a) Are there any criteria to choose which layer should provide
1199	   protection?

1201	       (b) If both layers provide survivability features, what are the
1202	   requirements to coordinate these mechanisms?

1204	       (c) How is lack of current functionality of cross-layer
1205	   cooridnation currently hampering operations?

1207	       (d) Would the benefits be worth additional complexity associated
1208	   with routing isolation (e.g. VPN, areas), security, address
1209	   isolation and policy / authentication processes?

1211	   2. Horizontal (between two areas or administrative subdivisions
1212	   within the same network layer):

1214	       (a) What are the criteria that trigger the creation of protocol
1215	   or administrative boundaries pertaining to restoration? (e.g.,
1216	   scalability?  multi-vendor interoperability? what are the practical
1217	   issues?)  multi-provider? Should multi-vendor necessitate
1218	   hierarchical seperation?

1220	       When such boundaries are defined:

1222	       (b) What are the requirements on how protection/restoration is
1223	   performed end-to-end across such boundaries?

1225	       (c) If different restoration mechanisms are implemented on two
1226	   sides of a boundary, what are the requirements on their interaction?

1228	      What is the primary driver of horizontal hierarchy? (select one)
1229	       - functionality (e.g. metro -v- backbone)
1230	       - routing scalability
1231	       - signaling scalability
1232	       - current network architecture, trying to layer on TE ontop of

1234	Lai, et al              Category - Expiration                    [22]

1236	            Network Hierarchy and Multilayer Survivability    Oct 2001

1238	         already hiearchical network architecture
1239	       - routing and signalling

1241	      For signalling scalability, is it
1242	       - managability
1243	       - processing/state of network
1244	       - edge-to-edge N^2 type issue

1246	       For routing scalability, is it
1247	       - processing/state of network
1248	       - are you flat and want to go hierarchical
1249	       - or already hierarchical?
1250	       - data or TDM application?

1252	   D. Policy

1254	   1. What are the requirements for policy support during
1255	   protection/restoration,
1256	       e.g., restoration priority, preemption, etc.

1258	   E. Signaling Mechanisms

1260	   1. What are the requirements on the signaling transport mechanism
1261	   (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP
1262	   network, etc.) used to communicate restoration protocol
1263	      messages between network elements. What are the bandwidth and
1264	   other requirements on the signaling channels?

1266	   2. What are the requirements on fault detection/localization
1267	   mechanisms (which is the prelude to performing restoration
1268	   procedures)  in the case of opaque and transparent optical networks?
1269	   What are the requirements in the case of MPLS restoration?

1271	   3. What are the requirements on signaling protocols to be used in
1272	   restoration procedures (e.g., high priority processing, security,
1273	   etc).

1275	   4. Are there any requirements on the operation of restoration
1276	   protocols?

1278	   F. Quantitative

1280	   1. What are the quantitative requirements (e.g., latency) for
1281	   completing restoration under different protection modes (for both
1282	   local and end-to-end protection)?

1284	   G. Management

1286	   1. What information should be measured/maintained by the control
1287	   plane at each network element pertaining to restoration events?

1289	   2. What are the requirements for the correlation between control
1290	   plane and data plane failures from the restoration point of view?

1292	Lai, et al              Category - Expiration                    [23]

1294	            Network Hierarchy and Multilayer Survivability    Oct 2001

1296	Full Copyright Statement

1298	   "Copyright (C) The Internet Society (date). All Rights Reserved.
1299	   This document and translations of it may be copied and furnished to
1300	   others, and derivative works that comment on or otherwise explain it
1301	   or assist in its implmentation may be prepared, copied, published
1302	   and distributed, in whole or in part, without restriction of any
1303	   kind, provided that the above copyright notice and this paragraph
1304	   are included on all such copies and derivative works. However, this
1305	   document itself may not be modified in any way, such as by removing
1306	   the copyright notice or references to the Internet Society or other
1307	   Internet organizations, except as needed for the purpose of
1308	   developing Internet standards in which case the procedures for
1309	   copyrights defined in the Internet Standards process must be
1310	   followed, or as required to translate it into languages other than
1311	   English.

1313	   The limited permissions granted above are perpetual and will not be
1314	   revoked by the Internet Society or its successors or assigns.

1316	   This document and the information contained herein is provided on an
1317	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1318	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1319	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1320	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1321	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1323	Lai, et al              Category - Expiration                    [24]