idnits 2.17.1 

draft-team-tewg-restore-hierarchy-00.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == There are 2 instances of lines with non-ascii characters in the document.

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 1050 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** The abstract seems to contain references ([1]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 776 has weird spacing: '... define  the c...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 2001) is 8320 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Missing reference section? '1' on line 54 looks like a reference

  -- Missing reference section? '2' on line 72 looks like a reference

  -- Missing reference section? '3' on line 183 looks like a reference

  -- Missing reference section? '4' on line 223 looks like a reference

  -- Missing reference section? '5' on line 320 looks like a reference

  -- Missing reference section? '6' on line 576 looks like a reference

  -- Missing reference section? '7' on line 576 looks like a reference

  -- Missing reference section? '8' on line 576 looks like a reference


     Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Traffic Engineering Working Group                     Wai Sum Lai, AT&T
2	Internet Draft                                   Dave McDysan, WorldCom
3	<draft-team-tewg-restore-hierarchy-00.txt>                 (Co-Editors)
4	Category: Informational
5	Expiration Date: January 2002                                 Jim Boyle
6	                                                          Malin Carlzon
7	                                                    Rob Coltun, Redback
8	                                                      Tim Griffin, AT&T
9	                                                        Ed Kern, Cogent
10	                                                 Tom Reddington, Lucent

12	                                                              July 2001

14	             Network Hierarchy and Multilayer Survivability

16	Status of this Memo

18	   This document is an Internet-Draft and is in full conformance with
19	      all provisions of Section 10 of RFC2026 [1].

21	   Internet-Drafts are working documents of the Internet Engineering
22	   Task Force (IETF), its areas, and its working groups. Note that
23	   other groups may also distribute working documents as Internet-
24	   Drafts. Internet-Drafts are draft documents valid for a maximum of
25	   six months and may be updated, replaced, or obsoleted by other
26	   documents at any time. It is inappropriate to use Internet- Drafts
27	   as reference material or to cite them other than as "work in
28	   progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	1. Abstract

38	   This document is the deliverable out of the Network Hierarchy and
39	   Survivability Techniques Design Team established within the Traffic
40	   Engineering Working Group.  This team was requested to try to
41	   determine what the current and near term requirements are for
42	   survivability and hierarchy in MPLS networks.  The team determined
43	   that there appears to be a need for common, interoperable
44	   survivability approaches in packet and non-packet networks.
45	   Suggested approaches include path-based as well as one that repairs
46	   connections in proximity to the network fault.  For clarity, an
47	   expanded set of definitions is included.  As for hierarchy, there
48	   did not appear to be as much need for work on "vertical hierarchy,"
49	   defined as communication between network layers such as TDM/optical
50	   and MPLS.  In particular, instead of direct exchange of signaling
51	   and routing between vertical layers, some looser form of
52	   coordination and communication is a nearer term need.  For

54	Lai, et al              Category - Expiration                     [1]
55	            Network Hierarchy and Multilayer Survivability   July 2001

57	   "horizontal hierarchy" in data networks, there does appear to be a
58	   pressing need.  This requirement is often presented in the context
59	   of layer 2 and layer 3 VPN services where SLAs would appear to
60	   necessitate signaling from the edges into the core of a network.
61	   Issues include potential current protocols limitations in networks
62	   which are hierarchical (e.g. multi-area OSPF) and scalability
63	   concerns of potentially O(N^2) connection growth in larger networks.

65	              Please send comments to te-wg@ops.ietf.org

67	2. Conventions used in this document

69	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
70	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
71	   this document are to be interpreted as described in RFC-2119 [2].

73	3. Introduction

75	   This document presents a proposal of the tangible requirements for
76	   network survivability and hierarchy in current service provider
77	   environments.  With feedback from the working group solicited, the
78	   objective is to help focus the work that is being addressed in the
79	   traffic engineering, ccamp and other working groups.  A main goal of
80	   this work is to provide some expedience for required functionality
81	   in multi-vendor service provider networks.  The initial focus is
82	   primarily on intra-domain operations.  However, to maintain
83	   consistency in the provision of end-to-end service in a multi-
84	   provider environment, rules governing the operations of
85	   survivability mechanisms at domain boundaries must also be
86	   specified.  While such issues are raised and discussed, where
87	   appropriate, they will not be treated in depth in the initial
88	   release of this document.

90	   The document first develops a set of definitions to be used later in
91	   this document and potentially in other documents as well.  It then
92	   addresses the requirements and issues associated with service
93	   restoration, hierarchy, and finally a short discussion of
94	   survivability in hierarchical context.

96	4. Definitions

98	4.1 Hierarchy Terminology

100	   Network hierarchy is an abstraction of part of a network's topology
101	   and the routing and signaling mechanism needed to support the
102	   topological abstraction.  Abstraction may be used as a mechanism to
103	   build large networks or as a technique for enforcing administrative,
104	   topological or geographic boundaries.  For example, network
105	   hierarchy might be used to separate the metropolitan and long-haul

107	Lai, et al              Category - Expiration                       2
108	            Network Hierarchy and Multilayer Survivability   July 2001

110	   regions of a network or to separate the regional and backbone
111	   sections of a network [Bert Wijnen], or to interconnect service
112	   provider networks (with BGP which reduces a network to an Autonomous
113	   System).  In this document, network hierarchy is considered from two
114	   perspectives:
115	   (1) Horizontally oriented: between two areas or administrative
116	   subdivisions within the same network layer
117	   (2) Vertically oriented: between two network layers

119	   Horizontal hierarchy is the abstraction necessary to allow a network
120	   at one network layer, for instance a packet network, to grow.
121	   Examples of horizontal hierarchy include BGP and multi-area OSPF.

123	   Vertical hierarchy is the abstraction, or reduction in information,
124	   which would be of benefit when communicating information across
125	   network layers, as in propagating information between optical and
126	   router networks.

128	4.2 Survivability Terminology

130	   Extra traffic is the traffic carried over the protection entity
131	   while the working entity is active.  Extra traffic is not protected,
132	   i.e., when the protection entity is required to protect the traffic
133	   that is being carried over the working entity (e.g., due to a
134	   failure affecting the working entity), the extra traffic is
135	   preempted.

137	   Normalization is the return to the normal state of a network upon
138	   completing the repair of the network failure.  This could include
139	   the rerouting of affected traffic to the original working entities
140	   or new routes.  The term revertive mode is used when traffic is
141	   returned to the working entity (switch back).

143	   Protection, also called protection switching, is a survivability
144	   technique based on predetermined failure recovery: as the working
145	   entity is established, resources are reserved for the protection
146	   entity.  These resources may be used by low-priority traffic
147	   (referred to as extra traffic) if traffic preemption is allowed.
148	   Depending on the amount of reserved resources, not all of the
149	   affected traffic may be protected.  (For further discussion of
150	   concepts related to protection, see the Sub-section below on
151	   Survivability Concepts.)

153	   Protection entity (also called back-up entity or recovery entity) is
154	   the entity that is used to carry protected traffic in protection
155	   operation mode, i.e., when the working entity is in error or has
156	   failed.

158	   Recovery is the sequence of actions taken by a network after the
159	   detection of a failure to maintain the required performance level
160	   for existing services (e.g., according to service level agreements)
161	   and to allow normalization of the network.  The actions include

163	Lai, et al              Category - Expiration                       3
164	            Network Hierarchy and Multilayer Survivability   July 2001

166	   notification of the failure followed by two parallel processes: (1)
167	   a repair process with fault isolation and repair of the failed
168	   components, and (2) a reconfiguration process with path selection
169	   and rerouting for the affected traffic.

171	   Rerouting is placement of affected traffic from the working entity
172	   to the protection entity, when the path for the protection entity
173	   has been selected after the detection of a fault on the working
174	   entity.  This is synonymous with switch-over in protection
175	   techniques.  (In [3], rerouting is synonymous with restoration.)

177	   Restoration is a survivability technique that dynamically discovers
178	   the alternate path from spare resources in network, or establishes
179	   new paths on demand, for affected traffic once the failure is
180	   detected and the affected traffic is identified for rerouting.  The
181	   new path may be based on preplanned configurations or current
182	   network status.  Thus, restoration involves a path selection process
183	   followed by traffic rerouting. (In [3], restoration is referred to
184	   as recovery by rerouting.)

186	   Restoration, or more specifically, service restoration, refers to
187	   the actions taken by a network to maintain service continuity after
188	   the detection of a failure.  In this second usage, restoration has a
189	   meaning very similar to recovery, except that restoration covers
190	   only the reconfiguration process and not the repair process.  Also,
191	   in this usage, it should be clear from the context that it is
192	   irrelevant whether the survivability technique used to achieve
193	   service continuity is based on protection or restoration techniques.

195	   Restoration time is the time interval from the occurrence of a
196	   network impairment to the instant when the affected traffic is
197	   either completely rerouted or until spare resources are exhausted
198	   and/or no more preemptable traffic to make room.

200	   Revertive mode is a procedure in which revertive action, i.e.,
201	   switch back from the protection entity to the working entity, is
202	   taken once the failed working entity has been repaired.  In non-
203	   revertive mode, such action is not taken.  To minimize service
204	   interruption, switch-back in revertive mode should be performed at a
205	   time when there is the least impact on the traffic concerned, or by
206	   using the make-before-break concept.

208	   Shared risk group (SRG) is a set of network elements that are
209	   collectively impacted by a specific fault or fault type. For
210	   example, a shared risk link group (SRLG) is the union of all the
211	   links on those fibers that are routed in the same physical conduit
212	   in a fiber-span network.  This concept includes, besides shared
213	   conduit, other types of compromise such as shared fiber cable,
214	   shared right of way, shared optical ring, shared office without
215	   power sharing, etc.  The span of an SRG, such as the length of the
216	   sharing for compromised outside plant, needs to be considered on a
217	   per fault basis.

219	Lai, et al              Category - Expiration                       4
220	            Network Hierarchy and Multilayer Survivability   July 2001

222	   Survivability is the capability of a network to maintain service
223	   continuity in the presence of faults within the network [4].
224	   Survivability techniques such as protection and restoration are
225	   implemented either on a per-link basis, on a per-path basis, or
226	   throughout an entire network to alleviate service disruption at
227	   affordable costs.  The degree of survivability is determined by the
228	   network's capability to survive single failures, multiple failures,
229	   and equipment failures.

231	   Working entity is the entity that is used to carry traffic in normal
232	   operation mode.  Depending on the context, an entity can be, e.g., a
233	   channel or a transmission link in the physical layer, an LSP in
234	   MPLS, or a logical bundle of one or more LSPs.

236	4.3 Survivability Concepts

238	   In a survivable network design, spare capacity and diversity must be
239	   built into the network from the beginning to support some degree of
240	   self-healing whenever failures occur.  A common strategy is to
241	   associate each working entity with a protection entity having either
242	   dedicated resources or shared resources that are pre-reserved or
243	   reserved-on-demand.  According to the methods of setting up a
244	   protection entity, different approaches to providing survivability
245	   can be classified.  Generally, protection techniques are based on
246	   having a dedicated protection entity set up prior to failure.  Such
247	   is not the case in restoration techniques, which mainly rely on the
248	   use of spare capacity in the network.  Hence, in terms of trade-
249	   offs, protection techniques usually offer fast recovery from failure
250	   with enhanced availability, while restoration techniques usually
251	   achieve better resource utilization.

253	   Protection techniques can be implemented by several architectures:
254	   1+1, 1:1, 1:n, and m:n.  In the context of SDH/SONET, they are
255	   referred to as Automatic Protection Switching (APS).

257	   In the 1+1 protection architecture, a protection entity is dedicated
258	   to each working entity.  The dual-feed mechanism is used whereby the
259	   working entity is permanently bridged onto the protection entity at
260	   the source of the protected domain.  In normal operation mode,
261	   identical traffic is transmitted simultaneously on both the working
262	   and protection entities.  At the sink of the protected domain, both
263	   feeds are monitored for alarms and maintenance signals.  A selection
264	   between the working and protection entity is made based on some
265	   predetermined criteria, such as the transmission performance
266	   requirements or defect indication.  This architecture is rather
267	   expensive since resource duplication is required.  It is generally
268	   used for specific services that need a very high availability.

270	   In the 1:1 protection architecture, a protection entity is also
271	   dedicated to each working entity.  The protected traffic is normally
272	   transmitted by the working entity.  If the working entity has
273	   failed, the protected traffic is rerouted to the protection entity.

275	Lai, et al              Category - Expiration                       5
276	            Network Hierarchy and Multilayer Survivability   July 2001

278	   This architecture is inherently slower in recovering from failure
279	   than a 1+1 architecture since communication between both ends of the
280	   protection domain is required to perform the switch-over operation.
281	   An advantage is that the protection entity can optionally be used to
282	   carry preemptable "extra traffic" in normal operation.  Also, in
283	   packet networks, a protection path can be pre-established for later
284	   use with pre-planned but not pre-reserved capacity.  (If no packets
285	   are sent into a link, no bandwidth is consumed.)  This is not the
286	   case in channelized transport networks.

288	   In the 1:n protection architecture, a dedicated protection entity is
289	   shared by n working entities.  Traffic is normally sent on the
290	   working entities.  When multiple working entities have failed
291	   simultaneously, only one of them can be restored by the common
292	   protection entity.  This contention is resolved by assigning a
293	   different preemptive priority to each working entity.  As in the 1:1
294	   case, the protection entity can optionally be used to carry
295	   preemptable "extra traffic" in normal operation

297	   The m:n architecture is a generalization of the 1:n architecture.
298	   Typically m <= n, m dedicated protection entities are shared by n
299	   working entities.   While this architecture can improve system
300	   availability with small cost increases, it has rarely been
301	   implemented or standardized.

303	5. Survivability

305	5.1 Scope

307	   Interoperable approaches to network survivability were determined to
308	   be an immediate requirement in packet networks as well as in
309	   SDH/SONET framed TDM networks.  Not as pressing at this time were
310	   techniques which would cover all-optical networks (e.g., where
311	   framing is unknown), as the control of these networks in a multi-
312	   vendor environment appeared to have some other hurdles to first deal
313	   with.  Also, not of immediate interest were approaches to coordinate
314	   or explicitly communicate survivability mechanisms across network
315	   layers (such as from a TDM or optical network to/from an IP
316	   network).  However, a capability should be provided for a network
317	   operator to control the operation of survivability mechanisms among
318	   different layers.  Such issues and those related to OAM are
319	   currently outside the scope of this document.  (For proposed MPLS
320	   OAM requirements, see [5]).

322	   The types of network failures that cause a restoration to be
323	   performed include link/span and node failures (which might include
324	   span failures at lower layers).  Other more complex failure
325	   mechanisms such as systematic control-plane failure or breach of
326	   security are not within the scope of the survivability mechanisms
327	   discussed in this document.

329	Lai, et al              Category - Expiration                       6
330	            Network Hierarchy and Multilayer Survivability   July 2001

332	5.2 Required initial set of survivability mechanisms

334	5.2.1   1:1 Path Protection with Pre-Established Capacity

336	   In this protection mode, the head end of a working connection
337	   establishes a protection connection to the destination.  In normal
338	   operation, traffic is only sent on the working connection, though
339	   the ability to signal that traffic will be sent on both connections
340	   (1+1 Path for signaling purposes) would be valuable in non-packet
341	   networks.  Some distinction between working and protection
342	   connections is likely, either through explicit objects, or
343	   preferably through implicit methods such as general classes or
344	   priorities.  Head ends need the ability to create connections that
345	   are as failure disjoint as possible from each other.  This would
346	   require SRG information that can be generally assigned to either
347	   nodes or links and propagated through the control or management
348	   plane.  In this mechanism, capacity in the protection connection is
349	   pre-established, however it can be used to carry preemptable extra
350	   traffic.  Protect capacity is first come first served.  When protect
351	   capacity is called into service during restoration, there should be
352	   the ability to promote the protection connection to working status
353	   (for non-revertive mode operation) with some form of make-before-
354	   break capability.

356	5.2.2   1:1 Path Protection with Pre-Planned Capacity

358	   Similar to the above 1:1 protection with pre-established capacity,
359	   the protection connection in this case is also pre-signaled.  The
360	   difference is in the way protect capacity is assigned.  With pre-
361	   planned capacity, the mechanism supports the ability for the protect
362	   capacity to be shared, or "double-booked."  It would be expected
363	   that should operator predicted failures occur, which potentially
364	   could rely on enumeration in SRGs, that only a limited set of
365	   protect connections would be put into service, and that the protect
366	   capacity available in the network would be able to fulfill this
367	   traffic (given proper sizing and planning of the network).  In a
368	   sense, this is 1:1 from a path perspective, however the protect
369	   capacity in the network (on a link by link basis) is shared in a 1:n
370	   fashion.  Some form of information propagation could be required
371	   before traffic may be sent on protection connections, especially in
372	   TDM networks.  In data networks, a desirable operating approach for
373	   this mechanism might be where the protect capacity is not accurately
374	   booked against SRGs (e.g. non-predictive).

376	   The use of this approach improves network resource utilization, but
377	   may require more careful planning.  So, initial deployment might be
378	   based on 1:1 path protection with pre-established capacity and the
379	   local restoration mechanism to be described next.

381	5.2.3   Local Restoration

383	Lai, et al              Category - Expiration                       7
384	            Network Hierarchy and Multilayer Survivability   July 2001

386	   Due to the time impact of signal propagation, path-based approaches
387	   may not be able to meet the service requirements desired in some
388	   networks.   The solution to this is to restore connectivity in
389	   immediate proximity to the fault.  At a minimum, this approach
390	   should be able to protect against connectivity-type SRGs, though
391	   protecting against node-based SRGs might be worthwhile.  After local
392	   restoration is in place, it is likely that head end systems would
393	   later perform some path-level re-grooming.  Head end systems must
394	   have some control as to whether their connections are candidates for
395	   or excluded from local restoration.

397	5.2.4   Path Restoration

399	   In this approach, connections that are impacted by a fault are
400	   rerouted by the originating network element upon notification of
401	   connection failure.  This approach does not involve any new
402	   mechanisms.  It merely is a mention of another common approach to
403	   protecting against faults in a network.

405	5.3 Applications Supported

407	   With service continuity under failure as a goal, a network is
408	   "survivable" if, in the face of a network failure, connectivity is
409	   interrupted for a brief period and then restored before the network
410	   failure ends.  The length of this interrupted period is dependent on
411	   the application supported.  Here are some typical applications that
412	   need to be considered:

414	   - Best-effort data: restoration of network connectivity by rerouting
415	     at the IP layer would be sufficient
416	   - Premium data service: need to meet TCP or application protocol
417	     timer requirements
418	   - Voice: call cutoff is in the range of 140 msec to 2 sec
419	   - Other real-time service (e.g., streaming, fax)
420	   - Mission-critical applications

422	5.4 Timing Bounds for Service Restoration

424	   The approach to picking the types of survivability mechanisms
425	   recommended was to consider a spectrum of mechanisms that can be
426	   used to protect traffic with varying characteristics of
427	   survivability and speed of restoration, and then attempt to select a
428	   few general points which provide some coverage across that spectrum.
429	   The focus of this work is to provide requirements to which a small
430	   set of detailed proposals may be developed, allowing the operator
431	   some (limited) flexibility in approaches to meeting their design
432	   goals in engineering multi-vendor networks.  Requirements of
433	   different applications as listed in the previous sub-section were
434	   discussed generally, however none on the team would likely attest to
435	   the scientific merit of the ability of the timing bounds below to
436	   meet any specific application�s needs.  A few assumptions include:

438	Lai, et al              Category - Expiration                       8
439	            Network Hierarchy and Multilayer Survivability   July 2001

441	   Approaches that protection switch without propagation of information
442	   are likely to be faster than those that do require some form of
443	   fault notification to some or all elements in a network.
444	   Approaches that require some form of signaling after a fault will
445	   also likely suffer some timing impact.

447	   Proposed timing bounds for service restoration for different
448	   mechanisms are as follows (all bounds are exclusive of signal
449	   propagation):

451	   1:1 path protection with pre-established capacity:   100-500 ms
452	   1:1 path protection with pre-planned capacity:       100-750 ms
453	   Local restoration:                                   50 ms
454	   Path restoration:                                    1-5 seconds

456	   To ensure that the service requirements for different applications
457	   can be met within the above timing bounds, restoration priority is
458	   used to determine the order in which connections are restored (to
459	   minimize service restoration time as well as to gain access to
460	   available spare capacity).  For example, mission critical
461	   applications may require high restoration priority.  Preemption
462	   priority should only be used in the event that all connections
463	   cannot be restored, in which case connections with lower preemption
464	   priority should be released.  Depending on a service provider's
465	   strategy in provisioning network resources for backup, preemption
466	   may or not be needed in the network.

468	5.5 Coordination Among Layers

470	   A common design goal for multi-layered networks is to provide the
471	   desired level of service in the most cost-effective manner.  The use
472	   of multilayer survivability might allow the optimization of spare
473	   resources through the improvement of resource utilization by sharing
474	   spare capacity across different layers, though further
475	   investigations are needed.  Coordination during service restoration
476	   among different network layers (e.g. IP, SDH/SONET, optical layer)
477	   might necessitate development of vertical hierarchy.  The benefits
478	   of providing survivability mechanisms at multiple layers, and the
479	   optimization of the overall approach, must be weighed with the
480	   associated cost and service impacts.

482	   A default coordination mechanism for inter-layer interaction could
483	   be the use of nested timers and current SDH/SONET fault monitoring,
484	   as has been done traditionally for backward compatibility.  Thus,
485	   when lower-layer restoration happens in a longer time period than
486	   higher-layer restoration, a hold-off timer is utilized to avoid
487	   contention between the different single-layer recovery schemes.  In
488	   other words, multilayer interaction is addressed by having
489	   successively higher multiplexing levels operate at restoration time
490	   scale greater than the next lowest layer.  Currently, if SDH/SONET
491	   protection switching is used, MPLS recovery timers must wait until
492	   SDH/SONET has had time to switch.

494	Lai, et al              Category - Expiration                       9
495	            Network Hierarchy and Multilayer Survivability   July 2001

497	   It was felt that the current approach to coordination of
498	   survivability approaches currently did not have significant
499	   operational shortfalls.  These approaches include protecting traffic
500	   solely at one layer (e.g. at the IP layer over linear WDM, or at the
501	   SDH/SONET layer).  Where survivability mechanisms might be deployed
502	   at several layers, such as when a routed network rides a SDH/SONET
503	   protected network, it was felt that current coordination approaches
504	   were sufficient in many cases.  One exception is the hold-off of
505	   MPLS recovery until the completion of SDH/SONET protection switching
506	   as described above.  This limits the recovery time of fast MPLS
507	   restoration.  Also, note that failures within a layer can be guarded
508	   against by techniques either in that layer or at a higher layer, but
509	   not in reverse.  Thus, the optical layer cannot guard against
510	   failures in the IP layer such as router system failures, line card
511	   failures.

513	5.6 Evolution Toward IP Over Optical

515	   As more pressing requirements for survivability and horizontal
516	   hierarchy for edge-to-edge signaling are met with technical
517	   proposals, it is believed that the benefits of merging (in some
518	   manner) the control planes of multiple layers will be outlined.
519	   When these benefits are self-evident, it would then seem to be the
520	   right time to review if vertical hierarchy mechanisms are needed,
521	   and what the requirements might be.

523	6. Hierarchy Requirements

525	   Efforts in the area of network hierarchy should focus on mechanisms
526	   that would allow more scalable edge-to-edge signaling, or signaling
527	   across networks with existing network hierarchy (such as multi-area
528	   OSPF).  This would appear to be a more immediate need than
529	   mechanisms that might be needed to interconnect networks at
530	   different layers.

532	6.1 Historical Context

534	   One reason for horizontal hierarchy is functionality (e.g., metro
535	   versus backbone).  Geographic �islands� reduce the need for
536	   interoperability and make administration and operations less
537	   complex.  Using a simpler, more interoperable, survivability scheme
538	   at metro/backbone boundaries is natural for many provider network
539	   architectures.  In transmission networks, creating geographic
540	   islands of different vendor equipment has been done for a long time
541	   because multi-vendor interoperability has been difficult to achieve.
542	   Traditionally, providers have to coordinate the equipment on either
543	   end of a "connection," and making this interoperable reduces
544	   complexity.  A provider should be able to concatenate survivability
545	   mechanisms in order to provide a "protected link" to the next higher
546	   level.  Think of SDH/SONET rings connecting to TDM DXCs with 1+1
547	   line-layer protection between the ADM and the DXC port.  The TDM
548	   connection, e.g., a DS3 is protected, but usually all equipment on
549	   each SDH/SONET ring is from a single vendor.  The DXC cross

551	Lai, et al              Category - Expiration                      10
552	            Network Hierarchy and Multilayer Survivability   July 2001

554	   connections are controlled by the provider and the ports are
555	   physically protected resulting in a highly available design.  Thus,
556	   concatenation of survivability approaches can be used to cascade
557	   across horizontal hierarchy.  While not perfect, it is workable in
558	   the near- to mid-term until multi-vendor interoperability is
559	   achieved.

561	   While the problems associated with multi-vendor interoperability may
562	   necessitate horizontal hierarchy as a practical matter (at least
563	   this has been the case in TDM networks), there may be no technical
564	   reason for it.  Members of the team with more experience on IP
565	   networks felt there should be no need for this in core networks, or
566	   even most access networks.

568	   Some of the largest service provider networks currently run a single
569	   area/level IGP.  Some service providers, as well as many large
570	   enterprise networks, run multi-area OSPF to gain increases in
571	   scalability.  Often, this was from an original design, so it is
572	   difficult to say if the network truly required the hierarchy to
573	   reach its current size.

575	   Some proposals on improved mechanisms to address network hierarchy
576	   have been suggested [6, 7, 8].  This document aims to provide the
577	   concrete requirements so that these and other proposals can first
578	   aim to meet some limited objectives.

580	6.2 Applications for Horizontal Hierarchy

582	   A primary driver for intra-domain horizontal hierarchy is signaling
583	   scalability in the context of edge-to-edge VPNs, potentially across
584	   traffic-engineered data networks.  There are a number of different
585	   approaches to VPNs and they are currently being addressed by
586	   different emerging protocols: RFC 2547bis BGP/MPLS VPNs, provider-
587	   provisioned VPNs based upon MPLS tunnels (e.g., virtual routers),
588	   Pseudo Wire Edge-to-edge Emulation (PWE3), etc.  These may or not
589	   need explicit signaling from edge to edge, but it is a common
590	   perception that in order to meet SLAs, some form of edge-to-edge
591	   signaling is required.

593	   For signaling scalability, there are probably two types of network
594	   scenarios to consider:

596	   - Large SP networks with flat routing domains where edge-to-edge
597	     (MPLS) signaling as implemented today would probably not scale.
598	   - Networks which would like to signal edge-to-edge, and might even
599	     scale in a limited application. However, they are hierarchically
600	     routed (e.g. OSPF areas) and current implementations, and
601	     potentially standards prevent signaling across areas.  This
602	     requires the development of signaling standards that support
603	     dynamic establishment and potentially restoration of LSPs across a
604	     2-level IGP hierarchy.

606	Lai, et al              Category - Expiration                      11
607	            Network Hierarchy and Multilayer Survivability   July 2001

609	   Scalability is concerned with the O(N^2) properties of edge-to-edge
610	   signaling.  For a large network, maintaining a "connection" between
611	   every edge is simply not scalable.  Even if establishing and
612	   maintaining connections is feasible, there might be an impact on
613	   core survivability mechanisms which would cause restoration times to
614	   grow with N^2, which would be undesirable.  While some value of N
615	   may be inevitable, approaches to reduce N (e.g. to pull in from the
616	   edge to aggregation points) might be of value.

618	   For routing scalability, especially in data applications, a major
619	   concern is the amount of processing/state that is required in the
620	   variety of network elements.  If some nodes might not be able to
621	   communicate and process the state of every other node, it might be
622	   preferable to limit the information.  There is one way of thought
623	   that says that the amount of information contained by a horizontal
624	   barrier should be significant, and that impacts this might have on
625	   optimality in route selection and ability to provide global
626	   survivability are accepted tradeoffs.

628	6.3 Horizontal Hierarchy Requirements

630	   Mechanisms are required to allow for edge-to-edge signaling of
631	   connections through a network.  The types of network scenarios
632	   include large networks with a large number of edge devices and flat
633	   interior routing, as well as medium to large networks which
634	   currently have hierarchical interior routing such as multi-area OSPF
635	   or multi-level IS-IS.  The primary context of this is edge-to-edge
636	   signaling which is thought to be required to assure the SLAs for the
637	   layer 2 and layer 3 VPNs that are being carried across the network.
638	   Another possible context would be edge-to-edge signaling in TDM
639	   SDH/SONET networks, where metro and core networks again might either
640	   be in a flat or hierarchical interior routing domain.

642	7. Survivability and Hierarchy

644	   When horizontal hierarchy exist in a network layer, a question
645	   arises as to how survivability can be provided along a connection
646	   which crosses hierarchical boundaries.

648	   In designing protocols to meet the requirements of hierarchy, an
649	   approach to consider is that boundaries are either clean, or are of
650	   minimal value.  However, the concept of network elements that
651	   participate on both sides of a boundary might be a consideration
652	   (e.g. OSPF ABRs).  That would allow for devices on either side to
653	   take an intra-area approach within their region of knowledge, and
654	   for the ABR to do this in both areas, and splice the two protected
655	   connections together at a common point (granted it is a common point
656	   of failure now).  If the limitations of this approach start to
657	   appear in operational settings, then perhaps it would be time to
658	   start thinking about route-servers and signaling propagated
659	   directives.  However, one initial approach might be to signal
660	   through a common border router, and to consider the service as
661	   protected as it consist of a concatenated set of connections which

663	Lai, et al              Category - Expiration                      12
664	            Network Hierarchy and Multilayer Survivability   July 2001

666	   are each protected within their area.  Another approach might be to
667	   have a least common denominator mechanism at the boundary, e.g., 1+1
668	   port protection.  There should also be some standardized means for a
669	   survivability scheme on one side of such a boundary to communicate
670	   with the scheme on the other side regarding the success or failure
671	   of the service restoration action.  For example, if a part of a
672	   "connection" is down on one side of such a boundary, there is no
673	   need for the other side to recover from failures.

675	   In summary, at this time, approaches that allow concatenation of
676	   survivability schemes across hierarchical boundaries should provide
677	   sufficient.

679	8. Security Considerations

681	   Security is not considered in this initial version.

683	9. References

685	   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
686	      9, RFC 2026, October 1996.

688	   2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
689	      Levels", BCP 14, RFC 2119, March 1997

691	   3  V. Sharma, B. Crane, K. Owens, C. Huang, F. Hellstrand, J. Weil,
692	      L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. Chiu,
693	      "Framework for MPLS-based Recovery," Internet-Draft, Work in
694	      Progress, March 2001.

696	   4  D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, "A
697	      Framework for Internet Traffic Engineering," Internet-Draft, Work
698	      in Progress, May 2001.

700	   5  N. Harrison, et al, "Requirements for OAM in MPLS Networks,"
701	      Internet-Draft, Work in Progress, May 2001.

703	   6  K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic
704	      Engineering," Internet-Draft, Work in Progress, March 2001.

706	   7  G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft,
707	      Work in Progress, March 2001.

709	   8  A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing
710	      Extensions for MPLS Signaling," Internet-Draft, Work in Progress,
711	      July 2001.

713	10.  Acknowledgments

715	Lai, et al              Category - Expiration                      13
716	            Network Hierarchy and Multilayer Survivability   July 2001

718	   A lot of the direction taken in this document, and by the team, was
719	   steered by the insightful questions provided by Bala Rajagoplan,
720	   Greg Bernstein, Yangguang Xu, and Avri Doria.  The set of questions
721	   is attached as Appendix A in this document.

723	11. Author's Addresses

725	   Wai Sum Lai
726	   AT&T
727	   200 Laurel Avenue
728	   Middletown, NJ 07748, USA
729	   Tel: +1 732-420-3712
730	   wlai@att.com

732	   Dave McDysan
733	   WorldCom
734	   22001 Loudoun County Pkwy
735	   Ashburn, VA 20147, USA
736	   dave.mcdysan@wcom.com

738	   Jim Boyle
739	   jimpb@nc.rr.com

741	   Malin Carlzon
742	   malin@sunet.se

744	   Rob Coltun
745	   rcoltun@redback.com

747	   Tim Griffin
748	   AT&T
749	   180 Park Avenue
750	   Florham Park, NJ 07932, USA
751	   Tel: +1 973-360-7238
752	   griffin@research.att.com

754	   Ed Kern
755	   Cogent Communications
756	   3413 Metzerott Rd
757	   College Park, MD 20740, USA
758	   Tel: +1 703-852-0522
759	   ejk@tech.org

761	   Tom Reddington
762	   Lucent Technologies
763	   67 Whippany Rd
764	   Whippany, NJ 07981, USA
765	   Tel: +1 973-386-7291
766	   treddington@bell-labs.com

768	Appendix A: Questions used to help develop requirements

770	Lai, et al              Category - Expiration                      14
771	            Network Hierarchy and Multilayer Survivability   July 2001

773	   A. Definitions

775	   1. In determining the specific requirements, the design team should
776	   precisely define  the concepts "survivability", "restoration",
777	   "protection", "protection switching", "recovery", "re-routing" etc.
778	   and their relations. This would enable the requirements doc to
779	   describe precisely which of these will be addressed.
780	   In the following, the term "restoration" is used to indicate the
781	   broad set of policies and mechanisms used to ensure survivability.

783	   B. Network types and protection modes

785	   1. What is the scope of the requirements with regard to the types of
786	   networks covered? Specifically, are the following in scope:

788	   Restoration of connections in mesh optical networks (opaque or
789	   transparent)
790	   Restoration of connections in hybrid mesh-ring networks
791	   Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a
792	   transport network, e.g., optical)
793	   Any other types of networks?
794	   Is commonality of approach, or optimization of approach more
795	   important?

797	   2.  What are the requirements with regard to the protection modes to
798	   be supported in each network type covered? (Examples of protection
799	   modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes
800	   such as P-cycles, etc.)

802	   3.  What are the requirements on local span (i.e., link by link)
803	   protection and end-to-end protection, and the interaction between
804	   them?  E.g.: what should be the granularity of connections for each
805	   type (single connection, bundle of connections, etc).

807	   C. Hierarchy

809	   1. Vertical (between two network layers):
810	       What are the requirements for the interaction between
811	   restoration procedures across two network layers, when these
812	   features are offered in both layers?  (Example, MPLS network
813	   realized over pt-to-pt optical connections.) Under such a case,

815	       (a) Are there any criteria to choose which layer should provide
816	   protection?

818	       (b) If both layers provide survivability features, what are the
819	   requirements to coordinate these mechanisms?

821	       (c) How is lack of current functionality of cross-layer
822	   cooridnation currently hampering operations?

824	Lai, et al              Category - Expiration                      15
825	            Network Hierarchy and Multilayer Survivability   July 2001

827	       (d) Would the benefits be worth additional complexity associated
828	   with routing isolation (e.g. VPN, areas), security, address
829	   isolation and policy / authentication processes?

831	   2. Horizontal (between two areas or administrative subdivisions
832	   within the same network layer):

834	       (a) What are the criteria that trigger the creation of protocol
835	   or administrative boundaries pertaining to restoration? (e.g.,
836	   scalability?  multi-vendor interoperability? what are the practical
837	   issues?)  multi-provider? Should multi-vendor necessitate
838	   hierarchical seperation?

840	       When such boundaries are defined:

842	       (b) What are the requirements on how protection/restoration is
843	   performed end-to-end across such boundaries?

845	       (c) If different restoration mechanisms are implemented on two
846	   sides of a boundary, what are the requirements on their interaction?

848	      What is the primary driver of horizontal hierarchy? (select one)
849	       - functionality (e.g. metro -v- backbone)
850	       - routing scalability
851	       - signaling scalability
852	       - current network architecture, trying to layer on TE ontop of
853	         already hiearchical network architecture
854	       - routing and signalling

856	      For signalling scalability, is it
857	       - managability
858	       - processing/state of network
859	       - edge-to-edge N^2 type issue

861	       For routing scalability, is it
862	       - processing/state of network
863	       - are you flat and want to go hierarchical
864	       - or already hierarchical?
865	       - data or TDM application?

867	   D. Policy

869	   1. What are the requirements for policy support during
870	   protection/restoration,
871	       e.g., restoration priority, preemption, etc.

873	   E. Signaling Mechanisms

875	   1. What are the requirements on the signaling transport mechanism
876	   (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP
877	   network, etc.) used to communicate restoration protocol
878	      messages between network elements. What are the bandwidth and
879	   other requirements on the signaling channels?

881	Lai, et al              Category - Expiration                      16
882	            Network Hierarchy and Multilayer Survivability   July 2001

884	   2. What are the requirements on fault detection/localization
885	   mechanisms (which is the prelude to performing restoration
886	   procedures)  in the case of opaque and transparent optical networks?
887	   What are the requirements in the case of MPLS restoration?

889	   3. What are the requirements on signaling protocols to be used in
890	   restoration procedures (e.g., high priority processing, security,
891	   etc).

893	   4. Are there any requirements on the operation of restoration
894	   protocols?

896	   F. Quantitative

898	   1. What are the quantitative requirements (e.g., latency) for
899	   completing restoration under different protection modes (for both
900	   local and end-to-end protection)?

902	   G. Management

904	   1. What information should be measured/maintained by the control
905	   plane at each network element pertaining to restoration events?

907	   2. What are the requirements for the correlation between control
908	   plane and data plane failures from the restoration point of view?

910	Full Copyright Statement

912	   "Copyright (C) The Internet Society (date). All Rights Reserved.
913	   This document and translations of it may be copied and furnished to
914	   others, and derivative works that comment on or otherwise explain it
915	   or assist in its implmentation may be prepared, copied, published
916	   and distributed, in whole or in part, without restriction of any
917	   kind, provided that the above copyright notice and this paragraph
918	   are included on all such copies and derivative works. However, this
919	   document itself may not be modified in any way, such as by removing
920	   the copyright notice or references to the Internet Society or other
921	   Internet organizations, except as needed for the purpose of
922	   developing Internet standards in which case the procedures for
923	   copyrights defined in the Internet Standards process must be
924	   followed, or as required to translate it into languages other than
925	   English.

927	   The limited permissions granted above are perpetual and will not be
928	   revoked by the Internet Society or its successors or assigns.

930	   This document and the information contained herein is provided on an
931	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
932	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
933	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
934	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF

936	Lai, et al              Category - Expiration                      17
937	            Network Hierarchy and Multilayer Survivability   July 2001

939	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

941	Lai, et al              Category - Expiration                      18