Traffic Engineering Working Group Wai Sum Lai, AT&T Internet Draft Dave McDysan, WorldCom (Co-Editors) Category: Informational Expiration Date: January 2002 Jim Boyle Malin Carlzon Rob Coltun, Redback Tim Griffin, AT&T Ed Kern, Cogent Tom Reddington, Lucent July 2001 Network Hierarchy and Multilayer Survivability Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document is the deliverable out of the Network Hierarchy and Survivability Techniques Design Team established within the Traffic Engineering Working Group. This team was requested to try to determine what the current and near term requirements are for survivability and hierarchy in MPLS networks. The team determined that there appears to be a need for common, interoperable survivability approaches in packet and non-packet networks. Suggested approaches include path-based as well as one that repairs connections in proximity to the network fault. For clarity, an expanded set of definitions is included. As for hierarchy, there did not appear to be as much need for work on "vertical hierarchy," defined as communication between network layers such as TDM/optical and MPLS. In particular, instead of direct exchange of signaling and routing between vertical layers, some looser form of coordination and communication is a nearer term need. For Lai, et al Category - Expiration [1] Network Hierarchy and Multilayer Survivability July 2001 "horizontal hierarchy" in data networks, there does appear to be a pressing need. This requirement is often presented in the context of layer 2 and layer 3 VPN services where SLAs would appear to necessitate signaling from the edges into the core of a network. Issues include potential current protocols limitations in networks which are hierarchical (e.g. multi-area OSPF) and scalability concerns of potentially O(N^2) connection growth in larger networks. Please send comments to te-wg@ops.ietf.org 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. 3. Introduction This document presents a proposal of the tangible requirements for network survivability and hierarchy in current service provider environments. With feedback from the working group solicited, the objective is to help focus the work that is being addressed in the traffic engineering, ccamp and other working groups. A main goal of this work is to provide some expedience for required functionality in multi-vendor service provider networks. The initial focus is primarily on intra-domain operations. However, to maintain consistency in the provision of end-to-end service in a multi- provider environment, rules governing the operations of survivability mechanisms at domain boundaries must also be specified. While such issues are raised and discussed, where appropriate, they will not be treated in depth in the initial release of this document. The document first develops a set of definitions to be used later in this document and potentially in other documents as well. It then addresses the requirements and issues associated with service restoration, hierarchy, and finally a short discussion of survivability in hierarchical context. 4. Definitions 4.1 Hierarchy Terminology Network hierarchy is an abstraction of part of a network's topology and the routing and signaling mechanism needed to support the topological abstraction. Abstraction may be used as a mechanism to build large networks or as a technique for enforcing administrative, topological or geographic boundaries. For example, network hierarchy might be used to separate the metropolitan and long-haul Lai, et al Category - Expiration 2 Network Hierarchy and Multilayer Survivability July 2001 regions of a network or to separate the regional and backbone sections of a network [Bert Wijnen], or to interconnect service provider networks (with BGP which reduces a network to an Autonomous System). In this document, network hierarchy is considered from two perspectives: (1) Horizontally oriented: between two areas or administrative subdivisions within the same network layer (2) Vertically oriented: between two network layers Horizontal hierarchy is the abstraction necessary to allow a network at one network layer, for instance a packet network, to grow. Examples of horizontal hierarchy include BGP and multi-area OSPF. Vertical hierarchy is the abstraction, or reduction in information, which would be of benefit when communicating information across network layers, as in propagating information between optical and router networks. 4.2 Survivability Terminology Extra traffic is the traffic carried over the protection entity while the working entity is active. Extra traffic is not protected, i.e., when the protection entity is required to protect the traffic that is being carried over the working entity (e.g., due to a failure affecting the working entity), the extra traffic is preempted. Normalization is the return to the normal state of a network upon completing the repair of the network failure. This could include the rerouting of affected traffic to the original working entities or new routes. The term revertive mode is used when traffic is returned to the working entity (switch back). Protection, also called protection switching, is a survivability technique based on predetermined failure recovery: as the working entity is established, resources are reserved for the protection entity. These resources may be used by low-priority traffic (referred to as extra traffic) if traffic preemption is allowed. Depending on the amount of reserved resources, not all of the affected traffic may be protected. (For further discussion of concepts related to protection, see the Sub-section below on Survivability Concepts.) Protection entity (also called back-up entity or recovery entity) is the entity that is used to carry protected traffic in protection operation mode, i.e., when the working entity is in error or has failed. Recovery is the sequence of actions taken by a network after the detection of a failure to maintain the required performance level for existing services (e.g., according to service level agreements) and to allow normalization of the network. The actions include Lai, et al Category - Expiration 3 Network Hierarchy and Multilayer Survivability July 2001 notification of the failure followed by two parallel processes: (1) a repair process with fault isolation and repair of the failed components, and (2) a reconfiguration process with path selection and rerouting for the affected traffic. Rerouting is placement of affected traffic from the working entity to the protection entity, when the path for the protection entity has been selected after the detection of a fault on the working entity. This is synonymous with switch-over in protection techniques. (In [3], rerouting is synonymous with restoration.) Restoration is a survivability technique that dynamically discovers the alternate path from spare resources in network, or establishes new paths on demand, for affected traffic once the failure is detected and the affected traffic is identified for rerouting. The new path may be based on preplanned configurations or current network status. Thus, restoration involves a path selection process followed by traffic rerouting. (In [3], restoration is referred to as recovery by rerouting.) Restoration, or more specifically, service restoration, refers to the actions taken by a network to maintain service continuity after the detection of a failure. In this second usage, restoration has a meaning very similar to recovery, except that restoration covers only the reconfiguration process and not the repair process. Also, in this usage, it should be clear from the context that it is irrelevant whether the survivability technique used to achieve service continuity is based on protection or restoration techniques. Restoration time is the time interval from the occurrence of a network impairment to the instant when the affected traffic is either completely rerouted or until spare resources are exhausted and/or no more preemptable traffic to make room. Revertive mode is a procedure in which revertive action, i.e., switch back from the protection entity to the working entity, is taken once the failed working entity has been repaired. In non- revertive mode, such action is not taken. To minimize service interruption, switch-back in revertive mode should be performed at a time when there is the least impact on the traffic concerned, or by using the make-before-break concept. Shared risk group (SRG) is a set of network elements that are collectively impacted by a specific fault or fault type. For example, a shared risk link group (SRLG) is the union of all the links on those fibers that are routed in the same physical conduit in a fiber-span network. This concept includes, besides shared conduit, other types of compromise such as shared fiber cable, shared right of way, shared optical ring, shared office without power sharing, etc. The span of an SRG, such as the length of the sharing for compromised outside plant, needs to be considered on a per fault basis. Lai, et al Category - Expiration 4 Network Hierarchy and Multilayer Survivability July 2001 Survivability is the capability of a network to maintain service continuity in the presence of faults within the network [4]. Survivability techniques such as protection and restoration are implemented either on a per-link basis, on a per-path basis, or throughout an entire network to alleviate service disruption at affordable costs. The degree of survivability is determined by the network's capability to survive single failures, multiple failures, and equipment failures. Working entity is the entity that is used to carry traffic in normal operation mode. Depending on the context, an entity can be, e.g., a channel or a transmission link in the physical layer, an LSP in MPLS, or a logical bundle of one or more LSPs. 4.3 Survivability Concepts In a survivable network design, spare capacity and diversity must be built into the network from the beginning to support some degree of self-healing whenever failures occur. A common strategy is to associate each working entity with a protection entity having either dedicated resources or shared resources that are pre-reserved or reserved-on-demand. According to the methods of setting up a protection entity, different approaches to providing survivability can be classified. Generally, protection techniques are based on having a dedicated protection entity set up prior to failure. Such is not the case in restoration techniques, which mainly rely on the use of spare capacity in the network. Hence, in terms of trade- offs, protection techniques usually offer fast recovery from failure with enhanced availability, while restoration techniques usually achieve better resource utilization. Protection techniques can be implemented by several architectures: 1+1, 1:1, 1:n, and m:n. In the context of SDH/SONET, they are referred to as Automatic Protection Switching (APS). In the 1+1 protection architecture, a protection entity is dedicated to each working entity. The dual-feed mechanism is used whereby the working entity is permanently bridged onto the protection entity at the source of the protected domain. In normal operation mode, identical traffic is transmitted simultaneously on both the working and protection entities. At the sink of the protected domain, both feeds are monitored for alarms and maintenance signals. A selection between the working and protection entity is made based on some predetermined criteria, such as the transmission performance requirements or defect indication. This architecture is rather expensive since resource duplication is required. It is generally used for specific services that need a very high availability. In the 1:1 protection architecture, a protection entity is also dedicated to each working entity. The protected traffic is normally transmitted by the working entity. If the working entity has failed, the protected traffic is rerouted to the protection entity. Lai, et al Category - Expiration 5 Network Hierarchy and Multilayer Survivability July 2001 This architecture is inherently slower in recovering from failure than a 1+1 architecture since communication between both ends of the protection domain is required to perform the switch-over operation. An advantage is that the protection entity can optionally be used to carry preemptable "extra traffic" in normal operation. Also, in packet networks, a protection path can be pre-established for later use with pre-planned but not pre-reserved capacity. (If no packets are sent into a link, no bandwidth is consumed.) This is not the case in channelized transport networks. In the 1:n protection architecture, a dedicated protection entity is shared by n working entities. Traffic is normally sent on the working entities. When multiple working entities have failed simultaneously, only one of them can be restored by the common protection entity. This contention is resolved by assigning a different preemptive priority to each working entity. As in the 1:1 case, the protection entity can optionally be used to carry preemptable "extra traffic" in normal operation The m:n architecture is a generalization of the 1:n architecture. Typically m <= n, m dedicated protection entities are shared by n working entities. While this architecture can improve system availability with small cost increases, it has rarely been implemented or standardized. 5. Survivability 5.1 Scope Interoperable approaches to network survivability were determined to be an immediate requirement in packet networks as well as in SDH/SONET framed TDM networks. Not as pressing at this time were techniques which would cover all-optical networks (e.g., where framing is unknown), as the control of these networks in a multi- vendor environment appeared to have some other hurdles to first deal with. Also, not of immediate interest were approaches to coordinate or explicitly communicate survivability mechanisms across network layers (such as from a TDM or optical network to/from an IP network). However, a capability should be provided for a network operator to control the operation of survivability mechanisms among different layers. Such issues and those related to OAM are currently outside the scope of this document. (For proposed MPLS OAM requirements, see [5]). The types of network failures that cause a restoration to be performed include link/span and node failures (which might include span failures at lower layers). Other more complex failure mechanisms such as systematic control-plane failure or breach of security are not within the scope of the survivability mechanisms discussed in this document. Lai, et al Category - Expiration 6 Network Hierarchy and Multilayer Survivability July 2001 5.2 Required initial set of survivability mechanisms 5.2.1 1:1 Path Protection with Pre-Established Capacity In this protection mode, the head end of a working connection establishes a protection connection to the destination. In normal operation, traffic is only sent on the working connection, though the ability to signal that traffic will be sent on both connections (1+1 Path for signaling purposes) would be valuable in non-packet networks. Some distinction between working and protection connections is likely, either through explicit objects, or preferably through implicit methods such as general classes or priorities. Head ends need the ability to create connections that are as failure disjoint as possible from each other. This would require SRG information that can be generally assigned to either nodes or links and propagated through the control or management plane. In this mechanism, capacity in the protection connection is pre-established, however it can be used to carry preemptable extra traffic. Protect capacity is first come first served. When protect capacity is called into service during restoration, there should be the ability to promote the protection connection to working status (for non-revertive mode operation) with some form of make-before- break capability. 5.2.2 1:1 Path Protection with Pre-Planned Capacity Similar to the above 1:1 protection with pre-established capacity, the protection connection in this case is also pre-signaled. The difference is in the way protect capacity is assigned. With pre- planned capacity, the mechanism supports the ability for the protect capacity to be shared, or "double-booked." It would be expected that should operator predicted failures occur, which potentially could rely on enumeration in SRGs, that only a limited set of protect connections would be put into service, and that the protect capacity available in the network would be able to fulfill this traffic (given proper sizing and planning of the network). In a sense, this is 1:1 from a path perspective, however the protect capacity in the network (on a link by link basis) is shared in a 1:n fashion. Some form of information propagation could be required before traffic may be sent on protection connections, especially in TDM networks. In data networks, a desirable operating approach for this mechanism might be where the protect capacity is not accurately booked against SRGs (e.g. non-predictive). The use of this approach improves network resource utilization, but may require more careful planning. So, initial deployment might be based on 1:1 path protection with pre-established capacity and the local restoration mechanism to be described next. 5.2.3 Local Restoration Lai, et al Category - Expiration 7 Network Hierarchy and Multilayer Survivability July 2001 Due to the time impact of signal propagation, path-based approaches may not be able to meet the service requirements desired in some networks. The solution to this is to restore connectivity in immediate proximity to the fault. At a minimum, this approach should be able to protect against connectivity-type SRGs, though protecting against node-based SRGs might be worthwhile. After local restoration is in place, it is likely that head end systems would later perform some path-level re-grooming. Head end systems must have some control as to whether their connections are candidates for or excluded from local restoration. 5.2.4 Path Restoration In this approach, connections that are impacted by a fault are rerouted by the originating network element upon notification of connection failure. This approach does not involve any new mechanisms. It merely is a mention of another common approach to protecting against faults in a network. 5.3 Applications Supported With service continuity under failure as a goal, a network is "survivable" if, in the face of a network failure, connectivity is interrupted for a brief period and then restored before the network failure ends. The length of this interrupted period is dependent on the application supported. Here are some typical applications that need to be considered: - Best-effort data: restoration of network connectivity by rerouting at the IP layer would be sufficient - Premium data service: need to meet TCP or application protocol timer requirements - Voice: call cutoff is in the range of 140 msec to 2 sec - Other real-time service (e.g., streaming, fax) - Mission-critical applications 5.4 Timing Bounds for Service Restoration The approach to picking the types of survivability mechanisms recommended was to consider a spectrum of mechanisms that can be used to protect traffic with varying characteristics of survivability and speed of restoration, and then attempt to select a few general points which provide some coverage across that spectrum. The focus of this work is to provide requirements to which a small set of detailed proposals may be developed, allowing the operator some (limited) flexibility in approaches to meeting their design goals in engineering multi-vendor networks. Requirements of different applications as listed in the previous sub-section were discussed generally, however none on the team would likely attest to the scientific merit of the ability of the timing bounds below to meet any specific applicationËs needs. A few assumptions include: Lai, et al Category - Expiration 8 Network Hierarchy and Multilayer Survivability July 2001 Approaches that protection switch without propagation of information are likely to be faster than those that do require some form of fault notification to some or all elements in a network. Approaches that require some form of signaling after a fault will also likely suffer some timing impact. Proposed timing bounds for service restoration for different mechanisms are as follows (all bounds are exclusive of signal propagation): 1:1 path protection with pre-established capacity: 100-500 ms 1:1 path protection with pre-planned capacity: 100-750 ms Local restoration: 50 ms Path restoration: 1-5 seconds To ensure that the service requirements for different applications can be met within the above timing bounds, restoration priority is used to determine the order in which connections are restored (to minimize service restoration time as well as to gain access to available spare capacity). For example, mission critical applications may require high restoration priority. Preemption priority should only be used in the event that all connections cannot be restored, in which case connections with lower preemption priority should be released. Depending on a service provider's strategy in provisioning network resources for backup, preemption may or not be needed in the network. 5.5 Coordination Among Layers A common design goal for multi-layered networks is to provide the desired level of service in the most cost-effective manner. The use of multilayer survivability might allow the optimization of spare resources through the improvement of resource utilization by sharing spare capacity across different layers, though further investigations are needed. Coordination during service restoration among different network layers (e.g. IP, SDH/SONET, optical layer) might necessitate development of vertical hierarchy. The benefits of providing survivability mechanisms at multiple layers, and the optimization of the overall approach, must be weighed with the associated cost and service impacts. A default coordination mechanism for inter-layer interaction could be the use of nested timers and current SDH/SONET fault monitoring, as has been done traditionally for backward compatibility. Thus, when lower-layer restoration happens in a longer time period than higher-layer restoration, a hold-off timer is utilized to avoid contention between the different single-layer recovery schemes. In other words, multilayer interaction is addressed by having successively higher multiplexing levels operate at restoration time scale greater than the next lowest layer. Currently, if SDH/SONET protection switching is used, MPLS recovery timers must wait until SDH/SONET has had time to switch. Lai, et al Category - Expiration 9 Network Hierarchy and Multilayer Survivability July 2001 It was felt that the current approach to coordination of survivability approaches currently did not have significant operational shortfalls. These approaches include protecting traffic solely at one layer (e.g. at the IP layer over linear WDM, or at the SDH/SONET layer). Where survivability mechanisms might be deployed at several layers, such as when a routed network rides a SDH/SONET protected network, it was felt that current coordination approaches were sufficient in many cases. One exception is the hold-off of MPLS recovery until the completion of SDH/SONET protection switching as described above. This limits the recovery time of fast MPLS restoration. Also, note that failures within a layer can be guarded against by techniques either in that layer or at a higher layer, but not in reverse. Thus, the optical layer cannot guard against failures in the IP layer such as router system failures, line card failures. 5.6 Evolution Toward IP Over Optical As more pressing requirements for survivability and horizontal hierarchy for edge-to-edge signaling are met with technical proposals, it is believed that the benefits of merging (in some manner) the control planes of multiple layers will be outlined. When these benefits are self-evident, it would then seem to be the right time to review if vertical hierarchy mechanisms are needed, and what the requirements might be. 6. Hierarchy Requirements Efforts in the area of network hierarchy should focus on mechanisms that would allow more scalable edge-to-edge signaling, or signaling across networks with existing network hierarchy (such as multi-area OSPF). This would appear to be a more immediate need than mechanisms that might be needed to interconnect networks at different layers. 6.1 Historical Context One reason for horizontal hierarchy is functionality (e.g., metro versus backbone). Geographic Ÿislands÷ reduce the need for interoperability and make administration and operations less complex. Using a simpler, more interoperable, survivability scheme at metro/backbone boundaries is natural for many provider network architectures. In transmission networks, creating geographic islands of different vendor equipment has been done for a long time because multi-vendor interoperability has been difficult to achieve. Traditionally, providers have to coordinate the equipment on either end of a "connection," and making this interoperable reduces complexity. A provider should be able to concatenate survivability mechanisms in order to provide a "protected link" to the next higher level. Think of SDH/SONET rings connecting to TDM DXCs with 1+1 line-layer protection between the ADM and the DXC port. The TDM connection, e.g., a DS3 is protected, but usually all equipment on each SDH/SONET ring is from a single vendor. The DXC cross Lai, et al Category - Expiration 10 Network Hierarchy and Multilayer Survivability July 2001 connections are controlled by the provider and the ports are physically protected resulting in a highly available design. Thus, concatenation of survivability approaches can be used to cascade across horizontal hierarchy. While not perfect, it is workable in the near- to mid-term until multi-vendor interoperability is achieved. While the problems associated with multi-vendor interoperability may necessitate horizontal hierarchy as a practical matter (at least this has been the case in TDM networks), there may be no technical reason for it. Members of the team with more experience on IP networks felt there should be no need for this in core networks, or even most access networks. Some of the largest service provider networks currently run a single area/level IGP. Some service providers, as well as many large enterprise networks, run multi-area OSPF to gain increases in scalability. Often, this was from an original design, so it is difficult to say if the network truly required the hierarchy to reach its current size. Some proposals on improved mechanisms to address network hierarchy have been suggested [6, 7, 8]. This document aims to provide the concrete requirements so that these and other proposals can first aim to meet some limited objectives. 6.2 Applications for Horizontal Hierarchy A primary driver for intra-domain horizontal hierarchy is signaling scalability in the context of edge-to-edge VPNs, potentially across traffic-engineered data networks. There are a number of different approaches to VPNs and they are currently being addressed by different emerging protocols: RFC 2547bis BGP/MPLS VPNs, provider- provisioned VPNs based upon MPLS tunnels (e.g., virtual routers), Pseudo Wire Edge-to-edge Emulation (PWE3), etc. These may or not need explicit signaling from edge to edge, but it is a common perception that in order to meet SLAs, some form of edge-to-edge signaling is required. For signaling scalability, there are probably two types of network scenarios to consider: - Large SP networks with flat routing domains where edge-to-edge (MPLS) signaling as implemented today would probably not scale. - Networks which would like to signal edge-to-edge, and might even scale in a limited application. However, they are hierarchically routed (e.g. OSPF areas) and current implementations, and potentially standards prevent signaling across areas. This requires the development of signaling standards that support dynamic establishment and potentially restoration of LSPs across a 2-level IGP hierarchy. Lai, et al Category - Expiration 11 Network Hierarchy and Multilayer Survivability July 2001 Scalability is concerned with the O(N^2) properties of edge-to-edge signaling. For a large network, maintaining a "connection" between every edge is simply not scalable. Even if establishing and maintaining connections is feasible, there might be an impact on core survivability mechanisms which would cause restoration times to grow with N^2, which would be undesirable. While some value of N may be inevitable, approaches to reduce N (e.g. to pull in from the edge to aggregation points) might be of value. For routing scalability, especially in data applications, a major concern is the amount of processing/state that is required in the variety of network elements. If some nodes might not be able to communicate and process the state of every other node, it might be preferable to limit the information. There is one way of thought that says that the amount of information contained by a horizontal barrier should be significant, and that impacts this might have on optimality in route selection and ability to provide global survivability are accepted tradeoffs. 6.3 Horizontal Hierarchy Requirements Mechanisms are required to allow for edge-to-edge signaling of connections through a network. The types of network scenarios include large networks with a large number of edge devices and flat interior routing, as well as medium to large networks which currently have hierarchical interior routing such as multi-area OSPF or multi-level IS-IS. The primary context of this is edge-to-edge signaling which is thought to be required to assure the SLAs for the layer 2 and layer 3 VPNs that are being carried across the network. Another possible context would be edge-to-edge signaling in TDM SDH/SONET networks, where metro and core networks again might either be in a flat or hierarchical interior routing domain. 7. Survivability and Hierarchy When horizontal hierarchy exist in a network layer, a question arises as to how survivability can be provided along a connection which crosses hierarchical boundaries. In designing protocols to meet the requirements of hierarchy, an approach to consider is that boundaries are either clean, or are of minimal value. However, the concept of network elements that participate on both sides of a boundary might be a consideration (e.g. OSPF ABRs). That would allow for devices on either side to take an intra-area approach within their region of knowledge, and for the ABR to do this in both areas, and splice the two protected connections together at a common point (granted it is a common point of failure now). If the limitations of this approach start to appear in operational settings, then perhaps it would be time to start thinking about route-servers and signaling propagated directives. However, one initial approach might be to signal through a common border router, and to consider the service as protected as it consist of a concatenated set of connections which Lai, et al Category - Expiration 12 Network Hierarchy and Multilayer Survivability July 2001 are each protected within their area. Another approach might be to have a least common denominator mechanism at the boundary, e.g., 1+1 port protection. There should also be some standardized means for a survivability scheme on one side of such a boundary to communicate with the scheme on the other side regarding the success or failure of the service restoration action. For example, if a part of a "connection" is down on one side of such a boundary, there is no need for the other side to recover from failures. In summary, at this time, approaches that allow concatenation of survivability schemes across hierarchical boundaries should provide sufficient. 8. Security Considerations Security is not considered in this initial version. 9. References 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 3 V. Sharma, B. Crane, K. Owens, C. Huang, F. Hellstrand, J. Weil, L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. Chiu, "Framework for MPLS-based Recovery," Internet-Draft, Work in Progress, March 2001. 4 D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, "A Framework for Internet Traffic Engineering," Internet-Draft, Work in Progress, May 2001. 5 N. Harrison, et al, "Requirements for OAM in MPLS Networks," Internet-Draft, Work in Progress, May 2001. 6 K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic Engineering," Internet-Draft, Work in Progress, March 2001. 7 G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft, Work in Progress, March 2001. 8 A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing Extensions for MPLS Signaling," Internet-Draft, Work in Progress, July 2001. 10. Acknowledgments Lai, et al Category - Expiration 13 Network Hierarchy and Multilayer Survivability July 2001 A lot of the direction taken in this document, and by the team, was steered by the insightful questions provided by Bala Rajagoplan, Greg Bernstein, Yangguang Xu, and Avri Doria. The set of questions is attached as Appendix A in this document. 11. Author's Addresses Wai Sum Lai AT&T 200 Laurel Avenue Middletown, NJ 07748, USA Tel: +1 732-420-3712 wlai@att.com Dave McDysan WorldCom 22001 Loudoun County Pkwy Ashburn, VA 20147, USA dave.mcdysan@wcom.com Jim Boyle jimpb@nc.rr.com Malin Carlzon malin@sunet.se Rob Coltun rcoltun@redback.com Tim Griffin AT&T 180 Park Avenue Florham Park, NJ 07932, USA Tel: +1 973-360-7238 griffin@research.att.com Ed Kern Cogent Communications 3413 Metzerott Rd College Park, MD 20740, USA Tel: +1 703-852-0522 ejk@tech.org Tom Reddington Lucent Technologies 67 Whippany Rd Whippany, NJ 07981, USA Tel: +1 973-386-7291 treddington@bell-labs.com Appendix A: Questions used to help develop requirements Lai, et al Category - Expiration 14 Network Hierarchy and Multilayer Survivability July 2001 A. Definitions 1. In determining the specific requirements, the design team should precisely define the concepts "survivability", "restoration", "protection", "protection switching", "recovery", "re-routing" etc. and their relations. This would enable the requirements doc to describe precisely which of these will be addressed. In the following, the term "restoration" is used to indicate the broad set of policies and mechanisms used to ensure survivability. B. Network types and protection modes 1. What is the scope of the requirements with regard to the types of networks covered? Specifically, are the following in scope: Restoration of connections in mesh optical networks (opaque or transparent) Restoration of connections in hybrid mesh-ring networks Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a transport network, e.g., optical) Any other types of networks? Is commonality of approach, or optimization of approach more important? 2. What are the requirements with regard to the protection modes to be supported in each network type covered? (Examples of protection modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes such as P-cycles, etc.) 3. What are the requirements on local span (i.e., link by link) protection and end-to-end protection, and the interaction between them? E.g.: what should be the granularity of connections for each type (single connection, bundle of connections, etc). C. Hierarchy 1. Vertical (between two network layers): What are the requirements for the interaction between restoration procedures across two network layers, when these features are offered in both layers? (Example, MPLS network realized over pt-to-pt optical connections.) Under such a case, (a) Are there any criteria to choose which layer should provide protection? (b) If both layers provide survivability features, what are the requirements to coordinate these mechanisms? (c) How is lack of current functionality of cross-layer cooridnation currently hampering operations? Lai, et al Category - Expiration 15 Network Hierarchy and Multilayer Survivability July 2001 (d) Would the benefits be worth additional complexity associated with routing isolation (e.g. VPN, areas), security, address isolation and policy / authentication processes? 2. Horizontal (between two areas or administrative subdivisions within the same network layer): (a) What are the criteria that trigger the creation of protocol or administrative boundaries pertaining to restoration? (e.g., scalability? multi-vendor interoperability? what are the practical issues?) multi-provider? Should multi-vendor necessitate hierarchical seperation? When such boundaries are defined: (b) What are the requirements on how protection/restoration is performed end-to-end across such boundaries? (c) If different restoration mechanisms are implemented on two sides of a boundary, what are the requirements on their interaction? What is the primary driver of horizontal hierarchy? (select one) - functionality (e.g. metro -v- backbone) - routing scalability - signaling scalability - current network architecture, trying to layer on TE ontop of already hiearchical network architecture - routing and signalling For signalling scalability, is it - managability - processing/state of network - edge-to-edge N^2 type issue For routing scalability, is it - processing/state of network - are you flat and want to go hierarchical - or already hierarchical? - data or TDM application? D. Policy 1. What are the requirements for policy support during protection/restoration, e.g., restoration priority, preemption, etc. E. Signaling Mechanisms 1. What are the requirements on the signaling transport mechanism (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP network, etc.) used to communicate restoration protocol messages between network elements. What are the bandwidth and other requirements on the signaling channels? Lai, et al Category - Expiration 16 Network Hierarchy and Multilayer Survivability July 2001 2. What are the requirements on fault detection/localization mechanisms (which is the prelude to performing restoration procedures) in the case of opaque and transparent optical networks? What are the requirements in the case of MPLS restoration? 3. What are the requirements on signaling protocols to be used in restoration procedures (e.g., high priority processing, security, etc). 4. Are there any requirements on the operation of restoration protocols? F. Quantitative 1. What are the quantitative requirements (e.g., latency) for completing restoration under different protection modes (for both local and end-to-end protection)? G. Management 1. What information should be measured/maintained by the control plane at each network element pertaining to restoration events? 2. What are the requirements for the correlation between control plane and data plane failures from the restoration point of view? Full Copyright Statement "Copyright (C) The Internet Society (date). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF Lai, et al Category - Expiration 17 Network Hierarchy and Multilayer Survivability July 2001 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Lai, et al Category - Expiration 18