Network Working Group Jonathan P. Lang (Calient Networks) Internet Draft John Drake (Calient Networks) Expiration Date: August 2001 Yakov Rekhter (Juniper Networks) February 2001 Generalized MPLS Recovery Mechanisms draft-lang-ccamp-recovery-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft discusses protection and restoration mechanisms for fault management within the GMPLS framework [GMPLS]. Lang, J., Drake, J., Rekhter, Y. [Page 1] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 1. Introduction A key requirement for the development of a common control plane for both optical and electronic networks is that there must be features in the signaling, routing, and link management protocols to enable intelligent fault management. Fault management requires four steps: fault detection, fault localization, fault notification, and fault recovery. Fault detection should be handled at the layer closest to the failure; for optical networks, this is the physical (optical) layer. One measure of fault detection at the physical layer is detecting loss of light (LOL); other techniques based on, for example, OSNR, BER, dispersion, crosstalk, and attenuation are still being investigated (see, for example, [OLCP] and [LMP-DWDM]). Fault localization requires communication between nodes to determine where the failure has occurred (for example, SONET AIS is used to localize failures between SONET terminating devices). One interesting consequence of using LOL to detect failures in optical networks is that LOL propagates downstream along the connectionÆs path. The Link Management Protocol (LMP) [LMP] includes a fault localization procedure that is designed to localize failures in both transparent (all-optical) and opaque (opto-electrical) networks, and is independent of the data encoding scheme. Fault notification is the Communication of a failure between the node detecting it and a node equipped to deal with the failure. Fast fault notification is essential for rapid recovery. The Notify mechanism of [RSVP-GEN] is designed to support fast notification of non-adjacent nodes. Once a failure has been detected and localized, and the responsible node has been notified, protection and restoration can be used to recover from the failure. We make the distinction between protection and restoration by the time scales in which they operate. Protection is designed to react to failures rapidly (say, in less than a couple hundred milliseconds) and often involves 100% resource redundancy. For example, SONET automatic protection switching (APS) is designed to switch the traffic from a primary (working) path to a secondary (protection) path in less than 50ms. This requires simultaneous transmission along both the primary and secondary paths (called 1+1 protection) with a selector at the receiving node, and uses twice as many network resources as a non-APS protected path. Restoration, on the other hand, is designed to react to failures quickly, but it typically takes an order of magnitude longer to restore the connection compared to protection switching. This is because restoration typically utilizes pools of shared resources that are more efficient in terms of the network utilization. In addition, restoration may involve rerouting connections, which can be computationally expensive if the paths are not pre-calculated or if the pre-calculated resources are no longer available. Protection and restoration methods have traditionally been addressed using two techniques: path-level recovery, where the failure is addressed at the end nodes (i.e., the initiating and terminating nodes of the path); and span-level recovery, where the failure is Lang, J., Drake, J., Rekhter, Y. [Page 2] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 addressed at an intermediate or transit node. Path-level recovery can be further subdivided into path protection, where secondary (or protection) paths are pre-allocated, and path restoration, where connections are rerouted, either dynamically or using pre-calculated (but not pre-allocated) paths. Span-level recovery can be subdivided into span protection, where traffic is switched to an alternate channel or link connecting the same two nodes, and span restoration, where traffic is switched to an alternate route between the two nodes (this involves passing through additional intermediate nodes). To effectively use protection, there must be mechanisms to configure protected links on a span between nodes, advertise the protection bandwidth of a link so that it may be used by a class of traffic that has different availability requirements, establish secondary (protection) LSPs to protect primary LSPs, allow the resources of secondary LSPs to be used by lower priority traffic until a switchover occurs, and signal protection switchover when necessary. In this draft, we discuss protection and restoration in the context of GMPLS signaling. Specifically, we address these issues in the context of RSVP signaling and OSPF and IS-IS routing. 2. Protection Mechanisms Protection is designed to react to failures at the fastest timescale and typically involves pre-provisioning protection resources. In this section we discuss both span and path protection and present mechanisms within GMPLS to implement both protection schemes. 2.1 Span Protection A span consists of a number of channels between two adjacent nodes that are bundled into a single link called a TE link (see [LMP]). Span protection involves switching to a protection channel when a failure occurs on a working channel. At the span level, both dedicated (1+1, 1:1) and shared (M:N) protection may be implemented. The protection type supported by a TE link (LPT) will be advertised throughout the network using an IGP so that intelligent routing decisions can be made (see Section 4). The desired protection for a path is signaled as part of the Generalized Label Request in GMPLS signaling. This is needed in signaling if a link supports multiple protection types or if loose routing is used. For dedicated 1+1 span protection, each node must replicate the data onto two separate channels (possibly using separate component links of a bundled link or separate ports of a TE link) and the adjacent node must select the data from only one channel based on the signal integrity. This is the fastest protection mechanism, however, it requires using twice the LSP bandwidth between each pair of nodes and the ability to replicate the data on two separate channels. Lang, J., Drake, J., Rekhter, Y. [Page 3] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 For shared M:N protection, M protection links are shared between N primary links. Since data is not replicated on both the primary and secondary links, failures must first be isolated before the switchover can occur. LMP can be used for fault isolation, and the upstream node (upstream in terms of the direction an RSVP Path message traverses) will initiate the local span protection. To initiate span protection, the upstream node SHOULD send an RSVP Path message with a Label Set object including the labels for the available secondary links. If more than one label is included in the Label Set object, the Suggested Label object should be used to indicate the preferred secondary label. If the failure affected a bi-directional LSP, a new Upstream Label may also need to be transmitted. In addition, new LinkId, PHOP, and modified ERO may also need to be included based on the shared protection configuration. Note that the benefit of exchanging the shared protection configuration in advance using LMP is that it minimizes the potential label conflict when protection switching. When the downstream node receives the Path message with the new objects, it MUST verify the parameters, update the RSVP Path state, and respond with either an RSVP Resv message with a new label or it should generate a PathError message if the resources are not available. 2.2 Path Protection Path protection is addressed at the end nodes of an LSP (i.e., LSP initiator and terminator) and requires switching to an alternate path when a failure occurs. For 1+1 path protection, a signal is transmitted simultaneously over two disjoint paths and a selector is used at the receiving node to choose the better signal. For M:N path protection, N primary signals are transmitted along disjoint paths, and M secondary paths are pre-established for shared protection switching among the N primary paths. 2.2.1 1+1 Path Protection There are a number of 1+1 path protection variations that may be implemented that provide different levels of protection. The most common notion of 1+1 path protection is to select two disjoint paths, one primary and one secondary, where each link along both paths is unprotected. This protects against a single link or node failure, depending on how the two paths are disjoint. One variation of 1+1 path protection is to select a single path where each link individually supports 1+1 span protection as discussed in Subsection 2.1. This protects against a single link failure, but not a node failure. One may also combine the two approaches by ensuring that for every contiguous segment of the path that includes only the links that don't support 1+1 span protection, the head-end LSR has to compute a link-disjoint segment, with the constraint that none of the links in the newly computed segments have 1+1 protection. After the two paths are computed, the head-end LSR will originate two LSPs with dedicated 1+1 and unprotected bits set in the LPT. The Lang, J., Drake, J., Rekhter, Y. [Page 4] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 setup will indicate that these two paths request Shared-Explicit reservations (see [TUNNEL]). At each node where the two paths branch out, the node must replicate the data into both branches. At each node where the two paths merge, the node must select the data from only one path based on the integrity of the signal. For SONET/SDH, LSPs are bi-directional and each branching point is also a merging point and vice versa. As an example consider the following: M / \ A---B C----D \ / N Only links A-B and C-D support 1+1 protection. Node A wants to establish a 1+1 protected path to D. In this case, A computes a primary path, A, B, M, C, D where the segment B, M, C has links that do not support 1+1 protection. Therefore, A computes a link-disjoint segment, B, N, C, and uses it to construct a secondary path, A, B, N, C, D. A initiates a setup of two LSPs indicating the desire for Shared Explicit (SE) reservations - the first path is routed along A, B, M, C, D, and the second path is routed along A, B, N, C, D. Since the two LSPs branch out at node B, B sends the data it receives from A to both M and N. At node C, the two LSPs merge and C selects the data received over one of these LSPs (based on the integrity of the signal), and forwards this data to D. When the LSP from A to D is bi-directional, then C must also send the data it receives from D to both M and N, and B must select the data received from either M or N, and forward it the to A. 2.2.2. M:N Path Protection There are a number of M:N path protection variations that may be implemented to provide different levels of protection and to address different network configurations. The most common notion of M:N path protection is to route N node-disjoint primary paths and pre- establish M backup paths that are node disjoint from the primary paths. This protects against M path failures. Another variation of M:N path protection is to select a single path where each link individually supports M:N span protection. This protects against M link failures over each span, but is not robust to node failures. One may also combine the two approaches by ensuring that for every contiguous segment of the path that includes only the links that donÆt support M:N span protection, the head-end node has to compute a node- or link-disjoint segment, with the constraint that none of the links in the newly computed segments need to be protected. Lang, J., Drake, J., Rekhter, Y. [Page 5] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 An important feature of the GMPLS work is that it allows pre- configuring secondary (backup) LSPs to protect primary LSPs. This is done by indicating the LSP is of type Secondary in the protection field of the Generalized Label Request. Secondary LSPs are used for fast switchover when primary LSPs fail. Although the resources for the secondary LSPs are pre-allocated, lower priority traffic may use the resources with the caveat that the lower priority traffic will be preempted if the primary LSP fails. If lower priority traffic is using resources along the secondary LSPs, the end nodes may need to be notified of the failure in order to complete the switchover. The setup of the primary LSP SHOULD indicate that the LSP initiator and terminator wish to receive Notify messages using the Notify Request object. If a failure occurs, LMP can be used to isolate the failure. Once the failure is isolated, the upstream node (upstream in terms of the direction an RSVP Path message traverses) SHOULD send an RSVP Notify message to the LSP initiator, and the downstream node SHOULD send an RSVP Notify message to the LSP terminator. Upon receipt of the Notify messages, the source and destination nodes MUST switch the traffic from the primary LSP to the pre-configured secondary LSP. Note that if a common initiator-terminator is used for all N primary paths sharing the secondary path (assuming 1:N protection), no further notification is required to indicate that the N primary LSPs are no longer protected. As an example consider the following: A---B E---F / \ / \ I--- C----D ---T \ / \ / J---K L---M Two node-disjoint routes from initiator I to terminator T cannot be found; however, two node-disjoint routes can be found from node I to node C and from node D to node T. Furthermore, the link from node C to node D is protected using dedicated 1:1 protection. In this case, I computes the primary route R1={I,A,B,C,D,E,F,T} and secondary route R2={I,J,K,C,D,L,M,T} where the segment {C,D} supports 1:1 span protection. A initiates a setup of two LSPs indicating the desire for Shared Explicit reservations; the primary LSP is routed along R1 and the secondary LSP is routed along R2. 3. Restoration Mechanisms Restoration is designed to react to failures quickly and use bandwidth efficiently, but typically involves dynamic resource establishment and route calculation, and therefore, takes more time to switch to an alternate path than protection techniques. Restoration can be implemented at the intiator node or at an intermediate node once the responsible node has been notified. Failure notification can be done using the Notify procedures of Lang, J., Drake, J., Rekhter, Y. [Page 6] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 [GMPLS] or using the standard RSVP PathError messages. In the section, we briefly discuss span and path restoration and highlight the RSVP mechanisms that can be used to implement them. To support span restoration, where traffic is switched to an alternate route around a failure, a new LSP is established at an intermediate node that involves passing through additional intermediate nodes. Span restoration may be beneficial for LSPs that span multiple hops and/or large distances because the latency incurred for failure notification may be significantly reduced and only segments of the LSP are rerouted instead of the entire path. The RSVP Notify Request object can be used by an intermediate node to request that it be the target of an RSVP Notify message. Span restoration may break traffic-engineering (TE) requirements if a strict-hop route is defined for the connection. Furthermore, the constraints used for routing the connection must be forwarded so that an intermediate node doing span restoration is able to calculate an appropriate alternate route; this is similar to the problems when establishing/maintaining TE requirements that span mult-areas (see [MULTI] for a proposed mechanism). Path restoration, on the other hand, switches traffic to an alternate route around a failure, where the new route is selected at the LSP initiator and may reuse intermediate nodes used by the original LSP and it may include additional intermediate nodes. For strict-hop routing, TE requirements can be directly applied to the route calculation, and the filed node or link can be avoided. However, if the failure occurred within a loose-routed hop, the source node may not have enough information to reroute the connection around the failure. Restoration (span or path) will be initiated by the node that has isolated the failure or by the node that has received either an RSVP Notify message or an RSVP Path Error message indicating that a failure has occurred. The new resources can be established in a make-before-break fashion, where the new LSP is setup before the old LSP is torn down, using the mechanisms of the LSP_Tunnel Session object (see [TUNNEL]) and the Shared-Explicit reservation style. Both the new and old LSPs share resources at nodes common to both LSPs. The Tunnel end point addresses, Tunnel Id, Extended Tunnel Id, Tunnel sender address, and LSP Id are all used to uniquely identify both the old and new LSPs; this ensures new resources are established without double counting resource requirements along common segments. 4. Routing Enhancements The GMPLS extensions to OSPF [OSPF-GE] and IS-IS [ISIS-GE] include the advertisement of the LPT. The LPT field is a bit vector that indicates the protection capabilities that are supported for the link. The LPT field may be configured with Dedicated 1+1, Dedicated 1:1, Shared M:N, and Enhanced protection, as well as Unprotected. Lang, J., Drake, J., Rekhter, Y. [Page 7] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 For a link that has dedicated 1+1 protection or is unprotected, this advertisement provides a complete description of the link capabilities and the usable bandwidth. However, a key argument for using dedicated 1:1 or shared M:N is the efficiency gained by reusing the protection bandwidth for lower priority traffic when the bandwidth would otherwise be idle. To advertise the protection bandwidth for a link that has dedicated 1:1 or shared M:N protection, a link with LPT field Extra Traffic should be advertised. This indicates that bandwidth can be used by LSPs, with the caveat that any LSPs routed over this link will be preempted if the resources are needed as a result of a failure over the primary link. When a failure occurs on a dedicated 1:1 or shared M:N link, the LSPs routed over the link will automatically be switched to the Extra Traffic link that is protecting it. To support the routing of Secondary LSPs for M:N path protection (as described in Section 2.2.2), new extensions must be added to the current GMPLS routing extensions. In particular, there must be a mechanism to advertise secondary bandwidth and processing rules must be defined for bandwidth accounting when LSP requests arrive at a node. See [BWAcct] for a proposal addressing these issues. 5. Acknowledgments We would like to thank Kireeti Kompella and Ayan Banerjee for their comments and fruitful discussions. 6. References [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3," BCP 9, RFC 2026, October 1996. [GMPLS] Ashwood-Smith, P., Banerjee, A., Berger, L., et al, "Generalized MPLS - signaling functional description," Internet Draft, draft-ietf-mpls-generalized-mpls- signaling-01.txt, (work in progress). [OLCP] Chiu, A., Strand, J., Tkach, R., ôUnique Features and Requirements for The Optical Layer Control Plane, Internet Draft, draft-chiu--strand-unique-OLCP-01.txt, (work in progress). [LMP-DWDM] Fredette, A., Snyder, E., Shantigram, J., et al, ôLink Management Protocol (LMP) for WDM Transmission Systems,ö Internet Draft, draft-fredette-lmp-wdm-00.txt, (work in progress). [LMP] Lang, J. P., Mitra, K., Drake, J., Kompella, K., et al, ôLink Management Protocol (LMP),ö Internet Draft, draft- ietf-mpls-lmp-01.txt, (work in progress). Lang, J., Drake, J., Rekhter, Y. [Page 8] Internet Draft draft-lang-ccamp-recovery-00.txt February 2001 [RSVP-GEN] Ashwood-Smith, P., Banerjee, A., Berger, L., et al, " Generalized MPLS Signaling - RSVP-TE Extensions," Internet Draft, draft-ietf-mpls-generalized-rsvp-te-00.txt, (work in progress). [TUNNEL] Awduche, D., Berger, L., Gan, D-H., Li. T., Srinivasan, V., Swallow, G., ôRSVP-TE: Extensions to RSVP for LSP Tunnels,ö Internet Draft, draft-ietf-mpls-rsvp-lsp-tunnel- 07.txt, (work in progress). [MULTI] Kompella, K., Rekhter, Y., ôMulti-area MPLS Traffic Engineering,ö Internet Draft, draft-kompella-mpls- multiarea-te-00.txt, (work in progress). [OSPF-GE] Kompella, K., Rekhter, Y., Banerjee, A., Drake, J., et al, ôOSPF Extensions in Support of MPL(ambda)S," Internet Draft, draft-kompella-ospf-ompls-extensions-00.txt, (work in progress). [ISIS-GE] Kompella, K., Rekhter, Y., Banerjee, A., Drake, J., et al, ôIS-IS Extensions in Support of Generalized MPLS,ö Internet Draft, draft-ietf-isis-gmpls-extensions-01.txt, (work in progress). [BWAcct] Kompella, K., Lang, J.P., Drake, J., ôBandwidth Accouting in Support of Secondary LSPs,ö Internet Draft, (work in progress). Lang, J., Drake, J., Rekhter, Y. [Page 9] 7. Author's Addresses Jonathan P. Lang John Drake Calient Networks Calient Networks 25 Castilian Drive 5853 Rue Ferrari Goleta, CA 93117 San Jose, CA 95138 email: jplang@calient.net email: jdrake@calient.net Yakov Rekhter Juniper Networks 1194 N. Mathilda Avenue Sunnyvale, CA 94089 email: yakov@juniper.net Lang, J., Drake, J., Rekhter, Y. [Page 10]