Fault Notification and Service Recovery Protocol June 2002 CCAMP Working Group Richard Rabbat (FLA) Internet Draft Ching-Fong Su (FLA) Expires: December 2002 Norihiko Shinomiya (FLL) Vishal Sharma (Metanoia) Takafumi Chujo (FLA) Akira Chugo (FLL) June 2002 A Fault Notification and Service Recovery Protocol draft-rabbat-fault-notification-protocol-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft describes a fault notification and service recovery protocol that can be used in GMPLS-enabled networks to achieve bounded time activation of protection paths in the case of single failures. It presents a complete solution to the problem and justifies choices made for the notification method, extensions required to current algorithms and protocols and any security and deployment issues. Table of Contents Status of this Memo...............................................1 Abstract..........................................................1 Table of Contents.................................................1 1. Overview.......................................................2 2. Terminology....................................................3 3. Glossary of Terms Used.........................................3 Rabbat Expires - December 2002 [Page 1] Fault Notification and Service Recovery Protocol June 2002 4. Requirements at Protection Path Setup Time.....................4 5. Protocol Steps in Failure Notification and Service Recovery....5 5.1 T1: Fault Detection Time......................................5 5.2 T2: Hold-Off Time.............................................6 5.3 T3: Fault Notification and Completion of Recovery Operation...6 5.3.1 Delays Incurred by Messages to PSL and on Protection Path...8 5.4 T4: Traffic Restoration Time..................................8 6. Examples of Protection switching...............................9 7. Finding the Time-Constrained Protection Path..................11 7.1 Finding a Sub-graph for the Protection Algorithm.............12 7.2 Hybrid Protection/Restoration Method.........................12 7.3 Reversion After Fixing the Failure...........................13 8. Implementation of flooding notification.......................13 9. Security Considerations.......................................13 10. Conclusion...................................................13 Appendix A. Fault Notification Message Delays on Path............14 A.1 Delays Associated with Link Traversal........................14 A.2 Delays Incurred at the Nodes.................................14 Appendix B. Finding a Sub-graph Subject to Time Constraints......15 11. References...................................................17 12. Author's Addresses...........................................18 1. Overview The issue of protection and restoration in optical switching networks is of high importance to ensure high-availability and uninterrupted service. Several mechanisms for protection in mesh and ring topologies have been devised. Protection and restoration algorithms can be used for local repair (link-based or node-based), span protection and path protection. This document describes a fault notification and service recovery protocol to ensure bounded recovery times, e.g., 50 ms recovery time, which is comparable to recovery in SONET/SDH networks. In the case of link-based protection, the protection algorithm can handle faults such as fiber link failures. In the case of a node failure, recovery can be done through either node-based or path-based protection. The advantage of path-based protection lies in its ability to reduce wavelength redundancy (wavelengths that are reserved for possible failures) but its disadvantage is the potentially lengthy delay incurred in notifying all nodes along the protection path, of the failure of a remote resource. In some applications, protection paths need to be chosen carefully to meet some restoration time requirement (e.g., 50ms). Several drafts being discussed by the Internet Engineering Task Force deal with signaling and methodology that can be used in a framework Rabbat Expires - December 2002 [Page 2] Fault Notification and Service Recovery Protocol June 2002 for guaranteeing a bounded protection and recovery time, but none do provide hard guarantees for GMPLS-based SONET and WDM networks, although some have claimed results of fast reroute on MPLS-based IP networks. For example, although several mechanisms and frameworks discuss requirements for 50 ms protection switching, they do not propose a process that achieves the goal. This document addresses this issue. This document presents a fault-notification protocol that is both technology and topology agnostic. It applies to intra-domain protection. Multi-domain protection is left for further study. We assume unidirectional traffic through Label Switched Paths (LSPs) and, where relevant, discuss applicability to bi-directional traffic which consists of two unidirectional LSPs. For the purpose of illustration, we also assume a mesh WDM network; applicability to ring-topology requires further study. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [1]. 3. Glossary of Terms Used LSR: Label Switched Router PSL: Path Switch LSR, commonly the ingress LSR PML: Path Merge LSR, commonly the egress LSR LMP: Link Management Protocol FDL: Failure Detection LSR, defined in this draft as the LSR that detects the resource failure, either by a signal from the optical/transport layer SD/SF: Signal Degrade and Signal Fail, as described in [2] MEMS: Micro-Electro Mechanical Systems PXC: Photonic Cross-Connect, a cross-connect that switches wavelengths transparently, by means of a switching fabric such as MEMS Rabbat Expires - December 2002 [Page 3] Fault Notification and Service Recovery Protocol June 2002 AIS: Alarm Indication Signal, a signal at the SONET/SDH transport layer BDI: Backward Defect Indication, a signal at the transport layer sent upstream FIS: Fault Indication Signal, defined in [3] 4. Requirements at Protection Path Setup Time When the protection algorithm calculates the protection path for a certain type of protection (local repair or end-to-end) in a GMPLS- enabled network, it distributes labels for that path to reserve the link resources (wavelengths, wavebands, etc.) for protection and possibly select them. These link resources could be used to carry preemptible best-effort traffic to increase the utilization of the network, when the protection path is not activated. Alternatively, the same link resource may be reserved by multiple protection paths for different link failures as long as these protection paths do not need to be activated simultaneously (e.g., M:N shared protection). In either case, proper link resource needs to be activated upon the notification of failure. When a label for a protection LSP is set up on a certain node A through RSVP-TE or CR-LDP, node A must know what network resource this LSP is protecting. In the case of RSVP-TE for example, the protection PATH message may notify all nodes on the protection path with that information at path setup as in the case of [4]. This allows node A to bundle labels (as well as its link resources) that protect a particular network resource. For example, if two labels j and k correspond to two LSPs used to protect working paths from the failure of link (X, Y), then they belong to the bundle L (X, Y). This allows node A to jointly activate/cross-connect both LSPs referenced by labels j and k when it receives notification of the failure of link (X, Y). This documents proposes a method for per-failure fault notification (as compared to per-LSP fault notification), hence such bundled label information is essential. The main difference between "per-failure" vs. "per-LSP" notification is the number of notification mechanisms that need to be started. Per-failure fault notification allows one mechanism to engage to notify all relevant nodes of the fault. On the other hand and in the case of per-LSP notification, as many mechanisms as there exists failed LSP's (for example, all LSPÆs that failed due to a link failure) have to be engaged. Rabbat Expires - December 2002 [Page 4] Fault Notification and Service Recovery Protocol June 2002 5. Protocol Steps in Failure Notification and Service Recovery The steps described in this section present a fault notification protocol that details the process used in notifying nodes of the resource failure and activating the protection lightpaths. The failure sequence is based on Internet Draft [3] adapted to WDM networks. The timing diagram in Figure 1 is reproduced and modified based on [3].The critical component in guaranteeing time constraints to service recovery is the fault notification process. The following sequence of events MUST be followed in order to ensure that the recovery process happens within a specific amount of time, as is the case of SONET/SDH-based networks. --Network Impairment | --Fault Detected | | --Start of Notification | | | --Recovery Operation Complete | | | | --Path Traffic Restored | | | | | | | | | | v v v v v ------------------------------------------------ | T1 | T2 | T3 | T4 | Figure 1. Recovery Cycle Model 5.1 T1: Fault Detection Time This is the period of time between the network impairment and the detection at the control plane. We define the Failure Detection LSR (FDL) to be the node at which the fault is detected. An example of such network impairment is a fiber cut. Layer 1 at a certain node detects the fault and passes it to the control plane. This document assumes that equipment in the optical network can detect such failures. This time is not included in the calculation of the recovery time. In general, in case a full-duplex link is cut, two nodes will detect the fault, one upstream of the fault, and the other downstream. In the case of a unidirectional link failure, the node downstream detects the failure. In that case, that node will send at the transport layer a signal such as the Backward Defect Indication (BDI) defined in ITU-T G.709 to the node upstream that will also act as FDL. We assume that the time difference between detection and inference based on BDI is negligible. Other transport plane technologies MUST offer the same capability to be used in this context. So both upstream and downstream nodes detect the failure. Rabbat Expires - December 2002 [Page 5] Fault Notification and Service Recovery Protocol June 2002 To support the failure detection requirement, nodes MUST implement per-channel monitoring that will pinpoint the failure and report it at the FDL. 5.2 T2: Hold-Off Time This is the period of time that elapses before the node that has detected the fault starts the fault-recovery process. This time assumes that the fault-recovery process at a given layer (for example, the recovery protocol outlined in this draft) may wait for restoration or recovery to occur at another layer (for example, SONET/SDH protection). In the case of WDM-based protection, this time should be 0 sec since there is no underlying protection layer that will kick in. In the case of GMPLS-enabled IP network over SONET, the T2 may be set to 50ms such that SONET protection scheme can activate before any IP (MPLS) layer protection is triggered. For GMPLS-enabled SONET over WDM, the choice is a bit complicated. Protection mechanisms such as SONET/SDH protection could be used in the same environment in conjunction with WDM-based protection by picking either protection mechanism or no protection at all. Allowing redundant protection mechanisms for the same lightpath may increase the recovery time. The SONET/SDH layer, if it exists, makes the decision as to whether to request from the WDM layer a protected or unprotected lightpath to connect the SONET equipment. 5.3 T3: Fault Notification and Completion of Recovery Operation T3 is the period between the time when FDL starts sending out a fault notification message, and the time when every node including PSLs (Path Switch LSR[3]) on the corresponding protection paths have been notified of the failure and finished reconfiguring themselves for carrying restored traffic. The PSL (Path Switch LSR) can be the same physical entity as the FDL in the case of link-based protection. It is important to have the PSL be as close to the link failure as possible. This reduces the recovery time since no messages have to be relayed to a remote or centralized authority to initiate that recovery. Some PSLs or PMLs may detect a failure, for example, a Loss of Light (LoL) event. The fault notification message MUST be initiated by the FDL even if the PSLs and PMLs have detected the error. This allows the fault notification mechanism to solve for the worst-case scenarios and gives timely notification of all concerned nodes on the protection path(s). For the purpose of this draft, transport plane signals such as the AIS (Alarm Indication Signal) and the BDI will be disregarded by all OXCs except the FDLs. It is to be noted that Rabbat Expires - December 2002 [Page 6] Fault Notification and Service Recovery Protocol June 2002 fault notification occurs at the control plane to minimize layer interaction. The FDL MAY use several fault notification methods to notify other nodes of the failure, including GMPLS-based signaling and flooding. In the case of GMPLS-based signaling, there is generally one fault notification message per disrupted Label Switched Path. Hence, signaling does not scale well with the number of connections; in addition, the message processing delay is less predictable. For details about the notification methods and the choice of flooding for this draft, the reader is encouraged to refer to [5]. This document specifies a notification protocol based on message flooding. In the case of flooding, the message sent from the FDL to all other nodes on the different protection paths should reach them within the specified amount of time Trec (recovery time)minus the reconfiguration time Tcfg needed at each node after fault notification. We define this time to be Tnot = Trec - Tcfg. Tnot is the fault notification time. This is explained in detail in Appendix B. Nodes on a protection path (including the PSL) are aware that they are protecting against the failure of a particular resource. All nodes notified of the failure will activate the protection path by performing any required hardware configuration (for example, moving mirrors in the case of a MEMS-based switching fabric). The PSL starts sending data on the protection path at the start time S(I) specified in the next paragraph. If the PSL and/or PML sense at the transport plane a failure of the protection lightpath to be activated, it MUST raise an alarm that may be dealt with at the management plane. The management plane will take appropriate remediation action. Alarm and remediation are outside the scope of this draft. The nodes on protection paths receive the fault notification from the FDL, within a deterministic time. This time delay is calculated by each node as explained in Appendix A. In order to avoid complex clock synchronization implementations, a PSL identified as node I that receives the notification from FDL node J will calculate the start time S(I) at which it switches traffic to the protection path as follows: S(I) = time-of-notification(I) - min-delay-between(J, I) + Trec where time-of-notification (I) returns the clock time at node I; min-delay-between(J, I) returns the minimum time needed for the notification from J to reach node I; Rabbat Expires - December 2002 [Page 7] Fault Notification and Service Recovery Protocol June 2002 Note that (time-of-notification (I) - min-delay-between(J, I)) will give the time when failure was detected at J, and Trec is the recovery time requirement. For simplicity, in this example, we assume that hardware reconfiguration time and fault notification time have been considered during the protection path set up, which will be considered in Section 7. Hence at S(I), every node on the protection paths should have been notified of the failure and finished reconfiguration. Fault notification is carried out through message flooding as follows. The FDL sends a notification packet to its neighbors on all outgoing links. The notification packet is a high-priority packet. The packet contains the unique global identifier of the link at fault. Each node that receives such a packet duplicates it on as many outgoing links to neighboring nodes that it has (minus the node it came from and any other node it may have received a copy from), and sends the duplicates to its neighbors. The node also sends an acknowledgement back on the link it received the message from. Flooding may be implemented using LMP [6]. 5.3.1 Delays Incurred by Messages to PSL and on Protection Path The above discussion suggests that in order for the protection algorithm to abide by the Trec ms recovery requirement, it needs to be either: 1. Aware of timing issues to be able to select a proper path. 2. Passed a set of nodes and links that satisfy the timing constraints. The protection algorithms found in the literature work by computing a protection path for the working paths that require protection. All resources identified in the network topology are usually used in calculating the protection paths. In contrast, a modified protection algorithm should be executed in the case of a strict requirement on the recovery time. For example, a pruned topology should be considered for protection path computation. A database of link information should hold the fiber physical length and the capacity of each link (or channel) as well as the notification message processing time. The total time needed by a notification packet to travel from source to destination can be broken into two types of delay: the time needed to traverse each link and the time needed to go through each node. The different delay calculations are discussed in Appendix A. 5.4 T4: Traffic Restoration Time Rabbat Expires - December 2002 [Page 8] Fault Notification and Service Recovery Protocol June 2002 This is the time between the last recovery action and the time that the traffic (if present) is completely recovered. This interval is intended to account for the time required for traffic to once again arrive at the point in the network that experienced disrupted or degraded service due to the occurrence of the fault, e.g. the PML (Path Merge LSR, the egress LSR) [3]. 6. Examples of Protection switching WDM mesh protection algorithms enable local repair and end-to-end protection. As mentioned previously, local repair refers to either link-based or node-based protection. In both local repair mechanisms, the nodes that are the endpoints at the link failure will act as PSL. In the case of a node failure, local node-based protection can be used where the nodes that are direct neighbors of the failed node on the LSPs will act as PSL. Link-based protection is depicted in scenario A of Figure 2. It is trivial to show a scenario depicting how link-based protection fails to restore a connection in the case of a failed node. The downstream LSP is defined to be the path from node S to node T and the upstream LSP is defined to be the path from node T to node S. ----- Failed Link | ----- ----- v ----- ----- | | | | \ / | | | | | S |----| A |-----/-----| B |-----| T | | | | | / \ | | | | ----- ----- ----- ----- | | | | <-- Protection Path For Link (A, B) ----------------- Figure 2. Scenario for Link-Based Protection: Nodes A and B detect fault and initiate recovery of downstream and upstream LSPs respectively. In Figure 2, both nodes A and B are notified of the link failure at the control plane. Node A can detect upstream link failures. At detection time, node A MUST send an AIS or other signal to node B. If the fault is detected at node B, then node B MUST send an AIS to node A. Each node will initiate the recovery of the unidirectional LSPs, one downstream from node S to node T and the other upstream from node T to node S. Therefore nodes A and B serve as FDLs (as well as PSLs.) Rabbat Expires - December 2002 [Page 9] Fault Notification and Service Recovery Protocol June 2002 Node-based protection is depicted in scenarios A and B of Figure 3 in the cases of link failure and node failure respectively. In this figure, only the protected nodes of interest and their associated protection paths are identified. -------------- Protected Nodes | | v v ----- ----- ----- ----- ----- ----- | | | | | | \ / | | | | | | | S |----| A |----| B |--/--| C |----| D |----| T | | | | | | | / \ | | | | | | ----- ----- ----- ----- ----- ----- | | | | | | | | <-- Protection Paths -------------------- | | | -------------------- Scenario A: Link (B, C) failure. Nodes B and C initiate failure notification and recovery of upstream and downstream LSPs respectively. --------- Protected Node | \ v / ----- ----- ----- ----- ----- | | | | |\ /| | | | | | S |----| A |----| B |----| C |----| T | | | | | |/ \| | | | | ----- ----- ----- ----- ----- | / \ | | | <-- Protection Path ------------------- Scenario B: Node B failure. Nodes A and C initiate fault notification and recovery of upstream and downstream LSPs respectively. Figure 3. Node-Based Protection Scenarios. Rabbat Expires - December 2002 [Page 10] Fault Notification and Service Recovery Protocol June 2002 Scenario A of Figure 3 depicts a link failure, leading node B to act as the PSL for the downstream LSP and set up the protection path toward node D that protects against the failure of node C (as well as Link (B, C)). Similarly, node C acting as PSL will activate the protection path toward node A that protects against the failure of node B (as well as Link (C, B)). Nodes B and C are FDLs that are responsible for sending out fault notification messages. Scenario B of Figure 3 shows node B's failure. The layers that detect fault at nodes A and C notify respectively the nodes' control plane and A and C act as PSL in that case, activating the protection path that protects against the failure of node B. Nodes A and C are FDLs that are responsible for sending out fault notification messages. ----- Failed Link | ----- ----- v ----- ----- | | | | \ / | | | | | S |----| A |-----/-----| B |-----| T | | | | | / \ | | | | ----- ----- ----- ----- | | | | <-- Protection Path ------------------------------------ Figure 4. Path-based protection for failure of Link (A, B). Figure 4 shows a path-based protection. Nodes A and B, acting as FDLs, initiate recovery of upstream and downstream LSPs , by sending a fault notification to ingress and egress nodes S and T (the PSLs)and intermediate nodes on the associated protection paths. 7. Finding the Time-Constrained Protection Path In [5], we outline two methods for dealing with restoration time constraints for the protection path and picks the most suitable. The first method consists of choosing a sub-graph G' of the original graph G where path calculation will yield a protection path that can be used within a recovery time of Trec milliseconds. The second method consists of changing the protection algorithm to make it aware of time constraints while computing a protection path. Because of the complexity of the second method, this document recommends the first method and provide detailed explanation in the following. Rabbat Expires - December 2002 [Page 11] Fault Notification and Service Recovery Protocol June 2002 7.1 Finding a Sub-graph for the Protection Algorithm This method relies on the view of the current network topology, including node distances and channel speeds. We can run a shortest path algorithm such as the Dijkstra or Bellman-Ford algorithm to implement this method. When computing shortest paths, for every link that is considered, the metric used in calculating a new shortest path will associate the time delay incurred to traverse that link, instead of just computing hop count. This of course assumes knowledge of the delays incurred while traversing the links and nodes and is equivalent to finding the fastest path. A delay metric associated with each link allows the Dijkstra algorithm to keep track of total delay incurred on links and nodes. We assume an upper delay limit of Trec (ms). The algorithm shown in Appendix B is considered with applicability to both link and path protection. Note that we use the words link and arc interchangeably to mean directed link. This ensures that the paths returned by the shortest path algorithm can be activated within time Trec after the failure is detected. Of course, it does not guarantee that such a path can be found. The functioning of the algorithm is to determine the set of nodes and arcs that can be used in calculating the protection path, which will guarantee that the protection path will be activated within time Trec. The protection algorithm can then calculate protection paths using sub-graph G' rather than graph G for either link or path protection. In the case of link-based protection, the algorithm just needs to define the sub-graph that can be reached by any of the two nodes at the endpoints of the link. In the case of path-based protection, the algorithm checks that irrespective of where the fault happens, the fault recovery process can activate the protection path(s) within time Trec. This method identifies eligible nodes, but does not guarantee that the algorithm will find a path that satisfies the delay constraint, since once the graph is pruned of some nodes that it could use, it has less choice in finding a protection path. The advantage though is that the protection algorithm will run over a small graph G' than the original graph G, leading to a speedup in the protection path computation. A side effect of this algorithm is that the protection algorithm will find a path in the "neighborhood" of the failed link. 7.2 Hybrid Protection/Restoration Method It is to be noted that a network operator could adopt other protection schemes, such as a mix between path-based and link-based protection. If a path-based protection path cannot be found due to excessive time needed for service recovery, then a less resource- Rabbat Expires - December 2002 [Page 12] Fault Notification and Service Recovery Protocol June 2002 efficient link-based protection path can be chosen, as local repair includes a smaller number of hops on the average, therefore increasing the probability of finding such protection path. 7.3 Reversion After Fixing the Failure Most of the current literature recommends that when the failed link or node is back online, then traffic should be moved back to the original path, as it is more efficient. This can be achieved by having the LSP stop using the protection path and revert to the previously unavailable working path. 8. Implementation of flooding notification We propose in this section an example of an implementation of fault notification using flooding. In this example, we use LMP for flooding. For this purpose, we propose to extend LMP by adding two messages: FaultNotify and FaultNotifyAck. LMP connectivity is established at start-up time, so no additional session initiation needs to occur for fault notification. LMP Hello messages make sure that connectivity is maintained. The proposed LMP message FaultNotify carries the fault notification information described above to the neighboring node. The neighbor replies with a FaultNotifyAck. A node keeps sending FaultNotify messages at intervals until it receives a FaultNotifyAck response back or the control channel connectivity is lost, as in the case of an evCCDn event (event generated when there is indication that the control channel is no longer available) [6]. 9. Security Considerations This draft makes use of several protocols; therefore this draft does not introduce any new security issues besides the ones that arise in the use of these protocols. 10. Conclusion This draft discusses a fault notification and service recovery protocol for GMPLS-enabled optical networks. It presents the steps required in the notification process, leading to lightpath service recovery within specific time bounds. The protocol proposes an algorithm to find the subset of nodes in the network that can be used to pick a protection path that can activated within that time bound. Rabbat Expires - December 2002 [Page 13] Fault Notification and Service Recovery Protocol June 2002 Appendix A. Fault Notification Message Delays on Path This appendix describes the delays incurred on the path. Two types of delays occur on the path between any two nodes. They are delays incurred during traversal of the links on that path, and delays that occur at the nodes along the path. The following presents the computations and expected values for the different delays. A.1 Delays Associated with Link Traversal The time needed to traverse each link is the sum of the transmission time and the link propagation delay: 1. The transmission time is a value based on link capacity. The calculation is as follows: D trans = (packet size) / (link speed). 2. The link propagation delay is due to the physical length of the link: D prop = length / (light propagation speed on fiber). The notification packet size is dependent on its content, and will be provided after information about the packet is presented. The length of such a packet is usually of the order of a hundred bytes (about 10^3 bits). As an example, for a link speed of 1 Gbps, D trans ~= 10^3 / 10^9 = 10^-6 s = 1 micro-second. This value therefore can safely be ignored in calculating delays. On the other hand, the link propagation delay in metropolitan area and long-haul networks affects total delay. For a distance of 100 km, with light speed in a fiber at 2/3 (about 200,000 km/s) of its speed in free space, this delay would be 0.5 ms. A.2 Delays Incurred at the Nodes At each node, two delays are important: queuing delay and processing time. The processing time D proc has been identified in the literature as a few tenths of a millisecond in the case of an RSVP object. This value is smaller in the case of a simpler IP packet requesting the activation of an LSP path. The issue of queuing delay is important at all intermediate nodes. Fault notification messages should be queued at the front of the buffer that holds other control packets in order to avoid queuing delays, (those messages do not have to contend with data packets since obviously no data are sent in the control channel). A queuing process such as priority queuing would allow those packets to be admitted at the head of the queue, through the setup of the priority of the packet. A simple mechanism such as the setup of the priority Rabbat Expires - December 2002 [Page 14] Fault Notification and Service Recovery Protocol June 2002 bits at the IP header, such as the IP precedence bits or DSCP code points of the TOS (Type Of Service) byte would be appropriate. Using priority queuing for fault notification messages will ensure that the queuing delay will be bounded. In the case of flooding for fault notification, D queue(a) = 0 sec. If other fault notification messages are in the queue as well, this implies multiple failures, where the time recovery guarantee does not apply. Otherwise, it may indicate the fact that multiple messages are traveling on different protection paths to notify the same link failure, such as the case when a signaling protocol is used for fault notification. In the case of per-LSP fault notification just as in the case of using a signaling protocol, the maximum queuing delay at node a is: D queue max(a)= (number of protection paths) * (packet size) / (link bandwidth). This explains mathematically the choice against using a signaling protocol for fault notification. Flooding allows that value to be 0 sec. In the absence of priority queuing, the maximum queue delay can be calculated as follows at node a, assuming fair queuing of the FIFO buffers of all control channels and assuming input buffers only: D queue max(a)= (number of queues) * (queue size) / (link bandwidth). This value is an upper bound, and is dependent on hardware buffer implementations. Appendix B. Finding a Sub-graph Subject to Time Constraints For simplicity, we assume the restoration time constraint is Trec seconds and the reconfiguration time at each node once notified of the failure is Tcfg. Thus, the allowable fault notification time Tnot = Trec - Tcfg. Initialization: G (N, A) is the graph containing the set N of nodes n and set A of arcs (I, J) representing the control channel links // Note that the links in this graph could be different from the // links used for the data channels For every link (I, J) in A Set cost (I, J) = d proc (I) + d queue max (I) + d prop (I, J) // this sets all link costs to a delay value that includes the // delays incurred at the node I. node-neighborhood-for-failure (I, (I, J)) is the set of nodes that are Tnot away from node I if link (I, J) fails initially set to {}. Rabbat Expires - December 2002 [Page 15] Fault Notification and Service Recovery Protocol June 2002 Procedure Find-node-neighborhood-for-failure (I, (I, J)) Begin Run shortest path algorithm on graph pruned from link (I, J) // This results in a set SC (I), that contains for each node X, // the smallest cost (minimum delay) from node I. // min-delay-between (I, X) is defined in section 5.3 as the // minimum time (smallest cost) needed for a message to travel // from I to X. For every entry for node X in SC, min-delay-between (I, X) += d proc (X) + d queue max (X) // add the delays incurred at the last node // SC now contains the shortest time from node I to any node // including processing time at the end-nodes for link (I, J) For every entry in SC, if min-delay-between (I, X) < Tnot, then Add node X in SC to node-neighborhood-for-failure (I, (I, J)) // If the message cannot go from I to J within Tnot sec, then // the node is not reachable within the time constraints End Procedure Prune-for-link-protection (I, J) Begin Find-node-neighborhood-for-failure (I, (I, J)) Find-node-neighborhood-for-failure (J, (J, I)) Set reachable (I, J) = union of node-neighborhood-for-failure (I, (I, J)) and node-neighborhood-for-failure (J, (J, I)) // This is the set of nodes reachable by either endpoints of // link (I, J) within time Tnot Return reachable (I, J) End Procedure Prune-for-path-protection (P (I1, I2,... In)) Begin Set N' = N // all nodes originally For any link (Ik; Ik+1) of path P Begin Prune-for-link-protection (Ik, Ik+1) Set N' = intersection of N' and reachable (Ik, Ik+1) End End Running for link (I, J) protection: Rabbat Expires - December 2002 [Page 16] Fault Notification and Service Recovery Protocol June 2002 Prune-for-link-protection (I, J) Set N' = reachable (I, J) Run link protection algorithm considering only nodes in set N' and any data links between them (not necessarily the same links as the control channels) Running for Path P (I1, I2,... In) protection: Prune-for-path-protection (P (I1, I2,... In)) Run path protection algorithm considering only nodes in set N' and any data links between them 11. References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [2] Papadimitriou, D., et al, "Analysis Grid for GMPLS-based Recovery Mechanisms (including Protection and Restoration)", Internet draft, work in progress, draft-papadimitriou-ccamp-gmpls-recovery- analysis-00.txt, April 2002. [3] Sharma, V., et al, "Framework for MPLS-based recovery", Internet Draft, work in progress, draft-ietf-mpls-recovery-frmwrk-04.txt, May 2002. [4] Li, G., J. Yates, et al, "Experiments in Fast Restoration using GMPLS in Optical/Electronic Mesh Networks", Post-deadline Papers Digest, OFC 2001, Anaheim, CA, March 2001. [5] Rabbat, R. et al, "Fault Notification and Service Recovery in WDM Networks", white paper available at: http://perth.mit.edu/~richard/wp-ietf-fault-notification.pdf. [6] Lang, J. et al, "Link Management Protocol (LMP)", Internet Draft, work in progress, draft-ietf-ccamp-lmp-03.txt, March 2002. Rabbat Expires - December 2002 [Page 17] Fault Notification and Service Recovery Protocol June 2002 12. Author's Addresses Richard Rabbat Ching-Fong Su Fujitsu Labs of America, Inc. Fujitsu Labs of America, Inc. 595 Lawrence Expressway 595 Lawrence Expressway Sunnyvale, CA 94085 Sunnyvale, CA 94085 United States of America United States of America Phone: +1-408-530-4537 Phone: +1-408-530-4572 Email: rabbat@fla.fujitsu.com Email: csu@fla.fujitsu.com Norihiko Shinomiya Vishal Sharma Fujitsu Laboratories Ltd. Metanoia, Inc. 1-1, Kamikodanaka 4-Chome 305 Elan Village Lane, Unit 121 Nakahara-ku, Kawasaki San Jose, CA 95134-2545 211-8588, Japan United States of America Phone: +81-44-754-2635 Phone: +1-408-955-0910 Email: shinomi@jp.fujitsu.com Email: v.sharma@ieee.org Takafumi Chujo Akira Chugo Fujitsu Labs of America, Inc. Fujitsu Laboratories Ltd. 595 Lawrence Expressway 1-1, Kamikodanaka 4-Chome Sunnyvale, CA 94085 Nakahara-ku, Kawasaki United States of America 211-8588, Japan Phone: +1-408-530-4507 Phone: +81-44-754-2629 Email: takafumi@fla.fujitsu.com Email: chugo@flab.fujitsu.co.jp Rabbat Expires - December 2002 [Page 18]