Fault Notification and Service Recovery Protocol June 2002 

CCAMP Working Group                                Richard Rabbat (FLA) 
Internet Draft                                      Ching-Fong Su (FLA) 
Expires: December 2002                         Norihiko Shinomiya (FLL) 
                                               Vishal Sharma (Metanoia) 
                                                   Takafumi Chujo (FLA) 
                                                      Akira Chugo (FLL) 
                                                              June 2002 
 
            A Fault Notification and Service Recovery Protocol 
             draft-rabbat-fault-notification-protocol-00.txt

     
Status of this Memo  
    
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026.  
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that 
   other groups may also distribute working documents as Internet-
   Drafts.  
    
   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time. It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress."  
    
   The list of current Internet-Drafts can be accessed at  
        http://www.ietf.org/ietf/1id-abstracts.txt  
   The list of Internet-Draft Shadow Directories can be accessed at  
        http://www.ietf.org/shadow.html.  
    
Abstract  
    
   This draft describes a fault notification and service recovery 
   protocol that can be used in GMPLS-enabled networks to achieve 
   bounded time activation of protection paths in the case of single 
   failures. It presents a complete solution to the problem and 
   justifies choices made for the notification method, extensions 
   required to current algorithms and protocols and any security and 
   deployment issues. 
 
Table of Contents 
    
   Status of this Memo...............................................1 
   Abstract..........................................................1 
   Table of Contents.................................................1 
   1. Overview.......................................................2 
   2. Terminology....................................................3 
   3. Glossary of Terms Used.........................................3 

 
Rabbat                 Expires - December 2002               [Page 1] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   4. Requirements at Protection Path Setup Time.....................4 
   5. Protocol Steps in Failure Notification and Service Recovery....5 
   5.1 T1: Fault Detection Time......................................5 
   5.2 T2: Hold-Off Time.............................................6 
   5.3 T3: Fault Notification and Completion of Recovery Operation...6 
   5.3.1 Delays Incurred by Messages to PSL and on Protection Path...8 
   5.4 T4: Traffic Restoration Time..................................8 
   6. Examples of Protection switching...............................9 
   7. Finding the Time-Constrained Protection Path..................11 
   7.1 Finding a Sub-graph for the Protection Algorithm.............12 
   7.2 Hybrid Protection/Restoration Method.........................12 
   7.3 Reversion After Fixing the Failure...........................13 
   8. Implementation of flooding notification.......................13 
   9. Security Considerations.......................................13 
   10. Conclusion...................................................13 
   Appendix A. Fault Notification Message Delays on Path............14 
   A.1 Delays Associated with Link Traversal........................14 
   A.2 Delays Incurred at the Nodes.................................14 
   Appendix B. Finding a Sub-graph Subject to Time Constraints......15 
   11. References...................................................17 
   12. Author's Addresses...........................................18 
    
    
1.   Overview 
    
   The issue of protection and restoration in optical switching networks 
   is of high importance to ensure high-availability and uninterrupted 
   service.  Several mechanisms for  protection in  mesh and ring 
   topologies have been devised.  Protection and restoration algorithms 
   can be used for local repair (link-based or node-based), span 
   protection and path protection. This document describes a fault 
   notification and service recovery protocol to ensure bounded recovery 
   times, e.g., 50 ms recovery time, which is comparable to recovery in 
   SONET/SDH networks. 
      
   In the case of link-based protection, the protection algorithm can 
   handle faults such as fiber link failures.  In the case of a node 
   failure, recovery can be done through either node-based or path-based 
   protection.  The advantage of path-based protection lies in its 
   ability to reduce wavelength redundancy (wavelengths that are 
   reserved for possible failures) but its disadvantage is the 
   potentially lengthy  delay incurred in notifying all nodes along the 
   protection path, of the failure of a remote resource.  In some 
   applications, protection paths need to be chosen carefully to meet 
   some restoration time requirement (e.g., 50ms). 
    
   Several drafts being discussed by the Internet Engineering Task Force 
   deal with signaling and methodology that can be used in a framework 

 
Rabbat                 Expires - December 2002               [Page 2] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   for guaranteeing a bounded protection and recovery time, but none do 
   provide hard guarantees for GMPLS-based SONET and WDM networks, 
   although some have claimed results of fast reroute on MPLS-based IP 
   networks. For example, although several mechanisms and frameworks 
   discuss requirements for 50 ms protection switching, they do not 
   propose a process that achieves the goal.  This document addresses 
   this issue.  
    
   This document presents a fault-notification protocol that is both 
   technology and topology agnostic.  It applies to intra-domain 
   protection.  Multi-domain protection is left for further study. 
 
   We assume unidirectional traffic through Label Switched Paths (LSPs) 
   and, where relevant, discuss applicability to bi-directional traffic 
   which consists of two unidirectional LSPs.  For the purpose of 
   illustration, we also assume a mesh WDM network; applicability to 
   ring-topology requires further study. 
    
    
2.   Terminology 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in [1]. 
    
    
3.   Glossary of Terms Used 
    
   LSR: Label Switched Router 
    
   PSL: Path Switch LSR, commonly the ingress LSR 
    
   PML: Path Merge LSR, commonly the egress LSR 
    
   LMP: Link Management Protocol  
    
   FDL: Failure Detection LSR, defined in this draft as the LSR that 
   detects the resource failure, either by a signal from the 
   optical/transport layer 
    
   SD/SF: Signal Degrade and Signal Fail, as described in [2] 
    
   MEMS: Micro-Electro Mechanical Systems 
    
   PXC: Photonic Cross-Connect, a cross-connect that switches 
   wavelengths transparently, by means of a switching fabric such as 
   MEMS 
    

Rabbat                 Expires - December 2002               [Page 3] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   AIS: Alarm Indication Signal, a signal at the SONET/SDH transport 
   layer 
    
   BDI: Backward Defect Indication, a signal at the transport layer sent 
   upstream 
    
   FIS: Fault Indication Signal, defined in [3] 
    
    
4.   Requirements at Protection Path Setup Time  
    
   When the protection algorithm calculates the protection path for a 
   certain type of protection (local repair or end-to-end) in a GMPLS-
   enabled  network, it distributes labels for that path to reserve the 
   link resources (wavelengths, wavebands, etc.) for protection and 
   possibly select them.  These link resources could be used to carry 
   preemptible best-effort traffic to increase the utilization of the 
   network, when the protection path is not activated.  Alternatively, 
   the same link resource may be reserved by multiple protection paths 
   for different link failures as long as these protection paths do not 
   need to be activated simultaneously (e.g., M:N shared protection). In 
   either case, proper link resource needs to be activated upon the 
   notification of failure. 
    
   When a label for a protection LSP is set up on a certain node A 
   through RSVP-TE or CR-LDP, node A must know what network resource 
   this LSP is protecting.  In the case of RSVP-TE for example, the 
   protection PATH message may notify all nodes on the protection path 
   with that information at path setup as in the case of [4].  This 
   allows node A to bundle labels (as well as its link resources) that 
   protect a particular network resource.  For example, if two labels j 
   and k correspond to two LSPs used to protect working paths from the 
   failure of link (X, Y), then they belong to the bundle L (X, Y).  
   This allows node A to jointly activate/cross-connect both LSPs 
   referenced by labels j and k when it receives notification of the 
   failure of link (X, Y).   
    
   This documents proposes a method for per-failure fault notification 
   (as compared to per-LSP fault notification), hence such bundled label 
   information is essential.  The main difference between "per-failure" 
   vs. "per-LSP" notification is the number of notification mechanisms 
   that need to be started.  Per-failure fault notification allows one 
   mechanism to engage to notify all relevant nodes of the fault.  On 
   the other hand and in the case of per-LSP notification, as many 
   mechanisms as there exists failed LSP's (for example, all LSPÆs that 
   failed due to a link failure) have to be engaged. 
    
    
Rabbat                 Expires - December 2002               [Page 4] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
5.   Protocol Steps in Failure Notification and Service Recovery  
    
   The steps described in this section present a fault notification 
   protocol that details the process used in notifying nodes of the 
   resource failure and activating the protection lightpaths.  The 
   failure sequence is based on Internet Draft [3] adapted to WDM 
   networks. The timing diagram in Figure 1 is reproduced and modified 
   based on [3].The critical component in guaranteeing time constraints 
   to service recovery is the fault notification process.  The following 
   sequence of events MUST be followed in order to ensure that the 
   recovery process happens within a specific amount of time, as is the 
   case of SONET/SDH-based networks.   
    
    
    --Network Impairment  
    |    --Fault Detected  
    |    |    --Start of Notification   
    |    |    |    --Recovery Operation Complete 
    |    |    |    |    --Path Traffic Restored 
    |    |    |    |    |  
    |    |    |    |    |  
    v    v    v    v    v  
   ------------------------------------------------  
    | T1 | T2 | T3 | T4 |  
    
   Figure 1. Recovery Cycle Model 
    
    
5.1    T1: Fault Detection Time 
    
   This is the period of time between the network impairment and the 
   detection at the control plane.  We define the Failure Detection LSR 
   (FDL) to be the node at which the fault is detected.  An example of 
   such network impairment is a fiber cut.  Layer 1 at a certain node 
   detects the fault and passes it to the control plane.  This document 
   assumes that equipment in the optical network can detect such 
   failures.  This time is not included in the calculation of the 
   recovery time.  In general, in case a full-duplex link is cut, two 
   nodes will detect the fault, one upstream of the fault, and the other 
   downstream.  In the case of a unidirectional link failure, the node 
   downstream detects the failure.  In that case, that node will send at 
   the transport layer a signal such as the Backward Defect Indication 
   (BDI) defined in ITU-T G.709 to the node upstream that will also act 
   as FDL.  We assume that the time difference between detection and 
   inference based on BDI is negligible.  Other transport plane 
   technologies MUST offer the same capability to be used in this 
   context.  So both upstream and downstream nodes detect the failure. 
    

Rabbat                 Expires - December 2002               [Page 5] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   To support the failure detection requirement, nodes MUST implement 
   per-channel monitoring that will pinpoint the failure and report it 
   at the FDL. 
    
5.2    T2: Hold-Off Time 
    
   This is the period of time that elapses before the node that has 
   detected the fault starts the fault-recovery process.  This time 
   assumes that the fault-recovery process at a given layer (for 
   example, the recovery protocol outlined in this draft) may wait for 
   restoration or recovery to occur at another layer (for example, 
   SONET/SDH protection).  In the case of WDM-based protection, this 
   time should be 0 sec since there is no underlying protection layer 
   that will kick in.  
    
   In the case of GMPLS-enabled IP network over SONET, the T2 may be set 
   to 50ms such that SONET protection scheme can activate before any IP 
   (MPLS) layer protection is triggered. For GMPLS-enabled SONET over 
   WDM, the choice is a bit complicated. 
   Protection mechanisms such as SONET/SDH protection could be used in 
   the same environment in conjunction with WDM-based protection by 
   picking either protection mechanism or no protection at all.  
   Allowing redundant protection mechanisms for the same lightpath may 
   increase the recovery time.  The SONET/SDH layer, if it exists, makes 
   the decision as to whether to request from the WDM layer a protected 
   or unprotected lightpath to connect the SONET equipment.   
    
5.3    T3: Fault Notification and Completion of Recovery Operation 
    
   T3 is the period between the time when FDL starts sending out a fault 
   notification message, and the time when every node including PSLs 
   (Path Switch LSR[3]) on the corresponding protection paths have been 
   notified of the failure and finished reconfiguring themselves for 
   carrying restored traffic. The PSL (Path Switch LSR) can be the same 
   physical entity as the FDL in the case of link-based protection.  It 
   is important to have the PSL be as close to the link failure as 
   possible.  This reduces the recovery time since no messages have to 
   be relayed to a remote or centralized authority to initiate that 
   recovery.  
    
   Some PSLs or PMLs may detect a failure, for example, a Loss of Light 
   (LoL) event.  The fault notification message MUST be initiated by the 
   FDL even if the PSLs and PMLs have detected the error.  This allows 
   the fault notification mechanism to solve for the worst-case 
   scenarios and gives timely notification of all concerned nodes on the 
   protection path(s).  For the purpose of this draft, transport plane 
   signals such as the AIS (Alarm Indication Signal) and the BDI will be 
   disregarded by all OXCs except the FDLs.  It is to be noted that 

 
Rabbat                 Expires - December 2002               [Page 6] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   fault notification occurs at the control plane to minimize layer 
   interaction. 
    
   The FDL MAY use several fault notification methods to notify other 
   nodes of the failure, including GMPLS-based signaling and flooding. 
   In the case of GMPLS-based signaling, there is generally one fault 
   notification message per disrupted Label Switched Path.  Hence, 
   signaling does not scale well with the number of connections; in 
   addition, the message processing delay is less predictable. For 
   details about the notification methods and the choice of flooding for 
   this draft, the reader is encouraged to refer to [5]. This document 
   specifies a notification protocol based on message flooding. 
    
   In the case of flooding, the message sent from the FDL to all other 
   nodes on the different protection paths should reach them within the 
   specified amount of time Trec (recovery time)minus the 
   reconfiguration time Tcfg needed at each node after fault 
   notification.  We define this time to be Tnot = Trec - Tcfg.  Tnot is 
   the fault notification time.  This is explained in detail in Appendix 
   B. 
    
   Nodes on a protection path (including the PSL) are aware that they 
   are protecting against the failure of a particular resource.  All 
   nodes notified of the failure will activate the protection path by 
   performing any required hardware configuration (for example, moving 
   mirrors in the case of a MEMS-based switching fabric).  The PSL 
   starts sending data on the protection path at the start time S(I) 
   specified in the next paragraph.  If the PSL and/or PML sense at the 
   transport plane a failure of the protection lightpath to be activated, 
   it MUST raise an alarm that may be dealt with at the management plane.  
   The management plane will take appropriate remediation action.  Alarm 
   and remediation are outside the scope of this draft. 
    
   The nodes on protection paths receive the fault notification from the 
   FDL, within a deterministic time.  This time delay is calculated by 
   each node as explained in Appendix A.  In order to avoid complex 
   clock synchronization implementations, a PSL identified as node I 
   that receives the notification from FDL node J will calculate the 
   start time S(I) at which it switches traffic to the protection path 
   as follows: 
    
      S(I) = time-of-notification(I) - min-delay-between(J, I) + Trec 
    
      where 
    
         time-of-notification (I) returns the clock time at node I; 
         min-delay-between(J, I) returns the minimum time needed for the 
                  notification from J to reach node I; 

 
Rabbat                 Expires - December 2002               [Page 7] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   Note that (time-of-notification (I) - min-delay-between(J, I)) will 
   give the time when failure was detected at J, and Trec is the 
   recovery time requirement.  For simplicity, in this example, we 
   assume that hardware reconfiguration time and fault notification time 
   have been considered during the protection path set up, which will be 
   considered in Section 7. Hence at S(I), every node on the protection 
   paths should have been notified of the failure and finished 
   reconfiguration. 
    
   Fault notification is carried out through message flooding as follows.  
   The FDL sends a notification packet to its neighbors on all outgoing 
   links.  The notification packet is a high-priority packet.  The 
   packet contains the unique global identifier of the link at fault.  
   Each node that receives such a packet duplicates it on as many 
   outgoing links to neighboring nodes that it has (minus the node it 
   came from and any other node it may have received a copy from), and 
   sends the duplicates to its neighbors.  The node also sends an 
   acknowledgement back on the link it received the message from.  
   Flooding may be implemented using LMP [6]. 
    
5.3.1      Delays Incurred by Messages to PSL and on Protection Path 
    
   The above discussion suggests that in order for the protection 
   algorithm to abide by the Trec ms recovery requirement, it needs to 
   be either: 
    
      1. Aware of timing issues to be able to select a proper path. 
      2. Passed a set of nodes and links that satisfy the timing 
      constraints. 
    
   The protection algorithms found in the literature work by computing a 
   protection path for the working paths that require protection.  All 
   resources identified in the network topology are usually used in 
   calculating the protection paths.  In contrast, a modified protection 
   algorithm should be executed in the case of a strict requirement on 
   the recovery time.  For example, a pruned topology should be 
   considered for protection path computation.  A database of link 
   information should hold the fiber physical length and the capacity of 
   each link (or channel) as well as the notification message processing 
   time.  The total time needed by a notification packet to travel from 
   source to destination can be broken into two types of delay: the time 
   needed to traverse each link and the time needed to go through each 
   node.  The different delay calculations are discussed in Appendix A. 
    
5.4    T4: Traffic Restoration Time 
    

Rabbat                 Expires - December 2002               [Page 8] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   This is the time between the last recovery action and the time that 
   the traffic (if present) is completely recovered.  This interval is 
   intended to account for the time required for traffic to once again 
   arrive at the point in the network that experienced disrupted or 
   degraded service due to the occurrence of the fault, e.g. the PML 
   (Path Merge LSR, the egress LSR) [3].   
    
    
6.   Examples of Protection switching 
    
   WDM mesh protection algorithms enable local repair and end-to-end 
   protection.  As mentioned previously, local repair refers to either 
   link-based or node-based protection.  In both local repair 
   mechanisms, the nodes that are the endpoints at the link failure will 
   act as PSL.  In the case of a node failure,  local node-based 
   protection can be used where the nodes that are direct neighbors of 
   the failed node on the LSPs will act as PSL.  Link-based protection 
   is depicted in scenario A of Figure 2.  It is trivial to show a 
   scenario depicting how link-based protection fails to restore a 
   connection in the case of a failed node.  The downstream LSP is 
   defined to be the path from node S to node T and the upstream LSP is 
   defined to be the path from node T to node S. 
    
    
                      ----- Failed Link 
                      |  
   -----    -----     v     -----     ----- 
   |   |    |   |    \ /    |   |     |   | 
   | S |----| A |-----/-----| B |-----| T | 
   |   |    |   |    / \    |   |     |   | 
   -----    -----           -----     ----- 
              |               | 
              |               |  <-- Protection Path For Link (A, B) 
              ----------------- 
    
   Figure 2. Scenario for Link-Based Protection: Nodes A and B detect  
   fault and initiate recovery of downstream and upstream LSPs 
   respectively. 
    
    
   In Figure 2, both nodes A and B are notified of the link failure at 
   the control plane.  Node A can detect upstream link failures.  At 
   detection time, node A MUST send an AIS or other signal to node B.  
   If the fault is detected at node B, then node B MUST send an AIS to 
   node A.  Each node will initiate the recovery of the unidirectional 
   LSPs, one downstream from node S to node T and the other upstream 
   from node T to node S.  Therefore nodes A and B serve as FDLs (as 
   well as PSLs.) 

 
Rabbat                 Expires - December 2002               [Page 9] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   Node-based protection is depicted in scenarios A and B of Figure 3 in 
   the cases of link failure and node failure respectively.  In this 
   figure, only the protected nodes of interest and their associated 
   protection paths are identified. 
    
    
                      -------------- Protected Nodes 
                      |         | 
                      v         v 
   -----    -----    -----     -----    -----    ----- 
   |   |    |   |    |   | \ / |   |    |   |    |   | 
   | S |----| A |----| B |--/--| C |----| D |----| T | 
   |   |    |   |    |   | / \ |   |    |   |    |   | 
   -----    -----    -----     -----    -----    ----- 
              |        |         |        | 
              |        |         |        |  <-- Protection Paths 
              --------------------        | 
                       |                  | 
                       -------------------- 
    
       Scenario A: Link (B, C) failure. Nodes B and C initiate failure  
                   notification and recovery of upstream and 
                   downstream LSPs respectively. 
    
    
                       --------- Protected Node 
                       |       
                    \  v  / 
   -----    -----    -----    -----    ----- 
   |   |    |   |    |\ /|    |   |    |   | 
   | S |----| A |----| B |----| C |----| T | 
   |   |    |   |    |/ \|    |   |    |   | 
   -----    -----    -----    -----    ----- 
              |     /     \     | 
              |                 |  <-- Protection Path 
              ------------------- 
    
       Scenario B: Node B failure. Nodes A and C initiate fault  
                   notification and recovery of upstream and 
                   downstream LSPs respectively. 
    
   Figure 3. Node-Based Protection Scenarios. 
    

Rabbat                 Expires - December 2002              [Page 10] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   Scenario A of Figure 3 depicts a link failure, leading node B to act 
   as the PSL for the downstream LSP and set up the protection path 
   toward node D that protects against the failure of node C (as well as 
   Link (B, C)). Similarly, node C acting as PSL will activate the 
   protection path toward node A that protects against the failure of 
   node B (as well as Link (C, B)).  Nodes B and C are FDLs that are 
   responsible for sending out fault notification messages. 
    
   Scenario B of Figure 3 shows node B's failure.  The layers that 
   detect fault at nodes A and C notify respectively the nodes' control 
   plane and A and C act as PSL in that case, activating the protection 
   path that protects against the failure of node B.  Nodes A and C are 
   FDLs that are responsible for sending out fault notification 
   messages. 
    
    
                      ----- Failed Link 
                      |  
   -----    -----     v     -----     ----- 
   |   |    |   |    \ /    |   |     |   | 
   | S |----| A |-----/-----| B |-----| T | 
   |   |    |   |    / \    |   |     |   | 
   -----    -----           -----     ----- 
     |                                  | 
     |                                  |  <-- Protection Path 
     ------------------------------------ 
    
   Figure 4. Path-based protection for failure of Link (A, B). 
    
    
   Figure 4 shows a path-based protection. Nodes A and B, acting as FDLs, 
   initiate recovery of upstream and downstream LSPs , by sending a 
   fault notification to ingress and egress nodes S and T (the PSLs)and 
   intermediate nodes on the associated protection paths. 
    
    
7.   Finding the Time-Constrained Protection Path 
    
   In [5], we outline two methods for dealing with restoration time 
   constraints for the protection path and picks the most suitable.  The 
   first method consists of choosing a sub-graph G' of the original 
   graph G where path calculation will yield a protection path that can 
   be used within a recovery time of Trec milliseconds.  The second 
   method consists of changing the protection algorithm to make it aware 
   of time constraints while computing a protection path. Because of the 
   complexity of the second method, this document recommends the first 
   method and provide detailed explanation in the following. 

 
Rabbat                 Expires - December 2002              [Page 11] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
7.1     Finding a Sub-graph for the Protection Algorithm 
    
   This method relies on the view of the current network topology, 
   including node distances and channel speeds.  We can run a shortest 
   path algorithm such as the Dijkstra or Bellman-Ford algorithm to 
   implement this method.  When computing shortest paths, for every link 
   that is considered, the metric used in calculating a new shortest 
   path will associate the time delay incurred to traverse that link, 
   instead of just computing hop count.  This of course assumes 
   knowledge of the delays incurred while traversing the links and nodes 
   and is equivalent to finding the fastest path.  A delay metric 
   associated with each link allows the Dijkstra algorithm to keep track 
   of total delay incurred on links and nodes.  We assume an upper delay 
   limit of Trec (ms).  The algorithm shown in Appendix B is considered 
   with applicability to both link and path protection.  Note that we 
   use the words link and arc interchangeably to mean directed link. 
    
   This ensures that the paths returned by the shortest path algorithm 
   can be activated within time Trec after the failure is detected.  Of 
   course, it does not guarantee that such a path can be found.  The 
   functioning of the algorithm is to determine the set of nodes and 
   arcs that can be used in calculating the protection path, which will 
   guarantee that the protection path will be activated within time Trec.  
   The protection algorithm can then calculate protection paths using 
   sub-graph G' rather than graph G for either link or path protection. 
    
   In the case of link-based protection, the algorithm just needs to 
   define the sub-graph that can be reached by any of the two nodes at 
   the endpoints of the link.  In the case of path-based protection, the 
   algorithm checks that irrespective of where the fault happens, the 
   fault recovery process can activate the protection path(s) within 
   time Trec.  This method identifies eligible nodes, but does not 
   guarantee that the algorithm will find a path that satisfies the 
   delay constraint, since once the graph is pruned of some nodes that 
   it could use, it has less choice in finding a protection path.  The 
   advantage though is that the protection algorithm will run over a 
   small graph G' than the original graph G, leading to a speedup in the 
   protection path computation.  A side effect of this algorithm is that 
   the protection algorithm will find a path in the "neighborhood" of 
   the failed link. 
    
7.2     Hybrid Protection/Restoration Method 
    
   It is to be noted that a network operator could adopt other 
   protection schemes, such as a mix between path-based and link-based 
   protection.  If a path-based protection path cannot be found due to 
   excessive time needed for service recovery, then a less resource-

 
Rabbat                 Expires - December 2002              [Page 12] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   efficient link-based protection path can be chosen, as local repair 
   includes a smaller number of hops on the average, therefore 
   increasing the probability of finding such protection path. 
    
7.3    Reversion After Fixing the Failure 
    
   Most of the current literature recommends that when the failed link 
   or node is back online, then traffic should be moved back to the 
   original path, as it is more efficient.  This can be achieved by 
   having the LSP stop using the protection path and revert to the 
   previously unavailable working path. 
    
    
8.   Implementation of flooding notification 
    
   We propose in this section an example of an implementation of fault 
   notification using flooding.  In this example, we use LMP for 
   flooding.  For this purpose, we propose to extend LMP by adding two 
   messages: FaultNotify and FaultNotifyAck.  LMP connectivity is 
   established at start-up time, so no additional session initiation 
   needs to occur for fault notification.  LMP Hello messages make sure 
   that connectivity is maintained.  The proposed LMP message 
   FaultNotify carries the fault notification information described 
   above to the neighboring node.  The neighbor replies with a 
   FaultNotifyAck.  A node keeps sending FaultNotify messages at 
   intervals until it receives a FaultNotifyAck response back or the 
   control channel connectivity is lost, as in the case of an evCCDn 
   event (event generated when there is indication that the control 
   channel is no longer available) [6]. 
    
    
9.   Security Considerations  
     
   This draft makes use of several protocols; therefore this draft does 
   not introduce any new security issues besides the ones that arise in 
   the use of these protocols. 
    
    
10.    Conclusion 
    
   This draft discusses a fault notification and service recovery 
   protocol for GMPLS-enabled optical networks.  It presents the steps 
   required in the notification process, leading to lightpath service 
   recovery within specific time bounds.  The protocol proposes an 
   algorithm to find the subset of nodes in the network that can be used 
   to pick a protection path that can activated within that time bound. 
    
    
Rabbat                 Expires - December 2002              [Page 13] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
Appendix A.            Fault Notification Message Delays on Path 
    
   This appendix describes the delays incurred on the path.  Two types 
   of delays occur on the path between any two nodes.  They are delays 
   incurred during traversal of the links on that path, and delays that 
   occur at the nodes along the path.  The following presents the 
   computations and expected values for the different delays. 
    
A.1    Delays Associated with Link Traversal 
    
   The time needed to traverse each link is the sum of the transmission 
   time and the link propagation delay: 
    
   1. The transmission time is a value based on link capacity.  The 
      calculation is as follows: D trans = (packet size) / (link 
      speed). 
   2. The link propagation delay is due to the physical length of the 
      link: D prop = length / (light propagation speed on fiber). 
    
   The notification packet size is dependent on its content, and will be 
   provided after information about the packet is presented.  The length 
   of such a packet is usually of the order of a hundred bytes (about 
   10^3 bits).  As an example, for a link speed of 1 Gbps, 
    
   D trans ~= 10^3 / 10^9 = 10^-6 s = 1 micro-second. 
    
   This value therefore can safely be ignored in calculating delays.  On 
   the other hand, the link propagation delay in metropolitan area and 
   long-haul networks affects total delay.  For a distance of 100 km, 
   with light speed in a fiber at 2/3 (about 200,000 km/s) of its speed 
   in free space, this delay would be 0.5 ms. 
    
A.2    Delays Incurred at the Nodes 
    
   At each node, two delays are important: queuing delay and processing 
   time.  The processing time D proc has been identified in the 
   literature as a few tenths of a millisecond in the case of an RSVP 
   object.  This value is smaller in the case of a simpler IP packet 
   requesting the activation of an LSP path. 
    
   The issue of queuing delay is important at all intermediate nodes.   
   Fault notification messages should be queued at the front of the 
   buffer that holds other control packets in order to avoid queuing 
   delays, (those messages do not have to contend with data packets 
   since obviously no data are sent in the control channel).  A queuing 
   process such as priority queuing would allow those packets to be 
   admitted at the head of the queue, through the setup of the priority 
   of the packet.  A simple mechanism such as the setup of the priority 

 
Rabbat                 Expires - December 2002              [Page 14] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   bits at the IP header, such as the IP precedence bits or DSCP code 
   points of the TOS (Type Of Service) byte would be appropriate.  Using 
   priority queuing for fault notification messages will ensure that the 
   queuing delay will be bounded.  In the case of flooding for fault 
   notification, D queue(a) = 0 sec.  If other fault notification 
   messages are in the queue as well, this implies multiple failures, 
   where the time recovery guarantee does not apply.  Otherwise, it may 
   indicate the fact that multiple messages are traveling on different 
   protection paths to notify the same link failure, such as the case 
   when a signaling protocol is used for fault notification.  In the 
   case of per-LSP fault notification just as in the case of using a 
   signaling protocol, the maximum queuing delay at node a is: 
    
   D queue max(a)= (number of protection paths) * (packet size) / (link 
   bandwidth). 
    
   This explains mathematically the choice against using a signaling 
   protocol for fault notification.  Flooding allows that value to be 0 
   sec.  In the absence of priority queuing, the maximum queue delay can 
   be calculated as follows at node a, assuming fair queuing of the FIFO 
   buffers of all control channels and assuming input buffers only: 
    
   D queue max(a)= (number of queues) * (queue size) / (link bandwidth). 
    
   This value is an upper bound, and is dependent on hardware buffer 
   implementations. 
    
    
Appendix B.            Finding a Sub-graph Subject to Time Constraints 
    
   For simplicity, we assume the restoration time constraint is Trec 
   seconds and the reconfiguration time at each node once notified of 
   the failure is Tcfg.  Thus, the allowable fault notification time 
   Tnot = Trec - Tcfg. 
     
   Initialization: 
      G (N, A) is the graph containing the set N of nodes n and set A 
      of arcs (I, J) representing the control channel links 
      // Note that the links in this graph could be different from the 
      // links used for the data channels 
    
      For every link (I, J) in A 
         Set cost (I, J) = d proc (I) + d queue max (I) + d prop (I, J) 
      // this sets all link costs to a delay value that includes the 
      // delays incurred at the node I. 
    
      node-neighborhood-for-failure (I, (I, J)) is the set of nodes that 
      are Tnot away from node I if link (I, J) fails initially set to {}. 

 
Rabbat                 Expires - December 2002              [Page 15] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
   Procedure Find-node-neighborhood-for-failure (I, (I, J)) 
   Begin 
      Run shortest path algorithm on graph pruned from link (I, J) 
       
      // This results in a set SC (I), that contains for each node X,  
      // the smallest cost (minimum delay) from node I. 
      // min-delay-between (I, X) is defined in section 5.3 as the 
      // minimum time (smallest cost) needed for a message to travel  
      // from I to X. 
       
      For every entry for node X in SC, 
         min-delay-between (I, X) += d proc (X) + d queue max (X) 
         // add the delays incurred at the last node 
       
      // SC now contains the shortest time from node I to any node 
      // including processing time at the end-nodes for link (I, J) 
      For every entry in SC, if min-delay-between (I, X) < Tnot, then 
         Add node X in SC to node-neighborhood-for-failure (I, (I, J)) 
    
      // If the message cannot go from I to J within Tnot sec, then  
      // the node is not reachable within the time constraints 
   End 
    
   Procedure Prune-for-link-protection (I, J) 
   Begin 
      Find-node-neighborhood-for-failure (I, (I, J)) 
      Find-node-neighborhood-for-failure (J, (J, I)) 
      Set reachable (I, J) = union of  
               node-neighborhood-for-failure (I, (I, J)) 
               and node-neighborhood-for-failure (J, (J, I)) 
      // This is the set of nodes reachable by either endpoints of  
      // link (I, J) within time Tnot 
      Return reachable (I, J) 
   End 
    
   Procedure Prune-for-path-protection (P (I1, I2,... In)) 
   Begin 
      Set N' = N 
      // all nodes originally 
      For any link (Ik; Ik+1) of path P 
      Begin 
         Prune-for-link-protection (Ik, Ik+1) 
         Set N' = intersection of N' and reachable (Ik, Ik+1)  
      End 
   End 
    
   Running for link (I, J) protection: 

 
Rabbat                 Expires - December 2002              [Page 16] 
           Fault Notification and Service Recovery Protocol  June 2002 
 
 
      Prune-for-link-protection (I, J) 
      Set N' = reachable (I, J) 
      Run link protection algorithm considering only nodes in set N' and 
         any data links between them (not necessarily the same links as  
         the control channels) 
    
   Running for Path P (I1, I2,... In) protection: 
      Prune-for-path-protection (P (I1, I2,... In)) 
      Run path protection algorithm considering only nodes in set N' and 
         any data links between them 
    
    
11.    References  
                     
   [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 
      Levels", BCP 14, RFC 2119, March 1997. 
    
   [2] Papadimitriou, D., et al, "Analysis Grid for GMPLS-based Recovery 
      Mechanisms (including Protection and Restoration)", Internet 
      draft, work in progress, draft-papadimitriou-ccamp-gmpls-recovery-
      analysis-00.txt, April 2002. 
    
   [3] Sharma, V., et al, "Framework for MPLS-based recovery", Internet 
      Draft, work in progress, draft-ietf-mpls-recovery-frmwrk-04.txt, 
      May 2002. 
    
   [4] Li, G., J. Yates, et al, "Experiments in Fast Restoration using 
      GMPLS in Optical/Electronic Mesh Networks", Post-deadline Papers 
      Digest, OFC 2001, Anaheim, CA, March 2001. 
    
   [5] Rabbat, R. et al, "Fault Notification and Service Recovery in WDM 
      Networks", white paper available at: 
      http://perth.mit.edu/~richard/wp-ietf-fault-notification.pdf. 
    
   [6] Lang, J. et al, "Link Management Protocol (LMP)", Internet Draft, 
      work in progress, draft-ietf-ccamp-lmp-03.txt, March 2002. 
    

Rabbat                 Expires - December 2002              [Page 17] 
              Fault Notification and Service Recovery Protocol June 2002 

    
12.    Author's Addresses  
        
   Richard Rabbat                      Ching-Fong Su 
   Fujitsu Labs of America, Inc.       Fujitsu Labs of America, Inc. 
   595 Lawrence Expressway             595 Lawrence Expressway 
   Sunnyvale, CA 94085                 Sunnyvale, CA 94085 
   United States of America            United States of America 
   Phone: +1-408-530-4537              Phone: +1-408-530-4572 
   Email: rabbat@fla.fujitsu.com       Email: csu@fla.fujitsu.com 
    
   Norihiko Shinomiya                  Vishal Sharma  
   Fujitsu Laboratories Ltd.           Metanoia, Inc.  
   1-1, Kamikodanaka 4-Chome           305 Elan Village Lane, Unit 121  
   Nakahara-ku, Kawasaki               San Jose, CA 95134-2545  
   211-8588, Japan                     United States of America 
   Phone: +81-44-754-2635              Phone: +1-408-955-0910  
   Email: shinomi@jp.fujitsu.com       Email: v.sharma@ieee.org 
    
   Takafumi Chujo                      Akira Chugo 
   Fujitsu Labs of America, Inc.       Fujitsu Laboratories Ltd. 
   595 Lawrence Expressway             1-1, Kamikodanaka 4-Chome 
   Sunnyvale, CA 94085                 Nakahara-ku, Kawasaki 
   United States of America            211-8588, Japan 
   Phone: +1-408-530-4507              Phone: +81-44-754-2629 
   Email: takafumi@fla.fujitsu.com     Email: chugo@flab.fujitsu.co.jp 
    
    
Rabbat                 Expires - December 2002              [Page 18]