Network Working Group Jerry Ash Internet Draft Gagan L. Choudhury Jeff Han Expiration Date: May 2002 Vera D. Sapozhnikova Mostafa Sherif AT&T Mahmood Noorchashm Concert Anurag Maunder Sanera Systems Vishwas Manral NetPlane October, 2001 Proposed Mechanisms for Congestion Control/ Failure Recovery in OSPF & ISIS Networks Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. ABSTRACT: Earlier papers and contributions identified issues of congestion control and failure recovery for link-state protocol networks, such as OSPF, ISIS, and PNNI networks [maunder, choudhury, pappalardo1, pappalardo2, atm01-0101]. The problem addressed is to enable link-state protocols to a) gracefully recover from massive loss of topology database information, and b) respond gracefully to network overloads and failures. This contribution proposes specific additional considerations for network congestion control/failure recovery. Candidate mechanisms are proposed for control of network congestion and failure recovery, in particular we initially propose to investigate the following mechanisms: a) throttle new connection setups, topology-state updates, and Hello updates based on automatic congestion control mechanisms, b) special marking of critical control messages (e.g., Hello and topology-state-update Ack) so that they may receive prioritized processing, c) database backup, in which a topology database could be automatically recovered from loss based on local backup mechanisms, and d) hitless restart, which allows routes to continue to be used if there is an uninterrupted data path, even if the control path is interrupted due to a failure. We propose that the OSPF and ISIS working groups include the candidate mechanisms in an evaluation of congestion control/failure recovery for OSPF and ISIS networks. Table of Contents 1. Introduction 2. Failure Experience and Analysis 2.1 Failure Experience 2.2 Vendor Analysis of Product Performance 2.3 Analytic Modeling, Simulation Analysis, and Emulation Analysis 3. Proposed Solution Methods 4. Candidate Mechanisms for Congestion Control/Failure Recovery 4.1 Automatic Congestion Control 4.2 Special Marking of Critical Control Messages 4.3 Database Backup 4.4 Hitless Restart 5. Proposals 6. Security Considerations 7. Acknowledgments 8. References 9. Authors' Addresses 1. Introduction Congestion can arise in data networks for many different reasons. There is evidence based on previous failures that link state (LS) routing protocols, such as OSPF and ISIS, currently can not recover from large failures which result in widespread loss of topology database information (especially when areas/peer-groups get "too large"). LS protocols typically use topology-state update (TSU) mechanisms to build the topology database at each node, typically conveying the topology status through flooding of TSU messages containing link, node, and reachable-address information between nodes. In OSPF, they use the link state advertisement (LSA), in PNNI, such mechanisms use the PNNI topology state element (PTSE), in frame-relay and proprietary-routing networks, they may use other TSU mechanisms to exchange topology status information to build the topology database at each node. In this contribution we use a generic term - TSU - so as to cover LS protocols in general. Earlier papers and contributions identified issues of congestion control and failure recovery for LS protocol networks, such as OSPF, ISIS, and PNNI networks [maunder, choudhury, pappalardo1, pappalardo2, atm01-0101]. The problem addressed is to enable LS protocols to a) gracefully recover from massive loss of topology database information, and b) respond gracefully to network overloads and failures. This contribution proposes specific additional considerations for network congestion control/failure recovery. As discussed in Section 2, there have been data network outages in which these problems have manifested themselves. As a result of these failures, a number of congestion control/failure recovery mechanisms were recommended, and these are reviewed in Section 3 of this contribution. In Section 4, we discuss how these candidate mechanisms for control of network congestion and failure recovery can be used. In particular we discuss: a) how a node might detect congestion or failure conditions based on various criteria, such as CPU real-time use, memory consumption, and others, b) how a node might control its workload once congestion is detected, such as through * the throttling of new connection setups, topology-status updates, and Hello updates with the use of automatic congestion control mechanisms, * special marking of critical control messages (e.g., Hello and topology-state-update Ack) so that they may receive prioritized processing, a) how a node might recover from failure, such as through * database backup, in which a topology database could be automatically recovered from loss based on local backup mechanisms, * hitless restart, which allows routes to continue to be used if there is an uninterrupted data path, even if the control path is interrupted due to a failure. We propose that the OSPF and ISIS working groups include the candidate methanisms in an evaluation of congestion control/failure recovery for OSPF and ISIS networks. 1.1 Related Work Underway in Standards Bodies There is much work already underway in standards bodies, namely the IETF, ATM Forum, and ITU-T, to address issues of congestion control and failure recovery in ATM- and IP-based packet networks. Numerous references are cited and are further explained in the document [maunder, moy1, moy2, moy3, murphy, whitehead, zinen, atm01-0101, btd-cs-congestion-02.00]. 2. Failure Experience and Analysis In this section we present evidence of the current problems associated with LS failure recovery from various failure conditions, which is based on a) failure experience, b) vendor analysis of product performance, and c) analytic modeling, simulation analysis, and emulation analysis 2.1 Failure Experience AT&T has experienced serious data network outages in which recovery of the underlying LS protocols was inadequate. For example, in the failure in the AT&T Frame Relay Network on April 13, 1998 [att], an initial procedural error triggered two undetected software bugs, leading to a huge overload of control messages in the network. The result of this control overload was the loss of all topology database information, and the LS protocol then attempted to recover the database with the usual Hello and TSU updates. Analysis has shown that several problems then occurred to prevent the network from recovering properly: * Very large number of TSUs being sent to every node to process, causing general processor overload * * Route computation based on incomplete topology recovery, causing routes to be generated based on transient, asynchronous topology information and then in need of frequent re-computation * * Inadequate work queue management to allow processes to complete before more work is put into the process queue * * Inability to segment the network (into smaller "peer groups") to aid in the LS protocol recovery * * Inability to access the node processors with network management commands due to lack of necessary priority of these messages A more recent failure occurred on February 20, 2001 in the AT&T ATM Network, which resulted in a large overload of TSUs, and a lengthy network outage [pappalardo1, pappalardo2]. Manual procedures were put in place to reduce TSU flooding, which worked to stabilize the network. It is desirable that such TSU flooding reduction be automatic under overload. In general, there have been a number of major outages reported by most major carriers, and routing protocol issues have generally been involved. Other relevant LS-network failures are reported in [cholewka, jander]. Various networks employing LS protocols use various control messages and mechanisms to update the LS database, not necessarily LSAs, PTSEs, or flooding mechanisms. Based on experience, however, the LS protocols are found to be vulnerable to loss of database information, control overload to re-sync databases, and other failure/overload scenarios which make such networks more vulnerable in the absence of adequate protection mechanisms. Hence we are addressing a generic problem of LS protocols across a variety of implementations, and the basic problem is prevalent in LS protocol networks employing frame-relay, ATM, and IP based technologies. 2.2 Vendor Analysis of Product Performance Various vendors and service providers were asked to analyze their product with reference to how their LS protocol recovers from a total network failure, that is, loss of all database information in the specified scenario. The specification of the failure presented to the vendors made the following assumptions: 1. 400 node network is considered a) 100 backbone nodes b) 3 edge nodes per backbone node (Edge single homed) c) Each backbone node is connected to no more than 10 nodes (maximum node adjacency is 13, which defines a sparse network) 2. There are 101 areas a) 1 backbone area with 100 backbone nodes b) 100 edge areas, each with 3 nodes, all homed on the backbone peer group 3. 1,000,000 addresses are advertised in the network 4. maximum node adjacency is 13 a) sparse network Assumptions on packet sizes and processing times varied according to product capabilities associated with the individual estimates given below. The results for each projected recovery time from three vendor/service-provider analyses of the above scenario are as follows: Recovery Time Estimate A - 3.5 hours Recovery Time Estimate B - 5-15 minutes Recovery Time Estimate C - 5.5 hours Clearly there is a large variability to the estimates, however the expectation is that vendor equipment recovery is not adequate under a large failure scenario as was analyzed. 2.3 Analytic Modeling, Simulation Analysis, and Emulation Analysis Some analysis models have been published which reveal problematic performance of LS protocols under various failure conditions [maunder, atmf00-0249]. There are various simulation capabilities [NS, wandl, opnet, scalable-networks, makesystems] for analyzing LS protocol behavior under failure. We report below an event simulation study [choudhury] on overloads in large-scale networks and the corresponding recovery times. Emulation of large failure performance could also be available through laboratory testing of actual protocol performance of vendor products under failure conditions. However to date no studies have been published on failures in large scale networks and the corresponding recovery times. The results from such simulation and emulation studies are specific to the type of equipment and network configuration modeled, and therefore are not completely general. However, the results to date lend support for the existence of problems with LS protocols in the face of congestion and network failures. A network-wide event simulation model is reported in [choudhury] to study the impact of a TSU storm. It captures the actual congestion seen at various nodes and accounts for propagation delay between nodes, retransmissions in case a TSU is not acknowledged, failure of links for TSUs delayed beyond the node-dead interval, and link recovery following database synchronization and TSU flooding once the TSU is processed. It approximates a real network implementation and uses processing times that are roughly in the same order of magnitude as measured in the real network (of the order of milliseconds). There are two categories of IGP messages processed at each node in the simulation. Category 1 messages are triggered by a timer and include the Hello refresh, TSU refresh and retransmission packets. Category 2 messages are not triggered by a timer and include received Hello, received TSU and received acknowledgements. Timer-triggered messages are given non-preemptive priority over the other type and are queued in the timer queue. As a result, the received Hello packets and the received acknowledgement packets may see long queuing delays under intense CPU overload. Figure 1 below shows sample results of the simulation study when applied to a network with about 300 nodes and 800 links. The node-adjacency varies from node-to-node and the maximum node-adjacency is 30. The Hello interval is assumed to be 5 seconds, the minimum interval between successive SPF (Shortest-Path-First) calculations is 1 second, and the Node-Dead Interval ("Router-Dead Interval") is 15 seconds, i.e., a link is declared down if no Hello packet is received for three successive hello intervals. During the study, a TSU storm of size X is created at instant of time 100 seconds where storm-size is defined as the number of TSUs generated during a storm. Three cases are considered with X = 300, 600 and 900 respectively. Besides the storm, there are also the normal once-in-thirty-minutes TSU refreshes. At any given point of time we define a quantity "dispersion" that is the number of control packets already generated in the network but not received and processed in at least one node (each control packet is assumed to carry three TSUs). Table 1 plots dispersion as a function of time and thereby illustrates the impact of TSU storm on network stability. Before the TSU storm, the dispersion due to normal TSU refreshes remains small. We expect the dispersion to jump to a high value right after the storm and then come down to the pre-storm level after some period of time (this happens with X=300 and X=600 but not X=900). In Table 1 with a TSU storm size 300, the "heavy dispersion period" lasted about 11 seconds and no link losses were observed. With a TSU storm of size 600, the "heavy dispersion period" lasted about 40 seconds. Some link losses were observed a little after 15 seconds within the "heavy dispersion period" but eventually all links recovered and the dispersion came down to the pre-storm level. With a TSU storm of size 900, the "heavy dispersion period" lasted throughout the simulation period (6 minutes). ======|===================================================================== | Table 1: DISPERSION as a FUNCTION of TIME (in sec) | for different TSU Storm Sizes STORM |===================================================================== SIZE |100s 106s 110s 115s 140s 170s 230s 330s 370s ======|===================================================================== 300 | 0 39 3 1 0 1 0 0 0 ------|--------------------------------------------------------------------- 600 | 0 133 120 100 12 1 0 0 0 ------|--------------------------------------------------------------------- 900 | 0 230 215 196 101 119 224 428 488 ======|===================================================================== The generic observations are as follows: * If the initial TSU storm size (e.g., X=300) is such that the delays experienced by Hello packets are not big enough to cause any link failures anywhere in the network, the network remains stable and quickly gets back to a period of "low dispersion". These types of LSA storms are observed quite frequently in operational networks, from which the network easily recovers. * If the initial TSU storm size (e.g., X=600) is such that the delays experienced by a few Hello packets in a few nodes cause link failures then some secondary TSU storms are generated. However, the secondary storms do not keep growing indefinitely and the network remains stable and eventually gets back to a period of "low dispersion". This type of LSA storm was observed in an operational network triggered by a network upgrade, from which the network recovered but with some difficulty. * If the initial TSU storm size (e.g., X=900), is such that the delays experienced by many Hello packets in many nodes cause link failures then a wave of secondary TSU storms are generated. The network enters an unstable state and the secondary storms are sustained indefinitely or for a very long period of time. This type of LSA storm was observed in an operational network triggered by a network failure (2/20/01 AT&T ATM network failure discussed in Section 2.1) from which the network did not recover without corrective steps (manual procedures were used to reduce TSU flooding and stabilize the network). The results show that there is a TSU storm threshold above which the network shows unstable behavior. It is desirable that TSU flooding reduction be automatic under overload, and the model could be used to assess the benefits of various control mechanisms, such as those discussed in the next Section. 3. Proposed Solution Methods Generally, we wish to find generic LS protocol enhancements such that networks can control congestion arising from any stimulus. There are general congestion control/failure recovery methods that have been applied in many networks, and these should be considered for enhancement to the OSPF and ISIS protocols. It would be desirable to not have TSU congestion being the primary factor in limiting the size of peer-groups. Rather, we should like to be able to grow a peer-groups to a very large size in order to reduce complexity, and also recover from large failure resulting in widespread loss of topology database information. Some candidate protocol mechanisms to allow recovery from large failures are these: a) throttle new connection setups, topology-state updates, and Hello updates based on automatic congestion control mechanisms, triggered by congestion at nodes. Congestion may be either due to the processing of routing update or connection setup messages and is manifested by sustained CPU busy time. This is similar to automatic congestion control (ACC) [Q.764, whitehead, hosein1, hosein2], or dynamic overload control (DOC) mechanisms used in various networks in which a node can signal other nodes to limit the rate of control messages sent to it [mummert], b) special marking of critical control messages (e.g., Hello and topology-state-update Ack) so that they may receive prioritized processing [maunder], c) database backup, in which a topology database could be automatically recovered from loss based on local backup mechanisms, d) hitless restart, which allows routes to continue to be used if there is an uninterrupted data path, even if the control path is interrupted due to a failure [moy1], e) node processor CPU real-time and memory control, which would provide work process queuing mechanisms to allow lower priority processes to run and not starve CPU memory and real-time to the point where no processes run, f) give necessary priority to network management control messages so they are not locked out during times of failure g) avoid duplicate control messages where possible, (e.g., sent out on parallel links to the same node). PNNI already addresses flooding on parallel links, it is currently proposed for OSPF [moy2]. h) mechanisms to be able to automatically segment a large peer group into smaller peer groups, allow these segments to recover their topology and routing, then combine the smaller peer groups into the original large peer group and allow that to recover. Such mechanisms are implemented in operational networks, such as through network management scripts to segment the network under severe overload. Currently these procedures are triggered manually, however it would be desirable to automate such mechanisms within the LS protocol. i) provide bandwidth allocation mechanisms to limit the amount of link bandwidth allocation to control traffic versus data traffic. This capability exists in PNNI: RCCs are provisioned with traffic parameters (default parameters are specified in [af-pnni-0055.000]) and traffic is shaped into the RCC VC at the configured traffic rate. j) avoid acknowledging messages that overflow the work queue. The point is not to ack a message that is eventually discarded and not processed, ack only when it is known that it will be processed. 4. Candidate Mechanisms for Congestion Control/Failure Recovery In the following sections, we describe in more detail the first four of these mechanisms in Section 3, and propose that more detailed evaluation be undertaken for inclusion into OSPF and ISIS. 4.1 Automatic Congestion Control Mechanisms are proposed to slow down TSU/Hello update rates throttle new connection setups, topology-state updates, and Hello updates based on automatic congestion control and triggered by congestion at nodes. The objective of throttling new connection setups applies directly in the case of some link-state protocols such as PNNI which support connection set up, and indriectly in the case of OSPF or ISIS where the concept could well apply to MPLS set ups of label switched paths (LSPs) [mpls-architecture]. That is, in both establishing an LSP and modifying LSPs, throttling connection setups would help reduce the associated control load, and the methods as described below might then apply. Automatic congestion control (ACC) [Q.764, whitehead, hosein1, hosein2] is used during node overloads to maintain network throughput. Such mechanisms have been used in large-scale voice networks for many years, such as the AT&T voice network [mummert, ash], which is a large dynamic routing network. ACC therefore is applicable to both dynamic routing networks and fixed routing networks. In ITU-T Recommendation Q.764, ACC can distinguish different levels of congestion, for example, a less severe congestion condition (level 1) and more severe condition (level 2). If either congestion level is reached, an ACC information message may be sent to the adjacent nodes indicating the level of congestion. The adjacent nodes, when receiving an ACC information message, should reduce their traffic to the affected overloaded node. The ACC level is typically based on the real time utilization and queue length thresholds in the overloaded node. The ACC levels can be passed back to the adjacent nodes, for example, in release messages. When an adjacent node finds out that a node is overloaded, it blocks connection requests to the overloaded node. For example, if the ACC is set to level 2, then the adjacent node could block a percentage of traffic and adjust the blocking rate until ACC congestion level 1 is reached. The mechanism described in ITU-T Recommendation Q.764, however, was found to over-control in response to overload [hosein1, hosein2]. Various improved algorithms have been investigated [whitehead, hosein1], and provide good performance and are relatively simple to implement. The exact mechanisms for automatic congestion control need to be specified in the follow-on study of the mechanism. In the case of applying ACC to new connection setups during congestion, traffic can be shed by rejecting newly-arriving connection requests back to the originating node which then either routes around the congestion or denies the new setup. This feature is currently being provided within the signaling congestion control upgrades for PNNI [atm00-0257]. For controlling the level of TSU flooding during congestion, a difficulty is that with distributed routing, flooding happens from all nodes which must continue to flood their neighbors the information a node receives. There is no initial source to flow control the TSUs flooded to a node. A mechanism could devised in which nodes "advertise" some level of congestion and if enough nodes are sufficiently "congested" then all nodes in the peer group start to slow down the rate at which they emit TSUs. As an example, suppose the maximum rate at which trunk bandwidth-change TSUs can be generated is one per five seconds per trunk. Under a TSU overload, as evidenced by long CPU busy times at some number of nodes, a mechanism can automatically change the "five" to say "ten". The TSU congestion state could be inferred, for example, by not receiving Hello updates within say 2 times the Hello interval. We could have an explicit way to notify other nodes of congestion, such as a mechanism similar to the grace-LSA mechanism where we send the information in Opaque LSA's [moy1]. We could envision an overload bit which is set under LSU storm detection, and which results in the following actions: 1. The LSU retransmit interval is automatically increased, thus throttling LSU's as called for above. The LSU retransmit interval, Rxmt, refers to three things: a) In the sending and receiving of LS database packets two nodes form a master/slave relationship. Each LS database packet has a sequence number, and each database packet sent by the master is acknowledged by the slave through echoing of the sequence number. Both packets sent by the master and their responses contain summaries of LS data. The master is the only one allowed to retransmit LS database packets. It does so only at fixed intervals, the length of which is the configured per-interface, denoted as the Rxmt Interval. b) The link state request packets that are not satisfied are resent at fixed intervals (the Rxmt interval), c) The time interval between LSU retransmissions, for adjacencies belonging to this interface. Increasing Rxmt helps slow the rate of LSUs, including retransmits, thus reducing congestion. 2. Define an interval which is greater than the node dead interval, call it adjacency breaking interval. When the actual node dead interval occurs, we do not put the link down, instead we update the node LSU. If Hellos are received before the adjacency dead interval is reached, the adjacency is not broken. This would result in adjacencies not breaking and causing a secondary flooding storm, although we are still sending out the correct LSU information for the SPF calculation. This would also result in less data traffic going through the link in case of problems on that link, as the node LSU will not show the link as having transit capability. 3. Increase the Hello interval and the node dead interval. However we need to ensure that adjacencies are not broken in case of a Hello or node dead interval mismatch. After the network stabilizes, we decrease the intervals and then set the overload bit off. 4. Do not do any incremental SPF and if required increase SPF interval. To prevent LSU flooding storms in the stable state, we could use optimized refreshment in which all LSU's are not refreshed at the same time, thus preventing a flooding storm. A method for this is given in [refresh-guide]. To reduce the rate of flooding during database synchronization, the master node should not immediately send a new database description packet if either the master node or slave node is congested. Generally, when a node detects that either itself or a connected node is congested, it should reduce the rate of flooding. The mechanism could be something like a sliding window control in which the size of the window is equal to the approximate number of LSU's that an adjacent node can process. This helps in the case when a node has to flood thousands of LSU's, which at present it would do immediately and perhaps cause the other end to become overloaded, and result in further congestion through innumerable retransmits. In case of a sliding window control, we would limit the rate of LSU flooding, thus causing no overload at the other end. The exact mechanisms for automatic congestion control need to be specified in the follow-on study of the mechanism. 4.2 Special Marking of Critical Control Messages Special marking is proposed to quickly identify critical control messages (e.g., Hello and TSU Ack) so that they may receive prioritized processing. This mechanism has been proposed for OSPF [maunder]. The benefit and use of this capability has been illustrated in Section 2.3. It is useful to mark control messages, such as TSU Acks and Hello packets, with a priority marking so that the work-load management process within a node can easily distinguish high priority work from lower priority work [maunder]. In the case of ATM networks, this can perhaps be accomplished through the use of a separate VCI. Database description packets from a slave node could be prioritized in the master node, as it is equivalent to an Ack for the Database Description Packet from the Master. In addition, packets could also be prioritized based on: a) the state of the sending neighbor. The higher the state of the neighbor interaction the higher the priority of packets from it. In OSPF, for example, packets from neighbors with a state greater than Exchange could be given higher priority than those that are in a lower state. b) those packets that convey "bad" information, c) LSU's that have a change versus LSUs that have no change. LSU's generated by LSU refresh generally do not have any change, so they should be given lesser priority over LSU's transmitted because of change. Delay in sending the refresh will not cause any harm, as the previous version of the same LSU is already there in the network. However packets with change could cause a change in routing table, and hence need to be prioritized. In addition, it is necessary to provide a flexible scheduling mechanism for CPU real-time and memory that gives each process certain minimal share even during overload and does not starve any process. CPU overload, if not managed properly, can result in poor utilization of CPU real-time and memory. For example, during an overload in the receipt of control messages such as TSUs and Hello messages, it is useful for node processors to manage CPU real-time and memory so that node resources are not exhausted and useful work is always accomplished. Various means of workload management can be employed. The exact mechanisms for special marking of critical control messages, and for processor real-time and memory management, need to be specified in the follow-on study of the mechanism. 4.3 Database Backup A common practice in large networks is to provide a method of locally backing up the LS database information. Then in the case of a failure condition in which database information is lost, the backup database is accessed and used to reinitialize the LS database. A mechanism to recover any lost updates of database information, that is, those updates lost during the interval from the time of failure to the recovery time, would be needed. The concept of a database backup therefore is not new, however, it is new in the context of using database backup together with LS protocols. Database backup has worked in practice for about 25 years in the AT&T core voice network, which is a large-scale distributed, dynamic routing network (uses RTNR, with class-of-service capabilities [ash]). The intuition behind the database backup concept is simple: if you lose your whole database (just to take that example), it is better to reinitialize with a local copy, which can be expected to have a large percentage of correct, up-to-date information. The alternative of flooding the entire database from other nodes is far more time consuming and less efficient. The issue of getting the database back to 100% accuracy after restoration from the local backup is an issue that needs to be addressed. To further define the concept, we can illustrate two possible implementations, although there are many possible implementations, these are not necessarily being proposed One way to view the LS database at each node is to consider a decomposition into the slow information database (SIDB) and the fast information database (FIDB). Information in the SIDB changes slowly, for example) * max link bandwidth * metric * reachable addresses Information in the FIDB changes rapidly, for example * available link bandwidth (e.g., real-time updates) Because the data is backed up, SIDB and FIDB information would only be updated when changes occurred, and need not be periodically refreshed. Under a total LS database failure scenario, as discussed in Section 2, a failed node restores its most recent database and advertises the time-stamp of the most recent SIDB update received. Other nodes would then know which SIDB updates to restore to the node. FIDB information is flooded as normal and not restored through the most-recent time-stamp mechanism, eventually the FIDB would re-sync. After a failure, a node bases its routing on the current database, derived as above. Some of the database information will be somewhat out of sync, and routing will not be optimal, however the routing would suffice until the database re-syncs completely. Since the SIDB information changes slowly, restoring that data should require a minimal amount of flooding. Since the FIDB information is not restored, the total amount of flooding after a database failure would be vastly reduced. Since the TSU flooding is slowed down after the failure event, for a while information from the old copy of the database would be used for routing purposes but as TSUs trickle in, the database should be gradually populated with new information. This would mean somewhat inefficient routing and increased crankbacks for a while but would improve stability due to slowed-down TSU flooding. Another alternative is to use the stable information as soon as a node gets back up, as suggested above, but not attempt to reduce the flooding in the network by not re-flooding the stable information. This can be achieved by each node maintaining a shadow copy of the stable information, as described above. When a node comes back up, it routes based on the backed up shadow copy, and even if the current information collected through the routing protocol is not complete, the node can still use the old information for routing. In this approach, each node could explicitly specify what information is stable, and this may be easier than globally agreeing upon the information that is stable versus the information that is unstable. In this approach, the overall control load may be reduced by fewer instances of failed routing which result from incomplete database information. Another mechanism for database backup would be to have a redundant backup completely in sync with the primary database. Upon database failure, the node would automatically need to use of the backup. Another mechanism is to back up all the non-self-originated LSU's, as well as the states of interfaces. Accordingly, when the network is recovering, a node can generate all its own LSU's, node, network and summary LSUs. The above are just illustrative examples, which are not necessarily proposed. The exact mechanisms for using LS backup and database backup needs to be specified in the follow-on study of the mechanism. 4.4 Hitless Restart Nodes normally implement a separation of control and forwarding functions, wherein certain processors are dedicated to control and management tasks such as routing, while other processors perform the data forwarding tasks. This separation creates the possibility of maintaining a node's data forwarding capability while the control software is restarted/reloaded, which is known as "hitless restart". Under normal operation, LS protocols intentionally route around a restarting node while it rebuilds its LS database. Avoidance is accomplished by having the node's neighbors reissue their TSUs, omitting links to the restarting node. However, if the network topology remains stable and the restarting node is able to keep its forwarding table(s) across the restart, it would be possible to keep the restarting node on the forwarding path. Mechanisms have been proposed for implementing hitless restart within LS protocols [moy1]. [Zinin] in addition enables recovery from a node crash in an active/standby configuration, or other multiprocessor model. The exact mechanisms for hitless restart need to be specified in the follow-on study of the mechanism. 5. Proposals Proposal A: We propose that the OSPF and ISIS working groups start a new work item on OSPF and ISIS routing congestion. Proposal B: We propose that the OSPF and ISIS working groups include the four candidate mechanisms in a more detailed evaluation of their inclusion in the OSPF and ISIS protocols for congestion control/failure recovery for OSPF and ISIS networks. We suggest that the OSPF and ISIS working groups explore the suggested solution mechanisms in Sections 3-4: a) automatic congestion control mechanisms to throttle new connection setups, topology-state updates, and Hello updates, triggered by congestion at nodes, b) special marking of critical control messages (e.g., Hello and topology-state-update Ack) so that they may receive prioritized processing, c) database backup, in which a topology database could be automatically recovered from loss based on local backup mechanisms, and d) hitless restart, which allows routes to continue to be used if there is an uninterrupted data path, even if the control path is interrupted due to a failure. 6. Security Considerations There are no new security considerations based on proposals in this draft. 7. Acknowledgments The authors gratefully acknowledge the comments and suggestions from many people. At AT&T we thank Margaret Chiosi, Enrique Cuevas, Tom Helstern, Steve Holmgren, Aswatnarayan Raghuram, and Mike Shapiro, Shawn McAllister at Alcatel, Marty Albright at Worldcom, Sohel Khan at Sprint, Ghassem Koleni at Nortel Networks, Thomas Cornely and Greg Wetzel at Covad, Bob Dianda at Sprint, and Jeff Parker at Axiowave. 8. References [af-pnni-0055.000] "Private Network-Network Interface Specification Version 1.0 (PNNI 1.0)," March 1996. [ash] Ash, G. R., "Dynamic Routing in Telecommunications Networks," McGraw Hill. [atmf00-0249] "Scalability and Reliability of large ATM networks." [atm00-0257] "Signaling Congestion Control in PNNI Networks: The Need and Proposed Solution Outline." [atm00-0480] "Congestion Control/Failure Recovery in PNNI Networks." [atm01-0101] "Proposed Mechanisms for Congestion Control/Failure Recovery in PNNI Networks." [att] "AT&T announces cause of frame-relay network outage," AT&T Press Release, April 22, 1998. [btd-cs-congestion-02.00] "Signaling Congestion Control Version 1.0", Baseline Text [cholewka] Cholewka, K., "MCI Outage Has Domino Effect," Inter@ctive Week, August 20, 1999. [choudhury] Choudhury, G., Maunder, A. S., Sapozhnikova, V., "Faster Link-State Convergence and Improving Network Scalabiity and Stability," sumitted for presentation at LCN 2001. [hosein1] Hosein, P., "An Improved ACC Algorithm for Telecommunication Networks," Telecommunication Systems 0, 1998. [hosein2] Hosein, P., "Overload Control for Real-Time Telecommunication Databases," International Teletraffic Congress - 16, Edinburgh, Scotland, June 1999. [jander] Jander, M., "In Qwest Outage, ATM Takes Some Heat," Light Reading, April 6, 2001. [maunder] Maunder, A. S., Choudhury, G., "Explicit Marking and Prioritized Treatment of Specific IGP Packets for Faster IGP Convergence and Improved Network Scalability and Stability," work in progress. [mpls-architecture] Rosen, E., Viswanathan, A., Callon, R., "Multiprotocol Label Switching Architecture," RFC 3031, January 2001. [mummert] Mummert, V. S., "Network Management and its Implementation on the No. 4ESS," International Switching Symposium, Japan, 1976. [moy1] Moy, J., "Hitless OSPF Restart", work in progress. [moy2] Moy, J., "Flooding over parallel point-to-point links," work in progress. [moy3] Moy, J., "Flooding Over a Subset Topology," work in progress. [murphy] Murphy, P., "OSPF Floodgates," work in progress. [pappalardo1] Pappalardo, D., "Can one rogue switch buckle AT&T's network?," Network World Fusion, February 23, 2001. [pappalardo2] Pappalardo, D., "AT&T, customers grapple with ATM net outage," Network World, February 26, 2001. [Q.764] "Signalling System No. 7 - ISDN user part signalling procedures," December 1999. [refresh-guide] Zinin, A., "Guidelines for Efficient LSA Refreshment in OSPF," work in progress. [whitehead] Whitehead, Martin, "A class of overload controls based on controlling call reject rates," ITU-T contribution D.19, Feburary 2001. [zinen] Zinin, A., et. al., "OSPF Restart Signaling," work in progress. 9. Authors' Addresses Jerry Ash AT&T Room MT D5-2A01 200 Laurel Avenue Middletown, NJ 07748, USA Phone: +1-(732)-420-4578 Fax: +1-(732)-368-8659 Email: gash@att.com Gagan L. Choudhury AT&T Room D5-3C21 200 S.Laurel Avenue Middletown, NJ 07748 Phone: + 1 732 420- 3721 Fax:+1 732 368-1919 E-mail: gchoudhury@att.com Jeff Han AT&T Room C5-2C34 200 S. Laurel Avenue Middletown, NJ 07748 Phone: +1 732 420-2438 Fax:+1 732 371-7234 E-mail: jhan@att.com Vishwas Manral NetPlane 189, Prashasan Nagar, Road number 72 Jubilee Hills Hyderabad India E-mail: vishwasm@netplane.com Anurag S. Maunder Sanera Systems, 370 San Aleso Avenue Sunnyvale, CA 94085 Ph: 408-734-6123 Fax: 408-400-0151 E-mail: amaunder@sanera.net Mahmood Noorchashm AT&T Room D3-3C02 200 S. Laurel Avenue Middletown, NJ 07748 Phone: +1 732 420-4185 Fax:+1 732 668-1769 E-mail: mahmood.noorchashm@concert.com Vera D. Sapozhnikova AT&T Room C5-2C29 200 S. Laurel Avenue Middletown, NJ 07748 Phone: + 1 732 420-2653 Fax: +.1 732 368-1774 E-mail: sapozhnikova@att.com Mostafa Hashem Sherif AT&T Room C5-2D18 200 S. Laurel Avenue Middletown, NJ 07748 Phone: + 1 732 420-2448 Fax: +.1 732 371-7234 E-mail: mhs@.att.com