Internet Engineering Task Force C. Alaettinoglu Internet Draft V. Jacobson Expires in May, 2001 H. Yu draft-alaettinoglu-isis-convergence-00.txt Packet Design November, 2000 Towards Milli-Second IGP Convergence Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at www.ietf.org/shadow.html. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. Abstract It is possible for link-state routing protocols to converge in link propagation time scales, that is, in tens of milliseconds. However, current deployments of ISIS[ISIS-ISO][ISIS-IP], a link-state routing protocol, are not anywhere near this point. In this paper, we present some analyses of ISIS convergence by showing its behavior upon link/router failures and repairs, and its scaling properties to large networks, both in terms of number of nodes and links. We then explore changes needed in the ISIS specification and implementations to reach IGP convergence in milliseconds. Our results are based on experimentation done with ISIS, but some of the findings may apply to OSPF as well. This paper contains color figures and graphs which are missing from the ASCII version of the paper. The postscript and pdf versions of this paper can be found in the Internet-Drafts repository. 1 Motivation The theoretical limit for link-state routing protocols to re-route Alaettinoglu, et. al. Expires: May, 2000 [page 1 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 is in link propagation time scales, that is in tens of milliseconds. However, the current ISIS re-route times are in tens of seconds. During the re-route period, a subset of destinations are either not-reachable or are reached through non-optimal routes. It is important to close this gap between theory and implementation for several reasons. First, millisecond re-route times will lead to increased network reliability since the periods at which routes are not available/non-optimal will be shorter (one to three orders of magnitude shorter). Second, it enables multi-service traffic such as voice over IP since it decreases the number of packets dropped to tolerable levels for these services. And finally, it eliminates the need for more expensive and complex layer 2 protections schemes, such as SONET. There are three steps to ISIS re-routing: detection, propagation and shortest path calculation. In the detection step, the topology change is locally detected. That is, a link failure/repair is detected by the routers the link is adjacent to, and a router failure/repair is detected by the router's neighbors. In the propagation step, a new link-state packet (LSP) reflecting the local topology after the change is flooded to all the other routers in the network. Finally, in the shortest path calculation step, each router having received the new LSP computes the new routes using a shortest path tree algorithm. Typically, Dijkstra's Shortest Path First (SPF) algorithm is used in this calculation. Only after these three steps, the re-route has completed, or equivalently we say that ISIS has converged. It is prudent to understand where the tens of seconds of re-route times go before trying to fix ISIS re-routing. Hence, we have done extensive experiments to characterize the time spent at each of these steps. In our experiments, we tested ISIS running on Cisco 7200s running IOS 12.0S9 and IOS 12.1P and on Juniper M40s running JunOS 4.1. In Sections [detection], [propagation], and [spf], we present our findings on the detection, propagation, and shortest path calculation steps respectively. In Section [summary], we summarize our findings and suggest changes to ISIS protocol specification and on its implementations. ISTalk Software In many of our experiments, we emulated very large random networks. To facilitate this, we developed a software tool called ISTalk. ISTalk can establish adjacencies with ISIS routers and inject them LSPs emulating any topology allowing us to vary both the number of nodes and the number of edges in the network. This is graphically illustrated in Figure [istalk].([fig] ISTalk Software emulates virtual topologies.) 2 Detection The topology change, or the link-state change, is either detected Alaettinoglu, et. al. Expires: May, 2000 [page 2 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 and communicated to ISIS by the lower level protocols (i.e. link level protocols), or it is detected using the peer to peer ISIS hello protocol. The link level detection is fastest but seems to be inconsistently implemented by vendors. In our experiments, with certain medium we noticed that the device driver detected the failure but did not communicate it to the router's OS. The link level detection is also not always possible, for example in a switched environment where a router fails behind a switch. In this case, the switch gets the link layer notification, but has no mechanism to notify other routers in the network. The ISIS hello protocol is more robust, but slower. In this case, the adjacent routers send periodic hello packets to each other (the adjacency is also established through the use of hello packets, but not described here). If a router misses a fixed number of hello packets from an adjacent router, it declares the adjacency to that router down. By default, the routers send each other hello packets every 10 seconds, and declare the adjacency down after missing 3 hello packets. Hence, with default parameters, the detection can take up to 30 seconds. The specification limits the granularity of hello interval to seconds, hence making 3 seconds the fastest possible detection based on the hello protocol. Several ISPs use 2 second hello intervals, but declare an adjacency down after missing 5 hello packets, resulting in a detection time of 10 seconds.([fig] Hello Experiment Setup) To study the effect of hello interval, we configured 3 routers in a triangle topology and connected a workstation running ISTalk to router 1 as shown in Figure [hello-setup]. The routers are connected to each other using switched ethernet interfaces. Hence, when we take down one of the routers, the other two have to rely on the hello protocol to detect the change. The ISTalk establishes an ISIS adjacency with router 1 so that it receives the LSPs flooded in the network. The workstation pings the router 3 using router 3's interface address towards router 2 so that the pings follow the path shown on the left hand side of Figure [hello-setup]. We then bring down router 2 and after re-route the pings follow the path shown on the right hand side of Figure [hello-setup]. The pings are sent once every 100 milliseconds. ([fig] Packets received by the workstation during the Hello Experiment) Figure [hello-received-packets] shows the relevant packets received by the workstation. The short spikes are the ICMP ECHO REPLY packets received; they are only received if there is a valid route to the ping's destination and back. As seen in the figure, there is a 33 second period where there are no ICMP ECHO REPLY packets received. This is the re-route time, during which router 1 does not have a route to the destination. The two tall spikes are LSPs reflecting the topology change as detected by routers 1 and 3. Note that the two LSPs are separated by 1 second. This is because the two routers detect the change at different times due to their 3 hello packet intervals ending at different times. This separation can be anywhere from 0 second to one full hello interval, in this case 10 seconds. Alaettinoglu, et. al. Expires: May, 2000 [page 3 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 Why then after receipt of LSPs there is still 5 seconds of re-route time? This is because the routers delay the SPF calculation by a number of seconds (default is 5 seconds) in the hope that they may catch more LSPs, particularly this second LSP in the figure, and do one SPF calculation instead of many (because the costly SPF calculation overwhelmed routers in the past). We refer to this delay as the SPF delay. For millisecond convergence, the SPF delay should be in milliseconds as well, and should ideally be zero. The argument against a small hello interval is the bandwidth consumed by the hello packets and the hello packets getting queued behind data packets and the routers not receiving them in time. The most desirable solution to the latter problem is to treat hello packets preferentially over the data packets. In case this solution is not available, we studied how the hello protocol is affected by different hello intervals by simulating a link and varied both the data load (using heavy tail traffic) and the hello interval (as a fraction of the link's bandwidth). Figure [effect of hello interval] shows the results. As seen from the figure, the hello interval does not play a significant role, i.e. the hello packets do not miss their deadlines, until the hello packets become a dominant bandwidth consumer of the link's capacity.([fig] Effect of hello interval and load on the Hello Protocol.) Since the protocol's ultimate limit on the hello interval is set by the bandwidth used by hello packets, extending the specification to allow sub-second intervals would allow sub-second detection on almost all links. This does not mean that the detection time can be made arbitrarily small, only that detection should be limited by the physical constraints of the link (its bandwidth, propagation delay, etc), not by arbitrary clock granularities set by the protocol designers. Detection: stability and damping For either event triggered (via the link layer notification) or hello driven detection, there are network-wide stability issues if routing tries to follow rapid link transients (i.e., a link that goes down and up several times a second). Making hello interval smaller may seem like allowing more instability into the network. The usual way of dealing with this is to treat "bad news" differently from the "good news" so routing is quick to find an alternate path on any failure but slow to switch back when the link comes up. The current ISIS specification treats bad news and good news the same way but it should be trivial to change the detection specification to allow different filtering constants for "down" and "up" state changes. 3 Propagation After a topological change is detected, a new link state packet (LSP) is generated at the point of detection and then flooded unmodified (except for the "remaining life" field) through the network. Flooding Alaettinoglu, et. al. Expires: May, 2000 [page 4 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 should propagate the LSP across the network at near the speed of light plus one store-and-forward delay per hop. Thus, in theory the LSP propagation should make a negligible contribution to the re-route time. Unfortunately, this is not what we observed. In our experiment, we connected 3 routers in a line topology using switched ethernet as shown in Figure [LSP Experiment Setup]. The ISTalk established ISIS adjacencies with router 1 on its interface 0 and with router 3 on its interface 3. It also passively listened on its interfaces 1 and 2. Our experiments simulated networks of varying sizes. In the steady state, that is after three routers think they are part of a large network, we injected an LSP at interface 0 and observed the delay until it appeared at other interfaces of the workstation. Since the SPF calculation can take a significant amount of time (see Section [spf]), commercial router implementations impose a limit on how frequently the SPF calculation can be done using the SPF delay parameter. In some implementations this is 5 seconds fixed. In some implementations, this is changeable but the granularity is in seconds. This limit essentially adds to the propagation time. In this experiment, to remove its affect we set the SPF delay to zero seconds.([fig] LSP Propagation Experiment Setup.) Figure [LSP Propagation Results] shows the results for 500, 800 and 1000 node topologies (the average node degree is fixet at 5). The x-axis is the interface number and the y-axis is the time in seconds at which the LSP is seen. As seen from the figure, propagating an LSP across 3 routers takes close to one second and propagation time is related to the topology size. The reason for this delay is related to the topology size is that the routers are performing the SPF calculation before propagating the LSPs. ([fig] LSP Propagation Experiment Results.) In our experiment, we set SPF delay to zero seconds resulting in a race condition at the routers between performing an SPF calculation or propagating the LSP. The routers chose to do the SPF calculation first. This degrades the propagation time from near the speed of light to O(diameter\times SPFtime). To prevent this, the specification might be amended to explicitly state that the LSP flooding is higher priority than the SPF calculation. 4 Shortest Path Calculation The final step in re-route is to compute new routes using a shortest path tree algorithm, typically Dijkstra's Shortest Path First algorithm. For sparse topologies, a binary heap implementation of Dijkstra's algorithm has O(n\log n) complexity, and a naive implementation has O(n^{2}) complexity. Figures [vary nodes] shows the SPF calculation time in milliseconds versus number of nodes in the network. The Figure [vary nodes log] shows the same but the y-axis is in log scale. In this experiment each node had a fixed number of neighbors, 5.([fig] SPF calculation time versus number of nodes.) Alaettinoglu, et. al. Expires: May, 2000 [page 5 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 ([fig] SPF calculation time versus number of nodes (log scale).) The top 3 curves are the commercial router vendor implementations (collected at the Ciscos using "show isis spf log" and using "show isis spf-log" in Junipers). The bottom curve is an in-house implementation of Dijkstra's shortest path algorithm. All the time values are wall-clock times, and there is no other load on the routers. The SPF calculation is indeed CPU intensive, taking 100s of milliseconds. As can be seen from the log-scale graph, the bottom 3 curves have the same shape. As a matter of fact an n\log n curve fits perfectly to them, unfortunately the top curve, one of the commercial router implementations fits an n^{2} curve perfectly.([fig] SPF calculation time versus number of edges per node.) In Figure [vary degree], we kept the number of nodes at 300 but varied the average number of edges per node. Today, some ISPs have full-mesh topologies (using ATM/MPLS technologies) and run ISIS over it. With full-mesh, the SPF calculation is even more intense, almost about one second. As illustrated by these experiments, even in high-end platforms, the SPF calculation can take a long time (seconds) and has poor scaling properties (n\log n to n^{2}). This has a serious impact both on re-route times since the SPF calculation is in series with the LSP propagation and on overall network stability since the CPU on the routers will be saturated. To fix this, an algorithm that scales much better than Dijkstra's is needed. Dijkstra's algorithm re-computes all routes after a topology change regardless of whether they were affected or not, each time starting from scratch. More recent algorithms[Franciosa97:dijk][Frigioni94:dijk][Narvaez99:dijk] store the data structures from earlier calculations and only re-compute the affected routes. Their average case complexity is O(\log n). We implemented one of these algorithms and results for 300 node topology with degree varied are shown in Figure [dynamic SPT]. Note that the y-axis, SPF time in milliseconds, is in log scale, and the new dynamic shortest path tree algorithm computes new routes in microseconds, that is 10,000 times faster than the commercial implementations of Dijkstra's algorithm. Note that with the dynamic shortest path tree algorithm, the calculation time first increases and then decreases. This is because as the average degree increases, the number of affected nodes by a topology change first increases and then decreases. In the case of a full-mesh topology, the number of affected nodes is often only one node.([fig] Dynamic SPT algorithm is 10,000 times faster.) 5 Summary of Findings and Reccomendations Stable, robust IP re-routing that works at the network's propagation time (the theoretical maximum for any re-routing scheme) is both possible and achievable. To get there, we have to make some minor changes to the ISIS specification in the following priority order: * switch to a modern algorithm for the SPF calculation Alaettinoglu, et. al. Expires: May, 2000 [page 6 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 * make the granularity of the Hello Interval in millisecond, rather than seconds * allow different detection and damping filter constants for the link up and down events, i.e. differentiate good and bad news * give higher priority to LSP propagation than SPF computation * queue hello packets in front of data packets What we did not see: During our experiments, we injected topologies with 90,000 edges, thousands of nodes, drove all the routers to 100% CPU utilization, randomly unplugged links and powered down routers. Although we were looking for it, we saw no evidence of routing instability and we observed several subtle things done to avoid getting into an unstable operating regime. For example, even under 100% CPU utilization, the routers have never missed sending out an hello packet. If they did, they could have dropped adjacencies, causing more link-state changes, which in turn causing more routing load, and which in turn causing more hello packet being missed and so on. It appears that the two vendors we looked at have learned a lot from a decade's worth or routing disasters and meltdowns and are currently shipping robust routing code. What we did not look for: In full-mesh topologies, each newly generated LSP is received by a router n times, one from each of its neighboring routers. Even though the router only performs one SPF for this LSP, it needs to recognize that the other n-1 LSPs are duplicates. When a router goes down in a full-mesh topology, each router loses an adjacency and generates a new LSP. Hence, routers now need to recognize O(n^{2}) duplicate LSPs and perform up to O(n) SPF calculations (in practice LSPs received while performing an SPF calculation are bundled together for the next round of SPF calculation). We did not design experiments to study the time it takes to recognize O(n^{2}) duplicate LSPs since there are already proposals[mpls-unnum] addressing this issue. 6 Acknowledgments Authors are grateful to the Global Crossing Production Testing Lab, Cisco Systems, and Juniper Networks for allowing us to use their facilities to perform some of these tests. Authors would also like to thank Kathie Nichols, Judy Estrin, Chia-Chee Kuan, and Doug Klein for their comments on this document. 7 References 8 Authors' Addresses Cengiz Alaettinoglu Packet Design Alaettinoglu, et. al. Expires: May, 2000 [page 7 ] ^L INTERNET DRAFT draft-alaettinoglu-ISIS-convergence-00 November, 2000 66 Willow Place Menlo Park, CA 94025 USA email: cengiz@packetdesign.com Van Jacobson Packet Design 66 Willow Place Menlo Park, CA 94025 USA email: van@packetdesign.com Haobo Yu Packet Design 66 Willow Place Menlo Park, CA 94025 USA email: haoboy@packetdesign.com