idnits 2.17.1 

draft-alaettinoglu-isis-convergence-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 435 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Introduction section.

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 7 instances of too long lines in the document, the longest one
     being 4 characters in excess of 72.

  ** There are 5 instances of lines with control characters in the document.

  ** The abstract seems to contain references ([ISIS-IP], [ISIS-ISO]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'ISIS-ISO' on line 37 looks like a reference

  -- Missing reference section? 'ISIS-IP' on line 37 looks like a reference


     Summary: 10 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force        		      C. Alaettinoglu
2	Internet Draft                         		      V. Jacobson
3	Expires in May, 2001                   		      H. Yu
4	draft-alaettinoglu-isis-convergence-00.txt  	      Packet Design
5	                                       		      November, 2000

7	Towards Milli-Second IGP Convergence

9	<draft-alaettinoglu-isis-convergence-00>

11	Status of this Memo

13	This document is an Internet-Draft and is in full conformance with
14	all provisions of Section 10 of RFC2026. Internet-Drafts are working
15	documents of the Internet Engineering Task Force (IETF), its areas,
16	and its working groups. Note that other groups may also distribute
17	working documents as Internet-Drafts.

19	Internet-Drafts are draft documents valid for a maximum of six months
20	and may be updated, replaced, or obsoleted by other documents at
21	any time. It is inappropriate to use Internet-Drafts as reference
22	material or to cite them other than as "work in progress."

24	The list of current Internet-Drafts can be accessed at
25	www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow
26	Directories can be accessed at www.ietf.org/shadow.html. Distribution of
27	this memo is unlimited.

29	Copyright Notice

31	Copyright (C) The Internet Society (2000). All Rights Reserved.

33	Abstract

35	It is possible for link-state routing protocols to converge in link
36	propagation time scales, that is, in tens of milliseconds. However,
37	current deployments of ISIS[ISIS-ISO][ISIS-IP], a link-state routing
38	protocol, are not anywhere near this point. In this paper, we present
39	some analyses of ISIS convergence by showing its behavior upon
40	link/router failures and repairs, and its scaling properties to
41	large networks, both in terms of number of nodes and links. We
42	then explore changes needed in the ISIS specification and implementations
43	to reach IGP convergence in milliseconds. Our results are based
44	on experimentation done with ISIS, but some of the findings may
45	apply to OSPF as well.

47	This paper contains color figures and graphs which are missing from
48	the ASCII version of the paper. The postscript and pdf versions
49	of this paper can be found in the Internet-Drafts repository.

51	1 Motivation

53	The theoretical limit for link-state routing protocols to re-route

55	Alaettinoglu, et. al.         Expires: May, 2000             [page  1 ]
56	^L
57	is in link propagation time scales, that is in tens of milliseconds.
58	However, the current ISIS re-route times are in tens of seconds.
59	During the re-route period, a subset of destinations are either
60	not-reachable or are reached through non-optimal routes.

62	It is important to close this gap between theory and implementation
63	for several reasons. First, millisecond re-route times will lead
64	to increased network reliability since the periods at which routes
65	are not available/non-optimal will be shorter (one to three orders
66	of magnitude shorter). Second, it enables multi-service traffic
67	such as voice over IP since it decreases the number of packets
68	dropped to tolerable levels for these services. And finally, it
69	eliminates the need for more expensive and complex layer 2 protections
70	schemes, such as SONET.

72	There are three steps to ISIS re-routing: detection, propagation
73	and shortest path calculation. In the detection step, the topology
74	change is locally detected. That is, a link failure/repair is detected
75	by the routers the link is adjacent to, and a router failure/repair
76	is detected by the router's neighbors. In the propagation step,
77	a new link-state packet (LSP) reflecting the local topology after
78	the change is flooded to all the other routers in the network.
79	Finally, in the shortest path calculation step, each router having
80	received the new LSP computes the new routes using a shortest path
81	tree algorithm. Typically, Dijkstra's Shortest Path First (SPF)
82	algorithm is used in this calculation. Only after these three steps,
83	the re-route has completed, or equivalently we say that ISIS has
84	converged.

86	It is prudent to understand where the tens of seconds of re-route
87	times go before trying to fix ISIS re-routing. Hence, we have done
88	extensive experiments to characterize the time spent at each of
89	these steps. In our experiments, we tested ISIS running on Cisco
90	7200s running IOS 12.0S9 and IOS 12.1P and on Juniper M40s running
91	JunOS 4.1. In Sections [detection], [propagation], and [spf], we
92	present our findings on the detection, propagation, and shortest
93	path calculation steps respectively. In Section [summary], we summarize
94	our findings and suggest changes to ISIS protocol specification
95	and on its implementations.

97	 ISTalk Software

99	In many of our experiments, we emulated very large random networks.
100	To facilitate this, we developed a software tool called ISTalk.
101	ISTalk can establish adjacencies with ISIS routers and inject them
102	LSPs emulating any topology allowing us to vary both the number
103	of nodes and the number of edges in the network. This is graphically
104	illustrated in Figure [istalk].([fig] ISTalk Software emulates
105	virtual topologies.<istalk>)

107	2 Detection<detection>

109	The topology change, or the link-state change, is either detected

111	Alaettinoglu, et. al.         Expires: May, 2000             [page  2 ]
112	^L
113	and communicated to ISIS by the lower level protocols (i.e. link
114	level protocols), or it is detected using the peer to peer ISIS
115	hello protocol. The link level detection is fastest but seems to
116	be inconsistently implemented by vendors. In our experiments, with
117	certain medium we noticed that the device driver detected the failure
118	but did not communicate it to the router's OS. The link level detection
119	is also not always possible, for example in a switched environment
120	where a router fails behind a switch. In this case, the switch
121	gets the link layer notification, but has no mechanism to notify
122	other routers in the network.

124	The ISIS hello protocol is more robust, but slower. In this case,
125	the adjacent routers send periodic hello packets to each other
126	(the adjacency is also established through the use of hello packets,
127	but not described here). If a router misses a fixed number of hello
128	packets from an adjacent router, it declares the adjacency to that
129	router down. By default, the routers send each other hello packets
130	every 10 seconds, and declare the adjacency down after missing
131	3 hello packets. Hence, with default parameters, the detection
132	can take up to 30 seconds. The specification limits the granularity
133	of hello interval to seconds, hence making 3 seconds the fastest
134	possible detection based on the hello protocol. Several ISPs use
135	2 second hello intervals, but declare an adjacency down after missing
136	5 hello packets, resulting in a detection time of 10 seconds.([fig] Hello
137	Experiment Setup<hello-setup>)

139	To study the effect of hello interval, we configured 3 routers in
140	a triangle topology and connected a workstation running ISTalk
141	to router 1 as shown in Figure [hello-setup]. The routers are connected
142	to each other using switched ethernet interfaces. Hence, when we
143	take down one of the routers, the other two have to rely on the
144	hello protocol to detect the change. The ISTalk establishes an
145	ISIS adjacency with router 1 so that it receives the LSPs flooded
146	in the network. The workstation pings the router 3 using router
147	3's interface address towards router 2 so that the pings follow
148	the path shown on the left hand side of Figure [hello-setup]. We
149	then bring down router 2 and after re-route the pings follow the
150	path shown on the right hand side of Figure [hello-setup]. The
151	pings are sent once every 100 milliseconds. ([fig] Packets received
152	by the workstation during the Hello Experiment<hello-received-packets>)

154	Figure [hello-received-packets] shows the relevant packets received
155	by the workstation. The short spikes are the ICMP ECHO REPLY packets
156	received; they are only received if there is a valid route to the
157	ping's destination and back. As seen in the figure, there is a
158	33 second period where there are no ICMP ECHO REPLY packets received.
159	This is the re-route time, during which router 1 does not have
160	a route to the destination. The two tall spikes are LSPs reflecting
161	the topology change as detected by routers 1 and 3. Note that the
162	two LSPs are separated by 1 second. This is because the two routers
163	detect the change at different times due to their 3 hello packet
164	intervals ending at different times. This separation can be anywhere
165	from 0 second to one full hello interval, in this case 10 seconds.

167	Alaettinoglu, et. al.         Expires: May, 2000             [page  3 ]
168	^L
169	Why then after receipt of LSPs there is still 5 seconds of re-route
170	time? This is because the routers delay the SPF calculation by
171	a number of seconds (default is 5 seconds) in the hope that they
172	may catch more LSPs, particularly this second LSP in the figure,
173	and do one SPF calculation instead of many (because the costly
174	SPF calculation overwhelmed routers in the past). We refer to this
175	delay as the SPF delay. For millisecond convergence, the SPF delay
176	should be in milliseconds as well, and should ideally be zero.

178	The argument against a small hello interval is the bandwidth consumed
179	by the hello packets and the hello packets getting queued behind
180	data packets and the routers not receiving them in time. The most
181	desirable solution to the latter problem is to treat hello packets
182	preferentially over the data packets. In case this solution is
183	not available, we studied how the hello protocol is affected by
184	different hello intervals by simulating a link and varied both
185	the data load (using heavy tail traffic) and the hello interval
186	(as a fraction of the link's bandwidth). Figure [effect of hello interval]
187	shows the results. As seen from the figure, the hello interval
188	does not play a significant role, i.e. the hello packets do not
189	miss their deadlines, until the hello packets become a dominant
190	bandwidth consumer of the link's capacity.([fig] Effect of hello
191	interval and load on the Hello Protocol.<effect of hello interval>)

193	Since the protocol's ultimate limit on the hello interval is set
194	by the bandwidth used by hello packets, extending the specification
195	to allow sub-second intervals would allow sub-second detection
196	on almost all links. This does not mean that the detection time
197	can be made arbitrarily small, only that detection should be limited
198	by the physical constraints of the link (its bandwidth, propagation
199	delay, etc), not by arbitrary clock granularities set by the protocol
200	designers.

202	 Detection: stability and damping

204	For either event triggered (via the link layer notification) or
205	hello driven detection, there are network-wide stability issues
206	if routing tries to follow rapid link transients (i.e., a link
207	that goes down and up several times a second). Making hello interval
208	smaller may seem like allowing more instability into the network.
209	The usual way of dealing with this is to treat "bad news" differently
210	from the "good news" so routing is quick to find an alternate path
211	on any failure but slow to switch back when the link comes up.
212	The current ISIS specification treats bad news and good news the
213	same way but it should be trivial to change the detection specification
214	to allow different filtering constants for "down" and "up" state
215	changes.

217	3 Propagation<propagation>

219	After a topological change is detected, a new link state packet
220	(LSP) is generated at the point of detection and then flooded unmodified
221	(except for the "remaining life" field) through the network. Flooding

223	Alaettinoglu, et. al.         Expires: May, 2000             [page  4 ]
224	^L
225	should propagate the LSP across the network at near the speed of
226	light plus one store-and-forward delay per hop. Thus, in theory
227	the LSP propagation should make a negligible contribution to the
228	re-route time. Unfortunately, this is not what we observed.

230	In our experiment, we connected 3 routers in a line topology using
231	switched ethernet as shown in Figure [LSP Experiment Setup]. The
232	ISTalk established ISIS adjacencies with router 1 on its interface
233	0 and with router 3 on its interface 3. It also passively listened
234	on its interfaces 1 and 2. Our experiments simulated networks of
235	varying sizes. In the steady state, that is after three routers
236	think they are part of a large network, we injected an LSP at interface
237	0 and observed the delay until it appeared at other interfaces
238	of the workstation.

240	Since the SPF calculation can take a significant amount of time
241	(see Section [spf]), commercial router implementations impose a
242	limit on how frequently the SPF calculation can be done using the
243	SPF delay parameter. In some implementations this is 5 seconds
244	fixed. In some implementations, this is changeable but the granularity
245	is in seconds. This limit essentially adds to the propagation time.
246	In this experiment, to remove its affect we set the SPF delay to
247	zero seconds.([fig] LSP Propagation Experiment Setup.<LSP Experiment Setup>)

249	Figure [LSP Propagation Results] shows the results for 500, 800
250	and 1000 node topologies (the average node degree is fixet at 5).
251	The x-axis is the interface number and the y-axis is the time in
252	seconds at which the LSP is seen. As seen from the figure, propagating
253	an LSP across 3 routers takes close to one second and propagation
254	time is related to the topology size. The reason for this delay
255	is related to the topology size is that the routers are performing
256	the SPF calculation before propagating the LSPs. ([fig] LSP Propagation
257	Experiment Results.<LSP Propagation Results>)

259	In our experiment, we set SPF delay to zero seconds resulting in
260	a race condition at the routers between performing an SPF calculation
261	or propagating the LSP. The routers chose to do the SPF calculation
262	first. This degrades the propagation time from near the speed of
263	light to O(diameter\times SPFtime). To prevent this, the specification
264	might be amended to explicitly state that the LSP flooding is higher
265	priority than the SPF calculation.

267	4 Shortest Path Calculation<spf>

269	The final step in re-route is to compute new routes using a shortest
270	path tree algorithm, typically Dijkstra's Shortest Path First algorithm.
271	For sparse topologies, a binary heap implementation of Dijkstra's
272	algorithm has O(n\log n) complexity, and a naive implementation
273	has O(n^{2}) complexity. Figures [vary nodes] shows the SPF calculation
274	time in milliseconds versus number of nodes in the network. The
275	Figure [vary nodes log] shows the same but the y-axis is in log
276	scale. In this experiment each node had a fixed number of neighbors,
277	5.([fig] SPF calculation time versus number of nodes.<vary nodes>)

279	Alaettinoglu, et. al.         Expires: May, 2000             [page  5 ]
280	^L
281	([fig] SPF
282	calculation time versus number of nodes (log scale).<vary nodes log>) The
283	top 3 curves are the commercial router vendor implementations (collected
284	at the Ciscos using "show isis spf log" and using "show isis spf-log"
285	in Junipers). The bottom curve is an in-house implementation of
286	Dijkstra's shortest path algorithm. All the time values are wall-clock
287	times, and there is no other load on the routers. The SPF calculation
288	is indeed CPU intensive, taking 100s of milliseconds. As can be
289	seen from the log-scale graph, the bottom 3 curves have the same
290	shape. As a matter of fact an n\log n curve fits perfectly to them,
291	unfortunately the top curve, one of the commercial router implementations
292	fits an n^{2} curve perfectly.([fig] SPF calculation time versus
293	number of edges per node.<vary degree>) In Figure [vary degree],
294	we kept the number of nodes at 300 but varied the average number
295	of edges per node. Today, some ISPs have full-mesh topologies (using
296	ATM/MPLS technologies) and run ISIS over it. With full-mesh, the
297	SPF calculation is even more intense, almost about one second.

299	As illustrated by these experiments, even in high-end platforms,
300	the SPF calculation can take a long time (seconds) and has poor
301	scaling properties (n\log n to n^{2}). This has a serious impact
302	both on re-route times since the SPF calculation is in series with
303	the LSP propagation and on overall network stability since the
304	CPU on the routers will be saturated. To fix this, an algorithm
305	that scales much better than Dijkstra's is needed. Dijkstra's algorithm
306	re-computes all routes after a topology change regardless of whether
307	they were affected or not, each time starting from scratch. More
308	recent algorithms[Franciosa97:dijk][Frigioni94:dijk][Narvaez99:dijk]
309	store the data structures from earlier calculations and only re-compute
310	the affected routes. Their average case complexity is O(\log n).

312	We implemented one of these algorithms and results for 300 node
313	topology with degree varied are shown in Figure [dynamic SPT].
314	Note that the y-axis, SPF time in milliseconds, is in log scale,
315	and the new dynamic shortest path tree algorithm computes new routes
316	in microseconds, that is 10,000 times faster than the commercial
317	implementations of Dijkstra's algorithm. Note that with the dynamic
318	shortest path tree algorithm, the calculation time first increases
319	and then decreases. This is because as the average degree increases,
320	the number of affected nodes by a topology change first increases
321	and then decreases. In the case of a full-mesh topology, the number
322	of affected nodes is often only one node.([fig] Dynamic SPT algorithm
323	is 10,000 times faster.<dynamic SPT>)

325	5 Summary of Findings and Reccomendations<summary>

327	Stable, robust IP re-routing that works at the network's propagation
328	time (the theoretical maximum for any re-routing scheme) is both
329	possible and achievable. To get there, we have to make some minor
330	changes to the ISIS specification in the following priority order:

332	* switch to a modern algorithm for the SPF calculation

334	Alaettinoglu, et. al.         Expires: May, 2000             [page  6 ]
335	^L
336	* make the granularity of the Hello Interval in millisecond, rather
337	  than seconds

339	* allow different detection and damping filter constants for the
340	  link up and down events, i.e. differentiate good and bad news

342	* give higher priority to LSP propagation than SPF computation

344	* queue hello packets in front of data packets

346	What we did not see: During our experiments, we injected topologies
347	with 90,000 edges, thousands of nodes, drove all the routers to
348	100% CPU utilization, randomly unplugged links and powered down
349	routers. Although we were looking for it, we saw no evidence of
350	routing instability and we observed several subtle things done
351	to avoid getting into an unstable operating regime. For example,
352	even under 100% CPU utilization, the routers have never missed
353	sending out an hello packet. If they did, they could have dropped
354	adjacencies, causing more link-state changes, which in turn causing
355	more routing load, and which in turn causing more hello packet
356	being missed and so on. It appears that the two vendors we looked
357	at have learned a lot from a decade's worth or routing disasters
358	and meltdowns and are currently shipping robust routing code.

360	What we did not look for: In full-mesh topologies, each newly generated
361	LSP is received by a router n times, one from each of its neighboring
362	routers. Even though the router only performs one SPF for this
363	LSP, it needs to recognize that the other n-1 LSPs are duplicates.
364	When a router goes down in a full-mesh topology, each router loses
365	an adjacency and generates a new LSP. Hence, routers now need to
366	recognize O(n^{2}) duplicate LSPs and perform up to O(n) SPF calculations
367	(in practice LSPs received while performing an SPF calculation
368	are bundled together for the next round of SPF calculation). We
369	did not design experiments to study the time it takes to recognize
370	O(n^{2}) duplicate LSPs since there are already proposals[mpls-unnum]
371	addressing this issue.

373	6 Acknowledgments

375	Authors are grateful to the Global Crossing Production Testing Lab,
376	Cisco Systems, and Juniper Networks for allowing us to use their
377	facilities to perform some of these tests. Authors would also like
378	to thank Kathie Nichols, Judy Estrin, Chia-Chee Kuan, and Doug
379	Klein for their comments on this document.

381	7 References

383	8 Authors' Addresses

385	Cengiz Alaettinoglu
386	Packet Design

388	Alaettinoglu, et. al.         Expires: May, 2000             [page  7 ]
389	^L
390	66 Willow Place
391	Menlo Park, CA 94025
392	USA
393	email: cengiz@packetdesign.com

395	Van Jacobson
396	Packet Design
397	66 Willow Place
398	Menlo Park, CA 94025
399	USA
400	email: van@packetdesign.com

402	Haobo Yu
403	Packet Design
404	66 Willow Place
405	Menlo Park, CA 94025
406	USA
407	email: haoboy@packetdesign.com