idnits 2.17.1 draft-ietf-ospf-scalability-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == There are 9 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 552 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '3' on line 474 looks like a reference -- Missing reference section? '4' on line 476 looks like a reference -- Missing reference section? '1' on line 470 looks like a reference -- Missing reference section? '2' on line 472 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force A. S. Maunder 2 Internet Draft Cisco Systems 3 Expires in August, 2001 4 draft-ietf-ospf-scalability-00.txt G. Choudhury 5 AT&T Labs 6 March, 2001 8 Explicit Marking and Prioritized Treatment of Specific IGP Packets 9 for Faster IGP Convergence and Improved Network Scalability and 10 Stability 12 14 Status of this Memo 16 This document is an Internet-Draft and is in full conformance with 17 all provisions of Section 10 of RFC2026. Internet-Drafts are working 18 documents of the Internet Engineering Task Force (IETF), its areas, 19 and its working groups. Note that other groups may also distribute 20 working documents as Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months 24 and may be updated, replaced, or obsoleted by other documents at 25 any time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft 30 Shadow 31 Directories can be accessed at www.ietf.org/shadow.html. 32 Distribution of 33 this memo is unlimited. 35 Copyright Notice 37 Copyright (C) The Internet Society (2000). All Rights Reserved. 39 Abstract 41 There has been a lot of interest in the networking community to 42 allow for fast failure detection followed by the fast restoration 43 and recovery. It may be possible to provide fast recovery using 44 special mechanisms; however, there is a strong interest in 45 addressing this issue at a more fundamental level i.e. at IGP 46 convergence because it addresses the problem at a much broader 47 scale. Faster IGP convergence inevitably requires faster detection 48 by using smaller hello interval timers (unless one relies on link 49 level detection which is not always possible), fast flooding and 50 more frequent SPF calculations. However, we provide analytic and 51 simulation results* to show that this compromises the scalability 52 and stability of the network, mainly because Hello packets received 53 at a router are indistinguishable from other packets and may 54 experience long queueing delays during a sudden burst of many LSA 55 updates. In this draft we suggest a need for Hello and potentially 56 some other IGP packets to be marked explicitly so that efficient 58 Maunder, et. al. Expires: August, 2001 [page 1] 60 implementations can detect and act upon these messages in a priority 61 fashion thus allowing significant reduction in convergence time for 62 IGP while maintaining network stability. 64 The figures and graphs are missing from the ASCII version of the 65 draft. The pdf versions of this draft can be found in the Internet- 66 Drafts repository. 68 1 Motivation 70 The motivation of this draft is to address two key issues: 71 (1) Fast restoration under failure conditions 72 (2) Increased network scalability and stability 74 The motivation for allowing fast restoration under failure 75 conditions is similar to the one provided in [1]draft-alaettinoglu- 76 isis-convergence-00.txt. The theoretical limit for link-state 77 routing protocols to re-route is in link propagation time scales, 78 i.e., in tens of milliseconds. However, in practice it takes 79 seconds to tens of seconds to detect the link failure and 80 disseminate this information to the network followed by the 81 convergence on the new set of paths. This is an inordinately long 82 period of transient time for mission critical traffic destined to 83 the non-reachable nodes of the network. One component of the long 84 re-route time is the link failure detection time of between 20 and 85 30 seconds through three missed Hello packets with the typical hello 86 interval of 10 seconds (between 30 and 40 seconds if missed hello 87 threshold is 4). This component would be much shorter in the 88 presence of link level detection, but as pointed out in [1]draft- 89 alaettinoglu-isis-convergence-00.txt it does not work in some cases. 90 For example, a device driver may detect the link level failure but 91 fail to notify it to the IGP level. Also, if a router fails behind 92 a switch in a switched environment then even though the switch gets 93 the link level notification it cannot communicate that to other 94 routers. Therefore for faster reliable detection at the IGP level, 95 one has to reduce the hello interval. Reference [1]draft- 96 alaettinoglu-isis-convergence-00.txt suggests that this be reduced 97 to below a second, perhaps even to tens of milliseconds. A second 98 component of the long re-route time is delayed SPF (shortest-path- 99 first) computation. The typical delay value is 5 seconds but needs 100 to be reduced significantly to have sub-second rerouting. 102 The second issue we address is the ability of a network to withstand 103 the simultaneous or near-simultaneous update of a large number of 104 link-state- advertisement messages, or LSAs. We call this event, an 105 LSA storm. An LSA storm may be generated due to many reasons. Here 106 are some examples: (a) one or more link failures due to fiber cuts, 107 (b) one or more node failures for some reason, e.g., failed power 108 supply in an office, (c) requirement of taking down and later 109 bringing back many nodes during a software/hardware upgrade, (d) 110 near-synchronization of the once-in-30-minutes refresh instants of 111 some types of LSAs, (e) refresh of all LSAs in the system during a 112 change in software version. The LSA storm tends to drive the node 113 CPU utilization to 100% for a period of time and the duration of 114 this period increases with the size of the storm and the node 115 adjacency, i.e., the number of trunks connected to it. During this 116 period the Hello packets received at the node would see high delays 117 and if this delay exceeds typically three or four hello intervals 119 Maunder, et. al. Expires: August, 2001 [page 2] 121 (typically 30 or 40 seconds) then the associated trunk would be 122 declared down. Depending on the implementation, there may be other 123 impacts of a long CPU busy period as well. For example, in a 124 reliable node architecture with an active and a standby processor, a 125 processor-switch may result during an extended CPU-busy period which 126 may mean that all the adjacencies would be lost and need to be re- 127 established. Both of the above events would cause more database 128 synchronization with neighbors and network-wide LSA flooding which 129 in turn might cause extended CPU-busy periods at other nodes. This 130 may cause unstable behavior in the network for an extended period of 131 time and potentially a meltdown in the extreme case. Due to world- 132 wide increased traffic demand, data networks are ever increasing in 133 size. As the network size grows, a bigger LSA storm and a higher 134 adjacency at certain nodes would be more likely and so would 135 increase the probability of unstable behavior. One way to address 136 the scalability issue is to divide the network hierarchically into 137 different areas so that flooding of LSAs remains localized within 138 areas. However, this approach increases the network management and 139 design complexity and less optimal routing between areas. Also area 140 0 may see the flooding of a large number of summary LSAs and some of 141 the new protocols may not work well under the hierarchical system. 142 Thus it is important to allow the network to grow towards as large a 143 size as possible under a single area. The undesirable impact of 144 large LSA storms is understood in the networking community and it is 145 well known that large scale flooding of control messages (either 146 naturally or due to a bug) has been responsible for several network 147 events in the past causing a meltdown or a near-meltdown. Recently, 148 proposals have been submitted to avoid synchronization of LSA 149 refreshes [2]draft-ietf-ospf-refresh-guide-01.txt and reduce 150 flooding overhead in case more than one interface goes to the same 151 neighbor [3] draft-ietf-ospf-isis-flood-opt-00.txt, and [4]draft- 152 ietf-ospf-ppp-flood-00.txt. 154 In this proposal we would like to make the point that reducing hello 155 intervals and more frequent SPF computation would in fact reduce 156 network scalability and stability. We will use a simple and 157 approximate but easy-to-understand analytic model for this purpose. 158 We will also use a more involved simulation model. Next, we would 159 like to make the point that many of the underlying causes of network 160 scalability could be avoided if certain IGP messages could be 161 specially marked and provided prioritized treatment. 163 2 Analytic Model for Delay seen By a Received Hello Packet During a 164 LSA Storm 166 For every trunk interface, a node has to send and receive a Hello 167 packet once every hello interval. Sending of a Hello packet can be 168 triggered by a timer and it is possible to give higher priority to 169 timer-driven jobs and thereby ensure that it is not excessively 170 delayed even during extended CPU-busy periods. However, a received 171 Hello packet cannot be easily distinguished from other IGP or IP 172 packets and therefore is typically served in a first-come-first- 173 served fashion. We do a simple and approximate analysis of the 174 delay experienced by this packet during an LSA storm at a node with 175 highest adjacency. Let�s assume: 176 ? S = Size of LSA storm (i.e., number of LSAs in it). Also, it 177 is assumed that each LSA is carried in one LSU packet. 178 ? L = Link adjacency of the node under consideration. This is 180 Maunder, et. al. Expires: August, 2001 [page 3] 182 assumed to be the maximum in the network. 183 ? t1 = Time to send or receive one IGP packet over an interface 184 (the same time is assumed for Hello, LSA, duplicate LSA and LSA 185 acknowledgement even though in general there may be some 186 differences. However, this would be a good approximation if 187 majority of the time is in the act of receiving or sending and a 188 relatively small part for packet-type-specific work. In the 189 numerical examples we assume t1 = 1 ms. 190 ? t2 = Time to do one SPF calculation. For large network, this 191 time is usually in hundreds of ms and in the numerical examples 192 we assume t2 = 200 ms. 193 ? Hi = Hello interval. 194 ? Si = minimum interval between successive SPF calculations. 195 ? ro = Rate at which non-IGP work comes to the node (e.g., 196 forwarding of data packets). For the numerical examples we 197 assume ro = 0.2. 198 ? T = Total work brought in to the node during the LSA storm. 199 For each LSA update generated elsewhere, the node will receive 200 one new LSA packet over one interface, send an acknowledgement 201 packet over that interface, and send copies of the LSA packet 202 over the remaining L-1 interfaces. Also, assuming that the 203 implicit acknowledgement mechanism is in use, the node will 204 subsequently receive either an acknowledgement or a duplicate LSA 205 over the remaining L-1 interfaces. So over each interface one 206 packet is sent and one is received. It can be seen that the same 207 would be true for self-generated LSAs. So the total work per 208 LSA update is 2*L*t1. Since there are S LSAs in the storm, we 209 get 211 T = 2*S*L*t1 (1) 213 In Equation (1) we ignore retransmissions of LSAs in case 214 acknowledgements are not received or processed within 5 seconds. 215 This impact and other details are taken into account in the 216 simulation model to be presented later. 217 ? T2 = Time period over which the work comes. Due to differences 218 in propagation times and congestion at other nodes, it is 219 possible for the work arrival time to be spread out over a long 220 interval. However, since we are considering the node with 221 highest adjacency, i.e., one with highest congestion, (this is 222 assuming that all nodes have the same processing power and about 223 the same non-IGP workload) most of the work will come in one 224 chunk. We verified this to be usually true using simulations. 225 One part of T2 will be of the order of link propagation delay and 226 we assume that there is a second part which is proportional to T. 227 Therefore we get, 229 T2 = A + B*T (2) 231 Where A and B are constants. For the numerical examples we 232 assume 233 A = 10 ms and B = 0.1. 235 ? D = Maximum delay experienced by a Hello packet during the LSA 236 storm. We assume first-come-first-served service and hence the 237 delay seen by the Hello packet would be the total outstanding 238 work at the node at the arrival instant plus its own processing 239 time. We assume that outstanding work steadily increases over 241 Maunder, et. al. Expires: August, 2001 [page 4] 243 the interval T2 and so the maximum delay is seen by a Hello 244 packet that comes near the end of this interval. We write down 245 an approximate expression for D and then explain the various 246 terms on the right hand side: 248 D = T � T2 + max(1,2*T2/Hi)*t1 + max(1,T2/Si)*t2 + ro*T2 (3) 250 The first term is the total work brought in due to the LSA storm. 251 The second term is the work the node was able to finish since we are 252 assuming that it was continuously busy during the period T2. The 253 third term is the total work due to the sending and receiving of 254 Hello packets during the period T2. Note that it is assumed that at 255 least one Hello packet is processed, i.e., itself. The fourth term 256 is due to SPF processing during the period T2 and we assume that at 257 least one SPF processing is done. The last term is the total non- 258 IGP work coming to the node over the interval T2 260 ? Dmax = maximum allowed value of D, i.e., if D exceeds this 261 value then the associated link would be declared down. In the 262 numerical examples below we assume 264 Dmax = 3*Hi (4) 266 If we assume that the previous Hello packet was minimally 267 delayed then exceeding Dmax really means four missed hellos since 268 the Hello packet under study itself came after a period Hi. In 269 the numerical examples below, both D and Dmax change with choice 270 of system parameters and we are mainly interested in identifying 271 if D exceeds Dmax. For this purpose we define the following 272 ratio variable 274 Delay Ratio = D/Dmax (5) 276 and identify if Delay Ratio exceeds 1. 278 In Figures 1-3 we plot the Delay Ratio as a function of LSA Storm 279 size with node adjacencies 10, 20 and 50 respectively. All 280 parameters except for the ones noted explicitly on the figures are 281 as stated earlier. Figure 1 assumes Hello packets every 10 seconds 282 and SPF calculation every 5 seconds which are typical default values 283 today. With a node adjacency of 10, the Delay Ratio is below 1 even 284 with an LSA storm of size 1000. However, with a node adjacency of 285 20, the Delay Ratio exceeds 1 at around a storm of size 800 and with 286 a node adjacency of 50, the Delay Ratio exceeds 1 at around a storm 287 of size 325. 289 Figure 1: Delay Ratio with Hello Every 10 Seconds, SPF Every 5 290 Seconds, Dmax = 30 seconds 292 In a large network it is not unusual to have LSA storms of size 293 several hundreds since the LSA database size may be several 294 thousands. This is particularly true if there are many type 5 LSAs 295 and there are special LSAs for carrying information about available 296 bandwidth at trunks as is common in ATM networks and might be used 297 in MPLS-based networks as well. 298 Figure 2 decreases the hello interval to 2 seconds and SPF 299 calculation is done once a second. LSA storm thresholds are 300 significantly reduced. Specifically, with a node adjacency of 10, 302 Maunder, et. al. Expires: August, 2001 [page 5] 304 the Delay Ratio exceeds 1 at around a storm of size 310; with a node 305 adjacency of 20, the Delay Ratio exceeds 1 at around a storm of size 306 160; and with a node adjacency of 50, the Delay Ratio exceeds 1 at 307 around a storm of size only 65. 309 Figure 2: Delay Ratio with Hello Every 2 Seconds, SPF Every 1 310 Second, Dmax = 6 seconds 312 Figure 3 decreases the hello interval even further to 300 ms and SPF 313 calculation is done once every 500 ms. LSA storm thresholds are 314 really small now. Specifically, with a node adjacency of 10, the 315 Delay Ratio exceeds 1 at around a storm of size 40, with a node 316 adjacency of 20, the Delay Ratio exceeds 1 at around a storm of size 317 20, and with a node adjacency of 50, the Delay Ratio is already over 318 1 even with a storm of size 10. 320 Figure 3: Delay Ratio with Hello Every 300 ms, SPF Every 500 ms, 321 Dmax = 900 ms 323 Whenever Delay Ratio exceeds 1, the associated link is declared down 324 even if it is actually up and eventually other undesirable events 325 start (e.g., trunk flapping and cascading of extended CPU overload 326 periods to other nodes). Therefore, the LSA storm threshold at 327 which the Delay Ratio exceeds 1 may also roughly be considered as 328 the network stability threshold. Figures 1-3 show that the 329 stability threshold rapidly decreases as the hello interval and SPF 330 computation interval decreases. One reason for this is the 331 increased CPU work due to more frequent hello and SPF computations, 332 but the dominant reason is that Dmax itself decreases and so a 333 smaller CPU busy interval is needed to exceed it. Specifically, 334 Dmax is 30 seconds in Figure 1, 6 Seconds in Figure 2 and only 900 335 ms in Figure 3. It is clear from the above examples that in order to 336 maintain network stability as the hello interval decreases, it is 337 necessary to provide faster prioritized treatment to received Hello 338 packets which can of course be only done if those packets can be 339 distinguished from other IGP or IP packets. 341 3 Simulation Study 343 We have also developed a simulation model to capture more accurately 344 the impact of an LSA storm on all the nodes of the network. It 345 captures the actual congestion seen at various nodes, propagation 346 delay between nodes and retransmissions in case an LSA is not 347 acknowledged. It also tries to approximate a real network 348 implementation and uses processing times that are roughly in the 349 same order of magnitude as measured in the real network (of the 350 order of milliseconds). There are two categories of IGP messages. 351 Category one messages are triggered by a timer and include the Hello 352 refresh, LSA refresh and retransmission packets. Category 2 messages 353 are not triggered by a timer and include received Hello, received 354 LSA and received acknowledgements. Timer-triggered messages are 355 given non-preemptive priority over the other type. A beneficial 356 effect of this strategy is that Hello packets are sent out with 357 little delay even under intense CPU overload. However, the received 358 Hello packets and the received acknowledgement packets may see long 359 queueing delays under intense CPU overload. Figures 4 and 5 below 360 show sample results of the simulation study when applied to a 362 Maunder, et. al. Expires: August, 2001 [page 6] 364 network with about 300 nodes and 800 trunks. The hello interval is 365 assumed to be 5 seconds, the minimum interval between successive SPF 366 calculations is 1 second, and a trunk is declared down if no Hello 367 packet is received for three successive hello intervals, i.e., 15 368 seconds. During the study, an LSA storm of size 300 and 600 369 (Figures 4 and 5 respectively) are created at instant of time 100 370 seconds. Three LSAs are packed within one LSU packet and it is 371 assumed that they remain packed the same way during the flooding 372 process. Besides the storm, there are also the normal once-in- 373 thirty-minutes LSA refreshes and those LSAs are packed one per LSU 374 packet. We define a quantity �dispersion� which is the number of LSU 375 packets generated in the network but not received and processed in 376 at least one node. Figures 4 and 5 plot dispersion as a function of 377 time. Before the LSA storm, the dispersion due to normal LSA 378 refreshes remains small. As expected, right after the storm the 379 dispersion jumps and then comes down again to the pre-storm level 380 after some period of time. In Figure 4 with an LSA storm size 300, 381 the �heavy dispersion period� lasted about 11 seconds and no trunk 382 losses were observed. In Figure 5 with an LSA storm of size 600, 383 the �heavy dispersion period� lasted about 40 seconds. Some trunk 384 losses were observed a little after 15 seconds within the �heavy 385 dispersion period� but eventually all trunks recovered and the 386 dispersion came down to the pre-storm level. 388 Figure 4: Dispersion Versus Time (LSA Storm Size = 300) 390 Figure 5: Dispersion Versus Time (LSA Storm Size = 600) 392 4 Need for Special Marking and Prioritized Treatment of Specific IGP 393 packets 395 The analytic and simulation models show that a major cause for 396 unstable behavior in networks is received Hello packets at a node 397 getting queued behind other work brought in to the node during an 398 LSA storm and missing the deadline of typically three or four hello 399 intervals. This need not happen to outgoing Hello packets that are 400 triggered by a timer since the node CPU can give it prioritized 401 treatment. Clearly, if the received Hello packet can be specially 402 marked to distinguish it from other IGP and IP packets then they can 403 also be given prioritized treatment and they would not miss the 404 deadline even during a large LSA storm. Some specific field of IP 405 packets may be used for this purpose. Besides the Hello packets 406 there may be other IGP packets that could also benefit from special 407 marking and prioritized marking. We give two examples but clearly 408 others are possible. 409 ? One example is the LSA acknowledgement packet. This packet 410 disables retransmission and if a large queueing delay to this 411 packet expires the retransmission timer (typical default value is 412 5 seconds) then a needless retransmission will happen causing 413 extra traffic load. Retransmission event is usually rare due to 414 the reliable nature of transmission links, but during the 600 LSA 415 storm simulation in Figure 5 many retransmission events were 416 noted. Usually, retransmission events happen more with a longer 417 CPU busy period. Clearly, a special marking and prioritization 419 Maunder, et. al. Expires: August, 2001 [page 7] 421 of the LSA acknowledgement packet would eliminate many needless 422 retransmissions. 423 ? A second example is an LSA carrying a bad news, i.e., a failure 424 of a trunk or a node. It is preferable to transmit this 425 information faster than other LSAs in the network that either 426 carry good news or are just once-in-30-minutes refreshes. The 427 explicit identification can also be used to trigger the SPF 428 calculation after processing LSAes carrying bad information. This 429 will obviate the need of lowering the SPF calculation interval 430 under all circumstances and thus reducing the processing 431 overhead. 433 The example in this draft focussed explicitly on the control domain. 434 However, it can easily be seen that having an explicit 435 identification for certain �chosen� packets will help minimize their 436 drop probability in the traffic plane also. The explicit 437 identification allows these control packets to be easily 438 distinguished from the data packets in the line card and hence their 439 processing (forwarding) can be expedited even under large traffic 440 conditions. 442 5 Summary 444 In this proposal we point out that if a large LSA storm is generated 445 as a result of some type of failure/recovery of nodes/trunks or 446 synchronization among refreshes then the Hello packets received at a 447 node may see large queueing delays and miss the deadline of 448 typically three or four hello intervals. This causes the trunk to 449 be down and is potentially the beginning of unstable behavior in the 450 network. This is already a concern in today�s network but would be 451 a much bigger concern if the hello interval and minimum interval 452 between SPF calculations are substantially reduced (below or perhaps 453 well below a second) in order to allow faster rerouting, as proposed 454 in [1]draft-alaettinoglu-isis-convergence-00.txt. To avoid the 455 above, we propose the use of a special marking for Hello packets 456 (perhaps using a special field in IP packets) so that they may be 457 distinguished from other IGP and IP packets and provided a 458 prioritized treatment during intense CPU overload periods caused by 459 LSA storms. We also point out that other IGP packets could benefit 460 from special markings as well. Two examples are LSA acknowledgement 461 packets and LSA packets carrying bad news. 463 5 Acknowledgments 465 The authors would like to thank members of the High-Speed Packet 466 Switching division of AT&T for their help during the study. 468 6 References 470 [1] draft-alaettinoglu-isis-convergence-00.txt November, 2000 472 [2] draft-ietf-ospf-refresh-guide-01.txt July, 2000 474 [3] draft-ietf-ospf-isis-flood-opt-00.txt October, 2000 476 [4] draft-ietf-ospf-ppp-flood-00.txt November, 2000 478 Maunder, et. al. Expires: August, 2001 [page 8] 480 8 Authors' Addresses 482 Anurag S. Maunder 483 Cisco Systems 484 email: amaunder@cisco.com 486 Gagan Choudhury 487 AT&T Labs, Middletown, NJ, USA 488 email: gchoudhury@att.com 490 *The study was done when Anurag S. Maunder was a Sr. Member of 491 Techical Staff at AT&T.