idnits 2.17.1 draft-li-bgp-stability-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1020. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1031. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1038. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1044. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 13, 2007) is 6161 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'AggrWithdrawl' is mentioned on line 940, but not defined == Missing Reference: 'TON-1998' is mentioned on line 985, but not defined == Missing Reference: 'Infocom-1999' is mentioned on line 951, but not defined == Missing Reference: 'FTCS-1999' is mentioned on line 946, but not defined == Missing Reference: 'Sigcomm-2000' is mentioned on line 976, but not defined == Missing Reference: 'Infocom-2001' is mentioned on line 955, but not defined == Missing Reference: 'Sigcomm-2002' is mentioned on line 980, but not defined == Missing Reference: 'PCC-2004' is mentioned on line 964, but not defined == Missing Reference: 'Infocom-2005' is mentioned on line 960, but not defined == Missing Reference: 'R-BGP' is mentioned on line 970, but not defined Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Inter-Domain Routing T. Li 3 Internet-Draft Cisco Systems, Inc. 4 Intended status: Standards Track G. Huston 5 Expires: December 15, 2007 APNIC 6 June 13, 2007 8 BGP Stability Improvements 9 draft-li-bgp-stability-01 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on December 15, 2007. 36 Copyright Notice 38 Copyright (C) The IETF Trust (2007). 40 Abstract 42 BGP is the routing protocol used to tie the Autonomous Systems (ASes) 43 of the Internet together. The ongoing stability of BGP in the face 44 of arbitrary inputs, both malicious and unintentional, is of primary 45 importance to the overall stability of the Internet. The overall 46 issue is not a new one. Previously, one aspect of stability, known 47 as route flap damping was originally discussed in RFC 2439. In the 48 intervening years, a great deal of experience with flap damping and 49 other stability concerns has been accumulated. Most recently, the 50 issue of BGP stability has been highlighted in RAWS. This document 51 describes the experience that has been gained concerning stability in 52 the intervening years, hypotheses about remaining problems, 53 suggestions for experiments to be performed, and proposals for 54 possible alternatives. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 60 1.2. History . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.3. Observations . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.3.1. Path hunting . . . . . . . . . . . . . . . . . . . . . 4 63 1.4. The wave model . . . . . . . . . . . . . . . . . . . . . . 5 64 1.4.1. Refraction and diffraction . . . . . . . . . . . . . . 5 65 2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 2.1. Flap damping . . . . . . . . . . . . . . . . . . . . . . . 6 67 2.2. Rapid convergence . . . . . . . . . . . . . . . . . . . . 6 68 2.3. Reduced overhead . . . . . . . . . . . . . . . . . . . . . 7 69 3. Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 7 70 3.1. Turn it off . . . . . . . . . . . . . . . . . . . . . . . 7 71 3.2. Alternate parameters . . . . . . . . . . . . . . . . . . . 7 72 3.3. Band-stop filtering . . . . . . . . . . . . . . . . . . . 8 73 3.4. Path length damping . . . . . . . . . . . . . . . . . . . 8 74 3.5. Optimal path hysteresis . . . . . . . . . . . . . . . . . 10 75 3.5.1. Optimal path length hysteresis . . . . . . . . . . . . 11 76 3.6. Delayed best path selection . . . . . . . . . . . . . . . 11 77 3.7. Deprecate MRAI . . . . . . . . . . . . . . . . . . . . . . 12 78 3.8. Hybrid approaches . . . . . . . . . . . . . . . . . . . . 13 79 3.9. Other work . . . . . . . . . . . . . . . . . . . . . . . . 13 80 4. Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . 13 81 4.1. Call for collaboration . . . . . . . . . . . . . . . . . . 13 82 4.2. Literature search . . . . . . . . . . . . . . . . . . . . 13 83 4.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 14 84 4.3.1. A Case Study: Path length damping . . . . . . . . . . 14 85 4.4. Prototyping, Testing and Pilot Deployment . . . . . . . . 19 86 5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 87 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 88 7. Security Considerations . . . . . . . . . . . . . . . . . . . 19 89 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 90 8.1. Normative References . . . . . . . . . . . . . . . . . . . 20 91 8.2. Informative References . . . . . . . . . . . . . . . . . . 20 92 8.3. Potential References . . . . . . . . . . . . . . . . . . . 20 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 94 Intellectual Property and Copyright Statements . . . . . . . . . . 23 96 1. Introduction 98 BGP [RFC4271] is the routing protocol used to tie the Autonomous 99 Systems (ASes) of the Internet together. The ongoing stability of 100 BGP in the face of arbitrary inputs, both malicious and 101 unintentional, is of primary importance to the overall stability of 102 the Internet. The overall issue is not a new one. Previously, one 103 aspect of stability, known as route flap damping was originally 104 discussed in RFC 2439 [RFC2439]. In the intervening years, a great 105 deal of experience with flap damping and other stability concerns has 106 been accumulated. Most recently, the issue of BGP stability has been 107 highlighted in RAWS [I-D.iab-raws-report]. This document describes 108 the experience that has been gained concerning stability in the 109 intervening years, hypotheses about remaining problems, suggestions 110 for experiments to be performed, and proposals for possible 111 alternatives. 113 Please note that this document is very much a work-in-progress. 115 1.1. Requirements Language 117 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 118 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 119 document are to be interpreted as described in RFC 2119 [RFC2119]. 121 1.2. History 123 The circuits used in computer networks have the unfortunate property 124 that they can intermittently fail and then recover. This was an 125 especially common failure mode for copper-based circuits. Under such 126 circumstances, when there was a BGP speaker on both ends of the 127 circuit, any prefixes advertised across the link would tend to 128 oscillate at the frequency induced by the intermittent link. The 129 oscillating prefixes would then propagate across the full Internet, 130 causing the entire routing subsystem to churn at the rate of the 131 prefix. 133 Individually, a single such prefix is not a significant issue. 134 However, as the Internet continued to scale upwards, it became 135 obvious that the CPU requirements to deal with the ever-increasing 136 number of oscillating prefixes would quickly become onerous. This 137 was aggravated by the fact that the party responsible for the 138 flapping circuit was frequently unaware of the problem, or, worse 139 yet, unwilling to address the issue. 141 Route flap can also be caused by policy oscillations [CN2000] or MED 142 churn oscillations [RFC3345]. 144 Thus, the original goal of route flap damping was to protect the 145 control plane from oscillations. This was done by determining the 146 number of flaps and the time elapsed since the last transition. This 147 is fed into an exponential decay function, and, if the prefix is 148 found to be flapping based on this data, the actual propagation of 149 the route is suppressed. Since the frequency information must be 150 stored even if the prefix is not currently active, there is state 151 overhead associated with flap damping for each prefix that has been 152 oscillating. 154 1.3. Observations 156 Unfortunately, flap damping isn't truly discerning about the nature 157 of routing changes. Any routing change can easily be misinterpreted 158 by flap damping as instability, resulting in premature damping of 159 prefixes [Harmful]. 161 1.3.1. Path hunting 163 One source of path changes is BGP's normal mechanism for _path 164 exploration_ or _path hunting_. These situations occur because BGP 165 is a path-vector protocol, where each BGP speaker advertises the path 166 that it is using to its neighbors, complete with the full AS path to 167 the destination. Since the number of possible paths through even a 168 simple topology is large, there can be many different path 169 transitions that can possibly be advertised. 171 Path hunting can occur both when a prefix is first advertised or when 172 a prefix is withdrawn. At advertisement time, the prefix may 173 propagate through the topology at different rates, sometimes 174 resulting in it first appearing at an AS with a suboptimal path. 175 Over time, optimal paths will appear where suboptimal paths were 176 before, resulting in a path change that is subsequently propagated. 178 Similarly, when a prefix is withdrawn from the network, as the local 179 BGP speaker receives the prefix withdrawal from a BGP peer it may 180 substitute a previously received announcement for this prefix from a 181 different peer in its place and announce this replacement route to 182 its peers in response to the received withdrawal. If this 183 replacement path is subsequently withdrawn, the local BGP speaker 184 will again select another previously received announcement from a 185 different peer, if one exists. This process may continue until the 186 original prefix withdrawal has propagated across the entire routing 187 space. 189 Interestingly, the amount of path hunting can increase dramatically 190 as the meshiness of the topology increases. It's easy to observe 191 this if you first consider an acyclic topology (i.e., a tree). In 192 such a topology, there is only one possible path, so there is no 193 hunting. If a single link is added to this topology, then there is 194 one cycle in the graph and at most two possible paths for BGP to 195 explore. Subsequent links can add many more alternate paths, 196 depending on their placement. 198 In the worst possible case path hunting in BGP can explore every 199 possible path of each path length. More commonly it has been 200 observed that path hunting in today's Internet can add an additional 201 2 or 3 BGP updates to a prefix withdrawal. 203 1.4. The wave model 205 An intuitive means of understanding the observed behavior is by 206 analogy to a wave. Any change in the network triggers the 207 dissemination of information (either updates or withdrawals) through 208 the topology from the point of occurrence. The information travels 209 outwards along all of the paths supported by BGP, in much the same 210 way that a wave would propagate from a pebble dropped in a lake. 212 The wave expands at each BGP speaker, where the information is 213 propagated to all other BGP peers, including ones that already have 214 the information. If the newly arrived information is inferior to the 215 existing path information, then the wave dies out at that BGP 216 speaker. If the newly arrived path is the best path, then the wave 217 continues to propagate. 219 1.4.1. Refraction and diffraction 221 Information does not traverse the BGP mesh at constant rates. 222 Differences in implementations, processing loads, propagation delay, 223 damping parameters, and policy can all contribute to delaying 224 updates. As with a physical wave, we know that the speed of 225 propagation varies with the material. This results in a bending of 226 the wave, known as _refraction_. 228 Similarly, since the BGP mesh is not a continuous medium we get 229 _diffraction_, where the wave passes through an aperture and fans out 230 from there, creating a new wavefront. Each additional wavefront 231 represents additional processing burden on the routing subsystem. 233 It is interesting to note that flap damping itself may be a 234 contributor to the creation of additional wavefronts. Since a route 235 that is being damped will be delayed for a long time, damping is 236 effectively delaying a wave of information, possibly creating more 237 refractive and diffractive effects. 239 Another major contributor to the generation of additional 240 intermediate wavefronts of information is the disparate use of the 241 Minimum Route Advertisement Interval (MRAI) Timer, where various 242 implementations impose differing delays on update propagation. 244 It should be noted that the wave analogy does break down when it 245 comes to interference. Unlike a physical wave, two waves of BGP 246 information do not interfere additively or destructively. Instead, 247 as noted above, at most one wave will propagate from any point, and 248 which wave will propagate may vary from point to point. 250 2. Goals 252 2.1. Flap damping 254 As we reconsider the mechanisms that constitute flap damping, we need 255 to keep in mind that the original goals of detecting and protecting 256 the routing subsystem from noisy inputs is still a requirement. 257 While copper circuits are now less common in the core, the overall 258 network has expanded dramatically and there is a wide variance in the 259 skills and experience in operational roles. 261 As a result, it is still possible for an errant AS to inject flapping 262 information into the BGP mesh, either as the result of policy 263 misconfiguration, implementation error, an intermittent circuit, or 264 even as an intentional destructive act. Thus, it is important that 265 there still be mechanisms that intervene and ameliorate these 266 effects, protecting the routing subsystem. 268 2.2. Rapid convergence 270 While protecting the routing system is of paramount importance, it is 271 vital that the routing subsystem also continue to perform its primary 272 task: providing routing. Any flap damping and path hunting 273 suppression mechanism must continue to provide rapid convergence to 274 some workable path so that connectivity is restored. However, this 275 goal should not be construed to require rapid optimality. While a 276 best path should eventually be selected and propagated, it is far 277 more important that some connectivity be restored immediately. Most 278 applications can survive with a sub-optimal path, while no 279 applications can succeed if no path is selected. 281 This distinction seems vital for understanding the precise behavior 282 of any mechanism, so for the sake of this discussion, we explicitly 283 define two milestones for convergence: 285 reachability: The source has a valid path to the destination. The 286 path may be suboptimal and may not respect the ultimate traffic 287 engineering preferences of the ASes along the path. As a result, 288 the path may exhibit congestion, unwelcome QoS handling, or any 289 other of a number of sub-optimalities. Time-sensitive 290 applications such as video or voice may require the optimal path 291 for proper operation. 293 optimality: The source has the long-term best path to the 294 destination. This milestone has been reached when there is a 295 stable transitive closure of the best path selection process of 296 all ASes along the path from source to destination. 298 2.3. Reduced overhead 300 Any form of damping of BGP updates should strive to remove the 301 unnecessary exchange of information. As described above, both path 302 hunting and refractive effects cause unnecessary churn in BGP. The 303 flap damping mechanism should be generalized to help suppress as much 304 of this unnecessary information as possible. 306 3. Hypotheses 308 In this section, we put forth a number of hypotheses about possible 309 mechanisms to achieve the goals above. As of this writing, more 310 investigation is needed on each of these theories, and where possible 311 we've included some discussion of the experiments that we feel would 312 be worthwhile. Our goal here is to examine a number of different 313 mechanisms, understand their relative benefits, and select a small 314 subset to become the core set of replacement mechanisms. 316 3.1. Turn it off 318 Given the concern about the negative effects of path damping, 319 [RIPE-378] recommends that path damping be disabled. While this is 320 not unreasonable given the lack of beneficial alternatives, we feel 321 that some of the possibilities presented here will eventually prevail 322 and that this sentiment can be changed over time. 324 3.2. Alternate parameters 326 It has been suggested in [Harmful] that the default flap damping 327 parameters in existing implementations are simply too aggressive and 328 quickly convert normal path hunting into a damping event that 329 precludes connectivity. Significantly increasing the parameters 330 could permit significantly more churn to be passed by the routing 331 subsystem while still filtering out truly periodic sources of flap. 333 It would be useful to test this by simply configuring numerous 334 differing parameters and observing if there is any beneficial effect. 335 At this time, we have no recommendations for possible alternative 336 parameter settings. 338 3.3. Band-stop filtering 340 Another view is that classical flap damping isn't working as well as 341 we might like because of a frequency mismatch. The current mechanism 342 looks for a number of changes in a given period of time. If the 343 route exceeds this threshold frequency, then it is damped. The 344 assumed information model behind the current BGP Flap Damping 345 parameters was that of a relatively low frequency oscillation 346 occurring over an extended period of time. 348 Measurements of BGP activity indicate that one of the predominant 349 signal components is a high frequency oscillation occurring over a 350 short time interval. This acts as a false positive for damping that 351 we would like to avoid. One alternative approach is to shift from 352 looking for flapping above a given threshold frequency and simply 353 accept that when there is a real topological change, there will be 354 extensive high frequency path changes. These should be dealt with by 355 separate path hunting suppression techniques, as described below. 357 Some time after the topological change, the high-frequency path 358 changes should stop and the route should then resume its stability. 359 Subsequent path changes would then be indicative of real oscillation 360 and would be subject to damping. 362 The implementation of this would be relatively straightforward. When 363 a change is seen on a stable route, it opens an oscillation window of 364 a fixed duration (e.g., 60s). Any changes within that window are not 365 considered as contributing to flap damping. After the window is 366 closed, any subsequent changes would count as significant events 367 towards damping. Effectively, this technique creates a filter that 368 passes very, very low frequencies (e.g., configuration changes) and 369 defers high frequency changes to path hunting mechanisms, but will 370 detect and deter ongoing route flap within a certain frequency band. 371 This is normally known as a band-stop filter. 373 3.4. Path length damping 375 The increased meshiness of the core of the Internet has significantly 376 changed the nature of path changes that are visible in BGP. As the 377 meshiness of the network increases, the number of parallel links 378 between any given pair of ASes tends to increase. This helps protect 379 against single link failures between ASes. This also reduces the 380 frequency of AS path changes on transit prefixes because most of the 381 link failures in the densely meshed part of the network will not 382 result in an AS path change. 384 As a result, when a BGP speaker does see a change in the AS path, and 385 in particular, when the AS path length increases, this would seem to 386 be a good heuristic indication that there is some significant 387 failure. The ultimate trigger for the advertisement of an update 388 with a longer AS path is the removal of a previously used shorter 389 path. This is either due to withdrawal at the origin, or removal of 390 an interior connection. As a result, it seems likely that such 391 failures are less likely to have alternative working paths and that 392 the increase in path length is a harbinger of path hunting that is 393 likely to be unsuccessful. We therefore suggest that this event 394 could be used to trigger a suppression period, which would allow the 395 prefix to oscillate arbitrarily without propagation to the remainder 396 of the network. The obvious risk is that this would be a false 397 negative, unnecessarily disrupting connectivity. 399 Observations of the time-coupling of updates in BGP that refer to the 400 same prefix show that only 28% of all BGP updates occur as an 401 isolated event, 26% of all updates occur as a part of a pair of 402 updates occurring within 65 seconds of each other and the remaining 403 46% of all BGP Updates occur within a coupled sequence of 3 or more 404 updates where each update occurs within 65 seconds of the previous 405 update in the sequence. 407 Again, the implementation of this would be relatively 408 straightforward. When a BGP speaker found that it needed to change 409 its best path for a prefix and that the new best path was longer than 410 the previous best path, then it would suppress the advertisement of 411 the longer AS path to its neighbor and start a timer. Subsequent 412 changes to the prefix that continue to lengthen the AS path would 413 restart the timer. If the BGP speaker installed a shorter best path, 414 or undertook a local withdrawal it would remove the suppressed update 415 and cancel the timer. Otherwise, when the timer expired, the BGP 416 speaker would advertise the result, if any. 418 Some analysis suggests that this technique will be effective at 419 suppressing about 20% of the overall churn rate. [Potaroo0607] 421 The benefits of this approach are that it would drastically reduce 422 the amount of path hunting that happened when a stub site had a 423 failure of its tail circuit. 425 This approach has a number of drawbacks: 427 1. Failures that result in alternate best paths with path lengths 428 that are equal to the previous best path are not damped, even in 429 the case of stub tail-circuit failures. 431 2. Convergence is harmed if the alternate AS paths that are damped 432 out are optimal: 434 A. If the failure that triggered the path change is not due to a 435 tail-circuit failure, then useful alternate AS paths will be 436 ignored. 438 B. While this approach is beneficial for stub sites, such sites 439 are not particularly good candidates for being explicitly 440 routed. Such sites should only ever need prefixes that are 441 aggregateable. Such prefixes may be explicitly distributed 442 by BGP for the sake of traffic engineering, so understanding 443 the scope of affected prefixes is left for future study. For 444 non-stub sites, this approach damages convergence. 445 Unfortunately, it seems difficult to distinguish between stub 446 prefixes and non-stub prefixes. To do so would seem to 447 require require explicit annotation that could only be 448 reasonably generated by manual configuration and would likely 449 be incorrect. Transporting the annotation itself would 450 require further protocol extension. 452 3.5. Optimal path hysteresis 454 It has been observed that the overall topology of the Internet at the 455 AS level changes at a fairly low rate. Thus, the optimal AS path to 456 a given prefix, ignoring transient issues, changes at a very low 457 rate. This suggests that caching the optimal AS path and waiting for 458 it to reappear would be another alternative heuristic to help select 459 only the long-term optimal path. 461 An implementation of this technique might retain a copy of the AS 462 path on per-prefix basis, even if it had no active path to the 463 prefix. Because most implementations maintain a cache of AS paths, 464 this is not necessarily prohibitively expensive. When a new AS path 465 is received for a prefix, the new path is compared to the cached 466 optimal path. If it matches, or it is preferable to the stored 467 optimal path, then the new path is immediately accepted, advertised, 468 and the cache can be updated appropriately. However, if the new AS 469 path is inferior to the cached path, then the implementation can 470 infer that there is some path hunting in progress and can choose to 471 either not perform best path selection, not select the new path, or 472 not advertise the new path. Again, after a suitable period has 473 elapsed, the implementation may decide that the optimal path is 474 unlikely to appear and may process the inferior path normally. 476 The benefits of this approach are that when a site has been 477 temporarily unreachable for a short time, then when it returns to 478 connectivity, only the optimal path will propagate through the 479 network. 481 There are three drawbacks to this approach. First, this approach 482 will delay convergence in the case where the cached optimal path is 483 not going to be restored. Second, this approach will delay 484 reachability. Third, the maintenance of the cache could be a non- 485 trivial amount of overhead. Many implementations maintain a cache of 486 AS paths, where the actual path is stored a single time and then 487 various prefixes point to a cache entry. In such circumstances, if 488 the AS path is maintained for some other active prefix, then the 489 additional cost of caching the path is the additional entry for the 490 unreachable prefix and the pointer to the cache entry. However, if 491 there are no other references for the AS path, then the storage of 492 the AS path itself would be part of the overhead. The impact of this 493 overhead is tempered by the fact that a prefix that is only 494 disconnected from the network for a short time would likely reuse the 495 same memory shortly thereafter in any case. An implementation could 496 also alleviate the cost of this overhead by limiting the amount of 497 memory spent on caching optimal paths for inactive prefixes, or 498 temporally limiting the lifetime of the cached information. 500 3.5.1. Optimal path length hysteresis 502 A variant of this approach would be to only cache the path length of 503 the optimal path. This would allow certain suboptimal paths matching 504 the cached length to pass rapidly through the system, and in 505 exchange, it would decrease the amount of overhead necessary to 506 maintain the cache. 508 3.6. Delayed best path selection 510 Another observation based on the discussion in Section 1.4.1 is that 511 the amount of flap is exacerbated by each AS selecting the best 512 possible path each time a new path is presented. This is not 513 strictly required by BGP, so ignoring some of the incoming paths 514 would be perfectly acceptable. Further, an implementation could 515 reasonably delay performing any best path analysis for an arbitrarily 516 long time, as long as it continued to advertise the path it actually 517 used. Thus, one possible policy would be to only perform best path 518 selection when absolutely required. When the first path for a prefix 519 arrives, the implementation would immediately select that path, 520 thereby restoring connectivity. Subsequent paths from other 521 neighbors for the same prefix would not trigger a new best path 522 computation. Rather, they would simply start a timer that would only 523 expire when the paths had stabilized. 525 The benefit of this approach is that it would provide rapid 526 reachability and a major reduction in churn. By propagating one path 527 and suppressing most of the intermediate paths, only a few paths are 528 likely to be propagated. The drawback to this approach is that it 529 will still necessary propagate some suboptimal paths. If the initial 530 path was the long-term optimal path, then churn would not be an 531 issue. Thus, under this approach, propagating the first path helps 532 to optimize reachability, but makes it very likely that the optimal 533 path will subsequently follow. Further, if the neighbor that relayed 534 the initial path decides to change its path selection for any reason, 535 then the local BGP speaker will have no alternative except to execute 536 best path selection and propagate a new best path. This would likely 537 be another intermediate path, not necessarily the long-term optimal 538 path. 540 Some basic analysis shows that this technique, when combined with 541 optimal path hysteresis is capable of reducing the overall BGP 542 message load by 35% for prefixes that oscillate frequently. In 543 addition, when these techniques are combined with path length 544 damping, there is negligible further improvement. Examination of the 545 result shows that optimal path hysteresis also effectively damps out 546 much of the path hunting messaging that occurs when a prefix is being 547 withdrawn, in much the same way that path length damping would do. 549 3.7. Deprecate MRAI 551 BGP specifies that a prefix should not be advertised multiple times 552 within a fixed period of time. This is called the Min Route 553 Advertisement Interval (MRAI). Implementations of this are sometimes 554 simplistic and can result in information being delayed for the length 555 of this interval even when such delay causes an increase in the 556 number of updates due to diffractive replications. It also leads to 557 unnecessary delays in convergence. 559 The original intent of MRAI was to rate-limit BGP updates to prevent 560 thrashing. Unfortunately, this unsophisticated control has some side 561 effects that are deleterious to the overall effort. If other can 562 provide appropriate guarantees that the update rate will remain 563 appropriately constrained, then the spirit of the MRAI requirement 564 would be satisfied and the actual mechanism itself would not be 565 necessary. 567 For example, path length damping could be used to ensure that the 568 rate of increases in the AS path length would be kept to a 569 controllable level. Refinements in flap damping might be used to 570 deal with the case where the AS path length is constant. No 571 mechanism would be necessary to deal with a decreasing AS path 572 length, since that is necessarily bounded by the AS path length 573 itself. In such circumstances, it should be possible to remove the 574 MRAI mechanism entirely, improve convergence, decrease diffraction, 575 and continue to ensure overall mesh stability. 577 3.8. Hybrid approaches 579 The approaches listed here could, in principle, be used in 580 conjunction with one another. This would result in a hybrid behavior 581 that had the benefits and drawbacks of the combined mechanisms. For 582 example, it should be possible to combine optimal path caching and 583 delayed best path selection. This approach would then propagate the 584 first path seen by a BGP speaker, but would then delay subsequent 585 path selection opportunities until the optimal path is seen. 587 3.9. Other work 589 Other work is already in progress to help reduce the amount of BGP 590 messages that are generated when a large number of routes are 591 withdrawn. [AggrWithdrawl] 593 4. Next steps 595 4.1. Call for collaboration 597 As can be seen from the above, there is a great deal of work yet to 598 be done on this subject. Collaborators are most welcome in any 599 aspect of the work. 601 4.2. Literature search 603 There are a number of technical articles listed below that have been 604 published on BGP flap damping and stability that need to be reviewed 605 and included if they prove substantive. A few known ones are listed 606 here. There are very likely a number of other articles in the 607 literature that are relevant. 609 [TON-1998] 611 [Infocom-1999] 613 [FTCS-1999] 615 [Sigcomm-2000] 617 [Infocom-2001] 619 [Sigcomm-2002] 621 [PCC-2004] 623 [Infocom-2005] 625 [R-BGP] 627 4.3. Analysis 629 A number of projects have collected traces of BGP update messages 630 that demonstrate both flap and path hunting. It would be of great 631 interest to examine the effects of some of the proposal in Section 3 632 in detail on these traces. 634 4.3.1. A Case Study: Path length damping 636 This case study examines the BGP updates over the month of April 2007 637 based on an observation point adjacent to AS 4637. 639 During that month a single eBGP peering session received 1,341,520 640 BGP update messages, reflecting 3,523,906 individual prefix updates 641 and 627,538 individual prefix withdrawals. Considering that there 642 were an average of some 215,000 individual prefixes in the BGP 643 routing table across the month, that's an average of around 19 644 updates for every prefix in the month, assuming a uniform 645 distribution of updates across the entire routing domain. 647 Of course, the distribution of updates is not uniform, and most of 648 the network is highly stable. Half of these 210,000 prefixes had 649 less than 10 routing updates across the month, and only 20,000 650 prefixes had more than 40 updates for the month. In other words, 651 this is a very skewed distribution, with 10% of announced prefixes 652 being responsible for 53% of all routing updates, and the busiest 1% 653 of prefixes responsible for 24% of the routing updates for the month. 655 The first step is to look at what kinds of updates one can expect 656 from a single peer in BGP. The following table classifies the types 657 of BGP updates of interest here. 659 +------+------------------------------------------------------------+ 660 | Code | Description | 661 +------+------------------------------------------------------------+ 662 | AA+ | Announcement of an already announced prefix with a longer | 663 | | AS path (update to longer path) | 664 | AA- | Announcement of an announced prefix with a shorter AS path | 665 | | (update to shorter path) | 666 | AA0 | Announcement of an announced prefix with a different path | 667 | | of the same length (update to a different AS path of same | 668 | | length) | 669 | AA* | Announcement of an announced prefix with the same path but | 670 | | different attributes (update of attributes) | 671 | AA | Announcement of an announced prefix with no change in path | 672 | | or attributes (possible BGP error or data collection | 673 | | error) | 674 | WA+ | Announcement of a withdrawn prefix, with longer AS path | 675 | WA- | Announcement of a withdrawn prefix, with shorter AS path | 676 | WA0 | Announcement of a withdrawn prefix, with different AS path | 677 | | of the same length | 678 | WA* | Announcement of a withdrawn prefix with the same AS path, | 679 | | but different attributes | 680 | WA | Announcement of a withdrawn prefix with the same AS path | 681 | | and same attributes | 682 | AW | Withdrawal of an announced prefix | 683 | WW | Withdrawal of a withdrawn prefix (possible BGP error or a | 684 | | data collection error) | 685 +------+------------------------------------------------------------+ 687 The following table classifies all the updates according to this 688 taxonomy. 690 +------+---------+ 691 | Code | Count | 692 +------+---------+ 693 | AA+ | 607,093 | 694 | AA- | 555,609 | 695 | AA0 | 594,029 | 696 | AA* | 782,404 | 697 | AA | 195,707 | 698 | WA+ | 238,141 | 699 | WA- | 190,328 | 700 | WA0 | 51,780 | 701 | WA* | 30,797 | 702 | WA | 77,440 | 703 | AW | 627,538 | 704 | WW | 0 | 705 +------+---------+ 707 The interesting numbers here are those associated with BGP path 708 hunting following a withdrawal, which are likely to be associated 709 with the 607,093 AA+ updates and the 627,538 AW updates. But the 710 population of these update types alone is not enough on its own to 711 justify a conclusion that over 1.2 million updates are associated 712 with path hunting as a precursor to prefix withdrawal events. The 713 other salient factor that needs to be examined is the time 714 distribution of updates, as the path hunting condition is associated 715 with a rapid burst of updates. 717 In looking at the time distribution of updates for the same prefix, 718 there are some prominent peaks. The operation of the 30 second MRAI 719 timer appears to be very prominent, and 934,391 updates occurred 720 precisely 30 seconds after the previous update for the same prefix, 721 and a total 1,636,093 updates for the same prefix occurred within 31 722 seconds of the previous update. That's the equivalent to 39% of the 723 entire BGP update activity for the month. There are further local 724 peaks at 30 second intervals at 60, 90 and all the way through to 240 725 seconds. Almost half of the BGP update activity occurs in a closely 726 time-coupled manner. 728 There are also smaller local peaks at 30 minute and 1 hour intervals. 729 It is likely that these correspond to Route Flap Damping outcomes, 730 where the damping period is typically one of these two values. 731 Interestingly, there are also local peaks at 1, 2 and 3 day 732 intervals. This is unlikely to be an artifact of Route Flap Damping, 733 and is more likely to be the outcome of some form of time-managed 734 traffic system that performs routing changes on a regular 24 hour 735 basis. 737 Another way of looking at this time distribution of updates is to 738 construct update "sequences" where a pair of updates is considered to 739 be part of the same sequence if it refers to the same prefix and is 740 received within 65 seconds (or slightly longer than two MRAI Timer 741 intervals) of any previous update for the same prefix. Only 28% of 742 the updates for the month are not part of any sequence, while 26% of 743 updates are part of a coupled update pair, and 46% of updates are 744 part of sequences of 3 or more updates. Interestingly enough, 745 changing the timer as to what defines a sequence does not alter the 746 profile greatly. Extending the sequence timer to 125 seconds (or 747 four MRAI Timer intervals) produces the outcome that 54% of updates 748 are part of sequences of 3 or more updates, while reducing the 749 sequence timer to 35 seconds produces the outcome that 36% of all 750 updates are part of sequences of 3 or more updates. 752 The approaches to flap damping to date have tended to look at flap 753 damping as a persistent condition that lasts for hours or longer, and 754 are an outcome of a strongly persistent announcement and withdrawal 755 characteristics that are assumed to be associated with some form of 756 cyclical behavior of an underlying circuit. This now appears to be 757 well wide of the mark in terms of capturing the profile of what 758 appears to be redundant BGP updates that reflect transitory routing 759 states that are not in any converged state. 761 The question this prompts is whether there is any value in looking at 762 BGP update patterns in the micro view rather than the macro? Can we 763 identify, on the fly, update sequences that are highly likely to 764 correlate to the BGP behavior of path hunting to a withdrawal and 765 damp the intermediate path hunting states and simply propagate the 766 resultant converged state? This was part of the intent of the MRAI 767 timer, but rather than simply apply a uniform damping interval to 768 update propagation, can we devise a selective algorithm that attempt 769 to pinpoint routing transitions that are strongly associated with BGP 770 path hunting? 772 There are a number of observations here that appear to point to some 773 value in considering this approach: 775 o A BGP update generator may perform "update compression" by 776 removing an already queued update when a further update that 777 refers to the same prefix is generated. Thus, when using the MRAI 778 timer, only the most recent state for each prefix is passed to the 779 BGP peers, and any intermediate state that occurred during the 780 MRAI-imposed quiescent time is suppressed. 782 o Convergence in BGP appears to take longer than a single MRAI timer 783 interval. As noted in the sequencing profile for updates, some 784 36% of all sequences take more than two MRAI timer intervals, or 785 more than 60 seconds, to complete. 787 o Path hunting in BGP is commonly represented as an update sequence 788 of the form {AA+}* AW, i.e. a sequence of lengthening AS path 789 lengths followed by a withdrawal. 791 o Suppression of updates that lengthen the AS path length of a 792 prefix does not implicitly create any risks of routing loop 793 formation during the suppression period. If the peer had already 794 selected a different path as the best path, then the update to a 795 longer path would have no impact on the previous selection. If 796 the peer was using the path advertised by this BGP speaker as its 797 best path, then the suppression may cause the peer to continue to 798 use this out-of-date path, but would not cause a path loop, as if 799 the peer was listed on the longer path then the peer would already 800 have a shorter path, and this update would not alter the peer's 801 forwarding state. 803 So the profile of update sequences that appear to be effective 804 targets for some form of local suppression are those that lengthen 805 the AS path, and possibly also those updates that do not change the 806 AS path Length, and are also part of a sequence of time-clustered 807 updates for the same prefix. 809 One approach to path length damping is to delay the processing an 810 update if the update represents a lengthening of the AS path for an 811 already announced prefix, selecting the AA+ updates. Furthermore, 812 the updates that are of interest here are those that occur during BGP 813 path hunting, so the length of time of the suppression should not be 814 minutes or hours, but slightly over one MRAI time interval, or 35 815 seconds. If no further updates for this prefix are generated in this 816 suppression interval, then the update is processed at the end of the 817 suppression time, otherwise the suppressed update is replaced by its 818 successor update. 820 How effective would this form of path length damping be in the 821 context of the BGP Update data set we've been examining here? 823 The algorithm used to implement this damping response is to suppress 824 the processing all AA+ updates by up to 35 seconds. If a further 825 update for this prefix occurs during this suppression interval, then 826 the suppressed update is ignored and the successor update is 827 processed in stead. If this update represents a further lengthening 828 of the AS path then it, too, is suppressed for 35 seconds. There are 829 607,093 AA+ updates in this set of suppressed updates, or some 15% of 830 the total update load for the month. Path length damping would 831 result in not processing 371,943 updates, or some 9.5% of the total 832 update load. This result also indicates that 61% of all AA+ updates 833 are followed by a subsequent update for the same prefix within one 834 MRAI time interval. 836 Decreasing the sensitivity of the suppression parameters to a little 837 over 2 MRAI intervals, or 65 seconds, increases the number of 838 unprocessed updates to 418,805, or an additional 1%, so it appears 839 that a damping sensitivity of a single MRAI interval represents a 840 suitable point of compromise between maximizing the number of damped 841 BGP events without making the BGP convergence process significantly 842 slower. 844 Of these damped updates, how many are actually path hunt updates? 845 Some 96,135 of these damped updates are immediately followed by a 846 withdrawal within the 35 second suppression period, and a further 847 36,691 damped updates are followed by another suppressed AA+ update. 849 This approach could be extended in a number of directions. One 850 approach is to regard any update that does not reduce the AS path 851 length or withdraw the prefix as being a candidate for damping. In 852 this case some 845,290 updates would be damped or 21% of all updates. 853 Of these update just under one quarter, or 208,007 of these damped 854 updates are followed by a withdrawal within one MRAI interval, and a 855 further 474,234 of these damped updates are followed by an update 856 with an AS path of the same or greater length. The implication being 857 that path length damping removes around one fifth of the total BGP 858 update volume without reducing the time to convergence for route 859 withdrawal, nor the time for propagation of more preferred routing 860 paths. 862 Using this latter form of path length damping, over the month the 863 average prefix update rate per second falls from 1.60 prefix updates 864 per second to 1.22 prefix updates per second, with 0.38 damped 865 updates per second on average. 867 The average peak update rate per hour falls from 355 to 290 prefix 868 updates per second using path length damping, an average reduction of 869 65 updates per second on the hourly peaks. 871 4.4. Prototyping, Testing and Pilot Deployment 873 After some analysis, it would then be helpful to actually implement 874 the most useful possible solutions in a number of BGP 875 implementations. Since this is a change to BGP, extensive testing is 876 going to be necessary and a period of pilot deployment will be 877 required. Implementers, testers, and operators could help accelerate 878 this portion of the project. 880 5. Acknowledgments 882 This document builds on the work of [RFC2439] and we would like to 883 thank Curtis Villamizar, Ravi Chandra, and Ramesh Govindan for their 884 excellent work. 886 We would like to thank Brian Carpenter and Robin Whittle for their 887 helpful comments. 889 6. IANA Considerations 891 This memo includes no requests to IANA. 893 7. Security Considerations 895 This document raises no new security issues. 897 8. References 899 8.1. Normative References 901 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 902 Requirement Levels", BCP 14, RFC 2119, March 1997. 904 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 905 Protocol 4 (BGP-4)", RFC 4271, January 2006. 907 8.2. Informative References 909 [CN2000] Varadhan, K., Govindan, R., and D. Estrin, "Persistent 910 Route Oscillations in Inter-domain Routing", Computer 911 Networks vol. 32, pp. 1-16, 2000, . 914 [Harmful] Bush, R., Griffin, T., and Z. Mao, "Route flap damping: 915 harmful?", . 917 [I-D.iab-raws-report] 918 Meyers, D., "Report from the IAB Workshop on Routing and 919 Addressing", draft-iab-raws-report-02 (work in progress), 920 April 2007. 922 [Potaroo0607] 923 Huston, G., "Damping BGP", 924 . 926 [RFC2439] Villamizar, C., Chandra, R., and R. Govindan, "BGP Route 927 Flap Damping", RFC 2439, November 1998. 929 [RFC3345] McPherson, D., Gill, V., Walton, D., and A. Retana, 930 "Border Gateway Protocol (BGP) Persistent Route 931 Oscillation Condition", RFC 3345, August 2002. 933 [RIPE-378] 934 Smith, P. and C. Panigl, "RIPE Routing Working Group 935 Recommendations on Route-flap Damping", 936 . 938 8.3. Potential References 940 [AggrWithdrawl] 941 Raszuk, R., Patel, K., Appanna, C., and D. Ward, "BGP 942 Aggregate Withdraw", draft-raszuk-aggr-withdraw-00 (work 943 in progress), . 946 [FTCS-1999] 947 Labovitz, C., Ahuja, A., and F. Jahanian, "Experimental 948 Study of Internet Stability and Wide-Area Network 949 Failures", FTCS 1999. 951 [Infocom-1999] 952 Labovitz, C., Malan, G., and F. Jahanian, "Origins of 953 Internet Routing Instability", Infocom 1999. 955 [Infocom-2001] 956 Labovitz, C., Ahuja, A., Wattenhofer, R., and S. 957 Venkatachary, "The Impact of Internet Policy and Topology 958 on Delayed Routing Convergence", Infocom 2001. 960 [Infocom-2005] 961 Chandrashekar, J., Duan, Z., Zhang, Z., and J. Krasky, 962 "Limiting path exploration in BGP", Infocom 2005. 964 [PCC-2004] 965 Duan, Z., Chandrashekar, J., Krasky, J., Xu, K., and Z. 966 Zhang, "Damping BGP Route Flaps", IEEE International 967 Conference on Performance, Computing, and 968 Communications 2002. 970 [R-BGP] Kushman, N., Kandula, S., Katabi, D., and B. Maggs, 971 "R-BGP: Staying Connected In a Connected World", 4th 972 USENIX Symposium on Networked Systems Design & 973 Implementation 2007, 974 . 976 [Sigcomm-2000] 977 Labovitz, C., Ahuja, A., Bose, A., and F. Jahanian, 978 "Delayed Internet Routing Convergence", Sigcomm 2000. 980 [Sigcomm-2002] 981 Mao, Z., Govindan, R., Varghese, G., and R. Katz, "Route 982 Flap Damping Exacerbates Internet Routing Convergence", 983 Sigcomm 2002. 985 [TON-1998] 986 Labovitz, C., Malan, G., and F. Jahanian, "Internet 987 Routing Instability", TON 1998. 989 Authors' Addresses 991 Tony Li 992 Cisco Systems, Inc. 993 170 W. Tasman Dr. 994 San Jose, CA 95134 995 US 997 Phone: +1 408 853 1494 998 Email: tli@cisco.com 1000 Geoff Huston 1001 Asia Pacific Network Information Centre 1003 Email: gih@apnic.net 1004 URI: http://www.apnic.net 1006 Full Copyright Statement 1008 Copyright (C) The IETF Trust (2007). 1010 This document is subject to the rights, licenses and restrictions 1011 contained in BCP 78, and except as set forth therein, the authors 1012 retain all their rights. 1014 This document and the information contained herein are provided on an 1015 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1016 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1017 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1018 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1019 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1020 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1022 Intellectual Property 1024 The IETF takes no position regarding the validity or scope of any 1025 Intellectual Property Rights or other rights that might be claimed to 1026 pertain to the implementation or use of the technology described in 1027 this document or the extent to which any license under such rights 1028 might or might not be available; nor does it represent that it has 1029 made any independent effort to identify any such rights. Information 1030 on the procedures with respect to rights in RFC documents can be 1031 found in BCP 78 and BCP 79. 1033 Copies of IPR disclosures made to the IETF Secretariat and any 1034 assurances of licenses to be made available, or the result of an 1035 attempt made to obtain a general license or permission for the use of 1036 such proprietary rights by implementers or users of this 1037 specification can be obtained from the IETF on-line IPR repository at 1038 http://www.ietf.org/ipr. 1040 The IETF invites any interested party to bring to its attention any 1041 copyrights, patents or patent applications, or other proprietary 1042 rights that may cover technology that may be required to implement 1043 this standard. Please address the information to the IETF at 1044 ietf-ipr@ietf.org. 1046 Acknowledgment 1048 Funding for the RFC Editor function is provided by the IETF 1049 Administrative Support Activity (IASA).