idnits 2.17.1 draft-mcpherson-bgp-route-oscillation-01.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 54 instances of too long lines in the document, the longest one being 5 characters in excess of 72. ** The abstract seems to contain references ([2], [3], [4], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 7 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1771 (ref. '1') (Obsoleted by RFC 4271) ** Obsolete normative reference: RFC 2796 (ref. '2') (Obsoleted by RFC 4456) ** Obsolete normative reference: RFC 1965 (ref. '3') (Obsoleted by RFC 3065) -- Possible downref: Non-RFC (?) normative reference: ref. '4' == Outdated reference: A later version (-26) exists of draft-ietf-idr-bgp4-12 Summary: 12 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Danny McPherson 3 INTERNET DRAFT Amber Networks, Inc. 4 Vijay Gill 5 Metromedia Fiber Network, Inc. 6 Daniel Walton 7 Alvaro Retana 8 January 2001 Cisco Systems, Inc. 10 BGP Persistent Route Oscillation Condition 11 13 1. Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC 2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet- Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 2. Abstract 36 The Border Gateway Protocol (BGP) [1] is an inter-Autonomous System 37 routing protocol. The primary function of a BGP speaking system is to 38 exchange network reachability information with other BGP systems. 40 It has recently been discovered that in particular configurations, 41 the BGP scaling mechanisms defined in "BGP Route Reflection - An 42 Alternative to Full Mesh IBGP" [2] and "Autonomous System 43 Confederations for BGP" [3] will introduce persistent BGP route 44 oscillation[4]. This document discusses the two types of persistent 45 route oscillation that have been identified, describes when these 46 conditions will occur, and provides some network design guidelines to 47 avoid introducing such occurrences. 49 3. Introduction 51 It has been known for some time that in particular configurations, 52 the BGP scaling mechanisms defined in "BGP Route Reflection - An 53 Alternative to Full Mesh IBGP" [2] and "Autonomous System 54 Confederations for BGP" [3] will introduce persistent BGP route 55 oscillation. 57 The problem is inherent in the way BGP works: locally defined routing 58 policies may conflict globally, and certain types of conflicts can 59 cause persistent oscillation of the protocol. Given current 60 practices, we happen to see the problem manifest itself in the 61 context of MED + route reflectors or confederations. 63 The current specification of BGP-4 [5] states that the 64 MULTI_EXIT_DISC is only comparable between routes learned from the 65 same neighboring AS. This limitation is consistent with the 66 description of the attribute: "The MULTI_EXIT_DISC attribute may be 67 used on external (inter-AS) links to discriminate among multiple exit 68 or entry points to the same neighboring AS." [1,5] 70 In a full mesh iBGP network, all the internal routers have complete 71 visibility of the available exit points into a neighboring AS. The 72 comparison of the MULTI_EXIT_DISC for only some paths is not a 73 problem. 75 Because of the scalability implications of a full mesh iBGP network, 76 two alternatives have been standardized: route reflectors [2] and AS 77 confederations [3]. Both alternatives describe methods by which 78 route distribution may be achieved without a full iBGP mesh in an AS. 80 The route reflector alternative defines the ability to re-advertise 81 (reflect) iBGP-learned routes to other iBGP peers once the best path 82 is selected [2]. AS Confederations specify the operation of a 83 collection of autonomous systems under a common administration as a 84 single entity (i.e. from the outside, the internal topology and the 85 existence of separate autonomous systems are not visible). In both 86 cases, the reduction of the iBGP full mesh results in the fact that 87 not all the BGP speakers in the AS have complete visibility of the 88 available exit points into a neighboring AS. In fact, the visibility 89 may be partial and inconsistent depending on the location (and 90 function) of the router in the AS. 92 In certain topologies involving either route reflectors or 93 confederations (detailed description later in this document), the 94 partial visibility of the available exit points into a neighboring AS 95 may result in an inconsistent best path selection decision as the 96 routers don't have all the relevant information. If the 97 inconsistencis span more than one peering router, they may result in 98 a persistent route oscillation. The best path selection rules 99 applied in this document are consistent with the current 100 specification [5]. 102 The persistent route oscillation behavior is deterministic and can be 103 avoided by employing some rudimentary BGP network design principles 104 until protocol enhancements resolve the problem. 106 In the following sections a taxonomy of the types of oscillations is 107 presented and a description of the set of conditions that will 108 trigger route oscillations is given. We continue by providing 109 several network design alternatives that remove the potential for 110 this to occur. 112 It is the intent of the authors that this document serve to increase 113 operator awareness of the problem, as well as to trigger discussion 114 and subsequent proposals for potential protocol enhancements that 115 remove the possibly for this to occur. 117 The oscillations are classified into Type I and Type II depending 118 upon criteria documented below. 120 4. Type I Discussion 122 In the following two subsections we provide configurations under 123 which Type I Churn will occur. We begin with a discussion of the 124 problem when using Route Reflection, and then discuss the problem as 125 it relates to AS Confederations. 127 In general, Type I Churn occurs only when BOTH of the following 128 conditions are met: 130 1) a single-level Route Reflection or AS Confederations 131 design is used in the network AND 133 2) the network accepts the BGP MULTI_EXIT_DISC (MED) 134 attribute from two or more ASs for a single prefix 135 and the MED values are unique. 137 It is also possible for the non-deterministic ordering of paths to 138 cause the route oscillation problem. [1] does not specify that paths 139 should be ordered based on MEDs but it has been proven that non- 140 deterministic ordering can lead to loops and inconsistent routing 141 decisions. Most vendors have either implemented deterministic 142 ordering as default behavior, or provide a knob that permits the 143 operator to configure the router to order paths in a deterministic 144 manner based on MEDs. 146 4.1. Route Reflection and Type I Churn 148 We now discuss Type I oscillation as it relates to Route Reflection. 149 To begin, consider the topology depicted in Figure 1: 151 --------------------------------------------------------------- 152 / -------------------- -------------------- \ 153 | / \ / \ | 154 | | Cluster 1 | | Cluster 2 | | 155 | | | | | | 156 | | | *1 | | | 157 | | Ra(RR) . . . . . . . . . . . . . . Rd(RR) | | 158 | | . . | | . | | 159 | | .*5 .*4 | | .*12 | | 160 | | . . | | . | | 161 | | Rb(C) Rc(C) | | Re(C) | | 162 | | . . | | . | | 163 | \ . . / \ . / | 164 | ---.------------.--- ---------.---------- | 165 \ .(10) .(1) AS1 .(0) / 166 -------.------------.---------------------------.-------------- 167 . . . 168 ------ . ------------ . 169 / \ . / \ . 170 | AS10 | | AS6 | 171 \ / \ / 172 ------ ------------ 173 . . 174 . . 175 . -------------- 176 . / \ 177 | AS100 |- 10.0.0.0/8 178 \ / 179 -------------- 181 Figure 1: Example Route Reflection Topology 183 In Figure 1 AS1 contains two Route Reflector Clusters, Clusters 1 and 184 2. Each Cluster contains one Route Reflector (RR) (i.e., Ra and Rd, 185 respectively). An associated 'RR' in parentheses represents each RR. 186 Cluster 1 contains two RR Clients (Rb and Rc), and Cluster 2 contains 187 one RR Client (Re). An associated 'C' in parentheses indicates RR 188 Client status. The dotted lines are used to represent BGP peering 189 sessions. 191 The number contained in parentheses on the AS1 EBGP peering sessions 192 represents the MED value advertised by the peer to be associated with 193 the 10.0.0.0/8 network reachability advertisement. 195 The number proceeding each '*' on the IBGP peering sessions repre- 196 sents the additive IGP metrics that are to be associated with the BGP 197 NEXT_HOP attribute for the concerned route. For example, the Ra IGP 198 metric value associated with a NEXT_HOP learned via Rb would be 5; 199 while the metric value associated with the NEXT_HOP learned via Re 200 would be 13. 202 Table 1 depicts the 10.0.0.0/8 route attributes as seen by routers 203 Rb, Rc and Re, respectively. Note that the IGP metrics in Figure 1 204 are only of concern when advertising the route to an IBGP peer. 206 Router MED AS_PATH 207 -------------------- 208 Rb 10 10 100 209 Rc 1 6 100 210 Re 0 6 100 212 Table 1: Route Attribute Table 214 For the following steps 1 through 5 the best path will be marked with 215 a '*'. 217 1) Ra has the following installed in its BGP table with 218 the path learned via AS2 marked best: 220 NEXT_HOP 221 AS_PATH MED IGP Cost 222 ----------------------- 223 6 100 1 4 224 * 10 100 10 5 226 The '10 100' route should not be marked as best, though 227 this is not the cause of the persistent route oscillation. 228 Ra realizes it has the wrong route marked as best since the 229 '6 100' path has a lower IGP metric. As such, Ra makes this 230 change and advertises an UPDATE message to its neighbors to 231 let them know that it now considers the '6 100, 1, 4' route 232 as best. 234 2) Rd receives the UPDATE from Ra, which leaves Rd with the 235 following installed in its BGP table: 237 NEXT_HOP 238 AS_PATH MED IGP Cost 239 ----------------------- 240 * 6 100 0 12 241 6 100 1 5 243 Rd then marks the '6 100, 0, 12' route as best because it has 244 a lower MED. Rd sends an UPDATE message to its neighbors to 245 let them know that this is the best route. 247 3) Ra receives the UPDATE message from Rd and now has the 248 following in its BGP table: 250 NEXT_HOP 251 AS_PATH MED IGP Cost 252 ----------------------- 253 6 100 0 13 254 6 100 1 4 255 * 10 100 10 5 257 The first route (6 100, 0, 13) beats the second route (6 100, 258 1, 4) because of lower MED, then the third route (10 100, 10, 259 5) beats the first route because of lower IGP metric to 260 NEXT_HOP. Ra sends an UPDATE message to its peers to let them 261 know its new best route. 263 4) Rd receives the UPDATE message from Ra, which leaves Rd with the 264 following BGP table: 266 NEXT_HOP 267 AS_PATH MED IGP Cost 268 ----------------------- 269 6 100 0 12 270 * 10 100 10 6 272 Rd selects the '10 100, 10, 6' path as best because of the IGP 273 metric. Rd sends an UPDATE/withdraw to its peers to let them 274 know this is its best route. 276 5) Ra receives the UPDATE message from Rd, which leaves Ra with the 277 following BGP table: 279 NEXT_HOP 280 AS_PATH MED IGP Cost 281 ----------------------- 282 6 100 1 4 283 * 10 100 10 5 285 Ra received a withdraw for '6 100, 0, 13', which changes what is 286 considered the best route for Ra. 287 This is why Ra has the '10 100, 10, 5' route selected as best in 288 Step 1, even though '6 100, 1, 4' is actually better. 290 At this point, we've made a full loop and are back at Step 1. The 291 router realizes it is using the incorrect best path, and the cycle 292 repeats. This is an example of Type I Churn when using Route Reflec- 293 tion. 295 4.2. AS Confederations and Type I Churn 297 We'll now provide an example of Type I Churn occurring with AS Con- 298 federations. To begin, consider the topology depicted in Figure 2: 300 --------------------------------------------------------------- 301 / -------------------- -------------------- \ 302 | / \ / \ | 303 | | Sub-AS 65000 | | Sub-AS 65001 | | 304 | | | | | | 305 | | | *1 | | | 306 | | Ra . . . . . . . . . . . . . . . . . Rd | | 307 | | . . | | . | | 308 | | .*3 .*2 | | .*6 | | 309 | | . . | | . | | 310 | | Rb . . . . . Rc | | Re | | 311 | | . *5 . | | . | | 312 | \ . . / \ . / | 313 | ---.------------.--- ---------.---------- | 314 \ .(10) .(1) AS1 .(0) / 315 -------.------------.---------------------------.-------------- 316 . . . 317 ------ . ------------ . 318 / \ . / \ . 319 | AS10 | | AS6 | 320 \ / \ / 321 ------ ------------ 322 . . 323 . . 324 . -------------- 325 . / \ 326 | AS100 |- 10.0.0.0/8 327 \ / 328 -------------- 330 Figure 2: Example AS Confederations Topology 332 The number proceeding each '*' on the BGP peering sessions represents 333 the additive IGP metrics that are to be associated with the BGP 334 NEXT_HOP. The number contained in parentheses on each AS1 EBGP peer- 335 ing sessions represents the MED value advertised by the peer to be 336 associated with the 10.0.0.0/8 network reachability advertisement. 338 The number contained in parentheses on each AS1 EBGP peering sessions 339 represents the MED value advertised by the peer to be associated with 340 the 10.0.0.0/8 network reachability advertisement. 342 The number proceeding each '*' on the IBGP peering sessions repre- 343 sents the additive IGP metrics that are to be associated with the BGP 344 NEXT_HOP attribute for the concerned route. 346 For example, the Ra IGP metric value associated with a NEXT_HOP 347 learned via Rb would be 5; while the metric value associated with the 348 NEXT_HOP learned via Re would be 13. 350 Table 2 depicts the 10.0.0.0/8 route attributes as seen by routers 351 Rb, Rc and Re, respectively. Note that the IGP metrics in Figure 2 352 are only of concern when advertising the route to an IBGP peer. 354 Router MED AS_PATH 355 -------------------- 356 Rb 10 10 100 357 Rc 1 6 100 358 Re 0 6 100 360 Table 2: Route Attribute Table 362 For the following steps 1 through 6 the best route will be marked 363 with an '*'. 365 1) Ra has the following BGP table: 367 NEXT_HOP 368 AS_PATH MED IGP Cost 369 ------------------------------- 370 * 10 100 10 3 371 (65001) 6 100 0 7 372 6 100 1 2 374 The '10 100' route is selected as best and advertised to 375 Rd, though this is not the cause of the persistent route 376 oscillation. 378 2) Rd has the following in its BGP table: 380 NEXT_HOP 381 AS_PATH MED IGP Cost 382 ------------------------------- 383 6 100 0 6 384 * (65000) 10 100 10 4 386 The "(65000) 10 100' route is selected as best because it has 387 the lowest IGP metric. As a result, Rd sends an UPDATE/withdraw 388 to Ra for the '6 100' route that it had previously advertised. 390 3) Ra receives the withdraw from Rd. Ra now has the following in 391 its BGP table: 393 NEXT_HOP 394 AS_PATH MED IGP Cost 395 ------------------------------- 396 * 10 100 10 3 397 6 100 1 2 399 Ra received a withdrawal for '(65001) 6 100', which changes what 400 is considered the best route for Ra. Ra does not compute the 401 best path for a prefix unless its best route was withdrawn. 402 This is why Ra has the '10 100, 10, 3' route selected as best, 403 even though the '6 100, 1, 2' route is better. 405 4) Ra realizes that the '6 100' route is better because of the 406 lower IGP metric. Ra sends an UPDATE/withdraw to Rd for the '10 407 100' route since Ra is now using the '6 100' path as its best 408 route. 410 Ra's BGP table looks like this: 412 NEXT_HOP 413 AS_PATH MED IGP Cost 414 ------------------------------- 415 10 100 10 3 416 * 6 100 1 2 418 5) Rd receives the UPDATE from Ra and now has the following in 419 its BGP table: 421 NEXT_HOP 422 AS_PATH MED IGP Cost 423 ------------------------------- 424 (65000) 6 100 1 3 425 * 6 100 0 6 427 Rd selects the '6 100, 0, 5' route as best because of the lower 428 MED value. Rd sends an UPDATE message to Ra, reporting that 429 '6 100, 0 5' is now its best route. 431 6) Ra receives the UPDATE from Rd. Ra now has the following in its 432 BGP table: 434 NEXT_HOP 435 AS_PATH MED IGP Cost 436 ------------------------------- 437 * 10 100 10 3 438 (65001) 6 100 0 7 439 6 100 1 2 441 At this point we have made a full cycle and are back to step 1. This 442 is an example of Type I Churn with AS Confederations. 444 4.3. Potential Workarounds for Type I Churn 446 There are a number of alternatives that can be employed to provide 447 workarounds to this problem: 449 1) When using Route Reflection make sure that the inter-Cluster 450 links have a higher IGP metric than the intra-Cluster links. 451 This is the preferred choice when using Route Reflection. Had 452 the inter-Cluster IGP metrics been much larger than the intra- 453 Cluster IGP metrics, the above would not have occurred. 455 2) When using AS Confederations ensure that the inter-Sub-AS 456 links have a higher IGP metric than the intra-Sub-AS links. 457 This is the preferred option when using AS Confederations. 458 Had the inter-Sub-AS IGP metrics been much larger than the 459 intra-Sub-AS IGP metrics, the above would not have occurred. 461 3) Do not accept MEDs from peers (this may not be a feasible 462 alternative). 464 4) Utilize other BGP attributes higher in the decision process 465 so that the BGP decision algorithm never reaches the MED 466 step. As using this completely overrides MEDs, Option 3 may make 467 more sense. 469 5) Always compare BGP MEDs, regardless of whether or not they were 470 obtained from a single AS. This is probably a bad idea since 471 MEDs may be derived in a number of ways, and are typically done 472 so as a matter of operator-specific policy. As such, comparing 473 MED values for a single prefix learned from multiple ASs is 474 ill-advised. Of course, this mostly defeats the purpose of MEDs, 475 and as such, Option 3 may be a more viable alternative. 477 6) Use a full IBGP mesh. This is not a feasible solution for 478 ASs with a large number of BGP speakers. 480 5. Type II Discussion 482 In the following subsection we provide configurations under which 483 Type II Churn will occur when using AS Confederations. For sake of 484 brevity, we avoid similar discussion of the occurrence when using 485 Route Reflection. 487 In general, Type II churn occurs only when BOTH of the following con- 488 ditions are met: 490 1) More than one tier of Route Reflection or Sub-ASs 491 is used in the network AND 493 2) the network accepts the BGP MULTI_EXIT_DISC (MED) 494 attribute from two or more ASs for a single prefix 495 and the MED values are unique. 497 5.1. AS Confederations and Type II Churn 499 Let's now examine the occurrence of Type II Churn as it relates to AS 500 Confederations. Figure 3 provides our sample topology: 502 --------------------------------------------------------------- 503 / -------------------- \ 504 | AS N / Sub-AS 65500 \ | 505 | | | | 506 | | Rc . . . . Rd | | 507 | | . *2 . | | 508 | \ . . / | 509 | -.---------------.-- | 510 | .*40 .*40 | 511 | --------------.----- .------------------- | 512 | / . \ / . \ | 513 | | Sub-AS . | | . Sub-AS | | 514 | | 65501 . | | . 65502 | | 515 | | Rb | | Re | | 516 | | . | | . . | | 517 | | .*10 | | *3. .*2 | | 518 | | . | | . . | | 519 | | Ra . | | . Rf . . . Rg | | 520 | \ . / . . / | 521 | -----------------.--- . -----------.--------- | 522 \ (0) . .() .(1) / 523 ---------------------------.----.---------------.-------------- 524 . . 525 ------ . . ------------ 526 |AS X| | AS Y | 527 ------ ------------ 529 Figure 3: Example AS Confederations Topology 531 In Figure 3 AS N contains three Sub-ASs, 65500, 65501 and 532 65502. No RR is used within the Sub-AS, and as such, all routers 533 within each Sub-AS are fully meshed. Ra and Rb are members of Sub-AS 534 65501. Rc and Rd are members of Sub-AS 65500. Ra and Rg are EBGP 535 peering with AS Y, router Rf has an EBGP peering with AS X. The 536 dotted lines are used to represent BGP peering sessions. 538 The number proceeding each '*' on the BGP peering sessions 539 represents the additive IGP metrics that are to be associated with 540 the BGP NEXT_HOP. The number contained in parentheses on each AS N 541 EBGP peering session represents the MED value advertised by the peer 542 to be associated with the network reachability advertisement(s). 544 Rc, Rd and Re are the primary routers involved in the churn, and as 545 such, will be the only BGP tables that we will monitor step by step. 547 For the following steps 1 through 8 each routers best route will be 548 marked with a '*'. 550 1) Re receives the 'X' and 'Y1' paths. Re selects 'Y1' because of 551 IGP metric. 553 NEXT_HOP 554 Router AS_PATH MED IGP Cost 555 ------------------------------ 556 Re X 3 557 * Y 1 2 559 Re will advertise its new best path to Rd. 561 2) The 'Y0' path was passed from Ra to Rb, and then from Rb 562 to Rc. Rd learns the 'Y1' path from Re. Rc selects 'Y0', 563 Rd selects 'Y1'. 565 NEXT_HOP 566 Router AS_PATH MED IGP Cost 567 ------------------------------- 568 Rc * Y 0 50 569 Rd * Y 1 42 570 Re X 3 571 * Y 1 2 573 3) Rc and Rd advertise their best paths to each other; 574 Rd selects 'Y0' because of MED. 576 NEXT_HOP 577 Router AS_PATH MED IGP Cost 578 ------------------------------ 579 Rc * Y 0 50 580 Y 1 44 581 Rd * Y 0 52 582 Y 1 42 583 Re X 3 584 * Y 1 2 586 Rd has a new best path so he will send an advertisement 587 to Re and send a withdraw for 'Y1' to Rc. 589 4) Re selects 'X' per 'Y0' beats 'Y1' because of the MED. 590 'X' beats 'Y0' because of IGP metric. 592 NEXT_HOP 593 Router AS_PATH MED IGP Cost 594 ------------------------------ 595 Rc * Y 0 50 596 Rd * Y 0 52 597 Y 1 42 598 Re * X 3 599 Y 0 92 601 5) Rd selects 'X' because of IGP metric. 603 NEXT_HOP 604 Router AS_PATH MED IGP Cost 605 ------------------------------ 606 Rc * Y 0 50 607 Rd Y 0 52 608 * X 43 609 Re * X 3 610 Y 0 92 611 Y 1 2 613 Rd has a new best path so he will send an UPDATE to Rc 614 and an UPDATE/withdraw to Re for 'Y0'. 616 6) Rc selects 'X' because of IGP metric. Re selects 'Y1' 617 because of IGP metric. 619 NEXT_HOP 620 Router AS_PATH MED IGP Cost 621 ------------------------------ 622 Rc Y 0 50 623 * X 45 624 Rd Y 0 52 625 * X 43 626 Re X 3 627 * Y 1 2 629 7) Rd selects 'Y1'. 631 NEXT_HOP 632 Router AS_PATH MED IGP Cost 633 ------------------------------ 634 Rc Y 0 50 635 * X 45 636 Rd * Y 1 42 637 Re X 3 638 * Y 1 2 640 8) Rc selects 'Y0'. 642 NEXT_HOP 643 Router AS_PATH MED IGP Cost 644 ------------------------------ 645 Rc * Y 0 50 646 Y 1 44 647 Rd * Y 1 42 648 Re X 3 649 * Y 1 2 651 At this point we are back to Step 2 and are in a loop. 653 5.2. Potential Workarounds for Type II Churn 655 1) Do not accept MEDs from peers (this may not be a feasible 656 alternative). 658 2) Utilize other BGP attributes higher in the decision process so 659 that the BGP decision algorithm selects a single AS before it 660 reaches the MED step. For example, if local-pref were set based 661 on the advertising AS, then you first eliminated all routes 662 except those in a single AS. In the example, router Re 663 would pick either X or Y based on local-pref and never change 664 that selection. 666 This leaves two simple workarounds for the two types of problems. 668 Type I: Make inter-cluster or inter-sub-AS link metrics higher 669 than intra-cluster or intra-sub-AS metrics. 671 Type II: Make route selections based on local pref assigned to 672 advertising AS first and then used IGP cost and MED 673 to make selection among routes from the same AS. 675 Note that this requires per-prefix policies, as well as near 676 intimate knowledge of other networks by the network operator. 677 The authors are not aware of ANY [large] provider today that 678 performs per-prefix policies on routes learned from peers. 679 Implicitly removing this dynamic portion of route selection 680 does not appear to be a viable option in today's networks. 681 The main point is that an available workaround using 682 local_pref so no two AS advertise a given prefix at the same 683 local_pref solves type II churn. 685 3) Always compare BGP MEDs, regardless of whether or not they were 686 obtained from a single AS. This is probably a bad idea since 687 MEDs may be derived in a number of ways, and are typically done 688 so as a matter of operator-specific policy and largely a function 689 of available metric space provided by the employed IGP. As such, 690 comparing MED values for a single prefix learned from multiple 691 ASs is ill-advised. This mostly defeats the purpose of MEDs; 692 Option 1 may be a more viable alternative. 694 4) Do not use more than one tier of Route Reflection or Sub-ASs 695 in the network. The risk of route oscillation should be 696 considered when desiging networks that might use a multi-tiered 697 routing isolation architecture. 699 5) In a RR topology, mesh the clients. For confederations, mesh 700 the border routers at each level in the hierarchy. In 701 Figure 3, for example, if Rb and Re are peers, then there's 702 no churn. 704 Future drafts will propose other solutions for Type II Churn 706 6. Future Works 708 It should be stated that protocol enhancements regarding this problem 709 must be pursued. Imposing network design requirements such as those 710 outlined above are clearly an unreasonable long-term solution. Prob- 711 lems such as this should not occur under 'default' configurations. 713 7. Security Considerations 715 This discussion introduces no new security concerns to BGP or other 716 specifications referenced in this document. 718 8. Acknowledgments 720 The authors would like to thank: Curtis Villamizar, Tim Griffin, John 721 Scudder and Ron Da Silva. 723 9. References 725 [1] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 (BGP-4)", 726 RFC 1771, March 1995. 728 [2] Bates, T., Chandra, R., Chen, E., "BGP Route Reflection - An 729 Alternative to Full Mesh IBGP", RFC 2796, April 2000. 731 [3] Traina, P., McPherson, D., Scudder, J.. "Autonomous System 732 Confederations for BGP", RFC 1965bis, "Work In Progress", 733 October 2000. 735 [4] Cisco Systems, Inc., "Endless BGP Convergence Problem in Cisco 736 IOS Software Releases" , FN, October 10, 2000. 738 [5] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 (BGP-4)", 739 Work in Progress (draft-ietf-idr-bgp4-12.txt), January 2001. 741 10. Authors' Addresses 743 Danny McPherson 744 Amber Networks, Inc. 745 48664 Milmont Drive 746 Fremont, CA 94538 747 Email: danny@ambernetworks.com 749 Vijay Gill 750 Metromedia Fiber Network, Inc. 751 8075 Leesburg Pike, STE 3 752 Vienna, VA, 22182 753 Email: vijay@umbc.edu 755 Daniel Walton 756 Cisco Systems, Inc. 757 7025 Kit Creek Rd. 758 Research Triangle Park, NC 27709 759 Email: dwalton@cisco.com 761 Alvaro Retana 762 Cisco Systems, Inc. 763 7025 Kit Creek Rd. 764 Research Triangle Park, NC 27709 765 Email: aretana@cisco.com