idnits 2.17.1 draft-bhatia-bgp-multiple-next-hops-01.txt: -(93): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5 on line 796. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 773. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 780. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 786. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 6 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. == There are 9 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 785 has weird spacing: '...IETF at ietf...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: As such, speakers implementing the MULTIPLE_NEXT_HOP capability MUST not send additional paths, beyond the single best path allowed by BGP-4 [BGP4], unless the remote speaker has indicated its preparedness with the RM bit. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 2006) is 6464 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'KEYWORDS' is mentioned on line 58, but not defined == Missing Reference: 'IANA-AFI' is mentioned on line 174, but not defined == Missing Reference: 'IANA-SAFI' is mentioned on line 181, but not defined == Unused Reference: 'BGP-CAP' is defined on line 468, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 486, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3392 (ref. 'BGP-CAP') (Obsoleted by RFC 5492) Summary: 5 errors (**), 0 flaws (~~), 12 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft August 2006 3 Network Working Group Manav Bhatia 4 Internet Draft Lucent Technologies 5 Joel M. Halpern 6 Paul Jakma 7 Expires: January 2007 Sun Microsystems 9 Advertising Multiple NextHop Routes in BGP 11 draft-bhatia-bgp-multiple-next-hops-01.txt 13 Status of this Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet draft will expire on August 2006 38 Copyright Notice 40 Copyright (C) The Internet Society (2006). 42 Abstract 44 This document describes an extensible mechanism that allows a BGP 45 speaker to advertise multiple BGP paths for a destination to its 46 peers, by describing a new BGP capability, termed "Multiple-Hop 47 Capability". 49 The mechanisms described in this document are applicable to all 50 routers, both those with the ability to inject multiple routing 51 entries in their forwarding table and those without. 53 Conventions used in this document 55 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 56 "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this 57 document are to be interpreted as described in RFC 2119 [KEYWORDS] 59 Table of Contents 61 1. Introduction...................................................2 62 2. Multiple-Hop Capability........................................3 63 2.1 Multiple-Hop attribute - MULTIPLE_HOP......................5 64 3. Operation when both peers are Multiple-Hop capable.............6 65 3.1 Advertisement of Multiple-Hop BGP routes...................7 66 3.2 Withdrawal Procedures......................................7 67 3.3 Procedures for the Receiving Speaker.......................8 68 3.4 Working with Multiple-Hop capable IBGP peers...............8 69 3.5 Implicit Withdrawal for one of the Next-Hops...............9 70 4. Multiprotocol Extensions to BGP................................9 71 5. Security Considerations.......................................10 72 6. Acknowledgements..............................................10 73 7. IANA Considerations...........................................10 74 8. References....................................................10 75 8.1 Normative References......................................10 76 8.2 Informative References....................................11 77 9. Appendix A....................................................11 78 9.1 Suboptimal Routing in Route Reflector clients.............11 79 9.2 Avoiding Persistent Route Oscillations....................12 80 9.3 eBGP mesh scaling at IXes via Route Servers...............15 81 9.4 Advertising a subset of routes in BGP.....................15 82 9.5 Equal Cost Multiple Path BGP..............................16 83 10. Author�s Address.............................................16 84 11. Intellectual Property Statement..............................17 86 1. Introduction 88 Currently BGP [BGP4] speakers cannot announce multiple paths, even if 89 it is desirable in certain scenarios. This is because the BGP 90 specification allows only one "best" route to be inserted into the 91 Loc-RIB, and to be announced to other BGP speakers. If another route 92 for a destination that has previously been announced to a BGP peer, 93 is sent later, then the receiver �implicitly withdraws� the former 94 route and replaces it with the new one. 96 Because of this behavior, BGP speakers are never able to advertise 97 multiple paths for the same destination to their peers. 99 Lifting this restriction would have benefit for at least the 100 following scenarios in BGP: 102 o Persistent route-oscillation conditions in BGP [MED] 104 o eBGP mesh scaling at Internet Exchanges 106 o Interaction between ECMP capable BGP speakers 108 The first concerns route-reflectors [RR], where in certain 109 topologies, persistent route-oscillation conditions can arise due to 110 the clients of route-reflectors being never fully informed of each 111 others best paths, particularly where MED/Router ID values are 112 considered as part of the best-path selection. If BGP were to 113 provide a means to allow route-reflectors to share all the collective 114 best-paths with its clients, then these conditions could be 115 alleviated, as has been shown in the Appendix. 117 The second concerns scaling of eBGP meshes at Internet Exchanges 118 (referred to as an IX from now on, or IXes in the plural). IX 119 operators have deployed eBGP route-servers, in a variety of guises, 120 in order to reduce the need for customers to establish direct 121 sessions with other customers. These route-servers however have 122 severe limitations because of the single-path restriction in BGP. 123 Removing this limitation would allow for efficient deployment of IX 124 route-servers. 126 The third concerns BGP implementations which are capable of 127 considering multiple routes for inclusion into their RIB, and hence 128 likely their FIB, but do not have a way to relay the full resulting 129 state of their BGP RIB to their peers. 131 This document specifies the mechanism by which Multiple-Hop operates; 132 however it will not attempt to fully describe the usages. In 133 particular this document anticipates that the ECMP scenario will be 134 described fully in another document, as it would have to be even if 135 documented without consideration of the Multiple-Hop capability. 137 It is anticipated however that any speaker implementing the 138 functionality described in this document would be able to 139 interoperate with Multiple-Hop capable route-servers and route- 140 reflectors, just as BGP speakers interoperate with Route-Reflectors 141 in the absence of the Multiple-Hop capability. 143 2. Multiple-Hop Capability 145 Multiple Hop capability is a new capability that can be used by a BGP 146 speaker to indicate its ability to understand Multiple-Hop Updates 147 from a remote peer. 149 This capability is defined as follows: 151 Capability Code: TBD 153 Capability Length: Variable 155 Capability Values: Consists of one or more of the tuples as follows: 158 +--------------------------------------------------+ 159 | Address Family Identifier (16 bits) | 160 +--------------------------------------------------+ 161 | Subsequent Address Family Identifier (8 bits) | 162 +--------------------------------------------------+ 163 | Flags for the Address Family (8 bits) | 164 +--------------------------------------------------+ 166 Figure 1 168 The use and meaning of the fields are as follows: 170 Address Family Identifier 172 This field carries the identity of the Network Layer protocol 173 for which the Multiple Hop support is advertised. Presently 174 defined values for this field are specified in [IANA-AFI]. 176 Subsequent Address Family Identifier (SAFI): 178 This field provides additional information about the type of 179 the Network Layer Reachability Information carried in the 180 attribute. Presently defined values for this field are specified 181 in [IANA-SAFI]. 183 Flags for Address Family: 185 This field contains bit flags for the . 187 0 1 2 3 4 5 6 7 188 +-+-+-+-+-+-+-+--+ 189 |R|R|R|R|R|R|R|RM| 190 +-+-+-+-+-+-+-+--+ 192 R Reserved: 194 MUST be set to zero by the sender and ignored by the receiver. 196 RM Receive Multiple 198 Indicates that the speaker is interested in receiving additional 199 BGP paths, other than just the best path from the receiver. 201 A speaker sets this bit in its MULTIPLE_NEXT_HOP capability to 202 indicate that it is prepared to receive additional path 203 advertisements, beyond just the best path, by way of the 204 MULTIPLE_NEXT_HOP capability. 206 As such, speakers implementing the MULTIPLE_NEXT_HOP capability 207 MUST not send additional paths, beyond the single best path 208 allowed by BGP-4 [BGP4], unless the remote speaker has 209 indicated its preparedness with the RM bit. 211 2.1 Multiple-Hop attribute - MULTIPLE_HOP 213 This attribute is an optional, non-transitive attribute that can be 214 used for advertising multiple next-hops associated with a NLRI. 216 The attribute data contains one or more tuples of (AFI,SAFI, List 217 of Next Hop Information), where each tuple is encoded as shown 218 below: 220 +------------------------------------------------+ 221 | Address Family Identifier (2 octets) | 222 +------------------------------------------------+ 223 | Subsequent Address Family Identifier (1 octet) | 224 +------------------------------------------------+ 225 | Number of Next Hops (1 octet) | 226 +------------------------------------------------+ 227 | Length of the First Next Hop (1 octet) | 228 +------------------------------------------------+ 229 | Network Address of First Next Hop (variable) | 230 +------------------------------------------------+ 231 | Length of the Second Next Hop (1 octet) | 232 +------------------------------------------------+ 233 | Network Address of Second Next Hop (variable) | 234 +------------------------------------------------+ 235 | . . . | 236 | . . . | 237 +------------------------------------------------+ 238 | Length of the Nth Next Hop (1 octet) | 239 +------------------------------------------------+ 240 | Network Address of Nth Next Hop (variable) | 241 +------------------------------------------------+ 243 Figure 2 245 The various fields are defined as follows: 247 Address Family Identifier: The AFI field carries the identity of 248 the Network Layer protocol associated with the Network Address 249 that follows. 251 Subsequent Address Family Identifier: The SAFI field in 252 combination with the Address Family Identifier field identifies 253 the Network Layer context associated with the Network Address of 254 the Next Hop(s). 256 Number of Next-Hops: This field carries the total number of Multiple- 257 Hop BGP routes for the given NLRI. 259 Length of Nth Next Hop Network Address: A 1 octet field whose value 260 expresses the length of the "Network Address of Next Hop" field as 261 measured in octets. For IPv6 routes the value shall be set to 16, 262 when only a global address is present, or 32 if a link-local 263 address is also included in the Next Hop field [BGP-IPv6]. 265 Network Address of Nth Next Hop: This is a variable length field that 266 contains the Network Address of the next router on the path to the 267 destination. 269 The N next-hops listed in the MULTIPLE_HOP path attribute define the 270 Network Layer address of the routers that should be used as next-hops 271 to the destinations listed in the UPDATE message. 273 3. Operation when both peers are Multiple-Hop capable 275 In the following sections, "Local speaker" refers to a router which 276 is advertising the BGP Multiple-Hop routes, and the "Receiving 277 Speaker" refers to a router that peers with the former to accept 278 multiple BGP routes for a destination. 280 Consider that the Multiple-Hop Capability has been exchanged between 281 the Local speaker and the Receiving speaker, and a BGP session 282 between them is established. The following sections detail the 283 procedures that shall be followed by the Local speaker as well as the 284 Receiving speaker once the Multiple-Hop capability has been 285 exchanged, and the local speaker wants to advertise some BGP 286 Multiple-Hop routes. 288 Note that for operation within the confines of this document and BGP, 289 the local speaker almost certainly will be acting as an eBGP route- 290 server or iBGP route-reflector, with the receiver asserting the RM 291 bit in the Multiple-Hop capability, and therefore acting as a client 292 of that speaker. 294 Other uses, such as ECMP speakers exchanging Multiple-Hop routes will 295 require further consideration, not addressed in this document as 296 stated previously, considerations not per se related to the Multiple- 297 Hop capability itself. 299 3.1 Advertisement of Multiple-Hop BGP routes 301 The extensions proposed in this draft allow BGP paths to be 302 identified by their NLRI and next-hop address, rather than just by 303 their NLRI. This extended identification is indicated by the 304 presence of the MULTIPLE_HOP attribute. Given that this is used when 305 there are multiple paths sharing NLRI, this attribute allows for the 306 representation of multiple such paths in a single advertisement. 308 Thus between Multiple-Hop capable speakers, the MULTIPLE_HOP 309 attribute MUST be used in addition to the existing NEXT_HOP in order 310 to announce multiple next-hops for the destinations listed in the 311 NLRI field of the UPDATE message. 313 All prefixes announced using this attribute MUST NOT replace the 314 previous advertisements and thus, multiple BGP paths for a prefix can 315 be advertised by the Local Speaker. If the same prefix is later 316 announced with ONLY the NEXT_HOP attribute then it MUST be taken as 317 an implicit withdraw for all the previous paths advertised by that 318 peer for that destination. 320 It should be noted that transmission of multiple paths is only valid 321 for the same NLRI that differ on the next-hop. 323 An UPDATE message which contains feasible routes and carries 324 MULTIPLE_HOP and no NEXT_HOP attribute MUST NOT be considered as an 325 implicit withdrawal. The Receiving Speaker MUST append these 326 routes in its Adj-RIBs-In [BGP4], as additional paths to that 327 destination. 329 When advertising multiple paths which do not have identical path 330 attributes, separate BGP UPDATE messages MUST be sent, each with a 331 MULTIPLE_HOP attribute even if there is only one next-hop in each 332 MULTIPLE_HOP attribute. Presence of MULTIPLE_HOP suppresses route 333 replacement at the receiving end. 335 3.2 Withdrawal Procedures 337 An UPDATE message which contains an IP address prefix in the 338 WITHDRAWN ROUTES marks all the associated routes as being no longer 339 available for use. 341 An UPDATE message consisting of an IP address prefix in the NLRI 342 field and only the NEXT_HOP attribute implicitly withdraws all the 343 routes to that address prefix and replaces it with the one advertised 344 by the NEXT_HOP. 346 An UPDATE message which contains an IP address prefix in the 347 WITHDRAWN ROUTES and the MULTIPLE_HOP attribute only removes the path 348 associated with that next-hop. 350 An UPDATE message announced with a MULTIPLE_HOP attribute for a given 351 IP address prefix implicitly withdraws any previous route announced 352 with the same next-hop. 354 3.3 Procedures for the Receiving Speaker 356 The Receiving Speaker upon receiving the MULTIPLE_HOP attribute will 357 understand that the Local Speaker has advertised Multiple-Hop BGP 358 routes. Within a single UPDATE message all the prefixes will have 359 identical attributes, except for the next-hops, which will be carried 360 in the MULTIPLE_HOP attribute. 362 A series of further UPDATE messages for the same NLRI, with or 363 without the same set of attributes and containing the MULTIPLE_HOP 364 attribute will be understood to be additive. Each UPDATE message 365 would append these additional feasible routes, to the appropriate 366 Adj-RIBs-In, where after the receiving speaker may run its normal 367 decision process to select the best path to install in its Local-RIB. 369 Upon receiving an UPDATE message for the same NLRI, without the 370 MULTIPLE_HOP attribute, the receiver will consider this as a 371 replacement route for all the previously announced routes to that 372 destination. 374 If the BGP Speaker wants to withdraw all the BGP routes for a 375 particular address prefix then it can send a normal BGP UPDATE 376 message listing the IP address prefix in the WITHDRAWN ROUTES field. 377 The Receiving Speaker upon receiving this message MUST remove all the 378 routes associated with that destination. 380 If the Receiving Speaker receives an UPDATE message with the 381 MULTIPLE_HOP attribute listing both, the feasible and the 382 unfeasible routes, then it MUST consider the path attributes for the 383 feasible routes. All the destinations listed in the WITHDRAWN ROUTES 384 MUST be removed as per [BGP4]. 386 3.4 Working with Multiple-Hop capable IBGP peers 388 This section explains how multiple-hop feature will work in the 389 normal scenarios. 391 Assume that the two IBGP speakers A and B exchange this capability. 392 Consider a case where A receives multiple UPDATE messages for NLRI X 393 with next-hops Nj, Nk and Nm. Assume that all these routes are valid 394 and A wants to pass on this set to B. Also assume that Nj and Nk 395 share the same path attributes (Origin, AS Path, Local Pref, etc) and 396 can be thus advertised in a single UPDATE message. 398 A makes an UPDATE message and uses the MULTIPLE_HOP path attribute. 399 It puts the AFI, SAFI, number of next-hops as 2, length of the first 400 next-hop Nj, network address of Nj, length of Nk and the network 401 address of Nk. 403 When this UPDATE message reaches B, it looks at the MULTIPLE_HOP 404 attribute and understands that there are multiple routes to reach X. 405 It inserts the two routes for X with the next-hops Nj and Nk in its 406 Adj-RIBs-In. 408 A also needs to announce the remaining route to X with next-hop Nl. 409 It makes an UPDATE message, fills the path attributes, and uses the 410 MULTIPLE_HOP attribute to encode next-hop information about Nl. This 411 UPDATE message is sent to B. 413 When B receives this UPDATE message it knows that this is not a 414 replacement route for X as it comes with the MULTIPLE_HOP 415 attribute. It simply appends this new route in its adj-RIBs-In, 416 runs the decision process, and proceeds as normal. 418 Assume that at some point later, A needs to withdraw the route 419 associated with the tuple [X, nexthop Nk]. It makes an UPDATE 420 message, puts X in the WITHDRAWN ROUTES and inserts the MULTIPLE_HOP 421 attribute, encoding the next-hop Nk inside. 423 When B receives this UPDATE message it understands that A wants to 424 remove one (or more) of the routes associated with X. To determine 425 which exact route(s) needs to be removed, it looks at the 426 MULTIPLE_HOP attribute and goes about removing all the routes 427 associated with the next-hops listed therein. 429 3.5 Implicit Withdrawal for one of the Next-Hops 431 In the same scenario to replace a route associated with the tuple [X, 432 next-hop Nk], A can advertise a fresh route with a new set of path 433 attributes. B would consider the new advertisement as an implicit 434 withdrawal for the previously announced route for the tuple [X, next- 435 hop Nk]. 437 4. Multiprotocol Extensions to BGP 439 Since the MULTIPLE_HOP includes both the AFI and SAFI, it is possible 440 to advertise multiple MPBGP routes. In this case, MP_REACH_NLRI 441 [MBGP] attribute shall carry the NLRI information and MULTIPLE_HOP 442 the information about the additional next-hops. 444 To suppress route replacement the additional routes must be 445 advertised by keeping the length of the next-hop as 0 in the 446 MP_REACH_NLRI attribute. The same should be encoded in the 447 MULTIPLE_HOP attribute. 449 5. Security Considerations 451 This extension to BGP does not change the underlying security issues 452 inherent in the existing BGP. 454 6. Acknowledgements 456 The authors would like to thank Tony Li, Arnold Nipper and Curtis 457 Villamizar for their valuable comments and suggestions on the earlier 458 versions of this draft from which the current work has been derived. 460 7. IANA Considerations 462 IANA needs to assign a capability code to the Multiple Hop capability 464 8. References 466 8.1 Normative References 468 [BGP-CAP] Chandra, R. and J. Scudder, "Capabilities Advertisement 469 with BGP-4", RFC 3392, November 2002 471 [BGP4] Rekhter, Y., Li, T. and Hares, S., "A Border Gateway 472 Protocol 4 (BGP-4)", RFC 4271, March 1995 474 [RR] Chandra, R., Bates, T., and E. Chen, "BGP Route Reflection 475 - An Alternative to Full Mesh Internal BGP (IBGP)", RFC 476 4456, April 2006 478 [BGP-IPv6] Marques, P. and F. Dupont, "Use of BGP-4 Multiprotocol 479 Extensions for IPv6 Inter-Domain Routing", RFC 2545, 480 March 1999. 482 [MBGP] Chandra, R., Rekhter, Y., Bates, T., and D. Katz, 483 "Multiprotocol Extension for BGP-4", 484 draft-ietf-idr-rfc2858bis-10.txt (work in progress) 486 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 487 Requirement Levels", RFC 2119, BCP 14, February 2001. 489 [IANA_AFI] http://www.iana.org/assignments/address-family-numbers 491 [IANA-SAFI]http://www.iana.org/assignments/safi-namespace 493 8.2 Informative References 495 [MED] Retana, A., Walton, D., McPherson, D., and V. Gill, 496 "Border Gateway Protocol (BGP) Persistent Route 497 Oscillation Condition", RFC 3345, August 2002. 499 [COMM] Chandra, R., Trania, P. and Li, T.,�BGP Communities 500 Attribute�, RFC 1997, August 1996 502 9. Appendix A 504 This section explains some scenarios where advertising multiple BGP 505 paths may prove to be useful. 507 9.1 Suboptimal Routing in Route Reflector clients 509 Route Reflection can result in suboptimal routing due to the client 510 not having full visibility to all the BGP paths in the AS. This is 511 because the RR selects the best path and reflects only that best path 512 to its clients. In case the RR has equal cost BGP routes, then it 513 shall select the one based on the lower Router ID. As a result, the 514 clients do not receive the full view of the available paths, or at 515 least the paths that are equidistant from the RR. This can result in 516 suboptimal routing from the client's perspective. A client may have 517 selected a different best path if more paths had been made visible to 518 it. With Multiple-hop BGP, the RR can advertise all the equal cost 519 BGP routes that it has to its client, giving the client more options 520 to choose from. 522 The extensions proposed in this draft provide provision for the RR to 523 reflect all the routes to its clients. 525 9.2 Avoiding Persistent Route Oscillations 527 ---------------------------------- 528 / AS X \ 529 | ----- | 530 | / \ | 531 | | | | 532 | | RR | | 533 | \ / | 534 | -/+\- | 535 | c1 / \ c2 | 536 | ---- / \ ---- | 537 | / \ / \ / \ | 538 | ( Ra ) ( Rb ) | 539 | \ / \ / | 540 | -/\-- ------ | 541 | / \ \ | 542 | / \ \ | 543 \ / \ \ / 544 --/------\--------------------\---- 545 / \ \ 546 / --------------------------- 547 / / \ --\-- \ 548 --/- | \ / \ | 549 // \\ | \ | | | 550 | R2 | | \ | R3 | | 551 | | | -\-- \ / | 552 \\ // | / \ ----- | 553 ---- | | | | 554 AS Y | | R1 | | 555 | \ / | 556 | ---- | 557 \ AS Z / 558 ----------------------------- 560 Figure 3 562 Consider the topology as shown in Figure 1. Say, AS X consists of 563 Route Reflector (RR) and two clients Ra and Rb. Ra is connected to 564 R2 in AS Y and R1 in AS Z. Rb is connected to R3 in AS Z. Assume that 565 the Router ID of R1 < R2 and IGP cost c1 < c2. The dashed lines 566 between the routers shows BGP peering. Assume that the BGP speakers 567 in AS Y and AS Z receive a BGP UPDATE for 10.0.0.0/8 from AS W. 568 Assume that they advertise the following path attributes to BGP 569 speakers in AS X: 571 R2: NLRI 10.0.0.0/8, AS_PATH Y W, MED 100, NEXT_HOP R2 572 R1: NLRI 10.0.0.0/8, AS_PATH Z W, MED 300, NEXT_HOP R1 574 R3: NLRI 10.0.0.0/8, AS_PATH Z W, MED 200, NEXT_HOP R3 576 Scenario 1: Traditional BGP in AS X 578 The following events happen: 580 1. Ra receives UPDATE messages from R2 and R1. Since they are from 581 different ASes, MEDs are not compared and the tie breaks on the 582 lower Router ID. Since R1 < R2, route from R1 is selected and 583 advertised to the RR. Ra thus has the following path as the 584 best one for 10.0.0.0/8: 586 AS_PATH Z W, MED 300, NEXT_HOP R1 588 2. Rb receives the UPDATE from R3, installs this and advertises the 589 same to the RR. Rb thus has the following path for 10.0.0.0/8: 591 AS_PATH Z W, MED 200, NEXT_HOP R3 593 3. RR receives two UPDATE messages from its clients. Since the 594 neighboring AS is the same in both of them, the tie breaks on the 595 route having the lower value of MED. It thus selects the route it 596 learns from Rb as the best one and advertises this to Ra. 598 4. Ra now has all the three paths. Route learnt from Rb wins over 599 the route learnt from R1 (lower MED) and the route learnt from 600 R2 wins over the route learnt from Rb (EBGP > IBGP). 602 5. Ra thus sends an implicit WITHDRAW to the RR, replacing the 603 earlier announcement with the route learnt from R2. 605 6. RR thus has the following paths for 10.0.0.0/8: 607 AS_PATH Y W, MED 100, NEXT_HOP R2 608 AS_PATH Z W, MED 200, NEXT_HOP R3 610 It selects the first path because the IGP cost to reach the 611 NEXT_HOP (R2) is lesser for the first one. It thus, advertises 612 this path to Rb and sends a WITHDRAW message to Ra, removing the 613 path it had initially announced (one learnt from Rb) 615 7. Ra receives the WITHDRAW message from the RR and removes the path. 616 Nothing is done as it is currently not the best path. 618 8. Rb receives the advertisement from RR, but doesn't do anything, as 619 the path learnt from R3 is better (EBGP > IBGP). 621 9. Ra at this time has only two routes. One, learnt from R1 and the 622 other learnt from R2: 624 AS_PATH Z W, MED 300, NEXT_HOP R1 626 AS_PATH Y W, MED 100, NEXT_HOP R2 628 It has selected the route learnt from R2. After some time, this 629 router runs its scanner process for validating the NEXT_HOPs. 630 There it runs the best path algorithm and finds that the route 631 learnt from R1 is better than the route learnt from R2, because 632 of the lower Router ID. 634 10.Ra sends an implicit WITHDRAW to RR, replacing the earlier 635 announcement with the route learnt from R2. 637 11... 639 The loop follows and it cycles again and again. 641 Scenario 2: Multiple-Hop BGP is implemented in AS X 643 1. If everything happens the same as in the preceding example then 644 Ra will have two paths to reach 10.0.0.0/8. Since everything 645 else is the same, it will advertise both these routes to the RR. 646 Note that Ra will not look at the Router ID, etc. for tie 647 breaking if Multiple-Hop capabilities are implemented. 649 2. RR will now have three paths for 10.0.0.0/8. Path 3, from Rb and 650 Paths 1 and 2 from Ra. 652 Path 1: AS_PATH Y W, MED 100, NEXT_HOP R2 654 Path 2: AS_PATH Z W, MED 300, NEXT_HOP R1 656 Path 3: AS_PATH Z W, MED 200, NEXT_HOP R3 658 Out of Path 2 and Path 3, it will select Path 3 (lower MED).From 659 Path 1 and Path 3, it will select Path 1, based on the lower 660 IGP cost. RR thus selects the Path 1 as the best route. 662 3. RR will advertise the new path to Rb. Rb will thus have the 663 following two paths: 665 Path 1: AS_PATH Y W, MED 100, NEXT_HOP R2 667 Path 2: AS_PATH Z W, MED 200, NEXT_HOP R3 668 Path 2 will win because of the EBGP > IBGP rule, and it will 669 continue using R3. There is thus, no change on Rb and it 670 continues using the same path as before. 672 4. The network is stable and there are no route oscillations. 674 9.3 eBGP mesh scaling at IXes via Route Servers 676 IXes today sometimes offer their customers the facility to peer with 677 a neutral IX route-server as a means to reduce the direct peering 678 requirements for their customers. The peering overhead may be 679 considerable given the many hundreds of ASes which may be present at 680 some of the larger IXes today, and it is quite plausible that IXes 681 will continue to grow in terms of attached customers and ASes. 683 However, the single-path limitation of BGP imposes great operational 684 difficulty in allowing such a route-server to be effective. 686 There are typically two kinds of route-server, one which is a normal 687 BGP speaker and simply provides a single-best-path-for-all service, 688 and the type which are configured with each customer�s policies and 689 calculate the best-path separately for each. Both approaches have 690 their limitations: 692 o Route-servers which simply advertise the current best known IX 693 path according to normal BGP procedures, without applying any 694 customer-specific policy, require the customers to often still 695 establish direct sessions with each other for cases where they 696 wish to apply policy. Much of the scaling benefits are never 697 realised. 699 o Route-servers which apply policy on their customers behalf, 700 selecting the best-path on a per-customer basis and then 701 advertising each customer a tailor-made best-path, require 702 extensive co-ordination of policy between the IX operators and 703 each of their customers. Further, it may be difficult for 704 customers to keep their policies private due the operational 705 requirements of policy co-ordination between IX and customer. 707 If there were a mechanism in BGP to allow an IX route-server to pass 708 all other advertisements to a customer peer, without performing any 709 path selection or applying any policy, then this would remove the 710 need for policy co-ordination between each customer and the IX, and 711 address the other shortcomings listed above. Such a mechanism would 712 be easy for both the IX operator and each customer to deploy and 713 maintain. 715 9.4 Advertising a subset of routes in BGP 716 Providers can tag some selected routes with certain communities 717 [COMM]. An administrator could write a policy that would advertise 718 all the paths carrying a known community within that AS to another 719 router capable of understanding the Multiple-Hop extensions. This is 720 a form of policy implementation and a detailed study of what could be 721 achieved using such techniques is beyond the scope of this draft. 723 9.5 Equal Cost Multiple Path BGP 725 Currently some implementations, when they receive multiple equal cost 726 BGP routes from different peers, are able to insert all of them (or a 727 subset of those, based on their local policies) in their forwarding 728 table to locally split the load for the destination, while announcing 729 only one "best" BGP path to its other peers. This however has 730 implications for those other peers which receive such an announcement 731 from this ECMP capable BGP speaker. The implication, as per route 732 aggregation, is these other peers potentially will not posses the 733 full path information, which can lead to loops. Hence, such an ECMP 734 capable BGP speaker can only enable this feature if great care is 735 taken, if at all, or must act as if it had aggregated the set of 736 routes concerned. 738 While this document does not directly address the question of ECMP, 739 the mechanism introduced can be built upon in order to do so. It 740 would be feasible to introduce additional semantics on top of the 741 Multiple-Nexthop Capability so as to allow the ECMP BGP speaker to 742 fully communicate the details of all the paths it is forwarding on, 743 and hence allow those other peers to have full visibility of path 744 information and be able to avoid selecting paths which would 745 otherwise loop, while still maintaining compatibility with speakers 746 not implementing ECMP and Multiple-Hop. 748 10. Author�s Address 750 Manav Bhatia 751 Lucent Technologies 753 Email: manav@lucent.com 755 Joel M. Halpern 757 Email: joel@stevecrocker.com 759 Paul Jakma 760 Sun Microsystems 762 Email: paul.jakma@sun.com 764 11. Intellectual Property Statement 766 The IETF takes no position regarding the validity or scope of any 767 Intellectual Property Rights or other rights that might be claimed to 768 pertain to the implementation or use of the technology described in 769 this document or the extent to which any license under such rights 770 might or might not be available; nor does it represent that it has 771 made any independent effort to identify any such rights. Information 772 on the procedures with respect to rights in RFC documents can be 773 found in BCP 78 and BCP 79. 775 Copies of IPR disclosures made to the IETF Secretariat and any 776 assurances of licenses to be made available, or the result of an 777 attempt made to obtain a general license or permission for the use of 778 such proprietary rights by implementers or users of this 779 specification can be obtained from the IETF on-line IPR repository at 780 http://www.ietf.org/ipr. 782 The IETF invites any interested party to bring to its attention any 783 copyrights, patents or patent applications, or other proprietary 784 rights that may cover technology that may be required to implement 785 this standard. Please address the information to the IETF at ietf- 786 ipr@ietf.org. 788 Disclaimer of Validity 790 This document and the information contained herein are provided on an 791 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 792 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 793 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 794 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 795 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 796 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 798 Copyright Statement 800 Copyright (C) The Internet Society (2006). This document is subject 801 to the rights, licenses and restrictions contained in BCP 78, and 802 except as set forth therein, the authors retain all their rights. 804 Acknowledgment 806 Funding for the RFC Editor function is currently provided by the 807 Internet Society.