idnits 2.17.1 draft-dickson-idr-well-ordered-nth-best-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 509. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 520. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 527. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 533. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([5]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 13, 2008) is 5758 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC2842' on line 200 -- Looks like a reference, but probably isn't: 'IANA-AFI' on line 218 -- Looks like a reference, but probably isn't: 'IANA-SAFI' on line 222 ** Downref: Normative reference to an Informational RFC: RFC 3345 (ref. '1') -- Possible downref: Non-RFC (?) normative reference: ref. '4' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 idr B. Dickson 3 Internet-Draft Afilias Canada, Inc 4 Expires: January 14, 2009 July 13, 2008 6 Enhanced BGP Capabilities for Exchanging Additional Nth-Best Paths 7 draft-dickson-idr-well-ordered-nth-best-01 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on January 14, 2009. 34 Copyright Notice 36 Copyright (C) The IETF Trust (2008). 38 Abstract 40 This Internet Draft describes an enhanced way to exchange prefix 41 information, so as to permit multiple copies of a prefix, with 42 different paths, to be announced and withdrawn. 44 This negotiated capability facilitates faster local (inter-AS) and 45 global (intra-AS) convergence, reduces path-hunting, improves route- 46 reflector behaviour, including eliminating persistent oscillations. 48 Additional prefix instances have a new wire format for updates and 49 withdrawals, to control path selection. 51 Benefits are seen both when deployed intra-AS, and on inter-AS 52 peering. 54 Author's Note 56 This Internet Draft is intended to result in this draft or a related 57 draft(s) being placed on the Standards Track for idr. 59 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 60 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 61 document are to be interpreted as described in [5]. 63 Intended Status: Proposed Standard. 65 Table of Contents 67 1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. The Best Path Chaining and the Best Path Tree . . . . . . 3 69 1.2. The Withdrawal Problem . . . . . . . . . . . . . . . . . . 3 70 1.3. The Uniqueness Property . . . . . . . . . . . . . . . . . 4 71 2. Proposed Changes . . . . . . . . . . . . . . . . . . . . . . . 4 72 2.1. USE_N_BEST Capability . . . . . . . . . . . . . . . . . . 5 73 3. Modifications to BGP Behavior . . . . . . . . . . . . . . . . 6 74 3.1. Changes to Path Selection Rules . . . . . . . . . . . . . 6 75 3.2. N Best - Basic Method . . . . . . . . . . . . . . . . . . 7 76 3.3. N Best - Route Reflector . . . . . . . . . . . . . . . . . 7 77 3.4. N Best - Inter-AS Hybrid Method . . . . . . . . . . . . . 7 78 3.5. IBGP vs EBGP . . . . . . . . . . . . . . . . . . . . . . . 7 79 4. Implementation Guidelines . . . . . . . . . . . . . . . . . . 8 80 5. Security Considerations . . . . . . . . . . . . . . . . . . . 9 81 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 82 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 83 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 84 8.1. Normative References . . . . . . . . . . . . . . . . . . . 9 85 8.2. Informative References . . . . . . . . . . . . . . . . . . 10 86 Appendix A. Path-Hunting Examples . . . . . . . . . . . . . . . . 10 87 Appendix B. Persistent Oscillation Examples . . . . . . . . . . . 10 88 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 89 Intellectual Property and Copyright Statements . . . . . . . . . . 12 91 1. Background 93 Even when all the best current practises are observed, operational 94 problems may be experienced when running a BGP network. 96 These include slow convergence due to "path-hunting" and persistant 97 oscillations [1]. 99 Standardization of MRAI timers helps path-hunting, and oscillations 100 can be worked around with RFC 5004 [3]. 102 However, both of these RFCs identify the above issues as needing 103 further work. 105 1.1. The Best Path Chaining and the Best Path Tree 107 In a stable system of BGP speakers, for every given prefix, the 108 selected best paths should form a spanning tree. At each node, the 109 best path selected points further up the tree. The root of the tree 110 is the destination, i.e. the originator of the prefix. The path from 111 any leaf to the root forms a "chain" of best paths. 113 There are any number of ways that path attributes may be modified 114 over time, at arbitrary places in this tree. When this happens, 115 individual segments of the tree may conceptually "stretch" or 116 "shrink". These changes may have no effect on the overall set of 117 choices of best path, or they may cause a cascade effect "below" that 118 point in the tree, with nodes migrating to new locations in a new 119 version of the tree. 121 However, each node makes its choice of best path locally, and every 122 time a node changes its selection of best path, that change is 123 visible to its peers, and may in turn affect their own choice of best 124 path. This propogation of changes is not instantaneous, and owing to 125 the non-tree-like nature of the actual connectivity between nodes, 126 can and does result in race conditions. 128 Depending on connectivity, peering policy, and initial conditions, 129 the behavior may border on that of systems best describe through 130 chaos theory. The time to reach a stable state, while generally 131 bounded, is often far from fast, not necessarily predictable, and not 132 necessarily consistent. 134 1.2. The Withdrawal Problem 136 Under normal circumstances, a change in attributes for a prefix will 137 "flow" along the tree of best paths, without disrupting the structure 138 of the tree itself signficantly. Even when a node selects a new best 139 path (and thus re-attaches itself to the tree in a new location), it 140 typically will continue to pass the new attributes along the branch 141 of the tree for which it is the root. 143 However, under certain circumstances, its choice of new best path, 144 requires it to WITHDRAW the prefix from those peers, and effectively 145 sever the branch. It is in the after-effects of this truncation that 146 much of the path-hunting behavior gets triggered. 148 When a withdrawal effectively severs a branch of the tree, all the 149 nodes on the tree will need to find new paths to the root. The 150 problem is, that it takes some time for them to learn this fact. 152 In the mean time, the nodes in the severed branch may continue to 153 use, and propogate, paths that are technically infeasible. 155 The idea is to fast-track the flooding of the infeasibility of paths 156 throughout all parts of the tree below a given link, so as to 157 minimize the use of infeasible paths. 159 1.3. The Uniqueness Property 161 Currently, for each prefix, only one path for that prefix is ever 162 announced from one peer to another (ignoring Route Reflectors). 163 Because of this property, uniqueness, a withdrawal on a prefix does 164 not require path information. This also means that a change of best 165 path is accomplished via an update for a prefix with the new path 166 information. 168 If, however, more than one path for a given prefix were sent, then 169 any attempt to withdraw a prefix+path would require some mechanism to 170 distinguish between prefix instances. 172 In an environment where multiple path announcments per prefix are 173 possible, but only one "best" path per prefix is maintained, then two 174 steps would be involved in changing the "best" path. In no 175 particular order, that would be the withdrawal of the old prefix+ 176 path, and the announcement of the new prefix+path. 178 2. Proposed Changes 180 What is being proposed is, maintaining a set of "N best" for each 181 prefix, and sending ALL of these rather than just the "best" path. 183 When any of the "N best" becomes infeasible, a withdrawal is sent. 184 If a withdrawal is received, it receives special fast-track handling, 185 taking advantage of the "N best" information. If any of the N best 186 is affected by the withdrawal, it is immediately flooded to peers 187 without doing a prefix BGP path comparison (since those results have 188 already been pre-calculated). 190 The supposition is that pruning all infeasible branches, while 191 maintaining information on N best paths, allows for fast removal of 192 all best paths which are dependent on infeasible paths, and fast 193 reconvergence with pre-computed alternate paths. It is expected that 194 the N-best mechanism should act as a stop-gap until, but not actually 195 replace, full prefix path comparisons to generate a new set of "N 196 best" paths. 198 2.1. USE_N_BEST Capability 200 The USE_N_BEST Capability is a new BGP capability [RFC2842]. The 201 Capability Code for this capability is specified in the IANA 202 Considerations section of this document. The Capability Length field 203 of this capability is variable. The Capability Value field consists 204 of zero or more of the tuples as follows: 206 +------------------------------------------------+ 207 | Address Family Identifier (2 octets) | 208 +------------------------------------------------+ 209 | Subsequent Address Family Identifier (1 octet) | 210 +------------------------------------------------+ 212 Figure 1 214 The meaning and use of the fields are as follows: 215 Address Family Identifier (AFI): This field carries the identity of 216 the Network Layer protocol for which the BGP speaker intends to 217 advertise multiple paths. Presently defined values for this field 218 are specified in [IANA-AFI]. 219 Subsequent Address Family Identifier (SAFI): This field provides 220 additional information about the type of the Network Layer 221 Reachability Information carried in the attribute. Presently 222 defined values for this field are specified in [IANA-SAFI]. 223 When advertising the USE_N_BEST Capability to a peer, a BGP speaker 224 conveys to the peer that the speaker is capable of receiving multiple 225 paths as well as the single path from the peer for address families 226 that the speaker supports. When a tuple is included in 227 the capability, it indicates that the BGP speaker intends to 228 advertise multiple paths for the . If the USE_N_BEST 229 Capability is also received from the peer, the speaker would then 230 follow the procedures for advertising "Best N" paths to the peer for 231 the specified . 233 When advertising "Best N" paths: 235 o Update messages MUST be in the new format [4], and 236 ADD_PATH_ORDERED must also be advertised 237 o For each prefix, at most one of each ordinal value, 1 through N, 238 may be sent 239 o The sender is responsible for selecting its own path ordinals 240 o The sender is responsible for maintaining the sequence order per 241 prefix 242 o As a result of withdrawals, the sequence sent might not start at 243 1, and might be sparse 245 3. Modifications to BGP Behavior 247 3.1. Changes to Path Selection Rules 249 The path selection rules for BGP (section 9.1.2.2 of BGP4 [2]) are 250 changed as follows: 251 o The following rule is a modification to step (c). 253 It MAY only be needed when the node is acting as a Route 254 Reflector. If a node is NOT a Route Reflector, a simplified 255 modification (remove any paths NOT marked BEST) MAY be used. 256 (This modification exists to resolve the Persistent Oscillation 257 problem only.) 259 The modification to step (c) is: 261 Step (c) is first performed INCLUDING paths NOT marked as BEST. 263 If, at the end of the first attempt at step (c), no paths marked 264 BEST remain, re-run step (c), this time EXCLUDING all paths NOT 265 marked BEST. 267 After this modified version of step (c), it should be observed 268 (and asserted) that only paths marked BEST must remain. 270 In other words, Step (c) MUST remove any non-BEST paths. 271 o The remainder of the usual BGP path selection rules are applied as 272 normal 273 The path selection rules for "Nth Best" path are as follows: 274 o The already-selected (N-1) best paths are removed from the set of 275 paths to compare 276 o The same rules are applied as for the "best" path (including the 277 modification to step (c), above) 278 o The selected path is advertised (to any peers with whom Nth-best 279 has been negotiated), with the ordinal value of N applied 280 The prefix instances for consideration of Nth-best path are the 281 REMAINDER of non-yet-selected instances. NB: Only the best (lowest 282 received ordinal), not-yet-selected instance of any IN-RIB may be 283 selected for the local (and out-RIB) Nth-best path. 285 3.2. N Best - Basic Method 287 Once the capabality for doing so has been negotiated between a pair 288 of BGP speakers, each sends the best N paths for each prefix. The 289 path information will include the additional ordinal value on the 290 each Nth-best path. 292 When the current "best" path is withdrawn, the withdrawal MAY be 293 propogated without having to perform a full BGP prefix path 294 selection. The current "second best" path in the local-RIB is 295 promoted to "best". This is because the alternate candidates have 296 already been evaluated and "second-best" has already been selected. 298 Whenever an AS consists of a mesh of BGP speakers who have negotiated 299 this capability, the withdrawal will propogate through the entire AS. 300 This will either have no effect, or will cause a change in "best", 301 which does not require non-local information in order to choose the 302 new "best" path. 304 The second-best path from a neighbor MUST ONLY be considered as a 305 candidate for best path, when the previous best path from that 306 neighbor is withdrawn. When this occurs, the path in question is 307 promoted to "best" status. 309 3.3. N Best - Route Reflector 311 The N best are all reflected. The same mechanism is used for 312 determining the best N per prefix. Updates must be reflected 313 whenever the choice of any of the best N change. Withdrawals may be 314 propogated immediately. 316 3.4. N Best - Inter-AS Hybrid Method 318 When a withdrawal of the current best path is received from a peer 319 doing USE_N_BEST, and the rules for sending updates require that an 320 update for this prefix be sent to a peer who does not support 321 USE_N_BEST, the current second-best instance of the prefix is sent to 322 that peer in an Update. The neighbor does not need the withdrawal, 323 since the new path replaces the old path. 325 3.5. IBGP vs EBGP 327 The same rules apply for EBGP->EBGP, EBGP->IBGP, IBGP->EBGP, and 328 IBGP->IBGP. If a particular peering has had USE_N_BEST negotiated, 329 then any update for a particular prefix that results in new selection 330 of any of the N best paths, the new selections (and possible 331 withdrawal of old selections) is sent to the appropriate peers. 333 4. Implementation Guidelines 335 In order to encourage effective implementation schemes, and to 336 demonstrate some of the benefits of deployment, here are some 337 suggestions for facilitating fast propogation of path changes, which 338 are anticipated as improving behavior. This applies in particular to 339 Path Hunting issues. 341 In-RIB-N (many) -> RIB-N -> out-RIB-N 342 | \ 343 v `-> out-RIB (to non-Nth-best peers) 344 RIB -> FIB 346 +----------+------+--------+---------+-----------------| 347 | PREFIX | UNIQ | IN-ORD | OUT-ORD | *PATH-info-ptr | 348 +----------+------+--------+---------+-----------------| 350 Figure 2 352 Where IN-ORD and OUT-ORD indicate the preference order (from BEST to 353 Nth-BEST) of the sender, or ourselves, and UNIQ is chosen to uniquely 354 identify the prefix; BGP Originator is used for UNIQ. IN-ORD are the 355 values sent from a peer. OUT-ORD is non-zero for ONLY those prefixes 356 selected for inclusion into the RIB-N. 358 For example, if all external peers have NOT negotiated Nth-Best, 359 those prefixes would have an ordinal value of 1. Each In-RIB-N would 360 have at most one instance. And for each prefix, at most one In-RIB-N 361 would be selected as best, and have its corresponding OUT-ORD set to 362 1. 364 This forward-chaining allows for expedited processing of updates. We 365 can immediately determine whether any given withdrawals need to be 366 flooded to peers, and if so, what ordinal to use on the forwarded 367 update. This flooding MAY be performed in parallel to normal BGP 368 table update processing. 370 For clarity, it should be pointed out that: 371 o The process for the step RIB-N to RIB is "select prefixes with 372 OUT-ORD == 1". 374 o The process for the step RIB-N to out-RIB is also "select prefixes 375 with OUT-ORD == 1". 376 o The process for the step RIB-N to out-RIB-N is the same as 377 ordinary RIB to out-RIB, except for preservation of Ordinal 378 values. 380 5. Security Considerations 382 No additional security considerations beyond those already present in 383 BGP are introduced. 385 6. IANA Considerations 387 IANA will need to assign a new code point for BGP Capabilities for 388 USE_N_BEST. 390 7. Acknowledgements 392 The author wishes to acknowledge the helpful guidance of Joe Abley, 393 and Tony Li. The author thanks the following for feedback during the 394 review and revision process: Joel M. Halpern, Tony Li. The author 395 also wishes to acknowledge the insight gained from his Scottish 396 Deerhound, Skylar, winning a Reserve Best-in-Show. (The selection 397 method of "second best" comes from the Reserve system used at the 398 group and best-in-show levels of dog shows). 400 8. References 402 8.1. Normative References 404 [1] McPherson, D., Gill, V., Walton, D., and A. Retana, "Border 405 Gateway Protocol (BGP) Persistent Route Oscillation Condition", 406 RFC 3345, August 2002. 408 [2] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 4 409 (BGP-4)", RFC 4271, January 2006. 411 [3] Chen, E. and S. Sangli, "Avoid BGP Best Path Transitions from 412 One External to Another", RFC 5004, September 2007. 414 [4] Dickson, B., "Enhanced BGP Capabilities for Exchanging Second- 415 Best Paths", July 2008. 417 8.2. Informative References 419 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement 420 Levels", BCP 14, RFC 2119, March 1997. 422 Appendix A. Path-Hunting Examples 424 (These will be included in a subsequent version of this ID.) 426 Appendix B. Persistent Oscillation Examples 428 Consider the example in Figure 3 where o R1, R2, R3, R4, and R5 429 belong to one AS. o R1 is a route reflector with R2 and R3 as its 430 clients. o R4 is a route reflector with R5 as its client. o The IGP 431 metrics are as listed. o External paths (a), (b), and (c) are as 432 described in Figure 4. 434 +----+ 1 +----+ 435 | R1 |-------------| R4 | 436 +----+ +----+ 437 | \ | 438 | \ | 439 3| \ 2 | 6 440 | \ | 441 | \ | 442 +----+ +----+ +----+ 443 | R2 | | R3 | | R5 | 444 +----+ +----+ +----+ 445 | | | 446 (a) (b) (c) 448 Figure 3 450 Path AS_PATH MED 451 a 1 3 10 452 b 2 3 1 453 c 2 3 0 455 Figure 4 457 With the addition of "Nth Best", and locally limiting N to 2, we 458 have: 460 R1 has the following: 462 Path AS_PATH MED IGP-metric 463 a 1 3 10 3 (received:best) (best) 464 b 2 3 1 2 (received:best) 465 c 2 3 0 7 (received:best) (second_best - not sent) 467 R4 has the following: 469 Path AS_PATH MED IGP-metric 470 a 1 3 10 4 (received:best) (best - not sent) 471 c 2 3 0 6 (received: best) (second_best) 473 This results in R1 having: 475 Path AS_PATH MED IGP-metric 476 a 1 3 10 3 (received:best) (best) 477 b 2 3 1 2 (received:best) 478 c 2 3 0 7 (received:second_best) (second_best - not sent) 480 By including N best (for N=2) in the best path calculation, the 481 persistent oscillation problem is resolved. 483 Author's Address 485 Brian Dickson 486 Afilias Canada, Inc 487 4141 Yonge St, 488 Suite 204 489 North York, ON M2P 2A8 490 Canada 492 Email: brian.peter.dickson@gmail.com 493 URI: www.afilias.info 495 Full Copyright Statement 497 Copyright (C) The IETF Trust (2008). 499 This document is subject to the rights, licenses and restrictions 500 contained in BCP 78, and except as set forth therein, the authors 501 retain all their rights. 503 This document and the information contained herein are provided on an 504 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 505 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 506 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 507 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 508 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 509 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 511 Intellectual Property 513 The IETF takes no position regarding the validity or scope of any 514 Intellectual Property Rights or other rights that might be claimed to 515 pertain to the implementation or use of the technology described in 516 this document or the extent to which any license under such rights 517 might or might not be available; nor does it represent that it has 518 made any independent effort to identify any such rights. Information 519 on the procedures with respect to rights in RFC documents can be 520 found in BCP 78 and BCP 79. 522 Copies of IPR disclosures made to the IETF Secretariat and any 523 assurances of licenses to be made available, or the result of an 524 attempt made to obtain a general license or permission for the use of 525 such proprietary rights by implementers or users of this 526 specification can be obtained from the IETF on-line IPR repository at 527 http://www.ietf.org/ipr. 529 The IETF invites any interested party to bring to its attention any 530 copyrights, patents or patent applications, or other proprietary 531 rights that may cover technology that may be required to implement 532 this standard. Please address the information to the IETF at 533 ietf-ipr@ietf.org. 535 Acknowledgment 537 Funding for the RFC Editor function is provided by the IETF 538 Administrative Support Activity (IASA).