idnits 2.17.1 draft-uttaro-idr-bgp-persistence-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 278 has weird spacing: '...eration secti...' == Line 933 has weird spacing: '...lineaux cedex...' == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: For MPLS VPN services, the effectiveness of the traffic isolation between VPNs relies on the correctness of the MPLS labels between ingress and egress PEs. In particular, when an egress PE withdraws a label L1 allocated to a VPN1 route, this label MUST not be assigned to a VPN route of a different VPN until all ingress PEs stop using the old VPN1 route using L1. -- The document date (July 12, 2013) is 3939 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-16) exists of draft-ietf-idr-bgp-gr-notification-01 == Outdated reference: A later version (-12) exists of draft-ietf-idr-bgp-bestpath-selection-criteria-06 -- Obsolete informational reference (is this intentional?): RFC 5575 (Obsoleted by RFC 8955) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force J. Uttaro 3 Internet-Draft AT&T 4 Intended status: Standards Track E. Chen 5 Expires: January 13, 2014 Cisco Systems 6 B. Decraene 7 Orange 8 J. Scudder 9 Juniper Networks 10 July 12, 2013 12 Support for Long-lived BGP Graceful Restart 13 draft-uttaro-idr-bgp-persistence-02 15 Abstract 17 In this document we introduce a new BGP capability termed "Long-lived 18 Graceful Restart Capability" so that stale routes can be retained for 19 a longer time upon session failure. In addition a new BGP community 20 "LLGR_STALE" is introduced for marking stale routes retained for a 21 longer time. We also specify that such long-lived stale routes be 22 treated as the least-preferred, and their advertisements be limited 23 to BGP speakers that have advertised the new capability. Use of this 24 extension is not advisable in all cases, and we provide guidelines to 25 help determine if it is. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 13, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 63 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 3. Protocol Extensions . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. Long-lived Graceful Restart Capability . . . . . . . . . . 5 66 3.2. LLGR_STALE Community . . . . . . . . . . . . . . . . . . . 6 67 3.3. NO_LLGR Community . . . . . . . . . . . . . . . . . . . . 6 68 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4.1. Use of Graceful Restart Capability . . . . . . . . . . . . 7 70 4.2. Session Resets . . . . . . . . . . . . . . . . . . . . . . 7 71 4.3. Processing LLGR_STALE Routes . . . . . . . . . . . . . . . 9 72 4.4. Route Selection . . . . . . . . . . . . . . . . . . . . . 10 73 4.5. Multicast VPN . . . . . . . . . . . . . . . . . . . . . . 10 74 4.6. Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 10 75 4.7. Optional Partial Deployment Procedure . . . . . . . . . . 10 76 4.8. Procedures When BGP is the PE-CE Protocol in a VPN . . . . 11 77 5. Deployment Considerations . . . . . . . . . . . . . . . . . . 12 78 5.1. When BGP is the PE-CE Protocol in a VPN . . . . . . . . . 13 79 5.2. Risks of Depreferencing Routes . . . . . . . . . . . . . . 13 80 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 81 7. Examples of Operation . . . . . . . . . . . . . . . . . . . . 16 82 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 83 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 18 84 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 85 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 86 11.1. Normative References . . . . . . . . . . . . . . . . . . . 19 87 11.2. Informative References . . . . . . . . . . . . . . . . . . 20 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 90 1. Introduction 92 Historically, routing protocols in general and BGP in particular have 93 been designed with a focus on correctness, where a key part of 94 "correctness" is for each network element's forwarding state to 95 converge toward the current state of the network as quickly as 96 possible. For this reason, the protocol was designed to remove state 97 advertised by routers which went down (from a BGP perspective) as 98 quickly as possible. Over time, this has been relaxed somewhat, 99 notably by BGP Graceful Restart [RFC4724]; however, the paradigm has 100 remained one of attempting to rapidly remove "stale" state from the 101 network. 103 Over time, two phenomena have arisen that call into question the 104 underlying assumptions of this paradigm. The first is the widespread 105 adoption of tunneled forwarding infrastructures, for example MPLS. 106 Such infrastructures eliminate the risk of some types of forwarding 107 loops that can arise in hop-by-hop forwarding, and thus reduce one of 108 the motivations for strong consistency between forwarding elements. 109 The second is the increasing use of BGP as a transport for data less 110 closely associated with packet forwarding than was originally the 111 case. Examples include the use of BGP for autodiscovery (VPLS 112 [RFC4761]) and filter programming (FLOWSPEC [RFC5575]). In these 113 cases, BGP data takes on a character more akin to configuration than 114 to traditional routing. 116 The observations above motivate a desire to offer network operators 117 the ability to choose to retain BGP data for a longer period than has 118 hitherto been possible when the BGP control plane fails for some 119 reason. Although the semantics of BGP Graceful Restart [RFC4724] are 120 close to those desired, several gaps exist, most notably in maximum 121 time for which "stale" information can be retained -- Graceful 122 Restart imposes a 4095 second upper bound. 124 In this document we introduce a new BGP capability termed "Long-lived 125 Graceful Restart Capability" so that stale information can be 126 retained for a longer time across a session reset. We also introduce 127 a new BGP community, "LLGR_STALE", to mark such information. Such 128 stale information is to be treated as least-preferred, and its 129 advertisement limited to BGP speakers that support the new 130 capability. Where possible, we reference the semantics of BGP 131 Graceful Restart [RFC4724] rather than specifying similar semantics 132 in this document. 134 The expected deployment model for this extension is that it will only 135 be invoked for certain address families. This is discussed in more 136 detail in the Deployment Considerations section (Section 5). When 137 used, its use may be combined with that of traditional Graceful 138 Restart, in which case it is invoked only after the traditional 139 Graceful Restart interval has elapsed, or it may be invoked 140 immediately. Apart from the potential to greatly extend the timer, 141 the most obvious difference between Long-Lived and traditional 142 Graceful Restart is that in the Long-Lived version, routes are 143 "depreferenced", that is, treated as least-preferred, whereas in the 144 traditional version, route preference is not affected. The design 145 choice to treat Long-Lived Stale routes as least-preferred was 146 informed by the expectation that they might be retained for a 147 (potentially) almost unbounded period of time, whereas in the 148 traditional Graceful Restart case, stale routes are retained for only 149 a brief interval. In the GR case, the tradeoff between advertising 150 new route status (at the cost of routing churn) and not advertising 151 it (at the cost of suboptimal or incorrect route selection) is 152 resolved in favor of not advertising, and in the LLGR case, it is 153 resolved in favor of advertising new state. 155 1.1. Requirements Language 157 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 158 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 159 document are to be interpreted as described in RFC 2119 [RFC2119]. 161 2. Definitions 163 Depreference, Depreferenced: A route is said to be depreferenced if 164 it has its route selection preference reduced in reaction to some 165 event. 167 GR: Abbreviation for "Graceful Restart" [RFC4724], also sometimes 168 referred to herein as "conventional Graceful Restart" or 169 "conventional GR" to distinguish it from the "Long-lived Graceful 170 Restart" defined by this document. 172 Helper: Or "helper router". During Graceful Restart or Long-lived 173 Graceful Restart, the router that detects a session failure and 174 applies the listed procedures. [RFC4724] refers to this as the 175 "receiving speaker". 177 LLGR: Abbreviation for "Long-lived Graceful Restart". 179 LLST: Abbreviation for "Long-lived Stale Time". 181 Route: We use "route" to mean any information encoded as a BGP NLRI 182 and set of path attributes. As discussed above, the connection 183 between such routes and installation of forwarding state may be 184 quite remote. 186 3. Protocol Extensions 188 A new BGP capability and two new BGP communities are introduced. 190 3.1. Long-lived Graceful Restart Capability 192 The "Long-lived Graceful Restart Capability" is a new BGP capability 193 [RFC5492] that can be used by a BGP speaker to indicate its ability 194 to preserve its state according to the procedures of this document. 195 This capability MUST be advertised in conjunction with the Graceful 196 Restart capability [RFC4724], see the "Use of Graceful Restart 197 Capability" section (Section 4.1). 199 The capability value consists of one or more tuples as follows: 202 +--------------------------------------------------+ 203 | Address Family Identifier (16 bits) | 204 +--------------------------------------------------+ 205 | Subsequent Address Family Identifier (8 bits) | 206 +--------------------------------------------------+ 207 | Flags for Address Family (8 bits) | 208 +--------------------------------------------------+ 209 | Long-lived Stale Time (24 bits) | 210 +--------------------------------------------------+ 211 | ... | 212 +--------------------------------------------------+ 213 | Address Family Identifier (16 bits) | 214 +--------------------------------------------------+ 215 | Subsequent Address Family Identifier (8 bits) | 216 +--------------------------------------------------+ 217 | Flags for Address Family (8 bits) | 218 +--------------------------------------------------+ 219 | Long-lived Stale Time (24 bits) | 220 +--------------------------------------------------+ 222 The meaning of the fields are as follows: 224 Address Family Identifier (AFI), Subsequent Address Family 225 Identifier (SAFI): 227 The AFI and SAFI, taken in combination, indicate that the BGP 228 speaker has the ability to preserve its forwarding state for 229 the address family during a subsequent BGP restart. Routes may 230 be explicitly associated with a particular AFI and SAFI using 231 the encoding of [RFC4760] or implicitly associated with 232 if using the encoding of [RFC4271]. 234 Flags for Address Family: 236 This field contains bit flags relating to routes that were 237 advertised with the given AFI and SAFI. 239 0 1 2 3 4 5 6 7 240 +-+-+-+-+-+-+-+-+ 241 |F| Reserved | 242 +-+-+-+-+-+-+-+-+ 244 The most significant bit is used to indicate whether the state 245 for routes that were advertised with the given AFI and SAFI has 246 indeed been preserved during the previous BGP restart. When 247 set (value 1), the bit indicates that the state has been 248 preserved. This bit is called the "F bit" since it was 249 historically used to indicate preservation of Forwarding State. 250 Use of the F bit is detailed in the Session Resets section 251 (Section 4.2). 253 The remaining bits are reserved and MUST be set to zero by the 254 sender and ignored by the receiver. 256 Long-lived Stale Time: 258 This time (in seconds) specifies how long stale information 259 (for the AFI/SAFI) may be retained (possibly in conjunction 260 with the period specified by the "Restart Time" in the Graceful 261 Restart Capability, if present). 263 3.2. LLGR_STALE Community 265 We introduce a new BGP community [RFC1997] "LLGR_STALE" (value: TBD). 266 It can be used to mark stale routes retained for a longer period of 267 time. Such long-lived stale routes are to be handled according to 268 the procedures specified in the Operation section (Section 4). 270 An implementation MAY allow users to configure policies that accept, 271 reject, or modify routes based on the presence or absence of this 272 community. 274 3.3. NO_LLGR Community 276 We introduce a new BGP community "NO_LLGR" (value: TBD). It can be 277 used to mark routes which a BGP speaker does not want treated 278 according to these procedures, as detailed in the Operation section 279 (Section 4). 281 An implementation MAY allow users to configure policies that accept, 282 reject, or modify routes based on the presence or absence of this 283 community. 285 4. Operation 287 A BGP speaker MAY use BGP Capabilities Advertisements [RFC5492] to 288 advertise the "Long-lived Graceful Restart Capability" to indicate 289 its ability to retain state and perform related procedures specified 290 in this document. The setting of the parameters for an AFI/SAFI 291 depends on the properties of the BGP speaker, network scale, and 292 local configuration. 294 In the presence of the "Long-lived Graceful Restart Capability", the 295 procedures specified in [RFC4724] and 296 [I-D.ietf-idr-bgp-gr-notification] continue to apply unless 297 explicitly revised by this document. 299 4.1. Use of Graceful Restart Capability 301 The Graceful Restart capability MUST be advertised in conjunction 302 with the LLGR capability. If it is not so advertised, the LLGR 303 capability MUST be disregarded. The purpose for mandating that both 304 be used in conjunction is to enable reuse of certain base mechanisms 305 that are common to both "flavors", notably origination, collection 306 and processing of EoR, as well as the finite state machine 307 modifications and connection reset logic introduced by GR. 309 We observe that if support for conventional Graceful Restart is not 310 desired for the session, the conventional GR phase can be skipped by 311 omitting all AFI/SAFI from the GR capability, advertising a Restart 312 Time of zero, or both. The Session Resets section (Section 4.2) 313 discusses the interaction of conventional and long-lived GR. 315 4.2. Session Resets 317 BGP Graceful Restart [RFC4724], updated by 318 [I-D.ietf-idr-bgp-gr-notification], defines conditions under which a 319 BGP session can reset and have its associated routes retained. If 320 such a reset occurs for a session for which the LLGR Capability has 321 also been exchanged, the following procedures apply. 323 If the Graceful Restart Capability that was received does not list 324 all AFI/SAFI supported by the session, then for those non-listed AFI/ 325 SAFI the GR "Restart Time" shall be deemed zero. Similarly, if the 326 received LLGR Capability does not list all AFI/SAFI supported by the 327 session, then for those non-listed AFI/SAFI the "Long-lived Stale 328 Time" shall be deemed zero. 330 The following text in Section 4.2 of the GR specification [RFC4724] 331 no longer applies: 333 If the session does not get re-established within the "Restart 334 Time" that the peer advertised previously, the Receiving Speaker 335 MUST delete all the stale routes from the peer that it is 336 retaining. 338 and the following procedures are specified instead: 340 After the session goes down and before the session is re-established, 341 the stale routes for an AFI/SAFI MUST be retained. The interval for 342 which they are retained is limited by the sum of the "Restart Time" 343 in the received Graceful Restart Capability and the "Long-lived Stale 344 Time" in the received Long-lived Graceful Restart Capability. These 345 timers MAY be modified by local configuration. 347 If the value of the "Restart Time" or the "Long-lived Stale Time" is 348 zero, the duration of the corresponding period would be zero seconds. 349 So, for example, if the "Restart Time" is zero and the "Long-lived 350 Stale Time" is nonzero, only the procedures particular to LLGR would 351 apply. Conversely, if the "Long-lived Stale Time" is zero and the 352 "Restart Time" is nonzero, only the procedures of GR would apply. If 353 both are zero, none of these procedures would apply, only those of 354 the base BGP specification (although EoR would still be used as 355 detailed in [RFC4724]). And finally, if both are nonzero, then the 356 procedures would be applied serially -- first those of GR, then those 357 of LLGR. We observe that during the first interval, while the 358 procedures of GR are in effect, route preference would not be 359 affected, while during the second interval, while LLGR procedures are 360 in effect, routes would be treated as least-preferred as specified 361 elsewhere in this document. 363 Once the "Restart Time" period ends (including the case that the 364 "Restart Time" is zero), the LLGR period is said to have begun and 365 the following procedures MUST be performed: 367 o The helper router MUST start a timer for the "Long-lived Stale 368 Time". If the timer for the "Long-lived Stale Time" expires 369 before the session is re-established, the helper MUST delete all 370 the stale routes from the neighbor that it is retaining. 372 o The helper router MUST attach the LLGR_STALE community for the 373 stale routes being retained. Note that this requirement implies 374 that the routes would need to be readvertised, to disseminate the 375 modified community. 377 o If any of the routes from the peer have been marked with the 378 NO_LLGR community, either as sent by the peer, or as the result of 379 a configured policy, they MUST NOT be retained, but MUST be 380 removed as per the normal operation of [RFC4271]. 382 o The helper router MUST perform the procedures listed under 383 Section 4.3. 385 Once the session is re-established, the procedures specified in 386 [RFC4724] apply for the stale routes irrespective of whether the 387 stale routes are retained during the "Restart Time" period or the 388 "Long-lived Stale Time" period. However, in the case of consecutive 389 restarts (i.e, the session goes down before the EoR is received) the 390 previously marked stale routes MUST NOT be deleted before the timer 391 for the "Long-lived Stale Time" expires. 393 Similarly to [RFC4724], once the session is re-established, if the F 394 bit for a specific address family is not set in the newly received 395 LLGR Capability, or if a specific address family is not included in 396 the newly received LLGR Capability, or if the LLGR and accompanying 397 GR Capability are not received in the re-established session at all, 398 then the Helper MUST immediately remove all the stale routes from the 399 peer that it is retaining for that address family. 401 If a "Long-lived Stale Time" timer is running for a peer, it MUST NOT 402 be updated (other than by manual operator intervention) until the 403 peer has established and synchronized a new session. The session is 404 termed "synchronized" once the EoR has been received from the peer. 406 The value of the "Long-lived Stale Time" in the capability received 407 from a neighbor MAY be reduced by local configuration. 409 While the session is down, the expiration of the "Long-lived Stale 410 Time" timer is treated analogously to the expiration of the "Restart 411 Time" timer in Graceful Restart. However, the timer continues to run 412 once the session has re-established. The timer is not stopped, nor 413 updated, until EoR is received from the peer. If the timer expires 414 during synchronization with the peer, any stale routes that the peer 415 has not refreshed, are removed. If the session subsequently resets 416 prior to becoming synchronized, any remaining routes should be 417 removed immediately. 419 4.3. Processing LLGR_STALE Routes 421 A BGP speaker that has advertised the "Long-lived Graceful Restart 422 Capability" to a neighbor MUST perform the following upon receiving a 423 route from that neighbor with the "LLGR_STALE" community, or upon 424 attaching the "LLGR_STALE" community itself per Section 4.2: 426 o Treat the route as the least-preferred in route selection (see 427 below). See the Risks of Depreferencing Routes section 428 (Section 5.2) for a discussion of potential risks inherent in 429 doing this. 431 o The route SHOULD NOT be advertised to any neighbor from which the 432 Long-lived Graceful Restart Capability has not been received. The 433 exception is described in the Optional Partial Deployment 434 Procedure section (Section 4.7). Note that this requirement 435 implies that such routes should be withdrawn from any such 436 neighbor. 438 o The "LLGR_STALE" community MUST NOT be removed when the route is 439 further advertised. 441 4.4. Route Selection 443 In this document, when we refer to treating a route as least- 444 preferred, this means the route MUST be treated as less preferred 445 than any other route that is not so treated. When performing route 446 selection between two routes both of which are least-preferred, 447 normal tie-breaking applies. Note that this would only be expected 448 to happen if the only routes available for selection were least- 449 preferred -- in all other cases, such routes would have been 450 eliminated from consideration. 452 4.5. Multicast VPN 454 Special consideration is required if LLGR is to be applied to the 455 Multicast VPN SAFI [RFC6514]. Considerations for Multicast VPNs will 456 be covered in a future revision of this document. 458 4.6. Errors 460 If the LLGR capability is received without an accompanying GR 461 capability, the LLGR capability MUST be ignored, that is, the 462 implementation MUST behave as though no LLGR capability had been 463 received. 465 4.7. Optional Partial Deployment Procedure 467 Ideally, all routers in an Autonomous System would support this 468 specification before it was enabled. However, to facilitate 469 incremental deployment, stale routes MAY be advertised to neighbors 470 that have not advertised the Long-lived Graceful Restart Capability 471 under the following conditions: 473 o The neighbors MUST be internal (IBGP or Confederation) neighbors. 475 o The NO_EXPORT community [RFC1997] MUST be attached to the stale 476 routes. 478 o The stale routes MUST have their LOCAL_PREF set to zero. See the 479 Risks of Depreferencing Routes section (Section 5.2) for a 480 discussion of potential risks inherent in doing this. 482 If this strategy for partial deployment is used, the network operator 483 should set LOCAL_PREF to zero for all LLGR routes throughout the 484 Autonomous System. This trades off a small reduction in flexibility 485 (ordering may not be preserved between competing LLGR routes) for 486 consistency between routers which do, and do not, support this 487 specification. Since consistency of route selection can be important 488 for preventing forwarding loops, the latter consideration dominates. 490 4.8. Procedures When BGP is the PE-CE Protocol in a VPN 492 In VPN deployments, for example [RFC4364], BGP is often used as a 493 PE-CE protocol. It may be a practical necessity in such deployments 494 to accommodate interoperation with CEs that cannot easily be upgraded 495 to support specifications such as this one. This leads to a problem: 496 in this specification, we take pains to ensure that "stale" routing 497 information will not leak beyond the perimeter of routers that 498 support these procedures, so that it can be depreferenced as 499 expected, and we provide a workaround (Section 4.7) for the case 500 where one or more IBGP routers are not upgraded. However, in the VPN 501 PE-CE case, the protocol in use is EBGP, and our workaround does not 502 work since it relies on the use of LOCAL_PREF, an IBGP-only path 503 attribute. 505 We observe that the principal motivation for restricting the 506 propagation of "stale" routing information is the desire to prevent 507 it from spreading without limit once it exits the "safe" perimeter. 508 We further observe that VPN deployments are typically topologically 509 constrained, making this concern moot. For this reason, an 510 implementation MAY advertise stale routes over a PE-CE session, when 511 explicitly configured to do so. That is, the second rule listed in 512 Section 4.3 MAY be disregarded in such cases. All other rules 513 continue to apply. Finally, if this exception is used, the 514 implementation SHOULD by default attach the NO_EXPORT community to 515 the routes in question, as an additional protection against stale 516 routes spreading without limit. Attachment of the NO_EXPORT 517 community MAY be disabled by explicit configuration, to accommodate 518 exceptional cases. 520 See further discussion in Section 5.1. 522 5. Deployment Considerations 524 The deployment considerations discussed in [RFC4724] apply to this 525 document. In addition, network operators are cautioned to carefully 526 consider the potential disadvantages of deploying these procedures 527 for a given AFI/SAFI. Most notably, if used for an AFI/SAFI that 528 conveys traditional reachability information, use of a long-lived 529 stale route could result in a loss of connectivity for the covered 530 prefix. This specification takes pains to mitigate this risk where 531 possible, by making such routes least-preferred and by restricting 532 the scope of such routes to routers that support these procedures 533 (or, optionally, a single Autonomous System, see "Optional Partial 534 Deployment Procedure", above). However, according to the normal 535 rules of IP forwarding a stale more-specific route, that has no non- 536 stale alternate paths available, will still be used instead of a non- 537 stale less-specific route. Networks in which the deployment of these 538 procedures would be especially concerning include those which do not 539 use "tunneled" forwarding (in other words, those using traditional 540 hop-by-hop forwarding). 542 Implementations MUST NOT enable these procedures by default. They 543 MUST require affirmative configuration per AFI/SAFI in order to 544 enable them. 546 The procedures of this document do not alter the route resolvability 547 requirement of [RFC4271] Section 9.1.2.1.. Because of this, it will 548 commonly be the case that "stale" IBGP routes will only continue to 549 be used if the router depicted in the next hop remains resolvable, 550 even if its BGP component is down. Details of IGP fault-tolerance 551 strategies are beyond the scope of this document. In addition to the 552 foregoing, it may be advisable to check the viability of the next hop 553 through other means, see for example 554 [I-D.ietf-idr-bgp-bestpath-selection-criteria]. This may be 555 especially useful in cases where the next hop is known directly at 556 the network layer, notably EBGP. 558 As discussed in this document, after a BGP session goes down and 559 before the session is re-established, stale routes may be retained 560 for up to two consecutive periods, controlled by the "Restart Time" 561 and the "Long-lived Stale Time", respectively. During the first 562 period routing churn would be prevented but with potential 563 blackholing of traffic. During the second period potential 564 blackholing of traffic may be reduced but routing churn would be 565 visible throughout the network. The setting of the relevant 566 parameters for a particular application should take into account the 567 tradeoffs, the network dynamics and potential failure scenarios. If 568 needed, the first period can be bypassed either by local 569 configuration or by setting the "Restart Time" in the Graceful 570 Restart Capability to zero and/or not listing the AFI/SAFI in that 571 Capability. 573 The setting of the F bit (and the "Forwarding State" bit of the 574 accompanying GR capability) depends in part on deployment 575 considerations. The F bit can be understood as an indication that 576 the Helper should flush associated routes (if the bit is left clear). 577 As discussed in the Introduction, an important use case for LLGR is 578 for routes that are more akin to configuration than to traditional 579 routing. For such routes, it may make sense to always set the F bit, 580 regardless of other considerations. Likewise, for control-plane-only 581 entities such as dedicated route reflectors, that do not participate 582 in the forwarding plane, it makes sense to always set the F bit. 583 Overall, the rule of thumb is that if loss of state on the restarting 584 router can reasonably be expected to cause a forwarding loop or black 585 hole, the F bit should be set scrupulously according to whether state 586 has been retained. Specifics of when the F bit is, and is not, set 587 is implementation-dependent and may also be controlled by 588 configuration. 590 5.1. When BGP is the PE-CE Protocol in a VPN 592 As discussed in Section 4.8, it may be necessary to advertise stale 593 routes to a CE in some VPN deployments, even if the CE does not 594 support this specification. In that case, the network operator 595 configuring their PE to advertise such routes should notify the 596 operator of the CE receiving the routes, and the CE should be 597 configured to depreference the routes. Typical BGP implementations 598 will be able to do this by matching on the LLGR_STALE community, and 599 setting the LOCAL_PREF for matching routes to zero, similar to the 600 procedure described in Section 4.7. 602 5.2. Risks of Depreferencing Routes 604 Depreferencing EBGP routes is considered safe, no different from the 605 common practice of applying a routing policy to an EBGP session. 606 However, the same is not always true of IBGP. 608 Consistent route selection is a fundamental tenet of IBGP correctness 609 and safe operation in hop-by-hop routed networks. When routers 610 within an AS apply different criteria in selecting routes, they can 611 arrive at inconsistent route selections, potentially with the 612 consequence of forming forwarding loops unless some form of tunneled 613 forwarding is used to prevent "core" routers from making a 614 (potentially inconsistent) forwarding decision based on the IP 615 header. 617 This specification uses the state of a peering session as an input to 618 the selection criteria, depreferencing routes that are associated 619 with a session that has gone down but have not yet aged out. Since 620 different routers within an AS might have different notions as to 621 whether their respective sessions with a given peer are up or down, 622 they might apply different selection criteria to routes from that 623 peer. This could result in a forwarding loop forming between such 624 routers. 626 For an example of such a forwarding loop, consider the following 627 simple topology: 629 A ---- B ---- C ------------------------- D 630 ^ ^ 631 | | 632 R1 R2 634 In this example, A - D are routers with a full mesh of IBGP sessions 635 between them. The short links have unit cost, the long link has cost 636 5. Routers A and D are AS border routers, each advertising some 637 route, R, into the AS -- these are denoted R1 and R2 in the diagram. 638 In ordinary operation, it can be seen that routers B and C will 639 select R1 for forwarding, and will forward toward A. 641 Suppose that the session between A and B goes down for some reason, 642 and stays down long enough for LLGR processing to be invoked on B. 643 Then on B, route R1 will be depreferenced, leading to the selection 644 of R2 by B. However, C will continue to prefer R1. It can be seen 645 that in this case, a forwarding loop for packets destined to R would 646 form between B and C. (We note that other forwarding loop scenarios 647 can be constructed for traditional GR, but are generally considered 648 less severe since GR can remain in effect for a much more limited 649 interval.) 651 The potential benefits of this specification can outweigh the risks 652 discussed above, as long as care is exercised in deployment. The 653 cardinal rule to be followed is, if a given set of routes are being 654 used within an AS for hop-by-hop forwarding, it is NOT RECOMMENDED to 655 enable LLGR procedures. If tunneled forwarding (such as MPLS) is 656 used within the AS, or if routes are being used for purposes other 657 than hop-by-hop forwarding, less caution is needed, though the 658 operator should still carefully consider the consequences of enabling 659 LLGR. 661 6. Security Considerations 663 The security implications of the LLGR mechanism defined within in 664 this document are akin to those incurred by the maintenance of stale 665 routing information within a network. This is particularly relevant 666 when considering the maintenance of routing information that is 667 utilised for service segregation - such as MPLS label entries. 669 For MPLS VPN services, the effectiveness of the traffic isolation 670 between VPNs relies on the correctness of the MPLS labels between 671 ingress and egress PEs. In particular, when an egress PE withdraws a 672 label L1 allocated to a VPN1 route, this label MUST not be assigned 673 to a VPN route of a different VPN until all ingress PEs stop using 674 the old VPN1 route using L1. 676 Such a corner case may happen today, if the propagation of VPN routes 677 by BGP messages between PEs takes more time than the label re- 678 allocation delay on a PE. Given that we can generally bound worst 679 case BGP propagation time to a few minutes (for example 2-5), the 680 security breach will not occur if PEs are designed to not reallocate 681 a previous used and withdrawn label before a few minutes. 683 The problem is made worse with BGP GR between PEs as VPN routes can 684 be stalled for a longer period of time (for example 20 minutes). 686 This is further aggravated by the BGP LLGR extension proposed in this 687 document as VPN routes can be stalled for a much longer period of 688 time (for example 2 hours, 1 day). 690 Therefore, to avoid VPN breach, before enabling BGP LLGR, SPs needs 691 to check how fast a given label can be reused by a PE, taking into 692 account: 694 o The load of the BGP route churn on a PE (in term of number of VPN 695 label advertised and churn rate). 697 o The label allocation policy on the PE (possibly depending upon the 698 size of pool of the VPN labels (which can be restricted by 699 hardware consideration or others MPLS usages), the label 700 allocation scheme (for example per route or per VRF/CE), the re- 701 allocation policy (for example least recently used label...) 703 Note that [RFC4781] which defines Graceful Restart Mechanism for BGP 704 with MPLS is also applicable to BGP LLGR. 706 In addition to these considerations, the LLGR mechanism described 707 within this document is considered to be complex to exploit 708 maliciously - in order to inject packets into a topology, there is a 709 requirement to engineer a specific LLGR state between two PE devices, 710 whilst engineering label reallocation to occur in a manner that 711 results in the two topologies overlapping. Such allocation is 712 particularly difficult to engineer (since it is typically an internal 713 mechanism of an LSR). 715 7. Examples of Operation 717 For illustrative purposes, we present a few examples of how this 718 specification might be used in practice. These examples are neither 719 exhaustive nor normative. 721 Consider the following scenario: A border router, ASBR1, has an IBGP 722 peering with a route reflector, RR1, from which it learns routes. It 723 has an EBGP peering with an external peer, EXT, to which it 724 advertises those routes. The external peer has advertised the GR and 725 LLGR Capabilities to ASBR1. ASBR1 is configured to support GR and 726 LLGR on its session with RR1 and EXT. RR1 advertises a GR Restart 727 Time of 1 (second) and a LLST of 3600 (seconds): 729 +----------+--------------------------------------------------------+ 730 | Time | Event | 731 +----------+--------------------------------------------------------+ 732 | t | ASBR1's IBGP session with RR fails. ASBR1 retains | 733 | | RR's routes according to the rules of GR [RFC4724] | 734 | | | 735 | t+1 | GR Restart Time expires. ASBR1 transitions RR's | 736 | | routes to long-lived stale by attaching the LLGR_STALE | 737 | | community and depreferencing them. However, since it | 738 | | has no backup routes, it continues to make use of | 739 | | them. It re-announces them to EXT with the LLGR_STALE | 740 | | community attached. | 741 | | | 742 | t+1+3600 | LLST expires. ASBR1 removes RR's stale routes from | 743 | | its own RIB and sends BGP updates to withdraw them | 744 | | from EXT. | 745 +----------+--------------------------------------------------------+ 747 Next, imagine the same scenario but suppose RR1 advertised a GR 748 Restart Time of zero, effectively disabling GR. Equally, ASBR1 could 749 have used local configuration to override RR1's offered Restart Time, 750 setting it to a locally-configured value of zero: 752 +----------+--------------------------------------------------------+ 753 | Time | Event | 754 +----------+--------------------------------------------------------+ 755 | t | ASBR1's IBGP session with RR fails. ASBR1 transitions | 756 | | RR's routes to long-lived stale by attaching the | 757 | | LLGR_STALE community and depreferencing them. | 758 | | However, since it has no backup routes, it continues | 759 | | to make use of them. It re-announces them to EXT with | 760 | | the LLGR_STALE community attached. | 761 | | | 762 | t+0+3600 | LLST expires. ASBR1 removes RR's stale routes from | 763 | | its own RIB and sends BGP updates to withdraw them | 764 | | from EXT. | 765 +----------+--------------------------------------------------------+ 767 Next, imagine the original scenario, but consider that the ASBR1-RR1 768 session comes back up and becomes synchronized 180 seconds after the 769 failure was detected: 771 +---------+---------------------------------------------------------+ 772 | Time | Event | 773 +---------+---------------------------------------------------------+ 774 | t | ASBR1's IBGP session with RR fails. ASBR1 retains RR's | 775 | | routes according to the rules of GR [RFC4724] | 776 | | | 777 | t+1 | GR Restart Time expires. ASBR1 transitions RR's routes | 778 | | to long-lived stale by attaching the LLGR_STALE | 779 | | community and depreferencing them. However, since it | 780 | | has no backup routes, it continues to make use of them. | 781 | | It re-announces them to EXT with the LLGR_STALE | 782 | | community attached. | 783 | | | 784 | t+1+179 | Session is reestablished and resynchronized. ASBR1 | 785 | | removes the LLGR_STALE community from RR1's routes and | 786 | | re-announces them to EXT with the LLGR_STALE community | 787 | | removed. | 788 +---------+---------------------------------------------------------+ 790 Finally, imagine the original scenario, but consider that EXT has not 791 advertised the LLGR Capability to ASBR1: 793 +----------+--------------------------------------------------------+ 794 | Time | Event | 795 +----------+--------------------------------------------------------+ 796 | t | ASBR1's IBGP session with RR fails. ASBR1 retains | 797 | | RR's routes according to the rules of GR [RFC4724] | 798 | | | 799 | t+1 | GR Restart Time expires. ASBR1 transitions RR's | 800 | | routes to long-lived stale by attaching the LLGR_STALE | 801 | | community and depreferencing them. However, since it | 802 | | has no backup routes, it continues to make use of | 803 | | them. It withdraws them from EXT. | 804 | | | 805 | t+1+3600 | LLST expires. ASBR1 removes RR's stale routes from | 806 | | its own RIB. | 807 +----------+--------------------------------------------------------+ 809 8. Acknowledgements 811 We would like to thank Roberto Fragassi, John Medamana, Han Nguyen, 812 Jeffrey Haas, Nabil Bitar, Nicolai Leymann, Pranav Mehta, Saikat Ray, 813 Martin Djernaes and Eric Rosen for their valuable inputs and 814 contributions to the discussions and solutions. 816 9. Contributors 818 Clarence Filsfils 819 Cisco Systems 820 Brussels 1000 821 Belgium 823 Email: cf@cisco.com 825 Pradosh Mohapatra 826 Cumulus Networks 828 Email: pmohapat@cumulusnetworks.com 830 Yakov Rekhter 831 Juniper Networks 833 Email: yakov@juniper.net 834 Rob Shakir 835 BT 837 Email: rob.shakir@bt.com 839 Adam Simpson 840 Alcatel-Lucent 841 600 March Road 842 Ottawa, Ontario K2K 2E6 843 Canada 845 Email: adam.simpson@alcatel-lucent.com 847 10. IANA Considerations 849 This document defines a new BGP capability - Long-lived Graceful 850 Restart Capability. The Capability Code needs to be assigned by 851 IANA. 853 This document introduces a new BGP community "LLGR_STALE" for marking 854 the long-lived stale routes, and another community "NO_LLGR" to 855 indicate that stale routes should not be retained. These community 856 values need to be assigned by IANA. 858 11. References 860 11.1. Normative References 862 [I-D.ietf-idr-bgp-gr-notification] 863 Patel, K., Fernando, R., Scudder, J., and J. Haas, 864 "Notification Message support for BGP Graceful Restart", 865 draft-ietf-idr-bgp-gr-notification-01 (work in progress), 866 April 2013. 868 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 869 Communities Attribute", RFC 1997, August 1996. 871 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 872 Requirement Levels", BCP 14, RFC 2119, March 1997. 874 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 875 Protocol 4 (BGP-4)", RFC 4271, January 2006. 877 [RFC4724] Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y. 878 Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724, 879 January 2007. 881 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 882 "Multiprotocol Extensions for BGP-4", RFC 4760, 883 January 2007. 885 [RFC5492] Scudder, J. and R. Chandra, "Capabilities Advertisement 886 with BGP-4", RFC 5492, February 2009. 888 [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP 889 Encodings and Procedures for Multicast in MPLS/BGP IP 890 VPNs", RFC 6514, February 2012. 892 11.2. Informative References 894 [I-D.ietf-idr-bgp-bestpath-selection-criteria] 895 Asati, R., "BGP Bestpath Selection Criteria Enhancement", 896 draft-ietf-idr-bgp-bestpath-selection-criteria-06 (work in 897 progress), February 2013. 899 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 900 Networks (VPNs)", RFC 4364, February 2006. 902 [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service 903 (VPLS) Using BGP for Auto-Discovery and Signaling", 904 RFC 4761, January 2007. 906 [RFC4781] Rekhter, Y. and R. Aggarwal, "Graceful Restart Mechanism 907 for BGP with MPLS", RFC 4781, January 2007. 909 [RFC5575] Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J., 910 and D. McPherson, "Dissemination of Flow Specification 911 Rules", RFC 5575, August 2009. 913 Authors' Addresses 915 James Uttaro 916 AT&T 917 200 S. Laurel Avenue 918 Middletown, NJ 07748 919 USA 921 Email: ju1738@att.com 922 Enke Chen 923 Cisco Systems 924 170 W. Tasman Drive 925 San Jose, CA 95134 926 USA 928 Email: enkechen@cisco.com 930 Bruno Decraene 931 Orange 932 38-40 Rue de General Leclerc 933 92794 Issy Moulineaux cedex 9 934 France 936 Email: bruno.decraene@orange.com 938 John G. Scudder 939 Juniper Networks 940 1194 N. Mathilda Ave 941 Sunnyvale, CA 94089 942 USA 944 Email: jgs@juniper.net