idnits 2.17.1 draft-uttaro-idr-bgp-persistence-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1044 has weird spacing: '...lineaux cedex...' == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: For MPLS VPN services, the effectiveness of the traffic isolation between VPNs relies on the correctness of the MPLS labels between ingress and egress PEs. In particular, when an egress PE withdraws a label L1 allocated to a VPN1 route, this label MUST not be assigned to a VPN route of a different VPN until all ingress PEs stop using the old VPN1 route using L1. -- The document date (November 13, 2013) is 3810 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC6625' is mentioned on line 523, but not defined == Outdated reference: A later version (-16) exists of draft-ietf-idr-bgp-gr-notification-01 == Outdated reference: A later version (-12) exists of draft-ietf-idr-bgp-bestpath-selection-criteria-06 -- Obsolete informational reference (is this intentional?): RFC 5575 (Obsoleted by RFC 8955) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force J. Uttaro 3 Internet-Draft AT&T 4 Intended status: Standards Track E. Chen 5 Expires: May 17, 2014 Cisco Systems 6 B. Decraene 7 Orange 8 J. Scudder 9 Juniper Networks 10 November 13, 2013 12 Support for Long-lived BGP Graceful Restart 13 draft-uttaro-idr-bgp-persistence-03 15 Abstract 17 In this document we introduce a new BGP capability termed "Long-lived 18 Graceful Restart Capability" so that stale routes can be retained for 19 a longer time upon session failure. In addition a new BGP community 20 "LLGR_STALE" is introduced for marking stale routes retained for a 21 longer time. We also specify that such long-lived stale routes be 22 treated as the least-preferred, and their advertisements be limited 23 to BGP speakers that have advertised the new capability. Use of this 24 extension is not advisable in all cases, and we provide guidelines to 25 help determine if it is. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 17, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 63 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 3. Protocol Extensions . . . . . . . . . . . . . . . . . . . . . 4 65 3.1. Long-lived Graceful Restart Capability . . . . . . . . . 5 66 3.2. LLGR_STALE Community . . . . . . . . . . . . . . . . . . 6 67 3.3. NO_LLGR Community . . . . . . . . . . . . . . . . . . . . 7 68 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4.1. Use of Graceful Restart Capability . . . . . . . . . . . 7 70 4.2. Session Resets . . . . . . . . . . . . . . . . . . . . . 7 71 4.3. Processing LLGR_STALE Routes . . . . . . . . . . . . . . 10 72 4.4. Route Selection . . . . . . . . . . . . . . . . . . . . . 10 73 4.5. Multicast VPN . . . . . . . . . . . . . . . . . . . . . . 10 74 4.6. Errors . . . . . . . . . . . . . . . . . . . . . . . . . 13 75 4.7. Optional Partial Deployment Procedure . . . . . . . . . . 13 76 4.8. Procedures When BGP is the PE-CE Protocol in a VPN . . . 13 77 5. Deployment Considerations . . . . . . . . . . . . . . . . . . 14 78 5.1. When BGP is the PE-CE Protocol in a VPN . . . . . . . . . 15 79 5.2. Risks of Depreferencing Routes . . . . . . . . . . . . . 16 80 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17 81 7. Examples of Operation . . . . . . . . . . . . . . . . . . . . 18 82 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20 83 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 84 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 85 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 86 11.1. Normative References . . . . . . . . . . . . . . . . . . 22 87 11.2. Informative References . . . . . . . . . . . . . . . . . 22 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 90 1. Introduction 91 Historically, routing protocols in general and BGP in particular have 92 been designed with a focus on correctness, where a key part of 93 "correctness" is for each network element's forwarding state to 94 converge toward the current state of the network as quickly as 95 possible. For this reason, the protocol was designed to remove state 96 advertised by routers which went down (from a BGP perspective) as 97 quickly as possible. Over time, this has been relaxed somewhat, 98 notably by BGP Graceful Restart [RFC4724]; however, the paradigm has 99 remained one of attempting to rapidly remove "stale" state from the 100 network. 102 Over time, two phenomena have arisen that call into question the 103 underlying assumptions of this paradigm. The first is the widespread 104 adoption of tunneled forwarding infrastructures, for example MPLS. 105 Such infrastructures eliminate the risk of some types of forwarding 106 loops that can arise in hop-by-hop forwarding, and thus reduce one of 107 the motivations for strong consistency between forwarding elements. 108 The second is the increasing use of BGP as a transport for data less 109 closely associated with packet forwarding than was originally the 110 case. Examples include the use of BGP for autodiscovery (VPLS 111 [RFC4761]) and filter programming (FLOWSPEC [RFC5575]). In these 112 cases, BGP data takes on a character more akin to configuration than 113 to traditional routing. 115 The observations above motivate a desire to offer network operators 116 the ability to choose to retain BGP data for a longer period than has 117 hitherto been possible when the BGP control plane fails for some 118 reason. Although the semantics of BGP Graceful Restart [RFC4724] are 119 close to those desired, several gaps exist, most notably in maximum 120 time for which "stale" information can be retained -- Graceful 121 Restart imposes a 4095 second upper bound. 123 In this document we introduce a new BGP capability termed "Long-lived 124 Graceful Restart Capability" so that stale information can be 125 retained for a longer time across a session reset. We also introduce 126 a new BGP community, "LLGR_STALE", to mark such information. Such 127 stale information is to be treated as least-preferred, and its 128 advertisement limited to BGP speakers that support the new 129 capability. Where possible, we reference the semantics of BGP 130 Graceful Restart [RFC4724] rather than specifying similar semantics 131 in this document. 133 The expected deployment model for this extension is that it will only 134 be invoked for certain address families. This is discussed in more 135 detail in the Deployment Considerations section (Section 5). When 136 used, its use may be combined with that of traditional Graceful 137 Restart, in which case it is invoked only after the traditional 138 Graceful Restart interval has elapsed, or it may be invoked 139 immediately. Apart from the potential to greatly extend the timer, 140 the most obvious difference between Long-Lived and traditional 141 Graceful Restart is that in the Long-Lived version, routes are 142 "depreferenced", that is, treated as least-preferred, whereas in the 143 traditional version, route preference is not affected. The design 144 choice to treat Long-Lived Stale routes as least-preferred was 145 informed by the expectation that they might be retained for a 146 (potentially) almost unbounded period of time, whereas in the 147 traditional Graceful Restart case, stale routes are retained for only 148 a brief interval. In the GR case, the tradeoff between advertising 149 new route status (at the cost of routing churn) and not advertising 150 it (at the cost of suboptimal or incorrect route selection) is 151 resolved in favor of not advertising, and in the LLGR case, it is 152 resolved in favor of advertising new state. 154 1.1. Requirements Language 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 157 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 158 document are to be interpreted as described in RFC 2119 [RFC2119]. 160 2. Definitions 162 Depreference, Depreferenced: A route is said to be depreferenced if 163 it has its route selection preference reduced in reaction to some 164 event. 166 GR: Abbreviation for "Graceful Restart" [RFC4724], also sometimes 167 referred to herein as "conventional Graceful Restart" or 168 "conventional GR" to distinguish it from the "Long-lived Graceful 169 Restart" defined by this document. 171 Helper: Or "helper router". During Graceful Restart or Long-lived 172 Graceful Restart, the router that detects a session failure and 173 applies the listed procedures. [RFC4724] refers to this as the 174 "receiving speaker". 176 LLGR: Abbreviation for "Long-lived Graceful Restart". 178 LLST: Abbreviation for "Long-lived Stale Time". 180 Route: We use "route" to mean any information encoded as a BGP NLRI 181 and set of path attributes. As discussed above, the connection 182 between such routes and installation of forwarding state may be 183 quite remote. 185 3. Protocol Extensions 186 A new BGP capability and two new BGP communities are introduced. 188 3.1. Long-lived Graceful Restart Capability 190 The "Long-lived Graceful Restart Capability" is a new BGP capability 191 [RFC5492] that can be used by a BGP speaker to indicate its ability 192 to preserve its state according to the procedures of this document. 193 This capability MUST be advertised in conjunction with the Graceful 194 Restart capability [RFC4724], see the "Use of Graceful Restart 195 Capability" section (Section 4.1). 197 The capability value consists of one or more tuples as follows: 200 +--------------------------------------------------+ 201 | Address Family Identifier (16 bits) | 202 +--------------------------------------------------+ 203 | Subsequent Address Family Identifier (8 bits) | 204 +--------------------------------------------------+ 205 | Flags for Address Family (8 bits) | 206 +--------------------------------------------------+ 207 | Long-lived Stale Time (24 bits) | 208 +--------------------------------------------------+ 209 | ... | 210 +--------------------------------------------------+ 211 | Address Family Identifier (16 bits) | 212 +--------------------------------------------------+ 213 | Subsequent Address Family Identifier (8 bits) | 214 +--------------------------------------------------+ 215 | Flags for Address Family (8 bits) | 216 +--------------------------------------------------+ 217 | Long-lived Stale Time (24 bits) | 218 +--------------------------------------------------+ 220 The meaning of the fields are as follows: 222 Address Family Identifier (AFI), Subsequent Address Family 223 Identifier (SAFI): 225 The AFI and SAFI, taken in combination, indicate that the BGP 226 speaker has the ability to preserve its forwarding state for 227 the address family during a subsequent BGP restart. Routes may 228 be explicitly associated with a particular AFI and SAFI using 229 the encoding of [RFC4760] or implicitly associated with 230 if using the encoding of [RFC4271]. 232 Flags for Address Family: 234 This field contains bit flags relating to routes that were 235 advertised with the given AFI and SAFI. 237 0 1 2 3 4 5 6 7 238 +-+-+-+-+-+-+-+-+ 239 |F| Reserved | 240 +-+-+-+-+-+-+-+-+ 242 The most significant bit is used to indicate whether the state 243 for routes that were advertised with the given AFI and SAFI has 244 indeed been preserved during the previous BGP restart. When 245 set (value 1), the bit indicates that the state has been 246 preserved. This bit is called the "F bit" since it was 247 historically used to indicate preservation of Forwarding State. 248 Use of the F bit is detailed in the Session Resets section 249 (Section 4.2). 251 The remaining bits are reserved and MUST be set to zero by the 252 sender and ignored by the receiver. 254 Long-lived Stale Time: 256 This time (in seconds) specifies how long stale information 257 (for the AFI/SAFI) may be retained (possibly in conjunction 258 with the period specified by the "Restart Time" in the Graceful 259 Restart Capability, if present). 261 3.2. LLGR_STALE Community 263 We introduce a new BGP community [RFC1997] "LLGR_STALE" (value: TBD). 264 It can be used to mark stale routes retained for a longer period of 265 time. Such long-lived stale routes are to be handled according to 266 the procedures specified in the Operation section (Section 4). 268 An implementation MAY allow users to configure policies that accept, 269 reject, or modify routes based on the presence or absence of this 270 community. 272 3.3. NO_LLGR Community 274 We introduce a new BGP community "NO_LLGR" (value: TBD). It can be 275 used to mark routes which a BGP speaker does not want treated 276 according to these procedures, as detailed in the Operation section 277 (Section 4). 279 An implementation MAY allow users to configure policies that accept, 280 reject, or modify routes based on the presence or absence of this 281 community. 283 4. Operation 285 A BGP speaker MAY use BGP Capabilities Advertisements [RFC5492] to 286 advertise the "Long-lived Graceful Restart Capability" to indicate 287 its ability to retain state and perform related procedures specified 288 in this document. The setting of the parameters for an AFI/SAFI 289 depends on the properties of the BGP speaker, network scale, and 290 local configuration. 292 In the presence of the "Long-lived Graceful Restart Capability", the 293 procedures specified in [RFC4724] and 294 [I-D.ietf-idr-bgp-gr-notification] continue to apply unless 295 explicitly revised by this document. 297 4.1. Use of Graceful Restart Capability 299 The Graceful Restart capability MUST be advertised in conjunction 300 with the LLGR capability. If it is not so advertised, the LLGR 301 capability MUST be disregarded. The purpose for mandating that both 302 be used in conjunction is to enable reuse of certain base mechanisms 303 that are common to both "flavors", notably origination, collection 304 and processing of EoR, as well as the finite state machine 305 modifications and connection reset logic introduced by GR. 307 We observe that if support for conventional Graceful Restart is not 308 desired for the session, the conventional GR phase can be skipped by 309 omitting all AFI/SAFI from the GR capability, advertising a Restart 310 Time of zero, or both. The Session Resets section (Section 4.2) 311 discusses the interaction of conventional and long-lived GR. 313 4.2. Session Resets 315 BGP Graceful Restart [RFC4724], updated by 316 [I-D.ietf-idr-bgp-gr-notification], defines conditions under which a 317 BGP session can reset and have its associated routes retained. If 318 such a reset occurs for a session for which the LLGR Capability has 319 also been exchanged, the following procedures apply. 321 If the Graceful Restart Capability that was received does not list 322 all AFI/SAFI supported by the session, then for those non-listed AFI/ 323 SAFI the GR "Restart Time" shall be deemed zero. Similarly, if the 324 received LLGR Capability does not list all AFI/SAFI supported by the 325 session, then for those non-listed AFI/SAFI the "Long-lived Stale 326 Time" shall be deemed zero. 328 The following text in Section 4.2 of the GR specification [RFC4724] 329 no longer applies: 331 If the session does not get re-established within the "Restart 332 Time" that the peer advertised previously, the Receiving Speaker 333 MUST delete all the stale routes from the peer that it is 334 retaining. 336 and the following procedures are specified instead: 338 After the session goes down and before the session is re-established, 339 the stale routes for an AFI/SAFI MUST be retained. The interval for 340 which they are retained is limited by the sum of the "Restart Time" 341 in the received Graceful Restart Capability and the "Long-lived Stale 342 Time" in the received Long-lived Graceful Restart Capability. These 343 timers MAY be modified by local configuration. 345 If the value of the "Restart Time" or the "Long-lived Stale Time" is 346 zero, the duration of the corresponding period would be zero seconds. 347 So, for example, if the "Restart Time" is zero and the "Long-lived 348 Stale Time" is nonzero, only the procedures particular to LLGR would 349 apply. Conversely, if the "Long-lived Stale Time" is zero and the 350 "Restart Time" is nonzero, only the procedures of GR would apply. If 351 both are zero, none of these procedures would apply, only those of 352 the base BGP specification (although EoR would still be used as 353 detailed in [RFC4724]). And finally, if both are nonzero, then the 354 procedures would be applied serially -- first those of GR, then those 355 of LLGR. We observe that during the first interval, while the 356 procedures of GR are in effect, route preference would not be 357 affected, while during the second interval, while LLGR procedures are 358 in effect, routes would be treated as least-preferred as specified 359 elsewhere in this document. 361 Once the "Restart Time" period ends (including the case that the 362 "Restart Time" is zero), the LLGR period is said to have begun and 363 the following procedures MUST be performed: 365 o The helper router MUST start a timer for the "Long-lived Stale 366 Time". If the timer for the "Long-lived Stale Time" expires 367 before the session is re-established, the helper MUST delete all 368 the stale routes from the neighbor that it is retaining. 370 o The helper router MUST attach the LLGR_STALE community for the 371 stale routes being retained. Note that this requirement implies 372 that the routes would need to be readvertised, to disseminate the 373 modified community. 375 o If any of the routes from the peer have been marked with the 376 NO_LLGR community, either as sent by the peer, or as the result of 377 a configured policy, they MUST NOT be retained, but MUST be 378 removed as per the normal operation of [RFC4271]. 380 o The helper router MUST perform the procedures listed under 381 Section 4.3. 383 Once the session is re-established, the procedures specified in 384 [RFC4724] apply for the stale routes irrespective of whether the 385 stale routes are retained during the "Restart Time" period or the 386 "Long-lived Stale Time" period. However, in the case of consecutive 387 restarts (i.e, the session goes down before the EoR is received) the 388 previously marked stale routes MUST NOT be deleted before the timer 389 for the "Long-lived Stale Time" expires. 391 Similarly to [RFC4724], once the session is re-established, if the F 392 bit for a specific address family is not set in the newly received 393 LLGR Capability, or if a specific address family is not included in 394 the newly received LLGR Capability, or if the LLGR and accompanying 395 GR Capability are not received in the re-established session at all, 396 then the Helper MUST immediately remove all the stale routes from the 397 peer that it is retaining for that address family. 399 If a "Long-lived Stale Time" timer is running for a peer, it MUST NOT 400 be updated (other than by manual operator intervention) until the 401 peer has established and synchronized a new session. The session is 402 termed "synchronized" once the EoR has been received from the peer. 404 The value of the "Long-lived Stale Time" in the capability received 405 from a neighbor MAY be reduced by local configuration. 407 While the session is down, the expiration of the "Long-lived Stale 408 Time" timer is treated analogously to the expiration of the "Restart 409 Time" timer in Graceful Restart. However, the timer continues to run 410 once the session has re-established. The timer is not stopped, nor 411 updated, until EoR is received from the peer. If the timer expires 412 during synchronization with the peer, any stale routes that the peer 413 has not refreshed, are removed. If the session subsequently resets 414 prior to becoming synchronized, any remaining routes should be 415 removed immediately. 417 4.3. Processing LLGR_STALE Routes 419 A BGP speaker that has advertised the "Long-lived Graceful Restart 420 Capability" to a neighbor MUST perform the following upon receiving a 421 route from that neighbor with the "LLGR_STALE" community, or upon 422 attaching the "LLGR_STALE" community itself per Section 4.2: 424 o Treat the route as the least-preferred in route selection (see 425 below). See the Risks of Depreferencing Routes section 426 (Section 5.2) for a discussion of potential risks inherent in 427 doing this. 429 o The route SHOULD NOT be advertised to any neighbor from which the 430 Long-lived Graceful Restart Capability has not been received. The 431 exception is described in the Optional Partial Deployment 432 Procedure section (Section 4.7). Note that this requirement 433 implies that such routes should be withdrawn from any such 434 neighbor. 436 o The "LLGR_STALE" community MUST NOT be removed when the route is 437 further advertised. 439 4.4. Route Selection 441 In this document, when we refer to treating a route as least- 442 preferred, this means the route MUST be treated as less preferred 443 than any other route that is not so treated. When performing route 444 selection between two routes both of which are least-preferred, 445 normal tie-breaking applies. Note that this would only be expected 446 to happen if the only routes available for selection were least- 447 preferred -- in all other cases, such routes would have been 448 eliminated from consideration. 450 4.5. Multicast VPN 452 If LLGR is being used in a network that carries Multicast VPN (MVPN) 453 traffic ([RFC6513],[RFC6514]), special considerations apply. 455 [RFC6513] defines the notion of the "Upstream PE" and the "Upstream 456 Multicast Hop" (UMH) for a particular multicast flow. To determine 457 the Upstream PE and/or the UMH for a particular flow, a particular 458 set of comparable BGP routes (the "UMH-eligible" routes for that 459 flow, as defined in [RFC6513]) is considered, and the "best" one 460 (according to the BGP bestpath selection algorithm) is chosen. The 461 UMH-eligible routes are routes with AFI/SAFI 1/1, 1/2, 2/1, or 2/2. 462 When a router detects a change in the Upstream PE or UMH for a given 463 flow, the router may modify its data plane state for that flow. For 464 example, the router may begin to discard any packets of the flow that 465 it believes have arrived from the previously chosen Upstream PE or 466 UMH. The assumption is that the newly chosen Upstream PE and/or UMH 467 will make the corresponding changes, if necessary, to their own data 468 plane states. In addition, if a router detects a change in the 469 Uptream PE or UMH for a given flow, it may originate or readvertise 470 (with different attributes) certain of the BGP MCAST-VPN routes 471 (routes with SAFI 5) that are defined in [RFC6514]. The assumption 472 is that the MCAST-VPN routes will be properly distributed by BGP to 473 other routers that have data plane states for the given flow, i.e., 474 that BGP will converge so that all routers handle the flow in a 475 consistent manner. 477 However, if detection of a change to the Upstream PE or UMH is based 478 entirely on stale routes, one cannot assume that BGP will converge; 479 rather one must assume that the UMH-eligible routes and the MCAST-VPN 480 routes are not being properly distributed. Since the purpose of the 481 LLGR procedures is to try to keep the data flowing (by "freezing" the 482 data plane states) when the control plane updates are not being 483 properly distributed, it does not seem appropriate to react to 484 changes that are based entirely on stale routes. Therefore, the 485 following rules MUST be applied when a router is computing or 486 recomputing the Upstream PE and/or the UMH for a given multicast 487 flow: 489 o STALE routes (i.e., UMH-eligible routes with the LLGR_STALE 490 attribute) are less preferable than non-STALE routes. 492 o If all the UMH-eligible routes for a given flow are STALE, then 493 the Upstream PE and/or UMH for that flow is considered to be 494 "stale". 496 o If the Upstream PE or UMH for a given multicast flow has already 497 been determined, and the result of a new computation yields a new 498 Upstream PE or UMH, but the Upstream PE or UMH is "stale" (as 499 defined just above), then the Upstream PE and/or UMH for that flow 500 MUST be left unchanged. 502 o If the Upstream PE or UMH for a given multicast flow has not 503 already been determined, but is now determined to be STALE, the 504 multicast flow is considered to have no reachable Upstream PE and/ 505 or UMH. 507 [RFC6514] also defines a set of route types with SAFI 5 ("MCAST-VPN" 508 routes). LLGR can be applied to MCAST-VPN routes. However, the 509 following MCAST-VPN route types require special procedures, as 510 specified in this section: 512 o Leaf A-D routes 513 o C-multicast Shared Tree Join routes 514 o C-multicast Source Tree Join routes 516 Routes of these three types are always "targeted" to a particular 517 upstream router. Depending on the situation, the targeted router may 518 be the Upstream PE for a given flow or the UMH for a given flow. 519 Alternatively, the targeted router may be determined by choosing the 520 "best" route (according to the BGP bestpath algorithm) from among a 521 set of comparable Intra-AS I-PMSI A-D routes, or from among a set of 522 comparable Inter-AS I-PMSI A-D routes, or from among a set of 523 comparable S-PMSI A-D routes. (See [RFC6513], [RFC6514], [RFC6625], 524 and draft-ietf-mpls-seamless-mcast for details.) Once the target is 525 chosen, it is identified in an IPv4-address-specific Route Target 526 (RT) or an IPv6-address-specific RT that is attached to the route 527 before the route is advertised. If the target for one of these 528 routes changes, the value of the attached RT will also change. This 529 in turn may cause the route to be advertised, readvertised, or 530 withdrawn on specific BGP sessions. 532 For cases where the targeted router is the Upstream PE or the UMH for 533 a particular flow, the rules given previously in this section apply. 534 For example, if a Leaf A-D route is targeted to a flow's UMH, and all 535 the relevant UMH-eligible routes are stale, the UMH is left 536 unchanged. Thus the Leaf A-D route is not readvertised with a new 537 RT. 539 In those cases where the targeted router for a given Leaf A-D route 540 is selected by comparing a set of S-PMSI A-D routes, or where the 541 targeted router for a given C-multicast Shared or Source Tree Join 542 route is selected by comparing a set of Inter-AS I-PMSI A-D routes, 543 the following rules MUST be applied: 545 o STALE routes (i.e., "I/S-PMSI A-D routes" with the LLGR_STALE 546 attribute) are less preferable than non-STALE routes. 548 o If all the routes being considered are STALE, then the targeted 549 router of the Leaf A-D route or C-multicast Shared or Source Tree 550 Join route MUST NOT be changed. 552 This prevents a Leaf A-D route or C-multicast route from being 553 targeted to a particular router if the relevant I/S-PMSI A-D routes 554 from that router are stale. Since those routes are stale, it is 555 likely that the Leaf A-D route or C-multicast route would not make it 556 to the targeted router, in which case it is better to maintain the 557 existing data plane states than to make changes that presuppose that 558 the MCAST-VPN routes will be properly distributed. 560 4.6. Errors 562 If the LLGR capability is received without an accompanying GR 563 capability, the LLGR capability MUST be ignored, that is, the 564 implementation MUST behave as though no LLGR capability had been 565 received. 567 4.7. Optional Partial Deployment Procedure 569 Ideally, all routers in an Autonomous System would support this 570 specification before it was enabled. However, to facilitate 571 incremental deployment, stale routes MAY be advertised to neighbors 572 that have not advertised the Long-lived Graceful Restart Capability 573 under the following conditions: 575 o The neighbors MUST be internal (IBGP or Confederation) neighbors. 577 o The NO_EXPORT community [RFC1997] MUST be attached to the stale 578 routes. 580 o The stale routes MUST have their LOCAL_PREF set to zero. See the 581 Risks of Depreferencing Routes section (Section 5.2) for a 582 discussion of potential risks inherent in doing this. 584 If this strategy for partial deployment is used, the network operator 585 should set LOCAL_PREF to zero for all LLGR routes throughout the 586 Autonomous System. This trades off a small reduction in flexibility 587 (ordering may not be preserved between competing LLGR routes) for 588 consistency between routers which do, and do not, support this 589 specification. Since consistency of route selection can be important 590 for preventing forwarding loops, the latter consideration dominates. 592 4.8. Procedures When BGP is the PE-CE Protocol in a VPN 594 In VPN deployments, for example [RFC4364], BGP is often used as a PE- 595 CE protocol. It may be a practical necessity in such deployments to 596 accommodate interoperation with CEs that cannot easily be upgraded to 597 support specifications such as this one. This leads to a problem: in 598 this specification, we take pains to ensure that "stale" routing 599 information will not leak beyond the perimeter of routers that 600 support these procedures, so that it can be depreferenced as 601 expected, and we provide a workaround (Section 4.7) for the case 602 where one or more IBGP routers are not upgraded. However, in the VPN 603 PE-CE case, the protocol in use is EBGP, and our workaround does not 604 work since it relies on the use of LOCAL_PREF, an IBGP-only path 605 attribute. 607 We observe that the principal motivation for restricting the 608 propagation of "stale" routing information is the desire to prevent 609 it from spreading without limit once it exits the "safe" perimeter. 610 We further observe that VPN deployments are typically topologically 611 constrained, making this concern moot. For this reason, an 612 implementation MAY advertise stale routes over a PE-CE session, when 613 explicitly configured to do so. That is, the second rule listed in 614 Section 4.3 MAY be disregarded in such cases. All other rules 615 continue to apply. Finally, if this exception is used, the 616 implementation SHOULD by default attach the NO_EXPORT community to 617 the routes in question, as an additional protection against stale 618 routes spreading without limit. Attachment of the NO_EXPORT 619 community MAY be disabled by explicit configuration, to accommodate 620 exceptional cases. 622 See further discussion in Section 5.1. 624 5. Deployment Considerations 626 The deployment considerations discussed in [RFC4724] apply to this 627 document. In addition, network operators are cautioned to carefully 628 consider the potential disadvantages of deploying these procedures 629 for a given AFI/SAFI. Most notably, if used for an AFI/SAFI that 630 conveys traditional reachability information, use of a long-lived 631 stale route could result in a loss of connectivity for the covered 632 prefix. This specification takes pains to mitigate this risk where 633 possible, by making such routes least-preferred and by restricting 634 the scope of such routes to routers that support these procedures 635 (or, optionally, a single Autonomous System, see "Optional Partial 636 Deployment Procedure", above). However, according to the normal 637 rules of IP forwarding a stale more-specific route, that has no non- 638 stale alternate paths available, will still be used instead of a non- 639 stale less-specific route. Networks in which the deployment of these 640 procedures would be especially concerning include those which do not 641 use "tunneled" forwarding (in other words, those using traditional 642 hop-by-hop forwarding). 644 Implementations MUST NOT enable these procedures by default. They 645 MUST require affirmative configuration per AFI/SAFI in order to 646 enable them. 648 The procedures of this document do not alter the route resolvability 649 requirement of [RFC4271] Section 9.1.2.1.. Because of this, it will 650 commonly be the case that "stale" IBGP routes will only continue to 651 be used if the router depicted in the next hop remains resolvable, 652 even if its BGP component is down. Details of IGP fault-tolerance 653 strategies are beyond the scope of this document. In addition to the 654 foregoing, it may be advisable to check the viability of the next hop 655 through other means, see for example 656 [I-D.ietf-idr-bgp-bestpath-selection-criteria]. This may be 657 especially useful in cases where the next hop is known directly at 658 the network layer, notably EBGP. 660 As discussed in this document, after a BGP session goes down and 661 before the session is re-established, stale routes may be retained 662 for up to two consecutive periods, controlled by the "Restart Time" 663 and the "Long-lived Stale Time", respectively. During the first 664 period routing churn would be prevented but with potential 665 blackholing of traffic. During the second period potential 666 blackholing of traffic may be reduced but routing churn would be 667 visible throughout the network. The setting of the relevant 668 parameters for a particular application should take into account the 669 tradeoffs, the network dynamics and potential failure scenarios. If 670 needed, the first period can be bypassed either by local 671 configuration or by setting the "Restart Time" in the Graceful 672 Restart Capability to zero and/or not listing the AFI/SAFI in that 673 Capability. 675 The setting of the F bit (and the "Forwarding State" bit of the 676 accompanying GR capability) depends in part on deployment 677 considerations. The F bit can be understood as an indication that 678 the Helper should flush associated routes (if the bit is left clear). 679 As discussed in the Introduction, an important use case for LLGR is 680 for routes that are more akin to configuration than to traditional 681 routing. For such routes, it may make sense to always set the F bit, 682 regardless of other considerations. Likewise, for control-plane-only 683 entities such as dedicated route reflectors, that do not participate 684 in the forwarding plane, it makes sense to always set the F bit. 685 Overall, the rule of thumb is that if loss of state on the restarting 686 router can reasonably be expected to cause a forwarding loop or black 687 hole, the F bit should be set scrupulously according to whether state 688 has been retained. Specifics of when the F bit is, and is not, set 689 is implementation-dependent and may also be controlled by 690 configuration. 692 5.1. When BGP is the PE-CE Protocol in a VPN 694 As discussed in Section 4.8, it may be necessary to advertise stale 695 routes to a CE in some VPN deployments, even if the CE does not 696 support this specification. In that case, the network operator 697 configuring their PE to advertise such routes should notify the 698 operator of the CE receiving the routes, and the CE should be 699 configured to depreference the routes. Typical BGP implementations 700 will be able to do this by matching on the LLGR_STALE community, and 701 setting the LOCAL_PREF for matching routes to zero, similar to the 702 procedure described in Section 4.7. 704 5.2. Risks of Depreferencing Routes 706 Depreferencing EBGP routes is considered safe, no different from the 707 common practice of applying a routing policy to an EBGP session. 708 However, the same is not always true of IBGP. 710 Consistent route selection is a fundamental tenet of IBGP correctness 711 and safe operation in hop-by-hop routed networks. When routers 712 within an AS apply different criteria in selecting routes, they can 713 arrive at inconsistent route selections, potentially with the 714 consequence of forming forwarding loops unless some form of tunneled 715 forwarding is used to prevent "core" routers from making a 716 (potentially inconsistent) forwarding decision based on the IP 717 header. 719 This specification uses the state of a peering session as an input to 720 the selection criteria, depreferencing routes that are associated 721 with a session that has gone down but have not yet aged out. Since 722 different routers within an AS might have different notions as to 723 whether their respective sessions with a given peer are up or down, 724 they might apply different selection criteria to routes from that 725 peer. This could result in a forwarding loop forming between such 726 routers. 728 For an example of such a forwarding loop, consider the following 729 simple topology: 731 A ---- B ---- C ------------------------- D 732 ^ ^ 733 | | 734 R1 R2 736 In this example, A - D are routers with a full mesh of IBGP sessions 737 between them. The short links have unit cost, the long link has cost 738 5. Routers A and D are AS border routers, each advertising some 739 route, R, into the AS -- these are denoted R1 and R2 in the diagram. 740 In ordinary operation, it can be seen that routers B and C will 741 select R1 for forwarding, and will forward toward A. 743 Suppose that the session between A and B goes down for some reason, 744 and stays down long enough for LLGR processing to be invoked on B. 745 Then on B, route R1 will be depreferenced, leading to the selection 746 of R2 by B. However, C will continue to prefer R1. It can be seen 747 that in this case, a forwarding loop for packets destined to R would 748 form between B and C. (We note that other forwarding loop scenarios 749 can be constructed for traditional GR, but are generally considered 750 less severe since GR can remain in effect for a much more limited 751 interval.) 753 The potential benefits of this specification can outweigh the risks 754 discussed above, as long as care is exercised in deployment. The 755 cardinal rule to be followed is, if a given set of routes are being 756 used within an AS for hop-by-hop forwarding, it is NOT RECOMMENDED to 757 enable LLGR procedures. If tunneled forwarding (such as MPLS) is 758 used within the AS, or if routes are being used for purposes other 759 than hop-by-hop forwarding, less caution is needed, though the 760 operator should still carefully consider the consequences of enabling 761 LLGR. 763 6. Security Considerations 765 The security implications of the LLGR mechanism defined within in 766 this document are akin to those incurred by the maintenance of stale 767 routing information within a network. This is particularly relevant 768 when considering the maintenance of routing information that is 769 utilised for service segregation - such as MPLS label entries. 771 For MPLS VPN services, the effectiveness of the traffic isolation 772 between VPNs relies on the correctness of the MPLS labels between 773 ingress and egress PEs. In particular, when an egress PE withdraws a 774 label L1 allocated to a VPN1 route, this label MUST not be assigned 775 to a VPN route of a different VPN until all ingress PEs stop using 776 the old VPN1 route using L1. 778 Such a corner case may happen today, if the propagation of VPN routes 779 by BGP messages between PEs takes more time than the label re- 780 allocation delay on a PE. Given that we can generally bound worst 781 case BGP propagation time to a few minutes (for example 2-5), the 782 security breach will not occur if PEs are designed to not reallocate 783 a previous used and withdrawn label before a few minutes. 785 The problem is made worse with BGP GR between PEs as VPN routes can 786 be stalled for a longer period of time (for example 20 minutes). 788 This is further aggravated by the BGP LLGR extension proposed in this 789 document as VPN routes can be stalled for a much longer period of 790 time (for example 2 hours, 1 day). 792 Therefore, to avoid VPN breach, before enabling BGP LLGR, SPs needs 793 to check how fast a given label can be reused by a PE, taking into 794 account: 796 o The load of the BGP route churn on a PE (in term of number of VPN 797 label advertised and churn rate). 799 o The label allocation policy on the PE (possibly depending upon the 800 size of pool of the VPN labels (which can be restricted by 801 hardware consideration or others MPLS usages), the label 802 allocation scheme (for example per route or per VRF/CE), the re- 803 allocation policy (for example least recently used label...) 805 Note that [RFC4781] which defines Graceful Restart Mechanism for BGP 806 with MPLS is also applicable to BGP LLGR. 808 In addition to these considerations, the LLGR mechanism described 809 within this document is considered to be complex to exploit 810 maliciously - in order to inject packets into a topology, there is a 811 requirement to engineer a specific LLGR state between two PE devices, 812 whilst engineering label reallocation to occur in a manner that 813 results in the two topologies overlapping. Such allocation is 814 particularly difficult to engineer (since it is typically an internal 815 mechanism of an LSR). 817 7. Examples of Operation 819 For illustrative purposes, we present a few examples of how this 820 specification might be used in practice. These examples are neither 821 exhaustive nor normative. 823 Consider the following scenario: A border router, ASBR1, has an IBGP 824 peering with a route reflector, RR1, from which it learns routes. It 825 has an EBGP peering with an external peer, EXT, to which it 826 advertises those routes. The external peer has advertised the GR and 827 LLGR Capabilities to ASBR1. ASBR1 is configured to support GR and 828 LLGR on its session with RR1 and EXT. RR1 advertises a GR Restart 829 Time of 1 (second) and a LLST of 3600 (seconds): 831 +------------+------------------------------------------------------+ 832 | Time | Event | 833 +------------+------------------------------------------------------+ 834 | t | ASBR1's IBGP session with RR fails. ASBR1 retains | 835 | | RR's routes according to the rules of GR [RFC4724] | 836 | | | 837 | t+1 | GR Restart Time expires. ASBR1 transitions RR's | 838 | | routes to long-lived stale by attaching the | 839 | | LLGR_STALE community and depreferencing them. | 840 | | However, since it has no backup routes, it continues | 841 | | to make use of them. It re-announces them to EXT | 842 | | with the LLGR_STALE community attached. | 843 | | | 844 | t+1+3600 | LLST expires. ASBR1 removes RR's stale routes from | 845 | | its own RIB and sends BGP updates to withdraw them | 846 | | from EXT. | 847 +------------+------------------------------------------------------+ 849 Next, imagine the same scenario but suppose RR1 advertised a GR 850 Restart Time of zero, effectively disabling GR. Equally, ASBR1 could 851 have used local configuration to override RR1's offered Restart Time, 852 setting it to a locally-configured value of zero: 854 +------------+------------------------------------------------------+ 855 | Time | Event | 856 +------------+------------------------------------------------------+ 857 | t | ASBR1's IBGP session with RR fails. ASBR1 | 858 | | transitions RR's routes to long-lived stale by | 859 | | attaching the LLGR_STALE community and | 860 | | depreferencing them. However, since it has no backup | 861 | | routes, it continues to make use of them. It re- | 862 | | announces them to EXT with the LLGR_STALE community | 863 | | attached. | 864 | | | 865 | t+0+3600 | LLST expires. ASBR1 removes RR's stale routes from | 866 | | its own RIB and sends BGP updates to withdraw them | 867 | | from EXT. | 868 +------------+------------------------------------------------------+ 870 Next, imagine the original scenario, but consider that the ASBR1-RR1 871 session comes back up and becomes synchronized 180 seconds after the 872 failure was detected: 874 +---------+---------------------------------------------------------+ 875 | Time | Event | 876 +---------+---------------------------------------------------------+ 877 | t | ASBR1's IBGP session with RR fails. ASBR1 retains RR's | 878 | | routes according to the rules of GR [RFC4724] | 879 | | | 880 | t+1 | GR Restart Time expires. ASBR1 transitions RR's routes | 881 | | to long-lived stale by attaching the LLGR_STALE | 882 | | community and depreferencing them. However, since it | 883 | | has no backup routes, it continues to make use of them. | 884 | | It re-announces them to EXT with the LLGR_STALE | 885 | | community attached. | 886 | | | 887 | t+1+179 | Session is reestablished and resynchronized. ASBR1 | 888 | | removes the LLGR_STALE community from RR1's routes and | 889 | | re-announces them to EXT with the LLGR_STALE community | 890 | | removed. | 891 +---------+---------------------------------------------------------+ 893 Finally, imagine the original scenario, but consider that EXT has not 894 advertised the LLGR Capability to ASBR1: 896 +------------+------------------------------------------------------+ 897 | Time | Event | 898 +------------+------------------------------------------------------+ 899 | t | ASBR1's IBGP session with RR fails. ASBR1 retains | 900 | | RR's routes according to the rules of GR [RFC4724] | 901 | | | 902 | t+1 | GR Restart Time expires. ASBR1 transitions RR's | 903 | | routes to long-lived stale by attaching the | 904 | | LLGR_STALE community and depreferencing them. | 905 | | However, since it has no backup routes, it continues | 906 | | to make use of them. It withdraws them from EXT. | 907 | | | 908 | t+1+3600 | LLST expires. ASBR1 removes RR's stale routes from | 909 | | its own RIB. | 910 +------------+------------------------------------------------------+ 912 8. Acknowledgements 914 We would like to thank Roberto Fragassi, John Medamana, Han Nguyen, 915 Jeffrey Haas, Nabil Bitar, Nicolai Leymann, Pranav Mehta, Saikat Ray, 916 Martin Djernaes and Eric Rosen for their valuable inputs and 917 contributions to the discussions and solutions. 919 9. Contributors 921 Clarence Filsfils 922 Cisco Systems 923 Brussels 1000 924 Belgium 926 Email: cf@cisco.com 928 Pradosh Mohapatra 929 Cumulus Networks 930 Email: pmohapat@cumulusnetworks.com 932 Yakov Rekhter 933 Juniper Networks 935 Email: yakov@juniper.net 937 Eric Rosen 938 Cisco Systems 940 Email: erosen@cisco.com 942 Rob Shakir 943 BT 945 Email: rob.shakir@bt.com 947 Adam Simpson 948 Alcatel-Lucent 949 600 March Road 950 Ottawa, Ontario K2K 2E6 951 Canada 953 Email: adam.simpson@alcatel-lucent.com 955 10. IANA Considerations 957 This document defines a new BGP capability - Long-lived Graceful 958 Restart Capability. The Capability Code needs to be assigned by 959 IANA. 961 This document introduces a new BGP community "LLGR_STALE" for marking 962 the long-lived stale routes, and another community "NO_LLGR" to 963 indicate that stale routes should not be retained. These community 964 values need to be assigned by IANA. 966 11. References 967 11.1. Normative References 969 [I-D.ietf-idr-bgp-gr-notification] 970 Patel, K., Fernando, R., Scudder, J., and J. Haas, 971 "Notification Message support for BGP Graceful Restart", 972 draft-ietf-idr-bgp-gr-notification-01 (work in progress), 973 April 2013. 975 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 976 Communities Attribute", RFC 1997, August 1996. 978 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 979 Requirement Levels", BCP 14, RFC 2119, March 1997. 981 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 982 Protocol 4 (BGP-4)", RFC 4271, January 2006. 984 [RFC4724] Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y. 985 Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724, 986 January 2007. 988 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 989 "Multiprotocol Extensions for BGP-4", RFC 4760, January 990 2007. 992 [RFC5492] Scudder, J. and R. Chandra, "Capabilities Advertisement 993 with BGP-4", RFC 5492, February 2009. 995 [RFC6513] Rosen, E. and R. Aggarwal, "Multicast in MPLS/BGP IP 996 VPNs", RFC 6513, February 2012. 998 [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP 999 Encodings and Procedures for Multicast in MPLS/BGP IP 1000 VPNs", RFC 6514, February 2012. 1002 11.2. Informative References 1004 [I-D.ietf-idr-bgp-bestpath-selection-criteria] 1005 Asati, R., "BGP Bestpath Selection Criteria Enhancement", 1006 draft-ietf-idr-bgp-bestpath-selection-criteria-06 (work in 1007 progress), February 2013. 1009 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1010 Networks (VPNs)", RFC 4364, February 2006. 1012 [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service 1013 (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 1014 4761, January 2007. 1016 [RFC4781] Rekhter, Y. and R. Aggarwal, "Graceful Restart Mechanism 1017 for BGP with MPLS", RFC 4781, January 2007. 1019 [RFC5575] Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J., 1020 and D. McPherson, "Dissemination of Flow Specification 1021 Rules", RFC 5575, August 2009. 1023 Authors' Addresses 1025 James Uttaro 1026 AT&T 1027 200 S. Laurel Avenue 1028 Middletown, NJ 07748 1029 USA 1031 Email: ju1738@att.com 1033 Enke Chen 1034 Cisco Systems 1035 170 W. Tasman Drive 1036 San Jose, CA 95134 1037 USA 1039 Email: enkechen@cisco.com 1041 Bruno Decraene 1042 Orange 1043 38-40 Rue de General Leclerc 1044 92794 Issy Moulineaux cedex 9 1045 France 1047 Email: bruno.decraene@orange.com 1049 John G. Scudder 1050 Juniper Networks 1051 1194 N. Mathilda Ave 1052 Sunnyvale, CA 94089 1053 USA 1055 Email: jgs@juniper.net