idnits 2.17.1 draft-uttaro-idr-bgp-persistence-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 536 has weird spacing: '...lineaux cedex...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: For MPLS VPN services, the effectiveness of the traffic isolation between VPNs relies on the correctness of the MPLS labels between ingress and egress PEs. In particular, when an egress PE withdraws a label L1 allocated to a VPN1 route, this label MUST not be assigned to a VPN route of a different VPN until all ingress PEs stop using the old VPN1 route using L1. -- The document date (October 20, 2011) is 4571 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '12' on line 381 == Unused Reference: 'RFC1997' is defined on line 481, but no explicit reference was found in the text == Unused Reference: 'RFC4271' is defined on line 487, but no explicit reference was found in the text == Unused Reference: 'RFC4364' is defined on line 490, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Uttaro 3 Internet-Draft AT&T 4 Intended status: Standards Track A. Simpson 5 Expires: April 22, 2012 Alcatel-Lucent 6 R. Shakir 7 C&W 8 C. Filsfils 9 P. Mohapatra 10 Cisco Systems 11 B. Decraene 12 France Telecom 13 J. Scudder 14 Y. Rekhter 15 Juniper Networks 16 October 20, 2011 18 BGP Persistence 19 draft-uttaro-idr-bgp-persistence-00 21 Abstract 23 For certain AFI/SAFI combinations it is desirable that a BGP speaker 24 be able to retain routing state learned over a session that has 25 terminated. By maintaining routing state forwarding may be 26 preserved. This technique works effectively as long as the AFI/SAFI 27 is primarily used to realize services that do not depend on 28 exchanging BGP routing state with peers or customers. There may be 29 exceptions based upon the amount and frequency of route exchange that 30 allow for this technique. Generally the BGP protocol tightly couples 31 the viability of a session and the routing state that is learned over 32 it. This is driven by the history of the protocol and it's 33 application in the internet space as a vehicle to exchange routing 34 state between administrative authorities. This document addresses 35 new services whose requirements for persistence diverge from the 36 Internet routing point of view. 38 Status of this Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at http://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on April 22, 2012. 55 Copyright Notice 57 Copyright (c) 2011 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (http://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 74 2. Communities . . . . . . . . . . . . . . . . . . . . . . . . . 5 75 2.1. PERSIST . . . . . . . . . . . . . . . . . . . . . . . . . 5 76 2.2. DO_NOT_PERSIST . . . . . . . . . . . . . . . . . . . . . . 5 77 2.3. STALE . . . . . . . . . . . . . . . . . . . . . . . . . . 5 78 3. Configuration (Persistence Timer, PERSIST and 79 DO_NOT_PERSIST Community) . . . . . . . . . . . . . . . . . . 6 80 3.1. Settings for Different Applications . . . . . . . . . . . 6 81 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 82 4.1. Attaching the STALE Community Value and Propagation of 83 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 7 84 4.2. Forwarding . . . . . . . . . . . . . . . . . . . . . . . . 7 85 4.3. Example Behaviour . . . . . . . . . . . . . . . . . . . . 8 86 5. Deployment Considerations . . . . . . . . . . . . . . . . . . 9 87 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 10 88 6.1. Persistence in L2VPN (VPLS/VPWS) . . . . . . . . . . . . . 10 89 6.2. Persistence in L3VPN . . . . . . . . . . . . . . . . . . . 11 90 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 91 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 92 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 93 10. Normative References . . . . . . . . . . . . . . . . . . . . . 18 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 96 1. Introduction 98 In certain scenarios, a BGP speaker may maintain forwarding in spite 99 of BGP session termination. Currently all routing state learned 100 between two speakers is flushed upon either normal or abnormal 101 session termination. There are techniques that are useful for 102 maintaining routing when a session abnormally terminates i.e BGR 103 Graceful RestartR ( RFC 4724 ) or normal termination such as 104 increasing timers but they do not change the fundamental problem. 105 The technique of BGP persistence works effectively as long as the 106 expectation is that there is a decoupling of session viability and 107 the correct service delivery, and the delivery uses the routing state 108 learned over that session. This document proposes a modification to 109 BGP's behavior by enabling persistence of BGP learned routing state 110 in spite of normal or abnormal session termination. 112 1.1. Requirements Language 114 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 115 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 116 document are to be interpreted as described in RFC 2119 [RFC2119]. 118 2. Communities 120 This memo defines three new communities that are used to identify the 121 capability of a path to persist and whether or not that path is live 122 or stale. 124 2.1. PERSIST 126 This memo defines a new transitive BGP community, PERSIST, with value 127 TBD (to be assigned by IANA). Attaching of the PERSIST community 128 SHOULD be controlled by configuration. Attaching the PERSIST 129 community indicates that the peer should maintain forwarding in the 130 case of a session failure. The functionality SHOULD default to being 131 disabled. 133 2.2. DO_NOT_PERSIST 135 This memo defines a new transitive BGP community, DO_NOT_PERSIST, 136 with value TBD (to be assigned by IANA). Attaching of the 137 DO_NOT_PERSIST community SHOULD be controlled by configuration. The 138 functionality SHOULD default to being disabled. 140 2.3. STALE 142 This memo defines a new transitive BGP community, STALE, with value 143 TBD (to be assigned by IANA). Attaching of the STALE community is 144 limited to a path that currently has the PERSIST community attached 146 3. Configuration (Persistence Timer, PERSIST and DO_NOT_PERSIST 147 Community) 149 Persistence must be configured on a per session basis. A speaker 150 configures the ability to persist independently of it's peer. There 151 is no negotiation between the peers. A timer must be configured 152 indicating the time to persist stale state from a peer where the 153 session is no longer viable. This timer is designated as the 154 persist-timer. A speaker must also attach persistence community 155 value indicating if a path to a route should persist. 157 3.1. Settings for Different Applications 159 The setting of the persist-timer should be based upon the field of 160 use. BGP is used in a many different applications that each bring a 161 unique requirement for retaining state. The following is not meant 162 as a comprehensive listing but to suggest timer settings for a subset 163 of AFI/SAFIs. 165 L2VPN This AFI/SAFI requires the exchange of routing state in order 166 to establish PWs to realize a VPLS VPN, or a VPWS PW. This AFI/ 167 SAFI does not require exchange of routing state with a customer 168 and there is no eBGP session established. The persist-timer 169 should be set to a large value on the order of days to infinity. 171 L3VPN This AFI/SAFI requires the exchange of routing state to create 172 a private VPN. This AFI/SAFI requires exchange of state with 173 customers via eBGP and is dynamic. The SP needs to consider the 174 possibility that stale state may not reflect the latest route 175 updates and therefore may be incorrect from the customer 176 perspective. The persist-timer should be set to a large value on 177 the order of hours to a few days. this is built upon the notion 178 some incorrectness is preferable to a large outage. 180 4. Operation 182 Assuming a session failure has occurred a BGP persistent router must 183 retain local forwarding state for those paths that are Persistent/ 184 Stale and propagate paths to downstream speakers that indicate that a 185 given path is now stale. 187 4.1. Attaching the STALE Community Value and Propagation of Paths 189 The following rules must be followed. 191 o Identify paths learned over a failed session that have the PERSIST 192 capable community value attached. 194 o For those paths attach the STALE community value and propagate to 195 all peers. 197 o For those paths learned over the failed session that do not have 198 PERSIST capable community value or are marked with the 199 DO_NOT_PERSIST community follow BGP rules and generate withdrawals 200 to all peers for those paths. 202 4.2. Forwarding 204 The following rules must be followed to ensure valid forwarding: 206 o All forwarding state must be retained i.e labels for BGP labeled 207 unicast. 209 o Forwarding must ensure that the Next Hop to a "stale" route is 210 viable. 212 o Forwarding to a "stale" route is only used if there are no other 213 paths available to that route. In other words an active path 214 always wins regardless of path selection. "Stale" state is always 215 considered to be less preferred when compared with an active path. 217 o Forwarding should be retained through an advertisement. When the 218 session is re-established forwarding should only change if the new 219 state is either different or better in terms of path selection. A 220 make before break strategy should be employed. 222 o Stale state may be retained indefinitely or may be programmed to 223 expire via configuration. 225 o The Receiving Speaker MUST replace the stale routes by the routing 226 updates received from the peer. Once the End-of-RIB marker for an 227 address family is received from the peer, it MUST immediately 228 remove any paths from the peer that are still marked as stale for 229 that address family. 231 o There is no restriction on whether the session is internal or 232 external. 234 4.3. Example Behaviour 236 Upon session establishment a speaker S2 may receive paths from S1 237 that are marked with PERSIST, DO_NOT_PERSIST or neither. Assume S2 238 is also peered with a downstream speaker S3.. Implementations MUST 239 follow the specifications outlined below for. 241 Upon recognition of the failure to S1, S2 will identify paths that 242 had been marked with PERSIST, DO_NOT_PERSIST or neither learned from 243 S1. S2 MUST implement the following behavior: 245 if ( P1 is tagged with PERSIST ) { 247 Retain Forwarding 248 Attach the STALE Community to all paths that were marked with PERSIST 249 Advertise STALE paths to all peers including S3 250 } 251 else ( P1 is marked with DO_NOT_PERSIST || not marked ) 253 Tear down the forwarding structure for P1 254 Follow normal BGP rules i.e Best path, withdrawal etc. 256 fi 258 5. Deployment Considerations 260 BGP Persistence as described in this document is useful within a 261 single autonomous system or across autonomous systems. 263 6. Applications 265 This technique may be useful in a wide array of applications where 266 routing state is either fairly static or, the state is localized 267 within a routing context. Some applications that come immediately to 268 mind are L2 and L3 VPN. 270 6.1. Persistence in L2VPN (VPLS/VPWS) 272 VPLS/VPWS VPNs use BGP to exchange routing state between two PEs. 273 This exchange allows for the creation of a PW within a VPN context 274 between those PEs. By definition, L2VPN does not exchange any 275 routing state with customers via BGP. BGP persistence is very useful 276 here as the state is quite constant. The only time state is 277 exchanged is when a PW endpoint is provisioned, deleted or when a 278 speaker reboots. 280 Referring to Figure 1, PE1 and PE2 have advertised BGP routing state 281 in order to create PWs between PE1 and PE2. The RRs are only 282 responsible to reflect this state between the PEs. The use of a 283 unique RD makes every path unique from the RRs perspective. 285 Assume that the both RR experience catastrophic failure. 287 Case 1 - All BGP speakers are persistent capable. 289 The PWs created between PE1 and PE2 persist. Forwarding 290 uninterrupted. 292 Case 2 - PE1 and the RRs are persistent capable, PE2 is not. 294 In this case the path advertised from PE2 via the RRs is persistent 295 at PE1, the PW from PE1 to PE2 is not torn down. PE2 will remove the 296 path from PE1 and tear down the PW from PE2 to PE1. THe effect is 297 that MAC state learned at PE2 is valid as the PW is still valid. MAC 298 state learned at PE1 is removed as the PW is no longer valid. 299 Eventually MAC destinations recursed to the PW at PE1 destined for 300 PE2 over the valid PW will time out. 302 Assume that the RRs are valid but the iBGP sessions are torn down.. 304 Case 3 - All BGP speakers are persistent capable. 306 The PWs created between PE1 and PE2 persist. Forwarding 307 uninterrupted. 309 VPNA VPNA 310 PW+++++++++++++++++++PW 312 CE1-------PE1--------RR1-------PE2------CE2 313 | | 314 | | 315 ----------RR2--------- 317 <--iBGP---><---iBGP--> 319 Figure 1 321 6.2. Persistence in L3VPN 323 --------RR1------- 324 / A C \ 325 CE1 ----- PE1 --Forwarding Path-- PE2 ---- CE2 326 \ B D / 327 ------- RR2 ------ 329 Figure 2 331 In the case of a Layer 3 VPN topology, during the failure of a route 332 reflector device at the current time, all routing information 333 propagated via BGP is purged from the routing database. In this 334 case, forwarding is interrupted within such a topology due to the 335 lack of signalling information, rather than an outage to the 336 forwarding path between the PE devices. With the addition of BGP 337 persistence, a complete service outage can be avoided. 339 The topology shown in Figure 2 is a simple L3VPN topology consisting 340 of two customer edge (CE) devices, along with two provider edge (PE), 341 and route reflector (RR) devices. In this case, where an RFC4364 VPN 342 topology is utilised a BGP session exists between PE1 to both RR1 and 343 RR2, and from PE2 to RR1 and RR2, in order to propagate the VPN 344 topology. 346 Case 1: No BGP speakers are persistence capable: 348 o In this scenario, during a simultaneous failure of RR1 and RR2 349 (which are extremely likely to share route reflector clients) both 350 PE1 and PE2 remove all routing information from the VPN from their 351 RIB, and hence a complete service outage is experienced. 353 o Where either sessions A and B, or C and D fail simultaneously, 354 routing information from either PE1 (in the case of A and B), or 355 PE2 (in the case of C and D) are withdrawn, and a partial service 356 topology exists. 358 o Both of the states described reflect a service outage where the 359 forwarding path between the PE devices is not interrupted. 361 Case 2: All BGP speakers are persistence capable: 363 o PE1 continues to forward utilising the label information received 364 from PE2 via the working forwarding path for the duration of the 365 persistence timer (and vice versa). 367 o This condition occurs regardless of the session(s) that fail. In 368 the worst case where sessions A, B, C and D fail simultaneously, 369 the network continues to operate in the state in which it was at 370 the time of the failure. 372 Case 3: PE1 and RR[12] are persistence capable - PE2 is not. 374 o During a failure of BGP session A or B, PE1 will continue to 375 forward utilising the routing information received from the RRs 376 for PE2 for the duration of the persistence timer. PE2 will 377 continue to forward utilising the routing information received 378 from the RRs, again for the duration of the persistence timer. 380 o In the case that either BGP session C or D fails, all routes will 381 be withdrawn by RR[12] towards PE1 since these routes are not 382 valid to be persisted by the RRs. The end result of this will be 383 that the routes advertised by CE2 into the VPN will be withdrawn. 385 o Where the worst case failure occurs (i.e. sessions A, B, C and D 386 fail) the routes advertised by CE1 into the VPN will be 387 persistently advertised by the RR devices, whereas those 388 advertised by CE2 will be withdrawn. Clearly in the example shown 389 in the figure this results in a service outage, but where multiple 390 PE devices exist within a topology, service is maintained for the 391 subset of CEs attached to PE devices supporting the persistence 392 capability. 394 Within the Layer 3 VPN deployment it should be noted that routing 395 information is less static than that of the many Layer 2 VPNs since 396 typically multiple routes exist within the topology rather than an 397 individual MAC address or egress interface per CE device on the PE 398 device. As such, the L3VPN operates with the routing databases in 399 the 'core' of the network reflecting those at the time of failure. 400 Should there be re-convergence for any path between the PE and CE 401 devices, this will result in invalid routing information, should the 402 egress PE device not hold alternate routing information for the 403 prefixes undergoing such re-convergence. It is expected that where 404 each PE maintains multiple paths to each egress prefix (where an 405 alternate path is available), it is expected that the egress PE will 406 forward packets towards an alternative egress PE for the prefix in 407 question where the topology is no longer valid. 409 The lack of convergence within a Layer 3 topology during the 410 persistent state SHOULD be considered since it may adversely affect 411 services, however, an assumption is made that a degraded service is 412 preferable to a complete service outage during a large-scale BGP 413 control plane failure. 415 7. Security Considerations 417 The security implications of the persistence mechanism defined within 418 in this document are akin to those incurred by the maintenance of 419 stale routing information within a network. This is particularly 420 relevant when considering the maintenance of routing information that 421 is utilised for service segregation - such as MPLS label entries. 423 For MPLS VPN services, the effectiveness of the traffic isolation 424 between VPNs relies on the correctness of the MPLS labels between 425 ingress and egress PEs. In particular, when an egress PE withdraws a 426 label L1 allocated to a VPN1 route, this label MUST not be assigned 427 to a VPN route of a different VPN until all ingress PEs stop using 428 the old VPN1 route using L1. 430 Such a corner case may happen today, if the propagation of VPN routes 431 by BGP messages between PEs takes more time than the label re- 432 allocation delay on a PE. Given that we can generally bound worst 433 case BGP propagation time to a few minutes (e.g. 2-5), the security 434 breach will not occur if PEs are designed to not reallocate a 435 previous used and withdrawn label before a few minutes. 437 The problem is made worse with BGP GR between PEs as VPN routes can 438 be stalled for a longer period of time (e.g. 20 minutes). 440 This is further aggravated by the BGP persistent extension proposed 441 in this document as VPN routes can be stalled for a much longer 442 period of time (e.g. 2 hours, 1 day). 444 Therefore, to avoid VPN breach, before enabling BGP persistence, SPs 445 needs to check how fast a given label can be reused by a PE, taking 446 into account: 448 o The load of the BGP route churn on a PE (in term of number of VPN 449 label advertised and churn rate). 451 o The label allocation policy on the PE (possibly depending upon the 452 size of pool of the VPN labels (which can be restricted by 453 hardware consideration or others MPLS usages), the label 454 allocation scheme (e.g. per route or per VRF/CE), the re- 455 allocation policy (e.g. least recently used label...) 457 In addition to these considerations, the persistence mechanism 458 described within this document is considered to be complex to exploit 459 maliciously - in order to inject packets into a topology, there is a 460 requirement to engineer a specific persistence state between two PE 461 devices, whilst engineering label reallocation to occur in a manner 462 that results in the two topologies overlapping. Such allocation is 463 particularly difficult to engineer (since it is typically an internal 464 mechanism of an LSR). 466 8. IANA Considerations 468 IANA shall assigned community values from BGP well-known communities 469 registry for the PERSIST, DO-NOT-PERSIST and STALE communities. No 470 additional IANA action is required. 472 9. Acknowledgements 474 We would like to acknowledge Roberto Fragassi (Alcatel-Lucent), John 475 Medamana, (AT&T) Han Nguyen (AT&T), Jeffrey Haas (Juniper), Nabil 476 Bitar (Verizon), Nicolai Leymann (DT) for their contributions to this 477 document. 479 10. Normative References 481 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 482 Communities Attribute", RFC 1997, August 1996. 484 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 485 Requirement Levels", BCP 14, RFC 2119, March 1997. 487 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 488 Protocol 4 (BGP-4)", RFC 4271, January 2006. 490 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 491 Networks (VPNs)", RFC 4364, February 2006. 493 Authors' Addresses 495 James Uttaro 496 AT&T 497 200 S. Laurel Avenue 498 Middletown, NJ 07748 499 USA 501 Email: ju1738@att.com 503 Adam Simpson 504 Alcatel-Lucent 505 600 March Road 506 Ottawa, Ontario K2K 2E6 507 Canada 509 Email: adam.simpson@alcatel-lucent.com 511 Rob Shakir 512 Cable&Wireless Worldwide 513 London 514 UK 516 Email: rjs@cw.net 517 URI: http://www.cw.com/ 519 Clarence Filsfils 520 Cisco Systems 521 Brussels 1000 522 BE 524 Email: cf@cisco.com 526 Pradosh Mohapatra 527 Cisco Systems 528 170 W. Tasman Drive 529 San Jose, CA 95134 530 USA 532 Email: pmohapat@cisco.com 533 Bruno Decraene 534 France Telecom 535 38-40 Rue de General Leclerc 536 92794 Issy Moulineaux cedex 9 537 France 539 Email: bruno.decraene@orange.com 541 John Scudder 542 Juniper Networks 543 1194 N. Mathilda Ave 544 Sunnyvale, CA 94089 545 USA 547 Email: jgs@juniper.net 549 Yakov Rekhter 550 Juniper Networks 552 Email: yakov@juniper.net