idnits 2.17.1 draft-ietf-idr-restart-06.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 9 longer pages, the longest (page 2) being 61 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1771 (ref. 'BGP-4') (Obsoleted by RFC 4271) ** Obsolete normative reference: RFC 2858 (ref. 'BGP-MP') (Obsoleted by RFC 4760) -- Unexpected draft version: The latest known version of draft-ietf-idr-rfc2842bis is -01, but you're referring to -02. ** Obsolete normative reference: RFC 2385 (ref. 'BGP-AUTH') (Obsoleted by RFC 5925) -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA-AFI' -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA-SAFI' Summary: 7 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Srihari R. Sangli (Procket Networks) 3 Internet Draft Yakov Rekhter (Juniper Networks) 4 Expiration Date: July 2003 Rex Fernando (Procket Networks) 5 John G. Scudder (Cisco Systems) 6 Enke Chen (Redback Networks) 8 Graceful Restart Mechanism for BGP 10 draft-ietf-idr-restart-06.txt 12 1. Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as ``work in progress.'' 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 2. Abstract 35 This document proposes a mechanism for BGP that would help minimize 36 the negative effects on routing caused by BGP restart. An End-of-RIB 37 marker is specified and can be used to convey routing convergence 38 information. A new BGP capability, termed "Graceful Restart 39 Capability", is defined which would allow a BGP speaker to express 40 its ability to preserve forwarding state during BGP restart. Finally, 41 procedures are outlined for temporarily retaining routing information 42 across a TCP transport reset. 44 The mechanisms described in this document are applicable to all 45 routers, both those with the ability to preserve forwarding state 46 during BGP restart and those without (although the latter need to 47 implement only a subset of the mechanisms described in this 48 document). 50 3. Introduction 52 Usually when BGP on a router restarts, all the BGP peers detect that 53 the session went down, and then came up. This "down/up" transition 54 results in a "routing flap" and causes BGP route re-computation, 55 generation of BGP routing updates and flap the forwarding tables. It 56 could spread across multiple routing domains. Such routing flaps may 57 create transient forwarding blackholes and/or transient forwarding 58 loops. They also consume resources on the control plane of the 59 routers affected by the flap. As such they are detrimental to the 60 overall network performance. 62 This document proposes a mechanism for BGP that would help minimize 63 the negative effects on routing caused by BGP restart. An End-of-RIB 64 marker is specified and can be used to convey routing convergence 65 information. A new BGP capability, termed "Graceful Restart 66 Capability", is defined which would allow a BGP speaker to express 67 its ability to preserve forwarding state during BGP restart. Finally, 68 procedures are outlined for temporarily retaining routing information 69 across a TCP transport reset. 71 4. Marker for End-of-RIB 73 An UPDATE message with no reachable NLRI and empty withdrawn NLRI is 74 specified as the End-Of-RIB Marker that can be used by a BGP speaker 75 to indicate to its peer the completion of the initial routing update 76 after the session is established. For IPv4 unicast address family, 77 the End-Of-RIB Marker is an UPDATE message with the minimum length 78 [BGP-4]. For any other address family, it is an UPDATE message that 79 contains only the MP_UNREACH_NLRI attribute [BGP-MP] with no 80 withdrawn routes for that . 82 Although the End-of-RIB Marker is specified for the purpose of BGP 83 graceful restart, it is noted that the generation of such a marker 84 upon completion of the initial update would be useful for routing 85 convergence in general, and thus the practice is recommended. 87 In addition, it would be beneficial for routing convergence if a BGP 88 speaker can indicate to its peer up-front that it will generate the 89 End-Of-RIB marker, regardless of its ability to preserve its 90 forwarding state during BGP restart. This can be accomplished using 91 the Graceful Restart Capability described in the next section. 93 5. Graceful Restart Capability 95 The Graceful Restart Capability is a new BGP capability [BGP-CAP] 96 that can be used by a BGP speaker to indicate its ability to preserve 97 its forwarding state during BGP restart. It can also be used to 98 convey to its peer its intention of generating the End-Of-RIB marker 99 upon the completion of its initial routing updates. 101 This capability is defined as follows: 103 Capability code: 64 105 Capability length: variable 107 Capability value: Consists of the "Restart Flags" field, "Restart 108 Time" field, and zero or more of the tuples as follows: 111 +--------------------------------------------------+ 112 | Restart Flags (4 bits) | 113 +--------------------------------------------------+ 114 | Restart Time in seconds (12 bits) | 115 +--------------------------------------------------+ 116 | Address Family Identifier (16 bits) | 117 +--------------------------------------------------+ 118 | Subsequent Address Family Identifier (8 bits) | 119 +--------------------------------------------------+ 120 | Flags for Address Family (8 bits) | 121 +--------------------------------------------------+ 122 | ... | 123 +--------------------------------------------------+ 124 | Address Family Identifier (16 bits) | 125 +--------------------------------------------------+ 126 | Subsequent Address Family Identifier (8 bits) | 127 +--------------------------------------------------+ 128 | Flags for Address Family (8 bits) | 129 +--------------------------------------------------+ 131 The use and meaning of the fields are as follows: 133 Restart Flags: 135 This field contains bit flags related to restart. 137 0 1 2 3 138 +-+-+-+-+ 139 |R|Resv.| 140 +-+-+-+-+ 141 The most significant bit is defined as the Restart State (R) 142 bit which can be used to avoid possible deadlock caused by 143 waiting for the End-of-RIB marker when multiple BGP speakers 144 peering with each other restart. When set (value 1), this bit 145 indicates that the BGP speaker has restarted, and its peer 146 should not wait for the End-of-RIB marker from the speaker 147 before advertising routing information to the speaker. 149 The remaining bits are reserved, and should be set to zero by 150 the sender and ignored by the receiver. 152 Restart Time: 154 This is the estimated time (in seconds) it will take for the 155 BGP session to be re-established after a restart. This can be 156 used to speed up routing convergence by its peer in case that 157 the BGP speaker does not come back after a restart. 159 Address Family Identifier (AFI): 161 This field carries the identity of the Network Layer protocol 162 for which the Graceful Restart support is advertised. Presently 163 defined values for this field are specified in [IANA-AFI]. 165 Subsequent Address Family Identifier (SAFI): 167 This field provides additional information about the type of 168 the Network Layer Reachability Information carried in the 169 attribute. Presently defined values for this field are 170 specified in [IANA-SAFI]. 172 Flags for Address Family: 174 This field contains bit flags for the . 176 0 1 2 3 4 5 6 7 177 +-+-+-+-+-+-+-+-+ 178 |F| Reserved | 179 +-+-+-+-+-+-+-+-+ 181 The most significant bit is defined as the Forwarding State (F) 182 bit which can be used to indicate if the forwarding state for 183 the has indeed been preserved during the previous 184 BGP restart. When set (value 1), the bit indicates that the 185 forwarding state has been preserved. 187 The remaining bits are reserved, and should be set to zero by 188 the sender and ignored by the receiver. 190 When a sender of this capability doesn't include any in 191 the capability, it means that the sender is not capable of preserving 192 its forwarding state during BGP restart, but supports procedures for 193 the Receiving Speaker (as defined in Section 6.2 of this document). 194 In that case the value of the "Restart Time" field advertised by the 195 sender is irrelevant. 197 A BGP speaker should not include more than one instance of the 198 Graceful Restart Capability in the capability advertisement [BGP- 199 CAP]. If more than one instance of the Graceful Restart Capability 200 is carried in the capability advertisement, the receiver of the 201 advertisement should ignore all but the last instance of the Graceful 202 Restart Capability. 204 Including into the Graceful Restart 205 Capability doesn't imply that the IPv4 unicast routing information 206 should be carried by using the BGP Multiprotocol extensions [BGP-MP] 207 - it could be carried in the NLRI field of the BGP UPDATE message. 209 6. Operation 211 A BGP speaker may advertise the Graceful Restart Capability for an 212 address family to its peer if it has the ability to preserve its 213 forwarding state for the address family when BGP restarts. In 214 addition, even if the speaker does not have the ability to preserve 215 its forwarding state for any address family during BGP restart, it is 216 still recommended that the speaker advertise the Graceful Restart 217 Capability to its peer (as mentioned before this is done by not 218 including any in the advertised capability). There are 219 two reasons for doing this. First, to indicate its intention of 220 generating the End-of-RIB marker upon the completion of its initial 221 routing updates, as doing this would be useful for routing 222 convergence in general. Second, to indicate its support for a peer 223 which wishes to perform a graceful restart. 225 The End-of-RIB marker should be sent by a BGP speaker to its peer 226 once it completes the initial routing update (including the case when 227 there is no update to send) for an address family after the BGP 228 session is established. 230 It is noted that the normal BGP procedures must be followed when the 231 TCP session terminates due to the sending or receiving of a BGP 232 NOTIFICATION message. 234 In general the Restart Time should not be greater than the HOLDTIME 235 carried in the OPEN. 237 In the following sections, "Restarting Speaker" refers to a router 238 whose BGP has restarted, and "Receiving Speaker" refers to a router 239 that peers with the restarting speaker. 241 Consider that the Graceful Restart Capability for an address family 242 is advertised by the Restarting Speaker, and is understood by the 243 Receiving Speaker, and a BGP session between them is established. 244 The following sections detail the procedures that shall be followed 245 by the Restarting Speaker as well as the Receiving Speaker once the 246 Restarting Speaker restarts. 248 6.1. Procedures for the Restarting Speaker 250 When the Restarting Speaker restarts, if possible it shall retain the 251 forwarding state for the BGP routes in the Loc-RIB, and shall mark 252 them as stale. It should not differentiate between stale and other 253 information during forwarding. 255 To re-establish the session with its peer, the Restarting Speaker 256 must set the "Restart State" bit in the Graceful Restart Capability 257 of the OPEN message. Unless allowed via configuration, the 258 "Forwarding State" bit for an address family in the capability can be 259 set only if the forwarding state has indeed been preserved for that 260 address family during the restart. 262 Once the session between the Restarting Speaker and the Receiving 263 Speaker is re-established, the Restarting Speaker will receive and 264 process BGP messages from its peers. However, it shall defer route 265 selection for an address family until it receives the End-of-RIB 266 marker from all its peers (excluding the ones with the "Restart 267 State" bit set in the received capability and excluding the ones 268 which do not advertise the graceful restart capability). It is noted 269 that prior to route selection, the speaker has no routes to advertise 270 to its peers and no routes to update the forwarding state. 272 In situations where both IGP and BGP have restarted, it might be 273 advantageous to wait for IGP to converge before the BGP speaker 274 performs route selection. 276 After the BGP speaker performs route selection, the forwarding state 277 of the speaker shall be updated and any previously marked stale 278 information shall be removed. The Adj-RIB-Out can then be advertised 279 to its peers. Once the initial update is complete for an address 280 family (including the case that there is no routing update to send), 281 the End-of-RIB marker shall be sent. 283 To put an upper bound on the amount of time a router defers its route 284 selection, an implementation must support a (configurable) timer that 285 imposes this upper bound. 287 If one wants to apply graceful restart only when the restart is 288 planned (as opposed to both planned and unplanned restart), then one 289 way to accomplish this would be to set the Forwarding State bit to 1 290 after a planned restart, and to 0 in all other cases. Other 291 approaches to accomplish this are outside the scope of this document. 293 6.2. Procedures for the Receiving Speaker 295 When the Restarting Speaker restarts, the Receiving Speaker may or 296 may not detect the termination of the TCP session with the Restarting 297 Speaker, depending on the underlying TCP implementation, whether or 298 not [BGP-AUTH] is in use, and the specific circumstances of the 299 restart. In case it does not detect the TCP reset and still 300 considers the BGP session as being established, it shall treat the 301 subsequent open connection from the peer as an indication of TCP 302 reset and act accordingly (when the Graceful Restart Capability has 303 been received from the peer). 305 "Acting accordingly" in this context means that the previous TCP 306 session should be closed, and the new one retained. Note that this 307 behavior differs from the default behavior, as specified in [BGP-4] 308 section 6.8. Since the previous connection is considered to be 309 reset, no NOTIFICATION message should be sent -- the previous TCP 310 session is simply closed. 312 When the Receiving Speaker detects TCP reset for a BGP session with a 313 peer that has advertised the Graceful Restart Capability, it shall 314 retain the routes received from the peer for all the address families 315 that were previously received in the Graceful Restart Capability, and 316 shall mark them as stale routing information. To deal with possible 317 consecutive restarts, a route (from the peer) previously marked as 318 stale shall be deleted. The router should not differentiate between 319 stale and other routing information during forwarding. 321 In re-establishing the session, the "Restart State" bit in the 322 Graceful Restart Capability of the OPEN message sent by the Receiving 323 Speaker shall not be set unless the Receiving Speaker has restarted. 324 The presence and the setting of the "Forwarding State" bit for an 325 address family depends upon the actual forwarding state and 326 configuration. 328 If the session does not get re-established within the "Restart Time" 329 that the peer advertised previously, the Receiving Speaker shall 330 delete all the stale routes from the peer that it is retaining. 332 Once the session is re-established, if the "Forwarding State" bit for 333 a specific address family is not set in the newly received Graceful 334 Restart Capability, or if a specific address family is not included 335 in the newly received Graceful Restart Capability, or if the Graceful 336 Restart Capability isn't received in the re-established session at 337 all, then Receiving Speaker shall immediately remove all the stale 338 routes from the peer that it is retaining for that address family. 340 The Receiving Speaker shall send the End-of-RIB marker once it 341 completes the initial update for an address family (including the 342 case that it has no routes to send) to the peer. 344 The Receiving Speaker shall replace the stale routes by the routing 345 updates received from the peer. Once the End-of-RIB marker for an 346 address family is received from the peer, it shall immediately remove 347 any routes from the peer that are still marked as stale for that 348 address family. 350 To put an upper bound on the amount of time a router retains the 351 stale routes, an implementation may support a (configurable) timer 352 that imposes this upper bound. 354 7. Deployment Considerations 356 While the procedures described in this document would help minimize 357 the effect of routing flaps, it is noted, however, that when a BGP 358 Graceful Restart capable router restarts, there is a potential for 359 transient routing loops or blackholes in the network if routing 360 information changes before the involved routers complete routing 361 updates and convergence. Also, depending on the network topology, if 362 not all IBGP speakers are Graceful Restart capable, there could be an 363 increased exposure to transient routing loops or blackholes when the 364 Graceful Restart procedures are exercised. 366 The Restart Time, the upper bound for retaining routes and the upper 367 bound for deferring route selection may need to be tuned as more 368 deployment experience is gained. 370 Finally, it is noted that the benefits of deploying BGP Graceful 371 Restart in an AS whose IGPs and BGP are tightly coupled (i.e., BGP 372 and IGPs would both restart) and IGPs have no similar Graceful 373 Restart capability are reduced relative to the scenario where IGPs do 374 have similar Graceful Restart capability. 376 8. Security Considerations 378 Since with this proposal a new connection can cause an old one to be 379 terminated, it might seem to open the door to denial of service 380 attacks. However, it is noted that unauthenticated BGP is already 381 known to be vulnerable to denials of service through attacks on the 382 TCP transport. The TCP transport is commonly protected through use 383 of [BGP-AUTH]. Such authentication will equally protect against 384 denials of service through spurious new connections. 386 It is thus concluded that this proposal does not change the 387 underlying security model (and issues) of BGP-4. 389 9. Acknowledgments 391 The authors would like to thank Bruce Cole, Bill Fenner, Eric Gray 392 Jeffrey Haas, Alvaro Retana, Naiming Shen, Satinder Singh, David 393 Ward, Shane Wright and Alex Zinin for their review and comments. 395 10. Normative References 397 [BGP-4] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 (BGP- 398 4)", RFC 1771, March 1995. 400 [BGP-MP] Bates, T., Chandra, R., Katz, D., and Rekhter, Y., 401 "Multiprotocol Extensions for BGP-4", RFC2858, June 2000. 403 [BGP-CAP] Chandra, R., Scudder, J., "Capabilities Advertisement with 404 BGP-4", draft-ietf-idr-rfc2842bis-02.txt, April 2002. 406 [BGP-AUTH] Heffernan A., "Protection of BGP Sessions via the TCP MD5 407 Signature Option", RFC 2385, August 1998. 409 [IANA-AFI] http://www.iana.org/assignments/address-family-numbers. 411 [IANA-SAFI] http://www.iana.org/assignments/safi-namespace. 413 11. Author Information 415 Srihari R. Sangli 416 Procket Networks, Inc. 417 1100 Cadillac Court 418 Milpitas, CA 95035 419 e-mail: srihari@procket.com 421 Yakov Rekhter 422 Juniper Networks, Inc. 423 1194 N. Mathilda Avenue 424 Sunnyvale, CA 94089 425 e-mail: yakov@juniper.net 427 Rex Fernando 428 Procket Networks, Inc. 429 1100 Cadillac Court 430 Milpitas, CA 95035 431 e-mail: rex@procket.com 433 John G. Scudder 434 Cisco Systems, Inc. 435 170 West Tasman Drive 436 San Jose, CA 95134 437 e-mail: jgs@cisco.com 439 Enke Chen 440 Redback Networks, Inc. 441 350 Holger Way 442 San Jose, CA 95134 443 e-mail: enke@redback.com