idnits 2.17.1 draft-ietf-idr-restart-04.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 9 longer pages, the longest (page 2) being 61 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1771 (ref. 'BGP-4') (Obsoleted by RFC 4271) ** Obsolete normative reference: RFC 2858 (ref. 'BGP-MP') (Obsoleted by RFC 4760) -- Unexpected draft version: The latest known version of draft-ietf-idr-rfc2842bis is -01, but you're referring to -02. ** Obsolete normative reference: RFC 2385 (ref. 'BGP-AUTH') (Obsoleted by RFC 5925) -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA-AFI' -- Possible downref: Non-RFC (?) normative reference: ref. 'IANA-SAFI' Summary: 8 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Srihari R. Sangli (Procket Networks) 3 Internet Draft Yakov Rekhter (Juniper Networks) 4 Expiration Date: December 2002 Rex Fernando (Procket Networks) 5 John G. Scudder (Cisco Systems) 6 Enke Chen (Redback Networks) 8 Graceful Restart Mechanism for BGP 10 draft-ietf-idr-restart-04.txt 12 1. Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as ``work in progress.'' 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 2. Abstract 35 This document proposes a mechanism for BGP that would help minimize 36 the negative effects on routing caused by BGP restart. An End-of-RIB 37 marker is specified and can be used to convey routing convergence 38 information. A new BGP capability, termed "Graceful Restart 39 Capability", is defined which would allow a BGP speaker to express 40 its ability to preserve forwarding state during BGP restart. Finally, 41 procedures are outlined for temporarily retaining routing information 42 across a TCP transport reset. 44 3. Introduction 46 Usually when BGP on a router restarts, all the BGP peers detect that 47 the session went down, and then came up. This "down/up" transition 48 results in a "routing flap" and causes BGP route re-computation, 49 generation of BGP routing updates and flap the forwarding tables. It 50 could spread across multiple routing domains. Such routing flaps may 51 create transient forwarding blackholes and/or transient forwarding 52 loops. They also consume resources on the control plane of the 53 routers affected by the flap. As such they are detrimental to the 54 overall network performance. 56 This document proposes a mechanism for BGP that would help minimize 57 the negative effects on routing caused by BGP restart. An End-of-RIB 58 marker is specified and can be used to convey routing convergence 59 information. A new BGP capability, termed "Graceful Restart 60 Capability", is defined which would allow a BGP speaker to express 61 its ability to preserve forwarding state during BGP restart. Finally, 62 procedures are outlined for temporarily retaining routing information 63 across a TCP transport reset. 65 4. Marker for End-of-RIB 67 An UPDATE message with no reachable NLRI and empty withdrawn NLRI is 68 specified as the End-Of-RIB Marker that can be used by a BGP speaker 69 to indicate to its peer the completion of the initial routing update 70 after the session is established. For IPv4 unicast address family, 71 the End-Of-RIB Marker is an UPDATE message with the minimum length 72 [BGP-4]. For any other address family, it is an UPDATE message that 73 contains only the MP_UNREACH_NLRI attribute [BGP-MP] with no 74 withdrawn routes for that . 76 Although the End-of-RIB Marker is specified for the purpose of BGP 77 graceful restart, it is noted that the generation of such a marker 78 upon completion of the initial update would be useful for routing 79 convergence in general, and thus the practice is recommended. 81 In addition, it would be beneficial for routing convergence if a BGP 82 speaker can indicate to its peer up-front that it will generate the 83 End-Of-RIB marker, regardless of its ability to preserve its 84 forwarding state during BGP restart. This can be accomplished using 85 the Graceful Restart Capability described in the next section. 87 5. Graceful Restart Capability 89 The Graceful Restart Capability is a new BGP capability [BGP-CAP] 90 that can be used by a BGP speaker to indicate its ability to preserve 91 its forwarding state during BGP restart. It can also be used to 92 convey to its peer its intention of generating the End-Of-RIB marker 93 upon the completion of its initial routing updates. 95 This capability is defined as follows: 97 Capability code: 64 99 Capability length: variable 101 Capability value: Consists of the "Restart Flags" field, "Restart 102 Time" field, and zero or more of the tuples as follows: 105 +--------------------------------------------------+ 106 | Restart Flags (4 bits) | 107 +--------------------------------------------------+ 108 | Restart Time in seconds (12 bits) | 109 +--------------------------------------------------+ 110 | Address Family Identifier (16 bits) | 111 +--------------------------------------------------+ 112 | Subsequent Address Family Identifier (8 bits) | 113 +--------------------------------------------------+ 114 | Flags for Address Family (8 bits) | 115 +--------------------------------------------------+ 116 | ... | 117 +--------------------------------------------------+ 118 | Address Family Identifier (16 bits) | 119 +--------------------------------------------------+ 120 | Subsequent Address Family Identifier (8 bits) | 121 +--------------------------------------------------+ 122 | Flags for Address Family (8 bits) | 123 +--------------------------------------------------+ 125 The use and meaning of the fields are as follows: 127 Restart Flags: 129 This field contains bit flags related to restart. 131 0 1 2 3 132 +-+-+-+-+ 133 |R|Resv.| 134 +-+-+-+-+ 135 The most significant bit is defined as the Restart State (R) 136 bit which can be used to avoid possible deadlock caused by 137 waiting for the End-of-RIB marker when multiple BGP speakers 138 peering with each other restart. When set (value 1), this bit 139 indicates that the BGP speaker has restarted, and its peer 140 should not wait for the End-of-RIB marker from the speaker 141 before advertising routing information to the speaker. 143 The remaining bits are reserved, and should be set to zero by 144 the sender and ignored by the receiver. 146 Restart Time: 148 This is the estimated time (in seconds) it will take for the 149 BGP session to be re-established after a restart. This can be 150 used to speed up routing convergence by its peer in case that 151 the BGP speaker does not come back after a restart. 153 Address Family Identifier (AFI): 155 This field carries the identity of the Network Layer protocol 156 for which the Graceful Restart support is advertised. Presently 157 defined values for this field are specified in [IANA-AFI]. 159 Subsequent Address Family Identifier (SAFI): 161 This field provides additional information about the type of 162 the Network Layer Reachability Information carried in the 163 attribute. Presently defined values for this field are 164 specified in [IANA-SAFI]. 166 Flags for Address Family: 168 This field contains bit flags for the . 170 0 1 2 3 4 5 6 7 171 +-+-+-+-+-+-+-+-+ 172 |F| Reserved | 173 +-+-+-+-+-+-+-+-+ 175 The most significant bit is defined as the Forwarding State (F) 176 bit which can be used to indicate if the forwarding state for 177 the has indeed been preserved during the previous 178 BGP restart. When set (value 1), the bit indicates that the 179 forwarding state has been preserved. 181 The remaining bits are reserved, and should be set to zero by 182 the sender and ignored by the receiver. 184 When a sender of this capability doesn't include any in 185 the capability, it means that the sender is not capable of preserving 186 its forwarding state during BGP restart, but is going to generate the 187 End-of-RIB marker upon the completion of its initial routing updates. 188 The value of the "Restart Time" field is irrelevant in that case. 190 A BGP speaker should not include more than one instance of the 191 Graceful Restart Capability in the capability advertisement [BGP- 192 CAP]. If more than one instance of the Graceful Restart Capability 193 is carried in the capability advertisement, the receiver of the 194 advertisement should ignore all but the last instance of the Graceful 195 Restart Capability. 197 Including into the Graceful Restart 198 Capability doesn't imply that the IPv4 unicast routing information 199 should be carried by using the BGP Multiprotocol extensions [BGP-MP] 200 - it could be carried in the NLRI field of the BGP UPDATE message. 202 6. Operation 204 A BGP speaker may advertise the Graceful Restart Capability for an 205 address family to its peer if it has the ability to preserve its 206 forwarding state for the address family when BGP restarts. In 207 addition, even if the speaker does not have the ability to preserve 208 its forwarding state for any address family during BGP restart, it is 209 still recommended that the speaker advertise the Graceful Restart 210 Capability to its peer to indicate its intention of generating the 211 End-of-RIB marker upon the completion of its initial routing updates 212 (as mentioned before this is done by not including any in 213 the advertised capability), as doing this would be useful for routing 214 convergence in general. 216 The End-of-RIB marker should be sent by a BGP speaker to its peer 217 once it completes the initial routing update (including the case when 218 there is no update to send) for an address family after the BGP 219 session is established. 221 It is noted that the normal BGP procedures must be followed when the 222 TCP session terminates due to the sending or receiving of a BGP 223 NOTIFICATION message. 225 In general the Restart Time should not be greater than the HOLDTIME 226 carried in the OPEN. 228 In the following sections, "Restarting Speaker" refers to a router 229 whose BGP has restarted, and "Receiving Speaker" refers to a router 230 that peers with the restarting speaker. 232 Consider that the Graceful Restart Capability for an address family 233 is advertised by the Restarting Speaker, and is understood by the 234 Receiving Speaker, and a BGP session between them is established. 235 The following sections detail the procedures that shall be followed 236 by the Restarting Speaker as well as the Receiving Speaker once the 237 Restarting Speaker restarts. 239 6.1. Procedures for the Restarting Speaker 241 When the Restarting Speaker restarts, if possible it shall retain the 242 forwarding state for the BGP routes in the Loc-RIB, and shall mark 243 them as stale. It should not differentiate between stale and other 244 information during forwarding. 246 To re-establish the session with its peer, the Restarting Speaker 247 must set the "Restart State" bit in the Graceful Restart Capability 248 of the OPEN message. Unless allowed via configuration, the 249 "Forwarding State" bit for an address family in the capability can be 250 set only if the forwarding state has indeed been preserved for that 251 address family during the restart. 253 Once the session between the Restarting Speaker and the Receiving 254 Speaker is re-established, the Restarting Speaker will receive and 255 process BGP messages from its peers. However, it shall defer route 256 selection for an address family until it receives the End-of-RIB 257 marker from all its peers (excluding the ones with the "Restart 258 State" bit set in the received capability and excluding the ones 259 which do not advertise the graceful restart capability). It is noted 260 that prior to route selection, the speaker has no routes to advertise 261 to its peers and no routes to update the forwarding state. 263 In situations where both IGP and BGP have restarted, it might be 264 advantageous to wait for IGP to converge before the BGP speaker 265 performs route selection. 267 After the BGP speaker performs route selection, the forwarding state 268 of the speaker shall be updated and any previously marked stale 269 information shall be removed. The Adj-RIB-Out can then be advertised 270 to its peers. Once the initial update is complete for an address 271 family (including the case that there is no routing update to send), 272 the End-of-RIB marker shall be sent. 274 To put an upper bound on the amount of time a router defers its route 275 selection, an implementation must support a (configurable) timer that 276 imposes this upper bound. 278 If one wants to apply graceful restart only when the restart is 279 planned (as opposed to both planned and unplanned restart), then one 280 way to accomplish this would be to set the Forwarding State bit to 1 281 after a planned restart, and to 0 in all other cases. Other 282 approaches to accomplish this are outside the scope of this document. 284 6.2. Procedures for the Receiving Speaker 286 When the Restarting Speaker restarts, the Receiving Speaker may or 287 may not detect the termination of the TCP session with the Restarting 288 Speaker, depending on the underlying TCP implementation, whether or 289 not [BGP-AUTH] is in use, and the specific circumstances of the 290 restart. In case it does not detect the TCP reset and still 291 considers the BGP session as being established, it shall treat the 292 subsequent open connection from the peer as an indication of TCP 293 reset and act accordingly (when the Graceful Restart Capability has 294 been received from the peer). 296 "Acting accordingly" in this context means that the previous TCP 297 session should be closed, and the new one retained. Note that this 298 behavior differs from the default behavior, as specified in [BGP-4] 299 section 6.8. Since the previous connection is considered to be 300 reset, no NOTIFICATION message should be sent -- the previous TCP 301 session is simply closed. 303 When the Receiving Speaker detects TCP reset for a BGP session with a 304 peer that has advertised the Graceful Restart Capability, it shall 305 retain the routes received from the peer for all the address families 306 that were previously received in the Graceful Restart Capability, and 307 shall mark them as stale routing information. To deal with possible 308 consecutive restarts, a route (from the peer) previously marked as 309 stale shall be deleted. The router should not differentiate between 310 stale and other routing information during forwarding. 312 In re-establishing the session, the "Restart State" bit in the 313 Graceful Restart Capability of the OPEN message sent by the Receiving 314 Speaker shall not be set unless the Receiving Speaker has restarted. 315 The presence and the setting of the "Forwarding State" bit for an 316 address family depends upon the actual forwarding state and 317 configuration. 319 If the session does not get re-established within the "Restart Time" 320 that the peer advertised previously, the Receiving Speaker shall 321 delete all the stale routes from the peer that it is retaining. 323 Once the session is re-established, if the "Forwarding State" bit for 324 a specific address family is not set in the newly received Graceful 325 Restart Capability, or if a specific address family is not included 326 in the newly received Graceful Restart Capability, or if the Graceful 327 Restart Capability isn't received in the re-established session at 328 all, then Receiving Speaker shall immediately remove all the stale 329 routes from the peer that it is retaining for that address family. 331 The Receiving Speaker shall send the End-of-RIB marker once it 332 completes the initial update for an address family (including the 333 case that it has no routes to send) to the peer. 335 The Receiving Speaker shall replace the stale routes by the routing 336 updates received from the peer. Once the End-of-RIB marker for an 337 address family is received from the peer, it shall immediately remove 338 any routes from the peer that are still marked as stale for that 339 address family. 341 To put an upper bound on the amount of time a router retains the 342 stale routes, an implementation may support a (configurable) timer 343 that imposes this upper bound. 345 7. Deployment Considerations 347 While the procedures described in this document would help minimize 348 the effect of routing flaps, it is noted, however, that when a BGP 349 Graceful Restart capable router restarts, there is a potential for 350 transient routing loops or blackholes in the network if routing 351 information changes before the involved routers complete routing 352 updates and convergence. Also, depending on the network topology, if 353 not all IBGP speakers are Graceful Restart capable, there could be an 354 increased exposure to transient routing loops or blackholes when the 355 Graceful Restart procedures are exercised. 357 The Restart Time, the upper bound for retaining routes and the upper 358 bound for deferring route selection may need to be tuned as more 359 deployment experience is gained. 361 Finally, it is noted that the benefits of deploying BGP Graceful 362 Restart in an AS whose IGPs and BGP are tightly coupled (i.e., BGP 363 and IGPs would both restart) and IGPs have no similar Graceful 364 Restart capability are reduced relative to the scenario where IGPs do 365 have similar Graceful Restart capability. 367 8. Security Considerations 369 Since with this proposal a new connection can cause an old one to be 370 terminated, it might seem to open the door to denial of service 371 attacks. However, it is noted that unauthenticated BGP is already 372 known to be vulnerable to denials of service through attacks on the 373 TCP transport. The TCP transport is commonly protected through use 374 of [BGP-AUTH]. Such authentication will equally protect against 375 denials of service through spurious new connections. 377 It is thus concluded that this proposal does not change the 378 underlying security model (and issues) of BGP-4. 380 9. Acknowledgments 382 The authors would like to thank Bruce Cole, Bill Fenner, Eric Gray 383 Jeffrey Haas, Alvaro Retana, Naiming Shen, Satinder Singh, David 384 Ward, Shane Wright and Alex Zinin for their review and comments. 386 10. References 388 [BGP-4] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 (BGP- 389 4)", RFC 1771, March 1995. 391 [BGP-MP] Bates, T., Chandra, R., Katz, D., and Rekhter, Y., 392 "Multiprotocol Extensions for BGP-4", RFC2858, June 2000. 394 [BGP-CAP] Chandra, R., Scudder, J., "Capabilities Advertisement with 395 BGP-4", draft-ietf-idr-rfc2842bis-02.txt, April 2002. 397 [BGP-AUTH] Heffernan A., "Protection of BGP Sessions via the TCP MD5 398 Signature Option", RFC 2385, August 1998. 400 [IANA-AFI] http://www.iana.org/assignments/address-family-numbers. 402 [IANA-SAFI] http://www.iana.org/assignments/safi-namespace. 404 11. Author Information 406 Srihari R. Sangli 407 Procket Networks, Inc. 408 1100 Cadillac Court 409 Milpitas, CA 95035 410 e-mail: srihari@procket.com 412 Yakov Rekhter 413 Juniper Networks, Inc. 414 1194 N. Mathilda Avenue 415 Sunnyvale, CA 94089 416 e-mail: yakov@juniper.net 418 Rex Fernando 419 Procket Networks, Inc. 420 1100 Cadillac Court 421 Milpitas, CA 95035 422 e-mail: rex@procket.com 424 John G. Scudder 425 Cisco Systems, Inc. 426 170 West Tasman Drive 427 San Jose, CA 95134 428 e-mail: jgs@cisco.com 430 Enke Chen 431 Redback Networks, Inc. 432 350 Holger Way 433 San Jose, CA 95134 434 e-mail: enke@redback.com