idnits 2.17.1 

draft-uttaro-idr-bgp-persistence-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 278 has weird spacing: '...eration  secti...'

  == Line 933 has weird spacing: '...lineaux  cedex...'

  == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but
     does not include the phrase in its RFC 2119 key words list.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     For MPLS VPN services, the effectiveness of the traffic isolation
     between VPNs relies on the correctness of the MPLS labels between ingress
     and egress PEs.  In particular, when an egress PE withdraws a label L1
     allocated to a VPN1 route, this label MUST not be assigned to a VPN route
     of a different VPN until all ingress PEs stop using the old VPN1 route
     using L1.

  -- The document date (July 12, 2013) is 3939 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-16) exists of
     draft-ietf-idr-bgp-gr-notification-01

  == Outdated reference: A later version (-12) exists of
     draft-ietf-idr-bgp-bestpath-selection-criteria-06

  -- Obsolete informational reference (is this intentional?): RFC 5575
     (Obsoleted by RFC 8955)


     Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                J. Uttaro
3	Internet-Draft                                                      AT&T
4	Intended status: Standards Track                                 E. Chen
5	Expires: January 13, 2014                                  Cisco Systems
6	                                                             B. Decraene
7	                                                                  Orange
8	                                                              J. Scudder
9	                                                        Juniper Networks
10	                                                           July 12, 2013

12	              Support for Long-lived BGP Graceful Restart
13	                  draft-uttaro-idr-bgp-persistence-02

15	Abstract

17	   In this document we introduce a new BGP capability termed "Long-lived
18	   Graceful Restart Capability" so that stale routes can be retained for
19	   a longer time upon session failure.  In addition a new BGP community
20	   "LLGR_STALE" is introduced for marking stale routes retained for a
21	   longer time.  We also specify that such long-lived stale routes be
22	   treated as the least-preferred, and their advertisements be limited
23	   to BGP speakers that have advertised the new capability.  Use of this
24	   extension is not advisable in all cases, and we provide guidelines to
25	   help determine if it is.

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on January 13, 2014.

44	Copyright Notice

46	   Copyright (c) 2013 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
62	     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  4
63	   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  4
64	   3.  Protocol Extensions  . . . . . . . . . . . . . . . . . . . . .  5
65	     3.1.  Long-lived Graceful Restart Capability . . . . . . . . . .  5
66	     3.2.  LLGR_STALE Community . . . . . . . . . . . . . . . . . . .  6
67	     3.3.  NO_LLGR Community  . . . . . . . . . . . . . . . . . . . .  6
68	   4.  Operation  . . . . . . . . . . . . . . . . . . . . . . . . . .  7
69	     4.1.  Use of Graceful Restart Capability . . . . . . . . . . . .  7
70	     4.2.  Session Resets . . . . . . . . . . . . . . . . . . . . . .  7
71	     4.3.  Processing LLGR_STALE Routes . . . . . . . . . . . . . . .  9
72	     4.4.  Route Selection  . . . . . . . . . . . . . . . . . . . . . 10
73	     4.5.  Multicast VPN  . . . . . . . . . . . . . . . . . . . . . . 10
74	     4.6.  Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 10
75	     4.7.  Optional Partial Deployment Procedure  . . . . . . . . . . 10
76	     4.8.  Procedures When BGP is the PE-CE Protocol in a VPN . . . . 11
77	   5.  Deployment Considerations  . . . . . . . . . . . . . . . . . . 12
78	     5.1.  When BGP is the PE-CE Protocol in a VPN  . . . . . . . . . 13
79	     5.2.  Risks of Depreferencing Routes . . . . . . . . . . . . . . 13
80	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
81	   7.  Examples of Operation  . . . . . . . . . . . . . . . . . . . . 16
82	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18
83	   9.  Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 18
84	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
85	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19
86	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 19
87	     11.2. Informative References . . . . . . . . . . . . . . . . . . 20
88	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

90	1.  Introduction

92	   Historically, routing protocols in general and BGP in particular have
93	   been designed with a focus on correctness, where a key part of
94	   "correctness" is for each network element's forwarding state to
95	   converge toward the current state of the network as quickly as
96	   possible.  For this reason, the protocol was designed to remove state
97	   advertised by routers which went down (from a BGP perspective) as
98	   quickly as possible.  Over time, this has been relaxed somewhat,
99	   notably by BGP Graceful Restart [RFC4724]; however, the paradigm has
100	   remained one of attempting to rapidly remove "stale" state from the
101	   network.

103	   Over time, two phenomena have arisen that call into question the
104	   underlying assumptions of this paradigm.  The first is the widespread
105	   adoption of tunneled forwarding infrastructures, for example MPLS.
106	   Such infrastructures eliminate the risk of some types of forwarding
107	   loops that can arise in hop-by-hop forwarding, and thus reduce one of
108	   the motivations for strong consistency between forwarding elements.
109	   The second is the increasing use of BGP as a transport for data less
110	   closely associated with packet forwarding than was originally the
111	   case.  Examples include the use of BGP for autodiscovery (VPLS
112	   [RFC4761]) and filter programming (FLOWSPEC [RFC5575]).  In these
113	   cases, BGP data takes on a character more akin to configuration than
114	   to traditional routing.

116	   The observations above motivate a desire to offer network operators
117	   the ability to choose to retain BGP data for a longer period than has
118	   hitherto been possible when the BGP control plane fails for some
119	   reason.  Although the semantics of BGP Graceful Restart [RFC4724] are
120	   close to those desired, several gaps exist, most notably in maximum
121	   time for which "stale" information can be retained -- Graceful
122	   Restart imposes a 4095 second upper bound.

124	   In this document we introduce a new BGP capability termed "Long-lived
125	   Graceful Restart Capability" so that stale information can be
126	   retained for a longer time across a session reset.  We also introduce
127	   a new BGP community, "LLGR_STALE", to mark such information.  Such
128	   stale information is to be treated as least-preferred, and its
129	   advertisement limited to BGP speakers that support the new
130	   capability.  Where possible, we reference the semantics of BGP
131	   Graceful Restart [RFC4724] rather than specifying similar semantics
132	   in this document.

134	   The expected deployment model for this extension is that it will only
135	   be invoked for certain address families.  This is discussed in more
136	   detail in the Deployment Considerations section (Section 5).  When
137	   used, its use may be combined with that of traditional Graceful
138	   Restart, in which case it is invoked only after the traditional
139	   Graceful Restart interval has elapsed, or it may be invoked
140	   immediately.  Apart from the potential to greatly extend the timer,
141	   the most obvious difference between Long-Lived and traditional
142	   Graceful Restart is that in the Long-Lived version, routes are
143	   "depreferenced", that is, treated as least-preferred, whereas in the
144	   traditional version, route preference is not affected.  The design
145	   choice to treat Long-Lived Stale routes as least-preferred was
146	   informed by the expectation that they might be retained for a
147	   (potentially) almost unbounded period of time, whereas in the
148	   traditional Graceful Restart case, stale routes are retained for only
149	   a brief interval.  In the GR case, the tradeoff between advertising
150	   new route status (at the cost of routing churn) and not advertising
151	   it (at the cost of suboptimal or incorrect route selection) is
152	   resolved in favor of not advertising, and in the LLGR case, it is
153	   resolved in favor of advertising new state.

155	1.1.  Requirements Language

157	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
158	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
159	   document are to be interpreted as described in RFC 2119 [RFC2119].

161	2.  Definitions

163	   Depreference, Depreferenced:  A route is said to be depreferenced if
164	      it has its route selection preference reduced in reaction to some
165	      event.

167	   GR:  Abbreviation for "Graceful Restart" [RFC4724], also sometimes
168	      referred to herein as "conventional Graceful Restart" or
169	      "conventional GR" to distinguish it from the "Long-lived Graceful
170	      Restart" defined by this document.

172	   Helper:  Or "helper router".  During Graceful Restart or Long-lived
173	      Graceful Restart, the router that detects a session failure and
174	      applies the listed procedures.  [RFC4724] refers to this as the
175	      "receiving speaker".

177	   LLGR:  Abbreviation for "Long-lived Graceful Restart".

179	   LLST:  Abbreviation for "Long-lived Stale Time".

181	   Route:  We use "route" to mean any information encoded as a BGP NLRI
182	      and set of path attributes.  As discussed above, the connection
183	      between such routes and installation of forwarding state may be
184	      quite remote.

186	3.  Protocol Extensions

188	   A new BGP capability and two new BGP communities are introduced.

190	3.1.  Long-lived Graceful Restart Capability

192	   The "Long-lived Graceful Restart Capability" is a new BGP capability
193	   [RFC5492] that can be used by a BGP speaker to indicate its ability
194	   to preserve its state according to the procedures of this document.
195	   This capability MUST be advertised in conjunction with the Graceful
196	   Restart capability [RFC4724], see the "Use of Graceful Restart
197	   Capability" section (Section 4.1).

199	   The capability value consists of one or more tuples <AFI, SAFI,
200	   Flags, Long-lived Stale Time> as follows:

202	         +--------------------------------------------------+
203	         | Address Family Identifier (16 bits)              |
204	         +--------------------------------------------------+
205	         | Subsequent Address Family Identifier (8 bits)    |
206	         +--------------------------------------------------+
207	         | Flags for Address Family (8 bits)                |
208	         +--------------------------------------------------+
209	         | Long-lived Stale Time (24 bits)                  |
210	         +--------------------------------------------------+
211	         | ...                                              |
212	         +--------------------------------------------------+
213	         | Address Family Identifier (16 bits)              |
214	         +--------------------------------------------------+
215	         | Subsequent Address Family Identifier (8 bits)    |
216	         +--------------------------------------------------+
217	         | Flags for Address Family (8 bits)                |
218	         +--------------------------------------------------+
219	         | Long-lived Stale Time (24 bits)                  |
220	         +--------------------------------------------------+

222	   The meaning of the fields are as follows:

224	      Address Family Identifier (AFI), Subsequent Address Family
225	      Identifier (SAFI):

227	         The AFI and SAFI, taken in combination, indicate that the BGP
228	         speaker has the ability to preserve its forwarding state for
229	         the address family during a subsequent BGP restart.  Routes may
230	         be explicitly associated with a particular AFI and SAFI using
231	         the encoding of [RFC4760] or implicitly associated with
232	         <AFI=IPv4, SAFI=Unicast> if using the encoding of [RFC4271].

234	      Flags for Address Family:

236	         This field contains bit flags relating to routes that were
237	         advertised with the given AFI and SAFI.

239	                0 1 2 3 4 5 6 7
240	               +-+-+-+-+-+-+-+-+
241	               |F|   Reserved  |
242	               +-+-+-+-+-+-+-+-+

244	         The most significant bit is used to indicate whether the state
245	         for routes that were advertised with the given AFI and SAFI has
246	         indeed been preserved during the previous BGP restart.  When
247	         set (value 1), the bit indicates that the state has been
248	         preserved.  This bit is called the "F bit" since it was
249	         historically used to indicate preservation of Forwarding State.
250	         Use of the F bit is detailed in the Session Resets section
251	         (Section 4.2).

253	         The remaining bits are reserved and MUST be set to zero by the
254	         sender and ignored by the receiver.

256	      Long-lived Stale Time:

258	         This time (in seconds) specifies how long stale information
259	         (for the AFI/SAFI) may be retained (possibly in conjunction
260	         with the period specified by the "Restart Time" in the Graceful
261	         Restart Capability, if present).

263	3.2.  LLGR_STALE Community

265	   We introduce a new BGP community [RFC1997] "LLGR_STALE" (value: TBD).
266	   It can be used to mark stale routes retained for a longer period of
267	   time.  Such long-lived stale routes are to be handled according to
268	   the procedures specified in the Operation section (Section 4).

270	   An implementation MAY allow users to configure policies that accept,
271	   reject, or modify routes based on the presence or absence of this
272	   community.

274	3.3.  NO_LLGR Community

276	   We introduce a new BGP community "NO_LLGR" (value: TBD).  It can be
277	   used to mark routes which a BGP speaker does not want treated
278	   according to these procedures, as detailed in the Operation  section
279	   (Section 4).

281	   An implementation MAY allow users to configure policies that accept,
282	   reject, or modify routes based on the presence or absence of this
283	   community.

285	4.  Operation

287	   A BGP speaker MAY use BGP Capabilities Advertisements [RFC5492] to
288	   advertise the "Long-lived Graceful Restart Capability" to indicate
289	   its ability to retain state and perform related procedures specified
290	   in this document.  The setting of the parameters for an AFI/SAFI
291	   depends on the properties of the BGP speaker, network scale, and
292	   local configuration.

294	   In the presence of the "Long-lived Graceful Restart Capability", the
295	   procedures specified in [RFC4724] and
296	   [I-D.ietf-idr-bgp-gr-notification] continue to apply unless
297	   explicitly revised by this document.

299	4.1.  Use of Graceful Restart Capability

301	   The Graceful Restart capability MUST be advertised in conjunction
302	   with the LLGR capability.  If it is not so advertised, the LLGR
303	   capability MUST be disregarded.  The purpose for mandating that both
304	   be used in conjunction is to enable reuse of certain base mechanisms
305	   that are common to both "flavors", notably origination, collection
306	   and processing of EoR, as well as the finite state machine
307	   modifications and connection reset logic introduced by GR.

309	   We observe that if support for conventional Graceful Restart is not
310	   desired for the session, the conventional GR phase can be skipped by
311	   omitting all AFI/SAFI from the GR capability, advertising a Restart
312	   Time of zero, or both.  The Session Resets section (Section 4.2)
313	   discusses the interaction of conventional and long-lived GR.

315	4.2.  Session Resets

317	   BGP Graceful Restart [RFC4724], updated by
318	   [I-D.ietf-idr-bgp-gr-notification], defines conditions under which a
319	   BGP session can reset and have its associated routes retained.  If
320	   such a reset occurs for a session for which the LLGR Capability has
321	   also been exchanged, the following procedures apply.

323	   If the Graceful Restart Capability that was received does not list
324	   all AFI/SAFI supported by the session, then for those non-listed AFI/
325	   SAFI the GR "Restart Time" shall be deemed zero.  Similarly, if the
326	   received LLGR Capability does not list all AFI/SAFI supported by the
327	   session, then for those non-listed AFI/SAFI the "Long-lived Stale
328	   Time" shall be deemed zero.

330	   The following text in Section 4.2 of the GR specification [RFC4724]
331	   no longer applies:

333	      If the session does not get re-established within the "Restart
334	      Time" that the peer advertised previously, the Receiving Speaker
335	      MUST delete all the stale routes from the peer that it is
336	      retaining.

338	   and the following procedures are specified instead:

340	   After the session goes down and before the session is re-established,
341	   the stale routes for an AFI/SAFI MUST be retained.  The interval for
342	   which they are retained is limited by the sum of the "Restart Time"
343	   in the received Graceful Restart Capability and the "Long-lived Stale
344	   Time" in the received Long-lived Graceful Restart Capability.  These
345	   timers MAY be modified by local configuration.

347	   If the value of the "Restart Time" or the "Long-lived Stale Time" is
348	   zero, the duration of the corresponding period would be zero seconds.
349	   So, for example, if the "Restart Time" is zero and the "Long-lived
350	   Stale Time" is nonzero, only the procedures particular to LLGR would
351	   apply.  Conversely, if the "Long-lived Stale Time" is zero and the
352	   "Restart Time" is nonzero, only the procedures of GR would apply.  If
353	   both are zero, none of these procedures would apply, only those of
354	   the base BGP specification (although EoR would still be used as
355	   detailed in [RFC4724]).  And finally, if both are nonzero, then the
356	   procedures would be applied serially -- first those of GR, then those
357	   of LLGR.  We observe that during the first interval, while the
358	   procedures of GR are in effect, route preference would not be
359	   affected, while during the second interval, while LLGR procedures are
360	   in effect, routes would be treated as least-preferred as specified
361	   elsewhere in this document.

363	   Once the "Restart Time" period ends (including the case that the
364	   "Restart Time" is zero), the LLGR period is said to have begun and
365	   the following procedures MUST be performed:

367	   o  The helper router MUST start a timer for the "Long-lived Stale
368	      Time".  If the timer for the "Long-lived Stale Time" expires
369	      before the session is re-established, the helper MUST delete all
370	      the stale routes from the neighbor that it is retaining.

372	   o  The helper router MUST attach the LLGR_STALE community for the
373	      stale routes being retained.  Note that this requirement implies
374	      that the routes would need to be readvertised, to disseminate the
375	      modified community.

377	   o  If any of the routes from the peer have been marked with the
378	      NO_LLGR community, either as sent by the peer, or as the result of
379	      a configured policy, they MUST NOT be retained, but MUST be
380	      removed as per the normal operation of [RFC4271].

382	   o  The helper router MUST perform the procedures listed under
383	      Section 4.3.

385	   Once the session is re-established, the procedures specified in
386	   [RFC4724] apply for the stale routes irrespective of whether the
387	   stale routes are retained during the "Restart Time" period or the
388	   "Long-lived Stale Time" period.  However, in the case of consecutive
389	   restarts (i.e, the session goes down before the EoR is received) the
390	   previously marked stale routes MUST NOT be deleted before the timer
391	   for the "Long-lived Stale Time" expires.

393	   Similarly to [RFC4724], once the session is re-established, if the F
394	   bit for a specific address family is not set in the newly received
395	   LLGR Capability, or if a specific address family is not included in
396	   the newly received LLGR Capability, or if the LLGR and accompanying
397	   GR Capability are not received in the re-established session at all,
398	   then the Helper MUST immediately remove all the stale routes from the
399	   peer that it is retaining for that address family.

401	   If a "Long-lived Stale Time" timer is running for a peer, it MUST NOT
402	   be updated (other than by manual operator intervention) until the
403	   peer has established and synchronized a new session.  The session is
404	   termed "synchronized" once the EoR has been received from the peer.

406	   The value of the "Long-lived Stale Time" in the capability received
407	   from a neighbor MAY be reduced by local configuration.

409	   While the session is down, the expiration of the "Long-lived Stale
410	   Time" timer is treated analogously to the expiration of the "Restart
411	   Time" timer in Graceful Restart.  However, the timer continues to run
412	   once the session has re-established.  The timer is not stopped, nor
413	   updated, until EoR is received from the peer.  If the timer expires
414	   during synchronization with the peer, any stale routes that the peer
415	   has not refreshed, are removed.  If the session subsequently resets
416	   prior to becoming synchronized, any remaining routes should be
417	   removed immediately.

419	4.3.  Processing LLGR_STALE Routes

421	   A BGP speaker that has advertised the "Long-lived Graceful Restart
422	   Capability" to a neighbor MUST perform the following upon receiving a
423	   route from that neighbor with the "LLGR_STALE" community, or upon
424	   attaching the "LLGR_STALE" community itself per Section 4.2:

426	   o  Treat the route as the least-preferred in route selection (see
427	      below).  See the Risks of Depreferencing Routes section
428	      (Section 5.2) for a discussion of potential risks inherent in
429	      doing this.

431	   o  The route SHOULD NOT be advertised to any neighbor from which the
432	      Long-lived Graceful Restart Capability has not been received.  The
433	      exception is described in the Optional Partial Deployment
434	      Procedure section (Section 4.7).  Note that this requirement
435	      implies that such routes should be withdrawn from any such
436	      neighbor.

438	   o  The "LLGR_STALE" community MUST NOT be removed when the route is
439	      further advertised.

441	4.4.  Route Selection

443	   In this document, when we refer to treating a route as least-
444	   preferred, this means the route MUST be treated as less preferred
445	   than any other route that is not so treated.  When performing route
446	   selection between two routes both of which are least-preferred,
447	   normal tie-breaking applies.  Note that this would only be expected
448	   to happen if the only routes available for selection were least-
449	   preferred -- in all other cases, such routes would have been
450	   eliminated from consideration.

452	4.5.  Multicast VPN

454	   Special consideration is required if LLGR is to be applied to the
455	   Multicast VPN SAFI [RFC6514].  Considerations for Multicast VPNs will
456	   be covered in a future revision of this document.

458	4.6.  Errors

460	   If the LLGR capability is received without an accompanying GR
461	   capability, the LLGR capability MUST be ignored, that is, the
462	   implementation MUST behave as though no LLGR capability had been
463	   received.

465	4.7.  Optional Partial Deployment Procedure

467	   Ideally, all routers in an Autonomous System would support this
468	   specification before it was enabled.  However, to facilitate
469	   incremental deployment, stale routes MAY be advertised to neighbors
470	   that have not advertised the Long-lived Graceful Restart Capability
471	   under the following conditions:

473	   o  The neighbors MUST be internal (IBGP or Confederation) neighbors.

475	   o  The NO_EXPORT community [RFC1997] MUST be attached to the stale
476	      routes.

478	   o  The stale routes MUST have their LOCAL_PREF set to zero.  See the
479	      Risks of Depreferencing Routes section (Section 5.2) for a
480	      discussion of potential risks inherent in doing this.

482	   If this strategy for partial deployment is used, the network operator
483	   should set LOCAL_PREF to zero for all LLGR routes throughout the
484	   Autonomous System.  This trades off a small reduction in flexibility
485	   (ordering may not be preserved between competing LLGR routes) for
486	   consistency between routers which do, and do not, support this
487	   specification.  Since consistency of route selection can be important
488	   for preventing forwarding loops, the latter consideration dominates.

490	4.8.  Procedures When BGP is the PE-CE Protocol in a VPN

492	   In VPN deployments, for example [RFC4364], BGP is often used as a
493	   PE-CE protocol.  It may be a practical necessity in such deployments
494	   to accommodate interoperation with CEs that cannot easily be upgraded
495	   to support specifications such as this one.  This leads to a problem:
496	   in this specification, we take pains to ensure that "stale" routing
497	   information will not leak beyond the perimeter of routers that
498	   support these procedures, so that it can be depreferenced as
499	   expected, and we provide a workaround (Section 4.7) for the case
500	   where one or more IBGP routers are not upgraded.  However, in the VPN
501	   PE-CE case, the protocol in use is EBGP, and our workaround does not
502	   work since it relies on the use of LOCAL_PREF, an IBGP-only path
503	   attribute.

505	   We observe that the principal motivation for restricting the
506	   propagation of "stale" routing information is the desire to prevent
507	   it from spreading without limit once it exits the "safe" perimeter.
508	   We further observe that VPN deployments are typically topologically
509	   constrained, making this concern moot.  For this reason, an
510	   implementation MAY advertise stale routes over a PE-CE session, when
511	   explicitly configured to do so.  That is, the second rule listed in
512	   Section 4.3 MAY be disregarded in such cases.  All other rules
513	   continue to apply.  Finally, if this exception is used, the
514	   implementation SHOULD by default attach the NO_EXPORT community to
515	   the routes in question, as an additional protection against stale
516	   routes spreading without limit.  Attachment of the NO_EXPORT
517	   community MAY be disabled by explicit configuration, to accommodate
518	   exceptional cases.

520	   See further discussion in Section 5.1.

522	5.  Deployment Considerations

524	   The deployment considerations discussed in [RFC4724] apply to this
525	   document.  In addition, network operators are cautioned to carefully
526	   consider the potential disadvantages of deploying these procedures
527	   for a given AFI/SAFI.  Most notably, if used for an AFI/SAFI that
528	   conveys traditional reachability information, use of a long-lived
529	   stale route could result in a loss of connectivity for the covered
530	   prefix.  This specification takes pains to mitigate this risk where
531	   possible, by making such routes least-preferred and by restricting
532	   the scope of such routes to routers that support these procedures
533	   (or, optionally, a single Autonomous System, see "Optional Partial
534	   Deployment Procedure", above).  However, according to the normal
535	   rules of IP forwarding a stale more-specific route, that has no non-
536	   stale alternate paths available, will still be used instead of a non-
537	   stale less-specific route.  Networks in which the deployment of these
538	   procedures would be especially concerning include those which do not
539	   use "tunneled" forwarding (in other words, those using traditional
540	   hop-by-hop forwarding).

542	   Implementations MUST NOT enable these procedures by default.  They
543	   MUST require affirmative configuration per AFI/SAFI in order to
544	   enable them.

546	   The procedures of this document do not alter the route resolvability
547	   requirement of [RFC4271] Section 9.1.2.1..  Because of this, it will
548	   commonly be the case that "stale" IBGP routes will only continue to
549	   be used if the router depicted in the next hop remains resolvable,
550	   even if its BGP component is down.  Details of IGP fault-tolerance
551	   strategies are beyond the scope of this document.  In addition to the
552	   foregoing, it may be advisable to check the viability of the next hop
553	   through other means, see for example
554	   [I-D.ietf-idr-bgp-bestpath-selection-criteria].  This may be
555	   especially useful in cases where the next hop is known directly at
556	   the network layer, notably EBGP.

558	   As discussed in this document, after a BGP session goes down and
559	   before the session is re-established, stale routes may be retained
560	   for up to two consecutive periods, controlled by the "Restart Time"
561	   and the "Long-lived Stale Time", respectively.  During the first
562	   period routing churn would be prevented but with potential
563	   blackholing of traffic.  During the second period potential
564	   blackholing of traffic may be reduced but routing churn would be
565	   visible throughout the network.  The setting of the relevant
566	   parameters for a particular application should take into account the
567	   tradeoffs, the network dynamics and potential failure scenarios.  If
568	   needed, the first period can be bypassed either by local
569	   configuration or by setting the "Restart Time" in the Graceful
570	   Restart Capability to zero and/or not listing the AFI/SAFI in that
571	   Capability.

573	   The setting of the F bit (and the "Forwarding State" bit of the
574	   accompanying GR capability) depends in part on deployment
575	   considerations.  The F bit can be understood as an indication that
576	   the Helper should flush associated routes (if the bit is left clear).
577	   As discussed in the Introduction, an important use case for LLGR is
578	   for routes that are more akin to configuration than to traditional
579	   routing.  For such routes, it may make sense to always set the F bit,
580	   regardless of other considerations.  Likewise, for control-plane-only
581	   entities such as dedicated route reflectors, that do not participate
582	   in the forwarding plane, it makes sense to always set the F bit.
583	   Overall, the rule of thumb is that if loss of state on the restarting
584	   router can reasonably be expected to cause a forwarding loop or black
585	   hole, the F bit should be set scrupulously according to whether state
586	   has been retained.  Specifics of when the F bit is, and is not, set
587	   is implementation-dependent and may also be controlled by
588	   configuration.

590	5.1.  When BGP is the PE-CE Protocol in a VPN

592	   As discussed in Section 4.8, it may be necessary to advertise stale
593	   routes to a CE in some VPN deployments, even if the CE does not
594	   support this specification.  In that case, the network operator
595	   configuring their PE to advertise such routes should notify the
596	   operator of the CE receiving the routes, and the CE should be
597	   configured to depreference the routes.  Typical BGP implementations
598	   will be able to do this by matching on the LLGR_STALE community, and
599	   setting the LOCAL_PREF for matching routes to zero, similar to the
600	   procedure described in Section 4.7.

602	5.2.  Risks of Depreferencing Routes

604	   Depreferencing EBGP routes is considered safe, no different from the
605	   common practice of applying a routing policy to an EBGP session.
606	   However, the same is not always true of IBGP.

608	   Consistent route selection is a fundamental tenet of IBGP correctness
609	   and safe operation in hop-by-hop routed networks.  When routers
610	   within an AS apply different criteria in selecting routes, they can
611	   arrive at inconsistent route selections, potentially with the
612	   consequence of forming forwarding loops unless some form of tunneled
613	   forwarding is used to prevent "core" routers from making a
614	   (potentially inconsistent) forwarding decision based on the IP
615	   header.

617	   This specification uses the state of a peering session as an input to
618	   the selection criteria, depreferencing routes that are associated
619	   with a session that has gone down but have not yet aged out.  Since
620	   different routers within an AS might have different notions as to
621	   whether their respective sessions with a given peer are up or down,
622	   they might apply different selection criteria to routes from that
623	   peer.  This could result in a forwarding loop forming between such
624	   routers.

626	   For an example of such a forwarding loop, consider the following
627	   simple topology:

629	        A ---- B ---- C ------------------------- D
630	        ^                                         ^
631	        |                                         |
632	        R1                                        R2

634	   In this example, A - D are routers with a full mesh of IBGP sessions
635	   between them.  The short links have unit cost, the long link has cost
636	   5.  Routers A and D are AS border routers, each advertising some
637	   route, R, into the AS -- these are denoted R1 and R2 in the diagram.
638	   In ordinary operation, it can be seen that routers B and C will
639	   select R1 for forwarding, and will forward toward A.

641	   Suppose that the session between A and B goes down for some reason,
642	   and stays down long enough for LLGR processing to be invoked on B.
643	   Then on B, route R1 will be depreferenced, leading to the selection
644	   of R2 by B. However, C will continue to prefer R1.  It can be seen
645	   that in this case, a forwarding loop for packets destined to R would
646	   form between B and C. (We note that other forwarding loop scenarios
647	   can be constructed for traditional GR, but are generally considered
648	   less severe since GR can remain in effect for a much more limited
649	   interval.)

651	   The potential benefits of this specification can outweigh the risks
652	   discussed above, as long as care is exercised in deployment.  The
653	   cardinal rule to be followed is, if a given set of routes are being
654	   used within an AS for hop-by-hop forwarding, it is NOT RECOMMENDED to
655	   enable LLGR procedures.  If tunneled forwarding (such as MPLS) is
656	   used within the AS, or if routes are being used for purposes other
657	   than hop-by-hop forwarding, less caution is needed, though the
658	   operator should still carefully consider the consequences of enabling
659	   LLGR.

661	6.  Security Considerations

663	   The security implications of the LLGR mechanism defined within in
664	   this document are akin to those incurred by the maintenance of stale
665	   routing information within a network.  This is particularly relevant
666	   when considering the maintenance of routing information that is
667	   utilised for service segregation - such as MPLS label entries.

669	   For MPLS VPN services, the effectiveness of the traffic isolation
670	   between VPNs relies on the correctness of the MPLS labels between
671	   ingress and egress PEs.  In particular, when an egress PE withdraws a
672	   label L1 allocated to a VPN1 route, this label MUST not be assigned
673	   to a VPN route of a different VPN until all ingress PEs stop using
674	   the old VPN1 route using L1.

676	   Such a corner case may happen today, if the propagation of VPN routes
677	   by BGP messages between PEs takes more time than the label re-
678	   allocation delay on a PE.  Given that we can generally bound worst
679	   case BGP propagation time to a few minutes (for example 2-5), the
680	   security breach will not occur if PEs are designed to not reallocate
681	   a previous used and withdrawn label before a few minutes.

683	   The problem is made worse with BGP GR between PEs as VPN routes can
684	   be stalled for a longer period of time (for example 20 minutes).

686	   This is further aggravated by the BGP LLGR extension proposed in this
687	   document as VPN routes can be stalled for a much longer period of
688	   time (for example 2 hours, 1 day).

690	   Therefore, to avoid VPN breach, before enabling BGP LLGR, SPs needs
691	   to check how fast a given label can be reused by a PE, taking into
692	   account:

694	   o  The load of the BGP route churn on a PE (in term of number of VPN
695	      label advertised and churn rate).

697	   o  The label allocation policy on the PE (possibly depending upon the
698	      size of pool of the VPN labels (which can be restricted by
699	      hardware consideration or others MPLS usages), the label
700	      allocation scheme (for example per route or per VRF/CE), the re-
701	      allocation policy (for example least recently used label...)

703	   Note that [RFC4781] which defines Graceful Restart Mechanism for BGP
704	   with MPLS is also applicable to BGP LLGR.

706	   In addition to these considerations, the LLGR mechanism described
707	   within this document is considered to be complex to exploit
708	   maliciously - in order to inject packets into a topology, there is a
709	   requirement to engineer a specific LLGR state between two PE devices,
710	   whilst engineering label reallocation to occur in a manner that
711	   results in the two topologies overlapping.  Such allocation is
712	   particularly difficult to engineer (since it is typically an internal
713	   mechanism of an LSR).

715	7.  Examples of Operation

717	   For illustrative purposes, we present a few examples of how this
718	   specification might be used in practice.  These examples are neither
719	   exhaustive nor normative.

721	   Consider the following scenario: A border router, ASBR1, has an IBGP
722	   peering with a route reflector, RR1, from which it learns routes.  It
723	   has an EBGP peering with an external peer, EXT, to which it
724	   advertises those routes.  The external peer has advertised the GR and
725	   LLGR Capabilities to ASBR1.  ASBR1 is configured to support GR and
726	   LLGR on its session with RR1 and EXT.  RR1 advertises a GR Restart
727	   Time of 1 (second) and a LLST of 3600 (seconds):

729	   +----------+--------------------------------------------------------+
730	   | Time     | Event                                                  |
731	   +----------+--------------------------------------------------------+
732	   | t        | ASBR1's IBGP session with RR fails.  ASBR1 retains     |
733	   |          | RR's routes according to the rules of GR [RFC4724]     |
734	   |          |                                                        |
735	   | t+1      | GR Restart Time expires.  ASBR1 transitions RR's       |
736	   |          | routes to long-lived stale by attaching the LLGR_STALE |
737	   |          | community and depreferencing them.  However, since it  |
738	   |          | has no backup routes, it continues to make use of      |
739	   |          | them.  It re-announces them to EXT with the LLGR_STALE |
740	   |          | community attached.                                    |
741	   |          |                                                        |
742	   | t+1+3600 | LLST expires.  ASBR1 removes RR's stale routes from    |
743	   |          | its own RIB and sends BGP updates to withdraw them     |
744	   |          | from EXT.                                              |
745	   +----------+--------------------------------------------------------+

747	   Next, imagine the same scenario but suppose RR1 advertised a GR
748	   Restart Time of zero, effectively disabling GR.  Equally, ASBR1 could
749	   have used local configuration to override RR1's offered Restart Time,
750	   setting it to a locally-configured value of zero:

752	   +----------+--------------------------------------------------------+
753	   | Time     | Event                                                  |
754	   +----------+--------------------------------------------------------+
755	   | t        | ASBR1's IBGP session with RR fails.  ASBR1 transitions |
756	   |          | RR's routes to long-lived stale by attaching the       |
757	   |          | LLGR_STALE community and depreferencing them.          |
758	   |          | However, since it has no backup routes, it continues   |
759	   |          | to make use of them.  It re-announces them to EXT with |
760	   |          | the LLGR_STALE community attached.                     |
761	   |          |                                                        |
762	   | t+0+3600 | LLST expires.  ASBR1 removes RR's stale routes from    |
763	   |          | its own RIB and sends BGP updates to withdraw them     |
764	   |          | from EXT.                                              |
765	   +----------+--------------------------------------------------------+

767	   Next, imagine the original scenario, but consider that the ASBR1-RR1
768	   session comes back up and becomes synchronized 180 seconds after the
769	   failure was detected:

771	   +---------+---------------------------------------------------------+
772	   | Time    | Event                                                   |
773	   +---------+---------------------------------------------------------+
774	   | t       | ASBR1's IBGP session with RR fails.  ASBR1 retains RR's |
775	   |         | routes according to the rules of GR [RFC4724]           |
776	   |         |                                                         |
777	   | t+1     | GR Restart Time expires.  ASBR1 transitions RR's routes |
778	   |         | to long-lived stale by attaching the LLGR_STALE         |
779	   |         | community and depreferencing them.  However, since it   |
780	   |         | has no backup routes, it continues to make use of them. |
781	   |         | It re-announces them to EXT with the LLGR_STALE         |
782	   |         | community attached.                                     |
783	   |         |                                                         |
784	   | t+1+179 | Session is reestablished and resynchronized.  ASBR1     |
785	   |         | removes the LLGR_STALE community from RR1's routes and  |
786	   |         | re-announces them to EXT with the LLGR_STALE community  |
787	   |         | removed.                                                |
788	   +---------+---------------------------------------------------------+

790	   Finally, imagine the original scenario, but consider that EXT has not
791	   advertised the LLGR Capability to ASBR1:

793	   +----------+--------------------------------------------------------+
794	   | Time     | Event                                                  |
795	   +----------+--------------------------------------------------------+
796	   | t        | ASBR1's IBGP session with RR fails.  ASBR1 retains     |
797	   |          | RR's routes according to the rules of GR [RFC4724]     |
798	   |          |                                                        |
799	   | t+1      | GR Restart Time expires.  ASBR1 transitions RR's       |
800	   |          | routes to long-lived stale by attaching the LLGR_STALE |
801	   |          | community and depreferencing them.  However, since it  |
802	   |          | has no backup routes, it continues to make use of      |
803	   |          | them.  It withdraws them from EXT.                     |
804	   |          |                                                        |
805	   | t+1+3600 | LLST expires.  ASBR1 removes RR's stale routes from    |
806	   |          | its own RIB.                                           |
807	   +----------+--------------------------------------------------------+

809	8.  Acknowledgements

811	   We would like to thank Roberto Fragassi, John Medamana, Han Nguyen,
812	   Jeffrey Haas, Nabil Bitar, Nicolai Leymann, Pranav Mehta, Saikat Ray,
813	   Martin Djernaes and Eric Rosen for their valuable inputs and
814	   contributions to the discussions and solutions.

816	9.  Contributors

818	    Clarence Filsfils
819	    Cisco Systems
820	    Brussels  1000
821	    Belgium

823	    Email: cf@cisco.com

825	    Pradosh Mohapatra
826	    Cumulus Networks

828	    Email: pmohapat@cumulusnetworks.com

830	    Yakov Rekhter
831	    Juniper Networks

833	    Email: yakov@juniper.net
834	    Rob Shakir
835	    BT

837	    Email: rob.shakir@bt.com

839	    Adam Simpson
840	    Alcatel-Lucent
841	    600 March Road
842	    Ottawa, Ontario  K2K 2E6
843	    Canada

845	    Email: adam.simpson@alcatel-lucent.com

847	10.  IANA Considerations

849	   This document defines a new BGP capability - Long-lived Graceful
850	   Restart Capability.  The Capability Code needs to be assigned by
851	   IANA.

853	   This document introduces a new BGP community "LLGR_STALE" for marking
854	   the long-lived stale routes, and another community "NO_LLGR" to
855	   indicate that stale routes should not be retained.  These community
856	   values need to be assigned by IANA.

858	11.  References

860	11.1.  Normative References

862	   [I-D.ietf-idr-bgp-gr-notification]
863	              Patel, K., Fernando, R., Scudder, J., and J. Haas,
864	              "Notification Message support for BGP Graceful Restart",
865	              draft-ietf-idr-bgp-gr-notification-01 (work in progress),
866	              April 2013.

868	   [RFC1997]  Chandrasekeran, R., Traina, P., and T. Li, "BGP
869	              Communities Attribute", RFC 1997, August 1996.

871	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
872	              Requirement Levels", BCP 14, RFC 2119, March 1997.

874	   [RFC4271]  Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
875	              Protocol 4 (BGP-4)", RFC 4271, January 2006.

877	   [RFC4724]  Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y.
878	              Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724,
879	              January 2007.

881	   [RFC4760]  Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
882	              "Multiprotocol Extensions for BGP-4", RFC 4760,
883	              January 2007.

885	   [RFC5492]  Scudder, J. and R. Chandra, "Capabilities Advertisement
886	              with BGP-4", RFC 5492, February 2009.

888	   [RFC6514]  Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP
889	              Encodings and Procedures for Multicast in MPLS/BGP IP
890	              VPNs", RFC 6514, February 2012.

892	11.2.  Informative References

894	   [I-D.ietf-idr-bgp-bestpath-selection-criteria]
895	              Asati, R., "BGP Bestpath Selection Criteria Enhancement",
896	              draft-ietf-idr-bgp-bestpath-selection-criteria-06 (work in
897	              progress), February 2013.

899	   [RFC4364]  Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
900	              Networks (VPNs)", RFC 4364, February 2006.

902	   [RFC4761]  Kompella, K. and Y. Rekhter, "Virtual Private LAN Service
903	              (VPLS) Using BGP for Auto-Discovery and Signaling",
904	              RFC 4761, January 2007.

906	   [RFC4781]  Rekhter, Y. and R. Aggarwal, "Graceful Restart Mechanism
907	              for BGP with MPLS", RFC 4781, January 2007.

909	   [RFC5575]  Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J.,
910	              and D. McPherson, "Dissemination of Flow Specification
911	              Rules", RFC 5575, August 2009.

913	Authors' Addresses

915	   James Uttaro
916	   AT&T
917	   200 S. Laurel Avenue
918	   Middletown, NJ  07748
919	   USA

921	   Email: ju1738@att.com
922	   Enke Chen
923	   Cisco Systems
924	   170 W. Tasman Drive
925	   San Jose, CA  95134
926	   USA

928	   Email: enkechen@cisco.com

930	   Bruno Decraene
931	   Orange
932	   38-40 Rue de General Leclerc
933	   92794 Issy Moulineaux  cedex 9
934	   France

936	   Email: bruno.decraene@orange.com

938	   John G. Scudder
939	   Juniper Networks
940	   1194 N. Mathilda Ave
941	   Sunnyvale, CA  94089
942	   USA

944	   Email: jgs@juniper.net