idnits 2.17.1 

draft-ietf-rtgwg-ipfrr-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 470.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure
     Invitation. 

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 2004) is 7255 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'MPLSFRR' on line 70 looks like a reference

  -- Missing reference section? 'BFD' on line 168 looks like a reference

  -- Missing reference section? 'U-TURNS' on line 209 looks like a reference

  -- Missing reference section? 'TUNNELS' on line 212 looks like a reference


     Summary: 9 errors (**), 0 flaws (~~), 2 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                         M. Shand
2	Internet Draft
3	Expiration Date: Dec 2004                                Cisco Systems

5	                                                             June 2004

7	                       IP Fast Reroute Framework

9	                draft-ietf-rtgwg-ipfrr-framework-01.txt

11	Status of this Memo

13	   By submitting this Internet-Draft, I certify that any applicable
14	   patent or other IPR claims of which I am aware have been disclosed,
15	   or will be disclosed, and any of which I become aware will be
16	   disclosed, in accordance with RFC 3668.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups. Note that other
20	   groups may also distribute working documents as Internet-Drafts.
21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsolete by other documents at any
23	   time. It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress".

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	Abstract

34	   This document provides a framework for the development of IP fast re-
35	   route mechanisms which provide protection against link or router
36	   failure by invoking locally determined repair paths. Unlike MPLS
37	   Fast-reroute, the mechanisms are applicable to a network employing
38	   conventional IP routing and forwarding. An essential part of such
39	   mechanisms is the prevention of packet loss caused by the loops which
40	   normally occur during the re-convergence of the network following a
41	   failure.

43	1.  Introduction

45	   When a link or node failure occurs in a routed network, there is
46	   inevitably a period of disruption to the delivery of traffic until
47	   the network re-converges on the new topology. Packets for
48	   destinations which were previously reached by traversing the failed
49	   component may be dropped or may suffer looping. Traditionally such
50	   disruptions have lasted for periods of at least several seconds, and
51	   most applications have been constructed to tolerate such a quality of
52	   service.

54	   Recent advances in routers have reduced this interval to under a
55	   second for carefully configured networks using link state IGPs.
56	   However, new Internet services are emerging which may be sensitive to
57	   periods of traffic loss which are orders of magnitude shorter than
58	   this.

60	   Addressing these issues is difficult because the distributed nature
61	   of the network imposes an intrinsic limit on the minimum convergence
62	   time which can be achieved.

64	   However, there is an alternative approach, which is to compute backup
65	   routes that allow the failure to be repaired locally by the router(s)
66	   detecting the failure without the immediate need to inform other
67	   routers of the failure. In this case, the disruption time can be
68	   limited to the small time taken to detect the adjacent failure and
69	   invoke the backup routes. This is analogous to the technique employed
70	   by MPLS Fast Reroute [MPLSFRR], but the mechanisms employed for the
71	   backup routes in pure IP networks are necessarily very different.

73	   This document provides a framework for the development of this
74	   approach.

76	2. Problem Analysis

78	   The duration of the packet delivery disruption caused by a
79	   conventional routing transition is determined by a number of factors:

81	     1. The time taken to detect the failure. This may be of the order
82	        of a few mS when it can be detected at the physical layer, up to
83	        several tens of seconds when a routing protocol hello is
84	        employed. During this period packets will be unavoidably lost.

86	     2. The time taken for the local router to react to the failure.
87	        This will typically involve generating and flooding new routing
88	        updates, perhaps after some hold-down delay, and re-computing
89	        the router's FIB.

91	     3. The time taken to pass the information about the failure to
92	        other routers in the network. In the absence of routing protocol
93	        packet loss, this is typically between 10mS and 100mS per hop.

95	     4. The time taken to re-compute the forwarding tables. This is
96	        typically a few mS for a link state protocol using Dijkstra's
97	        algorithm.

99	     5. The time taken to load the revised forwarding tables into the
100	        forwarding hardware. This time is very implementation dependant
101	        and also depends on the number of prefixes affected by the
102	        failure, but may be several hundred mS.

104	   The disruption will last until the routers adjacent to the failure
105	   have completed steps 1 and 2, and then all the routers in the network
106	   whose paths are affected by the failure have completed the remaining
107	   steps.

109	   The initial packet loss is caused by the router(s) adjacent to the
110	   failure continuing to attempt to transmit packets across the failure
111	   until it is detected. This loss is unavoidable, but the detection
112	   time can be reduced to a few tens of mS as described in section 3.1.

114	   Subsequent packet loss is caused by the "micro-loops" which form
115	   because of temporary inconsistencies between routers' forwarding
116	   tables. These occur as a result of the different times at which
117	   routers update their forwarding tables to reflect the failure. These
118	   variable delays are caused by steps 3, 4 and 5 above and in many
119	   routers it is step 5 which is both the largest factor and which has
120	   the greatest variance between routers. The large variance arises from
121	   implementation differences and from the differing impact that a
122	   failure has on each individual router. For example, the number of
123	   prefixes affected by the failure may vary dramatically from one
124	   router to another.

126	   In order to achieve packet disruption times which are commensurate
127	   with the failure detection times it is necessary to perform two
128	   distinct tasks:

130	     1. Provide a mechanism for the router(s) adjacent to the failure to
131	        rapidly invoke a repair path, which is unaffected by any
132	        subsequent re-convergence.

134	     2. Provide a mechanism to prevent the effects of micro loops during
135	        subsequent re-convergence.

137	   Performing the first task without the second will result in the
138	   repair path being starved of traffic and hence being redundant.
139	   Performing the second without the first will result in traffic being
140	   discarded by the router(s) adjacent to the failure. Both tasks are
141	   necessary for an effective solution to the problem.

143	   However, repair paths can be used in isolation where the failure is
144	   short-lived. The repair paths can be kept in place until the failure
145	   is repaired and there is no need to advertise the failure to other
146	   routers.

148	   Similarly, micro loop avoidance can be used in isolation to prevent
149	   loops arising from pre-planned management action.

151	   Note that micro-loops can also occur when a link or node is restored
152	   to service and thus a micro-loop avoidance mechanism is required for
153	   both link up and link down cases.

155	3. Mechanisms for IP Fast-route

157	   The set of mechanisms required for an effective solution to the
158	   problem can be broken down into the following sub-problems.

160	3.1. Mechanisms for fast failure detection

162	   It is critical that the failure detection time is minimized. A number
163	   of approaches are possible, such as:

165	     1. Physical detection; for example, loss of light.

167	     2. Routing protocol independent protocol detection; for example,
168	        The Bidirectional Failure Detection protocol [BFD].

170	     3. Routing protocol detection; for example, use of "fast hellos".

172	3.2. Mechanisms for repair paths

174	   Once a failure has been detected by one of the above mechanisms,
175	   traffic which previously traversed the failure is transmitted over
176	   one or more repair paths. The design of the repair paths should be
177	   such that they can be pre-calculated in anticipation of each local
178	   failure and made available for invocation with minimal delay. There
179	   are three basic categories of repair paths:

181	     1. Equal cost multiple paths (ECMP). Where such paths exist, and
182	        one or more of the alternate paths do not traverse the failure,
183	        they may trivially be used as repair paths.

185	     2. Downstream paths. (Also known as "loop free feasible
186	        alternates".) Such a path exists when a direct neighbor of the
187	        router adjacent to the failure has a path to the destination
188	        which can be guaranteed not to traverse the failure.

190	     3. Multihop repair paths. When there is no feasible downstream path
191	        it may still be possible to locate a router, which is more than
192	        one hop away from the router adjacent to the failure, from which
193	        traffic will be forwarded to the destination without traversing
194	        the failure.

196	   ECMP and downstream paths offer the simplest repair paths and would
197	   normally be used when they are available. It is anticipated that
198	   around 80% of failures (see section 3.2.2) can be repaired using
199	   these alone.

201	   Multi-hop repair paths are considerably more complex, both in the
202	   computations required to determine their existence, and in the
203	   mechanisms required to invoke them. They can be further classified
204	   as:

206	     1. Mechanisms where one or more alternate FIBs are pre-computed in
207	        all routers and the repaired packet is instructed to be
208	        forwarded using a "repair FIB" by some method of signaling such
209	        as detecting a "U-turn" [U-TURNS] or marking the packet.

211	     2. Mechanisms functionally equivalent to a loose source route which
212	        is invoked using the normal FIB. These include tunnels [TUNNELS]
213	        and label based mechanisms.

215	   In many cases a repair path which reaches two-hops away from the
216	   router detecting the failure will suffice, and it is anticipated that
217	   around 98% of failures (see section 3.2.2) can be repaired by this
218	   method. However, to provide complete repair coverage some use of
219	   longer multi-hop repair paths is generally necessary.

221	3.2.1. Scope of repair paths

223	   A particular repair path may be valid for all destinations which
224	   require repair or may only be valid for a subset of destinations. If
225	   a repair path is valid for a node immediately downstream of the
226	   failure, then it will be valid for all destinations previously
227	   reachable by traversing the failure. However, in cases where such a
228	   repair path is difficult to achieve because it requires a high order
229	   multi-hop repair path, it may still be possible to identify lower
230	   order repair paths (possibly even downstream paths) which allow the
231	   majority of destinations to be repaired. When IPFRR is unable to
232	   provide complete repair, it is desirable that the extent of the
233	   repair coverage can be determined and reported via network
234	   management.

236	   There is a tradeoff to be achieved between minimizing the number of
237	   repair paths to be computed, and minimizing the overheads incurred in
238	   using higher order multi-hop repair paths for destinations for which
239	   they are not strictly necessary. However, the computational cost of
240	   determining repair paths on an individual destination basis can be
241	   very high.

243	   The use of repair paths may result in excessive traffic passing over
244	   a link, resulting in congestion discard. This reduces the
245	   effectiveness of IPFRR. Mechanisms to influence the distribution of
246	   repaired traffic to minimize this effect are therefore desirable.

248	3.2.2. Analysis of repair coverage

250	   In some cases the repair strategy will permit the repair of all
251	   single link or node failures in the network for all possible
252	   destinations. This can be defined as 100% coverage. However, where
253	   the coverage is less than 100% it is important for the purposes of
254	   comparisons between different proposed repair strategies to define
255	   what is meant by such a percentage. There are three possibilities:

257	     1. The percentage of links (or nodes) which can be fully protected
258	        for all destinations. This is appropriate where the requirement
259	        is to protect all traffic, but some percentage of the possible
260	        failures may be identified as being un-protectable.

262	     2. The percentage of destinations which can be fully protected for
263	        all link (or node) failures. This is appropriate where the
264	        requirement is to protect against all possible failures, but
265	        some percentage of destinations may be identified as being un-
266	        protectable.

268	     3. For all destinations (d) and for all failures (f), the
269	        percentage of the total potential failure cases (d*f) which are
270	        protected. This is appropriate where the requirement is an
271	        overall "best effort" protection.

273	   The coverage obtained is dependent on the repair strategy and highly
274	   dependent on the detailed topology and metrics. Any figures quoted in
275	   this document are for illustrative purposes only.

277	3.2.3. Link or node repair

279	   A repair path may be computed to protect against failure of an
280	   adjacent link, or failure of an adjacent node. In general, link
281	   protection is simpler to achieve. A repair which protects against
282	   node failure will also protect against link failure for all
283	   destinations except those for which the adjacent node is a single
284	   point of failure.

286	   In some cases it may be necessary to distinguish between a link or
287	   node failure in order that the optimal repair strategy is invoked.
288	   Methods for link/node failure determination may be based on
289	   techniques such as BFD. This determination may be made prior to
290	   invoking any repairs, but this will increase the period of packet
291	   loss following a failure unless the determination can be performed as
292	   part of the failure detection mechanism itself. Alternatively, a
293	   subsequent determination can be used to optimise an already invoked
294	   default strategy.

296	3.2.4. Maintenance of Repair paths

298	   In order to meet the response time goals, it is expected (though not
299	   required) that repair paths, and their associated FIB entries, will
300	   be pre-computed and installed ready for invocation when a failure is
301	   detected. Following invocation the repair paths remain in effect
302	   until they are no longer required. This will normally be when the
303	   routing protocol has re-converged on the new topology taking into
304	   account the failure, and traffic will no longer be using the repair
305	   paths.

307	   The repair paths have the property that they are unaffected by any
308	   topology changes resulting from the failure which caused their
309	   instantiation. Therefore there is no need to re-compute them during
310	   the convergence period. They may be affected by an unrelated
311	   simultaneous topology change, but such events are out of scope of
312	   this work (see section 3.2.5).

314	   Once the routing protocol has re-converged it is necessary for all
315	   repair paths to take account of the new topology. Various
316	   optimizations may permit the efficient identification of repair paths
317	   which are unaffected by the change, and hence do not require full re-
318	   computation. Since the new repair paths will not be required until
319	   the next failure occurs, the re-computation may be performed as a
320	   background task and be subject to a hold-down, but excessive delay in
321	   completing this operation will increase the risk of a new failure
322	   occurring before the repair paths are in place.

324	3.2.5. Multiple failures and Shared Risk Groups

326	   Complete protection against multiple unrelated failures is out of
327	   scope of this work. However, it is important that the occurrence of a
328	   second failure while one failure is undergoing repair should not
329	   result in a level of service which is significantly worse than that
330	   which would have been achieved in the absence of any repair strategy.

332	   Shared Risk Groups are an example of multiple related failures, and
333	   their protection is a matter for further study.

335	   One specific example of an SRLG which is clearly within the scope of
336	   this work is a node failure. This causes the simultaneous failure of
337	   multiple links, but their closely defined topological relationship
338	   makes the problem more tractable.

340	3.3. Mechanisms for micro-loop prevention

342	   Control of micro-loops is important not only because they can cause
343	   packet loss in traffic which is affected by the failure, but because
344	   by saturating a link with looping packets they can also cause
345	   congestion loss of traffic flowing over that link which would
346	   otherwise be unaffected by the failure.

348	   A number of solutions to the problem of micro-loop formation have
349	   been proposed. The following factors are significant in their
350	   classification:

352	     1. Partial or complete protection against micro-loops.

354	     2. Delay imposed upon convergence.

356	     3. Tolerance of multiple failures (from node failures, and in
357	        general).

359	     4. Computational complexity (pre-computed or real time).

361	     5. Applicability to scheduled events.

363	     6. Applicability to link/node reinstatement.

365	4. Management Considerations

367	   While many of the management requirements will be specific to
368	   particular IPFRR solutions, the following general aspects need to be
369	   addressed:

371	     1. Configuration

373	          a. Enabling/disabling IPFRR support.

375	          b. Enabling/disabling protection on a per link/node basis.

377	          c. Expressing preferences regarding the links/nodes used for
378	             repair paths.

380	          d. Configuration of failure detection mechanisms.

382	          e. Configuration of loop avoidance strategies.

384	     2. Monitoring

386	          a. Notification of links/nodes/destinations which cannot be
387	             protected.

389	          b. Notification of pre-computed repair paths, and anticipated
390	             traffic patterns.

392	          c. Counts of failure detections, protection invocations and
393	             packets forwarded over repair paths.

395	5. Scope and applicability

397	   Link state protocols provide ubiquitous topology information, which
398	   facilitates the computation of repairs paths. Therefore the initial
399	   scope of this work is in the context of link state IGPs.

401	   Provision of similar facilities in non-link state IGPs and BGP is a
402	   matter for further study, but the correct operation of the repair
403	   mechanisms for traffic with a destination outside the IGP domain is
404	   an important consideration for solutions based on this framework

406	6. IANA considerations

408	   There are no IANA considerations that arise from this framework
409	   document.

411	7. Security Considerations

413	   This framework document does not itself introduce any security
414	   issues, but attention must be paid to the security implications of
415	   any proposed solutions to the problem.

417	8. IPR Disclosure Acknowledgement

419	   Certain IPR may be applicable to the mechanisms outlined in this
420	   document. Please check the detailed specifications for possible IPR
421	   notices.

423	9. Normative References

425	   Internet-drafts are works in progress available from
426	   http://www.ietf.org/internet-drafts/

428	10. Informative References

430	   Internet-drafts are works in progress available from
431	   http://www.ietf.org/internet-drafts/

433	   BFD       Katz, D., and Ward, D., "Bidirectional Forwarding
434	             Detection", draft-katz-ward-bfd-02.txt, (work in
435	             progress).

437	MPLSFRR     Pan, P. et al, "Fast Reroute Extensions to RSVP-
438	             TE for LSP Tunnels",
439	             draft-ietf-mpls-rsvp-lsp-fastreroute-05.txt

441	TUNNELS     Bryant, S. et al, "IP Fast Reroute using
442	             tunnels", draft-bryant-ipfrr-tunnels-00.txt,
443	             (work in progress).

445	U-TURNS     Atlas, A. et al, "IP/LDP Local Protection",
446	             draft-atlas-ip-local-protect-00.txt, (work in
447	             progress).

449	11. Author's Address

451	   Mike Shand
452	   Cisco Systems,
453	   250, Longwater Avenue,
454	   Green Park,
455	   Reading, RG2 6GB,
456	   United Kingdom.             Email: mshand@cisco.com

458	Full copyright statement

460	   Copyright (C) The Internet Society (2004). This document is subject
461	   to the rights, licenses and restrictions contained in BCP 78, and
462	   except as set forth therein, the authors retain all their rights.

464	   This document and the information contained herein are provided on an
465	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
466	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
467	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
468	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
469	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
470	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.