idnits 2.17.1 

draft-ietf-tsvwg-sctp-failover-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC4960]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 23, 2014) is 3472 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260)


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         Y. Nishida
3	Internet-Draft                                        GE Global Research
4	Intended status: Standards Track                            P. Natarajan
5	Expires: April 26, 2015                                    Cisco Systems
6	                                                                 A. Caro
7	                                                        BBN Technologies
8	                                                                 P. Amer
9	                                                  University of Delaware
10	                                                              K. Nielsen
11	                                                                Ericsson
12	                                                        October 23, 2014

14	                    Quick Failover Algorithm in SCTP
15	                 draft-ietf-tsvwg-sctp-failover-07.txt

17	Abstract

19	   One of the major advantages of SCTP is that it supports multi-homed
20	   communication.  A multi-homed SCTP end-point has the ability to
21	   withstand network failures by migrating the traffic from an inactive
22	   network to an active one.  However, if the [RFC4960] specified
23	   failover operation is followed there can be a significant delay in
24	   the migration to the active destination addresses, thus severely
25	   reducing the effectiveness of SCTP multi-homed operation.

27	   The memo complements RFC4960 by the introduction of the Potentially
28	   Failed state and associated new Quick Failover operation to apply
29	   during network failure and specifies for SCTP senders to support this
30	   more performance optimal failover procedure as an add-on to the
31	   [RFC4960] failover operation.  The memo in addition complements
32	   [RFC4960] by introduction of alternative switchover operation modes
33	   for the data transfer path management after a failover.  These
34	   operation modes offer for more performance optimal operation in some
35	   network environments.  From the perspective of this memo the
36	   implementation of the additional switchover operation modes is
37	   considered optional.

39	   The procedures defined require only minimal modifications to the
40	   current specification.  The procedures are sender-side only and do
41	   not impact the SCTP receiver.

43	Status of This Memo

45	   This Internet-Draft is submitted in full conformance with the
46	   provisions of BCP 78 and BCP 79.

48	   Internet-Drafts are working documents of the Internet Engineering
49	   Task Force (IETF).  Note that other groups may also distribute
50	   working documents as Internet-Drafts.  The list of current Internet-
51	   Drafts is at http://datatracker.ietf.org/drafts/current/.

53	   Internet-Drafts are draft documents valid for a maximum of six months
54	   and may be updated, replaced, or obsoleted by other documents at any
55	   time.  It is inappropriate to use Internet-Drafts as reference
56	   material or to cite them other than as "work in progress."

58	   This Internet-Draft will expire on April 26, 2015.

60	Copyright Notice

62	   Copyright (c) 2014 IETF Trust and the persons identified as the
63	   document authors.  All rights reserved.

65	   This document is subject to BCP 78 and the IETF Trust's Legal
66	   Provisions Relating to IETF Documents
67	   (http://trustee.ietf.org/license-info) in effect on the date of
68	   publication of this document.  Please review these documents
69	   carefully, as they describe your rights and restrictions with respect
70	   to this document.  Code Components extracted from this document must
71	   include Simplified BSD License text as described in Section 4.e of
72	   the Trust Legal Provisions and are provided without warranty as
73	   described in the Simplified BSD License.

75	Table of Contents

77	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
78	   2.  Conventions and Terminology . . . . . . . . . . . . . . . . .   4
79	   3.  Issues with the SCTP Path Management  . . . . . . . . . . . .   4
80	   4.  SCTP with Potentially-Failed Destination State (SCTP-PF)  . .   5
81	     4.1.  SCTP-PF Concept . . . . . . . . . . . . . . . . . . . . .   5
82	     4.2.  SCTP-PF Algorithm Detail  . . . . . . . . . . . . . . . .   6
83	     4.3.  Optional Feature: Permanent Failover  . . . . . . . . . .   9
84	   5.  Socket API Considerations . . . . . . . . . . . . . . . . . .  10
85	     5.1.  Support for the Potentially Failed Path State . . . . . .  11
86	     5.2.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket
87	           Option  . . . . . . . . . . . . . . . . . . . . . . . . .  12
88	     5.3.  Exposing the Potentially Failed Path State
89	           (SCTP_EXPOSE_POTENTIALLY_FAILED_STATE) Socket Option  . .  13
90	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
91	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
92	   8.  Proposed Change of Status (to be Deleted before Publication)   14
93	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  14
94	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  14
95	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  14

97	   Appendix A.  Discussions of Alternative Approaches  . . . . . . .  15
98	     A.1.  Reduce Path.Max.Retrans (PMR) . . . . . . . . . . . . . .  15
99	     A.2.  Adjust RTO related parameters . . . . . . . . . . . . . .  16
100	   Appendix B.  Discussions for Path Bouncing Effect . . . . . . . .  16
101	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

103	1.  Introduction

105	   The Stream Control Transmission Protocol (SCTP) as specified in
106	   [RFC4960] supports multihoming at the transport layer -- an SCTP
107	   association can bind to multiple IP addresses at each endpoint.
108	   SCTP's multihoming features include failure detection and failover
109	   procedures to provide network interface redundancy and improved end-
110	   to-end fault tolerance.

112	   In SCTP's current failure detection procedure, the sender must
113	   experience Path.Max.Retrans (PMR) number of consecutive failed
114	   retransmissions on a destination before detecting a path failure.
115	   The sender fails over to an alternate active destination only after
116	   failure detection.  Until detecting the failover, the sender
117	   continues to transmit data on the failed path, which degrades the
118	   SCTP performance.  Concurrent Multipath Transfer (CMT) [IYENGAR06] is
119	   an extension to SCTP and allows the sender to transmit data on
120	   multiple paths simultaneously.  Research [NATARAJAN09] shows that the
121	   current failure detection procedure worsens CMT performance during
122	   failover and can be significantly improved by employing a better
123	   failover algorithm.

125	   This document specifies an alternative failure detection procedure
126	   for SCTP that improves the SCTP performance during a failover.

128	   Also the operation after a failover impacts the performance of the
129	   protocol.  With [RFC4960] procedures, SCTP will, after a failover
130	   from the primary path, switch back to use the primary path for data
131	   transfer as soon as this path becomes available.  From a performance
132	   perspective, as confirmed in research [CARO02], such a switchback of
133	   the data transmission path is not optimal in general.  As an optional
134	   alternative to the switchback operation of [RFC4960], this document
135	   specifies for SCTP to support the Permanent Failover switchover
136	   procedures proposed by [CARO02].  Additional discussions for
137	   alternative approach that does not require modifications to [RFC4960]
138	   and path bouncing effects that might be caused by frequent switchover
139	   are provided in Appendix.

141	2.  Conventions and Terminology

143	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
144	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
145	   document are to be interpreted as described in [RFC2119].

147	3.  Issues with the SCTP Path Management

149	   This section describes issues in the current SCTP to be fixed by the
150	   approach described in this document.

152	   SCTP can utilize multiple IP addresses for a single SCTP association.
153	   Each SCTP endpoint exchanges the list of its usable addresses during
154	   initial negotiation with its peer.  Then the endpoints select one
155	   address from the peer's list and define this as the primary
156	   destination.  During normal transmission, SCTP sends all user data to
157	   the primary destination.  Also, it sends heartbeat packets to all
158	   idle destinations at a certain interval to check the reachability of
159	   the path.  Idle destinations normally include all non-primary
160	   destinations.

162	   If a sender has multiple active destination addresses, it can
163	   retransmit data to secondary destination address, when the
164	   transmission to the primary times out.

166	   When a sender receives an acknowledgment for DATA or HEARTBEAT chunks
167	   sent to one of the destination addresses, it considers that
168	   destination to be active.  If it fails to receive acknowledgments,
169	   the error count for the address is increased.  If the error counter
170	   exceeds the protocol parameter 'Path.Max.Retrans', SCTP endpoint
171	   considers the address to be inactive.

173	   The failover process of SCTP is initiated when the primary path
174	   becomes inactive (error counter for the primary path exceeds
175	   Path.Max.Retrans).  If the primary path is marked inactive, SCTP
176	   chooses a new destination address from one of the active destinations
177	   and start using this address to send data to.  If the primary path
178	   becomes active again, SCTP uses the primary destination for
179	   subsequent data transmissions and stop using non-primary one.

181	   One issue with this failover process is that it usually takes
182	   significant amount of time before SCTP switches to the new
183	   destination.  Let's say the primary path on a multi-homed host
184	   becomes unavailable and the RTO value for the primary path at that
185	   time is around 1 second, it usually takes over 60 seconds before SCTP
186	   starts to use the secondary path.  This is because the recommended
187	   value for Path.Max.Retrans in the standard is 5, which requires 6
188	   consecutive timeouts before failover takes place.  Before SCTP
189	   switches to the secondary address, SCTP keeps trying to send packets
190	   to the primary and only retransmitted packets are sent to the
191	   secondary and can thus be reached at the receiver.  This slow
192	   failover process can cause significant performance degradation and
193	   will not be acceptable in some situations.

195	   Another issue is that once the primary path is active again, the
196	   traffic is switched back.  This is not optimal in some situations.
197	   This is further discussed in Section 4.3.

199	4.  SCTP with Potentially-Failed Destination State (SCTP-PF)

201	   To address the issues described in Section 3, this section updates
202	   SCTP path management scheme with the Potentially Failed state and
203	   associated Quick Failover operation.  We use the term SCTP-PF to
204	   denote the resulting SCTP path management operation.

206	4.1.  SCTP-PF Concept

208	   SCTP-PF as defined stems from the following two observations about
209	   SCTP's failure detection procedure:

211	   o  To minimize performance impact during failover, the sender should
212	      avoid transmitting data to the failed destination as early as
213	      possible.  In the current SCTP path management scheme, the sender
214	      stops transmitting data to a destination only after the
215	      destination is marked Failed (inactive).  Thus, a smaller PMR
216	      value is ideal so that the sender transitions a destination to the
217	      Failed (inactive) state quicker.

219	   o  Smaller PMR values increase the chances of spurious failure
220	      detection where the sender incorrectly marks a destination as
221	      Failed (inactive) during periods of temporary congestion.  As
222	      [RFC4960] recommends for a coupling of the PMR value and the AMR
223	      value such spurious failure detection risks to carry over to
224	      spurious association failure detection and closure.  Larger PMR
225	      values are preferable to avoid spurious failure detection.

227	   From the above observations it is clear that tuning the PMR value
228	   involves the following tradeoff -- a lower value improves performance
229	   but increases the chances of spurious failure detection, whereas a
230	   higher value degrades performance and reduces spurious failure
231	   detection in a wide range of path conditions.  Thus, tuning the
232	   association's PMR value is an incomplete solution to address
233	   performance impact during failure.

235	   This new method introduces a new "Potentially-Failed" (PF)
236	   destination state in SCTP's path management procedure.  The PF state
237	   was originally proposed to improve CMT performance [NATARAJAN09].
238	   The PF state is an intermediate state between Active and Failed
239	   states.  SCTP's failure detection procedure is modified to include
240	   the PF state.  The new failure detection algorithm assumes that loss
241	   detected by a timeout implies either severe congestion or failure en-
242	   route.  After a number of consecutive timeouts on a path, the sender
243	   is unsure, and marks the corresponding destination as PF.  A PF
244	   destination is not used for data transmission except in special cases
245	   (discussed below).  The new failure detection algorithm requires only
246	   sender-side changes.

248	4.2.  SCTP-PF Algorithm Detail

250	   SCTP PF operation is specified as follows:

252	   1.   The sender maintains a new tunable parameter called Potentially-
253	        Failed.Max.Retrans (PFMR).  The RECOMMENDED value of PFMR = 0
254	        when Quick Failover is used.  When PFMR is larger or equal to
255	        PMR, Quick Failover is turned off.

257	   2.   The error counter of an active destination address is
258	        incremented as specified in [RFC4960].  This means that the
259	        error counter of the destination address will be incremented
260	        each time the T3-rtx timer expires, or at times where a
261	        HEARTBEAT sent to an idle, active address is not acknowledged
262	        within an RTO.  When the value in the destination address error
263	        counter exceeds PFMR, the endpoint MUST mark the destination
264	        transport address as PF.

266	   3.   The sender SHOULD avoid data transmission to PF destinations.
267	        When the destinations are all in PF state or some in PF state
268	        and some in inactive state, the sender MUST choose one
269	        destination in PF state and transmit data to this destination.
270	        The sender SHOULD choose the destination in PF state with the
271	        lowest error count (fewest consecutive timeouts) for data
272	        transmission and transmit data to this destination.  When there
273	        are multiple PF destinations with same error count, the sender
274	        SHOULD let the choice among the multiple PF destination with
275	        equal error count be based on the [RFC4960], section 6.4.1,
276	        principles of choosing most divergent source-destination pairs
277	        when executing (potentially consecutive) retransmission.  This
278	        means that the sender SHOULD attempt to pick the most divergent
279	        source - destination pair from the last source - destination
280	        pair on which data were transmitted or retransmitted.  Rules for
281	        picking the most divergent source-destination pair are an
282	        implementation decision and are not specified within this
283	        document.  A sender may choose to deploy other strategies than
284	        the above when choosing among multiple PF destinations with
285	        equal error count.  In all cases the sender MUST NOT change the
286	        state of chosen destination and it MUST NOT clear the
287	        destination's error counter as a result of choosing the
288	        destination for data transmission.

290	   4.   Heartbeats SHOULD be sent to PF destination(s) once per RTO.
291	        This means the sender MUST ignore HB.interval for PF
292	        destinations.  If an heartbeat is unanswered, the sender SHOULD
293	        increment the error counter and exponentially back off the RTO
294	        value.  If error counter is less than PMR, the sender SHOULD
295	        transmit another heartbeat immediately after T3-timer
296	        expiration.  When data is transmitted to a PF destination, the
297	        transmission of heartbeats may be omitted as SACK or T3-rtx
298	        timer expiration can provide equivalent information.  It is
299	        RECOMMENDED that heartbeats be send to PF destinations
300	        regardless of whether the Path Heartbeat function (Section 8.3
301	        of [RFC4960]) is enabled for the destination address or not.

303	   5.   When the sender receives an heartbeat ACK from a PF destination,
304	        the sender MUST clear the destination's error counter and
305	        transition the PF destination back to Active state.  When the
306	        sender resumes data transmission on the destination it MUST do
307	        this following the prescriptions of Section 7.2 of [RFC4960].

309	   6.   Additional (PMR - PFMR) consecutive timeouts on a PF destination
310	        confirm the path failure, upon which the destination transitions
311	        to the Inactive state.  As described in [RFC4960], the sender
312	        (i) SHOULD notify ULP about this state transition, and (ii)
313	        transmit heartbeats to the Inactive destination at a lower
314	        frequency as described in Section 8.3 of [RFC4960] (when this
315	        function is enabled for the destination address).

317	   7.   When all destinations are in inactive state (association dormant
318	        state) the sender MUST also choose one destination to transmit
319	        data to.  The sender SHOULD choose the destination in inactive
320	        state with the lowest error count (fewest consecutive timeouts)
321	        for data transmission and transmit data to this destination.
322	        When there are multiple destinations with same error count in
323	        inactive state, the sender SHOULD attempt to pick the most
324	        divergent source - destination pair from the last source -
325	        destination pair on which data were transmitted or retransmitted
326	        following [RFC4960].  Rules for picking the most divergent
327	        source-destination pair are an implementation decision and are
328	        not specified within this document.  Therefore, a sender SHOULD
329	        allow for incrementing the destination error counters up to some
330	        reasonable limit larger than PMR+1, thus changing the
331	        prescriptions of [RFC4960], section 8.3, in this respect.  The
332	        exact limit to apply is not specified in this document but it is
333	        considered reasonable to require for such to be an order of
334	        magnitude higher than the PMR value.  A sender MAY choose to
335	        deploy other strategies than the above.  For example, a sender
336	        could choose to prioritize the last active destination during
337	        dormant state.  The strategy to prioritize the last active
338	        destination is optimal when some paths are permanently inactive,
339	        but suboptimal when paths' instability is transient.  While the
340	        increment of the error counters above PMR+1 is a prerequisite
341	        for the error counter values to serve to guide the path
342	        selection in dormant state, then it is noted that by virtue of
343	        the introduction of the Potentially Failed state, one may deploy
344	        higher values of PMR without compromising the efficiency of the
345	        failover operation, and thus making the increase of path error
346	        counters above PMR+1 less critical as the dormant state will be
347	        less likely to happen.  The downside of increasing the PMR value
348	        relative to the AMR value, however, is that the per destination
349	        address failure detection and notification of such to ULP
350	        thereby is weakened.  In all cases the sender MUST NOT change
351	        the state of the chosen destination and it MUST NOT clear the
352	        destination's error counter as a result of choosing the
353	        destination for data transmission.

355	   8.   ACKs for chunks that have been transmitted to multiple
356	        destinations (i.e., a chunk which has been retransmitted to a
357	        different destination than the destination to which the chunk
358	        was first transmitted) SHOULD NOT clear the error count of an
359	        inactive destination and SHOULD NOT transition a PF destination
360	        back to Active state, since a sender cannot disambiguate whether
361	        the ACK was for the original transmission or the
362	        retransmission(s).  The same ambiguity concerns the related
363	        congestion window growth.  The bytes of a newly acknowledged
364	        chunk which has been transmitted to multiple destinations SHOULD
365	        be considered for contribution to the congestion window growth
366	        towards the destination where the chunk was last sent.  The
367	        contribution of the acked bytes to the window growth is subject
368	        to the prescriptions described in Section 7.2 of [RFC4960] is
369	        fulfilled.  A SCTP sender MAY apply a different approach for
370	        both the error count handling and the congestion control growth
371	        handling based on unequivocally information on which destination
372	        (including multiple destinations) the chunk reached.  This
373	        document makes no reference to what such unequivocally
374	        information could consist of, neither how such unequivocally
375	        information could be obtained.  The implementation of such an
376	        alternative approach is left to implementations.

378	   9.   ACKs for chunks which has been transmitted to one destination
379	        address only MUST clear the error counter of the destination
380	        address and MUST transition a PF destination back to Active
381	        state.  This situation can happen when new data is sent to a
382	        destination address in PF state.  It can also happen in
383	        situations where the destination address is in PF state due to
384	        the occurrence of a spurious T3-rtx timer and ACKs start to
385	        arrive for data sent prior to occurrence of the spurious T3-rtx
386	        and data has not yet been retransmitted towards other
387	        destinations.  This document does not specify special handling
388	        for detection of or reaction to spurious T3-rtx timeouts, e.g.,
389	        for special operation vis-a-vis the congestion control handling
390	        or data retransmission operation towards a destination address
391	        which undergoes a transition from active to PF to active state
392	        due to a spurious T3-rtx timeout.  But it is noted that this is
393	        an area which would benefit from additional attention,
394	        experimentation and specification for Single Homed SCTP as well
395	        as for Multi Homed SCTP protocol operation.

397	   10.  SCTP stack SHOULD provide the ULP with the means to expose the
398	        PF state of its destinations as well as the means to notify the
399	        state transitions from Active to PF, and vice-versa.  When doing
400	        this, such SCTP stack MUST provide the ULP with the means to
401	        suppress exposure of PF state and association state transitions
402	        as well.

404	4.3.  Optional Feature: Permanent Failover

406	   In [RFC4960], an SCTP sender migrates the traffic back to the
407	   original primary destination once this destination becomes active
408	   again.  As the CWND towards the original primary destination has to
409	   be rebuilt once data transfer resumes, the switch back to use the
410	   original primary path is not always optimal.  Indeed [CARO02] shows
411	   that the switch back to the original primary may degrade SCTP
412	   performance compared to continuing data transmission on the same
413	   path, especially, but not only, in scenarios where this path's
414	   characteristics are better.  In order to mitigate this performance
415	   degradation, Permanent Failover operation was proposed in [CARO02].
416	   When SCTP changes the destination due to failover, Permanent Failover
417	   operation allows SCTP sender to continue data transmission on the new
418	   working path even if the old primary destination becomes active
419	   again.  This is achieved by having SCTP perform a switch over of the
420	   primary path to the alternative working path rather than having SCTP
421	   switch back data transfer to the (previous) primary path.

423	   The manner of switch over operation that is most optimal in a given
424	   scenario depends on the relative quality of a set primary path versus
425	   the quality of alternative paths available as well as it depends on
426	   the extent to which it is desired for the mode of operation to
427	   enforce traffic distribution over a number of network paths.  I.e.,
428	   load distribution of traffic from multiple SCTP associations may be
429	   sought to be enforced by distribution of the set primary paths with
430	   [RFC4960] switchback operation.  However as [RFC4960] switchback
431	   behavior is suboptimal in certain situations, especially in scenarios
432	   where a number of equally good paths are available, it is recommended
433	   for SCTP to support also, as alternative behavior, the Permanent
434	   Failover switch over modes of operation.

436	   The Permanent Failover operation requires only sender side changes.
437	   The details are:

439	   1.  The sender maintains a new tunable parameter, called
440	       Primary.Switchover.Max.Retrans (PSMR).  The PSMR MUST be set
441	       greater or equal to the PFMR value.  Implementations MUST reject
442	       any other values of PSMR.

444	   2.  When the path error counter on a set primary path exceeds PSMR,
445	       the SCTP implementation MUST autonomously select and set a new
446	       primary path.

448	   3.  The primary path selected by the SCTP implementation MUST be the
449	       path which at the given time would be chosen for data transfer.
450	       A previously failed primary path MAY come in use as data transfer
451	       path as per normal path selection when the present data transfer
452	       path fails.

454	   4.  The recommended value of PSMR is PFMR when Permanent Failover is
455	       used.  This means that no forced switchback to a previously
456	       failed primary path is performed.  An implementation of Permanent
457	       Failover MUST support the setting of PSMR = PFMR.  An
458	       implementation of Permanent Failover MAY support setting of PSMR
459	       > PFMR.

461	   5.  It MUST be possible to disable the Permanent Failover and obtain
462	       the standard switchback operation of [RFC4960].

464	   This specifications RECOMMENDS a default configuration that uses
465	   standard RFC4960 switchback, i.e., switch back to the old primary
466	   destination once the destination becomes active again.  However, to
467	   support optimal operation in a wider range of network scenarios, an
468	   implementation MAY implement Permanent Failover operation as detailed
469	   above and MAY enable it based on network configurations or users'
470	   requests.

472	5.  Socket API Considerations

474	   This section describes how the socket API defined in [RFC6458] is
475	   extended to provide a way for the application to control and observe
476	   the quick failover behavior.

478	   Please note that this section is informational only.

480	   A socket API implementation based on [RFC6458] is, by means of the
481	   existing SCTP_PEER_ADDR_CHANGE event, extended to provide the event
482	   notification when a peer address enters or leaves the potentially
483	   failed state as well as the socket API implementation is extended to
484	   expose the potentially failed state of a peer address in the existing
485	   SCTP_GET_PEER_ADDR_INFO structure.

487	   Furthermore, two new read/write socket options for the level
488	   IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS and
489	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE are defined as described below.
490	   The first socket option is used to control the values of the PFMR and
491	   PSMR parameters described in Section 4.  The second one controls the
492	   exposition of the potentially failed path state.

494	   Support for the SCTP_PEER_ADDR_THLDS and
495	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE socket options need also to be
496	   added to the function sctp_opt_info().

498	5.1.  Support for the Potentially Failed Path State

500	   As defined in [RFC6458], the SCTP_PEER_ADDR_CHANGE event is provided
501	   if the status of a peer address changes.  In addition to the state
502	   changes described in [RFC6458], this event is also provided, if a
503	   peer address enters or leaves the potentially failed state.  The
504	   notification as defined in [RFC6458] uses the following structure:

506	   struct sctp_paddr_change {
507	     uint16_t spc_type;
508	     uint16_t spc_flags;
509	     uint32_t spc_length;
510	     struct sockaddr_storage spc_aaddr;
511	     uint32_t spc_state;
512	     uint32_t spc_error;
513	     sctp_assoc_t spc_assoc_id;
514	   }

516	   [RFC6458] defines the constants SCTP_ADDR_AVAILABLE,
517	   SCTP_ADDR_UNREACHABLE, SCTP_ADDR_REMOVED, SCTP_ADDR_ADDED, and
518	   SCTP_ADDR_MADE_PRIM to be provided in the spc_state field.  This
519	   document defines in addition to that the new constant
520	   SCTP_ADDR_POTENTIALLY_FAILED, which is reported if the affected
521	   address becomes potentially failed.

523	   The SCTP_GET_PEER_ADDR_INFO socket option defined in [RFC6458] can be
524	   used to query the state of a peer address.  It uses the following
525	   structure:

527	   struct sctp_paddrinfo {
528	     sctp_assoc_t spinfo_assoc_id;
529	     struct sockaddr_storage spinfo_address;
530	     int32_t spinfo_state;
531	     uint32_t spinfo_cwnd;
532	     uint32_t spinfo_srtt;
533	     uint32_t spinfo_rto;
534	     uint32_t spinfo_mtu;
535	   };

537	   [RFC6458] defines the constants SCTP_UNCONFIRMED, SCTP_ACTIVE, and
538	   SCTP_INACTIVE to be provided in the spinfo_state field.  This
539	   document defines in addition to that the new constant
540	   SCTP_POTENTIALLY_FAILED, which is reported if the peer address is
541	   potentially failed.

543	5.2.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket Option

545	   Applications can control the quick failover behavior by getting or
546	   setting the number of consecutive timeouts before a peer address is
547	   considered potentially failed or unreachable and before the primary
548	   path is changed automatically.  This socket option uses the level
549	   IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS.

551	   The following structure is used to access and modify the thresholds:

553	   struct sctp_paddrthlds {
554	     sctp_assoc_t spt_assoc_id;
555	     struct sockaddr_storage spt_address;
556	     uint16_t spt_pathmaxrxt;
557	     uint16_t spt_pathpfthld;
558	     uint16_t spt_pathcpthld;
559	   };

561	   spt_assoc_id:  This parameter is ignored for one-to-one style
562	      sockets.  For one-to-many style sockets the application may fill
563	      in an association identifier or SCTP_FUTURE_ASSOC.  It is an error
564	      to use SCTP_{CURRENT|ALL}_ASSOC in spt_assoc_id.

566	   spt_address:  This specifies which peer address is of interest.  If a
567	      wildcard address is provided, this socket option applies to all
568	      current and future peer addresses.

570	   spt_pathmaxrxt:  Each peer address of interest is considered
571	      unreachable, if its path error counter exceeds spt_pathmaxrxt.

573	   spt_pathpfthld:  Each peer address of interest is considered
574	      potentially failed, if its path error counter exceeds
575	      spt_pathpfthld.

577	   spt_pathcpthld:  Each peer address of interest is not considered the
578	      primary remote address anymore, if its path error counter exceeds
579	      spt_pathcpthld.  Using a value of 0xffff disables the selection of
580	      a new primary peer address.  If an implementation does not support
581	      the automatically selection of a new primary address, it should
582	      indicate an error with errno set to EINVAL if a value different
583	      from 0xffff is used in spt_pathcpthld.  Setting of spt_pathcpthld
584	      < spt_pathpfthld should be rejected with errno set to EINVAL.  An
585	      implementation MAY support only setting of spt_pathcpthld =
586	      spt_pathpfthld and spt_pathcpthld = 0xffff.  In this case it shall
587	      reject setting of other values with errno set to EINVAL.

589	5.3.  Exposing the Potentially Failed Path State
590	      (SCTP_EXPOSE_POTENTIALLY_FAILED_STATE) Socket Option

592	   Applications can control the exposure of the potentially failed path
593	   state in the SCTP_PEER_ADDR_CHANGE event and the
594	   SCTP_GET_PEER_ADDR_INFO as described in Section 5.1.  The default
595	   value is implementation specific.

597	   This socket option uses the level IPPROTO_SCTP and the name
598	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.

600	   The following structure is used to control the exposition of the
601	   potentially failed path state:

603	   struct sctp_assoc_value {
604	     sctp_assoc_t assoc_id;
605	     uint32_t assoc_value;
606	   };

608	   assoc_id:  This parameter is ignored for one-to-one style sockets.
609	      For one-to-many style sockets the application may fill in an
610	      association identifier or SCTP_FUTURE_ASSOC.  It is an error to
611	      use SCTP_{CURRENT|ALL}_ASSOC in assoc_id.

613	   assoc_value:  The potentially failed path state is exposed if and
614	      only if this parameter is non-zero.

616	6.  Security Considerations

618	   Security considerations for the use of SCTP and its APIs are
619	   discussed in [RFC4960] and [RFC6458].  There are no new security
620	   considerations introduced in this document.

622	7.  IANA Considerations

624	   This document does not create any new registries or modify the rules
625	   for any existing registries managed by IANA.

627	8.  Proposed Change of Status (to be Deleted before Publication)

629	   The initial status of this document was Experimental.  However,
630	   because of its usefulness, simple design and the existence of
631	   multiple active implementations, it has been changed to PS by WG
632	   consensus.

634	9.  References

636	9.1.  Normative References

638	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
639	              Requirement Levels", BCP 14, RFC 2119, March 1997.

641	   [RFC4960]  Stewart, R., "Stream Control Transmission Protocol", RFC
642	              4960, September 2007.

644	9.2.  Informative References

646	   [CARO02]   Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R.
647	              Stewart, "A Two-level Threshold Recovery Mechanism for
648	              SCTP", Tech report, CIS Dept, University of Delaware , 7
649	              2002.

651	   [CARO04]   Caro Jr., A., Amer, P., and R. Stewart, "End-to-End
652	              Failover Thresholds for Transport Layer Multihoming",
653	              MILCOM 2004 , 11 2004.

655	   [CARO05]   Caro Jr., A., "End-to-End Fault Tolerance using Transport
656	              Layer Multihoming", Ph.D Thesis, University of Delaware ,
657	              1 2005.

659	   [FALLON08]
660	              Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E.,
661	              and A. Hanley, "SCTP Switchover Performance Issues in WLAN
662	              Environments", IEEE CCNC 2008, 1 2008.

664	   [GRINNEMO04]
665	              Grinnemo, K-J. and A. Brunstrom, "Performance of SCTP-
666	              controlled failovers in M3UA-based SIGTRAN networks",
667	              Advanced Simulation Technologies Conference , 4 2004.

669	   [IYENGAR06]
670	              Iyengar, J., Amer, P., and R. Stewart, "Concurrent
671	              Multipath Transfer using SCTP Multihoming over Independent
672	              End-to-end Paths.", IEEE/ACM Trans on Networking 14(5), 10
673	              2006.

675	   [JUNGMAIER02]
676	              Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of
677	              SCTP in failover scenarios", World Multiconference on
678	              Systemics, Cybernetics and Informatics , 7 2002.

680	   [NATARAJAN09]
681	              Natarajan, P., Ekiz, N., Amer, P., and R. Stewart,
682	              "Concurrent Multipath Transfer during Path Failure",
683	              Computer Communications , 5 2009.

685	   [RFC6458]  Stewart, R., Tuexen, M., Poon, K., Lei, P., and V.
686	              Yasevich, "Sockets API Extensions for the Stream Control
687	              Transmission Protocol (SCTP)", RFC 6458, December 2011.

689	Appendix A.  Discussions of Alternative Approaches

691	   This section lists alternative approaches for the issues desribed in
692	   this document.  Although these approaches do not require to update
693	   RFC4960, we do not recommend them from the reasons described below.

695	A.1.  Reduce Path.Max.Retrans (PMR)

697	   Smaller values for Path.Max.Retrans shorten the failover duration.
698	   In fact, this is recommended in some research results [JUNGMAIER02]
699	   [GRINNEMO04] [FALLON08].  For example, if when Path.Max.Retrans=0,
700	   SCTP switches to another destination on a single timeout.  This
701	   smaller value for Path.Max.Retrans can results in spurious failover,
702	   which might be a problem.

704	   Unlike SCTP-PF, the interval for heartbeat packets is governed by
705	   'HB.interval' even during failover process.  'HB.interval' is usually
706	   set in the order of seconds (recommended value is 30 seconds).  When
707	   the primary path becomes inactive, the next HB can be transmitted
708	   only seconds later.  Meanwhile, the primary path may have recovered.
709	   In such situations, post failover, an endpoint is forced to wait on
710	   the order of seconds before the endpoint can resume transmission on
711	   the primary path.  However, using smaller value for 'HB.interval'
712	   might help this situation, but it will be the waste of bandwidth in
713	   most cases.

715	   In addition, smaller Path.Max.Retrans values also affect
716	   'Association.Max.Retrans' values.  When the SCTP association's error
717	   count (sum of error counts on all ACTIVE paths) exceeds
718	   Association.Max.Retrans threshold, the SCTP sender considers the peer
719	   endpoint unreachable and terminates the association.  Therefore,
720	   Section 8.2 in [RFC4960] recommends that Association.Max.Retrans
721	   value should not be larger than the summation of the Path.Max.Retrans
722	   of each of the destination addresses, else the SCTP sender considers
723	   its peer reachable even when all destinations are INACTIVE.  To avoid
724	   such inconsistent behavior an SCTP implementation SHOULD reduce
725	   Association.Max.Retrans accordingly whenever it reduces
726	   Path.Max.Retrans.  However, smaller Association.Max.Retrans value
727	   increases chances of association termination during minor congestion
728	   events.

730	A.2.  Adjust RTO related parameters

732	   As several research results indicate, we can also shorten the
733	   duration of failover process by adjusting RTO related parameters
734	   [JUNGMAIER02] [FALLON08].  During failover process, RTO keeps being
735	   doubled.  However, if we can choose smaller value for RTO.max, we can
736	   stop the exponential growth of RTO at some point.  Also, choosing
737	   smaller values for RTO.initial or RTO.min can contribute to keep RTO
738	   value small.

740	   Similar to reducing Path.Max.Retrans, the advantage of this approach
741	   is that it requires no modification to the current specification,
742	   although it needs to ignore several recommendations described in the
743	   Section 15 of [RFC4960].  However, this approach requires to have
744	   enough knowledge about the network characteristics between end
745	   points.  Otherwise, it can introduce adverse side-effects such as
746	   spurious timeouts.

748	Appendix B.  Discussions for Path Bouncing Effect

750	   The methods described in the document can accelerate the failover
751	   process.  Hence, they might introduce the path bouncing effect where
752	   the sender keeps changing the data transmission path frequently.
753	   This sounds harmful to the data transfer, however several research
754	   results indicate that there is no serious problem with SCTP in terms
755	   of path bouncing effect [CARO04] [CARO05].

757	   There are two main reasons for this.  First, SCTP is basically
758	   designed for multipath communication, which means SCTP maintains all
759	   path related parameters (CWND, ssthresh, RTT, error count, etc) per
760	   each destination address.  These parameters cannot be affected by
761	   path bouncing.  In addition, when SCTP migrates the data transfer to
762	   another path, it starts with the minimal or the initial CWND.  Hence,
763	   there is little chance for packet reordering or duplicating.

765	   Second, even if all communication paths between the end-nodes share
766	   the same bottleneck, the quick failover results in a behavior already
767	   allowed by [RFC4960].

769	Authors' Addresses

771	   Yoshifumi Nishida
772	   GE Global Research
773	   2623 Camino Ramon
774	   San Ramon, CA  94583
775	   USA

777	   Email: nishida@wide.ad.jp

779	   Preethi Natarajan
780	   Cisco Systems
781	   510 McCarthy Blvd
782	   Milpitas, CA  95035
783	   USA

785	   Email: prenatar@cisco.com

787	   Armando Caro
788	   BBN Technologies
789	   10 Moulton St.
790	   Cambridge, MA  02138
791	   USA

793	   Email: acaro@bbn.com

795	   Paul D. Amer
796	   University of Delaware
797	   Computer Science Department - 434 Smith Hall
798	   Newark, DE  19716-2586
799	   USA

801	   Email: amer@udel.edu
802	   Karen E. E. Nielsen
803	   Ericsson
804	   Kistavaegen 25
805	   Stockholm  164 80
806	   Sweden

808	   Email: karen.nielsen@tieto.com