idnits 2.17.1 

draft-ietf-tsvwg-sctp-failover-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC4960]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (December 24, 2014) is 3411 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260)


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         Y. Nishida
3	Internet-Draft                                        GE Global Research
4	Intended status: Standards Track                            P. Natarajan
5	Expires: June 27, 2015                                     Cisco Systems
6	                                                                 A. Caro
7	                                                        BBN Technologies
8	                                                                 P. Amer
9	                                                  University of Delaware
10	                                                              K. Nielsen
11	                                                                Ericsson
12	                                                       December 24, 2014

14	               SCTP-PF: Quick Failover Algorithm in SCTP
15	                 draft-ietf-tsvwg-sctp-failover-09.txt

17	Abstract

19	   One of the major advantages of SCTP is the support of multi-homed
20	   communication.  A multi-homed SCTP end-point has the ability to
21	   withstand network failures by migrating the traffic from an inactive
22	   network to an active one.  However, if the failover operation as
23	   specified in [RFC4960] is followed, there can be a significant delay
24	   in the migration to the active destination addresses, thus severely
25	   reducing the effectiveness of the SCTP failover operation.

27	   This memo complements [RFC4960] by the introduction of the
28	   Potentially Failed path state and the associated new failover
29	   operation called SCTP-PF to apply during a network failure.  In
30	   addition, the memo complements [RFC4960] by introducing of
31	   alternative switchover operation modes for the data transfer path
32	   management after the recovery of a failed primary path.  These modes
33	   offers for more performance optimal operation in some network
34	   environments.  The implementation of the additional switchover
35	   operation modes is optional.

37	   The procedures defined in the document require only minimal
38	   modifications to the current specification.  The procedures are
39	   sender-side only and do not impact the SCTP receiver.

41	Status of This Memo

43	   This Internet-Draft is submitted in full conformance with the
44	   provisions of BCP 78 and BCP 79.

46	   Internet-Drafts are working documents of the Internet Engineering
47	   Task Force (IETF).  Note that other groups may also distribute
48	   working documents as Internet-Drafts.  The list of current Internet-
49	   Drafts is at http://datatracker.ietf.org/drafts/current/.

51	   Internet-Drafts are draft documents valid for a maximum of six months
52	   and may be updated, replaced, or obsoleted by other documents at any
53	   time.  It is inappropriate to use Internet-Drafts as reference
54	   material or to cite them other than as "work in progress."

56	   This Internet-Draft will expire on June 27, 2015.

58	Copyright Notice

60	   Copyright (c) 2014 IETF Trust and the persons identified as the
61	   document authors.  All rights reserved.

63	   This document is subject to BCP 78 and the IETF Trust's Legal
64	   Provisions Relating to IETF Documents
65	   (http://trustee.ietf.org/license-info) in effect on the date of
66	   publication of this document.  Please review these documents
67	   carefully, as they describe your rights and restrictions with respect
68	   to this document.  Code Components extracted from this document must
69	   include Simplified BSD License text as described in Section 4.e of
70	   the Trust Legal Provisions and are provided without warranty as
71	   described in the Simplified BSD License.

73	Table of Contents

75	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
76	   2.  Conventions and Terminology . . . . . . . . . . . . . . . . .   3
77	   3.  Issues with the SCTP Path Management  . . . . . . . . . . . .   4
78	   4.  SCTP with Potentially-Failed Destination State (SCTP-PF)  . .   5
79	     4.1.  SCTP-PF Concept . . . . . . . . . . . . . . . . . . . . .   5
80	     4.2.  SCTP-PF Algorithm in Detail . . . . . . . . . . . . . . .   6
81	     4.3.  Optional Feature: Permanent Failover  . . . . . . . . . .   9
82	   5.  Socket API Considerations . . . . . . . . . . . . . . . . . .  11
83	     5.1.  Support for the Potentially Failed Path State . . . . . .  11
84	     5.2.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket
85	           Option  . . . . . . . . . . . . . . . . . . . . . . . . .  12
86	     5.3.  Exposing the Potentially Failed Path State
87	           (SCTP_EXPOSE_POTENTIALLY_FAILED_STATE) Socket Option  . .  13
88	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
89	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
90	   8.  Proposed Change of Status (to be Deleted before Publication)   14
91	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  14
92	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  14
93	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  15
94	   Appendix A.  Discussions of Alternative Approaches  . . . . . . .  16
95	     A.1.  Reduce Path.Max.Retrans (PMR) . . . . . . . . . . . . . .  16
96	     A.2.  Adjust RTO related parameters . . . . . . . . . . . . . .  16
97	   Appendix B.  Discussions for Path Bouncing Effect . . . . . . . .  17
98	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

100	1.  Introduction

102	   The Stream Control Transmission Protocol (SCTP) as specified in
103	   [RFC4960] supports multihoming at the transport layer -- an SCTP
104	   endpoint can bind to multiple IP addresses.  SCTP's multihoming
105	   features include failure detection and failover procedures to provide
106	   network interface redundancy and improved end-to-end fault tolerance.

108	   In SCTP's current failure detection procedure, the sender must
109	   experience Path.Max.Retrans (PMR) number of consecutive failed timer-
110	   based retransmissions on a destination address before detecting a
111	   path failure.  The sender fails over to an alternate active
112	   destination address only after failure detection.  Until detecting
113	   the failover, the sender continues to transmit data on the failed
114	   path, which degrades the SCTP performance.  Concurrent Multipath
115	   Transfer (CMT) [IYENGAR06] is an extension to SCTP that allows the
116	   sender to transmit data on multiple paths simultaneously.  Research
117	   [NATARAJAN09] shows that the current failure detection procedure
118	   worsens CMT performance during failover and can be significantly
119	   improved by employing a better failover algorithm.

121	   This document specifies an alternative failure detection procedure
122	   for SCTP that improves the SCTP performance during a failover.

124	   Also the operation after the recovery of a failed path impacts the
125	   performance of the protocol.  With procedures specified in [RFC4960],
126	   SCTP will, after a failover from the primary path, switch back to the
127	   primary path for data transfer as soon as this path becomes available
128	   again.  From a performance perspective, as confirmed in research
129	   [CARO02], such a switchback of the data transmission path is not
130	   optimal in general.  As an optional alternative to the switchback
131	   operation of [RFC4960], this document specifies the Permanent
132	   Failover procedures proposed by [CARO02].

134	   Additional discussions for alternative approaches that do not require
135	   modifications to [RFC4960] and path bouncing effects that might be
136	   caused by frequent switchover are provided in the Appendices.

138	2.  Conventions and Terminology

140	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
141	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
142	   document are to be interpreted as described in [RFC2119].

144	3.  Issues with the SCTP Path Management

146	   This section describes issues in the SCTP as specified in [RFC4960]
147	   to be fixed by the approach described in this document.

149	   An SCTP endpoint can support multiple IP addresses.  Each SCTP
150	   endpoint exchanges the list of its usable addresses during the
151	   initial negotiation with its peer.  Then the endpoints select one
152	   address from the peer's list and use this as the primary destination
153	   address.  During normal transmission, an SCTP endpoint sends all user
154	   data to the primary destination address.  Also, it sends packets
155	   containing a HEARTBEAT chunk to all idle destination addresses at a
156	   certain interval to check the reachability of these destination
157	   addresses.  Idle destination addresses normally include all non-
158	   primary destination addresses.

160	   If a sender has multiple active destination addresses, it can
161	   retransmit data to an non-primary destination address, if the
162	   transmission to the primary times out.

164	   When a sender receives an acknowledgment for DATA or HEARTBEAT chunks
165	   sent to one of the destination addresses, it considers that
166	   destination address to be active and clears the error counter for the
167	   destination address.  If it fails to receive acknowledgments, the
168	   error count for the destination address is increased.  If the error
169	   counter exceeds the tunable protocol parameter Path.Max.Retrans
170	   (PMR), the SCTP endpoint considers the destination address to be
171	   inactive.

173	   The failover process of SCTP is initiated when the primary path
174	   becomes inactive (the error counter for the primary path exceeds
175	   Path.Max.Retrans).  If the primary path is marked inactive, SCTP
176	   chooses a new destination address from one of the active destinations
177	   and start using this address to send data to.  If the primary path
178	   becomes active again, SCTP uses the primary destination address for
179	   subsequent data transmissions and stop using the non-primary one.

181	   One issue with this failover process is that it usually takes a
182	   significant amount of time before SCTP switches to the new
183	   destination address.  Let's say the primary path on a multi-homed
184	   host becomes unavailable and the RTO value for the primary path at
185	   that time is around 1 second, it usually takes over 60 seconds before
186	   SCTP starts to use the non-primary path for initial data
187	   transmission.  This is because the recommended value for
188	   Path.Max.Retrans in the [RFC4960] is 5, which requires 6 consecutive
189	   timeouts before the failover takes place.  Before SCTP switches to
190	   the non-primary address, SCTP keeps trying to send packets to the
191	   primary address and only retransmitted packets are sent to the non-
192	   primary address and thus can be received by the receiver.  This slow
193	   failover process can cause significant performance degradation and is
194	   not acceptable in some situations.

196	   Another issue is that once the primary path becomes active again, the
197	   traffic is switched back.  This is not optimal in some situations.
198	   This is further discussed in Section 4.3.

200	4.  SCTP with Potentially-Failed Destination State (SCTP-PF)

202	   To address the issues described in Section 3, this section extends
203	   SCTP path management scheme by adding the Potentially Failed state
204	   and the associated failover operation.  We use the term SCTP-PF to
205	   denote the resulting SCTP path management operation.

207	4.1.  SCTP-PF Concept

209	   SCTP-PF as defined stems from the following two observations about
210	   SCTP's failure detection procedure:

212	   o  To minimize the performance impact during failover, the sender
213	      should avoid transmitting data to the failed destination address
214	      as early as possible.  In the current SCTP path management scheme,
215	      the sender stops transmitting data to a destination destination
216	      only after the destination is marked Failed (inactive).  Thus, a
217	      smaller PMR value is better because the sender can transition a
218	      destination address to the Failed (inactive) state quicker.

220	   o  Smaller PMR values increase the chances of spurious failure
221	      detection where the sender incorrectly marks a destination address
222	      as Failed (inactive) during periods of temporary congestion.  As
223	      [RFC4960] recommends for a coupling of the PMR value and the
224	      protocol parameter Association.Max.Retrans (AMR) value such
225	      spurious failure detection risks to carry over to spurious
226	      association failure detection and closure.  Larger PMR values are
227	      preferable to avoid spurious failure detection.

229	   From the above observations it is clear that tuning the PMR value
230	   involves the following tradeoff -- a lower value improves performance
231	   but increases the chances of spurious failure detection, whereas a
232	   higher value degrades performance and reduces spurious failure
233	   detection in a wide range of path conditions.  Thus, tuning the
234	   association's PMR value is an incomplete solution to address the
235	   performance impact during failure.

237	   SCTP-PF defined in this document introduces a new "Potentially-
238	   Failed" (PF) destination state in SCTP's path management procedure.
239	   The PF state was originally proposed to improve CMT performance

241	   [NATARAJAN09].  The PF state is an intermediate state between the
242	   Active and Failed states.  SCTP's failure detection procedure is
243	   modified to include the PF state.  The new failure detection
244	   algorithm assumes that loss detected by a timeout implies either
245	   severe congestion or failure en-route.  After a number of consecutive
246	   timeouts on a path, the sender is unsure, and marks the corresponding
247	   destination address as PF.  A PF destination address is not used for
248	   data transmission except in special cases (discussed below).  The new
249	   failure detection algorithm requires only sender-side changes.

251	4.2.  SCTP-PF Algorithm in Detail

253	   The SCTP-PF operation is specified as follows:

255	   1.   The sender maintains a new tunable parameter called Potentially-
256	        Failed.Max.Retrans (PFMR).  The RECOMMENDED value of PFMR = 0
257	        when SCTP-PF is used.  When PFMR is larger or equal to PMR,
258	        SCTP-PF is turned off.

260	   2.   The error counter of an active destination address is
261	        incremented as specified in [RFC4960].  This means that the
262	        error counter of the destination address will be incremented
263	        each time the T3-rtx timer expires, or at times where a
264	        HEARTBEAT sent to an idle, active address is not acknowledged
265	        within an RTO.  When the value in the destination address error
266	        counter exceeds PFMR, the endpoint MUST mark the destination
267	        transport address as PF.

269	   3.   The sender SHOULD avoid data transmission to PF destination
270	        addresses.  When the destination addresses are all in PF state
271	        or some in PF state and some in inactive state, the sender MUST
272	        choose one destination address in PF state and transmit data to
273	        this destination.  The sender SHOULD choose the destination
274	        address in PF state with the lowest error count (fewest
275	        consecutive timeouts) for data transmission and transmit data to
276	        this destination.  When there are multiple PF destinations with
277	        same error count, the sender SHOULD let the choice among the
278	        multiple PF destination address with equal error count be based
279	        on the [RFC4960], section 6.4.1, principles of choosing most
280	        divergent source-destination pairs when executing (potentially
281	        consecutive) retransmission.  This means that the sender SHOULD
282	        attempt to pick the most divergent source - destination pair
283	        from the last source - destination pair on which data were
284	        transmitted or retransmitted.  Rules for picking the most
285	        divergent source-destination pair are an implementation decision
286	        and are not specified within this document.  A sender may choose
287	        to deploy other strategies than the above when choosing among
288	        multiple PF destinations with equal error count.  In all cases,
289	        the sender MUST NOT change the state of chosen destination
290	        address and it MUST NOT clear the destination's error counter as
291	        a result of choosing the destination address for data
292	        transmission.

294	   4.   HEARTBEAT chunks SHOULD be sent to PF destination(s) once per
295	        RTO, which requires to ignore HB.interval for PF destinations.
296	        If a HEARTBEAT chunk is not acknowledged, the sender SHOULD
297	        increment the error counter and exponentially back off the RTO
298	        value.  If error counter is less than PMR, the sender SHOULD
299	        transmit another packet containing HEARTBEAT chunk immediately
300	        after T3-timer expiration.  When data is transmitted to a PF
301	        destination, the transmission of HEARTBEAT chunk MAY be omitted
302	        as receipt of SACK chunks or a T3-rtx timer expiration can
303	        provide equivalent information.  It is RECOMMENDED that
304	        HEARTBEAT chunks are send to PF destinations regardless of
305	        whether the Path Heartbeat function (Section 8.3 of [RFC4960])
306	        is enabled for the destination address or not.

308	   5.   When the sender receives a HEARTBEAT ACK from a PF destination,
309	        the sender MUST clear the destination's error counter and
310	        transition the PF destination address back to Active state.
311	        When the sender resumes data transmission on the destination
312	        address, it MUST do this following the prescriptions of
313	        Section 7.2 of [RFC4960].

315	   6.   Additional (PMR - PFMR) consecutive timeouts on a PF destination
316	        address confirm the path failure, upon which the destination
317	        address transitions to the Inactive state.  As described in
318	        [RFC4960], the sender (i) SHOULD notify ULP about this state
319	        transition, and (ii) transmit HEARTBEAT chunks to the Inactive
320	        destination address at a lower frequency as described in
321	        Section 8.3 of [RFC4960] (when this function is enabled for the
322	        destination address).

324	   7.   When all destinations are in inactive state (association dormant
325	        state) the sender MUST also choose one destination address to
326	        transmit data to.  The sender SHOULD choose the destination
327	        address in inactive state with the lowest error count (fewest
328	        consecutive timeouts) for data transmission and transmit data to
329	        this destination.  When there are multiple destination addresses
330	        with same error count in inactive state, the sender SHOULD
331	        attempt to pick the most divergent source - destination pair
332	        from the last source - destination pair on which data were
333	        transmitted or retransmitted following [RFC4960].  Rules for
334	        picking the most divergent source-destination pair are an
335	        implementation decision and are not specified within this
336	        document.  Therefore, a sender SHOULD allow for incrementing the
337	        destination error counters up to some reasonable limit larger
338	        than PMR+1, thus changing the prescriptions of [RFC4960],
339	        section 8.3, in this respect.  The exact limit to apply is not
340	        specified in this document but it is considered reasonable to
341	        require for such to be an order of magnitude higher than the PMR
342	        value.  A sender MAY choose to deploy other strategies than the
343	        above.  For example, a sender could choose to prioritize the
344	        last active destination address during dormant state.  The
345	        strategy to prioritize the last active destination address is
346	        optimal when some paths are permanently inactive, but suboptimal
347	        when paths' instability is transient.  While the increment of
348	        the error counters above PMR+1 is a prerequisite for the error
349	        counter values to serve to guide the path selection in dormant
350	        state, then it is noted that by virtue of the introduction of
351	        the Potentially Failed state, one may deploy higher values of
352	        PMR without compromising the efficiency of the failover
353	        operation, and thus making the increase of path error counters
354	        above PMR+1 less critical as the dormant state will be less
355	        likely to happen.  The downside of increasing the PMR value
356	        relative to the AMR value, however, is that the per destination
357	        address failure detection and notification of such to ULP
358	        thereby is weakened.  In all cases the sender MUST NOT change
359	        the state of the chosen destination address and it MUST NOT
360	        clear the destination's error counter as a result of choosing
361	        the destination address for data transmission.

363	   8.   Acknowledgments for chunks that have been transmitted to
364	        multiple destinations (i.e., a chunk which has been
365	        retransmitted to a different destination address than the
366	        destination address to which the chunk was first transmitted)
367	        SHOULD NOT clear the error count of an inactive destination
368	        address and SHOULD NOT transition a PF destination address back
369	        to Active state, since a sender cannot disambiguate whether the
370	        ACK was for the original transmission or the retransmission(s).
371	        The same ambiguity concerns the related congestion window
372	        growth.  The bytes of a newly acknowledged chunk which has been
373	        transmitted to multiple destination addresses SHOULD be
374	        considered for contribution to the congestion window growth
375	        towards the destination address where the chunk was last sent.
376	        The contribution of the ACKed bytes to the window growth is
377	        subject to the prescriptions described in Section 7.2 of
378	        [RFC4960] is fulfilled.  A SCTP sender MAY apply a different
379	        approach for both the error count handling and the congestion
380	        control growth handling based on unequivocally information on
381	        which destination (including multiple destination addresses) the
382	        chunk reached.  This document makes no reference to what such
383	        unequivocally information could consist of, neither how such
384	        unequivocally information could be obtained.  The implementation
385	        of such an alternative approach is left to implementations.

387	   9.   Acknowledgments for chunks that has been transmitted to one
388	        destination address only MUST clear the error counter of the
389	        destination address and MUST transition a PF destination address
390	        back to Active state.  This situation can happen when new data
391	        is sent to a destination address in PF state.  It can also
392	        happen in situations where the destination address is in PF
393	        state due to the occurrence of a spurious T3-rtx timer and
394	        Acknowledgments start to arrive for data sent prior to
395	        occurrence of the spurious T3-rtx and data has not yet been
396	        retransmitted towards other destinations.  This document does
397	        not specify special handling for detection of or reaction to
398	        spurious T3-rtx timeouts, e.g., for special operation vis-a-vis
399	        the congestion control handling or data retransmission operation
400	        towards a destination address which undergoes a transition from
401	        active to PF to active state due to a spurious T3-rtx timeout.
402	        But it is noted that this is an area which would benefit from
403	        additional attention, experimentation and specification for
404	        Single Homed SCTP as well as for Multi Homed SCTP protocol
405	        operation.

407	   10.  SCTP stack SHOULD provide the ULP with the means to expose the
408	        PF state of its destinations as well as the means to notify the
409	        state transitions from Active to PF, and vice-versa.  When doing
410	        this, such an SCTP stack MUST provide the ULP with the means to
411	        suppress exposure of PF state and associated state transitions
412	        as well.

414	4.3.  Optional Feature: Permanent Failover

416	   In [RFC4960], an SCTP sender migrates the traffic back to the
417	   original primary destination address once this address becomes active
418	   again.  As the CWND towards the original primary destination address
419	   has to be rebuilt once data transfer resumes, the switch back to use
420	   the original primary address is not always optimal.  Indeed [CARO02]
421	   shows that the switch back to the original primary may degrade SCTP
422	   performance compared to continuing data transmission on the same
423	   path, especially, but not only, in scenarios where this path's
424	   characteristics are better.  In order to mitigate this performance
425	   degradation, the Permanent Failover operation was proposed in
426	   [CARO02].  When SCTP changes the destination address due to failover,
427	   Permanent Failover operation allows SCTP sender to continue data
428	   transmission on the new working path even when the old primary
429	   destination address becomes active again.  This is achieved by having
430	   SCTP perform a switch over of the primary path to the alternative
431	   working path rather than having SCTP switch back data transfer to the
432	   (previous) primary path.

434	   The manner of switch over operation that is most optimal in a given
435	   scenario depends on the relative quality of a set primary path versus
436	   the quality of alternative paths available as well as it depends on
437	   the extent to which it is desired for the mode of operation to
438	   enforce traffic distribution over a number of network paths.  I.e.,
439	   load distribution of traffic from multiple SCTP associations may be
440	   sought to be enforced by distribution of the set primary paths with
441	   [RFC4960] switchback operation.  However as [RFC4960] switchback
442	   behavior is suboptimal in certain situations, especially in scenarios
443	   where a number of equally good paths are available, it is recommended
444	   for SCTP to support also, as alternative behavior, the Permanent
445	   Failover switch over modes of operation.

447	   The Permanent Failover operation requires only sender side changes.
448	   The details are:

450	   1.  The sender maintains a new tunable parameter, called
451	       Primary.Switchover.Max.Retrans (PSMR).  The PSMR MUST be set
452	       greater or equal to the PFMR value.  Implementations MUST reject
453	       any other values of PSMR.

455	   2.  When the path error counter on a set primary path exceeds PSMR,
456	       the SCTP implementation MUST autonomously select and set a new
457	       primary path.

459	   3.  The primary path selected by the SCTP implementation MUST be the
460	       path which at the given time would be chosen for data transfer.
461	       A previously failed primary path MAY come in use as data transfer
462	       path as per normal path selection when the present data transfer
463	       path fails.

465	   4.  The recommended value of PSMR is PFMR when Permanent Failover is
466	       used.  This means that no forced switchback to a previously
467	       failed primary path is performed.  An implementation of Permanent
468	       Failover MUST support the setting of PSMR = PFMR.  An
469	       implementation of Permanent Failover MAY support setting of PSMR
470	       > PFMR.

472	   5.  It MUST be possible to disable the Permanent Failover and obtain
473	       the standard switchback operation of [RFC4960].

475	   This specifications RECOMMENDS a default configuration that uses
476	   standard RFC4960 switchback, i.e., switch back to the old primary
477	   destination once the destination address becomes active again.
478	   However, to support optimal operation in a wider range of network
479	   scenarios, an implementation MAY implement Permanent Failover
480	   operation as detailed above and MAY enable it based on network
481	   configurations or users' requests.

483	5.  Socket API Considerations

485	   This section describes how the socket API defined in [RFC6458] is
486	   extended to provide a way for the application to control and observe
487	   the SCTP-PF behavior.

489	   Please note that this section is informational only.

491	   A socket API implementation based on [RFC6458] is, by means of the
492	   existing SCTP_PEER_ADDR_CHANGE event, extended to provide the event
493	   notification when a peer address enters or leaves the potentially
494	   failed state as well as the socket API implementation is extended to
495	   expose the potentially failed state of a peer address in the existing
496	   SCTP_GET_PEER_ADDR_INFO structure.

498	   Furthermore, two new read/write socket options for the level
499	   IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS and
500	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE are defined as described below.
501	   The first socket option is used to control the values of the PFMR and
502	   PSMR parameters described in Section 4.  The second one controls the
503	   exposition of the potentially failed path state.

505	   Support for the SCTP_PEER_ADDR_THLDS and
506	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE socket options need also to be
507	   added to the function sctp_opt_info().

509	5.1.  Support for the Potentially Failed Path State

511	   As defined in [RFC6458], the SCTP_PEER_ADDR_CHANGE event is provided
512	   if the status of a peer address changes.  In addition to the state
513	   changes described in [RFC6458], this event is also provided, if a
514	   peer address enters or leaves the potentially failed state.  The
515	   notification as defined in [RFC6458] uses the following structure:

517	   struct sctp_paddr_change {
518	     uint16_t spc_type;
519	     uint16_t spc_flags;
520	     uint32_t spc_length;
521	     struct sockaddr_storage spc_aaddr;
522	     uint32_t spc_state;
523	     uint32_t spc_error;
524	     sctp_assoc_t spc_assoc_id;
525	   }

527	   [RFC6458] defines the constants SCTP_ADDR_AVAILABLE,
528	   SCTP_ADDR_UNREACHABLE, SCTP_ADDR_REMOVED, SCTP_ADDR_ADDED, and
529	   SCTP_ADDR_MADE_PRIM to be provided in the spc_state field.  This
530	   document defines in addition to that the new constant
531	   SCTP_ADDR_POTENTIALLY_FAILED, which is reported if the affected
532	   address becomes potentially failed.

534	   The SCTP_GET_PEER_ADDR_INFO socket option defined in [RFC6458] can be
535	   used to query the state of a peer address.  It uses the following
536	   structure:

538	   struct sctp_paddrinfo {
539	     sctp_assoc_t spinfo_assoc_id;
540	     struct sockaddr_storage spinfo_address;
541	     int32_t spinfo_state;
542	     uint32_t spinfo_cwnd;
543	     uint32_t spinfo_srtt;
544	     uint32_t spinfo_rto;
545	     uint32_t spinfo_mtu;
546	   };

548	   [RFC6458] defines the constants SCTP_UNCONFIRMED, SCTP_ACTIVE, and
549	   SCTP_INACTIVE to be provided in the spinfo_state field.  This
550	   document defines in addition to that the new constant
551	   SCTP_POTENTIALLY_FAILED, which is reported if the peer address is
552	   potentially failed.

554	5.2.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket Option

556	   Applications can control the SCTP-PF behavior by getting or setting
557	   the number of consecutive timeouts before a peer address is
558	   considered potentially failed or unreachable and before the primary
559	   path is changed automatically.  This socket option uses the level
560	   IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS.

562	   The following structure is used to access and modify the thresholds:

564	   struct sctp_paddrthlds {
565	     sctp_assoc_t spt_assoc_id;
566	     struct sockaddr_storage spt_address;
567	     uint16_t spt_pathmaxrxt;
568	     uint16_t spt_pathpfthld;
569	     uint16_t spt_pathcpthld;
570	   };

572	   spt_assoc_id:  This parameter is ignored for one-to-one style
573	      sockets.  For one-to-many style sockets the application may fill
574	      in an association identifier or SCTP_FUTURE_ASSOC.  It is an error
575	      to use SCTP_{CURRENT|ALL}_ASSOC in spt_assoc_id.

577	   spt_address:  This specifies which peer address is of interest.  If a
578	      wildcard address is provided, this socket option applies to all
579	      current and future peer addresses.

581	   spt_pathmaxrxt:  Each peer address of interest is considered
582	      unreachable, if its path error counter exceeds spt_pathmaxrxt.

584	   spt_pathpfthld:  Each peer address of interest is considered
585	      potentially failed, if its path error counter exceeds
586	      spt_pathpfthld.

588	   spt_pathcpthld:  Each peer address of interest is not considered the
589	      primary remote address anymore, if its path error counter exceeds
590	      spt_pathcpthld.  Using a value of 0xffff disables the selection of
591	      a new primary peer address.  If an implementation does not support
592	      the automatically selection of a new primary address, it should
593	      indicate an error with errno set to EINVAL if a value different
594	      from 0xffff is used in spt_pathcpthld.  Setting of spt_pathcpthld
595	      < spt_pathpfthld should be rejected with errno set to EINVAL.  An
596	      implementation MAY support only setting of spt_pathcpthld =
597	      spt_pathpfthld and spt_pathcpthld = 0xffff.  In this case it shall
598	      reject setting of other values with errno set to EINVAL.

600	5.3.  Exposing the Potentially Failed Path State
601	      (SCTP_EXPOSE_POTENTIALLY_FAILED_STATE) Socket Option

603	   Applications can control the exposure of the potentially failed path
604	   state in the SCTP_PEER_ADDR_CHANGE event and the
605	   SCTP_GET_PEER_ADDR_INFO as described in Section 5.1.  The default
606	   value is implementation specific.

608	   This socket option uses the level IPPROTO_SCTP and the name
609	   SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.

611	   The following structure is used to control the exposition of the
612	   potentially failed path state:

614	   struct sctp_assoc_value {
615	     sctp_assoc_t assoc_id;
616	     uint32_t assoc_value;
617	   };

619	   assoc_id:  This parameter is ignored for one-to-one style sockets.
620	      For one-to-many style sockets the application may fill in an
621	      association identifier or SCTP_FUTURE_ASSOC.  It is an error to
622	      use SCTP_{CURRENT|ALL}_ASSOC in assoc_id.

624	   assoc_value:  The potentially failed path state is exposed if and
625	      only if this parameter is non-zero.

627	6.  Security Considerations

629	   Security considerations for the use of SCTP and its APIs are
630	   discussed in [RFC4960] and [RFC6458].  There are no new security
631	   considerations introduced in this document.

633	7.  IANA Considerations

635	   This document does not create any new registries or modify the rules
636	   for any existing registries managed by IANA.

638	8.  Proposed Change of Status (to be Deleted before Publication)

640	   Initially this work looked to entail some changes of the Congestion
641	   Control (CC) operation of SCTP and for this reason the work was
642	   proposed as Experimental.  These intended changes of the CC operation
643	   have since been judged to be irrelevant and are no longer part of the
644	   specification.  As the specification entails no other potential
645	   harmful features, consensus exists in the WG to bring the work
646	   forward as PS.

648	   Initially concerns have been expressed about the possibility for the
649	   mechanism to introduce path bouncing with potential harmful network
650	   impacts.  These concerns are believed to be unfounded.  This issue is
651	   addressed in Appendix B.

653	   It is noted that the feature specified by this document is
654	   implemented by multiple SCTP SW implementations and furthermore that
655	   various variants of the solution have been deployed in Telco
656	   signaling environments for several years with good results.

658	9.  References

660	9.1.  Normative References

662	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
663	              Requirement Levels", BCP 14, RFC 2119, March 1997.

665	   [RFC4960]  Stewart, R., "Stream Control Transmission Protocol", RFC
666	              4960, September 2007.

668	9.2.  Informative References

670	   [CARO02]   Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R.
671	              Stewart, "A Two-level Threshold Recovery Mechanism for
672	              SCTP", Tech report, CIS Dept, University of Delaware , 7
673	              2002.

675	   [CARO04]   Caro Jr., A., Amer, P., and R. Stewart, "End-to-End
676	              Failover Thresholds for Transport Layer Multihoming",
677	              MILCOM 2004 , 11 2004.

679	   [CARO05]   Caro Jr., A., "End-to-End Fault Tolerance using Transport
680	              Layer Multihoming", Ph.D Thesis, University of Delaware ,
681	              1 2005.

683	   [FALLON08]
684	              Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E.,
685	              and A. Hanley, "SCTP Switchover Performance Issues in WLAN
686	              Environments", IEEE CCNC 2008, 1 2008.

688	   [GRINNEMO04]
689	              Grinnemo, K-J. and A. Brunstrom, "Performance of SCTP-
690	              controlled failovers in M3UA-based SIGTRAN networks",
691	              Advanced Simulation Technologies Conference , 4 2004.

693	   [IYENGAR06]
694	              Iyengar, J., Amer, P., and R. Stewart, "Concurrent
695	              Multipath Transfer using SCTP Multihoming over Independent
696	              End-to-end Paths.", IEEE/ACM Trans on Networking 14(5), 10
697	              2006.

699	   [JUNGMAIER02]
700	              Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of
701	              SCTP in failover scenarios", World Multiconference on
702	              Systemics, Cybernetics and Informatics , 7 2002.

704	   [NATARAJAN09]
705	              Natarajan, P., Ekiz, N., Amer, P., and R. Stewart,
706	              "Concurrent Multipath Transfer during Path Failure",
707	              Computer Communications , 5 2009.

709	   [RFC6458]  Stewart, R., Tuexen, M., Poon, K., Lei, P., and V.
710	              Yasevich, "Sockets API Extensions for the Stream Control
711	              Transmission Protocol (SCTP)", RFC 6458, December 2011.

713	Appendix A.  Discussions of Alternative Approaches

715	   This section lists alternative approaches for the issues desribed in
716	   this document.  Although these approaches do not require to update
717	   RFC4960, we do not recommend them from the reasons described below.

719	A.1.  Reduce Path.Max.Retrans (PMR)

721	   Smaller values for Path.Max.Retrans shorten the failover duration.
722	   In fact, this is recommended in some research results [JUNGMAIER02]
723	   [GRINNEMO04] [FALLON08].  For example, if when Path.Max.Retrans=0,
724	   SCTP switches to another destination address on a single timeout.
725	   This smaller value for Path.Max.Retrans can results in spurious
726	   failover, which might be a problem.

728	   Unlike SCTP-PF, the interval for heartbeat packets is governed by
729	   'HB.interval' even during failover process.  'HB.interval' is usually
730	   set in the order of seconds (recommended value is 30 seconds).  When
731	   the primary path becomes inactive, the next HB can be transmitted
732	   only seconds later.  Meanwhile, the primary path may have recovered.
733	   In such situations, post failover, an endpoint is forced to wait on
734	   the order of seconds before the endpoint can resume transmission on
735	   the primary path.  However, using smaller value for 'HB.interval'
736	   might help this situation, but it will be the waste of bandwidth in
737	   most cases.

739	   In addition, smaller Path.Max.Retrans values also affect
740	   'Association.Max.Retrans' values.  When the SCTP association's error
741	   count (sum of error counts on all ACTIVE paths) exceeds
742	   Association.Max.Retrans threshold, the SCTP sender considers the peer
743	   endpoint unreachable and terminates the association.  Therefore,
744	   Section 8.2 in [RFC4960] recommends that Association.Max.Retrans
745	   value should not be larger than the summation of the Path.Max.Retrans
746	   of each of the destination addresses, else the SCTP sender considers
747	   its peer reachable even when all destinations are INACTIVE.  To avoid
748	   such inconsistent behavior an SCTP implementation SHOULD reduce
749	   Association.Max.Retrans accordingly whenever it reduces
750	   Path.Max.Retrans.  However, smaller Association.Max.Retrans value
751	   increases chances of association termination during minor congestion
752	   events.

754	A.2.  Adjust RTO related parameters

756	   As several research results indicate, we can also shorten the
757	   duration of failover process by adjusting RTO related parameters
758	   [JUNGMAIER02] [FALLON08].  During failover process, RTO keeps being
759	   doubled.  However, if we can choose smaller value for RTO.max, we can
760	   stop the exponential growth of RTO at some point.  Also, choosing
761	   smaller values for RTO.initial or RTO.min can contribute to keep RTO
762	   value small.

764	   Similar to reducing Path.Max.Retrans, the advantage of this approach
765	   is that it requires no modification to the current specification,
766	   although it needs to ignore several recommendations described in the
767	   Section 15 of [RFC4960].  However, this approach requires to have
768	   enough knowledge about the network characteristics between end
769	   points.  Otherwise, it can introduce adverse side-effects such as
770	   spurious timeouts.

772	Appendix B.  Discussions for Path Bouncing Effect

774	   The methods described in the document can accelerate the failover
775	   process.  Hence, they might introduce the path bouncing effect where
776	   the sender keeps changing the data transmission path frequently.
777	   This sounds harmful to the data transfer, however several research
778	   results indicate that there is no serious problem with SCTP in terms
779	   of path bouncing effect [CARO04] [CARO05].

781	   There are two main reasons for this.  First, SCTP is basically
782	   designed for multipath communication, which means SCTP maintains all
783	   path related parameters (CWND, ssthresh, RTT, error count, etc) per
784	   each destination address.  These parameters cannot be affected by
785	   path bouncing.  In addition, when SCTP migrates the data transfer to
786	   another path, it starts with the minimal or the initial CWND.  Hence,
787	   there is little chance for packet reordering or duplicating.

789	   Second, even if all communication paths between the end-nodes share
790	   the same bottleneck, the SCTP-PF results in a behavior already
791	   allowed by [RFC4960].

793	Authors' Addresses

795	   Yoshifumi Nishida
796	   GE Global Research
797	   2623 Camino Ramon
798	   San Ramon, CA  94583
799	   USA

801	   Email: nishida@wide.ad.jp
802	   Preethi Natarajan
803	   Cisco Systems
804	   510 McCarthy Blvd
805	   Milpitas, CA  95035
806	   USA

808	   Email: prenatar@cisco.com

810	   Armando Caro
811	   BBN Technologies
812	   10 Moulton St.
813	   Cambridge, MA  02138
814	   USA

816	   Email: acaro@bbn.com

818	   Paul D. Amer
819	   University of Delaware
820	   Computer Science Department - 434 Smith Hall
821	   Newark, DE  19716-2586
822	   USA

824	   Email: amer@udel.edu

826	   Karen E. E. Nielsen
827	   Ericsson
828	   Kistavaegen 25
829	   Stockholm  164 80
830	   Sweden

832	   Email: karen.nielsen@tieto.com