idnits 2.17.1 

draft-nishida-tsvwg-sctp-failover-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 12, 2012) is 4425 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260)

  == Outdated reference: A later version (-32) exists of
     draft-ietf-tsvwg-sctpsocket-31


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         Y. Nishida
3	Internet-Draft                                              WIDE Project
4	Intended status: Standards Track                            P. Natarajan
5	Expires: September 13, 2012                                Cisco Systems
6	                                                                 A. Caro
7	                                                        BBN Technologies
8	                                                          March 12, 2012

10	                    Quick Failover Algorithm in SCTP
11	                  draft-nishida-tsvwg-sctp-failover-05

13	Abstract

15	   One of the major advantages in SCTP is supporting multi-homing
16	   communication.  If a multi-homed end-point has redundant network
17	   connections, SCTP sessions can have a good chance to survive from
18	   network failures by migrating inactive network to active one.
19	   However, if we follow the SCTP standard, there can be significant
20	   delay for the network migration.  During this migration period, SCTP
21	   cannot transmit much data to the destination.  This issue drastically
22	   impairs the usability of SCTP in some situations.  This memo
23	   describes the issue of SCTP failover mechanism and discuss its
24	   solutions which require minimal modification to the current standard.

26	Status of this Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at http://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on September 13, 2012.

43	Copyright Notice

45	   Copyright (c) 2012 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents
50	   (http://trustee.ietf.org/license-info) in effect on the date of
51	   publication of this document.  Please review these documents
52	   carefully, as they describe your rights and restrictions with respect
53	   to this document.  Code Components extracted from this document must
54	   include Simplified BSD License text as described in Section 4.e of
55	   the Trust Legal Provisions and are provided without warranty as
56	   described in the Simplified BSD License.

58	Table of Contents

60	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
61	   2.  Conventions and Terminology  . . . . . . . . . . . . . . . . .  4
62	   3.  Issue in SCTP Path Management Process  . . . . . . . . . . . .  5
63	   4.  Existing Solutions for Smooth Failover . . . . . . . . . . . .  6
64	     4.1.  Reduce Path.Max.Retrans  . . . . . . . . . . . . . . . . .  6
65	     4.2.  Adjust RTO related parameters  . . . . . . . . . . . . . .  7
66	   5.  Proposed Solution: SCTP with Potentially-Failed
67	       Destination State (SCTP-PF)  . . . . . . . . . . . . . . . . .  8
68	     5.1.  SCTP-PF Description  . . . . . . . . . . . . . . . . . . .  8
69	     5.2.  Effect of Path Bouncing  . . . . . . . . . . . . . . . . . 10
70	     5.3.  Permanent Failover . . . . . . . . . . . . . . . . . . . . 10
71	     5.4.  Handling Error Counter . . . . . . . . . . . . . . . . . . 10
72	   6.  Socket API Considerations  . . . . . . . . . . . . . . . . . . 12
73	     6.1.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket
74	           option . . . . . . . . . . . . . . . . . . . . . . . . . . 12
75	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 13
76	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
77	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 15
78	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 15
79	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 15
80	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17

82	1.  Introduction

84	   The Stream Control Transmission Protocol (SCTP) [RFC4960] natively
85	   supports multihoming at the transport layer -- an SCTP association
86	   can bind to multiple IP addresses at each endpoint.  SCTP's
87	   multihoming features include failure detection and failover
88	   procedures to provide network interface redundancy and improved end-
89	   to-end fault tolerance.

91	   In SCTP's current failure detection procedure, the sender must
92	   experience Path.Max.Retrans (PMR) number of consecutive timeouts on a
93	   destination before detecting path failure.  The sender fails over to
94	   an alternate active destination only after failure detection.  Until
95	   failover, the sender transmits data on the failed path, degrading
96	   SCTP performance.  Concurrent Multipath Transfer (CMT) [IYENGAR06] is
97	   an extension to SCTP and allows the sender to transmit data on
98	   multiple paths simultaneously.  Research [NATARAJAN09] shows that the
99	   current failure detection procedure worsens CMT performance during
100	   failover and can be significantly improved by employing a better
101	   failover algorithm.

103	   This document proposes an alternative failure detection procedure for
104	   SCTP (and CMT) that improves SCTP (CMT) performance during failover.

106	2.  Conventions and Terminology

108	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
109	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
110	   document are to be interpreted as described in [RFC2119].

112	3.  Issue in SCTP Path Management Process

114	   SCTP can utilize multiple IP addresses for a single SCTP association.
115	   Each SCTP endpoint exchanges the list of available addresses on the
116	   node during initial negotiation.  After this, endpoints select one
117	   address from the list and define this as the primary destination.
118	   During normal transmission, SCTP sends all data to the primary
119	   destination.  Also, it sends heartbeat packets to other (non-primary)
120	   destinations at a certain interval to check the reachability of the
121	   path.

123	   If sender has multiple active destination addresses, it can
124	   retransmit data to secondary destination address when the
125	   transmission to the primary times out.

127	   When sender receives the acknowledgment for data or heartbeat packets
128	   from one of the destination addresses, it considers the destination
129	   is active.  If it fails to receive acknowledgments, the error count
130	   for the address is increased.  If the error counter exceeds the
131	   protocol parameter 'Path.Max.Retrans', SCTP endpoint considers the
132	   address is inactive.

134	   The failover process of SCTP is initiated when the primary path
135	   becomes inactive (error counter for the primacy path exceeds
136	   Path.Max.Retrans).  If the primary path is marked inactive, SCTP
137	   chooses new destination address from one of the active destinations
138	   and start using this address to send data.  If the primary path
139	   becomes active again, SCTP uses the primary destination for
140	   subsequent data transmissions and stop using non-primary one.

142	   An issue in this failover process is that it usually takes
143	   significant amount of time before SCTP switches to the new
144	   destination.  Let's say the primary path on a multi-homed host
145	   becomes unavailable and the RTO value for the primary path at that
146	   time is around 1 second, it usually takes over 60 seconds before SCTP
147	   starts to use the secondary path.  This is because the recommended
148	   value for Path.Max.Retrans in the standard is 5, which requires 6
149	   consecutive timeouts before failover takes place.  Before SCTP
150	   switches to the secondary address, SCTP keeps trying to send packets
151	   to the primary and only retransmitted packets are sent to the
152	   secondary can be reached at the receiver.  This slow failover process
153	   can cause significant performance degradation and will not be
154	   acceptable in some situations.

156	4.  Existing Solutions for Smooth Failover

158	   The following approach are conceivable for the solutions of this
159	   issue.

161	4.1.  Reduce Path.Max.Retrans

163	   If we choose smaller value for Path.Max.Retrans, we can shorten the
164	   duration of failover process.  In fact, this is recommended in some
165	   research results [JUNGMAIER02] [GRINNEMO04] [FALLON08].  For example,
166	   if we set Path.Max.Retrans to 0, SCTP switches to another destination
167	   on a single timeout.  However, smaller value for Path.Max.Retrans
168	   might cause spurious failover.  In addition, if we use smaller value
169	   for Path.Max.Retrans, we may also need to choose smaller value for
170	   'Association.Max.Retrans'.  The Association.Max.Retrans indicates the
171	   threshold for the total number of consecutive error count for the
172	   entire SCTP association.  If the total of the error count for all
173	   paths exceeds this value, the endpoint considers the peer endpoint
174	   unreachable and terminates the association.  According to the Section
175	   8.2 in [RFC4960], we should avoid having the value of
176	   Association.Max.Retrans larger than the summation of the
177	   Path.Max.Retrans of all the destination addresses.  Otherwise, even
178	   if all the destination addresses become inactive, the endpoint still
179	   considers the peer endpoint reachable.  The behavior in this
180	   situation is not defined in the RFC and depends on each
181	   implementation.  In order to avoid inconsistent behavior between
182	   implementations, we had better use smaller value for
183	   Association.Max.Retrans.  However, if we choose smaller value for
184	   Association.Max.Retrans, associations will prone to be terminated
185	   with minor congestion.

187	   Another issue is that the interval of heartbeat packet: 'HB.interval'
188	   may not be small. (recommended value is 30 seconds) This means once
189	   failover takes place, an endpoint might need a certain amount of time
190	   to use the primary path again.  This can cause undesirable effects in
191	   case of spurious failover.  If we choose smaller value for
192	   HB.interval, the traffic used for path probing in a session will be
193	   increased.

195	   The advantage of tuning Path.Max.Retrans is that it requires no
196	   modification to the current standard, although it needs to ignore
197	   several recommendations.  In addition, some research results indicate
198	   path bouncing caused by spurious failover does not cause serious
199	   problems.  We discuss the effect of path bouncing in the section 5.

201	4.2.  Adjust RTO related parameters

203	   As several research results indicate, we can also shorten the
204	   duration of failover process by adjusting RTO related parameters
205	   [JUNGMAIER02] [FALLON08].  During failover process.  RTO keeps being
206	   doubled.  However, if we can choose smaller value for RTO.max, we can
207	   stop the exponential growth of RTO at some point.  Also, choosing
208	   smaller values for RTO.initial or RTO.min can contribute to keep RTO
209	   value small.

211	   Similar to reducing Path.Max.Retrans, the advantage of this approach
212	   is that it requires no modification to the current standard, although
213	   it needs to ignore several recommendations.  However, this approach
214	   requires to have enough knowledge about the network characteristics
215	   between end points.  Otherwise, it can introduce adverse side-effects
216	   such as spurious timeouts.

218	5.  Proposed Solution: SCTP with Potentially-Failed Destination State
219	    (SCTP-PF)

221	5.1.  SCTP-PF Description

223	   Our proposal stems from the following two observations about SCTP's
224	   failure detection procedure:

226	   o  In order to minimize performance impact during failover, the
227	      sender should avoid transmitting data to the failed destination as
228	      early as possible.  In the current SCTP path management scheme,
229	      the sender stops transmitting data to a destination only after the
230	      destination is marked Failed.  Thus, a smaller PMR value is ideal
231	      so that the sender transitions a destination to the Failed state
232	      quicker.

234	   o  Smaller PMR values increase the chances of spurious failure
235	      detection where the sender incorrectly marks a destination as
236	      Failed during periods of temporary congestion.  Larger PMR values
237	      are preferable to avoid spurious failure detection.

239	   From the above observations it is clear that tweaking the PMR value
240	   involves the following tradeoff -- a lower value improves performance
241	   but increases the chances of spurious failure detection, whereas a
242	   higher value degrades performance and reduces spurious failure
243	   detection in a wide range of path conditions.  Thus, tweaking the
244	   association's PMR value is an incomplete solution to address
245	   performance impact during failure.

247	   We propose a new "Potentially-failed" (PF) destination state in
248	   SCTP's path management procedure.  The PF state was originally
249	   proposed to improve CMT performance [NATARAJAN09].  The PF state is
250	   an intermediate state between Active and Failed states.  SCTP's
251	   failure detection procedure is modified to include the PF state.  The
252	   new failure detection algorithm assumes that loss detected by a
253	   timeout implies either severe congestion or failure en-route.  After
254	   a single timeout on a path, a sender is unsure, and marks the
255	   corresponding destination as PF.  A PF destination is not used for
256	   data transmission except in special cases (discussed below).  The new
257	   failure detection algorithm requires only sender-side changes.
258	   Details are:

260	   1.  The sender maintains a new tunable parameter called Potentially-
261	       failed.Max.Retrans (PFMR).  The recommended value of PFMR = 0
262	       when quick failover is used.  When an association's PFMR >= PMR,
263	       quick failover is turned off.

265	   2.  Each time the T3-rtx timer expires on an active or idle
266	       destination, the error counter of that destination address will
267	       be incremented.  When the value in the error counter exceeds
268	       PFMR, the endpoint should mark the destination transport address
269	       as PF.  SCTP MUST NOT send any notification to the upper layer
270	       about the active to PF state transition.

272	   3.  The sender SHOULD avoid data transmission to PF destinations.
273	       When all destinations are in either PF or Inactive state, the
274	       sender MAY either move the destination from PF to active state
275	       (and transmit data to the active destination) or the sender MAY
276	       transmit data to a PF destination.  In the former scenario, (i)
277	       the sender MUST NOT notify the ULP about the state transition,
278	       and (ii) MUST NOT clear the destination's error counter.  It is
279	       recommended that the sender picks the PF destination with least
280	       error count (fewest consecutive timeouts) for data transmission.
281	       In case of a tie (multiple PF destinations with same error
282	       count), the sender MAY choose the last active destination.

284	   4.  Only heartbeats MUST be sent to PF destination(s) once per RTO.
285	       This means the sender SHOULD ignore HB.interval for PF
286	       destinations.  If an heartbeat is unanswered, the sender
287	       increments the error counter and exponentially backs off the RTO
288	       value.  If error counter is less than PMR, the sender SHOULD
289	       transmit another heartbeat immediately after T3-timer expiration.

291	   5.  When the sender receives an heartbeat ACK from a PF destination,
292	       the sender clears the destination's error counter and transitions
293	       the PF destination back to active state.  This state transition
294	       MUST NOT be notified to the ULP.  This destination's cwnd is set
295	       to 1 MTU (TODO: or 2?  Needs more text discussing rationale; can
296	       revisit later?)

298	   6.  An additional (PMR - PFMR) consecutive timeouts on a PF
299	       destination confirm the path failure, upon which the destination
300	       transitions to the Inactive state.  As described in [RFC4960],
301	       the sender (i) SHOULD notify ULP about this state transition, and
302	       (ii) transmit heartbeats to the Inactive destination at a lower
303	       frequency as described in Section 8.3 of [RFC4960].

305	   7.  When all destinations are in the Inactive state, the sender picks
306	       one of the Inactive destinations for data transmission.  This
307	       proposal recommends that the sender picks the Inactive
308	       destination with least error count (fewest consecutive timeouts)
309	       for data transmission.  In case of a tie (multiple Inactive
310	       destinations with same error count), the sender MAY choose the
311	       last active destination.

313	   8.  ACKs for retransmissions do not transition a PF destination back
314	       to the active state, since a sender cannot disambiguate whether
315	       the ack was for the original transmission or the
316	       retransmission(s).

318	5.2.  Effect of Path Bouncing

320	   The methods described above can accelerate failover process.  Hence,
321	   it might introduce path bouncing effect which keeps changing the data
322	   transmission path frequently.  This sounds harmful for data transfer,
323	   however several research results indicate that there is no serious
324	   problem with SCTP in terms of path bouncing effect [CARO04] [CARO05].

326	   There are two main reasons for this.  First, SCTP is basically
327	   designed for multipath communication, which means SCTP maintains all
328	   path related parameters (cwnd, ssthresh, RTT, error count, etc) per
329	   each destination address.  These parameters cannot be affected by
330	   path bouncing.  In addition, when SCTP migrates to another path, it
331	   starts with minimal cwnd because of slow-start.  Hence, there is
332	   little chance for packet reordering or duplicating.

334	   Second, even if all communication paths between end-nodes share the
335	   same bottleneck, the proposed method does not make situations worse.
336	   In case of congestion, the current standard tries to transmit data
337	   packets to the primary during failover, while the proposed method
338	   tries to explore other destinations.  In any case, the same amount of
339	   data packets sent to the same bottleneck.

341	5.3.  Permanent Failover

343	   When primary path becomes active again after failover, SCTP migrates
344	   back to the primary path.  After this, SCTP starts data transfer with
345	   minimal cwnd.  This is because SCTP must perform slow-start when it
346	   migrates to new path.  However, this might degrade the communication
347	   performance in case that the performance of the alternative path is
348	   relatively good.  In order to mitigate this effect of slow-start,
349	   permanent failover was proposed in [CARO02].  Permanent failover
350	   allows SCTP to remain the alternative path even if the primacy path
351	   becomes active again.  This approach can improve performance in some
352	   cases, however, it will require more detail analysis since it might
353	   impact on SCTP failover algorithm.  Since we prefer to keep the
354	   current behavior of the standard as possible, we recommend not to
355	   take this approach for now.

357	5.4.  Handling Error Counter

359	   When multiple destinations are in the PF state, the sender may
360	   transmit heartbeats to multiple destinations at the same time.  This
361	   allows sender to quickly track and respond to network status change.
362	   However, when all PF destinations become unavailable, this approach
363	   increases the total number of consecutive retransmissions rather
364	   aggressively than the current SCTP spec does.  Because of this
365	   aggressive increase, an SCTP association may be terminated rather
366	   earlier than the standard [RFC4960].

368	   One way to avoid early termination is to send retransmitted data or
369	   HB to only one PF destination at a time, but this approach may delay
370	   path status tracking.  An alternative solution is to exclude HB
371	   timeouts from incrementing the error count.  The latter approach is
372	   preferred but requires an update to Section 8.3 of [RFC4960].

374	6.  Socket API Considerations

376	   This section describes how the socket API defined in
377	   [I-D.ietf-tsvwg-sctpsocket] is extended to provide a way for the
378	   application to control the quick failover behavior.

380	   Please note that this section is informational only.

382	   A socket API implementation based on [I-D.ietf-tsvwg-sctpsocket] is
383	   extended by adding a new read/write socket option for the level
384	   IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS as described below.
385	   This socket option is used to read/write the value of PFMR parameter
386	   described in Section 5.

388	   Support for the SCTP_PEER_ADDR_THLDS socket option needs also to be
389	   added to the function sctp_opt_info().

391	6.1.  Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket option

393	   Applications can control the quick failover behavior by getting or
394	   setting the number of timeouts before a peer address is considered
395	   potentially failed or unreachable.

397	   The following structure is used to access and modify the thresholds:

399	   struct sctp_paddrthlds {
400	     sctp_assoc_t spt_assoc_id;
401	     struct sockaddr_storage spt_address;
402	     uint16_t spt_pathmaxrxt;
403	     uint16_t spt_pathpfthld;
404	   };

406	   spt_assoc_id:  This parameter is ignored for one-to-one style
407	      sockets.  For one-to-many style sockets the application may fill
408	      in an association identifier or SCTP_FUTURE_ASSOC for this query.
409	      It is an error to use SCTP_{CURRENT|ALL}_ASSOC in spt_assoc_id.

411	   spt_address:  This specifies which peer address is of interest.  If a
412	      wildcard address is provided, this socket option applies to all
413	      current and future peer addresses.

415	   spt_pathmaxrxt:  Each peer address of interest is considered
416	      unreachable, if its path error counter exceeds spt_pathmaxrxt.

418	   spt_pathpfthld:  Each peer address of interest is considered
419	      potentially failed, if its path error counter exceeds
420	      spt_pathpfthld.

422	7.  Security Considerations

424	   There are no new security considerations introduced in this document.

426	8.  IANA Considerations

428	   This document does not create any new registries or modify the rules
429	   for any existing registries managed by IANA.

431	9.  References

433	9.1.  Normative References

435	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
436	              Requirement Levels", BCP 14, RFC 2119, March 1997.

438	   [RFC4960]  Stewart, R., "Stream Control Transmission Protocol",
439	              RFC 4960, September 2007.

441	9.2.  Informative References

443	   [CARO02]   Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R.
444	              Stewart, "A Two-level Threshold Recovery Mechanism for
445	              SCTP", Tech report, CIS Dept, University of Delaware ,
446	              7 2002.

448	   [CARO04]   Caro Jr., A., Amer, P., and R. Stewart, "End-to-End
449	              Failover Thresholds for Transport Layer Multihoming",
450	              MILCOM 2004 , 11 2004.

452	   [CARO05]   Caro Jr., A., "End-to-End Fault Tolerance using Transport
453	              Layer Multihoming", Ph.D Thesis, University of Delaware ,
454	              1 2005.

456	   [FALLON08]
457	              Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E.,
458	              and A. Hanley, "SCTP Switchover Performance Issues in WLAN
459	              Environments", IEEE CCNC 2008, 1 2008.

461	   [GRINNEMO04]
462	              Grinnemo, K-J. and A. Brunstrom, "Performance of SCTP-
463	              controlled failovers in M3UA-based SIGTRAN networks",
464	              Advanced Simulation Technologies Conference , 4 2004.

466	   [I-D.ietf-tsvwg-sctpsocket]
467	              Stewart, R., Tuexen, M., Poon, K., Lei, P., and V.
468	              Yasevich, "Sockets API Extensions for Stream Control
469	              Transmission Protocol (SCTP)",
470	              draft-ietf-tsvwg-sctpsocket-31 (work in progress),
471	              August 2011.

473	   [IYENGAR06]
474	              Iyengar, J., Amer, P., and R. Stewart, "Concurrent
475	              Multipath Transfer using SCTP Multihoming over Independent
476	              End-to-end Paths.", IEEE/ACM Trans on Networking 14(5),
477	              10 2006.

479	   [JUNGMAIER02]
480	              Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of
481	              SCTP in failover scenarios", World Multiconference on
482	              Systemics, Cybernetics and Informatics , 7 2002.

484	   [NATARAJAN09]
485	              Natarajan, P., Ekiz, N., Amer, P., and R. Stewart,
486	              "Concurrent Multipath Transfer during Path Failure",
487	              Computer Communications , 5 2009.

489	Authors' Addresses

491	   Yoshifumi Nishida
492	   WIDE Project
493	   Endo 5322
494	   Fujisawa, Kanagawa  252-8520
495	   Japan

497	   Email: nishida@wide.ad.jp

499	   Preethi Natarajan
500	   Cisco Systems
501	   510 McCarthy Blvd
502	   Milpitas, CA  95035
503	   USA

505	   Email: prenatar@cisco.com

507	   Armando Caro
508	   BBN Technologies
509	   10 Moulton St.
510	   Cambridge, MA  02138
511	   USA

513	   Email: acaro@bbn.com