idnits 2.17.1 draft-nishida-tsvwg-sctp-failover-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: o Each time the T3-rtx timer expires on an active or idle destination, the error counter of that destination address will be incremented. When the value in the error counter exceeds PFMR, the endpoint should mark the destination transport address as PF. SCTP MUST not send any notification to the upper layer about the active to PF state transition. -- The document date (December 9, 2010) is 4879 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'CARO02' -- Possible downref: Non-RFC (?) normative reference: ref. 'CARO04' -- Possible downref: Non-RFC (?) normative reference: ref. 'CARO05' -- Possible downref: Non-RFC (?) normative reference: ref. 'FALLON08' -- Possible downref: Non-RFC (?) normative reference: ref. 'GRINNEMO04' -- Possible downref: Non-RFC (?) normative reference: ref. 'IYENGAR06' -- Possible downref: Non-RFC (?) normative reference: ref. 'JUNGMAIER02' -- Possible downref: Non-RFC (?) normative reference: ref. 'NATARAJAN09' ** Downref: Normative reference to an Informational RFC: RFC 4690 ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Nishida 3 Internet-Draft WIDE Project 4 Intended status: Standards Track P. Natarajan 5 Expires: June 12, 2011 Cisco Systems 6 December 9, 2010 8 Quick Failover Algorithm in SCTP 9 draft-nishida-tsvwg-sctp-failover-01 11 Abstract 13 One of the major advantages in SCTP is supporting multi-homing 14 communication. If an multi-homed end-point has redundant network 15 connections, SCTP sessions can have a good chance to survive from 16 network failures by migrating inactive network to active one. 17 However, if we follow the SCTP standard, there can be significant 18 delay for the network migration. During this migration period, SCTP 19 cannot transmit much data to the destination. This issue drastically 20 impairs the usability of SCTP in some situations. This memo 21 describes the issue of SCTP failover mechanism and discuss its 22 solutions which require minimal modification to the current standard. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on June 12, 2011. 41 Copyright Notice 43 Copyright (c) 2010 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 4 60 3. Issue in SCTP Path Management Process . . . . . . . . . . . . 5 61 4. Solutions for Smooth Failover . . . . . . . . . . . . . . . . 6 62 4.1. Reduce Path.Max.Retrans . . . . . . . . . . . . . . . . . 6 63 4.2. Adjust RTO related parameters . . . . . . . . . . . . . . 7 64 4.3. Introducing Potentially-failed Destination State in 65 Failure Detection Algorithm . . . . . . . . . . . . . . . 7 66 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 67 5.1. Effect of Path Bouncing . . . . . . . . . . . . . . . . . 10 68 5.2. Permanent Failover . . . . . . . . . . . . . . . . . . . . 10 69 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11 70 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 71 8. Normative References . . . . . . . . . . . . . . . . . . . . . 13 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15 74 1. Introduction 76 The Stream Control Transmission Protocol (SCTP) [RFC4960] natively 77 supports multihoming at the transport layer -- an SCTP association 78 can bind to multiple IP addresses at each endpoint. SCTP's 79 multihoming features include failure detection and failover 80 procedures to provide network interface redundancy and improved end- 81 to-end fault tolerance. 83 In SCTP's current failure detection proceudre, the sender must 84 experience Path.Max.Retrans (PMR) number of consecutive timeouts on a 85 destination before detecting path failure. The sender fails over to 86 an alternate active destination only after failure detection. Until 87 failover, the sender transmits data on the failed path, degrading 88 SCTP performance. Concurrent Multipath Transfer (CMT) [IYENGAR06] is 89 an extension to SCTP and allows the sender to transmit data on 90 multiple paths simultaneously. Research [NATARAJAN09] shows that the 91 current failure detection procedure worsens CMT performance during 92 failover and can be significantly improved by employing a better 93 failover algorithm. 95 This document proposes an alternative failure detection procedure for 96 SCTP (and CMT) that improves SCTP (CMT) performance during failover. 98 2. Conventions and Terminology 100 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 101 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 102 document are to be interpreted as described in [RFC2119]. 104 Since this document describes a potential risk in NewReno, it uses 105 the same terminology and definitions in RFC4690. [RFC4690]. 107 3. Issue in SCTP Path Management Process 109 SCTP can utilize multiple IP addresses for single SCTP association. 110 Each SCTP endpoint exchanges the list of available addresses on the 111 node during initial negotiation. After this, endpoints select one 112 address from the list and define this as the destination of the 113 primary path. Basically, SCTP sends all data through this primary 114 path for normal data transmissions. Also, it sends heartbeat packets 115 to other (non-primary) destinations at a certain interval to check 116 the reachability of the path. 118 If sender has multiple active destination addresses, it can 119 retransmit data to secondary destination address when the 120 transmission to the primary times out. 122 When sender receives the acknowledgment for data or heartbeat packets 123 from one of the destination addresses, it considers the destination 124 is active. If it fails to receive acknowledgments, the error count 125 for the address is increased. If the error counter exceeds the 126 protocol parameter 'Path.Max.Retrans', SCTP endpoint considers the 127 address is inactive. 129 The failover process of SCTP is initiated when the primary path 130 becomes inactive (error counter for the primacy path exceeds 131 Path.Max.Retrans). If the primary path is marked inactive, SCTP 132 chooses new destination address from one of the active destinations 133 and start using this address to send data. If the primary path 134 becomes active again, SCTP uses the primary destination for 135 subsequent data transmissions and stop using non-primary one. 137 An issue in this failover process is that it usually takes 138 significant amount of time before SCTP switches to the new 139 destination. Let's say the primary path on a multi-homed host 140 becomes unavailable and the RTO value for the primary path at that 141 time is around 1 second, it usually takes over 60 seconds before SCTP 142 starts to use the secondary path. This is because the recommended 143 value for Path.Max.Retrans in the standard is 5, which requires 6 144 consecutive timeouts before failover takes place. Before SCTP 145 switches to the secondary address, SCTP keeps trying to send packets 146 to the primary and only retransmitted packets are sent to the 147 secondary can be reached at the receiver. This slow failover process 148 can cause significant performance degradation and will not be 149 acceptable in some situations. 151 4. Solutions for Smooth Failover 153 The following approach are conceivable for the solutions of this 154 issue. 156 4.1. Reduce Path.Max.Retrans 158 If we choose smaller value for Path.Max.Retrans, we can shorten the 159 duration of failover process. In fact, this is recommended in some 160 research results [JUNGMAIER02] [GRINNEMO04] [FALLON08]. For example, 161 if we set Path.Max.Retrans to 0, SCTP switches to another destination 162 on a single timeout. However, smaller value for Path.Max.Retrans 163 might cause spurious failover. In addition, if we use smaller value 164 for Path.Max.Retrans, we may also need to choose smaller value for 165 'Association.Max.Retrans'. The Association.Max.Retrans indicates the 166 threshold for the total number of consecutive error count for the 167 entire SCTP association. If the total of the error count for all 168 paths exceeds this value, the endpoint considers the peer endpoint 169 unreachable and terminates the association. According to the Section 170 8.2 in [RFC4960], we should avoid having the value of 171 Association.Max.Retrans larger than the summation of the 172 Path.Max.Retrans of all the destination addresses. Otherwise, even 173 if all the destination addresses become inactive, the endpoint still 174 considers the peer endpoint reachable. The behavior in this 175 situation is not defined in the RFC and depends on each 176 implementation. In order to avoid inconsistent behavior between 177 implementations, we had better use smaller value for 178 Association.Max.Retrans. However, if we choose smaller value for 179 Association.Max.Retrans, associations will prone to be terminated 180 with minor congestion. 182 Another issue is that the interval of heartbeat packet: 'HB.interval' 183 may not be small. (recommended value is 30 seconds) This means once 184 failover takes place, an endpoint might need a certain amount of time 185 to use the primary path again. This can cause undesirable effects in 186 case of spurious failover. If we choose smaller value for 187 HB.interval, the traffic used for path probing in a session will be 188 increased. 190 The advantage of tuning Path.Max.Retrans is that it requires no 191 modification to the current standard, although it needs to ignore 192 several recommendations. In addition, some research results indicate 193 path bouncing caused by spurious failover does not cause serious 194 problems. We discuss the effect of path bouncing in the section 5. 196 4.2. Adjust RTO related parameters 198 As several research results indicate, we can also shorten the 199 duration of failover process by adjusting RTO related parameters 200 [JUNGMAIER02] [FALLON08]. During failover process. RTO keeps being 201 doubled. However, if we can choose smaller value for RTO.max, we can 202 stop the exponential growth of RTO at some point. Also, choosing 203 smaller values for RTO.initial or RTO.min can contribute to keep RTO 204 value small. 206 Similar to reducing Path.Max.Retrans, the advantage of this approach 207 is that it requires no modification to the current standard, although 208 it needs to ignore several recommendations. However, this approach 209 requires to have enough knowledge about the network characteristics 210 between end points. Otherwise, it can introduce adverse side-effects 211 such as spurious timeouts. 213 4.3. Introducing Potentially-failed Destination State in Failure 214 Detection Algorithm 216 Our proposal stems from the following two observations about SCTP's 217 failure detection procedure: 219 o In order to minimize performance impact during failover, the 220 sender should avoid transmitting data to the failed destination as 221 early as possible. In the current SCTP path management scheme, 222 the sender stops transmitting data to a destination only after the 223 destination is marked Failed. Thus, a smaller PMR value is ideal 224 so that the sender transitions a destination to the Failed state 225 quicker. 227 o Smaller PMR values increase the chances of spurious failure 228 detection where the sender incorrectly marks a destination as 229 Failed during periods of temporary congestion. Larger PMR values 230 are preferable to avoid spurious failure detection. 232 From the above observations it is clear that tweaking the PMR value 233 involves the following tradeoff -- a lower value improves performance 234 but increases the chances of spurious failure detection, whereas a 235 higher value degrades performance and reduces spurious failure 236 detection in a wide range of path conditions. Thus, tweaking the 237 association's PMR value is an incomplete solution to address 238 performance impact during failure. 240 We propose a new "Potentially-failed" (PF) destination state in 241 SCTP's path management procedure. The PF state is an intermediate 242 state between Active and Failed states and was originally proposed to 243 improve CMT performance [NATARAJAN09]. SCTP's failure detection 244 procedure is modified to include the PF state. The new failure 245 detection algorithm assumes that loss detected by a timeout implies 246 either severe congestion or failure en-route. After a single timeout 247 on a path, a sender is unsure, and marks the corresponding 248 destination as PF. A PF destination is not used for data 249 transmission except in special cases (discussed below). The new 250 failure detection algorithm requires only sender-side changes. 251 Details are: 253 o The sender maintains a new tunable parameter called Potentially- 254 failed.Max.Retrans (PFMR). An association's PFMR value MUST be 255 lower than the association's PMR value. The recommended value of 256 PFMR = 0. 258 o Each time the T3-rtx timer expires on an active or idle 259 destination, the error counter of that destination address will be 260 incremented. When the value in the error counter exceeds PFMR, 261 the endpoint should mark the destination transport address as PF. 262 SCTP MUST not send any notification to the upper layer about the 263 active to PF state transition. 265 o The sender never transmits data to a PF destination. However, 266 when all destinations are in either PF or Inactive state, the 267 sender SHOULD transition a destination marked PF to the active 268 state and transmit data to this destination. The destination's 269 error counter MUST NOT be cleared during this state transition. 270 It is recommended that the sender transitions the PF destination 271 with least error count (fewest consecutive timeouts) to the active 272 state. In case of a tie (multiple PF destinations with same error 273 count), the sender MAY choose the last active destination. 275 o Only heartbeats MUST be sent to PF destination(s) once per RTO. 276 This means the sender SHOULD ignore HB.interval for PF 277 destinations. If an heartbeat is unanswered, the sender 278 increments the error counter and exponentially backs off the RTO 279 value. If error counter is less than PMR, the sender SHOULD 280 transmit another heartbeat immediately after T3-timer expiration. 281 An implementation MAY use protocol parameter 'PFHB.interval' for 282 the interval of heartbeat transmissions. If PFHB.interval is non- 283 zero, a heartbeat packet is sent once per RTO of each destination 284 address plus PFHB.interval with jittering of +/- 50% of the RTO 285 value. Use of PFHB.interval can reduce the frequency of failover, 286 which might be useful where the characteristic of the paths are 287 mostly equal. 289 o When the sender receives an heartbeat ack from a PF destination, 290 the sender clears the destination's error counter and transitions 291 the PF destination back to active state. This state transition 292 MUST NOT be notified to the ULP unless it is explicitly requested. 293 This destination's cwnd is set to 1 MTU (TODO: or 2? Needs more 294 text discussing rationale; can revisit later?) 296 o An additional (PMR - PFMR) consecutive timeouts on a PF 297 destination confirm the path failure, upon which the destination 298 transitions to the Inactive state. As described in [RFC4960], the 299 sender (i) SHOULD notifiy ULP about this state transition, and 300 (ii) transmit heartbeats to the Inactive destination at a lower 301 frequency as described in Section 8.3 of [RFC4960]. 303 o When all destinations are in the Inactive state, the sender 304 transitions one of the destinations back to the Active state and 305 continues data transmission to this destination. This proposal 306 recommends that the sender transitions the Inactive destination 307 with least error count (fewest consecutive timeouts) to the active 308 state. In case of a tie (multiple Inactive destinations with same 309 error count), the sender MAY choose the last active destination. 311 o Acks for retransmissions do not transition a PF destination back 312 to the active state, since a sender cannot disambiguate whether 313 the ack was for the original transmission or the 314 retransmission(s). 316 5. Discussion 318 5.1. Effect of Path Bouncing 320 The methods described above can accelerate failover process. Hence, 321 it might introduce path bouncing effect which keeps changing the data 322 transmission path frequently. This sounds harmful for data transfer, 323 however several research results indicate that there is no serious 324 problem with SCTP in terms of path bouncing effect [CARO04] [CARO05]. 326 There are two main reasons for this. First, SCTP is basically 327 designed for multipath communication, which means SCTP maintains all 328 path related parameters (cwnd, ssthresh, RTT, error count, etc) per 329 each destination address. These parameters cannot be affected by 330 path bouncing. In addition, when SCTP migrates to another path, it 331 starts with minimal cwnd because of slow-start. Hence, there is 332 little chance for packet reordering or duplicating. 334 Second, even if all communication paths between end-nodes share the 335 same bottleneck, the proposed method does not make situations worse. 336 In case of congestion, the current standard tries to transmit data 337 packets to the primary during failover, while the proposed method 338 tries to explore other destinations. In any case, the same amount of 339 data packets sent to the same bottleneck. 341 5.2. Permanent Failover 343 When primary path becomes active again after failover, SCTP migrates 344 back to the primary path. After this, SCTP starts data transfer with 345 minimal cwnd. This is because SCTP must perform slow-start when it 346 migrates to new path. However, this might degrade the communication 347 performance in case that the performance of the alternative path is 348 relatively good. In order to mitigate this effect of slow-start, 349 permanent failover was proposed in [CARO02]. Permanent failover 350 allows SCTP to remain the alternative path even if the primacy path 351 becomes active again. This approach can improve performance in some 352 cases, however, it will require more detail analysis since it might 353 impact on SCTP failover algorithm. Since we prefer to keep the 354 current behavior of the standard as possible, we recommend not to 355 take this approach for now. 357 6. Security Considerations 359 There are no new security considerations introduced in this document. 361 7. IANA Considerations 363 This document does not create any new registries or modify the rules 364 for any existing registries managed by IANA. 366 8. Normative References 368 [CARO02] Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R. 369 Stewart, "A Two-level Threshold Recovery Mechanism for 370 SCTP", Tech report, CIS Dept, University of Delaware , 371 7 2002. 373 [CARO04] Caro Jr., A., Amer, P., and R. Stewart, "End-to-End 374 Failover Thresholds for Transport Layer Multihoming", 375 MILCOM 2004 , 11 2004. 377 [CARO05] Caro Jr., A., "End-to-End Fault Tolerance using Transport 378 Layer Multihoming", Ph.D Thesis, University of Delaware , 379 1 2005. 381 [FALLON08] 382 Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E., 383 and A. Hanley, "SCTP Switchover Performance Issues in WLAN 384 Environments", IEEE CCNC 2008, 1 2008. 386 [GRINNEMO04] 387 Grinnemo, K-J. and A. Brunstrom, "Peformance of SCTP- 388 controlled failovers in M3UA-based SIGTRAN networks", 389 Advanced Simulation Technologies Conference , 4 2004. 391 [IYENGAR06] 392 Iyengar, J., Amer, P., and R. Stewart, "Concurrent 393 Multipath Transfer using SCTP Multihoming over Independent 394 End-to-end Paths.", IEEE/ACM Trans on Networking 14(5), 395 10 2006. 397 [JUNGMAIER02] 398 Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of 399 SCTP in failover scenrarios", World Multiconference on 400 Systemics, Cybernetics and Informatics , 7 2002. 402 [NATARAJAN09] 403 Natarajan, P., Ekiz, N., Amer, P., and R. Stewart, 404 "Concurrent Multipath Transfer during Path Failure", 405 Computer Communications , 5 2009. 407 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 408 Requirement Levels", BCP 14, RFC 2119, March 1997. 410 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 411 Recommendations for Internationalized Domain Names 412 (IDNs)", RFC 4690, September 2006. 414 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", 415 RFC 4960, September 2007. 417 Authors' Addresses 419 Yoshifumi Nishida 420 WIDE Project 421 Endo 5322 422 Fujisawa, Kanagawa 252-8520 423 Japan 425 Email: nishida@wide.ad.jp 427 Preethi Natarajan 428 Cisco Systems 429 425 E. Tasman Drive 430 San Jose, CA 95134 431 USA 433 Email: prenatar@cisco.com