idnits 2.17.1 draft-ietf-tsvwg-sctp-failover-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 18, 2013) is 3958 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Nishida 3 Internet-Draft GE Global Research 4 Intended status: Experimental P. Natarajan 5 Expires: December 20, 2013 Cisco Systems 6 A. Caro 7 BBN Technologies 8 P. Amer 9 University of Delaware 10 June 18, 2013 12 Quick Failover Algorithm in SCTP 13 draft-ietf-tsvwg-sctp-failover-01 15 Abstract 17 One of the major advantages in SCTP is supporting multi-homing 18 communication. If a multi-homed end-point has redundant network 19 connections, SCTP sessions can have a good chance to survive from 20 network failures by migrating inactive network to active one. 21 However, if we follow the SCTP standard, there can be significant 22 delay for the network migration. During this migration period, SCTP 23 cannot transmit much data to the destination. This issue drastically 24 impairs the usability of SCTP in some situations. This memo 25 describes the issue of SCTP failover mechanism and discuss its 26 solutions which require minimal modification to the current standard. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on December 20, 2013. 45 Copyright Notice 47 Copyright (c) 2013 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 4 64 3. Issue in SCTP Path Management Process . . . . . . . . . . . . 5 65 4. Existing Solutions for Smooth Failover . . . . . . . . . . . . 6 66 4.1. Reduce Path.Max.Retrans . . . . . . . . . . . . . . . . . 6 67 4.2. Adjust RTO related parameters . . . . . . . . . . . . . . 7 68 5. Proposed Solution: SCTP with Potentially-Failed 69 Destination State (SCTP-PF) . . . . . . . . . . . . . . . . . 8 70 5.1. SCTP-PF Description . . . . . . . . . . . . . . . . . . . 8 71 5.2. Effect of Path Bouncing . . . . . . . . . . . . . . . . . 10 72 5.3. Permanent Failover . . . . . . . . . . . . . . . . . . . . 10 73 5.4. Handling Error Counter . . . . . . . . . . . . . . . . . . 11 74 6. Socket API Considerations . . . . . . . . . . . . . . . . . . 12 75 6.1. Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket 76 option . . . . . . . . . . . . . . . . . . . . . . . . . . 12 77 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 78 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 79 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 80 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 81 9.2. Informative References . . . . . . . . . . . . . . . . . . 15 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 84 1. Introduction 86 The Stream Control Transmission Protocol (SCTP) [RFC4960] natively 87 supports multihoming at the transport layer -- an SCTP association 88 can bind to multiple IP addresses at each endpoint. SCTP's 89 multihoming features include failure detection and failover 90 procedures to provide network interface redundancy and improved end- 91 to-end fault tolerance. 93 In SCTP's current failure detection procedure, the sender must 94 experience Path.Max.Retrans (PMR) number of consecutive timeouts on a 95 destination before detecting path failure. The sender fails over to 96 an alternate active destination only after failure detection. Until 97 failover, the sender transmits data on the failed path, degrading 98 SCTP performance. Concurrent Multipath Transfer (CMT) [IYENGAR06] is 99 an extension to SCTP and allows the sender to transmit data on 100 multiple paths simultaneously. Research [NATARAJAN09] shows that the 101 current failure detection procedure worsens CMT performance during 102 failover and can be significantly improved by employing a better 103 failover algorithm. 105 This document proposes an alternative failure detection procedure for 106 SCTP (and CMT) that improves SCTP (CMT) performance during failover. 108 2. Conventions and Terminology 110 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 111 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 112 document are to be interpreted as described in [RFC2119]. 114 3. Issue in SCTP Path Management Process 116 SCTP can utilize multiple IP addresses for a single SCTP association. 117 Each SCTP endpoint exchanges the list of available addresses on the 118 node during initial negotiation. After this, endpoints select one 119 address from the list and define this as the primary destination. 120 During normal transmission, SCTP sends all data to the primary 121 destination. Also, it sends heartbeat packets to other (non-primary) 122 destinations at a certain interval to check the reachability of the 123 path. 125 If sender has multiple active destination addresses, it can 126 retransmit data to secondary destination address when the 127 transmission to the primary times out. 129 When sender receives the acknowledgment for data or heartbeat packets 130 from one of the destination addresses, it considers the destination 131 is active. If it fails to receive acknowledgments, the error count 132 for the address is increased. If the error counter exceeds the 133 protocol parameter 'Path.Max.Retrans', SCTP endpoint considers the 134 address is inactive. 136 The failover process of SCTP is initiated when the primary path 137 becomes inactive (error counter for the primary path exceeds 138 Path.Max.Retrans). If the primary path is marked inactive, SCTP 139 chooses new destination address from one of the active destinations 140 and start using this address to send data. If the primary path 141 becomes active again, SCTP uses the primary destination for 142 subsequent data transmissions and stop using non-primary one. 144 An issue in this failover process is that it usually takes 145 significant amount of time before SCTP switches to the new 146 destination. Let's say the primary path on a multi-homed host 147 becomes unavailable and the RTO value for the primary path at that 148 time is around 1 second, it usually takes over 60 seconds before SCTP 149 starts to use the secondary path. This is because the recommended 150 value for Path.Max.Retrans in the standard is 5, which requires 6 151 consecutive timeouts before failover takes place. Before SCTP 152 switches to the secondary address, SCTP keeps trying to send packets 153 to the primary and only retransmitted packets are sent to the 154 secondary can be reached at the receiver. This slow failover process 155 can cause significant performance degradation and will not be 156 acceptable in some situations. 158 4. Existing Solutions for Smooth Failover 160 The following approach are conceivable for the solutions of this 161 issue. 163 4.1. Reduce Path.Max.Retrans 165 If we choose smaller value for Path.Max.Retrans, we can shorten the 166 duration of failover process. In fact, this is recommended in some 167 research results [JUNGMAIER02] [GRINNEMO04] [FALLON08]. For example, 168 if we set Path.Max.Retrans to 0, SCTP switches to another destination 169 on a single timeout. However, smaller value for Path.Max.Retrans 170 might cause spurious failover. In addition, if we use smaller value 171 for Path.Max.Retrans, we may also need to choose smaller value for 172 'Association.Max.Retrans'. The Association.Max.Retrans indicates the 173 threshold for the total number of consecutive error count for the 174 entire SCTP association. If the total of the error count for all 175 paths exceeds this value, the endpoint considers the peer endpoint 176 unreachable and terminates the association. According to the Section 177 8.2 in [RFC4960], we should avoid having the value of 178 Association.Max.Retrans larger than the summation of the 179 Path.Max.Retrans of all the destination addresses. Otherwise, even 180 if all the destination addresses become inactive, the endpoint still 181 considers the peer endpoint reachable. The behavior in this 182 situation is not defined in the RFC and depends on each 183 implementation. In order to avoid inconsistent behavior between 184 implementations, we had better use smaller value for 185 Association.Max.Retrans. However, if we choose smaller value for 186 Association.Max.Retrans, associations will prone to be terminated 187 with minor congestion. 189 Another issue is that the interval of heartbeat packet: 'HB.interval' 190 may not be small. (recommended value is 30 seconds) This means once 191 failover takes place, an endpoint might need a certain amount of time 192 to use the primary path again. This can cause undesirable effects in 193 case of spurious failover. If we choose smaller value for 194 HB.interval, the traffic used for path probing in a session will be 195 increased. 197 The advantage of tuning Path.Max.Retrans is that it requires no 198 modification to the current standard, although it needs to ignore 199 several recommendations. In addition, some research results indicate 200 path bouncing caused by spurious failover does not cause serious 201 problems. We discuss the effect of path bouncing in the section 5. 203 4.2. Adjust RTO related parameters 205 As several research results indicate, we can also shorten the 206 duration of failover process by adjusting RTO related parameters 207 [JUNGMAIER02] [FALLON08]. During failover process, RTO keeps being 208 doubled. However, if we can choose smaller value for RTO.max, we can 209 stop the exponential growth of RTO at some point. Also, choosing 210 smaller values for RTO.initial or RTO.min can contribute to keep RTO 211 value small. 213 Similar to reducing Path.Max.Retrans, the advantage of this approach 214 is that it requires no modification to the current standard, although 215 it needs to ignore several recommendations. However, this approach 216 requires to have enough knowledge about the network characteristics 217 between end points. Otherwise, it can introduce adverse side-effects 218 such as spurious timeouts. 220 5. Proposed Solution: SCTP with Potentially-Failed Destination State 221 (SCTP-PF) 223 5.1. SCTP-PF Description 225 Our proposal stems from the following two observations about SCTP's 226 failure detection procedure: 228 o In order to minimize performance impact during failover, the 229 sender should avoid transmitting data to the failed destination as 230 early as possible. In the current SCTP path management scheme, 231 the sender stops transmitting data to a destination only after the 232 destination is marked Failed. Thus, a smaller PMR value is ideal 233 so that the sender transitions a destination to the Failed state 234 quicker. 236 o Smaller PMR values increase the chances of spurious failure 237 detection where the sender incorrectly marks a destination as 238 Failed during periods of temporary congestion. Larger PMR values 239 are preferable to avoid spurious failure detection. 241 From the above observations it is clear that tweaking the PMR value 242 involves the following tradeoff -- a lower value improves performance 243 but increases the chances of spurious failure detection, whereas a 244 higher value degrades performance and reduces spurious failure 245 detection in a wide range of path conditions. Thus, tweaking the 246 association's PMR value is an incomplete solution to address 247 performance impact during failure. 249 We propose a new "Potentially-failed" (PF) destination state in 250 SCTP's path management procedure. The PF state was originally 251 proposed to improve CMT performance [NATARAJAN09]. The PF state is 252 an intermediate state between Active and Failed states. SCTP's 253 failure detection procedure is modified to include the PF state. The 254 new failure detection algorithm assumes that loss detected by a 255 timeout implies either severe congestion or failure en-route. After 256 a single timeout on a path, a sender is unsure, and marks the 257 corresponding destination as PF. A PF destination is not used for 258 data transmission except in special cases (discussed below). The new 259 failure detection algorithm requires only sender-side changes. 260 Details are: 262 1. The sender maintains a new tunable parameter called Potentially- 263 failed.Max.Retrans (PFMR). The recommended value of PFMR = 0 264 when quick failover is used. When an association's PFMR >= PMR, 265 quick failover is turned off. 267 2. Each time the T3-rtx timer expires on an active or idle 268 destination, the error counter of that destination address will 269 be incremented. When the value in the error counter exceeds 270 PFMR, the endpoint should mark the destination transport address 271 as PF. SCTP MUST NOT send any notification to the upper layer 272 about the Active to PF state transition. 274 3. The sender SHOULD avoid data transmission to PF destinations. 275 When all destinations are in either PF or Inactive state, the 276 sender MAY either move the destination from PF to Active state 277 (and transmit data to the active destination) or the sender MAY 278 transmit data to a PF destination. In the former scenario, (i) 279 the sender MUST NOT notify the ULP about the state transition, 280 and (ii) MUST NOT clear the destination's error counter. It is 281 recommended that the sender picks the PF destination with least 282 error count (fewest consecutive timeouts) for data transmission. 283 In case of a tie (multiple PF destinations with same error 284 count), the sender MAY choose the last active destination. 286 4. Only heartbeats MUST be sent to PF destination(s) once per RTO. 287 This means the sender SHOULD ignore HB.interval for PF 288 destinations. If an heartbeat is unanswered, the sender 289 increments the error counter and exponentially backs off the RTO 290 value. If error counter is less than PMR, the sender SHOULD 291 transmit another heartbeat immediately after T3-timer expiration. 293 5. When the sender receives an heartbeat ACK from a PF destination, 294 the sender clears the destination's error counter and transitions 295 the PF destination back to Active state. This state transition 296 MUST NOT be notified to the ULP. This destination's cwnd is set 297 to 1 MTU. Note that in scenarios where the destination was 298 temporarily congested during the T3-timer expiration, an SCTP 299 sender transmits 1 MTU worth of data while an SCTP-PF sender 300 transmits an HB after the T3-timer expiry (more details in 301 Section 5 of [NATARAJAN09]). The SCTP sender has 1 RTT head- 302 start in cwnd evolution compared to SCTP-PF sender. An SCTP-PF 303 sender may set cwnd to 2 MTUs after receiving HB-ACK in order to 304 offset this performance difference. 306 6. An additional (PMR - PFMR) consecutive timeouts on a PF 307 destination confirm the path failure, upon which the destination 308 transitions to the Inactive state. As described in [RFC4960], 309 the sender (i) SHOULD notify ULP about this state transition, and 310 (ii) transmit heartbeats to the Inactive destination at a lower 311 frequency as described in Section 8.3 of [RFC4960]. 313 7. When all destinations are in the Inactive state, the sender picks 314 one of the Inactive destinations for data transmission. This 315 proposal recommends that the sender picks the Inactive 316 destination with least error count (fewest consecutive timeouts) 317 for data transmission. In case of a tie (multiple Inactive 318 destinations with same error count), the sender MAY choose the 319 last active destination. 321 8. ACKs for retransmissions do not transition a PF destination back 322 to Active state, since a sender cannot disambiguate whether the 323 ack was for the original transmission or the retransmission(s). 325 5.2. Effect of Path Bouncing 327 The methods described above can accelerate failover process. Hence, 328 it might introduce path bouncing effect which keeps changing the data 329 transmission path frequently. This sounds harmful for data transfer, 330 however several research results indicate that there is no serious 331 problem with SCTP in terms of path bouncing effect [CARO04] [CARO05]. 333 There are two main reasons for this. First, SCTP is basically 334 designed for multipath communication, which means SCTP maintains all 335 path related parameters (cwnd, ssthresh, RTT, error count, etc) per 336 each destination address. These parameters cannot be affected by 337 path bouncing. In addition, when SCTP migrates to another path, it 338 starts with minimal cwnd because of slow-start. Hence, there is 339 little chance for packet reordering or duplicating. 341 Second, even if all communication paths between end-nodes share the 342 same bottleneck, the proposed method does not make situations worse. 343 In case of congestion, the current standard tries to transmit data 344 packets to the primary during failover, while the proposed method 345 tries to explore other destinations. In any case, the same amount of 346 data packets sent to the same bottleneck. 348 5.3. Permanent Failover 350 Post failover, an SCTP sender migrates back to the original primary 351 destination once this destination becomes active. The sender sets 352 cwnd to the initial cwnd value and performs slow start. [CARO02] 353 shows that the switch over to the original primary may degrade SCTP 354 performance compared to continuing data transmission on the same 355 path, especially in scenarios where this path's characteristics are 356 better. In order to mitigate this performance degradation, permanent 357 failover was proposed in [CARO02]. Permanent failover allows SCTP to 358 remain the alternative path even if the primary path becomes active 359 again. We recommend that SCTP-PF should stick to the standard 360 RFC4960 behavior, i.e., switch back to the original primary once this 361 destination becomes active again. Permanent failover may be 362 considered in the future based on discussions and consensus within 363 the community. 365 5.4. Handling Error Counter 367 When multiple destinations are in the PF state, the sender may 368 transmit heartbeats to multiple destinations at the same time. This 369 allows SCTP-PF sender to quickly track and respond to network status 370 change compared to an SCTP sender. However, when all PF destinations 371 become unavailable, an SCTP-PF sender has outstanding HBs on all 372 destinations compared to an SCTP sender and increases the count for 373 the total number of consecutive retransmissions faster than the SCTP 374 sender. SCTP-PF's faster increase in the error count will result in 375 association termination sooner than SCTP. 377 For deployments where aggressive failure detection and association 378 termination is not desired, we recommend that AMR be set to the 379 maximum allowed value (sum of PMRs of all paths), to delay assoc 380 termination during SCTP-PF. Another option is to send retransmitted 381 data or HB to only one PF destination at a time, but this approach 382 may delay path status tracking. To exclude HB timeouts from 383 incrementing the error count can also be a solution, however, this 384 requires an update to Section 8.3 of [RFC4960]. 386 6. Socket API Considerations 388 This section describes how the socket API defined in [RFC6458] is 389 extended to provide a way for the application to control the quick 390 failover behavior. 392 Please note that this section is informational only. 394 A socket API implementation based on [RFC6458] is extended by adding 395 a new read/write socket option for the level IPPROTO_SCTP and the 396 name SCTP_PEER_ADDR_THLDS as described below. This socket option is 397 used to read/write the value of PFMR parameter described in Section 398 5. 400 Support for the SCTP_PEER_ADDR_THLDS socket option needs also to be 401 added to the function sctp_opt_info(). 403 6.1. Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket option 405 Applications can control the quick failover behavior by getting or 406 setting the number of timeouts before a peer address is considered 407 potentially failed or unreachable. 409 The following structure is used to access and modify the thresholds: 411 struct sctp_paddrthlds { 412 sctp_assoc_t spt_assoc_id; 413 struct sockaddr_storage spt_address; 414 uint16_t spt_pathmaxrxt; 415 uint16_t spt_pathpfthld; 416 }; 418 spt_assoc_id: This parameter is ignored for one-to-one style 419 sockets. For one-to-many style sockets the application may fill 420 in an association identifier or SCTP_FUTURE_ASSOC for this query. 421 It is an error to use SCTP_{CURRENT|ALL}_ASSOC in spt_assoc_id. 423 spt_address: This specifies which peer address is of interest. If a 424 wildcard address is provided, this socket option applies to all 425 current and future peer addresses. 427 spt_pathmaxrxt: Each peer address of interest is considered 428 unreachable, if its path error counter exceeds spt_pathmaxrxt. 430 spt_pathpfthld: Each peer address of interest is considered 431 potentially failed, if its path error counter exceeds 432 spt_pathpfthld. 434 7. Security Considerations 436 There are no new security considerations introduced in this document. 438 8. IANA Considerations 440 This document does not create any new registries or modify the rules 441 for any existing registries managed by IANA. 443 9. References 445 9.1. Normative References 447 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 448 Requirement Levels", BCP 14, RFC 2119, March 1997. 450 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", 451 RFC 4960, September 2007. 453 9.2. Informative References 455 [CARO02] Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R. 456 Stewart, "A Two-level Threshold Recovery Mechanism for 457 SCTP", Tech report, CIS Dept, University of Delaware , 458 7 2002. 460 [CARO04] Caro Jr., A., Amer, P., and R. Stewart, "End-to-End 461 Failover Thresholds for Transport Layer Multihoming", 462 MILCOM 2004 , 11 2004. 464 [CARO05] Caro Jr., A., "End-to-End Fault Tolerance using Transport 465 Layer Multihoming", Ph.D Thesis, University of Delaware , 466 1 2005. 468 [FALLON08] 469 Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E., 470 and A. Hanley, "SCTP Switchover Performance Issues in WLAN 471 Environments", IEEE CCNC 2008, 1 2008. 473 [GRINNEMO04] 474 Grinnemo, K-J. and A. Brunstrom, "Performance of SCTP- 475 controlled failovers in M3UA-based SIGTRAN networks", 476 Advanced Simulation Technologies Conference , 4 2004. 478 [IYENGAR06] 479 Iyengar, J., Amer, P., and R. Stewart, "Concurrent 480 Multipath Transfer using SCTP Multihoming over Independent 481 End-to-end Paths.", IEEE/ACM Trans on Networking 14(5), 482 10 2006. 484 [JUNGMAIER02] 485 Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of 486 SCTP in failover scenarios", World Multiconference on 487 Systemics, Cybernetics and Informatics , 7 2002. 489 [NATARAJAN09] 490 Natarajan, P., Ekiz, N., Amer, P., and R. Stewart, 491 "Concurrent Multipath Transfer during Path Failure", 492 Computer Communications , 5 2009. 494 [RFC6458] Stewart, R., Tuexen, M., Poon, K., Lei, P., and V. 495 Yasevich, "Sockets API Extensions for the Stream Control 496 Transmission Protocol (SCTP)", RFC 6458, December 2011. 498 Authors' Addresses 500 Yoshifumi Nishida 501 GE Global Research 502 2623 Camino Ramon 503 San Ramon, CA 94583 504 USA 506 Email: nishida@wide.ad.jp 508 Preethi Natarajan 509 Cisco Systems 510 510 McCarthy Blvd 511 Milpitas, CA 95035 512 USA 514 Email: prenatar@cisco.com 516 Armando Caro 517 BBN Technologies 518 10 Moulton St. 519 Cambridge, MA 02138 520 USA 522 Email: acaro@bbn.com 524 Paul D. Amer 525 University of Delaware 526 Computer Science Department - 434 Smith Hall 527 Newark, DE 19716-2586 528 USA 530 Email: amer@udel.edu