idnits 2.17.1 draft-nishida-tsvwg-sctp-failover-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 12, 2012) is 4425 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260) == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-sctpsocket-31 Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Nishida 3 Internet-Draft WIDE Project 4 Intended status: Standards Track P. Natarajan 5 Expires: September 13, 2012 Cisco Systems 6 A. Caro 7 BBN Technologies 8 March 12, 2012 10 Quick Failover Algorithm in SCTP 11 draft-nishida-tsvwg-sctp-failover-05 13 Abstract 15 One of the major advantages in SCTP is supporting multi-homing 16 communication. If a multi-homed end-point has redundant network 17 connections, SCTP sessions can have a good chance to survive from 18 network failures by migrating inactive network to active one. 19 However, if we follow the SCTP standard, there can be significant 20 delay for the network migration. During this migration period, SCTP 21 cannot transmit much data to the destination. This issue drastically 22 impairs the usability of SCTP in some situations. This memo 23 describes the issue of SCTP failover mechanism and discuss its 24 solutions which require minimal modification to the current standard. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on September 13, 2012. 43 Copyright Notice 45 Copyright (c) 2012 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 4 62 3. Issue in SCTP Path Management Process . . . . . . . . . . . . 5 63 4. Existing Solutions for Smooth Failover . . . . . . . . . . . . 6 64 4.1. Reduce Path.Max.Retrans . . . . . . . . . . . . . . . . . 6 65 4.2. Adjust RTO related parameters . . . . . . . . . . . . . . 7 66 5. Proposed Solution: SCTP with Potentially-Failed 67 Destination State (SCTP-PF) . . . . . . . . . . . . . . . . . 8 68 5.1. SCTP-PF Description . . . . . . . . . . . . . . . . . . . 8 69 5.2. Effect of Path Bouncing . . . . . . . . . . . . . . . . . 10 70 5.3. Permanent Failover . . . . . . . . . . . . . . . . . . . . 10 71 5.4. Handling Error Counter . . . . . . . . . . . . . . . . . . 10 72 6. Socket API Considerations . . . . . . . . . . . . . . . . . . 12 73 6.1. Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket 74 option . . . . . . . . . . . . . . . . . . . . . . . . . . 12 75 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 76 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 78 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 79 9.2. Informative References . . . . . . . . . . . . . . . . . . 15 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 82 1. Introduction 84 The Stream Control Transmission Protocol (SCTP) [RFC4960] natively 85 supports multihoming at the transport layer -- an SCTP association 86 can bind to multiple IP addresses at each endpoint. SCTP's 87 multihoming features include failure detection and failover 88 procedures to provide network interface redundancy and improved end- 89 to-end fault tolerance. 91 In SCTP's current failure detection procedure, the sender must 92 experience Path.Max.Retrans (PMR) number of consecutive timeouts on a 93 destination before detecting path failure. The sender fails over to 94 an alternate active destination only after failure detection. Until 95 failover, the sender transmits data on the failed path, degrading 96 SCTP performance. Concurrent Multipath Transfer (CMT) [IYENGAR06] is 97 an extension to SCTP and allows the sender to transmit data on 98 multiple paths simultaneously. Research [NATARAJAN09] shows that the 99 current failure detection procedure worsens CMT performance during 100 failover and can be significantly improved by employing a better 101 failover algorithm. 103 This document proposes an alternative failure detection procedure for 104 SCTP (and CMT) that improves SCTP (CMT) performance during failover. 106 2. Conventions and Terminology 108 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 109 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 110 document are to be interpreted as described in [RFC2119]. 112 3. Issue in SCTP Path Management Process 114 SCTP can utilize multiple IP addresses for a single SCTP association. 115 Each SCTP endpoint exchanges the list of available addresses on the 116 node during initial negotiation. After this, endpoints select one 117 address from the list and define this as the primary destination. 118 During normal transmission, SCTP sends all data to the primary 119 destination. Also, it sends heartbeat packets to other (non-primary) 120 destinations at a certain interval to check the reachability of the 121 path. 123 If sender has multiple active destination addresses, it can 124 retransmit data to secondary destination address when the 125 transmission to the primary times out. 127 When sender receives the acknowledgment for data or heartbeat packets 128 from one of the destination addresses, it considers the destination 129 is active. If it fails to receive acknowledgments, the error count 130 for the address is increased. If the error counter exceeds the 131 protocol parameter 'Path.Max.Retrans', SCTP endpoint considers the 132 address is inactive. 134 The failover process of SCTP is initiated when the primary path 135 becomes inactive (error counter for the primacy path exceeds 136 Path.Max.Retrans). If the primary path is marked inactive, SCTP 137 chooses new destination address from one of the active destinations 138 and start using this address to send data. If the primary path 139 becomes active again, SCTP uses the primary destination for 140 subsequent data transmissions and stop using non-primary one. 142 An issue in this failover process is that it usually takes 143 significant amount of time before SCTP switches to the new 144 destination. Let's say the primary path on a multi-homed host 145 becomes unavailable and the RTO value for the primary path at that 146 time is around 1 second, it usually takes over 60 seconds before SCTP 147 starts to use the secondary path. This is because the recommended 148 value for Path.Max.Retrans in the standard is 5, which requires 6 149 consecutive timeouts before failover takes place. Before SCTP 150 switches to the secondary address, SCTP keeps trying to send packets 151 to the primary and only retransmitted packets are sent to the 152 secondary can be reached at the receiver. This slow failover process 153 can cause significant performance degradation and will not be 154 acceptable in some situations. 156 4. Existing Solutions for Smooth Failover 158 The following approach are conceivable for the solutions of this 159 issue. 161 4.1. Reduce Path.Max.Retrans 163 If we choose smaller value for Path.Max.Retrans, we can shorten the 164 duration of failover process. In fact, this is recommended in some 165 research results [JUNGMAIER02] [GRINNEMO04] [FALLON08]. For example, 166 if we set Path.Max.Retrans to 0, SCTP switches to another destination 167 on a single timeout. However, smaller value for Path.Max.Retrans 168 might cause spurious failover. In addition, if we use smaller value 169 for Path.Max.Retrans, we may also need to choose smaller value for 170 'Association.Max.Retrans'. The Association.Max.Retrans indicates the 171 threshold for the total number of consecutive error count for the 172 entire SCTP association. If the total of the error count for all 173 paths exceeds this value, the endpoint considers the peer endpoint 174 unreachable and terminates the association. According to the Section 175 8.2 in [RFC4960], we should avoid having the value of 176 Association.Max.Retrans larger than the summation of the 177 Path.Max.Retrans of all the destination addresses. Otherwise, even 178 if all the destination addresses become inactive, the endpoint still 179 considers the peer endpoint reachable. The behavior in this 180 situation is not defined in the RFC and depends on each 181 implementation. In order to avoid inconsistent behavior between 182 implementations, we had better use smaller value for 183 Association.Max.Retrans. However, if we choose smaller value for 184 Association.Max.Retrans, associations will prone to be terminated 185 with minor congestion. 187 Another issue is that the interval of heartbeat packet: 'HB.interval' 188 may not be small. (recommended value is 30 seconds) This means once 189 failover takes place, an endpoint might need a certain amount of time 190 to use the primary path again. This can cause undesirable effects in 191 case of spurious failover. If we choose smaller value for 192 HB.interval, the traffic used for path probing in a session will be 193 increased. 195 The advantage of tuning Path.Max.Retrans is that it requires no 196 modification to the current standard, although it needs to ignore 197 several recommendations. In addition, some research results indicate 198 path bouncing caused by spurious failover does not cause serious 199 problems. We discuss the effect of path bouncing in the section 5. 201 4.2. Adjust RTO related parameters 203 As several research results indicate, we can also shorten the 204 duration of failover process by adjusting RTO related parameters 205 [JUNGMAIER02] [FALLON08]. During failover process. RTO keeps being 206 doubled. However, if we can choose smaller value for RTO.max, we can 207 stop the exponential growth of RTO at some point. Also, choosing 208 smaller values for RTO.initial or RTO.min can contribute to keep RTO 209 value small. 211 Similar to reducing Path.Max.Retrans, the advantage of this approach 212 is that it requires no modification to the current standard, although 213 it needs to ignore several recommendations. However, this approach 214 requires to have enough knowledge about the network characteristics 215 between end points. Otherwise, it can introduce adverse side-effects 216 such as spurious timeouts. 218 5. Proposed Solution: SCTP with Potentially-Failed Destination State 219 (SCTP-PF) 221 5.1. SCTP-PF Description 223 Our proposal stems from the following two observations about SCTP's 224 failure detection procedure: 226 o In order to minimize performance impact during failover, the 227 sender should avoid transmitting data to the failed destination as 228 early as possible. In the current SCTP path management scheme, 229 the sender stops transmitting data to a destination only after the 230 destination is marked Failed. Thus, a smaller PMR value is ideal 231 so that the sender transitions a destination to the Failed state 232 quicker. 234 o Smaller PMR values increase the chances of spurious failure 235 detection where the sender incorrectly marks a destination as 236 Failed during periods of temporary congestion. Larger PMR values 237 are preferable to avoid spurious failure detection. 239 From the above observations it is clear that tweaking the PMR value 240 involves the following tradeoff -- a lower value improves performance 241 but increases the chances of spurious failure detection, whereas a 242 higher value degrades performance and reduces spurious failure 243 detection in a wide range of path conditions. Thus, tweaking the 244 association's PMR value is an incomplete solution to address 245 performance impact during failure. 247 We propose a new "Potentially-failed" (PF) destination state in 248 SCTP's path management procedure. The PF state was originally 249 proposed to improve CMT performance [NATARAJAN09]. The PF state is 250 an intermediate state between Active and Failed states. SCTP's 251 failure detection procedure is modified to include the PF state. The 252 new failure detection algorithm assumes that loss detected by a 253 timeout implies either severe congestion or failure en-route. After 254 a single timeout on a path, a sender is unsure, and marks the 255 corresponding destination as PF. A PF destination is not used for 256 data transmission except in special cases (discussed below). The new 257 failure detection algorithm requires only sender-side changes. 258 Details are: 260 1. The sender maintains a new tunable parameter called Potentially- 261 failed.Max.Retrans (PFMR). The recommended value of PFMR = 0 262 when quick failover is used. When an association's PFMR >= PMR, 263 quick failover is turned off. 265 2. Each time the T3-rtx timer expires on an active or idle 266 destination, the error counter of that destination address will 267 be incremented. When the value in the error counter exceeds 268 PFMR, the endpoint should mark the destination transport address 269 as PF. SCTP MUST NOT send any notification to the upper layer 270 about the active to PF state transition. 272 3. The sender SHOULD avoid data transmission to PF destinations. 273 When all destinations are in either PF or Inactive state, the 274 sender MAY either move the destination from PF to active state 275 (and transmit data to the active destination) or the sender MAY 276 transmit data to a PF destination. In the former scenario, (i) 277 the sender MUST NOT notify the ULP about the state transition, 278 and (ii) MUST NOT clear the destination's error counter. It is 279 recommended that the sender picks the PF destination with least 280 error count (fewest consecutive timeouts) for data transmission. 281 In case of a tie (multiple PF destinations with same error 282 count), the sender MAY choose the last active destination. 284 4. Only heartbeats MUST be sent to PF destination(s) once per RTO. 285 This means the sender SHOULD ignore HB.interval for PF 286 destinations. If an heartbeat is unanswered, the sender 287 increments the error counter and exponentially backs off the RTO 288 value. If error counter is less than PMR, the sender SHOULD 289 transmit another heartbeat immediately after T3-timer expiration. 291 5. When the sender receives an heartbeat ACK from a PF destination, 292 the sender clears the destination's error counter and transitions 293 the PF destination back to active state. This state transition 294 MUST NOT be notified to the ULP. This destination's cwnd is set 295 to 1 MTU (TODO: or 2? Needs more text discussing rationale; can 296 revisit later?) 298 6. An additional (PMR - PFMR) consecutive timeouts on a PF 299 destination confirm the path failure, upon which the destination 300 transitions to the Inactive state. As described in [RFC4960], 301 the sender (i) SHOULD notify ULP about this state transition, and 302 (ii) transmit heartbeats to the Inactive destination at a lower 303 frequency as described in Section 8.3 of [RFC4960]. 305 7. When all destinations are in the Inactive state, the sender picks 306 one of the Inactive destinations for data transmission. This 307 proposal recommends that the sender picks the Inactive 308 destination with least error count (fewest consecutive timeouts) 309 for data transmission. In case of a tie (multiple Inactive 310 destinations with same error count), the sender MAY choose the 311 last active destination. 313 8. ACKs for retransmissions do not transition a PF destination back 314 to the active state, since a sender cannot disambiguate whether 315 the ack was for the original transmission or the 316 retransmission(s). 318 5.2. Effect of Path Bouncing 320 The methods described above can accelerate failover process. Hence, 321 it might introduce path bouncing effect which keeps changing the data 322 transmission path frequently. This sounds harmful for data transfer, 323 however several research results indicate that there is no serious 324 problem with SCTP in terms of path bouncing effect [CARO04] [CARO05]. 326 There are two main reasons for this. First, SCTP is basically 327 designed for multipath communication, which means SCTP maintains all 328 path related parameters (cwnd, ssthresh, RTT, error count, etc) per 329 each destination address. These parameters cannot be affected by 330 path bouncing. In addition, when SCTP migrates to another path, it 331 starts with minimal cwnd because of slow-start. Hence, there is 332 little chance for packet reordering or duplicating. 334 Second, even if all communication paths between end-nodes share the 335 same bottleneck, the proposed method does not make situations worse. 336 In case of congestion, the current standard tries to transmit data 337 packets to the primary during failover, while the proposed method 338 tries to explore other destinations. In any case, the same amount of 339 data packets sent to the same bottleneck. 341 5.3. Permanent Failover 343 When primary path becomes active again after failover, SCTP migrates 344 back to the primary path. After this, SCTP starts data transfer with 345 minimal cwnd. This is because SCTP must perform slow-start when it 346 migrates to new path. However, this might degrade the communication 347 performance in case that the performance of the alternative path is 348 relatively good. In order to mitigate this effect of slow-start, 349 permanent failover was proposed in [CARO02]. Permanent failover 350 allows SCTP to remain the alternative path even if the primacy path 351 becomes active again. This approach can improve performance in some 352 cases, however, it will require more detail analysis since it might 353 impact on SCTP failover algorithm. Since we prefer to keep the 354 current behavior of the standard as possible, we recommend not to 355 take this approach for now. 357 5.4. Handling Error Counter 359 When multiple destinations are in the PF state, the sender may 360 transmit heartbeats to multiple destinations at the same time. This 361 allows sender to quickly track and respond to network status change. 362 However, when all PF destinations become unavailable, this approach 363 increases the total number of consecutive retransmissions rather 364 aggressively than the current SCTP spec does. Because of this 365 aggressive increase, an SCTP association may be terminated rather 366 earlier than the standard [RFC4960]. 368 One way to avoid early termination is to send retransmitted data or 369 HB to only one PF destination at a time, but this approach may delay 370 path status tracking. An alternative solution is to exclude HB 371 timeouts from incrementing the error count. The latter approach is 372 preferred but requires an update to Section 8.3 of [RFC4960]. 374 6. Socket API Considerations 376 This section describes how the socket API defined in 377 [I-D.ietf-tsvwg-sctpsocket] is extended to provide a way for the 378 application to control the quick failover behavior. 380 Please note that this section is informational only. 382 A socket API implementation based on [I-D.ietf-tsvwg-sctpsocket] is 383 extended by adding a new read/write socket option for the level 384 IPPROTO_SCTP and the name SCTP_PEER_ADDR_THLDS as described below. 385 This socket option is used to read/write the value of PFMR parameter 386 described in Section 5. 388 Support for the SCTP_PEER_ADDR_THLDS socket option needs also to be 389 added to the function sctp_opt_info(). 391 6.1. Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) socket option 393 Applications can control the quick failover behavior by getting or 394 setting the number of timeouts before a peer address is considered 395 potentially failed or unreachable. 397 The following structure is used to access and modify the thresholds: 399 struct sctp_paddrthlds { 400 sctp_assoc_t spt_assoc_id; 401 struct sockaddr_storage spt_address; 402 uint16_t spt_pathmaxrxt; 403 uint16_t spt_pathpfthld; 404 }; 406 spt_assoc_id: This parameter is ignored for one-to-one style 407 sockets. For one-to-many style sockets the application may fill 408 in an association identifier or SCTP_FUTURE_ASSOC for this query. 409 It is an error to use SCTP_{CURRENT|ALL}_ASSOC in spt_assoc_id. 411 spt_address: This specifies which peer address is of interest. If a 412 wildcard address is provided, this socket option applies to all 413 current and future peer addresses. 415 spt_pathmaxrxt: Each peer address of interest is considered 416 unreachable, if its path error counter exceeds spt_pathmaxrxt. 418 spt_pathpfthld: Each peer address of interest is considered 419 potentially failed, if its path error counter exceeds 420 spt_pathpfthld. 422 7. Security Considerations 424 There are no new security considerations introduced in this document. 426 8. IANA Considerations 428 This document does not create any new registries or modify the rules 429 for any existing registries managed by IANA. 431 9. References 433 9.1. Normative References 435 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 436 Requirement Levels", BCP 14, RFC 2119, March 1997. 438 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", 439 RFC 4960, September 2007. 441 9.2. Informative References 443 [CARO02] Caro Jr., A., Iyengar, J., Amer, P., Heinz, G., and R. 444 Stewart, "A Two-level Threshold Recovery Mechanism for 445 SCTP", Tech report, CIS Dept, University of Delaware , 446 7 2002. 448 [CARO04] Caro Jr., A., Amer, P., and R. Stewart, "End-to-End 449 Failover Thresholds for Transport Layer Multihoming", 450 MILCOM 2004 , 11 2004. 452 [CARO05] Caro Jr., A., "End-to-End Fault Tolerance using Transport 453 Layer Multihoming", Ph.D Thesis, University of Delaware , 454 1 2005. 456 [FALLON08] 457 Fallon, S., Jacob, P., Qiao, Y., Murphy, L., Fallon, E., 458 and A. Hanley, "SCTP Switchover Performance Issues in WLAN 459 Environments", IEEE CCNC 2008, 1 2008. 461 [GRINNEMO04] 462 Grinnemo, K-J. and A. Brunstrom, "Performance of SCTP- 463 controlled failovers in M3UA-based SIGTRAN networks", 464 Advanced Simulation Technologies Conference , 4 2004. 466 [I-D.ietf-tsvwg-sctpsocket] 467 Stewart, R., Tuexen, M., Poon, K., Lei, P., and V. 468 Yasevich, "Sockets API Extensions for Stream Control 469 Transmission Protocol (SCTP)", 470 draft-ietf-tsvwg-sctpsocket-31 (work in progress), 471 August 2011. 473 [IYENGAR06] 474 Iyengar, J., Amer, P., and R. Stewart, "Concurrent 475 Multipath Transfer using SCTP Multihoming over Independent 476 End-to-end Paths.", IEEE/ACM Trans on Networking 14(5), 477 10 2006. 479 [JUNGMAIER02] 480 Jungmaier, A., Rathgeb, E., and M. Tuexen, "On the use of 481 SCTP in failover scenarios", World Multiconference on 482 Systemics, Cybernetics and Informatics , 7 2002. 484 [NATARAJAN09] 485 Natarajan, P., Ekiz, N., Amer, P., and R. Stewart, 486 "Concurrent Multipath Transfer during Path Failure", 487 Computer Communications , 5 2009. 489 Authors' Addresses 491 Yoshifumi Nishida 492 WIDE Project 493 Endo 5322 494 Fujisawa, Kanagawa 252-8520 495 Japan 497 Email: nishida@wide.ad.jp 499 Preethi Natarajan 500 Cisco Systems 501 510 McCarthy Blvd 502 Milpitas, CA 95035 503 USA 505 Email: prenatar@cisco.com 507 Armando Caro 508 BBN Technologies 509 10 Moulton St. 510 Cambridge, MA 02138 511 USA 513 Email: acaro@bbn.com