idnits 2.17.1 draft-scheffenegger-tcpm-sack-loss-recovery-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 15, 2010) is 4909 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1323' is defined on line 253, but no explicit reference was found in the text == Unused Reference: 'I-D.blanton-tcpm-3517bis' is defined on line 279, but no explicit reference was found in the text == Unused Reference: 'I-D.henderson-tcpm-rfc3782-bis' is defined on line 285, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-tcpm-sack-recovery-entry' is defined on line 291, but no explicit reference was found in the text == Unused Reference: 'LRSF' is defined on line 298, but no explicit reference was found in the text == Unused Reference: 'SimRTO' is defined on line 316, but no explicit reference was found in the text == Unused Reference: 'TCPLat' is defined on line 322, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) ** Obsolete normative reference: RFC 3517 (Obsoleted by RFC 6675) ** Obsolete normative reference: RFC 3782 (Obsoleted by RFC 6582) ** Downref: Normative reference to an Experimental RFC: RFC 5827 == Outdated reference: A later version (-01) exists of draft-blanton-tcpm-3517bis-00 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) Summary: 4 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor R. Scheffenegger 3 Extensions (tcpm) NetApp, Inc. 4 Internet-Draft November 15, 2010 5 Intended status: Standards Track 6 Expires: May 19, 2011 8 Improving SACK-based loss recovery for TCP 9 draft-scheffenegger-tcpm-sack-loss-recovery-00 11 Abstract 13 This note clarifies the behavior of TCP SACK while doing loss 14 recovery close to the end-of-stream. This allows TCP SACK to never 15 exhibit worse loss recovery characteristics than TCP NewReno under 16 identical circumstances. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 19, 2011. 35 Copyright Notice 37 Copyright (c) 2010 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 54 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 55 4. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 5. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 5 58 6.1. Reordering . . . . . . . . . . . . . . . . . . . . . . . . 6 59 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 6 60 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 61 9. Security Considerations . . . . . . . . . . . . . . . . . . . 7 62 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 63 10.1. Normative References . . . . . . . . . . . . . . . . . . . 7 64 10.2. Informative References . . . . . . . . . . . . . . . . . . 7 65 Appendix A. Scenarios . . . . . . . . . . . . . . . . . . . . . . 8 66 A.1. Basic Case . . . . . . . . . . . . . . . . . . . . . . . . 9 67 A.2. Data delay ~1 RTT . . . . . . . . . . . . . . . . . . . . 10 68 A.3. Data reordering . . . . . . . . . . . . . . . . . . . . . 11 69 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 1. Introduction 73 Selective Acknowledgement (SACK) is widely used to identify exactly 74 which TCP segment was lost and only send these missing segments 75 during a recovery episode. This helps improve the effectiveness of 76 loss recovery and aligns with the principle of packet conservation. 77 When no SACK information is available, TCP senders typically revert 78 to the [RFC3782] NewReno fast retransmission / fast recovery 79 retransmission algorithm. As ultima ratio, the method of last 80 resort, retransmission timeouts (RTO) are used to perform loss 81 recovery. 83 When multiple segments of a window are lost, including one or more 84 segments directly prior to the end-of-stream, TCP sessions making use 85 of [RFC3517] SACK suffer worse loss recovery performance than TCP 86 session utilizing [RFC3782] NewReno. When this happens, TCP SACK has 87 to revert to retransmission timeout (RTO) for loss recovery. 89 An algorithm is described that allows the complete and timely 90 recovery at the end-of-stream. The aim of this algorithm is to 91 address one corner case of TCP SACK. The timeliness of recovery for 92 TCP SACK is improved to that of TCP NewReno. Overall, this minor 93 change will minimize the prevalence of RTOs. 95 2. Requirements Language 97 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 98 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 99 document are to be interpreted as described in [RFC2119]. 101 3. Overview 103 TCP SACK Loss Recovery [RFC3517] was designed to reduce the number of 104 unnecessary retransmissions to close to zero and also to recover from 105 multiple segment loss within a single window without reverting to a 106 retransmission timeout. 108 In addition, [RFC2018] specifically stipulated up to which point a 109 SACK enabled sender may promote segments to become elegible for 110 retransmission under the SACK scheme. This heuristic works very well 111 during bulk transfers, where the sender always has additional data to 112 transmit. Close to the end of a stream, when there is no more data 113 in the socket to send, still outstanding and never acknowledged 114 segments can not become elegible for retransmission. 116 When this happens, TCP SACK performance degrades and becomes worse 117 than the performance of TCP NewReno [RFC3782]. TCP NewReno can 118 recover such a set of loss events without reverting to RTO loss 119 recovery. 121 The relevance of this specific aspect may seem unimportant at first 122 glance. When TCP is used for boundary synchronized transactions, 123 where applications regularly stall transmitting data, end-of-stream 124 performance can dominate the transfer. Such streams are very 125 frequently application limited during their existance (see definition 126 in [RFC5827]), and the performance penalty of TCP SACK often requires 127 the use of TCP NewReno despite it having worse overall network 128 efficiency. 130 4. Definitions 132 The reader is expected to be familiar with the definitions given in 133 [RFC5827], [RFC5681], [RFC3517] and [RFC2018]. 135 SND.FACK (forward acknowledgment) is used to describe the highest 136 sequence number that has been SACKed by the receiver and subsequently 137 seen by the sender. The full definition can be found in [MM96a] and 138 [MM96b]. 140 End-of-stream is used similar to the definition of small congestion 141 windows in [RFC5827], with the exception of small congestion windows 142 due to TCP congestion control. End-of-stream indicates that the TCP 143 sender has no additional unsent data in the sender socket, or may 144 wait for enough data to accumulate before sending (Nagle's Algorithm 145 [RFC0896]). 147 5. Algorithm 149 The key observation is that when the receiver sends out a cumulative 150 ACK with no SACK entries, all data delivered to the receiver is fully 151 continguous but some segments are potentially lost. In NewReno loss 152 recovery, any cumulative ACK below "recover" triggers a single 153 retransmission regardless if NewReno is at end-of-stream or in 154 continous transfer. 156 TCP SACK already performs at least as good as TCP NewReno, as long as 157 the sender can continue to inject new data into the network. The 158 modification outlined below ensures, that TCP SACK can perform as 159 good as NewReno under a wider range of circumstances. 161 This algorithm is only applicable when the TCP sender has SACK 162 enabled for the TCP connection, and also maintains a variable 163 SND.FACK. 165 A. A TCP sender SHOULD NOT exit loss recovery if it receives a 166 cumulative ACK for a sequence number greater than RecoveryPoint 167 while it is at end-of-stream. Any necessary congestion window 168 adjustments SHALL be performed as necessary. 170 B. A TCP sender using this algorithm MUST perform the following 171 steps upon the receipt of a cumulative ACK containing no SACK 172 information, while it is in loss recovery. 174 1. Process ACK information per the loss recovery algorithm 175 outlined in [RFC3517]. 177 2. If the ACK contains no SACK information, cumulatively 178 acknowledges all data up to SND.FACK (SND.UNA == SND.FACK), 179 some data is still outstanding (SND.UNA < SND.MAX), the TCP 180 sender may send additional data (cwnd - Pipe >= 1 SMSS), and 181 the TCP sender has no additional data to send beyond SND.MAX, 182 the TCP sender SHOULD transmit one segment. 184 In order to achive timely recovery the retransmission timer MUST NOT 185 be reset when this algorithm performs a retransmission. This is in 186 strict compliance with [RFC0793]. 188 6. Discussion 190 This algorithm does not deviate from current implementation of SACK 191 loss recovery for bulk transfers. However, at the end-of-stream, 192 when there is no data to advance SND.MAX, this heuristic allows the 193 recovery of segments similar to NewReno loss recovery. If the loss 194 occurs during times where cwnd is very small, or when the ACK clock 195 fails, this approach still falls back to RTO loss recovery. 197 For the case of only a few (2-3) segments lost in the last window 198 before the end-of-stream, which this algorithm addresses, no spurious 199 retransmissions are performed unless reordering delay above 1 RTT 200 occurs, any a cumulative ACK is received by the sender in the 201 meantime. This property of the outlined algorithm is identical to 202 that of TCP NewReno. 204 The aspects of packet conservation, timely loss recovery and 205 avoidance of retransmission timeouts have lead to allowing only a 206 single segment to be recovered per RTT. 208 6.1. Reordering 210 If the last segment(s) at the end-of-stream are not lost, but delays, 211 three different cases may result: 213 If RTT > RTO(min), and reordering delay >= RTT: No change in the 214 sender behavior, all segments may be retransmitted spuriously. 215 Without this algorithm due to RTO, with this algorithm the 216 retransmitted segments may be clocked out by ACKs. Slow-start 217 may be posponed somewhat reliving acute network congestion 218 slightly. 220 If RTT < RTO(min), and reorder delay between RTT and RTO(min): Some 221 spurious retransmits can happen, but retransmissions will again 222 occur at most 1 segment per RTT. A premature, spurious RTO may 223 be avoided. 225 If reordering delay < RTT: The TCP sender will not see a cumulative 226 ACK without SACK enties, thus SND.UNA will remain lower than 227 SND.FACK. The TCP sender behavior is therefore unchanged. 229 7. Acknowledgements 231 The author would like to thank Matt Mathis for the insightful 232 discussions about SACK and it's intended behavior and the spirit 233 driving the design of SACK. 235 Furthermore, valuable feedback was received from Ethan Blanton, 236 Yoshifumi Nishida and John Heffner. 238 Dragana Damjanovic was very helpful in reviewing the text. 240 8. IANA Considerations 242 This memo includes no request to IANA. 244 9. Security Considerations 246 The algorithm presented in this paper shares security considerations 247 with [RFC2018] and [RFC3517]. 249 10. References 251 10.1. Normative References 253 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 254 for High Performance", RFC 1323, May 1992. 256 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 257 Selective Acknowledgment Options", RFC 2018, October 1996. 259 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 260 Requirement Levels", BCP 14, RFC 2119, March 1997. 262 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 263 Conservative Selective Acknowledgment (SACK)-based Loss 264 Recovery Algorithm for TCP", RFC 3517, April 2003. 266 [RFC3782] Floyd, S., Henderson, T., and A. Gurtov, "The NewReno 267 Modification to TCP's Fast Recovery Algorithm", RFC 3782, 268 April 2004. 270 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 271 Control", RFC 5681, September 2009. 273 [RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and 274 P. Hurtig, "Early Retransmit for TCP and Stream Control 275 Transmission Protocol (SCTP)", RFC 5827, May 2010. 277 10.2. Informative References 279 [I-D.blanton-tcpm-3517bis] 280 Blanton, E., Allman, M., Jarvinen, I., and M. Kojo, "A 281 Conservative Selective Acknowledgment (SACK)-based Loss 282 Recovery Algorithm for TCP", draft-blanton-tcpm-3517bis-00 283 (work in progress), October 2010. 285 [I-D.henderson-tcpm-rfc3782-bis] 286 Floyd, S., Henderson, T., and A. Gurtov, "The NewReno 287 Modification to TCP's Fast Recovery Algorithm", 288 draft-henderson-tcpm-rfc3782-bis-01 (work in progress), 289 October 2010. 291 [I-D.ietf-tcpm-sack-recovery-entry] 292 Jarvinen, I. and M. Kojo, "Using TCP Selective 293 Acknowledgement (SACK) Information to Determine Duplicate 294 Acknowledgements for Loss Recovery Initiation", 295 draft-ietf-tcpm-sack-recovery-entry-01 (work in progress), 296 March 2010. 298 [LRSF] Hurtig, P., Garcia, J., and A. Brunstrom, "Loss Recovery 299 in Short TCP/SCTP Flows", Dec 2006, . 302 [MM96a] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: 303 Refining TCP Congestion Control", Aug 1996, . 306 [MM96b] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 307 Parameters", Sep 2004, 308 . 310 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 311 RFC 793, September 1981. 313 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 314 RFC 896, January 1984. 316 [SimRTO] Guan, Y., Van den Broeck, B., Potemans, J., Theunis, J., 317 Li, D., Van Lil, E., and A. Van de Capelle, "Simulation 318 Study of TCP Eifel Algorithms", 2005, . 322 [TCPLat] Cardwell, N., Savage, S., and T. Anderson, "Modeling TCP 323 Latency", Mar 2000, . 325 Appendix A. Scenarios 327 For clarity, each segment is denoted only via a single number. Note 328 that the ACKs are also given with the segement they ack, not the next 329 sequence number. For example S1 may span sequence numbers 1000-1999, 330 while the acknowledgement A1 carries the sequence number 2000. If an 331 acknowledgement also carries SACK information, the SACK entries are 332 listed after a colon. A hyphen denotes which segments a single SACK 333 entry spans. For simplicity, all segments are SMSS sized. 335 A.1. Basic Case 337 In this scenario, the sender has no more data to send past S7. 338 Reordering of data segments or ACKs and ACK losses are are absent 339 from this scenario. 341 ACK TX RX ACK 342 Rcvd Seg Seg Sent 344 A00 345 S1 S1 346 S2 (dropped) 347 A1 348 A0 349 S3 S3 350 S4 S4 A1,3 351 A1,3-4 352 A1 353 S5 S5 354 S6 (dropped) 355 A1,3-5 356 A1,3 357 S7 (dropped) 358 --- 359 A1,3-4 360 --- 361 A1,3-5 362 S2 S2 363 A5 364 A5 365 S6 S6 366 A6 367 A6 368 S7 S7 369 A7 370 A7 372 end-of-stream loss recovery 374 A.2. Data delay ~1 RTT 376 In this scenario, segments S6 and S7 are not dropped, but delayed by 377 about 1 RTT - while RTT is smaller then the minimum allowed 378 retransmission timeout threshold RTO(min). Segments that are delayed 379 by less than 1 RTT are not retransmitted. Segments delayed more than 380 1 RTT are either retransmitted by this algorithm, or by RTO loss 381 recovery. 383 ACK TX RX ACK 384 Rcvd Seg Seg Sent 386 A00 387 S1 S1 388 S2 (dropped) 389 A1 390 A0 391 S3 S3 392 S4 S4 A1,3 393 A1,3-4 394 A1 395 S5 S5 396 S6 (delayed) 397 A1,3-5 398 A1,3 399 S7 (delayed) 400 --- 401 A1,3-4 402 --- 403 A1,3-5 S6 404 S2 S2 A1,3-6 405 A6 406 A1,3-6 S7 407 --- A7 408 A6 409 S7 S7 410 A7 411 A7 413 A7 415 end-of-stream segment delay < RTT 417 A.3. Data reordering 419 In this case, the segments S6 and S7 are delivered out of order. 420 This is a normal SACK recovery event. 422 ACK TX RX ACK 423 Rcvd Seg Seg Sent 425 A00 426 S1 S1 427 S2 (dropped) 428 A1 429 A0 430 S3 S3 431 S4 S4 A1,3 432 A1,3-4 433 A1 434 S5 S5 435 S6 (reordered) 436 A1,3-5 437 A1,3 438 S7 S7 439 --- A1,3-5,7 440 A1,3-4 441 --- S6 442 A1,3-7 443 A1,3-5 444 S2 S2 445 A7 446 A1,3-5,7 447 S6 448 A7 449 A1,3-7 450 --- 452 A7 454 A7 456 end-of-stream segment reorder < RTT 458 Author's Address 460 Richard Scheffenegger 461 NetApp, Inc. 462 Am Euro Platz 2 463 Vienna, 1120 464 Austria 466 Phone: +43 1 3676811 3146 467 Email: rs@netapp.com