idnits 2.17.1 draft-allman-tcp-sack-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-03-28) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** There are 22 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC2119], [RFC2581]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 118, but not defined == Missing Reference: 'B' is mentioned on line 118, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2582 (Obsoleted by RFC 3782) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Ethan Blanton 3 INTERNET DRAFT Ohio University 4 File: draft-allman-tcp-sack-10.txt Mark Allman 5 BBN/NASA GRC 6 Kevin Fall 7 Intel Research 8 May, 2002 9 Expires: November, 2002 11 A Conservative SACK-based Loss Recovery Algorithm for TCP 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of [RFC2026]. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Abstract 36 This document presents a conservative loss recovery algorithm 37 for TCP that is based on the use of the selective acknowledgment 38 TCP option. The algorithm presented in this document conforms 39 to the spirit of the current congestion control specification 40 [RFC2581], but allows TCP senders to recover more effectively 41 when multiple segments are lost from a single flight of data. 43 Terminology 45 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 46 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 47 document are to be interpreted as described in RFC 2119 [RFC2119]. 49 1 Introduction 51 This document presents a conservative loss recovery algorithm for 52 TCP that is based on the use of the selective acknowledgment TCP 53 option. While the TCP selective acknowledgment (SACK) option 54 [RFC2018] is being steadily deployed in the Internet [All00] there 55 is evidence that hosts are not using the SACK information when 56 making retransmission and congestion control decisions [PF01]. The 57 goal of this document is to outline one straightforward method for 58 TCP implementations to use SACK information to increase performance. 60 [RFC2581] allows advanced loss recovery algorithms to be used by TCP 61 [RFC793] provided that they follow the spirit of TCP's congestion 62 control algorithms [RFC2581,RFC2914]. [RFC2582] outlines one such 63 advanced recovery algorithm called NewReno. This document outlines 64 a loss recovery algorithm that uses the selective acknowledgment 65 (SACK) [RFC2018] TCP option to enhance TCP's loss recovery. The 66 algorithm outlined in this document, heavily based on the algorithm 67 detailed in [FF96], is a conservative replacement of the fast 68 recovery algorithm [Jac90,RFC2581]. The algorithm specified in this 69 document is a straightforward SACK-based loss recovery strategy that 70 follows the guidelines set in [RFC2581] and can safely be used in 71 TCP implementations. Alternate SACK-based loss recovery methods can 72 be used in TCP as implementers see fit (as long as the alternate 73 algorithms follow the guidelines provided in [RFC2581]). Please 74 note, however, that the SACK-based decisions in this document (such 75 as what segments are to be sent at what time) are largely decoupled 76 from the congestion control algorithms, and as such can be treated 77 as separate issues if so desired. 79 2 Definitions 81 The reader is expected to be familiar with the definitions given in 82 [RFC2581]. 84 The reader is assumed to be familiar with selective acknowledgments 85 as specified in [RFC2018]. 87 For the purposes of explaining the SACK-based loss recovery 88 algorithm we define four variables that a TCP sender stores: 90 ``HighACK'' is the sequence number of the highest byte of 91 data that has been cumulatively ACKed at a given point. 93 ``HighData'' is the highest sequence number transmitted at a 94 given point. 96 ``HighRxt'' is the highest sequence number which has been 97 retransmitted during the current loss recovery phase. 99 ``Pipe'' is a sender's estimate of the number of bytes 100 outstanding in the network. This is used during recovery 101 for limiting the sender's sending rate. The pipe variable 102 allows TCP to use a fundamentally different congestion 103 control than specified in [RFC2581]. The algorithm is often 104 referred to as the ``pipe algorithm''. 106 For the purposes of this specification we define a ``duplicate 107 acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK 108 number is equal to the current value of HighACK, as described in 109 [RFC2581]. 111 We define a variable ``DupThresh'' that holds the number of 112 duplicate acknowledgments required to trigger a retransmission. Per 113 [RFC2581] this threshold is defined to be 3 duplicate 114 acknowledgments. However, implementers should consult any updates 115 to [RFC2581] to determine the current value for DupThresh (or method 116 for determining its value). 118 Finally, a range of sequence numbers [A,B] is said to ``cover'' 119 sequence number S if A <= S <= B. 121 3 Keeping Track of SACK Information 123 For a TCP sender to implement the algorithm defined in the next 124 section it must keep a data structure to store incoming 125 selective acknowledgment information on a per connection basis. 126 Such a data structure is commonly called the ``scoreboard''. 127 The specifics of the scoreboard data structure are out of scope 128 for this document (as long as the implementation can perform all 129 functions required by this specification). 131 Note that while this document speaks of marking and keeping 132 track of octets, a real world implementation would probably want 133 to keep track of octet ranges or otherwise collapse the data 134 while ensuring that arbitrary ranges are still markable. 136 4 Processing and Acting Upon SACK Information 138 For the purposes of the algorithm defined in this document the 139 scoreboard SHOULD implement the following functions: 141 Update (): 143 Given the information provided in an ACK, each octet that is 144 cumulatively ACKed or SACKed should be marked accordingly in 145 the scoreboard data structure, and the total number of 146 octets SACKed should be recorded. 148 Note: SACK information is advisory and therefore SACKed data 149 MUST NOT be removed from TCP's retransmission buffer until the 150 data is cumulatively acknowledged [RFC2018]. 152 NextSeg (): 154 This routine uses the scoreboard data structure maintained by 155 the Update() function to determine what to transmit based on 156 the SACK information that has arrived from the data receiver 157 (and hence been marked in the scoreboard). NextSeg () MUST 158 return the sequence number range of the next segment that is 159 to be transmitted, per the following rules: 161 (1) If there exists a smallest unSACKed sequence number 'S1' 162 such that HighRxt < S1 < HighData and there are either 163 DupThresh * SMSS octets above S1 which have been SACKed or 164 the number of discontiguous SACKed sequence spaces above S1 165 is greater than DupThresh, S1 is presumed to have been lost 166 and the sequence range of one segment of up to SMSS octets 167 starting with S1 MUST be returned. 169 (2) If no sequence number 'S1' per rule (1) exists but there 170 exists available unsent data and the receiver's advertised 171 window allows, the sequence range of one segment of up to 172 SMSS octets of previously unsent data starting with sequence 173 number HighData+1 MUST be returned. 175 (3) If the conditions for rules (1) and (2) fail, but there 176 exists an unSACKed sequence number 'S2' such that HighRxt < 177 S2 < HighData, one segment of up to SMSS octets starting 178 with S2 MUST be returned. Note that this segment need not 179 meet the additional requirements in (1). 181 (4) If the conditions for each of (1), (2), and (3) are not 182 met, then NextSeg () MUST indicate failure, and no segment 183 is returned. 185 AmountSACKed (RangeBegin,RangeEnd): 187 This routine MUST return the total number of octets which fall 188 between RangeBegin and RangeEnd that have been selectively 189 acknowledged by the receiver. 191 Note: The SACK-based loss recovery algorithm outlined in this 192 document requires more computational resources than previous TCP 193 loss recovery strategies. However, we believe the scoreboard data 194 structure can be implemented in a reasonably efficient manner (both 195 in terms of computation complexity and memory usage) in most TCP 196 implementations. 198 5 Algorithm Details 200 Upon the receipt of any ACK containing SACK information, the 201 scoreboard MUST be updated via the Update () routine. 203 Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the 204 scoreboard is to be updated as normal. Note: The first and second 205 duplicate ACKs can also be used to trigger the transmission of 206 previously unsent segments using the Limited Transmit algorithm 207 [RFC3042]. 209 When a TCP sender receives the duplicate ACK corresponding to 210 DupThresh ACKs, the scoreboard MUST be updated with the new SACK 211 information (via Update ()) and a loss recovery phase SHOULD be 212 initiated, per the fast retransmit algorithm outlined in [RFC2581], 213 and in doing so the following steps MUST be taken: 215 (1) pipe = HighData - HighACK - AmountSACKed (HighACK + 1,HighData) 217 Set a ``pipe'' variable to the number of outstanding octets 218 currently ``in the pipe''; this is the data which has been 219 sent by the TCP sender but for which no cumulative or 220 selective acknowledgment has been received. This data is 221 assumed to be still traversing the network path. 223 (2) RecoveryPoint = HighData 225 When the TCP sender receives a cumulative ACK for this data 226 octet the loss recovery phase is terminated. 228 (3) ssthresh = cwnd = (FlightSize / 2) 230 The congestion window (cwnd) and slow start threshold 231 (sstrhesh) are reduced to half of FlightSize per [RFC2581]. 233 (4) Retransmit the first data segment presumed dropped -- the 234 segment starting with sequence number HighACK + 1. To 235 prevent repeated retransmission of the same data, set 236 HighRxt to the highest sequence number in the retransmitted 237 segment. 239 (5) In order to take advantage of potential additional available 240 cwnd, proceed to step (D) below. 242 Once a TCP is in the loss recovery phase the following procedure 243 MUST be used for each arriving ACK: 245 (A) An incoming cumulative ACK for a sequence number greater than 246 RecoveryPoint signals the end of loss recovery and the loss 247 recovery phase MUST be terminated. Any information contained in 248 the scoreboard for sequence numbers greater than the new value 249 of HighACK SHOULD NOT be cleared when leaving the loss recovery 250 phase. 252 (B) Upon receipt of a duplicate ACK the following actions MUST be 253 taken: 255 (B.1) Use Update () to record the new SACK information conveyed 256 by the incoming ACK. 258 (B.2) The pipe variable is decremented by the number of newly 259 SACKed data octets conveyed in the incoming ACK (i.e., those 260 octets that are being SACKed for the first time), as that is 261 the amount of new data presumed to have left the network. 263 (C) When a ``partial ACK'' (an ACK that increases the HighACK point, 264 but does not terminate loss recovery) arrives, the following 265 actions MUST be performed: 267 (C.1) Before updating HighACK based on the received cumulative 268 ACK, save HighACK as OldHighACK. 270 (C.2) The scoreboard MUST be updated based on the cumulative ACK 271 and any new SACK information that is included in the ACK via 272 the Update () routine. 274 (C.3) The value of pipe MUST be decremented by the number of 275 octets that have left the network path using the following 276 equation: 278 pipe = pipe - ((HighACK - OldHighACK) - 279 AmountSACKed (OldHighACK + 1, HighACK)) 281 (C.4) The value of pipe MUST be decremented by the number of 282 newly SACKed data octets conveyed in the incoming ACK (i.e., 283 those octets that are being SACKed for the first time), as 284 these octets represent data that has left the network. 286 (D) While pipe is less than cwnd the TCP sender SHOULD transmit one 287 or more segments as follows: 289 (D.1) The scoreboard MUST be queried via NextSeg () for the 290 sequence number range of the next segment to transmit, and 291 the given segment sent. 293 (D.2) If any of the data octets sent in (D.1) are above 294 HighData, the pipe variable MUST be incremented by the 295 number of data octets previously unsent in (D.1). 297 (D.3) If any of the data octets sent in (D.1) are below 298 HighData, HighRxt MUST be set to the highest sequence number 299 of the segment retransmitted. 301 (D.4) If any of the data octets sent in (D.1) are above 302 HighData, HighData must be updated to reflect the 303 transmission of previously unsent data. 305 (D.5) If cwnd - pipe is greater than 1 SMSS, return to (D.1) 307 5.1 Retransmission Timeouts 309 In order to avoid memory deadlocks, the TCP receiver is allowed to 310 discard data that has already been acknowledged with a selective 311 acknowledgment. As a result [RFC2018] suggests that a TCP sender 312 SHOULD expunge the SACK information gathered from a receiver upon a 313 retransmission timeout ``since the timeout might indicate that the 314 data receiver has reneged.'' Additionally, a TCP sender MUST 315 ``ignore prior SACK information in determining which data to 316 retransmit.'' However, a SACK TCP sender SHOULD still use all SACK 317 information made available during the slow start phase of loss 318 recovery following an RTO. 320 As described in Sections 4 and 5, Update () MAY continue to be 321 used appropriately upon receipt of ACKs. This will allow the 322 slow start recovery period to benefit from all available 323 information provided by the receiver, despite the fact that SACK 324 information was expunged due to the RTO. 326 If there are segments missing from the receiver's buffer following 327 processing of the retransmitted segment, the corresponding ACK will 328 contain SACK information. In this case, a TCP sender SHOULD use 329 this SACK information by using the NextSeg () routine to determine 330 what data should be sent in each segment of the slow start. 332 6 Managing the RTO Timer 334 The standard TCP RTO estimator is defined in [RFC2988]. Due to 335 the fact that the SACK algorithm in this document can have an 336 impact on the behavior of the estimator, implementers may wish 337 to consider how the timer is managed. [RFC2988] calls for the 338 RTO timer to be re-armed each time an ACK arrives that advances 339 the cumulative ACK point. Because the algorithm presented in 340 this document can keep the ACK clock going through a fairly 341 significant loss event, (comparatively longer than the algorithm 342 described in [RFC2581]), on some networks the loss event could 343 last longer than the RTO. In this case the RTO timer would 344 expire prematurely and a segment that need not be retransmitted 345 would be resent. 347 Therefore we give implementers the latitude to use the standard 348 [RFC2988] style RTO management or, optionally, a more careful 349 variant that re-arms the RTO timer on each retransmission that 350 is sent during recovery MAY be used. This provides a more 351 conservative timer than specified in [RFC2988], and so may not 352 always be an attractive alternative. However, in some cases it 353 may prevent needless retransmissions, go-back-N transmission and 354 further reduction of the congestion window. 356 7 Research 358 The algorithm specified in this document is analyzed in [FF96], 359 which shows that the above algorithm is effective in reducing 360 transfer time over standard TCP Reno [RFC2581] when multiple 361 segments are dropped from a window of data (especially as the number 362 of drops increases). [AHKO97] shows that the algorithm defined in 363 this document can greatly improve throughput in connections 364 traversing satellite channels. 366 8 Security Considerations 368 The algorithm presented in this paper shares security considerations 369 with [RFC2581]. A key difference is that an algorithm based on 370 SACKs is more robust against attackers forging duplicate ACKs to 371 force the TCP sender to reduce cwnd. With SACKs, TCP senders have an 372 additional check on whether or not a particular ACK is legitimate. 373 While not fool-proof, SACK does provide some amount of protection in 374 this area. 376 Acknowledgments 378 The authors wish to thank Sally Floyd for encouraging this document 379 and commenting on early drafts. The algorithm described in this 380 document is largely based on an algorithm outlined by Kevin Fall and 381 Sally Floyd in [FF96], although the authors of this document assume 382 responsibility for any mistakes in the above text. Murali Bashyam, 383 Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, Vern 384 Paxson and Venkat Venkatsubra provided valuable feedback on earlier 385 versions of this document. Finally, we thank Matt Mathis and 386 Jamshid Mahdavi for implementing the scoreboard in ns and hence 387 guiding our thinking in keeping track of SACK state. 389 Normative References 391 [RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793, 392 September 1981. 394 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective 395 Acknowledgment Options. RFC 2018, October 1996 397 [RFC2026] Scott Bradner. The Internet Standards Process -- Revision 398 3, RFC 2026, October 1996 400 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 401 Requirement Levels", BCP 14, RFC 2119, March 1997. 403 [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP 404 Congestion Control, RFC 2581, April 1999. 406 Non-Normative References 408 [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP 409 Performance Over Satellite Links. Proceedings of the Fifth 410 International Conference on Telecommunications Systems, 411 Nashville, TN, March, 1997. 413 [All00] Mark Allman. A Web Server's View of the Transport Layer. ACM 414 Computer Communication Review, 30(5), October 2000. 416 [FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of 417 Tahoe, Reno and SACK TCP. Computer Communication Review, July 418 1996. 420 [Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm. 421 Technical Report, LBL, April 1990. 423 [PF01] Jitendra Padhye, Sally Floyd. Identifying the TCP Behavior 424 of Web Servers, ACM SIGCOMM, August 2001. 426 [RFC2582] Sally Floyd and Tom Henderson. The NewReno Modification 427 to TCP's Fast Recovery Algorithm, RFC 2582, April 1999. 429 [RFC2914] Sally Floyd. Congestion Control Principles, RFC 2914, 430 September 2000. 432 [RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission 433 Timer, RFC 2988, November 2000. 435 [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd. Enhancing 436 TCP's Loss Recovery Using Limited Transmit. RFC 3042, 437 January 2001 439 Author's Addresses: 441 Ethan Blanton 442 Ohio University Internetworking Research Lab 443 Stocker Center 444 Athens, OH 45701 445 eblanton@irg.cs.ohiou.edu 447 Mark Allman 448 BBN Technologies/NASA Glenn Research Center 449 Lewis Field 450 21000 Brookpark Rd. MS 54-5 451 Cleveland, OH 44135 452 Phone: 216-433-6586 453 Fax: 216-433-8705 454 mallman@bbn.com 455 http://roland.grc.nasa.gov/~mallman 457 Kevin Fall 458 Intel Research 459 2150 Shattuck Ave., PH Suite 460 Berkeley, CA 94704 461 kfall@intel-research.net