idnits 2.17.1 draft-allman-tcp-sack-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 2 instances of too long lines in the document, the longest one being 1 character in excess of 72. ** There are 22 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC2119]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 118, but not defined == Missing Reference: 'B' is mentioned on line 118, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'AHKO97' -- Possible downref: Non-RFC (?) normative reference: ref. 'All00' -- Possible downref: Non-RFC (?) normative reference: ref. 'FF96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac90' -- Possible downref: Non-RFC (?) normative reference: ref. 'PF01' ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782) Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Ethan Blanton 3 INTERNET DRAFT Ohio University 4 File: draft-allman-tcp-sack-08.txt Mark Allman 5 BBN/NASA GRC 6 Kevin Fall 7 Intel Research 8 November, 2001 9 Expires: May, 2002 11 A Conservative SACK-based Loss Recovery Algorithm for TCP 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of [RFC2026]. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as 21 Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as 26 reference material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Abstract 36 This document presents a conservative loss recovery algorithm for 37 TCP that is based on the use of the selective acknowledgment TCP 38 option. The algorithm presented in this document conforms to the 39 spirit of the current congestion control specification, but allows 40 TCP senders to recover more effectively when multiple segments are 41 lost from a single flight of data. 43 Terminology 45 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 46 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 47 document are to be interpreted as described in RFC 2119 [RFC2119]. 49 1 Introduction 51 This document presents a conservative loss recovery algorithm for 52 TCP that is based on the use of the selective acknowledgment TCP 53 option. While the TCP selective acknowledgment (SACK) option 54 [RFC2018] is being steadily deployed in the Internet [All00] there 55 is evidence that hosts are not using the SACK information when 56 making retransmission and congestion control decisions [PF01]. The 57 goal of this document is to outline one straightforward method for 58 TCP implementations to use SACK information to increase performance. 60 [RFC2581] allows advanced loss recovery algorithms to be used by TCP 61 [RFC793] provided that they follow the spirit of TCP's congestion 62 control algorithms [RFC2581,RFC2914]. [RFC2582] outlines one such 63 advanced recovery algorithm called NewReno. This document outlines 64 a loss recovery algorithm that uses the selective acknowledgment 65 (SACK) [RFC2018] TCP option to enhance TCP's loss recovery. The 66 algorithm outlined in this document, heavily based on the algorithm 67 detailed in [FF96], is a conservative replacement of the fast 68 recovery algorithm [Jac90,RFC2581]. The algorithm specified in this 69 document is a straightforward SACK-based loss recovery strategy that 70 follows the guidelines set in [RFC2581] and can safely be used in 71 TCP implementations. Alternate SACK-based loss recovery methods can 72 be used in TCP as implementers see fit (as long as the alternate 73 algorithms follow the guidelines provided in [RFC2581]). Please 74 note, however, that the SACK-based decisions in this document (such 75 as what segments are to be sent at what time) are largely decoupled 76 from the congestion control algorithms, and as such can be treated 77 as separate issues if so desired. 79 2 Definitions 81 The reader is expected to be familiar with the definitions given in 82 [RFC2581]. 84 The reader is assumed to be familiar with selective acknowledgments 85 as specified in [RFC2018]. 87 For the purposes of explaining the SACK-based loss recovery 88 algorithm we define four variables that a TCP sender stores: 90 ``HighACK'' is the sequence number of the highest byte of 91 data that has been cumulatively ACKed at a given point. 93 ``HighData'' is the highest sequence number transmitted at a 94 given point. 96 ``HighRxt'' is the highest sequence number which has been 97 retransmitted during the current loss recovery phase. 99 ``RxtBytes'' is the number of octets that have been 100 retransmitted but not yet cumulatively acknowledged. 102 ``Pipe'' is a sender's estimate of the number of bytes 103 outstanding in the network. This is used during recovery 104 for limiting the sender's sending rate. 106 For the purposes of this specification we define a ``duplicate 107 acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK 108 number is equal to the current value of HighACK, as described in 109 [RFC2581]. 111 We define a variable ``DupThresh'' that holds the number of 112 duplicate acknowledgments required to trigger a retransmission. Per 113 [RFC2581] this threshold is defined to be 3 duplicate 114 acknowledgments. However, implementers should consult any updates 115 to [RFC2581] to determine the current value for DupThresh (or method 116 for determining its value). 118 Finally, a range of sequence numbers [A,B] is said to ``cover'' 119 sequence number S if A <= S <= B. 121 3 Keeping Track of SACK Information 123 For a TCP sender to implement the algorithm defined in the next 124 section it must keep a data structure to store incoming 125 selective acknowledgment information on a per connection basis. 126 Such a data structure is commonly called the ``scoreboard''. 127 The specifics of the scoreboard data structure are out of scope 128 for this document (as long as the implementation can perform all 129 functions required by this specification). 131 Note that while this document speaks of marking and keeping 132 track of octets, a real world implementation would probably want 133 to keep track of octet ranges or otherwise collapse the data 134 while ensuring that arbitrary ranges are still markable. 136 4 Processing and Acting Upon SACK Information 138 For the purposes of the algorithm defined in this document the 139 scoreboard SHOULD implement the following functions: 141 Update (): 143 Each octet that is cumulatively ACKed or SACKed should be marked 144 accordingly in the scoreboard data structure, and the total 145 number of octets SACKed should be recorded. 147 Note: SACK information is advisory and therefore SACKed data 148 MUST NOT be removed from TCP's retransmission buffer until the 149 data is cumulatively acknowledged [RFC2018]. 151 NextSeg (): 153 This routine MUST return the sequence number range of the next 154 segment that is to be transmitted, per the following rules: 156 (1) If there exists a smallest unSACKed sequence number 'S1' 157 such that HighRxt < S1 < HighData and there are either 158 DupThresh * SMSS octets above S1 which have been SACKed or 159 the number of discontiguous SACKed sequence spaces above S1 160 is greater than DupThresh, S1 is presumed to have been lost 161 and the sequence range of one segment of up to SMSS octets 162 starting with S1 MUST be returned. 164 (2) If no sequence number 'S1' per rule (1) exists but there 165 exists available unsent data and the receiver's advertised 166 window allows, the sequence range of one segment of up to 167 SMSS octets of previously unsent data starting with sequence 168 number HighData+1 MUST be returned. 170 (3) If the conditions for rules (1) and (2) fail, but there 171 exists an unSACKed sequence number 'S2' such that HighRxt < 172 S2 < HighData, one segment of up to SMSS octets starting 173 with S2 MUST be returned. Note that this segment need not 174 meet the additional requirements in (1). 176 (4) If the conditions for each of (1), (2), and (3) are not 177 met, then NextSeg () MUST indicate failure, and no segment 178 is returned. 180 AmountSACKed (RangeBegin,RangeEnd): 182 This routine MUST return the total number of octets which fall 183 between RangeBegin and RangeEnd that have been selectively 184 acknowledged by the receiver. 186 LeftNetwork (OldHighACK,HighACK): 188 This function MUST return the number of octets in the given 189 sequence number range that have left the network. 191 First, we determine the amount of previously unaccounted for 192 data that is now known to have left the network: 194 PrevUnAcct = (HighACK - OldHighACK) - 195 AmountSACKed (OldHighACK + 1,HighACK) 197 PrevUnAcct represents the amount of data that was not previously 198 covered by the cumulative ACK, less the amount of that data that 199 is known to have been SACKed. 201 If the incoming acknowledgment does not include SACK 202 information we conclude that the newly cumulatively ACKed 203 data does not contain any retransmissions and this routine 204 MUST NOT conduct any of the following calculations and MUST 205 return PrevUnAcct. 207 Otherwise, we calculate the number of octets that have been 208 retransmitted and ACKed. Since retransmissions increase the 209 estimate of the number of octets in the pipe we have to 210 decrement the pipe twice when a retransmission is cumulatively 211 ACKed (once for the original transmission and once for the 212 retransmission). 214 Note: Segments can be retransmitted more than once by TCP, but 215 not by the algorithm presented in this document. Therefore, the 216 assumption that a segment was retransmitted only once in the 217 calculations specified is safe. 219 We calculate the number of retransmitted bytes being ACKed as: 221 RxtBytesACKed = min (PrevUnAcct,RxtBytes) 223 We expect that all previously unaccounted for octets have been 224 retransmitted in the general case. However, dropped ACKs (with 225 useful SACK information) can cause this number to be an 226 overestimate of the actual number of octets retransmitted. 227 Thus, we bound RxtBytesACKed to the actual number of octets 228 retransmitted. 230 Next, we adjust RxtBytes as follows: 232 RxtBytes -= RxtBytesACKed 234 to account for the retransmitted octets we now know have left the 235 network. 237 Finally, the function returns the sum of PrevUnAcct and 238 RxtBytesACKed. 240 Note: The SACK-based loss recovery algorithm outlined in this 241 document requires more computational resources than previous TCP 242 loss recovery strategies. However, we believe the scoreboard data 243 structure can be implemented in a reasonably efficient manner (both 244 in terms of computation complexity and memory usage) in most TCP 245 implementations. 247 5 Algorithm Details 249 Upon the receipt of any ACK containing SACK information, the 250 scoreboard MUST be updated via the Update () routine. 252 Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the 253 scoreboard is to be updated as normal. Note: The first and second 254 duplicate ACKs can also be used to trigger the transmission of 255 previously unsent segments using the Limited Transmit algorithm 256 [RFC3042]. 258 When a TCP sender receives the duplicate ACK corresponding to 259 DupThresh ACKs, the scoreboard MUST be updated with the new SACK 260 information (via Update ()) and a loss recovery phase SHOULD be 261 initiated, per the fast retransmit algorithm outlined in [RFC2581], 262 and in doing so the following steps MUST be taken: 264 (1) pipe = HighData - HighACK - AmountSACKed (HighACK,HighData) 266 Set a ``pipe'' variable to the number of outstanding octets 267 currently ``in the pipe''; this is the data which has been 268 sent by the TCP sender but for which no cumulative or 269 selective acknowledgment has been received. This data is 270 assumed to be still traversing the network path. 272 (2) RecoveryPoint = HighData 274 When the TCP sender receives a cumulative ACK for this data 275 octet the loss recovery phase is terminated. 277 (3) ssthresh = cwnd = (FlightSize / 2) 279 The congestion window (cwnd) and slow start threshold 280 (sstrhesh) are reduced to half of FlightSize per [RFC2581]. 282 (4) Retransmit the first data segment presumed dropped -- the 283 segment starting with sequence number HighACK + 1. To 284 prevent repeated retransmission of the same data, set 285 HighRxt to the highest sequence number in the retransmitted 286 segment. 288 (5) Set RxtBytes to the number of octets retransmitted in step 289 (4) above. 291 (6) In order to take advantage of potential additional available 292 cwnd, proceed to step (D) below. 294 Once a TCP is in the loss recovery phase the following procedure 295 MUST be used for each arriving ACK: 297 (A) An incoming cumulative ACK for a sequence number greater than 298 RecoveryPoint signals the end of loss recovery and the loss 299 recovery phase MUST be terminated. Any information contained in 300 the scoreboard for sequence numbers greater than the new value 301 of HighACK SHOULD NOT be cleared when leaving the loss recovery 302 phase. 304 (B) Upon receipt of a duplicate ACK the following actions MUST be 305 taken: 307 (B.1) Use Update () to record the new SACK information conveyed 308 by the incoming ACK. 310 (B.2) The pipe variable is decremented by the number of newly 311 SACKed data octets conveyed in the incoming ACK (i.e., those 312 octets that are being SACKed for the first time), as that is 313 the amount of new data presumed to have left the network. 315 (C) When a ``partial ACK'' (an ACK that increases the HighACK point, 316 but does not terminate loss recovery) arrives, the following 317 actions MUST be performed: 319 (C.1) Before updating HighACK based on the received cumulative 320 ACK, save HighACK as OldHighACK. 322 (C.2) The scoreboard MUST be updated based on the cumulative ACK 323 and any new SACK information that is included in the ACK via 324 the Update () routine. 326 (C.3) The value of pipe MUST be decremented by the number of 327 octets returned by calling LeftNetwork (OldHighACK,HighACK). 329 (D) While pipe is less than cwnd the TCP sender SHOULD transmit one 330 or more segments as follows: 332 (D.1) The scoreboard MUST be queried via NextSeg () for the 333 sequence number range of the next segment to transmit, and 334 the given segment sent. 336 (D.2) The pipe variable MUST be incremented by the number of 337 data octets sent in (D.1). 339 (D.3) If any of the data octets sent in (D.1) are below 340 HighData, HighRxt MUST be set to the highest sequence number 341 of the segment retransmitted. 343 (D.4) If any of the data octets sent in (D.1) are below 344 HighData, RxtBytes MUST be incremented by the number of 345 octets retransmitted. 347 (D.5) If any of the data octets sent in (D.1) are above 348 HighData, HighData must be updated to reflect the 349 transmission of previously unsent data. 351 (D.6) If cwnd - pipe is greater than 1 SMSS, return to (D.1) 353 5.1 Retransmission Timeouts 355 Keeping track of SACK information depends on the TCP sender having 356 an accurate measure of the current state of the network, the 357 conditions of this connection, and the state of the receiver's 358 buffer. Due to these limitations, [RFC2018] suggests that a TCP 359 sender SHOULD expunge the SACK information gathered from a receiver 360 upon a retransmission timeout ``since the timeout might indicate 361 that the data receiver has reneged.'' Additionally, a TCP sender 362 MUST ``ignore prior SACK information in determining which data to 363 retransmit.'' However, a SACK TCP sender SHOULD still use all SACK 364 information made available during the slow start phase of loss 365 recovery following an RTO. 367 As described in Sections 4 and 5, Update () MAY continue to be 368 used appropriately upon receipt of ACKs. This will allow the 369 slow start recovery period to benefit from all available 370 information provided by the receiver, despite the fact that SACK 371 information was expunged due to the RTO. 373 If there are segments missing from the receiver's buffer following 374 processing of the retransmitted segment, the corresponding ACK will 375 contain SACK information. In this case, a TCP sender SHOULD use 376 this SACK information by using the NextSeg () routine to determine 377 what data should be sent in each segment of the slow start. 379 6 Research 380 The algorithm specified in this document is analyzed in [FF96], 381 which shows that the above algorithm is effective in reducing 382 transfer time over standard TCP Reno [RFC2581] when multiple 383 segments are dropped from a window of data (especially as the number 384 of drops increases). [AHKO97] shows that the algorithm defined in 385 this document can greatly improve throughput in connections 386 traversing satellite channels. 388 7 Security Considerations 390 The algorithm presented in this paper shares security considerations 391 with [RFC2581]. A key difference is that an algorithm based on 392 SACKs is more robust against attackers forging duplicate ACKs to 393 force the TCP sender to reduce cwnd. With SACKs, TCP senders have an 394 additional check on whether or not a particular ACK is legitimate. 395 While not fool-proof, SACK does provide some amount of protection in 396 this area. 398 Acknowledgments 400 The authors wish to thank Sally Floyd for encouraging this document 401 and commenting on an early draft. The algorithm described in this 402 document is largely based on an algorithm outlined by Kevin Fall and 403 Sally Floyd in [FF96], although the authors of this document assume 404 responsibility for any mistakes in the above text. Murali Bashyam, 405 Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, Vern 406 Paxson and Venkat Venkatsubra provided valuable feedback on earlier 407 versions of this document. Finally, we thank Matt Mathis and 408 Jamshid Mahdavi for implementing the scoreboard in ns and hence 409 guiding our thinking in keeping track of SACK state. 411 References 413 [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP 414 Performance Over Satellite Links. Proceedings of the Fifth 415 International Conference on Telecommunications Systems, 416 Nashville, TN, March, 1997. 418 [All00] Mark Allman. A Web Server's View of the Transport Layer. ACM 419 Computer Communication Review, 30(5), October 2000. 421 [FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of 422 Tahoe, Reno and SACK TCP. Computer Communication Review, July 423 1996. 425 [Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm. 426 Technical Report, LBL, April 1990. 428 [PF01] Jitendra Padhye, Sally Floyd. Identifying the TCP Behavior 429 of Web Servers, ACM SIGCOMM, August 2001. 431 [RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793, 432 September 1981. 434 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective 435 Acknowledgment Options. RFC 2018, October 1996 437 [RFC2026] Scott Bradner. The Internet Standards Process -- Revision 438 3, RFC 2026, October 1996 440 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 441 Requirement Levels", BCP 14, RFC 2119, March 1997. 443 [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP 444 Congestion Control, RFC 2581, April 1999. 446 [RFC2582] Sally Floyd and Tom Henderson. The NewReno Modification 447 to TCP's Fast Recovery Algorithm, RFC 2582, April 1999. 449 [RFC2914] Sally Floyd. Congestion Control Principles, RFC 2914, 450 September 2000. 452 [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd. Enhancing 453 TCP's Loss Recovery Using Limited Transmit. RFC 3042, 454 January 2001 456 Author's Addresses: 458 Ethan Blanton 459 Ohio University Internetworking Research Lab 460 Stocker Center 461 Athens, OH 45701 462 eblanton@irg.cs.ohiou.edu 464 Mark Allman 465 BBN Technologies/NASA Glenn Research Center 466 Lewis Field 467 21000 Brookpark Rd. MS 54-5 468 Cleveland, OH 44135 469 Phone: 216-433-6586 470 Fax: 216-433-8705 471 mallman@bbn.com 472 http://roland.grc.nasa.gov/~mallman 474 Kevin Fall 475 Intel Research 476 2150 Shattuck Ave., PH Suite 477 Berkeley, CA 94704 478 kfall@intel-research.net