idnits 2.17.1 draft-allman-tcp-sack-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-24) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** There are 31 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC2119], [RFC2581]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 262 has weird spacing: '...ariable to th...' == Line 263 has weird spacing: '... is the data ...' == Line 264 has weird spacing: '... sent by the ...' == Line 265 has weird spacing: '...ent has been ...' == Line 266 has weird spacing: '... not been det...' == (1 more instance...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 117, but not defined == Missing Reference: 'B' is mentioned on line 117, but not defined ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2582 (Obsoleted by RFC 3782) -- Obsolete informational reference (is this intentional?): RFC 2988 (Obsoleted by RFC 6298) Summary: 7 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Ethan Blanton 2 INTERNET DRAFT Ohio University 3 File: draft-allman-tcp-sack-12.txt Mark Allman 4 BBN/NASA GRC 5 Kevin Fall 6 Intel Research 7 July, 2002 8 Expires: January, 2003 10 A Conservative SACK-based Loss Recovery Algorithm for TCP 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of [RFC2026]. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as 20 Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six 23 months and may be updated, replaced, or obsoleted by other documents 24 at any time. It is inappropriate to use Internet-Drafts as 25 reference material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Abstract 35 This document presents a conservative loss recovery algorithm 36 for TCP that is based on the use of the selective acknowledgment 37 TCP option. The algorithm presented in this document conforms 38 to the spirit of the current congestion control specification 39 [RFC2581], but allows TCP senders to recover more effectively 40 when multiple segments are lost from a single flight of data. 42 Terminology 44 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 45 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 46 document are to be interpreted as described in RFC 2119 [RFC2119]. 48 1 Introduction 50 This document presents a conservative loss recovery algorithm for 51 TCP that is based on the use of the selective acknowledgment TCP 52 option. While the TCP selective acknowledgment (SACK) option 53 [RFC2018] is being steadily deployed in the Internet [All00] there 54 is evidence that hosts are not using the SACK information when 55 making retransmission and congestion control decisions [PF01]. The 56 goal of this document is to outline one straightforward method for 57 TCP implementations to use SACK information to increase performance. 59 [RFC2581] allows advanced loss recovery algorithms to be used by TCP 60 [RFC793] provided that they follow the spirit of TCP's congestion 61 control algorithms [RFC2581,RFC2914]. [RFC2582] outlines one such 62 advanced recovery algorithm called NewReno. This document outlines 63 a loss recovery algorithm that uses the selective acknowledgment 64 (SACK) [RFC2018] TCP option to enhance TCP's loss recovery. The 65 algorithm outlined in this document, heavily based on the algorithm 66 detailed in [FF96], is a conservative replacement of the fast 67 recovery algorithm [Jac90,RFC2581]. The algorithm specified in this 68 document is a straightforward SACK-based loss recovery strategy that 69 follows the guidelines set in [RFC2581] and can safely be used in 70 TCP implementations. Alternate SACK-based loss recovery methods can 71 be used in TCP as implementers see fit (as long as the alternate 72 algorithms follow the guidelines provided in [RFC2581]). Please 73 note, however, that the SACK-based decisions in this document (such 74 as what segments are to be sent at what time) are largely decoupled 75 from the congestion control algorithms, and as such can be treated 76 as separate issues if so desired. 78 2 Definitions 80 The reader is expected to be familiar with the definitions given in 81 [RFC2581]. 83 The reader is assumed to be familiar with selective acknowledgments 84 as specified in [RFC2018]. 86 For the purposes of explaining the SACK-based loss recovery 87 algorithm we define four variables that a TCP sender stores: 89 ``HighACK'' is the sequence number of the highest byte of 90 data that has been cumulatively ACKed at a given point. 92 ``HighData'' is the highest sequence number transmitted at a 93 given point. 95 ``HighRxt'' is the highest sequence number which has been 96 retransmitted during the current loss recovery phase. 98 ``Pipe'' is a sender's estimate of the number of bytes 99 outstanding in the network. This is used during recovery 100 for limiting the sender's sending rate. The pipe variable 101 allows TCP to use a fundamentally different congestion 102 control than specified in [RFC2581]. The algorithm is often 103 referred to as the ``pipe algorithm''. 105 For the purposes of this specification we define a ``duplicate 106 acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK 107 number is equal to the current value of HighACK, as described in 108 [RFC2581]. 110 We define a variable ``DupThresh'' that holds the number of 111 duplicate acknowledgments required to trigger a retransmission. Per 112 [RFC2581] this threshold is defined to be 3 duplicate 113 acknowledgments. However, implementers should consult any updates 114 to [RFC2581] to determine the current value for DupThresh (or method 115 for determining its value). 117 Finally, a range of sequence numbers [A,B] is said to ``cover'' 118 sequence number S if A <= S <= B. 120 3 Keeping Track of SACK Information 122 For a TCP sender to implement the algorithm defined in the next 123 section it must keep a data structure to store incoming 124 selective acknowledgment information on a per connection basis. 125 Such a data structure is commonly called the ``scoreboard''. 126 The specifics of the scoreboard data structure are out of scope 127 for this document (as long as the implementation can perform all 128 functions required by this specification). 130 Note that while this document speaks of marking and keeping 131 track of octets, a real world implementation would probably want 132 to keep track of octet ranges or otherwise collapse the data 133 while ensuring that arbitrary ranges are still markable. 135 4 Processing and Acting Upon SACK Information 137 For the purposes of the algorithm defined in this document the 138 scoreboard SHOULD implement the following functions: 140 Update (): 142 Given the information provided in an ACK, each octet that is 143 cumulatively ACKed or SACKed should be marked accordingly in 144 the scoreboard data structure, and the total number of 145 octets SACKed should be recorded. 147 Note: SACK information is advisory and therefore SACKed data 148 MUST NOT be removed from TCP's retransmission buffer until the 149 data is cumulatively acknowledged [RFC2018]. 151 IsLost (SeqNum): 153 This routine returns whether the given sequence number is 154 considered to be lost. The routine returns true when either 155 DupThresh discontiguous SACKed sequences have arrived above 156 'SeqNum' or DupThresh * SMSS bytes with sequence numbers greater 157 than 'SeqNum' have been SACKed. Otherwise, the routine returns 158 false. 160 SetPipe (): 162 This routine traverses the sequence space from HighACK to 163 HighData and MUST set the ``pipe'' variable to an estimate of 164 the number of octets that are currently in transit between the 165 TCP sender and the TCP receiver. After initializing pipe to 166 zero the following steps are taken for each octet 'S1' in the 167 sequence space between HighACK and HighData that has not been 168 SACKed: 170 (a) The pipe variable is incremented by 1 octet. 172 (b) If S1 <= HighRxt and IsLost (S1) returns false: 174 Pipe is incremented by 1 octet. 176 The effect of this condition is that pipe is incremented for 177 both the original transmission and the retransmission of the 178 octet because neither has been determined to have left the 179 network at this point. 181 NextSeg (): 183 This routine uses the scoreboard data structure maintained by 184 the Update() function to determine what to transmit based on 185 the SACK information that has arrived from the data receiver 186 (and hence been marked in the scoreboard). NextSeg () MUST 187 return the sequence number range of the next segment that is 188 to be transmitted, per the following rules: 190 (1) If there exists a smallest unSACKed sequence number 'S2' 191 that meets the following three criteria for determining loss 192 the sequence range of one segment of up to SMSS octets 193 starting with S2 MUST be returned. 195 (1.a) S2 is greater than HighRxt. 197 (1.b) S2 is less than the highest octet convered by any 198 received SACK. 200 (1.c) IsLost (S2) returns true. 202 (2) If no sequence number 'S2' per rule (1) exists but there 203 exists available unsent data and the receiver's advertised 204 window allows, the sequence range of one segment of up to 205 SMSS octets of previously unsent data starting with sequence 206 number HighData+1 MUST be returned. 208 (3) If the conditions for rules (1) and (2) fail, but there 209 exists an unSACKed sequence number 'S3' that meets the 210 criteria for detecting loss given in steps (1.a) and (1.b) 211 above (specifically excluding step (1.c)) then one segment 212 of up to SMSS octets starting with S3 MUST be returned. 214 (4) If the conditions for each of (1), (2), and (3) are not 215 met, then NextSeg () MUST indicate failure, and no segment 216 is returned. 218 Note: The SACK-based loss recovery algorithm outlined in this 219 document requires more computational resources than previous TCP 220 loss recovery strategies. However, we believe the scoreboard data 221 structure can be implemented in a reasonably efficient manner (both 222 in terms of computation complexity and memory usage) in most TCP 223 implementations. 225 5 Algorithm Details 227 Upon the receipt of any ACK containing SACK information, the 228 scoreboard MUST be updated via the Update () routine. 230 Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the 231 scoreboard is to be updated as normal. Note: The first and second 232 duplicate ACKs can also be used to trigger the transmission of 233 previously unsent segments using the Limited Transmit algorithm 234 [RFC3042]. 236 When a TCP sender receives the duplicate ACK corresponding to 237 DupThresh ACKs, the scoreboard MUST be updated with the new SACK 238 information (via Update ()). If no previous loss event has 239 occurred on the connection or the cumulative acknowledgement point 240 is beyond the last value of RecoveryPoint, a loss recovery phase 241 SHOULD be initiated, per the fast retransmit algorithm outlined in 242 [RFC2581]. The following steps MUST be taken: 244 (1) RecoveryPoint = HighData 246 When the TCP sender receives a cumulative ACK for this data 247 octet the loss recovery phase is terminated. 249 (2) ssthresh = cwnd = (FlightSize / 2) 251 The congestion window (cwnd) and slow start threshold 252 (ssthresh) are reduced to half of FlightSize per [RFC2581]. 254 (3) Retransmit the first data segment presumed dropped -- the 255 segment starting with sequence number HighACK + 1. To 256 prevent repeated retransmission of the same data, set 257 HighRxt to the highest sequence number in the retransmitted 258 segment. 260 (4) Run SetPipe () 262 Set a ``pipe'' variable to the number of outstanding octets 263 currently ``in the pipe''; this is the data which has been 264 sent by the TCP sender but for which no cumulative or 265 selective acknowledgment has been received and the data has 266 not been determined to have been dropped in the network. 267 This data is assumed to be still traversing the network 268 path. 270 (5) In order to take advantage of potential additional available 271 cwnd, proceed to step (C) below. 273 Once a TCP is in the loss recovery phase the following procedure 274 MUST be used for each arriving ACK: 276 (A) An incoming cumulative ACK for a sequence number greater than 277 RecoveryPoint signals the end of loss recovery and the loss 278 recovery phase MUST be terminated. Any information contained in 279 the scoreboard for sequence numbers greater than the new value 280 of HighACK SHOULD NOT be cleared when leaving the loss recovery 281 phase. 283 (B) Upon receipt of an ACK that does not cover RecoveryPoint the 284 following actions MUST be taken: 286 (B.1) Use Update () to record the new SACK information conveyed 287 by the incoming ACK. 289 (B.2) Use SetPipe () to re-calculate the number of octets still 290 in the network. 292 (C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more 293 segments as follows: 295 (C.1) The scoreboard MUST be queried via NextSeg () for the 296 sequence number range of the next segment to transmit (if 297 any), and the given segment sent. 299 (C.2) If any of the data octets sent in (C.1) are below 300 HighData, HighRxt MUST be set to the highest sequence number 301 of the segment retransmitted. 303 (C.3) If any of the data octets sent in (C.1) are above 304 HighData, HighData must be updated to reflect the 305 transmission of previously unsent data. 307 (C.4) The estimate of the amount of data outstanding in the 308 network must be updated by incrementing pipe by the 309 number of octets transmitted in (C.1). 311 (C.5) If cwnd - pipe >= 1 SMSS, return to (C.1) 313 5.1 Retransmission Timeouts 315 In order to avoid memory deadlocks, the TCP receiver is allowed to 316 discard data that has already been selectively acknowledged. As a 317 result, [RFC2018] suggests that a TCP sender SHOULD expunge the 318 SACK information gathered from a receiver upon a retransmission 319 timeout ``since the timeout might indicate that the data receiver 320 has reneged.'' Additionally, a TCP sender MUST ``ignore prior SACK 321 information in determining which data to retransmit.'' However, a 322 SACK TCP sender SHOULD still use all SACK information made 323 available during the slow start phase of loss recovery following 324 an RTO. 326 If an RTO occurs during loss recovery as specified in this document, 327 RecoveryPoint MUST be preserved and the loss recovery algorithm 328 outlined in this document MUST be terminated. In addition, a new 329 recovery phase (as described in section 5) MUST NOT be initiated 330 until HighACK is greater than or equal to RecoveryPoint. 332 As described in Sections 4 and 5, Update () SHOULD continue to be 333 used appropriately upon receipt of ACKs. This will allow the slow 334 start recovery period to benefit from all available information 335 provided by the receiver, despite the fact that SACK information was 336 expunged due to the RTO. 338 If there are segments missing from the receiver's buffer following 339 processing of the retransmitted segment, the corresponding ACK will 340 contain SACK information. In this case, a TCP sender SHOULD use 341 this SACK information when determining what data should be sent in 342 each segment of the slow start. The exact algorithm for this 343 selection is not specified in this document (specifically NextSeg () 344 is inappropriate during slow start after an RTO). A relatively 345 straightforward approach to ``filling in'' the sequence space 346 reported as missing should be a reasonable approach. 348 6 Managing the RTO Timer 350 The standard TCP RTO estimator is defined in [RFC2988]. Due to 351 the fact that the SACK algorithm in this document can have an 352 impact on the behavior of the estimator, implementers may wish 353 to consider how the timer is managed. [RFC2988] calls for the 354 RTO timer to be re-armed each time an ACK arrives that advances 355 the cumulative ACK point. Because the algorithm presented in 356 this document can keep the ACK clock going through a fairly 357 significant loss event, (comparatively longer than the algorithm 358 described in [RFC2581]), on some networks the loss event could 359 last longer than the RTO. In this case the RTO timer would 360 expire prematurely and a segment that need not be retransmitted 361 would be resent. 363 Therefore we give implementers the latitude to use the standard 364 [RFC2988] style RTO management or, optionally, a more careful 365 variant that re-arms the RTO timer on each retransmission that 366 is sent during recovery MAY be used. This provides a more 367 conservative timer than specified in [RFC2988], and so may not 368 always be an attractive alternative. However, in some cases it 369 may prevent needless retransmissions, go-back-N transmission and 370 further reduction of the congestion window. 372 7 Research 374 The algorithm specified in this document is analyzed in [FF96], 375 which shows that the above algorithm is effective in reducing 376 transfer time over standard TCP Reno [RFC2581] when multiple 377 segments are dropped from a window of data (especially as the number 378 of drops increases). [AHKO97] shows that the algorithm defined in 379 this document can greatly improve throughput in connections 380 traversing satellite channels. 382 8 Security Considerations 384 The algorithm presented in this paper shares security considerations 385 with [RFC2581]. A key difference is that an algorithm based on 386 SACKs is more robust against attackers forging duplicate ACKs to 387 force the TCP sender to reduce cwnd. With SACKs, TCP senders have an 388 additional check on whether or not a particular ACK is legitimate. 389 While not fool-proof, SACK does provide some amount of protection in 390 this area. 392 Acknowledgments 394 The authors wish to thank Sally Floyd for encouraging this 395 document and commenting on early drafts. The algorithm 396 described in this document is loosely based on an algorithm 397 outlined by Kevin Fall and Sally Floyd in [FF96], although the 398 authors of this document assume responsibility for any mistakes 399 in the above text. Murali Bashyam, Ken Calvert, Tom Henderson, 400 Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann, 401 Vern Paxson, Venkat Venkatsubra and Lili Wang provided valuable 402 feedback on earlier versions of this document. Finally, we 403 thank Matt Mathis and Jamshid Mahdavi for implementing the 404 scoreboard in ns and hence guiding our thinking in keeping track 405 of SACK state. 407 Normative References 409 [RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793, 410 September 1981. 412 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective 413 Acknowledgment Options. RFC 2018, October 1996 415 [RFC2026] Scott Bradner. The Internet Standards Process -- Revision 416 3, RFC 2026, October 1996 418 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 419 Requirement Levels", BCP 14, RFC 2119, March 1997. 421 [RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP 422 Congestion Control, RFC 2581, April 1999. 424 Non-Normative References 426 [AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP 427 Performance Over Satellite Links. Proceedings of the Fifth 428 International Conference on Telecommunications Systems, 429 Nashville, TN, March, 1997. 431 [All00] Mark Allman. A Web Server's View of the Transport Layer. ACM 432 Computer Communication Review, 30(5), October 2000. 434 [FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of 435 Tahoe, Reno and SACK TCP. Computer Communication Review, July 436 1996. 438 [Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm. 439 Technical Report, LBL, April 1990. 441 [PF01] Jitendra Padhye, Sally Floyd. Identifying the TCP Behavior 442 of Web Servers, ACM SIGCOMM, August 2001. 444 [RFC2582] Sally Floyd and Tom Henderson. The NewReno Modification 445 to TCP's Fast Recovery Algorithm, RFC 2582, April 1999. 447 [RFC2914] Sally Floyd. Congestion Control Principles, RFC 2914, 448 September 2000. 450 [RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission 451 Timer, RFC 2988, November 2000. 453 [RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd. Enhancing 454 TCP's Loss Recovery Using Limited Transmit. RFC 3042, 455 January 2001 457 Author's Addresses: 459 Ethan Blanton 460 Ohio University Internetworking Research Lab 461 Stocker Center 462 Athens, OH 45701 463 eblanton@irg.cs.ohiou.edu 465 Mark Allman 466 BBN Technologies/NASA Glenn Research Center 467 Lewis Field 468 21000 Brookpark Rd. MS 54-5 469 Cleveland, OH 44135 470 Phone: 216-433-6586 471 Fax: 216-433-8705 472 mallman@bbn.com 473 http://roland.grc.nasa.gov/~mallman 475 Kevin Fall 476 Intel Research 477 2150 Shattuck Ave., PH Suite 478 Berkeley, CA 94704 479 kfall@intel-research.net