idnits 2.17.1 draft-swami-tsvwg-tcp-dclor-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-26) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 218: '..., the TCP sender MUST set its congesti...' RFC 2119 keyword, line 223: '...lue of SS_THRESH MUST be left unchange...' RFC 2119 keyword, line 227: '.... The TCP sender SHOULD also reset all...' RFC 2119 keyword, line 240: '...w data, the TCP sender SHOULD send the...' RFC 2119 keyword, line 245: '... 5. A TCP sender MUST repeat step-2 to...' (10 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 297 has weird spacing: '...-flight shoul...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 24, 2003) is 7520 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2026' is mentioned on line 16, but not defined == Unused Reference: 'RFC3517' is defined on line 379, but no explicit reference was found in the text == Unused Reference: 'RFC2883' is defined on line 385, but no explicit reference was found in the text == Unused Reference: 'RFC3522' is defined on line 389, but no explicit reference was found in the text == Unused Reference: 'LG03' is defined on line 392, but no explicit reference was found in the text == Unused Reference: 'SK03' is defined on line 396, but no explicit reference was found in the text == Unused Reference: 'RFC2988' is defined on line 401, but no explicit reference was found in the text == Unused Reference: 'BA02' is defined on line 404, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2861 (Obsoleted by RFC 7661) ** Obsolete normative reference: RFC 3517 (Obsoleted by RFC 6675) ** Downref: Normative reference to an Experimental RFC: RFC 3522 == Outdated reference: A later version (-06) exists of draft-ietf-tsvwg-tcp-eifel-response-03 == Outdated reference: A later version (-04) exists of draft-sarolahti-tsvwg-tcp-frto-03 -- Possible downref: Normative reference to a draft: ref. 'SK03' ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Possible downref: Normative reference to a draft: ref. 'BA02' Summary: 10 errors (**), 0 flaws (~~), 12 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Yogesh Swami 3 INTERNET DRAFT Khiem Le 4 File: draft-swami-tsvwg-tcp-dclor-02.txt Nokia Research Center 5 Dallas 6 September 24, 2003 7 Expires: March 24, 2004 9 DCLOR: De-correlated Loss Recovery using SACK option 10 for spurious timeouts. 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of [RFC2026]. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html 33 Abstract 35 A spurious timeout in TCP forces the sender to unnecessarily 36 retransmit one complete congestion window of data into the network. 37 In addition, TCP uses the rate of arrival of ACKs as the basic 38 criterion for congestion control. TCP makes the assumption that the 39 rate at which ACKs are received reflects the end-to-end state of the 40 network in terms of congestion. However, ACKs after a spurious 41 timeout don't reflect the end-to-end congestion state of the network; 42 they only reflect the congestion state of a part of the network. In 43 these cases, the slow-start behavior after a timeout can further add 44 to network congestion. In this draft we propose changes to the TCP 45 sender that can be used to solve the problem of both redundant- 46 retransmission and network congestion after a spurious timeout. 48 1. Introduction 50 The response of a TCP sender after a retransmission timeout is 51 governed by the underlying assumption that a mid-stream timeout can 52 occur only if there is heavy congestion--manifested as packet 53 loss--in the network. TCP therefore assumes that a timeout is a 54 sufficient indication to a) recover all the packets in flight, and b) 55 to initiate a congestion response (slow start in this case) suited 56 for heavy congestion scenarios. 58 Even though timeout is often a sufficient indication for recovering 59 all the packets in flight and initiating slow start, the loss 60 recovery algorithm should be separate from the congestion control 61 decisions. The loss recovery algorithm should only answer the 62 question of "what" data (i.e., what sequence numbers) to send. On the 63 other hand, the congestion control algorithm should answer the 64 question of "how much" data to send. But after a timeout, TCP 65 addresses the issues of loss recovery and congestion control using a 66 single mechanism--send one packet per round trip timeout (RTO) 67 (answers the "how much" question) until an acknowledgment is 68 received; the single segment sent is always the first unacknowledged 69 outstanding packet in the retransmission queue (answers the "what" 70 question). Since the present TCP's loss recovery and congestion 71 control algorithms are coupled together, we call this "Correlated 72 Loss Recovery (CLOR)." 74 Although the assumption that a timeout can occur only if there is 75 severe congestion is valid for traditional wire-line networks, it 76 does not hold good for some other types of networks--networks where 77 packets can be stalled "in the network" for a significant duration 78 without being discarded. Typical examples of such networks are 79 cellular networks. In cellular networks, the link layer can 80 experience a relatively long disruption due to errors, and the link 81 layer protocol can keep these packets-in-error buffered as long as 82 the link layer disruption lasts. 84 In this document we present an alternative approach to loss recovery 85 and congestion control that "De-Correlates" Loss Recovery from 86 congestion congestion and allows independent choice on using a 87 particular TCP sequence number without compromising on the congestion 88 control principles of [RFC2581][RFC2914][RFC2861]. 90 2. Problem Description. 92 Let us assume that a TCP sender has sent N packets, p(1) ... p(N), 93 into the network and it's waiting for the ACK of p(1) (Figure-1). Due 94 to bad network conditions or some other problem, these packets are 95 excessively delayed at some some intermediary node RTR-1. Unlike 96 standard IP routers, RTR-1 keeps these packets buffered for a 97 relatively long period of time until these packets are forwarded to 98 their intended recipient. This excessive delay forces the TCP sender 99 to timeout and enter slow start. 101 As far as the sender is concerned, a timeout is always interpreted as 102 heavy congestion. The TCP sender therefore makes the assumption that 103 all packets between p(1) and p(N) were lost in the network. To 104 recover from this misconstrued loss, the TCP sender retransmits P1(1) 105 ( Px(k) represents the xth retransmission of packet with sequence 106 number k), and waits for the ACK a(1). 108 After some period of time when the network conditions at RTR-1 109 improve, the queued in packets are finally dispatched to their 110 intended recipient; in response to the packet the TCP receiver 111 generates the ACK a(1). When the TCP sender receives a(1), it's 112 fooled into believing that a(1) was generated in response to the 113 retransmitted packet p1(1), while in reality a(1) was generated in 114 response to the originally transmitted packet p(1). When the sender 115 receives a(1), it increases its congestion window to two, and 116 retransmits p1(2) and p1(3). As the sender receives more 117 acknowledgments, it continues with retransmissions and finally starts 118 sending new data. 120 The following two sub sections examine the problems associated with 121 the above-mentioned TCP behavior. 123 2.1 Redundant Data Retransmission 125 The obvious and relatively easy-to-solve inefficiency of the above 126 algorithm is that the entire congestion window worth of data is 127 unnecessarily retransmitted. Although such retransmissions are 128 harmless to high-bandwidth, well-provisioned, backbone links (so long 129 they are infrequent), it could severely degrade the performance of 130 slow links. 132 In cases where bandwidth is a commodity at a premium, (e.g., cellular 133 networks), unnecessary retransmission can also be costly. 135 2.2 Congestion after Spurious Timeout 137 To analyze network congestion after spurious timeout, we compute the 138 worst case scenario packet loss in the system--assuming only TCP 139 connections to be present. 141 After the spurious timeout, the TCP sender sets its SS_THRESH to N/2. 142 Therefore, for the first N/2 ACKs received (i.e., ACK a(1) to a(N/2) 143 ), the TCP sender will grow its congestion window by one and reach 144 the SS_THRESH value of N/2. For each ACK received, the TCP sender 145 sends 2 packets. Therefore, by the end of the slow start, the TCP 146 sender would have sent 2*(N/2) packets into the network. For the 147 remaining N/2 ACKs (i.e., ACKs between a(N/2+1) to a(N)) the TCP 148 sender will remain in the congestion avoidance phase and send one 149 packet for each ACK received--sending N/2 more data segments. The net 150 amount of data sent is therefore N/2 + N = 3N/2. 152 Please note that the entire 3N/2 packets are injected into the 153 network within a time period less than or equal to RTT in most cases. 154 The number of data segments that left the network during this time is 155 only N. Therefore, N/2 packets out of 3N/2 packets will be lost with 156 a very high probability. These N/2 lost packets, however, need not 157 come from the same connection, and such a data-burst will 158 unnecessarily penalize all the competing TCP connections that share 159 the same bottleneck router. 161 Going further ahead, let us assume there are M competing TCP 162 connections that share the same bottleneck router(s) with C(0) (each 163 connection is numbered C(0) ... C(M-1)). During the period of time 164 while C(0) is stalled, the TCP sender does not use its network 165 resources--the buffer space--on the bottleneck router(s). The 166 competing connections, C(1)... C(M), however see this lack of 167 activity as resource availability and start growing their window by 168 at least one segment per RTT during this time period (by virtue of 169 linear window increase during congestion avoidance phase). For 170 simplicity reasons, we assume that each of these connections has the 171 same round trip time of RTT, and the idle time for C(0) is k*RTT 172 (where k > RTO/RTT). Under these assumptions, each of these competing 173 connections will increase their congestion window by k segments. 174 Therefore the amount of packets lost in the network due to slow start 175 can be as high as: 177 N/2 + M*k ... (4) 179 the first term in the above equation is the packet loss due to slow 180 start, while the second term is the loss due to window growth of 181 completing connections (if the competing connections were in slow 182 start the response could have been worse). 184 Based on the above equation, we note that the congestion state of the 185 network depends upon the duration of spurious timeout. In our response 186 algorithm we therefore take the time duration of spurious timeout 187 into account to reduce the data rate by half every RTO. Please note 188 that this scheme works well only when the number of competing 189 connections M does not vary too much while C(0) was stalled. A more 190 conservative response algorithm should reduce the data rate to 191 INIT_WINDOW if M is not bounded. 193 In the following sections we describe an algorithm that solves the 194 problem of both redundant retransmission and packet loss after a 195 spurious timeout. 197 3. De-correlated Loss Recovery (DCLOR) 199 The basic idea behind DCLOR is to send a new data segment from 200 outside the sender's retransmission queue and wait for the ACK or 201 SACK of the new data before initiating the response algorithm. Unlike 202 slow-start where the response algorithm starts immediately after 203 receiving the first ACK, DCLOR waits for the ACK/SACK of the new data 204 sent after timeout before initiating loss recovery. The SACK block 205 for new data contains sufficient information to determine all the 206 packets that were lost into the network. Once the sequence number of 207 lost packets is determined, the TCP sender grows its congestion 208 window as determined by the SS_THRESH and it's congestion window. 210 3.1 Probe phase after a timeout 212 The following steps describe the response of a TCP sender on a 213 timeout: 215 1. If the timeout occurs before the 3 way handshake is complete, 216 the TCP sender's behavior is unchanged, 218 2. After each timeout, the TCP sender MUST set its congestion 219 window to: 221 cwnd = max( cwnd/2, INIT_WINDOW). 223 The value of SS_THRESH MUST be left unchanged at this point. The 224 TCP sender should also count the number of packets in flight at 225 this time, and keep it in a state variable stale_outstanding. 227 3. The TCP sender SHOULD also reset all the SACK tag bits in its 228 retransmission queue if this the first timeout. 230 4. Instead of sending the first unacknowledged packet P1 after a 231 timeout, the TCP sender should *disregard* its congestion window 232 and sends ONE new MSS size data (Pn+1). 234 The TCP sender should also store the sequence number of the new 235 segment in a new state variable called SS_PTR (for slow start 236 pointer). 238 If the sender does not have any new data outside its 239 retransmission queue, or if the receiver's flow control window 240 cannot sustain any new data, the TCP sender SHOULD send the 241 highest sequence numbered MSS sized data chunk from its 242 retransmission queue (i.e., it should send the last packet from 243 its retransmission queue). 245 5. A TCP sender MUST repeat step-2 to step-4 until it enters the 246 Timeout-Recovery state as described in step 6. 248 3.2 Congestion Control After the probe phase 250 6. For each ACK received with ACK-sequence number less than 251 SS_PTR, the TCP sender SHOULD NOT grow it's congestion window. 252 If the ACK contains a new SACK block, the SACK tag SHOULD be set 253 in the corresponding data packet, and the number of packets in 254 flight should be updated. If a pure ACK is received, the packet 255 should be removed from the retransmission queue and the value of 256 packets in flight should be updated. 258 After making the above mentioned changes, the TCP sender SHOULD 259 send new data (i.e., data from outside the retransmission queue) 260 if the number of packets in flight is less than the congestion 261 window. In addition, the TCP sender should keep a variable 262 'new_packets' which counts the number of bytes (packets if 263 congestion window is maintained as a count of packets) sent that 264 have a sequence number greater than or equal to SS_PTR. 266 In addition, the TCP sender SHOULD NOT take any timer sample for 267 the stale ACKs. (NOTE: We do not attempt to change the RTT 268 calculation in an ad-hoc manner; we believe that this is a 269 research problem that needs better network modeling before an 270 appropriate timer calculation can be found) 272 7. Step-6 continues until the TCP sender receives an ACK 273 with a sequence number greater than SS_PTR, or a SACK block 274 covering the sequence number greater than SS_PTR. 276 If the sender receives a SACK block containing SS_PTR, i.e., if 277 there is a packet loss in the stalled window, it SHOULD follow 278 step-8. 280 If the sender receives an ACK that acknowledges SS_PTR, i.e., if 281 no packets were lost from the stalled window, it SHOULD go to 282 step-10. 284 NOTE: In our previous experiments we had set the congestion window 285 to one MSS after a spurious timeout, however this algorithm performs 286 better if there is moderate load on the routers and the number of 287 competing connections do not vary a lot 0 the stalling period. In 288 case of heavy load, setting the congestion window to INIT_WINDOW 289 still performs better. We believe that using the present congestion 290 response makes a fair compromise for different scenarios. 292 3.3 Timeout-Recovery: recovering lost packets after timeout 294 8. The TCP sender traverses the retransmission queue and marks 295 all the packets without any SACK tag as lost. The TCP sender 296 also updates its packets in flight based on the SACK tags and 297 the lost segment information (the packets-in-flight should be 298 ZERO after the update). 300 Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR 301 uses only one SACK block containing SS_PTR to mark packets as 302 lost. This is because we do not expect packet reordering to 303 exist over the period of RTO. 305 9. The TCP sender should update its SS_THRESH, as: 307 SS_THRESH= stale_outstanding/2 309 10. The TCP sender SHOULD set cwnd=new_packets+1. (Note that if 310 all packets were lost, the value of 'new_packets' will be 1, and 311 therefore the congestion window will become 2, which is the 312 value for a timeout due to congestion.) If packets were lost in 313 the network (i.e., if a SACK for SS_PTR was received), the TCP 314 sender should start by sending packets with lowest sequence 315 number; else it should continue with new data. 317 The sender should follow the normal window growth strategy based 318 on the value of SS_THRESH after this step. 320 Please note that with a pure ACK acknowledging SS_PTR, the TCP sender 321 does not update the SS_THRESH value (it directly enters step-10 from 322 step-7). This prevents a TCP sender from setting its SS_THRESH to a 323 very small values if the spurious timeout occurs at the start of the 324 connection. 326 4. Data Delivery To Upper Layers 328 If a TCP sender loses its entire congestion window worth of data, 329 sending new data after timeout prevents a TCP receiver from 330 forwarding the new data to the upper layers immediately. However, 331 once the SACK for this new data is received, the TCP sender will send 332 the first lost segment. This essentially means that data delivery to 333 the upper layers could be delayed by at most one RTT when all the 334 packets are lost in the network. 336 This, however, does not affect the throughput of the connection in 337 any way. If a timeout has occurred, then the data delivery to the 338 upper layers has already been excessively delayed. Delaying it by 339 another round trip is not a serious problem. Please note that 340 reliability and timeliness are two conflicting issues and one cannot 341 gain on one without sacrificing something else on the other. 343 5. Security Considerations 345 The TCP SACK information is meant to be advisory, and a TCP receiver 346 is allowed--though strongly discouraged--to discard data blocks the 347 receiver has already SACKed [RFC2018]. Please note however that even 348 if the TCP sender discards the data block it received, it MUST still 349 send the SACK block for at least the recent most data received. 350 Therefore in spite of SACK reneging, DCLOR will work without any 351 deadlocks. 353 A SACK implementation is also allowed not to send a SACK block even 354 though the TCP sender and receiver might have agreed to SACK- 355 Permitted option at the start of the connection. In these cases, 356 however, if the receiver sends one SACK block, it must send SACK 357 blocks for the rest of the connection. Because of the above mentioned 358 leniency in implementation, its possible that a TCP receiver may 359 agree on SACK-Permitted option, and yet not send any SACK blocks. To 360 make DCLOR robust under these circumstances, DCLOR SHOULD NOT be 361 invoked unless the sender has seen at least one SACK block before 362 timeout. We, however, believe that once the SACK-Permitted option is 363 accepted, the TCP sender MUST send a SACK block--even though that 364 block might finally be discarded. Otherwise, the SACK-Permitted 365 option is completely redundant and serves little purpose. To the best 366 of our knowledge, almost all SACK implementations send a SACK block 367 if they have accepted the SACK-Permitted option. 369 6. References 371 [RFC2581] M. Allman, V. Paxson, W. Stevens. "TCP Congestion 372 Control," Apr, 1999. 374 [RFC2914] S. Floyd, "Congestion Control Principles," Sep 2002. 376 [RFC2861] M. Handley, J. Padhye, S. Floyd. "TCP Congestion 377 Window Validation," Jun 2000. 379 [RFC3517] E. Blanton, M. Allman, K. Fall, L. Wang, "Conservative 380 SACK-based Loss Recovery Algorithm for TCP," Apr 2003. 382 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP 383 Selective Acknowledgment Options," Oct 1996. 385 [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, "An 386 Extension to the Selective Acknowledgment (SACK) Option 387 for TCP," Jul 2000. 389 [RFC3522] R. Ludwig, M. Meyer. "The Eiffel Detection Algorithm 390 for TCP," Apr 2003. 392 [LG03] R. Ludwig, A. Gurtov, "The Eifel Response Algorithm for 393 TCP." Internet draft; work in progress, draft-ietf-tsvwg- 394 tcp-eifel-response-03.txt, Mar 2003. 396 [SK03] P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery 397 Algorithm for Avoiding Unnecessary Retransmissions." 398 Internet draft; work in progress. draft-sarolahti-tsvwg- 399 tcp-frto-03.txt, Jan 2003. 401 [RFC2988] V. Paxon, M. Allman. "Computing TCP's Retransmission 402 Timer," Nov 2000. 404 [BA02] E. Blanton, M. Allman, "Using TCP DSACKs and SCTP 405 Duplicate TSNs to Detect Spurious Retransmissions," 406 Internet draft; work in progress, draft-blanton-dsack- 407 use-02.txt, Oct 2002. 409 7. IPR Statement 411 The IETF has been notified of intellectual property rights claimed in 412 regard to some or all of the specification contained in this 413 document. For more information consult the on-line list of claimed 414 rights at http://www.ietf.org/ipr. 416 Author's Address: 418 Yogesh Prem Swami Khiem Le 419 Nokia Research Center Nokia Research Center 420 6000 Connection Drive 6000 Connection Drive 421 Irving TX-75063 Irving TX-75063 422 USA USA 424 Phone: +1 972-374-0669 Phone: +1 972-894-4882 425 Email: yogesh.swami@nokia.com Email: khiem.le@nokia.com