idnits 2.17.1 draft-swami-tsvwg-tcp-dclor-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-19) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 260: '..., the TCP sender MUST set its congesti...' RFC 2119 keyword, line 265: '...lue of SS_THRESH MUST be left UNCHANGE...' RFC 2119 keyword, line 269: '.... The TCP sender SHOULD also reset all...' RFC 2119 keyword, line 282: '...w data, the TCP sender SHOULD send the...' RFC 2119 keyword, line 287: '... 5. A TCP sender MUST repeat step-2 to...' (10 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Apr 2003) is 7675 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2026' is mentioned on line 16, but not defined == Unused Reference: 'BAFW03' is defined on line 415, but no explicit reference was found in the text == Unused Reference: 'RFC2883' is defined on line 423, but no explicit reference was found in the text == Unused Reference: 'RFC2988' is defined on line 440, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2861 (Obsoleted by RFC 7661) ** Downref: Normative reference to an Experimental draft: draft-ietf-tsvwg-tcp-eifel-alg (ref. 'LM02') == Outdated reference: A later version (-06) exists of draft-ietf-tsvwg-tcp-eifel-response-03 == Outdated reference: A later version (-04) exists of draft-sarolahti-tsvwg-tcp-frto-03 -- Possible downref: Normative reference to a draft: ref. 'SK03' ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Possible downref: Normative reference to a draft: ref. 'BA02' Summary: 8 errors (**), 0 flaws (~~), 7 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Yogesh Swami 3 INTERNET DRAFT Khiem Le 4 File: draft-swami-tsvwg-tcp-dclor-01.txt Nokia Research Center 5 Dallas 6 Apr 2003 7 Expires: Oct 2003 9 DCLOR: De-correlated Loss Recovery using SACK option 10 for spurious timeouts. 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of [RFC2026]. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html 33 Abstract 35 A spurious timeout in TCP forces the sender to unnecessarily 36 retransmit one complete congestion window of data into the network. 37 In addition, TCP uses the rate of arrival of ACKs as the basic 38 criterion for congestion control. TCP makes the assumption that the 39 rate at which ACKs are received reflects the end-to-end state of the 40 network in terms of congestion. But after a spurious-timeout, the 41 ACKs don't reflect the end-to-end congestion state of the network, 42 but only a part of it. In these cases, the slow-start behavior after 43 a timeout can further add to network congestion. In this draft we 44 propose changes to the TCP sender (no change is needed for TCP 45 receiver) that can be used to solve the problem of both redundant- 46 retransmission and network congestion after a spurious timeout. 48 1. Introduction 50 The response of a TCP sender after a retransmission timeout is 51 governed by the underlying assumption that a mid-stream timeout can 52 occur only if there is heavy congestion--manifested as packet 53 loss--in the network. Even though loss is often caused by congestion, 54 the loss recovery algorithm itself should only answer the question of 55 "what" data (i.e., what sequence number of data ) to send. While on 56 the other hand, the congestion control algorithm should answer the 57 question of "how much" data to send. But after a timeout, TCP 58 addresses the issues of loss recovery and congestion control using a 59 single mechanism--send one segment per round trip timeout (RTO) 60 (answers the "how much" question) until an acknowledgment is 61 received. The single segment sent is always the first unacknowledged 62 outstanding packet in the retransmission queue (answers the "what" 63 question). Since the present TCP's loss recovery and congestion 64 control algorithms are coupled together, we call this "Correlated 65 Loss Recovery (CLOR)." 67 Although the assumption that a timeout can occur only if there is 68 severe congestion is valid for traditional wire-line networks, it 69 does not hold good for some other types of networks--networks where 70 packets can be stalled "in the network" for a significant duration 71 without being discarded. Typical examples of such networks are 72 cellular networks. In cellular networks, the link layer can 73 experience a relatively long disruption due to errors, and the link 74 layer protocol can keep these packets-in-error buffered as long as 75 the link layer disruption lasts. 77 In this document we present an alternative approach to loss recovery 78 and congestion control that "De-Correlates" Loss Recovery from 79 congestion congestion and allows independent choice on using a 80 particular TCP sequence number without compromising on the congestion 81 control principles of [RFC2581][RFC2914][RFC2861]. 83 Although several drafts [LM02][LG03][SK03][BA02] have been presented 84 on this topic, we believe that none of them fully considers all the 85 problems associated with spurious timeouts. In the following section 86 we first describe these problems in more detail and then describe the 87 DCLOR mechanism in section-3. 89 2. Problem Description. 91 Let us assume that a TCP sender has sent N packets, p(1) ... p(N), 92 into the network and it's waiting for the ACK of p(1) (Figure-1). Due 93 to bad network conditions or some other problem, these packets are 94 excessively delayed at some some intermediary node NDN. Unlike 95 standard IP routers, the NDN keeps these packets buffered for a 96 relatively long period of time until these packets are forwarded to 97 their intended recipient. This excessive delay forces the TCP sender 98 to timeout and enter slow start. 100 Figure-1 102 TCP-Sender NDN TCP-Receiver 104 ..... |----p(1)------>| | 105 ^ |----p(2)------>| | 106 : | . | | 107 RTT=D | . | | 108 : | . | | 109 ..... |----p(N)------>| | 110 | ^ | | 111 | : | | 112 | RTO | | 113 | : | | 114 | V |----p(1)-->| 115 ... |----p1(1)----->|<---a(1)---|... 116 L | | | 117 ... |<----a(1)------|----p(2)-->| 118 |->p1(2),p1(3)->|<---a(2)---|... 119 | . | . | 120 | . | . | 121 | . | . | 122 | |<---a(N)---| 123 | |---p1(1)-->| 124 | |<---a(N)---| 125 | | | 127 As far as the sender is concerned, a timeout is always interpreted as 128 heavy congestion. The TCP sender therefore makes the assumption that 129 all packets between p(1) and p(N) were lost in the network. To 130 recover from this misconstrued loss, the TCP sender retransmits P1(1) 131 ( Px(k) represents the xth retransmission of packet with sequence 132 number k), and waits for the ACK a(1). 134 After some period of time when the network conditions at NDN improve, 135 the queued in packets are finally dispatched to their intended 136 recipient; in response the TCP receiver generates the ACK a(1). When 137 the TCP sender receives a(1), it's fooled into believing that a(1) 138 was generated in response to the retransmitted packet p1(1), while in 139 reality a(1) was generated in response to the originally transmitted 140 packet p(1). When the sender receives a(1), it increases its 141 congestion window to two, and retransmits p1(2) and p1(3). As the 142 sender receives more acknowledgments, it continues with 143 retransmissions and finally starts sending new data. 145 The following two sub sections examine the problems associated with 146 the above-mentioned TCP behavior. 148 2.1 Redundant Data Retransmission 150 The obvious and relatively easy-to-solve inefficiency of the above 151 algorithm is that the entire congestion window worth of data is 152 unnecessarily retransmitted. Although such retransmissions are 153 harmless to high-bandwidth, well-provisioned, backbone links (so long 154 they are infrequent), it could severely degrade the performance of 155 slow links. 157 In cases where bandwidth is a commodity at a premium, (e.g., cellular 158 networks), unnecessary retransmission can also be costly. 160 2.2 Congestion after Spurious Timeout 162 To analyze network congestion after spurious timeout, we compute the 163 worst case scenario packet loss in the system--assuming only TCP 164 connections to be present. 166 After the spurious timeout, the TCP sender sets its SS_THRESH to N/2. 167 Therefore, for the first N/2 ACKs received (i.e., ACK a(1) to a(N/2) 168 ), the TCP sender will grow its congestion window by one and reach 169 the SS_THRESH value of N/2. For each ACK received, the TCP sender 170 sends 2 packets. Therefore, by the end of the slow start, the TCP 171 sender would have sent 2*(N/2) packets into the network. For the 172 remaining N/2 ACKs (i.e., ACKs between a(N/2+1) to a(N)) the TCP 173 sender will remain in the congestion avoidance phase and send one 174 packet for each ACK received--sending N/2 more data segments. The net 175 amount of data sent is therefore N/2 + N = 3N/2. 177 Please note that the entire 3N/2 packets are injected into the 178 network within a time period less than or equal to RTT in most cases. 179 The number of data segments that left the network during this time is 180 only N. Therefore, N/2 packets out of 3N/2 packets will be lost with 181 a very high probability. These N/2 lost packets, however, need not 182 come from the same connection, and such a data-burst will 183 unnecessarily penalize all the competing TCP connections that share 184 the same bottleneck router. 186 Going further ahead, let us assume there are M competing TCP 187 connections that share the same bottleneck router(s) with 188 C(0)(Figure-2). During the period of time while C(0) is stalled, the 189 TCP sender of C(0) does not use its network resources--the buffer 190 space--on the bottleneck router(s). The competing connections, 191 C(1)... C(M), however see this lack of activity as resource 192 availability and start growing their window by at least one segment 193 per RTT during this time period (by virtue of linear window increase 194 during congestion avoidance phase). For simplicity reasons, we 195 assume that each of these connections has the same round trip time of 196 RTT, and the idle time for C(0) is k*RTT (where k > RTO/RTT). Under 197 these assumptions, each of these competing connections will increase 198 their congestion window by k segments. Therefore the amount of 199 packets lost in the network due to slow start can be as high as: 201 N/2 + M*k ... (4) 203 the first term in the above equation is the packet loss due to slow 204 start, while the second term is the loss due to window growth of 205 completing connections (if the competing connections were in slow 206 start the response could have been worse). 208 Figure-2 209 C(1) C(2)... C(M) 210 | | ... | 211 | | ... | 212 | | ... | 213 V V ... V 214 \ \ / 215 \ \ / 216 \ \ / 217 +------X--X--X---+ +------------------+ 218 Defaulting | | | | 219 C(0) ----------->| Bottleneck |------>|Buffered packets |---> 220 connection | router | | | 221 +-----X--X----X--+ +------------------+ 222 | | | 223 | | | 224 c(1)c(2) C(M) 226 Based on the above equation, we note that the congestion state of the 227 network depends upon the duration of spurious timeout. In our reponse 228 algorithm we therefore take the time duration of spurious timeout 229 into account reduce the data rate by half every RTO. Please note that 230 this scheme works well only when the number of competing connections 231 M does not vary too much while C(0) was stalled. A more conservative 232 response algorithm should reduce the data rate to INIT_WINDOW if M is 233 not bounded. 235 In the following sections we describe an algorithm that solves the 236 problem of both redundant retransmission and packet loss after a 237 spurious timeout. 239 3. De-correlated Loss Recovery (DCLOR) 241 The basic idea behind DCLOR is to send a new data segment from 242 outside the sender's retransmission queue and wait for the ACK or 243 SACK of the new data before initiating the response algorithm. Unlike 244 slow-start where the response algorithm starts immediately after 245 receiving the first ACK, DCLOR waits for the ACK/SACK of the new data 246 sent after timeout before initiating loss recovery. The SACK block 247 for new data contains sufficient information to determine all the 248 packets that were lost into the network. Once the sequence number of 249 lost packets is determined, the TCP sender grows its congestion 250 window as determined by the SS_THRESH and it's congestion window. 252 3.1 Probe phase after a timeout 254 The following steps describe the response of a TCP sender on a 255 timeout: 257 1. If the timeout occurs before the 3 way handshake is complete, 258 the TCP sender's behavior is unchanged, 260 2. After each timeout, the TCP sender MUST set its congestion 261 window to: 263 cwnd = max( cwnd >> 1, IINIT_WINDOW). 265 The value of SS_THRESH MUST be left UNCHANGED at this point. The 266 TCP sender should also count the number of packets in flight at 267 this time, and keep it in a state variable stale_outstanding. 269 3. The TCP sender SHOULD also reset all the SACK tag bits in its 270 retransmission queue if this the first timeout. 272 4. Instead of sending the first unacknowledged packet P1 273 after a timeout, the TCP sender should *disregard* its 274 congestion window and send ONE NEW MSS size data Pn+1. 276 The TCP sender should also store the sequence number of the new 277 segment in a new state variable called SS_PTR (for slow start 278 pointer). 280 If the sender does not have any new data outside its 281 retransmission queue, or if the receiver's flow control window 282 cannot sustain any new data, the TCP sender SHOULD send the 283 highest sequence numbered MSS sized data chunk from its 284 retransmission queue (i.e., it should send the last packet from 285 its retransmission queue). 287 5. A TCP sender MUST repeat step-2 to step-4 until it 288 enters the Timeout-Recovery state as described in step 6. 290 3.2 Congestion Control After the probe phase 292 6. For each ACK received with the ACK-sequence number 293 less than SS_PTR, regardless of the value of the SS_THRESH, the 294 TCP sender SHOULD NOT grow it's congestion window. If the ACK 295 contains a new SACK block, the SACK tag SHOULD be set in the 296 corresponding data packet. If new segments were ACKed, and the 297 congestion window allows, the TCP sender SHOULD send new data. 298 (Note: the idea here is that the congestion window should not be 299 grown in response to stale ACKs since these ACKs don't reflect 300 the end to end state of the network). 302 In addition, the TCP sender SHOULD NOT take any timer sample for 303 the stale ACKs. (NOTE: We do not attempt to change the RTT 304 calculation in an ad-hoc manner; we believe that this is a 305 reaseach problem that needs better network modelling before an 306 appropriate timer calculation can be found) 308 7. Step-6 continues until the TCP sender receives an 309 ACK acking a sequence number greater than SS_PTR, or it receives 310 a SACK block covering the sequence number greater than SS_PTR. 312 If the sender receives a SACK block containing SS_PTR, i.e., if 313 there is a packet loss in the stalled window, it SHOULD go to 314 step-8. 316 If the sender receives an ACK that acknowledges SS_PTR, i.e., if 317 no packets were lost from the stalled window, it SHOULD go to 318 step-10. 320 NOTE: In our previous experiments we had set the congestion window 321 to one MSS after a spurious timeout, however this algorithm prerforms 322 better if there is moderate load on the routers and the number of 323 competing connections do not vary a lot duing the stalling period. In 324 case of heavy load, setting the congestion window to INIT_WINDOW 325 still performs better. We believe that using the present congestion 326 response make a fair compromise for different scenarios. 328 3.3 Timeout-Recovery: recovering lost packets after timeout 330 8. The TCP sender traverses the retransmission queue and marks 331 all the packets without any SACK tag as lost. The TCP sender 332 also updates its packets-in-flight (pipe) based on the SACK tags 333 and the lost segment information (the packets-in-flight (pipe) 334 should be ZERO after the update). 336 Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR 337 uses only one SACK block containing SS_PTR to mark packets as 338 lost. This is because we do not expect packet reordering to 339 exist over the period of RTO. 341 9. The TCP sender should update its SS_THRESH, as: 343 SS_THRESH= stale_outstanding >> 1 (step-2) 345 10. The TCP sender SHOULD set its congestion window to cwnd+1. 346 If packets were lost into the network (i.e., if a SACK for 347 SS_PTR was received), the TCP sender should start by sending 348 packets with lowest sequence number; else it should continue 349 with new data. (Note: for each new SACK block received, the 350 sender should send a segment--lost or new--and therefore the 351 problem of duplicate ACKs is not of concern here.) 353 The sender should follow the normal window growth strategy based 354 on the value of SS_THRESH after this step. 356 Please note that with a pure ACK acknowledging SS_PTR, the TCP sender 357 does not update the SS_THRESH value (it directly enters step-10 from 358 step-7). This prevents a TCP sender from setting its SS_THRESH to a 359 very small values if the spurious timeout occurs at the start of the 360 connection. 362 4. Data Delivery To Upper Layers 364 If a TCP sender loses its entire congestion window worth of data, 365 sending new data after timeout prevents a TCP receiver from 366 forwarding the new data to the upper layers immediately. However, 367 once the SACK for this new data is received, the TCP sender will send 368 the first lost segment. This essentially means that data delivery to 369 the upper layers could be delayed by at most one RTT when all the 370 packets are lost in the network. 372 This, however, does not affect the throughput of the connection in 373 any way. If a timeout has occurred, then the data delivery to the 374 upper layers has already been excessively delayed. Delaying it by 375 another round trip is not a serious problem. Please note that 376 reliability and timeliness are two conflicting issues and one cannot 377 gain on one without sacrificing something else on the other. 379 5. Security Considerations 381 The TCP SACK information is meant to be advisory, and a TCP receiver 382 is allowed--though strongly discouraged--to discard data blocks the 383 receiver has already SACKed [RFC2018]. Please note however that even 384 if the TCP sender discards the data block it received, it MUST still 385 send the SACK block for at least the recent most data received. 386 Therefore in spite of SACK reneging, DCLOR will work without any 387 deadlocks. 389 A SACK implementation is also allowed not to send a SACK block even 390 though the TCP sender and receiver might have agreed to SACK- 391 Permitted option at the start of the connection. In these cases, 392 however, if the receiver sends one SACK block, it must send SACK 393 blocks for the rest of the connection. Because of the above mentioned 394 leniency in implementation, its possible that a TCP receiver may 395 agree on SACK-Permitted option, and yet not send any SACK blocks. To 396 make DCLOR robust under these circumstances, DCLOR SHOULD NOT be 397 invoked unless the sender has seen at least one SACK block before 398 timeout. We, however, believe that once the SACK-Permitted option is 399 accepted, the TCP sender MUST send a SACK block--even though that 400 block might finally be discarded. Otherwise, the SACK-Permitted 401 option is completely redundant and serves little purpose. To the best 402 of our knowledge, almost all SACK implementations send a SACK block 403 if they have accepted the SACK-Permitted option. 405 6. References 407 [RFC2581] M. Allman, V. Paxson, W. Stevens. "TCP Congestion 408 Control," Apr, 1999. 410 [RFC2914] S. Floyd, "Congestion Control Principles," Sep 2002. 412 [RFC2861] M. Handley, J. Padhye, S. Floyd. "TCP Congestion 413 Window Validation," Jun 2000. 415 [BAFW03] E. Blanton, M. Allman, K. Fall, L. Wang, "Conservative 416 SACK-based Loss Recovery Algorithm for TCP," draft- 417 allman-tcp-sack-13.txt. Internet draft; work in progress. 418 Oct 2002. 420 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP 421 Selective Acknowledgment Options," Oct 1996. 423 [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, "An 424 Extension to the Selective Acknowledgment (SACK) Option 425 for TCP," Jul 2000. 427 [LM02] R. Ludwig, M. Meyer. "The Eiffel Detection Algorithm 428 for TCP." Internet draft; work in progress, draft-ietf- 429 tsvwg-tcp-eifel-alg-07.txt, Dec 2002. 431 [LG03] R. Ludwig, A. Gurtov, "The Eifel Response Algorithm for 432 TCP." Internet draft; work in progress, draft-ietf-tsvwg- 433 tcp-eifel-response-03.txt, Mar 2003. 435 [SK03] P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery 436 Algorithm for Avoiding Unnecessary Retransmissions." 437 Internet draft; work in progress. draft-sarolahti-tsvwg- 438 tcp-frto-03.txt, Jan 2003. 440 [RFC2988] V. Paxon, M. Allman. "Computing TCP's Retransmission 441 Timer," Nov 2000. 443 [BA02] E. Blanton, M. Allman, "Using TCP DSACKs and SCTP 444 Duplicate TSNs to Detect Spurious Retransmissions," 445 Internet draft; work in progress, draft-blanton-dsack- 446 use-02.txt, Oct 2002. 448 7. IPR Statement 450 The IETF has been notified of intellectual property rights claimed in 451 regard to some or all of the specification contained in this 452 document. For more information consult the on-line list of claimed 453 rights at http://www.ietf.org/ipr. 455 Author's Address: 457 Yogesh Prem Swami Khiem Le 458 Nokia Research Center Nokia Research Center 459 6000 Connection Drive 6000 Connection Drive 460 Irving TX-75063 Irving TX-75063 461 USA USA 463 Phone: +1 972-374-0669 Phone: +1 972-894-4882 464 Email: yogesh.swami@nokia.com Email: khiem.le@nokia.com