idnits 2.17.1 draft-mathis-tcpm-proportional-rate-reduction-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 543: '... implementations SHOULD implement both...' RFC 2119 keyword, line 544: '... algorithm is used. It is RECOMMENDED...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 11, 2011) is 4673 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2018' is mentioned on line 161, but not defined == Missing Reference: 'LT' is mentioned on line 483, but not defined == Missing Reference: 'SPLIT' is mentioned on line 561, but not defined ** Obsolete normative reference: RFC 3517 (Obsoleted by RFC 6675) Summary: 3 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group M. Mathis 3 Internet-Draft N. Dukkipati 4 Intended status: Experimental Y. Cheng 5 Expires: January 12, 2012 Google, Inc 6 July 11, 2011 8 Proportional Rate Reduction for TCP 9 draft-mathis-tcpm-proportional-rate-reduction-01.txt 11 Abstract 13 This document describes an experimental algorithm, Proportional Rate 14 Reduction (PPR) and related algorithms to improve the accuracy of the 15 amount of data sent by TCP during loss recovery. Standard Congestion 16 Control requires that TCP and other protocols reduce their congestion 17 window in response to losses. This window reduction naturally occurs 18 in the same round trip as the data retransmissions to repair the 19 losses, and is implemented by choosing not to transmit any data in 20 response to some ACKs arriving from the receiver. Two widely 21 deployed algorithms are used to implement this window reduction: Fast 22 Recovery and Rate Halving. Both algorithms are needlessly fragile 23 under a number of conditions, particularly when there is a burst of 24 losses that such that the number of ACKs returning to the sender is 25 so small that the effective window falls below the target congestion 26 window chosen by the congestion control algorithm. Proportional Rate 27 Reduction avoids these excess window reductions such that at the end 28 of recovery the actual window size will be as close as possible to 29 the window size determined by the congestion control algorithm. It 30 is patterned after rate halving, but using the fraction that is 31 appropriate for target window chosen by the congestion control 32 algorithm. In addition we propose two slightly different algorithms 33 to bound the total window reduction due to all mechanisms, including 34 application stalls, the losses themselves and inhibit further window 35 reductions when possible. 37 Status of this Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at http://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on January 12, 2012. 54 Copyright Notice 56 Copyright (c) 2011 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (http://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 3. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 6 74 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 7 75 4. Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 10 76 5. Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 12 77 6. Conclusion and Recommendations . . . . . . . . . . . . . . . . 13 78 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 79 8. Security Considerations . . . . . . . . . . . . . . . . . . . 14 80 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 81 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 82 Appendix A. Packet Conservation Bound . . . . . . . . . . . . . . 15 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 85 1. Introduction 87 This document describes an experimental algorithm, Proportional Rate 88 Reduction (PPR) and two slightly different reduction bound algorithms 89 to improve the accuracy of the amount of data sent by TCP during loss 90 recovery. 92 Standard Congestion Control [RFC 5681] requires that TCP (and other 93 protocols) reduce their congestion window in response to losses. 94 Fast Recovery, described in the same document, is the reference 95 algorithm for making this adjustment. It's stated goal is to recover 96 TCP's self clock by relying on returning ACKs during recovery to 97 clock more data into the network. Fast Recovery adjusts the window 98 by waiting for one half RTT of ACKs to pass before sending any data. 99 It is fragile because it can not compensate for the implicit window 100 reduction caused by the losses themselves, and is exposed to 101 timeouts. For example if half of the data or ACKs are lost, Fast 102 Recovery's expected behavior would be to reduce the window by not 103 sending in response to the first half window of ACKs, but then it 104 would not receive any additional ACKs and would timeout because it 105 failed to send anything at all. 107 The rate-halving algorithm improves this situation by sending data on 108 alternate ACKs during recovery, such that after one RTT the window 109 has been halved. Rate-having is implemented in Linux after only 110 being informally published [RHweb], including from an uncompleted 111 Internet-Draft[RHID]. Rate-halving also does not adequately 112 compensate for the implicit window reduction caused by the losses and 113 also assumes a 50% window reduction, which was completely standard at 114 the time it was written (several modern congestion control 115 algorithms, such as Cubic[CUBIC], can sometimes reduce the window by 116 much less than 50%.). As a consequence rate-halving often allows the 117 window to fall further than necessary, reducing performance and 118 increasing the risk of timeouts if there are any additional losses. 120 Proportional Rate Reduction (PPR) avoids these excess window 121 reductions such that at the end of recovery the actual window size 122 will be as close as possible to the window size determined by the 123 congestion control algorithm. It is patterned after Rate Halving, 124 but using the fraction that is appropriate for target window chosen 125 by the congestion control algorithm. During PRR one of two 126 additional reduction bound algorithms monitors the total window 127 reduction due to all mechanisms, including application stalls, the 128 losses themselves and attempts to inhibit further window reductions. 130 We describe two slightly different reduction bound algorithms: 131 conservative reduction bound (CRB), which meets a strict segment 132 conserving correctness critera; and a slow start reduction bound 133 (SSRB), which is more aggressive than CRB by at most one segment per 134 ACK. CRB meets a conservative, philosophically pure and 135 aesthetically appealing notion of correct, however in real networks 136 it does not perform as well as the algorithms described in RFC 3517, 137 which prove to be non-conservative in a statistically significant 138 number of cases. SSRB offers a compromise by allowing TCP to send 139 one additional segment per ACK relative to CRB in some situations. 140 Although SSRB is less aggressive than RFC 3517 (transmitting fewer 141 segments or transmitting them later) it slightly outperforms it, due 142 to slightly lower probability of additional losses during recovery. 144 All three algorithms are based on common design principles, derived 145 from Van Jacobson's packet conservation principle: segments delivered 146 to the receiver are used as the clock to trigger sending additional 147 segments into the network. As much as possible Proportional Rate 148 Reduction and the reduction bound rely on this self clock process, 149 and are only slightly affected by the accuracy of other estimators, 150 such as pipe[RFC 3517] and cwnd. This is what gives the algorithms 151 their precision in the presence of events that cause uncertainty in 152 other estimators. 154 In Section 5, we summarize a companion paper[Recovery] with some 155 measurement experiments: PRR+SSRB outperforms both RFC 3517 and PRR+ 156 CRB under authentic network traffic. 158 The algorithms are described as modifications to RFC 5681, TCP 159 Congestion Control, using concepts drawn from the pipe algorithm [RFC 160 3517]. They are most accurate and more easily implemented with 161 SACK[RFC 2018], but they do not require SACK. 163 2. Definitions 165 The following terms, parameters and state variables are used as they 166 are defined in earlier documents: 168 RFC 3517: covered (as in "covered sequence numbers") 170 RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size 171 (SMSS) 173 Voluntary Window Reductions: choosing not to send data in response to 174 some ACKs, for the purpose of reducing the sending window size or 175 data rate. 177 We define some additional variables: 179 SACKd: The total number of bytes that the scoreboard indicates has 180 been delivered to the receiver. This can be computed by scanning the 181 scoreboard and counting the total number of bytes covered by all sack 182 blocks. 184 DeliveredData: The total number of bytes that the current ACK 185 indicates have been delivered to the receiver, relative to all past 186 ACKs. When not in recovery, DeliveredData is the change in snd.una. 187 With SACK, DeliveredData is not an estimator and can be computed 188 precisely as the change in snd.una plus the change in SACKd. Note 189 that if there are SACK blocks and snd.una advances, the change in 190 SACKd is typically negative. In recovery without SACK, DeliveredData 191 is estimated to be 1 SMSS on duplicate acknowledgements, and on a 192 subsequent partial or full ACK, DeliveredData is estimated to be the 193 change in snd.una, minus one SMSS for each preceding duplicate ACK. 195 Note that DeliveredData is robust: for TCP using SACK, DeliveredData 196 can be precisely computed anywhere in the network just by inspecting 197 the returning ACKs. The consequence of missing ACKs is that later 198 ACKs will show a larger DeliveredData. Furthermore, for any TCP 199 (with or without SACK) the sum of DeliveredData must agree with the 200 forward progress over the same time interval. 202 We introduce a local variable "sndcnt", which indicates exactly how 203 many bytes should be sent in response to each ACK. Note that the 204 decision of which data to send (e.g. retransmit missing data or send 205 more new data) is out of scope for this document. 207 3. Algorithms 209 Summary: If pipe (the estimated data is in flight) is larger than 210 ssthresh (the target cwnd at the end of recovery) then Proportional 211 Rate Reduction spreads the the voluntary window reductions across a 212 full RTT, such that as prr_delivered approaches RecoverFS (at the end 213 of recovery) prr_out approaches ssthresh, the target value for cwnd. 214 If there are excess losses such that pipe falls below ssthresh, the 215 selected reduction bound algorithm tries to hold pipe at ssthresh by 216 sending at most "limit" segments per ACK to catch up. For both 217 reduction bound algorithms the limit is first set to past voluntary 218 window reductions (prr_delivered - prr_out) permitting single ACKs to 219 trigger sending multiple segments. With PRR+CRB (conservative==True) 220 and if there are too many losses then prr_delivered - prr_out will be 221 exactly the same as DeliveredData for the current ACK, resulting in 222 sndcnt=DeliveredData. Therefore there will be no further Voluntary 223 Window Reductions. With PRR+SSRB (conservative==False) the same 224 situation results in sndcnt=DeliveredData+1, which ultimately causes 225 TCP to slowstart up to ssthresh. 227 At the beginning of recovery initialize PRR state. This assumes a 228 modern congestion control algorithm, CongCtrlAlg(), that might set 229 ssthresh to something other than FlightSize/2: 231 ssthresh = CongCtrlAlg() // Target cwnd after recovery 232 prr_delivered = 0 // Total bytes delivered during recov 233 prr_out = 0 // Total bytes sent during recovery 234 RecoverFS = snd.nxt-snd.una // Flightsize at the start of recov 236 On every ACK during recovery compute: 238 DeliveredData = delta(snd.una) + delta(SACKd) 239 prr_delivered += DeliveredData 240 pipe = (RFC 3517 pipe algorithm) 241 if (pipe > ssthresh) { 242 // Proportional Rate Reduction 243 sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out 244 } else { 245 // Two version of the reduction bound 246 if (conservative) { // PRR+CRB 247 limit = prr_delivered - prr_out 248 } else { // PRR+SSRB 249 limit = MAX(prr_delivered - prr_out, DeliveredData) + 1 250 } 251 sndcnt = MIN(ssthresh - pipe, limit) 252 } 253 sndcnt = MAX(sndcnt, 0) // positive 255 On any data transmission or retransmission: 257 prr_out += (data sent) // strictly less than or equal to sndcnt 259 The following examples will make these algorithms much clearer. 261 3.1. Examples 263 We illustrate these algorithms by showing their different behaviors 264 for two scenarios: TCP experiencing either a single loss or a burst 265 of 15 consecutive losses. In all cases we assume bulk data, standard 266 AIMD congestion control (the ssthresh is set to Flight Size/2) and 267 cwnd = FlightSIze = pipe = 20 segments, so ssthresh will be set to 10 268 at the beginning of recovery. We also assume standard Fast 269 Retransmit and Limited Transmit, so we send two new segments followed 270 by one retransmit on the first 3 duplicate ACKs after the losses. 272 Each of the diagrams below shows the per ACK response to the first 273 round trip for the various recovery algorithms when the zeroth 274 segment is lost. The top line indicates the transmitted segment 275 number triggering the ACKs, with an X for the lost segment. "cwnd" 276 and "pipe" indicate the values of these algorithms after processing 277 each returning ACK. "Sent" indicates how much "N"ew or 278 "R"etransmitted data would be sent. Note that the algorithms for 279 deciding which data should be sent are out of scope of this document. 281 When there is a single loss, PRR with either of the reduction bound 282 algorithms has the same behavior. We show "RB", a flag indicating 283 which reduction bound subexpression ultimately determined the value 284 of sndcnt. When there is minimal losses "limit" (both algorithms) 285 will always be larger than ssthresh - pipe, so the sndcnt will be 286 ssthresh - pipe indicated by "s" in the "RB" row. PRR does not use 287 cwnd during recovery. 289 RFC 3517 290 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 291 cwnd: 20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 292 pipe: 19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10 293 sent: N N R N N N N N N N N 295 Rate halving (Linux) 296 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 297 cwnd: 20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 298 pipe: 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 299 sent: N N R N N N N N N N N 301 PRR 302 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 303 pipe: 19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10 304 sent: N N R N N N N N N N N 305 RB: s s 307 Note that all three algorithms send same total amount of data. RFC 308 3517 experiences a "half-window of silence", while the Rate Halving 309 and PRR spread the voluntary window reduction across an entire RTT. 311 Next we consider the same initial conditions when the first 15 312 packets (0-14) are lost. During the remainder of the lossy RTT, only 313 5 ACKs are returned to the sender. We examine each of these 314 algorithms in succession. 316 RFC 3517 317 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 318 cwnd: 20 20 11 11 11 319 pipe: 19 19 4 10 10 320 sent: N N 7R R R 322 Rate Halving (Linux) 323 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 324 cwnd: 20 20 5 5 5 325 pipe: 19 19 4 4 4 326 sent: N N R R R 328 PRR-CRB 329 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 330 pipe: 19 19 4 4 4 331 sent: N N R R R 332 RB: f f f 334 PRR-SSRB 335 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 336 pipe: 19 19 4 5 6 337 sent: N N 2R 2R 2R 338 RB: d d d 340 In this situation, RFC 3517 is very non-conservative, because once 341 fast retransmit is triggered (on the ACK for segment 17) TCP 342 immediately retransmits sufficient data to bring pipe up to cwnd. 343 Our measurement data (see Section 5) indicates that RFC 3517 344 significantly outperforms Rate Halving, PRR-CRB and some other 345 similarly conservative algorithms that we tested, suggesting that it 346 is significantly common for the actual losses to exceed the window 347 reduction determined by the congestion control algorithm. 349 The Linux implementation of Rate Halving includes an early version of 350 the conservative reduction bound[RHweb]. In this situation each of 351 the five ACKs trigger exactly 5 transmissions (2 new data, 3 old 352 data), and cwnd is set to 5. At a window size of 5, it takes three 353 round trips to retransmit 15 lost segments. Rate Halving does not 354 raise the window during recovery, so when recovery finally completes, 355 TCP will slowstart cwnd from 5 up to 10. In this example, TCP 356 operates at half of the window chosen by the congestion control for 357 more than three RTTs, increasing the elapsed time and exposing it to 358 timeouts if there are additional losses. 360 PRR-CRB implements conservative reduction bound. Since the total 361 losses bring pipe below ssthresh, data is sent such that the total 362 data transmitted, prr_out, follows the total data delivered to the 363 receiver as reported by returning ACKs. Transmission are controlled 364 by the sending limit, which was set to prr_delivered - prr_out. This 365 is indicated by the RB:f tagging in the figure. In this case PRR-CRB 366 is exposed to exactly the same problems as Rate Halving, taking 367 excessively long to recover from the losses and being exposed to 368 additional timeouts. 370 PRR-SSRB increases the window by exactly 1 segment per ACK until pipe 371 rises to sshthresh during recovery. This is accomplished by setting 372 limit to one greater than the data reported to have been delivered to 373 the receiver on this ACK, effectively implementing a slowstart during 374 recovery, and indicated by RB:d tagging in the figure. Although 375 increasing the window during recovery seems to be ill advised, it is 376 important to remember that this actually less aggressive than the 377 current standard which permits sending the same quantity of extra 378 data as a single burst in response to the ACK that triggered Fast 379 Retransmit 381 Under less extreme conditions, when the total losses are smaller than 382 the difference between Flight Size and ssthresh, PRR-CRB and PRR-SSRB 383 have identical behaviours. 385 4. Properties 387 The following properties are common to both PRR-CRB and PRR-SSRB: 389 Normally Proportional Rate Reduction will spread Voluntary Window 390 reductions out evenly across a full RTT. This has the potential to 391 generally reduce the burstiness of Internet traffic, and could be 392 considered to be a type of soft pacing. Theoretically any pacing 393 increases the probability that different flows are interleaved, 394 reducing the opportunity for ACK compression and other phenomena that 395 increase traffic burstiness. However these effects have not been 396 quantified. 398 If there are minimal losses, Proportional Rate Reduction will 399 converge to exactly the target window chosen by the congestion 400 control algorithm. Note that as TCP approaches the end of recovery 401 prr_delivered will approach RecoverFS and sndcnt will be computed 402 such that prr_out approaches ssthresh. 404 Implicit window reductions due to multiple isolated losses during 405 recovery cause later Voluntary Reductions to be skipped. For small 406 numbers of losses the window size ends at exactly the window chosen 407 by the congestion control algorithm. 409 For burst losses, earlier Voluntary Window Reductions can be undone 410 by sending extra segments in response to ACKs arriving later during 411 recovery. Note that as long as some Voluntary Window Reductions are 412 not undone, the final value for pipe will be the same as ssthresh, 413 the target cwnd value chosen by the congestion control algorithm. 415 Proportional Rate Reduction with either reduction round improves the 416 situation when there are application stalls (e.g. when the sending 417 application does not queue data for transmission quickly enough or 418 the receiver stops advancing rwnd). When there is a application 419 stall early during recovery prr_out will fall behind the sum of the 420 transmissions permitted by sndcnt. The missed opportunities to send 421 due to stalls are treated like banked Voluntary Window Reductions: 422 specifically they cause prr_delivered-prr_out to be significantly 423 positive. If the application catches up while TCP is still in 424 recovery, TCP will send a partial window burst to catch up to exactly 425 where it would have been, had the application never stalled. 426 Although this burst might be viewed as being hard on the network, 427 this is exactly what happens every time there is a partial RTT 428 application stall while not in recovery. We have made the partial 429 RTT stall behavior uniform in all states. Changing this behavior is 430 out of scope for this document. 432 Proportional Rate Reduction with Reduction Bound is significantly 433 less sensitive to errors of the pipe estimator. While in recovery, 434 pipe is intrinsically an estimator, using incomplete information to 435 guess if un-SACKed segments are actually lost or out-of-order in the 436 network. Under some conditions pipe can have significant errors, for 437 example when a burst of reordered data is presumed to be lost and is 438 retransmitted, but then the original data arrives before the 439 retransmission. If the transmissions are regulated directly by pipe 440 as they are in RFC 3517, then errors and discontinuities in the value 441 of the pipe estimator can cause significant errors in the amount of 442 data sent. With Proportional Rate Reduction with Reduction Bound, 443 pipe merely determines how sndcnt is computed from DataDelivered. 444 Since short term errors in pipe are smoothed out across multiple ACKs 445 and both Proportional Rate Reduction and the reduction converge to 446 the same final window, errors in the pipe estimator have less impact 447 on the final outcome (This needs to be tested better). 449 Under all conditions and sequences of events during recovery, PRR-CRB 450 strictly bounds the data transmitted to be equal to or less than the 451 amount of data delivered to the receiver. We claim that this packet 452 conservation bound is the most aggressive algorithm that does not 453 lead to additional forced losses in some environments. It has the 454 property that if there is a standing queue at a bottleneck that is 455 carrying no other traffic, the queue will maintain exactly constant 456 length for the entire recovery duration (except for +1/-1 fluctuation 457 due to differences in packet arrival and exit times) . See 458 Appendix A for a detailed discussion of this property. 460 Although the packet Packet Conserving Bound in very appealing for a 461 number of reasons, our measurements summarized in Section 5 462 demonstrate that it is less aggressive and does not perform as well 463 as RFC3517, which permits large bursts of data when there are bursts 464 of losses. PRR-SSRB is a compromise that permits TCP to send one 465 extra segment per ACK as compared to the packet conserving bound. 466 From the< perspective of the packet conserving bound, PRR-SSRB does 467 indeed open the window during recovery, however it is significantly 468 less aggressive than RFC3517 in the presence of burst losses. 470 5. Measurements 472 In a (to be published) companion paper[Recovery] we describe some 473 measurements comparing the various strategies for reducing the window 474 during recovery. The results presented in that paper are summarized 475 here. 477 The various window reduction algorithms and extensive instrumentation 478 were all implemented in a modified Linux 2.6.34 kernel. For all 479 experiments we used a uniform subset of the non-standard algorithms 480 present in the base Linux implementation. Specifically we disabled 481 threshold retransmit [FACK], which triggers Fast Retransmit earlier 482 than the standard algorithm. We left enabled CUBIC [CUBIC], limited 483 transmit [LT], and lost retransmission detection algorithms. This 484 subset was a compromise chosen such that the behaviors of both Rate 485 Halving (the Linux default) and RFC 3517 mode were authentic to their 486 respective specifications while at the same time the performance and 487 features were comparable to the kernels in production use. The 488 different window reduction algorithms were all present in the same 489 kernel and could be selected with a sysctl, such that we had an 490 absolutely uniform baseline for comparing RFC 3517, Rate Halving, and 491 PRR with various reduction bounds. 493 Our experiments included an additional algorithm, PRR with an 494 unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe 495 falls below ssthresh. This behavior parallels RFC 3517. 497 An important detail of this configuration is that CUBIC only reduces 498 the window by %, as opposed to the 50% reduction used by traditional 499 congestion control algorithms. This, in conjunction with using only 500 standard algorithms to trigger Fast Retransmit, accentuates the 501 tendency for RFC 3517 and PRR-UB to send a burst at the point when 502 Fast Retransmit gets triggered if pipe is already below ssthresh. 504 All experiments were performed on servers carrying production traffic 505 for multiple Google services. 507 In this configuration is is observed that for 32% of the recovery 508 events, pipe falls below ssthresh before Fast Retransmit is 509 triggered, thus the various PRR algorithms start in the reduction 510 bound phase, and both PRR-UB and RFC 3517 send bursts of segments 511 with the fast retransmit. 513 In the companion paper we observe that PRR-SSRB spends the least time 514 in recovery of all the algorithms tested, largely because it 515 experiences fewer timeouts once it is already in recovery. 517 RFC 3517 experiences 29% more detected lost retransmissions and 2.6% 518 more timeouts (presumably due to undetected lost retransmissions) 519 than PRR-SSRB. These results are representative of PRR-UB and other 520 algorithms that send bursts when pipe falls below ssthresh. 522 Rate Halving experiences 5% more timeouts and significantly smaller 523 final cwnd values at the end of recovery. The smaller cwnd sometimes 524 causes the recovery itself to take extra round trips. These results 525 are representative of PRR-CRB and other algorithms that implement 526 strict packet conservation during recovery. 528 6. Conclusion and Recommendations 530 [This text assumes standards track. Experimental status would be 531 somewhat more reserved.] 533 Although the packet conserving bound in very appealing for a number 534 of reasons, our measurements summarized in Section 5 demonstrate that 535 it is less aggressive and does not perform as well as RFC3517, which 536 permits large bursts of data when there are bursts of losses. PRR- 537 SSRB is a compromise that permits TCP to send one extra segment per 538 ACK as compared to the packet conserving bound. From the perspective 539 of the packet conserving bound, PRR-SSRB does indeed open the window 540 during recovery, however it is significantly less aggressive than 541 RFC3517 in the presence of burst losses. 543 All TCP implementations SHOULD implement both PRR-CRB and PRR-SSRB, 544 with a control to select which algorithm is used. It is RECOMMENDED 545 that PRR-SSRB is the default algorithm. 547 7. Acknowledgements 549 This draft is based in part on previous incomplete work by Matt 550 Mathis, Jeff Semke and Jamshid Mahdavi[RHID] and influenced by 551 several discussion with John Heffner. 553 Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the 554 experiments. 556 8. Security Considerations 558 Proportional Rate Reduction does not change the risk profile for TCP. 560 Implementers that change PRR from counting bytes to segments have to 561 be cautious about the effects of ACK splitting attacks[SPLIT], where 562 the receiver acknowledges partial segments for the purpose of 563 confusing the sender's congestion accounting. 565 9. IANA Considerations 567 This document makes no request of IANA. 569 Note to RFC Editor: this section may be removed on publication as an 570 RFC. 572 10. References 574 TODO: A proper reference section. 576 [RFC 3517] "A Conservative Selective Acknowledgment (SACK)-based Loss 577 Recovery Algorithm for TCP". E. Blanton, M. Allman, K. Fall, L. 578 Wang. April 2003. 580 [RFC 5681] "TCP Congestion Control". M. Allman, V. Paxson, E. 581 Blanton. September 2009. 583 [RHweb] "TCP Rate-Halving with Bounding Parameters". M. Mathis, J. 584 Madavi, http://www.psc.edu/networking/papers/FACKnotes/971219/, Dec 585 1997. 587 [RHID] "The Rate-Halving Algorithm for TCP Congestion Control". M. 588 Mathis, J. Semke, J. Mahdavi, K. Lahey. 589 http://www.psc.edu/networking/ftp/papers/draft-ratehalving.txt, Work 590 in progress, last updated June 1999. 592 [CUBIC] "CUBIC: A new TCP-friendly high-speed TCP variant". I. Rhee, 593 L. Xu, PFLDnet, Feb 2005. 595 [FACK] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP 596 Congestion Control", Proceedings of SIGCOMM'96, August, 1996, 597 Stanford, CA. 599 [Recovery] N. Dukkipati, M. Mathis, Y Cheng, "Improving TCP loss 600 recovery", to be published 2011. 602 Appendix A. Packet Conservation Bound 604 Under all conditions and sequences of events during recovery, PRR-CRB 605 strictly bounds the data transmitted to be equal to or less than the 606 amount of data delivered to the receiver. We claim that this packet 607 conservation bound is the most aggressive algorithm that does not 608 lead to additional forced losses in some environments. It has the 609 property that if there is a standing queue at a bottleneck that is 610 carrying no other traffic, the queue will maintain exactly constant 611 length for the entire recovery duration (except for +1/-1 fluctuation 612 due to differences in packet arrival and exit times) . Any less 613 aggressive algorithm will result in a declining queue at the 614 bottleneck. Any more aggressive algorithm will result in an 615 increasing queue or additional losses at the bottleneck. 617 We demonstrate this property with a little thought experiment: 619 Imagine a network path that has insignificant delays in both 620 directions, except the processing time and queue at a single 621 bottleneck in the forward path. By insignificant delay, I mean when 622 a packet is "served" at the head of the bottleneck queue, the 623 following events happen in much less than one packet time at the 624 bottleneck: the packet arrives at the receiver; the receiver sends an 625 ACK; which arrives at the sender; the sender processes the ACK and 626 sends some data; the data is queued at the bottleneck. 628 If sndcnt is set to DataDelivered and nothing else is inhibiting 629 sending data, then clearly the data arriving at the bottleneck queue 630 will exactly replace the data that was served at the head of the 631 queue, so the queue will have a constant length. If queue is drop 632 tail and full then the queue will stay exactly full, even in the 633 presence of losses or reordering on the ACK path, and independent of 634 whether the data is in order or out-of-order (e.g. simple reordering 635 or loss recovery from an earlier RTT). Any more aggressive algorithm 636 sending additional data will cause a queue overflow and loss. Any 637 less aggressive algorithm will under fill the queue. Therefore 638 setting sndcnt to DataDeliverd is the most aggressive algorithm that 639 does not cause forced losses in this simple network. Relaxing the 640 assumptions (e.g. making delays more authentic and adding more flows, 641 delayed ACKs, etc) increases the noise (jitter) in the system but 642 does not change it's basic behavior. 644 Note that the congestion control algorithm implements a broader 645 notion of optimal that includes appropriately sharing of the network. 646 PRR-CRB will choose to send the lessor of the data permitted by this 647 packet conserving bound and as determined by the congestion control 648 algorithm as PRR brings TCP's actual window down to ssthresh. 650 Authors' Addresses 652 Matt Mathis 653 Google, Inc 654 1600 Amphitheater Parkway 655 Mountain View, California 93117 656 USA 658 Email: mattmathis@google.com 660 Nandita Dukkipati 661 Google, Inc 662 1600 Amphitheater Parkway 663 Mountain View, California 93117 664 USA 666 Email: nanditad@google.com 668 Yuchung Cheng 669 Google, Inc 670 1600 Amphitheater Parkway 671 Mountain View, California 93117 672 USA 674 Email: ycheng@google.com