idnits 2.17.1 draft-ietf-tcpm-prr-rfc6937bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC6937, but the abstract doesn't seem to directly say this. It does mention RFC6937 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 418 has weird spacing: '... bd d d...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (12 January 2021) is 1201 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC0793' is defined on line 668, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group M. Mathis 3 Internet-Draft N. Dukkipati 4 Obsoletes: 6937 (if approved) Y. Cheng 5 Intended status: Standards Track Google, Inc. 6 Expires: 16 July 2021 12 January 2021 8 Proportional Rate Reduction for TCP 9 draft-ietf-tcpm-prr-rfc6937bis-00 11 Abstract 13 This document updates the Proportional Rate Reduction (PRR) algorithm 14 described as experimental in RFC 6937 to standards track. PRR 15 potentially replaces the Fast Recovery and Rate-Halving algorithms. 16 All of these algorithms regulate the amount of data sent by TCP or 17 other transport protocol during loss recovery. PRR more accurately 18 regulates the actual flight size through recovery such that at the 19 end of recovery it will be as close as possible to the ssthresh, as 20 determined by the congestion control algorithm. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on 16 July 2021. 39 Copyright Notice 41 Copyright (c) 2021 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 46 license-info) in effect on the date of publication of this document. 47 Please review these documents carefully, as they describe your rights 48 and restrictions with respect to this document. Code Components 49 extracted from this document must include Simplified BSD License text 50 as described in Section 4.e of the Trust Legal Provisions and are 51 provided without warranty as described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction to bis . . . . . . . . . . . . . . . . . . . . . 2 56 1.1. Document and WG Information . . . . . . . . . . . . . . . 3 57 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6 59 4. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 8 61 5. Properties . . . . . . . . . . . . . . . . . . . . . . . . . 11 62 6. Measurements . . . . . . . . . . . . . . . . . . . . . . . . 13 63 7. Conclusion and Recommendations . . . . . . . . . . . . . . . 14 64 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 65 9. Security Considerations . . . . . . . . . . . . . . . . . . . 15 66 10. Normative References . . . . . . . . . . . . . . . . . . . . 15 67 11. Informative References . . . . . . . . . . . . . . . . . . . 16 68 Appendix A. Strong Packet Conservation Bound . . . . . . . . . . 17 69 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 71 1. Introduction to bis 73 This document updates the Proportional Rate Reduction (PRR) algorithm 74 described in [RFC6937]from experimental to standards track. PRR 75 accuracy regulates the amount of data sent during loss recovery, such 76 that at the end of recovery the flight size will be as close as 77 possible to the ssthresh, as determined by the congestion control 78 algorithm. PRR has been deployed in at least 3 major operating 79 systems covering the vast majority of today's web traffic. There 80 have been no changes to PRR as documented in the experimental RFC. 81 The descriptions here have been [will be] updated to normative 82 standards language. For a tutorial description of the algorithms and 83 the rationale behind them please see the original RFC. 85 The experimental RFC describes two different reduction bound 86 algorithms to limit the total window reduction due to all mechanisms, 87 including transient application stalls and the losses themselves: 88 Conservative Reduction Bound (CRB), which is strictly packet 89 conserving; and a Slow Start Reduction Bound (SSRB), which is more 90 aggressive than CRB by at most 1 segment per ACK. [RFC6937] left the 91 choice of Reduction Bound to the discretion of the implementer. 93 The paper "An Internet-Wide Analysis of Traffic Policing" [Flatch et 94 al] uncovered a crucial situation where the Reduction Bound mattered. 95 Under certain configurations, token bucket traffic policers 96 [token_bucket] can suddenly start discarding a large fraction of the 97 traffic. This happens without warning to the end systems when 98 policers run out of tokens. The transport congestion control has no 99 opportunity to measure the token rate, and sets ssthresh based on the 100 recently observed path performance. This value for ssthresh may be 101 substantially larger than can be sustained by the token rate, 102 potentially causing persistent high loss. Under these conditions, 103 both reduction bounds perform very poorly. PRR-CRB is too timid, 104 sometimes causing very long recovery times at smaller than necessary 105 windows, and PRR-SSRB is too aggressive, often causing many 106 retransmissions to be lost multiple times. 108 Investigating these environments led to the development of a 109 heuristic to dynamically switch between Reduction Bounds: use PRR- 110 SSRB only while snd.una is advancing without evidence of ongoing 111 congestion and use PRR-CRB otherwise. 113 This heuristic is only invoked for what should be a rare corner case: 114 when losses or other events cause the flight size to fall below 115 ssthresh. The extreme loss rates that make the heuristic important 116 are only common in the presence of poorly configured token bucket 117 policers, which are pathologically wasteful and inefficient. In 118 these environments the heuristics serves to salvage a bad situation 119 and any reasonable implementation of the heuristic performs far 120 better than either bound by itself. 122 The algorithm below is identical to the algorithm presented in 123 [RFC6937]. The "conservative" parameter MAY be replaced by the 124 heuristic also described below. 126 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 127 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 128 document are to be interpreted as described in [RFC2119] 130 1.1. Document and WG Information 132 Please send all comments, questions and feedback to tcpm@ietf.org 134 Formatted: 2021-01-13 15:31:40-08:00 136 About this revision: 138 Text above (the "Introduction to bis") is drawn from draft-mathis- 139 tcpm-rfc6937bis-00. In this -00, all of the text below is copied 140 verbatim from RFC 6937, to facilitate comparison between RFC 6937 and 141 this document as it evolves. 143 2. Introduction 145 This document describes an experimental algorithm, PRR, to improve 146 the accuracy of the amount of data sent by TCP during loss recovery. 148 Standard congestion control [RFC5681] requires that TCP (and other 149 protocols) reduce their congestion window (cwnd) in response to 150 losses. Fast Recovery, described in the same document, is the 151 reference algorithm for making this adjustment. Its stated goal is 152 to recover TCP's self clock by relying on returning ACKs during 153 recovery to clock more data into the network. Fast Recovery 154 typically adjusts the window by waiting for one half round-trip time 155 (RTT) of ACKs to pass before sending any data. It is fragile because 156 it cannot compensate for the implicit window reduction caused by the 157 losses themselves. 159 RFC 6675 [RFC6675] makes Fast Recovery with Selective Acknowledgement 160 (SACK) [RFC2018] more accurate by computing "pipe", a sender side 161 estimate of the number of bytes still outstanding in the network. 162 With RFC 6675, Fast Recovery is implemented by sending data as 163 necessary on each ACK to prevent pipe from falling below slow-start 164 threshold (ssthresh), the window size as determined by the congestion 165 control algorithm. This protects Fast Recovery from timeouts in many 166 cases where there are heavy losses, although not if the entire second 167 half of the window of data or ACKs are lost. However, a single ACK 168 carrying a SACK option that implies a large quantity of missing data 169 can cause a step discontinuity in the pipe estimator, which can cause 170 Fast Retransmit to send a burst of data. 172 The Rate-Halving algorithm sends data on alternate ACKs during 173 recovery, such that after 1 RTT the window has been halved. Rate- 174 Halving is implemented in Linux after only being informally published 175 [RHweb], including an uncompleted document [RHID]. Rate-Halving also 176 does not adequately compensate for the implicit window reduction 177 caused by the losses and assumes a net 50% window reduction, which 178 was completely standard at the time it was written but not 179 appropriate for modern congestion control algorithms, such as CUBIC 180 [CUBIC], which reduce the window by less than 50%. As a consequence, 181 Rate-Halving often allows the window to fall further than necessary, 182 reducing performance and increasing the risk of timeouts if there are 183 additional losses. 185 PRR avoids these excess window adjustments such that at the end of 186 recovery the actual window size will be as close as possible to 187 ssthresh, the window size as determined by the congestion control 188 algorithm. It is patterned after Rate-Halving, but using the 189 fraction that is appropriate for the target window chosen by the 190 congestion control algorithm. During PRR, one of two additional 191 Reduction Bound algorithms limits the total window reduction due to 192 all mechanisms, including transient application stalls and the losses 193 themselves. 195 We describe two slightly different Reduction Bound algorithms: 196 Conservative Reduction Bound (CRB), which is strictly packet 197 conserving; and a Slow Start Reduction Bound (SSRB), which is more 198 aggressive than CRB by, at most, 1 segment per ACK. PRR-CRB meets 199 the Strong Packet Conservation Bound described in Appendix A; 200 however, in real networks it does not perform as well as the 201 algorithms described in RFC 6675, which prove to be more aggressive 202 in a significant number of cases. SSRB offers a compromise by 203 allowing TCP to send 1 additional segment per ACK relative to CRB in 204 some situations. Although SSRB is less aggressive than RFC 6675 205 (transmitting fewer segments or taking more time to transmit them), 206 it outperforms it, due to the lower probability of additional losses 207 during recovery. 209 The Strong Packet Conservation Bound on which PRR and both Reduction 210 Bounds are based is patterned after Van Jacobson's packet 211 conservation principle: segments delivered to the receiver are used 212 as the clock to trigger sending the same number of segments back into 213 the network. As much as possible, PRR and the Reduction Bound 214 algorithms rely on this self clock process, and are only slightly 215 affected by the accuracy of other estimators, such as pipe [RFC6675] 216 and cwnd. This is what gives the algorithms their precision in the 217 presence of events that cause uncertainty in other estimators. 219 The original definition of the packet conservation principle 220 [Jacobson88] treated packets that are presumed to be lost (e.g., 221 marked as candidates for retransmission) as having left the network. 222 This idea is reflected in the pipe estimator defined in RFC 6675 and 223 used here, but it is distinct from the Strong Packet Conservation 224 Bound as described in Appendix A, which is defined solely on the 225 basis of data arriving at the receiver. 227 We evaluated these and other algorithms in a large scale measurement 228 study presented in a companion paper [IMC11] and summarized in 229 Section 6. This measurement study was based on RFC 3517 [RFC3517], 230 which has since been superseded by RFC 6675. Since there are slight 231 differences between the two specifications, and we were meticulous 232 about our implementation of RFC 3517, we are not comfortable 233 unconditionally asserting that our measurement results apply to RFC 234 6675, although we believe this to be the case. We have instead 235 chosen to be pedantic about describing measurement results relative 236 to RFC 3517, on which they were actually based. General discussions 237 of algorithms and their properties have been updated to refer to RFC 238 6675. 240 We found that for authentic network traffic, PRR-SSRB outperforms 241 both RFC 3517 and Linux Rate-Halving even though it is less 242 aggressive than RFC 3517. We believe that these results apply to RFC 243 6675 as well. 245 The algorithms are described as modifications to RFC 5681 [RFC5681], 246 "TCP Congestion Control", using concepts drawn from the pipe 247 algorithm [RFC6675]. They are most accurate and more easily 248 implemented with SACK [RFC2018], but do not require SACK. 250 3. Definitions 252 The following terms, parameters, and state variables are used as they 253 are defined in earlier documents: 255 RFC 793: snd.una (send unacknowledged) 257 RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size 258 (SMSS) 260 RFC 6675: covered (as in "covered sequence numbers") 262 Voluntary window reductions: choosing not to send data in response to 263 some ACKs, for the purpose of reducing the sending window size and 264 data rate 266 We define some additional variables: 268 SACKd: The total number of bytes that the scoreboard indicates have 269 been delivered to the receiver. This can be computed by scanning the 270 scoreboard and counting the total number of bytes covered by all sack 271 blocks. If SACK is not in use, SACKd is not defined. 273 DeliveredData: The total number of bytes that the current ACK 274 indicates have been delivered to the receiver. When not in recovery, 275 DeliveredData is the change in snd.una. With SACK, DeliveredData can 276 be computed precisely as the change in snd.una, plus the (signed) 277 change in SACKd. In recovery without SACK, DeliveredData is 278 estimated to be 1 SMSS on duplicate acknowledgements, and on a 279 subsequent partial or full ACK, DeliveredData is estimated to be the 280 change in snd.una, minus 1 SMSS for each preceding duplicate ACK. 282 Note that DeliveredData is robust; for TCP using SACK, DeliveredData 283 can be precisely computed anywhere in the network just by inspecting 284 the returning ACKs. The consequence of missing ACKs is that later 285 ACKs will show a larger DeliveredData. Furthermore, for any TCP 286 (with or without SACK), the sum of DeliveredData must agree with the 287 forward progress over the same time interval. 289 We introduce a local variable "sndcnt", which indicates exactly how 290 many bytes should be sent in response to each ACK. Note that the 291 decision of which data to send (e.g., retransmit missing data or send 292 more new data) is out of scope for this document. 294 4. Algorithms 296 At the beginning of recovery, initialize PRR state. This assumes a 297 modern congestion control algorithm, CongCtrlAlg(), that might set 298 ssthresh to something other than FlightSize/2: 300 ssthresh = CongCtrlAlg() // Target cwnd after recovery 301 prr_delivered = 0 // Total bytes delivered during recovery 302 prr_out = 0 // Total bytes sent during recovery 303 RecoverFS = snd.nxt-snd.una // FlightSize at the start of recovery 305 Figure 1 307 On every ACK during recovery compute: 308 DeliveredData = change_in(snd.una) + change_in(SACKd) 309 prr_delivered += DeliveredData 310 pipe = (RFC 6675 pipe algorithm) 311 if (pipe > ssthresh) { 312 // Proportional Rate Reduction 313 sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out 314 } else { 315 // Two versions of the Reduction Bound 316 if (conservative) { // PRR-CRB 317 limit = prr_delivered - prr_out 318 } else { // PRR-SSRB 319 limit = MAX(prr_delivered - prr_out, DeliveredData) + MSS 320 } 321 // Attempt to catch up, as permitted by limit 322 sndcnt = MIN(ssthresh - pipe, limit) 323 } 325 Figure 2 327 On any data transmission or retransmission: 328 prr_out += (data sent) // strictly less than or equal to sndcnt 330 Figure 3 332 4.1. Examples 334 We illustrate these algorithms by showing their different behaviors 335 for two scenarios: TCP experiencing either a single loss or a burst 336 of 15 consecutive losses. In all cases we assume bulk data (no 337 application pauses), standard Additive Increase Multiplicative 338 Decrease (AIMD) congestion control, and cwnd = FlightSize = pipe = 20 339 segments, so ssthresh will be set to 10 at the beginning of recovery. 340 We also assume standard Fast Retransmit and Limited Transmit 341 [RFC3042], so TCP will send 2 new segments followed by 1 retransmit 342 in response to the first 3 duplicate ACKs following the losses. 344 Each of the diagrams below shows the per ACK response to the first 345 round trip for the various recovery algorithms when the zeroth 346 segment is lost. The top line indicates the transmitted segment 347 number triggering the ACKs, with an X for the lost segment. "cwnd" 348 and "pipe" indicate the values of these algorithms after processing 349 each returning ACK. "Sent" indicates how much 'N'ew or 350 'R'etransmitted data would be sent. Note that the algorithms for 351 deciding which data to send are out of scope of this document. 353 When there is a single loss, PRR with either of the Reduction Bound 354 algorithms has the same behavior. We show "RB", a flag indicating 355 which Reduction Bound subexpression ultimately determined the value 356 of sndcnt. When there are minimal losses, "limit" (both algorithms) 357 will always be larger than ssthresh - pipe, so the sndcnt will be 358 ssthresh - pipe, indicated by "s" in the "RB" row. 360 RFC 6675 361 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 362 cwnd: 20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 363 pipe: 19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10 364 sent: N N R N N N N N N N N 366 Rate-Halving (Linux) 367 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 368 cwnd: 20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 369 pipe: 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 370 sent: N N R N N N N N N N N 372 PRR 373 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 374 pipe: 19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10 375 sent: N N R N N N N N N N N 376 RB: s s 377 Cwnd is not shown because PRR does not use it. 379 Key for RB 380 s: sndcnt = ssthresh - pipe // from ssthresh 381 b: sndcnt = prr_delivered - prr_out + SMSS // from banked 382 d: sndcnt = DeliveredData + SMSS // from DeliveredData 383 (Sometimes, more than one applies.) 385 Figure 4 387 Note that all 3 algorithms send the same total amount of data. RFC 388 6675 experiences a "half window of silence", while the Rate-Halving 389 and PRR spread the voluntary window reduction across an entire RTT. 391 Next, we consider the same initial conditions when the first 15 392 packets (0-14) are lost. During the remainder of the lossy RTT, only 393 5 ACKs are returned to the sender. We examine each of these 394 algorithms in succession. 396 RFC 6675 397 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 398 cwnd: 20 20 11 11 11 399 pipe: 19 19 4 10 10 400 sent: N N 7R R R 402 Rate-Halving (Linux) 403 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 404 cwnd: 20 20 5 5 5 405 pipe: 19 19 4 4 4 406 sent: N N R R R 408 PRR-CRB 409 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 410 pipe: 19 19 4 4 4 411 sent: N N R R R 412 RB: b b b 414 PRR-SSRB 415 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 416 pipe: 19 19 4 5 6 417 sent: N N 2R 2R 2R 418 RB: bd d d 420 Figure 5 422 In this specific situation, RFC 6675 is more aggressive because once 423 Fast Retransmit is triggered (on the ACK for segment 17), TCP 424 immediately retransmits sufficient data to bring pipe up to cwnd. 425 Our measurement data (see Section 6) indicates that RFC 6675 426 significantly outperforms Rate-Halving, PRR-CRB, and some other 427 similarly conservative algorithms that we tested, showing that it is 428 significantly common for the actual losses to exceed the window 429 reduction determined by the congestion control algorithm. 431 The Linux implementation of Rate-Halving includes an early version of 432 the Conservative Reduction Bound [RHweb]. In this situation, the 5 433 ACKs trigger exactly 1 transmission each (2 new data, 3 old data), 434 and cwnd is set to 5. At a window size of 5, it takes 3 round trips 435 to retransmit all 15 lost segments. Rate-Halving does not raise the 436 window at all during recovery, so when recovery finally completes, 437 TCP will slow start cwnd from 5 up to 10. In this example, TCP 438 operates at half of the window chosen by the congestion control for 439 more than 3 RTTs, increasing the elapsed time and exposing it to 440 timeouts in the event that there are additional losses. 442 PRR-CRB implements a Conservative Reduction Bound. Since the total 443 losses bring pipe below ssthresh, data is sent such that the total 444 data transmitted, prr_out, follows the total data delivered to the 445 receiver as reported by returning ACKs. Transmission is controlled 446 by the sending limit, which is set to prr_delivered - prr_out. This 447 is indicated by the RB:b tagging in the figure. In this case, PRR- 448 CRB is exposed to exactly the same problems as Rate-Halving; the 449 excess window reduction causes it to take excessively long to recover 450 the losses and exposes it to additional timeouts. 452 PRR-SSRB increases the window by exactly 1 segment per ACK until pipe 453 rises to ssthresh during recovery. This is accomplished by setting 454 limit to one greater than the data reported to have been delivered to 455 the receiver on this ACK, implementing slow start during recovery, 456 and indicated by RB:d tagging in the figure. Although increasing the 457 window during recovery seems to be ill advised, it is important to 458 remember that this is actually less aggressive than permitted by RFC 459 5681, which sends the same quantity of additional data as a single 460 burst in response to the ACK that triggered Fast Retransmit. 462 For less extreme events, where the total losses are smaller than the 463 difference between FlightSize and ssthresh, PRR-CRB and PRR-SSRB have 464 identical behaviors. 466 5. Properties 468 The following properties are common to both PRR-CRB and PRR-SSRB, 469 except as noted: 471 PRR maintains TCP's ACK clocking across most recovery events, 472 including burst losses. RFC 6675 can send large unclocked bursts 473 following burst losses. 475 Normally, PRR will spread voluntary window reductions out evenly 476 across a full RTT. This has the potential to generally reduce the 477 burstiness of Internet traffic, and could be considered to be a type 478 of soft pacing. Hypothetically, any pacing increases the probability 479 that different flows are interleaved, reducing the opportunity for 480 ACK compression and other phenomena that increase traffic burstiness. 481 However, these effects have not been quantified. 483 If there are minimal losses, PRR will converge to exactly the target 484 window chosen by the congestion control algorithm. Note that as TCP 485 approaches the end of recovery, prr_delivered will approach RecoverFS 486 and sndcnt will be computed such that prr_out approaches ssthresh. 488 Implicit window reductions, due to multiple isolated losses during 489 recovery, cause later voluntary reductions to be skipped. For small 490 numbers of losses, the window size ends at exactly the window chosen 491 by the congestion control algorithm. 493 For burst losses, earlier voluntary window reductions can be undone 494 by sending extra segments in response to ACKs arriving later during 495 recovery. Note that as long as some voluntary window reductions are 496 not undone, the final value for pipe will be the same as ssthresh, 497 the target cwnd value chosen by the congestion control algorithm. 499 PRR with either Reduction Bound improves the situation when there are 500 application stalls, e.g., when the sending application does not queue 501 data for transmission quickly enough or the receiver stops advancing 502 rwnd (receiver window). When there is an application stall early 503 during recovery, prr_out will fall behind the sum of the 504 transmissions permitted by sndcnt. The missed opportunities to send 505 due to stalls are treated like banked voluntary window reductions; 506 specifically, they cause prr_delivered - prr_out to be significantly 507 positive. If the application catches up while TCP is still in 508 recovery, TCP will send a partial window burst to catch up to exactly 509 where it would have been had the application never stalled. Although 510 this burst might be viewed as being hard on the network, this is 511 exactly what happens every time there is a partial RTT application 512 stall while not in recovery. We have made the partial RTT stall 513 behavior uniform in all states. Changing this behavior is out of 514 scope for this document. 516 PRR with Reduction Bound is less sensitive to errors in the pipe 517 estimator. While in recovery, pipe is intrinsically an estimator, 518 using incomplete information to estimate if un-SACKed segments are 519 actually lost or merely out of order in the network. Under some 520 conditions, pipe can have significant errors; for example, pipe is 521 underestimated when a burst of reordered data is prematurely assumed 522 to be lost and marked for retransmission. If the transmissions are 523 regulated directly by pipe as they are with RFC 6675, a step 524 discontinuity in the pipe estimator causes a burst of data, which 525 cannot be retracted once the pipe estimator is corrected a few ACKs 526 later. For PRR, pipe merely determines which algorithm, PRR or the 527 Reduction Bound, is used to compute sndcnt from DeliveredData. While 528 pipe is underestimated, the algorithms are different by at most 1 529 segment per ACK. Once pipe is updated, they converge to the same 530 final window at the end of recovery. 532 Under all conditions and sequences of events during recovery, PRR-CRB 533 strictly bounds the data transmitted to be equal to or less than the 534 amount of data delivered to the receiver. We claim that this Strong 535 Packet Conservation Bound is the most aggressive algorithm that does 536 not lead to additional forced losses in some environments. It has 537 the property that if there is a standing queue at a bottleneck with 538 no cross traffic, the queue will maintain exactly constant length for 539 the duration of the recovery, except for +1/-1 fluctuation due to 540 differences in packet arrival and exit times. See Appendix A for a 541 detailed discussion of this property. 543 Although the Strong Packet Conservation Bound is very appealing for a 544 number of reasons, our measurements summarized in Section 6 545 demonstrate that it is less aggressive and does not perform as well 546 as RFC 6675, which permits bursts of data when there are bursts of 547 losses. PRR-SSRB is a compromise that permits TCP to send 1 extra 548 segment per ACK as compared to the Packet Conserving Bound. From the 549 perspective of a strict Packet Conserving Bound, PRR-SSRB does indeed 550 open the window during recovery; however, it is significantly less 551 aggressive than RFC 6675 in the presence of burst losses. 553 6. Measurements 555 In a companion IMC11 paper [IMC11], we describe some measurements 556 comparing the various strategies for reducing the window during 557 recovery. The experiments were performed on servers carrying Google 558 production traffic and are briefly summarized here. 560 The various window reduction algorithms and extensive instrumentation 561 were all implemented in Linux 2.6. We used the uniform set of 562 algorithms present in the base Linux implementation, including CUBIC 563 [CUBIC], Limited Transmit [RFC3042], threshold transmit (Section 3.1 564 in [FACK]) (this algorithm was not present in RFC 3517, but a similar 565 algorithm has been added to RFC 6675), and lost retransmission 566 detection algorithms. We confirmed that the behaviors of Rate- 567 Halving (the Linux default), RFC 3517, and PRR were authentic to 568 their respective specifications and that performance and features 569 were comparable to the kernels in production use. All of the 570 different window reduction algorithms were all present in a common 571 kernel and could be selected with a sysctl, such that we had an 572 absolutely uniform baseline for comparing them. 574 Our experiments included an additional algorithm, PRR with an 575 unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe 576 falls below ssthresh. This behavior parallels RFC 3517. 578 An important detail of this configuration is that CUBIC only reduces 579 the window by 30%, as opposed to the 50% reduction used by 580 traditional congestion control algorithms. This accentuates the 581 tendency for RFC 3517 and PRR-UB to send a burst at the point when 582 Fast Retransmit gets triggered because pipe is likely to already be 583 below ssthresh. Precisely this condition was observed for 32% of the 584 recovery events: pipe fell below ssthresh before Fast Retransmit was 585 triggered, thus the various PRR algorithms started in the Reduction 586 Bound phase, and RFC 3517 sent bursts of segments with the Fast 587 Retransmit. 589 In the companion paper, we observe that PRR-SSRB spends the least 590 time in recovery of all the algorithms tested, largely because it 591 experiences fewer timeouts once it is already in recovery. 593 RFC 3517 experiences 29% more detected lost retransmissions and 2.6% 594 more timeouts (presumably due to undetected lost retransmissions) 595 than PRR-SSRB. These results are representative of PRR-UB and other 596 algorithms that send bursts when pipe falls below ssthresh. 598 Rate-Halving experiences 5% more timeouts and significantly smaller 599 final cwnd values at the end of recovery. The smaller cwnd sometimes 600 causes the recovery itself to take extra round trips. These results 601 are representative of PRR-CRB and other algorithms that implement 602 strict packet conservation during recovery. 604 7. Conclusion and Recommendations 606 Although the Strong Packet Conservation Bound used in PRR-CRB is very 607 appealing for a number of reasons, our measurements show that it is 608 less aggressive and does not perform as well as RFC 3517 (and by 609 implication RFC 6675), which permits bursts of data when there are 610 bursts of losses. RFC 3517 and RFC 6675 are conservative in the 611 original sense of Van Jacobson's packet conservation principle, which 612 included the assumption that presumed lost segments have indeed left 613 the network. PRR-CRB makes no such assumption, following instead a 614 Strong Packet Conservation Bound in which only packets that have 615 actually arrived at the receiver are considered to have left the 616 network. PRR-SSRB is a compromise that permits TCP to send 1 extra 617 segment per ACK relative to the Strong Packet Conservation Bound, to 618 partially compensate for excess losses. 620 From the perspective of the Strong Packet Conservation Bound, PRR- 621 SSRB does indeed open the window during recovery; however, it is 622 significantly less aggressive than RFC 3517 (and RFC 6675) in the 623 presence of burst losses. Even so, it often outperforms RFC 3517 624 (and presumably RFC 6675) because it avoids some of the self- 625 inflicted losses caused by bursts. 627 At this time, we see no reason not to test and deploy PRR-SSRB on a 628 large scale. Implementers worried about any potential impact of 629 raising the window during recovery may want to optionally support 630 PRR-CRB (which is actually simpler to implement) for comparison 631 studies. Furthermore, there is one minor detail of PRR that can be 632 improved by replacing pipe by total_pipe, as defined by Laminar TCP 633 [Laminar]. 635 One final comment about terminology: we expect that common usage will 636 drop "Slow Start Reduction Bound" from the algorithm name. This 637 document needed to be pedantic about having distinct names for PRR 638 and every variant of the Reduction Bound. However, we do not 639 anticipate any future exploration of the alternative Reduction 640 Bounds. 642 8. Acknowledgements 644 This document is based in part on previous incomplete work by Matt 645 Mathis, Jeff Semke, and Jamshid Mahdavi [RHID] and influenced by 646 several discussions with John Heffner. 648 Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the 649 experiments. 651 Ilpo Jarvinen reviewed the code. 653 Mark Allman improved the document through his insightful review. 655 Neal Cardwell for reviewing and testing the patch. 657 9. Security Considerations 659 PRR does not change the risk profile for TCP. 661 Implementers that change PRR from counting bytes to segments have to 662 be cautious about the effects of ACK splitting attacks [Savage99], 663 where the receiver acknowledges partial segments for the purpose of 664 confusing the sender's congestion accounting. 666 10. Normative References 668 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 669 RFC 793, DOI 10.17487/RFC0793, September 1981, 670 . 672 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 673 Selective Acknowledgment Options", RFC 2018, 674 DOI 10.17487/RFC2018, October 1996, 675 . 677 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 678 Requirement Levels", BCP 14, RFC 2119, 679 DOI 10.17487/RFC2119, March 1997, 680 . 682 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 683 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 684 . 686 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 687 and Y. Nishida, "A Conservative Loss Recovery Algorithm 688 Based on Selective Acknowledgment (SACK) for TCP", 689 RFC 6675, DOI 10.17487/RFC6675, August 2012, 690 . 692 11. Informative References 694 [CUBIC] Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed 695 TCP variant", PFLDnet 2005, February 2005. 697 [FACK] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: 698 Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96, 699 August 1996. 701 [IMC11] Dukkipati, N., Mathis, M., Cheng, Y., and M. Ghobadi, 702 "Proportional Rate Reduction for TCP", Proceedings of the 703 11th ACM SIGCOMM Conference on Internet Measurement 704 2011, Berlin, Germany, November 2011. 706 [Jacobson88] 707 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 708 Comput. Commun. Rev. 18(4), August 1988. 710 [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP 711 congestion control", Work in Progress, 16 July 2012. 713 [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing 714 TCP's Loss Recovery Using Limited Transmit", RFC 3042, 715 DOI 10.17487/RFC3042, January 2001, 716 . 718 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 719 Conservative Selective Acknowledgment (SACK)-based Loss 720 Recovery Algorithm for TCP", RFC 3517, 721 DOI 10.17487/RFC3517, April 2003, 722 . 724 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 725 Rate Reduction for TCP", RFC 6937, DOI 10.17487/RFC6937, 726 May 2013, . 728 [RHID] Mathis, M., Semke, J., and J. Mahdavi, "The Rate-Halving 729 Algorithm for TCP Congestion Control", Work in Progress, 730 August 1999. 732 [RHweb] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 733 Parameters", Web publication, December 1997, 734 . 736 [Savage99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 737 "TCP congestion control with a misbehaving receiver", 738 SIGCOMM Comput. Commun. Rev. 29(5), October 1999. 740 Appendix A. Strong Packet Conservation Bound 742 PRR-CRB is based on a conservative, philosophically pure, and 743 aesthetically appealing Strong Packet Conservation Bound, described 744 here. Although inspired by Van Jacobson's packet conservation 745 principle [Jacobson88], it differs in how it treats segments that are 746 missing and presumed lost. Under all conditions and sequences of 747 events during recovery, PRR-CRB strictly bounds the data transmitted 748 to be equal to or less than the amount of data delivered to the 749 receiver. Note that the effects of presumed losses are included in 750 the pipe calculation, but do not affect the outcome of PRR-CRB, once 751 pipe has fallen below ssthresh. 753 We claim that this Strong Packet Conservation Bound is the most 754 aggressive algorithm that does not lead to additional forced losses 755 in some environments. It has the property that if there is a 756 standing queue at a bottleneck that is carrying no other traffic, the 757 queue will maintain exactly constant length for the entire duration 758 of the recovery, except for +1/-1 fluctuation due to differences in 759 packet arrival and exit times. Any less aggressive algorithm will 760 result in a declining queue at the bottleneck. Any more aggressive 761 algorithm will result in an increasing queue or additional losses if 762 it is a full drop tail queue. 764 We demonstrate this property with a little thought experiment: 766 Imagine a network path that has insignificant delays in both 767 directions, except for the processing time and queue at a single 768 bottleneck in the forward path. By insignificant delay, we mean when 769 a packet is "served" at the head of the bottleneck queue, the 770 following events happen in much less than one bottleneck packet time: 771 the packet arrives at the receiver; the receiver sends an ACK that 772 arrives at the sender; the sender processes the ACK and sends some 773 data; the data is queued at the bottleneck. 775 If sndcnt is set to DeliveredData and nothing else is inhibiting 776 sending data, then clearly the data arriving at the bottleneck queue 777 will exactly replace the data that was served at the head of the 778 queue, so the queue will have a constant length. If queue is drop 779 tail and full, then the queue will stay exactly full. Losses or 780 reordering on the ACK path only cause wider fluctuations in the queue 781 size, but do not raise its peak size, independent of whether the data 782 is in order or out of order (including loss recovery from an earlier 783 RTT). Any more aggressive algorithm that sends additional data will 784 overflow the drop tail queue and cause loss. Any less aggressive 785 algorithm will under-fill the queue. Therefore, setting sndcnt to 786 DeliveredData is the most aggressive algorithm that does not cause 787 forced losses in this simple network. Relaxing the assumptions 788 (e.g., making delays more authentic and adding more flows, delayed 789 ACKs, etc.) is likely to increase the fine grained fluctuations in 790 queue size but does not change its basic behavior. 792 Note that the congestion control algorithm implements a broader 793 notion of optimal that includes appropriately sharing the network. 794 Typical congestion control algorithms are likely to reduce the data 795 sent relative to the Packet Conserving Bound implemented by PRR, 796 bringing TCP's actual window down to ssthresh. 798 Authors' Addresses 800 Matt Mathis 801 Google, Inc. 802 1600 Amphitheatre Parkway 803 Mountain View, California 94043 804 United States of America 806 Email: mattmathis@google.com 807 Nandita Dukkipati 808 Google, Inc. 809 1600 Amphitheatre Parkway 810 Mountain View, California 94043 811 United States of America 813 Email: nanditad@google.com 815 Yuchung Cheng 816 Google, Inc. 817 1600 Amphitheatre Parkway 818 Mountain View, California 94043 819 United States of America 821 Email: ycheng@google.com