idnits 2.17.1 draft-mathis-tcpm-rfc6937bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC6937, but the abstract doesn't seem to directly say this. It does mention RFC6937 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 298 has weird spacing: '... bd d d...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (31 October 2020) is 1267 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC0793' is defined on line 548, but no explicit reference was found in the text == Unused Reference: 'RFC2018' is defined on line 552, but no explicit reference was found in the text == Unused Reference: 'RFC5681' is defined on line 562, but no explicit reference was found in the text == Unused Reference: 'RFC6675' is defined on line 566, but no explicit reference was found in the text == Unused Reference: 'RFC3517' is defined on line 598, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 2 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group M. Mathis 3 Internet-Draft N. Dukkipati 4 Obsoletes: 6937 (if approved) Y. Cheng 5 Intended status: Standards Track Google, Inc. 6 Expires: 4 May 2021 31 October 2020 8 Proportional Rate Reduction for TCP 9 draft-mathis-tcpm-rfc6937bis-00 11 Abstract 13 This document updates the Proportional Rate Reduction (PRR) algorithm 14 described as experimental in RFC 6937 to standards track. PRR 15 potentially replaces the Fast Recovery and Rate-Halving algorithms. 16 All of these algorithms regulate the amount of data sent by TCP or 17 other transport protocol during loss recovery. PRR more accurately 18 regulates the actual flight size through recovery such that at the 19 end of recovery it will be as close as possible to the ssthresh, as 20 determined by the congestion control algorithm. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on 4 May 2021. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 46 license-info) in effect on the date of publication of this document. 47 Please review these documents carefully, as they describe your rights 48 and restrictions with respect to this document. Code Components 49 extracted from this document must include Simplified BSD License text 50 as described in Section 4.e of the Trust Legal Provisions and are 51 provided without warranty as described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 5 59 4. Properties . . . . . . . . . . . . . . . . . . . . . . . . . 8 60 5. Measurements . . . . . . . . . . . . . . . . . . . . . . . . 10 61 6. Conclusion and Recommendations . . . . . . . . . . . . . . . 11 62 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 63 8. Security Considerations . . . . . . . . . . . . . . . . . . . 12 64 9. Normative References . . . . . . . . . . . . . . . . . . . . 12 65 10. Informative References . . . . . . . . . . . . . . . . . . . 13 66 Appendix A. Strong Packet Conservation Bound . . . . . . . . . . 14 67 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 69 1. Introduction 71 This document updates the Proportional Rate Reduction (PRR) algorithm 72 described in [RFC6937]from experimental to standards track. PRR 73 accuracy regulates the amount of data sent during loss recovery, such 74 that at the end of recovery the flight size will be as close as 75 possible to the ssthresh, as determined by the congestion control 76 algorithm. PRR has been deployed in at least 3 major operating 77 systems covering the vast majority of today's web traffic. There 78 have been no changes to PRR as documented in the experimental RFC. 79 The descriptions here have been [will be] updated to normative 80 standards language. For a tutorial description of the algorithms and 81 the rationale behind them please see the original RFC. 83 The experimental RFC describes two different reduction bound 84 algorithms to limit the total window reduction due to all mechanisms, 85 including transient application stalls and the losses themselves: 86 Conservative Reduction Bound (CRB), which is strictly packet 87 conserving; and a Slow Start Reduction Bound (SSRB), which is more 88 aggressive than CRB by at most 1 segment per ACK. [RFC6937] left the 89 choice of Reduction Bound to the discretion of the implementer. 91 The paper "An Internet-Wide Analysis of Traffic Policing" [Flatch et 92 al] uncovered a crucial situation where the Reduction Bound mattered. 93 Under certain configurations, token bucket traffic policers 94 [token_bucket] can suddenly start discarding a large fraction of the 95 traffic. This happens without warning when policers run out of 96 tokens. The transport congestion control has no opportunity to 97 measure the token rate, and sets ssthresh based on the recently 98 observed path performance. This value for ssthresh may be 99 substantially larger than can be sustained by the token rate, 100 potentially causing persistent high loss. Under these conditions, 101 both reduction bounds perform very poorly. PRR-CRB is too timid, 102 sometimes causing very long recovery times at smaller than necessary 103 windows, and PRR-SSRB is too aggressive, often causing many 104 retransmissions to be lost multiple times. 106 Investigating these environments led to the development of a 107 heuristic to dynamically switch between Reduction Bounds: use PRR- 108 SSRB only while snd.una is advancing without additional losses and 109 use PRR-CRB otherwise. 111 This heuristic is only invoked for what should be a rare corner case: 112 when losses or other events cause the flight size to fall below 113 ssthresh. The extreme loss rates that make the heuristic important 114 are only common in the presence of poorly configured token bucket 115 policers, which are pathologically wasteful and inefficient. In 116 these environments the heuristics serves to salvage a bad situation 117 and any reasonable implementation of the heuristic performs far 118 better than either bound by itself. 120 The algorithm below is identical to the algorithm presented in 121 [RFC6937]. The "conservative" parameter MAY be replaced by the 122 heuristic also described below. 124 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 125 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 126 document are to be interpreted as described in [RFC2119] 128 [All text below is copied from RFC 6937, it will be revised after 129 this document is adopted as a tcpm work item] 131 2. Definitions 133 The following terms, parameters, and state variables are used as they 134 are defined in earlier documents: 136 RFC 793: snd.una (send unacknowledged) 138 RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size 139 (SMSS) 141 RFC 6675: covered (as in "covered sequence numbers") 142 Voluntary window reductions: choosing not to send data in response to 143 some ACKs, for the purpose of reducing the sending window size and 144 data rate 146 We define some additional variables: 148 SACKd: The total number of bytes that the scoreboard indicates have 149 been delivered to the receiver. This can be computed by scanning the 150 scoreboard and counting the total number of bytes covered by all sack 151 blocks. If SACK is not in use, SACKd is not defined. 153 DeliveredData: The total number of bytes that the current ACK 154 indicates have been delivered to the receiver. When not in recovery, 155 DeliveredData is the change in snd.una. With SACK, DeliveredData can 156 be computed precisely as the change in snd.una, plus the (signed) 157 change in SACKd. In recovery without SACK, DeliveredData is 158 estimated to be 1 SMSS on duplicate acknowledgements, and on a 159 subsequent partial or full ACK, DeliveredData is estimated to be the 160 change in snd.una, minus 1 SMSS for each preceding duplicate ACK. 162 Note that DeliveredData is robust; for TCP using SACK, DeliveredData 163 can be precisely computed anywhere in the network just by inspecting 164 the returning ACKs. The consequence of missing ACKs is that later 165 ACKs will show a larger DeliveredData. Furthermore, for any TCP 166 (with or without SACK), the sum of DeliveredData must agree with the 167 forward progress over the same time interval. 169 We introduce a local variable "sndcnt", which indicates exactly how 170 many bytes should be sent in response to each ACK. Note that the 171 decision of which data to send (e.g., retransmit missing data or send 172 more new data) is out of scope for this document. 174 3. Algorithms 176 At the beginning of recovery, initialize PRR state. This assumes a 177 modern congestion control algorithm, CongCtrlAlg(), that might set 178 ssthresh to something other than FlightSize/2: 180 ssthresh = CongCtrlAlg() // Target cwnd after recovery 181 prr_delivered = 0 // Total bytes delivered during recovery 182 prr_out = 0 // Total bytes sent during recovery 183 RecoverFS = snd.nxt-snd.una // FlightSize at the start of recovery 185 Figure 1 187 On every ACK during recovery compute: 188 DeliveredData = change_in(snd.una) + change_in(SACKd) 189 prr_delivered += DeliveredData 190 pipe = (RFC 6675 pipe algorithm) 191 if (pipe > ssthresh) { 192 // Proportional Rate Reduction 193 sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out 194 } else { 195 // Two versions of the Reduction Bound 196 if (conservative) { // PRR-CRB 197 limit = prr_delivered - prr_out 198 } else { // PRR-SSRB 199 limit = MAX(prr_delivered - prr_out, DeliveredData) + MSS 200 } 201 // Attempt to catch up, as permitted by limit 202 sndcnt = MIN(ssthresh - pipe, limit) 203 } 205 Figure 2 207 On any data transmission or retransmission: 208 prr_out += (data sent) // strictly less than or equal to sndcnt 210 Figure 3 212 3.1. Examples 214 We illustrate these algorithms by showing their different behaviors 215 for two scenarios: TCP experiencing either a single loss or a burst 216 of 15 consecutive losses. In all cases we assume bulk data (no 217 application pauses), standard Additive Increase Multiplicative 218 Decrease (AIMD) congestion control, and cwnd = FlightSize = pipe = 20 219 segments, so ssthresh will be set to 10 at the beginning of recovery. 220 We also assume standard Fast Retransmit and Limited Transmit 221 [RFC3042], so TCP will send 2 new segments followed by 1 retransmit 222 in response to the first 3 duplicate ACKs following the losses. 224 Each of the diagrams below shows the per ACK response to the first 225 round trip for the various recovery algorithms when the zeroth 226 segment is lost. The top line indicates the transmitted segment 227 number triggering the ACKs, with an X for the lost segment. "cwnd" 228 and "pipe" indicate the values of these algorithms after processing 229 each returning ACK. "Sent" indicates how much 'N'ew or 230 'R'etransmitted data would be sent. Note that the algorithms for 231 deciding which data to send are out of scope of this document. 233 When there is a single loss, PRR with either of the Reduction Bound 234 algorithms has the same behavior. We show "RB", a flag indicating 235 which Reduction Bound subexpression ultimately determined the value 236 of sndcnt. When there are minimal losses, "limit" (both algorithms) 237 will always be larger than ssthresh - pipe, so the sndcnt will be 238 ssthresh - pipe, indicated by "s" in the "RB" row. 240 RFC 6675 241 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 242 cwnd: 20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 243 pipe: 19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10 244 sent: N N R N N N N N N N N 246 Rate-Halving (Linux) 247 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 248 cwnd: 20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 249 pipe: 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 250 sent: N N R N N N N N N N N 252 PRR 253 ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 254 pipe: 19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10 255 sent: N N R N N N N N N N N 256 RB: s s 257 Cwnd is not shown because PRR does not use it. 259 Key for RB 260 s: sndcnt = ssthresh - pipe // from ssthresh 261 b: sndcnt = prr_delivered - prr_out + SMSS // from banked 262 d: sndcnt = DeliveredData + SMSS // from DeliveredData 263 (Sometimes, more than one applies.) 265 Figure 4 267 Note that all 3 algorithms send the same total amount of data. RFC 268 6675 experiences a "half window of silence", while the Rate-Halving 269 and PRR spread the voluntary window reduction across an entire RTT. 271 Next, we consider the same initial conditions when the first 15 272 packets (0-14) are lost. During the remainder of the lossy RTT, only 273 5 ACKs are returned to the sender. We examine each of these 274 algorithms in succession. 276 RFC 6675 277 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 278 cwnd: 20 20 11 11 11 279 pipe: 19 19 4 10 10 280 sent: N N 7R R R 282 Rate-Halving (Linux) 283 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 284 cwnd: 20 20 5 5 5 285 pipe: 19 19 4 4 4 286 sent: N N R R R 288 PRR-CRB 289 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 290 pipe: 19 19 4 4 4 291 sent: N N R R R 292 RB: b b b 294 PRR-SSRB 295 ack# X X X X X X X X X X X X X X X 15 16 17 18 19 296 pipe: 19 19 4 5 6 297 sent: N N 2R 2R 2R 298 RB: bd d d 300 Figure 5 302 In this specific situation, RFC 6675 is more aggressive because once 303 Fast Retransmit is triggered (on the ACK for segment 17), TCP 304 immediately retransmits sufficient data to bring pipe up to cwnd. 305 Our measurement data (see Section 5) indicates that RFC 6675 306 significantly outperforms Rate-Halving, PRR-CRB, and some other 307 similarly conservative algorithms that we tested, showing that it is 308 significantly common for the actual losses to exceed the window 309 reduction determined by the congestion control algorithm. 311 The Linux implementation of Rate-Halving includes an early version of 312 the Conservative Reduction Bound [RHweb]. In this situation, the 5 313 ACKs trigger exactly 1 transmission each (2 new data, 3 old data), 314 and cwnd is set to 5. At a window size of 5, it takes 3 round trips 315 to retransmit all 15 lost segments. Rate-Halving does not raise the 316 window at all during recovery, so when recovery finally completes, 317 TCP will slow start cwnd from 5 up to 10. In this example, TCP 318 operates at half of the window chosen by the congestion control for 319 more than 3 RTTs, increasing the elapsed time and exposing it to 320 timeouts in the event that there are additional losses. 322 PRR-CRB implements a Conservative Reduction Bound. Since the total 323 losses bring pipe below ssthresh, data is sent such that the total 324 data transmitted, prr_out, follows the total data delivered to the 325 receiver as reported by returning ACKs. Transmission is controlled 326 by the sending limit, which is set to prr_delivered - prr_out. This 327 is indicated by the RB:b tagging in the figure. In this case, PRR- 328 CRB is exposed to exactly the same problems as Rate-Halving; the 329 excess window reduction causes it to take excessively long to recover 330 the losses and exposes it to additional timeouts. 332 PRR-SSRB increases the window by exactly 1 segment per ACK until pipe 333 rises to ssthresh during recovery. This is accomplished by setting 334 limit to one greater than the data reported to have been delivered to 335 the receiver on this ACK, implementing slow start during recovery, 336 and indicated by RB:d tagging in the figure. Although increasing the 337 window during recovery seems to be ill advised, it is important to 338 remember that this is actually less aggressive than permitted by RFC 339 5681, which sends the same quantity of additional data as a single 340 burst in response to the ACK that triggered Fast Retransmit. 342 For less extreme events, where the total losses are smaller than the 343 difference between FlightSize and ssthresh, PRR-CRB and PRR-SSRB have 344 identical behaviors. 346 4. Properties 348 The following properties are common to both PRR-CRB and PRR-SSRB, 349 except as noted: 351 PRR maintains TCP's ACK clocking across most recovery events, 352 including burst losses. RFC 6675 can send large unclocked bursts 353 following burst losses. 355 Normally, PRR will spread voluntary window reductions out evenly 356 across a full RTT. This has the potential to generally reduce the 357 burstiness of Internet traffic, and could be considered to be a type 358 of soft pacing. Hypothetically, any pacing increases the probability 359 that different flows are interleaved, reducing the opportunity for 360 ACK compression and other phenomena that increase traffic burstiness. 361 However, these effects have not been quantified. 363 If there are minimal losses, PRR will converge to exactly the target 364 window chosen by the congestion control algorithm. Note that as TCP 365 approaches the end of recovery, prr_delivered will approach RecoverFS 366 and sndcnt will be computed such that prr_out approaches ssthresh. 368 Implicit window reductions, due to multiple isolated losses during 369 recovery, cause later voluntary reductions to be skipped. For small 370 numbers of losses, the window size ends at exactly the window chosen 371 by the congestion control algorithm. 373 For burst losses, earlier voluntary window reductions can be undone 374 by sending extra segments in response to ACKs arriving later during 375 recovery. Note that as long as some voluntary window reductions are 376 not undone, the final value for pipe will be the same as ssthresh, 377 the target cwnd value chosen by the congestion control algorithm. 379 PRR with either Reduction Bound improves the situation when there are 380 application stalls, e.g., when the sending application does not queue 381 data for transmission quickly enough or the receiver stops advancing 382 rwnd (receiver window). When there is an application stall early 383 during recovery, prr_out will fall behind the sum of the 384 transmissions permitted by sndcnt. The missed opportunities to send 385 due to stalls are treated like banked voluntary window reductions; 386 specifically, they cause prr_delivered - prr_out to be significantly 387 positive. If the application catches up while TCP is still in 388 recovery, TCP will send a partial window burst to catch up to exactly 389 where it would have been had the application never stalled. Although 390 this burst might be viewed as being hard on the network, this is 391 exactly what happens every time there is a partial RTT application 392 stall while not in recovery. We have made the partial RTT stall 393 behavior uniform in all states. Changing this behavior is out of 394 scope for this document. 396 PRR with Reduction Bound is less sensitive to errors in the pipe 397 estimator. While in recovery, pipe is intrinsically an estimator, 398 using incomplete information to estimate if un-SACKed segments are 399 actually lost or merely out of order in the network. Under some 400 conditions, pipe can have significant errors; for example, pipe is 401 underestimated when a burst of reordered data is prematurely assumed 402 to be lost and marked for retransmission. If the transmissions are 403 regulated directly by pipe as they are with RFC 6675, a step 404 discontinuity in the pipe estimator causes a burst of data, which 405 cannot be retracted once the pipe estimator is corrected a few ACKs 406 later. For PRR, pipe merely determines which algorithm, PRR or the 407 Reduction Bound, is used to compute sndcnt from DeliveredData. While 408 pipe is underestimated, the algorithms are different by at most 1 409 segment per ACK. Once pipe is updated, they converge to the same 410 final window at the end of recovery. 412 Under all conditions and sequences of events during recovery, PRR-CRB 413 strictly bounds the data transmitted to be equal to or less than the 414 amount of data delivered to the receiver. We claim that this Strong 415 Packet Conservation Bound is the most aggressive algorithm that does 416 not lead to additional forced losses in some environments. It has 417 the property that if there is a standing queue at a bottleneck with 418 no cross traffic, the queue will maintain exactly constant length for 419 the duration of the recovery, except for +1/-1 fluctuation due to 420 differences in packet arrival and exit times. See Appendix A for a 421 detailed discussion of this property. 423 Although the Strong Packet Conservation Bound is very appealing for a 424 number of reasons, our measurements summarized in Section 5 425 demonstrate that it is less aggressive and does not perform as well 426 as RFC 6675, which permits bursts of data when there are bursts of 427 losses. PRR-SSRB is a compromise that permits TCP to send 1 extra 428 segment per ACK as compared to the Packet Conserving Bound. From the 429 perspective of a strict Packet Conserving Bound, PRR-SSRB does indeed 430 open the window during recovery; however, it is significantly less 431 aggressive than RFC 6675 in the presence of burst losses. 433 5. Measurements 435 In a companion IMC11 paper [IMC11], we describe some measurements 436 comparing the various strategies for reducing the window during 437 recovery. The experiments were performed on servers carrying Google 438 production traffic and are briefly summarized here. 440 The various window reduction algorithms and extensive instrumentation 441 were all implemented in Linux 2.6. We used the uniform set of 442 algorithms present in the base Linux implementation, including CUBIC 443 [CUBIC], Limited Transmit [RFC3042], threshold transmit (Section 3.1 444 in [FACK]) (this algorithm was not present in RFC 3517, but a similar 445 algorithm has been added to RFC 6675), and lost retransmission 446 detection algorithms. We confirmed that the behaviors of Rate- 447 Halving (the Linux default), RFC 3517, and PRR were authentic to 448 their respective specifications and that performance and features 449 were comparable to the kernels in production use. All of the 450 different window reduction algorithms were all present in a common 451 kernel and could be selected with a sysctl, such that we had an 452 absolutely uniform baseline for comparing them. 454 Our experiments included an additional algorithm, PRR with an 455 unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe 456 falls below ssthresh. This behavior parallels RFC 3517. 458 An important detail of this configuration is that CUBIC only reduces 459 the window by 30%, as opposed to the 50% reduction used by 460 traditional congestion control algorithms. This accentuates the 461 tendency for RFC 3517 and PRR-UB to send a burst at the point when 462 Fast Retransmit gets triggered because pipe is likely to already be 463 below ssthresh. Precisely this condition was observed for 32% of the 464 recovery events: pipe fell below ssthresh before Fast Retransmit was 465 triggered, thus the various PRR algorithms started in the Reduction 466 Bound phase, and RFC 3517 sent bursts of segments with the Fast 467 Retransmit. 469 In the companion paper, we observe that PRR-SSRB spends the least 470 time in recovery of all the algorithms tested, largely because it 471 experiences fewer timeouts once it is already in recovery. 473 RFC 3517 experiences 29% more detected lost retransmissions and 2.6% 474 more timeouts (presumably due to undetected lost retransmissions) 475 than PRR-SSRB. These results are representative of PRR-UB and other 476 algorithms that send bursts when pipe falls below ssthresh. 478 Rate-Halving experiences 5% more timeouts and significantly smaller 479 final cwnd values at the end of recovery. The smaller cwnd sometimes 480 causes the recovery itself to take extra round trips. These results 481 are representative of PRR-CRB and other algorithms that implement 482 strict packet conservation during recovery. 484 6. Conclusion and Recommendations 486 Although the Strong Packet Conservation Bound used in PRR-CRB is very 487 appealing for a number of reasons, our measurements show that it is 488 less aggressive and does not perform as well as RFC 3517 (and by 489 implication RFC 6675), which permits bursts of data when there are 490 bursts of losses. RFC 3517 and RFC 6675 are conservative in the 491 original sense of Van Jacobson's packet conservation principle, which 492 included the assumption that presumed lost segments have indeed left 493 the network. PRR-CRB makes no such assumption, following instead a 494 Strong Packet Conservation Bound in which only packets that have 495 actually arrived at the receiver are considered to have left the 496 network. PRR-SSRB is a compromise that permits TCP to send 1 extra 497 segment per ACK relative to the Strong Packet Conservation Bound, to 498 partially compensate for excess losses. 500 From the perspective of the Strong Packet Conservation Bound, PRR- 501 SSRB does indeed open the window during recovery; however, it is 502 significantly less aggressive than RFC 3517 (and RFC 6675) in the 503 presence of burst losses. Even so, it often outperforms RFC 3517 504 (and presumably RFC 6675) because it avoids some of the self- 505 inflicted losses caused by bursts. 507 At this time, we see no reason not to test and deploy PRR-SSRB on a 508 large scale. Implementers worried about any potential impact of 509 raising the window during recovery may want to optionally support 510 PRR-CRB (which is actually simpler to implement) for comparison 511 studies. Furthermore, there is one minor detail of PRR that can be 512 improved by replacing pipe by total_pipe, as defined by Laminar TCP 513 [Laminar]. 515 One final comment about terminology: we expect that common usage will 516 drop "Slow Start Reduction Bound" from the algorithm name. This 517 document needed to be pedantic about having distinct names for PRR 518 and every variant of the Reduction Bound. However, we do not 519 anticipate any future exploration of the alternative Reduction 520 Bounds. 522 7. Acknowledgements 524 This document is based in part on previous incomplete work by Matt 525 Mathis, Jeff Semke, and Jamshid Mahdavi [RHID] and influenced by 526 several discussions with John Heffner. 528 Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the 529 experiments. 531 Ilpo Jarvinen reviewed the code. 533 Mark Allman improved the document through his insightful review. 535 Neal Cardwell for reviewing and testing the patch. 537 8. Security Considerations 539 PRR does not change the risk profile for TCP. 541 Implementers that change PRR from counting bytes to segments have to 542 be cautious about the effects of ACK splitting attacks [Savage99], 543 where the receiver acknowledges partial segments for the purpose of 544 confusing the sender's congestion accounting. 546 9. Normative References 548 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 549 RFC 793, DOI 10.17487/RFC0793, September 1981, 550 . 552 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 553 Selective Acknowledgment Options", RFC 2018, 554 DOI 10.17487/RFC2018, October 1996, 555 . 557 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 558 Requirement Levels", BCP 14, RFC 2119, 559 DOI 10.17487/RFC2119, March 1997, 560 . 562 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 563 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 564 . 566 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 567 and Y. Nishida, "A Conservative Loss Recovery Algorithm 568 Based on Selective Acknowledgment (SACK) for TCP", 569 RFC 6675, DOI 10.17487/RFC6675, August 2012, 570 . 572 10. Informative References 574 [CUBIC] Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed 575 TCP variant", PFLDnet 2005, February 2005. 577 [FACK] Mathis, M. and J. Mahdavi, "Forward Acknowledgment: 578 Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96, 579 August 1996. 581 [IMC11] Dukkipati, N., Mathis, M., Cheng, Y., and M. Ghobadi, 582 "Proportional Rate Reduction for TCP", Proceedings of the 583 11th ACM SIGCOMM Conference on Internet Measurement 584 2011, Berlin, Germany, November 2011. 586 [Jacobson88] 587 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 588 Comput. Commun. Rev. 18(4), August 1988. 590 [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP 591 congestion control", Work in Progress, 16 July 2012. 593 [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing 594 TCP's Loss Recovery Using Limited Transmit", RFC 3042, 595 DOI 10.17487/RFC3042, January 2001, 596 . 598 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 599 Conservative Selective Acknowledgment (SACK)-based Loss 600 Recovery Algorithm for TCP", RFC 3517, 601 DOI 10.17487/RFC3517, April 2003, 602 . 604 [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional 605 Rate Reduction for TCP", RFC 6937, DOI 10.17487/RFC6937, 606 May 2013, . 608 [RHID] Mathis, M., Semke, J., and J. Mahdavi, "The Rate-Halving 609 Algorithm for TCP Congestion Control", Work in Progress, 610 August 1999. 612 [RHweb] Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding 613 Parameters", Web publication, December 1997, 614 . 616 [Savage99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, 617 "TCP congestion control with a misbehaving receiver", 618 SIGCOMM Comput. Commun. Rev. 29(5), October 1999. 620 Appendix A. Strong Packet Conservation Bound 622 PRR-CRB is based on a conservative, philosophically pure, and 623 aesthetically appealing Strong Packet Conservation Bound, described 624 here. Although inspired by Van Jacobson's packet conservation 625 principle [Jacobson88], it differs in how it treats segments that are 626 missing and presumed lost. Under all conditions and sequences of 627 events during recovery, PRR-CRB strictly bounds the data transmitted 628 to be equal to or less than the amount of data delivered to the 629 receiver. Note that the effects of presumed losses are included in 630 the pipe calculation, but do not affect the outcome of PRR-CRB, once 631 pipe has fallen below ssthresh. 633 We claim that this Strong Packet Conservation Bound is the most 634 aggressive algorithm that does not lead to additional forced losses 635 in some environments. It has the property that if there is a 636 standing queue at a bottleneck that is carrying no other traffic, the 637 queue will maintain exactly constant length for the entire duration 638 of the recovery, except for +1/-1 fluctuation due to differences in 639 packet arrival and exit times. Any less aggressive algorithm will 640 result in a declining queue at the bottleneck. Any more aggressive 641 algorithm will result in an increasing queue or additional losses if 642 it is a full drop tail queue. 644 We demonstrate this property with a little thought experiment: 646 Imagine a network path that has insignificant delays in both 647 directions, except for the processing time and queue at a single 648 bottleneck in the forward path. By insignificant delay, we mean when 649 a packet is "served" at the head of the bottleneck queue, the 650 following events happen in much less than one bottleneck packet time: 651 the packet arrives at the receiver; the receiver sends an ACK that 652 arrives at the sender; the sender processes the ACK and sends some 653 data; the data is queued at the bottleneck. 655 If sndcnt is set to DeliveredData and nothing else is inhibiting 656 sending data, then clearly the data arriving at the bottleneck queue 657 will exactly replace the data that was served at the head of the 658 queue, so the queue will have a constant length. If queue is drop 659 tail and full, then the queue will stay exactly full. Losses or 660 reordering on the ACK path only cause wider fluctuations in the queue 661 size, but do not raise its peak size, independent of whether the data 662 is in order or out of order (including loss recovery from an earlier 663 RTT). Any more aggressive algorithm that sends additional data will 664 overflow the drop tail queue and cause loss. Any less aggressive 665 algorithm will under-fill the queue. Therefore, setting sndcnt to 666 DeliveredData is the most aggressive algorithm that does not cause 667 forced losses in this simple network. Relaxing the assumptions 668 (e.g., making delays more authentic and adding more flows, delayed 669 ACKs, etc.) is likely to increase the fine grained fluctuations in 670 queue size but does not change its basic behavior. 672 Note that the congestion control algorithm implements a broader 673 notion of optimal that includes appropriately sharing the network. 674 Typical congestion control algorithms are likely to reduce the data 675 sent relative to the Packet Conserving Bound implemented by PRR, 676 bringing TCP's actual window down to ssthresh. 678 Authors' Addresses 679 Matt Mathis 680 Google, Inc. 681 1600 Amphitheatre Parkway 682 Mountain View, California 94043 683 United States of America 685 Email: mattmathis@google.com 687 Nandita Dukkipati 688 Google, Inc. 689 1600 Amphitheatre Parkway 690 Mountain View, California 94043 691 United States of America 693 Email: nanditad@google.com 695 Yuchung Cheng 696 Google, Inc. 697 1600 Amphitheatre Parkway 698 Mountain View, California 94043 699 United States of America 701 Email: ycheng@google.com