idnits 2.17.1 draft-ietf-quic-recovery-28.txt: -(2081): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There is 1 instance of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (20 May 2020) is 1437 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Initial' is mentioned on line 1468, but not defined == Outdated reference: A later version (-34) exists of draft-ietf-quic-tls-28 == Outdated reference: A later version (-34) exists of draft-ietf-quic-transport-28 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-rack-08 -- Obsolete informational reference (is this intentional?): RFC 8312 (Obsoleted by RFC 9438) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 QUIC J. Iyengar, Ed. 3 Internet-Draft Fastly 4 Intended status: Standards Track I. Swett, Ed. 5 Expires: 21 November 2020 Google 6 20 May 2020 8 QUIC Loss Detection and Congestion Control 9 draft-ietf-quic-recovery-28 11 Abstract 13 This document describes loss detection and congestion control 14 mechanisms for QUIC. 16 Note to Readers 18 Discussion of this draft takes place on the QUIC working group 19 mailing list (quic@ietf.org (mailto:quic@ietf.org)), which is 20 archived at https://mailarchive.ietf.org/arch/ 21 search/?email_list=quic. 23 Working Group information can be found at https://github.com/quicwg; 24 source code and issues list for this draft can be found at 25 https://github.com/quicwg/base-drafts/labels/-recovery. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on 21 November 2020. 44 Copyright Notice 46 Copyright (c) 2020 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 51 license-info) in effect on the date of publication of this document. 52 Please review these documents carefully, as they describe your rights 53 and restrictions with respect to this document. Code Components 54 extracted from this document must include Simplified BSD License text 55 as described in Section 4.e of the Trust Legal Provisions and are 56 provided without warranty as described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 61 2. Conventions and Definitions . . . . . . . . . . . . . . . . . 4 62 3. Design of the QUIC Transmission Machinery . . . . . . . . . . 5 63 3.1. Relevant Differences Between QUIC and TCP . . . . . . . . 5 64 3.1.1. Separate Packet Number Spaces . . . . . . . . . . . . 6 65 3.1.2. Monotonically Increasing Packet Numbers . . . . . . . 6 66 3.1.3. Clearer Loss Epoch . . . . . . . . . . . . . . . . . 6 67 3.1.4. No Reneging . . . . . . . . . . . . . . . . . . . . . 7 68 3.1.5. More ACK Ranges . . . . . . . . . . . . . . . . . . . 7 69 3.1.6. Explicit Correction For Delayed Acknowledgements . . 7 70 3.1.7. Probe Timeout Replaces RTO and TLP . . . . . . . . . 7 71 3.1.8. The Minimum Congestion Window is Two Packets . . . . 8 72 4. Estimating the Round-Trip Time . . . . . . . . . . . . . . . 8 73 4.1. Generating RTT samples . . . . . . . . . . . . . . . . . 8 74 4.2. Estimating min_rtt . . . . . . . . . . . . . . . . . . . 9 75 4.3. Estimating smoothed_rtt and rttvar . . . . . . . . . . . 9 76 5. Loss Detection . . . . . . . . . . . . . . . . . . . . . . . 11 77 5.1. Acknowledgement-based Detection . . . . . . . . . . . . . 11 78 5.1.1. Packet Threshold . . . . . . . . . . . . . . . . . . 11 79 5.1.2. Time Threshold . . . . . . . . . . . . . . . . . . . 12 80 5.2. Probe Timeout . . . . . . . . . . . . . . . . . . . . . . 13 81 5.2.1. Computing PTO . . . . . . . . . . . . . . . . . . . . 13 82 5.2.2. Handshakes and New Paths . . . . . . . . . . . . . . 14 83 5.2.3. Speeding Up Handshake Completion . . . . . . . . . . 15 84 5.2.4. Sending Probe Packets . . . . . . . . . . . . . . . . 16 85 5.3. Handling Retry Packets . . . . . . . . . . . . . . . . . 17 86 5.4. Discarding Keys and Packet State . . . . . . . . . . . . 17 87 6. Congestion Control . . . . . . . . . . . . . . . . . . . . . 18 88 6.1. Explicit Congestion Notification . . . . . . . . . . . . 19 89 6.2. Initial and Minimum Congestion Window . . . . . . . . . . 19 90 6.3. Slow Start . . . . . . . . . . . . . . . . . . . . . . . 19 91 6.4. Congestion Avoidance . . . . . . . . . . . . . . . . . . 20 92 6.5. Recovery Period . . . . . . . . . . . . . . . . . . . . . 20 93 6.6. Ignoring Loss of Undecryptable Packets . . . . . . . . . 20 94 6.7. Probe Timeout . . . . . . . . . . . . . . . . . . . . . . 21 95 6.8. Persistent Congestion . . . . . . . . . . . . . . . . . . 21 96 6.9. Pacing . . . . . . . . . . . . . . . . . . . . . . . . . 22 97 6.10. Under-utilizing the Congestion Window . . . . . . . . . . 23 98 7. Security Considerations . . . . . . . . . . . . . . . . . . . 24 99 7.1. Congestion Signals . . . . . . . . . . . . . . . . . . . 24 100 7.2. Traffic Analysis . . . . . . . . . . . . . . . . . . . . 24 101 7.3. Misreporting ECN Markings . . . . . . . . . . . . . . . . 24 102 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 103 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 104 9.1. Normative References . . . . . . . . . . . . . . . . . . 25 105 9.2. Informative References . . . . . . . . . . . . . . . . . 25 106 Appendix A. Loss Recovery Pseudocode . . . . . . . . . . . . . . 27 107 A.1. Tracking Sent Packets . . . . . . . . . . . . . . . . . . 27 108 A.1.1. Sent Packet Fields . . . . . . . . . . . . . . . . . 27 109 A.2. Constants of Interest . . . . . . . . . . . . . . . . . . 28 110 A.3. Variables of interest . . . . . . . . . . . . . . . . . . 28 111 A.4. Initialization . . . . . . . . . . . . . . . . . . . . . 29 112 A.5. On Sending a Packet . . . . . . . . . . . . . . . . . . . 30 113 A.6. On Receiving a Datagram . . . . . . . . . . . . . . . . . 30 114 A.7. On Receiving an Acknowledgment . . . . . . . . . . . . . 31 115 A.8. Setting the Loss Detection Timer . . . . . . . . . . . . 32 116 A.9. On Timeout . . . . . . . . . . . . . . . . . . . . . . . 34 117 A.10. Detecting Lost Packets . . . . . . . . . . . . . . . . . 35 118 Appendix B. Congestion Control Pseudocode . . . . . . . . . . . 35 119 B.1. Constants of interest . . . . . . . . . . . . . . . . . . 36 120 B.2. Variables of interest . . . . . . . . . . . . . . . . . . 36 121 B.3. Initialization . . . . . . . . . . . . . . . . . . . . . 37 122 B.4. On Packet Sent . . . . . . . . . . . . . . . . . . . . . 37 123 B.5. On Packet Acknowledgement . . . . . . . . . . . . . . . . 37 124 B.6. On New Congestion Event . . . . . . . . . . . . . . . . . 38 125 B.7. Process ECN Information . . . . . . . . . . . . . . . . . 38 126 B.8. On Packets Lost . . . . . . . . . . . . . . . . . . . . . 39 127 B.9. Upon dropping Initial or Handshake keys . . . . . . . . . 39 128 Appendix C. Change Log . . . . . . . . . . . . . . . . . . . . . 40 129 C.1. Since draft-ietf-quic-recovery-27 . . . . . . . . . . . . 40 130 C.2. Since draft-ietf-quic-recovery-26 . . . . . . . . . . . . 40 131 C.3. Since draft-ietf-quic-recovery-25 . . . . . . . . . . . . 41 132 C.4. Since draft-ietf-quic-recovery-24 . . . . . . . . . . . . 41 133 C.5. Since draft-ietf-quic-recovery-23 . . . . . . . . . . . . 41 134 C.6. Since draft-ietf-quic-recovery-22 . . . . . . . . . . . . 41 135 C.7. Since draft-ietf-quic-recovery-21 . . . . . . . . . . . . 41 136 C.8. Since draft-ietf-quic-recovery-20 . . . . . . . . . . . . 41 137 C.9. Since draft-ietf-quic-recovery-19 . . . . . . . . . . . . 41 138 C.10. Since draft-ietf-quic-recovery-18 . . . . . . . . . . . . 42 139 C.11. Since draft-ietf-quic-recovery-17 . . . . . . . . . . . . 42 140 C.12. Since draft-ietf-quic-recovery-16 . . . . . . . . . . . . 43 141 C.13. Since draft-ietf-quic-recovery-14 . . . . . . . . . . . . 44 142 C.14. Since draft-ietf-quic-recovery-13 . . . . . . . . . . . . 44 143 C.15. Since draft-ietf-quic-recovery-12 . . . . . . . . . . . . 44 144 C.16. Since draft-ietf-quic-recovery-11 . . . . . . . . . . . . 44 145 C.17. Since draft-ietf-quic-recovery-10 . . . . . . . . . . . . 44 146 C.18. Since draft-ietf-quic-recovery-09 . . . . . . . . . . . . 45 147 C.19. Since draft-ietf-quic-recovery-08 . . . . . . . . . . . . 45 148 C.20. Since draft-ietf-quic-recovery-07 . . . . . . . . . . . . 45 149 C.21. Since draft-ietf-quic-recovery-06 . . . . . . . . . . . . 45 150 C.22. Since draft-ietf-quic-recovery-05 . . . . . . . . . . . . 45 151 C.23. Since draft-ietf-quic-recovery-04 . . . . . . . . . . . . 45 152 C.24. Since draft-ietf-quic-recovery-03 . . . . . . . . . . . . 45 153 C.25. Since draft-ietf-quic-recovery-02 . . . . . . . . . . . . 45 154 C.26. Since draft-ietf-quic-recovery-01 . . . . . . . . . . . . 46 155 C.27. Since draft-ietf-quic-recovery-00 . . . . . . . . . . . . 46 156 C.28. Since draft-iyengar-quic-loss-recovery-01 . . . . . . . . 46 157 Appendix D. Contributors . . . . . . . . . . . . . . . . . . . . 46 158 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 46 159 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 46 161 1. Introduction 163 QUIC is a new multiplexed and secure transport protocol atop UDP, 164 specified in [QUIC-TRANSPORT]. This document describes congestion 165 control and loss recovery for QUIC. Mechanisms described in this 166 document follow the spirit of existing TCP congestion control and 167 loss recovery mechanisms, described in RFCs, various Internet-drafts, 168 or academic papers, and also those prevalent in TCP implementations. 170 2. Conventions and Definitions 172 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 173 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 174 "OPTIONAL" in this document are to be interpreted as described in 175 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 176 capitals, as shown here. 178 Definitions of terms that are used in this document: 180 Ack-eliciting Frames: All frames other than ACK, PADDING, and 181 CONNECTION_CLOSE are considered ack-eliciting. 183 Ack-eliciting Packets: Packets that contain ack-eliciting frames 184 elicit an ACK from the receiver within the maximum ack delay and 185 are called ack-eliciting packets. 187 In-flight: Packets are considered in-flight when they are ack- 188 eliciting or contain a PADDING frame, and they have been sent but 189 are not acknowledged, declared lost, or abandoned along with old 190 keys. 192 3. Design of the QUIC Transmission Machinery 194 All transmissions in QUIC are sent with a packet-level header, which 195 indicates the encryption level and includes a packet sequence number 196 (referred to below as a packet number). The encryption level 197 indicates the packet number space, as described in [QUIC-TRANSPORT]. 198 Packet numbers never repeat within a packet number space for the 199 lifetime of a connection. Packet numbers are sent in monotonically 200 increasing order within a space, preventing ambiguity. 202 This design obviates the need for disambiguating between 203 transmissions and retransmissions and eliminates significant 204 complexity from QUIC's interpretation of TCP loss detection 205 mechanisms. 207 QUIC packets can contain multiple frames of different types. The 208 recovery mechanisms ensure that data and frames that need reliable 209 delivery are acknowledged or declared lost and sent in new packets as 210 necessary. The types of frames contained in a packet affect recovery 211 and congestion control logic: 213 * All packets are acknowledged, though packets that contain no ack- 214 eliciting frames are only acknowledged along with ack-eliciting 215 packets. 217 * Long header packets that contain CRYPTO frames are critical to the 218 performance of the QUIC handshake and use shorter timers for 219 acknowledgement. 221 * Packets containing frames besides ACK or CONNECTION_CLOSE frames 222 count toward congestion control limits and are considered in- 223 flight. 225 * PADDING frames cause packets to contribute toward bytes in flight 226 without directly causing an acknowledgment to be sent. 228 3.1. Relevant Differences Between QUIC and TCP 230 Readers familiar with TCP's loss detection and congestion control 231 will find algorithms here that parallel well-known TCP ones. 232 Protocol differences between QUIC and TCP however contribute to 233 algorithmic differences. We briefly describe these protocol 234 differences below. 236 3.1.1. Separate Packet Number Spaces 238 QUIC uses separate packet number spaces for each encryption level, 239 except 0-RTT and all generations of 1-RTT keys use the same packet 240 number space. Separate packet number spaces ensures acknowledgement 241 of packets sent with one level of encryption will not cause spurious 242 retransmission of packets sent with a different encryption level. 243 Congestion control and round-trip time (RTT) measurement are unified 244 across packet number spaces. 246 3.1.2. Monotonically Increasing Packet Numbers 248 TCP conflates transmission order at the sender with delivery order at 249 the receiver, which results in retransmissions of the same data 250 carrying the same sequence number, and consequently leads to 251 "retransmission ambiguity". QUIC separates the two. QUIC uses a 252 packet number to indicate transmission order. Application data is 253 sent in one or more streams and delivery order is determined by 254 stream offsets encoded within STREAM frames. 256 QUIC's packet number is strictly increasing within a packet number 257 space, and directly encodes transmission order. A higher packet 258 number signifies that the packet was sent later, and a lower packet 259 number signifies that the packet was sent earlier. When a packet 260 containing ack-eliciting frames is detected lost, QUIC rebundles 261 necessary frames in a new packet with a new packet number, removing 262 ambiguity about which packet is acknowledged when an ACK is received. 263 Consequently, more accurate RTT measurements can be made, spurious 264 retransmissions are trivially detected, and mechanisms such as Fast 265 Retransmit can be applied universally, based only on packet number. 267 This design point significantly simplifies loss detection mechanisms 268 for QUIC. Most TCP mechanisms implicitly attempt to infer 269 transmission ordering based on TCP sequence numbers - a non-trivial 270 task, especially when TCP timestamps are not available. 272 3.1.3. Clearer Loss Epoch 274 QUIC starts a loss epoch when a packet is lost and ends one when any 275 packet sent after the epoch starts is acknowledged. TCP waits for 276 the gap in the sequence number space to be filled, and so if a 277 segment is lost multiple times in a row, the loss epoch may not end 278 for several round trips. Because both should reduce their congestion 279 windows only once per epoch, QUIC will do it once for every round 280 trip that experiences loss, while TCP may only do it once across 281 multiple round trips. 283 3.1.4. No Reneging 285 QUIC ACKs contain information that is similar to TCP SACK, but QUIC 286 does not allow any acked packet to be reneged, greatly simplifying 287 implementations on both sides and reducing memory pressure on the 288 sender. 290 3.1.5. More ACK Ranges 292 QUIC supports many ACK ranges, opposed to TCP's 3 SACK ranges. In 293 high loss environments, this speeds recovery, reduces spurious 294 retransmits, and ensures forward progress without relying on 295 timeouts. 297 3.1.6. Explicit Correction For Delayed Acknowledgements 299 QUIC endpoints measure the delay incurred between when a packet is 300 received and when the corresponding acknowledgment is sent, allowing 301 a peer to maintain a more accurate round-trip time estimate; see 302 Section 13.2 of [QUIC-TRANSPORT]. 304 3.1.7. Probe Timeout Replaces RTO and TLP 306 QUIC uses a probe timeout (see Section 5.2), with a timer based on 307 TCP's RTO computation. QUIC's PTO includes the peer's maximum 308 expected acknowledgement delay instead of using a fixed minimum 309 timeout. QUIC does not collapse the congestion window until 310 persistent congestion (Section 6.8) is declared, unlike TCP, which 311 collapses the congestion window upon expiry of an RTO. Instead of 312 collapsing the congestion window and declaring everything in-flight 313 lost, QUIC allows probe packets to temporarily exceed the congestion 314 window whenever the timer expires. 316 In doing this, QUIC avoids unnecessary congestion window reductions, 317 obviating the need for correcting mechanisms such as F-RTO [RFC5682]. 318 Since QUIC does not collapse the congestion window on a PTO 319 expiration, a QUIC sender is not limited from sending more in-flight 320 packets after a PTO expiration if it still has available congestion 321 window. This occurs when a sender is application-limited and the PTO 322 timer expires. This is more aggressive than TCP's RTO mechanism when 323 application-limited, but identical when not application-limited. 325 A single packet loss at the tail does not indicate persistent 326 congestion, so QUIC specifies a time-based definition to ensure one 327 or more packets are sent prior to a dramatic decrease in congestion 328 window; see Section 6.8. 330 3.1.8. The Minimum Congestion Window is Two Packets 332 TCP uses a minimum congestion window of one packet. However, loss of 333 that single packet means that the sender needs to waiting for a PTO 334 (Section 5.2) to recover, which can be much longer than a round-trip 335 time. Sending a single ack-eliciting packet also increases the 336 chances of incurring additional latency when a receiver delays its 337 acknowledgement. 339 QUIC therefore recommends that the minimum congestion window be two 340 packets. While this increases network load, it is considered safe, 341 since the sender will still reduce its sending rate exponentially 342 under persistent congestion (Section 5.2). 344 4. Estimating the Round-Trip Time 346 At a high level, an endpoint measures the time from when a packet was 347 sent to when it is acknowledged as a round-trip time (RTT) sample. 348 The endpoint uses RTT samples and peer-reported host delays (see 349 Section 13.2 of [QUIC-TRANSPORT]) to generate a statistical 350 description of the network path's RTT. An endpoint computes the 351 following three values for each path: the minimum value observed over 352 the lifetime of the path (min_rtt), an exponentially-weighted moving 353 average (smoothed_rtt), and the mean deviation (referred to as 354 "variation" in the rest of this document) in the observed RTT samples 355 (rttvar). 357 4.1. Generating RTT samples 359 An endpoint generates an RTT sample on receiving an ACK frame that 360 meets the following two conditions: 362 * the largest acknowledged packet number is newly acknowledged, and 364 * at least one of the newly acknowledged packets was ack-eliciting. 366 The RTT sample, latest_rtt, is generated as the time elapsed since 367 the largest acknowledged packet was sent: 369 latest_rtt = ack_time - send_time_of_largest_acked 371 An RTT sample is generated using only the largest acknowledged packet 372 in the received ACK frame. This is because a peer reports ACK delays 373 for only the largest acknowledged packet in an ACK frame. While the 374 reported ACK delay is not used by the RTT sample measurement, it is 375 used to adjust the RTT sample in subsequent computations of 376 smoothed_rtt and rttvar Section 4.3. 378 To avoid generating multiple RTT samples for a single packet, an ACK 379 frame SHOULD NOT be used to update RTT estimates if it does not newly 380 acknowledge the largest acknowledged packet. 382 An RTT sample MUST NOT be generated on receiving an ACK frame that 383 does not newly acknowledge at least one ack-eliciting packet. A peer 384 usually does not send an ACK frame when only non-ack-eliciting 385 packets are received. Therefore an ACK frame that contains 386 acknowledgements for only non-ack-eliciting packets could include an 387 arbitrarily large Ack Delay value. Ignoring such ACK frames avoids 388 complications in subsequent smoothed_rtt and rttvar computations. 390 A sender might generate multiple RTT samples per RTT when multiple 391 ACK frames are received within an RTT. As suggested in [RFC6298], 392 doing so might result in inadequate history in smoothed_rtt and 393 rttvar. Ensuring that RTT estimates retain sufficient history is an 394 open research question. 396 4.2. Estimating min_rtt 398 min_rtt is the minimum RTT observed for a given network path. 399 min_rtt is set to the latest_rtt on the first RTT sample, and to the 400 lesser of min_rtt and latest_rtt on subsequent samples. In this 401 document, min_rtt is used by loss detection to reject implausibly 402 small rtt samples. 404 An endpoint uses only locally observed times in computing the min_rtt 405 and does not adjust for ACK delays reported by the peer. Doing so 406 allows the endpoint to set a lower bound for the smoothed_rtt based 407 entirely on what it observes (see Section 4.3), and limits potential 408 underestimation due to erroneously-reported delays by the peer. 410 The RTT for a network path may change over time. If a path's actual 411 RTT decreases, the min_rtt will adapt immediately on the first low 412 sample. If the path's actual RTT increases, the min_rtt will not 413 adapt to it, allowing future RTT samples that are smaller than the 414 new RTT be included in smoothed_rtt. 416 4.3. Estimating smoothed_rtt and rttvar 418 smoothed_rtt is an exponentially-weighted moving average of an 419 endpoint's RTT samples, and rttvar is the variation in the RTT 420 samples, estimated using a mean variation. 422 The calculation of smoothed_rtt uses path latency after adjusting RTT 423 samples for acknowledgement delays. These delays are computed using 424 the ACK Delay field of the ACK frame as described in Section 19.3 of 425 [QUIC-TRANSPORT]. For packets sent in the ApplicationData packet 426 number space, a peer limits any delay in sending an acknowledgement 427 for an ack-eliciting packet to no greater than the value it 428 advertised in the max_ack_delay transport parameter. Consequently, 429 when a peer reports an Ack Delay that is greater than its 430 max_ack_delay, the delay is attributed to reasons out of the peer's 431 control, such as scheduler latency at the peer or loss of previous 432 ACK frames. Any delays beyond the peer's max_ack_delay are therefore 433 considered effectively part of path delay and incorporated into the 434 smoothed_rtt estimate. 436 When adjusting an RTT sample using peer-reported acknowledgement 437 delays, an endpoint: 439 * MUST ignore the Ack Delay field of the ACK frame for packets sent 440 in the Initial and Handshake packet number space. 442 * MUST use the lesser of the value reported in Ack Delay field of 443 the ACK frame and the peer's max_ack_delay transport parameter. 445 * MUST NOT apply the adjustment if the resulting RTT sample is 446 smaller than the min_rtt. This limits the underestimation that a 447 misreporting peer can cause to the smoothed_rtt. 449 smoothed_rtt and rttvar are computed as follows, similar to 450 [RFC6298]. 452 When there are no samples for a network path, and on the first RTT 453 sample for the network path: 455 smoothed_rtt = rtt_sample 456 rttvar = rtt_sample / 2 458 Before any RTT samples are available, the initial RTT is used as 459 rtt_sample. On the first RTT sample for the network path, that 460 sample is used as rtt_sample. This ensures that the first 461 measurement erases the history of any persisted or default values. 463 On subsequent RTT samples, smoothed_rtt and rttvar evolve as follows: 465 ack_delay = min(Ack Delay in ACK Frame, max_ack_delay) 466 adjusted_rtt = latest_rtt 467 if (min_rtt + ack_delay < latest_rtt): 468 adjusted_rtt = latest_rtt - ack_delay 469 smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rtt 470 rttvar_sample = abs(smoothed_rtt - adjusted_rtt) 471 rttvar = 3/4 * rttvar + 1/4 * rttvar_sample 473 5. Loss Detection 475 QUIC senders use acknowledgements to detect lost packets, and a probe 476 time out (see Section 5.2) to ensure acknowledgements are received. 477 This section provides a description of these algorithms. 479 If a packet is lost, the QUIC transport needs to recover from that 480 loss, such as by retransmitting the data, sending an updated frame, 481 or abandoning the frame. For more information, see Section 13.3 of 482 [QUIC-TRANSPORT]. 484 5.1. Acknowledgement-based Detection 486 Acknowledgement-based loss detection implements the spirit of TCP's 487 Fast Retransmit [RFC5681], Early Retransmit [RFC5827], FACK [FACK], 488 SACK loss recovery [RFC6675], and RACK [RACK]. This section provides 489 an overview of how these algorithms are implemented in QUIC. 491 A packet is declared lost if it meets all the following conditions: 493 * The packet is unacknowledged, in-flight, and was sent prior to an 494 acknowledged packet. 496 * Either its packet number is kPacketThreshold smaller than an 497 acknowledged packet (Section 5.1.1), or it was sent long enough in 498 the past (Section 5.1.2). 500 The acknowledgement indicates that a packet sent later was delivered, 501 and the packet and time thresholds provide some tolerance for packet 502 reordering. 504 Spuriously declaring packets as lost leads to unnecessary 505 retransmissions and may result in degraded performance due to the 506 actions of the congestion controller upon detecting loss. 507 Implementations can detect spurious retransmissions and increase the 508 reordering threshold in packets or time to reduce future spurious 509 retransmissions and loss events. Implementations with adaptive time 510 thresholds MAY choose to start with smaller initial reordering 511 thresholds to minimize recovery latency. 513 5.1.1. Packet Threshold 515 The RECOMMENDED initial value for the packet reordering threshold 516 (kPacketThreshold) is 3, based on best practices for TCP loss 517 detection [RFC5681] [RFC6675]. Implementations SHOULD NOT use a 518 packet threshold less than 3, to keep in line with TCP [RFC5681]. 520 Some networks may exhibit higher degrees of reordering, causing a 521 sender to detect spurious losses. Algorithms that increase the 522 reordering threshold after spuriously detecting losses, such as TCP- 523 NCR [RFC4653], have proven to be useful in TCP and are expected to at 524 least as useful in QUIC. Re-ordering could be more common with QUIC 525 than TCP, because network elements cannot observe and fix the order 526 of out-of-order packets. 528 5.1.2. Time Threshold 530 Once a later packet within the same packet number space has been 531 acknowledged, an endpoint SHOULD declare an earlier packet lost if it 532 was sent a threshold amount of time in the past. To avoid declaring 533 packets as lost too early, this time threshold MUST be set to at 534 least the local timer granularity, as indicated by the kGranularity 535 constant. The time threshold is: 537 max(kTimeThreshold * max(smoothed_rtt, latest_rtt), kGranularity) 539 If packets sent prior to the largest acknowledged packet cannot yet 540 be declared lost, then a timer SHOULD be set for the remaining time. 542 Using max(smoothed_rtt, latest_rtt) protects from the two following 543 cases: 545 * the latest RTT sample is lower than the smoothed RTT, perhaps due 546 to reordering where the acknowledgement encountered a shorter 547 path; 549 * the latest RTT sample is higher than the smoothed RTT, perhaps due 550 to a sustained increase in the actual RTT, but the smoothed RTT 551 has not yet caught up. 553 The RECOMMENDED time threshold (kTimeThreshold), expressed as a 554 round-trip time multiplier, is 9/8. The RECOMMENDED value of the 555 timer granularity (kGranularity) is 1ms. 557 Implementations MAY experiment with absolute thresholds, thresholds 558 from previous connections, adaptive thresholds, or including RTT 559 variation. Smaller thresholds reduce reordering resilience and 560 increase spurious retransmissions, and larger thresholds increase 561 loss detection delay. 563 5.2. Probe Timeout 565 A Probe Timeout (PTO) triggers sending one or two probe datagrams 566 when ack-eliciting packets are not acknowledged within the expected 567 period of time or the server may not have validated the client's 568 address. A PTO enables a connection to recover from loss of tail 569 packets or acknowledgements. 571 A PTO timer expiration event does not indicate packet loss and MUST 572 NOT cause prior unacknowledged packets to be marked as lost. When an 573 acknowledgement is received that newly acknowledges packets, loss 574 detection proceeds as dictated by packet and time threshold 575 mechanisms; see Section 5.1. 577 As with loss detection, the probe timeout is per packet number space. 578 The PTO algorithm used in QUIC implements the reliability functions 579 of Tail Loss Probe [RACK], RTO [RFC5681], and F-RTO algorithms for 580 TCP [RFC5682]. The timeout computation is based on TCP's 581 retransmission timeout period [RFC6298]. 583 5.2.1. Computing PTO 585 When an ack-eliciting packet is transmitted, the sender schedules a 586 timer for the PTO period as follows: 588 PTO = smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay 590 The PTO period is the amount of time that a sender ought to wait for 591 an acknowledgement of a sent packet. This time period includes the 592 estimated network roundtrip-time (smoothed_rtt), the variation in the 593 estimate (4*rttvar), and max_ack_delay, to account for the maximum 594 time by which a receiver might delay sending an acknowledgement. 595 When the PTO is armed for Initial or Handshake packet number spaces, 596 the max_ack_delay is 0, as specified in 13.2.1 of [QUIC-TRANSPORT]. 598 The PTO value MUST be set to at least kGranularity, to avoid the 599 timer expiring immediately. 601 A sender recomputes and may need to reset its PTO timer every time an 602 ack-eliciting packet is sent or acknowledged, when the handshake is 603 confirmed, or when Initial or Handshake keys are discarded. This 604 ensures the PTO is always set based on the latest RTT information and 605 for the last sent packet in the correct packet number space. 607 When ack-eliciting packets in multiple packet number spaces are in 608 flight, the timer MUST be set for the packet number space with the 609 earliest timeout, with one exception. The ApplicationData packet 610 number space (Section 4.1.1 of [QUIC-TLS]) MUST be ignored until the 611 handshake completes. Not arming the PTO for ApplicationData prevents 612 a client from retransmitting a 0-RTT packet on a PTO expiration 613 before confirming that the server is able to decrypt 0-RTT packets, 614 and prevents a server from sending a 1-RTT packet on a PTO expiration 615 before it has the keys to process an acknowledgement. 617 When a PTO timer expires, the PTO backoff MUST be increased, 618 resulting in the PTO period being set to twice its current value. 619 The PTO backoff factor is reset when an acknowledgement is received, 620 except in the following case. A server might take longer to respond 621 to packets during the handshake than otherwise. To protect such a 622 server from repeated client probes, the PTO backoff is not reset at a 623 client that is not yet certain that the server has finished 624 validating the client's address. That is, a client does not reset 625 the PTO backoff factor on receiving acknowledgements until it 626 receives a HANDSHAKE_DONE frame or an acknowledgement for one of its 627 Handshake or 1-RTT packets. 629 This exponential reduction in the sender's rate is important because 630 consecutive PTOs might be caused by loss of packets or 631 acknowledgements due to severe congestion. Even when there are ack- 632 eliciting packets in-flight in multiple packet number spaces, the 633 exponential increase in probe timeout occurs across all spaces to 634 prevent excess load on the network. For example, a timeout in the 635 Initial packet number space doubles the length of the timeout in the 636 Handshake packet number space. 638 The life of a connection that is experiencing consecutive PTOs is 639 limited by the endpoint's idle timeout. 641 The probe timer MUST NOT be set if the time threshold Section 5.1.2 642 loss detection timer is set. The time threshold loss detection timer 643 is expected to both expire earlier than the PTO and be less likely to 644 spuriously retransmit data. 646 5.2.2. Handshakes and New Paths 648 Resumed connections over the same network MAY use the previous 649 connection's final smoothed RTT value as the resumed connection's 650 initial RTT. When no previous RTT is available, the initial RTT 651 SHOULD be set to 333ms, resulting in a 1 second initial timeout, as 652 recommended in [RFC6298]. 654 A connection MAY use the delay between sending a PATH_CHALLENGE and 655 receiving a PATH_RESPONSE to set the initial RTT (see kInitialRtt in 656 Appendix A.2) for a new path, but the delay SHOULD NOT be considered 657 an RTT sample. 659 Prior to handshake completion, when few to none RTT samples have been 660 generated, it is possible that the probe timer expiration is due to 661 an incorrect RTT estimate at the client. To allow the client to 662 improve its RTT estimate, the new packet that it sends MUST be ack- 663 eliciting. 665 Initial packets and Handshake packets could be never acknowledged, 666 but they are removed from bytes in flight when the Initial and 667 Handshake keys are discarded, as described below in 668 Section Section 5.4. When Initial or Handshake keys are discarded, 669 the PTO and loss detection timers MUST be reset, because discarding 670 keys indicates forward progress and the loss detection timer might 671 have been set for a now discarded packet number space. 673 5.2.2.1. Before Address Validation 675 Until the server has validated the client's address on the path, the 676 amount of data it can send is limited to three times the amount of 677 data received, as specified in Section 8.1 of [QUIC-TRANSPORT]. If 678 no additional data can be sent, the server's PTO timer MUST NOT be 679 armed until datagrams have been received from the client, because 680 packets sent on PTO count against the anti-amplification limit. Note 681 that the server could fail to validate the client's address even if 682 0-RTT is accepted. 684 Since the server could be blocked until more packets are received 685 from the client, it is the client's responsibility to send packets to 686 unblock the server until it is certain that the server has finished 687 its address validation (see Section 8 of [QUIC-TRANSPORT]). That is, 688 the client MUST set the probe timer if the client has not received an 689 acknowledgement for one of its Handshake or 1-RTT packets, and has 690 not received a HANDSHAKE_DONE frame. If Handshake keys are available 691 to the client, it MUST send a Handshake packet, and otherwise it MUST 692 send an Initial packet in a UDP datagram of at least 1200 bytes. 694 A client could have received and acknowledged a Handshake packet, 695 causing it to discard state for the Initial packet number space, but 696 not sent any ack-eliciting Handshake packets. In this case, the PTO 697 is set from the current time. 699 5.2.3. Speeding Up Handshake Completion 701 When a server receives an Initial packet containing duplicate CRYPTO 702 data, it can assume the client did not receive all of the server's 703 CRYPTO data sent in Initial packets, or the client's estimated RTT is 704 too small. When a client receives Handshake or 1-RTT packets prior 705 to obtaining Handshake keys, it may assume some or all of the 706 server's Initial packets were lost. 708 To speed up handshake completion under these conditions, an endpoint 709 MAY send a packet containing unacknowledged CRYPTO data earlier than 710 the PTO expiry, subject to address validation limits; see Section 8.1 711 of [QUIC-TRANSPORT]. 713 Peers can also use coalesced packets to ensure that each datagram 714 elicits at least one acknowledgement. For example, clients can 715 coalesce an Initial packet containing PING and PADDING frames with a 716 0-RTT data packet and a server can coalesce an Initial packet 717 containing a PING frame with one or more packets in its first flight. 719 5.2.4. Sending Probe Packets 721 When a PTO timer expires, a sender MUST send at least one ack- 722 eliciting packet in the packet number space as a probe, unless there 723 is no data available to send. An endpoint MAY send up to two full- 724 sized datagrams containing ack-eliciting packets, to avoid an 725 expensive consecutive PTO expiration due to a single lost datagram or 726 transmit data from multiple packet number spaces. All probe packets 727 sent on a PTO MUST be ack-eliciting. 729 In addition to sending data in the packet number space for which the 730 timer expired, the sender SHOULD send ack-eliciting packets from 731 other packet number spaces with in-flight data, coalescing packets if 732 possible. This is particularly valuable when the server has both 733 Initial and Handshake data in-flight or the client has both Handshake 734 and ApplicationData in-flight, because the peer might only have 735 receive keys for one of the two packet number spaces. 737 If the sender wants to elicit a faster acknowledgement on PTO, it can 738 skip a packet number to eliminate the ack delay. 740 When the PTO timer expires, and there is new or previously sent 741 unacknowledged data, it MUST be sent. A probe packet SHOULD carry 742 new data when possible. A probe packet MAY carry retransmitted 743 unacknowledged data when new data is unavailable, when flow control 744 does not permit new data to be sent, or to opportunistically reduce 745 loss recovery delay. Implementations MAY use alternative strategies 746 for determining the content of probe packets, including sending new 747 or retransmitted data based on the application's priorities. 749 It is possible the sender has no new or previously-sent data to send. 750 As an example, consider the following sequence of events: new 751 application data is sent in a STREAM frame, deemed lost, then 752 retransmitted in a new packet, and then the original transmission is 753 acknowledged. When there is no data to send, the sender SHOULD send 754 a PING or other ack-eliciting frame in a single packet, re-arming the 755 PTO timer. 757 Alternatively, instead of sending an ack-eliciting packet, the sender 758 MAY mark any packets still in flight as lost. Doing so avoids 759 sending an additional packet, but increases the risk that loss is 760 declared too aggressively, resulting in an unnecessary rate reduction 761 by the congestion controller. 763 Consecutive PTO periods increase exponentially, and as a result, 764 connection recovery latency increases exponentially as packets 765 continue to be dropped in the network. Sending two packets on PTO 766 expiration increases resilience to packet drops, thus reducing the 767 probability of consecutive PTO events. 769 When the PTO timer expires multiple times and new data cannot be 770 sent, implementations must choose between sending the same payload 771 every time or sending different payloads. Sending the same payload 772 may be simpler and ensures the highest priority frames arrive first. 773 Sending different payloads each time reduces the chances of spurious 774 retransmission. 776 5.3. Handling Retry Packets 778 A Retry packet causes a client to send another Initial packet, 779 effectively restarting the connection process. A Retry packet 780 indicates that the Initial was received, but not processed. A Retry 781 packet cannot be treated as an acknowledgment, because it does not 782 indicate that a packet was processed or specify the packet number. 784 Clients that receive a Retry packet reset congestion control and loss 785 recovery state, including resetting any pending timers. Other 786 connection state, in particular cryptographic handshake messages, is 787 retained; see Section 17.2.5 of [QUIC-TRANSPORT]. 789 The client MAY compute an RTT estimate to the server as the time 790 period from when the first Initial was sent to when a Retry or a 791 Version Negotiation packet is received. The client MAY use this 792 value in place of its default for the initial RTT estimate. 794 5.4. Discarding Keys and Packet State 796 When packet protection keys are discarded (see Section 4.10 of 797 [QUIC-TLS]), all packets that were sent with those keys can no longer 798 be acknowledged because their acknowledgements cannot be processed 799 anymore. The sender MUST discard all recovery state associated with 800 those packets and MUST remove them from the count of bytes in flight. 802 Endpoints stop sending and receiving Initial packets once they start 803 exchanging Handshake packets; see Section 17.2.2.1 of 804 [QUIC-TRANSPORT]. At this point, recovery state for all in-flight 805 Initial packets is discarded. 807 When 0-RTT is rejected, recovery state for all in-flight 0-RTT 808 packets is discarded. 810 If a server accepts 0-RTT, but does not buffer 0-RTT packets that 811 arrive before Initial packets, early 0-RTT packets will be declared 812 lost, but that is expected to be infrequent. 814 It is expected that keys are discarded after packets encrypted with 815 them would be acknowledged or declared lost. Initial secrets however 816 might be destroyed sooner, as soon as handshake keys are available; 817 see Section 4.11.1 of [QUIC-TLS]. 819 6. Congestion Control 821 This document specifies a congestion controller for QUIC similar to 822 TCP NewReno [RFC6582]. 824 The signals QUIC provides for congestion control are generic and are 825 designed to support different algorithms. Endpoints can unilaterally 826 choose a different algorithm to use, such as Cubic [RFC8312]. 828 If an endpoint uses a different controller than that specified in 829 this document, the chosen controller MUST conform to the congestion 830 control guidelines specified in Section 3.1 of [RFC8085]. 832 Similar to TCP, packets containing only ACK frames do not count 833 towards bytes in flight and are not congestion controlled. Unlike 834 TCP, QUIC can detect the loss of these packets and MAY use that 835 information to adjust the congestion controller or the rate of ACK- 836 only packets being sent, but this document does not describe a 837 mechanism for doing so. 839 The algorithm in this document specifies and uses the controller's 840 congestion window in bytes. 842 An endpoint MUST NOT send a packet if it would cause bytes_in_flight 843 (see Appendix B.2) to be larger than the congestion window, unless 844 the packet is sent on a PTO timer expiration; see Section 5.2. 846 6.1. Explicit Congestion Notification 848 If a path has been verified to support ECN [RFC3168] [RFC8311], QUIC 849 treats a Congestion Experienced (CE) codepoint in the IP header as a 850 signal of congestion. This document specifies an endpoint's response 851 when its peer receives packets with the ECN-CE codepoint. 853 6.2. Initial and Minimum Congestion Window 855 QUIC begins every connection in slow start with the congestion window 856 set to an initial value. Endpoints SHOULD use an initial congestion 857 window of 10 times the maximum datagram size (max_datagram_size), 858 limited to the larger of 14720 or twice the maximum datagram size. 859 This follows the analysis and recommendations in [RFC6928], 860 increasing the byte limit to account for the smaller 8 byte overhead 861 of UDP compared to the 20 byte overhead for TCP. 863 Prior to validating the client's address, the server can be further 864 limited by the anti-amplification limit as specified in Section 8.1 865 of [QUIC-TRANSPORT]. Though the anti-amplification limit can prevent 866 the congestion window from being fully utilized and therefore slow 867 down the increase in congestion window, it does not directly affect 868 the congestion window. 870 The minimum congestion window is the smallest value the congestion 871 window can decrease to as a response to loss, ECN-CE, or persistent 872 congestion. The RECOMMENDED value is 2 * max_datagram_size. 874 6.3. Slow Start 876 While in slow start, QUIC increases the congestion window by the 877 number of bytes acknowledged when each acknowledgment is processed, 878 resulting in exponential growth of the congestion window. 880 QUIC exits slow start upon loss or upon increase in the ECN-CE 881 counter. When slow start is exited, the congestion window halves and 882 the slow start threshold is set to the new congestion window. QUIC 883 re-enters slow start any time the congestion window is less than the 884 slow start threshold, which only occurs after persistent congestion 885 is declared. 887 6.4. Congestion Avoidance 889 Slow start exits to congestion avoidance. Congestion avoidance uses 890 an Additive Increase Multiplicative Decrease (AIMD) approach that 891 increases the congestion window by one maximum packet size per 892 congestion window acknowledged. When a loss or ECN-CE marking is 893 detected, NewReno halves the congestion window, sets the slow start 894 threshold to the new congestion window, and then enters the recovery 895 period. 897 6.5. Recovery Period 899 A recovery period is entered when loss or ECN-CE marking of a packet 900 is detected in congestion avoidance after the congestion window and 901 slow start threshold have been decreased. A recovery period ends 902 when a packet sent during the recovery period is acknowledged. This 903 is slightly different from TCP's definition of recovery, which ends 904 when the lost packet that started recovery is acknowledged. 906 The recovery period aims to limit congestion window reduction to once 907 per round trip. Therefore during recovery, the congestion window 908 remains unchanged irrespective of new losses or increases in the ECN- 909 CE counter. 911 When entering recovery, a single packet MAY be sent even if bytes in 912 flight now exceeds the recently reduced congestion window. This 913 speeds up loss recovery if the data in the lost packet is 914 retransmitted and is similar to TCP as described in Section 5 of 915 [RFC6675]. If further packets are lost while the sender is in 916 recovery, sending any packets in response MUST obey the congestion 917 window limit. 919 6.6. Ignoring Loss of Undecryptable Packets 921 During the handshake, some packet protection keys might not be 922 available when a packet arrives and the receiver can choose to drop 923 the packet. In particular, Handshake and 0-RTT packets cannot be 924 processed until the Initial packets arrive and 1-RTT packets cannot 925 be processed until the handshake completes. Endpoints MAY ignore the 926 loss of Handshake, 0-RTT, and 1-RTT packets that might have arrived 927 before the peer had packet protection keys to process those packets. 928 Endpoints MUST NOT ignore the loss of packets that were sent after 929 the earliest acknowledged packet in a given packet number space. 931 6.7. Probe Timeout 933 Probe packets MUST NOT be blocked by the congestion controller. A 934 sender MUST however count these packets as being additionally in 935 flight, since these packets add network load without establishing 936 packet loss. Note that sending probe packets might cause the 937 sender's bytes in flight to exceed the congestion window until an 938 acknowledgement is received that establishes loss or delivery of 939 packets. 941 6.8. Persistent Congestion 943 When an ACK frame is received that establishes loss of all in-flight 944 packets sent over a long enough period of time, the network is 945 considered to be experiencing persistent congestion. Commonly, this 946 can be established by consecutive PTOs, but since the PTO timer is 947 reset when a new ack-eliciting packet is sent, an explicit duration 948 must be used to account for those cases where PTOs do not occur or 949 are substantially delayed. The rationale for this threshold is to 950 enable a sender to use initial PTOs for aggressive probing, as TCP 951 does with Tail Loss Probe (TLP) [RACK], before establishing 952 persistent congestion, as TCP does with a Retransmission Timeout 953 (RTO) [RFC5681]. The RECOMMENDED value for 954 kPersistentCongestionThreshold is 3, which is approximately 955 equivalent to two TLPs before an RTO in TCP. 957 This duration is computed as follows: 959 (smoothed_rtt + 4 * rttvar + max_ack_delay) * 960 kPersistentCongestionThreshold 962 For example, assume: 964 smoothed_rtt = 1 965 rttvar = 0 966 max_ack_delay = 0 967 kPersistentCongestionThreshold = 3 969 If an ack-eliciting packet is sent at time t = 0, the following 970 scenario would illustrate persistent congestion: 972 +------+------------------------+ 973 | Time | Action | 974 +======+========================+ 975 | t=0 | Send Pkt #1 (App Data) | 976 +------+------------------------+ 977 | t=1 | Send Pkt #2 (PTO 1) | 978 +------+------------------------+ 979 | t=3 | Send Pkt #3 (PTO 2) | 980 +------+------------------------+ 981 | t=7 | Send Pkt #4 (PTO 3) | 982 +------+------------------------+ 983 | t=8 | Recv ACK of Pkt #4 | 984 +------+------------------------+ 986 Table 1 988 The first three packets are determined to be lost when the 989 acknowledgement of packet 4 is received at t = 8. The congestion 990 period is calculated as the time between the oldest and newest lost 991 packets: (3 - 0) = 3. The duration for persistent congestion is 992 equal to: (1 * kPersistentCongestionThreshold) = 3. Because the 993 threshold was reached and because none of the packets between the 994 oldest and the newest packets are acknowledged, the network is 995 considered to have experienced persistent congestion. 997 When persistent congestion is established, the sender's congestion 998 window MUST be reduced to the minimum congestion window 999 (kMinimumWindow). This response of collapsing the congestion window 1000 on persistent congestion is functionally similar to a sender's 1001 response on a Retransmission Timeout (RTO) in TCP [RFC5681] after 1002 Tail Loss Probes (TLP) [RACK]. 1004 6.9. Pacing 1006 This document does not specify a pacer, but it is RECOMMENDED that a 1007 sender pace sending of all in-flight packets based on input from the 1008 congestion controller. For example, a pacer might distribute the 1009 congestion window over the smoothed RTT when used with a window-based 1010 controller, or a pacer might use the rate estimate of a rate-based 1011 controller. 1013 An implementation should take care to architect its congestion 1014 controller to work well with a pacer. For instance, a pacer might 1015 wrap the congestion controller and control the availability of the 1016 congestion window, or a pacer might pace out packets handed to it by 1017 the congestion controller. 1019 Timely delivery of ACK frames is important for efficient loss 1020 recovery. Packets containing only ACK frames SHOULD therefore not be 1021 paced, to avoid delaying their delivery to the peer. 1023 Endpoints can implement pacing as they choose. A perfectly paced 1024 sender spreads packets exactly evenly over time. For a window-based 1025 congestion controller, such as the one in this document, that rate 1026 can be computed by averaging the congestion window over the round- 1027 trip time. Expressed as a rate in bytes: 1029 rate = N * congestion_window / smoothed_rtt 1031 Or, expressed as an inter-packet interval: 1033 interval = smoothed_rtt * packet_size / congestion_window / N 1035 Using a value for "N" that is small, but at least 1 (for example, 1036 1.25) ensures that variations in round-trip time don't result in 1037 under-utilization of the congestion window. Values of 'N' larger 1038 than 1 ultimately result in sending packets as acknowledgments are 1039 received rather than when timers fire, provided the congestion window 1040 is fully utilized and acknowledgments arrive at regular intervals. 1042 Practical considerations, such as packetization, scheduling delays, 1043 and computational efficiency, can cause a sender to deviate from this 1044 rate over time periods that are much shorter than a round-trip time. 1045 Sending multiple packets into the network without any delay between 1046 them creates a packet burst that might cause short-term congestion 1047 and losses. Implementations MUST either use pacing or limit such 1048 bursts to the initial congestion window; see Section 6.2. 1050 One possible implementation strategy for pacing uses a leaky bucket 1051 algorithm, where the capacity of the "bucket" is limited to the 1052 maximum burst size and the rate the "bucket" fills is determined by 1053 the above function. 1055 6.10. Under-utilizing the Congestion Window 1057 When bytes in flight is smaller than the congestion window and 1058 sending is not pacing limited, the congestion window is under- 1059 utilized. When this occurs, the congestion window SHOULD NOT be 1060 increased in either slow start or congestion avoidance. This can 1061 happen due to insufficient application data or flow control limits. 1063 A sender MAY use the pipeACK method described in Section 4.3 of 1064 [RFC7661] to determine if the congestion window is sufficiently 1065 utilized. 1067 A sender that paces packets (see Section 6.9) might delay sending 1068 packets and not fully utilize the congestion window due to this 1069 delay. A sender SHOULD NOT consider itself application limited if it 1070 would have fully utilized the congestion window without pacing delay. 1072 A sender MAY implement alternative mechanisms to update its 1073 congestion window after periods of under-utilization, such as those 1074 proposed for TCP in [RFC7661]. 1076 7. Security Considerations 1078 7.1. Congestion Signals 1080 Congestion control fundamentally involves the consumption of signals 1081 - both loss and ECN codepoints - from unauthenticated entities. On- 1082 path attackers can spoof or alter these signals. An attacker can 1083 cause endpoints to reduce their sending rate by dropping packets, or 1084 alter send rate by changing ECN codepoints. 1086 7.2. Traffic Analysis 1088 Packets that carry only ACK frames can be heuristically identified by 1089 observing packet size. Acknowledgement patterns may expose 1090 information about link characteristics or application behavior. 1091 Endpoints can use PADDING frames or bundle acknowledgments with other 1092 frames to reduce leaked information. 1094 7.3. Misreporting ECN Markings 1096 A receiver can misreport ECN markings to alter the congestion 1097 response of a sender. Suppressing reports of ECN-CE markings could 1098 cause a sender to increase their send rate. This increase could 1099 result in congestion and loss. 1101 A sender MAY attempt to detect suppression of reports by marking 1102 occasional packets that they send with ECN-CE. If a packet sent with 1103 ECN-CE is not reported as having been CE marked when the packet is 1104 acknowledged, then the sender SHOULD disable ECN for that path. 1106 Reporting additional ECN-CE markings will cause a sender to reduce 1107 their sending rate, which is similar in effect to advertising reduced 1108 connection flow control limits and so no advantage is gained by doing 1109 so. 1111 Endpoints choose the congestion controller that they use. Though 1112 congestion controllers generally treat reports of ECN-CE markings as 1113 equivalent to loss [RFC8311], the exact response for each controller 1114 could be different. Failure to correctly respond to information 1115 about ECN markings is therefore difficult to detect. 1117 8. IANA Considerations 1119 This document has no IANA actions. 1121 9. References 1123 9.1. Normative References 1125 [QUIC-TLS] Thomson, M., Ed. and S. Turner, Ed., "Using TLS to Secure 1126 QUIC", Work in Progress, Internet-Draft, draft-ietf-quic- 1127 tls-28, 20 May 2020, 1128 . 1130 [QUIC-TRANSPORT] 1131 Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based 1132 Multiplexed and Secure Transport", Work in Progress, 1133 Internet-Draft, draft-ietf-quic-transport-28, 20 May 2020, 1134 . 1137 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1138 Requirement Levels", BCP 14, RFC 2119, 1139 DOI 10.17487/RFC2119, March 1997, 1140 . 1142 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 1143 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 1144 March 2017, . 1146 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1147 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1148 May 2017, . 1150 9.2. Informative References 1152 [FACK] Mathis, M. and J. Mahdavi, "Forward Acknowledgement: 1153 Refining TCP Congestion Control", ACM SIGCOMM , August 1154 1996. 1156 [RACK] Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "RACK: 1157 a time-based fast loss detection algorithm for TCP", Work 1158 in Progress, Internet-Draft, draft-ietf-tcpm-rack-08, 9 1159 March 2020, . 1162 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1163 of Explicit Congestion Notification (ECN) to IP", 1164 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1165 . 1167 [RFC4653] Bhandarkar, S., Reddy, A. L. N., Allman, M., and E. 1168 Blanton, "Improving the Robustness of TCP to Non- 1169 Congestion Events", RFC 4653, DOI 10.17487/RFC4653, August 1170 2006, . 1172 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1173 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1174 . 1176 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1177 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1178 Spurious Retransmission Timeouts with TCP", RFC 5682, 1179 DOI 10.17487/RFC5682, September 2009, 1180 . 1182 [RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and 1183 P. Hurtig, "Early Retransmit for TCP and Stream Control 1184 Transmission Protocol (SCTP)", RFC 5827, 1185 DOI 10.17487/RFC5827, May 2010, 1186 . 1188 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1189 "Computing TCP's Retransmission Timer", RFC 6298, 1190 DOI 10.17487/RFC6298, June 2011, 1191 . 1193 [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The 1194 NewReno Modification to TCP's Fast Recovery Algorithm", 1195 RFC 6582, DOI 10.17487/RFC6582, April 2012, 1196 . 1198 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1199 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1200 Based on Selective Acknowledgment (SACK) for TCP", 1201 RFC 6675, DOI 10.17487/RFC6675, August 2012, 1202 . 1204 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, 1205 "Increasing TCP's Initial Window", RFC 6928, 1206 DOI 10.17487/RFC6928, April 2013, 1207 . 1209 [RFC7661] Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating 1210 TCP to Support Rate-Limited Traffic", RFC 7661, 1211 DOI 10.17487/RFC7661, October 2015, 1212 . 1214 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1215 Notification (ECN) Experimentation", RFC 8311, 1216 DOI 10.17487/RFC8311, January 2018, 1217 . 1219 [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 1220 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 1221 RFC 8312, DOI 10.17487/RFC8312, February 2018, 1222 . 1224 Appendix A. Loss Recovery Pseudocode 1226 We now describe an example implementation of the loss detection 1227 mechanisms described in Section 5. 1229 A.1. Tracking Sent Packets 1231 To correctly implement congestion control, a QUIC sender tracks every 1232 ack-eliciting packet until the packet is acknowledged or lost. It is 1233 expected that implementations will be able to access this information 1234 by packet number and crypto context and store the per-packet fields 1235 (Appendix A.1.1) for loss recovery and congestion control. 1237 After a packet is declared lost, the endpoint can track it for an 1238 amount of time comparable to the maximum expected packet reordering, 1239 such as 1 RTT. This allows for detection of spurious 1240 retransmissions. 1242 Sent packets are tracked for each packet number space, and ACK 1243 processing only applies to a single space. 1245 A.1.1. Sent Packet Fields 1247 packet_number: The packet number of the sent packet. 1249 ack_eliciting: A boolean that indicates whether a packet is ack- 1250 eliciting. If true, it is expected that an acknowledgement will 1251 be received, though the peer could delay sending the ACK frame 1252 containing it by up to the MaxAckDelay. 1254 in_flight: A boolean that indicates whether the packet counts 1255 towards bytes in flight. 1257 sent_bytes: The number of bytes sent in the packet, not including 1258 UDP or IP overhead, but including QUIC framing overhead. 1260 time_sent: The time the packet was sent. 1262 A.2. Constants of Interest 1264 Constants used in loss recovery are based on a combination of RFCs, 1265 papers, and common practice. 1267 kPacketThreshold: Maximum reordering in packets before packet 1268 threshold loss detection considers a packet lost. The value 1269 recommended in Section 5.1.1 is 3. 1271 kTimeThreshold: Maximum reordering in time before time threshold 1272 loss detection considers a packet lost. Specified as an RTT 1273 multiplier. The value recommended in Section 5.1.2 is 9/8. 1275 kGranularity: Timer granularity. This is a system-dependent value, 1276 and Section 5.1.2 recommends a value of 1ms. 1278 kInitialRtt: The RTT used before an RTT sample is taken. The value 1279 recommended in Section 5.2.2 is 500ms. 1281 kPacketNumberSpace: An enum to enumerate the three packet number 1282 spaces. 1284 enum kPacketNumberSpace { 1285 Initial, 1286 Handshake, 1287 ApplicationData, 1288 } 1290 A.3. Variables of interest 1292 Variables required to implement the congestion control mechanisms are 1293 described in this section. 1295 latest_rtt: The most recent RTT measurement made when receiving an 1296 ack for a previously unacked packet. 1298 smoothed_rtt: The smoothed RTT of the connection, computed as 1299 described in Section 4.3. 1301 rttvar: The RTT variation, computed as described in Section 4.3. 1303 min_rtt: The minimum RTT seen in the connection, ignoring ack delay, 1304 as described in Section 4.2. 1306 max_ack_delay: The maximum amount of time by which the receiver 1307 intends to delay acknowledgments for packets in the 1308 ApplicationData packet number space. The actual ack_delay in a 1309 received ACK frame may be larger due to late timers, reordering, 1310 or lost ACK frames. 1312 loss_detection_timer: Multi-modal timer used for loss detection. 1314 pto_count: The number of times a PTO has been sent without receiving 1315 an ack. 1317 time_of_last_sent_ack_eliciting_packet[kPacketNumberSpace]: The time 1318 the most recent ack-eliciting packet was sent. 1320 largest_acked_packet[kPacketNumberSpace]: The largest packet number 1321 acknowledged in the packet number space so far. 1323 loss_time[kPacketNumberSpace]: The time at which the next packet in 1324 that packet number space will be considered lost based on 1325 exceeding the reordering window in time. 1327 sent_packets[kPacketNumberSpace]: An association of packet numbers 1328 in a packet number space to information about them. Described in 1329 detail above in Appendix A.1. 1331 A.4. Initialization 1333 At the beginning of the connection, initialize the loss detection 1334 variables as follows: 1336 loss_detection_timer.reset() 1337 pto_count = 0 1338 latest_rtt = 0 1339 smoothed_rtt = initial_rtt 1340 rttvar = initial_rtt / 2 1341 min_rtt = 0 1342 max_ack_delay = 0 1343 for pn_space in [ Initial, Handshake, ApplicationData ]: 1344 largest_acked_packet[pn_space] = infinite 1345 time_of_last_sent_ack_eliciting_packet[pn_space] = 0 1346 loss_time[pn_space] = 0 1348 A.5. On Sending a Packet 1350 After a packet is sent, information about the packet is stored. The 1351 parameters to OnPacketSent are described in detail above in 1352 Appendix A.1.1. 1354 Pseudocode for OnPacketSent follows: 1356 OnPacketSent(packet_number, pn_space, ack_eliciting, 1357 in_flight, sent_bytes): 1358 sent_packets[pn_space][packet_number].packet_number = 1359 packet_number 1360 sent_packets[pn_space][packet_number].time_sent = now() 1361 sent_packets[pn_space][packet_number].ack_eliciting = 1362 ack_eliciting 1363 sent_packets[pn_space][packet_number].in_flight = in_flight 1364 if (in_flight): 1365 if (ack_eliciting): 1366 time_of_last_sent_ack_eliciting_packet[pn_space] = now() 1367 OnPacketSentCC(sent_bytes) 1368 sent_packets[pn_space][packet_number].size = sent_bytes 1369 SetLossDetectionTimer() 1371 A.6. On Receiving a Datagram 1373 When a server is blocked by anti-amplification limits, receiving a 1374 datagram unblocks it, even if none of the packets in the datagram are 1375 successfully processed. In such a case, the PTO timer will need to 1376 be re-armed. 1378 Pseudocode for OnDatagramReceived follows: 1380 OnDatagramReceived(datagram): 1381 // If this datagram unblocks the server, arm the 1382 // PTO timer to avoid deadlock. 1383 if (server was at anti-amplification limit): 1384 SetLossDetectionTimer() 1386 A.7. On Receiving an Acknowledgment 1388 When an ACK frame is received, it may newly acknowledge any number of 1389 packets. 1391 Pseudocode for OnAckReceived and UpdateRtt follow: 1393 OnAckReceived(ack, pn_space): 1394 if (largest_acked_packet[pn_space] == infinite): 1395 largest_acked_packet[pn_space] = ack.largest_acked 1396 else: 1397 largest_acked_packet[pn_space] = 1398 max(largest_acked_packet[pn_space], ack.largest_acked) 1400 // DetectNewlyAckedPackets finds packets that are newly 1401 // acknowledged and removes them from sent_packets. 1402 newly_acked_packets = 1403 DetectAndRemoveAckedPackets(ack, pn_space) 1404 // Nothing to do if there are no newly acked packets. 1405 if (newly_acked_packets.empty()): 1406 return 1408 // If the largest acknowledged is newly acked and 1409 // at least one ack-eliciting was newly acked, update the RTT. 1410 if (newly_acked_packets.largest().packet_number == 1411 ack.largest_acked && 1412 IncludesAckEliciting(newly_acked_packets)): 1413 latest_rtt = 1414 now - sent_packets[pn_space][ack.largest_acked].time_sent 1415 ack_delay = 0 1416 if (pn_space == ApplicationData): 1417 ack_delay = ack.ack_delay 1418 UpdateRtt(ack_delay) 1420 // Process ECN information if present. 1421 if (ACK frame contains ECN information): 1422 ProcessECN(ack, pn_space) 1424 lost_packets = DetectAndRemoveLostPackets(pn_space) 1425 if (!lost_packets.empty()): 1426 OnPacketsLost(lost_packets) 1427 OnPacketsAcked(newly_acked_packets) 1428 // Reset pto_count unless the client is unsure if 1429 // the server has validated the client's address. 1430 if (PeerCompletedAddressValidation()): 1431 pto_count = 0 1432 SetLossDetectionTimer() 1434 UpdateRtt(ack_delay): 1435 if (is first RTT sample): 1436 min_rtt = latest_rtt 1437 smoothed_rtt = latest_rtt 1438 rttvar = latest_rtt / 2 1439 return 1441 // min_rtt ignores ack delay. 1442 min_rtt = min(min_rtt, latest_rtt) 1443 // Limit ack_delay by max_ack_delay 1444 ack_delay = min(ack_delay, max_ack_delay) 1445 // Adjust for ack delay if plausible. 1446 adjusted_rtt = latest_rtt 1447 if (latest_rtt > min_rtt + ack_delay): 1448 adjusted_rtt = latest_rtt - ack_delay 1450 rttvar = 3/4 * rttvar + 1/4 * abs(smoothed_rtt - adjusted_rtt) 1451 smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rtt 1453 A.8. Setting the Loss Detection Timer 1455 QUIC loss detection uses a single timer for all timeout loss 1456 detection. The duration of the timer is based on the timer's mode, 1457 which is set in the packet and timer events further below. The 1458 function SetLossDetectionTimer defined below shows how the single 1459 timer is set. 1461 This algorithm may result in the timer being set in the past, 1462 particularly if timers wake up late. Timers set in the past fire 1463 immediately. 1465 Pseudocode for SetLossDetectionTimer follows: 1467 GetEarliestTimeAndSpace(times): 1468 time = times[Initial] 1469 space = Initial 1470 for pn_space in [ Handshake, ApplicationData ]: 1471 if (times[pn_space] != 0 && 1472 (time == 0 || times[pn_space] < time) && 1473 # Skip ApplicationData until handshake completion. 1474 (pn_space != ApplicationData || 1475 IsHandshakeComplete()): 1476 time = times[pn_space]; 1477 space = pn_space 1478 return time, space 1480 PeerCompletedAddressValidation(): 1481 # Assume clients validate the server's address implicitly. 1482 if (endpoint is server): 1483 return true 1484 # Servers complete address validation when a 1485 # protected packet is received. 1486 return has received Handshake ACK || 1487 has received 1-RTT ACK || 1488 has received HANDSHAKE_DONE 1490 SetLossDetectionTimer(): 1491 earliest_loss_time, _ = GetEarliestTimeAndSpace(loss_time) 1492 if (earliest_loss_time != 0): 1493 // Time threshold loss detection. 1494 loss_detection_timer.update(earliest_loss_time) 1495 return 1497 if (server is at anti-amplification limit): 1498 // The server's timer is not set if nothing can be sent. 1499 loss_detection_timer.cancel() 1500 return 1502 if (no ack-eliciting packets in flight && 1503 PeerCompletedAddressValidation()): 1504 // There is nothing to detect lost, so no timer is set. 1505 // However, the client needs to arm the timer if the 1506 // server might be blocked by the anti-amplification limit. 1507 loss_detection_timer.cancel() 1508 return 1510 // Determine which PN space to arm PTO for. 1511 sent_time, pn_space = GetEarliestTimeAndSpace( 1512 time_of_last_sent_ack_eliciting_packet) 1513 // Don't arm PTO for ApplicationData until handshake complete. 1514 if (pn_space == ApplicationData && 1515 handshake is not confirmed): 1516 loss_detection_timer.cancel() 1517 return 1518 if (sent_time == 0): 1519 assert(!PeerCompletedAddressValidation()) 1520 sent_time = now() 1522 // Calculate PTO duration 1523 timeout = smoothed_rtt + max(4 * rttvar, kGranularity) + 1524 max_ack_delay 1525 timeout = timeout * (2 ^ pto_count) 1527 loss_detection_timer.update(sent_time + timeout) 1529 A.9. On Timeout 1531 When the loss detection timer expires, the timer's mode determines 1532 the action to be performed. 1534 Pseudocode for OnLossDetectionTimeout follows: 1536 OnLossDetectionTimeout(): 1537 earliest_loss_time, pn_space = 1538 GetEarliestTimeAndSpace(loss_time) 1539 if (earliest_loss_time != 0): 1540 // Time threshold loss Detection 1541 lost_packets = DetectLostPackets(pn_space) 1542 assert(!lost_packets.empty()) 1543 OnPacketsLost(lost_packets) 1544 SetLossDetectionTimer() 1545 return 1547 if (bytes_in_flight > 0): 1548 // PTO. Send new data if available, else retransmit old data. 1549 // If neither is available, send a single PING frame. 1550 _, pn_space = GetEarliestTimeAndSpace( 1551 time_of_last_sent_ack_eliciting_packet) 1552 SendOneOrTwoAckElicitingPackets(pn_space) 1553 else: 1554 assert(endpoint is client without 1-RTT keys) 1555 // Client sends an anti-deadlock packet: Initial is padded 1556 // to earn more anti-amplification credit, 1557 // a Handshake packet proves address ownership. 1558 if (has Handshake keys): 1559 SendOneAckElicitingHandshakePacket() 1560 else: 1561 SendOneAckElicitingPaddedInitialPacket() 1563 pto_count++ 1564 SetLossDetectionTimer() 1566 A.10. Detecting Lost Packets 1568 DetectAndRemoveLostPackets is called every time an ACK is received or 1569 the time threshold loss detection timer expires. This function 1570 operates on the sent_packets for that packet number space and returns 1571 a list of packets newly detected as lost. 1573 Pseudocode for DetectAndRemoveLostPackets follows: 1575 DetectAndRemoveLostPackets(pn_space): 1576 assert(largest_acked_packet[pn_space] != infinite) 1577 loss_time[pn_space] = 0 1578 lost_packets = {} 1579 loss_delay = kTimeThreshold * max(latest_rtt, smoothed_rtt) 1581 // Minimum time of kGranularity before packets are deemed lost. 1582 loss_delay = max(loss_delay, kGranularity) 1584 // Packets sent before this time are deemed lost. 1585 lost_send_time = now() - loss_delay 1587 foreach unacked in sent_packets[pn_space]: 1588 if (unacked.packet_number > largest_acked_packet[pn_space]): 1589 continue 1591 // Mark packet as lost, or set time when it should be marked. 1592 if (unacked.time_sent <= lost_send_time || 1593 largest_acked_packet[pn_space] >= 1594 unacked.packet_number + kPacketThreshold): 1595 sent_packets[pn_space].remove(unacked.packet_number) 1596 if (unacked.in_flight): 1597 lost_packets.insert(unacked) 1598 else: 1599 if (loss_time[pn_space] == 0): 1600 loss_time[pn_space] = unacked.time_sent + loss_delay 1601 else: 1602 loss_time[pn_space] = min(loss_time[pn_space], 1603 unacked.time_sent + loss_delay) 1604 return lost_packets 1606 Appendix B. Congestion Control Pseudocode 1608 We now describe an example implementation of the congestion 1609 controller described in Section 6. 1611 B.1. Constants of interest 1613 Constants used in congestion control are based on a combination of 1614 RFCs, papers, and common practice. 1616 kInitialWindow: Default limit on the initial bytes in flight as 1617 described in Section 6.2. 1619 kMinimumWindow: Minimum congestion window in bytes as described in 1620 Section 6.2. 1622 kLossReductionFactor: Reduction in congestion window when a new loss 1623 event is detected. The Section 6 section recommends a value is 1624 0.5. 1626 kPersistentCongestionThreshold: Period of time for persistent 1627 congestion to be established, specified as a PTO multiplier. The 1628 Section 6.8 section recommends a value of 3. 1630 B.2. Variables of interest 1632 Variables required to implement the congestion control mechanisms are 1633 described in this section. 1635 max_datagram_size: The sender's current maximum payload size. Does 1636 not include UDP or IP overhead. The max datagram size is used for 1637 congestion window computations. An endpoint sets the value of 1638 this variable based on its PMTU (see Section 14.1 of 1639 [QUIC-TRANSPORT]), with a minimum value of 1200 bytes. 1641 ecn_ce_counters[kPacketNumberSpace]: The highest value reported for 1642 the ECN-CE counter in the packet number space by the peer in an 1643 ACK frame. This value is used to detect increases in the reported 1644 ECN-CE counter. 1646 bytes_in_flight: The sum of the size in bytes of all sent packets 1647 that contain at least one ack-eliciting or PADDING frame, and have 1648 not been acked or declared lost. The size does not include IP or 1649 UDP overhead, but does include the QUIC header and AEAD overhead. 1650 Packets only containing ACK frames do not count towards 1651 bytes_in_flight to ensure congestion control does not impede 1652 congestion feedback. 1654 congestion_window: Maximum number of bytes-in-flight that may be 1655 sent. 1657 congestion_recovery_start_time: The time when QUIC first detects 1658 congestion due to loss or ECN, causing it to enter congestion 1659 recovery. When a packet sent after this time is acknowledged, 1660 QUIC exits congestion recovery. 1662 ssthresh: Slow start threshold in bytes. When the congestion window 1663 is below ssthresh, the mode is slow start and the window grows by 1664 the number of bytes acknowledged. 1666 B.3. Initialization 1668 At the beginning of the connection, initialize the congestion control 1669 variables as follows: 1671 congestion_window = kInitialWindow 1672 bytes_in_flight = 0 1673 congestion_recovery_start_time = 0 1674 ssthresh = infinite 1675 for pn_space in [ Initial, Handshake, ApplicationData ]: 1676 ecn_ce_counters[pn_space] = 0 1678 B.4. On Packet Sent 1680 Whenever a packet is sent, and it contains non-ACK frames, the packet 1681 increases bytes_in_flight. 1683 OnPacketSentCC(bytes_sent): 1684 bytes_in_flight += bytes_sent 1686 B.5. On Packet Acknowledgement 1688 Invoked from loss detection's OnAckReceived and is supplied with the 1689 newly acked_packets from sent_packets. 1691 InCongestionRecovery(sent_time): 1692 return sent_time <= congestion_recovery_start_time 1694 OnPacketsAcked(acked_packets): 1695 for (packet in acked_packets): 1696 // Remove from bytes_in_flight. 1697 bytes_in_flight -= packet.size 1698 if (InCongestionRecovery(packet.time_sent)): 1699 // Do not increase congestion window in recovery period. 1700 return 1701 if (IsAppOrFlowControlLimited()): 1702 // Do not increase congestion_window if application 1703 // limited or flow control limited. 1704 return 1705 if (congestion_window < ssthresh): 1706 // Slow start. 1707 congestion_window += packet.size 1708 return 1709 // Congestion avoidance. 1710 congestion_window += max_datagram_size * acked_packet.size 1711 / congestion_window 1713 B.6. On New Congestion Event 1715 Invoked from ProcessECN and OnPacketsLost when a new congestion event 1716 is detected. May start a new recovery period and reduces the 1717 congestion window. 1719 CongestionEvent(sent_time): 1720 // Start a new congestion event if packet was sent after the 1721 // start of the previous congestion recovery period. 1722 if (!InCongestionRecovery(sent_time)): 1723 congestion_recovery_start_time = now() 1724 congestion_window *= kLossReductionFactor 1725 congestion_window = max(congestion_window, kMinimumWindow) 1726 ssthresh = congestion_window 1727 // A packet can be sent to speed up loss recovery. 1728 MaybeSendOnePacket() 1730 B.7. Process ECN Information 1732 Invoked when an ACK frame with an ECN section is received from the 1733 peer. 1735 ProcessECN(ack, pn_space): 1736 // If the ECN-CE counter reported by the peer has increased, 1737 // this could be a new congestion event. 1738 if (ack.ce_counter > ecn_ce_counters[pn_space]): 1739 ecn_ce_counters[pn_space] = ack.ce_counter 1740 CongestionEvent(sent_packets[ack.largest_acked].time_sent) 1742 B.8. On Packets Lost 1744 Invoked from DetectLostPackets when packets are deemed lost. 1746 InPersistentCongestion(lost_packets): 1747 pto = smoothed_rtt + max(4 * rttvar, kGranularity) + 1748 max_ack_delay 1749 congestion_period = pto * kPersistentCongestionThreshold 1750 // Determine if all packets in the time period before the 1751 // largest newly lost packet, including the edges, are 1752 // marked lost 1753 return AreAllPacketsLost(lost_packets, congestion_period) 1755 OnPacketsLost(lost_packets): 1756 // Remove lost packets from bytes_in_flight. 1757 for (lost_packet : lost_packets): 1758 bytes_in_flight -= lost_packet.size 1759 CongestionEvent(lost_packets.largest().time_sent) 1761 // Collapse congestion window if persistent congestion 1762 if (InPersistentCongestion(lost_packets)): 1763 congestion_window = kMinimumWindow 1765 B.9. Upon dropping Initial or Handshake keys 1767 When Initial or Handshake keys are discarded, packets from the space 1768 are discarded and loss detection state is updated. 1770 Pseudocode for OnPacketNumberSpaceDiscarded follows: 1772 OnPacketNumberSpaceDiscarded(pn_space): 1773 assert(pn_space != ApplicationData) 1774 // Remove any unacknowledged packets from flight. 1775 foreach packet in sent_packets[pn_space]: 1776 if packet.in_flight 1777 bytes_in_flight -= size 1778 sent_packets[pn_space].clear() 1779 // Reset the loss detection and PTO timer 1780 time_of_last_sent_ack_eliciting_packet[kPacketNumberSpace] = 0 1781 loss_time[pn_space] = 0 1782 pto_count = 0 1783 SetLossDetectionTimer() 1785 Appendix C. Change Log 1787 *RFC Editor's Note:* Please remove this section prior to 1788 publication of a final version of this document. 1790 Issue and pull request numbers are listed with a leading octothorp. 1792 C.1. Since draft-ietf-quic-recovery-27 1794 * Added recommendations for speeding up handshake under some loss 1795 conditions (#3078, #3080) 1797 * PTO count is reset when handshake progress is made (#3272, #3415) 1799 * PTO count is not reset by a client when the server might be 1800 awaiting address validation (#3546, #3551) 1802 * Recommend repairing losses immediately after entering the recovery 1803 period (#3335, #3443) 1805 * Clarified what loss conditions can be ignored during the handshake 1806 (#3456, #3450) 1808 * Allow, but don't recommend, using RTT from previous connection to 1809 seed RTT (#3464, #3496) 1811 * Recommend use of adaptive loss detection thresholds (#3571, #3572) 1813 C.2. Since draft-ietf-quic-recovery-26 1815 No changes. 1817 C.3. Since draft-ietf-quic-recovery-25 1819 No significant changes. 1821 C.4. Since draft-ietf-quic-recovery-24 1823 * Require congestion control of some sort (#3247, #3244, #3248) 1825 * Set a minimum reordering threshold (#3256, #3240) 1827 * PTO is specific to a packet number space (#3067, #3074, #3066) 1829 C.5. Since draft-ietf-quic-recovery-23 1831 * Define under-utilizing the congestion window (#2630, #2686, #2675) 1833 * PTO MUST send data if possible (#3056, #3057) 1835 * Connection Close is not ack-eliciting (#3097, #3098) 1837 * MUST limit bursts to the initial congestion window (#3160) 1839 * Define the current max_datagram_size for congestion control 1840 (#3041, #3167) 1842 C.6. Since draft-ietf-quic-recovery-22 1844 * PTO should always send an ack-eliciting packet (#2895) 1846 * Unify the Handshake Timer with the PTO timer (#2648, #2658, #2886) 1848 * Move ACK generation text to transport draft (#1860, #2916) 1850 C.7. Since draft-ietf-quic-recovery-21 1852 * No changes 1854 C.8. Since draft-ietf-quic-recovery-20 1856 * Path validation can be used as initial RTT value (#2644, #2687) 1858 * max_ack_delay transport parameter defaults to 0 (#2638, #2646) 1860 * Ack Delay only measures intentional delays induced by the 1861 implementation (#2596, #2786) 1863 C.9. Since draft-ietf-quic-recovery-19 1864 * Change kPersistentThreshold from an exponent to a multiplier 1865 (#2557) 1867 * Send a PING if the PTO timer fires and there's nothing to send 1868 (#2624) 1870 * Set loss delay to at least kGranularity (#2617) 1872 * Merge application limited and sending after idle sections. Always 1873 limit burst size instead of requiring resetting CWND to initial 1874 CWND after idle (#2605) 1876 * Rewrite RTT estimation, allow RTT samples where a newly acked 1877 packet is ack-eliciting but the largest_acked is not (#2592) 1879 * Don't arm the handshake timer if there is no handshake data 1880 (#2590) 1882 * Clarify that the time threshold loss alarm takes precedence over 1883 the crypto handshake timer (#2590, #2620) 1885 * Change initial RTT to 500ms to align with RFC6298 (#2184) 1887 C.10. Since draft-ietf-quic-recovery-18 1889 * Change IW byte limit to 14720 from 14600 (#2494) 1891 * Update PTO calculation to match RFC6298 (#2480, #2489, #2490) 1893 * Improve loss detection's description of multiple packet number 1894 spaces and pseudocode (#2485, #2451, #2417) 1896 * Declare persistent congestion even if non-probe packets are sent 1897 and don't make persistent congestion more aggressive than RTO 1898 verified was (#2365, #2244) 1900 * Move pseudocode to the appendices (#2408) 1902 * What to send on multiple PTOs (#2380) 1904 C.11. Since draft-ietf-quic-recovery-17 1906 * After Probe Timeout discard in-flight packets or send another 1907 (#2212, #1965) 1909 * Endpoints discard initial keys as soon as handshake keys are 1910 available (#1951, #2045) 1912 * 0-RTT state is discarded when 0-RTT is rejected (#2300) 1914 * Loss detection timer is cancelled when ack-eliciting frames are in 1915 flight (#2117, #2093) 1917 * Packets are declared lost if they are in flight (#2104) 1919 * After becoming idle, either pace packets or reset the congestion 1920 controller (#2138, 2187) 1922 * Process ECN counts before marking packets lost (#2142) 1924 * Mark packets lost before resetting crypto_count and pto_count 1925 (#2208, #2209) 1927 * Congestion and loss recovery state are discarded when keys are 1928 discarded (#2327) 1930 C.12. Since draft-ietf-quic-recovery-16 1932 * Unify TLP and RTO into a single PTO; eliminate min RTO, min TLP 1933 and min crypto timeouts; eliminate timeout validation (#2114, 1934 #2166, #2168, #1017) 1936 * Redefine how congestion avoidance in terms of when the period 1937 starts (#1928, #1930) 1939 * Document what needs to be tracked for packets that are in flight 1940 (#765, #1724, #1939) 1942 * Integrate both time and packet thresholds into loss detection 1943 (#1969, #1212, #934, #1974) 1945 * Reduce congestion window after idle, unless pacing is used (#2007, 1946 #2023) 1948 * Disable RTT calculation for packets that don't elicit 1949 acknowledgment (#2060, #2078) 1951 * Limit ack_delay by max_ack_delay (#2060, #2099) 1953 * Initial keys are discarded once Handshake keys are available 1954 (#1951, #2045) 1956 * Reorder ECN and loss detection in pseudocode (#2142) 1958 * Only cancel loss detection timer if ack-eliciting packets are in 1959 flight (#2093, #2117) 1961 C.13. Since draft-ietf-quic-recovery-14 1963 * Used max_ack_delay from transport params (#1796, #1782) 1965 * Merge ACK and ACK_ECN (#1783) 1967 C.14. Since draft-ietf-quic-recovery-13 1969 * Corrected the lack of ssthresh reduction in CongestionEvent 1970 pseudocode (#1598) 1972 * Considerations for ECN spoofing (#1426, #1626) 1974 * Clarifications for PADDING and congestion control (#837, #838, 1975 #1517, #1531, #1540) 1977 * Reduce early retransmission timer to RTT/8 (#945, #1581) 1979 * Packets are declared lost after an RTO is verified (#935, #1582) 1981 C.15. Since draft-ietf-quic-recovery-12 1983 * Changes to manage separate packet number spaces and encryption 1984 levels (#1190, #1242, #1413, #1450) 1986 * Added ECN feedback mechanisms and handling; new ACK_ECN frame 1987 (#804, #805, #1372) 1989 C.16. Since draft-ietf-quic-recovery-11 1991 No significant changes. 1993 C.17. Since draft-ietf-quic-recovery-10 1995 * Improved text on ack generation (#1139, #1159) 1997 * Make references to TCP recovery mechanisms informational (#1195) 1999 * Define time_of_last_sent_handshake_packet (#1171) 2001 * Added signal from TLS the data it includes needs to be sent in a 2002 Retry packet (#1061, #1199) 2004 * Minimum RTT (min_rtt) is initialized with an infinite value 2005 (#1169) 2007 C.18. Since draft-ietf-quic-recovery-09 2009 No significant changes. 2011 C.19. Since draft-ietf-quic-recovery-08 2013 * Clarified pacing and RTO (#967, #977) 2015 C.20. Since draft-ietf-quic-recovery-07 2017 * Include Ack Delay in RTO(and TLP) computations (#981) 2019 * Ack Delay in SRTT computation (#961) 2021 * Default RTT and Slow Start (#590) 2023 * Many editorial fixes. 2025 C.21. Since draft-ietf-quic-recovery-06 2027 No significant changes. 2029 C.22. Since draft-ietf-quic-recovery-05 2031 * Add more congestion control text (#776) 2033 C.23. Since draft-ietf-quic-recovery-04 2035 No significant changes. 2037 C.24. Since draft-ietf-quic-recovery-03 2039 No significant changes. 2041 C.25. Since draft-ietf-quic-recovery-02 2043 * Integrate F-RTO (#544, #409) 2045 * Add congestion control (#545, #395) 2047 * Require connection abort if a skipped packet was acknowledged 2048 (#415) 2050 * Simplify RTO calculations (#142, #417) 2052 C.26. Since draft-ietf-quic-recovery-01 2054 * Overview added to loss detection 2056 * Changes initial default RTT to 100ms 2058 * Added time-based loss detection and fixes early retransmit 2060 * Clarified loss recovery for handshake packets 2062 * Fixed references and made TCP references informative 2064 C.27. Since draft-ietf-quic-recovery-00 2066 * Improved description of constants and ACK behavior 2068 C.28. Since draft-iyengar-quic-loss-recovery-01 2070 * Adopted as base for draft-ietf-quic-recovery 2072 * Updated authors/editors list 2074 * Added table of contents 2076 Appendix D. Contributors 2078 The IETF QUIC Working Group received an enormous amount of support 2079 from many people. The following people provided substantive 2080 contributions to this document: Alessandro Ghedini, Benjamin 2081 Saunders, Gorry Fairhurst, 奥 一穂 (Kazuho Oku), Lars Eggert, Magnus 2082 Westerlund, Marten Seemann, Martin Duke, Martin Thomson, Nick Banks, 2083 Praveen Balasubramaniam. 2085 Acknowledgments 2087 Authors' Addresses 2089 Jana Iyengar (editor) 2090 Fastly 2092 Email: jri.ietf@gmail.com 2094 Ian Swett (editor) 2095 Google 2097 Email: ianswett@google.com