idnits 2.17.1 draft-cardwell-iccrg-bbr-congestion-control-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 3 instances of too long lines in the document, the longest one being 10 characters in excess of 72. ** The abstract seems to contain references ([RFC5681], [RFC793], [RFC8312], [RFC9000]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (8 November 2021) is 890 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC6937' is mentioned on line 248, but not defined == Missing Reference: 'RFC6928' is mentioned on line 2340, but not defined == Unused Reference: 'DC13' is defined on line 2843, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) ** Obsolete normative reference: RFC 8312 (Obsoleted by RFC 9438) Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Congestion Control Research Group N. Cardwell 3 Internet-Draft Y. Cheng 4 Intended status: Experimental S. Hassas Yeganeh 5 Expires: 12 May 2022 I. Swett 6 V. Jacobson 7 Google 8 8 November 2021 10 BBR Congestion Control 11 draft-cardwell-iccrg-bbr-congestion-control-01 13 Abstract 15 This document specifies the BBR congestion control algorithm. BBR 16 ("Bottleneck Bandwidth and Round-trip propagation time") uses recent 17 measurements of a transport connection's delivery rate, round-trip 18 time, and packet loss rate to build an explicit model of the network 19 path. BBR then uses this model to control both how fast it sends 20 data and the maximum volume of data it allows in flight in the 21 network at any time. Relative to loss-based congestion control 22 algorithms such as Reno [RFC5681] or CUBIC [RFC8312], BBR offers 23 substantially higher throughput for bottlenecks with shallow buffers 24 or random losses, and substantially lower queueing delays for 25 bottlenecks with deep buffers (avoiding "bufferbloat"). BBR can be 26 implemented in any transport protocol that supports packet-delivery 27 acknowledgment. Thus far, open source implementations are available 28 for TCP [RFC793] and QUIC [RFC9000]. This document specifies version 29 2 of the BBR algorithm, also sometimes referred to as BBRv2 or bbr2. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 12 May 2022. 48 Copyright Notice 50 Copyright (c) 2021 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Simplified BSD License text 59 as described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 66 2.1. Transport Connection State . . . . . . . . . . . . . . . 5 67 2.2. Per-Packet State . . . . . . . . . . . . . . . . . . . . 5 68 2.3. Per-ACK Rate Sample State . . . . . . . . . . . . . . . . 5 69 2.4. Output Control Parameters . . . . . . . . . . . . . . . . 6 70 2.5. Pacing State and Parameters . . . . . . . . . . . . . . . 6 71 2.6. cwnd State and Parameters . . . . . . . . . . . . . . . . 7 72 2.7. General Algorithm State . . . . . . . . . . . . . . . . . 7 73 2.8. Core Algorithm Design Parameters . . . . . . . . . . . . 7 74 2.9. Network Path Model Parameters . . . . . . . . . . . . . . 8 75 2.9.1. Data Rate Network Path Model Parameters . . . . . . . 8 76 2.9.2. Data Volume Network Path Model Parameters . . . . . . 8 77 2.10. State for Responding to Congestion . . . . . . . . . . . 9 78 2.11. Estimating BBR.max_bw . . . . . . . . . . . . . . . . . . 10 79 2.12. Estimating BBR.extra_acked . . . . . . . . . . . . . . . 10 80 2.13. Startup Parameters and State . . . . . . . . . . . . . . 10 81 2.14. ProbeRTT and min_rtt Parameters and State . . . . . . . . 10 82 2.14.1. Parameters for Estimating BBR.min_rtt . . . . . . . 10 83 2.14.2. Parameters for Scheduling ProbeRTT . . . . . . . . . 11 84 3. Design Overview . . . . . . . . . . . . . . . . . . . . . . . 11 85 3.1. High-Level Design Goals . . . . . . . . . . . . . . . . . 11 86 3.2. Algorithm Overview . . . . . . . . . . . . . . . . . . . 12 87 3.3. State Machine Overview . . . . . . . . . . . . . . . . . 13 88 3.4. Network Path Model Overview . . . . . . . . . . . . . . . 13 89 3.4.1. High-Level Design Goals for the Network Path Model . 13 90 3.4.2. Time Scales for the Network Model . . . . . . . . . . 14 91 3.5. Control Parameter Overview . . . . . . . . . . . . . . . 15 92 3.6. Environment and Usage . . . . . . . . . . . . . . . . . . 15 93 4. Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . 15 94 4.1. State Machine . . . . . . . . . . . . . . . . . . . . . . 15 95 4.1.1. State Transition Diagram . . . . . . . . . . . . . . 15 96 4.1.2. State Machine Operation Overview . . . . . . . . . . 16 97 4.1.3. State Machine Tactics . . . . . . . . . . . . . . . . 17 98 4.2. Algorithm Organization . . . . . . . . . . . . . . . . . 18 99 4.2.1. Initialization . . . . . . . . . . . . . . . . . . . 18 100 4.2.2. Per-Transmit Steps . . . . . . . . . . . . . . . . . 18 101 4.2.3. Per-ACK Steps . . . . . . . . . . . . . . . . . . . . 19 102 4.2.4. Per-Loss Steps . . . . . . . . . . . . . . . . . . . 19 103 4.3. State Machine Operation . . . . . . . . . . . . . . . . . 19 104 4.3.1. Startup . . . . . . . . . . . . . . . . . . . . . . . 19 105 4.3.2. Drain . . . . . . . . . . . . . . . . . . . . . . . . 22 106 4.3.3. ProbeBW . . . . . . . . . . . . . . . . . . . . . . . 23 107 4.3.4. ProbeRTT . . . . . . . . . . . . . . . . . . . . . . 33 108 4.4. Restarting From Idle . . . . . . . . . . . . . . . . . . 38 109 4.4.1. Setting Pacing Rate in ProbeBW . . . . . . . . . . . 38 110 4.4.2. Checking for ProberRTT Completion . . . . . . . . . . 39 111 4.4.3. Logic . . . . . . . . . . . . . . . . . . . . . . . . 39 112 4.5. Updating Network Path Model Parameters . . . . . . . . . 39 113 4.5.1. Tracking Packet-Timed Round Trips and 114 BBR.round_count . . . . . . . . . . . . . . . . . . . 40 115 4.5.2. BBR.max_bw . . . . . . . . . . . . . . . . . . . . . 41 116 4.5.3. BBR.min_rtt . . . . . . . . . . . . . . . . . . . . . 43 117 4.5.4. BBR.offload_budget . . . . . . . . . . . . . . . . . 45 118 4.5.5. BBR.extra_acked . . . . . . . . . . . . . . . . . . . 46 119 4.5.6. Updating the Model Upon Packet Loss . . . . . . . . . 47 120 4.6. Updating Control Parameters . . . . . . . . . . . . . . . 51 121 4.6.1. Summary of Control Behavior in the State Machine . . 51 122 4.6.2. Pacing Rate: BBR.pacing_rate . . . . . . . . . . . . 53 123 4.6.3. Send Quantum: BBR.send_quantum . . . . . . . . . . . 54 124 4.6.4. Congestion Window . . . . . . . . . . . . . . . . . . 55 125 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 60 126 6. Security Considerations . . . . . . . . . . . . . . . . . . . 62 127 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 62 128 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 62 129 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 62 130 9.1. Normative References . . . . . . . . . . . . . . . . . . 62 131 9.2. Informative References . . . . . . . . . . . . . . . . . 63 132 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 65 134 1. Introduction 136 The Internet has traditionally used loss-based congestion control 137 algorithms like Reno ([Jac88], [Jac90], [WS95] [RFC5681]) and CUBIC 138 ([HRX08], [RFC8312]). These algorithms worked well for many years 139 because they were sufficiently well-matched to the prevalent range of 140 bandwidth-delay products and degrees of buffering in Internet paths. 141 As the Internet has evolved, loss-based congestion control is 142 increasingly problematic in several important scenarios: 144 1. Shallow buffers: In shallow buffers, packet loss can happen even 145 when a link has low utilization. With high-speed, long-haul 146 links employing commodity switches with shallow buffers, loss- 147 based congestion control can cause abysmal throughput because it 148 overreacts, multiplicatively decreasing the sending rate upon 149 packet loss, and only slowly growing its sending rate thereafter. 150 This can happen even if the packet loss arises from transient 151 traffic bursts when the link is mostly idle. 153 2. Deep buffers: At the edge of today's Internet, loss-based 154 congestion control can cause the problem of "bufferbloat", by 155 repeatedly filling deep buffers in last-mile links and causing 156 high queuing delays. 158 3. Dynamic traffic workloads: With buffers of any depth, dynamic 159 mixes of newly-entering flows or flights of data from recently 160 idle flows can cause frequent packet loss. In such scenarios 161 loss-based congestion control can fail to maintain its fair share 162 of bandwidth, leading to poor application performance. 164 In both the shallow-buffer (1.) or dynamic-traffic (3.) scenarios 165 mentioned above it is difficult to achieve full throughput with loss- 166 based congestion control in practice: for CUBIC, sustaining 10Gbps 167 over 100ms RTT needs a packet loss rate below 0.000003% (i.e., more 168 than 40 seconds between packet losses), and over a 100ms RTT path a 169 more feasible loss rate like 1% can only sustain at most 3 Mbps 170 [RFC8312]. These limitations apply no matter what the bottleneck 171 link is capable of or what the connection's fair share is. 172 Furthermore, failure to reach the fair share can cause poor 173 throughpout and poor tail latency for latency-sensitive applications. 175 The BBR ("Bottleneck Bandwidth and Round-trip propagation time") 176 congestion control algorithm is a model-based algorithm that takes an 177 approach different from loss-based congestion control: BBR uses 178 recent measurements of a transport connection's delivery rate, round- 179 trip time, and packet loss rate to build an explicit model of the 180 network path, including its estimated available bandwidth, bandwidth- 181 delay product, and the maximum volume of data that the connection can 182 place in-flight in the network without causing excessive queue 183 pressure. It then uses this model in order to guide its control 184 behavior in seeking high throughput and low queue pressure. 186 This document describes the current version of the BBR algorithm, 187 BBRv2. The previous version of the algorithm, BBRv1, was described 188 previously at a high level [CCGHJ16][CCGHJ17]. The implications of 189 BBR in allowing high utilization of high-speed networks with shallow 190 buffers have been discussed in other work [MM19]. Active work on the 191 BBR algorithm is continuing. 193 This document is organized as follows. Section 2 provides various 194 definitions that will be used throughout this document. Section 3 195 provides an overview of the design of the BBR algorithm, and section 196 4 describes the BBR algorithm in detail, including BBR's network path 197 model, control parameters, and state machine. Section 5 describes 198 the implementation status, section 6 describes security 199 considerations, section 7 notes that there are no IANA 200 considerations, and section 8 closes with Acknowledgments. 202 2. Terminology 204 This document defines state variables and constants for the BBR 205 algorithm. 207 The variables starting with C, P, or rs not defined below are defined 208 in [draft-cheng-iccrg-delivery-rate-estimation]. 210 2.1. Transport Connection State 212 C.delivered: The total amount of data (tracked in octets or in 213 packets) delivered so far over the lifetime of the transport 214 connection C. 216 SMSS: The Sender Maximum Segment Size. 218 is_cwnd_limited: True if the connection has fully utilized its cwnd 219 at any point in the last packet-timed round trip. 221 InitialCwnd: The initial congestion window set by the transport 222 protocol implementation for the connection at initialization time. 224 2.2. Per-Packet State 226 packet.delivered: C.delivered when the given packet was sent by 227 transport connection C. 229 packet.departure_time: The earliest pacing departure time for the 230 given packet. 232 packet.tx_in_flight: The volume of data that was estimated to be in 233 flight at the time of the transmission of the packet. 235 2.3. Per-ACK Rate Sample State 237 rs.delivered: The volume of data delivered between the transmission 238 of the packet that has just been ACKed and the current time. 240 rs.delivery_rate: The delivery rate (aka bandwidth) sample obtained 241 from the packet that has just been ACKed. 243 rs.rtt: The RTT sample calculated based on the most recently-sent 244 segment of the segments that have just been ACKed. 246 rs.newly_acked: The volume of data cumulatively or selectively 247 acknowledged upon the ACK that was just received. (This quantity is 248 referred to as "DeliveredData" in [RFC6937].) 250 rs.newly_lost: The volume of data newly marked lost upon the ACK that 251 was just received. 253 rs.tx_in_flight: The volume of data that was estimated to be in 254 flight at the time of the transmission of the packet that has just 255 been ACKed (the most recently sent segment among segments ACKed by 256 the ACK that was just received). 258 rs.lost: The volume of data that was declared lost between the 259 transmission and acknowledgement of the packet that has just been 260 ACKed (the most recently sent segment among segments ACKed by the ACK 261 that was just received). 263 2.4. Output Control Parameters 265 cwnd: The transport sender's congestion window, which limits the 266 amount of data in flight. 268 BBR.pacing_rate: The current pacing rate for a BBR flow, which 269 controls inter-packet spacing. 271 BBR.send_quantum: The maximum size of a data aggregate scheduled and 272 transmitted together. 274 2.5. Pacing State and Parameters 276 BBR.pacing_gain: The dynamic gain factor used to scale BBR.bw to 277 produce BBR.pacing_rate. 279 BBRPacingMarginPercent: The static discount factor of 1% used to 280 scale BBR.bw to produce BBR.pacing_rate. 282 BBR.next_departure_time: The earliest pacing departure time for the 283 next packet BBR schedules for transmission. 285 2.6. cwnd State and Parameters 287 BBR.cwnd_gain: The dynamic gain factor used to scale the estimated 288 BDP to produce a congestion window (cwnd). 290 BBRStartupPacingGain: A constant specifying the minimum gain value 291 for calculating the pacing rate that will allow the sending rate to 292 double each round (4*ln(2) ~= 2.77) [BBRStartupPacingGain]; used in 293 Startup mode for BBR.pacing_gain. 295 BBRStartupCwndGain: A constant specifying the minimum gain value for 296 calculating the cwnd that will allow the sending rate to double each 297 round (2.0); used in Startup mode for BBR.cwnd_gain. 299 BBR.packet_conservation: A boolean indicating whether BBR is 300 currently using packet conservation dynamics to bound cwnd. 302 2.7. General Algorithm State 304 BBR.state: The current state of a BBR flow in the BBR state machine. 306 BBR.round_count: Count of packet-timed round trips elapsed so far. 308 BBR.round_start: A boolean that BBR sets to true once per packet- 309 timed round trip, on ACKs that advance BBR.round_count. 311 BBR.next_round_delivered: packet.delivered value denoting the end of 312 a packet-timed round trip. 314 BBR.idle_restart: A boolean that is true if and only if a connection 315 is restarting after being idle. 317 2.8. Core Algorithm Design Parameters 319 BBRLossThresh: The maximum tolerated per-round-trip packet loss rate 320 when probing for bandwidth (the default is 2%). 322 BBRBeta: The default multiplicative decrease to make upon each round 323 trip during which the connection detects packet loss (the value is 324 0.7). 326 BBRHeadroom: The multiplicative factor to apply to BBR.inflight_hi 327 when attempting to leave free headroom in the path (e.g. free space 328 in the bottleneck buffer or free time slots in the bottleneck link) 329 that can be used by cross traffic (the value is 0.85). 331 BBRMinPipeCwnd: The minimal cwnd value BBR targets, to allow 332 pipelining with TCP endpoints that follow an "ACK every other packet" 333 delayed-ACK policy: 4 * SMSS. 335 2.9. Network Path Model Parameters 337 2.9.1. Data Rate Network Path Model Parameters 339 The data rate model parameters together estimate both the sending 340 rate required to reach the full bandwidth available to the flow 341 (BBR.max_bw), and the maximum pacing rate control parameter that is 342 consistent with the queue pressure objective (BBR.bw). 344 BBR.max_bw: The windowed maximum recent bandwidth sample - obtained 345 using the BBR delivery rate sampling algorithm [draft-cheng-iccrg- 346 delivery-rate-estimation] - measured during the current or previous 347 bandwidth probing cycle (or during Startup, if the flow is still in 348 that state). (Part of the long-term model.) 350 BBR.bw_hi: The long-term maximum sending bandwidth that the algorithm 351 estimates will produce acceptable queue pressure, based on signals in 352 the current or previous bandwidth probing cycle, as measured by loss. 353 (Part of the long-term model.) 355 BBR.bw_lo: The short-term maximum sending bandwidth that the 356 algorithm estimates is safe for matching the current network path 357 delivery rate, based on any loss signals in the current bandwidth 358 probing cycle. This is generally lower than max_bw or bw_hi (thus 359 the name). (Part of the short-term model.) 361 BBR.bw: The maximum sending bandwidth that the algorithm estimates is 362 appropriate for matching the current network path delivery rate, 363 given all available signals in the model, at any time scale. It is 364 the min() of max_bw, bw_hi, and bw_lo. 366 2.9.2. Data Volume Network Path Model Parameters 368 The data volume model parameters together estimate both the volume of 369 in-flight data required to reach the full bandwidth available to the 370 flow (BBR.max_inflight), and the maximum volume of data that is 371 consistent with the queue pressure objective (cwnd). 373 BBR.min_rtt: The windowed minimum round-trip time sample measured 374 over the last MinRTTFilterLen = 10 seconds. This attempts to 375 estimate the two-way propagation delay of the network path when all 376 connections sharing a bottleneck are using BBR, but also allows BBR 377 to estimate the value required for a bdp estimate that allows full 378 throughput if there are legacy loss-based Reno or CUBIC flows sharing 379 the bottleneck. 381 BBR.bdp: The estimate of the network path's BDP (Bandwidth-Delay 382 Product), computed as: BBR.bdp = BBR.bw * BBR.min_rtt. 384 BBR.extra_acked: A volume of data that is the estimate of the recent 385 degree of aggregation in the network path. 387 BBR.offload_budget: The estimate of the minimum volume of data 388 necessary to achieve full throughput when using sender (TSO/GSO) and 389 receiver (LRO, GRO) host offload mechanisms. 391 BBR.max_inflight: The estimate of the volume of in-flight data 392 required to fully utilize the bottleneck bandwidth available to the 393 flow, based on the BDP estimate (BBR.bdp), the aggregation estimate 394 (BBR.extra_acked), the offload budget (BBR.offload_budget), and 395 BBRMinPipeCwnd. 397 BBR.inflight_hi: Analogous to BBR.bw_hi, the long-term maximum volume 398 of in-flight data that the algorithm estimates will produce 399 acceptable queue pressure, based on signals in the current or 400 previous bandwidth probing cycle, as measured by loss. That is, if a 401 flow is probing for bandwidth, and observes that sending a particular 402 volume of in-flight data causes a loss rate higher than the loss rate 403 objective, it sets inflight_hi to that volume of data. (Part of the 404 long-term model.) 406 BBR.inflight_lo: Analogous to BBR.bw_lo, the short-term maximum 407 volume of in-flight data that the algorithm estimates is safe for 408 matching the current network path delivery process, based on any loss 409 signals in the current bandwidth probing cycle. This is generally 410 lower than max_inflight or inflight_hi (thus the name). (Part of the 411 short-term model.) 413 2.10. State for Responding to Congestion 415 BBR.bw_latest: a 1-round-trip max of delivered bandwidth 416 (rs.delivery_rate). 418 BBR.inflight_latest: a 1-round-trip max of delivered volume of data 419 (rs.delivered). 421 2.11. Estimating BBR.max_bw 423 BBR.MaxBwFilter: The filter for tracking the maximum recent 424 rs.delivery_rate sample, for estimating BBR.max_bw. 426 MaxBwFilterLen: The filter window length for BBR.MaxBwFilter = 2 427 (representing up to 2 ProbeBW cycles, the current cycle and the 428 previous full cycle). 430 BBR.cycle_count: The virtual time used by the BBR.max_bw filter 431 window. Note that BBR.cycle_count only needs to be tracked with a 432 single bit, since the BBR.MaxBwFilter only needs to track samples 433 from two time slots: the previous ProbeBW cycle and the current 434 ProbeBW cycle. 436 2.12. Estimating BBR.extra_acked 438 BBR.extra_acked_interval_start: the start of the time interval for 439 estimating the excess amount of data acknowledged due to aggregation 440 effects. 442 BBR.extra_acked_delivered: the volume of data marked as delivered 443 since BBR.extra_acked_interval_start. 445 BBR.ExtraACKedFilter: the max filter tracking the recent maximum 446 degree of aggregation in the path. 448 BBRExtraAckedFilterLen = The window length of the 449 BBR.ExtraACKedFilter max filter window: 10 (in units of packet-timed 450 round trips). 452 2.13. Startup Parameters and State 454 BBR.filled_pipe: A boolean that records whether BBR estimates that it 455 has ever fully utilized its available bandwidth ("filled the pipe"). 457 BBR.full_bw: A recent baseline BBR.max_bw to estimate if BBR has 458 "filled the pipe" in Startup. 460 BBR.full_bw_count: The number of non-app-limited round trips without 461 large increases in BBR.full_bw. 463 2.14. ProbeRTT and min_rtt Parameters and State 465 2.14.1. Parameters for Estimating BBR.min_rtt 467 BBR.min_rtt_stamp: The wall clock time at which the current 468 BBR.min_rtt sample was obtained. 470 MinRTTFilterLen: A constant specifying the length of the BBR.min_rtt 471 min filter window, MinRTTFilterLen is 10 secs. 473 2.14.2. Parameters for Scheduling ProbeRTT 475 BBRProbeRTTCwndGain = A constant specifying the gain value for 476 calculating the cwnd during ProbeRTT: 0.5 (meaning that ProbeRTT 477 attempts to reduce in-flight data to 50% of the estimated BDP). 479 ProbeRTTDuration: A constant specifying the minimum duration for 480 which ProbeRTT state holds inflight to BBRMinPipeCwnd or fewer 481 packets: 200 ms. 483 ProbeRTTInterval: A constant specifying the minimum time interval 484 between ProbeRTT states: 5 secs. 486 BBR.probe_rtt_min_delay: The minimum RTT sample recorded in the last 487 ProbeRTTInterval. 489 BBR.probe_rtt_min_stamp: The wall clock time at which the current 490 BBR.probe_rtt_min_delay sample was obtained. 492 BBR.probe_rtt_expired: A boolean recording whether the 493 BBR.probe_rtt_min_delay has expired and is due for a refresh with an 494 application idle period or a transition into ProbeRTT state. 496 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 497 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 498 document are to be interpreted as described in [RFC2119]. 500 3. Design Overview 502 3.1. High-Level Design Goals 504 The high-level goal of BBR is to achieve both: 506 1. The full throughput (or approximate fair share thereof) available 507 to a flow 509 * Achieved in a fast and scalable manner (using bandwidth in 510 O(log(BDP)) time). 512 * Achieved with average packet loss rates of up to 1%. 514 2. Low queue pressure (low queuing delay and low packet loss). 516 These goals are in tension: sending faster improves the odds of 517 achieving (1) but reduces the odds of achieving (2), while sending 518 slower improves the odds of achieving (2) but reduces the odds of 519 achieving (1). Thus the algorithm cannot maximize throughput or 520 minimize queue pressure independently, and must jointly optimize 521 both. 523 To try to achieve these goals, and seek an operating point with high 524 throughput and low delay [K79] [GK81], BBR aims to adapt its sending 525 process to match the network delivery process, in two dimensions: 527 1. data rate: the rate at which the flow sends data should ideally 528 match the rate at which the network delivers the flow's data (the 529 available bottleneck bandwidth) 531 2. data volume: the amount of unacknowledged data in flight in the 532 network should ideally match the bandwidth-delay product (BDP) of 533 the path 535 Both the control of the data rate (via the pacing rate) and data 536 volume (directly via the congestion window or cwnd; and indirectly 537 via the pacing rate) are important. A mismatch in either dimension 538 can cause the sender to fail to meet its high-level design goals: 540 1. volume mismatch: If a sender perfectly matches its sending rate 541 to the available bandwidth, but its volume of in-flight data 542 exceeds the BDP, then the sender can maintain a large standing 543 queue, increasing network latency and risking packet loss. 545 2. rate mismatch: If a sender's volume of in-flight data matches the 546 BDP perfectly but its sending rate exceeds the available 547 bottleneck bandwidth (e.g. the sender transmits a BDP of data in 548 an unpaced fashion, at the sender's link rate), then up to a full 549 BDP of data can burst into the bottleneck queue, causing high 550 delay and/or high loss. 552 3.2. Algorithm Overview 554 Based on the rationale above, BBR tries to spend most of its time 555 matching its sending process (data rate and data volume) to the 556 network path's delivery process. To do this, it explores the 557 2-dimensional control parameter space of (1) data rate ("bandwidth" 558 or "throughput") and (2) data volume ("in-flight data"), with a goal 559 of finding the maximum values of each control parameter that are 560 consistent with its objective for queue pressure. 562 Depending on what signals a given network path manifests at a given 563 time, the objective for queue pressure is measured in terms of the 564 most strict among: 566 * the amount of data that is estimated to be queued in the 567 bottleneck buffer (data_in_flight - estimated_BDP): the objective 568 is to maintain this amount at or below 1.5 * estimated_BDP 570 * the packet loss rate: the objective is a maximum per-round-trip 571 packet loss rate of BBRLossThresh=2% (and an average packet loss 572 rate considerably lower) 574 3.3. State Machine Overview 576 BBR varies its control parameters with a simple state machine that 577 aims for high throughput, low latency, and an approximately fair 578 sharing of bandwidth, while maintaining an up-to-date model of the 579 network path. 581 A BBR flow starts in the Startup state, and ramps up its sending rate 582 quickly, to rapidly estimate the maximum available bandwidth 583 (BBR.max_bw). When it estimates the bottleneck bandwidth has been 584 fully utilized, it enters the Drain state to drain the estimated 585 queue. In steady state a BBR flow mostly uses the ProbeBW states, to 586 periodically briefly send faster to probe for higher capacity and 587 then briefly send slower to try to drain any resulting queue. If 588 needed, it briefly enters the ProbeRTT state, to lower the sending 589 rate to probe for lower BBR.min_rtt samples. The detailed behavior 590 for each state is described below. 592 3.4. Network Path Model Overview 594 3.4.1. High-Level Design Goals for the Network Path Model 596 At a high level, the BBR model is trying to reflect two aspects of 597 the network path: 599 * Model what's required for achieving full throughput: Estimate the 600 minimum data rate and data volume required to fully utilize the 601 fair share of the bottleneck bandwidth available to the flow. 602 This must incorporate estimates of the maximum available bandwidth 603 (BBR.max_bw), the BDP of the path (BBR.bdp), and the requirements 604 of any offload features on the end hosts or mechanisms on the 605 network path that produce aggregation effects (summing up to 606 BBR.max_inflight). 608 * Model what's permitted for achieving low queue pressure: Estimate 609 the maximum data rate (BBR.bw) and data volume (cwnd) consistent 610 with the queue pressure objective, as measured by the estimated 611 degree of queuing and packet loss. 613 Note that those two aspects are in tension: the highest throughput is 614 available to the flow when it sends as fast as possible and occupies 615 as many bottleneck buffer slots as possible; the lowest que pressure 616 is achieved by the flow when it sends as slow as possible and 617 occupies as few bottleneck buffer slots as possible. To resolve the 618 tension, the algorithm aims to achieve the maximum throughput 619 achievable while still meeting the queue pressure objective. 621 3.4.2. Time Scales for the Network Model 623 At a high level, the BBR model is trying to reflect the properties of 624 the network path on two different time scales: 626 3.4.2.1. Long-term model 628 One goal is for BBR to maintain high average utilization of the fair 629 share of the available bandwidth, over long time intervals. This 630 requires estimates of the path's data rate and volume capacities that 631 are robust over long time intervals. This means being robust to 632 congestion signals that may be noisy or may reflect short-term 633 congestion that has already abated by the time an ACK arrives. This 634 also means providing a robust history of the best recently-achievable 635 performance on the path so that the flow can quickly and robustly aim 636 to re-probe that level of performance whenever it decides to probe 637 the capacity of the path. 639 3.4.2.2. Short-term model 641 A second goal of BBR is to react to every congestion signal, 642 including loss, as if it may indicate a persistent/long-term increase 643 in congestion and/or decrease in the bandwidth available to the flow, 644 because that may indeed be the case. 646 3.4.2.3. Time Scale Strategy 648 BBR sequentially alternates between spending most of its time using 649 short-term models to conservatively respect all congestion signals in 650 case they represent persistent congestion, but periodically using its 651 long-term model to robustly probe the limits of the available path 652 capacity in case the congestion has abated and more capacity is 653 available. 655 3.5. Control Parameter Overview 657 BBR uses its model to control the connection's sending behavior. 658 Rather than using a single control parameter, like the cwnd parameter 659 that limits the volume of in-flight data in the Reno and CUBIC 660 congestion control algorithms, BBR uses three distinct control 661 parameters: 663 1. pacing rate: the maximum rate at which BBR sends data. 665 2. send quantum: the maximum size of any aggregate that the 666 transport sender implementation may need to transmit as a unit to 667 amortize per-packet transmission overheads. 669 3. cwnd: the maximum volume of data BBR allows in-flight in the 670 network at any time. 672 3.6. Environment and Usage 674 BBR is a congestion control algorithm that is agnostic to transport- 675 layer and link-layer technologies, requires only sender-side changes, 676 and does not require changes in the network. Open source 677 implementations of BBR are available for the TCP [RFC793] and QUIC 678 [RFC9000] transport protocols, and these implementations have been 679 used in production for a large volume of Internet traffic. 681 4. Detailed Algorithm 683 4.1. State Machine 685 BBR implements a state machine that uses the network path model to 686 guide its decisions, and the control parameters to enact its 687 decisions. 689 4.1.1. State Transition Diagram 691 The following state transition diagram summarizes the flow of control 692 and the relationship between the different states: 694 | 695 V 696 +---> Startup ------------+ 697 | | | 698 | V | 699 | Drain --------------+ 700 | | | 701 | V | 702 +---> ProbeBW_DOWN -------+ 703 | ^ | | 704 | | V | 705 | | ProbeBW_CRUISE ------+ 706 | | | | 707 | | V | 708 | | ProbeBW_REFILL -----+ 709 | | | | 710 | | V | 711 | | ProbeBW_UP ---------+ 712 | | | | 713 | +------+ | 714 | | 715 +---- ProbeRTT <-----------+ 717 4.1.2. State Machine Operation Overview 719 When starting up, BBR probes to try to quickly build a model of the 720 network path; to adapt to later changes to the path or its traffic, 721 BBR must continue to probe to update its model. If the available 722 bottleneck bandwidth increases, BBR must send faster to discover 723 this. Likewise, if the round-trip propagation delay changes, this 724 changes the BDP, and thus BBR must send slower to get inflight below 725 the new BDP in order to measure the new BBR.min_rtt. Thus, BBR's 726 state machine runs periodic, sequential experiments, sending faster 727 to check for BBR.bw increases or sending slower to yield bandwidth, 728 drain the queue, and check for BBR.min_rtt decreases. The frequency, 729 magnitude, duration, and structure of these experiments differ 730 depending on what's already known (startup or steady-state) and 731 application sending behavior (intermittent or continuous). 733 This state machine has several goals: 735 * Achieve high throughput by efficiently utilizing available 736 bandwidth. 738 * Achieve low latency and packet loss rates by keeping queues 739 bounded and small. 741 * Share bandwidth with other flows in an approximately fair manner. 743 * Feed samples to the model estimators to refresh and update the 744 model. 746 4.1.3. State Machine Tactics 748 In the BBR framework, at any given time the sender can choose one of 749 the following tactics: 751 * Acceleration: Send faster then the network is delivering data: to 752 probe the bw (https://docs.google.com/document/ 753 d/1_EqkO9UjJ2yD0SEiICeHLMVT285L2Tkd9kkybIMDWfY/ 754 edit#heading=h.etm5bmltrkme), the maximum bandwidth available to 755 the flow 757 * Cruising: Send at the same rate the network is delivering data: 758 try to match the sending rate to the flow's current available 759 bandwidth, to try to achieve high utilization of the available 760 bandwidth without increasing queue pressure 762 * Deceleration: Send slower than the network is delivering data: to 763 reduce the amount of data in flight, with a number of overlapping 764 motivations: 766 - Reducing queuing delay: to reduce queuing delay, to reduce 767 latency for request/response cross-traffic (e.g. RPC, web 768 traffic). 770 - Reducing packet loss: to reduce packet loss, to reduce tail 771 latency for request/response cross-traffic (e.g. RPC, web 772 traffic) and improve coexistence with Reno/CUBIC. 774 - Probing BBR.min_rtt: to probe the path's BBR.min_rtt 776 - Bandwidth convergence: to aid bandwidth fairness convergence, 777 by leaving unused capacity in the bottleneck link or bottleneck 778 buffer, to allow other flows that may have lower sending rates 779 to discover and utilize the unused capacity 781 - Burst tolerance: to allow bursty arrivals of cross-traffic 782 (e.g. short web or RPC requests) to be able to share the 783 bottleneck link without causing excessive queuing delay or 784 packet loss 786 Throughout the lifetime of a BBR flow, it sequentially cycles through 787 all three tactics, to measure the network path and try to optimize 788 its operating point. 790 BBR's state machine uses two control mechanisms. Primarily, it uses 791 the pacing_gain (see the "Pacing Rate" section), which controls how 792 fast packets are sent relative to BBR.bw. A pacing_gain > 1 793 decreases inter-packet time and increases inflight. A pacing_gain < 794 1 has the opposite effect, increasing inter-packet time and while 795 aiming to decrease inflight. Second, if the state machine needs to 796 quickly reduce inflight to a particular absolute value, it uses the 797 cwnd. 799 4.2. Algorithm Organization 801 The BBR algorithm is an event-driven algorithm that executes steps 802 upon the following events: connection initialization, upon each ACK, 803 upon the transmission of each quantum, and upon loss detection 804 events. All of the sub-steps invoked referenced below are described 805 below. 807 4.2.1. Initialization 809 Upon transport connection initialization, BBR executes its 810 initialization steps: 812 BBROnInit(): 813 init_windowed_max_filter(filter=BBR.MaxBwFilter, value=0, time=0) 814 BBR.min_rtt = SRTT ? SRTT : Inf 815 BBR.min_rtt_stamp = Now() 816 BBR.probe_rtt_done_stamp = 0 817 BBR.probe_rtt_round_done = false 818 BBR.prior_cwnd = 0 819 BBR.idle_restart = false 820 BBR.extra_acked_interval_start = Now() 821 BBR.extra_acked_delivered = 0 822 BBRResetCongestionSignals() 823 BBRResetLowerBounds() 824 BBRInitRoundCounting() 825 BBRInitFullPipe() 826 BBRInitPacingRate() 827 BBREnterStartup() 829 4.2.2. Per-Transmit Steps 831 When transmitting, BBR merely needs to check for the case where the 832 flow is restarting from idle: 834 BBROnTransmit(): 835 BBRHandleRestartFromIdle() 837 4.2.3. Per-ACK Steps 839 On every ACK, the BBR algorithm executes the following 840 BBRUpdateOnACK() steps in order to update its network path model, 841 update its state machine, and adjust its control parameters to adapt 842 to the updated model: 844 BBRUpdateOnACK(): 845 BBRUpdateModelAndState() 846 BBRUpdateControlParameters() 848 BBRUpdateModelAndState(): 849 BBRUpdateLatestDeliverySignals() 850 BBRUpdateCongestionSignals() 851 BBRUpdateACKAggregation() 852 BBRCheckStartupDone() 853 BBRCheckDrain() 854 BBRUpdateProbeBWCyclePhase() 855 BBRUpdateMinRTT() 856 BBRCheckProbeRTT() 857 BBRAdvanceLatestDeliverySignals() 858 BBRBoundBWForModel() 860 BBRUpdateControlParameters(): 861 BBRSetPacingRate() 862 BBRSetSendQuantum() 863 BBRSetCwnd() 865 4.2.4. Per-Loss Steps 867 On every packet loss event, where some sequence range "packet" is 868 marked lost, the BBR algorithm executes the following 869 BBRUpdateOnLoss() steps in order to update its network path model 871 BBRUpdateOnLoss(packet): 872 BBRHandleLostPacket(packet) 874 4.3. State Machine Operation 876 4.3.1. Startup 877 4.3.1.1. Startup Dynamics 879 When a BBR flow starts up, it performs its first (and most rapid) 880 sequential probe/drain process in the Startup and Drain states. 881 Network link bandwidths currently span a range of at least 11 orders 882 of magnitude, from a few bps to 200 Gbps. To quickly learn 883 BBR.max_bw, given this huge range to explore, BBR's Startup state 884 does an exponential search of the rate space, doubling the sending 885 rate each round. This finds BBR.max_bw in O(log_2(BDP)) round trips. 887 To achieve this rapid probing in the smoothest possible fashion, in 888 Startup BBR uses the minimum gain values that will allow the sending 889 rate to double each round: in Startup BBR sets BBR.pacing_gain to 890 BBRStartupPacingGain (2.77) [BBRStartupPacingGain] and BBR.cwnd_gain 891 to BBRStartupCwndGain (2). 893 When initializing a connection, or upon any later entry into Startup 894 mode, BBR executes the following BBREnterStartup() steps: 896 BBREnterStartup(): 897 BBR.state = Startup 898 BBR.pacing_gain = BBRStartupPacingGain 899 BBR.cwnd_gain = BBRStartupCwndGain 901 As BBR grows its sending rate rapidly, it obtains higher delivery 902 rate samples, BBR.max_bw increases, and the pacing rate and cwnd both 903 adapt by smoothly growing in proportion. Once the pipe is full, a 904 queue typically forms, but the cwnd_gain bounds any queue to 905 (cwnd_gain - 1) * estimated_BDP, which is approximately (2.77 - 1) * 906 estimated_BDP = 1.77 * estimated_BDP. The immediately following 907 Drain state is designed to quickly drain that queue. 909 During Startup, BBR estimates whether the pipe is full using two 910 estimators. The first looks for a plateau in the BBR.max_bw 911 estimate. The second looks for packet loss. The following 912 subsections discuss these estimators. 914 BBRCheckStartupDone(): 915 BBRCheckStartupFullBandwidth() 916 BBRCheckStartupHighLoss() 917 if (BBR.state == Startup and BBR.filled_pipe) 918 BBREnterDrain() 920 4.3.1.2. Estimating When Startup has Filled the Pipe Based on Bandwidth 921 Plateau 923 During Startup, BBR estimates whether the pipe is full by looking for 924 a plateau in the BBR.max_bw estimate. The output of this "full pipe" 925 estimator is tracked in BBR.filled_pipe, a boolean that records 926 whether BBR estimates that it has ever fully utilized its available 927 bandwidth ("filled the pipe"). If BBR notices that there are several 928 (three) rounds where attempts to double the delivery rate actually 929 result in little increase (less than 25 percent), then it estimates 930 that it has reached BBR.max_bw, sets BBR.filled_pipe to true, exits 931 Startup and enters Drain. 933 Upon connection initialization the full pipe estimator runs: 935 BBRInitFullPipe(): 936 BBR.filled_pipe = false 937 BBR.full_bw = 0 938 BBR.full_bw_count = 0 940 Once per round trip, upon an ACK that acknowledges new data, and when 941 the delivery rate sample is not application-limited (see [draft- 942 cheng-iccrg-delivery-rate-estimation]), BBR runs the "full pipe" 943 estimator, if needed: 945 BBRCheckStartupFullBandwidth(): 946 if BBR.filled_pipe or 947 !BBR.round_start or rs.is_app_limited 948 return /* no need to check for a full pipe now */ 949 if (BBR.max_bw >= BBR.full_bw * 1.25) /* still growing? */ 950 BBR.full_bw = BBR.max_bw /* record new baseline level */ 951 BBR.full_bw_count = 0 952 return 953 BBR.full_bw_count++ /* another round w/o much growth */ 954 if (BBR.full_bw_count >= 3) 955 BBR.filled_pipe = true 957 BBR waits three rounds to have solid evidence that the sender is not 958 detecting a delivery-rate plateau that was temporarily imposed by the 959 receive window. Allowing three rounds provides time for the 960 receiver's receive-window auto-tuning to open up the receive window 961 and for the BBR sender to realize that BBR.max_bw should be higher: 962 in the first round the receive-window auto-tuning algorithm grows the 963 receive window; in the second round the sender fills the higher 964 receive window; in the third round the sender gets higher delivery- 965 rate samples. This three-round threshold was validated by YouTube 966 experimental data. 968 4.3.1.3. Estimating When Startup has Filled the Pipe Based on Packet 969 Loss 971 A second method BBR uses for estimating the bottleneck is full is by 972 looking at sustained packet losses Specifically for a case where the 973 following criteria are all met: 975 * The connection has been in fast recovery for at least one full 976 round trip. 978 * The loss rate over the time scale of a single full round trip 979 exceeds loss_thresh (2%). 981 * There are at least BBRStartupFullLossCnt=3 discontiguous sequence 982 ranges lost in that round trip. 984 If these criteria are all met, then BBRCheckStartupHighLoss() sets 985 BBR.filled_pipe = true and exits Startup and enters Drain. 987 The algorithm waits until all three criteria are met to filter out 988 noise from burst losses, and to try to ensure the bottleneck is fully 989 utilized on a sustained basis, and the full bottleneck bandwidth has 990 been measured, before attempting to drain the level of in-flight data 991 to the estimated BDP. 993 4.3.2. Drain 995 Upon exiting Startup, BBR enters its Drain state. In Drain, BBR aims 996 to quickly drain any queue created in Startup by switching to a 997 pacing_gain well below 1.0, until any estimated queue has been 998 drained. It uses a pacing_gain that is the inverse of the value used 999 during Startup, chosen to try to drain the queue in one round 1000 [BBRDrainPacingGain]: 1002 BBREnterDrain(): 1003 BBR.state = Drain 1004 BBR.pacing_gain = 1/BBRStartupCwndGain /* pace slowly */ 1005 BBR.cwnd_gain = BBRStartupCwndGain /* maintain cwnd */ 1007 In Drain, when the amount of data in flight is less than or equal to 1008 the estimated BDP, meaning BBR estimates that the queue has been 1009 fully drained, then BBR exits Drain and enters ProbeBW. To implement 1010 this, upon every ACK BBR executes: 1012 BBRCheckDrain(): 1013 if (BBR.state == Drain and packets_in_flight <= BBRInflight(1.0)) 1014 BBREnterProbeBW() /* BBR estimates the queue was drained */ 1016 4.3.3. ProbeBW 1018 Long-lived BBR flows tend to spend the vast majority of their time in 1019 the ProbeBW states. In the ProbeBW states, a BBR flow sequentially 1020 accelerates, decelerates, and cruises, to measure the network path, 1021 improve its operating point (increase throughput and reduce queue 1022 pressure), and converge toward a more fair allocation of bottleneck 1023 bandwidth. To do this, the flow sequentially cycles through all 1024 three tactics: trying to send faster than, slower than, and at the 1025 same rate as the network delivery process. To achieve this, a BBR 1026 flow in ProbeBW mode cycles through the four Probe bw states - DOWN, 1027 CRUISE, REFILL, and UP - described below in turn. 1029 4.3.3.1. ProbeBW_DOWN 1031 In the ProbeBW_DOWN phase of the cycle, a BBR flow pursues the 1032 deceleration tactic, to try to send slower than the network is 1033 delivering data, to reduce the amount of data in flight, with all of 1034 the standard motivations for the deceleration tactic (discussed in 1035 "State Machine Tactics", above). It does this by switching to a 1036 BBR.pacing_gain of 0.75, sending at 75% of BBR.bw. 1038 Exit conditions: The flow exits this phase and enters CRUISE when the 1039 flow estimates that both of the following conditions have been met: 1041 * There is free headroom: If inflight_hi is set, then BBR remains in 1042 DOWN at least until the volume of in-flight data is less than or 1043 equal to BBRHeadroom*BBR.inflight_hi. The goal of this constraint 1044 is to ensure that in cases where loss signals suggest an upper 1045 limit on the volume of in-flight data, then the flow attempts to 1046 leave some free headroom in the path (e.g. free space in the 1047 bottleneck buffer or free time slots in the bottleneck link) that 1048 can be used by cross traffic (both for volume-based convergence of 1049 bandwidth shares and for burst tolerance). 1051 * At least BBR.min_rtt has elapsed in ProbeBW_DOWN. 1053 4.3.3.2. ProbeBW_CRUISE 1055 In the ProbeBW_CRUISE phase of the cycle, a BBR flow pursues the 1056 "cruising" tactic (discussed in "State Machine Tactics", above), 1057 attempting to send at the same rate the network is delivering data. 1058 It tries to match the sending rate to the flow's current available 1059 bandwidth, to try to achieve high utilization of the available 1060 bandwidth without increasing queue pressure. It does this by 1061 switching to a pacing_gain of 1.0, sending at 100% of BBR.bw. 1062 Notably, while in this state it responds to concrete congestion 1063 signals (loss) by reducing BBR.bw_lo and BBR.inflight_lo, because 1064 these signals suggest that the available bandwidth and deliverable 1065 volume of in-flight data have likely reduced, and the flow needs to 1066 change to adapt, slowing down to match the latest delivery process. 1068 Exit conditions: The connection adaptively holds this state until it 1069 decides that it is time to probe for bandwidth, at which time it 1070 enters ProbeBW_REFILL (see "Time Scale for Bandwidth Probing", 1071 below). 1073 4.3.3.3. ProbeBW_REFILL 1075 The goal of the ProbeBW_REFILL state is to "refill the pipe", to try 1076 to fully utilize the network bottleneck without creating any 1077 significant queue pressure. 1079 To do this, BBR first resets the short-term model parameters bw_lo 1080 and inflight_lo, setting both to "Infinity". This is the key moment 1081 in the BBR time scale strategy (see "Time Scale Strategy", above) 1082 where the flow pivots, discarding its conservative short-term bw_lo 1083 and inflight_lo parameters and beginning to robustly probe the 1084 bottleneck's long-term available bandwidth. During this time bw_hi 1085 and inflight_hi, if set, constrain the connection. 1087 During ProbeBW_REFILL BBR uses a BBR.pacing_gain of 1.0, to send at a 1088 rate that matches the current estimated available bandwidth, for one 1089 packet-timed round trip. The goal is to fully utilize the bottleneck 1090 link before transitioning into ProbeBW_UP and significantly 1091 increasing the chances of causing a loss signal. The motivating 1092 insight is that, as soon as a flow starts acceleration, sending 1093 faster than the available bandwidth, it will start building a queue 1094 at the bottleneck. And if the buffer is shallow enough, then the 1095 flow can cause loss signals very shortly after the first accelerating 1096 packets arrive at the bottleneck. If the flow were to neglect to 1097 fill the pipe before it causes this loss signal, then these very 1098 quick signals of excess queue could cause the flow's estimate of the 1099 path's capacity (i.e. inflight_hi) to significantly underestimate. 1100 In particular, if the flow were to transition directly from 1101 ProbeBW_CRUISE to ProbeBW_UP, the volume of in-flight data (at the 1102 time the first accelerating packets were sent) may often be still 1103 very close to the volume of in-flight data maintained in CRUISE, 1104 which may be only BBRHeadroom*inflight_hi. 1106 Exit conditions: The flow exits ProbeBW_REFILL after one packet-timed 1107 round trip, and enters UP. This is because after one full round trip 1108 of sending in ProbeBW_REFILL the flow can safely estimate that the 1109 pipe is as full as its BBR.bdp estimate permits. And 1110 correspondingly, at this point the flow starts to see bandwidth 1111 samples reflecting its ProbeBW_REFILL behavior, which may be putting 1112 too much data in flight. 1114 4.3.3.4. ProbeBW_UP 1116 After ProbeBW_REFILL refills the pipe, ProbeBW_UP probes for possible 1117 increases in available bandwidth by using a BBR.pacing_gain of 1.25, 1118 sending faster than the current estimated available bandwidth. 1120 If the flow has not set BBR.inflight_hi or BBR.bw_hi, it tries to 1121 raise the volume of in-flight data to at least BBR.pacing_gain * 1122 BBR.bdp = 1.25 * BBR.bdp; note that this may take more than 1123 BBR.min_rtt if BBR.min_rtt is small (e.g. on a LAN). 1125 If the flow has set BBR.inflight_hi or BBR.bw_hi, it moves to an 1126 operating point based on those limits and then gradually increases 1127 the upper volume bound (BBR.inflight_hi) and rate bound (BBR.bw_hi) 1128 using the following approach: 1130 * bw_hi: The flow raises bw_hi to the latest measured bandwidth 1131 sample if the latest measured bandwidth sample is above bw_hi and 1132 the loss rate for the sample is not above the BBRLossThresh. 1134 * inflight_hi: The flow raises inflight_hi in ProbeBW_UP in a manner 1135 that is slow and cautious at first, but increasingly rapid and 1136 bold over time. The initial caution is motivated by the fact that 1137 a given BBR flow may be sharing a shallow buffer with thousands of 1138 other flows, so that the buffer space available to the flow may be 1139 quite tight - even just a single packet. The increasingly rapid 1140 growth over time is motivated by the fact that in a high-speed WAN 1141 the increase in available bandwidth (and thus the estimated BDP) 1142 may require the flow to grow the volume of its inflight data by up 1143 to O(1,000,000); even a quite typical BDP like 10Gbps * 100ms is 1144 82,563 packets. BBR takes an approach where the additive increase 1145 to BBR.inflight_hi exponentially doubles each round trip; in each 1146 successive round trip, inflight_hi grows by 1, 2, 4, 8, 16, etc, 1147 with the increases spread uniformly across the entire round trip. 1148 This helps allow BBR to utilize a larger BDP in O(log(BDP)) round 1149 trips, meeting the design goal for scalable utilization of newly- 1150 available bandwidth. 1152 Exit conditions: The BBR flow ends ProbeBW_UP bandwidth probing and 1153 transitions to ProbeBW_DOWN to try to drain the bottleneck queue when 1154 any of the following conditions are met: 1156 1. Estimated queue: The flow has been in ProbeBW_UP for at least 1157 1*min_rtt, and the estimated queue is high enough that the flow 1158 judges it has robustly probed for available bandwidth 1159 (packets_in_flight > 1.25 * BBR.bdp). 1161 2. Loss: The current loss rate exceeds loss_thresh (2%). 1163 4.3.3.5. Time Scale for Bandwidth Probing 1165 Choosing the time scale for probing bandwidth is tied to the question 1166 of how to coexist with legacy Reno/CUBIC flows, since probing for 1167 bandwidth runs a significant risk of causing packet loss, and causing 1168 packet loss can significantly limit the throughput of such legacy 1169 Reno/CUBIC flows. 1171 4.3.3.5.1. Bandwidth Probing and Coexistence with Reno/CUBIC 1173 BBR has an explicit strategy for coexistence with Reno/CUBIC: to try 1174 to behave in a manner so that Reno/CUBIC flows coexisting with BBR 1175 can continue to work well in the primary contexts where they do 1176 today: 1178 * Intra-datacenter/LAN traffic: we want Reno/CUBIC to be able to 1179 perform well in 100M through 40G enterprise and datacenter 1180 Ethernet 1182 - BDP = 40 Gbps * 20 us / (1514 bytes) ~= 66 packets 1184 * Public Internet last mile traffic: we want Reno/CUBIC to be able 1185 to support up to 25Mbps (for 4K Video) at an RTT of 30ms, typical 1186 parameters for common CDNs for large video services: 1188 - BDP = 25Mbps * 30 ms / (1514 bytes) ~= 62 packets 1190 The challenge in meeting these goals is that Reno/CUBIC need long 1191 periods of no loss to utilize large BDPs. The good news is that in 1192 the environments where Reno/CUBIC work well today (mentioned above), 1193 the BDPs are small, roughly ~100 packets or less. 1195 4.3.3.5.2. A Dual-Time-Scale Approach for Coexistence 1197 The BBR strategy has several aspects: 1199 1. The highest priority is to estimate the bandwidth available to 1200 the BBR flow in question. 1202 2. Secondarily, a given BBR flow adapts (within bounds) the 1203 frequency at which it probes bandwidth and knowingly risks packet 1204 loss, to allow Reno/CUBIC to reach a bandwidth at least as high 1205 as that given BBR flow. 1207 To adapt the frequency of bandwidth probing, BBR considers two time 1208 scales: a BBR-native time scale, and a bounded Reno-conscious time 1209 scale: 1211 * T_bbr: BBR-native time-scale 1213 - T_bbr = uniformly randomly distributed between 2 and 3 secs 1215 * T_reno: Reno-coexistence time scale 1217 - T_reno_bound = pick_randomly_either({62, 63}) 1219 - T_reno = min(BDP in packets, 62) round trips 1221 * T_probe: The time between bandwidth probe UP phases: 1223 - T_probe = min(T_bbr, T_reno) 1225 This dual-time-scale approach is similar to that used by CUBIC, which 1226 has a CUBIC-native time scale given by a cubic curve, and a "Reno 1227 emulation" module that estimates what cwnd would give the flow Reno- 1228 equivalent throughput. At any given moment, CUBIC choose the cwnd 1229 implied by the more aggressive strategy. 1231 We randomize both the T_bbr and T_reno parameters, for better mixing 1232 and fairness convergence. 1234 4.3.3.5.3. Design Considerations for Choosing Constant Parameters 1236 We design the maximum wall-clock bounds of BBR-native inter- 1237 bandwidth-probe wall clock time, T_bbr, to be: 1239 * Higher than 2 sec to try to avoid causing loss for a long enough 1240 time to allow Reno flow with RTT=30ms to get 25Mbps (4K video) 1241 throughput. For this workload, given the Reno sawtooth that 1242 raises cwnd from roughly BDP to 2*BDP, one MSS per round trip, the 1243 inter-bandwidth-probe time must be at least: BDP * RTT = 25Mbps * 1244 .030 sec / (1514 bytes) * 0.030 sec = 1.9secs 1246 * Lower than 3 sec to ensure flows can start probing in a reasonable 1247 amount of time to discover unutilized bw on human-scale 1248 interactive time-scales (e.g. perhaps traffic from a competing web 1249 page download is now complete). 1251 The maximum round-trip bounds of the Reno-coexistence time scale, 1252 T_reno, are chosen to be 62-63 with the following considerations in 1253 mind: 1255 * Choosing a value smaller than roughly 60 would imply that when BBR 1256 flows coexisted with Reno/CUBIC flows (e.g. Netflix Reno flows) 1257 on public Internet broadband links, the Reno/CUBIC flows would not 1258 be able to achieve enough bandwidth to show 4K video. 1260 * Choosing a value larger than roughly 65 would prevent BBR from 1261 reaching its goal of tolerating 1% loss per round trip. Given 1262 that the steady-state (non-bandwidth-probing) BBR response to a 1263 round trip with X% packet loss is to reduce the sending rate by X% 1264 (see the "Updating the Model Upon Packet Loss" section), this 1265 means that the BBR sending rate after N rounds of packet loss at a 1266 rate loss_rate is (1 - loss_rate)^N. This means that for a flow 1267 that encounters 1% loss in 65 round trips of ProbeBW_CRUISE, and 1268 then doubles its cwnd (back to BBR.inflight_hi) in ProbeBW_REFILL 1269 and ProbeBW_UP, it will be able to restore and reprobe its 1270 original sending rate, since: BBW.max_bw * (1 - loss_rate)^N * 2 = 1271 BBR.max_bw * (1 - .01)^65 ~= 1.04 * BBR.max_bw. That is, the flow 1272 will be able to fully respond to packet loss signals in 1273 ProbeBW_CRUISE while also fully re-measuring its maximum 1274 achievable throughput in ProbeBW_UP. 1276 The resulting behavior is that for BBR flows with small BDPs, the 1277 bandwidth probing will be on roughly the same time scale as Reno/ 1278 CUBIC; flows with large BDPs will intentionally probe more rapidly/ 1279 frequently than Reno/CUBIC would (roughly every 62 round trips for 1280 low-RTT flows, or 2-3 secs for high-RTT flows). 1282 The considerations above for timing bandwidth probing can be 1283 implemented as follows: 1285 /* Is it time to transition from DOWN or CRUISE to REFILL? */ 1286 BBRCheckTimeToProbeBW(): 1287 if (BBRHasElapsedInPhase(BBR.bw_probe_wait) || 1288 BBRIsRenoCoexistenceProbeTime()) 1289 BBRStartProbeBW_REFILL() 1290 return true 1291 return false 1293 /* Randomized decision about how long to wait until 1294 * probing for bandwidth, using round count and wall clock. 1295 */ 1296 BBRPickProbeWait(): 1297 /* Decide random round-trip bound for wait: */ 1298 BBR.rounds_since_bw_probe = 1299 random_int_between(0, 1); /* 0 or 1 */ 1300 /* Decide the random wall clock bound for wait: */ 1301 BBR.bw_probe_wait = 1302 2sec + random_float_between(0.0, 1.0) /* 0..1 sec */ 1304 BBRIsRenoCoexistenceProbeTime(): 1305 reno_rounds = BBRTargetInflight() 1306 rounds = min(reno_rounds, 63) 1307 return BBR.rounds_since_bw_probe >= rounds 1309 /* How much data do we want in flight? 1310 * Our estimated BDP, unless congestion cut cwnd. */ 1311 BBRTargetInflight() 1312 return min(BBR.bdp, cwnd) 1314 4.3.3.6. ProbeBW Algorithm Details 1316 BBR's ProbeBW algorithm operates as follows. 1318 Upon entering ProbeBW, BBR executes: 1320 BBREnterProbeBW(): 1321 BBRStartProbeBW_DOWN() 1323 The core logic for entering each state: 1325 BBRStartProbeBW_DOWN(): 1326 BBRResetCongestionSignals() 1327 BBR.probe_up_cnt = Infinity /* not growing inflight_hi */ 1328 BBRPickProbeWait() 1329 BBR.cycle_stamp = Now() /* start wall clock */ 1330 BBR.ack_phase = ACKS_PROBE_STOPPING 1331 BBRStartRound() 1332 BBR.state = ProbeBW_DOWN 1334 BBRStartProbeBW_CRUISE(): 1335 BBR.state = ProbeBW_CRUISE 1337 BBRStartProbeBW_REFILL(): 1338 BBRResetLowerBounds() 1339 BBR.bw_probe_up_rounds = 0 1340 BBR.bw_probe_up_acks = 0 1341 BBR.ack_phase = ACKS_REFILLING 1342 BBRStartRound() 1343 BBR.state = ProbeBW_REFILL 1345 BBRStartProbeBW_UP(): 1346 BBR.ack_phase = ACKS_PROBE_STARTING 1347 BBRStartRound() 1348 BBR.cycle_stamp = Now() /* start wall clock */ 1349 BBR.state = ProbeBW_UP 1350 BBRRaiseInflightHiSlope() 1352 BBR executes the following BBRUpdateProbeBWCyclePhase() logic on each 1353 ACK that ACKs or SACKs new data, to advance the ProbeBW state 1354 machine: 1356 /* The core state machine logic for ProbeBW: */ 1357 BBRUpdateProbeBWCyclePhase(): 1358 if (!BBR.filled_pipe) 1359 return /* only handling steady-state behavior here */ 1360 BBRAdaptUpperBounds() 1361 if (!IsInAProbeBWState()) 1362 return /* only handling ProbeBW states here: */ 1364 switch (state) 1366 ProbeBW_DOWN: 1367 if (BBRCheckTimeToProbeBW()) 1368 return /* already decided state transition */ 1369 if (BBRCheckTimeToCruise()) 1370 BBRStartProbeBW_CRUISE() 1372 ProbeBW_CRUISE: 1373 if (BBRCheckTimeToProbeBW()) 1374 return /* already decided state transition */ 1376 ProbeBW_REFILL: 1377 /* After one round of REFILL, start UP */ 1378 if (BBR.round_start) 1379 BBR.bw_probe_samples = 1 1380 BBRStartProbeBW_UP() 1382 ProbeBW_UP: 1383 if (BBRHasElapsedInPhase(BBR.min_rtt) and 1384 inflight > BBRInflight(BBR.max_bw, 1.25)) 1385 BBRStartProbeBW_DOWN() 1387 The ancillary logic to implement the ProbeBW state machine: 1389 IsInAProbeBWState() 1390 state = BBR.state 1391 return (state == ProbeBW_DOWN or 1392 state == ProbeBW_CRUISE or 1393 state == ProbeBW_REFILL or 1394 state == ProbeBW_UP) 1396 /* Time to transition from DOWN to CRUISE? */ 1397 BBRCheckTimeToCruise(): 1398 if (inflight > BBRInflightWithHeadroom()) 1399 return false /* not enough headroom */ 1400 if (inflight <= BBRInflight(BBR.max_bw, 1.0)) 1401 return true /* inflight <= estimated BDP */ 1402 if (BBRHasElapsedInPhase(BBR.min_rtt)) 1403 Return true /* at least a round in DOWN */ 1405 BBRHasElapsedInPhase(interval): 1406 return Now() > BBR.cycle_stamp + interval 1408 /* Return a volume of data that tries to leave free 1409 * headroom in the bottleneck buffer or link for 1410 * other flows, for fairness convergence and lower 1411 * RTTs and loss */ 1412 BBRInflightWithHeadroom(): 1413 if (BBR.inflight_hi == Infinity) 1414 return Infinity 1415 headroom = max(1, BBRHeadroom * BBR.inflight_hi) 1416 return max(BBR.inflight_hi - headroom, 1417 BBRMinPipeCwnd) 1419 /* Raise inflight_hi slope if appropriate. */ 1420 BBRRaiseInflightHiSlope(): 1421 growth_this_round = 1MSS << BBR.bw_probe_up_rounds 1422 BBR.bw_probe_up_rounds = min(BBR.bw_probe_up_rounds + 1, 30) 1423 BBR.probe_up_cnt = max(cwnd / growth_this_round, 1) 1425 /* Increase inflight_hi if appropriate. */ 1426 BBRProbeInflightHiUpward(): 1427 if (!is_cwnd_limited or cwnd < BBR.inflight_hi) 1428 return /* not fully using inflight_hi, so don't grow it */ 1429 BBR.bw_probe_up_acks += rs.newly_acked 1430 if (BBR.bw_probe_up_acks >= BBR.probe_up_cnt) 1431 delta = BBR.bw_probe_up_acks / BBR.probe_up_cnt 1432 BBR.bw_probe_up_acks -= delta * BBR.bw_probe_up_cnt 1433 BBR.inflight_hi += delta 1434 if (BBR.round_start) 1435 BBRRaiseInflightHiSlope() 1437 /* Track ACK state and update BBR.max_bw window and 1438 * BBR.inflight_hi and BBR.bw_hi. */ 1439 BBRAdaptUpperBounds(): 1440 if (BBR.ack_phase == ACKS_PROBE_STARTING and BBR.round_start) 1441 /* starting to get bw probing samples */ 1442 BBR.ack_phase = ACKS_PROBE_FEEDBACK 1443 if (BBR.ack_phase == ACKS_PROBE_STOPPING and BBR.round_start) 1444 /* end of samples from bw probing phase */ 1445 if (IsInAProbeBWState() and !rs.is_app_limited) 1446 BBRAdvanceMaxBwFilter() 1448 if (!CheckInflightTooHigh()) 1449 /* Loss rate is safe. Adjust upper bounds upward. */ 1450 if (BBR.inflight_hi == Infinity or BBR.bw_hi == Infinity) 1451 return /* no upper bounds to raise */ 1452 if (rs.tx_in_flight > BBR.inflight_hi) 1453 BBR.inflight_hi = rs.tx_in_flight 1454 if (rs.delivery_rate > BBR.bw_hi) 1455 BBR.bw_hi = rs.bw 1456 if (BBR.state == ProbeBW_UP) 1457 BBRProbeInflightHiUpward() 1459 4.3.4. ProbeRTT 1461 4.3.4.1. ProbeRTT Overview 1463 To help probe for BBR.min_rtt, on an as-needed basis BBR flows enter 1464 the ProbeRTT state to try to cooperate to periodically drain the 1465 bottleneck queue - and thus improve their BBR.min_rtt estimate of the 1466 unloaded two-way propagation delay. 1468 A critical point is that before BBR raises its BBR.min_rtt estimate 1469 (which would in turn raise its maximum permissible cwnd), it first 1470 enters ProbeRTT to try to make a concerted and coordinated effort to 1471 drain the bottleneck queue and make a robust BBR.min_rtt measurement. 1472 This allows the BBR.min_rtt estimates of ensembles of BBR flows to 1473 converge avoiding feedback loops of ever-increasing queues and RTT 1474 samples. 1476 The ProbeRTT state works in concert with BBR.min_rtt estimation. Up 1477 to once every ProbeRTTInterval = 5 seconds, the flow enters ProbeRTT, 1478 decelerating by setting its cwnd_gain to BBRProbeRTTCwndGain = 0.5 to 1479 reduce its volume of inflight data to half of its estimated BDP, to 1480 try to allow the flow to measure the unloaded two-way propagation 1481 delay. 1483 There are two main motivations for making the MinRTTFilterLen roughly 1484 twice the ProbeRTTInterval. First, this ensures that during a 1485 ProbeRTT episode the flow will "remember" the BBR.min_rtt value it 1486 measured during the previous ProbeRTT episode, providing a robust bdp 1487 estimate for the cwnd = 0.5*bdp calculation, increasing the 1488 likelihood of fully draining the bottleneck queue. Second, this 1489 allows the flow's BBR.min_rtt filter window to generally include RTT 1490 samples from two ProbeTT episodes, providing a more robust estimate. 1492 The algorithm for ProbeRTT is as follows: 1494 Entry conditions: In any state other than ProbeRTT itself, if the 1495 BBR.probe_rtt_min_delay estimate has not been updated (i.e., by 1496 getting a lower RTT measurement) for more than ProbeRTTInterval = 5 1497 seconds, then BBR enters ProbeRTT and reduces the BBR.cwnd_gain to 1498 BBRProbeRTTCwndGain = 0.5. 1500 Exit conditions: After maintaining the volume of in-flight data at 1501 BBRProbeRTTCwndGain*BBR.bdp for at least ProbeRTTDuration (200 ms) 1502 and at least one round trip, BBR leaves ProbeRTT and transitions to 1503 ProbeBW if it estimates the pipe was filled already, or Startup 1504 otherwise. 1506 4.3.4.2. ProbeRTT Design Rationale 1508 BBR is designed to have ProbeRTT sacrifice no more than roughly 2% of 1509 a flow's available bandwidth. It is also designed to spend the vast 1510 majority of its time (at least roughly 96 percent) in ProbeBW and the 1511 rest in ProbeRTT, based on a set of tradeoffs. ProbeRTT lasts long 1512 enough (at least ProbeRTTDuration = 200 ms) to allow flows with 1513 different RTTs to have overlapping ProbeRTT states, while still being 1514 short enough to bound the throughput penalty of ProbeRTT's cwnd 1515 capping to roughly 2%, with the average throughput targeted at: 1517 throughput = (200ms*0.5*BBR.bw + (5s - 200ms)*BBR.bw) / 5s 1518 = (.1s + 4.8s)/5s * BBR.bw = 0.98 * BBR.bw 1520 As discussed above, BBR's BBR.min_rtt filter window, MinRTTFilterLen, 1521 and time interval between ProbeRTT states, ProbeRTTInterval, work in 1522 concert. BBR uses a MinRTTFilterLen equal to or longer than 1523 ProbeRTTInterval to allow the filter window to include at least one 1524 ProbeRTT. 1526 To allow coordination with other BBR flows, each flow MUST use the 1527 standard ProbeRTTInterval of 5 secs. 1529 An ProbeRTTInterval of 5 secs is short enough to allow quick 1530 convergence if traffic levels or routes change, but long enough so 1531 that interactive applications (e.g., Web, remote procedure calls, 1532 video chunks) often have natural silences or low-rate periods within 1533 the window where the flow's rate is low enough for long enough to 1534 drain its queue in the bottleneck. Then the BBR.probe_rtt_min_delay 1535 filter opportunistically picks up these measurements, and the 1536 BBR.probe_rtt_min_delay estimate refreshes without requiring 1537 ProbeRTT. This way, flows typically need only pay the 2 percent 1538 throughput penalty if there are multiple bulk flows busy sending over 1539 the entire ProbeRTTInterval window. 1541 As an optimization, when restarting from idle and finding that the 1542 BBR.probe_rtt_min_delay has expired, BBR does not enter ProbeRTT; the 1543 idleness is deemed a sufficient attempt to coordinate to drain the 1544 queue. 1546 4.3.4.3. Calculating the rs.rtt RTT Sample 1548 Upon transmitting each packet, BBR (or the associated transport 1549 protocol) stores in per-packet data the wall-clock scheduled 1550 transmission time of the packet in packet.departure_time (see the 1551 "Pacing Rate: BBR.pacing_rate" section for how this is calculated). 1553 For every ACK that newly acknowledges some data (whether cumulatively 1554 or selectively), the sender's BBR implementation (or the associated 1555 transport protocol implementation) attempts to calculate an RTT 1556 sample. The sender MUST consider any potential retransmission 1557 ambiguities that can arise in some transport protocols. If some of 1558 the acknowledged data was not retransmitted, or some of the data was 1559 retransmitted but the sender can still unambiguously determine the 1560 RTT of the data (e.g. if the transport supports [RFC7323] TCP 1561 timestamps or an equivalent mechanism), then the sender calculates an 1562 RTT sample, rs.rtt, as follows: 1564 rs.rtt = Now() - packet.departure_time 1566 4.3.4.4. ProbeRTT Logic 1568 On every ACK BBR executes BBRUpdateMinRTT() to update its ProbeRTT 1569 scheduling state (BBR.probe_rtt_min_delay and 1570 BBR.probe_rtt_min_stamp) and its BBR.min_rtt estimate: 1572 BBRUpdateMinRTT() 1573 BBR.probe_rtt_expired = 1574 Now() > BBR.probe_rtt_min_stamp + ProbeRTTInterval 1575 if (rs.rtt >= 0 and 1576 (rs.rtt < BBR.probe_rtt_min_delay or 1577 BBR.probe_rtt_expired)) 1578 BBR.probe_rtt_min_delay = rs.rtt 1579 BBR.probe_rtt_min_stamp = Now() 1581 min_rtt_expired = 1582 Now() > BBR.min_rtt_stamp + MinRTTFilterLen 1583 if (BBR.probe_rtt_min_delay < BBR.min_rtt or 1584 min_rtt_expired) 1585 BBR.min_rtt = BBR.probe_rtt_min_delay 1586 BBR.min_rtt_stamp = BBR.probe_rtt_min_stamp 1588 Here BBR.probe_rtt_expired is a boolean recording whether the 1589 BBR.probe_rtt_min_delay has expired and is due for a refresh, via 1590 either an application idle period or a transition into ProbeRTT 1591 state. 1593 On every ACK BBR executes BBRCheckProbeRTT() to handle the steps 1594 related to the ProbeRTT state as follows: 1596 BBRCheckProbeRTT(): 1597 if (BBR.state != ProbeRTT and 1598 BBR.probe_rtt_expired and 1599 not BBR.idle_restart) 1600 BBREnterProbeRTT() 1601 BBRSaveCwnd() 1602 BBR.probe_rtt_done_stamp = 0 1603 BBR.ack_phase = ACKS_PROBE_STOPPING 1604 BBRStartRound() 1605 if (BBR.state == ProbeRTT) 1606 BBRHandleProbeRTT() 1607 if (rs.delivered > 0) 1608 BBR.idle_restart = false 1610 BBREnterProbeRTT(): 1611 BBR.state = ProbeRTT 1612 BBR.pacing_gain = 1 1613 BBR.cwnd_gain = BBRProbeRTTCwndGain /* 0.5 */ 1615 BBRHandleProbeRTT(): 1616 /* Ignore low rate samples during ProbeRTT: */ 1617 MarkConnectionAppLimited() 1618 if (BBR.probe_rtt_done_stamp == 0 and 1619 packets_in_flight <= BBRProbeRTTCwnd()) 1620 /* Wait for at least ProbeRTTDuration to elapse: */ 1621 BBR.probe_rtt_done_stamp = 1622 Now() + ProbeRTTDuration 1623 /* Wait for at least one round to elapse: */ 1624 BBR.probe_rtt_round_done = false 1625 BBRStartRound() 1626 else if (BBR.probe_rtt_done_stamp != 0) 1627 if (BBR.round_start) 1628 BBR.probe_rtt_round_done = true 1629 if (BBR.probe_rtt_round_done) 1630 BBRCheckProbeRTTDone() 1632 BBRCheckProbeRTTDone(): 1633 if (BBR.probe_rtt_done_stamp != 0 and 1634 Now() > BBR.probe_rtt_done_stamp) 1635 /* schedule next ProbeRTT: */ 1636 BBR.probe_rtt_min_stamp = Now() 1637 BBRRestoreCwnd() 1638 BBRExitProbeRTT() 1640 MarkConnectionAppLimited(): 1641 C.app_limited = 1642 (C.delivered + packets_in_flight) ? : 1 1644 4.3.4.5. Exiting ProbeRTT 1646 When exiting ProbeRTT, BBR transitions to ProbeBW if it estimates the 1647 pipe was filled already, or Startup otherwise. 1649 When transitioning out of ProbeRTT, BBR calls BBRResetLowerBounds() 1650 to reset the lower bounds, since any congestion encountered in 1651 ProbeRTT may have pulled the short-term model far below the capacity 1652 of the path. 1654 But the algorithm is cautious in timing the next bandwidth probe: 1655 raising inflight after ProbeRTT may cause loss, so the algorithm 1656 resets the bandwidth-probing clock by starting the cycle at 1657 ProbeBW_DOWN(). But then as an optimization, since the connection is 1658 exiting ProbeRTT, we know that infligh is already below the estimated 1659 BDP, so the connection can proceed immediately to ProbeBW_CRUISE. 1661 To summarize, the logic for exiting ProbeRTT is as follows: 1663 BBRExitProbeRTT(): 1664 BBRResetLowerBounds() 1665 if (BBR.filled_pipe) 1666 BBRStartProbeBW_DOWN() 1667 BBRStartProbeBW_CRUISE() 1668 else 1669 BBREnterStartup() 1671 4.4. Restarting From Idle 1673 4.4.1. Setting Pacing Rate in ProbeBW 1675 When restarting from idle in ProbeBW states, BBR leaves its cwnd as- 1676 is and paces packets at exactly BBR.bw, aiming to return as quickly 1677 as possible to its target operating point of rate balance and a full 1678 pipe. Specifically, if the flow's BBR.state is ProbeBW, and the flow 1679 is application-limited, and there are no packets in flight currently, 1680 then at the moment the flow sends one or more packets BBR sets 1681 BBR.pacing_rate to exactly BBR.bw. More precisely, the BBR algorithm 1682 takes the following steps in BBRHandleRestartFromIdle() before 1683 sending a packet for a flow. 1685 The "Restarting Idle Connections" section of [RFC5681] suggests 1686 restarting from idle by slow-starting from the initial window. 1687 However, this approach was assuming a congestion control algorithm 1688 that had no estimate of the bottleneck bandwidth and no pacing, and 1689 thus resorted to relying on slow-starting driven by an ACK clock. 1690 The long (log_2(BDP)*RTT) delays required to reach full utilization 1691 with that "slow start after idle" approach caused many large 1692 deployments to disable this mechanism, resulting in a "BDP-scale 1693 line-rate burst" approach instead. Instead of these two approaches, 1694 BBR restarts by pacing at BBR.bw, typically achieving approximate 1695 rate balance and a full pipe after only one BBR.min_rtt has elapsed. 1697 4.4.2. Checking for ProberRTT Completion 1699 As an optimization, when restarting from idle BBR checks to see if 1700 the connection is in ProbeRTT and has met the exit conditions for 1701 ProbeRTT. If a connection goes idle during ProbeRTT then often it 1702 will have met those exit conditions by the time it restarts, so that 1703 the connection can restore the cwnd to its full value before it 1704 starts transmitting a new flight of data. 1706 4.4.3. Logic 1708 The BBR algorithm takes the following steps in 1709 BBRHandleRestartFromIdle() before sending a packet for a flow: 1711 BBRHandleRestartFromIdle(): 1712 if (packets_in_flight == 0 and C.app_limited) 1713 BBR.idle_restart = true 1714 BBR.extra_acked_interval_start = Now() 1715 if (IsInAProbeBWState()) 1716 BBRSetPacingRateWithGain(1) 1717 else if (BBR.state == ProbeRTT) 1718 BBRCheckProbeRTTDone() 1720 4.5. Updating Network Path Model Parameters 1722 BBR is a model-based congestion control algorithm: it is based on an 1723 explicit model of the network path over which a transport flow 1724 travels. The following is a summary of each parameter, including its 1725 meaning and how the algorithm calculates and uses its value. We can 1726 group the parameter into three groups: 1728 * core state machine parameters 1730 * parameters to model the data rate 1732 * parameters to model the volume of in-flight data 1734 4.5.1. Tracking Packet-Timed Round Trips and BBR.round_count 1736 Several aspects of the BBR algorithm depend on counting the progress 1737 of "packet-timed" round trips, which start at the transmission of 1738 some segment, and then end at the acknowledgement of that segment. 1739 BBR.round_count is a count of the number of these "packet-timed" 1740 round trips elapsed so far. BBR uses this virtual BBR.round_count 1741 because it is more robust than using wall clock time. In particular, 1742 arbitrary intervals of wall clock time can elapse due to application 1743 idleness, variations in RTTs, or timer delays for retransmission 1744 timeouts, causing wall-clock-timed model parameter estimates to "time 1745 out" or to be "forgotten" too quickly to provide robustness. 1747 BBR counts packet-timed round trips by recording state about a 1748 sentinel packet, and waiting for an ACK of any data packet that was 1749 sent after that sentinel packet, using the following pseudocode: 1751 Upon connection initialization: 1753 BBRInitRoundCounting(): 1754 BBR.next_round_delivered = 0 1755 BBR.round_start = false 1756 BBR.round_count = 0 1758 Upon sending each packet, the rate estimation algorithm [draft-cheng- 1759 iccrg-delivery-rate-estimation] records the amount of data thus far 1760 acknowledged as delivered: 1762 packet.delivered = C.delivered 1764 Upon receiving an ACK for a given data packet, the rate estimation 1765 algorithm [draft-cheng-iccrg-delivery-rate-estimation] updates the 1766 amount of data thus far acknowledged as delivered: 1768 C.delivered += packet.size 1770 Upon receiving an ACK for a given data packet, the BBR algorithm 1771 first executes the following logic to see if a round trip has 1772 elapsed, and if so, increment the count of such round trips elapsed: 1774 BBRUpdateRound(): 1775 if (packet.delivered >= BBR.next_round_delivered) 1776 BBRStartRound() 1777 BBR.round_count++ 1778 BBR.rounds_since_probe++ 1779 BBR.round_start = true 1780 else 1781 BBR.round_start = false 1783 BBRStartRound(): 1784 BBR.next_round_delivered = C.delivered 1786 4.5.2. BBR.max_bw 1788 BBR.max_bw is BBR's estimate of the maximum bottleneck bandwidth 1789 available to data transmissions for the transport flow. At any time, 1790 a transport connection's data transmissions experience some slowest 1791 link or bottleneck. The bottleneck's delivery rate determines the 1792 connection's maximum data-delivery rate. BBR tries to closely match 1793 its sending rate to this bottleneck delivery rate to help seek "rate 1794 balance", where the flow's packet arrival rate at the bottleneck 1795 equals the departure rate. The bottleneck rate varies over the life 1796 of a connection, so BBR continually estimates BBR.max_bw using recent 1797 signals. 1799 4.5.2.1. Delivery Rate Samples for Estimating BBR.max_bw 1801 Since calculating delivery rate samples is subtle, and the samples 1802 are useful independent of congestion control, the approach BBR uses 1803 for measuring each single delivery rate sample is specified in a 1804 separate Internet Draft [draft-cheng-iccrg-delivery-rate-estimation]. 1806 4.5.2.2. BBR.max_bw Max Filter 1808 Delivery rate samples are often below the typical bottleneck 1809 bandwidth available to the flow, due to "noise" introduced by random 1810 variation in physical transmission processes (e.g. radio link layer 1811 noise) or queues or along the network path. To filter these effects 1812 BBR uses a max filter: BBR estimates BBR.max_bw using the windowed 1813 maximum recent delivery rate sample seen by the connection over 1814 recent history. 1816 The BBR.max_bw max filter window covers a time period extending over 1817 the past two ProbeBW cycles. The BBR.max_bw max filter window length 1818 is driven by trade-offs among several considerations: 1820 * It is long enough to cover at least one entire ProbeBW cycle (see 1821 the "ProbeBW" section). This ensures that the window contains at 1822 least some delivery rate samples that are the result of data 1823 transmitted with a super-unity pacing_gain (a pacing_gain larger 1824 than 1.0). Such super-unity delivery rate samples are 1825 instrumental in revealing the path's underlying available 1826 bandwidth even when there is noise from delivery rate shortfalls 1827 due to aggregation delays, queuing delays from variable cross- 1828 traffic, lossy link layers with uncorrected losses, or short-term 1829 buffer exhaustion (e.g., brief coincident bursts in a shallow 1830 buffer). 1832 * It aims to be long enough to cover short-term fluctuations in the 1833 network's delivery rate due to the aforementioned sources of 1834 noise. In particular, the delivery rate for radio link layers 1835 (e.g., wifi and cellular technologies) can be highly variable, and 1836 the filter window needs to be long enough to remember "good" 1837 delivery rate samples in order to be robust to such variations. 1839 * It aims to be short enough to respond in a timely manner to 1840 sustained reductions in the bandwidth available to a flow, whether 1841 this is because other flows are using a larger share of the 1842 bottleneck, or the bottleneck link service rate has reduced due to 1843 layer 1 or layer 2 changes, policy changes, or routing changes. 1844 In any of these cases, existing BBR flows traversing the 1845 bottleneck should, in a timely manner, reduce their BBR.max_bw 1846 estimates and thus pacing rate and in-flight data, in order to 1847 match the sending behavior to the new available bandwidth. 1849 4.5.2.3. BBR.max_bw and Application-limited Delivery Rate Samples 1851 Transmissions can be application-limited, meaning the transmission 1852 rate is limited by the application rather than the congestion control 1853 algorithm. This is quite common because of request/response traffic. 1854 When there is a transmission opportunity but no data to send, the 1855 delivery rate sampler marks the corresponding bandwidth sample(s) as 1856 application-limited [draft-cheng-iccrg-delivery-rate-estimation]. 1857 The BBR.max_bw estimator carefully decides which samples to include 1858 in the bandwidth model to ensure that BBR.max_bw reflects network 1859 limits, not application limits. By default, the estimator discards 1860 application-limited samples, since by definition they reflect 1861 application limits. However, the estimator does use application- 1862 limited samples if the measured delivery rate happens to be larger 1863 than the current BBR.max_bw estimate, since this indicates the 1864 current BBR.Max_bw estimate is too low. 1866 4.5.2.4. Updating the BBR.max_bw Max Filter 1868 For every ACK that acknowledges some data packets as delivered, BBR 1869 invokes BBRUpdateMaxBw() to update the BBR.max_bw estimator as 1870 follows (here rs.delivery_rate is the delivery rate sample obtained 1871 from the ACK that is being processed, as specified in [draft-cheng- 1872 iccrg-delivery-rate-estimation]): 1874 BBRUpdateMaxBw() 1875 BBRUpdateRound() 1876 if (rs.delivery_rate >= BBR.max_bw || !rs.is_app_limited) 1877 BBR.max_bw = update_windowed_max_filter( 1878 filter=BBR.MaxBwFilter, 1879 value=rs.delivery_rate, 1880 time=BBR.cycle_count, 1881 window_length=MaxBwFilterLen) 1883 4.5.2.5. Tracking Time for the BBR.max_bw Max Filter 1885 BBR tracks time for the BBR.max_bw filter window using a virtual 1886 (non-wall-clock) time tracked by counting the cyclical progression 1887 through ProbeBW cycles. Each time through the Probe bw cycle, one 1888 round trip after exiting ProbeBW_UP (the point at which the flow has 1889 its best chance to measure the highest throughput of the cycle), BBR 1890 increments BBR.cycle_count, the virtual time used by the BBR.max_bw 1891 filter window. Note that BBR.cycle_count only needs to be tracked 1892 with a single bit, since the BBR.max_bw filter only needs to track 1893 samples from two time slots: the previous ProbeBW cycle and the 1894 current ProbeBW cycle: 1896 BBRAdvanceMaxBwFilter(): 1897 BBR.cycle_count++ 1899 4.5.3. BBR.min_rtt 1901 BBR.min_rtt is BBR's estimate of the round-trip propagation delay of 1902 the path over which a transport connection is sending. The path's 1903 round-trip propagation delay determines the minimum amount of time 1904 over which the connection must be willing to sustain transmissions at 1905 the BBR.bw rate, and thus the minimum amount of data needed in- 1906 flight, for the connection to reach full utilization (a "Full Pipe"). 1907 The round-trip propagation delay can vary over the life of a 1908 connection, so BBR continually estimates BBR.min_rtt using recent 1909 round-trip delay samples. 1911 4.5.3.1. Round-Trip Time Samples for Estimating BBR.min_rtt 1913 For every data packet a connection sends, BBR calculates an RTT 1914 sample that measures the time interval from sending a data packet 1915 until that packet is acknowledged. 1917 For the most part, the same considerations and mechanisms that apply 1918 to RTT estimation for the purposes of retransmission timeout 1919 calculations [RFC6298] apply to BBR RTT samples. Namely, BBR does 1920 not use RTT samples based on the transmission time of retransmitted 1921 packets, since these are ambiguous, and thus unreliable. Also, BBR 1922 calculates RTT samples using both cumulative and selective 1923 acknowledgments (if the transport supports [RFC2018] SACK options or 1924 an equivalent mechanism), or transport-layer timestamps (if the 1925 transport supports [RFC7323] TCP timestamps or an equivalent 1926 mechanism). 1928 The only divergence from RTT estimation for retransmission timeouts 1929 is in the case where a given acknowledgment ACKs more than one data 1930 packet. In order to be conservative and schedule long timeouts to 1931 avoid spurious retransmissions, the maximum among such potential RTT 1932 samples is typically used for computing retransmission timeouts; 1933 i.e., SRTT is typically calculated using the data packet with the 1934 earliest transmission time. By contrast, in order for BBR to try to 1935 reach the minimum amount of data in flight to fill the pipe, BBR uses 1936 the minimum among such potential RTT samples; i.e., BBR calculates 1937 the RTT using the data packet with the latest transmission time. 1939 4.5.3.2. BBR.min_rtt Min Filter 1941 RTT samples tend to be above the round-trip propagation delay of the 1942 path, due to "noise" introduced by random variation in physical 1943 transmission processes (e.g. radio link layer noise), queues along 1944 the network path, the receiver's delayed ack strategy, ack 1945 aggregation, etc. Thus to filter out these effects BBR uses a min 1946 filter: BBR estimates BBR.min_rtt using the minimum recent RTT sample 1947 seen by the connection over that past MinRTTFilterLen seconds. (Many 1948 of the same network effects that can decrease delivery rate 1949 measurements can increase RTT samples, which is why BBR's min- 1950 filtering approach for RTTs is the complement of its max-filtering 1951 approach for delivery rates.) 1953 The length of the BBR.min_rtt min filter window is MinRTTFilterLen = 1954 10 secs. This is driven by trade-offs among several considerations: 1956 * The MinRTTFilterLen is longer than ProbeRTTInterval, so that it 1957 covers an entire ProbeRTT cycle (see the "ProbeRTT" section 1958 below). This helps ensure that the window can contain RTT samples 1959 that are the result of data transmitted with inflight below the 1960 estimated BDP of the flow. Such RTT samples are important for 1961 helping to reveal the path's underlying two-way propagation delay 1962 even when the aforementioned "noise" effects can often obscure it. 1964 * The MinRTTFilterLen aims to be long enough to avoid needing to cut 1965 in-flight and throughput often. Measuring two-way propagation 1966 delay requires in-flight to be at or below BDP, which risks some 1967 amount of underutilization, so BBR uses a filter window long 1968 enough that such underutilization events can be rare. 1970 * The MinRTTFilterLen aims to be long enough that many applications 1971 have a "natural" moment of silence or low utilization that can cut 1972 in-flight below BDP and naturally serve to refresh the 1973 BBR.min_rtt, without requiring BBR to force an artificial cut in 1974 in-flight. This applies to many popular applications, including 1975 Web, RPC, chunked audio or video traffic. 1977 * The MinRTTFilterLen aims to be short enough to respond in a timely 1978 manner to real increases in the two-way propagation delay of the 1979 path, e.g. due to route changes, which are expected to typically 1980 happen on longer time scales. 1982 A BBR implementation MAY use a generic windowed min filter to track 1983 BBR.min_rtt. However, a significant savings in space and improvement 1984 in freshness can be achieved by integrating the BBR.min_rtt 1985 estimation into the ProbeRTT state machine, so this document 1986 discusses that approach in the ProbeRTT section. 1988 4.5.4. BBR.offload_budget 1990 BBR.offload_budget is the estimate of the minimum volume of data 1991 necessary to achieve full throughput using sender (TSO/GSO) and 1992 receiver (LRO, GRO) host offload mechanisms, computed as follows: 1994 BBRUpdateOffloadBudget(): 1995 BBR.offload_budget = 3 * BBR.send_quantum 1997 The factor of 3 is chosen to allow maintaining at least: 1999 * 1 quantum in the sending host's queuing discipline layer 2001 * 1 quantum being segmented in the sending host TSO/GSO engine 2003 * 1 quantum being reassembled or otherwise remaining unacknowledged 2004 due to the receiver host's LRO/GRO/delayed-ACK engine 2006 4.5.5. BBR.extra_acked 2008 BBR.extra_acked is a volume of data that is the estimate of the 2009 recent degree of aggregation in the network path. For each ACK, the 2010 algorithm computes a sample of the estimated extra ACKed data beyond 2011 the amount of data that the sender expected to be ACKed over the 2012 timescale of a round-trip, given the BBR.bw. Then it computes 2013 BBR.extra_acked as the windowed maximum sample over the last 2014 BBRExtraAckedFilterLen=10 packet-timed round-trips. If the ACK rate 2015 falls below the expected bandwidth, then the algorithm estimates an 2016 aggregation episode has terminated, and resets the sampling interval 2017 to start from the current time. 2019 The BBR.extra_acked thus reflects the recently-measured magnitude of 2020 data and ACK aggregation effects such as batching and slotting at 2021 shared-medium L2 hops (wifi, cellular, DOCSIS), as well as end-host 2022 offload mechanisms (TSO, GSO, LRO, GRO), and end host or middlebox 2023 ACK decimation/thinning. 2025 BBR augments its cwnd by BBR.extra_acked to allow the connection to 2026 keep sending during inter-ACK silences, to an extent that matches the 2027 recently measured degree of aggregation. 2029 More precisely, this is computed as: 2031 BBRUpdateACKAggregation(): 2032 /* Find excess ACKed beyond expected amount over this interval */ 2033 interval = (Now() - BBR.extra_acked_interval_start) 2034 expected_delivered = BBR.bw * interval 2035 /* Reset interval if ACK rate is below expected rate: */ 2036 if (BBR.extra_acked_delivered <= expected_delivered) 2037 BBR.extra_acked_delivered = 0 2038 BBR.extra_acked_interval_start = Now() 2039 expected_delivered = 0 2040 BBR.extra_acked_delivered += rs.newly_acked 2041 extra = BBR.extra_acked_delivered - expected_delivered 2042 extra = min(extra, cwnd) 2043 BBR.extra_acked = 2044 update_windowed_max_filter( 2045 filter=BBR.ExtraACKedFilter, 2046 value=extra, 2047 time=BBR.round_count, 2048 window_length=BBRExtraAckedFilterLen) 2050 4.5.6. Updating the Model Upon Packet Loss 2052 In every state, BBR responds to (filtered) congestion signals, 2053 including loss. The response to those congestion signals depends on 2054 the flow's current state, since the information that the flow can 2055 infer depends on what the flow was doing when the flow experienced 2056 the signal. 2058 4.5.6.1. Probing for Bandwidth In Startup 2060 In Startup, if the congestion signals meet the Startup exit criteria, 2061 the flow exits Startup and enters Drain. 2063 4.5.6.2. Probing for Bandwidth In ProbeBW 2065 BBR searches for the maximum volume of data that can be sensibly 2066 placed in-flight in the network. A key precondition is that the flow 2067 is actually trying robustly to find that operating point. To 2068 implement this, when a flow is in ProbeBW, and an ACK covers data 2069 sent in one of the accelerating phases (REFILL or UP), and the ACK 2070 indicates that the loss rate over the past round trip exceeds the 2071 queue pressure objective, and the flow is not application limited, 2072 and has not yet responded to congestion signals from the most recent 2073 REFILL or UP phase, then the flow estimates that the volume of data 2074 it allowed in flight exceeded what matches the current delivery 2075 process on the path, and reduces BBR.inflight_hi: 2077 /* Do loss signals suggest inflight is too high 2078 * If so, react. */ 2079 CheckInflightTooHigh(): 2080 if (IsInflightTooHigh(rs)) 2081 if (BBR.bw_probe_samples) 2082 BBRHandleInflightTooHigh() 2083 return true /* inflight too high */ 2084 else 2085 return false /* inflight not too high */ 2087 IsInflightTooHigh(): 2088 return (rs.lost > rs.tx_in_flight * BBRLossThresh) 2090 BBRHandleInflightTooHigh(): 2091 BBR.bw_probe_samples = 0; /* only react once per bw probe */ 2092 if (!rs.is_app_limited) 2093 BBR.inflight_hi = max(rs.tx_in_flight, 2094 BBRTargetInflight() * BBRBeta)) 2095 If (BBR.state == ProbeBW_UP) 2096 BBRStartProbeBW_DOWN() 2098 Here rs.tx_in_flight is the amount of data that was estimated to be 2099 in flight when the most recently ACKed packet was sent. And the 2100 BBRBeta (0.7x) bound is to try to ensure that BBR does not react more 2101 dramatically than CUBIC's 0.7x multiplicative decrease factor. 2103 Some loss detection algorithms, including algorithms like RACK 2104 [RFC8985] that delay loss marking while waiting for potential 2105 reordering to resolve, may mark packets as lost long after the loss 2106 itself happened. In such cases, the tx_in_flight for the delivered 2107 sequence range that allowed the loss to be detected may be 2108 considerably smaller than the tx_in_flight of the lost packet itself. 2109 In such cases using the former tx_in_flight rather than the latter 2110 can cause BBR.inflight_hi to be significantly underestimated. To 2111 avoid such issues, BBR processes each loss detection event to more 2112 precisely estimate the volume of in-flight data at which loss rates 2113 cross BBRLossThresh, noting that this may have happened mid-way 2114 through some packet. To estimate this value, we can solve for 2115 "lost_prefix" in the following equation, where inflight_prev 2116 represents the volume of in-flight data preceding this packet, 2117 lost_prev represents the data lost among that previous in-flight 2118 data: 2120 lost / inflight >= BBRLossThresh 2121 (lost_prev + lost_prefix) / (inflight_prev + lost_prefix) >= BBRLossThresh 2122 /* solving for lost_prefix we arrive at: */ 2123 lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / (1 - BBRLossThresh) 2125 In pseudocode: 2127 BBRHandleLostPacket(packet): 2128 if (!BBR.bw_probe_samples) 2129 return /* not a packet sent while probing bandwidth */ 2130 rs.tx_in_flight = packet.tx_in_flight /* inflight at transmit */ 2131 rs.lost = C.lost - packet.lost /* data lost since transmit */ 2132 rs.is_app_limited = packet.is_app_limited; 2133 if (IsInflightTooHigh(rs)) 2134 rs.tx_in_flight = BBRInflightHiFromLostPacket(rs, packet) 2135 BBRHandleInflightTooHigh(rs) 2137 /* At what prefix of packet did losses exceed BBRLossThresh? */ 2138 BBRInflightHiFromLostPacket(rs, packet): 2139 size = packet.size 2140 /* What was in flight before this packet? */ 2141 inflight_prev = rs.tx_in_flight - size 2142 /* What was lost before this packet? */ 2143 lost_prev = rs.lost - size 2144 lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / 2145 (1 - BBRLossThresh) 2146 /* At what inflight value did losses cross BBRLossThresh? */ 2147 inflight = inflight_prev + lost_prefix 2148 return inflight 2150 4.5.6.3. When not Probing for Bandwidth 2152 When not explicitly accelerating to probe for bandwidth (Drain, 2153 ProbeRTT, ProbeBW_DOWN, ProbeBW_CRUISE), BBR responds to loss by 2154 slowing down to some extent. This is because loss suggests that the 2155 available bandwidth and safe volume of in-flight data may have 2156 decreased recently, and the flow needs to adapt, slowing down toward 2157 the latest delivery process. BBR flows implement this response by 2158 reducing the short-term model parameters, BBR.bw_lo and 2159 BBR.inflight_lo. 2161 When encountering packet loss when the flow is not probing for 2162 bandwidth, the strategy is to gradually adapt to the current measured 2163 delivery process (the rate and volume of data that is delivered 2164 through the network path over the last round trip). This applies 2165 generally: whether in fast recovery, RTO recovery, TLP recovery; 2166 whether application-limited or not. 2168 There are two key parameters the algorithm tracks, to measure the 2169 current delivery process: 2171 BBR.bw_latest: a 1-round-trip max of delivered bandwidth 2172 (rs.delivery_rate). 2174 BBR.inflight_latest: a 1-round-trip max of delivered volume of data 2175 (rs.delivered). 2177 Upon the ACK at the end of each round that encountered a newly-marked 2178 loss, the flow updates its model (bw_lo and inflight_lo) as follows: 2180 bw_lo = max( bw_latest, BBRBeta * bw_lo ) 2181 inflight_lo = max( inflight_latest, BBRBeta * inflight_lo ) 2183 This logic can be represented as follows: 2185 /* Near start of ACK processing: */ 2186 BBRUpdateLatestDeliverySignals(): 2187 BBR.loss_round_start = 0 2188 BBR.bw_latest = max(BBR.bw_latest, rs.delivery_rate) 2189 BBR.inflight_latest = max(BBR.inflight_latest, rs.delivered) 2190 if (rs.prior_delivered >= BBR.loss_round_delivered) 2191 BBR.loss_round_delivered = C.delivered 2192 BBR.loss_round_start = 1 2194 /* Near end of ACK processing: */ 2195 BBRAdvanceLatestDeliverySignals(): 2196 if (BBR.loss_round_start) 2197 BBR.bw_latest = rs.delivery_rate 2198 BBR.inflight_latest = rs.delivered 2200 BBRResetCongestionSignals(): 2201 BBR.loss_in_round = 0 2202 BBR.bw_latest = 0 2203 BBR.inflight_latest = 0 2205 /* Update congestion state on every ACK */ 2206 BBRUpdateCongestionSignals(): 2207 BBRUpdateMaxBw() 2208 if (rs.losses > 0) 2209 BBR.loss_in_round = 1 2210 if (!BBR.loss_round_start) 2211 return /* wait until end of round trip */ 2212 BBRAdaptLowerBoundsFromCongestion() 2213 BBR.loss_in_round = 0 2215 /* Once per round-trip respond to congestion */ 2216 BBRAdaptLowerBoundsFromCongestion(): 2217 if (BBRIsProbingBW()) 2218 return 2219 if (BBR.loss_in_round()) 2220 BBRInitLowerBounds() 2221 BBRLossLowerBounds() 2223 /* Handle the first congestion episode in this cycle */ 2224 BBRInitLowerBounds(): 2225 if (BBR.bw_lo == Infinity) 2226 BBR.bw_lo = BBR.max_bw 2227 if (BBR.inflight_lo == Infinity) 2228 BBR.inflight_lo = cwnd 2230 /* Adjust model once per round based on loss */ 2231 BBRLossLowerBounds() 2232 BBR.bw_lo = max(BBR.bw_latest, 2233 BBRBeta * BBR.bw_lo) 2234 BBR.inflight_lo = max(BBR.inflight_latest, 2235 BBRBeta * BBR.infligh_lo) 2237 BBRResetLowerBounds(): 2238 BBR.bw_lo = Infinity 2239 BBR.inflight_lo = Infinity 2241 BBRBoundBWForModel(): 2242 BBR.bw = min(BBR.max_bw, BBR.bw_lo, BBR.bw_hi) 2244 4.6. Updating Control Parameters 2246 BBR uses three distinct but interrelated control parameters: pacing 2247 rate, send quantum, and congestion window (cwnd). 2249 4.6.1. Summary of Control Behavior in the State Machine 2251 The following table summarizes how BBR modulates the control 2252 parameters in each state. In the table below, the semantics of the 2253 columns are as follows: 2255 * State: the state in the BBR state machine, as depicted in the 2256 "State Transition Diagram" section above. 2258 * Tactic: The tactic chosen from the "State Machine Tactics" 2259 subsection above: "accel" refers to acceleration, "decel" to 2260 deceleration, and "cruise" to cruising. 2262 * Pacing Gain: the value used for BBR.pacing_gain in the given 2263 state. 2265 * Cwnd Gain: the value used for BBR.cwnd_gain in the given state. 2267 * Rate Cap: the rate values applied as bounds on the BBR.max_bw 2268 value applied to compute BBR.bw. 2270 * Volume Cap: the volume values applied as bounds on the 2271 BBR.max_inflight value to compute cwnd. 2273 The control behavior can be summarized as follows: 2275 +-----------------+--------+--------+------+--------+------------------+ 2276 | State | Tactic | Pacing | Cwnd | Rate | Volume | 2277 | | | Gain | Gain | Cap | Cap | 2278 +-----------------+--------+--------+------+--------+------------------+ 2279 | Startup | accel | 2.77 | 2 | | | 2280 | | | | | | | 2281 +-----------------+--------+--------+------+--------+------------------+ 2282 | Drain | decel | 0.5 | 2 | bw_hi, | inflight_hi, | 2283 | | | | | bw_lo | inflight_lo | 2284 +-----------------+--------+--------+------+--------+------------------+ 2285 | ProbeBW_DOWN | decel | 0.75 | 2 | bw_hi, | inflight_hi, | 2286 | | | | | bw_lo | inflight_lo | 2287 +-----------------+--------+--------+------+--------+------------------+ 2288 | ProbeBW_CRUISE | cruise | 1.0 | 2 | bw_hi, | 0.85*inflight_hi | 2289 | | | | | bw_lo | inflight_lo | 2290 +-----------------+--------+--------+------+--------+------------------+ 2291 | ProbeBW_REFILL | accel | 1.0 | 2 | bw_hi | inflight_hi | 2292 | | | | | | | 2293 +-----------------+--------+--------+------+--------+------------------+ 2294 | ProbeBW_UP | accel | 1.25 | 2 | bw_hi | inflight_hi | 2295 | | | | | | | 2296 +-----------------+--------+--------+------+--------+------------------+ 2297 | ProbeRTT | decel | 1.0 | 0.5 | bw_hi, | 0.85*inflight_hi | 2298 | | | | | bw_lo | inflight_lo | 2299 +-----------------+--------+--------+------+--------+------------------+ 2301 Upon processing each ACK, BBR uses the values in the table above to 2302 compute BBR.bw in BBRBoundBWForModel(), and the cwnd in 2303 BBRBoundCwndForModel(). 2305 The pacing gain values are chosen so that the average of the 2306 ProbeBW_UP pacing gain of 1.25 and ProbeBW_DOWN pacing gain of 0.75 2307 is a value of 1.0. Since the ProbeBW_CRUISE and ProbeBW_REFILL 2308 states also have a pacing gain of 1.0, on average during the entire 2309 cycle the pacing gain is 1.0. This is done to allow for the average 2310 pacing rate to equal BBR.bw, the estimated available bandwidth, and 2311 thus allow for the flow -- to the extent that its BBR.bw estimate is 2312 accurate -- to maintain high utilization, while maintaining a small, 2313 well-bounded queue. 2315 4.6.2. Pacing Rate: BBR.pacing_rate 2317 To help match the packet-arrival rate to the bottleneck bandwidth 2318 available to the flow, BBR paces data packets. Pacing enforces a 2319 maximum rate at which BBR schedules quanta of packets for 2320 transmission. 2322 The sending host implements pacing by maintaining inter-quantum 2323 spacing at the time each packet is scheduled for departure, 2324 calculating the next departure time for a packet for a given flow 2325 (BBR.next_departure_time) as a function of the most recent packet 2326 size and the current pacing rate, as follows: 2328 BBR.next_departure_time = max(Now(), BBR.next_departure_time) 2329 packet.departure_time = BBR.next_departure_time 2330 pacing_delay = packet.size / BBR.pacing_rate 2331 BBR.next_departure_time = BBR.next_departure_time + pacing_delay 2333 To adapt to the bottleneck, in general BBR sets the pacing rate to be 2334 proportional to bw, with a dynamic gain, or scaling factor of 2335 proportionality, called pacing_gain. 2337 When a BBR flow starts it has no bw estimate (bw is 0). So in this 2338 case it sets an initial pacing rate based on the transport sender 2339 implementation's initial congestion window ("InitialCwnd", e.g. from 2340 [RFC6928]), the initial SRTT (smoothed round-trip time) after the 2341 first non-zero RTT sample, and the initial pacing_gain: 2343 BBRInitPacingRate(): 2344 nominal_bandwidth = InitialCwnd / (SRTT ? SRTT : 1ms) 2345 BBR.pacing_rate = BBRStartupPacingGain * nominal_bandwidth 2347 After initialization, on each data ACK BBR updates its pacing rate to 2348 be proportional to bw, as long as it estimates that it has filled the 2349 pipe (BBR.filled_pipe is true; see the "Startup" section for 2350 details), or doing so increases the pacing rate. Limiting the pacing 2351 rate updates in this way helps the connection probe robustly for 2352 bandwidth until it estimates it has reached its full available 2353 bandwidth ("filled the pipe"). In particular, this prevents the 2354 pacing rate from being reduced when the connection has only seen 2355 application-limited bandwidth samples. BBR updates the pacing rate 2356 on each ACK by executing the BBRSetPacingRate() step as follows: 2358 BBRSetPacingRateWithGain(pacing_gain): 2359 rate = pacing_gain * bw * (100 - BBRPacingMarginPercent) / 100 2360 if (BBR.filled_pipe || rate > BBR.pacing_rate) 2361 BBR.pacing_rate = rate 2363 BBRSetPacingRate(): 2364 BBRSetPacingRateWithGain(BBR.pacing_gain) 2366 To help drive the network toward lower queues and low latency while 2367 maintaining high utilization, the BBRPacingMarginPercent constant of 2368 1 aims to cause BBR to pace at 1% below the bw, on average. 2370 4.6.3. Send Quantum: BBR.send_quantum 2372 In order to amortize per-packet overheads involved in the sending 2373 process (host CPU, NIC processing, and interrupt processing delays), 2374 high-performance transport sender implementations (e.g., Linux TCP) 2375 often schedule an aggregate containing multiple packets (multiple 2376 SMSS) worth of data as a single quantum (using TSO, GSO, or other 2377 offload mechanisms). The BBR congestion control algorithm makes this 2378 control decision explicitly, dynamically calculating a quantum 2379 control parameter that specifies the maximum size of these 2380 transmission aggregates. This decision is based on a trade-off: 2382 * A smaller quantum is preferred at lower data rates because it 2383 results in shorter packet bursts, shorter queues, lower queueing 2384 delays, and lower rates of packet loss. 2386 * A bigger quantum can be required at higher data rates because it 2387 results in lower CPU overheads at the sending and receiving hosts, 2388 who can ship larger amounts of data with a single trip through the 2389 networking stack. 2391 On each ACK, BBR runs BBRSetSendQuantum() to update BBR.send_quantum 2392 as follows: 2394 BBRSetSendQuantum(): 2395 if (BBR.pacing_rate < 1.2 Mbps) 2396 floor = 1 * SMSS 2397 else 2398 floor = 2 * SMSS 2399 BBR.send_quantum = min(BBR.pacing_rate * 1ms, 64KBytes) 2400 BBR.send_quantum = max(BBR.send_quantum, floor) 2402 A BBR implementation MAY use alternate approaches to select a 2403 BBR.send_quantum, as appropriate for the CPU overheads anticipated 2404 for senders and receivers, and buffering considerations anticipated 2405 in the network path. However, for the sake of the network and other 2406 users, a BBR implementation SHOULD attempt to use the smallest 2407 feasible quanta. 2409 4.6.4. Congestion Window 2411 The congestion window, or cwnd, controls the maximum volume of data 2412 BBR allows in flight in the network at any time. It is the maximum 2413 volume of in-flight data that the algorithm estimates is appropriate 2414 for matching the current network path delivery process, given all 2415 available signals in the model, at any time scale. BBR adapts the 2416 cwnd based on its model of the network path and the state machine's 2417 decisions about how to probe that path. 2419 By default, BBR grows its cwnd to meet its BBR.max_inflight, which 2420 models what's required for achieving full throughput, and as such is 2421 scaled to adapt to the estimated BDP computed from its path model. 2422 But BBR's selection of cwnd is designed to explicitly trade off among 2423 competing considerations that dynamically adapt to various 2424 conditions. So in loss recovery BBR more conservatively adjusts its 2425 sending behavior based on more recent delivery samples, and if BBR 2426 needs to re-probe the current BBR.min_rtt of the path then it cuts 2427 its cwnd accordingly. The following sections describe the various 2428 considerations that impact cwnd. 2430 4.6.4.1. Initial cwnd 2432 BBR generally uses measurements to build a model of the network path 2433 and then adapts control decisions to the path based on that model. 2434 As such, the selection of the initial cwnd is considered to be 2435 outside the scope of the BBR algorithm, since at initialization there 2436 are no measurements yet upon which BBR can operate. Thus, at 2437 initialization, BBR uses the transport sender implementation's 2438 initial congestion window (e.g. from [RFC6298] for TCP). 2440 4.6.4.2. Computing BBR.max_inflight 2442 The BBR BBR.max_inflight is the upper bound on the volume of data BBR 2443 allows in flight. This bound is always in place, and dominates when 2444 all other considerations have been satisfied: the flow is not in loss 2445 recovery, does not need to probe BBR.min_rtt, and has accumulated 2446 confidence in its model parameters by receiving enough ACKs to 2447 gradually grow the current cwnd to meet the BBR.max_inflight. 2449 On each ACK, BBR calculates the BBR.max_inflight in 2450 BBRUpdateMaxInflight() as follows: 2452 BBRBDPMultiple(gain): 2453 if (BBR.min_rtt == Inf) 2454 return InitialCwnd /* no valid RTT samples yet */ 2455 BBR.bdp = BBR.bw * BBR.min_rtt 2456 return gain * BBR.bdp 2458 BBRQuantizationBudget(inflight) 2459 BBRUpdateOffloadBudget() 2460 inflight = max(inflight, BBR.offload_budget) 2461 inflight = max(inflight, BBRMinPipeCwnd) 2462 if (BBR.state == ProbeBW && BBR.cycle_idx == ProbeBW_UP) 2463 inflight += 2 2464 return inflight 2466 BBRInflight(gain): 2467 inflight = BBRBDPMultiple(gain) 2468 return BBRQuantizationBudget(inflight) 2470 BBRUpdateMaxInflight(): 2471 BBRUpdateAggregationBudget() 2472 inflight = BBRBDPMultiple(BBR.cwnd_gain) 2473 inflight += BBR.extra_acked 2474 BBR.max_inflight = BBRQuantizationBudget(inflight) 2476 The "estimated_bdp" term tries to allow enough packets in flight to 2477 fully utilize the estimated BDP of the path, by allowing the flow to 2478 send at BBR.bw for a duration of BBR.min_rtt. Scaling up the BDP by 2479 BBR.cwnd_gain bounds in-flight data to a small multiple of the BDP, 2480 to handle common network and receiver behavior, such as delayed, 2481 stretched, or aggregated ACKs [A15]. The "quanta" term allows enough 2482 quanta in flight on the sending and receiving hosts to reach high 2483 throughput even in environments using offload mechanisms. 2485 4.6.4.3. Minimum cwnd for Pipelining 2487 For BBR.max_inflight, BBR imposes a floor of BBRMinPipeCwnd (4 2488 packets, i.e. 4 * SMSS). This floor helps ensure that even at very 2489 low BDPs, and with a transport like TCP where a receiver may ACK only 2490 every alternate SMSS of data, there are enough packets in flight to 2491 maintain full pipelining. In particular BBR tries to allow at least 2492 2 data packets in flight and ACKs for at least 2 data packets on the 2493 path from receiver to sender. 2495 4.6.4.4. Modulating cwnd in Loss Recovery 2497 BBR interprets loss as a hint that there may be recent changes in 2498 path behavior that are not yet fully reflected in its model of the 2499 path, and thus it needs to be more conservative. 2501 Upon a retransmission timeout (RTO), BBR conservatively reduces cwnd 2502 to a value that will allow 1 SMSS to be transmitted. Then BBR 2503 gradually increases cwnd using the normal approach outlined below in 2504 "Core cwnd Adjustment Mechanism". 2506 When a BBR sender detects packet loss but there are still packets in 2507 flight, on the first round of the loss-repair process BBR temporarily 2508 reduces the cwnd to match the current delivery rate as ACKs arrive. 2509 On second and later rounds of loss repair, it ensures the sending 2510 rate never exceeds twice the current delivery rate as ACKs arrive. 2512 When BBR exits loss recovery it restores the cwnd to the "last known 2513 good" value that cwnd held before entering recovery. This applies 2514 equally whether the flow exits loss recovery because it finishes 2515 repairing all losses or because it executes an "undo" event after 2516 inferring that a loss recovery event was spurious. 2518 There are several ways to implement this high-level design for 2519 updating cwnd in loss recovery. One is as follows: 2521 Upon retransmission timeout (RTO): 2523 BBROnEnterRTO(): 2524 BBR.prior_cwnd = BBRSaveCwnd() 2525 cwnd = packets_in_flight + 1 2527 Upon entering Fast Recovery, set cwnd to the number of packets still 2528 in flight (allowing at least one for a fast retransmit): 2530 BBROnEnterFastRecovery(): 2531 BBR.prior_cwnd = BBRSaveCwnd() 2532 cwnd = packets_in_flight + max(rs.newly_acked, 1) 2533 BBR.packet_conservation = true 2535 Upon every ACK in Fast Recovery, run the following 2536 BBRModulateCwndForRecovery() steps, which help ensure packet 2537 conservation on the first round of recovery, and sending at no more 2538 than twice the current delivery rate on later rounds of recovery 2539 (given that "rs.newly_acked" packets were newly marked ACKed or 2540 SACKed and "rs.newly_lost" were newly marked lost): 2542 BBRModulateCwndForRecovery(): 2543 if (rs.newly_lost > 0) 2544 cwnd = max(cwnd - rs.newly_lost, 1) 2545 if (BBR.packet_conservation) 2546 cwnd = max(cwnd, packets_in_flight + rs.newly_acked) 2548 After one round-trip in Fast Recovery: 2550 BBR.packet_conservation = false 2552 Upon exiting loss recovery (RTO recovery or Fast Recovery), either by 2553 repairing all losses or undoing recovery, BBR restores the best-known 2554 cwnd value we had upon entering loss recovery: 2556 BBR.packet_conservation = false 2557 BBRRestoreCwnd() 2559 Note that exiting loss recovery happens during ACK processing, and at 2560 the end of ACK processing BBRBoundCwndForModel() will bound the cwnd 2561 based on the current model parameters. Thus the cwnd and pacing rate 2562 after loss recovery will generally be smaller than the values 2563 entering loss recovery. 2565 The BBRSaveCwnd() and BBRRestoreCwnd() helpers help remember and 2566 restore the last-known good cwnd (the latest cwnd unmodulated by loss 2567 recovery or ProbeRTT), and is defined as follows: 2569 BBRSaveCwnd(): 2570 if (!InLossRecovery() and BBR.state != ProbeRTT) 2571 return cwnd 2572 else 2573 return max(BBR.prior_cwnd, cwnd) 2575 BBRRestoreCwnd(): 2576 cwnd = max(cwnd, BBR.prior_cwnd) 2578 4.6.4.5. Modulating cwnd in ProbeRTT 2580 If BBR decides it needs to enter the ProbeRTT state (see the 2581 "ProbeRTT" section below), its goal is to quickly reduce the volume 2582 of in-flight data and drain the bottleneck queue, thereby allowing 2583 measurement of BBR.min_rtt. To implement this mode, BBR bounds the 2584 cwnd to BBRMinPipeCwnd, the minimal value that allows pipelining (see 2585 the "Minimum cwnd for Pipelining" section, above): 2587 BBRProbeRTTCwnd(): 2588 probe_rtt_cwnd = BBRBDPMultiple(BBR.bw, BBRProbeRTTCwndGain) 2589 probe_rtt_cwnd = max(probe_rtt_cwnd, BBRMinPipeCwnd) 2590 return probe_rtt_cwnd 2592 BBRBoundCwndForProbeRTT(): 2593 if (BBR.state == ProbeRTT) 2594 cwnd = min(cwnd, BBRProbeRTTCwnd()) 2596 4.6.4.6. Core cwnd Adjustment Mechanism 2598 The network path and traffic traveling over it can make sudden 2599 dramatic changes. To adapt to these changes smoothly and robustly, 2600 and reduce packet losses in such cases, BBR uses a conservative 2601 strategy. When cwnd is above the BBR.max_inflight derived from BBR's 2602 path model, BBR cuts the cwnd immediately to the BBR.max_inflight. 2603 When cwnd is below BBR.max_inflight, BBR raises the cwnd gradually 2604 and cautiously, increasing cwnd by no more than the amount of data 2605 acknowledged (cumulatively or selectively) upon each ACK. 2607 Specifically, on each ACK that acknowledges "rs.newly_acked" packets 2608 as newly ACKed or SACKed, BBR runs the following BBRSetCwnd() steps 2609 to update cwnd: 2611 BBRSetCwnd(): 2612 BBRUpdateMaxInflight() 2613 BBRModulateCwndForRecovery() 2614 if (!BBR.packet_conservation) { 2615 if (BBR.filled_pipe) 2616 cwnd = min(cwnd + rs.newly_acked, BBR.max_inflight) 2617 else if (cwnd < BBR.max_inflight || C.delivered < InitialCwnd) 2618 cwnd = cwnd + rs.newly_acked 2619 cwnd = max(cwnd, BBRMinPipeCwnd) 2620 } 2621 BBRBoundCwndForProbeRTT() 2622 BBRBoundCwndForModel() 2624 There are several considerations embodied in the logic above. If BBR 2625 has measured enough samples to achieve confidence that it has filled 2626 the pipe (see the description of BBR.filled_pipe in the "Startup" 2627 section below), then it increases its cwnd based on the number of 2628 packets delivered, while bounding its cwnd to be no larger than the 2629 BBR.max_inflight adapted to the estimated BDP. Otherwise, if the 2630 cwnd is below the BBR.max_inflight, or the sender has marked so 2631 little data delivered (less than InitialCwnd) that it does not yet 2632 judge its BBR.max_bw estimate and BBR.max_inflight as useful, then it 2633 increases cwnd without bounding it to be below BBR.max_inflight. 2634 Finally, BBR imposes a floor of BBRMinPipeCwnd in order to allow 2635 pipelining even with small BDPs (see the "Minimum cwnd for 2636 Pipelining" section, above). 2638 4.6.4.7. Bounding cwnd Based on Recent Congestion 2640 Finally, BBR bounds the cwnd based on recent congestion, as outlined 2641 in the "Volume Cap" column of the table in the "Summary of Control 2642 Behavior in the State Machine" section: 2644 BBRBoundCwndForModel(): 2645 cap = Infinity 2646 if (IsInAProbeBWState() and 2647 BBR.state != ProbeBW_CRUISE) 2648 cap = BBR.inflight_hi 2649 else if (BBR.state == ProbeRTT or 2650 BBR.state == ProbeBW_CRUISE) 2651 cap = BBRInflightWithHeadroom() 2653 /* apply inflight_lo (possibly infinite): */ 2654 cap = min(cap, BBR.inflight_lo) 2655 cap = max(cap, BBRMinPipeCwnd) 2656 cwnd = min(cwnd, cap) 2658 5. Implementation Status 2660 This section records the status of known implementations of the 2661 algorithm defined by this specification at the time of posting of 2662 this Internet-Draft, and is based on a proposal described in 2663 [RFC7942]. The description of implementations in this section is 2664 intended to assist the IETF in its decision processes in progressing 2665 drafts to RFCs. Please note that the listing of any individual 2666 implementation here does not imply endorsement by the IETF. 2667 Furthermore, no effort has been spent to verify the information 2668 presented here that was supplied by IETF contributors. This is not 2669 intended as, and must not be construed to be, a catalog of available 2670 implementations or their features. Readers are advised to note that 2671 other implementations may exist. 2673 According to [RFC7942], "this will allow reviewers and working groups 2674 to assign due consideration to documents that have the benefit of 2675 running code, which may serve as evidence of valuable experimentation 2676 and feedback that have made the implemented protocols more mature. 2677 It is up to the individual working groups to use this information as 2678 they see fit". 2680 As of the time of writing, the following implementations of BBR have 2681 been publicly released: 2683 * Linux TCP 2685 - Source code URL: 2687 o https://github.com/google/bbr/blob/v2alpha/README.md 2689 o https://github.com/google/bbr/blob/v2alpha/net/ipv4/ 2690 tcp_bbr2.c 2692 - Source: Google 2694 - Maturity: production 2696 - License: dual-licensed: GPLv2 / BSD 2698 - Contact: https://groups.google.com/d/forum/bbr-dev 2700 - Last updated: August 21, 2021 2702 * QUIC 2704 - Source code URLs: 2706 o https://cs.chromium.org/chromium/src/net/third_party/quiche/ 2707 src/quic/core/congestion_control/bbr2_sender.cc 2709 o https://cs.chromium.org/chromium/src/net/third_party/quiche/ 2710 src/quic/core/congestion_control/bbr2_sender.h 2712 - Source: Google 2714 - Maturity: production 2716 - License: BSD-style 2718 - Contact: https://groups.google.com/d/forum/bbr-dev 2720 - Last updated: October 21, 2021 2722 6. Security Considerations 2724 This proposal makes no changes to the underlying security of 2725 transport protocols or congestion control algorithms. BBR shares the 2726 same security considerations as the existing standard congestion 2727 control algorithm [RFC5681]. 2729 7. IANA Considerations 2731 This document has no IANA actions. Here we are using that phrase, 2732 suggested by [RFC5226], because BBR does not modify or extend the 2733 wire format of any network protocol, nor does it add new dependencies 2734 on assigned numbers. BBR involves only a change to the congestion 2735 control algorithm of a transport sender, and does not involve changes 2736 in the network, the receiver, or any network protocol. 2738 Note to RFC Editor: this section may be removed on publication as an 2739 RFC. 2741 8. Acknowledgments 2743 The authors are grateful to Len Kleinrock for his work on the theory 2744 underlying congestion control. We are indebted to Larry Brakmo for 2745 pioneering work on the Vegas [BP95] and New Vegas [B15] congestion 2746 control algorithms, which presaged many elements of BBR, and for 2747 Larry's advice and guidance during BBR's early development. The 2748 authors would also like to thank Kevin Yang, Priyaranjan Jha, Yousuk 2749 Seung, Luke Hsiao for their work on TCP BBR; Jana Iyengar, Victor 2750 Vasiliev, Bin Wu for their work on QUIC BBR; and Matt Mathis for his 2751 research work on the BBR algorithm and its implications [MM19]. We 2752 would also like to thank C. Stephen Gunn, Eric Dumazet, Nandita 2753 Dukkipati, Pawel Jurczyk, Biren Roy, David Wetherall, Amin Vahdat, 2754 Leonidas Kontothanassis, and the YouTube, google.com, Bandwidth 2755 Enforcer, and Google SRE teams for their invaluable help and support. 2756 We would like to thank Randall R. Stewart, Jim Warner, Loganaden 2757 Velvindron, and Hiren Panchasara for feedback and suggestions on 2758 earlier versions of this document. 2760 9. References 2762 9.1. Normative References 2764 [RFC793] Postel, J., "Transmission Control Protocol", September 2765 1981. 2767 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 2768 Options", RFC 2018, October 1996, 2769 . 2771 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 2772 Scheffenegger, "TCP Extensions for High Performance", 2773 September 2014. 2775 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2776 Requirement Levels", RFC 2119, March 1997, 2777 . 2779 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 2780 IANA Considerations Section in RFCs", May 2008. 2782 [RFC6298] Paxson, V., "Computing TCP's Retransmission Timer", 2783 RFC 6298, June 2011, 2784 . 2786 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2787 Control", RFC 5681, September 2009, 2788 . 2790 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 2791 Code: The Implementation Status Section", July 2016. 2793 [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 2794 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 2795 February 2018, . 2797 [RFC8985] Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The 2798 RACK-TLP Loss Detection Algorithm for TCP", RFC 8985, 2799 DOI 10.17487/RFC8985, February 2021, 2800 . 2802 [RFC9000] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based 2803 Multiplexed and Secure Transport", RFC 9000, 2804 DOI 10.17487/RFC9000, May 2021, 2805 . 2807 9.2. Informative References 2809 [draft-cheng-iccrg-delivery-rate-estimation] 2810 Cheng, Y., Cardwell, N., Hassas Yeganeh, S., and V. 2811 Jacobson, "Delivery Rate Estimation", Work in Progress, 2812 Internet-Draft, draft-cheng-iccrg-delivery-rate- 2813 estimation, November 2021, . 2816 [CCGHJ16] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and 2817 V. Jacobson, "BBR: Congestion-Based Congestion Control", 2818 ACM Queue Oct 2016, October 2016, 2819 . 2821 [CCGHJ17] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and 2822 V. Jacobson, "BBR: Congestion-Based Congestion Control", 2823 Communications of the ACM Feb 2017, February 2017, 2824 . 2827 [MM19] Mathis, M. and J. Mahdavi, "Deprecating The TCP 2828 Macroscopic Model", Computer Communication Review, vol. 2829 49, no. 5, pp. 63-68 , October 2019. 2831 [BBRStartupPacingGain] 2832 Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V. 2833 Jacobson, "BBR Startup Pacing Gain: a Derivation", June 2834 2018, . 2837 [BBRDrainPacingGain] 2838 Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V. 2839 Jacobson, "BBR Drain Pacing Gain: a Derivation", September 2840 2021, . 2843 [DC13] Dumazet, E. and Y. Cheng, "TSO, fair queuing, pacing: 2844 three's a charm", IETF 88 , November 2013, 2845 . 2848 [A15] Abrahamsson, M., "TCP ACK suppression", IETF AQM mailing 2849 list , November 2015, . 2852 [Jac88] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 2853 1988, Computer Communication Review, vol. 18, no. 4, pp. 2854 314-329 , August 1988, 2855 . 2857 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance 2858 Algorithm", end2end-interest mailing list , April 1990, 2859 . 2861 [BP95] Brakmo, L. and L. Peterson, "TCP Vegas: end-to-end 2862 congestion avoidance on a global Internet", IEEE Journal 2863 on Selected Areas in Communications 13(8): 1465-1480 , 2864 October 1995. 2866 [B15] Brakmo, L., "TCP-NV: An Update to TCP-Vegas", , August 2867 2015, . 2870 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: 2871 The Implementation", Addison-Wesley , 1995. 2873 [HRX08] Ha, S., Rhee, I., and L. Xu, "CUBIC: A New TCP-Friendly 2874 High-Speed TCP Variant", ACM SIGOPS Operating System 2875 Review , 2008. 2877 [GK81] Gail, R. and L. Kleinrock, "An Invariant Property of 2878 Computer Network Power", Proceedings of the International 2879 Conference on Communications June, 1981, 2880 . 2882 [K79] Kleinrock, L., "Power and deterministic rules of thumb for 2883 probabilistic problems in computer communications", 2884 Proceedings of the International Conference on 2885 Communications 1979. 2887 Authors' Addresses 2889 Neal Cardwell 2890 Google 2892 Email: ncardwell@google.com 2894 Yuchung Cheng 2895 Google 2897 Email: ycheng@google.com 2899 Soheil Hassas Yeganeh 2900 Google 2902 Email: soheil@google.com 2903 Ian Swett 2904 Google 2906 Email: ianswett@google.com 2908 Van Jacobson 2909 Google 2911 Email: vanj@google.com