idnits 2.17.1 draft-cardwell-iccrg-bbr-congestion-control-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 3 instances of too long lines in the document, the longest one being 10 characters in excess of 72. ** The abstract seems to contain references ([RFC5681], [RFC793], [RFC8312], [RFC9000]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (7 March 2022) is 779 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC6937' is mentioned on line 247, but not defined == Missing Reference: 'RFC6928' is mentioned on line 2330, but not defined == Unused Reference: 'DC13' is defined on line 2845, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) ** Obsolete normative reference: RFC 8312 (Obsoleted by RFC 9438) Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Congestion Control Research Group N. Cardwell 3 Internet-Draft Y. Cheng 4 Intended status: Experimental S. Hassas Yeganeh 5 Expires: 8 September 2022 I. Swett 6 V. Jacobson 7 Google 8 7 March 2022 10 BBR Congestion Control 11 draft-cardwell-iccrg-bbr-congestion-control-02 13 Abstract 15 This document specifies the BBR congestion control algorithm. BBR 16 ("Bottleneck Bandwidth and Round-trip propagation time") uses recent 17 measurements of a transport connection's delivery rate, round-trip 18 time, and packet loss rate to build an explicit model of the network 19 path. BBR then uses this model to control both how fast it sends 20 data and the maximum volume of data it allows in flight in the 21 network at any time. Relative to loss-based congestion control 22 algorithms such as Reno [RFC5681] or CUBIC [RFC8312], BBR offers 23 substantially higher throughput for bottlenecks with shallow buffers 24 or random losses, and substantially lower queueing delays for 25 bottlenecks with deep buffers (avoiding "bufferbloat"). BBR can be 26 implemented in any transport protocol that supports packet-delivery 27 acknowledgment. Thus far, open source implementations are available 28 for TCP [RFC793] and QUIC [RFC9000]. This document specifies version 29 2 of the BBR algorithm, also sometimes referred to as BBRv2 or bbr2. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on 8 September 2022. 48 Copyright Notice 50 Copyright (c) 2022 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 55 license-info) in effect on the date of publication of this document. 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. Code Components 58 extracted from this document must include Revised BSD License text as 59 described in Section 4.e of the Trust Legal Provisions and are 60 provided without warranty as described in the Revised BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 66 2.1. Transport Connection State . . . . . . . . . . . . . . . 5 67 2.2. Per-Packet State . . . . . . . . . . . . . . . . . . . . 5 68 2.3. Per-ACK Rate Sample State . . . . . . . . . . . . . . . . 5 69 2.4. Output Control Parameters . . . . . . . . . . . . . . . . 6 70 2.5. Pacing State and Parameters . . . . . . . . . . . . . . . 6 71 2.6. cwnd State and Parameters . . . . . . . . . . . . . . . . 7 72 2.7. General Algorithm State . . . . . . . . . . . . . . . . . 7 73 2.8. Core Algorithm Design Parameters . . . . . . . . . . . . 7 74 2.9. Network Path Model Parameters . . . . . . . . . . . . . . 8 75 2.9.1. Data Rate Network Path Model Parameters . . . . . . . 8 76 2.9.2. Data Volume Network Path Model Parameters . . . . . . 8 77 2.10. State for Responding to Congestion . . . . . . . . . . . 9 78 2.11. Estimating BBR.max_bw . . . . . . . . . . . . . . . . . . 10 79 2.12. Estimating BBR.extra_acked . . . . . . . . . . . . . . . 10 80 2.13. Startup Parameters and State . . . . . . . . . . . . . . 10 81 2.14. ProbeRTT and min_rtt Parameters and State . . . . . . . . 10 82 2.14.1. Parameters for Estimating BBR.min_rtt . . . . . . . 10 83 2.14.2. Parameters for Scheduling ProbeRTT . . . . . . . . . 11 84 3. Design Overview . . . . . . . . . . . . . . . . . . . . . . . 11 85 3.1. High-Level Design Goals . . . . . . . . . . . . . . . . . 11 86 3.2. Algorithm Overview . . . . . . . . . . . . . . . . . . . 12 87 3.3. State Machine Overview . . . . . . . . . . . . . . . . . 13 88 3.4. Network Path Model Overview . . . . . . . . . . . . . . . 13 89 3.4.1. High-Level Design Goals for the Network Path Model . 13 90 3.4.2. Time Scales for the Network Model . . . . . . . . . . 14 91 3.5. Control Parameter Overview . . . . . . . . . . . . . . . 15 92 3.6. Environment and Usage . . . . . . . . . . . . . . . . . . 15 93 4. Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . 15 94 4.1. State Machine . . . . . . . . . . . . . . . . . . . . . . 15 95 4.1.1. State Transition Diagram . . . . . . . . . . . . . . 15 96 4.1.2. State Machine Operation Overview . . . . . . . . . . 16 97 4.1.3. State Machine Tactics . . . . . . . . . . . . . . . . 17 98 4.2. Algorithm Organization . . . . . . . . . . . . . . . . . 18 99 4.2.1. Initialization . . . . . . . . . . . . . . . . . . . 18 100 4.2.2. Per-Transmit Steps . . . . . . . . . . . . . . . . . 18 101 4.2.3. Per-ACK Steps . . . . . . . . . . . . . . . . . . . . 19 102 4.2.4. Per-Loss Steps . . . . . . . . . . . . . . . . . . . 19 103 4.3. State Machine Operation . . . . . . . . . . . . . . . . . 19 104 4.3.1. Startup . . . . . . . . . . . . . . . . . . . . . . . 19 105 4.3.2. Drain . . . . . . . . . . . . . . . . . . . . . . . . 22 106 4.3.3. ProbeBW . . . . . . . . . . . . . . . . . . . . . . . 23 107 4.3.4. ProbeRTT . . . . . . . . . . . . . . . . . . . . . . 34 108 4.4. Restarting From Idle . . . . . . . . . . . . . . . . . . 39 109 4.4.1. Setting Pacing Rate in ProbeBW . . . . . . . . . . . 39 110 4.4.2. Checking for ProberRTT Completion . . . . . . . . . . 40 111 4.4.3. Logic . . . . . . . . . . . . . . . . . . . . . . . . 40 112 4.5. Updating Network Path Model Parameters . . . . . . . . . 40 113 4.5.1. BBR.round_count: Tracking Packet-Timed Round Trips . 41 114 4.5.2. BBR.max_bw: Estimated Maximum Bandwidth . . . . . . . 42 115 4.5.3. BBR.min_rtt: Estimated Minimum Round-Trip Time . . . 44 116 4.5.4. BBR.offload_budget . . . . . . . . . . . . . . . . . 46 117 4.5.5. BBR.extra_acked . . . . . . . . . . . . . . . . . . . 47 118 4.5.6. Updating the Model Upon Packet Loss . . . . . . . . . 48 119 4.6. Updating Control Parameters . . . . . . . . . . . . . . . 52 120 4.6.1. Summary of Control Behavior in the State Machine . . 52 121 4.6.2. Pacing Rate: BBR.pacing_rate . . . . . . . . . . . . 53 122 4.6.3. Send Quantum: BBR.send_quantum . . . . . . . . . . . 55 123 4.6.4. Congestion Window . . . . . . . . . . . . . . . . . . 55 124 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 61 125 6. Security Considerations . . . . . . . . . . . . . . . . . . . 62 126 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 62 127 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 63 128 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 63 129 9.1. Normative References . . . . . . . . . . . . . . . . . . 63 130 9.2. Informative References . . . . . . . . . . . . . . . . . 64 131 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 66 133 1. Introduction 135 The Internet has traditionally used loss-based congestion control 136 algorithms like Reno ([Jac88], [Jac90], [WS95] [RFC5681]) and CUBIC 137 ([HRX08], [RFC8312]). These algorithms worked well for many years 138 because they were sufficiently well-matched to the prevalent range of 139 bandwidth-delay products and degrees of buffering in Internet paths. 140 As the Internet has evolved, loss-based congestion control is 141 increasingly problematic in several important scenarios: 143 1. Shallow buffers: In shallow buffers, packet loss can happen even 144 when a link has low utilization. With high-speed, long-haul 145 links employing commodity switches with shallow buffers, loss- 146 based congestion control can cause abysmal throughput because it 147 overreacts, multiplicatively decreasing the sending rate upon 148 packet loss, and only slowly growing its sending rate thereafter. 149 This can happen even if the packet loss arises from transient 150 traffic bursts when the link is mostly idle. 152 2. Deep buffers: At the edge of today's Internet, loss-based 153 congestion control can cause the problem of "bufferbloat", by 154 repeatedly filling deep buffers in last-mile links and causing 155 high queuing delays. 157 3. Dynamic traffic workloads: With buffers of any depth, dynamic 158 mixes of newly-entering flows or flights of data from recently 159 idle flows can cause frequent packet loss. In such scenarios 160 loss-based congestion control can fail to maintain its fair share 161 of bandwidth, leading to poor application performance. 163 In both the shallow-buffer (1.) or dynamic-traffic (3.) scenarios 164 mentioned above it is difficult to achieve full throughput with loss- 165 based congestion control in practice: for CUBIC, sustaining 10Gbps 166 over 100ms RTT needs a packet loss rate below 0.000003% (i.e., more 167 than 40 seconds between packet losses), and over a 100ms RTT path a 168 more feasible loss rate like 1% can only sustain at most 3 Mbps 169 [RFC8312]. These limitations apply no matter what the bottleneck 170 link is capable of or what the connection's fair share is. 171 Furthermore, failure to reach the fair share can cause poor 172 throughpout and poor tail latency for latency-sensitive applications. 174 The BBR ("Bottleneck Bandwidth and Round-trip propagation time") 175 congestion control algorithm is a model-based algorithm that takes an 176 approach different from loss-based congestion control: BBR uses 177 recent measurements of a transport connection's delivery rate, round- 178 trip time, and packet loss rate to build an explicit model of the 179 network path, including its estimated available bandwidth, bandwidth- 180 delay product, and the maximum volume of data that the connection can 181 place in-flight in the network without causing excessive queue 182 pressure. It then uses this model in order to guide its control 183 behavior in seeking high throughput and low queue pressure. 185 This document describes the current version of the BBR algorithm, 186 BBRv2. The previous version of the algorithm, BBRv1, was described 187 previously at a high level [CCGHJ16][CCGHJ17]. The implications of 188 BBR in allowing high utilization of high-speed networks with shallow 189 buffers have been discussed in other work [MM19]. Active work on the 190 BBR algorithm is continuing. 192 This document is organized as follows. Section 2 provides various 193 definitions that will be used throughout this document. Section 3 194 provides an overview of the design of the BBR algorithm, and section 195 4 describes the BBR algorithm in detail, including BBR's network path 196 model, control parameters, and state machine. Section 5 describes 197 the implementation status, section 6 describes security 198 considerations, section 7 notes that there are no IANA 199 considerations, and section 8 closes with Acknowledgments. 201 2. Terminology 203 This document defines state variables and constants for the BBR 204 algorithm. 206 The variables starting with C, P, or rs not defined below are defined 207 in [draft-cheng-iccrg-delivery-rate-estimation]. 209 2.1. Transport Connection State 211 C.delivered: The total amount of data (tracked in octets or in 212 packets) delivered so far over the lifetime of the transport 213 connection C. 215 SMSS: The Sender Maximum Segment Size. 217 is_cwnd_limited: True if the connection has fully utilized its cwnd 218 at any point in the last packet-timed round trip. 220 InitialCwnd: The initial congestion window set by the transport 221 protocol implementation for the connection at initialization time. 223 2.2. Per-Packet State 225 packet.delivered: C.delivered when the given packet was sent by 226 transport connection C. 228 packet.departure_time: The earliest pacing departure time for the 229 given packet. 231 packet.tx_in_flight: The volume of data that was estimated to be in 232 flight at the time of the transmission of the packet. 234 2.3. Per-ACK Rate Sample State 236 rs.delivered: The volume of data delivered between the transmission 237 of the packet that has just been ACKed and the current time. 239 rs.delivery_rate: The delivery rate (aka bandwidth) sample obtained 240 from the packet that has just been ACKed. 242 rs.rtt: The RTT sample calculated based on the most recently-sent 243 segment of the segments that have just been ACKed. 245 rs.newly_acked: The volume of data cumulatively or selectively 246 acknowledged upon the ACK that was just received. (This quantity is 247 referred to as "DeliveredData" in [RFC6937].) 249 rs.newly_lost: The volume of data newly marked lost upon the ACK that 250 was just received. 252 rs.tx_in_flight: The volume of data that was estimated to be in 253 flight at the time of the transmission of the packet that has just 254 been ACKed (the most recently sent segment among segments ACKed by 255 the ACK that was just received). 257 rs.lost: The volume of data that was declared lost between the 258 transmission and acknowledgement of the packet that has just been 259 ACKed (the most recently sent segment among segments ACKed by the ACK 260 that was just received). 262 2.4. Output Control Parameters 264 cwnd: The transport sender's congestion window, which limits the 265 amount of data in flight. 267 BBR.pacing_rate: The current pacing rate for a BBR flow, which 268 controls inter-packet spacing. 270 BBR.send_quantum: The maximum size of a data aggregate scheduled and 271 transmitted together. 273 2.5. Pacing State and Parameters 275 BBR.pacing_gain: The dynamic gain factor used to scale BBR.bw to 276 produce BBR.pacing_rate. 278 BBRPacingMarginPercent: The static discount factor of 1% used to 279 scale BBR.bw to produce BBR.pacing_rate. 281 BBR.next_departure_time: The earliest pacing departure time for the 282 next packet BBR schedules for transmission. 284 2.6. cwnd State and Parameters 286 BBR.cwnd_gain: The dynamic gain factor used to scale the estimated 287 BDP to produce a congestion window (cwnd). 289 BBRStartupPacingGain: A constant specifying the minimum gain value 290 for calculating the pacing rate that will allow the sending rate to 291 double each round (4*ln(2) ~= 2.77) [BBRStartupPacingGain]; used in 292 Startup mode for BBR.pacing_gain. 294 BBRStartupCwndGain: A constant specifying the minimum gain value for 295 calculating the cwnd that will allow the sending rate to double each 296 round (2.0); used in Startup mode for BBR.cwnd_gain. 298 BBR.packet_conservation: A boolean indicating whether BBR is 299 currently using packet conservation dynamics to bound cwnd. 301 2.7. General Algorithm State 303 BBR.state: The current state of a BBR flow in the BBR state machine. 305 BBR.round_count: Count of packet-timed round trips elapsed so far. 307 BBR.round_start: A boolean that BBR sets to true once per packet- 308 timed round trip, on ACKs that advance BBR.round_count. 310 BBR.next_round_delivered: packet.delivered value denoting the end of 311 a packet-timed round trip. 313 BBR.idle_restart: A boolean that is true if and only if a connection 314 is restarting after being idle. 316 2.8. Core Algorithm Design Parameters 318 BBRLossThresh: The maximum tolerated per-round-trip packet loss rate 319 when probing for bandwidth (the default is 2%). 321 BBRBeta: The default multiplicative decrease to make upon each round 322 trip during which the connection detects packet loss (the value is 323 0.7). 325 BBRHeadroom: The multiplicative factor to apply to BBR.inflight_hi 326 when attempting to leave free headroom in the path (e.g. free space 327 in the bottleneck buffer or free time slots in the bottleneck link) 328 that can be used by cross traffic (the value is 0.85). 330 BBRMinPipeCwnd: The minimal cwnd value BBR targets, to allow 331 pipelining with TCP endpoints that follow an "ACK every other packet" 332 delayed-ACK policy: 4 * SMSS. 334 2.9. Network Path Model Parameters 336 2.9.1. Data Rate Network Path Model Parameters 338 The data rate model parameters together estimate both the sending 339 rate required to reach the full bandwidth available to the flow 340 (BBR.max_bw), and the maximum pacing rate control parameter that is 341 consistent with the queue pressure objective (BBR.bw). 343 BBR.max_bw: The windowed maximum recent bandwidth sample - obtained 344 using the BBR delivery rate sampling algorithm [draft-cheng-iccrg- 345 delivery-rate-estimation] - measured during the current or previous 346 bandwidth probing cycle (or during Startup, if the flow is still in 347 that state). (Part of the long-term model.) 349 BBR.bw_hi: The long-term maximum sending bandwidth that the algorithm 350 estimates will produce acceptable queue pressure, based on signals in 351 the current or previous bandwidth probing cycle, as measured by loss. 352 (Part of the long-term model.) 354 BBR.bw_lo: The short-term maximum sending bandwidth that the 355 algorithm estimates is safe for matching the current network path 356 delivery rate, based on any loss signals in the current bandwidth 357 probing cycle. This is generally lower than max_bw or bw_hi (thus 358 the name). (Part of the short-term model.) 360 BBR.bw: The maximum sending bandwidth that the algorithm estimates is 361 appropriate for matching the current network path delivery rate, 362 given all available signals in the model, at any time scale. It is 363 the min() of max_bw, bw_hi, and bw_lo. 365 2.9.2. Data Volume Network Path Model Parameters 367 The data volume model parameters together estimate both the volume of 368 in-flight data required to reach the full bandwidth available to the 369 flow (BBR.max_inflight), and the maximum volume of data that is 370 consistent with the queue pressure objective (cwnd). 372 BBR.min_rtt: The windowed minimum round-trip time sample measured 373 over the last MinRTTFilterLen = 10 seconds. This attempts to 374 estimate the two-way propagation delay of the network path when all 375 connections sharing a bottleneck are using BBR, but also allows BBR 376 to estimate the value required for a bdp estimate that allows full 377 throughput if there are legacy loss-based Reno or CUBIC flows sharing 378 the bottleneck. 380 BBR.bdp: The estimate of the network path's BDP (Bandwidth-Delay 381 Product), computed as: BBR.bdp = BBR.bw * BBR.min_rtt. 383 BBR.extra_acked: A volume of data that is the estimate of the recent 384 degree of aggregation in the network path. 386 BBR.offload_budget: The estimate of the minimum volume of data 387 necessary to achieve full throughput when using sender (TSO/GSO) and 388 receiver (LRO, GRO) host offload mechanisms. 390 BBR.max_inflight: The estimate of the volume of in-flight data 391 required to fully utilize the bottleneck bandwidth available to the 392 flow, based on the BDP estimate (BBR.bdp), the aggregation estimate 393 (BBR.extra_acked), the offload budget (BBR.offload_budget), and 394 BBRMinPipeCwnd. 396 BBR.inflight_hi: Analogous to BBR.bw_hi, the long-term maximum volume 397 of in-flight data that the algorithm estimates will produce 398 acceptable queue pressure, based on signals in the current or 399 previous bandwidth probing cycle, as measured by loss. That is, if a 400 flow is probing for bandwidth, and observes that sending a particular 401 volume of in-flight data causes a loss rate higher than the loss rate 402 objective, it sets inflight_hi to that volume of data. (Part of the 403 long-term model.) 405 BBR.inflight_lo: Analogous to BBR.bw_lo, the short-term maximum 406 volume of in-flight data that the algorithm estimates is safe for 407 matching the current network path delivery process, based on any loss 408 signals in the current bandwidth probing cycle. This is generally 409 lower than max_inflight or inflight_hi (thus the name). (Part of the 410 short-term model.) 412 2.10. State for Responding to Congestion 414 BBR.bw_latest: a 1-round-trip max of delivered bandwidth 415 (rs.delivery_rate). 417 BBR.inflight_latest: a 1-round-trip max of delivered volume of data 418 (rs.delivered). 420 2.11. Estimating BBR.max_bw 422 BBR.MaxBwFilter: The filter for tracking the maximum recent 423 rs.delivery_rate sample, for estimating BBR.max_bw. 425 MaxBwFilterLen: The filter window length for BBR.MaxBwFilter = 2 426 (representing up to 2 ProbeBW cycles, the current cycle and the 427 previous full cycle). 429 BBR.cycle_count: The virtual time used by the BBR.max_bw filter 430 window. Note that BBR.cycle_count only needs to be tracked with a 431 single bit, since the BBR.MaxBwFilter only needs to track samples 432 from two time slots: the previous ProbeBW cycle and the current 433 ProbeBW cycle. 435 2.12. Estimating BBR.extra_acked 437 BBR.extra_acked_interval_start: the start of the time interval for 438 estimating the excess amount of data acknowledged due to aggregation 439 effects. 441 BBR.extra_acked_delivered: the volume of data marked as delivered 442 since BBR.extra_acked_interval_start. 444 BBR.ExtraACKedFilter: the max filter tracking the recent maximum 445 degree of aggregation in the path. 447 BBRExtraAckedFilterLen = The window length of the 448 BBR.ExtraACKedFilter max filter window: 10 (in units of packet-timed 449 round trips). 451 2.13. Startup Parameters and State 453 BBR.filled_pipe: A boolean that records whether BBR estimates that it 454 has ever fully utilized its available bandwidth ("filled the pipe"). 456 BBR.full_bw: A recent baseline BBR.max_bw to estimate if BBR has 457 "filled the pipe" in Startup. 459 BBR.full_bw_count: The number of non-app-limited round trips without 460 large increases in BBR.full_bw. 462 2.14. ProbeRTT and min_rtt Parameters and State 464 2.14.1. Parameters for Estimating BBR.min_rtt 466 BBR.min_rtt_stamp: The wall clock time at which the current 467 BBR.min_rtt sample was obtained. 469 MinRTTFilterLen: A constant specifying the length of the BBR.min_rtt 470 min filter window, MinRTTFilterLen is 10 secs. 472 2.14.2. Parameters for Scheduling ProbeRTT 474 BBRProbeRTTCwndGain = A constant specifying the gain value for 475 calculating the cwnd during ProbeRTT: 0.5 (meaning that ProbeRTT 476 attempts to reduce in-flight data to 50% of the estimated BDP). 478 ProbeRTTDuration: A constant specifying the minimum duration for 479 which ProbeRTT state holds inflight to BBRMinPipeCwnd or fewer 480 packets: 200 ms. 482 ProbeRTTInterval: A constant specifying the minimum time interval 483 between ProbeRTT states: 5 secs. 485 BBR.probe_rtt_min_delay: The minimum RTT sample recorded in the last 486 ProbeRTTInterval. 488 BBR.probe_rtt_min_stamp: The wall clock time at which the current 489 BBR.probe_rtt_min_delay sample was obtained. 491 BBR.probe_rtt_expired: A boolean recording whether the 492 BBR.probe_rtt_min_delay has expired and is due for a refresh with an 493 application idle period or a transition into ProbeRTT state. 495 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 496 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 497 document are to be interpreted as described in [RFC2119]. 499 3. Design Overview 501 3.1. High-Level Design Goals 503 The high-level goal of BBR is to achieve both: 505 1. The full throughput (or approximate fair share thereof) available 506 to a flow 508 * Achieved in a fast and scalable manner (using bandwidth in 509 O(log(BDP)) time). 511 * Achieved with average packet loss rates of up to 1%. 513 2. Low queue pressure (low queuing delay and low packet loss). 515 These goals are in tension: sending faster improves the odds of 516 achieving (1) but reduces the odds of achieving (2), while sending 517 slower improves the odds of achieving (2) but reduces the odds of 518 achieving (1). Thus the algorithm cannot maximize throughput or 519 minimize queue pressure independently, and must jointly optimize 520 both. 522 To try to achieve these goals, and seek an operating point with high 523 throughput and low delay [K79] [GK81], BBR aims to adapt its sending 524 process to match the network delivery process, in two dimensions: 526 1. data rate: the rate at which the flow sends data should ideally 527 match the rate at which the network delivers the flow's data (the 528 available bottleneck bandwidth) 530 2. data volume: the amount of unacknowledged data in flight in the 531 network should ideally match the bandwidth-delay product (BDP) of 532 the path 534 Both the control of the data rate (via the pacing rate) and data 535 volume (directly via the congestion window or cwnd; and indirectly 536 via the pacing rate) are important. A mismatch in either dimension 537 can cause the sender to fail to meet its high-level design goals: 539 1. volume mismatch: If a sender perfectly matches its sending rate 540 to the available bandwidth, but its volume of in-flight data 541 exceeds the BDP, then the sender can maintain a large standing 542 queue, increasing network latency and risking packet loss. 544 2. rate mismatch: If a sender's volume of in-flight data matches the 545 BDP perfectly but its sending rate exceeds the available 546 bottleneck bandwidth (e.g. the sender transmits a BDP of data in 547 an unpaced fashion, at the sender's link rate), then up to a full 548 BDP of data can burst into the bottleneck queue, causing high 549 delay and/or high loss. 551 3.2. Algorithm Overview 553 Based on the rationale above, BBR tries to spend most of its time 554 matching its sending process (data rate and data volume) to the 555 network path's delivery process. To do this, it explores the 556 2-dimensional control parameter space of (1) data rate ("bandwidth" 557 or "throughput") and (2) data volume ("in-flight data"), with a goal 558 of finding the maximum values of each control parameter that are 559 consistent with its objective for queue pressure. 561 Depending on what signals a given network path manifests at a given 562 time, the objective for queue pressure is measured in terms of the 563 most strict among: 565 * the amount of data that is estimated to be queued in the 566 bottleneck buffer (data_in_flight - estimated_BDP): the objective 567 is to maintain this amount at or below 1.5 * estimated_BDP 569 * the packet loss rate: the objective is a maximum per-round-trip 570 packet loss rate of BBRLossThresh=2% (and an average packet loss 571 rate considerably lower) 573 3.3. State Machine Overview 575 BBR varies its control parameters with a simple state machine that 576 aims for high throughput, low latency, and an approximately fair 577 sharing of bandwidth, while maintaining an up-to-date model of the 578 network path. 580 A BBR flow starts in the Startup state, and ramps up its sending rate 581 quickly, to rapidly estimate the maximum available bandwidth 582 (BBR.max_bw). When it estimates the bottleneck bandwidth has been 583 fully utilized, it enters the Drain state to drain the estimated 584 queue. In steady state a BBR flow mostly uses the ProbeBW states, to 585 periodically briefly send faster to probe for higher capacity and 586 then briefly send slower to try to drain any resulting queue. If 587 needed, it briefly enters the ProbeRTT state, to lower the sending 588 rate to probe for lower BBR.min_rtt samples. The detailed behavior 589 for each state is described below. 591 3.4. Network Path Model Overview 593 3.4.1. High-Level Design Goals for the Network Path Model 595 At a high level, the BBR model is trying to reflect two aspects of 596 the network path: 598 * Model what's required for achieving full throughput: Estimate the 599 minimum data rate and data volume required to fully utilize the 600 fair share of the bottleneck bandwidth available to the flow. 601 This must incorporate estimates of the maximum available bandwidth 602 (BBR.max_bw), the BDP of the path (BBR.bdp), and the requirements 603 of any offload features on the end hosts or mechanisms on the 604 network path that produce aggregation effects (summing up to 605 BBR.max_inflight). 607 * Model what's permitted for achieving low queue pressure: Estimate 608 the maximum data rate (BBR.bw) and data volume (cwnd) consistent 609 with the queue pressure objective, as measured by the estimated 610 degree of queuing and packet loss. 612 Note that those two aspects are in tension: the highest throughput is 613 available to the flow when it sends as fast as possible and occupies 614 as many bottleneck buffer slots as possible; the lowest que pressure 615 is achieved by the flow when it sends as slow as possible and 616 occupies as few bottleneck buffer slots as possible. To resolve the 617 tension, the algorithm aims to achieve the maximum throughput 618 achievable while still meeting the queue pressure objective. 620 3.4.2. Time Scales for the Network Model 622 At a high level, the BBR model is trying to reflect the properties of 623 the network path on two different time scales: 625 3.4.2.1. Long-term model 627 One goal is for BBR to maintain high average utilization of the fair 628 share of the available bandwidth, over long time intervals. This 629 requires estimates of the path's data rate and volume capacities that 630 are robust over long time intervals. This means being robust to 631 congestion signals that may be noisy or may reflect short-term 632 congestion that has already abated by the time an ACK arrives. This 633 also means providing a robust history of the best recently-achievable 634 performance on the path so that the flow can quickly and robustly aim 635 to re-probe that level of performance whenever it decides to probe 636 the capacity of the path. 638 3.4.2.2. Short-term model 640 A second goal of BBR is to react to every congestion signal, 641 including loss, as if it may indicate a persistent/long-term increase 642 in congestion and/or decrease in the bandwidth available to the flow, 643 because that may indeed be the case. 645 3.4.2.3. Time Scale Strategy 647 BBR sequentially alternates between spending most of its time using 648 short-term models to conservatively respect all congestion signals in 649 case they represent persistent congestion, but periodically using its 650 long-term model to robustly probe the limits of the available path 651 capacity in case the congestion has abated and more capacity is 652 available. 654 3.5. Control Parameter Overview 656 BBR uses its model to control the connection's sending behavior. 657 Rather than using a single control parameter, like the cwnd parameter 658 that limits the volume of in-flight data in the Reno and CUBIC 659 congestion control algorithms, BBR uses three distinct control 660 parameters: 662 1. pacing rate: the maximum rate at which BBR sends data. 664 2. send quantum: the maximum size of any aggregate that the 665 transport sender implementation may need to transmit as a unit to 666 amortize per-packet transmission overheads. 668 3. cwnd: the maximum volume of data BBR allows in-flight in the 669 network at any time. 671 3.6. Environment and Usage 673 BBR is a congestion control algorithm that is agnostic to transport- 674 layer and link-layer technologies, requires only sender-side changes, 675 and does not require changes in the network. Open source 676 implementations of BBR are available for the TCP [RFC793] and QUIC 677 [RFC9000] transport protocols, and these implementations have been 678 used in production for a large volume of Internet traffic. An open 679 source implementation of BBR is also available for DCCP [RFC4340] 680 [draft-romo-iccrg-ccid5]. 682 4. Detailed Algorithm 684 4.1. State Machine 686 BBR implements a state machine that uses the network path model to 687 guide its decisions, and the control parameters to enact its 688 decisions. 690 4.1.1. State Transition Diagram 692 The following state transition diagram summarizes the flow of control 693 and the relationship between the different states: 695 | 696 V 697 +---> Startup ------------+ 698 | | | 699 | V | 700 | Drain --------------+ 701 | | | 702 | V | 703 +---> ProbeBW_DOWN -------+ 704 | ^ | | 705 | | V | 706 | | ProbeBW_CRUISE ------+ 707 | | | | 708 | | V | 709 | | ProbeBW_REFILL -----+ 710 | | | | 711 | | V | 712 | | ProbeBW_UP ---------+ 713 | | | | 714 | +------+ | 715 | | 716 +---- ProbeRTT <-----------+ 718 4.1.2. State Machine Operation Overview 720 When starting up, BBR probes to try to quickly build a model of the 721 network path; to adapt to later changes to the path or its traffic, 722 BBR must continue to probe to update its model. If the available 723 bottleneck bandwidth increases, BBR must send faster to discover 724 this. Likewise, if the round-trip propagation delay changes, this 725 changes the BDP, and thus BBR must send slower to get inflight below 726 the new BDP in order to measure the new BBR.min_rtt. Thus, BBR's 727 state machine runs periodic, sequential experiments, sending faster 728 to check for BBR.bw increases or sending slower to yield bandwidth, 729 drain the queue, and check for BBR.min_rtt decreases. The frequency, 730 magnitude, duration, and structure of these experiments differ 731 depending on what's already known (startup or steady-state) and 732 application sending behavior (intermittent or continuous). 734 This state machine has several goals: 736 * Achieve high throughput by efficiently utilizing available 737 bandwidth. 739 * Achieve low latency and packet loss rates by keeping queues 740 bounded and small. 742 * Share bandwidth with other flows in an approximately fair manner. 744 * Feed samples to the model estimators to refresh and update the 745 model. 747 4.1.3. State Machine Tactics 749 In the BBR framework, at any given time the sender can choose one of 750 the following tactics: 752 * Acceleration: Send faster then the network is delivering data: to 753 probe the maximum bandwidth available to the flow 755 * Cruising: Send at the same rate the network is delivering data: 756 try to match the sending rate to the flow's current available 757 bandwidth, to try to achieve high utilization of the available 758 bandwidth without increasing queue pressure 760 * Deceleration: Send slower than the network is delivering data: to 761 reduce the amount of data in flight, with a number of overlapping 762 motivations: 764 - Reducing queuing delay: to reduce queuing delay, to reduce 765 latency for request/response cross-traffic (e.g. RPC, web 766 traffic). 768 - Reducing packet loss: to reduce packet loss, to reduce tail 769 latency for request/response cross-traffic (e.g. RPC, web 770 traffic) and improve coexistence with Reno/CUBIC. 772 - Probing BBR.min_rtt: to probe the path's BBR.min_rtt 774 - Bandwidth convergence: to aid bandwidth fairness convergence, 775 by leaving unused capacity in the bottleneck link or bottleneck 776 buffer, to allow other flows that may have lower sending rates 777 to discover and utilize the unused capacity 779 - Burst tolerance: to allow bursty arrivals of cross-traffic 780 (e.g. short web or RPC requests) to be able to share the 781 bottleneck link without causing excessive queuing delay or 782 packet loss 784 Throughout the lifetime of a BBR flow, it sequentially cycles through 785 all three tactics, to measure the network path and try to optimize 786 its operating point. 788 BBR's state machine uses two control mechanisms. Primarily, it uses 789 the pacing_gain (see the "Pacing Rate" section), which controls how 790 fast packets are sent relative to BBR.bw. A pacing_gain > 1 791 decreases inter-packet time and increases inflight. A pacing_gain < 792 1 has the opposite effect, increasing inter-packet time and while 793 aiming to decrease inflight. Second, if the state machine needs to 794 quickly reduce inflight to a particular absolute value, it uses the 795 cwnd. 797 4.2. Algorithm Organization 799 The BBR algorithm is an event-driven algorithm that executes steps 800 upon the following events: connection initialization, upon each ACK, 801 upon the transmission of each quantum, and upon loss detection 802 events. All of the sub-steps invoked referenced below are described 803 below. 805 4.2.1. Initialization 807 Upon transport connection initialization, BBR executes its 808 initialization steps: 810 BBROnInit(): 811 init_windowed_max_filter(filter=BBR.MaxBwFilter, value=0, time=0) 812 BBR.min_rtt = SRTT ? SRTT : Inf 813 BBR.min_rtt_stamp = Now() 814 BBR.probe_rtt_done_stamp = 0 815 BBR.probe_rtt_round_done = false 816 BBR.prior_cwnd = 0 817 BBR.idle_restart = false 818 BBR.extra_acked_interval_start = Now() 819 BBR.extra_acked_delivered = 0 820 BBRResetCongestionSignals() 821 BBRResetLowerBounds() 822 BBRInitRoundCounting() 823 BBRInitFullPipe() 824 BBRInitPacingRate() 825 BBREnterStartup() 827 4.2.2. Per-Transmit Steps 829 When transmitting, BBR merely needs to check for the case where the 830 flow is restarting from idle: 832 BBROnTransmit(): 833 BBRHandleRestartFromIdle() 835 4.2.3. Per-ACK Steps 837 On every ACK, the BBR algorithm executes the following 838 BBRUpdateOnACK() steps in order to update its network path model, 839 update its state machine, and adjust its control parameters to adapt 840 to the updated model: 842 BBRUpdateOnACK(): 843 BBRUpdateModelAndState() 844 BBRUpdateControlParameters() 846 BBRUpdateModelAndState(): 847 BBRUpdateLatestDeliverySignals() 848 BBRUpdateCongestionSignals() 849 BBRUpdateACKAggregation() 850 BBRCheckStartupDone() 851 BBRCheckDrain() 852 BBRUpdateProbeBWCyclePhase() 853 BBRUpdateMinRTT() 854 BBRCheckProbeRTT() 855 BBRAdvanceLatestDeliverySignals() 856 BBRBoundBWForModel() 858 BBRUpdateControlParameters(): 859 BBRSetPacingRate() 860 BBRSetSendQuantum() 861 BBRSetCwnd() 863 4.2.4. Per-Loss Steps 865 On every packet loss event, where some sequence range "packet" is 866 marked lost, the BBR algorithm executes the following 867 BBRUpdateOnLoss() steps in order to update its network path model 869 BBRUpdateOnLoss(packet): 870 BBRHandleLostPacket(packet) 872 4.3. State Machine Operation 874 4.3.1. Startup 875 4.3.1.1. Startup Dynamics 877 When a BBR flow starts up, it performs its first (and most rapid) 878 sequential probe/drain process in the Startup and Drain states. 879 Network link bandwidths currently span a range of at least 11 orders 880 of magnitude, from a few bps to 200 Gbps. To quickly learn 881 BBR.max_bw, given this huge range to explore, BBR's Startup state 882 does an exponential search of the rate space, doubling the sending 883 rate each round. This finds BBR.max_bw in O(log_2(BDP)) round trips. 885 To achieve this rapid probing in the smoothest possible fashion, in 886 Startup BBR uses the minimum gain values that will allow the sending 887 rate to double each round: in Startup BBR sets BBR.pacing_gain to 888 BBRStartupPacingGain (2.77) [BBRStartupPacingGain] and BBR.cwnd_gain 889 to BBRStartupCwndGain (2). 891 When initializing a connection, or upon any later entry into Startup 892 mode, BBR executes the following BBREnterStartup() steps: 894 BBREnterStartup(): 895 BBR.state = Startup 896 BBR.pacing_gain = BBRStartupPacingGain 897 BBR.cwnd_gain = BBRStartupCwndGain 899 As BBR grows its sending rate rapidly, it obtains higher delivery 900 rate samples, BBR.max_bw increases, and the pacing rate and cwnd both 901 adapt by smoothly growing in proportion. Once the pipe is full, a 902 queue typically forms, but the cwnd_gain bounds any queue to 903 (cwnd_gain - 1) * estimated_BDP, which is approximately (2.77 - 1) * 904 estimated_BDP = 1.77 * estimated_BDP. The immediately following 905 Drain state is designed to quickly drain that queue. 907 During Startup, BBR estimates whether the pipe is full using two 908 estimators. The first looks for a plateau in the BBR.max_bw 909 estimate. The second looks for packet loss. The following 910 subsections discuss these estimators. 912 BBRCheckStartupDone(): 913 BBRCheckStartupFullBandwidth() 914 BBRCheckStartupHighLoss() 915 if (BBR.state == Startup and BBR.filled_pipe) 916 BBREnterDrain() 918 4.3.1.2. Exiting Startup Based on Bandwidth Plateau 920 During Startup, BBR estimates whether the pipe is full by looking for 921 a plateau in the BBR.max_bw estimate. The output of this "full pipe" 922 estimator is tracked in BBR.filled_pipe, a boolean that records 923 whether BBR estimates that it has ever fully utilized its available 924 bandwidth ("filled the pipe"). If BBR notices that there are several 925 (three) rounds where attempts to double the delivery rate actually 926 result in little increase (less than 25 percent), then it estimates 927 that it has reached BBR.max_bw, sets BBR.filled_pipe to true, exits 928 Startup and enters Drain. 930 Upon connection initialization the full pipe estimator runs: 932 BBRInitFullPipe(): 933 BBR.filled_pipe = false 934 BBR.full_bw = 0 935 BBR.full_bw_count = 0 937 Once per round trip, upon an ACK that acknowledges new data, and when 938 the delivery rate sample is not application-limited (see [draft- 939 cheng-iccrg-delivery-rate-estimation]), BBR runs the "full pipe" 940 estimator, if needed: 942 BBRCheckStartupFullBandwidth(): 943 if BBR.filled_pipe or 944 !BBR.round_start or rs.is_app_limited 945 return /* no need to check for a full pipe now */ 946 if (BBR.max_bw >= BBR.full_bw * 1.25) /* still growing? */ 947 BBR.full_bw = BBR.max_bw /* record new baseline level */ 948 BBR.full_bw_count = 0 949 return 950 BBR.full_bw_count++ /* another round w/o much growth */ 951 if (BBR.full_bw_count >= 3) 952 BBR.filled_pipe = true 954 BBR waits three rounds to have solid evidence that the sender is not 955 detecting a delivery-rate plateau that was temporarily imposed by the 956 receive window. Allowing three rounds provides time for the 957 receiver's receive-window auto-tuning to open up the receive window 958 and for the BBR sender to realize that BBR.max_bw should be higher: 959 in the first round the receive-window auto-tuning algorithm grows the 960 receive window; in the second round the sender fills the higher 961 receive window; in the third round the sender gets higher delivery- 962 rate samples. This three-round threshold was validated by YouTube 963 experimental data. 965 4.3.1.3. Exiting Startup Based on Packet Loss 967 A second method BBR uses for estimating the bottleneck is full is by 968 looking at sustained packet losses Specifically for a case where the 969 following criteria are all met: 971 * The connection has been in fast recovery for at least one full 972 round trip. 974 * The loss rate over the time scale of a single full round trip 975 exceeds BBRLossThresh (2%). 977 * There are at least BBRStartupFullLossCnt=3 discontiguous sequence 978 ranges lost in that round trip. 980 If these criteria are all met, then BBRCheckStartupHighLoss() sets 981 BBR.filled_pipe = true and exits Startup and enters Drain. 983 The algorithm waits until all three criteria are met to filter out 984 noise from burst losses, and to try to ensure the bottleneck is fully 985 utilized on a sustained basis, and the full bottleneck bandwidth has 986 been measured, before attempting to drain the level of in-flight data 987 to the estimated BDP. 989 4.3.2. Drain 991 Upon exiting Startup, BBR enters its Drain state. In Drain, BBR aims 992 to quickly drain any queue created in Startup by switching to a 993 pacing_gain well below 1.0, until any estimated queue has been 994 drained. It uses a pacing_gain that is the inverse of the value used 995 during Startup, chosen to try to drain the queue in one round 996 [BBRDrainPacingGain]: 998 BBREnterDrain(): 999 BBR.state = Drain 1000 BBR.pacing_gain = 1/BBRStartupCwndGain /* pace slowly */ 1001 BBR.cwnd_gain = BBRStartupCwndGain /* maintain cwnd */ 1003 In Drain, when the amount of data in flight is less than or equal to 1004 the estimated BDP, meaning BBR estimates that the queue has been 1005 fully drained, then BBR exits Drain and enters ProbeBW. To implement 1006 this, upon every ACK BBR executes: 1008 BBRCheckDrain(): 1009 if (BBR.state == Drain and packets_in_flight <= BBRInflight(1.0)) 1010 BBREnterProbeBW() /* BBR estimates the queue was drained */ 1012 4.3.3. ProbeBW 1014 Long-lived BBR flows tend to spend the vast majority of their time in 1015 the ProbeBW states. In the ProbeBW states, a BBR flow sequentially 1016 accelerates, decelerates, and cruises, to measure the network path, 1017 improve its operating point (increase throughput and reduce queue 1018 pressure), and converge toward a more fair allocation of bottleneck 1019 bandwidth. To do this, the flow sequentially cycles through all 1020 three tactics: trying to send faster than, slower than, and at the 1021 same rate as the network delivery process. To achieve this, a BBR 1022 flow in ProbeBW mode cycles through the four Probe bw states - DOWN, 1023 CRUISE, REFILL, and UP - described below in turn. 1025 4.3.3.1. ProbeBW_DOWN 1027 In the ProbeBW_DOWN phase of the cycle, a BBR flow pursues the 1028 deceleration tactic, to try to send slower than the network is 1029 delivering data, to reduce the amount of data in flight, with all of 1030 the standard motivations for the deceleration tactic (discussed in 1031 "State Machine Tactics", above). It does this by switching to a 1032 BBR.pacing_gain of 0.9, sending at 90% of BBR.bw. The pacing_gain 1033 value of 0.9 is derived based on the ProbeBW_UP pacing gain of 1.25, 1034 as the minimum pacing_gain value that allows bandwidth-based 1035 convergence to approximate fairness. 1037 Exit conditions: The flow exits this phase and enters CRUISE when the 1038 flow estimates that both of the following conditions have been met: 1040 * There is free headroom: If inflight_hi is set, then BBR remains in 1041 DOWN at least until the volume of in-flight data is less than or 1042 equal to BBRHeadroom*BBR.inflight_hi. The goal of this constraint 1043 is to ensure that in cases where loss signals suggest an upper 1044 limit on the volume of in-flight data, then the flow attempts to 1045 leave some free headroom in the path (e.g. free space in the 1046 bottleneck buffer or free time slots in the bottleneck link) that 1047 can be used by cross traffic (both for volume-based convergence of 1048 bandwidth shares and for burst tolerance). 1050 * The volume of in-flight data is less than or equal to BBR.BDP, 1051 i.e. the flow estimates that it has drained any queue at the 1052 bottleneck. 1054 4.3.3.2. ProbeBW_CRUISE 1056 In the ProbeBW_CRUISE phase of the cycle, a BBR flow pursues the 1057 "cruising" tactic (discussed in "State Machine Tactics", above), 1058 attempting to send at the same rate the network is delivering data. 1059 It tries to match the sending rate to the flow's current available 1060 bandwidth, to try to achieve high utilization of the available 1061 bandwidth without increasing queue pressure. It does this by 1062 switching to a pacing_gain of 1.0, sending at 100% of BBR.bw. 1063 Notably, while in this state it responds to concrete congestion 1064 signals (loss) by reducing BBR.bw_lo and BBR.inflight_lo, because 1065 these signals suggest that the available bandwidth and deliverable 1066 volume of in-flight data have likely reduced, and the flow needs to 1067 change to adapt, slowing down to match the latest delivery process. 1069 Exit conditions: The connection adaptively holds this state until it 1070 decides that it is time to probe for bandwidth, at which time it 1071 enters ProbeBW_REFILL (see "Time Scale for Bandwidth Probing", 1072 below). 1074 4.3.3.3. ProbeBW_REFILL 1076 The goal of the ProbeBW_REFILL state is to "refill the pipe", to try 1077 to fully utilize the network bottleneck without creating any 1078 significant queue pressure. 1080 To do this, BBR first resets the short-term model parameters bw_lo 1081 and inflight_lo, setting both to "Infinity". This is the key moment 1082 in the BBR time scale strategy (see "Time Scale Strategy", above) 1083 where the flow pivots, discarding its conservative short-term bw_lo 1084 and inflight_lo parameters and beginning to robustly probe the 1085 bottleneck's long-term available bandwidth. During this time bw_hi 1086 and inflight_hi, if set, constrain the connection. 1088 During ProbeBW_REFILL BBR uses a BBR.pacing_gain of 1.0, to send at a 1089 rate that matches the current estimated available bandwidth, for one 1090 packet-timed round trip. The goal is to fully utilize the bottleneck 1091 link before transitioning into ProbeBW_UP and significantly 1092 increasing the chances of causing a loss signal. The motivating 1093 insight is that, as soon as a flow starts acceleration, sending 1094 faster than the available bandwidth, it will start building a queue 1095 at the bottleneck. And if the buffer is shallow enough, then the 1096 flow can cause loss signals very shortly after the first accelerating 1097 packets arrive at the bottleneck. If the flow were to neglect to 1098 fill the pipe before it causes this loss signal, then these very 1099 quick signals of excess queue could cause the flow's estimate of the 1100 path's capacity (i.e. inflight_hi) to significantly underestimate. 1101 In particular, if the flow were to transition directly from 1102 ProbeBW_CRUISE to ProbeBW_UP, the volume of in-flight data (at the 1103 time the first accelerating packets were sent) may often be still 1104 very close to the volume of in-flight data maintained in CRUISE, 1105 which may be only BBRHeadroom*inflight_hi. 1107 Exit conditions: The flow exits ProbeBW_REFILL after one packet-timed 1108 round trip, and enters UP. This is because after one full round trip 1109 of sending in ProbeBW_REFILL the flow (if not application-limited) 1110 has had an opportunity to place as many packets in flight as its 1111 BBR.bw estimate permits. And correspondingly, at this point the flow 1112 starts to see bandwidth samples reflecting its ProbeBW_REFILL 1113 behavior, which may be putting too much data in flight. 1115 4.3.3.4. ProbeBW_UP 1117 After ProbeBW_REFILL refills the pipe, ProbeBW_UP probes for possible 1118 increases in available bandwidth by using a BBR.pacing_gain of 1.25, 1119 sending faster than the current estimated available bandwidth. 1121 If the flow has not set BBR.inflight_hi or BBR.bw_hi, it tries to 1122 raise the volume of in-flight data to at least BBR.pacing_gain * 1123 BBR.bdp = 1.25 * BBR.bdp; note that this may take more than 1124 BBR.min_rtt if BBR.min_rtt is small (e.g. on a LAN). 1126 If the flow has set BBR.inflight_hi or BBR.bw_hi, it moves to an 1127 operating point based on those limits and then gradually increases 1128 the upper volume bound (BBR.inflight_hi) and rate bound (BBR.bw_hi) 1129 using the following approach: 1131 * bw_hi: The flow raises bw_hi to the latest measured bandwidth 1132 sample if the latest measured bandwidth sample is above bw_hi and 1133 the loss rate for the sample is not above the BBRLossThresh. 1135 * inflight_hi: The flow raises inflight_hi in ProbeBW_UP in a manner 1136 that is slow and cautious at first, but increasingly rapid and 1137 bold over time. The initial caution is motivated by the fact that 1138 a given BBR flow may be sharing a shallow buffer with thousands of 1139 other flows, so that the buffer space available to the flow may be 1140 quite tight - even just a single packet. The increasingly rapid 1141 growth over time is motivated by the fact that in a high-speed WAN 1142 the increase in available bandwidth (and thus the estimated BDP) 1143 may require the flow to grow the volume of its inflight data by up 1144 to O(1,000,000); even a quite typical BDP like 10Gbps * 100ms is 1145 82,563 packets. BBR takes an approach where the additive increase 1146 to BBR.inflight_hi exponentially doubles each round trip; in each 1147 successive round trip, inflight_hi grows by 1, 2, 4, 8, 16, etc, 1148 with the increases spread uniformly across the entire round trip. 1149 This helps allow BBR to utilize a larger BDP in O(log(BDP)) round 1150 trips, meeting the design goal for scalable utilization of newly- 1151 available bandwidth. 1153 Exit conditions: The BBR flow ends ProbeBW_UP bandwidth probing and 1154 transitions to ProbeBW_DOWN to try to drain the bottleneck queue when 1155 any of the following conditions are met: 1157 1. Estimated queue: The flow has been in ProbeBW_UP for at least 1158 1*min_rtt, and the estimated queue is high enough that the flow 1159 judges it has robustly probed for available bandwidth 1160 (packets_in_flight > 1.25 * BBR.bdp). 1162 2. Loss: The current loss rate exceeds BBRLossThresh (2%). 1164 4.3.3.5. Time Scale for Bandwidth Probing 1166 Choosing the time scale for probing bandwidth is tied to the question 1167 of how to coexist with legacy Reno/CUBIC flows, since probing for 1168 bandwidth runs a significant risk of causing packet loss, and causing 1169 packet loss can significantly limit the throughput of such legacy 1170 Reno/CUBIC flows. 1172 4.3.3.5.1. Bandwidth Probing and Coexistence with Reno/CUBIC 1174 BBR has an explicit strategy for coexistence with Reno/CUBIC: to try 1175 to behave in a manner so that Reno/CUBIC flows coexisting with BBR 1176 can continue to work well in the primary contexts where they do 1177 today: 1179 * Intra-datacenter/LAN traffic: we want Reno/CUBIC to be able to 1180 perform well in 100M through 40G enterprise and datacenter 1181 Ethernet 1182 - BDP = 40 Gbps * 20 us / (1514 bytes) ~= 66 packets 1184 * Public Internet last mile traffic: we want Reno/CUBIC to be able 1185 to support up to 25Mbps (for 4K Video) at an RTT of 30ms, typical 1186 parameters for common CDNs for large video services: 1188 - BDP = 25Mbps * 30 ms / (1514 bytes) ~= 62 packets 1190 The challenge in meeting these goals is that Reno/CUBIC need long 1191 periods of no loss to utilize large BDPs. The good news is that in 1192 the environments where Reno/CUBIC work well today (mentioned above), 1193 the BDPs are small, roughly ~100 packets or less. 1195 4.3.3.5.2. A Dual-Time-Scale Approach for Coexistence 1197 The BBR strategy has several aspects: 1199 1. The highest priority is to estimate the bandwidth available to 1200 the BBR flow in question. 1202 2. Secondarily, a given BBR flow adapts (within bounds) the 1203 frequency at which it probes bandwidth and knowingly risks packet 1204 loss, to allow Reno/CUBIC to reach a bandwidth at least as high 1205 as that given BBR flow. 1207 To adapt the frequency of bandwidth probing, BBR considers two time 1208 scales: a BBR-native time scale, and a bounded Reno-conscious time 1209 scale: 1211 * T_bbr: BBR-native time-scale 1213 - T_bbr = uniformly randomly distributed between 2 and 3 secs 1215 * T_reno: Reno-coexistence time scale 1217 - T_reno_bound = pick_randomly_either({62, 63}) 1219 - reno_bdp = min(BBR.bdp, cwnd) 1221 - T_reno = min(reno_bdp, T_reno_bound) round trips 1223 * T_probe: The time between bandwidth probe UP phases: 1225 - T_probe = min(T_bbr, T_reno) 1227 This dual-time-scale approach is similar to that used by CUBIC, which 1228 has a CUBIC-native time scale given by a cubic curve, and a "Reno 1229 emulation" module that estimates what cwnd would give the flow Reno- 1230 equivalent throughput. At any given moment, CUBIC choose the cwnd 1231 implied by the more aggressive strategy. 1233 We randomize both the T_bbr and T_reno parameters, for better mixing 1234 and fairness convergence. 1236 4.3.3.5.3. Design Considerations for Choosing Constant Parameters 1238 We design the maximum wall-clock bounds of BBR-native inter- 1239 bandwidth-probe wall clock time, T_bbr, to be: 1241 * Higher than 2 sec to try to avoid causing loss for a long enough 1242 time to allow Reno flow with RTT=30ms to get 25Mbps (4K video) 1243 throughput. For this workload, given the Reno sawtooth that 1244 raises cwnd from roughly BDP to 2*BDP, one MSS per round trip, the 1245 inter-bandwidth-probe time must be at least: BDP * RTT = 25Mbps * 1246 .030 sec / (1514 bytes) * 0.030 sec = 1.9secs 1248 * Lower than 3 sec to ensure flows can start probing in a reasonable 1249 amount of time to discover unutilized bw on human-scale 1250 interactive time-scales (e.g. perhaps traffic from a competing web 1251 page download is now complete). 1253 The maximum round-trip bounds of the Reno-coexistence time scale, 1254 T_reno, are chosen to be 62-63 with the following considerations in 1255 mind: 1257 * Choosing a value smaller than roughly 60 would imply that when BBR 1258 flows coexisted with Reno/CUBIC flows (e.g. Netflix Reno flows) 1259 on public Internet broadband links, the Reno/CUBIC flows would not 1260 be able to achieve enough bandwidth to show 4K video. 1262 * Choosing a value larger than roughly 65 would prevent BBR from 1263 reaching its goal of tolerating 1% loss per round trip. Given 1264 that the steady-state (non-bandwidth-probing) BBR response to a 1265 round trip with X% packet loss is to reduce the sending rate by X% 1266 (see the "Updating the Model Upon Packet Loss" section), this 1267 means that the BBR sending rate after N rounds of packet loss at a 1268 rate loss_rate is (1 - loss_rate)^N. This means that for a flow 1269 that encounters 1% loss in 65 round trips of ProbeBW_CRUISE, and 1270 then doubles its cwnd (back to BBR.inflight_hi) in ProbeBW_REFILL 1271 and ProbeBW_UP, it will be able to restore and reprobe its 1272 original sending rate, since: BBW.max_bw * (1 - loss_rate)^N * 2 = 1273 BBR.max_bw * (1 - .01)^65 ~= 1.04 * BBR.max_bw. That is, the flow 1274 will be able to fully respond to packet loss signals in 1275 ProbeBW_CRUISE while also fully re-measuring its maximum 1276 achievable throughput in ProbeBW_UP. 1278 The resulting behavior is that for BBR flows with small BDPs, the 1279 bandwidth probing will be on roughly the same time scale as Reno/ 1280 CUBIC; flows with large BDPs will intentionally probe more rapidly/ 1281 frequently than Reno/CUBIC would (roughly every 62 round trips for 1282 low-RTT flows, or 2-3 secs for high-RTT flows). 1284 The considerations above for timing bandwidth probing can be 1285 implemented as follows: 1287 /* Is it time to transition from DOWN or CRUISE to REFILL? */ 1288 BBRCheckTimeToProbeBW(): 1289 if (BBRHasElapsedInPhase(BBR.bw_probe_wait) || 1290 BBRIsRenoCoexistenceProbeTime()) 1291 BBRStartProbeBW_REFILL() 1292 return true 1293 return false 1295 /* Randomized decision about how long to wait until 1296 * probing for bandwidth, using round count and wall clock. 1297 */ 1298 BBRPickProbeWait(): 1299 /* Decide random round-trip bound for wait: */ 1300 BBR.rounds_since_bw_probe = 1301 random_int_between(0, 1); /* 0 or 1 */ 1302 /* Decide the random wall clock bound for wait: */ 1303 BBR.bw_probe_wait = 1304 2sec + random_float_between(0.0, 1.0) /* 0..1 sec */ 1306 BBRIsRenoCoexistenceProbeTime(): 1307 reno_rounds = BBRTargetInflight() 1308 rounds = min(reno_rounds, 63) 1309 return BBR.rounds_since_bw_probe >= rounds 1311 /* How much data do we want in flight? 1312 * Our estimated BDP, unless congestion cut cwnd. */ 1313 BBRTargetInflight() 1314 return min(BBR.bdp, cwnd) 1316 4.3.3.6. ProbeBW Algorithm Details 1318 BBR's ProbeBW algorithm operates as follows. 1320 Upon entering ProbeBW, BBR executes: 1322 BBREnterProbeBW(): 1323 BBRStartProbeBW_DOWN() 1325 The core logic for entering each state: 1327 BBRStartProbeBW_DOWN(): 1328 BBRResetCongestionSignals() 1329 BBR.probe_up_cnt = Infinity /* not growing inflight_hi */ 1330 BBRPickProbeWait() 1331 BBR.cycle_stamp = Now() /* start wall clock */ 1332 BBR.ack_phase = ACKS_PROBE_STOPPING 1333 BBRStartRound() 1334 BBR.state = ProbeBW_DOWN 1336 BBRStartProbeBW_CRUISE(): 1337 BBR.state = ProbeBW_CRUISE 1339 BBRStartProbeBW_REFILL(): 1340 BBRResetLowerBounds() 1341 BBR.bw_probe_up_rounds = 0 1342 BBR.bw_probe_up_acks = 0 1343 BBR.ack_phase = ACKS_REFILLING 1344 BBRStartRound() 1345 BBR.state = ProbeBW_REFILL 1347 BBRStartProbeBW_UP(): 1348 BBR.ack_phase = ACKS_PROBE_STARTING 1349 BBRStartRound() 1350 BBR.cycle_stamp = Now() /* start wall clock */ 1351 BBR.state = ProbeBW_UP 1352 BBRRaiseInflightHiSlope() 1354 BBR executes the following BBRUpdateProbeBWCyclePhase() logic on each 1355 ACK that ACKs or SACKs new data, to advance the ProbeBW state 1356 machine: 1358 /* The core state machine logic for ProbeBW: */ 1359 BBRUpdateProbeBWCyclePhase(): 1360 if (!BBR.filled_pipe) 1361 return /* only handling steady-state behavior here */ 1362 BBRAdaptUpperBounds() 1363 if (!IsInAProbeBWState()) 1364 return /* only handling ProbeBW states here: */ 1366 switch (state) 1368 ProbeBW_DOWN: 1369 if (BBRCheckTimeToProbeBW()) 1370 return /* already decided state transition */ 1371 if (BBRCheckTimeToCruise()) 1372 BBRStartProbeBW_CRUISE() 1374 ProbeBW_CRUISE: 1375 if (BBRCheckTimeToProbeBW()) 1376 return /* already decided state transition */ 1378 ProbeBW_REFILL: 1379 /* After one round of REFILL, start UP */ 1380 if (BBR.round_start) 1381 BBR.bw_probe_samples = 1 1382 BBRStartProbeBW_UP() 1384 ProbeBW_UP: 1385 if (BBRHasElapsedInPhase(BBR.min_rtt) and 1386 inflight > BBRInflight(BBR.max_bw, 1.25)) 1387 BBRStartProbeBW_DOWN() 1389 The ancillary logic to implement the ProbeBW state machine: 1391 IsInAProbeBWState() 1392 state = BBR.state 1393 return (state == ProbeBW_DOWN or 1394 state == ProbeBW_CRUISE or 1395 state == ProbeBW_REFILL or 1396 state == ProbeBW_UP) 1398 /* Time to transition from DOWN to CRUISE? */ 1399 BBRCheckTimeToCruise(): 1400 if (inflight > BBRInflightWithHeadroom()) 1401 return false /* not enough headroom */ 1402 if (inflight <= BBRInflight(BBR.max_bw, 1.0)) 1403 return true /* inflight <= estimated BDP */ 1405 BBRHasElapsedInPhase(interval): 1407 return Now() > BBR.cycle_stamp + interval 1409 /* Return a volume of data that tries to leave free 1410 * headroom in the bottleneck buffer or link for 1411 * other flows, for fairness convergence and lower 1412 * RTTs and loss */ 1413 BBRInflightWithHeadroom(): 1414 if (BBR.inflight_hi == Infinity) 1415 return Infinity 1416 headroom = max(1, BBRHeadroom * BBR.inflight_hi) 1417 return max(BBR.inflight_hi - headroom, 1418 BBRMinPipeCwnd) 1420 /* Raise inflight_hi slope if appropriate. */ 1421 BBRRaiseInflightHiSlope(): 1422 growth_this_round = 1MSS << BBR.bw_probe_up_rounds 1423 BBR.bw_probe_up_rounds = min(BBR.bw_probe_up_rounds + 1, 30) 1424 BBR.probe_up_cnt = max(cwnd / growth_this_round, 1) 1426 /* Increase inflight_hi if appropriate. */ 1427 BBRProbeInflightHiUpward(): 1428 if (!is_cwnd_limited or cwnd < BBR.inflight_hi) 1429 return /* not fully using inflight_hi, so don't grow it */ 1430 BBR.bw_probe_up_acks += rs.newly_acked 1431 if (BBR.bw_probe_up_acks >= BBR.probe_up_cnt) 1432 delta = BBR.bw_probe_up_acks / BBR.probe_up_cnt 1433 BBR.bw_probe_up_acks -= delta * BBR.bw_probe_up_cnt 1434 BBR.inflight_hi += delta 1435 if (BBR.round_start) 1436 BBRRaiseInflightHiSlope() 1438 /* Track ACK state and update BBR.max_bw window and 1439 * BBR.inflight_hi and BBR.bw_hi. */ 1440 BBRAdaptUpperBounds(): 1441 if (BBR.ack_phase == ACKS_PROBE_STARTING and BBR.round_start) 1442 /* starting to get bw probing samples */ 1443 BBR.ack_phase = ACKS_PROBE_FEEDBACK 1444 if (BBR.ack_phase == ACKS_PROBE_STOPPING and BBR.round_start) 1445 /* end of samples from bw probing phase */ 1446 if (IsInAProbeBWState() and !rs.is_app_limited) 1447 BBRAdvanceMaxBwFilter() 1449 if (!CheckInflightTooHigh()) 1450 /* Loss rate is safe. Adjust upper bounds upward. */ 1451 if (BBR.inflight_hi == Infinity or BBR.bw_hi == Infinity) 1452 return /* no upper bounds to raise */ 1453 if (rs.tx_in_flight > BBR.inflight_hi) 1454 BBR.inflight_hi = rs.tx_in_flight 1456 if (rs.delivery_rate > BBR.bw_hi) 1457 BBR.bw_hi = rs.bw 1458 if (BBR.state == ProbeBW_UP) 1459 BBRProbeInflightHiUpward() 1461 4.3.4. ProbeRTT 1463 4.3.4.1. ProbeRTT Overview 1465 To help probe for BBR.min_rtt, on an as-needed basis BBR flows enter 1466 the ProbeRTT state to try to cooperate to periodically drain the 1467 bottleneck queue - and thus improve their BBR.min_rtt estimate of the 1468 unloaded two-way propagation delay. 1470 A critical point is that before BBR raises its BBR.min_rtt estimate 1471 (which would in turn raise its maximum permissible cwnd), it first 1472 enters ProbeRTT to try to make a concerted and coordinated effort to 1473 drain the bottleneck queue and make a robust BBR.min_rtt measurement. 1474 This allows the BBR.min_rtt estimates of ensembles of BBR flows to 1475 converge avoiding feedback loops of ever-increasing queues and RTT 1476 samples. 1478 The ProbeRTT state works in concert with BBR.min_rtt estimation. Up 1479 to once every ProbeRTTInterval = 5 seconds, the flow enters ProbeRTT, 1480 decelerating by setting its cwnd_gain to BBRProbeRTTCwndGain = 0.5 to 1481 reduce its volume of inflight data to half of its estimated BDP, to 1482 try to allow the flow to measure the unloaded two-way propagation 1483 delay. 1485 There are two main motivations for making the MinRTTFilterLen roughly 1486 twice the ProbeRTTInterval. First, this ensures that during a 1487 ProbeRTT episode the flow will "remember" the BBR.min_rtt value it 1488 measured during the previous ProbeRTT episode, providing a robust bdp 1489 estimate for the cwnd = 0.5*bdp calculation, increasing the 1490 likelihood of fully draining the bottleneck queue. Second, this 1491 allows the flow's BBR.min_rtt filter window to generally include RTT 1492 samples from two ProbeTT episodes, providing a more robust estimate. 1494 The algorithm for ProbeRTT is as follows: 1496 Entry conditions: In any state other than ProbeRTT itself, if the 1497 BBR.probe_rtt_min_delay estimate has not been updated (i.e., by 1498 getting a lower RTT measurement) for more than ProbeRTTInterval = 5 1499 seconds, then BBR enters ProbeRTT and reduces the BBR.cwnd_gain to 1500 BBRProbeRTTCwndGain = 0.5. 1502 Exit conditions: After maintaining the volume of in-flight data at 1503 BBRProbeRTTCwndGain*BBR.bdp for at least ProbeRTTDuration (200 ms) 1504 and at least one round trip, BBR leaves ProbeRTT and transitions to 1505 ProbeBW if it estimates the pipe was filled already, or Startup 1506 otherwise. 1508 4.3.4.2. ProbeRTT Design Rationale 1510 BBR is designed to have ProbeRTT sacrifice no more than roughly 2% of 1511 a flow's available bandwidth. It is also designed to spend the vast 1512 majority of its time (at least roughly 96 percent) in ProbeBW and the 1513 rest in ProbeRTT, based on a set of tradeoffs. ProbeRTT lasts long 1514 enough (at least ProbeRTTDuration = 200 ms) to allow flows with 1515 different RTTs to have overlapping ProbeRTT states, while still being 1516 short enough to bound the throughput penalty of ProbeRTT's cwnd 1517 capping to roughly 2%, with the average throughput targeted at: 1519 throughput = (200ms*0.5*BBR.bw + (5s - 200ms)*BBR.bw) / 5s 1520 = (.1s + 4.8s)/5s * BBR.bw = 0.98 * BBR.bw 1522 As discussed above, BBR's BBR.min_rtt filter window, MinRTTFilterLen, 1523 and time interval between ProbeRTT states, ProbeRTTInterval, work in 1524 concert. BBR uses a MinRTTFilterLen equal to or longer than 1525 ProbeRTTInterval to allow the filter window to include at least one 1526 ProbeRTT. 1528 To allow coordination with other BBR flows, each flow MUST use the 1529 standard ProbeRTTInterval of 5 secs. 1531 An ProbeRTTInterval of 5 secs is short enough to allow quick 1532 convergence if traffic levels or routes change, but long enough so 1533 that interactive applications (e.g., Web, remote procedure calls, 1534 video chunks) often have natural silences or low-rate periods within 1535 the window where the flow's rate is low enough for long enough to 1536 drain its queue in the bottleneck. Then the BBR.probe_rtt_min_delay 1537 filter opportunistically picks up these measurements, and the 1538 BBR.probe_rtt_min_delay estimate refreshes without requiring 1539 ProbeRTT. This way, flows typically need only pay the 2 percent 1540 throughput penalty if there are multiple bulk flows busy sending over 1541 the entire ProbeRTTInterval window. 1543 As an optimization, when restarting from idle and finding that the 1544 BBR.probe_rtt_min_delay has expired, BBR does not enter ProbeRTT; the 1545 idleness is deemed a sufficient attempt to coordinate to drain the 1546 queue. 1548 4.3.4.3. Calculating the rs.rtt RTT Sample 1550 Upon transmitting each packet, BBR (or the associated transport 1551 protocol) stores in per-packet data the wall-clock scheduled 1552 transmission time of the packet in packet.departure_time (see the 1553 "Pacing Rate: BBR.pacing_rate" section for how this is calculated). 1555 For every ACK that newly acknowledges some data (whether cumulatively 1556 or selectively), the sender's BBR implementation (or the associated 1557 transport protocol implementation) attempts to calculate an RTT 1558 sample. The sender MUST consider any potential retransmission 1559 ambiguities that can arise in some transport protocols. If some of 1560 the acknowledged data was not retransmitted, or some of the data was 1561 retransmitted but the sender can still unambiguously determine the 1562 RTT of the data (e.g. if the transport supports [RFC7323] TCP 1563 timestamps or an equivalent mechanism), then the sender calculates an 1564 RTT sample, rs.rtt, as follows: 1566 rs.rtt = Now() - packet.departure_time 1568 4.3.4.4. ProbeRTT Logic 1570 On every ACK BBR executes BBRUpdateMinRTT() to update its ProbeRTT 1571 scheduling state (BBR.probe_rtt_min_delay and 1572 BBR.probe_rtt_min_stamp) and its BBR.min_rtt estimate: 1574 BBRUpdateMinRTT() 1575 BBR.probe_rtt_expired = 1576 Now() > BBR.probe_rtt_min_stamp + ProbeRTTInterval 1577 if (rs.rtt >= 0 and 1578 (rs.rtt < BBR.probe_rtt_min_delay or 1579 BBR.probe_rtt_expired)) 1580 BBR.probe_rtt_min_delay = rs.rtt 1581 BBR.probe_rtt_min_stamp = Now() 1583 min_rtt_expired = 1584 Now() > BBR.min_rtt_stamp + MinRTTFilterLen 1585 if (BBR.probe_rtt_min_delay < BBR.min_rtt or 1586 min_rtt_expired) 1587 BBR.min_rtt = BBR.probe_rtt_min_delay 1588 BBR.min_rtt_stamp = BBR.probe_rtt_min_stamp 1590 Here BBR.probe_rtt_expired is a boolean recording whether the 1591 BBR.probe_rtt_min_delay has expired and is due for a refresh, via 1592 either an application idle period or a transition into ProbeRTT 1593 state. 1595 On every ACK BBR executes BBRCheckProbeRTT() to handle the steps 1596 related to the ProbeRTT state as follows: 1598 BBRCheckProbeRTT(): 1599 if (BBR.state != ProbeRTT and 1600 BBR.probe_rtt_expired and 1601 not BBR.idle_restart) 1602 BBREnterProbeRTT() 1603 BBRSaveCwnd() 1604 BBR.probe_rtt_done_stamp = 0 1605 BBR.ack_phase = ACKS_PROBE_STOPPING 1606 BBRStartRound() 1607 if (BBR.state == ProbeRTT) 1608 BBRHandleProbeRTT() 1609 if (rs.delivered > 0) 1610 BBR.idle_restart = false 1612 BBREnterProbeRTT(): 1613 BBR.state = ProbeRTT 1614 BBR.pacing_gain = 1 1615 BBR.cwnd_gain = BBRProbeRTTCwndGain /* 0.5 */ 1617 BBRHandleProbeRTT(): 1618 /* Ignore low rate samples during ProbeRTT: */ 1619 MarkConnectionAppLimited() 1620 if (BBR.probe_rtt_done_stamp == 0 and 1621 packets_in_flight <= BBRProbeRTTCwnd()) 1622 /* Wait for at least ProbeRTTDuration to elapse: */ 1623 BBR.probe_rtt_done_stamp = 1624 Now() + ProbeRTTDuration 1625 /* Wait for at least one round to elapse: */ 1626 BBR.probe_rtt_round_done = false 1627 BBRStartRound() 1628 else if (BBR.probe_rtt_done_stamp != 0) 1629 if (BBR.round_start) 1630 BBR.probe_rtt_round_done = true 1631 if (BBR.probe_rtt_round_done) 1632 BBRCheckProbeRTTDone() 1634 BBRCheckProbeRTTDone(): 1635 if (BBR.probe_rtt_done_stamp != 0 and 1636 Now() > BBR.probe_rtt_done_stamp) 1637 /* schedule next ProbeRTT: */ 1638 BBR.probe_rtt_min_stamp = Now() 1639 BBRRestoreCwnd() 1640 BBRExitProbeRTT() 1642 MarkConnectionAppLimited(): 1643 C.app_limited = 1644 (C.delivered + packets_in_flight) ? : 1 1646 4.3.4.5. Exiting ProbeRTT 1648 When exiting ProbeRTT, BBR transitions to ProbeBW if it estimates the 1649 pipe was filled already, or Startup otherwise. 1651 When transitioning out of ProbeRTT, BBR calls BBRResetLowerBounds() 1652 to reset the lower bounds, since any congestion encountered in 1653 ProbeRTT may have pulled the short-term model far below the capacity 1654 of the path. 1656 But the algorithm is cautious in timing the next bandwidth probe: 1657 raising inflight after ProbeRTT may cause loss, so the algorithm 1658 resets the bandwidth-probing clock by starting the cycle at 1659 ProbeBW_DOWN(). But then as an optimization, since the connection is 1660 exiting ProbeRTT, we know that infligh is already below the estimated 1661 BDP, so the connection can proceed immediately to ProbeBW_CRUISE. 1663 To summarize, the logic for exiting ProbeRTT is as follows: 1665 BBRExitProbeRTT(): 1666 BBRResetLowerBounds() 1667 if (BBR.filled_pipe) 1668 BBRStartProbeBW_DOWN() 1669 BBRStartProbeBW_CRUISE() 1670 else 1671 BBREnterStartup() 1673 4.4. Restarting From Idle 1675 4.4.1. Setting Pacing Rate in ProbeBW 1677 When restarting from idle in ProbeBW states, BBR leaves its cwnd as- 1678 is and paces packets at exactly BBR.bw, aiming to return as quickly 1679 as possible to its target operating point of rate balance and a full 1680 pipe. Specifically, if the flow's BBR.state is ProbeBW, and the flow 1681 is application-limited, and there are no packets in flight currently, 1682 then at the moment the flow sends one or more packets BBR sets 1683 BBR.pacing_rate to exactly BBR.bw. More precisely, the BBR algorithm 1684 takes the following steps in BBRHandleRestartFromIdle() before 1685 sending a packet for a flow. 1687 The "Restarting Idle Connections" section of [RFC5681] suggests 1688 restarting from idle by slow-starting from the initial window. 1689 However, this approach was assuming a congestion control algorithm 1690 that had no estimate of the bottleneck bandwidth and no pacing, and 1691 thus resorted to relying on slow-starting driven by an ACK clock. 1692 The long (log_2(BDP)*RTT) delays required to reach full utilization 1693 with that "slow start after idle" approach caused many large 1694 deployments to disable this mechanism, resulting in a "BDP-scale 1695 line-rate burst" approach instead. Instead of these two approaches, 1696 BBR restarts by pacing at BBR.bw, typically achieving approximate 1697 rate balance and a full pipe after only one BBR.min_rtt has elapsed. 1699 4.4.2. Checking for ProberRTT Completion 1701 As an optimization, when restarting from idle BBR checks to see if 1702 the connection is in ProbeRTT and has met the exit conditions for 1703 ProbeRTT. If a connection goes idle during ProbeRTT then often it 1704 will have met those exit conditions by the time it restarts, so that 1705 the connection can restore the cwnd to its full value before it 1706 starts transmitting a new flight of data. 1708 4.4.3. Logic 1710 The BBR algorithm takes the following steps in 1711 BBRHandleRestartFromIdle() before sending a packet for a flow: 1713 BBRHandleRestartFromIdle(): 1714 if (packets_in_flight == 0 and C.app_limited) 1715 BBR.idle_restart = true 1716 BBR.extra_acked_interval_start = Now() 1717 if (IsInAProbeBWState()) 1718 BBRSetPacingRateWithGain(1) 1719 else if (BBR.state == ProbeRTT) 1720 BBRCheckProbeRTTDone() 1722 4.5. Updating Network Path Model Parameters 1724 BBR is a model-based congestion control algorithm: it is based on an 1725 explicit model of the network path over which a transport flow 1726 travels. The following is a summary of each parameter, including its 1727 meaning and how the algorithm calculates and uses its value. We can 1728 group the parameter into three groups: 1730 * core state machine parameters 1732 * parameters to model the data rate 1734 * parameters to model the volume of in-flight data 1736 4.5.1. BBR.round_count: Tracking Packet-Timed Round Trips 1738 Several aspects of the BBR algorithm depend on counting the progress 1739 of "packet-timed" round trips, which start at the transmission of 1740 some segment, and then end at the acknowledgement of that segment. 1741 BBR.round_count is a count of the number of these "packet-timed" 1742 round trips elapsed so far. BBR uses this virtual BBR.round_count 1743 because it is more robust than using wall clock time. In particular, 1744 arbitrary intervals of wall clock time can elapse due to application 1745 idleness, variations in RTTs, or timer delays for retransmission 1746 timeouts, causing wall-clock-timed model parameter estimates to "time 1747 out" or to be "forgotten" too quickly to provide robustness. 1749 BBR counts packet-timed round trips by recording state about a 1750 sentinel packet, and waiting for an ACK of any data packet that was 1751 sent after that sentinel packet, using the following pseudocode: 1753 Upon connection initialization: 1755 BBRInitRoundCounting(): 1756 BBR.next_round_delivered = 0 1757 BBR.round_start = false 1758 BBR.round_count = 0 1760 Upon sending each packet, the rate estimation algorithm [draft-cheng- 1761 iccrg-delivery-rate-estimation] records the amount of data thus far 1762 acknowledged as delivered: 1764 packet.delivered = C.delivered 1766 Upon receiving an ACK for a given data packet, the rate estimation 1767 algorithm [draft-cheng-iccrg-delivery-rate-estimation] updates the 1768 amount of data thus far acknowledged as delivered: 1770 C.delivered += packet.size 1772 Upon receiving an ACK for a given data packet, the BBR algorithm 1773 first executes the following logic to see if a round trip has 1774 elapsed, and if so, increment the count of such round trips elapsed: 1776 BBRUpdateRound(): 1777 if (packet.delivered >= BBR.next_round_delivered) 1778 BBRStartRound() 1779 BBR.round_count++ 1780 BBR.rounds_since_probe++ 1781 BBR.round_start = true 1782 else 1783 BBR.round_start = false 1785 BBRStartRound(): 1786 BBR.next_round_delivered = C.delivered 1788 4.5.2. BBR.max_bw: Estimated Maximum Bandwidth 1790 BBR.max_bw is BBR's estimate of the maximum bottleneck bandwidth 1791 available to data transmissions for the transport flow. At any time, 1792 a transport connection's data transmissions experience some slowest 1793 link or bottleneck. The bottleneck's delivery rate determines the 1794 connection's maximum data-delivery rate. BBR tries to closely match 1795 its sending rate to this bottleneck delivery rate to help seek "rate 1796 balance", where the flow's packet arrival rate at the bottleneck 1797 equals the departure rate. The bottleneck rate varies over the life 1798 of a connection, so BBR continually estimates BBR.max_bw using recent 1799 signals. 1801 4.5.2.1. Delivery Rate Samples for Estimating BBR.max_bw 1803 Since calculating delivery rate samples is subtle, and the samples 1804 are useful independent of congestion control, the approach BBR uses 1805 for measuring each single delivery rate sample is specified in a 1806 separate Internet Draft [draft-cheng-iccrg-delivery-rate-estimation]. 1808 4.5.2.2. BBR.max_bw Max Filter 1810 Delivery rate samples are often below the typical bottleneck 1811 bandwidth available to the flow, due to "noise" introduced by random 1812 variation in physical transmission processes (e.g. radio link layer 1813 noise) or queues or along the network path. To filter these effects 1814 BBR uses a max filter: BBR estimates BBR.max_bw using the windowed 1815 maximum recent delivery rate sample seen by the connection over 1816 recent history. 1818 The BBR.max_bw max filter window covers a time period extending over 1819 the past two ProbeBW cycles. The BBR.max_bw max filter window length 1820 is driven by trade-offs among several considerations: 1822 * It is long enough to cover at least one entire ProbeBW cycle (see 1823 the "ProbeBW" section). This ensures that the window contains at 1824 least some delivery rate samples that are the result of data 1825 transmitted with a super-unity pacing_gain (a pacing_gain larger 1826 than 1.0). Such super-unity delivery rate samples are 1827 instrumental in revealing the path's underlying available 1828 bandwidth even when there is noise from delivery rate shortfalls 1829 due to aggregation delays, queuing delays from variable cross- 1830 traffic, lossy link layers with uncorrected losses, or short-term 1831 buffer exhaustion (e.g., brief coincident bursts in a shallow 1832 buffer). 1834 * It aims to be long enough to cover short-term fluctuations in the 1835 network's delivery rate due to the aforementioned sources of 1836 noise. In particular, the delivery rate for radio link layers 1837 (e.g., wifi and cellular technologies) can be highly variable, and 1838 the filter window needs to be long enough to remember "good" 1839 delivery rate samples in order to be robust to such variations. 1841 * It aims to be short enough to respond in a timely manner to 1842 sustained reductions in the bandwidth available to a flow, whether 1843 this is because other flows are using a larger share of the 1844 bottleneck, or the bottleneck link service rate has reduced due to 1845 layer 1 or layer 2 changes, policy changes, or routing changes. 1846 In any of these cases, existing BBR flows traversing the 1847 bottleneck should, in a timely manner, reduce their BBR.max_bw 1848 estimates and thus pacing rate and in-flight data, in order to 1849 match the sending behavior to the new available bandwidth. 1851 4.5.2.3. BBR.max_bw and Application-limited Delivery Rate Samples 1853 Transmissions can be application-limited, meaning the transmission 1854 rate is limited by the application rather than the congestion control 1855 algorithm. This is quite common because of request/response traffic. 1856 When there is a transmission opportunity but no data to send, the 1857 delivery rate sampler marks the corresponding bandwidth sample(s) as 1858 application-limited [draft-cheng-iccrg-delivery-rate-estimation]. 1859 The BBR.max_bw estimator carefully decides which samples to include 1860 in the bandwidth model to ensure that BBR.max_bw reflects network 1861 limits, not application limits. By default, the estimator discards 1862 application-limited samples, since by definition they reflect 1863 application limits. However, the estimator does use application- 1864 limited samples if the measured delivery rate happens to be larger 1865 than the current BBR.max_bw estimate, since this indicates the 1866 current BBR.Max_bw estimate is too low. 1868 4.5.2.4. Updating the BBR.max_bw Max Filter 1870 For every ACK that acknowledges some data packets as delivered, BBR 1871 invokes BBRUpdateMaxBw() to update the BBR.max_bw estimator as 1872 follows (here rs.delivery_rate is the delivery rate sample obtained 1873 from the ACK that is being processed, as specified in [draft-cheng- 1874 iccrg-delivery-rate-estimation]): 1876 BBRUpdateMaxBw() 1877 BBRUpdateRound() 1878 if (rs.delivery_rate >= BBR.max_bw || !rs.is_app_limited) 1879 BBR.max_bw = update_windowed_max_filter( 1880 filter=BBR.MaxBwFilter, 1881 value=rs.delivery_rate, 1882 time=BBR.cycle_count, 1883 window_length=MaxBwFilterLen) 1885 4.5.2.5. Tracking Time for the BBR.max_bw Max Filter 1887 BBR tracks time for the BBR.max_bw filter window using a virtual 1888 (non-wall-clock) time tracked by counting the cyclical progression 1889 through ProbeBW cycles. Each time through the Probe bw cycle, one 1890 round trip after exiting ProbeBW_UP (the point at which the flow has 1891 its best chance to measure the highest throughput of the cycle), BBR 1892 increments BBR.cycle_count, the virtual time used by the BBR.max_bw 1893 filter window. Note that BBR.cycle_count only needs to be tracked 1894 with a single bit, since the BBR.max_bw filter only needs to track 1895 samples from two time slots: the previous ProbeBW cycle and the 1896 current ProbeBW cycle: 1898 BBRAdvanceMaxBwFilter(): 1899 BBR.cycle_count++ 1901 4.5.3. BBR.min_rtt: Estimated Minimum Round-Trip Time 1903 BBR.min_rtt is BBR's estimate of the round-trip propagation delay of 1904 the path over which a transport connection is sending. The path's 1905 round-trip propagation delay determines the minimum amount of time 1906 over which the connection must be willing to sustain transmissions at 1907 the BBR.bw rate, and thus the minimum amount of data needed in- 1908 flight, for the connection to reach full utilization (a "Full Pipe"). 1909 The round-trip propagation delay can vary over the life of a 1910 connection, so BBR continually estimates BBR.min_rtt using recent 1911 round-trip delay samples. 1913 4.5.3.1. Round-Trip Time Samples for Estimating BBR.min_rtt 1915 For every data packet a connection sends, BBR calculates an RTT 1916 sample that measures the time interval from sending a data packet 1917 until that packet is acknowledged. 1919 For the most part, the same considerations and mechanisms that apply 1920 to RTT estimation for the purposes of retransmission timeout 1921 calculations [RFC6298] apply to BBR RTT samples. Namely, BBR does 1922 not use RTT samples based on the transmission time of retransmitted 1923 packets, since these are ambiguous, and thus unreliable. Also, BBR 1924 calculates RTT samples using both cumulative and selective 1925 acknowledgments (if the transport supports [RFC2018] SACK options or 1926 an equivalent mechanism), or transport-layer timestamps (if the 1927 transport supports [RFC7323] TCP timestamps or an equivalent 1928 mechanism). 1930 The only divergence from RTT estimation for retransmission timeouts 1931 is in the case where a given acknowledgment ACKs more than one data 1932 packet. In order to be conservative and schedule long timeouts to 1933 avoid spurious retransmissions, the maximum among such potential RTT 1934 samples is typically used for computing retransmission timeouts; 1935 i.e., SRTT is typically calculated using the data packet with the 1936 earliest transmission time. By contrast, in order for BBR to try to 1937 reach the minimum amount of data in flight to fill the pipe, BBR uses 1938 the minimum among such potential RTT samples; i.e., BBR calculates 1939 the RTT using the data packet with the latest transmission time. 1941 4.5.3.2. BBR.min_rtt Min Filter 1943 RTT samples tend to be above the round-trip propagation delay of the 1944 path, due to "noise" introduced by random variation in physical 1945 transmission processes (e.g. radio link layer noise), queues along 1946 the network path, the receiver's delayed ack strategy, ack 1947 aggregation, etc. Thus to filter out these effects BBR uses a min 1948 filter: BBR estimates BBR.min_rtt using the minimum recent RTT sample 1949 seen by the connection over that past MinRTTFilterLen seconds. (Many 1950 of the same network effects that can decrease delivery rate 1951 measurements can increase RTT samples, which is why BBR's min- 1952 filtering approach for RTTs is the complement of its max-filtering 1953 approach for delivery rates.) 1955 The length of the BBR.min_rtt min filter window is MinRTTFilterLen = 1956 10 secs. This is driven by trade-offs among several considerations: 1958 * The MinRTTFilterLen is longer than ProbeRTTInterval, so that it 1959 covers an entire ProbeRTT cycle (see the "ProbeRTT" section 1960 below). This helps ensure that the window can contain RTT samples 1961 that are the result of data transmitted with inflight below the 1962 estimated BDP of the flow. Such RTT samples are important for 1963 helping to reveal the path's underlying two-way propagation delay 1964 even when the aforementioned "noise" effects can often obscure it. 1966 * The MinRTTFilterLen aims to be long enough to avoid needing to cut 1967 in-flight and throughput often. Measuring two-way propagation 1968 delay requires in-flight to be at or below BDP, which risks some 1969 amount of underutilization, so BBR uses a filter window long 1970 enough that such underutilization events can be rare. 1972 * The MinRTTFilterLen aims to be long enough that many applications 1973 have a "natural" moment of silence or low utilization that can cut 1974 in-flight below BDP and naturally serve to refresh the 1975 BBR.min_rtt, without requiring BBR to force an artificial cut in 1976 in-flight. This applies to many popular applications, including 1977 Web, RPC, chunked audio or video traffic. 1979 * The MinRTTFilterLen aims to be short enough to respond in a timely 1980 manner to real increases in the two-way propagation delay of the 1981 path, e.g. due to route changes, which are expected to typically 1982 happen on longer time scales. 1984 A BBR implementation MAY use a generic windowed min filter to track 1985 BBR.min_rtt. However, a significant savings in space and improvement 1986 in freshness can be achieved by integrating the BBR.min_rtt 1987 estimation into the ProbeRTT state machine, so this document 1988 discusses that approach in the ProbeRTT section. 1990 4.5.4. BBR.offload_budget 1992 BBR.offload_budget is the estimate of the minimum volume of data 1993 necessary to achieve full throughput using sender (TSO/GSO) and 1994 receiver (LRO, GRO) host offload mechanisms, computed as follows: 1996 BBRUpdateOffloadBudget(): 1997 BBR.offload_budget = 3 * BBR.send_quantum 1999 The factor of 3 is chosen to allow maintaining at least: 2001 * 1 quantum in the sending host's queuing discipline layer 2003 * 1 quantum being segmented in the sending host TSO/GSO engine 2005 * 1 quantum being reassembled or otherwise remaining unacknowledged 2006 due to the receiver host's LRO/GRO/delayed-ACK engine 2008 4.5.5. BBR.extra_acked 2010 BBR.extra_acked is a volume of data that is the estimate of the 2011 recent degree of aggregation in the network path. For each ACK, the 2012 algorithm computes a sample of the estimated extra ACKed data beyond 2013 the amount of data that the sender expected to be ACKed over the 2014 timescale of a round-trip, given the BBR.bw. Then it computes 2015 BBR.extra_acked as the windowed maximum sample over the last 2016 BBRExtraAckedFilterLen=10 packet-timed round-trips. If the ACK rate 2017 falls below the expected bandwidth, then the algorithm estimates an 2018 aggregation episode has terminated, and resets the sampling interval 2019 to start from the current time. 2021 The BBR.extra_acked thus reflects the recently-measured magnitude of 2022 data and ACK aggregation effects such as batching and slotting at 2023 shared-medium L2 hops (wifi, cellular, DOCSIS), as well as end-host 2024 offload mechanisms (TSO, GSO, LRO, GRO), and end host or middlebox 2025 ACK decimation/thinning. 2027 BBR augments its cwnd by BBR.extra_acked to allow the connection to 2028 keep sending during inter-ACK silences, to an extent that matches the 2029 recently measured degree of aggregation. 2031 More precisely, this is computed as: 2033 BBRUpdateACKAggregation(): 2034 /* Find excess ACKed beyond expected amount over this interval */ 2035 interval = (Now() - BBR.extra_acked_interval_start) 2036 expected_delivered = BBR.bw * interval 2037 /* Reset interval if ACK rate is below expected rate: */ 2038 if (BBR.extra_acked_delivered <= expected_delivered) 2039 BBR.extra_acked_delivered = 0 2040 BBR.extra_acked_interval_start = Now() 2041 expected_delivered = 0 2042 BBR.extra_acked_delivered += rs.newly_acked 2043 extra = BBR.extra_acked_delivered - expected_delivered 2044 extra = min(extra, cwnd) 2045 BBR.extra_acked = 2046 update_windowed_max_filter( 2047 filter=BBR.ExtraACKedFilter, 2048 value=extra, 2049 time=BBR.round_count, 2050 window_length=BBRExtraAckedFilterLen) 2052 4.5.6. Updating the Model Upon Packet Loss 2054 In every state, BBR responds to (filtered) congestion signals, 2055 including loss. The response to those congestion signals depends on 2056 the flow's current state, since the information that the flow can 2057 infer depends on what the flow was doing when the flow experienced 2058 the signal. 2060 4.5.6.1. Probing for Bandwidth In Startup 2062 In Startup, if the congestion signals meet the Startup exit criteria, 2063 the flow exits Startup and enters Drain. 2065 4.5.6.2. Probing for Bandwidth In ProbeBW 2067 BBR searches for the maximum volume of data that can be sensibly 2068 placed in-flight in the network. A key precondition is that the flow 2069 is actually trying robustly to find that operating point. To 2070 implement this, when a flow is in ProbeBW, and an ACK covers data 2071 sent in one of the accelerating phases (REFILL or UP), and the ACK 2072 indicates that the loss rate over the past round trip exceeds the 2073 queue pressure objective, and the flow is not application limited, 2074 and has not yet responded to congestion signals from the most recent 2075 REFILL or UP phase, then the flow estimates that the volume of data 2076 it allowed in flight exceeded what matches the current delivery 2077 process on the path, and reduces BBR.inflight_hi: 2079 /* Do loss signals suggest inflight is too high? 2080 * If so, react. */ 2081 CheckInflightTooHigh(): 2082 if (IsInflightTooHigh(rs)) 2083 if (BBR.bw_probe_samples) 2084 BBRHandleInflightTooHigh() 2085 return true /* inflight too high */ 2086 else 2087 return false /* inflight not too high */ 2089 IsInflightTooHigh(): 2090 return (rs.lost > rs.tx_in_flight * BBRLossThresh) 2092 BBRHandleInflightTooHigh(): 2093 BBR.bw_probe_samples = 0; /* only react once per bw probe */ 2094 if (!rs.is_app_limited) 2095 BBR.inflight_hi = max(rs.tx_in_flight, 2096 BBRTargetInflight() * BBRBeta)) 2097 If (BBR.state == ProbeBW_UP) 2098 BBRStartProbeBW_DOWN() 2100 Here rs.tx_in_flight is the amount of data that was estimated to be 2101 in flight when the most recently ACKed packet was sent. And the 2102 BBRBeta (0.7x) bound is to try to ensure that BBR does not react more 2103 dramatically than CUBIC's 0.7x multiplicative decrease factor. 2105 Some loss detection algorithms, including algorithms like RACK 2106 [RFC8985] that delay loss marking while waiting for potential 2107 reordering to resolve, may mark packets as lost long after the loss 2108 itself happened. In such cases, the tx_in_flight for the delivered 2109 sequence range that allowed the loss to be detected may be 2110 considerably smaller than the tx_in_flight of the lost packet itself. 2111 In such cases using the former tx_in_flight rather than the latter 2112 can cause BBR.inflight_hi to be significantly underestimated. To 2113 avoid such issues, BBR processes each loss detection event to more 2114 precisely estimate the volume of in-flight data at which loss rates 2115 cross BBRLossThresh, noting that this may have happened mid-way 2116 through some packet. To estimate this value, we can solve for 2117 "lost_prefix" in the following equation, where inflight_prev 2118 represents the volume of in-flight data preceding this packet, 2119 lost_prev represents the data lost among that previous in-flight 2120 data: 2122 lost / inflight >= BBRLossThresh 2123 (lost_prev + lost_prefix) / (inflight_prev + lost_prefix) >= BBRLossThresh 2124 /* solving for lost_prefix we arrive at: */ 2125 lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / (1 - BBRLossThresh) 2127 In pseudocode: 2129 BBRHandleLostPacket(packet): 2130 if (!BBR.bw_probe_samples) 2131 return /* not a packet sent while probing bandwidth */ 2132 rs.tx_in_flight = packet.tx_in_flight /* inflight at transmit */ 2133 rs.lost = C.lost - packet.lost /* data lost since transmit */ 2134 rs.is_app_limited = packet.is_app_limited; 2135 if (IsInflightTooHigh(rs)) 2136 rs.tx_in_flight = BBRInflightHiFromLostPacket(rs, packet) 2137 BBRHandleInflightTooHigh(rs) 2139 /* At what prefix of packet did losses exceed BBRLossThresh? */ 2140 BBRInflightHiFromLostPacket(rs, packet): 2141 size = packet.size 2142 /* What was in flight before this packet? */ 2143 inflight_prev = rs.tx_in_flight - size 2144 /* What was lost before this packet? */ 2145 lost_prev = rs.lost - size 2146 lost_prefix = (BBRLossThresh * inflight_prev - lost_prev) / 2147 (1 - BBRLossThresh) 2148 /* At what inflight value did losses cross BBRLossThresh? */ 2149 inflight = inflight_prev + lost_prefix 2150 return inflight 2152 4.5.6.3. When not Probing for Bandwidth 2154 When not explicitly accelerating to probe for bandwidth (Drain, 2155 ProbeRTT, ProbeBW_DOWN, ProbeBW_CRUISE), BBR responds to loss by 2156 slowing down to some extent. This is because loss suggests that the 2157 available bandwidth and safe volume of in-flight data may have 2158 decreased recently, and the flow needs to adapt, slowing down toward 2159 the latest delivery process. BBR flows implement this response by 2160 reducing the short-term model parameters, BBR.bw_lo and 2161 BBR.inflight_lo. 2163 When encountering packet loss when the flow is not probing for 2164 bandwidth, the strategy is to gradually adapt to the current measured 2165 delivery process (the rate and volume of data that is delivered 2166 through the network path over the last round trip). This applies 2167 generally: whether in fast recovery, RTO recovery, TLP recovery; 2168 whether application-limited or not. 2170 There are two key parameters the algorithm tracks, to measure the 2171 current delivery process: 2173 BBR.bw_latest: a 1-round-trip max of delivered bandwidth 2174 (rs.delivery_rate). 2176 BBR.inflight_latest: a 1-round-trip max of delivered volume of data 2177 (rs.delivered). 2179 Upon the ACK at the end of each round that encountered a newly-marked 2180 loss, the flow updates its model (bw_lo and inflight_lo) as follows: 2182 bw_lo = max( bw_latest, BBRBeta * bw_lo ) 2183 inflight_lo = max( inflight_latest, BBRBeta * inflight_lo ) 2185 This logic can be represented as follows: 2187 /* Near start of ACK processing: */ 2188 BBRUpdateLatestDeliverySignals(): 2189 BBR.loss_round_start = 0 2190 BBR.bw_latest = max(BBR.bw_latest, rs.delivery_rate) 2191 BBR.inflight_latest = max(BBR.inflight_latest, rs.delivered) 2192 if (rs.prior_delivered >= BBR.loss_round_delivered) 2193 BBR.loss_round_delivered = C.delivered 2194 BBR.loss_round_start = 1 2196 /* Near end of ACK processing: */ 2197 BBRAdvanceLatestDeliverySignals(): 2198 if (BBR.loss_round_start) 2199 BBR.bw_latest = rs.delivery_rate 2200 BBR.inflight_latest = rs.delivered 2202 BBRResetCongestionSignals(): 2203 BBR.loss_in_round = 0 2204 BBR.bw_latest = 0 2205 BBR.inflight_latest = 0 2207 /* Update congestion state on every ACK */ 2208 BBRUpdateCongestionSignals(): 2209 BBRUpdateMaxBw() 2210 if (rs.losses > 0) 2211 BBR.loss_in_round = 1 2212 if (!BBR.loss_round_start) 2213 return /* wait until end of round trip */ 2214 BBRAdaptLowerBoundsFromCongestion() 2215 BBR.loss_in_round = 0 2217 /* Once per round-trip respond to congestion */ 2218 BBRAdaptLowerBoundsFromCongestion(): 2219 if (BBRIsProbingBW()) 2220 return 2221 if (BBR.loss_in_round()) 2222 BBRInitLowerBounds() 2223 BBRLossLowerBounds() 2225 /* Handle the first congestion episode in this cycle */ 2226 BBRInitLowerBounds(): 2227 if (BBR.bw_lo == Infinity) 2228 BBR.bw_lo = BBR.max_bw 2229 if (BBR.inflight_lo == Infinity) 2230 BBR.inflight_lo = cwnd 2232 /* Adjust model once per round based on loss */ 2233 BBRLossLowerBounds() 2234 BBR.bw_lo = max(BBR.bw_latest, 2235 BBRBeta * BBR.bw_lo) 2236 BBR.inflight_lo = max(BBR.inflight_latest, 2237 BBRBeta * BBR.infligh_lo) 2239 BBRResetLowerBounds(): 2240 BBR.bw_lo = Infinity 2241 BBR.inflight_lo = Infinity 2243 BBRBoundBWForModel(): 2244 BBR.bw = min(BBR.max_bw, BBR.bw_lo, BBR.bw_hi) 2246 4.6. Updating Control Parameters 2248 BBR uses three distinct but interrelated control parameters: pacing 2249 rate, send quantum, and congestion window (cwnd). 2251 4.6.1. Summary of Control Behavior in the State Machine 2253 The following table summarizes how BBR modulates the control 2254 parameters in each state. In the table below, the semantics of the 2255 columns are as follows: 2257 * State: the state in the BBR state machine, as depicted in the 2258 "State Transition Diagram" section above. 2260 * Tactic: The tactic chosen from the "State Machine Tactics" 2261 subsection above: "accel" refers to acceleration, "decel" to 2262 deceleration, and "cruise" to cruising. 2264 * Pacing Gain: the value used for BBR.pacing_gain in the given 2265 state. 2267 * Cwnd Gain: the value used for BBR.cwnd_gain in the given state. 2269 * Rate Cap: the rate values applied as bounds on the BBR.max_bw 2270 value applied to compute BBR.bw. 2272 * Volume Cap: the volume values applied as bounds on the 2273 BBR.max_inflight value to compute cwnd. 2275 The control behavior can be summarized as follows. Upon processing 2276 each ACK, BBR uses the values in the table below to compute BBR.bw in 2277 BBRBoundBWForModel(), and the cwnd in BBRBoundCwndForModel(): 2279 +-----------------+--------+--------+------+--------+------------------+ 2280 | State | Tactic | Pacing | Cwnd | Rate | Volume | 2281 | | | Gain | Gain | Cap | Cap | 2282 +-----------------+--------+--------+------+--------+------------------+ 2283 | Startup | accel | 2.77 | 2 | | | 2284 | | | | | | | 2285 +-----------------+--------+--------+------+--------+------------------+ 2286 | Drain | decel | 0.5 | 2 | bw_hi, | inflight_hi, | 2287 | | | | | bw_lo | inflight_lo | 2288 +-----------------+--------+--------+------+--------+------------------+ 2289 | ProbeBW_DOWN | decel | 0.9 | 2 | bw_hi, | inflight_hi, | 2290 | | | | | bw_lo | inflight_lo | 2291 +-----------------+--------+--------+------+--------+------------------+ 2292 | ProbeBW_CRUISE | cruise | 1.0 | 2 | bw_hi, | 0.85*inflight_hi | 2293 | | | | | bw_lo | inflight_lo | 2294 +-----------------+--------+--------+------+--------+------------------+ 2295 | ProbeBW_REFILL | accel | 1.0 | 2 | bw_hi | inflight_hi | 2296 | | | | | | | 2297 +-----------------+--------+--------+------+--------+------------------+ 2298 | ProbeBW_UP | accel | 1.25 | 2 | bw_hi | inflight_hi | 2299 | | | | | | | 2300 +-----------------+--------+--------+------+--------+------------------+ 2301 | ProbeRTT | decel | 1.0 | 0.5 | bw_hi, | 0.85*inflight_hi | 2302 | | | | | bw_lo | inflight_lo | 2303 +-----------------+--------+--------+------+--------+------------------+ 2305 4.6.2. Pacing Rate: BBR.pacing_rate 2307 To help match the packet-arrival rate to the bottleneck bandwidth 2308 available to the flow, BBR paces data packets. Pacing enforces a 2309 maximum rate at which BBR schedules quanta of packets for 2310 transmission. 2312 The sending host implements pacing by maintaining inter-quantum 2313 spacing at the time each packet is scheduled for departure, 2314 calculating the next departure time for a packet for a given flow 2315 (BBR.next_departure_time) as a function of the most recent packet 2316 size and the current pacing rate, as follows: 2318 BBR.next_departure_time = max(Now(), BBR.next_departure_time) 2319 packet.departure_time = BBR.next_departure_time 2320 pacing_delay = packet.size / BBR.pacing_rate 2321 BBR.next_departure_time = BBR.next_departure_time + pacing_delay 2323 To adapt to the bottleneck, in general BBR sets the pacing rate to be 2324 proportional to bw, with a dynamic gain, or scaling factor of 2325 proportionality, called pacing_gain. 2327 When a BBR flow starts it has no bw estimate (bw is 0). So in this 2328 case it sets an initial pacing rate based on the transport sender 2329 implementation's initial congestion window ("InitialCwnd", e.g. from 2330 [RFC6928]), the initial SRTT (smoothed round-trip time) after the 2331 first non-zero RTT sample, and the initial pacing_gain: 2333 BBRInitPacingRate(): 2334 nominal_bandwidth = InitialCwnd / (SRTT ? SRTT : 1ms) 2335 BBR.pacing_rate = BBRStartupPacingGain * nominal_bandwidth 2337 After initialization, on each data ACK BBR updates its pacing rate to 2338 be proportional to bw, as long as it estimates that it has filled the 2339 pipe (BBR.filled_pipe is true; see the "Startup" section for 2340 details), or doing so increases the pacing rate. Limiting the pacing 2341 rate updates in this way helps the connection probe robustly for 2342 bandwidth until it estimates it has reached its full available 2343 bandwidth ("filled the pipe"). In particular, this prevents the 2344 pacing rate from being reduced when the connection has only seen 2345 application-limited bandwidth samples. BBR updates the pacing rate 2346 on each ACK by executing the BBRSetPacingRate() step as follows: 2348 BBRSetPacingRateWithGain(pacing_gain): 2349 rate = pacing_gain * bw * (100 - BBRPacingMarginPercent) / 100 2350 if (BBR.filled_pipe || rate > BBR.pacing_rate) 2351 BBR.pacing_rate = rate 2353 BBRSetPacingRate(): 2354 BBRSetPacingRateWithGain(BBR.pacing_gain) 2356 To help drive the network toward lower queues and low latency while 2357 maintaining high utilization, the BBRPacingMarginPercent constant of 2358 1 aims to cause BBR to pace at 1% below the bw, on average. 2360 4.6.3. Send Quantum: BBR.send_quantum 2362 In order to amortize per-packet overheads involved in the sending 2363 process (host CPU, NIC processing, and interrupt processing delays), 2364 high-performance transport sender implementations (e.g., Linux TCP) 2365 often schedule an aggregate containing multiple packets (multiple 2366 SMSS) worth of data as a single quantum (using TSO, GSO, or other 2367 offload mechanisms). The BBR congestion control algorithm makes this 2368 control decision explicitly, dynamically calculating a quantum 2369 control parameter that specifies the maximum size of these 2370 transmission aggregates. This decision is based on a trade-off: 2372 * A smaller quantum is preferred at lower data rates because it 2373 results in shorter packet bursts, shorter queues, lower queueing 2374 delays, and lower rates of packet loss. 2376 * A bigger quantum can be required at higher data rates because it 2377 results in lower CPU overheads at the sending and receiving hosts, 2378 who can ship larger amounts of data with a single trip through the 2379 networking stack. 2381 On each ACK, BBR runs BBRSetSendQuantum() to update BBR.send_quantum 2382 as follows: 2384 BBRSetSendQuantum(): 2385 if (BBR.pacing_rate < 1.2 Mbps) 2386 floor = 1 * SMSS 2387 else 2388 floor = 2 * SMSS 2389 BBR.send_quantum = min(BBR.pacing_rate * 1ms, 64KBytes) 2390 BBR.send_quantum = max(BBR.send_quantum, floor) 2392 A BBR implementation MAY use alternate approaches to select a 2393 BBR.send_quantum, as appropriate for the CPU overheads anticipated 2394 for senders and receivers, and buffering considerations anticipated 2395 in the network path. However, for the sake of the network and other 2396 users, a BBR implementation SHOULD attempt to use the smallest 2397 feasible quanta. 2399 4.6.4. Congestion Window 2401 The congestion window, or cwnd, controls the maximum volume of data 2402 BBR allows in flight in the network at any time. It is the maximum 2403 volume of in-flight data that the algorithm estimates is appropriate 2404 for matching the current network path delivery process, given all 2405 available signals in the model, at any time scale. BBR adapts the 2406 cwnd based on its model of the network path and the state machine's 2407 decisions about how to probe that path. 2409 By default, BBR grows its cwnd to meet its BBR.max_inflight, which 2410 models what's required for achieving full throughput, and as such is 2411 scaled to adapt to the estimated BDP computed from its path model. 2412 But BBR's selection of cwnd is designed to explicitly trade off among 2413 competing considerations that dynamically adapt to various 2414 conditions. So in loss recovery BBR more conservatively adjusts its 2415 sending behavior based on more recent delivery samples, and if BBR 2416 needs to re-probe the current BBR.min_rtt of the path then it cuts 2417 its cwnd accordingly. The following sections describe the various 2418 considerations that impact cwnd. 2420 4.6.4.1. Initial cwnd 2422 BBR generally uses measurements to build a model of the network path 2423 and then adapts control decisions to the path based on that model. 2424 As such, the selection of the initial cwnd is considered to be 2425 outside the scope of the BBR algorithm, since at initialization there 2426 are no measurements yet upon which BBR can operate. Thus, at 2427 initialization, BBR uses the transport sender implementation's 2428 initial congestion window (e.g. from [RFC6298] for TCP). 2430 4.6.4.2. Computing BBR.max_inflight 2432 The BBR BBR.max_inflight is the upper bound on the volume of data BBR 2433 allows in flight. This bound is always in place, and dominates when 2434 all other considerations have been satisfied: the flow is not in loss 2435 recovery, does not need to probe BBR.min_rtt, and has accumulated 2436 confidence in its model parameters by receiving enough ACKs to 2437 gradually grow the current cwnd to meet the BBR.max_inflight. 2439 On each ACK, BBR calculates the BBR.max_inflight in 2440 BBRUpdateMaxInflight() as follows: 2442 BBRBDPMultiple(gain): 2443 if (BBR.min_rtt == Inf) 2444 return InitialCwnd /* no valid RTT samples yet */ 2445 BBR.bdp = BBR.bw * BBR.min_rtt 2446 return gain * BBR.bdp 2448 BBRQuantizationBudget(inflight) 2449 BBRUpdateOffloadBudget() 2450 inflight = max(inflight, BBR.offload_budget) 2451 inflight = max(inflight, BBRMinPipeCwnd) 2452 if (BBR.state == ProbeBW && BBR.cycle_idx == ProbeBW_UP) 2453 inflight += 2 2454 return inflight 2456 BBRInflight(gain): 2457 inflight = BBRBDPMultiple(gain) 2458 return BBRQuantizationBudget(inflight) 2460 BBRUpdateMaxInflight(): 2461 BBRUpdateAggregationBudget() 2462 inflight = BBRBDPMultiple(BBR.cwnd_gain) 2463 inflight += BBR.extra_acked 2464 BBR.max_inflight = BBRQuantizationBudget(inflight) 2466 The "estimated_bdp" term tries to allow enough packets in flight to 2467 fully utilize the estimated BDP of the path, by allowing the flow to 2468 send at BBR.bw for a duration of BBR.min_rtt. Scaling up the BDP by 2469 BBR.cwnd_gain bounds in-flight data to a small multiple of the BDP, 2470 to handle common network and receiver behavior, such as delayed, 2471 stretched, or aggregated ACKs [A15]. The "quanta" term allows enough 2472 quanta in flight on the sending and receiving hosts to reach high 2473 throughput even in environments using offload mechanisms. 2475 4.6.4.3. Minimum cwnd for Pipelining 2477 For BBR.max_inflight, BBR imposes a floor of BBRMinPipeCwnd (4 2478 packets, i.e. 4 * SMSS). This floor helps ensure that even at very 2479 low BDPs, and with a transport like TCP where a receiver may ACK only 2480 every alternate SMSS of data, there are enough packets in flight to 2481 maintain full pipelining. In particular BBR tries to allow at least 2482 2 data packets in flight and ACKs for at least 2 data packets on the 2483 path from receiver to sender. 2485 4.6.4.4. Modulating cwnd in Loss Recovery 2487 BBR interprets loss as a hint that there may be recent changes in 2488 path behavior that are not yet fully reflected in its model of the 2489 path, and thus it needs to be more conservative. 2491 Upon a retransmission timeout (RTO), BBR conservatively reduces cwnd 2492 to a value that will allow 1 SMSS to be transmitted. Then BBR 2493 gradually increases cwnd using the normal approach outlined below in 2494 "Core cwnd Adjustment Mechanism". 2496 When a BBR sender detects packet loss but there are still packets in 2497 flight, on the first round of the loss-repair process BBR temporarily 2498 reduces the cwnd to match the current delivery rate as ACKs arrive. 2499 On second and later rounds of loss repair, it ensures the sending 2500 rate never exceeds twice the current delivery rate as ACKs arrive. 2502 When BBR exits loss recovery it restores the cwnd to the "last known 2503 good" value that cwnd held before entering recovery. This applies 2504 equally whether the flow exits loss recovery because it finishes 2505 repairing all losses or because it executes an "undo" event after 2506 inferring that a loss recovery event was spurious. 2508 There are several ways to implement this high-level design for 2509 updating cwnd in loss recovery. One is as follows: 2511 Upon retransmission timeout (RTO): 2513 BBROnEnterRTO(): 2514 BBR.prior_cwnd = BBRSaveCwnd() 2515 cwnd = packets_in_flight + 1 2517 Upon entering Fast Recovery, set cwnd to the number of packets still 2518 in flight (allowing at least one for a fast retransmit): 2520 BBROnEnterFastRecovery(): 2521 BBR.prior_cwnd = BBRSaveCwnd() 2522 cwnd = packets_in_flight + max(rs.newly_acked, 1) 2523 BBR.packet_conservation = true 2525 Upon every ACK in Fast Recovery, run the following 2526 BBRModulateCwndForRecovery() steps, which help ensure packet 2527 conservation on the first round of recovery, and sending at no more 2528 than twice the current delivery rate on later rounds of recovery 2529 (given that "rs.newly_acked" packets were newly marked ACKed or 2530 SACKed and "rs.newly_lost" were newly marked lost): 2532 BBRModulateCwndForRecovery(): 2533 if (rs.newly_lost > 0) 2534 cwnd = max(cwnd - rs.newly_lost, 1) 2535 if (BBR.packet_conservation) 2536 cwnd = max(cwnd, packets_in_flight + rs.newly_acked) 2538 After one round-trip in Fast Recovery: 2540 BBR.packet_conservation = false 2542 Upon exiting loss recovery (RTO recovery or Fast Recovery), either by 2543 repairing all losses or undoing recovery, BBR restores the best-known 2544 cwnd value we had upon entering loss recovery: 2546 BBR.packet_conservation = false 2547 BBRRestoreCwnd() 2549 Note that exiting loss recovery happens during ACK processing, and at 2550 the end of ACK processing BBRBoundCwndForModel() will bound the cwnd 2551 based on the current model parameters. Thus the cwnd and pacing rate 2552 after loss recovery will generally be smaller than the values 2553 entering loss recovery. 2555 The BBRSaveCwnd() and BBRRestoreCwnd() helpers help remember and 2556 restore the last-known good cwnd (the latest cwnd unmodulated by loss 2557 recovery or ProbeRTT), and is defined as follows: 2559 BBRSaveCwnd(): 2560 if (!InLossRecovery() and BBR.state != ProbeRTT) 2561 return cwnd 2562 else 2563 return max(BBR.prior_cwnd, cwnd) 2565 BBRRestoreCwnd(): 2566 cwnd = max(cwnd, BBR.prior_cwnd) 2568 4.6.4.5. Modulating cwnd in ProbeRTT 2570 If BBR decides it needs to enter the ProbeRTT state (see the 2571 "ProbeRTT" section below), its goal is to quickly reduce the volume 2572 of in-flight data and drain the bottleneck queue, thereby allowing 2573 measurement of BBR.min_rtt. To implement this mode, BBR bounds the 2574 cwnd to BBRMinPipeCwnd, the minimal value that allows pipelining (see 2575 the "Minimum cwnd for Pipelining" section, above): 2577 BBRProbeRTTCwnd(): 2578 probe_rtt_cwnd = BBRBDPMultiple(BBR.bw, BBRProbeRTTCwndGain) 2579 probe_rtt_cwnd = max(probe_rtt_cwnd, BBRMinPipeCwnd) 2580 return probe_rtt_cwnd 2582 BBRBoundCwndForProbeRTT(): 2583 if (BBR.state == ProbeRTT) 2584 cwnd = min(cwnd, BBRProbeRTTCwnd()) 2586 4.6.4.6. Core cwnd Adjustment Mechanism 2588 The network path and traffic traveling over it can make sudden 2589 dramatic changes. To adapt to these changes smoothly and robustly, 2590 and reduce packet losses in such cases, BBR uses a conservative 2591 strategy. When cwnd is above the BBR.max_inflight derived from BBR's 2592 path model, BBR cuts the cwnd immediately to the BBR.max_inflight. 2593 When cwnd is below BBR.max_inflight, BBR raises the cwnd gradually 2594 and cautiously, increasing cwnd by no more than the amount of data 2595 acknowledged (cumulatively or selectively) upon each ACK. 2597 Specifically, on each ACK that acknowledges "rs.newly_acked" packets 2598 as newly ACKed or SACKed, BBR runs the following BBRSetCwnd() steps 2599 to update cwnd: 2601 BBRSetCwnd(): 2602 BBRUpdateMaxInflight() 2603 BBRModulateCwndForRecovery() 2604 if (!BBR.packet_conservation) { 2605 if (BBR.filled_pipe) 2606 cwnd = min(cwnd + rs.newly_acked, BBR.max_inflight) 2607 else if (cwnd < BBR.max_inflight || C.delivered < InitialCwnd) 2608 cwnd = cwnd + rs.newly_acked 2609 cwnd = max(cwnd, BBRMinPipeCwnd) 2610 } 2611 BBRBoundCwndForProbeRTT() 2612 BBRBoundCwndForModel() 2614 There are several considerations embodied in the logic above. If BBR 2615 has measured enough samples to achieve confidence that it has filled 2616 the pipe (see the description of BBR.filled_pipe in the "Startup" 2617 section below), then it increases its cwnd based on the number of 2618 packets delivered, while bounding its cwnd to be no larger than the 2619 BBR.max_inflight adapted to the estimated BDP. Otherwise, if the 2620 cwnd is below the BBR.max_inflight, or the sender has marked so 2621 little data delivered (less than InitialCwnd) that it does not yet 2622 judge its BBR.max_bw estimate and BBR.max_inflight as useful, then it 2623 increases cwnd without bounding it to be below BBR.max_inflight. 2624 Finally, BBR imposes a floor of BBRMinPipeCwnd in order to allow 2625 pipelining even with small BDPs (see the "Minimum cwnd for 2626 Pipelining" section, above). 2628 4.6.4.7. Bounding cwnd Based on Recent Congestion 2630 Finally, BBR bounds the cwnd based on recent congestion, as outlined 2631 in the "Volume Cap" column of the table in the "Summary of Control 2632 Behavior in the State Machine" section: 2634 BBRBoundCwndForModel(): 2635 cap = Infinity 2636 if (IsInAProbeBWState() and 2637 BBR.state != ProbeBW_CRUISE) 2638 cap = BBR.inflight_hi 2639 else if (BBR.state == ProbeRTT or 2640 BBR.state == ProbeBW_CRUISE) 2641 cap = BBRInflightWithHeadroom() 2643 /* apply inflight_lo (possibly infinite): */ 2644 cap = min(cap, BBR.inflight_lo) 2645 cap = max(cap, BBRMinPipeCwnd) 2646 cwnd = min(cwnd, cap) 2648 5. Implementation Status 2650 This section records the status of known implementations of the 2651 algorithm defined by this specification at the time of posting of 2652 this Internet-Draft, and is based on a proposal described in 2653 [RFC7942]. The description of implementations in this section is 2654 intended to assist the IETF in its decision processes in progressing 2655 drafts to RFCs. Please note that the listing of any individual 2656 implementation here does not imply endorsement by the IETF. 2657 Furthermore, no effort has been spent to verify the information 2658 presented here that was supplied by IETF contributors. This is not 2659 intended as, and must not be construed to be, a catalog of available 2660 implementations or their features. Readers are advised to note that 2661 other implementations may exist. 2663 According to [RFC7942], "this will allow reviewers and working groups 2664 to assign due consideration to documents that have the benefit of 2665 running code, which may serve as evidence of valuable experimentation 2666 and feedback that have made the implemented protocols more mature. 2667 It is up to the individual working groups to use this information as 2668 they see fit". 2670 As of the time of writing, the following implementations of BBR have 2671 been publicly released: 2673 * Linux TCP 2675 - Source code URL: 2677 o https://github.com/google/bbr/blob/v2alpha/README.md 2679 o https://github.com/google/bbr/blob/v2alpha/net/ipv4/ 2680 tcp_bbr2.c 2682 - Source: Google 2684 - Maturity: production 2686 - License: dual-licensed: GPLv2 / BSD 2688 - Contact: https://groups.google.com/d/forum/bbr-dev 2690 - Last updated: August 21, 2021 2692 * QUIC 2694 - Source code URLs: 2696 o https://cs.chromium.org/chromium/src/net/third_party/quiche/ 2697 src/quic/core/congestion_control/bbr2_sender.cc 2699 o https://cs.chromium.org/chromium/src/net/third_party/quiche/ 2700 src/quic/core/congestion_control/bbr2_sender.h 2702 - Source: Google 2704 - Maturity: production 2706 - License: BSD-style 2708 - Contact: https://groups.google.com/d/forum/bbr-dev 2710 - Last updated: October 21, 2021 2712 6. Security Considerations 2714 This proposal makes no changes to the underlying security of 2715 transport protocols or congestion control algorithms. BBR shares the 2716 same security considerations as the existing standard congestion 2717 control algorithm [RFC5681]. 2719 7. IANA Considerations 2721 This document has no IANA actions. Here we are using that phrase, 2722 suggested by [RFC5226], because BBR does not modify or extend the 2723 wire format of any network protocol, nor does it add new dependencies 2724 on assigned numbers. BBR involves only a change to the congestion 2725 control algorithm of a transport sender, and does not involve changes 2726 in the network, the receiver, or any network protocol. 2728 Note to RFC Editor: this section may be removed on publication as an 2729 RFC. 2731 8. Acknowledgments 2733 The authors are grateful to Len Kleinrock for his work on the theory 2734 underlying congestion control. We are indebted to Larry Brakmo for 2735 pioneering work on the Vegas [BP95] and New Vegas [B15] congestion 2736 control algorithms, which presaged many elements of BBR, and for 2737 Larry's advice and guidance during BBR's early development. The 2738 authors would also like to thank Kevin Yang, Priyaranjan Jha, Yousuk 2739 Seung, Luke Hsiao for their work on TCP BBR; Jana Iyengar, Victor 2740 Vasiliev, Bin Wu for their work on QUIC BBR; and Matt Mathis for his 2741 research work on the BBR algorithm and its implications [MM19]. We 2742 would also like to thank C. Stephen Gunn, Eric Dumazet, Nandita 2743 Dukkipati, Pawel Jurczyk, Biren Roy, David Wetherall, Amin Vahdat, 2744 Leonidas Kontothanassis, and the YouTube, google.com, Bandwidth 2745 Enforcer, and Google SRE teams for their invaluable help and support. 2746 We would like to thank Randall R. Stewart, Jim Warner, Loganaden 2747 Velvindron, Hiren Panchasara, and Adrian Zapletal for feedback and 2748 suggestions on earlier versions of this document. 2750 9. References 2752 9.1. Normative References 2754 [RFC793] Postel, J., "Transmission Control Protocol", September 2755 1981. 2757 [RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment 2758 Options", RFC 2018, October 1996, 2759 . 2761 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 2762 Scheffenegger, "TCP Extensions for High Performance", 2763 September 2014. 2765 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2766 Requirement Levels", RFC 2119, March 1997, 2767 . 2769 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 2770 IANA Considerations Section in RFCs", May 2008. 2772 [RFC6298] Paxson, V., "Computing TCP's Retransmission Timer", 2773 RFC 6298, June 2011, 2774 . 2776 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2777 Control", RFC 5681, September 2009, 2778 . 2780 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 2781 Code: The Implementation Status Section", July 2016. 2783 [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 2784 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 2785 February 2018, . 2787 [RFC8985] Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The 2788 RACK-TLP Loss Detection Algorithm for TCP", RFC 8985, 2789 DOI 10.17487/RFC8985, February 2021, 2790 . 2792 [RFC9000] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based 2793 Multiplexed and Secure Transport", RFC 9000, 2794 DOI 10.17487/RFC9000, May 2021, 2795 . 2797 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 2798 Congestion Control Protocol (DCCP)", RFC 4340, 2799 DOI 10.17487/RFC4340, March 2006, 2800 . 2802 9.2. Informative References 2804 [draft-cheng-iccrg-delivery-rate-estimation] 2805 Cheng, Y., Cardwell, N., Hassas Yeganeh, S., and V. 2806 Jacobson, "Delivery Rate Estimation", Work in Progress, 2807 Internet-Draft, draft-cheng-iccrg-delivery-rate- 2808 estimation, November 2021, . 2811 [CCGHJ16] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and 2812 V. Jacobson, "BBR: Congestion-Based Congestion Control", 2813 ACM Queue Oct 2016, October 2016, 2814 . 2816 [CCGHJ17] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and 2817 V. Jacobson, "BBR: Congestion-Based Congestion Control", 2818 Communications of the ACM Feb 2017, February 2017, 2819 . 2822 [MM19] Mathis, M. and J. Mahdavi, "Deprecating The TCP 2823 Macroscopic Model", Computer Communication Review, vol. 2824 49, no. 5, pp. 63-68 , October 2019. 2826 [BBRStartupPacingGain] 2827 Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V. 2828 Jacobson, "BBR Startup Pacing Gain: a Derivation", June 2829 2018, . 2832 [BBRDrainPacingGain] 2833 Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V. 2834 Jacobson, "BBR Drain Pacing Gain: a Derivation", September 2835 2021, . 2838 [draft-romo-iccrg-ccid5] 2839 Romo, N., Kim, J., and M. Amend, "Profile for Datagram 2840 Congestion Control Protocol (DCCP) Congestion Control ID 2841 5", Work in Progress, Internet-Draft, draft-romo-iccrg- 2842 ccid5, 25 October 2021, 2843 . 2845 [DC13] Dumazet, E. and Y. Cheng, "TSO, fair queuing, pacing: 2846 three's a charm", IETF 88 , November 2013, 2847 . 2850 [A15] Abrahamsson, M., "TCP ACK suppression", IETF AQM mailing 2851 list , November 2015, . 2854 [Jac88] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 2855 1988, Computer Communication Review, vol. 18, no. 4, pp. 2856 314-329 , August 1988, 2857 . 2859 [Jac90] Jacobson, V., "Modified TCP Congestion Avoidance 2860 Algorithm", end2end-interest mailing list , April 1990, 2861 . 2863 [BP95] Brakmo, L. and L. Peterson, "TCP Vegas: end-to-end 2864 congestion avoidance on a global Internet", IEEE Journal 2865 on Selected Areas in Communications 13(8): 1465-1480 , 2866 October 1995. 2868 [B15] Brakmo, L., "TCP-NV: An Update to TCP-Vegas", , August 2869 2015, . 2872 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: 2873 The Implementation", Addison-Wesley , 1995. 2875 [HRX08] Ha, S., Rhee, I., and L. Xu, "CUBIC: A New TCP-Friendly 2876 High-Speed TCP Variant", ACM SIGOPS Operating System 2877 Review , 2008. 2879 [GK81] Gail, R. and L. Kleinrock, "An Invariant Property of 2880 Computer Network Power", Proceedings of the International 2881 Conference on Communications June, 1981, 2882 . 2884 [K79] Kleinrock, L., "Power and deterministic rules of thumb for 2885 probabilistic problems in computer communications", 2886 Proceedings of the International Conference on 2887 Communications 1979. 2889 Authors' Addresses 2891 Neal Cardwell 2892 Google 2893 Email: ncardwell@google.com 2895 Yuchung Cheng 2896 Google 2897 Email: ycheng@google.com 2899 Soheil Hassas Yeganeh 2900 Google 2901 Email: soheil@google.com 2903 Ian Swett 2904 Google 2905 Email: ianswett@google.com 2907 Van Jacobson 2908 Google 2909 Email: vanj@google.com