idnits 2.17.1 draft-ietf-ledbat-congestion-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 9, 2011) is 4555 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'Schneider10' is mentioned on line 179, but not defined Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: April 11, 2012 J. Iyengar 6 Franklin and Marshall College 7 M. Kuehlewind 8 University of Stuttgart 9 October 9, 2011 11 Low Extra Delay Background Transport (LEDBAT) 12 draft-ietf-ledbat-congestion-08.txt 14 Abstract 16 LEDBAT is an experimental delay-based congestion control algorithm 17 that attempts to utilize the available bandwidth on an end-to-end 18 path while limiting the consequent increase in queueing delay on the 19 path. LEDBAT uses changes in one-way delay measurements to limit 20 congestion that the flow itself induces in the network. LEDBAT is 21 designed for use by background bulk-transfer applications; it is 22 designed to be no more aggressive than TCP congestion control and to 23 yield in the presence of any competing flows when latency builds, 24 thus limiting interference with the network performance of the 25 competing flows. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on April 11, 2012. 44 Copyright Notice 46 Copyright (c) 2011 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 3 64 2.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 4 65 3. LEDBAT Congestion Control . . . . . . . . . . . . . . . . . . 4 66 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5 67 3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 68 3.3. Receiver-Side Operation . . . . . . . . . . . . . . . . . 5 69 3.4. Sender-Side Operation . . . . . . . . . . . . . . . . . . 6 70 3.4.1. An Overview . . . . . . . . . . . . . . . . . . . . . 6 71 3.4.2. The Complete Sender Algorithm . . . . . . . . . . . . 6 72 3.5. Parameter Values . . . . . . . . . . . . . . . . . . . . . 9 73 4. Understanding LEDBAT Mechanisms . . . . . . . . . . . . . . . 10 74 4.1. Delay Estimation . . . . . . . . . . . . . . . . . . . . . 11 75 4.1.1. Estimating Base Delay . . . . . . . . . . . . . . . . 11 76 4.1.2. Estimating Queueing Delay . . . . . . . . . . . . . . 11 77 4.2. Managing the Congestion Window . . . . . . . . . . . . . . 12 78 4.2.1. Window Increase: Probing For More Bandwidth . . . . . 12 79 4.2.2. Window Decrease: Responding To Congestion . . . . . . 12 80 4.3. Choosing The Queuing Delay Target . . . . . . . . . . . . 12 81 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 13 82 5.1. Framing and Ack Frequency Considerations . . . . . . . . . 13 83 5.2. Competing With TCP Flows . . . . . . . . . . . . . . . . . 13 84 5.3. Fairness Among LEDBAT Flows . . . . . . . . . . . . . . . 13 85 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 86 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 87 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15 88 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 89 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 90 9.2. Informative References . . . . . . . . . . . . . . . . . . 16 91 Appendix A. Timestamp errors . . . . . . . . . . . . . . . . . . 16 92 A.1. Clock offset . . . . . . . . . . . . . . . . . . . . . . . 16 93 A.2. Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 16 94 A.3. Clock skew correction mechanism . . . . . . . . . . . . . 17 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 97 1. Requirements notation 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 101 document are to be interpreted as described in [RFC2119]. 103 2. Introduction 105 TCP congestion control [RFC5681] seeks to share bandwidth at a 106 bottleneck link equitably among flows competing at the bottleneck, 107 and it is the predominant congestion control mechanism used on the 108 Internet. Not all applications seek an equitable share of network 109 throughput, however "background" applications, such as software 110 updates or file-sharing applications, seek to operate without 111 interfering with the performance of more interactive and delay- 112 and/or bandwidth-sensitive "foreground" applications and standard TCP 113 may be too aggressive for use with such background applications. 115 LEDBAT is an experimental delay-based congestion control mechanism 116 that reacts early to congestion in the network, thus enabling 117 "background" applications to use the network while avoidoing 118 interference with the network performance of competing flows. A 119 LEDBAT sender uses one-way delay measurements to estimate the amount 120 of queueing on the data path, controls the LEDBAT flow's congestion 121 window based on this estimate, and minimizes interference with 122 competing flows when latency builds by adding low extra queueing 123 delay on the end-to-end path. 125 Delay-based congestion control protocols, such as TCP-Vegas 126 [Bra94][Low02], are generally designed to achieve more, not less 127 throughput than standard TCP, and often outperform TCP under 128 particular network settings. In contrast, LEDBAT is designed to be 129 no more aggressive than TCP; LEDBAT is a "scavenger" congestion 130 control mechanism that seeks to utilize all available bandwidth and 131 yields quickly when competing with standard TCP at a bottleneck link. 133 2.1. Design Goals 135 LEDBAT congestion control seeks to: 136 1. utilize end-to-end available bandwidth, and maintain low queueing 137 delay when no other traffic is present, 138 2. add little to the queuing delay induced by concurrent flows, 139 3. quickly yield to flows using standard TCP congestion control that 140 share the same bottleneck link, 142 2.2. Applicability 144 LEDBAT is a "scavenger" congestion control mechanism that is 145 primarily motivated by background bulk-transfer applications, such as 146 large file transfers (as with file-sharing applications) and software 147 updates. It can be used with any application that seeks to minimize 148 its impact on the network and on other interactive delay- and/or 149 bandwidth-sensitive network applications. LEDBAT is expected to work 150 well when the sender and/or receiver is connected via a residential 151 access network. 153 LEDBAT can be used as part of a transport protocol or as part of an 154 application, as long as the data transmission mechanisms are capable 155 of carrying timestamps and acknowledging data frequently. LEDBAT can 156 be used, with appropriate extensions where necessary, with TCP, SCTP, 157 and DCCP, and with proprietary application protocols such as those 158 built on top of UDP for P2P applications. 160 When used with an ECN-capable framing protocol, LEDBAT should react 161 to an ECN mark as it would to a loss, as specified in [RFC3168]. 163 LEDBAT is designed to reduce build-up of a standing queue by long- 164 lived LEDBAT flows at a link with a tail-drop FIFO queue, so as to 165 avoid persistently delaying other flows sharing the queue. If Active 166 Queue Management (AQM) is configured to drop or ECN-mark packets 167 before the LEDBAT starts reacting to persistent queue build-up, 168 LEDBAT reverts to standard TCP behavior, rather than yield to other 169 TCP flows. However, such an AQM is still desirable since it keeps 170 queuing delay low, achieving an outcome that is in line with LEDBAT's 171 goals. Additionally, a LEDBAT transport that supports ECN enjoys the 172 advantages that an ECN-capable TCP enjoys over an ECN-agnostic TCP; 173 avoiding losses and possible retransmissions. Weighted Fair Queuing 174 (WFQ), as employed by some home gateways [Schneider10], seeks to 175 isolate and protect delay-sensitive flows from delays due to standing 176 queues built up by concurrent long-lived flows. Consequently, while 177 it prevents LEDBAT from yielding to other TCP flows, it again 178 achieves an outcome that is in line with LEDBAT's goals 179 [Schneider10]. 181 Further study is required to fully understand the behaviour of LEDBAT 182 with non-drop-tail, non-FIFO queues. 184 3. LEDBAT Congestion Control 185 3.1. Overview 187 A standard TCP sender increases its congestion window until a loss 188 occurs [RFC5681] or an ECN mark is received [RFC3168], which, in the 189 absence of any Active Queue Management (AQM) and link errors in the 190 network, occurs only when the queue at the bottleneck link on the 191 end-to-end path overflows. Since packet loss or marking at the 192 bottleneck link is expected to be preceded by an increase in the 193 queueing delay at the bottleneck link, LEDBAT congestion control uses 194 this increase in queueing delay as an early signal of congestion, 195 enabling it to respond to congestion earlier than standard TCP, and 196 enabling it to yield bandwidth to a competing TCP flow. 198 LEDBAT employs one-way delay measurements to estimate queueing delay. 199 When the estimated queueing delay is less than a pre-determined 200 target, LEDBAT infers that the network is not yet congested, and 201 increases its sending rate to utilize any spare capacity in the 202 network. When the estimated queueing delay becomes greater than a 203 pre-determined target, LEDBAT decreases its sending rate quickly as a 204 response to potential congestion in the network. 206 3.2. Preliminaries 208 A LEDBAT sender uses a congestion window (cwnd) to gate the amount of 209 data that the sender can send into the network in one roundtrip time 210 (RTT). A sender MAY maintain its cwnd in bytes or in packets; this 211 document uses cwnd in bytes. LEDBAT requires that each data segment 212 carries a "timestamp" from the sender, based on which the receiver 213 computes the one-way delay from the sender, and sends this computed 214 value back to the sender. 216 In addition to the LEDBAT mechanism described below, we note that a 217 slow start mechanism can be used as specified in [RFC5681]. Since 218 slow start leads to faster increase in the window than that specified 219 in LEDBAT, conservative congestion control implementations employing 220 LEDBAT may skip slow start altogether and start with an initial 221 window of INIT_CWND * MSS. (INIT_CWND is described later in 222 Section 3.5.) 224 The term "MSS", or the sender's Maximum Segment Size, used in this 225 document refers to the size of the largest segment that the sender 226 can transmit. The value of MSS can be based on the path MTU 227 discovery [RFC4821] algorithm and/or on other factors. 229 3.3. Receiver-Side Operation 231 A LEDBAT receiver operates as follows: 233 on data_packet: 234 remote_timestamp = data_packet.timestamp 235 acknowledgement.delay = local_timestamp() - remote_timestamp 236 # fill in other fields of acknowledgement 237 acknowlegement.send() 239 A receiver MAY send more than one delay sample in an acknowledgment. 240 For instance, a receiver that delays acknowledgments, i.e., sends an 241 acknowledgment less frequently than once per data packet, MAY send 242 all the one-way delay samples that it gathers in one acknowledgment. 244 3.4. Sender-Side Operation 246 3.4.1. An Overview 248 As a first approximation, a LEDAT sender operates as shown below; the 249 complete algorithm is specified later in Section 3.4.2. TARGET is 250 the maximum queueing delay that LEDBAT itself can introduce in the 251 network, and GAIN determines the rate at which the cwnd responds to 252 changes in queueing delay; both constants are specified later. Since 253 off_target can be positive or negative, the cwnd increases or 254 decreases in proportion to off_target. 256 on initialization: 257 base_delay = +INFINITY 259 on acknowledgement: 260 current_delay = acknowledgement.delay 261 base_delay = min(base_delay, current_delay) 262 queuing_delay = current_delay - base_delay 263 off_target = (TARGET - queuing_delay) / TARGET 264 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 266 The simplified mechanism above ignores multiple delay samples in an 267 acknowledgment, noise filtering, base delay expiration, and sender 268 idle times, which we now take into account in our complete sender 269 algorithm below. 271 3.4.2. The Complete Sender Algorithm 273 update_current_delay() maintains a list of one-way delay 274 measurements, of which a filtered value is used as an estimate of the 275 current end-to-end delay. update_base_delay() maintains a list of 276 one-way delay minima over a number of one-minute intervals, to 277 measure and to track changes in the base delay of the end-to-end 278 path. 280 This algorithm restricts cwnd growth after a period of inactivity, 281 where the cwnd is clamped down to a little more than flightsize using 282 max_allowed_cwnd. To be TCP-friendly on data loss, LEDBAT halves its 283 cwnd. The full sender-side algorithm is given below: 285 on initialization: 286 # cwnd is the amount of data that is allowed to send in one RTT and 287 # is defined in bytes. 288 # CTO is the Congestion Timeout value. 290 create current_delays list with CURRENT_FILTER elements 291 create base_delays list with BASE_HISTORY number of elements 292 initialize elements in current_delays and base_delays to +INFINITY 293 last_rollover = -INFINITY # More than a minute in the past 294 flightsize = 0 295 cwnd = INIT_CWND * MSS 296 CTO = 1 second 298 on acknowledgment: 299 # flightsize is the amount of data outstanding before this ack 300 # was received and is updated later; 301 # bytes_newly_acked is the number of bytes that this ack 302 # newly acknowledges, and it MAY be set to MSS. 304 for each delay sample in the acknowledgment: 305 delay = acknowledgement.delay 306 update_base_delay(delay) 307 update_current_delay(delay) 308 queuing_delay = FILTER(current_delays) - MIN(base_delays) 309 off_target = (TARGET - queuing_delay) / TARGET 310 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 311 max_allowed_cwnd = flightsize + ALLOWED_INCREASE * MSS 312 cwnd = min(cwnd, max_allowed_cwnd) 313 cwnd = max(cwnd, MIN_CWND * MSS) 314 flightsize = flightsize - bytes_newly_acked 315 update_CTO() 317 on data loss: 318 # atmost once per RTT 319 cwnd = min (cwnd, max (cwnd/2, MIN_CWND * MSS)) 320 if data lost is not to be retransmitted: 321 flightsize = flightsize - bytes_not_to_be_retransmitted 323 if no acks are received within a CTO: 324 # extreme congestion, or significant RTT change. 325 # set cwnd to 1MSS and backoff the congestion timer. 326 cwnd = 1 * MSS 327 CTO = 2 * CTO 329 update_CTO() 330 # implements an RTT estimation mechanism using data 331 # transmission times and ack reception times, 332 # which is used to implement a congestion timeout (CTO). 333 # If implementing in TCP, sender SHOULD use 334 # mechanisms described in RFC 6298 , 335 # and the CTO is the same as the RTO. 337 update_current_delay(delay) 338 # Maintain a list of CURRENT_FILTER last delays observed. 339 delete first item in current_delays list 340 append delay to current_delays list 342 update_base_delay(delay) 343 # Maintain BASE_HISTORY delay-minima. 344 # Each minimum is measured over a period of a minute. 345 # 'now' is the current system time 346 if round_to_minute(now) != round_to_minute(last_rollover) 347 last_rollover = now 348 delete first item in base_delays list 349 append delay to base_delays list 350 else 351 base_delays.tail = MIN(base_delays.tail, delay) 353 The LEDBAT sender ensures that any outliers in the current_delay 354 samples are eliminated by implementing FILTER() to extract the actual 355 delay estimate from the current_delay samples. An example of such a 356 filter is the Exponentially-Weighted Moving Average (EWMA) function. 358 To implement an approximate minimum over the past few minutes, a 359 LEDBAT sender stores BASE_HISTORY separate minima---one each for the 360 last BASE_HISTORY-1 minutes, and one for the running current minute. 361 At the end of the current minute, the window moves---the earliest 362 minimum is dropped and the latest minimum is added. If the 363 connection is idle for a given minute, no data is available for the 364 one-way delay and, therefore, a value of +INFINITY has to be stored 365 in the list. If the connection has been idle for BASE_HISTORY 366 minutes, all minima in the list are thus set to +INFINITY and 367 measurement begins anew. LEDBAT thus requires that during idle 368 periods, an implementation must maintain the base delay list. 370 LEDBAT uses a congestion timeout (CTO) to avoid transmitting data 371 during periods of heavy congestion, and to avoid congestion collapse. 372 A CTO is used to detect heavy congestion indicated by loss of all 373 outstanding data or acknowledgments, resulting in reduction of the 374 cwnd to 1 MSS and an exponential backoff of the CTO interval. This 375 backoff of the CTO value avoids sending more data into an overloaded 376 queue, and also allows the sender to cope with sudden changes in the 377 RTT of the path. The function of a CTO is similar to that of an 378 retransmission timeout (RTO) in TCP [RFC3390], but since LEDBAT 379 separates reliability from congestion control, a retransmission need 380 not be triggered by a CTO. LEDBAT, however does not preclude a CTO 381 from triggering retransmissions, as could be the case if LEDBAT 382 congestion control were to be used with TCP framing and reliability. 384 The CTO is a gating mechanism that ensures exponential backoff of 385 sending rate under heavy congestion, and it may be implemented with 386 or without a timer. An implementation choosing to avoid timers may 387 consider using a "next-time-to-send" variable, set based on the CTO, 388 to control the earliest time a sender may transmit without receiving 389 any acks. 391 A maximum value MAY be placed on the CTO, and if placed, it MUST be 392 60 seconds or more. 394 We note that LEDBAT assumes random fluctuations in inter-packet 395 transmission times. That will help to measure the correct base delay 396 beacuse the the bottleneck runs empty from time to time; see section 397 Section 5.3 for a discussion. 399 3.5. Parameter Values 401 TARGET MUST be 100 milliseconds or less, and this choice of value is 402 explained further in Section 4.3. Note that using the same TARGET 403 value across LEDBAT flows enables equitable sharing of the bottleneck 404 bandwidth---flows with a higher TARGET may get a larger share of the 405 bottleneck bandwidth. It is possible to consider the use of 406 different TARGET values for implementing a relative priority between 407 two competing LEDBAT flows by setting a higher TARGET value for the 408 higher-priority flow. 410 ALLOWED_INCREASE SHOULD be 1, and it MUST be greater than 0. An 411 ALLOWED_INCREASE of 0 results in no cwnd growth at all, and an 412 ALLOWED_INCREASE of 1 allows and limits the cwnd increase based on 413 flightsize in the previous RTT. An ALLOWED_INCREASE greater than 1 414 MAY be used when interactions between LEDBAT and the framing protocol 415 provide a clear reason for doing so. 417 GAIN MUST be set to 1 or less. A GAIN of 1 limits the maximum cwnd 418 ramp-up to the same rate as TCP Reno in Congestion Avoidance. While 419 this document specifies the use of the same GAIN for both cwnd 420 increase (when off_target is greater than zero) and decrease (when 421 off_target is less than zero), implementations MAY use a higher GAIN 422 for cwnd decrease than for the increase; our justification follows. 424 When a competing non-LEDBAT flow increases its sending rate, the 425 LEDBAT sender may only measure a small amount of additional delay and 426 decrease the sending rate slowly. To ensure no impact on a competing 427 non-LEDBAT flow, the LEDBAT flow should decrease its sending rate at 428 least as quickly as the competing flow increases its sending rate. A 429 higher decrease GAIN MAY be used to allow the LEDBAT flow to decrease 430 its sending rate faster than the competing flow's increase rate. 432 The size of the base_delays list, BASE_HISTORY, SHOULD be 10. If the 433 actual base delay decreases, due to a route change for instance, a 434 LEDBAT sender adapts immediately, irrespective of the value of 435 BASE_HISTORY. If the actual base delay increases however, a LEDBAT 436 sender will take BASE_HISTORY minutes to adapt and may wrongly infer 437 a little more extra delay than intended (TARGET) in the meanwhile. A 438 value for BASE_HISTORY is thus a tradeoff: a higher value may yield a 439 more accurate measurement when the base delay is unchanging, and a 440 lower value results in a quicker response to actual increase in base 441 delay. 443 A LEDBAT sender uses the current_delays list to maintain only delay 444 measurements made within a RTT amount of time in the past, seeking to 445 eliminate noise spikes in its measurement of the current one-way 446 delay through the network. The size of this list, CURRENT_FILTER, 447 may be variable, and depends on the number of successful measurements 448 made within a RTT amount of time in the past. The sender should seek 449 to gather enough delay samples in each RTT so as to have statistical 450 confidence in the measurements. While the number of delay samples 451 required for such confidence will vary depending on network 452 conditions, we recommend that the sender SHOULD use at least 4 453 samples in each RTT, unless the number of samples is lower due to a 454 small congestion window. Thus, subject to congestion window 455 constraints, the number of delay samples in each RTT SHOULD be at 456 least 4, and thus CURRENT_FILTER SHOULD be at least 4. 457 CURRENT_FILTER MUST be limited such that samples in the list are not 458 older than an RTT in the past. 460 INIT_CWND SHOULD be 4, and MIN_CWND SHOULD be 2. An INIT_CWND of 4 461 should help seed FILTER() at the sender when there are no samples at 462 the beginning of a flow, and a MIN_CWND of 2 allows FILTER() to use 463 more than a single instantaneous delay estimate while not being too 464 aggressive. Slight deviations may be warranted, for example, when 465 these values of MIN_CWND and INIT_CWND interact poorly with the 466 framing protocol. 468 4. Understanding LEDBAT Mechanisms 470 This section describes the delay estimation and window management 471 mechanisms used in LEDBAT. 473 4.1. Delay Estimation 475 LEDBAT estimates congestion in the direction of the data flow, and to 476 avoid measuring additional delay from e.g. queue build-up on the 477 reverse path (or ack path) or reordering, LEDBAT uses one-way delay 478 estimates. LEDBAT assumes measurements are done with data packets, 479 thus avoiding the need for separate measurement packets and avoiding 480 the pitfall of measurement packets being treated differently from the 481 data packets in the network. 483 End-to-end delay can be decomposed into transmission (or 484 serialization) delay, propagation (or speed-of-light) delay, queueing 485 delay, and processing delay. On any given path, barring some noise, 486 all delay components except for queueing delay are constant. To 487 observe an increase in the queueing delay in the network, a LEDBAT 488 sender separates the queueing delay component from the rest of the 489 end-to-end delay, as described below. 491 4.1.1. Estimating Base Delay 493 Since queuing delay is always additive to the end-to-end delay, 494 LEDBAT estimates the sum of the constant delay components, which we 495 call "base delay", to be the minimum delay observed on the end-to-end 496 path. 498 To respond to true changes in the base delay, as can be caused by a 499 route change, LEDBAT uses only recent measurements in estimating the 500 base delay. The duration of the observation window itself is a 501 tradeoff between robustness of measurement and responsiveness to 502 change---a larger observation window increases the chances that the 503 true base delay will be detected (as long as the true base delay is 504 unchanged), whereas a smaller observation window results in faster 505 response to true changes in the base delay. 507 4.1.2. Estimating Queueing Delay 509 Given that the base delay is constant, the queueing delay is 510 represented by the variable component of the measured end-to-end 511 delay. LEDBAT measures queueing delay as simply the difference 512 between an end-to-end delay measurement and the current estimate of 513 base delay. The queueing delay should be filtered (depending on the 514 usage scenario) to eliminate noise in the delay estimation, such as 515 due to spikes in processing delay at a node on the path. 517 4.2. Managing the Congestion Window 519 4.2.1. Window Increase: Probing For More Bandwidth 521 A LEDBAT sender increases its congestion window if the queuing delay 522 is smaller than a target value, proportionally to the relative 523 difference between the current queueing delay and the delay target. 524 To be friendly to competing TCP flows, we set this highest rate of 525 window growth to be the same as TCP's. In other words, a LEDBAT flow 526 thus never ramps up faster than a competing TCP flow over the same 527 path. As closer the extra delay gets to the TARGET value, as slower 528 LEDBAT will increase the window. 530 4.2.2. Window Decrease: Responding To Congestion 532 When the sender's queueing delay estimate is higher than the target, 533 the LEDBAT flow's rate should be reduced. LEDBAT uses a simple 534 linear controller to detemine sending rate as a function of the delay 535 estimate, where the response is proportional to the difference 536 between the current queueing delay estimate and the target. This 537 allows to decrease the window only slightly while probing and leads 538 to a quite stable state with high link utilization. In limited 539 experiments with Bittorrent nodes, this controller seems to work 540 well. 542 Unlike TCP-like loss-based congestion control, LEDBAT seeks to avoid 543 losses and so a LEDBAT sender is not expected to normally rely on 544 losses to determine the sending rate. However, when data loss does 545 occur, LEDBAT must respond as standard TCP does; even if the queueing 546 delay estimates indicate otherwise, a loss is assumed to be a strong 547 indication of congestion. Thus, to deal with severe congestion when 548 packets are dropped in the network, and to provide a fallback against 549 incorrect queuing delay estimates, a LEDBAT sender halves its 550 congestion window when a loss event is detected. As with TCP New- 551 Reno, LEDBAT reduces its cwnd by half at most once per RTT. 553 4.3. Choosing The Queuing Delay Target 555 The queueing delay target is a tradeoff. A target that is too low 556 might result in under-utilization of the bottleneck link, because of 557 the noise in the delay measurement e.g in a mobile scenario, and may 558 also be more sensitive to error in the measured delay. The 559 International Telecommunication Union's (ITU's) Recommendation G.114 560 defines a delay of 150 ms to be acceptable for most user voice 561 applications. Thus the extra delay induced by LEDBAT must be below 562 150 ms to reduce impact on delay-sentive applications. If the TARGET 563 value is larger than the maximum delay the queue can induce, LEBAT 564 will fallback to the same behavior than standard TCP (see section 565 Section 5.2). 567 Our recommendation of 100 ms or less as the target is based on these 568 considerations. Anecdotal evidence indicates that this value works 569 well: LEDBAT has been implemented and successfully deployed with a 570 target value of 100 ms in two Bittorrent implementations---BitTorrent 571 DNA as the exclusive congestion control mechanism and in uTorrent as 572 an experimental mechanism. 574 5. Discussion 576 5.1. Framing and Ack Frequency Considerations 578 While the actual framing and wire format of the protocols using 579 LEDBAT are outside the scope of this document, we briefly consider 580 the data framing and ack frequency needs of LEDBAT mechanisms. 582 To compute the data path's one-way delay, our discussion of LEDBAT 583 assumes a framing that allows the sender to timestamp packets and for 584 the receiver to convey the measured one-way delay back to the sender 585 in ack packets. LEDBAT does not require this particular method, but 586 it does require unambiguous delay estimates using data and ack 587 packets. 589 A LEDBAT receiver may send an ack as frequently as one for every data 590 packet received or less frequently; LEDBAT does require that the 591 receiver MUST transmit at least one ack in every RTT. 593 5.2. Competing With TCP Flows 595 LEDBAT is designed to respond to congestion indications earlier than 596 loss-based TCP. A LEDBAT flow is more aggressive when the queueing 597 delay estimate is lower; since the queueing delay estimate is non- 598 negative, LEDBAT is most aggressive when its queuing delay estimate 599 is zero. In this case, LEDBAT ramps up its congestion window at the 600 same rate as TCP does. LEDBAT reduces its rate earlier than TCP 601 does, always halving the congestion window on loss. Thus, in the 602 worst case where the delay estimates are completely and consistently 603 off, a LEDBAT flow falls back to TCP mechanisms as it will be at most 604 as aggressive as a TCP flow and halves on loss. 606 5.3. Fairness Among LEDBAT Flows 608 The primary design goals of LEDBAT are focussed on the aggregate 609 behavior of LEDBAT flows when they compete with standard TCP. Since 610 LEDBAT is designed for background traffic, we consider link 611 utilization to be more important than fairness amongst LEDBAT flows. 613 Nevertheless, we now consider fairness issues that might arise 614 amongst competing LEDBAT flows. 616 LEDBAT as described so far lacks a mechanism specifically designed to 617 equalize utilization amongst LEDBAT flows. Anecdotally observed 618 behavior of existing implementations indicates that a rough 619 equalization does occur since in most enviroments some amount of 620 randomness in the inter-packet transmission times exist, as explained 621 further below. 623 Delay-based congestion control systems suffer from the possibility of 624 late-comers incorrectly measuring and using a higher base-delay than 625 an active flow that started earlier. Suppose a LEDBAT flow is the 626 only flow on the bottleneck, which the flow saturates, steadily 627 maintaining the queueing delay at a target delay. When a new LEDBAT 628 flow arrives, it might incorrectly measure the current end-to-end 629 delay, including the queueing delay being maintained by the first 630 LEDBAT flow, as its base delay, and the incoming flow might now 631 effectively seek to build on top of the existing, already maximal 632 queueing delay. As the second flow builds up, the first flow sees 633 the true queueing delay and backs off, while the late-comer keeps 634 building up, using up the entire link's capacity; this advantage is 635 called the "late-comer's advantage". 637 In the worse case, if the first flow yields at the same rate as the 638 new flow increases its sending rate, the new flow will see constant 639 end-to-end delay, which it assumes is the base delay, until the first 640 flow backs off completely. As a result, by the time the second flow 641 stops increasing its cwnd, it would have added twice the target 642 queueing delay to the network. 644 This advantage can be reduced if the the first flow yields quickly 645 enough to empty the bottleneck queue faster than the incoming flow 646 increases its occupancy in the queue; as a result, the late-comer 647 might measure a delay closer to the base delay. While such a 648 reduction might be achieved through a multiplicative decrease of the 649 congestion window, this might cause stronger fluctuations in flow 650 throughput during steady state. Thus we do not recommend a 651 multiplicative decrease scheme. 653 We note that in certain use-case scenarios, it is possible for a 654 later LEDBAT flow to gain an unfair advantage over an existing one 655 [Car10]. In practice, this concern may be alleviated by the 656 burstiness of network traffic: all that's needed to measure the base 657 delay is one small gap in transmission schedules between the LEDBAT 658 flows. These gaps can occur for a number of reasons such as latency 659 introduced due to application sending patterns, OS scheduling at the 660 sender, processing delay at the sender or any network node, and link 661 contention. When such a gap occurs in the first sender's 662 transmission while the late-comer is starting, base delay is 663 immediately correctly measured. With a small number of LEDBAT flows, 664 system noise may sufficiently regulate the late-comer's advantage. 666 6. IANA Considerations 668 There are no IANA considerations for this document. 670 7. Security Considerations 672 A network on the path might choose to cause higher delay measurements 673 than the real queuing delay so that LEDBAT backs off even when 674 there's no congestion present. While shaping of traffic into an 675 artificially narrow bottleneck by increasing the queueing delay 676 cannot be trivially counteracted, a protocol using LEDBAT should seek 677 to minimize the risk of such an attack by authenticating the 678 timestamp and delay fields in the packets. 680 LEDBAT is not known to introduce any new concerns with privacy, 681 integrity, or other security issues for flows that use it. It should 682 be compatible with use of IPsec and TLS/DTLS. 684 8. Acknowledgements 686 We thank folks in the LEDBAT working group for their comments and 687 feedback. Special thanks to Murari Sridharan and Rolf Winter for 688 their patient and untiring shepherding. 690 9. References 692 9.1. Normative References 694 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 695 Requirement Levels", BCP 14, RFC 2119, March 1997. 697 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 698 of Explicit Congestion Notification (ECN) to IP", 699 RFC 3168, September 2001. 701 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 702 Initial Window", RFC 3390, October 2002. 704 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 705 Discovery", RFC 4821, March 2007. 707 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 708 Control", RFC 5681, September 2009. 710 9.2. Informative References 712 [Bra94] Brakmo, L., O'Malley, S., and L. Peterson, "TCP Vegas: New 713 techniques for congestion detection and avoidance", 714 Proceedings of SIGCOMM '94, pages 24-35, August 1994. 716 [Car10] Carofiglio, G., Muscariello, L., Rossi, D., Testa, C., and 717 S. Valenti, "Rethinking Low Extra Delay Background 718 Transport Protocols", arXiv:1010.5623v1, September 2010. 720 [Low02] Low, S., Peterson, L., and L. Wang, "Understanding TCP 721 Vegas: A Duality Model", JACM 49 (2), March 2002. 723 Appendix A. Timestamp errors 725 One-way delay measurement needs to deal with timestamp errors. We'll 726 use the same locally linear clock model and the same terminology as 727 Network Time Protocol (NTP). This model is valid for any 728 differentiable clocks. NTP uses the term "offset" to refer to 729 difference from true time and "skew" to refer to difference of clock 730 rate from the true rate. The clock will thus have a fixed offset 731 from the true time and a skew. We'll consider what we need to do 732 about the offset and the skew separately. 734 A.1. Clock offset 736 First, consider the case of zero skew. The offset of each of the two 737 clocks shows up as a fixed error in one-way delay measurement. The 738 difference of the offsets is the absolute error of the one-way delay 739 estimate. We won't use this estimate directly, however. We'll use 740 the difference between that and a base delay. Because the error 741 (difference of clock offsets) is the same for the current and base 742 delay, it cancels from the queuing delay estimate, which is what 743 we'll use. Clock offset is thus irrelevant to the design. 745 A.2. Clock skew 747 The clock skew manifests in a linearly changing error in the time 748 estimate. For a given pair of clocks, the difference in skews is the 749 skew of the one-way delay estimate. Unlike the offset, this no 750 longer cancels in the computation of the queuing delay estimate. On 751 the other hand, while the offset could be huge, with some clocks off 752 by minutes or even hours or more, the skew is typically small. For 753 example, NTP is designed to work with most clocks, yet it gives up 754 when the skew is more than 500 parts per million (PPM). Typical 755 skews of clocks that have never been trained seem to often be around 756 100-200 PPM. Previously trained clocks could have 10-20 PPM skew due 757 to temperature changes. A 100-PPM skew means accumulating 6 758 milliseconds of error per minute. The base delay updates mostly 759 takes care of clock skew unless the skew is unusually high or extreme 760 values have been chosen for TARGET and BASE_HISTORY so that the clock 761 skew in BASE_DELAY minutes is larger than the TARGET. 763 Clock skew can be in two directions: either the sender's clock is 764 faster than the receiver's, or vice versa. 766 If the sender's clock is faster the one-way delay measurement will 767 get more and more reduced by the clock drift over time. Whenever 768 there is no additional delay the base delay will be updated by a 769 smaller one-way delay value and the skew is compensated. If a 770 competing flow introduces additional queueing delay LEDBAT will 771 anyway get out of the way quickly and an overestimated one-way delay 772 will just speed-up the back-off. 774 When the receiver clock runs faster, the raw delay estimate will 775 drift up with time. This can suppress the throughput unnecessarily. 776 In this case a skew correction mechanim can be benefital. Further 777 condersiderations based on a deployed implementation and LEDBAT 778 specific preconditions are given in the next section. 780 A.3. Clock skew correction mechanism 782 The following paragraph describes the deployed clock skew correction 783 mechanism in the BitTorrent implementation for documentation purpose. 785 In the BitTorrent implemtation the receiver sends back the raw 786 (sending and receiving) timestamps. Based on this imfomation and the 787 system time at feedback receiption the sender can estimated the one- 788 way delay in both directions. Thus the sender can run the same base 789 delay calculation algorithm it runs for itself for the receiver as 790 well. If the sender can detect the receiver reducing its base delay, 791 it can infer that this is due to clock drift. The sender can be 792 compensated for by increasing the it's base delay by the same amount. 793 To apply this mechanism symmetrical framing is need (i.e., same 794 information about delays transmitted in both way). 796 The following considerations can be used for an alternative 797 implementation as a reference: 799 o Skew correction with faster virtual clock: 800 Since having a faster clock on the sender will continuousely 801 update the base delay, a faster virtual clock for sender 802 timestamping can be applied. This virual clock can be computed 803 from the default machine clock through a linear transformation. 804 E.g. with a 500 PPM speed-up the sender's clock is very likely to 805 be faster than any receiver's clock and thus LEDBAT will benefit 806 from the implicit correction when updating the base delay. 808 o Skew correction with estimating drift: 809 With LEDBAT the history of base delay minima is already kept for 810 each minute. This can provide a base to compute the clock skew 811 difference between the two hosts. The slope of a linear function 812 fitted to the set of minima base delays gives an estimate of the 813 clock skew. This estimation can be used to correct the clocks. 814 If the other endpoint is doing the same, the clock should be 815 corrected by half of the estimated skew amount. 817 o Byzantine skew correction: 818 When it is known that each host maintains long-lived connections 819 to a number of different other hosts, a byzantine scheme can be 820 used to estimate the skew with respect to the true time. Namely, 821 calculate the skew difference for each of the peer hosts as 822 described with the previous approach, then take the median of the 823 skew differences. While this scheme is not universally 824 applicable, it combines well with other schemes, since it is 825 essentially a clock training mechanism. The scheme also acts the 826 fastest, since the state is preserved between connections. 828 Authors' Addresses 830 Stanislav Shalunov 831 BitTorrent Inc 832 612 Howard St, Suite 400 833 San Francisco, CA 94105 834 USA 836 Email: shalunov@bittorrent.com 837 URI: http://shlang.com 838 Greg Hazel 839 BitTorrent Inc 840 612 Howard St, Suite 400 841 San Francisco, CA 94105 842 USA 844 Email: greg@bittorrent.com 846 Janardhan Iyengar 847 Franklin and Marshall College 848 415 Harrisburg Ave. 849 Lancaster, PA 17603 850 USA 852 Email: jiyengar@fandm.edu 854 Mirja Kuehlewind 855 University of Stuttgart 856 Stuttgart 857 DE 859 Email: mirja.kuehlewind@ikr.uni-stuttgart.de