idnits 2.17.1 draft-ietf-ledbat-congestion-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2011) is 4561 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: May 3, 2012 J. Iyengar 6 Franklin and Marshall College 7 M. Kuehlewind 8 University of Stuttgart 9 October 31, 2011 11 Low Extra Delay Background Transport (LEDBAT) 12 draft-ietf-ledbat-congestion-09.txt 14 Abstract 16 LEDBAT is an experimental delay-based congestion control algorithm 17 that attempts to utilize the available bandwidth on an end-to-end 18 path while limiting the consequent increase in queueing delay on the 19 path. LEDBAT uses changes in one-way delay measurements to limit 20 congestion that the flow itself induces in the network. LEDBAT is 21 designed for use by background bulk-transfer applications; it is 22 designed to be no more aggressive than TCP congestion control and to 23 yield in the presence of any competing flows when latency builds, 24 thus limiting interference with the network performance of the 25 competing flows. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 3, 2012. 44 Copyright Notice 46 Copyright (c) 2011 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 3 64 2.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 4 65 3. LEDBAT Congestion Control . . . . . . . . . . . . . . . . . . 4 66 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 68 3.3. Receiver-Side Operation . . . . . . . . . . . . . . . . . 5 69 3.4. Sender-Side Operation . . . . . . . . . . . . . . . . . . 6 70 3.4.1. An Overview . . . . . . . . . . . . . . . . . . . . . 6 71 3.4.2. The Complete Sender Algorithm . . . . . . . . . . . . 6 72 3.5. Parameter Values . . . . . . . . . . . . . . . . . . . . . 9 73 4. Understanding LEDBAT Mechanisms . . . . . . . . . . . . . . . 11 74 4.1. Delay Estimation . . . . . . . . . . . . . . . . . . . . . 11 75 4.1.1. Estimating Base Delay . . . . . . . . . . . . . . . . 11 76 4.1.2. Estimating Queueing Delay . . . . . . . . . . . . . . 12 77 4.2. Managing the Congestion Window . . . . . . . . . . . . . . 12 78 4.2.1. Window Increase: Probing For More Bandwidth . . . . . 12 79 4.2.2. Window Decrease: Responding To Congestion . . . . . . 12 80 4.3. Choosing The Queuing Delay Target . . . . . . . . . . . . 13 81 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 13 82 5.1. Framing and Ack Frequency Considerations . . . . . . . . . 13 83 5.2. Competing With TCP Flows . . . . . . . . . . . . . . . . . 13 84 5.3. Fairness Among LEDBAT Flows . . . . . . . . . . . . . . . 14 85 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 86 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 87 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15 88 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 89 9.1. Normative References . . . . . . . . . . . . . . . . . . . 16 90 9.2. Informative References . . . . . . . . . . . . . . . . . . 16 91 Appendix A. Timestamp errors . . . . . . . . . . . . . . . . . . 16 92 A.1. Clock offset . . . . . . . . . . . . . . . . . . . . . . . 17 93 A.2. Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 17 94 A.3. Clock skew correction mechanism . . . . . . . . . . . . . 18 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 97 1. Requirements notation 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 101 document are to be interpreted as described in [RFC2119]. 103 2. Introduction 105 TCP congestion control [RFC5681] seeks to share bandwidth at a 106 bottleneck link equitably among flows competing at the bottleneck, 107 and it is the predominant congestion control mechanism used on the 108 Internet. Not all applications seek an equitable share of network 109 throughput, however "background" applications, such as software 110 updates or file-sharing applications, seek to operate without 111 interfering with the performance of more interactive and delay- 112 and/or bandwidth-sensitive "foreground" applications and standard TCP 113 may be too aggressive for use with such background applications. 115 LEDBAT is an experimental delay-based congestion control mechanism 116 that reacts early to congestion in the network, thus enabling 117 "background" applications to use the network while avoidoing 118 interference with the network performance of competing flows. A 119 LEDBAT sender uses one-way delay measurements to estimate the amount 120 of queueing on the data path, controls the LEDBAT flow's congestion 121 window based on this estimate, and minimizes interference with 122 competing flows when latency builds by adding low extra queueing 123 delay on the end-to-end path. 125 Delay-based congestion control protocols, such as TCP-Vegas 126 [Bra94][Low02], are generally designed to achieve more, not less 127 throughput than standard TCP, and often outperform TCP under 128 particular network settings. In contrast, LEDBAT is designed to be 129 no more aggressive than TCP; LEDBAT is a "scavenger" congestion 130 control mechanism that seeks to utilize all available bandwidth and 131 yields quickly when competing with standard TCP at a bottleneck link. 133 2.1. Design Goals 135 LEDBAT congestion control seeks to: 136 1. utilize end-to-end available bandwidth, and maintain low queueing 137 delay when no other traffic is present, 138 2. add little to the queuing delay induced by concurrent flows, 139 3. quickly yield to flows using standard TCP congestion control that 140 share the same bottleneck link, 142 2.2. Applicability 144 LEDBAT is a "scavenger" congestion control mechanism that is 145 primarily motivated by background bulk-transfer applications, such as 146 large file transfers (as with file-sharing applications) and software 147 updates. It can be used with any application that seeks to minimize 148 its impact on the network and on other interactive delay- and/or 149 bandwidth-sensitive network applications. LEDBAT is expected to work 150 well when the sender and/or receiver is connected via a residential 151 access network. 153 LEDBAT can be used as part of a transport protocol or as part of an 154 application, as long as the data transmission mechanisms are capable 155 of carrying timestamps and acknowledging data frequently. LEDBAT can 156 be used, with appropriate extensions where necessary, with TCP, SCTP, 157 and DCCP, and with proprietary application protocols such as those 158 built on top of UDP for P2P applications. 160 When used with an ECN-capable framing protocol, LEDBAT should react 161 to an ECN mark as it would to a loss, as specified in [RFC3168]. 163 LEDBAT is designed to reduce build-up of a standing queue by long- 164 lived LEDBAT flows at a link with a tail-drop FIFO queue, so as to 165 avoid persistently delaying other flows sharing the queue. If Active 166 Queue Management (AQM) is configured to drop or ECN-mark packets 167 before the LEDBAT starts reacting to persistent queue build-up, 168 LEDBAT reverts to standard TCP behavior, rather than yield to other 169 TCP flows. However, such an AQM is still desirable since it keeps 170 queuing delay low, achieving an outcome that is in line with LEDBAT's 171 goals. Additionally, a LEDBAT transport that supports ECN enjoys the 172 advantages that an ECN-capable TCP enjoys over an ECN-agnostic TCP; 173 avoiding losses and possible retransmissions. Weighted Fair Queuing 174 (WFQ), as employed by some home gateways, seeks to isolate and 175 protect delay-sensitive flows from delays due to standing queues 176 built up by concurrent long-lived flows. Consequently, while it 177 prevents LEDBAT from yielding to other TCP flows, it again achieves 178 an outcome that is in line with LEDBAT's goals [Sch10]. 180 Further study is required to fully understand the behaviour of LEDBAT 181 with non-drop-tail, non-FIFO queues. 183 3. LEDBAT Congestion Control 185 3.1. Overview 187 A standard TCP sender increases its congestion window until a loss 188 occurs [RFC5681] or an ECN mark is received [RFC3168], which, in the 189 absence of any Active Queue Management (AQM) and link errors in the 190 network, occurs only when the queue at the bottleneck link on the 191 end-to-end path overflows. Since packet loss or marking at the 192 bottleneck link is expected to be preceded by an increase in the 193 queueing delay at the bottleneck link, LEDBAT congestion control uses 194 this increase in queueing delay as an early signal of congestion, 195 enabling it to respond to congestion earlier than standard TCP, and 196 enabling it to yield bandwidth to a competing TCP flow. 198 LEDBAT employs one-way delay measurements to estimate queueing delay. 199 When the estimated queueing delay is less than a pre-determined 200 target, LEDBAT infers that the network is not yet congested, and 201 increases its sending rate to utilize any spare capacity in the 202 network. When the estimated queueing delay becomes greater than a 203 pre-determined target, LEDBAT decreases its sending rate quickly as a 204 response to potential congestion in the network. 206 3.2. Preliminaries 208 A LEDBAT sender uses a congestion window (cwnd) to gate the amount of 209 data that the sender can send into the network in one roundtrip time 210 (RTT). A sender MAY maintain its cwnd in bytes or in packets; this 211 document uses cwnd in bytes. LEDBAT requires that each data segment 212 carries a "timestamp" from the sender, based on which the receiver 213 computes the one-way delay from the sender, and sends this computed 214 value back to the sender. 216 In addition to the LEDBAT mechanism described below, we note that a 217 slow start mechanism can be used as specified in [RFC5681]. Since 218 slow start leads to faster increase in the window than that specified 219 in LEDBAT, conservative congestion control implementations employing 220 LEDBAT may skip slow start altogether and start with an initial 221 window of INIT_CWND * MSS. (INIT_CWND is described later in 222 Section 3.5.) 224 The term "MSS", or the sender's Maximum Segment Size, used in this 225 document refers to the size of the largest segment that the sender 226 can transmit. The value of MSS can be based on the path MTU 227 discovery [RFC4821] algorithm and/or on other factors. 229 3.3. Receiver-Side Operation 231 A LEDBAT receiver operates as follows: 233 on data_packet: 234 remote_timestamp = data_packet.timestamp 235 acknowledgement.delay = local_timestamp() - remote_timestamp 236 # fill in other fields of acknowledgement 237 acknowlegement.send() 239 A receiver MAY send more than one delay sample in an acknowledgment. 240 For instance, a receiver that delays acknowledgments, i.e., sends an 241 acknowledgment less frequently than once per data packet, MAY send 242 all the one-way delay samples that it gathers in one acknowledgment. 244 3.4. Sender-Side Operation 246 3.4.1. An Overview 248 As a first approximation, a LEDBAT sender operates as shown below; 249 the complete algorithm is specified later in Section 3.4.2. TARGET 250 is the maximum queueing delay that LEDBAT itself may introduce in the 251 network, and GAIN determines the rate at which the cwnd responds to 252 changes in queueing delay; both constants are specified later. 253 off_target is a normalized value representing the difference between 254 the measured current queueing delay and the pre-determined TARGET 255 queuing delay. off_target can be positive or negative, and 256 consequently, cwnd increases or decreases in proportion to 257 off_target. 259 on initialization: 260 base_delay = +INFINITY 262 on acknowledgement: 263 current_delay = acknowledgement.delay 264 base_delay = min(base_delay, current_delay) 265 queuing_delay = current_delay - base_delay 266 off_target = (TARGET - queuing_delay) / TARGET 267 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 269 The simplified mechanism above ignores multiple delay samples in an 270 acknowledgment, noise filtering, base delay expiration, and sender 271 idle times, which we now take into account in our complete sender 272 algorithm below. 274 3.4.2. The Complete Sender Algorithm 276 update_current_delay() maintains a list of one-way delay 277 measurements, of which a filtered value is used as an estimate of the 278 current end-to-end delay. update_base_delay() maintains a list of 279 one-way delay minima over a number of one-minute intervals, to 280 measure and to track changes in the base delay of the end-to-end 281 path. 283 This algorithm restricts cwnd growth after a period of inactivity, 284 where the cwnd is clamped down to a little more than flightsize using 285 max_allowed_cwnd. To be TCP-friendly on data loss, LEDBAT halves its 286 cwnd. The full sender-side algorithm is given below: 288 on initialization: 289 # cwnd is the amount of data that is allowed to send in one RTT and 290 # is defined in bytes. 291 # CTO is the Congestion Timeout value. 293 create current_delays list with CURRENT_FILTER elements 294 create base_delays list with BASE_HISTORY number of elements 295 initialize elements in base_delays to +INFINITY 296 initialize elements in current_delays appropriate to FILTER() 297 last_rollover = -INFINITY # More than a minute in the past 298 flightsize = 0 299 cwnd = INIT_CWND * MSS 300 CTO = 1 second 302 on acknowledgment: 303 # flightsize is the amount of data outstanding before this ack 304 # was received and is updated later; 305 # bytes_newly_acked is the number of bytes that this ack 306 # newly acknowledges, and it MAY be set to MSS. 308 for each delay sample in the acknowledgment: 309 delay = acknowledgement.delay 310 update_base_delay(delay) 311 update_current_delay(delay) 312 queuing_delay = FILTER(current_delays) - MIN(base_delays) 313 off_target = (TARGET - queuing_delay) / TARGET 314 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 315 max_allowed_cwnd = flightsize + ALLOWED_INCREASE * MSS 316 cwnd = min(cwnd, max_allowed_cwnd) 317 cwnd = max(cwnd, MIN_CWND * MSS) 318 flightsize = flightsize - bytes_newly_acked 319 update_CTO() 321 on data loss: 322 # at most once per RTT 323 cwnd = min (cwnd, max (cwnd/2, MIN_CWND * MSS)) 324 if data lost is not to be retransmitted: 325 flightsize = flightsize - bytes_not_to_be_retransmitted 327 if no acks are received within a CTO: 328 # extreme congestion, or significant RTT change. 329 # set cwnd to 1MSS and backoff the congestion timer. 330 cwnd = 1 * MSS 331 CTO = 2 * CTO 333 update_CTO() 334 # implements an RTT estimation mechanism using data 335 # transmission times and ack reception times, 336 # which is used to implement a congestion timeout (CTO). 337 # If implementing LEDBAT in TCP, sender SHOULD use 338 # mechanisms described in RFC 6298 , 339 # and the CTO would be the same as the RTO. 341 update_current_delay(delay) 342 # Maintain a list of CURRENT_FILTER last delays observed. 343 delete first item in current_delays list 344 append delay to current_delays list 346 update_base_delay(delay) 347 # Maintain BASE_HISTORY delay-minima. 348 # Each minimum is measured over a period of a minute. 349 # 'now' is the current system time 350 if round_to_minute(now) != round_to_minute(last_rollover) 351 last_rollover = now 352 delete first item in base_delays list 353 append delay to base_delays list 354 else 355 base_delays.tail = MIN(base_delays.tail, delay) 357 The LEDBAT sender seeks to to extract the actual delay estimate from 358 the current_delay samples by implementing FILTER() to eliminate any 359 outliers. Different types of filters MAY be used for FILTER() --- a 360 NULL filter, that does not filter at all, is a reasonable candidate 361 as well, since LEDBAT's use of a linear controller for cwnd increase 362 and decrease may allow it to recover quickly from errors induced by 363 bad samples. Another example of a filter is the Exponentially- 364 Weighted Moving Average (EWMA) function, with weights that enable 365 agile tracking of changing network delay. A simple MIN filter 366 applied over a small window may also provide robustness to large 367 delay peaks, as may occur with delayed acks in TCP. Care should be 368 taken that the filter used, while providing robustness to noise, 369 remains sensitive to persistent congestion signals. 371 To implement an approximate minimum over the past few minutes, a 372 LEDBAT sender stores BASE_HISTORY separate minima---one each for the 373 last BASE_HISTORY-1 minutes, and one for the running current minute. 374 At the end of the current minute, the window moves---the earliest 375 minimum is dropped and the latest minimum is added. If the 376 connection is idle for a given minute, no data is available for the 377 one-way delay and, therefore, a value of +INFINITY has to be stored 378 in the list. If the connection has been idle for BASE_HISTORY 379 minutes, all minima in the list are thus set to +INFINITY and 380 measurement begins anew. LEDBAT thus requires that during idle 381 periods, an implementation must maintain the base delay list. 383 LEDBAT uses a congestion timeout (CTO) to avoid transmitting data 384 during periods of heavy congestion, and to avoid congestion collapse. 385 A CTO is used to detect heavy congestion indicated by loss of all 386 outstanding data or acknowledgments, resulting in reduction of the 387 cwnd to 1 MSS and an exponential backoff of the CTO interval. This 388 backoff of the CTO value avoids sending more data into an overloaded 389 queue, and also allows the sender to cope with sudden changes in the 390 RTT of the path. The function of a CTO is similar to that of an 391 retransmission timeout (RTO) in TCP [RFC6298], but since LEDBAT 392 separates reliability from congestion control, a retransmission need 393 not be triggered by a CTO. LEDBAT, however does not preclude a CTO 394 from triggering retransmissions, as could be the case if LEDBAT 395 congestion control were to be used with TCP framing and reliability. 397 The CTO is a gating mechanism that ensures exponential backoff of 398 sending rate under heavy congestion, and it may be implemented with 399 or without a timer. An implementation choosing to avoid timers may 400 consider using a "next-time-to-send" variable, set based on the CTO, 401 to control the earliest time a sender may transmit without receiving 402 any acks. 404 A maximum value MAY be placed on the CTO, and if placed, it MUST be 405 60 seconds or more. 407 We note that LEDBAT assumes random fluctuations in inter-packet 408 transmission times. That will help to measure the correct base delay 409 because the bottleneck runs empty from time to time; see section 410 Section 5.3 for a discussion. 412 3.5. Parameter Values 414 TARGET MUST be 100 milliseconds or less, and this choice of value is 415 explained further in Section 4.3. Note that using the same TARGET 416 value across LEDBAT flows enables equitable sharing of the bottleneck 417 bandwidth---flows with a higher TARGET may get a larger share of the 418 bottleneck bandwidth. It is possible to consider the use of 419 different TARGET values for implementing a relative priority between 420 two competing LEDBAT flows by setting a higher TARGET value for the 421 higher-priority flow. 423 ALLOWED_INCREASE SHOULD be 1, and it MUST be greater than 0. An 424 ALLOWED_INCREASE of 0 results in no cwnd growth at all, and an 425 ALLOWED_INCREASE of 1 allows and limits the cwnd increase based on 426 flightsize in the previous RTT. An ALLOWED_INCREASE greater than 1 427 MAY be used when interactions between LEDBAT and the framing protocol 428 provide a clear reason for doing so. 430 GAIN MUST be set to 1 or less. A GAIN of 1 limits the maximum cwnd 431 ramp-up to the same rate as TCP Reno in Congestion Avoidance. While 432 this document specifies the use of the same GAIN for both cwnd 433 increase (when off_target is greater than zero) and decrease (when 434 off_target is less than zero), implementations MAY use a higher GAIN 435 for cwnd decrease than for the increase; our justification follows. 436 When a competing non-LEDBAT flow increases its sending rate, the 437 LEDBAT sender may only measure a small amount of additional delay and 438 decrease the sending rate slowly. To ensure no impact on a competing 439 non-LEDBAT flow, the LEDBAT flow should decrease its sending rate at 440 least as quickly as the competing flow increases its sending rate. A 441 higher decrease GAIN MAY be used to allow the LEDBAT flow to decrease 442 its sending rate faster than the competing flow's increase rate. 444 The size of the base_delays list, BASE_HISTORY, SHOULD be 10. If the 445 actual base delay decreases, due to a route change for instance, a 446 LEDBAT sender adapts immediately, irrespective of the value of 447 BASE_HISTORY. If the actual base delay increases however, a LEDBAT 448 sender will take BASE_HISTORY minutes to adapt and may wrongly infer 449 a little more extra delay than intended (TARGET) in the meanwhile. A 450 value for BASE_HISTORY is thus a tradeoff: a higher value may yield a 451 more accurate measurement when the base delay is unchanging, and a 452 lower value results in a quicker response to actual increase in base 453 delay. 455 A LEDBAT sender uses the current_delays list to maintain only delay 456 measurements made within a RTT amount of time in the past, seeking to 457 eliminate noise spikes in its measurement of the current one-way 458 delay through the network. The size of this list, CURRENT_FILTER, 459 may be variable, and depends on the FILTER() function as well as the 460 number of successful measurements made within a RTT amount of time in 461 the past. The sender should seek to gather enough delay samples in 462 each RTT so as to have statistical confidence in the measurements. 463 While the number of delay samples required for such confidence will 464 vary depending on network conditions, we recommend that the sender 465 SHOULD use at least 4 samples in each RTT, unless the number of 466 samples is lower due to a small congestion window. Thus, subject to 467 congestion window constraints, the number of delay samples in each 468 RTT SHOULD be at least 4. The value of CURRENT_FILTER will depend on 469 the filter being employed, but CURRENT_FILTER MUST be limited such 470 that samples in the list are not older than an RTT in the past. 472 INIT_CWND and MIN_CWND SHOULD both be 2. An INIT_CWND of 2 should 473 help seed FILTER() at the sender when there are no samples at the 474 beginning of a flow, and a MIN_CWND of 2 allows FILTER() to use more 475 than a single instantaneous delay estimate while not being too 476 aggressive. Slight deviations may be warranted, for example, when 477 these values of INIT_CWND and MIN_CWND interact poorly with the 478 framing protocol. However, INIT_CWND and MIN_CWND MUST be no larger 479 than the corresponding values specified for TCP [RFC5681]. 481 4. Understanding LEDBAT Mechanisms 483 This section describes the delay estimation and window management 484 mechanisms used in LEDBAT. 486 4.1. Delay Estimation 488 LEDBAT estimates congestion in the direction of the data flow, and to 489 avoid measuring additional delay from e.g. queue build-up on the 490 reverse path (or ack path) or reordering, LEDBAT uses one-way delay 491 estimates. LEDBAT assumes measurements are done with data packets, 492 thus avoiding the need for separate measurement packets and avoiding 493 the pitfall of measurement packets being treated differently from the 494 data packets in the network. 496 End-to-end delay can be decomposed into transmission (or 497 serialization) delay, propagation (or speed-of-light) delay, queueing 498 delay, and processing delay. On any given path, barring some noise, 499 all delay components except for queueing delay are constant. To 500 observe an increase in the queueing delay in the network, a LEDBAT 501 sender separates the queueing delay component from the rest of the 502 end-to-end delay, as described below. 504 4.1.1. Estimating Base Delay 506 Since queuing delay is always additive to the end-to-end delay, 507 LEDBAT estimates the sum of the constant delay components, which we 508 call "base delay", to be the minimum delay observed on the end-to-end 509 path. 511 To respond to true changes in the base delay, as can be caused by a 512 route change, LEDBAT uses only recent measurements in estimating the 513 base delay. The duration of the observation window itself is a 514 tradeoff between robustness of measurement and responsiveness to 515 change---a larger observation window increases the chances that the 516 true base delay will be detected (as long as the true base delay is 517 unchanged), whereas a smaller observation window results in faster 518 response to true changes in the base delay. 520 4.1.2. Estimating Queueing Delay 522 Given that the base delay is constant, the queueing delay is 523 represented by the variable component of the measured end-to-end 524 delay. LEDBAT measures queueing delay as simply the difference 525 between an end-to-end delay measurement and the current estimate of 526 base delay. The queueing delay should be filtered (depending on the 527 usage scenario) to eliminate noise in the delay estimation, such as 528 due to spikes in processing delay at a node on the path. 530 4.2. Managing the Congestion Window 532 4.2.1. Window Increase: Probing For More Bandwidth 534 A LEDBAT sender increases its congestion window if the queuing delay 535 is smaller than a target value, proportionally to the relative 536 difference between the current queueing delay and the delay target. 537 To be friendly to competing TCP flows, we set this highest rate of 538 window growth to be the same as TCP's. In other words, a LEDBAT flow 539 thus never ramps up faster than a competing TCP flow over the same 540 path. As closer the extra delay gets to the TARGET value, as slower 541 LEDBAT will increase the window. 543 4.2.2. Window Decrease: Responding To Congestion 545 When the sender's queueing delay estimate is higher than the target, 546 the LEDBAT flow's rate should be reduced. LEDBAT uses a simple 547 linear controller to determine the sending rate as a function of the 548 delay estimate, where the response is proportional to the difference 549 between the current queueing delay estimate and the target. This 550 allows to decrease the window only slightly while probing and leads 551 to a quite stable state with high link utilization. In limited 552 experiments with Bittorrent nodes, this controller seems to work 553 well. 555 Unlike TCP-like loss-based congestion control, LEDBAT seeks to avoid 556 losses and so a LEDBAT sender is not expected to normally rely on 557 losses to determine the sending rate. However, when data loss does 558 occur, LEDBAT must respond as standard TCP does; even if the queueing 559 delay estimates indicate otherwise, a loss is assumed to be a strong 560 indication of congestion. Thus, to deal with severe congestion when 561 packets are dropped in the network, and to provide a fallback against 562 incorrect queuing delay estimates, a LEDBAT sender halves its 563 congestion window when a loss event is detected. As with TCP New- 564 Reno, LEDBAT reduces its cwnd by half at most once per RTT. 566 4.3. Choosing The Queuing Delay Target 568 The queueing delay target is a tradeoff. A target that is too low 569 might result in under-utilization of the bottleneck link, because of 570 the noise in the delay measurement e.g in a mobile scenario, and may 571 also be more sensitive to error in the measured delay. The 572 International Telecommunication Union's (ITU's) Recommendation G.114 573 defines a delay of 150 ms to be acceptable for most user voice 574 applications. Thus the extra delay induced by LEDBAT must be below 575 150 ms to reduce impact on delay-sentive applications. If the TARGET 576 value is larger than the maximum delay the queue can induce, LEDBAT 577 will fallback to the same behavior than standard TCP (see section 578 Section 5.2). 580 Our recommendation of 100 ms or less as the target is based on these 581 considerations. Anecdotal evidence indicates that this value works 582 well: LEDBAT has been implemented and successfully deployed with a 583 target value of 100 ms in two Bittorrent implementations---BitTorrent 584 DNA as the exclusive congestion control mechanism and in uTorrent as 585 an experimental mechanism. 587 5. Discussion 589 5.1. Framing and Ack Frequency Considerations 591 While the actual framing and wire format of the protocols using 592 LEDBAT are outside the scope of this document, we briefly consider 593 the data framing and ack frequency needs of LEDBAT mechanisms. 595 To compute the data path's one-way delay, our discussion of LEDBAT 596 assumes a framing that allows the sender to timestamp packets and for 597 the receiver to convey the measured one-way delay back to the sender 598 in ack packets. LEDBAT does not require this particular method, but 599 it does require unambiguous delay estimates using data and ack 600 packets. 602 A LEDBAT receiver may send an ack as frequently as one for every data 603 packet received or less frequently; LEDBAT does require that the 604 receiver MUST transmit at least one ack in every RTT. 606 5.2. Competing With TCP Flows 608 LEDBAT is designed to respond to congestion indications earlier than 609 loss-based TCP. A LEDBAT flow is more aggressive when the queueing 610 delay estimate is lower; since the queueing delay estimate is non- 611 negative, LEDBAT is most aggressive when its queuing delay estimate 612 is zero. In this case, LEDBAT ramps up its congestion window at the 613 same rate as TCP does. LEDBAT reduces its rate earlier than TCP 614 does, always halving the congestion window on loss. Thus, in the 615 worst case where the delay estimates are completely and consistently 616 off, a LEDBAT flow falls back to TCP mechanisms as it will be at most 617 as aggressive as a TCP flow and halves on loss. 619 5.3. Fairness Among LEDBAT Flows 621 The primary design goals of LEDBAT are focussed on the aggregate 622 behavior of LEDBAT flows when they compete with standard TCP. Since 623 LEDBAT is designed for background traffic, we consider link 624 utilization to be more important than fairness amongst LEDBAT flows. 625 Nevertheless, we now consider fairness issues that might arise 626 amongst competing LEDBAT flows. 628 LEDBAT as described so far lacks a mechanism specifically designed to 629 equalize utilization amongst LEDBAT flows. Anecdotally observed 630 behavior of existing implementations indicates that a rough 631 equalization does occur since in most enviroments some amount of 632 randomness in the inter-packet transmission times exist, as explained 633 further below. 635 Delay-based congestion control systems suffer from the possibility of 636 late-comers incorrectly measuring and using a higher base-delay than 637 an active flow that started earlier. Suppose a LEDBAT flow is the 638 only flow on the bottleneck, which the flow saturates, steadily 639 maintaining the queueing delay at a target delay. When a new LEDBAT 640 flow arrives, it might incorrectly measure the current end-to-end 641 delay, including the queueing delay being maintained by the first 642 LEDBAT flow, as its base delay, and the incoming flow might now 643 effectively seek to build on top of the existing, already maximal 644 queueing delay. As the second flow builds up, the first flow sees 645 the true queueing delay and backs off, while the late-comer keeps 646 building up, using up the entire link's capacity; this advantage is 647 called the "late-comer's advantage". 649 In the worse case, if the first flow yields at the same rate as the 650 new flow increases its sending rate, the new flow will see constant 651 end-to-end delay, which it assumes is the base delay, until the first 652 flow backs off completely. As a result, by the time the second flow 653 stops increasing its cwnd, it would have added twice the target 654 queueing delay to the network. 656 This advantage can be reduced if the the first flow yields quickly 657 enough to empty the bottleneck queue faster than the incoming flow 658 increases its occupancy in the queue; as a result, the late-comer 659 might measure a delay closer to the base delay. While such a 660 reduction might be achieved through a multiplicative decrease of the 661 congestion window, this might cause stronger fluctuations in flow 662 throughput during steady state. Thus we do not recommend a 663 multiplicative decrease scheme. 665 We note that in certain use-case scenarios, it is possible for a 666 later LEDBAT flow to gain an unfair advantage over an existing one 667 [Car10]. In practice, this concern may be alleviated by the 668 burstiness of network traffic: all that's needed to measure the base 669 delay is one small gap in transmission schedules between the LEDBAT 670 flows. These gaps can occur for a number of reasons such as latency 671 introduced due to application sending patterns, OS scheduling at the 672 sender, processing delay at the sender or any network node, and link 673 contention. When such a gap occurs in the first sender's 674 transmission while the late-comer is starting, base delay is 675 immediately correctly measured. With a small number of LEDBAT flows, 676 system noise may sufficiently regulate the late-comer's advantage. 678 6. IANA Considerations 680 There are no IANA considerations for this document. 682 7. Security Considerations 684 A network on the path might choose to cause higher delay measurements 685 than the real queuing delay so that LEDBAT backs off even when 686 there's no congestion present. While shaping of traffic into an 687 artificially narrow bottleneck by increasing the queueing delay 688 cannot be trivially counteracted, a protocol using LEDBAT should seek 689 to minimize the risk of such an attack by authenticating the 690 timestamp and delay fields in the packets. 692 LEDBAT is not known to introduce any new concerns with privacy, 693 integrity, or other security issues for flows that use it. It should 694 be compatible with use of IPsec and TLS/DTLS. 696 8. Acknowledgements 698 We thank folks in the LEDBAT working group for their comments and 699 feedback. Special thanks to Murari Sridharan and Rolf Winter for 700 their patient and untiring shepherding. 702 9. References 703 9.1. Normative References 705 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 706 Requirement Levels", BCP 14, RFC 2119, March 1997. 708 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 709 of Explicit Congestion Notification (ECN) to IP", 710 RFC 3168, September 2001. 712 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 713 Discovery", RFC 4821, March 2007. 715 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 716 Control", RFC 5681, September 2009. 718 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 719 "Computing TCP's Retransmission Timer", RFC 6298, 720 June 2011. 722 9.2. Informative References 724 [Bra94] Brakmo, L., O'Malley, S., and L. Peterson, "TCP Vegas: New 725 techniques for congestion detection and avoidance", 726 Proceedings of SIGCOMM '94, pages 24-35, August 1994. 728 [Car10] Carofiglio, G., Muscariello, L., Rossi, D., Testa, C., and 729 S. Valenti, "Rethinking Low Extra Delay Background 730 Transport Protocols", arXiv:1010.5623v1, September 2010. 732 [Low02] Low, S., Peterson, L., and L. Wang, "Understanding TCP 733 Vegas: A Duality Model", JACM 49 (2), March 2002. 735 [Sch10] Schneider, J., Wagner, J., Winter, R., and H. Kolbe, "Out 736 of my Way -- Evaluating Low Extra Delay Background 737 Transport in an ADSL Access Network", Proceedings of 22nd 738 International Teletraffic Congress (ITC22), September 739 2010. 741 Appendix A. Timestamp errors 743 One-way delay measurement needs to deal with timestamp errors. We'll 744 use the same locally linear clock model and the same terminology as 745 Network Time Protocol (NTP). This model is valid for any 746 differentiable clocks. NTP uses the term "offset" to refer to 747 difference from true time and "skew" to refer to difference of clock 748 rate from the true rate. The clock will thus have a fixed offset 749 from the true time and a skew. We'll consider what we need to do 750 about the offset and the skew separately. 752 A.1. Clock offset 754 First, consider the case of zero skew. The offset of each of the two 755 clocks shows up as a fixed error in one-way delay measurement. The 756 difference of the offsets is the absolute error of the one-way delay 757 estimate. We won't use this estimate directly, however. We'll use 758 the difference between that and a base delay. Because the error 759 (difference of clock offsets) is the same for the current and base 760 delay, it cancels from the queuing delay estimate, which is what 761 we'll use. Clock offset is thus irrelevant to the design. 763 A.2. Clock skew 765 The clock skew manifests in a linearly changing error in the time 766 estimate. For a given pair of clocks, the difference in skews is the 767 skew of the one-way delay estimate. Unlike the offset, this no 768 longer cancels in the computation of the queuing delay estimate. On 769 the other hand, while the offset could be huge, with some clocks off 770 by minutes or even hours or more, the skew is typically small. For 771 example, NTP is designed to work with most clocks, yet it gives up 772 when the skew is more than 500 parts per million (PPM). Typical 773 skews of clocks that have never been trained seem to often be around 774 100-200 PPM. Previously trained clocks could have 10-20 PPM skew due 775 to temperature changes. A 100-PPM skew means accumulating 6 776 milliseconds of error per minute. The base delay updates mostly 777 takes care of clock skew unless the skew is unusually high or extreme 778 values have been chosen for TARGET and BASE_HISTORY so that the clock 779 skew in BASE_DELAY minutes is larger than the TARGET. 781 Clock skew can be in two directions: either the sender's clock is 782 faster than the receiver's, or vice versa. 784 If the sender's clock is faster the one-way delay measurement will 785 get more and more reduced by the clock drift over time. Whenever 786 there is no additional delay the base delay will be updated by a 787 smaller one-way delay value and the skew is compensated. If a 788 competing flow introduces additional queueing delay LEDBAT will 789 anyway get out of the way quickly and an overestimated one-way delay 790 will just speed-up the back-off. 792 When the receiver clock runs faster, the raw delay estimate will 793 drift up with time. This can suppress the throughput unnecessarily. 794 In this case a skew correction mechanim can be benefital. Further 795 condersiderations based on a deployed implementation and LEDBAT 796 specific preconditions are given in the next section. 798 A.3. Clock skew correction mechanism 800 The following paragraph describes the deployed clock skew correction 801 mechanism in the BitTorrent implementation for documentation purpose. 803 In the BitTorrent implementation, the receiver sends back raw 804 (sending and receiving) timestamps. Using this information, the 805 sender can estimate the one-way delays in both directions, and can 806 also compute and maintain an estimate of the base delay as would be 807 observed by the receiver. If the sender detects the receiver 808 reducing its base delay, it infers that this reduction is due to 809 clock drift. The sender can be compensated by increasing its base 810 delay by the same amount. To apply this mechanism however, 811 timestamps need to be transmitted in both directions. 813 The following considerations can be used for an alternative 814 implementation as a reference: 815 o Skew correction with faster virtual clock: 816 Since having a faster clock on the sender will continuousely 817 update the base delay, a faster virtual clock for sender 818 timestamping can be applied. This virual clock can be computed 819 from the default machine clock through a linear transformation. 820 E.g. with a 500 PPM speed-up the sender's clock is very likely to 821 be faster than any receiver's clock and thus LEDBAT will benefit 822 from the implicit correction when updating the base delay. 824 o Skew correction with estimating drift: 825 With LEDBAT the history of base delay minima is already kept for 826 each minute. This can provide a base to compute the clock skew 827 difference between the two hosts. The slope of a linear function 828 fitted to the set of minima base delays gives an estimate of the 829 clock skew. This estimation can be used to correct the clocks. 830 If the other endpoint is doing the same, the clock should be 831 corrected by half of the estimated skew amount. 833 o Byzantine skew correction: 834 When it is known that each host maintains long-lived connections 835 to a number of different other hosts, a byzantine scheme can be 836 used to estimate the skew with respect to the true time. Namely, 837 calculate the skew difference for each of the peer hosts as 838 described with the previous approach, then take the median of the 839 skew differences. While this scheme is not universally 840 applicable, it combines well with other schemes, since it is 841 essentially a clock training mechanism. The scheme also acts the 842 fastest, since the state is preserved between connections. 844 Authors' Addresses 846 Stanislav Shalunov 847 BitTorrent Inc 848 612 Howard St, Suite 400 849 San Francisco, CA 94105 850 USA 852 Email: shalunov@bittorrent.com 853 URI: http://shlang.com 855 Greg Hazel 856 BitTorrent Inc 857 612 Howard St, Suite 400 858 San Francisco, CA 94105 859 USA 861 Email: greg@bittorrent.com 863 Janardhan Iyengar 864 Franklin and Marshall College 865 415 Harrisburg Ave. 866 Lancaster, PA 17603 867 USA 869 Email: jiyengar@fandm.edu 871 Mirja Kuehlewind 872 University of Stuttgart 873 Stuttgart 874 DE 876 Email: mirja.kuehlewind@ikr.uni-stuttgart.de