idnits 2.17.1 draft-ietf-ledbat-congestion-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 11, 2011) is 4666 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'Schneider10' is mentioned on line 179, but not defined Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: January 12, 2012 J. Iyengar 6 Franklin and Marshall College 7 M. Kuehlewind 8 University of Stuttgart 9 July 11, 2011 11 Low Extra Delay Background Transport (LEDBAT) 12 draft-ietf-ledbat-congestion-07.txt 14 Abstract 16 LEDBAT is an experimental delay-based congestion control algorithm 17 that attempts to utilize the available bandwidth on an end-to-end 18 path while limiting the consequent increase in queueing delay on the 19 path. LEDBAT uses changes in one-way delay measurements to limit 20 congestion that the flow itself induces in the network. LEDBAT is 21 designed for use by background bulk-transfer applications; it is 22 designed to be no more aggressive than TCP congestion control and to 23 yield in the presence of any competing flows when latency builds, 24 thus limiting interference with the network performance of the 25 competing flows. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 12, 2012. 44 Copyright Notice 46 Copyright (c) 2011 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 3 64 2.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 4 65 3. LEDBAT Congestion Control . . . . . . . . . . . . . . . . . . 4 66 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5 67 3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 68 3.3. Receiver-Side Operation . . . . . . . . . . . . . . . . . 5 69 3.4. Sender-Side Operation . . . . . . . . . . . . . . . . . . 6 70 3.4.1. An Overview . . . . . . . . . . . . . . . . . . . . . 6 71 3.4.2. The Complete Sender Algorithm . . . . . . . . . . . . 6 72 3.5. Parameter Values . . . . . . . . . . . . . . . . . . . . . 8 73 4. Understanding LEDBAT Mechanisms . . . . . . . . . . . . . . . 10 74 4.1. Delay Estimation . . . . . . . . . . . . . . . . . . . . . 10 75 4.1.1. Estimating Base Delay . . . . . . . . . . . . . . . . 10 76 4.1.2. Estimating Queueing Delay . . . . . . . . . . . . . . 11 77 4.2. Managing the Congestion Window . . . . . . . . . . . . . . 11 78 4.2.1. Window Increase: Probing For More Bandwidth . . . . . 11 79 4.2.2. Window Decrease: Responding To Congestion . . . . . . 11 80 4.3. Choosing The Queuing Delay Target . . . . . . . . . . . . 12 81 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 82 5.1. Framing and Ack Frequency Considerations . . . . . . . . . 12 83 5.2. Competing With TCP Flows . . . . . . . . . . . . . . . . . 12 84 5.3. Fairness Among LEDBAT Flows . . . . . . . . . . . . . . . 13 85 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 86 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 87 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 88 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 89 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 90 9.2. Informative References . . . . . . . . . . . . . . . . . . 15 91 Appendix A. Timestamp errors . . . . . . . . . . . . . . . . . . 15 92 A.1. Clock offset . . . . . . . . . . . . . . . . . . . . . . . 15 93 A.2. Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 16 94 A.3. Clock skew correction mechanism . . . . . . . . . . . . . 16 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 97 1. Requirements notation 99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 101 document are to be interpreted as described in [RFC2119]. 103 2. Introduction 105 TCP congestion control [RFC5681] seeks to share bandwidth at a 106 bottleneck link equitably among flows competing at the bottleneck, 107 and it is the predominant congestion control mechanism used on the 108 Internet. Not all applications seek an equitable share of network 109 throughput, however "background" applications, such as software 110 updates or file-sharing applications, seek to operate without 111 interfering with the performance of more interactive and delay- 112 and/or bandwidth-sensitive "foreground" applications and standard TCP 113 may be too aggressive for use with such background applications. 115 LEDBAT is an experimental delay-based congestion control mechanism 116 that reacts early to congestion in the network, thus enabling 117 "background" applications to use the network while avoidoing 118 interference with the network performance of competing flows. A 119 LEDBAT sender uses one-way delay measurements to estimate the amount 120 of queueing on the data path, controls the LEDBAT flow's congestion 121 window based on this estimate, and minimizes interference with 122 competing flows when latency builds by adding low extra queueing 123 delay on the end-to-end path. 125 Delay-based congestion control protocols, such as TCP-Vegas 126 [Bra94][Low02], are generally designed to achieve more, not less 127 throughput than standard TCP, and often outperform TCP under 128 particular network settings. In contrast, LEDBAT is designed to be 129 no more aggressive than TCP; LEDBAT is a "scavenger" congestion 130 control mechanism that seeks to utilize all available bandwidth and 131 yields quickly when competing with standard TCP at a bottleneck link. 133 2.1. Design Goals 135 LEDBAT congestion control seeks to: 136 1. utilize end-to-end available bandwidth, and maintain low queueing 137 delay when no other traffic is present, 138 2. add little to the queuing delay induced by concurrent flows, 139 3. quickly yield to flows using standard TCP congestion control that 140 share the same bottleneck link, 142 2.2. Applicability 144 LEDBAT is a "scavenger" congestion control mechanism that is 145 primarily motivated by background bulk-transfer applications, such as 146 large file transfers (as with file-sharing applications) and software 147 updates. It can be used with any application that seeks to minimize 148 its impact on the network and on other interactive delay- and/or 149 bandwidth-sensitive network applications. LEDBAT is expected to work 150 well when the sender and/or receiver is connected via a residential 151 access network. 153 LEDBAT can be used as part of a transport protocol or as part of an 154 application, as long as the data transmission mechanisms are capable 155 of carrying timestamps and acknowledging data frequently. LEDBAT can 156 be used, with appropriate extensions where necessary, with TCP, SCTP, 157 and DCCP, and with proprietary application protocols such as those 158 built on top of UDP for P2P applications. 160 When used with an ECN-capable framing protocol, LEDBAT should react 161 to an ECN mark as it would to a loss, as specified in [RFC3168]. 163 LEDBAT is designed to reduce build-up of a standing queue by long- 164 lived LEDBAT flows at a link with a tail-drop FIFO queue, so as to 165 avoid persistently delaying other flows sharing the queue. If Active 166 Queue Management (AQM) is configured to drop or ECN-mark packets 167 before the LEDBAT starts reacting to persistent queue build-up, 168 LEDBAT reverts to standard TCP behavior, rather than yield to other 169 TCP flows. However, such an AQM is still desirable since it keeps 170 queuing delay low, achieving an outcome that is in line with LEDBAT's 171 goals. Additionally, a LEDBAT transport that supports ECN enjoys the 172 advantages that an ECN-capable TCP enjoys over an ECN-agnostic TCP; 173 avoiding losses and possible retransmissions. Weighted Fair Queuing 174 (WFQ), as employed by some home gateways [Schneider10], seeks to 175 isolate and protect delay-sensitive flows from delays due to standing 176 queues built up by concurrent long-lived flows. Consequently, while 177 it prevents LEDBAT from yielding to other TCP flows, it again 178 achieves an outcome that is in line with LEDBAT's goals 179 [Schneider10]. 181 Further study is required to fully understand the behaviour of LEDBAT 182 with non-drop-tail, non-FIFO queues. 184 3. LEDBAT Congestion Control 185 3.1. Overview 187 A standard TCP sender increases its congestion window until a loss 188 occurs [RFC5681] or an ECN mark is received [RFC3168], which, in the 189 absence of any Active Queue Management (AQM) and link errors in the 190 network, occurs only when the queue at the bottleneck link on the 191 end-to-end path overflows. Since packet loss or marking at the 192 bottleneck link is expected to be preceded by an increase in the 193 queueing delay at the bottleneck link, LEDBAT congestion control uses 194 this increase in queueing delay as an early signal of congestion, 195 enabling it to respond to congestion earlier than standard TCP, and 196 enabling it to yield bandwidth to a competing TCP flow. 198 LEDBAT employs one-way delay measurements to estimate queueing delay. 199 When the estimated queueing delay is less than a pre-determined 200 target, LEDBAT infers that the network is not yet congested, and 201 increases its sending rate to utilize any spare capacity in the 202 network. When the estimated queueing delay becomes greater than a 203 pre-determined target, LEDBAT decreases its sending rate quickly as a 204 response to potential congestion in the network. 206 3.2. Preliminaries 208 A LEDBAT sender uses a congestion window (cwnd) to gate the amount of 209 data that the sender can send into the network in one roundtrip time 210 (RTT). A sender MAY maintain its cwnd in bytes or in packets; this 211 document uses cwnd in bytes. LEDBAT requires that each data segment 212 carries a "timestamp" from the sender, based on which the receiver 213 computes the one-way delay from the sender, and sends this computed 214 value back to the sender. 216 In addition to the LEDBAT mechanism described below, we note that a 217 slow start mechanism can be used as specified in [RFC5681]. Since 218 slow start leads to faster increase in the window than that specified 219 in LEDBAT, conservative congestion control implementations employing 220 LEDBAT may skip slow start altogether and start with an initial 221 window of INIT_CWND * MSS. (INIT_CWND is described later in 222 Section 3.5.) 224 The term "MSS", or the sender's Maximum Segment Size, used in this 225 document refers to the size of the largest segment that the sender 226 can transmit. The value of MSS can be based on the path MTU 227 discovery [RFC4821] algorithm and/or on other factors. 229 3.3. Receiver-Side Operation 231 A LEDBAT receiver operates as follows: 233 on data_packet: 234 remote_timestamp = data_packet.timestamp 235 acknowledgement.delay = local_timestamp() - remote_timestamp 236 # fill in other fields of acknowledgement 237 acknowlegement.send() 239 A receiver MAY send more than one delay sample in an acknowledgment. 240 For instance, a receiver that delays acknowledgments, i.e., sends an 241 acknowledgment less frequently than once per data packet, MAY send 242 all the one-way delay samples that it gathers in one acknowledgment. 244 3.4. Sender-Side Operation 246 3.4.1. An Overview 248 As a first approximation, a LEDAT sender operates as shown below; the 249 complete algorithm is specified later in Section 3.4.2. TARGET is 250 the maximum queueing delay that LEDBAT itself can introduce in the 251 network, and GAIN determines the rate at which the cwnd responds to 252 changes in queueing delay; both constants are specified later. Since 253 off_target can be positive or negative, the cwnd increases or 254 decreases in proportion to off_target. 256 on initialization: 257 base_delay = +INFINITY 259 on acknowledgement: 260 current_delay = acknowledgement.delay 261 base_delay = min(base_delay, current_delay) 262 queuing_delay = current_delay - base_delay 263 off_target = (TARGET - queuing_delay) / TARGET 264 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 266 The simplified mechanism above ignores multiple delay samples in an 267 acknowledgment, noise filtering, base delay expiration, and sender 268 idle times, which we now take into account in our complete sender 269 algorithm below. 271 3.4.2. The Complete Sender Algorithm 273 update_current_delay() maintains a list of one-way delay 274 measurements, of which a filtered value is used as an estimate of the 275 current end-to-end delay. update_base_delay() maintains a list of 276 one-way delay minima over a number of one-minute intervals, to 277 measure and to track changes in the base delay of the end-to-end 278 path. 280 This algorithm restricts cwnd growth after a period of inactivity, 281 where the cwnd is clamped down to a little more than flightsize using 282 max_allowed_cwnd. To be TCP-friendly on data loss, LEDBAT halves its 283 cwnd. The full sender-side algorithm is given below: 285 on initialization: 286 # cwnd is the amount of data that is allowed to send in one RTT and 287 # is defined in bytes. 289 create current_delays list with CURRENT_FILTER elements 290 create base_delays list with BASE_HISTORY number of elements 291 initialize elements in current_delays and base_delays to +INFINITY 292 last_rollover = -INFINITY # More than a minute in the past 293 flightsize = 0 294 cwnd = INIT_CWND * MSS 296 on acknowledgment: 297 # flightsize is the amount of data outstanding before this ack 298 # was received and is updated later by update_flightsize(); 299 # bytes_newly_acked is the number of bytes that this ack 300 # newly acknowledges, and it MAY be set to MSS. 302 for each delay sample in the acknowledgment: 303 delay = acknowledgement.delay 304 update_base_delay(delay) 305 update_current_delay(delay) 306 queuing_delay = FILTER(current_delays) - MIN(base_delays) 307 off_target = (TARGET - queuing_delay) / TARGET 308 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 309 max_allowed_cwnd = flightsize + ALLOWED_INCREASE * MSS 310 cwnd = min(cwnd, max_allowed_cwnd) 311 cwnd = max(cwnd, MIN_CWND * MSS) 312 update_flightsize() 314 on data loss: 315 # atmost once per RTT 316 cwnd = max(cwnd/2, MIN_CWND * MSS) 318 update_flightsize() 319 #subtracts bytes_newly_acked from flightsize 320 flightsize = flightsize - bytes_newly_acked 322 update_current_delay(delay) 323 # Maintain a list of CURRENT_FILTER last delays observed. 324 delete first item in current_delays list 325 append delay to current_delays list 327 update_base_delay(delay) 328 # Maintain BASE_HISTORY delay-minima. 329 # Each minimum is measured over a period of a minute. 330 # 'now' is the current system time 331 if round_to_minute(now) != round_to_minute(last_rollover) 332 last_rollover = now 333 delete first item in base_delays list 334 append delay to base_delays list 335 else 336 base_delays.tail = MIN(base_delays.tail, delay) 338 The LEDBAT sender ensures that any outliers in the current_delay 339 samples are eliminated by implementing FILTER() to extract the actual 340 delay estimate from the current_delay samples. An example of such a 341 filter is the Exponentially-Weighted Moving Average (EWMA) function. 343 To implement an approximate minimum over the past few minutes, a 344 LEDBAT sender stores BASE_HISTORY separate minima---one each for the 345 last BASE_HISTORY-1 minutes, and one for the running current minute. 346 At the end of the current minute, the window moves---the earliest 347 minimum is dropped and the latest minimum is added. If the 348 connection is idle for a given minute, no data is available for the 349 one-way delay and, therefore, a value of +INFINITY has to be stored 350 in the list. If the connection has been idle for BASE_HISTORY 351 minutes, all minima in the list are thus set to +INFINITY and 352 measurement begins anew. LEDBAT thus requires that during idle 353 periods, an implementation must maintain the base delay list. 355 We note that LEDBAT assumes random fluctuations in inter-packet 356 transmission times. That will help to measure the correct base delay 357 beacuse the the bottleneck runs empty from time to time; see section 358 Section 5.3 for a discussion. 360 3.5. Parameter Values 362 TARGET MUST be 100 milliseconds or less, and this choice of value is 363 explained further in Section 4.3. Note that using the same TARGET 364 value across LEDBAT flows enables equitable sharing of the bottleneck 365 bandwidth---flows with a higher TARGET may get a larger share of the 366 bottleneck bandwidth. It is possible to consider the use of 367 different TARGET values for implementing a relative priority between 368 two competing LEDBAT flows by setting a higher TARGET value for the 369 higher-priority flow. 371 ALLOWED_INCREASE SHOULD be 1, and it MUST be greater than 0. An 372 ALLOWED_INCREASE of 0 results in no cwnd growth at all, and an 373 ALLOWED_INCREASE of 1 allows and limits the cwnd increase based on 374 flightsize in the previous RTT. An ALLOWED_INCREASE greater than 1 375 MAY be used when interactions between LEDBAT and the framing protocol 376 provide a clear reason for doing so. 378 GAIN MUST be set to 1 or less. A GAIN of 1 limits the maximum cwnd 379 ramp-up to the same rate as TCP Reno in Congestion Avoidance. While 380 this document specifies the use of the same GAIN for both cwnd 381 increase (when off_target is greater than zero) and decrease (when 382 off_target is less than zero), implementations MAY use a higher GAIN 383 for cwnd decrease than for the increase; our justification follows. 384 When a competing non-LEDBAT flow increases its sending rate, the 385 LEDBAT sender may only measure a small amount of additional delay and 386 decrease the sending rate slowly. To ensure no impact on a competing 387 non-LEDBAT flow, the LEDBAT flow should decrease its sending rate at 388 least as quickly as the competing flow increases its sending rate. A 389 higher decrease GAIN MAY be used to allow the LEDBAT flow to decrease 390 its sending rate faster than the competing flow's increase rate. 392 The size of the base_delays list, BASE_HISTORY, SHOULD be 10. If the 393 actual base delay decreases, due to a route change for instance, a 394 LEDBAT sender adapts immediately, irrespective of the value of 395 BASE_HISTORY. If the actual base delay increases however, a LEDBAT 396 sender will take BASE_HISTORY minutes to adapt and may wrongly infer 397 a little more extra delay than intended (TARGET) in the meanwhile. A 398 value for BASE_HISTORY is thus a tradeoff: a higher value may yield a 399 more accurate measurement when the base delay is unchanging, and a 400 lower value results in a quicker response to actual increase in base 401 delay. 403 A LEDBAT sender uses the current_delays list to maintain only delay 404 measurements made within a RTT amount of time in the past, seeking to 405 eliminate noise spikes in its measurement of the current one-way 406 delay through the network. The size of this list, CURRENT_FILTER, 407 may be variable, and depends on the number of successful measurements 408 made within a RTT amount of time in the past. The sender should seek 409 to gather enough delay samples in each RTT so as to have statistical 410 confidence in the measurements. While the number of delay samples 411 required for such confidence will vary depending on network 412 conditions, we recommend that the sender SHOULD use at least 4 413 samples in each RTT. Thus, CURRENT_FILTER SHOULD be at least 4, and 414 limited such that no samples in list are older than an RTT in the 415 past. 417 MIN_CWND SHOULD be 2, and it MUST be at least 1. INIT_CWND SHOULD be 418 2, and it MUST be at least 1. The choice of MIN_CWND and INIT_CWND 419 are strongly connected to the framing protocol; a larger MIN_CWND 420 and/or INIT_CWND MAY be used if the framing protocol allows it. For 421 instance, TCP senders may use a larger INIT_CWND as specified in 422 [RFC3390]. 424 4. Understanding LEDBAT Mechanisms 426 This section describes the delay estimation and window management 427 mechanisms used in LEDBAT. 429 4.1. Delay Estimation 431 LEDBAT estimates congestion in the direction of the data flow, and to 432 avoid measuring additional delay from e.g. queue build-up on the 433 reverse path (or ack path) or reordering, LEDBAT uses one-way delay 434 estimates. LEDBAT assumes measurements are done with data packets, 435 thus avoiding the need for separate measurement packets and avoiding 436 the pitfall of measurement packets being treated differently from the 437 data packets in the network. 439 End-to-end delay can be decomposed into transmission (or 440 serialization) delay, propagation (or speed-of-light) delay, queueing 441 delay, and processing delay. On any given path, barring some noise, 442 all delay components except for queueing delay are constant. To 443 observe an increase in the queueing delay in the network, a LEDBAT 444 sender separates the queueing delay component from the rest of the 445 end-to-end delay, as described below. 447 4.1.1. Estimating Base Delay 449 Since queuing delay is always additive to the end-to-end delay, 450 LEDBAT estimates the sum of the constant delay components, which we 451 call "base delay", to be the minimum delay observed on the end-to-end 452 path. 454 To respond to true changes in the base delay, as can be caused by a 455 route change, LEDBAT uses only recent measurements in estimating the 456 base delay. The duration of the observation window itself is a 457 tradeoff between robustness of measurement and responsiveness to 458 change---a larger observation window increases the chances that the 459 true base delay will be detected (as long as the true base delay is 460 unchanged), whereas a smaller observation window results in faster 461 response to true changes in the base delay. 463 4.1.2. Estimating Queueing Delay 465 Given that the base delay is constant, the queueing delay is 466 represented by the variable component of the measured end-to-end 467 delay. LEDBAT measures queueing delay as simply the difference 468 between an end-to-end delay measurement and the current estimate of 469 base delay. The queueing delay should be filtered (depending on the 470 usage scenario) to eliminate noise in the delay estimation, such as 471 due to spikes in processing delay at a node on the path. 473 4.2. Managing the Congestion Window 475 4.2.1. Window Increase: Probing For More Bandwidth 477 A LEDBAT sender increases its congestion window if the queuing delay 478 is smaller than a target value, proportionally to the relative 479 difference between the current queueing delay and the delay target. 480 To be friendly to competing TCP flows, we set this highest rate of 481 window growth to be the same as TCP's. In other words, a LEDBAT flow 482 thus never ramps up faster than a competing TCP flow over the same 483 path. As closer the extra delay gets to the TARGET value, as slower 484 LEDBAT will increase the window. 486 4.2.2. Window Decrease: Responding To Congestion 488 When the sender's queueing delay estimate is higher than the target, 489 the LEDBAT flow's rate should be reduced. LEDBAT uses a simple 490 linear controller to detemine sending rate as a function of the delay 491 estimate, where the response is proportional to the difference 492 between the current queueing delay estimate and the target. This 493 allows to decrease the window only slightly while probing and leads 494 to a quite stable state with high link utilization. In limited 495 experiments with Bittorrent nodes, this controller seems to work 496 well. 498 Unlike TCP-like loss-based congestion control, LEDBAT seeks to avoid 499 losses and so a LEDBAT sender is not expected to normally rely on 500 losses to determine the sending rate. However, when data loss does 501 occur, LEDBAT must respond as standard TCP does; even if the queueing 502 delay estimates indicate otherwise, a loss is assumed to be a strong 503 indication of congestion. Thus, to deal with severe congestion when 504 packets are dropped in the network, and to provide a fallback against 505 incorrect queuing delay estimates, a LEDBAT sender halves its 506 congestion window when a loss event is detected. As with TCP New- 507 Reno, LEDBAT reduces its cwnd by half at most once per RTT. 509 4.3. Choosing The Queuing Delay Target 511 The queueing delay target is a tradeoff. A target that is too low 512 might result in under-utilization of the bottleneck link, because of 513 the noise in the delay measurement e.g in a mobile scenario, and may 514 also be more sensitive to error in the measured delay. The 515 International Telecommunication Union's (ITU's) Recommendation G.114 516 defines a delay of 150 ms to be acceptable for most user voice 517 applications. Thus the extra delay induced by LEDBAT must be below 518 150 ms to reduce impact on delay-sentive applications. If the TARGET 519 value is larger than the maximum delay the queue can induce, LEBAT 520 will fallback to the same behavior than standard TCP (see section 521 Section 5.2). 523 Our recommendation of 100 ms or less as the target is based on these 524 considerations. Anecdotal evidence indicates that this value works 525 well: LEDBAT has been implemented and successfully deployed with a 526 target value of 100 ms in two Bittorrent implementations---BitTorrent 527 DNA as the exclusive congestion control mechanism and in uTorrent as 528 an experimental mechanism. 530 5. Discussion 532 5.1. Framing and Ack Frequency Considerations 534 While the actual framing and wire format of the protocols using 535 LEDBAT are outside the scope of this document, we briefly consider 536 the data framing and ack frequency needs of LEDBAT mechanisms. 538 To compute the data path's one-way delay, our discussion of LEDBAT 539 assumes a framing that allows the sender to timestamp packets and for 540 the receiver to convey the measured one-way delay back to the sender 541 in ack packets. LEDBAT does not require this particular method, but 542 it does require unambiguous delay estimates using data and ack 543 packets. 545 A LEDBAT receiver may send an ack as frequently as one for every data 546 packet received or less frequently; LEDBAT does require that the 547 receiver MUST transmit at least one ack in every RTT. 549 5.2. Competing With TCP Flows 551 LEDBAT is designed to respond to congestion indications earlier than 552 loss-based TCP. A LEDBAT flow is more aggressive when the queueing 553 delay estimate is lower; since the queueing delay estimate is non- 554 negative, LEDBAT is most aggressive when its queuing delay estimate 555 is zero. In this case, LEDBAT ramps up its congestion window at the 556 same rate as TCP does. LEDBAT reduces its rate earlier than TCP 557 does, always halving the congestion window on loss. Thus, in the 558 worst case where the delay estimates are completely and consistently 559 off, a LEDBAT flow falls back to TCP mechanisms as it will be at most 560 as aggressive as a TCP flow and halves on loss. 562 5.3. Fairness Among LEDBAT Flows 564 The primary design goals of LEDBAT are focussed on the aggregate 565 behavior of LEDBAT flows when they compete with standard TCP. Since 566 LEDBAT is designed for background traffic, we consider link 567 utilization to be more important than fairness amongst LEDBAT flows. 568 Nevertheless, we now consider fairness issues that might arise 569 amongst competing LEDBAT flows. 571 LEDBAT as described so far lacks a mechanism specifically designed to 572 equalize utilization amongst LEDBAT flows. Anecdotally observed 573 behavior of existing implementations indicates that a rough 574 equalization does occur since in most enviroments some amount of 575 randomness in the inter-packet transmission times exist, as explained 576 further below. 578 Delay-based congestion control systems suffer from the possibility of 579 late-comers incorrectly measuring and using a higher base-delay than 580 an active flow that started earlier. Suppose a LEDBAT flow is the 581 only flow on the bottleneck, which the flow saturates, steadily 582 maintaining the queueing delay at a target delay. When a new LEDBAT 583 flow arrives, it might incorrectly measure the current end-to-end 584 delay, including the queueing delay being maintained by the first 585 LEDBAT flow, as its base delay, and the incoming flow might now 586 effectively seek to build on top of the existing, already maximal 587 queueing delay. As the second flow builds up, the first flow sees 588 the true queueing delay and backs off, while the late-comer keeps 589 building up, using up the entire link's capacity; this advantage is 590 called the "late-comer's advantage". 592 In the worse case, if the first flow yields at the same rate as the 593 new flow increases its sending rate, the new flow will see constant 594 end-to-end delay, which it assumes is the base delay, until the first 595 flow backs off completely. As a result, by the time the second flow 596 stops increasing its cwnd, it would have added twice the target 597 queueing delay to the network. 599 This advantage can be reduced if the the first flow yields quickly 600 enough to empty the bottleneck queue faster than the incoming flow 601 increases its occupancy in the queue; as a result, the late-comer 602 might measure a delay closer to the base delay. While such a 603 reduction might be achieved through a multiplicative decrease of the 604 congestion window, this might cause stronger fluctuations in flow 605 throughput during steady state. Thus we do not recommend a 606 multiplicative decrease scheme. 608 We note that in certain use-case scenarios, it is possible for a 609 later LEDBAT flow to gain an unfair advantage over an existing one 610 [Carofiglio10]. In practice, this concern may be alleviated by the 611 burstiness of network traffic: all that's needed to measure the base 612 delay is one small gap in transmission schedules between the LEDBAT 613 flows. These gaps can occur for a number of reasons such as latency 614 introduced due to application sending patterns, OS scheduling at the 615 sender, processing delay at the sender or any network node, and link 616 contention. When such a gap occurs in the first sender's 617 transmission while the late-comer is starting, base delay is 618 immediately correctly measured. With a small number of LEDBAT flows, 619 system noise may sufficiently regulate the late-comer's advantage. 621 6. IANA Considerations 623 There are no IANA considerations for this document. 625 7. Security Considerations 627 A network on the path might choose to cause higher delay measurements 628 than the real queuing delay so that LEDBAT backs off even when 629 there's no congestion present. While shaping of traffic into an 630 artificially narrow bottleneck by increasing the queueing delay 631 cannot be trivially counteracted, a protocol using LEDBAT should seek 632 to minimize the risk of such an attack by authenticating the 633 timestamp and delay fields in the packets. 635 LEDBAT is not known to introduce any new concerns with privacy, 636 integrity, or other security issues for flows that use it. It should 637 be compatible with use of IPsec and TLS/DTLS. 639 8. Acknowledgements 641 We thank folks in the LEDBAT working group for their comments and 642 feedback. Special thanks to Murari Sridharan and Rolf Winter for 643 their patient and untiring shepherding. 645 9. References 646 9.1. Normative References 648 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 649 Requirement Levels", BCP 14, RFC 2119, March 1997. 651 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 652 of Explicit Congestion Notification (ECN) to IP", 653 RFC 3168, September 2001. 655 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 656 Initial Window", RFC 3390, October 2002. 658 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 659 Discovery", RFC 4821, March 2007. 661 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 662 Control", RFC 5681, September 2009. 664 9.2. Informative References 666 [Bra94] Brakmo, L., O'Malley, S., and L. Peterson, "TCP Vegas: New 667 techniques for congestion detection and avoidance", 668 Proceedings of SIGCOMM '94, pages 24-35, August 1994. 670 [Carofiglio10] 671 Carofiglio, G., Muscariello, L., Rossi, D., and S. 672 Valenti, "The quest for LEDBAT fairness", IEEE 673 Globecom'10, December 2010. 675 [Low02] Low, S., Peterson, L., and L. Wang, "Understanding TCP 676 Vegas: A Duality Model", JACM 49 (2), March 2002. 678 Appendix A. Timestamp errors 680 One-way delay measurement needs to deal with timestamp errors. We'll 681 use the same locally linear clock model and the same terminology as 682 Network Time Protocol (NTP). This model is valid for any 683 differentiable clocks. NTP uses the term "offset" to refer to 684 difference from true time and "skew" to refer to difference of clock 685 rate from the true rate. The clock will thus have a fixed offset 686 from the true time and a skew. We'll consider what we need to do 687 about the offset and the skew separately. 689 A.1. Clock offset 691 First, consider the case of zero skew. The offset of each of the two 692 clocks shows up as a fixed error in one-way delay measurement. The 693 difference of the offsets is the absolute error of the one-way delay 694 estimate. We won't use this estimate directly, however. We'll use 695 the difference between that and a base delay. Because the error 696 (difference of clock offsets) is the same for the current and base 697 delay, it cancels from the queuing delay estimate, which is what 698 we'll use. Clock offset is thus irrelevant to the design. 700 A.2. Clock skew 702 The clock skew manifests in a linearly changing error in the time 703 estimate. For a given pair of clocks, the difference in skews is the 704 skew of the one-way delay estimate. Unlike the offset, this no 705 longer cancels in the computation of the queuing delay estimate. On 706 the other hand, while the offset could be huge, with some clocks off 707 by minutes or even hours or more, the skew is typically small. For 708 example, NTP is designed to work with most clocks, yet it gives up 709 when the skew is more than 500 parts per million (PPM). Typical 710 skews of clocks that have never been trained seem to often be around 711 100-200 PPM. Previously trained clocks could have 10-20 PPM skew due 712 to temperature changes. A 100-PPM skew means accumulating 6 713 milliseconds of error per minute. The base delay updates mostly 714 takes care of clock skew unless the skew is unusually high or extreme 715 values have been chosen for TARGET and BASE_HISTORY so that the clock 716 skew in BASE_DELAY minutes is larger than the TARGET. 718 Clock skew can be in two directions: either the sender's clock is 719 faster than the receiver's, or vice versa. 721 If the sender's clock is faster the one-way delay measurement will 722 get more and more reduced by the clock drift over time. Whenever 723 there is no additional delay the base delay will be updated by a 724 smaller one-way delay value and the skew is compensated. If a 725 competing flow introduces additional queueing delay LEDBAT will 726 anyway get out of the way quickly and an overestimated one-way delay 727 will just speed-up the back-off. 729 When the receiver clock runs faster, the raw delay estimate will 730 drift up with time. This can suppress the throughput unnecessarily. 731 In this case a skew correction mechanim can be benefital. Further 732 condersiderations based on a deployed implementation and LEDBAT 733 specific preconditions are given in the next section. 735 A.3. Clock skew correction mechanism 737 The following paragraph describes the deployed clock skew correction 738 mechanism in the BitTorrent implementation for documentation purpose. 740 In the BitTorrent implemtation the receiver sends back the raw 741 (sending and receiving) timestamps. Based on this imfomation and the 742 system time at feedback receiption the sender can estimated the one- 743 way delay in both directions. Thus the sender can run the same base 744 delay calculation algorithm it runs for itself for the receiver as 745 well. If the sender can detect the receiver reducing its base delay, 746 it can infer that this is due to clock drift. The sender can be 747 compensated for by increasing the it's base delay by the same amount. 748 To apply this mechanism symmetrical framing is need (i.e., same 749 information about delays transmitted in both way). 751 The following considerations can be used for an alternative 752 implementation as a reference: 753 o Skew correction with faster virtual clock: 754 Since having a faster clock on the sender will continuousely 755 update the base delay, a faster virtual clock for sender 756 timestamping can be applied. This virual clock can be computed 757 from the default machine clock through a linear transformation. 758 E.g. with a 500 PPM speed-up the sender's clock is very likely to 759 be faster than any receiver's clock and thus LEDBAT will benefit 760 from the implicit correction when updating the base delay. 762 o Skew correction with estimating drift: 763 With LEDBAT the history of base delay minima is already kept for 764 each minute. This can provide a base to compute the clock skew 765 difference between the two hosts. The slope of a linear function 766 fitted to the set of minima base delays gives an estimate of the 767 clock skew. This estimation can be used to correct the clocks. 768 If the other endpoint is doing the same, the clock should be 769 corrected by half of the estimated skew amount. 771 o Byzantine skew correction: 772 When it is known that each host maintains long-lived connections 773 to a number of different other hosts, a byzantine scheme can be 774 used to estimate the skew with respect to the true time. Namely, 775 calculate the skew difference for each of the peer hosts as 776 described with the previous approach, then take the median of the 777 skew differences. While this scheme is not universally 778 applicable, it combines well with other schemes, since it is 779 essentially a clock training mechanism. The scheme also acts the 780 fastest, since the state is preserved between connections. 782 Authors' Addresses 784 Stanislav Shalunov 785 BitTorrent Inc 786 612 Howard St, Suite 400 787 San Francisco, CA 94105 788 USA 790 Email: shalunov@bittorrent.com 791 URI: http://shlang.com 793 Greg Hazel 794 BitTorrent Inc 795 612 Howard St, Suite 400 796 San Francisco, CA 94105 797 USA 799 Email: greg@bittorrent.com 801 Janardhan Iyengar 802 Franklin and Marshall College 803 415 Harrisburg Ave. 804 Lancaster, PA 17603 805 USA 807 Email: jiyengar@fandm.edu 809 Mirja Kuehlewind 810 University of Stuttgart 811 Stuttgart 812 DE 814 Email: mirja.kuehlewind@ikr.uni-stuttgart.de