idnits 2.17.1 draft-ietf-ledbat-congestion-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 31, 2011) is 4713 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: December 2, 2011 J. Iyengar 6 Franklin and Marshall College 7 M. Kuehlewind 8 University of Stuttgart 9 May 31, 2011 11 Low Extra Delay Background Transport (LEDBAT) 12 draft-ietf-ledbat-congestion-06.txt 14 Abstract 16 LEDBAT is an experimental delay-based congestion control algorithm 17 that attempts to utilize the available bandwidth on an end-to-end 18 path while limiting the consequent increase in queueing delay on the 19 path. LEDBAT uses changes in one-way delay measurements to limit 20 congestion that the flow itself induces in the network. LEDBAT is 21 designed for use by background bulk-transfer applications; it is 22 designed to be no more aggressive than TCP congestion control and to 23 yield in the presence of any competing TCP flows, thus limiting 24 interference with the network performance of the competing flows. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on December 2, 2011. 43 Copyright Notice 45 Copyright (c) 2011 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 61 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 3 63 2.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 4 64 3. LEDBAT Congestion Control . . . . . . . . . . . . . . . . . . 4 65 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 67 3.3. Receiver-Side Operation . . . . . . . . . . . . . . . . . 5 68 3.4. Sender-Side Operation . . . . . . . . . . . . . . . . . . 5 69 3.4.1. An Overview . . . . . . . . . . . . . . . . . . . . . 5 70 3.4.2. The Complete Sender Algorithm . . . . . . . . . . . . 6 71 3.5. Parameter Values . . . . . . . . . . . . . . . . . . . . . 8 72 4. Understanding LEDBAT Mechanisms . . . . . . . . . . . . . . . 9 73 4.1. Delay Estimation . . . . . . . . . . . . . . . . . . . . . 9 74 4.1.1. Estimating Base Delay . . . . . . . . . . . . . . . . 10 75 4.1.2. Estimating Queueing Delay . . . . . . . . . . . . . . 10 76 4.2. Managing the Congestion Window . . . . . . . . . . . . . . 10 77 4.2.1. Window Increase: Probing For More Bandwidth . . . . . 10 78 4.2.2. Window Decrease: Responding To Congestion . . . . . . 11 79 4.3. Choosing The Queuing Delay Target . . . . . . . . . . . . 11 80 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 11 81 5.1. Framing and Ack Frequency Considerations . . . . . . . . . 11 82 5.2. Competing With TCP Flows . . . . . . . . . . . . . . . . . 12 83 5.3. Fairness Among LEDBAT Flows . . . . . . . . . . . . . . . 12 84 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 85 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 86 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 87 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 88 9.1. Normative References . . . . . . . . . . . . . . . . . . . 14 89 9.2. Informative References . . . . . . . . . . . . . . . . . . 14 90 Appendix A. Timestamp errors . . . . . . . . . . . . . . . . . . 14 91 A.1. Clock offset . . . . . . . . . . . . . . . . . . . . . . . 14 92 A.2. Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 15 93 A.3. Clock skew correction mechanism . . . . . . . . . . . . . 15 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 96 1. Requirements notation 98 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in [RFC2119]. 102 2. Introduction 104 TCP congestion control [RFC5681] seeks to share bandwidth at a 105 bottleneck link equitably among flows competing at the bottleneck, 106 and it is the predominant congestion control mechanism used on the 107 Internet. Not all applications seek an equitable share of network 108 throughput, however---"background" applications, such as software 109 updates or file-sharing applications, seek to operate without 110 interfering with the performance of more interactive and delay- 111 and/or bandwidth-sensitive "foreground" applications---and standard 112 TCP may be too aggressive for use with such background applications. 114 LEDBAT is an experimental delay-based congestion control mechanism 115 that reacts early to congestion in the network, thus enabling 116 "background" applications to use the network while avoidoing 117 interference with the network performance of competing TCP flows. A 118 LEDBAT sender uses one-way delay measurements to estimate the amount 119 of queueing on the data path, controls the LEDBAT flow's congestion 120 window based on this estimate, and minimizes interference with 121 competing TCP flows by adding low extra queueing delay on the end-to- 122 end path. 124 Delay-based congestion control protocols, such as TCP-Vegas [Bra94], 125 are generally designed to achieve more, not less throughput than 126 standard TCP, and often outperform TCP under particular network 127 settings. In contrast, LEDBAT is designed to be no more aggressive 128 than TCP; LEDBAT is a "scavenger" congestion control mechanism that 129 seeks to utilize all available bandwidth and yields quickly when 130 competing with standard TCP at a bottleneck link. 132 2.1. Design Goals 134 LEDBAT congestion control seeks to: 135 1. utilize end-to-end available bandwidth, and maintain low queueing 136 delay when no other traffic is present, 137 2. add little to the queuing delay induced by concurrent TCP flows, 138 3. quickly yield to flows using standard TCP congestion control that 139 share the same bottleneck link, 141 2.2. Applicability 143 LEDBAT is a "scavenger" congestion control mechanism that is 144 primarily motivated by background bulk-transfer applications, such as 145 large file transfers (as with file-sharing applications) and software 146 updates. It can be used with any application that seeks to minimize 147 its impact on the network and on other interactive delay- and/or 148 bandwidth-sensitive network applications. LEDBAT is expected to work 149 well when the sender and/or receiver is connected via a residential 150 access network. 152 LEDBAT seeks to operate in networks with FIFO queues and with tail- 153 drop queue management. Further study is required to understand the 154 implications of Active Queue Management (AQM) schemes on LEDBAT 155 mechanisms. 157 LEDBAT, as specified in this document, is a congestion control 158 mechanism that can be used as part of a transport protocol or as part 159 of an application. LEDBAT can be used where the data transmission 160 mechanisms are capable of carrying timestamps and acknowledging data 161 frequently. LEDBAT can be used, with appropriate extensions where 162 necessary, with TCP, SCTP, and DCCP, and with proprietary application 163 protocols such as those built atop UDP for P2P applications. 165 3. LEDBAT Congestion Control 167 3.1. Overview 169 A standard TCP sender increases its congestion window until a loss 170 occurs [RFC5681], which, in the absence of any Active Queue 171 Management (AQM) in the network, occurs only when the queue at the 172 bottleneck link on the end-to-end path overflows. Since packet loss 173 at the bottleneck link is expected to be preceded by an increase in 174 the queueing delay at the bottleneck link, LEDBAT congestion control 175 uses this increase in queueing delay as an early signal of 176 congestion, enabling it to respond to congestion earlier than 177 standard TCP, and enabling it to yield bandwidth to a competing TCP 178 flow. 180 LEDBAT employs one-way delay measurements to estimate queueing delay. 181 When the estimated queueing delay is less than a pre-determined 182 target, LEDBAT infers that the network is not yet congested, and 183 increases its sending rate to utilize any spare capacity in the 184 network. When the estimated queueing delay becomes greater than a 185 pre-determined target, LEDBAT decreases its sending rate quickly as a 186 response to potential congestion in the network. 188 3.2. Preliminaries 190 A LEDBAT sender uses a congestion window (cwnd) to gate the amount of 191 data that the sender can send into the network in one roundtrip time 192 (RTT). A sender MAY maintain its cwnd in bytes or in packets; this 193 document uses cwnd in bytes. LEDBAT requires that each data segment 194 carries a "timestamp" from the sender, based on which the receiver 195 computes the one-way delay from the sender, and sends this computed 196 value back to the sender. 198 In addition to the LEDBAT mechanism described below, we note that a 199 slow start mechanism can be used as specified in [RFC5681]. Since 200 slow start leads to faster increase in the window than that specified 201 in LEDBAT, conservative congestion control implementations employing 202 LEDBAT may skip slow start altogether and start with an initial 203 window of INIT_CWND * MSS. (INIT_CWND is described later in 204 Section 3.5.) 206 The term "MSS", or the sender's Maximum Segment Size, used in this 207 document refers to the size of the largest segment that the sender 208 can transmit. The value of MSS can be based on the path MTU 209 discovery [RFC4821] algorithm and/or on other factors. 211 3.3. Receiver-Side Operation 213 A LEDBAT receiver operates as follows: 215 on data_packet: 216 remote_timestamp = data_packet.timestamp 217 acknowledgement.delay = local_timestamp() - remote_timestamp 218 # fill in other fields of acknowledgement 219 acknowlegement.send() 221 A receiver MAY send more than one delay sample in an acknowledgment. 222 For instance, a receiver that delays acknowledgments, i.e., sends an 223 acknowledgment less frequently than once per data packet, MAY send 224 all the one-way delay samples that it gathers in one acknowledgment. 226 3.4. Sender-Side Operation 228 3.4.1. An Overview 230 As a first approximation, a LEDAT sender operates as shown below; the 231 complete algorithm is specified later in Section 3.4.2. TARGET is 232 the maximum queueing delay that LEDBAT itself can introduce in the 233 network, and GAIN determines the rate at which the cwnd responds to 234 changes in queueing delay; both constants are specified later. Since 235 off_target can be positive or negative, the cwnd increases or 236 decreases in proportion to off_target. 238 on initialization: 239 base_delay = +INFINITY 241 on acknowledgement: 242 current_delay = acknowledgement.delay 243 base_delay = min(base_delay, current_delay) 244 queuing_delay = current_delay - base_delay 245 off_target = (TARGET - queuing_delay) / TARGET 246 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 248 The simplified mechanism above ignores multiple delay samples in an 249 acknowledgment, noise filtering, base delay expiration, and sender 250 idle times, which we now take into account in our complete sender 251 algorithm below. 253 3.4.2. The Complete Sender Algorithm 255 update_current_delay() maintains a list of one-way delay 256 measurements, of which the minimum is used as an estimate of the 257 current end-to-end delay. update_base_delay() maintains a list of 258 one-way delay minima over a number of one-minute intervals, to 259 measure and to track changes in the base delay of the end-to-end 260 path. 262 This algorithm restricts cwnd growth after a period of inactivity, 263 where the cwnd is clamped down to a little more than flightsize using 264 max_allowed_cwnd. To be TCP-friendly on data loss, LEDBAT halves its 265 cwnd. The full sender-side algorithm is given below: 267 on initialization: 268 create current_delays list with CURRENT_FILTER elements 269 create base_delays list with BASE_HISTORY number of elements 270 inialize elements in current_delays and base_delays to +INFINITY 271 last_rollover = -INFINITY # More than a minute in the past 272 cwnd = INIT_CWND * MSS 274 on acknowledgement: 275 # flightsize is the amount of data oustanding before this ack 276 # was received and is updated later by update_flightsize(); 277 # bytes_newly_acked is the number of bytes that this ack 278 # newly acknowledges, and it MAY be set to MSS; and 279 # cwnd is in bytes. 281 for each delay sample in the acknowledgment: 282 delay = acknowledgement.delay 283 update_base_delay(delay) 284 update_current_delay(delay) 285 queuing_delay = FILTER(current_delays) - MIN(base_delays) 286 off_target = (TARGET - queuing_delay) / TARGET 287 cwnd += GAIN * off_target * bytes_newly_acked * MSS / cwnd 288 max_allowed_cwnd = flightsize + ALLOWED_INCREASE * MSS 289 cwnd = min(cwnd, max_allowed_cwnd) 290 cwnd = max(cwnd, MIN_CWND * MSS) 291 update_flightsize() #subtracts bytes_newly_acked from flightsize 293 on data loss: 294 # atmost once per RTT 295 cwnd = max(cwnd/2, MIN_CWND * MSS) 297 update_current_delay(delay) 298 # Maintain a list of CURRENT_FILTER last delays observed. 299 delete first item in current_delays list 300 append delay to current_delays list 302 update_base_delay(delay) 303 # Maintain BASE_HISTORY delay-minima. 304 # Each minimum is measured over a period of a minute. 305 if round_to_minute(now) != round_to_minute(last_rollover) 306 last_rollover = now 307 delete first item in base_delays list 308 append delay to base_delays list 309 else 310 base_delays.tail = MIN(base_delays.tail, delay) 312 The LEDBAT sender ensures that any outliers in the current_delay 313 samples are eliminated by implementing FILTER() to extract the actual 314 delay estimate from the current_delay samples. An example of such a 315 filter is the Exponentially-Weighted Moving Average (EWMA) function. 317 To implement an approximate minimum over the past few minutes, a 318 LEDBAT sender stores BASE_HISTORY+1 separate minima---one each for 319 the last BASE_HISTORY minutes, and one for the running current 320 minute. At the end of the current minute, the window moves---the 321 earliest minimum is dropped and the latest minimum is added. If the 322 connection is idle for a given minute, no data is available for the 323 one-way delay and, therefore, a value of +INFINITY is stored in the 324 list. If the connection has been idle for BASE_HISTORY minutes, all 325 minima in the list are thus set to +INFINITY and measurement begins 326 anew. LEDBAT thus requires that during idle periods, an 327 implementation must maintain the base delay list. 329 We note that LEDBAT assumes random fluctuations in inter-packet 330 transmission times; see section Section 5.3 for a discussion. 332 3.5. Parameter Values 334 TARGET MUST be 100 milliseconds or less, and this choice of value is 335 explained further in Section 4.3. Note that using the same TARGET 336 value across LEDBAT flows enables equitable sharing of the bottleneck 337 bandwidth---flows with a higher TARGET may get a larger share of the 338 bottleneck bandwidth. It is possible to consider the use of 339 different TARGET values for implementing a relative priority between 340 two competing LEDBAT flows by setting a higher TARGET value for the 341 higher-priority flow. 343 ALLOWED_INCREASE SHOULD be 1, and it MUST be greater than 0. An 344 ALLOWED_INCREASE of 0 results in no cwnd growth at all, and an 345 ALLOWED_INCREASE of 1 allows and limits cwnd increase based on 346 flightsize in the previous RTT. An ALLOWED_INCREASE greater than 1 347 MAY be used when interactions between LEDBAT and the framing protocol 348 provide a clear reason for doing so. 350 GAIN MUST be set to 1 or less. A GAIN of 1 limits the maximum cwnd 351 ramp-up to the same rate as TCP Reno in Congestion Avoidance. While 352 this document specifies the use of the same GAIN for both cwnd 353 increase (when off_target is greater than zero) and decrease (when 354 off_target is less than zero), implementations MAY use a higher GAIN 355 for cwnd decrease than for the increase; our justification follows. 356 When a competing non-LEDBAT flow increases its sending rate, the 357 LEDBAT sender may only measure a small amount of additional delay and 358 decrease the sending rate slowly. To ensure no impact on a competing 359 non-LEDBAT flow, the LEDBAT flow should decrease its sending rate at 360 least as quickly as the competing flow increases its sending rate. A 361 higher decrease GAIN MAY be used to allow the LEDBAT flow to decrease 362 its sending rate faster than the competing flow's increase rate. 364 The size of the base_delays list, BASE_HISTORY, SHOULD be 10. If the 365 actual base delay decreases, due to a route change for instance, a 366 LEDBAT sender adapts immediately, irrespective of the value of 367 BASE_HISTORY. If the actual base delay increases however, a LEDBAT 368 sender will take BASE_HISTORY minutes to adapt and may wrongly infer 369 congestion in the meanwhile. A value for BASE_HISTORY is thus a 370 tradeoff: a higher value may yield a more accurate measurement when 371 the base delay is unchanging, and a lower value results in a quicker 372 response to actual increase in base delay. 374 A LEDBAT sender uses the current_delays list to maintain delay 375 measurements made within an RTT amount of time in the past, seeking 376 to eliminate noise spikes in its measurement of the current one-way 377 delay through the network. The size of this list, CURRENT_FILTER, 378 may be variable, and depends on the number of successful measurements 379 made within an RTT amount of time in the past. The sender should 380 seek to gather enough delay samples in each RTT so as to have 381 statistical confidence in the measurements. While the number of 382 delay samples required for such confidence will vary depending on 383 network conditions, we recommend that the sender SHOULD use at least 384 4 samples in each RTT. Thus, CURRENT_FILTER SHOULD be at least 4, 385 and limited such that no samples in list are older than an RTT in the 386 past. 388 MIN_CWND SHOULD be 2, and it MUST be at least 1. INIT_CWND SHOULD be 389 2, and it MUST be at least 1. The choice of MIN_CWND and INIT_CWND 390 are strongly connected to the framing protocol; a larger MIN_CWND 391 and/or INIT_CWND MAY be used if the framing protocol allows it. For 392 instance, TCP senders may use a larger INIT_CWND as specified in 393 [RFC3390]. 395 4. Understanding LEDBAT Mechanisms 397 This section describes the delay estimation and window management 398 mechanisms used in LEDBAT. 400 4.1. Delay Estimation 402 LEDBAT estimates congestion in the direction of the data flow, and to 403 avoid measuring queue build-up on the reverse path (or ack path), 404 LEDBAT uses one-way delay estimates. LEDBAT assumes measurements are 405 done with data packets, thus avoiding the need for separate 406 measurement packets and avoiding the pitfall of measurement packets 407 being treated differently from the data packets in the network. 409 End-to-end delay can be decomposed into transmission (or 410 serialization) delay, propagation (or speed-of-light) delay, queueing 411 delay, and processing delay. On any given path, barring some noise, 412 all delay components except for queueing delay are constant. To 413 observe an increase in the queueing delay in the network, a LEDBAT 414 sender separates the queueing delay component from the rest of the 415 end-to-end delay, as described below. 417 4.1.1. Estimating Base Delay 419 Since queuing delay is always additive to the end-to-end delay, 420 LEDBAT estimates the sum of the constant delay components, which we 421 call "base delay", to be the minimum delay observed on the end-to-end 422 path. Using the minimum observed delay also allows LEDBAT to 423 eliminate noise in the delay estimation, such as due to spikes in 424 processing delay at a node on the path. 426 To respond to true changes in the base delay, as can be caused by a 427 route change, LEDBAT uses only recent measurements in estimating the 428 base delay. The duration of the observation window itself is a 429 tradeoff between robustness of measurement and responsiveness to 430 change---a larger observation window increases the chances that the 431 true base delay will be detected (as long as the true base delay is 432 unchanged), whereas a smaller observation window results in faster 433 response to true changes in the base delay. 435 4.1.2. Estimating Queueing Delay 437 Given that the base delay is constant, the queueing delay is 438 represented by the variable component of the measured end-to-end 439 delay. LEDBAT measures queueing delay as simply the difference 440 between an end-to-end delay measurement and the current estimate of 441 base delay. 443 4.2. Managing the Congestion Window 445 4.2.1. Window Increase: Probing For More Bandwidth 447 A LEDBAT sender increases its congestion window if the queuing delay 448 is smaller than a target value, proportionally to the relative 449 difference between the current queueing delay and the delay target. 450 To be friendly to competing TCP flows, we set this highest rate of 451 window growth to be the same as TCP's. In other words, A LEDBAT flow 452 thus never ramps up faster than a competing TCP flow over the same 453 path. 455 4.2.2. Window Decrease: Responding To Congestion 457 When the sender's queueing delay estimate is higher than the target, 458 the LEDBAT flow's rate should be reduced. LEDBAT uses a simple 459 linear controller to detemine sending rate as a function of the delay 460 estimate, where the response is proportional to the difference 461 between the current queueing delay estimate and the target. In 462 limited experiments with Bittorrent nodes, this controller seems to 463 work well. 465 Unlike TCP-like loss-based congestion control, LEDBAT seeks to avoid 466 losses and so a LEDBAT sender is not expected to normally rely on 467 losses to determine the sending rate. However, when data loss does 468 occur, LEDBAT must respond as standard TCP does; even if the queueing 469 delay estimates indicate otherwise, a loss is assumed to be a strong 470 indication of congestion. Thus, to deal with severe congestion when 471 packets are dropped in the network, and to provide a fallback against 472 incorrect queuing delay estimates, a LEDBAT sender halves its 473 congestion window when a loss event is detected. As with TCP New- 474 Reno, LEDBAT reduces its cwnd by half at most once per RTT. 476 4.3. Choosing The Queuing Delay Target 478 The queueing delay target is a tradeoff. A target that is too low 479 might result in under-utilization of the bottleneck link, especially 480 if the LEDBAT flow is the only flow on the link, and may also be more 481 sensitive to error in the measured delay. The International 482 Telecommunication Union's (ITU's) Recommendation G.114 defines a 483 delay of 150 ms to be acceptable for most user voice applications. 484 Thus the extra delay induced by LEDBAT must be below 150 ms to reduce 485 impact on delay-sentive applications. 487 Our recommendation of 100 ms or less as the target is based on these 488 considerations. Anecdotal evidence indicates that this value works 489 well: LEDBAT has been implemented and successfully deployed with a 490 target value of 100 ms in two Bittorrent implementations---BitTorrent 491 DNA as the exclusive congestion control mechanism and in uTorrent as 492 an experimental mechanism. 494 5. Discussion 496 5.1. Framing and Ack Frequency Considerations 498 While the actual framing and wire format of the protocols using 499 LEDBAT are outside the scope of this document, we briefly consider 500 the data framing and ack frequency needs of LEDBAT mechanisms. 502 To compute the data path's one-way delay, our discussion of LEDBAT 503 assumes a framing that allows the sender to timestamp packets and for 504 the receiver to convey the measured one-way delay back to the sender 505 in ack packets. LEDBAT does not require this particular method, but 506 it does require unambiguous delay estimates using data and ack 507 packets. 509 A LEDBAT receiver may send an ack as frequently as one for every data 510 packet received or less frequently; LEDBAT does require that the 511 receiver MUST transmit at least one ack in every RTT. 513 5.2. Competing With TCP Flows 515 LEDBAT is designed to respond to congestion indications earlier than 516 loss-based TCP. A LEDBAT flow is more aggressive when the queueing 517 delay estimate is lower; since the queueing delay estimate is non- 518 negative, LEDBAT is most aggressive when its queuing delay estimate 519 is zero. In this case, LEDBAT ramps up its congestion window at the 520 same rate as TCP does. LEDBAT reduces its rate earlier than TCP 521 does, always halving the congestion window on loss. Thus, in the 522 worst case where the delay estimates are completely and consistently 523 off, a LEDBAT flow falls back to TCP mechanisms and is as aggressive 524 as a TCP flow. 526 5.3. Fairness Among LEDBAT Flows 528 The primary design goals of LEDBAT are focussed on the aggregate 529 behavior of LEDBAT flows when they compete with standard TCP. Since 530 LEDBAT is designed for background traffic, we consider link 531 utilization to be more important than fairness amongst LEDBAT flows. 532 Nevertheless, we now consider fairness issues that might arise 533 amongst competing LEDBAT flows. 535 LEDBAT as described so far lacks a mechanism specifically designed to 536 equalize utilization amongst LEDBAT flows. Anecdotally observed 537 behavior of existing implementations indicates that a rough 538 equalization does occur since in most enviroments some amount of 539 randomness in the inter-packet transmission times exist, as explained 540 further below. 542 Delay-based congestion control systems suffer from the possibility of 543 late-comers incorrectly measuring and using a higher base-delay than 544 an active flow that started earlier. Suppose a LEDBAT flow is the 545 only flow on the bottleneck, which the flow saturates, steadily 546 maintaining the queueing delay at a target delay. When a new LEDBAT 547 flow arrives, it might incorrectly measure the current end-to-end 548 delay, including the queueing delay being maintained by the first 549 LEDBAT flow, as its base delay, and the incoming flow might now 550 effectively seek to build on top of the existing, already maximal 551 queueing delay. As the second flow builds up, the first flow sees 552 the true queueing delay and backs off, while the late-comer keeps 553 building up, using up the entire link's capacity; this advantage is 554 called the "late-comer's advantage". 556 In the worse case, if the first flow yields at the same rate as the 557 new flow increases its sending rate, the new flow will see constant 558 end-to-end delay, which it assumes is the base delay, until the first 559 flow backs off completely. As a result, by the time the second flow 560 stops increasing its cwnd, it would have added twice the target 561 queueing delay to the network. 563 This advantage can be reduced if the the first flow yields quickly 564 enough to empty the bottleneck queue faster than the incoming flow 565 increases its occupancy in the queue; as a result, the late-comer 566 might measure a delay closer to the base delay. While such a 567 reduction might be achieved through a multiplicative decrease of the 568 congestion window, this might cause stronger fluctuations in flow 569 throughput during steady state. 571 In practice, however, this concern seems to be alleviated by the 572 burstiness of network traffic: all that's needed to measure the base 573 delay is one small gap in transmission schedules between the LEDBAT 574 flows. These gaps can occur for a number of reasons such as latency 575 introduced due to OS scheduling at the sender, processing delay at 576 the sender or any network node, and link contention. When such a gap 577 occurs while the late-comer is starting, base delay is immediately 578 correctly measured. With a small number of LEDBAT flows, system 579 noise seems to sufficiently regulate the late-comer's advantage. 581 6. IANA Considerations 583 There are no IANA considerations for this document. 585 7. Security Considerations 587 A network on the path might choose to cause higher delay measurements 588 than the real queuing delay so that LEDBAT backs off even when 589 there's no congestion present. While shaping of traffic into an 590 artificially narrow bottleneck by increasing the queueing delay 591 cannot be trivially counteracted, a protocol using LEDBAT should seek 592 to minimize the risk of such an attack by authenticating the 593 timestamp and delay fields in the packets. 595 8. Acknowledgements 597 We thank folks in the LEDBAT working group for their comments and 598 feedback. Special thanks to Murari Sridharan and Rolf Winter for 599 their patient and untiring shepherding. 601 9. References 603 9.1. Normative References 605 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 606 Requirement Levels", BCP 14, RFC 2119, March 1997. 608 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 609 Initial Window", RFC 3390, October 2002. 611 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 612 Discovery", RFC 4821, March 2007. 614 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 615 Control", RFC 5681, September 2009. 617 9.2. Informative References 619 [Bra94] Brakmo, L., O'Malley, S., and L. Peterson, "TCP Vegas: New 620 techniques for congestion detection and avoidance", 621 Proceedings of SIGCOMM '94, pages 24-35, August 1994. 623 Appendix A. Timestamp errors 625 One-way delay measurement needs to deal with timestamp errors. We'll 626 use the same locally linear clock model and the same terminology as 627 Network Time Protocol (NTP). This model is valid for any 628 differentiable clocks. NTP uses the term "offset" to refer to 629 difference from true time and "skew" to refer to difference of clock 630 rate from the true rate. The clock will thus have a fixed offset 631 from the true time and a skew. We'll consider what we need to do 632 about the offset and the skew separately. 634 A.1. Clock offset 636 First, consider the case of zero skew. The offset of each of the two 637 clocks shows up as a fixed error in one-way delay measurement. The 638 difference of the offsets is the absolute error of the one-way delay 639 estimate. We won't use this estimate directly, however. We'll use 640 the difference between that and a base delay. Because the error 641 (difference of clock offsets) is the same for the current and base 642 delay, it cancels from the queuing delay estimate, which is what 643 we'll use. Clock offset is thus irrelevant to the design. 645 A.2. Clock skew 647 The clock skew manifests in a linearly changing error in the time 648 estimate. For a given pair of clocks, the difference in skews is the 649 skew of the one-way delay estimate. Unlike the offset, this no 650 longer cancels in the computation of the queuing delay estimate. On 651 the other hand, while the offset could be huge, with some clocks off 652 by minutes or even hours or more, the skew is typically small. For 653 example, NTP is designed to work with most clocks, yet it gives up 654 when the skew is more than 500 parts per million (PPM). Typical 655 skews of clocks that have never been trained seem to often be around 656 100-200 PPM. Previously trained clocks could have 10-20 PPM skew due 657 to temperature changes. A 100-PPM skew means accumulating 6 658 milliseconds of error per minute. The base delay updates mostly 659 takes care of clock skew unless the skew is unusually high or extreme 660 values have been chosen for TARGET and BASE_HISTORY so that the clock 661 skew in BASE_DELAY minutes is larger than the TARGET. 663 Clock skew can be in two directions: either the sender's clock is 664 faster than the receiver's, or vice versa. 666 If the senders's clock is faster the one-way delay measurement will 667 get more and more reduced by the clock drift over time. Whenever 668 there is no additional delay the base delay will be updated by a 669 smaller one-way delay value and the skew is compensated. This will 670 happen continuously as LEBDAT is design to keep the queue empty. If 671 a competing flow introduces additional queueing delay LEDBAT will 672 anyway get out of the way quickly and an overestimated one-way delay 673 will just speed-up the back-off. 675 When the receiver clock runs faster, the raw delay estimate will 676 drift up with time. This can suppress the throughput unnecessarily. 677 In this case a skew correction mechanim can be benefital. Further 678 condersiderations based on a deployed implementation and LEDBAT 679 specific preconditions are given in the next section. 681 A.3. Clock skew correction mechanism 683 The following paragraph describes the deployed clock skew correction 684 mechanism in the BitTorrent implementation for documentation purpose. 686 In the BitTorrent implemtation the receiver sends back the raw 687 (sending and receiving) timestamps. Based on this imfomation and the 688 system time at feedback receiption the sender can estimated the one- 689 way delay in both directions. Thus the sender can run the same base 690 delay calculation algorithm it runs for itself for the receiver as 691 well. If the sender can detect the receiver reducing its base delay, 692 it can infer that this is due to clock drift. The sender can be 693 compensated for by increasing the it's base delay by the same amount. 694 To apply this mechanism symmetrical framing is need (i.e., same 695 information about delays transmitted in both way). 697 The following considerations can be used for an alternative 698 implementation as a reference: 699 o Skew correction with faster virtual clock: 700 Since having a faster clock on the sender will continuousely 701 update the base delay, a faster virtual clock for sender 702 timestamping can be applied. This virual clock can be computed 703 from the default machine clock through a linear transformation. 704 E.g. with a 500 PPM speed-up the sender's clock is very likely to 705 be faster than any receiver's clock and thus LEDBAT will benefit 706 from the implicit correction when updating the base delay. 708 o Skew correction with estimating drift: 709 With LEDBAT the history of base delay minima is already kept for 710 each minute. This can provide a base to compute the clock skew 711 difference between the two hosts. The slope of a linear function 712 fitted to the set of minima base delays gives an estimate of the 713 clock skew. This estimation can be used to correct the clocks. 714 If the other endpoint is doing the same, the clock should be 715 corrected by half of the estimated skew amount. 717 o Byzantine skew correction: 718 When it is known that each host maintains long-lived connections 719 to a number of different other hosts, a byzantine scheme can be 720 used to estimate the skew with respect to the true time. Namely, 721 calculate the skew difference for each of the peer hosts as 722 described with the previous approach, then take the median of the 723 skew differences. While this scheme is not universally 724 applicable, it combines well with other schemes, since it is 725 essentially a clock training mechanism. The scheme also acts the 726 fastest, since the state is preserved between connections. 728 Authors' Addresses 730 Stanislav Shalunov 731 BitTorrent Inc 732 612 Howard St, Suite 400 733 San Francisco, CA 94105 734 USA 736 Email: shalunov@bittorrent.com 737 URI: http://shlang.com 739 Greg Hazel 740 BitTorrent Inc 741 612 Howard St, Suite 400 742 San Francisco, CA 94105 743 USA 745 Email: greg@bittorrent.com 747 Janardhan Iyengar 748 Franklin and Marshall College 749 415 Harrisburg Ave. 750 Lancaster, PA 17603 751 USA 753 Email: jiyengar@fandm.edu 755 Mirja Kuehlewind 756 University of Stuttgart 757 Stuttgart 758 DE 760 Email: mirja.kuehlewind@ikr.uni-stuttgart.de