idnits 2.17.1 draft-ietf-ledbat-congestion-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 14, 2011) is 4791 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: September 15, 2011 J. Iyengar 6 Franklin and Marshall College 7 M. Kuehlewind 8 University of Stuttgart 9 March 14, 2011 11 Low Extra Delay Background Transport (LEDBAT) 12 draft-ietf-ledbat-congestion-04.txt 14 Abstract 16 LEDBAT is an experimental delay-based congestion control algorithm 17 that attempts to utilize the available bandwidth on an end-to-end 18 path while limiting the consequent increase in queueing delay on the 19 path. LEDBAT uses changes in one-way delay measurements to limit 20 congestion induced in the network by the LEDBAT flow. LEDBAT is 21 designed for use by background bulk-transfer applications; it is 22 designed to be no more aggressive than TCP congestion control and to 23 yield in the presence of competing TCP flows, thus limiting 24 interference with the network performance of the competing TCP flows. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on September 15, 2011. 43 Copyright Notice 45 Copyright (c) 2011 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 61 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 3 63 2.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 4 64 3. LEDBAT Congestion Control . . . . . . . . . . . . . . . . . . 4 65 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 5 67 3.3. Receiver-Side Operation . . . . . . . . . . . . . . . . . 5 68 3.4. Sender-Side Operation . . . . . . . . . . . . . . . . . . 5 69 3.4.1. An Overview . . . . . . . . . . . . . . . . . . . . . 5 70 3.4.2. The Complete Sender Algorithm . . . . . . . . . . . . 6 71 3.5. Parameter Values . . . . . . . . . . . . . . . . . . . . . 7 72 4. Understanding LEDBAT Mechanisms . . . . . . . . . . . . . . . 8 73 4.1. Delay Estimation . . . . . . . . . . . . . . . . . . . . . 8 74 4.1.1. Estimating Base Delay . . . . . . . . . . . . . . . . 8 75 4.1.2. Estimating Queueing Delay . . . . . . . . . . . . . . 9 76 4.2. Managing the Congestion Window . . . . . . . . . . . . . . 9 77 4.2.1. Window Increase: Probing For More Bandwidth . . . . . 9 78 4.2.2. Window Decrease: Responding To Congestion . . . . . . 9 79 4.3. Choosing The Queuing Delay Target . . . . . . . . . . . . 10 80 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 81 5.1. Framing and Ack Frequency Considerations . . . . . . . . . 10 82 5.2. Competing With TCP Flows . . . . . . . . . . . . . . . . . 11 83 5.3. Fairness Among LEDBAT Flows . . . . . . . . . . . . . . . 11 84 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 85 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 86 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 87 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 88 9.1. Normative References . . . . . . . . . . . . . . . . . . . 13 89 9.2. Informative References . . . . . . . . . . . . . . . . . . 13 90 Appendix A. Timestamp errors . . . . . . . . . . . . . . . . . . 13 91 A.1. Clock offset . . . . . . . . . . . . . . . . . . . . . . . 13 92 A.2. Clock skew . . . . . . . . . . . . . . . . . . . . . . . . 13 93 A.3. Clock skew correction mechanism . . . . . . . . . . . . . 14 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15 96 1. Requirements notation 98 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in [RFC2119]. 102 2. Introduction 104 TCP congestion control [RFC5681] seeks to share bandwidth at a 105 bottleneck link equitably among flows competing at the bottleneck, 106 and it is the predominant congestion control mechanism used on the 107 Internet. Not all applications seek an equitable share of network 108 throughput, however---"background" applications, such as software 109 updates or file-sharing applications, seek to operate without 110 interfering with the performance of more interactive and delay- 111 and/or bandwidth-sensitive "foreground" applications---and standard 112 TCP may be too aggressive for use with such background applications. 114 LEDBAT is an experimental delay-based congestion control mechanism 115 that reacts early to congestion in the network, thus enabling network 116 use by applications that seek to avoid interfering with the network 117 performance of TCP flows. LEDBAT uses one-way delay measurements to 118 determine congestion on the data path, and keeps latency across the 119 bottleneck link in the end-to-end path low while attempting to 120 utilize the available bandwidth on the end-to-end path. 122 Delay-based congestion control protocols, such as TCP-Vegas [Bra94], 123 are generally designed to achieve more, not less throughput than 124 standard TCP, and often outperform TCP under particular network 125 settings. LEDBAT, however, is designed to be no more aggressive than 126 TCP; LEDBAT is a "scavenger" congestion control mechanism that seeks 127 to utilize all available bandwidth and yields quickly when competing 128 with standard TCP at a bottleneck link. A LEDBAT sender uses one-way 129 delay measurements to estimate the amount of queueing on the data 130 path, controls the LEDBAT flow's congestion window based on this 131 estimate, and minimizes interference with competing TCP flows by 132 adding low extra queueing delay on the end-to-end path. 134 2.1. Design Goals 136 LEDBAT congestion control seeks to: 137 1. keep delay low when no other traffic is present, 138 2. add little to the queuing delays induced by TCP traffic, 139 3. quickly yield to flows using standard TCP congestion control that 140 share the same bottleneck link, 142 4. utilize end-to-end available bandwidth, and 143 5. operate well in networks with FIFO queues and tail-drop queue 144 management. 146 2.2. Applicability 148 LEDBAT is a "scavenger" congestion control mechanism---a LEDBAT flow 149 seeks to utilize all available bandwidth and to yield quickly to a 150 competing standard TCP flow---and is primarily motivated by 151 background bulk-transfer applications, such as large file transfers 152 (as with file-sharing applications) and software updates. It can be 153 used with any application that seeks to minimize its impact on the 154 network and on other interactive delay- and/or bandwidth-sensitive 155 network applications. LEDBAT is expected to work well when the 156 sender and/or receiver is connected via a residential access network. 158 This document specifies a congestion control mechanism that can be 159 used as part of a transport protocol or as part of an application. 160 LEDBAT can be used where the data transmission mechanisms are capable 161 of carrying timestamps and acknowledging data frequently. LEDBAT can 162 be used, with appropriate extensions where necessary, with TCP, SCTP, 163 and DCCP, and with proprietary application protocols such as those 164 built atop UDP for P2P applications. 166 3. LEDBAT Congestion Control 168 3.1. Overview 170 A standard TCP sender increases its congestion window until a loss 171 occurs [RFC5681], which, in the absence of any Active Queue 172 Management (AQM) in the network, occurs only when the queue at the 173 bottleneck link on the end-to-end path overflows. Since packet loss 174 at the bottleneck link is expected to be preceded by an increase in 175 the queueing delay at the bottleneck link, LEDBAT congestion control 176 uses this increase in queueing delay as an early signal of 177 congestion, enabling it to respond to congestion earlier than 178 standard TCP, and enabling it to yield bandwidth to a competing TCP 179 flow. 181 LEDBAT employs one-way delay measurements to estimate queueing delay. 182 When the estimated queueing delay is less than a pre-determined 183 target, LEDBAT infers that the network is not yet congested, and 184 increases its sending rate to utilize any spare capacity in the 185 network. When the estimated queueing delay becomes greater than a 186 pre-determined target, LEDBAT decreases its sending rate quickly as a 187 response to potential congestion in the network. 189 3.2. Preliminaries 191 A LEDBAT sender uses a congestion window (cwnd) that gates the amount 192 of data that the sender can send into the network in one RTT. A 193 sender can choose to maintain a cwnd in bytes or in packets; this 194 document uses cwnd in bytes. LEDBAT requires that each data segment 195 carries a "timestamp" from the sender, based on which the receiver 196 computes the one-way delay from the sender, and sends this computed 197 value back to the sender. 199 In addition to the LEDBAT mechanism described below, we note that a 200 slow start mechanism can be used as specified in [RFC5681]. Since 201 slow start leads to faster increase in the window than that specified 202 in LEDBAT, conservative congestion control implementations employing 203 LEDBAT may skip slow start altogether and start with an initial 204 window of MIN_CWND MSS (MIN_CWND is described later in Section 3.5. 206 3.3. Receiver-Side Operation 208 A LEDBAT receiver operates as follows: 210 on data_packet: 211 remote_timestamp = data_packet.timestamp 212 acknowledgement.delay = local_timestamp() - remote_timestamp 213 # fill in other fields of acknowledgement 214 acknowlegement.send() 216 3.4. Sender-Side Operation 218 3.4.1. An Overview 220 As a first approximation, a LEDAT sender operates as shown below; the 221 complete algorithm is specified later in Section 3.4.2. TARGET is 222 the maximum queueing delay that LEDBAT itself can introduce in the 223 network, and GAIN determines the rate at which the congestion window 224 responds to changes in queueing delay; both constants are specified 225 later. Since off_target can be positive or negative, the congestion 226 window (cwnd) increases or decreases in proportion to off_target. 228 on initialization: 229 base_delay = +INFINITY 231 on acknowledgement: 232 current_delay = acknowledgement.delay 233 base_delay = min(base_delay, current_delay) 234 queuing_delay = current_delay - base_delay 235 off_target = (TARGET - queuing_delay) / TARGET 236 cwnd += GAIN * off_target * bytes_newly_acked / cwnd 238 3.4.2. The Complete Sender Algorithm 240 The simplified mechanism above ignores noise filtering and base delay 241 expiration, which we now take into account in our complete sender 242 algorithm below. update_base_delay() maintains a list of one-way 243 delay minima over a number of one-minute intervals, to measure and 244 track changes in the base delay of the end-to-end path. 245 update_current_delay() maintains a list of one-way delay 246 measurements, of which the minimum is used as an estimate of the 247 current end-to-end delay. Note that while this document uses the 248 minimum to filter any noise in the one-way delay, a different and 249 more sophisticated filter MAY be used. 251 This complete algorithm restricts cwnd growth after a period of 252 inactivity, where the cwnd is clamped down to a little more than 253 flightsize using max_allowed_cwnd. Finally, to be TCP-friendly on 254 data loss, LEDBAT halves its congestion window. The full sender-side 255 algorithm is given below: 257 on initialization: 258 create current_delays list with CURRENT_FILTER elements 259 create base_delays list with BASE_HISTORY number of elements 260 inialize elements in current_delays and base_delays to +INFINITY 261 last_rollover = -INFINITY # More than a minute in the past. 262 cwnd = INIT_CWND * MSS 264 on acknowledgement: 265 # flightsize is the amount of data oustanding before this ack 266 # was received and is updated later by update_flightsize(); 267 # bytes_newly_acked is the number of bytes that this ack 268 # newly acknowledges, and it MAY be set to MSS; and 269 # cwnd is in bytes. 271 delay = acknowledgement.delay 272 update_base_delay(delay) 273 update_current_delay(delay) 274 queuing_delay = MIN(current_delays) - MIN(base_delays) 275 off_target = (TARGET - queuing_delay) / TARGET 276 cwnd += GAIN * off_target * bytes_newly_acked / cwnd 277 max_allowed_cwnd = flightsize + ALLOWED_INCREASE * MSS 278 cwnd = min(cwnd, max_allowed_cwnd) 279 cwnd = max(cwnd, MIN_CWND * MSS) 280 update_flightsize() #subtracts bytes_newly_acked from flightsize 282 on data loss: 283 # atmost once per RTT 284 cwnd = cwnd/2 285 cwnd = max(cwnd, MIN_CWND * MSS) 287 update_current_delay(delay) 288 # Maintain a list of CURRENT_FILTER last delays observed. 289 delete first item in current_delays list 290 append delay to current_delays list 292 update_base_delay(delay) 293 # Maintain BASE_HISTORY min delays. Each represents a minute. 294 if round_to_minute(now) != round_to_minute(last_rollover) 295 last_rollover = now 296 delete first item in base_delays list 297 append delay to tail of base_delays list 298 else 299 tail of base_delays list = MIN(tail of base_delays list, delay) 301 Note, that random fluctuations in inter-packet transmission time is 302 assumed; see section Section 5.3 for a discussion. 304 To implement an approximate minimum over the last N minutes, a LEDBAT 305 sender stores BASE_HISTORY+1 separate minima---BASE_HISTORY for the 306 last BASE_HISTORY minutes, and one for the running current minute. 307 At the end of the current minute, the window moves---the earliest 308 minimum is dropped and the latest minimum is added. When the 309 connection is idle for a given minute, no data is available for the 310 one-way delay and, therefore, no minimum is stored. When the 311 connection has been idle for N minutes, the measurement begins anew. 312 Thus even if no data has sent and consequently no acknowledgements 313 have been received the implementation have to mainatin the base delay 314 vector. 316 3.5. Parameter Values 318 TARGET MUST be 100 milliseconds or less, and this choice of value is 319 explained further in Section 4.3. Note that using the same TARGET 320 value across LEDBAT flows enables equitable sharing of the bottleneck 321 bandwidth---flows with a higher TARGET may get a larger share of the 322 bottleneck bandwidth. It is possible to consider the use of 323 different TARGET values for implementing a relative priority between 324 two competing LEDBAT flows by setting a higher TARGET value for the 325 higher-priority flow. 327 ALLOWED_INCREASE SHOULD be 1, and it MUST be greater than 0. An 328 ALLOWED_INCREASE of 0 results in no cwnd growth at all, and an 329 ALLOWED_INCREASE of 1 allows and limits cwnd increase based on 330 flightsize in the previous RTT. An ALLOWED_INCREASE greater than 1 331 MAY be used when interactions between LEDBAT and the framing protocol 332 provide a clear reason for doing so. GAIN MUST be set to 1 or less. 333 A GAIN of 1 limits the maximum congestion window ramp-up to the same 334 rate as TCP Reno in Congestion Avoidance. BASE_HISTORY SHOULD be 10; 335 MIN_CWND SHOULD be 2, and it MUST be at least 1. INIT_CWND SHOULD be 336 2, and it MUST be at least 1. The choice of MIN_CWND and INIT_CWND 337 are strongly connected to the framing protocol; a larger MIN_CWND 338 and/or INIT_CWND MAY be used if the framing protocol allows it. For 339 instance, TCP senders may use a larger INIT_CWND as specified in 340 [RFC3390]. 342 A LEDBAT sender uses the current_delays list to maintain delay 343 measurements made within an RTT amount of time in the past, seeking 344 to eliminate noise spikes in its measurement of the current one-way 345 delay through the network. The size of this list, CURRENT_FILTER, 346 may be variable, and depends on the number of successful measurements 347 made within an RTT amount of time in the past. CURRENT_FILTER SHOULD 348 be 1, and it MUST be at least 1 and limited such that no samples in 349 list are older than an RTT in the past. Note that after a long 350 silent period, a LEDBAT sender will still have at least 1 351 current_delay sample; any current_delay samples older than an RTT 352 MUST NOT be used in computing queueing_delay. A simple 353 implementation with a fixed-size list replaces measurements in the 354 list that are older than an RTT with +INFINITY. 356 4. Understanding LEDBAT Mechanisms 358 This section describes the delay estimation and window management 359 mechanisms used in LEDBAT congestion control. 361 4.1. Delay Estimation 363 To observe an increase in the queueing delay in the network, LEDBAT 364 separates the queueing delay component from the rest of the end-to- 365 end delay. This section explains how LEDBAT decomposes the observed 366 changes in end-to-end delay into these two components. 368 LEDBAT estimates congestion in the direction of the data flow, and to 369 avoid measuring queue build-up on the reverse path (or ack path), 370 LEDBAT uses one-way delay estimates. LEDBAT assumes measurements are 371 done with data packets, thus avoiding the need for separate 372 measurement packets and avoiding the pitfall of measurement packets 373 being treated differently from the data packets in the network. 375 4.1.1. Estimating Base Delay 377 End-to-end delay can be decomposed into transmission (or 378 serialization) delay, propagation (or speed-of-light) delay, queueing 379 delay, and processing delay. On any given path, barring some noise, 380 all delay components except for queueing delay are constant. Since 381 queuing delay is always additive to the end-to-end delay, we estimate 382 the sum of the constant delay components, which we call "base delay", 383 to be the minimum delay observed on the end-to-end path. Using the 384 minimum observed delay also allows LEDBAT to eliminate noise in the 385 delay estimation, such as due to spikes in processing delay at a node 386 on the path. 388 To respond to true changes in the base delay, as can be caused by a 389 route change, LEDBAT uses only recent measurements in estimating the 390 base delay.The duration of the observation window itself is a 391 tradeoff between robustness of measurement and responsiveness to 392 change: a larger observation window increases the chances that the 393 true base delay will be detected (as long as the true base delay is 394 unchanged), whereas a smaller observation window results in faster 395 response to true changes in the base delay. 397 4.1.2. Estimating Queueing Delay 399 Given that the base delay is constant, the queueing delay is 400 represented by the variable component of the measured end-to-end 401 delay. LEDBAT measures queueing delay as simply the difference 402 between an end-to-end delay measurement and the current estimate of 403 base delay. 405 4.2. Managing the Congestion Window 407 4.2.1. Window Increase: Probing For More Bandwidth 409 A LEDBAT sender increases its congestion window if the queuing delay 410 is smaller than a target value, proportionally to the relative 411 difference between the current queueing delay and the delay target. 412 To be friendly to competing TCP flows, we set this highest rate of 413 window growth to be the same as TCP's. In other words, A LEDBAT flow 414 thus never ramps up faster than a competing TCP flow over the same 415 path. 417 4.2.2. Window Decrease: Responding To Congestion 419 When the sender's queueing delay estimate is higher than the target, 420 the LEDBAT flow's rate should be reduced. LEDBAT uses a simple 421 linear controller to detemine sending rate as a function of the delay 422 estimate, where the response is proportional to the difference 423 between the current queueing delay estimate and the target. In 424 limited experiments with Bittorrent nodes, this controller seems to 425 work well. 427 Unlike TCP-like loss-based congestion control, LEDBAT does not induce 428 losses and so a LEDBAT sender is not expected to normally rely on 429 losses to determine the sending rate. However, when data loss does 430 occur, LEDBAT must respond as standard TCP does; even if the queueing 431 delay estimates indicate otherwise, a loss is assumed to be a strong 432 indication of congestion. Thus, to deal with severe congestion when 433 packets are dropped in the network, and to provide a fallback against 434 incorrect queuing delay estimates, a LEDBAT sender halves its 435 congestion window when a loss event is detected. As with TCP New- 436 Reno, LEDBAT reduces its cwnd by half at most once per RTT. 438 4.3. Choosing The Queuing Delay Target 440 The queueing delay target is a tradeoff. A target that is too low 441 might result in under-utilization of the bottleneck link, especially 442 if the LEDBAT flow is the only flow on the link, and may also be more 443 sensitive to error in the measured delay. The International 444 Telecommunication Union's (ITU's) Recommendation G.114 defines a 445 delay of 150 ms to be acceptable for most user voice applications. 446 Thus the extra delay induced by LEDBAT must be below 150 ms to reduce 447 impact on delay-sentive applications. 449 Our recommendation of 100 ms or less as the target is based on these 450 considerations. Anecdotal evidence indicates that this value works 451 well: LEDBAT has been been implemented and successfully deployed with 452 a target value of 100 ms in two Bittorrent implementations--- 453 BitTorrent DNA as the exclusive congestion control mechanism and in 454 uTorrent as an experimental mechanism. 456 5. Discussion 458 5.1. Framing and Ack Frequency Considerations 460 While the actual framing and wire format of the protocols using 461 LEDBAT are outside the scope of this document, we briefly consider 462 the data framing and ack frequency needs of LEDBAT mechanisms. 464 To compute the data path's one-way delay, our discussion of LEDBAT 465 assumes a framing that allows the sender to timestamp packets and for 466 the receiver to convey the measured one-way delay back to the sender 467 in ack packets. LEDBAT does not require this particular method, but 468 it does require unambiguous delay estimates using data and ack 469 packets. 471 A LEDBAT receiver may send an ack as frequently as one for every data 472 packet received or less frequently; LEDBAT does require that the 473 receiver MUST transmit at least one ack in every RTT. 475 5.2. Competing With TCP Flows 477 LEDBAT is designed to respond to congestion indications earlier than 478 loss-based TCP. A LEDBAT flow is more aggressive when the queueing 479 delay estimate is lower; since the queueing delay estimate is non- 480 negative, LEDBAT is most aggressive when its queuing delay estimate 481 is zero. In this case, LEDBAT ramps up its congestion window at the 482 same rate as TCP does. LEDBAT reduces its rate earlier than TCP 483 does, always halving the congestion window on loss. Thus, in the 484 worst case where the delay estimates are completely and consistently 485 off, a LEDBAT flow falls back to TCP mechanisms and is as aggressive 486 as a TCP flow. 488 5.3. Fairness Among LEDBAT Flows 490 The primary design goals of LEDBAT are focussed on the aggregate 491 behavior of LEDBAT flows when they compete with standard TCP. Since 492 LEDBAT is designed for background traffic, we consider link 493 utilization to be more important than fairness amongst LEDBAT flows. 494 Nevertheless, we now consider fairness issues that might arise 495 amongst competing LEDBAT flows. 497 LEDBAT as described so far lacks a mechanism specifically designed to 498 equalize utilization amongst LEDBAT flows. Anecdotally observed 499 behavior of existing implementations indicates that a rough 500 equalization does occur since in most enviroments some amount of 501 randomness in the inter-packet transmission times exist, as explained 502 further below. 504 Delay-based congestion control systems suffer from the possibility of 505 late-comers incorrectly measuring and using a higher base-delay than 506 an active flow that started earlier. Suppose a LEDBAT flow is the 507 only flow on the bottleneck, which the flow saturates, steadily 508 maintaining the queueing delay at a target delay. When a new LEDBAT 509 flow arrives, it might incorrectly measure the current end-to-end 510 delay, including the queueing delay being maintained by the first 511 LEDBAT flow, as its base delay, and the incoming flow might now 512 effectively seek to build on top of the existing, already maximal 513 queueing delay. As the second flow builds up, the first flow sees 514 the true queueing delay and backs off, while the late-comer keeps 515 building up, using up the entire link's capacity; this advantage is 516 called the "late-comer's advantage". 518 In the worse case, if the first flow yields at the same rate as the 519 new flow increases its sending rate, the new flow will see constant 520 end-to-end delay, which it assumes is the base delay, until the first 521 flow backs off completely. As a result, by the time the second flow 522 stops increasing its cwnd, it would have added twice the target 523 queueing delay to the network. 525 This advantage can be reduced if the the first flow yields quickly 526 enough to empty the bottleneck queue faster than the incoming flow 527 increases its occupancy in the queue; as a result, the late-comer 528 might measure a delay closer to the base delay. While such a 529 reduction might be achieved through a multiplicative decrease of the 530 congestion window, this might cause stronger fluctuations in flow 531 throughput during steady state. 533 In practice, however, this concern seems to be alleviated by the 534 burstiness of network traffic: all that's needed to measure the base 535 delay is one small gap in transmission schedules between the LEDBAT 536 flows. These gaps can occur for a number of reasons such as latency 537 introduced due to OS scheduling at the sender, processing delay at 538 the sender or any network node, and link contention. When such a gap 539 occurs while the late-comer is starting, base delay is immediately 540 correctly measured. With a small number of LEDBAT flows, system 541 noise seems to sufficiently regulate the late-comer's advantage. 543 6. IANA Considerations 545 There are no IANA considerations for this document. 547 7. Security Considerations 549 A network on the path might choose to cause higher delay measurements 550 than the real queuing delay so that LEDBAT backs off even when 551 there's no congestion present. While shaping of traffic into an 552 artificially narrow bottleneck by increasing the queueing delay 553 cannot be trivially counteracted, a protocol using LEDBAT should seek 554 to minimize the risk of such an attack by authenticating the 555 timestamp and delay fields in the packets. 557 8. Acknowledgements 559 We thank folks in the LEDBAT working group for their comments and 560 feedback. Special thanks to Murari Sridharan and Rolf Winter for 561 their patient and untiring shepherding. 563 9. References 564 9.1. Normative References 566 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 567 Requirement Levels", BCP 14, RFC 2119, March 1997. 569 [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's 570 Initial Window", RFC 3390, October 2002. 572 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 573 Control", RFC 5681, September 2009. 575 9.2. Informative References 577 [Bra94] Brakmo, L., O'Malley, S., and L. Peterson, "TCP Vegas: New 578 techniques for congestion detection and avoidance", 579 Proceedings of SIGCOMM '94, pages 24-35, August 1994. 581 Appendix A. Timestamp errors 583 One-way delay measurement needs to deal with timestamp errors. We'll 584 use the same locally linear clock model and the same terminology as 585 Network Time Protocol (NTP). This model is valid for any 586 differentiable clocks. NTP uses the term "offset" to refer to 587 difference from true time and "skew" to refer to difference of clock 588 rate from the true rate. The clock will thus have a fixed offset 589 from the true time and a skew. We'll consider what we need to do 590 about the offset and the skew separately. 592 A.1. Clock offset 594 First, consider the case of zero skew. The offset of each of the two 595 clocks shows up as a fixed error in one-way delay measurement. The 596 difference of the offsets is the absolute error of the one-way delay 597 estimate. We won't use this estimate directly, however. We'll use 598 the difference between that and a base delay. Because the error 599 (difference of clock offsets) is the same for the current and base 600 delay, it cancels from the queuing delay estimate, which is what 601 we'll use. Clock offset is thus irrelevant to the design. 603 A.2. Clock skew 605 The clock skew manifests in a linearly changing error in the time 606 estimate. For a given pair of clocks, the difference in skews is the 607 skew of the one-way delay estimate. Unlike the offset, this no 608 longer cancels in the computation of the queuing delay estimate. On 609 the other hand, while the offset could be huge, with some clocks off 610 by minutes or even hours or more, the skew is typically small. For 611 example, NTP is designed to work with most clocks, yet it gives up 612 when the skew is more than 500 parts per million (PPM). Typical 613 skews of clocks that have never been trained seem to often be around 614 100-200 PPM. Previously trained clocks could have 10-20 PPM skew due 615 to temperature changes. A 100-PPM skew means accumulating 6 616 milliseconds of error per minute. The base delay updates mostly 617 takes care of clock skew unless the skew is unusually high or extreme 618 values have been chosen for TARGET and BASE_HISTORY so that the clock 619 skew in BASE_DELAY minutes is larger than the TARGET. 621 Clock skew can be in two directions: either the sender's clock is 622 faster than the receiver's, or vice versa. 624 If the senders's clock is faster the one-way delay measurement will 625 get more and more reduced by the clock drift over time. Whenever 626 there is no additional delay the base delay will be updated by a 627 smaller one-way delay value and the skew is compensated. This will 628 happen continuously as LEBDAT is design to keep the queue empty. If 629 a competing flow introduces additional queueing delay LEDBAT will 630 anyway get out of the way quickly and an overestimated one-way delay 631 will just speed-up the back-off. 633 When the receiver clock runs faster, the raw delay estimate will 634 drift up with time. This can suppress the throughput unnecessarily. 635 In this case a skew correction mechanim can be benefital. Further 636 condersiderations based on a deployed implementation and LEDBAT 637 specific preconditions are given in the next section. 639 A.3. Clock skew correction mechanism 641 The following paragraph describes the deployed clock skew correction 642 mechanism in the BitTorrent implementation for documentation purpose. 644 In the BitTorrent implemtation the receiver sends back the raw 645 (sending and receiving) timestamps. Based on this imfomation and the 646 system time at feedback receiption the sender can estimated the one- 647 way delay in both directions. Thus the sender can run the same base 648 delay calculation algorithm it runs for itself for the receiver as 649 well. If the sender can detect the receiver reducing its base delay, 650 it can infer that this is due to clock drift. The sender can be 651 compensated for by increasing the it's base delay by the same amount. 652 To apply this mechanism symmetrical framing is need (i.e., same 653 information about delays transmitted in both way). 655 The following considerations can be used for an alternative 656 implementation as a reference: 658 o Skew correction with faster virtual clock: 659 Since having a faster clock on the sender will continuousely 660 update the base delay, a faster virtual clock for sender 661 timestamping can be applied. This virual clock can be computed 662 from the default machine clock through a linear transformation. 663 E.g. with a 500 PPM speed-up the sender's clock is very likely to 664 be faster than any receiver's clock and thus LEDBAT will benefit 665 from the implicit correction when updating the base delay. 667 o Skew correction with estimating drift: 668 With LEDBAT the history of base delay minima is already kept for 669 each minute. This can provide a base to compute the clock skew 670 difference between the two hosts. The slope of a linear function 671 fitted to the set of minima base delays gives an estimate of the 672 clock skew. This estimation can be used to correct the clocks. 673 If the other endpoint is doing the same, the clock should be 674 corrected by half of the estimated skew amount. 676 o Byzantine skew correction: 677 When it is known that each host maintains long-lived connections 678 to a number of different other hosts, a byzantine scheme can be 679 used to estimate the skew with respect to the true time. Namely, 680 calculate the skew difference for each of the peer hosts as 681 described with the previous approach, then take the median of the 682 skew differences. While this scheme is not universally 683 applicable, it combines well with other schemes, since it is 684 essentially a clock training mechanism. The scheme also acts the 685 fastest, since the state is preserved between connections. 687 Authors' Addresses 689 Stanislav Shalunov 690 BitTorrent Inc 691 612 Howard St, Suite 400 692 San Francisco, CA 94105 693 USA 695 Email: shalunov@bittorrent.com 696 URI: http://shlang.com 697 Greg Hazel 698 BitTorrent Inc 699 612 Howard St, Suite 400 700 San Francisco, CA 94105 701 USA 703 Email: greg@bittorrent.com 705 Janardhan Iyengar 706 Franklin and Marshall College 707 415 Harrisburg Ave. 708 Lancaster, PA 17603 709 USA 711 Email: jiyengar@fandm.edu 713 Mirja Kuehlewind 714 University of Stuttgart 715 Stuttgart 716 DE 718 Email: mirja.kuehlewind@ikr.uni-stuttgart.de