idnits 2.17.1 draft-shalunov-ledbat-congestion-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 4, 2009) is 5532 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft BitTorrent Inc 4 Intended status: Experimental March 4, 2009 5 Expires: September 5, 2009 7 Low Extra Delay Background Transport (LEDBAT) 8 draft-shalunov-ledbat-congestion-00.txt 10 Status of this Memo 12 This Internet-Draft is submitted to IETF in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt. 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 This Internet-Draft will expire on September 5, 2009. 33 Copyright Notice 35 Copyright (c) 2009 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This document is subject to BCP 78 and the IETF Trust's Legal 39 Provisions Relating to IETF Documents in effect on the date of 40 publication of this document (http://trustee.ietf.org/license-info). 41 Please review these documents carefully, as they describe your rights 42 and restrictions with respect to this document. 44 Abstract 46 LEDBAT is an alternative experimental congestion control algorithm. 48 LEDBAT enables an advanced networking application to minimize the 49 extra delay it induces in the bottleneck while saturating the 50 bottleneck. It thus implements an end-to-end version of scavenger 51 service. LEDBAT has been been implemented in BitTorrent DNA, as the 52 exclusive congestion control mechanism, and in uTorrent, as an 53 experimental mechanism, and deployed in the wild with favorable 54 results. 56 Table of Contents 58 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 59 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. LEDBAT design goals . . . . . . . . . . . . . . . . . . . . . 3 61 4. LEDBAT motivation . . . . . . . . . . . . . . . . . . . . . . 4 62 4.1. Simplest network topology . . . . . . . . . . . . . . . . 4 63 4.2. Extra delay . . . . . . . . . . . . . . . . . . . . . . . 4 64 4.3. Queuing delay target . . . . . . . . . . . . . . . . . . . 4 65 4.4. Need to measure delay . . . . . . . . . . . . . . . . . . 5 66 4.5. Queing delay estimate . . . . . . . . . . . . . . . . . . 5 67 4.6. Controller . . . . . . . . . . . . . . . . . . . . . . . . 5 68 4.7. Max rampup rate same as TCP . . . . . . . . . . . . . . . 5 69 4.8. Halve on loss . . . . . . . . . . . . . . . . . . . . . . 6 70 4.9. Yield to TCP . . . . . . . . . . . . . . . . . . . . . . . 6 71 4.10. Need for one-way delay . . . . . . . . . . . . . . . . . . 6 72 4.11. Measuring one-way delay . . . . . . . . . . . . . . . . . 6 73 4.12. Route changes . . . . . . . . . . . . . . . . . . . . . . 6 74 4.13. Timestamp errors . . . . . . . . . . . . . . . . . . . . . 7 75 4.13.1. Clock offset . . . . . . . . . . . . . . . . . . . . 7 76 4.13.2. Clock skew . . . . . . . . . . . . . . . . . . . . . 7 77 4.14. Noise filtering . . . . . . . . . . . . . . . . . . . . . 8 78 4.15. Safety of LEDBAT . . . . . . . . . . . . . . . . . . . . . 8 79 5. LEDBAT congestion control . . . . . . . . . . . . . . . . . . 8 80 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 81 7. Normative References . . . . . . . . . . . . . . . . . . . . . 11 82 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 84 1. Requirements notation 86 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 87 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 88 document are to be interpreted as described in [RFC2119]. 90 2. Introduction 92 The standard congestion control in TCP is based on loss and has not 93 been designed to drive delay to any given value. Because TCP needs 94 losses to back off, when a FIFO bottleneck lacks AQM, TCP fills the 95 buffer, effectively maximizing possible delay. Large number of the 96 thinnest links in the Internet, particularly most uplinks of home 97 connections, lack AQM. They also frequently contain enough buffer 98 space to get delays into hundreds of milliseconds and even seconds. 99 There is no benefit to having delays this large, but there are very 100 substantial drawbacks for interactive applications: games and VoIP 101 become impossible and even web browsing becomes very slow. 103 While a number of delay-based congestion control mechanisms have been 104 proposed, they were generally not designed to minimize the delay 105 induced in the network. 107 LEDBAT is designed to allow to keep the latency across the congested 108 bottleneck low even as it is saturated. This allows applications 109 that send large amounts of data, particularly upstream on home 110 connections, such as peer-to-peer application, to operate without 111 destroying the user experience in interactive applications. LEDBAT 112 takes advantage of delay measurements and backs off before loss 113 occurs. It has been deployed by BitTorrent in the wild with the 114 BitTorrent DNA client and now, experimentally, with the uTorrent 115 client. This mechanism not only allows to keep delay across a 116 bottleneck low, but also yields quickly in the presence of competing 117 traffic with loss-based congestion control. 119 Beyond its utility for P2P, LEDBAT enables other advanced networking 120 applications to better get out of the way of interactive apps. 122 In addition to direct and immediate benefits for P2P and other 123 application that can benefit from scavenger service, LEDBAT could 124 point the way for a possible future evolution of the Internet where 125 loss is not part of the designed behavior and delay is minimized. 127 3. LEDBAT design goals 129 LEDBAT design goals are: 131 1. saturate the bottleneck 132 2. keep delay low when no other traffic is present 133 3. quickly yield to traffic sharing the same bottleneck queue that 134 uses standard TCP congestion control 135 4. add little to the queuing delays induced by TCP traffic 136 5. operate well in networks with FIFO queuing with drop-tail 137 discipline 138 6. be deployable for popular applications that currently comprise 139 noticeable fractions of Internet traffic 140 7. where available, use explicit congestion notification (ECN), 141 active queue management (AQM), and/or end-to-end differentiated 142 services (DiffServ). 144 4. LEDBAT motivation 146 This section describes LEDBAT informally and provides some 147 motivation. It is expected to be helpful for general understanding 148 and useful in discussion of the properties of LEDBAT. 150 Without a loss of generality, we can consider only one direction of 151 the data transfer. The opposite direction can be treated 152 identically. 154 4.1. Simplest network topology 156 Consider first the desired behavior when there's only a single 157 bottleneck and no competing traffic whatsoever, not even other LEDBAT 158 connections. The design goals obviously need to be robustly met for 159 this trivial case. 161 4.2. Extra delay 163 Consider the queuing delay on the bottleneck. This delay is the 164 extra delay induced by congestion control. One of our design goals 165 is to keep this delay low. However, when this delay is zero, the 166 queue is empty and so no data is being transmitted and the link is 167 thus not saturated. Hence, our design goal is to keep the queuing 168 delay low, but non-zero. 170 4.3. Queuing delay target 172 How low do we want the queuing delay to be? Because another design 173 goal is to be deployable on networks with only simple FIFO queuing 174 and drop-tail discipline, we can't rely on explicit signaling for the 175 queuing delay. So we're going to estimate it using external 176 measurements. The external measurements will have an error at least 177 on the order of best-case scheduling delays in the OSes. There's 178 thus a good reason to try to make the queuing delay larger than this 179 error. There's no reason that would want us to push the delay much 180 further up. Thus, we will have a delay target that we would want to 181 maintain. 183 4.4. Need to measure delay 185 To maintain delay near the target, we have to use delay measurements. 186 Lacking delay measurements, we'd have to go only by loss (when ECN is 187 lacking). For loss to occur (on a FIFO link with drop-tail 188 discipline), the buffer must first be filled. This would drive the 189 delay to the largest possible value for this link, thus violating our 190 design goal of keeping delay low. 192 4.5. Queing delay estimate 194 Since our goal is to control the queuing delay, it is natural to 195 maintain an estimate of it. Let's call delay components propagation, 196 serialization, processing, and queuing. All components but queuing 197 are nearly constant and queuing is variable. Because queuing delay 198 is always positive, the constant propagation+serialization+processing 199 delay is no less than the minimum delay observed. Assuming that the 200 queuing delay distribution density has non-zero integral from zero to 201 any sufficiently small upper limit, minimum is also an asymptotically 202 consistent estimate of the constant fraction of the delay. We can 203 thus estimate the queuing delay as the difference between current and 204 base delay as usual. 206 4.6. Controller 208 When our estimate of the queuing delay is lower than the target, it's 209 natural to send faster. When our estimate is higher, it's natural to 210 send slower. To avoid trivial oscillations on round-trip-time (RTT) 211 scale, the response of the controller needs to be near zero when the 212 estimate is near the target. To converge faster, the response needs 213 to increase as the difference increases. The simplest controller 214 with this property is the linear controller, where the response is 215 proportional to the difference between the estimate and the target. 216 This controller happens to work well in practice obviating the need 217 for more complex ones. 219 4.7. Max rampup rate same as TCP 221 The maximum speed with which we can increase our congestion window is 222 then when queuing delay estimate is zero. To be on the safe side, 223 we'll make this speed equal to how fast TCP increases its sending 224 speed. Since queuing delay estimate is always non-negative, this 225 will ensure never ramping up faster than TCP would. 227 4.8. Halve on loss 229 Further, to deal with severe congestion when most packets are lost 230 and to provide a safety net against incorrect queuing delay 231 estimates, we'll halve the window when a loss event is detected. 232 We'll do so at most once per RTT. 234 4.9. Yield to TCP 236 Consider competition between a LEDBAT connection and a connection 237 governed by loss-based congestion control (on a FIFO bottleneck with 238 drop-tail discipline). Loss-based connection will need to experience 239 loss to back off. Loss will only occur after the connection 240 experiences maximum possible delays. LEDBAT will thus receive 241 congestion indication sooner than the loss-based connection. If 242 LEDBAT can ramp down faster than the loss-based connection ramps up, 243 LEDBAT will yield. LEDBAT ramps down when queuing delay estimate 244 exceeds the target: the more the excess, the faster the ramp-down. 245 When the loss-based connection is standard TCP, LEDBAT will yield at 246 precisely the same rate as TCP is ramping up when the queuing delay 247 is double the target. 249 4.10. Need for one-way delay 251 Now consider a case when one link direction is saturated with 252 unrelated TCP traffic while another direction is near-empty. 253 Consider LEDBAT sending in the near-empty direction. Our design goal 254 is to saturate it. However, if we pay attention to round-trip 255 delays, we'll sense the delays on the reverse path and respond to 256 them as described in the previous paragraph. We must, thus, measure 257 one-way delay and use that for our queuing delay estimate. 259 4.11. Measuring one-way delay 261 A special IETF protocol, One-Way Active Measurement Protocol (OWAMP), 262 exists for measuring one-way delay. However, since LEDBAT will 263 already be sending data, it is more efficient to add a timestamp to 264 the packets on the data direction and a measurement result field on 265 the acknowledgement direction. This also prevents the danger of 266 measurement packets being treated differently from the data packets. 267 The failure case would be better treatment of measurement packets, 268 where the data connection would be driven to losses. 270 4.12. Route changes 272 Routes can change. To deal, base delay needs to be computed over a 273 period of last few minutes instead of since the start of connection. 274 The tradeoff is: for longer intervals, base is more accurate; for 275 shorter intervals, reaction to route changes is faster. 277 A convenient way to implement an approximate minimum over last N 278 minutes is to keep separate minima for last N+1 minutes (last one for 279 the partial current minute). 281 4.13. Timestamp errors 283 One-way delay measurement needs to deal with timestamp errors. We'll 284 use the same locally linear clock model as Network Time Protocol 285 (NTP). This model is valid for any differentiable clocks. The clock 286 will thus have a fixed offset from the true time and a skew. We'll 287 consider what we need to do about the offset and the skew separately. 289 4.13.1. Clock offset 291 First, consider the case of zero skew. The offset of each of the two 292 clocks shows up as a fixed error in one-way delay measurement. The 293 difference of the offsets is the absolute error of the one-way delay 294 estimate. We won't use this estimate directly, however. We'll use 295 the difference between that and a base delay. Because the error 296 (difference of clock offsets) is the same for the current and base 297 delay, it cancels from the queuing delay estimate, which is what 298 we'll use. Clock offset is thus irrelevant to the design. 300 4.13.2. Clock skew 302 Now consider the skew. For a given clock, skew manifests in a 303 linearly changing error in the time estimate. For a given pair of 304 clocks, the difference in skews is the skew of the one-way delay 305 estimate. Unlike the offset, this no longer cancels in the 306 computation of the queuing delay estimate. On the other hand, while 307 the offset could be huge, with some clocks off by minutes or even 308 hours or more, the skew is typically not too bad. For example, NTP 309 is designed to work with most clocks, yet it gives up when the skew 310 is more than 500 parts per million (PPM). Typical skews of clocks 311 that have never been trained seem to often be around 100-200 PPM. 312 Previously trained clocks could have 10-20 PPM skew due to 313 temperature changes. A 100-PPM skew means accumulating 6 314 milliseconds of error per minute. The expiration of base delay 315 related to route changes mostly takes care of clock skew. A 316 technique to specifically compute and cancel it is trivially possible 317 and involves tracking base delay skew over a number of minutes and 318 then correcting for it, but usually isn't necessary, unless the 319 target is unusually low, the skew is unusually high, or the base 320 interval is unusually long. It is not further described in this 321 document. 323 4.14. Noise filtering 325 In addition to timestamp errors, one-way delay estimate includes an 326 error of measurement when part of the time measured was spent inside 327 the sending or the receiving machines. Different views are possible 328 on the nature of this delay: one view holds that, to the extent this 329 delay internal to a machine is not constant, it is a variety of 330 queuing delay and nothing needs to be done to detect or eliminate it; 331 another view holds that, since this delay does not have the same 332 characteristics as queuing delay induced by a fixed-capacity 333 bottleneck, it is more correctly classified as non-constant 334 processing delay and should be filtered out. In practice, this 335 doesn't seem to matter very much one way or the other. The way to 336 filter the noise out is to observe, again, that the noise is always 337 nonnegative and so a good filter is the minimum of several recent 338 delay measurements. 340 4.15. Safety of LEDBAT 342 LEDBAT is most aggressive when its queuing delay estimate is most 343 wrong and is as low as it can be. Queuing delay estimate is 344 nonnegative, therefore the worst possible case is when somehow the 345 estimate is always returned as zero. In this case, LEDBAT will ramp 346 up as fast as TCP and halve the rate on loss. Thus, in case of worst 347 possible failure of estimates, LEDBAT will behave identically to TCP. 348 This provides an extra safety net. 350 5. LEDBAT congestion control 352 Consider two parties, a sender and a receiver, with the sender having 353 an unlimited source of data to send to the receiver and the receiver 354 merely acknowledging the data. (In an actual protocol, it's more 355 convenient to have bidirectional connections, but unidirectional 356 abstraction suffices to describe the congestion control mechanism.) 358 Consider a protocol that uses packets of equal size and acknowledges 359 each of them separately. (Variable-sized packets and delayed 360 acknowledgements are possible and are being implemented, but 361 complicate the exposition.) 363 Assume that each data packet contains a header field timestamp. The 364 sender puts a timestamp from its clock into this field. Further 365 assume that each acknowledgement packet contains a field delay. It 366 is shown below how it is populated. 368 Slow start behavior is unchanged in LEDBAT. Note that rampup is 369 faster in slow start than during congestion avoidance and so very 370 conservative implementations MAY skip slow start altogether. 372 As far as congestion control is concerned, the receiver is then very 373 simple and operates as follows, using a pseudocode: 375 on data_packet: 376 remote_timestamp = data_packet.timestamp 377 acknowledgement.delay = local_timestamp() - remote_timestamp 378 # fill in other fields of acknowledgement 379 acknowlegement.send() 381 The sender actually operates the congestion control algorithm and 382 acts, in first approximation, as follows: 384 on acknowledgement: 385 current_delay = acknowledgement.delay 386 base_delay = min(base_delay, current_delay) 387 queuing_delay = current_delay - base_delay 388 off_target = TARGET - queuing_delay 389 cwnd += GAIN * off_target / cwnd 391 The pseudocode above is a simplification and ignores noise filtering 392 and base expiration. The more precise pseudocode that takes these 393 factors into account is as follows and MUST be followed: 395 on acknowledgement: 396 delay = acknowledgement.delay 397 update_base_delay(delay) 398 update_current_delay(delay) 399 queuing_delay = current_delay() - base_delay() 400 off_target = TARGET - queuing_delay 401 cwnd += GAIN * off_target / cwnd 403 update_current_delay(delay) 404 # Maintain a list of NOISE_FILTER last delays observed. 405 forget the earliest of NOISE_FILTER current_delays 406 add delay to the end of current_delays 408 current_delay() 409 min(the NOISE_FILTER delays stored by update_current_delay) 411 update_base_delay(delay) 412 # Maintain BASE_HISTORY min delays. Each represents a minute. 413 if round_to_minute(now) != round_to_minute(last_rollover) 414 last_rollover = now 415 forget the earliest of base delays 416 add delay to the end of base_delays 417 else 418 last of base_delays = min(last of base_delays, delay) 420 base_delay() 421 min(the BASE_HISTORY min delays stored by update_base_delay) 423 TARGET parameter MUST be set to 25 milliseconds and GAIN MUST be set 424 so that max rampup rate is the same as for TCP. BASE_HISTORY MUST be 425 no less than 2 and SHOULD NOT be more than 10. NOISE_FILTER SHOULD 426 be tuned so that it is at least 1 and no more than cwnd/2. 428 6. Security Considerations 430 An network on the path might choose to cause higher delay 431 measurements than the real queuing delay so that LEDBAT backs off 432 even when there's no congestion present. Shaping of traffic into an 433 artificially narrow bottleneck can't be counteracted, but faking 434 timestamp field can and SHOULD. A protocol using the LEDBAT 435 congestion control SHOULD authenticate the timestamp and delay 436 fields, preferably as part of authenticating most of the rest of the 437 packet, with the exception of volatile header fields. The choice of 438 the authentication mechanism that resists man-in-the-middle attacks 439 is outside of scope of this document. 441 7. Normative References 443 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 444 Requirement Levels", BCP 14, RFC 2119, March 1997. 446 Author's Address 448 Stanislav Shalunov 449 BitTorrent Inc 450 612 Howard St, Suite 400 451 San Francisco, CA 94105 452 USA 454 Email: shalunov@bittorrent.com 455 URI: http://shlang.com