idnits 2.17.1 draft-ietf-ledbat-congestion-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 12, 2010) is 5037 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 LEDBAT WG S. Shalunov 3 Internet-Draft G. Hazel 4 Intended status: Experimental BitTorrent Inc 5 Expires: January 13, 2011 July 12, 2010 7 Low Extra Delay Background Transport (LEDBAT) 8 draft-ietf-ledbat-congestion-02.txt 10 Abstract 12 LEDBAT is an alternative experimental congestion control algorithm. 14 LEDBAT enables a networking application to minimize the extra delay 15 it induces in the bottleneck while saturating the bottleneck. It 16 thus implements an end-to-end version of scavenger service. LEDBAT 17 has been been implemented in uTorrent, which is currently the most 18 popular BitTorrent client, as the preferred congestion control 19 mechanism, and deployed in the wild with favorable results. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on January 13, 2011. 38 Copyright Notice 40 Copyright (c) 2010 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 56 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. LEDBAT design goals . . . . . . . . . . . . . . . . . . . . . 3 58 4. LEDBAT applicability . . . . . . . . . . . . . . . . . . . . . 4 59 5. LEDBAT motivation . . . . . . . . . . . . . . . . . . . . . . 4 60 5.1. Simplest network topology . . . . . . . . . . . . . . . . 5 61 5.2. Extra delay . . . . . . . . . . . . . . . . . . . . . . . 5 62 5.3. Queuing delay target . . . . . . . . . . . . . . . . . . . 5 63 5.4. Need to measure delay . . . . . . . . . . . . . . . . . . 5 64 5.5. Queing delay estimate . . . . . . . . . . . . . . . . . . 5 65 5.6. Controller . . . . . . . . . . . . . . . . . . . . . . . . 6 66 5.7. Max rampup rate same as TCP . . . . . . . . . . . . . . . 6 67 5.8. Halve on loss . . . . . . . . . . . . . . . . . . . . . . 6 68 5.9. Yield to TCP . . . . . . . . . . . . . . . . . . . . . . . 6 69 5.10. Need for one-way delay . . . . . . . . . . . . . . . . . . 7 70 5.11. Measuring one-way delay . . . . . . . . . . . . . . . . . 7 71 5.12. Route changes . . . . . . . . . . . . . . . . . . . . . . 7 72 5.13. Timestamp errors . . . . . . . . . . . . . . . . . . . . . 7 73 5.13.1. Clock offset . . . . . . . . . . . . . . . . . . . . 8 74 5.13.2. Clock skew . . . . . . . . . . . . . . . . . . . . . 8 75 5.14. Noise filtering . . . . . . . . . . . . . . . . . . . . . 10 76 5.15. Non-bulk flows . . . . . . . . . . . . . . . . . . . . . . 11 77 5.16. LEDBAT framing and wire format . . . . . . . . . . . . . . 12 78 5.17. Fairness between LEDBAT flows . . . . . . . . . . . . . . 12 79 5.17.1. Late comers . . . . . . . . . . . . . . . . . . . . . 13 80 5.18. Safety of LEDBAT . . . . . . . . . . . . . . . . . . . . . 14 81 6. LEDBAT congestion control . . . . . . . . . . . . . . . . . . 14 82 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 83 8. Normative References . . . . . . . . . . . . . . . . . . . . . 17 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 86 1. Requirements notation 88 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 89 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 90 document are to be interpreted as described in [RFC2119]. 92 2. Introduction 94 The standard congestion control in TCP is based on loss and has not 95 been designed to drive delay to any given value. Because TCP needs 96 losses to back off, when a FIFO bottleneck lacks AQM, TCP fills the 97 buffer, effectively maximizing possible delay. Large number of the 98 thinnest links in the Internet, particularly most uplinks of home 99 connections, lack AQM. They also frequently contain enough buffer 100 space to get delays into hundreds of milliseconds and even seconds. 101 There is no benefit to having delays this large, but there are very 102 substantial drawbacks for interactive applications: games and VoIP 103 become impossible and even web browsing becomes very slow. 105 While a number of delay-based congestion control mechanisms have been 106 proposed, they were generally not designed to minimize the delay 107 induced in the network. 109 LEDBAT is designed to allow to keep the latency across the congested 110 bottleneck low even as it is saturated. This allows applications 111 that send large amounts of data, particularly upstream on home 112 connections, such as peer-to-peer application, to operate without 113 destroying the user experience in interactive applications. LEDBAT 114 takes advantage of delay measurements and backs off before loss 115 occurs. It has been deployed by BitTorrent in the wild first with 116 the BitTorrent DNA client (a P2P-based CDN) and now with the uTorrent 117 client. This mechanism not only allows to keep delay across a 118 bottleneck low, but also yields quickly in the presence of competing 119 traffic with loss-based congestion control. 121 Beyond its utility for P2P, LEDBAT enables other advanced networking 122 applications to better get out of the way of interactive apps. 124 In addition to direct and immediate benefits for P2P and other 125 application that can benefit from scavenger service, LEDBAT could 126 point the way for a possible future evolution of the Internet where 127 loss is not part of the designed behavior and delay is minimized. 129 3. LEDBAT design goals 131 LEDBAT design goals are: 133 1. saturate the bottleneck 134 2. keep delay low when no other traffic is present 135 3. quickly yield to traffic sharing the same bottleneck queue that 136 uses standard TCP congestion control 137 4. add little to the queuing delays induced by TCP traffic 138 5. operate well in networks with FIFO queuing with drop-tail 139 discipline 140 6. be deployable for popular applications that currently comprise 141 noticeable fractions of Internet traffic 142 7. where available, use explicit congestion notification (ECN), 143 active queue management (AQM), and/or end-to-end differentiated 144 services (DiffServ). 146 4. LEDBAT applicability 148 While originally deployed with a P2P application and a proprietary 149 UDP framing, LEDBAT is applicable to any transport protocol that has 150 provisions for its deployment and satisfies the wire format 151 constraints (can carry timestamps, delay estimates, and has frequent 152 ACKing). Currently, no IETF standard transport specifies the use of 153 LEDBAT, but all of TCP, DCCP, and SCTP could be adapted to use it. 155 LEDBAT has been tested on the Internet across a range of link speeds 156 from kilobits to multiple gigabits per second and in, given its 157 deployment in uTorrent, across a very wide range of network 158 topologies. It is designed to scale to higher speeds than the speeds 159 tested, which were the highest the available hardware would support. 161 On the low end of the link speed spectrum, LEDBAT works best with 162 slower links when the packet size is reduced, thus reducing 163 serialization delay. On slower rural and wireless links, even a 164 single 1500-byte packet can have a substantial serialization delay 165 that, when combined with the protocol target delay, can become too 166 large for VoIP applications sharing the bottleneck. The mechanism to 167 reduce packet size for slower networks is not further discussed in 168 this document. 170 5. LEDBAT motivation 172 This section describes LEDBAT informally and provides some 173 motivation. It is expected to be helpful for general understanding 174 and useful in discussion of the properties of LEDBAT. 176 Without a loss of generality, we can consider only one direction of 177 the data transfer. The opposite direction can be treated 178 identically. 180 5.1. Simplest network topology 182 Consider first the desired behavior when there's only a single 183 bottleneck and no competing traffic whatsoever, not even other LEDBAT 184 connections. The design goals obviously need to be robustly met for 185 this trivial case. 187 5.2. Extra delay 189 Consider the queuing delay on the bottleneck. This delay is the 190 extra delay induced by congestion control. One of our design goals 191 is to keep this delay low. However, when this delay is zero, the 192 queue is empty and so no data is being transmitted and the link is 193 thus not saturated. Hence, our design goal is to keep the queuing 194 delay low, but non-zero. 196 5.3. Queuing delay target 198 How low do we want the queuing delay to be? Because another design 199 goal is to be deployable on networks with only simple FIFO queuing 200 and drop-tail discipline, we can't rely on explicit signaling for the 201 queuing delay. So we're going to estimate it using external 202 measurements. The external measurements will have an error at least 203 on the order of best-case scheduling delays in the OSes. There's 204 thus a good reason to try to make the queuing delay larger than this 205 error. There's no reason that would want us to push the delay much 206 further up. Thus, we will have a delay target that we would want to 207 maintain. 209 5.4. Need to measure delay 211 To maintain delay near the target, we have to use delay measurements. 212 Lacking delay measurements, we'd have to go only by loss (when ECN is 213 lacking). For loss to occur (on a FIFO link with drop-tail 214 discipline), the buffer must first be filled. This would drive the 215 delay to the largest possible value for this link, thus violating our 216 design goal of keeping delay low. 218 5.5. Queing delay estimate 220 Since our goal is to control the queuing delay, it is natural to 221 maintain an estimate of it. Let's call delay components propagation, 222 serialization, processing, and queuing. All components but queuing 223 are nearly constant and queuing is variable. Because queuing delay 224 is always positive, the constant propagation+serialization+processing 225 delay is no less than the minimum delay observed. Assuming that the 226 queuing delay distribution density has non-zero integral from zero to 227 any sufficiently small upper limit, minimum is also an asymptotically 228 consistent estimate of the constant fraction of the delay. We can 229 thus estimate the queuing delay as the difference between current and 230 base delay as usual. 232 5.6. Controller 234 When our estimate of the queuing delay is lower than the target, it's 235 natural to send faster. When our estimate is higher, it's natural to 236 send slower. To avoid trivial oscillations on round-trip-time (RTT) 237 scale, the response of the controller needs to be near zero when the 238 estimate is near the target. To converge faster, the response needs 239 to increase as the difference increases. The simplest controller 240 with this property is the linear controller, where the response is 241 proportional to the difference between the estimate and the target. 242 This controller happens to work well in practice obviating the need 243 for more complex ones. 245 5.7. Max rampup rate same as TCP 247 The maximum speed with which we can increase our congestion window is 248 then when queuing delay estimate is zero. To be on the safe side, 249 we'll make this speed equal to how fast TCP increases its sending 250 speed. Since queuing delay estimate is always non-negative, this 251 will ensure never ramping up faster than TCP would. 253 5.8. Halve on loss 255 Further, to deal with severe congestion when most packets are lost 256 and to provide a safety net against incorrect queuing delay 257 estimates, we'll halve the window when a loss event is detected. 258 We'll do so at most once per RTT. Note that, unlike with loss-based 259 congestion control, the normal mode of LEDBAT operation does not 260 induce losses and so normally does not rely on a loss rate to 261 determine the sending rate. This makes the details of reaction to 262 loss less important than in the case of loss-based congestion 263 control. For LEDBAT, reducing the congestion window on loss is a 264 safety net in case of severe congestion (or mistaken or broken delay 265 estimates). 267 5.9. Yield to TCP 269 Consider competition between a LEDBAT connection and a connection 270 governed by loss-based congestion control (on a FIFO bottleneck with 271 drop-tail discipline). Loss-based connection will need to experience 272 loss to back off. Loss will only occur after the connection 273 experiences maximum possible delays. LEDBAT will thus receive 274 congestion indication sooner than the loss-based connection. If 275 LEDBAT can ramp down faster than the loss-based connection ramps up, 276 LEDBAT will yield. LEDBAT ramps down when queuing delay estimate 277 exceeds the target: the more the excess, the faster the ramp-down. 278 When the loss-based connection is standard TCP, LEDBAT will yield at 279 precisely the same rate as TCP is ramping up when the queuing delay 280 is double the target. 282 5.10. Need for one-way delay 284 Now consider a case when one link direction is saturated with 285 unrelated TCP traffic while another direction is near-empty. 286 Consider LEDBAT sending in the near-empty direction. Our design goal 287 is to saturate it. However, if we pay attention to round-trip 288 delays, we'll sense the delays on the reverse path and respond to 289 them as described in the previous paragraph. We must, thus, measure 290 one-way delay and use that for our queuing delay estimate. 292 5.11. Measuring one-way delay 294 A special IETF protocol, One-Way Active Measurement Protocol (OWAMP), 295 exists for measuring one-way delay. However, since LEDBAT will 296 already be sending data, it is more efficient to add a timestamp to 297 the packets on the data direction and a measurement result field on 298 the acknowledgement direction. This also prevents the danger of 299 measurement packets being treated differently from the data packets. 300 The failure case would be better treatment of measurement packets, 301 where the data connection would be driven to losses. 303 5.12. Route changes 305 Routes can change. To deal, base delay needs to be computed over a 306 period of last few minutes instead of since the start of connection. 307 The tradeoff is: for longer intervals, base is more accurate; for 308 shorter intervals, reaction to route changes is faster. 310 A convenient way to implement an approximate minimum over last N 311 minutes is to keep separate minima for last N+1 minutes (last one for 312 the partial current minute). 314 When the connection is idle for a given minute, no data are available 315 for the one-way delay and, therefore, no minimum is kept. When the 316 connection has been idle for N minutes, the measurement thus has to 317 begin anew. 319 5.13. Timestamp errors 321 One-way delay measurement needs to deal with timestamp errors. We'll 322 use the same locally linear clock model and the same terminology as 323 Network Time Protocol (NTP). This model is valid for any 324 differentiable clocks. NTP uses the term "offset" to refer to 325 difference from true time and "skew" to refer to difference of clock 326 rate from the true rate. The clock will thus have a fixed offset 327 from the true time and a skew. We'll consider what we need to do 328 about the offset and the skew separately. 330 5.13.1. Clock offset 332 First, consider the case of zero skew. The offset of each of the two 333 clocks shows up as a fixed error in one-way delay measurement. The 334 difference of the offsets is the absolute error of the one-way delay 335 estimate. We won't use this estimate directly, however. We'll use 336 the difference between that and a base delay. Because the error 337 (difference of clock offsets) is the same for the current and base 338 delay, it cancels from the queuing delay estimate, which is what 339 we'll use. Clock offset is thus irrelevant to the design. 341 5.13.2. Clock skew 343 Now consider the skew. For a given clock, skew manifests in a 344 linearly changing error in the time estimate. For a given pair of 345 clocks, the difference in skews is the skew of the one-way delay 346 estimate. Unlike the offset, this no longer cancels in the 347 computation of the queuing delay estimate. On the other hand, while 348 the offset could be huge, with some clocks off by minutes or even 349 hours or more, the skew is typically not too bad. For example, NTP 350 is designed to work with most clocks, yet it gives up when the skew 351 is more than 500 parts per million (PPM). Typical skews of clocks 352 that have never been trained seem to often be around 100-200 PPM. 353 Previously trained clocks could have 10-20 PPM skew due to 354 temperature changes. A 100-PPM skew means accumulating 6 355 milliseconds of error per minute. The expiration of base delay 356 related to route changes mostly takes care of clock skew. A 357 technique to specifically compute and cancel it is trivially possible 358 and involves tracking base delay skew over a number of minutes and 359 then correcting for it, but usually isn't necessary, unless the 360 target is unusually low, the skew is unusually high, or the base 361 interval is unusually long. It is not further described in this 362 document. 364 For cases when the base interval is long or the skew is high or the 365 target is low, a technique to correct for skew can be beneficial. 366 The technique described here or a different mechanism MAY be used by 367 implementations. The technique described is still experimental, but 368 it is actually currently used. The pseudocode in the specification 369 below does not include any of the skew correction algorithms. 371 5.13.2.1. Deployed clock skew correction mechanism 373 Clock skew can be in two directions: either the sender's clock is 374 faster than the receiver's, or vice versa. We refer to the former 375 situation as clock drift "in sender's favor" and to the latter as 376 clock drift "in receiver's favor". 378 When the clock drift is "in sender's favor", nothing special needs to 379 be done, because the timestamp differences (i.e., the raw delay 380 estimates) will grow smaller with time, and thus the base delay will 381 be continuously updated with the drift. 383 When the clock drift is "in receiver's favor", the raw delay 384 estimates will drift up with time, suppressing the throughput 385 needlessly. This is the case that can benefit from a special 386 mechanism. Assume symmetrical framing (i.e., same information about 387 delays transmitted in both way). If the sender can detect the 388 receiver reducing its base delay, it can infer that this is due to 389 clock drift "in receiver's favor". This can be compensated for by 390 increasing the sender's base delay by the same amount. Since, in our 391 implementation, the receiver sends back the raw timestamp estimate, 392 the sender can run the same base delay calculation algorithm it runs 393 for itself for the receiver as well; when it reduces the inferred 394 receiver's delay, it increases its own by the same amount. 396 5.13.2.2. Skew correction with faster virtual clock 398 This is an alternative skew correction algorithm, currently under 399 consideration and not deployed in the wild. 401 Since having a faster clock on the sender is, relatively speaking, a 402 non-problem, one can use two different virtual clocks on each LEDBAT 403 implementation: use, for example, the default machine clock for 404 situations where the instance is acting as a receiver, and use a 405 virtual clock, easily computed from the default machine clock through 406 a linear transformation, for situations where the instance is acting 407 as a sender. Make the virtual clock, e.g., 500 PPM faster than the 408 machine clock. Since 500 PPM is more than the variability of most 409 clocks (plus or minus 100 PPM), any sender's clock is very likely to 410 be faster than any receiver's clock, thus benefitting from the 411 implicit correction of taking the minimum as the base delay. 413 Note that this approach is not compatible with the one described in 414 the preceding section. 416 5.13.2.3. Skew correction with estimating drift 418 This is an alternative skew correction algorithm, currently under 419 consideration and not deployed in the wild. 421 The history of base delay minima we already keep for each minute 422 provides us with direct means of computing the clock skew difference 423 between the two hosts. Namely, we can fit a linear function to the 424 set of base delay estimates for each minute. The slope of the 425 function is an estimate of the clock skew difference for the given 426 pair of sender and receiver. Once the clock skew difference is 427 estimated, it can be used to correct the clocks so that they advance 428 at nearly the same rate. Namely, the clock needs to be corrected by 429 half of the estimated skew amount, since the other half will be 430 corrected by the other endpoint. Note that the skew differences are 431 then maintained for each connection and the virtual clocks used with 432 each connection can differ, since they do not attempt to estimate the 433 skew with respect to the true time, but instead with respect to the 434 other endpoint. 436 5.13.2.3.1. Byzantine skew correction 438 This is an alternative skew correction algorithm, currently under 439 consideration and not deployed in the wild. 441 When it is known that each host maintains long-lived connections to a 442 number of different other hosts, a byzantine scheme can be used to 443 estimate the skew with respect to the true time. Namely, calculate 444 the skew difference for each of the peer hosts as described in the 445 preceding section, then take the median of the skew differences. 447 This inherent clock drift can then be corrected with a linear 448 transformation before the clock data is used in the algorithm from 449 the preceding section, the currently deployed algorithm, or nearly 450 any other skew correction algorithm. 452 While this scheme is not universally applicable, it combines well 453 with other schemes, since it is essentially a clock training 454 mechanism. The scheme also acts the fastest, since the state is 455 preserved between connections. 457 5.14. Noise filtering 459 In addition to timestamp errors, one-way delay estimate includes an 460 error of measurement when part of the time measured was spent inside 461 the sending or the receiving machines. Different views are possible 462 on the nature of this delay: one view holds that, to the extent this 463 delay internal to a machine is not constant, it is a variety of 464 queuing delay and nothing needs to be done to detect or eliminate it; 465 another view holds that, since this delay does not have the same 466 characteristics as queuing delay induced by a fixed-capacity 467 bottleneck, it is more correctly classified as non-constant 468 processing delay and should be filtered out. In practice, this 469 doesn't seem to matter very much one way or the other. The way to 470 filter the noise out is to observe, again, that the noise is always 471 nonnegative and so a good filter is the minimum of several recent 472 delay measurements. 474 5.15. Non-bulk flows 476 Normally in transport, we're mainly concerned with bulk flows with 477 infinite sources. Sometimes this model needs modification. For 478 example, an application might be limiting the rate at which the data 479 is fed to the transport layer. If the offered rate can be carried by 480 the network without any sign of congestion, an adaptive congestion 481 control loop will increase the rate allowed by congestion control. 482 The LEDBAT mechanism in the form described above will keep increasing 483 the congestion window linearly with time. This is clearly wrong for 484 non-bulk flows because it allows the congestion control to decide 485 that it is allowed to send much faster than it has ever sent. 486 Therefore, some special treatment is required to deal with the case 487 of non-bulk flows. 489 The same problem may also occur in some TCP implementations. It has 490 first been noticed for TCP in situations where a connection has been 491 idle for a substantial time and the original workaround was to re- 492 enter slow start after an idle period. While re-entering slow start 493 (or, more generally, returning to the original state of the 494 connection) is a good idea after a long idle period, it does not 495 solve the problem for connections that are trickling data out at an 496 application-determined rate. In fact, even application-level 497 keepalives or small metadata exchanges can be enough to keep the 498 connection from becoming idle for long. The fix for this problem is 499 to limit the growth of the congestion window by a quantity related to 500 the current flight size, i.e., the amount of outstanding 501 unacknowledged data. 503 The congestion window should not grow beyond a factor of the current 504 flight size for sufficiently large values of the flight size. This 505 provides leeway for the congestion window to grow and probe the 506 network beyond the current rate, at the same time keeping it on a 507 tether that does not let it increase indefinitely. A refinement of 508 this rule is that the congestion window should not grow beyond a 509 constant plus a factor of the flight size. The presence of the 510 additive constant is necessary because the congestion window needs to 511 be able to increase at least by one packet even when it is small -- 512 otherwise no probing is possible. The smallest value, one packet, 513 then would define the minimum value of the factor: 2. However, for 514 larger values of congestion window, a tighter bound than "double the 515 flight size" may be desirable. The additive constant makes it 516 possible to use values between 1 and 2 for the factor. 518 5.16. LEDBAT framing and wire format 520 The actual framing and wire format of the protocol(s) using the 521 LEDBAT congestion control mechanism is outside of scope of this 522 document, which only describes the congestion control part. 524 There is an implication of the need to use one-way delay from the 525 sender to the receiver in the sender. An obvious way to support this 526 is to use a framing that timestamps packets at the sender and conveys 527 the measured one-way delay back to the sender in ack packets. This 528 is the method we'll keep in mind for the purposes of exposition. 529 Other methods are possible and valid. 531 Another implication is the receipt of frequent ACK packets. The 532 exposition below assumes one ACK per data packet, but any reasonably 533 small number of data packets per ACK will work as long as there is at 534 least one ACK every round-trip time. 536 The protocols to which this congestion control mechanism is 537 applicable, with possible appropriate extensions, are TCP, SCTP, 538 DCCP, etc. It is not a goal of this document to cover such 539 applications. The mechanism can also be used with proprietary 540 transport protocols, e.g., those built over UDP for P2P applications. 542 5.17. Fairness between LEDBAT flows 544 The design goals of LEDBAT center around the aggregate behavior of 545 LEDBAT flows when they compete with standard TCP. It is also 546 interesting how LEDBAT flows share bottleneck bandwidth when they 547 only compete between themselves. 549 LEDBAT as described so far lacks a mechanism specifically designed to 550 equalize utilization between these flows. The observed behavior of 551 existing implementations indicates that a rough equalization, in 552 fact, does occur. 554 The delay measurements used as control inputs by LEDBAT contain some 555 amount of noise and errors. The linear controller converts this 556 input noise into the same amount of output noise. The effect that 557 this has is that the uncorrelated component of the noise between 558 flows serves to randomly shuffle some amount of bandwidth between 559 flows. The amount shuffled during each RTT is proportional to the 560 noise divided by the target delay. The random-walk trajectory of 561 bandwidth utilized by each of the flows over time tends to the fair 562 share. The timescales on which the rates become comparable are 563 proportional to the target delay multiplied by the RTT and divided by 564 the noise. 566 In complex real-life systems, the main concern is usually the 567 reduction of the amount of noise, which is copious if not eliminated. 568 In some circumstances, however, the measurements might be "too good" 569 -- since the equalization timescale is inversely proportional to 570 noise, perfect measurements would result in lack of convergence. 572 Under these circumstances, it may be beneficial to introduce some 573 artificial randomness into the inputs (or, equivalently, outputs) of 574 the controller. Note that most systems should not require this and 575 should be primarily concerned with reducing, not adding, noise. 577 5.17.1. Late comers 579 With delay-based congestion control systems, there's a concern about 580 the ability of late comers to measure the base delay correctly. 581 Suppose a LEDBAT flow saturates a bottleneck; another LEDBAT flow 582 starts and proceeds to measure the base delay and the current delay 583 and to estimate the queuing delay. If the bottleneck always contains 584 target delay worth of packets, the second flow would see the 585 bottleneck as empty start building a second target delay worth of 586 queue on top of the existing queue. The concern ("late comers' 587 advantage") is that the initial flow would now back off because it 588 sees the real delay and the late comer would use the whole capacity. 590 However, once the initial flow yields, the late comer immediately 591 measures the true base delay and the two flows operate from the same 592 (correct) inputs. 594 Additionally, in practice this concern is further alleviated by the 595 burstiness of network traffic: all that's needed to measure the base 596 delay is one small gap. These gaps can occur for a variety of 597 reasons: the OS may delay the scheduling of the sending process until 598 a time slice ends, the sending computer might be unusually busy for 599 some number of milliseconds or tens of milliseconds, etc. If such a 600 gap occurs while the late comer is starting, base delay is 601 immediately correctly measured. With small number of flows, this 602 appears to be the main mechanism of regulating the late comers' 603 advantage. 605 5.18. Safety of LEDBAT 607 LEDBAT is most aggressive when its queuing delay estimate is most 608 wrong and is as low as it can be. Queuing delay estimate is 609 nonnegative, therefore the worst possible case is when somehow the 610 estimate is always returned as zero. In this case, LEDBAT will ramp 611 up as fast as TCP and halve the rate on loss. Thus, in case of worst 612 possible failure of estimates, LEDBAT will behave identically to TCP. 613 This provides an extra safety net. 615 6. LEDBAT congestion control 617 Consider two parties, a sender and a receiver, with the sender having 618 an unlimited source of data to send to the receiver and the receiver 619 merely acknowledging the data. (In an actual protocol, it's more 620 convenient to have bidirectional connections, but unidirectional 621 abstraction suffices to describe the congestion control mechanism.) 623 Consider a protocol that uses packets of equal size and acknowledges 624 each of them separately. (Variable-sized packets and delayed 625 acknowledgements are possible and are being implemented, but 626 complicate the exposition.) 628 Assume that each data packet contains a header field timestamp. The 629 sender puts a timestamp from its clock into this field. Further 630 assume that each acknowledgement packet contains a field delay. It 631 is shown below how it is populated. 633 Slow start behavior is unchanged in LEDBAT. Note that rampup is 634 faster in slow start than during congestion avoidance and so very 635 conservative implementations MAY skip slow start altogether. 637 As far as congestion control is concerned, the receiver is then very 638 simple and operates as follows, using a pseudocode: 640 on data_packet: 641 remote_timestamp = data_packet.timestamp 642 acknowledgement.delay = local_timestamp() - remote_timestamp 643 # fill in other fields of acknowledgement 644 acknowlegement.send() 646 The sender actually operates the congestion control algorithm and 647 acts, in first approximation, as follows: 649 on initialization: 650 base_delay = +infinity 652 on acknowledgement: 653 current_delay = acknowledgement.delay 654 base_delay = min(base_delay, current_delay) 655 queuing_delay = current_delay - base_delay 656 off_target = TARGET - queuing_delay 657 cwnd += GAIN * off_target / cwnd 659 The pseudocode above is a simplification and ignores noise filtering 660 and base expiration. The more precise pseudocode that takes these 661 factors into account is as follows and MUST be followed: 663 on initialization: 664 set all NOISE_FILTER delays used by current_delay() to +infinity 665 set all BASE_HISTORY delays used by base_delay() to +infinity 666 last_rollover = -infinity # More than a minute in the past. 668 on acknowledgement: 669 delay = acknowledgement.delay 670 update_base_delay(delay) 671 update_current_delay(delay) 672 queuing_delay = current_delay() - base_delay() 673 off_target = TARGET - queuing_delay + random_input() 674 cwnd += GAIN * off_target / cwnd 675 # flight_size() is the amount of currently not acked data. 676 max_allowed_cwnd = ALLOWED_INCREASE + TETHER*flight_size() 677 cwnd = min(cwnd, max_allowed_cwnd) 679 random_input() 680 # random() is a PRNG between 0.0 and 1.0 681 # NB: RANDOMNESS_AMOUNT is normally 0 682 RANDOMNESS_AMOUNT * TARGET * ((random() - 0.5)*2) 684 update_current_delay(delay) 685 # Maintain a list of NOISE_FILTER last delays observed. 686 forget the earliest of NOISE_FILTER current_delays 687 add delay to the end of current_delays 689 current_delay() 690 min(the NOISE_FILTER delays stored by update_current_delay) 692 update_base_delay(delay) 693 # Maintain BASE_HISTORY min delays. Each represents a minute. 694 if round_to_minute(now) != round_to_minute(last_rollover) 695 last_rollover = now 696 forget the earliest of base delays 697 add delay to the end of base_delays 698 else 699 last of base_delays = min(last of base_delays, delay) 701 base_delay() 702 min(the BASE_HISTORY min delays stored by update_base_delay) 704 TARGET parameter MUST be set to 100 milliseconds. GAIN SHOULD be set 705 so that max rampup rate is the same as for TCP and MAY be up to twice 706 more. BASE_HISTORY SHOULD be 10; it MUST be no less than 2 and 707 SHOULD NOT be more than 20. NOISE_FILTER SHOULD be 1; it MAY be 708 tuned so that it is at least 1 and no more than cwnd/2. 709 ALLOWED_INCREASE SHOULD be 1 packet; it MUST be at least 1 packet and 710 SHOULD NOT be more than 3 packets. TETHER SHOULD be 1.5; it MUST be 711 greater than 1. RANDONMESS_AMOUNT SHOULD be 0; it MUST be between 0 712 and 0.1 inclusively. 714 Note, in particular, that using a consistent TARGET value is 715 important, as flows using different TARGET values will not share a 716 bottleneck well: flows with higher values will tend to get a 717 disproportionately large share of the bottleneck. This is the reason 718 why the TARGET is the only parameter that is specified with no 719 flexibility in the implementation. (Note that historically uTorrent 720 used different values of TARGET, both as a temporary safety 721 precaution and later experimentally. The current choice is based on 722 the experience with the use of different values.) 724 7. Security Considerations 726 A network on the path might choose to cause higher delay measurements 727 than the real queuing delay so that LEDBAT backs off even when 728 there's no congestion present. Shaping of traffic into an 729 artificially narrow bottleneck can't be counteracted, but faking 730 timestamp field can and SHOULD. A protocol using the LEDBAT 731 congestion control SHOULD authenticate the timestamp and delay 732 fields, preferably as part of authenticating most of the rest of the 733 packet, with the exception of volatile header fields. The choice of 734 the authentication mechanism that resists man-in-the-middle attacks 735 is outside of scope of this document. 737 8. Normative References 739 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 740 Requirement Levels", BCP 14, RFC 2119, March 1997. 742 Authors' Addresses 744 Stanislav Shalunov 745 BitTorrent Inc 746 612 Howard St, Suite 400 747 San Francisco, CA 94105 748 USA 750 Email: shalunov@bittorrent.com 751 URI: http://shlang.com 752 Greg Hazel 753 BitTorrent Inc 754 612 Howard St, Suite 400 755 San Francisco, CA 94105 756 USA 758 Email: greg@bittorrent.com