idnits 2.17.1 

draft-ietf-ledbat-congestion-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 12, 2010) is 5037 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	LEDBAT WG                                                    S. Shalunov
3	Internet-Draft                                                  G. Hazel
4	Intended status: Experimental                             BitTorrent Inc
5	Expires: January 13, 2011                                  July 12, 2010

7	             Low Extra Delay Background Transport (LEDBAT)
8	                  draft-ietf-ledbat-congestion-02.txt

10	Abstract

12	   LEDBAT is an alternative experimental congestion control algorithm.

14	   LEDBAT enables a networking application to minimize the extra delay
15	   it induces in the bottleneck while saturating the bottleneck.  It
16	   thus implements an end-to-end version of scavenger service.  LEDBAT
17	   has been been implemented in uTorrent, which is currently the most
18	   popular BitTorrent client, as the preferred congestion control
19	   mechanism, and deployed in the wild with favorable results.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on January 13, 2011.

38	Copyright Notice

40	   Copyright (c) 2010 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  3
56	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   3.  LEDBAT design goals  . . . . . . . . . . . . . . . . . . . . .  3
58	   4.  LEDBAT applicability . . . . . . . . . . . . . . . . . . . . .  4
59	   5.  LEDBAT motivation  . . . . . . . . . . . . . . . . . . . . . .  4
60	     5.1.  Simplest network topology  . . . . . . . . . . . . . . . .  5
61	     5.2.  Extra delay  . . . . . . . . . . . . . . . . . . . . . . .  5
62	     5.3.  Queuing delay target . . . . . . . . . . . . . . . . . . .  5
63	     5.4.  Need to measure delay  . . . . . . . . . . . . . . . . . .  5
64	     5.5.  Queing delay estimate  . . . . . . . . . . . . . . . . . .  5
65	     5.6.  Controller . . . . . . . . . . . . . . . . . . . . . . . .  6
66	     5.7.  Max rampup rate same as TCP  . . . . . . . . . . . . . . .  6
67	     5.8.  Halve on loss  . . . . . . . . . . . . . . . . . . . . . .  6
68	     5.9.  Yield to TCP . . . . . . . . . . . . . . . . . . . . . . .  6
69	     5.10. Need for one-way delay . . . . . . . . . . . . . . . . . .  7
70	     5.11. Measuring one-way delay  . . . . . . . . . . . . . . . . .  7
71	     5.12. Route changes  . . . . . . . . . . . . . . . . . . . . . .  7
72	     5.13. Timestamp errors . . . . . . . . . . . . . . . . . . . . .  7
73	       5.13.1.  Clock offset  . . . . . . . . . . . . . . . . . . . .  8
74	       5.13.2.  Clock skew  . . . . . . . . . . . . . . . . . . . . .  8
75	     5.14. Noise filtering  . . . . . . . . . . . . . . . . . . . . . 10
76	     5.15. Non-bulk flows . . . . . . . . . . . . . . . . . . . . . . 11
77	     5.16. LEDBAT framing and wire format . . . . . . . . . . . . . . 12
78	     5.17. Fairness between LEDBAT flows  . . . . . . . . . . . . . . 12
79	       5.17.1.  Late comers . . . . . . . . . . . . . . . . . . . . . 13
80	     5.18. Safety of LEDBAT . . . . . . . . . . . . . . . . . . . . . 14
81	   6.  LEDBAT congestion control  . . . . . . . . . . . . . . . . . . 14
82	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
83	   8.  Normative References . . . . . . . . . . . . . . . . . . . . . 17
84	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17

86	1.  Requirements notation

88	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
89	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
90	   document are to be interpreted as described in [RFC2119].

92	2.  Introduction

94	   The standard congestion control in TCP is based on loss and has not
95	   been designed to drive delay to any given value.  Because TCP needs
96	   losses to back off, when a FIFO bottleneck lacks AQM, TCP fills the
97	   buffer, effectively maximizing possible delay.  Large number of the
98	   thinnest links in the Internet, particularly most uplinks of home
99	   connections, lack AQM.  They also frequently contain enough buffer
100	   space to get delays into hundreds of milliseconds and even seconds.
101	   There is no benefit to having delays this large, but there are very
102	   substantial drawbacks for interactive applications: games and VoIP
103	   become impossible and even web browsing becomes very slow.

105	   While a number of delay-based congestion control mechanisms have been
106	   proposed, they were generally not designed to minimize the delay
107	   induced in the network.

109	   LEDBAT is designed to allow to keep the latency across the congested
110	   bottleneck low even as it is saturated.  This allows applications
111	   that send large amounts of data, particularly upstream on home
112	   connections, such as peer-to-peer application, to operate without
113	   destroying the user experience in interactive applications.  LEDBAT
114	   takes advantage of delay measurements and backs off before loss
115	   occurs.  It has been deployed by BitTorrent in the wild first with
116	   the BitTorrent DNA client (a P2P-based CDN) and now with the uTorrent
117	   client.  This mechanism not only allows to keep delay across a
118	   bottleneck low, but also yields quickly in the presence of competing
119	   traffic with loss-based congestion control.

121	   Beyond its utility for P2P, LEDBAT enables other advanced networking
122	   applications to better get out of the way of interactive apps.

124	   In addition to direct and immediate benefits for P2P and other
125	   application that can benefit from scavenger service, LEDBAT could
126	   point the way for a possible future evolution of the Internet where
127	   loss is not part of the designed behavior and delay is minimized.

129	3.  LEDBAT design goals

131	   LEDBAT design goals are:

133	   1.  saturate the bottleneck
134	   2.  keep delay low when no other traffic is present
135	   3.  quickly yield to traffic sharing the same bottleneck queue that
136	       uses standard TCP congestion control
137	   4.  add little to the queuing delays induced by TCP traffic
138	   5.  operate well in networks with FIFO queuing with drop-tail
139	       discipline
140	   6.  be deployable for popular applications that currently comprise
141	       noticeable fractions of Internet traffic
142	   7.  where available, use explicit congestion notification (ECN),
143	       active queue management (AQM), and/or end-to-end differentiated
144	       services (DiffServ).

146	4.  LEDBAT applicability

148	   While originally deployed with a P2P application and a proprietary
149	   UDP framing, LEDBAT is applicable to any transport protocol that has
150	   provisions for its deployment and satisfies the wire format
151	   constraints (can carry timestamps, delay estimates, and has frequent
152	   ACKing).  Currently, no IETF standard transport specifies the use of
153	   LEDBAT, but all of TCP, DCCP, and SCTP could be adapted to use it.

155	   LEDBAT has been tested on the Internet across a range of link speeds
156	   from kilobits to multiple gigabits per second and in, given its
157	   deployment in uTorrent, across a very wide range of network
158	   topologies.  It is designed to scale to higher speeds than the speeds
159	   tested, which were the highest the available hardware would support.

161	   On the low end of the link speed spectrum, LEDBAT works best with
162	   slower links when the packet size is reduced, thus reducing
163	   serialization delay.  On slower rural and wireless links, even a
164	   single 1500-byte packet can have a substantial serialization delay
165	   that, when combined with the protocol target delay, can become too
166	   large for VoIP applications sharing the bottleneck.  The mechanism to
167	   reduce packet size for slower networks is not further discussed in
168	   this document.

170	5.  LEDBAT motivation

172	   This section describes LEDBAT informally and provides some
173	   motivation.  It is expected to be helpful for general understanding
174	   and useful in discussion of the properties of LEDBAT.

176	   Without a loss of generality, we can consider only one direction of
177	   the data transfer.  The opposite direction can be treated
178	   identically.

180	5.1.  Simplest network topology

182	   Consider first the desired behavior when there's only a single
183	   bottleneck and no competing traffic whatsoever, not even other LEDBAT
184	   connections.  The design goals obviously need to be robustly met for
185	   this trivial case.

187	5.2.  Extra delay

189	   Consider the queuing delay on the bottleneck.  This delay is the
190	   extra delay induced by congestion control.  One of our design goals
191	   is to keep this delay low.  However, when this delay is zero, the
192	   queue is empty and so no data is being transmitted and the link is
193	   thus not saturated.  Hence, our design goal is to keep the queuing
194	   delay low, but non-zero.

196	5.3.  Queuing delay target

198	   How low do we want the queuing delay to be?  Because another design
199	   goal is to be deployable on networks with only simple FIFO queuing
200	   and drop-tail discipline, we can't rely on explicit signaling for the
201	   queuing delay.  So we're going to estimate it using external
202	   measurements.  The external measurements will have an error at least
203	   on the order of best-case scheduling delays in the OSes.  There's
204	   thus a good reason to try to make the queuing delay larger than this
205	   error.  There's no reason that would want us to push the delay much
206	   further up.  Thus, we will have a delay target that we would want to
207	   maintain.

209	5.4.  Need to measure delay

211	   To maintain delay near the target, we have to use delay measurements.
212	   Lacking delay measurements, we'd have to go only by loss (when ECN is
213	   lacking).  For loss to occur (on a FIFO link with drop-tail
214	   discipline), the buffer must first be filled.  This would drive the
215	   delay to the largest possible value for this link, thus violating our
216	   design goal of keeping delay low.

218	5.5.  Queing delay estimate

220	   Since our goal is to control the queuing delay, it is natural to
221	   maintain an estimate of it.  Let's call delay components propagation,
222	   serialization, processing, and queuing.  All components but queuing
223	   are nearly constant and queuing is variable.  Because queuing delay
224	   is always positive, the constant propagation+serialization+processing
225	   delay is no less than the minimum delay observed.  Assuming that the
226	   queuing delay distribution density has non-zero integral from zero to
227	   any sufficiently small upper limit, minimum is also an asymptotically
228	   consistent estimate of the constant fraction of the delay.  We can
229	   thus estimate the queuing delay as the difference between current and
230	   base delay as usual.

232	5.6.  Controller

234	   When our estimate of the queuing delay is lower than the target, it's
235	   natural to send faster.  When our estimate is higher, it's natural to
236	   send slower.  To avoid trivial oscillations on round-trip-time (RTT)
237	   scale, the response of the controller needs to be near zero when the
238	   estimate is near the target.  To converge faster, the response needs
239	   to increase as the difference increases.  The simplest controller
240	   with this property is the linear controller, where the response is
241	   proportional to the difference between the estimate and the target.
242	   This controller happens to work well in practice obviating the need
243	   for more complex ones.

245	5.7.  Max rampup rate same as TCP

247	   The maximum speed with which we can increase our congestion window is
248	   then when queuing delay estimate is zero.  To be on the safe side,
249	   we'll make this speed equal to how fast TCP increases its sending
250	   speed.  Since queuing delay estimate is always non-negative, this
251	   will ensure never ramping up faster than TCP would.

253	5.8.  Halve on loss

255	   Further, to deal with severe congestion when most packets are lost
256	   and to provide a safety net against incorrect queuing delay
257	   estimates, we'll halve the window when a loss event is detected.
258	   We'll do so at most once per RTT.  Note that, unlike with loss-based
259	   congestion control, the normal mode of LEDBAT operation does not
260	   induce losses and so normally does not rely on a loss rate to
261	   determine the sending rate.  This makes the details of reaction to
262	   loss less important than in the case of loss-based congestion
263	   control.  For LEDBAT, reducing the congestion window on loss is a
264	   safety net in case of severe congestion (or mistaken or broken delay
265	   estimates).

267	5.9.  Yield to TCP

269	   Consider competition between a LEDBAT connection and a connection
270	   governed by loss-based congestion control (on a FIFO bottleneck with
271	   drop-tail discipline).  Loss-based connection will need to experience
272	   loss to back off.  Loss will only occur after the connection
273	   experiences maximum possible delays.  LEDBAT will thus receive
274	   congestion indication sooner than the loss-based connection.  If
275	   LEDBAT can ramp down faster than the loss-based connection ramps up,
276	   LEDBAT will yield.  LEDBAT ramps down when queuing delay estimate
277	   exceeds the target: the more the excess, the faster the ramp-down.
278	   When the loss-based connection is standard TCP, LEDBAT will yield at
279	   precisely the same rate as TCP is ramping up when the queuing delay
280	   is double the target.

282	5.10.  Need for one-way delay

284	   Now consider a case when one link direction is saturated with
285	   unrelated TCP traffic while another direction is near-empty.
286	   Consider LEDBAT sending in the near-empty direction.  Our design goal
287	   is to saturate it.  However, if we pay attention to round-trip
288	   delays, we'll sense the delays on the reverse path and respond to
289	   them as described in the previous paragraph.  We must, thus, measure
290	   one-way delay and use that for our queuing delay estimate.

292	5.11.  Measuring one-way delay

294	   A special IETF protocol, One-Way Active Measurement Protocol (OWAMP),
295	   exists for measuring one-way delay.  However, since LEDBAT will
296	   already be sending data, it is more efficient to add a timestamp to
297	   the packets on the data direction and a measurement result field on
298	   the acknowledgement direction.  This also prevents the danger of
299	   measurement packets being treated differently from the data packets.
300	   The failure case would be better treatment of measurement packets,
301	   where the data connection would be driven to losses.

303	5.12.  Route changes

305	   Routes can change.  To deal, base delay needs to be computed over a
306	   period of last few minutes instead of since the start of connection.
307	   The tradeoff is: for longer intervals, base is more accurate; for
308	   shorter intervals, reaction to route changes is faster.

310	   A convenient way to implement an approximate minimum over last N
311	   minutes is to keep separate minima for last N+1 minutes (last one for
312	   the partial current minute).

314	   When the connection is idle for a given minute, no data are available
315	   for the one-way delay and, therefore, no minimum is kept.  When the
316	   connection has been idle for N minutes, the measurement thus has to
317	   begin anew.

319	5.13.  Timestamp errors

321	   One-way delay measurement needs to deal with timestamp errors.  We'll
322	   use the same locally linear clock model and the same terminology as
323	   Network Time Protocol (NTP).  This model is valid for any
324	   differentiable clocks.  NTP uses the term "offset" to refer to
325	   difference from true time and "skew" to refer to difference of clock
326	   rate from the true rate.  The clock will thus have a fixed offset
327	   from the true time and a skew.  We'll consider what we need to do
328	   about the offset and the skew separately.

330	5.13.1.  Clock offset

332	   First, consider the case of zero skew.  The offset of each of the two
333	   clocks shows up as a fixed error in one-way delay measurement.  The
334	   difference of the offsets is the absolute error of the one-way delay
335	   estimate.  We won't use this estimate directly, however.  We'll use
336	   the difference between that and a base delay.  Because the error
337	   (difference of clock offsets) is the same for the current and base
338	   delay, it cancels from the queuing delay estimate, which is what
339	   we'll use.  Clock offset is thus irrelevant to the design.

341	5.13.2.  Clock skew

343	   Now consider the skew.  For a given clock, skew manifests in a
344	   linearly changing error in the time estimate.  For a given pair of
345	   clocks, the difference in skews is the skew of the one-way delay
346	   estimate.  Unlike the offset, this no longer cancels in the
347	   computation of the queuing delay estimate.  On the other hand, while
348	   the offset could be huge, with some clocks off by minutes or even
349	   hours or more, the skew is typically not too bad.  For example, NTP
350	   is designed to work with most clocks, yet it gives up when the skew
351	   is more than 500 parts per million (PPM).  Typical skews of clocks
352	   that have never been trained seem to often be around 100-200 PPM.
353	   Previously trained clocks could have 10-20 PPM skew due to
354	   temperature changes.  A 100-PPM skew means accumulating 6
355	   milliseconds of error per minute.  The expiration of base delay
356	   related to route changes mostly takes care of clock skew.  A
357	   technique to specifically compute and cancel it is trivially possible
358	   and involves tracking base delay skew over a number of minutes and
359	   then correcting for it, but usually isn't necessary, unless the
360	   target is unusually low, the skew is unusually high, or the base
361	   interval is unusually long.  It is not further described in this
362	   document.

364	   For cases when the base interval is long or the skew is high or the
365	   target is low, a technique to correct for skew can be beneficial.
366	   The technique described here or a different mechanism MAY be used by
367	   implementations.  The technique described is still experimental, but
368	   it is actually currently used.  The pseudocode in the specification
369	   below does not include any of the skew correction algorithms.

371	5.13.2.1.  Deployed clock skew correction mechanism

373	   Clock skew can be in two directions: either the sender's clock is
374	   faster than the receiver's, or vice versa.  We refer to the former
375	   situation as clock drift "in sender's favor" and to the latter as
376	   clock drift "in receiver's favor".

378	   When the clock drift is "in sender's favor", nothing special needs to
379	   be done, because the timestamp differences (i.e., the raw delay
380	   estimates) will grow smaller with time, and thus the base delay will
381	   be continuously updated with the drift.

383	   When the clock drift is "in receiver's favor", the raw delay
384	   estimates will drift up with time, suppressing the throughput
385	   needlessly.  This is the case that can benefit from a special
386	   mechanism.  Assume symmetrical framing (i.e., same information about
387	   delays transmitted in both way).  If the sender can detect the
388	   receiver reducing its base delay, it can infer that this is due to
389	   clock drift "in receiver's favor".  This can be compensated for by
390	   increasing the sender's base delay by the same amount.  Since, in our
391	   implementation, the receiver sends back the raw timestamp estimate,
392	   the sender can run the same base delay calculation algorithm it runs
393	   for itself for the receiver as well; when it reduces the inferred
394	   receiver's delay, it increases its own by the same amount.

396	5.13.2.2.  Skew correction with faster virtual clock

398	   This is an alternative skew correction algorithm, currently under
399	   consideration and not deployed in the wild.

401	   Since having a faster clock on the sender is, relatively speaking, a
402	   non-problem, one can use two different virtual clocks on each LEDBAT
403	   implementation: use, for example, the default machine clock for
404	   situations where the instance is acting as a receiver, and use a
405	   virtual clock, easily computed from the default machine clock through
406	   a linear transformation, for situations where the instance is acting
407	   as a sender.  Make the virtual clock, e.g., 500 PPM faster than the
408	   machine clock.  Since 500 PPM is more than the variability of most
409	   clocks (plus or minus 100 PPM), any sender's clock is very likely to
410	   be faster than any receiver's clock, thus benefitting from the
411	   implicit correction of taking the minimum as the base delay.

413	   Note that this approach is not compatible with the one described in
414	   the preceding section.

416	5.13.2.3.  Skew correction with estimating drift

418	   This is an alternative skew correction algorithm, currently under
419	   consideration and not deployed in the wild.

421	   The history of base delay minima we already keep for each minute
422	   provides us with direct means of computing the clock skew difference
423	   between the two hosts.  Namely, we can fit a linear function to the
424	   set of base delay estimates for each minute.  The slope of the
425	   function is an estimate of the clock skew difference for the given
426	   pair of sender and receiver.  Once the clock skew difference is
427	   estimated, it can be used to correct the clocks so that they advance
428	   at nearly the same rate.  Namely, the clock needs to be corrected by
429	   half of the estimated skew amount, since the other half will be
430	   corrected by the other endpoint.  Note that the skew differences are
431	   then maintained for each connection and the virtual clocks used with
432	   each connection can differ, since they do not attempt to estimate the
433	   skew with respect to the true time, but instead with respect to the
434	   other endpoint.

436	5.13.2.3.1.  Byzantine skew correction

438	   This is an alternative skew correction algorithm, currently under
439	   consideration and not deployed in the wild.

441	   When it is known that each host maintains long-lived connections to a
442	   number of different other hosts, a byzantine scheme can be used to
443	   estimate the skew with respect to the true time.  Namely, calculate
444	   the skew difference for each of the peer hosts as described in the
445	   preceding section, then take the median of the skew differences.

447	   This inherent clock drift can then be corrected with a linear
448	   transformation before the clock data is used in the algorithm from
449	   the preceding section, the currently deployed algorithm, or nearly
450	   any other skew correction algorithm.

452	   While this scheme is not universally applicable, it combines well
453	   with other schemes, since it is essentially a clock training
454	   mechanism.  The scheme also acts the fastest, since the state is
455	   preserved between connections.

457	5.14.  Noise filtering

459	   In addition to timestamp errors, one-way delay estimate includes an
460	   error of measurement when part of the time measured was spent inside
461	   the sending or the receiving machines.  Different views are possible
462	   on the nature of this delay: one view holds that, to the extent this
463	   delay internal to a machine is not constant, it is a variety of
464	   queuing delay and nothing needs to be done to detect or eliminate it;
465	   another view holds that, since this delay does not have the same
466	   characteristics as queuing delay induced by a fixed-capacity
467	   bottleneck, it is more correctly classified as non-constant
468	   processing delay and should be filtered out.  In practice, this
469	   doesn't seem to matter very much one way or the other.  The way to
470	   filter the noise out is to observe, again, that the noise is always
471	   nonnegative and so a good filter is the minimum of several recent
472	   delay measurements.

474	5.15.  Non-bulk flows

476	   Normally in transport, we're mainly concerned with bulk flows with
477	   infinite sources.  Sometimes this model needs modification.  For
478	   example, an application might be limiting the rate at which the data
479	   is fed to the transport layer.  If the offered rate can be carried by
480	   the network without any sign of congestion, an adaptive congestion
481	   control loop will increase the rate allowed by congestion control.
482	   The LEDBAT mechanism in the form described above will keep increasing
483	   the congestion window linearly with time.  This is clearly wrong for
484	   non-bulk flows because it allows the congestion control to decide
485	   that it is allowed to send much faster than it has ever sent.
486	   Therefore, some special treatment is required to deal with the case
487	   of non-bulk flows.

489	   The same problem may also occur in some TCP implementations.  It has
490	   first been noticed for TCP in situations where a connection has been
491	   idle for a substantial time and the original workaround was to re-
492	   enter slow start after an idle period.  While re-entering slow start
493	   (or, more generally, returning to the original state of the
494	   connection) is a good idea after a long idle period, it does not
495	   solve the problem for connections that are trickling data out at an
496	   application-determined rate.  In fact, even application-level
497	   keepalives or small metadata exchanges can be enough to keep the
498	   connection from becoming idle for long.  The fix for this problem is
499	   to limit the growth of the congestion window by a quantity related to
500	   the current flight size, i.e., the amount of outstanding
501	   unacknowledged data.

503	   The congestion window should not grow beyond a factor of the current
504	   flight size for sufficiently large values of the flight size.  This
505	   provides leeway for the congestion window to grow and probe the
506	   network beyond the current rate, at the same time keeping it on a
507	   tether that does not let it increase indefinitely.  A refinement of
508	   this rule is that the congestion window should not grow beyond a
509	   constant plus a factor of the flight size.  The presence of the
510	   additive constant is necessary because the congestion window needs to
511	   be able to increase at least by one packet even when it is small --
512	   otherwise no probing is possible.  The smallest value, one packet,
513	   then would define the minimum value of the factor: 2.  However, for
514	   larger values of congestion window, a tighter bound than "double the
515	   flight size" may be desirable.  The additive constant makes it
516	   possible to use values between 1 and 2 for the factor.

518	5.16.  LEDBAT framing and wire format

520	   The actual framing and wire format of the protocol(s) using the
521	   LEDBAT congestion control mechanism is outside of scope of this
522	   document, which only describes the congestion control part.

524	   There is an implication of the need to use one-way delay from the
525	   sender to the receiver in the sender.  An obvious way to support this
526	   is to use a framing that timestamps packets at the sender and conveys
527	   the measured one-way delay back to the sender in ack packets.  This
528	   is the method we'll keep in mind for the purposes of exposition.
529	   Other methods are possible and valid.

531	   Another implication is the receipt of frequent ACK packets.  The
532	   exposition below assumes one ACK per data packet, but any reasonably
533	   small number of data packets per ACK will work as long as there is at
534	   least one ACK every round-trip time.

536	   The protocols to which this congestion control mechanism is
537	   applicable, with possible appropriate extensions, are TCP, SCTP,
538	   DCCP, etc.  It is not a goal of this document to cover such
539	   applications.  The mechanism can also be used with proprietary
540	   transport protocols, e.g., those built over UDP for P2P applications.

542	5.17.  Fairness between LEDBAT flows

544	   The design goals of LEDBAT center around the aggregate behavior of
545	   LEDBAT flows when they compete with standard TCP.  It is also
546	   interesting how LEDBAT flows share bottleneck bandwidth when they
547	   only compete between themselves.

549	   LEDBAT as described so far lacks a mechanism specifically designed to
550	   equalize utilization between these flows.  The observed behavior of
551	   existing implementations indicates that a rough equalization, in
552	   fact, does occur.

554	   The delay measurements used as control inputs by LEDBAT contain some
555	   amount of noise and errors.  The linear controller converts this
556	   input noise into the same amount of output noise.  The effect that
557	   this has is that the uncorrelated component of the noise between
558	   flows serves to randomly shuffle some amount of bandwidth between
559	   flows.  The amount shuffled during each RTT is proportional to the
560	   noise divided by the target delay.  The random-walk trajectory of
561	   bandwidth utilized by each of the flows over time tends to the fair
562	   share.  The timescales on which the rates become comparable are
563	   proportional to the target delay multiplied by the RTT and divided by
564	   the noise.

566	   In complex real-life systems, the main concern is usually the
567	   reduction of the amount of noise, which is copious if not eliminated.
568	   In some circumstances, however, the measurements might be "too good"
569	   -- since the equalization timescale is inversely proportional to
570	   noise, perfect measurements would result in lack of convergence.

572	   Under these circumstances, it may be beneficial to introduce some
573	   artificial randomness into the inputs (or, equivalently, outputs) of
574	   the controller.  Note that most systems should not require this and
575	   should be primarily concerned with reducing, not adding, noise.

577	5.17.1.  Late comers

579	   With delay-based congestion control systems, there's a concern about
580	   the ability of late comers to measure the base delay correctly.
581	   Suppose a LEDBAT flow saturates a bottleneck; another LEDBAT flow
582	   starts and proceeds to measure the base delay and the current delay
583	   and to estimate the queuing delay.  If the bottleneck always contains
584	   target delay worth of packets, the second flow would see the
585	   bottleneck as empty start building a second target delay worth of
586	   queue on top of the existing queue.  The concern ("late comers'
587	   advantage") is that the initial flow would now back off because it
588	   sees the real delay and the late comer would use the whole capacity.

590	   However, once the initial flow yields, the late comer immediately
591	   measures the true base delay and the two flows operate from the same
592	   (correct) inputs.

594	   Additionally, in practice this concern is further alleviated by the
595	   burstiness of network traffic: all that's needed to measure the base
596	   delay is one small gap.  These gaps can occur for a variety of
597	   reasons: the OS may delay the scheduling of the sending process until
598	   a time slice ends, the sending computer might be unusually busy for
599	   some number of milliseconds or tens of milliseconds, etc.  If such a
600	   gap occurs while the late comer is starting, base delay is
601	   immediately correctly measured.  With small number of flows, this
602	   appears to be the main mechanism of regulating the late comers'
603	   advantage.

605	5.18.  Safety of LEDBAT

607	   LEDBAT is most aggressive when its queuing delay estimate is most
608	   wrong and is as low as it can be.  Queuing delay estimate is
609	   nonnegative, therefore the worst possible case is when somehow the
610	   estimate is always returned as zero.  In this case, LEDBAT will ramp
611	   up as fast as TCP and halve the rate on loss.  Thus, in case of worst
612	   possible failure of estimates, LEDBAT will behave identically to TCP.
613	   This provides an extra safety net.

615	6.  LEDBAT congestion control

617	   Consider two parties, a sender and a receiver, with the sender having
618	   an unlimited source of data to send to the receiver and the receiver
619	   merely acknowledging the data.  (In an actual protocol, it's more
620	   convenient to have bidirectional connections, but unidirectional
621	   abstraction suffices to describe the congestion control mechanism.)

623	   Consider a protocol that uses packets of equal size and acknowledges
624	   each of them separately.  (Variable-sized packets and delayed
625	   acknowledgements are possible and are being implemented, but
626	   complicate the exposition.)

628	   Assume that each data packet contains a header field timestamp.  The
629	   sender puts a timestamp from its clock into this field.  Further
630	   assume that each acknowledgement packet contains a field delay.  It
631	   is shown below how it is populated.

633	   Slow start behavior is unchanged in LEDBAT.  Note that rampup is
634	   faster in slow start than during congestion avoidance and so very
635	   conservative implementations MAY skip slow start altogether.

637	   As far as congestion control is concerned, the receiver is then very
638	   simple and operates as follows, using a pseudocode:

640	   on data_packet:
641	     remote_timestamp = data_packet.timestamp
642	     acknowledgement.delay = local_timestamp() - remote_timestamp
643	     # fill in other fields of acknowledgement
644	     acknowlegement.send()

646	   The sender actually operates the congestion control algorithm and
647	   acts, in first approximation, as follows:

649	   on initialization:
650	     base_delay = +infinity

652	   on acknowledgement:
653	     current_delay = acknowledgement.delay
654	     base_delay = min(base_delay, current_delay)
655	     queuing_delay = current_delay - base_delay
656	     off_target = TARGET - queuing_delay
657	     cwnd += GAIN * off_target / cwnd

659	   The pseudocode above is a simplification and ignores noise filtering
660	   and base expiration.  The more precise pseudocode that takes these
661	   factors into account is as follows and MUST be followed:

663	   on initialization:
664	     set all NOISE_FILTER delays used by current_delay() to +infinity
665	     set all BASE_HISTORY delays used by base_delay() to +infinity
666	     last_rollover = -infinity # More than a minute in the past.

668	   on acknowledgement:
669	     delay = acknowledgement.delay
670	     update_base_delay(delay)
671	     update_current_delay(delay)
672	     queuing_delay = current_delay() - base_delay()
673	     off_target = TARGET - queuing_delay + random_input()
674	     cwnd += GAIN * off_target / cwnd
675	     # flight_size() is the amount of currently not acked data.
676	     max_allowed_cwnd = ALLOWED_INCREASE + TETHER*flight_size()
677	     cwnd = min(cwnd, max_allowed_cwnd)

679	   random_input()
680	     # random() is a PRNG between 0.0 and 1.0
681	     # NB: RANDOMNESS_AMOUNT is normally 0
682	     RANDOMNESS_AMOUNT * TARGET * ((random() - 0.5)*2)

684	   update_current_delay(delay)
685	     # Maintain a list of NOISE_FILTER last delays observed.
686	     forget the earliest of NOISE_FILTER current_delays
687	     add delay to the end of current_delays

689	   current_delay()
690	     min(the NOISE_FILTER delays stored by update_current_delay)

692	   update_base_delay(delay)
693	     # Maintain BASE_HISTORY min delays. Each represents a minute.
694	     if round_to_minute(now) != round_to_minute(last_rollover)
695	       last_rollover = now
696	       forget the earliest of base delays
697	       add delay to the end of base_delays
698	     else
699	       last of base_delays = min(last of base_delays, delay)

701	   base_delay()
702	     min(the BASE_HISTORY min delays stored by update_base_delay)

704	   TARGET parameter MUST be set to 100 milliseconds.  GAIN SHOULD be set
705	   so that max rampup rate is the same as for TCP and MAY be up to twice
706	   more.  BASE_HISTORY SHOULD be 10; it MUST be no less than 2 and
707	   SHOULD NOT be more than 20.  NOISE_FILTER SHOULD be 1; it MAY be
708	   tuned so that it is at least 1 and no more than cwnd/2.
709	   ALLOWED_INCREASE SHOULD be 1 packet; it MUST be at least 1 packet and
710	   SHOULD NOT be more than 3 packets.  TETHER SHOULD be 1.5; it MUST be
711	   greater than 1.  RANDONMESS_AMOUNT SHOULD be 0; it MUST be between 0
712	   and 0.1 inclusively.

714	   Note, in particular, that using a consistent TARGET value is
715	   important, as flows using different TARGET values will not share a
716	   bottleneck well: flows with higher values will tend to get a
717	   disproportionately large share of the bottleneck.  This is the reason
718	   why the TARGET is the only parameter that is specified with no
719	   flexibility in the implementation.  (Note that historically uTorrent
720	   used different values of TARGET, both as a temporary safety
721	   precaution and later experimentally.  The current choice is based on
722	   the experience with the use of different values.)

724	7.  Security Considerations

726	   A network on the path might choose to cause higher delay measurements
727	   than the real queuing delay so that LEDBAT backs off even when
728	   there's no congestion present.  Shaping of traffic into an
729	   artificially narrow bottleneck can't be counteracted, but faking
730	   timestamp field can and SHOULD.  A protocol using the LEDBAT
731	   congestion control SHOULD authenticate the timestamp and delay
732	   fields, preferably as part of authenticating most of the rest of the
733	   packet, with the exception of volatile header fields.  The choice of
734	   the authentication mechanism that resists man-in-the-middle attacks
735	   is outside of scope of this document.

737	8.  Normative References

739	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
740	              Requirement Levels", BCP 14, RFC 2119, March 1997.

742	Authors' Addresses

744	   Stanislav Shalunov
745	   BitTorrent Inc
746	   612 Howard St, Suite 400
747	   San Francisco, CA  94105
748	   USA

750	   Email: shalunov@bittorrent.com
751	   URI:   http://shlang.com
752	   Greg Hazel
753	   BitTorrent Inc
754	   612 Howard St, Suite 400
755	   San Francisco, CA  94105
756	   USA

758	   Email: greg@bittorrent.com