idnits 2.17.1 draft-ietf-aqm-fq-codel-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 287: '...he minimum delay MUST be experienced i...' RFC 2119 keyword, line 288: '... interval. It SHOULD be set on the ...' RFC 2119 keyword, line 309: '...e memory, so some packet limit MUST be...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 562 has weird spacing: '...st_head flowc...' -- The document date (October 19, 2015) is 3111 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '1' on line 842


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	AQM working group                                 T. Hoeiland-Joergensen
3	Internet-Draft                                       Karlstad University
4	Intended status: Informational                               P. McKenney
5	Expires: April 21, 2016                      IBM Linux Technology Center
6	                                                                 D. Taht
7	                                                                Teklibre
8	                                                               J. Gettys

10	                                                              E. Dumazet
11	                                                            Google, Inc.
12	                                                        October 19, 2015

14	                            FlowQueue-Codel
15	                       draft-ietf-aqm-fq-codel-02

17	Abstract

19	   This memo presents the FQ-CoDel hybrid packet scheduler/AQM
20	   algorithm, a powerful tool for fighting bufferbloat and reducing
21	   latency.

23	   FQ-CoDel mixes packets from multiple flows and reduces the impact of
24	   head of line blocking from bursty traffic.  It provides isolation for
25	   low-rate traffic such as DNS, web, and videoconferencing traffic.  It
26	   improves utilisation across the networking fabric, especially for
27	   bidirectional traffic, by keeping queue lengths short; and it can be
28	   implemented in a memory- and CPU-efficient fashion across a wide
29	   range of hardware.

31	Status of This Memo

33	   This Internet-Draft is submitted in full conformance with the
34	   provisions of BCP 78 and BCP 79.

36	   Internet-Drafts are working documents of the Internet Engineering
37	   Task Force (IETF).  Note that other groups may also distribute
38	   working documents as Internet-Drafts.  The list of current Internet-
39	   Drafts is at http://datatracker.ietf.org/drafts/current/.

41	   Internet-Drafts are draft documents valid for a maximum of six months
42	   and may be updated, replaced, or obsoleted by other documents at any
43	   time.  It is inappropriate to use Internet-Drafts as reference
44	   material or to cite them other than as "work in progress."

46	   This Internet-Draft will expire on April 21, 2016.

48	Copyright Notice

50	   Copyright (c) 2015 IETF Trust and the persons identified as the
51	   document authors.  All rights reserved.

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents
55	   (http://trustee.ietf.org/license-info) in effect on the date of
56	   publication of this document.  Please review these documents
57	   carefully, as they describe your rights and restrictions with respect
58	   to this document.  Code Components extracted from this document must
59	   include Simplified BSD License text as described in Section 4.e of
60	   the Trust Legal Provisions and are provided without warranty as
61	   described in the Simplified BSD License.

63	Table of Contents

65	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
66	     1.1.  Terminology and concepts  . . . . . . . . . . . . . . . .   3
67	     1.2.  Informal summary of FQ-CoDel  . . . . . . . . . . . . . .   4
68	   2.  CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
69	   3.  Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . .   5
70	   4.  FQ-CoDel Parameters and Data Structures . . . . . . . . . . .   6
71	     4.1.  Parameters  . . . . . . . . . . . . . . . . . . . . . . .   6
72	       4.1.1.  Interval  . . . . . . . . . . . . . . . . . . . . . .   6
73	       4.1.2.  Target  . . . . . . . . . . . . . . . . . . . . . . .   7
74	       4.1.3.  Packet limit  . . . . . . . . . . . . . . . . . . . .   7
75	       4.1.4.  Quantum . . . . . . . . . . . . . . . . . . . . . . .   7
76	       4.1.5.  Flows . . . . . . . . . . . . . . . . . . . . . . . .   8
77	       4.1.6.  ECN . . . . . . . . . . . . . . . . . . . . . . . . .   8
78	       4.1.7.  CE threshold  . . . . . . . . . . . . . . . . . . . .   8
79	     4.2.  Data structures . . . . . . . . . . . . . . . . . . . . .   8
80	       4.2.1.  Internal queues . . . . . . . . . . . . . . . . . . .   8
81	       4.2.2.  New and old queues lists  . . . . . . . . . . . . . .   9
82	   5.  The FQ-CoDel scheduler  . . . . . . . . . . . . . . . . . . .   9
83	     5.1.  Enqueue . . . . . . . . . . . . . . . . . . . . . . . . .   9
84	       5.1.1.  Alternative classification schemes  . . . . . . . . .  10
85	     5.2.  Dequeue . . . . . . . . . . . . . . . . . . . . . . . . .  10
86	   6.  Implementation considerations . . . . . . . . . . . . . . . .  11
87	     6.1.  Probability of hash collisions  . . . . . . . . . . . . .  11
88	     6.2.  Memory Overhead . . . . . . . . . . . . . . . . . . . . .  12
89	     6.3.  Per-Packet Timestamping . . . . . . . . . . . . . . . . .  13
90	     6.4.  Other forms of "Fair Queueing"  . . . . . . . . . . . . .  13
91	     6.5.  Differences between CoDel and FQ-CoDel behaviour  . . . .  13
92	   7.  Limitations of flow queueing  . . . . . . . . . . . . . . . .  14
93	     7.1.  Fairness between things other than flows  . . . . . . . .  14
94	     7.2.  Flow bunching by opaque encapsulation . . . . . . . . . .  15
95	     7.3.  Low-priority congestion control algorithms  . . . . . . .  15

97	   8.  Deployment status and future work . . . . . . . . . . . . . .  16
98	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
99	   10. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
100	   11. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
101	   12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  17
102	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
103	     13.1.  Normative References . . . . . . . . . . . . . . . . . .  17
104	     13.2.  Informative References . . . . . . . . . . . . . . . . .  18
105	     13.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . .  18
106	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  18

108	1.  Introduction

110	   The FQ-CoDel algorithm is a combined packet scheduler and AQM
111	   developed as part of the bufferbloat-fighting community effort.  It
112	   is based on a modified Deficit Round Robin (DRR) queue scheduler,
113	   with the CoDel AQM algorithm operating on each queue.  This document
114	   describes the combined algorithm; reference implementations are
115	   available for ns2 and ns3 and it is included in the mainline Linux
116	   kernel as the fq_codel queueing discipline.

118	   The rest of this document is structured as follows: This section
119	   gives some concepts and terminology used in the rest of the document,
120	   and gives a short informal summary of the FQ-CoDel algorithm.
121	   Section 2 gives an overview of the CoDel algorithm.  Section 3 covers
122	   the flow hashing and DRR portion.  Section 4 defines the parameters
123	   and data structures employed by FQ-CoDel.  Section 5 describes the
124	   working of the algorithm in detail.  Section 6 describes
125	   implementation considerations, and Section 7 lists some of the
126	   limitations of using flow queueing.  Section 8 outlines the current
127	   status of FQ-CoDel deployment and lists some possible future areas of
128	   inquiry, and finally, Section 12 concludes.

130	1.1.  Terminology and concepts

132	   Flow: A flow is typically identified by a 5-tuple of source IP,
133	   destination IP, source port, destination port, and protocol.  It can
134	   also be identified by a superset or subset of those parameters, or by
135	   mac address, or other means.

137	   Queue: A queue of packets represented internally in FQ-CoDel.  In
138	   most instances each flow gets its own queue; however because of the
139	   possibility of hash collisions, this is not always the case.  In an
140	   attempt to avoid confusion, the word 'queue' is used to refer to the
141	   internal data structure, and 'flow' to refer to the actual stream of
142	   packets being delivered to the FQ-CoDel algorithm.

144	   Scheduler: A mechanism to select which queue a packet is dequeued
145	   from.

147	   CoDel AQM: The Active Queue Management algorithm employed by FQ-
148	   CoDel.

150	   DRR: Deficit round-robin scheduling.

152	   Quantum: The maximum amount of bytes to be dequeued from a queue at
153	   once.

155	1.2.  Informal summary of FQ-CoDel

157	   FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [CODELDRAFT], with an
158	   optimisation for sparse flows similar to SQF [SQF2012] and DRR++
159	   [DRRPP].  We call this "Flow Queueing" rather than "Fair Queueing" as
160	   flows that build a queue are treated differently than flows that do
161	   not.

163	   FQ-CoDel stochastically classifies incoming packets into different
164	   queues by hashing the 5-tuple of IP protocol number and source and
165	   destination IP and port numbers, perturbed with a random number
166	   selected at initiation time (although other flow classification
167	   schemes can optionally be configured instead).  Each queue is managed
168	   by the CoDel AQM algorithm.  Packet ordering within a queue is
169	   preserved, since queues have FIFO ordering.

171	   The FQ-CoDel algorithm consists of two logical parts: the scheduler
172	   which selects which queue to dequeue a packet from, and the CoDel AQM
173	   which works on each of the queues.  The subtleties of FQ-CoDel are
174	   mostly in the scheduling part, whereas the interaction between the
175	   scheduler and the CoDel algorithm are fairly straight forward:

177	   At initialisation, each queue is set up to have a separate set of
178	   CoDel state variables.  By default, 1024 queues are created.  The
179	   current implementation supports anywhere from one to 64K separate
180	   queues, and each queue maintains the state variables throughout its
181	   lifetime, and so acts the same as the non-FQ CoDel variant would.
182	   This means that with only one queue, FQ-CoDel behaves essentially the
183	   same as CoDel by itself.

185	   On dequeue, FQ-CoDel selects a queue from which to dequeue by a two-
186	   tier round-robin scheme, in which each queue is allowed to dequeue up
187	   to a configurable quantum of bytes for each iteration.  Deviations
188	   from this quantum is maintained as a deficit for the queue, which
189	   serves to make the fairness scheme byte-based rather than a packet-
190	   based.  The two-tier round-robin mechanism distinguishes between
191	   "new" queues (which don't build up a standing queue) and "old"
192	   queues, that have queued enough data to be around for more than one
193	   iteration of the round-robin scheduler.

195	   This new/old queue distinction has a particular consequence for
196	   queues that don't build up more than a quantum of bytes before being
197	   visited by the scheduler: Such queues are removed from the list, and
198	   then re-added as a new queue each time a packet arrives for it, and
199	   so will get priority over queues that do not empty out each round
200	   (except for a minor modification to protect against starvation,
201	   detailed below).  Exactly how little data a flow has to send to keep
202	   its queue in this state is somewhat difficult to reason about,
203	   because it depends on both the egress link speed and the number of
204	   concurrent flows.  However, in practice many things that are
205	   beneficial to have prioritised for typical internet use (ACKs, DNS
206	   lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_
207	   to fall in this category, which is why FQ-CoDel performs so well for
208	   many practical applications.  However, the implicitness of the
209	   prioritisation means that for applications that require guaranteed
210	   priority (for instance multiplexing the network control plane over
211	   the network itself), explicit classification is still needed.

213	   This scheduling scheme has some subtlety to it, which is explained in
214	   detail in the remainder of this document.

216	2.  CoDel

218	   CoDel is described in the the ACM Queue paper [CODEL2012], and the
219	   AQM working group draft [CODELDRAFT].  The basic idea is to control
220	   queue length, maintaining sufficient queueing to keep the outgoing
221	   link busy, but avoiding building up the queue beyond that point.
222	   This is done by preferentially dropping packets that remain in the
223	   queue for "too long".  Packets are dropped by head drop, which lowers
224	   the time for the drop signal to propagate back to the sender by the
225	   length of the queue, and helps trigger TCP fast retransmit sooner.

227	   The CoDel algorithm itself will not be described here; instead we
228	   refer the reader to the CoDel draft [CODELDRAFT].

230	3.  Flow Queueing

232	   The intention of FQ-CoDel's scheduler is to give each _flow_ its own
233	   queue, hence the term _Flow Queueing_. Rather than a perfect
234	   realisation of this, a hashing-based scheme is used, where flows are
235	   hashed into a number of buckets which each has its own queue.  The
236	   number of buckets are configurable, and presently defaults to 1024 in
237	   the Linux implementation.  This is enough to avoid hash collisions on
238	   a moderate number of flows as seen for instance in a home gateway.
239	   Depending on the characteristics of the link, this can be tuned to
240	   trade off memory for a lower probability of hash collisions.  See
241	   Section 6 for a more in-depth discussion of this.

243	   By default, the flow hashing is performed on the 5-tuple of source
244	   and destination IP and port numbers and IP protocol number.  While
245	   the hashing can be customised to match on arbitrary packet bytes,
246	   care should be taken when doing so: Much of the benefit of the FQ-
247	   CoDel scheduler comes from this per-flow distinction.  However, the
248	   default hashing does have some limitations, as discussed in
249	   Section 7.

251	   FQ-CoDel's DRR scheduler is byte-based, employing a deficit round-
252	   robin mechanism between queues.  This works by keeping track of the
253	   current byte _deficit_ of each queue.  This deficit is initialised to
254	   the configurable quantum; each time a queue gets a dequeue
255	   opportunity, it gets to dequeue packets, decreasing the deficit by
256	   the packet size for each packet, until the deficit runs into the
257	   negative, at which point it is increased by one quantum, and the
258	   dequeue opportunity ends.

260	   This means that if one queue contains packets of, for instance, size
261	   quantum/3, and another contains quantum-sized packets, the first
262	   queue will dequeue three packets each time it gets a turn, whereas
263	   the second only dequeues one.  This means that flows that send small
264	   packets are not penalised by the difference in packet sizes; rather,
265	   the DRR scheme approximates a (single-)byte-based fairness queueing.
266	   The size of the quantum determines the scheduling granularity, with
267	   the tradeoff from too small a quantum being scheduling overhead.  For
268	   small bandwidths, lowering the quantum from the default MTU size can
269	   be advantageous.

271	   Unlike plain DRR there are two sets of flows - a "new" list for flows
272	   that have not built a queue recently, and an "old" list for flow-
273	   building queues.  This distinction is an integral part of the FQ-
274	   CoDel scheduler and is described in more detail in Section 5.

276	4.  FQ-CoDel Parameters and Data Structures

278	   This section goes into the parameters and data structures in FQ-
279	   CoDel.

281	4.1.  Parameters

283	4.1.1.  Interval

285	   The _interval_ parameter has the same semantics as CoDel and is used
286	   to ensure that the measured minimum delay does not become too stale.
287	   The minimum delay MUST be experienced in the last epoch of length
288	   interval.  It SHOULD be set on the order of the worst-case RTT
289	   through the bottleneck to give end-points sufficient time to react.

291	   The default interval value is 100 ms.

293	4.1.2.  Target

295	   The _target_ parameter has the same semantics as CoDel.  It is the
296	   acceptable minimum standing/persistent queue delay for each FQ-CoDel
297	   Queue.  This minimum delay is identified by tracking the local
298	   minimum queue delay that packets experience.

300	   The default target value is 5 ms, but this value should be tuned to
301	   be at least the transmission time of a single MTU-sized packet at the
302	   prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is
303	   ~15ms), to prevent CoDel from being too aggressive at low bandwidths.
304	   It should otherwise be set to on the order of 5-10% of the configured
305	   interval.

307	4.1.3.  Packet limit

309	   Routers do not have infinite memory, so some packet limit MUST be
310	   enforced.

312	   The _limit_ parameter is the hard limit on the real queue size,
313	   measured in number of packets.  This limit is a global limit on the
314	   number of packets in all queues; each individual queue does not have
315	   an upper limit.  When the limit is reached and a new packet arrives
316	   for enqueue, a packet is dropped from the head of the largest queue
317	   (measured in bytes) to make room for the new packet.

319	   In Linux, the default packet limit is 10240 packets, which is
320	   suitable for up to 10GigE speeds.  In practice, the hard limit is
321	   rarely, if ever, hit, as drops are performed by the CoDel algorithm
322	   long before the limit is hit.  For platforms that are severely memory
323	   constrained, a lower limit can be used.

325	4.1.4.  Quantum

327	   The _quantum_ parameter is the number of bytes each queue gets to
328	   dequeue on each round of the scheduling algorithm.  The default is
329	   set to 1514 bytes which corresponds to the Ethernet MTU plus the
330	   hardware header length of 14 bytes.

332	   In TSO-enabled systems, where a "packet" consists of an offloaded
333	   packet train, it can presently be as large as 64K bytes.  In GRO-
334	   enabled systems, up to 17 times the TCP max segment size (or 25K
335	   bytes).  These mega-packets severely impact FQ-CoDel's ability to
336	   schedule traffic, and hurt latency needlessly.  There is ongoing work
337	   in Linux to make smarter use of offload engines.

339	4.1.5.  Flows

341	   The _flows_ parameter sets the number of queues into which the
342	   incoming packets are classified.  Due to the stochastic nature of
343	   hashing, multiple flows may end up being hashed into the same slot.

345	   This parameter can be set only at load time since memory has to be
346	   allocated for the hash table in the current implementation.

348	   The default value is 1024 in the current Linux implementation.

350	4.1.6.  ECN

352	   ECN is _enabled_ by default.  Rather than do anything special with
353	   misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system
354	   to minimise their impact, thus unresponsive packets in a flow being
355	   marked with ECN can grow to the overall packet limit, but will not
356	   otherwise affect the performance of the system.

358	   It can be disabled by specifying the _noecn_ parameter.

360	4.1.7.  CE threshold

362	   This parameter enables DCTCP-like processing to enable CE marking ECT
363	   packets at a lower setpoint than the default codel target.

365	   The parameter, _ce_threshold_, is disabled by default and can be set
366	   to a number of usec to enable.

368	4.2.  Data structures

370	4.2.1.  Internal queues

372	   The main data structure of FQ-CoDel is the array of queues, which is
373	   instantiated to the number of queues specified by the _flows_
374	   parameter at instantiation time.  Each queue consists simply of an
375	   ordered list of packets with FIFO semantics, two state variables
376	   tracking the queue deficit and total number of bytes enqueued, and
377	   the set of CoDel state variables.  Other state variables to track
378	   queue statistics can also be included: for instance, the Linux
379	   implementation keeps a count of dropped packets.

381	   Queue space is shared: there's a global limit on the number of
382	   packets the queues can hold, but not one per queue.

384	4.2.2.  New and old queues lists

386	   FQ-CoDel maintains two lists of active queues, called "new" and "old"
387	   queues.  Each list is an ordered list containing references to the
388	   array of queues.  When a packet is added to a queue that is not
389	   currently active, that queue becomes active by being added to the
390	   list of new queues.  Later on, it is moved to the list of old queues,
391	   from which it is removed when it is no longer active.  This behaviour
392	   is the source of some subtlety in the packet scheduling at dequeue
393	   time, explained below.

395	5.  The FQ-CoDel scheduler

397	   This section describes the operation of the FQ-CoDel scheduler and
398	   AQM.  It is split into two parts explaining the enqueue and dequeue
399	   operations.

401	5.1.  Enqueue

403	   The packet enqueue mechanism consists of three stages: classification
404	   into a queue, timestamping and bookkeeping, and optionally dropping a
405	   packet when the total number of enqueued packets goes over the
406	   maximum.

408	   When a packet is enqueued, it is first classified into the
409	   appropriate queue.  By default, this is done by hashing (using a
410	   Jenkins hash function) on the 5-tuple of IP protocol, and source and
411	   destination IP and port numbers, permuted by a random value selected
412	   at initialisation time, and taking the hash value modulo the number
413	   of queues.

415	   Once the packet has been successfully classified into a queue, it is
416	   handed over to the CoDel algorithm for timestamping.  It is then
417	   added to the tail of the selected queue, and the queue's byte count
418	   is updated by the packet size.  Then, if the queue is not currently
419	   active (i.e. if it is not in either the list of new or the list of
420	   old queues), it is added to the end of the list of new queues, and
421	   its deficit is initiated to the configured quantum.  Otherwise it is
422	   added to the old queue list.

424	   Finally, the total number of enqueued packets is compared with the
425	   configured limit, and if it is _above_ this value (which can happen
426	   since a packet was just enqueued), a packet is dropped from the head
427	   of the queue with the largest current byte count.  Note that this in
428	   most cases means that the packet that gets dropped is different from
429	   the one that was just enqueued, and may even be from a different
430	   queue.

432	5.1.1.  Alternative classification schemes

434	   As mentioned previously, it is possible to modify the classification
435	   scheme to provide a different notion of a 'flow'.  The Linux
436	   implementation provides this option in the form of the "tc filter"
437	   command.  While this can add capabilities (for instance, matching on
438	   other possible parameters such as mac address, diffserv, firewall and
439	   flow specific markings, etc.), care should be taken to preserve the
440	   notion of 'flow' as much of the benefit of the FQ-CoDel scheduler
441	   comes from keeping flows in separate queues.  We are not aware of any
442	   deployments utilising the custom classification feature.

444	   An alternative to changing the classification scheme is to perform
445	   decapsulation prior to hashing.  The Linux implementation does this
446	   for common encapsulations known to the kernel, such as 6in4, IPIP and
447	   GRE tunnels.  This helps to distinguish between flows that share the
448	   same (outer) 5-tuple, but of course is limited to unencrypted tunnels
449	   (see Section 7.2).

451	5.2.  Dequeue

453	   Most of FQ-CoDel's work is done at packet dequeue time.  It consists
454	   of three parts: selecting a queue from which to dequeue a packet,
455	   actually dequeuing it (employing the CoDel algorithm in the process),
456	   and some final bookkeeping.

458	   For the first part, the scheduler first looks at the list of new
459	   queues; for each queue in that list, if that queue has a negative
460	   deficit (i.e.  it has already dequeued at least a quantum of bytes),
461	   its deficit is increased by one quantum, and the queue is put onto
462	   _the end of_ the list of old queues, and the routine selects the next
463	   queue and starts again.

465	   Otherwise, that queue is selected for dequeue.  If the list of new
466	   queues is empty, the scheduler proceeds down the list of old queues
467	   in the same fashion (checking the deficit, and either selecting the
468	   queue for dequeuing, or increasing the deficit and putting the queue
469	   back at the end of the list).

471	   After having selected a queue from which to dequeue a packet, the
472	   CoDel algorithm is invoked on that queue.  This applies the CoDel
473	   control law, and may discard one or more packets from the head of
474	   that queue, before returning the packet that should be dequeued (or
475	   nothing if the queue is or becomes empty while being handled by the
476	   CoDel algorithm).

478	   Finally, if the CoDel algorithm did not return a packet, the queue is
479	   empty, and the scheduler does one of two things: if the queue
480	   selected for dequeue came from the list of new queues, it is moved to
481	   _the end of_ the list of old queues.  If instead it came from the
482	   list of old queues, that queue is removed from the list, to be added
483	   back (as a new queue) the next time a packet arrives that hashes to
484	   that queue.  Then (since no packet was available for dequeue), the
485	   whole dequeue process is restarted from the beginning.

487	   If, instead, the scheduler _did_ get a packet back from the CoDel
488	   algorithm, it updates the byte deficit for the selected queue before
489	   returning the packet as the result of the dequeue operation.

491	   The step that moves an empty queue from the list of new queues to
492	   _the end of_ the list of old queues before it is removed is crucial
493	   to prevent starvation.  Otherwise the queue could reappear (the next
494	   time a packet arrives for it) before the list of old queues is
495	   visited; this can go on indefinitely even with a small number of
496	   active flows, if the flow providing packets to the queue in question
497	   transmits at just the right rate.  This is prevented by first moving
498	   the queue to _the end of_ the list of old queues, forcing a pass
499	   through that, and thus preventing starvation.  Moving it to the end
500	   of the list, rather than the front, is crucial for this to work.

502	   The resulting migration of queues between the different states is
503	   summarised in the following state diagram:

505	   +-----------------+                +------------------+
506	   |                 |     Empty      |                  |
507	   |     Empty       |<---------------+       Old        +-----+
508	   |                 |                |                  |     |
509	   +-------+---------+                +------------------+     |
510	           |                             ^            ^        |Quantum
511	           |Arrival                      |            |        |Exceeded
512	           v                             |            |        |
513	   +-----------------+                   |            |        |
514	   |                 |     Empty or      |            |        |
515	   |      New        +-------------------+            +--------+
516	   |                 |  Quantum exceeded
517	   +-----------------+

519	6.  Implementation considerations

521	6.1.  Probability of hash collisions

523	   Since the Linux FQ-CoDel implementation by default uses 1024 hash
524	   buckets, the probability that (say) 100 VoIP sessions will all hash
525	   to the same bucket is something like ten to the power of minus 300.
526	   Thus, the probability that at least one of the VoIP sessions will
527	   hash to some other queue is very high indeed.

529	   Expanding on this, based on analytical equations for hash collision
530	   probabilities, for 100 VoIP sessions, the probability of no collision
531	   is 90.78%; the probability that no more than two of the 100 VoIP
532	   sessions will be involved in any given collision = 99.57%; and the
533	   probability that no more than three of the 100 VoIP sessions will be
534	   involved in any given collision = 99.99%.

536	   These probabilities can be improved upon by using set-associative
537	   hashing, a technique used in the Cake algorithm currently being
538	   developed as a further development upon the FQ-CoDel principles.  For
539	   a 4-way associative hash with the same number of total queues, the
540	   probability of no collisions for 100 VoIP sessions is 99.93%, while
541	   for an 8-way associative hash it is ~100%.

543	6.2.  Memory Overhead

545	   FQ-CoDel can be implemented with a very low memory footprint (less
546	   than 64 bytes per queue on 64 bit systems).  These are the data
547	   structures used in the Linux implementation:

549	   struct codel_vars {
550	      u32             count;
551	      u32             lastcount;
552	      bool            dropping;
553	      u16             rec_inv_sqrt;
554	      codel_time_t    first_above_time;
555	      codel_time_t    drop_next;
556	      codel_time_t    ldelay;
557	   };

559	   struct fq_codel_flow {
560	      struct sk_buff    *head;
561	      struct sk_buff    *tail;
562	      struct list_head  flowchain;
563	      int               deficit;
564	      u32               dropped; /* # of drops (or ECN marks) on flow */
565	      struct codel_vars cvars;
566	   };

568	   The master table managing all queues looks like this:

570	   struct fq_codel_sched_data {
571	      struct tcf_proto *filter_list;  /* optional external classifier */
572	      struct fq_codel_flow *flows;    /* Flows table [flows_cnt] */
573	      u32             *backlogs;      /* backlog table [flows_cnt] */
574	      u32             flows_cnt;      /* number of flows */
575	      u32             perturbation;   /* hash perturbation */
576	      u32             quantum;        /* psched_mtu(qdisc_dev(sch)); */
577	      struct codel_params cparams;
578	      struct codel_stats cstats;
579	      u32             drop_overlimit;
580	      u32             new_flow_count;

582	      struct list_head new_flows;     /* list of new flows */
583	      struct list_head old_flows;     /* list of old flows */
584	   };

586	6.3.  Per-Packet Timestamping

588	   The CoDel portion of the algorithm requires per-packet timestamps be
589	   stored along with the packet.  While this approach works well for
590	   software-based routers, it may be impossible to retrofit devices that
591	   do most of their processing in silicon and lack space or mechanism
592	   for timestamping.

594	   Also, while perfect resolution is not needed, timestamp resolution
595	   below the target is necessary.  Furthermore, timestamping functions
596	   in the core OS need to be efficient as they are called at least once
597	   on each packet enqueue and dequeue.

599	6.4.  Other forms of "Fair Queueing"

601	   Much of the scheduling portion of FQ-CoDel is derived from DRR and is
602	   substantially similar to DRR++. SFQ-based versions have also been
603	   produced and tested in ns2.  Other forms of Fair Queueing, such as
604	   WFQ or QFQ, have not been thoroughly explored.

606	6.5.  Differences between CoDel and FQ-CoDel behaviour

608	   CoDel can be applied to a single queue system as a straight AQM,
609	   where it converges towards an "ideal" drop rate (i.e. one that
610	   minimises delay while keeping a high link utilisation), and then
611	   optimises around that control point.

613	   The scheduling of FQ-CoDel mixes packets of competing flows, which
614	   acts to pace bursty flows to better fill the pipe.  Additionally, a
615	   new flow gets substantial "credit" over other flows until CoDel finds
616	   an ideal drop rate for it.  However, for a new flow that exceeds the
617	   configured quantum, more time passes before all of its data is
618	   delivered (as packets from it, too, are mixed across the other
619	   existing queue-building flows).  Thus, FQ-CoDel takes longer (as
620	   measured in time) to converge towards an ideal drop rate for a given
621	   new flow, but does so within fewer delivered _packets_ from that
622	   flow.

624	   Finally, the flow isolation FQ-CoDel provides means that the CoDel
625	   drop mechanism operates on the flows actually building queues, which
626	   results in packets being dropped more accurately from the largest
627	   flows than CoDel alone manages.  Additionally, flow isolation
628	   radically improves the transient behaviour of the network when
629	   traffic or link characteristics change (e.g. when new flows start up
630	   or the link bandwidth changes); while CoDel itself can take a while
631	   to respond, fq_codel doesn't miss a beat.

633	7.  Limitations of flow queueing

635	   While FQ-CoDel has been shown in many scenarios to offer significant
636	   performance gains, there are some scenarios where the scheduling
637	   algorithm in particular is not a good fit.  This section documents
638	   some of the known cases which either may require tweaking the default
639	   behaviour, or where alternatives to flow queueing should be
640	   considered.

642	7.1.  Fairness between things other than flows

644	   In some parts of the network, enforcing flow-level fairness may not
645	   be desirable, or some other form of fairness may be more important.
646	   An example of this can be an Internet Service Provider that may be
647	   more interested in ensuring fairness between customers than between
648	   flows.  Or a hosting or transit provider that wishes to ensure
649	   fairness between connecting Autonomous Systems or networks.  Another
650	   issue can be that the number of simultaneous flows experienced at a
651	   particular link can be too high for flow-based fairness queueing to
652	   be effective.

654	   Whatever the reason, in a scenario where fairness between flows is
655	   not desirable, reconfiguring FQ-CoDel to match on a different
656	   characteristic can be a way forward.  The implementation in Linux can
657	   leverage the packet matching mechanism of the _tc_ subsystem to use
658	   any available packet field to partition packets into virtual queues,
659	   to for instance match on address or subnet source/destination pairs,
660	   application layer characteristics, etc.

662	   Furthermore, as commonly deployed today, FQ-CoDel is used with three
663	   or more tiers of classification: priority, best effort and
664	   background, based on diffserv markings.  Some products do more
665	   detailed classification, including deep packet inspection and
666	   destination-specific filters to achieve their desired result.

668	7.2.  Flow bunching by opaque encapsulation

670	   Where possible, FQ-CoDel will attempt to decapsulate packets before
671	   matching on the header fields for the flow hashing.  However, for
672	   some encapsulation techniques, most notably encrypted VPNs, this is
673	   not possible.  If several flows are bunched into one such
674	   encapsulated tunnel, they will be seen as one flow by the FQ-CoDel
675	   algorithm.  This means that they will share a queue, and drop
676	   behaviour, and so flows inside the encapsulation will not benefit
677	   from the implicit prioritisation of FQ-CoDel, but will continue to
678	   benefit from the reduced overall queue length from the CoDel
679	   algorithm operating on the queue.  In addition, when such an
680	   encapsulated bunch competes against other flows, it will count as one
681	   flow, and not assigned a share of the bandwidth based on how many
682	   flows are inside the encapsulation.

684	   Depending on the application, this may or may not be desirable
685	   behaviour.  In cases where it is not, changing FQ-CoDel's matching to
686	   not be flow-based (as detailed in the previous subsection above) can
687	   be a mitigation.  Going forward, having some mechanism for opaque
688	   encapsulations to express to the outer layer which flow a packet
689	   belongs to, could be a way to mitigate this.

691	7.3.  Low-priority congestion control algorithms

693	   In the presence of queue management schemes that contain latency
694	   under load, low-priority congestion control algorithms such as LEDBAT
695	   [RFC6817] (or, in general, algorithms that try to voluntarily use up
696	   less than their fair share of bandwidth) experiences very little
697	   added latency when the link is congested.  Thus, they lack the signal
698	   to back off that added latency previously afforded them.  This effect
699	   is seen with FQ-CoDel as well as with any effective AQM [GONG2014].

701	   As such, these delay-based algorithms tend to revert to loss-based
702	   congestion control, and will consume the fair share of bandwidth
703	   afforded to them by the FQ-CoDel scheduler.  However, low-priority
704	   congestion control mechanisms may be able to take steps to continue
705	   to be low priority, for instance by taking into account the vastly
706	   reduced level of delay afforded by an AQM, or by using a coupled
707	   approach to observing the behaviour of multiple flows.

709	8.  Deployment status and future work

711	   The FQ-CoDel algorithm as described in this document has been shipped
712	   as part of the Linux kernel since version 3.5, released on the 21st
713	   of July, 2012, with the ce_threshold being added in version 4.2.  The
714	   algorithm has seen widespread testing in a variety of contexts and is
715	   configured as the default queueing discipline in a number of mainline
716	   Linux distributions (as of this writing at least OpenWRT, Arch Linux
717	   and Fedora).  We believe it to be a safe default and encourage people
718	   running Linux to turn it on: It is a massive improvement over the
719	   previous default FIFO queue.

721	   Of course there is always room for improvement, and this document has
722	   listed some of the know limitations of the algorithm.  As such, we
723	   encourage further research into algorithm refinements and addressing
724	   of limitations.  One such effort is undertaken by the bufferbloat
725	   community in the form of the Cake [1] queue management scheme.  In
726	   addition to this we believe the following (non-exhaustive) list of
727	   issues to be worthy of further enquiry:

729	   o  Variations on the flow classification mechanism to fit different
730	      notions of flows.  For instance, an ISP might want to deploy per-
731	      subscriber scheduling, while in other cases several flows can
732	      share a 5-tuple, as exemplified by the RTCWEB QoS recommendations
733	      [RTCWEB-QOS].

735	   o  Interactions between flow queueing and delay-based congestion
736	      control algorithms and scavenger protocols.

738	   o  Other scheduling mechanisms to replace the DRR portion of the
739	      algorithm, e.g.  QFQ or WFQ.

741	   o  Sensitivity of parameters, most notably the number of queues and
742	      the CoDel parameters.

744	9.  Security Considerations

746	   There are no specific security exposures associated with FQ-CoDel
747	   that are not also present in current FIFO systems.  And some are in
748	   fact reduced (e.g. simple minded packet floods).  However, some care
749	   is needed in the implementation to ensure this is the case.  These
750	   are included in the description above, however we reiterate them
751	   here:

753	   o  To prevent packets in the new queues from starving old queues, it
754	      is important that when a queue on the list of new queues empties,
755	      it is moved to _the end of_ the list of old queues.  This is
756	      described at the end of Section 5.2.

758	   o  To prevent an attacker targeting a specific flow for a denial of
759	      service attack, the hash that maps packets to queues should not be
760	      predictable.  To achieve this, FQ-CoDel salts the hash, as
761	      described in the beginning of Section 5.1.  The size of the salt
762	      and the strength of the hash function is obviously a tradeoff
763	      between performance and security.  The Linux implementation uses a
764	      32 bit random value as the salt and a Jenkins hash function.  This
765	      makes it possible to achieve very high throughput, and we consider
766	      it sufficient to ward off the most obvious attacks.

768	10.  IANA Considerations

770	   This document has no actions for IANA.

772	11.  Acknowledgements

774	   Our deepest thanks to Kathie Nichols, Van Jacobson, and all the
775	   members of the bufferbloat.net effort for all the help on developing
776	   and testing the algorithm.  In addition, our thanks to Anil Agarwal
777	   for his help with getting the hash collision probabilities in this
778	   document right.

780	12.  Conclusions

782	   FQ-CoDel is a very general, efficient, nearly parameterless queue
783	   management approach combining flow queueing with CoDel.  It is a very
784	   powerful tool for solving bufferbloat, and we believe it to be safe
785	   to turn on by default, as has already happened in a number of Linux
786	   distributions.  In this document we have documented the Linux
787	   implementation in sufficient detail for an independent
788	   implementation, and we encourage such implementations be widely
789	   deployed.

791	13.  References

793	13.1.  Normative References

795	   [CODELDRAFT]
796	              Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar,
797	              "Controlled Delay Active Queue Management", October 2014,
798	              .

800	   [RFC6817]  Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind,
801	              "Low Extra Delay Background Transport (LEDBAT)", RFC 6817,
802	              DOI 10.17487/RFC6817, December 2012,
803	              .

805	   [RTCWEB-QOS]
806	              Dhesikan, S., Jennings, C., Druta, D., Jones, P., and J.
807	              Polk, "DSCP and other packet markings for RTCWeb QoS",
808	              December 2014, .

811	13.2.  Informative References

813	   [CODEL2012]
814	              Nichols, K. and V. Jacobson, "Controlling Queue Delay",
815	              July 2012, .

817	   [DRR]      Shreedhar, M. and G. Varghese, "Efficient Fair Queueing
818	              Using Deficit Round Robin", June 1996,
819	              .

822	   [DRRPP]    MacGregor, M. and W. Shi, "Deficits for Bursty Latency-
823	              critical Flows: DRR++", 2000,
824	              .

827	   [GONG2014]
828	              Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht,
829	              "Fighting the bufferbloat: on the coexistence of AQM and
830	              low priority congestion control", July 2014,
831	              .

834	   [SQF2012]  Bonald, T., Muscariello, L., and N. Ostallo, "On the
835	              impact of TCP and per-flow scheduling on Internet
836	              Performance - IEEE/ACM transactions on Networking", April
837	              2012, .

840	13.3.  URIs

842	   [1] http://www.bufferbloat.net/projects/codel/wiki/Cake

844	Authors' Addresses

846	   Toke Hoeiland-Joergensen
847	   Karlstad University
848	   Dept. of Computer Science
849	   Karlstad  65188
850	   Sweden

852	   Email: toke.hoiland-jorgensen@kau.se
853	   Paul McKenney
854	   IBM Linux Technology Center
855	   1385 NW Amberglen Parkway
856	   Hillsboro, OR  97006
857	   USA

859	   Email: paulmck@linux.vnet.ibm.com
860	   URI:   http://www2.rdrop.com/~paulmck/

862	   Dave Taht
863	   Teklibre
864	   2104 W First street
865	   Apt 2002
866	   FT Myers, FL  33901
867	   USA

869	   Email: dave.taht@gmail.com
870	   URI:   http://www.teklibre.com/

872	   Jim Gettys
873	   21 Oak Knoll Road
874	   Carlisle, MA  993
875	   USA

877	   Email: jg@freedesktop.org
878	   URI:   https://en.wikipedia.org/wiki/Jim_Gettys

880	   Eric Dumazet
881	   Google, Inc.
882	   1600 Amphitheater Pkwy
883	   Mountain View, CA  94043
884	   USA

886	   Email: edumazet@gmail.com