idnits 2.17.1 

draft-ietf-aqm-fq-codel-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 578 has weird spacing: '...st_head  flowc...'

  -- The document date (March 01, 2016) is 2978 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	AQM working group                                 T. Hoeiland-Joergensen
3	Internet-Draft                                       Karlstad University
4	Intended status: Experimental                                P. McKenney
5	Expires: September 2, 2016                   IBM Linux Technology Center
6	                                                                 D. Taht
7	                                                                Teklibre
8	                                                               J. Gettys

10	                                                              E. Dumazet
11	                                                            Google, Inc.
12	                                                          March 01, 2016

14	                            FlowQueue-Codel
15	                       draft-ietf-aqm-fq-codel-05

17	Abstract

19	   This memo presents the FQ-CoDel hybrid packet scheduler/AQM
20	   algorithm, a powerful tool for fighting bufferbloat and reducing
21	   latency.

23	   FQ-CoDel mixes packets from multiple flows and reduces the impact of
24	   head of line blocking from bursty traffic.  It provides isolation for
25	   low-rate traffic such as DNS, web, and videoconferencing traffic.  It
26	   improves utilisation across the networking fabric, especially for
27	   bidirectional traffic, by keeping queue lengths short; and it can be
28	   implemented in a memory- and CPU-efficient fashion across a wide
29	   range of hardware.

31	Status of This Memo

33	   This Internet-Draft is submitted in full conformance with the
34	   provisions of BCP 78 and BCP 79.

36	   Internet-Drafts are working documents of the Internet Engineering
37	   Task Force (IETF).  Note that other groups may also distribute
38	   working documents as Internet-Drafts.  The list of current Internet-
39	   Drafts is at http://datatracker.ietf.org/drafts/current/.

41	   Internet-Drafts are draft documents valid for a maximum of six months
42	   and may be updated, replaced, or obsoleted by other documents at any
43	   time.  It is inappropriate to use Internet-Drafts as reference
44	   material or to cite them other than as "work in progress."

46	   This Internet-Draft will expire on September 2, 2016.

48	Copyright Notice

50	   Copyright (c) 2016 IETF Trust and the persons identified as the
51	   document authors.  All rights reserved.

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents
55	   (http://trustee.ietf.org/license-info) in effect on the date of
56	   publication of this document.  Please review these documents
57	   carefully, as they describe your rights and restrictions with respect
58	   to this document.  Code Components extracted from this document must
59	   include Simplified BSD License text as described in Section 4.e of
60	   the Trust Legal Provisions and are provided without warranty as
61	   described in the Simplified BSD License.

63	Table of Contents

65	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
66	     1.1.  Conventions used in this document . . . . . . . . . . . .   3
67	     1.2.  Terminology and concepts  . . . . . . . . . . . . . . . .   4
68	     1.3.  Informal summary of FQ-CoDel  . . . . . . . . . . . . . .   4
69	   2.  CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
70	   3.  Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . .   6
71	   4.  The FQ-CoDel scheduler  . . . . . . . . . . . . . . . . . . .   7
72	     4.1.  Enqueue . . . . . . . . . . . . . . . . . . . . . . . . .   7
73	       4.1.1.  Alternative classification schemes  . . . . . . . . .   8
74	     4.2.  Dequeue . . . . . . . . . . . . . . . . . . . . . . . . .   8
75	   5.  Implementation considerations . . . . . . . . . . . . . . . .   9
76	     5.1.  Data structures . . . . . . . . . . . . . . . . . . . . .  10
77	     5.2.  Parameters  . . . . . . . . . . . . . . . . . . . . . . .  10
78	       5.2.1.  Interval  . . . . . . . . . . . . . . . . . . . . . .  10
79	       5.2.2.  Target  . . . . . . . . . . . . . . . . . . . . . . .  10
80	       5.2.3.  Packet limit  . . . . . . . . . . . . . . . . . . . .  11
81	       5.2.4.  Quantum . . . . . . . . . . . . . . . . . . . . . . .  11
82	       5.2.5.  Flows . . . . . . . . . . . . . . . . . . . . . . . .  11
83	       5.2.6.  ECN . . . . . . . . . . . . . . . . . . . . . . . . .  11
84	       5.2.7.  CE threshold  . . . . . . . . . . . . . . . . . . . .  12
85	     5.3.  Probability of hash collisions  . . . . . . . . . . . . .  12
86	     5.4.  Memory Overhead . . . . . . . . . . . . . . . . . . . . .  12
87	     5.5.  Per-Packet Timestamping . . . . . . . . . . . . . . . . .  13
88	     5.6.  Other forms of "Fair Queueing"  . . . . . . . . . . . . .  14
89	     5.7.  Differences between CoDel and FQ-CoDel behaviour  . . . .  14
90	   6.  Limitations of flow queueing  . . . . . . . . . . . . . . . .  15
91	     6.1.  Fairness between things other than flows  . . . . . . . .  15
92	     6.2.  Flow bunching by opaque encapsulation . . . . . . . . . .  15
93	     6.3.  Low-priority congestion control algorithms  . . . . . . .  16
94	   7.  Deployment status and future work . . . . . . . . . . . . . .  16
95	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  17
96	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
97	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  18
98	   11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  18
99	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
100	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  18
101	     12.2.  Informative References . . . . . . . . . . . . . . . . .  19
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

104	1.  Introduction

106	   The FQ-CoDel algorithm is a combined packet scheduler and AQM
107	   developed as part of the bufferbloat-fighting community effort.  It
108	   is based on a modified Deficit Round Robin (DRR) queue scheduler,
109	   with the CoDel AQM algorithm operating on each queue.  This document
110	   describes the combined algorithm; reference implementations are
111	   available for ns2 and ns3 and it is included in the mainline Linux
112	   kernel as the fq_codel queueing discipline.

114	   Since the FQ-CoDel algorithm was originally developed in the Linux
115	   kernel, that implementation is still considered canonical.  This
116	   document strives to describe the algorithm in the abstract in the
117	   first sections and separate out most implementation details in
118	   subsequent sections, but does use the Linux implementation as
119	   reference for default behaviour in the algorithm description itself.

121	   The rest of this document is structured as follows: This section
122	   gives some concepts and terminology used in the rest of the document,
123	   and gives a short informal summary of the FQ-CoDel algorithm.
124	   Section 2 gives an overview of the CoDel algorithm.  Section 3 covers
125	   the flow hashing and DRR portion.  Section 4 then describes the
126	   working of the algorithm in detail, while Section 5 describes
127	   implementation details and considerations.  Section 6 lists some of
128	   the limitations of using flow queueing.  Section 7 outlines the
129	   current status of FQ-CoDel deployment and lists some possible future
130	   areas of inquiry, and Section 8 reiterates some important security
131	   points that must be observed in the implementation.  Finally,
132	   Section 11 concludes.

134	1.1.  Conventions used in this document

136	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
137	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
138	   document are to be interpreted as described in [RFC2119].

140	   In this document, these words will appear with that interpretation
141	   only when in ALL CAPS.  Lower case uses of these words are not to be
142	   interpreted as carrying [RFC2119] significance.

144	1.2.  Terminology and concepts

146	   Flow: A flow is typically identified by a 5-tuple of source IP,
147	   destination IP, source port, destination port, and protocol.  It can
148	   also be identified by a superset or subset of those parameters, or by
149	   mac address, or other means.

151	   Queue: A queue of packets represented internally in FQ-CoDel.  In
152	   most instances each flow gets its own queue; however because of the
153	   possibility of hash collisions, this is not always the case.  In an
154	   attempt to avoid confusion, the word 'queue' is used to refer to the
155	   internal data structure, and 'flow' to refer to the actual stream of
156	   packets being delivered to the FQ-CoDel algorithm.

158	   Scheduler: A mechanism to select which queue a packet is dequeued
159	   from.

161	   CoDel AQM: The Active Queue Management algorithm employed by FQ-
162	   CoDel.

164	   DRR: Deficit round-robin scheduling.

166	   Quantum: The maximum amount of bytes to be dequeued from a queue at
167	   once.

169	1.3.  Informal summary of FQ-CoDel

171	   FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [CODELDRAFT], with an
172	   optimisation for sparse flows similar to SQF [SQF2012] and DRR++
173	   [DRRPP].  We call this "Flow Queueing" rather than "Fair Queueing" as
174	   flows that build a queue are treated differently than flows that do
175	   not.

177	   FQ-CoDel stochastically classifies incoming packets into different
178	   queues by hashing the 5-tuple of IP protocol number and source and
179	   destination IP and port numbers, perturbed with a random number
180	   selected at initiation time (although other flow classification
181	   schemes can optionally be configured instead).  Each queue is managed
182	   by the CoDel AQM algorithm.  Packet ordering within a queue is
183	   preserved, since queues have FIFO ordering.

185	   The FQ-CoDel algorithm consists of two logical parts: the scheduler
186	   which selects which queue to dequeue a packet from, and the CoDel AQM
187	   which works on each of the queues.  The subtleties of FQ-CoDel are
188	   mostly in the scheduling part, whereas the interaction between the
189	   scheduler and the CoDel algorithm are fairly straight forward:

191	   At initialisation, each queue is set up to have a separate set of
192	   CoDel state variables.  By default, 1024 queues are created.  The
193	   current implementation supports anywhere from one to 64K separate
194	   queues, and each queue maintains the state variables throughout its
195	   lifetime, and so acts the same as the non-FQ CoDel variant would.
196	   This means that with only one queue, FQ-CoDel behaves essentially the
197	   same as CoDel by itself.

199	   On dequeue, FQ-CoDel selects a queue from which to dequeue by a two-
200	   tier round-robin scheme, in which each queue is allowed to dequeue up
201	   to a configurable quantum of bytes for each iteration.  Deviations
202	   from this quantum is maintained as byte credits for the queue, which
203	   serves to make the fairness scheme byte-based rather than a packet-
204	   based.  The two-tier round-robin mechanism distinguishes between
205	   "new" queues (which don't build up a standing queue) and "old"
206	   queues, that have queued enough data to be around for more than one
207	   iteration of the round-robin scheduler.

209	   This new/old queue distinction has a particular consequence for
210	   queues that don't build up more than a quantum of bytes before being
211	   visited by the scheduler: Such queues are removed from the list, and
212	   then re-added as a new queue each time a packet arrives for it, and
213	   so will get priority over queues that do not empty out each round
214	   (except for a minor modification to protect against starvation,
215	   detailed below).  Exactly how little data a flow has to send to keep
216	   its queue in this state is somewhat difficult to reason about,
217	   because it depends on both the egress link speed and the number of
218	   concurrent flows.  However, in practice many things that are
219	   beneficial to have prioritised for typical internet use (ACKs, DNS
220	   lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_
221	   to fall in this category, which is why FQ-CoDel performs so well for
222	   many practical applications.  However, the implicitness of the
223	   prioritisation means that for applications that require guaranteed
224	   priority (for instance multiplexing the network control plane over
225	   the network itself), explicit classification is still needed.

227	   This scheduling scheme has some subtlety to it, which is explained in
228	   detail in the remainder of this document.

230	2.  CoDel

232	   CoDel is described in the the ACM Queue paper [CODEL2012], and the
233	   AQM working group draft [CODELDRAFT].  The basic idea is to control
234	   queue length, maintaining sufficient queueing to keep the outgoing
235	   link busy, but avoiding building up the queue beyond that point.
236	   This is done by preferentially dropping packets that remain in the
237	   queue for "too long".  Packets are dropped by head drop, which lowers
238	   the time for the drop signal to propagate back to the sender by the
239	   length of the queue, and helps trigger TCP fast retransmit sooner.

241	   The CoDel algorithm itself will not be described here; instead we
242	   refer the reader to the CoDel draft [CODELDRAFT].

244	3.  Flow Queueing

246	   The intention of FQ-CoDel's scheduler is to give each _flow_ its own
247	   queue, hence the term _Flow Queueing_. Rather than a perfect
248	   realisation of this, a hashing-based scheme is used, where flows are
249	   hashed into a number of buckets which each has its own queue.  The
250	   number of buckets are configurable, and presently defaults to 1024 in
251	   the Linux implementation.  This is enough to avoid hash collisions on
252	   a moderate number of flows as seen for instance in a home gateway.
253	   Depending on the characteristics of the link, this can be tuned to
254	   trade off memory for a lower probability of hash collisions.  See
255	   Section 6 for a more in-depth discussion of this.

257	   By default, the flow hashing is performed on the 5-tuple of source
258	   and destination IP and port numbers and IP protocol number.  While
259	   the hashing can be customised to match on arbitrary packet bytes,
260	   care should be taken when doing so: Much of the benefit of the FQ-
261	   CoDel scheduler comes from this per-flow distinction.  However, the
262	   default hashing does have some limitations, as discussed in
263	   Section 6.

265	   FQ-CoDel's DRR scheduler is byte-based, employing a deficit round-
266	   robin mechanism between queues.  This works by keeping track of the
267	   current number _byte credits_ of each queue.  This number is is
268	   initialised to the configurable quantum; each time a queue gets a
269	   dequeue opportunity, it gets to dequeue packets, decreasing the
270	   number of credits by the packet size for each packet.  This continues
271	   until the number of credits runs into the negative, at which point it
272	   is increased by one quantum, and the dequeue opportunity ends.

274	   This means that if one queue contains packets of, for instance, size
275	   quantum/3, and another contains quantum-sized packets, the first
276	   queue will dequeue three packets each time it gets a turn, whereas
277	   the second only dequeues one.  This means that flows that send small
278	   packets are not penalised by the difference in packet sizes; rather,
279	   the DRR scheme approximates a (single-)byte-based fairness queueing.
280	   The size of the quantum determines the scheduling granularity, with
281	   the tradeoff from too small a quantum being scheduling overhead.  For
282	   small bandwidths, lowering the quantum from the default MTU size can
283	   be advantageous.

285	   Unlike plain DRR there are two sets of flows - a "new" list for flows
286	   that have not built a queue recently, and an "old" list for flow-
287	   building queues.  This distinction is an integral part of the FQ-
288	   CoDel scheduler and is described in more detail in Section 4.

290	4.  The FQ-CoDel scheduler

292	   To make its scheduling decisions, FQ-CoDel maintains two ordered
293	   lists of active queues, called "new" and "old" queues.  When a packet
294	   is added to a queue that is not currently active, that queue becomes
295	   active by being added to the list of new queues.  Later on, it is
296	   moved to the list of old queues, from which it is removed when it is
297	   no longer active.  This behaviour is the source of some subtlety in
298	   the packet scheduling at dequeue time, explained below.

300	4.1.  Enqueue

302	   The packet enqueue mechanism consists of three stages: classification
303	   into a queue, timestamping and bookkeeping, and optionally dropping a
304	   packet when the total number of enqueued packets goes over the
305	   maximum.

307	   When a packet is enqueued, it is first classified into the
308	   appropriate queue.  By default, this is done by hashing (using a
309	   Jenkins hash function) on the 5-tuple of IP protocol, and source and
310	   destination IP and port numbers, permuted by a random value selected
311	   at initialisation time, and taking the hash value modulo the number
312	   of queues.

314	   Once the packet has been successfully classified into a queue, it is
315	   handed over to the CoDel algorithm for timestamping.  It is then
316	   added to the tail of the selected queue, and the queue's byte count
317	   is updated by the packet size.  Then, if the queue is not currently
318	   active (i.e. if it is not in either the list of new or the list of
319	   old queues), it is added to the end of the list of new queues, and
320	   its number of credits is initiated to the configured quantum.
321	   Otherwise, the queue is left in its current queue list.

323	   Finally, the total number of enqueued packets is compared with the
324	   configured limit, and if it is _above_ this value (which can happen
325	   since a packet was just enqueued), a packet is dropped from the head
326	   of the queue with the largest current byte count.  Note that this in
327	   most cases means that the packet that gets dropped is different from
328	   the one that was just enqueued, and may even be from a different
329	   queue.

331	4.1.1.  Alternative classification schemes

333	   As mentioned previously, it is possible to modify the classification
334	   scheme to provide a different notion of a 'flow'.  The Linux
335	   implementation provides this option in the form of the "tc filter"
336	   command.  While this can add capabilities (for instance, matching on
337	   other possible parameters such as mac address, diffserv, firewall and
338	   flow specific markings, etc.), care should be taken to preserve the
339	   notion of 'flow' as much of the benefit of the FQ-CoDel scheduler
340	   comes from keeping flows in separate queues.  We are not aware of any
341	   deployments utilising the custom classification feature.

343	   An alternative to changing the classification scheme is to perform
344	   decapsulation prior to hashing.  The Linux implementation does this
345	   for common encapsulations known to the kernel, such as 6in4, IPIP and
346	   GRE tunnels.  This helps to distinguish between flows that share the
347	   same (outer) 5-tuple, but of course is limited to unencrypted tunnels
348	   (see Section 6.2).

350	4.2.  Dequeue

352	   Most of FQ-CoDel's work is done at packet dequeue time.  It consists
353	   of three parts: selecting a queue from which to dequeue a packet,
354	   actually dequeuing it (employing the CoDel algorithm in the process),
355	   and some final bookkeeping.

357	   For the first part, the scheduler first looks at the list of new
358	   queues; for the queue at the head of that list, if that queue has a
359	   negative number of credits (i.e. it has already dequeued at least a
360	   quantum of bytes), it is given an additional quantum of credits, the
361	   queue is put onto _the end of_ the list of old queues, and the
362	   routine selects the next queue and starts again.

364	   Otherwise, that queue is selected for dequeue.  If the list of new
365	   queues is empty, the scheduler proceeds down the list of old queues
366	   in the same fashion (checking the credits, and either selecting the
367	   queue for dequeuing, or adding credits and putting the queue back at
368	   the end of the list).

370	   After having selected a queue from which to dequeue a packet, the
371	   CoDel algorithm is invoked on that queue.  This applies the CoDel
372	   control law, and may discard one or more packets from the head of
373	   that queue, before returning the packet that should be dequeued (or
374	   nothing if the queue is or becomes empty while being handled by the
375	   CoDel algorithm).

377	   Finally, if the CoDel algorithm does not return a packet, then the
378	   queue must be empty, and the scheduler does one of two things: if the
379	   queue selected for dequeue came from the list of new queues, it is
380	   moved to _the end of_ the list of old queues.  If instead it came
381	   from the list of old queues, that queue is removed from the list, to
382	   be added back (as a new queue) the next time a packet arrives that
383	   hashes to that queue.  Then (since no packet was available for
384	   dequeue), the whole dequeue process is restarted from the beginning.

386	   If, instead, the scheduler _did_ get a packet back from the CoDel
387	   algorithm, it subtracts the size of the packet from the the byte
388	   credits for the selected queue and returns the packet as the result
389	   of the dequeue operation.

391	   The step that moves an empty queue from the list of new queues to
392	   _the end of_ the list of old queues before it is removed is crucial
393	   to prevent starvation.  Otherwise the queue could reappear (the next
394	   time a packet arrives for it) before the list of old queues is
395	   visited; this can go on indefinitely even with a small number of
396	   active flows, if the flow providing packets to the queue in question
397	   transmits at just the right rate.  This is prevented by first moving
398	   the queue to _the end of_ the list of old queues, forcing a pass
399	   through that, and thus preventing starvation.  Moving it to the end
400	   of the list, rather than the front, is crucial for this to work.

402	   The resulting migration of queues between the different states is
403	   summarised in the following state diagram:

405	   +-----------------+                +------------------+
406	   |                 |     Empty      |                  |
407	   |     Empty       |<---------------+       Old        +----+
408	   |                 |                |                  |    |
409	   +-------+---------+                +------------------+    |
410	           |                             ^            ^       |Credits
411	           |Arrival                      |            |       |Exhausted
412	           v                             |            |       |
413	   +-----------------+                   |            |       |
414	   |                 |      Empty or     |            |       |
415	   |      New        +-------------------+            +-------+
416	   |                 | Credits exhausted
417	   +-----------------+

419	5.  Implementation considerations

421	   This section contains implementation details for the FQ-CoDel
422	   algorithm.  This includes the data structures and parameters used in
423	   the Linux implementation, as well as discussion of some required
424	   features of the target platform and other considerations.

426	5.1.  Data structures

428	   The main data structure of FQ-CoDel is the array of queues, which is
429	   instantiated with the number of queues specified by the _flows_
430	   parameter at instantiation time.  Each queue consists simply of an
431	   ordered list of packets with FIFO semantics, two state variables
432	   tracking the queue credits and total number of bytes enqueued, and
433	   the set of CoDel state variables.  Other state variables to track
434	   queue statistics can also be included: for instance, the Linux
435	   implementation keeps a count of dropped packets.

437	   In addition to the queue structures themselves, FQ-CoDel maintains
438	   two ordered lists containing references to the subset of queues that
439	   are currently active.  These are the list of 'new' queues and the
440	   list of 'old' queues, as explained in Section 4 above.

442	   In the Linux implementation, queue space is shared: there's a global
443	   limit on the number of packets the queues can hold, but not one per
444	   queue.

446	5.2.  Parameters

448	   The following are the user configuration parameters exposed by the
449	   Linux implementation of FQ-CoDel.

451	5.2.1.  Interval

453	   The _interval_ parameter has the same semantics as CoDel and is used
454	   to ensure that the measured minimum delay does not become too stale.
455	   That is, CoDel only reacts to delay experienced in the last epoch of
456	   length interval.  It SHOULD be set on the order of the worst-case RTT
457	   through the bottleneck to give end-points sufficient time to react.

459	   The default interval value is 100 ms.

461	5.2.2.  Target

463	   The _target_ parameter has the same semantics as CoDel.  It is the
464	   acceptable minimum standing/persistent queue delay for each FQ-CoDel
465	   Queue.  This minimum delay is identified by tracking the local
466	   minimum queue delay that packets experience.

468	   The default target value is 5 ms, but this value should be tuned to
469	   be at least the transmission time of a single MTU-sized packet at the
470	   prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is
471	   ~15ms), to prevent CoDel from being too aggressive at low bandwidths.
472	   It should otherwise be set to on the order of 5-10% of the configured
473	   interval.

475	5.2.3.  Packet limit

477	   Routers do not have infinite memory, so some packet limit MUST be
478	   enforced.

480	   The _limit_ parameter is the hard limit on the real queue size,
481	   measured in number of packets.  This limit is a global limit on the
482	   number of packets in all queues; each individual queue does not have
483	   an upper limit.  When the limit is reached and a new packet arrives
484	   for enqueue, a packet is dropped from the head of the largest queue
485	   (measured in bytes) to make room for the new packet.

487	   In Linux, the default packet limit is 10240 packets, which is
488	   suitable for up to 10GigE speeds.  In practice, the hard limit is
489	   rarely, if ever, hit, as drops are performed by the CoDel algorithm
490	   long before the limit is hit.  For platforms that are severely memory
491	   constrained, a lower limit can be used.

493	5.2.4.  Quantum

495	   The _quantum_ parameter is the number of bytes each queue gets to
496	   dequeue on each round of the scheduling algorithm.  The default is
497	   set to 1514 bytes which corresponds to the Ethernet MTU plus the
498	   hardware header length of 14 bytes.

500	   In TSO-enabled systems, where a "packet" consists of an offloaded
501	   packet train, it can presently be as large as 64K bytes.  In GRO-
502	   enabled systems, up to 17 times the TCP max segment size (or 25K
503	   bytes).  These mega-packets severely impact FQ-CoDel's ability to
504	   schedule traffic, and hurt latency needlessly.  There is ongoing work
505	   in Linux to make smarter use of offload engines.

507	5.2.5.  Flows

509	   The _flows_ parameter sets the number of queues into which the
510	   incoming packets are classified.  Due to the stochastic nature of
511	   hashing, multiple flows may end up being hashed into the same slot.

513	   This parameter can be set only at load time since memory has to be
514	   allocated for the hash table in the current implementation.

516	   The default value is 1024 in the current Linux implementation.

518	5.2.6.  ECN

520	   ECN is _enabled_ by default.  Rather than do anything special with
521	   misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system
522	   to minimise their impact, thus unresponsive packets in a flow being
523	   marked with ECN can grow to the overall packet limit, but will not
524	   otherwise affect the performance of the system.

526	   It can be disabled by specifying the _noecn_ parameter.

528	5.2.7.  CE threshold

530	   This parameter enables DCTCP-like processing to enable CE marking ECT
531	   packets at a lower setpoint than the default codel target.

533	   The parameter, _ce_threshold_, is disabled by default and can be set
534	   to a number of microseconds to enable.

536	5.3.  Probability of hash collisions

538	   Since the Linux FQ-CoDel implementation by default uses 1024 hash
539	   buckets, the probability that (say) 100 flows will all hash to the
540	   same bucket is something like ten to the power of minus 300.  Thus,
541	   the probability that at least one of the flows will hash to some
542	   other queue is very high indeed.

544	   Expanding on this, based on analytical equations for hash collision
545	   probabilities, for 100 flows, the probability of no collision is
546	   90.78%; the probability that no more than two of the 100 flows will
547	   be involved in any given collision = 99.57%; and the probability that
548	   no more than three of the 100 flows will be involved in any given
549	   collision = 99.99%.

551	   These probabilities can be improved upon by using set-associative
552	   hashing, a technique used in the Cake algorithm currently being
553	   developed as a further development upon the FQ-CoDel principles.  For
554	   a 4-way associative hash with the same number of total queues, the
555	   probability of no collisions for 100 flows is 99.93%, while for an
556	   8-way associative hash it is ~100%.

558	5.4.  Memory Overhead

560	   FQ-CoDel can be implemented with a very low memory footprint (less
561	   than 64 bytes per queue on 64 bit systems).  These are the data
562	   structures used in the Linux implementation:

564	   <CODE BEGINS>
565	   struct codel_vars {
566	      u32             count;
567	      u32             lastcount;
568	      bool            dropping;
569	      u16             rec_inv_sqrt;
570	      codel_time_t    first_above_time;
571	      codel_time_t    drop_next;
572	      codel_time_t    ldelay;
573	   };

575	   struct fq_codel_flow {
576	      struct sk_buff    *head;
577	      struct sk_buff    *tail;
578	      struct list_head  flowchain;
579	      int               credits;  /* current number of queue credits */
580	      u32               dropped; /* # of drops (or ECN marks) on flow */
581	      struct codel_vars cvars;
582	   };

584	   <CODE ENDS>

586	   The master table managing all queues looks like this:

588	   <CODE BEGINS>

590	   struct fq_codel_sched_data {
591	      struct tcf_proto *filter_list;  /* optional external classifier */
592	      struct fq_codel_flow *flows;    /* Flows table [flows_cnt] */
593	      u32             *backlogs;      /* backlog table [flows_cnt] */
594	      u32             flows_cnt;      /* number of flows */
595	      u32             perturbation;   /* hash perturbation */
596	      u32             quantum;        /* psched_mtu(qdisc_dev(sch)); */
597	      struct codel_params cparams;
598	      struct codel_stats cstats;
599	      u32             drop_overlimit;
600	      u32             new_flow_count;

602	      struct list_head new_flows;     /* list of new flows */
603	      struct list_head old_flows;     /* list of old flows */
604	   };

606	   <CODE ENDS>

608	5.5.  Per-Packet Timestamping

610	   The CoDel portion of the algorithm requires per-packet timestamps be
611	   stored along with the packet.  While this approach works well for
612	   software-based routers, it may be impossible to retrofit devices that
613	   do most of their processing in silicon and lack space or mechanism
614	   for timestamping.

616	   Also, while perfect resolution is not needed, timestamp resolution
617	   below the target is necessary.  Furthermore, timestamping functions
618	   in the core OS need to be efficient as they are called at least once
619	   on each packet enqueue and dequeue.

621	5.6.  Other forms of "Fair Queueing"

623	   Much of the scheduling portion of FQ-CoDel is derived from DRR and is
624	   substantially similar to DRR++. SFQ-based versions have also been
625	   produced and tested in ns2.  Other forms of Fair Queueing, such as
626	   WFQ or QFQ, have not been thoroughly explored, but there's no a
627	   priori reason why the round-robin scheduling of FQ-CoDel couldn't be
628	   replaced with something else.

630	5.7.  Differences between CoDel and FQ-CoDel behaviour

632	   CoDel can be applied to a single queue system as a straight AQM,
633	   where it converges towards an "ideal" drop rate (i.e. one that
634	   minimises delay while keeping a high link utilisation), and then
635	   optimises around that control point.

637	   The scheduling of FQ-CoDel mixes packets of competing flows, which
638	   acts to pace bursty flows to better fill the pipe.  Additionally, a
639	   new flow gets substantial leeway over other flows until CoDel finds
640	   an ideal drop rate for it.  However, for a new flow that exceeds the
641	   configured quantum, more time passes before all of its data is
642	   delivered (as packets from it, too, are mixed across the other
643	   existing queue-building flows).  Thus, FQ-CoDel takes longer (as
644	   measured in time) to converge towards an ideal drop rate for a given
645	   new flow, but does so within fewer delivered _packets_ from that
646	   flow.

648	   Finally, the flow isolation FQ-CoDel provides means that the CoDel
649	   drop mechanism operates on the flows actually building queues, which
650	   results in packets being dropped more accurately from the largest
651	   flows than CoDel alone manages.  Additionally, flow isolation
652	   radically improves the transient behaviour of the network when
653	   traffic or link characteristics change (e.g. when new flows start up
654	   or the link bandwidth changes); while CoDel itself can take a while
655	   to respond, fq_codel doesn't miss a beat.

657	6.  Limitations of flow queueing

659	   While FQ-CoDel has been shown in many scenarios to offer significant
660	   performance gains, there are some scenarios where the scheduling
661	   algorithm in particular is not a good fit.  This section documents
662	   some of the known cases which either may require tweaking the default
663	   behaviour, or where alternatives to flow queueing should be
664	   considered.

666	6.1.  Fairness between things other than flows

668	   In some parts of the network, enforcing flow-level fairness may not
669	   be desirable, or some other form of fairness may be more important.
670	   An example of this can be an Internet Service Provider that may be
671	   more interested in ensuring fairness between customers than between
672	   flows.  Or a hosting or transit provider that wishes to ensure
673	   fairness between connecting Autonomous Systems or networks.  Another
674	   issue can be that the number of simultaneous flows experienced at a
675	   particular link can be too high for flow-based fairness queueing to
676	   be effective.

678	   Whatever the reason, in a scenario where fairness between flows is
679	   not desirable, reconfiguring FQ-CoDel to match on a different
680	   characteristic can be a way forward.  The implementation in Linux can
681	   leverage the packet matching mechanism of the _tc_ subsystem to use
682	   any available packet field to partition packets into virtual queues,
683	   to for instance match on address or subnet source/destination pairs,
684	   application layer characteristics, etc.

686	   Furthermore, as commonly deployed today, FQ-CoDel is used with three
687	   or more tiers of classification: priority, best effort and
688	   background, based on diffserv markings.  Some products do more
689	   detailed classification, including deep packet inspection and
690	   destination-specific filters to achieve their desired result.

692	6.2.  Flow bunching by opaque encapsulation

694	   Where possible, FQ-CoDel will attempt to decapsulate packets before
695	   matching on the header fields for the flow hashing.  However, for
696	   some encapsulation techniques, most notably encrypted VPNs, this is
697	   not possible.  If several flows are bunched into one such
698	   encapsulated tunnel, they will be seen as one flow by the FQ-CoDel
699	   algorithm.  This means that they will share a queue, and drop
700	   behaviour, and so flows inside the encapsulation will not benefit
701	   from the implicit prioritisation of FQ-CoDel, but will continue to
702	   benefit from the reduced overall queue length from the CoDel
703	   algorithm operating on the queue.  In addition, when such an
704	   encapsulated bunch competes against other flows, it will count as one
705	   flow, and not assigned a share of the bandwidth based on how many
706	   flows are inside the encapsulation.

708	   Depending on the application, this may or may not be desirable
709	   behaviour.  In cases where it is not, changing FQ-CoDel's matching to
710	   not be flow-based (as detailed in the previous subsection above) can
711	   be a mitigation.  Going forward, having some mechanism for opaque
712	   encapsulations to express to the outer layer which flow a packet
713	   belongs to, could be a way to mitigate this.

715	6.3.  Low-priority congestion control algorithms

717	   In the presence of queue management schemes that contain latency
718	   under load, low-priority congestion control algorithms such as LEDBAT
719	   [RFC6817] (or, in general, algorithms that try to voluntarily use up
720	   less than their fair share of bandwidth) experiences very little
721	   added latency when the link is congested.  Thus, they lack the signal
722	   to back off that added latency previously afforded them.  This effect
723	   is seen with FQ-CoDel as well as with any effective AQM [GONG2014].

725	   As such, these delay-based algorithms tend to revert to loss-based
726	   congestion control, and will consume the fair share of bandwidth
727	   afforded to them by the FQ-CoDel scheduler.  However, low-priority
728	   congestion control mechanisms may be able to take steps to continue
729	   to be low priority, for instance by taking into account the vastly
730	   reduced level of delay afforded by an AQM, or by using a coupled
731	   approach to observing the behaviour of multiple flows.

733	7.  Deployment status and future work

735	   The FQ-CoDel algorithm as described in this document has been shipped
736	   as part of the Linux kernel since version 3.5, released on the 21st
737	   of July, 2012, with the ce_threshold being added in version 4.2.  The
738	   algorithm has seen widespread testing in a variety of contexts and is
739	   configured as the default queueing discipline in a number of mainline
740	   Linux distributions (as of this writing at least OpenWRT, Arch Linux
741	   and Fedora).  We believe it to be a safe default and encourage people
742	   running Linux to turn it on: It is a massive improvement over the
743	   previous default FIFO queue.

745	   Of course there is always room for improvement, and this document has
746	   listed some of the know limitations of the algorithm.  As such, we
747	   encourage further research into algorithm refinements and addressing
748	   of limitations.  One such effort is undertaken by the bufferbloat
749	   community in the form of the Cake queue management scheme (see
750	   http://www.bufferbloat.net/projects/codel/wiki/Cake).  In addition to
751	   this we believe the following (non-exhaustive) list of issues to be
752	   worthy of further enquiry:

754	   o  Variations on the flow classification mechanism to fit different
755	      notions of flows.  For instance, an ISP might want to deploy per-
756	      subscriber scheduling, while in other cases several flows can
757	      share a 5-tuple, as exemplified by the RTCWEB QoS recommendations
758	      [RTCWEB-QOS].

760	   o  Interactions between flow queueing and delay-based congestion
761	      control algorithms and scavenger protocols.

763	   o  Other scheduling mechanisms to replace the DRR portion of the
764	      algorithm, e.g.  QFQ or WFQ.

766	   o  Sensitivity of parameters, most notably the number of queues and
767	      the CoDel parameters.

769	8.  Security Considerations

771	   There are no specific security exposures associated with FQ-CoDel
772	   that are not also present in current FIFO systems.  And some are in
773	   fact reduced (e.g. simple minded packet floods).  However, some care
774	   is needed in the implementation to ensure this is the case.  These
775	   are included in the description above, however we reiterate them
776	   here:

778	   o  To prevent packets in the new queues from starving old queues, it
779	      is important that when a queue on the list of new queues empties,
780	      it is moved to _the end of_ the list of old queues.  This is
781	      described at the end of Section 4.2.

783	   o  To prevent an attacker targeting a specific flow for a denial of
784	      service attack, the hash that maps packets to queues should not be
785	      predictable.  To achieve this, FQ-CoDel salts the hash, as
786	      described in the beginning of Section 4.1.  The size of the salt
787	      and the strength of the hash function is obviously a tradeoff
788	      between performance and security.  The Linux implementation uses a
789	      32 bit random value as the salt and a Jenkins hash function.  This
790	      makes it possible to achieve very high throughput, and we consider
791	      it sufficient to ward off the most obvious attacks.

793	   o  Packet fragments without a layer 4 header can be hashed into
794	      different bins than the first fragment with the header intact.
795	      This can cause reordering and/or adversely affect the performance
796	      of the flow.  Keeping state to match the fragments to the
797	      beginning of the packet, or simply putting all fragmented packets
798	      into the same queue, are two ways to alleviate this.

800	9.  IANA Considerations

802	   This document has no actions for IANA.

804	10.  Acknowledgements

806	   Our deepest thanks to Kathie Nichols, Van Jacobson, and all the
807	   members of the bufferbloat.net effort for all the help on developing
808	   and testing the algorithm.  In addition, our thanks to Anil Agarwal
809	   for his help with getting the hash collision probabilities in this
810	   document right.

812	11.  Conclusions

814	   FQ-CoDel is a very general, efficient, nearly parameterless queue
815	   management approach combining flow queueing with CoDel.  It is a very
816	   powerful tool for solving bufferbloat, and we believe it to be safe
817	   to turn on by default, as has already happened in a number of Linux
818	   distributions.  In this document we have documented the Linux
819	   implementation in sufficient detail for an independent
820	   implementation, and we encourage such implementations be widely
821	   deployed.

823	12.  References

825	12.1.  Normative References

827	   [CODELDRAFT]
828	              Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar,
829	              "Controlled Delay Active Queue Management", October 2014,
830	              <https://datatracker.ietf.org/doc/draft-ietf-aqm-codel/>.

832	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
833	              Requirement Levels", BCP 14, RFC 2119,
834	              DOI 10.17487/RFC2119, March 1997,
835	              <http://www.rfc-editor.org/info/rfc2119>.

837	   [RFC6817]  Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind,
838	              "Low Extra Delay Background Transport (LEDBAT)", RFC 6817,
839	              DOI 10.17487/RFC6817, December 2012,
840	              <http://www.rfc-editor.org/info/rfc6817>.

842	   [RTCWEB-QOS]
843	              Dhesikan, S., Jennings, C., Druta, D., Jones, P., and J.
844	              Polk, "DSCP and other packet markings for RTCWeb QoS",
845	              December 2014, <https://datatracker.ietf.org/doc/draft-
846	              dhesikan-tsvwg-rtcweb-qos/>.

848	12.2.  Informative References

850	   [CODEL2012]
851	              Nichols, K. and V. Jacobson, "Controlling Queue Delay",
852	              July 2012, <http://queue.acm.org/detail.cfm?id=2209336>.

854	   [DRR]      Shreedhar, M. and G. Varghese, "Efficient Fair Queueing
855	              Using Deficit Round Robin", June 1996,
856	              <http://users.ece.gatech.edu/~siva/ECE4607/presentations/
857	              DRR.pdf>.

859	   [DRRPP]    MacGregor, M. and W. Shi, "Deficits for Bursty Latency-
860	              critical Flows: DRR++", 2000,
861	              <http://ieeexplore.ieee.org/xpls/
862	              abs_all.jsp?arnumber=875803>.

864	   [GONG2014]
865	              Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht,
866	              "Fighting the bufferbloat: on the coexistence of AQM and
867	              low priority congestion control", July 2014,
868	              <http://perso.telecom-paristech.fr/~drossi/paper/
869	              rossi14comnet-b.pdf>.

871	   [SQF2012]  Bonald, T., Muscariello, L., and N. Ostallo, "On the
872	              impact of TCP and per-flow scheduling on Internet
873	              Performance - IEEE/ACM transactions on Networking", April
874	              2012, <http://perso.telecom-
875	              paristech.fr/~bonald/Publications_files/BMO2011.pdf>.

877	Authors' Addresses

879	   Toke Hoeiland-Joergensen
880	   Karlstad University
881	   Dept. of Computer Science
882	   Karlstad  65188
883	   Sweden

885	   Email: toke.hoiland-jorgensen@kau.se

887	   Paul McKenney
888	   IBM Linux Technology Center
889	   1385 NW Amberglen Parkway
890	   Hillsboro, OR  97006
891	   USA

893	   Email: paulmck@linux.vnet.ibm.com
894	   URI:   http://www2.rdrop.com/~paulmck/
895	   Dave Taht
896	   Teklibre
897	   2104 W First street
898	   Apt 2002
899	   FT Myers, FL  33901
900	   USA

902	   Email: dave.taht@gmail.com
903	   URI:   http://www.teklibre.com/

905	   Jim Gettys
906	   21 Oak Knoll Road
907	   Carlisle, MA  993
908	   USA

910	   Email: jg@freedesktop.org
911	   URI:   https://en.wikipedia.org/wiki/Jim_Gettys

913	   Eric Dumazet
914	   Google, Inc.
915	   1600 Amphitheater Pkwy
916	   Mountain View, CA  94043
917	   USA

919	   Email: edumazet@gmail.com