idnits 2.17.1 draft-ietf-aqm-fq-codel-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 287: '...he minimum delay MUST be experienced i...' RFC 2119 keyword, line 288: '... interval. It SHOULD be set on the ...' RFC 2119 keyword, line 309: '...e memory, so some packet limit MUST be...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 562 has weird spacing: '...st_head flowc...' -- The document date (October 19, 2015) is 3111 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 842 Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM working group T. Hoeiland-Joergensen 3 Internet-Draft Karlstad University 4 Intended status: Informational P. McKenney 5 Expires: April 21, 2016 IBM Linux Technology Center 6 D. Taht 7 Teklibre 8 J. Gettys 10 E. Dumazet 11 Google, Inc. 12 October 19, 2015 14 FlowQueue-Codel 15 draft-ietf-aqm-fq-codel-02 17 Abstract 19 This memo presents the FQ-CoDel hybrid packet scheduler/AQM 20 algorithm, a powerful tool for fighting bufferbloat and reducing 21 latency. 23 FQ-CoDel mixes packets from multiple flows and reduces the impact of 24 head of line blocking from bursty traffic. It provides isolation for 25 low-rate traffic such as DNS, web, and videoconferencing traffic. It 26 improves utilisation across the networking fabric, especially for 27 bidirectional traffic, by keeping queue lengths short; and it can be 28 implemented in a memory- and CPU-efficient fashion across a wide 29 range of hardware. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on April 21, 2016. 48 Copyright Notice 50 Copyright (c) 2015 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Terminology and concepts . . . . . . . . . . . . . . . . 3 67 1.2. Informal summary of FQ-CoDel . . . . . . . . . . . . . . 4 68 2. CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 3. Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . . 5 70 4. FQ-CoDel Parameters and Data Structures . . . . . . . . . . . 6 71 4.1. Parameters . . . . . . . . . . . . . . . . . . . . . . . 6 72 4.1.1. Interval . . . . . . . . . . . . . . . . . . . . . . 6 73 4.1.2. Target . . . . . . . . . . . . . . . . . . . . . . . 7 74 4.1.3. Packet limit . . . . . . . . . . . . . . . . . . . . 7 75 4.1.4. Quantum . . . . . . . . . . . . . . . . . . . . . . . 7 76 4.1.5. Flows . . . . . . . . . . . . . . . . . . . . . . . . 8 77 4.1.6. ECN . . . . . . . . . . . . . . . . . . . . . . . . . 8 78 4.1.7. CE threshold . . . . . . . . . . . . . . . . . . . . 8 79 4.2. Data structures . . . . . . . . . . . . . . . . . . . . . 8 80 4.2.1. Internal queues . . . . . . . . . . . . . . . . . . . 8 81 4.2.2. New and old queues lists . . . . . . . . . . . . . . 9 82 5. The FQ-CoDel scheduler . . . . . . . . . . . . . . . . . . . 9 83 5.1. Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . 9 84 5.1.1. Alternative classification schemes . . . . . . . . . 10 85 5.2. Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . 10 86 6. Implementation considerations . . . . . . . . . . . . . . . . 11 87 6.1. Probability of hash collisions . . . . . . . . . . . . . 11 88 6.2. Memory Overhead . . . . . . . . . . . . . . . . . . . . . 12 89 6.3. Per-Packet Timestamping . . . . . . . . . . . . . . . . . 13 90 6.4. Other forms of "Fair Queueing" . . . . . . . . . . . . . 13 91 6.5. Differences between CoDel and FQ-CoDel behaviour . . . . 13 92 7. Limitations of flow queueing . . . . . . . . . . . . . . . . 14 93 7.1. Fairness between things other than flows . . . . . . . . 14 94 7.2. Flow bunching by opaque encapsulation . . . . . . . . . . 15 95 7.3. Low-priority congestion control algorithms . . . . . . . 15 97 8. Deployment status and future work . . . . . . . . . . . . . . 16 98 9. Security Considerations . . . . . . . . . . . . . . . . . . . 16 99 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 100 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 101 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 17 102 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 103 13.1. Normative References . . . . . . . . . . . . . . . . . . 17 104 13.2. Informative References . . . . . . . . . . . . . . . . . 18 105 13.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 18 106 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 108 1. Introduction 110 The FQ-CoDel algorithm is a combined packet scheduler and AQM 111 developed as part of the bufferbloat-fighting community effort. It 112 is based on a modified Deficit Round Robin (DRR) queue scheduler, 113 with the CoDel AQM algorithm operating on each queue. This document 114 describes the combined algorithm; reference implementations are 115 available for ns2 and ns3 and it is included in the mainline Linux 116 kernel as the fq_codel queueing discipline. 118 The rest of this document is structured as follows: This section 119 gives some concepts and terminology used in the rest of the document, 120 and gives a short informal summary of the FQ-CoDel algorithm. 121 Section 2 gives an overview of the CoDel algorithm. Section 3 covers 122 the flow hashing and DRR portion. Section 4 defines the parameters 123 and data structures employed by FQ-CoDel. Section 5 describes the 124 working of the algorithm in detail. Section 6 describes 125 implementation considerations, and Section 7 lists some of the 126 limitations of using flow queueing. Section 8 outlines the current 127 status of FQ-CoDel deployment and lists some possible future areas of 128 inquiry, and finally, Section 12 concludes. 130 1.1. Terminology and concepts 132 Flow: A flow is typically identified by a 5-tuple of source IP, 133 destination IP, source port, destination port, and protocol. It can 134 also be identified by a superset or subset of those parameters, or by 135 mac address, or other means. 137 Queue: A queue of packets represented internally in FQ-CoDel. In 138 most instances each flow gets its own queue; however because of the 139 possibility of hash collisions, this is not always the case. In an 140 attempt to avoid confusion, the word 'queue' is used to refer to the 141 internal data structure, and 'flow' to refer to the actual stream of 142 packets being delivered to the FQ-CoDel algorithm. 144 Scheduler: A mechanism to select which queue a packet is dequeued 145 from. 147 CoDel AQM: The Active Queue Management algorithm employed by FQ- 148 CoDel. 150 DRR: Deficit round-robin scheduling. 152 Quantum: The maximum amount of bytes to be dequeued from a queue at 153 once. 155 1.2. Informal summary of FQ-CoDel 157 FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [CODELDRAFT], with an 158 optimisation for sparse flows similar to SQF [SQF2012] and DRR++ 159 [DRRPP]. We call this "Flow Queueing" rather than "Fair Queueing" as 160 flows that build a queue are treated differently than flows that do 161 not. 163 FQ-CoDel stochastically classifies incoming packets into different 164 queues by hashing the 5-tuple of IP protocol number and source and 165 destination IP and port numbers, perturbed with a random number 166 selected at initiation time (although other flow classification 167 schemes can optionally be configured instead). Each queue is managed 168 by the CoDel AQM algorithm. Packet ordering within a queue is 169 preserved, since queues have FIFO ordering. 171 The FQ-CoDel algorithm consists of two logical parts: the scheduler 172 which selects which queue to dequeue a packet from, and the CoDel AQM 173 which works on each of the queues. The subtleties of FQ-CoDel are 174 mostly in the scheduling part, whereas the interaction between the 175 scheduler and the CoDel algorithm are fairly straight forward: 177 At initialisation, each queue is set up to have a separate set of 178 CoDel state variables. By default, 1024 queues are created. The 179 current implementation supports anywhere from one to 64K separate 180 queues, and each queue maintains the state variables throughout its 181 lifetime, and so acts the same as the non-FQ CoDel variant would. 182 This means that with only one queue, FQ-CoDel behaves essentially the 183 same as CoDel by itself. 185 On dequeue, FQ-CoDel selects a queue from which to dequeue by a two- 186 tier round-robin scheme, in which each queue is allowed to dequeue up 187 to a configurable quantum of bytes for each iteration. Deviations 188 from this quantum is maintained as a deficit for the queue, which 189 serves to make the fairness scheme byte-based rather than a packet- 190 based. The two-tier round-robin mechanism distinguishes between 191 "new" queues (which don't build up a standing queue) and "old" 192 queues, that have queued enough data to be around for more than one 193 iteration of the round-robin scheduler. 195 This new/old queue distinction has a particular consequence for 196 queues that don't build up more than a quantum of bytes before being 197 visited by the scheduler: Such queues are removed from the list, and 198 then re-added as a new queue each time a packet arrives for it, and 199 so will get priority over queues that do not empty out each round 200 (except for a minor modification to protect against starvation, 201 detailed below). Exactly how little data a flow has to send to keep 202 its queue in this state is somewhat difficult to reason about, 203 because it depends on both the egress link speed and the number of 204 concurrent flows. However, in practice many things that are 205 beneficial to have prioritised for typical internet use (ACKs, DNS 206 lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_ 207 to fall in this category, which is why FQ-CoDel performs so well for 208 many practical applications. However, the implicitness of the 209 prioritisation means that for applications that require guaranteed 210 priority (for instance multiplexing the network control plane over 211 the network itself), explicit classification is still needed. 213 This scheduling scheme has some subtlety to it, which is explained in 214 detail in the remainder of this document. 216 2. CoDel 218 CoDel is described in the the ACM Queue paper [CODEL2012], and the 219 AQM working group draft [CODELDRAFT]. The basic idea is to control 220 queue length, maintaining sufficient queueing to keep the outgoing 221 link busy, but avoiding building up the queue beyond that point. 222 This is done by preferentially dropping packets that remain in the 223 queue for "too long". Packets are dropped by head drop, which lowers 224 the time for the drop signal to propagate back to the sender by the 225 length of the queue, and helps trigger TCP fast retransmit sooner. 227 The CoDel algorithm itself will not be described here; instead we 228 refer the reader to the CoDel draft [CODELDRAFT]. 230 3. Flow Queueing 232 The intention of FQ-CoDel's scheduler is to give each _flow_ its own 233 queue, hence the term _Flow Queueing_. Rather than a perfect 234 realisation of this, a hashing-based scheme is used, where flows are 235 hashed into a number of buckets which each has its own queue. The 236 number of buckets are configurable, and presently defaults to 1024 in 237 the Linux implementation. This is enough to avoid hash collisions on 238 a moderate number of flows as seen for instance in a home gateway. 239 Depending on the characteristics of the link, this can be tuned to 240 trade off memory for a lower probability of hash collisions. See 241 Section 6 for a more in-depth discussion of this. 243 By default, the flow hashing is performed on the 5-tuple of source 244 and destination IP and port numbers and IP protocol number. While 245 the hashing can be customised to match on arbitrary packet bytes, 246 care should be taken when doing so: Much of the benefit of the FQ- 247 CoDel scheduler comes from this per-flow distinction. However, the 248 default hashing does have some limitations, as discussed in 249 Section 7. 251 FQ-CoDel's DRR scheduler is byte-based, employing a deficit round- 252 robin mechanism between queues. This works by keeping track of the 253 current byte _deficit_ of each queue. This deficit is initialised to 254 the configurable quantum; each time a queue gets a dequeue 255 opportunity, it gets to dequeue packets, decreasing the deficit by 256 the packet size for each packet, until the deficit runs into the 257 negative, at which point it is increased by one quantum, and the 258 dequeue opportunity ends. 260 This means that if one queue contains packets of, for instance, size 261 quantum/3, and another contains quantum-sized packets, the first 262 queue will dequeue three packets each time it gets a turn, whereas 263 the second only dequeues one. This means that flows that send small 264 packets are not penalised by the difference in packet sizes; rather, 265 the DRR scheme approximates a (single-)byte-based fairness queueing. 266 The size of the quantum determines the scheduling granularity, with 267 the tradeoff from too small a quantum being scheduling overhead. For 268 small bandwidths, lowering the quantum from the default MTU size can 269 be advantageous. 271 Unlike plain DRR there are two sets of flows - a "new" list for flows 272 that have not built a queue recently, and an "old" list for flow- 273 building queues. This distinction is an integral part of the FQ- 274 CoDel scheduler and is described in more detail in Section 5. 276 4. FQ-CoDel Parameters and Data Structures 278 This section goes into the parameters and data structures in FQ- 279 CoDel. 281 4.1. Parameters 283 4.1.1. Interval 285 The _interval_ parameter has the same semantics as CoDel and is used 286 to ensure that the measured minimum delay does not become too stale. 287 The minimum delay MUST be experienced in the last epoch of length 288 interval. It SHOULD be set on the order of the worst-case RTT 289 through the bottleneck to give end-points sufficient time to react. 291 The default interval value is 100 ms. 293 4.1.2. Target 295 The _target_ parameter has the same semantics as CoDel. It is the 296 acceptable minimum standing/persistent queue delay for each FQ-CoDel 297 Queue. This minimum delay is identified by tracking the local 298 minimum queue delay that packets experience. 300 The default target value is 5 ms, but this value should be tuned to 301 be at least the transmission time of a single MTU-sized packet at the 302 prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is 303 ~15ms), to prevent CoDel from being too aggressive at low bandwidths. 304 It should otherwise be set to on the order of 5-10% of the configured 305 interval. 307 4.1.3. Packet limit 309 Routers do not have infinite memory, so some packet limit MUST be 310 enforced. 312 The _limit_ parameter is the hard limit on the real queue size, 313 measured in number of packets. This limit is a global limit on the 314 number of packets in all queues; each individual queue does not have 315 an upper limit. When the limit is reached and a new packet arrives 316 for enqueue, a packet is dropped from the head of the largest queue 317 (measured in bytes) to make room for the new packet. 319 In Linux, the default packet limit is 10240 packets, which is 320 suitable for up to 10GigE speeds. In practice, the hard limit is 321 rarely, if ever, hit, as drops are performed by the CoDel algorithm 322 long before the limit is hit. For platforms that are severely memory 323 constrained, a lower limit can be used. 325 4.1.4. Quantum 327 The _quantum_ parameter is the number of bytes each queue gets to 328 dequeue on each round of the scheduling algorithm. The default is 329 set to 1514 bytes which corresponds to the Ethernet MTU plus the 330 hardware header length of 14 bytes. 332 In TSO-enabled systems, where a "packet" consists of an offloaded 333 packet train, it can presently be as large as 64K bytes. In GRO- 334 enabled systems, up to 17 times the TCP max segment size (or 25K 335 bytes). These mega-packets severely impact FQ-CoDel's ability to 336 schedule traffic, and hurt latency needlessly. There is ongoing work 337 in Linux to make smarter use of offload engines. 339 4.1.5. Flows 341 The _flows_ parameter sets the number of queues into which the 342 incoming packets are classified. Due to the stochastic nature of 343 hashing, multiple flows may end up being hashed into the same slot. 345 This parameter can be set only at load time since memory has to be 346 allocated for the hash table in the current implementation. 348 The default value is 1024 in the current Linux implementation. 350 4.1.6. ECN 352 ECN is _enabled_ by default. Rather than do anything special with 353 misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system 354 to minimise their impact, thus unresponsive packets in a flow being 355 marked with ECN can grow to the overall packet limit, but will not 356 otherwise affect the performance of the system. 358 It can be disabled by specifying the _noecn_ parameter. 360 4.1.7. CE threshold 362 This parameter enables DCTCP-like processing to enable CE marking ECT 363 packets at a lower setpoint than the default codel target. 365 The parameter, _ce_threshold_, is disabled by default and can be set 366 to a number of usec to enable. 368 4.2. Data structures 370 4.2.1. Internal queues 372 The main data structure of FQ-CoDel is the array of queues, which is 373 instantiated to the number of queues specified by the _flows_ 374 parameter at instantiation time. Each queue consists simply of an 375 ordered list of packets with FIFO semantics, two state variables 376 tracking the queue deficit and total number of bytes enqueued, and 377 the set of CoDel state variables. Other state variables to track 378 queue statistics can also be included: for instance, the Linux 379 implementation keeps a count of dropped packets. 381 Queue space is shared: there's a global limit on the number of 382 packets the queues can hold, but not one per queue. 384 4.2.2. New and old queues lists 386 FQ-CoDel maintains two lists of active queues, called "new" and "old" 387 queues. Each list is an ordered list containing references to the 388 array of queues. When a packet is added to a queue that is not 389 currently active, that queue becomes active by being added to the 390 list of new queues. Later on, it is moved to the list of old queues, 391 from which it is removed when it is no longer active. This behaviour 392 is the source of some subtlety in the packet scheduling at dequeue 393 time, explained below. 395 5. The FQ-CoDel scheduler 397 This section describes the operation of the FQ-CoDel scheduler and 398 AQM. It is split into two parts explaining the enqueue and dequeue 399 operations. 401 5.1. Enqueue 403 The packet enqueue mechanism consists of three stages: classification 404 into a queue, timestamping and bookkeeping, and optionally dropping a 405 packet when the total number of enqueued packets goes over the 406 maximum. 408 When a packet is enqueued, it is first classified into the 409 appropriate queue. By default, this is done by hashing (using a 410 Jenkins hash function) on the 5-tuple of IP protocol, and source and 411 destination IP and port numbers, permuted by a random value selected 412 at initialisation time, and taking the hash value modulo the number 413 of queues. 415 Once the packet has been successfully classified into a queue, it is 416 handed over to the CoDel algorithm for timestamping. It is then 417 added to the tail of the selected queue, and the queue's byte count 418 is updated by the packet size. Then, if the queue is not currently 419 active (i.e. if it is not in either the list of new or the list of 420 old queues), it is added to the end of the list of new queues, and 421 its deficit is initiated to the configured quantum. Otherwise it is 422 added to the old queue list. 424 Finally, the total number of enqueued packets is compared with the 425 configured limit, and if it is _above_ this value (which can happen 426 since a packet was just enqueued), a packet is dropped from the head 427 of the queue with the largest current byte count. Note that this in 428 most cases means that the packet that gets dropped is different from 429 the one that was just enqueued, and may even be from a different 430 queue. 432 5.1.1. Alternative classification schemes 434 As mentioned previously, it is possible to modify the classification 435 scheme to provide a different notion of a 'flow'. The Linux 436 implementation provides this option in the form of the "tc filter" 437 command. While this can add capabilities (for instance, matching on 438 other possible parameters such as mac address, diffserv, firewall and 439 flow specific markings, etc.), care should be taken to preserve the 440 notion of 'flow' as much of the benefit of the FQ-CoDel scheduler 441 comes from keeping flows in separate queues. We are not aware of any 442 deployments utilising the custom classification feature. 444 An alternative to changing the classification scheme is to perform 445 decapsulation prior to hashing. The Linux implementation does this 446 for common encapsulations known to the kernel, such as 6in4, IPIP and 447 GRE tunnels. This helps to distinguish between flows that share the 448 same (outer) 5-tuple, but of course is limited to unencrypted tunnels 449 (see Section 7.2). 451 5.2. Dequeue 453 Most of FQ-CoDel's work is done at packet dequeue time. It consists 454 of three parts: selecting a queue from which to dequeue a packet, 455 actually dequeuing it (employing the CoDel algorithm in the process), 456 and some final bookkeeping. 458 For the first part, the scheduler first looks at the list of new 459 queues; for each queue in that list, if that queue has a negative 460 deficit (i.e. it has already dequeued at least a quantum of bytes), 461 its deficit is increased by one quantum, and the queue is put onto 462 _the end of_ the list of old queues, and the routine selects the next 463 queue and starts again. 465 Otherwise, that queue is selected for dequeue. If the list of new 466 queues is empty, the scheduler proceeds down the list of old queues 467 in the same fashion (checking the deficit, and either selecting the 468 queue for dequeuing, or increasing the deficit and putting the queue 469 back at the end of the list). 471 After having selected a queue from which to dequeue a packet, the 472 CoDel algorithm is invoked on that queue. This applies the CoDel 473 control law, and may discard one or more packets from the head of 474 that queue, before returning the packet that should be dequeued (or 475 nothing if the queue is or becomes empty while being handled by the 476 CoDel algorithm). 478 Finally, if the CoDel algorithm did not return a packet, the queue is 479 empty, and the scheduler does one of two things: if the queue 480 selected for dequeue came from the list of new queues, it is moved to 481 _the end of_ the list of old queues. If instead it came from the 482 list of old queues, that queue is removed from the list, to be added 483 back (as a new queue) the next time a packet arrives that hashes to 484 that queue. Then (since no packet was available for dequeue), the 485 whole dequeue process is restarted from the beginning. 487 If, instead, the scheduler _did_ get a packet back from the CoDel 488 algorithm, it updates the byte deficit for the selected queue before 489 returning the packet as the result of the dequeue operation. 491 The step that moves an empty queue from the list of new queues to 492 _the end of_ the list of old queues before it is removed is crucial 493 to prevent starvation. Otherwise the queue could reappear (the next 494 time a packet arrives for it) before the list of old queues is 495 visited; this can go on indefinitely even with a small number of 496 active flows, if the flow providing packets to the queue in question 497 transmits at just the right rate. This is prevented by first moving 498 the queue to _the end of_ the list of old queues, forcing a pass 499 through that, and thus preventing starvation. Moving it to the end 500 of the list, rather than the front, is crucial for this to work. 502 The resulting migration of queues between the different states is 503 summarised in the following state diagram: 505 +-----------------+ +------------------+ 506 | | Empty | | 507 | Empty |<---------------+ Old +-----+ 508 | | | | | 509 +-------+---------+ +------------------+ | 510 | ^ ^ |Quantum 511 |Arrival | | |Exceeded 512 v | | | 513 +-----------------+ | | | 514 | | Empty or | | | 515 | New +-------------------+ +--------+ 516 | | Quantum exceeded 517 +-----------------+ 519 6. Implementation considerations 521 6.1. Probability of hash collisions 523 Since the Linux FQ-CoDel implementation by default uses 1024 hash 524 buckets, the probability that (say) 100 VoIP sessions will all hash 525 to the same bucket is something like ten to the power of minus 300. 526 Thus, the probability that at least one of the VoIP sessions will 527 hash to some other queue is very high indeed. 529 Expanding on this, based on analytical equations for hash collision 530 probabilities, for 100 VoIP sessions, the probability of no collision 531 is 90.78%; the probability that no more than two of the 100 VoIP 532 sessions will be involved in any given collision = 99.57%; and the 533 probability that no more than three of the 100 VoIP sessions will be 534 involved in any given collision = 99.99%. 536 These probabilities can be improved upon by using set-associative 537 hashing, a technique used in the Cake algorithm currently being 538 developed as a further development upon the FQ-CoDel principles. For 539 a 4-way associative hash with the same number of total queues, the 540 probability of no collisions for 100 VoIP sessions is 99.93%, while 541 for an 8-way associative hash it is ~100%. 543 6.2. Memory Overhead 545 FQ-CoDel can be implemented with a very low memory footprint (less 546 than 64 bytes per queue on 64 bit systems). These are the data 547 structures used in the Linux implementation: 549 struct codel_vars { 550 u32 count; 551 u32 lastcount; 552 bool dropping; 553 u16 rec_inv_sqrt; 554 codel_time_t first_above_time; 555 codel_time_t drop_next; 556 codel_time_t ldelay; 557 }; 559 struct fq_codel_flow { 560 struct sk_buff *head; 561 struct sk_buff *tail; 562 struct list_head flowchain; 563 int deficit; 564 u32 dropped; /* # of drops (or ECN marks) on flow */ 565 struct codel_vars cvars; 566 }; 568 The master table managing all queues looks like this: 570 struct fq_codel_sched_data { 571 struct tcf_proto *filter_list; /* optional external classifier */ 572 struct fq_codel_flow *flows; /* Flows table [flows_cnt] */ 573 u32 *backlogs; /* backlog table [flows_cnt] */ 574 u32 flows_cnt; /* number of flows */ 575 u32 perturbation; /* hash perturbation */ 576 u32 quantum; /* psched_mtu(qdisc_dev(sch)); */ 577 struct codel_params cparams; 578 struct codel_stats cstats; 579 u32 drop_overlimit; 580 u32 new_flow_count; 582 struct list_head new_flows; /* list of new flows */ 583 struct list_head old_flows; /* list of old flows */ 584 }; 586 6.3. Per-Packet Timestamping 588 The CoDel portion of the algorithm requires per-packet timestamps be 589 stored along with the packet. While this approach works well for 590 software-based routers, it may be impossible to retrofit devices that 591 do most of their processing in silicon and lack space or mechanism 592 for timestamping. 594 Also, while perfect resolution is not needed, timestamp resolution 595 below the target is necessary. Furthermore, timestamping functions 596 in the core OS need to be efficient as they are called at least once 597 on each packet enqueue and dequeue. 599 6.4. Other forms of "Fair Queueing" 601 Much of the scheduling portion of FQ-CoDel is derived from DRR and is 602 substantially similar to DRR++. SFQ-based versions have also been 603 produced and tested in ns2. Other forms of Fair Queueing, such as 604 WFQ or QFQ, have not been thoroughly explored. 606 6.5. Differences between CoDel and FQ-CoDel behaviour 608 CoDel can be applied to a single queue system as a straight AQM, 609 where it converges towards an "ideal" drop rate (i.e. one that 610 minimises delay while keeping a high link utilisation), and then 611 optimises around that control point. 613 The scheduling of FQ-CoDel mixes packets of competing flows, which 614 acts to pace bursty flows to better fill the pipe. Additionally, a 615 new flow gets substantial "credit" over other flows until CoDel finds 616 an ideal drop rate for it. However, for a new flow that exceeds the 617 configured quantum, more time passes before all of its data is 618 delivered (as packets from it, too, are mixed across the other 619 existing queue-building flows). Thus, FQ-CoDel takes longer (as 620 measured in time) to converge towards an ideal drop rate for a given 621 new flow, but does so within fewer delivered _packets_ from that 622 flow. 624 Finally, the flow isolation FQ-CoDel provides means that the CoDel 625 drop mechanism operates on the flows actually building queues, which 626 results in packets being dropped more accurately from the largest 627 flows than CoDel alone manages. Additionally, flow isolation 628 radically improves the transient behaviour of the network when 629 traffic or link characteristics change (e.g. when new flows start up 630 or the link bandwidth changes); while CoDel itself can take a while 631 to respond, fq_codel doesn't miss a beat. 633 7. Limitations of flow queueing 635 While FQ-CoDel has been shown in many scenarios to offer significant 636 performance gains, there are some scenarios where the scheduling 637 algorithm in particular is not a good fit. This section documents 638 some of the known cases which either may require tweaking the default 639 behaviour, or where alternatives to flow queueing should be 640 considered. 642 7.1. Fairness between things other than flows 644 In some parts of the network, enforcing flow-level fairness may not 645 be desirable, or some other form of fairness may be more important. 646 An example of this can be an Internet Service Provider that may be 647 more interested in ensuring fairness between customers than between 648 flows. Or a hosting or transit provider that wishes to ensure 649 fairness between connecting Autonomous Systems or networks. Another 650 issue can be that the number of simultaneous flows experienced at a 651 particular link can be too high for flow-based fairness queueing to 652 be effective. 654 Whatever the reason, in a scenario where fairness between flows is 655 not desirable, reconfiguring FQ-CoDel to match on a different 656 characteristic can be a way forward. The implementation in Linux can 657 leverage the packet matching mechanism of the _tc_ subsystem to use 658 any available packet field to partition packets into virtual queues, 659 to for instance match on address or subnet source/destination pairs, 660 application layer characteristics, etc. 662 Furthermore, as commonly deployed today, FQ-CoDel is used with three 663 or more tiers of classification: priority, best effort and 664 background, based on diffserv markings. Some products do more 665 detailed classification, including deep packet inspection and 666 destination-specific filters to achieve their desired result. 668 7.2. Flow bunching by opaque encapsulation 670 Where possible, FQ-CoDel will attempt to decapsulate packets before 671 matching on the header fields for the flow hashing. However, for 672 some encapsulation techniques, most notably encrypted VPNs, this is 673 not possible. If several flows are bunched into one such 674 encapsulated tunnel, they will be seen as one flow by the FQ-CoDel 675 algorithm. This means that they will share a queue, and drop 676 behaviour, and so flows inside the encapsulation will not benefit 677 from the implicit prioritisation of FQ-CoDel, but will continue to 678 benefit from the reduced overall queue length from the CoDel 679 algorithm operating on the queue. In addition, when such an 680 encapsulated bunch competes against other flows, it will count as one 681 flow, and not assigned a share of the bandwidth based on how many 682 flows are inside the encapsulation. 684 Depending on the application, this may or may not be desirable 685 behaviour. In cases where it is not, changing FQ-CoDel's matching to 686 not be flow-based (as detailed in the previous subsection above) can 687 be a mitigation. Going forward, having some mechanism for opaque 688 encapsulations to express to the outer layer which flow a packet 689 belongs to, could be a way to mitigate this. 691 7.3. Low-priority congestion control algorithms 693 In the presence of queue management schemes that contain latency 694 under load, low-priority congestion control algorithms such as LEDBAT 695 [RFC6817] (or, in general, algorithms that try to voluntarily use up 696 less than their fair share of bandwidth) experiences very little 697 added latency when the link is congested. Thus, they lack the signal 698 to back off that added latency previously afforded them. This effect 699 is seen with FQ-CoDel as well as with any effective AQM [GONG2014]. 701 As such, these delay-based algorithms tend to revert to loss-based 702 congestion control, and will consume the fair share of bandwidth 703 afforded to them by the FQ-CoDel scheduler. However, low-priority 704 congestion control mechanisms may be able to take steps to continue 705 to be low priority, for instance by taking into account the vastly 706 reduced level of delay afforded by an AQM, or by using a coupled 707 approach to observing the behaviour of multiple flows. 709 8. Deployment status and future work 711 The FQ-CoDel algorithm as described in this document has been shipped 712 as part of the Linux kernel since version 3.5, released on the 21st 713 of July, 2012, with the ce_threshold being added in version 4.2. The 714 algorithm has seen widespread testing in a variety of contexts and is 715 configured as the default queueing discipline in a number of mainline 716 Linux distributions (as of this writing at least OpenWRT, Arch Linux 717 and Fedora). We believe it to be a safe default and encourage people 718 running Linux to turn it on: It is a massive improvement over the 719 previous default FIFO queue. 721 Of course there is always room for improvement, and this document has 722 listed some of the know limitations of the algorithm. As such, we 723 encourage further research into algorithm refinements and addressing 724 of limitations. One such effort is undertaken by the bufferbloat 725 community in the form of the Cake [1] queue management scheme. In 726 addition to this we believe the following (non-exhaustive) list of 727 issues to be worthy of further enquiry: 729 o Variations on the flow classification mechanism to fit different 730 notions of flows. For instance, an ISP might want to deploy per- 731 subscriber scheduling, while in other cases several flows can 732 share a 5-tuple, as exemplified by the RTCWEB QoS recommendations 733 [RTCWEB-QOS]. 735 o Interactions between flow queueing and delay-based congestion 736 control algorithms and scavenger protocols. 738 o Other scheduling mechanisms to replace the DRR portion of the 739 algorithm, e.g. QFQ or WFQ. 741 o Sensitivity of parameters, most notably the number of queues and 742 the CoDel parameters. 744 9. Security Considerations 746 There are no specific security exposures associated with FQ-CoDel 747 that are not also present in current FIFO systems. And some are in 748 fact reduced (e.g. simple minded packet floods). However, some care 749 is needed in the implementation to ensure this is the case. These 750 are included in the description above, however we reiterate them 751 here: 753 o To prevent packets in the new queues from starving old queues, it 754 is important that when a queue on the list of new queues empties, 755 it is moved to _the end of_ the list of old queues. This is 756 described at the end of Section 5.2. 758 o To prevent an attacker targeting a specific flow for a denial of 759 service attack, the hash that maps packets to queues should not be 760 predictable. To achieve this, FQ-CoDel salts the hash, as 761 described in the beginning of Section 5.1. The size of the salt 762 and the strength of the hash function is obviously a tradeoff 763 between performance and security. The Linux implementation uses a 764 32 bit random value as the salt and a Jenkins hash function. This 765 makes it possible to achieve very high throughput, and we consider 766 it sufficient to ward off the most obvious attacks. 768 10. IANA Considerations 770 This document has no actions for IANA. 772 11. Acknowledgements 774 Our deepest thanks to Kathie Nichols, Van Jacobson, and all the 775 members of the bufferbloat.net effort for all the help on developing 776 and testing the algorithm. In addition, our thanks to Anil Agarwal 777 for his help with getting the hash collision probabilities in this 778 document right. 780 12. Conclusions 782 FQ-CoDel is a very general, efficient, nearly parameterless queue 783 management approach combining flow queueing with CoDel. It is a very 784 powerful tool for solving bufferbloat, and we believe it to be safe 785 to turn on by default, as has already happened in a number of Linux 786 distributions. In this document we have documented the Linux 787 implementation in sufficient detail for an independent 788 implementation, and we encourage such implementations be widely 789 deployed. 791 13. References 793 13.1. Normative References 795 [CODELDRAFT] 796 Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar, 797 "Controlled Delay Active Queue Management", October 2014, 798 . 800 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 801 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 802 DOI 10.17487/RFC6817, December 2012, 803 . 805 [RTCWEB-QOS] 806 Dhesikan, S., Jennings, C., Druta, D., Jones, P., and J. 807 Polk, "DSCP and other packet markings for RTCWeb QoS", 808 December 2014, . 811 13.2. Informative References 813 [CODEL2012] 814 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 815 July 2012, . 817 [DRR] Shreedhar, M. and G. Varghese, "Efficient Fair Queueing 818 Using Deficit Round Robin", June 1996, 819 . 822 [DRRPP] MacGregor, M. and W. Shi, "Deficits for Bursty Latency- 823 critical Flows: DRR++", 2000, 824 . 827 [GONG2014] 828 Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht, 829 "Fighting the bufferbloat: on the coexistence of AQM and 830 low priority congestion control", July 2014, 831 . 834 [SQF2012] Bonald, T., Muscariello, L., and N. Ostallo, "On the 835 impact of TCP and per-flow scheduling on Internet 836 Performance - IEEE/ACM transactions on Networking", April 837 2012, . 840 13.3. URIs 842 [1] http://www.bufferbloat.net/projects/codel/wiki/Cake 844 Authors' Addresses 846 Toke Hoeiland-Joergensen 847 Karlstad University 848 Dept. of Computer Science 849 Karlstad 65188 850 Sweden 852 Email: toke.hoiland-jorgensen@kau.se 853 Paul McKenney 854 IBM Linux Technology Center 855 1385 NW Amberglen Parkway 856 Hillsboro, OR 97006 857 USA 859 Email: paulmck@linux.vnet.ibm.com 860 URI: http://www2.rdrop.com/~paulmck/ 862 Dave Taht 863 Teklibre 864 2104 W First street 865 Apt 2002 866 FT Myers, FL 33901 867 USA 869 Email: dave.taht@gmail.com 870 URI: http://www.teklibre.com/ 872 Jim Gettys 873 21 Oak Knoll Road 874 Carlisle, MA 993 875 USA 877 Email: jg@freedesktop.org 878 URI: https://en.wikipedia.org/wiki/Jim_Gettys 880 Eric Dumazet 881 Google, Inc. 882 1600 Amphitheater Pkwy 883 Mountain View, CA 94043 884 USA 886 Email: edumazet@gmail.com