idnits 2.17.1 draft-ietf-aqm-fq-codel-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 445: '...he minimum delay MUST be experienced i...' RFC 2119 keyword, line 446: '... interval. It SHOULD be set on the ...' RFC 2119 keyword, line 467: '...e memory, so some packet limit MUST be...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 567 has weird spacing: '...st_head flowc...' -- The document date (February 04, 2016) is 2996 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 856 Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM working group T. Hoeiland-Joergensen 3 Internet-Draft Karlstad University 4 Intended status: Experimental P. McKenney 5 Expires: August 7, 2016 IBM Linux Technology Center 6 D. Taht 7 Teklibre 8 J. Gettys 10 E. Dumazet 11 Google, Inc. 12 February 04, 2016 14 FlowQueue-Codel 15 draft-ietf-aqm-fq-codel-04 17 Abstract 19 This memo presents the FQ-CoDel hybrid packet scheduler/AQM 20 algorithm, a powerful tool for fighting bufferbloat and reducing 21 latency. 23 FQ-CoDel mixes packets from multiple flows and reduces the impact of 24 head of line blocking from bursty traffic. It provides isolation for 25 low-rate traffic such as DNS, web, and videoconferencing traffic. It 26 improves utilisation across the networking fabric, especially for 27 bidirectional traffic, by keeping queue lengths short; and it can be 28 implemented in a memory- and CPU-efficient fashion across a wide 29 range of hardware. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on August 7, 2016. 48 Copyright Notice 50 Copyright (c) 2016 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Terminology and concepts . . . . . . . . . . . . . . . . 3 67 1.2. Informal summary of FQ-CoDel . . . . . . . . . . . . . . 4 68 2. CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 3. Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . . 5 70 4. The FQ-CoDel scheduler . . . . . . . . . . . . . . . . . . . 6 71 4.1. Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . 7 72 4.1.1. Alternative classification schemes . . . . . . . . . 7 73 4.2. Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 5. Implementation considerations . . . . . . . . . . . . . . . . 9 75 5.1. Data structures . . . . . . . . . . . . . . . . . . . . . 9 76 5.2. Parameters . . . . . . . . . . . . . . . . . . . . . . . 10 77 5.2.1. Interval . . . . . . . . . . . . . . . . . . . . . . 10 78 5.2.2. Target . . . . . . . . . . . . . . . . . . . . . . . 10 79 5.2.3. Packet limit . . . . . . . . . . . . . . . . . . . . 10 80 5.2.4. Quantum . . . . . . . . . . . . . . . . . . . . . . . 11 81 5.2.5. Flows . . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.2.6. ECN . . . . . . . . . . . . . . . . . . . . . . . . . 11 83 5.2.7. CE threshold . . . . . . . . . . . . . . . . . . . . 11 84 5.3. Probability of hash collisions . . . . . . . . . . . . . 12 85 5.4. Memory Overhead . . . . . . . . . . . . . . . . . . . . . 12 86 5.5. Per-Packet Timestamping . . . . . . . . . . . . . . . . . 13 87 5.6. Other forms of "Fair Queueing" . . . . . . . . . . . . . 13 88 5.7. Differences between CoDel and FQ-CoDel behaviour . . . . 13 89 6. Limitations of flow queueing . . . . . . . . . . . . . . . . 14 90 6.1. Fairness between things other than flows . . . . . . . . 14 91 6.2. Flow bunching by opaque encapsulation . . . . . . . . . . 15 92 6.3. Low-priority congestion control algorithms . . . . . . . 15 93 7. Deployment status and future work . . . . . . . . . . . . . . 16 94 8. Security Considerations . . . . . . . . . . . . . . . . . . . 16 95 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 96 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 97 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 17 98 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 99 12.1. Normative References . . . . . . . . . . . . . . . . . . 17 100 12.2. Informative References . . . . . . . . . . . . . . . . . 18 101 12.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 18 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 104 1. Introduction 106 The FQ-CoDel algorithm is a combined packet scheduler and AQM 107 developed as part of the bufferbloat-fighting community effort. It 108 is based on a modified Deficit Round Robin (DRR) queue scheduler, 109 with the CoDel AQM algorithm operating on each queue. This document 110 describes the combined algorithm; reference implementations are 111 available for ns2 and ns3 and it is included in the mainline Linux 112 kernel as the fq_codel queueing discipline. 114 Since the FQ-CoDel algorithm was originally developed in the Linux 115 kernel, that implementation is still considered canonical. This 116 document strives to describe the algorithm in the abstract in the 117 first sections and separate out most implementation details in 118 subsequent sections, but does use the Linux implementation as 119 reference for default behaviour in the algorithm description itself. 121 The rest of this document is structured as follows: This section 122 gives some concepts and terminology used in the rest of the document, 123 and gives a short informal summary of the FQ-CoDel algorithm. 124 Section 2 gives an overview of the CoDel algorithm. Section 3 covers 125 the flow hashing and DRR portion. Section 4 then describes the 126 working of the algorithm in detail, while Section 5 describes 127 implementation details and considerations. Section 6 lists some of 128 the limitations of using flow queueing. Section 7 outlines the 129 current status of FQ-CoDel deployment and lists some possible future 130 areas of inquiry, and Section 8 reiterates some important security 131 points that must be observed in the implementation. Finally, 132 Section 11 concludes. 134 1.1. Terminology and concepts 136 Flow: A flow is typically identified by a 5-tuple of source IP, 137 destination IP, source port, destination port, and protocol. It can 138 also be identified by a superset or subset of those parameters, or by 139 mac address, or other means. 141 Queue: A queue of packets represented internally in FQ-CoDel. In 142 most instances each flow gets its own queue; however because of the 143 possibility of hash collisions, this is not always the case. In an 144 attempt to avoid confusion, the word 'queue' is used to refer to the 145 internal data structure, and 'flow' to refer to the actual stream of 146 packets being delivered to the FQ-CoDel algorithm. 148 Scheduler: A mechanism to select which queue a packet is dequeued 149 from. 151 CoDel AQM: The Active Queue Management algorithm employed by FQ- 152 CoDel. 154 DRR: Deficit round-robin scheduling. 156 Quantum: The maximum amount of bytes to be dequeued from a queue at 157 once. 159 1.2. Informal summary of FQ-CoDel 161 FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [CODELDRAFT], with an 162 optimisation for sparse flows similar to SQF [SQF2012] and DRR++ 163 [DRRPP]. We call this "Flow Queueing" rather than "Fair Queueing" as 164 flows that build a queue are treated differently than flows that do 165 not. 167 FQ-CoDel stochastically classifies incoming packets into different 168 queues by hashing the 5-tuple of IP protocol number and source and 169 destination IP and port numbers, perturbed with a random number 170 selected at initiation time (although other flow classification 171 schemes can optionally be configured instead). Each queue is managed 172 by the CoDel AQM algorithm. Packet ordering within a queue is 173 preserved, since queues have FIFO ordering. 175 The FQ-CoDel algorithm consists of two logical parts: the scheduler 176 which selects which queue to dequeue a packet from, and the CoDel AQM 177 which works on each of the queues. The subtleties of FQ-CoDel are 178 mostly in the scheduling part, whereas the interaction between the 179 scheduler and the CoDel algorithm are fairly straight forward: 181 At initialisation, each queue is set up to have a separate set of 182 CoDel state variables. By default, 1024 queues are created. The 183 current implementation supports anywhere from one to 64K separate 184 queues, and each queue maintains the state variables throughout its 185 lifetime, and so acts the same as the non-FQ CoDel variant would. 186 This means that with only one queue, FQ-CoDel behaves essentially the 187 same as CoDel by itself. 189 On dequeue, FQ-CoDel selects a queue from which to dequeue by a two- 190 tier round-robin scheme, in which each queue is allowed to dequeue up 191 to a configurable quantum of bytes for each iteration. Deviations 192 from this quantum is maintained as byte credits for the queue, which 193 serves to make the fairness scheme byte-based rather than a packet- 194 based. The two-tier round-robin mechanism distinguishes between 195 "new" queues (which don't build up a standing queue) and "old" 196 queues, that have queued enough data to be around for more than one 197 iteration of the round-robin scheduler. 199 This new/old queue distinction has a particular consequence for 200 queues that don't build up more than a quantum of bytes before being 201 visited by the scheduler: Such queues are removed from the list, and 202 then re-added as a new queue each time a packet arrives for it, and 203 so will get priority over queues that do not empty out each round 204 (except for a minor modification to protect against starvation, 205 detailed below). Exactly how little data a flow has to send to keep 206 its queue in this state is somewhat difficult to reason about, 207 because it depends on both the egress link speed and the number of 208 concurrent flows. However, in practice many things that are 209 beneficial to have prioritised for typical internet use (ACKs, DNS 210 lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_ 211 to fall in this category, which is why FQ-CoDel performs so well for 212 many practical applications. However, the implicitness of the 213 prioritisation means that for applications that require guaranteed 214 priority (for instance multiplexing the network control plane over 215 the network itself), explicit classification is still needed. 217 This scheduling scheme has some subtlety to it, which is explained in 218 detail in the remainder of this document. 220 2. CoDel 222 CoDel is described in the the ACM Queue paper [CODEL2012], and the 223 AQM working group draft [CODELDRAFT]. The basic idea is to control 224 queue length, maintaining sufficient queueing to keep the outgoing 225 link busy, but avoiding building up the queue beyond that point. 226 This is done by preferentially dropping packets that remain in the 227 queue for "too long". Packets are dropped by head drop, which lowers 228 the time for the drop signal to propagate back to the sender by the 229 length of the queue, and helps trigger TCP fast retransmit sooner. 231 The CoDel algorithm itself will not be described here; instead we 232 refer the reader to the CoDel draft [CODELDRAFT]. 234 3. Flow Queueing 236 The intention of FQ-CoDel's scheduler is to give each _flow_ its own 237 queue, hence the term _Flow Queueing_. Rather than a perfect 238 realisation of this, a hashing-based scheme is used, where flows are 239 hashed into a number of buckets which each has its own queue. The 240 number of buckets are configurable, and presently defaults to 1024 in 241 the Linux implementation. This is enough to avoid hash collisions on 242 a moderate number of flows as seen for instance in a home gateway. 243 Depending on the characteristics of the link, this can be tuned to 244 trade off memory for a lower probability of hash collisions. See 245 Section 6 for a more in-depth discussion of this. 247 By default, the flow hashing is performed on the 5-tuple of source 248 and destination IP and port numbers and IP protocol number. While 249 the hashing can be customised to match on arbitrary packet bytes, 250 care should be taken when doing so: Much of the benefit of the FQ- 251 CoDel scheduler comes from this per-flow distinction. However, the 252 default hashing does have some limitations, as discussed in 253 Section 6. 255 FQ-CoDel's DRR scheduler is byte-based, employing a deficit round- 256 robin mechanism between queues. This works by keeping track of the 257 current number _byte credits_ of each queue. This number is is 258 initialised to the configurable quantum; each time a queue gets a 259 dequeue opportunity, it gets to dequeue packets, decreasing the 260 number of credits by the packet size for each packet. This continues 261 until the number of credits runs into the negative, at which point it 262 is increased by one quantum, and the dequeue opportunity ends. 264 This means that if one queue contains packets of, for instance, size 265 quantum/3, and another contains quantum-sized packets, the first 266 queue will dequeue three packets each time it gets a turn, whereas 267 the second only dequeues one. This means that flows that send small 268 packets are not penalised by the difference in packet sizes; rather, 269 the DRR scheme approximates a (single-)byte-based fairness queueing. 270 The size of the quantum determines the scheduling granularity, with 271 the tradeoff from too small a quantum being scheduling overhead. For 272 small bandwidths, lowering the quantum from the default MTU size can 273 be advantageous. 275 Unlike plain DRR there are two sets of flows - a "new" list for flows 276 that have not built a queue recently, and an "old" list for flow- 277 building queues. This distinction is an integral part of the FQ- 278 CoDel scheduler and is described in more detail in Section 4. 280 4. The FQ-CoDel scheduler 282 To make its scheduling decisions, FQ-CoDel maintains two ordered 283 lists of active queues, called "new" and "old" queues. When a packet 284 is added to a queue that is not currently active, that queue becomes 285 active by being added to the list of new queues. Later on, it is 286 moved to the list of old queues, from which it is removed when it is 287 no longer active. This behaviour is the source of some subtlety in 288 the packet scheduling at dequeue time, explained below. 290 4.1. Enqueue 292 The packet enqueue mechanism consists of three stages: classification 293 into a queue, timestamping and bookkeeping, and optionally dropping a 294 packet when the total number of enqueued packets goes over the 295 maximum. 297 When a packet is enqueued, it is first classified into the 298 appropriate queue. By default, this is done by hashing (using a 299 Jenkins hash function) on the 5-tuple of IP protocol, and source and 300 destination IP and port numbers, permuted by a random value selected 301 at initialisation time, and taking the hash value modulo the number 302 of queues. 304 Once the packet has been successfully classified into a queue, it is 305 handed over to the CoDel algorithm for timestamping. It is then 306 added to the tail of the selected queue, and the queue's byte count 307 is updated by the packet size. Then, if the queue is not currently 308 active (i.e. if it is not in either the list of new or the list of 309 old queues), it is added to the end of the list of new queues, and 310 its number of credits is initiated to the configured quantum. 311 Otherwise, the queue is left in its current queue list. 313 Finally, the total number of enqueued packets is compared with the 314 configured limit, and if it is _above_ this value (which can happen 315 since a packet was just enqueued), a packet is dropped from the head 316 of the queue with the largest current byte count. Note that this in 317 most cases means that the packet that gets dropped is different from 318 the one that was just enqueued, and may even be from a different 319 queue. 321 4.1.1. Alternative classification schemes 323 As mentioned previously, it is possible to modify the classification 324 scheme to provide a different notion of a 'flow'. The Linux 325 implementation provides this option in the form of the "tc filter" 326 command. While this can add capabilities (for instance, matching on 327 other possible parameters such as mac address, diffserv, firewall and 328 flow specific markings, etc.), care should be taken to preserve the 329 notion of 'flow' as much of the benefit of the FQ-CoDel scheduler 330 comes from keeping flows in separate queues. We are not aware of any 331 deployments utilising the custom classification feature. 333 An alternative to changing the classification scheme is to perform 334 decapsulation prior to hashing. The Linux implementation does this 335 for common encapsulations known to the kernel, such as 6in4, IPIP and 336 GRE tunnels. This helps to distinguish between flows that share the 337 same (outer) 5-tuple, but of course is limited to unencrypted tunnels 338 (see Section 6.2). 340 4.2. Dequeue 342 Most of FQ-CoDel's work is done at packet dequeue time. It consists 343 of three parts: selecting a queue from which to dequeue a packet, 344 actually dequeuing it (employing the CoDel algorithm in the process), 345 and some final bookkeeping. 347 For the first part, the scheduler first looks at the list of new 348 queues; for the queue at the head of that list, if that queue has a 349 negative number of credits (i.e. it has already dequeued at least a 350 quantum of bytes), it is given an additional quantum of credits, the 351 queue is put onto _the end of_ the list of old queues, and the 352 routine selects the next queue and starts again. 354 Otherwise, that queue is selected for dequeue. If the list of new 355 queues is empty, the scheduler proceeds down the list of old queues 356 in the same fashion (checking the credits, and either selecting the 357 queue for dequeuing, or adding credits and putting the queue back at 358 the end of the list). 360 After having selected a queue from which to dequeue a packet, the 361 CoDel algorithm is invoked on that queue. This applies the CoDel 362 control law, and may discard one or more packets from the head of 363 that queue, before returning the packet that should be dequeued (or 364 nothing if the queue is or becomes empty while being handled by the 365 CoDel algorithm). 367 Finally, if the CoDel algorithm does not return a packet, then the 368 queue must be empty, and the scheduler does one of two things: if the 369 queue selected for dequeue came from the list of new queues, it is 370 moved to _the end of_ the list of old queues. If instead it came 371 from the list of old queues, that queue is removed from the list, to 372 be added back (as a new queue) the next time a packet arrives that 373 hashes to that queue. Then (since no packet was available for 374 dequeue), the whole dequeue process is restarted from the beginning. 376 If, instead, the scheduler _did_ get a packet back from the CoDel 377 algorithm, it subtracts the size of the packet from the the byte 378 credits for the selected queue and returns the packet as the result 379 of the dequeue operation. 381 The step that moves an empty queue from the list of new queues to 382 _the end of_ the list of old queues before it is removed is crucial 383 to prevent starvation. Otherwise the queue could reappear (the next 384 time a packet arrives for it) before the list of old queues is 385 visited; this can go on indefinitely even with a small number of 386 active flows, if the flow providing packets to the queue in question 387 transmits at just the right rate. This is prevented by first moving 388 the queue to _the end of_ the list of old queues, forcing a pass 389 through that, and thus preventing starvation. Moving it to the end 390 of the list, rather than the front, is crucial for this to work. 392 The resulting migration of queues between the different states is 393 summarised in the following state diagram: 395 +-----------------+ +------------------+ 396 | | Empty | | 397 | Empty |<---------------+ Old +----+ 398 | | | | | 399 +-------+---------+ +------------------+ | 400 | ^ ^ |Credits 401 |Arrival | | |Exhausted 402 v | | | 403 +-----------------+ | | | 404 | | Empty or | | | 405 | New +-------------------+ +-------+ 406 | | Credits exhausted 407 +-----------------+ 409 5. Implementation considerations 411 This section contains implementation details for the FQ-CoDel 412 algorithm. This includes the data structures and parameters used in 413 the Linux implementation, as well as discussion of some required 414 features of the target platform and other considerations. 416 5.1. Data structures 418 The main data structure of FQ-CoDel is the array of queues, which is 419 instantiated with the number of queues specified by the _flows_ 420 parameter at instantiation time. Each queue consists simply of an 421 ordered list of packets with FIFO semantics, two state variables 422 tracking the queue credits and total number of bytes enqueued, and 423 the set of CoDel state variables. Other state variables to track 424 queue statistics can also be included: for instance, the Linux 425 implementation keeps a count of dropped packets. 427 In addition to the queue structures themselves, FQ-CoDel maintains 428 two ordered lists containing references to the subset of queues that 429 are currently active. These are the list of 'new' queues and the 430 list of 'old' queues, as explained in Section 4 above. 432 In the Linux implementation, queue space is shared: there's a global 433 limit on the number of packets the queues can hold, but not one per 434 queue. 436 5.2. Parameters 438 The following are the user configuration parameters exposed by the 439 Linux implementation of FQ-CoDel. 441 5.2.1. Interval 443 The _interval_ parameter has the same semantics as CoDel and is used 444 to ensure that the measured minimum delay does not become too stale. 445 The minimum delay MUST be experienced in the last epoch of length 446 interval. It SHOULD be set on the order of the worst-case RTT 447 through the bottleneck to give end-points sufficient time to react. 449 The default interval value is 100 ms. 451 5.2.2. Target 453 The _target_ parameter has the same semantics as CoDel. It is the 454 acceptable minimum standing/persistent queue delay for each FQ-CoDel 455 Queue. This minimum delay is identified by tracking the local 456 minimum queue delay that packets experience. 458 The default target value is 5 ms, but this value should be tuned to 459 be at least the transmission time of a single MTU-sized packet at the 460 prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is 461 ~15ms), to prevent CoDel from being too aggressive at low bandwidths. 462 It should otherwise be set to on the order of 5-10% of the configured 463 interval. 465 5.2.3. Packet limit 467 Routers do not have infinite memory, so some packet limit MUST be 468 enforced. 470 The _limit_ parameter is the hard limit on the real queue size, 471 measured in number of packets. This limit is a global limit on the 472 number of packets in all queues; each individual queue does not have 473 an upper limit. When the limit is reached and a new packet arrives 474 for enqueue, a packet is dropped from the head of the largest queue 475 (measured in bytes) to make room for the new packet. 477 In Linux, the default packet limit is 10240 packets, which is 478 suitable for up to 10GigE speeds. In practice, the hard limit is 479 rarely, if ever, hit, as drops are performed by the CoDel algorithm 480 long before the limit is hit. For platforms that are severely memory 481 constrained, a lower limit can be used. 483 5.2.4. Quantum 485 The _quantum_ parameter is the number of bytes each queue gets to 486 dequeue on each round of the scheduling algorithm. The default is 487 set to 1514 bytes which corresponds to the Ethernet MTU plus the 488 hardware header length of 14 bytes. 490 In TSO-enabled systems, where a "packet" consists of an offloaded 491 packet train, it can presently be as large as 64K bytes. In GRO- 492 enabled systems, up to 17 times the TCP max segment size (or 25K 493 bytes). These mega-packets severely impact FQ-CoDel's ability to 494 schedule traffic, and hurt latency needlessly. There is ongoing work 495 in Linux to make smarter use of offload engines. 497 5.2.5. Flows 499 The _flows_ parameter sets the number of queues into which the 500 incoming packets are classified. Due to the stochastic nature of 501 hashing, multiple flows may end up being hashed into the same slot. 503 This parameter can be set only at load time since memory has to be 504 allocated for the hash table in the current implementation. 506 The default value is 1024 in the current Linux implementation. 508 5.2.6. ECN 510 ECN is _enabled_ by default. Rather than do anything special with 511 misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system 512 to minimise their impact, thus unresponsive packets in a flow being 513 marked with ECN can grow to the overall packet limit, but will not 514 otherwise affect the performance of the system. 516 It can be disabled by specifying the _noecn_ parameter. 518 5.2.7. CE threshold 520 This parameter enables DCTCP-like processing to enable CE marking ECT 521 packets at a lower setpoint than the default codel target. 523 The parameter, _ce_threshold_, is disabled by default and can be set 524 to a number of microseconds to enable. 526 5.3. Probability of hash collisions 528 Since the Linux FQ-CoDel implementation by default uses 1024 hash 529 buckets, the probability that (say) 100 flows will all hash to the 530 same bucket is something like ten to the power of minus 300. Thus, 531 the probability that at least one of the flows will hash to some 532 other queue is very high indeed. 534 Expanding on this, based on analytical equations for hash collision 535 probabilities, for 100 flows, the probability of no collision is 536 90.78%; the probability that no more than two of the 100 flows will 537 be involved in any given collision = 99.57%; and the probability that 538 no more than three of the 100 flows will be involved in any given 539 collision = 99.99%. 541 These probabilities can be improved upon by using set-associative 542 hashing, a technique used in the Cake algorithm currently being 543 developed as a further development upon the FQ-CoDel principles. For 544 a 4-way associative hash with the same number of total queues, the 545 probability of no collisions for 100 flows is 99.93%, while for an 546 8-way associative hash it is ~100%. 548 5.4. Memory Overhead 550 FQ-CoDel can be implemented with a very low memory footprint (less 551 than 64 bytes per queue on 64 bit systems). These are the data 552 structures used in the Linux implementation: 554 struct codel_vars { 555 u32 count; 556 u32 lastcount; 557 bool dropping; 558 u16 rec_inv_sqrt; 559 codel_time_t first_above_time; 560 codel_time_t drop_next; 561 codel_time_t ldelay; 562 }; 564 struct fq_codel_flow { 565 struct sk_buff *head; 566 struct sk_buff *tail; 567 struct list_head flowchain; 568 int credits; /* current number of queue credits */ 569 u32 dropped; /* # of drops (or ECN marks) on flow */ 570 struct codel_vars cvars; 571 }; 573 The master table managing all queues looks like this: 575 struct fq_codel_sched_data { 576 struct tcf_proto *filter_list; /* optional external classifier */ 577 struct fq_codel_flow *flows; /* Flows table [flows_cnt] */ 578 u32 *backlogs; /* backlog table [flows_cnt] */ 579 u32 flows_cnt; /* number of flows */ 580 u32 perturbation; /* hash perturbation */ 581 u32 quantum; /* psched_mtu(qdisc_dev(sch)); */ 582 struct codel_params cparams; 583 struct codel_stats cstats; 584 u32 drop_overlimit; 585 u32 new_flow_count; 587 struct list_head new_flows; /* list of new flows */ 588 struct list_head old_flows; /* list of old flows */ 589 }; 591 5.5. Per-Packet Timestamping 593 The CoDel portion of the algorithm requires per-packet timestamps be 594 stored along with the packet. While this approach works well for 595 software-based routers, it may be impossible to retrofit devices that 596 do most of their processing in silicon and lack space or mechanism 597 for timestamping. 599 Also, while perfect resolution is not needed, timestamp resolution 600 below the target is necessary. Furthermore, timestamping functions 601 in the core OS need to be efficient as they are called at least once 602 on each packet enqueue and dequeue. 604 5.6. Other forms of "Fair Queueing" 606 Much of the scheduling portion of FQ-CoDel is derived from DRR and is 607 substantially similar to DRR++. SFQ-based versions have also been 608 produced and tested in ns2. Other forms of Fair Queueing, such as 609 WFQ or QFQ, have not been thoroughly explored, but there's no a 610 priori reason why the round-robin scheduling of FQ-CoDel couldn't be 611 replaced with something else. 613 5.7. Differences between CoDel and FQ-CoDel behaviour 615 CoDel can be applied to a single queue system as a straight AQM, 616 where it converges towards an "ideal" drop rate (i.e. one that 617 minimises delay while keeping a high link utilisation), and then 618 optimises around that control point. 620 The scheduling of FQ-CoDel mixes packets of competing flows, which 621 acts to pace bursty flows to better fill the pipe. Additionally, a 622 new flow gets substantial leeway over other flows until CoDel finds 623 an ideal drop rate for it. However, for a new flow that exceeds the 624 configured quantum, more time passes before all of its data is 625 delivered (as packets from it, too, are mixed across the other 626 existing queue-building flows). Thus, FQ-CoDel takes longer (as 627 measured in time) to converge towards an ideal drop rate for a given 628 new flow, but does so within fewer delivered _packets_ from that 629 flow. 631 Finally, the flow isolation FQ-CoDel provides means that the CoDel 632 drop mechanism operates on the flows actually building queues, which 633 results in packets being dropped more accurately from the largest 634 flows than CoDel alone manages. Additionally, flow isolation 635 radically improves the transient behaviour of the network when 636 traffic or link characteristics change (e.g. when new flows start up 637 or the link bandwidth changes); while CoDel itself can take a while 638 to respond, fq_codel doesn't miss a beat. 640 6. Limitations of flow queueing 642 While FQ-CoDel has been shown in many scenarios to offer significant 643 performance gains, there are some scenarios where the scheduling 644 algorithm in particular is not a good fit. This section documents 645 some of the known cases which either may require tweaking the default 646 behaviour, or where alternatives to flow queueing should be 647 considered. 649 6.1. Fairness between things other than flows 651 In some parts of the network, enforcing flow-level fairness may not 652 be desirable, or some other form of fairness may be more important. 653 An example of this can be an Internet Service Provider that may be 654 more interested in ensuring fairness between customers than between 655 flows. Or a hosting or transit provider that wishes to ensure 656 fairness between connecting Autonomous Systems or networks. Another 657 issue can be that the number of simultaneous flows experienced at a 658 particular link can be too high for flow-based fairness queueing to 659 be effective. 661 Whatever the reason, in a scenario where fairness between flows is 662 not desirable, reconfiguring FQ-CoDel to match on a different 663 characteristic can be a way forward. The implementation in Linux can 664 leverage the packet matching mechanism of the _tc_ subsystem to use 665 any available packet field to partition packets into virtual queues, 666 to for instance match on address or subnet source/destination pairs, 667 application layer characteristics, etc. 669 Furthermore, as commonly deployed today, FQ-CoDel is used with three 670 or more tiers of classification: priority, best effort and 671 background, based on diffserv markings. Some products do more 672 detailed classification, including deep packet inspection and 673 destination-specific filters to achieve their desired result. 675 6.2. Flow bunching by opaque encapsulation 677 Where possible, FQ-CoDel will attempt to decapsulate packets before 678 matching on the header fields for the flow hashing. However, for 679 some encapsulation techniques, most notably encrypted VPNs, this is 680 not possible. If several flows are bunched into one such 681 encapsulated tunnel, they will be seen as one flow by the FQ-CoDel 682 algorithm. This means that they will share a queue, and drop 683 behaviour, and so flows inside the encapsulation will not benefit 684 from the implicit prioritisation of FQ-CoDel, but will continue to 685 benefit from the reduced overall queue length from the CoDel 686 algorithm operating on the queue. In addition, when such an 687 encapsulated bunch competes against other flows, it will count as one 688 flow, and not assigned a share of the bandwidth based on how many 689 flows are inside the encapsulation. 691 Depending on the application, this may or may not be desirable 692 behaviour. In cases where it is not, changing FQ-CoDel's matching to 693 not be flow-based (as detailed in the previous subsection above) can 694 be a mitigation. Going forward, having some mechanism for opaque 695 encapsulations to express to the outer layer which flow a packet 696 belongs to, could be a way to mitigate this. 698 6.3. Low-priority congestion control algorithms 700 In the presence of queue management schemes that contain latency 701 under load, low-priority congestion control algorithms such as LEDBAT 702 [RFC6817] (or, in general, algorithms that try to voluntarily use up 703 less than their fair share of bandwidth) experiences very little 704 added latency when the link is congested. Thus, they lack the signal 705 to back off that added latency previously afforded them. This effect 706 is seen with FQ-CoDel as well as with any effective AQM [GONG2014]. 708 As such, these delay-based algorithms tend to revert to loss-based 709 congestion control, and will consume the fair share of bandwidth 710 afforded to them by the FQ-CoDel scheduler. However, low-priority 711 congestion control mechanisms may be able to take steps to continue 712 to be low priority, for instance by taking into account the vastly 713 reduced level of delay afforded by an AQM, or by using a coupled 714 approach to observing the behaviour of multiple flows. 716 7. Deployment status and future work 718 The FQ-CoDel algorithm as described in this document has been shipped 719 as part of the Linux kernel since version 3.5, released on the 21st 720 of July, 2012, with the ce_threshold being added in version 4.2. The 721 algorithm has seen widespread testing in a variety of contexts and is 722 configured as the default queueing discipline in a number of mainline 723 Linux distributions (as of this writing at least OpenWRT, Arch Linux 724 and Fedora). We believe it to be a safe default and encourage people 725 running Linux to turn it on: It is a massive improvement over the 726 previous default FIFO queue. 728 Of course there is always room for improvement, and this document has 729 listed some of the know limitations of the algorithm. As such, we 730 encourage further research into algorithm refinements and addressing 731 of limitations. One such effort is undertaken by the bufferbloat 732 community in the form of the Cake [1] queue management scheme. In 733 addition to this we believe the following (non-exhaustive) list of 734 issues to be worthy of further enquiry: 736 o Variations on the flow classification mechanism to fit different 737 notions of flows. For instance, an ISP might want to deploy per- 738 subscriber scheduling, while in other cases several flows can 739 share a 5-tuple, as exemplified by the RTCWEB QoS recommendations 740 [RTCWEB-QOS]. 742 o Interactions between flow queueing and delay-based congestion 743 control algorithms and scavenger protocols. 745 o Other scheduling mechanisms to replace the DRR portion of the 746 algorithm, e.g. QFQ or WFQ. 748 o Sensitivity of parameters, most notably the number of queues and 749 the CoDel parameters. 751 8. Security Considerations 753 There are no specific security exposures associated with FQ-CoDel 754 that are not also present in current FIFO systems. And some are in 755 fact reduced (e.g. simple minded packet floods). However, some care 756 is needed in the implementation to ensure this is the case. These 757 are included in the description above, however we reiterate them 758 here: 760 o To prevent packets in the new queues from starving old queues, it 761 is important that when a queue on the list of new queues empties, 762 it is moved to _the end of_ the list of old queues. This is 763 described at the end of Section 4.2. 765 o To prevent an attacker targeting a specific flow for a denial of 766 service attack, the hash that maps packets to queues should not be 767 predictable. To achieve this, FQ-CoDel salts the hash, as 768 described in the beginning of Section 4.1. The size of the salt 769 and the strength of the hash function is obviously a tradeoff 770 between performance and security. The Linux implementation uses a 771 32 bit random value as the salt and a Jenkins hash function. This 772 makes it possible to achieve very high throughput, and we consider 773 it sufficient to ward off the most obvious attacks. 775 o Packet fragments without a layer 4 header can be hashed into 776 different bins than the first fragment with the header intact. 777 This can cause reordering and/or adversely affect the performance 778 of the flow. Keeping state to match the fragments to the 779 beginning of the packet, or simply putting all fragmented packets 780 into the same queue, are two ways to alleviate this. 782 9. IANA Considerations 784 This document has no actions for IANA. 786 10. Acknowledgements 788 Our deepest thanks to Kathie Nichols, Van Jacobson, and all the 789 members of the bufferbloat.net effort for all the help on developing 790 and testing the algorithm. In addition, our thanks to Anil Agarwal 791 for his help with getting the hash collision probabilities in this 792 document right. 794 11. Conclusions 796 FQ-CoDel is a very general, efficient, nearly parameterless queue 797 management approach combining flow queueing with CoDel. It is a very 798 powerful tool for solving bufferbloat, and we believe it to be safe 799 to turn on by default, as has already happened in a number of Linux 800 distributions. In this document we have documented the Linux 801 implementation in sufficient detail for an independent 802 implementation, and we encourage such implementations be widely 803 deployed. 805 12. References 807 12.1. Normative References 809 [CODELDRAFT] 810 Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar, 811 "Controlled Delay Active Queue Management", October 2014, 812 . 814 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 815 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 816 DOI 10.17487/RFC6817, December 2012, 817 . 819 [RTCWEB-QOS] 820 Dhesikan, S., Jennings, C., Druta, D., Jones, P., and J. 821 Polk, "DSCP and other packet markings for RTCWeb QoS", 822 December 2014, . 825 12.2. Informative References 827 [CODEL2012] 828 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 829 July 2012, . 831 [DRR] Shreedhar, M. and G. Varghese, "Efficient Fair Queueing 832 Using Deficit Round Robin", June 1996, 833 . 836 [DRRPP] MacGregor, M. and W. Shi, "Deficits for Bursty Latency- 837 critical Flows: DRR++", 2000, 838 . 841 [GONG2014] 842 Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht, 843 "Fighting the bufferbloat: on the coexistence of AQM and 844 low priority congestion control", July 2014, 845 . 848 [SQF2012] Bonald, T., Muscariello, L., and N. Ostallo, "On the 849 impact of TCP and per-flow scheduling on Internet 850 Performance - IEEE/ACM transactions on Networking", April 851 2012, . 854 12.3. URIs 856 [1] http://www.bufferbloat.net/projects/codel/wiki/Cake 858 Authors' Addresses 860 Toke Hoeiland-Joergensen 861 Karlstad University 862 Dept. of Computer Science 863 Karlstad 65188 864 Sweden 866 Email: toke.hoiland-jorgensen@kau.se 868 Paul McKenney 869 IBM Linux Technology Center 870 1385 NW Amberglen Parkway 871 Hillsboro, OR 97006 872 USA 874 Email: paulmck@linux.vnet.ibm.com 875 URI: http://www2.rdrop.com/~paulmck/ 877 Dave Taht 878 Teklibre 879 2104 W First street 880 Apt 2002 881 FT Myers, FL 33901 882 USA 884 Email: dave.taht@gmail.com 885 URI: http://www.teklibre.com/ 887 Jim Gettys 888 21 Oak Knoll Road 889 Carlisle, MA 993 890 USA 892 Email: jg@freedesktop.org 893 URI: https://en.wikipedia.org/wiki/Jim_Gettys 895 Eric Dumazet 896 Google, Inc. 897 1600 Amphitheater Pkwy 898 Mountain View, CA 94043 899 USA 901 Email: edumazet@gmail.com