idnits 2.17.1 draft-ietf-aqm-fq-codel-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 578 has weird spacing: '...st_head flowc...' -- The document date (March 01, 2016) is 2978 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM working group T. Hoeiland-Joergensen 3 Internet-Draft Karlstad University 4 Intended status: Experimental P. McKenney 5 Expires: September 2, 2016 IBM Linux Technology Center 6 D. Taht 7 Teklibre 8 J. Gettys 10 E. Dumazet 11 Google, Inc. 12 March 01, 2016 14 FlowQueue-Codel 15 draft-ietf-aqm-fq-codel-05 17 Abstract 19 This memo presents the FQ-CoDel hybrid packet scheduler/AQM 20 algorithm, a powerful tool for fighting bufferbloat and reducing 21 latency. 23 FQ-CoDel mixes packets from multiple flows and reduces the impact of 24 head of line blocking from bursty traffic. It provides isolation for 25 low-rate traffic such as DNS, web, and videoconferencing traffic. It 26 improves utilisation across the networking fabric, especially for 27 bidirectional traffic, by keeping queue lengths short; and it can be 28 implemented in a memory- and CPU-efficient fashion across a wide 29 range of hardware. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on September 2, 2016. 48 Copyright Notice 50 Copyright (c) 2016 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Conventions used in this document . . . . . . . . . . . . 3 67 1.2. Terminology and concepts . . . . . . . . . . . . . . . . 4 68 1.3. Informal summary of FQ-CoDel . . . . . . . . . . . . . . 4 69 2. CoDel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 3. Flow Queueing . . . . . . . . . . . . . . . . . . . . . . . . 6 71 4. The FQ-CoDel scheduler . . . . . . . . . . . . . . . . . . . 7 72 4.1. Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . 7 73 4.1.1. Alternative classification schemes . . . . . . . . . 8 74 4.2. Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . 8 75 5. Implementation considerations . . . . . . . . . . . . . . . . 9 76 5.1. Data structures . . . . . . . . . . . . . . . . . . . . . 10 77 5.2. Parameters . . . . . . . . . . . . . . . . . . . . . . . 10 78 5.2.1. Interval . . . . . . . . . . . . . . . . . . . . . . 10 79 5.2.2. Target . . . . . . . . . . . . . . . . . . . . . . . 10 80 5.2.3. Packet limit . . . . . . . . . . . . . . . . . . . . 11 81 5.2.4. Quantum . . . . . . . . . . . . . . . . . . . . . . . 11 82 5.2.5. Flows . . . . . . . . . . . . . . . . . . . . . . . . 11 83 5.2.6. ECN . . . . . . . . . . . . . . . . . . . . . . . . . 11 84 5.2.7. CE threshold . . . . . . . . . . . . . . . . . . . . 12 85 5.3. Probability of hash collisions . . . . . . . . . . . . . 12 86 5.4. Memory Overhead . . . . . . . . . . . . . . . . . . . . . 12 87 5.5. Per-Packet Timestamping . . . . . . . . . . . . . . . . . 13 88 5.6. Other forms of "Fair Queueing" . . . . . . . . . . . . . 14 89 5.7. Differences between CoDel and FQ-CoDel behaviour . . . . 14 90 6. Limitations of flow queueing . . . . . . . . . . . . . . . . 15 91 6.1. Fairness between things other than flows . . . . . . . . 15 92 6.2. Flow bunching by opaque encapsulation . . . . . . . . . . 15 93 6.3. Low-priority congestion control algorithms . . . . . . . 16 94 7. Deployment status and future work . . . . . . . . . . . . . . 16 95 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 96 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 97 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18 98 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 18 99 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 100 12.1. Normative References . . . . . . . . . . . . . . . . . . 18 101 12.2. Informative References . . . . . . . . . . . . . . . . . 19 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 104 1. Introduction 106 The FQ-CoDel algorithm is a combined packet scheduler and AQM 107 developed as part of the bufferbloat-fighting community effort. It 108 is based on a modified Deficit Round Robin (DRR) queue scheduler, 109 with the CoDel AQM algorithm operating on each queue. This document 110 describes the combined algorithm; reference implementations are 111 available for ns2 and ns3 and it is included in the mainline Linux 112 kernel as the fq_codel queueing discipline. 114 Since the FQ-CoDel algorithm was originally developed in the Linux 115 kernel, that implementation is still considered canonical. This 116 document strives to describe the algorithm in the abstract in the 117 first sections and separate out most implementation details in 118 subsequent sections, but does use the Linux implementation as 119 reference for default behaviour in the algorithm description itself. 121 The rest of this document is structured as follows: This section 122 gives some concepts and terminology used in the rest of the document, 123 and gives a short informal summary of the FQ-CoDel algorithm. 124 Section 2 gives an overview of the CoDel algorithm. Section 3 covers 125 the flow hashing and DRR portion. Section 4 then describes the 126 working of the algorithm in detail, while Section 5 describes 127 implementation details and considerations. Section 6 lists some of 128 the limitations of using flow queueing. Section 7 outlines the 129 current status of FQ-CoDel deployment and lists some possible future 130 areas of inquiry, and Section 8 reiterates some important security 131 points that must be observed in the implementation. Finally, 132 Section 11 concludes. 134 1.1. Conventions used in this document 136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 138 document are to be interpreted as described in [RFC2119]. 140 In this document, these words will appear with that interpretation 141 only when in ALL CAPS. Lower case uses of these words are not to be 142 interpreted as carrying [RFC2119] significance. 144 1.2. Terminology and concepts 146 Flow: A flow is typically identified by a 5-tuple of source IP, 147 destination IP, source port, destination port, and protocol. It can 148 also be identified by a superset or subset of those parameters, or by 149 mac address, or other means. 151 Queue: A queue of packets represented internally in FQ-CoDel. In 152 most instances each flow gets its own queue; however because of the 153 possibility of hash collisions, this is not always the case. In an 154 attempt to avoid confusion, the word 'queue' is used to refer to the 155 internal data structure, and 'flow' to refer to the actual stream of 156 packets being delivered to the FQ-CoDel algorithm. 158 Scheduler: A mechanism to select which queue a packet is dequeued 159 from. 161 CoDel AQM: The Active Queue Management algorithm employed by FQ- 162 CoDel. 164 DRR: Deficit round-robin scheduling. 166 Quantum: The maximum amount of bytes to be dequeued from a queue at 167 once. 169 1.3. Informal summary of FQ-CoDel 171 FQ-CoDel is a _hybrid_ of DRR [DRR] and CoDel [CODELDRAFT], with an 172 optimisation for sparse flows similar to SQF [SQF2012] and DRR++ 173 [DRRPP]. We call this "Flow Queueing" rather than "Fair Queueing" as 174 flows that build a queue are treated differently than flows that do 175 not. 177 FQ-CoDel stochastically classifies incoming packets into different 178 queues by hashing the 5-tuple of IP protocol number and source and 179 destination IP and port numbers, perturbed with a random number 180 selected at initiation time (although other flow classification 181 schemes can optionally be configured instead). Each queue is managed 182 by the CoDel AQM algorithm. Packet ordering within a queue is 183 preserved, since queues have FIFO ordering. 185 The FQ-CoDel algorithm consists of two logical parts: the scheduler 186 which selects which queue to dequeue a packet from, and the CoDel AQM 187 which works on each of the queues. The subtleties of FQ-CoDel are 188 mostly in the scheduling part, whereas the interaction between the 189 scheduler and the CoDel algorithm are fairly straight forward: 191 At initialisation, each queue is set up to have a separate set of 192 CoDel state variables. By default, 1024 queues are created. The 193 current implementation supports anywhere from one to 64K separate 194 queues, and each queue maintains the state variables throughout its 195 lifetime, and so acts the same as the non-FQ CoDel variant would. 196 This means that with only one queue, FQ-CoDel behaves essentially the 197 same as CoDel by itself. 199 On dequeue, FQ-CoDel selects a queue from which to dequeue by a two- 200 tier round-robin scheme, in which each queue is allowed to dequeue up 201 to a configurable quantum of bytes for each iteration. Deviations 202 from this quantum is maintained as byte credits for the queue, which 203 serves to make the fairness scheme byte-based rather than a packet- 204 based. The two-tier round-robin mechanism distinguishes between 205 "new" queues (which don't build up a standing queue) and "old" 206 queues, that have queued enough data to be around for more than one 207 iteration of the round-robin scheduler. 209 This new/old queue distinction has a particular consequence for 210 queues that don't build up more than a quantum of bytes before being 211 visited by the scheduler: Such queues are removed from the list, and 212 then re-added as a new queue each time a packet arrives for it, and 213 so will get priority over queues that do not empty out each round 214 (except for a minor modification to protect against starvation, 215 detailed below). Exactly how little data a flow has to send to keep 216 its queue in this state is somewhat difficult to reason about, 217 because it depends on both the egress link speed and the number of 218 concurrent flows. However, in practice many things that are 219 beneficial to have prioritised for typical internet use (ACKs, DNS 220 lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_ 221 to fall in this category, which is why FQ-CoDel performs so well for 222 many practical applications. However, the implicitness of the 223 prioritisation means that for applications that require guaranteed 224 priority (for instance multiplexing the network control plane over 225 the network itself), explicit classification is still needed. 227 This scheduling scheme has some subtlety to it, which is explained in 228 detail in the remainder of this document. 230 2. CoDel 232 CoDel is described in the the ACM Queue paper [CODEL2012], and the 233 AQM working group draft [CODELDRAFT]. The basic idea is to control 234 queue length, maintaining sufficient queueing to keep the outgoing 235 link busy, but avoiding building up the queue beyond that point. 236 This is done by preferentially dropping packets that remain in the 237 queue for "too long". Packets are dropped by head drop, which lowers 238 the time for the drop signal to propagate back to the sender by the 239 length of the queue, and helps trigger TCP fast retransmit sooner. 241 The CoDel algorithm itself will not be described here; instead we 242 refer the reader to the CoDel draft [CODELDRAFT]. 244 3. Flow Queueing 246 The intention of FQ-CoDel's scheduler is to give each _flow_ its own 247 queue, hence the term _Flow Queueing_. Rather than a perfect 248 realisation of this, a hashing-based scheme is used, where flows are 249 hashed into a number of buckets which each has its own queue. The 250 number of buckets are configurable, and presently defaults to 1024 in 251 the Linux implementation. This is enough to avoid hash collisions on 252 a moderate number of flows as seen for instance in a home gateway. 253 Depending on the characteristics of the link, this can be tuned to 254 trade off memory for a lower probability of hash collisions. See 255 Section 6 for a more in-depth discussion of this. 257 By default, the flow hashing is performed on the 5-tuple of source 258 and destination IP and port numbers and IP protocol number. While 259 the hashing can be customised to match on arbitrary packet bytes, 260 care should be taken when doing so: Much of the benefit of the FQ- 261 CoDel scheduler comes from this per-flow distinction. However, the 262 default hashing does have some limitations, as discussed in 263 Section 6. 265 FQ-CoDel's DRR scheduler is byte-based, employing a deficit round- 266 robin mechanism between queues. This works by keeping track of the 267 current number _byte credits_ of each queue. This number is is 268 initialised to the configurable quantum; each time a queue gets a 269 dequeue opportunity, it gets to dequeue packets, decreasing the 270 number of credits by the packet size for each packet. This continues 271 until the number of credits runs into the negative, at which point it 272 is increased by one quantum, and the dequeue opportunity ends. 274 This means that if one queue contains packets of, for instance, size 275 quantum/3, and another contains quantum-sized packets, the first 276 queue will dequeue three packets each time it gets a turn, whereas 277 the second only dequeues one. This means that flows that send small 278 packets are not penalised by the difference in packet sizes; rather, 279 the DRR scheme approximates a (single-)byte-based fairness queueing. 280 The size of the quantum determines the scheduling granularity, with 281 the tradeoff from too small a quantum being scheduling overhead. For 282 small bandwidths, lowering the quantum from the default MTU size can 283 be advantageous. 285 Unlike plain DRR there are two sets of flows - a "new" list for flows 286 that have not built a queue recently, and an "old" list for flow- 287 building queues. This distinction is an integral part of the FQ- 288 CoDel scheduler and is described in more detail in Section 4. 290 4. The FQ-CoDel scheduler 292 To make its scheduling decisions, FQ-CoDel maintains two ordered 293 lists of active queues, called "new" and "old" queues. When a packet 294 is added to a queue that is not currently active, that queue becomes 295 active by being added to the list of new queues. Later on, it is 296 moved to the list of old queues, from which it is removed when it is 297 no longer active. This behaviour is the source of some subtlety in 298 the packet scheduling at dequeue time, explained below. 300 4.1. Enqueue 302 The packet enqueue mechanism consists of three stages: classification 303 into a queue, timestamping and bookkeeping, and optionally dropping a 304 packet when the total number of enqueued packets goes over the 305 maximum. 307 When a packet is enqueued, it is first classified into the 308 appropriate queue. By default, this is done by hashing (using a 309 Jenkins hash function) on the 5-tuple of IP protocol, and source and 310 destination IP and port numbers, permuted by a random value selected 311 at initialisation time, and taking the hash value modulo the number 312 of queues. 314 Once the packet has been successfully classified into a queue, it is 315 handed over to the CoDel algorithm for timestamping. It is then 316 added to the tail of the selected queue, and the queue's byte count 317 is updated by the packet size. Then, if the queue is not currently 318 active (i.e. if it is not in either the list of new or the list of 319 old queues), it is added to the end of the list of new queues, and 320 its number of credits is initiated to the configured quantum. 321 Otherwise, the queue is left in its current queue list. 323 Finally, the total number of enqueued packets is compared with the 324 configured limit, and if it is _above_ this value (which can happen 325 since a packet was just enqueued), a packet is dropped from the head 326 of the queue with the largest current byte count. Note that this in 327 most cases means that the packet that gets dropped is different from 328 the one that was just enqueued, and may even be from a different 329 queue. 331 4.1.1. Alternative classification schemes 333 As mentioned previously, it is possible to modify the classification 334 scheme to provide a different notion of a 'flow'. The Linux 335 implementation provides this option in the form of the "tc filter" 336 command. While this can add capabilities (for instance, matching on 337 other possible parameters such as mac address, diffserv, firewall and 338 flow specific markings, etc.), care should be taken to preserve the 339 notion of 'flow' as much of the benefit of the FQ-CoDel scheduler 340 comes from keeping flows in separate queues. We are not aware of any 341 deployments utilising the custom classification feature. 343 An alternative to changing the classification scheme is to perform 344 decapsulation prior to hashing. The Linux implementation does this 345 for common encapsulations known to the kernel, such as 6in4, IPIP and 346 GRE tunnels. This helps to distinguish between flows that share the 347 same (outer) 5-tuple, but of course is limited to unencrypted tunnels 348 (see Section 6.2). 350 4.2. Dequeue 352 Most of FQ-CoDel's work is done at packet dequeue time. It consists 353 of three parts: selecting a queue from which to dequeue a packet, 354 actually dequeuing it (employing the CoDel algorithm in the process), 355 and some final bookkeeping. 357 For the first part, the scheduler first looks at the list of new 358 queues; for the queue at the head of that list, if that queue has a 359 negative number of credits (i.e. it has already dequeued at least a 360 quantum of bytes), it is given an additional quantum of credits, the 361 queue is put onto _the end of_ the list of old queues, and the 362 routine selects the next queue and starts again. 364 Otherwise, that queue is selected for dequeue. If the list of new 365 queues is empty, the scheduler proceeds down the list of old queues 366 in the same fashion (checking the credits, and either selecting the 367 queue for dequeuing, or adding credits and putting the queue back at 368 the end of the list). 370 After having selected a queue from which to dequeue a packet, the 371 CoDel algorithm is invoked on that queue. This applies the CoDel 372 control law, and may discard one or more packets from the head of 373 that queue, before returning the packet that should be dequeued (or 374 nothing if the queue is or becomes empty while being handled by the 375 CoDel algorithm). 377 Finally, if the CoDel algorithm does not return a packet, then the 378 queue must be empty, and the scheduler does one of two things: if the 379 queue selected for dequeue came from the list of new queues, it is 380 moved to _the end of_ the list of old queues. If instead it came 381 from the list of old queues, that queue is removed from the list, to 382 be added back (as a new queue) the next time a packet arrives that 383 hashes to that queue. Then (since no packet was available for 384 dequeue), the whole dequeue process is restarted from the beginning. 386 If, instead, the scheduler _did_ get a packet back from the CoDel 387 algorithm, it subtracts the size of the packet from the the byte 388 credits for the selected queue and returns the packet as the result 389 of the dequeue operation. 391 The step that moves an empty queue from the list of new queues to 392 _the end of_ the list of old queues before it is removed is crucial 393 to prevent starvation. Otherwise the queue could reappear (the next 394 time a packet arrives for it) before the list of old queues is 395 visited; this can go on indefinitely even with a small number of 396 active flows, if the flow providing packets to the queue in question 397 transmits at just the right rate. This is prevented by first moving 398 the queue to _the end of_ the list of old queues, forcing a pass 399 through that, and thus preventing starvation. Moving it to the end 400 of the list, rather than the front, is crucial for this to work. 402 The resulting migration of queues between the different states is 403 summarised in the following state diagram: 405 +-----------------+ +------------------+ 406 | | Empty | | 407 | Empty |<---------------+ Old +----+ 408 | | | | | 409 +-------+---------+ +------------------+ | 410 | ^ ^ |Credits 411 |Arrival | | |Exhausted 412 v | | | 413 +-----------------+ | | | 414 | | Empty or | | | 415 | New +-------------------+ +-------+ 416 | | Credits exhausted 417 +-----------------+ 419 5. Implementation considerations 421 This section contains implementation details for the FQ-CoDel 422 algorithm. This includes the data structures and parameters used in 423 the Linux implementation, as well as discussion of some required 424 features of the target platform and other considerations. 426 5.1. Data structures 428 The main data structure of FQ-CoDel is the array of queues, which is 429 instantiated with the number of queues specified by the _flows_ 430 parameter at instantiation time. Each queue consists simply of an 431 ordered list of packets with FIFO semantics, two state variables 432 tracking the queue credits and total number of bytes enqueued, and 433 the set of CoDel state variables. Other state variables to track 434 queue statistics can also be included: for instance, the Linux 435 implementation keeps a count of dropped packets. 437 In addition to the queue structures themselves, FQ-CoDel maintains 438 two ordered lists containing references to the subset of queues that 439 are currently active. These are the list of 'new' queues and the 440 list of 'old' queues, as explained in Section 4 above. 442 In the Linux implementation, queue space is shared: there's a global 443 limit on the number of packets the queues can hold, but not one per 444 queue. 446 5.2. Parameters 448 The following are the user configuration parameters exposed by the 449 Linux implementation of FQ-CoDel. 451 5.2.1. Interval 453 The _interval_ parameter has the same semantics as CoDel and is used 454 to ensure that the measured minimum delay does not become too stale. 455 That is, CoDel only reacts to delay experienced in the last epoch of 456 length interval. It SHOULD be set on the order of the worst-case RTT 457 through the bottleneck to give end-points sufficient time to react. 459 The default interval value is 100 ms. 461 5.2.2. Target 463 The _target_ parameter has the same semantics as CoDel. It is the 464 acceptable minimum standing/persistent queue delay for each FQ-CoDel 465 Queue. This minimum delay is identified by tracking the local 466 minimum queue delay that packets experience. 468 The default target value is 5 ms, but this value should be tuned to 469 be at least the transmission time of a single MTU-sized packet at the 470 prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is 471 ~15ms), to prevent CoDel from being too aggressive at low bandwidths. 472 It should otherwise be set to on the order of 5-10% of the configured 473 interval. 475 5.2.3. Packet limit 477 Routers do not have infinite memory, so some packet limit MUST be 478 enforced. 480 The _limit_ parameter is the hard limit on the real queue size, 481 measured in number of packets. This limit is a global limit on the 482 number of packets in all queues; each individual queue does not have 483 an upper limit. When the limit is reached and a new packet arrives 484 for enqueue, a packet is dropped from the head of the largest queue 485 (measured in bytes) to make room for the new packet. 487 In Linux, the default packet limit is 10240 packets, which is 488 suitable for up to 10GigE speeds. In practice, the hard limit is 489 rarely, if ever, hit, as drops are performed by the CoDel algorithm 490 long before the limit is hit. For platforms that are severely memory 491 constrained, a lower limit can be used. 493 5.2.4. Quantum 495 The _quantum_ parameter is the number of bytes each queue gets to 496 dequeue on each round of the scheduling algorithm. The default is 497 set to 1514 bytes which corresponds to the Ethernet MTU plus the 498 hardware header length of 14 bytes. 500 In TSO-enabled systems, where a "packet" consists of an offloaded 501 packet train, it can presently be as large as 64K bytes. In GRO- 502 enabled systems, up to 17 times the TCP max segment size (or 25K 503 bytes). These mega-packets severely impact FQ-CoDel's ability to 504 schedule traffic, and hurt latency needlessly. There is ongoing work 505 in Linux to make smarter use of offload engines. 507 5.2.5. Flows 509 The _flows_ parameter sets the number of queues into which the 510 incoming packets are classified. Due to the stochastic nature of 511 hashing, multiple flows may end up being hashed into the same slot. 513 This parameter can be set only at load time since memory has to be 514 allocated for the hash table in the current implementation. 516 The default value is 1024 in the current Linux implementation. 518 5.2.6. ECN 520 ECN is _enabled_ by default. Rather than do anything special with 521 misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system 522 to minimise their impact, thus unresponsive packets in a flow being 523 marked with ECN can grow to the overall packet limit, but will not 524 otherwise affect the performance of the system. 526 It can be disabled by specifying the _noecn_ parameter. 528 5.2.7. CE threshold 530 This parameter enables DCTCP-like processing to enable CE marking ECT 531 packets at a lower setpoint than the default codel target. 533 The parameter, _ce_threshold_, is disabled by default and can be set 534 to a number of microseconds to enable. 536 5.3. Probability of hash collisions 538 Since the Linux FQ-CoDel implementation by default uses 1024 hash 539 buckets, the probability that (say) 100 flows will all hash to the 540 same bucket is something like ten to the power of minus 300. Thus, 541 the probability that at least one of the flows will hash to some 542 other queue is very high indeed. 544 Expanding on this, based on analytical equations for hash collision 545 probabilities, for 100 flows, the probability of no collision is 546 90.78%; the probability that no more than two of the 100 flows will 547 be involved in any given collision = 99.57%; and the probability that 548 no more than three of the 100 flows will be involved in any given 549 collision = 99.99%. 551 These probabilities can be improved upon by using set-associative 552 hashing, a technique used in the Cake algorithm currently being 553 developed as a further development upon the FQ-CoDel principles. For 554 a 4-way associative hash with the same number of total queues, the 555 probability of no collisions for 100 flows is 99.93%, while for an 556 8-way associative hash it is ~100%. 558 5.4. Memory Overhead 560 FQ-CoDel can be implemented with a very low memory footprint (less 561 than 64 bytes per queue on 64 bit systems). These are the data 562 structures used in the Linux implementation: 564 565 struct codel_vars { 566 u32 count; 567 u32 lastcount; 568 bool dropping; 569 u16 rec_inv_sqrt; 570 codel_time_t first_above_time; 571 codel_time_t drop_next; 572 codel_time_t ldelay; 573 }; 575 struct fq_codel_flow { 576 struct sk_buff *head; 577 struct sk_buff *tail; 578 struct list_head flowchain; 579 int credits; /* current number of queue credits */ 580 u32 dropped; /* # of drops (or ECN marks) on flow */ 581 struct codel_vars cvars; 582 }; 584 586 The master table managing all queues looks like this: 588 590 struct fq_codel_sched_data { 591 struct tcf_proto *filter_list; /* optional external classifier */ 592 struct fq_codel_flow *flows; /* Flows table [flows_cnt] */ 593 u32 *backlogs; /* backlog table [flows_cnt] */ 594 u32 flows_cnt; /* number of flows */ 595 u32 perturbation; /* hash perturbation */ 596 u32 quantum; /* psched_mtu(qdisc_dev(sch)); */ 597 struct codel_params cparams; 598 struct codel_stats cstats; 599 u32 drop_overlimit; 600 u32 new_flow_count; 602 struct list_head new_flows; /* list of new flows */ 603 struct list_head old_flows; /* list of old flows */ 604 }; 606 608 5.5. Per-Packet Timestamping 610 The CoDel portion of the algorithm requires per-packet timestamps be 611 stored along with the packet. While this approach works well for 612 software-based routers, it may be impossible to retrofit devices that 613 do most of their processing in silicon and lack space or mechanism 614 for timestamping. 616 Also, while perfect resolution is not needed, timestamp resolution 617 below the target is necessary. Furthermore, timestamping functions 618 in the core OS need to be efficient as they are called at least once 619 on each packet enqueue and dequeue. 621 5.6. Other forms of "Fair Queueing" 623 Much of the scheduling portion of FQ-CoDel is derived from DRR and is 624 substantially similar to DRR++. SFQ-based versions have also been 625 produced and tested in ns2. Other forms of Fair Queueing, such as 626 WFQ or QFQ, have not been thoroughly explored, but there's no a 627 priori reason why the round-robin scheduling of FQ-CoDel couldn't be 628 replaced with something else. 630 5.7. Differences between CoDel and FQ-CoDel behaviour 632 CoDel can be applied to a single queue system as a straight AQM, 633 where it converges towards an "ideal" drop rate (i.e. one that 634 minimises delay while keeping a high link utilisation), and then 635 optimises around that control point. 637 The scheduling of FQ-CoDel mixes packets of competing flows, which 638 acts to pace bursty flows to better fill the pipe. Additionally, a 639 new flow gets substantial leeway over other flows until CoDel finds 640 an ideal drop rate for it. However, for a new flow that exceeds the 641 configured quantum, more time passes before all of its data is 642 delivered (as packets from it, too, are mixed across the other 643 existing queue-building flows). Thus, FQ-CoDel takes longer (as 644 measured in time) to converge towards an ideal drop rate for a given 645 new flow, but does so within fewer delivered _packets_ from that 646 flow. 648 Finally, the flow isolation FQ-CoDel provides means that the CoDel 649 drop mechanism operates on the flows actually building queues, which 650 results in packets being dropped more accurately from the largest 651 flows than CoDel alone manages. Additionally, flow isolation 652 radically improves the transient behaviour of the network when 653 traffic or link characteristics change (e.g. when new flows start up 654 or the link bandwidth changes); while CoDel itself can take a while 655 to respond, fq_codel doesn't miss a beat. 657 6. Limitations of flow queueing 659 While FQ-CoDel has been shown in many scenarios to offer significant 660 performance gains, there are some scenarios where the scheduling 661 algorithm in particular is not a good fit. This section documents 662 some of the known cases which either may require tweaking the default 663 behaviour, or where alternatives to flow queueing should be 664 considered. 666 6.1. Fairness between things other than flows 668 In some parts of the network, enforcing flow-level fairness may not 669 be desirable, or some other form of fairness may be more important. 670 An example of this can be an Internet Service Provider that may be 671 more interested in ensuring fairness between customers than between 672 flows. Or a hosting or transit provider that wishes to ensure 673 fairness between connecting Autonomous Systems or networks. Another 674 issue can be that the number of simultaneous flows experienced at a 675 particular link can be too high for flow-based fairness queueing to 676 be effective. 678 Whatever the reason, in a scenario where fairness between flows is 679 not desirable, reconfiguring FQ-CoDel to match on a different 680 characteristic can be a way forward. The implementation in Linux can 681 leverage the packet matching mechanism of the _tc_ subsystem to use 682 any available packet field to partition packets into virtual queues, 683 to for instance match on address or subnet source/destination pairs, 684 application layer characteristics, etc. 686 Furthermore, as commonly deployed today, FQ-CoDel is used with three 687 or more tiers of classification: priority, best effort and 688 background, based on diffserv markings. Some products do more 689 detailed classification, including deep packet inspection and 690 destination-specific filters to achieve their desired result. 692 6.2. Flow bunching by opaque encapsulation 694 Where possible, FQ-CoDel will attempt to decapsulate packets before 695 matching on the header fields for the flow hashing. However, for 696 some encapsulation techniques, most notably encrypted VPNs, this is 697 not possible. If several flows are bunched into one such 698 encapsulated tunnel, they will be seen as one flow by the FQ-CoDel 699 algorithm. This means that they will share a queue, and drop 700 behaviour, and so flows inside the encapsulation will not benefit 701 from the implicit prioritisation of FQ-CoDel, but will continue to 702 benefit from the reduced overall queue length from the CoDel 703 algorithm operating on the queue. In addition, when such an 704 encapsulated bunch competes against other flows, it will count as one 705 flow, and not assigned a share of the bandwidth based on how many 706 flows are inside the encapsulation. 708 Depending on the application, this may or may not be desirable 709 behaviour. In cases where it is not, changing FQ-CoDel's matching to 710 not be flow-based (as detailed in the previous subsection above) can 711 be a mitigation. Going forward, having some mechanism for opaque 712 encapsulations to express to the outer layer which flow a packet 713 belongs to, could be a way to mitigate this. 715 6.3. Low-priority congestion control algorithms 717 In the presence of queue management schemes that contain latency 718 under load, low-priority congestion control algorithms such as LEDBAT 719 [RFC6817] (or, in general, algorithms that try to voluntarily use up 720 less than their fair share of bandwidth) experiences very little 721 added latency when the link is congested. Thus, they lack the signal 722 to back off that added latency previously afforded them. This effect 723 is seen with FQ-CoDel as well as with any effective AQM [GONG2014]. 725 As such, these delay-based algorithms tend to revert to loss-based 726 congestion control, and will consume the fair share of bandwidth 727 afforded to them by the FQ-CoDel scheduler. However, low-priority 728 congestion control mechanisms may be able to take steps to continue 729 to be low priority, for instance by taking into account the vastly 730 reduced level of delay afforded by an AQM, or by using a coupled 731 approach to observing the behaviour of multiple flows. 733 7. Deployment status and future work 735 The FQ-CoDel algorithm as described in this document has been shipped 736 as part of the Linux kernel since version 3.5, released on the 21st 737 of July, 2012, with the ce_threshold being added in version 4.2. The 738 algorithm has seen widespread testing in a variety of contexts and is 739 configured as the default queueing discipline in a number of mainline 740 Linux distributions (as of this writing at least OpenWRT, Arch Linux 741 and Fedora). We believe it to be a safe default and encourage people 742 running Linux to turn it on: It is a massive improvement over the 743 previous default FIFO queue. 745 Of course there is always room for improvement, and this document has 746 listed some of the know limitations of the algorithm. As such, we 747 encourage further research into algorithm refinements and addressing 748 of limitations. One such effort is undertaken by the bufferbloat 749 community in the form of the Cake queue management scheme (see 750 http://www.bufferbloat.net/projects/codel/wiki/Cake). In addition to 751 this we believe the following (non-exhaustive) list of issues to be 752 worthy of further enquiry: 754 o Variations on the flow classification mechanism to fit different 755 notions of flows. For instance, an ISP might want to deploy per- 756 subscriber scheduling, while in other cases several flows can 757 share a 5-tuple, as exemplified by the RTCWEB QoS recommendations 758 [RTCWEB-QOS]. 760 o Interactions between flow queueing and delay-based congestion 761 control algorithms and scavenger protocols. 763 o Other scheduling mechanisms to replace the DRR portion of the 764 algorithm, e.g. QFQ or WFQ. 766 o Sensitivity of parameters, most notably the number of queues and 767 the CoDel parameters. 769 8. Security Considerations 771 There are no specific security exposures associated with FQ-CoDel 772 that are not also present in current FIFO systems. And some are in 773 fact reduced (e.g. simple minded packet floods). However, some care 774 is needed in the implementation to ensure this is the case. These 775 are included in the description above, however we reiterate them 776 here: 778 o To prevent packets in the new queues from starving old queues, it 779 is important that when a queue on the list of new queues empties, 780 it is moved to _the end of_ the list of old queues. This is 781 described at the end of Section 4.2. 783 o To prevent an attacker targeting a specific flow for a denial of 784 service attack, the hash that maps packets to queues should not be 785 predictable. To achieve this, FQ-CoDel salts the hash, as 786 described in the beginning of Section 4.1. The size of the salt 787 and the strength of the hash function is obviously a tradeoff 788 between performance and security. The Linux implementation uses a 789 32 bit random value as the salt and a Jenkins hash function. This 790 makes it possible to achieve very high throughput, and we consider 791 it sufficient to ward off the most obvious attacks. 793 o Packet fragments without a layer 4 header can be hashed into 794 different bins than the first fragment with the header intact. 795 This can cause reordering and/or adversely affect the performance 796 of the flow. Keeping state to match the fragments to the 797 beginning of the packet, or simply putting all fragmented packets 798 into the same queue, are two ways to alleviate this. 800 9. IANA Considerations 802 This document has no actions for IANA. 804 10. Acknowledgements 806 Our deepest thanks to Kathie Nichols, Van Jacobson, and all the 807 members of the bufferbloat.net effort for all the help on developing 808 and testing the algorithm. In addition, our thanks to Anil Agarwal 809 for his help with getting the hash collision probabilities in this 810 document right. 812 11. Conclusions 814 FQ-CoDel is a very general, efficient, nearly parameterless queue 815 management approach combining flow queueing with CoDel. It is a very 816 powerful tool for solving bufferbloat, and we believe it to be safe 817 to turn on by default, as has already happened in a number of Linux 818 distributions. In this document we have documented the Linux 819 implementation in sufficient detail for an independent 820 implementation, and we encourage such implementations be widely 821 deployed. 823 12. References 825 12.1. Normative References 827 [CODELDRAFT] 828 Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar, 829 "Controlled Delay Active Queue Management", October 2014, 830 . 832 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 833 Requirement Levels", BCP 14, RFC 2119, 834 DOI 10.17487/RFC2119, March 1997, 835 . 837 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 838 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 839 DOI 10.17487/RFC6817, December 2012, 840 . 842 [RTCWEB-QOS] 843 Dhesikan, S., Jennings, C., Druta, D., Jones, P., and J. 844 Polk, "DSCP and other packet markings for RTCWeb QoS", 845 December 2014, . 848 12.2. Informative References 850 [CODEL2012] 851 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 852 July 2012, . 854 [DRR] Shreedhar, M. and G. Varghese, "Efficient Fair Queueing 855 Using Deficit Round Robin", June 1996, 856 . 859 [DRRPP] MacGregor, M. and W. Shi, "Deficits for Bursty Latency- 860 critical Flows: DRR++", 2000, 861 . 864 [GONG2014] 865 Gong, Y., Rossi, D., Testa, C., Valenti, S., and D. Taht, 866 "Fighting the bufferbloat: on the coexistence of AQM and 867 low priority congestion control", July 2014, 868 . 871 [SQF2012] Bonald, T., Muscariello, L., and N. Ostallo, "On the 872 impact of TCP and per-flow scheduling on Internet 873 Performance - IEEE/ACM transactions on Networking", April 874 2012, . 877 Authors' Addresses 879 Toke Hoeiland-Joergensen 880 Karlstad University 881 Dept. of Computer Science 882 Karlstad 65188 883 Sweden 885 Email: toke.hoiland-jorgensen@kau.se 887 Paul McKenney 888 IBM Linux Technology Center 889 1385 NW Amberglen Parkway 890 Hillsboro, OR 97006 891 USA 893 Email: paulmck@linux.vnet.ibm.com 894 URI: http://www2.rdrop.com/~paulmck/ 895 Dave Taht 896 Teklibre 897 2104 W First street 898 Apt 2002 899 FT Myers, FL 33901 900 USA 902 Email: dave.taht@gmail.com 903 URI: http://www.teklibre.com/ 905 Jim Gettys 906 21 Oak Knoll Road 907 Carlisle, MA 993 908 USA 910 Email: jg@freedesktop.org 911 URI: https://en.wikipedia.org/wiki/Jim_Gettys 913 Eric Dumazet 914 Google, Inc. 915 1600 Amphitheater Pkwy 916 Mountain View, CA 94043 917 USA 919 Email: edumazet@gmail.com