idnits 2.17.1 draft-hoeiland-joergensen-aqm-fq-codel-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 8 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 490 has weird spacing: '...st_head flowc...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (November 10, 2014) is 3454 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 719 == Unused Reference: 'RFC2119' is defined on line 671, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 674, but no explicit reference was found in the text == Unused Reference: 'RFC0970' is defined on line 677, but no explicit reference was found in the text == Unused Reference: 'RFC2309' is defined on line 680, but no explicit reference was found in the text == Unused Reference: 'SFQ' is defined on line 694, but no explicit reference was found in the text ** Obsolete normative reference: RFC 896 (Obsoleted by RFC 7805) ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM working group T. Hoeiland-Joergensen 3 Internet-Draft Karlstad University 4 Intended status: Informational P. McKenney 5 Expires: May 14, 2015 IBM Linux Technology Center 6 D. Taht 7 Teklibre 8 J. Gettys 9 E. Dumazet 10 Google, Inc. 11 November 10, 2014 13 FlowQueue-Codel 14 draft-hoeiland-joergensen-aqm-fq-codel-01 16 Abstract 18 This memo presents the FQ-CoDel hybrid packet scheduler/AQM 19 algorithm, a critical tool for fighting bufferbloat and reducing 20 latency across the Internet. 22 FQ-CoDel mixes packets from multiple flows and reduces the impact of 23 head of line blocking from bursty traffic. It provides isolation for 24 low-rate traffic such as DNS, web, and videoconferencing traffic. It 25 improves utilisation across the networking fabric, especially for 26 bidirectional traffic, by keeping queue lengths short; and it can be 27 implemented in a memory- and CPU-efficient fashion across a wide 28 range of hardware. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on May 14, 2015. 47 Copyright Notice 49 Copyright (c) 2014 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 1. Introduction 64 The FQ-CoDel algorithm is a combined packet scheduler and AQM 65 developed as part of the bufferbloat-fighting community effort. It 66 is based on a modified Deficit Round Robin (DRR) queue scheduler, 67 with the CoDel AQM algorithm operating on each sub-queue. This 68 document describes the combined algorithm; reference implementations 69 are available for ns2 and ns3 and it is included in the mainline 70 Linux kernel as the FQ-CoDel queueing discipline. 72 The rest of this document is structured as follows: This section 73 gives some concepts and terminology used in the rest of the document, 74 and gives a short informal summary of the FQ-CoDel algorithm. 75 Section 2 gives an overview of the CoDel algorithm. Section 3 covers 76 the DRR portion. Section 4 defines the parameters and data 77 structures employed by FQ-CoDel. Section 5 describes the working of 78 the algorithm in detail. Section 6 describes implementation 79 considerations, and section 7 lists some of the limitations of using 80 flow queueing. Finally section 11 concludes. 82 1.1. Terminology and concepts 84 Flow: A flow is typically identified by a 5-tuple of source IP, 85 destination IP, source port, destination port, and protocol. It can 86 also be identified by a superset or subset of those parameters, or by 87 mac address, or other means. 89 Queue: A queue of packets represented internally in FQ-CoDel. In 90 most instances each flow gets its own queue; however because of the 91 possibility of hash collisions, this is not always the case. In an 92 attempt to avoid confusion, the word 'queue' is used to refer to the 93 internal data structure, and 'flow' to refer to the actual stream of 94 packets being delivered to the FQ-CoDel algorithm. 96 Scheduler: A mechanism to select which queue a packet is dequeued 97 from. 99 CoDel AQM: The Active Queue Management algorithm employed by FQ- 100 CoDel. 102 DRR: Deficit round-robin scheduling. 104 Quantum: The maximum amount of bytes to be dequeued from a queue at 105 once. 107 1.2. Informal summary of FQ-CoDel 109 FQ-CoDel is a _hybrid_ of DRR [DRR] and CODEL [CODEL2012], with an 110 optimisation for sparse flows similar to SQF [SQF2012] and DRR++ 111 [DRRPP]. We call this "Flow Queueing" rather than "Fair Queueing" as 112 flows that build a queue are treated differently than flows that do 113 not. 115 FQ-CoDel stochastically classifies incoming packets into different 116 sub-queues by hashing the 5-tuple of IP protocol number and source 117 and destination IP and port numbers, perturbed with a random number 118 selected at initiation time (although other flow classification 119 schemes can optionally be configured instead). Each queue is managed 120 by the CoDel queueing discipline. Packet ordering within a queue is 121 preserved, since queues have FIFO ordering. 123 The FQ-CoDel algorithm consists of two logical parts: the scheduler 124 which selects which queue to dequeue a packet from, and the CoDel AQM 125 which works on each of the queues. The subtleties of FQ-CoDel are 126 mostly in the scheduling part, whereas the interaction between the 127 scheduler and the CoDel algorithm are fairly straight forward: 129 At initialisation, each queue is set up to have a separate set of 130 CoDel state variables. By default, 1024 queues are created. The 131 current implementation supports anywhere from one to 64K separate 132 queues, and each queue maintains the state variables throughout its 133 lifetime, and so acts the same as the non-FQ CoDel variant would. 134 This means that with only one queue, FQ-CoDel behaves essentially the 135 same as CoDel by itself. 137 On dequeue, FQ-CoDel selects a queue from which to dequeue by a two- 138 tier round-robin scheme, in which each queue is allowed to dequeue up 139 to a configurable quantum of bytes for each iteration. Deviations 140 from this quantum is maintained as a deficit for the queue, which 141 serves to make the fairness scheme byte-based rather than a packet- 142 based. The two-tier round-robin mechanism distinguishes between 143 "new" queues (which don't build up a standing queue) and "old" 144 queues, that have queued enough data to be around for more than one 145 iteration of the round-robin scheduler. 147 This new/old queue distinction has a particular consequence for 148 queues that don't build up more than a quantum of bytes before being 149 visited by the scheduler: Such queues are removed from the list, and 150 then re-added as a new queue each time a packet arrives for it, and 151 so will get priority over queues that do not empty out each round 152 (except for a minor modification to protect against starvation, 153 detailed below). Exactly how much data a flow has to send to keep 154 its queue in this state is somewhat difficult to reason about, 155 because it depends on both the egress link speed and the number of 156 concurrent flows. However, in practice many things that are 157 beneficial to have prioritised for typical internet use (ACKs, DNS 158 lookups, interactive SSH, HTTP requests, ARP, ICMP, VoIP) _tend_ to 159 fall in this category, which is why FQ-CoDel performs so well for 160 many practical applications. However, the implicitness of the 161 prioritisation means that for applications that require guaranteed 162 priority (for instance multiplexing the network control plane over 163 the network itself), explicit classification is still needed. 165 This scheduling scheme has some subtlety to it, which is explained in 166 detail in the remainder of this document. 168 2. CoDel 170 CoDel is described in the the ACM Queue paper, CODEL [CODEL2012], and 171 Van Jacobson's IETF84 presentation CODELDRAFT [CODELDRAFT]. The 172 basic idea is to control queue length, maintaining sufficient 173 queueing to keep the outgoing link busy, but avoiding building up the 174 queue beyond that point. This is done by preferentially dropping 175 packets that remain in the queue for "too long". 177 When each new packet arrives, its arrival time is stored with it. 178 Later, when it is that packet's turn to be dequeued, CoDel computes 179 its sojourn time (the current time minus the arrival time). If the 180 sojourn time for packets being dequeued exceeds the _target_ time for 181 a time period of at least the (current value of) _interval_, one or 182 more packets will be dropped (or marked, if ECN is enabled) in order 183 to signal the source endpoint to reduce its send rate. If the 184 sojourn still remains above the target time, the value of interval be 185 lowered, and additional packet drops will occur on a schedule 186 computed from an inverse-square-root control law until either (1) the 187 queue becomes empty or (2) a packet is encountered with a sojourn 188 time that is less than the target time. Upon exiting the dropping 189 mode, CoDel caches the last calculated interval (applying varying 190 amounts of hysteresis to it), to be used as the starting point on 191 subsequent re-entries into dropping mode. 193 The _target_ time is normally set to about five milliseconds, and the 194 initial _interval_ is normally set to about 100 milliseconds. This 195 approach has proven to be quite effective in a wide variety of 196 situations. 198 CoDel drops packets at the head of a queue, rather than at the tail. 200 3. Flow Queueing 202 FQ-CoDel's DRR scheduler is byte-based, employing a deficit round- 203 robin mechanism between queues. This works by keeping track of the 204 current byte _deficit_ of each queue. This deficit is initialised to 205 the configurable quantum; each time a queue gets a dequeue 206 opportunity, it gets to dequeue packets, decreasing the deficit by 207 the packet size for each packet, until the deficit runs into the 208 negative, at which point it is increased by one quantum, and the 209 dequeue opportunity ends. 211 This means that if one queue contains packets of size quantum/3, and 212 another contains quantum-sized packets, the first queue will dequeue 213 three packets each time it gets a turn, whereas the second only 214 dequeues one. This means that flows that send small packets are not 215 penalised by the difference in packet sizes; rather, the DRR scheme 216 approximates a (single-)byte-based fairness queueing. The size of 217 the quantum determines the scheduling granularity, with the tradeoff 218 from too small a quantum being scheduling overhead. For small 219 bandwidths, lowering the quantum from the default MTU size can be 220 advantageous. 222 Unlike DRR there are two sets of flows - a "new" list for flows that 223 have not built a queue recently, and an "old" list for flow-building 224 queues. 226 4. FQ-CoDel Parameters and Data Structures 228 This section goes into the parameters and data structures in FQ- 229 CoDel. 231 4.1. Parameters 233 4.1.1. Interval 235 The _interval_ parameter has the same semantics as CoDel and is used 236 to ensure that the measured minimum delay does not become too stale. 237 The minimum delay MUST be experienced in the last epoch of length 238 interval. It SHOULD be set on the order of the worst-case RTT 239 through the bottleneck to give end-points sufficient time to react. 241 The default interval value is 100 ms. 243 4.1.2. Target 245 The _target_ parameter has the same semantics as CoDel. It is the 246 acceptable minimum standing/persistent queue delay for each FQ-CoDel 247 Queue. This minimum delay is identified by tracking the local 248 minimum queue delay that packets experience. 250 The default target value is 5 ms, but this value SHOULD be tuned to 251 be at least the transmission time of a single MTU-sized packet at the 252 prevalent egress link speed (which for e.g. 1Mbps and MTU 1500 is 253 ~15ms). It should otherwise be set to on the order of 5-10% of the 254 configured interval. 256 4.1.3. Packet limit 258 Routers do not have infinite memory, so some packet limit MUST be 259 enforced. 261 The _limit_ parameter is the hard limit on the real queue size, 262 measured in number of packets. This limit is a global limit on the 263 number of packets in all queues; each individual queue does not have 264 an upper limit. When the limit is reached and a new packet arrives 265 for enqueue, a packet is dropped from the head of the largest queue 266 (measured in bytes) to make room for the new packet. 268 The default packet limit is 10240 packets, which is suitable for up 269 to 10GigE speeds. In practice, the hard limit is rarely, if ever, 270 hit, as drops are performed by the CoDel algorithm long before the 271 limit is hit. For platforms that are severely memory constrained, a 272 lower limit can be used. 274 4.1.4. Quantum 276 The _quantum_ parameter is the number of bytes each queue gets to 277 dequeue on each round of the scheduling algorithm. The default is 278 set to 1514 bytes which corresponds to the Ethernet MTU plus the 279 hardware header length of 14 bytes. 281 In TSO-enabled systems, where a "packet" consists of an offloaded 282 packet train, it can presently be as large as 64K bytes. In GRO- 283 enabled systems, up to 17 times the TCP max segment size (or 25K 284 bytes). 286 4.1.5. Flows 288 The _flows_ parameter sets the number of sub-queues into which the 289 incoming packets are classified. Due to the stochastic nature of 290 hashing, multiple flows may end up being hashed into the same slot. 292 This parameter can be set only at load time since memory has to be 293 allocated for the hash table in the current implementation. 295 The default value is 1024. 297 4.1.6. ECN 299 ECN is _enabled_ by default. Rather than do anything special with 300 misbehaved ECN flows, FQ-CoDel relies on the packet scheduling system 301 to minimise their impact, thus unresponsive packets in a flow being 302 marked with ECN can grow to the overall packet limit, but will not 303 otherwise affect the performance of the system. 305 It can be disabled by specifying the _noecn_ parameter. 307 4.2. Data structures 309 4.2.1. Internal sub-queues 311 The main data structure of FQ-CoDel is the array of sub-queues, which 312 is instantiated to the number of queues specified by the _flows_ 313 parameter at instantiation time. Each sub-queue consists simply of 314 an ordered list of packets with FIFO semantics, two state variables 315 tracking the queue deficit and total number of bytes enqueued, and 316 the set of CoDel state variables. Other state variables to track 317 queue statistics can also be included: for instance, the Linux 318 implementation keeps a count of dropped packets. 320 Queue space is shared: there's a global limit on the number of 321 packets the queues can hold, but not one per queue. 323 4.2.2. New and old queues lists 325 FQ-CoDel maintains two lists of active queues, called "new" and "old" 326 queues. Each list is an ordered list containing references to the 327 array of sub-queues. When a packet is added to a queue that is not 328 currently active, that queue becomes active by being added to the 329 list of new queues. Later on, it is moved to the list of old queues, 330 from which it is removed when it is no longer active. This behaviour 331 is the source of some subtlety in the packet scheduling at dequeue 332 time, explained below. 334 5. The FQ-CoDel scheduler and AQM interactions 336 This section describes the operation of the FQ-CoDel scheduler and 337 AQM. It is split into two parts explaining the enqueue and dequeue 338 operations. 340 5.1. Enqueue 342 The packet enqueue mechanism consists of three stages: classification 343 into a sub-queue, timestamping and bookkeeping, and optionally 344 dropping a packet when the total number of enqueued packets goes over 345 the maximum. 347 When a packet is enqueued, it is first classified into the 348 appropriate sub-queue. By default, this is done by hashing on the 349 5-tuple of IP protocol, and source and destination IP and port 350 numbers, permuted by a random value selected at initialisation time, 351 and taking the hash value modulo the number of sub-queues. However, 352 an implementation MAY also specify a configurable classification 353 scheme along a wide variety of other possible parameters such as mac 354 address, diffserv, firewall and flow specific markings, etc. (the 355 Linux implementation does so in the form of the 'tc filter' command). 357 If a custom filter fails, classification failure results in the 358 packet being dropped and no further action taken. By design the 359 standard filter cannot fail. 361 Additionally, the default hashing algorithm presently deployed does 362 decapsulation of some common packet types (6in4, IPIP, GRE 0), mixes 363 IPv6 IP addresses thoroughly, and uses Jenkins hash on the result. 365 Once the packet has been successfully classified into a sub-queue, it 366 is handed over to the CoDel algorithm for timestamping. It is then 367 added to the tail of the selected queue, and the queue's byte count 368 is updated by the packet size. Then, if the queue is not currently 369 active (i.e. if it is not in either the list of new or the list of 370 old queues), it is added to the end of the list of new queues, and 371 its deficit is initiated to the configured quantum. Otherwise it is 372 added to the old queue list. 374 Finally, the total number of enqueued packets is compared with the 375 configured limit, and if it is _above_ this value (which can happen 376 since a packet was just enqueued), a packet is dropped from the head 377 of the queue with the largest current byte count. Note that this in 378 most cases means that the packet that gets dropped is different from 379 the one that was just enqueued, and may even be from a different 380 queue. 382 5.2. Dequeue 384 Most of FQ-CoDel's work is done at packet dequeue time. It consists 385 of three parts: selecting a queue from which to dequeue a packet, 386 actually dequeuing it (employing the CoDel algorithm in the process), 387 and some final bookkeeping. 389 For the first part, the scheduler first looks at the list of new 390 queues; for each queue in that list, if that queue has a negative 391 deficit (i.e. it has already dequeued at least a quantum of bytes), 392 its deficit is increased by one quantum, and the queue is put onto 393 the end of the list of old queues, and the routine selects the next 394 queue and starts again. 396 Otherwise, that queue is selected for dequeue. If the list of new 397 queues is empty, the scheduler proceeds down the list of old queues 398 in the same fashion (checking the deficit, and either selecting the 399 queue for dequeuing, or increasing the deficit and putting the queue 400 back at the end of the list). 402 After having selected a queue from which to dequeue a packet, the 403 CoDel algorithm is invoked on that queue. This applies the CoDel 404 control law, and may discard one or more packets from the head of 405 that queue, before returning the packet that should be dequeued (or 406 nothing if the queue is or becomes empty while being handled by the 407 CoDel algorithm). 409 Finally, if the CoDel algorithm did not return a packet, the queue is 410 empty, and the scheduler does one of two things: if the queue 411 selected for dequeue came from the list of new queues, it is moved to 412 the end of the list of old queues. If instead it came from the list 413 of old queues, that queue is removed from the list, to be added back 414 (as a new queue) the next time a packet arrives that hashes to that 415 queue. Then (since no packet was available for dequeue), the whole 416 dequeue process is restarted from the beginning. 418 If, instead, the scheduler _did_ get a packet back from the CoDel 419 algorithm, it updates the byte deficit for the selected queue before 420 returning the packet as the result of the dequeue operation. 422 The step that moves an empty queue from the list of new queues to the 423 list of old queues before it is removed is crucial to prevent 424 starvation. Otherwise the queue could reappear (the next time a 425 packet arrives for it) before the list of old queues is visited; this 426 can go on indefinitely even with a small number of active flows, if 427 the flow providing packets to the queue in question transmits at just 428 the right rate. This is prevented by first moving the queue to the 429 list of old queues, forcing a pass through that, and thus preventing 430 starvation. 432 The resulting migration of queues between the different states is 433 summarised in the following state diagram: 435 +-----------------+ +--------------------+ 436 | | Empty | | 437 | Empty |<---------------+ Old +-----+ 438 | | | | | 439 +-------+---------+ +--------------------+ | 440 | ^ ^ |Quantum 441 |Arrival | | |Exceeded 442 v | | | 443 +-----------------+ | | | 444 | | Empty or | | | 445 | New +-------------------+ +--------+ 446 | | Quantum exceeded 447 +-----------------+ 449 6. Implementation considerations 451 6.1. Probability of hash collisions 453 Since the Linux FQ-CoDel implementation by default uses 1024 hash 454 buckets, the probability that (say) 100 VoIP sessions will all hash 455 to the same bucket is something like ten to the power of minus 300. 456 Thus, the probability that at least one of the VoIP sessions will 457 hash to some other queue is very high indeed. 459 Conversely, the probability that each of the 100 VoIP sessions will 460 get its own queue is given by (1023!/(924!*1024^99)) or about 0.007; 461 so not all that probable. The probability rises sharply, however, if 462 we are willing to accept a few collisions. For example, there is 463 about an 86% probability that no more than two of the 100 VoIP 464 sessions will be involved in any given collision, and about a 99% 465 probability that no more than three of the VoIP sessions will be 466 involved in any given collision. These last two results were 467 computed using Monte Carlo simulations: Oddly enough, the mathematics 468 for VoIP-session collision exactly matches that of hardware cache 469 overflow. 471 6.2. Memory Overhead 473 FQ-CoDel can be implemented with a very low memory footprint (less 474 than 64 bytes per queue on 64 bit systems). These are the data 475 structures used in the Linux implementation: 477 struct codel_vars { 478 u32 count; 479 u32 lastcount; 480 bool dropping; 481 u16 rec_inv_sqrt; 482 codel_time_t first_above_time; 483 codel_time_t drop_next; 484 codel_time_t ldelay; 485 }; 487 struct fq_codel_flow { 488 struct sk_buff *head; 489 struct sk_buff *tail; 490 struct list_head flowchain; 491 int deficit; 492 u32 dropped; /* number of drops (or ECN marks) on this flow */ 493 struct codel_vars cvars; 494 }; 496 The master table managing all queues looks like this: 498 struct fq_codel_sched_data { 499 struct tcf_proto *filter_list; /* optional external classifier */ 500 struct fq_codel_flow *flows; /* Flows table [flows_cnt] */ 501 u32 *backlogs; /* backlog table [flows_cnt] */ 502 u32 flows_cnt; /* number of flows */ 503 u32 perturbation; /* hash perturbation */ 504 u32 quantum; /* psched_mtu(qdisc_dev(sch)); */ 505 struct codel_params cparams; 506 struct codel_stats cstats; 507 u32 drop_overlimit; 508 u32 new_flow_count; 510 struct list_head new_flows; /* list of new flows */ 511 struct list_head old_flows; /* list of old flows */ 512 }; 514 6.3. Per-Packet Timestamping 516 The CoDel portion of the algorithm requires per-packet timestamps be 517 stored along with the packet. While this approach works well for 518 software-based routers, it may be impossible to retrofit devices that 519 do most of their processing in silicon and lack space or mechanism 520 for timestamping. 522 Also, while perfect resolution is not needed, timestamping functions 523 in the core OS need to be efficient as they are called at least once 524 on each packet enqueue and dequeue. 526 6.4. Other forms of "Fair Queueing" 528 Much of the scheduling portion of FQ-CoDel is derived from DRR and is 529 substantially similar to DRR++. SFQ-based versions have also been 530 produced and tested in ns2. Other forms of Fair Queueing, such as 531 WFQ or QFQ, have not been thoroughly explored. 533 6.5. Differences between CoDel and FQ-CoDel behaviour 535 CoDel can be applied to a single queue system as a straight AQM, 536 where it converges towards an "ideal" drop rate (i.e. one that 537 minimises delay while keeping a high link utilisation), and then 538 optimises around that control point. 540 The scheduling of FQ-CoDel mixes packets of competing flows, which 541 acts to pace bursty flows to better fill the pipe. Additionally, a 542 new flow gets substantial "credit" over other flows until CoDel finds 543 an ideal drop rate for it. However, for a new flow that exceeds the 544 configured quantum, more time passes before all of its data is 545 delivered (as packets from it, too, are mixed across the other 546 existing queue-building flows). Thus, FQ-CoDel takes longer (as 547 measured in time) to converge towards an ideal drop rate for a given 548 new flow, but does so within fewer delivered _packets_ from that 549 flow. 551 Finally, the flow isolation FQ-CoDel provides means that the CoDel 552 drop mechanism operates on the flows actually building queues, which 553 results in packets being dropped more accurately from the largest 554 flows than CoDel alone manages. Additionally, flow isolation 555 radically improves the transient behaviour of the network when 556 traffic or link characteristics change (e.g. when new flows start up 557 or the link bandwidth changes); while CoDel itself can take a while 558 to respond, fq_codel doesn't miss a beat. 560 7. Limitations of flow queueing 562 While FQ-CoDel has been shown in many scenarios to offer significant 563 performance gains, there are some scenarios where the scheduling 564 algorithm in particular is not a good fit. This section documents 565 some of the known cases which either may require tweaking the default 566 behaviour, or where alternatives to flow queueing should be 567 considered. 569 7.1. Fairness between things other than flows 571 In some parts of the network, enforcing flow-level fairness may not 572 be desirable, or some other level of fairness may be more important. 573 An example of this can be an Internet Service Provider that may be 574 more interested in ensuring fairness between customers than between 575 flows. Or a hosting or transit provider that wishes to ensure 576 fairness between connecting Autonomous Systems or networks. Another 577 issue can be that the number of simultaneous flows experienced at a 578 particular link can be too high for flow-based fairness queueing to 579 be effective. 581 Whatever the reason, in a scenario where fairness between flows is 582 not desirable, reconfiguring FQ-CoDel to match on a different 583 characteristic can be a way forward. The implementation in Linux can 584 leverage the powerful packet matching mechanism of the _tc_ subsystem 585 to use any available packet field to partition packets into virtual 586 queues, to for instance match on address or subnet source/destination 587 pairs, application layer characteristics, etc. 589 Furthermore, as commonly deployed today, FQ-CoDel is used with three 590 or more tiers of classification: priority, best effort and 591 background, based on diffserv markings. Some products do more 592 detailed classification, including deep packet inspection and 593 destination-specific filters to achieve their desired result. 595 7.2. Flow bunching by opaque encapsulation 597 Where possible, FQ-CoDel will attempt to decapsulate packets before 598 matching on the header fields for the flow hashing. However, for 599 some encapsulation techniques, most notably encrypted VPNs, this is 600 not possible. If several flows are bunched into one such 601 encapsulated tunnel, they will be seen as one flow by the FQ-CoDel 602 algorithm. This means that they will share a queue, and drop 603 behaviour, and so flows inside the encapsulation will not benefit 604 from the implicit prioritisation of FQ-CoDel, but will continue to 605 benefit from the reduced overall queue length from the CoDel 606 algorithm operating on the queue. In addition, when such an 607 encapsulated bunch competes against other flows, it will count as one 608 flow, and not assigned a share of the bandwidth based on how many 609 flows are inside the encapsulation. 611 Depending on the application, this may or may not be desirable 612 behaviour. In cases where it is not, changing FQ-CoDel's matching to 613 not be flow-based (as detailed in the previous subsection above) can 614 be a way to mitigate this. 616 7.3. Low-priority congestion control algorithms 618 Because of the flow isolation that FQ-CoDel provides, low-priority 619 congestion control algorithms (or, in general, algorithms that try to 620 voluntarily use up less than their fair share of bandwidth) can be 621 re-prioritised. Because a flow experiences very little added latency 622 when the link is congested, such algorithms lack the signal to back 623 off that added latency previously afforded them. As such, existing 624 algorithms tend to revert to loss-based congestion control, and will 625 consume the fair share of bandwidth afforded to them by the FQ-CoDel 626 scheduler. However, low-priority congestion control mechanisms may 627 be able to take steps to continue to be low priority, for instance by 628 taking into account the vastly reduced level of delay afforded by an 629 AQM, or by using a coupled approach to observing the behaviour of 630 multiple flows. 632 8. Security Considerations 634 There are no specific security exposures associated with FQ-CoDel. 635 Some exposures present in current FIFO systems are in fact reduced 636 (e.g. simple minded packet floods). 638 9. IANA Considerations 640 This document has no actions for IANA. 642 10. Acknowledgements 644 Our deepest thanks to Eric Dumazet (author of FQ-CoDel), Kathie 645 Nichols, Van Jacobson, and all the members of the bufferbloat.net 646 effort. 648 11. Conclusions 650 FQ-CoDel is a very general, efficient, nearly parameterless active 651 queue management approach combining flow queueing with CoDel. It is 652 a critical tool in solving bufferbloat. 654 FQ-CoDel's default settings SHOULD be modified for other special- 655 purpose networking applications, such as for exceptionally slow 656 links, for use in data centres, or on links with inherent delay 657 greater than 800ms (e.g. satellite links). 659 On-going projects are: improving FQ-CoDel with more SFQ-like 660 behaviour for lower bandwidth systems, improving the control law, 661 optimising sparse packet drop behaviour, etc.. 663 In addition to the Linux kernel sources, ns2 and ns3 models are 664 available. Refinements (such as NFQCODEL [1]) are being tested in 665 the CeroWrt effort. 667 12. References 669 12.1. Normative References 671 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 672 Requirement Levels", BCP 14, RFC 2119, March 1997. 674 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 675 RFC 896, January 1984. 677 [RFC0970] Nagle, J., "On packet switches with infinite storage", RFC 678 970, December 1985. 680 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 681 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 682 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 683 S., Wroclawski, J., and L. Zhang, "Recommendations on 684 Queue Management and Congestion Avoidance in the 685 Internet", RFC 2309, April 1998. 687 [CODELDRAFT] 688 Nichols, K., Jacobson, V., McGregor, A., and J. Iyengar, 689 "Controlling Queue Delay", October 2014, 690 . 692 12.2. Informative References 694 [SFQ] McKenney, P., "Stochastic Fairness Queuing", September 695 1990, . 698 [CODEL2012] 699 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 700 July 2012, . 702 [SQF2012] Bonald, T., Muscariello, L., and N. Ostallo, "On the 703 impact of TCP and per-flow scheduling on Internet 704 Performance - IEEE/ACM transactions on Networking", April 705 2012, . 708 [DRR] Shreedhar, M. and G. Varghese, "Efficient Fair Queueing 709 Using Deficit Round Robin", June 1996, 710 . 713 [DRRPP] MacGregor, and Shi, "Deficits for Bursty Latency-critical 714 Flows: DRR++", 2000, . 717 12.3. URIs 719 [1] http://www.bufferbloat.net/projects/cerowrt/wiki/nfq_codel 721 Authors' Addresses 723 Toke Hoeiland-Joergensen 724 Karlstad University 725 Dept. of Computer Science 726 Karlstad 65188 727 Sweden 729 Email: toke.hoiland-jorgensen@kau.se 731 Paul McKenney 732 IBM Linux Technology Center 733 1385 NW Amberglen Parkway 734 Hillsboro, OR 97006 735 USA 737 Email: paulmck@linux.vnet.ibm.com 738 URI: http://www2.rdrop.com/~paulmck/ 740 Dave Taht 741 Teklibre 742 2104 W First street 743 Apt 2002 744 FT Myers, FL 33901 745 USA 747 Email: d+ietf@teklibre.com 748 URI: http://www.teklibre.com/ 750 Jim Gettys 751 Google, Inc. 752 21 Oak Knoll Road 753 Carlisle, MA 01741 754 USA 756 Email: jg@freedesktop.org 757 URI: https://en.wikipedia.org/wiki/Jim_Gettys 758 Eric Dumazet 759 Google, Inc. 760 1600 Amphitheater Pkwy 761 Mountain View, Ca 94043 762 USA 764 Email: edumazet@gmail.com