idnits 2.17.1 draft-nichols-tsvwg-codel-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** The abstract seems to contain references ([CODEL2012], [RFC2309], [RFC896], [TSVBB2011,BB2011]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 3, 2014) is 3678 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC896' is mentioned on line 14, but not defined ** Obsolete undefined reference: RFC 896 (Obsoleted by RFC 7805) == Missing Reference: 'VJTARG14' is mentioned on line 356, but not defined == Missing Reference: 'CHARRB2007' is mentioned on line 647, but not defined == Unused Reference: 'CMNTS' is defined on line 1122, but no explicit reference was found in the text == Unused Reference: 'CHARB2007' is defined on line 1151, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 1170, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG K. Nichols 3 Internet-Draft Pollere, Inc. 4 Intended status: Informational V. Jacobson 5 Expires: September 4, 2014 Google 6 March 3, 2014 8 Controlled Delay Active Queue Management 9 draft-nichols-tsvwg-codel-02 11 Abstract 13 The "persistently full buffer" problem has been discussed in the IETF 14 community since the early 80's [RFC896]. The IRTF's End-to-End 15 Working Group called for the deployment of active queue management 16 (AQM) to solve the problem in 1998 [RFC2309]. Despite the awareness, 17 the problem gotten worse as by Moore's Law growth in memory density 18 fueled an exponential increase in buffer pool size. Efforts to 19 deploy AQM have been frustrated by difficult configuration and 20 negative impact on network utilization. The full buffer problem, 21 recently christened "bufferbloat"[TSVBB2011, BB2011] has become 22 increasingly important throughout the Internet but particularly at 23 the consumer edge. 25 To address bufferbloat, this document describes a general framework 26 for controlling excessived delay in networks called Controlled Delay 27 (CoDel) designed to work in modern networking environments as a part 28 of the solution to bufferbloat [CODEL2012]. CoDel consists of an 29 estimator, a setpoint, and a control loop and can be deployed in the 30 Internet without configuration. CoDel comprises some major technical 31 innovations and has been made available as open source so that the 32 framework can be applied by the community to a range of problems. It 33 has been implemented in Linux (and available in the Linux 34 distribution) and deployed in some networks at the consumer edge. In 35 addition, the framework has been successfully applied in other ways. 37 Note: Code Components extracted from this document must include the 38 license as included with the code in Section 5. 40 Status of This Memo 42 This Internet-Draft is submitted in full conformance with the 43 provisions of BCP 78 and BCP 79. 45 Internet-Drafts are working documents of the Internet Engineering 46 Task Force (IETF). Note that other groups may also distribute 47 working documents as Internet-Drafts. The list of current Internet- 48 Drafts is at http://datatracker.ietf.org/drafts/current/. 50 Internet-Drafts are draft documents valid for a maximum of six months 51 and may be updated, replaced, or obsoleted by other documents at any 52 time. It is inappropriate to use Internet-Drafts as reference 53 material or to cite them other than as "work in progress." 55 This Internet-Draft will expire on September 4, 2014. 57 Copyright Notice 59 Copyright (c) 2014 IETF Trust and the persons identified as the 60 document authors. All rights reserved. 62 This document is subject to BCP 78 and the IETF Trust's Legal 63 Provisions Relating to IETF Documents 64 (http://trustee.ietf.org/license-info) in effect on the date of 65 publication of this document. Please review these documents 66 carefully, as they describe your rights and restrictions with respect 67 to this document. Code Components extracted from this document must 68 include Simplified BSD License text as described in Section 4.e of 69 the Trust Legal Provisions and are provided without warranty as 70 described in the Simplified BSD License. 72 1. Introduction 74 The need for queue management has been evident for decades. Recently 75 the need has become more critical due to the increased consumer use 76 of the Internet mixing large video transactions with time-critical 77 VoIP and gaming. Gettys [TSV2011, BB2011] has been instrumental in 78 publicizing the problem and the measurement work [CHARB2007, 79 NATAL2010] and coining the term bufferbloat. Large content 80 distributors such as Google have observed that bufferbloat is 81 ubiquitous and adversely effects performance for many users. The 82 solution is an effective AQM that remediates bufferbloat at a 83 bottleneck while "doing no harm" at hops where buffers are not 84 bloated. 86 The development and deployment of effective active queue management 87 has been hampered by persistent misconceptions about the cause and 88 meaning of queues. Network buffers exist to absorb the packet bursts 89 that occur naturally in statistically multiplexed networks. Short- 90 term mismatches in traffic arrival and departure rates that arise 91 from upstream resource contention, transport conversation startup 92 transients and/or changes in the number of conversations sharing a 93 link create queues. Unfortunately, other network behavior can cause 94 queues to fill and their effects aren't nearly as benign. Discussion 95 of these issues and why the solution isn't just smaller buffers can 96 be found in [RFC2309],[VANQ2006],[REDL1998] and [CODEL2012]. It is 97 critical to understand the difference between the necessary, useful 98 "good" queue and the counterproductive "bad" queue. 100 Many approaches to active queue management (AQM) have been developed 101 over the past two decades but none has been widely deployed due to 102 performance problems. When designed with the wrong conceptual model 103 for queues, AQMs have limited operational range, require a lot of 104 configuration tweaking, and frequently impair rather than improve 105 performance. Today, the demands on an effective AQM are even 106 greater: many network devices must work across a range of bandwidths, 107 either due to link variations or due to the mobility of the device. 108 The CoDel approach is designed to meet the following goals: 110 o is parameterless for normal operation - has no knobs for 111 operators, users, or implementers to adjust 113 o treats "good queue" and "bad queue" differently, that is, keeps 114 delay low while permitting necessary bursts of traffic 116 o controls delay while insensitive (or nearly so) to round trip 117 delays, link rates and traffic loads; this goal is to "do no harm" 118 to network traffic while controlling delay 120 o adapts to dynamically changing link rates with no negative impact 121 on utilization 123 o is simple and efficient (can easily span the spectrum from low- 124 end, linux-based access points and home routers up to high-end 125 commercial router silicon) 127 Since April, 2012, when CoDel was published, a number of talented and 128 enthusiastic implementers have been using and adapting it with 129 promising results. Much of this work is collected at: http:// 130 www.bufferbloat.net/projects/codel. CoDel has five major 131 innovations that distinguish it from prior AQMs: use of local queue 132 minimum to track congestion ("bad queue"), use of an efficient single 133 state variable representation of that tracked statistic, use of 134 packet sojourn time as the observed datum, rather than packets, 135 bytes, or rates, use of mathematically determined setpoint derived 136 from maximizing the network power metric, and a modern state space 137 controller. 139 CoDel configures itself based on a round-trip time metric which can 140 be set to 100ms for the normal, terrestrial Internet. With no 141 changes to parameters, we have found CoDel to work across a wide 142 range of conditions, with varying links and the full range of 143 terrestrial round trip times. CoDel has been implemented in Linux 144 very efficiently and should lend itself to silicon implementation. 146 CoDel is well-adapted for use in multiple queued devices and has been 147 used by Eric Dumazet with multiple queues in sophisticated queue 148 management approach, fq_codel (covered in another draft). 150 2. Conventions used in this document 152 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 153 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 154 document are to be interpreted as described in [RFC2119]. 156 In this document, these words will appear with that interpretation 157 only when in ALL CAPS. Lower case uses of these words are not to be 158 interpreted as carrying [RFC2119] significance. 160 In this document, the characters ">>" preceding an indented line(s) 161 indicates a compliance requirement statement using the key words 162 listed above. This convention aids reviewers in quickly identifying 163 or finding the explicit compliance requirements of this RFC. 165 3. Building Blocks of Queue Management 167 Two decades of work on queue management failed to yield an approach 168 that could be widely deployed in the Internet. With careful tuning 169 for particular usages, queue management techniques have been able to 170 "kind of" work, that is decrease queuing delays, but utilization and 171 fairness suffer unduly. At the heart of queue management is the 172 notion of "good queue" and "bad queue" and the search for ways to get 173 rid of the bad queue (which only adds delay) while preserving the 174 good queue (which provides for good utilization). This section 175 explains queuing, both good and bad, and covers the innovative CoDel 176 building blocks that can be used to manage packet buffers to keep 177 their queues in the "good" range. 179 Packet queues form in buffers facing bottleneck links, i.e., where 180 the line rate goes from high to low or many links converge. The 181 well-known bandwidth-delay product (sometimes called "pipe size") is 182 the bottleneck's bandwidth multiplied by the sender-receiver-sender 183 round-trip delay and is the amount of data that has to be in transit 184 between two hosts in order to run at 100% utilization. To explore 185 how queues can form, consider a long-lived TCP connection with a 25 186 packet window sending through a connection with a bandwidth-delay 187 product of 20 packets. After an initial burst of packets the 188 connection will settle into a five packet (+/-1) standing queue, the 189 size determined by the window mismatch to the pipe size and unrelated 190 to the connection's sending rate. The connection has 25 packets in 191 flight at all times, but only 20 packets arrive at the destination 192 over a round trip time. If the TCP connection has a 30 packet 193 window, the queue will be ten packets with no change in sending rate. 195 Similarly, if the window is 20 packets, there will be no queue but 196 the sending rate is the same. Nothing can be inferred about the 197 sender rate from the queue and the existence of any queue at all 198 other than transient bursts can only create delay in the network. 199 The sender needs to reduce the number of packets in flight rather 200 than sending rate. 202 In the above example, the five packet standing queue can be seen to 203 contribute nothing but delay to the connection thus is clearly "bad 204 queue". If, in our example, there is a single bottleneck link and it 205 is much slower than the link that feeds it (say, a high-speed 206 ethernet link into a limited DSL uplink) a 20 packet buffer at the 207 bottleneck might be necessary to temporarily hold the 20 packets in 208 flight to keep the utilization high. The burst of packets should 209 drain completely (to 0 or 1 packets) within a round trip time and 210 this transient queue is "good queue" because it allows the connection 211 to keep the 20 packets in flight and for the bottleneck link to be 212 fully utilized. In terms of the delay experienced We can observe 213 that "good queue" goes away in about a round trip time, while "bad 214 queue" hangs around causing delays. 216 Effective queue management detects "bad queue" while ignoring "good 217 queue" and takes action to get rid of the bad queue when it is 218 detected. The goal is a queue controller that accomplishes this 219 objective. To control queue, we need three basic components 221 o Estimator - figure out what we've got 223 o Setpoint - know what what we want 225 o Control loop - if what we've got isn't what we want, we need a way 226 to move it there 228 3.1. Estimator 230 The Estimator both observes the queue and detects when good queue 231 turns to bad queue and vice versa. CoDel has two innovations in its 232 Estimator: what is observed as an indicator of queue and how the 233 observations are used to detect good/bad queue. 235 In the past, queue length has been widely used as an observed 236 indicator of congestion and is frequently conflated with sending 237 rate. Use of queue length as a metric is sensitive to how and when 238 the length is observed. A high speed arrival link to a buffer 239 serviced at a much lower rate can rapidly build up a queue that might 240 disperse completely or down to a single packet before a round trip 241 time has elapsed. If the queue length is monitored at packet arrival 242 (as in original RED) or departure time, every packet will see a queue 243 with one possible exception. If the queue length itself is time 244 sampled (as recommended in [REDL1998], a truer picture of the queue's 245 occupancy can be gained but a separate process is required. 247 The use of queue length is further complicated in networks that are 248 subject to both short and long term changes in available link rate 249 (as in wifi). Link rate drops can result in a spike in queue length 250 that should be ignored unless it persists. The length metric is 251 problematic when what we really want to control is the amount of 252 excess delay packets experience due to a persistent or standing 253 queue. The sojourn time that a packet spends in the buffer is 254 exactly what we want to track. Tracking the packet sojourn times in 255 the buffer observes the actual delay experienced by each packet. 256 Sojourn time is independent of link rate, gives superior performance 257 to use of buffer size, and is directly related to the user-visible 258 performance. It works regardless of line rate changes or whether the 259 link is shared by multiple queues (which the individual queues may 260 experience as changing rates). 262 Consider a link shared by two queues, one priority queue and one of 263 lower priority. Packets that arrive to the high priority queue are 264 sent as soon as the link is available while packets of the other 265 queue have to wait till the the priority queue is empty (i.e., a 266 strict priority scheduler). The number of packets in the priority 267 queue might be large but the queue is emptied quickly and the amount 268 of time each packet spends enqueued (the sojourn time) is not large. 269 The other queue might have a smaller number of packets, but packet 270 sojourn times will include the wait for the high priority packets to 271 be sent. This makes the sojourn times a good sample of the 272 congestion that each separate queue is experiencing and shows how 273 this metric is independent of the number of queues used or the 274 service discipline and instead reflective of the congestion seen by 275 the individual queue. 277 With sojourn time as the observation, how can it be used to separate 278 good queue from bad queue? In the past, averages, in particular of 279 queue length, have been used to determine bad queue. Consider the 280 burst that disperses every round trip time. The average queue will 281 be one-half the burst size, though this might vary depending on when 282 the average is computed and the timing of arrivals. The average then 283 would indicate a persistent queue where there is none. If instead we 284 track the minimum observation, if there is one packet that has a zero 285 sojourn time then there is no persistent queue. The value of the 286 minimum in detecting persistent queue is apparent when looking at 287 graphs of queue delay. 289 The standing queue can be detected by tracking the (local) minimum 290 queue delay packets experience. To ensure that this minimum value 291 does not become stale, it has to have been experienced recently, i.e. 292 during an appropriate past time interval. This "interval" is the 293 maximum amount of time a minimum is considered to be in effect. It 294 is clear that this interval should be at least a round trip time to 295 avoid falsely detecting a persistent queue and not a lot more than a 296 round trip time to avoid delay in detecting the persistent queue. 297 This suggests that the appropriate interval value is the maximum 298 round-trip time of all the connections sharing the buffer. To avoid 299 outlier values, the 95-99th percentile value is preferred rather than 300 a strict maximum. 302 A key realization makes the local minimum an efficiently computed 303 statistic. Note that it is sufficient to keep a single state 304 variable of how long the minimum has been above or below a target 305 value rather than retaining all the local values to compute the 306 minimum, leading to both storage and computational savings. 308 These two innovations, use of sojourn time as observed values and the 309 local minimum as the statistic to monitor queue congestion are key to 310 CoDel's Estimator building block. The local minimum sojourn time 311 provides an accurate and robust measure of standing queue and has an 312 efficient implementation. In addition, use of the minimum sojourn 313 time has important advantages in implementation. The minimum packet 314 sojourn can only be decreased when a packet is dequeued which means 315 that all the work of CoDel can take place when packets are dequeued 316 for transmission and that no locks are needed in the implementation. 317 The minimum is the only statistic with this property. 319 A more detailed explanation with many pictures can be found at: http: 320 //pollere.net/Pdfdocs/QrantJul06.pdf and http://www.ietf.org/ 321 proceedings/84/slides/slides-84-tsvarea-4.pdf . 323 3.2. Setpoint 325 Now that we have a robust way of detecting standing queue, we need to 326 have a Setpoint that tells us when to act. If the controller is set 327 to take action as soon as the estimator has a non-zero value, the 328 average drop rate will be maximized which minimizes TCP goodput 329 [MACTCP1997]. Also, since this policy results in no backlog over 330 time (no persistent queue), it also maximizes the bottleneck link 331 bandwidth lost because of normal stochastic variation in packet 332 interarrival time and obliterates much of the value of having a 333 buffer. We want a setpoint that maximizes utilization while 334 minimizing delay. Early in the history of packet networking, 335 Kleinrock developed the analytic machinery to do this using a 336 quantity he called _'power'_ (the ratio of a normalized throughput to 337 a normalized delay) [KLEIN81]. 339 It's straightforward to derive an analytic expression the average 340 goodput of a TCP conversation for a given round-trip time _r_ and 341 setpoint _f_ (where _f_ is expressed as a fraction of _r_) 342 [VJTARG14]. Reno TCP, for example, yields: 344 goodput = _r _(3 + 6_f_ - _f_^2) / (4 (1+_f_)) 346 Since the peak delay is just _f r_, it's clear that _power_ is solely 347 a function of _f_ since the _r_'s in the numerator and denominator 348 cancel: 350 As Kleinrock observed, the best operating point, in terms of 351 bandwidth / delay tradeoff, is the peak power point since points off 352 the peak represent a higher cost (in delay) per unit of bandwidth. 353 The power vs. _f_ curve for any AIMD TCP is monotone decreasing. But 354 the curve is very flat for _f_ < 0.1 followed by a increasing 355 curvature with a knee around .2 then a steep, almost linear fall off 356 [TSV84] [VJTARG14]. Since the previous equation showed that goodput 357 is monotone increasing with _f_, the best operating point is near the 358 right edge of the flat top since that represents the highest goodput 359 achievable for a negligible increase in delay. However, since the 360 _r_ in the model is a conservative upper bound, a target of .1_r_ 361 runs the risk of pushing shorter RTT connections over the knee and 362 giving them higher delay for no significant goodput increase. 363 Generally, a more conservative target of .05_r _offers a good 364 utilization vs. delay tradeoff while giving enough headroom to work 365 well with a large variation in real RTT. 367 As the above analysis shows, a very small standing queue gives close 368 to 100% utilization. While this result was for Reno TCP, the 369 derivation uses only properties that must hold for any 'TCP friendly' 370 transport. We have verified by both analysis and simulation that 371 this result holds for Reno, Cubic, and Westwood[TSV84]. This results 372 in a particularly simple form for the setpoint: the ideal range for 373 the permitted standing queue is between 5 and 10% of the TCP 374 connection RTT. Thus _target_ is simply 5% of the _interval_ of 375 section 3.1. 377 3.3. Control Loop 379 Section 3.1 describes a simple, reliable way to measure bad 380 (persistent) queue. Section 3.2 shows that TCP congestion control 381 dynamics gives rise to a setpoint for this measure that's a provably 382 good balance between enhancing throughput and minimizing delay, and 383 that this setpoint is a constant fraction of the same 'largest 384 average RTT' interval used to distinguish persistent from transient 385 queue. The only remaining building block needed for a basic AQM is a 386 'control loop' algorithm to effectively drive the queuing system from 387 any 'persistent queue above target' state to a state where the 388 persistent queue is below target. 390 Control theory provides a wealth of approaches to the design of 391 control loops. Most of classical control theory deals with the 392 control of linear, time-invariant, single-input-single-output (SISO) 393 systems. Control loops for these systems generally come from a (well 394 understood) class known as Proportional-Integral-Derivative (PID) 395 controllers. Unfortunately, a queue is not a linear system and an 396 AQM operates at the point of maximum non-linearity (where the output 397 link bandwidth saturates so increased demand creates delay rather 398 than higher utilization). Output queues are also not time-invariant 399 since traffic is generally a mix of connections which start and stop 400 at arbitrary times and which can have radically different behaviors 401 ranging from "open loop" UDP audio/video to "closed-loop" congestion- 402 avoiding TCP. Finally, the constantly changing mix of connections 403 (which can't be converted to a single 'lumped parameter' model 404 because of their transfer function differences) makes the system 405 multi-input-multi-output (MIMO), not SISO. 407 Since queuing systems match none of the prerequisites for a classical 408 controller, a modern state-space controller is a better approach with 409 states 'no persistent queue' and 'has persistent queue'. Since 410 Internet traffic mixtures change rapidly and unpredictably, a noise 411 and error tolerant adaptation algorithm like Stochastic Gradient is a 412 good choice. Since there's essentially no information in the amount 413 of persistent queue [TSV84], the adaptation should be driven by how 414 long it has persisted. 416 Consider the two extremes of traffic behavior, a single open-loop UDP 417 video stream and a single, long-lived TCP bulk data transfer. If the 418 average bandwidth of the UDP video stream is greater that the 419 bottleneck link rate, the link's queue will grow and the controller 420 will eventually enter 'has persistent queue' state and start dropping 421 packets. Since the video stream is open loop, its arrival rate is 422 unaffected by drops so the queue will persist until the average drop 423 rate is greater than the output bandwidth deficit (= average arrival 424 rate - average departure rate) so the job of the adaptation algorithm 425 is to discover this rate. For this example, the adaptation could 426 consist of simply estimating the arrival and departure rates then 427 dropping at a rate slightly greater than their difference. But this 428 class of algorithm won't work at all for the bulk data TCP stream. 429 TCP runs in closed-loop flow balance [TSV84] so its arrival rate is 430 almost always exactly equal to the departure rate - the queue isn't 431 the result of a rate imbalance but rather a mismatch between the TCP 432 sender's window and the src-dst-src round-trip path capacity (i.e., 433 the connection's bandwidth*delay product). The sender's TCP 434 congestion avoidance algorithm will slowly increase the send window 435 (one packet per round-trip-time) [RFC2581] which will eventually 436 cause the bottleneck to enter 'has persistent queue' state. But, 437 since the average input rate is the same as the average output rate, 438 the rate deficit estimation that gave the correct drop rate for the 439 video stream would compute a drop rate of zero for the TCP stream. 440 However, if the output link drops one packet as it enters 'has 441 persistent queue' state, when the sender discovers this (via TCP's 442 normal packet loss repair mechanisms) it will reduce its window by a 443 factor of two [RFC2581] so, one round-trip-time after the drop, the 444 persistent queue will go away. 446 If there were N TCP conversations sharing the bottleneck, the 447 controller would have to drop O(N) packets, one from each 448 conversation, to make all the conversations reduce their window to 449 get rid of the persistent queue. If the traffic mix consists of 450 short (<= bandwidth*delay product) conversations, the aggregate 451 behavior becomes more like the open-loop video example since each 452 conversation is likely to have already sent all its packets by the 453 time it learns about a drop so each drop has negligible effect on 454 subsequent traffic. 456 The controller doesn't know what type, how many or how long are the 457 conversations creating its queue so it has to learn that. Since 458 single drops can have a large effect if the degree of multiplexing 459 (the number of active conversations) is small, dropping at too high a 460 rate is likely to have a catastrophic effect on throughput. Dropping 461 at a low rate (< 1 packet per round-trip-time) then increasing the 462 drop rate slowly until the persistent queue goes below target is 463 unlikely to overdrop yet is guaranteed to eventually dissipate the 464 persistent queue. This stochastic gradient learning procedure is the 465 core of CoDel's control loop (the gradient exists because a drop 466 always reduces the (instantaneous) queue so an increasing drop rate 467 always moves the system "down" toward no persistent queue, regardless 468 of traffic mix). 470 The next drop time is decreased in inverse proportion to the square 471 root of the number of drops since the dropping state was entered, 472 using the well-known nonlinear relationship of drop rate to 473 throughput to get a linear change in throughput. [REDL1998, 474 MACTCP1997] 476 Since the best rate to start dropping is at slightly more than one 477 packet per RTT, the controller's initial drop rate can be directly 478 derived from the Estimator's interval, defined in section 3.1. Where 479 the interval is likely to be very close to the usual round trip time, 480 the initial drop spacing SHOULD be set to the Estimator's interval 481 plus twice the target (i.e., initial drop spacing = 1.1 * interval) 482 to ensure that acceptable congestion delays are covered. 484 Use of the minimum statistic lets the Controller be placed in the 485 dequeue routine with the Estimator. This means that the control 486 signal (the drop) can be sent at the first sign of bad queue (as 487 indicated by the sojourn time) and that the Controller can stop 488 acting as soon as the sojourn time falls below the Setpoint. 489 Dropping at dequeue has both implementation and control advantages. 491 4. Putting it together: queue management for the network edge 493 The CoDel building blocks are able to adapt to different or time- 494 varying link rates, to be easily used with multiple queues, to have 495 excellent utilization with low delay and to have a simple and 496 efficient implementation. The only setting CoDel requires is its 497 interval value, and as 100ms satisfies that definition for normal 498 internet usage, CoDel can be parameter-free for consumer use. CoDel 499 was released to the open source community where it has been widely 500 promulgated and adapted to many problems. We can see how well these 501 building blocks work in a simple CoDel queue management 502 implementation. This AQM was designed as a bufferbloat solution and 503 is focused on the consumer network edge. 505 4.1. Overview of CoDel AQM 507 To ensure that link utilization is not adversely affected, CoDel's 508 Estimator sets its target to the Setpoint that optimizes power and 509 CoDel's Controller does not drop packets when the drop would leave 510 the queue empty or with fewer than a maximum transmission unit (MTU) 511 worth of bytes in the buffer. Section 3.2 showed that the ideal 512 Setpoint is 5-10% of the connection RTT. In the open Internet, in 513 particular the consumer edge, we can use the "usual maximum" 514 terrestrial RTT of 100 ms to calculate a minimum target of 5ms. 515 Under the same assumptions, we compute the interval for tracking the 516 minimum to be the nominal RTT of 100ms. In practice, uncongested 517 links will see sojourn times under the target more often than once 518 per RTT, so the Estimator is not overly sensitive to the value of the 519 interval. 521 When the Estimator finds a persistent delay above target, the 522 Controller enters the drop state where a packet is dropped and the 523 next drop time is set. As discussed in section 3.3, the initial next 524 drop spacing is intended to be long enough to give the endpoints time 525 to react to the single drop so SHOULD be set to a value of 1.0 to 1.1 526 times the interval. If the Estimator's output falls below the 527 target, the Controller cancels the next drop and exits the drop 528 state. (The Controller is more sensitive than the Estimator to an 529 overly short interval, since an unnecessary drop could occur and 530 lower utilization.) If next drop time is reached while the 531 Controller is still in drop state, the packet being dequeued is 532 dropped and the next drop time is recalculated. Additional logic 533 prevents re-entering the dropping state too soon after exiting it and 534 resumes the dropping state at a recent control level, if one exists. 536 Note that CoDel AQM only enters its dropping state when the local 537 minimum sojourn delay has exceeded an acceptable standing queue 538 target for a time interval long enough for normal bursts to dissipate 539 ensuring that a burst of packets that fits in the pipe will not be 540 dropped. 542 CoDel's efficient implementation and lack of configuration are unique 543 features and make it suitable to manage modern packet buffers. For 544 more background and results on CoDel, see [CODEL2012] and http:// 545 pollere.net/CoDel.html . 547 4.2. Non-starvation 549 CoDel's goals are to control delay with little or no impact on link 550 utilization and to be deployed on a wide range of link bandwidth, 551 including varying rate links, without reconfiguration. To keep from 552 making drops when it would starve the output link, CoDel makes 553 another check before dropping to see if at least an MTU worth of 554 bytes remains in the buffer. If not, the packet SHOULD NOT be 555 dropped and, currently, CoDel exits the drop state. The MTU size can 556 be set adaptively to the largest packet seen so far or can be read 557 from the driver. 559 4.3. Using the interval 561 The interval is chosen to give endpoints time to react to a drop 562 without being so long that response times suffer. CoDel's Estimator, 563 Setpoint, and Control Loop all use the interval. Understanding their 564 derivation shows that CoDel is the most sensitive to the value of 565 interval for single long-lived TCPs with a decreased sensitivity for 566 traffic mixes. This is fortunate as RTTs vary across connections and 567 are not known apriori and it's difficult to obtain a definitive 568 histogram of RTTs seen on the normal consumer edge link. The best 569 policy is to use an interval slightly larger than the RTT seen by 570 most of the connections using a link, a value that can be determined 571 as the largest RTT seen if the value is not an outlier (as in section 572 3.1, use of a 95-99th percentile value should work). In practice, 573 this value is not known or measured (though see Section 6.2 for an 574 application where interval is measured. Work-in-progress at Pollere 575 may lead to a method of doing this in an Internet buffer). A setting 576 of 100ms works well across a range of RTTs from 10ms to 1 second 577 (excellent performance is achieved in the range from 10 ms to 300ms). 578 For devices intended for the normal terrestrial Internet interval 579 SHOULD have the value of 100ms. This will only cause overdropping 580 where a long-lived TCP has an RTT longer than 100ms and there is 581 little or no mixing with other connections through the link. 583 Some confusion concerns the roles of the target Setpoint and the 584 minimum-tracking interval. In particular, some experimenters believe 585 the value of target needs to be increased when the lower layers have 586 a bursty nature where packets are transmitted for short periods 587 interspersed with idle periods where the link is waiting for 588 permission to send. CoDel's Estimator will "see" the effective 589 transmission rate over an interval and increasing target will just 590 lead to longer queue delays. On the other hand, where a significant 591 additional delay is added to the intrinsic round trip time of most or 592 all packets due to the waiting time for a transmission, it is 593 necessary to increase interval by that extra delay. That is, target 594 SHOULD NOT be adjusted but interval MAY need to be adjusted. For 595 more on this (and pictures) see http://pollere.net/Pdfdocs/ 596 noteburstymacs.pdf 598 4.4. The target Setpoint 600 The target is the maximum acceptable standing queue delay above which 601 CoDel is dropping or preparing to drop and below which CoDel will not 602 drop. The calculations of section 3.2 showed that the best setpoint 603 is 5-10% of the RTT, with the low end of 5% preferred. We used 604 simulation to explore the impact when TCPs are mixed with other 605 traffic and with connections of different RTTs. Accordingly, we 606 experimented extensively with values in the 5-10% of RTT range and, 607 overall, used target values between 1 and 20 milliseconds for RTTs 608 from 30 to 500ms and link bandwidths of 64Kbps to 100Mbps to 609 experimentally explore the Setpoint that gives consistently high 610 utilization while controlling delay across a range of bandwidths, 611 RTTs, and traffic loads. Our results were notably consistent with 612 the mathematics of section 3.2. Below a target of 5ms, utilization 613 suffers for some conditions and traffic loads, above 5ms we saw very 614 little or no improvement in utilization. Thus target SHOULD be set 615 to 5ms for normal Internet traffic. 617 If a CoDel link has only or primarily long-lived TCP flows sharing a 618 link to congestion but not overload, the median delay through the 619 link will tend to the target. For bursty traffic loads and for 620 overloaded conditions (where it is difficult or impossible for all 621 the arriving flows to be accommodated) the median queues will be 622 longer than target. 624 The non-starvation drop inhibit feature dominates where the link rate 625 becomes very small. By inhibiting drops when there is less than an 626 (outbound link) MTU worth of bytes in the buffer, CoDel adapts to 627 very low bandwidth links. This is shown in [CODEL2012] and 628 interested parties should see the discussion of results there. 629 Unpublished studies were carried out down to 64Kbps. The drop 630 inhibit condition can be expanded to include a test to retain 631 sufficient bytes or packets to fill an allocation in a request-and- 632 grant MAC. 634 Sojourn times must remain above target for an entire interval in 635 order to enter the drop state. Any packet with a sojourn time less 636 than target will reset the time that the queue was last below the 637 target. Since Internet traffic has very dynamic characteristics, the 638 actual sojourn delays experienced by packets varies greatly and is 639 often less than the target unless the overload is excessive. When a 640 link is not overloaded, it is not a bottleneck and packet sojourn 641 times will be small or nonexistent. In the usual case, there are 642 only one or two places along a path where packets will encounter a 643 bottleneck (usually at the edge), so the amount of queuing delay 644 experienced by a packet should be less than 10 ms even under 645 extremely congested conditions. Contrast this to the queuing delays 646 that grow to orders of seconds that have led to the "bufferbloat" 647 term [NETAL2010, CHARRB2007]. 649 4.5. Use with multiple queues 651 Unlike other AQMs, CoDel is easily adapted to multiple queue systems. 652 With other approaches there is always a question of how to account 653 for the fact that each queue receives less than the full link rate 654 over time and usually sees a varying rate over time. This is exactly 655 what CoDel excels at: using a packet's sojourn time in the buffer 656 completely bypasses this problem. A separate CoDel algorithm runs on 657 each queue, but each CoDel uses the packet sojourn time the same way 658 a single queue CoDel does. Just as a single queue CoDel adapts to 659 changing link bandwidths[CODEL2012], so do the multiple queue CoDels. 660 When testing for queue occupancy before dropping, the total occupancy 661 of all bins should be used. This property of CoDel has been 662 exploited in fq_codel, briefly discussed in the next section and the 663 subject of another Internet Draft. 665 4.6. Use of stochastic bins or sub-queues to improve performance 667 Shortly after the release of the CoDel pseudocode, Eric Dumazet 668 created fq_codel, applying CoDel to each bin, or queue, used with 669 stochastic fair queuing. (To understand further, see [SFQ1990] or 670 the linux sfq at http://linux.die.net/man/8/tc-sfq .) Fq_codel hashes 671 on the packet header fields to determine a specific bin, or sub- 672 queue, for each five-tuple flow, and runs CoDel on each bin or sub- 673 queue thus creating a well-mixed output flow and obviating issues of 674 reverse path flows (including "ack compression"). Dumazet's code is 675 part of the CeroWrt project code at the bufferbloat.net's web site 676 and an Internet Draft has been submitted describing fq_codel, draft- 677 hoeiland-joergensen-mckenney-taht-gettys-dumazet-fq-codel. 679 We've experimented with a similar approach by creating an ns-2 680 simulator code module, sfqcodel. This has provided excellent results 681 thus far: median queues remain small across a range of traffic 682 patterns that includes bidirectional file transfers (that is, the 683 same traffic sent in both directions on a link), constant bit-rate 684 VoIP-like flows, and emulated web traffic and utilizations are 685 consistently better than single queue CoDel, generally very close to 686 100%. Our version differs from Dumazet's by preferring a packet-based 687 round robin of the bins rather than byte-based DRR and there may be 688 other minor differences in implementation. Our code, intended for 689 simulation experiments, is available at http://pollere.net/CoDel.html 690 and being integrated into the ns-2 distribution. Andrew McGregor has 691 an ns-3 version of fq_codel. 693 Stochastic flow queuing provides better traffic mixing on the link 694 and tends to isolate a larger flow or flows. For real priority 695 treatment, use of DiffServ isolation is encouraged. We've 696 experimented in simulation with creating a queue to isolate all the 697 UDP traffic (which is all simulated VoIP thus low bandwidth) but this 698 approach has to be applied with caution in the real world. Some 699 experimenters are trying rounding with a small quantum (on the order 700 of a voice packet size) but this also needs thorough study. 702 A number of open issues should be studied. In particular, if the 703 number of different queues or bins is too large, the scheduling will 704 be the dominant factor, not the AQM; it is NOT the case that more 705 bins are always better. In our simulations, we have found good 706 behavior across mixed traffic types with smaller numbers of queues, 707 8-16 for a 5Mbps link. This configuration appears to give the best 708 behavior for voice, web browsing and file transfers where increased 709 numbers of bins seems to favor file transfers at the expense of the 710 other traffic. Our work has been very preliminary and we encourage 711 others to take this up and to explore analytic modeling. It would be 712 instructive to see the effects of different numbers of bins on a 713 range of traffic models, something like an updated version of 714 [BMPFQ]. 716 Implementers SHOULD use the fq_codel multiple queue approach if 717 possible as it deals with many problems beyond the reach of an AQM on 718 a single queue. 720 4.7. Setting up CoDel AQM 722 CoDel's is set for use in devices in the open Internet. An interval 723 of 100ms is used, target is set to 5% of interval, and the initial 724 drop spacing is also set to interval. These settings have been 725 chosen so that a device, such as a small WiFi router, can be sold 726 without the need for any values to be made adjustable, yielding a 727 parameterless implementation. In addition, CoDel is useful in 728 environments with significantly different characteristics from the 729 normal Internet, for example, in switches used as a cluster 730 interconnect within a data center. Since cluster traffic is entirely 731 internal to the data center, round trip latencies are low (typically 732 <100us) but bandwidths are high (1-40Gbps) so it's relatively easy 733 for the aggregation phase of a distributed computation (e.g., the 734 Reduce part of a Map/Reduce) to persistently fill then overflow the 735 modest per-port buffering available in most high speed switches. A 736 CoDel configured for this environment (target and interval in the 737 microsecond rather than millisecond range) can minimize drops (or ECN 738 marks) while keeping throughput high and latency low. 740 Devices destined for these environments MAY use a different interval, 741 where suitable. If appropriate analysis indicates, the target MAY be 742 set to some other value in the 5-10% of interval and the initial drop 743 spacing MAY be set to a value of 1.0 to 1.2 times the interval. But 744 these settings will cause problems such as over dropping and low 745 throughput if used on the open Internet so devices that allow CoDel 746 to be configured MUST default to Internet appropriate values given in 747 this document. 749 5. Annotated Pseudo-code for CoDel AQM 751 What follows is the CoDel algorithm in C++-like pseudo-code. Since 752 CoDel adds relatively little new code to a basic tail-drop fifo- 753 queue, we've tried to highlight just these additions by presenting 754 CoDel as a sub-class of a basic fifo-queue base class. There have 755 been a number of minor variants in the code and our reference pseudo- 756 code has not yet been completely updated. The reference code is 757 included to aid implementers who wish to apply CoDel to queue 758 management as described here or to adapt its principles to other 759 applications. 761 Implementors are strongly encouraged to also look at Eric Dumazet's 762 Linux kernel version of CoDel - a well-written, well tested, real- 763 world, C-based implementation. As of this writing, it is at: 765 http://git.kernel.org/?p=linux/kernel/git/torvalds/ 766 linux.git;a=blob_plain;f=net/sched/sch_codel.c;hb=HEAD 767 This code is open-source with a dual BSD/GPL license: 769 Codel - The Controlled-Delay Active Queue Management algorithm 771 Copyright (C) 2011-2014 Kathleen Nichols 773 Redistribution and use in source and binary forms, with or without 774 modification, are 776 permitted provided that the following conditions are met: 778 o Redistributions of source code must retain the above copyright 779 notice, this list of conditions, and the following disclaimer, 780 without modification. 782 o Redistributions in binary form must reproduce the above copyright 783 notice, this list of conditions and the following disclaimer in 784 the documentation and/or other materials provided with the 785 distribution. 787 o The names of the authors may not be used to endorse or promote 788 products derived from this software without specific prior written 789 permission. 791 Alternatively, provided that this notice is retained in full, this 792 software may be distributed under the terms of the GNU General Public 793 License ("GPL") version 2, in which case the provisions of the GPL 794 apply INSTEAD OF those given above. 796 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 797 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 798 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 799 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 800 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 801 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 802 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 803 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 804 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 805 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 806 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 808 5.1. Data Types 810 "time_t" is an integer time value in units convenient for the system. 811 Resolution to at least a millisecond is required and better 812 resolution is useful up to the minimum possible packet time on the 813 output link; 64- or 32-bit widths are acceptable but with 32 bits the 814 resolution should be no finer than 2^{-16} to leave enough dynamic 815 range to represent a wide range of queue waiting times. Narrower 816 widths also have implementation issues due to overflow (wrapping) and 817 underflow (limit cycles because of truncation to zero) that are not 818 addressed in this pseudocode. The code presented here uses 0 as a 819 flag value to indicate "no time set." 821 "packet_t*" is a pointer to a packet descriptor. We assume it has a 822 tstamp field capable of holding a time_t and that field is available 823 for use by CoDel (it will be set by the enque routine and used by the 824 deque routine). 826 "queue_t" is a base class for queue objects (the parent class for 827 codel_queue_t objects). We assume it has enque() and deque() methods 828 that can be implemented in child classes. We assume it has a bytes() 829 method that returns the current queue size in bytes. This can be an 830 approximate value. The method is invoked in the deque() method but 831 shouldn't require a lock with the enque() method. 833 "flag_t" is a Boolean. 835 5.2. Per-queue state (codel_queue_t instance variables) 837 time_t first_above_time; // Time to declare sojourn time above target 838 time_t drop_next; // Time to drop next packet 839 uint32_t count; // Packets dropped since entering drop state 840 flag_t dropping; // Equal to 1 if in drop state 842 5.3. Constants 844 time_t target = MS2TIME(5); // 5ms target queue delay 845 time_t interval = MS2TIME(100); // 100ms sliding-minimum window 846 u_int maxpacket = 512; // Maximum packet size in bytes 847 // (should use interface MTU) 849 5.4. Enque routine 851 All the work of CoDel is done in the deque routine. The only CoDel 852 addition to enque is putting the current time in the packet's tstamp 853 field so that the deque routine can compute the packet's sojourn 854 time. 856 void codel_queue_t::enque(packet_t* pkt) 857 { 858 pkt->timestamp() = clock(); 859 queue_t::enque(pkt); 860 } 862 5.5. Deque routine 864 This is the heart of CoDel. There are two branches: In packet- 865 dropping state (meaning that the queue-sojourn time has gone above 866 target and hasn't come down yet), then we need to check if it's time 867 to leave or if it's time for the next drop(s); if we're not in 868 dropping state, then we need to decide if it's time to enter and do 869 the initial drop. 871 Packet* CoDelQueue::deque() 872 { 873 double now = clock();; 874 dodequeResult r = dodeque(now); 876 if (dropping_) { 877 if (! r.ok_to_drop) { 878 // sojourn time below target - leave dropping state 879 dropping_ = 0; 880 } 881 // Time for the next drop. Drop current packet and dequeue 882 // next. If the dequeue doesn't take us out of dropping 883 // state, schedule the next drop. A large backlog might 884 // result in drop rates so high that the next drop should 885 // happen now, hence the 'while' loop. 886 while (now >= drop_next_ && dropping_) { 887 drop(r.p); 888 r = dodeque(now); 889 if (! r.ok_to_drop) { 890 // leave dropping state 891 dropping_ = 0; 892 } else { 893 ++count_; 894 // schedule the next drop. 895 drop_next_ = control_law(drop_next_); 896 } 897 } 898 // If we get here we're not in dropping state. The 'ok_to_drop' 899 // return from dodeque means that the sojourn time has been 900 // above 'target' for 'interval' so enter dropping state. 901 } else if (r.ok_to_drop) { 902 drop(r.p); 903 r = dodeque(now); 904 dropping_ = 1; 906 // If min went above target close to when it last went 907 // below, assume that the drop rate that controlled the 908 // queue on the last cycle is a good starting point to 909 // control it now. ('drop_next' will be at most 'interval' 910 // later than the time of the last drop so 'now - drop_next' 911 // is a good approximation of the time from the last drop 912 // until now.) 913 count_ = (count_ > 2 && now - drop_next_ < 8*interval_)? 914 count_ - 2 : 1; 915 drop_next_ = control_law(now); 916 } 917 return (r.p); 918 } 920 5.6. Helper routines 922 Since the degree of multiplexing and nature of the traffic sources is 923 unknown, CoDel acts as a closed-loop servo system that gradually 924 increases the frequency of dropping until the queue is controlled 925 (sojourn time goes below target). This is the control law that 926 governs the servo. It has this form because of the sqrt(p) 927 dependence of TCP throughput on drop probability. Note that for 928 embedded systems or kernel implementation, the inverse sqrt can be 929 computed efficiently using only integer multiplication. See Eric 930 Dumazet's excellent Linux CoDel implementation for example code (in 931 file net/sched/sch_codel.c of the kernel source for 3.5 or newer 932 kernels). 934 time_t codel_queue_t::control_law(time_t t) 935 { 936 return t + interval / sqrt(count); 937 } 939 Next is a helper routine the does the actual packet dequeue and 940 tracks whether the sojourn time is above or below target and, if 941 above, if it has remained above continuously for at least interval. 942 It returns two values, a Boolean indicating if it is OK to drop 943 (sojourn time above target for at least interval) and the packet 944 dequeued. 946 typedef struct { 947 packet_t* p; 948 flag_t ok_to_drop; 949 } dodeque_result; 951 dodeque_result codel_queue_t::dodeque(time_t now) 952 { 953 dodequeResult r = { NULL, queue_t::deque() }; 954 if (r.p == NULL) { 955 // queue is empty - we can't be above target 956 first_above_time_ = 0; 957 return r; 958 } 960 // To span a large range of bandwidths, CoDel runs two 961 // different AQMs in parallel. One is sojourn-time-based 962 // and takes effect when the time to send an MTU-sized 963 // packet is less than target. The 1st term of the "if" 964 // below does this. The other is backlog-based and takes 965 // effect when the time to send an MTU-sized packet is >= 966 // target. The goal here is to keep the output link 967 // utilization high by never allowing the queue to get 968 // smaller than the amount that arrives in a typical 969 // interarrival time (MTU-sized packets arriving spaced 970 // by the amount of time it takes to send such a packet on 971 // the bottleneck). The 2nd term of the "if" does this. 972 time_t sojourn_time = now - r.p->tstamp; 973 if (sojourn_time_ < target_ || bytes() <= maxpacket_) { 974 // went below - stay below for at least interval 975 first_above_time_ = 0; 976 } else { 977 if (first_above_time_ == 0) { 978 // just went above from below. if still above at 979 // first_above_time, will say it's ok to drop. 980 first_above_time_ = now + interval_; 981 } else if (now >= first_above_time_) { 982 r.ok_to_drop = 1; 983 } 984 } 985 return r; 986 } 988 5.7. Implementation considerations 990 Since CoDel requires relatively little per-queue state and no direct 991 communication or state sharing between the enqueue and dequeue 992 routines, it's relatively simple to add it to almost any packet 993 processing pipeline, including ASIC- or NPU-based forwarding engines. 995 One issue to think about is dedeque's use of a 'bytes()' function to 996 find out about how many bytes are currently in the queue. This value 997 does not need to be exact. If the enqueue part of the pipeline keeps 998 a running count of the total number of bytes it has put into the 999 queue and the dequeue routine keeps a running count of the total 1000 bytes it has removed from the queue, 'bytes()' is just the difference 1001 between these two counters. 32 bit counters are more than adequate. 1002 Enqueue has to update its counter once per packet queued but it 1003 doesn't matter when (before, during or after the packet has been 1004 added to the queue). The worst that can happen is a slight, 1005 transient, underestimate of the queue size which might cause a drop 1006 to be briefly deferred. 1008 6. Adapting and applying CoDel's building blocks 1010 CoDel is being implemented and tested in a range of environments. 1011 Dave Taht has been instrumental in the integration and distribution 1012 of bufferbloat solutions, including CoDel, and has set up a website 1013 and a mailing list for CeroWRT implementers. (www.bufferbloat.net/ 1014 projects/codel) This is an active area of work and an excellent place 1015 to track developments. 1017 6.1. Validations and available code 1019 An experiment by Stanford graduate students successfully used the 1020 linux CoDel to duplicate our published simulation work on CoDel's 1021 ability to following drastic link rate changes which can be found at: 1022 http://reproducingnetworkresearch.wordpress.com/2012/06/06/solving- 1023 bufferbloat-the-codel-way/ . 1025 Our ns-2 simulations are available at http://pollere.net/CoDel.html . 1026 Cable Labs has funded some additions to the simulator sfqcodel code 1027 which have been made public. The basic algorithm of CoDel remains 1028 unchanged, but we continue to experiment with drop interval setting 1029 when resuming the drop state, inhibiting or canceling drop state when 1030 bytes in the queue small, and other minor details. Our approach to 1031 changes is to only make them if we are convinced they do more good 1032 than harm, both operationally and in the implementation. With this 1033 in mind, some of these issues may continue to evolve as we get more 1034 deployment and as the building blocks are applied to a wider range of 1035 problems. 1037 CoDel is being made available with the ns-2 distribution. 1039 Andrew McGregor has an ns-3 implementation of both CoDel and FQ_CoDel 1040 (https://github.com/dtaht/ns-3-dev). 1042 CoDel is available in Linux. Eric Dumazet has put CoDel into the 1043 Linux distribution. 1045 6.2. CoDel in the datacenter 1047 Nandita Dukkipati's team at Google was quick to realize that the 1048 CoDel building blocks could be applied to bufferbloat problems in 1049 datacenter servers, not just to Internet routers. The Linux CoDel 1050 queueing discipline (Qdisc) was adapted in three ways to tackle this 1051 bufferbloat problem. 1053 Non-data packets were not dropped as these are typically small and 1054 sometimes critical control packets. Being located on the server, 1055 there is no concern with misbehaving users scamming such a policy as 1056 there would be in an Internet router. 1058 In several data center workload benchmarks, which are typically 1059 bursty, CoDel reduced the queueing latencies at the Qdisc, and 1060 thereby improved the mean and 99 percentile latencies from several 1061 tens of milliseconds to less than one millisecond. The minimum 1062 tracking part of the CoDel framework proved useful in disambiguating 1063 "good" queue versus "bad" queue, particularly helpful in controlling 1064 Qdisc buffers that are inherently bursty because of TCP Segmentation 1065 Offload. 1067 7. Security Considerations 1069 This document describes an active queue management algorithm for 1070 implementation in networked devices. There are no specific security 1071 exposures associated with CoDel. 1073 8. IANA Considerations 1075 This document does not require actions by IANA. 1077 9. Conclusions 1079 CoDel provides very general, efficient, parameterless building blocks 1080 for queue management that can be applied to single or multiple queues 1081 in a variety of data networking scenarios. It is a critical tool in 1082 solving bufferbloat. CoDel's settings MAY be modified for other 1083 special-purpose networking applications. 1085 On-going projects are creating a deployable CoDel in Linux routers 1086 and experimenting with applying CoDel to stochastic queuing with very 1087 promising results. 1089 10. Acknowledgments 1091 The authors wish to thank Jim Gettys for the constructive nagging 1092 that made us get the work "out there" before we thought it was ready. 1093 We also want to thank Dave Taht, Eric Dumazet, and the open source 1094 community for showing the value of getting it "out there" and for 1095 making it real. We also wish to thank Nandita Dukkipati for 1096 contribution to section 6 and for comments which helped to 1097 substantially improve this draft. 1099 11. References 1101 11.1. Normative References 1103 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1104 Requirement Levels", BCP 14, RFC 2119, March 1997. 1106 11.2. Informative References 1108 [TSV2011] Gettys, J., "Bufferbloat: Dark Buffers in the Internet", 1109 IETF 80 presentation to Transport Area Open Meeting, 1110 March, 2011, 1111 . 1113 [BB2011] Gettys, J. and K. Nichols, "Bufferbloat: Dark Buffers in 1114 the Internet", Communications of the ACM 9(11) pp. 57-65, 1115 . 1117 [BMPFQ] Suter, B., "Buffer Management Schemes for Supporting TCP 1118 in Gigabit Routers with Per-flow Queueing", IEEE Journal 1119 on Selected Areas in Communications Vol. 17 Issue 6, June, 1120 1999, pp. 1159-1169, . 1122 [CMNTS] Allman, M., "Comments on Bufferbloat", Computer 1123 Communication Review Vol. 43 No. 1, January, 2013, pp. 1124 31-37, . 1126 [CODEL2012] 1127 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 1128 Communications of the ACM Vol. 55 No. 11, July, 2012, pp. 1129 42-50, . 1131 [VANQ2006] 1132 Jacobson, V., "A Rant on Queues", talk at MIT Lincoln 1133 Labs, Lexington, MA July, 2006, 1134 . 1136 [REDL1998] 1137 Nichols, K., Jacobson, V., and K. Poduri, "RED in a 1138 Different Light", Tech report, September, 1999, 1139 . 1142 [NETAL2010] 1143 Kreibich, C., et. al., "Netalyzr: Illuminating the Edge 1144 Network", Proceedings of the Internet Measurement 1145 Conference Melbourne, Australia, 2010, . 1147 [TSV84] Jacobson, V., "CoDel talk at TSV meeting IETF 84", 1148 . 1151 [CHARB2007] 1152 Dischinger, M., et. al, "Characterizing Residential 1153 Broadband Networks", Proceedings of the Internet 1154 Measurement Conference San Diego, CA, 2007, . 1156 [MACTCP1997] 1157 Mathis, M., Semke, J., and J. Mahdavi, "The Macroscopic 1158 Behavior of the TCP Congestion Avoidance Algorithm", ACM 1159 SIGCOMM Computer Communications Review Vol. 27 no. 1, Jan. 1160 2007, . 1162 [SFQ1990] McKenney, P., "Stochastic Fairness Queuing", Proceedings 1163 of IEEE INFOCOMM 90 San Francisco, 1990, . 1165 [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of 1166 Computer Network Power", International Conference on 1167 Communications June, 1981, 1168 . 1170 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1171 RFC 896, January 1984. 1173 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1174 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1175 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1176 S., Wroclawski, J., and L. Zhang, "Recommendations on 1177 Queue Management and Congestion Avoidance in the 1178 Internet", RFC 2309, April 1998. 1180 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1181 Control", RFC 2581, April 1999. 1183 Authors' Addresses 1185 Kathleen Nichols 1186 Pollere, Inc. 1187 PO Box 370201 1188 Montara, CA 94037 1189 USA 1191 Email: nichols@pollere.com 1193 Van Jacobson 1194 Google 1196 Email: vanj@google.com