idnits 2.17.1 draft-ietf-aqm-codel-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([BB2011], [CODEL2012], [RFC2309], [TSVBB2011], [RFC896]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 27, 2015) is 3281 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC896' is mentioned on line 16, but not defined ** Obsolete undefined reference: RFC 896 (Obsoleted by RFC 7805) == Missing Reference: 'TSVBB2011' is mentioned on line 23, but not defined == Missing Reference: 'VJTARG14' is mentioned on line 394, but not defined == Missing Reference: 'CHARRB2007' is mentioned on line 685, but not defined == Unused Reference: 'CHARB2007' is defined on line 1181, but no explicit reference was found in the text == Unused Reference: 'CMNTS' is defined on line 1186, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 1217, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) Summary: 2 errors (**), 0 flaws (~~), 8 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Active Queue Management and Packet Scheduling (aqm) K. Nichols 3 Internet-Draft Pollere, Inc. 4 Intended status: Informational V. Jacobson 5 Expires: October 26, 2015 A. McGregor 6 J. Iyengar 7 Google 8 April 27, 2015 10 Controlled Delay Active Queue Management 11 draft-ietf-aqm-codel-01 13 Abstract 15 The "persistently full buffer" problem has been discussed in the IETF 16 community since the early 80's [RFC896]. The IRTF's End-to-End 17 Working Group called for the deployment of active queue management 18 (AQM) to solve the problem in 1998 [RFC2309]. Despite the awareness, 19 the problem has only gotten worse as Moore's Law growth in memory 20 density fueled an exponential increase in buffer pool size. Efforts 21 to deploy AQM have been frustrated by difficult configuration and 22 negative impact on network utilization. This problem, recently 23 christened "bufferbloat", [TSVBB2011] [BB2011] has become 24 increasingly important throughout the Internet but particularly at 25 the consumer edge. 27 This document describes a general framework called CoDel (Controlled 28 Delay) [CODEL2012] that controls bufferbloat-generated excess delay 29 in modern networking environments. CoDel consists of an estimator, a 30 setpoint, and a control loop. It requires no configuration in normal 31 Internet deployments. CoDel comprises some major technical 32 innovations and has been made available as open source so that the 33 framework can be applied by the community to a range of problems. It 34 has been implemented in Linux (and available in the Linux 35 distribution) and deployed in some networks at the consumer edge. In 36 addition, the framework has been successfully applied in other ways. 38 Note: Code Components extracted from this document must include the 39 license as included with the code in Section 5. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at http://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on April 3, 2015. 58 Copyright Notice 60 Copyright (c) 2014 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (http://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 76 2. Conventions used in this document . . . . . . . . . . . . . . 4 77 3. Building Blocks of Queue Management . . . . . . . . . . . . . 5 78 3.1. Estimator . . . . . . . . . . . . . . . . . . . . . . . . 6 79 3.2. Setpoint . . . . . . . . . . . . . . . . . . . . . . . . 8 80 3.3. Control Loop . . . . . . . . . . . . . . . . . . . . . . 9 81 4. Putting it together: queue management for the network edge . 12 82 4.1. Overview of CoDel AQM . . . . . . . . . . . . . . . . . . 12 83 4.2. Non-starvation . . . . . . . . . . . . . . . . . . . . . 13 84 4.3. Using the interval . . . . . . . . . . . . . . . . . . . 13 85 4.4. The target Setpoint . . . . . . . . . . . . . . . . . . . 14 86 4.5. Use with multiple queues . . . . . . . . . . . . . . . . 15 87 4.6. Use of stochastic bins or sub-queues to improve 88 performance . . . . . . . . . . . . . . . . . . . . . . . 15 89 4.7. Setting up CoDel AQM . . . . . . . . . . . . . . . . . . 16 90 5. Annotated Pseudo-code for CoDel AQM . . . . . . . . . . . . . 17 91 5.1. Data Types . . . . . . . . . . . . . . . . . . . . . . . 18 92 5.2. Per-queue state (codel_queue_t instance variables) . . . 19 93 5.3. Constants . . . . . . . . . . . . . . . . . . . . . . . . 19 94 5.4. Enque routine . . . . . . . . . . . . . . . . . . . . . . 19 95 5.5. Deque routine . . . . . . . . . . . . . . . . . . . . . . 19 96 5.6. Helper routines . . . . . . . . . . . . . . . . . . . . . 21 97 5.7. Implementation considerations . . . . . . . . . . . . . . 22 98 6. Adapting and applying CoDel's building blocks . . . . . . . . 23 99 6.1. Validations and available code . . . . . . . . . . . . . 23 100 6.2. CoDel in the datacenter . . . . . . . . . . . . . . . . . 24 101 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 102 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 103 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 25 104 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 25 105 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 106 11.1. Normative References . . . . . . . . . . . . . . . . . . 25 107 11.2. Informative References . . . . . . . . . . . . . . . . . 25 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 27 110 1. Introduction 112 The need for queue management has been evident for decades. Recently 113 the need has become more critical due to the increased consumer use 114 of the Internet mixing large video transactions with time-critical 115 VoIP and gaming. Gettys [TSV2011, BB2011] has been instrumental in 116 publicizing the problem and the measurement work [CHARB2007, 117 NATAL2010] and coining the term bufferbloat. Large content 118 distributors such as Google have observed that bufferbloat is 119 ubiquitous and adversely affects performance for many users. The 120 solution is an effective AQM that remediates bufferbloat at a 121 bottleneck while "doing no harm" at hops where buffers are not 122 bloated. 124 The development and deployment of effective active queue management 125 has been hampered by persistent misconceptions about the cause and 126 meaning of queues. Network buffers exist to absorb the packet bursts 127 that occur naturally in statistically multiplexed networks. Short- 128 term mismatches in traffic arrival and departure rates that arise 129 from upstream resource contention, transport conversation startup 130 transients and/or changes in the number of conversations sharing a 131 link create queues. Unfortunately, other network behavior can cause 132 queues to fill and their effects aren't nearly as benign. Discussion 133 of these issues and why the solution isn't just smaller buffers can 134 be found in [RFC2309],[VANQ2006],[REDL1998] and [CODEL2012]. It is 135 critical to understand the difference between the necessary, useful 136 "good" queue and the counterproductive "bad" queue. 138 Many approaches to active queue management (AQM) have been developed 139 over the past two decades but none has been widely deployed due to 140 performance problems. When designed with the wrong conceptual model 141 for queues, AQMs have limited operational range, require a lot of 142 configuration tweaking, and frequently impair rather than improve 143 performance. Today, the demands on an effective AQM are even 144 greater: many network devices must work across a range of bandwidths, 145 either due to link variations or due to the mobility of the device. 146 The CoDel approach is designed to meet the following goals: 148 o is parameterless for normal operation - has no knobs for 149 operators, users, or implementers to adjust 151 o treats "good queue" and "bad queue" differently, that is, keeps 152 delay low while permitting necessary bursts of traffic 154 o controls delay while insensitive (or nearly so) to round trip 155 delays, link rates and traffic loads; this goal is to "do no harm" 156 to network traffic while controlling delay 158 o adapts to dynamically changing link rates with no negative impact 159 on utilization 161 o is simple and efficient (can easily span the spectrum from low- 162 end, linux-based access points and home routers up to high-end 163 commercial router silicon) 165 Since April, 2012, when CoDel was published, a number of talented and 166 enthusiastic implementers have been using and adapting it with 167 promising results. Much of this work is collected at: 168 http://www.bufferbloat.net/projects/codel . CoDel has five major 169 innovations that distinguish it from prior AQMs: use of local queue 170 minimum to track congestion ("bad queue"), use of an efficient single 171 state variable representation of that tracked statistic, use of 172 packet sojourn time as the observed datum, rather than packets, 173 bytes, or rates, use of mathematically determined setpoint derived 174 from maximizing the network power metric, and a modern state space 175 controller. 177 CoDel configures itself based on a round-trip time metric which can 178 be set to 100ms for the normal, terrestrial Internet. With no 179 changes to parameters, we have found CoDel to work across a wide 180 range of conditions, with varying links and the full range of 181 terrestrial round trip times. CoDel has been implemented in Linux 182 very efficiently and should lend itself to silicon implementation. 183 CoDel is well-adapted for use in multiple queued devices and has been 184 used by Eric Dumazet with multiple queues in sophisticated queue 185 management approach, fq_codel (covered in another draft). 187 2. Conventions used in this document 189 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 190 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 191 document are to be interpreted as described in [RFC2119]. 193 In this document, these words will appear with that interpretation 194 only when in ALL CAPS. Lower case uses of these words are not to be 195 interpreted as carrying [RFC2119] significance. 197 In this document, the characters ">>" preceding an indented line(s) 198 indicates a compliance requirement statement using the key words 199 listed above. This convention aids reviewers in quickly identifying 200 or finding the explicit compliance requirements of this RFC. 202 3. Building Blocks of Queue Management 204 Two decades of work on queue management failed to yield an approach 205 that could be widely deployed in the Internet. With careful tuning 206 for particular usages, queue management techniques have been able to 207 "kind of" work, that is decrease queuing delays, but utilization and 208 fairness suffer unduly. At the heart of queue management is the 209 notion of "good queue" and "bad queue" and the search for ways to get 210 rid of the bad queue (which only adds delay) while preserving the 211 good queue (which provides for good utilization). This section 212 explains queuing, both good and bad, and covers the innovative CoDel 213 building blocks that can be used to manage packet buffers to keep 214 their queues in the "good" range. 216 Packet queues form in buffers facing bottleneck links, i.e., where 217 the line rate goes from high to low or many links converge. The 218 well-known bandwidth-delay product (sometimes called "pipe size") is 219 the bottleneck's bandwidth multiplied by the sender-receiver-sender 220 round-trip delay and is the amount of data that has to be in transit 221 between two hosts in order to run at 100% utilization. To explore 222 how queues can form, consider a long-lived TCP connection with a 25 223 packet window sending through a connection with a bandwidth-delay 224 product of 20 packets. After an initial burst of packets the 225 connection will settle into a five packet (+/-1) standing queue, the 226 size determined by the window mismatch to the pipe size and unrelated 227 to the connection's sending rate. The connection has 25 packets in 228 flight at all times, but only 20 packets arrive at the destination 229 over a round trip time. If the TCP connection has a 30 packet 230 window, the queue will be ten packets with no change in sending rate. 231 Similarly, if the window is 20 packets, there will be no queue but 232 the sending rate is the same. Nothing can be inferred about the 233 sender rate from the queue and the existence of any queue at all 234 other than transient bursts can only create delay in the network. 235 The sender needs to reduce the number of packets in flight rather 236 than sending rate. 238 In the above example, the five packet standing queue can be seen to 239 contribute nothing but delay to the connection thus is clearly "bad 240 queue". If, in our example, there is a single bottleneck link and it 241 is much slower than the link that feeds it (say, a high-speed 242 ethernet link into a limited DSL uplink) a 20 packet buffer at the 243 bottleneck might be necessary to temporarily hold the 20 packets in 244 flight to keep the utilization high. The burst of packets should 245 drain completely (to 0 or 1 packets) within a round trip time and 246 this transient queue is "good queue" because it allows the connection 247 to keep the 20 packets in flight and for the bottleneck link to be 248 fully utilized. In terms of the delay experienced We can observe 249 that "good queue" goes away in about a round trip time, while "bad 250 queue" hangs around causing delays. 252 Effective queue management detects "bad queue" while ignoring "good 253 queue" and takes action to get rid of the bad queue when it is 254 detected. The goal is a queue controller that accomplishes this 255 objective. To control queue, we need three basic components 257 o Estimator - figure out what we've got 259 o Setpoint - know what what we want 261 o Control loop - if what we've got isn't what we want, we need a way 262 to move it there 264 3.1. Estimator 266 The Estimator both observes the queue and detects when good queue 267 turns to bad queue and vice versa. CoDel has two innovations in its 268 Estimator: what is observed as an indicator of queue and how the 269 observations are used to detect good/bad queue. 271 In the past, queue length has been widely used as an observed 272 indicator of congestion and is frequently conflated with sending 273 rate. Use of queue length as a metric is sensitive to how and when 274 the length is observed. A high speed arrival link to a buffer 275 serviced at a much lower rate can rapidly build up a queue that might 276 disperse completely or down to a single packet before a round trip 277 time has elapsed. If the queue length is monitored at packet arrival 278 (as in original RED) or departure time, every packet will see a queue 279 with one possible exception. If the queue length itself is time 280 sampled (as recommended in [REDL1998], a truer picture of the queue's 281 occupancy can be gained but a separate process is required. 283 The use of queue length is further complicated in networks that are 284 subject to both short and long term changes in available link rate 285 (as in wifi). Link rate drops can result in a spike in queue length 286 that should be ignored unless it persists. The length metric is 287 problematic when what we really want to control is the amount of 288 excess delay packets experience due to a persistent or standing 289 queue. The sojourn time that a packet spends in the buffer is 290 exactly what we want to track. Tracking the packet sojourn times in 291 the buffer observes the actual delay experienced by each packet. 292 Sojourn time is independent of link rate, gives superior performance 293 to use of buffer size, and is directly related to the user-visible 294 performance. It works regardless of line rate changes or whether the 295 link is shared by multiple queues (which the individual queues may 296 experience as changing rates). 298 Consider a link shared by two queues, one priority queue and one of 299 lower priority. Packets that arrive to the high priority queue are 300 sent as soon as the link is available while packets of the other 301 queue have to wait till the the priority queue is empty (i.e., a 302 strict priority scheduler). The number of packets in the priority 303 queue might be large but the queue is emptied quickly and the amount 304 of time each packet spends enqueued (the sojourn time) is not large. 305 The other queue might have a smaller number of packets, but packet 306 sojourn times will include the wait for the high priority packets to 307 be sent. This makes the sojourn times a good sample of the 308 congestion that each separate queue is experiencing and shows how 309 this metric is independent of the number of queues used or the 310 service discipline and instead reflective of the congestion seen by 311 the individual queue. 313 With sojourn time as the observation, how can it be used to separate 314 good queue from bad queue? In the past, averages, in particular of 315 queue length, have been used to determine bad queue. Consider the 316 burst that disperses every round trip time. The average queue will 317 be one-half the burst size, though this might vary depending on when 318 the average is computed and the timing of arrivals. The average then 319 would indicate a persistent queue where there is none. If instead we 320 track the minimum observation, if there is one packet that has a zero 321 sojourn time then there is no persistent queue. The value of the 322 minimum in detecting persistent queue is apparent when looking at 323 graphs of queue delay. 325 The standing queue can be detected by tracking the (local) minimum 326 queue delay packets experience. To ensure that this minimum value 327 does not become stale, it has to have been experienced recently, i.e. 328 during an appropriate past time interval. This "interval" is the 329 maximum amount of time a minimum is considered to be in effect. It 330 is clear that this interval should be at least a round trip time to 331 avoid falsely detecting a persistent queue and not a lot more than a 332 round trip time to avoid delay in detecting the persistent queue. 333 This suggests that the appropriate interval value is the maximum 334 round-trip time of all the connections sharing the buffer. To avoid 335 outlier values, the 95-99th percentile value is preferred rather than 336 a strict maximum. 338 A key realization makes the local minimum an efficiently computed 339 statistic. Note that it is sufficient to keep a single state 340 variable of how long the minimum has been above or below a target 341 value rather than retaining all the local values to compute the 342 minimum, leading to both storage and computational savings. 344 These two innovations, use of sojourn time as observed values and the 345 local minimum as the statistic to monitor queue congestion are key to 346 CoDel's Estimator building block. The local minimum sojourn time 347 provides an accurate and robust measure of standing queue and has an 348 efficient implementation. In addition, use of the minimum sojourn 349 time has important advantages in implementation. The minimum packet 350 sojourn can only be decreased when a packet is dequeued which means 351 that all the work of CoDel can take place when packets are dequeued 352 for transmission and that no locks are needed in the implementation. 353 The minimum is the only statistic with this property. 355 A more detailed explanation with many pictures can be found at: 356 http://pollere.net/Pdfdocs/QrantJul06.pdf and 357 http://www.ietf.org/proceedings/84/slides/slides-84-tsvarea-4.pdf . 359 3.2. Setpoint 361 Now that we have a robust way of detecting standing queue, we need to 362 have a Setpoint that tells us when to act. If the controller is set 363 to take action as soon as the estimator has a non-zero value, the 364 average drop rate will be maximized which minimizes TCP goodput 365 [MACTCP1997]. Also, since this policy results in no backlog over 366 time (no persistent queue), it also maximizes the bottleneck link 367 bandwidth lost because of normal stochastic variation in packet 368 interarrival time and obliterates much of the value of having a 369 buffer. We want a setpoint that maximizes utilization while 370 minimizing delay. Early in the history of packet networking, 371 Kleinrock developed the analytic machinery to do this using a 372 quantity he called _'power'_ (the ratio of a normalized throughput to 373 a normalized delay) [KLEIN81]. 375 It's straightforward to derive an analytic expression the average 376 goodput of a TCP conversation for a given round-trip time _r_ and 377 setpoint _f_ (where _f_ is expressed as a fraction of _r_) 378 [VJTARG14]. Reno TCP, for example, yields: 380 goodput = _r _(3 + 6_f_ - _f_^2) / (4 (1+_f_)) 382 Since the peak delay is just _f r_, it's clear that _power_ is solely 383 a function of _f_ since the _r_'s in the numerator and denominator 384 cancel: 386 power = (1 + 2_f_ - 1/3 _f^_2) / (1 + _f_)^2 388 As Kleinrock observed, the best operating point, in terms of 389 bandwidth / delay tradeoff, is the peak power point since points off 390 the peak represent a higher cost (in delay) per unit of bandwidth. 391 The power vs. _f_ curve for any AIMD TCP is monotone decreasing. But 392 the curve is very flat for _f_ < 0.1 followed by a increasing 393 curvature with a knee around .2 then a steep, almost linear fall off 394 [TSV84] [VJTARG14]. Since the previous equation showed that goodput 395 is monotone increasing with _f_, the best operating point is near the 396 right edge of the flat top since that represents the highest goodput 397 achievable for a negligible increase in delay. However, since the 398 _r_ in the model is a conservative upper bound, a target of .1_r_ 399 runs the risk of pushing shorter RTT connections over the knee and 400 giving them higher delay for no significant goodput increase. 401 Generally, a more conservative target of .05_r _offers a good 402 utilization vs. delay tradeoff while giving enough headroom to work 403 well with a large variation in real RTT. 405 As the above analysis shows, a very small standing queue gives close 406 to 100% utilization. While this result was for Reno TCP, the 407 derivation uses only properties that must hold for any 'TCP friendly' 408 transport. We have verified by both analysis and simulation that 409 this result holds for Reno, Cubic, and Westwood[TSV84]. This results 410 in a particularly simple form for the setpoint: the ideal range for 411 the permitted standing queue is between 5 and 10% of the TCP 412 connection RTT. Thus _target_ is simply 5% of the _interval_ of 413 section 3.1. 415 3.3. Control Loop 417 Section 3.1 describes a simple, reliable way to measure bad 418 (persistent) queue. Section 3.2 shows that TCP congestion control 419 dynamics gives rise to a setpoint for this measure that's a provably 420 good balance between enhancing throughput and minimizing delay, and 421 that this setpoint is a constant fraction of the same 'largest 422 average RTT' interval used to distinguish persistent from transient 423 queue. The only remaining building block needed for a basic AQM is a 424 'control loop' algorithm to effectively drive the queuing system from 425 any 'persistent queue above target' state to a state where the 426 persistent queue is below target. 428 Control theory provides a wealth of approaches to the design of 429 control loops. Most of classical control theory deals with the 430 control of linear, time-invariant, single-input-single-output (SISO) 431 systems. Control loops for these systems generally come from a (well 432 understood) class known as Proportional-Integral-Derivative (PID) 433 controllers. Unfortunately, a queue is not a linear system and an 434 AQM operates at the point of maximum non-linearity (where the output 435 link bandwidth saturates so increased demand creates delay rather 436 than higher utilization). Output queues are also not time-invariant 437 since traffic is generally a mix of connections which start and stop 438 at arbitrary times and which can have radically different behaviors 439 ranging from "open loop" UDP audio/video to "closed-loop" congestion- 440 avoiding TCP. Finally, the constantly changing mix of connections 441 (which can't be converted to a single 'lumped parameter' model 442 because of their transfer function differences) makes the system 443 multi-input-multi-output (MIMO), not SISO. 445 Since queuing systems match none of the prerequisites for a classical 446 controller, a modern state-space controller is a better approach with 447 states 'no persistent queue' and 'has persistent queue'. Since 448 Internet traffic mixtures change rapidly and unpredictably, a noise 449 and error tolerant adaptation algorithm like Stochastic Gradient is a 450 good choice. Since there's essentially no information in the amount 451 of persistent queue [TSV84], the adaptation should be driven by how 452 long it has persisted. 454 Consider the two extremes of traffic behavior, a single open-loop UDP 455 video stream and a single, long-lived TCP bulk data transfer. If the 456 average bandwidth of the UDP video stream is greater that the 457 bottleneck link rate, the link's queue will grow and the controller 458 will eventually enter 'has persistent queue' state and start dropping 459 packets. Since the video stream is open loop, its arrival rate is 460 unaffected by drops so the queue will persist until the average drop 461 rate is greater than the output bandwidth deficit (= average arrival 462 rate - average departure rate) so the job of the adaptation algorithm 463 is to discover this rate. For this example, the adaptation could 464 consist of simply estimating the arrival and departure rates then 465 dropping at a rate slightly greater than their difference. But this 466 class of algorithm won't work at all for the bulk data TCP stream. 467 TCP runs in closed-loop flow balance [TSV84] so its arrival rate is 468 almost always exactly equal to the departure rate - the queue isn't 469 the result of a rate imbalance but rather a mismatch between the TCP 470 sender's window and the src-dst-src round-trip path capacity (i.e., 471 the connection's bandwidth*delay product). The sender's TCP 472 congestion avoidance algorithm will slowly increase the send window 473 (one packet per round-trip-time) [RFC2581] which will eventually 474 cause the bottleneck to enter 'has persistent queue' state. But, 475 since the average input rate is the same as the average output rate, 476 the rate deficit estimation that gave the correct drop rate for the 477 video stream would compute a drop rate of zero for the TCP stream. 478 However, if the output link drops one packet as it enters 'has 479 persistent queue' state, when the sender discovers this (via TCP's 480 normal packet loss repair mechanisms) it will reduce its window by a 481 factor of two [RFC2581] so, one round-trip-time after the drop, the 482 persistent queue will go away. 484 If there were N TCP conversations sharing the bottleneck, the 485 controller would have to drop O(N) packets, one from each 486 conversation, to make all the conversations reduce their window to 487 get rid of the persistent queue. If the traffic mix consists of 488 short (<= bandwidth*delay product) conversations, the aggregate 489 behavior becomes more like the open-loop video example since each 490 conversation is likely to have already sent all its packets by the 491 time it learns about a drop so each drop has negligible effect on 492 subsequent traffic. 494 The controller doesn't know what type, how many or how long are the 495 conversations creating its queue so it has to learn that. Since 496 single drops can have a large effect if the degree of multiplexing 497 (the number of active conversations) is small, dropping at too high a 498 rate is likely to have a catastrophic effect on throughput. Dropping 499 at a low rate (< 1 packet per round-trip-time) then increasing the 500 drop rate slowly until the persistent queue goes below target is 501 unlikely to overdrop yet is guaranteed to eventually dissipate the 502 persistent queue. This stochastic gradient learning procedure is the 503 core of CoDel's control loop (the gradient exists because a drop 504 always reduces the (instantaneous) queue so an increasing drop rate 505 always moves the system "down" toward no persistent queue, regardless 506 of traffic mix). 508 The next drop time is decreased in inverse proportion to the square 509 root of the number of drops since the dropping state was entered, 510 using the well-known nonlinear relationship of drop rate to 511 throughput to get a linear change in throughput. [REDL1998, 512 MACTCP1997] 514 Since the best rate to start dropping is at slightly more than one 515 packet per RTT, the controller's initial drop rate can be directly 516 derived from the Estimator's interval, defined in section 3.1. Where 517 the interval is likely to be very close to the usual round trip time, 518 the initial drop spacing SHOULD be set to the Estimator's interval 519 plus twice the target (i.e., initial drop spacing = 1.1 * interval) 520 to ensure that acceptable congestion delays are covered. 522 Use of the minimum statistic lets the Controller be placed in the 523 dequeue routine with the Estimator. This means that the control 524 signal (the drop) can be sent at the first sign of bad queue (as 525 indicated by the sojourn time) and that the Controller can stop 526 acting as soon as the sojourn time falls below the Setpoint. 527 Dropping at dequeue has both implementation and control advantages. 529 4. Putting it together: queue management for the network edge 531 The CoDel building blocks are able to adapt to different or time- 532 varying link rates, to be easily used with multiple queues, to have 533 excellent utilization with low delay and to have a simple and 534 efficient implementation. The only setting CoDel requires is its 535 interval value, and as 100ms satisfies that definition for normal 536 internet usage, CoDel can be parameter-free for consumer use. CoDel 537 was released to the open source community where it has been widely 538 promulgated and adapted to many problems. We can see how well these 539 building blocks work in a simple CoDel queue management 540 implementation. This AQM was designed as a bufferbloat solution and 541 is focused on the consumer network edge. 543 4.1. Overview of CoDel AQM 545 To ensure that link utilization is not adversely affected, CoDel's 546 Estimator sets its target to the Setpoint that optimizes power and 547 CoDel's Controller does not drop packets when the drop would leave 548 the queue empty or with fewer than a maximum transmission unit (MTU) 549 worth of bytes in the buffer. Section 3.2 showed that the ideal 550 Setpoint is 5-10% of the connection RTT. In the open Internet, in 551 particular the consumer edge, we can use the "usual maximum" 552 terrestrial RTT of 100 ms to calculate a minimum target of 5ms. 553 Under the same assumptions, we compute the interval for tracking the 554 minimum to be the nominal RTT of 100ms. In practice, uncongested 555 links will see sojourn times under the target more often than once 556 per RTT, so the Estimator is not overly sensitive to the value of the 557 interval. 559 When the Estimator finds a persistent delay above target, the 560 Controller enters the drop state where a packet is dropped and the 561 next drop time is set. As discussed in section 3.3, the initial next 562 drop spacing is intended to be long enough to give the endpoints time 563 to react to the single drop so SHOULD be set to a value of 1.0 to 1.1 564 times the interval. If the Estimator's output falls below the 565 target, the Controller cancels the next drop and exits the drop 566 state. (The Controller is more sensitive than the Estimator to an 567 overly short interval, since an unnecessary drop could occur and 568 lower utilization.) If next drop time is reached while the 569 Controller is still in drop state, the packet being dequeued is 570 dropped and the next drop time is recalculated. Additional logic 571 prevents re-entering the dropping state too soon after exiting it and 572 resumes the dropping state at a recent control level, if one exists. 574 Note that CoDel AQM only enters its dropping state when the local 575 minimum sojourn delay has exceeded an acceptable standing queue 576 target for a time interval long enough for normal bursts to dissipate 577 ensuring that a burst of packets that fits in the pipe will not be 578 dropped. 580 CoDel's efficient implementation and lack of configuration are unique 581 features and make it suitable to manage modern packet buffers. For 582 more background and results on CoDel, see [CODEL2012] and 583 http://pollere.net/CoDel.html . 585 4.2. Non-starvation 587 CoDel's goals are to control delay with little or no impact on link 588 utilization and to be deployed on a wide range of link bandwidth, 589 including varying rate links, without reconfiguration. To keep from 590 making drops when it would starve the output link, CoDel makes 591 another check before dropping to see if at least an MTU worth of 592 bytes remains in the buffer. If not, the packet SHOULD NOT be 593 dropped and, currently, CoDel exits the drop state. The MTU size can 594 be set adaptively to the largest packet seen so far or can be read 595 from the driver. 597 4.3. Using the interval 599 The interval is chosen to give endpoints time to react to a drop 600 without being so long that response times suffer. CoDel's Estimator, 601 Setpoint, and Control Loop all use the interval. Understanding their 602 derivation shows that CoDel is the most sensitive to the value of 603 interval for single long-lived TCPs with a decreased sensitivity for 604 traffic mixes. This is fortunate as RTTs vary across connections and 605 are not known apriori and it's difficult to obtain a definitive 606 histogram of RTTs seen on the normal consumer edge link. The best 607 policy is to use an interval slightly larger than the RTT seen by 608 most of the connections using a link, a value that can be determined 609 as the largest RTT seen if the value is not an outlier (as in section 610 3.1, use of a 95-99th percentile value should work). In practice, 611 this value is not known or measured (though see Section 6.2 for an 612 application where interval is measured. Work-in-progress at Pollere 613 may lead to a method of doing this in an Internet buffer). A setting 614 of 100ms works well across a range of RTTs from 10ms to 1 second 615 (excellent performance is achieved in the range from 10 ms to 300ms). 616 For devices intended for the normal terrestrial Internet interval 617 SHOULD have the value of 100ms. This will only cause overdropping 618 where a long-lived TCP has an RTT longer than 100ms and there is 619 little or no mixing with other connections through the link. 621 Some confusion concerns the roles of the target Setpoint and the 622 minimum-tracking interval. In particular, some experimenters believe 623 the value of target needs to be increased when the lower layers have 624 a bursty nature where packets are transmitted for short periods 625 interspersed with idle periods where the link is waiting for 626 permission to send. CoDel's Estimator will "see" the effective 627 transmission rate over an interval and increasing target will just 628 lead to longer queue delays. On the other hand, where a significant 629 additional delay is added to the intrinsic round trip time of most or 630 all packets due to the waiting time for a transmission, it is 631 necessary to increase interval by that extra delay. That is, target 632 SHOULD NOT be adjusted but interval MAY need to be adjusted. For 633 more on this (and pictures) see http://pollere.net/Pdfdocs/ 634 noteburstymacs.pdf 636 4.4. The target Setpoint 638 The target is the maximum acceptable standing queue delay above which 639 CoDel is dropping or preparing to drop and below which CoDel will not 640 drop. The calculations of section 3.2 showed that the best setpoint 641 is 5-10% of the RTT, with the low end of 5% preferred. We used 642 simulation to explore the impact when TCPs are mixed with other 643 traffic and with connections of different RTTs. Accordingly, we 644 experimented extensively with values in the 5-10% of RTT range and, 645 overall, used target values between 1 and 20 milliseconds for RTTs 646 from 30 to 500ms and link bandwidths of 64Kbps to 100Mbps to 647 experimentally explore the Setpoint that gives consistently high 648 utilization while controlling delay across a range of bandwidths, 649 RTTs, and traffic loads. Our results were notably consistent with 650 the mathematics of section 3.2. Below a target of 5ms, utilization 651 suffers for some conditions and traffic loads, above 5ms we saw very 652 little or no improvement in utilization. Thus target SHOULD be set 653 to 5ms for normal Internet traffic. 655 If a CoDel link has only or primarily long-lived TCP flows sharing a 656 link to congestion but not overload, the median delay through the 657 link will tend to the target. For bursty traffic loads and for 658 overloaded conditions (where it is difficult or impossible for all 659 the arriving flows to be accommodated) the median queues will be 660 longer than target. 662 The non-starvation drop inhibit feature dominates where the link rate 663 becomes very small. By inhibiting drops when there is less than an 664 (outbound link) MTU worth of bytes in the buffer, CoDel adapts to 665 very low bandwidth links. This is shown in [CODEL2012] and 666 interested parties should see the discussion of results there. 667 Unpublished studies were carried out down to 64Kbps. The drop 668 inhibit condition can be expanded to include a test to retain 669 sufficient bytes or packets to fill an allocation in a request-and- 670 grant MAC. 672 Sojourn times must remain above target for an entire interval in 673 order to enter the drop state. Any packet with a sojourn time less 674 than target will reset the time that the queue was last below the 675 target. Since Internet traffic has very dynamic characteristics, the 676 actual sojourn delays experienced by packets varies greatly and is 677 often less than the target unless the overload is excessive. When a 678 link is not overloaded, it is not a bottleneck and packet sojourn 679 times will be small or nonexistent. In the usual case, there are 680 only one or two places along a path where packets will encounter a 681 bottleneck (usually at the edge), so the amount of queuing delay 682 experienced by a packet should be less than 10 ms even under 683 extremely congested conditions. Contrast this to the queuing delays 684 that grow to orders of seconds that have led to the "bufferbloat" 685 term [NETAL2010, CHARRB2007]. 687 4.5. Use with multiple queues 689 Unlike other AQMs, CoDel is easily adapted to multiple queue systems. 690 With other approaches there is always a question of how to account 691 for the fact that each queue receives less than the full link rate 692 over time and usually sees a varying rate over time. This is exactly 693 what CoDel excels at: using a packet's sojourn time in the buffer 694 completely bypasses this problem. A separate CoDel algorithm runs on 695 each queue, but each CoDel uses the packet sojourn time the same way 696 a single queue CoDel does. Just as a single queue CoDel adapts to 697 changing link bandwidths[CODEL2012], so do the multiple queue CoDels. 698 When testing for queue occupancy before dropping, the total occupancy 699 of all bins should be used. This property of CoDel has been 700 exploited in fq_codel, briefly discussed in the next section and the 701 subject of another Internet Draft. 703 4.6. Use of stochastic bins or sub-queues to improve performance 705 Shortly after the release of the CoDel pseudocode, Eric Dumazet 706 created fq_codel, applying CoDel to each bin, or queue, used with 707 stochastic fair queuing. (To understand further, see [SFQ1990] or 708 the linux sfq at http://linux.die.net/man/8/tc-sfq .) Fq_codel hashes 709 on the packet header fields to determine a specific bin, or sub- 710 queue, for each five-tuple flow, and runs CoDel on each bin or sub- 711 queue thus creating a well-mixed output flow and obviating issues of 712 reverse path flows (including "ack compression"). Dumazet's code is 713 part of the CeroWrt project code at the bufferbloat.net's web site 714 and an Internet Draft has been submitted describing fq_codel, draft- 715 hoeiland-joergensen-aqm-fq-codel. 717 We've experimented with a similar approach by creating an ns-2 718 simulator code module, sfqcodel. This has provided excellent results 719 thus far: median queues remain small across a range of traffic 720 patterns that includes bidirectional file transfers (that is, the 721 same traffic sent in both directions on a link), constant bit-rate 722 VoIP-like flows, and emulated web traffic and utilizations are 723 consistently better than single queue CoDel, generally very close to 724 100%. Our version differs from Dumazet's by preferring a packet-based 725 round robin of the bins rather than byte-based DRR and there may be 726 other minor differences in implementation. Our code, intended for 727 simulation experiments, is available at http://pollere.net/CoDel.html 728 and being integrated into the ns-2 distribution. Andrew McGregor has 729 an ns-3 version of fq_codel. 731 Stochastic flow queuing provides better traffic mixing on the link 732 and tends to isolate a larger flow or flows. For real priority 733 treatment, use of DiffServ isolation is encouraged. We've 734 experimented in simulation with creating a queue to isolate all the 735 UDP traffic (which is all simulated VoIP thus low bandwidth) but this 736 approach has to be applied with caution in the real world. Some 737 experimenters are trying rounding with a small quantum (on the order 738 of a voice packet size) but this also needs thorough study. 740 A number of open issues should be studied. In particular, if the 741 number of different queues or bins is too large, the scheduling will 742 be the dominant factor, not the AQM; it is NOT the case that more 743 bins are always better. In our simulations, we have found good 744 behavior across mixed traffic types with smaller numbers of queues, 745 8-16 for a 5Mbps link. This configuration appears to give the best 746 behavior for voice, web browsing and file transfers where increased 747 numbers of bins seems to favor file transfers at the expense of the 748 other traffic. Our work has been very preliminary and we encourage 749 others to take this up and to explore analytic modeling. It would be 750 instructive to see the effects of different numbers of bins on a 751 range of traffic models, something like an updated version of 752 [BMPFQ]. 754 Implementers SHOULD use the fq_codel multiple queue approach if 755 possible as it deals with many problems beyond the reach of an AQM on 756 a single queue. 758 4.7. Setting up CoDel AQM 760 CoDel's is set for use in devices in the open Internet. An interval 761 of 100ms is used, target is set to 5% of interval, and the initial 762 drop spacing is also set to interval. These settings have been 763 chosen so that a device, such as a small WiFi router, can be sold 764 without the need for any values to be made adjustable, yielding a 765 parameterless implementation. In addition, CoDel is useful in 766 environments with significantly different characteristics from the 767 normal Internet, for example, in switches used as a cluster 768 interconnect within a data center. Since cluster traffic is entirely 769 internal to the data center, round trip latencies are low (typically 770 <100us) but bandwidths are high (1-40Gbps) so it's relatively easy 771 for the aggregation phase of a distributed computation (e.g., the 772 Reduce part of a Map/Reduce) to persistently fill then overflow the 773 modest per-port buffering available in most high speed switches. A 774 CoDel configured for this environment (target and interval in the 775 microsecond rather than millisecond range) can minimize drops (or ECN 776 marks) while keeping throughput high and latency low. 778 Devices destined for these environments MAY use a different interval, 779 where suitable. If appropriate analysis indicates, the target MAY be 780 set to some other value in the 5-10% of interval and the initial drop 781 spacing MAY be set to a value of 1.0 to 1.2 times the interval. But 782 these settings will cause problems such as over dropping and low 783 throughput if used on the open Internet so devices that allow CoDel 784 to be configured MUST default to Internet appropriate values given in 785 this document. 787 5. Annotated Pseudo-code for CoDel AQM 789 What follows is the CoDel algorithm in C++-like pseudo-code. Since 790 CoDel adds relatively little new code to a basic tail-drop fifo- 791 queue, we've tried to highlight just these additions by presenting 792 CoDel as a sub-class of a basic fifo-queue base class. There have 793 been a number of minor variants in the code and our reference pseudo- 794 code has not yet been completely updated. The reference code is 795 included to aid implementers who wish to apply CoDel to queue 796 management as described here or to adapt its principles to other 797 applications. 799 Implementors are strongly encouraged to also look at Eric Dumazet's 800 Linux kernel version of CoDel - a well-written, well tested, real- 801 world, C-based implementation. As of this writing, it is at: 803 http://git.kernel.org/?p=linux/kernel/git/torvalds/ 804 linux.git;a=blob_plain;f=net/sched/sch_codel.c;hb=HEAD 806 This code is open-source with a dual BSD/GPL license: 808 Codel - The Controlled-Delay Active Queue Management algorithm 810 Copyright (C) 2011-2014 Kathleen Nichols 812 Redistribution and use in source and binary forms, with or without 813 modification, are 815 permitted provided that the following conditions are met: 817 o Redistributions of source code must retain the above copyright 818 notice, this list of conditions, and the following disclaimer, 819 without modification. 821 o Redistributions in binary form must reproduce the above copyright 822 notice, this list of conditions and the following disclaimer in 823 the documentation and/or other materials provided with the 824 distribution. 826 o The names of the authors may not be used to endorse or promote 827 products derived from this software without specific prior written 828 permission. 830 Alternatively, provided that this notice is retained in full, this 831 software may be distributed under the terms of the GNU General Public 832 License ("GPL") version 2, in which case the provisions of the GPL 833 apply INSTEAD OF those given above. 835 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 836 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 837 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 838 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 839 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 840 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 841 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 842 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 843 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 844 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 845 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 847 5.1. Data Types 849 "time_t " is an integer time value in units convenient for the 850 system. Resolution to at least a millisecond is required and better 851 resolution is useful up to the minimum possible packet time on the 852 output link; 64- or 32-bit widths are acceptable but with 32 bits the 853 resolution should be no finer than 2^{-16} to leave enough dynamic 854 range to represent a wide range of queue waiting times. Narrower 855 widths also have implementation issues due to overflow (wrapping) and 856 underflow (limit cycles because of truncation to zero) that are not 857 addressed in this pseudocode. The code presented here uses 0 as a 858 flag value to indicate "no time set." 860 "packet_t*" is a pointer to a packet descriptor. We assume it has a 861 tstamp field capable of holding a time_t and that field is available 862 for use by CoDel (it will be set by the enque routine and used by the 863 deque routine). 865 "queue_t" is a base class for queue objects (the parent class for 866 codel_queue_t objects). We assume it has enque() and deque() methods 867 that can be implemented in child classes. We assume it has a bytes() 868 method that returns the current queue size in bytes. This can be an 869 approximate value. The method is invoked in the deque() method but 870 shouldn't require a lock with the enque() method. 872 "flag_t " is a Boolean. 874 5.2. Per-queue state (codel_queue_t instance variables) 876 time_t first_above_time; // Time to declare sojourn time above target 877 time_t drop_next; // Time to drop next packet 878 uint32_t count; // Packets dropped since entering drop state 879 flag_t dropping; // Equal to 1 if in drop state 881 5.3. Constants 883 time_t target = MS2TIME(5); // 5ms target queue delay 884 time_t interval = MS2TIME(100); // 100ms sliding-minimum window 885 u_int maxpacket = 512; // Maximum packet size in bytes 886 // (should use interface MTU) 888 5.4. Enque routine 890 All the work of CoDel is done in the deque routine. The only CoDel 891 addition to enque is putting the current time in the packet's tstamp 892 field so that the deque routine can compute the packet's sojourn 893 time. 895 void codel_queue_t::enque(packet_t* pkt) 896 { 897 pkt->timestamp() = clock(); 898 queue_t::enque(pkt); 899 } 901 5.5. Deque routine 903 This is the heart of CoDel. There are two branches: In packet- 904 dropping state (meaning that the queue-sojourn time has gone above 905 target and hasn't come down yet), then we need to check if it's time 906 to leave or if it's time for the next drop(s); if we're not in 907 dropping state, then we need to decide if it's time to enter and do 908 the initial drop. 910 Packet* CoDelQueue::deque() 911 { 912 double now = clock();; 913 dodequeResult r = dodeque(now); 915 if (dropping_) { 916 if (! r.ok_to_drop) { 917 // sojourn time below target - leave dropping state 918 dropping_ = 0; 919 } 920 // Time for the next drop. Drop current packet and dequeue 921 // next. If the dequeue doesn't take us out of dropping 922 // state, schedule the next drop. A large backlog might 923 // result in drop rates so high that the next drop should 924 // happen now, hence the 'while' loop. 925 while (now >= drop_next_ && dropping_) { 926 drop(r.p); 927 r = dodeque(now); 928 if (! r.ok_to_drop) { 929 // leave dropping state 930 dropping_ = 0; 931 } else { 932 ++count_; 933 // schedule the next drop. 934 drop_next_ = control_law(drop_next_); 935 } 936 } 937 // If we get here we're not in dropping state. The 'ok_to_drop' 938 // return from dodeque means that the sojourn time has been 939 // above 'target' for 'interval' so enter dropping state. 940 } else if (r.ok_to_drop) { 941 drop(r.p); 942 r = dodeque(now); 943 dropping_ = 1; 945 // If min went above target close to when it last went 946 // below, assume that the drop rate that controlled the 947 // queue on the last cycle is a good starting point to 948 // control it now. ('drop_next' will be at most 'interval' 949 // later than the time of the last drop so 'now - drop_next' 950 // is a good approximation of the time from the last drop 951 // until now.) 952 count_ = (count_ > 2 && now - drop_next_ < 8*interval_)? 953 count_ - 2 : 1; 954 drop_next_ = control_law(now); 955 } 956 return (r.p); 957 } 959 5.6. Helper routines 961 Since the degree of multiplexing and nature of the traffic sources is 962 unknown, CoDel acts as a closed-loop servo system that gradually 963 increases the frequency of dropping until the queue is controlled 964 (sojourn time goes below target). This is the control law that 965 governs the servo. It has this form because of the sqrt(p) 966 dependence of TCP throughput on drop probability. Note that for 967 embedded systems or kernel implementation, the inverse sqrt can be 968 computed efficiently using only integer multiplication. See Eric 969 Dumazet's excellent Linux CoDel implementation for example code (in 970 file net/sched/sch_codel.c of the kernel source for 3.5 or newer 971 kernels). 973 time_t codel_queue_t::control_law(time_t t) 974 { 975 return t + interval / sqrt(count); 976 } 978 Next is a helper routine the does the actual packet dequeue and 979 tracks whether the sojourn time is above or below target and, if 980 above, if it has remained above continuously for at least interval. 981 It returns two values, a Boolean indicating if it is OK to drop 982 (sojourn time above target for at least interval) and the packet 983 dequeued. 985 typedef struct { 986 packet_t* p; 987 flag_t ok_to_drop; 988 } dodeque_result; 990 dodeque_result codel_queue_t::dodeque(time_t now) 991 { 992 dodequeResult r = { NULL, queue_t::deque() }; 993 if (r.p == NULL) { 994 // queue is empty - we can't be above target 995 first_above_time_ = 0; 996 return r; 997 } 999 // To span a large range of bandwidths, CoDel runs two 1000 // different AQMs in parallel. One is sojourn-time-based 1001 // and takes effect when the time to send an MTU-sized 1002 // packet is less than target. The 1st term of the "if" 1003 // below does this. The other is backlog-based and takes 1004 // effect when the time to send an MTU-sized packet is >= 1005 // target. The goal here is to keep the output link 1006 // utilization high by never allowing the queue to get 1007 // smaller than the amount that arrives in a typical 1008 // interarrival time (MTU-sized packets arriving spaced 1009 // by the amount of time it takes to send such a packet on 1010 // the bottleneck). The 2nd term of the "if" does this. 1011 time_t sojourn_time = now - r.p->tstamp; 1012 if (sojourn_time_ < target_ || bytes() <= maxpacket_) { 1013 // went below - stay below for at least interval 1014 first_above_time_ = 0; 1015 } else { 1016 if (first_above_time_ == 0) { 1017 // just went above from below. if still above at 1018 // first_above_time, will say it's ok to drop. 1019 first_above_time_ = now + interval_; 1020 } else if (now >= first_above_time_) { 1021 r.ok_to_drop = 1; 1022 } 1023 } 1024 return r; 1025 } 1027 5.7. Implementation considerations 1029 Since CoDel requires relatively little per-queue state and no direct 1030 communication or state sharing between the enqueue and dequeue 1031 routines, it's relatively simple to add it to almost any packet 1032 processing pipeline, including ASIC- or NPU-based forwarding engines. 1034 One issue to think about is dedeque's use of a 'bytes()' function to 1035 find out about how many bytes are currently in the queue. This value 1036 does not need to be exact. If the enqueue part of the pipeline keeps 1037 a running count of the total number of bytes it has put into the 1038 queue and the dequeue routine keeps a running count of the total 1039 bytes it has removed from the queue, 'bytes()' is just the difference 1040 between these two counters. 32 bit counters are more than adequate. 1041 Enqueue has to update its counter once per packet queued but it 1042 doesn't matter when (before, during or after the packet has been 1043 added to the queue). The worst that can happen is a slight, 1044 transient, underestimate of the queue size which might cause a drop 1045 to be briefly deferred. 1047 6. Adapting and applying CoDel's building blocks 1049 CoDel is being implemented and tested in a range of environments. 1050 Dave Taht has been instrumental in the integration and distribution 1051 of bufferbloat solutions, including CoDel, and has set up a website 1052 and a mailing list for CeroWRT implementers. 1053 (www.bufferbloat.net/projects/codel) This is an active area of work 1054 and an excellent place to track developments. 1056 6.1. Validations and available code 1058 An experiment by Stanford graduate students successfully used the 1059 linux CoDel to duplicate our published simulation work on CoDel's 1060 ability to following drastic link rate changes which can be found at: 1061 http://reproducingnetworkresearch.wordpress.com/2012/06/06/solving- 1062 bufferbloat-the-codel-way/ . 1064 Our ns-2 simulations are available at http://pollere.net/CoDel.html . 1065 Cable Labs has funded some additions to the simulator sfqcodel code 1066 which have been made public. The basic algorithm of CoDel remains 1067 unchanged, but we continue to experiment with drop interval setting 1068 when resuming the drop state, inhibiting or canceling drop state when 1069 bytes in the queue small, and other minor details. Our approach to 1070 changes is to only make them if we are convinced they do more good 1071 than harm, both operationally and in the implementation. With this 1072 in mind, some of these issues may continue to evolve as we get more 1073 deployment and as the building blocks are applied to a wider range of 1074 problems. 1076 CoDel is being made available with the ns-2 distribution. 1078 Andrew McGregor has an ns-3 implementation of both CoDel and FQ_CoDel 1079 (https://github.com/dtaht/ns-3-dev ). 1081 CoDel is available in Linux. Eric Dumazet has put CoDel into the 1082 Linux distribution. 1084 6.2. CoDel in the datacenter 1086 Nandita Dukkipati's team at Google was quick to realize that the 1087 CoDel building blocks could be applied to bufferbloat problems in 1088 datacenter servers, not just to Internet routers. The Linux CoDel 1089 queueing discipline (Qdisc) was adapted in three ways to tackle this 1090 bufferbloat problem. 1092 1. The default CoDel action was modified to be a direct feedback 1093 from Qdisc to the TCP layer at dequeue. The direct feedback 1094 simply reduces TCP's congestion window just as congestion control 1095 would do in the event of drop. The scheme falls back to ECN 1096 marking or packet drop if the TCP socket lock could not be 1097 acquired at dequeue. 1099 2. Being located in the server makes it possible to monitor the 1100 actual RTT to use as CoDel's interval rather than making a "best 1101 guess" of RTT. The CoDel interval is dynamically adjusted by 1102 using the maximum TCP round-trip time (RTT) of those connections 1103 sharing the same Qdisc/bucket. In particular, there is a history 1104 entry of the maximum RTT experienced over the last second. As a 1105 packet is dequeued, the RTT estimate is accessed from its TCP 1106 socket. If the estimate is larger than the current CoDel 1107 interval, the CoDel interval is immediately refreshed to the new 1108 value. If the CoDel interval is not refreshed for over a second, 1109 it is decreased it to the history entry and the process repeated. 1110 The use of the dynamic TCP RTT estimate lets interval adapt to 1111 the actual maximum value currently seen and thus lets the 1112 controller space its drop intervals appropriately. 1114 3. Since the mathematics of computing the set point are invariant, a 1115 target of 5% of the RTT or CoDel interval was used here. 1117 Non-data packets were not dropped as these are typically small and 1118 sometimes critical control packets. Being located on the server, 1119 there is no concern with misbehaving users scamming such a policy as 1120 there would be in an Internet router. 1122 In several data center workload benchmarks, which are typically 1123 bursty, CoDel reduced the queueing latencies at the Qdisc, and 1124 thereby improved the mean and 99 percentile latencies from several 1125 tens of milliseconds to less than one millisecond. The minimum 1126 tracking part of the CoDel framework proved useful in disambiguating 1127 "good" queue versus "bad" queue, particularly helpful in controlling 1128 Qdisc buffers that are inherently bursty because of TCP Segmentation 1129 Offload. 1131 7. Security Considerations 1133 This document describes an active queue management algorithm for 1134 implementation in networked devices. There are no specific security 1135 exposures associated with CoDel. 1137 8. IANA Considerations 1139 This document does not require actions by IANA. 1141 9. Conclusions 1143 CoDel provides very general, efficient, parameterless building blocks 1144 for queue management that can be applied to single or multiple queues 1145 in a variety of data networking scenarios. It is a critical tool in 1146 solving bufferbloat. CoDel's settings MAY be modified for other 1147 special-purpose networking applications. 1149 On-going projects are creating a deployable CoDel in Linux routers 1150 and experimenting with applying CoDel to stochastic queuing with very 1151 promising results. 1153 10. Acknowledgments 1155 The authors wish to thank Jim Gettys for the constructive nagging 1156 that made us get the work "out there" before we thought it was ready. 1157 We also want to thank Dave Taht, Eric Dumazet, and the open source 1158 community for showing the value of getting it "out there" and for 1159 making it real. We also wish to thank Nandita Dukkipati for 1160 contribution to section 6 and for comments which helped to 1161 substantially improve this draft. 1163 11. References 1165 11.1. Normative References 1167 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1168 Requirement Levels", BCP 14, RFC 2119, March 1997. 1170 11.2. Informative References 1172 [BB2011] Gettys, J. and K. Nichols, "Bufferbloat: Dark Buffers in 1173 the Internet", Communications of the ACM 9(11) pp. 57-65, 1174 . 1176 [BMPFQ] Suter, B., "Buffer Management Schemes for Supporting TCP 1177 in Gigabit Routers with Per-flow Queueing", IEEE Journal 1178 on Selected Areas in Communications Vol. 17 Issue 6, June, 1179 1999, pp. 1159-1169, . 1181 [CHARB2007] 1182 Dischinger, M., et. al, "Characterizing Residential 1183 Broadband Networks", Proceedings of the Internet 1184 Measurement Conference San Diego, CA, 2007, . 1186 [CMNTS] Allman, M., "Comments on Bufferbloat", Computer 1187 Communication Review Vol. 43 No. 1, January, 2013, pp. 1188 31-37, . 1190 [CODEL2012] 1191 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 1192 Communications of the ACM Vol. 55 No. 11, July, 2012, pp. 1193 42-50, . 1195 [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of 1196 Computer Network Power", International Conference on 1197 Communications June, 1981, 1198 . 1200 [MACTCP1997] 1201 Mathis, M., Semke, J., and J. Mahdavi, "The Macroscopic 1202 Behavior of the TCP Congestion Avoidance Algorithm", ACM 1203 SIGCOMM Computer Communications Review Vol. 27 no. 1, Jan. 1204 2007, . 1206 [NETAL2010] 1207 Kreibich, C., et. al., "Netalyzr: Illuminating the Edge 1208 Network", Proceedings of the Internet Measurement 1209 Conference Melbourne, Australia, 2010, . 1211 [REDL1998] 1212 Nichols, K., Jacobson, V., and K. Poduri, "RED in a 1213 Different Light", Tech report, September, 1999, 1214 . 1217 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1218 RFC 896, January 1984. 1220 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1221 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1222 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1223 S., Wroclawski, J., and L. Zhang, "Recommendations on 1224 Queue Management and Congestion Avoidance in the 1225 Internet", RFC 2309, April 1998. 1227 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1228 Control", RFC 2581, April 1999. 1230 [SFQ1990] McKenney, P., "Stochastic Fairness Queuing", Proceedings 1231 of IEEE INFOCOMM 90 San Francisco, 1990, . 1233 [TSV2011] Gettys, J., "Bufferbloat: Dark Buffers in the Internet", 1234 IETF 80 presentation to Transport Area Open Meeting, 1235 March, 2011, 1236 . 1238 [TSV84] Jacobson, V., "CoDel talk at TSV meeting IETF 84", 1239 . 1242 [VANQ2006] 1243 Jacobson, V., "A Rant on Queues", talk at MIT Lincoln 1244 Labs, Lexington, MA July, 2006, 1245 . 1247 Authors' Addresses 1249 Kathleen Nichols 1250 Pollere, Inc. 1251 PO Box 370201 1252 Montara, CA 94037 1253 USA 1255 Email: nichols@pollere.com 1257 Van Jacobson 1258 Google 1260 Email: vanj@google.com 1262 Andrew McGregor 1263 Google 1265 Email: andrewmcgr@google.com 1266 Jana Iyengar 1267 Google 1269 Email: jri@google.com