idnits 2.17.1 draft-ietf-aqm-codel-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2705 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '6' on line 1247 -- Looks like a reference, but probably isn't: '7' on line 1249 == Outdated reference: A later version (-06) exists of draft-ietf-aqm-fq-codel-03 -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AQM K. Nichols 3 Internet-Draft Pollere, Inc. 4 Intended status: Experimental V. Jacobson 5 Expires: May 4, 2017 A. McGregor, ed. 6 J. Iyengar, ed. 7 Google 8 October 31, 2016 10 Controlled Delay Active Queue Management 11 draft-ietf-aqm-codel-05 13 Abstract 15 This document describes a general framework called CoDel (Controlled 16 Delay) that controls bufferbloat-generated excess delay in modern 17 networking environments. CoDel consists of an estimator, a setpoint, 18 and a control loop. It requires no configuration in normal Internet 19 deployments. CoDel comprises some major technical innovations and 20 has been made available as open source so that the framework can be 21 applied by the community to a range of problems. It has been 22 implemented in Linux (and available in the Linux distribution) and 23 deployed in some networks at the consumer edge. In addition, the 24 framework has been successfully applied in other ways. 26 Note: Code Components extracted from this document must include the 27 license as included with the code in Section 5. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on May 4, 2017. 46 Copyright Notice 48 Copyright (c) 2016 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 2. Conventions used in this document . . . . . . . . . . . . . . 5 65 3. Building Blocks of Queue Management . . . . . . . . . . . . . 5 66 3.1. Estimator . . . . . . . . . . . . . . . . . . . . . . . . 6 67 3.2. Setpoint . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.3. Control Loop . . . . . . . . . . . . . . . . . . . . . . 9 69 4. Putting it together: queue management for the network edge . 12 70 4.1. Overview of CoDel AQM . . . . . . . . . . . . . . . . . . 12 71 4.2. Non-starvation . . . . . . . . . . . . . . . . . . . . . 13 72 4.3. Using the interval . . . . . . . . . . . . . . . . . . . 13 73 4.4. The target setpoint . . . . . . . . . . . . . . . . . . . 14 74 4.5. Use with multiple queues . . . . . . . . . . . . . . . . 15 75 4.6. Use of stochastic bins or sub-queues to improve 76 performance . . . . . . . . . . . . . . . . . . . . . . . 15 77 4.7. Setting up CoDel AQM . . . . . . . . . . . . . . . . . . 16 78 5. Annotated Pseudo-code for CoDel AQM . . . . . . . . . . . . . 17 79 5.1. Data Types . . . . . . . . . . . . . . . . . . . . . . . 18 80 5.2. Per-queue state (codel_queue_t instance variables) . . . 19 81 5.3. Constants . . . . . . . . . . . . . . . . . . . . . . . . 19 82 5.4. Enqueue routine . . . . . . . . . . . . . . . . . . . . . 19 83 5.5. Dequeue routine . . . . . . . . . . . . . . . . . . . . . 19 84 5.6. Helper routines . . . . . . . . . . . . . . . . . . . . . 21 85 5.7. Implementation considerations . . . . . . . . . . . . . . 22 86 6. Adapting and applying CoDel's building blocks . . . . . . . . 23 87 6.1. Validations and available code . . . . . . . . . . . . . 23 88 6.2. CoDel in the datacenter . . . . . . . . . . . . . . . . . 24 89 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 90 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 91 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 25 92 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 25 93 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 94 11.1. Normative References . . . . . . . . . . . . . . . . . . 25 95 11.2. Informative References . . . . . . . . . . . . . . . . . 26 96 11.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 27 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 27 99 1. Introduction 101 The need for queue management has been evident for decades. The 102 "persistently full buffer" problem has been discussed in the IETF 103 community since the early 80's [RFC896]. The IRTF's End-to-End 104 Working Group called for the deployment of active queue management 105 (AQM) to solve the problem in 1998 [RFC2309]. Despite this 106 awareness, the problem has only gotten worse as Moore's Law growth in 107 memory density fueled an exponential increase in buffer pool size. 108 Efforts to deploy AQM have been frustrated by difficult configuration 109 and negative impact on network utilization. This problem, recently 110 christened "bufferbloat", [TSV2011] [BB2011] has become increasingly 111 important throughout the Internet but particularly at the consumer 112 edge. Recently, queue management has become more critical due to 113 increased consumer use of the Internet, mixing large video 114 transactions with time-critical VoIP and gaming. Gettys [TSV2011, 115 BB2011] has been instrumental in publicizing the problem and the 116 measurement work [CHARB2007, NETAL2010] and coining the term 117 bufferbloat. Large content distributors such as Google have observed 118 that bufferbloat is ubiquitous and adversely affects performance for 119 many users. The solution is an effective AQM that remediates 120 bufferbloat at a bottleneck while "doing no harm" at hops where 121 buffers are not bloated. 123 The development and deployment of effective active queue management 124 has been hampered by persistent misconceptions about the cause and 125 meaning of packet queues in network buffers. Network buffers exist 126 to absorb the packet bursts that occur naturally in statistically 127 multiplexed networks. Buffers helpfully absorb the queues created by 128 such reasonable packet network behavior as short-term mismatches in 129 traffic arrival and departure rates that arise from upstream resource 130 contention, transport conversation startup transients and/or changes 131 in the number of conversations sharing a link. Unfortunately, other 132 less useful network behaviors can cause queues to fill and their 133 effects are not nearly as benign. Discussion of these issues and the 134 reason why the solution is not simply smaller buffers can be found in 135 [RFC2309], [VANQ2006], [REDL1998], and [CODEL2012]. To understand 136 queue management, it is critical to understand the difference between 137 the necessary, useful "good" queue, and the counterproductive "bad" 138 queue. 140 Many approaches to active queue management (AQM) have been developed 141 over the past two decades but none has been widely deployed due to 142 performance problems. When designed with the wrong conceptual model 143 for queues, AQMs have limited operational range, require a lot of 144 configuration tweaking, and frequently impair rather than improve 145 performance. Today, the demands on an effective AQM are even 146 greater: many network devices must work across a range of bandwidths, 147 either due to link variations or due to the mobility of the device. 148 The CoDel approach is designed to meet the following goals: 150 o parameterless for normal operation - has no knobs for operators, 151 users, or implementers to adjust 153 o treat "good queue" and "bad queue" differently, that is, keep 154 delay low while permitting necessary bursts of traffic 156 o control delay while insensitive (or nearly so) to round trip 157 delays, link rates and traffic loads; this goal is to "do no harm" 158 to network traffic while controlling delay 160 o adapt to dynamically changing link rates with no negative impact 161 on utilization 163 o simple and efficient implementation (can easily span the spectrum 164 from low-end, linux-based access points and home routers up to 165 high-end commercial router silicon) 167 CoDel has five major innovations that distinguish it from prior AQMs: 168 use of local queue minimum to track congestion ("bad queue"), use of 169 an efficient single state variable representation of that tracked 170 statistic, use of packet sojourn time as the observed datum, rather 171 than packets, bytes, or rates, use of mathematically determined 172 setpoint derived from maximizing the network power metric, and a 173 modern state space controller. 175 CoDel configures itself based on a round-trip time metric which can 176 be set to 100ms for the normal, terrestrial Internet. With no 177 changes to parameters, we have found CoDel to work across a wide 178 range of conditions, with varying links and the full range of 179 terrestrial round trip times. 181 Since CoDel was first published [CODEL2012], a number of implementers 182 have been using and adapting it with promising results. Much of this 183 work is collected at http://www.bufferbloat.net/projects/codel . 184 CoDel has been implemented in Linux very efficiently and should lend 185 itself to silicon implementation. CoDel is well-adapted for use in 186 multiple queued devices and has been used by Eric Dumazet with 187 multiple queues in a sophisticated queue management approach, 188 fq_codel [FQ-CODEL-ID]. 190 2. Conventions used in this document 192 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 193 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 194 document are to be interpreted as described in [RFC2119]. 196 In this document, these words will appear with that interpretation 197 only when in ALL CAPS. Lower case uses of these words are not to be 198 interpreted as carrying [RFC2119] significance. 200 3. Building Blocks of Queue Management 202 Two decades of work on queue management failed to yield an approach 203 that could be widely deployed in the Internet. Careful tuning for 204 particular usages has enabled queue management techniques to "kind 205 of" work; that is, they have been able to decrease queueing delays, 206 but only at the undue cost of link utilization and/or fairness. At 207 the heart of queue management is the notion of "good queue" and "bad 208 queue" and the search for ways to get rid of the bad queue (which 209 only adds delay) while preserving the good queue (which provides for 210 good utilization). This section explains queueing, both good and 211 bad, and covers the innovative CoDel building blocks that can be used 212 to manage packet buffers to keep their queues in the "good" range. 214 Packet queues form in buffers facing bottleneck links, i.e., where 215 the line rate goes from high to low or where many links converge. 216 The well-known bandwidth-delay product (sometimes called "pipe size") 217 is the bottleneck's bandwidth multiplied by the sender-receiver- 218 sender round-trip delay, and is the amount of data that has to be in 219 transit between two hosts in order to run the bottleneck link at 100% 220 utilization. To explore how queues can form, consider a long-lived 221 TCP connection with a 25 packet window sending through a connection 222 with a bandwidth-delay product of 20 packets. After an initial burst 223 of packets the connection will settle into a five packet (+/-1) 224 standing queue; this standing queue size is determined by the 225 mismatch between the window size and the pipe size, and is unrelated 226 to the connection's sending rate. The connection has 25 packets in 227 flight at all times, but only 20 packets arrive at the destination 228 over a round trip time. If the TCP connection has a 30 packet 229 window, the queue will be ten packets with no change in sending rate. 230 Similarly, if the window is 20 packets, there will be no queue but 231 the sending rate is the same. Nothing can be inferred about the 232 sending rate from the queue size, and any queue other than transient 233 bursts only creates delays in the network. The sender needs to 234 reduce the number of packets in flight rather than sending rate. 236 In the above example, the five packet standing queue can be seen to 237 contribute nothing but delay to the connection, and thus is clearly 238 "bad queue". If, in our example, there is a single bottleneck link 239 and it is much slower than the link that feeds it (say, a high-speed 240 ethernet link into a limited DSL uplink) a 20 packet buffer at the 241 bottleneck might be necessary to temporarily hold the 20 packets in 242 flight to keep the bottleneck link's utilization high. The burst of 243 packets should drain completely (to 0 or 1 packets) within a round 244 trip time and this transient queue is "good queue" because it allows 245 the connection to keep the 20 packets in flight and for the 246 bottleneck link to be fully utilized. In terms of the delay 247 experienced, the "good queue" goes away in about a round trip time, 248 while "bad queue" hangs around for longer, causing delays. 250 Effective queue management detects "bad queue" while ignoring "good 251 queue" and takes action to get rid of the bad queue when it is 252 detected. The goal is a queue controller that accomplishes this 253 objective. To control a queue, we need three basic components 255 o Estimator - figure out what we've got 257 o Setpoint - know what what we want 259 o Control loop - if what we've got isn't what we want, we need a way 260 to move it there 262 3.1. Estimator 264 The estimator both observes the queue and detects when good queue 265 turns to bad queue and vice versa. CoDel has two innovations in its 266 estimator: what is observed as an indicator of queue and how the 267 observations are used to detect good/bad queue. 269 Queue length has been widely used as an observed indicator of 270 congestion and is frequently conflated with sending rate. Use of 271 queue length as a metric is sensitive to how and when the length is 272 observed. A high speed arrival link to a buffer serviced at a much 273 lower rate can rapidly build up a queue that might disperse 274 completely or down to a single packet before a round trip time has 275 elapsed. If the queue length is monitored at packet arrival (as in 276 original RED) or departure time, every packet will see a queue with 277 one possible exception. If the queue length itself is time sampled 278 (as recommended in [REDL1998], a truer picture of the queue's 279 occupancy can be gained at the expense of considerable implementation 280 complexity. 282 The use of queue length is further complicated in networks that are 283 subject to both short and long term changes in available link rate 284 (as in WiFi). Link rate drops can result in a spike in queue length 285 that should be ignored unless it persists. It is not the queue 286 length that should be controlled but the amount of excess delay 287 packets experience due to a persistent or standing queue, which means 288 that the packet sojourn time in the buffer is exactly what we want to 289 track. Tracking the packet sojourn times in the buffer observes the 290 actual delay experienced by each packet. Sojourn time allows queue 291 management to be independent of link rate, gives superior performance 292 to use of buffer size, and is directly related to user-visible 293 performance. It works regardless of line rate changes or link 294 sharing by multiple queues (which the individual queues may 295 experience as changing rates). 297 Consider a link shared by two queues with different priorities. 298 Packets that arrive at the high priority queue are sent as soon as 299 the link is available while packets in the other queue have to wait 300 until the high priority queue is empty (i.e., a strict priority 301 scheduler). The number of packets in the high priority queue might 302 be large but the queue is emptied quickly and the amount of time each 303 packet spends enqueued (the sojourn time) is not large. The other 304 queue might have a smaller number of packets, but packet sojourn 305 times will include the waiting time for the high priority packets to 306 be sent. This makes the sojourn time a good sample of the congestion 307 that each separate queue is experiencing. This example also shows 308 how the metric of sojourn time is independent of the number of queues 309 or the service discipline used, and is instead indicative of 310 congestion seen by the individual queues. 312 How can observed sojourn time be used to separate good queue from bad 313 queue? Although averages, especially of queue length, have 314 previously been widely used as an indicator of bad queue, their 315 efficacy is questionable. Consider the burst that disperses every 316 round trip time. The average queue will be one-half the burst size, 317 though this might vary depending on when the average is computed and 318 the timing of arrivals. The average queue sojourn time would be one- 319 half the time it takes to clear the burst. The average then would 320 indicate a persistent queue where there is none. Instead of averages 321 we recommend tracking the minimum sojourn time, then, if there is one 322 packet that has a zero sojourn time then there is no persistent 323 queue. The value of the minimum in detecting persistent queue is 324 apparent when looking at graphs of queue delay. 326 A persistent queue can be detected by tracking the (local) minimum 327 queue delay packets experience. To ensure that this minimum value 328 does not become stale, it has to have been experienced recently, i.e. 329 during an appropriate past time interval. This interval is the 330 maximum amount of time a minimum value is considered to be in effect, 331 and is related to the amount of time it takes for the largest 332 expected burst to drain. Conservatively, this interval should be at 333 least a round trip time to avoid falsely detecting a persistent queue 334 and not a lot more than a round trip time to avoid delay in detecting 335 the persistent queue. This suggests that the appropriate interval 336 value is the maximum round-trip time of all the connections sharing 337 the buffer. 339 (The following key insight makes computation of the local minimum 340 efficient: It is sufficient to keep a single state variable of how 341 long the minimum has been above or below a target value rather than 342 retaining all the local values to compute the minimum, leading to 343 both storage and computational savings. We use this insight in the 344 pseudo-code for CoDel later in the draft.) 346 These two innovations, use of sojourn time as observed values and the 347 local minimum as the statistic to monitor queue congestion are key to 348 CoDel's estimator building block. The local minimum sojourn time 349 provides an accurate and robust measure of standing queue and has an 350 efficient implementation. In addition, use of the minimum sojourn 351 time has important advantages in implementation. The minimum packet 352 sojourn can only be decreased when a packet is dequeued which means 353 that all the work of CoDel can take place when packets are dequeued 354 for transmission and that no locks are needed in the implementation. 355 The minimum is the only statistic with this property. 357 A more detailed explanation with many pictures can be found in 358 http://pollere.net/Pdfdocs/QrantJul06.pdf and 359 http://www.ietf.org/proceedings/84/slides/slides-84-tsvarea-4.pdf . 361 3.2. Setpoint 363 Now that we have a robust way of detecting standing queue, we need a 364 setpoint that tells us when to act. If the controller is set to take 365 action as soon as the estimator has a non-zero value, the average 366 drop rate will be maximized, which minimizes TCP goodput 367 [MACTCP1997]. Also, this policy results in no backlog over time (no 368 persistent queue), which negates much of the value of having a 369 buffer, since it maximizes the bottleneck link bandwidth lost due to 370 normal stochastic variation in packet interarrival time. We want a 371 setpoint that maximizes utilization while minimizing delay. Early in 372 the history of packet networking, Kleinrock developed the analytic 373 machinery to do this using a quantity he called 'power', which is the 374 ratio of a normalized throughput to a normalized delay [KLEIN81]. 376 It is straightforward to derive an analytic expression for the 377 average goodput of a TCP conversation at a given round-trip time r 378 and setpoint f (where f is expressed as a fraction of r). Reno TCP, 379 for example, yields: 381 goodput = r (3 + 6f - f^2) / (4 (1+f)) 382 Since the peak queue delay is simply f r, power is solely a function 383 of f since the r's in the numerator and denominator cancel: 385 power is proportional to (1 + 2f - 1/3 f^2) / (1 + f)^2 387 As Kleinrock observed, the best operating point, in terms of 388 bandwidth / delay tradeoff, is the peak power point, since points off 389 the peak represent a higher cost (in delay) per unit of bandwidth. 390 The power vs. f curve for any AIMD TCP is monotone decreasing. But 391 the curve is very flat for f < 0.1 followed by a increasing curvature 392 with a knee around f = 0.2, then a steep, almost linear fall off 393 [TSV84]. Since the previous equation showed that goodput is monotone 394 increasing with f, the best operating point is near the right edge of 395 the flat top since that represents the highest goodput achievable for 396 a negligible increase in delay. However, since the r in the model is 397 a conservative upper bound, a target of 0.1r runs the risk of pushing 398 shorter RTT connections over the knee and giving them higher delay 399 for no significant goodput increase. Generally, a more conservative 400 target of 0.05r offers a good utilization vs. delay tradeoff while 401 giving enough headroom to work well with a large variation in real 402 RTT. 404 As the above analysis shows, a very small standing queue gives close 405 to 100% utilization of the bottleneck link. While this result was 406 for Reno TCP, the derivation uses only properties that must hold for 407 any 'TCP friendly' transport. We have verified by both analysis and 408 simulation that this result holds for Reno, Cubic, and 409 Westwood[TSV84]. This results in a particularly simple form for the 410 setpoint: the ideal range for the permitted standing queue is between 411 5% and 10% of the TCP connection's RTT. Thus target is simply 5% of 412 the interval of section 3.1. 414 3.3. Control Loop 416 Section 3.1 describes a simple, reliable way to measure bad 417 (persistent) queue. Section 3.2 shows that TCP congestion control 418 dynamics gives rise to a setpoint for this measure that's a provably 419 good balance between enhancing throughput and minimizing delay, and 420 that this setpoint is a constant fraction of the same 'largest 421 average RTT' interval used to distinguish persistent from transient 422 queue. The only remaining building block needed for a basic AQM is a 423 'control loop' algorithm to effectively drive the queueing system 424 from any 'persistent queue above target' state to a state where the 425 persistent queue is below target. 427 Control theory provides a wealth of approaches to the design of 428 control loops. Most of classical control theory deals with the 429 control of linear, time-invariant, single-input-single-output (SISO) 430 systems. Control loops for these systems generally come from a (well 431 understood) class known as Proportional-Integral-Derivative (PID) 432 controllers. Unfortunately, a queue is not a linear system and an 433 AQM operates at the point of maximum non-linearity (where the output 434 link bandwidth saturates so increased demand creates delay rather 435 than higher utilization). Output queues are also not time-invariant 436 since traffic is generally a mix of connections which start and stop 437 at arbitrary times and which can have radically different behaviors 438 ranging from "open loop" UDP audio/video to "closed-loop" congestion- 439 avoiding TCP. Finally, the constantly changing mix of connections 440 (which can't be converted to a single 'lumped parameter' model 441 because of their transfer function differences) makes the system 442 multi-input-multi-output (MIMO), not SISO. 444 Since queueing systems match none of the prerequisites for a 445 classical controller, a modern state-space controller is a better 446 approach with states 'no persistent queue' and 'has persistent 447 queue'. Since Internet traffic mixtures change rapidly and 448 unpredictably, a noise and error tolerant adaptation algorithm like 449 Stochastic Gradient is a good choice. Since there's essentially no 450 information in the amount of persistent queue [TSV84], the adaptation 451 should be driven by how long it has persisted. 453 Consider the two extremes of traffic behavior, a single open-loop UDP 454 video stream and a single, long-lived TCP bulk data transfer. If the 455 average bandwidth of the UDP video stream is greater that the 456 bottleneck link rate, the link's queue will grow and the controller 457 will eventually enter 'has persistent queue' state and start dropping 458 packets. Since the video stream is open loop, its arrival rate is 459 unaffected by drops so the queue will persist until the average drop 460 rate is greater than the output bandwidth deficit (= average arrival 461 rate - average departure rate) so the job of the adaptation algorithm 462 is to discover this rate. For this example, the adaptation could 463 consist of simply estimating the arrival and departure rates then 464 dropping at a rate slightly greater than their difference. But this 465 class of algorithm won't work at all for the bulk data TCP stream. 466 TCP runs in closed-loop flow balance [TSV84] so its arrival rate is 467 almost always exactly equal to the departure rate - the queue isn't 468 the result of a rate imbalance but rather a mismatch between the TCP 469 sender's window and the source-destination-source round-trip path 470 capacity (i.e., the connection's bandwidth-delay product). The 471 sender's TCP congestion avoidance algorithm will slowly increase the 472 send window (one packet per round-trip-time) [RFC2581] which will 473 eventually cause the bottleneck to enter 'has persistent queue' 474 state. But, since the average input rate is the same as the average 475 output rate, the rate deficit estimation that gave the correct drop 476 rate for the video stream would compute a drop rate of zero for the 477 TCP stream. However, if the output link drops one packet as it 478 enters 'has persistent queue' state, when the sender discovers this 479 (via TCP's normal packet loss repair mechanisms) it will reduce its 480 window by a factor of two [RFC2581] so, one round-trip-time after the 481 drop, the persistent queue will go away. 483 If there were N TCP conversations sharing the bottleneck, the 484 controller would have to drop O(N) packets, one from each 485 conversation, to make all the conversations reduce their window to 486 get rid of the persistent queue. If the traffic mix consists of 487 short (<= bandwidth-delay product) conversations, the aggregate 488 behavior becomes more like the open-loop video example since each 489 conversation is likely to have already sent all its packets by the 490 time it learns about a drop so each drop has negligible effect on 491 subsequent traffic. 493 The controller does not know the number, duration, or kind of 494 conversations creating its queue, so it has to learn the appropriate 495 response. Since single drops can have a large effect if the degree 496 of multiplexing (the number of active conversations) is small, 497 dropping at too high a rate is likely to have a catastrophic effect 498 on throughput. Dropping at a low rate (< 1 packet per round-trip- 499 time) then increasing the drop rate slowly until the persistent queue 500 goes below target is unlikely to overdrop and is guaranteed to 501 eventually dissipate the persistent queue. This stochastic gradient 502 learning procedure is the core of CoDel's control loop (the gradient 503 exists because a drop always reduces the (instantaneous) queue so an 504 increasing drop rate always moves the system "down" toward no 505 persistent queue, regardless of traffic mix). 507 The "next drop time" is decreased in inverse proportion to the square 508 root of the number of drops since the dropping state was entered, 509 using the well-known nonlinear relationship of drop rate to 510 throughput to get a linear change in throughput [REDL1998], 511 [MACTCP1997]. 513 Since the best rate to start dropping is at slightly more than one 514 packet per RTT, the controller's initial drop rate can be directly 515 derived from the estimator's interval, defined in section 3.1. When 516 the minimum sojourn time first crosses the target and CoDel drops a 517 packet, the earliest the controller could see the effect of the drop 518 is the round trip time (interval) + the local queue wait time 519 (target). If the next drop happens any earlier than this time 520 (interval + target), CoDel will overdrop. In practice, the local 521 queue waiting time tends to vary, so making the initial drop spacing 522 (i.e., the time to the second drop) be exactly the minimum possible 523 also leads to overdropping. Analysis of simulation and real-world 524 measured data shows that the 75th percentile magnitude of this 525 variation is less than the target, and so the initial drop spacing 526 SHOULD be set to the estimator's interval (i.e., initial drop spacing 527 = interval) to ensure that the controller has accounted for 528 acceptable congestion delays. 530 Use of the minimum statistic lets the controller be placed in the 531 dequeue routine with the estimator. This means that the control 532 signal (the drop) can be sent at the first sign of bad queue (as 533 indicated by the sojourn time) and that the controller can stop 534 acting as soon as the sojourn time falls below the setpoint. 535 Dropping at dequeue has both implementation and control advantages. 537 4. Putting it together: queue management for the network edge 539 CoDel was initially designed as a bufferbloat solution for the 540 consumer network edge. The CoDel building blocks are able to adapt 541 to different or time-varying link rates, to be easily used with 542 multiple queues, to have excellent utilization with low delay, and to 543 have a simple and efficient implementation. The only setting CoDel 544 requires is its interval value, and as 100ms satisfies that 545 definition for normal Internet usage, CoDel can be parameter-free for 546 consumer use. CoDel was released to the open source community, where 547 it has been widely promulgated and adapted to many problems. CoDel's 548 efficient implementation and lack of configuration are unique 549 features and make it suitable to manage modern packet buffers. For 550 more background and results on CoDel, see [CODEL2012] and 551 http://pollere.net/Codel.html . 553 4.1. Overview of CoDel AQM 555 To ensure that link utilization is not adversely affected, CoDel's 556 estimator sets its target to the setpoint that optimizes power and 557 CoDel's controller does not drop packets when the drop would leave 558 the queue empty or with fewer than a maximum transmission unit (MTU) 559 worth of bytes in the buffer. Section 3.2 showed that the ideal 560 setpoint is 5-10% of the connection RTT. In the open terrestrial- 561 based Internet, especially at the consumer edge, we expect most 562 unbloated RTTs to have a ceiling of 100ms [CHARB2007]. Using this 563 RTT gives a minimum target of 5ms and the interval for tracking the 564 minimum is 100ms. In practice, uncongested links will see sojourn 565 times below target more often than once per RTT, so the estimator is 566 not overly sensitive to the value of the interval. 568 When the estimator finds a persistent delay above target, the 569 controller enters the drop state where a packet is dropped and the 570 next drop time is set. As discussed in section 3.3, the initial next 571 drop spacing is intended to be long enough to give the endpoints time 572 to react to the single drop so SHOULD be set to a value equal to the 573 interval. If the estimator's output falls below target, the 574 controller cancels the next drop and exits the drop state. (The 575 controller is more sensitive than the estimator to an overly short 576 interval, since an unnecessary drop would occur and lower link 577 utilization.) If next drop time is reached while the controller is 578 still in drop state, the packet being dequeued is dropped and the 579 next drop time is recalculated. 581 Additional logic prevents re-entering the dropping state too soon 582 after exiting it and resumes the dropping state at a recent control 583 level, if one exists. This logic is described more precisely in the 584 pseudo-code below. Additional work is required to determine the 585 frequency and importance of re-entering the dropping state. 587 Note that CoDel AQM only enters its dropping state when the local 588 minimum sojourn delay has exceeded target for a time interval long 589 enough for normal bursts to dissipate, ensuring that a burst of 590 packets that fits in the pipe will not be dropped. 592 4.2. Non-starvation 594 CoDel's goals are to control delay with little or no impact on link 595 utilization and to be deployed on a wide range of link bandwidths, 596 including variable-rate links, without reconfiguration. To keep from 597 making drops when it would starve the output link, CoDel makes 598 another check before dropping to see if at least an MTU worth of 599 bytes remains in the buffer. If not, the packet SHOULD NOT be 600 dropped and, therefore, CoDel exits the drop state. The MTU size can 601 be set adaptively to the largest packet seen so far or can be read 602 from the driver. 604 4.3. Using the interval 606 The interval is chosen to give endpoints time to react to a drop 607 without being so long that response times suffer. CoDel's estimator, 608 setpoint, and control loop all use the interval. Understanding their 609 derivation shows that CoDel is the most sensitive to the value of 610 interval for single long-lived TCPs with a decreased sensitivity for 611 traffic mixes. This is fortunate as RTTs vary across connections and 612 are not known apriori. The best policy is to use an interval 613 slightly larger than the RTT seen by most of the connections using a 614 link, a value that can be determined as the largest RTT seen if the 615 value is not an outlier (use of a 95-99th percentile value should 616 work). In practice, this value is not known or measured (though see 617 Section 6.2 for an application where interval is measured. An 618 interval setting of 100ms works well across a range of RTTs from 10ms 619 to 1 second (excellent performance is achieved in the range from 10 620 ms to 300ms). For devices intended for the normal terrestrial 621 Internet, interval SHOULD have a value of 100ms. This will only 622 cause overdropping where a long-lived TCP has an RTT longer than 623 100ms and there is little or no mixing with other connections through 624 the link. 626 Some confusion concerns the roles of the target setpoint and the 627 minimum-tracking interval. In particular, some experimenters believe 628 the value of target needs to be increased when the lower layers have 629 a bursty nature where packets are transmitted for short periods 630 interspersed with idle periods where the link is waiting for 631 permission to send. CoDel's estimator will "see" the effective 632 transmission rate over an interval and increasing target will just 633 lead to longer queue delays. On the other hand, where a significant 634 additional delay is added to the intrinsic round trip time of most or 635 all packets due to the waiting time for a transmission, it is 636 necessary to increase interval by that extra delay. That is, target 637 SHOULD NOT be adjusted but interval MAY need to be adjusted. For 638 more on this (and pictures) see http://pollere.net/Pdfdocs/ 639 noteburstymacs.pdf 641 4.4. The target setpoint 643 The target is the maximum acceptable persistent queue delay above 644 which CoDel is dropping or preparing to drop and below which CoDel 645 will not drop. The calculations of section 3.2 showed that the best 646 setpoint is 5-10% of the RTT, with the low end of 5% preferred. We 647 used simulation to explore the impact when TCPs are mixed with other 648 traffic and with connections of different RTTs. Accordingly, we 649 experimented extensively with values in the 5-10% of RTT range and, 650 overall, used target values between 1 and 20 milliseconds for RTTs 651 from 30 to 500ms and link bandwidths of 64Kbps to 100Mbps to 652 experimentally explore the setpoint that gives consistently high 653 utilization while controlling delay across a range of bandwidths, 654 RTTs, and traffic loads. Our results were notably consistent with 655 the mathematics of section 3.2. Below a target of 5ms, utilization 656 suffers for some conditions and traffic loads, and above 5ms we saw 657 very little or no improvement in utilization. Thus target SHOULD be 658 set to 5ms for normal Internet traffic. 660 A congested (but not overloaded) CoDel link with traffic composed 661 solely or primarily of long-lived TCP flows will have a median delay 662 through the link will tend to the target. For bursty traffic loads 663 and for overloaded conditions (where it is difficult or impossible 664 for all the arriving flows to be accommodated) the median queues will 665 be longer than target. 667 The non-starvation drop inhibit feature dominates where the link rate 668 becomes very small. By inhibiting drops when there is less than an 669 (outbound link) MTU worth of bytes in the buffer, CoDel adapts to 670 very low bandwidth links. This is shown in [CODEL2012] and 671 interested parties should see the discussion of results there. 672 Unpublished studies were carried out down to 64Kbps. The drop 673 inhibit condition can be expanded to include a test to retain 674 sufficient bytes or packets to fill an allocation in a request-and- 675 grant MAC. 677 Sojourn times must remain above the target for an entire interval in 678 order to enter the drop state. Any packet with a sojourn time less 679 than the target will reset the time that the queue was last below the 680 target. Since Internet traffic has very dynamic characteristics, the 681 actual sojourn delay experienced by packets varies greatly and is 682 often less than the target unless the overload is excessive. When a 683 link is not overloaded, it is not a bottleneck and packet sojourn 684 times will be small or nonexistent. In the usual case, there are 685 only one or two places along a path where packets will encounter a 686 bottleneck (usually at the edge), so the total amount of queueing 687 delay experienced by a packet should be less than 10ms even under 688 extremely congested conditions. This net delay is substantially 689 lower than common current queueing delays on the Internet that grow 690 to orders of seconds [NETAL2010, CHARB2007]. 692 4.5. Use with multiple queues 694 Unlike other AQMs, CoDel is easily adapted to multiple queue systems. 695 With other approaches there is always a question of how to account 696 for the fact that each queue receives less than the full link rate 697 over time and usually sees a varying rate over time. This is exactly 698 what CoDel excels at: using a packet's sojourn time in the buffer 699 completely circumvents this problem. In a multiple-queue setting, a 700 separate CoDel algorithm runs on each queue, but each CoDel instance 701 uses the packet sojourn time the same way a single-queue CoDel does. 702 Just as a single-queue CoDel adapts to changing link bandwidths 703 [CODEL2012], so does a multiple-queue CoDel system. As an 704 optimization to avoid queueing more than necessary, when testing for 705 queue occupancy before dropping, the total occupancy of all queues 706 sharing the same output link should be used. This property of CoDel 707 has been exploited in fq_codel, briefly discussed in the next section 708 and in more detail in [FQ-CODEL-ID]. 710 4.6. Use of stochastic bins or sub-queues to improve performance 712 Shortly after the release of the CoDel pseudocode, Eric Dumazet 713 created fq_codel, applying CoDel to each bin, or queue, used with 714 stochastic fair queueing. (To understand further, see [SFQ1990] or 715 the linux sfq documentation.) fq_codel hashes on the packet header 716 fields to determine a specific bin, or sub-queue, for each five-tuple 717 flow, and runs CoDel on each bin or sub-queue thus creating a well- 718 mixed output flow and obviating issues of reverse path flows 719 (including "ack compression"). Dumazet's code is part of the CeroWrt 720 project code at the bufferbloat.net's web site and described in [FQ- 721 CODEL-ID]. Andrew McGregor has implemented a version of fq_codel for 722 the network simulator ns-3 (http://www.nsnam.org). 724 We have also experimented with a similar multi-queue approach by 725 creating an ns-2 simulator code module, sfqcodel, which uses 726 Stochastic Fair Queueing (SFQ) to isolate flows into bins, each 727 running CoDel. This setup has provided excellent results: median 728 queues remain small across a range of traffic patterns that includes 729 bidirectional file transfers (that is, the same traffic sent in both 730 directions on a link), constant bit-rate VoIP-like flows, and 731 emulated web traffic and utilizations are consistently better than 732 single queue CoDel, generally very close to 100%. Our code, intended 733 for simulation experiments, is available at http://pollere.net/Co 734 [6]del.html [7] and being integrated into the ns-2 distribution. 736 A number of open issues should be studied. In particular, if the 737 number of different queues or bins is too large, the scheduling will 738 be the dominant factor, not the AQM; it is NOT the case that more 739 bins are always better. In our simulations, we have found good 740 behavior across mixed traffic types with smaller numbers of queues, 741 8-16 for a 5Mbps link. This configuration appears to give the best 742 behavior for voice, web browsing and file transfers where increased 743 numbers of bins seems to favor file transfers at the expense of the 744 other traffic. Our work has been very preliminary and we encourage 745 others to take this up and to explore analytic modeling. It would be 746 instructive to see the effects of different numbers of bins on a 747 range of traffic models, something like an updated version of 748 [BMPFQ]. 750 Implementers SHOULD use the fq_codel multiple queue approach if 751 possible as it deals with many problems beyond the reach of an AQM on 752 a single queue. 754 4.7. Setting up CoDel AQM 756 CoDel is set for use in devices in the open Internet. An interval of 757 100ms is used, target is set to 5% of interval, and the initial drop 758 spacing is also set to interval. These settings have been chosen so 759 that a device, such as a small WiFi router, can be sold without the 760 need for any values to be made adjustable, yielding a parameterless 761 implementation. In addition, CoDel is useful in environments with 762 significantly different characteristics from the normal Internet, for 763 example, in switches used as a cluster interconnect within a data 764 center. Since cluster traffic is entirely internal to the data 765 center, round trip latencies are low (typically <100us) but 766 bandwidths are high (1-40Gbps) so it's relatively easy for the 767 aggregation phase of a distributed computation (e.g., the Reduce part 768 of a Map/Reduce) to persistently fill then overflow the modest per- 769 port buffering available in most high speed switches. A CoDel 770 configured for this environment (target and interval in the 771 microsecond rather than millisecond range) can minimize drops (or ECN 772 marks) while keeping throughput high and latency low. 774 Devices destined for these environments MAY use a different interval, 775 where suitable. If appropriate analysis indicates, the target MAY be 776 set to some other value in the 5-10% of interval and the initial drop 777 spacing MAY be set to a value of 1.0 to 1.2 times the interval. But 778 these settings will cause problems such as overdropping and low 779 throughput if used on the open Internet, so devices that allow CoDel 780 to be configured SHOULD default to Internet-appropriate values given 781 in this document. 783 5. Annotated Pseudo-code for CoDel AQM 785 What follows is the CoDel algorithm in C++-like pseudo-code. Since 786 CoDel adds relatively little new code to a basic tail-drop fifo- 787 queue, we have attempted to highlight just these additions by 788 presenting CoDel as a sub-class of a basic fifo-queue base class. 789 The reference code is included to aid implementers who wish to apply 790 CoDel to queue management as described here or to adapt its 791 principles to other applications. 793 Implementors are strongly encouraged to also look at Eric Dumazet's 794 Linux kernel version of CoDel - a well-written, well tested, real- 795 world, C-based implementation. As of this writing, it is at 796 https://github.com/torvalds/linux/blob/master/net/sched/sch_codel.c. 798 The following pseudo-code is open-source with a dual BSD/GPL license: 800 Codel - The Controlled-Delay Active Queue Management algorithm. 801 Copyright (C) 2011-2014 Kathleen Nichols . 802 Redistribution and use in source and binary forms, with or without 803 modification, are permitted provided that the following conditions 804 are met: 806 o Redistributions of source code must retain the above copyright 807 notice, this list of conditions, and the following disclaimer, 808 without modification. 810 o Redistributions in binary form must reproduce the above copyright 811 notice, this list of conditions and the following disclaimer in 812 the documentation and/or other materials provided with the 813 distribution. 815 o The names of the authors may not be used to endorse or promote 816 products derived from this software without specific prior written 817 permission. 819 Alternatively, provided that this notice is retained in full, this 820 software may be distributed under the terms of the GNU General Public 821 License ("GPL") version 2, in which case the provisions of the GPL 822 apply INSTEAD OF those given above. 824 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 825 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 826 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 827 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 828 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 829 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 830 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 831 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 832 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 833 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 834 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 836 5.1. Data Types 838 time_t is an integer time value in units convenient for the system. 839 Resolution to at least a millisecond is required and better 840 resolution is useful up to the minimum possible packet time on the 841 output link; 64- or 32-bit widths are acceptable but with 32 bits the 842 resolution should be no finer than 2^{-16} to leave enough dynamic 843 range to represent a wide range of queue waiting times. Narrower 844 widths also have implementation issues due to overflow (wrapping) and 845 underflow (limit cycles because of truncation to zero) that are not 846 addressed in this pseudocode. The code presented here uses 0 as a 847 flag value to indicate "no time set." 849 packet_t* is a pointer to a packet descriptor. We assume it has a 850 tstamp field capable of holding a time_t and that field is available 851 for use by CoDel (it will be set by the enqueue routine and used by 852 the dequeue routine). 854 queue_t is a base class for queue objects (the parent class for 855 codel_queue_t objects). We assume it has enqueue() and dequeue() 856 methods that can be implemented in child classes. We assume it has a 857 bytes() method that returns the current queue size in bytes. This 858 can be an approximate value. The method is invoked in the dequeue() 859 method but shouldn't require a lock with the enqueue() method. 861 flag_t is a Boolean. 863 5.2. Per-queue state (codel_queue_t instance variables) 865 time_t first_above_time_ = 0; // Time to declare sojourn time above 866 // target 867 time_t drop_next_ = 0; // Time to drop next packet 868 uint32_t count_ = 0; // Packets dropped in dropping state 869 uint32_t lastcount_ = 0; // Count from previous iteration 870 flag_t dropping_ = false; // Set to true if in drop state 872 5.3. Constants 874 time_t target = MS2TIME(5); // 5ms target queue delay 875 time_t interval = MS2TIME(100); // 100ms sliding-minimum window 876 u_int maxpacket = 512; // Maximum packet size in bytes 877 // (should use interface MTU) 879 5.4. Enqueue routine 881 All the work of CoDel is done in the dequeue routine. The only CoDel 882 addition to enqueue is putting the current time in the packet's 883 tstamp field so that the dequeue routine can compute the packet's 884 sojourn time. 886 void codel_queue_t::enqueue(packet_t* pkt) 887 { 888 pkt->timestamp() = clock(); 889 queue_t::enqueue(pkt); 890 } 892 5.5. Dequeue routine 894 This is the heart of CoDel. There are two branches based on whether 895 the controller is in dropping state: (i) if the controller is in 896 dropping state (that is, the minimum packet sojourn time is greater 897 than target) then the controller checks if it is time to leave 898 dropping state or schedules the next drop(s); or (ii) if the 899 controller is not in dropping state, it determines if it should enter 900 dropping state and do the initial drop. 902 packet_t* CoDelQueue::dequeue() 903 { 904 time_t now = clock(); 905 dodequeue_result r = dodequeue(now); 906 uint32_t delta; 908 if (dropping_) { 909 if (! r.ok_to_drop) { 910 // sojourn time below target - leave dropping state 911 dropping_ = false; 912 } 913 // Time for the next drop. Drop current packet and dequeue 914 // next. If the dequeue doesn't take us out of dropping 915 // state, schedule the next drop. A large backlog might 916 // result in drop rates so high that the next drop should 917 // happen now, hence the 'while' loop. 918 while (now >= drop_next_ && dropping_) { 919 drop(r.p); 920 ++count_; 921 r = dodequeue(now); 922 if (! r.ok_to_drop) { 923 // leave dropping state 924 dropping_ = false; 925 } else { 926 // schedule the next drop. 927 drop_next_ = control_law(drop_next_, count_); 928 } 929 } 930 // If we get here we're not in dropping state. The 'ok_to_drop' 931 // return from dodequeue means that the sojourn time has been 932 // above 'target' for 'interval' so enter dropping state. 933 } else if (r.ok_to_drop) { 934 drop(r.p); 935 r = dodequeue(now); 936 dropping_ = true; 938 // If min went above target close to when it last went 939 // below, assume that the drop rate that controlled the 940 // queue on the last cycle is a good starting point to 941 // control it now. ('drop_next' will be at most 'interval' 942 // later than the time of the last drop so 'now - drop_next' 943 // is a good approximation of the time from the last drop 944 // until now.) Implementations vary slightly here; this is 945 // the Linux version, which is more widely deployed and 946 // tested. 947 delta = count_ - lastcount_; 948 count_ = (delta > 1 && now - drop_next_ < 16*interval_)? 949 delta : 1; 950 drop_next_ = control_law(now, count_); 951 lastcount_ = count_; 952 } 953 return (r.p); 954 } 956 5.6. Helper routines 958 Since the degree of multiplexing and nature of the traffic sources is 959 unknown, CoDel acts as a closed-loop servo system that gradually 960 increases the frequency of dropping until the queue is controlled 961 (sojourn time goes below target). This is the control law that 962 governs the servo. It has this form because of the sqrt(p) 963 dependence of TCP throughput on drop probability. Note that for 964 embedded systems or kernel implementation, the inverse sqrt can be 965 computed efficiently using only integer multiplication. 967 time_t codel_queue_t::control_law(time_t t, uint32_t count) 968 { 969 return t + interval / sqrt(count); 970 } 972 Next is a helper routine the does the actual packet dequeue and 973 tracks whether the sojourn time is above or below target and, if 974 above, if it has remained above continuously for at least interval. 975 It returns two values: a Boolean indicating if it is OK to drop 976 (sojourn time above target for at least interval), and the packet 977 dequeued. 979 typedef struct { 980 packet_t* p; 981 flag_t ok_to_drop; 982 } dodequeue_result; 984 dodequeue_result codel_queue_t::dodequeue(time_t now) 985 { 986 dodequeue_result r = { queue_t::dequeue(), false }; 987 if (r.p == NULL) { 988 // queue is empty - we can't be above target 989 first_above_time_ = 0; 990 return r; 991 } 993 // To span a large range of bandwidths, CoDel runs two 994 // different AQMs in parallel. One is sojourn-time-based 995 // and takes effect when the time to send an MTU-sized 996 // packet is less than target. The 1st term of the "if" 997 // below does this. The other is backlog-based and takes 998 // effect when the time to send an MTU-sized packet is >= 999 // target. The goal here is to keep the output link 1000 // utilization high by never allowing the queue to get 1001 // smaller than the amount that arrives in a typical 1002 // interarrival time (MTU-sized packets arriving spaced 1003 // by the amount of time it takes to send such a packet on 1004 // the bottleneck). The 2nd term of the "if" does this. 1005 time_t sojourn_time = now - r.p->tstamp; 1006 if (sojourn_time_ < target_ || bytes() <= maxpacket_) { 1007 // went below - stay below for at least interval 1008 first_above_time_ = 0; 1009 } else { 1010 if (first_above_time_ == 0) { 1011 // just went above from below. if still above at 1012 // first_above_time, will say it's ok to drop. 1013 first_above_time_ = now + interval_; 1014 } else if (now >= first_above_time_) { 1015 r.ok_to_drop = true; 1016 } 1017 } 1018 return r; 1019 } 1021 5.7. Implementation considerations 1023 Since CoDel requires relatively little per-queue state and no direct 1024 communication or state sharing between the enqueue and dequeue 1025 routines, it is relatively simple to add CoDel to almost any packet 1026 processing pipeline, including ASIC- or NPU-based forwarding engines. 1028 One issue to consider is dodequeue()'s use of a 'bytes()' function to 1029 determine the current queue size in bytes. This value does not need 1030 to be exact. If the enqueue part of the pipeline keeps a running 1031 count of the total number of bytes it has put into the queue and the 1032 dequeue routine keeps a running count of the total bytes it has 1033 removed from the queue, 'bytes()' is simply the difference between 1034 these two counters (32-bit counters should be adequate.) Enqueue has 1035 to update its counter once per packet queued but it does not matter 1036 when (before, during or after the packet has been added to the 1037 queue). The worst that can happen is a slight, transient, 1038 underestimate of the queue size which might cause a drop to be 1039 briefly deferred. 1041 6. Adapting and applying CoDel's building blocks 1043 CoDel has been implemented and tested in a range of environments. 1045 6.1. Validations and available code 1047 An experiment by Stanford graduate students successfully used Linux 1048 CoDel to duplicate our published simulation work on CoDel's ability 1049 to adapt to drastic link rate changes. This experiment can be found 1050 at http://reproducingnetworkresearch.wordpress.com/2012/06/06/ 1051 solving-bufferbloat-the-codel-way/ . 1053 Our ns-2 simulations are available at http://pollere.net/Codel.html . 1054 CableLabs has funded some additions to the simulator sfqcodel code, 1055 which have been made public. The basic algorithm of CoDel remains 1056 unchanged, but we continue to experiment with drop interval setting 1057 when resuming the drop state, inhibiting or canceling drop state when 1058 few bytes are in the queue, and other details. Our approach to 1059 changes is to only make them if we are convinced they do more good 1060 than harm, both operationally and in the implementation. With this 1061 in mind, some of these issues may continue to evolve as we get more 1062 deployment and as the building blocks are applied to a wider range of 1063 problems. 1065 CoDel is available in ns-2 version 2.35 and later. 1067 Andrew McGregor has an ns-3 implementation of both CoDel and 1068 fq_codel. CoDel is available in ns-3 version 3.21 and later at 1069 https://www.nsnam.org/ . At the time of this writing, the ns-3 1070 implementation of fq_codel is available at 1071 https://www.nsnam.org/wiki/GSOC2014Bufferbloat . 1073 CoDel is available in Linux. Eric Dumazet implemented CoDel in the 1074 Linux kernel. 1076 Dave Taht integrated and distributed CoDel as a bufferbloat solution 1077 (http://www.bufferbloat.net/projects/codel ). 1079 6.2. CoDel in the datacenter 1081 Nandita Dukkipati and her group at Google realized that the CoDel 1082 building blocks could be applied to bufferbloat problems in 1083 datacenter servers, not just to Internet routers. The Linux CoDel 1084 queueing discipline (qdisc) was adapted in three ways to tackle this 1085 bufferbloat problem. 1087 1. The default CoDel action was modified to be a direct feedback 1088 from qdisc to the TCP layer at dequeue. The direct feedback 1089 simply reduces TCP's congestion window just as congestion control 1090 would do in the event of drop. The scheme falls back to ECN 1091 marking or packet drop if the TCP socket lock could not be 1092 acquired at dequeue. 1094 2. Being located in the server makes it possible to monitor the 1095 actual RTT to use as CoDel's interval rather than making a "best 1096 guess" of RTT. The CoDel interval is dynamically adjusted by 1097 using the maximum TCP round-trip time (RTT) of those connections 1098 sharing the same Qdisc/bucket. In particular, there is a history 1099 entry of the maximum RTT experienced over the last second. As a 1100 packet is dequeued, the RTT estimate is accessed from its TCP 1101 socket. If the estimate is larger than the current CoDel 1102 interval, the CoDel interval is immediately refreshed to the new 1103 value. If the CoDel interval is not refreshed for over a second, 1104 it is decreased it to the history entry and the process is 1105 repeated. The use of the dynamic TCP RTT estimate lets interval 1106 adapt to the actual maximum value currently seen and thus lets 1107 the controller space its drop intervals appropriately. 1109 3. Since the mathematics of computing the setpoint are invariant, a 1110 target of 5% of the RTT or CoDel interval was used here. 1112 Non-data packets were not dropped as these are typically small and 1113 sometimes critical control packets. Being located on the server, 1114 there is no concern with misbehaving users as there would be on the 1115 public Internet. 1117 In several data center workload benchmarks, which are typically 1118 bursty, CoDel reduced the queueing latencies at the qdisc, and 1119 thereby improved the mean and 99th-percentile latencies from several 1120 tens of milliseconds to less than one millisecond. The minimum 1121 tracking part of the CoDel framework proved useful in disambiguating 1122 "good" queue versus "bad" queue, particularly helpful in controlling 1123 qdisc buffers that are inherently bursty because of TCP Segmentation 1124 Offload (TSO). 1126 7. Security Considerations 1128 This document describes an active queue management algorithm for 1129 implementation in networked devices. There are no specific security 1130 exposures associated with CoDel. 1132 8. IANA Considerations 1134 This document does not require actions by IANA. 1136 9. Conclusions 1138 CoDel provides very general, efficient, parameterless building blocks 1139 for queue management that can be applied to single or multiple queues 1140 in a variety of data networking scenarios. It is a critical tool in 1141 solving bufferbloat. CoDel's settings MAY be modified for other 1142 special-purpose networking applications. We encourage 1143 experimentation, improvement, and rigorous testing. 1145 On-going projects are creating a deployable CoDel in Linux routers 1146 and experimenting with applying CoDel to stochastic queueing with 1147 promising results. 1149 10. Acknowledgments 1151 The authors thank Jim Gettys for the constructive nagging that made 1152 us get the work "out there" before we thought it was ready. We thank 1153 Dave Taht, Eric Dumazet, and the open source community for showing 1154 the value of getting it "out there" and for making it real. We thank 1155 Nandita Dukkipati for contributions to Section 6 and for comments 1156 which helped to substantially improve this draft. We thank the AQM 1157 working group and the Transport Area shepherd, Wes Eddy, for 1158 patiently prodding this draft all the way to a standard. 1160 11. References 1162 11.1. Normative References 1164 [RFC2119] Bradner, S., "Key Words for use in RFCs to Indicate 1165 Requirement Levels", March 1997. 1167 11.2. Informative References 1169 [FQ-CODEL-ID] 1170 Hoeiland-Joergensen, T., McKenney, P., 1171 dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, 1172 "FlowQueue-Codel", draft-ietf-aqm-fq-codel-03 (work in 1173 progress), November 2015. 1175 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1176 Control", RFC 2581, April 1999. 1178 [RFC896] Nagle, J., "Congestion control in IP/TCP internetworks", 1179 RFC 896, January 1984. 1181 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1182 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1183 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1184 S., Wroclawski, J., and L. Zhang, "Recommendations on 1185 Queue Management and Congestion Avoidance in the 1186 Internet", RFC 2309, April 1998. 1188 [TSV2011] Gettys, J., "Bufferbloat: Dark Buffers in the Internet", 1189 IETF 80 presentation to Transport Area Open Meeting, 1190 March, 2011, 1191 . 1193 [BB2011] Gettys, J. and K. Nichols, "Bufferbloat: Dark Buffers in 1194 the Internet", Communications of the ACM 9(11) pp. 57-65. 1196 [BMPFQ] Suter, B., "Buffer Management Schemes for Supporting TCP 1197 in Gigabit Routers with Per-flow Queueing", IEEE Journal 1198 on Selected Areas in Communications Vol. 17 Issue 6, June, 1199 1999, pp. 1159-1169. 1201 [CODEL2012] 1202 Nichols, K. and V. Jacobson, "Controlling Queue Delay", 1203 Communications of the ACM Vol. 55 No. 11, July, 2012, pp. 1204 42-50. 1206 [VANQ2006] 1207 Jacobson, V., "A Rant on Queues", talk at MIT Lincoln 1208 Labs, Lexington, MA July, 2006, 1209 . 1211 [REDL1998] 1212 Nichols, K., Jacobson, V., and K. Poduri, "RED in a 1213 Different Light", Tech report, September, 1999, 1214 . 1217 [NETAL2010] 1218 Kreibich, C., et. al., "Netalyzr: Illuminating the Edge 1219 Network", Proceedings of the Internet Measurement 1220 Conference Melbourne, Australia, 2010. 1222 [TSV84] Jacobson, V., "CoDel talk at TSV meeting IETF 84", 1223 . 1226 [CHARB2007] 1227 Dischinger, M., et. al, "Characterizing Residential 1228 Broadband Networks", Proceedings of the Internet 1229 Measurement Conference San Diego, CA, 2007. 1231 [MACTCP1997] 1232 Mathis, M., Semke, J., and J. Mahdavi, "The Macroscopic 1233 Behavior of the TCP Congestion Avoidance Algorithm", ACM 1234 SIGCOMM Computer Communications Review Vol. 27 no. 1, Jan. 1235 2007. 1237 [SFQ1990] McKenney, P., "Stochastic Fairness Queuing", Proceedings 1238 of IEEE INFOCOMM 90 San Francisco, 1990. 1240 [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of 1241 Computer Network Power", International Conference on 1242 Communications June, 1981, 1243 . 1245 11.3. URIs 1247 [6] http://pollere.net/Codel.html 1249 [7] http://pollere.net/Codel.html 1251 Authors' Addresses 1252 Kathleen Nichols 1253 Pollere, Inc. 1254 PO Box 370201 1255 Montara, CA 94037 1256 USA 1258 Email: nichols@pollere.com 1260 Van Jacobson 1261 Google 1263 Email: vanj@google.com 1265 Andrew McGregor 1266 Google 1268 Email: andrewmcgr@google.com 1270 Janardhan Iyengar 1271 Google 1273 Email: jri@google.com