| < draft-ietf-aqm-codel-05.txt | draft-ietf-aqm-codel-06.txt > | |||
|---|---|---|---|---|
| AQM K. Nichols | AQM K. Nichols | |||
| Internet-Draft Pollere, Inc. | Internet-Draft Pollere, Inc. | |||
| Intended status: Experimental V. Jacobson | Intended status: Experimental V. Jacobson | |||
| Expires: May 4, 2017 A. McGregor, ed. | Expires: June 25, 2017 A. McGregor, ed. | |||
| J. Iyengar, ed. | J. Iyengar, ed. | |||
| October 31, 2016 | December 22, 2016 | |||
| Controlled Delay Active Queue Management | Controlled Delay Active Queue Management | |||
| draft-ietf-aqm-codel-05 | draft-ietf-aqm-codel-06 | |||
| Abstract | Abstract | |||
| This document describes a general framework called CoDel (Controlled | This document describes a general framework called CoDel (Controlled | |||
| Delay) that controls bufferbloat-generated excess delay in modern | Delay) that controls bufferbloat-generated excess delay in modern | |||
| networking environments. CoDel consists of an estimator, a setpoint, | networking environments. CoDel consists of an estimator, a setpoint, | |||
| and a control loop. It requires no configuration in normal Internet | and a control loop. It requires no configuration in normal Internet | |||
| deployments. CoDel comprises some major technical innovations and | deployments. | |||
| has been made available as open source so that the framework can be | ||||
| applied by the community to a range of problems. It has been | ||||
| implemented in Linux (and available in the Linux distribution) and | ||||
| deployed in some networks at the consumer edge. In addition, the | ||||
| framework has been successfully applied in other ways. | ||||
| Note: Code Components extracted from this document must include the | ||||
| license as included with the code in Section 5. | ||||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on May 4, 2017. | This Internet-Draft will expire on June 25, 2017. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2016 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 2. Conventions used in this document . . . . . . . . . . . . . . 5 | 2. Conventions used in this document . . . . . . . . . . . . . . 4 | |||
| 3. Building Blocks of Queue Management . . . . . . . . . . . . . 5 | 3. Overview of the Codel AQM . . . . . . . . . . . . . . . . . . 4 | |||
| 3.1. Estimator . . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.1. Non-starvation . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 3.2. Setpoint . . . . . . . . . . . . . . . . . . . . . . . . 8 | 3.2. Using the interval . . . . . . . . . . . . . . . . . . . 6 | |||
| 3.3. Control Loop . . . . . . . . . . . . . . . . . . . . . . 9 | 3.3. The target setpoint . . . . . . . . . . . . . . . . . . . 6 | |||
| 4. Putting it together: queue management for the network edge . 12 | 3.4. Use with multiple queues . . . . . . . . . . . . . . . . 7 | |||
| 4.1. Overview of CoDel AQM . . . . . . . . . . . . . . . . . . 12 | 3.5. Setting up CoDel . . . . . . . . . . . . . . . . . . . . 7 | |||
| 4.2. Non-starvation . . . . . . . . . . . . . . . . . . . . . 13 | 4. Annotated Pseudo-code for CoDel AQM . . . . . . . . . . . . . 8 | |||
| 4.3. Using the interval . . . . . . . . . . . . . . . . . . . 13 | 4.1. Data Types . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4.4. The target setpoint . . . . . . . . . . . . . . . . . . . 14 | 4.2. Per-queue state (codel_queue_t instance variables) . . . 10 | |||
| 4.5. Use with multiple queues . . . . . . . . . . . . . . . . 15 | 4.3. Constants . . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 4.6. Use of stochastic bins or sub-queues to improve | 4.4. Enqueue routine . . . . . . . . . . . . . . . . . . . . . 10 | |||
| performance . . . . . . . . . . . . . . . . . . . . . . . 15 | 4.5. Dequeue routine . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 4.7. Setting up CoDel AQM . . . . . . . . . . . . . . . . . . 16 | 4.6. Helper routines . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 5. Annotated Pseudo-code for CoDel AQM . . . . . . . . . . . . . 17 | 4.7. Implementation considerations . . . . . . . . . . . . . . 13 | |||
| 5.1. Data Types . . . . . . . . . . . . . . . . . . . . . . . 18 | 5. Understanding the Building Blocks of Queue Management . . . . 14 | |||
| 5.2. Per-queue state (codel_queue_t instance variables) . . . 19 | 5.1. Estimator . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 5.3. Constants . . . . . . . . . . . . . . . . . . . . . . . . 19 | 5.2. Setpoint . . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 5.4. Enqueue routine . . . . . . . . . . . . . . . . . . . . . 19 | 5.3. Control Loop . . . . . . . . . . . . . . . . . . . . . . 19 | |||
| 5.5. Dequeue routine . . . . . . . . . . . . . . . . . . . . . 19 | 6. Further Experimentation . . . . . . . . . . . . . . . . . . . 21 | |||
| 5.6. Helper routines . . . . . . . . . . . . . . . . . . . . . 21 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 | |||
| 5.7. Implementation considerations . . . . . . . . . . . . . . 22 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6. Adapting and applying CoDel's building blocks . . . . . . . . 23 | 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6.1. Validations and available code . . . . . . . . . . . . . 23 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6.2. CoDel in the datacenter . . . . . . . . . . . . . . . . . 24 | 10.1. Normative References . . . . . . . . . . . . . . . . . . 22 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 | 10.2. Informative References . . . . . . . . . . . . . . . . . 22 | |||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 | Appendix A. Applying CoDel in the datacenter . . . . . . . . . . 24 | |||
| 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 25 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 25 | ||||
| 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 | ||||
| 11.1. Normative References . . . . . . . . . . . . . . . . . . 25 | ||||
| 11.2. Informative References . . . . . . . . . . . . . . . . . 26 | ||||
| 11.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 27 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 27 | ||||
| 1. Introduction | 1. Introduction | |||
| The need for queue management has been evident for decades. The | The "persistently full buffer" problem has been discussed in the IETF | |||
| "persistently full buffer" problem has been discussed in the IETF | ||||
| community since the early 80's [RFC896]. The IRTF's End-to-End | community since the early 80's [RFC896]. The IRTF's End-to-End | |||
| Working Group called for the deployment of active queue management | Working Group called for the deployment of active queue management | |||
| (AQM) to solve the problem in 1998 [RFC2309]. Despite this | (AQM) to solve the problem in 1998 [RFC2309]. Despite this | |||
| awareness, the problem has only gotten worse as Moore's Law growth in | awareness, the problem has only gotten worse as Moore's Law growth in | |||
| memory density fueled an exponential increase in buffer pool size. | memory density fueled an exponential increase in buffer pool size. | |||
| Efforts to deploy AQM have been frustrated by difficult configuration | Efforts to deploy AQM have been frustrated by difficult configuration | |||
| and negative impact on network utilization. This problem, recently | and negative impact on network utilization. This "bufferbloat" | |||
| christened "bufferbloat", [TSV2011] [BB2011] has become increasingly | problem [TSV2011] [BB2011] has become increasingly important | |||
| important throughout the Internet but particularly at the consumer | throughout the Internet but particularly at the consumer edge. Queue | |||
| edge. Recently, queue management has become more critical due to | management has become more critical due to increased consumer use of | |||
| increased consumer use of the Internet, mixing large video | the Internet, mixing large video transactions with time-critical VoIP | |||
| transactions with time-critical VoIP and gaming. Gettys [TSV2011, | and gaming. | |||
| BB2011] has been instrumental in publicizing the problem and the | ||||
| measurement work [CHARB2007, NETAL2010] and coining the term | ||||
| bufferbloat. Large content distributors such as Google have observed | ||||
| that bufferbloat is ubiquitous and adversely affects performance for | ||||
| many users. The solution is an effective AQM that remediates | ||||
| bufferbloat at a bottleneck while "doing no harm" at hops where | ||||
| buffers are not bloated. | ||||
| The development and deployment of effective active queue management | An effective AQM remediates bufferbloat at a bottleneck while "doing | |||
| has been hampered by persistent misconceptions about the cause and | no harm" at hops where buffers are not bloated. The development and | |||
| meaning of packet queues in network buffers. Network buffers exist | deployment of AQM however is commonly subject to common | |||
| to absorb the packet bursts that occur naturally in statistically | misconceptions about the cause of packet queues in network buffers. | |||
| multiplexed networks. Buffers helpfully absorb the queues created by | Network buffers exist to absorb the packet bursts that occur | |||
| such reasonable packet network behavior as short-term mismatches in | naturally in statistically multiplexed networks. Buffers helpfully | |||
| traffic arrival and departure rates that arise from upstream resource | absorb the queues created by such reasonable packet network behavior | |||
| contention, transport conversation startup transients and/or changes | as short-term mismatches in traffic arrival and departure rates that | |||
| in the number of conversations sharing a link. Unfortunately, other | arise from upstream resource contention, transport conversation | |||
| less useful network behaviors can cause queues to fill and their | startup transients and/or changes in the number of conversations | |||
| effects are not nearly as benign. Discussion of these issues and the | sharing a link. Unfortunately, other less useful network behaviors | |||
| reason why the solution is not simply smaller buffers can be found in | can cause queues to fill and their effects are not nearly as benign. | |||
| [RFC2309], [VANQ2006], [REDL1998], and [CODEL2012]. To understand | Discussion of these issues and the reason why the solution is not | |||
| queue management, it is critical to understand the difference between | simply smaller buffers can be found in [RFC2309], [VANQ2006], | |||
| the necessary, useful "good" queue, and the counterproductive "bad" | [REDL1998], and [CODEL2012]. To understand queue management, it is | |||
| queue. | critical to understand the difference between the necessary, useful | |||
| "good" queue, and the counterproductive "bad" queue. | ||||
| Many approaches to active queue management (AQM) have been developed | Several approaches to AQM have been developed over the past two | |||
| over the past two decades but none has been widely deployed due to | decades but none has been widely deployed due to performance | |||
| performance problems. When designed with the wrong conceptual model | problems. When designed with the wrong conceptual model for queues, | |||
| for queues, AQMs have limited operational range, require a lot of | AQMs have limited operational range, require a lot of configuration | |||
| configuration tweaking, and frequently impair rather than improve | tweaking, and frequently impair rather than improve performance. | |||
| performance. Today, the demands on an effective AQM are even | Learning from this past history, the CoDel approach is designed to | |||
| greater: many network devices must work across a range of bandwidths, | meet the following goals: | |||
| either due to link variations or due to the mobility of the device. | ||||
| The CoDel approach is designed to meet the following goals: | ||||
| o parameterless for normal operation - has no knobs for operators, | o parameterless for normal operation - has no knobs for operators, | |||
| users, or implementers to adjust | users, or implementers to adjust | |||
| o treat "good queue" and "bad queue" differently, that is, keep | o treat "good queue" and "bad queue" differently, that is, keep | |||
| delay low while permitting necessary bursts of traffic | delay low while permitting necessary bursts of traffic | |||
| o control delay while insensitive (or nearly so) to round trip | o control delay while insensitive (or nearly so) to round trip | |||
| delays, link rates and traffic loads; this goal is to "do no harm" | delays, link rates and traffic loads; this goal is to "do no harm" | |||
| to network traffic while controlling delay | to network traffic while controlling delay | |||
| o adapt to dynamically changing link rates with no negative impact | o adapt to dynamically changing link rates with no negative impact | |||
| on utilization | on utilization | |||
| o simple and efficient implementation (can easily span the spectrum | o simple and efficient implementation (can easily span the spectrum | |||
| from low-end, linux-based access points and home routers up to | from low-end, linux-based access points and home routers up to | |||
| high-end commercial router silicon) | high-end commercial router silicon) | |||
| CoDel has five major innovations that distinguish it from prior AQMs: | CoDel has five major differences from prior AQMs: use of local queue | |||
| use of local queue minimum to track congestion ("bad queue"), use of | minimum to track congestion ("bad queue"), use of an efficient single | |||
| an efficient single state variable representation of that tracked | state variable representation of that tracked statistic, use of | |||
| statistic, use of packet sojourn time as the observed datum, rather | packet sojourn time as the observed datum, rather than packets, | |||
| than packets, bytes, or rates, use of mathematically determined | bytes, or rates, use of mathematically determined setpoint derived | |||
| setpoint derived from maximizing the network power metric, and a | from maximizing the network power metric, and a modern state space | |||
| modern state space controller. | controller. | |||
| CoDel configures itself based on a round-trip time metric which can | CoDel configures itself based on a round-trip time metric which can | |||
| be set to 100ms for the normal, terrestrial Internet. With no | be set to 100ms for the normal, terrestrial Internet. With no | |||
| changes to parameters, we have found CoDel to work across a wide | changes to parameters, CoDel is expected to work across a wide range | |||
| range of conditions, with varying links and the full range of | of conditions, with varying links and the full range of terrestrial | |||
| terrestrial round trip times. | round trip times. | |||
| Since CoDel was first published [CODEL2012], a number of implementers | CoDel is easily adapted to multiple queue systems as shown by [FQ- | |||
| have been using and adapting it with promising results. Much of this | CODEL-ID]. Implementers should use the fq_codel multiple-queue | |||
| work is collected at http://www.bufferbloat.net/projects/codel . | approach as it deals with many problems beyond the reach of an AQM on | |||
| CoDel has been implemented in Linux very efficiently and should lend | a single queue. | |||
| itself to silicon implementation. CoDel is well-adapted for use in | ||||
| multiple queued devices and has been used by Eric Dumazet with | CoDel was first published in [CODEL2012] and has been implemented in | |||
| multiple queues in a sophisticated queue management approach, | the Linux kernel. | |||
| fq_codel [FQ-CODEL-ID]. | ||||
| Note that while this document refers to dropping packets when | ||||
| indicated by CoDel, it is reasonable to ECN-mark packets instead as | ||||
| well. | ||||
| 2. Conventions used in this document | 2. Conventions used in this document | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| In this document, these words will appear with that interpretation | In this document, these words will appear with that interpretation | |||
| only when in ALL CAPS. Lower case uses of these words are not to be | only when in ALL CAPS. Lower case uses of these words are not to be | |||
| interpreted as carrying [RFC2119] significance. | interpreted as carrying [RFC2119] significance. | |||
| 3. Building Blocks of Queue Management | 3. Overview of the Codel AQM | |||
| Two decades of work on queue management failed to yield an approach | ||||
| that could be widely deployed in the Internet. Careful tuning for | ||||
| particular usages has enabled queue management techniques to "kind | ||||
| of" work; that is, they have been able to decrease queueing delays, | ||||
| but only at the undue cost of link utilization and/or fairness. At | ||||
| the heart of queue management is the notion of "good queue" and "bad | ||||
| queue" and the search for ways to get rid of the bad queue (which | ||||
| only adds delay) while preserving the good queue (which provides for | ||||
| good utilization). This section explains queueing, both good and | ||||
| bad, and covers the innovative CoDel building blocks that can be used | ||||
| to manage packet buffers to keep their queues in the "good" range. | ||||
| Packet queues form in buffers facing bottleneck links, i.e., where | ||||
| the line rate goes from high to low or where many links converge. | ||||
| The well-known bandwidth-delay product (sometimes called "pipe size") | ||||
| is the bottleneck's bandwidth multiplied by the sender-receiver- | ||||
| sender round-trip delay, and is the amount of data that has to be in | ||||
| transit between two hosts in order to run the bottleneck link at 100% | ||||
| utilization. To explore how queues can form, consider a long-lived | ||||
| TCP connection with a 25 packet window sending through a connection | ||||
| with a bandwidth-delay product of 20 packets. After an initial burst | ||||
| of packets the connection will settle into a five packet (+/-1) | ||||
| standing queue; this standing queue size is determined by the | ||||
| mismatch between the window size and the pipe size, and is unrelated | ||||
| to the connection's sending rate. The connection has 25 packets in | ||||
| flight at all times, but only 20 packets arrive at the destination | ||||
| over a round trip time. If the TCP connection has a 30 packet | ||||
| window, the queue will be ten packets with no change in sending rate. | ||||
| Similarly, if the window is 20 packets, there will be no queue but | ||||
| the sending rate is the same. Nothing can be inferred about the | ||||
| sending rate from the queue size, and any queue other than transient | ||||
| bursts only creates delays in the network. The sender needs to | ||||
| reduce the number of packets in flight rather than sending rate. | ||||
| In the above example, the five packet standing queue can be seen to | ||||
| contribute nothing but delay to the connection, and thus is clearly | ||||
| "bad queue". If, in our example, there is a single bottleneck link | ||||
| and it is much slower than the link that feeds it (say, a high-speed | ||||
| ethernet link into a limited DSL uplink) a 20 packet buffer at the | ||||
| bottleneck might be necessary to temporarily hold the 20 packets in | ||||
| flight to keep the bottleneck link's utilization high. The burst of | ||||
| packets should drain completely (to 0 or 1 packets) within a round | ||||
| trip time and this transient queue is "good queue" because it allows | ||||
| the connection to keep the 20 packets in flight and for the | ||||
| bottleneck link to be fully utilized. In terms of the delay | ||||
| experienced, the "good queue" goes away in about a round trip time, | ||||
| while "bad queue" hangs around for longer, causing delays. | ||||
| Effective queue management detects "bad queue" while ignoring "good | ||||
| queue" and takes action to get rid of the bad queue when it is | ||||
| detected. The goal is a queue controller that accomplishes this | ||||
| objective. To control a queue, we need three basic components | ||||
| o Estimator - figure out what we've got | ||||
| o Setpoint - know what what we want | ||||
| o Control loop - if what we've got isn't what we want, we need a way | ||||
| to move it there | ||||
| 3.1. Estimator | ||||
| The estimator both observes the queue and detects when good queue | ||||
| turns to bad queue and vice versa. CoDel has two innovations in its | ||||
| estimator: what is observed as an indicator of queue and how the | ||||
| observations are used to detect good/bad queue. | ||||
| Queue length has been widely used as an observed indicator of | ||||
| congestion and is frequently conflated with sending rate. Use of | ||||
| queue length as a metric is sensitive to how and when the length is | ||||
| observed. A high speed arrival link to a buffer serviced at a much | ||||
| lower rate can rapidly build up a queue that might disperse | ||||
| completely or down to a single packet before a round trip time has | ||||
| elapsed. If the queue length is monitored at packet arrival (as in | ||||
| original RED) or departure time, every packet will see a queue with | ||||
| one possible exception. If the queue length itself is time sampled | ||||
| (as recommended in [REDL1998], a truer picture of the queue's | ||||
| occupancy can be gained at the expense of considerable implementation | ||||
| complexity. | ||||
| The use of queue length is further complicated in networks that are | ||||
| subject to both short and long term changes in available link rate | ||||
| (as in WiFi). Link rate drops can result in a spike in queue length | ||||
| that should be ignored unless it persists. It is not the queue | ||||
| length that should be controlled but the amount of excess delay | ||||
| packets experience due to a persistent or standing queue, which means | ||||
| that the packet sojourn time in the buffer is exactly what we want to | ||||
| track. Tracking the packet sojourn times in the buffer observes the | ||||
| actual delay experienced by each packet. Sojourn time allows queue | ||||
| management to be independent of link rate, gives superior performance | ||||
| to use of buffer size, and is directly related to user-visible | ||||
| performance. It works regardless of line rate changes or link | ||||
| sharing by multiple queues (which the individual queues may | ||||
| experience as changing rates). | ||||
| Consider a link shared by two queues with different priorities. | ||||
| Packets that arrive at the high priority queue are sent as soon as | ||||
| the link is available while packets in the other queue have to wait | ||||
| until the high priority queue is empty (i.e., a strict priority | ||||
| scheduler). The number of packets in the high priority queue might | ||||
| be large but the queue is emptied quickly and the amount of time each | ||||
| packet spends enqueued (the sojourn time) is not large. The other | ||||
| queue might have a smaller number of packets, but packet sojourn | ||||
| times will include the waiting time for the high priority packets to | ||||
| be sent. This makes the sojourn time a good sample of the congestion | ||||
| that each separate queue is experiencing. This example also shows | ||||
| how the metric of sojourn time is independent of the number of queues | ||||
| or the service discipline used, and is instead indicative of | ||||
| congestion seen by the individual queues. | ||||
| How can observed sojourn time be used to separate good queue from bad | ||||
| queue? Although averages, especially of queue length, have | ||||
| previously been widely used as an indicator of bad queue, their | ||||
| efficacy is questionable. Consider the burst that disperses every | ||||
| round trip time. The average queue will be one-half the burst size, | ||||
| though this might vary depending on when the average is computed and | ||||
| the timing of arrivals. The average queue sojourn time would be one- | ||||
| half the time it takes to clear the burst. The average then would | ||||
| indicate a persistent queue where there is none. Instead of averages | ||||
| we recommend tracking the minimum sojourn time, then, if there is one | ||||
| packet that has a zero sojourn time then there is no persistent | ||||
| queue. The value of the minimum in detecting persistent queue is | ||||
| apparent when looking at graphs of queue delay. | ||||
| A persistent queue can be detected by tracking the (local) minimum | ||||
| queue delay packets experience. To ensure that this minimum value | ||||
| does not become stale, it has to have been experienced recently, i.e. | ||||
| during an appropriate past time interval. This interval is the | ||||
| maximum amount of time a minimum value is considered to be in effect, | ||||
| and is related to the amount of time it takes for the largest | ||||
| expected burst to drain. Conservatively, this interval should be at | ||||
| least a round trip time to avoid falsely detecting a persistent queue | ||||
| and not a lot more than a round trip time to avoid delay in detecting | ||||
| the persistent queue. This suggests that the appropriate interval | ||||
| value is the maximum round-trip time of all the connections sharing | ||||
| the buffer. | ||||
| (The following key insight makes computation of the local minimum | ||||
| efficient: It is sufficient to keep a single state variable of how | ||||
| long the minimum has been above or below a target value rather than | ||||
| retaining all the local values to compute the minimum, leading to | ||||
| both storage and computational savings. We use this insight in the | ||||
| pseudo-code for CoDel later in the draft.) | ||||
| These two innovations, use of sojourn time as observed values and the | ||||
| local minimum as the statistic to monitor queue congestion are key to | ||||
| CoDel's estimator building block. The local minimum sojourn time | ||||
| provides an accurate and robust measure of standing queue and has an | ||||
| efficient implementation. In addition, use of the minimum sojourn | ||||
| time has important advantages in implementation. The minimum packet | ||||
| sojourn can only be decreased when a packet is dequeued which means | ||||
| that all the work of CoDel can take place when packets are dequeued | ||||
| for transmission and that no locks are needed in the implementation. | ||||
| The minimum is the only statistic with this property. | ||||
| A more detailed explanation with many pictures can be found in | ||||
| http://pollere.net/Pdfdocs/QrantJul06.pdf and | ||||
| http://www.ietf.org/proceedings/84/slides/slides-84-tsvarea-4.pdf . | ||||
| 3.2. Setpoint | ||||
| Now that we have a robust way of detecting standing queue, we need a | ||||
| setpoint that tells us when to act. If the controller is set to take | ||||
| action as soon as the estimator has a non-zero value, the average | ||||
| drop rate will be maximized, which minimizes TCP goodput | ||||
| [MACTCP1997]. Also, this policy results in no backlog over time (no | ||||
| persistent queue), which negates much of the value of having a | ||||
| buffer, since it maximizes the bottleneck link bandwidth lost due to | ||||
| normal stochastic variation in packet interarrival time. We want a | ||||
| setpoint that maximizes utilization while minimizing delay. Early in | ||||
| the history of packet networking, Kleinrock developed the analytic | ||||
| machinery to do this using a quantity he called 'power', which is the | ||||
| ratio of a normalized throughput to a normalized delay [KLEIN81]. | ||||
| It is straightforward to derive an analytic expression for the | ||||
| average goodput of a TCP conversation at a given round-trip time r | ||||
| and setpoint f (where f is expressed as a fraction of r). Reno TCP, | ||||
| for example, yields: | ||||
| goodput = r (3 + 6f - f^2) / (4 (1+f)) | ||||
| Since the peak queue delay is simply f r, power is solely a function | ||||
| of f since the r's in the numerator and denominator cancel: | ||||
| power is proportional to (1 + 2f - 1/3 f^2) / (1 + f)^2 | ||||
| As Kleinrock observed, the best operating point, in terms of | ||||
| bandwidth / delay tradeoff, is the peak power point, since points off | ||||
| the peak represent a higher cost (in delay) per unit of bandwidth. | ||||
| The power vs. f curve for any AIMD TCP is monotone decreasing. But | ||||
| the curve is very flat for f < 0.1 followed by a increasing curvature | ||||
| with a knee around f = 0.2, then a steep, almost linear fall off | ||||
| [TSV84]. Since the previous equation showed that goodput is monotone | ||||
| increasing with f, the best operating point is near the right edge of | ||||
| the flat top since that represents the highest goodput achievable for | ||||
| a negligible increase in delay. However, since the r in the model is | ||||
| a conservative upper bound, a target of 0.1r runs the risk of pushing | ||||
| shorter RTT connections over the knee and giving them higher delay | ||||
| for no significant goodput increase. Generally, a more conservative | ||||
| target of 0.05r offers a good utilization vs. delay tradeoff while | ||||
| giving enough headroom to work well with a large variation in real | ||||
| RTT. | ||||
| As the above analysis shows, a very small standing queue gives close | ||||
| to 100% utilization of the bottleneck link. While this result was | ||||
| for Reno TCP, the derivation uses only properties that must hold for | ||||
| any 'TCP friendly' transport. We have verified by both analysis and | ||||
| simulation that this result holds for Reno, Cubic, and | ||||
| Westwood[TSV84]. This results in a particularly simple form for the | ||||
| setpoint: the ideal range for the permitted standing queue is between | ||||
| 5% and 10% of the TCP connection's RTT. Thus target is simply 5% of | ||||
| the interval of section 3.1. | ||||
| 3.3. Control Loop | ||||
| Section 3.1 describes a simple, reliable way to measure bad | ||||
| (persistent) queue. Section 3.2 shows that TCP congestion control | ||||
| dynamics gives rise to a setpoint for this measure that's a provably | ||||
| good balance between enhancing throughput and minimizing delay, and | ||||
| that this setpoint is a constant fraction of the same 'largest | ||||
| average RTT' interval used to distinguish persistent from transient | ||||
| queue. The only remaining building block needed for a basic AQM is a | ||||
| 'control loop' algorithm to effectively drive the queueing system | ||||
| from any 'persistent queue above target' state to a state where the | ||||
| persistent queue is below target. | ||||
| Control theory provides a wealth of approaches to the design of | ||||
| control loops. Most of classical control theory deals with the | ||||
| control of linear, time-invariant, single-input-single-output (SISO) | ||||
| systems. Control loops for these systems generally come from a (well | ||||
| understood) class known as Proportional-Integral-Derivative (PID) | ||||
| controllers. Unfortunately, a queue is not a linear system and an | ||||
| AQM operates at the point of maximum non-linearity (where the output | ||||
| link bandwidth saturates so increased demand creates delay rather | ||||
| than higher utilization). Output queues are also not time-invariant | ||||
| since traffic is generally a mix of connections which start and stop | ||||
| at arbitrary times and which can have radically different behaviors | ||||
| ranging from "open loop" UDP audio/video to "closed-loop" congestion- | ||||
| avoiding TCP. Finally, the constantly changing mix of connections | ||||
| (which can't be converted to a single 'lumped parameter' model | ||||
| because of their transfer function differences) makes the system | ||||
| multi-input-multi-output (MIMO), not SISO. | ||||
| Since queueing systems match none of the prerequisites for a | ||||
| classical controller, a modern state-space controller is a better | ||||
| approach with states 'no persistent queue' and 'has persistent | ||||
| queue'. Since Internet traffic mixtures change rapidly and | ||||
| unpredictably, a noise and error tolerant adaptation algorithm like | ||||
| Stochastic Gradient is a good choice. Since there's essentially no | ||||
| information in the amount of persistent queue [TSV84], the adaptation | ||||
| should be driven by how long it has persisted. | ||||
| Consider the two extremes of traffic behavior, a single open-loop UDP | ||||
| video stream and a single, long-lived TCP bulk data transfer. If the | ||||
| average bandwidth of the UDP video stream is greater that the | ||||
| bottleneck link rate, the link's queue will grow and the controller | ||||
| will eventually enter 'has persistent queue' state and start dropping | ||||
| packets. Since the video stream is open loop, its arrival rate is | ||||
| unaffected by drops so the queue will persist until the average drop | ||||
| rate is greater than the output bandwidth deficit (= average arrival | ||||
| rate - average departure rate) so the job of the adaptation algorithm | ||||
| is to discover this rate. For this example, the adaptation could | ||||
| consist of simply estimating the arrival and departure rates then | ||||
| dropping at a rate slightly greater than their difference. But this | ||||
| class of algorithm won't work at all for the bulk data TCP stream. | ||||
| TCP runs in closed-loop flow balance [TSV84] so its arrival rate is | ||||
| almost always exactly equal to the departure rate - the queue isn't | ||||
| the result of a rate imbalance but rather a mismatch between the TCP | ||||
| sender's window and the source-destination-source round-trip path | ||||
| capacity (i.e., the connection's bandwidth-delay product). The | ||||
| sender's TCP congestion avoidance algorithm will slowly increase the | ||||
| send window (one packet per round-trip-time) [RFC2581] which will | ||||
| eventually cause the bottleneck to enter 'has persistent queue' | ||||
| state. But, since the average input rate is the same as the average | ||||
| output rate, the rate deficit estimation that gave the correct drop | ||||
| rate for the video stream would compute a drop rate of zero for the | ||||
| TCP stream. However, if the output link drops one packet as it | ||||
| enters 'has persistent queue' state, when the sender discovers this | ||||
| (via TCP's normal packet loss repair mechanisms) it will reduce its | ||||
| window by a factor of two [RFC2581] so, one round-trip-time after the | ||||
| drop, the persistent queue will go away. | ||||
| If there were N TCP conversations sharing the bottleneck, the | ||||
| controller would have to drop O(N) packets, one from each | ||||
| conversation, to make all the conversations reduce their window to | ||||
| get rid of the persistent queue. If the traffic mix consists of | ||||
| short (<= bandwidth-delay product) conversations, the aggregate | ||||
| behavior becomes more like the open-loop video example since each | ||||
| conversation is likely to have already sent all its packets by the | ||||
| time it learns about a drop so each drop has negligible effect on | ||||
| subsequent traffic. | ||||
| The controller does not know the number, duration, or kind of | ||||
| conversations creating its queue, so it has to learn the appropriate | ||||
| response. Since single drops can have a large effect if the degree | ||||
| of multiplexing (the number of active conversations) is small, | ||||
| dropping at too high a rate is likely to have a catastrophic effect | ||||
| on throughput. Dropping at a low rate (< 1 packet per round-trip- | ||||
| time) then increasing the drop rate slowly until the persistent queue | ||||
| goes below target is unlikely to overdrop and is guaranteed to | ||||
| eventually dissipate the persistent queue. This stochastic gradient | ||||
| learning procedure is the core of CoDel's control loop (the gradient | ||||
| exists because a drop always reduces the (instantaneous) queue so an | ||||
| increasing drop rate always moves the system "down" toward no | ||||
| persistent queue, regardless of traffic mix). | ||||
| The "next drop time" is decreased in inverse proportion to the square | ||||
| root of the number of drops since the dropping state was entered, | ||||
| using the well-known nonlinear relationship of drop rate to | ||||
| throughput to get a linear change in throughput [REDL1998], | ||||
| [MACTCP1997]. | ||||
| Since the best rate to start dropping is at slightly more than one | ||||
| packet per RTT, the controller's initial drop rate can be directly | ||||
| derived from the estimator's interval, defined in section 3.1. When | ||||
| the minimum sojourn time first crosses the target and CoDel drops a | ||||
| packet, the earliest the controller could see the effect of the drop | ||||
| is the round trip time (interval) + the local queue wait time | ||||
| (target). If the next drop happens any earlier than this time | ||||
| (interval + target), CoDel will overdrop. In practice, the local | ||||
| queue waiting time tends to vary, so making the initial drop spacing | ||||
| (i.e., the time to the second drop) be exactly the minimum possible | ||||
| also leads to overdropping. Analysis of simulation and real-world | ||||
| measured data shows that the 75th percentile magnitude of this | ||||
| variation is less than the target, and so the initial drop spacing | ||||
| SHOULD be set to the estimator's interval (i.e., initial drop spacing | ||||
| = interval) to ensure that the controller has accounted for | ||||
| acceptable congestion delays. | ||||
| Use of the minimum statistic lets the controller be placed in the | ||||
| dequeue routine with the estimator. This means that the control | ||||
| signal (the drop) can be sent at the first sign of bad queue (as | ||||
| indicated by the sojourn time) and that the controller can stop | ||||
| acting as soon as the sojourn time falls below the setpoint. | ||||
| Dropping at dequeue has both implementation and control advantages. | ||||
| 4. Putting it together: queue management for the network edge | ||||
| CoDel was initially designed as a bufferbloat solution for the | CoDel was initially designed as a bufferbloat solution for the | |||
| consumer network edge. The CoDel building blocks are able to adapt | consumer network edge. The CoDel building blocks are able to adapt | |||
| to different or time-varying link rates, to be easily used with | to different or time-varying link rates, to be easily used with | |||
| multiple queues, to have excellent utilization with low delay, and to | multiple queues, to have excellent utilization with low delay, and to | |||
| have a simple and efficient implementation. The only setting CoDel | have a simple and efficient implementation. The only setting CoDel | |||
| requires is its interval value, and as 100ms satisfies that | requires is its interval value, and as 100ms satisfies that | |||
| definition for normal Internet usage, CoDel can be parameter-free for | definition for normal Internet usage, CoDel can be parameter-free for | |||
| consumer use. CoDel was released to the open source community, where | consumer use. | |||
| it has been widely promulgated and adapted to many problems. CoDel's | ||||
| efficient implementation and lack of configuration are unique | ||||
| features and make it suitable to manage modern packet buffers. For | ||||
| more background and results on CoDel, see [CODEL2012] and | ||||
| http://pollere.net/Codel.html . | ||||
| 4.1. Overview of CoDel AQM | ||||
| To ensure that link utilization is not adversely affected, CoDel's | To ensure that link utilization is not adversely affected, CoDel's | |||
| estimator sets its target to the setpoint that optimizes power and | estimator sets its target to the setpoint that optimizes power and | |||
| CoDel's controller does not drop packets when the drop would leave | CoDel's controller does not drop packets when the drop would leave | |||
| the queue empty or with fewer than a maximum transmission unit (MTU) | the queue empty or with fewer than a maximum transmission unit (MTU) | |||
| worth of bytes in the buffer. Section 3.2 showed that the ideal | worth of bytes in the buffer. Section 5.2 shows that the ideal | |||
| setpoint is 5-10% of the connection RTT. In the open terrestrial- | setpoint is 5-10% of the connection RTT. In the open terrestrial- | |||
| based Internet, especially at the consumer edge, we expect most | based Internet, especially at the consumer edge, we expect most | |||
| unbloated RTTs to have a ceiling of 100ms [CHARB2007]. Using this | unbloated RTTs to have a ceiling of 100ms [CHARB2007]. Using this | |||
| RTT gives a minimum target of 5ms and the interval for tracking the | RTT gives a minimum target of 5ms and the interval for tracking the | |||
| minimum is 100ms. In practice, uncongested links will see sojourn | minimum is 100ms. In practice, uncongested links will see sojourn | |||
| times below target more often than once per RTT, so the estimator is | times below target more often than once per RTT, so the estimator is | |||
| not overly sensitive to the value of the interval. | not overly sensitive to the value of the interval. | |||
| When the estimator finds a persistent delay above target, the | When the estimator finds a persistent delay above target, the | |||
| controller enters the drop state where a packet is dropped and the | controller enters the drop state where a packet is dropped and the | |||
| next drop time is set. As discussed in section 3.3, the initial next | next drop time is set. As discussed in section 5.3, the initial next | |||
| drop spacing is intended to be long enough to give the endpoints time | drop spacing is intended to be long enough to give the endpoints time | |||
| to react to the single drop so SHOULD be set to a value equal to the | to react to the single drop so SHOULD be set to a value equal to the | |||
| interval. If the estimator's output falls below target, the | interval. If the estimator's output falls below target, the | |||
| controller cancels the next drop and exits the drop state. (The | controller cancels the next drop and exits the drop state. (The | |||
| controller is more sensitive than the estimator to an overly short | controller is more sensitive than the estimator to an overly short | |||
| interval, since an unnecessary drop would occur and lower link | interval, since an unnecessary drop would occur and lower link | |||
| utilization.) If next drop time is reached while the controller is | utilization.) If next drop time is reached while the controller is | |||
| still in drop state, the packet being dequeued is dropped and the | still in drop state, the packet being dequeued is dropped and the | |||
| next drop time is recalculated. | next drop time is recalculated. | |||
| skipping to change at page 13, line 22 ¶ | skipping to change at page 5, line 42 ¶ | |||
| after exiting it and resumes the dropping state at a recent control | after exiting it and resumes the dropping state at a recent control | |||
| level, if one exists. This logic is described more precisely in the | level, if one exists. This logic is described more precisely in the | |||
| pseudo-code below. Additional work is required to determine the | pseudo-code below. Additional work is required to determine the | |||
| frequency and importance of re-entering the dropping state. | frequency and importance of re-entering the dropping state. | |||
| Note that CoDel AQM only enters its dropping state when the local | Note that CoDel AQM only enters its dropping state when the local | |||
| minimum sojourn delay has exceeded target for a time interval long | minimum sojourn delay has exceeded target for a time interval long | |||
| enough for normal bursts to dissipate, ensuring that a burst of | enough for normal bursts to dissipate, ensuring that a burst of | |||
| packets that fits in the pipe will not be dropped. | packets that fits in the pipe will not be dropped. | |||
| 4.2. Non-starvation | 3.1. Non-starvation | |||
| CoDel's goals are to control delay with little or no impact on link | CoDel's goals are to control delay with little or no impact on link | |||
| utilization and to be deployed on a wide range of link bandwidths, | utilization and to be deployed on a wide range of link bandwidths, | |||
| including variable-rate links, without reconfiguration. To keep from | including variable-rate links, without reconfiguration. To keep from | |||
| making drops when it would starve the output link, CoDel makes | making drops when it would starve the output link, CoDel makes | |||
| another check before dropping to see if at least an MTU worth of | another check before dropping to see if at least an MTU worth of | |||
| bytes remains in the buffer. If not, the packet SHOULD NOT be | bytes remains in the buffer. If not, the packet SHOULD NOT be | |||
| dropped and, therefore, CoDel exits the drop state. The MTU size can | dropped and, therefore, CoDel exits the drop state. The MTU size can | |||
| be set adaptively to the largest packet seen so far or can be read | be set adaptively to the largest packet seen so far or can be read | |||
| from the driver. | from the driver. | |||
| 4.3. Using the interval | 3.2. Using the interval | |||
| The interval is chosen to give endpoints time to react to a drop | The interval is chosen to give endpoints time to react to a drop | |||
| without being so long that response times suffer. CoDel's estimator, | without being so long that response times suffer. CoDel's estimator, | |||
| setpoint, and control loop all use the interval. Understanding their | setpoint, and control loop all use the interval. Understanding their | |||
| derivation shows that CoDel is the most sensitive to the value of | derivation shows that CoDel is the most sensitive to the value of | |||
| interval for single long-lived TCPs with a decreased sensitivity for | interval for single long-lived TCPs with a decreased sensitivity for | |||
| traffic mixes. This is fortunate as RTTs vary across connections and | traffic mixes. This is fortunate as RTTs vary across connections and | |||
| are not known apriori. The best policy is to use an interval | are not known apriori. The best policy seems to be to use an | |||
| slightly larger than the RTT seen by most of the connections using a | interval slightly larger than the RTT seen by most of the connections | |||
| link, a value that can be determined as the largest RTT seen if the | using a link, a value that can be determined as the largest RTT seen | |||
| value is not an outlier (use of a 95-99th percentile value should | if the value is not an outlier (use of a 95-99th percentile value | |||
| work). In practice, this value is not known or measured (though see | should work). In practice, this value is not known or measured | |||
| Section 6.2 for an application where interval is measured. An | (though see Section 6.2 for an application where interval is | |||
| interval setting of 100ms works well across a range of RTTs from 10ms | measured. An interval setting of 100ms works well across a range of | |||
| to 1 second (excellent performance is achieved in the range from 10 | RTTs from 10ms to 1 second (excellent performance is achieved in the | |||
| ms to 300ms). For devices intended for the normal terrestrial | range from 10 ms to 300ms). For devices intended for the normal | |||
| Internet, interval SHOULD have a value of 100ms. This will only | terrestrial Internet, interval SHOULD have a value of 100ms. This | |||
| cause overdropping where a long-lived TCP has an RTT longer than | will only cause overdropping where a long-lived TCP has an RTT longer | |||
| 100ms and there is little or no mixing with other connections through | than 100ms and there is little or no mixing with other connections | |||
| the link. | through the link. | |||
| Some confusion concerns the roles of the target setpoint and the | A note on the roles of the target setpoint and the minimum-tracking | |||
| minimum-tracking interval. In particular, some experimenters believe | interval. Target should not be increased in response to lower layers | |||
| the value of target needs to be increased when the lower layers have | that have a bursty nature, where packets are transmitted for short | |||
| a bursty nature where packets are transmitted for short periods | periods interspersed with idle periods where the link is waiting for | |||
| interspersed with idle periods where the link is waiting for | ||||
| permission to send. CoDel's estimator will "see" the effective | permission to send. CoDel's estimator will "see" the effective | |||
| transmission rate over an interval and increasing target will just | transmission rate over an interval, and increasing target only leads | |||
| lead to longer queue delays. On the other hand, where a significant | to longer queue delays. On the other hand, where a significant | |||
| additional delay is added to the intrinsic round trip time of most or | additional delay is added to the intrinsic RTT of most or all packets | |||
| all packets due to the waiting time for a transmission, it is | due to the waiting time for a transmission, it is necessary to | |||
| necessary to increase interval by that extra delay. That is, target | increase interval by that extra delay. Target SHOULD NOT be adjusted | |||
| SHOULD NOT be adjusted but interval MAY need to be adjusted. For | for such short-term bursts, but interval MAY need to be adjusted if | |||
| more on this (and pictures) see http://pollere.net/Pdfdocs/ | the path's intrinsic RTT changes. | |||
| noteburstymacs.pdf | ||||
| 4.4. The target setpoint | 3.3. The target setpoint | |||
| The target is the maximum acceptable persistent queue delay above | The target is the maximum acceptable persistent queue delay above | |||
| which CoDel is dropping or preparing to drop and below which CoDel | which CoDel is dropping or preparing to drop and below which CoDel | |||
| will not drop. The calculations of section 3.2 showed that the best | will not drop. Target SHOULD be set to 5ms for normal Internet | |||
| setpoint is 5-10% of the RTT, with the low end of 5% preferred. We | traffic. | |||
| used simulation to explore the impact when TCPs are mixed with other | ||||
| traffic and with connections of different RTTs. Accordingly, we | ||||
| experimented extensively with values in the 5-10% of RTT range and, | ||||
| overall, used target values between 1 and 20 milliseconds for RTTs | ||||
| from 30 to 500ms and link bandwidths of 64Kbps to 100Mbps to | ||||
| experimentally explore the setpoint that gives consistently high | ||||
| utilization while controlling delay across a range of bandwidths, | ||||
| RTTs, and traffic loads. Our results were notably consistent with | ||||
| the mathematics of section 3.2. Below a target of 5ms, utilization | ||||
| suffers for some conditions and traffic loads, and above 5ms we saw | ||||
| very little or no improvement in utilization. Thus target SHOULD be | ||||
| set to 5ms for normal Internet traffic. | ||||
| A congested (but not overloaded) CoDel link with traffic composed | ||||
| solely or primarily of long-lived TCP flows will have a median delay | ||||
| through the link will tend to the target. For bursty traffic loads | ||||
| and for overloaded conditions (where it is difficult or impossible | ||||
| for all the arriving flows to be accommodated) the median queues will | ||||
| be longer than target. | ||||
| The non-starvation drop inhibit feature dominates where the link rate | The calculations of section 5.2 show that the best setpoint is 5-10% | |||
| becomes very small. By inhibiting drops when there is less than an | of the RTT, with the low end of 5% preferred. Extensive simulations | |||
| (outbound link) MTU worth of bytes in the buffer, CoDel adapts to | exploring the impact of different target values when used with mixed | |||
| very low bandwidth links. This is shown in [CODEL2012] and | traffic flows with different RTTs and different bandwidths show that | |||
| interested parties should see the discussion of results there. | below a target of 5ms, utilization suffers for some conditions and | |||
| Unpublished studies were carried out down to 64Kbps. The drop | traffic loads, and above 5ms showed very little or no improvement in | |||
| inhibit condition can be expanded to include a test to retain | utilization. | |||
| sufficient bytes or packets to fill an allocation in a request-and- | ||||
| grant MAC. | ||||
| Sojourn times must remain above the target for an entire interval in | Sojourn times must remain above the target for an entire interval in | |||
| order to enter the drop state. Any packet with a sojourn time less | order to enter the drop state. Any packet with a sojourn time less | |||
| than the target will reset the time that the queue was last below the | than the target will reset the time that the queue was last below the | |||
| target. Since Internet traffic has very dynamic characteristics, the | target. Since Internet traffic has very dynamic characteristics, the | |||
| actual sojourn delay experienced by packets varies greatly and is | actual sojourn delay experienced by packets varies greatly and is | |||
| often less than the target unless the overload is excessive. When a | often less than the target unless the overload is excessive. When a | |||
| link is not overloaded, it is not a bottleneck and packet sojourn | link is not overloaded, it is not a bottleneck and packet sojourn | |||
| times will be small or nonexistent. In the usual case, there are | times will be small or nonexistent. In the usual case, there are | |||
| only one or two places along a path where packets will encounter a | only one or two places along a path where packets will encounter a | |||
| bottleneck (usually at the edge), so the total amount of queueing | bottleneck (usually at the edge), so the total amount of queueing | |||
| delay experienced by a packet should be less than 10ms even under | delay experienced by a packet should be less than 10ms even under | |||
| extremely congested conditions. This net delay is substantially | extremely congested conditions. This net delay is substantially | |||
| lower than common current queueing delays on the Internet that grow | lower than common current queueing delays on the Internet that grow | |||
| to orders of seconds [NETAL2010, CHARB2007]. | to orders of seconds [NETAL2010, CHARB2007]. | |||
| 4.5. Use with multiple queues | 3.4. Use with multiple queues | |||
| Unlike other AQMs, CoDel is easily adapted to multiple queue systems. | ||||
| With other approaches there is always a question of how to account | ||||
| for the fact that each queue receives less than the full link rate | ||||
| over time and usually sees a varying rate over time. This is exactly | ||||
| what CoDel excels at: using a packet's sojourn time in the buffer | ||||
| completely circumvents this problem. In a multiple-queue setting, a | ||||
| separate CoDel algorithm runs on each queue, but each CoDel instance | ||||
| uses the packet sojourn time the same way a single-queue CoDel does. | ||||
| Just as a single-queue CoDel adapts to changing link bandwidths | ||||
| [CODEL2012], so does a multiple-queue CoDel system. As an | ||||
| optimization to avoid queueing more than necessary, when testing for | ||||
| queue occupancy before dropping, the total occupancy of all queues | ||||
| sharing the same output link should be used. This property of CoDel | ||||
| has been exploited in fq_codel, briefly discussed in the next section | ||||
| and in more detail in [FQ-CODEL-ID]. | ||||
| 4.6. Use of stochastic bins or sub-queues to improve performance | ||||
| Shortly after the release of the CoDel pseudocode, Eric Dumazet | ||||
| created fq_codel, applying CoDel to each bin, or queue, used with | ||||
| stochastic fair queueing. (To understand further, see [SFQ1990] or | ||||
| the linux sfq documentation.) fq_codel hashes on the packet header | ||||
| fields to determine a specific bin, or sub-queue, for each five-tuple | ||||
| flow, and runs CoDel on each bin or sub-queue thus creating a well- | ||||
| mixed output flow and obviating issues of reverse path flows | ||||
| (including "ack compression"). Dumazet's code is part of the CeroWrt | ||||
| project code at the bufferbloat.net's web site and described in [FQ- | ||||
| CODEL-ID]. Andrew McGregor has implemented a version of fq_codel for | ||||
| the network simulator ns-3 (http://www.nsnam.org). | ||||
| We have also experimented with a similar multi-queue approach by | ||||
| creating an ns-2 simulator code module, sfqcodel, which uses | ||||
| Stochastic Fair Queueing (SFQ) to isolate flows into bins, each | ||||
| running CoDel. This setup has provided excellent results: median | ||||
| queues remain small across a range of traffic patterns that includes | ||||
| bidirectional file transfers (that is, the same traffic sent in both | ||||
| directions on a link), constant bit-rate VoIP-like flows, and | ||||
| emulated web traffic and utilizations are consistently better than | ||||
| single queue CoDel, generally very close to 100%. Our code, intended | ||||
| for simulation experiments, is available at http://pollere.net/Co | ||||
| [6]del.html [7] and being integrated into the ns-2 distribution. | ||||
| A number of open issues should be studied. In particular, if the | ||||
| number of different queues or bins is too large, the scheduling will | ||||
| be the dominant factor, not the AQM; it is NOT the case that more | ||||
| bins are always better. In our simulations, we have found good | ||||
| behavior across mixed traffic types with smaller numbers of queues, | ||||
| 8-16 for a 5Mbps link. This configuration appears to give the best | ||||
| behavior for voice, web browsing and file transfers where increased | ||||
| numbers of bins seems to favor file transfers at the expense of the | ||||
| other traffic. Our work has been very preliminary and we encourage | ||||
| others to take this up and to explore analytic modeling. It would be | ||||
| instructive to see the effects of different numbers of bins on a | ||||
| range of traffic models, something like an updated version of | ||||
| [BMPFQ]. | ||||
| Implementers SHOULD use the fq_codel multiple queue approach if | CoDel is easily adapted to multiple queue systems. With other | |||
| possible as it deals with many problems beyond the reach of an AQM on | approaches there is always a question of how to account for the fact | |||
| a single queue. | that each queue receives less than the full link rate over time and | |||
| usually sees a varying rate over time. This is what CoDel excels at: | ||||
| using a packet's sojourn time in the buffer completely circumvents | ||||
| this problem. In a multiple-queue setting, a separate CoDel | ||||
| algorithm runs on each queue, but each CoDel instance uses the packet | ||||
| sojourn time the same way a single-queue CoDel does. Just as a | ||||
| single-queue CoDel adapts to changing link bandwidths [CODEL2012], so | ||||
| does a multiple-queue CoDel system. As an optimization to avoid | ||||
| queueing more than necessary, when testing for queue occupancy before | ||||
| dropping, the total occupancy of all queues sharing the same output | ||||
| link should be used. This property of CoDel has been exploited in | ||||
| fq_codel [FQ-CODEL-ID], which hashes on the packet header fields to | ||||
| determine a specific bin, or sub-queue, for each five-tuple flow, and | ||||
| runs CoDel on each bin or sub-queue thus creating a well-mixed output | ||||
| flow and obviating issues of reverse path flows (including "ack | ||||
| compression"). | ||||
| 4.7. Setting up CoDel AQM | 3.5. Setting up CoDel | |||
| CoDel is set for use in devices in the open Internet. An interval of | CoDel is set for use in devices in the open Internet. An interval of | |||
| 100ms is used, target is set to 5% of interval, and the initial drop | 100ms is used, target is set to 5% of interval, and the initial drop | |||
| spacing is also set to interval. These settings have been chosen so | spacing is also set to interval. These settings have been chosen so | |||
| that a device, such as a small WiFi router, can be sold without the | that a device, such as a small WiFi router, can be sold without the | |||
| need for any values to be made adjustable, yielding a parameterless | need for any values to be made adjustable, yielding a parameterless | |||
| implementation. In addition, CoDel is useful in environments with | implementation. In addition, CoDel is useful in environments with | |||
| significantly different characteristics from the normal Internet, for | significantly different characteristics from the normal Internet, for | |||
| example, in switches used as a cluster interconnect within a data | example, in switches used as a cluster interconnect within a data | |||
| center. Since cluster traffic is entirely internal to the data | center. Since cluster traffic is entirely internal to the data | |||
| center, round trip latencies are low (typically <100us) but | center, round trip latencies are low (typically <100us) but | |||
| bandwidths are high (1-40Gbps) so it's relatively easy for the | bandwidths are high (1-40Gbps) so it's relatively easy for the | |||
| aggregation phase of a distributed computation (e.g., the Reduce part | aggregation phase of a distributed computation (e.g., the Reduce part | |||
| of a Map/Reduce) to persistently fill then overflow the modest per- | of a Map/Reduce) to persistently fill then overflow the modest per- | |||
| port buffering available in most high speed switches. A CoDel | port buffering available in most high speed switches. A CoDel | |||
| configured for this environment (target and interval in the | configured for this environment (target and interval in the | |||
| microsecond rather than millisecond range) can minimize drops (or ECN | microsecond rather than millisecond range) can minimize drops or ECN | |||
| marks) while keeping throughput high and latency low. | marks while keeping throughput high and latency low. | |||
| Devices destined for these environments MAY use a different interval, | Devices destined for these environments MAY use a different interval, | |||
| where suitable. If appropriate analysis indicates, the target MAY be | where suitable. If appropriate analysis indicates, the target MAY be | |||
| set to some other value in the 5-10% of interval and the initial drop | set to some other value in the 5-10% of interval and the initial drop | |||
| spacing MAY be set to a value of 1.0 to 1.2 times the interval. But | spacing MAY be set to a value of 1.0 to 1.2 times the interval. But | |||
| these settings will cause problems such as overdropping and low | these settings will cause problems such as overdropping and low | |||
| throughput if used on the open Internet, so devices that allow CoDel | throughput if used on the open Internet, so devices that allow CoDel | |||
| to be configured SHOULD default to Internet-appropriate values given | to be configured SHOULD default to Internet-appropriate values given | |||
| in this document. | in this document. | |||
| 5. Annotated Pseudo-code for CoDel AQM | 4. Annotated Pseudo-code for CoDel AQM | |||
| What follows is the CoDel algorithm in C++-like pseudo-code. Since | What follows is the CoDel algorithm in C++-like pseudo-code. Since | |||
| CoDel adds relatively little new code to a basic tail-drop fifo- | CoDel adds relatively little new code to a basic tail-drop fifo- | |||
| queue, we have attempted to highlight just these additions by | queue, we have attempted to highlight just these additions by | |||
| presenting CoDel as a sub-class of a basic fifo-queue base class. | presenting CoDel as a sub-class of a basic fifo-queue base class. | |||
| The reference code is included to aid implementers who wish to apply | The reference code is included to aid implementers who wish to apply | |||
| CoDel to queue management as described here or to adapt its | CoDel to queue management as described here or to adapt its | |||
| principles to other applications. | principles to other applications. | |||
| Implementors are strongly encouraged to also look at Eric Dumazet's | Implementors are strongly encouraged to also look at the Linux kernel | |||
| Linux kernel version of CoDel - a well-written, well tested, real- | version of CoDel - a well-written, well tested, real-world, C-based | |||
| world, C-based implementation. As of this writing, it is at | implementation. As of this writing, it is available at | |||
| https://github.com/torvalds/linux/blob/master/net/sched/sch_codel.c. | https://github.com/torvalds/linux/blob/master/net/sched/sch_codel.c. | |||
| The following pseudo-code is open-source with a dual BSD/GPL license: | The following pseudo-code is open-source with a dual BSD/GPL license: | |||
| Codel - The Controlled-Delay Active Queue Management algorithm. | Codel - The Controlled-Delay Active Queue Management algorithm. | |||
| Copyright (C) 2011-2014 Kathleen Nichols <nichols@pollere.com>. | Copyright (C) 2011-2014 Kathleen Nichols <nichols@pollere.com>. | |||
| Redistribution and use in source and binary forms, with or without | Redistribution and use in source and binary forms, with or without | |||
| modification, are permitted provided that the following conditions | modification, are permitted provided that the following conditions | |||
| are met: | are met: | |||
| skipping to change at page 18, line 26 ¶ | skipping to change at page 9, line 31 ¶ | |||
| LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR | |||
| A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT | |||
| OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, | OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, | |||
| SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT | |||
| LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, | |||
| DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY | |||
| THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | |||
| (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | |||
| OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||
| 5.1. Data Types | 4.1. Data Types | |||
| time_t is an integer time value in units convenient for the system. | time_t is an integer time value in units convenient for the system. | |||
| Resolution to at least a millisecond is required and better | The code presented here uses 0 as a flag value to indicate "no time | |||
| resolution is useful up to the minimum possible packet time on the | set." | |||
| output link; 64- or 32-bit widths are acceptable but with 32 bits the | ||||
| resolution should be no finer than 2^{-16} to leave enough dynamic | ||||
| range to represent a wide range of queue waiting times. Narrower | ||||
| widths also have implementation issues due to overflow (wrapping) and | ||||
| underflow (limit cycles because of truncation to zero) that are not | ||||
| addressed in this pseudocode. The code presented here uses 0 as a | ||||
| flag value to indicate "no time set." | ||||
| packet_t* is a pointer to a packet descriptor. We assume it has a | packet_t* is a pointer to a packet descriptor. We assume it has a | |||
| tstamp field capable of holding a time_t and that field is available | tstamp field capable of holding a time_t and that field is available | |||
| for use by CoDel (it will be set by the enqueue routine and used by | for use by CoDel (it will be set by the enqueue routine and used by | |||
| the dequeue routine). | the dequeue routine). | |||
| queue_t is a base class for queue objects (the parent class for | queue_t is a base class for queue objects (the parent class for | |||
| codel_queue_t objects). We assume it has enqueue() and dequeue() | codel_queue_t objects). We assume it has enqueue() and dequeue() | |||
| methods that can be implemented in child classes. We assume it has a | methods that can be implemented in child classes. We assume it has a | |||
| bytes() method that returns the current queue size in bytes. This | bytes() method that returns the current queue size in bytes. This | |||
| can be an approximate value. The method is invoked in the dequeue() | can be an approximate value. The method is invoked in the dequeue() | |||
| method but shouldn't require a lock with the enqueue() method. | method but shouldn't require a lock with the enqueue() method. | |||
| flag_t is a Boolean. | flag_t is a Boolean. | |||
| 5.2. Per-queue state (codel_queue_t instance variables) | 4.2. Per-queue state (codel_queue_t instance variables) | |||
| time_t first_above_time_ = 0; // Time to declare sojourn time above | time_t first_above_time_ = 0; // Time to declare sojourn time above | |||
| // target | // target | |||
| time_t drop_next_ = 0; // Time to drop next packet | time_t drop_next_ = 0; // Time to drop next packet | |||
| uint32_t count_ = 0; // Packets dropped in dropping state | uint32_t count_ = 0; // Packets dropped in dropping state | |||
| uint32_t lastcount_ = 0; // Count from previous iteration | uint32_t lastcount_ = 0; // Count from previous iteration | |||
| flag_t dropping_ = false; // Set to true if in drop state | flag_t dropping_ = false; // Set to true if in drop state | |||
| 5.3. Constants | 4.3. Constants | |||
| time_t target = MS2TIME(5); // 5ms target queue delay | time_t target = MS2TIME(5); // 5ms target queue delay | |||
| time_t interval = MS2TIME(100); // 100ms sliding-minimum window | time_t interval = MS2TIME(100); // 100ms sliding-minimum window | |||
| u_int maxpacket = 512; // Maximum packet size in bytes | u_int maxpacket = 512; // Maximum packet size in bytes | |||
| // (should use interface MTU) | // (should use interface MTU) | |||
| 5.4. Enqueue routine | 4.4. Enqueue routine | |||
| All the work of CoDel is done in the dequeue routine. The only CoDel | All the work of CoDel is done in the dequeue routine. The only CoDel | |||
| addition to enqueue is putting the current time in the packet's | addition to enqueue is putting the current time in the packet's | |||
| tstamp field so that the dequeue routine can compute the packet's | tstamp field so that the dequeue routine can compute the packet's | |||
| sojourn time. | sojourn time. | |||
| void codel_queue_t::enqueue(packet_t* pkt) | void codel_queue_t::enqueue(packet_t* pkt) | |||
| { | { | |||
| pkt->timestamp() = clock(); | pkt->timestamp() = clock(); | |||
| queue_t::enqueue(pkt); | queue_t::enqueue(pkt); | |||
| } | } | |||
| 5.5. Dequeue routine | 4.5. Dequeue routine | |||
| This is the heart of CoDel. There are two branches based on whether | This is the heart of CoDel. There are two branches based on whether | |||
| the controller is in dropping state: (i) if the controller is in | the controller is in dropping state: (i) if the controller is in | |||
| dropping state (that is, the minimum packet sojourn time is greater | dropping state (that is, the minimum packet sojourn time is greater | |||
| than target) then the controller checks if it is time to leave | than target) then the controller checks if it is time to leave | |||
| dropping state or schedules the next drop(s); or (ii) if the | dropping state or schedules the next drop(s); or (ii) if the | |||
| controller is not in dropping state, it determines if it should enter | controller is not in dropping state, it determines if it should enter | |||
| dropping state and do the initial drop. | dropping state and do the initial drop. | |||
| packet_t* CoDelQueue::dequeue() | packet_t* CoDelQueue::dequeue() | |||
| skipping to change at page 21, line 5 ¶ | skipping to change at page 12, line 5 ¶ | |||
| // tested. | // tested. | |||
| delta = count_ - lastcount_; | delta = count_ - lastcount_; | |||
| count_ = (delta > 1 && now - drop_next_ < 16*interval_)? | count_ = (delta > 1 && now - drop_next_ < 16*interval_)? | |||
| delta : 1; | delta : 1; | |||
| drop_next_ = control_law(now, count_); | drop_next_ = control_law(now, count_); | |||
| lastcount_ = count_; | lastcount_ = count_; | |||
| } | } | |||
| return (r.p); | return (r.p); | |||
| } | } | |||
| 5.6. Helper routines | 4.6. Helper routines | |||
| Since the degree of multiplexing and nature of the traffic sources is | Since the degree of multiplexing and nature of the traffic sources is | |||
| unknown, CoDel acts as a closed-loop servo system that gradually | unknown, CoDel acts as a closed-loop servo system that gradually | |||
| increases the frequency of dropping until the queue is controlled | increases the frequency of dropping until the queue is controlled | |||
| (sojourn time goes below target). This is the control law that | (sojourn time goes below target). This is the control law that | |||
| governs the servo. It has this form because of the sqrt(p) | governs the servo. It has this form because of the sqrt(p) | |||
| dependence of TCP throughput on drop probability. Note that for | dependence of TCP throughput on drop probability. Note that for | |||
| embedded systems or kernel implementation, the inverse sqrt can be | embedded systems or kernel implementation, the inverse sqrt can be | |||
| computed efficiently using only integer multiplication. | computed efficiently using only integer multiplication. | |||
| skipping to change at page 22, line 47 ¶ | skipping to change at page 13, line 47 ¶ | |||
| // just went above from below. if still above at | // just went above from below. if still above at | |||
| // first_above_time, will say it's ok to drop. | // first_above_time, will say it's ok to drop. | |||
| first_above_time_ = now + interval_; | first_above_time_ = now + interval_; | |||
| } else if (now >= first_above_time_) { | } else if (now >= first_above_time_) { | |||
| r.ok_to_drop = true; | r.ok_to_drop = true; | |||
| } | } | |||
| } | } | |||
| return r; | return r; | |||
| } | } | |||
| 5.7. Implementation considerations | 4.7. Implementation considerations | |||
| time_t is an integer time value in units convenient for the system. | ||||
| Resolution to at least a millisecond is required and better | ||||
| resolution is useful up to the minimum possible packet time on the | ||||
| output link; 64- or 32-bit widths are acceptable but with 32 bits the | ||||
| resolution should be no finer than 2^{-16} to leave enough dynamic | ||||
| range to represent a wide range of queue waiting times. Narrower | ||||
| widths also have implementation issues due to overflow (wrapping) and | ||||
| underflow (limit cycles because of truncation to zero) that are not | ||||
| addressed in this pseudocode. | ||||
| Since CoDel requires relatively little per-queue state and no direct | Since CoDel requires relatively little per-queue state and no direct | |||
| communication or state sharing between the enqueue and dequeue | communication or state sharing between the enqueue and dequeue | |||
| routines, it is relatively simple to add CoDel to almost any packet | routines, it is relatively simple to add CoDel to almost any packet | |||
| processing pipeline, including ASIC- or NPU-based forwarding engines. | processing pipeline, including ASIC- or NPU-based forwarding engines. | |||
| One issue to consider is dodequeue()'s use of a 'bytes()' function to | One issue to consider is dodequeue()'s use of a 'bytes()' function to | |||
| determine the current queue size in bytes. This value does not need | determine the current queue size in bytes. This value does not need | |||
| to be exact. If the enqueue part of the pipeline keeps a running | to be exact. If the enqueue part of the pipeline keeps a running | |||
| count of the total number of bytes it has put into the queue and the | count of the total number of bytes it has put into the queue and the | |||
| dequeue routine keeps a running count of the total bytes it has | dequeue routine keeps a running count of the total bytes it has | |||
| removed from the queue, 'bytes()' is simply the difference between | removed from the queue, 'bytes()' is simply the difference between | |||
| these two counters (32-bit counters should be adequate.) Enqueue has | these two counters (32-bit counters should be adequate.) Enqueue has | |||
| to update its counter once per packet queued but it does not matter | to update its counter once per packet queued but it does not matter | |||
| when (before, during or after the packet has been added to the | when (before, during or after the packet has been added to the | |||
| queue). The worst that can happen is a slight, transient, | queue). The worst that can happen is a slight, transient, | |||
| underestimate of the queue size which might cause a drop to be | underestimate of the queue size which might cause a drop to be | |||
| briefly deferred. | briefly deferred. | |||
| 6. Adapting and applying CoDel's building blocks | 5. Understanding the Building Blocks of Queue Management | |||
| CoDel has been implemented and tested in a range of environments. | At the heart of queue management is the notion of "good queue" and | |||
| "bad queue" and the search for ways to get rid of the bad queue | ||||
| (which only adds delay) while preserving the good queue (which | ||||
| provides for good utilization). This section explains queueing, both | ||||
| good and bad, and covers the CoDel building blocks that can be used | ||||
| to manage packet buffers to keep their queues in the "good" range. | ||||
| 6.1. Validations and available code | Packet queues form in buffers facing bottleneck links, i.e., where | |||
| the line rate goes from high to low or where many links converge. | ||||
| The well-known bandwidth-delay product (sometimes called "pipe size") | ||||
| is the bottleneck's bandwidth multiplied by the sender-receiver- | ||||
| sender round-trip delay, and is the amount of data that has to be in | ||||
| transit between two hosts in order to run the bottleneck link at 100% | ||||
| utilization. To explore how queues can form, consider a long-lived | ||||
| TCP connection with a 25 packet window sending through a connection | ||||
| with a bandwidth-delay product of 20 packets. After an initial burst | ||||
| of packets the connection will settle into a five packet (+/-1) | ||||
| standing queue; this standing queue size is determined by the | ||||
| mismatch between the window size and the pipe size, and is unrelated | ||||
| to the connection's sending rate. The connection has 25 packets in | ||||
| flight at all times, but only 20 packets arrive at the destination | ||||
| over a round trip time. If the TCP connection has a 30 packet | ||||
| window, the queue will be ten packets with no change in sending rate. | ||||
| An experiment by Stanford graduate students successfully used Linux | Similarly, if the window is 20 packets, there will be no queue but | |||
| CoDel to duplicate our published simulation work on CoDel's ability | the sending rate is the same. Nothing can be inferred about the | |||
| to adapt to drastic link rate changes. This experiment can be found | sending rate from the queue size, and any queue other than transient | |||
| at http://reproducingnetworkresearch.wordpress.com/2012/06/06/ | bursts only creates delays in the network. The sender needs to | |||
| solving-bufferbloat-the-codel-way/ . | reduce the number of packets in flight rather than sending rate. | |||
| Our ns-2 simulations are available at http://pollere.net/Codel.html . | In the above example, the five packet standing queue can be seen to | |||
| CableLabs has funded some additions to the simulator sfqcodel code, | contribute nothing but delay to the connection, and thus is clearly | |||
| which have been made public. The basic algorithm of CoDel remains | "bad queue". If, in our example, there is a single bottleneck link | |||
| unchanged, but we continue to experiment with drop interval setting | and it is much slower than the link that feeds it (say, a high-speed | |||
| when resuming the drop state, inhibiting or canceling drop state when | ethernet link into a limited DSL uplink) a 20 packet buffer at the | |||
| few bytes are in the queue, and other details. Our approach to | bottleneck might be necessary to temporarily hold the 20 packets in | |||
| changes is to only make them if we are convinced they do more good | flight to keep the bottleneck link's utilization high. The burst of | |||
| than harm, both operationally and in the implementation. With this | packets should drain completely (to 0 or 1 packets) within a round | |||
| in mind, some of these issues may continue to evolve as we get more | trip time and this transient queue is "good queue" because it allows | |||
| deployment and as the building blocks are applied to a wider range of | the connection to keep the 20 packets in flight and for the | |||
| problems. | bottleneck link to be fully utilized. In terms of the delay | |||
| experienced, the "good queue" goes away in about a round trip time, | ||||
| while "bad queue" hangs around for longer, causing delays. | ||||
| CoDel is available in ns-2 version 2.35 and later. | Effective queue management detects "bad queue" while ignoring "good | |||
| queue" and takes action to get rid of the bad queue when it is | ||||
| detected. The goal is a queue controller that accomplishes this | ||||
| objective. To control a queue, we need three basic components | ||||
| Andrew McGregor has an ns-3 implementation of both CoDel and | o Estimator - figure out what we've got | |||
| fq_codel. CoDel is available in ns-3 version 3.21 and later at | ||||
| https://www.nsnam.org/ . At the time of this writing, the ns-3 | ||||
| implementation of fq_codel is available at | ||||
| https://www.nsnam.org/wiki/GSOC2014Bufferbloat . | ||||
| CoDel is available in Linux. Eric Dumazet implemented CoDel in the | o Setpoint - know what what we want | |||
| Linux kernel. | ||||
| Dave Taht integrated and distributed CoDel as a bufferbloat solution | o Control loop - if what we've got isn't what we want, we need a way | |||
| (http://www.bufferbloat.net/projects/codel ). | to move it there | |||
| 6.2. CoDel in the datacenter | 5.1. Estimator | |||
| Nandita Dukkipati and her group at Google realized that the CoDel | The estimator both observes the queue and detects when good queue | |||
| building blocks could be applied to bufferbloat problems in | turns to bad queue and vice versa. CoDel has two parts to its | |||
| datacenter servers, not just to Internet routers. The Linux CoDel | estimator: what is observed as an indicator of queue and how the | |||
| queueing discipline (qdisc) was adapted in three ways to tackle this | observations are used to detect good/bad queue. | |||
| bufferbloat problem. | ||||
| 1. The default CoDel action was modified to be a direct feedback | Queue length has been widely used as an observed indicator of | |||
| from qdisc to the TCP layer at dequeue. The direct feedback | congestion and is frequently conflated with sending rate. Use of | |||
| simply reduces TCP's congestion window just as congestion control | queue length as a metric is sensitive to how and when the length is | |||
| would do in the event of drop. The scheme falls back to ECN | observed. A high speed arrival link to a buffer serviced at a much | |||
| marking or packet drop if the TCP socket lock could not be | lower rate can rapidly build up a queue that might disperse | |||
| acquired at dequeue. | completely or down to a single packet before a round trip time has | |||
| elapsed. If the queue length is monitored at packet arrival (as in | ||||
| original RED) or departure time, every packet will see a queue with | ||||
| one possible exception. If the queue length itself is time sampled | ||||
| (as recommended in [REDL1998], a truer picture of the queue's | ||||
| occupancy can be gained at the expense of considerable implementation | ||||
| complexity. | ||||
| 2. Being located in the server makes it possible to monitor the | The use of queue length is further complicated in networks that are | |||
| actual RTT to use as CoDel's interval rather than making a "best | subject to both short and long term changes in available link rate | |||
| guess" of RTT. The CoDel interval is dynamically adjusted by | (as in WiFi). Link rate drops can result in a spike in queue length | |||
| using the maximum TCP round-trip time (RTT) of those connections | that should be ignored unless it persists. It is not the queue | |||
| sharing the same Qdisc/bucket. In particular, there is a history | length that should be controlled but the amount of excess delay | |||
| entry of the maximum RTT experienced over the last second. As a | packets experience due to a persistent or standing queue, which means | |||
| packet is dequeued, the RTT estimate is accessed from its TCP | that the packet sojourn time in the buffer is exactly what we want to | |||
| socket. If the estimate is larger than the current CoDel | track. Tracking the packet sojourn times in the buffer observes the | |||
| interval, the CoDel interval is immediately refreshed to the new | actual delay experienced by each packet. Sojourn time allows queue | |||
| value. If the CoDel interval is not refreshed for over a second, | management to be independent of link rate, gives superior performance | |||
| it is decreased it to the history entry and the process is | to use of buffer size, and is directly related to user-visible | |||
| repeated. The use of the dynamic TCP RTT estimate lets interval | performance. It works regardless of line rate changes or link | |||
| adapt to the actual maximum value currently seen and thus lets | sharing by multiple queues (which the individual queues may | |||
| the controller space its drop intervals appropriately. | experience as changing rates). | |||
| 3. Since the mathematics of computing the setpoint are invariant, a | Consider a link shared by two queues with different priorities. | |||
| target of 5% of the RTT or CoDel interval was used here. | Packets that arrive at the high priority queue are sent as soon as | |||
| the link is available while packets in the other queue have to wait | ||||
| until the high priority queue is empty (i.e., a strict priority | ||||
| scheduler). The number of packets in the high priority queue might | ||||
| be large but the queue is emptied quickly and the amount of time each | ||||
| packet spends enqueued (the sojourn time) is not large. The other | ||||
| queue might have a smaller number of packets, but packet sojourn | ||||
| times will include the waiting time for the high priority packets to | ||||
| be sent. This makes the sojourn time a good sample of the congestion | ||||
| that each separate queue is experiencing. This example also shows | ||||
| how the metric of sojourn time is independent of the number of queues | ||||
| or the service discipline used, and is instead indicative of | ||||
| congestion seen by the individual queues. | ||||
| Non-data packets were not dropped as these are typically small and | How can observed sojourn time be used to separate good queue from bad | |||
| sometimes critical control packets. Being located on the server, | queue? Although averages, especially of queue length, have | |||
| there is no concern with misbehaving users as there would be on the | previously been widely used as an indicator of bad queue, their | |||
| public Internet. | efficacy is questionable. Consider the burst that disperses every | |||
| round trip time. The average queue will be one-half the burst size, | ||||
| though this might vary depending on when the average is computed and | ||||
| the timing of arrivals. The average queue sojourn time would be one- | ||||
| half the time it takes to clear the burst. The average then would | ||||
| indicate a persistent queue where there is none. Instead of averages | ||||
| we recommend tracking the minimum sojourn time, then, if there is one | ||||
| packet that has a zero sojourn time then there is no persistent | ||||
| queue. The value of the minimum in detecting persistent queue is | ||||
| apparent when looking at graphs of queue delay. | ||||
| In several data center workload benchmarks, which are typically | A persistent queue can be detected by tracking the (local) minimum | |||
| bursty, CoDel reduced the queueing latencies at the qdisc, and | queue delay packets experience. To ensure that this minimum value | |||
| thereby improved the mean and 99th-percentile latencies from several | does not become stale, it has to have been experienced recently, i.e. | |||
| tens of milliseconds to less than one millisecond. The minimum | during an appropriate past time interval. This interval is the | |||
| tracking part of the CoDel framework proved useful in disambiguating | maximum amount of time a minimum value is considered to be in effect, | |||
| "good" queue versus "bad" queue, particularly helpful in controlling | and is related to the amount of time it takes for the largest | |||
| qdisc buffers that are inherently bursty because of TCP Segmentation | expected burst to drain. Conservatively, this interval should be at | |||
| Offload (TSO). | least a round trip time to avoid falsely detecting a persistent queue | |||
| and not a lot more than a round trip time to avoid delay in detecting | ||||
| the persistent queue. This suggests that the appropriate interval | ||||
| value is the maximum round-trip time of all the connections sharing | ||||
| the buffer. | ||||
| (The following key insight makes computation of the local minimum | ||||
| efficient: It is sufficient to keep a single state variable of how | ||||
| long the minimum has been above or below a target value rather than | ||||
| retaining all the local values to compute the minimum, leading to | ||||
| both storage and computational savings. We use this insight in the | ||||
| pseudo-code for CoDel later in the draft.) | ||||
| These two parts, use of sojourn time as observed values and the local | ||||
| minimum as the statistic to monitor queue congestion are key to | ||||
| CoDel's estimator building block. The local minimum sojourn time | ||||
| provides an accurate and robust measure of standing queue and has an | ||||
| efficient implementation. In addition, use of the minimum sojourn | ||||
| time has important advantages in implementation. The minimum packet | ||||
| sojourn can only be decreased when a packet is dequeued which means | ||||
| that all the work of CoDel can take place when packets are dequeued | ||||
| for transmission and that no locks are needed in the implementation. | ||||
| The minimum is the only statistic with this property. | ||||
| A more detailed explanation with many pictures can be found in | ||||
| http://www.ietf.org/proceedings/84/slides/slides-84-tsvarea-4.pdf . | ||||
| 5.2. Setpoint | ||||
| Now that we have a robust way of detecting standing queue, we need a | ||||
| setpoint that tells us when to act. If the controller is set to take | ||||
| action as soon as the estimator has a non-zero value, the average | ||||
| drop rate will be maximized, which minimizes TCP goodput | ||||
| [MACTCP1997]. Also, this policy results in no backlog over time (no | ||||
| persistent queue), which negates much of the value of having a | ||||
| buffer, since it maximizes the bottleneck link bandwidth lost due to | ||||
| normal stochastic variation in packet interarrival time. We want a | ||||
| setpoint that maximizes utilization while minimizing delay. Early in | ||||
| the history of packet networking, Kleinrock developed the analytic | ||||
| machinery to do this using a quantity he called 'power', which is the | ||||
| ratio of a normalized throughput to a normalized delay [KLEIN81]. | ||||
| It is straightforward to derive an analytic expression for the | ||||
| average goodput of a TCP conversation at a given round-trip time r | ||||
| and setpoint f (where f is expressed as a fraction of r). Reno TCP, | ||||
| for example, yields: | ||||
| goodput = r (3 + 6f - f^2) / (4 (1+f)) | ||||
| Since the peak queue delay is simply the product of f and r, power is | ||||
| solely a function of f since the r's in the numerator and denominator | ||||
| cancel: | ||||
| power is proportional to (1 + 2f - 1/3 f^2) / (1 + f)^2 | ||||
| As Kleinrock observed, the best operating point, in terms of | ||||
| bandwidth / delay tradeoff, is the peak power point, since points off | ||||
| the peak represent a higher cost (in delay) per unit of bandwidth. | ||||
| The power vs. f curve for any AIMD TCP is monotone decreasing. But | ||||
| the curve is very flat for f < 0.1 followed by a increasing curvature | ||||
| with a knee around f = 0.2, then a steep, almost linear fall off | ||||
| [TSV84]. Since the previous equation showed that goodput is monotone | ||||
| increasing with f, the best operating point is near the right edge of | ||||
| the flat top since that represents the highest goodput achievable for | ||||
| a negligible increase in delay. However, since the r in the model is | ||||
| a conservative upper bound, a target of 0.1r runs the risk of pushing | ||||
| shorter RTT connections over the knee and giving them higher delay | ||||
| for no significant goodput increase. Generally, a more conservative | ||||
| target of 0.05r offers a good utilization vs. delay tradeoff while | ||||
| giving enough headroom to work well with a large variation in real | ||||
| RTT. | ||||
| As the above analysis shows, a very small standing queue gives close | ||||
| to 100% utilization of the bottleneck link. While this result was | ||||
| for Reno TCP, the derivation uses only properties that must hold for | ||||
| any 'TCP friendly' transport. We have verified by both analysis and | ||||
| simulation that this result holds for Reno, Cubic, and | ||||
| Westwood[TSV84]. This results in a particularly simple form for the | ||||
| setpoint: the ideal range for the permitted standing queue is between | ||||
| 5% and 10% of the TCP connection's RTT. Thus target is simply 5% of | ||||
| the interval of section 3.1. | ||||
| We used simulation to explore the impact when TCPs are mixed with | ||||
| other traffic and with connections of different RTTs. Accordingly, | ||||
| we experimented extensively with values in the 5-10% of RTT range | ||||
| and, overall, used target values between 1 and 20 milliseconds for | ||||
| RTTs from 30 to 500ms and link bandwidths of 64Kbps to 100Mbps to | ||||
| experimentally explore the setpoint that gives consistently high | ||||
| utilization while controlling delay across a range of bandwidths, | ||||
| RTTs, and traffic loads. Our results were notably consistent with | ||||
| the mathematics above. | ||||
| A congested (but not overloaded) CoDel link with traffic composed | ||||
| solely or primarily of long-lived TCP flows will have a median delay | ||||
| through the link will tend to the target. For bursty traffic loads | ||||
| and for overloaded conditions (where it is difficult or impossible | ||||
| for all the arriving flows to be accommodated) the median queues will | ||||
| be longer than target. | ||||
| The non-starvation drop inhibit feature dominates where the link rate | ||||
| becomes very small. By inhibiting drops when there is less than an | ||||
| (outbound link) MTU worth of bytes in the buffer, CoDel adapts to | ||||
| very low bandwidth links, as shown in [CODEL2012]. | ||||
| 5.3. Control Loop | ||||
| Section 5.1 describes a simple, reliable way to measure bad | ||||
| (persistent) queue. Section 5.2 shows that TCP congestion control | ||||
| dynamics gives rise to a setpoint for this measure that's a provably | ||||
| good balance between enhancing throughput and minimizing delay, and | ||||
| that this setpoint is a constant fraction of the same 'largest | ||||
| average RTT' interval used to distinguish persistent from transient | ||||
| queue. The only remaining building block needed for a basic AQM is a | ||||
| 'control loop' algorithm to effectively drive the queueing system | ||||
| from any 'persistent queue above target' state to a state where the | ||||
| persistent queue is below target. | ||||
| Control theory provides a wealth of approaches to the design of | ||||
| control loops. Most of classical control theory deals with the | ||||
| control of linear, time-invariant, single-input-single-output (SISO) | ||||
| systems. Control loops for these systems generally come from a (well | ||||
| understood) class known as Proportional-Integral-Derivative (PID) | ||||
| controllers. Unfortunately, a queue is not a linear system and an | ||||
| AQM operates at the point of maximum non-linearity (where the output | ||||
| link bandwidth saturates so increased demand creates delay rather | ||||
| than higher utilization). Output queues are also not time-invariant | ||||
| since traffic is generally a mix of connections which start and stop | ||||
| at arbitrary times and which can have radically different behaviors | ||||
| ranging from "open loop" UDP audio/video to "closed-loop" congestion- | ||||
| avoiding TCP. Finally, the constantly changing mix of connections | ||||
| (which can't be converted to a single 'lumped parameter' model | ||||
| because of their transfer function differences) makes the system | ||||
| multi-input-multi-output (MIMO), not SISO. | ||||
| Since queueing systems match none of the prerequisites for a | ||||
| classical controller, a modern state-space controller is a better | ||||
| approach with states 'no persistent queue' and 'has persistent | ||||
| queue'. Since Internet traffic mixtures change rapidly and | ||||
| unpredictably, a noise and error tolerant adaptation algorithm like | ||||
| Stochastic Gradient is a good choice. Since there's essentially no | ||||
| information in the amount of persistent queue [TSV84], the adaptation | ||||
| should be driven by how long it has persisted. | ||||
| Consider the two extremes of traffic behavior, a single open-loop UDP | ||||
| video stream and a single, long-lived TCP bulk data transfer. If the | ||||
| average bandwidth of the UDP video stream is greater that the | ||||
| bottleneck link rate, the link's queue will grow and the controller | ||||
| will eventually enter 'has persistent queue' state and start dropping | ||||
| packets. Since the video stream is open loop, its arrival rate is | ||||
| unaffected by drops so the queue will persist until the average drop | ||||
| rate is greater than the output bandwidth deficit (= average arrival | ||||
| rate - average departure rate) so the job of the adaptation algorithm | ||||
| is to discover this rate. For this example, the adaptation could | ||||
| consist of simply estimating the arrival and departure rates then | ||||
| dropping at a rate slightly greater than their difference. But this | ||||
| class of algorithm won't work at all for the bulk data TCP stream. | ||||
| TCP runs in closed-loop flow balance [TSV84] so its arrival rate is | ||||
| almost always exactly equal to the departure rate - the queue isn't | ||||
| the result of a rate imbalance but rather a mismatch between the TCP | ||||
| sender's window and the source-destination-source round-trip path | ||||
| capacity (i.e., the connection's bandwidth-delay product). The | ||||
| sender's TCP congestion avoidance algorithm will slowly increase the | ||||
| send window (one packet per round-trip-time) [RFC2581] which will | ||||
| eventually cause the bottleneck to enter 'has persistent queue' | ||||
| state. But, since the average input rate is the same as the average | ||||
| output rate, the rate deficit estimation that gave the correct drop | ||||
| rate for the video stream would compute a drop rate of zero for the | ||||
| TCP stream. However, if the output link drops one packet as it | ||||
| enters 'has persistent queue' state, when the sender discovers this | ||||
| (via TCP's normal packet loss repair mechanisms) it will reduce its | ||||
| window by a factor of two [RFC2581] so, one round-trip-time after the | ||||
| drop, the persistent queue will go away. | ||||
| If there were N TCP conversations sharing the bottleneck, the | ||||
| controller would have to drop O(N) packets, one from each | ||||
| conversation, to make all the conversations reduce their window to | ||||
| get rid of the persistent queue. If the traffic mix consists of | ||||
| short (<= bandwidth-delay product) conversations, the aggregate | ||||
| behavior becomes more like the open-loop video example since each | ||||
| conversation is likely to have already sent all its packets by the | ||||
| time it learns about a drop so each drop has negligible effect on | ||||
| subsequent traffic. | ||||
| The controller does not know the number, duration, or kind of | ||||
| conversations creating its queue, so it has to learn the appropriate | ||||
| response. Since single drops can have a large effect if the degree | ||||
| of multiplexing (the number of active conversations) is small, | ||||
| dropping at too high a rate is likely to have a catastrophic effect | ||||
| on throughput. Dropping at a low rate (< 1 packet per round-trip- | ||||
| time) then increasing the drop rate slowly until the persistent queue | ||||
| goes below target is unlikely to overdrop and is guaranteed to | ||||
| eventually dissipate the persistent queue. This stochastic gradient | ||||
| learning procedure is the core of CoDel's control loop (the gradient | ||||
| exists because a drop always reduces the (instantaneous) queue so an | ||||
| increasing drop rate always moves the system "down" toward no | ||||
| persistent queue, regardless of traffic mix). | ||||
| The "next drop time" is decreased in inverse proportion to the square | ||||
| root of the number of drops since the dropping state was entered, | ||||
| using the well-known nonlinear relationship of drop rate to | ||||
| throughput to get a linear change in throughput [REDL1998], | ||||
| [MACTCP1997]. | ||||
| Since the best rate to start dropping is at slightly more than one | ||||
| packet per RTT, the controller's initial drop rate can be directly | ||||
| derived from the estimator's interval, defined in section 3.1. When | ||||
| the minimum sojourn time first crosses the target and CoDel drops a | ||||
| packet, the earliest the controller could see the effect of the drop | ||||
| is the round trip time (interval) + the local queue wait time | ||||
| (target). If the next drop happens any earlier than this time | ||||
| (interval + target), CoDel will overdrop. In practice, the local | ||||
| queue waiting time tends to vary, so making the initial drop spacing | ||||
| (i.e., the time to the second drop) be exactly the minimum possible | ||||
| also leads to overdropping. Analysis of simulation and real-world | ||||
| measured data shows that the 75th percentile magnitude of this | ||||
| variation is less than the target, and so the initial drop spacing | ||||
| should be set to the estimator's interval (i.e., initial drop spacing | ||||
| = interval) to ensure that the controller has accounted for | ||||
| acceptable congestion delays. | ||||
| Use of the minimum statistic lets the controller be placed in the | ||||
| dequeue routine with the estimator. This means that the control | ||||
| signal (the drop) can be sent at the first sign of bad queue (as | ||||
| indicated by the sojourn time) and that the controller can stop | ||||
| acting as soon as the sojourn time falls below the setpoint. | ||||
| Dropping at dequeue has both implementation and control advantages. | ||||
| 6. Further Experimentation | ||||
| We encourage experimentation with the recommended values of target | ||||
| and interval for Internet settings. CoDel provides general, | ||||
| efficient, parameterless building blocks for queue management that | ||||
| can be applied to single or multiple queues in a variety of data | ||||
| networking scenarios. CoDel's settings may be modified for other | ||||
| special-purpose networking applications. | ||||
| 7. Security Considerations | 7. Security Considerations | |||
| This document describes an active queue management algorithm for | This document describes an active queue management algorithm for | |||
| implementation in networked devices. There are no specific security | implementation in networked devices. There are no specific security | |||
| exposures associated with CoDel. | exposures associated with CoDel. | |||
| 8. IANA Considerations | 8. IANA Considerations | |||
| This document does not require actions by IANA. | This document does not require actions by IANA. | |||
| 9. Conclusions | 9. Acknowledgments | |||
| CoDel provides very general, efficient, parameterless building blocks | ||||
| for queue management that can be applied to single or multiple queues | ||||
| in a variety of data networking scenarios. It is a critical tool in | ||||
| solving bufferbloat. CoDel's settings MAY be modified for other | ||||
| special-purpose networking applications. We encourage | ||||
| experimentation, improvement, and rigorous testing. | ||||
| On-going projects are creating a deployable CoDel in Linux routers | ||||
| and experimenting with applying CoDel to stochastic queueing with | ||||
| promising results. | ||||
| 10. Acknowledgments | ||||
| The authors thank Jim Gettys for the constructive nagging that made | The authors thank Jim Gettys for the constructive nagging that made | |||
| us get the work "out there" before we thought it was ready. We thank | us get the work "out there" before we thought it was ready. We thank | |||
| Dave Taht, Eric Dumazet, and the open source community for showing | Dave Taht, Eric Dumazet, and the open source community for showing | |||
| the value of getting it "out there" and for making it real. We thank | the value of getting it "out there" and for making it real. We thank | |||
| Nandita Dukkipati for contributions to Section 6 and for comments | Nandita Dukkipati for contributions to Section 6 and for comments | |||
| which helped to substantially improve this draft. We thank the AQM | which helped to substantially improve this draft. We thank the AQM | |||
| working group and the Transport Area shepherd, Wes Eddy, for | working group and the Transport Area shepherd, Wes Eddy, for | |||
| patiently prodding this draft all the way to a standard. | patiently prodding this draft all the way to a standard. | |||
| 11. References | 10. References | |||
| 11.1. Normative References | 10.1. Normative References | |||
| [RFC2119] Bradner, S., "Key Words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key Words for use in RFCs to Indicate | |||
| Requirement Levels", March 1997. | Requirement Levels", March 1997. | |||
| 11.2. Informative References | 10.2. Informative References | |||
| [FQ-CODEL-ID] | [FQ-CODEL-ID] | |||
| Hoeiland-Joergensen, T., McKenney, P., | Hoeiland-Joergensen, T., McKenney, P., | |||
| dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, | dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, | |||
| "FlowQueue-Codel", draft-ietf-aqm-fq-codel-03 (work in | "FlowQueue-Codel", draft-ietf-aqm-fq-codel-03 (work in | |||
| progress), November 2015. | progress), November 2015. | |||
| [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion | [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion | |||
| Control", RFC 2581, April 1999. | Control", RFC 2581, April 1999. | |||
| skipping to change at page 27, line 39 ¶ | skipping to change at page 24, line 24 ¶ | |||
| 2007. | 2007. | |||
| [SFQ1990] McKenney, P., "Stochastic Fairness Queuing", Proceedings | [SFQ1990] McKenney, P., "Stochastic Fairness Queuing", Proceedings | |||
| of IEEE INFOCOMM 90 San Francisco, 1990. | of IEEE INFOCOMM 90 San Francisco, 1990. | |||
| [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of | [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of | |||
| Computer Network Power", International Conference on | Computer Network Power", International Conference on | |||
| Communications June, 1981, | Communications June, 1981, | |||
| <http://www.lk.cs.ucla.edu/data/files/Gail/power.pdf>. | <http://www.lk.cs.ucla.edu/data/files/Gail/power.pdf>. | |||
| 11.3. URIs | Appendix A. Applying CoDel in the datacenter | |||
| [6] http://pollere.net/Codel.html | Nandita Dukkipati and her group at Google realized that the CoDel | |||
| building blocks could be applied to bufferbloat problems in | ||||
| datacenter servers, not just to Internet routers. The Linux CoDel | ||||
| queueing discipline (qdisc) was adapted in three ways to tackle this | ||||
| bufferbloat problem. | ||||
| [7] http://pollere.net/Codel.html | 1. The default CoDel action was modified to be a direct feedback | |||
| from qdisc to the TCP layer at dequeue. The direct feedback | ||||
| simply reduces TCP's congestion window just as congestion control | ||||
| would do in the event of drop. The scheme falls back to ECN | ||||
| marking or packet drop if the TCP socket lock could not be | ||||
| acquired at dequeue. | ||||
| 2. Being located in the server makes it possible to monitor the | ||||
| actual RTT to use as CoDel's interval rather than making a "best | ||||
| guess" of RTT. The CoDel interval is dynamically adjusted by | ||||
| using the maximum TCP round-trip time (RTT) of those connections | ||||
| sharing the same Qdisc/bucket. In particular, there is a history | ||||
| entry of the maximum RTT experienced over the last second. As a | ||||
| packet is dequeued, the RTT estimate is accessed from its TCP | ||||
| socket. If the estimate is larger than the current CoDel | ||||
| interval, the CoDel interval is immediately refreshed to the new | ||||
| value. If the CoDel interval is not refreshed for over a second, | ||||
| it is decreased it to the history entry and the process is | ||||
| repeated. The use of the dynamic TCP RTT estimate lets interval | ||||
| adapt to the actual maximum value currently seen and thus lets | ||||
| the controller space its drop intervals appropriately. | ||||
| 3. Since the mathematics of computing the setpoint are invariant, a | ||||
| target of 5% of the RTT or CoDel interval was used here. | ||||
| Non-data packets were not dropped as these are typically small and | ||||
| sometimes critical control packets. Being located on the server, | ||||
| there is no concern with misbehaving users as there would be on the | ||||
| public Internet. | ||||
| In several data center workload benchmarks, which are typically | ||||
| bursty, CoDel reduced the queueing latencies at the qdisc, and | ||||
| thereby improved the mean and 99th-percentile latencies from several | ||||
| tens of milliseconds to less than one millisecond. The minimum | ||||
| tracking part of the CoDel framework proved useful in disambiguating | ||||
| "good" queue versus "bad" queue, particularly helpful in controlling | ||||
| qdisc buffers that are inherently bursty because of TCP Segmentation | ||||
| Offload (TSO). | ||||
| Authors' Addresses | Authors' Addresses | |||
| Kathleen Nichols | Kathleen Nichols | |||
| Pollere, Inc. | Pollere, Inc. | |||
| PO Box 370201 | PO Box 370201 | |||
| Montara, CA 94037 | Montara, CA 94037 | |||
| USA | USA | |||
| Email: nichols@pollere.com | Email: nichols@pollere.com | |||
| Van Jacobson | Van Jacobson | |||
| End of changes. 64 change blocks. | ||||
| 686 lines changed or deleted | 569 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||