idnits 2.17.1 draft-pan-aqm-pie-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** There are 9 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (June 3, 2013) is 3973 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC2309' is mentioned on line 109, but not defined ** Obsolete undefined reference: RFC 2309 (Obsoleted by RFC 7567) == Missing Reference: 'CoDel' is mentioned on line 488, but not defined == Missing Reference: 'CBQ' is mentioned on line 491, but not defined == Missing Reference: 'TCP-Models' is mentioned on line 494, but not defined == Missing Reference: 'PI' is mentioned on line 498, but not defined == Missing Reference: 'QCN' is mentioned on line 502, but not defined Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft R. Pan 3 Network Working Group P. Natarajan, C. Piglione, M. Prabhu 4 Intended Status: Informational V. Subramanian, F. Baker, B. V. Steeg 5 Cisco Systems 7 Expires: December 5, 2013 June 3, 2013 9 PIE: A Lightweight Control Scheme To Address the 10 Bufferbloat Problem 12 draft-pan-aqm-pie-00 14 Abstract 16 Bufferbloat is a phenomenon where excess buffers in the network cause 17 high latency and jitter. As more and more interactive applications 18 (e.g. voice over IP, real time video streaming and financial 19 transactions) run in the Internet, high latency and jitter degrade 20 application performance. There is a pressing need to design 21 intelligent queue management schemes that can control latency and 22 jitter; and hence provide desirable quality of service to users. 24 We present here a lightweight design, PIE(Proportional Integral 25 controller Enhanced) that can effectively control the average 26 queueing latency to a target value. Simulation results, theoretical 27 analysis and Linux testbed results have shown that PIE can ensure low 28 latency and achieve high link utilization under various congestion 29 situations. The design does not require per-packet timestamp, so it 30 incurs very small overhead and is simple enough to implement in both 31 hardware and software. 33 Status of this Memo 35 This Internet-Draft is submitted to IETF in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF), its areas, and its working groups. Note that 40 other groups may also distribute working documents as 41 Internet-Drafts. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 47 The list of current Internet-Drafts can be accessed at 48 http://www.ietf.org/1id-abstracts.html 50 The list of Internet-Draft Shadow Directories can be accessed at 51 http://www.ietf.org/shadow.html 53 Copyright and License Notice 55 Copyright (c) 2012 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 3. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 4. The PIE Scheme . . . . . . . . . . . . . . . . . . . . . . . . 5 74 4.1 Random Dropping . . . . . . . . . . . . . . . . . . . . . . 5 75 4.2 Drop Probability Calculation . . . . . . . . . . . . . . . . 6 76 4.3 Departure Rate Estimation . . . . . . . . . . . . . . . . . 7 77 4.4 Handling Bursts . . . . . . . . . . . . . . . . . . . . . . 8 78 5. Comments and Discussions . . . . . . . . . . . . . . . . . . . 9 79 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 11 80 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 81 9.1 Normative References . . . . . . . . . . . . . . . . . . . 11 82 9.2 Informative References . . . . . . . . . . . . . . . . . . 11 83 9.3 Other References . . . . . . . . . . . . . . . . . . . . . 11 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 86 1. Introduction 88 The explosion of smart phones, tablets and video traffic in the 89 Internet brings about a unique set of challenges for congestion 90 control. To avoid packet drops, many service providers or data center 91 operators require vendors to put in as much buffer as possible. With 92 rapid decrease in memory chip prices, these requests are easily 93 accommodated to keep customers happy. However, the above solution of 94 large buffer fails to take into account the nature of the TCP 95 protocol, the dominant transport protocol running in the Internet. 96 The TCP protocol continuously increases its sending rate and causes 97 network buffers to fill up. TCP cuts its rate only when it receives a 98 packet drop or mark that is interpreted as a congestion signal. 99 However, drops and marks usually occur when network buffers are full 100 or almost full. As a result, excess buffers, initially designed to 101 avoid packet drops, would lead to highly elevated queueing latency 102 and jitter. It is a delicate balancing act to design a queue 103 management scheme that not only allows short-term burst to smoothly 104 pass, but also controls the average latency when long-term congestion 105 persists. 107 Active queue management (AQM) schemes, such as Random Early Discard 108 (RED), have been around for well over a decade. AQM schemes could 109 potentially solve the aforementioned problem. RFC 2309[RFC2309] 110 strongly recommends the adoption of AQM schemes in the network to 111 improve the performance of the Internet. RED is implemented in a wide 112 variety of network devices, both in hardware and software. 113 Unfortunately, due to the fact that RED needs careful tuning of its 114 parameters for various network conditions, most network operators 115 don't turn RED on. In addition, RED is designed to control the queue 116 length which would affect delay implicitly. It does not control 117 latency directly. Hence, the Internet today still lacks an effective 118 design that can control buffer latency to improve the quality of 119 experience to latency-sensitive applications. 121 Recently, a new AQM scheme, CoDel[CoDel], was proposed to control 122 the latency directly to address the bufferbloat problem. CoDel 123 requires per packet timestamps. Also, packets are dropped at the 124 dequeue function after they have been enqueued for a while. Both of 125 these requirements consume excessive processing and infrastructure 126 resources. This consumption will make CoDel expensive to implement 127 and operate, especially in hardware. 129 PIE aims to combine the benefits of both RED and CoDel: easy to 130 implement like RED and directly control latency like CoDel. Similar 131 to RED, PIE randomly drops a packet at the onset of the congestion. 132 The congestion detection, however, is based on the queueing latency 133 like CoDel instead of the queue length like RED. Furthermore, PIE 134 also uses the latency moving trends: latency increasing or 135 decreasing, to help determine congestion levels. The design 136 parameters of PIE are chosen via stability analysis. While these 137 parameters can be fixed to work in various traffic conditions, they 138 could be made self-tuning to optimize system performance. 140 In addition, we assume any delay-based AQM scheme would be applied 141 over a Fair Queueing (FQ) structure or its approximate design, Class 142 Based Queueing (CBQ). FQ is one of the most studied scheduling 143 algorithms since it was first proposed in 1985 [RFC970]. CBQ has been 144 a standard feature in most network devices today[CBQ]. These designs 145 help flows/classes achieve max-min fairness and help mitigate bias 146 against long flows with long round trip times(RTT). Any AQM scheme 147 that is built on top of FQ or CBQ could benefit from these 148 advantages. Furthermore, we believe that these advantages such as per 149 flow/class fairness are orthogonal to the AQM design whose primary 150 goal is to control latency for a given queue. For flows that are 151 classified into the same class and put into the same queue, we need 152 to ensure their latency is better controlled and their fairness is 153 not worse than those under the standard DropTail or RED design. 155 This draft describes the overall design goals, system elements and 156 implementation details of PIE. We will also discuss various design 157 considerations, including how auto-tuning can be done. 159 2. Terminology 161 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 162 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 163 document are to be interpreted as described in RFC 2119 [RFC2119]. 165 3. Design Goals 167 We explore a queue management framework where we aim to improve the 168 performance of interactive and delay-sensitive applications. The 169 design of our scheme follows a few basic criteria. 171 * First, we directly control queueing latency instead of 172 controlling queue length. Queue sizes change with queue draining 173 rates and various flows' round trip times. Delay bloat is the 174 real issue that we need to address as it impairs real time 175 applications. If latency can be controlled, bufferbloat is not 176 an issue. As a matter of fact, we would allow more buffers for 177 sporadic bursts as long as the latency is under control. 179 * Secondly, we aim to attain high link utilization. The goal of 180 low latency shall be achieved without suffering link under- 181 utilization or losing network efficiency. An early congestion 182 signal could cause TCP to back off and avoid queue building up. 183 On the other hand, however, TCP's rate reduction could result in 184 link under-utilization. There is a delicate balance between 185 achieving high link utilization and low latency. 187 * Furthermore, the scheme should be simple to implement and 188 easily scalable in both hardware and software. The wide adoption 189 of RED over a variety of network devices is a testament to the 190 power of simple random early dropping/marking. We strive to 191 maintain similar design simplicity. 193 * Finally, the scheme should ensure system stability for various 194 network topologies and scale well with arbitrary number streams. 195 Design parameters shall be set automatically. Users only need to 196 set performance-related parameters such as target queue delay, 197 not design parameters. 199 In the following, we will elaborate on the design of PIE and its 200 operation. 202 4. The PIE Scheme 204 As illustrated in Fig. 1, our scheme comprises three simple components: 205 a) random dropping at enqueing; b) periodic drop probability update; c) 206 dequeing rate estimation. 208 The following sections describe these components in further detail, and 209 explain how they interact with each other. At the end of this section, 210 we will discuss how the scheme can be easily augmented to precisely 211 control bursts. 213 4.1 Random Dropping 215 Like any state-of-the-art AQM scheme, PIE would drop packets randomly 216 according to a drop probability, p, that is obtained from the drop- 217 probability-calculation component: 219 * upon a packet arrival 221 randomly drop a packet with a probability p. 223 Random Drop 224 / -------------- 225 -------/ --------------> | | | | | --------------> 226 /|\ | | | | | | 227 | -------------- | 228 | Queue Buffer | 229 | | | Departure bytes 230 | |queue | 231 | |length | 232 | | | 233 | \|/ \|/ 234 | ----------------- ------------------- 235 | | Drop | | | 236 -----<-----| Probability |<---| Departure Rate | 237 | Calculation | | Estimation | 238 ----------------- ------------------- 240 Figure 1. The PIE Structure 242 4.2 Drop Probability Calculation 244 The PIE algorithm periodically updates the drop probability as follows: 246 * estimate current queueing delay using Little's law: 248 est_del = qlen/depart_rate; 250 * calculate drop probability p as: 252 p = p + alpha*(est_del-target_del) + beta*(est_del-est_del_old); 254 est_del_old = est_del. 256 Here, the current queue length is denoted by qlen. The draining rate of 257 the queue, depart_rate, is obtained from the departure-rate-estimation 258 block. Variables, est_del and est_del_old, represent the current and 259 previous estimation of the queueing delay. The target latency value is 260 expressed in target_del. The update interval is denoted as Tupdate. 262 Note that the calculation of drop probability is based not only on the 263 current estimation of the queueing delay, but also on the direction 264 where the delay is moving, i.e., whether the delay is getting longer or 265 shorter. This direction can simply be measured as the difference between 266 est_del and est_del_old. This is the classic Proportional Integral 267 controller design that is adopted here for controlling queueing latency. 268 The controller parameters, in the unit of hz, are designed using 269 feedback loop analysis where TCP's behaviors are modeled using the 270 results from well-studied prior art[TCP-Models]. 272 We would like to point out that this type of controller has been studied 273 before for controlling the queue length [PI, QCN]. PIE adopts the 274 Proportional Integral controller for controlling delay and makes the 275 scheme auto-tuning. The theoretical analysis of PIE is under paper 276 submission and its reference will be included in this draft once it 277 becomes available. Nonetheless, we will discuss the intuitions for these 278 parameters in Section 5. 280 4.3 Departure Rate Estimation 282 The draining rate of a queue in the network often varies either because 283 other queues are sharing the same link, or the link capacity fluctuates. 284 Rate fluctuation is particularly common in wireless networks. Hence, we 285 decide to measure the departure rate directly as follows. 287 * we are in a measurement cycle if we have enough data in the queue: 289 qlen > deq_threshold 291 * if in a measurement cycle: 293 upon a packet departure 295 dq_count = dq_count + deque_pkt_size; 297 * if dq_count > deq_threshold then 299 depart_rate = dq_count/(now-start); 301 dq_count = 0; 303 start = now; 305 We only measure the departure rate when there are sufficient data in the 306 buffer, i.e., when the queue length is over a certain threshold, 307 deq_threshold. Short, non-persistent bursts of packets result in empty 308 queues from time to time, this would make the measurement less accurate. 309 The parameter, dq_count, represents the number of bytes departed since 310 the last measurement. Once dq_count is over a certain threshold, 311 deq_threshold, we obtain a measurement sample. The threshold is 312 recommended to be set to 10KB assuming a typical packet size of around 313 1KB or 1.5KB. This threshold would allow us a long enough period to 314 obtain an average draining rate but also fast enough to reflect sudden 315 changes in the draining rate. Note that this threshold is not crucial 316 for the system's stability. 318 4.4 Handling Bursts 320 The above three components form the basis of the PIE algorithm. Although 321 we aim to control the average latency of a congested queue, the scheme 322 should allow short term bursts to pass through the system without 323 hurting them. We would like to discuss how PIE manages bursts in this 324 section. 326 Bursts are well tolerated in the basic scheme for the following reasons: 327 first, the drop probability is updated periodically. Any short term 328 burst that occurs within this period could pass through without 329 incurring extra drops as it would not trigger a new drop probability 330 calculation. Secondly, PIE's drop probability calculation is done 331 incrementally. A single update would only lead to a small incremental 332 change in the probability. So if it happens that a burst does occur at 333 the exact instant that the probability is being calculated, the 334 incremental nature of the calculation would ensure its impact is kept 335 small. 337 Nonetheless, we would like to give users a precise control of the burst. 338 We introduce a parameter, max_burst, that is similar to the burst 339 tolerance in the token bucket design. By default, the parameter is set 340 to be 100ms. Users can certainly modify it according to their 341 application scenarios. The burst allowance is added into the basic PIE 342 design as follows: 344 * if p == 0 and est_del < del_ref and est_del_old < del_ref 346 burst_allowance = max_burst; 348 * upon packet arrival 350 if burst_allowance > 0 enqueue packet; 352 * upon probability update 353 burst_allowance = burst_allowance - Tupdate; 355 The burst allowance, noted by burst_allowance, is initialized to 356 max_burst. As long as burst_allowance is above zero, an incoming packet 357 will be enqueued bypassing the random drop process. During each update 358 instance, the value of burst_allowance is decremented by the update 359 period, Tupdate. When the congestion goes away, defined by us as p 360 equals to 0 and both the current and previous samples of estimated delay 361 are less than target_del, we reset burst_allowance to max_burst. 363 5. Comments and Discussions 365 While the formal analysis will be included later, we would like to 366 discuss the intuitions regarding how to determine the key parameters. 367 Although the PIE algorithm would set them automatically, they are not 368 meant to be magic numbers. We hope to give enough explanations here to 369 help demystify them so that users can experiment and explore on their 370 own. 372 As it is obvious from the above, the crucial equation in the PIE 373 algorithm is 375 p = p + alpha*(est_del-target_del) + beta*(est_del-est_del_old). 377 The value of alpha determines how the deviation of current latency from 378 the target value affects the drop probability. The beta term exerts 379 additional adjustments depending on whether the latency is trending up 380 or down. Note that the drop probability is reached incrementally, not 381 through a single step. To avoid big swings in adjustments which often 382 leads to instability, we would like to tune p in small increments. 383 Suppose that p is in the range of 1%. Then we would want the value of 384 alpha and beta to be small enough, say 0.1%, adjustment in each step. If 385 p is in the higher range, say above 10%, then the situation would 386 warrant a higher single step tuning, for example 1%. Finally, the drop 387 probability would only be stabilized when the latency is stable, i.e. 388 est_del equals est_del_old; and the value of the latency is equal to 389 target_del. The relative weight between alpha and beta determines the 390 final balance between latency offset and latency jitter. 392 The update interval, Tupdate, also plays a key role in stability. Given 393 the same alpha and beta values, the faster the update is, the higher the 394 loop gain will be. As it is not showing explicitly in the above 395 equation, it can become an oversight. Notice also that alpha and beta 396 have a unit of hz. 398 As a further extension, we could introduce weights for flows that are 399 classified into the same queue to achieve differential dropping. For 400 example, the dropping probability for flow i could be p(i) = 401 p/weight(i). Flows with higher weights would receive proportionally less 402 drops; and vice versa. Adding FQ on top, FQ_PIE, is another alternative. 404 Also, we have discussed congestion notification via the form of packet 405 drops. The algorithm can be easily applied to networks codes where Early 406 Congestion Notification (ECN) is enabled. The drop probability, p, above 407 would become marking probability. 409 6. Implementation 411 PIE can be applied to existing hardware or software solutions. In this 412 section, we discuss the implementation cost of the PIE algorithm. There 413 are three steps involved in PIE as discussed in Section 4. We examine 414 their complexities as follows. 416 Upon packet arrival, the algorithm simply drops a packet randomly based 417 on the drop probability p. This step is straightforward and requires no 418 packet header examination and manipulation. Besides, since no per packet 419 overhead, such as a timestamp, is required, there is no extra memory 420 requirement. Furthermore, the input side of a queue is typically under 421 software control while the output side of a queue is hardware based. 422 Hence, a drop at enqueueing can be readily retrofitted into existing 423 hardware or software implementations. 425 The drop probability calculation is done in the background and it occurs 426 every Tudpate interval. Given modern high speed links, this period 427 translates into once every tens, hundreds or even thousands of packets. 428 Hence the calculation occurs at a much slower time scale than packet 429 processing time, at least an order of magnitude slower. The calculation 430 of drop probability involves multiplications using alpha and beta. Since 431 the algorithm is not sensitive to the precise values of alpha and beta, 432 we can choose the values, e.g. alpha= 0.25 and beta = 2.5 so that 433 multiplications can be done using simple adds and shifts. As no 434 complicated functions are required, PIE can be easily implemented in 435 both hardware and software. The state requirement is only two variables 436 per queue: est_del and est_del_old. Hence the memory overhead is small. 438 In the departure rate estimation, PIE uses a counter to keep track of 439 the number of bytes departed for the current interval. This counter is 440 incremented per packet departure. Every Tupdate, PIE calculates latency 441 using the departure rate, which can be implemented using a 442 multiplication. Note that many network devices keep track an interface's 443 departure rate. In this case, PIE might be able to reuse this 444 information, simply skip the third step of the algorithm and hence 445 incurs no extra cost. We also understand that in some software 446 implementations, time-stamped are added for other purposes. In this 447 case, we can also make use of the time-stamps and bypass the departure 448 rate estimation and directly used the timestamp information in the drop 449 probability calculation. 451 In summary, the state requirement for PIE is limited and computation 452 overheads are small. Hence, PIE is simple to be implemented. In 453 addition, since PIE does not require any user configuration, it does not 454 impose any new cost on existing network management system solutions. SFQ 455 can be combined with PIE to provide further improvement of latency for 456 various flows with different priorities. However, SFQ requires extra 457 queueing and scheduling structures. Whether the performance gain can 458 justify the design overhead needs to be further investigated. 460 7. Incremental Deployment 462 One nice property of the AQM design is that it can be independently 463 designed and operated without the requirement of being inter-operable. 465 Although all network nodes can not be changed altogether to adopt 466 latency-based AQM schemes, we envision a gradual adoption which would 467 eventually lead to end-to-end low latency service for real time 468 applications. 470 8. IANA Considerations 472 There are no actions for IANA. 474 9. References 476 9.1 Normative References 478 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 479 Requirement Levels", BCP 14, RFC 2119, March 1997. 481 9.2 Informative References 483 [RFC970] Nagle, J., "On Packet Switches With Infinite 484 Storage",RFC970, December 1985. 486 9.3 Other References 488 [CoDel] Nichols, K., Jacobson, V., "Controlling Queue Delay", ACM 489 Queue. ACM Publishing. doi:10.1145/2209249.22W.09264. 491 [CBQ] Cisco White Paper, "http://www.cisco.com/en/US/docs/12_0t 492 /12_0tfeature/guide/cbwfq.html". 494 [TCP-Models] Misra, V., Gong, W., and Towsley, D., "Fluid-based 495 Analysis of a Network of AQM Routers Supporting TCP 496 Flows with an Application to RED", SIGCOMM 2000. 498 [PI] Hollot, C.V., Misra, V., Towsley, D. and Gong, W., 499 "On Designing Improved Controller for AQM Routers 500 Supporting TCP Flows", Infocom 2001. 502 [QCN] "Data Center Bridging - Congestion Notification", 503 http://www.ieee802.org/1/pages/802.1au.html. 505 Authors' Addresses 507 Rong Pan 508 Cisco Systems 509 510 McCarthy Blvd, 510 Milpitas, CA 95134, USA 511 Email: ropan@cisco.com 513 Preethi Natarajan, 514 Cisco Systems 515 510 McCarthy Blvd, 516 Milpitas, CA 95134, USA 517 Email: prenatar@cisco.com 519 Chiara Piglione 520 Cisco Systems 521 510 McCarthy Blvd, 522 Milpitas, CA 95134, USA 523 Email: cpiglion@cisco.com 525 Mythili Prabhu 526 Cisco Systems 527 510 McCarthy Blvd, 528 Milpitas, CA 95134, USA 529 Email: mysuryan@cisco.com 531 Vijay Subramanian 532 Cisco Systems 533 510 McCarthy Blvd, 534 Milpitas, CA 95134, USA 535 Email: vijaynsu@cisco.com 536 Fred Baker 537 Cisco Systems 538 510 McCarthy Blvd, 539 Milpitas, CA 95134, USA 540 Email: fred@cisco.com 542 Bill Ver Steeg 543 Cisco Systems 544 5030 Sugarloaf Parkway 545 Lawrenceville, GA, 30044, USA 546 Email: versteb@cisco.com