idnits 2.17.1 draft-ietf-aqm-pie-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. ** There are 31 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 19, 2016) is 2927 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'PI' is mentioned on line 766, but not defined == Missing Reference: 'QCN' is mentioned on line 770, but not defined == Missing Reference: 'TCP-Models' is mentioned on line 773, but not defined -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft R. Pan, P. Natarajan, F. Baker 3 Active Queue Management G. White, B. VerSteeg, M.S. Prabhu 4 Working Group C. Piglione, V. Subramanian 5 Intended Status: Experimental Track 7 Expires: October 21, 2016 April 19, 2016 9 PIE: A Lightweight Control Scheme To Address the 10 Bufferbloat Problem 12 draft-ietf-aqm-pie-07 14 Abstract 16 Bufferbloat is a phenomenon where excess buffers in the network cause 17 high latency and jitter. As more and more interactive applications 18 (e.g. voice over IP, real time video streaming and financial 19 transactions) run in the Internet, high latency and jitter degrade 20 application performance. There is a pressing need to design 21 intelligent queue management schemes that can control latency and 22 jitter; and hence provide desirable quality of service to users. 24 This document presents a lightweight active queue management design, 25 called PIE (Proportional Integral controller Enhanced), that can 26 effectively control the average queueing latency to a target value. 27 Simulation results, theoretical analysis and Linux testbed results 28 have shown that PIE can ensure low latency and achieve high link 29 utilization under various congestion situations. The design does not 30 require per-packet timestamp, so it incurs very small overhead and is 31 simple enough to implement in both hardware and software. 33 Status of this Memo 35 This Internet-Draft is submitted to IETF in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF), its areas, and its working groups. Note that 40 other groups may also distribute working documents as 41 Internet-Drafts. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 47 The list of current Internet-Drafts can be accessed at 48 http://www.ietf.org/1id-abstracts.html 50 The list of Internet-Draft Shadow Directories can be accessed at 51 http://www.ietf.org/shadow.html 53 Copyright and License Notice 55 Copyright (c) 2012 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 72 3. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 4. The Basic PIE Scheme . . . . . . . . . . . . . . . . . . . . . 6 74 4.1 Random Dropping(ECN Support is described later in this 75 document) . . . . . . . . . . . . . . . . . . . . . . . . . 7 76 4.2 Drop Probability Calculation . . . . . . . . . . . . . . . . 8 77 4.3 Latency Calculation . . . . . . . . . . . . . . . . . . . . 10 78 4.4 Burst Tolerance . . . . . . . . . . . . . . . . . . . . . . 10 79 5. Optional Design Elements of PIE . . . . . . . . . . . . . . . . 11 80 5.1 ECN Support . . . . . . . . . . . . . . . . . . . . . . . . 11 81 5.2 Departure Rate Estimation . . . . . . . . . . . . . . . . . 11 82 5.3 Turning PIE on and off . . . . . . . . . . . . . . . . . . . 13 83 5.4 De-randomization . . . . . . . . . . . . . . . . . . . . . . 14 84 5.5 Cap Drop Adjustment . . . . . . . . . . . . . . . . . . . . 15 85 6. Implementation Cost . . . . . . . . . . . . . . . . . . . . . . 15 86 7. Future Research . . . . . . . . . . . . . . . . . . . . . . . . 16 87 8. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 17 88 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 17 89 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 90 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 91 11.1 Normative References . . . . . . . . . . . . . . . . . . . 17 92 11.2 Informative References . . . . . . . . . . . . . . . . . . 17 93 11.3 Other References . . . . . . . . . . . . . . . . . . . . . 18 94 12. The Basic PIE pseudo Code . . . . . . . . . . . . . . . . . . 19 95 13. Pseudo code for PIE with optional enhancement . . . . . . . . 22 97 1. Introduction 99 The explosion of smart phones, tablets and video traffic in the 100 Internet brings about a unique set of challenges for congestion 101 control. To avoid packet drops, many service providers or data center 102 operators require vendors to put in as much buffer as possible. With 103 rapid decrease in memory chip prices, these requests are easily 104 accommodated to keep customers happy. While this solution succeeds in 105 assuring low packet loss and high TCP throughput, it suffers from a 106 major downside. The TCP protocol continuously increases its sending 107 rate and causes network buffers to fill up. TCP cuts its rate only 108 when it receives a packet drop or mark that is interpreted as a 109 congestion signal. However, drops and marks usually occur when 110 network buffers are full or almost full. As a result, excess buffers, 111 initially designed to avoid packet drops, would lead to highly 112 elevated queueing latency and jitter. It is a delicate balancing act 113 to design a queue management scheme that not only allows short-term 114 burst to smoothly pass, but also controls the average latency in the 115 presence of long-running greedy flows. 117 Active queue management (AQM) schemes, such as Random Early Detection 118 (RED), have been around for well over a decade. AQM schemes could 119 potentially solve the aforementioned problem. RFC 2309[RFC2309] 120 strongly recommends the adoption of AQM schemes in the network to 121 improve the performance of the Internet. RED is implemented in a wide 122 variety of network devices, both in hardware and software. 123 Unfortunately, due to the fact that RED needs careful tuning of its 124 parameters for various network conditions, most network operators 125 don't turn RED on. In addition, RED is designed to control the queue 126 length which would affect delay implicitly. It does not control 127 latency directly. Hence, the Internet today still lacks an effective 128 design that can control buffer latency to improve the quality of 129 experience to latency-sensitive applications. Notably, a recent IETF 130 AQM working group draft [IETF-AQM] calls for new methods of 131 controlling network latency. 133 New algorithms are beginning to emerge to control queueing latency 134 directly to address the bufferbloat problem [CoDel]. Along these 135 lines, PIE also aims to keep the benefits of RED: such as easy 136 implementation and scalability to high speeds. Similar to RED, PIE 137 randomly drops an incoming packet at the onset of the congestion. The 138 congestion detection, however, is based on the queueing latency 139 instead of the queue length like RED. Furthermore, PIE also uses the 140 derivative (rate of change) of the queueing latency to help determine 141 congestion levels and an appropriate response. The design parameters 142 of PIE are chosen via control theory stability analysis. While these 143 parameters can be fixed to work in various traffic conditions, they 144 could be made self-tuning to optimize system performance. 146 Separately, it is assumed that any delay-based AQM scheme would be 147 applied over a Fair Queueing (FQ) structure or one of its approximate 148 designs, Flow Queueing or Class Based Queueing (CBQ). FQ is one of 149 the most studied scheduling algorithms since it was first proposed in 150 1985 [RFC970]. CBQ has been a standard feature in most network 151 devices today[CBQ]. Any AQM scheme that is built on top of FQ or CBQ 152 could benefit from these advantages. Furthermore, these advantages 153 such as per flow/class fairness are orthogonal to the AQM design 154 whose primary goal is to control latency for a given queue. For flows 155 that are classified into the same class and put into the same queue, 156 one needs to ensure their latency is better controlled and their 157 fairness is not worse than those under the standard DropTail or RED 158 design. More details about the relationship between FQ and AQM can be 159 found in IETF draft [FQ-Implement]. 161 In October 2013, CableLabs' DOCSIS 3.1 specification [DOCSIS_3.1] 162 mandated that cable modems implement a specific variant of the PIE 163 design as the active queue management algorithm. In addition to cable 164 specific improvements, the PIE design in DOCSIS 3.1 [DOCSIS-PIE] has 165 improved the original design in several areas, including de- 166 randomization of coin tosses and enhanced burst protection. 168 This draft separates the PIE design into the basic elements that are 169 MUST to be implemented and optional SHOULD/MAY enhancement elements. 171 2. Terminology 173 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 174 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 175 document are to be interpreted as described in RFC 2119 [RFC2119]. 177 3. Design Goals 179 A queue management framework is designed to improve the performance 180 of interactive and delay-sensitive applications. It should follow the 181 general guidelines set by the AQM working group document "IETF 182 Recommendations Regarding Active Queue Management" [IETF-AQM]. More 183 specifically PIE design has the following basic criteria. 185 * First, queueing latency, instead of queue length, is 186 controlled. Queue sizes change with queue draining rates and 187 various flows' round trip times. Delay bloat is the real issue 188 that needs to be addressed as it impairs real time applications. 189 If latency can be controlled, bufferbloat is not an issue. In 190 fact, once latency is under control it frees up buffers for 191 sporadic bursts. 193 * Secondly, PIE aims to attain high link utilization. The goal 194 of low latency shall be achieved without suffering link under- 195 utilization or losing network efficiency. An early congestion 196 signal could cause TCP to back off and avoid queue building up. 197 On the other hand, however, TCP's rate reduction could result in 198 link under-utilization. There is a delicate balance between 199 achieving high link utilization and low latency. 201 * Furthermore, the scheme should be simple to implement and 202 easily scalable in both hardware and software. PIE strives to 203 maintain similar design simplicity to RED, which has been 204 implemented in a wide variety of network devices. 206 * Finally, the scheme should ensure system stability for various 207 network topologies and scale well across an arbitrary number of 208 streams. Design parameters shall be set automatically. Users 209 only need to set performance-related parameters such as target 210 queue delay, not design parameters. 212 In the following, the design of PIE and its operation are described in 213 detail. 215 4. The Basic PIE Scheme 217 As illustrated in Fig. 1, PIE conceptually comprises three simple MUST 218 components: a) random dropping at enqueueing; b) periodic drop 219 probability update; c) latency calculation. When a packet arrives, a 220 random decision is made regarding whether to drop the packet. The drop 221 probability is updated periodically based on how far the current delay 222 is away from the target and whether the queueing delay is currently 223 trending up or down. The queueing delay can be obtained using direct 224 measurements or using estimations calculated from the queue length and 225 the dequeue rate. 227 The detailed definition of parameters can be found in the pseudo code 228 section of this document (Section 11). Any state variables that PIE 229 maintains are noted using "PIE->". For full description of the 230 algorithm, one can refer to the full paper [HPSR-PIE]. 232 Random Drop 233 / -------------- 234 -------/ --------------> | | | | | --------------> 235 /|\ | | | | | 236 | -------------- 237 | Queue Buffer \ 238 | | \ 239 | |queue \ 240 | |length \ 241 | | \ 242 | \|/ \/ 243 | ----------------- ------------------- 244 | | Drop | | | 245 -----<-----| Probability |<---| Latency | 246 | Calculation | | Calculation | 247 ----------------- ------------------- 249 Figure 1. The PIE Structure 251 4.1 Random Dropping(ECN Support is described later in this document) 253 PIE MUST drop a packet upon its arrival to a queue according to a drop 254 probability, PIE->drop_prob_, that is obtained from the drop- 255 probability-calculation component. The random drop is triggered by a 256 packet arrival before enqueueing into a queue. 258 * Upon a packet enqueue, PIE MUST: 260 randomly drop the packet with a probability PIE->drop_prob_. 262 To ensure that PIE is work conserving, we MAY bypass the random drop if 263 the delay sample, PIE->qdelay_old_, is smaller than half of QDELAY_REF 264 when the drop probability is not too high, PIE->drop_prob_ < 0.2; or if 265 the queue has less than a couple of packets. 267 * Upon a packet enqueue, PIE MAY: 269 //Safeguard PIE to be work conserving 270 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 271 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 272 return ENQUE; 273 else 274 randomly drop the packet with a probability PIE->drop_prob_. 276 PIE optionally supports ECN and see Section 5.1. 278 4.2 Drop Probability Calculation 280 The PIE algorithm periodically updates the drop probability based on the 281 delay samples: not only the current delay sample but also the trend 282 where the delay is going, up or down. This is the classical Proportional 283 Integral (PI) controller method which is known for eliminating steady 284 state errors. This type of controller has been studied before for 285 controlling the queue length [PI, QCN]. PIE adopts the Proportional 286 Integral controller for controlling delay. The algorithm also auto- 287 adjusts the control parameters based on how heavy the congestion is, 288 which is reflected in the current drop probability. Note that the 289 current drop probability is a direct measure of current congestion 290 level, no need to measure the arrival rate and departure rate 291 mismatches. 293 When a congestion period goes away, we might be left with a high drop 294 probability with light packet arrivals. Hence, the PIE algorithm MUST 295 include a mechanism by which the drop probability decay exponentially 296 (rather than linearly) when the system is not congested. This would help 297 the drop probability converge to 0 faster while the PI controller 298 ensures that it would eventually reaches zero. The decay parameter of 2% 299 gives us around 750ms time constant, a few RTT. 301 Specifically, the PIE algorithm MUST periodically adjust the drop 302 probability every T_UPDATE interval: 304 * MUST calculate drop probability PIE->drop_prob_ and auto-tune it 305 as: 307 p = alpha*(current_qdelay-QDELAY_REF) + 308 beta*(current_qdelay-PIE->qdelay_old_); 310 if (PIE->drop_prob_ < 0.000001) { 311 p /= 2048; 312 } else if (PIE->drop_prob_ < 0.00001) { 313 p /= 512; 314 } else if (PIE->drop_prob_ < 0.0001) { 315 p /= 128; 316 } else if (PIE->drop_prob_ < 0.001) { 317 p /= 32; 318 } else if (PIE->drop_prob_ < 0.01) { 319 p /= 8; 320 } else if (PIE->drop_prob_ < 0.1) { 321 p /= 2; 322 } else { 323 p = p; 325 } 326 PIE->drop_prob_ += p; 328 * MUST decay the drop probability exponentially: 330 if (current_qdelay == 0 && PIE->qdelay_old_ == 0) { 332 PIE->drop_prob_ = PIE->drop_prob_*0.98; //1- 1/64 333 //is sufficient 335 } 337 * MUST bound the drop probability 338 if (PIE->drop_prob_ < 0) 339 PIE->drop_prob_ = 0.0 340 if (PIE->drop_prob_ > 1) 341 PIE->drop_prob_ = 1.0 343 * MUST store current delay value: 345 PIE->qdelay_old_ = current_qdelay. 347 The update interval, T_UPDATE, is defaulted to be 15ms. It MAY be 348 reduced on high speed links in order to provide smoother response. The 349 target delay value, QDELAY_REF, SHOULD be set to 15ms. Variables, 350 current_qdelay and PIE->qdelay_old_ represent the current and previous 351 samples of the queueing delay, which are calculated by the "Latency 352 Calculation" component (see Section 4.3). The variable current_qdelay is 353 actually a temporary variable while PIE->qdelay_old_ is a state variable 354 that PIE keeps. The drop probability is a value between 0 and 1. 355 However, implementations can certainly use integers. 357 The controller parameters, alpha and beta(in the unit of hz) are 358 designed using feedback loop analysis where TCP's behaviors are modeled 359 using the results from well-studied prior art[TCP-Models]. Note that the 360 above adjustment of p effectively scales the alpha and beta parameters 361 based on current congestion level indicated by the drop probability. 363 The theoretical analysis of PIE can be found in [HPSR-PIE]. As a rule of 364 thumb, to keep the same feedback loop dynamics, if we cut T_UPDATE in 365 half, we should also cut alpha by half and increase beta by alpha/4. If 366 the target delay is reduced, e.g. for data center use, the values of 367 alpha and beta SHOULD be increased by the same order of magnitude that 368 the target latency is reduced. For example, if QDELAY_REF is reduced 369 changed from 15ms to 150us, a reduction of two orders of magnitude, then 370 alpha and beta values should be increased to alpha*100 and beta*100. 372 4.3 Latency Calculation 374 The PIE algorithm MUST use latency to calculate drop probability. 376 * It MAY estimate current queueing delay using Little's law: 378 current_qdelay = queue_.byte_length()/dequeue_rate; 380 Details can be found in Section 5.2. 382 * or MAY use other techniques for calculating queueing delay, ex: 383 timestamp packets at enqueue and use the same to calculate delay 384 during dequeue. 386 4.4 Burst Tolerance 388 PIE MUST also NOT penalize short-term packet bursts [IETF-AQM]. PIE MUST 389 allow bursts of traffic that create finite-duration events in which 390 current queueing delay exceeds the QDELAY_REF, without triggering packet 391 drops. A parameter, MAX_BURST, is introduced that defines the burst 392 duration that will be protected. By default, the parameter SHOULD be set 393 to be 150ms. For simplicity, the PIE algorithm MAY effectively round 394 MAX_BURST up to an integer multiple of T_UPDATE. 396 To implement the burst tolerance function, two basic components of PIE 397 are involved: "random dropping" and "drop probability calculation". The 398 PIE algorithm MUST do the following: 400 * In "Random Dropping" block and upon a packet arrival , PIE MUST 401 check: 403 Upon a packet enqueue: 404 if PIE->burst_allowance_ > 0 enqueue packet; 405 else randomly drop a packet with a probability PIE->drop_prob_. 407 if (PIE->drop_prob_ == 0 and current_qdelay < QDELAY_REF/2 and 408 PIE->qdelay_old_ < QDELAY_REF/2) 409 PIE->burst_allowance_ = MAX_BURST; 411 * In "Drop Probability Calculation" block, PIE MUST additionally 412 calculate: 414 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - 415 T_UPDATE); 417 The burst allowance, noted by PIE->burst_allowance_, is initialized to 418 MAX_BURST. As long as PIE->burst_allowance_ is above zero, an incoming 419 packet will be enqueued bypassing the random drop process. During each 420 update instance, the value of PIE->burst_allowance_ is decremented by 421 the update period, T_UPDATE and is bottomed at 0. When the congestion 422 goes away, defined here as PIE->drop_prob_ equals 0 and both the current 423 and previous samples of estimated delay are less than half of 424 QDELAY_REF, PIE->burst_allowance_ is reset to MAX_BURST. 426 5. Optional Design Elements of PIE 428 The above forms the basic MUST have elements of the PIE algorithm. There 429 are several enhancements that are added to further augment the 430 performance of the basic algorithm. For clarity purposes, they are 431 included in this section. 433 5.1 ECN Support 435 PIE SHOULD support ECN by marking (rather than dropping) ECN capable 436 packets [IETF-ECN]. However, as a safeguard, an additional threshold, 437 mark_ecnth, is introduced. If the calculated drop probability exceeds 438 mark_ecnth, PIE MUST revert to packet drop for ECN capable packets. The 439 variable mark_ecnth SHOULD be set at 0.1(10%). 441 * To support ECN, the "random drop with a probability 442 PIE->drop_prob_" function in "Random Dropping" block SHOULD be 443 changed to the following: 445 * Upon a packet enqueue: 447 if rand() < PIE->drop_prob_: 449 if PIE->drop_prob_ < mark_ecnth && ecn_capable_packet == TRUE: 451 mark packet; 453 else: 455 drop packet; 457 5.2 Departure Rate Estimation 459 One way to calculate latency is to obtain the departure rate. The 460 draining rate of a queue in the network often varies either because 461 other queues are sharing the same link, or the link capacity fluctuates. 463 Rate fluctuation is particularly common in wireless networks. One MAY 464 measure directly at the dequeue operation. Short, non-persistent bursts 465 of packets result in empty queues from time to time, this would make the 466 measurement less accurate. PIE SHOULD measure when a sufficient data in 467 the buffer, i.e., when the queue length is over a certain threshold 468 (DQ_THRESHOLD). PIE measures how long it takes to drain DQ_THRESHOLD of 469 packets. More specifically, PIE MAY implement the rate estimation as 470 follows: 472 current_qdelay = queue_.byte_length() * 473 PIE->avg_dq_time_/DQ_THRESHOLD; 475 * Upon a packet deque: 477 if PIE->in_measurement_ == FALSE and queue.byte_length() >= 478 DQ_THRESHOLD: 479 PIE->in_measurement_ = TRUE; 480 PIE->measurement_start_ = now; 481 PIE->dq_count_ = 0; 483 if PIE->in_measurement_ == TRUE: 484 PIE->dq_count_ = PIE->dq_count_ + deque_pkt_size; 485 if PIE->dq_count_ >= DQ_THRESHOLD then 486 weight = DQ_THRESHOLD/2^16 487 PIE->avg_dq_time_ = (now-PIE->measurement_start_)*weight 488 + PIE->avg_dq_time_*(1-weight); 489 PIE->dq_count_=0; 490 PIE->measurement_start_ = now 491 else 492 PIE->in_measurement_ = FALSE; 494 The parameter, PIE->dq_count_, represents the number of bytes departed 495 since the last measurement. Once PIE->dq_count_ is over DQ_THRESHOLD, a 496 measurement sample is obtained. The threshold is recommended to be set 497 to 16KB assuming a typical packet size of around 1KB or 1.5KB. This 498 threshold would allow sufficient data to obtain an average draining rate 499 but also fast enough (< 64KB) to reflect sudden changes in the draining 500 rate. IF DQ_THRESHOLD is smaller than 64KB, a small weight is used to 501 smooth out the dequeue time and obtain PIE->avg_dq_time_. The dequeue 502 rate is simply DQ_THRESHOLD divided by PIE->avg_dq_time_. This threshold 503 is not crucial for the system's stability. Please note that the update 504 interval for calculating the drop probability is different from the rate 505 measurement cycle. The drop probability calculation is done periodically 506 per section 4.2 and it is done even when the algorithm is not in a 507 measurement cycle; in this case the previously latched value of PIE- 508 >avg_dq_time_ is used. 510 Random Drop 511 / -------------- 512 -------/ --------------------> | | | | | --------------> 513 /|\ | | | | | | 514 | | -------------- 515 | | Queue Buffer 516 | | | 517 | | |queue 518 | | |length 519 | | | 520 | \|/ \|/ 521 | ------------------------------ 522 | | Departure Rate | 523 -----<-----| & Drop Probability | 524 | Calculation | 525 ------------------------------ 527 Figure 2. The Enqueue-based PIE Structure 529 In some platforms, enqueueing and dequeueing functions belong to 530 different modules that are independent of each other. In such 531 situations, a pure enqueue-based design MAY be designed. As shown in 532 Figure 2, an enqueue-based design is depicted. The departure rate is 533 deduced from the number of packets enqueued and the queue length. The 534 design is based on the following key observation: over a certain time 535 interval, the number of departure packets = the number of enqueued 536 packets - the number of remaining packets in queue. In this design, 537 everything can be triggered by a packet arrival including the background 538 update process. The design complexity here is similar to the original 539 design. 541 5.3 Turning PIE on and off 543 Traffic naturally fluctuates in a network. It would be preferable not to 544 unnecessarily drop packets due to a spurious uptick in queueing latency. 545 PIE can be optionally turned on and off. It SHOULD only be turned on 546 (from off) when the buffer occupancy is over a certain threshold, which 547 SHOULD be set to 1/3 of the tail drop threshold. If it is on, PIE SHOULD 548 be turned off when congestion is over, i.e. when the drop probability 549 reaches 0, current and previous delay samples are all below half of 550 QDELAY_REF. 552 Ideally PIE should be turned on or off based on the latency. However, 553 calculating latency when PIE is off would introduce unnecessary packet 554 processing overhead. Weighing the trade-offs, it is decided to compare 555 against tail drop threshold to keep things simple. 557 When PIE is optionally turned on and off, the burst protection logic in 558 Section 4.4 MAY be modified as follows: 560 * "Random Dropping" block, PIE MAY add: 562 Upon packet arrival: 564 if PIE->active_ == FALSE && queue_length >= TAIL_DROP/3: 565 PIE->active_ = TRUE; 566 PIE->burst_allowance_ = MAX_BURST; 568 if PIE->burst_allowance_ > 0 enqueue packet; 569 else randomly drop a packet with a probability PIE->drop_prob_. 571 if (PIE->drop_prob_ == 0 and current_qdelay < QDELAY_REF/2 and 572 PIE->qdelay_old_ < QDELAY_REF/2) 573 PIE->active_ = FALSE; 574 PIE->burst_allowance_ = MAX_BURST; 576 * "Drop Probability Calculation" block, PIE MAY do the following: 577 if PIE->active_ == TRUE: 578 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 580 5.4 De-randomization 582 Although PIE adopts random dropping to achieve latency control, 583 independent coin tosses could introduce outlier situations where packets 584 are dropped too close to each other or too far from each other. This 585 would cause real drop percentage to temporarily deviate from the 586 intended value PIE->drop_prob_. In certain scenarios, such as small 587 number of simultaneous TCP flows, these deviations can cause significant 588 deviations in link utilization and queueing latency. PIE MAY introduce a 589 de-randomization mechanism to avoid such scenarios. A parameter, called 590 PIE->accu_prob_, is reset to 0 after a drop. Upon a packet arrival, PIE- 591 >accu_prob_ is incremented by the amount of drop probability, PIE- 592 >drop_prob_. If PIE->accu_prob_ is less than a low threshold, e.g. 0.85, 593 the arriving packet is enqueued; on the other hand, if PIE->accu_prob_ 594 is more than a high threshold, e.g. 8.5, a packet is forced to be 595 dropped. A packet is only randomly dropped if PIE->accu_prob_ falls in 596 between the two thresholds. Since PIE->accu_prob_ is reset to 0 after a 597 drop, another drop will not happen until 0.85/PIE->drop_prob_ packets 598 later. This avoids packets being dropped too close to each other. In the 599 other extreme case where 8.5/PIE->drop_prob_ packets have been enqueued 600 without incurring a drop, PIE would force a drop in order to prevent the 601 drops from being spaced too far apart. Further analysis can be found in 602 [DOCSIS-PIE]. 604 5.5 Cap Drop Adjustment 606 In the case of one single TCP flow during slow start phase in the 607 system, queue could quickly increase during slow start and demands high 608 drop probability. In some environments such as Cable Modem Speed Test, 609 one could not afford triggering timeout and lose throughput as 610 throughput is shown to customers who are testing his/her connection 611 speed. We MAY cap the maximum drop probability increase in each step. 613 * "Drop Probability Calculation" block, PIE MAY add: 615 if (PIE->drop_prob_ >= 0.1 && p > 0.02) { 617 p = 0.02; 619 } 621 6. Implementation Cost 623 PIE can be applied to existing hardware or software solutions. There are 624 three steps involved in PIE as discussed in Section 4. Their 625 complexities are examined below. 627 Upon packet arrival, the algorithm simply drops a packet randomly based 628 on the drop probability. This step is straightforward and requires no 629 packet header examination and manipulation. If the implementation 630 doesn't rely on packet timestamps for calculating latency, PIE does not 631 require extra memory. Furthermore, the input side of a queue is 632 typically under software control while the output side of a queue is 633 hardware based. Hence, a drop at enqueueing can be readily retrofitted 634 into existing hardware or software implementations. 636 The drop probability calculation is done in the background and it occurs 637 every T_UPDATE interval. Given modern high speed links, this period 638 translates into once every tens, hundreds or even thousands of packets. 639 Hence the calculation occurs at a much slower time scale than packet 640 processing time, at least an order of magnitude slower. The calculation 641 of drop probability involves multiplications using alpha and beta. Since 642 PIE's control law is robust to minor changes in alpha and beta values, 643 an implementation MAY choose these values to the closest multiples of 2 644 or 1/2 (ex: alpha=1/8, beta=1 + 1/4) such that the multiplications can 645 be done using simple adds and shifts. As no complicated functions are 646 required, PIE can be easily implemented in both hardware and software. 647 The state requirement is only one variables per queue: PIE->qdelay_old_. 648 Hence the memory overhead is small. 650 If one chooses to implement the departure rate estimation, PIE uses a 651 counter to keep track of the number of bytes departed for the current 652 interval. This counter is incremented per packet departure. Every 653 T_UPDATE, PIE calculates latency using the departure rate, which can be 654 implemented using a multiplication. Note that many network devices keep 655 track of an interface's departure rate. In this case, PIE might be able 656 to reuse this information, simply skip the third step of the algorithm 657 and hence incurs no extra cost. If platform already leverages packet 658 timestamps for other purposes, PIE MAY make use of these packet 659 timestamps for latency calculation instead of estimating departure rate. 661 Since the PIE design is separated into data path and control path, if 662 control path is implemented in software, any further improvement in 663 control path can be easily accommodated. 665 Flow queuing can also be combined with PIE to provide isolation between 666 flows. In this case, it is preferable to have an independent value of 667 drop probability per queue. This allows each flow to receive the most 668 appropriate level of congestion signal, and ensures that sparse flows 669 are protected from experiencing packet drops. However, running the 670 entire PIE algorithm independently on each queue in order to calculate 671 the drop probability may be overkill. Furthermore, in the case that 672 departure rate estimation is used to predict queuing latency, it is not 673 possible to calculate an accurate per-queue departure rate upon which to 674 implement the PIE drop probability calculation. Instead, it has been 675 proposed ([DOCSIS_AQM]) that a single implementation of the PIE drop 676 probability calculation based on the overall latency estimate be used, 677 followed by a per-queue scaling of drop-probability based on the ratio 678 of queue-depth between the queue in question and the current largest 679 queue. This scaling is reasonably simple, and has a couple of nice 680 properties. One, if a packet is arriving to an empty queue, it is given 681 immunity from packet drops altogether, regardless of the state of the 682 other queues. Two, in the situation where only a single queue is in use, 683 the algorithm behaves exactly like the single-queue PIE algorithm. 685 In summary, PIE is simple enough to be implemented in both software and 686 hardware. 688 7. Future Research 690 The design of the PIE algorithm is presented in this document. It 691 effectively controls the average queueing latency to a target value. The 692 following areas can be further studied: 694 * Autotuning of target delay without losing utilization; 696 * Autotuning for average RTT of traffic; 698 8. Incremental Deployment 700 PIE scheme can be independently deployed and managed without any 701 need for interoperability. 703 Although all network nodes cannot be changed altogether to adopt 704 latency-based AQM schemes, a gradual adoption would eventually lead 705 to end-to-end low latency service for all applications. 707 9. Security Considerations 709 This document describes an active queue management algorithm based 710 on implementations in Cisco products. This algorithm introduces no 711 specific security exposures. 713 10. IANA Considerations 715 There are no actions for IANA. 717 11. References 719 11.1 Normative References 721 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 722 Requirement Levels", BCP 14, RFC 2119, March 1997. 724 11.2 Informative References 726 [RFC970] Nagle, J., "On Packet Switches With Infinite 727 Storage",RFC970, December 1985. 729 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 730 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 731 Patridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., 732 Wroclawski, J. and Zhang, L., "Recommendations on Queue 733 Management and Congestion Avoidance in the Internet", 734 April, 1998. 736 [CBQ] Cisco White Paper, 737 "http://www.cisco.com/en/US/docs/12_0t/12_0tfeature/guide/cbwfq.html". 739 [CoDel] Nichols, K., Jacobson, V., "Controlling Queue Delay", ACM 740 Queue. ACM Publishing. doi:10.1145/2209249.22W.09264. 742 [DOCSIS_3.1] http://www.cablelabs.com/wp-content/uploads/specdocs 743 /CM-SP-MULPIv3.1-I01-131029.pdf. 745 [DOCSIS-PIE] White, G. and Pan, R., "A PIE-Based AQM for DOCSIS 746 Cable Modems", IETF draft-white-aqm-docsis-pie-02. 748 [FQ-Implement] Baker, F. and Pan, R. "On Queueing, Marking and 749 Dropping", IETF draft-ietf-aqm-fq-implementation. 751 [HPSR-PIE] Pan, R., Natarajan, P. Piglione, C., Prabhu, M.S., 752 Subramanian, V., Baker, F. Steeg and B. V., "PIE: A Lightweight 753 Control Scheme to Address the Bufferbloat Problem", IEEE HPSR 2013. 754 https://www.researchgate.net/publication/261134127_PIE_A_lightweight 755 _control_scheme_to_address_the_bufferbloat_problem?origin=mail 757 [IETF-AQM] Baker, F. and Fairhurst, G., "IETF Recommendations 758 Regarding Active Queue Management", draft-ietf-aqm-recommendation-11. 760 [IETF-ECN] Briscoe, B. Kaippallimalil, J and Phaler, P., 761 "Guidelines for Adding Congestion Notification to Protocols that 762 Encapsulate IP", draft-ietf-tsvwg-ecn-encap-guidelines. 764 11.3 Other References 766 [PI] Hollot, C.V., Misra, V., Towsley, D. and Gong, W., "On 767 Designing Improved Controller for AQM Routers Supporting TCP Flows", 768 Infocom 2001. 770 [QCN] "Data Center Bridging - Congestion Notification", 771 http://www.ieee802.org/1/pages/802.1au.html. 773 [TCP-Models] Misra, V., Gong, W., and Towsley, D., "Fluid-base 774 Analysis of a Network of AQM Routers Supporting TCP Flows with an 775 Application to RED", SIGCOMM 2000 777 Authors' Addresses 779 Rong Pan 780 Cisco Systems 781 3625 Cisco Way, 782 San Jose, CA 95134, USA 783 Email: ropan@cisco.com 785 Preethi Natarajan, 786 Cisco Systems 787 725 Alder Drive, 788 Milpitas, CA 95035, USA 789 Email: prenatar@cisco.com 791 Fred Baker 792 Cisco Systems 793 725 Alder Drive, 794 Milpitas, CA 95035, USA 795 Email: fred@cisco.com 797 Bill Ver Steeg 798 Cisco Systems 799 5030 Sugarloaf Parkway 800 Lawrenceville, GA, 30044, USA 801 Email: versteb@cisco.com 803 Mythili Prabhu* 804 Akamai Technologies 805 3355 Scott Blvd 806 Santa Clara, CA - 95054 807 Email: mythili@akamai.com 809 Chiara Piglione* 810 Broadcom Corporation 811 3151 Zanker Road 812 San Jose, CA 95134 813 Email: chiara@broadcom.com 815 Vijay Subramanian* 816 PLUMgrid, Inc. 817 350 Oakmead Parkway, 818 Suite 250 819 Sunnyvale, CA 94085 820 Email: vns@plumgrid.com 822 Greg White 823 CableLabs 824 858 Coal Creek Circle 825 Louisville, CO 80027, USA 826 Email: g.white@cablelabs.com 828 * Formerly at Cisco Systems 830 12. The Basic PIE pseudo Code 832 Configurable Parameters: 833 - QDELAY_REF. AQM Latency Target (default: 15ms) 834 - MAX_BURST. AQM Max Burst Allowance (default: 150ms) 836 Internal Parameters: 837 - Weights in the drop probability calculation (1/s): 838 alpha (default: 1/8), beta(default: 1 + 1/4) 839 - T_UPDATE: a period to calculate drop probability (default:15ms) 841 Table which stores status variables (ending with "_"): 842 - burst_allowance_: current burst allowance 843 - drop_prob_: The current packet drop probability. reset to 0 844 - qdelay_old_: The previous queue delay. reset to 0 846 Public/system functions: 847 - queue_. Holds the pending packets. 848 - drop(packet). Drops/discards a packet 849 - now(). Returns the current time 850 - random(). Returns a uniform r.v. in the range 0 ~ 1 851 - queue_.byte_length(). Returns current queue_ length in bytes 852 - queue_.enque(packet). Adds packet to tail of queue_ 853 - queue_.deque(). Returns the packet from the head of queue_ 854 - packet.size(). Returns size of packet 855 - packet.timestamp_delay(). Returns timestamped packet latency 857 ============================ 859 //called on each packet arrival 860 enque(Packet packet) { 861 if (PIE->drop_prob_ == 0 && current_qdelay < QDELAY_REF/2 862 && PIE->qdelay_old_ < QDELAY_REF/2) { 863 PIE->burst_allowance_ = MAX_BURST; 864 } 865 if (PIE->burst_allowance_ == 0 && drop_early() == DROP) { 866 drop(packet); 867 } else { 868 queue_.enque(packet); 869 } 870 } 872 =========================== 874 drop_early() { 876 //Safeguard PIE to be work conserving 877 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 878 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 879 return ENQUE; 881 } 883 double u = random(); 884 if (u < PIE->drop_prob_) { 885 return DROP; 886 } else { 887 return ENQUE; 888 } 889 } 891 =========================== 892 //we choose the timestamp option of obtaining latency for clarity 893 //rate estimation method can be found in the extended PIE pseudo code 895 deque(Packet packet) { 897 current_qdelay = packet.timestamp_delay(); 899 } 901 ============================ 902 //update periodically, T_UPDATE = 15ms 904 calculate_drop_prob() { 906 //can be implemented using integer multiply, 908 p = alpha*(current_qdelay - QDELAY_REF) + \ 909 beta*(current_qdelay-PIE->qdelay_old_); 911 if (PIE->drop_prob_ < 0.000001) { 912 p /= 2048; 913 } else if (PIE->drop_prob_ < 0.00001) { 914 p /= 512; 915 } else if (PIE->drop_prob_ < 0.0001) { 916 p /= 128; 917 } else if (PIE->drop_prob_ < 0.001) { 918 p /= 32; 919 } else if (PIE->drop_prob_ < 0.01) { 920 p /= 8; 921 } else if (PIE->drop_prob_ < 0.1) { 922 p /= 2; 923 } else { 924 p = p; 925 } 926 PIE->drop_prob_ += p; 928 //Exponentially decay drop prob when congestion goes away 929 if (current_qdelay == 0 && PIE->qdelay_old_ == 0) { 930 PIE->drop_prob_ *= 0.98; //1- 1/64 is sufficient 931 } 933 //bound drop probability 934 if (PIE->drop_prob_ < 0) 935 PIE->drop_prob_ = 0.0 936 if (PIE->drop_prob_ > 1) 937 PIE->drop_prob_ = 1.0 939 PIE->qdelay_old_ = current_qdelay; 941 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 943 } 944 } 946 13. Pseudo code for PIE with optional enhancement 948 Configurable Parameters: 949 - QDELAY_REF. AQM Latency Target (default: 15ms) 950 - MAX_BURST. AQM Max Burst Allowance (default: 150ms) 951 - MAX_ECNTH. AQM Max ECN Marking Threshold (default: 10%) 953 Internal Parameters: 954 - Weights in the drop probability calculation (1/s): 955 alpha (default: 1/8), beta(default: 1+1/4) 956 - DQ_THRESHOLD: (in bytes, default: 2^14 (in a power of 2) ) 957 - T_UPDATE: a period to calculate drop probability (default:15ms) 958 - TAIL_DROP: each queue has a tail drop threshold, pass it to PIE 960 Table which stores status variables (ending with "_"): 961 - active_: INACTIVE/ACTIVE 962 - burst_allowance_: current burst allowance 963 - drop_prob_: The current packet drop probability. reset to 0 964 - accu_prob_: Accumulated drop probability. reset to 0 965 - qdelay_old_: The previous queue delay estimate. reset to 0 966 - last_timestamp_: Timestamp of previous status update 967 - dq_count_, measurement_start_, in_measurement_, 968 avg_dq_time_. variables for measuring average dequeue rate. 970 Public/system functions: 971 - queue_. Holds the pending packets. 972 - drop(packet). Drops/discards a packet 973 - mark(packet). Marks ECN for a packet 974 - now(). Returns the current time 975 - random(). Returns a uniform r.v. in the range 0 ~ 1 976 - queue_.byte_length(). Returns current queue_ length in bytes 977 - queue_.enque(packet). Adds packet to tail of queue_ 978 - queue_.deque(). Returns the packet from the head of queue_ 979 - packet.size(). Returns size of packet 980 - packet.ecn(). Returns whether packet is ECN capable or not 982 ============================ 983 //called on each packet arrival 984 enque(Packet packet) { 985 if (queue_.byte_length()+packet.size() > TAIL_DROP) { 986 drop(packet); 987 PIE->accu_prob_ = 0; 988 } else if (PIE->active_ == TRUE && drop_early() == DROP 989 && PIE->burst_allowance_ == 0) { 990 if (PIE->drop_prob_ < MAX_ECNTH && packet.ecn() == TRUE) 991 mark(packet); 992 else 993 drop(packet); 994 PIE->accu_prob_ = 0; 995 } else { 996 queue_.enque(packet); 997 } 999 //If the queue is over a certain threshold, turn on PIE 1000 if (PIE->active_ == INACTIVE 1001 && queue_.byte_length() >= TAIL_DROP/3) { 1002 PIE->active_ = ACTIVE; 1003 PIE->qdelay_old_ = 0; 1004 PIE->drop_prob_ = 0; 1005 PIE->in_measurement_ = TRUE; 1006 PIE->dq_count_ = 0; 1007 PIE->avg_dq_time_ = 0; 1008 PIE->last_timestamp_ = now; 1009 PIE->burst_allowance_ = MAX_BURST; 1010 PIE->accu_prob_ = 0; 1011 PIE->measurement_start_ = now; 1012 } 1014 //If the queue has been idle for a while, turn off PIE 1015 //reset counters when accessing the queue after some idle 1016 //period if PIE was active before 1017 if ( PIE->drop_prob_ == 0 && PIE->qdelay_old_ == 0 1018 && current_qdelay == 0) { 1019 PIE->active_ = INACTIVE; 1020 PIE->in_measurement_ = FALSE; 1021 } 1022 } 1024 =========================== 1026 drop_early() { 1028 //PIE is active but the queue is not congested, return ENQUE 1029 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 1030 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 1031 return ENQUE; 1032 } 1034 if (PIE->drop_prob_ == 0) { 1035 PIE->accu_prob_ = 0; 1036 } 1038 //For practical reasons, drop probability can be further scaled 1039 //according to packet size. but need to set a bound to 1040 //avoid unnecessary bias 1042 //Random drop 1043 PIE->accu_prob_ += PIE->drop_prob_; 1044 if (PIE->accu_prob_ < 0.85) 1045 return ENQUE; 1046 if (PIE->accu_prob_ >= 8.5) 1047 return DROP; 1048 double u = random(); 1049 if (u < PIE->drop_prob_) { 1050 PIE->accu_prob_ = 0; 1051 return DROP; 1052 } else { 1053 return ENQUE; 1054 } 1055 } 1057 ============================ 1058 //update periodically, T_UPDATE = 15ms 1059 calculate_drop_prob() { 1060 if ( (now - PIE->last_timestamp_) >= T_UPDATE && 1061 PIE->active_ == ACTIVE) { 1062 //can be implemented using integer multiply, 1063 //DQ_THRESHOLD is power of 2 value 1064 current_qdelay = queue_.byte_length() * PIE- 1065 >avg_dq_time_/DQ_THRESHOLD; 1067 p = alpha*(current_qdelay - QDELAY_REF) + \ 1068 beta*(current_qdelay-PIE->qdelay_old_); 1070 if (PIE->drop_prob_ < 0.000001) { 1071 p /= 2048; 1072 } else if (PIE->drop_prob_ < 0.00001) { 1073 p /= 512; 1074 } else if (PIE->drop_prob_ < 0.0001) { 1075 p /= 128; 1076 } else if (PIE->drop_prob_ < 0.001) { 1077 p /= 32; 1078 } else if (PIE->drop_prob_ < 0.01) { 1079 p /= 8; 1080 } else if (PIE->drop_prob_ < 0.1) { 1081 p /= 2; 1082 } else { 1083 p = p; 1084 } 1086 if (PIE->drop_prob_ >= 0.1 && p > 0.02) { 1087 p = 0.02; 1088 } 1089 PIE->drop_prob_ += p; 1091 //Exponentially decay drop prob when congestion goes away 1092 if (current_qdelay < QDELAY_REF/2 && PIE->qdelay_old_ < 1093 QDELAY_REF/2) { 1094 PIE->drop_prob_ *= 0.98; //1- 1/64 is sufficient 1095 } 1097 //bound drop probability 1098 if (PIE->drop_prob_ < 0) 1099 PIE->drop_prob_ = 0 1100 if (PIE->drop_prob_ > 1) 1101 PIE->drop_prob_ = 1 1103 PIE->qdelay_old_ = current_qdelay; 1104 PIE->last_timestamp_ = now; 1105 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 1106 } 1107 } 1109 ========================== 1110 //called on each packet departure 1111 deque(Packet packet) { 1113 //deque rate estimation 1114 if (PIE->in_measurement_ == TRUE) { 1115 PIE->dq_count_ = packet.size() + PIE->dq_count_; 1116 //start a new measurement cycle if we have enough packets 1117 if ( PIE->dq_count_ >= DQ_THRESHOLD) { 1118 dq_time = now - PIE->measurement_start_; 1119 if(PIE->avg_dq_time_ == 0) { 1120 PIE->avg_dq_time_ = dq_time; 1121 } else { 1122 weight = DQ_THRESHOLD/2^16 1123 PIE->avg_dq_time_ = dq_time*weight + PIE->avg_dq_time_*(1- 1124 weight); 1125 } 1126 PIE->in_measurement_ = FALSE; 1127 } 1128 } 1130 //start a measurement if we have enough data in the queue: 1131 if (queue_.byte_length() >= DQ_THRESHOLD && 1132 PIE->in_measurement_ == FALSE) { 1133 PIE->in_measurement_ = TRUE; 1134 PIE->measurement_start_ = now; 1135 PIE->dq_count_ = 0; 1136 } 1137 }