idnits 2.17.1 draft-ietf-aqm-pie-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** There are 2 instances of too long lines in the document, the longest one being 21 characters in excess of 72. ** There are 31 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 29, 2016) is 2980 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'CBQ' is mentioned on line 721, but not defined == Missing Reference: 'IETF-AQM' is mentioned on line 743, but not defined == Missing Reference: 'CoDel' is mentioned on line 724, but not defined == Missing Reference: 'FQ-Implement' is mentioned on line 733, but not defined == Missing Reference: 'DOCSIS-PIE' is mentioned on line 730, but not defined == Missing Reference: 'HPSR-PIE' is mentioned on line 736, but not defined == Missing Reference: 'PI' is mentioned on line 752, but not defined == Missing Reference: 'QCN' is mentioned on line 756, but not defined == Missing Reference: 'TCP-Models' is mentioned on line 759, but not defined == Missing Reference: 'IETF-ECN' is mentioned on line 746, but not defined == Missing Reference: 'AQM DOCSIS' is mentioned on line 748, but not defined -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 3 errors (**), 0 flaws (~~), 12 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft R. Pan, P. Natarajan, F. Baker 3 Active Queue Management G. White, B. VerSteeg, M.S. Prabhu 4 Working Group C. Piglione, V. Subramanian 5 Intended Status: Standards Track 7 Expires: September 1, 2016 February 29, 2016 9 PIE: A Lightweight Control Scheme To Address the 10 Bufferbloat Problem 12 draft-ietf-aqm-pie-04 14 Abstract 16 Bufferbloat is a phenomenon where excess buffers in the network cause 17 high latency and jitter. As more and more interactive applications 18 (e.g. voice over IP, real time video streaming and financial 19 transactions) run in the Internet, high latency and jitter degrade 20 application performance. There is a pressing need to design 21 intelligent queue management schemes that can control latency and 22 jitter; and hence provide desirable quality of service to users. 24 This document presents a lightweight active queue management design, 25 called PIE (Proportional Integral controller Enhanced), that can 26 effectively control the average queueing latency to a target value. 27 Simulation results, theoretical analysis and Linux testbed results 28 have shown that PIE can ensure low latency and achieve high link 29 utilization under various congestion situations. The design does not 30 require per-packet timestamp, so it incurs very small overhead and is 31 simple enough to implement in both hardware and software. 33 Status of this Memo 35 This Internet-Draft is submitted to IETF in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF), its areas, and its working groups. Note that 40 other groups may also distribute working documents as 41 Internet-Drafts. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 47 The list of current Internet-Drafts can be accessed at 48 http://www.ietf.org/1id-abstracts.html 50 The list of Internet-Draft Shadow Directories can be accessed at 51 http://www.ietf.org/shadow.html 53 Copyright and License Notice 55 Copyright (c) 2012 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 72 3. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 4. The Basic PIE Scheme . . . . . . . . . . . . . . . . . . . . . 6 74 4.1 Random Dropping(ECN Support is described later in this 75 document) . . . . . . . . . . . . . . . . . . . . . . . . . 7 76 4.2 Drop Probability Calculation . . . . . . . . . . . . . . . . 8 77 4.3 Latency Calculation . . . . . . . . . . . . . . . . . . . . 10 78 4.4 Burst Tolerance . . . . . . . . . . . . . . . . . . . . . . 10 79 5. Optional Design Elements of PIE . . . . . . . . . . . . . . . . 11 80 5.1 ECN Support . . . . . . . . . . . . . . . . . . . . . . . . 11 81 5.2 Departure Rate Estimation . . . . . . . . . . . . . . . . . 11 82 5.3 Turning PIE on and off . . . . . . . . . . . . . . . . . . . 13 83 5.4 De-randomization . . . . . . . . . . . . . . . . . . . . . . 14 84 5.5 Cap Drop Adjustment . . . . . . . . . . . . . . . . . . . . 15 85 6. Implementation Cost . . . . . . . . . . . . . . . . . . . . . . 15 86 7. Future Research . . . . . . . . . . . . . . . . . . . . . . . . 16 87 8. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 16 88 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 17 89 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 17 90 10.1 Normative References . . . . . . . . . . . . . . . . . . . 17 91 10.2 Informative References . . . . . . . . . . . . . . . . . . 17 92 10.3 Other References . . . . . . . . . . . . . . . . . . . . . 17 93 [CBQ] Cisco White Paper, 94 "http://www.cisco.com/en/US/docs/12_0t/12_0tfeature/guide/cbwfq.html". n 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 96 11. The Basic PIE pseudo Code . . . . . . . . . . . . . . . . . . 19 97 12. Pseudo code for PIE with optional enhancement . . . . . . . . 22 99 1. Introduction 101 The explosion of smart phones, tablets and video traffic in the 102 Internet brings about a unique set of challenges for congestion 103 control. To avoid packet drops, many service providers or data center 104 operators require vendors to put in as much buffer as possible. With 105 rapid decrease in memory chip prices, these requests are easily 106 accommodated to keep customers happy. While this solution succeeds in 107 assuring low packet loss and high TCP throughput, it suffers from a 108 major downside. The TCP protocol continuously increases its sending 109 rate and causes network buffers to fill up. TCP cuts its rate only 110 when it receives a packet drop or mark that is interpreted as a 111 congestion signal. However, drops and marks usually occur when 112 network buffers are full or almost full. As a result, excess buffers, 113 initially designed to avoid packet drops, would lead to highly 114 elevated queueing latency and jitter. It is a delicate balancing act 115 to design a queue management scheme that not only allows short-term 116 burst to smoothly pass, but also controls the average latency in the 117 presence of long-running greedy flows. 119 Active queue management (AQM) schemes, such as Random Early Detection 120 (RED), have been around for well over a decade. AQM schemes could 121 potentially solve the aforementioned problem. RFC 2309[RFC2309] 122 strongly recommends the adoption of AQM schemes in the network to 123 improve the performance of the Internet. RED is implemented in a wide 124 variety of network devices, both in hardware and software. 125 Unfortunately, due to the fact that RED needs careful tuning of its 126 parameters for various network conditions, most network operators 127 don't turn RED on. In addition, RED is designed to control the queue 128 length which would affect delay implicitly. It does not control 129 latency directly. Hence, the Internet today still lacks an effective 130 design that can control buffer latency to improve the quality of 131 experience to latency-sensitive applications. Notably, a recent IETF 132 AQM working group draft [IETF-AQM] calls for new methods of 133 controlling network latency. 135 New algorithms are beginning to emerge to control queueing latency 136 directly to address the bufferbloat problem [CoDel]. Along these 137 lines, PIE also aims to keep the benefits of RED: such as easy 138 implementation and scalability to high speeds. Similar to RED, PIE 139 randomly drops an incoming packet at the onset of the congestion. The 140 congestion detection, however, is based on the queueing latency 141 instead of the queue length like RED. Furthermore, PIE also uses the 142 derivative (rate of change) of the queueing latency to help determine 143 congestion levels and an appropriate response. The design parameters 144 of PIE are chosen via control theory stability analysis. While these 145 parameters can be fixed to work in various traffic conditions, they 146 could be made self-tuning to optimize system performance. 148 Separately, it is assumed that any delay-based AQM scheme would be 149 applied over a Fair Queueing (FQ) structure or one of its approximate 150 designs, Flow Queueing or Class Based Queueing (CBQ). FQ is one of 151 the most studied scheduling algorithms since it was first proposed in 152 1985 [RFC970]. CBQ has been a standard feature in most network 153 devices today[CBQ]. Any AQM scheme that is built on top of FQ or CBQ 154 could benefit from these advantages. Furthermore, these advantages 155 such as per flow/class fairness are orthogonal to the AQM design 156 whose primary goal is to control latency for a given queue. For flows 157 that are classified into the same class and put into the same queue, 158 one needs to ensure their latency is better controlled and their 159 fairness is not worse than those under the standard DropTail or RED 160 design. More details about the relationship between FQ and AQM can be 161 found in IETF draft [FQ-Implement]. 163 In October 2013, CableLabs' DOCSIS 3.1 specification [DOCSIS_3.1] 164 mandated that cable modems implement a specific variant of the PIE 165 design as the active queue management algorithm. In addition to cable 166 specific improvements, the PIE design in DOCSIS 3.1 [DOCSIS-PIE] has 167 improved the original design in several areas, including de- 168 randomization of coin tosses and enhanced burst protection. 170 This draft separates the PIE design into the basic elements that are 171 MUST to be implemented and optional SHOULD/MAY enhancement elements. 173 2. Terminology 175 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 176 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 177 document are to be interpreted as described in RFC 2119 [RFC2119]. 179 3. Design Goals 181 A queue management framework is designed to improve the performance 182 of interactive and delay-sensitive applications. It should follow the 183 general guidelines set by the AQM working group document "IETF 184 Recommendations Regarding Active Queue Management" [IETF-AQM]. More 185 specifically PIE design has the following basic criteria. 187 * First, queueing latency, instead of queue length, is 188 controlled. Queue sizes change with queue draining rates and 189 various flows' round trip times. Delay bloat is the real issue 190 that needs to be addressed as it impairs real time applications. 191 If latency can be controlled, bufferbloat is not an issue. In 192 fact, once latency is under control it frees up buffers for 193 sporadic bursts. 195 * Secondly, PIE aims to attain high link utilization. The goal 196 of low latency shall be achieved without suffering link under- 197 utilization or losing network efficiency. An early congestion 198 signal could cause TCP to back off and avoid queue building up. 199 On the other hand, however, TCP's rate reduction could result in 200 link under-utilization. There is a delicate balance between 201 achieving high link utilization and low latency. 203 * Furthermore, the scheme should be simple to implement and 204 easily scalable in both hardware and software. PIE strives to 205 maintain similar design simplicity to RED, which has been 206 implemented in a wide variety of network devices. 208 * Finally, the scheme should ensure system stability for various 209 network topologies and scale well across an arbitrary number of 210 streams. Design parameters shall be set automatically. Users 211 only need to set performance-related parameters such as target 212 queue delay, not design parameters. 214 In the following, the design of PIE and its operation are described in 215 detail. 217 4. The Basic PIE Scheme 219 As illustrated in Fig. 1, PIE conceptually comprises three simple MUST 220 components: a) random dropping at enqueueing; b) periodic drop 221 probability update; c) latency calculation. When a packet arrives, a 222 random decision is made regarding whether to drop the packet. The drop 223 probability is updated periodically based on how far the current delay 224 is away from the target and whether the queueing delay is currently 225 trending up or down. The queueing delay can be obtained using direct 226 measurements or using estimations calculated from the queue length and 227 the dequeue rate. 229 The detailed definition of parameters can be found in the pseudo code 230 section of this document (Section 11). Any state variables that PIE 231 maintains are noted using "PIE->". For full description of the 232 algorithm, one can refer to the full paper [HPSR-PIE]. 234 Random Drop 235 / -------------- 236 -------/ --------------> | | | | | --------------> 237 /|\ | | | | | 238 | -------------- 239 | Queue Buffer \ 240 | | \ 241 | |queue \ 242 | |length \ 243 | | \ 244 | \|/ \/ 245 | ----------------- ------------------- 246 | | Drop | | | 247 -----<-----| Probability |<---| Latency | 248 | Calculation | | Calculation | 249 ----------------- ------------------- 251 Figure 1. The PIE Structure 253 4.1 Random Dropping(ECN Support is described later in this document) 255 PIE MUST drop a packet upon its arrival to a queue according to a drop 256 probability, PIE->drop_prob_, that is obtained from the drop- 257 probability-calculation component. The random drop is triggered by a 258 packet arrival before enqueueing into a queue. 260 * Upon a packet enqueue, PIE MUST: 262 randomly drop the packet with a probability PIE->drop_prob_. 264 To ensure that PIE is work conserving, we MAY bypass the random drop if 265 the delay sample, PIE->qdelay_old_, is smaller than half of QDELAY_REF 266 when the drop probability is not too high, PIE->drop_prob_ < 0.2; or if 267 the queue has less than a couple of packets. 269 * Upon a packet enqueue, PIE MAY: 271 //Safeguard PIE to be work conserving 272 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 273 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 274 return ENQUE; 275 else 276 randomly drop the packet with a probability PIE->drop_prob_. 278 PIE optionally supports ECN and see Section 5.1. 280 4.2 Drop Probability Calculation 282 The PIE algorithm periodically updates the drop probability based on the 283 delay samples: not only the current delay sample but also the trend 284 where the delay is going, up or down. This is the classical Proportional 285 Integral (PI) controller method which is known for eliminating steady 286 state errors. This type of controller has been studied before for 287 controlling the queue length [PI, QCN]. PIE adopts the Proportional 288 Integral controller for controlling delay. The algorithm also auto- 289 adjusts the control parameters based on how heavy the congestion is, 290 which is reflected in the current drop probability. Note that the 291 current drop probability is a direct measure of current congestion 292 level, no need to measure the arrival rate and departure rate 293 mismatches. 295 When a congestion period goes away, we might be left with a high drop 296 probability with light packet arrivals. Hence, the PIE algorithm MUST 297 include a mechanism by which the drop probability decay exponentially 298 (rather than linearly) when the system is not congested. This would help 299 the drop probability converge to 0 faster while the PI controller 300 ensures that it would eventually reaches zero. The decay parameter of 2% 301 gives us around 750ms time constant, a few RTT. 303 Specifically, the PIE algorithm MUST periodically adjust the drop 304 probability every T_UPDATE interval: 306 * MUST calculate drop probability PIE->drop_prob_ and auto-tune it 307 as: 309 p = alpha*(current_qdelay-QDELAY_REF) + 310 beta*(current_qdelay-PIE->qdelay_old_); 312 if (PIE->drop_prob_ < 0.000001) { 313 p /= 2048; 314 } else if (PIE->drop_prob_ < 0.00001) { 315 p /= 512; 316 } else if (PIE->drop_prob_ < 0.0001) { 317 p /= 128; 318 } else if (PIE->drop_prob_ < 0.001) { 319 p /= 32; 320 } else if (PIE->drop_prob_ < 0.01) { 321 p /= 8; 322 } else if (PIE->drop_prob_ < 0.1) { 323 p /= 2; 324 } else { 325 p = p; 327 } 328 PIE->drop_prob_ += p; 330 * MUST decay the drop probability exponentially: 332 if (current_qdelay == 0 && PIE->qdelay_old_ == 0) { 334 PIE->drop_prob_ = PIE->drop_prob_*0.98; //1- 1/64 335 //is sufficient 337 } 339 * MUST bound the drop probability 340 if (PIE->drop_prob_ < 0) 341 PIE->drop_prob_ = 0.0 342 if (PIE->drop_prob_ > 1) 343 PIE->drop_prob_ = 1.0 345 * MUST store current delay value: 347 PIE->qdelay_old_ = current_qdelay. 349 The update interval, T_UPDATE, is defaulted to be 15ms. It MAY be 350 reduced on high speed links in order to provide smoother response. The 351 target delay value, QDELAY_REF, SHOULD be set to 15ms. Variables, 352 current_qdelay and PIE->qdelay_old_ represent the current and previous 353 samples of the queueing delay, which are calculated by the "Latency 354 Calculation" component (see Section 4.3). The variable current_qdelay is 355 actually a temporary variable while PIE->qdelay_old_ is a state variable 356 that PIE keeps. The drop probability is a value between 0 and 1. 357 However, implementations can certainly use integers. 359 The controller parameters, alpha and beta(in the unit of hz) are 360 designed using feedback loop analysis where TCP's behaviors are modeled 361 using the results from well-studied prior art[TCP-Models]. Note that the 362 above adjustment of p effectively scales the alpha and beta parameters 363 based on current congestion level indicated by the drop probability. 365 The theoretical analysis of PIE can be found in [HPSR-PIE]. As a rule of 366 thumb, to keep the same feedback loop dynamics, if we cut T_UPDATE in 367 half, we should also cut alpha by half and increase beta by alpha/4. If 368 the target delay is reduced, e.g. for data center use, the values of 369 alpha and beta SHOULD be increased by the same order of magnitude that 370 the target latency is reduced. For example, if QDELAY_REF is reduced 371 changed from 15ms to 150us, a reduction of two orders of magnitude, then 372 alpha and beta values should be increased to alpha*100 and beta*100. 374 4.3 Latency Calculation 376 The PIE algorithm MUST use latency to calculate drop probability. 378 * It MAY estimate current queueing delay using Little's law: 380 current_qdelay = queue_.byte_length()/dequeue_rate; 382 Details can be found in Section 5.2. 384 * or MAY use other techniques for calculating queueing delay, ex: 385 timestamp packets at enqueue and use the same to calculate delay 386 during dequeue. 388 4.4 Burst Tolerance 390 PIE MUST also NOT penalize short-term packet bursts [IETF-AQM]. PIE MUST 391 allow bursts of traffic that create finite-duration events in which 392 current queueing delay exceeds the QDELAY_REF, without triggering packet 393 drops. A parameter, MAX_BURST, is introduced that defines the burst 394 duration that will be protected. By default, the parameter SHOULD be set 395 to be 150ms. For simplicity, the PIE algorithm MAY effectively round 396 MAX_BURST up to an integer multiple of T_UPDATE. 398 To implement the burst tolerance function, two basic components of PIE 399 are involved: "random dropping" and "drop probability calculation". The 400 PIE algorithm MUST do the following: 402 * In "Random Dropping" block and upon a packet arrival , PIE MUST 403 check: 405 Upon a packet enqueue: 406 if PIE->burst_allowance_ > 0 enqueue packet; 407 else randomly drop a packet with a probability PIE->drop_prob_. 409 if (PIE->drop_prob_ == 0 and current_qdelay < QDELAY_REF/2 and 410 PIE->qdelay_old_ < QDELAY_REF/2) 411 PIE->burst_allowance_ = MAX_BURST; 413 * In "Drop Probability Calculation" block, PIE MUST additionally 414 calculate: 416 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - 417 T_UPDATE); 419 The burst allowance, noted by PIE->burst_allowance_, is initialized to 420 MAX_BURST. As long as PIE->burst_allowance_ is above zero, an incoming 421 packet will be enqueued bypassing the random drop process. During each 422 update instance, the value of PIE->burst_allowance_ is decremented by 423 the update period, T_UPDATE and is bottomed at 0. When the congestion 424 goes away, defined here as PIE->drop_prob_ equals 0 and both the current 425 and previous samples of estimated delay are less than half of 426 QDELAY_REF, PIE->burst_allowance_ is reset to MAX_BURST. 428 5. Optional Design Elements of PIE 430 The above forms the basic MUST have elements of the PIE algorithm. There 431 are several enhancements that are added to further augment the 432 performance of the basic algorithm. For clarity purposes, they are 433 included in this section. 435 5.1 ECN Support 437 PIE SHOULD support ECN by marking (rather than dropping) ECN capable 438 packets [IETF-ECN]. However, as a safeguard, an additional threshold, 439 mark_ecnth, is introduced. If the calculated drop probability exceeds 440 mark_ecnth, PIE MUST revert to packet drop for ECN capable packets. The 441 variable mark_ecnth SHOULD be set at 0.1(10%). 443 * To support ECN, the "random drop with a probability 444 PIE->drop_prob_" function in "Random Dropping" block SHOULD be 445 changed to the following: 447 * Upon a packet enqueue: 449 if rand() < PIE->drop_prob_: 451 if PIE->drop_prob_ < mark_ecnth && ecn_capable_packet == TRUE: 453 mark packet; 455 else: 457 drop packet; 459 5.2 Departure Rate Estimation 461 One way to calculate latency is to obtain the departure rate. The 462 draining rate of a queue in the network often varies either because 463 other queues are sharing the same link, or the link capacity fluctuates. 465 Rate fluctuation is particularly common in wireless networks. One MAY 466 measure directly at the dequeue operation. Short, non-persistent bursts 467 of packets result in empty queues from time to time, this would make the 468 measurement less accurate. PIE SHOULD measure when a sufficient data in 469 the buffer, i.e., when the queue length is over a certain threshold 470 (DQ_THRESHOLD). PIE measures how long it takes to drain DQ_THRESHOLD of 471 packets. More specifically, PIE MAY implement the rate estimation as 472 follows: 474 current_qdelay = queue_.byte_length() * 475 PIE->avg_dq_time_/DQ_THRESHOLD; 477 * Upon a packet deque: 479 if PIE->in_measurement_ == FALSE and queue.byte_length() > 480 DQ_THRESHOLD: 481 PIE->in_measurement_ = TRUE; 482 PIE->measurement_start_ = now; 483 PIE->dq_count_ = 0; 485 if PIE->in_measurement_ == TRUE: 486 PIE->dq_count_ = PIE->dq_count_ + deque_pkt_size; 487 if PIE->dq_count_ > DQ_THRESHOLD then 488 weight = DQ_THRESHOLD/2^16 489 PIE->avg_dq_time_ = (now-PIE->measurement_start_)*weight 490 + PIE->avg_dq_time_*(1-weight); 491 PIE->dq_count_=0; 492 PIE->measurement_start_ = now 494 The parameter, PIE->dq_count_, represents the number of bytes departed 495 since the last measurement. Once PIE->dq_count_ is over DQ_THRESHOLD, a 496 measurement sample is obtained. The threshold is recommended to be set 497 to 16KB assuming a typical packet size of around 1KB or 1.5KB. This 498 threshold would allow sufficient data to obtain an average draining rate 499 but also fast enough (< 64KB) to reflect sudden changes in the draining 500 rate. IF DQ_THRESHOLD is smaller than 64KB, a small weight is used to 501 smooth out the dequeue time and obtain PIE->avg_dq_time_. The dequeue 502 rate is simply DQ_THRESHOLD divided by PIE->avg_dq_time_. This threshold 503 is not crucial for the system's stability. Please note that the update 504 interval for calculating the drop probability is different from the rate 505 measurement cycle. The drop probability calculation is done periodically 506 per section 4.2 and it is done even when the algorithm is not in a 507 measurement cycle; in this case the previously latched value of PIE- 508 >avg_dq_time_ is used. 510 Random Drop 511 / -------------- 512 -------/ --------------------> | | | | | --------------> 513 /|\ | | | | | | 514 | | -------------- 515 | | Queue Buffer 516 | | | 517 | | |queue 518 | | |length 519 | | | 520 | \|/ \|/ 521 | ------------------------------ 522 | | Departure Rate | 523 -----<-----| & Drop Probability | 524 | Calculation | 525 ------------------------------ 527 Figure 2. The Enqueue-based PIE Structure 529 In some platforms, enqueueing and dequeueing functions belong to 530 different modules that are independent of each other. In such 531 situations, a pure enqueue-based design MAY be designed. As shown in 532 Figure 2, an enqueue-based design is depicted. The departure rate is 533 deduced from the number of packets enqueued and the queue length. The 534 design is based on the following key observation: over a certain time 535 interval, the number of departure packets = the number of enqueued 536 packets - the number of remaining packets in queue. In this design, 537 everything can be triggered by a packet arrival including the background 538 update process. The design complexity here is similar to the original 539 design. 541 5.3 Turning PIE on and off 543 Traffic naturally fluctuates in a network. It would be preferable not to 544 unnecessarily drop packets due to a spurious uptick in queueing latency. 545 PIE can be optionally turned on and off. It SHOULD only be turned on 546 (from off) when the buffer occupancy is over a certain threshold, which 547 SHOULD be set to 1/3 of the tail drop threshold. If it is on, PIE SHOULD 548 be turned off when congestion is over, i.e. when the drop probability 549 reaches 0, current and previous delay samples are all below half of 550 QDELAY_REF. 552 Ideally PIE should be turned on or off based on the latency. However, 553 calculating latency when PIE is off would introduce unnecessary packet 554 processing overhead. Weighing the trade-offs, it is decided to compare 555 against tail drop threshold to keep things simple. 557 When PIE is optionally turned on and off, the burst protection logic in 558 Section 4.4 MAY be modified as follows: 560 * "Random Dropping" block, PIE MAY add: 562 Upon packet arrival: 564 if PIE->active_ == FALSE && queue_length >= TAIL_DROP/3: 565 PIE->active_ = TRUE; 566 PIE->burst_allowance_ = MAX_BURST; 568 if PIE->burst_allowance_ > 0 enqueue packet; 569 else randomly drop a packet with a probability PIE->drop_prob_. 571 if (PIE->drop_prob_ == 0 and current_qdelay < QDELAY_REF/2 and 572 PIE->qdelay_old_ < QDELAY_REF/2) 573 PIE->active_ = FALSE; 574 PIE->burst_allowance_ = MAX_BURST; 576 * "Drop Probability Calculation" block, PIE MAY do the following: 577 if PIE->active_ == TRUE: 578 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 580 5.4 De-randomization 582 Although PIE adopts random dropping to achieve latency control, 583 independent coin tosses could introduce outlier situations where packets 584 are dropped too close to each other or too far from each other. This 585 would cause real drop percentage to temporarily deviate from the 586 intended value PIE->drop_prob_. In certain scenarios, such as small 587 number of simultaneous TCP flows, these deviations can cause significant 588 deviations in link utilization and queueing latency. PIE MAY introduce a 589 de-randomization mechanism to avoid such scenarios. A parameter, called 590 PIE->accu_prob_, is reset to 0 after a drop. Upon a packet arrival, PIE- 591 >accu_prob_ is incremented by the amount of drop probability, PIE- 592 >drop_prob_. If PIE->accu_prob_ is less than a low threshold, e.g. 0.85, 593 the arriving packet is enqueued; on the other hand, if PIE->accu_prob_ 594 is more than a high threshold, e.g. 8.5, a packet is forced to be 595 dropped. A packet is only randomly dropped if PIE->accu_prob_ falls in 596 between the two thresholds. Since PIE->accu_prob_ is reset to 0 after a 597 drop, another drop will not happen until 0.85/PIE->drop_prob_ packets 598 later. This avoids packets being dropped too close to each other. In the 599 other extreme case where 8.5/PIE->drop_prob_ packets have been enqueued 600 without incurring a drop, PIE would force a drop in order to prevent the 601 drops from being spaced too far apart. Further analysis can be found in 602 [DOCSIS-PIE]. 604 5.5 Cap Drop Adjustment 606 In the case of one single TCP flow during slow start phase in the 607 system, queue could quickly increase during slow start and demands high 608 drop probability. In some environments such as Cable Modem Speed Test, 609 one could not afford triggering timeout and lose throughput as 610 throughput is shown to customers who are testing his/her connection 611 speed. We MAY cap the maximum drop probability increase in each step. 613 * "Drop Probability Calculation" block, PIE MAY add: 615 if (PIE->drop_prob_ >= 0.1 && p > 0.02) { 617 p = 0.02; 619 } 621 6. Implementation Cost 623 PIE can be applied to existing hardware or software solutions. There are 624 three steps involved in PIE as discussed in Section 4. Their 625 complexities are examined below. 627 Upon packet arrival, the algorithm simply drops a packet randomly based 628 on the drop probability. This step is straightforward and requires no 629 packet header examination and manipulation. If the implementation 630 doesn't rely on packet timestamps for calculating latency, PIE does not 631 require extra memory. Furthermore, the input side of a queue is 632 typically under software control while the output side of a queue is 633 hardware based. Hence, a drop at enqueueing can be readily retrofitted 634 into existing hardware or software implementations. 636 The drop probability calculation is done in the background and it occurs 637 every T_UPDATE interval. Given modern high speed links, this period 638 translates into once every tens, hundreds or even thousands of packets. 639 Hence the calculation occurs at a much slower time scale than packet 640 processing time, at least an order of magnitude slower. The calculation 641 of drop probability involves multiplications using alpha and beta. Since 642 PIE's control law is robust to minor changes in alpha and beta values, 643 an implementation MAY choose these values to the closest multiples of 2 644 or 1/2 (ex: alpha=1/8, beta=1 + 1/4) such that the multiplications can 645 be done using simple adds and shifts. As no complicated functions are 646 required, PIE can be easily implemented in both hardware and software. 647 The state requirement is only one variables per queue: PIE->qdelay_old_. 648 Hence the memory overhead is small. 650 If one chooses to implement the departure rate estimation, PIE uses a 651 counter to keep track of the number of bytes departed for the current 652 interval. This counter is incremented per packet departure. Every 653 T_UPDATE, PIE calculates latency using the departure rate, which can be 654 implemented using a multiplication. Note that many network devices keep 655 track of an interface's departure rate. In this case, PIE might be able 656 to reuse this information, simply skip the third step of the algorithm 657 and hence incurs no extra cost. If platform already leverages packet 658 timestamps for other purposes, PIE MAY make use of these packet 659 timestamps for latency calculation instead of estimating departure rate. 661 Since the PIE design is separated into data path and control path, if 662 control path is implemented in software, any further improvement in 663 control path can be easily accommodated. 665 SFQ can also be combined with PIE to further improve latency for various 666 flows with different priorities. If the timestamp is used to obtain 667 queueing latency, PIE can be adopted directly to each individual queue. 668 If the latency is obtained via the deque rate calculation, we recommend 669 one PIE instance using the overall queue length divided by the overall 670 deque rate. Then the overall PIE->drop_prob_ is modified using each 671 individual queue divided by the maximum individual queue length: PIE- 672 >drop_prob_(i)=queue_.byte_length(i)/max(queue_.byte_length(i)). 674 In summary, PIE is simple enough to be implemented in both software and 675 hardware. 677 7. Future Research 679 The design of the PIE algorithm is presented in this document. It 680 effectively controls the average queueing latency to a target value. The 681 following areas can be further studied: 683 * Autotuning of target delay without losing utilization; 685 * Autotuning for average RTT of traffic; 687 8. Incremental Deployment 689 PIE scheme can be independently deployed and managed without any 690 need for interoperability. 692 Although all network nodes cannot be changed altogether to adopt 693 latency-based AQM schemes, a gradual adoption would eventually lead 694 to end-to-end low latency service for all applications. 696 9. IANA Considerations 698 There are no actions for IANA. 700 10. References 702 10.1 Normative References 704 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 705 Requirement Levels", BCP 14, RFC 2119, March 1997. 707 10.2 Informative References 709 [RFC970] Nagle, J., "On Packet Switches With Infinite 710 Storage",RFC970, December 1985. 712 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 713 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 714 Patridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., 715 Wroclawski, J. and Zhang, L., "Recommendations on Queue 716 Management and Congestion Avoidance in the Internet", 717 April, 1998. 719 10.3 Other References 721 [CBQ] Cisco White Paper, 722 "http://www.cisco.com/en/US/docs/12_0t/12_0tfeature/guide/cbwfq.html". 724 [CoDel] Nichols, K., Jacobson, V., "Controlling Queue Delay", 725 ACM Queue. ACM Publishing. doi:10.1145/2209249.22W.09264. 727 [DOCSIS_3.1] http://www.cablelabs.com/wp-content/uploads/specdocs 728 /CM-SP-MULPIv3.1-I01-131029.pdf. 730 [DOCSIS-PIE] White, G. and Pan, R., "A PIE-Based AQM for DOCSIS 731 Cable Modems", IETF draft-white-aqm-docsis-pie-02. 733 [FQ-Implement] Baker, F. and Pan, R. "On Queueing, Marking and 734 Dropping", IETF draft-ietf-aqm-fq-implementation. 736 [HPSR-PIE] Pan, R., Natarajan, P. Piglione, C., Prabhu, M.S., 737 Subramanian, V., Baker, F. Steeg and B. V., "PIE: A Lightweight 738 Control Scheme to Address the Bufferbloat Problem", IEEE HPSR 2013. 740 https://www.researchgate.net/publication/261134127_PIE_A_lightweight 741 _control_scheme_to_address_the_bufferbloat_problem?origin=mail 743 [IETF-AQM] Baker, F. and Fairhurst, G., "IETF Recommendations 744 Regarding Active Queue Management", draft-ietf-aqm-recommendation-11. 746 [IETF-ECN] Briscoe, B. Kaippallimalil, J and Phaler, P., 747 "Guidelines for Adding Congestion Notification to Protocols that 748 Encapsulate IP", draft-ietf-tsvwg-ecn-encap-guidelines. [AQM DOCSIS] 749 http://www.cablelabs.com/wp-content/uploads/2014/06/DOCSIS- 750 AQM_May2014.pdf 752 [PI] Hollot, C.V., Misra, V., Towsley, D. and Gong, W., "On 753 Designing Improved Controller for AQM Routers Supporting TCP Flows", 754 Infocom 2001. 756 [QCN] "Data Center Bridging - Congestion Notification", 757 http://www.ieee802.org/1/pages/802.1au.html. 759 [TCP-Models] Misra, V., Gong, W., and Towsley, D., "Fluid-base 760 Analysis of a Network of AQM Routers Supporting TCP Flows with an 761 Application to RED", SIGCOMM 2000 763 Authors' Addresses 765 Rong Pan 766 Cisco Systems 767 3625 Cisco Way, 768 San Jose, CA 95134, USA 769 Email: ropan@cisco.com 771 Preethi Natarajan, 772 Cisco Systems 773 725 Alder Drive, 774 Milpitas, CA 95035, USA 775 Email: prenatar@cisco.com 777 Fred Baker 778 Cisco Systems 779 725 Alder Drive, 780 Milpitas, CA 95035, USA 781 Email: fred@cisco.com 783 Bill Ver Steeg 784 Cisco Systems 785 5030 Sugarloaf Parkway 786 Lawrenceville, GA, 30044, USA 787 Email: versteb@cisco.com 789 Mythili Prabhu* 790 Akamai Technologies 791 3355 Scott Blvd 792 Santa Clara, CA - 95054 793 Email: mythili@akamai.com 795 Chiara Piglione* 796 Broadcom Corporation 797 3151 Zanker Road 798 San Jose, CA 95134 799 Email: chiara@broadcom.com 801 Vijay Subramanian* 802 PLUMgrid, Inc. 803 350 Oakmead Parkway, 804 Suite 250 805 Sunnyvale, CA 94085 806 Email: vns@plumgrid.com 808 Greg White 809 CableLabs 810 858 Coal Creek Circle 811 Louisville, CO 80027, USA 812 Email: g.white@cablelabs.com 814 * Formerly at Cisco Systems 816 11. The Basic PIE pseudo Code 818 Configurable Parameters: 819 - QDELAY_REF. AQM Latency Target (default: 15ms) 820 - MAX_BURST. AQM Max Burst Allowance (default: 150ms) 822 Internal Parameters: 823 - Weights in the drop probability calculation (1/s): 824 alpha (default: 1/8), beta(default: 1 + 1/4) 825 - T_UPDATE: a period to calculate drop probability (default:15ms) 827 Table which stores status variables (ending with "_"): 828 - burst_allowance_: current burst allowance 829 - drop_prob_: The current packet drop probability. reset to 0 830 - qdelay_old_: The previous queue delay. reset to 0 832 Public/system functions: 833 - queue_. Holds the pending packets. 834 - drop(packet). Drops/discards a packet 835 - now(). Returns the current time 836 - random(). Returns a uniform r.v. in the range 0 ~ 1 837 - queue_.byte_length(). Returns current queue_ length in bytes 838 - queue_.enque(packet). Adds packet to tail of queue_ 839 - queue_.deque(). Returns the packet from the head of queue_ 840 - packet.size(). Returns size of packet 841 - packet.timestamp_delay(). Returns timestamped packet latency 843 ============================ 845 //called on each packet arrival 846 enque(Packet packet) { 847 if (PIE->drop_prob_ == 0 && current_qdelay < QDELAY_REF 848 && PIE->qdelay_old_ < QDELAY_REF) { 849 PIE->burst_allowance_ = MAX_BURST; 850 } 851 if (PIE->burst_allowance_ == 0 && drop_early() == DROP) { 852 drop(packet); 853 } else { 854 queue_.enque(packet); 855 } 856 } 858 =========================== 860 drop_early() { 862 //Safeguard PIE to be work conserving 863 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 864 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 865 return ENQUE; 866 } 868 double u = random(); 869 if (u < PIE->drop_prob_) { 870 return DROP; 871 } else { 872 return ENQUE; 873 } 874 } 876 =========================== 877 //we choose the timestamp option of obtaining latency for clarity 878 //rate estimation method can be found in the extended PIE pseudo code 880 deque(Packet packet) { 882 current_qdelay = packet.timestamp_delay(); 884 } 886 ============================ 887 //update periodically, T_UPDATE = 15ms 889 calculate_drop_prob() { 891 //can be implemented using integer multiply, 893 p = alpha*(current_qdelay - QDELAY_REF) + \ 894 beta*(current_qdelay-PIE->qdelay_old_); 896 if (PIE->drop_prob_ < 0.000001) { 897 p /= 2048; 898 } else if (PIE->drop_prob_ < 0.00001) { 899 p /= 512; 900 } else if (PIE->drop_prob_ < 0.0001) { 901 p /= 128; 902 } else if (PIE->drop_prob_ < 0.001) { 903 p /= 32; 904 } else if (PIE->drop_prob_ < 0.01) { 905 p /= 8; 906 } else if (PIE->drop_prob_ < 0.1) { 907 p /= 2; 908 } else { 909 p = p; 910 } 912 PIE->drop_prob_ += p; 914 //Exponentially decay drop prob when congestion goes away 915 if (current_qdelay == 0 && PIE->qdelay_old_ == 0) { 916 PIE->drop_prob_ *= 0.98; //1- 1/64 is sufficient 917 } 919 //bound drop probability 920 if (PIE->drop_prob_ < 0) 921 PIE->drop_prob_ = 0.0 922 if (PIE->drop_prob_ > 1) 923 PIE->drop_prob_ = 1.0 925 PIE->qdelay_old_ = current_qdelay; 927 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 929 } 930 } 932 12. Pseudo code for PIE with optional enhancement 934 Configurable Parameters: 935 - QDELAY_REF. AQM Latency Target (default: 15ms) 936 - MAX_BURST. AQM Max Burst Allowance (default: 150ms) 937 - MAX_ECNTH. AQM Max ECN Marking Threshold (default: 10%) 939 Internal Parameters: 940 - Weights in the drop probability calculation (1/s): 941 alpha (default: 1/8), beta(default: 1+1/4) 942 - DQ_THRESHOLD: (in bytes, default: 2^14 (in a power of 2) ) 943 - T_UPDATE: a period to calculate drop probability (default:15ms) 944 - TAIL_DROP: each queue has a tail drop threshold, pass it to PIE 946 Table which stores status variables (ending with "_"): 947 - active_: INACTIVE/ACTIVE 948 - burst_allowance_: current burst allowance 949 - drop_prob_: The current packet drop probability. reset to 0 950 - accu_prob_: Accumulated drop probability. reset to 0 951 - qdelay_old_: The previous queue delay estimate. reset to 0 952 - last_timestamp_: Timestamp of previous status update 953 - dq_count_, measurement_start_, in_measurement_, 954 avg_dq_time_. variables for measuring average dequeue rate. 956 Public/system functions: 957 - queue_. Holds the pending packets. 958 - drop(packet). Drops/discards a packet 959 - mark(packet). Marks ECN for a packet 960 - now(). Returns the current time 961 - random(). Returns a uniform r.v. in the range 0 ~ 1 962 - queue_.byte_length(). Returns current queue_ length in bytes 963 - queue_.enque(packet). Adds packet to tail of queue_ 964 - queue_.deque(). Returns the packet from the head of queue_ 965 - packet.size(). Returns size of packet 966 - packet.ecn(). Returns whether packet is ECN capable or not 968 ============================ 969 //called on each packet arrival 970 enque(Packet packet) { 971 if (queue_.byte_length()+packet.size() > TAIL_DROP) { 972 drop(packet); 973 PIE->accu_prob_ = 0; 974 } else if (PIE->active_ == TRUE && drop_early() == DROP 975 && PIE->burst_allowance_ == 0) { 976 if (PIE->drop_prob_ < MAX_ECNTH && packet.ecn() == TRUE) 977 mark(packet); 978 else 979 drop(packet); 980 PIE->accu_prob_ = 0; 981 } else { 982 queue_.enque(packet); 983 } 985 //If the queue is over a certain threshold, turn on PIE 986 if (PIE->active_ == INACTIVE 987 && queue_.byte_length() >= TAIL_DROP/3) { 988 PIE->active_ = ACTIVE; 989 PIE->qdelay_old_ = 0; 990 PIE->drop_prob_ = 0; 991 PIE->in_measurement_ = TRUE; 992 PIE->dq_count_ = 0; 993 PIE->avg_dq_time_ = 0; 994 PIE->last_timestamp_ = now; 995 PIE->burst_allowance_ = MAX_BURST; 996 PIE->accu_prob_ = 0; 997 PIE->measurement_start_ = now; 998 } 1000 //If the queue has been idle for a while, turn off PIE 1001 //reset counters when accessing the queue after some idle 1002 //period if PIE was active before 1003 if ( PIE->drop_prob_ == 0 && PIE->qdelay_old_ == 0 1004 && queue_.byte_length() == 0) { 1005 PIE->active_ = INACTIVE; 1006 PIE->in_measurement_ = FALSE; 1007 } 1008 } 1010 =========================== 1011 drop_early() { 1013 //PIE is active but the queue is not congested, return ENQUE 1014 if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 0.2) 1015 || (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) { 1016 return ENQUE; 1017 } 1019 if (PIE->drop_prob_ == 0) { 1020 PIE->accu_prob_ = 0; 1021 } 1023 //For practical reasons, drop probability can be further scaled 1024 //according to packet size. but need to set a bound to 1025 //avoid unnecessary bias 1027 //Random drop 1028 PIE->accu_prob_ += PIE->drop_prob_; 1029 if (PIE->accu_prob_ < 0.85) 1030 return ENQUE; 1031 if (PIE->accu_prob_ >= 8.5) 1032 return DROP; 1033 double u = random(); 1034 if (u < PIE->drop_prob_) { 1035 PIE->accu_prob_ = 0; 1036 return DROP; 1037 } else { 1038 return ENQUE; 1039 } 1040 } 1042 ============================ 1043 //update periodically, T_UPDATE = 15ms 1044 calculate_drop_prob() { 1045 if ( (now - PIE->last_timestamp_) >= T_UPDATE && 1046 PIE->active_ == ACTIVE) { 1047 //can be implemented using integer multiply, 1048 //DQ_THRESHOLD is power of 2 value 1049 current_qdelay = queue_.byte_length() * PIE- 1050 >avg_dq_time_/DQ_THRESHOLD; 1051 p = alpha*(current_qdelay - QDELAY_REF) + \ 1052 beta*(current_qdelay-PIE->qdelay_old_); 1054 if (PIE->drop_prob_ < 0.000001) { 1055 p /= 2048; 1056 } else if (PIE->drop_prob_ < 0.00001) { 1057 p /= 512; 1058 } else if (PIE->drop_prob_ < 0.0001) { 1059 p /= 128; 1060 } else if (PIE->drop_prob_ < 0.001) { 1061 p /= 32; 1062 } else if (PIE->drop_prob_ < 0.01) { 1063 p /= 8; 1064 } else if (PIE->drop_prob_ < 0.1) { 1065 p /= 2; 1066 } else { 1067 p = p; 1068 } 1070 if (PIE->drop_prob_ >= 0.1 && p > 0.02) { 1071 p = 0.02; 1072 } 1073 PIE->drop_prob_ += p; 1075 //Exponentially decay drop prob when congestion goes away 1076 if (current_qdelay == 0 && PIE->qdelay_old_ == 0) { 1077 PIE->drop_prob_ *= 0.98; //1- 1/64 is sufficient 1078 } 1080 //bound drop probability 1081 if (PIE->drop_prob_ < 0) 1082 PIE->drop_prob_ = 0 1083 if (PIE->drop_prob_ > 1) 1084 PIE->drop_prob_ = 1 1086 PIE->qdelay_old_ = current_qdelay; 1087 PIE->last_timestamp_ = now; 1088 PIE->burst_allowance_ = max(0,PIE->burst_allowance_ - T_UPDATE); 1089 } 1090 } 1092 ========================== 1093 //called on each packet departure 1094 deque(Packet packet) { 1096 //deque rate estimation 1097 if (PIE->in_measurement_ == TRUE) { 1098 PIE->dq_count_ = packet.size() + PIE->dq_count_; 1099 //start a new measurement cycle if we have enough packets 1100 if ( PIE->dq_count_ >= DQ_THRESHOLD) { 1101 dq_time = now - PIE->measurement_start_; 1102 if(PIE->avg_dq_time_ == 0) { 1103 PIE->avg_dq_time_ = dq_time; 1104 } else { 1105 weight = DQ_THRESHOLD/2^16 1106 PIE->avg_dq_time_ = dq_time*weight + PIE->avg_dq_time_*(1- 1107 weight); 1108 } 1109 PIE->in_measurement_ = FALSE; 1110 } 1111 } 1113 //start a measurement if we have enough data in the queue: 1114 if (queue_.byte_length() >= DQ_THRESHOLD && 1115 PIE->in_measurement_ == FALSE) { 1116 PIE->in_measurement_ = TRUE; 1117 PIE->measurement_start_ = now; 1118 PIE->dq_count_ = 0; 1119 } 1120 }