idnits 2.17.1 draft-ietf-tsvwg-aqm-dualq-coupled-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 3, 2017) is 2488 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 1095 -- Looks like a reference, but probably isn't: '1' on line 1095 == Outdated reference: A later version (-07) exists of draft-ietf-tcpm-cubic-04 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-dctcp-08 == Outdated reference: A later version (-08) exists of draft-ietf-tsvwg-ecn-experimentation-00 == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-00 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-00 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area working group (tsvwg) K. De Schepper 3 Internet-Draft Nokia Bell Labs 4 Intended status: Experimental B. Briscoe, Ed. 5 Expires: January 4, 2018 O. Bondarenko 6 Simula Research Lab 7 I. Tsang 8 Nokia Bell Labs 9 July 3, 2017 11 DualQ Coupled AQM for Low Latency, Low Loss and Scalable Throughput 12 draft-ietf-tsvwg-aqm-dualq-coupled-01 14 Abstract 16 Data Centre TCP (DCTCP) was designed to provide predictably low 17 queuing latency, near-zero loss, and throughput scalability using 18 explicit congestion notification (ECN) and an extremely simple 19 marking behaviour on switches. However, DCTCP does not co-exist with 20 existing TCP traffic---DCTCP is so aggressive that existing TCP 21 algorithms approach starvation. So, until now, DCTCP could only be 22 deployed where a clean-slate environment could be arranged, such as 23 in private data centres. This specification defines `DualQ Coupled 24 Active Queue Management (AQM)' to allow scalable congestion controls 25 like DCTCP to safely co-exist with classic Internet traffic. The 26 Coupled AQM ensures that a flow runs at about the same rate whether 27 it uses DCTCP or TCP Reno/Cubic, but without inspecting transport 28 layer flow identifiers. When tested in a residential broadband 29 setting, DCTCP achieved sub-millisecond average queuing delay and 30 zero congestion loss under a wide range of mixes of DCTCP and 31 `Classic' broadband Internet traffic, without compromising the 32 performance of the Classic traffic. The solution also reduces 33 network complexity and eliminates network configuration. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at http://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on January 4, 2018. 51 Copyright Notice 53 Copyright (c) 2017 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 1.1. Problem and Scope . . . . . . . . . . . . . . . . . . . . 3 70 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.3. Features . . . . . . . . . . . . . . . . . . . . . . . . 5 72 2. DualQ Coupled AQM Algorithm . . . . . . . . . . . . . . . . . 7 73 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 7 74 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 8 75 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 8 76 2.4. Normative Requirements . . . . . . . . . . . . . . . . . 8 77 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 78 4. Security Considerations . . . . . . . . . . . . . . . . . . . 10 79 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 10 80 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 81 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 6.1. Normative References . . . . . . . . . . . . . . . . . . 11 83 6.2. Informative References . . . . . . . . . . . . . . . . . 11 84 Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 14 85 A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 15 86 A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 18 87 Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 21 88 Appendix C. Guidance on Controlling Throughput Equivalence . . . 26 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 91 1. Introduction 92 1.1. Problem and Scope 94 Latency is becoming the critical performance factor for many (most?) 95 applications on the public Internet, e.g. interactive Web, Web 96 services, voice, conversational video, interactive video, interactive 97 remote presence, instant messaging, online gaming, remote desktop, 98 cloud-based applications, and video-assisted remote control of 99 machinery and industrial processes. In the developed world, further 100 increases in access network bit-rate offer diminishing returns, 101 whereas latency is still a multi-faceted problem. In the last decade 102 or so, much has been done to reduce propagation time by placing 103 caches or servers closer to users. However, queuing remains a major 104 component of latency. 106 The Diffserv architecture provides Expedited Forwarding [RFC3246], so 107 that low latency traffic can jump the queue of other traffic. 108 However, on access links dedicated to individual sites (homes, small 109 enterprises or mobile devices), often all traffic at any one time 110 will be latency-sensitive. Then Diffserv is of little use. Instead, 111 we need to remove the causes of any unnecessary delay. 113 The bufferbloat project has shown that excessively-large buffering 114 (`bufferbloat') has been introducing significantly more delay than 115 the underlying propagation time. These delays appear only 116 intermittently--only when a capacity-seeking (e.g. TCP) flow is long 117 enough for the queue to fill the buffer, making every packet in other 118 flows sharing the buffer sit through the queue. 120 Active queue management (AQM) was originally developed to solve this 121 problem (and others). Unlike Diffserv, which gives low latency to 122 some traffic at the expense of others, AQM controls latency for _all_ 123 traffic in a class. In general, AQMs introduce an increasing level 124 of discard from the buffer the longer the queue persists above a 125 shallow threshold. This gives sufficient signals to capacity-seeking 126 (aka. greedy) flows to keep the buffer empty for its intended 127 purpose: absorbing bursts. However, RED [RFC2309] and other 128 algorithms from the 1990s were sensitive to their configuration and 129 hard to set correctly. So, AQM was not widely deployed. 131 More recent state-of-the-art AQMs, e.g. 132 fq_CoDel [I-D.ietf-aqm-fq-codel], PIE [RFC8033], Adaptive 133 RED [ARED01], are easier to configure, because they define the 134 queuing threshold in time not bytes, so it is invariant for different 135 link rates. However, no matter how good the AQM, the sawtoothing 136 rate of TCP will either cause queuing delay to vary or cause the link 137 to be under-utilized. Even with a perfectly tuned AQM, the 138 additional queuing delay will be of the same order as the underlying 139 speed-of-light delay across the network. Flow-queuing can isolate 140 one flow from another, but it cannot isolate a TCP flow from the 141 delay variations it inflicts on itself, and it has other problems - 142 it overrides the flow rate decisions of variable rate video 143 applications, it does not recognise the flows within IPSec VPN 144 tunnels and it is relatively expensive to implement. 146 It seems that further changes to the network alone will now yield 147 diminishing returns. Data Centre TCP (DCTCP [I-D.ietf-tcpm-dctcp]) 148 teaches us that a small but radical change to TCP is needed to cut 149 two major outstanding causes of queuing delay variability: 151 1. the `sawtooth' varying rate of TCP itself; 153 2. the smoothing delay deliberately introduced into AQMs to permit 154 bursts without triggering losses. 156 The former causes a flow's round trip time (RTT) to vary from about 1 157 to 2 times the base RTT between the machines in question. The latter 158 delays the system's response to change by a worst-case 159 (transcontinental) RTT, which could be hundreds of times the actual 160 RTT of typical traffic from localized CDNs. 162 Latency is not our only concern: 164 3. It was known when TCP was first developed that it would not scale 165 to high bandwidth-delay products. 167 Given regular broadband bit-rates over WAN distances are 168 already [RFC3649] beyond the scaling range of `classic' TCP Reno, 169 `less unscalable' Cubic [I-D.ietf-tcpm-cubic] and 170 Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been 171 successfully deployed. However, these are now approaching their 172 scaling limits. Unfortunately, fully scalable TCPs such as DCTCP 173 cause `classic' TCP to starve itself, which is why they have been 174 confined to private data centres or research testbeds (until now). 176 This document specifies a `DualQ Coupled AQM' extension that solves 177 the problem of coexistence between scalable and classic flows, 178 without having to inspect flow identifiers. The AQM is not like 179 flow-queuing approaches [I-D.ietf-aqm-fq-codel] that classify packets 180 by flow identifier into numerous separate queues in order to isolate 181 sparse flows from the higher latency in the queues assigned to 182 heavier flow. In contrast, the AQM exploits the behaviour of 183 scalable congestion controls like DCTCP so that every packet in every 184 flow sharing the queue for DCTCP-like traffic can be served with very 185 low latency. 187 This AQM extension can be combined with any single queue AQM that 188 generates a statistical or deterministic mark/drop probability driven 189 by the queue dynamics. In many cases it simplifies the basic control 190 algorithm, and requires little extra processing. Therefore it is 191 believed the Coupled AQM would be applicable and easy to deploy in 192 all types of buffers; buffers in cost-reduced mass-market residential 193 equipment; buffers in end-system stacks; buffers in carrier-scale 194 equipment including remote access servers, routers, firewalls and 195 Ethernet switches; buffers in network interface cards, buffers in 196 virtualized network appliances, hypervisors, and so on. 198 The overall L4S architecture is described in 199 [I-D.ietf-tsvwg-l4s-arch]. The supporting papers [PI216] and 200 [DCttH15] give the full rationale for the AQM's design, both 201 discursively and in more precise mathematical form. 203 1.2. Terminology 205 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 206 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 207 document are to be interpreted as described in [RFC2119]. In this 208 document, these words will appear with that interpretation only when 209 in ALL CAPS. Lower case uses of these words are not to be 210 interpreted as carrying RFC-2119 significance. 212 The DualQ Coupled AQM uses two queues for two services. Each of the 213 following terms identifies both the service and the queue that 214 provides the service: 216 Classic (denoted by subscript C): The `Classic' service is intended 217 for all the behaviours that currently co-exist with TCP Reno (TCP 218 Cubic, Compound, SCTP, etc). 220 Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): 221 The `L4S' service is intended for a set of congestion controls 222 with scalable properties such as DCTCP (e.g. 223 Relentless [Mathis09]). 225 Either service can cope with a proportion of unresponsive or less- 226 responsive traffic as well (e.g. DNS, VoIP, etc), just as a single 227 queue AQM can. The DualQ Coupled AQM behaviour is similar to a 228 single FIFO queue with respect to unresponsive and overload traffic. 230 1.3. Features 232 The AQM couples marking and/or dropping across the two queues such 233 that a flow will get roughly the same throughput whichever it uses. 234 Therefore both queues can feed into the full capacity of a link and 235 no rates need to be configured for the queues. The L4S queue enables 236 scalable congestion controls like DCTCP to give stunningly low and 237 predictably low latency, without compromising the performance of 238 competing 'Classic' Internet traffic. Thousands of tests have been 239 conducted in a typical fixed residential broadband setting. Typical 240 experiments used base round trip delays up to 100ms between the data 241 centre and home network, and large amounts of background traffic in 242 both queues. For every L4S packet, the AQM kept the average queuing 243 delay below 1ms (or 2 packets if serialization delay is bigger for 244 slow links), and no losses at all were introduced by the AQM. 245 Details of the extensive experiments will be made available [PI216] 246 [DCttH15]. 248 Subjective testing was also conducted using a demanding panoramic 249 interactive video application run over a stack with DCTCP enabled and 250 deployed on the testbed. Each user could pan or zoom their own high 251 definition (HD) sub-window of a larger video scene from a football 252 match. Even though the user was also downloading large amounts of 253 L4S and Classic data, latency was so low that the picture appeared to 254 stick to their finger on the touchpad (all the L4S data achieved the 255 same ultra-low latency). With an alternative AQM, the video 256 noticeably lagged behind the finger gestures. 258 Unlike Diffserv Expedited Forwarding, the L4S queue does not have to 259 be limited to a small proportion of the link capacity in order to 260 achieve low delay. The L4S queue can be filled with a heavy load of 261 capacity-seeking flows like DCTCP and still achieve low delay. The 262 L4S queue does not rely on the presence of other traffic in the 263 Classic queue that can be 'overtaken'. It gives low latency to L4S 264 traffic whether or not there is Classic traffic, and the latency of 265 Classic traffic does not suffer when a proportion of the traffic is 266 L4S. The two queues are only necessary because DCTCP-like flows 267 cannot keep latency predictably low and keep utilization high if they 268 are mixed with legacy TCP flows, 270 The experiments used the Linux implementation of DCTCP that is 271 deployed in private data centres, without any modification despite 272 its known deficiencies. Nonetheless, certain modifications will be 273 necessary before DCTCP is safe to use on the Internet, which are 274 recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id]. However, the 275 focus of this specification is to get the network service in place. 276 Then, without any management intervention, applications can exploit 277 it by migrating to scalable controls like DCTCP, which can then 278 evolve _while_ their benefits are being enjoyed by everyone on the 279 Internet. 281 2. DualQ Coupled AQM Algorithm 283 There are two main aspects to the algorithm: 285 o the Coupled AQM that addresses throughput equivalence between 286 Classic (e.g. Reno, Cubic) flows and L4S (e.g. DCTCP) flows 288 o the Dual Queue structure that provides latency separation for L4S 289 flows to isolate them from the typically large Classic queue. 291 2.1. Coupled AQM 293 In the 1990s, the `TCP formula' was derived for the relationship 294 between TCP's congestion window, cwnd, and its drop probability, p. 295 To a first order approximation, cwnd of TCP Reno is inversely 296 proportional to the square root of p. TCP Cubic implements a Reno- 297 compatibility mode, which is the only relevant mode for typical RTTs 298 under 20ms, while the throughput of a single flow is less than about 299 500Mb/s. Therefore we can assume that Cubic traffic behaves similar 300 to Reno (but with a slightly different constant of proportionality), 301 and we shall use the term 'Classic' for the collection of Reno and 302 Cubic in Reno mode. 304 In our supporting paper [PI216], we derive the equivalent rate 305 equation for DCTCP, for which cwnd is inversely proportional to p 306 (not the square root), where in this case p is the ECN marking 307 probability. DCTCP is not the only congestion control that behaves 308 like this, so we use the term 'L4S' traffic for all similar 309 behaviour. 311 In order to make a DCTCP flow run at roughly the same rate as a Reno 312 TCP flow (all other factors being equal), we make the drop or marking 313 probability for Classic traffic, p_C distinct from the marking 314 probability for L4S traffic, p_L (in contrast to RFC3168 which 315 requires them to be the same). We make the Classic drop probability 316 p_C proportional to the square of the L4S marking probability p_L. 317 This is because we need to make the Reno flow rate equal the DCTCP 318 flow rate, so we have to square the square root of p_C in the Reno 319 rate equation to make it the same as the straight p_L in the DCTCP 320 rate equation. 322 There is a really simple way to implement the square of a probability 323 - by testing the queue against two random numbers not one. This is 324 the approach adopted in Appendix A and Appendix B. 326 Stating this as a formula, the relation between Classic drop 327 probability, p_C, and L4S marking probability, p_L needs to take the 328 form: 330 p_C = ( p_L / k )^2 (1) 332 where k is the constant of proportionality. 334 2.2. Dual Queue 336 Classic traffic builds a large queue, so a separate queue is provided 337 for L4S traffic, and it is scheduled with strict priority. 338 Nonetheless, coupled marking ensures that giving priority to L4S 339 traffic still leaves the right amount of spare scheduling time for 340 Classic flows to each get equivalent throughput to DCTCP flows (all 341 other factors such as RTT being equal). The algorithm achieves this 342 without having to inspect flow identifiers. 344 2.3. Traffic Classification 346 Both the Coupled AQM and DualQ mechanisms need an identifier to 347 distinguish L4S and C packets. A separate draft 348 [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of 349 the ECN field as this identifier, having assessed various 350 alternatives. 352 Given L4S work is currently on the experimental track, but the 353 definition of the ECN field is on the standards track [RFC3168], 354 another standards track document has proved necessary to make the 355 ECT(1) codepoint available for experimentation 356 [I-D.ietf-tsvwg-ecn-experimentation]. 358 2.4. Normative Requirements 360 In the Dual Queue, L4S packets MUST be given priority over Classic, 361 although strict priority MAY not be appropriate. 363 All L4S traffic MUST be ECN-capable, although some Classic traffic 364 MAY also be ECN-capable. 366 Whatever identifier is used for L4S traffic, it will still be 367 necessary to agree on the meaning of an ECN marking on L4S traffic, 368 relative to a drop of Classic traffic. In order to prevent 369 starvation of Classic traffic by scalable L4S traffic (e.g. DCTCP) 370 the drop probability of Classic traffic MUST be proportional to the 371 square of the marking probability of L4S traffic, In other words, the 372 power to which p_L is raised in Eqn. (1) MUST be 2. 374 The constant of proportionality, k, in Eqn (1) determines the 375 relative flow rates of Classic and L4S flows when the AQM concerned 376 is the bottleneck (all other factors being equal). k does not have to 377 be standardized because differences do not prevent interoperability. 379 However, k has to take some value, and each operator can make that 380 choice. 382 A value of k=2 is currently RECOMMENDED as the default for Internet 383 access networks. Assuming scalable congestion controls for the 384 Internet will be as aggressive as DCTCP, this will ensure their 385 congestion window will be roughly the same as that of a standards 386 track TCP congestion control (Reno) [RFC5681] and other so-called 387 TCP-friendly controls such as TCP Cubic in its TCP-friendly mode. 389 The requirements for scalable congestion controls on the Internet 390 (termed the TCP Prague requirements) [I-D.ietf-tsvwg-ecn-l4s-id] are 391 not necessarily final. If the aggressiveness of DCTCP is not defined 392 as the benchmark for scalable controls on the Internet, the 393 recommended value of k will also be subject to change. 395 Whatever value is recommended, the choice of k is a matter of 396 operator policy, and operators MAY choose a different value using 397 Table 1 and the guidelines in Appendix C. 399 Typically, access network operators isolate customers from each other 400 with some form of layer-2 multiplexing (TDM in DOCSIS, CDMA in 3G) or 401 L3 scheduling (WRR in broadband), rather than relying on TCP to share 402 capacity between customers [RFC0970]. In such cases, the choice of k 403 will solely affect relative flow rates within each customer's access 404 capacity, not between customers. Also, k will not affect relative 405 flow rates at any times when all flows are Classic or all L4S, and it 406 will not affect small flows. 408 Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are 409 given in Appendix A and Appendix B. Either example AQM can be used 410 to couple packet marking and dropping across a dual Q. Curvy RED 411 requires less operations per packet than RED and can be used if the 412 range of RTTs is limited. DualPI2 is a simplification of PIE with 413 stable Proportional-Integral control for both Classic and L4S 414 congestion controls. Nonetheless, it would be possible to control 415 the queues with other alternative AQMs, as long as the above 416 normative requirements (those expressed in capitals) are observed, 417 which are intended to be independent of the specific AQM. 419 {ToDo: Add management and monitoring requirements} 421 3. IANA Considerations 423 This specification contains no IANA considerations. 425 4. Security Considerations 427 4.1. Overload Handling 429 Where the interests of users or flows might conflict, it could be 430 necessary to police traffic to isolate any harm to performance. This 431 is a policy issue that needs to be separable from a basic AQM, but an 432 AQM does need to handle overload. A trade-off needs to be made 433 between complexity and the risk of either class harming the other. 434 It is an operator policy to define what must happen if the service 435 time of the classic queue becomes too great. In the following 436 subsections three optional non-exclusive overload protections are 437 defined. Their objective is for the overload behaviour of the DualQ 438 AQM to be similar to a single queue AQM. The example implementation 439 in Appendix A implements the 'delay on overload' policy. Other 440 overload protections can be envisaged: 442 Minimum throughput service: By replacing the priority scheduler 443 with a weighted round robin scheduler, a minimum throughput 444 service can be guaranteed for Classic traffic. Typically the 445 scheduling weight of the Classic queue will be small (e.g. 5%) to 446 avoid interference with the coupling but big enough to avoid 447 complete starvation of Classic traffic. 449 Delay on overload: To control milder overload of responsive traffic, 450 particularly when close to the maximum congestion signal, delay 451 can be used as an alternative congestion control mechanism. The 452 Dual Queue Coupled AQM can be made to behave like a single First- 453 In First-Out (FIFO) queue with different service times by 454 replacing the priority scheduler with a very simple scheduler that 455 could be called a "time-shifted FIFO", which is the same as the 456 Modifier Earliest Deadline First (MEDF) scheduler of [MEDF]. The 457 scheduler adds T_m to the queue delay of the next L4S packet, 458 before comparing it with the queue delay of the next Classic 459 packet, then it selects the packet with the greater adjusted queue 460 delay. Under regular conditions, this time-shifted FIFO scheduler 461 behaves just like a strict priority scheduler. But under moderate 462 or high overload it prevents starvation of the Classic queue, 463 because the time-shift defines the maximum extra queuing delay 464 (T_m) of Classic packets relative to L4S. 466 Drop on overload: On severe overload, e.g. due to non responsive 467 traffic, queues will typically overflow and packet drop will be 468 unavoidable. It is important to avoid unresponsive ECN traffic 469 (either Classic or L4S) driving the AQM to 100% drop and mark 470 probability. Congestion controls that have a minimum congestion 471 window will become unresponsive to ECN marking when the marking 472 probability is high. This situation can be avoided by applying 473 the drop probability to all packets of all traffic types when it 474 exceeds a certain threshold or by limiting the drop and marking 475 probabilities to a lower maximum value (up to where fairnes 476 between the different traffic types is still guaranteed) and rely 477 on delay to control temporary high congestion and eventually queue 478 overflow. If the classic drop probability is applied to all types 479 of traffic when it is higher than a threshold probability the 480 queueing delay can be controlled up to any overload situation, and 481 no further measures are required. If a maximum classic and 482 coupled L4S probability of less than 100% is used, both queues 483 need scheduling opportunities and should eventually experience 484 drop. This can be achieved with a scheduler that guarantees a 485 minimum throughput for each queue, such as a weighted round robin 486 or time-shifted FIFO scheduler. In that case a common queue limit 487 can be configured that will drop packets of both types of traffic. 489 To keep the throughput of both L4S and Classic flows equal over the 490 full load range, a different control strategy needs to be defined 491 above the point where one congestion control first saturates to a 492 probability of 100% (if k>1, L4S will saturate first). Possible 493 strategies include: also dropping L4S; increasing the queueing delay 494 for both; or ensuring that L4S traffic still responds to marking 495 below a window of 2 segments (see [I-D.ietf-tsvwg-ecn-l4s-id]). 497 5. Acknowledgements 499 Thanks to Anil Agarwal for detailed review comments and suggestions 500 on how to make our explanation clearer. 502 The authors' contributions are part-funded by the European Community 503 under its Seventh Framework Programme through the Reducing Internet 504 Transport Latency (RITE) project (ICT-317700). The views expressed 505 here are solely those of the authors. 507 6. References 509 6.1. Normative References 511 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 512 Requirement Levels", BCP 14, RFC 2119, 513 DOI 10.17487/RFC2119, March 1997, 514 . 516 6.2. Informative References 518 [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An 519 Algorithm for Increasing the Robustness of RED's Active 520 Queue Management", ACIRI Technical Report , August 2001, 521 . 523 [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", 524 ACM Queue 10(5), May 2012, 525 . 527 [CRED_Insights] 528 Briscoe, B., "Insights from Curvy RED (Random Early 529 Detection)", BT Technical Report TR-TUB8-2015-003, July 530 2015, 531 . 533 [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 534 Tsang, "`Data Centre to the Home': Ultra-Low Latency for 535 All", 2015, . 538 (Under submission) 540 [I-D.ietf-aqm-fq-codel] 541 Hoeiland-Joergensen, T., McKenney, P., 542 dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, "The 543 FlowQueue-CoDel Packet Scheduler and Active Queue 544 Management Algorithm", draft-ietf-aqm-fq-codel-06 (work in 545 progress), March 2016. 547 [I-D.ietf-tcpm-cubic] 548 Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 549 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 550 draft-ietf-tcpm-cubic-04 (work in progress), February 551 2017. 553 [I-D.ietf-tcpm-dctcp] 554 Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 555 and G. Judd, "Datacenter TCP (DCTCP): TCP Congestion 556 Control for Datacenters", draft-ietf-tcpm-dctcp-08 (work 557 in progress), June 2017. 559 [I-D.ietf-tsvwg-ecn-experimentation] 560 Black, D., "Explicit Congestion Notification (ECN) 561 Experimentation", draft-ietf-tsvwg-ecn-experimentation-00 562 (work in progress), November 2016. 564 [I-D.ietf-tsvwg-ecn-l4s-id] 565 Schepper, K., Briscoe, B., and I. Tsang, "Identifying 566 Modified Explicit Congestion Notification (ECN) Semantics 567 for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s- 568 id-00 (work in progress), November 2016. 570 [I-D.ietf-tsvwg-l4s-arch] 571 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 572 Low Loss, Scalable Throughput (L4S) Internet Service: 573 Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in 574 progress), November 2016. 576 [I-D.sridharan-tcpm-ctcp] 577 Sridharan, M., Tan, K., Bansal, D., and D. Thaler, 578 "Compound TCP: A New TCP Congestion Control for High-Speed 579 and Long Distance Networks", draft-sridharan-tcpm-ctcp-02 580 (work in progress), November 2008. 582 [Mathis09] 583 Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , 584 May 2009, . 587 [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a 588 simple scheduling algorithm for two real-time transport 589 service classes with application in the UTRAN", Proc. IEEE 590 Conference on Computer Communications (INFOCOM'03) Vol.2 591 pp.1116-1122, March 2003. 593 [PI216] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 594 Tsang, "PI2: A Linearized AQM for both Classic and 595 Scalable TCP", ACM CoNEXT'16 , December 2016, 596 . 599 (To appear) 601 [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", 602 RFC 970, DOI 10.17487/RFC0970, December 1985, 603 . 605 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 606 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 607 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 608 S., Wroclawski, J., and L. Zhang, "Recommendations on 609 Queue Management and Congestion Avoidance in the 610 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 611 . 613 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 614 of Explicit Congestion Notification (ECN) to IP", 615 RFC 3168, DOI 10.17487/RFC3168, September 2001, 616 . 618 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 619 J., Courtney, W., Davari, S., Firoiu, V., and D. 620 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 621 Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, 622 . 624 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 625 RFC 3649, DOI 10.17487/RFC3649, December 2003, 626 . 628 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 629 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 630 . 632 [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF 633 Recommendations Regarding Active Queue Management", 634 BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, 635 . 637 [RFC8033] Pan, R., Natarajan, P., Baker, F., and G. White, 638 "Proportional Integral Controller Enhanced (PIE): A 639 Lightweight Control Scheme to Address the Bufferbloat 640 Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017, 641 . 643 Appendix A. Example DualQ Coupled PI2 Algorithm 645 As a first concrete example, the pseudocode below gives the DualPI2 646 algorithm, which is a DualQ Coupled AQM algorithm based on the PI2 647 algorithm [PI216] for the Classic AQM. Pi2 is an improved variant of 648 the PIE AQM [RFC8033]. 650 We will introduce the pseudocode in two passes. The first pass 651 explains the core concepts, deferring handling of overload to the 652 second pass. The first pass also uses regular arithmetic, whereas 653 some integer arithmetic more suitable for kernel operations appears 654 in the second pass. To aid comparison, line numbers are kept in step 655 between the two passes by using letter suffixes where the longer code 656 needs extra lines. 658 A full open source implementation for Linux is available at: 659 https://github.com/olgabo/dualpi2. 661 A.1. Pass #1: Core Concepts 663 The pseudocode manipulates three main structures of variables: the 664 packet (pkt), the L4S queue (lq) and the Classic queue (cq). The 665 pseudocode consists of the following four functions: 667 o initialization code (Figure 1) that sets parameter defaults (the 668 API for setting non-default values is omitted for brevity) 670 o enqueue code (Figure 2) 672 o dequeue code (Figure 3) 674 o code to regularly update the base probability (p) used in the 675 dequeue code (Figure 4). 677 The base probability (p) is an internal variable from which the 678 marking and dropping probabilities for L4S and Classic traffic (p_L 679 and p_C) are derived. Specifically, p_L = p * k and p_C = p^2, from 680 Equation (1) (see lines 4a and 4b in Figure 4). 682 In our experiments so far (building on experiments with PIE) on 683 broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs 684 from 5 ms to 100 ms, DualPI2 achieves good results with the default 685 parameters in Figure 1. 687 1: dualpi2_params_init(...) { % Set parameter defaults 688 2: target = 15 ms % PI AQM Classic queue delay target 689 3: tshift = 2 * target % Scheduler time bias 690 4: Tupdate = 16 ms % PI Classic queue sampling interval 691 5: alpha = 10 Hz^2 % PI integral gain 692 6: beta = 100 Hz^2 % PI proportional gain 693 7: alpha_U = alpha *Tupdate % PI integral gain per update interval 694 8: beta_U = beta * Tupdate % PI prop'nal gain per update interval 695 9: T_time = 1 ms % L4S marking threshold in time 696 10: T_len = 2 * MTU % Min L4S marking threshold in bytes 697 11: k = 2 % Coupling factor 698 12: limit = MAX_LINK_RATE * 250 ms % Dual buffer size 699 13: p_Cmax = 1/4 % Max Classic drop/mark prob 700 14: p_Lmax = min(k*sqrt(p_Cmax), 1) % Max L4S marking prob 701 15: } 703 Figure 1: Example Header Pseudocode for DualQ Coupled PI2 AQM 705 1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq 706 2: stamp(pkt) % attach arrival time to packet 707 3: if ( lq.len() + cq.len() > limit ) 708 4: drop(pkt) % drop packet if buffer is full 709 5: else { % Packet classifier 710 6: if ( ecn(pkt) modulo 2 == 1 ) % ECN bits = ECT(1) or CE 711 7: lq.enqueue(pkt) 712 8: else % ECN bits = not-ECT or ECT(0) 713 9: cq.enqueue(pkt) 714 10: } 715 11: } 717 Figure 2: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM 719 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues 720 2: while ( lq.len() + cq.len() > 0 ) 721 3: if ( lq.time() + tshift >= cq.time() ) { % time-shifted FIFO 722 4: lq.dequeue(pkt) 723 5: if ( ((pkt.time() > T_time) % step marking ... 724 6: AND (lq.len > T_len)) 725 7: OR (p_L > rand()) ) % ...or linear marking 726 8: mark(pkt) 727 9: } else { 728 10: cq.dequeue(pkt) 729 11: if ( p_C > rand() ) { % probability p^2 730 12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT 731 13: drop(pkt) % squared drop 732 14: continue % continue to the top of the while loop 733 15: } 734 16: mark(pkt) % squared mark 735 17: } 736 18: } 737 19: return(pkt) % return the packet and stop 738 20: } 739 21: return(NULL) % no packet to dequeue 740 22: } 742 Figure 3: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM 744 1: dualpi2_update(lq, cq, target) { % Update p every Tupdate 745 2: curq = cq.time() % use queuing time of first-in Classic packet 746 3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) 747 4a: p_L = p * k % L4S prob = base prob * coupling factor 748 4b: p_C = p^2 % Classic prob = (base prob)^2 749 5: prevq = curq 750 6: } 752 Figure 4: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM 754 When packets arrive, first a common queue limit is checked as shown 755 in line 3 of the enqueuing pseudocode in Figure 2. Note that the 756 limit is deliberately tested before enqueue to avoid any bias against 757 larger packets (so the actual buffer has to be one packet larger than 758 limit). If limit is not exceeded, the packet will be classified and 759 enqueued to the Classic or L4S queue dependent on the least 760 significant bit of the ECN field in the IP header (line 6). Packets 761 with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be 762 enqueued in the Classic queue. Otherwise, ECT(1) and CE packets will 763 be enqueued in the L4S queue. Optional packet classification 764 flexibility is omitted for brevity. 766 The dequeue pseudocode schedules one packet for dequeuing (or zero if 767 the queue is empty). It also makes all the AQM decisions on dropping 768 and marking. It is contained within a large while loop so that if it 769 decides to drop a packet, it will continue until it selects a packet 770 to schedule. Line 3 of the dequeue pseudocode implements time- 771 shifted FIFO scheduling. It takes the packet that waited the 772 longest, biased against the Classic traffic by a time-shift of 773 tshift. 775 o If an L4S packet is scheduled, lines 5 to 8 mark the packet if 776 either the L4S threshold is exceeded, or if a random marking 777 decision is drawn according to k times the probability p 778 (maintained by the dualpi2_update() function discussed below). 779 The L4S threshold is usually in units of time (default T_time = 1 780 ms). However, on slow links the packet serialization time can 781 approach the threshold T_time, so line 6 sets a floor of 2 MTU to 782 the threshold. 784 o If a Classic packet is scheduled, lines 10 to 17 drop or mark the 785 packet based on the squared probability p_C = p^2 (hence the name 786 PI2 for Classic traffic). 788 The probability p is kept up to date by the core PI algorithm in 789 Figure 4 which is executed every Tupdate ([RFC8033] now recommends 790 16ms). The algorithm centres on line 3, which is a classical 791 Proportional-Integral (PI) controller that alters p dependent on a) 792 the error between the current queuing delay (curq) and the target 793 queuing delay (target) as defined in [RFC8033] and b) the change in 794 queuing delay since the last sample. The name 'PI' represents the 795 fact that the second factor is proportional to load while the first 796 is the integral of the load (so it removes any standing queue). In 797 corner cases, p can overflow the range [0,1] so the resulting value 798 of p has to be bounded (omitted from the pseudocode). 800 Alpha_U and beta_U are gain factors chosen based on stability 801 analysis to convert changes in queueing delay into changes in 802 probability. They are therefore in units of 'per second of delay' or 803 Hz. The suffix '_U' represents 'per update time' (Tupdate). If a 804 briefer update time is configured, alpha_U and beta_U also have to be 805 reduced to ensure that the same response is given over time, so that 806 a smaller Tupdate will only result in a response with finer steps, 807 not a more aggressive response. Therefore, alpha_U and beta_U are 808 derived from alpha and beta, which each represent a 'gain factor per 809 second of update time', so they can be configured independently of 810 the update time (see lines 7 and 8 of Figure 1). 812 Unlike in PIE, alpha_U and beta_U do not need to be tuned every 813 Tupdate dependent on p. Instead, in PI2 alpha_U and beta_U can be 814 constants because the squaring applied to Classic traffic tunes them 815 inherently, as explained in [PI216]. 817 Note that p solely depends on the queuing time in the Classic queue. 818 In line 2, the current queuing delay is evaluated by inspecting the 819 timestamp of the next packet to schedule in the Classic queue. The 820 function cq.time() subtracts the time stamped at enqueue from the 821 current time and implicitly takes the current queuing delay as 0 if 822 the queue is empty. 824 Because the L4S marking probability (p_L) is factored up by k, the 825 dynamic gain parameters alpha and beta are also inherently factored 826 up by k for the L4S queue, which is necessary to ensure that Classic 827 TCP and DCTCP controls have the same stability. So, if alpha is 10 828 Hz^2, the effective gain factor for the L4S queue is k*alpha, which 829 is 20 Hz^2 with the default coupling factor of k=2. 831 A.2. Pass #2: Overload Details 833 Figure 5 repeats the dequeue function of Figure 3, but with overload 834 details added. Similarly Figure 6 repeats the core PI algorithm of 835 Figure 4 with overload details added. The initialization and enqueue 836 functions are unchanged. 838 In line 13 of the initialization function (Figure 1), the default 839 maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the 840 point at which it is deemed that the Classic queue has become 841 persistently overloaded, so it switches to using solely drop, even 842 for ECN-capable packets. This protects the queue against any 843 unresponsive traffic that falsely claims that it is responsive to ECN 844 marking, as required by [RFC3168] and [RFC7567]. 846 Line 14 translates this into a maximum L4S marking probability 847 (p_Lmax) by rearranging Equation (1). With the default coupling 848 factor of k=2, this translates to a maximum L4S marking probability 849 of 1 or 100%. This is intended to ensure that the L4S queue starts to 850 introduce dropping once marking saturates and can rise no further. 851 The 'TCP Prague' requirements [I-D.ietf-tsvwg-ecn-l4s-id] require 852 that, when an L4S congestion control detects a drop, it falls back to 853 a response that coexists with 'Classic' TCP. So it is correct that 854 the L4S queue drops packets proportional to p^2, as if they are 855 Classic packets. 857 Both these switch-overs are triggered by the tests for overload 858 introduced in lines 4b and 12b of the dequeue function (Figure 5). 859 Lines 8c to 8g drop L4S packets with probability p^2 by comparing p 860 against two random numbers. Lines 8h to 8i mark the remaining 861 packets with probability p_L. 863 Lines 2c to 2d in the core PI algorithm (Figure 6) deal with overload 864 of the L4S queue when there is no Classic traffic. This is 865 necessary, because the core PI algorithm maintains the right 866 probability of drop to regulate overload, but it depends on the 867 length of the Classic queue. If there is no Classic queue the naive 868 algorithm in Figure 4 drops nothing, even if the L4S queue is 869 overloaded - so tail drop takes over (lines 3 and 4 of Figure 2). 871 If the test at line 2a finds that the Classic queue is empty, line 2d 872 measures the current queue delay using the L4S queue not the Classic 873 queue, While the L4S queue is not overloaded, its delay will always 874 be tiny compared to the target Classic queue delay. So p_L will be 875 driven to zero, and the L4S queue will naturally be governed solely 876 by threshold marking (lines 5 and 6 of the dequeue algorithm in 877 Figure 5). But, if unresponsive L4S source(s) cause overload, the 878 DualQ transitions smoothly to L4S marking based on the PI algorithm. 879 And as overload increases, it naturally transitions from marking to 880 dropping by the switch-over mechanism already described. 882 1: dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 883 2: while ( lq.len() + cq.len() > 0 ) 884 3: if ( lq.time() + tshift >= cq.time() ) { % time-shifted FIFO 885 4a: lq.dequeue(pkt) 886 4b: if ( p_L < p_Lmax ) { % Check for overload saturation 887 5: if ( ((pkt.time() > T_time) % step marking ... 888 6: AND (lq.len > T_len)) 889 7: OR (p_L > rand()) ) % ...or linear marking 890 8a: mark(pkt) 891 8b: } else { % overload saturation 892 8c: if ( p > max(rand(), rand()) ) { % probability p^2 893 8e: drop(pkt) % revert to Classic drop due to overload 894 8f: continue % continue to the top of the while loop 895 8g: } 896 8h: if ( p_L > rand() ) % probability p_L = k * p 897 8i: mark(pkt) % linear marking of remaining packets 898 8j: } 899 9: } else { 900 10: cq.dequeue(pkt) 901 11: if ( p > max(rand(), rand()) ) { % probability p^2 902 12a: if ( (ecn(pkt) == 0) % ECN field = not-ECT 903 12b: OR (p_L >= p_Lmax) ) { % Overload disables ECN 904 13: drop(pkt) % squared drop, redo loop 905 14: continue % continue to the top of the while loop 906 15: } 907 16: mark(pkt) % squared mark 908 17: } 909 18: } 910 19: return(pkt) % return the packet and stop 911 20: } 912 21: return(NULL) % no packet to dequeue 913 22: } 915 Figure 5: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM 916 (Including Integer Arithmetic and Overload Code) 918 1: dualpi2_update(lq, cq, target) { % Update p every Tupdate 919 2a: if ( cq.len() > 0 ) 920 2b: curq = cq.time() %use queuing time of first-in Classic packet 921 2c: else % Classic queue empty 922 2d: curq = lq.time() % use queuing time of first-in L4S packet 923 3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) 924 4: p_L = p * k % L4S prob = base prob * coupling factor 925 5: prevq = curq 926 6: } 928 Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM 929 (Including Overload Code) 931 Appendix B. Example DualQ Coupled Curvy RED Algorithm 933 As another example of a DualQ Coupled AQM algorithm, the pseudocode 934 below gives the Curvy RED based algorithm we used and tested. 935 Although we designed the AQM to be efficient in integer arithmetic, 936 to aid understanding it is first given using real-number arithmetic. 937 Then, one possible optimization for integer arithmetic is given, also 938 in pseudocode. To aid comparison, the line numbers are kept in step 939 between the two by using letter suffixes where the longer code needs 940 extra lines. 942 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 943 2: if ( lq.dequeue(pkt) ) { 944 3a: p_L = cq.sec() / 2^S_L 945 3b: if ( lq.byt() > T ) 946 3c: mark(pkt) 947 3d: elif ( p_L > maxrand(U) ) 948 4: mark(pkt) 949 5: return(pkt) % return the packet and stop here 950 6: } 951 7: while ( cq.dequeue(pkt) ) { 952 8a: alpha = 2^(-f_C) 953 8b: Q_C = alpha * pkt.sec() + (1-alpha)* Q_C % Classic Q EWMA 954 9a: sqrt_p_C = Q_C / 2^S_C 955 9b: if ( sqrt_p_C > maxrand(2*U) ) 956 10: drop(pkt) % Squared drop, redo loop 957 11: else 958 12: return(pkt) % return the packet and stop here 959 13: } 960 14: return(NULL) % no packet to dequeue 961 15: } 963 16: maxrand(u) { % return the max of u random numbers 964 17: maxr=0 965 18: while (u-- > 0) 966 19: maxr = max(maxr, rand()) % 0 <= rand() < 1 967 20: return(maxr) 968 21: } 970 Figure 7: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM 972 Packet classification code is not shown, as it is no different from 973 Figure 2. Classic ECN handling is not shown. Potential 974 classification schemes are discussed in Section 2. The Curvy RED 975 algorithm has not been maintained to the same degree as the DualPI2 976 algorithm. Some ideas used in DualPI2 could be translated into Curvy 977 RED, such as i) the time-shifted FIFO scheduler ii) the time-based 978 L4S threshold; iii) overload protection. {ToDo}. 980 At the outer level, the structure of dualq_dequeue() implements 981 strict priority scheduling. The code is written assuming the AQM is 982 applied on dequeue (Note 1) . Every time dualq_dequeue() is called, 983 the if-block in lines 2-6 determines whether there is an L4S packet 984 to dequeue by calling lq.dequeue(pkt), and otherwise the while-block 985 in lines 7-13 determines whether there is a Classic packet to 986 dequeue, by calling cq.dequeue(pkt). (Note 2) 988 In the lower priority Classic queue, a while loop is used so that, if 989 the AQM determines that a classic packet should be dropped, it 990 continues to test for classic packets deciding whether to drop each 991 until it actually forwards one. Thus, every call to dualq_dequeue() 992 returns one packet if at least one is present in either queue, 993 otherwise it returns NULL at line 14. (Note 3) 995 Within each queue, the decision whether to drop or mark is taken as 996 follows (to simplify the explanation, it is assumed that U=1): 998 L4S: If the test at line 2 determines there is an L4S packet to 999 dequeue, the tests at lines 3a and 3c determine whether to mark 1000 it. The first is a simple test of whether the L4S queue (lq.byt() 1001 in bytes) is greater than a step threshold T in bytes (Note 4). 1002 The second test is similar to the random ECN marking in RED, but 1003 with the following differences: i) the marking function does not 1004 start with a plateau of zero marking until a minimum threshold, 1005 rather the marking probability starts to increase as soon as the 1006 queue is positive; ii) marking depends on queuing time, not bytes, 1007 in order to scale for any link rate without being reconfigured; 1008 iii) marking of the L4S queue does not depend on itself, it 1009 depends on the queuing time of the _other_ (Classic) queue, where 1010 cq.sec() is the queuing time of the packet at the head of the 1011 Classic queue (zero if empty); iv) marking depends on the 1012 instantaneous queuing time (of the other Classic queue), not a 1013 smoothed average; v) the queue is compared with the maximum of U 1014 random numbers (but if U=1, this is the same as the single random 1015 number used in RED). 1017 Specifically, in line 3a the marking probability p_L is set to the 1018 Classic queueing time qc.sec() in seconds divided by the L4S 1019 scaling parameter 2^S_L, which represents the queuing time (in 1020 seconds) at which marking probability would hit 100%. Then in line 1021 3d (if U=1) the result is compared with a uniformly distributed 1022 random number between 0 and 1, which ensures that marking 1023 probability will linearly increase with queueing time. The 1024 scaling parameter is expressed as a power of 2 so that division 1025 can be implemented as a right bit-shift (>>) in line 3 of the 1026 integer variant of the pseudocode (Figure 8). 1028 Classic: If the test at line 7 determines that there is at least one 1029 Classic packet to dequeue, the test at line 9b determines whether 1030 to drop it. But before that, line 8b updates Q_C, which is an 1031 exponentially weighted moving average (Note 5) of the queuing time 1032 in the Classic queue, where pkt.sec() is the instantaneous 1033 queueing time of the current Classic packet and alpha is the EWMA 1034 constant for the classic queue. In line 8a, alpha is represented 1035 as an integer power of 2, so that in line 8 of the integer code 1036 the division needed to weight the moving average can be 1037 implemented by a right bit-shift (>> f_C). 1039 Lines 9a and 9b implement the drop function. In line 9a the 1040 averaged queuing time Q_C is divided by the Classic scaling 1041 parameter 2^S_C, in the same way that queuing time was scaled for 1042 L4S marking. This scaled queuing time is given the variable name 1043 sqrt_p_C because it will be squared to compute Classic drop 1044 probability, so before it is squared it is effectively the square 1045 root of the drop probability. The squaring is done by comparing 1046 it with the maximum out of two random numbers (assuming U=1). 1047 Comparing it with the maximum out of two is the same as the 1048 logical `AND' of two tests, which ensures drop probability rises 1049 with the square of queuing time (Note 6). Again, the scaling 1050 parameter is expressed as a power of 2 so that division can be 1051 implemented as a right bit-shift in line 9 of the integer 1052 pseudocode. 1054 The marking/dropping functions in each queue (lines 3 & 9) are two 1055 cases of a new generalization of RED called Curvy RED, motivated as 1056 follows. When we compared the performance of our AQM with fq_CoDel 1057 and PIE, we came to the conclusion that their goal of holding queuing 1058 delay to a fixed target is misguided [CRED_Insights]. As the number 1059 of flows increases, if the AQM does not allow TCP to increase queuing 1060 delay, it has to introduce abnormally high levels of loss. Then loss 1061 rather than queuing becomes the dominant cause of delay for short 1062 flows, due to timeouts and tail losses. 1064 Curvy RED constrains delay with a softened target that allows some 1065 increase in delay as load increases. This is achieved by increasing 1066 drop probability on a convex curve relative to queue growth (the 1067 square curve in the Classic queue, if U=1). Like RED, the curve hugs 1068 the zero axis while the queue is shallow. Then, as load increases, 1069 it introduces a growing barrier to higher delay. But, unlike RED, it 1070 requires only one parameter, the scaling, not three. The diadvantage 1071 of Curvy RED is that it is not adapted to a wide range of RTTs. 1072 Curvy RED can be used as is when the RTT range to support is limited 1073 otherwise an adaptation mechanism is required. 1075 There follows a summary listing of the two parameters used for each 1076 of the two queues: 1078 Classic: 1080 S_C : The scaling factor of the dropping function scales Classic 1081 queuing times in the range [0, 2^(S_C)] seconds into a dropping 1082 probability in the range [0,1]. To make division efficient, it 1083 is constrained to be an integer power of two; 1085 f_C : To smooth the queuing time of the Classic queue and make 1086 multiplication efficient, we use a negative integer power of 1087 two for the dimensionless EWMA constant, which we define as 1088 alpha = 2^(-f_C). 1090 L4S : 1092 S_L (and k'): As for the Classic queue, the scaling factor of 1093 the L4S marking function scales Classic queueing times in the 1094 range [0, 2^(S_L)] seconds into a probability in the range 1095 [0,1]. Note that S_L = S_C + k', where k' is the coupling 1096 between the queues. So S_L and k' count as only one parameter; 1097 k' is related to k in Equation (1) (Section 2.1) by k=2^k', 1098 where both k and k' are constants. Then implementations can 1099 avoid costly division by shifting p_L by k' bits to the right. 1101 T : The queue size in bytes at which step threshold marking 1102 starts in the L4S queue. 1104 {ToDo: These are the raw parameters used within the algorithm. A 1105 configuration front-end could accept more meaningful parameters and 1106 convert them into these raw parameters.} 1108 From our experiments so far, recommended values for these parameters 1109 are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs 1110 typical on the public Internet. [CRED_Insights] explains why these 1111 parameters are applicable whatever rate link this AQM implementation 1112 is deployed on and how the parameters would need to be adjusted for a 1113 scenario with a different range of RTTs (e.g. a data centre) {ToDo 1114 incorporate a summary of that report into this draft}. The setting of 1115 k depends on policy (see Section 2.4 and Appendix C respectively for 1116 its recommended setting and guidance on alternatives). 1118 There is also a cUrviness parameter, U, which is a small positive 1119 integer. It is likely to take the same hard-coded value for all 1120 implementations, once experiments have determined a good value. We 1121 have solely used U=1 in our experiments so far, but results might be 1122 even better with U=2 or higher. 1124 Note that the dropping function at line 9 calls maxrand(2*U), which 1125 gives twice as much curviness as the call to maxrand(U) in the 1126 marking function at line 3. This is the trick that implements the 1127 square rule in equation (1) (Section 2.1). This is based on the fact 1128 that, given a number X from 1 to 6, the probability that two dice 1129 throws will both be less than X is the square of the probability that 1130 one throw will be less than X. So, when U=1, the L4S marking 1131 function is linear and the Classic dropping function is squared. If 1132 U=2, L4S would be a square function and Classic would be quartic. 1133 And so on. 1135 The maxrand(u) function in lines 16-21 simply generates u random 1136 numbers and returns the maximum (Note 7). Typically, maxrand(u) 1137 could be run in parallel out of band. For instance, if U=1, the 1138 Classic queue would require the maximum of two random numbers. So, 1139 instead of calling maxrand(2*U) in-band, the maximum of every pair of 1140 values from a pseudorandom number generator could be generated out- 1141 of-band, and held in a buffer ready for the Classic queue to consume. 1143 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 1144 2: if ( lq.dequeue(pkt) ) { 1145 3: if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U))) 1146 4: mark(pkt) 1147 5: return(pkt) % return the packet and stop here 1148 6: } 1149 7: while ( cq.dequeue(pkt) ) { 1150 8: Q_C += (pkt.ns() - Q_C) >> f_C % Classic Q EWMA 1151 9: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) 1152 10: drop(pkt) % Squared drop, redo loop 1153 11: else 1154 12: return(pkt) % return the packet and stop here 1155 13: } 1156 14: return(NULL) % no packet to dequeue 1157 15: } 1159 Figure 8: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM 1160 using Integer Arithmetic 1162 Notes: 1164 1. The drain rate of the queue can vary if it is scheduled relative 1165 to other queues, or to cater for fluctuations in a wireless 1166 medium. To auto-adjust to changes in drain rate, the queue must 1167 be measured in time, not bytes or packets [CoDel]. In our Linux 1168 implementation, it was easiest to measure queuing time at 1169 dequeue. Queuing time can be estimated when a packet is enqueued 1170 by measuring the queue length in bytes and dividing by the recent 1171 drain rate. 1173 2. An implementation has to use priority queueing, but it need not 1174 implement strict priority. 1176 3. If packets can be enqueued while processing dequeue code, an 1177 implementer might prefer to place the while loop around both 1178 queues so that it goes back to test again whether any L4S packets 1179 arrived while it was dropping a Classic packet. 1181 4. In order not to change too many factors at once, for now, we keep 1182 the marking function for DCTCP-only traffic as similar as 1183 possible to DCTCP. However, unlike DCTCP, all processing is at 1184 dequeue, so we determine whether to mark a packet at the head of 1185 the queue by the byte-length of the queue _behind_ it. We plan 1186 to test whether using queuing time will work in all 1187 circumstances, and if we find that the step can cause 1188 oscillations, we will investigate replacing it with a steep 1189 random marking curve. 1191 5. An EWMA is only one possible way to filter bursts; other more 1192 adaptive smoothing methods could be valid and it might be 1193 appropriate to decrease the EWMA faster than it increases. 1195 6. In practice at line 10 the Classic queue would probably test for 1196 ECN capability on the packet to determine whether to drop or mark 1197 the packet. However, for brevity such detail is omitted. All 1198 packets classified into the L4S queue have to be ECN-capable, so 1199 no dropping logic is necessary at line 3. Nonetheless, L4S 1200 packets could be dropped by overload code (see Section 4.1). 1202 7. In the integer variant of the pseudocode (Figure 8) real numbers 1203 are all represented as integers scaled up by 2^32. In lines 3 & 1204 9 the function maxrand() is arranged to return an integer in the 1205 range 0 <= maxrand() < 2^32. Queuing times are also scaled up by 1206 2^32, but in two stages: i) In lines 3 and 8 queuing times 1207 cq.ns() and pkt.ns() are returned in integer nanoseconds, making 1208 the values about 2^30 times larger than when the units were 1209 seconds, ii) then in lines 3 and 9 an adjustment of -2 to the 1210 right bit-shift multiplies the result by 2^2, to complete the 1211 scaling by 2^32. 1213 Appendix C. Guidance on Controlling Throughput Equivalence 1214 +---------------+------+-------+ 1215 | RTT_C / RTT_L | Reno | Cubic | 1216 +---------------+------+-------+ 1217 | 1 | k'=1 | k'=0 | 1218 | 2 | k'=2 | k'=1 | 1219 | 3 | k'=2 | k'=2 | 1220 | 4 | k'=3 | k'=2 | 1221 | 5 | k'=3 | k'=3 | 1222 +---------------+------+-------+ 1224 Table 1: Value of k' for which DCTCP throughput is roughly the same 1225 as Reno or Cubic, for some example RTT ratios 1227 k' is related to k in Equation (1) (Section 2.1) by k=2^k'. 1229 To determine the appropriate policy, the operator first has to judge 1230 whether it wants DCTCP flows to have roughly equal throughput with 1231 Reno or with Cubic (because, even in its Reno-compatibility mode, 1232 Cubic is about 1.4 times more aggressive than Reno). Then the 1233 operator needs to decide at what ratio of RTTs it wants DCTCP and 1234 Classic flows to have roughly equal throughput. For example choosing 1235 k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same 1236 as Cubic, _if their RTTs are the same_. 1238 However, even if the base RTTs are the same, the actual RTTs are 1239 unlikely to be the same, because Classic (Cubic or Reno) traffic 1240 needs a large queue to avoid under-utilization and excess drop, 1241 whereas L4S (DCTCP) does not. The operator might still choose this 1242 policy if it judges that DCTCP throughput should be rewarded for 1243 keeping its own queue short. 1245 On the other hand, the operator will choose one of the higher values 1246 for k', if it wants to slow DCTCP down to roughly the same throughput 1247 as Classic flows, to compensate for Classic flows slowing themselves 1248 down by causing themselves extra queuing delay. 1250 The values for k' in the table are derived from the formulae, which 1251 was developed in [DCttH15]: 1253 2^k' = 1.64 (RTT_reno / RTT_dc) (2) 1254 2^k' = 1.19 (RTT_cubic / RTT_dc ) (3) 1256 For localized traffic from a particular ISP's data centre, we used 1257 the measured RTTs to calculate that a value of k'=3 (equivalant to 1258 k=8) would achieve throughput equivalence, and our experiments 1259 verified the formula very closely. 1261 For a typical mix of RTTs from local data centres and across the 1262 general Internet, a value of k'=1 (equivalent to k=2) is recommended 1263 as a good workable compromise. 1265 Authors' Addresses 1267 Koen De Schepper 1268 Nokia Bell Labs 1269 Antwerp 1270 Belgium 1272 Email: koen.de_schepper@nokia.com 1273 URI: https://www.bell-labs.com/usr/koen.de_schepper 1275 Bob Briscoe (editor) 1276 Simula Research Lab 1278 Email: ietf@bobbriscoe.net 1279 URI: http://bobbriscoe.net/ 1281 Olga Bondarenko 1282 Simula Research Lab 1283 Lysaker 1284 Norway 1286 Email: olgabnd@gmail.com 1287 URI: https://www.simula.no/people/olgabo 1289 Ing-jyh Tsang 1290 Nokia Bell Labs 1291 Antwerp 1292 Belgium 1294 Email: ing-jyh.tsang@nokia.com