idnits 2.17.1 draft-ietf-tsvwg-aqm-dualq-coupled-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 10, 2017) is 2542 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 926 -- Looks like a reference, but probably isn't: '1' on line 926 == Outdated reference: A later version (-07) exists of draft-ietf-tcpm-cubic-04 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-dctcp-05 == Outdated reference: A later version (-08) exists of draft-ietf-tsvwg-ecn-experimentation-00 == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-00 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-00 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area working group (tsvwg) K. De Schepper 3 Internet-Draft Nokia Bell Labs 4 Intended status: Experimental B. Briscoe, Ed. 5 Expires: November 11, 2017 O. Bondarenko 6 Simula Research Lab 7 I. Tsang 8 Nokia Bell Labs 9 May 10, 2017 11 DualQ Coupled AQM for Low Latency, Low Loss and Scalable Throughput 12 draft-ietf-tsvwg-aqm-dualq-coupled-00 14 Abstract 16 Data Centre TCP (DCTCP) was designed to provide predictably low 17 queuing latency, near-zero loss, and throughput scalability using 18 explicit congestion notification (ECN) and an extremely simple 19 marking behaviour on switches. However, DCTCP does not co-exist with 20 existing TCP traffic---throughput starves. So, until now, DCTCP 21 could only be deployed where a clean-slate environment could be 22 arranged, such as in private data centres. This specification 23 defines `DualQ Coupled Active Queue Management (AQM)' to allow 24 scalable congestion controls like DCTCP to safely co-exist with 25 classic Internet traffic. The Coupled AQM ensures that a flow runs 26 at about the same rate whether it uses DCTCP or TCP Reno/Cubic, but 27 without inspecting transport layer flow identifiers. When tested in 28 a residential broadband setting, DCTCP achieved sub-millisecond 29 average queuing delay and zero congestion loss under a wide range of 30 mixes of DCTCP and `Classic' broadband Internet traffic, without 31 compromising the performance of the Classic traffic. The solution 32 also reduces network complexity and eliminates network configuration. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on November 11, 2017. 50 Copyright Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 1.1. Problem and Scope . . . . . . . . . . . . . . . . . . . . 2 69 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Features . . . . . . . . . . . . . . . . . . . . . . . . 5 71 2. DualQ Coupled AQM Algorithm . . . . . . . . . . . . . . . . . 6 72 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 7 73 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 8 74 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 8 75 2.4. Normative Requirements . . . . . . . . . . . . . . . . . 8 76 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 77 4. Security Considerations . . . . . . . . . . . . . . . . . . . 9 78 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 10 79 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 80 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 81 6.1. Normative References . . . . . . . . . . . . . . . . . . 11 82 6.2. Informative References . . . . . . . . . . . . . . . . . 11 83 Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 14 84 Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 17 85 Appendix C. Guidance on Controlling Throughput Equivalence . . . 23 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 88 1. Introduction 90 1.1. Problem and Scope 92 Latency is becoming the critical performance factor for many (most?) 93 applications on the public Internet, e.g. interactive Web, Web 94 services, voice, conversational video, interactive video, interactive 95 remote presence, instant messaging, online gaming, remote desktop, 96 cloud-based applications, and video-assisted remote control of 97 machinery and industrial processes. In the developed world, further 98 increases in access network bit-rate offer diminishing returns, 99 whereas latency is still a multi-faceted problem. In the last decade 100 or so, much has been done to reduce propagation time by placing 101 caches or servers closer to users. However, queuing remains a major 102 component of latency. 104 The Diffserv architecture provides Expedited Forwarding [RFC3246], so 105 that low latency traffic can jump the queue of other traffic. 106 However, on access links dedicated to individual sites (homes, small 107 enterprises or mobile devices), often all traffic at any one time 108 will be latency-sensitive. Then Diffserv is of little use. Instead, 109 we need to remove the causes of any unnecessary delay. 111 The bufferbloat project has shown that excessively-large buffering 112 (`bufferbloat') has been introducing significantly more delay than 113 the underlying propagation time. These delays appear only 114 intermittently--only when a capacity-seeking (e.g. TCP) flow is long 115 enough for the queue to fill the buffer, making every packet in other 116 flows sharing the buffer sit through the queue. 118 Active queue management (AQM) was originally developed to solve this 119 problem (and others). Unlike Diffserv, which gives low latency to 120 some traffic at the expense of others, AQM controls latency for _all_ 121 traffic in a class. In general, AQMs introduce an increasing level 122 of discard from the buffer the longer the queue persists above a 123 shallow threshold. This gives sufficient signals to capacity-seeking 124 (aka. greedy) flows to keep the buffer empty for its intended 125 purpose: absorbing bursts. However, RED [RFC2309] and other 126 algorithms from the 1990s were sensitive to their configuration and 127 hard to set correctly. So, AQM was not widely deployed. 129 More recent state-of-the-art AQMs, e.g. 130 fq_CoDel [I-D.ietf-aqm-fq-codel], PIE [I-D.ietf-aqm-pie], Adaptive 131 RED [ARED01], are easier to configure, because they define the 132 queuing threshold in time not bytes, so it is invariant for different 133 link rates. However, no matter how good the AQM, the sawtoothing 134 rate of TCP will either cause queuing delay to vary or cause the link 135 to be under-utilized. Even with a perfectly tuned AQM, the 136 additional queuing delay will be of the same order as the underlying 137 speed-of-light delay across the network. Flow-queuing can isolate 138 one flow from another, but it cannot isolate a TCP flow from the 139 delay variations it inflicts on itself, and it has other problems - 140 it overrides the flow rate decisions of variable rate video 141 applications, it does not recognise the flows within IPSec VPN 142 tunnels and it is relatively expensive to implement. 144 It seems that further changes to the network alone will now yield 145 diminishing returns. Data Centre TCP (DCTCP [I-D.ietf-tcpm-dctcp]) 146 teaches us that a small but radical change to TCP is needed to cut 147 two major outstanding causes of queuing delay variability: 149 1. the `sawtooth' varying rate of TCP itself; 151 2. the smoothing delay deliberately introduced into AQMs to permit 152 bursts without triggering losses. 154 The former causes a flow's round trip time (RTT) to vary from about 1 155 to 2 times the base RTT between the machines in question. The latter 156 delays the system's response to change by a worst-case 157 (transcontinental) RTT, which could be hundreds of times the actual 158 RTT of typical traffic from localized CDNs. 160 Latency is not our only concern: 162 3. It was known when TCP was first developed that it would not scale 163 to high bandwidth-delay products. 165 Given regular broadband bit-rates over WAN distances are 166 already [RFC3649] beyond the scaling range of `classic' TCP Reno, 167 `less unscalable' Cubic [I-D.ietf-tcpm-cubic] and 168 Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been 169 successfully deployed. However, these are now approaching their 170 scaling limits. Unfortunately, fully scalable TCPs such as DCTCP 171 cause `classic' TCP to starve itself, which is why they have been 172 confined to private data centres or research testbeds (until now). 174 This document specifies a `DualQ Coupled AQM' extension that solves 175 the problem of coexistence between scalable and classic flows, 176 without having to inspect flow identifiers. The AQM is not like 177 flow-queuing approaches [I-D.ietf-aqm-fq-codel] that classify packets 178 by flow identifier into numerous separate queues in order to isolate 179 sparse flows from the higher latency in the queues assigned to 180 heavier flow. In contrast, the AQM exploits the behaviour of 181 scalable congestion controls like DCTCP so that every packet in every 182 flow sharing the queue for DCTCP-like traffic can be served with very 183 low latency. 185 This AQM extension can be combined with any single qeueu AQM that 186 generates a statistical or deterministic mark/drop probability driven 187 by the queue dynamics. In many cases it simplifies the basic control 188 algorithm, and requires little extra processing. Therefore it is 189 believed the Coupled AQM would be applicable and easy to deploy in 190 all types of buffers; buffers in cost-reduced mass-market residential 191 equipment; buffers in end-system stacks; buffers in carrier-scale 192 equipment including remote access servers, routers, firewalls and 193 Ethernet switches; buffers in network interface cards, buffers in 194 virtualized network appliances, hypervisors, and so on. 196 The overall L4S architecture is described in 197 [I-D.ietf-tsvwg-l4s-arch]. The supporting papers [PI216] and 198 [DCttH15] give the full rationale for the AQM's design, both 199 discursively and in more precise mathematical form. 201 1.2. Terminology 203 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 204 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 205 document are to be interpreted as described in [RFC2119]. In this 206 document, these words will appear with that interpretation only when 207 in ALL CAPS. Lower case uses of these words are not to be 208 interpreted as carrying RFC-2119 significance. 210 The DualQ Coupled AQM uses two queues for two services. Each of the 211 following terms identifies both the service and the queue that 212 provides the service: 214 Classic (denoted by subscript C): The `Classic' service is intended 215 for all the behaviours that currently co-exist with TCP Reno (TCP 216 Cubic, Compound, SCTP, etc). 218 Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): 219 The `L4S' service is intended for a set of congestion controls 220 with scalable properties such as DCTCP (e.g. 221 Relentless [Mathis09]). 223 Either service can cope with a proportion of unresponsive or less- 224 responsive traffic as well (e.g. DNS, VoIP, etc), just as a single 225 queue AQM can. The DualQ Coupled AQM behaviour is similar to a 226 single FIFO queue with respect to unresponsive and overload traffic. 228 1.3. Features 230 The AQM couples marking and/or dropping across the two queues such 231 that a flow will get roughly the same throughput whichever it uses. 232 Therefore both queues can feed into the full capacity of a link and 233 no rates need to be configured for the queues. The L4S queue enables 234 scalable congestion controls like DCTCP to give stunningly low and 235 predictably low latency, without compromising the performance of 236 competing 'Classic' Internet traffic. Thousands of tests have been 237 conducted in a typical fixed residential broadband setting. Typical 238 experiments used base round trip delays up to 100ms between the data 239 centre and home network, and large amounts of background traffic in 240 both queues. For every L4S packet, the AQM kept the average queuing 241 delay below 1ms (or 2 packets if serialization delay is bigger for 242 slow links), and no losses at all were introduced by the AQM. 243 Details of the extensive experiments will be made available [PI216] 244 [DCttH15]. 246 Subjective testing was also conducted using a demanding panoramic 247 interactive video application run over a stack with DCTCP enabled and 248 deployed on the testbed. Each user could pan or zoom their own high 249 definition (HD) sub-window of a larger video scene from a football 250 match. Even though the user was also downloading large amounts of 251 L4S and Classic data, latency was so low that the picture appeared to 252 stick to their finger on the touchpad (all the L4S data achieved the 253 same ultra-low latency). With an alternative AQM, the video 254 noticeably lagged behind the finger gestures. 256 Unlike Diffserv Expedited Forwarding, the L4S queue does not have to 257 be limited to a small proportion of the link capacity in order to 258 achieve low delay. The L4S queue can be filled with a heavy load of 259 capacity-seeking flows like DCTCP and still achieve low delay. The 260 L4S queue does not rely on the presence of other traffic in the 261 Classic queue that can be 'overtaken'. It gives low latency to L4S 262 traffic whether or not there is Classic traffic, and the latency of 263 Classic traffic does not suffer when a proportion of the traffic is 264 L4S. The two queues are only necessary because DCTCP-like flows 265 cannot keep latency predictably low and keep utilization high if they 266 are mixed with legacy TCP flows, 268 The experiments used the Linux implementation of DCTCP that is 269 deployed in private data centres, without any modification despite 270 its known deficiencies. Nonetheless, certain modifications will be 271 necessary before DCTCP is safe to use on the Internet, which are 272 recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id]. However, the 273 focus of this specification is to get the network service in place. 274 Then, without any management intervention, applications can exploit 275 it by migrating to scalable controls like DCTCP, which can then 276 evolve _while_ their benefits are being enjoyed by everyone on the 277 Internet. 279 2. DualQ Coupled AQM Algorithm 281 There are two main aspects to the algorithm: 283 o the Coupled AQM that addresses throughput equivalence between 284 Classic (e.g. Reno, Cubic) flows and L4S (e.g. DCTCP) flows 286 o the Dual Queue structure that provides latency separation for L4S 287 flows to isolate them from the typically large Classic queue. 289 2.1. Coupled AQM 291 In the 1990s, the `TCP formula' was derived for the relationship 292 between TCP's congestion window, cwnd, and its drop probability, p. 293 To a first order approximation, cwnd of TCP Reno is inversely 294 proportional to the square root of p. TCP Cubic implements a Reno- 295 compatibility mode, which is the only relevant mode for typical RTTs 296 under 20ms, while the throughput of a single flow is less than about 297 500Mb/s. Therefore we can assume that Cubic traffic behaves similar 298 to Reno (but with a slightly different constant of proportionality), 299 and we shall use the term 'Classic' for the collection of Reno and 300 Cubic in Reno mode. 302 In our supporting paper [PI216], we derive the equivalent rate 303 equation for DCTCP, for which cwnd is inversely proportional to p 304 (not the square root), where in this case p is the ECN marking 305 probability. DCTCP is not the only congestion control that behaves 306 like this, so we use the term 'L4S' traffic for all similar 307 behaviour. 309 In order to make a DCTCP flow run at roughly the same rate as a Reno 310 TCP flow (all other factors being equal), we make the drop or marking 311 probability for Classic traffic, p_C distinct from the marking 312 probability for L4S traffic, p_L (in contrast to RFC3168 which 313 requires them to be the same). We make the Classic drop probability 314 p_C proportional to the square of the L4S marking probability p_L. 315 This is because we need to make the Reno flow rate equal the DCTCP 316 flow rate, so we have to square the square root of p_C in the Reno 317 rate equation to make it the same as the straight p_L in the DCTCP 318 rate equation. 320 There is a really simple way to implement the square of a probability 321 - by testing the queue against two random numbers not one. This is 322 the approach adopted in Appendix A and Appendix B. 324 Stating this as a formula, the relation between Classic drop 325 probability, p_C, and L4S marking probability, p_L needs to take the 326 form: 328 p_C = ( p_L / k )^2 (1) 330 where k is the constant of proportionality. Optionally, k can be 331 expressed as a power of 2, so k=2^k', where k' is another constant. 332 Then implementations can avoid costly division by shifting p_L by k' 333 bits to the right. 335 2.2. Dual Queue 337 Classic traffic builds a large queue, so a separate queue is provided 338 for L4S traffic, and it is scheduled with strict priority. 339 Nonetheless, coupled marking ensures that giving priority to L4S 340 traffic still leaves the right amount of spare scheduling time for 341 Classic flows to each get equivalent throughput to DCTCP flows (all 342 other factors such as RTT being equal). The algorithm achieves this 343 without having to inspect flow identifiers. 345 2.3. Traffic Classification 347 Both the Coupled AQM and DualQ mechanisms need an identifier to 348 distinguish L4S and C packets. A separate draft 349 [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of 350 the ECN field as this identifier, having assessed various 351 alternatives. 353 Given L4S work is currently on the experimental track, but the 354 definition of the ECN field is on the standards track [RFC3168], 355 another standards track document has proved necessary to make the 356 ECT(1) codepoint available for experimentation 357 [I-D.ietf-tsvwg-ecn-experimentation]. 359 2.4. Normative Requirements 361 In the Dual Queue, L4S packets MUST be given priority over Classic, 362 although strict priority MAY not be appropriate. 364 All L4S traffic MUST be ECN-capable, although some Classic traffic 365 MAY also be ECN-capable. 367 Whatever identifier is used for L4S traffic, it will still be 368 necessary to agree on the meaning of an ECN marking on L4S traffic, 369 relative to a drop of Classic traffic. In order to prevent 370 starvation of Classic traffic by scalable L4S traffic (e.g. DCTCP) 371 the drop probability of Classic traffic MUST be proportional to the 372 square of the marking probability of L4S traffic, In other words, the 373 power to which p_L is raised in Eqn. (1) MUST be 2. 375 The constant of proportionality, k, in Eqn (1) determines the 376 relative flow rates of Classic and L4S flows when the AQM concerned 377 is the bottleneck (all other factors being equal). k does not have to 378 be standardized because differences do not prevent interoperability. 379 However, k has to take some value, and each operator can make that 380 choice. 382 A value of k=2 is currently RECOMMENDED as the default for Internet 383 access networks. Assuming scalable congestion controls for the 384 Internet will be as aggressive as DCTCP, this will ensure their 385 congestion window will be roughly the same as that of a standards 386 track TCP congestion control (Reno) [RFC5681] and other so-called 387 TCP-friendly controls such as TCP Cubic in its TCP-friendly mode. 389 The requirements for scalable congestion controls on the Internet 390 (termed the TCP Prague requirements) [I-D.ietf-tsvwg-ecn-l4s-id] are 391 not necessarily final. If the aggressiveness of DCTCP is not defined 392 as the benchmark for scalable controls on the Internet, the 393 recommended value of k will also be subject to change. 395 Whatever value is recommended, the choice of k is a matter of 396 operator policy, and operators MAY choose a different value using 397 Table 1 and the guidelines in Appendix C. 399 Typically, access network operators isolate customers from each other 400 with some form of layer-2 multiplexing (TDM in DOCSIS, CDMA in 3G) or 401 L3 scheduling (WRR in broadband), rather than relying on TCP to share 402 capacity between customers [RFC0970]. In such cases, the choice of k 403 will solely affect relative flow rates within each customer's access 404 capacity, not between customers. Also, k will not affect relative 405 flow rates at any times when all flows are Classic or all L4S, and it 406 will not affect small flows. 408 Example DualQ Coupled AQM algorithms called PI2 and Curvy RED are 409 given in Appendix A and Appendix B. Either example AQM can be used 410 to couple packet marking and dropping across a dual Q. Curvy RED 411 requires less operations per packet than RED and can be used if the 412 range of RTTs is limited. PI2 is a simplification of PIE with stable 413 Proportional-Integral control for both Classic and L4S congestion 414 controls. Nonetheless, it would be possible to control the queues 415 with other alternative AQMs, as long as the above normative 416 requirements (those expressed in capitals) are observed, which are 417 intended to be independent of the specific AQM. 419 {ToDo: Add management and monitoring requirements} 421 3. IANA Considerations 423 This specification contains no IANA considerations. 425 4. Security Considerations 426 4.1. Overload Handling 428 Where the interests of users or flows might conflict, it could be 429 necessary to police traffic to isolate any harm to performance. This 430 is a policy issue that needs to be separable from a basic AQM, but an 431 AQM does need to handle overload. A trade-off needs to be made 432 between complexity and the risk of either class harming the other. 433 It is an operator policy to define what must happen if the service 434 time of the classic queue becomes too great. In the following 435 subsections three optional non-exclusive overload protections are 436 defined. Their objective is for the overload behaviour of the DualQ 437 AQM to be similar to a single queue AQM. The example implementation 438 in Appendix A implements the 'delay on overload' policy. Other 439 overload protections can be envisaged: 441 Minimum throughput service: By replacing the priority scheduler 442 with a weighted round robin scheduler, a minimum throughput 443 service can be guaranteed for Classic traffic. Typically the 444 scheduling weight of the Classic queue will be small (e.g. 5%) to 445 avoid interference with the coupling but big enough to avoid 446 complete starvation of Classic traffic. 448 Delay on overload: To control milder overload of responsive traffic, 449 particularly when close to the maximum congestion signal, delay 450 can be used as an alternative congestion control mechanism. The 451 Dual Queue Coupled AQM can be made to behave like a single First- 452 In First-Out (FIFO) queue with different service times by 453 replacing the priority scheduler with a very simple scheduler that 454 could be called a "time-shifted FIFO", which is the same as the 455 Modifier Earliest Deadline First (MEDF) scheduler of [MEDF]. The 456 scheduler adds T_m to the queue delay of the next L4S packet, 457 before comparing it with the queue delay of the next Classic 458 packet, then it selects the packet with the greater adjusted queue 459 delay. Under regular conditions, this time-shifted FIFO scheduler 460 behaves just like a strict priority scheduler. But under moderate 461 or high overload it prevents starvation of the Classic queue, 462 because the time-shift defines the maximum extra queuing delay 463 (T_m) of Classic packets relative to L4S. 465 Drop on overload: On severe overload, e.g. due to non responsive 466 traffic, queues will typically overflow and packet drop will be 467 unavoidable. It is important to avoid unresponsive ECN traffic 468 (either Classic or L4S) driving the AQM to 100% drop and mark 469 probability. Congestion controls that have a minimum congestion 470 window will become unresponsive to ECN marking when the marking 471 probability is high. This situation can be avoided by applying 472 the drop probability to all packets of all traffic types when it 473 exceeds a certain threshold or by limiting the drop and marking 474 probabilities to a lower maximum value (up to where fairnes 475 between the different traffic types is still guaranteed) and rely 476 on delay to control temporary high congestion and eventually queue 477 overflow. If the classic drop probability is applied to all types 478 of traffic when it is higher than a threshold probability the 479 queueing delay can be controlled up to any overload situation, and 480 no further measures are required. If a maximum classic and 481 coupled L4S probability of less than 100% is used, both queues 482 need scheduling opportunities and should eventually experience 483 drop. This can be achieved with a scheduler that guarantees a 484 minimum throughput for each queue, such as a weighted round robin 485 or time-shifted FIFO scheduler. In that case a common queue limit 486 can be configured that will drop packets of both types of traffic. 488 To keep the throughput of both L4S and Classic flows equal over the 489 full load range, a different control strategy needs to be defined 490 above the point where one congestion control first saturates to a 491 probability of 100% (if k>1, L4S will saturate first). Possible 492 strategies include: also dropping L4S; increasing the queueing delay 493 for both; or ensuring that L4S traffic still responds to marking 494 below a window of 2 segments (see [I-D.ietf-tsvwg-ecn-l4s-id]). 496 5. Acknowledgements 498 Thanks to Anil Agarwal for detailed review comments and suggestions 499 on how to make our explanation clearer. 501 The authors' contributions are part-funded by the European Community 502 under its Seventh Framework Programme through the Reducing Internet 503 Transport Latency (RITE) project (ICT-317700). The views expressed 504 here are solely those of the authors. 506 6. References 508 6.1. Normative References 510 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 511 Requirement Levels", BCP 14, RFC 2119, 512 DOI 10.17487/RFC2119, March 1997, 513 . 515 6.2. Informative References 517 [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An 518 Algorithm for Increasing the Robustness of RED's Active 519 Queue Management", ACIRI Technical Report , August 2001, 520 . 522 [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", 523 ACM Queue 10(5), May 2012, 524 . 526 [CRED_Insights] 527 Briscoe, B., "Insights from Curvy RED (Random Early 528 Detection)", BT Technical Report TR-TUB8-2015-003, July 529 2015, 530 . 532 [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 533 Tsang, "`Data Centre to the Home': Ultra-Low Latency for 534 All", 2015, . 537 (Under submission) 539 [I-D.ietf-aqm-fq-codel] 540 Hoeiland-Joergensen, T., McKenney, P., 541 dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, "The 542 FlowQueue-CoDel Packet Scheduler and Active Queue 543 Management Algorithm", draft-ietf-aqm-fq-codel-06 (work in 544 progress), March 2016. 546 [I-D.ietf-aqm-pie] 547 Pan, R., Natarajan, P., Baker, F., and G. White, "PIE: A 548 Lightweight Control Scheme To Address the Bufferbloat 549 Problem", draft-ietf-aqm-pie-10 (work in progress), 550 September 2016. 552 [I-D.ietf-tcpm-cubic] 553 Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 554 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 555 draft-ietf-tcpm-cubic-04 (work in progress), February 556 2017. 558 [I-D.ietf-tcpm-dctcp] 559 Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 560 and G. Judd, "Datacenter TCP (DCTCP): TCP Congestion 561 Control for Datacenters", draft-ietf-tcpm-dctcp-05 (work 562 in progress), March 2017. 564 [I-D.ietf-tsvwg-ecn-experimentation] 565 Black, D., "Explicit Congestion Notification (ECN) 566 Experimentation", draft-ietf-tsvwg-ecn-experimentation-00 567 (work in progress), November 2016. 569 [I-D.ietf-tsvwg-ecn-l4s-id] 570 Schepper, K., Briscoe, B., and I. Tsang, "Identifying 571 Modified Explicit Congestion Notification (ECN) Semantics 572 for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s- 573 id-00 (work in progress), November 2016. 575 [I-D.ietf-tsvwg-l4s-arch] 576 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 577 Low Loss, Scalable Throughput (L4S) Internet Service: 578 Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in 579 progress), November 2016. 581 [I-D.sridharan-tcpm-ctcp] 582 Sridharan, M., Tan, K., Bansal, D., and D. Thaler, 583 "Compound TCP: A New TCP Congestion Control for High-Speed 584 and Long Distance Networks", draft-sridharan-tcpm-ctcp-02 585 (work in progress), November 2008. 587 [Mathis09] 588 Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , 589 May 2009, . 592 [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a 593 simple scheduling algorithm for two real-time transport 594 service classes with application in the UTRAN", Proc. IEEE 595 Conference on Computer Communications (INFOCOM'03) Vol.2 596 pp.1116-1122, March 2003. 598 [PI216] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 599 Tsang, "PI2: A Linearized AQM for both Classic and 600 Scalable TCP", ACM CoNEXT'16 , December 2016, 601 . 604 (To appear) 606 [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", 607 RFC 970, DOI 10.17487/RFC0970, December 1985, 608 . 610 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 611 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 612 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 613 S., Wroclawski, J., and L. Zhang, "Recommendations on 614 Queue Management and Congestion Avoidance in the 615 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 616 . 618 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 619 of Explicit Congestion Notification (ECN) to IP", 620 RFC 3168, DOI 10.17487/RFC3168, September 2001, 621 . 623 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 624 J., Courtney, W., Davari, S., Firoiu, V., and D. 625 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 626 Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, 627 . 629 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 630 RFC 3649, DOI 10.17487/RFC3649, December 2003, 631 . 633 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 634 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 635 . 637 Appendix A. Example DualQ Coupled PI2 Algorithm 639 As a first concrete example, the pseudocode below gives the DualQ 640 Coupled AQM algorithm based on the PI2 Classic AQM, we used and 641 tested. For this example only the pseudo code is given. An open 642 source implementation for Linux is available at: 643 https://github.com/olgabo/dualpi2. 645 1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq 646 2: stamp(pkt) % attach arrival time to packet 647 3: if ( lq.len() + cq.len() > limit ) 648 4: drop(pkt) % drop packet if q is full 649 5: else { 650 6: if ( ecn(pkt) modulo 2 == 0 ) % ECN bits = not-ect or ect(0) 651 7: cq.enqueue(pkt) 652 8: else % ECN bits = ect(1) or ce 653 9: lq.enqueue(pkt) 654 10: } 655 11: } 657 Figure 1: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM 659 1: dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 660 2: while ( lq.len() + cq.len() > 0 ) 661 3: if ( lq.time() + tshift >= cq.time() ) { 662 4: lq.dequeue(pkt) 663 5: if ( (pkt.time() > T) or (p * k > rand()) ) % coupling here 664 6: mark(pkt) 665 7: return(pkt) % return the packet and stop here 666 8: } else { 667 9: cq.dequeue(pkt) 668 10: if ( p > max(rand(), rand()) ) % only square part of (p/k)^2 669 11: if ( ecn(pkt) == 0 ) % ECN field = not-ect 670 12: drop(pkt) % squared drop, redo loop 671 13: else { 672 14: mark(pkt) % squared mark 673 15: return(pkt) % return the packet and stop here 674 16: } 675 17: else 676 18: return(pkt) % return the packet and stop here 677 19: } 678 20: } 679 21: return(NULL) % no packet to dequeue 680 22: } 682 Figure 2: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM 684 1: dualpi2_update(lq, cq) { % Update p every Tupdate 685 2: curq = cq.time() % use queuing time of first-in Classic packet 686 3: alpha_U = alpha * Tupdate % done once when parameters are set 687 4: beta_U = beta * Tupdate % done once when parameters are set 688 5: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) 689 6: prevq = curq 690 7: } 692 Figure 3: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM 694 When packets arrive, first a common queue limit is checked as shown 695 in line 3 of the enqueuing pseudocode in Figure 1. Note that the 696 limit is deliberately tested before enqueue to avoid any bias against 697 larger packets (so the actual buffer has to be one packet larger than 698 limit). If limit is not exceeded, the packet will be classified and 699 enqueued to the Classic or L4S queue dependent on the least 700 significant bit of the ECN field in the IP header (line 6). Packets 701 with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be 702 enqueued in the Classic queue. Otherwise, ECT(1) and CE packets will 703 be enqueued in the L4S queue. 705 The pseudocode in Figure 2 summarises the per packet dequeue 706 implementation of the DualPI2 code. Line 3 implements the time- 707 shifted FIFO scheduling. It takes the packet that waited the 708 longest, biased by a time-shift of tshift for the Classic traffic. 709 If an L4S packet is scheduled, lines 5 and 6 mark the packet if 710 either the L4S threshold T is exceeded, or if a random marking 711 decision is drawn according to k times the probability p (maintained 712 by the dualpi2_update() function discussed below). The coupling 713 factor is applied here for determining the L4S marking probability so 714 that Classic TCP control is independent from the L4S coupling factor. 715 If a Classic packet is scheduled, lines 10 to 16 drop or mark the 716 packet based on 2 random decisions resulting in the squared 717 probability p^2 (hence the name PI2 for Classic traffic). Note that 718 p is not reduced here by the factor k, as p has already been 719 multiplied by the factor k when it was used to mark the L4S traffic. 720 The coupling factor gives Classic TCP and DCTCP traffic equal 721 throughput; Because L4S marking is factored up by k, the dynamic gain 722 parameters alpha and beta also have to be factored up by k for the 723 L4S queue, which is necessary to ensure that Classic TCP and DCTCP 724 controls have the same stability. 726 The probability p is kept up to date by the core PI algorithm in 727 Figure 3 which is executed every Tupdate ([I-D.ietf-aqm-pie] now 728 recommends 16ms, but in our testing so far we have used the earlier 729 recommendation of 32ms). Note that p solely depends on the queuing 730 time in the Classic queue. In line 2, the current queuing delay is 731 evaluated by inspecting the timestamp of the next packet to schedule 732 in the Classic queue. The function cq.time() subtracts the time 733 stamped at enqueue from the current time and implicitly takes the 734 current queuing delay as 0 if the queue is empty. Line 3 and 4 only 735 need to be executed when the configuration parameters are changed. 736 Alpha and beta in Hz are gain factors per 1 second. If a briefer 737 update time is configured, alpha_U and beta_U (_U = per Tupdate) also 738 have to be reduced, to ensure that the same response is given over 739 time. As such, a smaller Tupdate will only result in a response with 740 smaller and finer steps, not a more aggressive response. The new 741 probability is calculated in line 5, where target is the target 742 queuing delay, as defined in [I-D.ietf-aqm-pie]. In corner cases, p 743 can overflow the range [0,1] so the resulting value of p has to be 744 bounded (omitted from the pseudocode). Unlike PIE, alpha_U and 745 beta_U are not tuned dependent on p, every Tupdate. Instead, in PI2 746 alpha_U and beta_U can be constants because the squaring applied to 747 Classic traffic tunes them inherently, as explained in [PI216]. 749 In our experiments so far (building on experiments with PIE) on 750 broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs 751 from 5 ms to 100 ms, PI2 achieves good results with the following 752 parameters: 754 tshift = 40ms 755 T = max(1ms, serialization time of 2 MTU) 757 target = 20ms 759 Tupdate = 32ms 761 k = 2 763 alpha = 10Hz (alpha*k = 20Hz for L4S) 765 beta = 100Hz (beta*k = 200Hz for L4S) 767 Appendix B. Example DualQ Coupled Curvy RED Algorithm 769 As another example, the pseudocode below gives the Curvy RED based 770 DualQ Coupled AQM algorithm we used and tested. Although we designed 771 the AQM to be efficient in integer arithmetic, to aid understanding 772 it is first given using real-number arithmetic. Then, one possible 773 optimization for integer arithmetic is given, also in pseudocode. To 774 aid comparison, the line numbers are kept in step between the two by 775 using letter suffixes where the longer code needs extra lines. 777 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 778 2: if ( lq.dequeue(pkt) ) { 779 3a: p_L = cq.sec() / 2^S_L 780 3b: if ( lq.byt() > T ) 781 3c: mark(pkt) 782 3d: elif ( p_L > maxrand(U) ) 783 4: mark(pkt) 784 5: return(pkt) % return the packet and stop here 785 6: } 786 7: while ( cq.dequeue(pkt) ) { 787 8a: alpha = 2^(-f_C) 788 8b: Q_C = alpha * pkt.sec() + (1-alpha)* Q_C % Classic Q EWMA 789 9a: sqrt_p_C = Q_C / 2^S_C 790 9b: if ( sqrt_p_C > maxrand(2*U) ) 791 10: drop(pkt) % Squared drop, redo loop 792 11: else 793 12: return(pkt) % return the packet and stop here 794 13: } 795 14: return(NULL) % no packet to dequeue 796 15: } 798 16: maxrand(u) { % return the max of u random numbers 799 17: maxr=0 800 18: while (u-- > 0) 801 19: maxr = max(maxr, rand()) % 0 <= rand() < 1 802 20: return(maxr) 803 21: } 805 Figure 4: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM 807 Packet classification code is not shown, as it is no different from 808 Figure 1. Potential classification schemes are discussed in 809 Section 2. Overload protection code will be included in a future 810 draft {ToDo}. 812 At the outer level, the structure of dualq_dequeue() implements 813 strict priority scheduling. The code is written assuming the AQM is 814 applied on dequeue (Note 1) . Every time dualq_dequeue() is called, 815 the if-block in lines 2-6 determines whether there is an L4S packet 816 to dequeue by calling lq.dequeue(pkt), and otherwise the while-block 817 in lines 7-13 determines whether there is a Classic packet to 818 dequeue, by calling cq.dequeue(pkt). (Note 2) 820 In the lower priority Classic queue, a while loop is used so that, if 821 the AQM determines that a classic packet should be dropped, it 822 continues to test for classic packets deciding whether to drop each 823 until it actually forwards one. Thus, every call to dualq_dequeue() 824 returns one packet if at least one is present in either queue, 825 otherwise it returns NULL at line 14. (Note 3) 827 Within each queue, the decision whether to drop or mark is taken as 828 follows (to simplify the explanation, it is assumed that U=1): 830 L4S: If the test at line 2 determines there is an L4S packet to 831 dequeue, the tests at lines 3a and 3c determine whether to mark 832 it. The first is a simple test of whether the L4S queue (lq.byt() 833 in bytes) is greater than a step threshold T in bytes (Note 4). 834 The second test is similar to the random ECN marking in RED, but 835 with the following differences: i) the marking function does not 836 start with a plateau of zero marking until a minimum threshold, 837 rather the marking probability starts to increase as soon as the 838 queue is positive; ii) marking depends on queuing time, not bytes, 839 in order to scale for any link rate without being reconfigured; 840 iii) marking of the L4S queue does not depend on itself, it 841 depends on the queuing time of the _other_ (Classic) queue, where 842 cq.sec() is the queuing time of the packet at the head of the 843 Classic queue (zero if empty); iv) marking depends on the 844 instantaneous queuing time (of the other Classic queue), not a 845 smoothed average; v) the queue is compared with the maximum of U 846 random numbers (but if U=1, this is the same as the single random 847 number used in RED). 849 Specifically, in line 3a the marking probability p_L is set to the 850 Classic queueing time qc.sec() in seconds divided by the L4S 851 scaling parameter 2^S_L, which represents the queuing time (in 852 seconds) at which marking probability would hit 100%. Then in line 853 3d (if U=1) the result is compared with a uniformly distributed 854 random number between 0 and 1, which ensures that marking 855 probability will linearly increase with queueing time. The 856 scaling parameter is expressed as a power of 2 so that division 857 can be implemented as a right bit-shift (>>) in line 3 of the 858 integer variant of the pseudocode (Figure 5). 860 Classic: If the test at line 7 determines that there is at least one 861 Classic packet to dequeue, the test at line 9b determines whether 862 to drop it. But before that, line 8b updates Q_C, which is an 863 exponentially weighted moving average (Note 5) of the queuing time 864 in the Classic queue, where pkt.sec() is the instantaneous 865 queueing time of the current Classic packet and alpha is the EWMA 866 constant for the classic queue. In line 8a, alpha is represented 867 as an integer power of 2, so that in line 8 of the integer code 868 the division needed to weight the moving average can be 869 implemented by a right bit-shift (>> f_C). 871 Lines 9a and 9b implement the drop function. In line 9a the 872 averaged queuing time Q_C is divided by the Classic scaling 873 parameter 2^S_C, in the same way that queuing time was scaled for 874 L4S marking. This scaled queuing time is given the variable name 875 sqrt_p_C because it will be squared to compute Classic drop 876 probability, so before it is squared it is effectively the square 877 root of the drop probability. The squaring is done by comparing 878 it with the maximum out of two random numbers (assuming U=1). 879 Comparing it with the maximum out of two is the same as the 880 logical `AND' of two tests, which ensures drop probability rises 881 with the square of queuing time (Note 6). Again, the scaling 882 parameter is expressed as a power of 2 so that division can be 883 implemented as a right bit-shift in line 9 of the integer 884 pseudocode. 886 The marking/dropping functions in each queue (lines 3 & 9) are two 887 cases of a new generalization of RED called Curvy RED, motivated as 888 follows. When we compared the performance of our AQM with fq_CoDel 889 and PIE, we came to the conclusion that their goal of holding queuing 890 delay to a fixed target is misguided [CRED_Insights]. As the number 891 of flows increases, if the AQM does not allow TCP to increase queuing 892 delay, it has to introduce abnormally high levels of loss. Then loss 893 rather than queuing becomes the dominant cause of delay for short 894 flows, due to timeouts and tail losses. 896 Curvy RED constrains delay with a softened target that allows some 897 increase in delay as load increases. This is achieved by increasing 898 drop probability on a convex curve relative to queue growth (the 899 square curve in the Classic queue, if U=1). Like RED, the curve hugs 900 the zero axis while the queue is shallow. Then, as load increases, 901 it introduces a growing barrier to higher delay. But, unlike RED, it 902 requires only one parameter, the scaling, not three. The diadvantage 903 of Curvy RED is that it is not adapted to a wide range of RTTs. 904 Curvy RED can be used as is when the RTT range to support is limited 905 otherwise an adaptation mechanism is required. 907 There follows a summary listing of the two parameters used for each 908 of the two queues: 910 Classic: 912 S_C : The scaling factor of the dropping function scales Classic 913 queuing times in the range [0, 2^(S_C)] seconds into a dropping 914 probability in the range [0,1]. To make division efficient, it 915 is constrained to be an integer power of two; 917 f_C : To smooth the queuing time of the Classic queue and make 918 multiplication efficient, we use a negative integer power of 919 two for the dimensionless EWMA constant, which we define as 920 2^(-f_C). 922 L4S : 924 S_L (and k): As for the Classic queue, the scaling factor of the 925 L4S marking function scales Classic queueing times in the range 926 [0, 2^(S_L)] seconds into a probability in the range [0,1]. 927 Note that S_L = S_C + k, where k is the coupling between the 928 queues (Section 2.1). So S_L and k count as only one 929 parameter; 931 T : The queue size in bytes at which step threshold marking 932 starts in the L4S queue. 934 {ToDo: These are the raw parameters used within the algorithm. A 935 configuration front-end could accept more meaningful parameters and 936 convert them into these raw parameters.} 938 From our experiments so far, recommended values for these parameters 939 are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs 940 typical on the public Internet. [CRED_Insights] explains why these 941 parameters are applicable whatever rate link this AQM implementation 942 is deployed on and how the parameters would need to be adjusted for a 943 scenario with a different range of RTTs (e.g. a data centre) {ToDo 944 incorporate a summary of that report into this draft}. The setting of 945 k depends on policy (see Section 2.4 and Appendix C respectively for 946 its recommended setting and guidance on alternatives). 948 There is also a cUrviness parameter, U, which is a small positive 949 integer. It is likely to take the same hard-coded value for all 950 implementations, once experiments have determined a good value. We 951 have solely used U=1 in our experiments so far, but results might be 952 even better with U=2 or higher. 954 Note that the dropping function at line 9 calls maxrand(2*U), which 955 gives twice as much curviness as the call to maxrand(U) in the 956 marking function at line 3. This is the trick that implements the 957 square rule in equation (1) (Section 2.1). This is based on the fact 958 that, given a number X from 1 to 6, the probability that two dice 959 throws will both be less than X is the square of the probability that 960 one throw will be less than X. So, when U=1, the L4S marking 961 function is linear and the Classic dropping function is squared. If 962 U=2, L4S would be a square function and Classic would be quartic. 963 And so on. 965 The maxrand(u) function in lines 16-21 simply generates u random 966 numbers and returns the maximum (Note 7). Typically, maxrand(u) 967 could be run in parallel out of band. For instance, if U=1, the 968 Classic queue would require the maximum of two random numbers. So, 969 instead of calling maxrand(2*U) in-band, the maximum of every pair of 970 values from a pseudorandom number generator could be generated out- 971 of-band, and held in a buffer ready for the Classic queue to consume. 973 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 974 2: if ( lq.dequeue(pkt) ) { 975 3: if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U))) 976 4: mark(pkt) 977 5: return(pkt) % return the packet and stop here 978 6: } 979 7: while ( cq.dequeue(pkt) ) { 980 8: Q_C += (pkt.ns() - Q_C) >> f_C % Classic Q EWMA 981 9: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) 982 10: drop(pkt) % Squared drop, redo loop 983 11: else 984 12: return(pkt) % return the packet and stop here 985 13: } 986 14: return(NULL) % no packet to dequeue 987 15: } 989 Figure 5: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM 990 using Integer Arithmetic 992 Notes: 994 1. The drain rate of the queue can vary if it is scheduled relative 995 to other queues, or to cater for fluctuations in a wireless 996 medium. To auto-adjust to changes in drain rate, the queue must 997 be measured in time, not bytes or packets [CoDel]. In our Linux 998 implementation, it was easiest to measure queuing time at 999 dequeue. Queuing time can be estimated when a packet is enqueued 1000 by measuring the queue length in bytes and dividing by the recent 1001 drain rate. 1003 2. An implementation has to use priority queueing, but it need not 1004 implement strict priority. 1006 3. If packets can be enqueued while processing dequeue code, an 1007 implementer might prefer to place the while loop around both 1008 queues so that it goes back to test again whether any L4S packets 1009 arrived while it was dropping a Classic packet. 1011 4. In order not to change too many factors at once, for now, we keep 1012 the marking function for DCTCP-only traffic as similar as 1013 possible to DCTCP. However, unlike DCTCP, all processing is at 1014 dequeue, so we determine whether to mark a packet at the head of 1015 the queue by the byte-length of the queue _behind_ it. We plan 1016 to test whether using queuing time will work in all 1017 circumstances, and if we find that the step can cause 1018 oscillations, we will investigate replacing it with a steep 1019 random marking curve. 1021 5. An EWMA is only one possible way to filter bursts; other more 1022 adaptive smoothing methods could be valid and it might be 1023 appropriate to decrease the EWMA faster than it increases. 1025 6. In practice at line 10 the Classic queue would probably test for 1026 ECN capability on the packet to determine whether to drop or mark 1027 the packet. However, for brevity such detail is omitted. All 1028 packets classified into the L4S queue have to be ECN-capable, so 1029 no dropping logic is necessary at line 3. Nonetheless, L4S 1030 packets could be dropped by overload code (see Section 4.1). 1032 7. In the integer variant of the pseudocode (Figure 5) real numbers 1033 are all represented as integers scaled up by 2^32. In lines 3 & 1034 9 the function maxrand() is arranged to return an integer in the 1035 range 0 <= maxrand() < 2^32. Queuing times are also scaled up by 1036 2^32, but in two stages: i) In lines 3 and 8 queuing times 1037 cq.ns() and pkt.ns() are returned in integer nanoseconds, making 1038 the values about 2^30 times larger than when the units were 1039 seconds, ii) then in lines 3 and 9 an adjustment of -2 to the 1040 right bit-shift multiplies the result by 2^2, to complete the 1041 scaling by 2^32. 1043 Appendix C. Guidance on Controlling Throughput Equivalence 1045 +---------------+------+-------+ 1046 | RTT_C / RTT_L | Reno | Cubic | 1047 +---------------+------+-------+ 1048 | 1 | k=1 | k=0 | 1049 | 2 | k=2 | k=1 | 1050 | 3 | k=2 | k=2 | 1051 | 4 | k=3 | k=2 | 1052 | 5 | k=3 | k=3 | 1053 +---------------+------+-------+ 1055 Table 1: Value of k for which DCTCP throughput is roughly the same as 1056 Reno or Cubic, for some example RTT ratios 1058 To determine the appropriate policy, the operator first has to judge 1059 whether it wants DCTCP flows to have roughly equal throughput with 1060 Reno or with Cubic (because, even in its Reno-compatibility mode, 1061 Cubic is about 1.4 times more aggressive than Reno). Then the 1062 operator needs to decide at what ratio of RTTs it wants DCTCP and 1063 Classic flows to have roughly equal throughput. For example choosing 1064 the recommended value of k=0 will make DCTCP throughput roughly the 1065 same as Cubic, _if their RTTs are the same_. 1067 However, even if the base RTTs are the same, the actual RTTs are 1068 unlikely to be the same, because Classic (Cubic or Reno) traffic 1069 needs a large queue to avoid under-utilization and excess drop, 1070 whereas L4S (DCTCP) does not. The operator might still choose this 1071 policy if it judges that DCTCP throughput should be rewarded for 1072 keeping its own queue short. 1074 On the other hand, the operator will choose one of the higher values 1075 for k, if it wants to slow DCTCP down to roughly the same throughput 1076 as Classic flows, to compensate for Classic flows slowing themselves 1077 down by causing themselves extra queuing delay. 1079 The values for k in the table are derived from the formulae, which 1080 was developed in [DCttH15]: 1082 2^k = 1.64 (RTT_reno / RTT_dc) (2) 1083 2^k = 1.19 (RTT_cubic / RTT_dc ) (3) 1085 For localized traffic from a particular ISP's data centre, we used 1086 the measured RTTs to calculate that a value of k=3 would achieve 1087 throughput equivalence, and our experiments verified the formula very 1088 closely. 1090 Authors' Addresses 1092 Koen De Schepper 1093 Nokia Bell Labs 1094 Antwerp 1095 Belgium 1097 Email: koen.de_schepper@nokia.com 1098 URI: https://www.bell-labs.com/usr/koen.de_schepper 1100 Bob Briscoe (editor) 1101 Simula Research Lab 1103 Email: ietf@bobbriscoe.net 1104 URI: http://bobbriscoe.net/ 1105 Olga Bondarenko 1106 Simula Research Lab 1107 Lysaker 1108 Norway 1110 Email: olgabnd@gmail.com 1111 URI: https://www.simula.no/people/olgabo 1113 Ing-jyh Tsang 1114 Nokia Bell Labs 1115 Antwerp 1116 Belgium 1118 Email: ing-jyh.tsang@nokia.com