idnits 2.17.1 draft-ietf-tsvwg-aqm-dualq-coupled-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 04, 2018) is 2001 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 1697 -- Looks like a reference, but probably isn't: '1' on line 1697 == Outdated reference: A later version (-02) exists of draft-briscoe-tsvwg-l4s-diffserv-00 == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-02 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-02 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) -- Obsolete informational reference (is this intentional?): RFC 8312 (Obsoleted by RFC 9438) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area working group (tsvwg) K. De Schepper 3 Internet-Draft Nokia Bell Labs 4 Intended status: Experimental B. Briscoe, Ed. 5 Expires: May 8, 2019 CableLabs 6 O. Bondarenko 7 Simula Research Lab 8 I. Tsang 9 Nokia 10 November 04, 2018 12 DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput 13 (L4S) 14 draft-ietf-tsvwg-aqm-dualq-coupled-08 16 Abstract 18 The Low Latency Low Loss Scalable Throughput (L4S) architecture 19 allows data flows over the public Internet to predictably achieve 20 ultra-low queuing latency, generally zero congestion loss and scaling 21 of per-flow throughput without the problems of traditional TCP. To 22 achieve this, L4S data flows use a 'scalable' congestion control 23 similar to Data Centre TCP (DCTCP) and a form of Explicit Congestion 24 Notification (ECN) with modified behaviour. However, until now, 25 scalable congestion controls did not co-exist with existing TCP Reno/ 26 Cubic traffic---scalable controls are so aggressive that 'Classic' 27 TCP algorithms drive themselves to starvation. Therefore, until now, 28 L4S controls could only be deployed where a clean-slate environment 29 could be arranged, such as in private data centres (hence the name 30 DCTCP). This specification defines `DualQ Coupled Active Queue 31 Management (AQM)', which enables these scalable congestion controls 32 to safely co-exist with Classic Internet traffic. 34 The Coupled AQM ensures that a flow runs at about the same rate 35 whether it uses DCTCP or TCP Reno/Cubic. It achieves this 36 indirectly, without having to inspect transport layer flow 37 identifiers, When tested in a residential broadband setting, DCTCP 38 also achieves sub-millisecond average queuing delay and zero 39 congestion loss under a wide range of mixes of DCTCP and `Classic' 40 broadband Internet traffic, without compromising the performance of 41 the Classic traffic. The solution also reduces network complexity 42 and eliminates network configuration. 44 Status of This Memo 46 This Internet-Draft is submitted in full conformance with the 47 provisions of BCP 78 and BCP 79. 49 Internet-Drafts are working documents of the Internet Engineering 50 Task Force (IETF). Note that other groups may also distribute 51 working documents as Internet-Drafts. The list of current Internet- 52 Drafts is at https://datatracker.ietf.org/drafts/current/. 54 Internet-Drafts are draft documents valid for a maximum of six months 55 and may be updated, replaced, or obsoleted by other documents at any 56 time. It is inappropriate to use Internet-Drafts as reference 57 material or to cite them other than as "work in progress." 59 This Internet-Draft will expire on May 8, 2019. 61 Copyright Notice 63 Copyright (c) 2018 IETF Trust and the persons identified as the 64 document authors. All rights reserved. 66 This document is subject to BCP 78 and the IETF Trust's Legal 67 Provisions Relating to IETF Documents 68 (https://trustee.ietf.org/license-info) in effect on the date of 69 publication of this document. Please review these documents 70 carefully, as they describe your rights and restrictions with respect 71 to this document. Code Components extracted from this document must 72 include Simplified BSD License text as described in Section 4.e of 73 the Trust Legal Provisions and are provided without warranty as 74 described in the Simplified BSD License. 76 Table of Contents 78 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 79 1.1. Problem and Scope . . . . . . . . . . . . . . . . . . . . 3 80 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 81 1.3. Features . . . . . . . . . . . . . . . . . . . . . . . . 6 82 2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 7 83 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 8 84 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 9 85 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 9 86 2.4. Overall DualQ Coupled AQM Structure . . . . . . . . . . . 10 87 2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 12 88 2.5.1. Functional Requirements . . . . . . . . . . . . . . . 12 89 2.5.1.1. Requirements in Unexpected Cases . . . . . . . . 13 90 2.5.2. Management Requirements . . . . . . . . . . . . . . . 15 91 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 92 4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 93 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 16 94 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput 95 or Delay? . . . . . . . . . . . . . . . . . . . . . . 17 96 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or 97 Delay? . . . . . . . . . . . . . . . . . . . . . . . 18 98 4.1.3. Protecting against Unresponsive ECN-Capable Traffic . 19 99 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 100 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 20 101 6.1. Normative References . . . . . . . . . . . . . . . . . . 20 102 6.2. Informative References . . . . . . . . . . . . . . . . . 20 103 Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 23 104 A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 23 105 A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 30 106 Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 33 107 Appendix C. Guidance on Controlling Throughput Equivalence . . . 39 108 Appendix D. Open Issues . . . . . . . . . . . . . . . . . . . . 40 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 111 1. Introduction 113 1.1. Problem and Scope 115 Latency is becoming the critical performance factor for many (most?) 116 applications on the public Internet, e.g. interactive Web, Web 117 services, voice, conversational video, interactive video, interactive 118 remote presence, instant messaging, online gaming, remote desktop, 119 cloud-based applications, and video-assisted remote control of 120 machinery and industrial processes. In the developed world, further 121 increases in access network bit-rate offer diminishing returns, 122 whereas latency is still a multi-faceted problem. In the last decade 123 or so, much has been done to reduce propagation time by placing 124 caches or servers closer to users. However, queuing remains a major 125 intermittent component of latency. 127 The Diffserv architecture provides Expedited Forwarding [RFC3246], so 128 that low latency traffic can jump the queue of other traffic. 129 However, on access links dedicated to individual sites (homes, small 130 enterprises or mobile devices), often all traffic at any one time 131 will be latency-sensitive and, if all the traffic on a link is marked 132 as EF, Diffserv cannot reduce the delay of any of it. In contrast, 133 the Low Latency Low Loss Scalable throughput (L4S) approach removes 134 the causes of any unnecessary queuing delay. 136 The bufferbloat project has shown that excessively-large buffering 137 (`bufferbloat') has been introducing significantly more delay than 138 the underlying propagation time. These delays appear only 139 intermittently--only when a capacity-seeking (e.g. TCP) flow is long 140 enough for the queue to fill the buffer, making every packet in other 141 flows sharing the buffer sit through the queue. 143 Active queue management (AQM) was originally developed to solve this 144 problem (and others). Unlike Diffserv, which gives low latency to 145 some traffic at the expense of others, AQM controls latency for _all_ 146 traffic in a class. In general, AQMs introduce an increasing level 147 of discard from the buffer the longer the queue persists above a 148 shallow threshold. This gives sufficient signals to capacity-seeking 149 (aka. greedy) flows to keep the buffer empty for its intended 150 purpose: absorbing bursts. However, RED [RFC2309] and other 151 algorithms from the 1990s were sensitive to their configuration and 152 hard to set correctly. So, AQM was not widely deployed in the 1990s. 154 More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290], 155 PIE [RFC8033], Adaptive RED [ARED01], are easier to configure, 156 because they define the queuing threshold in time not bytes, so it is 157 invariant for different link rates. However, no matter how good the 158 AQM, the sawtoothing rate of TCP will either cause queuing delay to 159 vary or cause the link to be under-utilized. Even with a perfectly 160 tuned AQM, the additional queuing delay will be of the same order as 161 the underlying speed-of-light delay across the network. Flow-queuing 162 can isolate one flow from another, but it cannot isolate a TCP flow 163 from the delay variations it inflicts on itself, and it has other 164 problems - it overrides the flow rate decisions of variable rate 165 video applications, it does not recognise the flows within IPSec VPN 166 tunnels and it is relatively expensive to implement. 168 It seems that further changes to the network alone will now yield 169 diminishing returns. Data Centre TCP (DCTCP [RFC8257]) teaches us 170 that a small but radical change to TCP is needed to cut two major 171 outstanding causes of queuing delay variability: 173 1. the `sawtooth' varying rate of TCP itself; 175 2. the smoothing delay deliberately introduced into AQMs to permit 176 bursts without triggering losses. 178 The former causes a flow's round trip time (RTT) to vary from about 1 179 to 2 times the base RTT between the machines in question. The latter 180 delays the system's response to change by a worst-case 181 (transcontinental) RTT, which could be hundreds of times the actual 182 RTT of typical traffic from localized CDNs. 184 Latency is not our only concern: 186 3. It was known when TCP was first developed that it would not scale 187 to high bandwidth-delay products [TCP-CA]. 189 Given regular broadband bit-rates over WAN distances are 190 already [RFC3649] beyond the scaling range of `classic' TCP Reno, 191 `less unscalable' Cubic [RFC8312] and 192 Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been 193 successfully deployed. However, these are now approaching their 194 scaling limits. Unfortunately, fully scalable TCPs such as DCTCP 195 cause `classic' TCP to starve itself, which is why they have been 196 confined to private data centres or research testbeds (until now). 198 This document specifies a `DualQ Coupled AQM' extension that solves 199 the problem of coexistence between scalable and classic flows, 200 without having to inspect flow identifiers. The AQM is not like 201 flow-queuing approaches [RFC8290] that classify packets by flow 202 identifier into numerous separate queues in order to isolate sparse 203 flows from the higher latency in the queues assigned to heavier 204 flows. In contrast, the AQM exploits the behaviour of scalable 205 congestion controls like DCTCP so that every packet in every flow 206 sharing the queue for DCTCP-like traffic can be served with very low 207 latency. 209 This AQM extension can be combined with any AQM designed for a single 210 queue that generates a statistical or deterministic mark/drop 211 probability driven by the queue dynamics. In many cases it 212 simplifies the basic control algorithm, and requires little extra 213 processing. Therefore it is believed the Coupled AQM would be 214 applicable and easy to deploy in all types of buffers; buffers in 215 cost-reduced mass-market residential equipment; buffers in end-system 216 stacks; buffers in carrier-scale equipment including remote access 217 servers, routers, firewalls and Ethernet switches; buffers in network 218 interface cards, buffers in virtualized network appliances, 219 hypervisors, and so on. 221 For the public Internet, nearly all the benefit will typically be 222 achieved by deploying the Coupled AQM into either end of the access 223 link between a 'site' and the Internet, which is invariably the 224 bottleneck. Here, the term 'site' is used loosely to mean a home, an 225 office, a campus or mobile user equipment. 227 The overall L4S architecture [I-D.ietf-tsvwg-l4s-arch] gives more 228 detail, including on wider deployment aspects such as coexistence in 229 bottlenecks where a DualQ Coupled AQM has not been deployed. The 230 supporting papers [PI2] and [DCttH15] give the full rationale for the 231 AQM's design, both discursively and in more precise mathematical 232 form. 234 1.2. Terminology 236 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 237 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 238 document are to be interpreted as described in [RFC2119] when, and 239 only when, they appear in all capitals, as shown here. 241 The DualQ Coupled AQM uses two queues for two services. Each of the 242 following terms identifies both the service and the queue that 243 provides the service: 245 Classic (denoted by subscript C): The `Classic' service is intended 246 for all the behaviours that currently co-exist with TCP Reno (TCP 247 Cubic, Compound, SCTP, etc). 249 Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): 250 The `L4S' service is intended for a set of congestion controls 251 with scalable properties (e.g. DCTCP [RFC8257], Relentless 252 TCP [Mathis09], the L4S variant of SCREAM for real-time 253 media {ToDo: ref}). For the public Internet a scalable control 254 has to comply with the requirements in [I-D.ietf-tsvwg-ecn-l4s-id] 255 (aka. the 'TCP Prague requirements'). 257 Either service can cope with a proportion of unresponsive or less- 258 responsive traffic as well, as long (e.g. DNS, VoIP, game sync 259 datagrams, etc), just as a single queue AQM can if this traffic makes 260 minimal contribution to queuing. The DualQ Coupled AQM behaviour 261 below is defined to be similar to a single FIFO queue with respect to 262 unresponsive and overload traffic. 264 1.3. Features 266 The AQM couples marking and/or dropping across the two queues such 267 that a flow will get roughly the same throughput whichever it uses. 268 Therefore both queues can feed into the full capacity of a link and 269 no rates need to be configured for the queues. The L4S queue enables 270 scalable congestion controls like DCTCP to give stunningly low and 271 predictably low latency, without compromising the performance of 272 competing 'Classic' Internet traffic. Thousands of tests have been 273 conducted in a typical fixed residential broadband setting. Typical 274 experiments used base round trip delays up to 100ms between the data 275 centre and home network, and large amounts of background traffic in 276 both queues. For every L4S packet, the AQM kept the average queuing 277 delay below 1ms (or 2 packets if serialization delay is bigger for 278 slow links), and no losses at all were introduced by the AQM. 279 Details of the extensive experiments are available [PI2] [DCttH15]. 281 Subjective testing was also conducted by multiple people all 282 simultaneously using very demanding high bandwidth low latency 283 applications over a single shared access link [L4Sdemo16]. In one 284 application, each user could use finger gestures to pan or zoom their 285 own high definition (HD) sub-window of a larger video scene generated 286 on the fly in 'the cloud' from a football match. Another user 287 wearing VR goggles was remotely receiving a feed from a 360-degree 288 camera in a racing car, again with the sub-window in their field of 289 vision generated on the fly in 'the cloud' dependent on their head 290 movements. Even though other users were also downloading large 291 amounts of L4S and Classic data, playing a gaming benchmark and 292 watchings videos over the same 40Mb/s downstream broadband link, 293 latency was so low that the football picture appeared to stick to the 294 user's finger on the touchpad and the experience fed from the remote 295 camera did not noticeably lag head movements. All the L4S data (even 296 including the downloads) achieved the same ultra-low latency. With 297 an alternative AQM, the video noticeably lagged behind the finger 298 gestures and head movements. 300 Unlike Diffserv Expedited Forwarding, the L4S queue does not have to 301 be limited to a small proportion of the link capacity in order to 302 achieve low delay. The L4S queue can be filled with a heavy load of 303 capacity-seeking flows like DCTCP and still achieve low delay. The 304 L4S queue does not rely on the presence of other traffic in the 305 Classic queue that can be 'overtaken'. It gives low latency to L4S 306 traffic whether or not there is Classic traffic, and the latency of 307 Classic traffic does not suffer when a proportion of the traffic is 308 L4S. The two queues are only necessary because DCTCP-like flows 309 cannot keep latency predictably low and keep utilization high if they 310 are mixed with legacy TCP flows, 312 The experiments used the Linux implementation of DCTCP that is 313 deployed in private data centres, without any modification despite 314 its known deficiencies. Nonetheless, certain modifications will be 315 necessary before DCTCP is safe to use on the Internet, which are 316 recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id]. However, the 317 focus of this specification is to get the network service in place. 318 Then, without any management intervention, applications can exploit 319 it by migrating to scalable controls like DCTCP, which can then 320 evolve _while_ their benefits are being enjoyed by everyone on the 321 Internet. 323 2. DualQ Coupled AQM 325 There are two main aspects to the approach: 327 o the Coupled AQM that addresses throughput equivalence between 328 Classic (e.g. Reno, Cubic) flows and L4S flows (that satisfy the 329 TCP Prague requirements). 331 o the Dual Queue structure that provides latency separation for L4S 332 flows to isolate them from the typically large Classic queue. 334 2.1. Coupled AQM 336 In the 1990s, the `TCP formula' was derived for the relationship 337 between TCP's congestion window, cwnd, and its drop probability, p. 338 To a first order approximation, cwnd of TCP Reno is inversely 339 proportional to the square root of p. 341 We focus on Reno as the worst case, because if we do not harm Reno, 342 we will not harm Cubic. Nonetheless, TCP Cubic implements a Reno- 343 compatibility mode, which is the only relevant mode for typical RTTs 344 under 20ms as long as the throughput of a single flow is less than 345 about 500Mb/s. Therefore it can be assumed that Cubic traffic 346 behaves similarly to Reno (but with a slightly different constant of 347 proportionality). The term 'Classic' will be used for the collection 348 of Reno-friendly traffic including Cubic in Reno mode. 350 The supporting paper [PI2] includes the derivation of the equivalent 351 rate equation for DCTCP, for which cwnd is inversely proportional to 352 p (not the square root), where in this case p is the ECN marking 353 probability. DCTCP is not the only congestion control that behaves 354 like this, so the term 'L4S' traffic will be used for all similar 355 behaviour. 357 For safe co-existence, under stationary conditions, a DCTCP flow has 358 to run at roughly the same rate as a Reno TCP flow (all other factors 359 being equal). So the drop or marking probability for Classic 360 traffic, p_C has to be distinct from the marking probability for L4S 361 traffic, p_L. [RFC8311] updates the original ECN specification 362 [RFC3168] to allow these probabilities to be distinct, because RFC 363 3168 required them to be the same. 365 Also, to remain stable, Classic sources need the network to smooth 366 p_C so it changes relatively slowly. In contrast, L4S avoids 367 smoothing in the network, because it delays all signals for a worst- 368 case RTT. So instead, L4S sources smooth the ECN marking probability 369 themselves, so they expect the network to generate ECN marks with a 370 probability p_L that tracks the instantaneous unsmoothed queue. 372 The Coupled AQM achieves safe coexistence by making the Classic drop 373 probability p_C proportional to the square of the coupled L4S 374 probability p_CL. p_CL is an input to the instantaneous L4S marking 375 probability p_L but it changes as slowly as p_C. This makes the Reno 376 flow rate roughly equal the DCTCP flow rate, because the squaring of 377 p_CL counterbalances the square root of p_C in the Classic 'TCP 378 formula'. 380 Stating this as a formula, the relation between Classic drop 381 probability, p_C, and the coupled L4S probability p_CL needs to take 382 the form: 384 p_C = ( p_CL / k )^2 (1) 386 where k is the constant of proportionality, which we shall call the 387 coupling factor. 389 2.2. Dual Queue 391 Classic traffic typically builds a large queue to prevent under- 392 utilization. Therefore a separate queue is provided for L4S traffic, 393 and it is scheduled with priority over Classic. Priority is 394 conditional to prevent starvation of Classic traffic. 396 Nonetheless, coupled marking ensures that giving priority to L4S 397 traffic still leaves the right amount of spare scheduling time for 398 Classic flows to each get equivalent throughput to DCTCP flows (all 399 other factors such as RTT being equal). 401 2.3. Traffic Classification 403 Both the Coupled AQM and DualQ mechanisms need an identifier to 404 distinguish L and C packets. Then the coupling algorithm can achieve 405 coexistence without having to inspect flow identifiers, because it 406 can apply the appropriate marking or dropping probability to all 407 flows of each type. A separate 408 specification [I-D.ietf-tsvwg-ecn-l4s-id] requires the sender to use 409 the ECT(1) codepoint of the ECN field as this identifier, having 410 assessed various alternatives. An additional process document has 411 proved necessary to make the ECT(1) codepoint available for 412 experimentation [RFC8311]. 414 For policy reasons, an operator might choose to steer certain packets 415 (e.g. from certain flows or with certain addresses) out of the L 416 queue, even though they identify themselves as L4S by their ECN 417 codepoints. In such cases, the device MUST NOT alter the ECN field, 418 so that it is preserved end-to-end. The aim is that each operator 419 can choose how it treats L4S traffic locally, but an individual 420 operator does not alter the identification of L4S packets, which 421 would prevent other operators downstream from making their own 422 choices on how to treat L4S traffic. 424 In addition, other identifiers could be used to classify certain 425 additional packet types into the L queue, that are deemed not to risk 426 harming the L4S service. For instance addresses of specific 427 applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific 428 Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit 429 service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain 430 protocols (e.g. ARP, DNS). 432 Note that the mechanism only reads these classifiers, it MUST NOT re- 433 mark or alter these identifiers (except for marking the ECN field 434 with the CE codepoint - with increasing frequency to indicate 435 increasing congestion). 437 2.4. Overall DualQ Coupled AQM Structure 439 Figure 1 shows the overall structure that any DualQ Coupled AQM is 440 likely to have. This schematic is intended to aid understanding of 441 the current designs of DualQ Coupled AQMs. However, it is not 442 intended to preclude other innovative ways of satisfying the 443 normative requirements in Section 2.5 that minimally define a DualQ 444 Coupled AQM. 446 The classifier on the left separates incoming traffic between the two 447 queues (L and C). Each queue has its own AQM that determines the 448 likelihood of marking or dropping (p_L and p_C). It has been 449 proved [PI2] that it is preferable to control load with a linear 450 controller, then square the output before applying it as a drop 451 probability to TCP (because TCP decreases its load proportional to 452 the square-root of the increase in drop). So, the AQM for Classic 453 traffic needs to be implemented in two stages: i) a base stage that 454 outputs an internal probability p' (pronounced p-prime); and ii) a 455 squaring stage that outputs p_C, where 457 p_C = (p')^2. (2) 459 Substituting for p_C in Eqn (1) gives: 461 p' = p_CL / k 463 So the slow-moving input to ECN marking in the L queue (the coupled 464 L4S probability) is: 466 p_CL = k*p', (3) 468 where k is the constant coupling factor (see Appendix C). 470 It can be seen that these two transformations of p' implement the 471 required coupling given in equation (1) earlier. 473 The actual probability p_L that we apply to the L queue needs to 474 track the immediate L queue delay, as well as track p_CL under 475 stationary conditions. So we use a native AQM in the L queue that 476 calculates a probability p'_L as a function of the instantaneous L 477 queue. And, given the L queue has conditional strict priority over 478 the C queue, whenever the L queue grows, we should apply marking 479 probability p'_L, but p_L should not fall below p_CL. This suggests: 481 p_L = max(p'_L, p_CL), (4) 483 which has also been found to work very well in practice. 485 _________ 486 | | ,------. 487 L4S queue | |===>| ECN | 488 ,'| _______|_| |marker|\ 489 <' | | `------'\\ 490 //`' v ^ p_L \\ 491 // ,-------. | \\ 492 // |Native |p'_L | \\,. 493 // | L4S |--->(MAX) < | ___ 494 ,----------.// | AQM | ^ p_CL `\|.'Cond-`. 495 | IP-ECN |/ `-------' | / itional \ 496 ==>|Classifier| ,-------. (k*p') [ priority]==> 497 | |\ | Base | | \scheduler/ 498 `----------'\\ | AQM |---->: ,'|`-.___.-' 499 \\ | |p' | <' | 500 \\ `-------' (p'^2) //`' 501 \\ ^ | // 502 \\,. | v p_C // 503 < | _________ .------.// 504 `\| | | | Drop |/ 505 Classic |queue |===>|/mark | 506 __|______| `------' 508 Legend: ===> traffic flow; ---> control dependency. 510 Figure 1: DualQ Coupled AQM Schematic 512 After the AQMs have applied their dropping or marking, the scheduler 513 forwards their packets to the link, giving priority to L4S traffic. 514 Priority has to be conditional in some way (see Section 4.1). Simple 515 strict priority is inappropriate otherwise it could lead the L4S 516 queue to starve the Classic queue. For example, consider the case 517 where a continually busy L4S queue blocks a DNS request in the 518 Classic queue, arbitrarily delaying the start of a new Classic flow. 520 Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are 521 given in Appendix A and Appendix B. Either example AQM can be used 522 to couple packet marking and dropping across a dual Q. 524 DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM. 525 Indeed, this Base AQM with just the squared output and no L4S queue 526 can be used as a drop-in replacement for PIE [RFC8033], in which case 527 we call it just PI2 [PI2]. PI2 is a principled simplification of PIE 528 that is both more responsive and more stable in the face of 529 dynamically varying load. 531 Curvy RED is derived from RED [RFC2309], but its configuration 532 parameters are insensitive to link rate and it requires less 533 operations per packet. However, DualPI2 is more responsive and 534 stable over a wider range of RTTs than Curvy RED. As a consequence, 535 DualPI2 has attracted more development attention than Curvy RED, 536 leaving the Curvy RED design incomplete and not so fully evaluated. 538 Both AQMs regulate their queue in units of time not bytes. As 539 already explained, this ensures configuration can be invariant for 540 different drain rates. With AQMs in a dualQ structure this is 541 particularly important because the drain rate of each queue can vary 542 rapidly as flows for the two queues arrive and depart, even if the 543 combined link rate is constant. 545 It would be possible to control the queues with other alternative 546 AQMs, as long as the normative requirements (those expressed in 547 capitals) in Section 2.5 are observed. 549 2.5. Normative Requirements for a DualQ Coupled AQM 551 The following requirements are intended to capture only the essential 552 aspects of a DualQ Coupled AQM. They are intended to be independent 553 of the particular AQMs used for each queue. 555 2.5.1. Functional Requirements 557 A Dual Queue Coupled AQM implementation MUST utilize two queues, each 558 with an AQM algorithm. The two queues can be part of a larger 559 queuing hierarchy [I-D.briscoe-tsvwg-l4s-diffserv]. 561 The AQM algorithm for the low latency (L) queue MUST apply ECN 562 marking. 564 The scheduler draining the two queues MUST give L4S packets priority 565 over Classic, although priority MUST be bounded in order not to 566 starve Classic traffic. 568 [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on 569 L4S traffic, relative to drop of Classic traffic. In order to 570 prevent starvation of Classic traffic by scalable L4S traffic, it 571 says, "The likelihood that an AQM drops a Not-ECT Classic packet 572 (p_C) MUST be roughly proportional to the square of the likelihood 573 that it would have marked it if it had been an L4S packet (p_L)." 574 The term 'likelihood' is used to allow for marking and dropping to be 575 either probabilistic or deterministic. 577 For the current specification, this translates into the following 578 requirement. A DualQ Coupled AQM MUST apply ECN marking to traffic 579 in the L queue that is no lower than that derived from the likelihood 580 of drop (or ECN marking) in the Classic queue using Eqn. (1). 582 The constant of proportionality, k, in Eqn (1) determines the 583 relative flow rates of Classic and L4S flows when the AQM concerned 584 is the bottleneck (all other factors being equal). 585 [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality 586 (k) does not have to be standardised for interoperability, but a 587 value of 2 is RECOMMENDED." 589 Assuming scalable congestion controls for the Internet will be as 590 aggressive as DCTCP, this will ensure their congestion window will be 591 roughly the same as that of a standards track TCP congestion control 592 (Reno) [RFC5681] and other so-called TCP-friendly controls, such as 593 TCP Cubic in its TCP-friendly mode. 595 The choice of k is a matter of operator policy, and operators MAY 596 choose a different value using Table 1 and the guidelines in 597 Appendix C. 599 If multiple users share capacity at a bottleneck (e.g. in the 600 Internet access link of a campus network), the operator's choice of k 601 will determine capacity sharing between the flows of different users. 602 However, on the public Internet, access network operators typically 603 isolate customers from each other with some form of layer-2 604 multiplexing (OFDM(A) in DOCSIS3.1, CDMA in 3G, SC-FDMA in LTE) or L3 605 scheduling (WRR in DSL), rather than relying on TCP to share capacity 606 between customers [RFC0970]. In such cases, the choice of k will 607 solely affect relative flow rates within each customer's access 608 capacity, not between customers. Also, k will not affect relative 609 flow rates at any times when all flows are Classic or all L4S, and it 610 will not affect the relative throughput of small flows. 612 2.5.1.1. Requirements in Unexpected Cases 614 The flexibility to allow operator-specific classifiers (Section 2.3) 615 leads to the need to specify what the AQM in each queue ought to do 616 with packets that do not carry the ECN field expected for that queue. 617 It is recommended that the AQM in each queue inspects the ECN field 618 to determine what sort of congestion notification to signal, then 619 decides whether to apply congestion notification to this particular 620 packet, as follows: 622 o If a packet that does not carry an ECT(1) or CE codepoint is 623 classified into the L queue: 625 * if the packet is ECT(0), the L AQM SHOULD apply CE-marking 626 using a probability appropriate to Classic congestion control 627 and appropriate to the target delay in the L queue 629 * if the packet is Not-ECT, the appropriate action depends on 630 whether some other function is protecting the L queue from 631 misbehaving flows (e.g. per-flow queue protection or latency 632 policing): 634 + If separate queue protection is provided, the L AQM SHOULD 635 ignore the packet and forward it unchanged, meaning it 636 should not calculate whether to apply congestion 637 notification and it should neither drop nor CE-mark the 638 packet (for instance, the operator might classify EF traffic 639 that is unresponsive to drop into the L queue, alongside 640 responsive L4S-ECN traffic) 642 + if separate queue protection is not provided, the L AQM 643 SHOULD apply drop using a drop probability appropriate to 644 Classic congestion control and appropriate to the target 645 delay in the L queue 647 o If a packet that carries an ECT(1) codepoint is classified into 648 the C queue: 650 * the C AQM SHOULD apply CE-marking using the coupled AQM 651 probability p_CL (= k*p'). 653 If the DualQ Coupled AQM has detected overload, it will signal 654 congestion solely using drop, irrespective of the ECN field. 656 The above requirements are worded as "SHOULDs", because operator- 657 specific classifiers are for flexibility, by definition. Therefore, 658 alternative actions might be appropriate in the operator's specific 659 circumstances. An example would be where the operator knows that 660 certain legacy traffic marked with one codepoint actually has a 661 congestion response associated with another codepoint. 663 2.5.2. Management Requirements 665 By default, a DualQ Coupled AQM SHOULD NOT need any configuration for 666 use at a bottleneck on the public Internet [RFC7567]. The following 667 parameters MAY be operator-configurable, e.g. to tune for non- 668 Internet settings: 670 o Optional packet classifier(s) to use in addition to the ECN field 671 (see Section 2.3); 673 o Expected typical RTT (a parameter for typical or target queuing 674 delay in each queue might be configurable instead; if so it MUST 675 be expressed in units of time); 677 o Expected maximum RTT (a stability parameter that depends on 678 maximum RTT might be configurable instead); 680 o Coupling factor, k; 682 o The limit to the conditional priority of L4S (scheduler-dependent, 683 e.g. the scheduler weight for WRR, or the time-shift for time- 684 shifted FIFO); 686 o The maximum Classic ECN marking probability, p_Cmax, before 687 switching over to drop. 689 An experimental DualQ Coupled AQM SHOULD allow the operator to 690 monitor each of the following operational statistics on demand, per 691 queue and per configurable sample interval, for performance 692 monitoring and perhaps also for accounting in some cases: 694 o Bits forwarded, from which utilization can be calculated; 696 o Total packets arriving, enqueued and dequeued to distinguish tail 697 discard from proactive AQM discard; 699 o ECN packets marked, non-ECN packets dropped, ECN packets dropped, 700 from which marking and dropping probabilities can be calculated; 702 o Queue delay (not including serialization delay of the head packet 703 or medium acquisition delay) - see further notes below. 705 Unlike the other statistics, queue delay cannot be captured in a 706 simple accumulating counter. Therefore the type of queue delay 707 statistics produced (mean, percentiles, etc.) will depend on 708 implementation constraints. To facilitate comparative evaluation 709 of different implementations and approaches, an implementation 710 SHOULD allow mean and 99th percentile queue delay to be derived 711 (per queue per sample interval). A relatively simple way to do 712 this would be to store a coarse-grained histogram of queue delay. 713 This could be done with a small number of bins with configurable 714 edges that represent contiguous ranges of queue delay. Then, over 715 a sample interval, each bin would accumulate a count of the number 716 of packets that had fallen within each range. The maximum queue 717 delay per queue per interval MAY also be recorded. 719 An experimental DualQ Coupled AQM SHOULD asynchronously report the 720 following data about anomalous conditions: 722 o Start-time and duration of overload state. 724 A hysteresis mechanism SHOULD be used to prevent flapping in and 725 out of overload causing an event storm. For instance, exit from 726 overload state could trigger one report, but also latch a timer. 727 Then, during that time, if the AQM enters and exits overload state 728 any number of times, the duration in overload state is accumulated 729 but no new report is generated until the first time the AQM is out 730 of overload once the timer has expired. 732 [RFC5706] suggests that deployment, coexistence and scaling should 733 also be covered as management requirements. The raison d'etre of the 734 DualQ Couple AQM is to enable deployment and coexistence of scalable 735 congestion controls - as incremental replacements for today's TCP- 736 friendly controls that do not scale with bandwidth-delay product. 737 Therefore, these motivating issues are explained in the Introduction 738 and detailed in the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. 739 Also, the descriptions of specific DualQ Coupled AQM algorithms in 740 the appendices cover scaling of their configuration parameters, e.g. 741 with respect to RTT and sampling frequency. 743 3. IANA Considerations 745 This specification contains no IANA considerations. 747 4. Security Considerations 749 4.1. Overload Handling 751 Where the interests of users or flows might conflict, it could be 752 necessary to police traffic to isolate any harm to the performance of 753 individual flows. However it is hard to avoid unintended side- 754 effects with policing, and in a trusted environment policing is not 755 necessary. Therefore per-flow policing needs to be separable from a 756 basic AQM, as an option under policy control. 758 However, a basic DualQ AQM does at least need to handle overload. A 759 useful objective would be for the overload behaviour of the DualQ AQM 760 to be at least no worse than a single queue AQM. However, a trade- 761 off needs to be made between complexity and the risk of either 762 traffic class harming the other. In each of the following three 763 subsections, an overload issue specific to the DualQ is described, 764 followed by proposed solution(s). 766 Under overload the higher priority L4S service will have to sacrifice 767 some aspect of its performance. Alternative solutions are provided 768 below that each relax a different factor: e.g. throughput, delay, 769 drop. These choices need to be made either by the developer or by 770 operator policy, rather than by the IETF. 772 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay? 774 Priority of L4S is required to be conditional to avoid total 775 throughput starvation of Classic by heavy L4S traffic. This raises 776 the question of whether to sacrifice L4S throughput or L4S delay (or 777 some other policy) to mitigate starvation of Classic: 779 Sacrifice L4S throughput: By using weighted round robin as the 780 conditional priority scheduler, the L4S service can sacrifice some 781 throughput during overload to guarantee a minimum throughput 782 service for Classic traffic. The scheduling weight of the Classic 783 queue should be small (e.g. 1/16). Then, in most traffic 784 scenarios the scheduler will not interfere and it will not need to 785 - the coupling mechanism and the end-systems will share out the 786 capacity across both queues as if it were a single pool. However, 787 because the congestion coupling only applies in one direction 788 (from C to L), if L4S traffic is over-aggressive or unresponsive, 789 the scheduler weight for Classic traffic will at least be large 790 enough to ensure it does not starve. 792 In cases where the ratio of L4S to Classic flows (e.g. 19:1) is 793 greater than the ratio of their scheduler weights (e.g. 15:1), the 794 L4S flows will get less than an equal share of the capacity, but 795 only slightly. For instance, with the example numbers given, each 796 L4S flow will get (15/16)/19 = 4.9% when ideally each would get 797 1/20=5%. In the rather specific case of an unresponsive flow 798 taking up a large part of the capacity set aside for L4S, using 799 WRR could significantly reduce the capacity left for any 800 responsive L4S flows. 802 Sacrifice L4S Delay: To control milder overload of responsive 803 traffic, particularly when close to the maximum congestion signal, 804 the operator could choose to control overload of the Classic queue 805 by allowing some delay to 'leak' across to the L4S queue. The 806 scheduler can be made to behave like a single First-In First-Out 807 (FIFO) queue with different service times by implementing a very 808 simple conditional priority scheduler that could be called a 809 "time-shifted FIFO" (see the Modifier Earliest Deadline First 810 (MEDF) scheduler of [MEDF]). This scheduler adds tshift to the 811 queue delay of the next L4S packet, before comparing it with the 812 queue delay of the next Classic packet, then it selects the packet 813 with the greater adjusted queue delay. Under regular conditions, 814 this time-shifted FIFO scheduler behaves just like a strict 815 priority scheduler. But under moderate or high overload it 816 prevents starvation of the Classic queue, because the time-shift 817 (tshift) defines the maximum extra queuing delay of Classic 818 packets relative to L4S. 820 The example implementation in Appendix A can implement either policy. 822 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or Delay? 824 To keep the throughput of both L4S and Classic flows roughly equal 825 over the full load range, a different control strategy needs to be 826 defined above the point where one AQM first saturates to a 827 probability of 100% leaving no room to push back the load any harder. 828 If k>1, L4S will saturate first, even though saturation could be 829 caused by unresponsive traffic in either queue. 831 The term 'unresponsive' includes cases where a flow becomes 832 temporarily unresponsive, for instance, a real-time flow that takes a 833 while to adapt its rate in response to congestion, or a TCP-like flow 834 that is normally responsive, but above a certain congestion level it 835 will not be able to reduce its congestion window below the minimum of 836 2 segments [RFC5681], effectively becoming unresponsive. (Note that 837 L4S traffic ought to remain responsive below a window of 2 segments 838 (see [I-D.ietf-tsvwg-ecn-l4s-id]). 840 Saturation raises the question of whether to relieve congestion by 841 introducing some drop into the L4S queue or by allowing delay to grow 842 in both queues (which could eventually lead to tail drop too): 844 Drop on Saturation: Saturation can be avoided by setting a maximum 845 threshold for L4S ECN marking (assuming k>1) before saturation 846 starts to make the flow rates of the different traffic types 847 diverge. Above that the drop probability of Classic traffic is 848 applied to all packets of all traffic types. Then experiments 849 have shown that queueing delay can be kept at the target in any 850 overload situation, including with unresponsive traffic, and no 851 further measures are required [DualQ-Test]. 853 Delay on Saturation: When L4S marking saturates, instead of 854 switching to drop, the drop and marking probabilities could be 855 capped. Beyond that, delay will grow either solely in the queue 856 with unresponsive traffic (if WRR is used), or in both queues (if 857 time-shifted FIFO is used). In either case, the higher delay 858 ought to control temporary high congestion. If the overload is 859 more persistent, eventually the combined DualQ will overflow and 860 tail drop will control congestion. 862 The example implementation in Appendix A solely applies the "drop on 863 saturation" policy. 865 4.1.3. Protecting against Unresponsive ECN-Capable Traffic 867 Unresponsive traffic has a greater advantage if it is also ECN- 868 capable. The advantage is undetectable at normal low levels of drop/ 869 marking, but it becomes significant with the higher levels of drop/ 870 marking typical during overload. This is an issue whether the ECN- 871 capable traffic is L4S or Classic. 873 This raises the question of whether and when to switch off ECN 874 marking and use solely drop instead, as required by both Section 7 of 875 [RFC3168] and Section 4.2.1 of [RFC7567]. 877 Experiments with the DualPI2 AQM (Appendix A) have shown that 878 introducing 'drop on saturation' at 100% L4S marking addresses this 879 problem with unresponsive ECN as well as addressing the saturation 880 problem. It leaves only a small range of congestion levels where 881 unresponsive traffic gains any advantage from using the ECN 882 capability, and the advantage is hardly detectable [DualQ-Test]. 884 5. Acknowledgements 886 Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for 887 detailed review comments particularly of the appendices and 888 suggestions on how to make our explanation clearer. Thanks also to 889 Greg White for improving the normative requirements and both Greg and 890 Tom Henderson for insights on the choice of schedulers, queue delay 891 measurement techniques. 893 The authors' contributions were originally part-funded by the 894 European Community under its Seventh Framework Programme through the 895 Reducing Internet Transport Latency (RITE) project (ICT-317700). Bob 896 Briscoe's contribution was also part-funded by the Research Council 897 of Norway through the TimeIn project. The views expressed here are 898 solely those of the authors. 900 6. References 902 6.1. Normative References 904 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 905 Requirement Levels", BCP 14, RFC 2119, 906 DOI 10.17487/RFC2119, March 1997, 907 . 909 6.2. Informative References 911 [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An 912 Algorithm for Increasing the Robustness of RED's Active 913 Queue Management", ACIRI Technical Report , August 2001, 914 . 916 [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", 917 ACM Queue 10(5), May 2012, 918 . 920 [CRED_Insights] 921 Briscoe, B., "Insights from Curvy RED (Random Early 922 Detection)", BT Technical Report TR-TUB8-2015-003, July 923 2015, 924 . 926 [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 927 Tsang, "`Data Centre to the Home': Ultra-Low Latency for 928 All", 2015, . 931 (Under submission) 933 [DualQ-Test] 934 Steen, H., "Destruction Testing: Ultra-Low Delay using 935 Dual Queue Coupled Active Queue Management", Masters 936 Thesis, Dept of Informatics, Uni Oslo , May 2017. 938 [I-D.briscoe-tsvwg-l4s-diffserv] 939 Briscoe, B., "Interactions between Low Latency, Low Loss, 940 Scalable Throughput (L4S) and Differentiated Services", 941 draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress), 942 March 2018. 944 [I-D.ietf-tsvwg-ecn-l4s-id] 945 Schepper, K., Briscoe, B., and I. Tsang, "Identifying 946 Modified Explicit Congestion Notification (ECN) Semantics 947 for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s- 948 id-02 (work in progress), March 2018. 950 [I-D.ietf-tsvwg-l4s-arch] 951 Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, 952 Low Loss, Scalable Throughput (L4S) Internet Service: 953 Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in 954 progress), March 2018. 956 [I-D.sridharan-tcpm-ctcp] 957 Sridharan, M., Tan, K., Bansal, D., and D. Thaler, 958 "Compound TCP: A New TCP Congestion Control for High-Speed 959 and Long Distance Networks", draft-sridharan-tcpm-ctcp-02 960 (work in progress), November 2008. 962 [L4Sdemo16] 963 Bondarenko, O., De Schepper, K., Tsang, I., and B. 964 Briscoe, "Ultra-Low Delay for All: Live Experience, Live 965 Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, 966 . 970 [Mathis09] 971 Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , 972 May 2009, . 975 [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a 976 simple scheduling algorithm for two real-time transport 977 service classes with application in the UTRAN", Proc. IEEE 978 Conference on Computer Communications (INFOCOM'03) Vol.2 979 pp.1116-1122, March 2003. 981 [PI2] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 982 Tsang, "PI2: A Linearized AQM for both Classic and 983 Scalable TCP", ACM CoNEXT'16 , December 2016, 984 . 987 (To appear) 989 [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", 990 RFC 970, DOI 10.17487/RFC0970, December 1985, 991 . 993 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 994 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 995 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 996 S., Wroclawski, J., and L. Zhang, "Recommendations on 997 Queue Management and Congestion Avoidance in the 998 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 999 . 1001 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1002 of Explicit Congestion Notification (ECN) to IP", 1003 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1004 . 1006 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 1007 J., Courtney, W., Davari, S., Firoiu, V., and D. 1008 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 1009 Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, 1010 . 1012 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 1013 RFC 3649, DOI 10.17487/RFC3649, December 2003, 1014 . 1016 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1017 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1018 . 1020 [RFC5706] Harrington, D., "Guidelines for Considering Operations and 1021 Management of New Protocols and Protocol Extensions", 1022 RFC 5706, DOI 10.17487/RFC5706, November 2009, 1023 . 1025 [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF 1026 Recommendations Regarding Active Queue Management", 1027 BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, 1028 . 1030 [RFC8033] Pan, R., Natarajan, P., Baker, F., and G. White, 1031 "Proportional Integral Controller Enhanced (PIE): A 1032 Lightweight Control Scheme to Address the Bufferbloat 1033 Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017, 1034 . 1036 [RFC8034] White, G. and R. Pan, "Active Queue Management (AQM) Based 1037 on Proportional Integral Controller Enhanced PIE) for 1038 Data-Over-Cable Service Interface Specifications (DOCSIS) 1039 Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February 1040 2017, . 1042 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1043 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1044 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1045 October 2017, . 1047 [RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys, 1048 J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler 1049 and Active Queue Management Algorithm", RFC 8290, 1050 DOI 10.17487/RFC8290, January 2018, 1051 . 1053 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1054 Notification (ECN) Experimentation", RFC 8311, 1055 DOI 10.17487/RFC8311, January 2018, 1056 . 1058 [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 1059 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 1060 RFC 8312, DOI 10.17487/RFC8312, February 2018, 1061 . 1063 [TCP-CA] Jacobson, V. and M. Karels, "Congestion Avoidance and 1064 Control", Laurence Berkeley Labs Technical Report , 1065 November 1988, . 1067 Appendix A. Example DualQ Coupled PI2 Algorithm 1069 As a first concrete example, the pseudocode below gives the DualPI2 1070 algorithm. DualPI2 follows the structure of the DualQ Coupled AQM 1071 framework in Figure 1. A simple step threshold (in units of queuing 1072 time) is used for the Native L4S AQM, but a ramp is also described as 1073 an alternative. And the PI2 algorithm [PI2] is used for the Classic 1074 AQM. PI2 is an improved variant of the PIE AQM [RFC8033]. 1076 We will introduce the pseudocode in two passes. The first pass 1077 explains the core concepts, deferring handling of overload to the 1078 second pass. To aid comparison, line numbers are kept in step 1079 between the two passes by using letter suffixes where the longer code 1080 needs extra lines. 1082 A full open source implementation for Linux is available at: 1083 https://github.com/olgabo/dualpi2. 1085 A.1. Pass #1: Core Concepts 1087 The pseudocode manipulates three main structures of variables: the 1088 packet (pkt), the L4S queue (lq) and the Classic queue (cq). The 1089 pseudocode consists of the following five functions: 1091 o initialization code (Figure 2) that sets parameter defaults (the 1092 API for setting non-default values is omitted for brevity) 1094 o enqueue code (Figure 3) 1096 o dequeue code (Figure 4) 1098 o a ramp function (Figure 5) used to calculate the ECN-marking 1099 probability for the L4S queue 1101 o code to regularly update the base probability (p) used in the 1102 dequeue code (Figure 6). 1104 It also uses the following functions that are not shown in full here: 1106 o scheduler(), which selects between the head packets of the two 1107 queues; the choice of scheduler technology is discussed later; 1109 o cq.len() or lq.len() returns the current length (aka. backlog) of 1110 the relevant queue in bytes; 1112 o cq.time() or lq.time() returns the current queuing delay (aka. 1113 sojourn time or service time) of the relevant queue in units of 1114 time; 1116 Queuing delay could be measured directly by storing a per-packet 1117 time-stamp as each packet is enqueued, and subtracting this from the 1118 system time when the packet is dequeued. If time-stamping is not 1119 easy to introduce with certain hardware, queuing delay could be 1120 predicted indirectly by dividing the size of the queue by the 1121 predicted departure rate, which might be known precisely for some 1122 link technologies (see for example [RFC8034]). 1124 In our experiments so far (building on experiments with PIE) on 1125 broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs 1126 from 5 ms to 100 ms, DualPI2 achieves good results with the default 1127 parameters in Figure 2. The parameters are categorised by whether 1128 they relate to the Base PI2 AQM, the L4S AQM or the framework 1129 coupling them together. Variables derived from these parameters are 1130 also included at the end of each category. Each parameter is 1131 explained as it is encountered in the walk-through of the pseudocode 1132 below. 1134 1: dualpi2_params_init(...) { % Set input parameter defaults 1135 2: % PI2 AQM parameters 1136 3: target = 15 ms % PI AQM Classic queue delay target 1137 4: Tupdate = 16 ms % PI Classic queue sampling interval 1138 5: alpha = 10 Hz^2 % PI integral gain 1139 6: beta = 100 Hz^2 % PI proportional gain 1140 7: p_Cmax = 1/4 % Max Classic drop/mark prob 1141 8: % Constants derived from PI2 AQM parameters 1142 9: alpha_U = alpha *Tupdate % PI integral gain per update interval 1143 10: beta_U = beta * Tupdate % PI prop'nal gain per update interval 1144 11: 1145 12: % DualQ Coupled framework parameters 1146 13: k = 2 % Coupling factor 1147 14: % scheduler weight or equival't parameter (scheduler-dependent) 1148 15: limit = MAX_LINK_RATE * 250 ms % Dual buffer size 1149 16: 1150 17: % L4S ramp AQM parameters 1151 18: minTh = 475 us % L4S min marking threshold in time units 1152 19: range = 525 us % Range of L4S ramp in time units 1153 20: Th_len = 2 * MTU % Min L4S marking threshold in bytes 1154 21: % Constants derived from L4S AQM parameters 1155 22: p_Lmax = min(k*sqrt(p_Cmax), 1) % Max L4S marking prob 1156 23: floor = Th_len * 8 / MIN_LINK_RATE % MIN_LINK_RATE is in Mb/s 1157 24: if (minTh < floor) { 1158 25: % Adjust ramp to exceed serialization time of 2 MTU 1159 26: range = max(range - (floor-minTh), 1) % 1us avoids /0 error 1160 27: minTh = floor 1161 28: } 1162 29: maxTh = minTh+range % L4S min marking threshold in time units 1163 30: } 1165 Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM 1167 For brevity the pseudocode shows some parameters in units of 1168 microseconds (us), but a real implementation would probably use 1169 nanoseconds. 1171 The overall goal of the code is to maintain the base probability (p), 1172 which is an internal variable from which the marking and dropping 1173 probabilities for L4S and Classic traffic (p_L and p_C) are derived. 1174 The variable named p in the pseudocode and in this walk-through is 1175 the same as p' (p-prime) in Section 2.4. The probabilities p_L and 1176 p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function 1177 (Figure 6) then used in the dualpi2_dequeue() function (Figure 4). 1178 The code walk-through below builds up to explaining that part of the 1179 code eventually, but it starts from packet arrival. 1181 1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq 1182 2: if ( lq.len() + cq.len() > limit ) 1183 3: drop(pkt) % drop packet if buffer is full 1184 4: else { % Packet classifier 1185 5: if ( ecn(pkt) modulo 2 == 1 ) % ECN bits = ECT(1) or CE 1186 6: lq.enqueue(pkt) 1187 7: else % ECN bits = not-ECT or ECT(0) 1188 8: cq.enqueue(pkt) 1189 9: } 1190 10: } 1192 Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM 1194 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues 1195 2: while ( lq.len() + cq.len() > 0 ) 1196 3: if ( scheduler() == lq ) { 1197 4: lq.dequeue(pkt) % Scheduler chooses lq 1198 5: p'_L = laqm(lq.time()) % Native L4S AQM 1199 6: p_L = max(p'_L, p_CL) % Combining function 1200 7: if ( p_L > rand() ) % Linear marking 1201 8: mark(pkt) 1202 9: } else { 1203 10: cq.dequeue(pkt) % Scheduler chooses cq 1204 11: if ( p_C > rand() ) { % probability p_C = p^2 1205 12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT 1206 13: drop(pkt) % squared drop 1207 14: continue % continue to the top of the while loop 1208 15: } 1209 16: mark(pkt) % squared mark 1210 17: } 1211 18: } 1212 19: return(pkt) % return the packet and stop 1213 20: } 1214 21: return(NULL) % no packet to dequeue 1215 22: } 1217 Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM 1219 When packets arrive, first a common queue limit is checked as shown 1220 in line 2 of the enqueuing pseudocode in Figure 3. Note that the 1221 limit is deliberately tested before enqueue to avoid any bias against 1222 larger packets (so depending whether the implementation stores a 1223 packet while testing whether to drop it from the tail, it might be 1224 necessary for the actual buffer memory to be one MTU larger than 1225 limit). 1227 Line 2 assumes an implementation where lq and cq share common buffer 1228 memory. An alternative implementation could use separate buffers for 1229 each queue, in which case the arriving packet would have to be 1230 classified first to determine which buffer to check for available 1231 space. The choice is a trade off; a shared buffer can use less 1232 memory whereas separate buffers isolate the L4S queue from tail-drop 1233 due to large bursts of Classic traffic (e.g. a Classic TCP during 1234 slow-start over a long RTT). 1236 Returning to the shared buffer case, if limit is not exceeded, the 1237 packet will be classified and enqueued to the Classic or L4S queue 1238 dependent on the least significant bit of the ECN field in the IP 1239 header (line 5). Packets with a codepoint having an LSB of 0 (Not- 1240 ECT and ECT(0)) will be enqueued in the Classic queue. Otherwise, 1241 ECT(1) and CE packets will be enqueued in the L4S queue. Optional 1242 additional packet classification flexibility is omitted for brevity 1243 (see [I-D.ietf-tsvwg-ecn-l4s-id]). 1245 The dequeue pseudocode (Figure 4) is repeatedly called whenever the 1246 lower layer is ready to forward a packet. It schedules one packet 1247 for dequeuing (or zero if the queue is empty) then returns control to 1248 the caller, so that it does not block while that packet is being 1249 forwarded. While making this dequeue decision, it also makes the 1250 necessary AQM decisions on dropping or marking. The alternative of 1251 applying the AQMs at enqueue would shift some processing from the 1252 critical time when each packet is dequeued. However, it would also 1253 add a whole queue of delay to the control signals, making the control 1254 loop very sloppy. 1256 All the dequeue code is contained within a large while loop so that 1257 if it decides to drop a packet, it will continue until it selects a 1258 packet to schedule. Line 3 of the dequeue pseudocode is where the 1259 scheduler chooses between the L4S queue (lq) and the Classic queue 1260 (cq). Detailed implementation of the scheduler is not shown (see 1261 discussion later). 1263 o If an L4S packet is scheduled, lines 7 and 8 ECN-mark the packet 1264 if a random marking decision is drawn according to p_L. Line 6 1265 calculates p_L as the maximum of the coupled L4S probability p_CL 1266 and the probability from the native L4S AQM p'_L. This implements 1267 the max() function shown in Figure 1 to couple the outputs of the 1268 two AQMs together. Of the two probabilities input to p_L in line 1269 6: 1271 * p'_L is calculated per packet in line 5 by the laqm() function 1272 (see Figure 5), 1274 * whereas p_CL is maintained by the dualpi2_update() function 1275 which runs every Tupdate (default 16ms) (see Figure 2). 1277 o If a Classic packet is scheduled, lines 10 to 17 drop or mark the 1278 packet based on the squared probability p_C. 1280 The Native L4S AQM algorithm (Figure 5) is a ramp function, similar 1281 to the RED algorithm, but simpler due to the following differences: 1283 o The min and max of the ramp are defined in units of queuing delay, 1284 not bytes, so that configuration remains invariant as the queue 1285 departure rate varies. 1287 o It uses instantaneous queueing delay to remove smoothing delay 1288 (L4S senders smooth incoming ECN feedback when necessary). 1290 o The ramp rises linearly directly from 0 to 1, not to a an 1291 intermediate value of p'_L as RED would, because there is no need 1292 to keep ECN marking probability low. 1294 o Marking does not have to be randomized. Determinism is being 1295 experimented with instead of randomness; to reduce the delay 1296 necessary to smooth out the noise of randomness from the signal. 1297 In this case, for each packet, the algorithm would accumulate p_L 1298 in a counter and mark the packet that took the counter over 1, 1299 then subtract 1 from the counter and continue. 1301 This ramp function requires two configuration parameters, the minimum 1302 threshold (minTh) and the width of the ramp (range), both in units of 1303 queuing time), as shown in the parameter initialization code in 1304 Figure 2. A minimum marking threshold parameter (Th_len) in 1305 transmission units (default 2 MTU) is also necessary to ensure that 1306 the ramp does not trigger excessive marking on slow links. The code 1307 in lines 23-28 of Figure 2 converts 2 MTU into time units and adjusts 1308 the ramp thresholds to be no shallower than this floor. 1310 An operator can effectively turn the ramp into a step function, as 1311 used by DCTCP, by setting the range to its minimum value (e.g. 1 ns). 1312 Then the condition for the ramp calculation will hardly ever arise. 1313 There is some concern that using the step function of DCTCP for the 1314 Native L4S AQM requires end-systems to smooth the signal for an 1315 unnecessarily large number of round trips to ensure sufficient 1316 fidelity. A ramp seems to be no worse than a step in initial 1317 experiments with existing DCTCP. Therefore, it is recommended that a 1318 ramp is configured in place of a step, which will allow congestion 1319 control algorithms to investigate faster smoothing algorithms. 1321 1: laqm(qdelay) { % Returns native L4S AQM probability 1322 2: if (qdelay >= maxTh) 1323 3: return 1 1324 4: else if (qdelay > minTh) 1325 5: return (qdelay - minTh)/range % Divide would use a bit-shift 1326 6: else 1327 7: return 0 1328 8: } 1330 Figure 5: Example Pseudocode for the Native L4S AQM 1332 1: dualpi2_update(lq, cq, target) { % Update p every Tupdate 1333 2: curq = cq.time() % use queuing time of first-in Classic packet 1334 3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) 1335 4: p_CL = p * k % Coupled L4S prob = base prob * coupling factor 1336 5: p_C = p^2 % Classic prob = (base prob)^2 1337 6: prevq = curq 1338 7: } 1340 Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM 1342 p_CL depends on the base probability (p), which is kept up to date by 1343 the core PI algorithm in Figure 6 executed every Tupdate. 1345 Note that p solely depends on the queuing time in the Classic queue. 1346 In line 2, the current queuing delay (curq) is evaluated from how 1347 long the head packet was in the Classic queue (cq). The function 1348 cq.time() (not shown) subtracts the time stamped at enqueue from the 1349 current time and implicitly takes the current queuing delay as 0 if 1350 the queue is empty. 1352 The algorithm centres on line 3, which is a classical Proportional- 1353 Integral (PI) controller that alters p dependent on: a) the error 1354 between the current queuing delay (curq) and the target queuing delay 1355 ('target' - see [RFC8033]); and b) the change in queuing delay since 1356 the last sample. The name 'PI' represents the fact that the second 1357 factor (how fast the queue is growing) is _P_roportional to load 1358 while the first is the _I_ntegral of the load (so it removes any 1359 standing queue in excess of the target). 1361 The two 'gain factors' in line 3, alpha_U and beta_U, respectively 1362 weight how strongly each of these elements ((a) and (b)) alters p. 1363 They are in units of 'per second of delay' or Hz, because they 1364 transform differences in queueing delay into changes in probability. 1366 alpha_U and beta_U are derived from the input parameters alpha and 1367 beta (see lines 5 and 6 of Figure 2). These recommended values of 1368 alpha and beta come from the stability analysis in [PI2] so that the 1369 AQM can change p as fast as possible in response to changes in load 1370 without over-compensating and therefore causing oscillations in the 1371 queue. 1373 alpha and beta determine how much p ought to change if it was updated 1374 every second. It is best to update p as frequently as possible, but 1375 the update interval (Tupdate) will probably be constrained by 1376 hardware performance. For link rates from 4 - 200 Mb/s, we found 1377 Tupdate=16ms (as recommended in [RFC8033]) is sufficient. However 1378 small the chosen value of Tupdate, p should change by the same amount 1379 per second, but in finer more frequent steps. So the gain factors 1380 used for updating p in Figure 6 need to be scaled by (Tupdate/1s), 1381 which is done in lines 9 and 10 of Figure 2). The suffix '_U' 1382 represents 'per update time' (Tupdate). 1384 In corner cases, p can overflow the range [0,1] so the resulting 1385 value of p has to be bounded (omitted from the pseudocode). Then, as 1386 already explained, the coupled and Classic probabilities are derived 1387 from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2. 1389 Because the coupled L4S marking probability (p_CL) is factored up by 1390 k, the dynamic gain parameters alpha and beta are also inherently 1391 factored up by k for the L4S queue, which is necessary to ensure that 1392 Classic TCP and DCTCP controls have the same stability. So, if alpha 1393 is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha, 1394 which is 20 Hz^2 with the default coupling factor of k=2. 1396 Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned 1397 every Tupdate dependent on p. Instead, in PI2, alpha_U and beta_U 1398 are independent of p because the squaring applied to Classic traffic 1399 tunes them inherently. This is explained in [PI2], which also 1400 explains why this more principled approach removes the need for most 1401 of the heuristics that had to be added to PIE. 1403 {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with 1404 RTT} 1406 A.2. Pass #2: Overload Details 1408 Figure 7 repeats the dequeue function of Figure 4, but with overload 1409 details added. Similarly Figure 8 repeats the core PI algorithm of 1410 Figure 6 with overload details added. The initialization, enqueue 1411 and L4S AQM functions are unchanged. 1413 In line 7 of the initialization function (Figure 2), the default 1414 maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the 1415 point at which it is deemed that the Classic queue has become 1416 persistently overloaded, so it switches to using solely drop, even 1417 for ECN-capable packets. This protects the queue against any 1418 unresponsive traffic that falsely claims that it is responsive to ECN 1419 marking, as required by [RFC3168] and [RFC7567]. 1421 Line 22 of the initialization function translates this into a maximum 1422 L4S marking probability (p_Lmax) by rearranging Equation (1). With a 1423 coupling factor of k=2 (the default) or greater, this translates to a 1424 maximum L4S marking probability of 1 (or 100%). This is intended to 1425 ensure that the L4S queue starts to introduce dropping once marking 1426 saturates and can rise no further. The 'TCP Prague' requirements 1427 [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion 1428 control detects a drop, it falls back to a response that coexists 1429 with 'Classic' TCP. So it is correct that the L4S queue drops 1430 packets proportional to p^2, as if they are Classic packets. 1432 Both these switch-overs are triggered by the tests for overload 1433 introduced in lines 4b and 12b of the dequeue function (Figure 7). 1434 Lines 8c to 8g drop L4S packets with probability p^2. Lines 8h to 8i 1435 mark the remaining packets with probability p_CL. If p_Lmax = 1, 1436 which is the suggested default configuration, all remaining packets 1437 will be marked because, to have reached the else block at line 8b, 1438 p_CL >= 1. 1440 Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload 1441 of the L4S queue when there is no Classic traffic. This is 1442 necessary, because the core PI algorithm maintains the appropriate 1443 drop probability to regulate overload, but it depends on the length 1444 of the Classic queue. If there is no Classic queue the naive 1445 algorithm in Figure 6 drops nothing, even if the L4S queue is 1446 overloaded - so tail drop would have to take over (lines 3 and 4 of 1447 Figure 3). 1449 If the test at line 2a finds that the Classic queue is empty, line 2d 1450 measures the current queue delay using the L4S queue instead. While 1451 the L4S queue is not overloaded, its delay will always be tiny 1452 compared to the target Classic queue delay. So p_L will be driven to 1453 zero, and the L4S queue will naturally be governed solely by 1454 threshold marking (lines 5 and 6 of the dequeue algorithm in 1455 Figure 7). But, if unresponsive L4S source(s) cause overload, the 1456 DualQ transitions smoothly to L4S marking based on the PI algorithm. 1457 And as overload increases, it naturally transitions from marking to 1458 dropping by the switch-over mechanism already described. 1460 1: dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 1461 2: while ( lq.len() + cq.len() > 0 ) 1462 3: if ( scheduler() == lq ) { 1463 4a: lq.dequeue(pkt) 1464 4b: if ( p_CL < p_Lmax ) { % Check for overload saturation 1465 5: p'_L = laqm(lq.time()) % Native L4S AQM 1466 6: p_L = max(p'_L, p_CL) % Combining function 1467 7: if ( p_L > rand() ) % Linear marking 1468 8a: mark(pkt) 1469 8b: } else { % overload saturation 1470 8c: if ( p_C > rand() ) { % probability p_C = p^2 1471 8e: drop(pkt) % revert to Classic drop due to overload 1472 8f: continue % continue to the top of the while loop 1473 8g: } 1474 8h: if ( p_CL > rand() ) % probability p_CL = k * p 1475 8i: mark(pkt) % linear marking of remaining packets 1476 8j: } 1477 9: } else { 1478 10: cq.dequeue(pkt) 1479 11: if ( p_C > rand() ) { % probability p_C = p^2 1480 12a: if ( (ecn(pkt) == 0) % ECN field = not-ECT 1481 12b: OR (p_C >= p_Cmax) ) { % Overload disables ECN 1482 13: drop(pkt) % squared drop, redo loop 1483 14: continue % continue to the top of the while loop 1484 15: } 1485 16: mark(pkt) % squared mark 1486 17: } 1487 18: } 1488 19: return(pkt) % return the packet and stop 1489 20: } 1490 21: return(NULL) % no packet to dequeue 1491 22: } 1493 Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM 1494 (Including Integer Arithmetic and Overload Code) 1496 1: dualpi2_update(lq, cq, target) { % Update p every Tupdate 1497 2a: if ( cq.len() > 0 ) 1498 2b: curq = cq.time() %use queuing time of first-in Classic packet 1499 2c: else % Classic queue empty 1500 2d: curq = lq.time() % use queuing time of first-in L4S packet 1501 3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) 1502 4: p_CL = p * k % Coupled L4S prob = base prob * coupling factor 1503 5: p_C = p^2 % Classic prob = (base prob)^2 1504 6: prevq = curq 1505 7: } 1507 Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM 1508 (Including Overload Code) 1510 The choice of scheduler technology is critical to overload protection 1511 (see Section 4.1). 1513 o A well-understood weighted scheduler such as weighted round robin 1514 (WRR) is recommended. The scheduler weight for Classic should be 1515 low, e.g. 1/16. 1517 o Alternatively, a time-shifted FIFO could be used. This is a very 1518 simple scheduler, but it does not fully isolate latency in the L4S 1519 queue from uncontrolled bursts in the Classic queue. It works by 1520 selecting the head packet that has waited the longest, biased 1521 against the Classic traffic by a time-shift of tshift. To 1522 implement time-shifted FIFO, the "if (scheduler() == lq )" test in 1523 line 3 of the dequeue code would simply be replaced by "if ( 1524 lq.time() + tshift >= cq.time() )". For the public Internet a 1525 good value for tshift is 50ms. For private networks with smaller 1526 diameter, about 4*target would be reasonable. 1528 o A strict priority scheduler would be inappropriate, because it 1529 would starve Classic if L4S was overloaded. 1531 Appendix B. Example DualQ Coupled Curvy RED Algorithm 1533 As another example of a DualQ Coupled AQM algorithm, the pseudocode 1534 below gives the Curvy RED based algorithm we used and tested. 1535 Although we designed the AQM to be efficient in integer arithmetic, 1536 to aid understanding it is first given using real-number arithmetic. 1537 Then, one possible optimization for integer arithmetic is given, also 1538 in pseudocode. To aid comparison, the line numbers are kept in step 1539 between the two by using letter suffixes where the longer code needs 1540 extra lines. 1542 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 1543 2: if ( lq.dequeue(pkt) ) { 1544 3a: p_L = cq.sec() / 2^S_L 1545 3b: if ( lq.byt() > T ) 1546 3c: mark(pkt) 1547 3d: elif ( p_L > maxrand(U) ) 1548 4: mark(pkt) 1549 5: return(pkt) % return the packet and stop here 1550 6: } 1551 7: while ( cq.dequeue(pkt) ) { 1552 8a: alpha = 2^(-f_C) 1553 8b: Q_C = alpha * pkt.sec() + (1-alpha)* Q_C % Classic Q EWMA 1554 9a: sqrt_p_C = Q_C / 2^S_C 1555 9b: if ( sqrt_p_C > maxrand(2*U) ) 1556 10: drop(pkt) % Squared drop, redo loop 1557 11: else 1558 12: return(pkt) % return the packet and stop here 1559 13: } 1560 14: return(NULL) % no packet to dequeue 1561 15: } 1563 16: maxrand(u) { % return the max of u random numbers 1564 17: maxr=0 1565 18: while (u-- > 0) 1566 19: maxr = max(maxr, rand()) % 0 <= rand() < 1 1567 20: return(maxr) 1568 21: } 1570 Figure 9: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM 1572 Packet classification code is not shown, as it is no different from 1573 Figure 3. Potential classification schemes are discussed in 1574 Section 2.3. The Curvy RED algorithm has not been maintained to the 1575 same degree as the DualPI2 algorithm. Some ideas used in DualPI2 1576 would need to be translated into Curvy RED, such as i) the 1577 conditional priority scheduler instead of strict priority ii) the 1578 time-based L4S threshold; iii) turning off ECN as overload 1579 protection; iv) Classic ECN support. These are not shown in the 1580 Curvy RED pseudocode, but would need to be implemented for 1581 production. {ToDo} 1583 At the outer level, the structure of dualq_dequeue() implements 1584 strict priority scheduling. The code is written assuming the AQM is 1585 applied on dequeue (Note 1) . Every time dualq_dequeue() is called, 1586 the if-block in lines 2-6 determines whether there is an L4S packet 1587 to dequeue by calling lq.dequeue(pkt), and otherwise the while-block 1588 in lines 7-13 determines whether there is a Classic packet to 1589 dequeue, by calling cq.dequeue(pkt). (Note 2) 1590 In the lower priority Classic queue, a while loop is used so that, if 1591 the AQM determines that a classic packet should be dropped, it 1592 continues to test for classic packets deciding whether to drop each 1593 until it actually forwards one. Thus, every call to dualq_dequeue() 1594 returns one packet if at least one is present in either queue, 1595 otherwise it returns NULL at line 14. (Note 3) 1597 Within each queue, the decision whether to drop or mark is taken as 1598 follows (to simplify the explanation, it is assumed that U=1): 1600 L4S: If the test at line 2 determines there is an L4S packet to 1601 dequeue, the tests at lines 3a and 3c determine whether to mark 1602 it. The first is a simple test of whether the L4S queue (lq.byt() 1603 in bytes) is greater than a step threshold T in bytes (Note 4). 1604 The second test is similar to the random ECN marking in RED, but 1605 with the following differences: i) the marking function does not 1606 start with a plateau of zero marking until a minimum threshold, 1607 rather the marking probability starts to increase as soon as the 1608 queue is positive; ii) marking depends on queuing time, not bytes, 1609 in order to scale for any link rate without being reconfigured; 1610 iii) marking of the L4S queue does not depend on itself, it 1611 depends on the queuing time of the _other_ (Classic) queue, where 1612 cq.sec() is the queuing time of the packet at the head of the 1613 Classic queue (zero if empty); iv) marking depends on the 1614 instantaneous queuing time (of the other Classic queue), not a 1615 smoothed average; v) the queue is compared with the maximum of U 1616 random numbers (but if U=1, this is the same as the single random 1617 number used in RED). 1619 Specifically, in line 3a the marking probability p_L is set to the 1620 Classic queueing time qc.sec() in seconds divided by the L4S 1621 scaling parameter 2^S_L, which represents the queuing time (in 1622 seconds) at which marking probability would hit 100%. Then in line 1623 3d (if U=1) the result is compared with a uniformly distributed 1624 random number between 0 and 1, which ensures that marking 1625 probability will linearly increase with queueing time. The 1626 scaling parameter is expressed as a power of 2 so that division 1627 can be implemented as a right bit-shift (>>) in line 3 of the 1628 integer variant of the pseudocode (Figure 10). 1630 Classic: If the test at line 7 determines that there is at least one 1631 Classic packet to dequeue, the test at line 9b determines whether 1632 to drop it. But before that, line 8b updates Q_C, which is an 1633 exponentially weighted moving average (Note 5) of the queuing time 1634 in the Classic queue, where pkt.sec() is the instantaneous 1635 queueing time of the current Classic packet and alpha is the EWMA 1636 constant for the classic queue. In line 8a, alpha is represented 1637 as an integer power of 2, so that in line 8 of the integer code 1638 the division needed to weight the moving average can be 1639 implemented by a right bit-shift (>> f_C). 1641 Lines 9a and 9b implement the drop function. In line 9a the 1642 averaged queuing time Q_C is divided by the Classic scaling 1643 parameter 2^S_C, in the same way that queuing time was scaled for 1644 L4S marking. This scaled queuing time is given the variable name 1645 sqrt_p_C because it will be squared to compute Classic drop 1646 probability, so before it is squared it is effectively the square 1647 root of the drop probability. The squaring is done by comparing 1648 it with the maximum out of two random numbers (assuming U=1). 1649 Comparing it with the maximum out of two is the same as the 1650 logical `AND' of two tests, which ensures drop probability rises 1651 with the square of queuing time (Note 6). Again, the scaling 1652 parameter is expressed as a power of 2 so that division can be 1653 implemented as a right bit-shift in line 9 of the integer 1654 pseudocode. 1656 The marking/dropping functions in each queue (lines 3 & 9) are two 1657 cases of a new generalization of RED called Curvy RED, motivated as 1658 follows. When we compared the performance of our AQM with fq_CoDel 1659 and PIE, we came to the conclusion that their goal of holding queuing 1660 delay to a fixed target is misguided [CRED_Insights]. As the number 1661 of flows increases, if the AQM does not allow TCP to increase queuing 1662 delay, it has to introduce abnormally high levels of loss. Then loss 1663 rather than queuing becomes the dominant cause of delay for short 1664 flows, due to timeouts and tail losses. 1666 Curvy RED constrains delay with a softened target that allows some 1667 increase in delay as load increases. This is achieved by increasing 1668 drop probability on a convex curve relative to queue growth (the 1669 square curve in the Classic queue, if U=1). Like RED, the curve hugs 1670 the zero axis while the queue is shallow. Then, as load increases, 1671 it introduces a growing barrier to higher delay. But, unlike RED, it 1672 requires only one parameter, the scaling, not three. The diadvantage 1673 of Curvy RED is that it is not adapted to a wide range of RTTs. 1674 Curvy RED can be used as is when the RTT range to support is limited 1675 otherwise an adaptation mechanism is required. 1677 There follows a summary listing of the two parameters used for each 1678 of the two queues: 1680 Classic: 1682 S_C : The scaling factor of the dropping function scales Classic 1683 queuing times in the range [0, 2^(S_C)] seconds into a dropping 1684 probability in the range [0,1]. To make division efficient, it 1685 is constrained to be an integer power of two; 1687 f_C : To smooth the queuing time of the Classic queue and make 1688 multiplication efficient, we use a negative integer power of 1689 two for the dimensionless EWMA constant, which we define as 1690 alpha = 2^(-f_C). 1692 L4S : 1694 S_L (and k'): As for the Classic queue, the scaling factor of 1695 the L4S marking function scales Classic queueing times in the 1696 range [0, 2^(S_L)] seconds into a probability in the range 1697 [0,1]. Note that S_L = S_C + k', where k' is the coupling 1698 between the queues. So S_L and k' count as only one parameter; 1699 k' is related to k in Equation (1) (Section 2.1) by k=2^k', 1700 where both k and k' are constants. Then implementations can 1701 avoid costly division by shifting p_L by k' bits to the right. 1703 T : The queue size in bytes at which step threshold marking 1704 starts in the L4S queue. 1706 {ToDo: These are the raw parameters used within the algorithm. A 1707 configuration front-end could accept more meaningful parameters and 1708 convert them into these raw parameters.} 1710 From our experiments so far, recommended values for these parameters 1711 are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs 1712 typical on the public Internet. [CRED_Insights] explains why these 1713 parameters are applicable whatever rate link this AQM implementation 1714 is deployed on and how the parameters would need to be adjusted for a 1715 scenario with a different range of RTTs (e.g. a data centre) {ToDo 1716 incorporate a summary of that report into this draft}. The setting of 1717 k depends on policy (see Section 2.5 and Appendix C respectively for 1718 its recommended setting and guidance on alternatives). 1720 There is also a cUrviness parameter, U, which is a small positive 1721 integer. It is likely to take the same hard-coded value for all 1722 implementations, once experiments have determined a good value. We 1723 have solely used U=1 in our experiments so far, but results might be 1724 even better with U=2 or higher. 1726 Note that the dropping function at line 9 calls maxrand(2*U), which 1727 gives twice as much curviness as the call to maxrand(U) in the 1728 marking function at line 3. This is the trick that implements the 1729 square rule in equation (1) (Section 2.1). This is based on the fact 1730 that, given a number X from 1 to 6, the probability that two dice 1731 throws will both be less than X is the square of the probability that 1732 one throw will be less than X. So, when U=1, the L4S marking 1733 function is linear and the Classic dropping function is squared. If 1734 U=2, L4S would be a square function and Classic would be quartic. 1735 And so on. 1737 The maxrand(u) function in lines 16-21 simply generates u random 1738 numbers and returns the maximum (Note 7). Typically, maxrand(u) 1739 could be run in parallel out of band. For instance, if U=1, the 1740 Classic queue would require the maximum of two random numbers. So, 1741 instead of calling maxrand(2*U) in-band, the maximum of every pair of 1742 values from a pseudorandom number generator could be generated out- 1743 of-band, and held in a buffer ready for the Classic queue to consume. 1745 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 1746 2: if ( lq.dequeue(pkt) ) { 1747 3: if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U))) 1748 4: mark(pkt) 1749 5: return(pkt) % return the packet and stop here 1750 6: } 1751 7: while ( cq.dequeue(pkt) ) { 1752 8: Q_C += (pkt.ns() - Q_C) >> f_C % Classic Q EWMA 1753 9: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) 1754 10: drop(pkt) % Squared drop, redo loop 1755 11: else 1756 12: return(pkt) % return the packet and stop here 1757 13: } 1758 14: return(NULL) % no packet to dequeue 1759 15: } 1761 Figure 10: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM 1762 using Integer Arithmetic 1764 Notes: 1766 1. The drain rate of the queue can vary if it is scheduled relative 1767 to other queues, or to cater for fluctuations in a wireless 1768 medium. To auto-adjust to changes in drain rate, the queue must 1769 be measured in time, not bytes or packets [CoDel]. In our Linux 1770 implementation, it was easiest to measure queuing time at 1771 dequeue. Queuing time can be estimated when a packet is enqueued 1772 by measuring the queue length in bytes and dividing by the recent 1773 drain rate. 1775 2. An implementation has to use priority queueing, but it need not 1776 implement strict priority. 1778 3. If packets can be enqueued while processing dequeue code, an 1779 implementer might prefer to place the while loop around both 1780 queues so that it goes back to test again whether any L4S packets 1781 arrived while it was dropping a Classic packet. 1783 4. In order not to change too many factors at once, for now, we keep 1784 the marking function for DCTCP-only traffic as similar as 1785 possible to DCTCP. However, unlike DCTCP, all processing is at 1786 dequeue, so we determine whether to mark a packet at the head of 1787 the queue by the byte-length of the queue _behind_ it. We plan 1788 to test whether using queuing time will work in all 1789 circumstances, and if we find that the step can cause 1790 oscillations, we will investigate replacing it with a steep 1791 random marking curve. 1793 5. An EWMA is only one possible way to filter bursts; other more 1794 adaptive smoothing methods could be valid and it might be 1795 appropriate to decrease the EWMA faster than it increases. 1797 6. In practice at line 10 the Classic queue would probably test for 1798 ECN capability on the packet to determine whether to drop or mark 1799 the packet. However, for brevity such detail is omitted. All 1800 packets classified into the L4S queue have to be ECN-capable, so 1801 no dropping logic is necessary at line 3. Nonetheless, L4S 1802 packets could be dropped by overload code (see Section 4.1). 1804 7. In the integer variant of the pseudocode (Figure 10) real numbers 1805 are all represented as integers scaled up by 2^32. In lines 3 & 1806 9 the function maxrand() is arranged to return an integer in the 1807 range 0 <= maxrand() < 2^32. Queuing times are also scaled up by 1808 2^32, but in two stages: i) In lines 3 and 8 queuing times 1809 cq.ns() and pkt.ns() are returned in integer nanoseconds, making 1810 the values about 2^30 times larger than when the units were 1811 seconds, ii) then in lines 3 and 9 an adjustment of -2 to the 1812 right bit-shift multiplies the result by 2^2, to complete the 1813 scaling by 2^32. 1815 Appendix C. Guidance on Controlling Throughput Equivalence 1817 +---------------+------+-------+ 1818 | RTT_C / RTT_L | Reno | Cubic | 1819 +---------------+------+-------+ 1820 | 1 | k'=1 | k'=0 | 1821 | 2 | k'=2 | k'=1 | 1822 | 3 | k'=2 | k'=2 | 1823 | 4 | k'=3 | k'=2 | 1824 | 5 | k'=3 | k'=3 | 1825 +---------------+------+-------+ 1827 Table 1: Value of k' for which DCTCP throughput is roughly the same 1828 as Reno or Cubic, for some example RTT ratios 1830 k' is related to k in Equation (1) (Section 2.1) by k=2^k'. 1832 To determine the appropriate policy, the operator first has to judge 1833 whether it wants DCTCP flows to have roughly equal throughput with 1834 Reno or with Cubic (because, even in its Reno-compatibility mode, 1835 Cubic is about 1.4 times more aggressive than Reno). Then the 1836 operator needs to decide at what ratio of RTTs it wants DCTCP and 1837 Classic flows to have roughly equal throughput. For example choosing 1838 k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same 1839 as Cubic, _if their RTTs are the same_. 1841 However, even if the base RTTs are the same, the actual RTTs are 1842 unlikely to be the same, because Classic (Cubic or Reno) traffic 1843 needs a large queue to avoid under-utilization and excess drop, 1844 whereas L4S (DCTCP) does not. The operator might still choose this 1845 policy if it judges that DCTCP throughput should be rewarded for 1846 keeping its own queue short. 1848 On the other hand, the operator will choose one of the higher values 1849 for k', if it wants to slow DCTCP down to roughly the same throughput 1850 as Classic flows, to compensate for Classic flows slowing themselves 1851 down by causing themselves extra queuing delay. 1853 The values for k' in the table are derived from the formulae, which 1854 was developed in [DCttH15]: 1856 2^k' = 1.64 (RTT_reno / RTT_dc) (2) 1857 2^k' = 1.19 (RTT_cubic / RTT_dc ) (3) 1859 For localized traffic from a particular ISP's data centre, we used 1860 the measured RTTs to calculate that a value of k'=3 (equivalant to 1861 k=8) would achieve throughput equivalence, and our experiments 1862 verified the formula very closely. 1864 For a typical mix of RTTs from local data centres and across the 1865 general Internet, a value of k'=1 (equivalent to k=2) is recommended 1866 as a good workable compromise. 1868 Appendix D. Open Issues 1870 Most of the following open issues are also tagged '{ToDo}' at the 1871 appropriate point in the document: 1873 PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U 1874 on Tupdate 1876 Curvy RED appendix: complete the unfinished parts 1878 Authors' Addresses 1880 Koen De Schepper 1881 Nokia Bell Labs 1882 Antwerp 1883 Belgium 1885 Email: koen.de_schepper@nokia.com 1886 URI: https://www.bell-labs.com/usr/koen.de_schepper 1888 Bob Briscoe (editor) 1889 CableLabs 1890 UK 1892 Email: ietf@bobbriscoe.net 1893 URI: http://bobbriscoe.net/ 1895 Olga Bondarenko 1896 Simula Research Lab 1897 Lysaker 1898 Norway 1900 Email: olgabnd@gmail.com 1901 URI: https://www.simula.no/people/olgabo 1903 Ing-jyh Tsang 1904 Nokia 1905 Antwerp 1906 Belgium 1908 Email: ing-jyh.tsang@nokia.com