idnits 2.17.1 draft-briscoe-aqm-dualq-coupled-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 07, 2015) is 3184 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 788 -- Looks like a reference, but probably isn't: '1' on line 788 == Unused Reference: 'RFC4774' is defined on line 620, but no explicit reference was found in the text == Outdated reference: A later version (-06) exists of draft-ietf-aqm-fq-codel-01 == Outdated reference: A later version (-10) exists of draft-ietf-aqm-pie-01 -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Active Queue Management (aqm) K. De Schepper 3 Internet-Draft Bell Labs 4 Intended status: Standards Track B. Briscoe, Ed. 5 Expires: February 8, 2016 Independent 6 O. Bondarenko 7 Simula Research Lab 8 I. Tsang 9 Bell Labs 10 August 07, 2015 12 DualQ Coupled AQM for Low Latency, Low Loss and Scalable Throughput 13 draft-briscoe-aqm-dualq-coupled-00 15 Abstract 17 Data Centre TCP (DCTCP) was designed to provide predictably low 18 queuing latency, near-zero loss, and throughput scalability using 19 explicit congestion notification (ECN) and an extremely simple 20 marking behaviour on switches. However, DCTCP does not co-exist with 21 existing TCP traffic---throughput starves. So, until now, DCTCP 22 could only be deployed where a clean-slate environment could be 23 arranged, such as in private data centres. This specification 24 defines `DualQ Coupled Active Queue Management (AQM)' to allow 25 scalable congestion controls like DCTCP to safely co-exist with 26 classic Internet traffic. The Coupled AQM ensures that a flow runs 27 at about the same rate whether it uses DCTCP or TCP Reno/Cubic, but 28 without inspecting transport layer flow identifiers. When tested in 29 a residential broadband setting, DCTCP achieved sub-millisecond 30 average queuing delay and zero congestion loss under a wide range of 31 mixes of DCTCP and `Classic' broadband Internet traffic, without 32 compromising the performance of the Classic traffic. The solution 33 also reduces network complexity and eliminates network configuration. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at http://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on February 8, 2016. 51 Copyright Notice 53 Copyright (c) 2015 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 1.1. Problem and Scope . . . . . . . . . . . . . . . . . . . . 2 70 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 71 1.3. Features . . . . . . . . . . . . . . . . . . . . . . . . 5 72 2. DualQ Coupled AQM Algorithm . . . . . . . . . . . . . . . . . 6 73 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 6 74 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 7 75 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 7 76 2.4. Normative Requirements . . . . . . . . . . . . . . . . . 9 77 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 78 4. Security Considerations . . . . . . . . . . . . . . . . . . . 10 79 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 10 80 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 81 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 82 6.1. Normative References . . . . . . . . . . . . . . . . . . 11 83 6.2. Informative References . . . . . . . . . . . . . . . . . 11 84 Appendix A. Example DualQ Coupled Algorithm . . . . . . . . . . 14 85 Appendix B. Guidance on Controlling Throughput Equivalence . . . 20 86 Appendix C. DCTCP Safety Enhancements . . . . . . . . . . . . . 21 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 89 1. Introduction 91 1.1. Problem and Scope 93 Latency is becoming the critical performance factor for many (most?) 94 applications on the public Internet, e.g. Web, voice, conversational 95 video, gaming and finance apps. In the developed world, further 96 increases in access network bit-rate offer diminishing returns, 97 whereas latency is still a multi-faceted problem. In the last decade 98 or so, much has been done to reduce propagation time by placing 99 caches or servers closer to users. However, queuing remains a major 100 component of latency. 102 The Diffserv architecture provides Expedited Forwarding [RFC3246], so 103 that low latency traffic can jump the queue of other traffic. 104 However, on access links dedicated to individual sites (homes, small 105 enterprises or mobile devices), often all traffic at any one time 106 will be latency-sensitive. Then Diffserv is of little use. Instead, 107 we need to remove the causes of any unnecessary delay. 109 The bufferbloat project has shown that excessively-large buffering 110 (`bufferbloat') has been introducing significantly more delay than 111 the underlying propagation time. These delays appear only 112 intermittently--only when a capacity-seeking (e.g. TCP) flow is long 113 enough for the queue to fill the buffer, making every packet in other 114 flows sharing the buffer sit through the queue. 116 Active queue management (AQM) was originally developed to solve this 117 problem (and others). Unlike Diffserv, AQM controls latency for 118 _all_ traffic in a class. In general, AQMs introduce an increasing 119 level of discard from the buffer the longer the queue persists above 120 a shallow threshold. This gives sufficient signals to capacity- 121 seeking (aka. greedy) flows to keep the buffer empty for its intended 122 purpose: absorbing bursts. However, RED [RFC2309] and other 123 algorithms from the 1990s were sensitive to their configuration and 124 hard to set correctly. So, AQM was not widely deployed. More recent 125 state-of-the-art AQMs, e.g. fq_CoDel [I-D.ietf-aqm-fq-codel], 126 PIE [I-D.ietf-aqm-pie], Adaptive RED [ARED01], define the threshold 127 in time not bytes, so it is invariant for different link rates. 129 It seems that further changes to the network alone will now yield 130 diminishing returns. Data Centre TCP 131 (DCTCP [I-D.bensley-tcpm-dctcp]) teaches us that a small but radical 132 change to TCP is needed to cut two major outstanding causes of 133 queuing delay variability: 135 1. the `sawtooth' varying rate of TCP itself; 137 2. the smoothing delay deliberately introduced into AQMs to permit 138 bursts without triggering losses. 140 The former causes a flow's round trip time (RTT) to vary from about 1 141 to 2 times the base RTT between the machines in question. The latter 142 delays the system's response to change by a worst-case 143 (transcontinental) RTT, which could be hundreds of times the actual 144 RTT of typical traffic from localized CDNs. 146 Latency is not our only concern: 148 3. It was known when TCP was first developed that it would not scale 149 to high bandwidth-delay products. 151 Given regular broadband bit-rates over WAN distances are 152 already [RFC3649] beyond the scaling range of `classic' TCP Reno, 153 `less unscalable' Cubic [I-D.zimmermann-tcpm-cubic] and 154 Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been 155 successfully deployed. However, these are now approaching their 156 scaling limits. Unfortunately, fully scalable TCPs such as DCTCP 157 cause `classic' TCP to starve itself, which is why they have been 158 confined to private data centres or research testbeds (until now). 160 This document specifies a `DualQ Coupled AQM' that solves the problem 161 of coexistence between DCTCP and classic flows, without having to 162 inspect flow identifiers. The AQM is not like flow-queuing 163 approaches [I-D.ietf-aqm-fq-codel] that classify packets by flow 164 identifier into numerous separate queues in order to isolate sparse 165 flows from the higher latency in the queues assigned to heavier flow. 166 In contrast, the AQM exploits the behaviour of scalable congestion 167 controls like DCTCP so that every packet in every flow sharing the 168 queue for DCTCP-like traffic can be served with very low latency. 170 The AQM needs fewer operations per packet than RED uses. Also, no 171 network configuration is needed for a wide range of scenarios where 172 the range of RTTs is typical for the public Internet. Therefore it 173 is believed the Coupled AQM would be applicable and easy to deploy in 174 all types of buffers; buffers in cost-reduced mass-market residential 175 equipment; buffers in end-system stacks; buffers in carrier-scale 176 equipment including remote access servers, routers, firewalls and 177 Ethernet switches; buffers in network interface cards, buffers in 178 virtualized network appliances, hypervisors, and so on. 180 The supporting paper [DCttH15] gives the full rationale for the AQM's 181 design, both discursively and in more precise mathematical form. 183 1.2. Terminology 185 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 186 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 187 document are to be interpreted as described in [RFC2119]. In this 188 document, these words will appear with that interpretation only when 189 in ALL CAPS. Lower case uses of these words are not to be 190 interpreted as carrying RFC-2119 significance. 192 The DualQ Coupled AQM uses two queues for two services. Each of the 193 following terms identifies both the service and the queue that 194 provides the service: 196 Classic (denoted by subscript C): The `Classic' service is intended 197 for all the behaviours that currently co-exist with TCP Reno (TCP 198 Cubic, Compound, SCTP, etc). 200 Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): 201 The `L4S' service is intended for DCTCP traffic but it is also 202 more general--it will allow a set of congestion controls with 203 similar scaling properties to DCTCP (e.g. Relentless [Mathis09]) 204 to evolve. 206 Either service can cope with a proportion of unresponsive or less- 207 responsive traffic as well (e.g. DNS, VoIP, etc). 209 1.3. Features 211 The AQM couples marking and/or dropping across the two queues such 212 that a flow will get roughly the same throughput whichever it uses. 213 Therefore both queues can feed into the full capacity of a link and 214 no rates need to be configured for the queues. The L4S queue enables 215 scalable congestion controls like DCTCP to give stunningly low and 216 predictably low latency, without compromising the performance of 217 competing 'Classic' Internet traffic. Thousands of tests have been 218 conducted in a typical fixed residential broadband setting. Typical 219 experiments used a base round trip delay of 7ms between the data 220 centre and home network, and large amounts of background traffic in 221 both queues. For every L4S packet, the AQM kept the 99th percentile 222 of queuing delay to about 1ms, and no losses at all were introduced 223 by the AQM. Details of the extensive experiments will be made 224 available [DCttH15]. 226 Subjective testing was also conducted using a demanding panoramic 227 interactive video application run over a stack with DCTCP enabled and 228 deployed on the testbed. Each user could pan or zoom their own high 229 definition (HD) sub-window of a larger video scene from a football 230 match. Even though the user was also downloading large amounts of 231 L4S and Classic data, latency was so low that the picture appeared to 232 stick to their finger on the touchpad (all the L4S data achieved the 233 same ultra-low latency). With an alternative AQM, the video 234 noticeably lagged behind the finger gestures. 236 Unlike Diffserv Expedited Forwarding, the L4S queue does not have to 237 be limited to a small proportion of the link capacity in order to 238 achieve low delay. The L4S queue can be filled with a heavy load of 239 capacity-seeking flows like DCTCP and still achieve low delay. The 240 L4S queue does not rely on the presence of other traffic in the 241 Classic queue that can be 'overtaken'. It gives low latency to L4S 242 traffic whether or not there is Classic traffic, and the latency of 243 Classic traffic does not suffer when a proportion of the traffic is 244 L4S. The two queues are only necessary because DCTCP-like flows 245 cannot keep latency predictably low and keep utilization high if they 246 are mixed with legacy TCP flows, 248 The experiments used the Linux implementation of DCTCP that is 249 deployed in private data centres, without any modification despite 250 its known deficiencies. Nonetheless, certain modifications will be 251 necessary before DCTCP is safe to use on the Internet, which are 252 recorded for now in Appendix C. However, the focus of this 253 specification is to get the network service in place. Then, without 254 any management intervention, applications can exploit it by migrating 255 to scalable controls like DCTCP, which can then evolve _while_ their 256 benefits are being enjoyed by everyone on the Internet. 258 2. DualQ Coupled AQM Algorithm 260 There are two main aspects to the algorithm: 262 o the Coupled AQM that addresses throughput equivalence between 263 Classic (e.g. Reno, Cubic) flows and L4S (e.g. DCTCP) flows 265 o the Dual Queue structure that provides latency separation for L4S 266 flows to isolate them from the typically large Classic queue. 268 2.1. Coupled AQM 270 In the 1990s, the `TCP formula' was derived for the relationship 271 between TCP's congestion window, cwnd, and its drop probability, p. 272 To a first order approximation, cwnd of TCP Reno is inversely 273 proportional to the square root of p. TCP Cubic implements a Reno- 274 compatibility mode, which is the only relevant mode for typical RTTs 275 under 20ms, while the throughput of a single flow is less than about 276 500Mb/s. Therefore we can assume that Cubic traffic behaves similar 277 to Reno (but with a slightly different constant of proportionality), 278 and we shall use the term 'Classic' for the collection of Reno and 279 Cubic in Reno mode. 281 In our supporting paper [DCttH15], we derive the equivalent rate 282 equation for DCTCP, for which cwnd is inversely proportional to p 283 (not the square root), where in this case p is the ECN marking 284 probability. DCTCP is not the only congestion control that behaves 285 like this, so we use the term 'L4S' traffic for all similar 286 behaviour. 288 In order to make a DCTCP flow run at roughly the same rate as a Reno 289 TCP flow (all other factors being equal), we make the drop 290 probability for Classic traffic, p_C distinct from the marking 291 probability for L4S traffic, p_L (in contrast to RFC3168 which 292 requires them to be the same). We make the Classic drop probability 293 p_C proportional to the square of the L4S marking probability p_L. 294 This is because we need to make the Reno flow rate equal the DCTCP 295 flow rate, so we have to square the square root of p_C in the Reno 296 rate equation to make it the same as the straight p_L in the DCTCP 297 rate equation. 299 There is a really simple way to implement the square of a probability 300 - by testing the queue against two random numbers not one. This is 301 the approach adopted in Appendix A. 303 Stating this as a formula, the relation between Classic drop 304 probability, p_C, and L4S marking probability, p_L needs to take the 305 form: 307 p_C = ( p_L / 2^k )^2 (1) 309 where 2^k is the constant of proportionality, which is expressed as a 310 power of 2 so that implementations can avoid costly division by 311 shifting p_L by k bits to the right. 313 2.2. Dual Queue 315 Classic traffic builds a large queue, so a separate queue is provided 316 for L4S traffic, and it is scheduled with strict priority. 317 Nonetheless, coupled marking ensures that giving priority to L4S 318 traffic still leaves the right amount of spare scheduling time for 319 Classic flows to each get equivalent throughput to DCTCP flows (all 320 other factors such as RTT being equal). The algorithm achieves this 321 without having to inspect flow identifiers. 323 2.3. Traffic Classification 325 Both the Coupled AQM and DualQ mechanisms need an identifier to 326 distinguish L4S and C packets, which will need to be standardized. 327 In our tests we used a cleared ECN field to indicate C packets and 328 L4S otherwise. The ECN specification [RFC3168] currently defines a 329 mark as equivalent to a drop. However, it says 331 "An environment where all end nodes were ECN-Capable could allow 332 new criteria to be developed for setting the CE codepoint, and new 333 congestion control mechanisms for end-node reaction to CE packets. 334 However, this is a research issue, and as such is not addressed in 335 this document." 337 and [RFC4774]} gives valid ways to alter ECN's semantics without 338 harming interoperability. 340 Since publication in 2001,deployment of RFC3168 ECN has been dogged 341 by bugs and misunderstandings. In recent years RFC3168 ECN has been 342 deployed quite successfully on servers [ECN_Deploy], and until 343 recently it was deployed but not enabled on a fair proportion of user 344 machines. Recently one major developer of client devices has 345 configured ECN on-by-default in its beta releases. However although 346 some network equipment vendors and developers have implemented ECN, 347 there is little evidence that any public network operator is 348 considering or has deployed ECN-capable AQMs on network equipment 349 yet. 351 A number of private data centre operators have deployed ECN, but not 352 RFC3168 ECN. Instead, they are using DCTCP to get predictable ultra- 353 low latency, and they are either ensuring that there is no non-DCTCP 354 traffic [I-D.bensley-tcpm-dctcp], or they are segregating such 355 traffic from DCTCP using Diffserv [DCTCP_Pitfalls]. The RFC3168 356 approach merely prevents drop, whereas the DCTCP approach provides 357 scalable throughput and ultra-low latency as well as avoiding drop. 358 Consequently it has been questioned whether the RFC3168 approach 359 offers enough performance improvement for an operator to countenance 360 the cost and risk of deployment. There has been some discussions at 361 the IETF on changing the meaning of an ECN mark to move towards the 362 DCTCP approach.The performance results from our experiments with 363 DCTCP for broadband residential users are certainly significant 364 enough to warrant interest from operators. 366 For those who have managed to get classic ECN widely deployed on end- 367 systems, moving the goalposts at this stage would be harsh. If the 368 meaning of ECN cannot be changed from "equivalent to drop", it would 369 be possible to identify the L4S service in another way, e.g. a 370 combination of ECN and Diffserv, or using the ECT(1) codepoint. The 371 Diffserv codepoint is not ideal, because L4S is an end-to-end service 372 and a DSCP is not preserved end-to-end. However, combining ECN and 373 Diffserv may be sufficient for initial deployment, while confined to 374 controlled sets of networks, during which time any users of classic 375 ECN can upgrade to L4S. The ECT(1) codepoint is perhaps less ideal, 376 because two separate uses of ECN really need two codepoints each, and 377 anyway it could be argued that the last ECN codepoint should not be 378 burned when the current one is not being used. 380 This draft does not currently recommend an approach for identifying 381 for the L4S service, which is initially left open for discussion 382 within the IETF. 384 2.4. Normative Requirements 386 In the Dual Queue, L4S packets MUST be given priority over Classic, 387 although strict priority MAY not be appropriate. 389 All L4S traffic MUST be ECN-capable, although some Classic traffic 390 MAY also be ECN-capable. 392 Whatever identifier is used for L4S traffic, it will still be 393 necessary to agree on the meaning of an ECN marking on L4S traffic, 394 relative to a drop of Classic traffic. In order to prevent 395 starvation of Classic traffic by scalable L4S traffic (e.g. DCTCP) 396 the drop probability of Classic traffic MUST be proportional to the 397 square of the marking probability of L4S traffic, In other words, the 398 power to which p_L is raised in Eqn. (1) MUST be 2. 400 The constant of proportionality, k, in Eqn (1) determines the 401 relative flow rates of Classic and L4S flows when the AQM concerned 402 is the bottleneck (all other factors being equal). k does not have to 403 be standardized because differences do not prevent interoperability. 404 However, k has to take some value, and each operator can make that 405 choice. 407 A value of k=0 is RECOMMENDED as the default for public Internet 408 access networks, assuming the DCTCP algorithm remains similar to that 409 in [I-D.bensley-tcpm-dctcp]. Nonetheless choice of k is a matter of 410 operator policy, and operators MAY choose a different value using 411 Table 1 and the guidelines in Appendix B. 413 Typically, access network operators isolate customers from each other 414 with some form of layer-2 multiplexing (TDM in DOCSIS, CDMA in 3G) or 415 L3 scheduling (WRR in broadband), rather than relying on TCP to share 416 capacity between customers [RFC0970]. In such cases, the choice of k 417 will solely affect relative flow rates within the customer's access 418 capacity, not between customers. Also, k would not affect rates of 419 small flows, nor long flows at any times when they are all Classic or 420 all L4S. 422 An example DualQ Coupled AQM algorithm is given in Appendix A. 423 Marking and dropping in each queue is based on an AQM called Curvy 424 RED, which is intended to improve on RED, PIE and CoDel. We have 425 found that Curvy RED offers good performance, requires less 426 operations per packet than RED and is insensitive to configuration. 427 Nonetheless, it would be possible to control each queue with an 428 alternative AQM, as long as the above normative requirements (those 429 expressed in capitals) are observed, which are intended to be 430 independent of the specific AQM. 432 {ToDo: Add management and monitoring requirements} 434 3. IANA Considerations 436 This specification contains no IANA considerations. 438 4. Security Considerations 440 4.1. Overload Handling 442 Where the interests of users or flows might conflict, it could be 443 necessary to police traffic to isolate any harm to performance. This 444 is a policy issue that needs to be separable from a basic AQM, but 445 the scheme does need to handle overload. A trade-off needs to be 446 made between complexity and the risk of either class harming the 447 other. It is an operator policy to define what must happen if the 448 service time of the classic queue becomes too great. In the 449 following subsections three optional non-exclusive overload 450 protections are defined. Their objective is for the overload 451 behaviour of the DualQ AQM to be similar to a single queue AQM. 452 Other overload protections can be envisaged: 454 Minimum throughput service: By replacing the priority scheduler 455 with a weighted round robin scheduler, a minimum throughput 456 service can be guaranteed for Classic traffic. Typically the 457 scheduling weight of the Classic queue will be small (e.g. 5%) to 458 avoid interference with the coupling but big enough to avoid 459 complete starvation of Classic traffic. 461 Drop on overload: On severe overload, e.g. due to non responsive 462 traffic, queues will typically overflow and packet drop will be 463 unavoidable when the queues reach their limits. The drop-limit of 464 each queue should be configured by specifying the maximum 465 supported load and determining the expected maximum size of each 466 queue when that load is separately applied to each queue. The 467 Classic queue limit will typically be larger than the L4S queue 468 limit. Overflow of one traffic type will automatically result in 469 drop in its respective queue. Both traffic types will get a high 470 congestion signal, due to the coupled marking, which will result 471 in similar starvation of responsive traffic in both queues. Thus, 472 the behaviour will be like a single queue AQM. To further improve 473 the arrival fairness of a single queue an extra overall AQM limit 474 can be applied, which is a limit to the sum of both queues. To be 475 effective, it should be configured to be less than the sum of the 476 limits of both queues, but greater than the maximum individual 477 queue limit. It ensures that the drop probability of unresponsive 478 traffic will be independent of its traffic type. 480 Delay on overload: To control milder overload of responsive traffic, 481 particularly when close to the maximum congestion signal, delay 482 can be used as an alternative congestion control mechanism. The 483 Dual Queue Coupled AQM can be made to behave like a single FIFO 484 queue with differentiated service times by replacing the priority 485 scheduler with a very simple "biased longest sojourn time first 486 scheduler". The bias is defined as a maximum sojourn time 487 difference (T_m) between the Classic and L4S packets. The 488 scheduler adds T_m to the sojourn time of the next L4S packet, 489 before comparing it with the timestamp of the next Classic packet, 490 then it selects the packet with the greater adjusted sojourn time. 491 This time shifted FIFO queue behaves just like a single FIFO queue 492 under moderate and high overload. 494 5. Acknowledgements 496 Thanks to Anil Agarwal for detailed review comments and suggestions 497 on how to make our explanation clearer. 499 The authors' contributions are part-funded by the European Community 500 under its Seventh Framework Programme through the Reducing Internet 501 Transport Latency (RITE) project (ICT-317700). The views expressed 502 here are solely those of the authors. 504 6. References 506 6.1. Normative References 508 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 509 Requirement Levels", BCP 14, RFC 2119, 510 DOI 10.17487/RFC2119, March 1997, 511 . 513 6.2. Informative References 515 [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An 516 Algorithm for Increasing the Robustness of RED's Active 517 Queue Management", ACIRI Technical Report , August 2001, 518 . 520 [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", 521 ACM Queue 10(5), May 2012, 522 . 524 [CRED_Insights] 525 Briscoe, B., "Insights from Curvy RED (Random Early 526 Detection)", BT Technical Report TR-TUB8-2015-003, July 527 2015, 528 . 530 [DCTCP_Pitfalls] 531 Judd, G., "Attaining the Promise and Avoiding the Pitfalls 532 of TCP in the Datacenter", 12th USENIX Symposium on 533 Networked Systems Design and Implementation (NSDI 15) 534 145--157, May 2015, 535 . 538 [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. 539 Tsang, "`Data Centre to the Home': Ultra-Low Latency for 540 All", 2015, . 543 (Under submission) 545 [ECN_Deploy] 546 Trammell, B., Kuehlewind, M., Boppart, D., Learmonth, I., 547 Fairhurst, G., and R. Scheffenegger, "Enabling Internet- 548 Wide Deployment of Explicit Congestion Notification", Proc 549 Passive & Active Measurement (PAM'15) Conference , 2015, 550 . 552 [I-D.bensley-tcpm-dctcp] 553 Bensley, S., Eggert, L., Thaler, D., Balasubramanian, P., 554 and G. Judd, "Microsoft's Datacenter TCP (DCTCP): TCP 555 Congestion Control for Datacenters", draft-bensley-tcpm- 556 dctcp-05 (work in progress), July 2015. 558 [I-D.ietf-aqm-fq-codel] 559 Hoeiland-Joergensen, T., McKenney, P., 560 dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, 561 "FlowQueue-Codel", draft-ietf-aqm-fq-codel-01 (work in 562 progress), July 2015. 564 [I-D.ietf-aqm-pie] 565 Pan, R., Natarajan, P., Baker, F., and G. White, "PIE: A 566 Lightweight Control Scheme To Address the Bufferbloat 567 Problem", draft-ietf-aqm-pie-01 (work in progress), March 568 2015. 570 [I-D.ietf-tcpm-accecn-reqs] 571 Kuehlewind, M., Scheffenegger, R., and B. Briscoe, 572 "Problem Statement and Requirements for a More Accurate 573 ECN Feedback", draft-ietf-tcpm-accecn-reqs-08 (work in 574 progress), March 2015. 576 [I-D.sridharan-tcpm-ctcp] 577 Sridharan, M., Tan, K., Bansal, D., and D. Thaler, 578 "Compound TCP: A New TCP Congestion Control for High-Speed 579 and Long Distance Networks", draft-sridharan-tcpm-ctcp-02 580 (work in progress), November 2008. 582 [I-D.zimmermann-tcpm-cubic] 583 Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 584 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 585 draft-zimmermann-tcpm-cubic-01 (work in progress), April 586 2015. 588 [Mathis09] 589 Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , 590 May 2009, . 593 [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", 594 RFC 970, DOI 10.17487/RFC0970, December 1985, 595 . 597 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 598 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 599 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 600 S., Wroclawski, J., and L. Zhang, "Recommendations on 601 Queue Management and Congestion Avoidance in the 602 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 603 . 605 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 606 of Explicit Congestion Notification (ECN) to IP", 607 RFC 3168, DOI 10.17487/RFC3168, September 2001, 608 . 610 [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, 611 J., Courtney, W., Davari, S., Firoiu, V., and D. 612 Stiliadis, "An Expedited Forwarding PHB (Per-Hop 613 Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, 614 . 616 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 617 RFC 3649, DOI 10.17487/RFC3649, December 2003, 618 . 620 [RFC4774] Floyd, S., "Specifying Alternate Semantics for the 621 Explicit Congestion Notification (ECN) Field", BCP 124, 622 RFC 4774, DOI 10.17487/RFC4774, November 2006, 623 . 625 [TCP-sub-mss-w] 626 Briscoe, B. and K. De Schepper, "Scaling TCP's Congestion 627 Window for Small Round Trip Times", BT Technical Report 628 TR-TUB8-2015-002, May 2015, 629 . 632 Appendix A. Example DualQ Coupled Algorithm 634 As a concrete example, the pseudocode below gives the DualQ Coupled 635 AQM algorithm we used in testing. Although we designed the AQM to be 636 efficient in integer arithmetic, to aid understanding it is first 637 given using real-number arithmetic. Then, one possible optimization 638 for integer arithmetic is given, also in pseudocode. To aid 639 comparison, the line numbers are kept in step between the two by 640 using letter suffixes where the longer code needs extra lines. 642 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 643 2: if ( lq.dequeue(pkt) ) { 644 3a: p_L = cq.sec() / 2^S_L 645 3b: if ( lq.byt() > T ) 646 3c: mark(pkt) 647 3d: elif ( p_L > maxrand(U) ) 648 4: mark(pkt) 649 5: return(pkt) % return the packet and stop here 650 6: } 651 7: while ( cq.dequeue(pkt) ) { 652 8a: alpha = 2^(-f_C) 653 8b: Q_C = alpha * pkt.sec() + (1-alpha)* Q_C % Classic Q EWMA 654 9a: sqrt_p_C = Q_C / 2^S_C 655 9b: if ( sqrt_p_C > maxrand(2*U) ) 656 10: drop(pkt) % Squared drop, redo loop 657 11: else 658 12: return(pkt) % return the packet and stop here 659 13: } 660 14: return(NULL) % no packet to dequeue 661 15: } 663 16: maxrand(u) { % return the max of u random numbers 664 17: maxr=0 665 18: while (u-- > 0) 666 19: maxr = max(maxr, rand()) % 0 <= rand() < 1 667 20: return(maxr) 668 21: } 670 Figure 1: Example Dequeue Pseudocode for Coupled DualQ AQM 672 Packet classification code is not shown, as it is no different from 673 regular packet classification. Potential classification schemes are 674 discussed in Section 2. Overload protection code will be included in 675 a future draft {ToDo}. 677 At the outer level, the structure of dualq_dequeue() implements 678 strict priority scheduling. The code is written assuming the AQM is 679 applied on dequeue (Note 1) . Every time dualq_dequeue() is called, 680 the if-block in lines 2-6 determines whether there is an L4S packet 681 to dequeue by calling lq.dequeue(pkt), and otherwise the while-block 682 in lines 7-13 determines whether there is a Classic packet to 683 dequeue, by calling cq.dequeue(pkt). (Note 2) 685 In the lower priority Classic queue, a while loop is used so that, if 686 the AQM determines that a classic packet should be dropped, it 687 continues to test for classic packets deciding whether to drop each 688 until it actually forwards one. Thus, every call to dualq_dequeue() 689 returns one packet if at least one is present in either queue, 690 otherwise it returns NULL at line 14. (Note 3) 692 Within each queue, the decision whether to drop or mark is taken as 693 follows (to simplify the explanation, it is assumed that U=1): 695 L4S: If the test at line 2 determines there is an L4S packet to 696 dequeue, the tests at lines 3a and 3c determine whether to mark 697 it. The first is a simple test of whether the L4S queue (lq.byt() 698 in bytes) is greater than a step threshold T in bytes (Note 4). 699 The second test is similar to the random ECN marking in RED, but 700 with the following differences: i) the marking function does not 701 start with a plateau of zero marking until a minimum threshold, 702 rather the marking probability starts to increase as soon as the 703 queue is positive; ii) marking depends on queuing time, not bytes, 704 in order to scale for any link rate without being reconfigured; 705 iii) marking of the L4S queue does not depend on itself, it 706 depends on the queuing time of the _other_ (Classic) queue, where 707 cq.sec() is the queuing time of the packet at the head of the 708 Classic queue (zero if empty); iv) marking depends on the 709 instantaneous queuing time (of the other queue), not a smoothed 710 average; v) the queue is compared with the maximum of U random 711 numbers (but if U=1, this is the same as the single random number 712 used in RED). 714 Specifically, in line 3a the marking probability p_L is set to the 715 Classic queueing time qc.sec() in seconds divided by the L4S 716 scaling parameter 2^S_L, which represents the queuing time (in 717 seconds) at which marking probability would hit 100%. Then in line 718 3d (if U=1) the result is compared with a uniformly distributed 719 random number between 0 and 1, which ensures that marking 720 probability will linearly increase with queueing time. The 721 scaling parameter is expressed as a power of 2 so that division 722 can be implemented as a right bit-shift (>>) in line 3 of the 723 integer variant of the pseudocode (Figure 2). 725 Classic: If the test at line 7 determines that there is at least one 726 Classic packet to dequeue, the test at line 9b determines whether 727 to drop it. But before that, line 8b updates Q_C, which is an 728 exponentially weighted moving average (Note 5) of the queuing time 729 in the Classic queue, where pkt.sec() is the instantaneous 730 queueing time of the current Classic packet and alpha is the EWMA 731 constant for the classic queue. In line 8a, alpha is represented 732 as an integer power of 2, so that in line 8 of the integer code 733 the division needed to weight the moving average can be 734 implemented by a right bit-shift (>> f_C). 736 Lines 9a and 9b implement the drop function. In line 9a the 737 averaged queuing time Q_C is divided by the Classic scaling 738 parameter 2^S_C, in the same way that queuing time was scaled for 739 L4S marking. This scaled queuing time is given the variable name 740 sqrt_p_C because it will be squared to compute Classic drop 741 probability, so before it is squared it is effectively the square 742 root of the drop probability. The squaring is done by comparing 743 it with the maximum out of two random numbers (assuming U=1). 744 Comparing it with the maximum out of two is the same as the 745 logical `AND' of two tests, which ensures drop probability rises 746 with the square of queuing time (Note 6). Again, the scaling 747 parameter is expressed as a power of 2 so that division can be 748 implemented as a right bit-shift in line 9 of the integer 749 pseudocode. 751 The marking/dropping functions in each queue (lines 3 & 9) are two 752 cases of a new generalization of RED called Curvy RED, motivated as 753 follows. When we compared the performance of our AQM with fq_CoDel 754 and PIE, we came to the conclusion that their goal of holding queuing 755 delay to a fixed target is misguided [CRED_Insights]. As the number 756 of flows increases, if the AQM does not allow TCP to increase queuing 757 delay, it has to introduce abnormally high levels of loss. Then loss 758 rather than queuing becomes the dominant cause of delay for short 759 flows, due to timeouts and tail losses. 761 Curvy RED constrains delay with a softened target that allows some 762 increase in delay as load increases. This is achieved by increasing 763 drop probability on a convex curve relative to queue growth (the 764 square curve in the Classic queue, if U=1). Like RED, the curve hugs 765 the zero axis while the queue is shallow. Then, as load increases, 766 it introduces a growing barrier to higher delay. But, unlike RED, it 767 requires only one parameter, the scaling, not three. 769 There follows a summary listing of the two parameters used for each 770 of the two queues: 772 Classic: 774 S_C : The scaling factor of the dropping function scales Classic 775 queuing times in the range [0, 2^(S_C)] seconds into a dropping 776 probability in the range [0,1]. To make division efficient, it 777 is constrained to be an integer power of two; 779 f_C : To smooth the queuing time of the Classic queue and make 780 multiplication efficient, we use a negative integer power of 781 two for the dimensionless EWMA constant, which we define as 782 2^(-f_C). 784 L4S : 786 S_L (and k): As for the Classic queue, the scaling factor of the 787 L4S marking function scales Classic queueing times in the range 788 [0, 2^(S_L)] seconds into a probability in the range [0,1]. 789 Note that S_L = S_C + k, where k is the coupling between the 790 queues (Section 2.1). So S_L and k count as only one 791 parameter; 793 T : The queue size in bytes at which step threshold marking 794 starts in the L4S queue. 796 {ToDo: These are the raw parameters used within the algorithm. A 797 configuration front-end could accept more meaningful parameters and 798 convert them into these raw parameters.} 800 From our experiments so far, recommended values for these parameters 801 are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs 802 typical on the public Internet. [CRED_Insights] explains why these 803 parameters are applicable whatever rate link this AQM implementation 804 is deployed on and how the parameters would need to be adjusted for a 805 scenario with a different range of RTTs (e.g. a data centre) {ToDo 806 incorporate a summary of that report into this draft}. The setting of 807 k depends on policy (see Section 2.4 and Appendix B respectively for 808 its recommended setting and guidance on alternatives). 810 There is also a cUrviness parameter, U, which is a small positive 811 integer. It is likely to take the same hard-coded value for all 812 implementations, once experiments have determined a good value. We 813 have solely used U=1 in our experiments so far, but results might be 814 even better with U=2 or higher. 816 Note that the dropping function at line 9 calls maxrand(2*U), which 817 gives twice as much curviness as the call to maxrand(U) in the 818 marking function at line 3. This is the trick that implements the 819 square rule in equation (1) (Section 2.1). This is based on the fact 820 that, given a number X from 1 to 6, the probability that two dice 821 throws will both be less than X is the square of the probability that 822 one throw will be less than X. So, when U=1, the L4S marking 823 function is linear and the Classic dropping function is squared. If 824 U=2, L4S would be a square function and Classic would be quartic. 825 And so on. 827 The maxrand(u) function in lines 16-21 simply generates u random 828 numbers and returns the maximum (Note 7). Typically, maxrand(u) 829 could be run in parallel out of band. For instance, if U=1, the 830 Classic queue would require the maximum of two random numbers. So, 831 instead of calling maxrand(2*U) in-band, the maximum of every pair of 832 values from a pseudorandom number generator could be generated out- 833 of-band, and held in a buffer ready for the Classic queue to consume. 835 1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq 836 2: if ( lq.dequeue(pkt) ) { 837 3: if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U))) 838 4: mark(pkt) 839 5: return(pkt) % return the packet and stop here 840 6: } 841 7: while ( cq.dequeue(pkt) ) { 842 8: Q_C += (pkt.ns() - Q_C) >> f_C % Classic Q EWMA 843 9: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) 844 10: drop(pkt) % Squared drop, redo loop 845 11: else 846 12: return(pkt) % return the packet and stop here 847 13: } 848 14: return(NULL) % no packet to dequeue 849 15: } 851 Figure 2: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM 852 using Integer Arithmetic 854 Notes: 856 1. The drain rate of the queue can vary if it is scheduled relative 857 to other queues, or to cater for fluctuations in a wireless 858 medium. To auto-adjust to changes in drain rate, the queue must 859 be measured in time, not bytes or packets [CoDel]. In our Linux 860 implementation, it was easiest to measure queuing time at 861 dequeue. Queuing time can be estimated when a packet is enqueued 862 by measuring the queue length in bytes and dividing by the recent 863 drain rate. 865 2. An implementation has to use priority queueing, but it need not 866 implement strict priority. 868 3. If packets can be enqueued while processing dequeue code, an 869 implementer might prefer to place the while loop around both 870 queues so that it goes back to test again whether any L4S packets 871 arrived while it was dropping a Classic packet. 873 4. In order not to change too many factors at once, for now, we keep 874 the marking function for DCTCP-only traffic as similar as 875 possible to DCTCP. However, unlike DCTCP, all processing is at 876 dequeue, so we determine whether to mark a packet at the head of 877 the queue by the byte-length of the queue _behind_ it. We plan 878 to test whether using queuing time will work in all 879 circumstances, and if we find that the step can cause 880 oscillations, we will investigate replacing it with a steep 881 random marking curve. 883 5. An EWMA is only one possible way to filter bursts; other more 884 adaptive smoothing methods could be valid and it might be 885 appropriate to decrease the EWMA faster than it increases. 887 6. In practice at line 10 the Classic queue would probably test for 888 ECN capability on the packet to determine whether to drop or mark 889 the packet. However, for brevity such detail is omitted. All 890 packets classified into the L4S queue have to be ECN-capable, so 891 no dropping logic is necessary at line 3. Nonetheless, L4S 892 packets could be dropped by overload code (see Section 4.1). 894 7. In the integer variant of the pseudocode (Figure 2) real numbers 895 are all represented as integers scaled up by 2^32. In lines 3 & 896 9 the function maxrand() is arranged to return an integer in the 897 range 0 <= maxrand() < 2^32. Queuing times are also scaled up by 898 2^32, but in two stages: i) In lines 3 and 8 queuing times 899 cq.ns() and pkt.ns() are returned in integer nanoseconds, making 900 the values about 2^30 times larger than when the units were 901 seconds, ii) then in lines 3 and 9 an adjustment of -2 to the 902 right bit-shift multiplies the result by 2^2, to complete the 903 scaling by 2^32. 905 Appendix B. Guidance on Controlling Throughput Equivalence 907 +---------------+------+-------+ 908 | RTT_C / RTT_L | Reno | Cubic | 909 +---------------+------+-------+ 910 | 1 | k=1 | k=0 | 911 | 2 | k=2 | k=1 | 912 | 3 | k=2 | k=2 | 913 | 4 | k=3 | k=2 | 914 | 5 | k=3 | k=3 | 915 +---------------+------+-------+ 917 Table 1: Value of k for which DCTCP throughput is roughly the same as 918 Reno or Cubic, for some example RTT ratios 920 To determine the appropriate policy, the operator first has to judge 921 whether it wants DCTCP flows to have roughly equal throughput with 922 Reno or with Cubic (because, even in its Reno-compatibility mode, 923 Cubic is about 1.4 times more aggressive than Reno). Then the 924 operator needs to decide at what ratio of RTTs it wants DCTCP and 925 Classic flows to have roughly equal throughput. For example choosing 926 the recommended value of k=0 will make DCTCP throughput roughly the 927 same as Cubic, _if their RTTs are the same_. 929 However, even if the base RTTs are the same, the actual RTTs are 930 unlikely to be the same, because Classic (Cubic or Reno) traffic 931 needs a large queue to avoid under-utilization and excess drop, 932 whereas L4S (DCTCP) does not. The operator might still choose this 933 policy if it judges that DCTCP throughput should be rewarded for 934 keeping its own queue short. 936 On the other hand, the operator will choose one of the higher values 937 for k, if it wants to slow DCTCP down to roughly the same throughput 938 as Classic flows, to compensate for Classic flows slowing themselves 939 down by causing themselves extra queuing delay. 941 The values for k in the table are derived from the formulae, which 942 was developed in [DCttH15]: 944 2^k = 1.64 (RTT_reno / RTT_dc) (2) 945 2^k = 1.19 (RTT_cubic / RTT_dc ) (3) 947 For localized traffic from a particular ISP's data centre, we used 948 the measured RTTs to calculate that a value of k=3 would achieve 949 throughput equivalence, and our experiments verified the formula very 950 closely. 952 Appendix C. DCTCP Safety Enhancements 954 This Appendix is informational not normative. It records changes 955 needed to DCTCP implementations so they can co-exist safely alongside 956 other traffic sources. They are recorded here until a more 957 appropriate draft is available to hold them. 959 Proposed changes are listed in rough order of criticality. Therefore 960 those later in the list may not be necessary: 962 o Negotiate its altered feedback semantics, which conveys the extent 963 of ECN marking, not just its existence, and this feedback needs to 964 be robust to loss [I-D.ietf-tcpm-accecn-reqs]; 966 o fall back to Reno or Cubic behaviour on loss; 968 o use a packet identifier associated with the L4S service; 970 o average ECN feedback over its own RTT, not the hard-coded RTT 971 suitable only for data-centres, perhaps like Relentless 972 TCP [Mathis09]; 974 o handle a window of less than 2 when the RTT is low, rather than 975 increase the queue [TCP-sub-mss-w]. 977 o test heuristically whether ECN marking is emanating from an 978 RFC3168 AQM. 980 Other, non-essential enhancements to DCTCP can be envisaged. 982 Authors' Addresses 984 Koen De Schepper 985 Bell Labs 986 Antwerp 987 Belgium 989 Email: koen.de_schepper@alcatel-lucent.com 990 URI: https://www.bell-labs.com/usr/koen.de_schepper 992 Bob Briscoe (editor) 993 Independent 995 Email: ietf@bobbriscoe.net 996 URI: http://bobbriscoe.net/ 998 Olga Bondarenko 999 Simula Research Lab 1000 Lysaker 1001 Norway 1003 Email: olgabnd@gmail.com 1004 URI: https://www.simula.no/people/olgabo 1006 Ing-jyh Tsang 1007 Bell Labs 1008 Antwerp 1009 Belgium 1011 Email: ing-jyh.tsang@alcatel-lucent.com