idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 24, 2018) is 2284 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1452

  -- Looks like a reference, but probably isn't: '1' on line 1452

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-00

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-00

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: July 28, 2018                                         CableLabs
6	                                                           O. Bondarenko
7	                                                     Simula Research Lab
8	                                                                I. Tsang
9	                                                                   Nokia
10	                                                        January 24, 2018

12	  DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
13	                                 (L4S)
14	                 draft-ietf-tsvwg-aqm-dualq-coupled-03

16	Abstract

18	   Data Centre TCP (DCTCP) was designed to provide predictably low
19	   queuing latency, near-zero loss, and throughput scalability using
20	   explicit congestion notification (ECN) and an extremely simple
21	   marking behaviour on switches.  However, DCTCP does not co-exist with
22	   existing TCP traffic---DCTCP is so aggressive that existing TCP
23	   algorithms approach starvation.  So, until now, DCTCP could only be
24	   deployed where a clean-slate environment could be arranged, such as
25	   in private data centres.  This specification defines `DualQ Coupled
26	   Active Queue Management (AQM)' to allow scalable congestion controls
27	   like DCTCP to safely co-exist with classic Internet traffic.  The
28	   Coupled AQM ensures that a flow runs at about the same rate whether
29	   it uses DCTCP or TCP Reno/Cubic, but without inspecting transport
30	   layer flow identifiers.  When tested in a residential broadband
31	   setting, DCTCP achieved sub-millisecond average queuing delay and
32	   zero congestion loss under a wide range of mixes of DCTCP and
33	   `Classic' broadband Internet traffic, without compromising the
34	   performance of the Classic traffic.  The solution also reduces
35	   network complexity and eliminates network configuration.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at https://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on July 28, 2018.

54	Copyright Notice

56	   Copyright (c) 2018 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (https://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
73	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
74	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   6
75	   2.  DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . .   7
76	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   7
77	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   8
78	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   8
79	     2.4.  Overall DualQ Coupled AQM Structure . . . . . . . . . . .   8
80	     2.5.  Normative Requirements for a DualQ Coupled AQM  . . . . .  11
81	       2.5.1.  Functional Requirements . . . . . . . . . . . . . . .  11
82	       2.5.2.  Management Requirements . . . . . . . . . . . . . . .  12
83	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
84	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
85	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  13
86	       4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput
87	               or Delay? . . . . . . . . . . . . . . . . . . . . . .  14
88	       4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or
89	               Delay?  . . . . . . . . . . . . . . . . . . . . . . .  15
90	       4.1.3.  Protecting against Unresponsive ECN-Capable Traffic .  16
91	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  16
92	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
93	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  17
94	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  17
95	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  20
96	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  20
97	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  25
98	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  28
99	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  34
100	   Appendix D.  Open Issues  . . . . . . . . . . . . . . . . . . . .  35
101	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  36

103	1.  Introduction

105	1.1.  Problem and Scope

107	   Latency is becoming the critical performance factor for many (most?)
108	   applications on the public Internet, e.g. interactive Web, Web
109	   services, voice, conversational video, interactive video, interactive
110	   remote presence, instant messaging, online gaming, remote desktop,
111	   cloud-based applications, and video-assisted remote control of
112	   machinery and industrial processes.  In the developed world, further
113	   increases in access network bit-rate offer diminishing returns,
114	   whereas latency is still a multi-faceted problem.  In the last decade
115	   or so, much has been done to reduce propagation time by placing
116	   caches or servers closer to users.  However, queuing remains a major
117	   component of latency.

119	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
120	   that low latency traffic can jump the queue of other traffic.
121	   However, on access links dedicated to individual sites (homes, small
122	   enterprises or mobile devices), often all traffic at any one time
123	   will be latency-sensitive and, if all the traffic on a link is marked
124	   as EF, Diffserv cannot reduce the delay of any of it.  In contrast,
125	   the Low Latency Low Loss Scalable throughput (L4S) approach removes
126	   the causes of any unnecessary queuing delay.

128	   The bufferbloat project has shown that excessively-large buffering
129	   (`bufferbloat') has been introducing significantly more delay than
130	   the underlying propagation time.  These delays appear only
131	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
132	   enough for the queue to fill the buffer, making every packet in other
133	   flows sharing the buffer sit through the queue.

135	   Active queue management (AQM) was originally developed to solve this
136	   problem (and others).  Unlike Diffserv, which gives low latency to
137	   some traffic at the expense of others, AQM controls latency for _all_
138	   traffic in a class.  In general, AQMs introduce an increasing level
139	   of discard from the buffer the longer the queue persists above a
140	   shallow threshold.  This gives sufficient signals to capacity-seeking
141	   (aka. greedy) flows to keep the buffer empty for its intended
142	   purpose: absorbing bursts.  However, RED [RFC2309] and other
143	   algorithms from the 1990s were sensitive to their configuration and
144	   hard to set correctly.  So, AQM was not widely deployed.

146	   More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290],
147	   PIE [RFC8033], Adaptive RED [ARED01], are easier to configure,
148	   because they define the queuing threshold in time not bytes, so it is
149	   invariant for different link rates.  However, no matter how good the
150	   AQM, the sawtoothing rate of TCP will either cause queuing delay to
151	   vary or cause the link to be under-utilized.  Even with a perfectly
152	   tuned AQM, the additional queuing delay will be of the same order as
153	   the underlying speed-of-light delay across the network.  Flow-queuing
154	   can isolate one flow from another, but it cannot isolate a TCP flow
155	   from the delay variations it inflicts on itself, and it has other
156	   problems - it overrides the flow rate decisions of variable rate
157	   video applications, it does not recognise the flows within IPSec VPN
158	   tunnels and it is relatively expensive to implement.

160	   It seems that further changes to the network alone will now yield
161	   diminishing returns.  Data Centre TCP (DCTCP [RFC8257]) teaches us
162	   that a small but radical change to TCP is needed to cut two major
163	   outstanding causes of queuing delay variability:

165	   1.  the `sawtooth' varying rate of TCP itself;

167	   2.  the smoothing delay deliberately introduced into AQMs to permit
168	       bursts without triggering losses.

170	   The former causes a flow's round trip time (RTT) to vary from about 1
171	   to 2 times the base RTT between the machines in question.  The latter
172	   delays the system's response to change by a worst-case
173	   (transcontinental) RTT, which could be hundreds of times the actual
174	   RTT of typical traffic from localized CDNs.

176	   Latency is not our only concern:

178	   3.  It was known when TCP was first developed that it would not scale
179	       to high bandwidth-delay products.

181	   Given regular broadband bit-rates over WAN distances are
182	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
183	   `less unscalable' Cubic [I-D.ietf-tcpm-cubic] and
184	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
185	   successfully deployed.  However, these are now approaching their
186	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
187	   cause `classic' TCP to starve itself, which is why they have been
188	   confined to private data centres or research testbeds (until now).

190	   This document specifies a `DualQ Coupled AQM' extension that solves
191	   the problem of coexistence between scalable and classic flows,
192	   without having to inspect flow identifiers.  The AQM is not like
193	   flow-queuing approaches [RFC8290] that classify packets by flow
194	   identifier into numerous separate queues in order to isolate sparse
195	   flows from the higher latency in the queues assigned to heavier flow.
196	   In contrast, the AQM exploits the behaviour of scalable congestion
197	   controls like DCTCP so that every packet in every flow sharing the
198	   queue for DCTCP-like traffic can be served with very low latency.

200	   This AQM extension can be combined with any single queue AQM that
201	   generates a statistical or deterministic mark/drop probability driven
202	   by the queue dynamics.  In many cases it simplifies the basic control
203	   algorithm, and requires little extra processing.  Therefore it is
204	   believed the Coupled AQM would be applicable and easy to deploy in
205	   all types of buffers; buffers in cost-reduced mass-market residential
206	   equipment; buffers in end-system stacks; buffers in carrier-scale
207	   equipment including remote access servers, routers, firewalls and
208	   Ethernet switches; buffers in network interface cards, buffers in
209	   virtualized network appliances, hypervisors, and so on.

211	   The overall L4S architecture is described in
212	   [I-D.ietf-tsvwg-l4s-arch].  The supporting papers [PI2] and [DCttH15]
213	   give the full rationale for the AQM's design, both discursively and
214	   in more precise mathematical form.

216	1.2.  Terminology

218	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
219	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
220	   document are to be interpreted as described in [RFC2119].  In this
221	   document, these words will appear with that interpretation only when
222	   in ALL CAPS.  Lower case uses of these words are not to be
223	   interpreted as carrying RFC-2119 significance.

225	   The DualQ Coupled AQM uses two queues for two services.  Each of the
226	   following terms identifies both the service and the queue that
227	   provides the service:

229	   Classic (denoted by subscript C):  The `Classic' service is intended
230	      for all the behaviours that currently co-exist with TCP Reno (TCP
231	      Cubic, Compound, SCTP, etc).

233	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
234	      The `L4S' service is intended for a set of congestion controls
235	      with scalable properties such as DCTCP (e.g.
236	      Relentless [Mathis09]).

238	   Either service can cope with a proportion of unresponsive or less-
239	   responsive traffic as well (e.g.  DNS, VoIP, etc), just as a single
240	   queue AQM can.  The DualQ Coupled AQM behaviour is similar to a
241	   single FIFO queue with respect to unresponsive and overload traffic.

243	1.3.  Features

245	   The AQM couples marking and/or dropping across the two queues such
246	   that a flow will get roughly the same throughput whichever it uses.
247	   Therefore both queues can feed into the full capacity of a link and
248	   no rates need to be configured for the queues.  The L4S queue enables
249	   scalable congestion controls like DCTCP to give stunningly low and
250	   predictably low latency, without compromising the performance of
251	   competing 'Classic' Internet traffic.  Thousands of tests have been
252	   conducted in a typical fixed residential broadband setting.  Typical
253	   experiments used base round trip delays up to 100ms between the data
254	   centre and home network, and large amounts of background traffic in
255	   both queues.  For every L4S packet, the AQM kept the average queuing
256	   delay below 1ms (or 2 packets if serialization delay is bigger for
257	   slow links), and no losses at all were introduced by the AQM.
258	   Details of the extensive experiments will be made available [PI2]
259	   [DCttH15].

261	   Subjective testing was also conducted using a demanding panoramic
262	   interactive video application run over a stack with DCTCP enabled and
263	   deployed on the testbed.  Each user could pan or zoom their own high
264	   definition (HD) sub-window of a larger video scene from a football
265	   match.  Even though the user was also downloading large amounts of
266	   L4S and Classic data, latency was so low that the picture appeared to
267	   stick to their finger on the touchpad (all the L4S data achieved the
268	   same ultra-low latency).  With an alternative AQM, the video
269	   noticeably lagged behind the finger gestures.

271	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
272	   be limited to a small proportion of the link capacity in order to
273	   achieve low delay.  The L4S queue can be filled with a heavy load of
274	   capacity-seeking flows like DCTCP and still achieve low delay.  The
275	   L4S queue does not rely on the presence of other traffic in the
276	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
277	   traffic whether or not there is Classic traffic, and the latency of
278	   Classic traffic does not suffer when a proportion of the traffic is
279	   L4S.  The two queues are only necessary because DCTCP-like flows
280	   cannot keep latency predictably low and keep utilization high if they
281	   are mixed with legacy TCP flows,

283	   The experiments used the Linux implementation of DCTCP that is
284	   deployed in private data centres, without any modification despite
285	   its known deficiencies.  Nonetheless, certain modifications will be
286	   necessary before DCTCP is safe to use on the Internet, which are
287	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
288	   focus of this specification is to get the network service in place.
289	   Then, without any management intervention, applications can exploit
290	   it by migrating to scalable controls like DCTCP, which can then
291	   evolve _while_ their benefits are being enjoyed by everyone on the
292	   Internet.

294	2.  DualQ Coupled AQM

296	   There are two main aspects to the approach:

298	   o  the Coupled AQM that addresses throughput equivalence between
299	      Classic (e.g.  Reno, Cubic) flows and L4S (e.g.  DCTCP) flows

301	   o  the Dual Queue structure that provides latency separation for L4S
302	      flows to isolate them from the typically large Classic queue.

304	2.1.  Coupled AQM

306	   In the 1990s, the `TCP formula' was derived for the relationship
307	   between TCP's congestion window, cwnd, and its drop probability, p.
308	   To a first order approximation, cwnd of TCP Reno is inversely
309	   proportional to the square root of p.

311	   TCP Cubic implements a Reno-compatibility mode, which is the only
312	   relevant mode for typical RTTs under 20ms as long as the throughput
313	   of a single flow is less than about 500Mb/s.  Therefore it can be
314	   assumed that Cubic traffic behaves similarly to Reno (but with a
315	   slightly different constant of proportionality), and the term
316	   'Classic' will be used for the collection of Reno-friendly traffic
317	   including Cubic in Reno mode.

319	   The supporting paper [PI2] includes the derivation of the equivalent
320	   rate equation for DCTCP, for which cwnd is inversely proportional to
321	   p (not the square root), where in this case p is the ECN marking
322	   probability.  DCTCP is not the only congestion control that behaves
323	   like this, so the term 'L4S' traffic will be used for all similar
324	   behaviour.

326	   In order to make a DCTCP flow run at roughly the same rate as a Reno
327	   TCP flow (all other factors being equal), the drop or marking
328	   probability for Classic traffic, p_C has to be distinct from the
329	   marking probability for L4S traffic, p_L (in contrast to RFC3168
330	   which requires them to be the same).  It is necessary to make the
331	   Classic drop probability p_C proportional to the square of the L4S
332	   marking probability p_L.  This makes the Reno flow rate roughly equal
333	   the DCTCP flow rate, because it squares the square root of p_C in the
334	   Reno rate equation to make it proportional to the straight p_L in the
335	   DCTCP rate equation.

337	   Stating this as a formula, the relation between Classic drop
338	   probability, p_C, and L4S marking probability, p_L needs to take the
339	   form:

341	       p_C = ( p_L / k )^2                  (1)

343	   where k is the constant of proportionality.

345	2.2.  Dual Queue

347	   Classic traffic typically builds a large queue to prevent under-
348	   utilization.  Therefore a separate queue is provided for L4S traffic,
349	   and it is scheduled with priority over Classic.  Priority is
350	   conditional to prevent starvation of Classic traffic.

352	   Nonetheless, coupled marking ensures that giving priority to L4S
353	   traffic still leaves the right amount of spare scheduling time for
354	   Classic flows to each get equivalent throughput to DCTCP flows (all
355	   other factors such as RTT being equal).  The algorithm achieves this
356	   without having to inspect flow identifiers.

358	2.3.  Traffic Classification

360	   Both the Coupled AQM and DualQ mechanisms need an identifier to
361	   distinguish L and C packets.  A separate draft
362	   [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of
363	   the ECN field as this identifier, having assessed various
364	   alternatives.  An additional process document has proved necessary to
365	   make the ECT(1) codepoint available for experimentation [RFC8311].

367	2.4.  Overall DualQ Coupled AQM Structure

369	   Figure 1 shows the overall structure that any DualQ Coupled AQM is
370	   likely to have.  This schematic is intended to aid understanding of
371	   the current designs of DualQ Coupled AQMs.  However, it is not
372	   intended to preclude other innovative ways of satisfying the
373	   normative requirements in Section 2.5 that minimally define a DualQ
374	   Coupled AQM.

376	   The classifier on the left separates incoming traffic between the two
377	   queues (L and C).  Each queue has its own AQM that determines the
378	   likelihood of dropping or marking (p_L and p_C).  Nonetheless, the
379	   AQM for Classic traffic is implemented in two stages: i) a base stage
380	   that outputs an internal probability p' (pronounced p-prime); and ii)
381	   a squaring stage that outputs p_C, where

383	       p_C = (p')^2.                        (2)

385	   This allows p_L to be coupled to p_C by marking L4S traffic
386	   proportionately to the intermediate output from the first stage.
387	   Specifically, the output of the base AQM is coupled across to the L
388	   queue in proportion to the output of the base AQM:

390	       p_CL = k*p',                         (3)

392	   where k is the constant coupling factor (see Appendix C) and p_CL is
393	   the output from the coupling between the C queue and the L queue.

395	   It can be seen in the following that these two transformations of p'
396	   implement the required coupling given in equation (1) earlier.
397	   Substituting for p' from equation (3) into (2):

399	      p_C = ( p_CL / k )^2.

401	   The actual L4S marking probability p_L is the maximum of the coupled
402	   output (p_CL) and the output of a native L4S AQM (p'L), shown as
403	   '(MAX)' in the schematic.  While the output of the Native L4S AQM is
404	   high (p'L > p_CL) it will dominate the way L traffic is marked.  When
405	   the native L4S AQM output is lower, the way L traffic is marked will
406	   be driven by the coupling, that is p_L = p_CL.  So, whenever the
407	   coupling is needed, as required from equation (1):

409	      p_C = ( p_L / k )^2.

411	                           _________
412	                                  | |    ,------.
413	                        L4S queue | |===>| ECN  |
414	                       ,'| _______|_|    |marker|\
415	                     <'  |         |     `------'\\
416	                      //`'         v        ^ p_L \\
417	                     //        ,-------.    |      \\
418	                    //         |Native |p'L |       \\,.
419	                   //          |  L4S  |-->(MAX)    <  |   ___
420	      ,----------.//           |  AQM  |    ^ p_CL   `\|.'Cond-`.
421	      |  IP-ECN  |/            `-------'    |          / itional \
422	   ==>|Classifier|             ,-------.  (k*p')       [ priority]==>
423	      |          |\            |  Base |    |          \scheduler/
424	      `----------'\\           |  AQM  |--->:        ,'|`-.___.-'
425	                   \\          |       |p'  |      <'  |
426	                    \\         `-------'  (p'^2)    //`'
427	                     \\            ^        |      //
428	                      \\,.         |        v p_C //
429	                      <  | _________     .------.//
430	                       `\|   |      |    | Drop |/
431	                     Classic |queue |===>|/mark |
432	                           __|______|    `------'

434	   Legend: ===> traffic flow; ---> control dependency.

436	                   Figure 1: DualQ Coupled AQM Schematic

438	   After the AQMs have applied their dropping or marking, the scheduler
439	   forwards their packets to the link, giving priority to L4S traffic.
440	   Priority has to be conditional in some way (see Section 4.1).  Simple
441	   strict priority is inappropriate otherwise it could lead the L4S
442	   queue to starve the Classic queue.  For example, consider the case
443	   where a continually busy L4S queue blocks a DNS request in the
444	   Classic queue, arbitrarily delaying the start of a new Classic flow.

446	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
447	   given in Appendix A and Appendix B.  Either example AQM can be used
448	   to couple packet marking and dropping across a dual Q.

450	   DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM.
451	   Indeed, this Base AQM with just the squared output and no L4S queue
452	   can be used as a drop-in replacement for PIE [RFC8033], in which case
453	   we call it just PI2 [PI2].  PI2 is a principled simplification of PIE
454	   that is both more responsive and more stable in the face of
455	   dynamically varying load.

457	   Curvy RED is derived from RED [RFC2309], but its configuration
458	   parameters are insensitive to link rate and it requires less
459	   operations per packet.  However, DualPI2 is more responsive and
460	   stable over a wider range of RTTs than Curvy RED.  As a consequence,
461	   DualPI2 has attracted more development attention than Curvy RED,
462	   leaving the Curvy RED design incomplete and not so fully evaluated.

464	   Both AQMs regulate their queue in units of time not bytes.  As
465	   already explained, this ensures configuration can be invariant for
466	   different drain rates.  With AQMs in a dualQ structure this is
467	   particularly important because the drain rate of each queue can vary
468	   rapidly as flows for the two queues arrive and depart, even if the
469	   combined link rate is constant.

471	   It would be possible to control the queues with other alternative
472	   AQMs, as long as the normative requirements (those expressed in
473	   capitals) in Section 2.5 are observed.

475	2.5.  Normative Requirements for a DualQ Coupled AQM

477	   The following requirements are intended to capture only the essential
478	   aspects of a DualQ Coupled AQM.  They are intended to be independent
479	   of the particular AQMs used for each queue.

481	2.5.1.  Functional Requirements

483	   In the Dual Queue, L4S packets MUST be given priority over Classic,
484	   although priority MUST be bounded in order not to starve Classic
485	   traffic.

487	   All L4S traffic MUST be ECN-capable.  Some Classic traffic might also
488	   be ECN-capable.

490	   Whatever identifier is used for L4S experiments,
491	   [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on
492	   L4S traffic, relative to drop of Classic traffic.  In order to
493	   prevent starvation of Classic traffic by scalable L4S traffic, it
494	   says, "The likelihood that an AQM drops a Not-ECT Classic packet
495	   (p_C) MUST be roughly proportional to the square of the likelihood
496	   that it would have marked it if it had been an L4S packet (p_L)."  In
497	   other words, in any DualQ Coupled AQM, the power to which p_L is
498	   raised in Eqn. (1) MUST be 2.  The term 'likelihood' is used to allow
499	   for marking and dropping to be either probabilistic or deterministic.

501	   The constant of proportionality, k, in Eqn (1) determines the
502	   relative flow rates of Classic and L4S flows when the AQM concerned
503	   is the bottleneck (all other factors being equal).
504	   [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality
505	   (k) does not have to be standardised for interoperability, but a
506	   value of 2 is RECOMMENDED."

508	   Assuming scalable congestion controls for the Internet will be as
509	   aggressive as DCTCP, this will ensure their congestion window will be
510	   roughly the same as that of a standards track TCP congestion control
511	   (Reno) [RFC5681] and other so-called TCP-friendly controls, such as
512	   TCP Cubic in its TCP-friendly mode.

514	   {ToDo: The requirements for scalable congestion controls on the
515	   Internet (termed the TCP Prague requirements)
516	   [I-D.ietf-tsvwg-ecn-l4s-id] are not necessarily final.  If the
517	   aggressiveness of DCTCP is not defined as the benchmark for scalable
518	   controls on the Internet, the recommended value of k will also be
519	   subject to change.}

521	   The choice of k is a matter of operator policy, and operators MAY
522	   choose a different value using Table 1 and the guidelines in
523	   Appendix C.

525	   If multiple users share capacity at a bottleneck (e.g. in the
526	   Internet access link of a campus network), the operator's choice of k
527	   will determine capacity sharing between the flows of different users.
528	   However, on the public Internet, access network operators typically
529	   isolate customers from each other with some form of layer-2
530	   multiplexing (TDM in DOCSIS, CDMA in 3G) or L3 scheduling (WRR in
531	   DSL), rather than relying on TCP to share capacity between customers
532	   [RFC0970].  In such cases, the choice of k will solely affect
533	   relative flow rates within each customer's access capacity, not
534	   between customers.  Also, k will not affect relative flow rates at
535	   any times when all flows are Classic or all L4S, and it will not
536	   affect small flows.

538	2.5.2.  Management Requirements

540	   By default, a DualQ Coupled AQM SHOULD NOT need any configuration for
541	   use at a bottleneck on the public Internet [RFC7567].  The following
542	   parameters MAY be operator-configurable, e.g. to tune for non-
543	   Internet settings:

545	   o  Optional packet classifier(s) to use in addition to the ECN field
546	      {ToDo: e.g.  ARP};

548	   o  Expected typical RTT (a parameter for typical or target queuing
549	      delay in each queue might be configurable instead);

551	   o  Expected maximum RTT (a stability parameter that depends on
552	      maximum RTT might be configurable instead);

554	   o  Coupling factor, k;

556	   o  The limit to the conditional priority of L4S (scheduler-dependent,
557	      e.g. the scheduler weight for WRR, or the time-shift for time-
558	      shifted FIFO);

560	   o  The maximum Classic ECN marking probability, p_Cmax, before
561	      switching over to drop.

563	   An experimental DualQ Coupled AQM SHOULD allow the operator to
564	   monitor the following operational statistics:

566	   o  Bits forwarded (total and per queue per sample interval), from
567	      which utilization can be calculated

569	   o  Q delay (per queue over sample interval)

571	   o  Total packets arriving, enqueued and dequeued (per queue per
572	      sample interval)

574	   o  ECN packets marked, non-ECN packets dropped, ECN packets dropped
575	      (per queue per sample interval), from which marking and dropping
576	      probabilities can be calculated

578	   o  Time and duration of each overload event.

580	   The type of statistics produced for variables like Q delay (mean,
581	   percentiles, etc.) will depend on implementation constraints.

583	3.  IANA Considerations

585	   This specification contains no IANA considerations.

587	4.  Security Considerations

589	4.1.  Overload Handling

591	   Where the interests of users or flows might conflict, it could be
592	   necessary to police traffic to isolate any harm to the performance of
593	   individual flows.  However it is hard to avoid unintended side-
594	   effects with policing, and in a trusted environment policing is not
595	   necessary.  Therefore per-flow policing needs to be separable from a
596	   basic AQM, as an option under policy control.

598	   However, a basic DualQ AQM does at least need to handle overload.  A
599	   useful objective would be for the overload behaviour of the DualQ AQM
600	   to be at least no worse than a single queue AQM.  However, a trade-
601	   off needs to be made between complexity and the risk of either
602	   traffic class harming the other.  In each of the following three
603	   subsections, an overload issue specific to the DualQ is described,
604	   followed by proposed solution(s).

606	   Under overload the higher priority L4S service will have to sacrifice
607	   some aspect of its performance.  Alternative solutions are provided
608	   below that each relax a different factor: e.g. throughput, delay,
609	   drop.  Some of these choices might need to be determined by operator
610	   policy or by the developer, rather than by the IETF. {ToDo: Reach
611	   consensus on which it is to be in each case.}

613	4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay?

615	   Priority of L4S is required to be conditional to avoid total
616	   throughput starvation of Classic by heavy L4S traffic.  This raises
617	   the question of whether to sacrifice L4S throughput or L4S delay (or
618	   some other policy) to mitigate starvation of Classic:

620	   Sacrifice L4S throughput:   By using weighted round robin as the
621	      conditional priority scheduler, the L4S service can sacrifice some
622	      throughput during overload to guarantee a minimum throughput
623	      service for Classic traffic.  The scheduling weight of the Classic
624	      queue should be small (e.g. 1/16).  Then, in most traffic
625	      scenarios the scheduler will not interfere and it will not need to
626	      - the coupling mechanism and the end-systems will share out the
627	      capacity across both queues as if it were a single pool.  However,
628	      because the congestion coupling only applies in one direction
629	      (from C to L), if L4S traffic is over-aggressive or unresponsive,
630	      the scheduler weight for Classic traffic will at least be large
631	      enough to ensure it does not starve.

633	      In cases where the ratio of L4S to Classic flows (e.g. 19:1) is
634	      greater than the ratio of their scheduler weights (e.g. 15:1), the
635	      L4S flows will get less than an equal share of the capacity, but
636	      only slightly.  For instance, with the example numbers given, each
637	      L4S flow will get (15/16)/19 = 4.9% when ideally each would get
638	      1/20=5%. In the rather specific case of an unresponsive flow
639	      taking up a large part of the capacity set aside for L4S, using
640	      WRR could significantly reduce the capacity left for any
641	      responsive L4S flows.

643	   Sacrifice L4S Delay:  To control milder overload of responsive
644	      traffic, particularly when close to the maximum congestion signal,
645	      the operator could choose to control overload of the Classic queue
646	      by allowing some delay to 'leak' across to the L4S queue.  The
647	      scheduler can be made to behave like a single First-In First-Out
648	      (FIFO) queue with different service times by implementing a very
649	      simple conditional priority scheduler that could be called a
650	      "time-shifted FIFO" (see the Modifier Earliest Deadline First
651	      (MEDF) scheduler of [MEDF]).  This scheduler adds tshift to the
652	      queue delay of the next L4S packet, before comparing it with the
653	      queue delay of the next Classic packet, then it selects the packet
654	      with the greater adjusted queue delay.  Under regular conditions,
655	      this time-shifted FIFO scheduler behaves just like a strict
656	      priority scheduler.  But under moderate or high overload it
657	      prevents starvation of the Classic queue, because the time-shift
658	      (tshift) defines the maximum extra queuing delay of Classic
659	      packets relative to L4S.

661	   The example implementation in Appendix A can implement either policy.

663	4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or Delay?

665	   To keep the throughput of both L4S and Classic flows roughly equal
666	   over the full load range, a different control strategy needs to be
667	   defined above the point where one AQM first saturates to a
668	   probability of 100% leaving no room to push back the load any harder.
669	   If k>1, L4S will saturate first, but saturation can be caused by
670	   unresponsive traffic in either queue.

672	   The term 'unresponsive' includes cases where a flow becomes
673	   temporarily unresponsive, for instance, a real-time flow that takes a
674	   while to adapt its rate in response to congestion, or a TCP-like flow
675	   that is normally responsive, but above a certain congestion level it
676	   will not be able to reduce its congestion window below the minimum of
677	   2 segments, effectively becoming unresponsive.  (Note that L4S
678	   traffic ought to remain responsive below a window of 2 segments (see
679	   [I-D.ietf-tsvwg-ecn-l4s-id]).

681	   Saturation raises the question of whether to relieve congestion by
682	   introducing some drop into the L4S queue or by allowing delay to grow
683	   in both queues (which could eventually lead to tail drop too):

685	   Drop on Saturation:  Saturation can be avoided by setting a maximum
686	      threshold for L4S ECN marking (assuming k>1) before saturation
687	      starts to make the flow rates of the different traffic types
688	      diverge.  Above that the drop probability of Classic traffic is
689	      applied to all packets of all traffic types.  Then experiments
690	      have shown that queueing delay can be kept at the target in any
691	      overload situation, including with unresponsive traffic, and no
692	      further measures are required.

694	   Delay on Saturation:  When L4S marking saturates, instead of
695	      switching to drop, the drop and marking probabilities could be
696	      capped.  Beyond that, delay will grow either solely in the queue
697	      with unresponsive traffic (if WRR is used), or in both queues (if
698	      time-shifted FIFO is used).  In either case, the higher delay
699	      ought to control temporary high congestion.  If the overload is
700	      more persistent, eventually the combined DualQ will overflow and
701	      tail drop will control congestion.

703	   The example implementation in Appendix A applies only the "drop on
704	   saturation" policy.

706	4.1.3.  Protecting against Unresponsive ECN-Capable Traffic

708	   Unresponsive traffic has a greater advantage if it is also ECN-
709	   capable.  The advantage is undetectable at normal low levels of drop/
710	   marking, but it becomes significant with the higher levels of drop/
711	   marking typical during overload.  This is an issue whether the ECN-
712	   capable traffic is L4S or Classic.

714	   This raises the question of whether and when to switch off ECN
715	   marking and use solely drop instead, as required by both Section 7 of
716	   [RFC3168] and Section 4.2.1 of [RFC7567].

718	   Experiments with the DualPI2 AQM (Appendix A) have shown that
719	   introducing 'drop on saturation' at 100% L4S marking addresses this
720	   problem with unresponsive ECN as well as addressing the saturation
721	   problem.  It leaves only a small range of congestion levels where
722	   unresponsive traffic gains any advantage from using the ECN
723	   capability, and the advantage is hardly detectable [DualQ-Test].

725	5.  Acknowledgements

727	   Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for
728	   detailed review comments particularly of the appendices and
729	   suggestions on how to make our explanation clearer.  Thanks also to
730	   Greg White and Tom Henderson for insights on the choice of schedulers
731	   and queue delay measurement techniques.

733	   The authors' contributions were originally part-funded by the
734	   European Community under its Seventh Framework Programme through the
735	   Reducing Internet Transport Latency (RITE) project (ICT-317700).  Bob
736	   Briscoe's contribution was also part-funded by the Research Council
737	   of Norway through the TimeIn project.  The views expressed here are
738	   solely those of the authors.

740	6.  References
741	6.1.  Normative References

743	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
744	              Requirement Levels", BCP 14, RFC 2119,
745	              DOI 10.17487/RFC2119, March 1997,
746	              <https://www.rfc-editor.org/info/rfc2119>.

748	6.2.  Informative References

750	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
751	              Algorithm for Increasing the Robustness of RED's Active
752	              Queue Management", ACIRI Technical Report , August 2001,
753	              <http://www.icir.org/floyd/red.html>.

755	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
756	              ACM Queue 10(5), May 2012,
757	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

759	   [CRED_Insights]
760	              Briscoe, B., "Insights from Curvy RED (Random Early
761	              Detection)", BT Technical Report TR-TUB8-2015-003, July
762	              2015,
763	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

765	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
766	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
767	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
768	              dctth_preprint.pdf>.

770	              (Under submission)

772	   [DualQ-Test]
773	              Steen, H., "Destruction Testing: Ultra-Low Delay using
774	              Dual Queue Coupled Active Queue Management", Masters
775	              Thesis, Dept of Informatics, Uni Oslo , May 2017.

777	   [I-D.ietf-tcpm-cubic]
778	              Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
779	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
780	              draft-ietf-tcpm-cubic-07 (work in progress), November
781	              2017.

783	   [I-D.ietf-tsvwg-ecn-l4s-id]
784	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
785	              Modified Explicit Congestion Notification (ECN) Semantics
786	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
787	              id-00 (work in progress), November 2016.

789	   [I-D.ietf-tsvwg-l4s-arch]
790	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
791	              Low Loss, Scalable Throughput (L4S) Internet Service:
792	              Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in
793	              progress), November 2016.

795	   [I-D.sridharan-tcpm-ctcp]
796	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
797	              "Compound TCP: A New TCP Congestion Control for High-Speed
798	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
799	              (work in progress), November 2008.

801	   [Mathis09]
802	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
803	              May 2009, <http://www.hpcc.jp/pfldnet2009/
804	              Program_files/1569198525.pdf>.

806	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
807	              simple scheduling algorithm for two real-time transport
808	              service classes with application in the UTRAN", Proc. IEEE
809	              Conference on Computer Communications (INFOCOM'03) Vol.2
810	              pp.1116-1122, March 2003.

812	   [PI2]      De Schepper, K., Bondarenko, O., Briscoe, B., and I.
813	              Tsang, "PI2: A Linearized AQM for both Classic and
814	              Scalable TCP", ACM CoNEXT'16 , December 2016,
815	              <https://riteproject.files.wordpress.com/2015/10/
816	              pi2_conext.pdf>.

818	              (To appear)

820	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
821	              RFC 970, DOI 10.17487/RFC0970, December 1985,
822	              <https://www.rfc-editor.org/info/rfc970>.

824	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
825	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
826	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
827	              S., Wroclawski, J., and L. Zhang, "Recommendations on
828	              Queue Management and Congestion Avoidance in the
829	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
830	              <https://www.rfc-editor.org/info/rfc2309>.

832	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
833	              of Explicit Congestion Notification (ECN) to IP",
834	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
835	              <https://www.rfc-editor.org/info/rfc3168>.

837	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
838	              J., Courtney, W., Davari, S., Firoiu, V., and D.
839	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
840	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
841	              <https://www.rfc-editor.org/info/rfc3246>.

843	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
844	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
845	              <https://www.rfc-editor.org/info/rfc3649>.

847	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
848	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
849	              <https://www.rfc-editor.org/info/rfc5681>.

851	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
852	              Recommendations Regarding Active Queue Management",
853	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
854	              <https://www.rfc-editor.org/info/rfc7567>.

856	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
857	              "Proportional Integral Controller Enhanced (PIE): A
858	              Lightweight Control Scheme to Address the Bufferbloat
859	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
860	              <https://www.rfc-editor.org/info/rfc8033>.

862	   [RFC8034]  White, G. and R. Pan, "Active Queue Management (AQM) Based
863	              on Proportional Integral Controller Enhanced PIE) for
864	              Data-Over-Cable Service Interface Specifications (DOCSIS)
865	              Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February
866	              2017, <https://www.rfc-editor.org/info/rfc8034>.

868	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
869	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
870	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
871	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

873	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
874	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
875	              and Active Queue Management Algorithm", RFC 8290,
876	              DOI 10.17487/RFC8290, January 2018,
877	              <https://www.rfc-editor.org/info/rfc8290>.

879	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
880	              Notification (ECN) Experimentation", RFC 8311,
881	              DOI 10.17487/RFC8311, January 2018,
882	              <https://www.rfc-editor.org/info/rfc8311>.

884	Appendix A.  Example DualQ Coupled PI2 Algorithm

886	   As a first concrete example, the pseudocode below gives the DualPI2
887	   algorithm.  DualPI2 follows the structure of the DualQ Coupled AQM
888	   framework in Figure 1.  A simple step threshold (in units of queuing
889	   time) is used for the Native L4S AQM, but a ramp is also described as
890	   an alternative.  And the PI2 algorithm [PI2] is used for the Classic
891	   AQM.  PI2 is an improved variant of the PIE AQM [RFC8033].

893	   We will introduce the pseudocode in two passes.  The first pass
894	   explains the core concepts, deferring handling of overload to the
895	   second pass.  To aid comparison, line numbers are kept in step
896	   between the two passes by using letter suffixes where the longer code
897	   needs extra lines.

899	   A full open source implementation for Linux is available at:
900	   https://github.com/olgabo/dualpi2.

902	A.1.  Pass #1: Core Concepts

904	   The pseudocode manipulates three main structures of variables: the
905	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
906	   pseudocode consists of the following four functions:

908	   o  initialization code (Figure 2) that sets parameter defaults (the
909	      API for setting non-default values is omitted for brevity)

911	   o  enqueue code (Figure 3)

913	   o  dequeue code (Figure 4)

915	   o  code to regularly update the base probability (p) used in the
916	      dequeue code (Figure 5).

918	   It also uses the following functions that are not shown in full here:

920	   o  scheduler(), which selects between the head packets of the two
921	      queues; the choice of scheduler technology is discussed later;

923	   o  cq.len() or lq.len() returns the current length (aka. backlog) of
924	      the relevant queue in bytes;

926	   o  cq.time() or lq.time() returns the current queuing delay (aka.
927	      sojourn time or service time) of the relevant queue in units of
928	      time;

930	   Queuing delay could be measured directly by storing a per-packet
931	   time-stamp as each packet is enqueued, and subtracting this from the
932	   system time when the packet is dequeued.  If time-stamping is not
933	   easy to introduce with certain hardware, queuing delay could be
934	   predicted indirectly by dividing the size of the queue by the
935	   predicted departure rate, which might be known precisely for some
936	   link technologies (see for example [RFC8034]).

938	   In our experiments so far (building on experiments with PIE) on
939	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
940	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
941	   parameters in Figure 2.  The parameters are categorised by whether
942	   they relate to the Base PI2 AQM, the L4S AQM or the framework
943	   coupling them together.  Variables derived from these parameters are
944	   also included at the end of each category.  Each parameter is
945	   explained as it is encountered in the walk-through of the pseudocode
946	   below.

948	   1:  dualpi2_params_init(...) {         % Set input parameter defaults
949	   2:    % PI2 AQM parameters
950	   3:    target = 15 ms              % PI AQM Classic queue delay target
951	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
952	   5:    alpha = 10 Hz^2                              % PI integral gain
953	   6:    beta = 100 Hz^2                          % PI proportional gain
954	   7:    p_Cmax = 1/4                       % Max Classic drop/mark prob
955	   8:    % Derived PI2 AQM variables
956	   9:    alpha_U = alpha *Tupdate % PI integral gain per update interval
957	   10:   beta_U = beta * Tupdate  % PI prop'nal gain per update interval
958	   11:
959	   12:   % DualQ Coupled framework parameters
960	   13:   k = 2                                         % Coupling factor
961	   14:   % scheduler weight or equival't parameter (scheduler-dependent)
962	   15:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
963	   16:
964	   17:   % L4S AQM parameters
965	   18:   T_time = 1 ms                   % L4S marking threshold in time
966	   19:   T_len = 2 * MTU            % Min L4S marking threshold in bytes
967	   20:   % Derived L4S AQM variables
968	   21:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
969	   22: }

971	       Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM

973	   The overall goal of the code is to maintain the base probability (p),
974	   which is an internal variable from which the marking and dropping
975	   probabilities for L4S and Classic traffic (p_L and p_C) are derived.
976	   The variable named p in the pseudocode and in this walk-through is
977	   the same as p' (p-prime) in Section 2.4.  The probabilities p_L and
978	   p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function
979	   (Figure 5) then used in the dualpi2_dequeue() function (Figure 4).

981	   The code walk-through below builds up to explaining that part of the
982	   code eventually, but it starts from packet arrival.

984	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
985	   2:    if ( lq.len() + cq.len() > limit )
986	   3:      drop(pkt)                     % drop packet if buffer is full
987	   4:    else {                                      % Packet classifier
988	   5:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
989	   6:        lq.enqueue(pkt)
990	   7:      else                           % ECN bits = not-ECT or ECT(0)
991	   8:        cq.enqueue(pkt)
992	   9:    }
993	   10: }

995	      Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

997	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
998	   2:    while ( lq.len() + cq.len() > 0 )
999	   3:      if ( scheduler() == lq ) {
1000	   4:        lq.dequeue(pkt)                      % Scheduler chooses lq
1001	   5:        if ( ((lq.time() > T_time)              % step marking ...
1002	   6:              AND (lq.len() > T_len))
1003	   7:            OR (p_CL > rand()) )             % ...or linear marking
1004	   8:          mark(pkt)
1005	   9:      } else {
1006	   10:       cq.dequeue(pkt)                      % Scheduler chooses cq
1007	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1008	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
1009	   13:           drop(pkt)                                % squared drop
1010	   14:           continue        % continue to the top of the while loop
1011	   15:         }
1012	   16:         mark(pkt)                                  % squared mark
1013	   17:       }
1014	   18:     }
1015	   19:     return(pkt)                      % return the packet and stop
1016	   20:   }
1017	   21:   return(NULL)                             % no packet to dequeue
1018	   22: }

1020	      Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

1022	   When packets arrive, first a common queue limit is checked as shown
1023	   in line 2 of the enqueuing pseudocode in Figure 3.  Note that the
1024	   limit is deliberately tested before enqueue to avoid any bias against
1025	   larger packets (so the actual buffer has to be one MTU larger than
1026	   limit).  If limit is not exceeded, the packet will be classified and
1027	   enqueued to the Classic or L4S queue dependent on the least
1028	   significant bit of the ECN field in the IP header (line 5).  Packets
1029	   with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be
1030	   enqueued in the Classic queue.  Otherwise, ECT(1) and CE packets will
1031	   be enqueued in the L4S queue.  Optional additional packet
1032	   classification flexibility is omitted for brevity.

1034	   The dequeue pseudocode (Figure 4) schedules one packet for dequeuing
1035	   (or zero if the queue is empty).  It also makes all the AQM decisions
1036	   on dropping and marking.  The alternative of applying the AQMs at
1037	   enqueue would shift some processing from the critical time when each
1038	   packet is dequeued.  However, it would also add a whole queue of
1039	   delay to the control signals, making the control loop very sloppy.

1041	   All the dequeue code is contained within a large while loop so that
1042	   if it decides to drop a packet, it will continue until it selects a
1043	   packet to schedule.  Line 3 of the dequeue pseudocode is where the
1044	   scheduler chooses between the L4S queue (lq) and the Classic queue
1045	   (cq).  Detailed implementation of the scheduler is not shown (see
1046	   discussion later).

1048	   o  If an L4S packet is scheduled, lines 5 to 8 mark the packet if
1049	      either the L4S threshold (T_time) is exceeded, or if a random
1050	      marking decision is drawn according to p_CL (maintained by the
1051	      dualpi2_update() function discussed below).  This logical 'OR' on
1052	      a per-packet basis implements the max() function shown in Figure 1
1053	      to couple the outputs of the two AQMs together.  The L4S threshold
1054	      is usually in units of time (default T_time = 1 ms).  However, on
1055	      slow links the packet serialization time can approach the
1056	      threshold T_time, so line 6 sets a floor of T_len (=2 MTU) to the
1057	      threshold, otherwise marking is always too frequent on slow links.

1059	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
1060	      packet based on the squared probability p_C.

1062	   There is some concern that using a step function for the Native L4S
1063	   AQM requires end-systems to smooth the signal for a lot longer -
1064	   until its fidelity is sufficient.  The latency benefits of a ramp are
1065	   being investigated as a simple alternative to the step.  This ramp
1066	   would be similar to the RED algorithm, with the following
1067	   differences:

1069	   o  The min and max of the ramp are defined in units of queuing delay,
1070	      not bytes, so that configuration remains invariant as the queue
1071	      departure rate varies.

1073	   o  It uses instantaneous queueing delay without smoothing (smoothing
1074	      is done in the end-systems).

1076	   o  Determinism is being experimented with instead of randomness; to
1077	      reduce the delay necessary to smooth out the noise of randomness
1078	      from the signal.  For each packet, the algorithm would accumulate
1079	      p'_L in a counter and mark the packet that took the counter over
1080	      1, then subtract 1 from the counter and continue.

1082	   o  The ramp rises linearly directly from 0 to 1, not to a an
1083	      intermediate value of p'_L as RED would, because there is no need
1084	      to keep ECN marking probability low.

1086	   This ramp algorithm would require two configuration parameters (min
1087	   and max threshold in units of queuing time), in contrast to the
1088	   single parameter of a step.

1090	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1091	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
1092	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1093	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1094	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1095	   6:    prevq = curq
1096	   7:  }

1098	     Figure 5: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

1100	   The base probability (p) is kept up to date by the core PI algorithm
1101	   in Figure 5, which is executed every Tupdate.

1103	   Note that p solely depends on the queuing time in the Classic queue.
1104	   In line 2, the current queuing delay (curq) is evaluated from how
1105	   long the head packet was in the Classic queue (cq).  The function
1106	   cq.time() (not shown) subtracts the time stamped at enqueue from the
1107	   current time and implicitly takes the current queuing delay as 0 if
1108	   the queue is empty.

1110	   The algorithm centres on line 3, which is a classical Proportional-
1111	   Integral (PI) controller that alters p dependent on: a) the error
1112	   between the current queuing delay (curq) and the target queuing delay
1113	   ('target' - see [RFC8033]); and b) the change in queuing delay since
1114	   the last sample.  The name 'PI' represents the fact that the second
1115	   factor (how fast the queue is growing) is _P_roportional to load
1116	   while the first is the _I_ntegral of the load (so it removes any
1117	   standing queue in excess of the target).

1119	   The two 'gain factors' in line 3, alpha_U and beta_U, respectively
1120	   weight how strongly each of these elements ((a) and (b)) alters p.
1121	   They are in units of 'per second of delay' or Hz, because they
1122	   transform differences in queueing delay into changes in probability.

1124	   alpha_U and beta_U are derived from the input parameters alpha and
1125	   beta (see lines 5 and 6 of Figure 2).  These recommended values of
1126	   alpha and beta come from the stability analysis in [PI2] so that the
1127	   AQM can change p as fast as possible in response to changes in load
1128	   without over-compensating and therefore causing oscillations in the
1129	   queue.

1131	   alpha and beta determine how much p ought to change if it was updated
1132	   every second.  It is best to update p as frequently as possible, but
1133	   the update interval (Tupdate) will probably be constrained by
1134	   hardware performance.  For link rates from 4 - 200 Mb/s, we found
1135	   Tupdate=16ms (as recommended in [RFC8033]) is sufficient.  However
1136	   small the chosen value of Tupdate, p should change by the same amount
1137	   per second, but in finer more frequent steps.  So the gain factors
1138	   used for updating p in Figure 5 need to be scaled by (Tupdate/1s),
1139	   which is done in lines 9 and 10 of Figure 2).  The suffix '_U'
1140	   represents 'per update time' (Tupdate).

1142	   In corner cases, p can overflow the range [0,1] so the resulting
1143	   value of p has to be bounded (omitted from the pseudocode).  Then, as
1144	   already explained, the coupled and Classic probabilities are derived
1145	   from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2.

1147	   Because the coupled L4S marking probability (p_CL) is factored up by
1148	   k, the dynamic gain parameters alpha and beta are also inherently
1149	   factored up by k for the L4S queue, which is necessary to ensure that
1150	   Classic TCP and DCTCP controls have the same stability.  So, if alpha
1151	   is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha,
1152	   which is 20 Hz^2 with the default coupling factor of k=2.

1154	   Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned
1155	   every Tupdate dependent on p.  Instead, in PI2, alpha_U and beta_U
1156	   are independent of p because the squaring applied to Classic traffic
1157	   tunes them inherently.  This is explained in [PI2], which also
1158	   explains why this more principled approach removes the need for most
1159	   of the heuristics that had to be added to PIE.

1161	   {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with
1162	   RTT}

1164	A.2.  Pass #2: Overload Details

1166	   Figure 6 repeats the dequeue function of Figure 4, but with overload
1167	   details added.  Similarly Figure 7 repeats the core PI algorithm of
1168	   Figure 5 with overload details added.  The initialization and enqueue
1169	   functions are unchanged.

1171	   In line 7 of the initialization function (Figure 2), the default
1172	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
1173	   point at which it is deemed that the Classic queue has become
1174	   persistently overloaded, so it switches to using solely drop, even
1175	   for ECN-capable packets.  This protects the queue against any
1176	   unresponsive traffic that falsely claims that it is responsive to ECN
1177	   marking, as required by [RFC3168] and [RFC7567].

1179	   Line 21 of the initialization function translates this into a maximum
1180	   L4S marking probability (p_Lmax) by rearranging Equation (1).  With a
1181	   coupling factor of k=2 (the default) or greater, this translates to a
1182	   maximum L4S marking probability of 1 (or 100%).  This is intended to
1183	   ensure that the L4S queue starts to introduce dropping once marking
1184	   saturates and can rise no further.  The 'TCP Prague' requirements
1185	   [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion
1186	   control detects a drop, it falls back to a response that coexists
1187	   with 'Classic' TCP.  So it is correct that the L4S queue drops
1188	   packets proportional to p^2, as if they are Classic packets.

1190	   Both these switch-overs are triggered by the tests for overload
1191	   introduced in lines 4b and 12b of the dequeue function (Figure 6).
1192	   Lines 8c to 8g drop L4S packets with probability p^2.  Lines 8h to 8i
1193	   mark the remaining packets with probability p_CL.

1195	   Lines 2c to 2d in the core PI algorithm (Figure 7) deal with overload
1196	   of the L4S queue when there is no Classic traffic.  This is
1197	   necessary, because the core PI algorithm maintains the appropriate
1198	   drop probability to regulate overload, but it depends on the length
1199	   of the Classic queue.  If there is no Classic queue the naive
1200	   algorithm in Figure 5 drops nothing, even if the L4S queue is
1201	   overloaded - so tail drop would have to take over (lines 3 and 4 of
1202	   Figure 3).

1204	   If the test at line 2a finds that the Classic queue is empty, line 2d
1205	   measures the current queue delay using the L4S queue instead.  While
1206	   the L4S queue is not overloaded, its delay will always be tiny
1207	   compared to the target Classic queue delay.  So p_L will be driven to
1208	   zero, and the L4S queue will naturally be governed solely by
1209	   threshold marking (lines 5 and 6 of the dequeue algorithm in
1210	   Figure 6).  But, if unresponsive L4S source(s) cause overload, the
1211	   DualQ transitions smoothly to L4S marking based on the PI algorithm.
1212	   And as overload increases, it naturally transitions from marking to
1213	   dropping by the switch-over mechanism already described.

1215	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
1216	   2:    while ( lq.len() + cq.len() > 0 )
1217	   3:      if ( scheduler() == lq ) {
1218	   4a:       lq.dequeue(pkt)
1219	   4b:       if ( p_CL < p_Lmax ) {      % Check for overload saturation
1220	   5:          if ( ((lq.time() > T_time)             % step marking ...
1221	   6:                AND (lq.len > T_len))
1222	   7:              OR (p_CL > rand()) )           % ...or linear marking
1223	   8a:            mark(pkt)
1224	   8b:       } else {                              % overload saturation
1225	   8c:         if ( p_C > rand() ) {             % probability p_C = p^2
1226	   8e:           drop(pkt)      % revert to Classic drop due to overload
1227	   8f:           continue        % continue to the top of the while loop
1228	   8g:         }
1229	   8h:         if ( p_CL > rand() )           % probability p_CL = k * p
1230	   8i:           mark(pkt)         % linear marking of remaining packets
1231	   8j:       }
1232	   9:      } else {
1233	   10:       cq.dequeue(pkt)
1234	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1235	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
1236	   12b:             OR (p_C >= p_Cmax) ) {       % Overload disables ECN
1237	   13:           drop(pkt)                     % squared drop, redo loop
1238	   14:           continue        % continue to the top of the while loop
1239	   15:         }
1240	   16:         mark(pkt)                                  % squared mark
1241	   17:       }
1242	   18:     }
1243	   19:     return(pkt)                      % return the packet and stop
1244	   20:   }
1245	   21:   return(NULL)                             % no packet to dequeue
1246	   22: }

1248	      Figure 6: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
1249	             (Including Integer Arithmetic and Overload Code)

1251	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1252	   2a:   if ( cq.len() > 0 )
1253	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
1254	   2c:   else                                      % Classic queue empty
1255	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
1256	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1257	   4:    p_CL = p * k           % L4S prob = base prob * coupling factor
1258	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1259	   6:    prevq = curq
1260	   7:  }

1262	     Figure 7: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
1263	                         (Including Overload Code)

1265	   The choice of scheduler technology is critical to overload protection
1266	   (see Section 4.1).

1268	   o  A well-understood weighted scheduler such as weighted round robin
1269	      (WRR) is recommended.  The scheduler weight for Classic should be
1270	      low, e.g. 1/16.

1272	   o  Alternatively, a time-shifted FIFO could be used.  This is a very
1273	      simple scheduler, but it does not fully isolate latency in the L4S
1274	      queue from uncontrolled bursts in the Classic queue.  It works by
1275	      selecting the head packet that has waited the longest, biased
1276	      against the Classic traffic by a time-shift of tshift.  To
1277	      implement time-shifted FIFO, the "if (scheduler() == lq )" test in
1278	      line 3 of the dequeue code would simply be replaced by "if (
1279	      lq.time() + tshift >= cq.time() )".  For the public Internet a
1280	      good value for tshift is 50ms.  For private networks with smaller
1281	      diameter, about 4*target would be reasonable.

1283	   o  A strict priority scheduler would be inappropriate, because it
1284	      would starve Classic if L4S was overloaded.

1286	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

1288	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
1289	   below gives the Curvy RED based algorithm we used and tested.
1290	   Although we designed the AQM to be efficient in integer arithmetic,
1291	   to aid understanding it is first given using real-number arithmetic.
1292	   Then, one possible optimization for integer arithmetic is given, also
1293	   in pseudocode.  To aid comparison, the line numbers are kept in step
1294	   between the two by using letter suffixes where the longer code needs
1295	   extra lines.

1297	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1298	   2:    if ( lq.dequeue(pkt) ) {
1299	   3a:     p_L = cq.sec() / 2^S_L
1300	   3b:     if ( lq.byt() > T )
1301	   3c:       mark(pkt)
1302	   3d:     elif ( p_L > maxrand(U) )
1303	   4:        mark(pkt)
1304	   5:      return(pkt)                % return the packet and stop here
1305	   6:    }
1306	   7:    while ( cq.dequeue(pkt) ) {
1307	   8a:     alpha = 2^(-f_C)
1308	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
1309	   9a:     sqrt_p_C = Q_C / 2^S_C
1310	   9b:     if ( sqrt_p_C > maxrand(2*U) )
1311	   10:       drop(pkt)                        % Squared drop, redo loop
1312	   11:     else
1313	   12:       return(pkt)              % return the packet and stop here
1314	   13:   }
1315	   14:   return(NULL)                           % no packet to dequeue
1316	   15: }

1318	   16: maxrand(u) {                % return the max of u random numbers
1319	   17:     maxr=0
1320	   18:     while (u-- > 0)
1321	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
1322	   20:     return(maxr)
1323	   21: }

1325	   Figure 8: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

1327	   Packet classification code is not shown, as it is no different from
1328	   Figure 3.  Potential classification schemes are discussed in
1329	   Section 2.  The Curvy RED algorithm has not been maintained to the
1330	   same degree as the DualPI2 algorithm.  Some ideas used in DualPI2
1331	   would need to be translated into Curvy RED, such as i) the
1332	   conditional priority scheduler instead of strict priority ii) the
1333	   time-based L4S threshold; iii) turning off ECN as overload
1334	   protection; iv) Classic ECN support.  These are not shown in the
1335	   Curvy RED pseudocode, but would need to be implemented for
1336	   production. {ToDo}

1338	   At the outer level, the structure of dualq_dequeue() implements
1339	   strict priority scheduling.  The code is written assuming the AQM is
1340	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
1341	   the if-block in lines 2-6 determines whether there is an L4S packet
1342	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
1343	   in lines 7-13 determines whether there is a Classic packet to
1344	   dequeue, by calling cq.dequeue(pkt).  (Note 2)
1345	   In the lower priority Classic queue, a while loop is used so that, if
1346	   the AQM determines that a classic packet should be dropped, it
1347	   continues to test for classic packets deciding whether to drop each
1348	   until it actually forwards one.  Thus, every call to dualq_dequeue()
1349	   returns one packet if at least one is present in either queue,
1350	   otherwise it returns NULL at line 14.  (Note 3)

1352	   Within each queue, the decision whether to drop or mark is taken as
1353	   follows (to simplify the explanation, it is assumed that U=1):

1355	   L4S:  If the test at line 2 determines there is an L4S packet to
1356	      dequeue, the tests at lines 3a and 3c determine whether to mark
1357	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1358	      in bytes) is greater than a step threshold T in bytes (Note 4).
1359	      The second test is similar to the random ECN marking in RED, but
1360	      with the following differences: i) the marking function does not
1361	      start with a plateau of zero marking until a minimum threshold,
1362	      rather the marking probability starts to increase as soon as the
1363	      queue is positive; ii) marking depends on queuing time, not bytes,
1364	      in order to scale for any link rate without being reconfigured;
1365	      iii) marking of the L4S queue does not depend on itself, it
1366	      depends on the queuing time of the _other_ (Classic) queue, where
1367	      cq.sec() is the queuing time of the packet at the head of the
1368	      Classic queue (zero if empty); iv) marking depends on the
1369	      instantaneous queuing time (of the other Classic queue), not a
1370	      smoothed average; v) the queue is compared with the maximum of U
1371	      random numbers (but if U=1, this is the same as the single random
1372	      number used in RED).

1374	      Specifically, in line 3a the marking probability p_L is set to the
1375	      Classic queueing time qc.sec() in seconds divided by the L4S
1376	      scaling parameter 2^S_L, which represents the queuing time (in
1377	      seconds) at which marking probability would hit 100%. Then in line
1378	      3d (if U=1) the result is compared with a uniformly distributed
1379	      random number between 0 and 1, which ensures that marking
1380	      probability will linearly increase with queueing time.  The
1381	      scaling parameter is expressed as a power of 2 so that division
1382	      can be implemented as a right bit-shift (>>) in line 3 of the
1383	      integer variant of the pseudocode (Figure 9).

1385	   Classic:  If the test at line 7 determines that there is at least one
1386	      Classic packet to dequeue, the test at line 9b determines whether
1387	      to drop it.  But before that, line 8b updates Q_C, which is an
1388	      exponentially weighted moving average (Note 5) of the queuing time
1389	      in the Classic queue, where pkt.sec() is the instantaneous
1390	      queueing time of the current Classic packet and alpha is the EWMA
1391	      constant for the classic queue.  In line 8a, alpha is represented
1392	      as an integer power of 2, so that in line 8 of the integer code
1393	      the division needed to weight the moving average can be
1394	      implemented by a right bit-shift (>> f_C).

1396	      Lines 9a and 9b implement the drop function.  In line 9a the
1397	      averaged queuing time Q_C is divided by the Classic scaling
1398	      parameter 2^S_C, in the same way that queuing time was scaled for
1399	      L4S marking.  This scaled queuing time is given the variable name
1400	      sqrt_p_C because it will be squared to compute Classic drop
1401	      probability, so before it is squared it is effectively the square
1402	      root of the drop probability.  The squaring is done by comparing
1403	      it with the maximum out of two random numbers (assuming U=1).
1404	      Comparing it with the maximum out of two is the same as the
1405	      logical `AND' of two tests, which ensures drop probability rises
1406	      with the square of queuing time (Note 6).  Again, the scaling
1407	      parameter is expressed as a power of 2 so that division can be
1408	      implemented as a right bit-shift in line 9 of the integer
1409	      pseudocode.

1411	   The marking/dropping functions in each queue (lines 3 & 9) are two
1412	   cases of a new generalization of RED called Curvy RED, motivated as
1413	   follows.  When we compared the performance of our AQM with fq_CoDel
1414	   and PIE, we came to the conclusion that their goal of holding queuing
1415	   delay to a fixed target is misguided [CRED_Insights].  As the number
1416	   of flows increases, if the AQM does not allow TCP to increase queuing
1417	   delay, it has to introduce abnormally high levels of loss.  Then loss
1418	   rather than queuing becomes the dominant cause of delay for short
1419	   flows, due to timeouts and tail losses.

1421	   Curvy RED constrains delay with a softened target that allows some
1422	   increase in delay as load increases.  This is achieved by increasing
1423	   drop probability on a convex curve relative to queue growth (the
1424	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1425	   the zero axis while the queue is shallow.  Then, as load increases,
1426	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1427	   requires only one parameter, the scaling, not three.  The diadvantage
1428	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1429	   Curvy RED can be used as is when the RTT range to support is limited
1430	   otherwise an adaptation mechanism is required.

1432	   There follows a summary listing of the two parameters used for each
1433	   of the two queues:

1435	   Classic:

1437	      S_C :   The scaling factor of the dropping function scales Classic
1438	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1439	         probability in the range [0,1].  To make division efficient, it
1440	         is constrained to be an integer power of two;

1442	      f_C :  To smooth the queuing time of the Classic queue and make
1443	         multiplication efficient, we use a negative integer power of
1444	         two for the dimensionless EWMA constant, which we define as
1445	         alpha = 2^(-f_C).

1447	   L4S :

1449	      S_L (and k'):   As for the Classic queue, the scaling factor of
1450	         the L4S marking function scales Classic queueing times in the
1451	         range [0, 2^(S_L)] seconds into a probability in the range
1452	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1453	         between the queues.  So S_L and k' count as only one parameter;
1454	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1455	         where both k and k' are constants.  Then implementations can
1456	         avoid costly division by shifting p_L by k' bits to the right.

1458	      T :  The queue size in bytes at which step threshold marking
1459	         starts in the L4S queue.

1461	   {ToDo: These are the raw parameters used within the algorithm.  A
1462	   configuration front-end could accept more meaningful parameters and
1463	   convert them into these raw parameters.}

1465	   From our experiments so far, recommended values for these parameters
1466	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1467	   typical on the public Internet.  [CRED_Insights] explains why these
1468	   parameters are applicable whatever rate link this AQM implementation
1469	   is deployed on and how the parameters would need to be adjusted for a
1470	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1471	   incorporate a summary of that report into this draft}. The setting of
1472	   k depends on policy (see Section 2.5 and Appendix C respectively for
1473	   its recommended setting and guidance on alternatives).

1475	   There is also a cUrviness parameter, U, which is a small positive
1476	   integer.  It is likely to take the same hard-coded value for all
1477	   implementations, once experiments have determined a good value.  We
1478	   have solely used U=1 in our experiments so far, but results might be
1479	   even better with U=2 or higher.

1481	   Note that the dropping function at line 9 calls maxrand(2*U), which
1482	   gives twice as much curviness as the call to maxrand(U) in the
1483	   marking function at line 3.  This is the trick that implements the
1484	   square rule in equation (1) (Section 2.1).  This is based on the fact
1485	   that, given a number X from 1 to 6, the probability that two dice
1486	   throws will both be less than X is the square of the probability that
1487	   one throw will be less than X.  So, when U=1, the L4S marking
1488	   function is linear and the Classic dropping function is squared.  If
1489	   U=2, L4S would be a square function and Classic would be quartic.
1490	   And so on.

1492	   The maxrand(u) function in lines 16-21 simply generates u random
1493	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1494	   could be run in parallel out of band.  For instance, if U=1, the
1495	   Classic queue would require the maximum of two random numbers.  So,
1496	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1497	   values from a pseudorandom number generator could be generated out-
1498	   of-band, and held in a buffer ready for the Classic queue to consume.

1500	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1501	   2:     if ( lq.dequeue(pkt) ) {
1502	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1503	   4:           mark(pkt)
1504	   5:        return(pkt)              % return the packet and stop here
1505	   6:     }
1506	   7:     while ( cq.dequeue(pkt) ) {
1507	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1508	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1509	   10:          drop(pkt)                     % Squared drop, redo loop
1510	   11:       else
1511	   12:          return(pkt)           % return the packet and stop here
1512	   13:    }
1513	   14:    return(NULL)                           % no packet to dequeue
1514	   15: }

1516	   Figure 9: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1517	                         using Integer Arithmetic

1519	   Notes:

1521	   1.  The drain rate of the queue can vary if it is scheduled relative
1522	       to other queues, or to cater for fluctuations in a wireless
1523	       medium.  To auto-adjust to changes in drain rate, the queue must
1524	       be measured in time, not bytes or packets [CoDel].  In our Linux
1525	       implementation, it was easiest to measure queuing time at
1526	       dequeue.  Queuing time can be estimated when a packet is enqueued
1527	       by measuring the queue length in bytes and dividing by the recent
1528	       drain rate.

1530	   2.  An implementation has to use priority queueing, but it need not
1531	       implement strict priority.

1533	   3.  If packets can be enqueued while processing dequeue code, an
1534	       implementer might prefer to place the while loop around both
1535	       queues so that it goes back to test again whether any L4S packets
1536	       arrived while it was dropping a Classic packet.

1538	   4.  In order not to change too many factors at once, for now, we keep
1539	       the marking function for DCTCP-only traffic as similar as
1540	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1541	       dequeue, so we determine whether to mark a packet at the head of
1542	       the queue by the byte-length of the queue _behind_ it.  We plan
1543	       to test whether using queuing time will work in all
1544	       circumstances, and if we find that the step can cause
1545	       oscillations, we will investigate replacing it with a steep
1546	       random marking curve.

1548	   5.  An EWMA is only one possible way to filter bursts; other more
1549	       adaptive smoothing methods could be valid and it might be
1550	       appropriate to decrease the EWMA faster than it increases.

1552	   6.  In practice at line 10 the Classic queue would probably test for
1553	       ECN capability on the packet to determine whether to drop or mark
1554	       the packet.  However, for brevity such detail is omitted.  All
1555	       packets classified into the L4S queue have to be ECN-capable, so
1556	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1557	       packets could be dropped by overload code (see Section 4.1).

1559	   7.  In the integer variant of the pseudocode (Figure 9) real numbers
1560	       are all represented as integers scaled up by 2^32.  In lines 3 &
1561	       9 the function maxrand() is arranged to return an integer in the
1562	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1563	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1564	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1565	       the values about 2^30 times larger than when the units were
1566	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1567	       right bit-shift multiplies the result by 2^2, to complete the
1568	       scaling by 2^32.

1570	Appendix C.  Guidance on Controlling Throughput Equivalence

1572	                     +---------------+------+-------+
1573	                     | RTT_C / RTT_L | Reno | Cubic |
1574	                     +---------------+------+-------+
1575	                     |             1 | k'=1 | k'=0  |
1576	                     |             2 | k'=2 | k'=1  |
1577	                     |             3 | k'=2 | k'=2  |
1578	                     |             4 | k'=3 | k'=2  |
1579	                     |             5 | k'=3 | k'=3  |
1580	                     +---------------+------+-------+

1582	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1583	               as Reno or Cubic, for some example RTT ratios

1585	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1587	   To determine the appropriate policy, the operator first has to judge
1588	   whether it wants DCTCP flows to have roughly equal throughput with
1589	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1590	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1591	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1592	   Classic flows to have roughly equal throughput.  For example choosing
1593	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1594	   as Cubic, _if their RTTs are the same_.

1596	   However, even if the base RTTs are the same, the actual RTTs are
1597	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1598	   needs a large queue to avoid under-utilization and excess drop,
1599	   whereas L4S (DCTCP) does not.  The operator might still choose this
1600	   policy if it judges that DCTCP throughput should be rewarded for
1601	   keeping its own queue short.

1603	   On the other hand, the operator will choose one of the higher values
1604	   for k', if it wants to slow DCTCP down to roughly the same throughput
1605	   as Classic flows, to compensate for Classic flows slowing themselves
1606	   down by causing themselves extra queuing delay.

1608	   The values for k' in the table are derived from the formulae, which
1609	   was developed in [DCttH15]:

1611	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1612	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1614	   For localized traffic from a particular ISP's data centre, we used
1615	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1616	   k=8) would achieve throughput equivalence, and our experiments
1617	   verified the formula very closely.

1619	   For a typical mix of RTTs from local data centres and across the
1620	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1621	   as a good workable compromise.

1623	Appendix D.  Open Issues

1625	   Most of the following open issues are also tagged '{ToDo}' at the
1626	   appropriate point in the document:

1628	      Operational guidance to monitor L4S experiment

1630	      Interaction between Diffserv & L4S

1632	      Define additional classifier flexibility more clearly
1633	      PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U
1634	      on Tupdate

1636	      Curvy RED appendix: complete the unfinished parts

1638	Authors' Addresses

1640	   Koen De Schepper
1641	   Nokia Bell Labs
1642	   Antwerp
1643	   Belgium

1645	   Email: koen.de_schepper@nokia.com
1646	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1648	   Bob Briscoe (editor)
1649	   CableLabs
1650	   UK

1652	   Email: ietf@bobbriscoe.net
1653	   URI:   http://bobbriscoe.net/

1655	   Olga Bondarenko
1656	   Simula Research Lab
1657	   Lysaker
1658	   Norway

1660	   Email: olgabnd@gmail.com
1661	   URI:   https://www.simula.no/people/olgabo

1663	   Ing-jyh Tsang
1664	   Nokia
1665	   Antwerp
1666	   Belgium

1668	   Email: ing-jyh.tsang@nokia.com