idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 2, 2018) is 2124 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1528

  -- Looks like a reference, but probably isn't: '1' on line 1528

  == Outdated reference: A later version (-02) exists of
     draft-briscoe-tsvwg-l4s-diffserv-00

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-02

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-02

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)

  -- Obsolete informational reference (is this intentional?): RFC 8312
     (Obsoleted by RFC 9438)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: January 3, 2019                                       CableLabs
6	                                                           O. Bondarenko
7	                                                     Simula Research Lab
8	                                                                I. Tsang
9	                                                                   Nokia
10	                                                            July 2, 2018

12	  DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
13	                                 (L4S)
14	                 draft-ietf-tsvwg-aqm-dualq-coupled-05

16	Abstract

18	   Data Centre TCP (DCTCP) was designed to provide predictably low
19	   queuing latency, near-zero loss, and throughput scalability using
20	   explicit congestion notification (ECN) and an extremely simple
21	   marking behaviour on switches.  However, DCTCP does not co-exist with
22	   existing TCP traffic---DCTCP is so aggressive that existing TCP
23	   algorithms approach starvation.  So, until now, DCTCP could only be
24	   deployed where a clean-slate environment could be arranged, such as
25	   in private data centres.  This specification defines `DualQ Coupled
26	   Active Queue Management (AQM)' to allow scalable congestion controls
27	   like DCTCP to safely co-exist with classic Internet traffic.  The
28	   Coupled AQM ensures that a flow runs at about the same rate whether
29	   it uses DCTCP or TCP Reno/Cubic, but without inspecting transport
30	   layer flow identifiers.  When tested in a residential broadband
31	   setting, DCTCP achieved sub-millisecond average queuing delay and
32	   zero congestion loss under a wide range of mixes of DCTCP and
33	   `Classic' broadband Internet traffic, without compromising the
34	   performance of the Classic traffic.  The solution also reduces
35	   network complexity and eliminates network configuration.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at https://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on January 3, 2019.

54	Copyright Notice

56	   Copyright (c) 2018 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (https://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
73	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
74	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   6
75	   2.  DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . .   7
76	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   7
77	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   8
78	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   8
79	     2.4.  Overall DualQ Coupled AQM Structure . . . . . . . . . . .   9
80	     2.5.  Normative Requirements for a DualQ Coupled AQM  . . . . .  11
81	       2.5.1.  Functional Requirements . . . . . . . . . . . . . . .  11
82	         2.5.1.1.  Requirements in Unexpected Cases  . . . . . . . .  12
83	       2.5.2.  Management Requirements . . . . . . . . . . . . . . .  13
84	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
85	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
86	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  14
87	       4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput
88	               or Delay? . . . . . . . . . . . . . . . . . . . . . .  15
89	       4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or
90	               Delay?  . . . . . . . . . . . . . . . . . . . . . . .  16
91	       4.1.3.  Protecting against Unresponsive ECN-Capable Traffic .  17
92	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
93	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
94	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  18
95	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  18
96	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  21
97	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  21
98	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  27
99	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  29
100	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  35
101	   Appendix D.  Open Issues  . . . . . . . . . . . . . . . . . . . .  36
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  37

104	1.  Introduction

106	1.1.  Problem and Scope

108	   Latency is becoming the critical performance factor for many (most?)
109	   applications on the public Internet, e.g. interactive Web, Web
110	   services, voice, conversational video, interactive video, interactive
111	   remote presence, instant messaging, online gaming, remote desktop,
112	   cloud-based applications, and video-assisted remote control of
113	   machinery and industrial processes.  In the developed world, further
114	   increases in access network bit-rate offer diminishing returns,
115	   whereas latency is still a multi-faceted problem.  In the last decade
116	   or so, much has been done to reduce propagation time by placing
117	   caches or servers closer to users.  However, queuing remains a major
118	   component of latency.

120	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
121	   that low latency traffic can jump the queue of other traffic.
122	   However, on access links dedicated to individual sites (homes, small
123	   enterprises or mobile devices), often all traffic at any one time
124	   will be latency-sensitive and, if all the traffic on a link is marked
125	   as EF, Diffserv cannot reduce the delay of any of it.  In contrast,
126	   the Low Latency Low Loss Scalable throughput (L4S) approach removes
127	   the causes of any unnecessary queuing delay.

129	   The bufferbloat project has shown that excessively-large buffering
130	   (`bufferbloat') has been introducing significantly more delay than
131	   the underlying propagation time.  These delays appear only
132	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
133	   enough for the queue to fill the buffer, making every packet in other
134	   flows sharing the buffer sit through the queue.

136	   Active queue management (AQM) was originally developed to solve this
137	   problem (and others).  Unlike Diffserv, which gives low latency to
138	   some traffic at the expense of others, AQM controls latency for _all_
139	   traffic in a class.  In general, AQMs introduce an increasing level
140	   of discard from the buffer the longer the queue persists above a
141	   shallow threshold.  This gives sufficient signals to capacity-seeking
142	   (aka. greedy) flows to keep the buffer empty for its intended
143	   purpose: absorbing bursts.  However, RED [RFC2309] and other
144	   algorithms from the 1990s were sensitive to their configuration and
145	   hard to set correctly.  So, AQM was not widely deployed.

147	   More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290],
148	   PIE [RFC8033], Adaptive RED [ARED01], are easier to configure,
149	   because they define the queuing threshold in time not bytes, so it is
150	   invariant for different link rates.  However, no matter how good the
151	   AQM, the sawtoothing rate of TCP will either cause queuing delay to
152	   vary or cause the link to be under-utilized.  Even with a perfectly
153	   tuned AQM, the additional queuing delay will be of the same order as
154	   the underlying speed-of-light delay across the network.  Flow-queuing
155	   can isolate one flow from another, but it cannot isolate a TCP flow
156	   from the delay variations it inflicts on itself, and it has other
157	   problems - it overrides the flow rate decisions of variable rate
158	   video applications, it does not recognise the flows within IPSec VPN
159	   tunnels and it is relatively expensive to implement.

161	   It seems that further changes to the network alone will now yield
162	   diminishing returns.  Data Centre TCP (DCTCP [RFC8257]) teaches us
163	   that a small but radical change to TCP is needed to cut two major
164	   outstanding causes of queuing delay variability:

166	   1.  the `sawtooth' varying rate of TCP itself;

168	   2.  the smoothing delay deliberately introduced into AQMs to permit
169	       bursts without triggering losses.

171	   The former causes a flow's round trip time (RTT) to vary from about 1
172	   to 2 times the base RTT between the machines in question.  The latter
173	   delays the system's response to change by a worst-case
174	   (transcontinental) RTT, which could be hundreds of times the actual
175	   RTT of typical traffic from localized CDNs.

177	   Latency is not our only concern:

179	   3.  It was known when TCP was first developed that it would not scale
180	       to high bandwidth-delay products.

182	   Given regular broadband bit-rates over WAN distances are
183	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
184	   `less unscalable' Cubic [RFC8312] and
185	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
186	   successfully deployed.  However, these are now approaching their
187	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
188	   cause `classic' TCP to starve itself, which is why they have been
189	   confined to private data centres or research testbeds (until now).

191	   This document specifies a `DualQ Coupled AQM' extension that solves
192	   the problem of coexistence between scalable and classic flows,
193	   without having to inspect flow identifiers.  The AQM is not like
194	   flow-queuing approaches [RFC8290] that classify packets by flow
195	   identifier into numerous separate queues in order to isolate sparse
196	   flows from the higher latency in the queues assigned to heavier flow.
197	   In contrast, the AQM exploits the behaviour of scalable congestion
198	   controls like DCTCP so that every packet in every flow sharing the
199	   queue for DCTCP-like traffic can be served with very low latency.

201	   This AQM extension can be combined with any single queue AQM that
202	   generates a statistical or deterministic mark/drop probability driven
203	   by the queue dynamics.  In many cases it simplifies the basic control
204	   algorithm, and requires little extra processing.  Therefore it is
205	   believed the Coupled AQM would be applicable and easy to deploy in
206	   all types of buffers; buffers in cost-reduced mass-market residential
207	   equipment; buffers in end-system stacks; buffers in carrier-scale
208	   equipment including remote access servers, routers, firewalls and
209	   Ethernet switches; buffers in network interface cards, buffers in
210	   virtualized network appliances, hypervisors, and so on.

212	   The overall L4S architecture is described in
213	   [I-D.ietf-tsvwg-l4s-arch].  The supporting papers [PI2] and [DCttH15]
214	   give the full rationale for the AQM's design, both discursively and
215	   in more precise mathematical form.

217	1.2.  Terminology

219	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
220	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
221	   document are to be interpreted as described in [RFC2119].  In this
222	   document, these words will appear with that interpretation only when
223	   in ALL CAPS.  Lower case uses of these words are not to be
224	   interpreted as carrying RFC-2119 significance.

226	   The DualQ Coupled AQM uses two queues for two services.  Each of the
227	   following terms identifies both the service and the queue that
228	   provides the service:

230	   Classic (denoted by subscript C):  The `Classic' service is intended
231	      for all the behaviours that currently co-exist with TCP Reno (TCP
232	      Cubic, Compound, SCTP, etc).

234	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
235	      The `L4S' service is intended for a set of congestion controls
236	      with scalable properties such as DCTCP (e.g.
237	      Relentless [Mathis09]).

239	   Either service can cope with a proportion of unresponsive or less-
240	   responsive traffic as well (e.g.  DNS, VoIP, etc), just as a single
241	   queue AQM can.  The DualQ Coupled AQM behaviour is similar to a
242	   single FIFO queue with respect to unresponsive and overload traffic.

244	1.3.  Features

246	   The AQM couples marking and/or dropping across the two queues such
247	   that a flow will get roughly the same throughput whichever it uses.
248	   Therefore both queues can feed into the full capacity of a link and
249	   no rates need to be configured for the queues.  The L4S queue enables
250	   scalable congestion controls like DCTCP to give stunningly low and
251	   predictably low latency, without compromising the performance of
252	   competing 'Classic' Internet traffic.  Thousands of tests have been
253	   conducted in a typical fixed residential broadband setting.  Typical
254	   experiments used base round trip delays up to 100ms between the data
255	   centre and home network, and large amounts of background traffic in
256	   both queues.  For every L4S packet, the AQM kept the average queuing
257	   delay below 1ms (or 2 packets if serialization delay is bigger for
258	   slow links), and no losses at all were introduced by the AQM.
259	   Details of the extensive experiments will be made available [PI2]
260	   [DCttH15].

262	   Subjective testing was also conducted using a demanding panoramic
263	   interactive video application run over a stack with DCTCP enabled and
264	   deployed on the testbed.  Each user could pan or zoom their own high
265	   definition (HD) sub-window of a larger video scene from a football
266	   match.  Even though the user was also downloading large amounts of
267	   L4S and Classic data, latency was so low that the picture appeared to
268	   stick to their finger on the touchpad (all the L4S data achieved the
269	   same ultra-low latency).  With an alternative AQM, the video
270	   noticeably lagged behind the finger gestures.

272	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
273	   be limited to a small proportion of the link capacity in order to
274	   achieve low delay.  The L4S queue can be filled with a heavy load of
275	   capacity-seeking flows like DCTCP and still achieve low delay.  The
276	   L4S queue does not rely on the presence of other traffic in the
277	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
278	   traffic whether or not there is Classic traffic, and the latency of
279	   Classic traffic does not suffer when a proportion of the traffic is
280	   L4S.  The two queues are only necessary because DCTCP-like flows
281	   cannot keep latency predictably low and keep utilization high if they
282	   are mixed with legacy TCP flows,

284	   The experiments used the Linux implementation of DCTCP that is
285	   deployed in private data centres, without any modification despite
286	   its known deficiencies.  Nonetheless, certain modifications will be
287	   necessary before DCTCP is safe to use on the Internet, which are
288	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
289	   focus of this specification is to get the network service in place.
290	   Then, without any management intervention, applications can exploit
291	   it by migrating to scalable controls like DCTCP, which can then
292	   evolve _while_ their benefits are being enjoyed by everyone on the
293	   Internet.

295	2.  DualQ Coupled AQM

297	   There are two main aspects to the approach:

299	   o  the Coupled AQM that addresses throughput equivalence between
300	      Classic (e.g.  Reno, Cubic) flows and L4S (e.g.  DCTCP) flows

302	   o  the Dual Queue structure that provides latency separation for L4S
303	      flows to isolate them from the typically large Classic queue.

305	2.1.  Coupled AQM

307	   In the 1990s, the `TCP formula' was derived for the relationship
308	   between TCP's congestion window, cwnd, and its drop probability, p.
309	   To a first order approximation, cwnd of TCP Reno is inversely
310	   proportional to the square root of p.

312	   TCP Cubic implements a Reno-compatibility mode, which is the only
313	   relevant mode for typical RTTs under 20ms as long as the throughput
314	   of a single flow is less than about 500Mb/s.  Therefore it can be
315	   assumed that Cubic traffic behaves similarly to Reno (but with a
316	   slightly different constant of proportionality), and the term
317	   'Classic' will be used for the collection of Reno-friendly traffic
318	   including Cubic in Reno mode.

320	   The supporting paper [PI2] includes the derivation of the equivalent
321	   rate equation for DCTCP, for which cwnd is inversely proportional to
322	   p (not the square root), where in this case p is the ECN marking
323	   probability.  DCTCP is not the only congestion control that behaves
324	   like this, so the term 'L4S' traffic will be used for all similar
325	   behaviour.

327	   In order to make a DCTCP flow run at roughly the same rate as a Reno
328	   TCP flow (all other factors being equal), the drop or marking
329	   probability for Classic traffic, p_C has to be distinct from the
330	   marking probability for L4S traffic, p_L (in contrast to RFC3168
331	   which requires them to be the same).  It is necessary to make the
332	   Classic drop probability p_C proportional to the square of the L4S
333	   marking probability p_L.  This makes the Reno flow rate roughly equal
334	   the DCTCP flow rate, because it squares the square root of p_C in the
335	   Reno rate equation to make it proportional to the straight p_L in the
336	   DCTCP rate equation.

338	   Stating this as a formula, the relation between Classic drop
339	   probability, p_C, and L4S marking probability, p_L needs to take the
340	   form:

342	       p_C = ( p_L / k )^2                  (1)

344	   where k is the constant of proportionality.

346	2.2.  Dual Queue

348	   Classic traffic typically builds a large queue to prevent under-
349	   utilization.  Therefore a separate queue is provided for L4S traffic,
350	   and it is scheduled with priority over Classic.  Priority is
351	   conditional to prevent starvation of Classic traffic.

353	   Nonetheless, coupled marking ensures that giving priority to L4S
354	   traffic still leaves the right amount of spare scheduling time for
355	   Classic flows to each get equivalent throughput to DCTCP flows (all
356	   other factors such as RTT being equal).  The algorithm achieves this
357	   without having to inspect flow identifiers.

359	2.3.  Traffic Classification

361	   Both the Coupled AQM and DualQ mechanisms need an identifier to
362	   distinguish L and C packets.  A separate draft
363	   [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of
364	   the ECN field as this identifier, having assessed various
365	   alternatives.  An additional process document has proved necessary to
366	   make the ECT(1) codepoint available for experimentation [RFC8311].

368	   For policy reasons, an operator might choose to steer certain packets
369	   (e.g. from certain flows or with certain addresses) out of the L
370	   queue, even though they identify themselves as L4S by their ECN
371	   codepoints.  In such cases, the classifier MUST NOT alter the ECN
372	   field, so that it is preserved end-to-end.  The aim is that each
373	   operator can choose how it treats L4S traffic locally, but an
374	   individual operator does not alter the identification of L4S packets,
375	   which would prevent other operators downstream from making their own
376	   choices on how to treat L4S traffic.

378	   In addition, other identifiers could be used to classify certain
379	   additional packet types into the L queue, that are deemed not to risk
380	   harming the L4S service.  For instance addresses of specific
381	   applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific
382	   Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit
383	   service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain
384	   protocols (e.g.  ARP, DNS).

386	   Note that the DualQ Coupled AQM only reads these classifiers, it MUST
387	   NOT re-mark or alter these identifiers (except for marking the ECN
388	   field with the CE codepoint - with increasing frequency to indicate
389	   increasing congestion).

391	2.4.  Overall DualQ Coupled AQM Structure

393	   Figure 1 shows the overall structure that any DualQ Coupled AQM is
394	   likely to have.  This schematic is intended to aid understanding of
395	   the current designs of DualQ Coupled AQMs.  However, it is not
396	   intended to preclude other innovative ways of satisfying the
397	   normative requirements in Section 2.5 that minimally define a DualQ
398	   Coupled AQM.

400	   The classifier on the left separates incoming traffic between the two
401	   queues (L and C).  Each queue has its own AQM that determines the
402	   likelihood of dropping or marking (p_L and p_C).  Nonetheless, the
403	   AQM for Classic traffic is implemented in two stages: i) a base stage
404	   that outputs an internal probability p' (pronounced p-prime); and ii)
405	   a squaring stage that outputs p_C, where

407	       p_C = (p')^2.                        (2)

409	   This allows p_L to be coupled to p_C by marking L4S traffic
410	   proportionately to the intermediate output from the first stage.
411	   Specifically, the output of the base AQM is coupled across to the L
412	   queue in proportion to the output of the base AQM:

414	       p_CL = k*p',                         (3)

416	   where k is the constant coupling factor (see Appendix C) and p_CL is
417	   the output from the coupling between the C queue and the L queue.

419	   It can be seen in the following that these two transformations of p'
420	   implement the required coupling given in equation (1) earlier.
421	   Substituting for p' from equation (3) into (2):

423	      p_C = ( p_CL / k )^2.

425	   The actual L4S marking probability p_L is the maximum of the coupled
426	   output (p_CL) and the output of a native L4S AQM (p'L), shown as
427	   '(MAX)' in the schematic.  While the output of the Native L4S AQM is
428	   high (p'L > p_CL) it will dominate the way L traffic is marked.  When
429	   the native L4S AQM output is lower, the way L traffic is marked will
430	   be driven by the coupling, that is p_L = p_CL.  So, whenever the
431	   coupling is needed, as required from equation (1):

433	      p_C = ( p_L / k )^2.

435	                           _________
436	                                  | |    ,------.
437	                        L4S queue | |===>| ECN  |
438	                       ,'| _______|_|    |marker|\
439	                     <'  |         |     `------'\\
440	                      //`'         v        ^ p_L \\
441	                     //        ,-------.    |      \\
442	                    //         |Native |p'L |       \\,.
443	                   //          |  L4S  |-->(MAX)    <  |   ___
444	      ,----------.//           |  AQM  |    ^ p_CL   `\|.'Cond-`.
445	      |  IP-ECN  |/            `-------'    |          / itional \
446	   ==>|Classifier|             ,-------.  (k*p')       [ priority]==>
447	      |          |\            |  Base |    |          \scheduler/
448	      `----------'\\           |  AQM  |--->:        ,'|`-.___.-'
449	                   \\          |       |p'  |      <'  |
450	                    \\         `-------'  (p'^2)    //`'
451	                     \\            ^        |      //
452	                      \\,.         |        v p_C //
453	                      <  | _________     .------.//
454	                       `\|   |      |    | Drop |/
455	                     Classic |queue |===>|/mark |
456	                           __|______|    `------'

458	   Legend: ===> traffic flow; ---> control dependency.

460	                   Figure 1: DualQ Coupled AQM Schematic

462	   After the AQMs have applied their dropping or marking, the scheduler
463	   forwards their packets to the link, giving priority to L4S traffic.
464	   Priority has to be conditional in some way (see Section 4.1).  Simple
465	   strict priority is inappropriate otherwise it could lead the L4S
466	   queue to starve the Classic queue.  For example, consider the case
467	   where a continually busy L4S queue blocks a DNS request in the
468	   Classic queue, arbitrarily delaying the start of a new Classic flow.

470	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
471	   given in Appendix A and Appendix B.  Either example AQM can be used
472	   to couple packet marking and dropping across a dual Q.

474	   DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM.
475	   Indeed, this Base AQM with just the squared output and no L4S queue
476	   can be used as a drop-in replacement for PIE [RFC8033], in which case
477	   we call it just PI2 [PI2].  PI2 is a principled simplification of PIE
478	   that is both more responsive and more stable in the face of
479	   dynamically varying load.

481	   Curvy RED is derived from RED [RFC2309], but its configuration
482	   parameters are insensitive to link rate and it requires less
483	   operations per packet.  However, DualPI2 is more responsive and
484	   stable over a wider range of RTTs than Curvy RED.  As a consequence,
485	   DualPI2 has attracted more development attention than Curvy RED,
486	   leaving the Curvy RED design incomplete and not so fully evaluated.

488	   Both AQMs regulate their queue in units of time not bytes.  As
489	   already explained, this ensures configuration can be invariant for
490	   different drain rates.  With AQMs in a dualQ structure this is
491	   particularly important because the drain rate of each queue can vary
492	   rapidly as flows for the two queues arrive and depart, even if the
493	   combined link rate is constant.

495	   It would be possible to control the queues with other alternative
496	   AQMs, as long as the normative requirements (those expressed in
497	   capitals) in Section 2.5 are observed.

499	2.5.  Normative Requirements for a DualQ Coupled AQM

501	   The following requirements are intended to capture only the essential
502	   aspects of a DualQ Coupled AQM.  They are intended to be independent
503	   of the particular AQMs used for each queue.

505	2.5.1.  Functional Requirements

507	   In the Dual Queue, L4S packets MUST be given priority over Classic,
508	   although priority MUST be bounded in order not to starve Classic
509	   traffic.

511	   Whatever identifier is used for L4S experiments,
512	   [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on
513	   L4S traffic, relative to drop of Classic traffic.  In order to
514	   prevent starvation of Classic traffic by scalable L4S traffic, it
515	   says, "The likelihood that an AQM drops a Not-ECT Classic packet
516	   (p_C) MUST be roughly proportional to the square of the likelihood
517	   that it would have marked it if it had been an L4S packet (p_L)."  In
518	   other words, in any DualQ Coupled AQM, the power to which p_L is
519	   raised in Eqn. (1) MUST be 2.  The term 'likelihood' is used to allow
520	   for marking and dropping to be either probabilistic or deterministic.

522	   The constant of proportionality, k, in Eqn (1) determines the
523	   relative flow rates of Classic and L4S flows when the AQM concerned
524	   is the bottleneck (all other factors being equal).

526	   [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality
527	   (k) does not have to be standardised for interoperability, but a
528	   value of 2 is RECOMMENDED."

530	   Assuming scalable congestion controls for the Internet will be as
531	   aggressive as DCTCP, this will ensure their congestion window will be
532	   roughly the same as that of a standards track TCP congestion control
533	   (Reno) [RFC5681] and other so-called TCP-friendly controls, such as
534	   TCP Cubic in its TCP-friendly mode.

536	   {ToDo: The requirements for scalable congestion controls on the
537	   Internet (termed the TCP Prague requirements)
538	   [I-D.ietf-tsvwg-ecn-l4s-id] are not necessarily final.  If the
539	   aggressiveness of DCTCP is not defined as the benchmark for scalable
540	   controls on the Internet, the recommended value of k will also be
541	   subject to change.}

543	   The choice of k is a matter of operator policy, and operators MAY
544	   choose a different value using Table 1 and the guidelines in
545	   Appendix C.

547	   If multiple users share capacity at a bottleneck (e.g. in the
548	   Internet access link of a campus network), the operator's choice of k
549	   will determine capacity sharing between the flows of different users.
550	   However, on the public Internet, access network operators typically
551	   isolate customers from each other with some form of layer-2
552	   multiplexing (TDM in DOCSIS, CDMA in 3G) or L3 scheduling (WRR in
553	   DSL), rather than relying on TCP to share capacity between customers
554	   [RFC0970].  In such cases, the choice of k will solely affect
555	   relative flow rates within each customer's access capacity, not
556	   between customers.  Also, k will not affect relative flow rates at
557	   any times when all flows are Classic or all L4S, and it will not
558	   affect small flows.

560	2.5.1.1.  Requirements in Unexpected Cases

562	   The flexibility to allow operator-specific classifiers (Section 2.3)
563	   leads to the need to specify what the AQM in each queue ought to do
564	   with packets that do not carry the ECN field expected for that queue.
565	   It is recommended that the AQM in each queue inspects the ECN field
566	   to determine what sort of congestion notification to signal, then
567	   decides whether to apply congestion notification to this particular
568	   packet, as follows:

570	   o  If a packet that does not carry an ECT(1) or CE codepoint is
571	      classified into the L queue:

573	      *  if the packet is ECT(0), the L AQM SHOULD apply CE-marking as
574	         if the packet were ECT(1)

576	      *  if the packet is Not-ECT, the appropriate action depends on
577	         whether some other function is protecting the L queue from
578	         misbehaving flows (e.g. per-flow queue protection or policing):

580	         +  If separate queue protection is provided, the L AQM SHOULD
581	            ignore the packet and forward it unchanged, meaning it
582	            should not calculate whether to apply congestion
583	            notification and it should neither drop nor CE-mark the
584	            packet (for instance, the operator might classify EF traffic
585	            that is unresponsive to drop into the L queue, alongside
586	            responsive L4S-ECN traffic)

588	         +  if separate queue protection is not provided, the L AQM MUST
589	            apply drop using the drop probability appropriate to the C
590	            queue

592	   o  If a packet that carries an ECT(1) or CE codepoint is classified
593	      into the C queue:

595	      *  the C AQM SHOULD apply CE-marking as if the packet were ECT(0).

597	   If the DualQ Coupled AQM has detected overload, it will signal
598	   congestion solely using drop, irrespective of the ECN field.

600	   Most of the above requirements are worded as "SHOULDs", because
601	   operator-specific classifiers are for flexibility, by definition.
602	   Therefore, alternative actions might be appropriate in the operator's
603	   specific circumstances.

605	2.5.2.  Management Requirements

607	   By default, a DualQ Coupled AQM SHOULD NOT need any configuration for
608	   use at a bottleneck on the public Internet [RFC7567].  The following
609	   parameters MAY be operator-configurable, e.g. to tune for non-
610	   Internet settings:

612	   o  Optional packet classifier(s) to use in addition to the ECN field
613	      (see Section 2.3);

615	   o  Expected typical RTT (a parameter for typical or target queuing
616	      delay in each queue might be configurable instead);

618	   o  Expected maximum RTT (a stability parameter that depends on
619	      maximum RTT might be configurable instead);

621	   o  Coupling factor, k;

623	   o  The limit to the conditional priority of L4S (scheduler-dependent,
624	      e.g. the scheduler weight for WRR, or the time-shift for time-
625	      shifted FIFO);

627	   o  The maximum Classic ECN marking probability, p_Cmax, before
628	      switching over to drop.

630	   An experimental DualQ Coupled AQM SHOULD allow the operator to
631	   monitor the following operational statistics:

633	   o  Bits forwarded (total and per queue per sample interval), from
634	      which utilization can be calculated

636	   o  Q delay (per queue over sample interval)

638	   o  Total packets arriving, enqueued and dequeued (per queue per
639	      sample interval)

641	   o  ECN packets marked, non-ECN packets dropped, ECN packets dropped
642	      (per queue per sample interval), from which marking and dropping
643	      probabilities can be calculated

645	   o  Time and duration of each overload event.

647	   The type of statistics produced for variables like Q delay (mean,
648	   percentiles, etc.) will depend on implementation constraints.

650	3.  IANA Considerations

652	   This specification contains no IANA considerations.

654	4.  Security Considerations

656	4.1.  Overload Handling

658	   Where the interests of users or flows might conflict, it could be
659	   necessary to police traffic to isolate any harm to the performance of
660	   individual flows.  However it is hard to avoid unintended side-
661	   effects with policing, and in a trusted environment policing is not
662	   necessary.  Therefore per-flow policing needs to be separable from a
663	   basic AQM, as an option under policy control.

665	   However, a basic DualQ AQM does at least need to handle overload.  A
666	   useful objective would be for the overload behaviour of the DualQ AQM
667	   to be at least no worse than a single queue AQM.  However, a trade-
668	   off needs to be made between complexity and the risk of either
669	   traffic class harming the other.  In each of the following three
670	   subsections, an overload issue specific to the DualQ is described,
671	   followed by proposed solution(s).

673	   Under overload the higher priority L4S service will have to sacrifice
674	   some aspect of its performance.  Alternative solutions are provided
675	   below that each relax a different factor: e.g. throughput, delay,
676	   drop.  Some of these choices might need to be determined by operator
677	   policy or by the developer, rather than by the IETF. {ToDo: Reach
678	   consensus on which it is to be in each case.}

680	4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay?

682	   Priority of L4S is required to be conditional to avoid total
683	   throughput starvation of Classic by heavy L4S traffic.  This raises
684	   the question of whether to sacrifice L4S throughput or L4S delay (or
685	   some other policy) to mitigate starvation of Classic:

687	   Sacrifice L4S throughput:   By using weighted round robin as the
688	      conditional priority scheduler, the L4S service can sacrifice some
689	      throughput during overload to guarantee a minimum throughput
690	      service for Classic traffic.  The scheduling weight of the Classic
691	      queue should be small (e.g. 1/16).  Then, in most traffic
692	      scenarios the scheduler will not interfere and it will not need to
693	      - the coupling mechanism and the end-systems will share out the
694	      capacity across both queues as if it were a single pool.  However,
695	      because the congestion coupling only applies in one direction
696	      (from C to L), if L4S traffic is over-aggressive or unresponsive,
697	      the scheduler weight for Classic traffic will at least be large
698	      enough to ensure it does not starve.

700	      In cases where the ratio of L4S to Classic flows (e.g. 19:1) is
701	      greater than the ratio of their scheduler weights (e.g. 15:1), the
702	      L4S flows will get less than an equal share of the capacity, but
703	      only slightly.  For instance, with the example numbers given, each
704	      L4S flow will get (15/16)/19 = 4.9% when ideally each would get
705	      1/20=5%. In the rather specific case of an unresponsive flow
706	      taking up a large part of the capacity set aside for L4S, using
707	      WRR could significantly reduce the capacity left for any
708	      responsive L4S flows.

710	   Sacrifice L4S Delay:  To control milder overload of responsive
711	      traffic, particularly when close to the maximum congestion signal,
712	      the operator could choose to control overload of the Classic queue
713	      by allowing some delay to 'leak' across to the L4S queue.  The
714	      scheduler can be made to behave like a single First-In First-Out
715	      (FIFO) queue with different service times by implementing a very
716	      simple conditional priority scheduler that could be called a
717	      "time-shifted FIFO" (see the Modifier Earliest Deadline First
718	      (MEDF) scheduler of [MEDF]).  This scheduler adds tshift to the
719	      queue delay of the next L4S packet, before comparing it with the
720	      queue delay of the next Classic packet, then it selects the packet
721	      with the greater adjusted queue delay.  Under regular conditions,
722	      this time-shifted FIFO scheduler behaves just like a strict
723	      priority scheduler.  But under moderate or high overload it
724	      prevents starvation of the Classic queue, because the time-shift
725	      (tshift) defines the maximum extra queuing delay of Classic
726	      packets relative to L4S.

728	   The example implementation in Appendix A can implement either policy.

730	4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or Delay?

732	   To keep the throughput of both L4S and Classic flows roughly equal
733	   over the full load range, a different control strategy needs to be
734	   defined above the point where one AQM first saturates to a
735	   probability of 100% leaving no room to push back the load any harder.
736	   If k>1, L4S will saturate first, but saturation can be caused by
737	   unresponsive traffic in either queue.

739	   The term 'unresponsive' includes cases where a flow becomes
740	   temporarily unresponsive, for instance, a real-time flow that takes a
741	   while to adapt its rate in response to congestion, or a TCP-like flow
742	   that is normally responsive, but above a certain congestion level it
743	   will not be able to reduce its congestion window below the minimum of
744	   2 segments, effectively becoming unresponsive.  (Note that L4S
745	   traffic ought to remain responsive below a window of 2 segments (see
746	   [I-D.ietf-tsvwg-ecn-l4s-id]).

748	   Saturation raises the question of whether to relieve congestion by
749	   introducing some drop into the L4S queue or by allowing delay to grow
750	   in both queues (which could eventually lead to tail drop too):

752	   Drop on Saturation:  Saturation can be avoided by setting a maximum
753	      threshold for L4S ECN marking (assuming k>1) before saturation
754	      starts to make the flow rates of the different traffic types
755	      diverge.  Above that the drop probability of Classic traffic is
756	      applied to all packets of all traffic types.  Then experiments
757	      have shown that queueing delay can be kept at the target in any
758	      overload situation, including with unresponsive traffic, and no
759	      further measures are required.

761	   Delay on Saturation:  When L4S marking saturates, instead of
762	      switching to drop, the drop and marking probabilities could be
763	      capped.  Beyond that, delay will grow either solely in the queue
764	      with unresponsive traffic (if WRR is used), or in both queues (if
765	      time-shifted FIFO is used).  In either case, the higher delay
766	      ought to control temporary high congestion.  If the overload is
767	      more persistent, eventually the combined DualQ will overflow and
768	      tail drop will control congestion.

770	   The example implementation in Appendix A applies only the "drop on
771	   saturation" policy.

773	4.1.3.  Protecting against Unresponsive ECN-Capable Traffic

775	   Unresponsive traffic has a greater advantage if it is also ECN-
776	   capable.  The advantage is undetectable at normal low levels of drop/
777	   marking, but it becomes significant with the higher levels of drop/
778	   marking typical during overload.  This is an issue whether the ECN-
779	   capable traffic is L4S or Classic.

781	   This raises the question of whether and when to switch off ECN
782	   marking and use solely drop instead, as required by both Section 7 of
783	   [RFC3168] and Section 4.2.1 of [RFC7567].

785	   Experiments with the DualPI2 AQM (Appendix A) have shown that
786	   introducing 'drop on saturation' at 100% L4S marking addresses this
787	   problem with unresponsive ECN as well as addressing the saturation
788	   problem.  It leaves only a small range of congestion levels where
789	   unresponsive traffic gains any advantage from using the ECN
790	   capability, and the advantage is hardly detectable [DualQ-Test].

792	5.  Acknowledgements

794	   Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for
795	   detailed review comments particularly of the appendices and
796	   suggestions on how to make our explanation clearer.  Thanks also to
797	   Greg White and Tom Henderson for insights on the choice of schedulers
798	   and queue delay measurement techniques.

800	   The authors' contributions were originally part-funded by the
801	   European Community under its Seventh Framework Programme through the
802	   Reducing Internet Transport Latency (RITE) project (ICT-317700).  Bob
803	   Briscoe's contribution was also part-funded by the Research Council
804	   of Norway through the TimeIn project.  The views expressed here are
805	   solely those of the authors.

807	6.  References
808	6.1.  Normative References

810	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
811	              Requirement Levels", BCP 14, RFC 2119,
812	              DOI 10.17487/RFC2119, March 1997,
813	              <https://www.rfc-editor.org/info/rfc2119>.

815	6.2.  Informative References

817	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
818	              Algorithm for Increasing the Robustness of RED's Active
819	              Queue Management", ACIRI Technical Report , August 2001,
820	              <http://www.icir.org/floyd/red.html>.

822	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
823	              ACM Queue 10(5), May 2012,
824	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

826	   [CRED_Insights]
827	              Briscoe, B., "Insights from Curvy RED (Random Early
828	              Detection)", BT Technical Report TR-TUB8-2015-003, July
829	              2015,
830	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

832	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
833	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
834	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
835	              dctth_preprint.pdf>.

837	              (Under submission)

839	   [DualQ-Test]
840	              Steen, H., "Destruction Testing: Ultra-Low Delay using
841	              Dual Queue Coupled Active Queue Management", Masters
842	              Thesis, Dept of Informatics, Uni Oslo , May 2017.

844	   [I-D.briscoe-tsvwg-l4s-diffserv]
845	              Briscoe, B., "Interactions between Low Latency, Low Loss,
846	              Scalable Throughput (L4S) and Differentiated Services",
847	              draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress),
848	              March 2018.

850	   [I-D.ietf-tsvwg-ecn-l4s-id]
851	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
852	              Modified Explicit Congestion Notification (ECN) Semantics
853	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
854	              id-02 (work in progress), March 2018.

856	   [I-D.ietf-tsvwg-l4s-arch]
857	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
858	              Low Loss, Scalable Throughput (L4S) Internet Service:
859	              Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in
860	              progress), March 2018.

862	   [I-D.sridharan-tcpm-ctcp]
863	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
864	              "Compound TCP: A New TCP Congestion Control for High-Speed
865	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
866	              (work in progress), November 2008.

868	   [Mathis09]
869	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
870	              May 2009, <http://www.hpcc.jp/pfldnet2009/
871	              Program_files/1569198525.pdf>.

873	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
874	              simple scheduling algorithm for two real-time transport
875	              service classes with application in the UTRAN", Proc. IEEE
876	              Conference on Computer Communications (INFOCOM'03) Vol.2
877	              pp.1116-1122, March 2003.

879	   [PI2]      De Schepper, K., Bondarenko, O., Briscoe, B., and I.
880	              Tsang, "PI2: A Linearized AQM for both Classic and
881	              Scalable TCP", ACM CoNEXT'16 , December 2016,
882	              <https://riteproject.files.wordpress.com/2015/10/
883	              pi2_conext.pdf>.

885	              (To appear)

887	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
888	              RFC 970, DOI 10.17487/RFC0970, December 1985,
889	              <https://www.rfc-editor.org/info/rfc970>.

891	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
892	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
893	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
894	              S., Wroclawski, J., and L. Zhang, "Recommendations on
895	              Queue Management and Congestion Avoidance in the
896	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
897	              <https://www.rfc-editor.org/info/rfc2309>.

899	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
900	              of Explicit Congestion Notification (ECN) to IP",
901	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
902	              <https://www.rfc-editor.org/info/rfc3168>.

904	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
905	              J., Courtney, W., Davari, S., Firoiu, V., and D.
906	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
907	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
908	              <https://www.rfc-editor.org/info/rfc3246>.

910	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
911	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
912	              <https://www.rfc-editor.org/info/rfc3649>.

914	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
915	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
916	              <https://www.rfc-editor.org/info/rfc5681>.

918	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
919	              Recommendations Regarding Active Queue Management",
920	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
921	              <https://www.rfc-editor.org/info/rfc7567>.

923	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
924	              "Proportional Integral Controller Enhanced (PIE): A
925	              Lightweight Control Scheme to Address the Bufferbloat
926	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
927	              <https://www.rfc-editor.org/info/rfc8033>.

929	   [RFC8034]  White, G. and R. Pan, "Active Queue Management (AQM) Based
930	              on Proportional Integral Controller Enhanced PIE) for
931	              Data-Over-Cable Service Interface Specifications (DOCSIS)
932	              Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February
933	              2017, <https://www.rfc-editor.org/info/rfc8034>.

935	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
936	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
937	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
938	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

940	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
941	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
942	              and Active Queue Management Algorithm", RFC 8290,
943	              DOI 10.17487/RFC8290, January 2018,
944	              <https://www.rfc-editor.org/info/rfc8290>.

946	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
947	              Notification (ECN) Experimentation", RFC 8311,
948	              DOI 10.17487/RFC8311, January 2018,
949	              <https://www.rfc-editor.org/info/rfc8311>.

951	   [RFC8312]  Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
952	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
953	              RFC 8312, DOI 10.17487/RFC8312, February 2018,
954	              <https://www.rfc-editor.org/info/rfc8312>.

956	Appendix A.  Example DualQ Coupled PI2 Algorithm

958	   As a first concrete example, the pseudocode below gives the DualPI2
959	   algorithm.  DualPI2 follows the structure of the DualQ Coupled AQM
960	   framework in Figure 1.  A simple step threshold (in units of queuing
961	   time) is used for the Native L4S AQM, but a ramp is also described as
962	   an alternative.  And the PI2 algorithm [PI2] is used for the Classic
963	   AQM.  PI2 is an improved variant of the PIE AQM [RFC8033].

965	   We will introduce the pseudocode in two passes.  The first pass
966	   explains the core concepts, deferring handling of overload to the
967	   second pass.  To aid comparison, line numbers are kept in step
968	   between the two passes by using letter suffixes where the longer code
969	   needs extra lines.

971	   A full open source implementation for Linux is available at:
972	   https://github.com/olgabo/dualpi2.

974	A.1.  Pass #1: Core Concepts

976	   The pseudocode manipulates three main structures of variables: the
977	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
978	   pseudocode consists of the following four functions:

980	   o  initialization code (Figure 2) that sets parameter defaults (the
981	      API for setting non-default values is omitted for brevity)

983	   o  enqueue code (Figure 3)

985	   o  dequeue code (Figure 4)

987	   o  code to regularly update the base probability (p) used in the
988	      dequeue code (Figure 5).

990	   It also uses the following functions that are not shown in full here:

992	   o  scheduler(), which selects between the head packets of the two
993	      queues; the choice of scheduler technology is discussed later;

995	   o  cq.len() or lq.len() returns the current length (aka. backlog) of
996	      the relevant queue in bytes;

998	   o  cq.time() or lq.time() returns the current queuing delay (aka.
999	      sojourn time or service time) of the relevant queue in units of
1000	      time;

1002	   Queuing delay could be measured directly by storing a per-packet
1003	   time-stamp as each packet is enqueued, and subtracting this from the
1004	   system time when the packet is dequeued.  If time-stamping is not
1005	   easy to introduce with certain hardware, queuing delay could be
1006	   predicted indirectly by dividing the size of the queue by the
1007	   predicted departure rate, which might be known precisely for some
1008	   link technologies (see for example [RFC8034]).

1010	   In our experiments so far (building on experiments with PIE) on
1011	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
1012	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
1013	   parameters in Figure 2.  The parameters are categorised by whether
1014	   they relate to the Base PI2 AQM, the L4S AQM or the framework
1015	   coupling them together.  Variables derived from these parameters are
1016	   also included at the end of each category.  Each parameter is
1017	   explained as it is encountered in the walk-through of the pseudocode
1018	   below.

1020	   1:  dualpi2_params_init(...) {         % Set input parameter defaults
1021	   2:    % PI2 AQM parameters
1022	   3:    target = 15 ms              % PI AQM Classic queue delay target
1023	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
1024	   5:    alpha = 10 Hz^2                              % PI integral gain
1025	   6:    beta = 100 Hz^2                          % PI proportional gain
1026	   7:    p_Cmax = 1/4                       % Max Classic drop/mark prob
1027	   8:    % Derived PI2 AQM variables
1028	   9:    alpha_U = alpha *Tupdate % PI integral gain per update interval
1029	   10:   beta_U = beta * Tupdate  % PI prop'nal gain per update interval
1030	   11:
1031	   12:   % DualQ Coupled framework parameters
1032	   13:   k = 2                                         % Coupling factor
1033	   14:   % scheduler weight or equival't parameter (scheduler-dependent)
1034	   15:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
1035	   16:
1036	   17:   % L4S AQM parameters
1037	   18:   T_time = 1 ms                   % L4S marking threshold in time
1038	   19:   T_len = 2 * MTU            % Min L4S marking threshold in bytes
1039	   20:   % Derived L4S AQM variables
1040	   21:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
1041	   22: }

1043	       Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM

1045	   The overall goal of the code is to maintain the base probability (p),
1046	   which is an internal variable from which the marking and dropping
1047	   probabilities for L4S and Classic traffic (p_L and p_C) are derived.
1048	   The variable named p in the pseudocode and in this walk-through is
1049	   the same as p' (p-prime) in Section 2.4.  The probabilities p_L and
1050	   p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function
1051	   (Figure 5) then used in the dualpi2_dequeue() function (Figure 4).
1052	   The code walk-through below builds up to explaining that part of the
1053	   code eventually, but it starts from packet arrival.

1055	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
1056	   2:    if ( lq.len() + cq.len() > limit )
1057	   3:      drop(pkt)                     % drop packet if buffer is full
1058	   4:    else {                                      % Packet classifier
1059	   5:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
1060	   6:        lq.enqueue(pkt)
1061	   7:      else                           % ECN bits = not-ECT or ECT(0)
1062	   8:        cq.enqueue(pkt)
1063	   9:    }
1064	   10: }

1066	      Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

1068	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
1069	   2:    while ( lq.len() + cq.len() > 0 )
1070	   3:      if ( scheduler() == lq ) {
1071	   4:        lq.dequeue(pkt)                      % Scheduler chooses lq
1072	   5:        if ( ((lq.time() > T_time)              % step marking ...
1073	   6:              AND (lq.len() > T_len))
1074	   7:            OR (p_CL > rand()) )             % ...or linear marking
1075	   8:          mark(pkt)
1076	   9:      } else {
1077	   10:       cq.dequeue(pkt)                      % Scheduler chooses cq
1078	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1079	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
1080	   13:           drop(pkt)                                % squared drop
1081	   14:           continue        % continue to the top of the while loop
1082	   15:         }
1083	   16:         mark(pkt)                                  % squared mark
1084	   17:       }
1085	   18:     }
1086	   19:     return(pkt)                      % return the packet and stop
1087	   20:   }
1088	   21:   return(NULL)                             % no packet to dequeue
1089	   22: }

1091	      Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

1093	   When packets arrive, first a common queue limit is checked as shown
1094	   in line 2 of the enqueuing pseudocode in Figure 3.  Note that the
1095	   limit is deliberately tested before enqueue to avoid any bias against
1096	   larger packets (so the actual buffer has to be one MTU larger than
1097	   limit).  If limit is not exceeded, the packet will be classified and
1098	   enqueued to the Classic or L4S queue dependent on the least
1099	   significant bit of the ECN field in the IP header (line 5).  Packets
1100	   with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be
1101	   enqueued in the Classic queue.  Otherwise, ECT(1) and CE packets will
1102	   be enqueued in the L4S queue.  Optional additional packet
1103	   classification flexibility is omitted for brevity (see
1104	   [I-D.ietf-tsvwg-ecn-l4s-id]).

1106	   The dequeue pseudocode (Figure 4) is repeatedly called whenever the
1107	   lower layer is ready to forward a packet.  It schedules one packet
1108	   for dequeuing (or zero if the queue is empty) then returns control to
1109	   the caller, so that it does not block while that packet is being
1110	   forwarded.  While making this dequeue decision, it also makes the
1111	   necessary AQM decisions on dropping or marking.  The alternative of
1112	   applying the AQMs at enqueue would shift some processing from the
1113	   critical time when each packet is dequeued.  However, it would also
1114	   add a whole queue of delay to the control signals, making the control
1115	   loop very sloppy.

1117	   All the dequeue code is contained within a large while loop so that
1118	   if it decides to drop a packet, it will continue until it selects a
1119	   packet to schedule.  Line 3 of the dequeue pseudocode is where the
1120	   scheduler chooses between the L4S queue (lq) and the Classic queue
1121	   (cq).  Detailed implementation of the scheduler is not shown (see
1122	   discussion later).

1124	   o  If an L4S packet is scheduled, lines 5 to 8 mark the packet if
1125	      either the L4S threshold (T_time) is exceeded, or if a random
1126	      marking decision is drawn according to p_CL (maintained by the
1127	      dualpi2_update() function discussed below).  This logical 'OR' on
1128	      a per-packet basis implements the max() function shown in Figure 1
1129	      to couple the outputs of the two AQMs together.  The L4S threshold
1130	      is usually in units of time (default T_time = 1 ms).  However, on
1131	      slow links the packet serialization time can approach the
1132	      threshold T_time, so line 6 sets a floor of T_len (=2 MTU) to the
1133	      threshold, otherwise marking is always too frequent on slow links.

1135	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
1136	      packet based on the squared probability p_C.

1138	   There is some concern that using a step function for the Native L4S
1139	   AQM requires end-systems to smooth the signal for a lot longer -
1140	   until its fidelity is sufficient.  The latency benefits of a ramp are
1141	   being investigated as a simple alternative to the step.  This ramp
1142	   would be similar to the RED algorithm, with the following
1143	   differences:

1145	   o  The min and max of the ramp are defined in units of queuing delay,
1146	      not bytes, so that configuration remains invariant as the queue
1147	      departure rate varies.

1149	   o  It uses instantaneous queueing delay without smoothing (smoothing
1150	      is done in the end-systems).

1152	   o  Determinism is being experimented with instead of randomness; to
1153	      reduce the delay necessary to smooth out the noise of randomness
1154	      from the signal.  For each packet, the algorithm would accumulate
1155	      p'_L in a counter and mark the packet that took the counter over
1156	      1, then subtract 1 from the counter and continue.

1158	   o  The ramp rises linearly directly from 0 to 1, not to a an
1159	      intermediate value of p'_L as RED would, because there is no need
1160	      to keep ECN marking probability low.

1162	   This ramp algorithm would require two configuration parameters (min
1163	   and max threshold in units of queuing time), in contrast to the
1164	   single parameter of a step.

1166	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1167	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
1168	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1169	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1170	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1171	   6:    prevq = curq
1172	   7:  }

1174	     Figure 5: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

1176	   The base probability (p) is kept up to date by the core PI algorithm
1177	   in Figure 5, which is executed every Tupdate.

1179	   Note that p solely depends on the queuing time in the Classic queue.
1180	   In line 2, the current queuing delay (curq) is evaluated from how
1181	   long the head packet was in the Classic queue (cq).  The function
1182	   cq.time() (not shown) subtracts the time stamped at enqueue from the
1183	   current time and implicitly takes the current queuing delay as 0 if
1184	   the queue is empty.

1186	   The algorithm centres on line 3, which is a classical Proportional-
1187	   Integral (PI) controller that alters p dependent on: a) the error
1188	   between the current queuing delay (curq) and the target queuing delay
1189	   ('target' - see [RFC8033]); and b) the change in queuing delay since
1190	   the last sample.  The name 'PI' represents the fact that the second
1191	   factor (how fast the queue is growing) is _P_roportional to load
1192	   while the first is the _I_ntegral of the load (so it removes any
1193	   standing queue in excess of the target).

1195	   The two 'gain factors' in line 3, alpha_U and beta_U, respectively
1196	   weight how strongly each of these elements ((a) and (b)) alters p.
1197	   They are in units of 'per second of delay' or Hz, because they
1198	   transform differences in queueing delay into changes in probability.

1200	   alpha_U and beta_U are derived from the input parameters alpha and
1201	   beta (see lines 5 and 6 of Figure 2).  These recommended values of
1202	   alpha and beta come from the stability analysis in [PI2] so that the
1203	   AQM can change p as fast as possible in response to changes in load
1204	   without over-compensating and therefore causing oscillations in the
1205	   queue.

1207	   alpha and beta determine how much p ought to change if it was updated
1208	   every second.  It is best to update p as frequently as possible, but
1209	   the update interval (Tupdate) will probably be constrained by
1210	   hardware performance.  For link rates from 4 - 200 Mb/s, we found
1211	   Tupdate=16ms (as recommended in [RFC8033]) is sufficient.  However
1212	   small the chosen value of Tupdate, p should change by the same amount
1213	   per second, but in finer more frequent steps.  So the gain factors
1214	   used for updating p in Figure 5 need to be scaled by (Tupdate/1s),
1215	   which is done in lines 9 and 10 of Figure 2).  The suffix '_U'
1216	   represents 'per update time' (Tupdate).

1218	   In corner cases, p can overflow the range [0,1] so the resulting
1219	   value of p has to be bounded (omitted from the pseudocode).  Then, as
1220	   already explained, the coupled and Classic probabilities are derived
1221	   from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2.

1223	   Because the coupled L4S marking probability (p_CL) is factored up by
1224	   k, the dynamic gain parameters alpha and beta are also inherently
1225	   factored up by k for the L4S queue, which is necessary to ensure that
1226	   Classic TCP and DCTCP controls have the same stability.  So, if alpha
1227	   is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha,
1228	   which is 20 Hz^2 with the default coupling factor of k=2.

1230	   Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned
1231	   every Tupdate dependent on p.  Instead, in PI2, alpha_U and beta_U
1232	   are independent of p because the squaring applied to Classic traffic
1233	   tunes them inherently.  This is explained in [PI2], which also
1234	   explains why this more principled approach removes the need for most
1235	   of the heuristics that had to be added to PIE.

1237	   {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with
1238	   RTT}

1240	A.2.  Pass #2: Overload Details

1242	   Figure 6 repeats the dequeue function of Figure 4, but with overload
1243	   details added.  Similarly Figure 7 repeats the core PI algorithm of
1244	   Figure 5 with overload details added.  The initialization and enqueue
1245	   functions are unchanged.

1247	   In line 7 of the initialization function (Figure 2), the default
1248	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
1249	   point at which it is deemed that the Classic queue has become
1250	   persistently overloaded, so it switches to using solely drop, even
1251	   for ECN-capable packets.  This protects the queue against any
1252	   unresponsive traffic that falsely claims that it is responsive to ECN
1253	   marking, as required by [RFC3168] and [RFC7567].

1255	   Line 21 of the initialization function translates this into a maximum
1256	   L4S marking probability (p_Lmax) by rearranging Equation (1).  With a
1257	   coupling factor of k=2 (the default) or greater, this translates to a
1258	   maximum L4S marking probability of 1 (or 100%).  This is intended to
1259	   ensure that the L4S queue starts to introduce dropping once marking
1260	   saturates and can rise no further.  The 'TCP Prague' requirements
1261	   [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion
1262	   control detects a drop, it falls back to a response that coexists
1263	   with 'Classic' TCP.  So it is correct that the L4S queue drops
1264	   packets proportional to p^2, as if they are Classic packets.

1266	   Both these switch-overs are triggered by the tests for overload
1267	   introduced in lines 4b and 12b of the dequeue function (Figure 6).
1268	   Lines 8c to 8g drop L4S packets with probability p^2.  Lines 8h to 8i
1269	   mark the remaining packets with probability p_CL.

1271	   Lines 2c to 2d in the core PI algorithm (Figure 7) deal with overload
1272	   of the L4S queue when there is no Classic traffic.  This is
1273	   necessary, because the core PI algorithm maintains the appropriate
1274	   drop probability to regulate overload, but it depends on the length
1275	   of the Classic queue.  If there is no Classic queue the naive
1276	   algorithm in Figure 5 drops nothing, even if the L4S queue is
1277	   overloaded - so tail drop would have to take over (lines 3 and 4 of
1278	   Figure 3).

1280	   If the test at line 2a finds that the Classic queue is empty, line 2d
1281	   measures the current queue delay using the L4S queue instead.  While
1282	   the L4S queue is not overloaded, its delay will always be tiny
1283	   compared to the target Classic queue delay.  So p_L will be driven to
1284	   zero, and the L4S queue will naturally be governed solely by
1285	   threshold marking (lines 5 and 6 of the dequeue algorithm in
1286	   Figure 6).  But, if unresponsive L4S source(s) cause overload, the
1287	   DualQ transitions smoothly to L4S marking based on the PI algorithm.
1288	   And as overload increases, it naturally transitions from marking to
1289	   dropping by the switch-over mechanism already described.

1291	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
1292	   2:    while ( lq.len() + cq.len() > 0 )
1293	   3:      if ( scheduler() == lq ) {
1294	   4a:       lq.dequeue(pkt)
1295	   4b:       if ( p_CL < p_Lmax ) {      % Check for overload saturation
1296	   5:          if ( ((lq.time() > T_time)             % step marking ...
1297	   6:                AND (lq.len > T_len))
1298	   7:              OR (p_CL > rand()) )           % ...or linear marking
1299	   8a:            mark(pkt)
1300	   8b:       } else {                              % overload saturation
1301	   8c:         if ( p_C > rand() ) {             % probability p_C = p^2
1302	   8e:           drop(pkt)      % revert to Classic drop due to overload
1303	   8f:           continue        % continue to the top of the while loop
1304	   8g:         }
1305	   8h:         if ( p_CL > rand() )           % probability p_CL = k * p
1306	   8i:           mark(pkt)         % linear marking of remaining packets
1307	   8j:       }
1308	   9:      } else {
1309	   10:       cq.dequeue(pkt)
1310	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1311	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
1312	   12b:             OR (p_C >= p_Cmax) ) {       % Overload disables ECN
1313	   13:           drop(pkt)                     % squared drop, redo loop
1314	   14:           continue        % continue to the top of the while loop
1315	   15:         }
1316	   16:         mark(pkt)                                  % squared mark
1317	   17:       }
1318	   18:     }
1319	   19:     return(pkt)                      % return the packet and stop
1320	   20:   }
1321	   21:   return(NULL)                             % no packet to dequeue
1322	   22: }

1324	      Figure 6: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
1325	             (Including Integer Arithmetic and Overload Code)

1327	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1328	   2a:   if ( cq.len() > 0 )
1329	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
1330	   2c:   else                                      % Classic queue empty
1331	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
1332	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1333	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1334	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1335	   6:    prevq = curq
1336	   7:  }

1338	     Figure 7: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
1339	                         (Including Overload Code)

1341	   The choice of scheduler technology is critical to overload protection
1342	   (see Section 4.1).

1344	   o  A well-understood weighted scheduler such as weighted round robin
1345	      (WRR) is recommended.  The scheduler weight for Classic should be
1346	      low, e.g. 1/16.

1348	   o  Alternatively, a time-shifted FIFO could be used.  This is a very
1349	      simple scheduler, but it does not fully isolate latency in the L4S
1350	      queue from uncontrolled bursts in the Classic queue.  It works by
1351	      selecting the head packet that has waited the longest, biased
1352	      against the Classic traffic by a time-shift of tshift.  To
1353	      implement time-shifted FIFO, the "if (scheduler() == lq )" test in
1354	      line 3 of the dequeue code would simply be replaced by "if (
1355	      lq.time() + tshift >= cq.time() )".  For the public Internet a
1356	      good value for tshift is 50ms.  For private networks with smaller
1357	      diameter, about 4*target would be reasonable.

1359	   o  A strict priority scheduler would be inappropriate, because it
1360	      would starve Classic if L4S was overloaded.

1362	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

1364	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
1365	   below gives the Curvy RED based algorithm we used and tested.
1366	   Although we designed the AQM to be efficient in integer arithmetic,
1367	   to aid understanding it is first given using real-number arithmetic.
1368	   Then, one possible optimization for integer arithmetic is given, also
1369	   in pseudocode.  To aid comparison, the line numbers are kept in step
1370	   between the two by using letter suffixes where the longer code needs
1371	   extra lines.

1373	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1374	   2:    if ( lq.dequeue(pkt) ) {
1375	   3a:     p_L = cq.sec() / 2^S_L
1376	   3b:     if ( lq.byt() > T )
1377	   3c:       mark(pkt)
1378	   3d:     elif ( p_L > maxrand(U) )
1379	   4:        mark(pkt)
1380	   5:      return(pkt)                % return the packet and stop here
1381	   6:    }
1382	   7:    while ( cq.dequeue(pkt) ) {
1383	   8a:     alpha = 2^(-f_C)
1384	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
1385	   9a:     sqrt_p_C = Q_C / 2^S_C
1386	   9b:     if ( sqrt_p_C > maxrand(2*U) )
1387	   10:       drop(pkt)                        % Squared drop, redo loop
1388	   11:     else
1389	   12:       return(pkt)              % return the packet and stop here
1390	   13:   }
1391	   14:   return(NULL)                           % no packet to dequeue
1392	   15: }

1394	   16: maxrand(u) {                % return the max of u random numbers
1395	   17:     maxr=0
1396	   18:     while (u-- > 0)
1397	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
1398	   20:     return(maxr)
1399	   21: }

1401	   Figure 8: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

1403	   Packet classification code is not shown, as it is no different from
1404	   Figure 3.  Potential classification schemes are discussed in
1405	   Section 2.3.  The Curvy RED algorithm has not been maintained to the
1406	   same degree as the DualPI2 algorithm.  Some ideas used in DualPI2
1407	   would need to be translated into Curvy RED, such as i) the
1408	   conditional priority scheduler instead of strict priority ii) the
1409	   time-based L4S threshold; iii) turning off ECN as overload
1410	   protection; iv) Classic ECN support.  These are not shown in the
1411	   Curvy RED pseudocode, but would need to be implemented for
1412	   production. {ToDo}

1414	   At the outer level, the structure of dualq_dequeue() implements
1415	   strict priority scheduling.  The code is written assuming the AQM is
1416	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
1417	   the if-block in lines 2-6 determines whether there is an L4S packet
1418	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
1419	   in lines 7-13 determines whether there is a Classic packet to
1420	   dequeue, by calling cq.dequeue(pkt).  (Note 2)
1421	   In the lower priority Classic queue, a while loop is used so that, if
1422	   the AQM determines that a classic packet should be dropped, it
1423	   continues to test for classic packets deciding whether to drop each
1424	   until it actually forwards one.  Thus, every call to dualq_dequeue()
1425	   returns one packet if at least one is present in either queue,
1426	   otherwise it returns NULL at line 14.  (Note 3)

1428	   Within each queue, the decision whether to drop or mark is taken as
1429	   follows (to simplify the explanation, it is assumed that U=1):

1431	   L4S:  If the test at line 2 determines there is an L4S packet to
1432	      dequeue, the tests at lines 3a and 3c determine whether to mark
1433	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1434	      in bytes) is greater than a step threshold T in bytes (Note 4).
1435	      The second test is similar to the random ECN marking in RED, but
1436	      with the following differences: i) the marking function does not
1437	      start with a plateau of zero marking until a minimum threshold,
1438	      rather the marking probability starts to increase as soon as the
1439	      queue is positive; ii) marking depends on queuing time, not bytes,
1440	      in order to scale for any link rate without being reconfigured;
1441	      iii) marking of the L4S queue does not depend on itself, it
1442	      depends on the queuing time of the _other_ (Classic) queue, where
1443	      cq.sec() is the queuing time of the packet at the head of the
1444	      Classic queue (zero if empty); iv) marking depends on the
1445	      instantaneous queuing time (of the other Classic queue), not a
1446	      smoothed average; v) the queue is compared with the maximum of U
1447	      random numbers (but if U=1, this is the same as the single random
1448	      number used in RED).

1450	      Specifically, in line 3a the marking probability p_L is set to the
1451	      Classic queueing time qc.sec() in seconds divided by the L4S
1452	      scaling parameter 2^S_L, which represents the queuing time (in
1453	      seconds) at which marking probability would hit 100%. Then in line
1454	      3d (if U=1) the result is compared with a uniformly distributed
1455	      random number between 0 and 1, which ensures that marking
1456	      probability will linearly increase with queueing time.  The
1457	      scaling parameter is expressed as a power of 2 so that division
1458	      can be implemented as a right bit-shift (>>) in line 3 of the
1459	      integer variant of the pseudocode (Figure 9).

1461	   Classic:  If the test at line 7 determines that there is at least one
1462	      Classic packet to dequeue, the test at line 9b determines whether
1463	      to drop it.  But before that, line 8b updates Q_C, which is an
1464	      exponentially weighted moving average (Note 5) of the queuing time
1465	      in the Classic queue, where pkt.sec() is the instantaneous
1466	      queueing time of the current Classic packet and alpha is the EWMA
1467	      constant for the classic queue.  In line 8a, alpha is represented
1468	      as an integer power of 2, so that in line 8 of the integer code
1469	      the division needed to weight the moving average can be
1470	      implemented by a right bit-shift (>> f_C).

1472	      Lines 9a and 9b implement the drop function.  In line 9a the
1473	      averaged queuing time Q_C is divided by the Classic scaling
1474	      parameter 2^S_C, in the same way that queuing time was scaled for
1475	      L4S marking.  This scaled queuing time is given the variable name
1476	      sqrt_p_C because it will be squared to compute Classic drop
1477	      probability, so before it is squared it is effectively the square
1478	      root of the drop probability.  The squaring is done by comparing
1479	      it with the maximum out of two random numbers (assuming U=1).
1480	      Comparing it with the maximum out of two is the same as the
1481	      logical `AND' of two tests, which ensures drop probability rises
1482	      with the square of queuing time (Note 6).  Again, the scaling
1483	      parameter is expressed as a power of 2 so that division can be
1484	      implemented as a right bit-shift in line 9 of the integer
1485	      pseudocode.

1487	   The marking/dropping functions in each queue (lines 3 & 9) are two
1488	   cases of a new generalization of RED called Curvy RED, motivated as
1489	   follows.  When we compared the performance of our AQM with fq_CoDel
1490	   and PIE, we came to the conclusion that their goal of holding queuing
1491	   delay to a fixed target is misguided [CRED_Insights].  As the number
1492	   of flows increases, if the AQM does not allow TCP to increase queuing
1493	   delay, it has to introduce abnormally high levels of loss.  Then loss
1494	   rather than queuing becomes the dominant cause of delay for short
1495	   flows, due to timeouts and tail losses.

1497	   Curvy RED constrains delay with a softened target that allows some
1498	   increase in delay as load increases.  This is achieved by increasing
1499	   drop probability on a convex curve relative to queue growth (the
1500	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1501	   the zero axis while the queue is shallow.  Then, as load increases,
1502	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1503	   requires only one parameter, the scaling, not three.  The diadvantage
1504	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1505	   Curvy RED can be used as is when the RTT range to support is limited
1506	   otherwise an adaptation mechanism is required.

1508	   There follows a summary listing of the two parameters used for each
1509	   of the two queues:

1511	   Classic:

1513	      S_C :   The scaling factor of the dropping function scales Classic
1514	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1515	         probability in the range [0,1].  To make division efficient, it
1516	         is constrained to be an integer power of two;

1518	      f_C :  To smooth the queuing time of the Classic queue and make
1519	         multiplication efficient, we use a negative integer power of
1520	         two for the dimensionless EWMA constant, which we define as
1521	         alpha = 2^(-f_C).

1523	   L4S :

1525	      S_L (and k'):   As for the Classic queue, the scaling factor of
1526	         the L4S marking function scales Classic queueing times in the
1527	         range [0, 2^(S_L)] seconds into a probability in the range
1528	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1529	         between the queues.  So S_L and k' count as only one parameter;
1530	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1531	         where both k and k' are constants.  Then implementations can
1532	         avoid costly division by shifting p_L by k' bits to the right.

1534	      T :  The queue size in bytes at which step threshold marking
1535	         starts in the L4S queue.

1537	   {ToDo: These are the raw parameters used within the algorithm.  A
1538	   configuration front-end could accept more meaningful parameters and
1539	   convert them into these raw parameters.}

1541	   From our experiments so far, recommended values for these parameters
1542	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1543	   typical on the public Internet.  [CRED_Insights] explains why these
1544	   parameters are applicable whatever rate link this AQM implementation
1545	   is deployed on and how the parameters would need to be adjusted for a
1546	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1547	   incorporate a summary of that report into this draft}. The setting of
1548	   k depends on policy (see Section 2.5 and Appendix C respectively for
1549	   its recommended setting and guidance on alternatives).

1551	   There is also a cUrviness parameter, U, which is a small positive
1552	   integer.  It is likely to take the same hard-coded value for all
1553	   implementations, once experiments have determined a good value.  We
1554	   have solely used U=1 in our experiments so far, but results might be
1555	   even better with U=2 or higher.

1557	   Note that the dropping function at line 9 calls maxrand(2*U), which
1558	   gives twice as much curviness as the call to maxrand(U) in the
1559	   marking function at line 3.  This is the trick that implements the
1560	   square rule in equation (1) (Section 2.1).  This is based on the fact
1561	   that, given a number X from 1 to 6, the probability that two dice
1562	   throws will both be less than X is the square of the probability that
1563	   one throw will be less than X.  So, when U=1, the L4S marking
1564	   function is linear and the Classic dropping function is squared.  If
1565	   U=2, L4S would be a square function and Classic would be quartic.
1566	   And so on.

1568	   The maxrand(u) function in lines 16-21 simply generates u random
1569	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1570	   could be run in parallel out of band.  For instance, if U=1, the
1571	   Classic queue would require the maximum of two random numbers.  So,
1572	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1573	   values from a pseudorandom number generator could be generated out-
1574	   of-band, and held in a buffer ready for the Classic queue to consume.

1576	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1577	   2:     if ( lq.dequeue(pkt) ) {
1578	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1579	   4:           mark(pkt)
1580	   5:        return(pkt)              % return the packet and stop here
1581	   6:     }
1582	   7:     while ( cq.dequeue(pkt) ) {
1583	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1584	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1585	   10:          drop(pkt)                     % Squared drop, redo loop
1586	   11:       else
1587	   12:          return(pkt)           % return the packet and stop here
1588	   13:    }
1589	   14:    return(NULL)                           % no packet to dequeue
1590	   15: }

1592	   Figure 9: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1593	                         using Integer Arithmetic

1595	   Notes:

1597	   1.  The drain rate of the queue can vary if it is scheduled relative
1598	       to other queues, or to cater for fluctuations in a wireless
1599	       medium.  To auto-adjust to changes in drain rate, the queue must
1600	       be measured in time, not bytes or packets [CoDel].  In our Linux
1601	       implementation, it was easiest to measure queuing time at
1602	       dequeue.  Queuing time can be estimated when a packet is enqueued
1603	       by measuring the queue length in bytes and dividing by the recent
1604	       drain rate.

1606	   2.  An implementation has to use priority queueing, but it need not
1607	       implement strict priority.

1609	   3.  If packets can be enqueued while processing dequeue code, an
1610	       implementer might prefer to place the while loop around both
1611	       queues so that it goes back to test again whether any L4S packets
1612	       arrived while it was dropping a Classic packet.

1614	   4.  In order not to change too many factors at once, for now, we keep
1615	       the marking function for DCTCP-only traffic as similar as
1616	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1617	       dequeue, so we determine whether to mark a packet at the head of
1618	       the queue by the byte-length of the queue _behind_ it.  We plan
1619	       to test whether using queuing time will work in all
1620	       circumstances, and if we find that the step can cause
1621	       oscillations, we will investigate replacing it with a steep
1622	       random marking curve.

1624	   5.  An EWMA is only one possible way to filter bursts; other more
1625	       adaptive smoothing methods could be valid and it might be
1626	       appropriate to decrease the EWMA faster than it increases.

1628	   6.  In practice at line 10 the Classic queue would probably test for
1629	       ECN capability on the packet to determine whether to drop or mark
1630	       the packet.  However, for brevity such detail is omitted.  All
1631	       packets classified into the L4S queue have to be ECN-capable, so
1632	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1633	       packets could be dropped by overload code (see Section 4.1).

1635	   7.  In the integer variant of the pseudocode (Figure 9) real numbers
1636	       are all represented as integers scaled up by 2^32.  In lines 3 &
1637	       9 the function maxrand() is arranged to return an integer in the
1638	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1639	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1640	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1641	       the values about 2^30 times larger than when the units were
1642	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1643	       right bit-shift multiplies the result by 2^2, to complete the
1644	       scaling by 2^32.

1646	Appendix C.  Guidance on Controlling Throughput Equivalence

1648	                     +---------------+------+-------+
1649	                     | RTT_C / RTT_L | Reno | Cubic |
1650	                     +---------------+------+-------+
1651	                     |             1 | k'=1 | k'=0  |
1652	                     |             2 | k'=2 | k'=1  |
1653	                     |             3 | k'=2 | k'=2  |
1654	                     |             4 | k'=3 | k'=2  |
1655	                     |             5 | k'=3 | k'=3  |
1656	                     +---------------+------+-------+

1658	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1659	               as Reno or Cubic, for some example RTT ratios

1661	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1663	   To determine the appropriate policy, the operator first has to judge
1664	   whether it wants DCTCP flows to have roughly equal throughput with
1665	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1666	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1667	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1668	   Classic flows to have roughly equal throughput.  For example choosing
1669	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1670	   as Cubic, _if their RTTs are the same_.

1672	   However, even if the base RTTs are the same, the actual RTTs are
1673	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1674	   needs a large queue to avoid under-utilization and excess drop,
1675	   whereas L4S (DCTCP) does not.  The operator might still choose this
1676	   policy if it judges that DCTCP throughput should be rewarded for
1677	   keeping its own queue short.

1679	   On the other hand, the operator will choose one of the higher values
1680	   for k', if it wants to slow DCTCP down to roughly the same throughput
1681	   as Classic flows, to compensate for Classic flows slowing themselves
1682	   down by causing themselves extra queuing delay.

1684	   The values for k' in the table are derived from the formulae, which
1685	   was developed in [DCttH15]:

1687	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1688	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1690	   For localized traffic from a particular ISP's data centre, we used
1691	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1692	   k=8) would achieve throughput equivalence, and our experiments
1693	   verified the formula very closely.

1695	   For a typical mix of RTTs from local data centres and across the
1696	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1697	   as a good workable compromise.

1699	Appendix D.  Open Issues

1701	   Most of the following open issues are also tagged '{ToDo}' at the
1702	   appropriate point in the document:

1704	      Operational guidance to monitor L4S experiment

1706	      PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U
1707	      on Tupdate

1709	      Curvy RED appendix: complete the unfinished parts

1711	Authors' Addresses

1713	   Koen De Schepper
1714	   Nokia Bell Labs
1715	   Antwerp
1716	   Belgium

1718	   Email: koen.de_schepper@nokia.com
1719	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1721	   Bob Briscoe (editor)
1722	   CableLabs
1723	   UK

1725	   Email: ietf@bobbriscoe.net
1726	   URI:   http://bobbriscoe.net/

1728	   Olga Bondarenko
1729	   Simula Research Lab
1730	   Lysaker
1731	   Norway

1733	   Email: olgabnd@gmail.com
1734	   URI:   https://www.simula.no/people/olgabo

1736	   Ing-jyh Tsang
1737	   Nokia
1738	   Antwerp
1739	   Belgium

1741	   Email: ing-jyh.tsang@nokia.com