idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 5, 2018) is 2243 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1473

  -- Looks like a reference, but probably isn't: '1' on line 1473

  == Outdated reference: A later version (-02) exists of
     draft-briscoe-tsvwg-l4s-diffserv-00

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-02

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-02

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: September 6, 2018                                     CableLabs
6	                                                           O. Bondarenko
7	                                                     Simula Research Lab
8	                                                                I. Tsang
9	                                                                   Nokia
10	                                                           March 5, 2018

12	  DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
13	                                 (L4S)
14	                 draft-ietf-tsvwg-aqm-dualq-coupled-04

16	Abstract

18	   Data Centre TCP (DCTCP) was designed to provide predictably low
19	   queuing latency, near-zero loss, and throughput scalability using
20	   explicit congestion notification (ECN) and an extremely simple
21	   marking behaviour on switches.  However, DCTCP does not co-exist with
22	   existing TCP traffic---DCTCP is so aggressive that existing TCP
23	   algorithms approach starvation.  So, until now, DCTCP could only be
24	   deployed where a clean-slate environment could be arranged, such as
25	   in private data centres.  This specification defines `DualQ Coupled
26	   Active Queue Management (AQM)' to allow scalable congestion controls
27	   like DCTCP to safely co-exist with classic Internet traffic.  The
28	   Coupled AQM ensures that a flow runs at about the same rate whether
29	   it uses DCTCP or TCP Reno/Cubic, but without inspecting transport
30	   layer flow identifiers.  When tested in a residential broadband
31	   setting, DCTCP achieved sub-millisecond average queuing delay and
32	   zero congestion loss under a wide range of mixes of DCTCP and
33	   `Classic' broadband Internet traffic, without compromising the
34	   performance of the Classic traffic.  The solution also reduces
35	   network complexity and eliminates network configuration.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at https://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on September 6, 2018.

54	Copyright Notice

56	   Copyright (c) 2018 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (https://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
73	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
74	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   6
75	   2.  DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . .   7
76	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   7
77	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   8
78	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   8
79	     2.4.  Overall DualQ Coupled AQM Structure . . . . . . . . . . .   9
80	     2.5.  Normative Requirements for a DualQ Coupled AQM  . . . . .  11
81	       2.5.1.  Functional Requirements . . . . . . . . . . . . . . .  11
82	       2.5.2.  Management Requirements . . . . . . . . . . . . . . .  12
83	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
84	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
85	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  13
86	       4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput
87	               or Delay? . . . . . . . . . . . . . . . . . . . . . .  14
88	       4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or
89	               Delay?  . . . . . . . . . . . . . . . . . . . . . . .  15
90	       4.1.3.  Protecting against Unresponsive ECN-Capable Traffic .  16
91	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  16
92	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
93	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  17
94	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  17
95	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  20
96	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  20
97	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  26
98	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  28
99	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  34
100	   Appendix D.  Open Issues  . . . . . . . . . . . . . . . . . . . .  35
101	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  36

103	1.  Introduction

105	1.1.  Problem and Scope

107	   Latency is becoming the critical performance factor for many (most?)
108	   applications on the public Internet, e.g. interactive Web, Web
109	   services, voice, conversational video, interactive video, interactive
110	   remote presence, instant messaging, online gaming, remote desktop,
111	   cloud-based applications, and video-assisted remote control of
112	   machinery and industrial processes.  In the developed world, further
113	   increases in access network bit-rate offer diminishing returns,
114	   whereas latency is still a multi-faceted problem.  In the last decade
115	   or so, much has been done to reduce propagation time by placing
116	   caches or servers closer to users.  However, queuing remains a major
117	   component of latency.

119	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
120	   that low latency traffic can jump the queue of other traffic.
121	   However, on access links dedicated to individual sites (homes, small
122	   enterprises or mobile devices), often all traffic at any one time
123	   will be latency-sensitive and, if all the traffic on a link is marked
124	   as EF, Diffserv cannot reduce the delay of any of it.  In contrast,
125	   the Low Latency Low Loss Scalable throughput (L4S) approach removes
126	   the causes of any unnecessary queuing delay.

128	   The bufferbloat project has shown that excessively-large buffering
129	   (`bufferbloat') has been introducing significantly more delay than
130	   the underlying propagation time.  These delays appear only
131	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
132	   enough for the queue to fill the buffer, making every packet in other
133	   flows sharing the buffer sit through the queue.

135	   Active queue management (AQM) was originally developed to solve this
136	   problem (and others).  Unlike Diffserv, which gives low latency to
137	   some traffic at the expense of others, AQM controls latency for _all_
138	   traffic in a class.  In general, AQMs introduce an increasing level
139	   of discard from the buffer the longer the queue persists above a
140	   shallow threshold.  This gives sufficient signals to capacity-seeking
141	   (aka. greedy) flows to keep the buffer empty for its intended
142	   purpose: absorbing bursts.  However, RED [RFC2309] and other
143	   algorithms from the 1990s were sensitive to their configuration and
144	   hard to set correctly.  So, AQM was not widely deployed.

146	   More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290],
147	   PIE [RFC8033], Adaptive RED [ARED01], are easier to configure,
148	   because they define the queuing threshold in time not bytes, so it is
149	   invariant for different link rates.  However, no matter how good the
150	   AQM, the sawtoothing rate of TCP will either cause queuing delay to
151	   vary or cause the link to be under-utilized.  Even with a perfectly
152	   tuned AQM, the additional queuing delay will be of the same order as
153	   the underlying speed-of-light delay across the network.  Flow-queuing
154	   can isolate one flow from another, but it cannot isolate a TCP flow
155	   from the delay variations it inflicts on itself, and it has other
156	   problems - it overrides the flow rate decisions of variable rate
157	   video applications, it does not recognise the flows within IPSec VPN
158	   tunnels and it is relatively expensive to implement.

160	   It seems that further changes to the network alone will now yield
161	   diminishing returns.  Data Centre TCP (DCTCP [RFC8257]) teaches us
162	   that a small but radical change to TCP is needed to cut two major
163	   outstanding causes of queuing delay variability:

165	   1.  the `sawtooth' varying rate of TCP itself;

167	   2.  the smoothing delay deliberately introduced into AQMs to permit
168	       bursts without triggering losses.

170	   The former causes a flow's round trip time (RTT) to vary from about 1
171	   to 2 times the base RTT between the machines in question.  The latter
172	   delays the system's response to change by a worst-case
173	   (transcontinental) RTT, which could be hundreds of times the actual
174	   RTT of typical traffic from localized CDNs.

176	   Latency is not our only concern:

178	   3.  It was known when TCP was first developed that it would not scale
179	       to high bandwidth-delay products.

181	   Given regular broadband bit-rates over WAN distances are
182	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
183	   `less unscalable' Cubic [I-D.ietf-tcpm-cubic] and
184	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
185	   successfully deployed.  However, these are now approaching their
186	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
187	   cause `classic' TCP to starve itself, which is why they have been
188	   confined to private data centres or research testbeds (until now).

190	   This document specifies a `DualQ Coupled AQM' extension that solves
191	   the problem of coexistence between scalable and classic flows,
192	   without having to inspect flow identifiers.  The AQM is not like
193	   flow-queuing approaches [RFC8290] that classify packets by flow
194	   identifier into numerous separate queues in order to isolate sparse
195	   flows from the higher latency in the queues assigned to heavier flow.
196	   In contrast, the AQM exploits the behaviour of scalable congestion
197	   controls like DCTCP so that every packet in every flow sharing the
198	   queue for DCTCP-like traffic can be served with very low latency.

200	   This AQM extension can be combined with any single queue AQM that
201	   generates a statistical or deterministic mark/drop probability driven
202	   by the queue dynamics.  In many cases it simplifies the basic control
203	   algorithm, and requires little extra processing.  Therefore it is
204	   believed the Coupled AQM would be applicable and easy to deploy in
205	   all types of buffers; buffers in cost-reduced mass-market residential
206	   equipment; buffers in end-system stacks; buffers in carrier-scale
207	   equipment including remote access servers, routers, firewalls and
208	   Ethernet switches; buffers in network interface cards, buffers in
209	   virtualized network appliances, hypervisors, and so on.

211	   The overall L4S architecture is described in
212	   [I-D.ietf-tsvwg-l4s-arch].  The supporting papers [PI2] and [DCttH15]
213	   give the full rationale for the AQM's design, both discursively and
214	   in more precise mathematical form.

216	1.2.  Terminology

218	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
219	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
220	   document are to be interpreted as described in [RFC2119].  In this
221	   document, these words will appear with that interpretation only when
222	   in ALL CAPS.  Lower case uses of these words are not to be
223	   interpreted as carrying RFC-2119 significance.

225	   The DualQ Coupled AQM uses two queues for two services.  Each of the
226	   following terms identifies both the service and the queue that
227	   provides the service:

229	   Classic (denoted by subscript C):  The `Classic' service is intended
230	      for all the behaviours that currently co-exist with TCP Reno (TCP
231	      Cubic, Compound, SCTP, etc).

233	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
234	      The `L4S' service is intended for a set of congestion controls
235	      with scalable properties such as DCTCP (e.g.
236	      Relentless [Mathis09]).

238	   Either service can cope with a proportion of unresponsive or less-
239	   responsive traffic as well (e.g.  DNS, VoIP, etc), just as a single
240	   queue AQM can.  The DualQ Coupled AQM behaviour is similar to a
241	   single FIFO queue with respect to unresponsive and overload traffic.

243	1.3.  Features

245	   The AQM couples marking and/or dropping across the two queues such
246	   that a flow will get roughly the same throughput whichever it uses.
247	   Therefore both queues can feed into the full capacity of a link and
248	   no rates need to be configured for the queues.  The L4S queue enables
249	   scalable congestion controls like DCTCP to give stunningly low and
250	   predictably low latency, without compromising the performance of
251	   competing 'Classic' Internet traffic.  Thousands of tests have been
252	   conducted in a typical fixed residential broadband setting.  Typical
253	   experiments used base round trip delays up to 100ms between the data
254	   centre and home network, and large amounts of background traffic in
255	   both queues.  For every L4S packet, the AQM kept the average queuing
256	   delay below 1ms (or 2 packets if serialization delay is bigger for
257	   slow links), and no losses at all were introduced by the AQM.
258	   Details of the extensive experiments will be made available [PI2]
259	   [DCttH15].

261	   Subjective testing was also conducted using a demanding panoramic
262	   interactive video application run over a stack with DCTCP enabled and
263	   deployed on the testbed.  Each user could pan or zoom their own high
264	   definition (HD) sub-window of a larger video scene from a football
265	   match.  Even though the user was also downloading large amounts of
266	   L4S and Classic data, latency was so low that the picture appeared to
267	   stick to their finger on the touchpad (all the L4S data achieved the
268	   same ultra-low latency).  With an alternative AQM, the video
269	   noticeably lagged behind the finger gestures.

271	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
272	   be limited to a small proportion of the link capacity in order to
273	   achieve low delay.  The L4S queue can be filled with a heavy load of
274	   capacity-seeking flows like DCTCP and still achieve low delay.  The
275	   L4S queue does not rely on the presence of other traffic in the
276	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
277	   traffic whether or not there is Classic traffic, and the latency of
278	   Classic traffic does not suffer when a proportion of the traffic is
279	   L4S.  The two queues are only necessary because DCTCP-like flows
280	   cannot keep latency predictably low and keep utilization high if they
281	   are mixed with legacy TCP flows,

283	   The experiments used the Linux implementation of DCTCP that is
284	   deployed in private data centres, without any modification despite
285	   its known deficiencies.  Nonetheless, certain modifications will be
286	   necessary before DCTCP is safe to use on the Internet, which are
287	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
288	   focus of this specification is to get the network service in place.
289	   Then, without any management intervention, applications can exploit
290	   it by migrating to scalable controls like DCTCP, which can then
291	   evolve _while_ their benefits are being enjoyed by everyone on the
292	   Internet.

294	2.  DualQ Coupled AQM

296	   There are two main aspects to the approach:

298	   o  the Coupled AQM that addresses throughput equivalence between
299	      Classic (e.g.  Reno, Cubic) flows and L4S (e.g.  DCTCP) flows

301	   o  the Dual Queue structure that provides latency separation for L4S
302	      flows to isolate them from the typically large Classic queue.

304	2.1.  Coupled AQM

306	   In the 1990s, the `TCP formula' was derived for the relationship
307	   between TCP's congestion window, cwnd, and its drop probability, p.
308	   To a first order approximation, cwnd of TCP Reno is inversely
309	   proportional to the square root of p.

311	   TCP Cubic implements a Reno-compatibility mode, which is the only
312	   relevant mode for typical RTTs under 20ms as long as the throughput
313	   of a single flow is less than about 500Mb/s.  Therefore it can be
314	   assumed that Cubic traffic behaves similarly to Reno (but with a
315	   slightly different constant of proportionality), and the term
316	   'Classic' will be used for the collection of Reno-friendly traffic
317	   including Cubic in Reno mode.

319	   The supporting paper [PI2] includes the derivation of the equivalent
320	   rate equation for DCTCP, for which cwnd is inversely proportional to
321	   p (not the square root), where in this case p is the ECN marking
322	   probability.  DCTCP is not the only congestion control that behaves
323	   like this, so the term 'L4S' traffic will be used for all similar
324	   behaviour.

326	   In order to make a DCTCP flow run at roughly the same rate as a Reno
327	   TCP flow (all other factors being equal), the drop or marking
328	   probability for Classic traffic, p_C has to be distinct from the
329	   marking probability for L4S traffic, p_L (in contrast to RFC3168
330	   which requires them to be the same).  It is necessary to make the
331	   Classic drop probability p_C proportional to the square of the L4S
332	   marking probability p_L.  This makes the Reno flow rate roughly equal
333	   the DCTCP flow rate, because it squares the square root of p_C in the
334	   Reno rate equation to make it proportional to the straight p_L in the
335	   DCTCP rate equation.

337	   Stating this as a formula, the relation between Classic drop
338	   probability, p_C, and L4S marking probability, p_L needs to take the
339	   form:

341	       p_C = ( p_L / k )^2                  (1)

343	   where k is the constant of proportionality.

345	2.2.  Dual Queue

347	   Classic traffic typically builds a large queue to prevent under-
348	   utilization.  Therefore a separate queue is provided for L4S traffic,
349	   and it is scheduled with priority over Classic.  Priority is
350	   conditional to prevent starvation of Classic traffic.

352	   Nonetheless, coupled marking ensures that giving priority to L4S
353	   traffic still leaves the right amount of spare scheduling time for
354	   Classic flows to each get equivalent throughput to DCTCP flows (all
355	   other factors such as RTT being equal).  The algorithm achieves this
356	   without having to inspect flow identifiers.

358	2.3.  Traffic Classification

360	   Both the Coupled AQM and DualQ mechanisms need an identifier to
361	   distinguish L and C packets.  A separate draft
362	   [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of
363	   the ECN field as this identifier, having assessed various
364	   alternatives.  An additional process document has proved necessary to
365	   make the ECT(1) codepoint available for experimentation [RFC8311].

367	   In addition (not instead), other identifiers could be used to
368	   classify certain additional packet types into the L queue, that are
369	   deemed not to risk harming the L4S service.  For instance addresses
370	   of specific applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]),
371	   specific Diffserv codepoints such as EF (Expedited Forwarding), CS5
372	   (Application Signalling) and Voice-Admit service classes (see
373	   [I-D.briscoe-tsvwg-l4s-diffserv]) or certain protocols (e.g.  ARP,
374	   DNS).

376	   Note that the DualQ Coupled AQM only reads these classifiers, it MUST
377	   NOT re-mark or alter these identifiers (except for marking the ECN
378	   field with the CE codepoint - with increasing frequency to indicate
379	   increasing congestion).

381	2.4.  Overall DualQ Coupled AQM Structure

383	   Figure 1 shows the overall structure that any DualQ Coupled AQM is
384	   likely to have.  This schematic is intended to aid understanding of
385	   the current designs of DualQ Coupled AQMs.  However, it is not
386	   intended to preclude other innovative ways of satisfying the
387	   normative requirements in Section 2.5 that minimally define a DualQ
388	   Coupled AQM.

390	   The classifier on the left separates incoming traffic between the two
391	   queues (L and C).  Each queue has its own AQM that determines the
392	   likelihood of dropping or marking (p_L and p_C).  Nonetheless, the
393	   AQM for Classic traffic is implemented in two stages: i) a base stage
394	   that outputs an internal probability p' (pronounced p-prime); and ii)
395	   a squaring stage that outputs p_C, where

397	       p_C = (p')^2.                        (2)

399	   This allows p_L to be coupled to p_C by marking L4S traffic
400	   proportionately to the intermediate output from the first stage.
401	   Specifically, the output of the base AQM is coupled across to the L
402	   queue in proportion to the output of the base AQM:

404	       p_CL = k*p',                         (3)

406	   where k is the constant coupling factor (see Appendix C) and p_CL is
407	   the output from the coupling between the C queue and the L queue.

409	   It can be seen in the following that these two transformations of p'
410	   implement the required coupling given in equation (1) earlier.
411	   Substituting for p' from equation (3) into (2):

413	      p_C = ( p_CL / k )^2.

415	   The actual L4S marking probability p_L is the maximum of the coupled
416	   output (p_CL) and the output of a native L4S AQM (p'L), shown as
417	   '(MAX)' in the schematic.  While the output of the Native L4S AQM is
418	   high (p'L > p_CL) it will dominate the way L traffic is marked.  When
419	   the native L4S AQM output is lower, the way L traffic is marked will
420	   be driven by the coupling, that is p_L = p_CL.  So, whenever the
421	   coupling is needed, as required from equation (1):

423	      p_C = ( p_L / k )^2.

425	                           _________
426	                                  | |    ,------.
427	                        L4S queue | |===>| ECN  |
428	                       ,'| _______|_|    |marker|\
429	                     <'  |         |     `------'\\
430	                      //`'         v        ^ p_L \\
431	                     //        ,-------.    |      \\
432	                    //         |Native |p'L |       \\,.
433	                   //          |  L4S  |-->(MAX)    <  |   ___
434	      ,----------.//           |  AQM  |    ^ p_CL   `\|.'Cond-`.
435	      |  IP-ECN  |/            `-------'    |          / itional \
436	   ==>|Classifier|             ,-------.  (k*p')       [ priority]==>
437	      |          |\            |  Base |    |          \scheduler/
438	      `----------'\\           |  AQM  |--->:        ,'|`-.___.-'
439	                   \\          |       |p'  |      <'  |
440	                    \\         `-------'  (p'^2)    //`'
441	                     \\            ^        |      //
442	                      \\,.         |        v p_C //
443	                      <  | _________     .------.//
444	                       `\|   |      |    | Drop |/
445	                     Classic |queue |===>|/mark |
446	                           __|______|    `------'

448	   Legend: ===> traffic flow; ---> control dependency.

450	                   Figure 1: DualQ Coupled AQM Schematic

452	   After the AQMs have applied their dropping or marking, the scheduler
453	   forwards their packets to the link, giving priority to L4S traffic.
454	   Priority has to be conditional in some way (see Section 4.1).  Simple
455	   strict priority is inappropriate otherwise it could lead the L4S
456	   queue to starve the Classic queue.  For example, consider the case
457	   where a continually busy L4S queue blocks a DNS request in the
458	   Classic queue, arbitrarily delaying the start of a new Classic flow.

460	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
461	   given in Appendix A and Appendix B.  Either example AQM can be used
462	   to couple packet marking and dropping across a dual Q.

464	   DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM.
465	   Indeed, this Base AQM with just the squared output and no L4S queue
466	   can be used as a drop-in replacement for PIE [RFC8033], in which case
467	   we call it just PI2 [PI2].  PI2 is a principled simplification of PIE
468	   that is both more responsive and more stable in the face of
469	   dynamically varying load.

471	   Curvy RED is derived from RED [RFC2309], but its configuration
472	   parameters are insensitive to link rate and it requires less
473	   operations per packet.  However, DualPI2 is more responsive and
474	   stable over a wider range of RTTs than Curvy RED.  As a consequence,
475	   DualPI2 has attracted more development attention than Curvy RED,
476	   leaving the Curvy RED design incomplete and not so fully evaluated.

478	   Both AQMs regulate their queue in units of time not bytes.  As
479	   already explained, this ensures configuration can be invariant for
480	   different drain rates.  With AQMs in a dualQ structure this is
481	   particularly important because the drain rate of each queue can vary
482	   rapidly as flows for the two queues arrive and depart, even if the
483	   combined link rate is constant.

485	   It would be possible to control the queues with other alternative
486	   AQMs, as long as the normative requirements (those expressed in
487	   capitals) in Section 2.5 are observed.

489	2.5.  Normative Requirements for a DualQ Coupled AQM

491	   The following requirements are intended to capture only the essential
492	   aspects of a DualQ Coupled AQM.  They are intended to be independent
493	   of the particular AQMs used for each queue.

495	2.5.1.  Functional Requirements

497	   In the Dual Queue, L4S packets MUST be given priority over Classic,
498	   although priority MUST be bounded in order not to starve Classic
499	   traffic.

501	   All L4S traffic MUST be ECN-capable.  Some Classic traffic might also
502	   be ECN-capable.

504	   Whatever identifier is used for L4S experiments,
505	   [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on
506	   L4S traffic, relative to drop of Classic traffic.  In order to
507	   prevent starvation of Classic traffic by scalable L4S traffic, it
508	   says, "The likelihood that an AQM drops a Not-ECT Classic packet
509	   (p_C) MUST be roughly proportional to the square of the likelihood
510	   that it would have marked it if it had been an L4S packet (p_L)."  In
511	   other words, in any DualQ Coupled AQM, the power to which p_L is
512	   raised in Eqn. (1) MUST be 2.  The term 'likelihood' is used to allow
513	   for marking and dropping to be either probabilistic or deterministic.

515	   The constant of proportionality, k, in Eqn (1) determines the
516	   relative flow rates of Classic and L4S flows when the AQM concerned
517	   is the bottleneck (all other factors being equal).
518	   [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality
519	   (k) does not have to be standardised for interoperability, but a
520	   value of 2 is RECOMMENDED."

522	   Assuming scalable congestion controls for the Internet will be as
523	   aggressive as DCTCP, this will ensure their congestion window will be
524	   roughly the same as that of a standards track TCP congestion control
525	   (Reno) [RFC5681] and other so-called TCP-friendly controls, such as
526	   TCP Cubic in its TCP-friendly mode.

528	   {ToDo: The requirements for scalable congestion controls on the
529	   Internet (termed the TCP Prague requirements)
530	   [I-D.ietf-tsvwg-ecn-l4s-id] are not necessarily final.  If the
531	   aggressiveness of DCTCP is not defined as the benchmark for scalable
532	   controls on the Internet, the recommended value of k will also be
533	   subject to change.}

535	   The choice of k is a matter of operator policy, and operators MAY
536	   choose a different value using Table 1 and the guidelines in
537	   Appendix C.

539	   If multiple users share capacity at a bottleneck (e.g. in the
540	   Internet access link of a campus network), the operator's choice of k
541	   will determine capacity sharing between the flows of different users.
542	   However, on the public Internet, access network operators typically
543	   isolate customers from each other with some form of layer-2
544	   multiplexing (TDM in DOCSIS, CDMA in 3G) or L3 scheduling (WRR in
545	   DSL), rather than relying on TCP to share capacity between customers
546	   [RFC0970].  In such cases, the choice of k will solely affect
547	   relative flow rates within each customer's access capacity, not
548	   between customers.  Also, k will not affect relative flow rates at
549	   any times when all flows are Classic or all L4S, and it will not
550	   affect small flows.

552	2.5.2.  Management Requirements

554	   By default, a DualQ Coupled AQM SHOULD NOT need any configuration for
555	   use at a bottleneck on the public Internet [RFC7567].  The following
556	   parameters MAY be operator-configurable, e.g. to tune for non-
557	   Internet settings:

559	   o  Optional packet classifier(s) to use in addition to the ECN field
560	      (see Section 2.3);

562	   o  Expected typical RTT (a parameter for typical or target queuing
563	      delay in each queue might be configurable instead);

565	   o  Expected maximum RTT (a stability parameter that depends on
566	      maximum RTT might be configurable instead);

568	   o  Coupling factor, k;

570	   o  The limit to the conditional priority of L4S (scheduler-dependent,
571	      e.g. the scheduler weight for WRR, or the time-shift for time-
572	      shifted FIFO);

574	   o  The maximum Classic ECN marking probability, p_Cmax, before
575	      switching over to drop.

577	   An experimental DualQ Coupled AQM SHOULD allow the operator to
578	   monitor the following operational statistics:

580	   o  Bits forwarded (total and per queue per sample interval), from
581	      which utilization can be calculated

583	   o  Q delay (per queue over sample interval)

585	   o  Total packets arriving, enqueued and dequeued (per queue per
586	      sample interval)

588	   o  ECN packets marked, non-ECN packets dropped, ECN packets dropped
589	      (per queue per sample interval), from which marking and dropping
590	      probabilities can be calculated

592	   o  Time and duration of each overload event.

594	   The type of statistics produced for variables like Q delay (mean,
595	   percentiles, etc.) will depend on implementation constraints.

597	3.  IANA Considerations

599	   This specification contains no IANA considerations.

601	4.  Security Considerations

603	4.1.  Overload Handling

605	   Where the interests of users or flows might conflict, it could be
606	   necessary to police traffic to isolate any harm to the performance of
607	   individual flows.  However it is hard to avoid unintended side-
608	   effects with policing, and in a trusted environment policing is not
609	   necessary.  Therefore per-flow policing needs to be separable from a
610	   basic AQM, as an option under policy control.

612	   However, a basic DualQ AQM does at least need to handle overload.  A
613	   useful objective would be for the overload behaviour of the DualQ AQM
614	   to be at least no worse than a single queue AQM.  However, a trade-
615	   off needs to be made between complexity and the risk of either
616	   traffic class harming the other.  In each of the following three
617	   subsections, an overload issue specific to the DualQ is described,
618	   followed by proposed solution(s).

620	   Under overload the higher priority L4S service will have to sacrifice
621	   some aspect of its performance.  Alternative solutions are provided
622	   below that each relax a different factor: e.g. throughput, delay,
623	   drop.  Some of these choices might need to be determined by operator
624	   policy or by the developer, rather than by the IETF. {ToDo: Reach
625	   consensus on which it is to be in each case.}

627	4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay?

629	   Priority of L4S is required to be conditional to avoid total
630	   throughput starvation of Classic by heavy L4S traffic.  This raises
631	   the question of whether to sacrifice L4S throughput or L4S delay (or
632	   some other policy) to mitigate starvation of Classic:

634	   Sacrifice L4S throughput:   By using weighted round robin as the
635	      conditional priority scheduler, the L4S service can sacrifice some
636	      throughput during overload to guarantee a minimum throughput
637	      service for Classic traffic.  The scheduling weight of the Classic
638	      queue should be small (e.g. 1/16).  Then, in most traffic
639	      scenarios the scheduler will not interfere and it will not need to
640	      - the coupling mechanism and the end-systems will share out the
641	      capacity across both queues as if it were a single pool.  However,
642	      because the congestion coupling only applies in one direction
643	      (from C to L), if L4S traffic is over-aggressive or unresponsive,
644	      the scheduler weight for Classic traffic will at least be large
645	      enough to ensure it does not starve.

647	      In cases where the ratio of L4S to Classic flows (e.g. 19:1) is
648	      greater than the ratio of their scheduler weights (e.g. 15:1), the
649	      L4S flows will get less than an equal share of the capacity, but
650	      only slightly.  For instance, with the example numbers given, each
651	      L4S flow will get (15/16)/19 = 4.9% when ideally each would get
652	      1/20=5%. In the rather specific case of an unresponsive flow
653	      taking up a large part of the capacity set aside for L4S, using
654	      WRR could significantly reduce the capacity left for any
655	      responsive L4S flows.

657	   Sacrifice L4S Delay:  To control milder overload of responsive
658	      traffic, particularly when close to the maximum congestion signal,
659	      the operator could choose to control overload of the Classic queue
660	      by allowing some delay to 'leak' across to the L4S queue.  The
661	      scheduler can be made to behave like a single First-In First-Out
662	      (FIFO) queue with different service times by implementing a very
663	      simple conditional priority scheduler that could be called a
664	      "time-shifted FIFO" (see the Modifier Earliest Deadline First
665	      (MEDF) scheduler of [MEDF]).  This scheduler adds tshift to the
666	      queue delay of the next L4S packet, before comparing it with the
667	      queue delay of the next Classic packet, then it selects the packet
668	      with the greater adjusted queue delay.  Under regular conditions,
669	      this time-shifted FIFO scheduler behaves just like a strict
670	      priority scheduler.  But under moderate or high overload it
671	      prevents starvation of the Classic queue, because the time-shift
672	      (tshift) defines the maximum extra queuing delay of Classic
673	      packets relative to L4S.

675	   The example implementation in Appendix A can implement either policy.

677	4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or Delay?

679	   To keep the throughput of both L4S and Classic flows roughly equal
680	   over the full load range, a different control strategy needs to be
681	   defined above the point where one AQM first saturates to a
682	   probability of 100% leaving no room to push back the load any harder.
683	   If k>1, L4S will saturate first, but saturation can be caused by
684	   unresponsive traffic in either queue.

686	   The term 'unresponsive' includes cases where a flow becomes
687	   temporarily unresponsive, for instance, a real-time flow that takes a
688	   while to adapt its rate in response to congestion, or a TCP-like flow
689	   that is normally responsive, but above a certain congestion level it
690	   will not be able to reduce its congestion window below the minimum of
691	   2 segments, effectively becoming unresponsive.  (Note that L4S
692	   traffic ought to remain responsive below a window of 2 segments (see
693	   [I-D.ietf-tsvwg-ecn-l4s-id]).

695	   Saturation raises the question of whether to relieve congestion by
696	   introducing some drop into the L4S queue or by allowing delay to grow
697	   in both queues (which could eventually lead to tail drop too):

699	   Drop on Saturation:  Saturation can be avoided by setting a maximum
700	      threshold for L4S ECN marking (assuming k>1) before saturation
701	      starts to make the flow rates of the different traffic types
702	      diverge.  Above that the drop probability of Classic traffic is
703	      applied to all packets of all traffic types.  Then experiments
704	      have shown that queueing delay can be kept at the target in any
705	      overload situation, including with unresponsive traffic, and no
706	      further measures are required.

708	   Delay on Saturation:  When L4S marking saturates, instead of
709	      switching to drop, the drop and marking probabilities could be
710	      capped.  Beyond that, delay will grow either solely in the queue
711	      with unresponsive traffic (if WRR is used), or in both queues (if
712	      time-shifted FIFO is used).  In either case, the higher delay
713	      ought to control temporary high congestion.  If the overload is
714	      more persistent, eventually the combined DualQ will overflow and
715	      tail drop will control congestion.

717	   The example implementation in Appendix A applies only the "drop on
718	   saturation" policy.

720	4.1.3.  Protecting against Unresponsive ECN-Capable Traffic

722	   Unresponsive traffic has a greater advantage if it is also ECN-
723	   capable.  The advantage is undetectable at normal low levels of drop/
724	   marking, but it becomes significant with the higher levels of drop/
725	   marking typical during overload.  This is an issue whether the ECN-
726	   capable traffic is L4S or Classic.

728	   This raises the question of whether and when to switch off ECN
729	   marking and use solely drop instead, as required by both Section 7 of
730	   [RFC3168] and Section 4.2.1 of [RFC7567].

732	   Experiments with the DualPI2 AQM (Appendix A) have shown that
733	   introducing 'drop on saturation' at 100% L4S marking addresses this
734	   problem with unresponsive ECN as well as addressing the saturation
735	   problem.  It leaves only a small range of congestion levels where
736	   unresponsive traffic gains any advantage from using the ECN
737	   capability, and the advantage is hardly detectable [DualQ-Test].

739	5.  Acknowledgements

741	   Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for
742	   detailed review comments particularly of the appendices and
743	   suggestions on how to make our explanation clearer.  Thanks also to
744	   Greg White and Tom Henderson for insights on the choice of schedulers
745	   and queue delay measurement techniques.

747	   The authors' contributions were originally part-funded by the
748	   European Community under its Seventh Framework Programme through the
749	   Reducing Internet Transport Latency (RITE) project (ICT-317700).  Bob
750	   Briscoe's contribution was also part-funded by the Research Council
751	   of Norway through the TimeIn project.  The views expressed here are
752	   solely those of the authors.

754	6.  References
755	6.1.  Normative References

757	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
758	              Requirement Levels", BCP 14, RFC 2119,
759	              DOI 10.17487/RFC2119, March 1997,
760	              <https://www.rfc-editor.org/info/rfc2119>.

762	6.2.  Informative References

764	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
765	              Algorithm for Increasing the Robustness of RED's Active
766	              Queue Management", ACIRI Technical Report , August 2001,
767	              <http://www.icir.org/floyd/red.html>.

769	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
770	              ACM Queue 10(5), May 2012,
771	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

773	   [CRED_Insights]
774	              Briscoe, B., "Insights from Curvy RED (Random Early
775	              Detection)", BT Technical Report TR-TUB8-2015-003, July
776	              2015,
777	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

779	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
780	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
781	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
782	              dctth_preprint.pdf>.

784	              (Under submission)

786	   [DualQ-Test]
787	              Steen, H., "Destruction Testing: Ultra-Low Delay using
788	              Dual Queue Coupled Active Queue Management", Masters
789	              Thesis, Dept of Informatics, Uni Oslo , May 2017.

791	   [I-D.briscoe-tsvwg-l4s-diffserv]
792	              Briscoe, B., "Interactions between Low Latency, Low Loss,
793	              Scalable Throughput (L4S) and Differentiated Services",
794	              draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress),
795	              March 2018.

797	   [I-D.ietf-tcpm-cubic]
798	              Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
799	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
800	              draft-ietf-tcpm-cubic-07 (work in progress), November
801	              2017.

803	   [I-D.ietf-tsvwg-ecn-l4s-id]
804	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
805	              Modified Explicit Congestion Notification (ECN) Semantics
806	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
807	              id-02 (work in progress), March 2018.

809	   [I-D.ietf-tsvwg-l4s-arch]
810	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
811	              Low Loss, Scalable Throughput (L4S) Internet Service:
812	              Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in
813	              progress), March 2018.

815	   [I-D.sridharan-tcpm-ctcp]
816	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
817	              "Compound TCP: A New TCP Congestion Control for High-Speed
818	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
819	              (work in progress), November 2008.

821	   [Mathis09]
822	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
823	              May 2009, <http://www.hpcc.jp/pfldnet2009/
824	              Program_files/1569198525.pdf>.

826	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
827	              simple scheduling algorithm for two real-time transport
828	              service classes with application in the UTRAN", Proc. IEEE
829	              Conference on Computer Communications (INFOCOM'03) Vol.2
830	              pp.1116-1122, March 2003.

832	   [PI2]      De Schepper, K., Bondarenko, O., Briscoe, B., and I.
833	              Tsang, "PI2: A Linearized AQM for both Classic and
834	              Scalable TCP", ACM CoNEXT'16 , December 2016,
835	              <https://riteproject.files.wordpress.com/2015/10/
836	              pi2_conext.pdf>.

838	              (To appear)

840	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
841	              RFC 970, DOI 10.17487/RFC0970, December 1985,
842	              <https://www.rfc-editor.org/info/rfc970>.

844	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
845	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
846	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
847	              S., Wroclawski, J., and L. Zhang, "Recommendations on
848	              Queue Management and Congestion Avoidance in the
849	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
850	              <https://www.rfc-editor.org/info/rfc2309>.

852	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
853	              of Explicit Congestion Notification (ECN) to IP",
854	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
855	              <https://www.rfc-editor.org/info/rfc3168>.

857	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
858	              J., Courtney, W., Davari, S., Firoiu, V., and D.
859	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
860	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
861	              <https://www.rfc-editor.org/info/rfc3246>.

863	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
864	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
865	              <https://www.rfc-editor.org/info/rfc3649>.

867	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
868	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
869	              <https://www.rfc-editor.org/info/rfc5681>.

871	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
872	              Recommendations Regarding Active Queue Management",
873	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
874	              <https://www.rfc-editor.org/info/rfc7567>.

876	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
877	              "Proportional Integral Controller Enhanced (PIE): A
878	              Lightweight Control Scheme to Address the Bufferbloat
879	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
880	              <https://www.rfc-editor.org/info/rfc8033>.

882	   [RFC8034]  White, G. and R. Pan, "Active Queue Management (AQM) Based
883	              on Proportional Integral Controller Enhanced PIE) for
884	              Data-Over-Cable Service Interface Specifications (DOCSIS)
885	              Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February
886	              2017, <https://www.rfc-editor.org/info/rfc8034>.

888	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
889	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
890	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
891	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

893	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
894	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
895	              and Active Queue Management Algorithm", RFC 8290,
896	              DOI 10.17487/RFC8290, January 2018,
897	              <https://www.rfc-editor.org/info/rfc8290>.

899	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
900	              Notification (ECN) Experimentation", RFC 8311,
901	              DOI 10.17487/RFC8311, January 2018,
902	              <https://www.rfc-editor.org/info/rfc8311>.

904	Appendix A.  Example DualQ Coupled PI2 Algorithm

906	   As a first concrete example, the pseudocode below gives the DualPI2
907	   algorithm.  DualPI2 follows the structure of the DualQ Coupled AQM
908	   framework in Figure 1.  A simple step threshold (in units of queuing
909	   time) is used for the Native L4S AQM, but a ramp is also described as
910	   an alternative.  And the PI2 algorithm [PI2] is used for the Classic
911	   AQM.  PI2 is an improved variant of the PIE AQM [RFC8033].

913	   We will introduce the pseudocode in two passes.  The first pass
914	   explains the core concepts, deferring handling of overload to the
915	   second pass.  To aid comparison, line numbers are kept in step
916	   between the two passes by using letter suffixes where the longer code
917	   needs extra lines.

919	   A full open source implementation for Linux is available at:
920	   https://github.com/olgabo/dualpi2.

922	A.1.  Pass #1: Core Concepts

924	   The pseudocode manipulates three main structures of variables: the
925	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
926	   pseudocode consists of the following four functions:

928	   o  initialization code (Figure 2) that sets parameter defaults (the
929	      API for setting non-default values is omitted for brevity)

931	   o  enqueue code (Figure 3)

933	   o  dequeue code (Figure 4)

935	   o  code to regularly update the base probability (p) used in the
936	      dequeue code (Figure 5).

938	   It also uses the following functions that are not shown in full here:

940	   o  scheduler(), which selects between the head packets of the two
941	      queues; the choice of scheduler technology is discussed later;

943	   o  cq.len() or lq.len() returns the current length (aka. backlog) of
944	      the relevant queue in bytes;

946	   o  cq.time() or lq.time() returns the current queuing delay (aka.
947	      sojourn time or service time) of the relevant queue in units of
948	      time;

950	   Queuing delay could be measured directly by storing a per-packet
951	   time-stamp as each packet is enqueued, and subtracting this from the
952	   system time when the packet is dequeued.  If time-stamping is not
953	   easy to introduce with certain hardware, queuing delay could be
954	   predicted indirectly by dividing the size of the queue by the
955	   predicted departure rate, which might be known precisely for some
956	   link technologies (see for example [RFC8034]).

958	   In our experiments so far (building on experiments with PIE) on
959	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
960	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
961	   parameters in Figure 2.  The parameters are categorised by whether
962	   they relate to the Base PI2 AQM, the L4S AQM or the framework
963	   coupling them together.  Variables derived from these parameters are
964	   also included at the end of each category.  Each parameter is
965	   explained as it is encountered in the walk-through of the pseudocode
966	   below.

968	   1:  dualpi2_params_init(...) {         % Set input parameter defaults
969	   2:    % PI2 AQM parameters
970	   3:    target = 15 ms              % PI AQM Classic queue delay target
971	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
972	   5:    alpha = 10 Hz^2                              % PI integral gain
973	   6:    beta = 100 Hz^2                          % PI proportional gain
974	   7:    p_Cmax = 1/4                       % Max Classic drop/mark prob
975	   8:    % Derived PI2 AQM variables
976	   9:    alpha_U = alpha *Tupdate % PI integral gain per update interval
977	   10:   beta_U = beta * Tupdate  % PI prop'nal gain per update interval
978	   11:
979	   12:   % DualQ Coupled framework parameters
980	   13:   k = 2                                         % Coupling factor
981	   14:   % scheduler weight or equival't parameter (scheduler-dependent)
982	   15:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
983	   16:
984	   17:   % L4S AQM parameters
985	   18:   T_time = 1 ms                   % L4S marking threshold in time
986	   19:   T_len = 2 * MTU            % Min L4S marking threshold in bytes
987	   20:   % Derived L4S AQM variables
988	   21:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
989	   22: }

991	       Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM

993	   The overall goal of the code is to maintain the base probability (p),
994	   which is an internal variable from which the marking and dropping
995	   probabilities for L4S and Classic traffic (p_L and p_C) are derived.
996	   The variable named p in the pseudocode and in this walk-through is
997	   the same as p' (p-prime) in Section 2.4.  The probabilities p_L and
998	   p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function
999	   (Figure 5) then used in the dualpi2_dequeue() function (Figure 4).
1000	   The code walk-through below builds up to explaining that part of the
1001	   code eventually, but it starts from packet arrival.

1003	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
1004	   2:    if ( lq.len() + cq.len() > limit )
1005	   3:      drop(pkt)                     % drop packet if buffer is full
1006	   4:    else {                                      % Packet classifier
1007	   5:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
1008	   6:        lq.enqueue(pkt)
1009	   7:      else                           % ECN bits = not-ECT or ECT(0)
1010	   8:        cq.enqueue(pkt)
1011	   9:    }
1012	   10: }

1014	      Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

1016	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
1017	   2:    while ( lq.len() + cq.len() > 0 )
1018	   3:      if ( scheduler() == lq ) {
1019	   4:        lq.dequeue(pkt)                      % Scheduler chooses lq
1020	   5:        if ( ((lq.time() > T_time)              % step marking ...
1021	   6:              AND (lq.len() > T_len))
1022	   7:            OR (p_CL > rand()) )             % ...or linear marking
1023	   8:          mark(pkt)
1024	   9:      } else {
1025	   10:       cq.dequeue(pkt)                      % Scheduler chooses cq
1026	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1027	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
1028	   13:           drop(pkt)                                % squared drop
1029	   14:           continue        % continue to the top of the while loop
1030	   15:         }
1031	   16:         mark(pkt)                                  % squared mark
1032	   17:       }
1033	   18:     }
1034	   19:     return(pkt)                      % return the packet and stop
1035	   20:   }
1036	   21:   return(NULL)                             % no packet to dequeue
1037	   22: }

1039	      Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

1041	   When packets arrive, first a common queue limit is checked as shown
1042	   in line 2 of the enqueuing pseudocode in Figure 3.  Note that the
1043	   limit is deliberately tested before enqueue to avoid any bias against
1044	   larger packets (so the actual buffer has to be one MTU larger than
1045	   limit).  If limit is not exceeded, the packet will be classified and
1046	   enqueued to the Classic or L4S queue dependent on the least
1047	   significant bit of the ECN field in the IP header (line 5).  Packets
1048	   with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be
1049	   enqueued in the Classic queue.  Otherwise, ECT(1) and CE packets will
1050	   be enqueued in the L4S queue.  Optional additional packet
1051	   classification flexibility is omitted for brevity (see
1052	   [I-D.ietf-tsvwg-ecn-l4s-id]).

1054	   The dequeue pseudocode (Figure 4) schedules one packet for dequeuing
1055	   (or zero if the queue is empty).  It also makes all the AQM decisions
1056	   on dropping and marking.  The alternative of applying the AQMs at
1057	   enqueue would shift some processing from the critical time when each
1058	   packet is dequeued.  However, it would also add a whole queue of
1059	   delay to the control signals, making the control loop very sloppy.

1061	   All the dequeue code is contained within a large while loop so that
1062	   if it decides to drop a packet, it will continue until it selects a
1063	   packet to schedule.  Line 3 of the dequeue pseudocode is where the
1064	   scheduler chooses between the L4S queue (lq) and the Classic queue
1065	   (cq).  Detailed implementation of the scheduler is not shown (see
1066	   discussion later).

1068	   o  If an L4S packet is scheduled, lines 5 to 8 mark the packet if
1069	      either the L4S threshold (T_time) is exceeded, or if a random
1070	      marking decision is drawn according to p_CL (maintained by the
1071	      dualpi2_update() function discussed below).  This logical 'OR' on
1072	      a per-packet basis implements the max() function shown in Figure 1
1073	      to couple the outputs of the two AQMs together.  The L4S threshold
1074	      is usually in units of time (default T_time = 1 ms).  However, on
1075	      slow links the packet serialization time can approach the
1076	      threshold T_time, so line 6 sets a floor of T_len (=2 MTU) to the
1077	      threshold, otherwise marking is always too frequent on slow links.

1079	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
1080	      packet based on the squared probability p_C.

1082	   There is some concern that using a step function for the Native L4S
1083	   AQM requires end-systems to smooth the signal for a lot longer -
1084	   until its fidelity is sufficient.  The latency benefits of a ramp are
1085	   being investigated as a simple alternative to the step.  This ramp
1086	   would be similar to the RED algorithm, with the following
1087	   differences:

1089	   o  The min and max of the ramp are defined in units of queuing delay,
1090	      not bytes, so that configuration remains invariant as the queue
1091	      departure rate varies.

1093	   o  It uses instantaneous queueing delay without smoothing (smoothing
1094	      is done in the end-systems).

1096	   o  Determinism is being experimented with instead of randomness; to
1097	      reduce the delay necessary to smooth out the noise of randomness
1098	      from the signal.  For each packet, the algorithm would accumulate
1099	      p'_L in a counter and mark the packet that took the counter over
1100	      1, then subtract 1 from the counter and continue.

1102	   o  The ramp rises linearly directly from 0 to 1, not to a an
1103	      intermediate value of p'_L as RED would, because there is no need
1104	      to keep ECN marking probability low.

1106	   This ramp algorithm would require two configuration parameters (min
1107	   and max threshold in units of queuing time), in contrast to the
1108	   single parameter of a step.

1110	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1111	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
1112	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1113	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1114	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1115	   6:    prevq = curq
1116	   7:  }

1118	     Figure 5: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

1120	   The base probability (p) is kept up to date by the core PI algorithm
1121	   in Figure 5, which is executed every Tupdate.

1123	   Note that p solely depends on the queuing time in the Classic queue.
1124	   In line 2, the current queuing delay (curq) is evaluated from how
1125	   long the head packet was in the Classic queue (cq).  The function
1126	   cq.time() (not shown) subtracts the time stamped at enqueue from the
1127	   current time and implicitly takes the current queuing delay as 0 if
1128	   the queue is empty.

1130	   The algorithm centres on line 3, which is a classical Proportional-
1131	   Integral (PI) controller that alters p dependent on: a) the error
1132	   between the current queuing delay (curq) and the target queuing delay
1133	   ('target' - see [RFC8033]); and b) the change in queuing delay since
1134	   the last sample.  The name 'PI' represents the fact that the second
1135	   factor (how fast the queue is growing) is _P_roportional to load
1136	   while the first is the _I_ntegral of the load (so it removes any
1137	   standing queue in excess of the target).

1139	   The two 'gain factors' in line 3, alpha_U and beta_U, respectively
1140	   weight how strongly each of these elements ((a) and (b)) alters p.
1141	   They are in units of 'per second of delay' or Hz, because they
1142	   transform differences in queueing delay into changes in probability.

1144	   alpha_U and beta_U are derived from the input parameters alpha and
1145	   beta (see lines 5 and 6 of Figure 2).  These recommended values of
1146	   alpha and beta come from the stability analysis in [PI2] so that the
1147	   AQM can change p as fast as possible in response to changes in load
1148	   without over-compensating and therefore causing oscillations in the
1149	   queue.

1151	   alpha and beta determine how much p ought to change if it was updated
1152	   every second.  It is best to update p as frequently as possible, but
1153	   the update interval (Tupdate) will probably be constrained by
1154	   hardware performance.  For link rates from 4 - 200 Mb/s, we found
1155	   Tupdate=16ms (as recommended in [RFC8033]) is sufficient.  However
1156	   small the chosen value of Tupdate, p should change by the same amount
1157	   per second, but in finer more frequent steps.  So the gain factors
1158	   used for updating p in Figure 5 need to be scaled by (Tupdate/1s),
1159	   which is done in lines 9 and 10 of Figure 2).  The suffix '_U'
1160	   represents 'per update time' (Tupdate).

1162	   In corner cases, p can overflow the range [0,1] so the resulting
1163	   value of p has to be bounded (omitted from the pseudocode).  Then, as
1164	   already explained, the coupled and Classic probabilities are derived
1165	   from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2.

1167	   Because the coupled L4S marking probability (p_CL) is factored up by
1168	   k, the dynamic gain parameters alpha and beta are also inherently
1169	   factored up by k for the L4S queue, which is necessary to ensure that
1170	   Classic TCP and DCTCP controls have the same stability.  So, if alpha
1171	   is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha,
1172	   which is 20 Hz^2 with the default coupling factor of k=2.

1174	   Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned
1175	   every Tupdate dependent on p.  Instead, in PI2, alpha_U and beta_U
1176	   are independent of p because the squaring applied to Classic traffic
1177	   tunes them inherently.  This is explained in [PI2], which also
1178	   explains why this more principled approach removes the need for most
1179	   of the heuristics that had to be added to PIE.

1181	   {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with
1182	   RTT}

1184	A.2.  Pass #2: Overload Details

1186	   Figure 6 repeats the dequeue function of Figure 4, but with overload
1187	   details added.  Similarly Figure 7 repeats the core PI algorithm of
1188	   Figure 5 with overload details added.  The initialization and enqueue
1189	   functions are unchanged.

1191	   In line 7 of the initialization function (Figure 2), the default
1192	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
1193	   point at which it is deemed that the Classic queue has become
1194	   persistently overloaded, so it switches to using solely drop, even
1195	   for ECN-capable packets.  This protects the queue against any
1196	   unresponsive traffic that falsely claims that it is responsive to ECN
1197	   marking, as required by [RFC3168] and [RFC7567].

1199	   Line 21 of the initialization function translates this into a maximum
1200	   L4S marking probability (p_Lmax) by rearranging Equation (1).  With a
1201	   coupling factor of k=2 (the default) or greater, this translates to a
1202	   maximum L4S marking probability of 1 (or 100%).  This is intended to
1203	   ensure that the L4S queue starts to introduce dropping once marking
1204	   saturates and can rise no further.  The 'TCP Prague' requirements
1205	   [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion
1206	   control detects a drop, it falls back to a response that coexists
1207	   with 'Classic' TCP.  So it is correct that the L4S queue drops
1208	   packets proportional to p^2, as if they are Classic packets.

1210	   Both these switch-overs are triggered by the tests for overload
1211	   introduced in lines 4b and 12b of the dequeue function (Figure 6).
1212	   Lines 8c to 8g drop L4S packets with probability p^2.  Lines 8h to 8i
1213	   mark the remaining packets with probability p_CL.

1215	   Lines 2c to 2d in the core PI algorithm (Figure 7) deal with overload
1216	   of the L4S queue when there is no Classic traffic.  This is
1217	   necessary, because the core PI algorithm maintains the appropriate
1218	   drop probability to regulate overload, but it depends on the length
1219	   of the Classic queue.  If there is no Classic queue the naive
1220	   algorithm in Figure 5 drops nothing, even if the L4S queue is
1221	   overloaded - so tail drop would have to take over (lines 3 and 4 of
1222	   Figure 3).

1224	   If the test at line 2a finds that the Classic queue is empty, line 2d
1225	   measures the current queue delay using the L4S queue instead.  While
1226	   the L4S queue is not overloaded, its delay will always be tiny
1227	   compared to the target Classic queue delay.  So p_L will be driven to
1228	   zero, and the L4S queue will naturally be governed solely by
1229	   threshold marking (lines 5 and 6 of the dequeue algorithm in
1230	   Figure 6).  But, if unresponsive L4S source(s) cause overload, the
1231	   DualQ transitions smoothly to L4S marking based on the PI algorithm.

1233	   And as overload increases, it naturally transitions from marking to
1234	   dropping by the switch-over mechanism already described.

1236	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
1237	   2:    while ( lq.len() + cq.len() > 0 )
1238	   3:      if ( scheduler() == lq ) {
1239	   4a:       lq.dequeue(pkt)
1240	   4b:       if ( p_CL < p_Lmax ) {      % Check for overload saturation
1241	   5:          if ( ((lq.time() > T_time)             % step marking ...
1242	   6:                AND (lq.len > T_len))
1243	   7:              OR (p_CL > rand()) )           % ...or linear marking
1244	   8a:            mark(pkt)
1245	   8b:       } else {                              % overload saturation
1246	   8c:         if ( p_C > rand() ) {             % probability p_C = p^2
1247	   8e:           drop(pkt)      % revert to Classic drop due to overload
1248	   8f:           continue        % continue to the top of the while loop
1249	   8g:         }
1250	   8h:         if ( p_CL > rand() )           % probability p_CL = k * p
1251	   8i:           mark(pkt)         % linear marking of remaining packets
1252	   8j:       }
1253	   9:      } else {
1254	   10:       cq.dequeue(pkt)
1255	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1256	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
1257	   12b:             OR (p_C >= p_Cmax) ) {       % Overload disables ECN
1258	   13:           drop(pkt)                     % squared drop, redo loop
1259	   14:           continue        % continue to the top of the while loop
1260	   15:         }
1261	   16:         mark(pkt)                                  % squared mark
1262	   17:       }
1263	   18:     }
1264	   19:     return(pkt)                      % return the packet and stop
1265	   20:   }
1266	   21:   return(NULL)                             % no packet to dequeue
1267	   22: }

1269	      Figure 6: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
1270	             (Including Integer Arithmetic and Overload Code)

1272	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1273	   2a:   if ( cq.len() > 0 )
1274	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
1275	   2c:   else                                      % Classic queue empty
1276	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
1277	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1278	   4:    p_CL = p * k           % L4S prob = base prob * coupling factor
1279	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1280	   6:    prevq = curq
1281	   7:  }

1283	     Figure 7: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
1284	                         (Including Overload Code)

1286	   The choice of scheduler technology is critical to overload protection
1287	   (see Section 4.1).

1289	   o  A well-understood weighted scheduler such as weighted round robin
1290	      (WRR) is recommended.  The scheduler weight for Classic should be
1291	      low, e.g. 1/16.

1293	   o  Alternatively, a time-shifted FIFO could be used.  This is a very
1294	      simple scheduler, but it does not fully isolate latency in the L4S
1295	      queue from uncontrolled bursts in the Classic queue.  It works by
1296	      selecting the head packet that has waited the longest, biased
1297	      against the Classic traffic by a time-shift of tshift.  To
1298	      implement time-shifted FIFO, the "if (scheduler() == lq )" test in
1299	      line 3 of the dequeue code would simply be replaced by "if (
1300	      lq.time() + tshift >= cq.time() )".  For the public Internet a
1301	      good value for tshift is 50ms.  For private networks with smaller
1302	      diameter, about 4*target would be reasonable.

1304	   o  A strict priority scheduler would be inappropriate, because it
1305	      would starve Classic if L4S was overloaded.

1307	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

1309	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
1310	   below gives the Curvy RED based algorithm we used and tested.
1311	   Although we designed the AQM to be efficient in integer arithmetic,
1312	   to aid understanding it is first given using real-number arithmetic.
1313	   Then, one possible optimization for integer arithmetic is given, also
1314	   in pseudocode.  To aid comparison, the line numbers are kept in step
1315	   between the two by using letter suffixes where the longer code needs
1316	   extra lines.

1318	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1319	   2:    if ( lq.dequeue(pkt) ) {
1320	   3a:     p_L = cq.sec() / 2^S_L
1321	   3b:     if ( lq.byt() > T )
1322	   3c:       mark(pkt)
1323	   3d:     elif ( p_L > maxrand(U) )
1324	   4:        mark(pkt)
1325	   5:      return(pkt)                % return the packet and stop here
1326	   6:    }
1327	   7:    while ( cq.dequeue(pkt) ) {
1328	   8a:     alpha = 2^(-f_C)
1329	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
1330	   9a:     sqrt_p_C = Q_C / 2^S_C
1331	   9b:     if ( sqrt_p_C > maxrand(2*U) )
1332	   10:       drop(pkt)                        % Squared drop, redo loop
1333	   11:     else
1334	   12:       return(pkt)              % return the packet and stop here
1335	   13:   }
1336	   14:   return(NULL)                           % no packet to dequeue
1337	   15: }

1339	   16: maxrand(u) {                % return the max of u random numbers
1340	   17:     maxr=0
1341	   18:     while (u-- > 0)
1342	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
1343	   20:     return(maxr)
1344	   21: }

1346	   Figure 8: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

1348	   Packet classification code is not shown, as it is no different from
1349	   Figure 3.  Potential classification schemes are discussed in
1350	   Section 2.3.  The Curvy RED algorithm has not been maintained to the
1351	   same degree as the DualPI2 algorithm.  Some ideas used in DualPI2
1352	   would need to be translated into Curvy RED, such as i) the
1353	   conditional priority scheduler instead of strict priority ii) the
1354	   time-based L4S threshold; iii) turning off ECN as overload
1355	   protection; iv) Classic ECN support.  These are not shown in the
1356	   Curvy RED pseudocode, but would need to be implemented for
1357	   production. {ToDo}

1359	   At the outer level, the structure of dualq_dequeue() implements
1360	   strict priority scheduling.  The code is written assuming the AQM is
1361	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
1362	   the if-block in lines 2-6 determines whether there is an L4S packet
1363	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
1364	   in lines 7-13 determines whether there is a Classic packet to
1365	   dequeue, by calling cq.dequeue(pkt).  (Note 2)
1366	   In the lower priority Classic queue, a while loop is used so that, if
1367	   the AQM determines that a classic packet should be dropped, it
1368	   continues to test for classic packets deciding whether to drop each
1369	   until it actually forwards one.  Thus, every call to dualq_dequeue()
1370	   returns one packet if at least one is present in either queue,
1371	   otherwise it returns NULL at line 14.  (Note 3)

1373	   Within each queue, the decision whether to drop or mark is taken as
1374	   follows (to simplify the explanation, it is assumed that U=1):

1376	   L4S:  If the test at line 2 determines there is an L4S packet to
1377	      dequeue, the tests at lines 3a and 3c determine whether to mark
1378	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1379	      in bytes) is greater than a step threshold T in bytes (Note 4).
1380	      The second test is similar to the random ECN marking in RED, but
1381	      with the following differences: i) the marking function does not
1382	      start with a plateau of zero marking until a minimum threshold,
1383	      rather the marking probability starts to increase as soon as the
1384	      queue is positive; ii) marking depends on queuing time, not bytes,
1385	      in order to scale for any link rate without being reconfigured;
1386	      iii) marking of the L4S queue does not depend on itself, it
1387	      depends on the queuing time of the _other_ (Classic) queue, where
1388	      cq.sec() is the queuing time of the packet at the head of the
1389	      Classic queue (zero if empty); iv) marking depends on the
1390	      instantaneous queuing time (of the other Classic queue), not a
1391	      smoothed average; v) the queue is compared with the maximum of U
1392	      random numbers (but if U=1, this is the same as the single random
1393	      number used in RED).

1395	      Specifically, in line 3a the marking probability p_L is set to the
1396	      Classic queueing time qc.sec() in seconds divided by the L4S
1397	      scaling parameter 2^S_L, which represents the queuing time (in
1398	      seconds) at which marking probability would hit 100%. Then in line
1399	      3d (if U=1) the result is compared with a uniformly distributed
1400	      random number between 0 and 1, which ensures that marking
1401	      probability will linearly increase with queueing time.  The
1402	      scaling parameter is expressed as a power of 2 so that division
1403	      can be implemented as a right bit-shift (>>) in line 3 of the
1404	      integer variant of the pseudocode (Figure 9).

1406	   Classic:  If the test at line 7 determines that there is at least one
1407	      Classic packet to dequeue, the test at line 9b determines whether
1408	      to drop it.  But before that, line 8b updates Q_C, which is an
1409	      exponentially weighted moving average (Note 5) of the queuing time
1410	      in the Classic queue, where pkt.sec() is the instantaneous
1411	      queueing time of the current Classic packet and alpha is the EWMA
1412	      constant for the classic queue.  In line 8a, alpha is represented
1413	      as an integer power of 2, so that in line 8 of the integer code
1414	      the division needed to weight the moving average can be
1415	      implemented by a right bit-shift (>> f_C).

1417	      Lines 9a and 9b implement the drop function.  In line 9a the
1418	      averaged queuing time Q_C is divided by the Classic scaling
1419	      parameter 2^S_C, in the same way that queuing time was scaled for
1420	      L4S marking.  This scaled queuing time is given the variable name
1421	      sqrt_p_C because it will be squared to compute Classic drop
1422	      probability, so before it is squared it is effectively the square
1423	      root of the drop probability.  The squaring is done by comparing
1424	      it with the maximum out of two random numbers (assuming U=1).
1425	      Comparing it with the maximum out of two is the same as the
1426	      logical `AND' of two tests, which ensures drop probability rises
1427	      with the square of queuing time (Note 6).  Again, the scaling
1428	      parameter is expressed as a power of 2 so that division can be
1429	      implemented as a right bit-shift in line 9 of the integer
1430	      pseudocode.

1432	   The marking/dropping functions in each queue (lines 3 & 9) are two
1433	   cases of a new generalization of RED called Curvy RED, motivated as
1434	   follows.  When we compared the performance of our AQM with fq_CoDel
1435	   and PIE, we came to the conclusion that their goal of holding queuing
1436	   delay to a fixed target is misguided [CRED_Insights].  As the number
1437	   of flows increases, if the AQM does not allow TCP to increase queuing
1438	   delay, it has to introduce abnormally high levels of loss.  Then loss
1439	   rather than queuing becomes the dominant cause of delay for short
1440	   flows, due to timeouts and tail losses.

1442	   Curvy RED constrains delay with a softened target that allows some
1443	   increase in delay as load increases.  This is achieved by increasing
1444	   drop probability on a convex curve relative to queue growth (the
1445	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1446	   the zero axis while the queue is shallow.  Then, as load increases,
1447	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1448	   requires only one parameter, the scaling, not three.  The diadvantage
1449	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1450	   Curvy RED can be used as is when the RTT range to support is limited
1451	   otherwise an adaptation mechanism is required.

1453	   There follows a summary listing of the two parameters used for each
1454	   of the two queues:

1456	   Classic:

1458	      S_C :   The scaling factor of the dropping function scales Classic
1459	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1460	         probability in the range [0,1].  To make division efficient, it
1461	         is constrained to be an integer power of two;

1463	      f_C :  To smooth the queuing time of the Classic queue and make
1464	         multiplication efficient, we use a negative integer power of
1465	         two for the dimensionless EWMA constant, which we define as
1466	         alpha = 2^(-f_C).

1468	   L4S :

1470	      S_L (and k'):   As for the Classic queue, the scaling factor of
1471	         the L4S marking function scales Classic queueing times in the
1472	         range [0, 2^(S_L)] seconds into a probability in the range
1473	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1474	         between the queues.  So S_L and k' count as only one parameter;
1475	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1476	         where both k and k' are constants.  Then implementations can
1477	         avoid costly division by shifting p_L by k' bits to the right.

1479	      T :  The queue size in bytes at which step threshold marking
1480	         starts in the L4S queue.

1482	   {ToDo: These are the raw parameters used within the algorithm.  A
1483	   configuration front-end could accept more meaningful parameters and
1484	   convert them into these raw parameters.}

1486	   From our experiments so far, recommended values for these parameters
1487	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1488	   typical on the public Internet.  [CRED_Insights] explains why these
1489	   parameters are applicable whatever rate link this AQM implementation
1490	   is deployed on and how the parameters would need to be adjusted for a
1491	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1492	   incorporate a summary of that report into this draft}. The setting of
1493	   k depends on policy (see Section 2.5 and Appendix C respectively for
1494	   its recommended setting and guidance on alternatives).

1496	   There is also a cUrviness parameter, U, which is a small positive
1497	   integer.  It is likely to take the same hard-coded value for all
1498	   implementations, once experiments have determined a good value.  We
1499	   have solely used U=1 in our experiments so far, but results might be
1500	   even better with U=2 or higher.

1502	   Note that the dropping function at line 9 calls maxrand(2*U), which
1503	   gives twice as much curviness as the call to maxrand(U) in the
1504	   marking function at line 3.  This is the trick that implements the
1505	   square rule in equation (1) (Section 2.1).  This is based on the fact
1506	   that, given a number X from 1 to 6, the probability that two dice
1507	   throws will both be less than X is the square of the probability that
1508	   one throw will be less than X.  So, when U=1, the L4S marking
1509	   function is linear and the Classic dropping function is squared.  If
1510	   U=2, L4S would be a square function and Classic would be quartic.
1511	   And so on.

1513	   The maxrand(u) function in lines 16-21 simply generates u random
1514	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1515	   could be run in parallel out of band.  For instance, if U=1, the
1516	   Classic queue would require the maximum of two random numbers.  So,
1517	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1518	   values from a pseudorandom number generator could be generated out-
1519	   of-band, and held in a buffer ready for the Classic queue to consume.

1521	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1522	   2:     if ( lq.dequeue(pkt) ) {
1523	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1524	   4:           mark(pkt)
1525	   5:        return(pkt)              % return the packet and stop here
1526	   6:     }
1527	   7:     while ( cq.dequeue(pkt) ) {
1528	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1529	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1530	   10:          drop(pkt)                     % Squared drop, redo loop
1531	   11:       else
1532	   12:          return(pkt)           % return the packet and stop here
1533	   13:    }
1534	   14:    return(NULL)                           % no packet to dequeue
1535	   15: }

1537	   Figure 9: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1538	                         using Integer Arithmetic

1540	   Notes:

1542	   1.  The drain rate of the queue can vary if it is scheduled relative
1543	       to other queues, or to cater for fluctuations in a wireless
1544	       medium.  To auto-adjust to changes in drain rate, the queue must
1545	       be measured in time, not bytes or packets [CoDel].  In our Linux
1546	       implementation, it was easiest to measure queuing time at
1547	       dequeue.  Queuing time can be estimated when a packet is enqueued
1548	       by measuring the queue length in bytes and dividing by the recent
1549	       drain rate.

1551	   2.  An implementation has to use priority queueing, but it need not
1552	       implement strict priority.

1554	   3.  If packets can be enqueued while processing dequeue code, an
1555	       implementer might prefer to place the while loop around both
1556	       queues so that it goes back to test again whether any L4S packets
1557	       arrived while it was dropping a Classic packet.

1559	   4.  In order not to change too many factors at once, for now, we keep
1560	       the marking function for DCTCP-only traffic as similar as
1561	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1562	       dequeue, so we determine whether to mark a packet at the head of
1563	       the queue by the byte-length of the queue _behind_ it.  We plan
1564	       to test whether using queuing time will work in all
1565	       circumstances, and if we find that the step can cause
1566	       oscillations, we will investigate replacing it with a steep
1567	       random marking curve.

1569	   5.  An EWMA is only one possible way to filter bursts; other more
1570	       adaptive smoothing methods could be valid and it might be
1571	       appropriate to decrease the EWMA faster than it increases.

1573	   6.  In practice at line 10 the Classic queue would probably test for
1574	       ECN capability on the packet to determine whether to drop or mark
1575	       the packet.  However, for brevity such detail is omitted.  All
1576	       packets classified into the L4S queue have to be ECN-capable, so
1577	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1578	       packets could be dropped by overload code (see Section 4.1).

1580	   7.  In the integer variant of the pseudocode (Figure 9) real numbers
1581	       are all represented as integers scaled up by 2^32.  In lines 3 &
1582	       9 the function maxrand() is arranged to return an integer in the
1583	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1584	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1585	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1586	       the values about 2^30 times larger than when the units were
1587	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1588	       right bit-shift multiplies the result by 2^2, to complete the
1589	       scaling by 2^32.

1591	Appendix C.  Guidance on Controlling Throughput Equivalence

1593	                     +---------------+------+-------+
1594	                     | RTT_C / RTT_L | Reno | Cubic |
1595	                     +---------------+------+-------+
1596	                     |             1 | k'=1 | k'=0  |
1597	                     |             2 | k'=2 | k'=1  |
1598	                     |             3 | k'=2 | k'=2  |
1599	                     |             4 | k'=3 | k'=2  |
1600	                     |             5 | k'=3 | k'=3  |
1601	                     +---------------+------+-------+

1603	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1604	               as Reno or Cubic, for some example RTT ratios

1606	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1608	   To determine the appropriate policy, the operator first has to judge
1609	   whether it wants DCTCP flows to have roughly equal throughput with
1610	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1611	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1612	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1613	   Classic flows to have roughly equal throughput.  For example choosing
1614	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1615	   as Cubic, _if their RTTs are the same_.

1617	   However, even if the base RTTs are the same, the actual RTTs are
1618	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1619	   needs a large queue to avoid under-utilization and excess drop,
1620	   whereas L4S (DCTCP) does not.  The operator might still choose this
1621	   policy if it judges that DCTCP throughput should be rewarded for
1622	   keeping its own queue short.

1624	   On the other hand, the operator will choose one of the higher values
1625	   for k', if it wants to slow DCTCP down to roughly the same throughput
1626	   as Classic flows, to compensate for Classic flows slowing themselves
1627	   down by causing themselves extra queuing delay.

1629	   The values for k' in the table are derived from the formulae, which
1630	   was developed in [DCttH15]:

1632	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1633	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1635	   For localized traffic from a particular ISP's data centre, we used
1636	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1637	   k=8) would achieve throughput equivalence, and our experiments
1638	   verified the formula very closely.

1640	   For a typical mix of RTTs from local data centres and across the
1641	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1642	   as a good workable compromise.

1644	Appendix D.  Open Issues

1646	   Most of the following open issues are also tagged '{ToDo}' at the
1647	   appropriate point in the document:

1649	      Operational guidance to monitor L4S experiment

1651	      Define additional classifier flexibility more clearly

1653	      PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U
1654	      on Tupdate
1655	      Curvy RED appendix: complete the unfinished parts

1657	Authors' Addresses

1659	   Koen De Schepper
1660	   Nokia Bell Labs
1661	   Antwerp
1662	   Belgium

1664	   Email: koen.de_schepper@nokia.com
1665	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1667	   Bob Briscoe (editor)
1668	   CableLabs
1669	   UK

1671	   Email: ietf@bobbriscoe.net
1672	   URI:   http://bobbriscoe.net/

1674	   Olga Bondarenko
1675	   Simula Research Lab
1676	   Lysaker
1677	   Norway

1679	   Email: olgabnd@gmail.com
1680	   URI:   https://www.simula.no/people/olgabo

1682	   Ing-jyh Tsang
1683	   Nokia
1684	   Antwerp
1685	   Belgium

1687	   Email: ing-jyh.tsang@nokia.com