idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 18, 2018) is 2101 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1535

  -- Looks like a reference, but probably isn't: '1' on line 1535

  == Outdated reference: A later version (-02) exists of
     draft-briscoe-tsvwg-l4s-diffserv-00

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-02

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-02

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)

  -- Obsolete informational reference (is this intentional?): RFC 8312
     (Obsoleted by RFC 9438)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: January 19, 2019                                      CableLabs
6	                                                           O. Bondarenko
7	                                                     Simula Research Lab
8	                                                                I. Tsang
9	                                                                   Nokia
10	                                                           July 18, 2018

12	  DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
13	                                 (L4S)
14	                 draft-ietf-tsvwg-aqm-dualq-coupled-06

16	Abstract

18	   Data Centre TCP (DCTCP) was designed to provide predictably low
19	   queuing latency, near-zero loss, and throughput scalability using
20	   explicit congestion notification (ECN) and an extremely simple
21	   marking behaviour on switches.  However, DCTCP does not co-exist with
22	   existing TCP traffic---DCTCP is so aggressive that existing TCP
23	   algorithms approach starvation.  So, until now, DCTCP could only be
24	   deployed where a clean-slate environment could be arranged, such as
25	   in private data centres.  This specification defines `DualQ Coupled
26	   Active Queue Management (AQM)' to allow scalable congestion controls
27	   like DCTCP to safely co-exist with classic Internet traffic.  The
28	   Coupled AQM ensures that a flow runs at about the same rate whether
29	   it uses DCTCP or TCP Reno/Cubic, but without inspecting transport
30	   layer flow identifiers.  When tested in a residential broadband
31	   setting, DCTCP achieved sub-millisecond average queuing delay and
32	   zero congestion loss under a wide range of mixes of DCTCP and
33	   `Classic' broadband Internet traffic, without compromising the
34	   performance of the Classic traffic.  The solution also reduces
35	   network complexity and eliminates network configuration.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at https://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on January 19, 2019.

54	Copyright Notice

56	   Copyright (c) 2018 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (https://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
73	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
74	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   6
75	   2.  DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . .   7
76	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   7
77	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   8
78	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   8
79	     2.4.  Overall DualQ Coupled AQM Structure . . . . . . . . . . .   9
80	     2.5.  Normative Requirements for a DualQ Coupled AQM  . . . . .  11
81	       2.5.1.  Functional Requirements . . . . . . . . . . . . . . .  11
82	         2.5.1.1.  Requirements in Unexpected Cases  . . . . . . . .  12
83	       2.5.2.  Management Requirements . . . . . . . . . . . . . . .  13
84	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
85	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
86	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  14
87	       4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput
88	               or Delay? . . . . . . . . . . . . . . . . . . . . . .  15
89	       4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or
90	               Delay?  . . . . . . . . . . . . . . . . . . . . . . .  16
91	       4.1.3.  Protecting against Unresponsive ECN-Capable Traffic .  17
92	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
93	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
94	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  18
95	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  18
96	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  21
97	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  21
98	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  27
99	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  30
100	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  36
101	   Appendix D.  Open Issues  . . . . . . . . . . . . . . . . . . . .  37
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  38

104	1.  Introduction

106	1.1.  Problem and Scope

108	   Latency is becoming the critical performance factor for many (most?)
109	   applications on the public Internet, e.g. interactive Web, Web
110	   services, voice, conversational video, interactive video, interactive
111	   remote presence, instant messaging, online gaming, remote desktop,
112	   cloud-based applications, and video-assisted remote control of
113	   machinery and industrial processes.  In the developed world, further
114	   increases in access network bit-rate offer diminishing returns,
115	   whereas latency is still a multi-faceted problem.  In the last decade
116	   or so, much has been done to reduce propagation time by placing
117	   caches or servers closer to users.  However, queuing remains a major
118	   component of latency.

120	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
121	   that low latency traffic can jump the queue of other traffic.
122	   However, on access links dedicated to individual sites (homes, small
123	   enterprises or mobile devices), often all traffic at any one time
124	   will be latency-sensitive and, if all the traffic on a link is marked
125	   as EF, Diffserv cannot reduce the delay of any of it.  In contrast,
126	   the Low Latency Low Loss Scalable throughput (L4S) approach removes
127	   the causes of any unnecessary queuing delay.

129	   The bufferbloat project has shown that excessively-large buffering
130	   (`bufferbloat') has been introducing significantly more delay than
131	   the underlying propagation time.  These delays appear only
132	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
133	   enough for the queue to fill the buffer, making every packet in other
134	   flows sharing the buffer sit through the queue.

136	   Active queue management (AQM) was originally developed to solve this
137	   problem (and others).  Unlike Diffserv, which gives low latency to
138	   some traffic at the expense of others, AQM controls latency for _all_
139	   traffic in a class.  In general, AQMs introduce an increasing level
140	   of discard from the buffer the longer the queue persists above a
141	   shallow threshold.  This gives sufficient signals to capacity-seeking
142	   (aka. greedy) flows to keep the buffer empty for its intended
143	   purpose: absorbing bursts.  However, RED [RFC2309] and other
144	   algorithms from the 1990s were sensitive to their configuration and
145	   hard to set correctly.  So, AQM was not widely deployed.

147	   More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290],
148	   PIE [RFC8033], Adaptive RED [ARED01], are easier to configure,
149	   because they define the queuing threshold in time not bytes, so it is
150	   invariant for different link rates.  However, no matter how good the
151	   AQM, the sawtoothing rate of TCP will either cause queuing delay to
152	   vary or cause the link to be under-utilized.  Even with a perfectly
153	   tuned AQM, the additional queuing delay will be of the same order as
154	   the underlying speed-of-light delay across the network.  Flow-queuing
155	   can isolate one flow from another, but it cannot isolate a TCP flow
156	   from the delay variations it inflicts on itself, and it has other
157	   problems - it overrides the flow rate decisions of variable rate
158	   video applications, it does not recognise the flows within IPSec VPN
159	   tunnels and it is relatively expensive to implement.

161	   It seems that further changes to the network alone will now yield
162	   diminishing returns.  Data Centre TCP (DCTCP [RFC8257]) teaches us
163	   that a small but radical change to TCP is needed to cut two major
164	   outstanding causes of queuing delay variability:

166	   1.  the `sawtooth' varying rate of TCP itself;

168	   2.  the smoothing delay deliberately introduced into AQMs to permit
169	       bursts without triggering losses.

171	   The former causes a flow's round trip time (RTT) to vary from about 1
172	   to 2 times the base RTT between the machines in question.  The latter
173	   delays the system's response to change by a worst-case
174	   (transcontinental) RTT, which could be hundreds of times the actual
175	   RTT of typical traffic from localized CDNs.

177	   Latency is not our only concern:

179	   3.  It was known when TCP was first developed that it would not scale
180	       to high bandwidth-delay products.

182	   Given regular broadband bit-rates over WAN distances are
183	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
184	   `less unscalable' Cubic [RFC8312] and
185	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
186	   successfully deployed.  However, these are now approaching their
187	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
188	   cause `classic' TCP to starve itself, which is why they have been
189	   confined to private data centres or research testbeds (until now).

191	   This document specifies a `DualQ Coupled AQM' extension that solves
192	   the problem of coexistence between scalable and classic flows,
193	   without having to inspect flow identifiers.  The AQM is not like
194	   flow-queuing approaches [RFC8290] that classify packets by flow
195	   identifier into numerous separate queues in order to isolate sparse
196	   flows from the higher latency in the queues assigned to heavier flow.
197	   In contrast, the AQM exploits the behaviour of scalable congestion
198	   controls like DCTCP so that every packet in every flow sharing the
199	   queue for DCTCP-like traffic can be served with very low latency.

201	   This AQM extension can be combined with any single queue AQM that
202	   generates a statistical or deterministic mark/drop probability driven
203	   by the queue dynamics.  In many cases it simplifies the basic control
204	   algorithm, and requires little extra processing.  Therefore it is
205	   believed the Coupled AQM would be applicable and easy to deploy in
206	   all types of buffers; buffers in cost-reduced mass-market residential
207	   equipment; buffers in end-system stacks; buffers in carrier-scale
208	   equipment including remote access servers, routers, firewalls and
209	   Ethernet switches; buffers in network interface cards, buffers in
210	   virtualized network appliances, hypervisors, and so on.

212	   The overall L4S architecture is described in
213	   [I-D.ietf-tsvwg-l4s-arch].  The supporting papers [PI2] and [DCttH15]
214	   give the full rationale for the AQM's design, both discursively and
215	   in more precise mathematical form.

217	1.2.  Terminology

219	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
220	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
221	   document are to be interpreted as described in [RFC2119].  In this
222	   document, these words will appear with that interpretation only when
223	   in ALL CAPS.  Lower case uses of these words are not to be
224	   interpreted as carrying RFC-2119 significance.

226	   The DualQ Coupled AQM uses two queues for two services.  Each of the
227	   following terms identifies both the service and the queue that
228	   provides the service:

230	   Classic (denoted by subscript C):  The `Classic' service is intended
231	      for all the behaviours that currently co-exist with TCP Reno (TCP
232	      Cubic, Compound, SCTP, etc).

234	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
235	      The `L4S' service is intended for a set of congestion controls
236	      with scalable properties such as DCTCP (e.g.
237	      Relentless [Mathis09]).

239	   Either service can cope with a proportion of unresponsive or less-
240	   responsive traffic as well (e.g.  DNS, VoIP, etc), just as a single
241	   queue AQM can.  The DualQ Coupled AQM behaviour is similar to a
242	   single FIFO queue with respect to unresponsive and overload traffic.

244	1.3.  Features

246	   The AQM couples marking and/or dropping across the two queues such
247	   that a flow will get roughly the same throughput whichever it uses.
248	   Therefore both queues can feed into the full capacity of a link and
249	   no rates need to be configured for the queues.  The L4S queue enables
250	   scalable congestion controls like DCTCP to give stunningly low and
251	   predictably low latency, without compromising the performance of
252	   competing 'Classic' Internet traffic.  Thousands of tests have been
253	   conducted in a typical fixed residential broadband setting.  Typical
254	   experiments used base round trip delays up to 100ms between the data
255	   centre and home network, and large amounts of background traffic in
256	   both queues.  For every L4S packet, the AQM kept the average queuing
257	   delay below 1ms (or 2 packets if serialization delay is bigger for
258	   slow links), and no losses at all were introduced by the AQM.
259	   Details of the extensive experiments will be made available [PI2]
260	   [DCttH15].

262	   Subjective testing was also conducted using a demanding panoramic
263	   interactive video application run over a stack with DCTCP enabled and
264	   deployed on the testbed.  Each user could pan or zoom their own high
265	   definition (HD) sub-window of a larger video scene from a football
266	   match.  Even though the user was also downloading large amounts of
267	   L4S and Classic data, latency was so low that the picture appeared to
268	   stick to their finger on the touchpad (all the L4S data achieved the
269	   same ultra-low latency).  With an alternative AQM, the video
270	   noticeably lagged behind the finger gestures.

272	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
273	   be limited to a small proportion of the link capacity in order to
274	   achieve low delay.  The L4S queue can be filled with a heavy load of
275	   capacity-seeking flows like DCTCP and still achieve low delay.  The
276	   L4S queue does not rely on the presence of other traffic in the
277	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
278	   traffic whether or not there is Classic traffic, and the latency of
279	   Classic traffic does not suffer when a proportion of the traffic is
280	   L4S.  The two queues are only necessary because DCTCP-like flows
281	   cannot keep latency predictably low and keep utilization high if they
282	   are mixed with legacy TCP flows,

284	   The experiments used the Linux implementation of DCTCP that is
285	   deployed in private data centres, without any modification despite
286	   its known deficiencies.  Nonetheless, certain modifications will be
287	   necessary before DCTCP is safe to use on the Internet, which are
288	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
289	   focus of this specification is to get the network service in place.
290	   Then, without any management intervention, applications can exploit
291	   it by migrating to scalable controls like DCTCP, which can then
292	   evolve _while_ their benefits are being enjoyed by everyone on the
293	   Internet.

295	2.  DualQ Coupled AQM

297	   There are two main aspects to the approach:

299	   o  the Coupled AQM that addresses throughput equivalence between
300	      Classic (e.g.  Reno, Cubic) flows and L4S (e.g.  DCTCP) flows

302	   o  the Dual Queue structure that provides latency separation for L4S
303	      flows to isolate them from the typically large Classic queue.

305	2.1.  Coupled AQM

307	   In the 1990s, the `TCP formula' was derived for the relationship
308	   between TCP's congestion window, cwnd, and its drop probability, p.
309	   To a first order approximation, cwnd of TCP Reno is inversely
310	   proportional to the square root of p.

312	   TCP Cubic implements a Reno-compatibility mode, which is the only
313	   relevant mode for typical RTTs under 20ms as long as the throughput
314	   of a single flow is less than about 500Mb/s.  Therefore it can be
315	   assumed that Cubic traffic behaves similarly to Reno (but with a
316	   slightly different constant of proportionality), and the term
317	   'Classic' will be used for the collection of Reno-friendly traffic
318	   including Cubic in Reno mode.

320	   The supporting paper [PI2] includes the derivation of the equivalent
321	   rate equation for DCTCP, for which cwnd is inversely proportional to
322	   p (not the square root), where in this case p is the ECN marking
323	   probability.  DCTCP is not the only congestion control that behaves
324	   like this, so the term 'L4S' traffic will be used for all similar
325	   behaviour.

327	   In order to make a DCTCP flow run at roughly the same rate as a Reno
328	   TCP flow (all other factors being equal), the drop or marking
329	   probability for Classic traffic, p_C has to be distinct from the
330	   marking probability for L4S traffic, p_L (in contrast to RFC3168
331	   which requires them to be the same).  It is necessary to make the
332	   Classic drop probability p_C proportional to the square of the L4S
333	   marking probability p_L.  This makes the Reno flow rate roughly equal
334	   the DCTCP flow rate, because it squares the square root of p_C in the
335	   Reno rate equation to make it proportional to the straight p_L in the
336	   DCTCP rate equation.

338	   Stating this as a formula, the relation between Classic drop
339	   probability, p_C, and L4S marking probability, p_L needs to take the
340	   form:

342	       p_C = ( p_L / k )^2                  (1)

344	   where k is the constant of proportionality.

346	2.2.  Dual Queue

348	   Classic traffic typically builds a large queue to prevent under-
349	   utilization.  Therefore a separate queue is provided for L4S traffic,
350	   and it is scheduled with priority over Classic.  Priority is
351	   conditional to prevent starvation of Classic traffic.

353	   Nonetheless, coupled marking ensures that giving priority to L4S
354	   traffic still leaves the right amount of spare scheduling time for
355	   Classic flows to each get equivalent throughput to DCTCP flows (all
356	   other factors such as RTT being equal).  The algorithm achieves this
357	   without having to inspect flow identifiers.

359	2.3.  Traffic Classification

361	   Both the Coupled AQM and DualQ mechanisms need an identifier to
362	   distinguish L and C packets.  A separate draft
363	   [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of
364	   the ECN field as this identifier, having assessed various
365	   alternatives.  An additional process document has proved necessary to
366	   make the ECT(1) codepoint available for experimentation [RFC8311].

368	   For policy reasons, an operator might choose to steer certain packets
369	   (e.g. from certain flows or with certain addresses) out of the L
370	   queue, even though they identify themselves as L4S by their ECN
371	   codepoints.  In such cases, the classifier MUST NOT alter the ECN
372	   field, so that it is preserved end-to-end.  The aim is that each
373	   operator can choose how it treats L4S traffic locally, but an
374	   individual operator does not alter the identification of L4S packets,
375	   which would prevent other operators downstream from making their own
376	   choices on how to treat L4S traffic.

378	   In addition, other identifiers could be used to classify certain
379	   additional packet types into the L queue, that are deemed not to risk
380	   harming the L4S service.  For instance addresses of specific
381	   applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific
382	   Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit
383	   service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain
384	   protocols (e.g.  ARP, DNS).

386	   Note that the DualQ Coupled AQM only reads these classifiers, it MUST
387	   NOT re-mark or alter these identifiers (except for marking the ECN
388	   field with the CE codepoint - with increasing frequency to indicate
389	   increasing congestion).

391	2.4.  Overall DualQ Coupled AQM Structure

393	   Figure 1 shows the overall structure that any DualQ Coupled AQM is
394	   likely to have.  This schematic is intended to aid understanding of
395	   the current designs of DualQ Coupled AQMs.  However, it is not
396	   intended to preclude other innovative ways of satisfying the
397	   normative requirements in Section 2.5 that minimally define a DualQ
398	   Coupled AQM.

400	   The classifier on the left separates incoming traffic between the two
401	   queues (L and C).  Each queue has its own AQM that determines the
402	   likelihood of dropping or marking (p_L and p_C).  Nonetheless, the
403	   AQM for Classic traffic is implemented in two stages: i) a base stage
404	   that outputs an internal probability p' (pronounced p-prime); and ii)
405	   a squaring stage that outputs p_C, where

407	       p_C = (p')^2.                        (2)

409	   This allows p_L to be coupled to p_C by marking L4S traffic
410	   proportionately to the intermediate output from the first stage.
411	   Specifically, the output of the base AQM is coupled across to the L
412	   queue in proportion to the output of the base AQM:

414	       p_CL = k*p',                         (3)

416	   where k is the constant coupling factor (see Appendix C) and p_CL is
417	   the output from the coupling between the C queue and the L queue.

419	   It can be seen in the following that these two transformations of p'
420	   implement the required coupling given in equation (1) earlier.
421	   Substituting for p' from equation (3) into (2):

423	      p_C = ( p_CL / k )^2.

425	   The actual L4S marking probability p_L is the maximum of the coupled
426	   output (p_CL) and the output of a native L4S AQM (p'L), shown as
427	   '(MAX)' in the schematic.  While the output of the Native L4S AQM is
428	   high (p'L > p_CL) it will dominate the way L traffic is marked.  When
429	   the native L4S AQM output is lower, the way L traffic is marked will
430	   be driven by the coupling, that is p_L = p_CL.  So, whenever the
431	   coupling is needed, as required from equation (1):

433	      p_C = ( p_L / k )^2.

435	                           _________
436	                                  | |    ,------.
437	                        L4S queue | |===>| ECN  |
438	                       ,'| _______|_|    |marker|\
439	                     <'  |         |     `------'\\
440	                      //`'         v        ^ p_L \\
441	                     //        ,-------.    |      \\
442	                    //         |Native |p'L |       \\,.
443	                   //          |  L4S  |-->(MAX)    <  |   ___
444	      ,----------.//           |  AQM  |    ^ p_CL   `\|.'Cond-`.
445	      |  IP-ECN  |/            `-------'    |          / itional \
446	   ==>|Classifier|             ,-------.  (k*p')       [ priority]==>
447	      |          |\            |  Base |    |          \scheduler/
448	      `----------'\\           |  AQM  |--->:        ,'|`-.___.-'
449	                   \\          |       |p'  |      <'  |
450	                    \\         `-------'  (p'^2)    //`'
451	                     \\            ^        |      //
452	                      \\,.         |        v p_C //
453	                      <  | _________     .------.//
454	                       `\|   |      |    | Drop |/
455	                     Classic |queue |===>|/mark |
456	                           __|______|    `------'

458	   Legend: ===> traffic flow; ---> control dependency.

460	                   Figure 1: DualQ Coupled AQM Schematic

462	   After the AQMs have applied their dropping or marking, the scheduler
463	   forwards their packets to the link, giving priority to L4S traffic.
464	   Priority has to be conditional in some way (see Section 4.1).  Simple
465	   strict priority is inappropriate otherwise it could lead the L4S
466	   queue to starve the Classic queue.  For example, consider the case
467	   where a continually busy L4S queue blocks a DNS request in the
468	   Classic queue, arbitrarily delaying the start of a new Classic flow.

470	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
471	   given in Appendix A and Appendix B.  Either example AQM can be used
472	   to couple packet marking and dropping across a dual Q.

474	   DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM.
475	   Indeed, this Base AQM with just the squared output and no L4S queue
476	   can be used as a drop-in replacement for PIE [RFC8033], in which case
477	   we call it just PI2 [PI2].  PI2 is a principled simplification of PIE
478	   that is both more responsive and more stable in the face of
479	   dynamically varying load.

481	   Curvy RED is derived from RED [RFC2309], but its configuration
482	   parameters are insensitive to link rate and it requires less
483	   operations per packet.  However, DualPI2 is more responsive and
484	   stable over a wider range of RTTs than Curvy RED.  As a consequence,
485	   DualPI2 has attracted more development attention than Curvy RED,
486	   leaving the Curvy RED design incomplete and not so fully evaluated.

488	   Both AQMs regulate their queue in units of time not bytes.  As
489	   already explained, this ensures configuration can be invariant for
490	   different drain rates.  With AQMs in a dualQ structure this is
491	   particularly important because the drain rate of each queue can vary
492	   rapidly as flows for the two queues arrive and depart, even if the
493	   combined link rate is constant.

495	   It would be possible to control the queues with other alternative
496	   AQMs, as long as the normative requirements (those expressed in
497	   capitals) in Section 2.5 are observed.

499	2.5.  Normative Requirements for a DualQ Coupled AQM

501	   The following requirements are intended to capture only the essential
502	   aspects of a DualQ Coupled AQM.  They are intended to be independent
503	   of the particular AQMs used for each queue.

505	2.5.1.  Functional Requirements

507	   In the Dual Queue, L4S packets MUST be given priority over Classic,
508	   although priority MUST be bounded in order not to starve Classic
509	   traffic.

511	   Whatever identifier is used for L4S experiments,
512	   [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on
513	   L4S traffic, relative to drop of Classic traffic.  In order to
514	   prevent starvation of Classic traffic by scalable L4S traffic, it
515	   says, "The likelihood that an AQM drops a Not-ECT Classic packet
516	   (p_C) MUST be roughly proportional to the square of the likelihood
517	   that it would have marked it if it had been an L4S packet (p_L)."  In
518	   other words, in any DualQ Coupled AQM, the power to which p_L is
519	   raised in Eqn. (1) MUST be 2.  The term 'likelihood' is used to allow
520	   for marking and dropping to be either probabilistic or deterministic.

522	   The constant of proportionality, k, in Eqn (1) determines the
523	   relative flow rates of Classic and L4S flows when the AQM concerned
524	   is the bottleneck (all other factors being equal).

526	   [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality
527	   (k) does not have to be standardised for interoperability, but a
528	   value of 2 is RECOMMENDED."

530	   Assuming scalable congestion controls for the Internet will be as
531	   aggressive as DCTCP, this will ensure their congestion window will be
532	   roughly the same as that of a standards track TCP congestion control
533	   (Reno) [RFC5681] and other so-called TCP-friendly controls, such as
534	   TCP Cubic in its TCP-friendly mode.

536	   {ToDo: The requirements for scalable congestion controls on the
537	   Internet (termed the TCP Prague requirements)
538	   [I-D.ietf-tsvwg-ecn-l4s-id] are not necessarily final.  If the
539	   aggressiveness of DCTCP is not defined as the benchmark for scalable
540	   controls on the Internet, the recommended value of k will also be
541	   subject to change.}

543	   The choice of k is a matter of operator policy, and operators MAY
544	   choose a different value using Table 1 and the guidelines in
545	   Appendix C.

547	   If multiple users share capacity at a bottleneck (e.g. in the
548	   Internet access link of a campus network), the operator's choice of k
549	   will determine capacity sharing between the flows of different users.
550	   However, on the public Internet, access network operators typically
551	   isolate customers from each other with some form of layer-2
552	   multiplexing (TDM in DOCSIS, CDMA in 3G) or L3 scheduling (WRR in
553	   DSL), rather than relying on TCP to share capacity between customers
554	   [RFC0970].  In such cases, the choice of k will solely affect
555	   relative flow rates within each customer's access capacity, not
556	   between customers.  Also, k will not affect relative flow rates at
557	   any times when all flows are Classic or all L4S, and it will not
558	   affect small flows.

560	2.5.1.1.  Requirements in Unexpected Cases

562	   The flexibility to allow operator-specific classifiers (Section 2.3)
563	   leads to the need to specify what the AQM in each queue ought to do
564	   with packets that do not carry the ECN field expected for that queue.
565	   It is recommended that the AQM in each queue inspects the ECN field
566	   to determine what sort of congestion notification to signal, then
567	   decides whether to apply congestion notification to this particular
568	   packet, as follows:

570	   o  If a packet that does not carry an ECT(1) or CE codepoint is
571	      classified into the L queue:

573	      *  if the packet is ECT(0), the L AQM SHOULD apply drop using a
574	         drop probability appropriate to Classic congestion control and
575	         appropriate to the target delay in the L queue

577	      *  if the packet is Not-ECT, the appropriate action depends on
578	         whether some other function is protecting the L queue from
579	         misbehaving flows (e.g. per-flow queue protection or latency
580	         policing):

582	         +  If separate queue protection is provided, the L AQM SHOULD
583	            ignore the packet and forward it unchanged, meaning it
584	            should not calculate whether to apply congestion
585	            notification and it should neither drop nor CE-mark the
586	            packet (for instance, the operator might classify EF traffic
587	            that is unresponsive to drop into the L queue, alongside
588	            responsive L4S-ECN traffic)

590	         +  if separate queue protection is not provided, the L AQM
591	            SHOULD apply drop using a drop probability appropriate to
592	            Classic congestion control and appropriate to the target
593	            delay in the L queue

595	   o  If a packet that carries an ECT(1) codepoint is classified into
596	      the C queue:

598	      *  the C AQM SHOULD apply CE-marking using the coupled AQM
599	         probability p_CL (= k*p').

601	   If the DualQ Coupled AQM has detected overload, it will signal
602	   congestion solely using drop, irrespective of the ECN field.

604	   The above requirements are worded as "SHOULDs", because operator-
605	   specific classifiers are for flexibility, by definition.  Therefore,
606	   alternative actions might be appropriate in the operator's specific
607	   circumstances.  An example would be where the operator knows that
608	   certain legacy traffic marked with one codepoint actually has a
609	   congestion response associated with another codepoint.

611	2.5.2.  Management Requirements

613	   By default, a DualQ Coupled AQM SHOULD NOT need any configuration for
614	   use at a bottleneck on the public Internet [RFC7567].  The following
615	   parameters MAY be operator-configurable, e.g. to tune for non-
616	   Internet settings:

618	   o  Optional packet classifier(s) to use in addition to the ECN field
619	      (see Section 2.3);

621	   o  Expected typical RTT (a parameter for typical or target queuing
622	      delay in each queue might be configurable instead);

624	   o  Expected maximum RTT (a stability parameter that depends on
625	      maximum RTT might be configurable instead);

627	   o  Coupling factor, k;

629	   o  The limit to the conditional priority of L4S (scheduler-dependent,
630	      e.g. the scheduler weight for WRR, or the time-shift for time-
631	      shifted FIFO);

633	   o  The maximum Classic ECN marking probability, p_Cmax, before
634	      switching over to drop.

636	   An experimental DualQ Coupled AQM SHOULD allow the operator to
637	   monitor the following operational statistics:

639	   o  Bits forwarded (total and per queue per sample interval), from
640	      which utilization can be calculated

642	   o  Q delay (per queue over sample interval)

644	   o  Total packets arriving, enqueued and dequeued (per queue per
645	      sample interval)

647	   o  ECN packets marked, non-ECN packets dropped, ECN packets dropped
648	      (per queue per sample interval), from which marking and dropping
649	      probabilities can be calculated

651	   o  Time and duration of each overload event.

653	   The type of statistics produced for variables like Q delay (mean,
654	   percentiles, etc.) will depend on implementation constraints.

656	3.  IANA Considerations

658	   This specification contains no IANA considerations.

660	4.  Security Considerations

662	4.1.  Overload Handling

664	   Where the interests of users or flows might conflict, it could be
665	   necessary to police traffic to isolate any harm to the performance of
666	   individual flows.  However it is hard to avoid unintended side-
667	   effects with policing, and in a trusted environment policing is not
668	   necessary.  Therefore per-flow policing needs to be separable from a
669	   basic AQM, as an option under policy control.

671	   However, a basic DualQ AQM does at least need to handle overload.  A
672	   useful objective would be for the overload behaviour of the DualQ AQM
673	   to be at least no worse than a single queue AQM.  However, a trade-
674	   off needs to be made between complexity and the risk of either
675	   traffic class harming the other.  In each of the following three
676	   subsections, an overload issue specific to the DualQ is described,
677	   followed by proposed solution(s).

679	   Under overload the higher priority L4S service will have to sacrifice
680	   some aspect of its performance.  Alternative solutions are provided
681	   below that each relax a different factor: e.g. throughput, delay,
682	   drop.  Some of these choices might need to be determined by operator
683	   policy or by the developer, rather than by the IETF. {ToDo: Reach
684	   consensus on which it is to be in each case.}

686	4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay?

688	   Priority of L4S is required to be conditional to avoid total
689	   throughput starvation of Classic by heavy L4S traffic.  This raises
690	   the question of whether to sacrifice L4S throughput or L4S delay (or
691	   some other policy) to mitigate starvation of Classic:

693	   Sacrifice L4S throughput:   By using weighted round robin as the
694	      conditional priority scheduler, the L4S service can sacrifice some
695	      throughput during overload to guarantee a minimum throughput
696	      service for Classic traffic.  The scheduling weight of the Classic
697	      queue should be small (e.g. 1/16).  Then, in most traffic
698	      scenarios the scheduler will not interfere and it will not need to
699	      - the coupling mechanism and the end-systems will share out the
700	      capacity across both queues as if it were a single pool.  However,
701	      because the congestion coupling only applies in one direction
702	      (from C to L), if L4S traffic is over-aggressive or unresponsive,
703	      the scheduler weight for Classic traffic will at least be large
704	      enough to ensure it does not starve.

706	      In cases where the ratio of L4S to Classic flows (e.g. 19:1) is
707	      greater than the ratio of their scheduler weights (e.g. 15:1), the
708	      L4S flows will get less than an equal share of the capacity, but
709	      only slightly.  For instance, with the example numbers given, each
710	      L4S flow will get (15/16)/19 = 4.9% when ideally each would get
711	      1/20=5%. In the rather specific case of an unresponsive flow
712	      taking up a large part of the capacity set aside for L4S, using
713	      WRR could significantly reduce the capacity left for any
714	      responsive L4S flows.

716	   Sacrifice L4S Delay:  To control milder overload of responsive
717	      traffic, particularly when close to the maximum congestion signal,
718	      the operator could choose to control overload of the Classic queue
719	      by allowing some delay to 'leak' across to the L4S queue.  The
720	      scheduler can be made to behave like a single First-In First-Out
721	      (FIFO) queue with different service times by implementing a very
722	      simple conditional priority scheduler that could be called a
723	      "time-shifted FIFO" (see the Modifier Earliest Deadline First
724	      (MEDF) scheduler of [MEDF]).  This scheduler adds tshift to the
725	      queue delay of the next L4S packet, before comparing it with the
726	      queue delay of the next Classic packet, then it selects the packet
727	      with the greater adjusted queue delay.  Under regular conditions,
728	      this time-shifted FIFO scheduler behaves just like a strict
729	      priority scheduler.  But under moderate or high overload it
730	      prevents starvation of the Classic queue, because the time-shift
731	      (tshift) defines the maximum extra queuing delay of Classic
732	      packets relative to L4S.

734	   The example implementation in Appendix A can implement either policy.

736	4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or Delay?

738	   To keep the throughput of both L4S and Classic flows roughly equal
739	   over the full load range, a different control strategy needs to be
740	   defined above the point where one AQM first saturates to a
741	   probability of 100% leaving no room to push back the load any harder.
742	   If k>1, L4S will saturate first, but saturation can be caused by
743	   unresponsive traffic in either queue.

745	   The term 'unresponsive' includes cases where a flow becomes
746	   temporarily unresponsive, for instance, a real-time flow that takes a
747	   while to adapt its rate in response to congestion, or a TCP-like flow
748	   that is normally responsive, but above a certain congestion level it
749	   will not be able to reduce its congestion window below the minimum of
750	   2 segments, effectively becoming unresponsive.  (Note that L4S
751	   traffic ought to remain responsive below a window of 2 segments (see
752	   [I-D.ietf-tsvwg-ecn-l4s-id]).

754	   Saturation raises the question of whether to relieve congestion by
755	   introducing some drop into the L4S queue or by allowing delay to grow
756	   in both queues (which could eventually lead to tail drop too):

758	   Drop on Saturation:  Saturation can be avoided by setting a maximum
759	      threshold for L4S ECN marking (assuming k>1) before saturation
760	      starts to make the flow rates of the different traffic types
761	      diverge.  Above that the drop probability of Classic traffic is
762	      applied to all packets of all traffic types.  Then experiments
763	      have shown that queueing delay can be kept at the target in any
764	      overload situation, including with unresponsive traffic, and no
765	      further measures are required.

767	   Delay on Saturation:  When L4S marking saturates, instead of
768	      switching to drop, the drop and marking probabilities could be
769	      capped.  Beyond that, delay will grow either solely in the queue
770	      with unresponsive traffic (if WRR is used), or in both queues (if
771	      time-shifted FIFO is used).  In either case, the higher delay
772	      ought to control temporary high congestion.  If the overload is
773	      more persistent, eventually the combined DualQ will overflow and
774	      tail drop will control congestion.

776	   The example implementation in Appendix A applies only the "drop on
777	   saturation" policy.

779	4.1.3.  Protecting against Unresponsive ECN-Capable Traffic

781	   Unresponsive traffic has a greater advantage if it is also ECN-
782	   capable.  The advantage is undetectable at normal low levels of drop/
783	   marking, but it becomes significant with the higher levels of drop/
784	   marking typical during overload.  This is an issue whether the ECN-
785	   capable traffic is L4S or Classic.

787	   This raises the question of whether and when to switch off ECN
788	   marking and use solely drop instead, as required by both Section 7 of
789	   [RFC3168] and Section 4.2.1 of [RFC7567].

791	   Experiments with the DualPI2 AQM (Appendix A) have shown that
792	   introducing 'drop on saturation' at 100% L4S marking addresses this
793	   problem with unresponsive ECN as well as addressing the saturation
794	   problem.  It leaves only a small range of congestion levels where
795	   unresponsive traffic gains any advantage from using the ECN
796	   capability, and the advantage is hardly detectable [DualQ-Test].

798	5.  Acknowledgements

800	   Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for
801	   detailed review comments particularly of the appendices and
802	   suggestions on how to make our explanation clearer.  Thanks also to
803	   Greg White and Tom Henderson for insights on the choice of schedulers
804	   and queue delay measurement techniques.

806	   The authors' contributions were originally part-funded by the
807	   European Community under its Seventh Framework Programme through the
808	   Reducing Internet Transport Latency (RITE) project (ICT-317700).  Bob
809	   Briscoe's contribution was also part-funded by the Research Council
810	   of Norway through the TimeIn project.  The views expressed here are
811	   solely those of the authors.

813	6.  References

815	6.1.  Normative References

817	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
818	              Requirement Levels", BCP 14, RFC 2119,
819	              DOI 10.17487/RFC2119, March 1997,
820	              <https://www.rfc-editor.org/info/rfc2119>.

822	6.2.  Informative References

824	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
825	              Algorithm for Increasing the Robustness of RED's Active
826	              Queue Management", ACIRI Technical Report , August 2001,
827	              <http://www.icir.org/floyd/red.html>.

829	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
830	              ACM Queue 10(5), May 2012,
831	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

833	   [CRED_Insights]
834	              Briscoe, B., "Insights from Curvy RED (Random Early
835	              Detection)", BT Technical Report TR-TUB8-2015-003, July
836	              2015,
837	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

839	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
840	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
841	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
842	              dctth_preprint.pdf>.

844	              (Under submission)

846	   [DualQ-Test]
847	              Steen, H., "Destruction Testing: Ultra-Low Delay using
848	              Dual Queue Coupled Active Queue Management", Masters
849	              Thesis, Dept of Informatics, Uni Oslo , May 2017.

851	   [I-D.briscoe-tsvwg-l4s-diffserv]
852	              Briscoe, B., "Interactions between Low Latency, Low Loss,
853	              Scalable Throughput (L4S) and Differentiated Services",
854	              draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress),
855	              March 2018.

857	   [I-D.ietf-tsvwg-ecn-l4s-id]
858	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
859	              Modified Explicit Congestion Notification (ECN) Semantics
860	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
861	              id-02 (work in progress), March 2018.

863	   [I-D.ietf-tsvwg-l4s-arch]
864	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
865	              Low Loss, Scalable Throughput (L4S) Internet Service:
866	              Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in
867	              progress), March 2018.

869	   [I-D.sridharan-tcpm-ctcp]
870	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
871	              "Compound TCP: A New TCP Congestion Control for High-Speed
872	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
873	              (work in progress), November 2008.

875	   [Mathis09]
876	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
877	              May 2009, <http://www.hpcc.jp/pfldnet2009/
878	              Program_files/1569198525.pdf>.

880	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
881	              simple scheduling algorithm for two real-time transport
882	              service classes with application in the UTRAN", Proc. IEEE
883	              Conference on Computer Communications (INFOCOM'03) Vol.2
884	              pp.1116-1122, March 2003.

886	   [PI2]      De Schepper, K., Bondarenko, O., Briscoe, B., and I.
887	              Tsang, "PI2: A Linearized AQM for both Classic and
888	              Scalable TCP", ACM CoNEXT'16 , December 2016,
889	              <https://riteproject.files.wordpress.com/2015/10/
890	              pi2_conext.pdf>.

892	              (To appear)

894	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
895	              RFC 970, DOI 10.17487/RFC0970, December 1985,
896	              <https://www.rfc-editor.org/info/rfc970>.

898	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
899	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
900	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
901	              S., Wroclawski, J., and L. Zhang, "Recommendations on
902	              Queue Management and Congestion Avoidance in the
903	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
904	              <https://www.rfc-editor.org/info/rfc2309>.

906	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
907	              of Explicit Congestion Notification (ECN) to IP",
908	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
909	              <https://www.rfc-editor.org/info/rfc3168>.

911	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
912	              J., Courtney, W., Davari, S., Firoiu, V., and D.
913	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
914	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
915	              <https://www.rfc-editor.org/info/rfc3246>.

917	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
918	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
919	              <https://www.rfc-editor.org/info/rfc3649>.

921	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
922	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
923	              <https://www.rfc-editor.org/info/rfc5681>.

925	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
926	              Recommendations Regarding Active Queue Management",
927	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
928	              <https://www.rfc-editor.org/info/rfc7567>.

930	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
931	              "Proportional Integral Controller Enhanced (PIE): A
932	              Lightweight Control Scheme to Address the Bufferbloat
933	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
934	              <https://www.rfc-editor.org/info/rfc8033>.

936	   [RFC8034]  White, G. and R. Pan, "Active Queue Management (AQM) Based
937	              on Proportional Integral Controller Enhanced PIE) for
938	              Data-Over-Cable Service Interface Specifications (DOCSIS)
939	              Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February
940	              2017, <https://www.rfc-editor.org/info/rfc8034>.

942	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
943	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
944	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
945	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

947	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
948	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
949	              and Active Queue Management Algorithm", RFC 8290,
950	              DOI 10.17487/RFC8290, January 2018,
951	              <https://www.rfc-editor.org/info/rfc8290>.

953	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
954	              Notification (ECN) Experimentation", RFC 8311,
955	              DOI 10.17487/RFC8311, January 2018,
956	              <https://www.rfc-editor.org/info/rfc8311>.

958	   [RFC8312]  Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
959	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
960	              RFC 8312, DOI 10.17487/RFC8312, February 2018,
961	              <https://www.rfc-editor.org/info/rfc8312>.

963	Appendix A.  Example DualQ Coupled PI2 Algorithm

965	   As a first concrete example, the pseudocode below gives the DualPI2
966	   algorithm.  DualPI2 follows the structure of the DualQ Coupled AQM
967	   framework in Figure 1.  A simple step threshold (in units of queuing
968	   time) is used for the Native L4S AQM, but a ramp is also described as
969	   an alternative.  And the PI2 algorithm [PI2] is used for the Classic
970	   AQM.  PI2 is an improved variant of the PIE AQM [RFC8033].

972	   We will introduce the pseudocode in two passes.  The first pass
973	   explains the core concepts, deferring handling of overload to the
974	   second pass.  To aid comparison, line numbers are kept in step
975	   between the two passes by using letter suffixes where the longer code
976	   needs extra lines.

978	   A full open source implementation for Linux is available at:
979	   https://github.com/olgabo/dualpi2.

981	A.1.  Pass #1: Core Concepts

983	   The pseudocode manipulates three main structures of variables: the
984	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
985	   pseudocode consists of the following four functions:

987	   o  initialization code (Figure 2) that sets parameter defaults (the
988	      API for setting non-default values is omitted for brevity)

990	   o  enqueue code (Figure 3)

992	   o  dequeue code (Figure 4)

994	   o  code to regularly update the base probability (p) used in the
995	      dequeue code (Figure 5).

997	   It also uses the following functions that are not shown in full here:

999	   o  scheduler(), which selects between the head packets of the two
1000	      queues; the choice of scheduler technology is discussed later;

1002	   o  cq.len() or lq.len() returns the current length (aka. backlog) of
1003	      the relevant queue in bytes;

1005	   o  cq.time() or lq.time() returns the current queuing delay (aka.
1006	      sojourn time or service time) of the relevant queue in units of
1007	      time;

1009	   Queuing delay could be measured directly by storing a per-packet
1010	   time-stamp as each packet is enqueued, and subtracting this from the
1011	   system time when the packet is dequeued.  If time-stamping is not
1012	   easy to introduce with certain hardware, queuing delay could be
1013	   predicted indirectly by dividing the size of the queue by the
1014	   predicted departure rate, which might be known precisely for some
1015	   link technologies (see for example [RFC8034]).

1017	   In our experiments so far (building on experiments with PIE) on
1018	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
1019	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
1020	   parameters in Figure 2.  The parameters are categorised by whether
1021	   they relate to the Base PI2 AQM, the L4S AQM or the framework
1022	   coupling them together.  Variables derived from these parameters are
1023	   also included at the end of each category.  Each parameter is
1024	   explained as it is encountered in the walk-through of the pseudocode
1025	   below.

1027	   1:  dualpi2_params_init(...) {         % Set input parameter defaults
1028	   2:    % PI2 AQM parameters
1029	   3:    target = 15 ms              % PI AQM Classic queue delay target
1030	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
1031	   5:    alpha = 10 Hz^2                              % PI integral gain
1032	   6:    beta = 100 Hz^2                          % PI proportional gain
1033	   7:    p_Cmax = 1/4                       % Max Classic drop/mark prob
1034	   8:    % Derived PI2 AQM variables
1035	   9:    alpha_U = alpha *Tupdate % PI integral gain per update interval
1036	   10:   beta_U = beta * Tupdate  % PI prop'nal gain per update interval
1037	   11:
1038	   12:   % DualQ Coupled framework parameters
1039	   13:   k = 2                                         % Coupling factor
1040	   14:   % scheduler weight or equival't parameter (scheduler-dependent)
1041	   15:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
1042	   16:
1043	   17:   % L4S AQM parameters
1044	   18:   T_time = 1 ms                   % L4S marking threshold in time
1045	   19:   T_len = 2 * MTU            % Min L4S marking threshold in bytes
1046	   20:   % Derived L4S AQM variables
1047	   21:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
1048	   22: }

1050	       Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM

1052	   The overall goal of the code is to maintain the base probability (p),
1053	   which is an internal variable from which the marking and dropping
1054	   probabilities for L4S and Classic traffic (p_L and p_C) are derived.
1055	   The variable named p in the pseudocode and in this walk-through is
1056	   the same as p' (p-prime) in Section 2.4.  The probabilities p_L and
1057	   p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function
1058	   (Figure 5) then used in the dualpi2_dequeue() function (Figure 4).
1059	   The code walk-through below builds up to explaining that part of the
1060	   code eventually, but it starts from packet arrival.

1062	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
1063	   2:    if ( lq.len() + cq.len() > limit )
1064	   3:      drop(pkt)                     % drop packet if buffer is full
1065	   4:    else {                                      % Packet classifier
1066	   5:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
1067	   6:        lq.enqueue(pkt)
1068	   7:      else                           % ECN bits = not-ECT or ECT(0)
1069	   8:        cq.enqueue(pkt)
1070	   9:    }
1071	   10: }

1073	      Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

1075	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
1076	   2:    while ( lq.len() + cq.len() > 0 )
1077	   3:      if ( scheduler() == lq ) {
1078	   4:        lq.dequeue(pkt)                      % Scheduler chooses lq
1079	   5:        if ( ((lq.time() > T_time)              % step marking ...
1080	   6:              AND (lq.len() > T_len))
1081	   7:            OR (p_CL > rand()) )             % ...or linear marking
1082	   8:          mark(pkt)
1083	   9:      } else {
1084	   10:       cq.dequeue(pkt)                      % Scheduler chooses cq
1085	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1086	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
1087	   13:           drop(pkt)                                % squared drop
1088	   14:           continue        % continue to the top of the while loop
1089	   15:         }
1090	   16:         mark(pkt)                                  % squared mark
1091	   17:       }
1092	   18:     }
1093	   19:     return(pkt)                      % return the packet and stop
1094	   20:   }
1095	   21:   return(NULL)                             % no packet to dequeue
1096	   22: }

1098	      Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

1100	   When packets arrive, first a common queue limit is checked as shown
1101	   in line 2 of the enqueuing pseudocode in Figure 3.  Note that the
1102	   limit is deliberately tested before enqueue to avoid any bias against
1103	   larger packets (so the actual buffer has to be one MTU larger than
1104	   limit).  If limit is not exceeded, the packet will be classified and
1105	   enqueued to the Classic or L4S queue dependent on the least
1106	   significant bit of the ECN field in the IP header (line 5).  Packets
1107	   with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be
1108	   enqueued in the Classic queue.  Otherwise, ECT(1) and CE packets will
1109	   be enqueued in the L4S queue.  Optional additional packet
1110	   classification flexibility is omitted for brevity (see
1111	   [I-D.ietf-tsvwg-ecn-l4s-id]).

1113	   The dequeue pseudocode (Figure 4) is repeatedly called whenever the
1114	   lower layer is ready to forward a packet.  It schedules one packet
1115	   for dequeuing (or zero if the queue is empty) then returns control to
1116	   the caller, so that it does not block while that packet is being
1117	   forwarded.  While making this dequeue decision, it also makes the
1118	   necessary AQM decisions on dropping or marking.  The alternative of
1119	   applying the AQMs at enqueue would shift some processing from the
1120	   critical time when each packet is dequeued.  However, it would also
1121	   add a whole queue of delay to the control signals, making the control
1122	   loop very sloppy.

1124	   All the dequeue code is contained within a large while loop so that
1125	   if it decides to drop a packet, it will continue until it selects a
1126	   packet to schedule.  Line 3 of the dequeue pseudocode is where the
1127	   scheduler chooses between the L4S queue (lq) and the Classic queue
1128	   (cq).  Detailed implementation of the scheduler is not shown (see
1129	   discussion later).

1131	   o  If an L4S packet is scheduled, lines 5 to 8 mark the packet if
1132	      either the L4S threshold (T_time) is exceeded, or if a random
1133	      marking decision is drawn according to p_CL (maintained by the
1134	      dualpi2_update() function discussed below).  This logical 'OR' on
1135	      a per-packet basis implements the max() function shown in Figure 1
1136	      to couple the outputs of the two AQMs together.  The L4S threshold
1137	      is usually in units of time (default T_time = 1 ms).  However, on
1138	      slow links the packet serialization time can approach the
1139	      threshold T_time, so line 6 sets a floor of T_len (=2 MTU) to the
1140	      threshold, otherwise marking is always too frequent on slow links.

1142	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
1143	      packet based on the squared probability p_C.

1145	   There is some concern that using a step function for the Native L4S
1146	   AQM requires end-systems to smooth the signal for a lot longer -
1147	   until its fidelity is sufficient.  The latency benefits of a ramp are
1148	   being investigated as a simple alternative to the step.  This ramp
1149	   would be similar to the RED algorithm, with the following
1150	   differences:

1152	   o  The min and max of the ramp are defined in units of queuing delay,
1153	      not bytes, so that configuration remains invariant as the queue
1154	      departure rate varies.

1156	   o  It uses instantaneous queueing delay without smoothing (smoothing
1157	      is done in the end-systems).

1159	   o  Determinism is being experimented with instead of randomness; to
1160	      reduce the delay necessary to smooth out the noise of randomness
1161	      from the signal.  For each packet, the algorithm would accumulate
1162	      p'_L in a counter and mark the packet that took the counter over
1163	      1, then subtract 1 from the counter and continue.

1165	   o  The ramp rises linearly directly from 0 to 1, not to a an
1166	      intermediate value of p'_L as RED would, because there is no need
1167	      to keep ECN marking probability low.

1169	   This ramp algorithm would require two configuration parameters (min
1170	   and max threshold in units of queuing time), in contrast to the
1171	   single parameter of a step.

1173	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1174	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
1175	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1176	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1177	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1178	   6:    prevq = curq
1179	   7:  }

1181	     Figure 5: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

1183	   The base probability (p) is kept up to date by the core PI algorithm
1184	   in Figure 5, which is executed every Tupdate.

1186	   Note that p solely depends on the queuing time in the Classic queue.
1187	   In line 2, the current queuing delay (curq) is evaluated from how
1188	   long the head packet was in the Classic queue (cq).  The function
1189	   cq.time() (not shown) subtracts the time stamped at enqueue from the
1190	   current time and implicitly takes the current queuing delay as 0 if
1191	   the queue is empty.

1193	   The algorithm centres on line 3, which is a classical Proportional-
1194	   Integral (PI) controller that alters p dependent on: a) the error
1195	   between the current queuing delay (curq) and the target queuing delay
1196	   ('target' - see [RFC8033]); and b) the change in queuing delay since
1197	   the last sample.  The name 'PI' represents the fact that the second
1198	   factor (how fast the queue is growing) is _P_roportional to load
1199	   while the first is the _I_ntegral of the load (so it removes any
1200	   standing queue in excess of the target).

1202	   The two 'gain factors' in line 3, alpha_U and beta_U, respectively
1203	   weight how strongly each of these elements ((a) and (b)) alters p.
1204	   They are in units of 'per second of delay' or Hz, because they
1205	   transform differences in queueing delay into changes in probability.

1207	   alpha_U and beta_U are derived from the input parameters alpha and
1208	   beta (see lines 5 and 6 of Figure 2).  These recommended values of
1209	   alpha and beta come from the stability analysis in [PI2] so that the
1210	   AQM can change p as fast as possible in response to changes in load
1211	   without over-compensating and therefore causing oscillations in the
1212	   queue.

1214	   alpha and beta determine how much p ought to change if it was updated
1215	   every second.  It is best to update p as frequently as possible, but
1216	   the update interval (Tupdate) will probably be constrained by
1217	   hardware performance.  For link rates from 4 - 200 Mb/s, we found
1218	   Tupdate=16ms (as recommended in [RFC8033]) is sufficient.  However
1219	   small the chosen value of Tupdate, p should change by the same amount
1220	   per second, but in finer more frequent steps.  So the gain factors
1221	   used for updating p in Figure 5 need to be scaled by (Tupdate/1s),
1222	   which is done in lines 9 and 10 of Figure 2).  The suffix '_U'
1223	   represents 'per update time' (Tupdate).

1225	   In corner cases, p can overflow the range [0,1] so the resulting
1226	   value of p has to be bounded (omitted from the pseudocode).  Then, as
1227	   already explained, the coupled and Classic probabilities are derived
1228	   from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2.

1230	   Because the coupled L4S marking probability (p_CL) is factored up by
1231	   k, the dynamic gain parameters alpha and beta are also inherently
1232	   factored up by k for the L4S queue, which is necessary to ensure that
1233	   Classic TCP and DCTCP controls have the same stability.  So, if alpha
1234	   is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha,
1235	   which is 20 Hz^2 with the default coupling factor of k=2.

1237	   Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned
1238	   every Tupdate dependent on p.  Instead, in PI2, alpha_U and beta_U
1239	   are independent of p because the squaring applied to Classic traffic
1240	   tunes them inherently.  This is explained in [PI2], which also
1241	   explains why this more principled approach removes the need for most
1242	   of the heuristics that had to be added to PIE.

1244	   {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with
1245	   RTT}

1247	A.2.  Pass #2: Overload Details

1249	   Figure 6 repeats the dequeue function of Figure 4, but with overload
1250	   details added.  Similarly Figure 7 repeats the core PI algorithm of
1251	   Figure 5 with overload details added.  The initialization and enqueue
1252	   functions are unchanged.

1254	   In line 7 of the initialization function (Figure 2), the default
1255	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
1256	   point at which it is deemed that the Classic queue has become
1257	   persistently overloaded, so it switches to using solely drop, even
1258	   for ECN-capable packets.  This protects the queue against any
1259	   unresponsive traffic that falsely claims that it is responsive to ECN
1260	   marking, as required by [RFC3168] and [RFC7567].

1262	   Line 21 of the initialization function translates this into a maximum
1263	   L4S marking probability (p_Lmax) by rearranging Equation (1).  With a
1264	   coupling factor of k=2 (the default) or greater, this translates to a
1265	   maximum L4S marking probability of 1 (or 100%).  This is intended to
1266	   ensure that the L4S queue starts to introduce dropping once marking
1267	   saturates and can rise no further.  The 'TCP Prague' requirements
1268	   [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion
1269	   control detects a drop, it falls back to a response that coexists
1270	   with 'Classic' TCP.  So it is correct that the L4S queue drops
1271	   packets proportional to p^2, as if they are Classic packets.

1273	   Both these switch-overs are triggered by the tests for overload
1274	   introduced in lines 4b and 12b of the dequeue function (Figure 6).
1275	   Lines 8c to 8g drop L4S packets with probability p^2.  Lines 8h to 8i
1276	   mark the remaining packets with probability p_CL.

1278	   Lines 2c to 2d in the core PI algorithm (Figure 7) deal with overload
1279	   of the L4S queue when there is no Classic traffic.  This is
1280	   necessary, because the core PI algorithm maintains the appropriate
1281	   drop probability to regulate overload, but it depends on the length
1282	   of the Classic queue.  If there is no Classic queue the naive
1283	   algorithm in Figure 5 drops nothing, even if the L4S queue is
1284	   overloaded - so tail drop would have to take over (lines 3 and 4 of
1285	   Figure 3).

1287	   If the test at line 2a finds that the Classic queue is empty, line 2d
1288	   measures the current queue delay using the L4S queue instead.  While
1289	   the L4S queue is not overloaded, its delay will always be tiny
1290	   compared to the target Classic queue delay.  So p_L will be driven to
1291	   zero, and the L4S queue will naturally be governed solely by
1292	   threshold marking (lines 5 and 6 of the dequeue algorithm in
1293	   Figure 6).  But, if unresponsive L4S source(s) cause overload, the
1294	   DualQ transitions smoothly to L4S marking based on the PI algorithm.
1295	   And as overload increases, it naturally transitions from marking to
1296	   dropping by the switch-over mechanism already described.

1298	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
1299	   2:    while ( lq.len() + cq.len() > 0 )
1300	   3:      if ( scheduler() == lq ) {
1301	   4a:       lq.dequeue(pkt)
1302	   4b:       if ( p_CL < p_Lmax ) {      % Check for overload saturation
1303	   5:          if ( ((lq.time() > T_time)             % step marking ...
1304	   6:                AND (lq.len > T_len))
1305	   7:              OR (p_CL > rand()) )           % ...or linear marking
1306	   8a:            mark(pkt)
1307	   8b:       } else {                              % overload saturation
1308	   8c:         if ( p_C > rand() ) {             % probability p_C = p^2
1309	   8e:           drop(pkt)      % revert to Classic drop due to overload
1310	   8f:           continue        % continue to the top of the while loop
1311	   8g:         }
1312	   8h:         if ( p_CL > rand() )           % probability p_CL = k * p
1313	   8i:           mark(pkt)         % linear marking of remaining packets
1314	   8j:       }
1315	   9:      } else {
1316	   10:       cq.dequeue(pkt)
1317	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1318	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
1319	   12b:             OR (p_C >= p_Cmax) ) {       % Overload disables ECN
1320	   13:           drop(pkt)                     % squared drop, redo loop
1321	   14:           continue        % continue to the top of the while loop
1322	   15:         }
1323	   16:         mark(pkt)                                  % squared mark
1324	   17:       }
1325	   18:     }
1326	   19:     return(pkt)                      % return the packet and stop
1327	   20:   }
1328	   21:   return(NULL)                             % no packet to dequeue
1329	   22: }

1331	      Figure 6: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
1332	             (Including Integer Arithmetic and Overload Code)

1334	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1335	   2a:   if ( cq.len() > 0 )
1336	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
1337	   2c:   else                                      % Classic queue empty
1338	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
1339	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1340	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1341	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1342	   6:    prevq = curq
1343	   7:  }

1345	     Figure 7: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
1346	                         (Including Overload Code)

1348	   The choice of scheduler technology is critical to overload protection
1349	   (see Section 4.1).

1351	   o  A well-understood weighted scheduler such as weighted round robin
1352	      (WRR) is recommended.  The scheduler weight for Classic should be
1353	      low, e.g. 1/16.

1355	   o  Alternatively, a time-shifted FIFO could be used.  This is a very
1356	      simple scheduler, but it does not fully isolate latency in the L4S
1357	      queue from uncontrolled bursts in the Classic queue.  It works by
1358	      selecting the head packet that has waited the longest, biased
1359	      against the Classic traffic by a time-shift of tshift.  To
1360	      implement time-shifted FIFO, the "if (scheduler() == lq )" test in
1361	      line 3 of the dequeue code would simply be replaced by "if (
1362	      lq.time() + tshift >= cq.time() )".  For the public Internet a
1363	      good value for tshift is 50ms.  For private networks with smaller
1364	      diameter, about 4*target would be reasonable.

1366	   o  A strict priority scheduler would be inappropriate, because it
1367	      would starve Classic if L4S was overloaded.

1369	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

1371	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
1372	   below gives the Curvy RED based algorithm we used and tested.
1373	   Although we designed the AQM to be efficient in integer arithmetic,
1374	   to aid understanding it is first given using real-number arithmetic.
1375	   Then, one possible optimization for integer arithmetic is given, also
1376	   in pseudocode.  To aid comparison, the line numbers are kept in step
1377	   between the two by using letter suffixes where the longer code needs
1378	   extra lines.

1380	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1381	   2:    if ( lq.dequeue(pkt) ) {
1382	   3a:     p_L = cq.sec() / 2^S_L
1383	   3b:     if ( lq.byt() > T )
1384	   3c:       mark(pkt)
1385	   3d:     elif ( p_L > maxrand(U) )
1386	   4:        mark(pkt)
1387	   5:      return(pkt)                % return the packet and stop here
1388	   6:    }
1389	   7:    while ( cq.dequeue(pkt) ) {
1390	   8a:     alpha = 2^(-f_C)
1391	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
1392	   9a:     sqrt_p_C = Q_C / 2^S_C
1393	   9b:     if ( sqrt_p_C > maxrand(2*U) )
1394	   10:       drop(pkt)                        % Squared drop, redo loop
1395	   11:     else
1396	   12:       return(pkt)              % return the packet and stop here
1397	   13:   }
1398	   14:   return(NULL)                           % no packet to dequeue
1399	   15: }

1401	   16: maxrand(u) {                % return the max of u random numbers
1402	   17:     maxr=0
1403	   18:     while (u-- > 0)
1404	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
1405	   20:     return(maxr)
1406	   21: }

1408	   Figure 8: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

1410	   Packet classification code is not shown, as it is no different from
1411	   Figure 3.  Potential classification schemes are discussed in
1412	   Section 2.3.  The Curvy RED algorithm has not been maintained to the
1413	   same degree as the DualPI2 algorithm.  Some ideas used in DualPI2
1414	   would need to be translated into Curvy RED, such as i) the
1415	   conditional priority scheduler instead of strict priority ii) the
1416	   time-based L4S threshold; iii) turning off ECN as overload
1417	   protection; iv) Classic ECN support.  These are not shown in the
1418	   Curvy RED pseudocode, but would need to be implemented for
1419	   production. {ToDo}

1421	   At the outer level, the structure of dualq_dequeue() implements
1422	   strict priority scheduling.  The code is written assuming the AQM is
1423	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
1424	   the if-block in lines 2-6 determines whether there is an L4S packet
1425	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
1426	   in lines 7-13 determines whether there is a Classic packet to
1427	   dequeue, by calling cq.dequeue(pkt).  (Note 2)
1428	   In the lower priority Classic queue, a while loop is used so that, if
1429	   the AQM determines that a classic packet should be dropped, it
1430	   continues to test for classic packets deciding whether to drop each
1431	   until it actually forwards one.  Thus, every call to dualq_dequeue()
1432	   returns one packet if at least one is present in either queue,
1433	   otherwise it returns NULL at line 14.  (Note 3)

1435	   Within each queue, the decision whether to drop or mark is taken as
1436	   follows (to simplify the explanation, it is assumed that U=1):

1438	   L4S:  If the test at line 2 determines there is an L4S packet to
1439	      dequeue, the tests at lines 3a and 3c determine whether to mark
1440	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1441	      in bytes) is greater than a step threshold T in bytes (Note 4).
1442	      The second test is similar to the random ECN marking in RED, but
1443	      with the following differences: i) the marking function does not
1444	      start with a plateau of zero marking until a minimum threshold,
1445	      rather the marking probability starts to increase as soon as the
1446	      queue is positive; ii) marking depends on queuing time, not bytes,
1447	      in order to scale for any link rate without being reconfigured;
1448	      iii) marking of the L4S queue does not depend on itself, it
1449	      depends on the queuing time of the _other_ (Classic) queue, where
1450	      cq.sec() is the queuing time of the packet at the head of the
1451	      Classic queue (zero if empty); iv) marking depends on the
1452	      instantaneous queuing time (of the other Classic queue), not a
1453	      smoothed average; v) the queue is compared with the maximum of U
1454	      random numbers (but if U=1, this is the same as the single random
1455	      number used in RED).

1457	      Specifically, in line 3a the marking probability p_L is set to the
1458	      Classic queueing time qc.sec() in seconds divided by the L4S
1459	      scaling parameter 2^S_L, which represents the queuing time (in
1460	      seconds) at which marking probability would hit 100%. Then in line
1461	      3d (if U=1) the result is compared with a uniformly distributed
1462	      random number between 0 and 1, which ensures that marking
1463	      probability will linearly increase with queueing time.  The
1464	      scaling parameter is expressed as a power of 2 so that division
1465	      can be implemented as a right bit-shift (>>) in line 3 of the
1466	      integer variant of the pseudocode (Figure 9).

1468	   Classic:  If the test at line 7 determines that there is at least one
1469	      Classic packet to dequeue, the test at line 9b determines whether
1470	      to drop it.  But before that, line 8b updates Q_C, which is an
1471	      exponentially weighted moving average (Note 5) of the queuing time
1472	      in the Classic queue, where pkt.sec() is the instantaneous
1473	      queueing time of the current Classic packet and alpha is the EWMA
1474	      constant for the classic queue.  In line 8a, alpha is represented
1475	      as an integer power of 2, so that in line 8 of the integer code
1476	      the division needed to weight the moving average can be
1477	      implemented by a right bit-shift (>> f_C).

1479	      Lines 9a and 9b implement the drop function.  In line 9a the
1480	      averaged queuing time Q_C is divided by the Classic scaling
1481	      parameter 2^S_C, in the same way that queuing time was scaled for
1482	      L4S marking.  This scaled queuing time is given the variable name
1483	      sqrt_p_C because it will be squared to compute Classic drop
1484	      probability, so before it is squared it is effectively the square
1485	      root of the drop probability.  The squaring is done by comparing
1486	      it with the maximum out of two random numbers (assuming U=1).
1487	      Comparing it with the maximum out of two is the same as the
1488	      logical `AND' of two tests, which ensures drop probability rises
1489	      with the square of queuing time (Note 6).  Again, the scaling
1490	      parameter is expressed as a power of 2 so that division can be
1491	      implemented as a right bit-shift in line 9 of the integer
1492	      pseudocode.

1494	   The marking/dropping functions in each queue (lines 3 & 9) are two
1495	   cases of a new generalization of RED called Curvy RED, motivated as
1496	   follows.  When we compared the performance of our AQM with fq_CoDel
1497	   and PIE, we came to the conclusion that their goal of holding queuing
1498	   delay to a fixed target is misguided [CRED_Insights].  As the number
1499	   of flows increases, if the AQM does not allow TCP to increase queuing
1500	   delay, it has to introduce abnormally high levels of loss.  Then loss
1501	   rather than queuing becomes the dominant cause of delay for short
1502	   flows, due to timeouts and tail losses.

1504	   Curvy RED constrains delay with a softened target that allows some
1505	   increase in delay as load increases.  This is achieved by increasing
1506	   drop probability on a convex curve relative to queue growth (the
1507	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1508	   the zero axis while the queue is shallow.  Then, as load increases,
1509	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1510	   requires only one parameter, the scaling, not three.  The diadvantage
1511	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1512	   Curvy RED can be used as is when the RTT range to support is limited
1513	   otherwise an adaptation mechanism is required.

1515	   There follows a summary listing of the two parameters used for each
1516	   of the two queues:

1518	   Classic:

1520	      S_C :   The scaling factor of the dropping function scales Classic
1521	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1522	         probability in the range [0,1].  To make division efficient, it
1523	         is constrained to be an integer power of two;

1525	      f_C :  To smooth the queuing time of the Classic queue and make
1526	         multiplication efficient, we use a negative integer power of
1527	         two for the dimensionless EWMA constant, which we define as
1528	         alpha = 2^(-f_C).

1530	   L4S :

1532	      S_L (and k'):   As for the Classic queue, the scaling factor of
1533	         the L4S marking function scales Classic queueing times in the
1534	         range [0, 2^(S_L)] seconds into a probability in the range
1535	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1536	         between the queues.  So S_L and k' count as only one parameter;
1537	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1538	         where both k and k' are constants.  Then implementations can
1539	         avoid costly division by shifting p_L by k' bits to the right.

1541	      T :  The queue size in bytes at which step threshold marking
1542	         starts in the L4S queue.

1544	   {ToDo: These are the raw parameters used within the algorithm.  A
1545	   configuration front-end could accept more meaningful parameters and
1546	   convert them into these raw parameters.}

1548	   From our experiments so far, recommended values for these parameters
1549	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1550	   typical on the public Internet.  [CRED_Insights] explains why these
1551	   parameters are applicable whatever rate link this AQM implementation
1552	   is deployed on and how the parameters would need to be adjusted for a
1553	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1554	   incorporate a summary of that report into this draft}. The setting of
1555	   k depends on policy (see Section 2.5 and Appendix C respectively for
1556	   its recommended setting and guidance on alternatives).

1558	   There is also a cUrviness parameter, U, which is a small positive
1559	   integer.  It is likely to take the same hard-coded value for all
1560	   implementations, once experiments have determined a good value.  We
1561	   have solely used U=1 in our experiments so far, but results might be
1562	   even better with U=2 or higher.

1564	   Note that the dropping function at line 9 calls maxrand(2*U), which
1565	   gives twice as much curviness as the call to maxrand(U) in the
1566	   marking function at line 3.  This is the trick that implements the
1567	   square rule in equation (1) (Section 2.1).  This is based on the fact
1568	   that, given a number X from 1 to 6, the probability that two dice
1569	   throws will both be less than X is the square of the probability that
1570	   one throw will be less than X.  So, when U=1, the L4S marking
1571	   function is linear and the Classic dropping function is squared.  If
1572	   U=2, L4S would be a square function and Classic would be quartic.
1573	   And so on.

1575	   The maxrand(u) function in lines 16-21 simply generates u random
1576	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1577	   could be run in parallel out of band.  For instance, if U=1, the
1578	   Classic queue would require the maximum of two random numbers.  So,
1579	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1580	   values from a pseudorandom number generator could be generated out-
1581	   of-band, and held in a buffer ready for the Classic queue to consume.

1583	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1584	   2:     if ( lq.dequeue(pkt) ) {
1585	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1586	   4:           mark(pkt)
1587	   5:        return(pkt)              % return the packet and stop here
1588	   6:     }
1589	   7:     while ( cq.dequeue(pkt) ) {
1590	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1591	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1592	   10:          drop(pkt)                     % Squared drop, redo loop
1593	   11:       else
1594	   12:          return(pkt)           % return the packet and stop here
1595	   13:    }
1596	   14:    return(NULL)                           % no packet to dequeue
1597	   15: }

1599	   Figure 9: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1600	                         using Integer Arithmetic

1602	   Notes:

1604	   1.  The drain rate of the queue can vary if it is scheduled relative
1605	       to other queues, or to cater for fluctuations in a wireless
1606	       medium.  To auto-adjust to changes in drain rate, the queue must
1607	       be measured in time, not bytes or packets [CoDel].  In our Linux
1608	       implementation, it was easiest to measure queuing time at
1609	       dequeue.  Queuing time can be estimated when a packet is enqueued
1610	       by measuring the queue length in bytes and dividing by the recent
1611	       drain rate.

1613	   2.  An implementation has to use priority queueing, but it need not
1614	       implement strict priority.

1616	   3.  If packets can be enqueued while processing dequeue code, an
1617	       implementer might prefer to place the while loop around both
1618	       queues so that it goes back to test again whether any L4S packets
1619	       arrived while it was dropping a Classic packet.

1621	   4.  In order not to change too many factors at once, for now, we keep
1622	       the marking function for DCTCP-only traffic as similar as
1623	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1624	       dequeue, so we determine whether to mark a packet at the head of
1625	       the queue by the byte-length of the queue _behind_ it.  We plan
1626	       to test whether using queuing time will work in all
1627	       circumstances, and if we find that the step can cause
1628	       oscillations, we will investigate replacing it with a steep
1629	       random marking curve.

1631	   5.  An EWMA is only one possible way to filter bursts; other more
1632	       adaptive smoothing methods could be valid and it might be
1633	       appropriate to decrease the EWMA faster than it increases.

1635	   6.  In practice at line 10 the Classic queue would probably test for
1636	       ECN capability on the packet to determine whether to drop or mark
1637	       the packet.  However, for brevity such detail is omitted.  All
1638	       packets classified into the L4S queue have to be ECN-capable, so
1639	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1640	       packets could be dropped by overload code (see Section 4.1).

1642	   7.  In the integer variant of the pseudocode (Figure 9) real numbers
1643	       are all represented as integers scaled up by 2^32.  In lines 3 &
1644	       9 the function maxrand() is arranged to return an integer in the
1645	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1646	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1647	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1648	       the values about 2^30 times larger than when the units were
1649	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1650	       right bit-shift multiplies the result by 2^2, to complete the
1651	       scaling by 2^32.

1653	Appendix C.  Guidance on Controlling Throughput Equivalence

1655	                     +---------------+------+-------+
1656	                     | RTT_C / RTT_L | Reno | Cubic |
1657	                     +---------------+------+-------+
1658	                     |             1 | k'=1 | k'=0  |
1659	                     |             2 | k'=2 | k'=1  |
1660	                     |             3 | k'=2 | k'=2  |
1661	                     |             4 | k'=3 | k'=2  |
1662	                     |             5 | k'=3 | k'=3  |
1663	                     +---------------+------+-------+

1665	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1666	               as Reno or Cubic, for some example RTT ratios

1668	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1670	   To determine the appropriate policy, the operator first has to judge
1671	   whether it wants DCTCP flows to have roughly equal throughput with
1672	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1673	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1674	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1675	   Classic flows to have roughly equal throughput.  For example choosing
1676	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1677	   as Cubic, _if their RTTs are the same_.

1679	   However, even if the base RTTs are the same, the actual RTTs are
1680	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1681	   needs a large queue to avoid under-utilization and excess drop,
1682	   whereas L4S (DCTCP) does not.  The operator might still choose this
1683	   policy if it judges that DCTCP throughput should be rewarded for
1684	   keeping its own queue short.

1686	   On the other hand, the operator will choose one of the higher values
1687	   for k', if it wants to slow DCTCP down to roughly the same throughput
1688	   as Classic flows, to compensate for Classic flows slowing themselves
1689	   down by causing themselves extra queuing delay.

1691	   The values for k' in the table are derived from the formulae, which
1692	   was developed in [DCttH15]:

1694	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1695	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1697	   For localized traffic from a particular ISP's data centre, we used
1698	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1699	   k=8) would achieve throughput equivalence, and our experiments
1700	   verified the formula very closely.

1702	   For a typical mix of RTTs from local data centres and across the
1703	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1704	   as a good workable compromise.

1706	Appendix D.  Open Issues

1708	   Most of the following open issues are also tagged '{ToDo}' at the
1709	   appropriate point in the document:

1711	      Operational guidance to monitor L4S experiment

1713	      PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U
1714	      on Tupdate

1716	      Curvy RED appendix: complete the unfinished parts

1718	Authors' Addresses

1720	   Koen De Schepper
1721	   Nokia Bell Labs
1722	   Antwerp
1723	   Belgium

1725	   Email: koen.de_schepper@nokia.com
1726	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1728	   Bob Briscoe (editor)
1729	   CableLabs
1730	   UK

1732	   Email: ietf@bobbriscoe.net
1733	   URI:   http://bobbriscoe.net/

1735	   Olga Bondarenko
1736	   Simula Research Lab
1737	   Lysaker
1738	   Norway

1740	   Email: olgabnd@gmail.com
1741	   URI:   https://www.simula.no/people/olgabo

1743	   Ing-jyh Tsang
1744	   Nokia
1745	   Antwerp
1746	   Belgium

1748	   Email: ing-jyh.tsang@nokia.com