idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (November 04, 2018) is 2001 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1697

  -- Looks like a reference, but probably isn't: '1' on line 1697

  == Outdated reference: A later version (-02) exists of
     draft-briscoe-tsvwg-l4s-diffserv-00

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-02

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-02

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)

  -- Obsolete informational reference (is this intentional?): RFC 8312
     (Obsoleted by RFC 9438)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: May 8, 2019                                           CableLabs
6	                                                           O. Bondarenko
7	                                                     Simula Research Lab
8	                                                                I. Tsang
9	                                                                   Nokia
10	                                                       November 04, 2018

12	  DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
13	                                 (L4S)
14	                 draft-ietf-tsvwg-aqm-dualq-coupled-08

16	Abstract

18	   The Low Latency Low Loss Scalable Throughput (L4S) architecture
19	   allows data flows over the public Internet to predictably achieve
20	   ultra-low queuing latency, generally zero congestion loss and scaling
21	   of per-flow throughput without the problems of traditional TCP.  To
22	   achieve this, L4S data flows use a 'scalable' congestion control
23	   similar to Data Centre TCP (DCTCP) and a form of Explicit Congestion
24	   Notification (ECN) with modified behaviour.  However, until now,
25	   scalable congestion controls did not co-exist with existing TCP Reno/
26	   Cubic traffic---scalable controls are so aggressive that 'Classic'
27	   TCP algorithms drive themselves to starvation.  Therefore, until now,
28	   L4S controls could only be deployed where a clean-slate environment
29	   could be arranged, such as in private data centres (hence the name
30	   DCTCP).  This specification defines `DualQ Coupled Active Queue
31	   Management (AQM)', which enables these scalable congestion controls
32	   to safely co-exist with Classic Internet traffic.

34	   The Coupled AQM ensures that a flow runs at about the same rate
35	   whether it uses DCTCP or TCP Reno/Cubic.  It achieves this
36	   indirectly, without having to inspect transport layer flow
37	   identifiers, When tested in a residential broadband setting, DCTCP
38	   also achieves sub-millisecond average queuing delay and zero
39	   congestion loss under a wide range of mixes of DCTCP and `Classic'
40	   broadband Internet traffic, without compromising the performance of
41	   the Classic traffic.  The solution also reduces network complexity
42	   and eliminates network configuration.

44	Status of This Memo

46	   This Internet-Draft is submitted in full conformance with the
47	   provisions of BCP 78 and BCP 79.

49	   Internet-Drafts are working documents of the Internet Engineering
50	   Task Force (IETF).  Note that other groups may also distribute
51	   working documents as Internet-Drafts.  The list of current Internet-
52	   Drafts is at https://datatracker.ietf.org/drafts/current/.

54	   Internet-Drafts are draft documents valid for a maximum of six months
55	   and may be updated, replaced, or obsoleted by other documents at any
56	   time.  It is inappropriate to use Internet-Drafts as reference
57	   material or to cite them other than as "work in progress."

59	   This Internet-Draft will expire on May 8, 2019.

61	Copyright Notice

63	   Copyright (c) 2018 IETF Trust and the persons identified as the
64	   document authors.  All rights reserved.

66	   This document is subject to BCP 78 and the IETF Trust's Legal
67	   Provisions Relating to IETF Documents
68	   (https://trustee.ietf.org/license-info) in effect on the date of
69	   publication of this document.  Please review these documents
70	   carefully, as they describe your rights and restrictions with respect
71	   to this document.  Code Components extracted from this document must
72	   include Simplified BSD License text as described in Section 4.e of
73	   the Trust Legal Provisions and are provided without warranty as
74	   described in the Simplified BSD License.

76	Table of Contents

78	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
79	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
80	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
81	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   6
82	   2.  DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . .   7
83	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   8
84	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   9
85	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   9
86	     2.4.  Overall DualQ Coupled AQM Structure . . . . . . . . . . .  10
87	     2.5.  Normative Requirements for a DualQ Coupled AQM  . . . . .  12
88	       2.5.1.  Functional Requirements . . . . . . . . . . . . . . .  12
89	         2.5.1.1.  Requirements in Unexpected Cases  . . . . . . . .  13
90	       2.5.2.  Management Requirements . . . . . . . . . . . . . . .  15
91	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
92	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
93	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  16
94	       4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput
95	               or Delay? . . . . . . . . . . . . . . . . . . . . . .  17
96	       4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or
97	               Delay?  . . . . . . . . . . . . . . . . . . . . . . .  18
98	       4.1.3.  Protecting against Unresponsive ECN-Capable Traffic .  19
99	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  19
100	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  20
101	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  20
102	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  20
103	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  23
104	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  23
105	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  30
106	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  33
107	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  39
108	   Appendix D.  Open Issues  . . . . . . . . . . . . . . . . . . . .  40
109	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  41

111	1.  Introduction

113	1.1.  Problem and Scope

115	   Latency is becoming the critical performance factor for many (most?)
116	   applications on the public Internet, e.g. interactive Web, Web
117	   services, voice, conversational video, interactive video, interactive
118	   remote presence, instant messaging, online gaming, remote desktop,
119	   cloud-based applications, and video-assisted remote control of
120	   machinery and industrial processes.  In the developed world, further
121	   increases in access network bit-rate offer diminishing returns,
122	   whereas latency is still a multi-faceted problem.  In the last decade
123	   or so, much has been done to reduce propagation time by placing
124	   caches or servers closer to users.  However, queuing remains a major
125	   intermittent component of latency.

127	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
128	   that low latency traffic can jump the queue of other traffic.
129	   However, on access links dedicated to individual sites (homes, small
130	   enterprises or mobile devices), often all traffic at any one time
131	   will be latency-sensitive and, if all the traffic on a link is marked
132	   as EF, Diffserv cannot reduce the delay of any of it.  In contrast,
133	   the Low Latency Low Loss Scalable throughput (L4S) approach removes
134	   the causes of any unnecessary queuing delay.

136	   The bufferbloat project has shown that excessively-large buffering
137	   (`bufferbloat') has been introducing significantly more delay than
138	   the underlying propagation time.  These delays appear only
139	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
140	   enough for the queue to fill the buffer, making every packet in other
141	   flows sharing the buffer sit through the queue.

143	   Active queue management (AQM) was originally developed to solve this
144	   problem (and others).  Unlike Diffserv, which gives low latency to
145	   some traffic at the expense of others, AQM controls latency for _all_
146	   traffic in a class.  In general, AQMs introduce an increasing level
147	   of discard from the buffer the longer the queue persists above a
148	   shallow threshold.  This gives sufficient signals to capacity-seeking
149	   (aka. greedy) flows to keep the buffer empty for its intended
150	   purpose: absorbing bursts.  However, RED [RFC2309] and other
151	   algorithms from the 1990s were sensitive to their configuration and
152	   hard to set correctly.  So, AQM was not widely deployed in the 1990s.

154	   More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290],
155	   PIE [RFC8033], Adaptive RED [ARED01], are easier to configure,
156	   because they define the queuing threshold in time not bytes, so it is
157	   invariant for different link rates.  However, no matter how good the
158	   AQM, the sawtoothing rate of TCP will either cause queuing delay to
159	   vary or cause the link to be under-utilized.  Even with a perfectly
160	   tuned AQM, the additional queuing delay will be of the same order as
161	   the underlying speed-of-light delay across the network.  Flow-queuing
162	   can isolate one flow from another, but it cannot isolate a TCP flow
163	   from the delay variations it inflicts on itself, and it has other
164	   problems - it overrides the flow rate decisions of variable rate
165	   video applications, it does not recognise the flows within IPSec VPN
166	   tunnels and it is relatively expensive to implement.

168	   It seems that further changes to the network alone will now yield
169	   diminishing returns.  Data Centre TCP (DCTCP [RFC8257]) teaches us
170	   that a small but radical change to TCP is needed to cut two major
171	   outstanding causes of queuing delay variability:

173	   1.  the `sawtooth' varying rate of TCP itself;

175	   2.  the smoothing delay deliberately introduced into AQMs to permit
176	       bursts without triggering losses.

178	   The former causes a flow's round trip time (RTT) to vary from about 1
179	   to 2 times the base RTT between the machines in question.  The latter
180	   delays the system's response to change by a worst-case
181	   (transcontinental) RTT, which could be hundreds of times the actual
182	   RTT of typical traffic from localized CDNs.

184	   Latency is not our only concern:

186	   3.  It was known when TCP was first developed that it would not scale
187	       to high bandwidth-delay products [TCP-CA].

189	   Given regular broadband bit-rates over WAN distances are
190	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
191	   `less unscalable' Cubic [RFC8312] and
192	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
193	   successfully deployed.  However, these are now approaching their
194	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
195	   cause `classic' TCP to starve itself, which is why they have been
196	   confined to private data centres or research testbeds (until now).

198	   This document specifies a `DualQ Coupled AQM' extension that solves
199	   the problem of coexistence between scalable and classic flows,
200	   without having to inspect flow identifiers.  The AQM is not like
201	   flow-queuing approaches [RFC8290] that classify packets by flow
202	   identifier into numerous separate queues in order to isolate sparse
203	   flows from the higher latency in the queues assigned to heavier
204	   flows.  In contrast, the AQM exploits the behaviour of scalable
205	   congestion controls like DCTCP so that every packet in every flow
206	   sharing the queue for DCTCP-like traffic can be served with very low
207	   latency.

209	   This AQM extension can be combined with any AQM designed for a single
210	   queue that generates a statistical or deterministic mark/drop
211	   probability driven by the queue dynamics.  In many cases it
212	   simplifies the basic control algorithm, and requires little extra
213	   processing.  Therefore it is believed the Coupled AQM would be
214	   applicable and easy to deploy in all types of buffers; buffers in
215	   cost-reduced mass-market residential equipment; buffers in end-system
216	   stacks; buffers in carrier-scale equipment including remote access
217	   servers, routers, firewalls and Ethernet switches; buffers in network
218	   interface cards, buffers in virtualized network appliances,
219	   hypervisors, and so on.

221	   For the public Internet, nearly all the benefit will typically be
222	   achieved by deploying the Coupled AQM into either end of the access
223	   link between a 'site' and the Internet, which is invariably the
224	   bottleneck.  Here, the term 'site' is used loosely to mean a home, an
225	   office, a campus or mobile user equipment.

227	   The overall L4S architecture [I-D.ietf-tsvwg-l4s-arch] gives more
228	   detail, including on wider deployment aspects such as coexistence in
229	   bottlenecks where a DualQ Coupled AQM has not been deployed.  The
230	   supporting papers [PI2] and [DCttH15] give the full rationale for the
231	   AQM's design, both discursively and in more precise mathematical
232	   form.

234	1.2.  Terminology

236	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
237	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
238	   document are to be interpreted as described in [RFC2119] when, and
239	   only when, they appear in all capitals, as shown here.

241	   The DualQ Coupled AQM uses two queues for two services.  Each of the
242	   following terms identifies both the service and the queue that
243	   provides the service:

245	   Classic (denoted by subscript C):  The `Classic' service is intended
246	      for all the behaviours that currently co-exist with TCP Reno (TCP
247	      Cubic, Compound, SCTP, etc).

249	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
250	      The `L4S' service is intended for a set of congestion controls
251	      with scalable properties (e.g.  DCTCP [RFC8257], Relentless
252	      TCP [Mathis09], the L4S variant of SCREAM for real-time
253	      media {ToDo: ref}).  For the public Internet a scalable control
254	      has to comply with the requirements in [I-D.ietf-tsvwg-ecn-l4s-id]
255	      (aka. the 'TCP Prague requirements').

257	   Either service can cope with a proportion of unresponsive or less-
258	   responsive traffic as well, as long (e.g.  DNS, VoIP, game sync
259	   datagrams, etc), just as a single queue AQM can if this traffic makes
260	   minimal contribution to queuing.  The DualQ Coupled AQM behaviour
261	   below is defined to be similar to a single FIFO queue with respect to
262	   unresponsive and overload traffic.

264	1.3.  Features

266	   The AQM couples marking and/or dropping across the two queues such
267	   that a flow will get roughly the same throughput whichever it uses.
268	   Therefore both queues can feed into the full capacity of a link and
269	   no rates need to be configured for the queues.  The L4S queue enables
270	   scalable congestion controls like DCTCP to give stunningly low and
271	   predictably low latency, without compromising the performance of
272	   competing 'Classic' Internet traffic.  Thousands of tests have been
273	   conducted in a typical fixed residential broadband setting.  Typical
274	   experiments used base round trip delays up to 100ms between the data
275	   centre and home network, and large amounts of background traffic in
276	   both queues.  For every L4S packet, the AQM kept the average queuing
277	   delay below 1ms (or 2 packets if serialization delay is bigger for
278	   slow links), and no losses at all were introduced by the AQM.
279	   Details of the extensive experiments are available [PI2] [DCttH15].

281	   Subjective testing was also conducted by multiple people all
282	   simultaneously using very demanding high bandwidth low latency
283	   applications over a single shared access link [L4Sdemo16].  In one
284	   application, each user could use finger gestures to pan or zoom their
285	   own high definition (HD) sub-window of a larger video scene generated
286	   on the fly in 'the cloud' from a football match.  Another user
287	   wearing VR goggles was remotely receiving a feed from a 360-degree
288	   camera in a racing car, again with the sub-window in their field of
289	   vision generated on the fly in 'the cloud' dependent on their head
290	   movements.  Even though other users were also downloading large
291	   amounts of L4S and Classic data, playing a gaming benchmark and
292	   watchings videos over the same 40Mb/s downstream broadband link,
293	   latency was so low that the football picture appeared to stick to the
294	   user's finger on the touchpad and the experience fed from the remote
295	   camera did not noticeably lag head movements.  All the L4S data (even
296	   including the downloads) achieved the same ultra-low latency.  With
297	   an alternative AQM, the video noticeably lagged behind the finger
298	   gestures and head movements.

300	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
301	   be limited to a small proportion of the link capacity in order to
302	   achieve low delay.  The L4S queue can be filled with a heavy load of
303	   capacity-seeking flows like DCTCP and still achieve low delay.  The
304	   L4S queue does not rely on the presence of other traffic in the
305	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
306	   traffic whether or not there is Classic traffic, and the latency of
307	   Classic traffic does not suffer when a proportion of the traffic is
308	   L4S.  The two queues are only necessary because DCTCP-like flows
309	   cannot keep latency predictably low and keep utilization high if they
310	   are mixed with legacy TCP flows,

312	   The experiments used the Linux implementation of DCTCP that is
313	   deployed in private data centres, without any modification despite
314	   its known deficiencies.  Nonetheless, certain modifications will be
315	   necessary before DCTCP is safe to use on the Internet, which are
316	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
317	   focus of this specification is to get the network service in place.
318	   Then, without any management intervention, applications can exploit
319	   it by migrating to scalable controls like DCTCP, which can then
320	   evolve _while_ their benefits are being enjoyed by everyone on the
321	   Internet.

323	2.  DualQ Coupled AQM

325	   There are two main aspects to the approach:

327	   o  the Coupled AQM that addresses throughput equivalence between
328	      Classic (e.g.  Reno, Cubic) flows and L4S flows (that satisfy the
329	      TCP Prague requirements).

331	   o  the Dual Queue structure that provides latency separation for L4S
332	      flows to isolate them from the typically large Classic queue.

334	2.1.  Coupled AQM

336	   In the 1990s, the `TCP formula' was derived for the relationship
337	   between TCP's congestion window, cwnd, and its drop probability, p.
338	   To a first order approximation, cwnd of TCP Reno is inversely
339	   proportional to the square root of p.

341	   We focus on Reno as the worst case, because if we do not harm Reno,
342	   we will not harm Cubic.  Nonetheless, TCP Cubic implements a Reno-
343	   compatibility mode, which is the only relevant mode for typical RTTs
344	   under 20ms as long as the throughput of a single flow is less than
345	   about 500Mb/s.  Therefore it can be assumed that Cubic traffic
346	   behaves similarly to Reno (but with a slightly different constant of
347	   proportionality).  The term 'Classic' will be used for the collection
348	   of Reno-friendly traffic including Cubic in Reno mode.

350	   The supporting paper [PI2] includes the derivation of the equivalent
351	   rate equation for DCTCP, for which cwnd is inversely proportional to
352	   p (not the square root), where in this case p is the ECN marking
353	   probability.  DCTCP is not the only congestion control that behaves
354	   like this, so the term 'L4S' traffic will be used for all similar
355	   behaviour.

357	   For safe co-existence, under stationary conditions, a DCTCP flow has
358	   to run at roughly the same rate as a Reno TCP flow (all other factors
359	   being equal).  So the drop or marking probability for Classic
360	   traffic, p_C has to be distinct from the marking probability for L4S
361	   traffic, p_L.  [RFC8311] updates the original ECN specification
362	   [RFC3168] to allow these probabilities to be distinct, because RFC
363	   3168 required them to be the same.

365	   Also, to remain stable, Classic sources need the network to smooth
366	   p_C so it changes relatively slowly.  In contrast, L4S avoids
367	   smoothing in the network, because it delays all signals for a worst-
368	   case RTT.  So instead, L4S sources smooth the ECN marking probability
369	   themselves, so they expect the network to generate ECN marks with a
370	   probability p_L that tracks the instantaneous unsmoothed queue.

372	   The Coupled AQM achieves safe coexistence by making the Classic drop
373	   probability p_C proportional to the square of the coupled L4S
374	   probability p_CL. p_CL is an input to the instantaneous L4S marking
375	   probability p_L but it changes as slowly as p_C.  This makes the Reno
376	   flow rate roughly equal the DCTCP flow rate, because the squaring of
377	   p_CL counterbalances the square root of p_C in the Classic 'TCP
378	   formula'.

380	   Stating this as a formula, the relation between Classic drop
381	   probability, p_C, and the coupled L4S probability p_CL needs to take
382	   the form:

384	       p_C = ( p_CL / k )^2                  (1)

386	   where k is the constant of proportionality, which we shall call the
387	   coupling factor.

389	2.2.  Dual Queue

391	   Classic traffic typically builds a large queue to prevent under-
392	   utilization.  Therefore a separate queue is provided for L4S traffic,
393	   and it is scheduled with priority over Classic.  Priority is
394	   conditional to prevent starvation of Classic traffic.

396	   Nonetheless, coupled marking ensures that giving priority to L4S
397	   traffic still leaves the right amount of spare scheduling time for
398	   Classic flows to each get equivalent throughput to DCTCP flows (all
399	   other factors such as RTT being equal).

401	2.3.  Traffic Classification

403	   Both the Coupled AQM and DualQ mechanisms need an identifier to
404	   distinguish L and C packets.  Then the coupling algorithm can achieve
405	   coexistence without having to inspect flow identifiers, because it
406	   can apply the appropriate marking or dropping probability to all
407	   flows of each type.  A separate
408	   specification [I-D.ietf-tsvwg-ecn-l4s-id] requires the sender to use
409	   the ECT(1) codepoint of the ECN field as this identifier, having
410	   assessed various alternatives.  An additional process document has
411	   proved necessary to make the ECT(1) codepoint available for
412	   experimentation [RFC8311].

414	   For policy reasons, an operator might choose to steer certain packets
415	   (e.g. from certain flows or with certain addresses) out of the L
416	   queue, even though they identify themselves as L4S by their ECN
417	   codepoints.  In such cases, the device MUST NOT alter the ECN field,
418	   so that it is preserved end-to-end.  The aim is that each operator
419	   can choose how it treats L4S traffic locally, but an individual
420	   operator does not alter the identification of L4S packets, which
421	   would prevent other operators downstream from making their own
422	   choices on how to treat L4S traffic.

424	   In addition, other identifiers could be used to classify certain
425	   additional packet types into the L queue, that are deemed not to risk
426	   harming the L4S service.  For instance addresses of specific
427	   applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific
428	   Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit
429	   service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain
430	   protocols (e.g.  ARP, DNS).

432	   Note that the mechanism only reads these classifiers, it MUST NOT re-
433	   mark or alter these identifiers (except for marking the ECN field
434	   with the CE codepoint - with increasing frequency to indicate
435	   increasing congestion).

437	2.4.  Overall DualQ Coupled AQM Structure

439	   Figure 1 shows the overall structure that any DualQ Coupled AQM is
440	   likely to have.  This schematic is intended to aid understanding of
441	   the current designs of DualQ Coupled AQMs.  However, it is not
442	   intended to preclude other innovative ways of satisfying the
443	   normative requirements in Section 2.5 that minimally define a DualQ
444	   Coupled AQM.

446	   The classifier on the left separates incoming traffic between the two
447	   queues (L and C).  Each queue has its own AQM that determines the
448	   likelihood of marking or dropping (p_L and p_C).  It has been
449	   proved [PI2] that it is preferable to control load with a linear
450	   controller, then square the output before applying it as a drop
451	   probability to TCP (because TCP decreases its load proportional to
452	   the square-root of the increase in drop).  So, the AQM for Classic
453	   traffic needs to be implemented in two stages: i) a base stage that
454	   outputs an internal probability p' (pronounced p-prime); and ii) a
455	   squaring stage that outputs p_C, where

457	       p_C = (p')^2.                         (2)

459	   Substituting for p_C in Eqn (1) gives:

461	       p' = p_CL / k

463	   So the slow-moving input to ECN marking in the L queue (the coupled
464	   L4S probability) is:

466	       p_CL = k*p',                          (3)

468	   where k is the constant coupling factor (see Appendix C).

470	   It can be seen that these two transformations of p' implement the
471	   required coupling given in equation (1) earlier.

473	   The actual probability p_L that we apply to the L queue needs to
474	   track the immediate L queue delay, as well as track p_CL under
475	   stationary conditions.  So we use a native AQM in the L queue that
476	   calculates a probability p'_L as a function of the instantaneous L
477	   queue.  And, given the L queue has conditional strict priority over
478	   the C queue, whenever the L queue grows, we should apply marking
479	   probability p'_L, but p_L should not fall below p_CL.  This suggests:

481	       p_L = max(p'_L, p_CL),                (4)

483	   which has also been found to work very well in practice.

485	                           _________
486	                                  | |    ,------.
487	                        L4S queue | |===>| ECN  |
488	                       ,'| _______|_|    |marker|\
489	                     <'  |         |     `------'\\
490	                      //`'         v        ^ p_L \\
491	                     //       ,-------.     |      \\
492	                    //        |Native |p'_L |       \\,.
493	                   //         |  L4S  |--->(MAX)    <  |   ___
494	      ,----------.//          |  AQM  |     ^ p_CL   `\|.'Cond-`.
495	      |  IP-ECN  |/           `-------'     |          / itional \
496	   ==>|Classifier|            ,-------.   (k*p')       [ priority]==>
497	      |          |\           |  Base |     |          \scheduler/
498	      `----------'\\          |  AQM  |---->:        ,'|`-.___.-'
499	                   \\         |       |p'   |      <'  |
500	                    \\        `-------'   (p'^2)    //`'
501	                     \\            ^        |      //
502	                      \\,.         |        v p_C //
503	                      <  | _________     .------.//
504	                       `\|   |      |    | Drop |/
505	                     Classic |queue |===>|/mark |
506	                           __|______|    `------'

508	   Legend: ===> traffic flow; ---> control dependency.

510	                   Figure 1: DualQ Coupled AQM Schematic

512	   After the AQMs have applied their dropping or marking, the scheduler
513	   forwards their packets to the link, giving priority to L4S traffic.
514	   Priority has to be conditional in some way (see Section 4.1).  Simple
515	   strict priority is inappropriate otherwise it could lead the L4S
516	   queue to starve the Classic queue.  For example, consider the case
517	   where a continually busy L4S queue blocks a DNS request in the
518	   Classic queue, arbitrarily delaying the start of a new Classic flow.

520	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
521	   given in Appendix A and Appendix B.  Either example AQM can be used
522	   to couple packet marking and dropping across a dual Q.

524	   DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM.
525	   Indeed, this Base AQM with just the squared output and no L4S queue
526	   can be used as a drop-in replacement for PIE [RFC8033], in which case
527	   we call it just PI2 [PI2].  PI2 is a principled simplification of PIE
528	   that is both more responsive and more stable in the face of
529	   dynamically varying load.

531	   Curvy RED is derived from RED [RFC2309], but its configuration
532	   parameters are insensitive to link rate and it requires less
533	   operations per packet.  However, DualPI2 is more responsive and
534	   stable over a wider range of RTTs than Curvy RED.  As a consequence,
535	   DualPI2 has attracted more development attention than Curvy RED,
536	   leaving the Curvy RED design incomplete and not so fully evaluated.

538	   Both AQMs regulate their queue in units of time not bytes.  As
539	   already explained, this ensures configuration can be invariant for
540	   different drain rates.  With AQMs in a dualQ structure this is
541	   particularly important because the drain rate of each queue can vary
542	   rapidly as flows for the two queues arrive and depart, even if the
543	   combined link rate is constant.

545	   It would be possible to control the queues with other alternative
546	   AQMs, as long as the normative requirements (those expressed in
547	   capitals) in Section 2.5 are observed.

549	2.5.  Normative Requirements for a DualQ Coupled AQM

551	   The following requirements are intended to capture only the essential
552	   aspects of a DualQ Coupled AQM.  They are intended to be independent
553	   of the particular AQMs used for each queue.

555	2.5.1.  Functional Requirements

557	   A Dual Queue Coupled AQM implementation MUST utilize two queues, each
558	   with an AQM algorithm.  The two queues can be part of a larger
559	   queuing hierarchy [I-D.briscoe-tsvwg-l4s-diffserv].

561	   The AQM algorithm for the low latency (L) queue MUST apply ECN
562	   marking.

564	   The scheduler draining the two queues MUST give L4S packets priority
565	   over Classic, although priority MUST be bounded in order not to
566	   starve Classic traffic.

568	   [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on
569	   L4S traffic, relative to drop of Classic traffic.  In order to
570	   prevent starvation of Classic traffic by scalable L4S traffic, it
571	   says, "The likelihood that an AQM drops a Not-ECT Classic packet
572	   (p_C) MUST be roughly proportional to the square of the likelihood
573	   that it would have marked it if it had been an L4S packet (p_L)."
574	   The term 'likelihood' is used to allow for marking and dropping to be
575	   either probabilistic or deterministic.

577	   For the current specification, this translates into the following
578	   requirement.  A DualQ Coupled AQM MUST apply ECN marking to traffic
579	   in the L queue that is no lower than that derived from the likelihood
580	   of drop (or ECN marking) in the Classic queue using Eqn.  (1).

582	   The constant of proportionality, k, in Eqn (1) determines the
583	   relative flow rates of Classic and L4S flows when the AQM concerned
584	   is the bottleneck (all other factors being equal).
585	   [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality
586	   (k) does not have to be standardised for interoperability, but a
587	   value of 2 is RECOMMENDED."

589	   Assuming scalable congestion controls for the Internet will be as
590	   aggressive as DCTCP, this will ensure their congestion window will be
591	   roughly the same as that of a standards track TCP congestion control
592	   (Reno) [RFC5681] and other so-called TCP-friendly controls, such as
593	   TCP Cubic in its TCP-friendly mode.

595	   The choice of k is a matter of operator policy, and operators MAY
596	   choose a different value using Table 1 and the guidelines in
597	   Appendix C.

599	   If multiple users share capacity at a bottleneck (e.g. in the
600	   Internet access link of a campus network), the operator's choice of k
601	   will determine capacity sharing between the flows of different users.
602	   However, on the public Internet, access network operators typically
603	   isolate customers from each other with some form of layer-2
604	   multiplexing (OFDM(A) in DOCSIS3.1, CDMA in 3G, SC-FDMA in LTE) or L3
605	   scheduling (WRR in DSL), rather than relying on TCP to share capacity
606	   between customers [RFC0970].  In such cases, the choice of k will
607	   solely affect relative flow rates within each customer's access
608	   capacity, not between customers.  Also, k will not affect relative
609	   flow rates at any times when all flows are Classic or all L4S, and it
610	   will not affect the relative throughput of small flows.

612	2.5.1.1.  Requirements in Unexpected Cases

614	   The flexibility to allow operator-specific classifiers (Section 2.3)
615	   leads to the need to specify what the AQM in each queue ought to do
616	   with packets that do not carry the ECN field expected for that queue.
617	   It is recommended that the AQM in each queue inspects the ECN field
618	   to determine what sort of congestion notification to signal, then
619	   decides whether to apply congestion notification to this particular
620	   packet, as follows:

622	   o  If a packet that does not carry an ECT(1) or CE codepoint is
623	      classified into the L queue:

625	      *  if the packet is ECT(0), the L AQM SHOULD apply CE-marking
626	         using a probability appropriate to Classic congestion control
627	         and appropriate to the target delay in the L queue

629	      *  if the packet is Not-ECT, the appropriate action depends on
630	         whether some other function is protecting the L queue from
631	         misbehaving flows (e.g. per-flow queue protection or latency
632	         policing):

634	         +  If separate queue protection is provided, the L AQM SHOULD
635	            ignore the packet and forward it unchanged, meaning it
636	            should not calculate whether to apply congestion
637	            notification and it should neither drop nor CE-mark the
638	            packet (for instance, the operator might classify EF traffic
639	            that is unresponsive to drop into the L queue, alongside
640	            responsive L4S-ECN traffic)

642	         +  if separate queue protection is not provided, the L AQM
643	            SHOULD apply drop using a drop probability appropriate to
644	            Classic congestion control and appropriate to the target
645	            delay in the L queue

647	   o  If a packet that carries an ECT(1) codepoint is classified into
648	      the C queue:

650	      *  the C AQM SHOULD apply CE-marking using the coupled AQM
651	         probability p_CL (= k*p').

653	   If the DualQ Coupled AQM has detected overload, it will signal
654	   congestion solely using drop, irrespective of the ECN field.

656	   The above requirements are worded as "SHOULDs", because operator-
657	   specific classifiers are for flexibility, by definition.  Therefore,
658	   alternative actions might be appropriate in the operator's specific
659	   circumstances.  An example would be where the operator knows that
660	   certain legacy traffic marked with one codepoint actually has a
661	   congestion response associated with another codepoint.

663	2.5.2.  Management Requirements

665	   By default, a DualQ Coupled AQM SHOULD NOT need any configuration for
666	   use at a bottleneck on the public Internet [RFC7567].  The following
667	   parameters MAY be operator-configurable, e.g. to tune for non-
668	   Internet settings:

670	   o  Optional packet classifier(s) to use in addition to the ECN field
671	      (see Section 2.3);

673	   o  Expected typical RTT (a parameter for typical or target queuing
674	      delay in each queue might be configurable instead; if so it MUST
675	      be expressed in units of time);

677	   o  Expected maximum RTT (a stability parameter that depends on
678	      maximum RTT might be configurable instead);

680	   o  Coupling factor, k;

682	   o  The limit to the conditional priority of L4S (scheduler-dependent,
683	      e.g. the scheduler weight for WRR, or the time-shift for time-
684	      shifted FIFO);

686	   o  The maximum Classic ECN marking probability, p_Cmax, before
687	      switching over to drop.

689	   An experimental DualQ Coupled AQM SHOULD allow the operator to
690	   monitor each of the following operational statistics on demand, per
691	   queue and per configurable sample interval, for performance
692	   monitoring and perhaps also for accounting in some cases:

694	   o  Bits forwarded, from which utilization can be calculated;

696	   o  Total packets arriving, enqueued and dequeued to distinguish tail
697	      discard from proactive AQM discard;

699	   o  ECN packets marked, non-ECN packets dropped, ECN packets dropped,
700	      from which marking and dropping probabilities can be calculated;

702	   o  Queue delay (not including serialization delay of the head packet
703	      or medium acquisition delay) - see further notes below.

705	      Unlike the other statistics, queue delay cannot be captured in a
706	      simple accumulating counter.  Therefore the type of queue delay
707	      statistics produced (mean, percentiles, etc.) will depend on
708	      implementation constraints.  To facilitate comparative evaluation
709	      of different implementations and approaches, an implementation
710	      SHOULD allow mean and 99th percentile queue delay to be derived
711	      (per queue per sample interval).  A relatively simple way to do
712	      this would be to store a coarse-grained histogram of queue delay.
713	      This could be done with a small number of bins with configurable
714	      edges that represent contiguous ranges of queue delay.  Then, over
715	      a sample interval, each bin would accumulate a count of the number
716	      of packets that had fallen within each range.  The maximum queue
717	      delay per queue per interval MAY also be recorded.

719	   An experimental DualQ Coupled AQM SHOULD asynchronously report the
720	   following data about anomalous conditions:

722	   o  Start-time and duration of overload state.

724	      A hysteresis mechanism SHOULD be used to prevent flapping in and
725	      out of overload causing an event storm.  For instance, exit from
726	      overload state could trigger one report, but also latch a timer.
727	      Then, during that time, if the AQM enters and exits overload state
728	      any number of times, the duration in overload state is accumulated
729	      but no new report is generated until the first time the AQM is out
730	      of overload once the timer has expired.

732	   [RFC5706] suggests that deployment, coexistence and scaling should
733	   also be covered as management requirements.  The raison d'etre of the
734	   DualQ Couple AQM is to enable deployment and coexistence of scalable
735	   congestion controls - as incremental replacements for today's TCP-
736	   friendly controls that do not scale with bandwidth-delay product.
737	   Therefore, these motivating issues are explained in the Introduction
738	   and detailed in the L4S architecture [I-D.ietf-tsvwg-l4s-arch].
739	   Also, the descriptions of specific DualQ Coupled AQM algorithms in
740	   the appendices cover scaling of their configuration parameters, e.g.
741	   with respect to RTT and sampling frequency.

743	3.  IANA Considerations

745	   This specification contains no IANA considerations.

747	4.  Security Considerations

749	4.1.  Overload Handling

751	   Where the interests of users or flows might conflict, it could be
752	   necessary to police traffic to isolate any harm to the performance of
753	   individual flows.  However it is hard to avoid unintended side-
754	   effects with policing, and in a trusted environment policing is not
755	   necessary.  Therefore per-flow policing needs to be separable from a
756	   basic AQM, as an option under policy control.

758	   However, a basic DualQ AQM does at least need to handle overload.  A
759	   useful objective would be for the overload behaviour of the DualQ AQM
760	   to be at least no worse than a single queue AQM.  However, a trade-
761	   off needs to be made between complexity and the risk of either
762	   traffic class harming the other.  In each of the following three
763	   subsections, an overload issue specific to the DualQ is described,
764	   followed by proposed solution(s).

766	   Under overload the higher priority L4S service will have to sacrifice
767	   some aspect of its performance.  Alternative solutions are provided
768	   below that each relax a different factor: e.g. throughput, delay,
769	   drop.  These choices need to be made either by the developer or by
770	   operator policy, rather than by the IETF.

772	4.1.1.  Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay?

774	   Priority of L4S is required to be conditional to avoid total
775	   throughput starvation of Classic by heavy L4S traffic.  This raises
776	   the question of whether to sacrifice L4S throughput or L4S delay (or
777	   some other policy) to mitigate starvation of Classic:

779	   Sacrifice L4S throughput:   By using weighted round robin as the
780	      conditional priority scheduler, the L4S service can sacrifice some
781	      throughput during overload to guarantee a minimum throughput
782	      service for Classic traffic.  The scheduling weight of the Classic
783	      queue should be small (e.g. 1/16).  Then, in most traffic
784	      scenarios the scheduler will not interfere and it will not need to
785	      - the coupling mechanism and the end-systems will share out the
786	      capacity across both queues as if it were a single pool.  However,
787	      because the congestion coupling only applies in one direction
788	      (from C to L), if L4S traffic is over-aggressive or unresponsive,
789	      the scheduler weight for Classic traffic will at least be large
790	      enough to ensure it does not starve.

792	      In cases where the ratio of L4S to Classic flows (e.g. 19:1) is
793	      greater than the ratio of their scheduler weights (e.g. 15:1), the
794	      L4S flows will get less than an equal share of the capacity, but
795	      only slightly.  For instance, with the example numbers given, each
796	      L4S flow will get (15/16)/19 = 4.9% when ideally each would get
797	      1/20=5%. In the rather specific case of an unresponsive flow
798	      taking up a large part of the capacity set aside for L4S, using
799	      WRR could significantly reduce the capacity left for any
800	      responsive L4S flows.

802	   Sacrifice L4S Delay:  To control milder overload of responsive
803	      traffic, particularly when close to the maximum congestion signal,
804	      the operator could choose to control overload of the Classic queue
805	      by allowing some delay to 'leak' across to the L4S queue.  The
806	      scheduler can be made to behave like a single First-In First-Out
807	      (FIFO) queue with different service times by implementing a very
808	      simple conditional priority scheduler that could be called a
809	      "time-shifted FIFO" (see the Modifier Earliest Deadline First
810	      (MEDF) scheduler of [MEDF]).  This scheduler adds tshift to the
811	      queue delay of the next L4S packet, before comparing it with the
812	      queue delay of the next Classic packet, then it selects the packet
813	      with the greater adjusted queue delay.  Under regular conditions,
814	      this time-shifted FIFO scheduler behaves just like a strict
815	      priority scheduler.  But under moderate or high overload it
816	      prevents starvation of the Classic queue, because the time-shift
817	      (tshift) defines the maximum extra queuing delay of Classic
818	      packets relative to L4S.

820	   The example implementation in Appendix A can implement either policy.

822	4.1.2.  Congestion Signal Saturation: Introduce L4S Drop or Delay?

824	   To keep the throughput of both L4S and Classic flows roughly equal
825	   over the full load range, a different control strategy needs to be
826	   defined above the point where one AQM first saturates to a
827	   probability of 100% leaving no room to push back the load any harder.
828	   If k>1, L4S will saturate first, even though saturation could be
829	   caused by unresponsive traffic in either queue.

831	   The term 'unresponsive' includes cases where a flow becomes
832	   temporarily unresponsive, for instance, a real-time flow that takes a
833	   while to adapt its rate in response to congestion, or a TCP-like flow
834	   that is normally responsive, but above a certain congestion level it
835	   will not be able to reduce its congestion window below the minimum of
836	   2 segments [RFC5681], effectively becoming unresponsive.  (Note that
837	   L4S traffic ought to remain responsive below a window of 2 segments
838	   (see [I-D.ietf-tsvwg-ecn-l4s-id]).

840	   Saturation raises the question of whether to relieve congestion by
841	   introducing some drop into the L4S queue or by allowing delay to grow
842	   in both queues (which could eventually lead to tail drop too):

844	   Drop on Saturation:  Saturation can be avoided by setting a maximum
845	      threshold for L4S ECN marking (assuming k>1) before saturation
846	      starts to make the flow rates of the different traffic types
847	      diverge.  Above that the drop probability of Classic traffic is
848	      applied to all packets of all traffic types.  Then experiments
849	      have shown that queueing delay can be kept at the target in any
850	      overload situation, including with unresponsive traffic, and no
851	      further measures are required [DualQ-Test].

853	   Delay on Saturation:  When L4S marking saturates, instead of
854	      switching to drop, the drop and marking probabilities could be
855	      capped.  Beyond that, delay will grow either solely in the queue
856	      with unresponsive traffic (if WRR is used), or in both queues (if
857	      time-shifted FIFO is used).  In either case, the higher delay
858	      ought to control temporary high congestion.  If the overload is
859	      more persistent, eventually the combined DualQ will overflow and
860	      tail drop will control congestion.

862	   The example implementation in Appendix A solely applies the "drop on
863	   saturation" policy.

865	4.1.3.  Protecting against Unresponsive ECN-Capable Traffic

867	   Unresponsive traffic has a greater advantage if it is also ECN-
868	   capable.  The advantage is undetectable at normal low levels of drop/
869	   marking, but it becomes significant with the higher levels of drop/
870	   marking typical during overload.  This is an issue whether the ECN-
871	   capable traffic is L4S or Classic.

873	   This raises the question of whether and when to switch off ECN
874	   marking and use solely drop instead, as required by both Section 7 of
875	   [RFC3168] and Section 4.2.1 of [RFC7567].

877	   Experiments with the DualPI2 AQM (Appendix A) have shown that
878	   introducing 'drop on saturation' at 100% L4S marking addresses this
879	   problem with unresponsive ECN as well as addressing the saturation
880	   problem.  It leaves only a small range of congestion levels where
881	   unresponsive traffic gains any advantage from using the ECN
882	   capability, and the advantage is hardly detectable [DualQ-Test].

884	5.  Acknowledgements

886	   Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for
887	   detailed review comments particularly of the appendices and
888	   suggestions on how to make our explanation clearer.  Thanks also to
889	   Greg White for improving the normative requirements and both Greg and
890	   Tom Henderson for insights on the choice of schedulers, queue delay
891	   measurement techniques.

893	   The authors' contributions were originally part-funded by the
894	   European Community under its Seventh Framework Programme through the
895	   Reducing Internet Transport Latency (RITE) project (ICT-317700).  Bob
896	   Briscoe's contribution was also part-funded by the Research Council
897	   of Norway through the TimeIn project.  The views expressed here are
898	   solely those of the authors.

900	6.  References

902	6.1.  Normative References

904	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
905	              Requirement Levels", BCP 14, RFC 2119,
906	              DOI 10.17487/RFC2119, March 1997,
907	              <https://www.rfc-editor.org/info/rfc2119>.

909	6.2.  Informative References

911	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
912	              Algorithm for Increasing the Robustness of RED's Active
913	              Queue Management", ACIRI Technical Report , August 2001,
914	              <http://www.icir.org/floyd/red.html>.

916	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
917	              ACM Queue 10(5), May 2012,
918	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

920	   [CRED_Insights]
921	              Briscoe, B., "Insights from Curvy RED (Random Early
922	              Detection)", BT Technical Report TR-TUB8-2015-003, July
923	              2015,
924	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

926	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
927	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
928	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
929	              dctth_preprint.pdf>.

931	              (Under submission)

933	   [DualQ-Test]
934	              Steen, H., "Destruction Testing: Ultra-Low Delay using
935	              Dual Queue Coupled Active Queue Management", Masters
936	              Thesis, Dept of Informatics, Uni Oslo , May 2017.

938	   [I-D.briscoe-tsvwg-l4s-diffserv]
939	              Briscoe, B., "Interactions between Low Latency, Low Loss,
940	              Scalable Throughput (L4S) and Differentiated Services",
941	              draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress),
942	              March 2018.

944	   [I-D.ietf-tsvwg-ecn-l4s-id]
945	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
946	              Modified Explicit Congestion Notification (ECN) Semantics
947	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
948	              id-02 (work in progress), March 2018.

950	   [I-D.ietf-tsvwg-l4s-arch]
951	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
952	              Low Loss, Scalable Throughput (L4S) Internet Service:
953	              Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in
954	              progress), March 2018.

956	   [I-D.sridharan-tcpm-ctcp]
957	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
958	              "Compound TCP: A New TCP Congestion Control for High-Speed
959	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
960	              (work in progress), November 2008.

962	   [L4Sdemo16]
963	              Bondarenko, O., De Schepper, K., Tsang, I., and B.
964	              Briscoe, "Ultra-Low Delay for All: Live Experience, Live
965	              Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016,
966	              <http://dl.acm.org/citation.cfm?doid=2910017.2910633
967	              (videos of demos: https://riteproject.eu/
968	              dctth/#1511dispatchwg )>.

970	   [Mathis09]
971	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
972	              May 2009, <http://www.hpcc.jp/pfldnet2009/
973	              Program_files/1569198525.pdf>.

975	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
976	              simple scheduling algorithm for two real-time transport
977	              service classes with application in the UTRAN", Proc. IEEE
978	              Conference on Computer Communications (INFOCOM'03) Vol.2
979	              pp.1116-1122, March 2003.

981	   [PI2]      De Schepper, K., Bondarenko, O., Briscoe, B., and I.
982	              Tsang, "PI2: A Linearized AQM for both Classic and
983	              Scalable TCP", ACM CoNEXT'16 , December 2016,
984	              <https://riteproject.files.wordpress.com/2015/10/
985	              pi2_conext.pdf>.

987	              (To appear)

989	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
990	              RFC 970, DOI 10.17487/RFC0970, December 1985,
991	              <https://www.rfc-editor.org/info/rfc970>.

993	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
994	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
995	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
996	              S., Wroclawski, J., and L. Zhang, "Recommendations on
997	              Queue Management and Congestion Avoidance in the
998	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
999	              <https://www.rfc-editor.org/info/rfc2309>.

1001	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1002	              of Explicit Congestion Notification (ECN) to IP",
1003	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
1004	              <https://www.rfc-editor.org/info/rfc3168>.

1006	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
1007	              J., Courtney, W., Davari, S., Firoiu, V., and D.
1008	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
1009	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
1010	              <https://www.rfc-editor.org/info/rfc3246>.

1012	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
1013	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
1014	              <https://www.rfc-editor.org/info/rfc3649>.

1016	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1017	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
1018	              <https://www.rfc-editor.org/info/rfc5681>.

1020	   [RFC5706]  Harrington, D., "Guidelines for Considering Operations and
1021	              Management of New Protocols and Protocol Extensions",
1022	              RFC 5706, DOI 10.17487/RFC5706, November 2009,
1023	              <https://www.rfc-editor.org/info/rfc5706>.

1025	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
1026	              Recommendations Regarding Active Queue Management",
1027	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
1028	              <https://www.rfc-editor.org/info/rfc7567>.

1030	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
1031	              "Proportional Integral Controller Enhanced (PIE): A
1032	              Lightweight Control Scheme to Address the Bufferbloat
1033	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
1034	              <https://www.rfc-editor.org/info/rfc8033>.

1036	   [RFC8034]  White, G. and R. Pan, "Active Queue Management (AQM) Based
1037	              on Proportional Integral Controller Enhanced PIE) for
1038	              Data-Over-Cable Service Interface Specifications (DOCSIS)
1039	              Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February
1040	              2017, <https://www.rfc-editor.org/info/rfc8034>.

1042	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
1043	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
1044	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
1045	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

1047	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
1048	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
1049	              and Active Queue Management Algorithm", RFC 8290,
1050	              DOI 10.17487/RFC8290, January 2018,
1051	              <https://www.rfc-editor.org/info/rfc8290>.

1053	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
1054	              Notification (ECN) Experimentation", RFC 8311,
1055	              DOI 10.17487/RFC8311, January 2018,
1056	              <https://www.rfc-editor.org/info/rfc8311>.

1058	   [RFC8312]  Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
1059	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
1060	              RFC 8312, DOI 10.17487/RFC8312, February 2018,
1061	              <https://www.rfc-editor.org/info/rfc8312>.

1063	   [TCP-CA]   Jacobson, V. and M. Karels, "Congestion Avoidance and
1064	              Control", Laurence Berkeley Labs Technical Report ,
1065	              November 1988, <http://ee.lbl.gov/papers/congavoid.pdf>.

1067	Appendix A.  Example DualQ Coupled PI2 Algorithm

1069	   As a first concrete example, the pseudocode below gives the DualPI2
1070	   algorithm.  DualPI2 follows the structure of the DualQ Coupled AQM
1071	   framework in Figure 1.  A simple step threshold (in units of queuing
1072	   time) is used for the Native L4S AQM, but a ramp is also described as
1073	   an alternative.  And the PI2 algorithm [PI2] is used for the Classic
1074	   AQM.  PI2 is an improved variant of the PIE AQM [RFC8033].

1076	   We will introduce the pseudocode in two passes.  The first pass
1077	   explains the core concepts, deferring handling of overload to the
1078	   second pass.  To aid comparison, line numbers are kept in step
1079	   between the two passes by using letter suffixes where the longer code
1080	   needs extra lines.

1082	   A full open source implementation for Linux is available at:
1083	   https://github.com/olgabo/dualpi2.

1085	A.1.  Pass #1: Core Concepts

1087	   The pseudocode manipulates three main structures of variables: the
1088	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
1089	   pseudocode consists of the following five functions:

1091	   o  initialization code (Figure 2) that sets parameter defaults (the
1092	      API for setting non-default values is omitted for brevity)

1094	   o  enqueue code (Figure 3)

1096	   o  dequeue code (Figure 4)

1098	   o  a ramp function (Figure 5) used to calculate the ECN-marking
1099	      probability for the L4S queue

1101	   o  code to regularly update the base probability (p) used in the
1102	      dequeue code (Figure 6).

1104	   It also uses the following functions that are not shown in full here:

1106	   o  scheduler(), which selects between the head packets of the two
1107	      queues; the choice of scheduler technology is discussed later;

1109	   o  cq.len() or lq.len() returns the current length (aka. backlog) of
1110	      the relevant queue in bytes;

1112	   o  cq.time() or lq.time() returns the current queuing delay (aka.
1113	      sojourn time or service time) of the relevant queue in units of
1114	      time;

1116	   Queuing delay could be measured directly by storing a per-packet
1117	   time-stamp as each packet is enqueued, and subtracting this from the
1118	   system time when the packet is dequeued.  If time-stamping is not
1119	   easy to introduce with certain hardware, queuing delay could be
1120	   predicted indirectly by dividing the size of the queue by the
1121	   predicted departure rate, which might be known precisely for some
1122	   link technologies (see for example [RFC8034]).

1124	   In our experiments so far (building on experiments with PIE) on
1125	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
1126	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
1127	   parameters in Figure 2.  The parameters are categorised by whether
1128	   they relate to the Base PI2 AQM, the L4S AQM or the framework
1129	   coupling them together.  Variables derived from these parameters are
1130	   also included at the end of each category.  Each parameter is
1131	   explained as it is encountered in the walk-through of the pseudocode
1132	   below.

1134	   1:  dualpi2_params_init(...) {         % Set input parameter defaults
1135	   2:    % PI2 AQM parameters
1136	   3:    target = 15 ms              % PI AQM Classic queue delay target
1137	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
1138	   5:    alpha = 10 Hz^2                              % PI integral gain
1139	   6:    beta = 100 Hz^2                          % PI proportional gain
1140	   7:    p_Cmax = 1/4                       % Max Classic drop/mark prob
1141	   8:    % Constants derived from PI2 AQM parameters
1142	   9:    alpha_U = alpha *Tupdate % PI integral gain per update interval
1143	   10:   beta_U = beta * Tupdate  % PI prop'nal gain per update interval
1144	   11:
1145	   12:   % DualQ Coupled framework parameters
1146	   13:   k = 2                                         % Coupling factor
1147	   14:   % scheduler weight or equival't parameter (scheduler-dependent)
1148	   15:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
1149	   16:
1150	   17:   % L4S ramp AQM parameters
1151	   18:   minTh = 475 us        % L4S min marking threshold in time units
1152	   19:   range = 525 us                % Range of L4S ramp in time units
1153	   20:   Th_len = 2 * MTU           % Min L4S marking threshold in bytes
1154	   21:   % Constants derived from L4S AQM parameters
1155	   22:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
1156	   23:   floor = Th_len * 8 / MIN_LINK_RATE   % MIN_LINK_RATE is in Mb/s
1157	   24:   if (minTh < floor) {
1158	   25:     % Adjust ramp to exceed serialization time of 2 MTU
1159	   26:     range = max(range - (floor-minTh), 1)   % 1us avoids /0 error
1160	   27:     minTh = floor
1161	   28:   }
1162	   29:   maxTh = minTh+range   % L4S min marking threshold in time units
1163	   30: }

1165	       Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM

1167	   For brevity the pseudocode shows some parameters in units of
1168	   microseconds (us), but a real implementation would probably use
1169	   nanoseconds.

1171	   The overall goal of the code is to maintain the base probability (p),
1172	   which is an internal variable from which the marking and dropping
1173	   probabilities for L4S and Classic traffic (p_L and p_C) are derived.
1174	   The variable named p in the pseudocode and in this walk-through is
1175	   the same as p' (p-prime) in Section 2.4.  The probabilities p_L and
1176	   p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function
1177	   (Figure 6) then used in the dualpi2_dequeue() function (Figure 4).
1178	   The code walk-through below builds up to explaining that part of the
1179	   code eventually, but it starts from packet arrival.

1181	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
1182	   2:    if ( lq.len() + cq.len() > limit )
1183	   3:      drop(pkt)                     % drop packet if buffer is full
1184	   4:    else {                                      % Packet classifier
1185	   5:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
1186	   6:        lq.enqueue(pkt)
1187	   7:      else                           % ECN bits = not-ECT or ECT(0)
1188	   8:        cq.enqueue(pkt)
1189	   9:    }
1190	   10: }

1192	      Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

1194	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
1195	   2:    while ( lq.len() + cq.len() > 0 )
1196	   3:      if ( scheduler() == lq ) {
1197	   4:        lq.dequeue(pkt)                      % Scheduler chooses lq
1198	   5:        p'_L = laqm(lq.time())                     % Native L4S AQM
1199	   6:        p_L = max(p'_L, p_CL)                  % Combining function
1200	   7:        if ( p_L > rand() )                        % Linear marking
1201	   8:          mark(pkt)
1202	   9:      } else {
1203	   10:       cq.dequeue(pkt)                      % Scheduler chooses cq
1204	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1205	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
1206	   13:           drop(pkt)                                % squared drop
1207	   14:           continue        % continue to the top of the while loop
1208	   15:         }
1209	   16:         mark(pkt)                                  % squared mark
1210	   17:       }
1211	   18:     }
1212	   19:     return(pkt)                      % return the packet and stop
1213	   20:   }
1214	   21:   return(NULL)                             % no packet to dequeue
1215	   22: }

1217	      Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

1219	   When packets arrive, first a common queue limit is checked as shown
1220	   in line 2 of the enqueuing pseudocode in Figure 3.  Note that the
1221	   limit is deliberately tested before enqueue to avoid any bias against
1222	   larger packets (so depending whether the implementation stores a
1223	   packet while testing whether to drop it from the tail, it might be
1224	   necessary for the actual buffer memory to be one MTU larger than
1225	   limit).

1227	   Line 2 assumes an implementation where lq and cq share common buffer
1228	   memory.  An alternative implementation could use separate buffers for
1229	   each queue, in which case the arriving packet would have to be
1230	   classified first to determine which buffer to check for available
1231	   space.  The choice is a trade off; a shared buffer can use less
1232	   memory whereas separate buffers isolate the L4S queue from tail-drop
1233	   due to large bursts of Classic traffic (e.g. a Classic TCP during
1234	   slow-start over a long RTT).

1236	   Returning to the shared buffer case, if limit is not exceeded, the
1237	   packet will be classified and enqueued to the Classic or L4S queue
1238	   dependent on the least significant bit of the ECN field in the IP
1239	   header (line 5).  Packets with a codepoint having an LSB of 0 (Not-
1240	   ECT and ECT(0)) will be enqueued in the Classic queue.  Otherwise,
1241	   ECT(1) and CE packets will be enqueued in the L4S queue.  Optional
1242	   additional packet classification flexibility is omitted for brevity
1243	   (see [I-D.ietf-tsvwg-ecn-l4s-id]).

1245	   The dequeue pseudocode (Figure 4) is repeatedly called whenever the
1246	   lower layer is ready to forward a packet.  It schedules one packet
1247	   for dequeuing (or zero if the queue is empty) then returns control to
1248	   the caller, so that it does not block while that packet is being
1249	   forwarded.  While making this dequeue decision, it also makes the
1250	   necessary AQM decisions on dropping or marking.  The alternative of
1251	   applying the AQMs at enqueue would shift some processing from the
1252	   critical time when each packet is dequeued.  However, it would also
1253	   add a whole queue of delay to the control signals, making the control
1254	   loop very sloppy.

1256	   All the dequeue code is contained within a large while loop so that
1257	   if it decides to drop a packet, it will continue until it selects a
1258	   packet to schedule.  Line 3 of the dequeue pseudocode is where the
1259	   scheduler chooses between the L4S queue (lq) and the Classic queue
1260	   (cq).  Detailed implementation of the scheduler is not shown (see
1261	   discussion later).

1263	   o  If an L4S packet is scheduled, lines 7 and 8 ECN-mark the packet
1264	      if a random marking decision is drawn according to p_L.  Line 6
1265	      calculates p_L as the maximum of the coupled L4S probability p_CL
1266	      and the probability from the native L4S AQM p'_L.  This implements
1267	      the max() function shown in Figure 1 to couple the outputs of the
1268	      two AQMs together.  Of the two probabilities input to p_L in line
1269	      6:

1271	      *  p'_L is calculated per packet in line 5 by the laqm() function
1272	         (see Figure 5),

1274	      *  whereas p_CL is maintained by the dualpi2_update() function
1275	         which runs every Tupdate (default 16ms) (see Figure 2).

1277	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
1278	      packet based on the squared probability p_C.

1280	   The Native L4S AQM algorithm (Figure 5) is a ramp function, similar
1281	   to the RED algorithm, but simpler due to the following differences:

1283	   o  The min and max of the ramp are defined in units of queuing delay,
1284	      not bytes, so that configuration remains invariant as the queue
1285	      departure rate varies.

1287	   o  It uses instantaneous queueing delay to remove smoothing delay
1288	      (L4S senders smooth incoming ECN feedback when necessary).

1290	   o  The ramp rises linearly directly from 0 to 1, not to a an
1291	      intermediate value of p'_L as RED would, because there is no need
1292	      to keep ECN marking probability low.

1294	   o  Marking does not have to be randomized.  Determinism is being
1295	      experimented with instead of randomness; to reduce the delay
1296	      necessary to smooth out the noise of randomness from the signal.
1297	      In this case, for each packet, the algorithm would accumulate p_L
1298	      in a counter and mark the packet that took the counter over 1,
1299	      then subtract 1 from the counter and continue.

1301	   This ramp function requires two configuration parameters, the minimum
1302	   threshold (minTh) and the width of the ramp (range), both in units of
1303	   queuing time), as shown in the parameter initialization code in
1304	   Figure 2.  A minimum marking threshold parameter (Th_len) in
1305	   transmission units (default 2 MTU) is also necessary to ensure that
1306	   the ramp does not trigger excessive marking on slow links.  The code
1307	   in lines 23-28 of Figure 2 converts 2 MTU into time units and adjusts
1308	   the ramp thresholds to be no shallower than this floor.

1310	   An operator can effectively turn the ramp into a step function, as
1311	   used by DCTCP, by setting the range to its minimum value (e.g. 1 ns).
1312	   Then the condition for the ramp calculation will hardly ever arise.
1313	   There is some concern that using the step function of DCTCP for the
1314	   Native L4S AQM requires end-systems to smooth the signal for an
1315	   unnecessarily large number of round trips to ensure sufficient
1316	   fidelity.  A ramp seems to be no worse than a step in initial
1317	   experiments with existing DCTCP.  Therefore, it is recommended that a
1318	   ramp is configured in place of a step, which will allow congestion
1319	   control algorithms to investigate faster smoothing algorithms.

1321	   1:  laqm(qdelay) {               % Returns native L4S AQM probability
1322	   2:    if (qdelay >= maxTh)
1323	   3:      return 1
1324	   4:    else if (qdelay > minTh)
1325	   5:      return (qdelay - minTh)/range  % Divide would use a bit-shift
1326	   6:    else
1327	   7:      return 0
1328	   8:  }

1330	            Figure 5: Example Pseudocode for the Native L4S AQM

1332	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1333	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
1334	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1335	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1336	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1337	   6:    prevq = curq
1338	   7:  }

1340	     Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

1342	   p_CL depends on the base probability (p), which is kept up to date by
1343	   the core PI algorithm in Figure 6 executed every Tupdate.

1345	   Note that p solely depends on the queuing time in the Classic queue.
1346	   In line 2, the current queuing delay (curq) is evaluated from how
1347	   long the head packet was in the Classic queue (cq).  The function
1348	   cq.time() (not shown) subtracts the time stamped at enqueue from the
1349	   current time and implicitly takes the current queuing delay as 0 if
1350	   the queue is empty.

1352	   The algorithm centres on line 3, which is a classical Proportional-
1353	   Integral (PI) controller that alters p dependent on: a) the error
1354	   between the current queuing delay (curq) and the target queuing delay
1355	   ('target' - see [RFC8033]); and b) the change in queuing delay since
1356	   the last sample.  The name 'PI' represents the fact that the second
1357	   factor (how fast the queue is growing) is _P_roportional to load
1358	   while the first is the _I_ntegral of the load (so it removes any
1359	   standing queue in excess of the target).

1361	   The two 'gain factors' in line 3, alpha_U and beta_U, respectively
1362	   weight how strongly each of these elements ((a) and (b)) alters p.
1363	   They are in units of 'per second of delay' or Hz, because they
1364	   transform differences in queueing delay into changes in probability.

1366	   alpha_U and beta_U are derived from the input parameters alpha and
1367	   beta (see lines 5 and 6 of Figure 2).  These recommended values of
1368	   alpha and beta come from the stability analysis in [PI2] so that the
1369	   AQM can change p as fast as possible in response to changes in load
1370	   without over-compensating and therefore causing oscillations in the
1371	   queue.

1373	   alpha and beta determine how much p ought to change if it was updated
1374	   every second.  It is best to update p as frequently as possible, but
1375	   the update interval (Tupdate) will probably be constrained by
1376	   hardware performance.  For link rates from 4 - 200 Mb/s, we found
1377	   Tupdate=16ms (as recommended in [RFC8033]) is sufficient.  However
1378	   small the chosen value of Tupdate, p should change by the same amount
1379	   per second, but in finer more frequent steps.  So the gain factors
1380	   used for updating p in Figure 6 need to be scaled by (Tupdate/1s),
1381	   which is done in lines 9 and 10 of Figure 2).  The suffix '_U'
1382	   represents 'per update time' (Tupdate).

1384	   In corner cases, p can overflow the range [0,1] so the resulting
1385	   value of p has to be bounded (omitted from the pseudocode).  Then, as
1386	   already explained, the coupled and Classic probabilities are derived
1387	   from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2.

1389	   Because the coupled L4S marking probability (p_CL) is factored up by
1390	   k, the dynamic gain parameters alpha and beta are also inherently
1391	   factored up by k for the L4S queue, which is necessary to ensure that
1392	   Classic TCP and DCTCP controls have the same stability.  So, if alpha
1393	   is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha,
1394	   which is 20 Hz^2 with the default coupling factor of k=2.

1396	   Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned
1397	   every Tupdate dependent on p.  Instead, in PI2, alpha_U and beta_U
1398	   are independent of p because the squaring applied to Classic traffic
1399	   tunes them inherently.  This is explained in [PI2], which also
1400	   explains why this more principled approach removes the need for most
1401	   of the heuristics that had to be added to PIE.

1403	   {ToDo: Scaling beta with Tupdate and scaling both alpha & beta with
1404	   RTT}

1406	A.2.  Pass #2: Overload Details

1408	   Figure 7 repeats the dequeue function of Figure 4, but with overload
1409	   details added.  Similarly Figure 8 repeats the core PI algorithm of
1410	   Figure 6 with overload details added.  The initialization, enqueue
1411	   and L4S AQM functions are unchanged.

1413	   In line 7 of the initialization function (Figure 2), the default
1414	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
1415	   point at which it is deemed that the Classic queue has become
1416	   persistently overloaded, so it switches to using solely drop, even
1417	   for ECN-capable packets.  This protects the queue against any
1418	   unresponsive traffic that falsely claims that it is responsive to ECN
1419	   marking, as required by [RFC3168] and [RFC7567].

1421	   Line 22 of the initialization function translates this into a maximum
1422	   L4S marking probability (p_Lmax) by rearranging Equation (1).  With a
1423	   coupling factor of k=2 (the default) or greater, this translates to a
1424	   maximum L4S marking probability of 1 (or 100%).  This is intended to
1425	   ensure that the L4S queue starts to introduce dropping once marking
1426	   saturates and can rise no further.  The 'TCP Prague' requirements
1427	   [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion
1428	   control detects a drop, it falls back to a response that coexists
1429	   with 'Classic' TCP.  So it is correct that the L4S queue drops
1430	   packets proportional to p^2, as if they are Classic packets.

1432	   Both these switch-overs are triggered by the tests for overload
1433	   introduced in lines 4b and 12b of the dequeue function (Figure 7).
1434	   Lines 8c to 8g drop L4S packets with probability p^2.  Lines 8h to 8i
1435	   mark the remaining packets with probability p_CL.  If p_Lmax = 1,
1436	   which is the suggested default configuration, all remaining packets
1437	   will be marked because, to have reached the else block at line 8b,
1438	   p_CL >= 1.

1440	   Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload
1441	   of the L4S queue when there is no Classic traffic.  This is
1442	   necessary, because the core PI algorithm maintains the appropriate
1443	   drop probability to regulate overload, but it depends on the length
1444	   of the Classic queue.  If there is no Classic queue the naive
1445	   algorithm in Figure 6 drops nothing, even if the L4S queue is
1446	   overloaded - so tail drop would have to take over (lines 3 and 4 of
1447	   Figure 3).

1449	   If the test at line 2a finds that the Classic queue is empty, line 2d
1450	   measures the current queue delay using the L4S queue instead.  While
1451	   the L4S queue is not overloaded, its delay will always be tiny
1452	   compared to the target Classic queue delay.  So p_L will be driven to
1453	   zero, and the L4S queue will naturally be governed solely by
1454	   threshold marking (lines 5 and 6 of the dequeue algorithm in
1455	   Figure 7).  But, if unresponsive L4S source(s) cause overload, the
1456	   DualQ transitions smoothly to L4S marking based on the PI algorithm.
1457	   And as overload increases, it naturally transitions from marking to
1458	   dropping by the switch-over mechanism already described.

1460	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
1461	   2:    while ( lq.len() + cq.len() > 0 )
1462	   3:      if ( scheduler() == lq ) {
1463	   4a:       lq.dequeue(pkt)
1464	   4b:       if ( p_CL < p_Lmax ) {      % Check for overload saturation
1465	   5:          p'_L = laqm(lq.time())                   % Native L4S AQM
1466	   6:          p_L = max(p'_L, p_CL)                % Combining function
1467	   7:          if ( p_L > rand() )                      % Linear marking
1468	   8a:           mark(pkt)
1469	   8b:       } else {                              % overload saturation
1470	   8c:         if ( p_C > rand() ) {             % probability p_C = p^2
1471	   8e:           drop(pkt)      % revert to Classic drop due to overload
1472	   8f:           continue        % continue to the top of the while loop
1473	   8g:         }
1474	   8h:         if ( p_CL > rand() )           % probability p_CL = k * p
1475	   8i:           mark(pkt)         % linear marking of remaining packets
1476	   8j:       }
1477	   9:      } else {
1478	   10:       cq.dequeue(pkt)
1479	   11:       if ( p_C > rand() ) {               % probability p_C = p^2
1480	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
1481	   12b:             OR (p_C >= p_Cmax) ) {       % Overload disables ECN
1482	   13:           drop(pkt)                     % squared drop, redo loop
1483	   14:           continue        % continue to the top of the while loop
1484	   15:         }
1485	   16:         mark(pkt)                                  % squared mark
1486	   17:       }
1487	   18:     }
1488	   19:     return(pkt)                      % return the packet and stop
1489	   20:   }
1490	   21:   return(NULL)                             % no packet to dequeue
1491	   22: }

1493	      Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
1494	             (Including Integer Arithmetic and Overload Code)

1496	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
1497	   2a:   if ( cq.len() > 0 )
1498	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
1499	   2c:   else                                      % Classic queue empty
1500	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
1501	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
1502	   4:    p_CL = p * k   % Coupled L4S prob = base prob * coupling factor
1503	   5:    p_C = p^2                        % Classic prob = (base prob)^2
1504	   6:    prevq = curq
1505	   7:  }

1507	     Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
1508	                         (Including Overload Code)

1510	   The choice of scheduler technology is critical to overload protection
1511	   (see Section 4.1).

1513	   o  A well-understood weighted scheduler such as weighted round robin
1514	      (WRR) is recommended.  The scheduler weight for Classic should be
1515	      low, e.g. 1/16.

1517	   o  Alternatively, a time-shifted FIFO could be used.  This is a very
1518	      simple scheduler, but it does not fully isolate latency in the L4S
1519	      queue from uncontrolled bursts in the Classic queue.  It works by
1520	      selecting the head packet that has waited the longest, biased
1521	      against the Classic traffic by a time-shift of tshift.  To
1522	      implement time-shifted FIFO, the "if (scheduler() == lq )" test in
1523	      line 3 of the dequeue code would simply be replaced by "if (
1524	      lq.time() + tshift >= cq.time() )".  For the public Internet a
1525	      good value for tshift is 50ms.  For private networks with smaller
1526	      diameter, about 4*target would be reasonable.

1528	   o  A strict priority scheduler would be inappropriate, because it
1529	      would starve Classic if L4S was overloaded.

1531	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

1533	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
1534	   below gives the Curvy RED based algorithm we used and tested.
1535	   Although we designed the AQM to be efficient in integer arithmetic,
1536	   to aid understanding it is first given using real-number arithmetic.
1537	   Then, one possible optimization for integer arithmetic is given, also
1538	   in pseudocode.  To aid comparison, the line numbers are kept in step
1539	   between the two by using letter suffixes where the longer code needs
1540	   extra lines.

1542	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1543	   2:    if ( lq.dequeue(pkt) ) {
1544	   3a:     p_L = cq.sec() / 2^S_L
1545	   3b:     if ( lq.byt() > T )
1546	   3c:       mark(pkt)
1547	   3d:     elif ( p_L > maxrand(U) )
1548	   4:        mark(pkt)
1549	   5:      return(pkt)                % return the packet and stop here
1550	   6:    }
1551	   7:    while ( cq.dequeue(pkt) ) {
1552	   8a:     alpha = 2^(-f_C)
1553	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
1554	   9a:     sqrt_p_C = Q_C / 2^S_C
1555	   9b:     if ( sqrt_p_C > maxrand(2*U) )
1556	   10:       drop(pkt)                        % Squared drop, redo loop
1557	   11:     else
1558	   12:       return(pkt)              % return the packet and stop here
1559	   13:   }
1560	   14:   return(NULL)                           % no packet to dequeue
1561	   15: }

1563	   16: maxrand(u) {                % return the max of u random numbers
1564	   17:     maxr=0
1565	   18:     while (u-- > 0)
1566	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
1567	   20:     return(maxr)
1568	   21: }

1570	   Figure 9: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

1572	   Packet classification code is not shown, as it is no different from
1573	   Figure 3.  Potential classification schemes are discussed in
1574	   Section 2.3.  The Curvy RED algorithm has not been maintained to the
1575	   same degree as the DualPI2 algorithm.  Some ideas used in DualPI2
1576	   would need to be translated into Curvy RED, such as i) the
1577	   conditional priority scheduler instead of strict priority ii) the
1578	   time-based L4S threshold; iii) turning off ECN as overload
1579	   protection; iv) Classic ECN support.  These are not shown in the
1580	   Curvy RED pseudocode, but would need to be implemented for
1581	   production. {ToDo}

1583	   At the outer level, the structure of dualq_dequeue() implements
1584	   strict priority scheduling.  The code is written assuming the AQM is
1585	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
1586	   the if-block in lines 2-6 determines whether there is an L4S packet
1587	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
1588	   in lines 7-13 determines whether there is a Classic packet to
1589	   dequeue, by calling cq.dequeue(pkt).  (Note 2)
1590	   In the lower priority Classic queue, a while loop is used so that, if
1591	   the AQM determines that a classic packet should be dropped, it
1592	   continues to test for classic packets deciding whether to drop each
1593	   until it actually forwards one.  Thus, every call to dualq_dequeue()
1594	   returns one packet if at least one is present in either queue,
1595	   otherwise it returns NULL at line 14.  (Note 3)

1597	   Within each queue, the decision whether to drop or mark is taken as
1598	   follows (to simplify the explanation, it is assumed that U=1):

1600	   L4S:  If the test at line 2 determines there is an L4S packet to
1601	      dequeue, the tests at lines 3a and 3c determine whether to mark
1602	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1603	      in bytes) is greater than a step threshold T in bytes (Note 4).
1604	      The second test is similar to the random ECN marking in RED, but
1605	      with the following differences: i) the marking function does not
1606	      start with a plateau of zero marking until a minimum threshold,
1607	      rather the marking probability starts to increase as soon as the
1608	      queue is positive; ii) marking depends on queuing time, not bytes,
1609	      in order to scale for any link rate without being reconfigured;
1610	      iii) marking of the L4S queue does not depend on itself, it
1611	      depends on the queuing time of the _other_ (Classic) queue, where
1612	      cq.sec() is the queuing time of the packet at the head of the
1613	      Classic queue (zero if empty); iv) marking depends on the
1614	      instantaneous queuing time (of the other Classic queue), not a
1615	      smoothed average; v) the queue is compared with the maximum of U
1616	      random numbers (but if U=1, this is the same as the single random
1617	      number used in RED).

1619	      Specifically, in line 3a the marking probability p_L is set to the
1620	      Classic queueing time qc.sec() in seconds divided by the L4S
1621	      scaling parameter 2^S_L, which represents the queuing time (in
1622	      seconds) at which marking probability would hit 100%. Then in line
1623	      3d (if U=1) the result is compared with a uniformly distributed
1624	      random number between 0 and 1, which ensures that marking
1625	      probability will linearly increase with queueing time.  The
1626	      scaling parameter is expressed as a power of 2 so that division
1627	      can be implemented as a right bit-shift (>>) in line 3 of the
1628	      integer variant of the pseudocode (Figure 10).

1630	   Classic:  If the test at line 7 determines that there is at least one
1631	      Classic packet to dequeue, the test at line 9b determines whether
1632	      to drop it.  But before that, line 8b updates Q_C, which is an
1633	      exponentially weighted moving average (Note 5) of the queuing time
1634	      in the Classic queue, where pkt.sec() is the instantaneous
1635	      queueing time of the current Classic packet and alpha is the EWMA
1636	      constant for the classic queue.  In line 8a, alpha is represented
1637	      as an integer power of 2, so that in line 8 of the integer code
1638	      the division needed to weight the moving average can be
1639	      implemented by a right bit-shift (>> f_C).

1641	      Lines 9a and 9b implement the drop function.  In line 9a the
1642	      averaged queuing time Q_C is divided by the Classic scaling
1643	      parameter 2^S_C, in the same way that queuing time was scaled for
1644	      L4S marking.  This scaled queuing time is given the variable name
1645	      sqrt_p_C because it will be squared to compute Classic drop
1646	      probability, so before it is squared it is effectively the square
1647	      root of the drop probability.  The squaring is done by comparing
1648	      it with the maximum out of two random numbers (assuming U=1).
1649	      Comparing it with the maximum out of two is the same as the
1650	      logical `AND' of two tests, which ensures drop probability rises
1651	      with the square of queuing time (Note 6).  Again, the scaling
1652	      parameter is expressed as a power of 2 so that division can be
1653	      implemented as a right bit-shift in line 9 of the integer
1654	      pseudocode.

1656	   The marking/dropping functions in each queue (lines 3 & 9) are two
1657	   cases of a new generalization of RED called Curvy RED, motivated as
1658	   follows.  When we compared the performance of our AQM with fq_CoDel
1659	   and PIE, we came to the conclusion that their goal of holding queuing
1660	   delay to a fixed target is misguided [CRED_Insights].  As the number
1661	   of flows increases, if the AQM does not allow TCP to increase queuing
1662	   delay, it has to introduce abnormally high levels of loss.  Then loss
1663	   rather than queuing becomes the dominant cause of delay for short
1664	   flows, due to timeouts and tail losses.

1666	   Curvy RED constrains delay with a softened target that allows some
1667	   increase in delay as load increases.  This is achieved by increasing
1668	   drop probability on a convex curve relative to queue growth (the
1669	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1670	   the zero axis while the queue is shallow.  Then, as load increases,
1671	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1672	   requires only one parameter, the scaling, not three.  The diadvantage
1673	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1674	   Curvy RED can be used as is when the RTT range to support is limited
1675	   otherwise an adaptation mechanism is required.

1677	   There follows a summary listing of the two parameters used for each
1678	   of the two queues:

1680	   Classic:

1682	      S_C :   The scaling factor of the dropping function scales Classic
1683	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1684	         probability in the range [0,1].  To make division efficient, it
1685	         is constrained to be an integer power of two;

1687	      f_C :  To smooth the queuing time of the Classic queue and make
1688	         multiplication efficient, we use a negative integer power of
1689	         two for the dimensionless EWMA constant, which we define as
1690	         alpha = 2^(-f_C).

1692	   L4S :

1694	      S_L (and k'):   As for the Classic queue, the scaling factor of
1695	         the L4S marking function scales Classic queueing times in the
1696	         range [0, 2^(S_L)] seconds into a probability in the range
1697	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1698	         between the queues.  So S_L and k' count as only one parameter;
1699	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1700	         where both k and k' are constants.  Then implementations can
1701	         avoid costly division by shifting p_L by k' bits to the right.

1703	      T :  The queue size in bytes at which step threshold marking
1704	         starts in the L4S queue.

1706	   {ToDo: These are the raw parameters used within the algorithm.  A
1707	   configuration front-end could accept more meaningful parameters and
1708	   convert them into these raw parameters.}

1710	   From our experiments so far, recommended values for these parameters
1711	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1712	   typical on the public Internet.  [CRED_Insights] explains why these
1713	   parameters are applicable whatever rate link this AQM implementation
1714	   is deployed on and how the parameters would need to be adjusted for a
1715	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1716	   incorporate a summary of that report into this draft}. The setting of
1717	   k depends on policy (see Section 2.5 and Appendix C respectively for
1718	   its recommended setting and guidance on alternatives).

1720	   There is also a cUrviness parameter, U, which is a small positive
1721	   integer.  It is likely to take the same hard-coded value for all
1722	   implementations, once experiments have determined a good value.  We
1723	   have solely used U=1 in our experiments so far, but results might be
1724	   even better with U=2 or higher.

1726	   Note that the dropping function at line 9 calls maxrand(2*U), which
1727	   gives twice as much curviness as the call to maxrand(U) in the
1728	   marking function at line 3.  This is the trick that implements the
1729	   square rule in equation (1) (Section 2.1).  This is based on the fact
1730	   that, given a number X from 1 to 6, the probability that two dice
1731	   throws will both be less than X is the square of the probability that
1732	   one throw will be less than X.  So, when U=1, the L4S marking
1733	   function is linear and the Classic dropping function is squared.  If
1734	   U=2, L4S would be a square function and Classic would be quartic.
1735	   And so on.

1737	   The maxrand(u) function in lines 16-21 simply generates u random
1738	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1739	   could be run in parallel out of band.  For instance, if U=1, the
1740	   Classic queue would require the maximum of two random numbers.  So,
1741	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1742	   values from a pseudorandom number generator could be generated out-
1743	   of-band, and held in a buffer ready for the Classic queue to consume.

1745	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1746	   2:     if ( lq.dequeue(pkt) ) {
1747	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1748	   4:           mark(pkt)
1749	   5:        return(pkt)              % return the packet and stop here
1750	   6:     }
1751	   7:     while ( cq.dequeue(pkt) ) {
1752	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1753	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1754	   10:          drop(pkt)                     % Squared drop, redo loop
1755	   11:       else
1756	   12:          return(pkt)           % return the packet and stop here
1757	   13:    }
1758	   14:    return(NULL)                           % no packet to dequeue
1759	   15: }

1761	   Figure 10: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1762	                         using Integer Arithmetic

1764	   Notes:

1766	   1.  The drain rate of the queue can vary if it is scheduled relative
1767	       to other queues, or to cater for fluctuations in a wireless
1768	       medium.  To auto-adjust to changes in drain rate, the queue must
1769	       be measured in time, not bytes or packets [CoDel].  In our Linux
1770	       implementation, it was easiest to measure queuing time at
1771	       dequeue.  Queuing time can be estimated when a packet is enqueued
1772	       by measuring the queue length in bytes and dividing by the recent
1773	       drain rate.

1775	   2.  An implementation has to use priority queueing, but it need not
1776	       implement strict priority.

1778	   3.  If packets can be enqueued while processing dequeue code, an
1779	       implementer might prefer to place the while loop around both
1780	       queues so that it goes back to test again whether any L4S packets
1781	       arrived while it was dropping a Classic packet.

1783	   4.  In order not to change too many factors at once, for now, we keep
1784	       the marking function for DCTCP-only traffic as similar as
1785	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1786	       dequeue, so we determine whether to mark a packet at the head of
1787	       the queue by the byte-length of the queue _behind_ it.  We plan
1788	       to test whether using queuing time will work in all
1789	       circumstances, and if we find that the step can cause
1790	       oscillations, we will investigate replacing it with a steep
1791	       random marking curve.

1793	   5.  An EWMA is only one possible way to filter bursts; other more
1794	       adaptive smoothing methods could be valid and it might be
1795	       appropriate to decrease the EWMA faster than it increases.

1797	   6.  In practice at line 10 the Classic queue would probably test for
1798	       ECN capability on the packet to determine whether to drop or mark
1799	       the packet.  However, for brevity such detail is omitted.  All
1800	       packets classified into the L4S queue have to be ECN-capable, so
1801	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1802	       packets could be dropped by overload code (see Section 4.1).

1804	   7.  In the integer variant of the pseudocode (Figure 10) real numbers
1805	       are all represented as integers scaled up by 2^32.  In lines 3 &
1806	       9 the function maxrand() is arranged to return an integer in the
1807	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1808	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1809	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1810	       the values about 2^30 times larger than when the units were
1811	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1812	       right bit-shift multiplies the result by 2^2, to complete the
1813	       scaling by 2^32.

1815	Appendix C.  Guidance on Controlling Throughput Equivalence

1817	                     +---------------+------+-------+
1818	                     | RTT_C / RTT_L | Reno | Cubic |
1819	                     +---------------+------+-------+
1820	                     |             1 | k'=1 | k'=0  |
1821	                     |             2 | k'=2 | k'=1  |
1822	                     |             3 | k'=2 | k'=2  |
1823	                     |             4 | k'=3 | k'=2  |
1824	                     |             5 | k'=3 | k'=3  |
1825	                     +---------------+------+-------+

1827	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1828	               as Reno or Cubic, for some example RTT ratios

1830	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1832	   To determine the appropriate policy, the operator first has to judge
1833	   whether it wants DCTCP flows to have roughly equal throughput with
1834	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1835	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1836	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1837	   Classic flows to have roughly equal throughput.  For example choosing
1838	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1839	   as Cubic, _if their RTTs are the same_.

1841	   However, even if the base RTTs are the same, the actual RTTs are
1842	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1843	   needs a large queue to avoid under-utilization and excess drop,
1844	   whereas L4S (DCTCP) does not.  The operator might still choose this
1845	   policy if it judges that DCTCP throughput should be rewarded for
1846	   keeping its own queue short.

1848	   On the other hand, the operator will choose one of the higher values
1849	   for k', if it wants to slow DCTCP down to roughly the same throughput
1850	   as Classic flows, to compensate for Classic flows slowing themselves
1851	   down by causing themselves extra queuing delay.

1853	   The values for k' in the table are derived from the formulae, which
1854	   was developed in [DCttH15]:

1856	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1857	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1859	   For localized traffic from a particular ISP's data centre, we used
1860	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1861	   k=8) would achieve throughput equivalence, and our experiments
1862	   verified the formula very closely.

1864	   For a typical mix of RTTs from local data centres and across the
1865	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1866	   as a good workable compromise.

1868	Appendix D.  Open Issues

1870	   Most of the following open issues are also tagged '{ToDo}' at the
1871	   appropriate point in the document:

1873	      PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U
1874	      on Tupdate

1876	      Curvy RED appendix: complete the unfinished parts

1878	Authors' Addresses

1880	   Koen De Schepper
1881	   Nokia Bell Labs
1882	   Antwerp
1883	   Belgium

1885	   Email: koen.de_schepper@nokia.com
1886	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1888	   Bob Briscoe (editor)
1889	   CableLabs
1890	   UK

1892	   Email: ietf@bobbriscoe.net
1893	   URI:   http://bobbriscoe.net/

1895	   Olga Bondarenko
1896	   Simula Research Lab
1897	   Lysaker
1898	   Norway

1900	   Email: olgabnd@gmail.com
1901	   URI:   https://www.simula.no/people/olgabo

1903	   Ing-jyh Tsang
1904	   Nokia
1905	   Antwerp
1906	   Belgium

1908	   Email: ing-jyh.tsang@nokia.com