idnits 2.17.1 

draft-ietf-tsvwg-aqm-dualq-coupled-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 3, 2017) is 2488 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 1095

  -- Looks like a reference, but probably isn't: '1' on line 1095

  == Outdated reference: A later version (-07) exists of
     draft-ietf-tcpm-cubic-04

  == Outdated reference: A later version (-10) exists of
     draft-ietf-tcpm-dctcp-08

  == Outdated reference: A later version (-08) exists of
     draft-ietf-tsvwg-ecn-experimentation-00

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-00

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-00

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)


     Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Transport Area working group (tsvwg)                      K. De Schepper
3	Internet-Draft                                           Nokia Bell Labs
4	Intended status: Experimental                            B. Briscoe, Ed.
5	Expires: January 4, 2018                                   O. Bondarenko
6	                                                     Simula Research Lab
7	                                                                I. Tsang
8	                                                         Nokia Bell Labs
9	                                                            July 3, 2017

11	  DualQ Coupled AQM for Low Latency, Low Loss and Scalable Throughput
12	                 draft-ietf-tsvwg-aqm-dualq-coupled-01

14	Abstract

16	   Data Centre TCP (DCTCP) was designed to provide predictably low
17	   queuing latency, near-zero loss, and throughput scalability using
18	   explicit congestion notification (ECN) and an extremely simple
19	   marking behaviour on switches.  However, DCTCP does not co-exist with
20	   existing TCP traffic---DCTCP is so aggressive that existing TCP
21	   algorithms approach starvation.  So, until now, DCTCP could only be
22	   deployed where a clean-slate environment could be arranged, such as
23	   in private data centres.  This specification defines `DualQ Coupled
24	   Active Queue Management (AQM)' to allow scalable congestion controls
25	   like DCTCP to safely co-exist with classic Internet traffic.  The
26	   Coupled AQM ensures that a flow runs at about the same rate whether
27	   it uses DCTCP or TCP Reno/Cubic, but without inspecting transport
28	   layer flow identifiers.  When tested in a residential broadband
29	   setting, DCTCP achieved sub-millisecond average queuing delay and
30	   zero congestion loss under a wide range of mixes of DCTCP and
31	   `Classic' broadband Internet traffic, without compromising the
32	   performance of the Classic traffic.  The solution also reduces
33	   network complexity and eliminates network configuration.

35	Status of This Memo

37	   This Internet-Draft is submitted in full conformance with the
38	   provisions of BCP 78 and BCP 79.

40	   Internet-Drafts are working documents of the Internet Engineering
41	   Task Force (IETF).  Note that other groups may also distribute
42	   working documents as Internet-Drafts.  The list of current Internet-
43	   Drafts is at http://datatracker.ietf.org/drafts/current/.

45	   Internet-Drafts are draft documents valid for a maximum of six months
46	   and may be updated, replaced, or obsoleted by other documents at any
47	   time.  It is inappropriate to use Internet-Drafts as reference
48	   material or to cite them other than as "work in progress."
49	   This Internet-Draft will expire on January 4, 2018.

51	Copyright Notice

53	   Copyright (c) 2017 IETF Trust and the persons identified as the
54	   document authors.  All rights reserved.

56	   This document is subject to BCP 78 and the IETF Trust's Legal
57	   Provisions Relating to IETF Documents
58	   (http://trustee.ietf.org/license-info) in effect on the date of
59	   publication of this document.  Please review these documents
60	   carefully, as they describe your rights and restrictions with respect
61	   to this document.  Code Components extracted from this document must
62	   include Simplified BSD License text as described in Section 4.e of
63	   the Trust Legal Provisions and are provided without warranty as
64	   described in the Simplified BSD License.

66	Table of Contents

68	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
69	     1.1.  Problem and Scope . . . . . . . . . . . . . . . . . . . .   3
70	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
71	     1.3.  Features  . . . . . . . . . . . . . . . . . . . . . . . .   5
72	   2.  DualQ Coupled AQM Algorithm . . . . . . . . . . . . . . . . .   7
73	     2.1.  Coupled AQM . . . . . . . . . . . . . . . . . . . . . . .   7
74	     2.2.  Dual Queue  . . . . . . . . . . . . . . . . . . . . . . .   8
75	     2.3.  Traffic Classification  . . . . . . . . . . . . . . . . .   8
76	     2.4.  Normative Requirements  . . . . . . . . . . . . . . . . .   8
77	   3.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
78	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
79	     4.1.  Overload Handling . . . . . . . . . . . . . . . . . . . .  10
80	   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  11
81	   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
82	     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  11
83	     6.2.  Informative References  . . . . . . . . . . . . . . . . .  11
84	   Appendix A.  Example DualQ Coupled PI2 Algorithm  . . . . . . . .  14
85	     A.1.  Pass #1: Core Concepts  . . . . . . . . . . . . . . . . .  15
86	     A.2.  Pass #2: Overload Details . . . . . . . . . . . . . . . .  18
87	   Appendix B.  Example DualQ Coupled Curvy RED Algorithm  . . . . .  21
88	   Appendix C.  Guidance on Controlling Throughput Equivalence . . .  26
89	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  28

91	1.  Introduction
92	1.1.  Problem and Scope

94	   Latency is becoming the critical performance factor for many (most?)
95	   applications on the public Internet, e.g. interactive Web, Web
96	   services, voice, conversational video, interactive video, interactive
97	   remote presence, instant messaging, online gaming, remote desktop,
98	   cloud-based applications, and video-assisted remote control of
99	   machinery and industrial processes.  In the developed world, further
100	   increases in access network bit-rate offer diminishing returns,
101	   whereas latency is still a multi-faceted problem.  In the last decade
102	   or so, much has been done to reduce propagation time by placing
103	   caches or servers closer to users.  However, queuing remains a major
104	   component of latency.

106	   The Diffserv architecture provides Expedited Forwarding [RFC3246], so
107	   that low latency traffic can jump the queue of other traffic.
108	   However, on access links dedicated to individual sites (homes, small
109	   enterprises or mobile devices), often all traffic at any one time
110	   will be latency-sensitive.  Then Diffserv is of little use.  Instead,
111	   we need to remove the causes of any unnecessary delay.

113	   The bufferbloat project has shown that excessively-large buffering
114	   (`bufferbloat') has been introducing significantly more delay than
115	   the underlying propagation time.  These delays appear only
116	   intermittently--only when a capacity-seeking (e.g.  TCP) flow is long
117	   enough for the queue to fill the buffer, making every packet in other
118	   flows sharing the buffer sit through the queue.

120	   Active queue management (AQM) was originally developed to solve this
121	   problem (and others).  Unlike Diffserv, which gives low latency to
122	   some traffic at the expense of others, AQM controls latency for _all_
123	   traffic in a class.  In general, AQMs introduce an increasing level
124	   of discard from the buffer the longer the queue persists above a
125	   shallow threshold.  This gives sufficient signals to capacity-seeking
126	   (aka. greedy) flows to keep the buffer empty for its intended
127	   purpose: absorbing bursts.  However, RED [RFC2309] and other
128	   algorithms from the 1990s were sensitive to their configuration and
129	   hard to set correctly.  So, AQM was not widely deployed.

131	   More recent state-of-the-art AQMs, e.g.
132	   fq_CoDel [I-D.ietf-aqm-fq-codel], PIE [RFC8033], Adaptive
133	   RED [ARED01], are easier to configure, because they define the
134	   queuing threshold in time not bytes, so it is invariant for different
135	   link rates.  However, no matter how good the AQM, the sawtoothing
136	   rate of TCP will either cause queuing delay to vary or cause the link
137	   to be under-utilized.  Even with a perfectly tuned AQM, the
138	   additional queuing delay will be of the same order as the underlying
139	   speed-of-light delay across the network.  Flow-queuing can isolate
140	   one flow from another, but it cannot isolate a TCP flow from the
141	   delay variations it inflicts on itself, and it has other problems -
142	   it overrides the flow rate decisions of variable rate video
143	   applications, it does not recognise the flows within IPSec VPN
144	   tunnels and it is relatively expensive to implement.

146	   It seems that further changes to the network alone will now yield
147	   diminishing returns.  Data Centre TCP (DCTCP [I-D.ietf-tcpm-dctcp])
148	   teaches us that a small but radical change to TCP is needed to cut
149	   two major outstanding causes of queuing delay variability:

151	   1.  the `sawtooth' varying rate of TCP itself;

153	   2.  the smoothing delay deliberately introduced into AQMs to permit
154	       bursts without triggering losses.

156	   The former causes a flow's round trip time (RTT) to vary from about 1
157	   to 2 times the base RTT between the machines in question.  The latter
158	   delays the system's response to change by a worst-case
159	   (transcontinental) RTT, which could be hundreds of times the actual
160	   RTT of typical traffic from localized CDNs.

162	   Latency is not our only concern:

164	   3.  It was known when TCP was first developed that it would not scale
165	       to high bandwidth-delay products.

167	   Given regular broadband bit-rates over WAN distances are
168	   already [RFC3649] beyond the scaling range of `classic' TCP Reno,
169	   `less unscalable' Cubic [I-D.ietf-tcpm-cubic] and
170	   Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been
171	   successfully deployed.  However, these are now approaching their
172	   scaling limits.  Unfortunately, fully scalable TCPs such as DCTCP
173	   cause `classic' TCP to starve itself, which is why they have been
174	   confined to private data centres or research testbeds (until now).

176	   This document specifies a `DualQ Coupled AQM' extension that solves
177	   the problem of coexistence between scalable and classic flows,
178	   without having to inspect flow identifiers.  The AQM is not like
179	   flow-queuing approaches [I-D.ietf-aqm-fq-codel] that classify packets
180	   by flow identifier into numerous separate queues in order to isolate
181	   sparse flows from the higher latency in the queues assigned to
182	   heavier flow.  In contrast, the AQM exploits the behaviour of
183	   scalable congestion controls like DCTCP so that every packet in every
184	   flow sharing the queue for DCTCP-like traffic can be served with very
185	   low latency.

187	   This AQM extension can be combined with any single queue AQM that
188	   generates a statistical or deterministic mark/drop probability driven
189	   by the queue dynamics.  In many cases it simplifies the basic control
190	   algorithm, and requires little extra processing.  Therefore it is
191	   believed the Coupled AQM would be applicable and easy to deploy in
192	   all types of buffers; buffers in cost-reduced mass-market residential
193	   equipment; buffers in end-system stacks; buffers in carrier-scale
194	   equipment including remote access servers, routers, firewalls and
195	   Ethernet switches; buffers in network interface cards, buffers in
196	   virtualized network appliances, hypervisors, and so on.

198	   The overall L4S architecture is described in
199	   [I-D.ietf-tsvwg-l4s-arch].  The supporting papers [PI216] and
200	   [DCttH15] give the full rationale for the AQM's design, both
201	   discursively and in more precise mathematical form.

203	1.2.  Terminology

205	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
206	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
207	   document are to be interpreted as described in [RFC2119].  In this
208	   document, these words will appear with that interpretation only when
209	   in ALL CAPS.  Lower case uses of these words are not to be
210	   interpreted as carrying RFC-2119 significance.

212	   The DualQ Coupled AQM uses two queues for two services.  Each of the
213	   following terms identifies both the service and the queue that
214	   provides the service:

216	   Classic (denoted by subscript C):  The `Classic' service is intended
217	      for all the behaviours that currently co-exist with TCP Reno (TCP
218	      Cubic, Compound, SCTP, etc).

220	   Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L):
221	      The `L4S' service is intended for a set of congestion controls
222	      with scalable properties such as DCTCP (e.g.
223	      Relentless [Mathis09]).

225	   Either service can cope with a proportion of unresponsive or less-
226	   responsive traffic as well (e.g.  DNS, VoIP, etc), just as a single
227	   queue AQM can.  The DualQ Coupled AQM behaviour is similar to a
228	   single FIFO queue with respect to unresponsive and overload traffic.

230	1.3.  Features

232	   The AQM couples marking and/or dropping across the two queues such
233	   that a flow will get roughly the same throughput whichever it uses.
234	   Therefore both queues can feed into the full capacity of a link and
235	   no rates need to be configured for the queues.  The L4S queue enables
236	   scalable congestion controls like DCTCP to give stunningly low and
237	   predictably low latency, without compromising the performance of
238	   competing 'Classic' Internet traffic.  Thousands of tests have been
239	   conducted in a typical fixed residential broadband setting.  Typical
240	   experiments used base round trip delays up to 100ms between the data
241	   centre and home network, and large amounts of background traffic in
242	   both queues.  For every L4S packet, the AQM kept the average queuing
243	   delay below 1ms (or 2 packets if serialization delay is bigger for
244	   slow links), and no losses at all were introduced by the AQM.
245	   Details of the extensive experiments will be made available [PI216]
246	   [DCttH15].

248	   Subjective testing was also conducted using a demanding panoramic
249	   interactive video application run over a stack with DCTCP enabled and
250	   deployed on the testbed.  Each user could pan or zoom their own high
251	   definition (HD) sub-window of a larger video scene from a football
252	   match.  Even though the user was also downloading large amounts of
253	   L4S and Classic data, latency was so low that the picture appeared to
254	   stick to their finger on the touchpad (all the L4S data achieved the
255	   same ultra-low latency).  With an alternative AQM, the video
256	   noticeably lagged behind the finger gestures.

258	   Unlike Diffserv Expedited Forwarding, the L4S queue does not have to
259	   be limited to a small proportion of the link capacity in order to
260	   achieve low delay.  The L4S queue can be filled with a heavy load of
261	   capacity-seeking flows like DCTCP and still achieve low delay.  The
262	   L4S queue does not rely on the presence of other traffic in the
263	   Classic queue that can be 'overtaken'.  It gives low latency to L4S
264	   traffic whether or not there is Classic traffic, and the latency of
265	   Classic traffic does not suffer when a proportion of the traffic is
266	   L4S.  The two queues are only necessary because DCTCP-like flows
267	   cannot keep latency predictably low and keep utilization high if they
268	   are mixed with legacy TCP flows,

270	   The experiments used the Linux implementation of DCTCP that is
271	   deployed in private data centres, without any modification despite
272	   its known deficiencies.  Nonetheless, certain modifications will be
273	   necessary before DCTCP is safe to use on the Internet, which are
274	   recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id].  However, the
275	   focus of this specification is to get the network service in place.
276	   Then, without any management intervention, applications can exploit
277	   it by migrating to scalable controls like DCTCP, which can then
278	   evolve _while_ their benefits are being enjoyed by everyone on the
279	   Internet.

281	2.  DualQ Coupled AQM Algorithm

283	   There are two main aspects to the algorithm:

285	   o  the Coupled AQM that addresses throughput equivalence between
286	      Classic (e.g.  Reno, Cubic) flows and L4S (e.g.  DCTCP) flows

288	   o  the Dual Queue structure that provides latency separation for L4S
289	      flows to isolate them from the typically large Classic queue.

291	2.1.  Coupled AQM

293	   In the 1990s, the `TCP formula' was derived for the relationship
294	   between TCP's congestion window, cwnd, and its drop probability, p.
295	   To a first order approximation, cwnd of TCP Reno is inversely
296	   proportional to the square root of p.  TCP Cubic implements a Reno-
297	   compatibility mode, which is the only relevant mode for typical RTTs
298	   under 20ms, while the throughput of a single flow is less than about
299	   500Mb/s.  Therefore we can assume that Cubic traffic behaves similar
300	   to Reno (but with a slightly different constant of proportionality),
301	   and we shall use the term 'Classic' for the collection of Reno and
302	   Cubic in Reno mode.

304	   In our supporting paper [PI216], we derive the equivalent rate
305	   equation for DCTCP, for which cwnd is inversely proportional to p
306	   (not the square root), where in this case p is the ECN marking
307	   probability.  DCTCP is not the only congestion control that behaves
308	   like this, so we use the term 'L4S' traffic for all similar
309	   behaviour.

311	   In order to make a DCTCP flow run at roughly the same rate as a Reno
312	   TCP flow (all other factors being equal), we make the drop or marking
313	   probability for Classic traffic, p_C distinct from the marking
314	   probability for L4S traffic, p_L (in contrast to RFC3168 which
315	   requires them to be the same).  We make the Classic drop probability
316	   p_C proportional to the square of the L4S marking probability p_L.
317	   This is because we need to make the Reno flow rate equal the DCTCP
318	   flow rate, so we have to square the square root of p_C in the Reno
319	   rate equation to make it the same as the straight p_L in the DCTCP
320	   rate equation.

322	   There is a really simple way to implement the square of a probability
323	   - by testing the queue against two random numbers not one.  This is
324	   the approach adopted in Appendix A and Appendix B.

326	   Stating this as a formula, the relation between Classic drop
327	   probability, p_C, and L4S marking probability, p_L needs to take the
328	   form:

330	       p_C = ( p_L / k )^2                  (1)

332	   where k is the constant of proportionality.

334	2.2.  Dual Queue

336	   Classic traffic builds a large queue, so a separate queue is provided
337	   for L4S traffic, and it is scheduled with strict priority.
338	   Nonetheless, coupled marking ensures that giving priority to L4S
339	   traffic still leaves the right amount of spare scheduling time for
340	   Classic flows to each get equivalent throughput to DCTCP flows (all
341	   other factors such as RTT being equal).  The algorithm achieves this
342	   without having to inspect flow identifiers.

344	2.3.  Traffic Classification

346	   Both the Coupled AQM and DualQ mechanisms need an identifier to
347	   distinguish L4S and C packets.  A separate draft
348	   [I-D.ietf-tsvwg-ecn-l4s-id] recommends using the ECT(1) codepoint of
349	   the ECN field as this identifier, having assessed various
350	   alternatives.

352	   Given L4S work is currently on the experimental track, but the
353	   definition of the ECN field is on the standards track [RFC3168],
354	   another standards track document has proved necessary to make the
355	   ECT(1) codepoint available for experimentation
356	   [I-D.ietf-tsvwg-ecn-experimentation].

358	2.4.  Normative Requirements

360	   In the Dual Queue, L4S packets MUST be given priority over Classic,
361	   although strict priority MAY not be appropriate.

363	   All L4S traffic MUST be ECN-capable, although some Classic traffic
364	   MAY also be ECN-capable.

366	   Whatever identifier is used for L4S traffic, it will still be
367	   necessary to agree on the meaning of an ECN marking on L4S traffic,
368	   relative to a drop of Classic traffic.  In order to prevent
369	   starvation of Classic traffic by scalable L4S traffic (e.g.  DCTCP)
370	   the drop probability of Classic traffic MUST be proportional to the
371	   square of the marking probability of L4S traffic, In other words, the
372	   power to which p_L is raised in Eqn. (1) MUST be 2.

374	   The constant of proportionality, k, in Eqn (1) determines the
375	   relative flow rates of Classic and L4S flows when the AQM concerned
376	   is the bottleneck (all other factors being equal). k does not have to
377	   be standardized because differences do not prevent interoperability.

379	   However, k has to take some value, and each operator can make that
380	   choice.

382	   A value of k=2 is currently RECOMMENDED as the default for Internet
383	   access networks.  Assuming scalable congestion controls for the
384	   Internet will be as aggressive as DCTCP, this will ensure their
385	   congestion window will be roughly the same as that of a standards
386	   track TCP congestion control (Reno) [RFC5681] and other so-called
387	   TCP-friendly controls such as TCP Cubic in its TCP-friendly mode.

389	   The requirements for scalable congestion controls on the Internet
390	   (termed the TCP Prague requirements) [I-D.ietf-tsvwg-ecn-l4s-id] are
391	   not necessarily final.  If the aggressiveness of DCTCP is not defined
392	   as the benchmark for scalable controls on the Internet, the
393	   recommended value of k will also be subject to change.

395	   Whatever value is recommended, the choice of k is a matter of
396	   operator policy, and operators MAY choose a different value using
397	   Table 1 and the guidelines in Appendix C.

399	   Typically, access network operators isolate customers from each other
400	   with some form of layer-2 multiplexing (TDM in DOCSIS, CDMA in 3G) or
401	   L3 scheduling (WRR in broadband), rather than relying on TCP to share
402	   capacity between customers [RFC0970].  In such cases, the choice of k
403	   will solely affect relative flow rates within each customer's access
404	   capacity, not between customers.  Also, k will not affect relative
405	   flow rates at any times when all flows are Classic or all L4S, and it
406	   will not affect small flows.

408	   Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are
409	   given in Appendix A and Appendix B.  Either example AQM can be used
410	   to couple packet marking and dropping across a dual Q.  Curvy RED
411	   requires less operations per packet than RED and can be used if the
412	   range of RTTs is limited.  DualPI2 is a simplification of PIE with
413	   stable Proportional-Integral control for both Classic and L4S
414	   congestion controls.  Nonetheless, it would be possible to control
415	   the queues with other alternative AQMs, as long as the above
416	   normative requirements (those expressed in capitals) are observed,
417	   which are intended to be independent of the specific AQM.

419	   {ToDo: Add management and monitoring requirements}

421	3.  IANA Considerations

423	   This specification contains no IANA considerations.

425	4.  Security Considerations

427	4.1.  Overload Handling

429	   Where the interests of users or flows might conflict, it could be
430	   necessary to police traffic to isolate any harm to performance.  This
431	   is a policy issue that needs to be separable from a basic AQM, but an
432	   AQM does need to handle overload.  A trade-off needs to be made
433	   between complexity and the risk of either class harming the other.
434	   It is an operator policy to define what must happen if the service
435	   time of the classic queue becomes too great.  In the following
436	   subsections three optional non-exclusive overload protections are
437	   defined.  Their objective is for the overload behaviour of the DualQ
438	   AQM to be similar to a single queue AQM.  The example implementation
439	   in Appendix A implements the 'delay on overload' policy.  Other
440	   overload protections can be envisaged:

442	   Minimum throughput service:   By replacing the priority scheduler
443	      with a weighted round robin scheduler, a minimum throughput
444	      service can be guaranteed for Classic traffic.  Typically the
445	      scheduling weight of the Classic queue will be small (e.g. 5%) to
446	      avoid interference with the coupling but big enough to avoid
447	      complete starvation of Classic traffic.

449	   Delay on overload:  To control milder overload of responsive traffic,
450	      particularly when close to the maximum congestion signal, delay
451	      can be used as an alternative congestion control mechanism.  The
452	      Dual Queue Coupled AQM can be made to behave like a single First-
453	      In First-Out (FIFO) queue with different service times by
454	      replacing the priority scheduler with a very simple scheduler that
455	      could be called a "time-shifted FIFO", which is the same as the
456	      Modifier Earliest Deadline First (MEDF) scheduler of [MEDF].  The
457	      scheduler adds T_m to the queue delay of the next L4S packet,
458	      before comparing it with the queue delay of the next Classic
459	      packet, then it selects the packet with the greater adjusted queue
460	      delay.  Under regular conditions, this time-shifted FIFO scheduler
461	      behaves just like a strict priority scheduler.  But under moderate
462	      or high overload it prevents starvation of the Classic queue,
463	      because the time-shift defines the maximum extra queuing delay
464	      (T_m) of Classic packets relative to L4S.

466	   Drop on overload:  On severe overload, e.g. due to non responsive
467	      traffic, queues will typically overflow and packet drop will be
468	      unavoidable.  It is important to avoid unresponsive ECN traffic
469	      (either Classic or L4S) driving the AQM to 100% drop and mark
470	      probability.  Congestion controls that have a minimum congestion
471	      window will become unresponsive to ECN marking when the marking
472	      probability is high.  This situation can be avoided by applying
473	      the drop probability to all packets of all traffic types when it
474	      exceeds a certain threshold or by limiting the drop and marking
475	      probabilities to a lower maximum value (up to where fairnes
476	      between the different traffic types is still guaranteed) and rely
477	      on delay to control temporary high congestion and eventually queue
478	      overflow.  If the classic drop probability is applied to all types
479	      of traffic when it is higher than a threshold probability the
480	      queueing delay can be controlled up to any overload situation, and
481	      no further measures are required.  If a maximum classic and
482	      coupled L4S probability of less than 100% is used, both queues
483	      need scheduling opportunities and should eventually experience
484	      drop.  This can be achieved with a scheduler that guarantees a
485	      minimum throughput for each queue, such as a weighted round robin
486	      or time-shifted FIFO scheduler.  In that case a common queue limit
487	      can be configured that will drop packets of both types of traffic.

489	   To keep the throughput of both L4S and Classic flows equal over the
490	   full load range, a different control strategy needs to be defined
491	   above the point where one congestion control first saturates to a
492	   probability of 100% (if k>1, L4S will saturate first).  Possible
493	   strategies include: also dropping L4S; increasing the queueing delay
494	   for both; or ensuring that L4S traffic still responds to marking
495	   below a window of 2 segments (see [I-D.ietf-tsvwg-ecn-l4s-id]).

497	5.  Acknowledgements

499	   Thanks to Anil Agarwal for detailed review comments and suggestions
500	   on how to make our explanation clearer.

502	   The authors' contributions are part-funded by the European Community
503	   under its Seventh Framework Programme through the Reducing Internet
504	   Transport Latency (RITE) project (ICT-317700).  The views expressed
505	   here are solely those of the authors.

507	6.  References

509	6.1.  Normative References

511	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
512	              Requirement Levels", BCP 14, RFC 2119,
513	              DOI 10.17487/RFC2119, March 1997,
514	              <http://www.rfc-editor.org/info/rfc2119>.

516	6.2.  Informative References

518	   [ARED01]   Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
519	              Algorithm for Increasing the Robustness of RED's Active
520	              Queue Management", ACIRI Technical Report , August 2001,
521	              <http://www.icir.org/floyd/red.html>.

523	   [CoDel]    Nichols, K. and V. Jacobson, "Controlling Queue Delay",
524	              ACM Queue 10(5), May 2012,
525	              <http://queue.acm.org/issuedetail.cfm?issue=2208917>.

527	   [CRED_Insights]
528	              Briscoe, B., "Insights from Curvy RED (Random Early
529	              Detection)", BT Technical Report TR-TUB8-2015-003, July
530	              2015,
531	              <http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>.

533	   [DCttH15]  De Schepper, K., Bondarenko, O., Briscoe, B., and I.
534	              Tsang, "`Data Centre to the Home': Ultra-Low Latency for
535	              All", 2015, <http://www.bobbriscoe.net/projects/latency/
536	              dctth_preprint.pdf>.

538	              (Under submission)

540	   [I-D.ietf-aqm-fq-codel]
541	              Hoeiland-Joergensen, T., McKenney, P.,
542	              dave.taht@gmail.com, d., Gettys, J., and E. Dumazet, "The
543	              FlowQueue-CoDel Packet Scheduler and Active Queue
544	              Management Algorithm", draft-ietf-aqm-fq-codel-06 (work in
545	              progress), March 2016.

547	   [I-D.ietf-tcpm-cubic]
548	              Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
549	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
550	              draft-ietf-tcpm-cubic-04 (work in progress), February
551	              2017.

553	   [I-D.ietf-tcpm-dctcp]
554	              Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
555	              and G. Judd, "Datacenter TCP (DCTCP): TCP Congestion
556	              Control for Datacenters", draft-ietf-tcpm-dctcp-08 (work
557	              in progress), June 2017.

559	   [I-D.ietf-tsvwg-ecn-experimentation]
560	              Black, D., "Explicit Congestion Notification (ECN)
561	              Experimentation", draft-ietf-tsvwg-ecn-experimentation-00
562	              (work in progress), November 2016.

564	   [I-D.ietf-tsvwg-ecn-l4s-id]
565	              Schepper, K., Briscoe, B., and I. Tsang, "Identifying
566	              Modified Explicit Congestion Notification (ECN) Semantics
567	              for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s-
568	              id-00 (work in progress), November 2016.

570	   [I-D.ietf-tsvwg-l4s-arch]
571	              Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency,
572	              Low Loss, Scalable Throughput (L4S) Internet Service:
573	              Architecture", draft-ietf-tsvwg-l4s-arch-00 (work in
574	              progress), November 2016.

576	   [I-D.sridharan-tcpm-ctcp]
577	              Sridharan, M., Tan, K., Bansal, D., and D. Thaler,
578	              "Compound TCP: A New TCP Congestion Control for High-Speed
579	              and Long Distance Networks", draft-sridharan-tcpm-ctcp-02
580	              (work in progress), November 2008.

582	   [Mathis09]
583	              Mathis, M., "Relentless Congestion Control", PFLDNeT'09 ,
584	              May 2009, <http://www.hpcc.jp/pfldnet2009/
585	              Program_files/1569198525.pdf>.

587	   [MEDF]     Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a
588	              simple scheduling algorithm for two real-time transport
589	              service classes with application in the UTRAN", Proc. IEEE
590	              Conference on Computer Communications (INFOCOM'03) Vol.2
591	              pp.1116-1122, March 2003.

593	   [PI216]    De Schepper, K., Bondarenko, O., Briscoe, B., and I.
594	              Tsang, "PI2: A Linearized AQM for both Classic and
595	              Scalable TCP", ACM CoNEXT'16 , December 2016,
596	              <https://riteproject.files.wordpress.com/2015/10/
597	              pi2_conext.pdf>.

599	              (To appear)

601	   [RFC0970]  Nagle, J., "On Packet Switches With Infinite Storage",
602	              RFC 970, DOI 10.17487/RFC0970, December 1985,
603	              <http://www.rfc-editor.org/info/rfc970>.

605	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
606	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
607	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
608	              S., Wroclawski, J., and L. Zhang, "Recommendations on
609	              Queue Management and Congestion Avoidance in the
610	              Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
611	              <http://www.rfc-editor.org/info/rfc2309>.

613	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
614	              of Explicit Congestion Notification (ECN) to IP",
615	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
616	              <http://www.rfc-editor.org/info/rfc3168>.

618	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
619	              J., Courtney, W., Davari, S., Firoiu, V., and D.
620	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
621	              Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002,
622	              <http://www.rfc-editor.org/info/rfc3246>.

624	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
625	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
626	              <http://www.rfc-editor.org/info/rfc3649>.

628	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
629	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
630	              <http://www.rfc-editor.org/info/rfc5681>.

632	   [RFC7567]  Baker, F., Ed. and G. Fairhurst, Ed., "IETF
633	              Recommendations Regarding Active Queue Management",
634	              BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
635	              <http://www.rfc-editor.org/info/rfc7567>.

637	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
638	              "Proportional Integral Controller Enhanced (PIE): A
639	              Lightweight Control Scheme to Address the Bufferbloat
640	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
641	              <http://www.rfc-editor.org/info/rfc8033>.

643	Appendix A.  Example DualQ Coupled PI2 Algorithm

645	   As a first concrete example, the pseudocode below gives the DualPI2
646	   algorithm, which is a DualQ Coupled AQM algorithm based on the PI2
647	   algorithm [PI216] for the Classic AQM.  Pi2 is an improved variant of
648	   the PIE AQM [RFC8033].

650	   We will introduce the pseudocode in two passes.  The first pass
651	   explains the core concepts, deferring handling of overload to the
652	   second pass.  The first pass also uses regular arithmetic, whereas
653	   some integer arithmetic more suitable for kernel operations appears
654	   in the second pass.  To aid comparison, line numbers are kept in step
655	   between the two passes by using letter suffixes where the longer code
656	   needs extra lines.

658	   A full open source implementation for Linux is available at:
659	   https://github.com/olgabo/dualpi2.

661	A.1.  Pass #1: Core Concepts

663	   The pseudocode manipulates three main structures of variables: the
664	   packet (pkt), the L4S queue (lq) and the Classic queue (cq).  The
665	   pseudocode consists of the following four functions:

667	   o  initialization code (Figure 1) that sets parameter defaults (the
668	      API for setting non-default values is omitted for brevity)

670	   o  enqueue code (Figure 2)

672	   o  dequeue code (Figure 3)

674	   o  code to regularly update the base probability (p) used in the
675	      dequeue code (Figure 4).

677	   The base probability (p) is an internal variable from which the
678	   marking and dropping probabilities for L4S and Classic traffic (p_L
679	   and p_C) are derived.  Specifically, p_L = p * k and p_C = p^2, from
680	   Equation (1) (see lines 4a and 4b in Figure 4).

682	   In our experiments so far (building on experiments with PIE) on
683	   broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs
684	   from 5 ms to 100 ms, DualPI2 achieves good results with the default
685	   parameters in Figure 1.

687	   1:  dualpi2_params_init(...) {               % Set parameter defaults
688	   2:    target = 15 ms              % PI AQM Classic queue delay target
689	   3:    tshift = 2 * target                       % Scheduler time bias
690	   4:    Tupdate = 16 ms            % PI Classic queue sampling interval
691	   5:    alpha = 10 Hz^2                              % PI integral gain
692	   6:    beta = 100 Hz^2                          % PI proportional gain
693	   7:    alpha_U = alpha *Tupdate % PI integral gain per update interval
694	   8:    beta_U = beta * Tupdate  % PI prop'nal gain per update interval
695	   9:    T_time = 1 ms                   % L4S marking threshold in time
696	   10:   T_len = 2 * MTU            % Min L4S marking threshold in bytes
697	   11:   k = 2                                         % Coupling factor
698	   12:   limit = MAX_LINK_RATE * 250 ms               % Dual buffer size
699	   13:   p_Cmax = 1/4                       % Max Classic drop/mark prob
700	   14:   p_Lmax = min(k*sqrt(p_Cmax), 1)          % Max L4S marking prob
701	   15: }

703	       Figure 1: Example Header Pseudocode for DualQ Coupled PI2 AQM

705	   1:  dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq
706	   2:    stamp(pkt)                      % attach arrival time to packet
707	   3:    if ( lq.len() + cq.len() > limit )
708	   4:      drop(pkt)                     % drop packet if buffer is full
709	   5:    else {                                      % Packet classifier
710	   6:      if ( ecn(pkt) modulo 2 == 1 )       % ECN bits = ECT(1) or CE
711	   7:        lq.enqueue(pkt)
712	   8:      else                           % ECN bits = not-ECT or ECT(0)
713	   9:        cq.enqueue(pkt)
714	   10:   }
715	   11: }

717	      Figure 2: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM

719	   1:  dualpi2_dequeue(lq, cq, pkt) {     % Couples L4S & Classic queues
720	   2:    while ( lq.len() + cq.len() > 0 )
721	   3:      if ( lq.time() + tshift >= cq.time() ) {  % time-shifted FIFO
722	   4:        lq.dequeue(pkt)
723	   5:        if ( ((pkt.time() > T_time)              % step marking ...
724	   6:              AND (lq.len > T_len))
725	   7:            OR (p_L > rand()) )              % ...or linear marking
726	   8:          mark(pkt)
727	   9:      } else {
728	   10:       cq.dequeue(pkt)
729	   11:       if ( p_C > rand() ) {                     % probability p^2
730	   12:         if ( ecn(pkt) == 0 ) {           % if ECN field = not-ECT
731	   13:           drop(pkt)                                % squared drop
732	   14:           continue        % continue to the top of the while loop
733	   15:         }
734	   16:         mark(pkt)                                  % squared mark
735	   17:       }
736	   18:     }
737	   19:     return(pkt)                      % return the packet and stop
738	   20:   }
739	   21:   return(NULL)                             % no packet to dequeue
740	   22: }

742	      Figure 3: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM

744	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
745	   2:    curq = cq.time()  % use queuing time of first-in Classic packet
746	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
747	   4a:   p_L = p * k            % L4S prob = base prob * coupling factor
748	   4b:   p_C = p^2                        % Classic prob = (base prob)^2
749	   5:    prevq = curq
750	   6:  }

752	     Figure 4: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM

754	   When packets arrive, first a common queue limit is checked as shown
755	   in line 3 of the enqueuing pseudocode in Figure 2.  Note that the
756	   limit is deliberately tested before enqueue to avoid any bias against
757	   larger packets (so the actual buffer has to be one packet larger than
758	   limit).  If limit is not exceeded, the packet will be classified and
759	   enqueued to the Classic or L4S queue dependent on the least
760	   significant bit of the ECN field in the IP header (line 6).  Packets
761	   with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be
762	   enqueued in the Classic queue.  Otherwise, ECT(1) and CE packets will
763	   be enqueued in the L4S queue.  Optional packet classification
764	   flexibility is omitted for brevity.

766	   The dequeue pseudocode schedules one packet for dequeuing (or zero if
767	   the queue is empty).  It also makes all the AQM decisions on dropping
768	   and marking.  It is contained within a large while loop so that if it
769	   decides to drop a packet, it will continue until it selects a packet
770	   to schedule.  Line 3 of the dequeue pseudocode implements time-
771	   shifted FIFO scheduling.  It takes the packet that waited the
772	   longest, biased against the Classic traffic by a time-shift of
773	   tshift.

775	   o  If an L4S packet is scheduled, lines 5 to 8 mark the packet if
776	      either the L4S threshold is exceeded, or if a random marking
777	      decision is drawn according to k times the probability p
778	      (maintained by the dualpi2_update() function discussed below).
779	      The L4S threshold is usually in units of time (default T_time = 1
780	      ms).  However, on slow links the packet serialization time can
781	      approach the threshold T_time, so line 6 sets a floor of 2 MTU to
782	      the threshold.

784	   o  If a Classic packet is scheduled, lines 10 to 17 drop or mark the
785	      packet based on the squared probability p_C = p^2 (hence the name
786	      PI2 for Classic traffic).

788	   The probability p is kept up to date by the core PI algorithm in
789	   Figure 4 which is executed every Tupdate ([RFC8033] now recommends
790	   16ms).  The algorithm centres on line 3, which is a classical
791	   Proportional-Integral (PI) controller that alters p dependent on a)
792	   the error between the current queuing delay (curq) and the target
793	   queuing delay (target) as defined in [RFC8033] and b) the change in
794	   queuing delay since the last sample.  The name 'PI' represents the
795	   fact that the second factor is proportional to load while the first
796	   is the integral of the load (so it removes any standing queue).  In
797	   corner cases, p can overflow the range [0,1] so the resulting value
798	   of p has to be bounded (omitted from the pseudocode).

800	   Alpha_U and beta_U are gain factors chosen based on stability
801	   analysis to convert changes in queueing delay into changes in
802	   probability.  They are therefore in units of 'per second of delay' or
803	   Hz.  The suffix '_U' represents 'per update time' (Tupdate).  If a
804	   briefer update time is configured, alpha_U and beta_U also have to be
805	   reduced to ensure that the same response is given over time, so that
806	   a smaller Tupdate will only result in a response with finer steps,
807	   not a more aggressive response.  Therefore, alpha_U and beta_U are
808	   derived from alpha and beta, which each represent a 'gain factor per
809	   second of update time', so they can be configured independently of
810	   the update time (see lines 7 and 8 of Figure 1).

812	   Unlike in PIE, alpha_U and beta_U do not need to be tuned every
813	   Tupdate dependent on p.  Instead, in PI2 alpha_U and beta_U can be
814	   constants because the squaring applied to Classic traffic tunes them
815	   inherently, as explained in [PI216].

817	   Note that p solely depends on the queuing time in the Classic queue.
818	   In line 2, the current queuing delay is evaluated by inspecting the
819	   timestamp of the next packet to schedule in the Classic queue.  The
820	   function cq.time() subtracts the time stamped at enqueue from the
821	   current time and implicitly takes the current queuing delay as 0 if
822	   the queue is empty.

824	   Because the L4S marking probability (p_L) is factored up by k, the
825	   dynamic gain parameters alpha and beta are also inherently factored
826	   up by k for the L4S queue, which is necessary to ensure that Classic
827	   TCP and DCTCP controls have the same stability.  So, if alpha is 10
828	   Hz^2, the effective gain factor for the L4S queue is k*alpha, which
829	   is 20 Hz^2 with the default coupling factor of k=2.

831	A.2.  Pass #2: Overload Details

833	   Figure 5 repeats the dequeue function of Figure 3, but with overload
834	   details added.  Similarly Figure 6 repeats the core PI algorithm of
835	   Figure 4 with overload details added.  The initialization and enqueue
836	   functions are unchanged.

838	   In line 13 of the initialization function (Figure 1), the default
839	   maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the
840	   point at which it is deemed that the Classic queue has become
841	   persistently overloaded, so it switches to using solely drop, even
842	   for ECN-capable packets.  This protects the queue against any
843	   unresponsive traffic that falsely claims that it is responsive to ECN
844	   marking, as required by [RFC3168] and [RFC7567].

846	   Line 14 translates this into a maximum L4S marking probability
847	   (p_Lmax) by rearranging Equation (1).  With the default coupling
848	   factor of k=2, this translates to a maximum L4S marking probability
849	   of 1 or 100%. This is intended to ensure that the L4S queue starts to
850	   introduce dropping once marking saturates and can rise no further.
851	   The 'TCP Prague' requirements [I-D.ietf-tsvwg-ecn-l4s-id] require
852	   that, when an L4S congestion control detects a drop, it falls back to
853	   a response that coexists with 'Classic' TCP.  So it is correct that
854	   the L4S queue drops packets proportional to p^2, as if they are
855	   Classic packets.

857	   Both these switch-overs are triggered by the tests for overload
858	   introduced in lines 4b and 12b of the dequeue function (Figure 5).
859	   Lines 8c to 8g drop L4S packets with probability p^2 by comparing p
860	   against two random numbers.  Lines 8h to 8i mark the remaining
861	   packets with probability p_L.

863	   Lines 2c to 2d in the core PI algorithm (Figure 6) deal with overload
864	   of the L4S queue when there is no Classic traffic.  This is
865	   necessary, because the core PI algorithm maintains the right
866	   probability of drop to regulate overload, but it depends on the
867	   length of the Classic queue.  If there is no Classic queue the naive
868	   algorithm in Figure 4 drops nothing, even if the L4S queue is
869	   overloaded - so tail drop takes over (lines 3 and 4 of Figure 2).

871	   If the test at line 2a finds that the Classic queue is empty, line 2d
872	   measures the current queue delay using the L4S queue not the Classic
873	   queue, While the L4S queue is not overloaded, its delay will always
874	   be tiny compared to the target Classic queue delay.  So p_L will be
875	   driven to zero, and the L4S queue will naturally be governed solely
876	   by threshold marking (lines 5 and 6 of the dequeue algorithm in
877	   Figure 5).  But, if unresponsive L4S source(s) cause overload, the
878	   DualQ transitions smoothly to L4S marking based on the PI algorithm.
879	   And as overload increases, it naturally transitions from marking to
880	   dropping by the switch-over mechanism already described.

882	   1:  dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq
883	   2:    while ( lq.len() + cq.len() > 0 )
884	   3:      if ( lq.time() + tshift >= cq.time() ) {  % time-shifted FIFO
885	   4a:       lq.dequeue(pkt)
886	   4b:       if ( p_L < p_Lmax ) {       % Check for overload saturation
887	   5:          if ( ((pkt.time() > T_time)            % step marking ...
888	   6:                AND (lq.len > T_len))
889	   7:              OR (p_L > rand()) )            % ...or linear marking
890	   8a:            mark(pkt)
891	   8b:       } else {                              % overload saturation
892	   8c:         if ( p > max(rand(), rand()) ) {        % probability p^2
893	   8e:           drop(pkt)      % revert to Classic drop due to overload
894	   8f:           continue        % continue to the top of the while loop
895	   8g:         }
896	   8h:         if ( p_L > rand() )             % probability p_L = k * p
897	   8i:           mark(pkt)         % linear marking of remaining packets
898	   8j:       }
899	   9:      } else {
900	   10:       cq.dequeue(pkt)
901	   11:       if ( p > max(rand(), rand()) ) {          % probability p^2
902	   12a:        if ( (ecn(pkt) == 0)                % ECN field = not-ECT
903	   12b:             OR (p_L >= p_Lmax) ) {       % Overload disables ECN
904	   13:           drop(pkt)                     % squared drop, redo loop
905	   14:           continue        % continue to the top of the while loop
906	   15:         }
907	   16:         mark(pkt)                                  % squared mark
908	   17:       }
909	   18:     }
910	   19:     return(pkt)                      % return the packet and stop
911	   20:   }
912	   21:   return(NULL)                             % no packet to dequeue
913	   22: }

915	      Figure 5: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
916	             (Including Integer Arithmetic and Overload Code)

918	   1:  dualpi2_update(lq, cq, target) {         % Update p every Tupdate
919	   2a:   if ( cq.len() > 0 )
920	   2b:     curq = cq.time() %use queuing time of first-in Classic packet
921	   2c:   else                                      % Classic queue empty
922	   2d:     curq = lq.time()    % use queuing time of first-in L4S packet
923	   3:    p = p + alpha_U * (curq - target) + beta_U * (curq - prevq)
924	   4:    p_L = p * k            % L4S prob = base prob * coupling factor
925	   5:    prevq = curq
926	   6:  }

928	     Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM
929	                         (Including Overload Code)

931	Appendix B.  Example DualQ Coupled Curvy RED Algorithm

933	   As another example of a DualQ Coupled AQM algorithm, the pseudocode
934	   below gives the Curvy RED based algorithm we used and tested.
935	   Although we designed the AQM to be efficient in integer arithmetic,
936	   to aid understanding it is first given using real-number arithmetic.
937	   Then, one possible optimization for integer arithmetic is given, also
938	   in pseudocode.  To aid comparison, the line numbers are kept in step
939	   between the two by using letter suffixes where the longer code needs
940	   extra lines.

942	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
943	   2:    if ( lq.dequeue(pkt) ) {
944	   3a:     p_L = cq.sec() / 2^S_L
945	   3b:     if ( lq.byt() > T )
946	   3c:       mark(pkt)
947	   3d:     elif ( p_L > maxrand(U) )
948	   4:        mark(pkt)
949	   5:      return(pkt)                % return the packet and stop here
950	   6:    }
951	   7:    while ( cq.dequeue(pkt) ) {
952	   8a:     alpha = 2^(-f_C)
953	   8b:     Q_C = alpha * pkt.sec() + (1-alpha)* Q_C    % Classic Q EWMA
954	   9a:     sqrt_p_C = Q_C / 2^S_C
955	   9b:     if ( sqrt_p_C > maxrand(2*U) )
956	   10:       drop(pkt)                        % Squared drop, redo loop
957	   11:     else
958	   12:       return(pkt)              % return the packet and stop here
959	   13:   }
960	   14:   return(NULL)                           % no packet to dequeue
961	   15: }

963	   16: maxrand(u) {                % return the max of u random numbers
964	   17:     maxr=0
965	   18:     while (u-- > 0)
966	   19:         maxr = max(maxr, rand())               % 0 <= rand() < 1
967	   20:     return(maxr)
968	   21: }

970	   Figure 7: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM

972	   Packet classification code is not shown, as it is no different from
973	   Figure 2.  Classic ECN handling is not shown.  Potential
974	   classification schemes are discussed in Section 2.  The Curvy RED
975	   algorithm has not been maintained to the same degree as the DualPI2
976	   algorithm.  Some ideas used in DualPI2 could be translated into Curvy
977	   RED, such as i) the time-shifted FIFO scheduler ii) the time-based
978	   L4S threshold; iii) overload protection.  {ToDo}.

980	   At the outer level, the structure of dualq_dequeue() implements
981	   strict priority scheduling.  The code is written assuming the AQM is
982	   applied on dequeue (Note 1) . Every time dualq_dequeue() is called,
983	   the if-block in lines 2-6 determines whether there is an L4S packet
984	   to dequeue by calling lq.dequeue(pkt), and otherwise the while-block
985	   in lines 7-13 determines whether there is a Classic packet to
986	   dequeue, by calling cq.dequeue(pkt).  (Note 2)

988	   In the lower priority Classic queue, a while loop is used so that, if
989	   the AQM determines that a classic packet should be dropped, it
990	   continues to test for classic packets deciding whether to drop each
991	   until it actually forwards one.  Thus, every call to dualq_dequeue()
992	   returns one packet if at least one is present in either queue,
993	   otherwise it returns NULL at line 14.  (Note 3)

995	   Within each queue, the decision whether to drop or mark is taken as
996	   follows (to simplify the explanation, it is assumed that U=1):

998	   L4S:  If the test at line 2 determines there is an L4S packet to
999	      dequeue, the tests at lines 3a and 3c determine whether to mark
1000	      it.  The first is a simple test of whether the L4S queue (lq.byt()
1001	      in bytes) is greater than a step threshold T in bytes (Note 4).
1002	      The second test is similar to the random ECN marking in RED, but
1003	      with the following differences: i) the marking function does not
1004	      start with a plateau of zero marking until a minimum threshold,
1005	      rather the marking probability starts to increase as soon as the
1006	      queue is positive; ii) marking depends on queuing time, not bytes,
1007	      in order to scale for any link rate without being reconfigured;
1008	      iii) marking of the L4S queue does not depend on itself, it
1009	      depends on the queuing time of the _other_ (Classic) queue, where
1010	      cq.sec() is the queuing time of the packet at the head of the
1011	      Classic queue (zero if empty); iv) marking depends on the
1012	      instantaneous queuing time (of the other Classic queue), not a
1013	      smoothed average; v) the queue is compared with the maximum of U
1014	      random numbers (but if U=1, this is the same as the single random
1015	      number used in RED).

1017	      Specifically, in line 3a the marking probability p_L is set to the
1018	      Classic queueing time qc.sec() in seconds divided by the L4S
1019	      scaling parameter 2^S_L, which represents the queuing time (in
1020	      seconds) at which marking probability would hit 100%. Then in line
1021	      3d (if U=1) the result is compared with a uniformly distributed
1022	      random number between 0 and 1, which ensures that marking
1023	      probability will linearly increase with queueing time.  The
1024	      scaling parameter is expressed as a power of 2 so that division
1025	      can be implemented as a right bit-shift (>>) in line 3 of the
1026	      integer variant of the pseudocode (Figure 8).

1028	   Classic:  If the test at line 7 determines that there is at least one
1029	      Classic packet to dequeue, the test at line 9b determines whether
1030	      to drop it.  But before that, line 8b updates Q_C, which is an
1031	      exponentially weighted moving average (Note 5) of the queuing time
1032	      in the Classic queue, where pkt.sec() is the instantaneous
1033	      queueing time of the current Classic packet and alpha is the EWMA
1034	      constant for the classic queue.  In line 8a, alpha is represented
1035	      as an integer power of 2, so that in line 8 of the integer code
1036	      the division needed to weight the moving average can be
1037	      implemented by a right bit-shift (>> f_C).

1039	      Lines 9a and 9b implement the drop function.  In line 9a the
1040	      averaged queuing time Q_C is divided by the Classic scaling
1041	      parameter 2^S_C, in the same way that queuing time was scaled for
1042	      L4S marking.  This scaled queuing time is given the variable name
1043	      sqrt_p_C because it will be squared to compute Classic drop
1044	      probability, so before it is squared it is effectively the square
1045	      root of the drop probability.  The squaring is done by comparing
1046	      it with the maximum out of two random numbers (assuming U=1).
1047	      Comparing it with the maximum out of two is the same as the
1048	      logical `AND' of two tests, which ensures drop probability rises
1049	      with the square of queuing time (Note 6).  Again, the scaling
1050	      parameter is expressed as a power of 2 so that division can be
1051	      implemented as a right bit-shift in line 9 of the integer
1052	      pseudocode.

1054	   The marking/dropping functions in each queue (lines 3 & 9) are two
1055	   cases of a new generalization of RED called Curvy RED, motivated as
1056	   follows.  When we compared the performance of our AQM with fq_CoDel
1057	   and PIE, we came to the conclusion that their goal of holding queuing
1058	   delay to a fixed target is misguided [CRED_Insights].  As the number
1059	   of flows increases, if the AQM does not allow TCP to increase queuing
1060	   delay, it has to introduce abnormally high levels of loss.  Then loss
1061	   rather than queuing becomes the dominant cause of delay for short
1062	   flows, due to timeouts and tail losses.

1064	   Curvy RED constrains delay with a softened target that allows some
1065	   increase in delay as load increases.  This is achieved by increasing
1066	   drop probability on a convex curve relative to queue growth (the
1067	   square curve in the Classic queue, if U=1).  Like RED, the curve hugs
1068	   the zero axis while the queue is shallow.  Then, as load increases,
1069	   it introduces a growing barrier to higher delay.  But, unlike RED, it
1070	   requires only one parameter, the scaling, not three.  The diadvantage
1071	   of Curvy RED is that it is not adapted to a wide range of RTTs.
1072	   Curvy RED can be used as is when the RTT range to support is limited
1073	   otherwise an adaptation mechanism is required.

1075	   There follows a summary listing of the two parameters used for each
1076	   of the two queues:

1078	   Classic:

1080	      S_C :   The scaling factor of the dropping function scales Classic
1081	         queuing times in the range [0, 2^(S_C)] seconds into a dropping
1082	         probability in the range [0,1].  To make division efficient, it
1083	         is constrained to be an integer power of two;

1085	      f_C :  To smooth the queuing time of the Classic queue and make
1086	         multiplication efficient, we use a negative integer power of
1087	         two for the dimensionless EWMA constant, which we define as
1088	         alpha = 2^(-f_C).

1090	   L4S :

1092	      S_L (and k'):   As for the Classic queue, the scaling factor of
1093	         the L4S marking function scales Classic queueing times in the
1094	         range [0, 2^(S_L)] seconds into a probability in the range
1095	         [0,1].  Note that S_L = S_C + k', where k' is the coupling
1096	         between the queues.  So S_L and k' count as only one parameter;
1097	         k' is related to k in Equation (1) (Section 2.1) by k=2^k',
1098	         where both k and k' are constants.  Then implementations can
1099	         avoid costly division by shifting p_L by k' bits to the right.

1101	      T :  The queue size in bytes at which step threshold marking
1102	         starts in the L4S queue.

1104	   {ToDo: These are the raw parameters used within the algorithm.  A
1105	   configuration front-end could accept more meaningful parameters and
1106	   convert them into these raw parameters.}

1108	   From our experiments so far, recommended values for these parameters
1109	   are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs
1110	   typical on the public Internet.  [CRED_Insights] explains why these
1111	   parameters are applicable whatever rate link this AQM implementation
1112	   is deployed on and how the parameters would need to be adjusted for a
1113	   scenario with a different range of RTTs (e.g. a data centre) {ToDo
1114	   incorporate a summary of that report into this draft}. The setting of
1115	   k depends on policy (see Section 2.4 and Appendix C respectively for
1116	   its recommended setting and guidance on alternatives).

1118	   There is also a cUrviness parameter, U, which is a small positive
1119	   integer.  It is likely to take the same hard-coded value for all
1120	   implementations, once experiments have determined a good value.  We
1121	   have solely used U=1 in our experiments so far, but results might be
1122	   even better with U=2 or higher.

1124	   Note that the dropping function at line 9 calls maxrand(2*U), which
1125	   gives twice as much curviness as the call to maxrand(U) in the
1126	   marking function at line 3.  This is the trick that implements the
1127	   square rule in equation (1) (Section 2.1).  This is based on the fact
1128	   that, given a number X from 1 to 6, the probability that two dice
1129	   throws will both be less than X is the square of the probability that
1130	   one throw will be less than X.  So, when U=1, the L4S marking
1131	   function is linear and the Classic dropping function is squared.  If
1132	   U=2, L4S would be a square function and Classic would be quartic.
1133	   And so on.

1135	   The maxrand(u) function in lines 16-21 simply generates u random
1136	   numbers and returns the maximum (Note 7).  Typically, maxrand(u)
1137	   could be run in parallel out of band.  For instance, if U=1, the
1138	   Classic queue would require the maximum of two random numbers.  So,
1139	   instead of calling maxrand(2*U) in-band, the maximum of every pair of
1140	   values from a pseudorandom number generator could be generated out-
1141	   of-band, and held in a buffer ready for the Classic queue to consume.

1143	   1:  dualq_dequeue(lq, cq) {  % Couples L4S & Classic queues, lq & cq
1144	   2:     if ( lq.dequeue(pkt) ) {
1145	   3:        if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U)))
1146	   4:           mark(pkt)
1147	   5:        return(pkt)              % return the packet and stop here
1148	   6:     }
1149	   7:     while ( cq.dequeue(pkt) ) {
1150	   8:         Q_C += (pkt.ns() - Q_C) >> f_C           % Classic Q EWMA
1151	   9:        if ( (Q_C >> (S_C-2) ) > maxrand(2*U) )
1152	   10:          drop(pkt)                     % Squared drop, redo loop
1153	   11:       else
1154	   12:          return(pkt)           % return the packet and stop here
1155	   13:    }
1156	   14:    return(NULL)                           % no packet to dequeue
1157	   15: }

1159	   Figure 8: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM
1160	                         using Integer Arithmetic

1162	   Notes:

1164	   1.  The drain rate of the queue can vary if it is scheduled relative
1165	       to other queues, or to cater for fluctuations in a wireless
1166	       medium.  To auto-adjust to changes in drain rate, the queue must
1167	       be measured in time, not bytes or packets [CoDel].  In our Linux
1168	       implementation, it was easiest to measure queuing time at
1169	       dequeue.  Queuing time can be estimated when a packet is enqueued
1170	       by measuring the queue length in bytes and dividing by the recent
1171	       drain rate.

1173	   2.  An implementation has to use priority queueing, but it need not
1174	       implement strict priority.

1176	   3.  If packets can be enqueued while processing dequeue code, an
1177	       implementer might prefer to place the while loop around both
1178	       queues so that it goes back to test again whether any L4S packets
1179	       arrived while it was dropping a Classic packet.

1181	   4.  In order not to change too many factors at once, for now, we keep
1182	       the marking function for DCTCP-only traffic as similar as
1183	       possible to DCTCP.  However, unlike DCTCP, all processing is at
1184	       dequeue, so we determine whether to mark a packet at the head of
1185	       the queue by the byte-length of the queue _behind_ it.  We plan
1186	       to test whether using queuing time will work in all
1187	       circumstances, and if we find that the step can cause
1188	       oscillations, we will investigate replacing it with a steep
1189	       random marking curve.

1191	   5.  An EWMA is only one possible way to filter bursts; other more
1192	       adaptive smoothing methods could be valid and it might be
1193	       appropriate to decrease the EWMA faster than it increases.

1195	   6.  In practice at line 10 the Classic queue would probably test for
1196	       ECN capability on the packet to determine whether to drop or mark
1197	       the packet.  However, for brevity such detail is omitted.  All
1198	       packets classified into the L4S queue have to be ECN-capable, so
1199	       no dropping logic is necessary at line 3.  Nonetheless, L4S
1200	       packets could be dropped by overload code (see Section 4.1).

1202	   7.  In the integer variant of the pseudocode (Figure 8) real numbers
1203	       are all represented as integers scaled up by 2^32.  In lines 3 &
1204	       9 the function maxrand() is arranged to return an integer in the
1205	       range 0 <= maxrand() < 2^32.  Queuing times are also scaled up by
1206	       2^32, but in two stages: i) In lines 3 and 8 queuing times
1207	       cq.ns() and pkt.ns() are returned in integer nanoseconds, making
1208	       the values about 2^30 times larger than when the units were
1209	       seconds, ii) then in lines 3 and 9 an adjustment of -2 to the
1210	       right bit-shift multiplies the result by 2^2, to complete the
1211	       scaling by 2^32.

1213	Appendix C.  Guidance on Controlling Throughput Equivalence
1214	                     +---------------+------+-------+
1215	                     | RTT_C / RTT_L | Reno | Cubic |
1216	                     +---------------+------+-------+
1217	                     |             1 | k'=1 | k'=0  |
1218	                     |             2 | k'=2 | k'=1  |
1219	                     |             3 | k'=2 | k'=2  |
1220	                     |             4 | k'=3 | k'=2  |
1221	                     |             5 | k'=3 | k'=3  |
1222	                     +---------------+------+-------+

1224	    Table 1: Value of k' for which DCTCP throughput is roughly the same
1225	               as Reno or Cubic, for some example RTT ratios

1227	   k' is related to k in Equation (1) (Section 2.1) by k=2^k'.

1229	   To determine the appropriate policy, the operator first has to judge
1230	   whether it wants DCTCP flows to have roughly equal throughput with
1231	   Reno or with Cubic (because, even in its Reno-compatibility mode,
1232	   Cubic is about 1.4 times more aggressive than Reno).  Then the
1233	   operator needs to decide at what ratio of RTTs it wants DCTCP and
1234	   Classic flows to have roughly equal throughput.  For example choosing
1235	   k'=0 (equivalent to k=1) will make DCTCP throughput roughly the same
1236	   as Cubic, _if their RTTs are the same_.

1238	   However, even if the base RTTs are the same, the actual RTTs are
1239	   unlikely to be the same, because Classic (Cubic or Reno) traffic
1240	   needs a large queue to avoid under-utilization and excess drop,
1241	   whereas L4S (DCTCP) does not.  The operator might still choose this
1242	   policy if it judges that DCTCP throughput should be rewarded for
1243	   keeping its own queue short.

1245	   On the other hand, the operator will choose one of the higher values
1246	   for k', if it wants to slow DCTCP down to roughly the same throughput
1247	   as Classic flows, to compensate for Classic flows slowing themselves
1248	   down by causing themselves extra queuing delay.

1250	   The values for k' in the table are derived from the formulae, which
1251	   was developed in [DCttH15]:

1253	       2^k' = 1.64 (RTT_reno / RTT_dc)                  (2)
1254	       2^k' = 1.19 (RTT_cubic / RTT_dc )                (3)

1256	   For localized traffic from a particular ISP's data centre, we used
1257	   the measured RTTs to calculate that a value of k'=3 (equivalant to
1258	   k=8) would achieve throughput equivalence, and our experiments
1259	   verified the formula very closely.

1261	   For a typical mix of RTTs from local data centres and across the
1262	   general Internet, a value of k'=1 (equivalent to k=2) is recommended
1263	   as a good workable compromise.

1265	Authors' Addresses

1267	   Koen De Schepper
1268	   Nokia Bell Labs
1269	   Antwerp
1270	   Belgium

1272	   Email: koen.de_schepper@nokia.com
1273	   URI:   https://www.bell-labs.com/usr/koen.de_schepper

1275	   Bob Briscoe (editor)
1276	   Simula Research Lab

1278	   Email: ietf@bobbriscoe.net
1279	   URI:   http://bobbriscoe.net/

1281	   Olga Bondarenko
1282	   Simula Research Lab
1283	   Lysaker
1284	   Norway

1286	   Email: olgabnd@gmail.com
1287	   URI:   https://www.simula.no/people/olgabo

1289	   Ing-jyh Tsang
1290	   Nokia Bell Labs
1291	   Antwerp
1292	   Belgium

1294	   Email: ing-jyh.tsang@nokia.com