idnits 2.17.1 

draft-briscoe-iccrg-prague-congestion-control-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 465 has weird spacing: '...n Linux  kerne...'

  == Line 510 has weird spacing: '...n Linux  kerne...'

  -- The document date (March 9, 2021) is 1143 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'B' is mentioned on line 348, but not defined

  == Outdated reference: A later version (-28) exists of
     draft-ietf-tcpm-accurate-ecn-13

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-12

  == Outdated reference: A later version (-15) exists of
     draft-ietf-tcpm-generalized-ecn-06

  == Outdated reference: A later version (-14) exists of
     draft-ietf-tcpm-hystartplusplus-01

  == Outdated reference: A later version (-25) exists of
     draft-ietf-tsvwg-aqm-dualq-coupled-13

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-08

  -- Obsolete informational reference (is this intentional?): RFC 4960
     (Obsoleted by RFC 9260)


     Summary: 0 errors (**), 0 flaws (~~), 10 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Congestion Control Research Group (ICCRG)        K. De Schepper
3	Internet-Draft                                                O. Tilmans
4	Intended status: Experimental                            Nokia Bell Labs
5	Expires: September 10, 2021                              B. Briscoe, Ed.
6	                                                             Independent
7	                                                           March 9, 2021

9	                       Prague Congestion Control
10	            draft-briscoe-iccrg-prague-congestion-control-00

12	Abstract

14	   This specification defines the Prague congestion control scheme,
15	   which is derived from DCTCP and adapted for Internet traffic by
16	   implementing the Prague L4S requirements.  Over paths with L4S
17	   support at the bottleneck, it adapts the DCTCP mechanisms to achieve
18	   consistently low latency and full throughput.  It is defined
19	   independently of any particular transport protocol or operating
20	   system, but notes are added that highlight issues specific to certain
21	   transports and OSs.  It is mainly based on the current default
22	   options of the reference Linux implementation of TCP Prague, but it
23	   includes experience from other implementations where available.  It
24	   separately describes non-default and optional parts, as well as
25	   future plans.

27	   The implementation does not satisfy all the Prague requirements (yet)
28	   and the IETF might decide that certain requirements need to be
29	   relaxed as an outcome of the process of trying to satisfy them all.
30	   In two cases, research code is replaced by placeholders until full
31	   evaluation is complete.

33	Status of This Memo

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF).  Note that other groups may also distribute
40	   working documents as Internet-Drafts.  The list of current Internet-
41	   Drafts is at https://datatracker.ietf.org/drafts/current/.

43	   Internet-Drafts are draft documents valid for a maximum of six months
44	   and may be updated, replaced, or obsoleted by other documents at any
45	   time.  It is inappropriate to use Internet-Drafts as reference
46	   material or to cite them other than as "work in progress."

48	   This Internet-Draft will expire on September 10, 2021.

50	Copyright Notice

52	   Copyright (c) 2021 IETF Trust and the persons identified as the
53	   document authors.  All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (https://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document.  Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document.  Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
68	     1.1.  Motivation: Low Queuing Delay /and/ Full Throughput . . .   4
69	     1.2.  Document Purpose  . . . . . . . . . . . . . . . . . . . .   5
70	     1.3.  Maturity Status (To be Removed Before Publication)  . . .   5
71	     1.4.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   6
72	   2.  Prague Congestion Control . . . . . . . . . . . . . . . . . .   8
73	     2.1.  The Prague L4S Requirements . . . . . . . . . . . . . . .   8
74	     2.2.  Packet Identification . . . . . . . . . . . . . . . . . .  10
75	     2.3.  Detecting and Measuring Congestion  . . . . . . . . . . .  10
76	       2.3.1.  Accurate ECN Feedback . . . . . . . . . . . . . . . .  10
77	         2.3.1.1.  Accurate ECN Feedback with TCP & Derivatives  . .  11
78	         2.3.1.2.  Accurate ECN Feedback with Other Modern
79	                   Transports  . . . . . . . . . . . . . . . . . . .  11
80	       2.3.2.  Moving Average of ECN Feedback  . . . . . . . . . . .  12
81	       2.3.3.  Scaling Loss Detection with Flow Rate . . . . . . . .  13
82	     2.4.  Congestion Response Algorithm . . . . . . . . . . . . . .  13
83	       2.4.1.  Fall-Back on Loss . . . . . . . . . . . . . . . . . .  13
84	       2.4.2.  Multiplicative Decrease on ECN Feedback . . . . . . .  14
85	       2.4.3.  Additive Increase and ECN Feedback  . . . . . . . . .  15
86	       2.4.4.  Reduced RTT-Dependence  . . . . . . . . . . . . . . .  16
87	       2.4.5.  Flow Start or Restart . . . . . . . . . . . . . . . .  17
88	     2.5.  Packet Sending  . . . . . . . . . . . . . . . . . . . . .  18
89	       2.5.1.  Packet Pacing . . . . . . . . . . . . . . . . . . . .  18
90	       2.5.2.  Segmentation Offload  . . . . . . . . . . . . . . . .  18
91	   3.  Variants and Future Work  . . . . . . . . . . . . . . . . . .  19
92	     3.1.  Getting up to Speed Faster  . . . . . . . . . . . . . . .  19
93	       3.1.1.  Flow Start (or Restart) . . . . . . . . . . . . . . .  19
94	       3.1.2.  Faster than Additive Increase . . . . . . . . . . . .  21
95	       3.1.3.  Remove Lag in Congestion Response . . . . . . . . . .  21
96	     3.2.  Combining Congestion Metrics  . . . . . . . . . . . . . .  22
97	       3.2.1.  ECN with Loss . . . . . . . . . . . . . . . . . . . .  22
98	       3.2.2.  ECN with Delay  . . . . . . . . . . . . . . . . . . .  23
99	     3.3.  Fall-Back on Classic ECN  . . . . . . . . . . . . . . . .  23
100	     3.4.  Further Reduced RTT-Dependence  . . . . . . . . . . . . .  24
101	     3.5.  Scaling Down to Fractional Windows  . . . . . . . . . . .  24
102	   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  25
103	   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  25
104	   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  25
105	   7.  Comments and Contributions Solicited (To be removed before
106	       Publication)  . . . . . . . . . . . . . . . . . . . . . . . .  25
107	   8.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  26
108	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  26
109	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  26
110	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  27
111	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  29

113	1.  Introduction

115	   This document defines the Prague congestion control.  It is defined
116	   independent of any particular transport protocol or operating system,
117	   but notes are added that highlight issues specific to certain
118	   transports and OSs.  The authors are most familiar with the reference
119	   implementation of Prague on Linux over TCP.  So that forms the basis
120	   of the large majority of platform-specific notes.  Nonetheless,
121	   wherever possible, experience from implementers on other platforms is
122	   included, and the intention is to gather more into this document
123	   during the drafting process.

125	   The Prague CC is intended to maintain consistently low queuing delay
126	   over network paths that offer L4S support at the bottleneck.  Where
127	   the bottleneck does not support L4S, the Prague CC is intended to
128	   fall back to behaving like a conventional 'Classic' congestion
129	   control.  L4S stands for Low Latency, Low Loss Scalable throughput.
130	   L4S support in the network involves Active Queue Management (AQM)
131	   with a very shallow target queueing delay (of the order of a
132	   millisecond) that applies immediate Explicit Congestion Notification
133	   (ECN).  'Immediate ECN' means that the network applies ECN marking
134	   based on the instantaneous queue, without any smoothing or filtering,
135	   The Prague CC takes on the job of smoothing and filtering the
136	   congestion signals from the network.

138	   The Prague CC is a particular instance of a scalable congestion
139	   control, which is defined in Section 1.4.  Scalable congestion
140	   control is the part of the L4S architecture that does the actual work
141	   of maintaining low queuing delay and ensuring that the delay and
142	   throughput properties scale with flow rate.

144	   The L4S architecture [I-D.ietf-tsvwg-l4s-arch] places the host
145	   congestion control in the context of the other parts of the system.

147	   In particular the different types of L4S AQM in the network and the
148	   codepoints in the IP-ECN field that convey to the network that the
149	   host supports the L4S form of ECN.  The architecture document also
150	   covers other issues such as: incremental deployment; protection of
151	   low latency queues against accidental or malicious disruption; and
152	   the relationship of L4S to other low latency technologies.  The
153	   specification of the L4S ECN Protocol [I-D.ietf-tsvwg-ecn-l4s-id]
154	   sets down the requirements that the Prague CC has to follow (called
155	   the Prague L4S Requirements - see Section 2.1 for a summary).

157	   Links to implementations of the Prague CC and other scalable
158	   congestion controls (all open source) can be found via the L4S
159	   landing page [L4S-home], which also links to numerous other L4S-
160	   related resources.  A (slightly dated) paper on the specific
161	   implementation of the Prague CC in Linux over TCP is also available
162	   [PragueLinux].

164	1.1.  Motivation: Low Queuing Delay /and/ Full Throughput

166	   The Prague CC is capable of keeping queuing delay consistently low
167	   while fully utilizing available capacity.  In contrast, Classic
168	   congestion controls need to induce a reasonably large queue
169	   (approaching a bandwidth-delay product) in order to fully utilize
170	   capacity.  Therefore, prior to scalable CCs like DCTCP and Prague, it
171	   was believed that very low delay was only possible by limiting
172	   throughput and isolating the low delay traffic from capacity-seeking
173	   traffic.

175	   The Prague CC uses additive increase multiplicative decrease (AIMD),
176	   in which it increases its window until an ECN mark (or loss) is
177	   detected, then yields in a continual sawtooth pattern.  The key to
178	   keeping queuing delay low without under-utilizing capacity is to keep
179	   the sawteeth tiny.  For example the average duration of a Prague CC
180	   sawtooth is of the order of a round trip, whereas a classic
181	   congestion control sawtooths over hundreds of round trips.  For
182	   instance, over an RTT of 36ms, at 100Mb/s Cubic takes about 106 round
183	   trips to recover, and at 800 Mb/s its recovery time triples to over
184	   340 round trips, or still more than 12 seconds (Reno would take 57
185	   seconds.

187	   Keeping the sawtooth amplitude down keeps queue variation down and
188	   utilization up.  Keeping the duration of the sawteeth down ensures
189	   control remains tight.  The definition of a scalable CC is that the
190	   duration between congestion marks does not increase as flow rate
191	   scales, all other factors being equal.  This is important, because it
192	   means that the sawteeth will always stay tiny.  So queue delay will
193	   remain very low, and control will remain very tight.

195	   The tip of each sawtooth occurs when the bottleneck link emits a
196	   congestion signal.  Therefore such small sawteeth are only feasible
197	   when ECN is used for the congestion signals.  If loss were used, the
198	   loss level would be prohibitively high.  This is why L4S-ECN has to
199	   depart from the requirement of Classic ECN[RFC3168] that an ECN mark
200	   is equivalent to a loss.  Because otherwise the response to the high
201	   level of ECN marking would have to be as great as the response to an
202	   equivalent level of loss.

204	   The Prague CC is derived from Data Center TCP (DCTCP [RFC8257]).
205	   DCTCP is confined to controlled environments like data centres
206	   precisely because it uses such small sawteeth, which induce such a
207	   high level of congestion marking.  For a CC using Classic ECN, this
208	   would be interpreted as equivalent to the same, very high, loss
209	   level.  The Classic CC would then continually drive its own rate down
210	   in the face of such an apparently high level of congestion.

212	   This is why coexistence with existing traffic is important for the
213	   Prague CC.  It has to be able to detect whether it is sharing the
214	   bottleneck with Classic traffic, and if so fall back to behaving in a
215	   Classic way.  If the bottleneck does not support ECN at all, that is
216	   easy - the Prague CC just responds in the Classic way to loss (see
217	   Section 2.4.1).  But if it is sharing the bottleneck with Classic ECN
218	   traffic, this is more difficult to detect (see Section 3.3).  Because
219	   the Prague CC removes most of the queue, it also addresses RTT-
220	   dependence.  Otherwise, at low base RTTs, it would become far more
221	   RTT-dependent than Classic CCs.

223	1.2.  Document Purpose

225	   There is not 'One True Prague CC'.  L4S is intended to enable
226	   development of any scalable CC that meets the L4S Prague requirements
227	   [I-D.ietf-tsvwg-ecn-l4s-id].  This document attempts to describe a
228	   reference implementation and attempts to generalize it to different
229	   transports and OS platforms.  The implementation does not satisfy all
230	   the Prague requirements (yet), and the IETF might decide that certain
231	   requirements need to be relaxed as an outcome of the process of
232	   trying to satisfy them all.

234	1.3.  Maturity Status (To be Removed Before Publication)

236	   The field of congestion control is always a work in progress.
237	   However, there are areas of the Prague CC that are still just
238	   placeholders while separate research code is evaluated.  And in other
239	   implementations of the Prague CC, other areas are incomplete.  In the
240	   Linux reference implementation of TCP Prague, interim code is used in
241	   the incomplete areas, which are:

243	   o  Flow start and restart (standard slow start is used, even though
244	      it often exits early in L4S environments were ECN marking tends to
245	      be frequent);

247	   o  Faster than additive increase (standard additive increase is used,
248	      which makes the flow particularly sluggish if it has dropped out
249	      of slow start early).

251	   The body of this document describes the Prague CC as implemented.
252	   Any non-default options or any planned improvements are separated out
253	   into Section 3 on "Variants and Future Work".  As each of the above
254	   areas is addressed, it will will be removed from this section and its
255	   description in the body of the document will be updated.  Once all
256	   areas are complete, this section will be removed.  Prague CC will
257	   then still be a work in progress, but only on a similar footing as
258	   all other congestion controls.

260	1.4.  Terminology

262	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
263	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
264	   document are to be interpreted as described in [RFC2119] when, and
265	   only when, they appear in all capitals, as shown here.

267	   Definitions of terms:

269	   Classic Congestion Control:  A congestion control behaviour that can
270	      co-exist with standard TCP Reno [RFC5681] without causing
271	      significantly negative impact on its flow rate [RFC5033].  With
272	      Classic congestion controls, as flow rate scales, the number of
273	      round trips between congestion signals (losses or ECN marks) rises
274	      with the flow rate.  So it takes longer and longer to recover
275	      after each congestion event.  Therefore control of queuing and
276	      utilization becomes very slack, and the slightest disturbance
277	      prevents a high rate from being attained [RFC3649].

279	   Scalable Congestion Control:  A congestion control where the average
280	      time from one congestion signal to the next (the recovery time)
281	      remains invariant as the flow rate scales, all other factors being
282	      equal.  This maintains the same degree of control over queueing
283	      and utilization whatever the flow rate, as well as ensuring that
284	      high throughput is robust to disturbances.  For instance, DCTCP
285	      averages 2 congestion signals per round-trip whatever the flow
286	      rate.  For the public Internet a Scalable transport has to comply
287	      with the requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id]
288	      (aka. the 'Prague L4S requirements').

290	   Response function:  The relationship between the window (cwnd) of a
291	      congestion control and the congestion signalling probability, p,
292	      in steady state.  A general response function has the form cwnd =
293	      K/p^B, where K and B are constants.  In an approximation of the
294	      response function of the standard Reno CC, B=1/2.  For a scalable
295	      congestion control B=1, so its response function takes the form
296	      cwnd = K/p.  The number of congestion signals per round is p*cwnd,
297	      which equates to the constant, K, for a scalable CC.  Hence the
298	      definition of a scalable CC above.

300	   Reno-friendly:  The subset of Classic traffic that excludes
301	      unresponsive traffic and excludes experimental congestion controls
302	      intended to coexist with Reno but without always being strictly
303	      friendly to it (as allowed by [RFC5033]).  Reno-friendly is used
304	      in place of 'TCP-friendly', given that the TCP protocol is used
305	      with many different congestion control behaviours.

307	   Classic ECN:  The original Explicit Congestion Notification (ECN)
308	      protocol [RFC3168], which requires ECN signals to be treated the
309	      same as drops, both when generated in the network and when
310	      responded to by the sender.

312	      The names used for the four codepoints of the 2-bit IP-ECN field
313	      are as defined in [RFC3168]: Not ECT, ECT(0), ECT(1) and CE, where
314	      ECT stands for ECN-Capable Transport and CE stands for Congestion
315	      Experienced.

317	      A packet marked with the CE codepoint is termed 'ECN-marked' or
318	      sometimes just 'marked' where the context makes ECN obvious.

320	   CC:  Congestion Control

322	   ACK:  an ACKnowledgement, or to ACKnowledge

324	   EWMA:  Exponentially Weighted Moving Average

326	   RTT:  Round Trip Time

328	   Definitions of Parameters and Variables:

330	   MTU_BITS:  Maximum transmission unit [b]

332	   cwnd:  Congestion window [B]

334	   ssthresh:  Slow start threshold [B]

336	   inflight:  The amount of data that the sender has sent but not yet
337	      received ACKs for [B]

339	   p: Steady-state probability of drop or marking []

341	   alpha:  EWMA of the ECN marking fraction []

343	   acked_sacked:  the amount of new data acknowledged by an ACK [B]

345	   ece_delta:  the amount of newly acknowledged data that was ECN-marked
346	      [B]

348	   ai_per_rtt:  additive increase to apply per RTT [B]

350	   srtt:  Smoothed round trip time [s]

352	   MAX_BURST_DELAY:  Maximum allowed bottleneck queuing delay due to
353	      segmentation offload bursts [s] (default 0.25 ms for the public
354	      Internet)

356	2.  Prague Congestion Control

358	2.1.  The Prague L4S Requirements

360	   The beneficial properties of L4S traffic (low queuing delay, etc.)
361	   depend on all L4S sources satisfying a set of conditions called the
362	   Prague L4S Requirements.  The name is after an ad hoc meeting of
363	   about thirty people co-located with the IETF in Prague in July 2015,
364	   the day after the first public demonstration of L4S.

366	   The meeting agreed a list of modifications to DCTCP [RFC8257] to
367	   focus activity on a variant that would be safe to use over the public
368	   Internet. it was suggested that this could be called TCP Prague to
369	   distinguish it from DCTCP.  This list was adopted by the IETF, and
370	   has continued to evolve (see section 4 of
371	   [I-D.ietf-tsvwg-ecn-l4s-id]).  The requirements are no longer TCP-
372	   specific, applying irrespective of wire-protocol (TCP, QUIC, RTP,
373	   SCTP, etc).

375	   This unusual start to the life of the project led to the unusual
376	   development process of a reference implementation that had to resolve
377	   a number of ambitious requirements, already known to be in tension
378	   [Tensions17].

380	   DCTCP already implements a scalable congestion control.  So most of
381	   the changes to make it usable over the Internet seemed trivial, some
382	   'merely' involving adoption of other parallel developments like
383	   Accurate ECN TCP feedback or RACK.  Others have been more challenging
384	   (e.g.  RTT-independence).  And others that seemed trivial became
385	   challenging given the complex set of bugs and behaviours that
386	   characterize today's Internet and the Linux stack.

388	   The more critical implementation challenges are highlighted in the
389	   following sections, in the hope we can prevent mistakes being
390	   repeated (see for instance Section 2.3.2, Section 2.4.2).  There was
391	   also a set of five intertwined 'bugs' - all masking each other, but
392	   causing unpredictable or poor performance as different code
393	   modifications unmasked them.  A draft write-up about these has been
394	   prepared, which is longer than the whole of the present document, so
395	   it will be included by reference once published.

397	   During the development process, we have unearthed fundamental aspects
398	   of the implementation and indeed the design of DCTCP and Prague that
399	   have still not caught up with the paradigm shift from existence to
400	   extent-based congestion response.  Some have been implemented by
401	   default, e.g. not suppressing additive increase for a round trip
402	   after a congestion event (Section 2.4.3).  Others have been
403	   implemented but not fully evaluated, e.g. removing the 1-2
404	   unnecessary round trips of lag in feedback processing (Section 3.1.3)
405	   and yet others are still future plans, e.g. further RTT-independence
406	   (Section 3.4) and exploiting combined congestion metrics in more
407	   cases (Section 3.2).

409	   The requirements are categorized into those that would impact other
410	   flows if not handled properly and performance optimizations that are
411	   important but optional from the IETF's point of view, because they
412	   only affect the flow itself.  The list below maps the order of the
413	   requirements in [I-D.ietf-tsvwg-ecn-l4s-id] to the order in this
414	   document (which is by functional categories and code status):

416	   Mandatory or Advisory Requirements:

418	      *  L4S-ECN packet identification: use of ECT(1) (Section 2.2)

420	      *  Accurate ECN feedback (Section 2.3.1)

422	      *  Reno-friendly response to a loss (Section 2.4.1)

424	      *  Detection of a classic ECN AQM (Section 3.3)

426	      *  Reduced RTT dependence (Section 2.4.4)

428	      *  Scaling down to a fractional window (no longer mandatory, see
429	         Section 3.5)

431	      *  Detecting loss in units of time (Section 2.3.3)

433	      *  Minimizing bursts (Section 2.5.1

435	   Optional performance optimizations:

437	      *  ECN-capable control packets (Section 2.2)

439	      *  Faster flow start (Section 3.1.1)

441	      *  Faster than additive increase (Section 3.1.2)

443	      *  Segmentation offload (Section 2.5.2)

445	2.2.  Packet Identification

447	   On the public Internet, a sender using the Prague CC MUST set the
448	   ECT(1) codepoint on all the packets it sends, in order to identify
449	   itself as an L4S-capable congestion control (Req 4.1
450	   [I-D.ietf-tsvwg-ecn-l4s-id]).

452	   This applies whatever the transport protocol, whether TCP, QUIC, RTP,
453	   etc.  In the case of TCP, unlike an RFC 3168 TCP ECN transport, a
454	   sender can set all packets as ECN-capable, including TCP control
455	   packets and retransmissions [RFC8311],
456	   [I-D.ietf-tcpm-generalized-ecn].

458	   The Prague CC SHOULD optionally be configurable to use the ECT(0)
459	   codepoint in private networks, such as data centres, which might be
460	   necessary for backward compatibility with DCTCP deployments where
461	   ECT(1) might already have another usage.

463	   Implementation note:

465	   TCP Prague in Linux  kernel:  The kernel was updated to allow the
466	      ECT(1) flag to be set from within a CC module.  The Prague CC then
467	      has full control over the ECN code point it uses at any one time.
468	      In this way it enforces the use of ECT(1) (or optionally ECT(0))
469	      and non-ECT when required.

471	2.3.  Detecting and Measuring Congestion

473	2.3.1.  Accurate ECN Feedback

475	   When feedback of ECN markings was added to TCP [RFC3168], it was
476	   decided not to report any more than one mark per RTT.  L4S-capable
477	   congestion controls need to know the extent, not just the existence
478	   of congestion (Req 4.2.  [I-D.ietf-tsvwg-ecn-l4s-id]).  Recently
479	   defined transports (DCCP, QUIC, etc) typically already satisfy this
480	   requirement.  So they are dealt with separately below, while TCP and
481	   derivatives such as SCTP [RFC4960] are covered first.

483	2.3.1.1.  Accurate ECN Feedback with TCP & Derivatives

485	   The TCP wire protocol is being updated to allow more accurate
486	   feedback (AccECN [I-D.ietf-tcpm-accurate-ecn]).  Therefore, in the
487	   case where a sender uses the Prague CC over TCP, whether as client or
488	   server:

490	   o  it MUST itself support AccECN;

492	   o  to support AccECN it also has to check that its peer supports
493	      AccECN during the handshake.

495	   If the peer does not support accurate ECN feedback, the sender MUST
496	   fall back to a Reno-friendly CC behaviour for the rest of the
497	   connection.  The non-Prague TCP sender MUST then no longer set ECT(1)
498	   on the packets it sends.  Note that the peer only needs to support
499	   AccECN; there is no need (and no way) to find out whether the peer is
500	   using an L4S-capable congestion control.

502	   Note that a sending TCP client that uses the Prague CC can set ECT(1)
503	   on the SYN prior to checking whether the other peer supports AccECN
504	   (as long as it follows the procedure in
505	   [I-D.ietf-tcpm-generalized-ecn] if it discovers the peer does not
506	   support AccECN).

508	   Implementation note:

510	   TCP Prague in Linux  kernel:  The kernel had been updated to support
511	      AccECN Independent of the CC module in use.  So the kernel tries
512	      to negotiate AccECN exchange whichever congestion control module
513	      is selected.  An additional check is provided to verify that the
514	      kernel actually does support AccECN, based on which the Prague CC
515	      module will decide to proceed using scalable CC or fall back to a
516	      Classic CC (Reno in the current implementation).

518	      A system wide option is available to disable AccECN negotiation,
519	      but the Prague CC module will always override this setting, as it
520	      depends on AccECN.  Then, solely in this case, AccECN will only be
521	      active for TCP flows using the Prague CC.

523	2.3.1.2.  Accurate ECN Feedback with Other Modern Transports

525	   Transport protocols specified recently, .e.g.  DCCP [RFC4340], QUIC
526	   [I-D.ietf-quic-transport], are unambiguously suitable for Prague CCs,
527	   because they were designed from the start with accurate ECN feedback.

529	   In the case of RTP/RTCP, ECN feedback was added in [RFC6679], which
530	   is sufficient for the Prague CC.  However, it is preferable to use
531	   the most recent improvements to ECN feedback in
532	   [I-D.ietf-avtcore-cc-feedback-message], as used in the implementation
533	   of the L4S variant of SCReAM [RFC8298].

535	2.3.2.  Moving Average of ECN Feedback

537	   The Prague CC currently maintains a moving average of ECN feedback in
538	   a similar way to DCTCP.  This section is provided mainly because
539	   performance has proved to be sensitive to implementation precision in
540	   this area.  So first, some background is necessary.

542	   The Prague CC triggers update of its moving average once per RTT by
543	   recording the packet it sent after the previous update, then watching
544	   for the ACK of that packet to return.  To maintain its moving
545	   average, it measures the fraction, frac, of ACKed bytes that carried
546	   ECN feedback over the previous round trip.  It then updates an
547	   exponentially weighted moving average (EWMA) of this fraction, called
548	   alpha, using the following algorithm:

550	      alpha += g * (frac - alpha);

552	   where g is the gain of the EWMA (default 1/16).

554	   Implementation notes:

556	   Rounding problems in DCTCP:  Alpha is a fraction between 0 and 1, and
557	      it needs to be represented with high resolution because the larger
558	      the bandwidth-delay product (BDP) of a flow, the smaller the value
559	      that alpha converges to (in steady state alpha = 2/cwnd).  In
560	      principle, Linux DCTCP maintains the moving average 'alpha' using
561	      the same formula as Prague CC uses (as above).  Linux represents
562	      alpha with a 10-bit integer (with resolution 1/1024).  However, up
563	      to kernel release 3.19, Linux used integer arithmetic that could
564	      not reduce alpha below 15/1024.  Then it was patched so that any
565	      value below 16/1024 was rounded down to zero [patch-alpha-zero].
566	      For a flow with a higher BDP than 128 segments, this means that,
567	      alpha flip-flops.  Once it has flopped down to zero DCTCP becomes
568	      unresponsive until it has built sufficient queue to flip up to
569	      16/1024.  For larger BDPs, this causes DCTCP to induce larger
570	      sawteeth, which loses the low-queuing-delay and high-utilization
571	      intent of the algorithm.

573	   Upscaled alpha in Prague CC:  To resolve the above problem the
574	      implementation of TCP Prague in Linux maintains upscaled_alpha =
575	      alpha/g instead of alpha:

577	         upscaled_alpha += frac - g * upscaled_alpha;

579	      This technique is the same as Linux uses for the retransmission
580	      timer variables, srtt and mdev.  Prague CC also uses 20 bits for
581	      alpha,

583	   Currently the above per-RTT update to the moving average, which was
584	   inherited from DCTCP, is the default in the Prague CC.  However,
585	   another approach is being investigated because these per-RTT updates
586	   introduce 1--2 rounds of delay into the congestion response on top of
587	   the inherent round of feedback delay (see Section 3.1.3 in the
588	   section on variants and future work).

590	2.3.3.  Scaling Loss Detection with Flow Rate

592	   After an ACK leaves a gap in the sequence space, a Prague CC is meant
593	   to deem that a loss has occurred using 'time-based units' (Req 4.3.
594	   [I-D.ietf-tsvwg-ecn-l4s-id]).  This is in contrast to the traditional
595	   approach that counts a hard-coded number of duplicate ACKs, e.g. the
596	   3 Dup-ACKs specified in [RFC5681].  Counting packets rather than time
597	   unnecessarily tightens the time within which parallelized links have
598	   to keep packets in sequence as flow rate scales over the years.

600	   To satsify this requirement, a Prague CC SHOULD wait until a certain
601	   fraction of the RTT has elapsed before it deems that the gap is due
602	   to packet loss.  The reference implementation of TCP Prague in Linux
603	   uses RACK [I-D.ietf-tcpm-rack] to address this requirement.  An
604	   approach similar to TCP RACK is also used in QUIC.

606	   At the start of a connection, RACK counts 3 DupACKs to detect loss
607	   because the initial smoothed RTT estimate can be inaccurate.  This
608	   would depend indirectly on time as long as the initial window (IW) is
609	   paced over a round trip (see Section 2.4.5).  For instance, if the
610	   initial window of 10 segments was paced evenly across the initial RTT
611	   then, in the next round, an implementation that deems there has been
612	   a loss after (say) 1/4 of an RTT can count 1/4 of 10 = 3 DupACKs
613	   (rounded up).  Subsequently, as the window grows, RACK shifts to
614	   using a fraction of the RTT for loss detection.

616	2.4.  Congestion Response Algorithm

618	   In congestion avoidance phase, a Prague CC uses a similar additive
619	   increase multiplicative decrease (AIMD) algorithm to DCTCP, but with
620	   the following differences:

622	2.4.1.  Fall-Back on Loss

624	   A Prague CC has to fall back to Reno-friendly behaviour on detection
625	   of a loss (Req 4.3.  [I-D.ietf-tsvwg-ecn-l4s-id]).  DCTCP falls back
626	   to Reno for the round trip after a loss, and the Linux reference
627	   implementation of TCP Prague inherits this behaviour.

629	   If a Prague CC has already reduced the congestion window due to ECN
630	   feedback less than a round trip before it detects a loss, it MAY
631	   reduce the congestion window by a smaller amount due to the loss, as
632	   long as the reductions due to ECN and the loss are Reno-friendly when
633	   taken together.

635	   See Section 3.2 for discussion of future work on congestion control
636	   using a combination of delay, ECN and loss.

638	   Implementation note:

640	   DCTCP bug prior to v5.1:  A Prague CC cannot rely on the fall-back-
641	      on-loss behaviour of the DCTCP code in the Linux kernel prior to
642	      v5.1, due to a previous bug in the fast retransmit code (but not
643	      in the retransmission timeout code) [patch-loss-react].

645	2.4.2.  Multiplicative Decrease on ECN Feedback

647	   The Prague CC currently responds to ECN feedback in a similar way to
648	   DCTCP.  This section is provided mainly because performance has
649	   proved to be sensitive to implementation details in this area.  So
650	   the following recap of the congestion response is needed first.

652	   As explained in Section 2.3.2, the Prague CC (like DCTCP) clocks its
653	   moving average of ECN-marking, alpha, once per round trip throughout
654	   a connection.  Nonetheless, it only triggers a multiplicative
655	   decrease to its congestion window when it actually receives an ACK
656	   carrying ECN feedback.  Then it suppresses any further decreases for
657	   one round trip, even if it receives further ECN feedback.  This is
658	   termed Congestion Window Reduced or CWR state.

660	   The Prague CC (like DCTCP) ensures that the average recovery time
661	   remains invariant as flow rate scales (Req 4.3 of
662	   [I-D.ietf-tsvwg-ecn-l4s-id]) by making the multiplicative decrease
663	   depend on the prevailing value of alpha as follows:

665	      ssthresh = (1 - alpha/2) * cwnd;

667	   Implementation notes:

669	   Upscaled alpha:  With reference to the earlier discussion of integer
670	      arithmetic precision (Section 2.3.2), alpha = g * upscaled_alpha.

672	   Carry of fractional cwnd remainder:  Typically the absolute reduction
673	      in the window is only a small number of segments.  So, if the
674	      Prague CC implementation counts the window in integer segments (as
675	      in the Linux reference code), delay can be made significantly less
676	      jumpy by tracking a fractional value alongside the integer window
677	      and carrying over any fractional remainder to the next reduction.
678	      Also, integer rounding bias ought to be removed from the
679	      multiplicative decrease calculation.

681	   In dynamic scenarios, as flows find a new operating point, alpha will
682	   have often tailed away to near-nothing before the onset of
683	   congestion.  Then DCTCP's tiny reduction followed by no further
684	   response for a round is precisely the wrong way for a CC to respond.
685	   A solution to this problem is being evaluated as part of the work
686	   already mentioned to improve Prague's responsiveness (see
687	   Section 3.1.3 in the section on variants and future work).

689	2.4.3.  Additive Increase and ECN Feedback

691	   Unlike DCTCP, the Prague CC does not suppress additive increase for
692	   one round trip after a congestion window reduction (while in CWR
693	   state).  Instead, a Prague CC applies additive increase irrespective
694	   of its CWR state, but only for bytes that have been ACK'd without ECN
695	   feedback.  Specifically, on each ACK,

697	       cwnd += (acked_sacked - ece_delta) * ai_per_rtt / cwnd;

699	   where:

701	      acked_sacked is the number of new bytes acknowledged by the ACK;

703	      ece_delta is the number of newly acknowledge ECN-marked bytes;

705	      ai_per_rtt is a scaling factor that is typically 1 SMSS except for
706	      small RTTs (see Section 2.4.4)

708	   Superficially, the traditional suppression of additive increase for
709	   the round after a decrease seems to make sense.  However, DCTCP and
710	   Prague are designed to induce an average of 2 congestion marks per
711	   RTT in steady state, which leaves very little space for any increase
712	   between the end of one round of CWR and the next mark.  In tests,
713	   when a test version of Prague CC is configured to completely suppress
714	   additive increase during CWR (like Reno and DCTCP), it sawteeth
715	   become more irregular, which is its way of making some decreases
716	   large enough to open up enough space for an increase.  This
717	   irregularity tends to reduce link utilization.  Therefore, the
718	   reference Prague CC continues additive increase irrespective of CWR
719	   state.

721	   Nonetheless, rather than continue additive increase regardless of
722	   congestion, it is safer to only increase on those ACKs that do not
723	   feed back congestion.  This approach reduces additive increase as the
724	   marking probability increases, which tends to keep the marking level
725	   unsaturated (below 100%) (see Section 3.1 of [Tensions17]).  Under
726	   stable conditions, Prague's congestion window then becomes
727	   proportional to (1-p)/p, rather than 1/p.

729	   See also 'Faster than Additive Increase' (Section 3.1.2)

731	2.4.4.  Reduced RTT-Dependence

733	   The window-based AIMD described so far was inherited from Reno via
734	   DCTCP.  When many long-running Reno flows share a link, their
735	   relative packet rates become roughly inversely proportional to RTT
736	   (packet rate =~ 1/RTT).  Then a flow with very small RTT will
737	   dominate any flows with larger RTTs.

739	   Queuing delay sets a lower limit to the smallest possible RTT.  So,
740	   prior to the extremely low queuing delay of L4S, extreme cases of RTT
741	   dependence had never been apparent.  Now that L4S has removed most of
742	   the queuing delay, we have to address the root-cause of RTT-
743	   dependence, which the Prague CC is required to do, at least when the
744	   RTT is small (see the 'Reduced RTT bias' aspect of Req 4.3.
745	   [I-D.ietf-tsvwg-ecn-l4s-id]).  Here, a small RTT is defined as below
746	   the typical RTT for the intended deployment environment.

748	   A Prague CC reduces RTT bias by using a reference RTT (RTT_ref)
749	   rather than the actual round trip (RTT) for all three of: the window
750	   update period; the EWMA update period; and the duration of CWR state
751	   after a decrease.  As the actual window (cwnd) is still sent within 1
752	   actual RTT, we also need to use a (conceptual) reference window,
753	   cwnd_ref.  For instance, if RTT_ref = 25 ms then, when the actual RTT
754	   is 5 ms, there are RTT_ref/RTT = 5 times more packets in cwnd_ref,
755	   than in the actual window, cwnd, because it spans 5 actual round
756	   trips.  We define M as the ratio RTT_ref/RTT.

758	   In the Linux implementation of TCP Prague, RTT_ref is a function of
759	   the actual RTT. 3 functions have been implemented: RTT_ref = max(RTT,
760	   RTT_REF_MIN); RTT_ref = RTT + AdditionalRTT; RTT_ref = ...  {ToDo}.
761	   The current default is RTT_ref = max(RTT, 25ms), which addresses the
762	   main Prague requirement for when the RTT is smaller than typical.

764	   In Reno or DCTCP, additive increase is implemented by dividing the
765	   desired increase of 1 segment per round over the cwnd packets in the
766	   round.  This requires an increase of 1/cwnd per packet.  In the Linux
767	   implementation of TCP Prague, the aim is to increase the reference
768	   window by 1 segment over a reference round.  However, in practice the
769	   increase is applied to the actual window, cwnd, which is M times
770	   smaller than cwnd_ref.  So cwnd has to be increased by only 1/M
771	   segments over RTT_ref.  But again, in practice, the increase is
772	   applied over an actual window of packets spanning an actual RTT,
773	   which is also M times smaller than the reference RTT.  So the desired
774	   increase in cwnd is only 1/M^2 segments over an actual round trip
775	   containing cwnd packets.  Therefore, the increase in cwnd per packet
776	   has to be (1/M^2) * (1/cwnd).

778	   Unless a flow lasts long enough for rates to converge, equal rates
779	   will not be relevant.  So, the Reduced RTT-Dependence algorithm only
780	   comes into effect after D rounds, where D is configurable (current
781	   default 500).  Continuing the previous example, if actual RTT=5 ms
782	   and RTT_ref = 25 ms, then Prague would stop using its RTT-dependent
783	   algorithm after 500*5ms = 2.5s and instead it would start to converge
784	   to equal rates using the Reduced RTT-Dependence algorithm.  If the
785	   actual RTT were higher (e.g. 20ms), it would stay in RTT-dependent
786	   mode for longer (10s), but this would be mitigated by its RTT being
787	   closer to the reference (20ms vs. 25ms).

789	   This approach prevents reduced RTT-dependence from making the flow
790	   less responsive at start-up and ensures that its early throughput
791	   share is based on its actual RTT.  The benefit is that short flows
792	   (mice) give themselves priority over longer flows (elephants), and
793	   shorter RTTs will still converge faster than longer RTTs.
794	   Nonetheless, the throughput still converges to equal rates after D
795	   rounds.

797	   It is planned to reset the algorithm to be RTT-dependent after an
798	   idle, not just at flow start, as discussed under Future Work in
799	   Section 3.4.

801	   Section 3.4 also discusses extending the reduction in RTT-dependence
802	   to longer RTTs than than RTT_ref.  The current Prague implementation
803	   does not support this.

805	2.4.5.  Flow Start or Restart

807	   Currently the Linux reference implementation of TCP Prague uses the
808	   standard Linux slow start code.  Slow start is exited once a single
809	   mark is detected.

811	   When other flows are actively filling the link, regular marks are
812	   expected, causing slow start of new flows to end prematurely.  This
813	   is clearly not ideal, so other approaches are being worked on (see
814	   Section 3.1.1).  However, slow start has been left as the default
815	   until a properly matured solution is completed.

817	2.5.  Packet Sending

819	2.5.1.  Packet Pacing

821	   The Prague CC SHOULD pace the packets it sends to avoid the queuing
822	   delay and under-utilization that would otherwise be caused by bursts
823	   of packets that can occur, for example, when a jump in the
824	   acknowledgement number opens up cwnd.  Prague does this in a similar
825	   way to the base Linux TCP stack, by spacing out the window of packets
826	   evenly over the round trip time, using the following calculation of
827	   the pacing rate [b/s]:

829	      pacing_rate = MTU_BITS * max(cwnd, inflight) / srtt;

831	   During slow start, as in the base Linux TCP stack, Prague factors up
832	   pacing_rate by 2, so that it paces out packets twice as fast as they
833	   are acknowledged.  This keeps up with the doubling of cwnd, but still
834	   prevents bursts in response to any larger transient jumps in cwnd.

836	       if (cwnd < ssthresh / 2)
837	           pacing_rate *= 2;

839	   During congestion avoidance, the Linux TCP Prague implementation does
840	   not factor up pacing_rate at all.  This contrasts with the base Linux
841	   TCP stack, which currently factors up pacing_rate by a ratio
842	   parameter set to 1.2.  The developers of the base Linux stack
843	   confirmed that this factor of 1.2 was only introduced in case it
844	   improved performance, but there were no scenarios where it was known
845	   to be needed.  In testing of Prague, this factor was found to cause
846	   queue delay spikes whenever cwnd jumped more than usual.  And
847	   throughput was no worse without it.  So it was removed from the TCP
848	   Prague CC.

850	   The Prague CC can use alternatives to the traditional slow-start
851	   algorithm, which use different pacing (see Section 2.4.5).

853	2.5.2.  Segmentation Offload

855	   In the absence of hardware pacing, it becomes increasingly difficult
856	   for a machine to scale to higher flow rates unless it is allowed to
857	   send packets in larger bursts, for instance using segmentation
858	   offload.  Happily, as flow rate scales up, proportionately more
859	   packets can be allowed in a burst for the same amount of queuing
860	   delay at the bottleneck.

862	   Therefore, the Prague CC sends packets in a burst as long as it will
863	   not induce more than MAX_BURST_DELAY of queuing at the bottleneck.

865	   From this constant and the current pacing_rate, it calculates how
866	   many MTU-sized packets to allow in a burst:

868	      max_burst = pacing_rate * MAX_BURST_DELAY / MTU_BITS

870	   The current default in the Linux TCP Prague for MAX_BURST_DELAY is
871	   250us which supports marking thresholds starting from about 500us
872	   without underutilization.  This approach is similar to that in the
873	   Linux TCP stack, except there MAX_BURST_DELAY is 1ms.

875	3.  Variants and Future Work

877	3.1.  Getting up to Speed Faster

879	   Appendix A.2. of [I-D.ietf-tsvwg-ecn-l4s-id] outlines the performance
880	   optimizations needed when transplanting DCTCP from a DC environment
881	   to a wide area network.  The following subsections address two of
882	   those points: faster flow startup and faster than additive increase.
883	   Then Section 3.1.3 covers the flip side, in which established flows
884	   have to yield faster to make room, otherwise queuing will result.

886	3.1.1.  Flow Start (or Restart)

888	   The Prague performance For faster flow start, two approaches are
889	   currently being investigated in parallel:

891	   Modified Slow Start:  The traditional exponential slow start can be
892	      modified both at the start and the end, with the aim of reducing
893	      the risk of queuing due to bursts and overshoot:

895	      Pacing IW:  A Prague CC can use an initial window of 10 (IW10
896	         [RFC6928]), but pacing of this Initial Window is recommended to
897	         try to avoid the pulse of queuing that could otherwise occur.
898	         Pacing IW10 also spreads the ACKs over the round trip so that
899	         subsequent rounds consist of ten subsets of packets (with 2, 4,
900	         8 etc.  per round in each subset), rather than a single set
901	         with 20, 40, 80 etc. in each round.  Then, if a queue builds
902	         during a round (e.g. due to other unexpected traffic arriving)
903	         it can drain in the gap before the next subset, rather than the
904	         whole set backing up in a much larger queue.

906	         In the Linux reference implementation of TCP Prague, IW pacing
907	         can be optionally enabled, but it is off by default, because it
908	         is yet to be fully evaluated.  It currently paces IW over half
909	         the initial smoothed round trip time (SRTT) measured during the
910	         handshake.  SRTT is halved because the RTT often reduces after
911	         the initial handshake.  For example: i) some CDNs move the flow
912	         to a closer server after establishment; ii) the initial RTT
913	         from a server can include the time to wake a sleeping handset
914	         battery; iii) some uplink technologies take a link-level round
915	         trip to request a scheduling slot.

917	         It is planned to exploit any cached knowledge of the path RTT
918	         to improve the initial estimate, for instance using the Linux
919	         per-destination cache. it is also planned to allow the
920	         application to give an RTT hint (by setting sk_max_pacing_rate
921	         in Linux) if the developer has reason to believe that the
922	         application has a better estimate.

924	      Exiting slow start more gracefully:  In the wide area Internet (in
925	         contrast to data centres), bottleneck access links tend to have
926	         much less capacity than the line rate of the sender.  With a
927	         shallow immediate ECN threshold at this bottleneck, the
928	         slightest burst can tend to induce an ECN mark, which
929	         traditionally causes slow start to exit.  A more gradual exit
930	         is being investigated for a Prague CC using the extent of
931	         marking, not just the existence of a single mark.  This will be
932	         more consistent with the extent-based marking that scalable
933	         congestion controls use during congestion avoidance.  Delay
934	         measurements (similar to Hystart++
935	         [I-D.ietf-tcpm-hystartplusplus]) can also be used to complement
936	         the ECN signals.

938	   Paced Chirping:  In this approach, the aim is to both increase more
939	      rapidly than exponential slow-start and to greatly reduce any
940	      overshoot.  It is primarily a delay-based approach, but the aim is
941	      also to exploit ECN signals when present (while not forgetting
942	      loss either).  Therefore Paced Chirping is generally usable for
943	      any congestion control - not solely for Prague CC and L4S.

945	      Instead of only aiming to detect capacity overshoot at the end of
946	      flow-start, brief trains of rapidly decreasing inter-packet
947	      spacing called chirps are used to test many rates with as few
948	      packets and as little load as possible.  A full description is
949	      beyond the scope of this document.  [LinuxPacedChirping]
950	      introduces the concepts and the code as well as citing the main
951	      papers on Paced Chirping.

953	      Paced chirping works well over continuous links such as Ethernet
954	      and DSL.  But better averaging and noise filtering are necessary
955	      over discontinuous link technologies such as WiFi, LTE cellular
956	      radio, passive optical networks (PON) and data over cable
957	      (DOCSIS).  This is the current focus of this work.

959	      The current Linux implementation of TCP Prague does not include
960	      Paced Chirping, but research code is available separately in Linux
961	      and ns3. it is accessible via the L4S landing page [L4S-home].

963	3.1.2.  Faster than Additive Increase

965	   The Prague CC has a startup phase and congestion avoidance phase like
966	   traditional CCs.  In steady-state during congestion avoidance, like
967	   all scalable congestion controls, it induces frequent ECN marks, with
968	   the same average recovery time between ECN marks, no matter how much
969	   the flow rate scales.

971	   If available capacity suddenly increases, e.g. other flow(s) depart
972	   or the link capacity increases, these regular ECN marks will stop.
973	   Therefore after a few rounds of silence (no ECN marks) in congestion
974	   avoidance phase, the Prague CC can assume that available capacity has
975	   increased, and switch to using the techniques from its startup phase
976	   (Section 3.1.1) to rapidly find the new, faster operating point.
977	   Then it can shift back into its congestion avoidance behaviour.

979	   That is the theory.  But, as explained in Section 3.1.1, the startup
980	   techniques, specifically paced chirping, are still being developed
981	   for discontinuous link types.  Once the startup behaviour is
982	   available, the Linux implementation of the Prague CC will also have a
983	   faster than additive increase behaviour.  S.3.2.3 of [PragueLinux])
984	   gives a brief preview of the performance of this approach over an
985	   Ethernet link type in ns3.

987	3.1.3.  Remove Lag in Congestion Response

989	   To keep queuing delay low, new flows can only push in fast if
990	   established flows yield fast.  It has recently been realized that the
991	   design of the Prague EWMA and congestion response introduces 1-2
992	   rounds of lag (on top of the inherent round of feedback delay due to
993	   the speed of light).  These lags were inherited from the design of
994	   DCTCP (see Section 2.3.2 and Section 2.4.2), where a couple of extra
995	   hundred microseconds was less noticeable.  But congestion control in
996	   the wide area Internet cannot afford up to 2 rounds trips of extra
997	   lag.

999	   To be clear, lag means delay before any response at all starts.  That
1000	   is qualititatively different from the smoothing gain of an EWMA,
1001	   which /reduces/ the response by the gain factor (1/16 by default) in
1002	   case a change in congestion does not persist.  Smoothing gain can
1003	   always be increased.  But 1-2 rounds of lag means that, when a new
1004	   flow tries to push in, the sender of an established flow will not
1005	   respond /at all/ for 1-2 rounds after it first receives congestion
1006	   feedback.

1008	   The Prague CC spends the first round trip of this lag gathering
1009	   feedback to measure frac before it is input into the EWMA algorithm
1010	   (see Section 2.3.2).  Then there is up to one further round of delay
1011	   because the implementations of DCTCP and Prague did not fully adopt
1012	   the paradigm shift to extent-based marking - the timing of the
1013	   decrease is still based on Reno.

1015	   Both Reno and DCTCP/Prague respond immediately on the first sign of
1016	   congestion.  Reno's response is large, so it waits a round in CWR
1017	   state to allow the response to take effect.  DCTCP's response is tiny
1018	   (extent-based), but then it still waits a round in CWR state.  So it
1019	   does next-to-nothing for a round.

1021	   New EWMA and resposne algorithms to remove these 1-2 extra rounds of
1022	   lag are described in [PerAckEWMA].  They have been implemented in
1023	   Linux and an iterative process of evaluation and redesign is in
1024	   progress.  The EWMA is updated per-ACK, but it still changes as if it
1025	   is clocked per round trip.  The congestion response is still
1026	   triggered by the first indication of ECN feedback, but it proceeds
1027	   over the subsequent round trip so that it can take into account
1028	   further incoming feedback as the EWMA evolves.  The reduction is
1029	   applied per-ACK but sized to result as if it had been a single
1030	   response per round trip,

1032	3.2.  Combining Congestion Metrics

1034	   Ultimately, it would be preferable to take an integrated approach and
1035	   use a combination of ECN, loss and delay metrics to drive congestion
1036	   control.  For instance, using a downward trend in ECN marking and/or
1037	   delay as a heuristic to temper the response to loss.  Such ideas are
1038	   not in the immediate plans for the Linux TCP Prague, but some more
1039	   specific ideas are highlighted in the following subsections.

1041	3.2.1.  ECN with Loss

1043	   If the bottleneck is ECN-capable, a loss due to congestion is very
1044	   likely to have been preceded by a period of ECN marking.  When the
1045	   current Linux TCP Prague CC detects a loss, like DCTCP, it halves
1046	   cwnd, even if it has already reduced cwnd in the same round trip due
1047	   to ECN marking.  This double reduction can end up factoring down cwnd
1048	   to as little as 1/4 in one round trip.

1050	   On a loss while in CWR state following an ECN reduction, it would be
1051	   possible to factor down cwnd by 1/(2-alpha), which would compound
1052	   with the previous decrease factor of (1-alpha/2) to result in: (1 -
1053	   alpha/2) / (2-alpha)) = 1/2.  In integer arithmetic, this division
1054	   would be possible but relatively expensive.  A less expensive
1055	   alternative would be multiplication by (2+alpha)/4, which
1056	   approximates to a compounded decrease factor of 1/2 for typical low
1057	   values of alpha, even up to 30%. The compound decrease factor is
1058	   never greater than 1/2 and in the worst case, if alpha was 100%, it
1059	   would factor cwnd down by 3/8.

1061	3.2.2.  ECN with Delay

1063	   Section 3.1.2 described the plans to shift between using ECN when
1064	   close to the operating point and using delay by injecting paced
1065	   chirps to find a new operating after the ECN signal goes silent for a
1066	   few rounds.  Paced chirping shifts more slowly to the new operating
1067	   point the more noise there is in the delay measurements.  Work is
1068	   ongoing on treating any ECN marking as a complementary metric.  The
1069	   resulting less noisy combined metric should then allow the controller
1070	   to shift more rapidly to each new operating point.

1072	   An alternative would be to combine ECN with the BBR approach, which
1073	   induces a much less noisy delay signal by using less frequent but
1074	   more pronounced delay spikes.  The approach currently being taken is
1075	   to adapt the chirp length to the degree of noise, so the chirps only
1076	   become longer and/or more pronounced when necessary, for instance
1077	   when faced with a discontinuous link technology such as WiFi.  With
1078	   multiple chirps per round, the noise can still be filtered out by
1079	   averaging over them all, rather than trying to remove noise from each
1080	   spike.  This keeps the 'self-harm' to the minimum necessary, and
1081	   ensures that capacity is always being sampled, which removes the risk
1082	   of going stale.

1084	3.3.  Fall-Back on Classic ECN

1086	   The implementation of TCP Prague CC in Linux includes an algorithm to
1087	   detect a Classic ECN AQM and fall back to Reno as a result, as
1088	   required by the 'Coexistence with Classic ECN' aspect of the Prague
1089	   Req 4.3.  [I-D.ietf-tsvwg-ecn-l4s-id].

1091	   The algorithm currently used (v2) is relatively simple, but rather
1092	   than describe it here, full rationale, pseudocode and explanation can
1093	   be found in the technical report about it [ecn-fallback].  This also
1094	   includes a selection of the evaluation results and a link to
1095	   visualizations of the full results online.  The current algorithm
1096	   nearly always detects a Classic ECN AQM, and in the majority of the
1097	   wide range of scenarios tested it is good at detecting an L4S AQM.
1098	   However, it wrongly identifies and L4S AQM as Classic in a
1099	   significant minority of cases when the link rate is low, or the RTT
1100	   is high.  The report gives ideas on how to improve detection in these
1101	   scenarios, but in the mean time the algorithm has been disabled by
1102	   default.

1104	   Recently, the report has been updated to include new ideas on other
1105	   ways to distinguish Classic from L4S AQMs.  The interested reader can
1106	   access it themselves, so this living document will not be further
1107	   summarized here.

1109	3.4.  Further Reduced RTT-Dependence

1111	   The algorithm to reduce RTT dependence is only relevant for long-
1112	   running flows.  So in the current TCP Prague implementation it
1113	   remains disabled for a certain number of round trips after the start
1114	   of a flow, as explained in Section 2.4.4.  It would be possible to
1115	   make RTT_ref gradually move from the actual RTT to the target
1116	   reference RTT, or peerhaps depend on other parameters of the flow.
1117	   Nonetheless, just switching in the algorithm after a number of rounds
1118	   works well enough.  It is planned to also disable the algorithm for a
1119	   similar duration if a flow becomes idle then restarts, but this is
1120	   yet to be evaluated.

1122	   Prague Req 4.3. in [I-D.ietf-tsvwg-ecn-l4s-id]) only requires reduced
1123	   RTT bias "in the range between the minimum likely RTT and typical
1124	   RTTs expected in the intended deployment scenario".  The current TCP
1125	   Prague implementation satisfies this requirement (Section 2.4.4).
1126	   Nonetheless, it would be preferable to be able to reduce the RTT bias
1127	   for high RTT flows as well.

1129	   If a step AQM is used, the congestion episodes of flows with
1130	   different RTTs tend to synchronize, which exacerbates RTT bias.  To
1131	   prevent this two candidate approaches will need to be investigated:
1132	   i) It might be sufficient to deprecate step AQMs for L4S (they are
1133	   not the preferred recommendation in
1134	   [I-D.ietf-tsvwg-aqm-dualq-coupled]); or ii) the reference RTT
1135	   approach of Section 2.4.4 might be usable for higher than typical
1136	   RTTs as well as lower.  In this latter case, (RTT/RTT_ref)^2 segments
1137	   would need to be added to the window per actual RTT.  The current TCP
1138	   Prague implementation does not support this faster AI for RTTs higher
1139	   than RTT_ref, due to the expected (but unverified) impact on latency
1140	   overshoot and responsiveness.

1142	3.5.  Scaling Down to Fractional Windows

1144	   A modification to v5.0 of the Linux TCP stack that scales down to
1145	   sub-packet windows is available for research purposes via the L4S
1146	   landing page [L4S-home].  The L4S Prague Requirements in section 4.3
1147	   of [I-D.ietf-tsvwg-ecn-l4s-id] recommend but no longer mandate
1148	   scaling down to sub-packet windows.  This is because becoming
1149	   unresponsive at a minimum window is a tradeoff between protecting
1150	   against other unresponsive flows and the extra queue you induce by
1151	   becoming unresponsive yourself.  So this code is not maintained as
1152	   part of the Linux implementation of TCP Prague.

1154	   Firstly, the stack ahs to be modifed to maintain a fractional
1155	   congestion window.  The because the ACK clock cannot work below 1
1156	   packet per RTT, the code sets the time to send each packet, then
1157	   readjusts the timing as each ACK arrives (otherwise any queuing
1158	   accumulates a burst in subsequent rounds).  Also, additive increase
1159	   of one segment does not scale below a 1-segment window.  So instead
1160	   of a constant additive increase, the code uses a logarithmically
1161	   scaled additive increase that slowly adapts the additive increase
1162	   constant to the slow start threshold.  Despite these quite radical
1163	   changes, the diff is surprisingly small.  The design and
1164	   implementation is explained in [Ahmed19], which also includes
1165	   evaluation results.

1167	4.  IANA Considerations

1169	   This specification contains no IANA considerations.

1171	5.  Security Considerations

1173	   Section 3.5 on scaling down to fractional windows discusses the
1174	   tradeoff in becoming unresponsive at a minium window, which causes a
1175	   queue to build (harm to self and to others) but protects oneself
1176	   against other unresponsive flows (whether malicious or accidental).

1178	   This draft inherits the security considerations discussed in
1179	   [I-D.ietf-tsvwg-ecn-l4s-id] and in the L4S architecture
1180	   [I-D.ietf-tsvwg-l4s-arch].  In particular, the self-interest
1181	   incentive to be responsive and minimize queuing delay, and
1182	   protections against those interested in disrupting the low queuing
1183	   delay of others.

1185	6.  Acknowledgements

1187	   Bob Briscoe's contribution was part-funded by the Comcast Innovation
1188	   Fund.  The views expressed here are solely those of the authors.

1190	7.  Comments and Contributions Solicited (To be removed before
1191	    Publication)

1193	   Comments and questions are encouraged and very welcome.  They can be
1194	   addressed to the IRTF Internet Congestion Control Research Group's
1195	   mailing list <iccrg@irtf.org>, and/or to the authors via <draft-
1196	   briscoe-iccrg-congestion-control@ietf.org>.  Contributions of design
1197	   ideas and/or code are also encouraged and welcome.

1199	8.  Contributors

1201	   The following contributed implementations and evaluations that
1202	   validated and helped to improve this specification:

1204	      Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of Nokia
1205	      Bell Labs, Belgium, prepared and maintains the Linux
1206	      implementation of TCP Prague.

1208	      Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> of Nokia
1209	      Bell Labs, Belgium, contributed to the Linux implementation of TCP
1210	      Prague.

1212	      Joakim Misund <joakim.misund@gmail.com> of Uni Oslo, Norway, wrote
1213	      the Linux paced chirping code.

1215	      Asad Sajjad Ahmed <me@asadsa.com>, Independent, Norway, wrote the
1216	      Linux code that maintains a sub-packet window.

1218	9.  References

1220	9.1.  Normative References

1222	   [I-D.ietf-tcpm-accurate-ecn]
1223	              Briscoe, B., Kuehlewind, M., and R. Scheffenegger, "More
1224	              Accurate ECN Feedback in TCP", draft-ietf-tcpm-accurate-
1225	              ecn-13 (work in progress), November 2020.

1227	   [I-D.ietf-tsvwg-ecn-l4s-id]
1228	              Schepper, K. and B. Briscoe, "Identifying Modified
1229	              Explicit Congestion Notification (ECN) Semantics for
1230	              Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s-
1231	              id-12 (work in progress), November 2020.

1233	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1234	              Requirement Levels", BCP 14, RFC 2119,
1235	              DOI 10.17487/RFC2119, March 1997,
1236	              <https://www.rfc-editor.org/info/rfc2119>.

1238	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1239	              of Explicit Congestion Notification (ECN) to IP",
1240	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
1241	              <https://www.rfc-editor.org/info/rfc3168>.

1243	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
1244	              Notification (ECN) Experimentation", RFC 8311,
1245	              DOI 10.17487/RFC8311, January 2018,
1246	              <https://www.rfc-editor.org/info/rfc8311>.

1248	9.2.  Informative References

1250	   [Ahmed19]  Ahmed, A., "Extending TCP for Low Round Trip Delay",
1251	              Masters Thesis, Uni Oslo , August 2019,
1252	              <https://www.duo.uio.no/handle/10852/70966>.

1254	   [ecn-fallback]
1255	              Briscoe, B. and A. Ahmed, "TCP Prague Fall-back on
1256	              Detection of a Classic ECN AQM", bobbriscoe.net Technical
1257	              Report TR-BB-2019-002, April 2020,
1258	              <https://arxiv.org/abs/1911.00710>.

1260	   [I-D.ietf-avtcore-cc-feedback-message]
1261	              Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP
1262	              Control Protocol (RTCP) Feedback for Congestion Control",
1263	              draft-ietf-avtcore-cc-feedback-message-09 (work in
1264	              progress), November 2020.

1266	   [I-D.ietf-quic-transport]
1267	              Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed
1268	              and Secure Transport", draft-ietf-quic-transport-34 (work
1269	              in progress), January 2021.

1271	   [I-D.ietf-tcpm-generalized-ecn]
1272	              Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit
1273	              Congestion Notification (ECN) to TCP Control Packets",
1274	              draft-ietf-tcpm-generalized-ecn-06 (work in progress),
1275	              October 2020.

1277	   [I-D.ietf-tcpm-hystartplusplus]
1278	              Balasubramanian, P., Huang, Y., and M. Olson, "HyStart++:
1279	              Modified Slow Start for TCP", draft-ietf-tcpm-
1280	              hystartplusplus-01 (work in progress), January 2021.

1282	   [I-D.ietf-tcpm-rack]
1283	              Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The
1284	              RACK-TLP loss detection algorithm for TCP", draft-ietf-
1285	              tcpm-rack-15 (work in progress), December 2020.

1287	   [I-D.ietf-tsvwg-aqm-dualq-coupled]
1288	              Schepper, K., Briscoe, B., and G. White, "DualQ Coupled
1289	              AQMs for Low Latency, Low Loss and Scalable Throughput
1290	              (L4S)", draft-ietf-tsvwg-aqm-dualq-coupled-13 (work in
1291	              progress), November 2020.

1293	   [I-D.ietf-tsvwg-l4s-arch]
1294	              Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low
1295	              Latency, Low Loss, Scalable Throughput (L4S) Internet
1296	              Service: Architecture", draft-ietf-tsvwg-l4s-arch-08 (work
1297	              in progress), November 2020.

1299	   [L4S-home]
1300	              "L4S: Ultra-Low Queuing Delay for All",
1301	              <https://riteproject.eu/dctth/#code>.

1303	   [LinuxPacedChirping]
1304	              Misund, J. and B. Briscoe, "Paced Chirping - Rethinking
1305	              TCP start-up", Proc. Linux Netdev 0x13 , March 2019,
1306	              <https://www.netdevconf.org/0x13/session.html?talk-chirp>.

1308	   [patch-alpha-zero]
1309	              Shewmaker, A., "tcp: allow dctcp alpha to drop to zero",
1310	              Linux GitHub patch; Commit: c80dbe0, October 2015,
1311	              <https://github.com/torvalds/linux/commits/master/net/
1312	              ipv4/tcp_dctcp.c>.

1314	   [patch-loss-react]
1315	              De Schepper, K., "tcp: Ensure DCTCP reacts to losses",
1316	              Linux GitHub patch; Commit: aecfde2, April 2019,
1317	              <https://github.com/torvalds/linux/commits/master/net/
1318	              ipv4/tcp_dctcp.c>.

1320	   [PerAckEWMA]
1321	              Briscoe, B., "Improving DCTCP/Prague Congestion Control
1322	              Responsiveness", Technical Report TR-BB-2020-002, January
1323	              2021, <https://arxiv.org/abs/2101.07727>.

1325	   [PragueLinux]
1326	              Briscoe, B., De Schepper, K., Albisser, O., Misund, J.,
1327	              Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing
1328	              the `TCP Prague' Requirements for Low Latency Low Loss
1329	              Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 ,
1330	              March 2019, <https://www.netdevconf.org/0x13/
1331	              session.html?talk-tcp-prague-l4s>.

1333	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
1334	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
1335	              <https://www.rfc-editor.org/info/rfc3649>.

1337	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
1338	              Congestion Control Protocol (DCCP)", RFC 4340,
1339	              DOI 10.17487/RFC4340, March 2006,
1340	              <https://www.rfc-editor.org/info/rfc4340>.

1342	   [RFC4960]  Stewart, R., Ed., "Stream Control Transmission Protocol",
1343	              RFC 4960, DOI 10.17487/RFC4960, September 2007,
1344	              <https://www.rfc-editor.org/info/rfc4960>.

1346	   [RFC5033]  Floyd, S. and M. Allman, "Specifying New Congestion
1347	              Control Algorithms", BCP 133, RFC 5033,
1348	              DOI 10.17487/RFC5033, August 2007,
1349	              <https://www.rfc-editor.org/info/rfc5033>.

1351	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1352	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
1353	              <https://www.rfc-editor.org/info/rfc5681>.

1355	   [RFC6679]  Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P.,
1356	              and K. Carlberg, "Explicit Congestion Notification (ECN)
1357	              for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August
1358	              2012, <https://www.rfc-editor.org/info/rfc6679>.

1360	   [RFC6928]  Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis,
1361	              "Increasing TCP's Initial Window", RFC 6928,
1362	              DOI 10.17487/RFC6928, April 2013,
1363	              <https://www.rfc-editor.org/info/rfc6928>.

1365	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
1366	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
1367	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
1368	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

1370	   [RFC8298]  Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation
1371	              for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December
1372	              2017, <https://www.rfc-editor.org/info/rfc8298>.

1374	   [Tensions17]
1375	              Briscoe, B. and K. De Schepper, "Resolving Tensions
1376	              between Congestion Control Scaling Requirements", Simula
1377	              Technical Report TR-CS-2016-001; arXiv:1904.07605, July
1378	              2017, <https://arxiv.org/abs/1904.07605>.

1380	Authors' Addresses

1382	   Koen De Schepper
1383	   Nokia Bell Labs
1384	   Antwerp
1385	   Belgium

1387	   Email: koen.de_schepper@nokia.com
1388	   URI:   https://www.bell-labs.com/usr/koen.de_schepper
1389	   Olivier Tilmans
1390	   Nokia Bell Labs
1391	   Antwerp
1392	   Belgium

1394	   Email: olivier.tilmans@nokia-bell-labs.com

1396	   Bob Briscoe (editor)
1397	   Independent
1398	   UK

1400	   Email: ietf@bobbriscoe.net
1401	   URI:   http://bobbriscoe.net/