Prague Congestion
ControlNokia Bell LabsAntwerpBelgiumkoen.de_schepper@nokia.comhttps://www.bell-labs.com/usr/koen.de_schepperNokia Bell LabsAntwerpBelgiumolivier.tilmans@nokia-bell-labs.comIndependentUKietf@bobbriscoe.nethttp://bobbriscoe.net/
IRTF
Internet Congestion Control Research Group (ICCRG)Internet-DraftI-DThis specification defines the Prague congestion control scheme,
which is derived from DCTCP and adapted for Internet traffic by
implementing the Prague L4S requirements. Over paths with L4S support at
the bottleneck, it adapts the DCTCP mechanisms to achieve consistently
low latency and full throughput. It is defined independently of any
particular transport protocol or operating system, but notes are added
that highlight issues specific to certain transports and OSs. It is
mainly based on the current default options of the reference Linux
implementation of TCP Prague, but it includes experience from other
implementations where available. It separately describes non-default and
optional parts, as well as future plans.The implementation does not satisfy all the Prague requirements (yet)
and the IETF might decide that certain requirements need to be relaxed
as an outcome of the process of trying to satisfy them all. In two
cases, research code is replaced by placeholders until full evaluation
is complete.This document defines the Prague congestion control. It is defined
independent of any particular transport protocol or operating system,
but notes are added that highlight issues specific to certain transports
and OSs. The authors are most familiar with the reference implementation
of Prague on Linux over TCP. So that forms the basis of the large
majority of platform-specific notes. Nonetheless, wherever possible,
experience from implementers on other platforms is included, and the
intention is to gather more into this document during the drafting
process.The Prague CC is intended to maintain consistently low queuing delay
over network paths that offer L4S support at the bottleneck. Where the
bottleneck does not support L4S, the Prague CC is intended to fall back
to behaving like a conventional 'Classic' congestion control. L4S stands
for Low Latency, Low Loss Scalable throughput. L4S support in the
network involves Active Queue Management (AQM) with a very shallow
target queueing delay (of the order of a millisecond) that applies
immediate Explicit Congestion Notification (ECN). 'Immediate ECN' means
that the network applies ECN marking based on the instantaneous queue,
without any smoothing or filtering, The Prague CC takes on the job of
smoothing and filtering the congestion signals from the network.The Prague CC is a particular instance of a scalable congestion
control, which is defined in .
Scalable congestion control is the part of the L4S architecture that
does the actual work of maintaining low queuing delay and ensuring that
the delay and throughput properties scale with flow rate.The L4S architecture places
the host congestion control in the context of the other parts of the
system. In particular the different types of L4S AQM in the network and
the codepoints in the IP-ECN field that convey to the network that the
host supports the L4S form of ECN. The architecture document also covers
other issues such as: incremental deployment; protection of low latency
queues against accidental or malicious disruption; and the relationship
of L4S to other low latency technologies. The specification of the L4S
ECN Protocol sets down the
requirements that the Prague CC has to follow (called the Prague L4S
Requirements - see for a
summary).Links to implementations of the Prague CC and other scalable
congestion controls (all open source) can be found via the L4S landing
page , which also links to numerous other
L4S-related resources. A (slightly dated) paper on the specific
implementation of the Prague CC in Linux over TCP is also available
.The Prague CC is capable of keeping queuing delay consistently low
while fully utilizing available capacity. In contrast, Classic
congestion controls need to induce a reasonably large queue
(approaching a bandwidth-delay product) in order to fully utilize
capacity. Therefore, prior to scalable CCs like DCTCP and Prague, it
was believed that very low delay was only possible by limiting
throughput and isolating the low delay traffic from capacity-seeking
traffic.The Prague CC uses additive increase multiplicative decrease
(AIMD), in which it increases its window until an ECN mark (or loss)
is detected, then yields in a continual sawtooth pattern. The key to
keeping queuing delay low without under-utilizing capacity is to keep
the sawteeth tiny. For example the average duration of a Prague CC
sawtooth is of the order of a round trip, whereas a classic congestion
control sawtooths over hundreds of round trips. For instance, over an
RTT of 36ms, at 100Mb/s Cubic takes about 106 round trips to recover,
and at 800 Mb/s its recovery time triples to over 340 round trips, or
still more than 12 seconds (Reno would take 57 seconds.Keeping the sawtooth amplitude down keeps queue variation down and
utilization up. Keeping the duration of the sawteeth down ensures
control remains tight. The definition of a scalable CC is that the
duration between congestion marks does not increase as flow rate
scales, all other factors being equal. This is important, because it
means that the sawteeth will always stay tiny. So queue delay will
remain very low, and control will remain very tight.The tip of each sawtooth occurs when the bottleneck link emits a
congestion signal. Therefore such small sawteeth are only feasible
when ECN is used for the congestion signals. If loss were used, the
loss level would be prohibitively high. This is why L4S-ECN has to
depart from the requirement of Classic ECN
that an ECN mark is equivalent to a loss. Because otherwise the
response to the high level of ECN marking would have to be as great as
the response to an equivalent level of loss.The Prague CC is derived from Data Center TCP (DCTCP ). DCTCP is confined to controlled environments like
data centres precisely because it uses such small sawteeth, which
induce such a high level of congestion marking. For a CC using Classic
ECN, this would be interpreted as equivalent to the same, very high,
loss level. The Classic CC would then continually drive its own rate
down in the face of such an apparently high level of congestion.This is why coexistence with existing traffic is important for the
Prague CC. It has to be able to detect whether it is sharing the
bottleneck with Classic traffic, and if so fall back to behaving in a
Classic way. If the bottleneck does not support ECN at all, that is
easy - the Prague CC just responds in the Classic way to loss (see
). But if it is sharing the
bottleneck with Classic ECN traffic, this is more difficult to detect
(see ). Because the Prague CC
removes most of the queue, it also addresses RTT-dependence.
Otherwise, at low base RTTs, it would become far more RTT-dependent
than Classic CCs.There is not 'One True Prague CC'. L4S is intended to enable
development of any scalable CC that meets the L4S Prague requirements
. This document attempts to
describe a reference implementation and attempts to generalize it to
different transports and OS platforms. The implementation does not
satisfy all the Prague requirements (yet), and the IETF might decide
that certain requirements need to be relaxed as an outcome of the
process of trying to satisfy them all.The field of congestion control is always a work in progress.
However, there are areas of the Prague CC that are still just
placeholders while separate research code is evaluated. And in other
implementations of the Prague CC, other areas are incomplete. In the
Linux reference implementation of TCP Prague, interim code is used in
the incomplete areas, which are:Flow start and restart (standard slow start is used, even
though it often exits early in L4S environments were ECN marking
tends to be frequent);Faster than additive increase (standard additive increase is
used, which makes the flow particularly sluggish if it has dropped
out of slow start early).The body of this document describes the Prague CC as
implemented. Any non-default options or any planned improvements are
separated out into on "Variants and
Future Work". As each of the above areas is addressed, it will will be
removed from this section and its description in the body of the
document will be updated. Once all areas are complete, this section
will be removed. Prague CC will then still be a work in progress, but
only on a similar footing as all other congestion controls.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in when, and only when, they appear in all capitals,
as shown here.Definitions of terms:A congestion control
behaviour that can co-exist with standard TCP Reno without causing significantly negative impact
on its flow rate . With Classic congestion
controls, as flow rate scales, the number of round trips between
congestion signals (losses or ECN marks) rises with the flow rate.
So it takes longer and longer to recover after each congestion
event. Therefore control of queuing and utilization becomes very
slack, and the slightest disturbance prevents a high rate from
being attained .A congestion control
where the average time from one congestion signal to the next (the
recovery time) remains invariant as the flow rate scales, all
other factors being equal. This maintains the same degree of
control over queueing and utilization whatever the flow rate, as
well as ensuring that high throughput is robust to disturbances.
For instance, DCTCP averages 2 congestion signals per round-trip
whatever the flow rate. For the public Internet a Scalable
transport has to comply with the requirements in Section 4
of (aka. the
'Prague L4S requirements').The relationship between the
window (cwnd) of a congestion control and the congestion
signalling probability, p, in steady state. A general response
function has the form cwnd = K/p^B, where K and B are constants.
In an approximation of the response function of the standard Reno
CC, B=1/2. For a scalable congestion control B=1, so its response
function takes the form cwnd = K/p. The number of congestion
signals per round is p*cwnd, which equates to the constant, K, for
a scalable CC. Hence the definition of a scalable CC above.The subset of Classic traffic that
excludes unresponsive traffic and excludes experimental congestion
controls intended to coexist with Reno but without always being
strictly friendly to it (as allowed by ).
Reno-friendly is used in place of 'TCP-friendly', given that the
TCP protocol is used with many different congestion control
behaviours.The original Explicit Congestion
Notification (ECN) protocol , which
requires ECN signals to be treated the same as drops, both when
generated in the network and when responded to by the
sender.The names used for the four
codepoints of the 2-bit IP-ECN field are as defined in : Not ECT, ECT(0), ECT(1) and CE, where ECT
stands for ECN-Capable Transport and CE stands for Congestion
Experienced.A packet marked with the CE
codepoint is termed 'ECN-marked' or sometimes just 'marked' where
the context makes ECN obvious.Congestion Controlan ACKnowledgement, or to ACKnowledgeExponentially Weighted Moving AverageRound Trip TimeDefinitions of Parameters and Variables:Maximum transmission unit [b]Congestion window [B]Slow start threshold [B]The amount of data that the sender has
sent but not yet received ACKs for [B]Steady-state probability of drop or marking
[]EWMA of the ECN marking fraction []the amount of new data acknowledged by
an ACK [B]the amount of newly acknowledged data
that was ECN-marked [B]additive increase to apply per RTT
[B]Smoothed round trip time [s]Maximum allowed bottleneck queuing
delay due to segmentation offload bursts [s] (default 0.25 ms for
the public Internet)The beneficial properties of L4S traffic (low queuing delay, etc.)
depend on all L4S sources satisfying a set of conditions called the
Prague L4S Requirements. The name is after an ad hoc meeting of about
thirty people co-located with the IETF in Prague in July 2015, the day
after the first public demonstration of L4S.The meeting agreed a list of modifications to DCTCP to focus activity on a variant that would be safe
to use over the public Internet. it was suggested that this could be
called TCP Prague to distinguish it from DCTCP. This list was adopted
by the IETF, and has continued to evolve (see section 4 of ). The requirements are no longer
TCP-specific, applying irrespective of wire-protocol (TCP, QUIC, RTP,
SCTP, etc). This unusual start to the life of the project led to the unusual
development process of a reference implementation that had to resolve
a number of ambitious requirements, already known to be in tension
.DCTCP already implements a scalable congestion control. So most of
the changes to make it usable over the Internet seemed trivial, some
'merely' involving adoption of other parallel developments like
Accurate ECN TCP feedback or RACK. Others have been more challenging
(e.g. RTT-independence). And others that seemed trivial became
challenging given the complex set of bugs and behaviours that
characterize today's Internet and the Linux stack. The more critical implementation challenges are highlighted in the
following sections, in the hope we can prevent mistakes being repeated
(see for instance , ). There was also a set of five intertwined 'bugs'
- all masking each other, but causing unpredictable or poor
performance as different code modifications unmasked them. A draft
write-up about these has been prepared, which is longer than the whole
of the present document, so it will be included by reference once
published.During the development process, we have unearthed fundamental
aspects of the implementation and indeed the design of DCTCP and
Prague that have still not caught up with the paradigm shift from
existence to extent-based congestion response. Some have been
implemented by default, e.g. not suppressing additive increase for a
round trip after a congestion event ().
Others have been implemented but not fully evaluated, e.g. removing
the 1-2 unnecessary round trips of lag in feedback processing () and yet others are still future
plans, e.g. further RTT-independence () and exploiting combined
congestion metrics in more cases (). The requirements are categorized into those that would impact other
flows if not handled properly and performance optimizations that are
important but optional from the IETF's point of view, because they
only affect the flow itself. The list below maps the order of the
requirements in to the
order in this document (which is by functional categories and code
status): L4S-ECN packet identification: use of ECT(1) ()Accurate ECN feedback ()Reno-friendly response to a loss ()Detection of a classic ECN AQM ()Reduced RTT dependence ()Scaling down to a fractional window (no longer mandatory,
see )Detecting loss in units of time ()Minimizing bursts (ECN-capable control packets ()Faster flow start ()Faster than additive increase ()Segmentation offload ()On the public Internet, a sender using the Prague CC MUST set the
ECT(1) codepoint on all the packets it sends, in order to identify
itself as an L4S-capable congestion control (Req 4.1 ).This applies whatever the transport protocol, whether TCP, QUIC,
RTP, etc. In the case of TCP, unlike an RFC 3168 TCP ECN transport, a
sender can set all packets as ECN-capable, including TCP control
packets and retransmissions , .The Prague CC SHOULD optionally be configurable to use the ECT(0)
codepoint in private networks, such as data centres, which might be
necessary for backward compatibility with DCTCP deployments where
ECT(1) might already have another usage.Implementation note:The kernel was updated
to allow the ECT(1) flag to be set from within a CC module. The
Prague CC then has full control over the ECN code point it uses at
any one time. In this way it enforces the use of ECT(1) (or
optionally ECT(0)) and non-ECT when required.When feedback of ECN markings was added to TCP , it was decided not to report any more than one
mark per RTT. L4S-capable congestion controls need to know the
extent, not just the existence of congestion (Req 4.2. ). Recently defined transports
(DCCP, QUIC, etc) typically already satisfy this requirement. So
they are dealt with separately below, while TCP and derivatives such
as SCTP are covered first.The TCP wire protocol is being updated to allow more accurate
feedback (AccECN ).
Therefore, in the case where a sender uses the Prague CC over TCP,
whether as client or server:it MUST itself support AccECN;to support AccECN it also has to check that its peer
supports AccECN during the handshake.If the peer does not support accurate ECN feedback, the
sender MUST fall back to a Reno-friendly CC behaviour for the rest
of the connection. The non-Prague TCP sender MUST then no longer
set ECT(1) on the packets it sends. Note that the peer only needs
to support AccECN; there is no need (and no way) to find out
whether the peer is using an L4S-capable congestion control.Note that a sending TCP client that uses the Prague CC can set
ECT(1) on the SYN prior to checking whether the other peer
supports AccECN (as long as it follows the procedure in if it discovers the peer
does not support AccECN).Implementation note:The kernel had been
updated to support AccECN Independent of the CC module in use.
So the kernel tries to negotiate AccECN exchange whichever
congestion control module is selected. An additional check is
provided to verify that the kernel actually does support
AccECN, based on which the Prague CC module will decide to
proceed using scalable CC or fall back to a Classic CC (Reno
in the current implementation).A
system wide option is available to disable AccECN negotiation,
but the Prague CC module will always override this setting, as
it depends on AccECN. Then, solely in this case, AccECN will
only be active for TCP flows using the Prague CC.Transport protocols specified recently, .e.g. DCCP , QUIC ,
are unambiguously suitable for Prague CCs, because they were
designed from the start with accurate ECN feedback.In the case of RTP/RTCP, ECN feedback was added in , which is sufficient for the Prague CC.
However, it is preferable to use the most recent improvements to
ECN feedback in , as used in the
implementation of the L4S variant of SCReAM .The Prague CC currently maintains a moving average of ECN
feedback in a similar way to DCTCP. This section is provided mainly
because performance has proved to be sensitive to implementation
precision in this area. So first, some background is necessary.The Prague CC triggers update of its moving average once per RTT
by recording the packet it sent after the previous update, then
watching for the ACK of that packet to return. To maintain its
moving average, it measures the fraction, frac, of ACKed bytes that
carried ECN feedback over the previous round trip. It then updates
an exponentially weighted moving average (EWMA) of this fraction,
called alpha, using the following algorithm:alpha += g * (frac - alpha);where g is the gain of the EWMA (default 1/16).Implementation notes:Alpha is a fraction
between 0 and 1, and it needs to be represented with high
resolution because the larger the bandwidth-delay product (BDP)
of a flow, the smaller the value that alpha converges to (in
steady state alpha = 2/cwnd). In principle, Linux DCTCP
maintains the moving average 'alpha' using the same formula as
Prague CC uses (as above). Linux represents alpha with a 10-bit
integer (with resolution 1/1024). However, up to kernel release
3.19, Linux used integer arithmetic that could not reduce alpha
below 15/1024. Then it was patched so that any value below
16/1024 was rounded down to zero . For a flow with a higher BDP than
128 segments, this means that, alpha flip-flops. Once it has
flopped down to zero DCTCP becomes unresponsive until it has
built sufficient queue to flip up to 16/1024. For larger BDPs,
this causes DCTCP to induce larger sawteeth, which loses the
low-queuing-delay and high-utilization intent of the
algorithm.To resolve the above
problem the implementation of TCP Prague in Linux maintains
upscaled_alpha = alpha/g instead of alpha:upscaled_alpha += frac - g * upscaled_alpha;This technique is the same as Linux uses for the
retransmission timer variables, srtt and mdev. Prague CC also
uses 20 bits for alpha,Currently the above per-RTT update to the moving average, which
was inherited from DCTCP, is the default in the Prague CC. However,
another approach is being investigated because these per-RTT updates
introduce 1--2 rounds of delay into the congestion response on top
of the inherent round of feedback delay (see in the section on variants and
future work).After an ACK leaves a gap in the sequence space, a Prague CC is
meant to deem that a loss has occurred using 'time-based units' (Req
4.3. ). This is in
contrast to the traditional approach that counts a hard-coded number
of duplicate ACKs, e.g. the 3 Dup-ACKs specified in . Counting packets rather than time unnecessarily
tightens the time within which parallelized links have to keep
packets in sequence as flow rate scales over the years.To satsify this requirement, a Prague CC SHOULD wait until a
certain fraction of the RTT has elapsed before it deems that the gap
is due to packet loss. The reference implementation of TCP Prague in
Linux uses RACK to address this
requirement. An approach similar to TCP RACK is also used in
QUIC.At the start of a connection, RACK counts 3 DupACKs to detect
loss because the initial smoothed RTT estimate can be inaccurate.
This would depend indirectly on time as long as the initial window
(IW) is paced over a round trip (see ). For instance, if the initial window of
10 segments was paced evenly across the initial RTT then, in the
next round, an implementation that deems there has been a loss after
(say) 1/4 of an RTT can count 1/4 of 10 = 3 DupACKs (rounded up).
Subsequently, as the window grows, RACK shifts to using a fraction
of the RTT for loss detection.In congestion avoidance phase, a Prague CC uses a similar additive
increase multiplicative decrease (AIMD) algorithm to DCTCP, but with
the following differences:A Prague CC has to fall back to Reno-friendly behaviour on
detection of a loss (Req 4.3. ). DCTCP falls back to Reno for
the round trip after a loss, and the Linux reference implementation
of TCP Prague inherits this behaviour.If a Prague CC has already reduced the congestion window due to
ECN feedback less than a round trip before it detects a loss, it MAY
reduce the congestion window by a smaller amount due to the loss, as
long as the reductions due to ECN and the loss are Reno-friendly
when taken together.See for discussion of
future work on congestion control using a combination of delay, ECN
and loss.Implementation note:A Prague CC cannot rely
on the fall-back-on-loss behaviour of the DCTCP code in the
Linux kernel prior to v5.1, due to a previous bug in the fast
retransmit code (but not in the retransmission timeout code)
.The Prague CC currently responds to ECN feedback in a similar way
to DCTCP. This section is provided mainly because performance has
proved to be sensitive to implementation details in this area. So
the following recap of the congestion response is needed first.As explained in , the Prague CC
(like DCTCP) clocks its moving average of ECN-marking, alpha, once
per round trip throughout a connection. Nonetheless, it only
triggers a multiplicative decrease to its congestion window when it
actually receives an ACK carrying ECN feedback. Then it suppresses
any further decreases for one round trip, even if it receives
further ECN feedback. This is termed Congestion Window Reduced or
CWR state.The Prague CC (like DCTCP) ensures that the average recovery time
remains invariant as flow rate scales (Req 4.3 of ) by making the multiplicative
decrease depend on the prevailing value of alpha as follows:ssthresh = (1 - alpha/2) * cwnd;Implementation notes:With reference to the earlier
discussion of integer arithmetic precision (), alpha = g * upscaled_alpha.Typically the
absolute reduction in the window is only a small number of
segments. So, if the Prague CC implementation counts the window
in integer segments (as in the Linux reference code), delay can
be made significantly less jumpy by tracking a fractional value
alongside the integer window and carrying over any fractional
remainder to the next reduction. Also, integer rounding bias
ought to be removed from the multiplicative decrease
calculation.In dynamic scenarios, as flows find a new operating point, alpha
will have often tailed away to near-nothing before the onset of
congestion. Then DCTCP's tiny reduction followed by no further
response for a round is precisely the wrong way for a CC to respond.
A solution to this problem is being evaluated as part of the work
already mentioned to improve Prague's responsiveness (see in the section on variants and
future work).Unlike DCTCP, the Prague CC does not suppress additive increase
for one round trip after a congestion window reduction (while in CWR
state). Instead, a Prague CC applies additive increase irrespective
of its CWR state, but only for bytes that have been ACK'd without
ECN feedback. Specifically, on each ACK,where:acked_sacked is the number of new bytes acknowledged by the
ACK;ece_delta is the number of newly acknowledge ECN-marked
bytes;ai_per_rtt is a scaling factor that is typically 1 SMSS
except for small RTTs (see )Superficially, the traditional suppression of additive increase
for the round after a decrease seems to make sense. However, DCTCP
and Prague are designed to induce an average of 2 congestion marks
per RTT in steady state, which leaves very little space for any
increase between the end of one round of CWR and the next mark. In
tests, when a test version of Prague CC is configured to completely
suppress additive increase during CWR (like Reno and DCTCP), it
sawteeth become more irregular, which is its way of making some
decreases large enough to open up enough space for an increase. This
irregularity tends to reduce link utilization. Therefore, the
reference Prague CC continues additive increase irrespective of CWR
state.Nonetheless, rather than continue additive increase regardless of
congestion, it is safer to only increase on those ACKs that do not
feed back congestion. This approach reduces additive increase as the
marking probability increases, which tends to keep the marking level
unsaturated (below 100%) (see Section 3.1 of ). Under stable conditions, Prague's congestion
window then becomes proportional to (1-p)/p, rather than 1/p.See also 'Faster than Additive Increase' ()The window-based AIMD described so far was inherited from Reno
via DCTCP. When many long-running Reno flows share a link, their
relative packet rates become roughly inversely proportional to RTT
(packet rate =~ 1/RTT). Then a flow with very small RTT will
dominate any flows with larger RTTs.Queuing delay sets a lower limit to the smallest possible RTT.
So, prior to the extremely low queuing delay of L4S, extreme cases
of RTT dependence had never been apparent. Now that L4S has removed
most of the queuing delay, we have to address the root-cause of
RTT-dependence, which the Prague CC is required to do, at least when
the RTT is small (see the 'Reduced RTT bias' aspect of Req 4.3.
). Here, a small RTT is
defined as below the typical RTT for the intended deployment
environment.A Prague CC reduces RTT bias by using a reference RTT (RTT_ref)
rather than the actual round trip (RTT) for all three of: the window
update period; the EWMA update period; and the duration of CWR state
after a decrease. As the actual window (cwnd) is still sent within 1
actual RTT, we also need to use a (conceptual) reference window,
cwnd_ref. For instance, if RTT_ref = 25 ms then, when the actual RTT
is 5 ms, there are RTT_ref/RTT = 5 times more packets in cwnd_ref,
than in the actual window, cwnd, because it spans 5 actual round
trips. We define M as the ratio RTT_ref/RTT.In the Linux implementation of TCP Prague, RTT_ref is a function
of the actual RTT. 3 functions have been implemented: RTT_ref =
max(RTT, RTT_REF_MIN); RTT_ref = RTT + AdditionalRTT; RTT_ref = ...
{ToDo}. The current default is RTT_ref = max(RTT, 25ms), which
addresses the main Prague requirement for when the RTT is smaller
than typical.In Reno or DCTCP, additive increase is implemented by dividing
the desired increase of 1 segment per round over the cwnd packets in
the round. This requires an increase of 1/cwnd per packet. In the
Linux implementation of TCP Prague, the aim is to increase the
reference window by 1 segment over a reference round. However, in
practice the increase is applied to the actual window, cwnd, which
is M times smaller than cwnd_ref. So cwnd has to be increased by
only 1/M segments over RTT_ref. But again, in practice, the increase
is applied over an actual window of packets spanning an actual RTT,
which is also M times smaller than the reference RTT. So the desired
increase in cwnd is only 1/M^2 segments over an actual round trip
containing cwnd packets. Therefore, the increase in cwnd per packet
has to be (1/M^2) * (1/cwnd).Unless a flow lasts long enough for rates to converge, equal
rates will not be relevant. So, the Reduced RTT-Dependence algorithm
only comes into effect after D rounds, where D is configurable
(current default 500). Continuing the previous example, if actual
RTT=5 ms and RTT_ref = 25 ms, then Prague would stop using its
RTT-dependent algorithm after 500*5ms = 2.5s and instead it would
start to converge to equal rates using the Reduced RTT-Dependence
algorithm. If the actual RTT were higher (e.g. 20ms), it would stay
in RTT-dependent mode for longer (10s), but this would be mitigated
by its RTT being closer to the reference (20ms vs. 25ms).This approach prevents reduced RTT-dependence from making the
flow less responsive at start-up and ensures that its early
throughput share is based on its actual RTT. The benefit is that
short flows (mice) give themselves priority over longer flows
(elephants), and shorter RTTs will still converge faster than longer
RTTs. Nonetheless, the throughput still converges to equal rates
after D rounds.It is planned to reset the algorithm to be RTT-dependent after an
idle, not just at flow start, as discussed under Future Work in
. also discusses
extending the reduction in RTT-dependence to longer RTTs than than
RTT_ref. The current Prague implementation does not support
this.Currently the Linux reference implementation of TCP Prague uses
the standard Linux slow start code. Slow start is exited once a
single mark is detected.When other flows are actively filling the link, regular marks are
expected, causing slow start of new flows to end prematurely. This
is clearly not ideal, so other approaches are being worked on (see
). However, slow start has
been left as the default until a properly matured solution is
completed.The Prague CC SHOULD pace the packets it sends to avoid the
queuing delay and under-utilization that would otherwise be caused
by bursts of packets that can occur, for example, when a jump in the
acknowledgement number opens up cwnd. Prague does this in a similar
way to the base Linux TCP stack, by spacing out the window of
packets evenly over the round trip time, using the following
calculation of the pacing rate [b/s]:pacing_rate = MTU_BITS * max(cwnd, inflight) / srtt;During slow start, as in the base Linux TCP stack, Prague factors
up pacing_rate by 2, so that it paces out packets twice as fast as
they are acknowledged. This keeps up with the doubling of cwnd, but
still prevents bursts in response to any larger transient jumps in
cwnd.During congestion avoidance, the Linux TCP Prague implementation
does not factor up pacing_rate at all. This contrasts with the base
Linux TCP stack, which currently factors up pacing_rate by a ratio
parameter set to 1.2. The developers of the base Linux stack
confirmed that this factor of 1.2 was only introduced in case it
improved performance, but there were no scenarios where it was known
to be needed. In testing of Prague, this factor was found to cause
queue delay spikes whenever cwnd jumped more than usual. And
throughput was no worse without it. So it was removed from the TCP
Prague CC.The Prague CC can use alternatives to the traditional slow-start
algorithm, which use different pacing (see ).In the absence of hardware pacing, it becomes increasingly
difficult for a machine to scale to higher flow rates unless it is
allowed to send packets in larger bursts, for instance using
segmentation offload. Happily, as flow rate scales up,
proportionately more packets can be allowed in a burst for the same
amount of queuing delay at the bottleneck.Therefore, the Prague CC sends packets in a burst as long as it
will not induce more than MAX_BURST_DELAY of queuing at the
bottleneck. From this constant and the current pacing_rate, it
calculates how many MTU-sized packets to allow in a burst:max_burst = pacing_rate * MAX_BURST_DELAY / MTU_BITSThe current default in the Linux TCP Prague for
MAX_BURST_DELAY is 250us which supports marking thresholds starting
from about 500us without underutilization. This approach is similar
to that in the Linux TCP stack, except there MAX_BURST_DELAY is
1ms.Appendix A.2. of
outlines the performance optimizations needed when transplanting DCTCP
from a DC environment to a wide area network. The following
subsections address two of those points: faster flow startup and
faster than additive increase. Then covers the flip side, in which
established flows have to yield faster to make room, otherwise queuing
will result.The Prague performance For faster flow start, two approaches are
currently being investigated in parallel:The traditional exponential
slow start can be modified both at the start and the end, with
the aim of reducing the risk of queuing due to bursts and
overshoot:A Prague CC can use an initial
window of 10 (IW10 ), but pacing of
this Initial Window is recommended to try to avoid the pulse
of queuing that could otherwise occur. Pacing IW10 also
spreads the ACKs over the round trip so that subsequent
rounds consist of ten subsets of packets (with 2, 4, 8 etc.
per round in each subset), rather than a single set with 20,
40, 80 etc. in each round. Then, if a queue builds during a
round (e.g. due to other unexpected traffic arriving) it can
drain in the gap before the next subset, rather than the
whole set backing up in a much larger queue.In the Linux reference implementation of TCP
Prague, IW pacing can be optionally enabled, but it is off
by default, because it is yet to be fully evaluated. It
currently paces IW over half the initial smoothed round trip
time (SRTT) measured during the handshake. SRTT is halved
because the RTT often reduces after the initial handshake.
For example: i) some CDNs move the flow to a closer server
after establishment; ii) the initial RTT from a server can
include the time to wake a sleeping handset battery; iii)
some uplink technologies take a link-level round trip to
request a scheduling slot.It is
planned to exploit any cached knowledge of the path RTT to
improve the initial estimate, for instance using the Linux
per-destination cache. it is also planned to allow the
application to give an RTT hint (by setting
sk_max_pacing_rate in Linux) if the developer has reason to
believe that the application has a better estimate.In the
wide area Internet (in contrast to data centres), bottleneck
access links tend to have much less capacity than the line
rate of the sender. With a shallow immediate ECN threshold
at this bottleneck, the slightest burst can tend to induce
an ECN mark, which traditionally causes slow start to exit.
A more gradual exit is being investigated for a Prague CC
using the extent of marking, not just the existence of a
single mark. This will be more consistent with the
extent-based marking that scalable congestion controls use
during congestion avoidance. Delay measurements (similar to
Hystart++ )
can also be used to complement the ECN signals.In this approach, the aim is to
both increase more rapidly than exponential slow-start and to
greatly reduce any overshoot. It is primarily a delay-based
approach, but the aim is also to exploit ECN signals when
present (while not forgetting loss either). Therefore Paced
Chirping is generally usable for any congestion control - not
solely for Prague CC and L4S.Instead of
only aiming to detect capacity overshoot at the end of
flow-start, brief trains of rapidly decreasing inter-packet
spacing called chirps are used to test many rates with as few
packets and as little load as possible. A full description is
beyond the scope of this document. introduces the concepts and the
code as well as citing the main papers on Paced Chirping.Paced chirping works well over continuous links
such as Ethernet and DSL. But better averaging and noise
filtering are necessary over discontinuous link technologies
such as WiFi, LTE cellular radio, passive optical networks (PON)
and data over cable (DOCSIS). This is the current focus of this
work.The current Linux implementation of
TCP Prague does not include Paced Chirping, but research code is
available separately in Linux and ns3. it is accessible via the
L4S landing page .The Prague CC has a startup phase and congestion avoidance phase
like traditional CCs. In steady-state during congestion avoidance,
like all scalable congestion controls, it induces frequent ECN
marks, with the same average recovery time between ECN marks, no
matter how much the flow rate scales.If available capacity suddenly increases, e.g. other flow(s)
depart or the link capacity increases, these regular ECN marks will
stop. Therefore after a few rounds of silence (no ECN marks) in
congestion avoidance phase, the Prague CC can assume that available
capacity has increased, and switch to using the techniques from its
startup phase () to rapidly
find the new, faster operating point. Then it can shift back into
its congestion avoidance behaviour.That is the theory. But, as explained in , the startup techniques,
specifically paced chirping, are still being developed for
discontinuous link types. Once the startup behaviour is available,
the Linux implementation of the Prague CC will also have a faster
than additive increase behaviour. S.3.2.3 of ) gives a brief preview of the performance of
this approach over an Ethernet link type in ns3.To keep queuing delay low, new flows can only push in fast if
established flows yield fast. It has recently been realized that the
design of the Prague EWMA and congestion response introduces 1-2
rounds of lag (on top of the inherent round of feedback delay due to
the speed of light). These lags were inherited from the design of
DCTCP (see and ), where a couple of extra hundred microseconds
was less noticeable. But congestion control in the wide area
Internet cannot afford up to 2 rounds trips of extra lag.To be clear, lag means delay before any response at all starts.
That is qualititatively different from the smoothing gain of an
EWMA, which /reduces/ the response by the gain factor (1/16 by
default) in case a change in congestion does not persist. Smoothing
gain can always be increased. But 1-2 rounds of lag means that, when
a new flow tries to push in, the sender of an established flow will
not respond /at all/ for 1-2 rounds after it first receives
congestion feedback.The Prague CC spends the first round trip of this lag gathering
feedback to measure frac before it is input into the EWMA algorithm
(see ). Then there is up to one
further round of delay because the implementations of DCTCP and
Prague did not fully adopt the paradigm shift to extent-based
marking - the timing of the decrease is still based on Reno.Both Reno and DCTCP/Prague respond immediately on the first sign
of congestion. Reno's response is large, so it waits a round in CWR
state to allow the response to take effect. DCTCP's response is tiny
(extent-based), but then it still waits a round in CWR state. So it
does next-to-nothing for a round.New EWMA and resposne algorithms to remove these 1-2 extra rounds
of lag are described in . They have been
implemented in Linux and an iterative process of evaluation and
redesign is in progress. The EWMA is updated per-ACK, but it still
changes as if it is clocked per round trip. The congestion response
is still triggered by the first indication of ECN feedback, but it
proceeds over the subsequent round trip so that it can take into
account further incoming feedback as the EWMA evolves. The reduction
is applied per-ACK but sized to result as if it had been a single
response per round trip,Ultimately, it would be preferable to take an integrated approach
and use a combination of ECN, loss and delay metrics to drive
congestion control. For instance, using a downward trend in ECN
marking and/or delay as a heuristic to temper the response to loss.
Such ideas are not in the immediate plans for the Linux TCP Prague,
but some more specific ideas are highlighted in the following
subsections.If the bottleneck is ECN-capable, a loss due to congestion is
very likely to have been preceded by a period of ECN marking. When
the current Linux TCP Prague CC detects a loss, like DCTCP, it
halves cwnd, even if it has already reduced cwnd in the same round
trip due to ECN marking. This double reduction can end up factoring
down cwnd to as little as 1/4 in one round trip.On a loss while in CWR state following an ECN reduction, it would
be possible to factor down cwnd by 1/(2-alpha), which would compound
with the previous decrease factor of (1-alpha/2) to result in: (1 -
alpha/2) / (2-alpha)) = 1/2. In integer arithmetic, this division
would be possible but relatively expensive. A less expensive
alternative would be multiplication by (2+alpha)/4, which
approximates to a compounded decrease factor of 1/2 for typical low
values of alpha, even up to 30%. The compound decrease factor is
never greater than 1/2 and in the worst case, if alpha was 100%, it
would factor cwnd down by 3/8. described the plans to shift
between using ECN when close to the operating point and using delay
by injecting paced chirps to find a new operating after the ECN
signal goes silent for a few rounds. Paced chirping shifts more
slowly to the new operating point the more noise there is in the
delay measurements. Work is ongoing on treating any ECN marking as a
complementary metric. The resulting less noisy combined metric
should then allow the controller to shift more rapidly to each new
operating point.An alternative would be to combine ECN with the BBR approach,
which induces a much less noisy delay signal by using less frequent
but more pronounced delay spikes. The approach currently being taken
is to adapt the chirp length to the degree of noise, so the chirps
only become longer and/or more pronounced when necessary, for
instance when faced with a discontinuous link technology such as
WiFi. With multiple chirps per round, the noise can still be
filtered out by averaging over them all, rather than trying to
remove noise from each spike. This keeps the 'self-harm' to the
minimum necessary, and ensures that capacity is always being
sampled, which removes the risk of going stale.The implementation of TCP Prague CC in Linux includes an algorithm
to detect a Classic ECN AQM and fall back to Reno as a result, as
required by the 'Coexistence with Classic ECN' aspect of the Prague
Req 4.3. .The algorithm currently used (v2) is relatively simple, but rather
than describe it here, full rationale, pseudocode and explanation can
be found in the technical report about it . This also includes a selection of the
evaluation results and a link to visualizations of the full results
online. The current algorithm nearly always detects a Classic ECN AQM,
and in the majority of the wide range of scenarios tested it is good
at detecting an L4S AQM. However, it wrongly identifies and L4S AQM as
Classic in a significant minority of cases when the link rate is low,
or the RTT is high. The report gives ideas on how to improve detection
in these scenarios, but in the mean time the algorithm has been
disabled by default.Recently, the report has been updated to include new ideas on other
ways to distinguish Classic from L4S AQMs. The interested reader can
access it themselves, so this living document will not be further
summarized here.The algorithm to reduce RTT dependence is only relevant for
long-running flows. So in the current TCP Prague implementation it
remains disabled for a certain number of round trips after the start
of a flow, as explained in . It
would be possible to make RTT_ref gradually move from the actual RTT
to the target reference RTT, or peerhaps depend on other parameters of
the flow. Nonetheless, just switching in the algorithm after a number
of rounds works well enough. It is planned to also disable the
algorithm for a similar duration if a flow becomes idle then restarts,
but this is yet to be evaluated.Prague Req 4.3. in ) only
requires reduced RTT bias "in the range between the minimum likely RTT
and typical RTTs expected in the intended deployment scenario". The
current TCP Prague implementation satisfies this requirement (). Nonetheless, it would be
preferable to be able to reduce the RTT bias for high RTT flows as
well.If a step AQM is used, the congestion episodes of flows with
different RTTs tend to synchronize, which exacerbates RTT bias. To
prevent this two candidate approaches will need to be investigated: i)
It might be sufficient to deprecate step AQMs for L4S (they are not
the preferred recommendation in ); or ii) the reference RTT
approach of might be usable
for higher than typical RTTs as well as lower. In this latter case,
(RTT/RTT_ref)^2 segments would need to be added to the window per
actual RTT. The current TCP Prague implementation does not support
this faster AI for RTTs higher than RTT_ref, due to the expected (but
unverified) impact on latency overshoot and responsiveness.A modification to v5.0 of the Linux TCP stack that scales down to
sub-packet windows is available for research purposes via the L4S
landing page . The L4S Prague Requirements in
section 4.3 of recommend
but no longer mandate scaling down to sub-packet windows. This is
because becoming unresponsive at a minimum window is a tradeoff
between protecting against other unresponsive flows and the extra
queue you induce by becoming unresponsive yourself. So this code is
not maintained as part of the Linux implementation of TCP Prague.Firstly, the stack ahs to be modifed to maintain a fractional
congestion window. The because the ACK clock cannot work below 1
packet per RTT, the code sets the time to send each packet, then
readjusts the timing as each ACK arrives (otherwise any queuing
accumulates a burst in subsequent rounds). Also, additive increase of
one segment does not scale below a 1-segment window. So instead of a
constant additive increase, the code uses a logarithmically scaled
additive increase that slowly adapts the additive increase constant to
the slow start threshold. Despite these quite radical changes, the
diff is surprisingly small. The design and implementation is explained
in , which also includes evaluation
results.This specification contains no IANA considerations. on scaling down to
fractional windows discusses the tradeoff in becoming unresponsive at a
minium window, which causes a queue to build (harm to self and to
others) but protects oneself against other unresponsive flows (whether
malicious or accidental).This draft inherits the security considerations discussed in and in the L4S architecture . In particular, the self-interest
incentive to be responsive and minimize queuing delay, and protections
against those interested in disrupting the low queuing delay of
others.Bob Briscoe's contribution was part-funded by the Comcast Innovation
Fund. The views expressed here are solely those of the authors.Comments and questions are encouraged and very welcome. They can be
addressed to the IRTF Internet Congestion Control Research Group's
mailing list <iccrg@irtf.org>, and/or to the authors via
<draft-briscoe-iccrg-congestion-control@ietf.org>. Contributions
of design ideas and/or code are also encouraged and welcome.The following contributed implementations and evaluations that
validated and helped to improve this specification:Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of
Nokia Bell Labs, Belgium, prepared and maintains the Linux
implementation of TCP Prague.Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> of
Nokia Bell Labs, Belgium, contributed to the Linux implementation of
TCP Prague.Joakim Misund <joakim.misund@gmail.com> of Uni Oslo,
Norway, wrote the Linux paced chirping code.Asad Sajjad Ahmed <me@asadsa.com>, Independent, Norway,
wrote the Linux code that maintains a sub-packet window.Implementing the `TCP Prague' Requirements for Low Latency
Low Loss Scalable Throughput (L4S)IndependentNokia Bell LabsSimula Research LabSimula Research LabNokia Bell LabsETH ZurichSimula Research LabImproving DCTCP/Prague Congestion Control
ResponsivenessIndependenttcp: allow dctcp alpha to drop to zerotcp: Ensure DCTCP reacts to lossesResolving Tensions between Congestion Control Scaling
RequirementsL4S: Ultra-Low Queuing Delay for AllPaced Chirping - Rethinking TCP start-upUniversity of OsloIndependentTCP Prague Fall-back on Detection of a Classic ECN
AQMIndependentSimula and Uni OsloExtending TCP for Low Round Trip DelaySimula and Uni Oslo