idnits 2.17.1 draft-briscoe-iccrg-prague-congestion-control-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 465 has weird spacing: '...n Linux kerne...' == Line 510 has weird spacing: '...n Linux kerne...' -- The document date (March 9, 2021) is 1143 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'B' is mentioned on line 348, but not defined == Outdated reference: A later version (-28) exists of draft-ietf-tcpm-accurate-ecn-13 == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-12 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-06 == Outdated reference: A later version (-14) exists of draft-ietf-tcpm-hystartplusplus-01 == Outdated reference: A later version (-25) exists of draft-ietf-tsvwg-aqm-dualq-coupled-13 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-08 -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) Summary: 0 errors (**), 0 flaws (~~), 10 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Congestion Control Research Group (ICCRG) K. De Schepper 3 Internet-Draft O. Tilmans 4 Intended status: Experimental Nokia Bell Labs 5 Expires: September 10, 2021 B. Briscoe, Ed. 6 Independent 7 March 9, 2021 9 Prague Congestion Control 10 draft-briscoe-iccrg-prague-congestion-control-00 12 Abstract 14 This specification defines the Prague congestion control scheme, 15 which is derived from DCTCP and adapted for Internet traffic by 16 implementing the Prague L4S requirements. Over paths with L4S 17 support at the bottleneck, it adapts the DCTCP mechanisms to achieve 18 consistently low latency and full throughput. It is defined 19 independently of any particular transport protocol or operating 20 system, but notes are added that highlight issues specific to certain 21 transports and OSs. It is mainly based on the current default 22 options of the reference Linux implementation of TCP Prague, but it 23 includes experience from other implementations where available. It 24 separately describes non-default and optional parts, as well as 25 future plans. 27 The implementation does not satisfy all the Prague requirements (yet) 28 and the IETF might decide that certain requirements need to be 29 relaxed as an outcome of the process of trying to satisfy them all. 30 In two cases, research code is replaced by placeholders until full 31 evaluation is complete. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 10, 2021. 50 Copyright Notice 52 Copyright (c) 2021 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Motivation: Low Queuing Delay /and/ Full Throughput . . . 4 69 1.2. Document Purpose . . . . . . . . . . . . . . . . . . . . 5 70 1.3. Maturity Status (To be Removed Before Publication) . . . 5 71 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 72 2. Prague Congestion Control . . . . . . . . . . . . . . . . . . 8 73 2.1. The Prague L4S Requirements . . . . . . . . . . . . . . . 8 74 2.2. Packet Identification . . . . . . . . . . . . . . . . . . 10 75 2.3. Detecting and Measuring Congestion . . . . . . . . . . . 10 76 2.3.1. Accurate ECN Feedback . . . . . . . . . . . . . . . . 10 77 2.3.1.1. Accurate ECN Feedback with TCP & Derivatives . . 11 78 2.3.1.2. Accurate ECN Feedback with Other Modern 79 Transports . . . . . . . . . . . . . . . . . . . 11 80 2.3.2. Moving Average of ECN Feedback . . . . . . . . . . . 12 81 2.3.3. Scaling Loss Detection with Flow Rate . . . . . . . . 13 82 2.4. Congestion Response Algorithm . . . . . . . . . . . . . . 13 83 2.4.1. Fall-Back on Loss . . . . . . . . . . . . . . . . . . 13 84 2.4.2. Multiplicative Decrease on ECN Feedback . . . . . . . 14 85 2.4.3. Additive Increase and ECN Feedback . . . . . . . . . 15 86 2.4.4. Reduced RTT-Dependence . . . . . . . . . . . . . . . 16 87 2.4.5. Flow Start or Restart . . . . . . . . . . . . . . . . 17 88 2.5. Packet Sending . . . . . . . . . . . . . . . . . . . . . 18 89 2.5.1. Packet Pacing . . . . . . . . . . . . . . . . . . . . 18 90 2.5.2. Segmentation Offload . . . . . . . . . . . . . . . . 18 91 3. Variants and Future Work . . . . . . . . . . . . . . . . . . 19 92 3.1. Getting up to Speed Faster . . . . . . . . . . . . . . . 19 93 3.1.1. Flow Start (or Restart) . . . . . . . . . . . . . . . 19 94 3.1.2. Faster than Additive Increase . . . . . . . . . . . . 21 95 3.1.3. Remove Lag in Congestion Response . . . . . . . . . . 21 96 3.2. Combining Congestion Metrics . . . . . . . . . . . . . . 22 97 3.2.1. ECN with Loss . . . . . . . . . . . . . . . . . . . . 22 98 3.2.2. ECN with Delay . . . . . . . . . . . . . . . . . . . 23 99 3.3. Fall-Back on Classic ECN . . . . . . . . . . . . . . . . 23 100 3.4. Further Reduced RTT-Dependence . . . . . . . . . . . . . 24 101 3.5. Scaling Down to Fractional Windows . . . . . . . . . . . 24 102 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 103 5. Security Considerations . . . . . . . . . . . . . . . . . . . 25 104 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 25 105 7. Comments and Contributions Solicited (To be removed before 106 Publication) . . . . . . . . . . . . . . . . . . . . . . . . 25 107 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 108 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 26 109 9.1. Normative References . . . . . . . . . . . . . . . . . . 26 110 9.2. Informative References . . . . . . . . . . . . . . . . . 27 111 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 113 1. Introduction 115 This document defines the Prague congestion control. It is defined 116 independent of any particular transport protocol or operating system, 117 but notes are added that highlight issues specific to certain 118 transports and OSs. The authors are most familiar with the reference 119 implementation of Prague on Linux over TCP. So that forms the basis 120 of the large majority of platform-specific notes. Nonetheless, 121 wherever possible, experience from implementers on other platforms is 122 included, and the intention is to gather more into this document 123 during the drafting process. 125 The Prague CC is intended to maintain consistently low queuing delay 126 over network paths that offer L4S support at the bottleneck. Where 127 the bottleneck does not support L4S, the Prague CC is intended to 128 fall back to behaving like a conventional 'Classic' congestion 129 control. L4S stands for Low Latency, Low Loss Scalable throughput. 130 L4S support in the network involves Active Queue Management (AQM) 131 with a very shallow target queueing delay (of the order of a 132 millisecond) that applies immediate Explicit Congestion Notification 133 (ECN). 'Immediate ECN' means that the network applies ECN marking 134 based on the instantaneous queue, without any smoothing or filtering, 135 The Prague CC takes on the job of smoothing and filtering the 136 congestion signals from the network. 138 The Prague CC is a particular instance of a scalable congestion 139 control, which is defined in Section 1.4. Scalable congestion 140 control is the part of the L4S architecture that does the actual work 141 of maintaining low queuing delay and ensuring that the delay and 142 throughput properties scale with flow rate. 144 The L4S architecture [I-D.ietf-tsvwg-l4s-arch] places the host 145 congestion control in the context of the other parts of the system. 147 In particular the different types of L4S AQM in the network and the 148 codepoints in the IP-ECN field that convey to the network that the 149 host supports the L4S form of ECN. The architecture document also 150 covers other issues such as: incremental deployment; protection of 151 low latency queues against accidental or malicious disruption; and 152 the relationship of L4S to other low latency technologies. The 153 specification of the L4S ECN Protocol [I-D.ietf-tsvwg-ecn-l4s-id] 154 sets down the requirements that the Prague CC has to follow (called 155 the Prague L4S Requirements - see Section 2.1 for a summary). 157 Links to implementations of the Prague CC and other scalable 158 congestion controls (all open source) can be found via the L4S 159 landing page [L4S-home], which also links to numerous other L4S- 160 related resources. A (slightly dated) paper on the specific 161 implementation of the Prague CC in Linux over TCP is also available 162 [PragueLinux]. 164 1.1. Motivation: Low Queuing Delay /and/ Full Throughput 166 The Prague CC is capable of keeping queuing delay consistently low 167 while fully utilizing available capacity. In contrast, Classic 168 congestion controls need to induce a reasonably large queue 169 (approaching a bandwidth-delay product) in order to fully utilize 170 capacity. Therefore, prior to scalable CCs like DCTCP and Prague, it 171 was believed that very low delay was only possible by limiting 172 throughput and isolating the low delay traffic from capacity-seeking 173 traffic. 175 The Prague CC uses additive increase multiplicative decrease (AIMD), 176 in which it increases its window until an ECN mark (or loss) is 177 detected, then yields in a continual sawtooth pattern. The key to 178 keeping queuing delay low without under-utilizing capacity is to keep 179 the sawteeth tiny. For example the average duration of a Prague CC 180 sawtooth is of the order of a round trip, whereas a classic 181 congestion control sawtooths over hundreds of round trips. For 182 instance, over an RTT of 36ms, at 100Mb/s Cubic takes about 106 round 183 trips to recover, and at 800 Mb/s its recovery time triples to over 184 340 round trips, or still more than 12 seconds (Reno would take 57 185 seconds. 187 Keeping the sawtooth amplitude down keeps queue variation down and 188 utilization up. Keeping the duration of the sawteeth down ensures 189 control remains tight. The definition of a scalable CC is that the 190 duration between congestion marks does not increase as flow rate 191 scales, all other factors being equal. This is important, because it 192 means that the sawteeth will always stay tiny. So queue delay will 193 remain very low, and control will remain very tight. 195 The tip of each sawtooth occurs when the bottleneck link emits a 196 congestion signal. Therefore such small sawteeth are only feasible 197 when ECN is used for the congestion signals. If loss were used, the 198 loss level would be prohibitively high. This is why L4S-ECN has to 199 depart from the requirement of Classic ECN[RFC3168] that an ECN mark 200 is equivalent to a loss. Because otherwise the response to the high 201 level of ECN marking would have to be as great as the response to an 202 equivalent level of loss. 204 The Prague CC is derived from Data Center TCP (DCTCP [RFC8257]). 205 DCTCP is confined to controlled environments like data centres 206 precisely because it uses such small sawteeth, which induce such a 207 high level of congestion marking. For a CC using Classic ECN, this 208 would be interpreted as equivalent to the same, very high, loss 209 level. The Classic CC would then continually drive its own rate down 210 in the face of such an apparently high level of congestion. 212 This is why coexistence with existing traffic is important for the 213 Prague CC. It has to be able to detect whether it is sharing the 214 bottleneck with Classic traffic, and if so fall back to behaving in a 215 Classic way. If the bottleneck does not support ECN at all, that is 216 easy - the Prague CC just responds in the Classic way to loss (see 217 Section 2.4.1). But if it is sharing the bottleneck with Classic ECN 218 traffic, this is more difficult to detect (see Section 3.3). Because 219 the Prague CC removes most of the queue, it also addresses RTT- 220 dependence. Otherwise, at low base RTTs, it would become far more 221 RTT-dependent than Classic CCs. 223 1.2. Document Purpose 225 There is not 'One True Prague CC'. L4S is intended to enable 226 development of any scalable CC that meets the L4S Prague requirements 227 [I-D.ietf-tsvwg-ecn-l4s-id]. This document attempts to describe a 228 reference implementation and attempts to generalize it to different 229 transports and OS platforms. The implementation does not satisfy all 230 the Prague requirements (yet), and the IETF might decide that certain 231 requirements need to be relaxed as an outcome of the process of 232 trying to satisfy them all. 234 1.3. Maturity Status (To be Removed Before Publication) 236 The field of congestion control is always a work in progress. 237 However, there are areas of the Prague CC that are still just 238 placeholders while separate research code is evaluated. And in other 239 implementations of the Prague CC, other areas are incomplete. In the 240 Linux reference implementation of TCP Prague, interim code is used in 241 the incomplete areas, which are: 243 o Flow start and restart (standard slow start is used, even though 244 it often exits early in L4S environments were ECN marking tends to 245 be frequent); 247 o Faster than additive increase (standard additive increase is used, 248 which makes the flow particularly sluggish if it has dropped out 249 of slow start early). 251 The body of this document describes the Prague CC as implemented. 252 Any non-default options or any planned improvements are separated out 253 into Section 3 on "Variants and Future Work". As each of the above 254 areas is addressed, it will will be removed from this section and its 255 description in the body of the document will be updated. Once all 256 areas are complete, this section will be removed. Prague CC will 257 then still be a work in progress, but only on a similar footing as 258 all other congestion controls. 260 1.4. Terminology 262 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 263 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 264 document are to be interpreted as described in [RFC2119] when, and 265 only when, they appear in all capitals, as shown here. 267 Definitions of terms: 269 Classic Congestion Control: A congestion control behaviour that can 270 co-exist with standard TCP Reno [RFC5681] without causing 271 significantly negative impact on its flow rate [RFC5033]. With 272 Classic congestion controls, as flow rate scales, the number of 273 round trips between congestion signals (losses or ECN marks) rises 274 with the flow rate. So it takes longer and longer to recover 275 after each congestion event. Therefore control of queuing and 276 utilization becomes very slack, and the slightest disturbance 277 prevents a high rate from being attained [RFC3649]. 279 Scalable Congestion Control: A congestion control where the average 280 time from one congestion signal to the next (the recovery time) 281 remains invariant as the flow rate scales, all other factors being 282 equal. This maintains the same degree of control over queueing 283 and utilization whatever the flow rate, as well as ensuring that 284 high throughput is robust to disturbances. For instance, DCTCP 285 averages 2 congestion signals per round-trip whatever the flow 286 rate. For the public Internet a Scalable transport has to comply 287 with the requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id] 288 (aka. the 'Prague L4S requirements'). 290 Response function: The relationship between the window (cwnd) of a 291 congestion control and the congestion signalling probability, p, 292 in steady state. A general response function has the form cwnd = 293 K/p^B, where K and B are constants. In an approximation of the 294 response function of the standard Reno CC, B=1/2. For a scalable 295 congestion control B=1, so its response function takes the form 296 cwnd = K/p. The number of congestion signals per round is p*cwnd, 297 which equates to the constant, K, for a scalable CC. Hence the 298 definition of a scalable CC above. 300 Reno-friendly: The subset of Classic traffic that excludes 301 unresponsive traffic and excludes experimental congestion controls 302 intended to coexist with Reno but without always being strictly 303 friendly to it (as allowed by [RFC5033]). Reno-friendly is used 304 in place of 'TCP-friendly', given that the TCP protocol is used 305 with many different congestion control behaviours. 307 Classic ECN: The original Explicit Congestion Notification (ECN) 308 protocol [RFC3168], which requires ECN signals to be treated the 309 same as drops, both when generated in the network and when 310 responded to by the sender. 312 The names used for the four codepoints of the 2-bit IP-ECN field 313 are as defined in [RFC3168]: Not ECT, ECT(0), ECT(1) and CE, where 314 ECT stands for ECN-Capable Transport and CE stands for Congestion 315 Experienced. 317 A packet marked with the CE codepoint is termed 'ECN-marked' or 318 sometimes just 'marked' where the context makes ECN obvious. 320 CC: Congestion Control 322 ACK: an ACKnowledgement, or to ACKnowledge 324 EWMA: Exponentially Weighted Moving Average 326 RTT: Round Trip Time 328 Definitions of Parameters and Variables: 330 MTU_BITS: Maximum transmission unit [b] 332 cwnd: Congestion window [B] 334 ssthresh: Slow start threshold [B] 336 inflight: The amount of data that the sender has sent but not yet 337 received ACKs for [B] 339 p: Steady-state probability of drop or marking [] 341 alpha: EWMA of the ECN marking fraction [] 343 acked_sacked: the amount of new data acknowledged by an ACK [B] 345 ece_delta: the amount of newly acknowledged data that was ECN-marked 346 [B] 348 ai_per_rtt: additive increase to apply per RTT [B] 350 srtt: Smoothed round trip time [s] 352 MAX_BURST_DELAY: Maximum allowed bottleneck queuing delay due to 353 segmentation offload bursts [s] (default 0.25 ms for the public 354 Internet) 356 2. Prague Congestion Control 358 2.1. The Prague L4S Requirements 360 The beneficial properties of L4S traffic (low queuing delay, etc.) 361 depend on all L4S sources satisfying a set of conditions called the 362 Prague L4S Requirements. The name is after an ad hoc meeting of 363 about thirty people co-located with the IETF in Prague in July 2015, 364 the day after the first public demonstration of L4S. 366 The meeting agreed a list of modifications to DCTCP [RFC8257] to 367 focus activity on a variant that would be safe to use over the public 368 Internet. it was suggested that this could be called TCP Prague to 369 distinguish it from DCTCP. This list was adopted by the IETF, and 370 has continued to evolve (see section 4 of 371 [I-D.ietf-tsvwg-ecn-l4s-id]). The requirements are no longer TCP- 372 specific, applying irrespective of wire-protocol (TCP, QUIC, RTP, 373 SCTP, etc). 375 This unusual start to the life of the project led to the unusual 376 development process of a reference implementation that had to resolve 377 a number of ambitious requirements, already known to be in tension 378 [Tensions17]. 380 DCTCP already implements a scalable congestion control. So most of 381 the changes to make it usable over the Internet seemed trivial, some 382 'merely' involving adoption of other parallel developments like 383 Accurate ECN TCP feedback or RACK. Others have been more challenging 384 (e.g. RTT-independence). And others that seemed trivial became 385 challenging given the complex set of bugs and behaviours that 386 characterize today's Internet and the Linux stack. 388 The more critical implementation challenges are highlighted in the 389 following sections, in the hope we can prevent mistakes being 390 repeated (see for instance Section 2.3.2, Section 2.4.2). There was 391 also a set of five intertwined 'bugs' - all masking each other, but 392 causing unpredictable or poor performance as different code 393 modifications unmasked them. A draft write-up about these has been 394 prepared, which is longer than the whole of the present document, so 395 it will be included by reference once published. 397 During the development process, we have unearthed fundamental aspects 398 of the implementation and indeed the design of DCTCP and Prague that 399 have still not caught up with the paradigm shift from existence to 400 extent-based congestion response. Some have been implemented by 401 default, e.g. not suppressing additive increase for a round trip 402 after a congestion event (Section 2.4.3). Others have been 403 implemented but not fully evaluated, e.g. removing the 1-2 404 unnecessary round trips of lag in feedback processing (Section 3.1.3) 405 and yet others are still future plans, e.g. further RTT-independence 406 (Section 3.4) and exploiting combined congestion metrics in more 407 cases (Section 3.2). 409 The requirements are categorized into those that would impact other 410 flows if not handled properly and performance optimizations that are 411 important but optional from the IETF's point of view, because they 412 only affect the flow itself. The list below maps the order of the 413 requirements in [I-D.ietf-tsvwg-ecn-l4s-id] to the order in this 414 document (which is by functional categories and code status): 416 Mandatory or Advisory Requirements: 418 * L4S-ECN packet identification: use of ECT(1) (Section 2.2) 420 * Accurate ECN feedback (Section 2.3.1) 422 * Reno-friendly response to a loss (Section 2.4.1) 424 * Detection of a classic ECN AQM (Section 3.3) 426 * Reduced RTT dependence (Section 2.4.4) 428 * Scaling down to a fractional window (no longer mandatory, see 429 Section 3.5) 431 * Detecting loss in units of time (Section 2.3.3) 433 * Minimizing bursts (Section 2.5.1 435 Optional performance optimizations: 437 * ECN-capable control packets (Section 2.2) 439 * Faster flow start (Section 3.1.1) 441 * Faster than additive increase (Section 3.1.2) 443 * Segmentation offload (Section 2.5.2) 445 2.2. Packet Identification 447 On the public Internet, a sender using the Prague CC MUST set the 448 ECT(1) codepoint on all the packets it sends, in order to identify 449 itself as an L4S-capable congestion control (Req 4.1 450 [I-D.ietf-tsvwg-ecn-l4s-id]). 452 This applies whatever the transport protocol, whether TCP, QUIC, RTP, 453 etc. In the case of TCP, unlike an RFC 3168 TCP ECN transport, a 454 sender can set all packets as ECN-capable, including TCP control 455 packets and retransmissions [RFC8311], 456 [I-D.ietf-tcpm-generalized-ecn]. 458 The Prague CC SHOULD optionally be configurable to use the ECT(0) 459 codepoint in private networks, such as data centres, which might be 460 necessary for backward compatibility with DCTCP deployments where 461 ECT(1) might already have another usage. 463 Implementation note: 465 TCP Prague in Linux kernel: The kernel was updated to allow the 466 ECT(1) flag to be set from within a CC module. The Prague CC then 467 has full control over the ECN code point it uses at any one time. 468 In this way it enforces the use of ECT(1) (or optionally ECT(0)) 469 and non-ECT when required. 471 2.3. Detecting and Measuring Congestion 473 2.3.1. Accurate ECN Feedback 475 When feedback of ECN markings was added to TCP [RFC3168], it was 476 decided not to report any more than one mark per RTT. L4S-capable 477 congestion controls need to know the extent, not just the existence 478 of congestion (Req 4.2. [I-D.ietf-tsvwg-ecn-l4s-id]). Recently 479 defined transports (DCCP, QUIC, etc) typically already satisfy this 480 requirement. So they are dealt with separately below, while TCP and 481 derivatives such as SCTP [RFC4960] are covered first. 483 2.3.1.1. Accurate ECN Feedback with TCP & Derivatives 485 The TCP wire protocol is being updated to allow more accurate 486 feedback (AccECN [I-D.ietf-tcpm-accurate-ecn]). Therefore, in the 487 case where a sender uses the Prague CC over TCP, whether as client or 488 server: 490 o it MUST itself support AccECN; 492 o to support AccECN it also has to check that its peer supports 493 AccECN during the handshake. 495 If the peer does not support accurate ECN feedback, the sender MUST 496 fall back to a Reno-friendly CC behaviour for the rest of the 497 connection. The non-Prague TCP sender MUST then no longer set ECT(1) 498 on the packets it sends. Note that the peer only needs to support 499 AccECN; there is no need (and no way) to find out whether the peer is 500 using an L4S-capable congestion control. 502 Note that a sending TCP client that uses the Prague CC can set ECT(1) 503 on the SYN prior to checking whether the other peer supports AccECN 504 (as long as it follows the procedure in 505 [I-D.ietf-tcpm-generalized-ecn] if it discovers the peer does not 506 support AccECN). 508 Implementation note: 510 TCP Prague in Linux kernel: The kernel had been updated to support 511 AccECN Independent of the CC module in use. So the kernel tries 512 to negotiate AccECN exchange whichever congestion control module 513 is selected. An additional check is provided to verify that the 514 kernel actually does support AccECN, based on which the Prague CC 515 module will decide to proceed using scalable CC or fall back to a 516 Classic CC (Reno in the current implementation). 518 A system wide option is available to disable AccECN negotiation, 519 but the Prague CC module will always override this setting, as it 520 depends on AccECN. Then, solely in this case, AccECN will only be 521 active for TCP flows using the Prague CC. 523 2.3.1.2. Accurate ECN Feedback with Other Modern Transports 525 Transport protocols specified recently, .e.g. DCCP [RFC4340], QUIC 526 [I-D.ietf-quic-transport], are unambiguously suitable for Prague CCs, 527 because they were designed from the start with accurate ECN feedback. 529 In the case of RTP/RTCP, ECN feedback was added in [RFC6679], which 530 is sufficient for the Prague CC. However, it is preferable to use 531 the most recent improvements to ECN feedback in 532 [I-D.ietf-avtcore-cc-feedback-message], as used in the implementation 533 of the L4S variant of SCReAM [RFC8298]. 535 2.3.2. Moving Average of ECN Feedback 537 The Prague CC currently maintains a moving average of ECN feedback in 538 a similar way to DCTCP. This section is provided mainly because 539 performance has proved to be sensitive to implementation precision in 540 this area. So first, some background is necessary. 542 The Prague CC triggers update of its moving average once per RTT by 543 recording the packet it sent after the previous update, then watching 544 for the ACK of that packet to return. To maintain its moving 545 average, it measures the fraction, frac, of ACKed bytes that carried 546 ECN feedback over the previous round trip. It then updates an 547 exponentially weighted moving average (EWMA) of this fraction, called 548 alpha, using the following algorithm: 550 alpha += g * (frac - alpha); 552 where g is the gain of the EWMA (default 1/16). 554 Implementation notes: 556 Rounding problems in DCTCP: Alpha is a fraction between 0 and 1, and 557 it needs to be represented with high resolution because the larger 558 the bandwidth-delay product (BDP) of a flow, the smaller the value 559 that alpha converges to (in steady state alpha = 2/cwnd). In 560 principle, Linux DCTCP maintains the moving average 'alpha' using 561 the same formula as Prague CC uses (as above). Linux represents 562 alpha with a 10-bit integer (with resolution 1/1024). However, up 563 to kernel release 3.19, Linux used integer arithmetic that could 564 not reduce alpha below 15/1024. Then it was patched so that any 565 value below 16/1024 was rounded down to zero [patch-alpha-zero]. 566 For a flow with a higher BDP than 128 segments, this means that, 567 alpha flip-flops. Once it has flopped down to zero DCTCP becomes 568 unresponsive until it has built sufficient queue to flip up to 569 16/1024. For larger BDPs, this causes DCTCP to induce larger 570 sawteeth, which loses the low-queuing-delay and high-utilization 571 intent of the algorithm. 573 Upscaled alpha in Prague CC: To resolve the above problem the 574 implementation of TCP Prague in Linux maintains upscaled_alpha = 575 alpha/g instead of alpha: 577 upscaled_alpha += frac - g * upscaled_alpha; 579 This technique is the same as Linux uses for the retransmission 580 timer variables, srtt and mdev. Prague CC also uses 20 bits for 581 alpha, 583 Currently the above per-RTT update to the moving average, which was 584 inherited from DCTCP, is the default in the Prague CC. However, 585 another approach is being investigated because these per-RTT updates 586 introduce 1--2 rounds of delay into the congestion response on top of 587 the inherent round of feedback delay (see Section 3.1.3 in the 588 section on variants and future work). 590 2.3.3. Scaling Loss Detection with Flow Rate 592 After an ACK leaves a gap in the sequence space, a Prague CC is meant 593 to deem that a loss has occurred using 'time-based units' (Req 4.3. 594 [I-D.ietf-tsvwg-ecn-l4s-id]). This is in contrast to the traditional 595 approach that counts a hard-coded number of duplicate ACKs, e.g. the 596 3 Dup-ACKs specified in [RFC5681]. Counting packets rather than time 597 unnecessarily tightens the time within which parallelized links have 598 to keep packets in sequence as flow rate scales over the years. 600 To satsify this requirement, a Prague CC SHOULD wait until a certain 601 fraction of the RTT has elapsed before it deems that the gap is due 602 to packet loss. The reference implementation of TCP Prague in Linux 603 uses RACK [I-D.ietf-tcpm-rack] to address this requirement. An 604 approach similar to TCP RACK is also used in QUIC. 606 At the start of a connection, RACK counts 3 DupACKs to detect loss 607 because the initial smoothed RTT estimate can be inaccurate. This 608 would depend indirectly on time as long as the initial window (IW) is 609 paced over a round trip (see Section 2.4.5). For instance, if the 610 initial window of 10 segments was paced evenly across the initial RTT 611 then, in the next round, an implementation that deems there has been 612 a loss after (say) 1/4 of an RTT can count 1/4 of 10 = 3 DupACKs 613 (rounded up). Subsequently, as the window grows, RACK shifts to 614 using a fraction of the RTT for loss detection. 616 2.4. Congestion Response Algorithm 618 In congestion avoidance phase, a Prague CC uses a similar additive 619 increase multiplicative decrease (AIMD) algorithm to DCTCP, but with 620 the following differences: 622 2.4.1. Fall-Back on Loss 624 A Prague CC has to fall back to Reno-friendly behaviour on detection 625 of a loss (Req 4.3. [I-D.ietf-tsvwg-ecn-l4s-id]). DCTCP falls back 626 to Reno for the round trip after a loss, and the Linux reference 627 implementation of TCP Prague inherits this behaviour. 629 If a Prague CC has already reduced the congestion window due to ECN 630 feedback less than a round trip before it detects a loss, it MAY 631 reduce the congestion window by a smaller amount due to the loss, as 632 long as the reductions due to ECN and the loss are Reno-friendly when 633 taken together. 635 See Section 3.2 for discussion of future work on congestion control 636 using a combination of delay, ECN and loss. 638 Implementation note: 640 DCTCP bug prior to v5.1: A Prague CC cannot rely on the fall-back- 641 on-loss behaviour of the DCTCP code in the Linux kernel prior to 642 v5.1, due to a previous bug in the fast retransmit code (but not 643 in the retransmission timeout code) [patch-loss-react]. 645 2.4.2. Multiplicative Decrease on ECN Feedback 647 The Prague CC currently responds to ECN feedback in a similar way to 648 DCTCP. This section is provided mainly because performance has 649 proved to be sensitive to implementation details in this area. So 650 the following recap of the congestion response is needed first. 652 As explained in Section 2.3.2, the Prague CC (like DCTCP) clocks its 653 moving average of ECN-marking, alpha, once per round trip throughout 654 a connection. Nonetheless, it only triggers a multiplicative 655 decrease to its congestion window when it actually receives an ACK 656 carrying ECN feedback. Then it suppresses any further decreases for 657 one round trip, even if it receives further ECN feedback. This is 658 termed Congestion Window Reduced or CWR state. 660 The Prague CC (like DCTCP) ensures that the average recovery time 661 remains invariant as flow rate scales (Req 4.3 of 662 [I-D.ietf-tsvwg-ecn-l4s-id]) by making the multiplicative decrease 663 depend on the prevailing value of alpha as follows: 665 ssthresh = (1 - alpha/2) * cwnd; 667 Implementation notes: 669 Upscaled alpha: With reference to the earlier discussion of integer 670 arithmetic precision (Section 2.3.2), alpha = g * upscaled_alpha. 672 Carry of fractional cwnd remainder: Typically the absolute reduction 673 in the window is only a small number of segments. So, if the 674 Prague CC implementation counts the window in integer segments (as 675 in the Linux reference code), delay can be made significantly less 676 jumpy by tracking a fractional value alongside the integer window 677 and carrying over any fractional remainder to the next reduction. 678 Also, integer rounding bias ought to be removed from the 679 multiplicative decrease calculation. 681 In dynamic scenarios, as flows find a new operating point, alpha will 682 have often tailed away to near-nothing before the onset of 683 congestion. Then DCTCP's tiny reduction followed by no further 684 response for a round is precisely the wrong way for a CC to respond. 685 A solution to this problem is being evaluated as part of the work 686 already mentioned to improve Prague's responsiveness (see 687 Section 3.1.3 in the section on variants and future work). 689 2.4.3. Additive Increase and ECN Feedback 691 Unlike DCTCP, the Prague CC does not suppress additive increase for 692 one round trip after a congestion window reduction (while in CWR 693 state). Instead, a Prague CC applies additive increase irrespective 694 of its CWR state, but only for bytes that have been ACK'd without ECN 695 feedback. Specifically, on each ACK, 697 cwnd += (acked_sacked - ece_delta) * ai_per_rtt / cwnd; 699 where: 701 acked_sacked is the number of new bytes acknowledged by the ACK; 703 ece_delta is the number of newly acknowledge ECN-marked bytes; 705 ai_per_rtt is a scaling factor that is typically 1 SMSS except for 706 small RTTs (see Section 2.4.4) 708 Superficially, the traditional suppression of additive increase for 709 the round after a decrease seems to make sense. However, DCTCP and 710 Prague are designed to induce an average of 2 congestion marks per 711 RTT in steady state, which leaves very little space for any increase 712 between the end of one round of CWR and the next mark. In tests, 713 when a test version of Prague CC is configured to completely suppress 714 additive increase during CWR (like Reno and DCTCP), it sawteeth 715 become more irregular, which is its way of making some decreases 716 large enough to open up enough space for an increase. This 717 irregularity tends to reduce link utilization. Therefore, the 718 reference Prague CC continues additive increase irrespective of CWR 719 state. 721 Nonetheless, rather than continue additive increase regardless of 722 congestion, it is safer to only increase on those ACKs that do not 723 feed back congestion. This approach reduces additive increase as the 724 marking probability increases, which tends to keep the marking level 725 unsaturated (below 100%) (see Section 3.1 of [Tensions17]). Under 726 stable conditions, Prague's congestion window then becomes 727 proportional to (1-p)/p, rather than 1/p. 729 See also 'Faster than Additive Increase' (Section 3.1.2) 731 2.4.4. Reduced RTT-Dependence 733 The window-based AIMD described so far was inherited from Reno via 734 DCTCP. When many long-running Reno flows share a link, their 735 relative packet rates become roughly inversely proportional to RTT 736 (packet rate =~ 1/RTT). Then a flow with very small RTT will 737 dominate any flows with larger RTTs. 739 Queuing delay sets a lower limit to the smallest possible RTT. So, 740 prior to the extremely low queuing delay of L4S, extreme cases of RTT 741 dependence had never been apparent. Now that L4S has removed most of 742 the queuing delay, we have to address the root-cause of RTT- 743 dependence, which the Prague CC is required to do, at least when the 744 RTT is small (see the 'Reduced RTT bias' aspect of Req 4.3. 745 [I-D.ietf-tsvwg-ecn-l4s-id]). Here, a small RTT is defined as below 746 the typical RTT for the intended deployment environment. 748 A Prague CC reduces RTT bias by using a reference RTT (RTT_ref) 749 rather than the actual round trip (RTT) for all three of: the window 750 update period; the EWMA update period; and the duration of CWR state 751 after a decrease. As the actual window (cwnd) is still sent within 1 752 actual RTT, we also need to use a (conceptual) reference window, 753 cwnd_ref. For instance, if RTT_ref = 25 ms then, when the actual RTT 754 is 5 ms, there are RTT_ref/RTT = 5 times more packets in cwnd_ref, 755 than in the actual window, cwnd, because it spans 5 actual round 756 trips. We define M as the ratio RTT_ref/RTT. 758 In the Linux implementation of TCP Prague, RTT_ref is a function of 759 the actual RTT. 3 functions have been implemented: RTT_ref = max(RTT, 760 RTT_REF_MIN); RTT_ref = RTT + AdditionalRTT; RTT_ref = ... {ToDo}. 761 The current default is RTT_ref = max(RTT, 25ms), which addresses the 762 main Prague requirement for when the RTT is smaller than typical. 764 In Reno or DCTCP, additive increase is implemented by dividing the 765 desired increase of 1 segment per round over the cwnd packets in the 766 round. This requires an increase of 1/cwnd per packet. In the Linux 767 implementation of TCP Prague, the aim is to increase the reference 768 window by 1 segment over a reference round. However, in practice the 769 increase is applied to the actual window, cwnd, which is M times 770 smaller than cwnd_ref. So cwnd has to be increased by only 1/M 771 segments over RTT_ref. But again, in practice, the increase is 772 applied over an actual window of packets spanning an actual RTT, 773 which is also M times smaller than the reference RTT. So the desired 774 increase in cwnd is only 1/M^2 segments over an actual round trip 775 containing cwnd packets. Therefore, the increase in cwnd per packet 776 has to be (1/M^2) * (1/cwnd). 778 Unless a flow lasts long enough for rates to converge, equal rates 779 will not be relevant. So, the Reduced RTT-Dependence algorithm only 780 comes into effect after D rounds, where D is configurable (current 781 default 500). Continuing the previous example, if actual RTT=5 ms 782 and RTT_ref = 25 ms, then Prague would stop using its RTT-dependent 783 algorithm after 500*5ms = 2.5s and instead it would start to converge 784 to equal rates using the Reduced RTT-Dependence algorithm. If the 785 actual RTT were higher (e.g. 20ms), it would stay in RTT-dependent 786 mode for longer (10s), but this would be mitigated by its RTT being 787 closer to the reference (20ms vs. 25ms). 789 This approach prevents reduced RTT-dependence from making the flow 790 less responsive at start-up and ensures that its early throughput 791 share is based on its actual RTT. The benefit is that short flows 792 (mice) give themselves priority over longer flows (elephants), and 793 shorter RTTs will still converge faster than longer RTTs. 794 Nonetheless, the throughput still converges to equal rates after D 795 rounds. 797 It is planned to reset the algorithm to be RTT-dependent after an 798 idle, not just at flow start, as discussed under Future Work in 799 Section 3.4. 801 Section 3.4 also discusses extending the reduction in RTT-dependence 802 to longer RTTs than than RTT_ref. The current Prague implementation 803 does not support this. 805 2.4.5. Flow Start or Restart 807 Currently the Linux reference implementation of TCP Prague uses the 808 standard Linux slow start code. Slow start is exited once a single 809 mark is detected. 811 When other flows are actively filling the link, regular marks are 812 expected, causing slow start of new flows to end prematurely. This 813 is clearly not ideal, so other approaches are being worked on (see 814 Section 3.1.1). However, slow start has been left as the default 815 until a properly matured solution is completed. 817 2.5. Packet Sending 819 2.5.1. Packet Pacing 821 The Prague CC SHOULD pace the packets it sends to avoid the queuing 822 delay and under-utilization that would otherwise be caused by bursts 823 of packets that can occur, for example, when a jump in the 824 acknowledgement number opens up cwnd. Prague does this in a similar 825 way to the base Linux TCP stack, by spacing out the window of packets 826 evenly over the round trip time, using the following calculation of 827 the pacing rate [b/s]: 829 pacing_rate = MTU_BITS * max(cwnd, inflight) / srtt; 831 During slow start, as in the base Linux TCP stack, Prague factors up 832 pacing_rate by 2, so that it paces out packets twice as fast as they 833 are acknowledged. This keeps up with the doubling of cwnd, but still 834 prevents bursts in response to any larger transient jumps in cwnd. 836 if (cwnd < ssthresh / 2) 837 pacing_rate *= 2; 839 During congestion avoidance, the Linux TCP Prague implementation does 840 not factor up pacing_rate at all. This contrasts with the base Linux 841 TCP stack, which currently factors up pacing_rate by a ratio 842 parameter set to 1.2. The developers of the base Linux stack 843 confirmed that this factor of 1.2 was only introduced in case it 844 improved performance, but there were no scenarios where it was known 845 to be needed. In testing of Prague, this factor was found to cause 846 queue delay spikes whenever cwnd jumped more than usual. And 847 throughput was no worse without it. So it was removed from the TCP 848 Prague CC. 850 The Prague CC can use alternatives to the traditional slow-start 851 algorithm, which use different pacing (see Section 2.4.5). 853 2.5.2. Segmentation Offload 855 In the absence of hardware pacing, it becomes increasingly difficult 856 for a machine to scale to higher flow rates unless it is allowed to 857 send packets in larger bursts, for instance using segmentation 858 offload. Happily, as flow rate scales up, proportionately more 859 packets can be allowed in a burst for the same amount of queuing 860 delay at the bottleneck. 862 Therefore, the Prague CC sends packets in a burst as long as it will 863 not induce more than MAX_BURST_DELAY of queuing at the bottleneck. 865 From this constant and the current pacing_rate, it calculates how 866 many MTU-sized packets to allow in a burst: 868 max_burst = pacing_rate * MAX_BURST_DELAY / MTU_BITS 870 The current default in the Linux TCP Prague for MAX_BURST_DELAY is 871 250us which supports marking thresholds starting from about 500us 872 without underutilization. This approach is similar to that in the 873 Linux TCP stack, except there MAX_BURST_DELAY is 1ms. 875 3. Variants and Future Work 877 3.1. Getting up to Speed Faster 879 Appendix A.2. of [I-D.ietf-tsvwg-ecn-l4s-id] outlines the performance 880 optimizations needed when transplanting DCTCP from a DC environment 881 to a wide area network. The following subsections address two of 882 those points: faster flow startup and faster than additive increase. 883 Then Section 3.1.3 covers the flip side, in which established flows 884 have to yield faster to make room, otherwise queuing will result. 886 3.1.1. Flow Start (or Restart) 888 The Prague performance For faster flow start, two approaches are 889 currently being investigated in parallel: 891 Modified Slow Start: The traditional exponential slow start can be 892 modified both at the start and the end, with the aim of reducing 893 the risk of queuing due to bursts and overshoot: 895 Pacing IW: A Prague CC can use an initial window of 10 (IW10 896 [RFC6928]), but pacing of this Initial Window is recommended to 897 try to avoid the pulse of queuing that could otherwise occur. 898 Pacing IW10 also spreads the ACKs over the round trip so that 899 subsequent rounds consist of ten subsets of packets (with 2, 4, 900 8 etc. per round in each subset), rather than a single set 901 with 20, 40, 80 etc. in each round. Then, if a queue builds 902 during a round (e.g. due to other unexpected traffic arriving) 903 it can drain in the gap before the next subset, rather than the 904 whole set backing up in a much larger queue. 906 In the Linux reference implementation of TCP Prague, IW pacing 907 can be optionally enabled, but it is off by default, because it 908 is yet to be fully evaluated. It currently paces IW over half 909 the initial smoothed round trip time (SRTT) measured during the 910 handshake. SRTT is halved because the RTT often reduces after 911 the initial handshake. For example: i) some CDNs move the flow 912 to a closer server after establishment; ii) the initial RTT 913 from a server can include the time to wake a sleeping handset 914 battery; iii) some uplink technologies take a link-level round 915 trip to request a scheduling slot. 917 It is planned to exploit any cached knowledge of the path RTT 918 to improve the initial estimate, for instance using the Linux 919 per-destination cache. it is also planned to allow the 920 application to give an RTT hint (by setting sk_max_pacing_rate 921 in Linux) if the developer has reason to believe that the 922 application has a better estimate. 924 Exiting slow start more gracefully: In the wide area Internet (in 925 contrast to data centres), bottleneck access links tend to have 926 much less capacity than the line rate of the sender. With a 927 shallow immediate ECN threshold at this bottleneck, the 928 slightest burst can tend to induce an ECN mark, which 929 traditionally causes slow start to exit. A more gradual exit 930 is being investigated for a Prague CC using the extent of 931 marking, not just the existence of a single mark. This will be 932 more consistent with the extent-based marking that scalable 933 congestion controls use during congestion avoidance. Delay 934 measurements (similar to Hystart++ 935 [I-D.ietf-tcpm-hystartplusplus]) can also be used to complement 936 the ECN signals. 938 Paced Chirping: In this approach, the aim is to both increase more 939 rapidly than exponential slow-start and to greatly reduce any 940 overshoot. It is primarily a delay-based approach, but the aim is 941 also to exploit ECN signals when present (while not forgetting 942 loss either). Therefore Paced Chirping is generally usable for 943 any congestion control - not solely for Prague CC and L4S. 945 Instead of only aiming to detect capacity overshoot at the end of 946 flow-start, brief trains of rapidly decreasing inter-packet 947 spacing called chirps are used to test many rates with as few 948 packets and as little load as possible. A full description is 949 beyond the scope of this document. [LinuxPacedChirping] 950 introduces the concepts and the code as well as citing the main 951 papers on Paced Chirping. 953 Paced chirping works well over continuous links such as Ethernet 954 and DSL. But better averaging and noise filtering are necessary 955 over discontinuous link technologies such as WiFi, LTE cellular 956 radio, passive optical networks (PON) and data over cable 957 (DOCSIS). This is the current focus of this work. 959 The current Linux implementation of TCP Prague does not include 960 Paced Chirping, but research code is available separately in Linux 961 and ns3. it is accessible via the L4S landing page [L4S-home]. 963 3.1.2. Faster than Additive Increase 965 The Prague CC has a startup phase and congestion avoidance phase like 966 traditional CCs. In steady-state during congestion avoidance, like 967 all scalable congestion controls, it induces frequent ECN marks, with 968 the same average recovery time between ECN marks, no matter how much 969 the flow rate scales. 971 If available capacity suddenly increases, e.g. other flow(s) depart 972 or the link capacity increases, these regular ECN marks will stop. 973 Therefore after a few rounds of silence (no ECN marks) in congestion 974 avoidance phase, the Prague CC can assume that available capacity has 975 increased, and switch to using the techniques from its startup phase 976 (Section 3.1.1) to rapidly find the new, faster operating point. 977 Then it can shift back into its congestion avoidance behaviour. 979 That is the theory. But, as explained in Section 3.1.1, the startup 980 techniques, specifically paced chirping, are still being developed 981 for discontinuous link types. Once the startup behaviour is 982 available, the Linux implementation of the Prague CC will also have a 983 faster than additive increase behaviour. S.3.2.3 of [PragueLinux]) 984 gives a brief preview of the performance of this approach over an 985 Ethernet link type in ns3. 987 3.1.3. Remove Lag in Congestion Response 989 To keep queuing delay low, new flows can only push in fast if 990 established flows yield fast. It has recently been realized that the 991 design of the Prague EWMA and congestion response introduces 1-2 992 rounds of lag (on top of the inherent round of feedback delay due to 993 the speed of light). These lags were inherited from the design of 994 DCTCP (see Section 2.3.2 and Section 2.4.2), where a couple of extra 995 hundred microseconds was less noticeable. But congestion control in 996 the wide area Internet cannot afford up to 2 rounds trips of extra 997 lag. 999 To be clear, lag means delay before any response at all starts. That 1000 is qualititatively different from the smoothing gain of an EWMA, 1001 which /reduces/ the response by the gain factor (1/16 by default) in 1002 case a change in congestion does not persist. Smoothing gain can 1003 always be increased. But 1-2 rounds of lag means that, when a new 1004 flow tries to push in, the sender of an established flow will not 1005 respond /at all/ for 1-2 rounds after it first receives congestion 1006 feedback. 1008 The Prague CC spends the first round trip of this lag gathering 1009 feedback to measure frac before it is input into the EWMA algorithm 1010 (see Section 2.3.2). Then there is up to one further round of delay 1011 because the implementations of DCTCP and Prague did not fully adopt 1012 the paradigm shift to extent-based marking - the timing of the 1013 decrease is still based on Reno. 1015 Both Reno and DCTCP/Prague respond immediately on the first sign of 1016 congestion. Reno's response is large, so it waits a round in CWR 1017 state to allow the response to take effect. DCTCP's response is tiny 1018 (extent-based), but then it still waits a round in CWR state. So it 1019 does next-to-nothing for a round. 1021 New EWMA and resposne algorithms to remove these 1-2 extra rounds of 1022 lag are described in [PerAckEWMA]. They have been implemented in 1023 Linux and an iterative process of evaluation and redesign is in 1024 progress. The EWMA is updated per-ACK, but it still changes as if it 1025 is clocked per round trip. The congestion response is still 1026 triggered by the first indication of ECN feedback, but it proceeds 1027 over the subsequent round trip so that it can take into account 1028 further incoming feedback as the EWMA evolves. The reduction is 1029 applied per-ACK but sized to result as if it had been a single 1030 response per round trip, 1032 3.2. Combining Congestion Metrics 1034 Ultimately, it would be preferable to take an integrated approach and 1035 use a combination of ECN, loss and delay metrics to drive congestion 1036 control. For instance, using a downward trend in ECN marking and/or 1037 delay as a heuristic to temper the response to loss. Such ideas are 1038 not in the immediate plans for the Linux TCP Prague, but some more 1039 specific ideas are highlighted in the following subsections. 1041 3.2.1. ECN with Loss 1043 If the bottleneck is ECN-capable, a loss due to congestion is very 1044 likely to have been preceded by a period of ECN marking. When the 1045 current Linux TCP Prague CC detects a loss, like DCTCP, it halves 1046 cwnd, even if it has already reduced cwnd in the same round trip due 1047 to ECN marking. This double reduction can end up factoring down cwnd 1048 to as little as 1/4 in one round trip. 1050 On a loss while in CWR state following an ECN reduction, it would be 1051 possible to factor down cwnd by 1/(2-alpha), which would compound 1052 with the previous decrease factor of (1-alpha/2) to result in: (1 - 1053 alpha/2) / (2-alpha)) = 1/2. In integer arithmetic, this division 1054 would be possible but relatively expensive. A less expensive 1055 alternative would be multiplication by (2+alpha)/4, which 1056 approximates to a compounded decrease factor of 1/2 for typical low 1057 values of alpha, even up to 30%. The compound decrease factor is 1058 never greater than 1/2 and in the worst case, if alpha was 100%, it 1059 would factor cwnd down by 3/8. 1061 3.2.2. ECN with Delay 1063 Section 3.1.2 described the plans to shift between using ECN when 1064 close to the operating point and using delay by injecting paced 1065 chirps to find a new operating after the ECN signal goes silent for a 1066 few rounds. Paced chirping shifts more slowly to the new operating 1067 point the more noise there is in the delay measurements. Work is 1068 ongoing on treating any ECN marking as a complementary metric. The 1069 resulting less noisy combined metric should then allow the controller 1070 to shift more rapidly to each new operating point. 1072 An alternative would be to combine ECN with the BBR approach, which 1073 induces a much less noisy delay signal by using less frequent but 1074 more pronounced delay spikes. The approach currently being taken is 1075 to adapt the chirp length to the degree of noise, so the chirps only 1076 become longer and/or more pronounced when necessary, for instance 1077 when faced with a discontinuous link technology such as WiFi. With 1078 multiple chirps per round, the noise can still be filtered out by 1079 averaging over them all, rather than trying to remove noise from each 1080 spike. This keeps the 'self-harm' to the minimum necessary, and 1081 ensures that capacity is always being sampled, which removes the risk 1082 of going stale. 1084 3.3. Fall-Back on Classic ECN 1086 The implementation of TCP Prague CC in Linux includes an algorithm to 1087 detect a Classic ECN AQM and fall back to Reno as a result, as 1088 required by the 'Coexistence with Classic ECN' aspect of the Prague 1089 Req 4.3. [I-D.ietf-tsvwg-ecn-l4s-id]. 1091 The algorithm currently used (v2) is relatively simple, but rather 1092 than describe it here, full rationale, pseudocode and explanation can 1093 be found in the technical report about it [ecn-fallback]. This also 1094 includes a selection of the evaluation results and a link to 1095 visualizations of the full results online. The current algorithm 1096 nearly always detects a Classic ECN AQM, and in the majority of the 1097 wide range of scenarios tested it is good at detecting an L4S AQM. 1098 However, it wrongly identifies and L4S AQM as Classic in a 1099 significant minority of cases when the link rate is low, or the RTT 1100 is high. The report gives ideas on how to improve detection in these 1101 scenarios, but in the mean time the algorithm has been disabled by 1102 default. 1104 Recently, the report has been updated to include new ideas on other 1105 ways to distinguish Classic from L4S AQMs. The interested reader can 1106 access it themselves, so this living document will not be further 1107 summarized here. 1109 3.4. Further Reduced RTT-Dependence 1111 The algorithm to reduce RTT dependence is only relevant for long- 1112 running flows. So in the current TCP Prague implementation it 1113 remains disabled for a certain number of round trips after the start 1114 of a flow, as explained in Section 2.4.4. It would be possible to 1115 make RTT_ref gradually move from the actual RTT to the target 1116 reference RTT, or peerhaps depend on other parameters of the flow. 1117 Nonetheless, just switching in the algorithm after a number of rounds 1118 works well enough. It is planned to also disable the algorithm for a 1119 similar duration if a flow becomes idle then restarts, but this is 1120 yet to be evaluated. 1122 Prague Req 4.3. in [I-D.ietf-tsvwg-ecn-l4s-id]) only requires reduced 1123 RTT bias "in the range between the minimum likely RTT and typical 1124 RTTs expected in the intended deployment scenario". The current TCP 1125 Prague implementation satisfies this requirement (Section 2.4.4). 1126 Nonetheless, it would be preferable to be able to reduce the RTT bias 1127 for high RTT flows as well. 1129 If a step AQM is used, the congestion episodes of flows with 1130 different RTTs tend to synchronize, which exacerbates RTT bias. To 1131 prevent this two candidate approaches will need to be investigated: 1132 i) It might be sufficient to deprecate step AQMs for L4S (they are 1133 not the preferred recommendation in 1134 [I-D.ietf-tsvwg-aqm-dualq-coupled]); or ii) the reference RTT 1135 approach of Section 2.4.4 might be usable for higher than typical 1136 RTTs as well as lower. In this latter case, (RTT/RTT_ref)^2 segments 1137 would need to be added to the window per actual RTT. The current TCP 1138 Prague implementation does not support this faster AI for RTTs higher 1139 than RTT_ref, due to the expected (but unverified) impact on latency 1140 overshoot and responsiveness. 1142 3.5. Scaling Down to Fractional Windows 1144 A modification to v5.0 of the Linux TCP stack that scales down to 1145 sub-packet windows is available for research purposes via the L4S 1146 landing page [L4S-home]. The L4S Prague Requirements in section 4.3 1147 of [I-D.ietf-tsvwg-ecn-l4s-id] recommend but no longer mandate 1148 scaling down to sub-packet windows. This is because becoming 1149 unresponsive at a minimum window is a tradeoff between protecting 1150 against other unresponsive flows and the extra queue you induce by 1151 becoming unresponsive yourself. So this code is not maintained as 1152 part of the Linux implementation of TCP Prague. 1154 Firstly, the stack ahs to be modifed to maintain a fractional 1155 congestion window. The because the ACK clock cannot work below 1 1156 packet per RTT, the code sets the time to send each packet, then 1157 readjusts the timing as each ACK arrives (otherwise any queuing 1158 accumulates a burst in subsequent rounds). Also, additive increase 1159 of one segment does not scale below a 1-segment window. So instead 1160 of a constant additive increase, the code uses a logarithmically 1161 scaled additive increase that slowly adapts the additive increase 1162 constant to the slow start threshold. Despite these quite radical 1163 changes, the diff is surprisingly small. The design and 1164 implementation is explained in [Ahmed19], which also includes 1165 evaluation results. 1167 4. IANA Considerations 1169 This specification contains no IANA considerations. 1171 5. Security Considerations 1173 Section 3.5 on scaling down to fractional windows discusses the 1174 tradeoff in becoming unresponsive at a minium window, which causes a 1175 queue to build (harm to self and to others) but protects oneself 1176 against other unresponsive flows (whether malicious or accidental). 1178 This draft inherits the security considerations discussed in 1179 [I-D.ietf-tsvwg-ecn-l4s-id] and in the L4S architecture 1180 [I-D.ietf-tsvwg-l4s-arch]. In particular, the self-interest 1181 incentive to be responsive and minimize queuing delay, and 1182 protections against those interested in disrupting the low queuing 1183 delay of others. 1185 6. Acknowledgements 1187 Bob Briscoe's contribution was part-funded by the Comcast Innovation 1188 Fund. The views expressed here are solely those of the authors. 1190 7. Comments and Contributions Solicited (To be removed before 1191 Publication) 1193 Comments and questions are encouraged and very welcome. They can be 1194 addressed to the IRTF Internet Congestion Control Research Group's 1195 mailing list , and/or to the authors via . Contributions of design 1197 ideas and/or code are also encouraged and welcome. 1199 8. Contributors 1201 The following contributed implementations and evaluations that 1202 validated and helped to improve this specification: 1204 Olivier Tilmans of Nokia 1205 Bell Labs, Belgium, prepared and maintains the Linux 1206 implementation of TCP Prague. 1208 Koen De Schepper of Nokia 1209 Bell Labs, Belgium, contributed to the Linux implementation of TCP 1210 Prague. 1212 Joakim Misund of Uni Oslo, Norway, wrote 1213 the Linux paced chirping code. 1215 Asad Sajjad Ahmed , Independent, Norway, wrote the 1216 Linux code that maintains a sub-packet window. 1218 9. References 1220 9.1. Normative References 1222 [I-D.ietf-tcpm-accurate-ecn] 1223 Briscoe, B., Kuehlewind, M., and R. Scheffenegger, "More 1224 Accurate ECN Feedback in TCP", draft-ietf-tcpm-accurate- 1225 ecn-13 (work in progress), November 2020. 1227 [I-D.ietf-tsvwg-ecn-l4s-id] 1228 Schepper, K. and B. Briscoe, "Identifying Modified 1229 Explicit Congestion Notification (ECN) Semantics for 1230 Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s- 1231 id-12 (work in progress), November 2020. 1233 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1234 Requirement Levels", BCP 14, RFC 2119, 1235 DOI 10.17487/RFC2119, March 1997, 1236 . 1238 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1239 of Explicit Congestion Notification (ECN) to IP", 1240 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1241 . 1243 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1244 Notification (ECN) Experimentation", RFC 8311, 1245 DOI 10.17487/RFC8311, January 2018, 1246 . 1248 9.2. Informative References 1250 [Ahmed19] Ahmed, A., "Extending TCP for Low Round Trip Delay", 1251 Masters Thesis, Uni Oslo , August 2019, 1252 . 1254 [ecn-fallback] 1255 Briscoe, B. and A. Ahmed, "TCP Prague Fall-back on 1256 Detection of a Classic ECN AQM", bobbriscoe.net Technical 1257 Report TR-BB-2019-002, April 2020, 1258 . 1260 [I-D.ietf-avtcore-cc-feedback-message] 1261 Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP 1262 Control Protocol (RTCP) Feedback for Congestion Control", 1263 draft-ietf-avtcore-cc-feedback-message-09 (work in 1264 progress), November 2020. 1266 [I-D.ietf-quic-transport] 1267 Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed 1268 and Secure Transport", draft-ietf-quic-transport-34 (work 1269 in progress), January 2021. 1271 [I-D.ietf-tcpm-generalized-ecn] 1272 Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit 1273 Congestion Notification (ECN) to TCP Control Packets", 1274 draft-ietf-tcpm-generalized-ecn-06 (work in progress), 1275 October 2020. 1277 [I-D.ietf-tcpm-hystartplusplus] 1278 Balasubramanian, P., Huang, Y., and M. Olson, "HyStart++: 1279 Modified Slow Start for TCP", draft-ietf-tcpm- 1280 hystartplusplus-01 (work in progress), January 2021. 1282 [I-D.ietf-tcpm-rack] 1283 Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The 1284 RACK-TLP loss detection algorithm for TCP", draft-ietf- 1285 tcpm-rack-15 (work in progress), December 2020. 1287 [I-D.ietf-tsvwg-aqm-dualq-coupled] 1288 Schepper, K., Briscoe, B., and G. White, "DualQ Coupled 1289 AQMs for Low Latency, Low Loss and Scalable Throughput 1290 (L4S)", draft-ietf-tsvwg-aqm-dualq-coupled-13 (work in 1291 progress), November 2020. 1293 [I-D.ietf-tsvwg-l4s-arch] 1294 Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low 1295 Latency, Low Loss, Scalable Throughput (L4S) Internet 1296 Service: Architecture", draft-ietf-tsvwg-l4s-arch-08 (work 1297 in progress), November 2020. 1299 [L4S-home] 1300 "L4S: Ultra-Low Queuing Delay for All", 1301 . 1303 [LinuxPacedChirping] 1304 Misund, J. and B. Briscoe, "Paced Chirping - Rethinking 1305 TCP start-up", Proc. Linux Netdev 0x13 , March 2019, 1306 . 1308 [patch-alpha-zero] 1309 Shewmaker, A., "tcp: allow dctcp alpha to drop to zero", 1310 Linux GitHub patch; Commit: c80dbe0, October 2015, 1311 . 1314 [patch-loss-react] 1315 De Schepper, K., "tcp: Ensure DCTCP reacts to losses", 1316 Linux GitHub patch; Commit: aecfde2, April 2019, 1317 . 1320 [PerAckEWMA] 1321 Briscoe, B., "Improving DCTCP/Prague Congestion Control 1322 Responsiveness", Technical Report TR-BB-2020-002, January 1323 2021, . 1325 [PragueLinux] 1326 Briscoe, B., De Schepper, K., Albisser, O., Misund, J., 1327 Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing 1328 the `TCP Prague' Requirements for Low Latency Low Loss 1329 Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 , 1330 March 2019, . 1333 [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", 1334 RFC 3649, DOI 10.17487/RFC3649, December 2003, 1335 . 1337 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1338 Congestion Control Protocol (DCCP)", RFC 4340, 1339 DOI 10.17487/RFC4340, March 2006, 1340 . 1342 [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", 1343 RFC 4960, DOI 10.17487/RFC4960, September 2007, 1344 . 1346 [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion 1347 Control Algorithms", BCP 133, RFC 5033, 1348 DOI 10.17487/RFC5033, August 2007, 1349 . 1351 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1352 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 1353 . 1355 [RFC6679] Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P., 1356 and K. Carlberg, "Explicit Congestion Notification (ECN) 1357 for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August 1358 2012, . 1360 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, 1361 "Increasing TCP's Initial Window", RFC 6928, 1362 DOI 10.17487/RFC6928, April 2013, 1363 . 1365 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1366 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1367 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1368 October 2017, . 1370 [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation 1371 for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December 1372 2017, . 1374 [Tensions17] 1375 Briscoe, B. and K. De Schepper, "Resolving Tensions 1376 between Congestion Control Scaling Requirements", Simula 1377 Technical Report TR-CS-2016-001; arXiv:1904.07605, July 1378 2017, . 1380 Authors' Addresses 1382 Koen De Schepper 1383 Nokia Bell Labs 1384 Antwerp 1385 Belgium 1387 Email: koen.de_schepper@nokia.com 1388 URI: https://www.bell-labs.com/usr/koen.de_schepper 1389 Olivier Tilmans 1390 Nokia Bell Labs 1391 Antwerp 1392 Belgium 1394 Email: olivier.tilmans@nokia-bell-labs.com 1396 Bob Briscoe (editor) 1397 Independent 1398 UK 1400 Email: ietf@bobbriscoe.net 1401 URI: http://bobbriscoe.net/