idnits 2.17.1 

draft-mathis-tcpm-rfc6937bis-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC6937, but the
     abstract doesn't seem to directly say this.  It does mention RFC6937
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 298 has weird spacing: '...     bd  d  d...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (31 October 2020) is 1267 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC0793' is defined on line 548, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2018' is defined on line 552, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC5681' is defined on line 562, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6675' is defined on line 566, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3517' is defined on line 598, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)


     Summary: 2 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                  M. Mathis
3	Internet-Draft                                              N. Dukkipati
4	Obsoletes: 6937 (if approved)                                   Y. Cheng
5	Intended status: Standards Track                            Google, Inc.
6	Expires: 4 May 2021                                      31 October 2020

8	                  Proportional Rate Reduction for TCP
9	                    draft-mathis-tcpm-rfc6937bis-00

11	Abstract

13	   This document updates the Proportional Rate Reduction (PRR) algorithm
14	   described as experimental in RFC 6937 to standards track.  PRR
15	   potentially replaces the Fast Recovery and Rate-Halving algorithms.
16	   All of these algorithms regulate the amount of data sent by TCP or
17	   other transport protocol during loss recovery.  PRR more accurately
18	   regulates the actual flight size through recovery such that at the
19	   end of recovery it will be as close as possible to the ssthresh, as
20	   determined by the congestion control algorithm.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on 4 May 2021.

39	Copyright Notice

41	   Copyright (c) 2020 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
46	   license-info) in effect on the date of publication of this document.
47	   Please review these documents carefully, as they describe your rights
48	   and restrictions with respect to this document.  Code Components
49	   extracted from this document must include Simplified BSD License text
50	   as described in Section 4.e of the Trust Legal Provisions and are
51	   provided without warranty as described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
56	   2.  Definitions . . . . . . . . . . . . . . . . . . . . . . . . .   3
57	   3.  Algorithms  . . . . . . . . . . . . . . . . . . . . . . . . .   4
58	     3.1.  Examples  . . . . . . . . . . . . . . . . . . . . . . . .   5
59	   4.  Properties  . . . . . . . . . . . . . . . . . . . . . . . . .   8
60	   5.  Measurements  . . . . . . . . . . . . . . . . . . . . . . . .  10
61	   6.  Conclusion and Recommendations  . . . . . . . . . . . . . . .  11
62	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  12
63	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
64	   9.  Normative References  . . . . . . . . . . . . . . . . . . . .  12
65	   10. Informative References  . . . . . . . . . . . . . . . . . . .  13
66	   Appendix A.  Strong Packet Conservation Bound . . . . . . . . . .  14
67	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

69	1.  Introduction

71	   This document updates the Proportional Rate Reduction (PRR) algorithm
72	   described in [RFC6937]from experimental to standards track.  PRR
73	   accuracy regulates the amount of data sent during loss recovery, such
74	   that at the end of recovery the flight size will be as close as
75	   possible to the ssthresh, as determined by the congestion control
76	   algorithm.  PRR has been deployed in at least 3 major operating
77	   systems covering the vast majority of today's web traffic.  There
78	   have been no changes to PRR as documented in the experimental RFC.
79	   The descriptions here have been [will be] updated to normative
80	   standards language.  For a tutorial description of the algorithms and
81	   the rationale behind them please see the original RFC.

83	   The experimental RFC describes two different reduction bound
84	   algorithms to limit the total window reduction due to all mechanisms,
85	   including transient application stalls and the losses themselves:
86	   Conservative Reduction Bound (CRB), which is strictly packet
87	   conserving; and a Slow Start Reduction Bound (SSRB), which is more
88	   aggressive than CRB by at most 1 segment per ACK.  [RFC6937] left the
89	   choice of Reduction Bound to the discretion of the implementer.

91	   The paper "An Internet-Wide Analysis of Traffic Policing" [Flatch et
92	   al] uncovered a crucial situation where the Reduction Bound mattered.
93	   Under certain configurations, token bucket traffic policers
94	   [token_bucket] can suddenly start discarding a large fraction of the
95	   traffic.  This happens without warning when policers run out of
96	   tokens.  The transport congestion control has no opportunity to
97	   measure the token rate, and sets ssthresh based on the recently
98	   observed path performance.  This value for ssthresh may be
99	   substantially larger than can be sustained by the token rate,
100	   potentially causing persistent high loss.  Under these conditions,
101	   both reduction bounds perform very poorly.  PRR-CRB is too timid,
102	   sometimes causing very long recovery times at smaller than necessary
103	   windows, and PRR-SSRB is too aggressive, often causing many
104	   retransmissions to be lost multiple times.

106	   Investigating these environments led to the development of a
107	   heuristic to dynamically switch between Reduction Bounds: use PRR-
108	   SSRB only while snd.una is advancing without additional losses and
109	   use PRR-CRB otherwise.

111	   This heuristic is only invoked for what should be a rare corner case:
112	   when losses or other events cause the flight size to fall below
113	   ssthresh.  The extreme loss rates that make the heuristic important
114	   are only common in the presence of poorly configured token bucket
115	   policers, which are pathologically wasteful and inefficient.  In
116	   these environments the heuristics serves to salvage a bad situation
117	   and any reasonable implementation of the heuristic performs far
118	   better than either bound by itself.

120	   The algorithm below is identical to the algorithm presented in
121	   [RFC6937].  The "conservative" parameter MAY be replaced by the
122	   heuristic also described below.

124	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
125	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
126	   document are to be interpreted as described in [RFC2119]

128	   [All text below is copied from RFC 6937, it will be revised after
129	   this document is adopted as a tcpm work item]

131	2.  Definitions

133	   The following terms, parameters, and state variables are used as they
134	   are defined in earlier documents:

136	   RFC 793: snd.una (send unacknowledged)

138	   RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size
139	   (SMSS)

141	   RFC 6675: covered (as in "covered sequence numbers")
142	   Voluntary window reductions: choosing not to send data in response to
143	   some ACKs, for the purpose of reducing the sending window size and
144	   data rate

146	   We define some additional variables:

148	   SACKd: The total number of bytes that the scoreboard indicates have
149	   been delivered to the receiver.  This can be computed by scanning the
150	   scoreboard and counting the total number of bytes covered by all sack
151	   blocks.  If SACK is not in use, SACKd is not defined.

153	   DeliveredData: The total number of bytes that the current ACK
154	   indicates have been delivered to the receiver.  When not in recovery,
155	   DeliveredData is the change in snd.una.  With SACK, DeliveredData can
156	   be computed precisely as the change in snd.una, plus the (signed)
157	   change in SACKd.  In recovery without SACK, DeliveredData is
158	   estimated to be 1 SMSS on duplicate acknowledgements, and on a
159	   subsequent partial or full ACK, DeliveredData is estimated to be the
160	   change in snd.una, minus 1 SMSS for each preceding duplicate ACK.

162	   Note that DeliveredData is robust; for TCP using SACK, DeliveredData
163	   can be precisely computed anywhere in the network just by inspecting
164	   the returning ACKs.  The consequence of missing ACKs is that later
165	   ACKs will show a larger DeliveredData.  Furthermore, for any TCP
166	   (with or without SACK), the sum of DeliveredData must agree with the
167	   forward progress over the same time interval.

169	   We introduce a local variable "sndcnt", which indicates exactly how
170	   many bytes should be sent in response to each ACK.  Note that the
171	   decision of which data to send (e.g., retransmit missing data or send
172	   more new data) is out of scope for this document.

174	3.  Algorithms

176	   At the beginning of recovery, initialize PRR state.  This assumes a
177	   modern congestion control algorithm, CongCtrlAlg(), that might set
178	   ssthresh to something other than FlightSize/2:

180	      ssthresh = CongCtrlAlg()  // Target cwnd after recovery
181	      prr_delivered = 0         // Total bytes delivered during recovery
182	      prr_out = 0               // Total bytes sent during recovery
183	      RecoverFS = snd.nxt-snd.una // FlightSize at the start of recovery

185	                                  Figure 1

187	   On every ACK during recovery compute:
188	      DeliveredData = change_in(snd.una) + change_in(SACKd)
189	      prr_delivered += DeliveredData
190	      pipe = (RFC 6675 pipe algorithm)
191	      if (pipe > ssthresh) {
192	         // Proportional Rate Reduction
193	         sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
194	      } else {
195	         // Two versions of the Reduction Bound
196	         if (conservative) {    // PRR-CRB
197	           limit = prr_delivered - prr_out
198	         } else {               // PRR-SSRB
199	           limit = MAX(prr_delivered - prr_out, DeliveredData) + MSS
200	         }
201	         // Attempt to catch up, as permitted by limit
202	         sndcnt = MIN(ssthresh - pipe, limit)
203	      }

205	                                  Figure 2

207	   On any data transmission or retransmission:
208	     prr_out += (data sent) // strictly less than or equal to sndcnt

210	                                  Figure 3

212	3.1.  Examples

214	   We illustrate these algorithms by showing their different behaviors
215	   for two scenarios: TCP experiencing either a single loss or a burst
216	   of 15 consecutive losses.  In all cases we assume bulk data (no
217	   application pauses), standard Additive Increase Multiplicative
218	   Decrease (AIMD) congestion control, and cwnd = FlightSize = pipe = 20
219	   segments, so ssthresh will be set to 10 at the beginning of recovery.
220	   We also assume standard Fast Retransmit and Limited Transmit
221	   [RFC3042], so TCP will send 2 new segments followed by 1 retransmit
222	   in response to the first 3 duplicate ACKs following the losses.

224	   Each of the diagrams below shows the per ACK response to the first
225	   round trip for the various recovery algorithms when the zeroth
226	   segment is lost.  The top line indicates the transmitted segment
227	   number triggering the ACKs, with an X for the lost segment.  "cwnd"
228	   and "pipe" indicate the values of these algorithms after processing
229	   each returning ACK.  "Sent" indicates how much 'N'ew or
230	   'R'etransmitted data would be sent.  Note that the algorithms for
231	   deciding which data to send are out of scope of this document.

233	   When there is a single loss, PRR with either of the Reduction Bound
234	   algorithms has the same behavior.  We show "RB", a flag indicating
235	   which Reduction Bound subexpression ultimately determined the value
236	   of sndcnt.  When there are minimal losses, "limit" (both algorithms)
237	   will always be larger than ssthresh - pipe, so the sndcnt will be
238	   ssthresh - pipe, indicated by "s" in the "RB" row.

240	   RFC 6675
241	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
242	   cwnd:    20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
243	   pipe:    19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10
244	   sent:     N  N  R                          N  N  N  N  N  N  N  N

246	   Rate-Halving (Linux)
247	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
248	   cwnd:    20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11
249	   pipe:    19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10
250	   sent:     N  N  R     N     N     N     N     N     N     N     N

252	   PRR
253	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
254	   pipe:    19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10
255	   sent:     N  N  R     N     N     N     N     N     N        N  N
256	   RB:                                                          s  s
257	       Cwnd is not shown because PRR does not use it.

259	   Key for RB
260	   s: sndcnt = ssthresh - pipe                 // from ssthresh
261	   b: sndcnt = prr_delivered - prr_out + SMSS  // from banked
262	   d: sndcnt = DeliveredData + SMSS            // from DeliveredData
263	   (Sometimes, more than one applies.)

265	                                  Figure 4

267	   Note that all 3 algorithms send the same total amount of data.  RFC
268	   6675 experiences a "half window of silence", while the Rate-Halving
269	   and PRR spread the voluntary window reduction across an entire RTT.

271	   Next, we consider the same initial conditions when the first 15
272	   packets (0-14) are lost.  During the remainder of the lossy RTT, only
273	   5 ACKs are returned to the sender.  We examine each of these
274	   algorithms in succession.

276	   RFC 6675
277	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
278	   cwnd:                                              20 20 11 11 11
279	   pipe:                                              19 19  4 10 10
280	   sent:                                               N  N 7R  R  R

282	   Rate-Halving (Linux)
283	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
284	   cwnd:                                              20 20  5  5  5
285	   pipe:                                              19 19  4  4  4
286	   sent:                                               N  N  R  R  R

288	   PRR-CRB
289	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
290	   pipe:                                              19 19  4  4  4
291	   sent:                                               N  N  R  R  R
292	   RB:                                                       b  b  b

294	   PRR-SSRB
295	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
296	   pipe:                                              19 19  4  5  6
297	   sent:                                               N  N 2R 2R 2R
298	   RB:                                                      bd  d  d

300	                                  Figure 5

302	   In this specific situation, RFC 6675 is more aggressive because once
303	   Fast Retransmit is triggered (on the ACK for segment 17), TCP
304	   immediately retransmits sufficient data to bring pipe up to cwnd.
305	   Our measurement data (see Section 5) indicates that RFC 6675
306	   significantly outperforms Rate-Halving, PRR-CRB, and some other
307	   similarly conservative algorithms that we tested, showing that it is
308	   significantly common for the actual losses to exceed the window
309	   reduction determined by the congestion control algorithm.

311	   The Linux implementation of Rate-Halving includes an early version of
312	   the Conservative Reduction Bound [RHweb].  In this situation, the 5
313	   ACKs trigger exactly 1 transmission each (2 new data, 3 old data),
314	   and cwnd is set to 5.  At a window size of 5, it takes 3 round trips
315	   to retransmit all 15 lost segments.  Rate-Halving does not raise the
316	   window at all during recovery, so when recovery finally completes,
317	   TCP will slow start cwnd from 5 up to 10.  In this example, TCP
318	   operates at half of the window chosen by the congestion control for
319	   more than 3 RTTs, increasing the elapsed time and exposing it to
320	   timeouts in the event that there are additional losses.

322	   PRR-CRB implements a Conservative Reduction Bound.  Since the total
323	   losses bring pipe below ssthresh, data is sent such that the total
324	   data transmitted, prr_out, follows the total data delivered to the
325	   receiver as reported by returning ACKs.  Transmission is controlled
326	   by the sending limit, which is set to prr_delivered - prr_out.  This
327	   is indicated by the RB:b tagging in the figure.  In this case, PRR-
328	   CRB is exposed to exactly the same problems as Rate-Halving; the
329	   excess window reduction causes it to take excessively long to recover
330	   the losses and exposes it to additional timeouts.

332	   PRR-SSRB increases the window by exactly 1 segment per ACK until pipe
333	   rises to ssthresh during recovery.  This is accomplished by setting
334	   limit to one greater than the data reported to have been delivered to
335	   the receiver on this ACK, implementing slow start during recovery,
336	   and indicated by RB:d tagging in the figure.  Although increasing the
337	   window during recovery seems to be ill advised, it is important to
338	   remember that this is actually less aggressive than permitted by RFC
339	   5681, which sends the same quantity of additional data as a single
340	   burst in response to the ACK that triggered Fast Retransmit.

342	   For less extreme events, where the total losses are smaller than the
343	   difference between FlightSize and ssthresh, PRR-CRB and PRR-SSRB have
344	   identical behaviors.

346	4.  Properties

348	   The following properties are common to both PRR-CRB and PRR-SSRB,
349	   except as noted:

351	   PRR maintains TCP's ACK clocking across most recovery events,
352	   including burst losses.  RFC 6675 can send large unclocked bursts
353	   following burst losses.

355	   Normally, PRR will spread voluntary window reductions out evenly
356	   across a full RTT.  This has the potential to generally reduce the
357	   burstiness of Internet traffic, and could be considered to be a type
358	   of soft pacing.  Hypothetically, any pacing increases the probability
359	   that different flows are interleaved, reducing the opportunity for
360	   ACK compression and other phenomena that increase traffic burstiness.
361	   However, these effects have not been quantified.

363	   If there are minimal losses, PRR will converge to exactly the target
364	   window chosen by the congestion control algorithm.  Note that as TCP
365	   approaches the end of recovery, prr_delivered will approach RecoverFS
366	   and sndcnt will be computed such that prr_out approaches ssthresh.

368	   Implicit window reductions, due to multiple isolated losses during
369	   recovery, cause later voluntary reductions to be skipped.  For small
370	   numbers of losses, the window size ends at exactly the window chosen
371	   by the congestion control algorithm.

373	   For burst losses, earlier voluntary window reductions can be undone
374	   by sending extra segments in response to ACKs arriving later during
375	   recovery.  Note that as long as some voluntary window reductions are
376	   not undone, the final value for pipe will be the same as ssthresh,
377	   the target cwnd value chosen by the congestion control algorithm.

379	   PRR with either Reduction Bound improves the situation when there are
380	   application stalls, e.g., when the sending application does not queue
381	   data for transmission quickly enough or the receiver stops advancing
382	   rwnd (receiver window).  When there is an application stall early
383	   during recovery, prr_out will fall behind the sum of the
384	   transmissions permitted by sndcnt.  The missed opportunities to send
385	   due to stalls are treated like banked voluntary window reductions;
386	   specifically, they cause prr_delivered - prr_out to be significantly
387	   positive.  If the application catches up while TCP is still in
388	   recovery, TCP will send a partial window burst to catch up to exactly
389	   where it would have been had the application never stalled.  Although
390	   this burst might be viewed as being hard on the network, this is
391	   exactly what happens every time there is a partial RTT application
392	   stall while not in recovery.  We have made the partial RTT stall
393	   behavior uniform in all states.  Changing this behavior is out of
394	   scope for this document.

396	   PRR with Reduction Bound is less sensitive to errors in the pipe
397	   estimator.  While in recovery, pipe is intrinsically an estimator,
398	   using incomplete information to estimate if un-SACKed segments are
399	   actually lost or merely out of order in the network.  Under some
400	   conditions, pipe can have significant errors; for example, pipe is
401	   underestimated when a burst of reordered data is prematurely assumed
402	   to be lost and marked for retransmission.  If the transmissions are
403	   regulated directly by pipe as they are with RFC 6675, a step
404	   discontinuity in the pipe estimator causes a burst of data, which
405	   cannot be retracted once the pipe estimator is corrected a few ACKs
406	   later.  For PRR, pipe merely determines which algorithm, PRR or the
407	   Reduction Bound, is used to compute sndcnt from DeliveredData.  While
408	   pipe is underestimated, the algorithms are different by at most 1
409	   segment per ACK.  Once pipe is updated, they converge to the same
410	   final window at the end of recovery.

412	   Under all conditions and sequences of events during recovery, PRR-CRB
413	   strictly bounds the data transmitted to be equal to or less than the
414	   amount of data delivered to the receiver.  We claim that this Strong
415	   Packet Conservation Bound is the most aggressive algorithm that does
416	   not lead to additional forced losses in some environments.  It has
417	   the property that if there is a standing queue at a bottleneck with
418	   no cross traffic, the queue will maintain exactly constant length for
419	   the duration of the recovery, except for +1/-1 fluctuation due to
420	   differences in packet arrival and exit times.  See Appendix A for a
421	   detailed discussion of this property.

423	   Although the Strong Packet Conservation Bound is very appealing for a
424	   number of reasons, our measurements summarized in Section 5
425	   demonstrate that it is less aggressive and does not perform as well
426	   as RFC 6675, which permits bursts of data when there are bursts of
427	   losses.  PRR-SSRB is a compromise that permits TCP to send 1 extra
428	   segment per ACK as compared to the Packet Conserving Bound.  From the
429	   perspective of a strict Packet Conserving Bound, PRR-SSRB does indeed
430	   open the window during recovery; however, it is significantly less
431	   aggressive than RFC 6675 in the presence of burst losses.

433	5.  Measurements

435	   In a companion IMC11 paper [IMC11], we describe some measurements
436	   comparing the various strategies for reducing the window during
437	   recovery.  The experiments were performed on servers carrying Google
438	   production traffic and are briefly summarized here.

440	   The various window reduction algorithms and extensive instrumentation
441	   were all implemented in Linux 2.6.  We used the uniform set of
442	   algorithms present in the base Linux implementation, including CUBIC
443	   [CUBIC], Limited Transmit [RFC3042], threshold transmit (Section 3.1
444	   in [FACK]) (this algorithm was not present in RFC 3517, but a similar
445	   algorithm has been added to RFC 6675), and lost retransmission
446	   detection algorithms.  We confirmed that the behaviors of Rate-
447	   Halving (the Linux default), RFC 3517, and PRR were authentic to
448	   their respective specifications and that performance and features
449	   were comparable to the kernels in production use.  All of the
450	   different window reduction algorithms were all present in a common
451	   kernel and could be selected with a sysctl, such that we had an
452	   absolutely uniform baseline for comparing them.

454	   Our experiments included an additional algorithm, PRR with an
455	   unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe
456	   falls below ssthresh.  This behavior parallels RFC 3517.

458	   An important detail of this configuration is that CUBIC only reduces
459	   the window by 30%, as opposed to the 50% reduction used by
460	   traditional congestion control algorithms.  This accentuates the
461	   tendency for RFC 3517 and PRR-UB to send a burst at the point when
462	   Fast Retransmit gets triggered because pipe is likely to already be
463	   below ssthresh.  Precisely this condition was observed for 32% of the
464	   recovery events: pipe fell below ssthresh before Fast Retransmit was
465	   triggered, thus the various PRR algorithms started in the Reduction
466	   Bound phase, and RFC 3517 sent bursts of segments with the Fast
467	   Retransmit.

469	   In the companion paper, we observe that PRR-SSRB spends the least
470	   time in recovery of all the algorithms tested, largely because it
471	   experiences fewer timeouts once it is already in recovery.

473	   RFC 3517 experiences 29% more detected lost retransmissions and 2.6%
474	   more timeouts (presumably due to undetected lost retransmissions)
475	   than PRR-SSRB.  These results are representative of PRR-UB and other
476	   algorithms that send bursts when pipe falls below ssthresh.

478	   Rate-Halving experiences 5% more timeouts and significantly smaller
479	   final cwnd values at the end of recovery.  The smaller cwnd sometimes
480	   causes the recovery itself to take extra round trips.  These results
481	   are representative of PRR-CRB and other algorithms that implement
482	   strict packet conservation during recovery.

484	6.  Conclusion and Recommendations

486	   Although the Strong Packet Conservation Bound used in PRR-CRB is very
487	   appealing for a number of reasons, our measurements show that it is
488	   less aggressive and does not perform as well as RFC 3517 (and by
489	   implication RFC 6675), which permits bursts of data when there are
490	   bursts of losses.  RFC 3517 and RFC 6675 are conservative in the
491	   original sense of Van Jacobson's packet conservation principle, which
492	   included the assumption that presumed lost segments have indeed left
493	   the network.  PRR-CRB makes no such assumption, following instead a
494	   Strong Packet Conservation Bound in which only packets that have
495	   actually arrived at the receiver are considered to have left the
496	   network.  PRR-SSRB is a compromise that permits TCP to send 1 extra
497	   segment per ACK relative to the Strong Packet Conservation Bound, to
498	   partially compensate for excess losses.

500	   From the perspective of the Strong Packet Conservation Bound, PRR-
501	   SSRB does indeed open the window during recovery; however, it is
502	   significantly less aggressive than RFC 3517 (and RFC 6675) in the
503	   presence of burst losses.  Even so, it often outperforms RFC 3517
504	   (and presumably RFC 6675) because it avoids some of the self-
505	   inflicted losses caused by bursts.

507	   At this time, we see no reason not to test and deploy PRR-SSRB on a
508	   large scale.  Implementers worried about any potential impact of
509	   raising the window during recovery may want to optionally support
510	   PRR-CRB (which is actually simpler to implement) for comparison
511	   studies.  Furthermore, there is one minor detail of PRR that can be
512	   improved by replacing pipe by total_pipe, as defined by Laminar TCP
513	   [Laminar].

515	   One final comment about terminology: we expect that common usage will
516	   drop "Slow Start Reduction Bound" from the algorithm name.  This
517	   document needed to be pedantic about having distinct names for PRR
518	   and every variant of the Reduction Bound.  However, we do not
519	   anticipate any future exploration of the alternative Reduction
520	   Bounds.

522	7.  Acknowledgements

524	   This document is based in part on previous incomplete work by Matt
525	   Mathis, Jeff Semke, and Jamshid Mahdavi [RHID] and influenced by
526	   several discussions with John Heffner.

528	   Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the
529	   experiments.

531	   Ilpo Jarvinen reviewed the code.

533	   Mark Allman improved the document through his insightful review.

535	   Neal Cardwell for reviewing and testing the patch.

537	8.  Security Considerations

539	   PRR does not change the risk profile for TCP.

541	   Implementers that change PRR from counting bytes to segments have to
542	   be cautious about the effects of ACK splitting attacks [Savage99],
543	   where the receiver acknowledges partial segments for the purpose of
544	   confusing the sender's congestion accounting.

546	9.  Normative References

548	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
549	              RFC 793, DOI 10.17487/RFC0793, September 1981,
550	              <https://www.rfc-editor.org/info/rfc793>.

552	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
553	              Selective Acknowledgment Options", RFC 2018,
554	              DOI 10.17487/RFC2018, October 1996,
555	              <https://www.rfc-editor.org/info/rfc2018>.

557	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
558	              Requirement Levels", BCP 14, RFC 2119,
559	              DOI 10.17487/RFC2119, March 1997,
560	              <https://www.rfc-editor.org/info/rfc2119>.

562	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
563	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
564	              <https://www.rfc-editor.org/info/rfc5681>.

566	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
567	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
568	              Based on Selective Acknowledgment (SACK) for TCP",
569	              RFC 6675, DOI 10.17487/RFC6675, August 2012,
570	              <https://www.rfc-editor.org/info/rfc6675>.

572	10.  Informative References

574	   [CUBIC]    Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed
575	              TCP variant", PFLDnet 2005, February 2005.

577	   [FACK]     Mathis, M. and J. Mahdavi, "Forward Acknowledgment:
578	              Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96,
579	              August 1996.

581	   [IMC11]    Dukkipati, N., Mathis, M., Cheng, Y., and M. Ghobadi,
582	              "Proportional Rate Reduction for TCP", Proceedings of the
583	              11th ACM SIGCOMM Conference on Internet Measurement
584	              2011, Berlin, Germany, November 2011.

586	   [Jacobson88]
587	              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
588	              Comput. Commun. Rev. 18(4), August 1988.

590	   [Laminar]  Mathis, M., "Laminar TCP and the case for refactoring TCP
591	              congestion control", Work in Progress, 16 July 2012.

593	   [RFC3042]  Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
594	              TCP's Loss Recovery Using Limited Transmit", RFC 3042,
595	              DOI 10.17487/RFC3042, January 2001,
596	              <https://www.rfc-editor.org/info/rfc3042>.

598	   [RFC3517]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A
599	              Conservative Selective Acknowledgment (SACK)-based Loss
600	              Recovery Algorithm for TCP", RFC 3517,
601	              DOI 10.17487/RFC3517, April 2003,
602	              <https://www.rfc-editor.org/info/rfc3517>.

604	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
605	              Rate Reduction for TCP", RFC 6937, DOI 10.17487/RFC6937,
606	              May 2013, <https://www.rfc-editor.org/info/rfc6937>.

608	   [RHID]     Mathis, M., Semke, J., and J. Mahdavi, "The Rate-Halving
609	              Algorithm for TCP Congestion Control", Work in Progress,
610	              August 1999.

612	   [RHweb]    Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
613	              Parameters", Web publication, December 1997,
614	              <http://www.psc.edu/networking/papers/FACKnotes/current/>.

616	   [Savage99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
617	              "TCP congestion control with a misbehaving receiver",
618	              SIGCOMM Comput. Commun. Rev. 29(5), October 1999.

620	Appendix A.  Strong Packet Conservation Bound

622	   PRR-CRB is based on a conservative, philosophically pure, and
623	   aesthetically appealing Strong Packet Conservation Bound, described
624	   here.  Although inspired by Van Jacobson's packet conservation
625	   principle [Jacobson88], it differs in how it treats segments that are
626	   missing and presumed lost.  Under all conditions and sequences of
627	   events during recovery, PRR-CRB strictly bounds the data transmitted
628	   to be equal to or less than the amount of data delivered to the
629	   receiver.  Note that the effects of presumed losses are included in
630	   the pipe calculation, but do not affect the outcome of PRR-CRB, once
631	   pipe has fallen below ssthresh.

633	   We claim that this Strong Packet Conservation Bound is the most
634	   aggressive algorithm that does not lead to additional forced losses
635	   in some environments.  It has the property that if there is a
636	   standing queue at a bottleneck that is carrying no other traffic, the
637	   queue will maintain exactly constant length for the entire duration
638	   of the recovery, except for +1/-1 fluctuation due to differences in
639	   packet arrival and exit times.  Any less aggressive algorithm will
640	   result in a declining queue at the bottleneck.  Any more aggressive
641	   algorithm will result in an increasing queue or additional losses if
642	   it is a full drop tail queue.

644	   We demonstrate this property with a little thought experiment:

646	   Imagine a network path that has insignificant delays in both
647	   directions, except for the processing time and queue at a single
648	   bottleneck in the forward path.  By insignificant delay, we mean when
649	   a packet is "served" at the head of the bottleneck queue, the
650	   following events happen in much less than one bottleneck packet time:
651	   the packet arrives at the receiver; the receiver sends an ACK that
652	   arrives at the sender; the sender processes the ACK and sends some
653	   data; the data is queued at the bottleneck.

655	   If sndcnt is set to DeliveredData and nothing else is inhibiting
656	   sending data, then clearly the data arriving at the bottleneck queue
657	   will exactly replace the data that was served at the head of the
658	   queue, so the queue will have a constant length.  If queue is drop
659	   tail and full, then the queue will stay exactly full.  Losses or
660	   reordering on the ACK path only cause wider fluctuations in the queue
661	   size, but do not raise its peak size, independent of whether the data
662	   is in order or out of order (including loss recovery from an earlier
663	   RTT).  Any more aggressive algorithm that sends additional data will
664	   overflow the drop tail queue and cause loss.  Any less aggressive
665	   algorithm will under-fill the queue.  Therefore, setting sndcnt to
666	   DeliveredData is the most aggressive algorithm that does not cause
667	   forced losses in this simple network.  Relaxing the assumptions
668	   (e.g., making delays more authentic and adding more flows, delayed
669	   ACKs, etc.) is likely to increase the fine grained fluctuations in
670	   queue size but does not change its basic behavior.

672	   Note that the congestion control algorithm implements a broader
673	   notion of optimal that includes appropriately sharing the network.
674	   Typical congestion control algorithms are likely to reduce the data
675	   sent relative to the Packet Conserving Bound implemented by PRR,
676	   bringing TCP's actual window down to ssthresh.

678	Authors' Addresses
679	   Matt Mathis
680	   Google, Inc.
681	   1600 Amphitheatre Parkway
682	   Mountain View, California 94043
683	   United States of America

685	   Email: mattmathis@google.com

687	   Nandita Dukkipati
688	   Google, Inc.
689	   1600 Amphitheatre Parkway
690	   Mountain View, California 94043
691	   United States of America

693	   Email: nanditad@google.com

695	   Yuchung Cheng
696	   Google, Inc.
697	   1600 Amphitheatre Parkway
698	   Mountain View, California 94043
699	   United States of America

701	   Email: ycheng@google.com