idnits 2.17.1 

draft-ietf-tcpm-prr-rfc6937bis-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  -- The draft header indicates that this document obsoletes RFC6937, but the
     abstract doesn't seem to directly say this.  It does mention RFC6937
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 418 has weird spacing: '...     bd  d  d...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (12 January 2021) is 1201 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC0793' is defined on line 668, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                  M. Mathis
3	Internet-Draft                                              N. Dukkipati
4	Obsoletes: 6937 (if approved)                                   Y. Cheng
5	Intended status: Standards Track                            Google, Inc.
6	Expires: 16 July 2021                                    12 January 2021

8	                  Proportional Rate Reduction for TCP
9	                   draft-ietf-tcpm-prr-rfc6937bis-00

11	Abstract

13	   This document updates the Proportional Rate Reduction (PRR) algorithm
14	   described as experimental in RFC 6937 to standards track.  PRR
15	   potentially replaces the Fast Recovery and Rate-Halving algorithms.
16	   All of these algorithms regulate the amount of data sent by TCP or
17	   other transport protocol during loss recovery.  PRR more accurately
18	   regulates the actual flight size through recovery such that at the
19	   end of recovery it will be as close as possible to the ssthresh, as
20	   determined by the congestion control algorithm.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on 16 July 2021.

39	Copyright Notice

41	   Copyright (c) 2021 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
46	   license-info) in effect on the date of publication of this document.
47	   Please review these documents carefully, as they describe your rights
48	   and restrictions with respect to this document.  Code Components
49	   extracted from this document must include Simplified BSD License text
50	   as described in Section 4.e of the Trust Legal Provisions and are
51	   provided without warranty as described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction to bis . . . . . . . . . . . . . . . . . . . . .   2
56	     1.1.  Document and WG Information . . . . . . . . . . . . . . .   3
57	   2.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
58	   3.  Definitions . . . . . . . . . . . . . . . . . . . . . . . . .   6
59	   4.  Algorithms  . . . . . . . . . . . . . . . . . . . . . . . . .   7
60	     4.1.  Examples  . . . . . . . . . . . . . . . . . . . . . . . .   8
61	   5.  Properties  . . . . . . . . . . . . . . . . . . . . . . . . .  11
62	   6.  Measurements  . . . . . . . . . . . . . . . . . . . . . . . .  13
63	   7.  Conclusion and Recommendations  . . . . . . . . . . . . . . .  14
64	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
65	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
66	   10. Normative References  . . . . . . . . . . . . . . . . . . . .  15
67	   11. Informative References  . . . . . . . . . . . . . . . . . . .  16
68	   Appendix A.  Strong Packet Conservation Bound . . . . . . . . . .  17
69	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  18

71	1.  Introduction to bis

73	   This document updates the Proportional Rate Reduction (PRR) algorithm
74	   described in [RFC6937]from experimental to standards track.  PRR
75	   accuracy regulates the amount of data sent during loss recovery, such
76	   that at the end of recovery the flight size will be as close as
77	   possible to the ssthresh, as determined by the congestion control
78	   algorithm.  PRR has been deployed in at least 3 major operating
79	   systems covering the vast majority of today's web traffic.  There
80	   have been no changes to PRR as documented in the experimental RFC.
81	   The descriptions here have been [will be] updated to normative
82	   standards language.  For a tutorial description of the algorithms and
83	   the rationale behind them please see the original RFC.

85	   The experimental RFC describes two different reduction bound
86	   algorithms to limit the total window reduction due to all mechanisms,
87	   including transient application stalls and the losses themselves:
88	   Conservative Reduction Bound (CRB), which is strictly packet
89	   conserving; and a Slow Start Reduction Bound (SSRB), which is more
90	   aggressive than CRB by at most 1 segment per ACK.  [RFC6937] left the
91	   choice of Reduction Bound to the discretion of the implementer.

93	   The paper "An Internet-Wide Analysis of Traffic Policing" [Flatch et
94	   al] uncovered a crucial situation where the Reduction Bound mattered.
95	   Under certain configurations, token bucket traffic policers
96	   [token_bucket] can suddenly start discarding a large fraction of the
97	   traffic.  This happens without warning to the end systems when
98	   policers run out of tokens.  The transport congestion control has no
99	   opportunity to measure the token rate, and sets ssthresh based on the
100	   recently observed path performance.  This value for ssthresh may be
101	   substantially larger than can be sustained by the token rate,
102	   potentially causing persistent high loss.  Under these conditions,
103	   both reduction bounds perform very poorly.  PRR-CRB is too timid,
104	   sometimes causing very long recovery times at smaller than necessary
105	   windows, and PRR-SSRB is too aggressive, often causing many
106	   retransmissions to be lost multiple times.

108	   Investigating these environments led to the development of a
109	   heuristic to dynamically switch between Reduction Bounds: use PRR-
110	   SSRB only while snd.una is advancing without evidence of ongoing
111	   congestion and use PRR-CRB otherwise.

113	   This heuristic is only invoked for what should be a rare corner case:
114	   when losses or other events cause the flight size to fall below
115	   ssthresh.  The extreme loss rates that make the heuristic important
116	   are only common in the presence of poorly configured token bucket
117	   policers, which are pathologically wasteful and inefficient.  In
118	   these environments the heuristics serves to salvage a bad situation
119	   and any reasonable implementation of the heuristic performs far
120	   better than either bound by itself.

122	   The algorithm below is identical to the algorithm presented in
123	   [RFC6937].  The "conservative" parameter MAY be replaced by the
124	   heuristic also described below.

126	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
127	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
128	   document are to be interpreted as described in [RFC2119]

130	1.1.  Document and WG Information

132	   Please send all comments, questions and feedback to tcpm@ietf.org

134	   Formatted: 2021-01-13 15:31:40-08:00

136	   About this revision:

138	   Text above (the "Introduction to bis") is drawn from draft-mathis-
139	   tcpm-rfc6937bis-00.  In this -00, all of the text below is copied
140	   verbatim from RFC 6937, to facilitate comparison between RFC 6937 and
141	   this document as it evolves.

143	2.  Introduction

145	   This document describes an experimental algorithm, PRR, to improve
146	   the accuracy of the amount of data sent by TCP during loss recovery.

148	   Standard congestion control [RFC5681] requires that TCP (and other
149	   protocols) reduce their congestion window (cwnd) in response to
150	   losses.  Fast Recovery, described in the same document, is the
151	   reference algorithm for making this adjustment.  Its stated goal is
152	   to recover TCP's self clock by relying on returning ACKs during
153	   recovery to clock more data into the network.  Fast Recovery
154	   typically adjusts the window by waiting for one half round-trip time
155	   (RTT) of ACKs to pass before sending any data.  It is fragile because
156	   it cannot compensate for the implicit window reduction caused by the
157	   losses themselves.

159	   RFC 6675 [RFC6675] makes Fast Recovery with Selective Acknowledgement
160	   (SACK) [RFC2018] more accurate by computing "pipe", a sender side
161	   estimate of the number of bytes still outstanding in the network.
162	   With RFC 6675, Fast Recovery is implemented by sending data as
163	   necessary on each ACK to prevent pipe from falling below slow-start
164	   threshold (ssthresh), the window size as determined by the congestion
165	   control algorithm.  This protects Fast Recovery from timeouts in many
166	   cases where there are heavy losses, although not if the entire second
167	   half of the window of data or ACKs are lost.  However, a single ACK
168	   carrying a SACK option that implies a large quantity of missing data
169	   can cause a step discontinuity in the pipe estimator, which can cause
170	   Fast Retransmit to send a burst of data.

172	   The Rate-Halving algorithm sends data on alternate ACKs during
173	   recovery, such that after 1 RTT the window has been halved.  Rate-
174	   Halving is implemented in Linux after only being informally published
175	   [RHweb], including an uncompleted document [RHID].  Rate-Halving also
176	   does not adequately compensate for the implicit window reduction
177	   caused by the losses and assumes a net 50% window reduction, which
178	   was completely standard at the time it was written but not
179	   appropriate for modern congestion control algorithms, such as CUBIC
180	   [CUBIC], which reduce the window by less than 50%.  As a consequence,
181	   Rate-Halving often allows the window to fall further than necessary,
182	   reducing performance and increasing the risk of timeouts if there are
183	   additional losses.

185	   PRR avoids these excess window adjustments such that at the end of
186	   recovery the actual window size will be as close as possible to
187	   ssthresh, the window size as determined by the congestion control
188	   algorithm.  It is patterned after Rate-Halving, but using the
189	   fraction that is appropriate for the target window chosen by the
190	   congestion control algorithm.  During PRR, one of two additional
191	   Reduction Bound algorithms limits the total window reduction due to
192	   all mechanisms, including transient application stalls and the losses
193	   themselves.

195	   We describe two slightly different Reduction Bound algorithms:
196	   Conservative Reduction Bound (CRB), which is strictly packet
197	   conserving; and a Slow Start Reduction Bound (SSRB), which is more
198	   aggressive than CRB by, at most, 1 segment per ACK.  PRR-CRB meets
199	   the Strong Packet Conservation Bound described in Appendix A;
200	   however, in real networks it does not perform as well as the
201	   algorithms described in RFC 6675, which prove to be more aggressive
202	   in a significant number of cases.  SSRB offers a compromise by
203	   allowing TCP to send 1 additional segment per ACK relative to CRB in
204	   some situations.  Although SSRB is less aggressive than RFC 6675
205	   (transmitting fewer segments or taking more time to transmit them),
206	   it outperforms it, due to the lower probability of additional losses
207	   during recovery.

209	   The Strong Packet Conservation Bound on which PRR and both Reduction
210	   Bounds are based is patterned after Van Jacobson's packet
211	   conservation principle: segments delivered to the receiver are used
212	   as the clock to trigger sending the same number of segments back into
213	   the network.  As much as possible, PRR and the Reduction Bound
214	   algorithms rely on this self clock process, and are only slightly
215	   affected by the accuracy of other estimators, such as pipe [RFC6675]
216	   and cwnd.  This is what gives the algorithms their precision in the
217	   presence of events that cause uncertainty in other estimators.

219	   The original definition of the packet conservation principle
220	   [Jacobson88] treated packets that are presumed to be lost (e.g.,
221	   marked as candidates for retransmission) as having left the network.
222	   This idea is reflected in the pipe estimator defined in RFC 6675 and
223	   used here, but it is distinct from the Strong Packet Conservation
224	   Bound as described in Appendix A, which is defined solely on the
225	   basis of data arriving at the receiver.

227	   We evaluated these and other algorithms in a large scale measurement
228	   study presented in a companion paper [IMC11] and summarized in
229	   Section 6.  This measurement study was based on RFC 3517 [RFC3517],
230	   which has since been superseded by RFC 6675.  Since there are slight
231	   differences between the two specifications, and we were meticulous
232	   about our implementation of RFC 3517, we are not comfortable
233	   unconditionally asserting that our measurement results apply to RFC
234	   6675, although we believe this to be the case.  We have instead
235	   chosen to be pedantic about describing measurement results relative
236	   to RFC 3517, on which they were actually based.  General discussions
237	   of algorithms and their properties have been updated to refer to RFC
238	   6675.

240	   We found that for authentic network traffic, PRR-SSRB outperforms
241	   both RFC 3517 and Linux Rate-Halving even though it is less
242	   aggressive than RFC 3517.  We believe that these results apply to RFC
243	   6675 as well.

245	   The algorithms are described as modifications to RFC 5681 [RFC5681],
246	   "TCP Congestion Control", using concepts drawn from the pipe
247	   algorithm [RFC6675].  They are most accurate and more easily
248	   implemented with SACK [RFC2018], but do not require SACK.

250	3.  Definitions

252	   The following terms, parameters, and state variables are used as they
253	   are defined in earlier documents:

255	   RFC 793: snd.una (send unacknowledged)

257	   RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size
258	   (SMSS)

260	   RFC 6675: covered (as in "covered sequence numbers")

262	   Voluntary window reductions: choosing not to send data in response to
263	   some ACKs, for the purpose of reducing the sending window size and
264	   data rate

266	   We define some additional variables:

268	   SACKd: The total number of bytes that the scoreboard indicates have
269	   been delivered to the receiver.  This can be computed by scanning the
270	   scoreboard and counting the total number of bytes covered by all sack
271	   blocks.  If SACK is not in use, SACKd is not defined.

273	   DeliveredData: The total number of bytes that the current ACK
274	   indicates have been delivered to the receiver.  When not in recovery,
275	   DeliveredData is the change in snd.una.  With SACK, DeliveredData can
276	   be computed precisely as the change in snd.una, plus the (signed)
277	   change in SACKd.  In recovery without SACK, DeliveredData is
278	   estimated to be 1 SMSS on duplicate acknowledgements, and on a
279	   subsequent partial or full ACK, DeliveredData is estimated to be the
280	   change in snd.una, minus 1 SMSS for each preceding duplicate ACK.

282	   Note that DeliveredData is robust; for TCP using SACK, DeliveredData
283	   can be precisely computed anywhere in the network just by inspecting
284	   the returning ACKs.  The consequence of missing ACKs is that later
285	   ACKs will show a larger DeliveredData.  Furthermore, for any TCP
286	   (with or without SACK), the sum of DeliveredData must agree with the
287	   forward progress over the same time interval.

289	   We introduce a local variable "sndcnt", which indicates exactly how
290	   many bytes should be sent in response to each ACK.  Note that the
291	   decision of which data to send (e.g., retransmit missing data or send
292	   more new data) is out of scope for this document.

294	4.  Algorithms

296	   At the beginning of recovery, initialize PRR state.  This assumes a
297	   modern congestion control algorithm, CongCtrlAlg(), that might set
298	   ssthresh to something other than FlightSize/2:

300	      ssthresh = CongCtrlAlg()  // Target cwnd after recovery
301	      prr_delivered = 0         // Total bytes delivered during recovery
302	      prr_out = 0               // Total bytes sent during recovery
303	      RecoverFS = snd.nxt-snd.una // FlightSize at the start of recovery

305	                                  Figure 1

307	   On every ACK during recovery compute:
308	      DeliveredData = change_in(snd.una) + change_in(SACKd)
309	      prr_delivered += DeliveredData
310	      pipe = (RFC 6675 pipe algorithm)
311	      if (pipe > ssthresh) {
312	         // Proportional Rate Reduction
313	         sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
314	      } else {
315	         // Two versions of the Reduction Bound
316	         if (conservative) {    // PRR-CRB
317	           limit = prr_delivered - prr_out
318	         } else {               // PRR-SSRB
319	           limit = MAX(prr_delivered - prr_out, DeliveredData) + MSS
320	         }
321	         // Attempt to catch up, as permitted by limit
322	         sndcnt = MIN(ssthresh - pipe, limit)
323	      }

325	                                  Figure 2

327	   On any data transmission or retransmission:
328	     prr_out += (data sent) // strictly less than or equal to sndcnt

330	                                  Figure 3

332	4.1.  Examples

334	   We illustrate these algorithms by showing their different behaviors
335	   for two scenarios: TCP experiencing either a single loss or a burst
336	   of 15 consecutive losses.  In all cases we assume bulk data (no
337	   application pauses), standard Additive Increase Multiplicative
338	   Decrease (AIMD) congestion control, and cwnd = FlightSize = pipe = 20
339	   segments, so ssthresh will be set to 10 at the beginning of recovery.
340	   We also assume standard Fast Retransmit and Limited Transmit
341	   [RFC3042], so TCP will send 2 new segments followed by 1 retransmit
342	   in response to the first 3 duplicate ACKs following the losses.

344	   Each of the diagrams below shows the per ACK response to the first
345	   round trip for the various recovery algorithms when the zeroth
346	   segment is lost.  The top line indicates the transmitted segment
347	   number triggering the ACKs, with an X for the lost segment.  "cwnd"
348	   and "pipe" indicate the values of these algorithms after processing
349	   each returning ACK.  "Sent" indicates how much 'N'ew or
350	   'R'etransmitted data would be sent.  Note that the algorithms for
351	   deciding which data to send are out of scope of this document.

353	   When there is a single loss, PRR with either of the Reduction Bound
354	   algorithms has the same behavior.  We show "RB", a flag indicating
355	   which Reduction Bound subexpression ultimately determined the value
356	   of sndcnt.  When there are minimal losses, "limit" (both algorithms)
357	   will always be larger than ssthresh - pipe, so the sndcnt will be
358	   ssthresh - pipe, indicated by "s" in the "RB" row.

360	   RFC 6675
361	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
362	   cwnd:    20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
363	   pipe:    19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10
364	   sent:     N  N  R                          N  N  N  N  N  N  N  N

366	   Rate-Halving (Linux)
367	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
368	   cwnd:    20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11
369	   pipe:    19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10
370	   sent:     N  N  R     N     N     N     N     N     N     N     N

372	   PRR
373	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
374	   pipe:    19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10
375	   sent:     N  N  R     N     N     N     N     N     N        N  N
376	   RB:                                                          s  s
377	       Cwnd is not shown because PRR does not use it.

379	   Key for RB
380	   s: sndcnt = ssthresh - pipe                 // from ssthresh
381	   b: sndcnt = prr_delivered - prr_out + SMSS  // from banked
382	   d: sndcnt = DeliveredData + SMSS            // from DeliveredData
383	   (Sometimes, more than one applies.)

385	                                  Figure 4

387	   Note that all 3 algorithms send the same total amount of data.  RFC
388	   6675 experiences a "half window of silence", while the Rate-Halving
389	   and PRR spread the voluntary window reduction across an entire RTT.

391	   Next, we consider the same initial conditions when the first 15
392	   packets (0-14) are lost.  During the remainder of the lossy RTT, only
393	   5 ACKs are returned to the sender.  We examine each of these
394	   algorithms in succession.

396	   RFC 6675
397	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
398	   cwnd:                                              20 20 11 11 11
399	   pipe:                                              19 19  4 10 10
400	   sent:                                               N  N 7R  R  R

402	   Rate-Halving (Linux)
403	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
404	   cwnd:                                              20 20  5  5  5
405	   pipe:                                              19 19  4  4  4
406	   sent:                                               N  N  R  R  R

408	   PRR-CRB
409	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
410	   pipe:                                              19 19  4  4  4
411	   sent:                                               N  N  R  R  R
412	   RB:                                                       b  b  b

414	   PRR-SSRB
415	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
416	   pipe:                                              19 19  4  5  6
417	   sent:                                               N  N 2R 2R 2R
418	   RB:                                                      bd  d  d

420	                                  Figure 5

422	   In this specific situation, RFC 6675 is more aggressive because once
423	   Fast Retransmit is triggered (on the ACK for segment 17), TCP
424	   immediately retransmits sufficient data to bring pipe up to cwnd.
425	   Our measurement data (see Section 6) indicates that RFC 6675
426	   significantly outperforms Rate-Halving, PRR-CRB, and some other
427	   similarly conservative algorithms that we tested, showing that it is
428	   significantly common for the actual losses to exceed the window
429	   reduction determined by the congestion control algorithm.

431	   The Linux implementation of Rate-Halving includes an early version of
432	   the Conservative Reduction Bound [RHweb].  In this situation, the 5
433	   ACKs trigger exactly 1 transmission each (2 new data, 3 old data),
434	   and cwnd is set to 5.  At a window size of 5, it takes 3 round trips
435	   to retransmit all 15 lost segments.  Rate-Halving does not raise the
436	   window at all during recovery, so when recovery finally completes,
437	   TCP will slow start cwnd from 5 up to 10.  In this example, TCP
438	   operates at half of the window chosen by the congestion control for
439	   more than 3 RTTs, increasing the elapsed time and exposing it to
440	   timeouts in the event that there are additional losses.

442	   PRR-CRB implements a Conservative Reduction Bound.  Since the total
443	   losses bring pipe below ssthresh, data is sent such that the total
444	   data transmitted, prr_out, follows the total data delivered to the
445	   receiver as reported by returning ACKs.  Transmission is controlled
446	   by the sending limit, which is set to prr_delivered - prr_out.  This
447	   is indicated by the RB:b tagging in the figure.  In this case, PRR-
448	   CRB is exposed to exactly the same problems as Rate-Halving; the
449	   excess window reduction causes it to take excessively long to recover
450	   the losses and exposes it to additional timeouts.

452	   PRR-SSRB increases the window by exactly 1 segment per ACK until pipe
453	   rises to ssthresh during recovery.  This is accomplished by setting
454	   limit to one greater than the data reported to have been delivered to
455	   the receiver on this ACK, implementing slow start during recovery,
456	   and indicated by RB:d tagging in the figure.  Although increasing the
457	   window during recovery seems to be ill advised, it is important to
458	   remember that this is actually less aggressive than permitted by RFC
459	   5681, which sends the same quantity of additional data as a single
460	   burst in response to the ACK that triggered Fast Retransmit.

462	   For less extreme events, where the total losses are smaller than the
463	   difference between FlightSize and ssthresh, PRR-CRB and PRR-SSRB have
464	   identical behaviors.

466	5.  Properties

468	   The following properties are common to both PRR-CRB and PRR-SSRB,
469	   except as noted:

471	   PRR maintains TCP's ACK clocking across most recovery events,
472	   including burst losses.  RFC 6675 can send large unclocked bursts
473	   following burst losses.

475	   Normally, PRR will spread voluntary window reductions out evenly
476	   across a full RTT.  This has the potential to generally reduce the
477	   burstiness of Internet traffic, and could be considered to be a type
478	   of soft pacing.  Hypothetically, any pacing increases the probability
479	   that different flows are interleaved, reducing the opportunity for
480	   ACK compression and other phenomena that increase traffic burstiness.
481	   However, these effects have not been quantified.

483	   If there are minimal losses, PRR will converge to exactly the target
484	   window chosen by the congestion control algorithm.  Note that as TCP
485	   approaches the end of recovery, prr_delivered will approach RecoverFS
486	   and sndcnt will be computed such that prr_out approaches ssthresh.

488	   Implicit window reductions, due to multiple isolated losses during
489	   recovery, cause later voluntary reductions to be skipped.  For small
490	   numbers of losses, the window size ends at exactly the window chosen
491	   by the congestion control algorithm.

493	   For burst losses, earlier voluntary window reductions can be undone
494	   by sending extra segments in response to ACKs arriving later during
495	   recovery.  Note that as long as some voluntary window reductions are
496	   not undone, the final value for pipe will be the same as ssthresh,
497	   the target cwnd value chosen by the congestion control algorithm.

499	   PRR with either Reduction Bound improves the situation when there are
500	   application stalls, e.g., when the sending application does not queue
501	   data for transmission quickly enough or the receiver stops advancing
502	   rwnd (receiver window).  When there is an application stall early
503	   during recovery, prr_out will fall behind the sum of the
504	   transmissions permitted by sndcnt.  The missed opportunities to send
505	   due to stalls are treated like banked voluntary window reductions;
506	   specifically, they cause prr_delivered - prr_out to be significantly
507	   positive.  If the application catches up while TCP is still in
508	   recovery, TCP will send a partial window burst to catch up to exactly
509	   where it would have been had the application never stalled.  Although
510	   this burst might be viewed as being hard on the network, this is
511	   exactly what happens every time there is a partial RTT application
512	   stall while not in recovery.  We have made the partial RTT stall
513	   behavior uniform in all states.  Changing this behavior is out of
514	   scope for this document.

516	   PRR with Reduction Bound is less sensitive to errors in the pipe
517	   estimator.  While in recovery, pipe is intrinsically an estimator,
518	   using incomplete information to estimate if un-SACKed segments are
519	   actually lost or merely out of order in the network.  Under some
520	   conditions, pipe can have significant errors; for example, pipe is
521	   underestimated when a burst of reordered data is prematurely assumed
522	   to be lost and marked for retransmission.  If the transmissions are
523	   regulated directly by pipe as they are with RFC 6675, a step
524	   discontinuity in the pipe estimator causes a burst of data, which
525	   cannot be retracted once the pipe estimator is corrected a few ACKs
526	   later.  For PRR, pipe merely determines which algorithm, PRR or the
527	   Reduction Bound, is used to compute sndcnt from DeliveredData.  While
528	   pipe is underestimated, the algorithms are different by at most 1
529	   segment per ACK.  Once pipe is updated, they converge to the same
530	   final window at the end of recovery.

532	   Under all conditions and sequences of events during recovery, PRR-CRB
533	   strictly bounds the data transmitted to be equal to or less than the
534	   amount of data delivered to the receiver.  We claim that this Strong
535	   Packet Conservation Bound is the most aggressive algorithm that does
536	   not lead to additional forced losses in some environments.  It has
537	   the property that if there is a standing queue at a bottleneck with
538	   no cross traffic, the queue will maintain exactly constant length for
539	   the duration of the recovery, except for +1/-1 fluctuation due to
540	   differences in packet arrival and exit times.  See Appendix A for a
541	   detailed discussion of this property.

543	   Although the Strong Packet Conservation Bound is very appealing for a
544	   number of reasons, our measurements summarized in Section 6
545	   demonstrate that it is less aggressive and does not perform as well
546	   as RFC 6675, which permits bursts of data when there are bursts of
547	   losses.  PRR-SSRB is a compromise that permits TCP to send 1 extra
548	   segment per ACK as compared to the Packet Conserving Bound.  From the
549	   perspective of a strict Packet Conserving Bound, PRR-SSRB does indeed
550	   open the window during recovery; however, it is significantly less
551	   aggressive than RFC 6675 in the presence of burst losses.

553	6.  Measurements

555	   In a companion IMC11 paper [IMC11], we describe some measurements
556	   comparing the various strategies for reducing the window during
557	   recovery.  The experiments were performed on servers carrying Google
558	   production traffic and are briefly summarized here.

560	   The various window reduction algorithms and extensive instrumentation
561	   were all implemented in Linux 2.6.  We used the uniform set of
562	   algorithms present in the base Linux implementation, including CUBIC
563	   [CUBIC], Limited Transmit [RFC3042], threshold transmit (Section 3.1
564	   in [FACK]) (this algorithm was not present in RFC 3517, but a similar
565	   algorithm has been added to RFC 6675), and lost retransmission
566	   detection algorithms.  We confirmed that the behaviors of Rate-
567	   Halving (the Linux default), RFC 3517, and PRR were authentic to
568	   their respective specifications and that performance and features
569	   were comparable to the kernels in production use.  All of the
570	   different window reduction algorithms were all present in a common
571	   kernel and could be selected with a sysctl, such that we had an
572	   absolutely uniform baseline for comparing them.

574	   Our experiments included an additional algorithm, PRR with an
575	   unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe
576	   falls below ssthresh.  This behavior parallels RFC 3517.

578	   An important detail of this configuration is that CUBIC only reduces
579	   the window by 30%, as opposed to the 50% reduction used by
580	   traditional congestion control algorithms.  This accentuates the
581	   tendency for RFC 3517 and PRR-UB to send a burst at the point when
582	   Fast Retransmit gets triggered because pipe is likely to already be
583	   below ssthresh.  Precisely this condition was observed for 32% of the
584	   recovery events: pipe fell below ssthresh before Fast Retransmit was
585	   triggered, thus the various PRR algorithms started in the Reduction
586	   Bound phase, and RFC 3517 sent bursts of segments with the Fast
587	   Retransmit.

589	   In the companion paper, we observe that PRR-SSRB spends the least
590	   time in recovery of all the algorithms tested, largely because it
591	   experiences fewer timeouts once it is already in recovery.

593	   RFC 3517 experiences 29% more detected lost retransmissions and 2.6%
594	   more timeouts (presumably due to undetected lost retransmissions)
595	   than PRR-SSRB.  These results are representative of PRR-UB and other
596	   algorithms that send bursts when pipe falls below ssthresh.

598	   Rate-Halving experiences 5% more timeouts and significantly smaller
599	   final cwnd values at the end of recovery.  The smaller cwnd sometimes
600	   causes the recovery itself to take extra round trips.  These results
601	   are representative of PRR-CRB and other algorithms that implement
602	   strict packet conservation during recovery.

604	7.  Conclusion and Recommendations

606	   Although the Strong Packet Conservation Bound used in PRR-CRB is very
607	   appealing for a number of reasons, our measurements show that it is
608	   less aggressive and does not perform as well as RFC 3517 (and by
609	   implication RFC 6675), which permits bursts of data when there are
610	   bursts of losses.  RFC 3517 and RFC 6675 are conservative in the
611	   original sense of Van Jacobson's packet conservation principle, which
612	   included the assumption that presumed lost segments have indeed left
613	   the network.  PRR-CRB makes no such assumption, following instead a
614	   Strong Packet Conservation Bound in which only packets that have
615	   actually arrived at the receiver are considered to have left the
616	   network.  PRR-SSRB is a compromise that permits TCP to send 1 extra
617	   segment per ACK relative to the Strong Packet Conservation Bound, to
618	   partially compensate for excess losses.

620	   From the perspective of the Strong Packet Conservation Bound, PRR-
621	   SSRB does indeed open the window during recovery; however, it is
622	   significantly less aggressive than RFC 3517 (and RFC 6675) in the
623	   presence of burst losses.  Even so, it often outperforms RFC 3517
624	   (and presumably RFC 6675) because it avoids some of the self-
625	   inflicted losses caused by bursts.

627	   At this time, we see no reason not to test and deploy PRR-SSRB on a
628	   large scale.  Implementers worried about any potential impact of
629	   raising the window during recovery may want to optionally support
630	   PRR-CRB (which is actually simpler to implement) for comparison
631	   studies.  Furthermore, there is one minor detail of PRR that can be
632	   improved by replacing pipe by total_pipe, as defined by Laminar TCP
633	   [Laminar].

635	   One final comment about terminology: we expect that common usage will
636	   drop "Slow Start Reduction Bound" from the algorithm name.  This
637	   document needed to be pedantic about having distinct names for PRR
638	   and every variant of the Reduction Bound.  However, we do not
639	   anticipate any future exploration of the alternative Reduction
640	   Bounds.

642	8.  Acknowledgements

644	   This document is based in part on previous incomplete work by Matt
645	   Mathis, Jeff Semke, and Jamshid Mahdavi [RHID] and influenced by
646	   several discussions with John Heffner.

648	   Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the
649	   experiments.

651	   Ilpo Jarvinen reviewed the code.

653	   Mark Allman improved the document through his insightful review.

655	   Neal Cardwell for reviewing and testing the patch.

657	9.  Security Considerations

659	   PRR does not change the risk profile for TCP.

661	   Implementers that change PRR from counting bytes to segments have to
662	   be cautious about the effects of ACK splitting attacks [Savage99],
663	   where the receiver acknowledges partial segments for the purpose of
664	   confusing the sender's congestion accounting.

666	10.  Normative References

668	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
669	              RFC 793, DOI 10.17487/RFC0793, September 1981,
670	              <https://www.rfc-editor.org/info/rfc793>.

672	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
673	              Selective Acknowledgment Options", RFC 2018,
674	              DOI 10.17487/RFC2018, October 1996,
675	              <https://www.rfc-editor.org/info/rfc2018>.

677	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
678	              Requirement Levels", BCP 14, RFC 2119,
679	              DOI 10.17487/RFC2119, March 1997,
680	              <https://www.rfc-editor.org/info/rfc2119>.

682	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
683	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
684	              <https://www.rfc-editor.org/info/rfc5681>.

686	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
687	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
688	              Based on Selective Acknowledgment (SACK) for TCP",
689	              RFC 6675, DOI 10.17487/RFC6675, August 2012,
690	              <https://www.rfc-editor.org/info/rfc6675>.

692	11.  Informative References

694	   [CUBIC]    Rhee, I. and L. Xu, "CUBIC: A new TCP-friendly high-speed
695	              TCP variant", PFLDnet 2005, February 2005.

697	   [FACK]     Mathis, M. and J. Mahdavi, "Forward Acknowledgment:
698	              Refining TCP Congestion Control", ACM SIGCOMM SIGCOMM96,
699	              August 1996.

701	   [IMC11]    Dukkipati, N., Mathis, M., Cheng, Y., and M. Ghobadi,
702	              "Proportional Rate Reduction for TCP", Proceedings of the
703	              11th ACM SIGCOMM Conference on Internet Measurement
704	              2011, Berlin, Germany, November 2011.

706	   [Jacobson88]
707	              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
708	              Comput. Commun. Rev. 18(4), August 1988.

710	   [Laminar]  Mathis, M., "Laminar TCP and the case for refactoring TCP
711	              congestion control", Work in Progress, 16 July 2012.

713	   [RFC3042]  Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
714	              TCP's Loss Recovery Using Limited Transmit", RFC 3042,
715	              DOI 10.17487/RFC3042, January 2001,
716	              <https://www.rfc-editor.org/info/rfc3042>.

718	   [RFC3517]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A
719	              Conservative Selective Acknowledgment (SACK)-based Loss
720	              Recovery Algorithm for TCP", RFC 3517,
721	              DOI 10.17487/RFC3517, April 2003,
722	              <https://www.rfc-editor.org/info/rfc3517>.

724	   [RFC6937]  Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
725	              Rate Reduction for TCP", RFC 6937, DOI 10.17487/RFC6937,
726	              May 2013, <https://www.rfc-editor.org/info/rfc6937>.

728	   [RHID]     Mathis, M., Semke, J., and J. Mahdavi, "The Rate-Halving
729	              Algorithm for TCP Congestion Control", Work in Progress,
730	              August 1999.

732	   [RHweb]    Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
733	              Parameters", Web publication, December 1997,
734	              <http://www.psc.edu/networking/papers/FACKnotes/current/>.

736	   [Savage99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
737	              "TCP congestion control with a misbehaving receiver",
738	              SIGCOMM Comput. Commun. Rev. 29(5), October 1999.

740	Appendix A.  Strong Packet Conservation Bound

742	   PRR-CRB is based on a conservative, philosophically pure, and
743	   aesthetically appealing Strong Packet Conservation Bound, described
744	   here.  Although inspired by Van Jacobson's packet conservation
745	   principle [Jacobson88], it differs in how it treats segments that are
746	   missing and presumed lost.  Under all conditions and sequences of
747	   events during recovery, PRR-CRB strictly bounds the data transmitted
748	   to be equal to or less than the amount of data delivered to the
749	   receiver.  Note that the effects of presumed losses are included in
750	   the pipe calculation, but do not affect the outcome of PRR-CRB, once
751	   pipe has fallen below ssthresh.

753	   We claim that this Strong Packet Conservation Bound is the most
754	   aggressive algorithm that does not lead to additional forced losses
755	   in some environments.  It has the property that if there is a
756	   standing queue at a bottleneck that is carrying no other traffic, the
757	   queue will maintain exactly constant length for the entire duration
758	   of the recovery, except for +1/-1 fluctuation due to differences in
759	   packet arrival and exit times.  Any less aggressive algorithm will
760	   result in a declining queue at the bottleneck.  Any more aggressive
761	   algorithm will result in an increasing queue or additional losses if
762	   it is a full drop tail queue.

764	   We demonstrate this property with a little thought experiment:

766	   Imagine a network path that has insignificant delays in both
767	   directions, except for the processing time and queue at a single
768	   bottleneck in the forward path.  By insignificant delay, we mean when
769	   a packet is "served" at the head of the bottleneck queue, the
770	   following events happen in much less than one bottleneck packet time:
771	   the packet arrives at the receiver; the receiver sends an ACK that
772	   arrives at the sender; the sender processes the ACK and sends some
773	   data; the data is queued at the bottleneck.

775	   If sndcnt is set to DeliveredData and nothing else is inhibiting
776	   sending data, then clearly the data arriving at the bottleneck queue
777	   will exactly replace the data that was served at the head of the
778	   queue, so the queue will have a constant length.  If queue is drop
779	   tail and full, then the queue will stay exactly full.  Losses or
780	   reordering on the ACK path only cause wider fluctuations in the queue
781	   size, but do not raise its peak size, independent of whether the data
782	   is in order or out of order (including loss recovery from an earlier
783	   RTT).  Any more aggressive algorithm that sends additional data will
784	   overflow the drop tail queue and cause loss.  Any less aggressive
785	   algorithm will under-fill the queue.  Therefore, setting sndcnt to
786	   DeliveredData is the most aggressive algorithm that does not cause
787	   forced losses in this simple network.  Relaxing the assumptions
788	   (e.g., making delays more authentic and adding more flows, delayed
789	   ACKs, etc.) is likely to increase the fine grained fluctuations in
790	   queue size but does not change its basic behavior.

792	   Note that the congestion control algorithm implements a broader
793	   notion of optimal that includes appropriately sharing the network.
794	   Typical congestion control algorithms are likely to reduce the data
795	   sent relative to the Packet Conserving Bound implemented by PRR,
796	   bringing TCP's actual window down to ssthresh.

798	Authors' Addresses

800	   Matt Mathis
801	   Google, Inc.
802	   1600 Amphitheatre Parkway
803	   Mountain View, California 94043
804	   United States of America

806	   Email: mattmathis@google.com
807	   Nandita Dukkipati
808	   Google, Inc.
809	   1600 Amphitheatre Parkway
810	   Mountain View, California 94043
811	   United States of America

813	   Email: nanditad@google.com

815	   Yuchung Cheng
816	   Google, Inc.
817	   1600 Amphitheatre Parkway
818	   Mountain View, California 94043
819	   United States of America

821	   Email: ycheng@google.com