idnits 2.17.1 

draft-mathis-tcpm-proportional-rate-reduction-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 543: '... implementations SHOULD implement both...'
     RFC 2119 keyword, line 544: '... algorithm is used.  It is RECOMMENDED...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 11, 2011) is 4673 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC 2018' is mentioned on line 161, but not defined

  == Missing Reference: 'LT' is mentioned on line 483, but not defined

  == Missing Reference: 'SPLIT' is mentioned on line 561, but not defined

  ** Obsolete normative reference: RFC 3517 (Obsoleted by RFC 6675)


     Summary: 3 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance Working Group                                  M. Mathis
3	Internet-Draft                                              N. Dukkipati
4	Intended status: Experimental                                   Y. Cheng
5	Expires: January 12, 2012                                    Google, Inc
6	                                                           July 11, 2011

8	                  Proportional Rate Reduction for TCP
9	          draft-mathis-tcpm-proportional-rate-reduction-01.txt

11	Abstract

13	   This document describes an experimental algorithm, Proportional Rate
14	   Reduction (PPR) and related algorithms to improve the accuracy of the
15	   amount of data sent by TCP during loss recovery.  Standard Congestion
16	   Control requires that TCP and other protocols reduce their congestion
17	   window in response to losses.  This window reduction naturally occurs
18	   in the same round trip as the data retransmissions to repair the
19	   losses, and is implemented by choosing not to transmit any data in
20	   response to some ACKs arriving from the receiver.  Two widely
21	   deployed algorithms are used to implement this window reduction: Fast
22	   Recovery and Rate Halving.  Both algorithms are needlessly fragile
23	   under a number of conditions, particularly when there is a burst of
24	   losses that such that the number of ACKs returning to the sender is
25	   so small that the effective window falls below the target congestion
26	   window chosen by the congestion control algorithm.  Proportional Rate
27	   Reduction avoids these excess window reductions such that at the end
28	   of recovery the actual window size will be as close as possible to
29	   the window size determined by the congestion control algorithm.  It
30	   is patterned after rate halving, but using the fraction that is
31	   appropriate for target window chosen by the congestion control
32	   algorithm.  In addition we propose two slightly different algorithms
33	   to bound the total window reduction due to all mechanisms, including
34	   application stalls, the losses themselves and inhibit further window
35	   reductions when possible.

37	Status of this Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at http://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on January 12, 2012.

54	Copyright Notice

56	   Copyright (c) 2011 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (http://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
72	   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  5
73	   3.  Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .  6
74	     3.1.  Examples . . . . . . . . . . . . . . . . . . . . . . . . .  7
75	   4.  Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 10
76	   5.  Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 12
77	   6.  Conclusion and Recommendations . . . . . . . . . . . . . . . . 13
78	   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14
79	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
80	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
81	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14
82	   Appendix A.  Packet Conservation Bound . . . . . . . . . . . . . . 15
83	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16

85	1.  Introduction

87	   This document describes an experimental algorithm, Proportional Rate
88	   Reduction (PPR) and two slightly different reduction bound algorithms
89	   to improve the accuracy of the amount of data sent by TCP during loss
90	   recovery.

92	   Standard Congestion Control [RFC 5681] requires that TCP (and other
93	   protocols) reduce their congestion window in response to losses.
94	   Fast Recovery, described in the same document, is the reference
95	   algorithm for making this adjustment.  It's stated goal is to recover
96	   TCP's self clock by relying on returning ACKs during recovery to
97	   clock more data into the network.  Fast Recovery adjusts the window
98	   by waiting for one half RTT of ACKs to pass before sending any data.
99	   It is fragile because it can not compensate for the implicit window
100	   reduction caused by the losses themselves, and is exposed to
101	   timeouts.  For example if half of the data or ACKs are lost, Fast
102	   Recovery's expected behavior would be to reduce the window by not
103	   sending in response to the first half window of ACKs, but then it
104	   would not receive any additional ACKs and would timeout because it
105	   failed to send anything at all.

107	   The rate-halving algorithm improves this situation by sending data on
108	   alternate ACKs during recovery, such that after one RTT the window
109	   has been halved.  Rate-having is implemented in Linux after only
110	   being informally published [RHweb], including from an uncompleted
111	   Internet-Draft[RHID].  Rate-halving also does not adequately
112	   compensate for the implicit window reduction caused by the losses and
113	   also assumes a 50% window reduction, which was completely standard at
114	   the time it was written (several modern congestion control
115	   algorithms, such as Cubic[CUBIC], can sometimes reduce the window by
116	   much less than 50%.).  As a consequence rate-halving often allows the
117	   window to fall further than necessary, reducing performance and
118	   increasing the risk of timeouts if there are any additional losses.

120	   Proportional Rate Reduction (PPR) avoids these excess window
121	   reductions such that at the end of recovery the actual window size
122	   will be as close as possible to the window size determined by the
123	   congestion control algorithm.  It is patterned after Rate Halving,
124	   but using the fraction that is appropriate for target window chosen
125	   by the congestion control algorithm.  During PRR one of two
126	   additional reduction bound algorithms monitors the total window
127	   reduction due to all mechanisms, including application stalls, the
128	   losses themselves and attempts to inhibit further window reductions.

130	   We describe two slightly different reduction bound algorithms:
131	   conservative reduction bound (CRB), which meets a strict segment
132	   conserving correctness critera; and a slow start reduction bound
133	   (SSRB), which is more aggressive than CRB by at most one segment per
134	   ACK.  CRB meets a conservative, philosophically pure and
135	   aesthetically appealing notion of correct, however in real networks
136	   it does not perform as well as the algorithms described in RFC 3517,
137	   which prove to be non-conservative in a statistically significant
138	   number of cases.  SSRB offers a compromise by allowing TCP to send
139	   one additional segment per ACK relative to CRB in some situations.
140	   Although SSRB is less aggressive than RFC 3517 (transmitting fewer
141	   segments or transmitting them later) it slightly outperforms it, due
142	   to slightly lower probability of additional losses during recovery.

144	   All three algorithms are based on common design principles, derived
145	   from Van Jacobson's packet conservation principle: segments delivered
146	   to the receiver are used as the clock to trigger sending additional
147	   segments into the network.  As much as possible Proportional Rate
148	   Reduction and the reduction bound rely on this self clock process,
149	   and are only slightly affected by the accuracy of other estimators,
150	   such as pipe[RFC 3517] and cwnd.  This is what gives the algorithms
151	   their precision in the presence of events that cause uncertainty in
152	   other estimators.

154	   In Section 5, we summarize a companion paper[Recovery] with some
155	   measurement experiments: PRR+SSRB outperforms both RFC 3517 and PRR+
156	   CRB under authentic network traffic.

158	   The algorithms are described as modifications to RFC 5681, TCP
159	   Congestion Control, using concepts drawn from the pipe algorithm [RFC
160	   3517].  They are most accurate and more easily implemented with
161	   SACK[RFC 2018], but they do not require SACK.

163	2.  Definitions

165	   The following terms, parameters and state variables are used as they
166	   are defined in earlier documents:

168	   RFC 3517: covered (as in "covered sequence numbers")

170	   RFC 5681: duplicate ACK, FlightSize, Sender Maximum Segment Size
171	   (SMSS)

173	   Voluntary Window Reductions: choosing not to send data in response to
174	   some ACKs, for the purpose of reducing the sending window size or
175	   data rate.

177	   We define some additional variables:

179	   SACKd: The total number of bytes that the scoreboard indicates has
180	   been delivered to the receiver.  This can be computed by scanning the
181	   scoreboard and counting the total number of bytes covered by all sack
182	   blocks.

184	   DeliveredData: The total number of bytes that the current ACK
185	   indicates have been delivered to the receiver, relative to all past
186	   ACKs.  When not in recovery, DeliveredData is the change in snd.una.
187	   With SACK, DeliveredData is not an estimator and can be computed
188	   precisely as the change in snd.una plus the change in SACKd.  Note
189	   that if there are SACK blocks and snd.una advances, the change in
190	   SACKd is typically negative.  In recovery without SACK, DeliveredData
191	   is estimated to be 1 SMSS on duplicate acknowledgements, and on a
192	   subsequent partial or full ACK, DeliveredData is estimated to be the
193	   change in snd.una, minus one SMSS for each preceding duplicate ACK.

195	   Note that DeliveredData is robust: for TCP using SACK, DeliveredData
196	   can be precisely computed anywhere in the network just by inspecting
197	   the returning ACKs.  The consequence of missing ACKs is that later
198	   ACKs will show a larger DeliveredData.  Furthermore, for any TCP
199	   (with or without SACK) the sum of DeliveredData must agree with the
200	   forward progress over the same time interval.

202	   We introduce a local variable "sndcnt", which indicates exactly how
203	   many bytes should be sent in response to each ACK.  Note that the
204	   decision of which data to send (e.g. retransmit missing data or send
205	   more new data) is out of scope for this document.

207	3.  Algorithms

209	   Summary: If pipe (the estimated data is in flight) is larger than
210	   ssthresh (the target cwnd at the end of recovery) then Proportional
211	   Rate Reduction spreads the the voluntary window reductions across a
212	   full RTT, such that as prr_delivered approaches RecoverFS (at the end
213	   of recovery) prr_out approaches ssthresh, the target value for cwnd.
214	   If there are excess losses such that pipe falls below ssthresh, the
215	   selected reduction bound algorithm tries to hold pipe at ssthresh by
216	   sending at most "limit" segments per ACK to catch up.  For both
217	   reduction bound algorithms the limit is first set to past voluntary
218	   window reductions (prr_delivered - prr_out) permitting single ACKs to
219	   trigger sending multiple segments.  With PRR+CRB (conservative==True)
220	   and if there are too many losses then prr_delivered - prr_out will be
221	   exactly the same as DeliveredData for the current ACK, resulting in
222	   sndcnt=DeliveredData.  Therefore there will be no further Voluntary
223	   Window Reductions.  With PRR+SSRB (conservative==False) the same
224	   situation results in sndcnt=DeliveredData+1, which ultimately causes
225	   TCP to slowstart up to ssthresh.

227	   At the beginning of recovery initialize PRR state.  This assumes a
228	   modern congestion control algorithm, CongCtrlAlg(), that might set
229	   ssthresh to something other than FlightSize/2:

231	      ssthresh = CongCtrlAlg() // Target cwnd after recovery
232	      prr_delivered = 0         // Total bytes delivered during recov
233	      prr_out = 0              // Total bytes sent during recovery
234	      RecoverFS = snd.nxt-snd.una // Flightsize at the start of recov

236	   On every ACK during recovery compute:

238	      DeliveredData = delta(snd.una) + delta(SACKd)
239	      prr_delivered += DeliveredData
240	      pipe = (RFC 3517 pipe algorithm)
241	      if (pipe > ssthresh) {
242	         // Proportional Rate Reduction
243	         sndcnt = CEIL(prr_delivered * ssthresh / RecoverFS) - prr_out
244	      } else {
245	         // Two version of the reduction bound
246	         if (conservative) {    // PRR+CRB
247	           limit = prr_delivered - prr_out
248	         } else {               // PRR+SSRB
249	           limit = MAX(prr_delivered - prr_out, DeliveredData) + 1
250	         }
251	         sndcnt = MIN(ssthresh - pipe, limit)
252	      }
253	      sndcnt = MAX(sndcnt, 0)                    // positive

255	   On any data transmission or retransmission:

257	         prr_out += (data sent) // strictly less than or equal to sndcnt

259	   The following examples will make these algorithms much clearer.

261	3.1.  Examples

263	   We illustrate these algorithms by showing their different behaviors
264	   for two scenarios: TCP experiencing either a single loss or a burst
265	   of 15 consecutive losses.  In all cases we assume bulk data, standard
266	   AIMD congestion control (the ssthresh is set to Flight Size/2) and
267	   cwnd = FlightSIze = pipe = 20 segments, so ssthresh will be set to 10
268	   at the beginning of recovery.  We also assume standard Fast
269	   Retransmit and Limited Transmit, so we send two new segments followed
270	   by one retransmit on the first 3 duplicate ACKs after the losses.

272	   Each of the diagrams below shows the per ACK response to the first
273	   round trip for the various recovery algorithms when the zeroth
274	   segment is lost.  The top line indicates the transmitted segment
275	   number triggering the ACKs, with an X for the lost segment. "cwnd"
276	   and "pipe" indicate the values of these algorithms after processing
277	   each returning ACK.  "Sent" indicates how much "N"ew or
278	   "R"etransmitted data would be sent.  Note that the algorithms for
279	   deciding which data should be sent are out of scope of this document.

281	   When there is a single loss, PRR with either of the reduction bound
282	   algorithms has the same behavior.  We show "RB", a flag indicating
283	   which reduction bound subexpression ultimately determined the value
284	   of sndcnt.  When there is minimal losses "limit" (both algorithms)
285	   will always be larger than ssthresh - pipe, so the sndcnt will be
286	   ssthresh - pipe indicated by "s" in the "RB" row.  PRR does not use
287	   cwnd during recovery.

289	   RFC 3517
290	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
291	   cwnd:    20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
292	   pipe:    19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10
293	   sent:     N  N  R                          N  N  N  N  N  N  N  N

295	   Rate halving (Linux)
296	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
297	   cwnd:    20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11
298	   pipe:    19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10
299	   sent:     N  N  R     N     N     N     N     N     N     N     N

301	   PRR
302	   ack#   X  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
303	   pipe:    19 19 18 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 10
304	   sent:     N  N  R     N     N     N     N     N     N        N  N
305	   RB:                                                          s  s

307	   Note that all three algorithms send same total amount of data.  RFC
308	   3517 experiences a "half-window of silence", while the Rate Halving
309	   and PRR spread the voluntary window reduction across an entire RTT.

311	   Next we consider the same initial conditions when the first 15
312	   packets (0-14) are lost.  During the remainder of the lossy RTT, only
313	   5 ACKs are returned to the sender.  We examine each of these
314	   algorithms in succession.

316	   RFC 3517
317	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
318	   cwnd:                                              20 20 11 11 11
319	   pipe:                                              19 19  4 10 10
320	   sent:                                               N  N 7R  R  R

322	   Rate Halving (Linux)
323	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
324	   cwnd:                                              20 20  5  5  5
325	   pipe:                                              19 19  4  4  4
326	   sent:                                               N  N  R  R  R

328	   PRR-CRB
329	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
330	   pipe:                                              19 19  4  4  4
331	   sent:                                               N  N  R  R  R
332	   RB:                                                       f  f  f

334	   PRR-SSRB
335	   ack#   X  X  X  X  X  X  X  X  X  X  X  X  X  X  X 15 16 17 18 19
336	   pipe:                                              19 19  4  5  6
337	   sent:                                               N  N 2R 2R 2R
338	   RB:                                                       d  d  d

340	   In this situation, RFC 3517 is very non-conservative, because once
341	   fast retransmit is triggered (on the ACK for segment 17) TCP
342	   immediately retransmits sufficient data to bring pipe up to cwnd.
343	   Our measurement data (see Section 5) indicates that RFC 3517
344	   significantly outperforms Rate Halving, PRR-CRB and some other
345	   similarly conservative algorithms that we tested, suggesting that it
346	   is significantly common for the actual losses to exceed the window
347	   reduction determined by the congestion control algorithm.

349	   The Linux implementation of Rate Halving includes an early version of
350	   the conservative reduction bound[RHweb].  In this situation each of
351	   the five ACKs trigger exactly 5 transmissions (2 new data, 3 old
352	   data), and cwnd is set to 5.  At a window size of 5, it takes three
353	   round trips to retransmit 15 lost segments.  Rate Halving does not
354	   raise the window during recovery, so when recovery finally completes,
355	   TCP will slowstart cwnd from 5 up to 10.  In this example, TCP
356	   operates at half of the window chosen by the congestion control for
357	   more than three RTTs, increasing the elapsed time and exposing it to
358	   timeouts if there are additional losses.

360	   PRR-CRB implements conservative reduction bound.  Since the total
361	   losses bring pipe below ssthresh, data is sent such that the total
362	   data transmitted, prr_out, follows the total data delivered to the
363	   receiver as reported by returning ACKs.  Transmission are controlled
364	   by the sending limit, which was set to prr_delivered - prr_out.  This
365	   is indicated by the RB:f tagging in the figure.  In this case PRR-CRB
366	   is exposed to exactly the same problems as Rate Halving, taking
367	   excessively long to recover from the losses and being exposed to
368	   additional timeouts.

370	   PRR-SSRB increases the window by exactly 1 segment per ACK until pipe
371	   rises to sshthresh during recovery.  This is accomplished by setting
372	   limit to one greater than the data reported to have been delivered to
373	   the receiver on this ACK, effectively implementing a slowstart during
374	   recovery, and indicated by RB:d tagging in the figure.  Although
375	   increasing the window during recovery seems to be ill advised, it is
376	   important to remember that this actually less aggressive than the
377	   current standard which permits sending the same quantity of extra
378	   data as a single burst in response to the ACK that triggered Fast
379	   Retransmit

381	   Under less extreme conditions, when the total losses are smaller than
382	   the difference between Flight Size and ssthresh, PRR-CRB and PRR-SSRB
383	   have identical behaviours.

385	4.  Properties

387	   The following properties are common to both PRR-CRB and PRR-SSRB:

389	   Normally Proportional Rate Reduction will spread Voluntary Window
390	   reductions out evenly across a full RTT.  This has the potential to
391	   generally reduce the burstiness of Internet traffic, and could be
392	   considered to be a type of soft pacing.  Theoretically any pacing
393	   increases the probability that different flows are interleaved,
394	   reducing the opportunity for ACK compression and other phenomena that
395	   increase traffic burstiness.  However these effects have not been
396	   quantified.

398	   If there are minimal losses, Proportional Rate Reduction will
399	   converge to exactly the target window chosen by the congestion
400	   control algorithm.  Note that as TCP approaches the end of recovery
401	   prr_delivered will approach RecoverFS and sndcnt will be computed
402	   such that prr_out approaches ssthresh.

404	   Implicit window reductions due to multiple isolated losses during
405	   recovery cause later Voluntary Reductions to be skipped.  For small
406	   numbers of losses the window size ends at exactly the window chosen
407	   by the congestion control algorithm.

409	   For burst losses, earlier Voluntary Window Reductions can be undone
410	   by sending extra segments in response to ACKs arriving later during
411	   recovery.  Note that as long as some Voluntary Window Reductions are
412	   not undone, the final value for pipe will be the same as ssthresh,
413	   the target cwnd value chosen by the congestion control algorithm.

415	   Proportional Rate Reduction with either reduction round improves the
416	   situation when there are application stalls (e.g. when the sending
417	   application does not queue data for transmission quickly enough or
418	   the receiver stops advancing rwnd).  When there is a application
419	   stall early during recovery prr_out will fall behind the sum of the
420	   transmissions permitted by sndcnt.  The missed opportunities to send
421	   due to stalls are treated like banked Voluntary Window Reductions:
422	   specifically they cause prr_delivered-prr_out to be significantly
423	   positive.  If the application catches up while TCP is still in
424	   recovery, TCP will send a partial window burst to catch up to exactly
425	   where it would have been, had the application never stalled.
426	   Although this burst might be viewed as being hard on the network,
427	   this is exactly what happens every time there is a partial RTT
428	   application stall while not in recovery.  We have made the partial
429	   RTT stall behavior uniform in all states.  Changing this behavior is
430	   out of scope for this document.

432	   Proportional Rate Reduction with Reduction Bound is significantly
433	   less sensitive to errors of the pipe estimator.  While in recovery,
434	   pipe is intrinsically an estimator, using incomplete information to
435	   guess if un-SACKed segments are actually lost or out-of-order in the
436	   network.  Under some conditions pipe can have significant errors, for
437	   example when a burst of reordered data is presumed to be lost and is
438	   retransmitted, but then the original data arrives before the
439	   retransmission.  If the transmissions are regulated directly by pipe
440	   as they are in RFC 3517, then errors and discontinuities in the value
441	   of the pipe estimator can cause significant errors in the amount of
442	   data sent.  With Proportional Rate Reduction with Reduction Bound,
443	   pipe merely determines how sndcnt is computed from DataDelivered.
444	   Since short term errors in pipe are smoothed out across multiple ACKs
445	   and both Proportional Rate Reduction and the reduction converge to
446	   the same final window, errors in the pipe estimator have less impact
447	   on the final outcome (This needs to be tested better).

449	   Under all conditions and sequences of events during recovery, PRR-CRB
450	   strictly bounds the data transmitted to be equal to or less than the
451	   amount of data delivered to the receiver.  We claim that this packet
452	   conservation bound is the most aggressive algorithm that does not
453	   lead to additional forced losses in some environments.  It has the
454	   property that if there is a standing queue at a bottleneck that is
455	   carrying no other traffic, the queue will maintain exactly constant
456	   length for the entire recovery duration (except for +1/-1 fluctuation
457	   due to differences in packet arrival and exit times) .  See
458	   Appendix A for a detailed discussion of this property.

460	   Although the packet Packet Conserving Bound in very appealing for a
461	   number of reasons, our measurements summarized in Section 5
462	   demonstrate that it is less aggressive and does not perform as well
463	   as RFC3517, which permits large bursts of data when there are bursts
464	   of losses.  PRR-SSRB is a compromise that permits TCP to send one
465	   extra segment per ACK as compared to the packet conserving bound.
466	   From the< perspective of the packet conserving bound, PRR-SSRB does
467	   indeed open the window during recovery, however it is significantly
468	   less aggressive than RFC3517 in the presence of burst losses.

470	5.  Measurements

472	   In a (to be published) companion paper[Recovery] we describe some
473	   measurements comparing the various strategies for reducing the window
474	   during recovery.  The results presented in that paper are summarized
475	   here.

477	   The various window reduction algorithms and extensive instrumentation
478	   were all implemented in a modified Linux 2.6.34 kernel.  For all
479	   experiments we used a uniform subset of the non-standard algorithms
480	   present in the base Linux implementation.  Specifically we disabled
481	   threshold retransmit [FACK], which triggers Fast Retransmit earlier
482	   than the standard algorithm.  We left enabled CUBIC [CUBIC], limited
483	   transmit [LT], and lost retransmission detection algorithms.  This
484	   subset was a compromise chosen such that the behaviors of both Rate
485	   Halving (the Linux default) and RFC 3517 mode were authentic to their
486	   respective specifications while at the same time the performance and
487	   features were comparable to the kernels in production use.  The
488	   different window reduction algorithms were all present in the same
489	   kernel and could be selected with a sysctl, such that we had an
490	   absolutely uniform baseline for comparing RFC 3517, Rate Halving, and
491	   PRR with various reduction bounds.

493	   Our experiments included an additional algorithm, PRR with an
494	   unlimited bound (PRR-UB), which sends ssthresh-pipe bursts when pipe
495	   falls below ssthresh.  This behavior parallels RFC 3517.

497	   An important detail of this configuration is that CUBIC only reduces
498	   the window by %, as opposed to the 50% reduction used by traditional
499	   congestion control algorithms.  This, in conjunction with using only
500	   standard algorithms to trigger Fast Retransmit, accentuates the
501	   tendency for RFC 3517 and PRR-UB to send a burst at the point when
502	   Fast Retransmit gets triggered if pipe is already below ssthresh.

504	   All experiments were performed on servers carrying production traffic
505	   for multiple Google services.

507	   In this configuration is is observed that for 32% of the recovery
508	   events, pipe falls below ssthresh before Fast Retransmit is
509	   triggered, thus the various PRR algorithms start in the reduction
510	   bound phase, and both PRR-UB and RFC 3517 send bursts of segments
511	   with the fast retransmit.

513	   In the companion paper we observe that PRR-SSRB spends the least time
514	   in recovery of all the algorithms tested, largely because it
515	   experiences fewer timeouts once it is already in recovery.

517	   RFC 3517 experiences 29% more detected lost retransmissions and 2.6%
518	   more timeouts (presumably due to undetected lost retransmissions)
519	   than PRR-SSRB.  These results are representative of PRR-UB and other
520	   algorithms that send bursts when pipe falls below ssthresh.

522	   Rate Halving experiences 5% more timeouts and significantly smaller
523	   final cwnd values at the end of recovery.  The smaller cwnd sometimes
524	   causes the recovery itself to take extra round trips.  These results
525	   are representative of PRR-CRB and other algorithms that implement
526	   strict packet conservation during recovery.

528	6.  Conclusion and Recommendations

530	   [This text assumes standards track.  Experimental status would be
531	   somewhat more reserved.]

533	   Although the packet conserving bound in very appealing for a number
534	   of reasons, our measurements summarized in Section 5 demonstrate that
535	   it is less aggressive and does not perform as well as RFC3517, which
536	   permits large bursts of data when there are bursts of losses.  PRR-
537	   SSRB is a compromise that permits TCP to send one extra segment per
538	   ACK as compared to the packet conserving bound.  From the perspective
539	   of the packet conserving bound, PRR-SSRB does indeed open the window
540	   during recovery, however it is significantly less aggressive than
541	   RFC3517 in the presence of burst losses.

543	   All TCP implementations SHOULD implement both PRR-CRB and PRR-SSRB,
544	   with a control to select which algorithm is used.  It is RECOMMENDED
545	   that PRR-SSRB is the default algorithm.

547	7.  Acknowledgements

549	   This draft is based in part on previous incomplete work by Matt
550	   Mathis, Jeff Semke and Jamshid Mahdavi[RHID] and influenced by
551	   several discussion with John Heffner.

553	   Monia Ghobadi and Sivasankar Radhakrishnan helped analyze the
554	   experiments.

556	8.  Security Considerations

558	   Proportional Rate Reduction does not change the risk profile for TCP.

560	   Implementers that change PRR from counting bytes to segments have to
561	   be cautious about the effects of ACK splitting attacks[SPLIT], where
562	   the receiver acknowledges partial segments for the purpose of
563	   confusing the sender's congestion accounting.

565	9.  IANA Considerations

567	   This document makes no request of IANA.

569	   Note to RFC Editor: this section may be removed on publication as an
570	   RFC.

572	10.  References

574	   TODO: A proper reference section.

576	   [RFC 3517] "A Conservative Selective Acknowledgment (SACK)-based Loss
577	   Recovery Algorithm for TCP".  E. Blanton, M. Allman, K. Fall, L.
578	   Wang.  April 2003.

580	   [RFC 5681] "TCP Congestion Control".  M. Allman, V. Paxson, E.
581	   Blanton.  September 2009.

583	   [RHweb] "TCP Rate-Halving with Bounding Parameters".  M. Mathis, J.
584	   Madavi, http://www.psc.edu/networking/papers/FACKnotes/971219/, Dec
585	   1997.

587	   [RHID] "The Rate-Halving Algorithm for TCP Congestion Control".  M.
588	   Mathis, J. Semke, J. Mahdavi, K. Lahey.
589	   http://www.psc.edu/networking/ftp/papers/draft-ratehalving.txt, Work
590	   in progress, last updated June 1999.

592	   [CUBIC] "CUBIC: A new TCP-friendly high-speed TCP variant".  I. Rhee,
593	   L. Xu, PFLDnet, Feb 2005.

595	   [FACK] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP
596	   Congestion Control", Proceedings of SIGCOMM'96, August, 1996,
597	   Stanford, CA.

599	   [Recovery] N. Dukkipati, M. Mathis, Y Cheng, "Improving TCP loss
600	   recovery", to be published 2011.

602	Appendix A.  Packet Conservation Bound

604	   Under all conditions and sequences of events during recovery, PRR-CRB
605	   strictly bounds the data transmitted to be equal to or less than the
606	   amount of data delivered to the receiver.  We claim that this packet
607	   conservation bound is the most aggressive algorithm that does not
608	   lead to additional forced losses in some environments.  It has the
609	   property that if there is a standing queue at a bottleneck that is
610	   carrying no other traffic, the queue will maintain exactly constant
611	   length for the entire recovery duration (except for +1/-1 fluctuation
612	   due to differences in packet arrival and exit times) .  Any less
613	   aggressive algorithm will result in a declining queue at the
614	   bottleneck.  Any more aggressive algorithm will result in an
615	   increasing queue or additional losses at the bottleneck.

617	   We demonstrate this property with a little thought experiment:

619	   Imagine a network path that has insignificant delays in both
620	   directions, except the processing time and queue at a single
621	   bottleneck in the forward path.  By insignificant delay, I mean when
622	   a packet is "served" at the head of the bottleneck queue, the
623	   following events happen in much less than one packet time at the
624	   bottleneck: the packet arrives at the receiver; the receiver sends an
625	   ACK; which arrives at the sender; the sender processes the ACK and
626	   sends some data; the data is queued at the bottleneck.

628	   If sndcnt is set to DataDelivered and nothing else is inhibiting
629	   sending data, then clearly the data arriving at the bottleneck queue
630	   will exactly replace the data that was served at the head of the
631	   queue, so the queue will have a constant length.  If queue is drop
632	   tail and full then the queue will stay exactly full, even in the
633	   presence of losses or reordering on the ACK path, and independent of
634	   whether the data is in order or out-of-order (e.g. simple reordering
635	   or loss recovery from an earlier RTT).  Any more aggressive algorithm
636	   sending additional data will cause a queue overflow and loss.  Any
637	   less aggressive algorithm will under fill the queue.  Therefore
638	   setting sndcnt to DataDeliverd is the most aggressive algorithm that
639	   does not cause forced losses in this simple network.  Relaxing the
640	   assumptions (e.g. making delays more authentic and adding more flows,
641	   delayed ACKs, etc) increases the noise (jitter) in the system but
642	   does not change it's basic behavior.

644	   Note that the congestion control algorithm implements a broader
645	   notion of optimal that includes appropriately sharing of the network.
646	   PRR-CRB will choose to send the lessor of the data permitted by this
647	   packet conserving bound and as determined by the congestion control
648	   algorithm as PRR brings TCP's actual window down to ssthresh.

650	Authors' Addresses

652	   Matt Mathis
653	   Google, Inc
654	   1600 Amphitheater Parkway
655	   Mountain View, California  93117
656	   USA

658	   Email: mattmathis@google.com

660	   Nandita Dukkipati
661	   Google, Inc
662	   1600 Amphitheater Parkway
663	   Mountain View, California  93117
664	   USA

666	   Email: nanditad@google.com

668	   Yuchung Cheng
669	   Google, Inc
670	   1600 Amphitheater Parkway
671	   Mountain View, California  93117
672	   USA

674	   Email: ycheng@google.com