idnits 2.17.1 

draft-mathis-tcp-ratehalving-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 971 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 3 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.

  ** The abstract seems to contain references ([Hoe95]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 116: '...nd.  At any given time, a TCP MUST NOT...'
     RFC 2119 keyword, line 838: '...afe to relax the SHOULD in RFC2018, su...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 1999) is 9020 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Hoe95'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Hoe96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'MM96'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'MM97'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NS'

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2481 (Obsoleted by RFC 3168)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782)


     Summary: 13 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                         Matt Mathis
2	INTERNET DRAFT                                           Jeff Semke
3	draft-mathis-tcp-ratehalving-00.txt                        PSC
4	                                                    Jamshid Mahdavi
5	                                                             Novell
6	                                                        August 1999
7	                                            Expires:  February 2000

9	           The Rate-Halving Algorithm for TCP Congestion Control

11	                          Status of this Memo

13	   This document is an Internet-Draft and is in full conformance with
14	   all provisions of Section 10 of RFC2026.  Internet-Drafts are working
15	   documents of the Internet Engineering Task Force (IETF), its areas,
16	   and its working groups.  Note that other groups may also distribute
17	   working documents as Internet-Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time.  It is inappropriate to use Internet- Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	Abstract

32	   This draft provides a detailed description of the Rate-Halving
33	   algorithm.  As specified by RFC2581, Fast Recovery adjusts the
34	   congestion window (cwnd) by transmitting new data only during the
35	   second half of the recovery interval.  The Rate-Halving algorithm
36	   adjusts the congestion window by spacing transmissions at the rate of
37	   one data segment per two segments acknowledged over the entire
38	   recovery period, thereby sustaining the self-clocking of TCP and
39	   avoiding a burst.  Since it is largely independent of the details of
40	   the data retransmission strategy, the Rate-Halving algorithm can be
41	   used with several standard and experimental TCP implementations:
42	   NewReno, SACK, and ECN.

44	   This algorithm was first proposed by Janey Hoe [Hoe95].

46	1. Introduction
47	===============

49	   All "Reno TCP" implementations include TCP Fast Retransmit and Fast
50	   Recovery algorithms [RFC2581].  Fast retransmit relies on three
51	   duplicate acknowledgements to trigger the retransmission of a single
52	   lost segment.  Once the Fast Retransmit has occurred, TCP then waits
53	   for enough additional duplicate ACKs to arrive, indicating that half
54	   of the data in flight has left the network.  Only when this has
55	   occurred will TCP send additional new data.

57	   The consequence of this delay is that the entire new window of data
58	   is transmitted in one half of one Round Trip Time (RTT). This burst
59	   can cause repeated bursts in successive RTTs following the recovery,
60	   which can result in overall additional burstiness on the network.

62	   Hoe [Hoe95] suggested that during Fast Recovery the TCP data sender
63	   space out retransmissions and new data on alternate acknowledgements
64	   across the entire recovery RTT.  (Note that this eliminates the half
65	   RTT lull in sending which occurs in Reno TCP.)

67	   This document describes a Rate-Halving algorithm which implements
68	   Hoe's idea.  It explains how to implement the algorithm under
69	   NewReno, SACK, and ECN-style TCP implementations.  This document does
70	   not discuss the other suggestions in [Hoe95] and [Hoe96], such as a
71	   change to the ssthresh parameter during Slow-Start, or the NewReno
72	   proposal [RFC2582].

74	   Rate-Halving has a number of other useful properties as well.  It
75	   results in a slightly lower final value for cwnd following recovery,
76	   which has been suggested by Floyd and others as the more correct
77	   value.  The Rate-Halving algorithm provides proper adjustments to the
78	   congestion window in response to congestion signals such as a lost
79	   segment or an ECN-Echo bit [RFC2481].  These adjustments are largely
80	   independent of the strategy used to retransmit missing segments,
81	   allowing Rate-Halving to be extended for use in other TCP
82	   implementations or even non-TCP transport protocols.

84	   Definitions of terms and variables that are used throughout the rest
85	   of the document can be found in Section 2.  Section 3 provides an
86	   overview of the Rate-Halving algorithm.  Section 4 presents a
87	   detailed specification of the algorithm.  Section 5 contains related
88	   commentary, and Section 6 discusses interactions between Rate-Halving
89	   and other components of TCP.

91	2. Definitions
92	==============

94	   The following terms are used throughout the document, and are defined
95	   here for clarity.  The definitions within quotations in this section
96	   are included from the cited documents.

98	   SENDER MAXIMUM SEGMENT SIZE (SMSS):  "The SMSS is the size of the
99	      largest segment that the sender can transmit.  This value can be
100	      based on the maximum transmission unit of the network, the path
101	      MTU discovery [RFC1191] algorithm, RMSS, or other factors.  The
102	      size does not include the TCP/IP headers and options." from
103	      [RFC2581]

105	   ADJUSTMENT INTERVAL:  The time during which TCP reduces the
106	      congestion window after experiencing congestion.

108	   REPAIR INTERVAL:  The time during which TCP retransmits lost
109	      segments.

111	   RATE-HALVING STATE (rh_state):  A state variable indicating what
112	      form of adjustments are being performed during adjustment
113	      intervals.  The Rate-Halving states are described in Section 4.1.

115	   CONGESTION WINDOW (cwnd):  "A TCP state variable that limits the
116	      amount of data a TCP can send.  At any given time, a TCP MUST NOT
117	      send data with a sequence number higher than the sum of the
118	      highest acknowledged sequence number and the minimum of cwnd and
119	      rwnd." from [RFC2581].  Cwnd, as used by Reno and NewReno TCPs, is
120	      used as an offset from snd_una (the highest-sequence segment
121	      acknowledged by the receiver, indicating that all segments less
122	      than snd_una have been received).  During recovery, Reno and
123	      NewReno artificially inflate cwnd as dupacks are received.

125	   RATE-HALVING CONGESTION WINDOW (rhcwnd): In order to differentiate
126	      the usage of cwnd within Rate-Halving from the cwnd used by Reno
127	      and NewReno [RFC2582], we introduce a new but comparable
128	      congestion window variable called rhcwnd.  Rhcwnd is not inflated
129	      by dupacks during congestion episodes.  It represents the amount
130	      of data that the sender is allowed to have outstanding.  It is not
131	      considered an offset to snd_una, but rather a count of the data in
132	      flight, regardless of whether the segments contain new or
133	      retransmitted data.

135	   PRIOR_RHCWND (prior_rhcwnd): The value of rhcwnd saved when the
136	      adjustment interval begins.

138	   LOSS WINDOW (LW):  "The loss window is the size of the congestion
139	      window after a TCP sender detects loss using its retransmission
140	      timer." from [RFC2581]

142	   MAX_SEQ (max_seq): The maximum sequence number transmitted for
143	      the current connection.

145	   PRIOR_MAX_SEQ (prior_max_seq): The maximum sequence number
146	      transmitted before the start of an adjustment interval.

148	   FORWARD ACKNOWLEDGEMENT (fack): The highest sequence number known to
149	      have reached the receiver, plus one.  (This includes SACK
150	      information, and is different than the variable snd.una when
151	      losses have occurred).  The use of the fack state variable (and
152	      the retran_data state variable, below) was originally described by
153	      Mathis [MM96].  The details for setting fack can be found in
154	      Section 6.1.2.

156	   DUPLICATE ACKNOWLEDGMENTS (dupacks): Acknowledgments that carry the
157	      same sequence number as an earlier acknowledgment.  In order to
158	      qualify as a dupack, the ack must carry no data, and must not be a
159	      window update.  Dupacks provide an indication that the segment
160	      following the sequence number in the dupack may have been lost,
161	      but that some future segment has been received.  The state
162	      variable "dupacks" is a count of the current number of
163	      "outstanding" dupacks.  By "outstanding", we mean dupacks which
164	      have been received, and which represent data not yet covered by
165	      snd.una.

167	   PARTIAL ACKNOWLEDGMENTS: Following congestion, these are "ACKs which
168	      cover new data, but not all the data outstanding when loss was
169	      detected" from [RFC2582].

171	   FULL ACKNOWLEDGMENT: Following congestion, an ACK which covers all
172	      data outstanding when a loss was detected (i.e. an ack that is
173	      greater than or equal to prior_max_seq.)

175	   RETRANSMITTED DATA (retran_data):  The amount of retransmitted data
176	      outstanding in the network.

178	   NUM_RETRANS (num_retrans): The number of bytes retransmitted during
179	      the current repair interval.

181	   ECN-ECHO ACK PACKET:  "If the sender receives an ECN-Echo ACK packet
182	      (that is, an ACK packet with the ECN-Echo flag set in the TCP
183	      header), then the sender knows that congestion was encountered in
184	      the network on the path from the sender to the receiver.  The
185	      indication of congestion should be treated just as a congestion
186	      loss in non-ECN-Capable TCP." from [RFC2481].

188	   In addition to the above definitions, we use the following commonly-
189	   used TCP state variables and terms:  ssthresh [RFC2581], snd.una,
190	   snd.nxt [RFC793], and segment [RFC2581].

192	3. Overview of the Rate-Halving Algorithm
193	=========================================

195	   The Rate-Halving (RH) algorithm is designed to reduce the congestion
196	   window following the detection of congestion in the network.
197	   Rate-Halving reduces rhcwnd over one round trip by transmitting one
198	   new segment for every two segments acknowledged by the receiver.  The
199	   Rate-Halving algorithm also detects the end of this round trip,
200	   employing several different checks to perform this detection most
201	   accurately.  The choice of which segment to send at any given point
202	   in time is completely independent of the Rate-Halving algorithm,
203	   allowing Rate-Halving to be used simultaneously with ECN (where no
204	   retransmissions occur) [RFC2481], NewReno (where retransmissions are
205	   spread out one per RTT) [RFC2582], or SACK (where holes are filled
206	   soon after they are detected) [RFC2018].

208	   As a result of the separation of window reduction from data
209	   retransmission, the term "recovery" ceases to have a clear meaning.
210	   We replace it with two terms: "adjustment interval" (referring to the
211	   round trip containing the window adjustment) and "repair interval"
212	   (referring to the retransmission of missing data).  The adjustment
213	   interval is entered immediately upon indication of the possible onset
214	   of congestion.  Two possible indications of congestion are presence
215	   of the ECN-Echo bit, or indications of a packet loss via duplicate
216	   ACK or via presence of a SACK block.

218	   During the adjustment interval, the window is reduced by sending one
219	   data segment for each two segments which are acknowledged, as per the
220	   Rate-Halving algorithm (described in detail below).  At the end of
221	   the adjustment interval, additional checks are made to update
222	   ssthresh and to ensure that the final value of rhcwnd is appropriate.
223	   Due to the additional window reduction for the lost data, the final
224	   value for rhcwnd is 1/2 of the correctly delivered portion of the
225	   window of data which was in the network at the time of detection.

227	4. Complete Rate-Halving Algorithm
228	==================================

230	   This section describes the complete Rate-Halving algorithm.  Most of
231	   the complexity of the specification is a result of the need for
232	   backwards compatibility with hosts that do not support SACK or ECN.

234	   For all subsections of this section, additional discussion and
235	   explanation may be found in Section 5 under the same subsection
236	   number.

238	4.1 Rate-Halving states

240	   Rate-Halving is specified as a state machine with the following
241	   states:

243	   RH_INCR - Rate-Halving is not performing any downward window
244	      adjustment.   This corresponds to TCP being in either Slow-start
245	      or the linear increase region.

247	   RH_EXACT - Rate-Halving is reducing the window with accurate knowledge
248	      of what data has left the network.   This state normally follows
249	      congestion indicated by ECN or SACK.

251	   RH_EST - Rate-Halving is reducing the window while estimating the
252	      amount of data that has left the network.   This state is
253	      analagous to Reno's Fast Recovery.

255	   RH_EST_REPAIR - Rate-Halving is holding the window constant while
256	      repairing multiple holes in a manner similar to NewReno on
257	      successive round trips after the window adjustment has completed.

259	        +----------------+                     +----------------+
260	        | RH_INCR        |-------------------->| RH_EXACT       |
261	        |                | 4.4 New SACK or ECN |                |
262	        | 4.2 Grow rhcwnd|                     | 4.6 Reduce     |
263	        |                |<--------------------|     rhcwnd     |
264	        | 4.3 Transmit   | 4.8 Minor reorder   |                |
265	        |                |                     | 4.3 Transmit   |
266	        |                |<--------------------|                |
267	        |                | 4.10 New data ACKed |                |
268	        |                | 4.14 Bound          |                |
269	        |                |                     |                |
270	        |                |<--------------------|                |
271	        |                | 4.15 Retransmit     |                |
272	        |                |        Timeout      |                |
273	        +----------------+                     +----------------+
274	          ^  ^  ^  ^  |                                       |
275	          |  |  |  |  |                                       |
276	          |  |  |  |  |                                       v
277	          |  |  |  |  |                                       |
278	      4.14|  |  |  |  v                           +-----------+
279	     Bound|  |  |  |  |                           | 4.5 Dupacks
280	          |  |  |  |  +---------------------------+
281	      4.13|  |  |  |                  4.5 Dupacks |
282	    Adjust|  |  |  +---------------+              |
283	          |  |  |                  |              |
284	      4.11|  |  +---------------+  |              |
285	      Full|  |                  |  |              |
286	       Ack|  +---------------+  |  |              |
287	          |                  |  |  |              v
288	        +----------------+   |  |  |           +----------------+
289	        | RH_EST_REPAIR  |   |  |  +-<---------| RH_EST         |
290	        |                |   |  |  4.8 Reorder |                |
291	        | 4.9 Partial ACK|   |  |              | 4.7 Reduce     |
292	        |                |   |  +-<------------|     rhcwnd     |
293	        | 4.3 Transmit   |   |    4.11 Full Ack|                |
294	        |                |   |    4.13 Adjust  | 4.3 Transmit   |
295	        |                |   ^    4.14 Bound   |                |
296	        |                |   |                 | 4.9 Partial ACK|
297	        |                |->-+---------<-------|                |
298	        |                | 4.15 Timeout        |                |
299	        |                |                     |                |
300	        |                |<--------------------|                |
301	        |                | 4.12 Exit if rhcwnd |                |
302	        |                |  <= prior_rhcwnd/2  |                |
303	        |                |                     |                |
304	        |                |-------------------->|                |
305	        |                | 4.5 Re-enter        |                |
306	        +----------------+                     +----------------+

308	4.2 Increasing Rhcwnd

310	   Upon receipt of a new ACK in RH_INCR, apply Congestion Avoidance or
311	   Slow-start per RFC2581 only when:

313	      snd.nxt - fack + retran_data + SMSS >= rhcwnd

315	4.3 Data Transmission

317	   While honoring all other sending constraints (such as receiver window),
318	   in any state, transmit data of length "len" if

320	      (snd.nxt - fack + retran_data + len) < rhcwnd

322	   If the data is a retransmission,

324	      retran_data += len

326	4.4 Entering RH_EXACT State

328	   Transition from RH_INCR to RH_EXACT if the ACK either carries an
329	   ECN-Echo bit, or indicates a new hole via a SACK option.

331	   Store the following information:

333	      prior_rhcwnd = rhcwnd
334	      prior_max_seq = max_seq

336	4.5 Entering RH_EST State

338	   From either RH_INCR or RH_EXACT, transition to RH_EST upon receipt of
339	   a dupack with no SACK options.

341	   From the RH_EST_REPAIR state, transition to RH_EST upon receipt of an
342	   ACK with the ECN-Echo bit set.

344	   If the transition is from RH_INCR or RH_EST_REPAIR to RH_EST, store
345	   the following information:

347	      prior_rhcwnd = rhcwnd
348	      prior_max_seq = max_seq

350	4.6 Reduction of Rhcwnd in RH_EXACT State

352	   For each ACK received while in state RH_EXACT, reduce rhcwnd by 1/2
353	   the distance fack advances plus 1/2 the size of any new holes.

355	4.7 Reduction of Rhcwnd in RH_EST State

357	   For each dupack received while in state RH_EST:

359	      rhcwnd  -= SMSS/2
360	      fack     = min((fack + SMSS), max_seq)
361	      dupacks += 1

363	4.8 Detection of Reordering; Return to RH_INCR

365	   Reordering detection applies only if an ECN-Echo has not been received
366	   and a retransmission has not been sent since the adjustment interval
367	   began.

369	   In RH_EXACT, reordering has occurred if an ACK arrives carrying no
370	   ECN-Echo bit and no SACK blocks.

372	   In RH_EST, reordering has occurred if a partial or full ACK arrives.

374	   If reordering is detected, transition to RH_INCR and restore the
375	   value of rhcwnd, without performing the bounding checks of 4.14:

377	      rhcwnd = prior_rhcwnd;

379	4.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR

381	   For each partial acknowledgement received in RH_EST or RH_EST_REPAIR:

383	   1. Set retran_data to 0.
384	   2. Reduce dupacks by [(bytes of data acked)/SMSS - 1].
385	   3. For the purpose of data transmission, note that a new hole exists
386	      at snd.una.  The new hole is eligible for retransmission if
387	      dupacks > 3, after being reduced in step 2.

389	4.10 Completion of RH_EXACT; Return to RH_INCR

391	   In RH_EXACT, Rate Halving is complete if one of the following occurs
392	   (indicating that a round trip time has passed):

394	   1. An ACK or SACK arrives which acknowledges data beyond
395	      prior_max_seq.

397	   2. An ACK or SACK arrives which acknowledges a segment retransmitted
398	      after entering the RH_EXACT state.

400	   Transition to state RH_INCR and perform the bounding checks in 4.14.

402	4.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR

404	   In state RH_EST or RH_EST_REPAIR, adjustment and repair are both
405	   complete when a full acknowledgement is received (i.e. snd.una >=
406	   prior_max_seq).

408	   When this occurs, transition to state RH_INCR, perform the estimation
409	   adjustment in 4.13, followed by the bounding checks in 4.14.

411	4.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR

413	   In state RH_EST, adjustment (but not repair) is complete when

415	      rhcwnd <= prior_rhcwnd / 2

417	   Transition from state RH_EST to state RH_EST_REPAIR.

419	4.13 Corrections for Estimation upon return to RH_INCR

421	   When adjustment and repair are complete after an estimated recovery,
422	   perform the following adjustment to compute the correct value for
423	   rhcwnd:

425	      rhcwnd = (prior_rhcwnd - num_retrans) / 2;

427	4.14 Application of Bounding Paremeters

429	   When transitioning back to RH_INCR, update the TCP state variables
430	   as follows:

432	      if (rhcwnd > prior_rhcwnd/2) rhcwnd = prior_rhcwnd/2;
433	      ssthresh = rhcwnd;
434	      if (ssthresh < prior_rhcwnd/4) ssthresh = prior_rhcwnd/4;

436	   Check for a new condition that would cause entry into another
437	   adjustment interval (e.g. the presence of SACK blocks reporting a new
438	   hole (beyond prior_max_seq).

440	4.15 Retransmission Timeouts

442	   If a timeout occurs while in a non-RH_INCR state, set

444	      ssthresh = prior_rhcwnd / 2
445	      rhcwnd = LW

447	   and return to the RH_INCR.

449	   Otherwise, if a timeout occurs while in RH_INCR, set

451	      ssthresh = rhcwnd /2
452	      rhcwnd = LW

454	5. Discussion of Rate-Halving Algorithm
455	=======================================

457	   Each of the subsections of Section 5 provide discussion for the
458	   corresponding subsection of Section 4.

460	5.1 Rate-Halving States

462	   The Rate-Halving algorithm may be viewed as a finite state machine
463	   with four states.  These states are independent of whether the TCP
464	   connection is in Congestion Avoidance or Slow-start.

466	5.2 Increasing Rhcwnd

468	   The left hand term in 4.2 includes all outstanding transmitted and
469	   retransmitted segments that are still in the network, plus one
470	   additional segment to account for the effects of rounding due to
471	   segmentation.

473	   This restriction assures that rhcwnd only grows as long as TCP
474	   actually succeeds in injecting enough data into the network to test
475	   the path.  If TCP is throttled by something other than rhcwnd, then
476	   there is no assurance that rhcwnd of data will actually fit into the
477	   network.

479	   This test should be done prior to taking any data which was ACKed or
480	   SACKed in the incoming segment into account.  This may be
481	   accomplished either by performing this check immediately upon
482	   receiving an ACK, or by remembering the total amount of new and
483	   retransmitted data ACKed or SACKed by the segment, and adding this to
484	   the term on the left hand side.

486	   Since it is never safe to open the window while estimating what data
487	   has left the network, Reno and NewReno can never open the window
488	   while dupacks are present - e.g during any adjustment or repair
489	   interval.

491	5.3 Data Transmission

493	   Rhcwnd always specifies an amount of data allowed to be in flight in
494	   the network. The expression (snd.nxt - fack + retran_data + len) is a
495	   measure of the outstanding data.  Data may be transmitted whenever
496	   the amount of outstanding data is less than the amount allowed by
497	   rhcwnd.

499	   Contrast this to Reno, where cwnd is used as an offset to snd.una,
500	   specifying which sequence numbers may be sent.  As a consequence the
501	   cwnd variable is overloaded during recovery (e.g. setting cwnd=SMSS
502	   to force Fast Retransmit.)

504	   To illustrate that the new semantics for rhcwnd in Rate-Halving are
505	   compatible with those of cwnd in Reno, observe that in Reno the
506	   boundary between being able to send data and not send data is:

508	      snd.nxt = snd.una + cwnd

510	   where
511	      cwnd = rhcwnd - retran_data + dupacks*SMSS
512	   and
513	      retran_data = 0 or 1 depending on whether a retransmission has
514	                    been sent.

516	   Substituting and rearranging the Reno equations yields equation 1:

518	      (1) rhcwnd = snd.nxt - snd.una - dupacks*SMSS + retran_data

520	   In Rate-Halving, the boundary occurs at:

522	      awnd = rhcwnd,

524	   where
525	      awnd = snd.nxt - fack + retran_data
526	   and
527	      fack = snd.una + losses + SACKed data

529	   Substituting and rearranging the Rate-Halving equations yields:

531	      (2) rhcwnd = snd.nxt - snd.una - losses - SACKed data +
532	                                                retran_data

534	   It can be seen that equations 1 and 2 are quite similar.  The losses
535	   known by Rate-Halving (via SACK options) are not known to Reno.  In
536	   addition, the amount of data that has left the network (indicated by
537	   SACK options to Rate-Halving) must be estimated by Reno as
538	   dupacks*SMSS.  Reno only allows one retransmitted segment per round
539	   trip, while Rate-Halving permits (and keeps track of) more.

541	   When SACK options are not used by the receiver, the Rate-Halving
542	   sender must estimate by using equation 1.

544	   Similar analysis can be applied to The "pipe" algorithm used for SACK
545	   in the "ns" [NS] simulator and some SACK implementations.

547	5.4 Entering RH_EXACT State

549	   Note that the sender can infer congestion by receiving a segment with
550	   the ECN-Echo bit set, or by receiving an Ack carrying a SACK option
551	   which indicates a new hole.  In both of these cases, the amount of
552	   data that has left the network is known, either because no data was
553	   lost, or the amount of data that was lost is reported by SACK
554	   options.

556	   In the case of a suspected loss, enter the adjustment interval
557	   immediately.  While this may seem to be a major change to TCP, in
558	   fact we retain the "3 Dupacks" check (and make it more robust) which
559	   is included in Reno TCP.  Segment retransmissions are determined by
560	   the SACK scoreboard, as discussed in section 6.2.1.  During the early
561	   portion of the adjustment interval, one new segment will generally be
562	   transmitted prior to retransmitting the lost segment which triggered
563	   the adjustment.

565	   The algorithm will detect if the segments have merely been reordered,
566	   and the adjustment is terminated (see section 5.8 below).

568	   Prior_rhcwnd and prior_max_seq will be used for end-of-adjustment
569	   checks.

571	5.5 Entering RH_EST State

573	   The significance of this state is that the amount of data in the
574	   network is estimated, since exact information is not available
575	   through SACK options.

577	   If the prior state was RH_EST_REPAIR, then this is a new reduction
578	   triggered by detecting a new ECN prior to the completion of a
579	   NewReno-style repair.

581	   The sender can also infer congestion by receiving duplicate
582	   acknowledgements.  In this case, it is known that a segment was lost
583	   or misordered, but the number of missing segments is not known.  The
584	   number can only be estimated by counting the number of duplicate ACKs
585	   received.  Even if RH_EXACT was entered with an ECN-Echo, the RH_EST
586	   state can be entered from RH_EXACT if a dupack is later received.

588	5.6 Reduction of Rhcwnd in RH_EXACT State

590	   Rather than counting received ACKs, the distance (in bytes) that fack
591	   advances is explicitly used.  This ensures that the window is
592	   properly reduced even when the receiver is sending delayed ACKs
593	   during congestion episodes (such as with ECN).

595	   Each SACK option that indicates a new hole triggers a reduction of
596	   rhcwnd by all of data that was lost plus 1/2 the data that was
597	   received.

599	   During RH, we reduce rhcwnd as data leaves the network.  The net
600	   effect is to cause the transmission of one new segment for every two
601	   segments reported by the receiver.

603	5.7 Reduction of Rhcwnd in RH_EST State

605	   Assume that each dupack is for a segment of size SMSS, and that the
606	   rate is being halved.  This estimate is designed to result in rhcwnd
607	   and fack values that are reasonable, given that more exact
608	   information is not available about which segments have left the
609	   network.  These estimates are comparable to (and no worse than) the
610	   estimates that Reno uses during Fast Recovery.  When the repair
611	   interval ends, final adjustments recalibrate rhcwnd and fack to their
612	   correct values.

614	5.8 Detection of Reordering; Return to RH_INCR

616	   Note that there is a choice of what do with the congestion window
617	   when segment reordering is detected.  In principle, rhcwnd could be
618	   set to an arbitrary function of prior_rhcwnd.

620	   For example, one could choose to penalize networks which reorder
621	   segments by forcibly reducing rhcwnd:

623	      rhcwnd = prior_rhcwnd / 2;

625	   A gentler form of reordering penalty would be to leave rchwnd as
626	   updated by Rate-Halving (Reduce rhcwnd as indicated in section 4.6 or
627	   4.7 until the reordering has been detected).

629	   The choice to restore rhcwnd results in no penalty for reordering:

631	      rhcwnd = prior_rhcwnd;

633	   This is functionally equivalent to today's Reno implementations, and
634	   may be less bursty in some cases.  (RH is less bursty in cases in
635	   which a single new segment is transmitted prior to detecting the
636	   reordering, resulting in a burst which is one segment smaller than
637	   the equivalent Reno burst).

639	   The optimal action is probably between these extremes, and should be
640	   the subject of future research.

642	5.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR

644	   A partial ACK indicates that one hole has been filled (since it is
645	   now covered by an ACK) and that another one exists.  In step 1,
646	   retran_data is cleared since the retransmitted segments left the
647	   network when the partial ACK was generated.

649	   Some of the segments that generated duplicate ACKs are acknowledged
650	   by the partial ACK.  These segments must be accounted for in the
651	   count of dupacks, so step 2 reduces dupacks by the amount of data
652	   that has just been acknowledged.

654	   A new hole is indicated by the partial ack one round trip after the
655	   retransmission for the previous hole was sent.  The only way in which
656	   the segment indicated by the partial ack would be falsely
657	   retransmitted would be if the segment was reordered (delayed) by more
658	   than a round trip time.

660	   If the previous hole was caused by a reordered segment, then the
661	   partial ack may appear without sending a retransmission (and with
662	   dupacks < 3).  In this case, the new hole is not eligible for
663	   retransmission until dupacks >= 3.

665	5.10 Completion of RH_EXACT; Return to RH_INCR

667	   Under severe reordering (after retransmission) condition 2 may cause
668	   a premature exit from the adjustment interval.  This would leave
669	   rhcwnd too large.  However the bounding check in 4.14 will force the
670	   required 1/2 window reduction.

672	   Case 2 might be particularly tricky to implement for cases in which
673	   SACK, ECN, and NewReno are all supported.  Here are some guidelines
674	   for implementation:

676	   ACKs or SACKs which show the arrival of retransmitted data may not be
677	   reliable indicators of case 2.  This is because it is possible (and
678	   even likely) for the data repair interval to continue well beyond the
679	   end of the adjustment interval.  Should a second adjustment interval
680	   begin as a result of subsequent lost data, it is possible for holes
681	   to be filled which don't satisfy case 2.  Therefore, a more rigorous
682	   test is needed for retransmitted segments.

684	   Assuming that there is a SACK scoreboard, one possible implementation
685	   is to store an adjustment interval ID number with each
686	   retransmission.  This ID will clearly indicate during which
687	   adjustment interval a retransmission occurred.  (The ID may also be
688	   "zero" indicating the retransmissions occurred outside of an
689	   adjustment interval).  ACKs and SACKs which are received for
690	   retransmissions can then be definitively tested against condition 2.

692	5.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR

694	   A full ack indicates that all segments that were outstanding at the
695	   first indication of congestion have been successfully received.
696	   Therefore, the adjustment and repair intervals should be completed,
697	   but final adjustments and checks must be made to ensure the validity
698	   of state variables.

700	5.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR

702	   When the window has been reduced by half, but a full ack has not yet
703	   been received, it is necessary to end the adjustment interval,
704	   holding the window constant while the repair interval completes.

706	   In the RH_EST_REPAIR state, rhcwnd is not increased or decreased.
707	   While in this state, retransmissions will be sent once per round trip
708	   time, as in NewReno until a full ACK ends the repair interval.

710	5.13 Corrections for Estimation upon return to RH_INCR

712	   The final adjustment ensures that the resulting congestion window is
713	   half of the data that successfully passed through the network during
714	   the congested round trip.

716	5.14 Application of Bounding Paremeters

718	   At the end of the RH_EXACT state, rhcwnd has been set to
719	   (prior_rhcwnd-loss)/2, where loss is the number of bytes of data that
720	   have been lost during the current adjustment interval.

722	   In some cases, it is possible for rhcwnd to be set to an invalid
723	   result (such as severe reordering mentioned above or improper
724	   acknowledgement strategies of the receiver).

726	   Rate-Halving includes a set of Bounding Parameters which are used to
727	   guarantee that the Rate-Halving algorithm will make appropriate
728	   adjustments to the window.  The prior_rhcwnd/2 check guarantees that,
729	   no matter what else happens, rhcwnd will always be reduced by a
730	   factor of two.

732	   The prior_rhcwnd/4 check ensures that the TCP connection is not
733	   unduly harmed by extreme network conditions.  The choice to set
734	   ssthresh (rather than rhcwnd) to prior_rhcwnd/4 prevents sending a
735	   burst into the network as a result of suddenly increasing rhcwnd.  A
736	   short period of Slow-start will follow, allowing rhcwnd to quickly
737	   reach the bounding parameter of prior_rhcwnd/4.

739	   Note that it is possible to exit the adjustment interval on an ACK,
740	   and re-enter a new adjustment interval as a result of the same
741	   acknowledgement.  For example, an ACK may carry both acknowledgement
742	   of retransmitted data, causing an exit per 4.10 case 2, and an
743	   ECN-Echo bit (indicating that the retransmitted segment was marked),
744	   causing a new adjustment interval per 4.4.

746	5.15 Retransmission Timeouts

748	   Timeout behavior is nearly identical to Reno.  Since rhcwnd is
749	   reduced during non-RH_INCR states, it is necessary to use
750	   prior_rhcwnd to prevent an overly agressive adjustment in the
751	   ssthresh calculations.

753	   See section 6.2.2 for information about SACK reneging (SACK receiver
754	   discards data that it has already sent SACK blocks for [RFC2018]).

756	   Do not clear the exponential backoff multiplier after the timeout if
757	   the ECN-Echo bit is set on the ACK that results from the
758	   retransmission.

760	6. Additional Implementation Notes
761	==================================

763	6.1 ACK processing

765	6.1.1 Order of actions to be taken

767	   Upon receiving an acknowledgement, the following actions should be
768	   taken:

770	   1. Save a copy of the following state variables for use during
771	      processing of this ACK: snd.una, fack, retran_data.
772	   2. Update the scoreboard, snd.una, fack, and retran_data.
773	   3. Note if the ECN-Echo bit is set.
774	   4. Check for exit conditions from RH states (4.8, 4.10, 4.11).
775	   5. Check for transition to RH_EST_REPAIR state (4.12).
776	   6. Process partial ACKs (4.9).
777	   7. Check for entrance conditions for RH (4.4, 4.5).
778	   8. Adjust the window using state variables as they existed at the
779	      time the ACK was received (4.2, 4.6, 4.7).
780	   9. If the window allows (4.3), send a new segment or a retransmission
781	      as determined by the scoreboard (6.2.1).

783	6.1.2 Setting FACK

785	      When in the RH_EXACT or RH_INCR state:
786	         When each ACK is received, set the fack state variable to be
787	         the maximum of: (1) the highest sequence number which has been
788	         acknowledged plus one, and (2) the highest sequence number
789	         which has appeared in a SACK block plus one.

791	      When in the RH_EST or RH_EST_REPAIR state:
792	         When each ACK is received, set the fack state variable to be
793	         snd_una + (1 + dupacks) * SMSS.

795	6.2 Data Recovery

797	6.2.1 SACK Scoreboard

799	   A SACK scoreboard keeps track of data blocks that have been received
800	   after some data has been lost, as indicated by SACK options.  The
801	   data transmission rule of 4.3 indicates only WHEN to send data (based
802	   on congestion control), not WHAT data to send (based on reliable
803	   delivery).  The scoreboard keeps track of what data has been
804	   received, and indicates what data is eligible for retransmission.

806	   Reno TCP uses a threshold of 3 dupacks to determine that a missing
807	   segment has been lost, not just reordered.  By immediately entering
808	   the adjustment interval, not only can a Rate-Halving implementation
809	   apply the dupack threshold to the first loss, but also to all
810	   subsequent losses.  Mathis and Mahdavi presented a method for
811	   implementing this logic, termed thresholded retransmissions, in the
812	   scoreboard [MM97].  A data block is eligible for retransmission only
813	   after the hole has been observed at least three times, as indicated
814	   by dupacks, or SACK options with sequence numbers that are higher
815	   than the data block in question.

817	   If no data is eligible for retransmission, new data may be sent when
818	   allowed by the congestion window and the receiver's advertised window
819	   (4.3).

821	   When no SACK information is available, such as in states RH_EST or
822	   RH_EST_REPAIR, a single-element scoreboard may be used to store loss
823	   and retransmission state information.

825	6.2.2 Retransmission timeout/SACK reneging

827	   RFC2018 recommends (with a SHOULD) that the scoreboard be cleared
828	   when a timeout is experienced, since the timeout might indicate that
829	   the SACK receiver may have reneged (discarded data that it has
830	   already sent a SACK option for).  If this RFC2018 recommendation is
831	   followed, then perform the following actions in addition to the
832	   actions specified in 4.15:

834	      snd.nxt = snd.una
835	      fack    = snd.una
836	      retran_data = 0

838	   It is believed to be safe to relax the SHOULD in RFC2018, such that
839	   when a retransmission timeout is experienced, the scoreboard can be
840	   kept in order to retain the information about which segments have
841	   already been received.  When allowing the scoreboard to be maintained
842	   through a timeout (conflicting with RFC2018), the following actions
843	   must be performed in addition to the actions specified in 4.15:

845	      snd.nxt = fack
846	      retran_data = 0

848	   In this case, snd.nxt is pulled back to fack, which is the highest
849	   sequence number known to have reached the receiver.  Normal
850	   transmissions will begin at this sequence number, and retransmissions
851	   will continue to be issued by the scoreboard.  (Following the
852	   timeout, fack will advance to the highest ACK or SACK block received.
853	   snd.nxt should be pulled ahead any time that fack > snd.nxt.)

855	   If the scoreboard is not discarded during a timeout, the sender must
856	   be able to detect that the receiver has reneged.  If snd.una advances
857	   to a block stored in the scoreboard, the sender knows that a SACK
858	   renege has occurred, and must clear the scoreboard, setting

860	      fack = snd.una
861	      snd.nxt = snd.una
862	      retran_data = 0

864	   There are three options for what should happen to the window when the
865	   sender detects that the receiver has reneged: 1) force a retransmit
866	   timeout, 2) cut the window in half, or 3) do nothing to the window.
867	   More experimentation is needed to determine the correct approach.

869	7. Acknowledgements
870	====================

872	   The authors would like to acknowledge Janie Hoe's work, which
873	   introduced the concepts of sending new data during recovery and
874	   spacing out data segments during the round trip following a loss.
875	   The authors would also like to thank Van Jacobson and Sally Floyd for
876	   inspirational conversations about many of the ideas in this document.
877	   In addition, the authors are grateful to Mark Allman, Jitendra
878	   Padhye, and Khrys Myrddin for their constructive feedback about
879	   earlier drafts of this document.

881	8. References
882	==============

884	   [Hoe95] J. Hoe, Startup Dynamics of TCP's Congestion Control and
885	   Avoidance Schemes. Master's Thesis, MIT, 1995.  URL "http://ana-
886	   www.lcs.mit.edu/anaweb/ps-papers/hoe-thesis.ps".

888	   [Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion
889	   Control Scheme for TCP.  In ACM SIGCOMM, August 1996.  URL
890	   "http://www.acm.org/sigcomm/sigcomm96/program.html".

892	   [MM96] M. Mathis, J. Mahdavi, "Forward Acknowledgement: Refining TCP
893	   Congestion Control," SIGCOMM 96, August 1996.

895	   [MM97] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
896	   Parameters," December 1997.
897	   URL "http://www.psc.edu/networking/papers/FACKnotes/current/".

899	   [NS] The UCB/LBNL/VINT Network Simulator (NS). URL "http://www-
900	   mash.cs.berkeley.edu/ns/".

902	   [RFC793] J. Postel, "Transmission Control Protocol," RFC793,
903	   September 1981.

905	   [RFC1191] J. Mogul, and S. Deering, "Path MTU Discovery," RFC1191,
906	   November 1990.

908	   [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP
909	   Selective Acknowledgment Options," RFC2018, October 1996.

911	   [RFC2481] K. Ramakrishnan, and S. Floyd, "A Proposal to add Explicit
912	   Congestion Notification (ECN) to IP," RFC2481, January 1999.

914	   [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion
915	   Control," RFC2581, April 1999.

917	   [RFC2582] S. Floyd, and T. Henderson, "The NewReno Modification to
918	   TCP's Fast Recovery Algorithm," RFC2582, April 1999.

920	9. Security Considerations
921	==========================

923	   RFC 2581 discusses general security considerations concerning TCP
924	   congestion control.  This document describes a specific algorithm
925	   that conforms with the congestion control requirements of RFC 2581,
926	   and so those considerations apply to this algorithm, too.  There are
927	   no known additional security concerns for this specific algorithm.

929	AUTHORS' ADDRESSES

931	   Matthew Mathis and Jeffrey Semke
932	   Pittsburgh Supercomputing Center
933	   4400 Fifth Avenue
934	   Pittsburgh, PA  15213

936	   EMail: mathis@psc.edu, semke@psc.edu

938	   Jamshid Mahdavi
939	   Novell, Inc.
940	   Mailstop: SJF-B492
941	   2211 North First Street
942	   San Jose, CA  95131

944	   EMail: mahdavi@novell.com, kml@novell.com