idnits 2.17.1 

draft-hurtig-tcpm-rtorestart-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (October 22, 2012) is 4197 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'SEG 1' is mentioned on line 164, but not defined

  == Missing Reference: 'SEG 2' is mentioned on line 165, but not defined

  == Missing Reference: 'SEG 3' is mentioned on line 170, but not defined

  ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260)

  == Outdated reference: A later version (-01) exists of
     draft-dukkipati-tcpm-tcp-loss-probe-00


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance and Minor Extensions                           P. Hurtig
3	(tcpm)                                               Karlstad University
4	Internet-Draft                                                A. Petlund
5	Intended status: Experimental              Simula Research Laboratory AS
6	Expires: April 25, 2013                                         M. Welzl
7	                                                      University of Oslo
8	                                                        October 22, 2012

10	                        TCP and SCTP RTO Restart
11	                    draft-hurtig-tcpm-rtorestart-03

13	Abstract

15	   This document describes a modified algorithm for managing the TCP and
16	   SCTP retransmission timers that provides faster loss recovery when a
17	   connection's amount of outstanding data is small.  The modification
18	   allows the transport to restart its retransmission timer more
19	   aggressively in situations where fast retransmit cannot be used.
20	   This enables faster loss detection and recovery for connections that
21	   are short-lived or application-limited.

23	Status of this Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at http://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on April 25, 2013.

40	Copyright Notice

42	   Copyright (c) 2012 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	1.  Introduction

57	   TCP uses two mechanisms to detect segment loss.  First, if a segment
58	   is not acknowledged within a certain amount of time, a retransmission
59	   timeout (RTO) occurs, and the segment is retransmitted [RFC6298].
60	   While the RTO is based on measured round-trip times (RTTs) between
61	   the sender and receiver, it also has a conservative lower bound of 1
62	   second to ensure that delayed segments are not mistaken as lost.
63	   Second, when a sender receives duplicate acknowledgments, the fast
64	   retransmit algorithm infers segment loss and triggers a
65	   retransmission.  Duplicate acknowledgments are generated by a
66	   receiver when out-of-order segments arrive.  As both segment loss and
67	   segment reordering cause out-of-order arrival, fast retransmit waits
68	   for three duplicate acknowledgments before considering the segment as
69	   lost.  In some situations, however, the number of outstanding
70	   segments is not enough to trigger three duplicate acknowledgments,
71	   and the sender must rely on lengthy RTOs for loss recovery.

73	   The amount of outstanding segments can be small for several reasons:

75	   (1)  The connection is limited by the congestion control when the
76	        path has a low total capacity (bandwidth-delay product) or the
77	        connection's share of the capacity is small.  It is also limited
78	        by the congestion control in the first RTTs of a connection or
79	        after an RTO when the available capacity is probed using slow-
80	        start.

82	   (2)  The connection is limited by the receiver's available buffer
83	        space.

85	   (3)  The connection is limited by the application if the available
86	        capacity of the path is not fully utilized (e.g. interactive
87	        applications), or at the end of a transfer, which is frequent if
88	        the total amount of data is small (e.g. web traffic).

90	   The first two situations can occur for any flow, as external factors
91	   at the network and/or host level cause them.  The third situation
92	   primarily affects flows that are short or have a low transmission
93	   rate.  Typical examples of applications that produce short flows are
94	   web servers.  [RJ10] shows that 70% of all web objects, found at the
95	   top 500 sites, are too small for fast retransmit to work.  [BPS98]
96	   shows that about 56% of all retransmissions sent by a busy web server
97	   are sent after RTO expiry.  While the experiments were not conducted
98	   using SACK [RFC2018], only 4% of the RTO-based retransmissions could
99	   have been avoided.  Applications have a low transmission rate when
100	   data is sent in response to actions, or as a reaction to real life
101	   events.  Typical examples of such applications are stock trading
102	   systems, remote computer operations and online games.  What is
103	   special about this class of applications is that they are time-
104	   dependant, and extra latency can reduce the application service level
105	   [P09].  Although such applications may represent a small amount of
106	   data sent on the network, a considerable number of flows have such
107	   properties and the importance of low latency is high.

109	   The RTO restart approach outlined in this document makes the RTO
110	   slightly more aggressive when the number of outstanding segments is
111	   small, in an attempt to enable faster loss recovery for all segments
112	   while being robust to reordering.  While it still conforms to the
113	   requirement in [RFC6298] that segments must not be retransmitted
114	   earlier than RTO seconds after their original transmission, it could
115	   increase the chance for a spurious timeout, which could degrade
116	   performance when the congestion window (cwnd) is large -- for
117	   example, when an application sends enough data to reach a cwnd
118	   covering 100 segments and then stops.  The likelihood and potential
119	   impact of this problem as well as possible mitigation strategies are
120	   currently under investigation.

122	   While this document focuses on TCP, the described changes are also
123	   valid for the Stream Control Transmission Protocol (SCTP) [RFC4960]
124	   which has similar loss recovery and congestion control algorithms.

126	1.1.  Requirements Language

128	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
129	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
130	   document are to be interpreted as described in RFC 2119 [RFC2119].

132	2.  RTO Restart Overview

134	   The RTO management algorithm described in [RFC6298] recommends that
135	   the retransmission timer is restarted when an acknowledgment (ACK)
136	   that acknowledges new data is received and there is still outstanding
137	   data.  The restart is conducted to guarantee that unacknowledged
138	   segments will be retransmitted after approximately RTO seconds.
139	   However, by restarting the timer on each incoming acknowledgment,
140	   retransmissions are not typically triggered RTO seconds after their
141	   previous transmission but rather RTO seconds after the last ACK
142	   arrived.  The duration of this extra delay depends on several factors
143	   but is in most cases approximately one RTT.  Hence, in most
144	   situations the time before a retransmission is triggered is equal to
145	   "RTO + RTT".

147	   The extra delay can be significant, especially for applications that
148	   use a lower RTOmin than the standard of 1 second and/or in
149	   environments with high RTTs, e.g. mobile networks.  The restart
150	   approach is illustrated in Figure 1 where a TCP sender transmits
151	   three segments to a receiver.  The arrival of the first and second
152	   segment triggers a delayed ACK [RFC1122], which restarts the RTO
153	   timer at the sender.  The RTO restart is performed approximately one
154	   RTT after the transmission of the third segment.  Thus, if the third
155	   segment is lost, as indicated in Figure 1, the effective loss
156	   detection time is "RTO + RTT" seconds.  In some situations, the
157	   effective loss detection time becomes even longer.  Consider a
158	   scenario where only two segments are outstanding.  If the second
159	   segment is lost, the time to expire the delayed ACK timer will also
160	   be included in the effective loss detection time.

162	             Sender                               Receiver
163	                           ...
164	             DATA [SEG 1] ----------------------> (ack delayed)
165	             DATA [SEG 2] ----------------------> (send ack)
166	             DATA [SEG 3] ----X         /-------- ACK
167	             (restart RTO)  <----------/
168	                           ...
169	             (RTO expiry)
170	             DATA [SEG 3] ---------------------->

172	                       Figure 1: RTO restart example

174	   During normal TCP bulk transfer the current RTO restart approach is
175	   not a problem.  Actually, as long as enough segments arrive at a
176	   receiver to enable fast retransmit, RTO-based loss recovery should be
177	   avoided.  RTOs should only be used as a last resort, as they
178	   drastically lower the congestion window compared to fast retransmit,
179	   and the current approach can therefore be beneficial -- it is
180	   described in [EL04] to act as a "safety margin" that compensates for
181	   some of the problems that the authors have identified with the
182	   standard RTO calculation.  Notably, the authors of [EL04] also state
183	   that "this safety margin does not exist for highly interactive
184	   applications where often only a single packet is in flight."

186	   There are only a few situations where timeouts are appropriate, or
187	   the only choice.  For example, if the network is severely congested
188	   and no segments arrive, RTO-based recovery should be used.  In this
189	   situation, the time to recover from the loss(es) will not be the
190	   performance bottleneck.  Furthermore, for connections that do not
191	   utilize enough capacity to enable fast retransmit, RTO is the only
192	   choice.  The time needed for loss detection in such scenarios can
193	   become a serious performance bottleneck.

195	3.  RTO Restart Algorithm

197	   To enable faster loss recovery for connections that are unable to use
198	   fast retransmit, an alternative RTO restart can be used.  By
199	   resetting the timer to "RTO - T_earliest", where T_earliest is the
200	   time elapsed since the earliest outstanding segment was transmitted,
201	   retransmissions will always occur after exactly RTO seconds.  This
202	   approach makes the RTO more aggressive than the standardized approach
203	   in [RFC6298] but still conforms to the requirement in [RFC6298] that
204	   segments must not be retransmitted earlier than RTO seconds after
205	   their original transmission.

207	   This document specifies the following update of step 5.3 in Section 5
208	   of [RFC6298] (and a similar update in Section 6.3.2 of [RFC4960] for
209	   SCTP):

211	      When an ACK is received that acknowledges new data:

213	      (1)  Set T_earliest = 0.

215	      (2)  If the following two conditions hold:

217	           (a)  The number of outstanding segments is less than four.

219	           (b)  There is no unsent data ready for transmission or the
220	                receiver's advertised window does not permit
221	                transmission.

223	           set T_earliest to the time elapsed since the earliest
224	           outstanding segment was sent.

226	      (3)  Restart the retransmission timer so that it will expire after
227	           "RTO - T_earliest" seconds (for the current value of RTO).

229	   The update requires TCP implementations to track the time elapsed
230	   since the transmission of the earliest outstanding segment
231	   (T_earliest).  As the alternative restart is used only when the
232	   number of outstanding segments is less than four only four segments
233	   need to be tracked.  Furthermore, some implementations of TCP (e.g.
234	   Linux TCP) already track the transmission times of all segments.

236	4.  Discussion

238	   The currently standardized algorithm has been shown to add at least
239	   one RTT to the loss recovery process in TCP [LS00] and SCTP
240	   [HB08][PBP09].  Applications that have strict timing requirements
241	   (e.g. telephony signaling and gaming) rather than throughput
242	   requirements may want to use a lower RTOmin than the standard of 1
243	   second [RFC4166].  For such applications the modified restart
244	   approach could be important as the RTT and also the delayed ACK timer
245	   of receivers will be large components of the effective loss recovery
246	   time.  Measurements in [HB08] have shown that the total transfer time
247	   of a lost segment (including the original transmission time and the
248	   loss recovery time) can be reduced with up to 35% using the suggested
249	   approach.  These results match those presented in [PGH06][PBP09],
250	   where the modified restart approach is shown to significantly reduce
251	   retransmission latency.

253	   There are several proposals that address the problem of not having
254	   enough ACKs for loss recovery.  In what follows, we explain why the
255	   mechanism described here is complementary to these approaches:

257	   The limited transmit mechanism [RFC3042] allows a TCP sender to
258	   transmit a previously unsent segment for each of the first two
259	   duplicate acknowledgments.  By transmitting new segments, the sender
260	   attempts to generate additional duplicate acknowledgments to enable
261	   fast retransmit.  However, limited transmit does not help if no
262	   previously unsent data is ready for transmission or if the receiver
263	   is out of buffer space.  [RFC5827] specifies an early retransmit
264	   algorithm to enable fast loss recovery in such situations.  By
265	   dynamically lowering the amount of duplicate acknowledgments needed
266	   for fast retransmit (dupthresh), based on the number of outstanding
267	   segments, a smaller number of duplicate acknowledgments are needed to
268	   trigger a retransmission.  In some situations, however, the algorithm
269	   is of no use or might not work properly.  First, if a single segment
270	   is outstanding, and lost, it is impossible to use early retransmit.
271	   Second, if ACKs are lost, the early retransmit cannot help.  Third,
272	   if the network path reorders segments, the algorithm might cause more
273	   unnecessary retransmissions than fast retransmit.

275	   TCP-NCR [RFC4653] sets the dupthresh to three or more, to better
276	   disambiguate reordered and lost segments.  In addition, early
277	   retransmit lowers the dupthresh when the amount of outstanding data
278	   is small, to enable faster loss recovery.  The reasons why the RTO
279	   restart procedure described in this document does not take dynamic
280	   dupthresh considerations into account are twofold.  First, if a
281	   larger dupthresh is used, the RTO restart approach could be used when
282	   the congestion window, and the amount of outstanding data, is larger.
283	   However, in such situations the actual amount of outstanding data can
284	   significantly impact the RTT of the connection, making it potentially
285	   dangerous to be more aggressive.  Second, if a smaller dupthresh is
286	   used, the amount of outstanding data needed for a restart is smaller.
287	   However, as the congestion window is already small, it does not
288	   matter if a retransmission is due to a fast retransmit or an RTO.
289	   The resulting congestion window will still be very small, and the
290	   only difference is how quickly TCP infers segment loss.

292	   Tail Loss Probe [TLP] is a proposal to send up to two "probe
293	   segments" when a timer fires which is set to a value smaller than the
294	   RTO.  A "probe segment" is a new segment if new data is available,
295	   else a retransmission.  The intention is to compensate for sluggish
296	   RTO behavior in situations where the RTO greatly exceeds the RTT,
297	   which, according to measurements reported in [TLP], is not uncommon.
298	   The Probe timeout (PTO) is at least 2 RTTs, and only scheduled in
299	   case the RTO is farther than the PTO.  A spurious PTO is less risky
300	   than a spurious RTO, as it would not have the same negative effects
301	   (clearing the scoreboard and restarting with slow-start).  In
302	   contrast, RTO restart is trying to make the RTO more appropriate in
303	   cases where there is no need to be overly cautious.

305	   TLP could kick in in situations where RTO restart does not apply, and
306	   it could overrule (yielding a similar general behavior, but with a
307	   lower timeout) RTO restart in cases where the number of outstanding
308	   segments is smaller than 4 and no new segments are available for
309	   transmission.  The shorter RTO from RTO restart also reduces the
310	   probability that TLP is activated because PTO might be farther than
311	   RTO.  This could make RTO restart more aggressive than the algorithm
312	   in [TLP] when:

314	   (1)  no data has been sent in an interval exceeding the RTO

316	   (2)  the number of outstanding segments is 3

318	   (3)  (defined in [RFC5681]) is at least 3

320	   because, under these conditions, in accordance with [RFC5681], 3
321	   packets can immediately be retransmitted, whereas TLP only allows up
322	   to two consecutive PTOs.

324	5.  IANA Considerations

326	   This memo includes no request to IANA.

328	6.  Security Considerations

330	   This document discusses a change in how to set the retransmission
331	   timer's value when restarted.  This change does not raise any new
332	   security issues with TCP or SCTP.

334	7.  References

336	7.1.  Normative References

338	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
339	              Communication Layers", STD 3, RFC 1122, October 1989.

341	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
342	              Selective Acknowledgment Options", RFC 2018, October 1996.

344	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
345	              Requirement Levels", BCP 14, RFC 2119, March 1997.

347	   [RFC3042]  Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
348	              TCP's Loss Recovery Using Limited Transmit", RFC 3042,
349	              January 2001.

351	   [RFC4166]  Coene, L. and J. Pastor-Balbas, "Telephony Signalling
352	              Transport over Stream Control Transmission Protocol (SCTP)
353	              Applicability Statement", RFC 4166, February 2006.

355	   [RFC4653]  Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton,
356	              "Improving the Robustness of TCP to Non-Congestion
357	              Events", RFC 4653, August 2006.

359	   [RFC4960]  Stewart, R., "Stream Control Transmission Protocol",
360	              RFC 4960, September 2007.

362	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
363	              Control", RFC 5681, September 2009.

365	   [RFC5827]  Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and
366	              P. Hurtig, "Early Retransmit for TCP and Stream Control
367	              Transmission Protocol (SCTP)", RFC 5827, May 2010.

369	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
370	              "Computing TCP's Retransmission Timer", RFC 6298,
371	              June 2011.

373	7.2.  Informative References

375	   [BPS98]    Balakrishnan, H., Padmanabhan, V., Seshan, S., Stemm, M.,
376	              and R. Katz, "TCP Behavior of a Busy Web Server: Analysis
377	              and Improvements", Proc. IEEE INFOCOM Conf., March 1998.

379	   [EL04]     Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End-
380	              to-End Retransmission Timer for Reliable Unicast
381	              Transport", IEEE INFOCOM 2004, March 2004.

383	   [HB08]     Hurtig, P. and A. Brunstrom, "SCTP: designed for timely
384	              message delivery?", Springer Telecommunication Systems,
385	              May 2010.

387	   [LS00]     Ludwig, R. and K. Sklower, "The Eifel retransmission
388	              timer", ACM SIGCOMM Comput. Commun. Rev., 30(3),
389	              July 2000.

391	   [P09]      Petlund, A., "Improving latency for interactive, thin-
392	              stream applications over reliable transport", Unipub PhD
393	              Thesis, Oct 2009.

395	   [PBP09]    Petlund, A., Beskow, P., Pedersen, J., Paaby, E., Griwodz,
396	              C., and P. Halvorsen, "Improving SCTP Retransmission
397	              Delays for Time-Dependent Thin Streams",
398	              Springer Multimedia Tools and Applications, 45(1-3), 2009.

400	   [PGH06]    Pedersen, J., Griwodz, C., and P. Halvorsen,
401	              "Considerations of SCTP Retransmission Delays for Thin
402	              Streams", IEEE LCN 2006, November 2006.

404	   [RJ10]     Ramachandran, S., "Web metrics: Size and number of
405	              resources", Google http://code.google.com/speed/articles/
406	              web-metrics.html, May 2010.

408	   [TLP]      Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
409	              "TCP Loss Probe (TLP): An Algorithm for Fast Recovery of
410	              Tail Losses", draft-dukkipati-tcpm-tcp-loss-probe-00.txt
411	              (work in progress), July 2012.

413	Authors' Addresses

415	   Per Hurtig
416	   Karlstad University
417	   Universitetsgatan 2
418	   Karlstad,   651 88
419	   Sweden

421	   Phone: +46 54 700 23 35
422	   Email: per.hurtig@kau.se

424	   Andreas Petlund
425	   Simula Research Laboratory AS
426	   P.O. Box 134
427	   Lysaker,   1325
428	   Norway

430	   Phone: +47 67 82 82 00
431	   Email: apetlund@simula.no

433	   Michael Welzl
434	   University of Oslo
435	   PO Box 1080 Blindern
436	   Oslo,   N-0316
437	   Norway

439	   Phone: +47 22 85 24 20
440	   Email: michawe@ifi.uio.no