idnits 2.17.1 

draft-zimmermann-tcp-lcd-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (August 26, 2009) is 5356 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-21) exists of
     draft-ietf-tcpm-1323bis-01

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298)

  -- Obsolete informational reference (is this intentional?): RFC 2629
     (Obsoleted by RFC 7749)


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                            A. Zimmermann
3	Internet-Draft                                              A. Hannemann
4	Intended status: Experimental                     RWTH Aachen University
5	Expires: February 27, 2010                               August 26, 2009

7	         Make TCP more Robust to Long Connectivity Disruptions
8	                      draft-zimmermann-tcp-lcd-02

10	Status of this Memo

12	   This Internet-Draft is submitted to IETF in full conformance with the
13	   provisions of BCP 78 and BCP 79.

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups.  Note that
17	   other groups may also distribute working documents as Internet-
18	   Drafts.

20	   Internet-Drafts are draft documents valid for a maximum of six months
21	   and may be updated, replaced, or obsoleted by other documents at any
22	   time.  It is inappropriate to use Internet-Drafts as reference
23	   material or to cite them other than as "work in progress."

25	   The list of current Internet-Drafts can be accessed at
26	   http://www.ietf.org/ietf/1id-abstracts.txt.

28	   The list of Internet-Draft Shadow Directories can be accessed at
29	   http://www.ietf.org/shadow.html.

31	   This Internet-Draft will expire on February 27, 2010.

33	Copyright Notice

35	   Copyright (c) 2009 IETF Trust and the persons identified as the
36	   document authors.  All rights reserved.

38	   This document is subject to BCP 78 and the IETF Trust's Legal
39	   Provisions Relating to IETF Documents in effect on the date of
40	   publication of this document (http://trustee.ietf.org/license-info).
41	   Please review these documents carefully, as they describe your rights
42	   and restrictions with respect to this document.

44	Abstract

46	   Disruptions in end-to-end path connectivity which last longer than
47	   one retransmission timeout cause suboptimal TCP performance.  The
48	   reason for the performance degradation is that TCP interprets segment
49	   loss induced by connectivity disruptions as a sign of congestion,
50	   resulting in repeated backoffs of the retransmission timer.  This
51	   leads in turn to a deferred detection of the re-establishment of the
52	   connection since TCP waits until the next retransmission timeout
53	   occurs before attempting the retransmission.

55	   This document describes how standard ICMP messages can be exploited
56	   to disambiguate true congestion loss from non-congestion loss caused
57	   by long connectivity disruptions.  Moreover, a revert strategy of the
58	   retransmission timer is specified that enables a more prompt
59	   detection of whether the connectivity to a previously disconnected
60	   peer node has been restored or not.  The specified algorithm is a TCP
61	   sender-only modification that effectively improves TCP performance in
62	   presence of connectivity disruptions.

64	Table of Contents

66	   1.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
67	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
68	   3.  Connectivity Disruption Indication . . . . . . . . . . . . . .  5
69	   4.  Connectivity Disruption Reaction . . . . . . . . . . . . . . .  6
70	     4.1.  Basic Idea . . . . . . . . . . . . . . . . . . . . . . . .  6
71	     4.2.  The Algorithm  . . . . . . . . . . . . . . . . . . . . . .  7
72	     4.3.  Discussion . . . . . . . . . . . . . . . . . . . . . . . .  9
73	     4.4.  Protecting Against Misbehaving Routers (the Safe
74	           Variant) . . . . . . . . . . . . . . . . . . . . . . . . . 11
75	   5.  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 11
76	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 13
77	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 13
78	   8.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 13
79	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 13
80	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 13
81	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 14
82	   Appendix A.  TODO list . . . . . . . . . . . . . . . . . . . . . . 16
83	   Appendix B.  Changes from previous versions of the draft . . . . . 16
84	     B.1.  Changes from draft-zimmermann-tcp-lcd-01 . . . . . . . . . 16
85	     B.2.  Changes from draft-zimmermann-tcp-lcd-00 . . . . . . . . . 16
86	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17

88	1.  Terminology

90	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
91	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
92	   document are to be interpreted as described in [RFC2119].

94	   As defined in [RFC0793], the term "acceptable acknowledgment (ACK)"
95	   refers to a TCP segment that acknowledges previously unacknowledged
96	   data.  The Transmission Control Protocol (TCP) sender state variable
97	   "SND.UNA" and the current segment variable "SEG.SEQ" are used as
98	   defined in [RFC0793].  SND.UNA holds the segment sequence number of
99	   earliest segment that has not been acknowledged by the TCP receiver
100	   (the oldest outstanding segment).  SEG.SEQ is the segment sequence
101	   number of a given segment.

103	   We use both the term "retransmission timer" and the term
104	   "retransmission timeout (RTO)" as defined in [RFC2988].

106	2.  Introduction

108	   Connectivity disruptions can occur in many different situations.  The
109	   frequency of the connectivity disruptions depends thereby on the
110	   property of the end-to-end path between the communicating hosts.
111	   While connectivity disruptions can occur in traditional wired
112	   networks too, e.g., simply due to an unplugged network cable, the
113	   likelihood of occurrence is significantly higher in wireless (multi-
114	   hop) networks.  Especially, end-host mobility, network topology
115	   changes and wireless interferences are crucial factors.  In the case
116	   of the Transmission Control Protocol (TCP) [RFC0793], the performance
117	   of the connection can exhibit a significant reduction compared to a
118	   permanently connected path [SESB05].  This is because TCP, which was
119	   originally designed to operate in fixed and wired networks, generally
120	   assumes that the end-to-end path connectivity is relatively stable
121	   over the connection's lifetime.

123	   According to Schuetz et. al.  [I-D.schuetz-tcpm-tcp-rlci]
124	   connectivity disruptions can be classified into two groups: "short"
125	   and "long" connectivity disruptions.  A connectivity disruption is
126	   short if connectivity returns before the retransmission timer fires
127	   for the first time.  In this case, TCP recovers lost data segments
128	   through Fast Retransmit and lost acknowledgments (ACK) through
129	   successfully delivered later ACKs.  Connectivity disruptions are
130	   declared as "long" for a given TCP connection, if the retransmission
131	   timer fires at least once before connectivity returns.  Whether or
132	   not path characteristics like the round trip time (RTT) or the
133	   available bandwidth have changed when the connectivity returns after
134	   a disruption is another important aspect for TCP's retransmission
135	   scheme [I-D.schuetz-tcpm-tcp-rlci].

137	   This document will focus on TCP's behavior in face of long
138	   connectivity disruptions in the time "before" connectivity is
139	   restored.  In particular this memo does not describe any additional
140	   modification to detect if the path characteristics remain unchanged
141	   in order to improve TCP's behavior "after" connectivity is restored.
142	   Therefore, TCP's congestion control mechanisms
143	   [I-D.ietf-tcpm-rfc2581bis] will be unchanged.

145	   When a long connectivity disruption occurs on a TCP connection, the
146	   TCP sender stops receiving acknowledgments.  After the retransmission
147	   timer expires, the TCP sender enters the timeout-based loss recovery
148	   and declares the oldest outstanding segment (SND.UNA) as lost.  Since
149	   TCP tightly couples reliability and congestion control, the
150	   retransmission of SND.UNA is triggered together with the reduction of
151	   sending rate, which is based on the assumption that loss is
152	   indication of congestion [I-D.ietf-tcpm-rfc2581bis].  As long as the
153	   connectivity disruption persists, TCP will repeat the procedure until
154	   the oldest outstanding segment is successfully acknowledged, or the
155	   connection times out.  TCP implementations that follow the
156	   recommended retransmission timeout (RTO) management of RFC 2988
157	   [RFC2988] double the RTO after each retransmission attempt.  However,
158	   the RTO growth may be bounded by an upper limit, the maximum RTO,
159	   which is at least 60s, but may be longer: Linux for example uses
160	   120s.  If the connectivity is restored between two retransmission
161	   attempts, TCP still has to wait until the retransmission timer
162	   expires before resuming transmission, since it simply does not have
163	   any means to know when the connectivity is re-established.
164	   Therefore, depending on when connectivity becomes available again,
165	   this can waste up to maximum RTO of possible transmission time.

167	   This retransmission behavior is not efficient, especially in
168	   scenarios or networks like wireless (multi-hop) networks where
169	   connectivity disruptions are frequent.  In the ideal case, TCP would
170	   attempt a retransmission as soon as connectivity to its peer is re-
171	   established.  This document describes how the standard Internet
172	   Control Message Protocol (ICMP) can be exploited to identify non-
173	   congestion loss caused by connectivity disruptions.  An revert
174	   strategy of the retransmission timer is specified that enables, due
175	   to higher-frequency retransmissions, a prompt detection of whether
176	   connectivity to a previously disconnected peer node has been
177	   restored.  The specified scheme is a TCP sender-only modification,
178	   i.e., neither intermediate routers nor the TCP receiver have to be
179	   modified.  Furthermore, in the case the network allows, i.e., no
180	   congestion is present, the proposed algorithm approaches the ideal
181	   behavior.

183	3.  Connectivity Disruption Indication

185	   As long as the queue of an intermediate router experiencing a link
186	   outage is deep enough, i.e., it can buffer all incoming packets, a
187	   connectivity disruption will only cause variation in delay which is
188	   handled well by contemporary TCP implementations with the help of
189	   Eifel [RFC3522] or forward RTO (F-RTO) [I-D.ietf-tcpm-rfc4138bis].
190	   However, if the link outage lasts too long, the router experiencing
191	   the link outage is forced to drop packets and finally to discard the
192	   according route.  Means to detect such link outages comprise reacting
193	   on failed address resolution protocol (ARP) [RFC0826] queries,
194	   unsuccessful link sensing, and the like.  However, this is solely in
195	   the responsibility of the respective router.

197	      Note: The focus of this memo is on introducing a method how ICMP
198	      messages may be exploited to improve TCP's performance; how
199	      different physical and link layer mechanisms underneath the
200	      network layer may trigger ICMP destination unreachable messages
201	      are out of scope of this memo.

203	   The removal of the route usually goes along with a notification to
204	   the corresponding TCP sender about the dropped packets via ICMP
205	   destination unreachable messages of code 0 (net unreachable) or code
206	   1 (host unreachable) [RFC1812].  Therefore, since ICMP destination
207	   unreachable messages of these codes provide evidence that packets
208	   were dropped due to a link outage, they can be used by a TCP as an
209	   indication for a connectivity disruption.

211	   Note that there are also other ICMP destination unreachable messages
212	   with different codes.  Some of them are candidates for connectivity
213	   disruption indications too, but need further investigation.  For
214	   example ICMP destination unreachable messages with code 5 (source
215	   route failed), code 11 (net unreachable for TOS), or code 12 (host
216	   unreachable for TOS) [RFC1812].  On the other side codes that flag
217	   hard errors are of no use for the proposed scheme, since TCP should
218	   abort the connection when those are received [RFC1122].  In the
219	   following, the term "ICMP unreachable message" is used as synonym for
220	   ICMP destination unreachable messages of code 0 or code 1.

222	   The accurate interpretation of ICMP unreachable messages as an
223	   connectivity disruption indication is complicated by the following
224	   two peculiarities of ICMP messages.  Firstly, they do not necessarily
225	   operate on the same timescale as the packets, i.e., in the given case
226	   TCP segments, which elicited them.  When a router drops a packet due
227	   to a missing route it will not necessarily send an ICMP unreachable
228	   message immediately, but rather queues it for later delivery.
229	   Secondly, ICMP messages are subject to rate limiting, e.g., when a
230	   router drops a whole window of data due to a link outage, it will
231	   hardly send as many ICMP unreachable messages as it dropped TCP
232	   segments.  Depending on the load of the router it may even send no
233	   ICMP unreachable messages at all.  Both peculiarities originate from
234	   [RFC1812].

236	   Fortunately, according to [RFC0792] ICMP unreachable messages are
237	   obliged to contain in their body the Internet Protocol (IP) header
238	   [RFC0791] of the datagram eliciting the ICMP unreachable messages
239	   plus the first 64 bits of the payload of that datagram.  Hence, in
240	   case of TCP both port numbers and the sequence number are included.
241	   This allows the originating TCP to identify the connection which an
242	   ICMP unreachable message is reporting an error about.  Moreover, it
243	   allows the originating TCP to identify which segment of the
244	   respective connection triggered the ICMP unreachable message,
245	   provided that there are not several segments in flight with the same
246	   sequence number.  This may very well be the case when TCP is
247	   recovering lost segments (see Section 4.3).

249	   A connectivity disruption indication in form of an ICMP unreachable
250	   message associated with a presumably lost TCP segment provides strong
251	   evidence that the segment was not dropped due to congestion but
252	   instead was successful delivered to the temporary end-point of the
253	   employed path, i.e., the reporting router.  It therefore did not
254	   witness any congestion at least on that very part of the path which
255	   was traveled by both, the TCP segment eliciting the ICMP unreachable
256	   message as well as the ICMP unreachable message itself.

258	4.  Connectivity Disruption Reaction

260	   In Section 4.1 the basic idea of the algorithm is given.  The
261	   complete algorithm is specified in Section 4.2.  In Section 4.3 the
262	   algorithm is discussed in detail.

264	4.1.  Basic Idea

266	   The goal of the algorithm is the prompt detection when the
267	   connectivity to a previously disconnected peer node has been restored
268	   after a long connectivity disruption while retaining appropriate
269	   behavior in case of congestion.  The proposed algorithm exploits
270	   standard ICMP unreachable messages to increase the TCP's
271	   retransmission frequency during timeout-based loss recovery by
272	   undoing one retransmission timer backoff whenever an ICMP unreachable
273	   message reports on a presumably lost retransmission.

275	   This approach has the advantage of appropriately reducing the probing
276	   rate in case of congestion.  If either the (re-)transmission itself,
277	   or the corresponding ICMP message is dropped the conventional backoff
278	   is performed and not undone, effectively halving the probing rate.

280	4.2.  The Algorithm

282	   A TCP sender using RFC 2988 [RFC2988] to compute TCP's retransmission
283	   timer MAY employ the following scheme to avoid over-conservative
284	   backoffs of the retransmission timer in case of long connectivity
285	   disruptions.  If a TCP sender does implement the scheme, the
286	   following steps MUST be taken, but only upon initiation of a timeout-
287	   based loss recovery, i.e., upon the first timeout of the oldest
288	   outstanding segment (SND.UNA).  The algorithm MUST NOT be re-
289	   initiated after a timeout-based loss recovery has already been
290	   started but not completed.  In particular, it must not be re-
291	   initiated upon subsequent timeouts for the same segment.

293	   A TCP sender that does not employ RFC 2988 [RFC2988] to compute TCP's
294	   retransmission timer SHOULD NOT use the scheme.  We envision that the
295	   scheme could be easily adapted to other algorithms than RFC 2988.
296	   However, we leave this as future work.

298	   The scheme specified in this document uses the "Backoff_cnt"
299	   variable, whose initial value is zero.  The variable is used to count
300	   the number of performed retransmission timer backoffs during one
301	   timeout-based loss recovery.  Moreover, the "RTO_base" variable is
302	   used to recover the previous RTO in case the retransmission timer
303	   backoff was unnecessary.  The variable is initialized with the RTO
304	   upon initiation of timeout-based loss recovery.

306	   (1)  Before the variable RTO gets updated when timeout-based loss
307	        recovery is initiated, set the variable "Backoff_cnt" and the
308	        variable "RTO_base" as follows:

310	           Backoff_cnt := 0;
311	           RTO_base := RTO.

313	        Proceed to step (R).

315	   (R)  This is a placeholder for the behavior that a standard TCP must
316	        execute at this point in case the retransmission timer is
317	        expired.  In particular if RFC 2988 [RFC2988] is used, steps
318	        (5.4) - (5.6) of that algorithm go here.  Proceed to step (2).

320	   (2)  If the retransmission timer was backed off in the previous step
321	        (R), then increment the variable "Backoff_cnt" by one to account
322	        for the new backoff

324	           Backoff_cnt := Backoff_cnt + 1.

326	   (3)  Wait either

328	           for the expiration of the retransmission timer.  When the
329	           retransmission timer expires, proceed to step (R);

331	           or for the arrival of an acceptable ACK.  When an acceptable
332	           ACK arrives, proceed to step (A);

334	           or for the arrival of an ICMP unreachable message.  When the
335	           ICMP unreachable message ICMP_DU arrives, proceed to step
336	           (4).

338	   (4)  If "Backoff_cnt > 0", i.e., an undoing of the last
339	        retransmission timer backoff is allowed, then

341	           proceed to step (5);

343	        else

345	           proceed to step (3).

347	   (5)  Extract the TCP segment header included in the ICMP destination
348	        unreachable message ICMP_DU

350	           SEG := Extract(ICMP_DU).

352	   (6)  If "SEG.SEQ == SND.UNA", i.e., the ICMP unreachable ICMP_DU
353	        message reports on the oldest outstanding segment, then undo the
354	        last retransmission timer backoff

356	           Backoff_cnt := Backoff_cnt - 1;
357	           RTO := RTO_base * 2^(Backoff_cnt).

359	   (7)  If the retransmission timer expires due to the undoing in the
360	        previous step (6), then

362	           proceed to step (R);

364	        else

366	           proceed to step (3).

368	   (A)  This is a placeholder for the standard TCP behavior that must be
369	        executed at this point in the case an acceptable ACK has
370	        arrived.  No further processing.

372	   When a TCP in steady-state detects a segment loss using the
373	   retransmission timer it enters the timeout-based loss recovery and
374	   initiates the algorithm (step 1).  It adjusts the slow start
375	   threshold (ssthresh), sets the congestion window (CWND) to one
376	   segment, back offs the retransmission timer and retransmits the first
377	   unacknowledged segment (step R) [I-D.ietf-tcpm-rfc2581bis] [RFC2988].

379	   In case the retransmission timer expires again (step 3a) a TCP will
380	   repeat the retransmission of the first unacknowledged segment and
381	   back off the retransmission timer once more (step R).  If a maximum
382	   value is placed on the RTO (rule 2.5 in [RFC2988]) and that maximum
383	   value is already reached the TCP will not backoff the retransmission
384	   timer in this step and thus "Backoff_cnt" MUST NOT be incremented.
385	   However, the "last step" to reach this maximum RTO is still
386	   considered as a backoff in the scope of this algorithm and
387	   "Backoff_cnt" MUST be incremented, even if the RTO is not strictly
388	   doubled.

390	   If the first received packet after the retransmission(s) is an
391	   acceptable ACK (step 3b), a TCP will proceed as normal, i.e., slow
392	   start the connection and terminate the algorithm (step A).  Later
393	   ICMP unreachable messages from the just terminated timeout-based loss
394	   recovery are of no use and therefore ignored since the ACK clock is
395	   already restarting due to the successful retransmission.

397	   On the other side if the first received packet after the
398	   retransmission(s) is an ICMP unreachable message (step 3c), a TCP
399	   SHOULD if allowed (step 4) undo one backoff for each ICMP unreachable
400	   message reporting an error on a retransmission.  To decide if an ICMP
401	   unreachable message reports on a retransmission, the sequence number
402	   therein is exploited (step 5, step 6).  The undo is done by re-
403	   calculating the RTO with the previously reduced "Backoff_cnt".  This
404	   calculation explicitly matches the exponential backoff specified in
405	   [RFC2988] (rule 5.5).

407	   Upon receipt of an ICMP unreachable message which legitimately undoes
408	   one backoff there is the possibility that this new started
409	   retransmission timer has expired already (step 7).  Then, a TCP
410	   SHOULD retransmit immediately, i.e., an ICMP message clocked
411	   retransmission.  In case the new started retransmission timer has not
412	   expired yet, TCP MUST wait accordingly.

414	4.3.  Discussion

416	   It is important to note that the proposed algorithm only reacts to
417	   connectivity disruption indications in form of ICMP destination
418	   unreachable messages during the phase of RTO induced loss recovery.
419	   That is, TCP's behavior is not altered when no ICMP unreachable
420	   messages are received, or the retransmission timer of the TCP sender
421	   did not yet expire since the last successfully received ACK.  Thereby
422	   the algorithm is by definition only triggered in the case of long
423	   connectivity disruptions.

425	   Only such ICMP unreachable messages which are reporting on the
426	   sequence number of the retransmission (SND.UNA) are evaluated by the
427	   proposed algorithm.  All other ICMP unreachable messages are ignored.
428	   If an ICMP unreachable message arrives for a retransmission it
429	   provides evidence that neither the retransmission nor the
430	   corresponding ICMP unreachable message itself did experience any
431	   congestion.  In other words, it has been proved that the
432	   retransmission was not lost due to congestion, but due to a
433	   connectivity disruption instead.

435	   One could argue, that if an ICMP unreachable message arrives for an
436	   RTO induced retransmission, the RTO should be reset, and the next
437	   retransmission sent out immediately similar to what is done when an
438	   ACK arrives after an RTO induced recovery phase.  This would allow
439	   for a much higher probing frequency based on the round trip time of
440	   the router where the connectivity is disrupted.  However, we consider
441	   our proposed scheme a good trade off between conservative behavior
442	   and a fast detection of connectivity re-establishment.

444	   Of course there is an ambiguity on which (re-)transmission an ICMP
445	   unreachable message reports.  However, for our purposes it is not
446	   considered to be problem, because the assumption that such an ICMP
447	   message provides evidence that one link loss was wrongly considered
448	   as a congestion loss, still holds.  There is also the option to make
449	   use of the timestamps option to obtain a more strict mapping between
450	   segments and ICMP messages (see Section 4.3).

452	   Besides the ambiguity if the first unacknowledged sequence number
453	   refers to the original transmission or to any of the retransmissions,
454	   there is another source of ambiguity about the sequence numbers
455	   contained in the ICMP unreachable messages.  For high bandwidth paths
456	   like modern gigabit links the sequence space may wrap rather quickly,
457	   thereby allowing the possibility that a late ICMP unreachable message
458	   reporting on an old error may coincidentally fit as input in the
459	   scheme explained above.  As a result, the scheme would wrongly undo
460	   one backoff.  Chances for this to happen are minuscule, since a
461	   particular ICMP message would need to contain the exact sequence
462	   number of SND.UNA, while at the same TCP is coincidentally in
463	   timeout-based loss recovery.  Moreover, as the scheme is tailored
464	   most conservatively no threat to the network from this issues may
465	   arise.

467	   Finally, the scheme explicitly does not call for a differentiation of
468	   ICMP unreachable messages originating from different routers, as the
469	   evidence of no congestion still holds even if the reporting router
470	   changed.

472	   Another exploitation of ICMP unreachable messages in the context of
473	   TCP congestion control might seem appropriate in case the ICMP
474	   unreachable message is received while TCP is in steady-state and the
475	   message refers to a segment from within the current window of data.
476	   As the RTT up to the router which generates the ICMP unreachable
477	   message is likely to be substantially shorter than the overall RTT to
478	   the destination, the ICMP unreachable message may very well reach the
479	   originating TCP while it is transmitting the current window of data.
480	   In case the remaining window is large, it might seem appropriate to
481	   refrain from transmitting the remaining window as there is timely
482	   evidence that it will only trigger further ICMP unreachable messages
483	   at the very router.  Although this might seem appropriate from a
484	   wastage perspective, it may be counterproductive from a security
485	   perspective since ICMP message are easy to spoof, thereby allowing an
486	   easy attack to the TCP by simply forging such ICMP messages.

488	   An additional consideration is the following: in the presence of
489	   multi-path routing even the receipt of a legitimate ICMP unreachable
490	   message cannot be exploited accurately because there is the option
491	   that only one of the multiple paths to the destination is suffering
492	   from a connectivity disruption which causes ICMP unreachable messages
493	   to be sent.  Then however, there is the possibility that the path
494	   along which the connectivity disruption occurred contributed
495	   considerably to the overall bandwidth, such that a congestion
496	   response is very well reasonable.  However, this is not necessarily
497	   the case.  Therefore, a TCP has no means except for its inherent
498	   congestion control to decide on this matter.  All in all, it seems
499	   that for a connection in steady-state, i.e., not in RTO induced
500	   recovery, reacting on ICMP unreachable messages in regard to
501	   congestion control is not appropriate.  For the case of RTO-based
502	   retransmissions, however, there is a reasonable congestion response,
503	   which is skipping further backoffs of the retransmission timer
504	   because there is no congestion indication - as described above.

506	4.4.  Protecting Against Misbehaving Routers (the Safe Variant)

508	   Given that the TCP Timestamps option [I-D.ietf-tcpm-1323bis] is
509	   enabled for a connection, a TCP sender MAY use the following
510	   algorithm to protect against misbehaving routers.

512	5.  Related Work

514	   In literature there are several methods that address TCP's problems
515	   in the presence of connectivity disruptions.  Some of them try to
516	   improve TCP's performance by modifying lower layers.  For example

518	   [SM03] introduces a "smart link layer" that buffers one segment for
519	   each ongoing connection and replaying these segments on connectivity
520	   re-establishment.  This approach has a serious drawback: previously
521	   stateless intermediate routers have to be modified in order to
522	   inspect TCP headers, to track the end-to-end connection and to
523	   provide additional buffer space.  These lead all in all to an
524	   additional need of memory and processing power.

526	   On the other hand stateless link layer schemes, like proposed in
527	   [RFC3819], which unconditionally buffer some small number of packets
528	   may have another problem: if a packet is buffered longer than the
529	   maximum segment lifetime (MSL) of 2 min [RFC0793], i.e., the
530	   disconnection lasts longer than MSL, TCP's assumption that such
531	   segments will never be received will no longer be true, violating
532	   TCP's semantics [I-D.eggert-tcpm-tcp-retransmit-now].

534	   Other approaches like TCP-F [CRVP01] or the Explicit Link Failure
535	   Notification (ELFN) [HV02] inform the TCP sender about a disrupted
536	   path by special messages generated from intermediate routers.  In
537	   case of a link failure they stop sending segments and freeze TCP's
538	   retransmission timers.  TCP-F stays in this state and remains silent
539	   until either a "route establishment notification" is received or an
540	   internal timer expires.  In contrast, ELFN periodically probes the
541	   network to detect connectivity re-establishment.  Both proposals rely
542	   on changes to intermediate routers, whereas the scheme proposed in
543	   this document is a sender-only modification.  Moreover, ELFN also
544	   does not consider congestion and may impose serious additional load
545	   on the network, depending on the probe interval.

547	   The authors of ATCP [LS01] propose enhancements to identify different
548	   types of packet loss by introducing a layer between TCP and IP.  They
549	   utilize ICMP destination unreachable messages to set TCP's receiver
550	   advertised window to zero and thus forcing the TCP sender to perform
551	   zero window probing with a exponential backoff.  ICMP destination
552	   unreachable messages, which arrive during this probing period, are
553	   ignored.  This approach is nearly orthogonal to this document, which
554	   exploits ICMP messages to undo a retransmission timer backoff when
555	   TCP is already probing.  In principle both mechanisms could be
556	   combined, however, due to security considerations it does not seem
557	   appropriate to adopt ATCP's reaction as discussed in Section 4.3.

559	   Schuetz et al. describe in [I-D.schuetz-tcpm-tcp-rlci] a set of TCP
560	   extensions that improve TCP's behavior when transmitting over paths
561	   whose characteristics can change on short time-scales.  Their
562	   proposed extensions modify the local behavior of TCP and introduce a
563	   new TCP option to signal locally received connectivity-change
564	   indications (CCIs) to remote peers.  Upon reception of a CCI, they
565	   re-probe the path characteristics either by performing a speculative
566	   retransmission or by sending a single segment of new data, depending
567	   on whether the connection is currently stalled in exponential backoff
568	   or transmitting in steady-state, respectively.  The authors focus on
569	   specifying TCP response mechanisms, nevertheless underlying layers
570	   would have to be modified to explicitly send CCIs to make these
571	   immediate responses possible.

573	6.  IANA Considerations

575	   This memo includes no request to IANA.

577	7.  Security Considerations

579	   The proposed algorithm is considered to be secure.  For example an
580	   attacker cannot make a TCP modified with proposed scheme flood the
581	   network just by sending forged ICMP unreachable messages to attempt
582	   to maliciously shorten the retransmission timer.  An attacker would
583	   need to guess the correct sequence number of the current
584	   retransmission, which seems very unlikely.  Even in case of an
585	   omniscient attacker, the impact on network load would be low, since
586	   the retransmission frequency is limited by the RTO which was computed
587	   before TCP has entered the timeout-based loss recovery.  (The highest
588	   probing frequency is expected to be even lower than once per minimum
589	   RTO, that is 1s as specified by [RFC2988].)

591	8.  Acknowledgments

593	   We would like to thank Timothy Shepard and Joe Touch for feedback on
594	   earlier versions of this draft.  We also thank Michael Faber, Daniel
595	   Schaffrath, and Damian Lukowski for implementing and testing the
596	   algorithm in Linux.  Special thanks go to Ilpo Jarvinen, who gave
597	   valuable feedback regarding the Linux implementation.

599	   This document was written with the xml2rfc tool described in
600	   [RFC2629].

602	9.  References

604	9.1.  Normative References

606	   [I-D.ietf-tcpm-1323bis]
607	              Borman, D., Braden, R., and V. Jacobson, "TCP Extensions
608	              for High Performance", draft-ietf-tcpm-1323bis-01 (work in
609	              progress), March 2009.

611	   [I-D.ietf-tcpm-rfc2581bis]
612	              Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
613	              Control", draft-ietf-tcpm-rfc2581bis-07 (work in
614	              progress), July 2009.

616	   [RFC0792]  Postel, J., "Internet Control Message Protocol", STD 5,
617	              RFC 792, September 1981.

619	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
620	              RFC 793, September 1981.

622	   [RFC1812]  Baker, F., "Requirements for IP Version 4 Routers",
623	              RFC 1812, June 1995.

625	   [RFC2988]  Paxson, V. and M. Allman, "Computing TCP's Retransmission
626	              Timer", RFC 2988, November 2000.

628	   [RFC4443]  Conta, A., Deering, S., and M. Gupta, "Internet Control
629	              Message Protocol (ICMPv6) for the Internet Protocol
630	              Version 6 (IPv6) Specification", RFC 4443, March 2006.

632	9.2.  Informative References

634	   [CRVP01]   Chandran, K., Raghunathan, S., Venkatesan, S., and R.
635	              Prakash, "A feedback-based scheme for improving TCP
636	              performance in ad hoc wireless networks", IEEE Personal
637	              Communications vol. 8, no. 1, pp. 34-39, February 2001.

639	   [HV02]     Holland, G. and N. Vaidya, "Analysis of TCP performance
640	              over mobile ad hoc networks", Wireless Networks vol. 8,
641	              no. 2-3, pp. 275-288, March 2002.

643	   [I-D.eggert-tcpm-tcp-retransmit-now]
644	              Eggert, L., "TCP Extensions for Immediate
645	              Retransmissions", draft-eggert-tcpm-tcp-retransmit-now-02
646	              (work in progress), June 2005.

648	   [I-D.ietf-tcpm-rfc4138bis]
649	              Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
650	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
651	              Spurious  Retransmission Timeouts with TCP",
652	              draft-ietf-tcpm-rfc4138bis-04 (work in progress),
653	              October 2008.

655	   [I-D.schuetz-tcpm-tcp-rlci]
656	              Schuetz, S., Koutsianas, N., Eggert, L., Eddy, W., Swami,
657	              Y., and K. Le, "TCP Response to Lower-Layer Connectivity-
658	              Change Indications", draft-schuetz-tcpm-tcp-rlci-03 (work
659	              in progress), February 2008.

661	   [LS01]     Liu, J. and S. Singh, "ATCP: TCP for mobile ad hoc
662	              networks", IEEE Journal on Selected Areas in
663	              Communications vol. 19, no. 7, pp. 1300-1315, 2001 July.

665	   [RFC0791]  Postel, J., "Internet Protocol", STD 5, RFC 791,
666	              September 1981.

668	   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
669	              converting network protocol addresses to 48.bit Ethernet
670	              address for transmission on Ethernet hardware", STD 37,
671	              RFC 826, November 1982.

673	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
674	              Communication Layers", STD 3, RFC 1122, October 1989.

676	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
677	              Requirement Levels", BCP 14, RFC 2119, March 1997.

679	   [RFC2629]  Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
680	              June 1999.

682	   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
683	              for TCP", RFC 3522, April 2003.

685	   [RFC3819]  Karn, P., Bormann, C., Fairhurst, G., Grossman, D.,
686	              Ludwig, R., Mahdavi, J., Montenegro, G., Touch, J., and L.
687	              Wood, "Advice for Internet Subnetwork Designers", BCP 89,
688	              RFC 3819, July 2004.

690	   [RFC4884]  Bonica, R., Gan, D., Tappan, D., and C. Pignataro,
691	              "Extended ICMP to Support Multi-Part Messages", RFC 4884,
692	              April 2007.

694	   [SESB05]   Schuetz, S., Eggert, L., Schmid, S., and M. Brunner,
695	              "Protocol enhancements for intermittently connected
696	              hosts", SIGCOMM Computer Communication Review vol. 35, no.
697	              3, pp. 5-18, December 2005.

699	   [SM03]     Scott, J. and G. Mapp, "Link layer-based TCP optimisation
700	              for disconnecting networks", SIGCOMM Computer
701	              Communication Review vol. 33, no. 5, pp. 31-42,
702	              October 2003.

704	Appendix A.  TODO list

706	   o  Extend the Security Sections 4.4 and 7.

708	   o  Extend discussion in Section 4.3

710	      *  ICMPv6.  See [RFC4443] and [RFC4884].

712	      *  Explicit Congestion Notification (ECN).

714	      *  More about congestion in general.

716	   o  Mention the possible side-effect on TCP implementations that
717	      measure the thresholds R1 and R2 (Section 4.2.3.5 of [RFC1122]) as
718	      a count of retransmissions instead of time units.

720	   o  Discuss the influence of packet duplication on the algorithm
721	      (Thanks to Ilpo).

723	Appendix B.  Changes from previous versions of the draft

725	B.1.  Changes from draft-zimmermann-tcp-lcd-01

727	   o  The algorithm in Section 4.2 was slightly changed.  Instead of
728	      reverting the RTO by halving it, it is recalculated with help of
729	      the "Backoff_cnt" variable.  This fixes an issue that occurred
730	      when the retransmission timer was backed off but bounded by a
731	      maximum value.  The algorithm in the previous version of the
732	      draft, would have "reverted" to half of that maximum value,
733	      instead of using the value, before the RTO was doubled (and then
734	      bounded).

736	   o  Miscellaneous editorial changes.

738	   o  Extended the TODO list (Appendix A).

740	B.2.  Changes from draft-zimmermann-tcp-lcd-00

742	   o  Miscellaneous editorial changes in Section 1, 2 and 3.

744	   o  The document was restructured in Section 1, 2 and 3 for easier
745	      reading.  The motivation for the algorithm is changed according
746	      TCP's problem to disambiguate congestion from non-congestion loss.

748	   o  Added Section 4.1.

750	   o  The algorithm in Section 4.2 was restructured and simplified:

752	      *  The special case of the first received ICMP destination
753	         unreachable message after an RTO was removed.

755	      *  The "Backoff_cnt" variable was introduced so it is no longer
756	         possible to perform more reverts than backoffs.

758	   o  The discussion in Section 4.3 was improved and expanded according
759	      to the algorithm changes.

761	   o  Added Section 4.4.

763	Authors' Addresses

765	   Alexander Zimmermann
766	   RWTH Aachen University
767	   Ahornstrasse 55
768	   Aachen,   52074
769	   Germany

771	   Phone: +49 241 80 21422
772	   Email: zimmermann@cs.rwth-aachen.de

774	   Arnd Hannemann
775	   RWTH Aachen University
776	   Ahornstrasse 55
777	   Aachen,   52074
778	   Germany

780	   Phone: +49 241 80 21423
781	   Email: hannemann@nets.rwth-aachen.de