Network Working Group                                      Reiner Ludwig
INTERNET-DRAFT                                             Michael Meyer
Expires: January 2003                                  Ericsson Research
                                                              July, 2002


                 The Eifel Detection Algorithm for TCP
                <draft-ietf-tsvwg-tcp-eifel-alg-04.txt>


Status of this memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or cite them other than as "work in progress".

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/lid-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html


Abstract

   The Eifel detection algorithm allows the TCP sender to detect a
   posteriori whether it has entered loss recovery unnecessarily. It
   already determines this in the beginning of loss recovery when the
   first acceptable ACK arrives after the timeout-based retransmit or
   fast retransmit has been sent. The algorithm requires that the TCP
   Timestamps option defined in RFC1323 is enabled for a connection. The
   idea is to use the timestamps echoed in returning ACKs to eliminate
   the retransmission ambiguity in TCP. The Eifel detection algorithm
   provides a basis for future TCP enhancements. This includes response
   algorithms to back out of loss recovery by restoring a TCP sender's
   congestion control state.


Ludwig & Meyer                                                  [Page 1]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


Terminology

   The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
   SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this
   document, are to be interpreted as described in [RFC2119].

   We refer to the first-time transmission of an octet as the 'original
   transmit'. A subsequent transmission of the same octet is referred to
   as a 'retransmit'. In most cases this terminology can likewise be
   applied to "data segments" as opposed to "octets". However, with
   repacketization a segment can contain both first-time transmissions
   and retransmissions of octets. In that case this terminology is only
   consistent when applied to "octets". For the Eifel detection
   algorithm this makes no difference as it also operates correctly when
   repacketization occurs.

   We use the term 'acceptable ACK' as defined in [RFC793]. That is an
   ACK that acknowledges previously unacknowledged data. We use the term
   'duplicate ACK', and the variable 'dupacks' as defined in [WS95]. The
   variable 'dupacks' is a counter of duplicate ACKs that have already
   been received by the TCP sender before the fast retransmit is sent.
   We use the variable 'DupThresh' to refer to the so-called duplicate
   acknowledgement threshold, i.e., the number of duplicate ACKs that
   need to arrive at the TCP sender to trigger a fast retransmit.
   Currently, DupThresh is specified as a fixed value of three
   [RFC2581]. Future TCPs might implement an adaptive DupThresh.


1. Introduction

   The retransmission ambiguity problem [KP87] is the TCP sender's
   inability to distinguish whether the first acceptable ACK that
   arrives after a retransmit, was sent in response to the original
   transmit or the retransmit. This problem occurs after a timeout-based
   retransmit and after a fast retransmit. The Eifel detection algorithm
   uses the TCP Timestamps option defined in [RFC1323] to eliminate the
   retransmission ambiguity. It thereby allows the TCP sender to detect
   a posteriori whether it has entered loss recovery unnecessarily.

   This added capability of the TCP sender is useful in environments
   where TCP's loss recovery and congestion control algorithms may often
   get falsely triggered. This can be caused by packet reordering,
   packet duplication, the loss of a flight of ACKs, or a sudden delay
   increase in the data or the ACK path that results in a spurious
   timeout. For example, such sudden delay increases can often occur in
   wide-area wireless access networks due to handovers, resource
   preemption due to higher priority traffic (e.g., voice), or because
   the mobile transmitter traverses through a radio coverage hole (e.g.,
   see [Gu01]). In such wireless networks, the often unnecessary
   go-back-N retransmits that typically occur after a spurious timeout
   create a serious problem. They decrease end-to-end throughput, are


Ludwig & Meyer                                                  [Page 2]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   useless load upon a potentially congested network, and waste
   transmission (battery) power.

   Based on the Eifel detection algorithm, the TCP sender may then
   choose to implement dedicated response algorithms. One goal of such a
   response algorithm would be to alleviate the consequences of a
   falsely triggered loss recovery. This may include restoring the TCP
   sender's congestion control state, and avoiding the mentioned
   unnecessary go-back-N retransmits. Another goal would be to adapt
   protocol parameters such as the duplicate acknowledgement threshold
   [RFC2581], and the RTT estimators [RFC2988]. This is to reduce the
   risk of falsely triggering TCP's loss recovery again as the
   connection progresses. However, such response algorithms are outside
   the scope of this document. Note: The original proposal, the "Eifel
   algorithm" [LK00], comprises both a detection and a response
   algorithm. This document only defines the detection part. The
   response part is defined in [LG02].

   A key feature of the Eifel detection algorithm is that it already
   detects upon the first acceptable ACK that arrives during loss
   recovery whether a fast retransmit or a timeout was spurious. This is
   crucial to be able to avoid the mentioned go-back-N retransmits.
   Another feature is that the Eifel detection algorithm is fairly
   robust against the loss of ACKs.


2. Events that Falsely Trigger TCP Loss Recovery

   The following events falsely trigger a TCP sender's loss recovery and
   congestion control algorithms. This causes a so-called spurious
   retransmit, and an unnecessary reduction of the TCP sender's
   congestion window and slow start threshold [RFC2581].

     - Spurious timeout

     - Packet reordering

     - Packet duplication

   The common understanding of a spurious timeout is a timeout that
   would not have occurred had the sender "waited longer". This may be
   caused by increased delay that suddenly occurs in the data and/or the
   ACK path. That in turn might cause an acceptable ACK to arrive too
   late, i.e., only after the TCP sender's retransmission timer has
   expired. For the purpose of specifying the algorithm in Section 3, we
   define this case as SPUR_TO (equal 1). However, in this document, we
   stretch the definition of a spurious timeout. We also speak of a
   spurious timeout when the timeout occurred because an entire flight
   of ACKs was lost while in fact the original transmit of the oldest
   outstanding segment had arrived at the TCP receiver. This case is
   also considered to fall under the definition of SPUR_TO since a TCP


Ludwig & Meyer                                                  [Page 3]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   sender is not able to distinguish it from the other case mentioned
   above. Such a timeout is certainly unavoidable, i.e., it would not
   have made a difference if the TCP sender had "waited longer". Still,
   the triggering of loss recovery was spurious since that original
   transmit was not lost, i.e., there had been no congestion in the data
   path. Hence, we use the term 'spurious timeout' in the sense that
   initiating the associated loss recovery was unnecessary. Yet, we
   exclude from the definition of a spurious timeout the case where the
   retransmit arrives at the TCP receiver before the corresponding
   original transmit.

        Note: There is another case where a timeout would not have
        occurred had the sender "waited longer": the retransmission
        timer expires, and afterwards the TCP sender receives the
        duplicate ACK that would have triggered a fast retransmit of the
        oldest outstanding segment. We call this a "fast timeout" since
        in competition with the fast retransmit algorithm the timeout
        was faster. However, a fast timeout is not spurious since
        apparently a segment was in fact lost, i.e., loss recovery was
        initiating rightfully. In this document, we do not consider fast
        timeouts. This case is addressed in an independent document
        [Lu02].

   Packet reordering in the network may occur because IP [RFC791] does
   not guarantee in-order delivery of packets. Additionally, a TCP
   receiver generates a duplicate ACK for each segment that arrives out-
   of-order. This results in a spurious fast retransmit if three or more
   data segments arrive out-of-order at the TCP receiver, and at least
   three of the resulting duplicate ACKs arrive at the TCP sender. This
   assumes that the duplicate acknowledgement threshold is set to three
   as defined in [RFC2581].

   Packet duplication may occur because a receiving IP does not (cannot)
   remove packets that have been duplicated in the network. A TCP
   receiver in turn also generates a duplicate ACK for each duplicate
   segment. As with packet reordering, this results in a spurious fast
   retransmit if duplication of data segments or ACKs results in three
   or more duplicate ACKs to arrive at the TCP sender. Again, this
   assumes that the duplicate acknowledgement threshold is set to three.

   The negative impact on TCP performance caused by packet reordering
   and packet duplication is commonly the same: a single spurious
   retransmit (the fast retransmit), and the unnecessary halving of the
   TCP sender's congestion window as a result of the subsequent fast
   recovery phase [RFC2581].

   The negative impact on TCP performance caused by a spurious timeout
   is more severe. First, the timeout event itself causes a single
   spurious retransmit, and unnecessarily forces the TCP sender into
   slow start [RFC2581]. Then, as the connection progresses, a chain
   reaction gets triggered that further decreases TCP's performance.


Ludwig & Meyer                                                  [Page 4]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   Since the timeout was spurious, at least some ACKs for original
   transmits will typically arrive at the TCP sender before the ACK for
   the retransmit arrives. (This is unless severe packet reordering
   coincided with the spurious timeout in such a way that the ACK for
   the retransmit is the first acceptable ACK to arrive at the TCP
   sender.) Those ACKs for original transmits will then trigger an
   implicit go-back-N loss recovery at the TCP sender. Assuming that
   none of the outstanding segments and none of the corresponding ACKs
   were lost, all outstanding segments will get retransmitted
   unnecessarily. In fact, during this phase the TCP sender breaks
   'packet conservation' [Jac88]. This is because the unnecessary
   go-back-N retransmits are sent during slow start, i.e., for each
   original packet leaving the network, two useless retransmits are sent
   into the network. Moreover, some TCPs will in addition suffer from a
   spurious fast retransmit. This is because the spurious go-back-N
   retransmits will arrive as duplicates at the TCP receiver, which in
   turn triggers a series of duplicate ACKs. Note that this last
   spurious fast retransmit could be avoided with the careful variant of
   'bug fix' [RFC2582].

   More detailed explanations including TCP trace plots that visualize
   the effects of spurious timeouts and packet reordering can be found
   in [LK00].


3. The Eifel Detection Algorithm

3.1 The Idea

   The goal of the Eifel detection algorithm is to allow the TCP sender
   to detect a posteriori whether it has entered loss recovery
   unnecessarily. Furthermore, the TCP sender should be able to make
   this decision upon the first acceptable ACK that arrives after the
   timeout-based retransmit or the fast retransmit has been sent. This
   in turn requires extra information in every ACK by which the TCP
   sender can unambiguously distinguish whether that first acceptable
   ACK was sent in response to the original transmit or the retransmit.
   Such extra information is provided by the TCP Timestamps option
   [RFC1323]. Generally speaking, timestamps are monotonously increasing
   "serial numbers" added into every segment that are then echoed within
   the corresponding ACKs. This is exploited by the Eifel detection
   algorithm in the following way.

   Given that timestamps are enabled for a connection, the TCP sender
   always stores the timestamp of the retransmit sent in the beginning
   of loss recovery, i.e., the timestamp of the timeout-based retransmit
   or the fast retransmit. If the timestamp of the first acceptable ACK,
   arriving after the retransmit was sent, is smaller then the stored
   timestamp of that retransmit, then that ACK must have been sent in
   response to an original transmit. Hence, the TCP sender must have
   entered loss recovery unnecessarily.


Ludwig & Meyer                                                  [Page 5]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   The fact that the Eifel detection algorithm decides upon the first
   acceptable ACK is crucial to allow future response algorithms to
   avoid the spurious go-back-N retransmits that typically occur after a
   spurious timeout. Also, if loss recovery was entered unnecessarily, a
   window worth of ACKs are outstanding that all carry a timestamp that
   is smaller than the stored timestamp of the retransmit. The arrival
   of any one of those ACKs suffices the Eifel detection algorithm to
   work. Hence, the solution is fairly robust against ACK losses. Even
   the ACK sent in response to the retransmit, i.e., the one that
   carries the stored timestamp, may get lost.

        Note: Also the DSACK option [RFC2883] can be used to detect a
        posteriori whether the TCP sender has entered loss recovery
        unnecessarily. However, the first ACK carrying a DSACK option
        usually arrives at the TCP sender only after loss recovery has
        already terminated. Thus, the DSACK option cannot be used to
        eliminate the retransmission ambiguity. Consequently, it cannot
        be used to avoid the mentioned spurious go-back-N retransmits.
        Moreover, a DSACK-based detection algorithm is less robust
        against ACK losses.


3.2 The Algorithm

   Given that the TCP Timestamps option [RFC1323] is enabled for a
   connection, a TCP sender MAY use the Eifel detection algorithm as
   defined in this subsection.

   If the Eifel detection algorithm is used, the following steps MUST be
   taken by the TCP sender, but only upon initiation of loss recovery,
   i.e., when either the timeout-based retransmit or the fast retransmit
   is sent. The Eifel detection algorithm MUST NOT be reinitiated after
   loss recovery has already started. In particular, it may not be
   reinitiated upon subsequent timeouts for the same segment, and not
   upon retransmitting segments other than the oldest outstanding
   segment, e.g., during selective loss recovery.

      (1)     Set a "RetransmitTS" variable to the value of the
               Timestamp Value field of the Timestamps option included
               in the retransmit sent when loss recovery is initiated. A
               TCP sender must ensure that RetransmitTS does not get
               overwritten as loss recovery progresses, e.g., in case of
               a second timeout and subsequent second retransmit of the
               same octet.

      (2)     Set a "SpuriousRecovery" variable to FALSE.

      (3)     Wait for the arrival of an acceptable ACK. If an
               acceptable ACK has arrived, then proceed to step (4).


Ludwig & Meyer                                                  [Page 6]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


      (4)     If the value of the Timestamp Echo Reply field of the
               acceptable ACK's Timestamps option is smaller than the
               value of the variable RetransmitTS, then proceed to step
               (5),

               else proceed to step (DONE).

      (5)     If the loss recovery has been initiated with a timeout-
               based retransmit, then set
                    SpuriousRecovery <- SPUR_TO,

               else set
                    SpuriousRecovery <- dupacks+1

      (RESP)  Do nothing (Placeholder for a response algorithm).

      (DONE)  No further processing.

   The comparison "smaller than" in step (4) is conservative. In theory,
   when the "timestamp clock" is slow or the network is fast,
   RetransmitTS could at most be equal to the timestamp echoed by an ACK
   sent in response to an original transmit. In that case, it is assumed
   that the loss recovery was not falsely triggered.


3.3 Non-RFC1323-compliant TCP Receivers Mask Certain Spurious Timeouts

   Not all TCP implementations strictly follow RFC1323. There are
   differences concerning which timestamp a TCP receiver echoes when a
   duplicate segment arrives. The relevant case to consider in this
   context is a spurious timeout that occurred because an entire flight
   of ACKs got lost (recall the definition of a spurious timeout from
   Section 2). The question is which timestamp the TCP receiver echoes
   in response to the spurious retransmit that typically arrives as a
   duplicate segment. RFC1323-compliant TCP receivers (e.g., LINUX
   kernel 2.4.x) will echo the timestamp of the last segment that
   arrived in-sequence, while some non-RFC1323-compliant TCP receivers
   will echo the timestamp of the retransmit.

   The Eifel detection algorithm will only detect such spurious timeouts
   in case the TCP receiver is RFC1323-compliant in this respect.
   Otherwise, the Eifel detection algorithm will simply not kick in.
   This is not any different than from running the TCP sender without
   the Eifel algorithm.


3.4 Protecting Against Misbehaving TCP Receivers (the Safe Variant)

   A TCP receiver can easily make a genuine retransmit appear to the TCP
   sender as a spurious retransmit by forging echoed timestamps. This
   may pose a security concern.


Ludwig & Meyer                                                  [Page 7]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   Fortunately, there is a way to modify the Eifel detection algorithm
   in a way that makes it robust against lying TCP receivers. The idea
   is to use timestamps as a "segment's secret" that the TCP receiver
   only gets to know if it receives the segment. Conversely, a TCP
   receiver will not know the timestamp of a segment that was lost.
   Hence, to "prove" that it received the original transmit of a segment
   that the TCP sender retransmitted, the TCP receiver would need to
   return the timestamp of that original transmit. The Eifel detection
   algorithm could then be modified to only decide that loss recovery
   has been unnecessarily entered if the first acceptable ACK echoes the
   timestamp of the original transmit.

   Hence, implementers may choose to implement the algorithm with the
   following modifications.

   Step (1) is replaced with step (1'):

      (1')    Set a "RetransmitTS" variable to the value of the
               Timestamp Value field of the Timestamps option that was
               included in the original transmit corresponding to the
               retransmit. Note: This step requires that the TCP sender
               stores the timestamps of all outstanding original
               transmits.

   Step (4) is replaced with step (4'):

      (4')    If the value of the Timestamp Echo Reply field of the
               acceptable ACK's Timestamps option is equal to the value
               of the variable RetransmitTS, then proceed to step (5),

               else proceed to step (DONE).

   These modifications come at a cost: the modified algorithm is fairly
   sensitive against ACK losses since it relies on the arrival of the
   acceptable ACK that corresponds to the original transmit.

        Note: The first acceptable ACK that arrives after loss recovery
        has been unnecessarily entered, should echo the timestamp of the
        original transmit. This assumes that the ACK corresponding to
        the original transmit was not lost, that that ACK was not
        reordered in the network, and that the TCP receiver does not
        forge timestamps but complies with RFC1323. In case of a
        spurious fast retransmit, this is implied by the rules for
        generating ACKs for data segments that fill in all or part of a
        gap in the sequence space (see section 4.2 of [RFC2581]) and by
        the rules for echoing timestamps in that case (see rule (C) in
        section 3.4 of [RFC1323]). In case of a spurious timeout, it is
        likely that the delay that has caused the spurious timeout has
        also caused the TCP receiver's delayed ACK timer [RFC1122] to
        expire before the original transmit arrives. Also, in this case


Ludwig & Meyer                                                  [Page 8]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


        the rules for generating ACKs and the rules for echoing
        timestamps (see rule (A) in section 3.4 of [RFC1323]) ensure
        that the original transmit's timestamp is echoed.

   A remaining problem is that a TCP receiver might guess a lost
   segment's timestamp from observing the timestamps of segments that
   arrived earlier. This could be avoided by having the TCP sender add a
   "random counter" to the timestamp of every segment it sends, and then
   increment that counter by a random value, e.g., between 1 and 100.
   This ensures that timestamps remain monotonously increasing while
   making it difficult for a TCP receiver to guess the timestamp of a
   lost segment.


4. Security Considerations

   There do not seem to be any security considerations associated with
   the Eifel detection algorithm. This is because the Eifel detection
   algorithm does not alter existing protocol state at the TCP sender
   nor at the receiver. That is, no value of a TCP sender's state
   variable is changed.

   Moreover, a variant of the Eifel detection algorithm has been
   proposed in Section 3.4 that makes it robust against lying TCP
   receivers.

Acknowledgments

   Many thanks to Keith Sklower, Randy Katz, Stephan Baucke, Sally
   Floyd, Vern Paxson, Mark Allman, Ethan Blanton, Andrei Gurtov, Pasi
   Sarolahti, and Alexey Kuznetsov for useful discussions that
   contributed to this work.

References

   [RFC2581] M. Allman, V. Paxson, W. Stevens, TCP Congestion Control,
             RFC 2581, April 1999.

   [RFC2119] S. Bradner, Key words for use in RFCs to Indicate
             Requirement Levels, RFC 2119, March 1997.

   [RFC2582] S. Floyd, T. Henderson, The NewReno Modification to TCP's
             Fast Recovery Algorithm, RFC 2582, April 1999.

   [RFC2883] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, A. Romanow,
             An Extension to the Selective Acknowledgement (SACK) Option
             for TCP, RFC 2883, July 2000.

   [Gu01]    A. Gurtov, Effect of Delays on TCP Performance, In
             Proceedings of IFIP Personal Wireless Communications,
             August 2001.


Ludwig & Meyer                                                  [Page 9]


INTERNET-DRAFT           TCP - Eifel Detection               July, 2002


   [Jac88]   V. Jacobson, Congestion Avoidance and Control, In
             Proceedings of ACM SIGCOMM 88.

   [RFC1323] V. Jacobson, R. Braden, D. Borman, TCP Extensions for High
             Performance, RFC 1323, May 1992.

   [KP87]    P. Karn, C. Partridge, Improving Round-Trip Time Estimates
             in Reliable Transport Protocols, In Proceedings of ACM
             SIGCOMM 87.

   [LK00]    R. Ludwig, R. H. Katz, The Eifel Algorithm: Making TCP
             Robust Against Spurious Retransmissions, ACM Computer
             Communication Review, Vol. 30, No. 1, January 2000,
             available at http://www.acm.org/sigcomm/ccr/archive/2000/
             jan00/ccr-200001-ludwig.html (easier studied when
             viewed/printed in color).

   [LG02]    R. Ludwig, A. Gurtov, The Eifel Response Algorithm for TCP,
             work in progress, July 2002.

   [Lu02]    R. Ludwig, Responding to Fast Timeouts in TCP, work in
             progress, July 2002.

   [RFC2988] V. Paxson, M. Allman, Computing TCP's Retransmission Timer,
             RFC 2988, November 2000.

   [RFC791]  J. Postel, Internet Protocol, RFC 791, September 1981.

   [RFC793]  J. Postel, Transmission Control Protocol, RFC793, September
             1981.

   [WS95]    G. R. Wright, W. R. Stevens, TCP/IP Illustrated, Volume 2
             (The Implementation), Addison Wesley, January 1995.

Author's Address

     Reiner Ludwig
     Ericsson Research
     Ericsson Allee 1
     52134 Herzogenrath, Germany
     Email: Reiner.Ludwig@eed.ericsson.se

     Michael Meyer
     Ericsson Research
     Ericsson Allee 1
     52134 Herzogenrath, Germany
     Email: Michael.Meyer@eed.ericsson.se


This Internet-Draft expires in January 2003.


Ludwig & Meyer                                                 [Page 10]