Internet Engineering Task Force Ethan Blanton
INTERNET DRAFT Ohio University
File: draft-allman-tcp-sack-12.txt Mark Allman
BBN/NASA GRC
Kevin Fall
Intel Research
July, 2002
Expires: January, 2003A new Request for Comments is now available in online RFC libraries.
RFC 3517
Title: A Conservative SACK-based Selective Acknowledgment
(SACK)-based Loss Recovery Algorithm for TCP
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of [RFC2026].
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
Author(s): E. Blanton, M. Allman, K. Fall, L. Wang
Status: Standards Track
Date: April 2003
Mailbox: eblanton@cs.purdue.edu, mallman@bbn.com,
kfall@intel-research.net, lwang0@uky.edu
Pages: 13
Characters: 27855
Updates/Obsoletes/SeeAlso: None
I-D Tag: draft-allman-tcp-sack-13.txt
URL: ftp://ftp.rfc-editor.org/in-notes/rfc3517.txt
This document presents a conservative loss recovery algorithm
for TCP that is based on the use of the selective acknowledgment
(SACK) TCP option. The algorithm presented in this document conforms
to the spirit of the current congestion control specification
[RFC2581],
(RFC 2581), but allows TCP senders to recover more effectively
when multiple segments are lost from a single flight of data.
Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
1 Introduction
This document presents a conservative loss recovery algorithm for
TCP that is based on the use of the selective acknowledgment TCP
option. While the TCP selective acknowledgment (SACK) option
[RFC2018] is being steadily deployed in the Internet [All00] there
is evidence that hosts are not using the SACK information when
making retransmission and congestion control decisions [PF01]. The
goal of this document is to outline one straightforward method for
TCP implementations to use SACK information to increase performance.
[RFC2581] allows advanced loss recovery algorithms to be used by TCP
[RFC793] provided that they follow the spirit of TCP's congestion
control algorithms [RFC2581,RFC2914]. [RFC2582] outlines one such
advanced recovery algorithm called NewReno.
This document outlines
a loss recovery algorithm that uses the selective acknowledgment
(SACK) [RFC2018] TCP option to enhance TCP's loss recovery. The
algorithm outlined in this document, heavily based on the algorithm
detailed in [FF96], is a conservative replacement product of the fast
recovery algorithm [Jac90,RFC2581]. The algorithm specified in this
document is a straightforward SACK-based loss recovery strategy that
follows the guidelines set in [RFC2581] and can safely be used in
TCP implementations. Alternate SACK-based loss recovery methods can
be used in TCP as implementers see fit (as long as the alternate
algorithms follow the guidelines provided in [RFC2581]). Please
note, however, that the SACK-based decisions in this document (such
as what segments are to be sent at what time) are largely decoupled
from the congestion control algorithms, and as such can be treated
as separate issues if so desired.
2 Definitions
The reader is expected to be familiar with the definitions given in
[RFC2581].
The reader is assumed to be familiar with selective acknowledgments
as specified in [RFC2018].
For the purposes of explaining the SACK-based loss recovery
algorithm we define four variables that a TCP sender stores:
``HighACK'' is the sequence number of the highest byte of
data that has been cumulatively ACKed at a given point.
``HighData'' is the highest sequence number transmitted at a
given point.
``HighRxt'' is the highest sequence number which has been
retransmitted during the current loss recovery phase.
``Pipe'' is a sender's estimate of the number Transport Area Working Group of bytes
outstanding in the network.
IETF.
This is used during recovery
for limiting the sender's sending rate. The pipe variable
allows TCP to use a fundamentally different congestion
control than specified in [RFC2581]. The algorithm is often
referred to as the ``pipe algorithm''.
For the purposes of this specification we define a ``duplicate
acknowledgment'' as an acknowledgment (ACK) whose cumulative ACK
number is equal to the current value of HighACK, as described in
[RFC2581].
We define a variable ``DupThresh'' that holds the number of
duplicate acknowledgments required to trigger a retransmission. Per
[RFC2581] this threshold is defined to be 3 duplicate
acknowledgments. However, implementers should consult any updates
to [RFC2581] to determine the current value for DupThresh (or method
for determining its value).
Finally, a range of sequence numbers [A,B] is said to ``cover''
sequence number S if A <= S <= B.
3 Keeping Track of SACK Information
For now a TCP sender to implement the algorithm defined in the next
section it must keep a data structure to store incoming
selective acknowledgment information on a per connection basis.
Such a data structure is commonly called the ``scoreboard''.
The specifics of the scoreboard data structure are out of scope
for this document (as long as the implementation can perform all
functions required by this specification).
Note that while this document speaks of marking and keeping
track of octets, a real world implementation would probably want
to keep track of octet ranges or otherwise collapse the data
while ensuring that arbitrary ranges are still markable.
4 Processing and Acting Upon SACK Information
For the purposes of the algorithm defined in this document the
scoreboard SHOULD implement the following functions:
Update ():
Given the information provided in an ACK, each octet that is
cumulatively ACKed or SACKed should be marked accordingly in
the scoreboard data structure, and the total number of
octets SACKed should be recorded.
Note: SACK information is advisory and therefore SACKed data
MUST NOT be removed from TCP's retransmission buffer until the
data is cumulatively acknowledged [RFC2018].
IsLost (SeqNum):
This routine returns whether the given sequence number is
considered to be lost. The routine returns true when either
DupThresh discontiguous SACKed sequences have arrived above
'SeqNum' or DupThresh * SMSS bytes with sequence numbers greater
than 'SeqNum' have been SACKed. Otherwise, the routine returns
false.
SetPipe (): Proposed Standard Protocol.
This routine traverses the sequence space from HighACK to
HighData and MUST set the ``pipe'' variable to document specifies an estimate of
the number of octets that are currently in transit between the
TCP sender and the TCP receiver. After initializing pipe to
zero the following steps are taken for each octet 'S1' in the
sequence space between HighACK and HighData that has not been
SACKed:
(a) The pipe variable is incremented by 1 octet.
(b) If S1 <= HighRxt and IsLost (S1) returns false:
Pipe is incremented by 1 octet.
The effect of this condition is that pipe is incremented for
both the original transmission and the retransmission of the
octet because neither has been determined to have left the
network at this point.
NextSeg ():
This routine uses the scoreboard data structure maintained by
the Update() function to determine what to transmit based on
the SACK information that has arrived from the data receiver
(and hence been marked in the scoreboard). NextSeg () MUST
return the sequence number range of the next segment that is
to be transmitted, per the following rules:
(1) If there exists a smallest unSACKed sequence number 'S2'
that meets the following three criteria Internet standards track protocol for determining loss
the sequence range of one segment of up to SMSS octets
starting with S2 MUST be returned.
(1.a) S2 is greater than HighRxt.
(1.b) S2 is less than the highest octet convered by any
received SACK.
(1.c) IsLost (S2) returns true.
(2) If no sequence number 'S2' per rule (1) exists but there
exists available unsent data Internet community, and the receiver's advertised
window allows, the sequence range of one segment of up to
SMSS octets of previously unsent data starting with sequence
number HighData+1 MUST be returned.
(3) If the conditions for rules (1) requests discussion and (2) fail, but there
exists an unSACKed sequence number 'S3' that meets the
criteria suggestions
for detecting loss given in steps (1.a) and (1.b)
above (specifically excluding step (1.c)) then one segment
of up improvements. Please refer to SMSS octets starting with S3 MUST be returned.
(4) If the conditions for each of (1), (2), and (3) are not
met, then NextSeg () MUST indicate failure, and no segment
is returned.
Note: The SACK-based loss recovery algorithm outlined in this
document requires more computational resources than previous TCP
loss recovery strategies. However, we believe the scoreboard data
structure can be implemented in a reasonably efficient manner (both
in terms of computation complexity and memory usage) in most TCP
implementations.
5 Algorithm Details
Upon the receipt of any ACK containing SACK information, the
scoreboard MUST be updated via the Update () routine.
Upon the receipt current edition of the first (DupThresh -
"Internet Official Protocol Standards" (STD 1) duplicate ACKs, the
scoreboard is to be updated as normal. Note: The first and second
duplicate ACKs can also be used to trigger the transmission of
previously unsent segments using the Limited Transmit algorithm
[RFC3042].
When a TCP sender receives the duplicate ACK corresponding to
DupThresh ACKs, the scoreboard MUST be updated with the new SACK
information (via Update ()). If no previous loss event has
occurred on the connection or the cumulative acknowledgement point
is beyond the last value of RecoveryPoint, a loss recovery phase
SHOULD be initiated, per the fast retransmit algorithm outlined in
[RFC2581]. The following steps MUST be taken:
(1) RecoveryPoint = HighData
When the TCP sender receives a cumulative ACK for this data
octet the loss recovery phase is terminated.
(2) ssthresh = cwnd = (FlightSize / 2)
The congestion window (cwnd)
standardization state and slow start threshold
(ssthresh) are reduced to half status of FlightSize per [RFC2581].
(3) Retransmit the first data segment presumed dropped -- the
segment starting with sequence number HighACK + 1. To
prevent repeated retransmission of the same data, set
HighRxt to the highest sequence number in the retransmitted
segment.
(4) Run SetPipe ()
Set a ``pipe'' variable to the number this protocol. Distribution
of outstanding octets
currently ``in the pipe''; this memo is the data which has been
sent by the TCP sender but for which no cumulative or
selective acknowledgment has been received and the data has
not been determined to have been dropped in the network. unlimited.
This data announcement is assumed to be still traversing the network
path.
(5) In order to take advantage of potential additional available
cwnd, proceed to step (C) below.
Once a TCP is in the loss recovery phase the following procedure
MUST be used for each arriving ACK:
(A) An incoming cumulative ACK for a sequence number greater than
RecoveryPoint signals the end of loss recovery and the loss
recovery phase MUST be terminated. Any information contained in
the scoreboard for sequence numbers greater than the new value
of HighACK SHOULD NOT be cleared when leaving the loss recovery
phase.
(B) Upon receipt of an ACK that does not cover RecoveryPoint the
following actions MUST be taken:
(B.1) Use Update () to record the new SACK information conveyed
by the incoming ACK.
(B.2) Use SetPipe () to re-calculate the number of octets still
in the network.
(C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more
segments as follows:
(C.1) The scoreboard MUST be queried via NextSeg () for the
sequence number range of the next segment to transmit (if
any), and the given segment sent.
(C.2) If any of the data octets sent in (C.1) are below
HighData, HighRxt MUST be set to the highest sequence number
of the segment retransmitted.
(C.3) If any of IETF list and the data octets sent in (C.1) are above
HighData, HighData must be updated RFC-DIST list.
Requests to reflect the
transmission of previously unsent data.
(C.4) The estimate of the amount of data outstanding in the
network must be updated by incrementing pipe by the
number of octets transmitted in (C.1).
(C.5) If cwnd - pipe >= 1 SMSS, return to (C.1)
5.1 Retransmission Timeouts
In order to avoid memory deadlocks, the TCP receiver is allowed added to
discard data that has already been selectively acknowledged. As a
result, [RFC2018] suggests that a TCP sender SHOULD expunge the
SACK information gathered or deleted from a receiver upon a retransmission
timeout ``since the timeout might indicate that the data receiver
has reneged.'' Additionally, a TCP sender MUST ``ignore prior SACK
information in determining which data to retransmit.'' However, a
SACK TCP sender SHOULD still use all SACK information made
available during the slow start phase of loss recovery following
an RTO.
If an RTO occurs during loss recovery as specified in this document,
RecoveryPoint MUST be preserved and the loss recovery algorithm
outlined in this document MUST be terminated. In addition, a new
recovery phase (as described in section 5) MUST NOT IETF distribution list
should be initiated
until HighACK is greater than or equal sent to RecoveryPoint.
As described in Sections 4 and 5, Update () SHOULD continue IETF-REQUEST@IETF.ORG. Requests to be
used appropriately upon receipt of ACKs. This will allow the slow
start recovery period to benefit from all available information
provided by the receiver, despite the fact that SACK information was
expunged due
added to the RTO.
If there are segments missing or deleted from the receiver's buffer following
processing of the retransmitted segment, the corresponding ACK will
contain SACK information. In this case, a TCP sender SHOULD use
this SACK information when determining what data RFC-DIST distribution list should
be sent in
each segment of the slow start. The exact algorithm for this
selection is not specified in this document (specifically NextSeg ()
is inappropriate during slow start after an RTO). A relatively
straightforward approach to ``filling in'' the sequence space
reported as missing should be a reasonable approach.
6 Managing the RTO Timer
The standard TCP RTO estimator is defined in [RFC2988]. Due to
the fact that the SACK algorithm in this document can have an
impact RFC-DIST-REQUEST@RFC-EDITOR.ORG.
Details on the behavior of the estimator, implementers obtaining RFCs via FTP or EMAIL may wish
to consider how the timer is managed. [RFC2988] calls for the
RTO timer to be re-armed each time obtained by sending
an ACK arrives that advances
the cumulative ACK point. Because the algorithm presented in
this document can keep the ACK clock going through a fairly
significant loss event, (comparatively longer than the algorithm
described in [RFC2581]), on some networks the loss event could
last longer than the RTO. In this case the RTO timer would
expire prematurely and a segment that need not be retransmitted
would be resent.
Therefore we give implementers the latitude EMAIL message to use the standard
[RFC2988] style RTO management or, optionally, a more careful
variant that re-arms rfc-info@RFC-EDITOR.ORG with the RTO timer on each retransmission that
is sent during recovery MAY be used. This provides a more
conservative timer than specified in [RFC2988], and so may not
always message body
help: ways_to_get_rfcs. For example:
To: rfc-info@RFC-EDITOR.ORG
Subject: getting rfcs
help: ways_to_get_rfcs
Requests for special distribution should be an attractive alternative. However, in some cases it
may prevent needless retransmissions, go-back-N transmission and
further reduction of the congestion window.
7 Research
The algorithm specified in this document is analyzed in [FF96],
which shows that the above algorithm is effective in reducing
transfer time over standard TCP Reno [RFC2581] when multiple
segments are dropped from a window of data (especially as addressed to either the number
author of drops increases). [AHKO97] shows that the algorithm defined in
this document can greatly improve throughput in connections
traversing satellite channels.
8 Security Considerations
The algorithm presented RFC in this paper shares security considerations
with [RFC2581]. A key difference is that an algorithm based on
SACKs is more robust against attackers forging duplicate ACKs to
force the TCP sender to reduce cwnd. With SACKs, TCP senders have an
additional check on whether question, or not a particular ACK is legitimate.
While not fool-proof, SACK does provide some amount of protection in
this area.
Acknowledgments
The authors wish to thank Sally Floyd for encouraging this
document and commenting on early drafts. The algorithm
described in this document is loosely based RFC-Manager@RFC-EDITOR.ORG. Unless
specifically noted otherwise on an algorithm
outlined by Kevin Fall and Sally Floyd in [FF96], although the
authors of this document assume responsibility RFC itself, all RFCs are for any mistakes
in the above text. Murali Bashyam, Ken Calvert, Tom Henderson,
Reiner Ludwig, Jamshid Mahdavi, Matt Mathis, Shawn Ostermann,
Vern Paxson, Venkat Venkatsubra and Lili Wang provided valuable
feedback on earlier versions of this document. Finally, we
thank Matt Mathis and Jamshid Mahdavi
unlimited distribution.echo
Submissions for implementing the
scoreboard in ns and hence guiding our thinking in keeping track
of SACK state.
Normative References
[RFC793] Jon Postel, Transmission Control Protocol, STD 7, RFC 793,
September 1981.
[RFC2018] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. TCP Selective
Acknowledgment Options. RFC 2018, October 1996
[RFC2026] Scott Bradner. The Internet Standards Process -- Revision
3, RFC 2026, October 1996
[RFC2119] Bradner, S., "Key words Requests for use in RFCs Comments should be sent to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2581] Mark Allman, Vern Paxson, W. Richard Stevens, TCP
Congestion Control,
RFC-EDITOR@RFC-EDITOR.ORG. Please consult RFC 2581, April 1999.
Non-Normative References
[AHKO97] Mark Allman, Chris Hayes, Hans Kruse, Shawn Ostermann. TCP
Performance Over Satellite Links. Proceedings of the Fifth
International Conference on Telecommunications Systems,
Nashville, TN, March, 1997.
[All00] Mark Allman. A Web Server's View of the Transport Layer. ACM
Computer Communication Review, 30(5), October 2000.
[FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of
Tahoe, Reno and SACK TCP. Computer Communication Review, July
1996.
[Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm.
Technical Report, LBL, April 1990.
[PF01] Jitendra Padhye, Sally Floyd. Identifying the TCP Behavior
of Web Servers, ACM SIGCOMM, August 2001.
[RFC2582] Sally Floyd and Tom Henderson. The NewReno Modification 2223, Instructions to TCP's Fast Recovery Algorithm, RFC 2582, April 1999.
[RFC2914] Sally Floyd. Congestion Control Principles, RFC 2914,
September 2000.
[RFC2988] Vern Paxson, Mark Allman. Computing TCP's Retransmission
Timer, RFC 2988, November 2000.
[RFC3042] Mark Allman, Hari Balkrishnan, Sally Floyd. Enhancing
TCP's Loss Recovery Using Limited Transmit. RFC 3042,
January 2001
Author's Addresses:
Ethan Blanton
Ohio University Internetworking Research Lab
Stocker Center
Athens, OH 45701
eblanton@irg.cs.ohiou.edu
Mark Allman
BBN Technologies/NASA Glenn Research Center
Lewis Field
21000 Brookpark Rd. MS 54-5
Cleveland, OH 44135
Phone: 216-433-6586
Fax: 216-433-8705
mallman@bbn.com
http://roland.grc.nasa.gov/~mallman
Kevin Fall
Intel Research
2150 Shattuck Ave., PH Suite
Berkeley, CA 94704
kfall@intel-research.net
Authors, for further information.