< draft-dukkipati-tcpm-tcp-loss-probe-00.txt   draft-dukkipati-tcpm-tcp-loss-probe-01.txt >
TCP Maintenance Working Group N. Dukkipati TCP Maintenance Working Group N. Dukkipati
Internet-Draft N. Cardwell Internet-Draft N. Cardwell
Intended status: Experimental Y. Cheng Intended status: Experimental Y. Cheng
Expires: January 10, 2013 M. Mathis Expires: August 29, 2013 M. Mathis
Google, Inc Google, Inc
July 09, 2012 February 25, 2013
TCP Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses
draft-dukkipati-tcpm-tcp-loss-probe-00.txt draft-dukkipati-tcpm-tcp-loss-probe-01.txt
Abstract Abstract
Retransmission timeouts are detrimental to application latency, Retransmission timeouts are detrimental to application latency,
especially for short transfers such as Web transactions where especially for short transfers such as Web transactions where
timeouts can often take longer than all of the rest of a transaction. timeouts can often take longer than all of the rest of a transaction.
The primary cause of retransmission timeouts are lost segments at the The primary cause of retransmission timeouts are lost segments at the
tail of transactions. This document describes an experimental tail of transactions. This document describes an experimental
algorithm for TCP to quickly recover lost segments at the end of algorithm for TCP to quickly recover lost segments at the end of
transactions or when an entire window of data or acknowledgments are transactions or when an entire window of data or acknowledgments are
lost. TCP Loss Probe (TLP) is a sender-only algorithm that allows lost. Tail Loss Probe (TLP) is a sender-only algorithm that allows
the transport to recover tail losses through fast recovery as opposed the transport to recover tail losses through fast recovery as opposed
to lengthy retransmission timeouts. If a connection is not receiving to lengthy retransmission timeouts. If a connection is not receiving
any acknowledgments for a certain period of time, TLP transmits the any acknowledgments for a certain period of time, TLP transmits the
last unacknowledged segment (loss probe). In the event of a tail last unacknowledged segment (loss probe). In the event of a tail
loss in the original transmissions, the acknowledgment from the loss loss in the original transmissions, the acknowledgment from the loss
probe triggers SACK/FACK based fast recovery. TLP effectively avoids probe triggers SACK/FACK based fast recovery. TLP effectively avoids
long timeouts and thereby improves TCP performance. long timeouts and thereby improves TCP performance.
Status of this Memo Status of this Memo
skipping to change at page 1, line 46 skipping to change at page 1, line 46
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 10, 2013. This Internet-Draft will expire on August 29, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5
2. Loss probe algorithm . . . . . . . . . . . . . . . . . . . . . 5 2. Loss probe algorithm . . . . . . . . . . . . . . . . . . . . . 5
2.1. Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 6
3. Detecting recovered losses . . . . . . . . . . . . . . . . . . 7 2.2. FACK threshold based recovery . . . . . . . . . . . . . . 8
3.1. TLP Loss Detection: The Basic Idea . . . . . . . . . . . . 8 3. Detecting recovered losses . . . . . . . . . . . . . . . . . . 9
3.2. TLP Loss Detection: Algorithm Details . . . . . . . . . . 8 3.1. TLP Loss Detection: The Basic Idea . . . . . . . . . . . . 9
4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2. TLP Loss Detection: Algorithm Details . . . . . . . . . . 9
4.1. Unifying loss recoveries . . . . . . . . . . . . . . . . . 10 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2. Recovery of any N-degree tail loss . . . . . . . . . . . . 11 4.1. Unifying loss recoveries . . . . . . . . . . . . . . . . . 12
5. Experiments with TLP . . . . . . . . . . . . . . . . . . . . . 13 4.2. Recovery of any N-degree tail loss . . . . . . . . . . . . 12
6. Related work . . . . . . . . . . . . . . . . . . . . . . . . . 15 5. Experiments with TLP . . . . . . . . . . . . . . . . . . . . . 14
7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 6. Related work . . . . . . . . . . . . . . . . . . . . . . . . . 16
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19
1. Introduction 1. Introduction
Retransmission timeouts are detrimental to application latency, Retransmission timeouts are detrimental to application latency,
especially for short transfers such as Web transactions where especially for short transfers such as Web transactions where
timeouts can often take longer than all of the rest of a transaction. timeouts can often take longer than all of the rest of a transaction.
This document describes an experimental algorithm, TCP Loss Probe This document describes an experimental algorithm, Tail Loss Probe
(TLP), to invoke fast recovery for losses that would otherwise be (TLP), to invoke fast recovery for losses that would otherwise be
only recoverable through timeouts. only recoverable through timeouts.
The Transmission Control Protocol (TCP) has two methods for The Transmission Control Protocol (TCP) has two methods for
recovering lost segments. First, the fast retransmit algorithm recovering lost segments. First, the fast retransmit algorithm
relies on incoming duplicate acknowledgments (ACKs), which indicate relies on incoming duplicate acknowledgments (ACKs), which indicate
that the receiver is missing some data. After a required number of that the receiver is missing some data. After a required number of
duplicate ACKs have arrived at the sender, it retransmits the first duplicate ACKs have arrived at the sender, it retransmits the first
unacknowledged segment and continues with a loss recovery algorithm unacknowledged segment and continues with a loss recovery algorithm
such as the SACK-based loss recovery [RFC3517]. If the fast such as the SACK-based loss recovery [RFC6675]. If the fast
retransmit algorithm fails for any reason, TCP uses a retransmission retransmit algorithm fails for any reason, TCP uses a retransmission
timeout as the last resort mechanism to recover lost segments. If an timeout as the last resort mechanism to recover lost segments. If an
ACK for a given segment is not received in a certain amount of time ACK for a given segment is not received in a certain amount of time
called retransmission timeout (RTO), the segment is resent [RFC6298]. called retransmission timeout (RTO), the segment is resent [RFC6298].
Timeouts can occur in a number of situations, such as the following: Timeouts can occur in a number of situations, such as the following:
(1) Drop tail at the end of transactions. Example: consider a (1) Drop tail at the end of transactions. Example: consider a
transfer of five segments sent on a connection that has a congestion transfer of five segments sent on a connection that has a congestion
window of ten. Any degree of loss in the tail, such as segments four window of ten. Any degree of loss in the tail, such as segments four
skipping to change at page 4, line 22 skipping to change at page 4, line 22
other indication of losses at the sender [IMC11PRR]. Early other indication of losses at the sender [IMC11PRR]. Early
retransmit and fast recovery have no hope of repairing losses without retransmit and fast recovery have no hope of repairing losses without
these indications. Efficiently addressing situations that would these indications. Efficiently addressing situations that would
cause timeouts without any prior indication of losses is a cause timeouts without any prior indication of losses is a
significant opportunity for additional improvements to loss recovery. significant opportunity for additional improvements to loss recovery.
To get a sense of just how long the RTOs are in relation to To get a sense of just how long the RTOs are in relation to
connection RTTs, following is the distribution of RTO/RTT values on connection RTTs, following is the distribution of RTO/RTT values on
Google Web servers. [percentile, RTO/RTT]: [50th percentile, 4.3]; Google Web servers. [percentile, RTO/RTT]: [50th percentile, 4.3];
[75th percentile, 11.3]; [90th percentile, 28.9]; [95th percentile, [75th percentile, 11.3]; [90th percentile, 28.9]; [95th percentile,
53.9]; [99th percentile, 214]. Such large RTOs make a huge 53.9]; [99th percentile, 214]. Large RTOs, typically caused by
contribution to the long tail on the latency statistics of short variance in measured RTTs, can be a result of intermediate queuing,
and service variability in mobile channels. Such large RTOs make a
huge contribution to the long tail on the latency statistics of short
flows. Note that simply reducing the length of RTO does not address flows. Note that simply reducing the length of RTO does not address
the latency problem for two reasons: first, it increases the chances the latency problem for two reasons: first, it increases the chances
of spurious retransmissions. Second and more importantly, an RTO of spurious retransmissions. Second and more importantly, an RTO
reduces TCP's congestion window to one and forces a slow start. reduces TCP's congestion window to one and forces a slow start.
Recovery of losses without relying primarily on the RTO mechanism is Recovery of losses without relying primarily on the RTO mechanism is
beneficial for short TCP transfers. beneficial for short TCP transfers.
The question we address in this document is: Can a TCP sender recover The question we address in this document is: Can a TCP sender recover
tail losses of transactions through fast recovery and thereby avoid tail losses of transactions through fast recovery and thereby avoid
lengthy retransmission timeouts? We specify an algorithm, TCP Loss lengthy retransmission timeouts? We specify an algorithm, Tail Loss
Probe (TLP), which sends probe segments to trigger duplicate ACKs Probe (TLP), which sends probe segments to trigger duplicate ACKs
with the intent of invoking fast recovery more quickly than an RTO at with the intent of invoking fast recovery more quickly than an RTO at
the end of a transaction. TLP is applicable only for connections in the end of a transaction. TLP is applicable only for connections in
Open state, wherein a sender is receiving in-sequence ACKs and has Open state, wherein a sender is receiving in-sequence ACKs and has
not detected any lost segments. TLP can be implemented by modifying not detected any lost segments. TLP can be implemented by modifying
only the TCP sender, and does not require any TCP options or changes only the TCP sender, and does not require any TCP options or changes
to the receiver for its operation. For convenience, this document to the receiver for its operation. For convenience, this document
mostly refers to TCP, but the algorithms and other discussion are mostly refers to TCP, but the algorithms and other discussion are
valid for Stream Control Transmission Protocol (SCTP) as well. valid for Stream Control Transmission Protocol (SCTP) as well.
This document is organized as follows. Section 2 describes the basic This document is organized as follows. Section 2 describes the basic
Loss Probe algorithm. Section 3 outlines an algorithm to detect the Loss Probe algorithm. Section 3 outlines an algorithm to detect the
cases when TLP plugs a hole in the sender. The algorithm makes the cases when TLP plugs a hole in the sender. The algorithm makes the
sender aware that a loss had occurred so it performs the appropriate sender aware that a loss had occurred so it performs the appropriate
congestion window reduction. Section 4 discusses the interaction of congestion window reduction. Section 4 discusses the interaction of
TLP with early retransmit. Section 5 discusses the experimental TLP with early retransmit in being able to recover any degree of tail
results with TLP on Google Web servers. Section 6 discusses related losses. Section 5 discusses the experimental results with TLP on
work, and Section 7 discusses the security considerations. Google Web servers. Section 6 discusses related work, and Section 7
discusses the security considerations.
1.1. Terminology 1.1. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
2. Loss probe algorithm 2. Loss probe algorithm
The Loss probe algorithm is designed for a sender to quickly detect The Loss probe algorithm is designed for a sender to quickly detect
skipping to change at page 6, line 16 skipping to change at page 6, line 20
senders to use the algorithm described in section 3 to infer whether senders to use the algorithm described in section 3 to infer whether
any segments were lost. any segments were lost.
2.1. Pseudocode 2.1. Pseudocode
We define the terminology used in specifying the TLP algorithm: We define the terminology used in specifying the TLP algorithm:
FlightSize: amount of outstanding data in the network as defined in FlightSize: amount of outstanding data in the network as defined in
[RFC5681]. [RFC5681].
RTO: The transport's retransmission timeout (RTO) is based on
measured round-trip times (RTT) between the sender and receiver, as
specified in [RFC6298] for TCP.
PTO: Probe timeout is a timer event indicating that an ACK is PTO: Probe timeout is a timer event indicating that an ACK is
overdue. Its value is constrained to be smaller than an RTO. overdue. Its value is constrained to be smaller than or equal to an
RTO.
SRTT: smoothed round-trip time computed like in [RFC6298]. SRTT: smoothed round-trip time computed like in [RFC6298].
Open state: the sender has so far received in-sequence ACKs with no Open state: the sender has so far received in-sequence ACKs with no
SACK blocks, and no other indications (such as retransmission SACK blocks, and no other indications (such as retransmission
timeout) that a loss may have occurred. timeout) that a loss may have occurred.
Consecutive PTOs: back-to-back PTOs all scheduled for the same tail Consecutive PTOs: back-to-back PTOs all scheduled for the same tail
packets in a flight. The (N+1)st PTO is scheduled after transmitting packets in a flight. The (N+1)st PTO is scheduled after transmitting
the probe segment for Nth PTO. the probe segment for Nth PTO.
The TLP algorithm works as follows: The TLP algorithm works as follows:
(1) Schedule PTO after transmission of new data in Open state: (1) Schedule PTO after transmission of new data in Open state:
Check for conditions to schedule PTO outlined in step 2 below. Check for conditions to schedule PTO outlined in step 2 below.
FlightSize > 1: schedule PTO in max(2*SRTT, 10ms). FlightSize > 1: schedule PTO in max(2*SRTT, 10ms).
FlightSize == 1: schedule PTO in max(2*SRTT, 1.5*SRTT+WCDelAckT). FlightSize == 1: schedule PTO in max(2*SRTT, 1.5*SRTT+WCDelAckT).
If RTO is earlier, schedule PTO in its place: PTO = min(RTO, PTO).
WCDelAckT stands for worst case delayed ACK timer. When FlightSize WCDelAckT stands for worst case delayed ACK timer. When FlightSize
is 1, PTO is inflated additionally by WCDelAckT time to compensate is 1, PTO is inflated additionally by WCDelAckT time to compensate
for a potential long delayed ACK timer at the receiver. The for a potential long delayed ACK timer at the receiver. The
RECOMMENDED value for WCDelAckT is 200ms. RECOMMENDED value for WCDelAckT is 200ms.
A PTO value of 2*SRTT allows a sender to wait long enough to know
that an ACK is overdue. Under normal circumstances, i.e. no losses,
an ACK typically arrives in one RTT. But choosing PTO to be exactly
an RTT is likely to generate spurious probes given that even end-
system timings can easily push an ACK to be above an RTT. We chose
PTO to be the next integral value of RTT. If RTO is smaller than the
computed value for PTO, then a probe is scheduled to be sent at the
RTO time. The RTO timer is rearmed at the time of sending the probe,
as is shown in Step 3 below. This ensures that a PTO is always sent
prior to a connection experiencing an RTO.
(2) Conditions for scheduling PTO: (2) Conditions for scheduling PTO:
(a) Connection is in Open state. (a) Connection is in Open state.
(b) RTO is farther than PTO. (b) Connection is either cwnd limited or application limited.
(c) Connection is either cwnd limited or application limited. (c) Number of consecutive PTOs <= 2.
(d) Number of consecutive PTOs <= 2. (d) Connection is SACK enabled.
(e) Connection is SACK enabled.
Implementations MAY use one or two consecutive PTOs. Implementations MAY use one or two consecutive PTOs.
(3) When PTO fires: (3) When PTO fires:
(a) If a new previously unsent segment exists: (a) If a new previously unsent segment exists:
-> Transmit new segment. -> Transmit new segment.
-> FlightSize += SMSS. cwnd remains unchanged. -> FlightSize += SMSS. cwnd remains unchanged.
(b) If no new segment exists: (b) If no new segment exists:
-> Retransmit the last segment. -> Retransmit the last segment.
(c) Reschedule next PTO if conditions in (2) allow. (c) Increment statistics counter for loss probes.
(d) If conditions in (2) are satisfied:
-> Reschedule next PTO.
Else:
-> Rearm RTO to fire at epoch 'now+RTO'.
The reason for retransmitting the last segment in Step (b) is so that The reason for retransmitting the last segment in Step (b) is so that
the ACK will carry SACK blocks and trigger either SACK-based loss the ACK will carry SACK blocks and trigger either SACK-based loss
recovery [RFC3517] or FACK-based fast recovery [FACK]. recovery [RFC6675] or FACK threshold based fast recovery [FACK]. On
transmission of a TLP, a MIB counter is incremented to keep track of
the total number of loss probes sent.
(4) During ACK processing: (4) During ACK processing:
Cancel any existing PTO. Cancel any existing PTO.
If conditions in (2) allow: If conditions in (2) allow:
-> Reschedule the PTO relative to the -> Reschedule PTO relative to the ACK receipt time.
time at which the ACK is received.
Following is an example of TLP. All events listed are at a TCP Following is an example of TLP. All events listed are at a TCP
sender. sender.
(1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is (1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is
no more new data to transmit. A PTO is scheduled to fire in 2 RTTs, no more new data to transmit. A PTO is scheduled to fire in 2 RTTs,
after the transmission of the 10th segment. after the transmission of the 10th segment.
(2) Receives acknowledgements (ACKs) for segments 1-5; segments 6-10 (2) Receives acknowledgements (ACKs) for segments 1-5; segments 6-10
are lost and no ACKs are received. Note that the sender are lost and no ACKs are received. Note that the sender
(re)schedules its PTO timer relative to the last received ACK, which (re)schedules its PTO timer relative to the last received ACK, which
is the ACK for segment 5 in this case. The sender sets the PTO is the ACK for segment 5 in this case. The sender sets the PTO
interval using the calculation described in step (1) of the interval using the calculation described in step (1) of the
algorithm. algorithm.
(3) When PTO fires, sender retransmits segment 10. (3) When PTO fires, sender retransmits segment 10.
(4) After an RTT, SACK for packet 10 arrives. The ACK also carries (4) After an RTT, SACK for packet 10 arrives. The ACK also carries
SACK holes for segments 6, 7, 8 and 9. This triggers FACK-based SACK holes for segments 6, 7, 8 and 9. This triggers FACK threshold
recovery. based recovery.
(5) Connection enters fast recovery and retransmits remaining lost (5) Connection enters fast recovery and retransmits remaining lost
segments. segments.
2.2. FACK threshold based recovery
At the core of TLP is its reliance on FACK threshold based algorithm
to invoke Fast Recovery. In this section we specify this algorithm.
Section 3.1 of the Forward Acknowledgement (FACK) Paper [FACK]
describes an alternate algorithm for triggering fast retransmit,
based on the extent of the SACK scoreboard. Its goal is to trigger
fast retransmit as soon as the receiver's reassembly queue is larger
than the dupack threshold, as indicated by the difference between the
forward most SACK block edge and SND.UNA. This algorithm quickly and
reliably triggers fast retransmit in the presence of burst losses --
often on the first SACK following such a loss. Such a threshold
based algorithm also triggers fast retransmit immediately in the
presence of any reordering with extent greater than the dupack
threshold.
FACK threshold based recovery works by introducing a new TCP state
variable at the sender called SND.FACK. SND.FACK reflects the
forward-most data held by the receiver and is updated when a SACK
block is received acknowledging data with a higher sequence number
than the current value of SND.FACK. SND.FACK reflects the highest
sequence number known to have been received plus one. Note that in
non-recovery states, SND.FACK is the same as SND.UNA. The following
snippet is the pseudocode for FACK threshold based recovery.
If (SND.FACK - SND.UNA) > dupack threshold:
-> Invoke Fast Retransmit and Fast Recovery.
3. Detecting recovered losses 3. Detecting recovered losses
If the only loss was the last segment, there is the risk that the If the only loss was the last segment, there is the risk that the
loss probe itself might repair the loss, effectively masking it from loss probe itself might repair the loss, effectively masking it from
congestion control. To avoid interfering with mandatory congestion congestion control. To avoid interfering with mandatory congestion
control [RFC5681] it is imperative that TLP include a mechanism to control [RFC5681] it is imperative that TLP include a mechanism to
detect when the probe might have masked a loss and to properly reduce detect when the probe might have masked a loss and to properly reduce
the congestion window (cwnd). An algorithm to examine subsequent the congestion window (cwnd). An algorithm to examine subsequent
ACKs to determine whether the original segment was lost is described ACKs to determine whether the original segment was lost is described
here. here.
skipping to change at page 10, line 4 skipping to change at page 11, line 8
(d) the ACK does not advance SND.UNA (d) the ACK does not advance SND.UNA
(e) the segment contains no data (e) the segment contains no data
(f) the segment is not a window update (f) the segment is not a window update
If all of those conditions are met, then the sender executes the If all of those conditions are met, then the sender executes the
following: following:
TLPRtxOut--; TLPRtxOut--;
(b) Marking the end of a TLP episode and detecting losses (b) Marking the end of a TLP episode and detecting losses
If an incoming ACK is after TLPHighRxt, then the sender deems the TLP If an incoming ACK is after TLPHighRxt, then the sender deems the TLP
episode over. At that time, the TLP sender executes the following: episode over. At that time, the TLP sender executes the following:
isLoss = (TLPRtxOut > 0); isLoss = (TLPRtxOut > 0) &&
(segment does not carry a DSACK for TLP retransmission);
TLPRtxOut = 0 TLPRtxOut = 0
if (isLoss) if (isLoss)
EnterRecovery(); EnterRecovery();
In other words, if the sender detects an ACK for data beyond the TLP In other words, if the sender detects an ACK for data beyond the TLP
loss probe retransmission then (in the absence of reordering on the loss probe retransmission then (in the absence of reordering on the
return path of ACKs) it should have received any ACKs that will return path of ACKs) it should have received any ACKs that will
indicate whether the original or any loss probe retransmissions were indicate whether the original or any loss probe retransmissions were
lost. If the TLPRtxOut count is still non-zero and thus indicates lost. An exception is the case when the segment carries a Duplicate
that some TLP probe segments remain unacknowledged, then the sender SACK (DSACK) for the TLP retransmission. If the TLPRtxOut count is
should presume that at least one segment was lost, so it should enter still non-zero and thus indicates that some TLP probe segments remain
fast recovery using the proportional rate reduction algorithm unacknowledged, then the sender should presume that at least one
[IMC11PRR]. segment was lost, so it should enter fast recovery using the
proportional rate reduction algorithm [IMC11PRR].
(5) Senders must only send a TLP loss probe retransmission if all the (5) Senders must only send a TLP loss probe retransmission if all the
conditions from section 2.1 are met and the following condition also conditions from section 2.1 are met and the following condition also
holds: holds:
(TLPRtxOut == 0) || (SND.NXT == TLPHighRxt) (TLPRtxOut == 0) || (SND.NXT == TLPHighRxt)
This ensures that there is at most one sequence range with This ensures that there is at most one sequence range with
outstanding TLP retransmissions. The sender maintains this invariant outstanding TLP retransmissions. The sender maintains this invariant
so that there is at most one TLP retransmission "episode" happening so that there is at most one TLP retransmission "episode" happening
skipping to change at page 13, line 27 skipping to change at page 14, line 27
5. Experiments with TLP 5. Experiments with TLP
In this section we describe experiments and measurements with TLP In this section we describe experiments and measurements with TLP
performed on Google Web servers using Linux 2.6. The experiments performed on Google Web servers using Linux 2.6. The experiments
were performed over several weeks and measurements were taken across were performed over several weeks and measurements were taken across
a wide range of Google applications. The main goal of the a wide range of Google applications. The main goal of the
experiments is to instrument and measure TLP's performance relative experiments is to instrument and measure TLP's performance relative
to the baseline. The experiment and baseline were using the same to the baseline. The experiment and baseline were using the same
kernels with an on/off switch to enable TLP. kernels with an on/off switch to enable TLP.
All experiments were conducted with the basic version of the TLP Our experiments include both the basic TLP algorithm of Section 2 and
described in Section 2. All other algorithms such as early its loss detection component in Section 3. All other algorithms such
retransmit and FACK based recovery are present in the both the as early retransmit and FACK threshold based recovery are present in
experiment and baseline. We have also experimented with a few the both the experiment and baseline. There are three primary
variations of TLP such as the consecutive probe segments to transmit metrics we are interested in: impact on TCP latency (average and tail
and the value that a PTO should be delayed by to take into account or 99th percentile latency), retransmission statistics, and the
delayed ACK timer when FlightSize equals one. The numbers reported overhead of probe segments relative to the total number of
below are with experiments where we did not restrict the number of transmitted segments. TCP latency is the time elapsed between the
consecutive probe segments to two. There are three primary metrics server transmitting the first byte of the response to it receiving an
we are interested in: impact on TCP latency (average and tail or 99th ACK for the last byte.
percentile latency), retransmission statistics, and the percentage of
probe segments relative to the total number of transmitted segments.
TCP latency is the time elapsed between the server transmitting the
first byte of the response to it receiving an ACK for the last byte.
The probe segments are not accounted in retransmission statistics but
instead tracked separately in the Linux MIB.
Over a period of a week long experiment, TLP reduced the average TCP The table below shows the percentiles and average latency improvement
latency of Google Web search and Instant search by [average:3%, 99th of key Web applications, including even those responses without
percentile:5%], Google Maps by [average:5%, 99th percentile:10%], and losses, measured over a period of one week. The key takeaway is: the
Google Images by [average:7%, 99th percentile:10%]. The varying average response time improved up to 7% and the 99th percentile
percentage of improvements is because of the differences in the improved by 10%. Nearly all of the improvement for TLP is in the
response size distribution amongst these services, e.g. Google tail latency (post-90th percentile). The varied improvements across
Images has the smallest response sizes and hence is more prone to services are due to different response-size distributions and traffic
tail segment losses. TLP also improved performance in mobile patterns. For example, TLP helps the most for Images, as these are
networks, e.g. by 7.2% for Web search and Instant and 7.6% for Images served by multiple concurrently active TCP connections which increase
transferred over Verizon network. To see why and where the latency the chances of tail segment losses.
improvements are coming from, we measured the retransmission
statistics. We broke down the retransmission stats based on nature
of retransmission - timeout retransmission or fast recovery. TLP
reduced the number of timeout retransmissions by 10% compared to the
baseline, i.e. (timeout_retrans_tlp - timeout_retrans_baseline) /
timeout_retrans_baseline = 10%. This includes the retransmissions
that occur in the slow start phase of the timeout. There is a
corresponding increase in the number of fast retransmits and in fast
recovery. Note that it is not always possible for TLP to convert
100% of the timeouts into fast recovery episodes for two reasons:
either a PTO was not scheduled because an RTO was closer, or a probe
segment was sent but an RTO occurred anyway.
The overhead of probe segments as a percentage of the total number of Application Average 99%
outgoing segments is 0.6%, i.e. (number of probe segments / number of
outgoing segments)*100 = 0.6%. When FlightSize equals 1 it is
important to account for the delayed ACK timer in the PTO value, in
order to bring down the number of unnecessary probe segments. (the
WCDelAckT term, aka Worst Case delayed ACK time was set to 200ms).
With delays of 0ms and 50ms, the percentage probe segments are 3.1%
and 2.2% respectively.
Besides the macro level latency and retransmission statistics, we Google Web Search -3% -5%
also measured TCP's internal state variables at the point when PTO
timer fires and a probe segment is transmitted. We report a few of
these statistics here. We measured the number of probe segments sent
consecutively within a single timeout period. About 80% of the
segments were probes sent the first time within the timeout period
while the remaining were consecutive probes. 10% of the probe
segments are new previously untransmitted data while the remainder
are a retransmission of the last sent segment.
The distribution below shows the time period, normalized by the Google Maps -5% -10%
connection RTT, between the PTO timer firing and the epoch when the
next RTO is scheduled. Note that the slack time period between a PTO
and RTO can be several tens of RTTs, possibly because of large RTOs
caused by variance in round-trip times.
percentile 25% 50% 75% 90% 99% Google Images -7% -10%
[normalized interval] 1.4 3.3 10.5 21 95 TLP also improved performance in mobile networks -- by 7.2% for Web
search and Instant and 7.6% for Images transferred over Verizon
network. To see why and where the latency improvements are coming
from, we measured the retransmission statistics. We broke down the
retransmission stats based on nature of retransmission -- timeout
retransmission or fast recovery. TLP reduced the number of timeouts
by 15% compared to the baseline, i.e. (timeouts_tlp -
timeouts_baseline) / timeouts_baseline = 15%. Instead, these losses
were either recovered via fast recovery or by the loss probe
retransmission itself. The largest reduction in timeouts is when the
sender is in the Open state in which it receives only insequence ACKs
and no duplicate ACKs, likely because of tail losses.
Correspondingly, the retransmissions occurring in the slow start
phase after RTO reduced by 46% relative to baseline. Note that it is
not always possible for TLP to convert 100% of the timeouts into fast
recovery episodes because a probe itself may be lost. Also notable
in our experiments is a significant decrease in the number of
spurious timeouts -- the experiment had 61% fewer congestion window
undo events. The Linux TCP sender uses either DSACK or timestamps to
determine if retransmissions are spurious and employs techniques for
undoing congestion window reductions. We also note that the total
number of retransmissions decreased 7% with TLP because of the
decrease in spurious retransmissions, and because the TLP probe
itself plugs a hole.
The following distribution shows the FlightSize and congestion window We also quantified the overhead of probe packets. The probes
values when a PTO is scheduled. We note that cwnd is not the accounted for 0.48% of all outgoing segments, i.e. (number of probe
limiting factor and that nearly all of the probe segments are sent segments / number of outgoing segments)*100 = 0.48%. This is a
within the congestion window. reasonable overhead when contrasted with the overall retransmission
rate of 3.2%. 10% of the probes sent are new segments and the rest
are retransmissions, which is unsurprising given that short Web
responses often don't have new data to send. We also found that in
about 33% of the cases, the probes themselves plugged the only hole
at receiver and the loss detection algorithm reduced the congestion
window. 37% of the probes were not necessary and resulted in a
duplicate acknowledgment.
Besides the macro level latency and retransmission statistics, we
report some measurements from TCP's internal state variables at the
point when a probe segment is transmitted. The following
distribution shows the FlightSize and congestion window values when a
PTO is scheduled. We note that cwnd is not the limiting factor and
that nearly all of the probe segments are sent within the congestion
window.
percentile 10% 25% 50% 75% 90% 99% percentile 10% 25% 50% 75% 90% 99%
FlightSize 1 1 2 3 10 20 FlightSize 1 1 2 3 10 20
cwnd 5 10 10 10 17 44 cwnd 5 10 10 10 17 44
We have also experimented with a few variations of TLP: multiple
probe segments versus single probe for the same tail loss episode,
and several values for WCDelAckT. Our experiments show that sending
just one probe suffices to get most (~90%) of latency benefits. The
experiment results reported in this section and our current
implementation limits number of probes to one, although the draft
itself allows up to two consecutive probes. We chose the worst case
delayed ack timer to be 200ms. When FlightSize equals 1 it is
important to account for the delayed ACK timer in the PTO value, in
order to bring down the number of unnecessary probe segments. With
delays of 0ms and 50ms, the probe overhead jumped from 0.48% to 3.1%
and 2.2% respectively. We have also experimented with transmitting
1-byte probe retransmissions as opposed to an entire MSS
retransmission probe. While this scheme has the advantage of not
requiring the loss detection algorithm outlined in Section 3, it
turned out to be problematic to implement correctly in certain TCP
stacks. Additionally, retransmitting 1-byte probe costs one more RTT
to recover single packet tail losses, which is detrimental for short
transfer latency.
6. Related work 6. Related work
TCP's long and conservative RTO recovery has long been identified as TCP's long and conservative RTO recovery has long been identified as
the major performance bottleneck for latency-demanding applications. the major performance bottleneck for latency-demanding applications.
A well-studied example is online gaming that requires reliability and A well-studied example is online gaming that requires reliability and
low latency but small bandwidth. [GRIWODZ06] shows that repeated low latency but small bandwidth. [GRIWODZ06] shows that repeated
long RTO is the dominating performance bottleneck for game long RTO is the dominating performance bottleneck for game
responsiveness. The authors in [PETLUND08] propose to use linear RTO responsiveness. The authors in [PETLUND08] propose to use linear RTO
to improve the performance, which has been incorporated in the Linux to improve the performance, which has been incorporated in the Linux
kernel as a non-default socket option for such thin streams. kernel as a non-default socket option for such thin streams.
skipping to change at page 15, line 26 skipping to change at page 17, line 4
long RTO is the dominating performance bottleneck for game long RTO is the dominating performance bottleneck for game
responsiveness. The authors in [PETLUND08] propose to use linear RTO responsiveness. The authors in [PETLUND08] propose to use linear RTO
to improve the performance, which has been incorporated in the Linux to improve the performance, which has been incorporated in the Linux
kernel as a non-default socket option for such thin streams. kernel as a non-default socket option for such thin streams.
[MONDAL08] further argues exponential RTO backoff should be removed [MONDAL08] further argues exponential RTO backoff should be removed
because it is not necessary for the stability of Internet. In because it is not necessary for the stability of Internet. In
contrast, TLP does not change the RTO timer calculation or the contrast, TLP does not change the RTO timer calculation or the
exponential back off. TLP's approach is to keep the behavior after exponential back off. TLP's approach is to keep the behavior after
RTO conservative for stability but allows a few timely probes before RTO conservative for stability but allows a few timely probes before
concluding the network is badly congested and cwnd should fall to 1. concluding the network is badly congested and cwnd should fall to 1.
As noted earlier in the Introduction the F-RTO [RFC5682] algorithm As noted earlier in the Introduction the F-RTO [RFC5682] algorithm
reduces the number of spurious timeout retransmissions and the Early reduces the number of spurious timeout retransmissions and the Early
Retransmit [RFC5827] mechanism reduces timeouts when a connection has Retransmit [RFC5827] mechanism reduces timeouts when a connection has
received a certain number of duplicate ACKs. Both are complementary received a certain number of duplicate ACKs. Both are complementary
to TLP and can work alongside. to TLP and can work alongside. Rescue retransmission introduced in
[RFC6675] deals with loss events such as AL*SL* (using the same
notation as section 4). TLP covers wider range of events such as
AL*. We experimented with rescue retransmission on Google Web
servers, but did not observe much performance improvement. When the
last segment is lost, it is more likely that a number of contiguous
segments preceding the segment are also lost, i.e. AL* is common.
Timeouts that occur in the fast recovery are rare.
TCP Loss Probe is one of several algorithms designed to maximize the [HURTIG13] proposes to offset the elapsed time of the pending packet
when re-arming the RTO timer. It is possible to apply the same idea
for the TLP timer as well. We have not yet tested such a change to
TLP.
Tail Loss Probe is one of several algorithms designed to maximize the
robustness of TCPs self clock in the presence of losses. It follows robustness of TCPs self clock in the presence of losses. It follows
the same principles as Proportional Rate Reduction [IMC11PRR] and TCP the same principles as Proportional Rate Reduction [IMC11PRR] and TCP
Laminar [Laminar]. Laminar [Laminar].
On a final note we note that Tail loss probe does not eliminate 100%
of all RTOs. RTOs still remain the dominant mode of loss recovery
for short transfers. More work in future should be done along the
following lines: transmitting multiple loss probes prior to finally
resorting to RTOs, maintaining ACK clocking for short transfers in
the absence of new data by clocking out old data in response to
incoming ACKs, taking cues from applications to indicate end of
transactions and use it for smarter tail loss recovery.
7. Security Considerations 7. Security Considerations
The security considerations outlined in [RFC5681] apply to this The security considerations outlined in [RFC5681] apply to this
document. At this time we did not find any additional security document. At this time we did not find any additional security
problems with TCP loss probe. problems with Tail loss probe.
8. IANA Considerations 8. IANA Considerations
This document makes no request of IANA. This document makes no request of IANA.
Note to RFC Editor: this section may be removed on publication as an Note to RFC Editor: this section may be removed on publication as an
RFC. RFC.
9. References 9. References
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
Conservative Selective Acknowledgment (SACK)-based Loss and Y. Nishida, "A Conservative Loss Recovery Algorithm
Recovery Algorithm for TCP", RFC 3517, April 2003. Based on Selective Acknowledgment (SACK) for TCP",
RFC 6675, August 2012.
[RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent,
"Computing TCP's Retransmission Timer", RFC 6298, "Computing TCP's Retransmission Timer", RFC 6298,
June 2011. June 2011.
[RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P.
Hurtig, "Early Retransmit for TCP and Stream Control Hurtig, "Early Retransmit for TCP and Stream Control
Transmission Protocol (SCTP)", RFC 5827, April 2010. Transmission Protocol (SCTP)", RFC 5827, April 2010.
[RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
skipping to change at page 17, line 14 skipping to change at page 19, line 16
Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen,
"TCP enhancements for interactive thin-stream "TCP enhancements for interactive thin-stream
applications", NOSSDAV , 2008. applications", NOSSDAV , 2008.
[MONDAL08] [MONDAL08]
Mondal, A. and A. Kuzmanovic, "Removing Exponential Mondal, A. and A. Kuzmanovic, "Removing Exponential
Backoff from TCP", ACM SIGCOMM Computer Communication Backoff from TCP", ACM SIGCOMM Computer Communication
Review , 2008. Review , 2008.
[Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP
congestion control", draft-mathis-tcpm-tcp-laminar-00 congestion control", July 2012.
(work in progress), February 2012.
[HURTIG13]
Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
and SCTP RTO Restart", draft-ietf-tcpm-rtorestart-00 (work
in progress), February 2013.
Authors' Addresses Authors' Addresses
Nandita Dukkipati Nandita Dukkipati
Google, Inc Google, Inc
1600 Amphitheater Parkway 1600 Amphitheater Parkway
Mountain View, California 93117 Mountain View, California 93117
USA USA
Email: nanditad@google.com Email: nanditad@google.com
 End of changes. 42 change blocks. 
121 lines changed or deleted 226 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/