| < draft-dukkipati-tcpm-tcp-loss-probe-00.txt | draft-dukkipati-tcpm-tcp-loss-probe-01.txt > | |||
|---|---|---|---|---|
| TCP Maintenance Working Group N. Dukkipati | TCP Maintenance Working Group N. Dukkipati | |||
| Internet-Draft N. Cardwell | Internet-Draft N. Cardwell | |||
| Intended status: Experimental Y. Cheng | Intended status: Experimental Y. Cheng | |||
| Expires: January 10, 2013 M. Mathis | Expires: August 29, 2013 M. Mathis | |||
| Google, Inc | Google, Inc | |||
| July 09, 2012 | February 25, 2013 | |||
| TCP Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses | Tail Loss Probe (TLP): An Algorithm for Fast Recovery of Tail Losses | |||
| draft-dukkipati-tcpm-tcp-loss-probe-00.txt | draft-dukkipati-tcpm-tcp-loss-probe-01.txt | |||
| Abstract | Abstract | |||
| Retransmission timeouts are detrimental to application latency, | Retransmission timeouts are detrimental to application latency, | |||
| especially for short transfers such as Web transactions where | especially for short transfers such as Web transactions where | |||
| timeouts can often take longer than all of the rest of a transaction. | timeouts can often take longer than all of the rest of a transaction. | |||
| The primary cause of retransmission timeouts are lost segments at the | The primary cause of retransmission timeouts are lost segments at the | |||
| tail of transactions. This document describes an experimental | tail of transactions. This document describes an experimental | |||
| algorithm for TCP to quickly recover lost segments at the end of | algorithm for TCP to quickly recover lost segments at the end of | |||
| transactions or when an entire window of data or acknowledgments are | transactions or when an entire window of data or acknowledgments are | |||
| lost. TCP Loss Probe (TLP) is a sender-only algorithm that allows | lost. Tail Loss Probe (TLP) is a sender-only algorithm that allows | |||
| the transport to recover tail losses through fast recovery as opposed | the transport to recover tail losses through fast recovery as opposed | |||
| to lengthy retransmission timeouts. If a connection is not receiving | to lengthy retransmission timeouts. If a connection is not receiving | |||
| any acknowledgments for a certain period of time, TLP transmits the | any acknowledgments for a certain period of time, TLP transmits the | |||
| last unacknowledged segment (loss probe). In the event of a tail | last unacknowledged segment (loss probe). In the event of a tail | |||
| loss in the original transmissions, the acknowledgment from the loss | loss in the original transmissions, the acknowledgment from the loss | |||
| probe triggers SACK/FACK based fast recovery. TLP effectively avoids | probe triggers SACK/FACK based fast recovery. TLP effectively avoids | |||
| long timeouts and thereby improves TCP performance. | long timeouts and thereby improves TCP performance. | |||
| Status of this Memo | Status of this Memo | |||
| skipping to change at page 1, line 46 ¶ | skipping to change at page 1, line 46 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on January 10, 2013. | This Internet-Draft will expire on August 29, 2013. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2012 IETF Trust and the persons identified as the | Copyright (c) 2013 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 2. Loss probe algorithm . . . . . . . . . . . . . . . . . . . . . 5 | 2. Loss probe algorithm . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 2.1. Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2.1. Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 3. Detecting recovered losses . . . . . . . . . . . . . . . . . . 7 | 2.2. FACK threshold based recovery . . . . . . . . . . . . . . 8 | |||
| 3.1. TLP Loss Detection: The Basic Idea . . . . . . . . . . . . 8 | 3. Detecting recovered losses . . . . . . . . . . . . . . . . . . 9 | |||
| 3.2. TLP Loss Detection: Algorithm Details . . . . . . . . . . 8 | 3.1. TLP Loss Detection: The Basic Idea . . . . . . . . . . . . 9 | |||
| 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 3.2. TLP Loss Detection: Algorithm Details . . . . . . . . . . 9 | |||
| 4.1. Unifying loss recoveries . . . . . . . . . . . . . . . . . 10 | 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 4.2. Recovery of any N-degree tail loss . . . . . . . . . . . . 11 | 4.1. Unifying loss recoveries . . . . . . . . . . . . . . . . . 12 | |||
| 5. Experiments with TLP . . . . . . . . . . . . . . . . . . . . . 13 | 4.2. Recovery of any N-degree tail loss . . . . . . . . . . . . 12 | |||
| 6. Related work . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 5. Experiments with TLP . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 | 6. Related work . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 | |||
| 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 | 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 | ||||
| 1. Introduction | 1. Introduction | |||
| Retransmission timeouts are detrimental to application latency, | Retransmission timeouts are detrimental to application latency, | |||
| especially for short transfers such as Web transactions where | especially for short transfers such as Web transactions where | |||
| timeouts can often take longer than all of the rest of a transaction. | timeouts can often take longer than all of the rest of a transaction. | |||
| This document describes an experimental algorithm, TCP Loss Probe | This document describes an experimental algorithm, Tail Loss Probe | |||
| (TLP), to invoke fast recovery for losses that would otherwise be | (TLP), to invoke fast recovery for losses that would otherwise be | |||
| only recoverable through timeouts. | only recoverable through timeouts. | |||
| The Transmission Control Protocol (TCP) has two methods for | The Transmission Control Protocol (TCP) has two methods for | |||
| recovering lost segments. First, the fast retransmit algorithm | recovering lost segments. First, the fast retransmit algorithm | |||
| relies on incoming duplicate acknowledgments (ACKs), which indicate | relies on incoming duplicate acknowledgments (ACKs), which indicate | |||
| that the receiver is missing some data. After a required number of | that the receiver is missing some data. After a required number of | |||
| duplicate ACKs have arrived at the sender, it retransmits the first | duplicate ACKs have arrived at the sender, it retransmits the first | |||
| unacknowledged segment and continues with a loss recovery algorithm | unacknowledged segment and continues with a loss recovery algorithm | |||
| such as the SACK-based loss recovery [RFC3517]. If the fast | such as the SACK-based loss recovery [RFC6675]. If the fast | |||
| retransmit algorithm fails for any reason, TCP uses a retransmission | retransmit algorithm fails for any reason, TCP uses a retransmission | |||
| timeout as the last resort mechanism to recover lost segments. If an | timeout as the last resort mechanism to recover lost segments. If an | |||
| ACK for a given segment is not received in a certain amount of time | ACK for a given segment is not received in a certain amount of time | |||
| called retransmission timeout (RTO), the segment is resent [RFC6298]. | called retransmission timeout (RTO), the segment is resent [RFC6298]. | |||
| Timeouts can occur in a number of situations, such as the following: | Timeouts can occur in a number of situations, such as the following: | |||
| (1) Drop tail at the end of transactions. Example: consider a | (1) Drop tail at the end of transactions. Example: consider a | |||
| transfer of five segments sent on a connection that has a congestion | transfer of five segments sent on a connection that has a congestion | |||
| window of ten. Any degree of loss in the tail, such as segments four | window of ten. Any degree of loss in the tail, such as segments four | |||
| skipping to change at page 4, line 22 ¶ | skipping to change at page 4, line 22 ¶ | |||
| other indication of losses at the sender [IMC11PRR]. Early | other indication of losses at the sender [IMC11PRR]. Early | |||
| retransmit and fast recovery have no hope of repairing losses without | retransmit and fast recovery have no hope of repairing losses without | |||
| these indications. Efficiently addressing situations that would | these indications. Efficiently addressing situations that would | |||
| cause timeouts without any prior indication of losses is a | cause timeouts without any prior indication of losses is a | |||
| significant opportunity for additional improvements to loss recovery. | significant opportunity for additional improvements to loss recovery. | |||
| To get a sense of just how long the RTOs are in relation to | To get a sense of just how long the RTOs are in relation to | |||
| connection RTTs, following is the distribution of RTO/RTT values on | connection RTTs, following is the distribution of RTO/RTT values on | |||
| Google Web servers. [percentile, RTO/RTT]: [50th percentile, 4.3]; | Google Web servers. [percentile, RTO/RTT]: [50th percentile, 4.3]; | |||
| [75th percentile, 11.3]; [90th percentile, 28.9]; [95th percentile, | [75th percentile, 11.3]; [90th percentile, 28.9]; [95th percentile, | |||
| 53.9]; [99th percentile, 214]. Such large RTOs make a huge | 53.9]; [99th percentile, 214]. Large RTOs, typically caused by | |||
| contribution to the long tail on the latency statistics of short | variance in measured RTTs, can be a result of intermediate queuing, | |||
| and service variability in mobile channels. Such large RTOs make a | ||||
| huge contribution to the long tail on the latency statistics of short | ||||
| flows. Note that simply reducing the length of RTO does not address | flows. Note that simply reducing the length of RTO does not address | |||
| the latency problem for two reasons: first, it increases the chances | the latency problem for two reasons: first, it increases the chances | |||
| of spurious retransmissions. Second and more importantly, an RTO | of spurious retransmissions. Second and more importantly, an RTO | |||
| reduces TCP's congestion window to one and forces a slow start. | reduces TCP's congestion window to one and forces a slow start. | |||
| Recovery of losses without relying primarily on the RTO mechanism is | Recovery of losses without relying primarily on the RTO mechanism is | |||
| beneficial for short TCP transfers. | beneficial for short TCP transfers. | |||
| The question we address in this document is: Can a TCP sender recover | The question we address in this document is: Can a TCP sender recover | |||
| tail losses of transactions through fast recovery and thereby avoid | tail losses of transactions through fast recovery and thereby avoid | |||
| lengthy retransmission timeouts? We specify an algorithm, TCP Loss | lengthy retransmission timeouts? We specify an algorithm, Tail Loss | |||
| Probe (TLP), which sends probe segments to trigger duplicate ACKs | Probe (TLP), which sends probe segments to trigger duplicate ACKs | |||
| with the intent of invoking fast recovery more quickly than an RTO at | with the intent of invoking fast recovery more quickly than an RTO at | |||
| the end of a transaction. TLP is applicable only for connections in | the end of a transaction. TLP is applicable only for connections in | |||
| Open state, wherein a sender is receiving in-sequence ACKs and has | Open state, wherein a sender is receiving in-sequence ACKs and has | |||
| not detected any lost segments. TLP can be implemented by modifying | not detected any lost segments. TLP can be implemented by modifying | |||
| only the TCP sender, and does not require any TCP options or changes | only the TCP sender, and does not require any TCP options or changes | |||
| to the receiver for its operation. For convenience, this document | to the receiver for its operation. For convenience, this document | |||
| mostly refers to TCP, but the algorithms and other discussion are | mostly refers to TCP, but the algorithms and other discussion are | |||
| valid for Stream Control Transmission Protocol (SCTP) as well. | valid for Stream Control Transmission Protocol (SCTP) as well. | |||
| This document is organized as follows. Section 2 describes the basic | This document is organized as follows. Section 2 describes the basic | |||
| Loss Probe algorithm. Section 3 outlines an algorithm to detect the | Loss Probe algorithm. Section 3 outlines an algorithm to detect the | |||
| cases when TLP plugs a hole in the sender. The algorithm makes the | cases when TLP plugs a hole in the sender. The algorithm makes the | |||
| sender aware that a loss had occurred so it performs the appropriate | sender aware that a loss had occurred so it performs the appropriate | |||
| congestion window reduction. Section 4 discusses the interaction of | congestion window reduction. Section 4 discusses the interaction of | |||
| TLP with early retransmit. Section 5 discusses the experimental | TLP with early retransmit in being able to recover any degree of tail | |||
| results with TLP on Google Web servers. Section 6 discusses related | losses. Section 5 discusses the experimental results with TLP on | |||
| work, and Section 7 discusses the security considerations. | Google Web servers. Section 6 discusses related work, and Section 7 | |||
| discusses the security considerations. | ||||
| 1.1. Terminology | 1.1. Terminology | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| 2. Loss probe algorithm | 2. Loss probe algorithm | |||
| The Loss probe algorithm is designed for a sender to quickly detect | The Loss probe algorithm is designed for a sender to quickly detect | |||
| skipping to change at page 6, line 16 ¶ | skipping to change at page 6, line 20 ¶ | |||
| senders to use the algorithm described in section 3 to infer whether | senders to use the algorithm described in section 3 to infer whether | |||
| any segments were lost. | any segments were lost. | |||
| 2.1. Pseudocode | 2.1. Pseudocode | |||
| We define the terminology used in specifying the TLP algorithm: | We define the terminology used in specifying the TLP algorithm: | |||
| FlightSize: amount of outstanding data in the network as defined in | FlightSize: amount of outstanding data in the network as defined in | |||
| [RFC5681]. | [RFC5681]. | |||
| RTO: The transport's retransmission timeout (RTO) is based on | ||||
| measured round-trip times (RTT) between the sender and receiver, as | ||||
| specified in [RFC6298] for TCP. | ||||
| PTO: Probe timeout is a timer event indicating that an ACK is | PTO: Probe timeout is a timer event indicating that an ACK is | |||
| overdue. Its value is constrained to be smaller than an RTO. | overdue. Its value is constrained to be smaller than or equal to an | |||
| RTO. | ||||
| SRTT: smoothed round-trip time computed like in [RFC6298]. | SRTT: smoothed round-trip time computed like in [RFC6298]. | |||
| Open state: the sender has so far received in-sequence ACKs with no | Open state: the sender has so far received in-sequence ACKs with no | |||
| SACK blocks, and no other indications (such as retransmission | SACK blocks, and no other indications (such as retransmission | |||
| timeout) that a loss may have occurred. | timeout) that a loss may have occurred. | |||
| Consecutive PTOs: back-to-back PTOs all scheduled for the same tail | Consecutive PTOs: back-to-back PTOs all scheduled for the same tail | |||
| packets in a flight. The (N+1)st PTO is scheduled after transmitting | packets in a flight. The (N+1)st PTO is scheduled after transmitting | |||
| the probe segment for Nth PTO. | the probe segment for Nth PTO. | |||
| The TLP algorithm works as follows: | The TLP algorithm works as follows: | |||
| (1) Schedule PTO after transmission of new data in Open state: | (1) Schedule PTO after transmission of new data in Open state: | |||
| Check for conditions to schedule PTO outlined in step 2 below. | Check for conditions to schedule PTO outlined in step 2 below. | |||
| FlightSize > 1: schedule PTO in max(2*SRTT, 10ms). | FlightSize > 1: schedule PTO in max(2*SRTT, 10ms). | |||
| FlightSize == 1: schedule PTO in max(2*SRTT, 1.5*SRTT+WCDelAckT). | FlightSize == 1: schedule PTO in max(2*SRTT, 1.5*SRTT+WCDelAckT). | |||
| If RTO is earlier, schedule PTO in its place: PTO = min(RTO, PTO). | ||||
| WCDelAckT stands for worst case delayed ACK timer. When FlightSize | WCDelAckT stands for worst case delayed ACK timer. When FlightSize | |||
| is 1, PTO is inflated additionally by WCDelAckT time to compensate | is 1, PTO is inflated additionally by WCDelAckT time to compensate | |||
| for a potential long delayed ACK timer at the receiver. The | for a potential long delayed ACK timer at the receiver. The | |||
| RECOMMENDED value for WCDelAckT is 200ms. | RECOMMENDED value for WCDelAckT is 200ms. | |||
| A PTO value of 2*SRTT allows a sender to wait long enough to know | ||||
| that an ACK is overdue. Under normal circumstances, i.e. no losses, | ||||
| an ACK typically arrives in one RTT. But choosing PTO to be exactly | ||||
| an RTT is likely to generate spurious probes given that even end- | ||||
| system timings can easily push an ACK to be above an RTT. We chose | ||||
| PTO to be the next integral value of RTT. If RTO is smaller than the | ||||
| computed value for PTO, then a probe is scheduled to be sent at the | ||||
| RTO time. The RTO timer is rearmed at the time of sending the probe, | ||||
| as is shown in Step 3 below. This ensures that a PTO is always sent | ||||
| prior to a connection experiencing an RTO. | ||||
| (2) Conditions for scheduling PTO: | (2) Conditions for scheduling PTO: | |||
| (a) Connection is in Open state. | (a) Connection is in Open state. | |||
| (b) RTO is farther than PTO. | (b) Connection is either cwnd limited or application limited. | |||
| (c) Connection is either cwnd limited or application limited. | (c) Number of consecutive PTOs <= 2. | |||
| (d) Number of consecutive PTOs <= 2. | (d) Connection is SACK enabled. | |||
| (e) Connection is SACK enabled. | ||||
| Implementations MAY use one or two consecutive PTOs. | Implementations MAY use one or two consecutive PTOs. | |||
| (3) When PTO fires: | (3) When PTO fires: | |||
| (a) If a new previously unsent segment exists: | (a) If a new previously unsent segment exists: | |||
| -> Transmit new segment. | -> Transmit new segment. | |||
| -> FlightSize += SMSS. cwnd remains unchanged. | -> FlightSize += SMSS. cwnd remains unchanged. | |||
| (b) If no new segment exists: | (b) If no new segment exists: | |||
| -> Retransmit the last segment. | -> Retransmit the last segment. | |||
| (c) Reschedule next PTO if conditions in (2) allow. | (c) Increment statistics counter for loss probes. | |||
| (d) If conditions in (2) are satisfied: | ||||
| -> Reschedule next PTO. | ||||
| Else: | ||||
| -> Rearm RTO to fire at epoch 'now+RTO'. | ||||
| The reason for retransmitting the last segment in Step (b) is so that | The reason for retransmitting the last segment in Step (b) is so that | |||
| the ACK will carry SACK blocks and trigger either SACK-based loss | the ACK will carry SACK blocks and trigger either SACK-based loss | |||
| recovery [RFC3517] or FACK-based fast recovery [FACK]. | recovery [RFC6675] or FACK threshold based fast recovery [FACK]. On | |||
| transmission of a TLP, a MIB counter is incremented to keep track of | ||||
| the total number of loss probes sent. | ||||
| (4) During ACK processing: | (4) During ACK processing: | |||
| Cancel any existing PTO. | Cancel any existing PTO. | |||
| If conditions in (2) allow: | If conditions in (2) allow: | |||
| -> Reschedule the PTO relative to the | -> Reschedule PTO relative to the ACK receipt time. | |||
| time at which the ACK is received. | ||||
| Following is an example of TLP. All events listed are at a TCP | Following is an example of TLP. All events listed are at a TCP | |||
| sender. | sender. | |||
| (1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is | (1) Sender transmits segments 1-10: 1, 2, 3, ..., 8, 9, 10. There is | |||
| no more new data to transmit. A PTO is scheduled to fire in 2 RTTs, | no more new data to transmit. A PTO is scheduled to fire in 2 RTTs, | |||
| after the transmission of the 10th segment. | after the transmission of the 10th segment. | |||
| (2) Receives acknowledgements (ACKs) for segments 1-5; segments 6-10 | (2) Receives acknowledgements (ACKs) for segments 1-5; segments 6-10 | |||
| are lost and no ACKs are received. Note that the sender | are lost and no ACKs are received. Note that the sender | |||
| (re)schedules its PTO timer relative to the last received ACK, which | (re)schedules its PTO timer relative to the last received ACK, which | |||
| is the ACK for segment 5 in this case. The sender sets the PTO | is the ACK for segment 5 in this case. The sender sets the PTO | |||
| interval using the calculation described in step (1) of the | interval using the calculation described in step (1) of the | |||
| algorithm. | algorithm. | |||
| (3) When PTO fires, sender retransmits segment 10. | (3) When PTO fires, sender retransmits segment 10. | |||
| (4) After an RTT, SACK for packet 10 arrives. The ACK also carries | (4) After an RTT, SACK for packet 10 arrives. The ACK also carries | |||
| SACK holes for segments 6, 7, 8 and 9. This triggers FACK-based | SACK holes for segments 6, 7, 8 and 9. This triggers FACK threshold | |||
| recovery. | based recovery. | |||
| (5) Connection enters fast recovery and retransmits remaining lost | (5) Connection enters fast recovery and retransmits remaining lost | |||
| segments. | segments. | |||
| 2.2. FACK threshold based recovery | ||||
| At the core of TLP is its reliance on FACK threshold based algorithm | ||||
| to invoke Fast Recovery. In this section we specify this algorithm. | ||||
| Section 3.1 of the Forward Acknowledgement (FACK) Paper [FACK] | ||||
| describes an alternate algorithm for triggering fast retransmit, | ||||
| based on the extent of the SACK scoreboard. Its goal is to trigger | ||||
| fast retransmit as soon as the receiver's reassembly queue is larger | ||||
| than the dupack threshold, as indicated by the difference between the | ||||
| forward most SACK block edge and SND.UNA. This algorithm quickly and | ||||
| reliably triggers fast retransmit in the presence of burst losses -- | ||||
| often on the first SACK following such a loss. Such a threshold | ||||
| based algorithm also triggers fast retransmit immediately in the | ||||
| presence of any reordering with extent greater than the dupack | ||||
| threshold. | ||||
| FACK threshold based recovery works by introducing a new TCP state | ||||
| variable at the sender called SND.FACK. SND.FACK reflects the | ||||
| forward-most data held by the receiver and is updated when a SACK | ||||
| block is received acknowledging data with a higher sequence number | ||||
| than the current value of SND.FACK. SND.FACK reflects the highest | ||||
| sequence number known to have been received plus one. Note that in | ||||
| non-recovery states, SND.FACK is the same as SND.UNA. The following | ||||
| snippet is the pseudocode for FACK threshold based recovery. | ||||
| If (SND.FACK - SND.UNA) > dupack threshold: | ||||
| -> Invoke Fast Retransmit and Fast Recovery. | ||||
| 3. Detecting recovered losses | 3. Detecting recovered losses | |||
| If the only loss was the last segment, there is the risk that the | If the only loss was the last segment, there is the risk that the | |||
| loss probe itself might repair the loss, effectively masking it from | loss probe itself might repair the loss, effectively masking it from | |||
| congestion control. To avoid interfering with mandatory congestion | congestion control. To avoid interfering with mandatory congestion | |||
| control [RFC5681] it is imperative that TLP include a mechanism to | control [RFC5681] it is imperative that TLP include a mechanism to | |||
| detect when the probe might have masked a loss and to properly reduce | detect when the probe might have masked a loss and to properly reduce | |||
| the congestion window (cwnd). An algorithm to examine subsequent | the congestion window (cwnd). An algorithm to examine subsequent | |||
| ACKs to determine whether the original segment was lost is described | ACKs to determine whether the original segment was lost is described | |||
| here. | here. | |||
| skipping to change at page 10, line 4 ¶ | skipping to change at page 11, line 8 ¶ | |||
| (d) the ACK does not advance SND.UNA | (d) the ACK does not advance SND.UNA | |||
| (e) the segment contains no data | (e) the segment contains no data | |||
| (f) the segment is not a window update | (f) the segment is not a window update | |||
| If all of those conditions are met, then the sender executes the | If all of those conditions are met, then the sender executes the | |||
| following: | following: | |||
| TLPRtxOut--; | TLPRtxOut--; | |||
| (b) Marking the end of a TLP episode and detecting losses | (b) Marking the end of a TLP episode and detecting losses | |||
| If an incoming ACK is after TLPHighRxt, then the sender deems the TLP | If an incoming ACK is after TLPHighRxt, then the sender deems the TLP | |||
| episode over. At that time, the TLP sender executes the following: | episode over. At that time, the TLP sender executes the following: | |||
| isLoss = (TLPRtxOut > 0); | isLoss = (TLPRtxOut > 0) && | |||
| (segment does not carry a DSACK for TLP retransmission); | ||||
| TLPRtxOut = 0 | TLPRtxOut = 0 | |||
| if (isLoss) | if (isLoss) | |||
| EnterRecovery(); | EnterRecovery(); | |||
| In other words, if the sender detects an ACK for data beyond the TLP | In other words, if the sender detects an ACK for data beyond the TLP | |||
| loss probe retransmission then (in the absence of reordering on the | loss probe retransmission then (in the absence of reordering on the | |||
| return path of ACKs) it should have received any ACKs that will | return path of ACKs) it should have received any ACKs that will | |||
| indicate whether the original or any loss probe retransmissions were | indicate whether the original or any loss probe retransmissions were | |||
| lost. If the TLPRtxOut count is still non-zero and thus indicates | lost. An exception is the case when the segment carries a Duplicate | |||
| that some TLP probe segments remain unacknowledged, then the sender | SACK (DSACK) for the TLP retransmission. If the TLPRtxOut count is | |||
| should presume that at least one segment was lost, so it should enter | still non-zero and thus indicates that some TLP probe segments remain | |||
| fast recovery using the proportional rate reduction algorithm | unacknowledged, then the sender should presume that at least one | |||
| [IMC11PRR]. | segment was lost, so it should enter fast recovery using the | |||
| proportional rate reduction algorithm [IMC11PRR]. | ||||
| (5) Senders must only send a TLP loss probe retransmission if all the | (5) Senders must only send a TLP loss probe retransmission if all the | |||
| conditions from section 2.1 are met and the following condition also | conditions from section 2.1 are met and the following condition also | |||
| holds: | holds: | |||
| (TLPRtxOut == 0) || (SND.NXT == TLPHighRxt) | (TLPRtxOut == 0) || (SND.NXT == TLPHighRxt) | |||
| This ensures that there is at most one sequence range with | This ensures that there is at most one sequence range with | |||
| outstanding TLP retransmissions. The sender maintains this invariant | outstanding TLP retransmissions. The sender maintains this invariant | |||
| so that there is at most one TLP retransmission "episode" happening | so that there is at most one TLP retransmission "episode" happening | |||
| skipping to change at page 13, line 27 ¶ | skipping to change at page 14, line 27 ¶ | |||
| 5. Experiments with TLP | 5. Experiments with TLP | |||
| In this section we describe experiments and measurements with TLP | In this section we describe experiments and measurements with TLP | |||
| performed on Google Web servers using Linux 2.6. The experiments | performed on Google Web servers using Linux 2.6. The experiments | |||
| were performed over several weeks and measurements were taken across | were performed over several weeks and measurements were taken across | |||
| a wide range of Google applications. The main goal of the | a wide range of Google applications. The main goal of the | |||
| experiments is to instrument and measure TLP's performance relative | experiments is to instrument and measure TLP's performance relative | |||
| to the baseline. The experiment and baseline were using the same | to the baseline. The experiment and baseline were using the same | |||
| kernels with an on/off switch to enable TLP. | kernels with an on/off switch to enable TLP. | |||
| All experiments were conducted with the basic version of the TLP | Our experiments include both the basic TLP algorithm of Section 2 and | |||
| described in Section 2. All other algorithms such as early | its loss detection component in Section 3. All other algorithms such | |||
| retransmit and FACK based recovery are present in the both the | as early retransmit and FACK threshold based recovery are present in | |||
| experiment and baseline. We have also experimented with a few | the both the experiment and baseline. There are three primary | |||
| variations of TLP such as the consecutive probe segments to transmit | metrics we are interested in: impact on TCP latency (average and tail | |||
| and the value that a PTO should be delayed by to take into account | or 99th percentile latency), retransmission statistics, and the | |||
| delayed ACK timer when FlightSize equals one. The numbers reported | overhead of probe segments relative to the total number of | |||
| below are with experiments where we did not restrict the number of | transmitted segments. TCP latency is the time elapsed between the | |||
| consecutive probe segments to two. There are three primary metrics | server transmitting the first byte of the response to it receiving an | |||
| we are interested in: impact on TCP latency (average and tail or 99th | ACK for the last byte. | |||
| percentile latency), retransmission statistics, and the percentage of | ||||
| probe segments relative to the total number of transmitted segments. | ||||
| TCP latency is the time elapsed between the server transmitting the | ||||
| first byte of the response to it receiving an ACK for the last byte. | ||||
| The probe segments are not accounted in retransmission statistics but | ||||
| instead tracked separately in the Linux MIB. | ||||
| Over a period of a week long experiment, TLP reduced the average TCP | The table below shows the percentiles and average latency improvement | |||
| latency of Google Web search and Instant search by [average:3%, 99th | of key Web applications, including even those responses without | |||
| percentile:5%], Google Maps by [average:5%, 99th percentile:10%], and | losses, measured over a period of one week. The key takeaway is: the | |||
| Google Images by [average:7%, 99th percentile:10%]. The varying | average response time improved up to 7% and the 99th percentile | |||
| percentage of improvements is because of the differences in the | improved by 10%. Nearly all of the improvement for TLP is in the | |||
| response size distribution amongst these services, e.g. Google | tail latency (post-90th percentile). The varied improvements across | |||
| Images has the smallest response sizes and hence is more prone to | services are due to different response-size distributions and traffic | |||
| tail segment losses. TLP also improved performance in mobile | patterns. For example, TLP helps the most for Images, as these are | |||
| networks, e.g. by 7.2% for Web search and Instant and 7.6% for Images | served by multiple concurrently active TCP connections which increase | |||
| transferred over Verizon network. To see why and where the latency | the chances of tail segment losses. | |||
| improvements are coming from, we measured the retransmission | ||||
| statistics. We broke down the retransmission stats based on nature | ||||
| of retransmission - timeout retransmission or fast recovery. TLP | ||||
| reduced the number of timeout retransmissions by 10% compared to the | ||||
| baseline, i.e. (timeout_retrans_tlp - timeout_retrans_baseline) / | ||||
| timeout_retrans_baseline = 10%. This includes the retransmissions | ||||
| that occur in the slow start phase of the timeout. There is a | ||||
| corresponding increase in the number of fast retransmits and in fast | ||||
| recovery. Note that it is not always possible for TLP to convert | ||||
| 100% of the timeouts into fast recovery episodes for two reasons: | ||||
| either a PTO was not scheduled because an RTO was closer, or a probe | ||||
| segment was sent but an RTO occurred anyway. | ||||
| The overhead of probe segments as a percentage of the total number of | Application Average 99% | |||
| outgoing segments is 0.6%, i.e. (number of probe segments / number of | ||||
| outgoing segments)*100 = 0.6%. When FlightSize equals 1 it is | ||||
| important to account for the delayed ACK timer in the PTO value, in | ||||
| order to bring down the number of unnecessary probe segments. (the | ||||
| WCDelAckT term, aka Worst Case delayed ACK time was set to 200ms). | ||||
| With delays of 0ms and 50ms, the percentage probe segments are 3.1% | ||||
| and 2.2% respectively. | ||||
| Besides the macro level latency and retransmission statistics, we | Google Web Search -3% -5% | |||
| also measured TCP's internal state variables at the point when PTO | ||||
| timer fires and a probe segment is transmitted. We report a few of | ||||
| these statistics here. We measured the number of probe segments sent | ||||
| consecutively within a single timeout period. About 80% of the | ||||
| segments were probes sent the first time within the timeout period | ||||
| while the remaining were consecutive probes. 10% of the probe | ||||
| segments are new previously untransmitted data while the remainder | ||||
| are a retransmission of the last sent segment. | ||||
| The distribution below shows the time period, normalized by the | Google Maps -5% -10% | |||
| connection RTT, between the PTO timer firing and the epoch when the | ||||
| next RTO is scheduled. Note that the slack time period between a PTO | ||||
| and RTO can be several tens of RTTs, possibly because of large RTOs | ||||
| caused by variance in round-trip times. | ||||
| percentile 25% 50% 75% 90% 99% | Google Images -7% -10% | |||
| [normalized interval] 1.4 3.3 10.5 21 95 | TLP also improved performance in mobile networks -- by 7.2% for Web | |||
| search and Instant and 7.6% for Images transferred over Verizon | ||||
| network. To see why and where the latency improvements are coming | ||||
| from, we measured the retransmission statistics. We broke down the | ||||
| retransmission stats based on nature of retransmission -- timeout | ||||
| retransmission or fast recovery. TLP reduced the number of timeouts | ||||
| by 15% compared to the baseline, i.e. (timeouts_tlp - | ||||
| timeouts_baseline) / timeouts_baseline = 15%. Instead, these losses | ||||
| were either recovered via fast recovery or by the loss probe | ||||
| retransmission itself. The largest reduction in timeouts is when the | ||||
| sender is in the Open state in which it receives only insequence ACKs | ||||
| and no duplicate ACKs, likely because of tail losses. | ||||
| Correspondingly, the retransmissions occurring in the slow start | ||||
| phase after RTO reduced by 46% relative to baseline. Note that it is | ||||
| not always possible for TLP to convert 100% of the timeouts into fast | ||||
| recovery episodes because a probe itself may be lost. Also notable | ||||
| in our experiments is a significant decrease in the number of | ||||
| spurious timeouts -- the experiment had 61% fewer congestion window | ||||
| undo events. The Linux TCP sender uses either DSACK or timestamps to | ||||
| determine if retransmissions are spurious and employs techniques for | ||||
| undoing congestion window reductions. We also note that the total | ||||
| number of retransmissions decreased 7% with TLP because of the | ||||
| decrease in spurious retransmissions, and because the TLP probe | ||||
| itself plugs a hole. | ||||
| The following distribution shows the FlightSize and congestion window | We also quantified the overhead of probe packets. The probes | |||
| values when a PTO is scheduled. We note that cwnd is not the | accounted for 0.48% of all outgoing segments, i.e. (number of probe | |||
| limiting factor and that nearly all of the probe segments are sent | segments / number of outgoing segments)*100 = 0.48%. This is a | |||
| within the congestion window. | reasonable overhead when contrasted with the overall retransmission | |||
| rate of 3.2%. 10% of the probes sent are new segments and the rest | ||||
| are retransmissions, which is unsurprising given that short Web | ||||
| responses often don't have new data to send. We also found that in | ||||
| about 33% of the cases, the probes themselves plugged the only hole | ||||
| at receiver and the loss detection algorithm reduced the congestion | ||||
| window. 37% of the probes were not necessary and resulted in a | ||||
| duplicate acknowledgment. | ||||
| Besides the macro level latency and retransmission statistics, we | ||||
| report some measurements from TCP's internal state variables at the | ||||
| point when a probe segment is transmitted. The following | ||||
| distribution shows the FlightSize and congestion window values when a | ||||
| PTO is scheduled. We note that cwnd is not the limiting factor and | ||||
| that nearly all of the probe segments are sent within the congestion | ||||
| window. | ||||
| percentile 10% 25% 50% 75% 90% 99% | percentile 10% 25% 50% 75% 90% 99% | |||
| FlightSize 1 1 2 3 10 20 | FlightSize 1 1 2 3 10 20 | |||
| cwnd 5 10 10 10 17 44 | cwnd 5 10 10 10 17 44 | |||
| We have also experimented with a few variations of TLP: multiple | ||||
| probe segments versus single probe for the same tail loss episode, | ||||
| and several values for WCDelAckT. Our experiments show that sending | ||||
| just one probe suffices to get most (~90%) of latency benefits. The | ||||
| experiment results reported in this section and our current | ||||
| implementation limits number of probes to one, although the draft | ||||
| itself allows up to two consecutive probes. We chose the worst case | ||||
| delayed ack timer to be 200ms. When FlightSize equals 1 it is | ||||
| important to account for the delayed ACK timer in the PTO value, in | ||||
| order to bring down the number of unnecessary probe segments. With | ||||
| delays of 0ms and 50ms, the probe overhead jumped from 0.48% to 3.1% | ||||
| and 2.2% respectively. We have also experimented with transmitting | ||||
| 1-byte probe retransmissions as opposed to an entire MSS | ||||
| retransmission probe. While this scheme has the advantage of not | ||||
| requiring the loss detection algorithm outlined in Section 3, it | ||||
| turned out to be problematic to implement correctly in certain TCP | ||||
| stacks. Additionally, retransmitting 1-byte probe costs one more RTT | ||||
| to recover single packet tail losses, which is detrimental for short | ||||
| transfer latency. | ||||
| 6. Related work | 6. Related work | |||
| TCP's long and conservative RTO recovery has long been identified as | TCP's long and conservative RTO recovery has long been identified as | |||
| the major performance bottleneck for latency-demanding applications. | the major performance bottleneck for latency-demanding applications. | |||
| A well-studied example is online gaming that requires reliability and | A well-studied example is online gaming that requires reliability and | |||
| low latency but small bandwidth. [GRIWODZ06] shows that repeated | low latency but small bandwidth. [GRIWODZ06] shows that repeated | |||
| long RTO is the dominating performance bottleneck for game | long RTO is the dominating performance bottleneck for game | |||
| responsiveness. The authors in [PETLUND08] propose to use linear RTO | responsiveness. The authors in [PETLUND08] propose to use linear RTO | |||
| to improve the performance, which has been incorporated in the Linux | to improve the performance, which has been incorporated in the Linux | |||
| kernel as a non-default socket option for such thin streams. | kernel as a non-default socket option for such thin streams. | |||
| skipping to change at page 15, line 26 ¶ | skipping to change at page 17, line 4 ¶ | |||
| long RTO is the dominating performance bottleneck for game | long RTO is the dominating performance bottleneck for game | |||
| responsiveness. The authors in [PETLUND08] propose to use linear RTO | responsiveness. The authors in [PETLUND08] propose to use linear RTO | |||
| to improve the performance, which has been incorporated in the Linux | to improve the performance, which has been incorporated in the Linux | |||
| kernel as a non-default socket option for such thin streams. | kernel as a non-default socket option for such thin streams. | |||
| [MONDAL08] further argues exponential RTO backoff should be removed | [MONDAL08] further argues exponential RTO backoff should be removed | |||
| because it is not necessary for the stability of Internet. In | because it is not necessary for the stability of Internet. In | |||
| contrast, TLP does not change the RTO timer calculation or the | contrast, TLP does not change the RTO timer calculation or the | |||
| exponential back off. TLP's approach is to keep the behavior after | exponential back off. TLP's approach is to keep the behavior after | |||
| RTO conservative for stability but allows a few timely probes before | RTO conservative for stability but allows a few timely probes before | |||
| concluding the network is badly congested and cwnd should fall to 1. | concluding the network is badly congested and cwnd should fall to 1. | |||
| As noted earlier in the Introduction the F-RTO [RFC5682] algorithm | As noted earlier in the Introduction the F-RTO [RFC5682] algorithm | |||
| reduces the number of spurious timeout retransmissions and the Early | reduces the number of spurious timeout retransmissions and the Early | |||
| Retransmit [RFC5827] mechanism reduces timeouts when a connection has | Retransmit [RFC5827] mechanism reduces timeouts when a connection has | |||
| received a certain number of duplicate ACKs. Both are complementary | received a certain number of duplicate ACKs. Both are complementary | |||
| to TLP and can work alongside. | to TLP and can work alongside. Rescue retransmission introduced in | |||
| [RFC6675] deals with loss events such as AL*SL* (using the same | ||||
| notation as section 4). TLP covers wider range of events such as | ||||
| AL*. We experimented with rescue retransmission on Google Web | ||||
| servers, but did not observe much performance improvement. When the | ||||
| last segment is lost, it is more likely that a number of contiguous | ||||
| segments preceding the segment are also lost, i.e. AL* is common. | ||||
| Timeouts that occur in the fast recovery are rare. | ||||
| TCP Loss Probe is one of several algorithms designed to maximize the | [HURTIG13] proposes to offset the elapsed time of the pending packet | |||
| when re-arming the RTO timer. It is possible to apply the same idea | ||||
| for the TLP timer as well. We have not yet tested such a change to | ||||
| TLP. | ||||
| Tail Loss Probe is one of several algorithms designed to maximize the | ||||
| robustness of TCPs self clock in the presence of losses. It follows | robustness of TCPs self clock in the presence of losses. It follows | |||
| the same principles as Proportional Rate Reduction [IMC11PRR] and TCP | the same principles as Proportional Rate Reduction [IMC11PRR] and TCP | |||
| Laminar [Laminar]. | Laminar [Laminar]. | |||
| On a final note we note that Tail loss probe does not eliminate 100% | ||||
| of all RTOs. RTOs still remain the dominant mode of loss recovery | ||||
| for short transfers. More work in future should be done along the | ||||
| following lines: transmitting multiple loss probes prior to finally | ||||
| resorting to RTOs, maintaining ACK clocking for short transfers in | ||||
| the absence of new data by clocking out old data in response to | ||||
| incoming ACKs, taking cues from applications to indicate end of | ||||
| transactions and use it for smarter tail loss recovery. | ||||
| 7. Security Considerations | 7. Security Considerations | |||
| The security considerations outlined in [RFC5681] apply to this | The security considerations outlined in [RFC5681] apply to this | |||
| document. At this time we did not find any additional security | document. At this time we did not find any additional security | |||
| problems with TCP loss probe. | problems with Tail loss probe. | |||
| 8. IANA Considerations | 8. IANA Considerations | |||
| This document makes no request of IANA. | This document makes no request of IANA. | |||
| Note to RFC Editor: this section may be removed on publication as an | Note to RFC Editor: this section may be removed on publication as an | |||
| RFC. | RFC. | |||
| 9. References | 9. References | |||
| [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A | [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., | |||
| Conservative Selective Acknowledgment (SACK)-based Loss | and Y. Nishida, "A Conservative Loss Recovery Algorithm | |||
| Recovery Algorithm for TCP", RFC 3517, April 2003. | Based on Selective Acknowledgment (SACK) for TCP", | |||
| RFC 6675, August 2012. | ||||
| [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, | [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, | |||
| "Computing TCP's Retransmission Timer", RFC 6298, | "Computing TCP's Retransmission Timer", RFC 6298, | |||
| June 2011. | June 2011. | |||
| [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. | [RFC5827] Allman, M., Ayesta, U., Wang, L., Blanton, J., and P. | |||
| Hurtig, "Early Retransmit for TCP and Stream Control | Hurtig, "Early Retransmit for TCP and Stream Control | |||
| Transmission Protocol (SCTP)", RFC 5827, April 2010. | Transmission Protocol (SCTP)", RFC 5827, April 2010. | |||
| [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, | [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, | |||
| skipping to change at page 17, line 14 ¶ | skipping to change at page 19, line 16 ¶ | |||
| Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, | Petlund, A., Evensen, K., Griwodz, C., and P. Halvorsen, | |||
| "TCP enhancements for interactive thin-stream | "TCP enhancements for interactive thin-stream | |||
| applications", NOSSDAV , 2008. | applications", NOSSDAV , 2008. | |||
| [MONDAL08] | [MONDAL08] | |||
| Mondal, A. and A. Kuzmanovic, "Removing Exponential | Mondal, A. and A. Kuzmanovic, "Removing Exponential | |||
| Backoff from TCP", ACM SIGCOMM Computer Communication | Backoff from TCP", ACM SIGCOMM Computer Communication | |||
| Review , 2008. | Review , 2008. | |||
| [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP | [Laminar] Mathis, M., "Laminar TCP and the case for refactoring TCP | |||
| congestion control", draft-mathis-tcpm-tcp-laminar-00 | congestion control", July 2012. | |||
| (work in progress), February 2012. | ||||
| [HURTIG13] | ||||
| Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP | ||||
| and SCTP RTO Restart", draft-ietf-tcpm-rtorestart-00 (work | ||||
| in progress), February 2013. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Nandita Dukkipati | Nandita Dukkipati | |||
| Google, Inc | Google, Inc | |||
| 1600 Amphitheater Parkway | 1600 Amphitheater Parkway | |||
| Mountain View, California 93117 | Mountain View, California 93117 | |||
| USA | USA | |||
| Email: nanditad@google.com | Email: nanditad@google.com | |||
| End of changes. 42 change blocks. | ||||
| 121 lines changed or deleted | 226 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||