INTERNET-DRAFT Amy Hughes, Joe Touch, John Heidemann draft-ietf-tcpimpl-restart-00.txt ISI March 30, 1998 Expires: Sept. 30, 1998 Issues in TCP Slow-Start Restart After Idle Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. The distribution of this document is unlimited. Abstract This draft discusses variations in the TCP 'slow-start restart' (SSR) algorithm, and the unintended failure of some variations to properly restart in some environments. SSR is intended to avoid line-rate bursts after idle periods, where TCP accumulates permission to send in the form of ACKs, but does not consume that permission immediately. SSR's original "restart after send is idle" is commonly implemented as "restart after receive is idle". The latter unintentionally fails to restart for bidirectional connections where the sender's burst is triggered by a reverse-path data packet, such as in persistent HTTP. Both the former and latter are shown to permit bursts in other circumstances. Three solutions are discussed, and their implementations evaluated. This document is a product of the LSAM project at ISI. Comments are solicited and should be addressed to the authors. Introduction Slow-Start Restart (SSR) describes one TCP behavior to respond to long sending pauses in an open connection. When a sender becomes idle, the normal ack-clocking mechanism which regulates traffic is no longer present and the sender may introduce a burst of packets into the network as large as the current congestion window (CWND). Such a burst may be too large for the intermediate routers to handle and may be too large for the receiver to handle at one time as well. A send timer was first proposed [JK90] to detect idle sending periods; the recommended response is to close the congestion window and perform a new slow-start. However, a footnote to this first proposed solution noted that send/receive symmetry on the channel meant that a receive timer could be used instead to achieve the same results. As this second solution takes advantage of a timer that is already required (to detect packet loss) it was implemented by Jacobson and Karels. This solution has been repeated in implementations which derive from their work. Bursty connections, such as the persistent connections required in HTTP/1.1 [FGMFB97] have been found to interact in meaningful ways with SSR [6]. In fact, it was discovered that SSR never occurs with HTTP/1.1 [Poo97]. This is because a new request will reset the receive timer (as suggested in the footnote in [JK90]) and the sending pause will not be detected [Tou97]. Further, both timer solutions depend on the retransmit timeout (RTO) and cannot detect send pauses that are shorter than this duration. In such cases, the sender may transmit a burst as large as the full congestion window. Burst detection. There are several ways of determining whether a connection is at risk of sending a burst of packets into the channel. We will discuss each method below, from the least radical to the most radical. Receive Timer: The use of a receive timer is the most common burst detection method. It is attractive because it is simple and makes use of an existing timer. However, a receive timer does not properly detect bursts in HTTP/1.1 because the timer is cancelled when the request packet is received. Further, when the connection is idle for less than a full RTO, a burst cannot be detected. Such a burst can happen when the connection is "nearly idle" or when acks are lost or reordered. Send Timer: A send timer is the reciprocal solution to using a receive timer. While it requires a new timestamp field to be maintained, it clearly detects send pauses and corrects the problem presented by HTTP/1.1. However, as with the receive timer, it cannot detect bursts that could happen before a full RTO. Packet Counting: An alternative method examines the unused portion of the congestion window to determine if the capacity to burst exists. This method is simple, it uses existing information to make its decision, and it solves both the HTTP/1.1. problem as well as the RTO problem. In addition, it addresses the problem that needs to be solved (bursts) instead of a specific circumstance where the problem could happen (send pauses). However, where timer detection avoids defining a burst (it defines idle periods instead), here a burst must be defined before it can be detected. One possible definition is the situation where the available portion of the sending window is some proportion of the entire congestion window, say 50%. Another definition places a numerical limit on the available portion of the congestion window, say 4 or CWND-1 packets. Burst Response Once a burst is detected, there are several different ways to take action. The different possibilities are listed below, again from least to most radical. Full Restart: Reducing the congestion window to one packet and re-entering slow- start, the original slow-start restart is one response. This was the solution proposed by J&K. This is a very conservative response and it defeats most of the speedup that HTTP/1.1 provides [HOT97]. Current proposals [FAP97] have suggested increasing the initial window from 1 packet to 4 packets. Further, depending on the method of burst detection, Full Restart can be far more punitive than it should be. Coupled with a timer, full restart is most likely to respond to a completely empty congestion window. Coupled with Packet Counting, the response could close the window too far, even smaller than the amount of outstanding data. Window Limiting: This is a modified version of Full Restart which solves the problem created by using Packet Counting to detect bursts. With this type of response, the congestion window is reduced to the amount of outstanding data plus the slow-start initial window (1, 2, or 4). It works exactly like Full Restart in the idle case, but is successful at controlling bursts in an active connection. Further, in an active connection, it effectively implements a leaky bucket of the initial window size for the accumulation of send opportunity based on the receipt of acks. This solution is fairly conservative, especially as it defaults to Full Restart, but more importantly, sending opportunity is simply lost if not used, and is not available for paced output. Also, it forces negative congestion feedback on the congestion window. Burst Size Limitation: When a burst is detected, its effects are limited, the sender may not send any more than a preset number of packets into the network. It is less conservative than the first two responses in that it does not affect the size of the congestion window, and it is simple to implement, simply count up the number of packets you can send and stop when you reach the limit. Whether to wait for an ack or some other signal to resume sending is an implementation detail. Lastly, this burst response can be performed after each ack or with each send. The behavior is slightly different in each case. Pacing: When a burst is detected, packets are dribbled into the network until the sender starts receiving acks and normal maintenance can be resumed [VH97]. This solution is very easy on the network and scales well in cases of high bw/delay. However, it requires a new timer and parameter tuning require more research. Implemented Solutions Now we will examine combinations of the different detection and response methods presented above. Each of the solutions that below have been implemented in some form. BSD Implementation (Jacobson and Karels) The most common implementation uses a receive timer coupled with Full Restart. This is the implementation that causes the interaction problems with HTTP/1.1. The obvious alternative is to implement a send timer as originally intended and use Full Restart. There are several drawbacks to this solution. First, a send timer adds additional state and serves no purpose other than to correct the bursting behavior after send pauses. Second, forcing a slow-start in this situation is problematic for HTTP/1.1. A slow-start for each new user request adds a delay burden to characteristically small HTTP responses. Further, the HTTP user request pattern is unpredictable. It is possible for the user to make a new request before the send timer expires, triggering a burst that would defeat such a timer. Maximum Burst Limitation (Floyd) Floyd has proposed a coupling of Packet Counting with Burst Size Limitation. This solution has been implemented in ns and it prevents the sender from transmitting a series of back-to-back packets larger than the user configured burst limit (suggested to be 4 packets) [NS97]. There are several issues involved with recovering from a burst and the ns implementation doesn't address them consistently. First, it is not clear when the sender is allowed to send again after sending the the first limited burst of packets. One implementation requires the sender to wait for the burst timer to expire. Another seems to allow a series of short bursts. Another issue is how the simulation implementation and usage translates to a live network situation. The implementation of this solution can range from simple to more complex. Congestion Window Monitoring (Hughes, Touch, and Heidemann) Our proposed solution combines Packet Counting with Window Limiting. Whenever (CWND - outstanding data > 4), we reduce CWND to (outstanding data + 4). The choice of 4 packets is discussed in with the implementation details below. Congestion Window Monitoring (CWM) allows the congestion window to grow normally but shrinks the congestion window as the sender becomes idle. It also prevents the sender from transmitting any bursts larger than 4 packets in response to a new request. Because CWM is not dependent on any timers, the loss of an ack or a nearly idle connection cannot cause any bursts. CWM is similar to Burst Limitation, but avoids the burst by reducing CWND, rather than by inhibiting the sends directly. As a result, we avoid the potential problem of sequential calls to TCP_output, which would cause bursts in the former, but not the latter. CWM also causes TCP to use the feedback of 'not using the CWND fast enough', which results in a decrease in the CWND. CWM effectively imposes a leaky bucket type limitation on the congestion window. The window is allowed to grow and be managed normally but the sender is not allowed to save up any sending opportunities. Any opportunity that is not used is lost. This property of CWM forces interleaved reception of acks and processing of sends. Rate Based Pacing (Visweswaraiah and Heidemann) Rate Based Pacing combines the Pacing response with either a Send Timer or Packet Counting. It avoids slow-start when resuming after sending pauses and allows the normal clocking of packets to be gracefully restarted. When a burst potential is detected, the algorithm meters a small burst of packets into the channel [VH97]. RBP is the least conservative solution to the bursting problem because it continues to make use of the pre-pause congestion window. If network conditions have changed significantly, maintaining the previous window could cause the paced connection to be overly aggressive as compared to other connections. (Although some work suggests congestion windows are stable over multi-minute timeframes [BSSK97].) More recently pacing been suggested for use in wireless networking scenarios [BPK97], and for satellite connections. Experimental Comparisons Packet traces of the current FreeBSD implementation of SSR (using the receive timer), of a modified version of FreeBSD using a send timer, and of CWM with HTTP/1.1 support the above observations. In all of the traces, the response pattern for the first request is the same with each method. This shows that CWM allows the congestion window to grow normally. Because of the different actions taken by the three algorithms, the response pattern for the second request differs as would be expected. [We have graphs available upon request] When the second request arrives at the server after the retransmission timeout (RTO), normal FreeBSD allows the server to respond with a burst of packets. FreeBSD using a send timer responds by entering slow-start. CWM allows a 4 packet burst. When the second request arrives at the server before the RTO, both timer implementations allow a burst. CWM again limits the burst to 4 packets. Note, RTO is the common timer limit, but any value would have the same results, depending on when the second request was presented in relation to the timer. Implementation of Congestion Window Monitoring Congestion Window Monitoring requires a simple modification to existing TCP output routines. The changes required replace the current idle detection code. Replace the existing 3 lines of code: idle = (snd_max == snd_una) if (idle && now - lastrcv >= rto) cwnd = 1; with the following 3 lines of code: maxwin = 4 + snd_nxt - snd_una; if (cwnd > maxwin) cwnd = maxwin; Packet counting is implemented by line 1. Lines 2 and 3 implement Window Limitation. The choice of limiting the available congestion window to 4 packets is based on the normal operation of TCP. An ACK received by the sender may be in response to the receipt of 2 packets, allowing another 2 to be sent. Further, normal window growth may require the sending of a third packet. Lastly, in slow-start with delayed ACKs, the receipt of an ACK can trigger the sending of 4 packets. Thus, 4 packets is a reasonable burst to send into the network. Increasing the initial window in slow-start to 4 packets has already been proposed [FAP97]. The effects of this change have been explored in simulation in [PN98] and in practice in [AHO97]. Such a modification to TCP would cause the same behavior as our solution in the cases where the pause timer has expired. It does not address the pre-timeout bursting situation we are concerned with. Conclusions At this time, we propose CWM as a simple, minimal and effective fix to the 'bug' in current TCP implementations that is exploited by HTTP/1.1. Modifications can be made to TCP to solve the slow-start restart problem that are consistent with the original congestion avoidance specifications (i.e. a send timer). However, we feel that the original intended behavior is not appropriate to some current applications, specifically HTTP. Thus, we recommend Congestion Window Monitoring to prevent bursts into the network. Not only does this solution solve the current problem in a simple way, it will prevent bursting in any other situation that might arise. The 4 packet bursts which we allow are consistent with congestion window growth algorithms and with Floyd's conclusion about increasing the initial window size. CWM, as well as the other solutions listed, need to be re-evaluated within emerging TCP implementations, e.g., SACK [JB88]. In general, TCP has no rate pacing and uses congestion control to avoid bursts in current implementations. A more explicit mechanism, such as RBP or similar proposals may be desirable in the future. Security implications CWM presents no security problems. References [AHO97] Mark Allman, Chris Hayes, and Shawn Ostermann. An Evaluatin of TCP Slow Start Modifications, July 1997. (Submitted to CCR, draft available from http://jarok.cs.ohiou.edu/papers/) [BPK97] Hari Balakrishnan, Venkata N. Padmanabhan, and Randy H. Katz. The Effects of Asymmetry on TCP Performance. In Proceedings of the ACM/IEEE Mobicom, Budapest, Hungary, ACM. September, 1997. [BSSK97] Hari Balakrishnan, Srinivasan Seshan, Mark Stemm, and Randy H. Katz. Analyzing Stability in Wide-Area Network Performance. In Proceedings of the ACM SIGMETRICS, Seattle WA, USA, ACM. June, 1997. [FGMFB97] R. Fielding, Jim Gettys, Jeffrey C. Mogul, H. Frystyk, and Tim Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1, January 1997. RFC 2068. [FAP97] Sally Floyd, Mark Allman, and Craig Partridge. Increasing TCP's Initial Window, July 1997. Internet Draft draft-floyd- incr-init-win-01.txt [Hei97] John Heidemann. Performance Interactions Between P-HTTP and TCP Implementations. ACM Computer Communications Review, 27(2), 65-73, April 1997. [HOT97] John Heidemann, Katia Obraczka, and Joe Touch. Modeling the Performance of HTTP Over Several Transport Protocols. ACM/IEEE Transactions on Networking 5(5), 616-630, October, 1997. [JB88] Van Jacobson and R.T. Braden. TCP extensions for long-delay paths, October 1988. RFC 1072. [JK90] Van Jacobson and Michael J. Karels. Congestion Avoidance and Control. ACM Computer Communication Review, 18(4):314-329, August 1990. Revised version of his SIGCOMM '88 paper. [NS97] ns Network Simulator. http://www-mash.cs.berkeley.edu/ns/, 1997. [PN98] K. Poduri and K. Nichols. Simulation Studies of Increased Initial TCP Window Size, February 1998. Internet Draft draft- ietf-tcpimpl-poduri-00.txt [Poo97] Kacheong Poon, Sun Microsystems, tcp-implementors mailing list, August, 1997. [Tou97] Joe Touch, ISI, tcp-implementors mailing list, August 12, 1997. [VH97] Vikram Visweswaraiah and John Heidemann. Improving Restart of Idle TCP Connections. Technical Report 97-661, University of Southern California, November 1997. Authors/ Address Amy Hughes, Joe Touch, John Hiedemann University of Southern California/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 USA Phone: +1 310-822-1511 Fax: +1 310-823-6714 URLs: http://www.isi.edu/~ahughes http://www.isi.edu/~touch http://www.isi.edu/~johnh Email: ahughes@isi.edu touch@isi.edu johnh@isi.edu