idnits 2.17.1 draft-hughes-restart-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-24) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([AHO98], [BSSK97], [VH97,PK98], [JK92], [JB88], [BPK97], [VH97], [NS97], [Hei97], [APS99], [PN98], [Tou97], [FAP97], [HOT97], [Poo97], [FGMFB97]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. == Line 593 has weird spacing: '... int ack_r...' == Line 594 has weird spacing: '... int bucke...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'AHO98' ** Obsolete normative reference: RFC 2581 (ref. 'APS99') (Obsoleted by RFC 5681) -- Possible downref: Non-RFC (?) normative reference: ref. 'BPK97' -- Possible downref: Non-RFC (?) normative reference: ref. 'BSSK97' -- Possible downref: Non-RFC (?) normative reference: ref. 'FGMFB97' ** Obsolete normative reference: RFC 2414 (ref. 'FAP97') (Obsoleted by RFC 3390) -- Possible downref: Non-RFC (?) normative reference: ref. 'Hei97' -- Possible downref: Non-RFC (?) normative reference: ref. 'HOT97' ** Obsolete normative reference: RFC 1072 (ref. 'JB88') (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac88' -- Possible downref: Non-RFC (?) normative reference: ref. 'JK92' -- Possible downref: Non-RFC (?) normative reference: ref. 'NS97' -- Possible downref: Non-RFC (?) normative reference: ref. 'PK98' ** Downref: Normative reference to an Informational RFC: RFC 2415 (ref. 'PN98') -- Possible downref: Non-RFC (?) normative reference: ref. 'Poo97' -- Possible downref: Non-RFC (?) normative reference: ref. 'Tou97' -- Possible downref: Non-RFC (?) normative reference: ref. 'VH97' Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 16 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Amy S. Hughes, Joe Touch, John Heidemann 3 draft-hughes-restart-00.txt ISI 4 Dec. 1, 2001 5 Expires: Jun. 1, 2002 7 Issues in TCP Slow-Start Restart After Idle 9 Status of this Memo 11 This document is an Internet-Draft and is NOT offered in accordance 12 with Section 10 of RFC2026, and the author does not provide the IETF 13 with any rights other than to publish as an Internet-Draft. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/1id-abstracts.html 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html 30 The distribution of this document is unlimited. 32 Abstract 34 This draft discusses variations in the TCP 'slow-start restart' (SSR) 35 algorithm, and the unintended failure of some variations to properly 36 restart in some environments. SSR is intended to avoid line-rate 37 bursts after idle periods, where TCP accumulates permission to send 38 in the form of ACKs, but does not consume that permission 39 immediately. SSR's original "restart after send is idle" is commonly 40 implemented as "restart after receive is idle". The latter 41 unintentionally fails to restart for bidirectional connections where 42 the sender's burst is triggered by a reverse-path data packet, such 43 as in persistent HTTP. Both the former and latter are shown to permit 44 bursts in other circumstances. Several solutions are discussed, and 45 their implementations evaluated. 47 This document updates draft-ietf-tcpimpl-restart-01.txt. It is a 48 product of the LSAM, X-Bone, and DynaBone projects at ISI. Comments 49 are solicited and should be addressed to the authors. 51 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 53 Introduction 55 Slow-Start Restart (SSR) describes one TCP behavior to respond to 56 long sending pauses in an open connection. When a sender becomes 57 idle, the normal ACK-clocking mechanism which regulates traffic is no 58 longer present and the sender may introduce a burst of packets into 59 the network as large as the current congestion window (CWND). Such a 60 burst may be too large for the intermediate routers to handle and may 61 be too large for the receiver to handle at one time as well. 63 A send timer was first proposed [JK92] to detect idle sending 64 periods; the recommended response is to close the congestion window 65 and perform a new slow-start. However, a footnote to this first 66 proposed solution noted that send/receive symmetry on the channel 67 meant that a receive timer could be used instead to achieve the same 68 results. As this second solution takes advantage of a timer that is 69 already required (to detect packet loss) it was implemented by 70 Jacobson and Karels. This solution has been repeated in 71 implementations which derive from their work. 73 Bursty connections, such as the persistent connections required in 74 HTTP/1.1 [FGMFB97] have been found to interact in meaningful ways 75 with SSR [Hei97]. In fact, it was discovered that SSR never occurs 76 with HTTP/1.1 [Poo97]. This is because a new request will reset the 77 receive timer (as suggested in the footnote in [JK92]) and the 78 sending pause will not be detected [Tou97]. 80 Further, both timer solutions depend on the retransmit timeout (RTO) 81 and cannot detect send pauses that are shorter than this duration. 82 In such cases, the sender may transmit a burst as large as the full 83 congestion window. 85 Burst Detection. 87 There are several ways of determining whether a connection is at risk 88 of sending a burst of packets into the channel. We will discuss each 89 method below, from the least radical to the most radical. 91 Receive Timer: 93 The use of a receive timer is the most common burst detection method. 94 It is attractive because it is simple and makes use of an existing 95 timer. However, a receive timer does not properly detect bursts in 96 HTTP/1.1 because the timer is cancelled when the request packet is 97 received. Further, when the connection is idle for less than a full 98 RTO, a burst cannot be detected. Such a burst can happen when the 99 connection is "nearly idle" or when ACKs are lost or reordered. 101 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 103 Send Timer: 105 A send timer is the reciprocal solution to using a receive timer. 106 While it requires a new timestamp field to be maintained, it clearly 107 detects send pauses and corrects the problem presented by HTTP/1.1. 108 However, as with the receive timer, it cannot detect bursts that 109 could happen before a full RTO. 111 Packet Counting: 113 An alternative method examines the unused portion of the congestion 114 window to determine if the capacity to burst exists. This method is 115 simple, it uses existing information to make its decision, and it 116 solves both the HTTP/1.1. problem as well as the RTO problem. In 117 addition, it addresses the problem that needs to be solved (bursts) 118 instead of a specific circumstance where the problem could happen 119 (send pauses). However, where timer detection avoids defining a 120 burst (it defines idle periods instead), here a burst must be defined 121 before it can be detected. One possible definition is the situation 122 where the available portion of the sending window is some proportion 123 of the entire congestion window, say 50%. Another definition places 124 a numerical limit on the available portion of the congestion window, 125 say 4 or CWND-1 packets. 127 Burst Response 129 Once a burst is detected, there are several different ways to take 130 action. The different possibilities are listed below, again from 131 least to most radical. 133 Full Restart: 135 Reducing the congestion window to one packet and re-entering slow- 136 start, the original slow-start restart, is one response. This was 137 the solution proposed by Jacobson and Karels. This is a very 138 conservative response and it defeats most of the speedup that 139 HTTP/1.1 provides [HOT97]. Current proposals [FAP97] have suggested 140 increasing the initial window from 1 packet to 4 packets. Further, 141 depending on the method of burst detection, Full Restart can be far 142 more punitive than it should be. Coupled with a timer, full restart 143 is most likely to respond to a completely empty congestion window. 144 Coupled with Packet Counting, the response could close the window too 145 far, even smaller than the amount of outstanding data. 147 Window Limiting: 149 This is a modified version of Full Restart which solves the problem 150 created by using Packet Counting to detect bursts. With this type of 152 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 154 response, the congestion window is reduced to the amount of 155 outstanding data plus the slow-start initial window (1, 2, or 4). It 156 works exactly like Full Restart in the idle case, but is successful 157 at controlling bursts in an active connection. Further, in an active 158 connection, it effectively implements a leaky bucket of the initial 159 window size for the accumulation of send opportunity based on the 160 receipt of ACKs. This solution is fairly conservative, especially as 161 it defaults to Full Restart, but more importantly, sending 162 opportunity is simply lost if not used, and is not available for 163 paced output. Also, it forces negative congestion feedback on the 164 congestion window. 166 Burst Size Limitation: 168 When a burst is detected, its effects are limited, the sender may not 169 send any more than a preset number of packets into the network. This 170 is less conservative than the first two responses in that it does not 171 affect the size of the congestion window, and it is simple to 172 implement, simply count up the number of packets you can send and 173 stop when you reach the limit. The burst count is reset according to 174 some policy, such as when ACKs are received, each time send is 175 called, or when some timer triggers. The behavior is determined by 176 how these parameters are combined to reset the count. 178 Pacing: 180 When a burst is detected, packets are periodically sent into the 181 network until the sender starts receiving acks and normal maintenance 182 can be resumed [VH97,PK98]. This solution is very easy on the 183 network and scales well in cases of high bw/delay. However, it 184 requires a new timer and additional research for parameter tuning. 186 Implemented Solutions 188 Now we will examine combinations of the different detection and 189 response methods presented above. Each of the solutions below have 190 been implemented in some form. 192 BSD Implementation (Jacobson and Karels) 194 The most common implementation uses a receive timer coupled with Full 195 Restart. This is the implementation that causes the interaction 196 problems with HTTP/1.1. The obvious alternative is to implement a 197 send timer as originally intended and use Full Restart. There are 198 several drawbacks to this solution. First, a send timer adds 199 additional state and serves no purpose other than to correct the 200 bursting behavior after send pauses. Second, forcing a slow-start in 201 this situation is problematic for HTTP/1.1. A slow-start for each 203 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 205 new user request adds a delay burden to characteristically small HTTP 206 responses. Further, the HTTP user request pattern is unpredictable. 207 It is possible for the user to make a new request before the send 208 timer expires, triggering a burst that would defeat such a timer. 210 Maximum Burst Limitation (Floyd) 212 Floyd has proposed a coupling of Packet Counting with Burst Size 213 Limitation. It was developed to avoid bursts caused by gaps in the 214 ACK stream. Such gaps cause the send window to slide forward in 215 jumps, rather than smoothly by 1-2 MTUs; subsequent send calls can 216 thus generate bursts. This solution has been implemented in "ns" and 217 it prevents the sender from transmitting a series of back-to-back 218 packets larger than the user configured burst limit (suggested to be 219 5 packets) [NS97]. 221 Maxburst defines a new function, send_much, with a parameter, 222 maxburst, which is called every time an external event occurs, such 223 as ACK is received or a timeout occurs. Each time send_much is 224 called, at most maxburst packets are emitted. This forces interleaved 225 processing of ACKs and sending of data. Sends initiated by the 226 arrival of data in the buffer are not affected; as a result, if there 227 is less than RTO of delay, and the send buffer is filled by the 228 application, Maxburst can still send a burst the size of an entire 229 window. 231 Maxburst does not explicitly address the sending of bursts after an 232 idle period; it relies on the existing mechanism to collapse CWND 233 after an RTO, because send_much does not replace send as the 234 application interface to transmitting data. 236 The limit of 5 packets is designed to allow non-sequental ACK losses, 237 and allow the CWND to grow normally under slow-start. Here is a list 238 of the possible burst limits, and the need for each: 240 1 packet per ACK or nothing gets done 242 2 packets per ACK if ACKs are compressed 244 3 packets during window growth; 2 for the compressed ACK, 1 for 245 the growth 247 4 packets per ACK received if ACKs are lost (not back-to-back), 248 and ACKs are compressed 250 5 packets per ACK received if ACKs are lost (not back-to-back), 251 ACKs are compressed, and the window is growing 253 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 255 Use it or lose it (UI/LI) (Hughes, Touch, and Heidemann) 257 One proposed solution combines Packet Counting with Window Limiting, 258 and the IETF community refers to it as "Use-It-or-Lose-It" (it was 259 originally called Congestion Window Monitoring, CWM, but we will use 260 the acronym UI/LI, pronounced 'wee-lee', here). Whenever (CWND - 261 outstanding data > 4), we reduce CWND to (outstanding data + 4). The 262 choice of 4 packets is discussed in with the implementation details 263 below. UI/LI allows the congestion window to grow normally but 264 shrinks the congestion window as the sender becomes idle. It also 265 prevents the sender from transmitting any bursts larger than 4 266 packets in response to a new request. Because UI/LI is not dependent 267 on any timers, the loss of an ACK or a nearly idle connection cannot 268 cause any bursts. UI/LI is similar to Maxburst, but avoids the burst 269 by reducing CWND, rather than by inhibiting the sends directly. As a 270 result, we avoid the potential problem of sequential calls to 271 "tcp_output", which would cause bursts in Maxburst. UI/LI also 272 causes TCP to use the feedback of 'not using the CWND fast enough', 273 which results in a decrease in the CWND. 275 UI/LI effectively imposes a leaky bucket type limitation on the 276 congestion window. The window is allowed to grow and be managed 277 normally but the sender is not allowed to save up any sending 278 opportunities. Any opportunity that is not used is lost. This 279 property of UI/LI forces interleaved reception of ACKs and processing 280 of sends. 282 Burst-or-lose (Touch, Floyd) 284 We have considered a modified version of Maxburst, in which all 285 sends, including those initiated by data arrival from the 286 application, limit their burst size each time they are called, and 287 reset their flag at that time. We call this burst-or-lose (BOL). The 288 result is very similar to UI/LI, but the bucket reflects only the 289 permission to send created as the result of an ACK receipt, rather 290 than the overall limit of the congestion window. 292 In this case, it is unncessary to collapse CWND after an RTO. The 293 primary reason for the collapse is to avoid bursts; BOL avoids bursts 294 completely, but recovers as quickly as the ACK clocking can be 295 restored. 297 In the case when there is no data to send, multiple ACK arrivals do 298 not increase the bucket available for sending. This avoids the bursts 299 after idle periods. In the case when ACKs are lost, the remaining ACK 300 generates permission to send only a fixed amount of data. The 301 application cannot defeat this bucket by calling send multiple times. 303 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 305 Rate Based Pacing (Visweswaraiah and Heidemann) 307 Rate Based Pacing (RBP) combines the Pacing response with either a 308 Send Timer or Packet Counting. It avoids slow-start when resuming 309 after sending pauses and allows the normal clocking of packets to be 310 gracefully restarted. When a burst potential is detected, the 311 algorithm meters a small burst of packets into the channel [VH97]. 312 RBP is the least conservative solution to the bursting problem 313 because it continues to make use of the pre-pause congestion window. 314 If network conditions have changed significantly, maintaining the 315 previous window could cause the paced connection to be overly 316 aggressive as compared to other connections. (Although some work 317 suggests congestion windows are stable over multi-minute timeframes 318 [BSSK97].) More recently pacing has been suggested for use in 319 wireless networking scenarios [BPK97], and for satellite connections. 321 Solution Comparison 323 Below we present a comparison of the solutions presented above, plus 324 the original Send Timer solution. The Send Timer solution is not in 325 use, but we implemented it in FreeBSD for comparison purposes. 327 We contrast each solution on four criteria. A "yes" in the "HTTP 328 Burst" column indicates that this solution will solve the bursting 329 problem created by HTTP/1.1. A "yes" in the "Other Burst" column 330 indicates that the solution will prevent bursts in other situations 331 as well. A "yes" in the "Adj CWND" column indicates that the 332 solution involves adjusting the value of CWND. The final column, 333 "Code Size", gives the number of lines of code needed to correctly 334 implement the solution. 336 | 337 | HTTP Burst Other Burst Adj CWND Code Size 338 -----------+------------------------------------------------- 339 Send Timer | yes no yes ~5 lines 340 -----------+------------------------------------------------- 341 Rcv Timer | no no yes 3 lines 342 -----------+------------------------------------------------- 343 Maxburst | yes yes no ~5 lines 344 -----------+------------------------------------------------- 345 UI/LI | yes yes yes 3 lines 346 -----------+------------------------------------------------- 347 BOL | yes yes no ~5 lines 348 -----------+------------------------------------------------- 349 RBP | yes yes no ~30 lines 351 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 353 There are other conditions under which to compare these algorithms, 354 including behavior after a single ACK loss, behavior after multiple ACK 355 losses (sequential and non-sequential), and the window regrowth proper- 356 ties. These are outside the scope of this document. 358 | 359 | Recovery rate 360 -----------+------------------------------------------------- 361 Send Timer | SS then CA 362 -----------+------------------------------------------------- 363 Rcv Timer | SS then CA 364 -----------+------------------------------------------------- 365 Maxburst | (not applicable) 366 -----------+------------------------------------------------- 367 UI/LI | SS and or CA, depending on SSTHRESH 368 -----------+------------------------------------------------- 369 BOL | (not applicable) 370 -----------+------------------------------------------------- 371 RBP | 1 RTT 373 The only significant difference between UI/LI and BOL is whether CWND is 374 affected, and thus the recovery period after the loss of an ACK. UI/LI 375 interprets ACK loss as an event warranting traditional TCP window recov- 376 ery, whereas BOL and Maxburst do not. Both UI/LI and BOL avoid the need 377 for idle detection for the purpose of CWND adjustment, and implement a 378 type of leaky bucket. Although UI/LI was intended as a type of leaky- 379 bucket system, BOL is much more directly analogous. 381 Experimental Comparisons 383 Packet traces of the current FreeBSD implementation of SSR (using the 384 receive timer), of a modified version of FreeBSD using a send timer, and 385 of UI/LI with HTTP/1.1 support the above observations. The graphs below 386 show traces of two HTTP/1.1 requests. The first request opens the CWND. 387 How the server responds to the packet containing the second request 388 reveals the differences between implementations. 390 The first graph shows the trace with FreeBSD v2.2.6, but the relevant 391 portions of the code have not been altered in more current releases. In 392 response to the first request, the server performs slow-start normally. 393 After 10 seconds, and after the retransmission timeout (RTO), the second 394 request arrives. At this point, receipt of the second request resets 395 the receive timer. The server, under the belief that communication has 396 not gone idle, sends a CWND-sized burst of packets into the network. 398 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 400 Idle trace with Receive Timer 401 sequence number (in bytes) 402 +------+-------+------+-------+------+-------+------+D-A---+ 403 + + + + + + + DDAA | 404 100000 ++ Client_ACK A DDAA ++ 405 | Client_SEG C D A A | 406 | Server_SEG D D A | 407 80000 ++ D A ++ 408 | D A | 409 | D | 410 60000 ++ D AA D ++ 411 | D A | 412 | D A | 413 | D ADDAA | 414 40000 ++ D A ++ 415 | D A | 416 | D AAD A | 417 20000 ++ D A ++ 418 | D A A | 419 + +D AAD A+ + + + + + | 420 0 ++A-D-A+-------+------+-------+------+-C-----+------+-----++ 421 0 2 4 6 8 10 12 14 422 time since first SYN (in seconds) 424 The second graph shows the trace with FreeBSD v2.2.6 altered to use a 425 send timer to detect idle communication. As expected, the server 426 responds to the first request with normal slow-start. Now, the receipt 427 of the second request comes in after the RTO and the idle period is cor- 428 rectly detected. In response, the server re-enters slow-start for the 429 second response. 431 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 433 Idle trace with Send Timer 434 sequence number (in bytes) 435 +-------------+-------------+------------+-------------+DAAA 436 + + + + D AAAA| 437 100000 ++ Client_ACK A DADA ++ 438 | Client_SEG C DA A | 439 | Server_SEG D D AAD | 440 80000 ++ DAAA ++ 441 | D AAD | 442 | D AAD | 443 60000 ++ D AA D AA AA ++ 444 | D AA | 445 | DAAA | 446 | D AAAA | 447 40000 ++ D AA ++ 448 | DADA | 449 | D AAAA | 450 20000 ++ DADA ++ 451 | DDAAD | 452 + DAAAA A + + + | 453 0 ++----AADAA---+-------------+--------CC--+-------------+--++ 454 0 5 10 15 20 455 time since first SYN (in seconds) 457 The third graph shows the trace with FreeBSD altered to use UI/LI to 458 limit bursts. Again, the first request and response are normal. This 459 shows that UI/LI allows the congestion window to grow normally. With 460 the second request we see the main difference of UI/LI. Again, the sec- 461 ond request comes after the RTO, and here we see the main difference 462 with UI/LI. In response to the second request, the server does not send 463 a burst the size of CWND or enter slow-start. Instead, it sends several 464 small 4-packet bursts, with the CWND growing normally from this point. 466 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 468 Idle trace with UI/LI 469 sequence number (in bytes) 470 +-------+------+-------+------+-------+------+------DD-A---+ 471 + + + + + + + D+A | 472 100000 ++ Client_ACK A D A A ++ 473 | Client_SEG C DDAA | 474 | Server_SEG D D A | 475 80000 ++ DD AD ++ 476 | D AD A | 477 | D AD AA | 478 60000 ++ D A D ++ 479 | DDA A | 480 | D A | 481 | D AD A | 482 40000 ++ DDAA ++ 483 | D A | 484 | DDAAA | 485 20000 ++ D AD A ++ 486 | D A | 487 + D A D AAA + + + + + | 488 0 ++A-D-A-D------+-------+------+---C---+------+-------+----++ 489 0 2 4 6 8 10 12 14 490 time since first SYN (in seconds) 492 Note, RTO is the usual timer limit, but any value would have the same 493 results, depending on when the second request was presented in relation 494 to the timer. If the second request arrives after the timer expires, 495 the reponse will behave as presented in these graphs. If the second 496 request arrives before the timer expires and after all outstanding pack- 497 ets have been acknowledged, systems with a send or receive timer will 498 send a CWND-sized burst. If the second request arrives before the last 499 ACK but after the last packet sent, the server has the potential to send 500 a burst no larger than CWND. UI/LI will limit any burst to 4 packets 501 regardless of the timing of the second request. 503 Implementation of UI/LI 505 UI/LI requires a simple modification to existing TCP output routines. 506 The changes required replace the current idle detection code. Below is 507 the suggested pseudocode from Appendix C of [JK92]: 509 int idle = (snd_max == snd_una); 510 if (idle && now - lastrcv >= rto) 511 cwnd = 1; 513 UI/LI would replace that code as below: 515 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 517 int idle = (snd_max == snd_una); 518 int maxwin = 4 + snd_nxt - snd_una; 519 if (cwnd > maxwin) 520 cwnd = maxwin; 522 Packet counting is implemented by line 2. Lines 3 and 4 implement 523 Window Limitation. Note that the comparison in line 1 (defining 524 "idle" [JK92]) is required for later Silly Window Syndrome avoidance 525 and could now be moved there. 527 We have implemented UI/LI in FreeBSD. The affected code is in the 528 file /sys/netinet/tcp_output.c. The lines: 530 idle = (tp->snd_max == tp->snd_una); 531 if (idle && tp->t_idle >= tp->t_rxtcur) 532 tp->snd_cwnd = tp->t_maxseg; 534 Are replaced with: 536 idle = (tp->snd_max == tp->snd_una); 537 maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una; 538 if (tp->snd_cwnd > maxwin) 539 tp->snd_cwnd = maxwin; 541 The choice of limiting the available congestion window to 4 packets 542 is based on the normal operation of TCP. An ACK received by the 543 sender may be in response to the receipt of 2 packets, allowing 544 another 2 to be sent. Further, normal window growth may require the 545 sending of a third packet. Lastly, in slow-start with delayed ACKs, 546 the receipt of an ACK can trigger the sending of 4 packets. Thus, 4 547 packets is a reasonable burst to send into the network. 549 Increasing the initial window in slow-start to 4 packets has already 550 been proposed [FAP97]. The effects of this change have been explored 551 in simulation in [PN98] and in practice in [AHO98]. Such a modifica- 552 tion to TCP would cause the same behavior as our solution in the 553 cases where the pause timer has expired. It does not address the 554 pre-timeout bursting situation we are concerned with. 556 Implementation of BOL 558 BOL can be implemented with a small amount of additional state, and a 559 small number of lines in TCP's input, output, and timer routines. The 560 state indicates the number of packets that can be emitted before the 561 next timer or ACK receipt. The other lines increment the state when a 562 notable event occurs, and decrement it when packets are sent. 564 Our implementation uses two state variables: bucket - the number of 566 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 568 packets that can be immediately sent, and ack_ratio - the number of 569 packets per ACK (upper bound). ack_ratio is implied by RFC-2581, as 570 at least one ACK "SHOULD" be sent for every 2 MSSs received, i.e., 571 ACK compression [APS99]. This is our default; we parameterize it so 572 that when others modify TCP to use less frequent ACKs, our portion of 573 the code will operate correctly. It should be noted that less fre- 574 quent ACKs will, by definition, increase the burstiness of the source 575 as a result. We also assume 'initial_win' is defined, currently 2 576 MSS, but may be configurable as per Experimental RFC-2414 [FAP97]. 578 There are two events that set the value of bucket. When an ACK is 579 received, we set bucket to ack_ratio * 2 + 1. This is for the same 580 reasons as Maxburst uses '5' - to allow for TCP to run at full rate 581 in the presence of isolated ACK losses. On timeout events, the bucket 582 is set to init_win. 584 When packets are emitted in the existing tcp_output routine, the 585 bucket is decremented. The bucket does not become negative; if it 586 would, that data is not sent, and the routine returns. This 'refusal 587 to send' behavior is identical to not having sufficient send window; 588 the application will retry, typically due to some timer event. 590 The code is as follows: 592 /* define the state for BOL */ 593 int ack_ratio, /* max number of packets per ack */ 594 int bucket, /* max burst size */ 596 #DEFINE init_win 2 598 /* set the bucket on ACK receipt */ 599 bucket = ack_ratio * 2 + 1; 601 /* set bucket on timeout */ 602 bucket = init_win; 604 /* decrement the bucket accordingly */ 605 can_send = min(bucket - want_send, 0); 606 bucket -= can_send; 607 do_send(can_send); 608 return can_send; 610 Local connections 612 We recently discovered cases where SSR adversely affected local con- 613 nections, i.e., connections within the same subnet. There is already 614 code in the current BSD-derived TCP implementation that disables ini- 615 tializing the congestion window (CWND) to a small value, when both 617 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 619 ends of a connection share a subnet. This code, from FreeBSD 620 tcp_input.c, is outlined below, where inp = tp->t_inpcb. 622 /* 623 * Don't force slow-start on local network. 624 */ 625 if (!in_localaddr(inp->inp_faddr)) 626 tp->snd_cwnd = mss; 628 RFC-2414 observes that there are 3 types of window sizes used in BSD 629 TCP: 631 1. intital window 632 2. restart after timeout 633 3. restart after loss 635 Recent work indicates that these windows should be treated differ- 636 ently; e.g., the restart-after-loss window should always be set to 1, 637 and the initial window can be set to 4 or not used for local connec- 638 tions [FAP97]. We proposed that it is consistent to extend this local 639 exception to the restart-after-timeout window, as shown below as a 640 FreeBSD code segment: 642 idle = (tp->snd_max == tp->snd_una); 643 maxwin = (4 * tp->t_maxseg) + tp->snd_nxt - tp->snd_una; 644 if ((tp->snd_cwnd > maxwin) && 645 !in_localaddr(tcp->t_inpcb->inp_faddr)) 646 tp->snd_cwnd = maxwin; 648 This modification is considered reasonable because it corresponds to 649 the initial window determination. There are other cases where TCP 650 uses "local" rules, such as MSS determination, and piggybacking. 652 Performance Issues 654 The purpose of these modifications is to avoid line-rate bursts, pri- 655 marily as the result of the source being idle, but also due to ACK 656 losses. A result of many of the proposed solutions is to limit the 657 source, either by reducing the congestion window, or limiting the 658 transmission of bursts. The performance effects of congestion window 659 reduction are well-studied, and are not addressed here. BOL can limit 660 the source without modifying the congestion window, and there are 661 concerns that this may affect the performance of TCP. 663 In particular, some implementations of TCP send ACKs in a bursty 664 fashion, either by reducing the send/receive context switching, or by 665 reducing the overall number of ACKs. These modifications are used to 666 increase the performance of TCP, but can also increase the burstiness 668 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 670 of the source. BOL, by reducing the burstiness of the source, is con- 671 sidered to potentially limit the effectiveness of these enhancements. 673 TCP currently recommends that an ACK "SHOULD" be sent for at least 674 every two MSS's received. We assume that this recommendation implies 675 an interleaving factor for the source as well. If this factor is 676 encoded in a TCB variable (ack_ratio), then BOL can allow for corre- 677 sponding bursts consistent with reasonable (non-adjacent) ACK losses. 679 We assume that TCPs should not delay ACK processing to aggregate the 680 receive-side timeslices, in order to reduce send/receive interleav- 681 ing, unless this is also encoded in their ack_ratio variable. Our 682 initial measurements indicate that even the recommended interleaving 683 does not substantially affect the performance of TCP. 685 In addition, we recommend that any burst-avoidance modification be 686 disabled for direct LAN connections. For such connections, the con- 687 gestion window mechanism is already disabled in several TCP implemen- 688 tations. The resulting burstiness is handled by the link layer and 689 receiver buffers. 691 Conclusions 693 At this time, we propose BOL as a simple, minimal and effective fix 694 to the 'bug' in current TCP implementations that is exploited by 695 HTTP/1.1. Modifications can be made to TCP to solve the slow-start 696 restart problem that are consistent with the original congestion 697 avoidance specifications (i.e. a send timer). However, we feel that 698 the original intended behavior is not appropriate to some current 699 applications, specifically HTTP. Thus, we recommend BOL to prevent 700 bursts into the network. Not only does this solution solve the cur- 701 rent problem in a simple way, it will prevent bursting in any other 702 situation that might arise. The packet bursts which we allow are con- 703 sistent with congestion window growth algorithms and with Floyd's 704 conclusion about increasing the initial window size. 706 BOL, as well as the other solutions listed, need to be re-evaluated 707 within emerging TCP implementations, e.g., SACK [JB88]. In general, 708 TCP has no rate pacing and uses congestion control to avoid bursts in 709 current implementations. A more explicit mechanism, such as RBP or 710 similar proposals may be desirable in the future. 712 Security implications 714 There are security risks associated with these algorithms that can 715 result in denial of service attacks. Intermediate routers can delay 716 data or ACKs in ways that can cause the restarting party to restart 717 more conservatively. They also result in less probability of the 719 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 721 restarting party generating a line-rate burst. 723 In Send-Timer and Maxburst algorithms, as well as in a no-restart 724 system, an attacker (either an intermediate router or data receiver) 725 can clump the delivery of ACKs, which can result in line-rate bursts. 727 The Receive-Timer, Rate-Based Pacing, UI/LI, and BOL algorithms use 728 data or ACK arrival to moderate outgoing data and avoid bursts. If 729 the attacker attempts to deliver data or ACKs so as to cause a burst, 730 these algorithms successfully avoid creating a burst. In the process, 731 their throughput is reduced. 733 We believe the latter algorithms should be considered more robust 734 than the former, because slow-start restart focuses on avoiding 735 bursts, rather than maximizing throughput. In the recommended algo- 736 rithms (BOL now, RBP for the future), the system succesfully avoids 737 generating a burst as the result of an attacker. 739 Acknowledgements 741 We would like to acknowledge Kacheong Poon, Sally Floyd, members of 742 the End-to-End-Interest Mailing List, and members of the IETF TCP 743 Implementation Working Group for their feedback and discussions on 744 this document. We would also like to thank Ruth Aydt, Steve Tuecke, 745 and Carl Kesselman for their observation of the local connection 746 issue. 748 References 750 [AHO98] Mark Allman, Chris Hayes, and Shawn Ostermann. An Evaluation 751 of TCP with Larger Initial Windows. ACM Computer Communications 752 Review, 28(3), 41-52 , July 1998. 754 [APS99] M. Allman, V. Paxson, W. Stevens. TCP Congestion Control, 755 April 1999. RFC-2581. 757 [BPK97] Hari Balakrishnan, Venkata N. Padmanabhan, and Randy H. Katz. 758 The Effects of Asymmetry on TCP Performance. In Proceedings of 759 the ACM/IEEE Mobicom, Budapest, Hungary, ACM. September, 1997. 761 [BSSK97] Hari Balakrishnan, Srinivasan Seshan, Mark Stemm, and Randy 762 H. Katz. Analyzing Stability in Wide-Area Network Performance. 763 In Proceedings of the ACM SIGMETRICS, Seattle WA, USA, ACM. 764 June, 1997. 766 [FGMFB97] R. Fielding, Jim Gettys, Jeffrey C. Mogul, H. Frystyk, and 767 Tim Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1, 769 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 771 January 1997. RFC-2068. 773 [FAP97] Mark Allman, Sally Floyd, and Craig Partridge. Increasing 774 TCP's Initial Window, September 1998. RFC-2414. 776 [Hei97] John Heidemann. Performance Interactions Between P-HTTP and 777 TCP Implementations. ACM Computer Communications Review, 27(2), 778 65-73, April 1997. 780 [HOT97] John Heidemann, Katia Obraczka, and Joe Touch. Modeling the 781 Performance of HTTP Over Several Transport Protocols. ACM/IEEE 782 Transactions on Networking 5(5), 616-630, October, 1997. 784 [JB88] Van Jacobson and R.T. Braden. TCP extensions for long-delay 785 paths, October 1988. RFC 1072. 787 [Jac88] Van Jacobson. Congestion Avoidance and Control. Proc. SIG- 788 COMM '88, in ACM Computer Communication Review, 18(4):314-329, 789 August 1988. NOTE: This paper lacks the appendix that describes 790 slow-start restart, which appears only in the unpublished [JK92]. 792 [JK92] Van Jacobson and Mike Karels. Congestion Avoidance and Con- 793 trol. This is a revised version of [Jac88], which adds Karels as 794 co-author and includes an additional appendix. The revised ver- 795 sion has not been published, but is available at 796 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z 798 [NS97] ns Network Simulator. http://www-mash.cs.berkeley.edu/ns/, 799 1997. 801 [PK98] Venkata N. Padmanabhan and Randy H. Katz. TCP Fast Start: A 802 Technique for Speeding Up Web Transfers. In Proceedings of the 803 IEEE GLOBECOM Internet Mini-Conference, Sydney, Australia, Novem- 804 ber, 1998. 806 [PN98] K. Poduri and K. Nichols. Simulation Studies of Increased Ini- 807 tial TCP Window Size, September 1998. RFC-2415. 809 [Poo97] Kacheong Poon, Sun Microsystems, tcp-implementors mailing 810 list, August, 1997. 812 [Tou97] Joe Touch, ISI, tcp-implementors mailing list, August 12, 813 1997. 815 [VH97] Vikram Visweswaraiah and John Heidemann. Improving Restart of 816 Idle TCP Connections. Technical Report 97-661, University of 817 Southern California, November 1997. 819 Hughes et al. Issues in TCP Slow-Start Restart Dec. 1, 2001 821 Authors/ Address 823 Amy S. Hughes, Joe Touch, John Hiedemann 824 University of Southern California/Information Sciences Institute 825 4676 Admiralty Way 826 Marina del Rey, CA 90292-6695 827 USA 828 Phone: +1 310-822-1511 829 Fax: +1 310-823-6714 830 URLs: http://www.isi.edu/~ahughes 831 http://www.isi.edu/touch 832 http://www.isi.edu/~johnh 833 Email: ahughes@isi.edu 834 touch@isi.edu 835 johnh@isi.edu