idnits 2.17.1 draft-ietf-tcpimpl-prob-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-27) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 62 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 63 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 30 instances of too long lines in the document, the longest one being 7 characters in excess of 72. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 1998) is 9295 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC1323' is mentioned on line 2078, but not defined ** Obsolete undefined reference: RFC 1323 (Obsoleted by RFC 7323) == Missing Reference: 'DF' is mentioned on line 2431, but not defined == Unused Reference: 'Stevens94' is defined on line 2881, but no explicit reference was found in the text == Unused Reference: 'Wright95' is defined on line 2885, but no explicit reference was found in the text == Unused Reference: 'Fall96,Hoe96' is defined on line 877, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 2818, but no explicit reference was found in the text == Unused Reference: 'RFC793' is defined on line 2874, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens94' -- Possible downref: Non-RFC (?) normative reference: ref. 'Wright95' -- Possible downref: Non-RFC (?) normative reference: ref. 'Fall96,Hoe96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Brakmo95' -- Possible downref: Non-RFC (?) normative reference: ref. 'Allman97' ** Obsolete normative reference: RFC 2414 (Obsoleted by RFC 3390) ** Obsolete normative reference: RFC 813 (Obsoleted by RFC 7805) -- Possible downref: Non-RFC (?) normative reference: ref. 'Dawson97' -- Possible downref: Non-RFC (?) normative reference: ref. 'Fall96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Hobbit96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Hoe96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson89' ** Obsolete normative reference: RFC 896 (Obsoleted by RFC 7805) -- Possible downref: Non-RFC (?) normative reference: ref. 'Paxson97' ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2001 (Obsoleted by RFC 2581) Summary: 16 errors (**), 0 flaws (~~), 12 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group V. Paxson, Editor 3 Internet Draft M. Allman 4 S. Dawson 5 W. Fenner 6 J. Griner 7 I. Heavens 8 K. Lahey 9 J. Semke 10 B. Volz 11 Expiration Date: May 1999 November 1998 13 Known TCP Implementation Problems 14 16 1. Status of this Memo 18 This document is an Internet Draft. Internet Drafts are working 19 documents of the Internet Engineering Task Force (IETF), its areas, 20 and its working groups. Note that other groups may also distribute 21 working documents as Internet Drafts. 23 Internet Drafts are draft documents valid for a maximum of six 24 months, and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet Drafts as reference 26 material or to cite them other than as ``work in progress''. 28 To view the entire list of current Internet-Drafts, please check the 29 "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 30 Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern 31 Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific 32 Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). 34 This memo provides information for the Internet community. This memo 35 does not specify an Internet standard of any kind. Distribution of 36 this memo is unlimited. 38 ID Known TCP Implementation Problems November 1998 40 Table of Contents 42 1. STATUS OF THIS MEMO.............................................1 43 2. INTRODUCTION....................................................2 44 3. KNOWN IMPLEMENTATION PROBLEMS...................................4 45 3.1 No initial slow start........................................4 46 3.2 No slow start after retransmission timeout...................6 47 3.3 Uninitialized CWND...........................................9 48 3.4 Inconsistent retransmission.................................11 49 3.5 Failure to retain above-sequence data.......................14 50 3.6 Extra additive constant in congestion avoidance.............18 51 3.7 Initial RTO too low.........................................24 52 3.8 Failure of window deflation after loss recovery.............27 53 3.9 Excessively short keepalive connection timeout..............30 54 3.10 Failure to back off retransmission timeout..................32 55 3.11 Insufficient interval between keepalives....................35 56 3.12 Window probe deadlock.......................................38 57 3.13 Stretch ACK violation.......................................42 58 3.14 Retransmission sends multiple packets.......................45 59 3.15 Failure to send FIN notification promptly...................48 60 3.16 Failure to send a RST after Half Duplex Close...............50 61 3.17 Failure to RST on close with data pending...................53 62 3.18 Options missing from TCP MSS calculation....................57 63 4. SECURITY CONSIDERATIONS........................................59 64 5. ACKNOWLEDGEMENTS...............................................59 65 6. REFERENCES.....................................................60 66 7. AUTHORS' ADDRESSES.............................................62 68 2. Introduction 70 This memo catalogs a number of known TCP implementation problems. 71 The goal in doing so is to improve conditions in the existing 72 Internet by enhancing the quality of current TCP/IP implementations. 73 It is hoped that both performance and correctness issues can be 74 resolved by making implementors aware of the problems and their 75 solutions. In the long term, it is hoped that this will provide a 76 reduction in unnecessary traffic on the network, the rate of 77 connection failures due to protocol errors, and load on network 78 servers due to time spent processing both unsuccessful connections 79 and retransmitted data. This will help to ensure the stability of 80 the global Internet. 82 Each problem is defined as follows: 84 Name of Problem 85 The name associated with the problem. In this memo, the name is 87 ID Known TCP Implementation Problems November 1998 89 given as a subsection heading. 91 Classification 92 One or more problem categories for which the problem is 93 classified: "congestion control", "performance", "reliability", 94 "resource management". 96 Description 97 A definition of the problem, succinct but including necessary 98 background material. 100 Significance 101 A brief summary of the sorts of environments for which the 102 problem is significant. 104 Implications 105 Why the problem is viewed as a problem. 107 Relevant RFCs 108 The RFCs defining the TCP specification with which the problem 109 conflicts. These RFCs often qualify behavior using terms such 110 as MUST, SHOULD, MAY, and others written capitalized. See RFC 111 2119 for the exact interpretation of these terms. 113 Trace file demonstrating the problem 114 One or more ASCII trace files demonstrating the problem, if 115 applicable. 117 Trace file demonstrating correct behavior 118 One or more examples of how correct behavior appears in a trace, 119 if applicable. 121 References 122 References that further discuss the problem. 124 How to detect 125 How to test an implementation to see if it exhibits the problem. 126 This discussion may include difficulties and subtleties 127 associated with causing the problem to manifest itself, and with 128 interpreting traces to detect the presence of the problem (if 129 applicable). 131 How to fix 132 For known causes of the problem, how to correct the 133 implementation. 135 ID Known TCP Implementation Problems November 1998 137 3. Known implementation problems 139 3.1. 141 Name of Problem 142 No initial slow start 144 Classification 145 Congestion control 147 Description 148 When a TCP begins transmitting data, it is required by RFC 1122, 149 4.2.2.15, to engage in a "slow start" by initializing its 150 congestion window, cwnd, to one packet (one segment of the maximum 151 size). (Note that an experimental change to TCP, documented in 152 [RFC2414], allows an initial value somewhat larger than one 153 packet.) It subsequently increases cwnd by one packet for each ACK 154 it receives for new data. The minimum of cwnd and the receiver's 155 advertised window bounds the highest sequence number the TCP can 156 transmit. A TCP that fails to initialize and increment cwnd in 157 this fashion exhibits "No initial slow start". 159 Significance 160 In congested environments, detrimental to the performance of other 161 connections, and possibly to the connection itself. 163 Implications 164 A TCP failing to slow start when beginning a connection results in 165 traffic bursts that can stress the network, leading to excessive 166 queueing delays and packet loss. 168 Implementations exhibiting this problem might do so because they 169 suffer from the general problem of not including the required 170 congestion window. These implementations will also suffer from "No 171 slow start after retransmission timeout". 173 There are different shades of "No initial slow start". From the 174 perspective of stressing the network, the worst is a connection 175 that simply always sends based on the receiver's advertised window, 176 with no notion of a separate congestion window. Another form is 177 described in "Uninitialized CWND" below. 179 Relevant RFCs 180 RFC 1122 requires use of slow start. RFC 2001 gives the specifics 182 ID Known TCP Implementation Problems November 1998 184 of slow start. 186 Trace file demonstrating it 187 Made using tcpdump [Jacobson89] recording at the connection 188 responder. No losses reported by the packet filter. 190 10:40:42.244503 B > A: S 1168512000:1168512000(0) win 32768 191 (DF) [tos 0x8] 192 10:40:42.259908 A > B: S 3688169472:3688169472(0) 193 ack 1168512001 win 32768 194 10:40:42.389992 B > A: . ack 1 win 33580 (DF) [tos 0x8] 195 10:40:42.664975 A > B: P 1:513(512) ack 1 win 32768 196 10:40:42.700185 A > B: . 513:1973(1460) ack 1 win 32768 197 10:40:42.718017 A > B: . 1973:3433(1460) ack 1 win 32768 198 10:40:42.762945 A > B: . 3433:4893(1460) ack 1 win 32768 199 10:40:42.811273 A > B: . 4893:6353(1460) ack 1 win 32768 200 10:40:42.829149 A > B: . 6353:7813(1460) ack 1 win 32768 201 10:40:42.853687 B > A: . ack 1973 win 33580 (DF) [tos 0x8] 202 10:40:42.864031 B > A: . ack 3433 win 33580 (DF) [tos 0x8] 204 After the third packet, the connection is established. A, the 205 connection responder, begins transmitting to B, the connection 206 initiator. Host A quickly sends 6 packets comprising 7812 bytes, 207 even though the SYN exchange agreed upon an MSS of 1460 bytes 208 (implying an initial congestion window of 1 segment corresponds to 209 1460 bytes), and so A should have sent at most 1460 bytes. 211 The ACKs sent by B to A in the last two lines indicate that this 212 trace is not a measurement error (slow start really occurring but 213 the corresponding ACKs having been dropped by the packet filter). 215 A second trace confirmed that the problem is repeatable. 217 Trace file demonstrating correct behavior 219 Made using tcpdump recording at the connection originator. No 220 losses reported by the packet filter. 222 12:35:31.914050 C > D: S 1448571845:1448571845(0) win 4380 223 12:35:32.068819 D > C: S 1755712000:1755712000(0) ack 1448571846 win 4096 224 12:35:32.069341 C > D: . ack 1 win 4608 225 12:35:32.075213 C > D: P 1:513(512) ack 1 win 4608 226 12:35:32.286073 D > C: . ack 513 win 4096 227 12:35:32.287032 C > D: . 513:1025(512) ack 1 win 4608 228 12:35:32.287506 C > D: . 1025:1537(512) ack 1 win 4608 229 12:35:32.432712 D > C: . ack 1537 win 4096 230 12:35:32.433690 C > D: . 1537:2049(512) ack 1 win 4608 232 ID Known TCP Implementation Problems November 1998 234 12:35:32.434481 C > D: . 2049:2561(512) ack 1 win 4608 235 12:35:32.435032 C > D: . 2561:3073(512) ack 1 win 4608 236 12:35:32.594526 D > C: . ack 3073 win 4096 237 12:35:32.595465 C > D: . 3073:3585(512) ack 1 win 4608 238 12:35:32.595947 C > D: . 3585:4097(512) ack 1 win 4608 239 12:35:32.596414 C > D: . 4097:4609(512) ack 1 win 4608 240 12:35:32.596888 C > D: . 4609:5121(512) ack 1 win 4608 241 12:35:32.733453 D > C: . ack 4097 win 4096 243 References 244 This problem is documented in [Paxson97]. 246 How to detect 247 For implementations always manifesting this problem, it shows up 248 immediately in a packet trace or a sequence plot, as illustrated 249 above. 251 How to fix 252 If the root problem is that the implementation lacks a notion of a 253 congestion window, then unfortunately this requires significant 254 work to fix. However, doing so is important, as such 255 implementations also exhibit "No slow start after retransmission 256 timeout". 258 3.2. 260 Name of Problem 261 No slow start after retransmission timeout 263 Classification 264 Congestion control 266 Description 267 When a TCP experiences a retransmission timeout, it is required by 268 RFC 1122, 4.2.2.15, to engage in "slow start" by initializing its 269 congestion window, cwnd, to one packet (one segment of the maximum 270 size). It subsequently increases cwnd by one packet for each ACK 271 it receives for new data until it reaches the "congestion 272 avoidance" threshold, ssthresh, at which point the congestion 273 avoidance algorithm for updating the window takes over. A TCP that 274 fails to enter slow start upon a timeout exhibits "No slow start 275 after retransmission timeout". 277 Significance 279 ID Known TCP Implementation Problems November 1998 281 In congested environments, severely detrimental to the performance 282 of other connections, and also the connection itself. 284 Implications 285 Entering slow start upon timeout forms one of the cornerstones of 286 Internet congestion stability, as outlined in [Jacobson88]. If 287 TCPs fail to do so, the network becomes at risk of suffering 288 "congestion collapse" [RFC896]. 290 Relevant RFCs 291 RFC 1122 requires use of slow start after loss. RFC 2001 gives the 292 specifics of how to implement slow start. RFC 896 describes 293 congestion collapse. 295 The retransmission timeout discussed here should not be confused 296 with the separate "fast recovery" retransmission mechanism 297 discussed in RFC 2001. 299 Trace file demonstrating it 300 Made using tcpdump recording at the sending TCP (A). No losses 301 reported by the packet filter. 303 10:40:59.090612 B > A: . ack 357125 win 33580 (DF) [tos 0x8] 304 10:40:59.222025 A > B: . 357125:358585(1460) ack 1 win 32768 305 10:40:59.868871 A > B: . 357125:358585(1460) ack 1 win 32768 306 10:41:00.016641 B > A: . ack 364425 win 33580 (DF) [tos 0x8] 307 10:41:00.036709 A > B: . 364425:365885(1460) ack 1 win 32768 308 10:41:00.045231 A > B: . 365885:367345(1460) ack 1 win 32768 309 10:41:00.053785 A > B: . 367345:368805(1460) ack 1 win 32768 310 10:41:00.062426 A > B: . 368805:370265(1460) ack 1 win 32768 311 10:41:00.071074 A > B: . 370265:371725(1460) ack 1 win 32768 312 10:41:00.079794 A > B: . 371725:373185(1460) ack 1 win 32768 313 10:41:00.089304 A > B: . 373185:374645(1460) ack 1 win 32768 314 10:41:00.097738 A > B: . 374645:376105(1460) ack 1 win 32768 315 10:41:00.106409 A > B: . 376105:377565(1460) ack 1 win 32768 316 10:41:00.115024 A > B: . 377565:379025(1460) ack 1 win 32768 317 10:41:00.123576 A > B: . 379025:380485(1460) ack 1 win 32768 318 10:41:00.132016 A > B: . 380485:381945(1460) ack 1 win 32768 319 10:41:00.141635 A > B: . 381945:383405(1460) ack 1 win 32768 320 10:41:00.150094 A > B: . 383405:384865(1460) ack 1 win 32768 321 10:41:00.158552 A > B: . 384865:386325(1460) ack 1 win 32768 322 10:41:00.167053 A > B: . 386325:387785(1460) ack 1 win 32768 323 10:41:00.175518 A > B: . 387785:389245(1460) ack 1 win 32768 324 10:41:00.210835 A > B: . 389245:390705(1460) ack 1 win 32768 325 10:41:00.226108 A > B: . 390705:392165(1460) ack 1 win 32768 326 10:41:00.241524 B > A: . ack 389245 win 8760 (DF) [tos 0x8] 328 ID Known TCP Implementation Problems November 1998 330 The first packet indicates the ack point is 357125. 130 msec after 331 receiving the ACK, A transmits the packet after the ACK point, 332 357125:358585. 640 msec after this transmission, it retransmits 333 357125:358585, in an apparent retransmission timeout. At this 334 point, A's cwnd should be one MSS, or 1460 bytes, as A enters slow 335 start. The trace is consistent with this possibility. 337 B replies with an ACK of 364425, indicating that A has filled a 338 sequence hole. At this point, A's cwnd should be 1460*2 = 2920 339 bytes, since in slow start receiving an ACK advances cwnd by MSS. 340 However, A then launches 19 consecutive packets, which is 341 inconsistent with slow start. 343 A second trace confirmed that the problem is repeatable. 345 Trace file demonstrating correct behavior 346 Made using tcpdump recording at the sending TCP (C). No losses 347 reported by the packet filter. 349 12:35:48.442538 C > D: P 465409:465921(512) ack 1 win 4608 350 12:35:48.544483 D > C: . ack 461825 win 4096 351 12:35:48.703496 D > C: . ack 461825 win 4096 352 12:35:49.044613 C > D: . 461825:462337(512) ack 1 win 4608 353 12:35:49.192282 D > C: . ack 465921 win 2048 354 12:35:49.192538 D > C: . ack 465921 win 4096 355 12:35:49.193392 C > D: P 465921:466433(512) ack 1 win 4608 356 12:35:49.194726 C > D: P 466433:466945(512) ack 1 win 4608 357 12:35:49.350665 D > C: . ack 466945 win 4096 358 12:35:49.351694 C > D: . 466945:467457(512) ack 1 win 4608 359 12:35:49.352168 C > D: . 467457:467969(512) ack 1 win 4608 360 12:35:49.352643 C > D: . 467969:468481(512) ack 1 win 4608 361 12:35:49.506000 D > C: . ack 467969 win 3584 363 After C transmits the first packet shown to D, it takes no action 364 in response to D's ACKs for 461825, because the first packet 365 already reached the advertised window limit of 4096 bytes above 366 461825. 600 msec after transmitting the first packet, C 367 retransmits 461825:462337, presumably due to a timeout. Its 368 congestion window is now MSS (512 bytes). 370 D acks 465921, indicating that C's retransmission filled a sequence 371 hole. This ACK advances C's cwnd from 512 to 1024. Very shortly 372 after, D acks 465921 again in order to update the offered window 373 from 2048 to 4096. This ACK does not advance cwnd since it is not 374 for new data. Very shortly after, C responds to the newly enlarged 375 window by transmitting two packets. D acks both, advancing cwnd 376 from 1024 to 1536. C in turn transmits three packets. 378 ID Known TCP Implementation Problems November 1998 380 References 381 This problem is documented in [Paxson97]. 383 How to detect 384 Packet loss is common enough in the Internet that generally it is 385 not difficult to find an Internet path that will force 386 retransmission due to packet loss. 388 If the effective window prior to loss is large enough, however, 389 then the TCP may retransmit using the "fast recovery" mechanism 390 described in RFC 2001. In a packet trace, the signature of fast 391 recovery is that the packet retransmission occurs in response to 392 the receipt of three duplicate ACKs, and subsequent duplicate ACKs 393 may lead to the transmission of new data, above both the ack point 394 and the highest sequence transmitted so far. An absence of three 395 duplicate ACKs prior to retransmission suffices to distinguish 396 between timeout and fast recovery retransmissions. In the face of 397 only observing fast recovery retransmissions, generally it is not 398 difficult to repeat the data transfer until observing a timeout 399 retransmission. 401 Once armed with a trace exhibiting a timeout retransmission, 402 determining whether the TCP follows slow start is done by computing 403 the correct progression of cwnd and comparing it to the amount of 404 data transmitted by the TCP subsequent to the timeout 405 retransmission. 407 How to fix 408 If the root problem is that the implementation lacks a notion of a 409 congestion window, then unfortunately this requires significant 410 work to fix. However, doing so is critical, for reasons outlined 411 above. 413 3.3. 415 Name of Problem 416 Uninitialized CWND 418 Classification 419 Congestion control 421 Description 422 As described above for "No initial slow start", when a TCP 424 ID Known TCP Implementation Problems November 1998 426 connection begins cwnd is initialized to one segment (or perhaps a 427 few segments, if experimenting with [RFC2414]). One particular 428 form of "No initial slow start", worth separate mention as the bug 429 is fairly widely deployed, is "Uninitialized CWND". That is, while 430 the TCP implements the proper slow start mechanism, it fails to 431 initialize cwnd properly, so slow start in fact fails to occur. 433 One way the bug can occur is if, during the connection 434 establishment handshake, the SYN ACK packet arrives without an MSS 435 option. The faulty implementation uses receipt of the MSS option 436 to initialize cwnd to one segment; if the option fails to arrive, 437 then cwnd is instead initialized to a very large value. 439 Significance 440 In congested environments, detrimental to the performance of other 441 connections, and likely to the connection itself. The burst can be 442 so large (see below) that it has deleterious effects even in 443 uncongested environments. 445 Implications 446 A TCP exhibiting this behavior is stressing the network with a 447 large burst of packets, which can cause loss in the network. 449 Relevant RFCs 450 RFC 1122 requires use of slow start. RFC 2001 gives the specifics 451 of slow start. 453 Trace file demonstrating it 454 This trace was made using tcpdump running on host A. Host A is the 455 sender and host B is the receiver. The advertised window and 456 timestamp options have been omitted for clarity, except for the 457 first segment sent by host A. Note that A sends an MSS option in 458 its initial SYN but B does not include one in its reply. 460 16:56:02.226937 A > B: S 237585307:237585307(0) win 8192 461 462 16:56:02.557135 B > A: S 1617216000:1617216000(0) 463 ack 237585308 win 16384 464 16:56:02.557788 A > B: . ack 1 win 8192 465 16:56:02.566014 A > B: . 1:537(536) ack 1 466 16:56:02.566557 A > B: . 537:1073(536) ack 1 467 16:56:02.567120 A > B: . 1073:1609(536) ack 1 468 16:56:02.567662 A > B: P 1609:2049(440) ack 1 469 16:56:02.568349 A > B: . 2049:2585(536) ack 1 470 16:56:02.568909 A > B: . 2585:3121(536) ack 1 472 [54 additional burst segments deleted for brevity] 474 ID Known TCP Implementation Problems November 1998 476 16:56:02.936638 A > B: . 32065:32601(536) ack 1 477 16:56:03.018685 B > A: . ack 1 479 After the three-way handshake, host A bursts 61 segments into the 480 network, before duplicate ACKs on the first segment cause a 481 retransmission to occur. Since host A did not wait for the ACK on 482 the first segment before sending additional segments, it is 483 exhibiting "Uninitialized CWND" 485 Trace file demonstrating correct behavior 487 See the example for "No initial slow start". 489 References 490 This problem is documented in [Paxson97]. 492 How to detect 493 This problem can be detected by examining a packet trace recorded 494 at either the sender or the receiver. However, the bug can be 495 difficult to induce because it requires finding a remote TCP peer 496 that does not send an MSS option in its SYN ACK. 498 How to fix 499 This problem can be fixed by ensuring that cwnd is initialized upon 500 receipt of a SYN ACK, even if the SYN ACK does not contain an MSS 501 option. 503 3.4. 505 Name of Problem 506 Inconsistent retransmission 508 Classification 509 Reliability 511 Description 512 If, for a given sequence number, a sending TCP retransmits 513 different data than previously sent for that sequence number, then 514 a strong possibility arises that the receiving TCP will reconstruct 515 a different byte stream than that sent by the sending application, 516 depending on which instance of the sequence number it accepts. 518 ID Known TCP Implementation Problems November 1998 520 Such a sending TCP exhibits "Inconsistent retransmission". 522 Significance 523 Critical for all environments. 525 Implications 526 Reliable delivery of data is a fundamental property of TCP. 528 Relevant RFCs 529 RFC 793, section 1.5, discusses the central role of reliability in 530 TCP operation. 532 Trace file demonstrating it 533 Made using tcpdump recording at the receiving TCP (B). No losses 534 reported by the packet filter. 536 12:35:53.145503 A > B: FP 90048435:90048461(26) ack 393464682 win 4096 537 4500 0042 9644 0000 538 3006 e4c2 86b1 0401 83f3 010a b2a4 0015 539 055e 07b3 1773 cb6a 5019 1000 68a9 0000 540 data starts here>504f 5254 2031 3334 2c31 3737*2c34 2c31 541 2c31 3738 2c31 3635 0d0a 542 12:35:53.146479 B > A: R 393464682:393464682(0) win 8192 543 12:35:53.851714 A > B: FP 90048429:90048463(34) ack 393464682 win 4096 544 4500 004a 965b 0000 545 3006 e4a3 86b1 0401 83f3 010a b2a4 0015 546 055e 07ad 1773 cb6a 5019 1000 8bd3 0000 547 data starts here>5041 5356 0d0a 504f 5254 2031 3334 2c31 548 3737*2c31 3035 2c31 3431 2c34 2c31 3539 549 0d0a 551 The sequence numbers shown in this trace are absolute and not 552 adjusted to reflect the ISN. The 4-digit hex values show a dump of 553 the packet's IP and TCP headers, as well as payload. A first sends 554 to B data for 90048435:90048461. The corresponding data begins 555 with hex words 504f, 5254, etc. 557 B responds with a RST. Since the recording location was local to 558 B, it is unknown whether A received the RST. 560 A then sends 90048429:90048463, which includes six sequence 561 positions below the earlier transmission, all 26 positions of the 562 earlier transmission, and two additional sequence positions. 564 The retransmission disagrees starting just after sequence 90048447, 565 annotated above with a leading '*'. These two bytes were 566 originally transmitted as hex 2c34 but retransmitted as hex 2c31. 567 Subsequent positions disagree as well. 569 ID Known TCP Implementation Problems November 1998 571 This behavior has been observed in other traces involving different 572 hosts. It is unknown how to repeat it. 574 In this instance, no corruption would occur, since B has already 575 indicated it will not accept further packets from A. 577 A second example illustrates a slightly different instance of the 578 problem. The tracing again was made with tcpdump at the receiving 579 TCP (D). 581 22:23:58.645829 C > D: P 185:212(27) ack 565 win 4096 582 4500 0043 90a3 0000 583 3306 0734 cbf1 9eef 83f3 010a 0525 0015 584 a3a2 faba 578c 70a4 5018 1000 9a53 0000 585 data starts here>504f 5254 2032 3033 2c32 3431 2c31 3538 586 2c32 3339 2c35 2c34 330d 0a 587 22:23:58.646805 D > C: . ack 184 win 8192 588 4500 0028 beeb 0000 589 3e06 ce06 83f3 010a cbf1 9eef 0015 0525 590 578c 70a4 a3a2 fab9 5010 2000 342f 0000 591 22:31:36.532244 C > D: FP 186:213(27) ack 565 win 4096 592 4500 0043 9435 0000 593 3306 03a2 cbf1 9eef 83f3 010a 0525 0015 594 a3a2 fabb 578c 70a4 5019 1000 9a51 0000 595 data starts here>504f 5254 2032 3033 2c32 3431 2c31 3538 596 2c32 3339 2c35 2c34 330d 0a 598 In this trace, sequence numbers are relative. C sends 185:212, but 599 D only sends an ACK for 184 (so sequence number 184 is missing). C 600 then sends 186:213. The packet payload is identical to the 601 previous payload, but the base sequence number is one higher, 602 resulting in an inconsistent retransmission. 604 Neither trace exhibits checksum errors. 606 Trace file demonstrating correct behavior 607 (Omitted, as presumably correct behavior is obvious.) 609 References 610 None known. 612 How to detect 613 This problem unfortunately can be very difficult to detect, since 614 available experience indicates it is quite rare that it is 615 manifested. No "trigger" has been identified that can be used to 616 reproduce the problem. 618 ID Known TCP Implementation Problems November 1998 620 How to fix 621 In the absence of a known "trigger", we cannot always assess how to 622 fix the problem. 624 In one implementation (not the one illustrated above), the problem 625 manifested itself when (1) the sender received a zero window and 626 stalled; (2) eventually an ACK arrived that offered a window larger 627 than that in effect at the time of the stall; (3) the sender 628 transmitted out of the buffer of data it held at the time of the 629 stall, but (4) failed to limit this transfer to the buffer length, 630 instead using the newly advertised (and larger) offered window. 631 Consequently, in addition to the valid buffer contents, it sent 632 whatever garbage values followed the end of the buffer. If it then 633 retransmitted the corresponding sequence numbers, at that point it 634 sent the correct data, resulting in an inconsistent retransmission. 635 Note that this instance of the problem reflects a more general 636 problem, that of initially transmitting incorrect data. 638 3.5. 640 Name of Problem 641 Failure to retain above-sequence data 643 Classification 644 Congestion control, performance 646 Description 647 When a TCP receives an "above sequence" segment, meaning one with a 648 sequence number exceeding RCV.NXT but below RCV.NXT+RCV.WND, it 649 SHOULD queue the segment for later delivery (RFC 1122, 4.2.2.20). 650 (See RFC 793 for the definition of RCV.NXT and RCV.WND.) A TCP 651 that fails to do so is said to exhibit "Failure to retain above- 652 sequence data". 654 It may sometimes be appropriate for a TCP to discard above-sequence 655 data to reclaim memory. If they do so only rarely, then we would 656 not consider them to exhibit this problem. Instead, the particular 657 concern is with TCPs that always discard above-sequence data. 659 Significance 660 In environments prone to packet loss, detrimental to the 661 performance of both other connections and the connection itself. 663 Implications 665 ID Known TCP Implementation Problems November 1998 667 In times of congestion, a failure to retain above-sequence data 668 will lead to numerous otherwise-unnecessary retransmissions, 669 aggravating the congestion and potentially reducing performance by 670 a large factor. 672 Relevant RFCs 673 RFC 1122 revises RFC 793 by upgrading the latter's MAY to a SHOULD 674 on this issue. 676 Trace file demonstrating it 677 Made using tcpdump recording at the receiving TCP. No losses 678 reported by the packet filter. 680 B is the TCP sender, A the receiver. A exhibits failure to retain 681 above sequence data: 683 10:38:10.164860 B > A: . 221078:221614(536) ack 1 win 33232 [tos 0x8] 684 10:38:10.170809 B > A: . 221614:222150(536) ack 1 win 33232 [tos 0x8] 685 10:38:10.177183 B > A: . 222150:222686(536) ack 1 win 33232 [tos 0x8] 686 10:38:10.225039 A > B: . ack 222686 win 25800 688 Here B has sent up to (relative) sequence 222686 in-sequence, and A 689 accordingly acknowledges. 691 10:38:10.268131 B > A: . 223222:223758(536) ack 1 win 33232 [tos 0x8] 692 10:38:10.337995 B > A: . 223758:224294(536) ack 1 win 33232 [tos 0x8] 693 10:38:10.344065 B > A: . 224294:224830(536) ack 1 win 33232 [tos 0x8] 694 10:38:10.350169 B > A: . 224830:225366(536) ack 1 win 33232 [tos 0x8] 695 10:38:10.356362 B > A: . 225366:225902(536) ack 1 win 33232 [tos 0x8] 696 10:38:10.362445 B > A: . 225902:226438(536) ack 1 win 33232 [tos 0x8] 697 10:38:10.368579 B > A: . 226438:226974(536) ack 1 win 33232 [tos 0x8] 698 10:38:10.374732 B > A: . 226974:227510(536) ack 1 win 33232 [tos 0x8] 699 10:38:10.380825 B > A: . 227510:228046(536) ack 1 win 33232 [tos 0x8] 700 10:38:10.387027 B > A: . 228046:228582(536) ack 1 win 33232 [tos 0x8] 701 10:38:10.393053 B > A: . 228582:229118(536) ack 1 win 33232 [tos 0x8] 702 10:38:10.399193 B > A: . 229118:229654(536) ack 1 win 33232 [tos 0x8] 703 10:38:10.405356 B > A: . 229654:230190(536) ack 1 win 33232 [tos 0x8] 705 A now receives 13 additional packets from B. These are above- 706 sequence because 222686:223222 was dropped. The packets do however 707 fit within the offered window of 25800. A does not generate any 708 duplicate ACKs for them. 710 The trace contributor (V. Paxson) verified that these 13 packets 711 had valid IP and TCP checksums. 713 10:38:11.917728 B > A: . 222686:223222(536) ack 1 win 33232 [tos 0x8] 714 10:38:11.930925 A > B: . ack 223222 win 32232 716 ID Known TCP Implementation Problems November 1998 718 B times out for 222686:223222 and retransmits it. Upon receiving 719 it, A only acknowledges 223222. Had it retained the valid above- 720 sequence packets, it would instead have ack'd 230190. 722 10:38:12.048438 B > A: . 223222:223758(536) ack 1 win 33232 [tos 0x8] 723 10:38:12.054397 B > A: . 223758:224294(536) ack 1 win 33232 [tos 0x8] 724 10:38:12.068029 A > B: . ack 224294 win 31696 726 B retransmits two more packets, and A only acknowledges them. This 727 pattern continues as B retransmits the entire set of previously- 728 received packets. 730 A second trace confirmed that the problem is repeatable. 732 Trace file demonstrating correct behavior 733 Made using tcpdump recording at the receiving TCP (C). No losses 734 reported by the packet filter. 736 09:11:25.790417 D > C: . 33793:34305(512) ack 1 win 61440 737 09:11:25.791393 D > C: . 34305:34817(512) ack 1 win 61440 738 09:11:25.792369 D > C: . 34817:35329(512) ack 1 win 61440 739 09:11:25.792369 D > C: . 35329:35841(512) ack 1 win 61440 740 09:11:25.793345 D > C: . 36353:36865(512) ack 1 win 61440 741 09:11:25.794321 C > D: . ack 35841 win 59904 743 A sequence hole occurs because 35841:36353 has been dropped. 745 09:11:25.794321 D > C: . 36865:37377(512) ack 1 win 61440 746 09:11:25.794321 C > D: . ack 35841 win 59904 747 09:11:25.795297 D > C: . 37377:37889(512) ack 1 win 61440 748 09:11:25.795297 C > D: . ack 35841 win 59904 749 09:11:25.796273 C > D: . ack 35841 win 61440 750 09:11:25.798225 D > C: . 37889:38401(512) ack 1 win 61440 751 09:11:25.799201 C > D: . ack 35841 win 61440 752 09:11:25.807009 D > C: . 38401:38913(512) ack 1 win 61440 753 09:11:25.807009 C > D: . ack 35841 win 61440 754 (many additional lines omitted) 755 09:11:25.884113 D > C: . 52737:53249(512) ack 1 win 61440 756 09:11:25.884113 C > D: . ack 35841 win 61440 758 Each additional, above-sequence packet C receives from D elicits a 759 duplicate ACK for 35841. 761 09:11:25.887041 D > C: . 35841:36353(512) ack 1 win 61440 762 09:11:25.887041 C > D: . ack 53249 win 44032 764 D retransmits 35841:36353 and C acknowledges receipt of data all 766 ID Known TCP Implementation Problems November 1998 768 the way up to 53249. 770 References 771 This problem is documented in [Paxson97]. 773 How to detect 774 Packet loss is common enough in the Internet that generally it is 775 not difficult to find an Internet path that will result in some 776 above-sequence packets arriving. A TCP that exhibits "Failure to 777 retain ..." may not generate duplicate ACKs for these packets. 778 However, some TCPs that do retain above-sequence data also do not 779 generate duplicate ACKs, so failure to do so does not definitively 780 identify the problem. Instead, the key observation is whether upon 781 retransmission of the dropped packet, data that was previously 782 above-sequence is acknowledged. 784 Two considerations in detecting this problem using a packet trace 785 are that it is easiest to do so with a trace made at the TCP 786 receiver, in order to unambiguously determine which packets arrived 787 successfully, and that such packets may still be correctly 788 discarded if they arrive with checksum errors. The latter can be 789 tested by capturing the entire packet contents and performing the 790 IP and TCP checksum algorithms to verify their integrity; or by 791 confirming that the packets arrive with the same checksum and 792 contents as that with which they were sent, with a presumption that 793 the sending TCP correctly calculates checksums for the packets it 794 transmits. 796 It is considerably easier to verify that an implementation does NOT 797 exhibit this problem. This can be done by recording a trace at the 798 data sender, and observing that sometimes after a retransmission 799 the receiver acknowledges a higher sequence number than just that 800 which was retransmitted. 802 How to fix 803 If the root problem is that the implementation lacks buffer, then 804 then unfortunately this requires significant work to fix. However, 805 doing so is important, for reasons outlined above. 807 ID Known TCP Implementation Problems November 1998 809 3.6. 811 Name of Problem 812 Extra additive constant in congestion avoidance 814 Classification 815 Congestion control / performance 817 Description 818 RFC 1122 section 4.2.2.15 states that TCP MUST implement Jacobson's 819 "congestion avoidance" algorithm [Jacobson88], which calls for 820 increasing the congestion window, cwnd, by: 822 MSS * MSS / cwnd 824 for each ACK received for new data [RFC2001]. This has the effect 825 of increasing cwnd by approximately one segment in each round trip 826 time. 828 Some TCP implementations add an additional fraction of a segment 829 (typically MSS/8) to cwnd for each ACK received for new data 830 [Stevens94, Wright95]: 832 (MSS * MSS / cwnd) + MSS/8 834 These implementations exhibit "Extra additive constant in 835 congestion avoidance". 837 Significance 838 May be detrimental to performance even in completely uncongested 839 environments (see Implications). 841 In congested environments, may also be detrimental to the 842 performance of other connections. 844 Implications 845 The extra additive term allows a TCP to more aggressively open its 846 congestion window (quadratic rather than linear increase). For 847 congested networks, this can increase the loss rate experienced by 848 all connections sharing a bottleneck with the aggressive TCP. 850 However, even for completely uncongested networks, the extra 851 additive term can lead to diminished performance, as follows. In 852 congestion avoidance, a TCP sender probes the network path to 854 ID Known TCP Implementation Problems November 1998 856 determine its available capacity, which often equates to the number 857 of buffers available at a bottleneck link. With linear congestion 858 avoidance, the TCP only probes for sufficient capacity (buffer) to 859 hold one extra packet per RTT. 861 Thus, when it exceeds the available capacity, generally only one 862 packet will be lost (since on the previous RTT it already found 863 that the path could sustain a window with one less packet in 864 flight). If the congestion window is sufficiently large, then the 865 TCP will recover from this single loss using fast retransmission 866 and avoid an expensive (in terms of performance) retransmission 867 timeout. 869 However, when the additional additive term is used, then cwnd can 870 increase by more than one packet per RTT, in which case the TCP 871 probes more aggressively. If in the previous RTT it had reached 872 the available capacity of the path, then the excess due to the 873 extra increase will again be lost, but now this will result in 874 multiple losses from the flight instead of a single loss. TCPs 875 that do not utilize SACK [RFC2018] generally will not recover from 876 multiple losses without incurring a retransmission timeout 877 [Fall96,Hoe96], significantly diminishing performance. 879 Relevant RFCs 880 RFC 1122 requires use of the "congestion avoidance" algorithm. RFC 881 2001 outlines the fast retransmit/fast recovery algorithms. RFC 882 2018 discusses the SACK option. 884 Trace file demonstrating it 886 Recorded using tcpdump running on the same FDDI LAN as host A. 887 Host A is the sender and host B is the receiver. The connection 888 establishment specified an MSS of 4,312 bytes and a window scale 889 factor of 4. We omit the establishment and the first 2.5 MB of 890 data transfer, as the problem is best demonstrated when the window 891 has grown to a large value. At the beginning of the trace excerpt, 892 the congestion window is 31 packets. The connection is never 893 receiver-window limited, so we omit window advertisements from the 894 trace for clarity. 896 11:42:07.697951 B > A: . ack 2383006 897 11:42:07.699388 A > B: . 2508054:2512366(4312) 898 11:42:07.699962 A > B: . 2512366:2516678(4312) 899 11:42:07.700012 B > A: . ack 2391630 900 11:42:07.701081 A > B: . 2516678:2520990(4312) 901 11:42:07.701656 A > B: . 2520990:2525302(4312) 903 ID Known TCP Implementation Problems November 1998 905 11:42:07.701739 B > A: . ack 2400254 906 11:42:07.702685 A > B: . 2525302:2529614(4312) 907 11:42:07.703257 A > B: . 2529614:2533926(4312) 908 11:42:07.703295 B > A: . ack 2408878 909 11:42:07.704414 A > B: . 2533926:2538238(4312) 910 11:42:07.704989 A > B: . 2538238:2542550(4312) 911 11:42:07.705040 B > A: . ack 2417502 912 11:42:07.705935 A > B: . 2542550:2546862(4312) 913 11:42:07.706506 A > B: . 2546862:2551174(4312) 914 11:42:07.706544 B > A: . ack 2426126 915 11:42:07.707480 A > B: . 2551174:2555486(4312) 916 11:42:07.708051 A > B: . 2555486:2559798(4312) 917 11:42:07.708088 B > A: . ack 2434750 918 11:42:07.709030 A > B: . 2559798:2564110(4312) 919 11:42:07.709604 A > B: . 2564110:2568422(4312) 920 11:42:07.710175 A > B: . 2568422:2572734(4312) * 922 11:42:07.710215 B > A: . ack 2443374 923 11:42:07.710799 A > B: . 2572734:2577046(4312) 924 11:42:07.711368 A > B: . 2577046:2581358(4312) 925 11:42:07.711405 B > A: . ack 2451998 926 11:42:07.712323 A > B: . 2581358:2585670(4312) 927 11:42:07.712898 A > B: . 2585670:2589982(4312) 928 11:42:07.712938 B > A: . ack 2460622 929 11:42:07.713926 A > B: . 2589982:2594294(4312) 930 11:42:07.714501 A > B: . 2594294:2598606(4312) 931 11:42:07.714547 B > A: . ack 2469246 932 11:42:07.715747 A > B: . 2598606:2602918(4312) 933 11:42:07.716287 A > B: . 2602918:2607230(4312) 934 11:42:07.716328 B > A: . ack 2477870 935 11:42:07.717146 A > B: . 2607230:2611542(4312) 936 11:42:07.717717 A > B: . 2611542:2615854(4312) 937 11:42:07.717762 B > A: . ack 2486494 938 11:42:07.718754 A > B: . 2615854:2620166(4312) 939 11:42:07.719331 A > B: . 2620166:2624478(4312) 940 11:42:07.719906 A > B: . 2624478:2628790(4312) ** 942 11:42:07.719958 B > A: . ack 2495118 943 11:42:07.720500 A > B: . 2628790:2633102(4312) 944 11:42:07.721080 A > B: . 2633102:2637414(4312) 945 11:42:07.721739 B > A: . ack 2503742 946 11:42:07.722348 A > B: . 2637414:2641726(4312) 947 11:42:07.722918 A > B: . 2641726:2646038(4312) 948 11:42:07.769248 B > A: . ack 2512366 950 The receiver's acknowledgment policy is one ACK per two packets 951 received. Thus, for each ACK arriving at host A, two new packets 952 are sent, except when cwnd increases due to congestion avoidance, 954 ID Known TCP Implementation Problems November 1998 956 in which case three new packets are sent. 958 With an ack-every-two-packets policy, cwnd should only increase one 959 MSS per 2 RTT. However, at the point marked "*" the window 960 increases after 7 ACKs have arrived, and then again at "**" after 6 961 more ACKs. 963 While we do not have space to show the effect, this trace suffered 964 from repeated timeout retransmissions due to multiple packet losses 965 during a single RTT. 967 Trace file demonstrating correct behavior 969 Made using the same host and tracing setup as above, except now A's 970 TCP has been modified to remove the MSS/8 additive constant. 971 Tcpdump reported 77 packet drops; the excerpt below is fully self- 972 consistent so it is unlikely that any of these occurred during the 973 excerpt. 975 We again begin when cwnd is 31 packets (this occurs significantly 976 later in the trace, because the congestion avoidance is now less 977 aggressive with opening the window). 979 14:22:21.236757 B > A: . ack 5194679 980 14:22:21.238192 A > B: . 5319727:5324039(4312) 981 14:22:21.238770 A > B: . 5324039:5328351(4312) 982 14:22:21.238821 B > A: . ack 5203303 983 14:22:21.240158 A > B: . 5328351:5332663(4312) 984 14:22:21.240738 A > B: . 5332663:5336975(4312) 985 14:22:21.270422 B > A: . ack 5211927 986 14:22:21.271883 A > B: . 5336975:5341287(4312) 987 14:22:21.272458 A > B: . 5341287:5345599(4312) 988 14:22:21.279099 B > A: . ack 5220551 989 14:22:21.280539 A > B: . 5345599:5349911(4312) 990 14:22:21.281118 A > B: . 5349911:5354223(4312) 991 14:22:21.281183 B > A: . ack 5229175 992 14:22:21.282348 A > B: . 5354223:5358535(4312) 993 14:22:21.283029 A > B: . 5358535:5362847(4312) 994 14:22:21.283089 B > A: . ack 5237799 995 14:22:21.284213 A > B: . 5362847:5367159(4312) 996 14:22:21.284779 A > B: . 5367159:5371471(4312) 997 14:22:21.285976 B > A: . ack 5246423 998 14:22:21.287465 A > B: . 5371471:5375783(4312) 999 14:22:21.288036 A > B: . 5375783:5380095(4312) 1000 14:22:21.288073 B > A: . ack 5255047 1001 14:22:21.289155 A > B: . 5380095:5384407(4312) 1002 14:22:21.289725 A > B: . 5384407:5388719(4312) 1004 ID Known TCP Implementation Problems November 1998 1006 14:22:21.289762 B > A: . ack 5263671 1007 14:22:21.291090 A > B: . 5388719:5393031(4312) 1008 14:22:21.291662 A > B: . 5393031:5397343(4312) 1009 14:22:21.291701 B > A: . ack 5272295 1010 14:22:21.292870 A > B: . 5397343:5401655(4312) 1011 14:22:21.293441 A > B: . 5401655:5405967(4312) 1012 14:22:21.293481 B > A: . ack 5280919 1013 14:22:21.294476 A > B: . 5405967:5410279(4312) 1014 14:22:21.295053 A > B: . 5410279:5414591(4312) 1015 14:22:21.295106 B > A: . ack 5289543 1016 14:22:21.296306 A > B: . 5414591:5418903(4312) 1017 14:22:21.296878 A > B: . 5418903:5423215(4312) 1018 14:22:21.296917 B > A: . ack 5298167 1019 14:22:21.297716 A > B: . 5423215:5427527(4312) 1020 14:22:21.298285 A > B: . 5427527:5431839(4312) 1021 14:22:21.298324 B > A: . ack 5306791 1022 14:22:21.299413 A > B: . 5431839:5436151(4312) 1023 14:22:21.299986 A > B: . 5436151:5440463(4312) 1024 14:22:21.303696 B > A: . ack 5315415 1025 14:22:21.305177 A > B: . 5440463:5444775(4312) 1026 14:22:21.305755 A > B: . 5444775:5449087(4312) 1027 14:22:21.308032 B > A: . ack 5324039 1028 14:22:21.309525 A > B: . 5449087:5453399(4312) 1029 14:22:21.310101 A > B: . 5453399:5457711(4312) 1030 14:22:21.310144 B > A: . ack 5332663 *** 1032 14:22:21.311615 A > B: . 5457711:5462023(4312) 1033 14:22:21.312198 A > B: . 5462023:5466335(4312) 1034 14:22:21.341876 B > A: . ack 5341287 1035 14:22:21.343451 A > B: . 5466335:5470647(4312) 1036 14:22:21.343985 A > B: . 5470647:5474959(4312) 1037 14:22:21.350304 B > A: . ack 5349911 1038 14:22:21.351852 A > B: . 5474959:5479271(4312) 1039 14:22:21.352430 A > B: . 5479271:5483583(4312) 1040 14:22:21.352484 B > A: . ack 5358535 1041 14:22:21.353574 A > B: . 5483583:5487895(4312) 1042 14:22:21.354149 A > B: . 5487895:5492207(4312) 1043 14:22:21.354205 B > A: . ack 5367159 1044 14:22:21.355467 A > B: . 5492207:5496519(4312) 1045 14:22:21.356039 A > B: . 5496519:5500831(4312) 1046 14:22:21.357361 B > A: . ack 5375783 1047 14:22:21.358855 A > B: . 5500831:5505143(4312) 1048 14:22:21.359424 A > B: . 5505143:5509455(4312) 1049 14:22:21.359465 B > A: . ack 5384407 1050 14:22:21.360605 A > B: . 5509455:5513767(4312) 1051 14:22:21.361181 A > B: . 5513767:5518079(4312) 1052 14:22:21.361225 B > A: . ack 5393031 1053 14:22:21.362485 A > B: . 5518079:5522391(4312) 1055 ID Known TCP Implementation Problems November 1998 1057 14:22:21.363057 A > B: . 5522391:5526703(4312) 1058 14:22:21.363096 B > A: . ack 5401655 1059 14:22:21.364236 A > B: . 5526703:5531015(4312) 1060 14:22:21.364810 A > B: . 5531015:5535327(4312) 1061 14:22:21.364867 B > A: . ack 5410279 1062 14:22:21.365819 A > B: . 5535327:5539639(4312) 1063 14:22:21.366386 A > B: . 5539639:5543951(4312) 1064 14:22:21.366427 B > A: . ack 5418903 1065 14:22:21.367586 A > B: . 5543951:5548263(4312) 1066 14:22:21.368158 A > B: . 5548263:5552575(4312) 1067 14:22:21.368199 B > A: . ack 5427527 1068 14:22:21.369189 A > B: . 5552575:5556887(4312) 1069 14:22:21.369758 A > B: . 5556887:5561199(4312) 1070 14:22:21.369803 B > A: . ack 5436151 1071 14:22:21.370814 A > B: . 5561199:5565511(4312) 1072 14:22:21.371398 A > B: . 5565511:5569823(4312) 1073 14:22:21.375159 B > A: . ack 5444775 1074 14:22:21.376658 A > B: . 5569823:5574135(4312) 1075 14:22:21.377235 A > B: . 5574135:5578447(4312) 1076 14:22:21.379303 B > A: . ack 5453399 1077 14:22:21.380802 A > B: . 5578447:5582759(4312) 1078 14:22:21.381377 A > B: . 5582759:5587071(4312) 1079 14:22:21.381947 A > B: . 5587071:5591383(4312) **** 1081 "***" marks the end of the first round trip. Note that cwnd did 1082 not increase (as evidenced by each ACK eliciting two new data 1083 packets). Only at "****", which comes near the end of the second 1084 round trip, does cwnd increase by one packet. 1086 This trace did not suffer any timeout retransmissions. It 1087 transferred the same amount of data as the first trace in about 1088 half as much time. This difference is repeatable between hosts A 1089 and B. 1091 References 1092 [Stevens94] and [Wright95] discuss this problem. The problem of 1093 Reno TCP failing to recover from multiple losses except via a 1094 retransmission timeout is discussed in [Fall96,Hoe96]. 1096 How to detect 1097 If source code is available, that is generally the easiest way to 1098 detect this problem. Search for each modification to the cwnd 1099 variable; (at least) one of these will be for congestion avoidance, 1100 and inspection of the related code should immediately identify the 1101 problem if present. 1103 ID Known TCP Implementation Problems November 1998 1105 The problem can also be detected by closely examining packet traces 1106 taken near the sender. During congestion avoidance, cwnd will 1107 increase by an additional segment upon the receipt of (typically) 1108 eight acknowledgements without a loss. This increase is in 1109 addition to the one segment increase per round trip time (or two 1110 round trip times if the receiver is using delayed ACKs). 1112 Furthermore, graphs of the sequence number vs. time, taken from 1113 packet traces, are normally linear during congestion avoidance. 1114 When viewing packet traces of transfers from senders exhibiting 1115 this problem, the graphs appear quadratic instead of linear. 1117 Finally, the traces will show that, with sufficiently large 1118 windows, nearly every loss event results in a timeout. 1120 How to fix 1121 This problem may be corrected by removing the "+ MSS/8" term from 1122 the congestion avoidance code that increases cwnd each time an ACK 1123 of new data is received. 1125 3.7. 1127 Name of Problem 1128 Initial RTO too low 1130 Classification 1131 Performance 1133 Description 1134 When a TCP first begins transmitting data, it lacks the RTT 1135 measurements necessary to have computed an adaptive retransmission 1136 timeout (RTO). RFC 1122, 4.2.3.1, states that a TCP SHOULD 1137 initialize RTO to 3 seconds. A TCP that uses a lower value 1138 exhibits "Initial RTO too low". 1140 Significance 1141 In environments with large RTTs (where "large" means any value 1142 larger than the initial RTO), TCPs will experience very poor 1143 performance. 1145 Implications 1146 Whenever RTO < RTT, very poor performance can result as packets are 1148 ID Known TCP Implementation Problems November 1998 1150 unnecessarily retransmitted (because RTO will expire before an ACK 1151 for the packet can arrive) and the connection enters slow start and 1152 congestion avoidance. Generally, the algorithms for computing RTO 1153 avoid this problem by adding a positive term to the estimated RTT. 1154 However, when a connection first begins it must use some estimate 1155 for RTO, and if it picks a value less than RTT, the above problems 1156 will arise. 1158 Furthermore, when the initial RTO < RTT, it can take a long time 1159 for the TCP to correct the problem by adapting the RTT estimate, 1160 because the use of Karn's algorithm (mandated by RFC 1122, 4.2.3.1) 1161 will discard many of the candidate RTT measurements made after the 1162 first timeout, since they will be measurements of retransmitted 1163 segments. 1165 Relevant RFCs 1166 RFC 1122 states that TCPs SHOULD initialize RTO to 3 seconds and 1167 MUST implement Karn's algorithm. 1169 Trace file demonstrating it 1170 The following trace file was taken using tcpdump at host A, the 1171 data sender. The advertised window and SYN options have been 1172 omitted for clarity. 1174 07:52:39.870301 A > B: S 2786333696:2786333696(0) 1175 07:52:40.548170 B > A: S 130240000:130240000(0) ack 2786333697 1176 07:52:40.561287 A > B: P 1:513(512) ack 1 1177 07:52:40.753466 A > B: . 1:513(512) ack 1 1178 07:52:41.133687 A > B: . 1:513(512) ack 1 1179 07:52:41.458529 B > A: . ack 513 1180 07:52:41.458686 A > B: . 513:1025(512) ack 1 1181 07:52:41.458797 A > B: P 1025:1537(512) ack 1 1182 07:52:41.541633 B > A: . ack 513 1183 07:52:41.703732 A > B: . 513:1025(512) ack 1 1184 07:52:42.044875 B > A: . ack 513 1185 07:52:42.173728 A > B: . 513:1025(512) ack 1 1186 07:52:42.330861 B > A: . ack 1537 1187 07:52:42.331129 A > B: . 1537:2049(512) ack 1 1188 07:52:42.331262 A > B: P 2049:2561(512) ack 1 1189 07:52:42.623673 A > B: . 1537:2049(512) ack 1 1190 07:52:42.683203 B > A: . ack 1537 1191 07:52:43.044029 B > A: . ack 1537 1192 07:52:43.193812 A > B: . 1537:2049(512) ack 1 1194 Note from the SYN/SYN-ACK exchange, the RTT is over 600 msec. 1195 However, from the elapsed time between the third and fourth lines 1197 ID Known TCP Implementation Problems November 1998 1199 (the first packet being sent and then retransmitted), it is 1200 apparent the RTO was initialized to under 200 msec. The next line 1201 shows that this value has doubled to 400 msec (correct exponential 1202 backoff of RTO), but that still does not suffice to avoid an 1203 unnecessary retransmission. 1205 Finally, an ACK from B arrives for the first segment. Later two 1206 more duplicate ACKs for 513 arrive, indicating that both the 1207 original and the two retransmissions arrived at B. (Indeed, a 1208 concurrent trace at B showed that no packets were lost during the 1209 entire connection). This ACK opens the congestion window to two 1210 packets, which are sent back-to-back, but at 07:52:41.703732 RTO 1211 again expires after a little over 200 msec, leading to an 1212 unnecessary retransmission, and the pattern repeats. By the end of 1213 the trace excerpt above, 1536 bytes have been successfully 1214 transmitted from A to B, over an interval of more than 2 seconds, 1215 reflecting terrible performance. 1217 Trace file demonstrating correct behavior 1218 The following trace file was taken using tcpdump at host C, the 1219 data sender. The advertised window and SYN options have been 1220 omitted for clarity. 1222 17:30:32.090299 C > D: S 2031744000:2031744000(0) 1223 17:30:32.900325 D > C: S 262737964:262737964(0) ack 2031744001 1224 17:30:32.900326 C > D: . ack 1 1225 17:30:32.910326 C > D: . 1:513(512) ack 1 1226 17:30:34.150355 D > C: . ack 513 1227 17:30:34.150356 C > D: . 513:1025(512) ack 1 1228 17:30:34.150357 C > D: . 1025:1537(512) ack 1 1229 17:30:35.170384 D > C: . ack 1025 1230 17:30:35.170385 C > D: . 1537:2049(512) ack 1 1231 17:30:35.170386 C > D: . 2049:2561(512) ack 1 1232 17:30:35.320385 D > C: . ack 1537 1233 17:30:35.320386 C > D: . 2561:3073(512) ack 1 1234 17:30:35.320387 C > D: . 3073:3585(512) ack 1 1235 17:30:35.730384 D > C: . ack 2049 1237 The initial SYN/SYN-ACK exchange shows that RTT is more than 800 1238 msec, and for some subsequent packets it rises above 1 second, but 1239 C's retransmit timer does not ever expire. 1241 References 1242 This problem is documented in [Paxson97]. 1244 ID Known TCP Implementation Problems November 1998 1246 How to detect 1247 This problem is readily detected by inspecting a packet trace of 1248 the startup of a TCP connection made over a long-delay path. It 1249 can be diagnosed from either a sender-side or receiver-side trace. 1250 Long-delay paths can often be found by locating remote sites on 1251 other continents. 1253 How to fix 1254 As this problem arises from a faulty initialization, one hopes 1255 fixing it requires a one-line change to the TCP source code. 1257 3.8. 1259 Name of Problem 1260 Failure of window deflation after loss recovery 1262 Classification 1263 Congestion control / performance 1265 Description 1266 The fast recovery algorithm allows TCP senders to continue to 1267 transmit new segments during loss recovery. First, fast 1268 retransmission is initiated after a TCP sender receives three 1269 duplicate ACKs. At this point, a retransmission is sent and cwnd 1270 is halved. The fast recovery algorithm then allows additional 1271 segments to be sent when sufficient additional duplicate ACKs 1272 arrive. Some implementations of fast recovery compute when to send 1273 additional segments by artificially incrementing cwnd, first by 1274 three segments to account for the three duplicate ACKs that 1275 triggered fast retransmission, and subsequently by 1 MSS for each 1276 new duplicate ACK that arrives. When cwnd allows, the sender 1277 transmits new data segments. 1279 When an ACK arrives that covers new data, cwnd is to be reduced by 1280 the amount by which it was artificially increased. However, some 1281 TCP implementations fail to "deflate" the window, causing an 1282 inappropriate amount of data to be sent into the network after 1283 recovery. One cause of this problem is the "header prediction" 1284 code, which is used to handle incoming segments that require little 1285 work. In some implementations of TCP, the header prediction code 1286 does not check to make sure cwnd has not been artificially 1287 inflated, and therefore does not reduce the artificially increased 1288 cwnd when appropriate. 1290 ID Known TCP Implementation Problems November 1998 1292 Significance 1293 TCP senders that exhibit this problem will transmit a burst of data 1294 immediately after recovery, which can degrade performance, as well 1295 as network stability. Effectively, the sender does not reduce the 1296 size of cwnd as much as it should (to half its value when loss was 1297 detected), if at all. This can harm the performance of the TCP 1298 connection itself, as well as competing TCP flows. 1300 Implications 1301 A TCP sender exhibiting this problem does not reduce cwnd 1302 appropriately in times of congestion, and therefore may contribute 1303 to congestive collapse. 1305 Relevant RFCs 1306 RFC 2001 outlines the fast retransmit/fast recovery algorithms. 1307 [Brakmo95] outlines this implementation problem and offers a fix. 1309 Trace file demonstrating it 1310 The following trace file was taken using tcpdump at host A, the 1311 data sender. The advertised window (which never changed) has been 1312 omitted for clarity, except for the first packet sent by each host. 1314 08:22:56.825635 A.7505 > B.7505: . 29697:30209(512) ack 1 win 4608 1315 08:22:57.038794 B.7505 > A.7505: . ack 27649 win 4096 1316 08:22:57.039279 A.7505 > B.7505: . 30209:30721(512) ack 1 1317 08:22:57.321876 B.7505 > A.7505: . ack 28161 1318 08:22:57.322356 A.7505 > B.7505: . 30721:31233(512) ack 1 1319 08:22:57.347128 B.7505 > A.7505: . ack 28673 1320 08:22:57.347572 A.7505 > B.7505: . 31233:31745(512) ack 1 1321 08:22:57.347782 A.7505 > B.7505: . 31745:32257(512) ack 1 1322 08:22:57.936393 B.7505 > A.7505: . ack 29185 1323 08:22:57.936864 A.7505 > B.7505: . 32257:32769(512) ack 1 1324 08:22:57.950802 B.7505 > A.7505: . ack 29697 win 4096 1325 08:22:57.951246 A.7505 > B.7505: . 32769:33281(512) ack 1 1326 08:22:58.169422 B.7505 > A.7505: . ack 29697 1327 08:22:58.638222 B.7505 > A.7505: . ack 29697 1328 08:22:58.643312 B.7505 > A.7505: . ack 29697 1329 08:22:58.643669 A.7505 > B.7505: . 29697:30209(512) ack 1 1330 08:22:58.936436 B.7505 > A.7505: . ack 29697 1331 08:22:59.002614 B.7505 > A.7505: . ack 29697 1332 08:22:59.003026 A.7505 > B.7505: . 33281:33793(512) ack 1 1333 08:22:59.682902 B.7505 > A.7505: . ack 33281 1334 08:22:59.683391 A.7505 > B.7505: P 33793:34305(512) ack 1 1335 08:22:59.683748 A.7505 > B.7505: P 34305:34817(512) ack 1 *** 1336 08:22:59.684043 A.7505 > B.7505: P 34817:35329(512) ack 1 1338 ID Known TCP Implementation Problems November 1998 1340 08:22:59.684266 A.7505 > B.7505: P 35329:35841(512) ack 1 1341 08:22:59.684567 A.7505 > B.7505: P 35841:36353(512) ack 1 1342 08:22:59.684810 A.7505 > B.7505: P 36353:36865(512) ack 1 1343 08:22:59.685094 A.7505 > B.7505: P 36865:37377(512) ack 1 1345 The first 12 lines of the trace show incoming ACKs clocking out a 1346 window of data segments. At this point in the transfer, cwnd is 7 1347 segments. The next 4 lines of the trace show 3 duplicate ACKs 1348 arriving from the receiver, followed by a retransmission from the 1349 sender. At this point, cwnd is halved (to 3 segments) and 1350 artificially incremented by the three duplicate ACKs that have 1351 arrived, making cwnd 6 segments. The next two lines show 2 more 1352 duplicate ACKs arriving, each of which increases cwnd by 1 segment. 1353 So, after these two duplicate ACKs arrive the cwnd is 8 segments 1354 and the sender has permission to send 1 new segment (since there 1355 are 7 segments outstanding). The next line in the trace shows this 1356 new segment being transmitted. The next packet shown in the trace 1357 is an ACK from host B that covers the first 7 outstanding segments 1358 (all but the new segment sent during recovery). This should cause 1359 cwnd to be reduced to 3 segments and 2 segments to be transmitted 1360 (since there is already 1 outstanding segment in the network). 1361 However, as shown by the last 7 lines of the trace, cwnd is not 1362 reduced, causing a line-rate burst of 7 new segments. 1364 Trace file demonstrating correct behavior 1365 The trace would appear identical to the one above, only it would 1366 stop after the line marked "***", because at this point host A 1367 would correctly reduce cwnd after recovery, allowing only 2 1368 segments to be transmitted, rather than producing a burst of 7 1369 segments. 1371 References 1372 This problem is documented and the performance implications 1373 analyzed in [Brakmo95]. 1375 How to detect 1376 Failure of window deflation after loss recovery can be found by 1377 examining sender-side packet traces recorded during periods of 1378 moderate loss (so cwnd can grow large enough to allow for fast 1379 recovery when loss occurs). 1381 How to fix 1382 When this bug is caused by incorrect header prediction, the fix is 1383 to add a predicate to the header prediction test that checks to see 1385 ID Known TCP Implementation Problems November 1998 1387 whether cwnd is inflated; if so, the header prediction test fails 1388 and the usual ACK processing occurs, which (in this case) takes 1389 care to deflate the window. See [Brakmo95] for details. 1391 3.9. 1393 Name of Problem 1394 Excessively short keepalive connection timeout 1396 Classification 1397 Reliability 1399 Description 1400 Keep-alive is a mechanism for checking whether an idle connection 1401 is still alive. According to RFC 1122, keepalive should only be 1402 invoked in server applications that might otherwise hang 1403 indefinitely and consume resources unnecessarily if a client 1404 crashes or aborts a connection during a network failure. 1406 RFC 1122 also specifies that if a keep-alive mechanism is 1407 implemented it MUST NOT interpret failure to respond to any 1408 specific probe as a dead connection. The RFC does not specify a 1409 particular mechanism for timing out a connection when no response 1410 is received for keepalive probes. However, if the mechanism does 1411 not allow ample time for recovery from network congestion or delay, 1412 connections may be timed out unnecessarily. 1414 Significance 1415 In congested networks, can lead to unwarranted termination of 1416 connections. 1418 Implications 1419 It is possible for the network connection between two peer machines 1420 to become congested or to exhibit packet loss at the time that a 1421 keep-alive probe is sent on a connection. If the keep-alive 1422 mechanism does not allow sufficient time before dropping 1423 connections in the face of unacknowledged probes, connections may 1424 be dropped even when both peers of a connection are still alive. 1426 Relevant RFCs 1427 RFC 1122 specifies that the keep-alive mechanism may be provided. 1428 It does not specify a mechanism for determining dead connections 1430 ID Known TCP Implementation Problems November 1998 1432 when keepalive probes are not acknowledged. 1434 Trace file demonstrating it 1435 Made using the Orchestra tool at the peer of the machine using 1436 keep-alive. After connection establishment, incoming keep-alives 1437 were dropped by Orchestra to simulate a dead connection. 1439 22:11:12.040000 A > B: 22666019:0 win 8192 datasz 4 SYN 1440 22:11:12.060000 B > A: 2496001:22666020 win 4096 datasz 4 SYN ACK 1441 22:11:12.130000 A > B: 22666020:2496002 win 8760 datasz 0 ACK 1442 (more than two hours elapse) 1443 00:23:00.680000 A > B: 22666019:2496002 win 8760 datasz 1 ACK 1444 00:23:01.770000 A > B: 22666019:2496002 win 8760 datasz 1 ACK 1445 00:23:02.870000 A > B: 22666019:2496002 win 8760 datasz 1 ACK 1446 00:23.03.970000 A > B: 22666019:2496002 win 8760 datasz 1 ACK 1447 00:23.05.070000 A > B: 22666019:2496002 win 8760 datasz 1 ACK 1449 The initial three packets are the SYN exchange for connection 1450 setup. About two hours later, the keepalive timer fires because 1451 the connection has been idle. Keepalive probes are transmitted a 1452 total of 5 times, with a 1 second spacing between probes, after 1453 which the connection is dropped. This is problematic because a 5 1454 second network outage at the time of the first probe results in the 1455 connection being killed. 1457 Trace file demonstrating correct behavior 1458 Made using the Orchestra tool at the peer of the machine using 1459 keep-alive. After connection establishment, incoming keep-alives 1460 were dropped by Orchestra to simulate a dead connection. 1462 16:01:52.130000 A > B: 1804412929:0 win 4096 datasz 4 SYN 1463 16:01:52.360000 B > A: 16512001:1804412930 win 4096 datasz 4 SYN ACK 1464 16:01:52.410000 A > B: 1804412930:16512002 win 4096 datasz 0 ACK 1465 (two hours elapse) 1466 18:01:57.170000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1467 18:03:12.220000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1468 18:04:27.270000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1469 18:05:42.320000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1470 18:06:57.370000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1471 18:08:12.420000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1472 18:09:27.480000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1473 18:10:43.290000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1474 18:11:57.580000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK 1475 18:13:12.630000 A > B: 1804412929:16512002 win 4096 datasz 0 RST ACK 1477 In this trace, when the keep-alive timer expires, 9 keepalive 1479 ID Known TCP Implementation Problems November 1998 1481 probes are sent at 75 second intervals. 75 seconds after the last 1482 probe is sent, a final RST segment is sent indicating that the 1483 connection has been closed. This implementation waits about 11 1484 minutes before timing out the connection, while the first 1485 implementation shown allows only 5 seconds. 1487 References 1488 This problem is documented in [Dawson97]. 1490 How to detect 1491 For implementations manifesting this problem, it shows up on a 1492 packet trace after the keepalive timer fires if the peer machine 1493 receiving the keepalive does not respond. Usually the keepalive 1494 timer will fire at least two hours after keepalive is turned on, 1495 but it may be sooner if the timer value has been configured lower, 1496 or if the keepalive mechanism violates the specification (see 1497 Insufficient interval between keepalives problem). In this 1498 example, suppressing the response of the peer to keepalive probes 1499 was accomplished using the Orchestra toolkit, which can be 1500 configured to drop packets. It could also have been done by 1501 creating a connection, turning on keepalive, and disconnecting the 1502 network connection at the receiver machine. 1504 How to fix 1505 This problem can be fixed by using a different method for timing 1506 out keepalives that allows a longer period of time to elapse before 1507 dropping the connection. For example, the algorithm for timing out 1508 on dropped data could be used. Another possibility is an algorithm 1509 such as the one shown in the trace above, which sends 9 probes at 1510 75 second intervals and then waits an additional 75 seconds for a 1511 response before closing the connection. 1513 3.10. 1515 Name of Problem 1516 Failure to back off retransmission timeout 1518 Classification 1519 Congestion control / reliability 1521 Description 1522 The retransmission timeout is used to determine when a packet has 1524 ID Known TCP Implementation Problems November 1998 1526 been dropped in the network. When this timeout has expired without 1527 the arrival of an ACK, the segment is retransmitted. Each time a 1528 segment is retransmitted, the timeout is adjusted according to an 1529 exponential backoff algorithm, doubling each time. If a TCP fails 1530 to receive an ACK after numerous attempts at retransmitting the 1531 same segment, it terminates the connection. A TCP that fails to 1532 double its retransmission timeout upon repeated timeouts is said to 1533 exhibit "Failure to back off retransmission timeout". 1535 Significance 1536 Backing off the retransmission timer is a cornerstone of network 1537 stability in the presence of congestion. Consequently, this bug 1538 can have severe adverse affects in congested networks. It also 1539 affects TCP reliability in congested networks, as discussed in the 1540 next section. 1542 Implications 1543 It is possible for the network connection between two TCP peers to 1544 become congested or to exhibit packet loss at the time that a 1545 retransmission is sent on a connection. If the retransmission 1546 mechanism does not allow sufficient time before dropping 1547 connections in the face of unacknowledged segments, connections may 1548 be dropped even when, by waiting longer, the connection could have 1549 continued. 1551 Relevant RFCs 1552 RFC 1122 specifies mandatory exponential backoff of the 1553 retransmission timeout, and the termination of connections after 1554 some period of time (at least 100 seconds). 1556 Trace file demonstrating it 1557 Made using tcpdump on an intermediate host: 1559 16:51:12.671727 A > B: S 510878852:510878852(0) win 16384 1560 16:51:12.672479 B > A: S 2392143687:2392143687(0) ack 510878853 win 16384 1561 16:51:12.672581 A > B: . ack 1 win 16384 1562 16:51:15.244171 A > B: P 1:3(2) ack 1 win 16384 1563 16:51:15.244933 B > A: . ack 3 win 17518 (DF) 1565 1567 16:51:19.381176 A > B: P 3:5(2) ack 1 win 16384 1568 16:51:20.162016 A > B: P 3:5(2) ack 1 win 16384 1569 16:51:21.161936 A > B: P 3:5(2) ack 1 win 16384 1571 ID Known TCP Implementation Problems November 1998 1573 16:51:22.161914 A > B: P 3:5(2) ack 1 win 16384 1574 16:51:23.161914 A > B: P 3:5(2) ack 1 win 16384 1575 16:51:24.161879 A > B: P 3:5(2) ack 1 win 16384 1576 16:51:25.161857 A > B: P 3:5(2) ack 1 win 16384 1577 16:51:26.161836 A > B: P 3:5(2) ack 1 win 16384 1578 16:51:27.161814 A > B: P 3:5(2) ack 1 win 16384 1579 16:51:28.161791 A > B: P 3:5(2) ack 1 win 16384 1580 16:51:29.161769 A > B: P 3:5(2) ack 1 win 16384 1581 16:51:30.161750 A > B: P 3:5(2) ack 1 win 16384 1582 16:51:31.161727 A > B: P 3:5(2) ack 1 win 16384 1584 16:51:32.161701 A > B: R 5:5(0) ack 1 win 16384 1586 The initial three packets are the SYN exchange for connection 1587 setup, then a single data packet, to verify that data can be 1588 transferred. Then the connection to the destination host was 1589 disconnected, and more data sent. Retransmissions occur every 1590 second for 12 seconds, and then the connection is terminated with a 1591 RST. This is problematic because a 12 second pause in connectivity 1592 could result in the termination of a connection. 1594 Trace file demonstrating correct behavior 1595 Again, a tcpdump taken from a third host: 1597 16:59:05.398301 A > B: S 2503324757:2503324757(0) win 16384 1598 16:59:05.399673 B > A: S 2492674648:2492674648(0) ack 2503324758 win 16384 1599 16:59:05.399866 A > B: . ack 1 win 17520 1600 16:59:06.538107 A > B: P 1:3(2) ack 1 win 17520 1601 16:59:06.540977 B > A: . ack 3 win 17518 (DF) 1603 1605 16:59:13.121542 A > B: P 3:5(2) ack 1 win 17520 1606 16:59:14.010928 A > B: P 3:5(2) ack 1 win 17520 1607 16:59:16.010979 A > B: P 3:5(2) ack 1 win 17520 1608 16:59:20.011229 A > B: P 3:5(2) ack 1 win 17520 1609 16:59:28.011896 A > B: P 3:5(2) ack 1 win 17520 1610 16:59:44.013200 A > B: P 3:5(2) ack 1 win 17520 1611 17:00:16.015766 A > B: P 3:5(2) ack 1 win 17520 1612 17:01:20.021308 A > B: P 3:5(2) ack 1 win 17520 1613 17:02:24.027752 A > B: P 3:5(2) ack 1 win 17520 1614 17:03:28.034569 A > B: P 3:5(2) ack 1 win 17520 1615 17:04:32.041567 A > B: P 3:5(2) ack 1 win 17520 1616 17:05:36.048264 A > B: P 3:5(2) ack 1 win 17520 1617 17:06:40.054900 A > B: P 3:5(2) ack 1 win 17520 1619 17:07:44.061306 A > B: R 5:5(0) ack 1 win 17520 1621 ID Known TCP Implementation Problems November 1998 1623 In this trace, when the retransmission timer expires, 12 1624 retransmissions are sent at exponentially-increasing intervals, 1625 until the interval value reaches 64 seconds, at which time the 1626 interval stops growing. 64 seconds after the last retransmission, 1627 a final RST segment is sent indicating that the connection has been 1628 closed. This implementation waits about 9 minutes before timing 1629 out the connection, while the first implementation shown allows 1630 only 12 seconds. 1632 References 1633 None known. 1635 How to detect 1636 A simple transfer can be easily interrupted by disconnecting the 1637 receiving host from the network. tcpdump or another appropriate 1638 tool should show the retransmissions being sent. Several trials in 1639 a low-rtt environment may be required to demonstrate the bug. 1641 How to fix 1642 For one of the implementations studied, this problem seemed to be 1643 the result of an error introduced with the addition of the Brakmo- 1644 Peterson RTO algorithm [Brakmo95], which can return a value of zero 1645 where the older Jacobson algorithm always returns a positive value. 1646 Brakmo and Peterson specified an additional step of min(rtt + 2, 1647 RTO) to avoid problems with this. Unfortunately, in the 1648 implementation this step was omitted when calculating the 1649 exponential backoff for the RTO. This results in an RTO of 0 1650 seconds being multiplied by the backoff, yielding again zero, and 1651 then being subjected to a later MAX operation that increases it to 1652 1 second, regardless of the backoff factor. 1654 A similar TCP persist failure has the same cause. 1656 3.11. 1658 Name of Problem 1659 Insufficient interval between keepalives 1661 Classification 1662 Reliability 1664 Description 1666 ID Known TCP Implementation Problems November 1998 1668 Keep-alive is a mechanism for checking whether an idle connection 1669 is still alive. According to RFC 1122, keep-alive may be included 1670 in an implementation. If it is included, the interval between 1671 keep-alive packets MUST be configurable, and MUST default to no 1672 less than two hours. 1674 Significance 1675 In congested networks, can lead to unwarranted termination of 1676 connections. 1678 Implications 1679 According to RFC 1122, keep-alive is not required of 1680 implementations because it could: (1) cause perfectly good 1681 connections to break during transient Internet failures; (2) 1682 consume unnecessary bandwidth ("if no one is using the connection, 1683 who cares if it is still good?"); and (3) cost money for an 1684 Internet path that charges for packets. Regarding this last point, 1685 we note that in addition the presence of dial-on-demand links in 1686 the route can greatly magnify the cost penalty of excess 1687 keepalives, potentially forcing a full-time connection on a link 1688 that would otherwise only be connected a few minutes a day. 1690 If keepalive is provided the RFC states that the required inter- 1691 keepalive distance MUST default to no less than two hours. If it 1692 does not, the probability of connections breaking increases, the 1693 bandwidth used due to keepalives increases, and cost increases over 1694 paths which charge per packet. 1696 Relevant RFCs 1697 RFC 1122 specifies that the keep-alive mechanism may be provided. 1698 It also specifies the two hour minimum for the default interval 1699 between keepalive probes. 1701 Trace file demonstrating it 1702 Made using the Orchestra tool at the peer of the machine using 1703 keep-alive. Machine A was configured to use default settings for 1704 the keepalive timer. 1706 11:36:32.910000 A > B: 3288354305:0 win 28672 datasz 4 SYN 1707 11:36:32.930000 B > A: 896001:3288354306 win 4096 datasz 4 SYN ACK 1708 11:36:32.950000 A > B: 3288354306:896002 win 28672 datasz 0 ACK 1710 11:50:01.190000 A > B: 3288354305:896002 win 28672 datasz 0 ACK 1711 11:50:01.210000 B > A: 896002:3288354306 win 4096 datasz 0 ACK 1713 ID Known TCP Implementation Problems November 1998 1715 12:03:29.410000 A > B: 3288354305:896002 win 28672 datasz 0 ACK 1716 12:03:29.430000 B > A: 896002:3288354306 win 4096 datasz 0 ACK 1718 12:16:57.630000 A > B: 3288354305:896002 win 28672 datasz 0 ACK 1719 12:16:57.650000 B > A: 896002:3288354306 win 4096 datasz 0 ACK 1721 12:30:25.850000 A > B: 3288354305:896002 win 28672 datasz 0 ACK 1722 12:30:25.870000 B > A: 896002:3288354306 win 4096 datasz 0 ACK 1724 12:43:54.070000 A > B: 3288354305:896002 win 28672 datasz 0 ACK 1725 12:43:54.090000 B > A: 896002:3288354306 win 4096 datasz 0 ACK 1727 The initial three packets are the SYN exchange for connection 1728 setup. About 13 minutes later, the keepalive timer fires because 1729 the connection is idle. The keepalive is acknowledged, and the 1730 timer fires again in about 13 more minutes. This behavior 1731 continues indefinitely until the connection is closed, and is a 1732 violation of the specification. 1734 Trace file demonstrating correct behavior 1735 Made using the Orchestra tool at the peer of the machine using 1736 keep-alive. Machine A was configured to use default settings for 1737 the keepalive timer. 1739 17:37:20.500000 A > B: 34155521:0 win 4096 datasz 4 SYN 1740 17:37:20.520000 B > A: 6272001:34155522 win 4096 datasz 4 SYN ACK 1741 17:37:20.540000 A > B: 34155522:6272002 win 4096 datasz 0 ACK 1743 19:37:25.430000 A > B: 34155521:6272002 win 4096 datasz 0 ACK 1744 19:37:25.450000 B > A: 6272002:34155522 win 4096 datasz 0 ACK 1746 21:37:30.560000 A > B: 34155521:6272002 win 4096 datasz 0 ACK 1747 21:37:30.570000 B > A: 6272002:34155522 win 4096 datasz 0 ACK 1749 23:37:35.580000 A > B: 34155521:6272002 win 4096 datasz 0 ACK 1750 23:37:35.600000 B > A: 6272002:34155522 win 4096 datasz 0 ACK 1752 01:37:40.620000 A > B: 34155521:6272002 win 4096 datasz 0 ACK 1753 01:37:40.640000 B > A: 6272002:34155522 win 4096 datasz 0 ACK 1755 03:37:45.590000 A > B: 34155521:6272002 win 4096 datasz 0 ACK 1756 03:37:45.610000 B > A: 6272002:34155522 win 4096 datasz 0 ACK 1758 The initial three packets are the SYN exchange for connection 1759 setup. Just over two hours later, the keepalive timer fires 1760 because the connection is idle. The keepalive is acknowledged, and 1761 the timer fires again just over two hours later. This behavior 1763 ID Known TCP Implementation Problems November 1998 1765 continues indefinitely until the connection is closed. 1767 References 1768 This problem is documented in [Dawson97]. 1770 How to detect 1771 For implementations manifesting this problem, it shows up on a 1772 packet trace. If the connection is left idle, the keepalive probes 1773 will arrive closer together than the two hour minimum. 1775 3.12. 1777 Name of Problem 1778 Window probe deadlock 1780 Classification 1781 Reliability 1783 Description 1784 When an application reads a single byte from a full window, the 1785 window should not be updated, in order to avoid Silly Window 1786 Syndrome (SWS; see [RFC813]). If the remote peer uses a single 1787 byte of data to probe the window, that byte can be accepted into 1788 the buffer. In some implementations, at this point a negative 1789 argument to a signed comparison causes all further new data to be 1790 considered outside the window; consequently, it is discarded (after 1791 sending an ACK to resynchronize). These discards include the ACKs 1792 for the data packets sent by the local TCP, so the TCP will 1793 consider the data unacknowledged. 1795 Consequently, the application may be unable to complete sending new 1796 data to the remote peer, because it has exhausted the transmit 1797 buffer available to its local TCP, and buffer space is never being 1798 freed because incoming ACKs that would do so are being discarded. 1799 If the application does not read any more data, which may happen 1800 due to its failure to complete such sends, then deadlock results. 1802 Significance 1803 It's relatively rare for applications to use TCP in a manner that 1804 can exercise this problem. Most applications only transmit bulk 1805 data if they know the other end is prepared to receive the data. 1806 However, if a client fails to consume data, putting the server in 1808 ID Known TCP Implementation Problems November 1998 1810 persist mode, and then consumes a small amount of data, it can 1811 mistakenly compute a negative window. At this point the client 1812 will discard all further packets from the server, including ACKs of 1813 the client's own data, since they are not inside the (impossibly- 1814 sized) window. If subsequently the client consumes enough data to 1815 then send a window update to the server, the situation will be 1816 rectified. That is, this situation can only happen if the client 1817 consumes 1 < N < MSS bytes, so as not to cause a window update, and 1818 then starts its own transmission towards the server of more than a 1819 window's worth of data. 1821 Implications 1822 TCP connections will hang and eventually time out. 1824 Relevant RFCs 1825 RFC 793 describes zero window probing. RFC 813 describes Silly 1826 Window Syndrome. 1828 Trace file demonstrating it 1829 Trace made from a version of tcpdump modified to print out the 1830 sequence number attached to an ACK even if it's dataless. An 1831 unmodified tcpdump would not print seq:seq(0); however, for this 1832 bug, the sequence number in the ACK is important for unambiguously 1833 determining how the TCP is behaving. 1835 [ Normal connection startup and data transmission from B to A. 1836 Options, including MSS of 16344 in both directions, omitted 1837 for clarity. ] 1838 16:07:32.327616 A > B: S 65360807:65360807(0) win 8192 1839 16:07:32.327304 B > A: S 65488807:65488807(0) ack 65360808 win 57344 1840 16:07:32.327425 A > B: . 1:1(0) ack 1 win 57344 1841 16:07:32.345732 B > A: P 1:2049(2048) ack 1 win 57344 1842 16:07:32.347013 B > A: P 2049:16385(14336) ack 1 win 57344 1843 16:07:32.347550 B > A: P 16385:30721(14336) ack 1 win 57344 1844 16:07:32.348683 B > A: P 30721:45057(14336) ack 1 win 57344 1845 16:07:32.467286 A > B: . 1:1(0) ack 45057 win 12288 1846 16:07:32.467854 B > A: P 45057:57345(12288) ack 1 win 57344 1848 [ B fills up A's offered window ] 1849 16:07:32.667276 A > B: . 1:1(0) ack 57345 win 0 1851 [ B probes A's window with a single byte ] 1852 16:07:37.467438 B > A: . 57345:57346(1) ack 1 win 57344 1854 [ A resynchronizes without accepting the byte ] 1856 ID Known TCP Implementation Problems November 1998 1858 16:07:37.467678 A > B: . 1:1(0) ack 57345 win 0 1860 [ B probes A's window again ] 1861 16:07:45.467438 B > A: . 57345:57346(1) ack 1 win 57344 1863 [ A resynchronizes and accepts the byte (per the ack field) ] 1864 16:07:45.667250 A > B: . 1:1(0) ack 57346 win 0 1866 [ The application on A has started generating data. The first 1867 packet A sends is small due to a memory allocation bug. ] 1868 16:07:51.358459 A > B: P 1:2049(2048) ack 57346 win 0 1870 [ B acks A's first packet ] 1871 16:07:51.467239 B > A: . 57346:57346(0) ack 2049 win 57344 1873 [ This looks as though A accepted B's ACK and is sending 1874 another packet in response to it. In fact, A is trying 1875 to resynchronize with B, and happens to have data to send 1876 and can send it because the first small packet didn't use 1877 up cwnd. ] 1878 16:07:51.467698 A > B: . 2049:14337(12288) ack 57346 win 0 1880 [ B acks all of the data that A has sent ] 1881 16:07:51.667283 B > A: . 57346:57346(0) ack 14337 win 57344 1883 [ A tries to resynchronize. Notice that by the packets 1884 seen on the network, A and B *are* in fact synchronized; 1885 A only thinks that they aren't. ] 1886 16:07:51.667477 A > B: . 14337:14337(0) ack 57346 win 0 1888 [ A's retransmit timer fires, and B acks all of the data. 1889 A once again tries to resynchronize. ] 1890 16:07:52.467682 A > B: . 1:14337(14336) ack 57346 win 0 1891 16:07:52.468166 B > A: . 57346:57346(0) ack 14337 win 57344 1892 16:07:52.468248 A > B: . 14337:14337(0) ack 57346 win 0 1894 [ A's retransmit timer fires again, and B acks all of the data. 1895 A once again tries to resynchronize. ] 1896 16:07:55.467684 A > B: . 1:14337(14336) ack 57346 win 0 1897 16:07:55.468172 B > A: . 57346:57346(0) ack 14337 win 57344 1898 16:07:55.468254 A > B: . 14337:14337(0) ack 57346 win 0 1900 Trace file demonstrating correct behavior 1901 Made between the same two hosts after applying the bug fix 1902 mentioned below (and using the same modified tcpdump). 1904 [ Connection starts up with data transmission from B to A. 1906 ID Known TCP Implementation Problems November 1998 1908 Note that due to a separate bug (the fact that A and B 1909 are communicating over a loopback driver), B erroneously 1910 skips slow start. ] 1911 17:38:09.510854 A > B: S 3110066585:3110066585(0) win 16384 1912 17:38:09.510926 B > A: S 3110174850:3110174850(0) ack 3110066586 win 57344 1913 17:38:09.510953 A > B: . 1:1(0) ack 1 win 57344 1914 17:38:09.512956 B > A: P 1:2049(2048) ack 1 win 57344 1915 17:38:09.513222 B > A: P 2049:16385(14336) ack 1 win 57344 1916 17:38:09.513428 B > A: P 16385:30721(14336) ack 1 win 57344 1917 17:38:09.513638 B > A: P 30721:45057(14336) ack 1 win 57344 1918 17:38:09.519531 A > B: . 1:1(0) ack 45057 win 12288 1919 17:38:09.519638 B > A: P 45057:57345(12288) ack 1 win 57344 1921 [ B fills up A's offered window ] 1922 17:38:09.719526 A > B: . 1:1(0) ack 57345 win 0 1924 [ B probes A's window with a single byte. A resynchronizes 1925 without accepting the byte ] 1926 17:38:14.499661 B > A: . 57345:57346(1) ack 1 win 57344 1927 17:38:14.499724 A > B: . 1:1(0) ack 57345 win 0 1929 [ B probes A's window again. A resynchronizes and accepts 1930 the byte, as indicated by the ack field ] 1931 17:38:19.499764 B > A: . 57345:57346(1) ack 1 win 57344 1932 17:38:19.519731 A > B: . 1:1(0) ack 57346 win 0 1934 [ B probes A's window with a single byte. A resynchronizes 1935 without accepting the byte ] 1936 17:38:24.499865 B > A: . 57346:57347(1) ack 1 win 57344 1937 17:38:24.499934 A > B: . 1:1(0) ack 57346 win 0 1939 [ The application on A has started generating data. 1940 B acks A's data and A accepts the ACKs and the 1941 data transfer continues ] 1942 17:38:28.530265 A > B: P 1:2049(2048) ack 57346 win 0 1943 17:38:28.719914 B > A: . 57346:57346(0) ack 2049 win 57344 1945 17:38:28.720023 A > B: . 2049:16385(14336) ack 57346 win 0 1946 17:38:28.720089 A > B: . 16385:30721(14336) ack 57346 win 0 1947 17:38:28.720370 B > A: . 57346:57346(0) ack 30721 win 57344 1949 17:38:28.720462 A > B: . 30721:45057(14336) ack 57346 win 0 1950 17:38:28.720526 A > B: P 45057:59393(14336) ack 57346 win 0 1951 17:38:28.720824 A > B: P 59393:73729(14336) ack 57346 win 0 1952 17:38:28.721124 B > A: . 57346:57346(0) ack 73729 win 47104 1954 17:38:28.721198 A > B: P 73729:88065(14336) ack 57346 win 0 1955 17:38:28.721379 A > B: P 88065:102401(14336) ack 57346 win 0 1957 ID Known TCP Implementation Problems November 1998 1959 17:38:28.721557 A > B: P 102401:116737(14336) ack 57346 win 0 1960 17:38:28.721863 B > A: . 57346:57346(0) ack 116737 win 36864 1962 References 1963 None known. 1965 How to detect 1966 Initiate a connection from a client to a server. Have the server 1967 continuously send data until its buffers have been full for long 1968 enough to exhaust the window. Next, have the client read 1 byte 1969 and then delay for long enough that the server TCP sends a window 1970 probe. Now have the client start sending data. At this point, if 1971 it ignores the server's ACKs, then the client's TCP suffers from 1972 the problem. 1974 How to fix 1975 In one implementation known to exhibit the problem (derived from 1976 4.3-Reno), the problem was introduced when the macro MAX() was 1977 replaced by the function call max() for computing the amount of 1978 space in the receive window: 1980 tp->rcv_wnd = max(win, (int)(tp->rcv_adv - tp->rcv_nxt)); 1982 When data has been received into a window beyond what has been 1983 advertised to the other side, rcv_nxt > rcv_adv, making this 1984 negative. It's clear from the (int) cast that this is intended, 1985 but the unsigned max() function sign-extends so the negative number 1986 is "larger". The fix is to change max() to imax(): 1988 tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt)); 1990 4.3-Tahoe and before did not have this bug, since it used the macro 1991 MAX() for this calculation. 1993 3.13. 1995 Name of Problem 1996 Stretch ACK violation 1998 Classification 1999 Congestion Control/Performance 2001 ID Known TCP Implementation Problems November 1998 2003 Description 2004 To improve efficiency (both computer and network) a data receiver 2005 may refrain from sending an ACK for each incoming segment, 2006 according to [RFC1122]. However, an ACK should not be delayed an 2007 inordinate amount of time. Specifically, ACKs SHOULD be sent for 2008 every second full-sized segment that arrives. If a second full- 2009 sized segment does not arrive within a given timeout (of no more 2010 than 0.5 seconds), an ACK should be transmitted, according to 2011 [RFC1122]. A TCP receiver which does not generate an ACK for every 2012 second full-sized segment exhibits a "Stretch ACK Violation". 2014 Significance 2015 TCP receivers exhibiting this behavior will cause TCP senders to 2016 generate burstier traffic, which can degrade performance in 2017 congested environments. In addition, generating fewer ACKs 2018 increases the amount of time needed by the slow start algorithm to 2019 open the congestion window to an appropriate point, which 2020 diminishes performance in environments with large bandwidth-delay 2021 products. Finally, generating fewer ACKs may cause needless 2022 retransmission timeouts in lossy environments, as it increases the 2023 possibility that an entire window of ACKs is lost, forcing a 2024 retransmission timeout. 2026 Implications 2027 When not in loss recovery, every ACK received by a TCP sender 2028 triggers the transmission of new data segments. The burst size is 2029 determined by the number of previously unacknowledged segments each 2030 ACK covers. Therefore, a TCP receiver ack'ing more than 2 segments 2031 at a time causes the sending TCP to generate a larger burst of 2032 traffic upon receipt of the ACK. This large burst of traffic can 2033 overwhelm an intervening gateway, leading to higher drop rates for 2034 both the connection and other connections passing through the 2035 congested gateway. 2037 In addition, the TCP slow start algorithm increases the congestion 2038 window by 1 segment for each ACK received. Therefore, increasing 2039 the ACK interval (thus decreasing the rate at which ACKs are 2040 transmitted) increases the amount of time it takes slow start to 2041 increase the congestion window to an appropriate operating point, 2042 and the connection consequently suffers from reduced performance. 2043 This is especially true for connections using large windows. 2045 Relevant RFCs 2046 RFC 1122 outlines delayed ACKs as a recommended mechanism. 2048 ID Known TCP Implementation Problems November 1998 2050 Trace file demonstrating it 2051 Trace file taken using tcpdump at host B, the data receiver (and 2052 ACK originator). The advertised window (which never changed) and 2053 timestamp options have been omitted for clarity, except for the 2054 first packet sent by A: 2056 12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1 2057 win 33580 [tos 0x8] 2058 12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1 2059 12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1 2060 12:09:24.832222 B.3999 > A.1174: . ack 6393 2061 12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1 2062 12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1 2063 12:09:24.950605 A.1174 > B.3999: . 9289:10737(1448) ack 1 2064 12:09:24.950797 B.3999 > A.1174: . ack 10737 2065 12:09:24.958488 A.1174 > B.3999: . 10737:12185(1448) ack 1 2066 12:09:25.052330 A.1174 > B.3999: . 12185:13633(1448) ack 1 2067 12:09:25.060216 A.1174 > B.3999: . 13633:15081(1448) ack 1 2068 12:09:25.060405 B.3999 > A.1174: . ack 15081 2070 This portion of the trace clearly shows that the receiver (host B) 2071 sends an ACK for every third full sized packet received. Further 2072 investigation of this implementation found that the cause of the 2073 increased ACK interval was the TCP options being used. The 2074 implementation sent an ACK after it was holding 2*MSS worth of 2075 unacknowledged data. In the above case, the MSS is 1460 bytes so 2076 the receiver transmits an ACK after it is holding at least 2920 2077 bytes of unacknowledged data. However, the length of the TCP 2078 options being used [RFC1323] took 12 bytes away from the data 2079 portion of each packet. This produced packets containing 1448 2080 bytes of data. But the additional bytes used by the options in the 2081 header were not taken into account when determining when to trigger 2082 an ACK. Therefore, it took 3 data segments before the data 2083 receiver was holding enough unacknowledged data (>= 2*MSS, or 2920 2084 bytes in the above example) to transmit an ACK. 2086 Trace file demonstrating correct behavior 2088 Trace file taken using tcpdump at host B, the data receiver (and 2089 ACK originator), again with window and timestamp information 2090 omitted except for the first packet: 2092 12:06:53.627320 A.1172 > B.3999: . 1449:2897(1448) ack 1 2093 win 33580 [tos 0x8] 2094 12:06:53.634773 A.1172 > B.3999: . 2897:4345(1448) ack 1 2095 12:06:53.634961 B.3999 > A.1172: . ack 4345 2096 12:06:53.737326 A.1172 > B.3999: . 4345:5793(1448) ack 1 2098 ID Known TCP Implementation Problems November 1998 2100 12:06:53.744401 A.1172 > B.3999: . 5793:7241(1448) ack 1 2101 12:06:53.744592 B.3999 > A.1172: . ack 7241 2102 12:06:53.752287 A.1172 > B.3999: . 7241:8689(1448) ack 1 2103 12:06:53.847332 A.1172 > B.3999: . 8689:10137(1448) ack 1 2104 12:06:53.847525 B.3999 > A.1172: . ack 10137 2106 This trace shows the TCP receiver (host B) ack'ing every second 2107 full-sized packet, according to [RFC1122]. This is the same 2108 implementation shown above, with slight modifications that allow 2109 the receiver to take the length of the options into account when 2110 deciding when to transmit an ACK. 2112 References 2113 This problem is documented in [Allman97] and [Paxson97]. 2115 How to detect 2116 Stretch ACK violations show up immediately in receiver-side packet 2117 traces of bulk transfers, as shown above. However, packet traces 2118 made on the sender side of the TCP connection may lead to 2119 ambiguities when diagnosing this problem due to the possibility of 2120 lost ACKs. 2122 3.14. 2124 Name of Problem 2125 Retransmission sends multiple packets 2127 Classification 2128 Congestion control 2130 Description 2131 When a TCP retransmits a segment due to a timeout expiration or 2132 beginning a fast retransmission sequence, it should only transmit a 2133 single segment. A TCP that transmits more than one segment 2134 exhibits "Retransmission Sends Multiple Packets". 2136 Instances of this problem have been known to occur due to 2137 miscomputations involving the use of TCP options. TCP options 2138 increase the TCP header beyond its usual size of 20 bytes. The 2139 total size of header must be taken into account when retransmitting 2140 a packet. If a TCP sender does not account for the length of the 2141 TCP options when determining how much data to retransmit, it will 2142 send too much data to fit into a single packet. In this case, the 2144 ID Known TCP Implementation Problems November 1998 2146 correct retransmission will be followed by a short segment 2147 (tinygram) containing data that may not need to be retransmitted. 2149 A specific case is a TCP using the RFC 1323 timestamp option, which 2150 adds 12 bytes to the standard 20-byte TCP header. On 2151 retransmission of a packet, the 12 byte option is incorrectly 2152 interpreted as part of the data portion of the segment. A standard 2153 TCP header and a new 12-byte option is added to the data, which 2154 yields a transmission of 12 bytes more data than contained in the 2155 original segment. This overflow causes a smaller packet, with 12 2156 data bytes, to be transmitted. 2158 Significance 2159 This problem is somewhat serious for congested environments because 2160 the TCP implementation injects more packets into the network than 2161 is appropriate. However, since a tinygram is only sent in response 2162 to a fast retransmit or a timeout, it does not effect the sustained 2163 sending rate. 2165 Implications 2166 A TCP exhibiting this behavior is stressing the network with more 2167 traffic than appropriate, and stressing routers by increasing the 2168 number of packets they must process. The redundant tinygram will 2169 also elicit a duplicate ACK from the receiver, resulting in yet 2170 another unnecessary transmission. 2172 Relevant RFCs 2173 RFC 1122 requires use of slow start after loss; RFC 2001 explicates 2174 slow start; RFC 1323 describes the timestamp option that has been 2175 observed to lead to some implementations exhibiting this problem. 2177 Trace file demonstrating it 2178 Made using tcpdump recording at a machine on the same subnet as 2179 Host A. Host A is the sender and Host B is the receiver. The 2180 advertised window and timestamp options have been omitted for 2181 clarity, except for the first segment sent by host A. In addition, 2182 portions of the trace file not pertaining to the packet in question 2183 have been removed (missing packets are denoted by ``[...]'' in the 2184 trace). 2186 11:55:22.701668 A > B: . 7361:7821(460) ack 1 2187 win 49324 2188 11:55:22.702109 A > B: . 7821:8281(460) ack 1 2190 ID Known TCP Implementation Problems November 1998 2192 [...] 2194 11:55:23.112405 B > A: . ack 7821 2195 11:55:23.113069 A > B: . 12421:12881(460) ack 1 2196 11:55:23.113511 A > B: . 12881:13341(460) ack 1 2197 11:55:23.333077 B > A: . ack 7821 2198 11:55:23.336860 B > A: . ack 7821 2199 11:55:23.340638 B > A: . ack 7821 2200 11:55:23.341290 A > B: . 7821:8281(460) ack 1 2201 11:55:23.341317 A > B: . 8281:8293(12) ack 1 2202 11:55:23.498242 B > A: . ack 7821 2203 11:55:23.506850 B > A: . ack 7821 2204 11:55:23.510630 B > A: . ack 7821 2206 [...] 2208 11:55:23.746649 B > A: . ack 10581 2210 The second line of the above trace shows the original transmission 2211 of a segment which is later dropped. After 3 duplicate ACKs, line 2212 9 of the trace shows the dropped packet (7821:8281), with a 460- 2213 byte payload, being retransmitted. Immediately following this 2214 retransmission, a packet with a 12-byte payload is unnecessarily 2215 sent. 2217 Trace file demonstrating correct behavior 2219 The trace file would be identical to the one above, with a single 2220 line: 2222 11:55:23.341317 A > B: . 8281:8293(12) ack 1 2224 omitted. 2226 References 2227 [Brakmo95] 2229 How to detect 2230 This problem can be detected by examining a packet trace of the TCP 2231 connections of a machine using TCP options, during which a packet 2232 is retransmitted. 2234 ID Known TCP Implementation Problems November 1998 2236 3.15. 2238 Name of Problem 2239 Failure to send FIN notification promptly 2241 Classification 2242 Performance 2244 Description 2245 When an application closes a connection, the corresponding TCP 2246 should send the FIN notification promptly to its peer (unless 2247 prevented by the congestion window). If a TCP implementation 2248 delays in sending the FIN notification, for example due to waiting 2249 until unacknowledged data has been acknowledged, then it is said to 2250 exhibit "Failure to send FIN notification promptly". 2252 Also, while not strictly required, FIN segments should include the 2253 PSH flag to ensure expedited delivery of any pending data at the 2254 receiver. 2256 Significance 2257 The greatest impact occurs for short-lived connections, since for 2258 these the additional time required to close the connection 2259 introduces the greatest relative delay. 2261 The additional time can be significant in the common case of the 2262 sender waiting for an ACK that is delayed by the receiver. 2264 Implications 2265 Can diminish total throughput as seen at the application layer, 2266 because connection termination takes longer to complete. 2268 Relevant RFCs 2269 RFC 793 indicates that a receiver should treat an incoming FIN flag 2270 as implying the push function. 2272 Trace file demonstrating it 2273 Made using tcpdump (no losses reported by the packet filter). 2275 10:04:38.68 A > B: S 1031850376:1031850376(0) win 4096 2276 (DF) 2277 10:04:38.71 B > A: S 596916473:596916473(0) ack 1031850377 2279 ID Known TCP Implementation Problems November 1998 2281 win 8760 (DF) 2282 10:04:38.73 A > B: . ack 1 win 4096 (DF) 2283 10:04:41.98 A > B: P 1:4(3) ack 1 win 4096 (DF) 2284 10:04:42.15 B > A: . ack 4 win 8757 (DF) 2285 10:04:42.23 A > B: P 4:7(3) ack 1 win 4096 (DF) 2286 10:04:42.25 B > A: P 1:11(10) ack 7 win 8754 (DF) 2287 10:04:42.32 A > B: . ack 11 win 4096 (DF) 2288 10:04:42.33 B > A: P 11:51(40) ack 7 win 8754 (DF) 2289 10:04:42.51 A > B: . ack 51 win 4096 (DF) 2290 10:04:42.53 B > A: F 51:51(0) ack 7 win 8754 (DF) 2291 10:04:42.56 A > B: FP 7:7(0) ack 52 win 4096 (DF) 2292 10:04:42.58 B > A: . ack 8 win 8754 (DF) 2294 Machine B in the trace above does not send out a FIN notification 2295 promptly if there is any data outstanding. It instead waits for 2296 all unacknowledged data to be acknowledged before sending the FIN 2297 segment. The connection was closed at 10:04.42.33 after requesting 2298 40 bytes to be sent. However, the FIN notification isn't sent 2299 until 10:04.42.51, after the (delayed) acknowledgement of the 40 2300 bytes of data. 2302 Trace file demonstrating correct behavior 2303 Made using tcpdump (no losses reported by the packet filter). 2305 10:27:53.85 C > D: S 419744533:419744533(0) win 4096 2306 (DF) 2307 10:27:53.92 D > C: S 10082297:10082297(0) ack 419744534 2308 win 8760 (DF) 2309 10:27:53.95 C > D: . ack 1 win 4096 (DF) 2310 10:27:54.42 C > D: P 1:4(3) ack 1 win 4096 (DF) 2311 10:27:54.62 D > C: . ack 4 win 8757 (DF) 2312 10:27:54.76 C > D: P 4:7(3) ack 1 win 4096 (DF) 2313 10:27:54.89 D > C: P 1:11(10) ack 7 win 8754 (DF) 2314 10:27:54.90 D > C: FP 11:51(40) ack7 win 8754 (DF) 2315 10:27:54.92 C > D: . ack 52 win 4096 (DF) 2316 10:27:55.01 C > D: FP 7:7(0) ack 52 win 4096 (DF) 2317 10:27:55.09 D > C: . ack 8 win 8754 (DF) 2319 Here, Machine D sends a FIN with 40 bytes of data even before the 2320 original 10 octets have been acknowledged. This is correct behavior 2321 as it provides for the highest performance. 2323 References 2324 This problem is documented in [Dawson97]. 2326 ID Known TCP Implementation Problems November 1998 2328 How to detect 2329 For implementations manifesting this problem, it shows up on a 2330 packet trace. 2332 3.16. 2334 Name of Problem 2335 Failure to send a RST after Half Duplex Close 2337 Classification 2338 Resource management 2340 Description 2341 RFC 1122 4.2.2.13 states that a TCP SHOULD send a RST if data is 2342 received after "half duplex close", i.e. if it cannot be delivered 2343 to the application. A TCP that fails to do so is said to exhibit 2344 "Failure to send a RST after Half Duplex Close". 2346 Significance 2347 Potentially serious for TCP endpoints that manage large numbers of 2348 connections, due to exhaustion of memory and/or process slots 2349 available for managing connection state. 2351 Implications 2352 Failure to send the RST can lead to permanently hung TCP 2353 connections. This problem has been demonstrated when HTTP clients 2354 abort connections, common when users move on to a new page before 2355 the current page has finished downloading. The HTTP client closes 2356 by transmitting a FIN while the server is transmitting images, 2357 text, etc. The server TCP receives the FIN, but its application 2358 does not close the connection until all data has been queued for 2359 transmission. Since the server will not transmit a FIN until all 2360 the preceding data has been transmitted, deadlock results if the 2361 client TCP does not consume the pending data or tear down the 2362 connection: the window decreases to zero, since the client cannot 2363 pass the data to the application, and the server sends probe 2364 segments. The client acknowledges the probe segments with a zero 2365 window. As mandated in RFC1122 4.2.2.17, the probe segments are 2366 transmitted forever. Server connection state remains in 2367 CLOSE_WAIT, and eventually server processes are exhausted. 2369 Note that there are two bugs. First, probe segments should be 2370 ignored if the window can never subsequently increase. Second, a 2372 ID Known TCP Implementation Problems November 1998 2374 RST should be sent when data is received after half duplex close. 2375 Fixing the first bug, but not the second, results in the probe 2376 segments eventually timing out the connection, but the server 2377 remains in CLOSE_WAIT for a significant and unnecessary period. 2379 Relevant RFCs 2380 RFC 1122 sections 4.2.2.13 and 4.2.2.17. 2382 Trace file demonstrating it 2383 Made using an unknown network analyzer. No drop information 2384 available. 2386 client.1391 > server.8080: S 0:1(0) ack: 0 win: 2000 2387 server.8080 > client.1391: SA 8c01:8c02(0) ack: 1 win: 8000 2388 client.1391 > server.8080: PA 2389 client.1391 > server.8080: PA 1:1c2(1c1) ack: 8c02 win: 2000 2390 server.8080 > client.1391: [DF] PA 8c02:8cde(dc) ack: 1c2 win: 8000 2391 server.8080 > client.1391: [DF] A 8cde:9292(5b4) ack: 1c2 win: 8000 2392 server.8080 > client.1391: [DF] A 9292:9846(5b4) ack: 1c2 win: 8000 2393 server.8080 > client.1391: [DF] A 9846:9dfa(5b4) ack: 1c2 win: 8000 2394 client.1391 > server.8080: PA 2395 server.8080 > client.1391: [DF] A 9dfa:a3ae(5b4) ack: 1c2 win: 8000 2396 server.8080 > client.1391: [DF] A a3ae:a962(5b4) ack: 1c2 win: 8000 2397 server.8080 > client.1391: [DF] A a962:af16(5b4) ack: 1c2 win: 8000 2398 server.8080 > client.1391: [DF] A af16:b4ca(5b4) ack: 1c2 win: 8000 2399 client.1391 > server.8080: PA 2400 server.8080 > client.1391: [DF] A b4ca:ba7e(5b4) ack: 1c2 win: 8000 2401 server.8080 > client.1391: [DF] A b4ca:ba7e(5b4) ack: 1c2 win: 8000 2402 client.1391 > server.8080: PA 2403 server.8080 > client.1391: [DF] A ba7e:bdfa(37c) ack: 1c2 win: 8000 2404 client.1391 > server.8080: PA 2405 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c2 win: 8000 2406 client.1391 > server.8080: PA 2408 [ HTTP client aborts and enters FIN_WAIT_1 ] 2410 client.1391 > server.8080: FPA 2412 [ server ACKs the FIN and enters CLOSE_WAIT ] 2414 server.8080 > client.1391: [DF] A 2416 [ client enters FIN_WAIT_2 ] 2418 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000 2420 ID Known TCP Implementation Problems November 1998 2422 [ server continues to try to send its data ] 2424 client.1391 > server.8080: PA < window = 0 > 2425 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000 2426 client.1391 > server.8080: PA < window = 0 > 2427 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000 2428 client.1391 > server.8080: PA < window = 0 > 2429 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000 2430 client.1391 > server.8080: PA < window = 0 > 2431 server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000 2432 client.1391 > server.8080: PA < window = 0 > 2434 [ ... repeat ad exhaustium ... ] 2436 Trace file demonstrating correct behavior 2437 Made using an unknown network analyzer. No drop information 2438 available. 2440 client > server D=80 S=59500 Syn Seq=337 Len=0 Win=8760 2441 server > client D=59500 S=80 Syn Ack=338 Seq=80153 Len=0 Win=8760 2442 client > server D=80 S=59500 Ack=80154 Seq=338 Len=0 Win=8760 2444 [ ... normal data omitted ... ] 2446 client > server D=80 S=59500 Ack=14559 Seq=596 Len=0 Win=8760 2447 server > client D=59500 S=80 Ack=596 Seq=114559 Len=1460 Win=8760 2449 [ client closes connection ] 2451 client > server D=80 S=59500 Fin Seq=596 Len=0 Win=8760 2452 server > client D=59500 S=80 Ack=597 Seq=116019 Len=1460 Win=8760 2454 [ client sends RST (RFC1122 4.2.2.13) ] 2456 client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0 2457 server > client D=59500 S=80 Ack=597 Seq=117479 Len=1460 Win=8760 2458 client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0 2459 server > client D=59500 S=80 Ack=597 Seq=118939 Len=1460 Win=8760 2460 client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0 2461 server > client D=59500 S=80 Ack=597 Seq=120399 Len=892 Win=8760 2462 client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0 2463 server > client D=59500 S=80 Ack=597 Seq=121291 Len=1460 Win=8760 2464 client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0 2466 "client" sends a number of RSTs, one in response to each incoming 2467 packet from "server". One might wonder why "server" keeps sending 2469 ID Known TCP Implementation Problems November 1998 2471 data packets after it has received a RST from "client"; the 2472 explanation is that "server" had already transmitted all five of 2473 the data packets before receiving the first RST from "client", so 2474 it is too late to avoid transmitting them. 2476 How to detect 2477 The problem can be detected by inspecting packet traces of a large, 2478 interrupted bulk transfer. 2480 3.17. 2482 Name of Problem 2483 Failure to RST on close with data pending 2485 Classification 2486 Resource management 2488 Description 2489 When an application closes a connection in such a way that it can 2490 no longer read any received data, the TCP SHOULD, per section 2491 4.2.2.13 of RFC 1122, send a RST if there is any unread received 2492 data, or if any new data is received. A TCP that fails to do so 2493 exhibits "Failure to RST on close with data pending". 2495 Note that, for some TCPs, this situation can be caused by an 2496 application "crashing" while a peer is sending data. 2498 We have observed a number of TCPs that exhibit this problem. The 2499 problem is less serious if any subsequent data sent to the now- 2500 closed connection endpoint elicits a RST (see illustration below). 2502 Significance 2503 This problem is most significant for endpoints that engage in large 2504 numbers of connections, as their ability to do so will be curtailed 2505 as they leak away resources. 2507 Implications 2508 Failure to reset the connection can lead to permanently hung 2509 connections, in which the remote endpoint takes no further action 2510 to tear down the connection because it is waiting on the local TCP 2511 to first take some action. This is particularly the case if the 2512 local TCP also allows the advertised window to go to zero, and 2514 ID Known TCP Implementation Problems November 1998 2516 fails to tear down the connection when the remote TCP engages in 2517 "persist" probes (see example below). 2519 Relevant RFCs 2520 RFC 1122 section 4.2.2.13. Also, 4.2.2.17 for the zero-window 2521 probing discussion below. 2523 Trace file demonstrating it 2524 Made using tcpdump. No drop information available. 2526 13:11:46.04 A > B: S 458659166:458659166(0) win 4096 2527 (DF) 2528 13:11:46.04 B > A: S 792320000:792320000(0) ack 458659167 2529 win 4096 2530 13:11:46.04 A > B: . ack 1 win 4096 (DF) 2531 13:11.55.80 A > B: . 1:513(512) ack 1 win 4096 (DF) 2532 13:11.55.80 A > B: . 513:1025(512) ack 1 win 4096 (DF) 2533 13:11:55.83 B > A: . ack 1025 win 3072 2534 13:11.55.84 A > B: . 1025:1537(512) ack 1 win 4096 (DF) 2535 13:11.55.84 A > B: . 1537:2049(512) ack 1 win 4096 (DF) 2536 13:11.55.85 A > B: . 2049:2561(512) ack 1 win 4096 (DF) 2537 13:11:56.03 B > A: . ack 2561 win 1536 2538 13:11.56.05 A > B: . 2561:3073(512) ack 1 win 4096 (DF) 2539 13:11.56.06 A > B: . 3073:3585(512) ack 1 win 4096 (DF) 2540 13:11.56.06 A > B: . 3585:4097(512) ack 1 win 4096 (DF) 2541 13:11:56.23 B > A: . ack 4097 win 0 2542 13:11:58.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF) 2543 13:11:58.16 B > A: . ack 4097 win 0 2544 13:12:00.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF) 2545 13:12:00.16 B > A: . ack 4097 win 0 2546 13:12:02.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF) 2547 13:12:02.16 B > A: . ack 4097 win 0 2548 13:12:05.37 A > B: . 4096:4097(1) ack 1 win 4096 (DF) 2549 13:12:05.37 B > A: . ack 4097 win 0 2550 13:12:06.36 B > A: F 1:1(0) ack 4097 win 0 2551 13:12:06.37 A > B: . ack 2 win 4096 (DF) 2552 13:12:11.78 A > B: . 4096:4097(1) ack 2 win 4096 (DF) 2553 13:12:11.78 B > A: . ack 4097 win 0 2554 13:12:24.59 A > B: . 4096:4097(1) ack 2 win 4096 (DF) 2555 13:12:24.60 B > A: . ack 4097 win 0 2556 13:12:50.22 A > B: . 4096:4097(1) ack 2 win 4096 (DF) 2557 13:12:50.22 B > A: . ack 4097 win 0 2559 Machine B in the trace above does not drop received data when the 2560 socket is "closed" by the application (in this case, the 2561 application process was terminated). This occurred at approximately 2563 ID Known TCP Implementation Problems November 1998 2565 13:12:06.36 and resulted in the FIN being sent in response to the 2566 close. However, because there is no longer an application to 2567 deliver the data to, the TCP should have instead sent a RST. 2569 Note: Machine A's zero-window probing is also broken. It is 2570 resending old data, rather than new data. Section 3.7 in RFC 793 2571 and Section 4.2.2.17 in RFC 1122 discuss zero-window probing. 2573 Trace file demonstrating better behavior 2574 Made using tcpdump. No drop information available. 2576 Better, but still not fully correct, behavior, per the discussion 2577 below. We show this behavior because it has been observed for a 2578 number of different TCP implementations. 2580 13:48:29.24 C > D: S 73445554:73445554(0) win 4096 2581 (DF) 2582 13:48:29.24 D > C: S 36050296:36050296(0) ack 73445555 2583 win 4096 (DF) 2584 13:48:29.25 C > D: . ack 1 win 4096 (DF) 2585 13:48:30.78 C > D: . 1:1461(1460) ack 1 win 4096 (DF) 2586 13:48:30.79 C > D: . 1461:2921(1460) ack 1 win 4096 (DF) 2587 13:48:30.80 D > C: . ack 2921 win 1176 (DF) 2588 13:48:32.75 C > D: . 2921:4097(1176) ack 1 win 4096 (DF) 2589 13:48:32.82 D > C: . ack 4097 win 0 (DF) 2590 13:48:34.76 C > D: . 4096:4097(1) ack 1 win 4096 (DF) 2591 13:48:34.84 D > C: . ack 4097 win 0 (DF) 2592 13:48:36.34 D > C: FP 1:1(0) ack 4097 win 4096 (DF) 2593 13:48:36.34 C > D: . 4097:5557(1460) ack 2 win 4096 (DF) 2594 13:48:36.34 D > C: R 36050298:36050298(0) win 24576 2595 13:48:36.34 C > D: . 5557:7017(1460) ack 2 win 4096 (DF) 2596 13:48:36.34 D > C: R 36050298:36050298(0) win 24576 2598 In this trace, the application process is terminated on Machine D 2599 at approximately 13:48:36.34. Its TCP sends the FIN with the 2600 window opened again (since it discarded the previously received 2601 data). Machine C promptly sends more data, causing Machine D to 2602 reset the connection since it cannot deliver the data to the 2603 application. Ideally, Machine D SHOULD send a RST instead of 2604 dropping the data and re-opening the receive window. 2606 Note: Machine C's zero-window probing is broken, the same as in the 2607 example above. 2609 Trace file demonstrating correct behavior 2611 ID Known TCP Implementation Problems November 1998 2613 Made using tcpdump. No losses reported by the packet filter. 2615 14:12:02.19 E > F: S 1143360000:1143360000(0) win 4096 2616 14:12:02.19 F > E: S 1002988443:1002988443(0) ack 1143360001 2617 win 4096 (DF) 2618 14:12:02.19 E > F: . ack 1 win 4096 2619 14:12:10.43 E > F: . 1:513(512) ack 1 win 4096 2620 14:12:10.61 F > E: . ack 513 win 3584 (DF) 2621 14:12:10.61 E > F: . 513:1025(512) ack 1 win 4096 2622 14:12:10.61 E > F: . 1025:1537(512) ack 1 win 4096 2623 14:12:10.81 F > E: . ack 1537 win 2560 (DF) 2624 14:12:10.81 E > F: . 1537:2049(512) ack 1 win 4096 2625 14:12:10.81 E > F: . 2049:2561(512) ack 1 win 4096 2626 14:12:10.81 E > F: . 2561:3073(512) ack 1 win 4096 2627 14:12:11.01 F > E: . ack 3073 win 1024 (DF) 2628 14:12:11.01 E > F: . 3073:3585(512) ack 1 win 4096 2629 14:12:11.01 E > F: . 3585:4097(512) ack 1 win 4096 2630 14:12:11.21 F > E: . ack 4097 win 0 (DF) 2631 14:12:15.88 E > F: . 4097:4098(1) ack 1 win 4096 2632 14:12:16.06 F > E: . ack 4097 win 0 (DF) 2633 14:12:20.88 E > F: . 4097:4098(1) ack 1 win 4096 2634 14:12:20.91 F > E: . ack 4097 win 0 (DF) 2635 14:12:21.94 F > E: R 1002988444:1002988444(0) win 4096 2637 When the application terminates at 14:12:21.94, F immediately sends 2638 a RST. 2640 Note: Machine E's zero-window probing is (finally) correct. 2642 How to detect 2643 The problem can often be detected by inspecting packet traces of a 2644 transfer in which the receiving application terminates abnormally. 2645 When doing so, there can be an ambiguity (if only looking at the 2646 trace) as to whether the receiving TCP did indeed have unread data 2647 that it could now no longer deliver. To provoke this to happen, it 2648 may help to suspend the receiving application so that it fails to 2649 consume any data, eventually exhausting the advertised window. At 2650 this point, since the advertised window is zero, we know that the 2651 receiving TCP has undelivered data buffered up. Terminating the 2652 application process then should suffice to test the correctness of 2653 the TCP's behavior. 2655 ID Known TCP Implementation Problems November 1998 2657 3.18. 2659 Name of Problem 2660 Options missing from TCP MSS calculation 2662 Classification 2663 Reliability / performance 2665 Description 2666 When a TCP determines how much data to send per packet, it 2667 calculates a segment size based on the MTU of the path. It must 2668 then subtract from that MTU the size of the IP and TCP headers in 2669 the packet. If IP options and TCP options are not taken into 2670 account correctly in this calculation, the resulting segment size 2671 may be too large. TCPs that do so are said to exhibit "Options 2672 missing from TCP MSS calculation". 2674 Significance 2675 In some implementations, this causes the transmission of strangely 2676 fragmented packets. In some implementations with Path MTU (PMTU) 2677 discovery [RFC1191], this problem can actually result in a total 2678 failure to transmit any data at all, regardless of the environment 2679 (see below). 2681 Arguably, especially since the wide deployment of firewalls, IP 2682 options appear only rarely in normal operations. 2684 Implications 2685 In implementations using PMTU discovery, this problem can result in 2686 packets that are too large for the output interface, and that have 2687 the DF (don't fragment) bit set in the IP header. Thus, the IP 2688 layer on the local machine is not allowed to fragment the packet to 2689 send it out the interface. It instead informs the TCP layer of the 2690 correct MTU size of the interface; the TCP layer again miscomputes 2691 the MSS by failing to take into account the size of IP options; and 2692 the problem repeats, with no data flowing. 2694 Relevant RFCs 2695 RFC 1122 describes the calculation of the effective send MSS. RFC 2696 1191 describes Path MTU discovery. 2698 Trace file demonstrating it 2700 ID Known TCP Implementation Problems November 1998 2702 Trace file taking using tcpdump on host C. The first trace 2703 demonstrates the fragmentation that occurs without path MTU 2704 discovery: 2706 13:55:25.488728 A.65528 > C.discard: 2707 P 567833:569273(1440) ack 1 win 17520 2708 2709 (frag 20828:1472@0+) 2710 (ttl 62, optlen=8 LSRR{B#} NOP) 2712 13:55:25.488943 A > C: 2713 (frag 20828:8@1472) 2714 (ttl 62, optlen=8 LSRR{B#} NOP) 2716 13:55:25.489052 C.discard > A.65528: 2717 . ack 566385 win 60816 2718 (DF) 2719 (ttl 60, id 41266) 2721 Host A repeatedly sends 1440-octet data segments, but these hare 2722 fragmented into two packets, one with 1432 octets of data, and 2723 another with 8 octets of data. 2725 The second trace demonstrates the failure to send any data 2726 segments, sometimes seen with hosts doing path MTU discovery: 2728 13:55:44.332219 A.65527 > C.discard: 2729 S 1018235390:1018235390(0) win 16384 2730 (DF) 2731 (ttl 62, id 20912, optlen=8 LSRR{B#} NOP) 2733 13:55:44.333015 C.discard > A.65527: 2734 S 1271629000:1271629000(0) ack 1018235391 win 60816 2735 (DF) 2736 (ttl 60, id 41427) 2738 13:55:44.333206 C.discard > A.65527: 2739 S 1271629000:1271629000(0) ack 1018235391 win 60816 2740 (DF) 2741 (ttl 60, id 41427) 2743 This is all of the activity seen on this connection. Eventually 2744 host C will time out attempting to establish the connection. 2746 How to detect 2747 The "netcat" utility [Hobbit96] is useful for generating source 2748 routed packets: 2750 ID Known TCP Implementation Problems November 1998 2752 1% nc C discard 2753 (interactive typing) 2754 ^C 2755 2% nc C discard < /dev/zero 2756 ^C 2757 3% nc -g B C discard 2758 (interactive typing) 2759 ^C 2760 4% nc -g B C discard < /dev/zero 2761 ^C 2763 Lines 1 through 3 should generate appropriate packets, which can be 2764 verified using tcpdump. If the problem is present, line 4 should 2765 generate one of the two kinds of packet traces shown. 2767 How to fix 2768 The implementation should ensure that the effective send MSS 2769 calculation includes a term for the IP and TCP options, as mandated 2770 by RFC 1122. 2772 4. Security Considerations 2774 This memo does not discuss any specific security-related TCP 2775 implementation problems, as the working group decided to pursue 2776 documenting those in a separate document. Some of the implementation 2777 problems discussed here, however, can be used for denial-of-service 2778 attacks. Those classified as congestion control present 2779 opportunities to subvert TCPs used for legitimate data transfer into 2780 excessively loading network elements. Those classified as 2781 "performance", "reliability" and "resource management" may be 2782 exploitable for launching surreptitious denial-of-service attacks 2783 against the user of the TCP. Both of these types of attacks can be 2784 extremely difficult to detect because in most respects they look 2785 identical to legitimate network traffic. 2787 5. Acknowledgements 2789 Thanks to numerous correspondents on the tcp-impl mailing list for 2790 their input: Steve Alexander, Larry Backman, Jerry Chu, Alan Cox, 2791 Kevin Fall, Richard Fox, Jim Gettys, Rick Jones, Allison Mankin, Neal 2792 McBurnett, Perry Metzger, der Mouse, Thomas Narten, Andras Olah, 2793 Steve Parker, Francesco Potorti`, Luigi Rizzo, Allyn Romanow, Al 2794 Smith, Jerry Toporek, Joe Touch, and Curtis Villamizar. 2796 Thanks also to Josh Cohen for the traces documenting the "Failure to 2798 ID Known TCP Implementation Problems November 1998 2800 send a RST after Half Duplex Close" problem; and to John Polstra, who 2801 analyzed the "Window probe deadlock" problem. 2803 6. References 2805 [Allman97] 2806 M. Allman, "Fixing Two BSD TCP Bugs," Technical Report CR-204151, 2807 NASA Lewis Research Center, Oct. 1997. 2808 http://gigahertz.lerc.nasa.gov/~mallman/papers/bug.ps 2810 [RFC2414] 2811 M. Allman, S. Floyd and C. Partridge, "Increasing TCP's Initial 2812 Window," Sep. 1998. 2814 [RFC1122] 2815 R. Braden, Editor, "Requirements for Internet Hosts -- 2816 Communication Layers," Oct. 1989. 2818 [RFC2119] 2819 S. Bradner, "Key words for use in RFCs to Indicate Requirement 2820 Levels," Mar. 1997. 2822 [Brakmo95] 2823 L. Brakmo and L. Peterson, "Performance Problems in BSD4.4 TCP," 2824 ACM Computer Communication Review, 25(5):69-86, 1995. 2826 [RFC813] 2827 D. Clark, "Window and Acknowledgement Strategy in TCP," Jul. 1982. 2829 [Dawson97] 2830 S. Dawson, F. Jahanian, and T. Mitton, "Experiments on Six 2831 Commercial TCP Implementations Using a Software Fault Injection 2832 Tool," to appear in Software Practice & Experience, 1997. A 2833 technical report version of this paper can be obtained at 2834 ftp://rtcl.eecs.umich.edu/outgoing/sdawson/CSE-TR-298-96.ps.gz. 2836 [Fall96] 2837 K. Fall and S. Floyd, "Simulation-based Comparisons of Tahoe, Reno, 2838 and SACK TCP," ACM Computer Communication Review, 26(3):5-21, 1996. 2840 [Hobbit96] 2841 Hobbit, Avian Research, netcat, available via anonymous ftp to 2842 ftp.avian.org, 1996. 2844 [Hoe96] 2846 ID Known TCP Implementation Problems November 1998 2848 J. Hoe, "Improving the Start-up Behavior of a Congestion Control 2849 Scheme for TCP," Proc. SIGCOMM '96. 2851 [Jacobson88] 2852 V. Jacobson, "Congestion Avoidance and Control," Proc. SIGCOMM '88. 2853 ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z 2855 [Jacobson89] 2856 V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via 2857 anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. 2859 [RFC2018] 2860 M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective 2861 Acknowledgement Options," Oct. 1996. 2863 [RFC1191] 2864 J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. 2866 [RFC896] 2867 J. Nagle, "Congestion Control in IP/TCP Internetworks," Jan. 1984. 2869 [Paxson97] 2870 V. Paxson, "Automated Packet Trace Analysis of TCP 2871 Implementations," Proc. SIGCOMM '97, available from 2872 ftp://ftp.ee.lbl.gov/papers/vp-tcpanaly-sigcomm97.ps.Z. 2874 [RFC793] 2875 J. Postel, Editor, "Transmission Control Protocol," Sep. 1981. 2877 [RFC2001] 2878 W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, 2879 and Fast Recovery Algorithms," Jan. 1997. 2881 [Stevens94] 2882 W. Stevens, "TCP/IP Illustrated, Volume 1", Addison-Wesley 2883 Publishing Company, Reading, Massachusetts, 1994. 2885 [Wright95] 2886 G. Wright and W. Stevens, "TCP/IP Illustrated, Volume 2", Addison- 2887 Wesley Publishing Company, Reading Massachusetts, 1995. 2889 ID Known TCP Implementation Problems November 1998 2891 7. Authors' Addresses 2893 Vern Paxson 2894 Network Research Group 2895 Lawrence Berkeley National Laboratory 2896 Berkeley, CA 94720 2897 USA 2898 Phone: +1 510/486-7504 2900 Mark Allman 2901 NASA Lewis Research Center/Sterling Software 2902 21000 Brookpark Road 2903 MS 54-2 2904 Cleveland, OH 44135 2905 USA 2906 Phone: +1 216/433-6586 2908 Scott Dawson 2909 Real-Time Computing Laboratory 2910 EECS Building 2911 University of Michigan 2912 Ann Arbor, MI 48109-2122 2913 USA 2914 Phone: +1 313/763-5363 2916 William C. Fenner 2917 Xerox PARC 2918 3333 Coyote Hill Road 2919 Palo Alto, CA 94304 2920 USA 2921 Phone: +1 650/812-4816 2923 Jim Griner 2924 NASA Lewis Research Center 2925 21000 Brookpark Road 2926 MS 54-2 2927 Cleveland, OH 44135 2928 USA 2929 Phone: +1 216/433-5787 2931 Ian Heavens 2932 Spider Software Ltd. 2933 8 John's Place, Leith 2934 Edinburgh EH6 7EL 2935 UK 2936 Phone: +44 131/475-7015 2938 Kevin Lahey 2940 ID Known TCP Implementation Problems November 1998 2942 NASA Ames Research Center/MRJ 2943 MS 258-6 2944 Moffett Field, CA 94035 2945 USA 2946 Phone: +1 650/604-4334 2948 Jeff Semke 2949 Pittsburgh Supercomputing Center 2950 4400 Fifth Ave 2951 Pittsburgh, PA 15213 2952 USA 2953 Phone: +1 412/268-4960 2955 Bernie Volz 2956 Process Software Corporation 2957 959 Concord Street 2958 Framingham, MA 01701 2959 USA 2960 Phone: +1 508/879-6994