idnits 2.17.1 draft-ietf-tcpimpl-pmtud-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 16 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Abstract section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 19 instances of too long lines in the document, the longest one being 7 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 80: '... MUST, SHOULD, MAY, and others wri...' RFC 2119 keyword, line 303: '... SHOULD, not a MUST....' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2401' is mentioned on line 128, but not defined ** Obsolete undefined reference: RFC 2401 (Obsoleted by RFC 4301) == Missing Reference: 'RFC2003' is mentioned on line 128, but not defined == Unused Reference: 'RFC2581' is defined on line 598, but no explicit reference was found in the text == Unused Reference: 'Paxson97' is defined on line 451, but no explicit reference was found in the text == Unused Reference: 'Paxson96' is defined on line 624, but no explicit reference was found in the text == Unused Reference: 'RFC2001' is defined on line 639, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 2525 ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) -- Possible downref: Non-RFC (?) normative reference: ref. 'Paxson97' ** Obsolete normative reference: RFC 813 (Obsoleted by RFC 7805) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson89' ** Downref: Normative reference to an Informational RFC: RFC 1435 ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) -- Possible downref: Non-RFC (?) normative reference: ref. 'Paxson96' ** Obsolete normative reference: RFC 879 (Obsoleted by RFC 7805, RFC 9293) ** Obsolete normative reference: RFC 2001 (Obsoleted by RFC 2581) Summary: 17 errors (**), 0 flaws (~~), 9 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group K. Lahey 2 Expires: November 2000 4 TCP Problems with Path MTU Discovery 5 7 1. Status of this Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 This document is an Internet Draft. Internet Drafts are working 13 documents of the Internet Engineering Task Force (IETF), its areas, 14 and its working groups. Note that other groups may also distribute 15 working documents as Internet Drafts. 17 Internet Drafts are draft documents valid for a maximum of six 18 months, and may be updated, replaced, or obsoleted by other documents 19 at any time. It is inappropriate to use Internet Drafts as reference 20 material or to cite them other than as ``work in progress''. 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 To view the entire list of current Internet-Drafts, please check the 29 "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 30 Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern 31 Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific 32 Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). 34 This memo provides information for the Internet community. This memo 35 does not specify an Internet standard of any kind. Distribution of 36 this memo is unlimited. 38 ^L 40 2. Introduction 42 This memo catalogs several known TCP implementation problems dealing 43 with Path MTU Discovery [RFC1191], including the long-standing black 44 hole problem, stretch ACKs due to confusion between MSS and segment 45 size, and MSS advertisement based on PMTU. The goal in doing so is 46 to improve conditions in the existing Internet by enhancing the 47 quality of current TCP/IP implementations. 49 While Path MTU Discovery (PMTUD) can be used with any upper-layer 50 protocol, it is most commonly used by TCP; this document does not 51 attempt to treat problems encountered by other upper-layer protocols. 52 Path MTU Discovery for IPv6 [RFC1981] treats only IPv6-dependent 53 issues, but not the TCP issues brought up in this document. 55 Each problem is defined as follows: 57 Name of Problem 58 The name associated with the problem. In this memo, the name is 59 given as a subsection heading. 61 Classification 62 One or more problem categories for which the problem is classified: 63 "congestion control", "performance", "reliability", 64 "non-interoperation -- connectivity failure". 66 Description 67 A definition of the problem, succinct but including necessary 68 background material. 70 Significance 71 A brief summary of the sorts of environments for which the problem 72 is significant. 74 Implications 75 Why the problem is viewed as a problem. 77 Relevant RFCs 78 The RFCs defining the TCP specification with which the problem 79 conflicts. These RFCs often qualify behavior using terms such as 80 MUST, SHOULD, MAY, and others written capitalized. See RFC 2119 81 for the exact interpretation of these terms. 83 ^L 85 Trace file demonstrating the problem 86 One or more ASCII trace files demonstrating the problem, if 87 applicable. 89 Trace file demonstrating correct behavior 90 One or more examples of how correct behavior appears in a trace, if 91 applicable. 93 References 94 References that further discuss the problem. 96 How to detect 97 How to test an implementation to see if it exhibits the problem. 98 This discussion may include difficulties and subtleties associated 99 with causing the problem to manifest itself, and with interpreting 100 traces to detect the presence of the problem (if applicable). 102 How to fix 103 For known causes of the problem, how to correct the implementation. 105 3. Known implementation problems 107 3.1. 109 Name of Problem 110 Black Hole Detection 112 Classification 113 Non-interoperation -- connectivity failure 115 Description 116 A host performs Path MTU Discovery by sending out as large a packet 117 as possible, with the Don't Fragment (DF) bit set in the IP header. 118 If the packet is too large for a router to forward on to a 119 particular link, the router must send an ICMP Destination 120 Unreachable -- Fragmentation Needed message to the source address. 121 The host then adjusts the packet size based on the ICMP message. 123 ^L 124 As was pointed out in [RFC1435], routers don't always do this 125 correctly -- many routers fail to send the ICMP messages, for a 126 variety of reasons ranging from kernel bugs to configuration 127 problems. Firewalls are often misconfigured to suppress all ICMP 128 messages. IPsec [RFC2401] and IP-in-IP [RFC2003] tunnels shouldn't 129 cause these sorts of problems, if the implementations follow the 130 advice in the appropriate documents. 132 PMTUD, as documented in [RFC1191], fails when the appropriate ICMP 133 messages are not received by the originating host. The upper-layer 134 protocol continues to try to send large packets and, without the 135 ICMP messages, never discovers that it needs to reduce the size of 136 those packets. Its packets are disappearing into a PMTUD black 137 hole. 139 Significance 140 When PMTUD fails due to the lack of ICMP messages, TCP will also 141 completely fail under some conditions. 143 Implications 144 This failure is especially difficult to debug, as pings and some 145 interactive TCP connections to the destination host work. Bulk 146 transfers fail with the first large packet and the connection 147 eventually times out. 149 These situations can almost always be blamed on a misconfiguration 150 within the network, which should be corrected. However it seems 151 inappropriate for some TCP implementations to suffer 152 interoperability failures over paths which do not affect other TCP 153 implementations (i.e. those without PMTUD). This creates a market 154 disincentive for deploying TCP implementation with PMTUD enabled. 156 Relevant RFCs 157 RFC1191 describes Path MTU Discovery. RFC 1435 provides an early 158 description of these sorts of problems. 160 Trace file demonstrating the problem 161 Made using tcpdump [Jacobson89] recording at an intermediate host. 163 20:12:11.951321 A > B: S 1748427200:1748427200(0) 164 win 49152 165 20:12:11.951829 B > A: S 1001927984:1001927984(0) 167 ^L 168 ack 1748427201 win 16384 169 20:12:11.955230 A > B: . ack 1 win 49152 (DF) 170 20:12:11.959099 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 171 20:12:13.139074 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 172 20:12:16.188685 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 173 20:12:22.290483 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 174 20:12:34.491856 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 175 20:12:58.896405 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 176 20:13:47.703184 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 177 20:14:52.780640 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 178 20:15:57.856037 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 179 20:17:02.932431 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 180 20:18:08.009337 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 181 20:19:13.090521 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 182 20:20:18.168066 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 183 20:21:23.242761 A > B: R 1461:1461(0) ack 1 win 49152 (DF) 185 The short SYN packet has no trouble traversing the network, due to 186 its small size. Similarly, ICMP echo packets used to diagnose 187 connectivity problems will succeed. 189 Large data packets fail to traverse the network. Eventually the 190 connection times out. This can be especially confusing when the 191 application starts out with a very small write, which succeeds, 192 following up with many large writes, which then fail. 194 Trace file demonstrating correct behavior 196 Made using tcpdump recording at an intermediate host. 198 16:48:42.659115 A > B: S 271394446:271394446(0) 199 win 8192 (DF) 200 16:48:42.672279 B > A: S 2837734676:2837734676(0) 201 ack 271394447 win 16384 202 16:48:42.676890 A > B: . ack 1 win 8760 (DF) 203 16:48:42.870574 A > B: . 1:1461(1460) ack 1 win 8760 (DF) 204 16:48:42.871799 A > B: . 1461:2921(1460) ack 1 win 8760 (DF) 205 16:48:45.786814 A > B: . 1:1461(1460) ack 1 win 8760 (DF) 206 16:48:51.794676 A > B: . 1:1461(1460) ack 1 win 8760 (DF) 207 16:49:03.808912 A > B: . 1:537(536) ack 1 win 8760 208 16:49:04.016476 B > A: . ack 537 win 16384 209 16:49:04.021245 A > B: . 537:1073(536) ack 1 win 8760 210 16:49:04.021697 A > B: . 1073:1609(536) ack 1 win 8760 211 16:49:04.120694 B > A: . ack 1609 win 16384 212 16:49:04.126142 A > B: . 1609:2145(536) ack 1 win 8760 214 In this case, the sender sees four packets fail to traverse the 216 ^L 217 network (using a two-packet initial send window) and turns off 218 PMTUD. All subsequent packets have the DF flag turned off, and the 219 size set to the default value of 536 [RFC1122]. 221 References 222 This problem has been discussed extensively on the tcp-impl mailing 223 list; the name "black hole" has been in use for many years. 225 How to detect 227 This shows up as a TCP connection which hangs (fails to make 228 progress) until closed by timeout (this often manifests itself as a 229 connection that connects and starts to transfer, then eventually 230 terminates after 15 minutes with zero bytes transfered). This is 231 particularly annoying with an application like ftp, which will work 232 perfectly while it uses small packets for control information, and 233 then fail on bulk transfers. 235 A series of ICMP echo packets will show that the two end hosts are 236 still capable of passing packets, a series of MTU-sized ICMP echo 237 packets will show some fragmentation, and a series of MTU-sized 238 ICMP echo packets with DF set will fail. This can be confusing for 239 network engineers trying to diagnose the problem. 241 There are several traceroute implementations that do PMTUD, and can 242 demonstrate the problem. 244 How to fix 245 TCP should notice that the connection is timing out. After several 246 timeouts, TCP should attempt to send smaller packets, perhaps 247 turning off the DF flag for each packet. If this succeeds, it 248 should continue to turn off PMTUD for the connection for some 249 reasonable period of time, after which it should probe again to try 250 to determine if the path has changed. 252 Note that, under IPv6, there is no DF bit -- it is implicitly on at 253 all times. Fragmentation is not allowed in routers, only at the 254 originating host. Fortunately, the minimum supported MTU for IPv6 255 is 1280 octets, which is significantly larger than the 68 octet 256 minimum in IPv4. This should make it more reasonable for IPv6 TCP 257 implementations to fall back to 1280 octet packets, when IPv4 258 implementations will probably have to turn off DF to respond to 259 black hole detection. 261 Ideally, the ICMP black holes should be fixed when they are found. 263 ^L 264 If hosts start to implement black hole detection, it may be that 265 these problems will go unnoticed and unfixed. This is especially 266 unfortunate, since detection can take several seconds each time, 267 and these delays could result in a significant, hidden degradation 268 of performance. Hosts that implement black hole detection should 269 probably log detected black holes, so that they can be fixed. 271 3.2. 273 Name of Problem 274 Stretch ACK due to PMTUD 276 Classification 277 Congestion Control / Performance 279 Description 280 When a naively implemented TCP stack communicates with a PMTUD 281 equipped stack, it will try to generate an ACK for every second 282 full-sized segment. If it determines the full-sized segment based 283 on the advertised MSS, this can degrade badly in the face of PMTUD. 285 The PMTU can wind up being a small fraction of the advertised MSS; 286 in this case, an ACK would be generated only very infrequently. 288 Significance 290 Stretch ACKs have a variety of unfortunate effects, more fully 291 outlined in [RFC2525]. Most of these have to do with encouraging a 292 more bursty connection, due to the infrequent arrival of ACKs. 293 They can also impede congestion window growth. 295 Implications 297 The complete implications of stretch ACKs are outlined in 298 [RFC2525]. 300 Relevant RFCs 301 RFC 1122 outlines the requirements for frequency of ACK generation. 302 [RFC2581] expands on this and clarifies that delayed ACK is a 303 SHOULD, not a MUST. 305 ^L 307 Trace file demonstrating it 309 Made using tcpdump recording at an intermediate host. The 310 timestamp options from all but the first two packets have been 311 removed for clarity. 313 18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384 314 (DF) 315 18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152 316 (DF) 317 18:16:52.979738 A > B: . ack 1 win 17248 (DF) 318 18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF) 319 18:16:52.982557 C > A: icmp: B unreachable - 320 need to frag (mtu 1500)! (DF) 321 18:16:52.985839 B > A: . ack 1 win 32768 (DF) 322 18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF) 323 . 324 . 325 . 326 18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF) 327 18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF) 328 18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF) 329 18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF) 330 18:16:58.524763 B > A: . ack 1452357 win 32768 (DF) 331 18:16:58.524986 B > A: . ack 1461045 win 32768 (DF) 332 18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF) 333 18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF) 334 18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF) 335 18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF) 336 18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF) 337 18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF) 338 18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF) 339 18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF) 340 18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF) 341 18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF) 342 18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF) 343 18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF) 344 18:16:58.537944 B > A: . ack 1478421 win 32768 (DF) 345 18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF) 347 Note that the interval between ACKs is significantly larger than 348 two times the segment size; it works out to be almost exactly two 349 times the advertised MSS. This transfer was long enough that it 350 could be verified that the stretch ACK was not the result of lost 351 ACK packets. 353 ^L 355 Trace file demonstrating correct behavior 357 Made using tcpdump recording at an intermediate host. The 358 timestamp options from all but the first two packets have been 359 removed for clarity. 361 18:13:32.287965 A > B: S 2972697496:2972697496(0) 362 win 16384 (DF) 363 18:13:32.290785 B > A: S 245639054:245639054(0) 364 ack 2972697497 win 34496 (DF) 365 18:13:32.290941 A > B: . ack 1 win 17248 (DF) 366 18:13:32.293774 A > B: . 1:4313(4312) ack 1 win 17248 (DF) 367 18:13:32.293856 C > A: icmp: B unreachable - 368 need to frag (mtu 1500)! (DF) 369 18:13:33.637338 A > B: . 1:1461(1460) ack 1 win 17248 (DF) 370 . 371 . 372 . 373 18:13:35.561691 A > B: . 1514021:1515481(1460) ack 1 win 17248 (DF) 374 18:13:35.561814 A > B: . 1515481:1516941(1460) ack 1 win 17248 (DF) 375 18:13:35.561938 A > B: . 1516941:1518401(1460) ack 1 win 17248 (DF) 376 18:13:35.562059 A > B: . 1518401:1519861(1460) ack 1 win 17248 (DF) 377 18:13:35.562174 A > B: . 1519861:1521321(1460) ack 1 win 17248 (DF) 378 18:13:35.564008 B > A: . ack 1481901 win 64680 (DF) 379 18:13:35.564383 A > B: . 1521321:1522781(1460) ack 1 win 17248 (DF) 380 18:13:35.564499 A > B: . 1522781:1524241(1460) ack 1 win 17248 (DF) 381 18:13:35.615576 B > A: . ack 1484821 win 64680 (DF) 382 18:13:35.615646 B > A: . ack 1487741 win 64680 (DF) 383 18:13:35.615716 B > A: . ack 1490661 win 64680 (DF) 384 18:13:35.615784 B > A: . ack 1493581 win 64680 (DF) 385 18:13:35.615856 B > A: . ack 1496501 win 64680 (DF) 386 18:13:35.615952 A > B: . 1524241:1525701(1460) ack 1 win 17248 (DF) 387 18:13:35.615966 B > A: . ack 1499421 win 64680 (DF) 388 18:13:35.616088 A > B: . 1525701:1527161(1460) ack 1 win 17248 (DF) 389 18:13:35.616105 B > A: . ack 1502341 win 64680 (DF) 390 18:13:35.616211 A > B: . 1527161:1528621(1460) ack 1 win 17248 (DF) 391 18:13:35.616228 B > A: . ack 1505261 win 64680 (DF) 392 18:13:35.616327 A > B: . 1528621:1530081(1460) ack 1 win 17248 (DF) 393 18:13:35.616349 B > A: . ack 1508181 win 64680 (DF) 394 18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF) 395 18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF) 396 18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF) 398 In this trace, an ACK is generated for every two segments that 399 arrive. (The segment size is slightly larger in this trace, even 400 though the source hosts are the same, because of the lack of 401 timestamp options in this trace.) 403 ^L 405 How to detect 406 This condition can be observed in a packet trace when the 407 advertised MSS is significantly larger than the actual PMTU of a 408 connection. 410 How to fix 411 Several solutions for this problem have been proposed: 413 A simple solution is to ACK every other packet, regardless of size. 414 This has the drawback of generating large numbers of ACKs in the 415 face of lots of very small packets; this shows up with 416 applications like the X Window System. 418 A slightly more complex solution would monitor the size of incoming 419 segments and try to determine what segment size the sender is 420 using. This requires slightly more state in the receiver, but has 421 the advantage of making receiver silly window syndrome avoidance 422 computations more accurate [RFC813]. 424 3.3. 426 Name of Problem 427 Determining MSS from PMTU 429 Classification 430 Performance 432 Description 433 The MSS advertised at the start of a connection should be based on 434 the MTU of the interfaces on the system. (For efficiency and other 435 reasons this may not be the largest MSS possible.) Some systems 436 use PMTUD determined values to determine the MSS to advertise. 438 This results in an advertised MSS that is smaller than the largest 439 MTU the system can receive. 441 Significance 442 The advertised MSS is an indication to the remote system about the 443 largest TCP segment that can be received [RFC879]. If this value 444 is too small, the remote system will be forced to use a smaller 446 ^L 447 segment size when sending, purely because the local system found a 448 particular PMTU earlier. 450 Given the asymmetric nature of many routes on the Internet 451 [Paxson97], it seems entirely possible that the return PMTU is 452 different from the sending PMTU. Limiting the segment size in this 453 way can reduce performance and frustrate the PMTUD algorithm. 455 Even if the route was symmetric, setting this artificially lowered 456 limit on segment size will make it impossible to probe later to 457 determine if the PMTU has changed. 459 Implications 460 The whole point of PMTUD is to send as large a segment as possible. 461 If long-running connections cannot successfully probe for larger 462 PMTU, then potential performance gains will be impossible to 463 realize. This destroys the whole point of PMTUD. 465 Relevant RFCs 466 RFC 1191. [RFC879] provides a complete discussion of MSS 467 calculations and appropriate values. Note that this practice does 468 not violate any of the specifications in these RFCs. 470 Trace file demonstrating it 471 This trace was made using tcpdump running on an intermediate host. 472 Host A initiates two separate consecutive connections, A1 and A2, 473 to host B. Router C is the location of the MTU bottleneck. As 474 usual, TCP options are removed from all non-SYN packets. 476 22:33:32.305912 A1 > B: S 1523306220:1523306220(0) 477 win 8760 (DF) 478 22:33:32.306518 B > A1: S 729966260:729966260(0) 479 ack 1523306221 win 16384 480 22:33:32.310307 A1 > B: . ack 1 win 8760 (DF) 481 22:33:32.323496 A1 > B: P 1:1461(1460) ack 1 win 8760 (DF) 482 22:33:32.323569 C > A1: icmp: 129.99.238.5 unreachable - 483 need to frag (mtu 1024) (DF) (ttl 255, id 20666) 484 22:33:32.783694 A1 > B: . 1:985(984) ack 1 win 8856 (DF) 485 22:33:32.840817 B > A1: . ack 985 win 16384 486 22:33:32.845651 A1 > B: . 1461:2445(984) ack 1 win 8856 (DF) 487 22:33:32.846094 B > A1: . ack 985 win 16384 488 22:33:33.724392 A1 > B: . 985:1969(984) ack 1 win 8856 (DF) 489 22:33:33.724893 B > A1: . ack 2445 win 14924 490 22:33:33.728591 A1 > B: . 2445:2921(476) ack 1 win 8856 (DF) 491 22:33:33.729161 A1 > B: . ack 1 win 8856 (DF) 493 ^L 494 22:33:33.840758 B > A1: . ack 2921 win 16384 496 [...] 498 22:33:34.238659 A1 > B: F 7301:8193(892) ack 1 win 8856 (DF) 499 22:33:34.239036 B > A1: . ack 8194 win 15492 500 22:33:34.239303 B > A1: F 1:1(0) ack 8194 win 16384 501 22:33:34.242971 A1 > B: . ack 2 win 8856 (DF) 502 22:33:34.454218 A2 > B: S 1523591299:1523591299(0) 503 win 8856 (DF) 504 22:33:34.454617 B > A2: S 732408874:732408874(0) 505 ack 1523591300 win 16384 506 22:33:34.457516 A2 > B: . ack 1 win 8856 (DF) 507 22:33:34.470683 A2 > B: P 1:985(984) ack 1 win 8856 (DF) 508 22:33:34.471144 B > A2: . ack 985 win 16384 509 22:33:34.476554 A2 > B: . 985:1969(984) ack 1 win 8856 (DF) 510 22:33:34.477580 A2 > B: P 1969:2953(984) ack 1 win 8856 (DF) 512 [...] 514 Notice that the SYN packet for session A2 specifies an MSS of 984. 516 Trace file demonstrating correct behavior 518 As before, this trace was made using tcpdump running on an 519 intermediate host. Host A initiates two separate consecutive 520 connections, A1 and A2, to host B. Router C is the location of the 521 MTU bottleneck. As usual, TCP options are removed from all non-SYN 522 packets. 524 22:36:58.828602 A1 > B: S 3402991286:3402991286(0) win 32768 525 (DF) 527 22:36:58.844040 B > A1: S 946999880:946999880(0) 528 ack 3402991287 win 16384 529 530 22:36:58.848058 A1 > B: . ack 1 win 32768 (DF) 531 22:36:58.851514 A1 > B: P 1:1025(1024) ack 1 win 32768 (DF) 532 22:36:58.851584 C > A1: icmp: 129.99.238.5 unreachable - 533 need to frag (mtu 1024) (DF) 534 22:36:58.855885 A1 > B: . 1:969(968) ack 1 win 32768 (DF) 535 22:36:58.856378 A1 > B: . 969:985(16) ack 1 win 32768 (DF) 536 22:36:59.036309 B > A1: . ack 985 win 16384 537 22:36:59.039255 A1 > B: FP 985:1025(40) ack 1 win 32768 (DF) 538 22:36:59.039623 B > A1: . ack 1026 win 16344 539 22:36:59.039828 B > A1: F 1:1(0) ack 1026 win 16384 541 ^L 542 22:36:59.043037 A1 > B: . ack 2 win 32768 (DF) 543 22:37:01.436032 A2 > B: S 3404812097:3404812097(0) win 32768 544 (DF) 546 22:37:01.436424 B > A2: S 949814769:949814769(0) 547 ack 3404812098 win 16384 548 549 22:37:01.440147 A2 > B: . ack 1 win 32768 (DF) 550 22:37:01.442736 A2 > B: . 1:969(968) ack 1 win 32768 (DF) 551 22:37:01.442894 A2 > B: P 969:985(16) ack 1 win 32768 (DF) 552 22:37:01.443283 B > A2: . ack 985 win 16384 553 22:37:01.446068 A2 > B: P 985:1025(40) ack 1 win 32768 (DF) 554 22:37:01.446519 B > A2: . ack 1025 win 16384 555 22:37:01.448465 A2 > B: F 1025:1025(0) ack 1 win 32768 (DF) 556 22:37:01.448837 B > A2: . ack 1026 win 16384 557 22:37:01.449007 B > A2: F 1:1(0) ack 1026 win 16384 558 22:37:01.452201 A2 > B: . ack 2 win 32768 (DF) 560 Note that the same MSS was used for both session A1 and session A2. 562 How to detect 564 This can be detected using a packet trace of two separate 565 connections; the first should invoke PMTUD; the second should 566 start soon enough after the first that the PMTU value does not time 567 out. 569 How to fix 570 The MSS should be determined based on the MTUs of the interfaces on 571 the system, as outlined in [RFC1122] and [RFC1191]. 573 4. Security Considerations 575 The one security concern raised by this memo is that ICMP black holes 576 are often caused by over-zealous security administrators who block 577 all ICMP messages. It is vitally important that those who design and 578 deploy security systems understand the impact of strict filtering on 579 upper-layer protocols. The safest web site in the world is worthless 580 if most TCP implementations cannot transfer data from it. It would 581 be far nicer to have all of the black holes fixed rather than fixing 582 all of the TCP implementations. 584 ^L 586 5. Acknowledgements 588 Thanks to Mark Allman, Vern Paxson, and Jamshid Mahdavi for generous 589 help reviewing the document, and to Matt Mathis for early suggestions 590 of various mechanisms that can cause PMTUD black holes, as well as 591 review. The structure for describing TCP problems, and the early 592 description of that structure is from [RFC2525]. Special thanks to 593 Amy Bock, who helped perform the PMTUD tests which discovered these 594 bugs. 596 6. References 598 [RFC2581] 599 M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", 600 April 1999. 602 [RFC1122] 603 R. Braden, Editor, "Requirements for Internet Hosts -- 604 Communication Layers," Oct. 1989. 606 [RFC813] 607 D. Clark, "Window and Acknowledgement Strategy in TCP," July 1982. 609 [Jacobson89] 610 V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via 611 anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. 613 [RFC1435] 614 S. Knowles, "IESG Advice from Experience with Path MTU Discovery," 615 March 1993. 617 [RFC1191] 618 J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. 620 [RFC1981] 621 J. McCann, S. Deering & J. Mogul, "Path MTU Discovery for IP 622 version 6", August 1996. 624 [Paxson96] 625 V. Paxson, "End-to-End Routing Behavior in the Internet", IEEE/ACM 626 Transactions on Networking (5), pp.~601-615, Oct. 1997. 628 [RFC2525] 629 V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I. 631 ^L 632 Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation 633 Problems", March 1999. 635 [RFC879] 636 J. Postel, "The TCP Maximum Segment Size and Related Topics," 637 November, 1983. 639 [RFC2001] 640 W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, 641 and Fast Recovery Algorithms," Jan. 1997. 643 6.1. Author's Address 645 Kevin Lahey