| < draft-ietf-tcpimpl-pmtud-03.txt | draft-ietf-tcpimpl-pmtud-04.txt > | |||
|---|---|---|---|---|
| Network Working Group K. Lahey | Network Working Group K. Lahey | |||
| Expires: October 2000 | Internet Draft June 2000 | |||
| Expires: November 2000 | ||||
| TCP Problems with Path MTU Discovery | TCP Problems with Path MTU Discovery | |||
| <draft-ietf-tcpimpl-pmtud-03.txt> | <draft-ietf-tcpimpl-pmtud-04.txt> | |||
| 1. Status of this Memo | 1. Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| This document is an Internet Draft. Internet Drafts are working | This document is an Internet Draft. Internet Drafts are working | |||
| documents of the Internet Engineering Task Force (IETF), its areas, | documents of the Internet Engineering Task Force (IETF), its areas, | |||
| and its working groups. Note that other groups may also distribute | and its working groups. Note that other groups may also distribute | |||
| working documents as Internet Drafts. | working documents as Internet Drafts. | |||
| skipping to change at page 2, line 5 ¶ | skipping to change at page 2, line 5 ¶ | |||
| To view the entire list of current Internet-Drafts, please check the | To view the entire list of current Internet-Drafts, please check the | |||
| "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow | "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow | |||
| Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern | Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern | |||
| Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific | Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific | |||
| Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). | Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). | |||
| This memo provides information for the Internet community. This memo | This memo provides information for the Internet community. This memo | |||
| does not specify an Internet standard of any kind. Distribution of | does not specify an Internet standard of any kind. Distribution of | |||
| this memo is unlimited. | this memo is unlimited. | |||
| ^L | ||||
| 2. Introduction | 2. Introduction | |||
| This memo catalogs several known TCP implementation problems dealing | This memo catalogs several known TCP implementation problems dealing | |||
| with Path MTU Discovery [RFC1191], including the long-standing black | with Path MTU Discovery [RFC1191], including the long-standing black | |||
| hole problem, stretch ACKs due to confusion between MSS and segment | hole problem, stretch ACKs due to confusion between MSS and segment | |||
| size, and MSS advertisement based on PMTU. The goal in doing so is | size, and MSS advertisement based on PMTU. The goal in doing so is | |||
| to improve conditions in the existing Internet by enhancing the | to improve conditions in the existing Internet by enhancing the | |||
| quality of current TCP/IP implementations. | quality of current TCP/IP implementations. | |||
| While Path MTU Discovery (PMTUD) can be used with any upper-layer | While Path MTU Discovery (PMTUD) can be used with any upper-layer | |||
| skipping to change at page 2, line 28 ¶ | skipping to change at page 2, line 30 ¶ | |||
| issues, but not the TCP issues brought up in this document. | issues, but not the TCP issues brought up in this document. | |||
| Each problem is defined as follows: | Each problem is defined as follows: | |||
| Name of Problem | Name of Problem | |||
| The name associated with the problem. In this memo, the name is | The name associated with the problem. In this memo, the name is | |||
| given as a subsection heading. | given as a subsection heading. | |||
| Classification | Classification | |||
| One or more problem categories for which the problem is classified: | One or more problem categories for which the problem is classified: | |||
| "congestion control", "performance", "reliability", "non- | "congestion control", "performance", "reliability", | |||
| interoperation -- connectivity failure". | "non-interoperation -- connectivity failure". | |||
| Description | Description | |||
| A definition of the problem, succinct but including necessary | A definition of the problem, succinct but including necessary | |||
| background material. | background material. | |||
| Significance | Significance | |||
| A brief summary of the sorts of environments for which the problem | A brief summary of the sorts of environments for which the problem | |||
| is significant. | is significant. | |||
| Implications | Implications | |||
| Why the problem is viewed as a problem. | Why the problem is viewed as a problem. | |||
| Relevant RFCs | Relevant RFCs | |||
| The RFCs defining the TCP specification with which the problem | The RFCs defining the TCP specification with which the problem | |||
| conflicts. These RFCs often qualify behavior using terms such as | conflicts. These RFCs often qualify behavior using terms such as | |||
| MUST, SHOULD, MAY, and others written capitalized. See RFC 2119 | MUST, SHOULD, MAY, and others written capitalized. See RFC 2119 | |||
| for the exact interpretation of these terms. | for the exact interpretation of these terms. | |||
| ^L | ||||
| Trace file demonstrating the problem | Trace file demonstrating the problem | |||
| One or more ASCII trace files demonstrating the problem, if | One or more ASCII trace files demonstrating the problem, if | |||
| applicable. | applicable. | |||
| Trace file demonstrating correct behavior | Trace file demonstrating correct behavior | |||
| One or more examples of how correct behavior appears in a trace, if | One or more examples of how correct behavior appears in a trace, if | |||
| applicable. | applicable. | |||
| References | References | |||
| References that further discuss the problem. | References that further discuss the problem. | |||
| skipping to change at page 4, line 5 ¶ | skipping to change at page 4, line 5 ¶ | |||
| Non-interoperation -- connectivity failure | Non-interoperation -- connectivity failure | |||
| Description | Description | |||
| A host performs Path MTU Discovery by sending out as large a packet | A host performs Path MTU Discovery by sending out as large a packet | |||
| as possible, with the Don't Fragment (DF) bit set in the IP header. | as possible, with the Don't Fragment (DF) bit set in the IP header. | |||
| If the packet is too large for a router to forward on to a | If the packet is too large for a router to forward on to a | |||
| particular link, the router must send an ICMP Destination | particular link, the router must send an ICMP Destination | |||
| Unreachable -- Fragmentation Needed message to the source address. | Unreachable -- Fragmentation Needed message to the source address. | |||
| The host then adjusts the packet size based on the ICMP message. | The host then adjusts the packet size based on the ICMP message. | |||
| ^L | ||||
| As was pointed out in [RFC1435], routers don't always do this | As was pointed out in [RFC1435], routers don't always do this | |||
| correctly -- many routers fail to send the ICMP messages, for a | correctly -- many routers fail to send the ICMP messages, for a | |||
| variety of reasons ranging from kernel bugs to configuration | variety of reasons ranging from kernel bugs to configuration | |||
| problems. Firewalls are often misconfigured to suppress all ICMP | problems. Firewalls are often misconfigured to suppress all ICMP | |||
| messages. IPsec [RFC2401] and IP-in-IP [RFC2003] tunnels shouldn't | messages. IPsec [RFC2401] and IP-in-IP [RFC2003] tunnels shouldn't | |||
| cause these sorts of problems, if the implementations follow the | cause these sorts of problems, if the implementations follow the | |||
| advice in the appropriate documents. | advice in the appropriate documents. | |||
| PMTUD, as documented in [RFC1191], fails when the appropriate ICMP | PMTUD, as documented in [RFC1191], fails when the appropriate ICMP | |||
| messages are not received by the originating host. The upper-layer | messages are not received by the originating host. The upper-layer | |||
| skipping to change at page 5, line 4 ¶ | skipping to change at page 5, line 4 ¶ | |||
| Relevant RFCs | Relevant RFCs | |||
| RFC1191 describes Path MTU Discovery. RFC 1435 provides an early | RFC1191 describes Path MTU Discovery. RFC 1435 provides an early | |||
| description of these sorts of problems. | description of these sorts of problems. | |||
| Trace file demonstrating the problem | Trace file demonstrating the problem | |||
| Made using tcpdump [Jacobson89] recording at an intermediate host. | Made using tcpdump [Jacobson89] recording at an intermediate host. | |||
| 20:12:11.951321 A > B: S 1748427200:1748427200(0) | 20:12:11.951321 A > B: S 1748427200:1748427200(0) | |||
| win 49152 <mss 1460> | win 49152 <mss 1460> | |||
| 20:12:11.951829 B > A: S 1001927984:1001927984(0) | 20:12:11.951829 B > A: S 1001927984:1001927984(0) | |||
| ^L | ||||
| ack 1748427201 win 16384 <mss 65240> | ack 1748427201 win 16384 <mss 65240> | |||
| 20:12:11.955230 A > B: . ack 1 win 49152 (DF) | 20:12:11.955230 A > B: . ack 1 win 49152 (DF) | |||
| 20:12:11.959099 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:11.959099 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:12:13.139074 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:13.139074 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:12:16.188685 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:16.188685 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:12:22.290483 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:22.290483 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:12:34.491856 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:34.491856 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:12:58.896405 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:12:58.896405 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:13:47.703184 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:13:47.703184 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| 20:14:52.780640 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | 20:14:52.780640 A > B: . 1:1461(1460) ack 1 win 49152 (DF) | |||
| skipping to change at page 6, line 4 ¶ | skipping to change at page 6, line 4 ¶ | |||
| 16:48:45.786814 A > B: . 1:1461(1460) ack 1 win 8760 (DF) | 16:48:45.786814 A > B: . 1:1461(1460) ack 1 win 8760 (DF) | |||
| 16:48:51.794676 A > B: . 1:1461(1460) ack 1 win 8760 (DF) | 16:48:51.794676 A > B: . 1:1461(1460) ack 1 win 8760 (DF) | |||
| 16:49:03.808912 A > B: . 1:537(536) ack 1 win 8760 | 16:49:03.808912 A > B: . 1:537(536) ack 1 win 8760 | |||
| 16:49:04.016476 B > A: . ack 537 win 16384 | 16:49:04.016476 B > A: . ack 537 win 16384 | |||
| 16:49:04.021245 A > B: . 537:1073(536) ack 1 win 8760 | 16:49:04.021245 A > B: . 537:1073(536) ack 1 win 8760 | |||
| 16:49:04.021697 A > B: . 1073:1609(536) ack 1 win 8760 | 16:49:04.021697 A > B: . 1073:1609(536) ack 1 win 8760 | |||
| 16:49:04.120694 B > A: . ack 1609 win 16384 | 16:49:04.120694 B > A: . ack 1609 win 16384 | |||
| 16:49:04.126142 A > B: . 1609:2145(536) ack 1 win 8760 | 16:49:04.126142 A > B: . 1609:2145(536) ack 1 win 8760 | |||
| In this case, the sender sees four packets fail to traverse the | In this case, the sender sees four packets fail to traverse the | |||
| ^L | ||||
| network (using a two-packet initial send window) and turns off | network (using a two-packet initial send window) and turns off | |||
| PMTUD. All subsequent packets have the DF flag turned off, and the | PMTUD. All subsequent packets have the DF flag turned off, and the | |||
| size set to the default value of 536 [RFC1122]. | size set to the default value of 536 [RFC1122]. | |||
| References | References | |||
| This problem has been discussed extensively on the tcp-impl mailing | This problem has been discussed extensively on the tcp-impl mailing | |||
| list; the name "black hole" has been in use for many years. | list; the name "black hole" has been in use for many years. | |||
| How to detect | How to detect | |||
| skipping to change at page 7, line 5 ¶ | skipping to change at page 7, line 5 ¶ | |||
| all times. Fragmentation is not allowed in routers, only at the | all times. Fragmentation is not allowed in routers, only at the | |||
| originating host. Fortunately, the minimum supported MTU for IPv6 | originating host. Fortunately, the minimum supported MTU for IPv6 | |||
| is 1280 octets, which is significantly larger than the 68 octet | is 1280 octets, which is significantly larger than the 68 octet | |||
| minimum in IPv4. This should make it more reasonable for IPv6 TCP | minimum in IPv4. This should make it more reasonable for IPv6 TCP | |||
| implementations to fall back to 1280 octet packets, when IPv4 | implementations to fall back to 1280 octet packets, when IPv4 | |||
| implementations will probably have to turn off DF to respond to | implementations will probably have to turn off DF to respond to | |||
| black hole detection. | black hole detection. | |||
| Ideally, the ICMP black holes should be fixed when they are found. | Ideally, the ICMP black holes should be fixed when they are found. | |||
| ^L | ||||
| If hosts start to implement black hole detection, it may be that | If hosts start to implement black hole detection, it may be that | |||
| these problems will go unnoticed and unfixed. This is especially | these problems will go unnoticed and unfixed. This is especially | |||
| unfortunate, since detection can take several seconds each time, | unfortunate, since detection can take several seconds each time, | |||
| and these delays could result in a significant, hidden degradation | and these delays could result in a significant, hidden degradation | |||
| of performance. Hosts that implement black hole detection should | of performance. Hosts that implement black hole detection should | |||
| probably log detected black holes, so that they can be fixed. | probably log detected black holes, so that they can be fixed. | |||
| 3.2. | 3.2. | |||
| Name of Problem | Name of Problem | |||
| Stretch ACK due to PMTUD | Stretch ACK due to PMTUD | |||
| Classification | Classification | |||
| Congestion Control / Performance | Congestion Control / Performance | |||
| Description | Description | |||
| When a naively implemented TCP stack communicates with a PMTUD- | When a naively implemented TCP stack communicates with a PMTUD | |||
| equipped stack, it will try to generate an ACK for every second | equipped stack, it will try to generate an ACK for every second | |||
| full-sized segment. If it determines the full-sized segment based | full-sized segment. If it determines the full-sized segment based | |||
| on the advertised MSS, this can degrade badly in the face of PMTUD. | on the advertised MSS, this can degrade badly in the face of PMTUD. | |||
| The PMTU can wind up being a small fraction of the advertised MSS; | The PMTU can wind up being a small fraction of the advertised MSS; | |||
| in this case, an ACK would be generated only very infrequently. | in this case, an ACK would be generated only very infrequently. | |||
| Significance | Significance | |||
| Stretch ACKs have a variety of unfortunate effects, more fully | Stretch ACKs have a variety of unfortunate effects, more fully | |||
| skipping to change at page 8, line 5 ¶ | skipping to change at page 8, line 5 ¶ | |||
| Implications | Implications | |||
| The complete implications of stretch ACKs are outlined in | The complete implications of stretch ACKs are outlined in | |||
| [RFC2525]. | [RFC2525]. | |||
| Relevant RFCs | Relevant RFCs | |||
| RFC 1122 outlines the requirements for frequency of ACK generation. | RFC 1122 outlines the requirements for frequency of ACK generation. | |||
| [RFC2581] expands on this and clarifies that delayed ACK is a | [RFC2581] expands on this and clarifies that delayed ACK is a | |||
| SHOULD, not a MUST. | SHOULD, not a MUST. | |||
| ^L | ||||
| Trace file demonstrating it | Trace file demonstrating it | |||
| Made using tcpdump recording at an intermediate host. The | Made using tcpdump recording at an intermediate host. The | |||
| timestamp options from all but the first two packets have been | timestamp options from all but the first two packets have been | |||
| removed for clarity. | removed for clarity. | |||
| 18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384 | 18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384 | |||
| <mss 4312,nop,wscale 0,nop,nop,timestamp 12128 0> (DF) () | <mss 4312,nop,wscale 0,nop,nop,timestamp 12128 0> (DF) | |||
| 18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152 | 18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152 | |||
| <mss 4312,nop,wscale 1,nop,nop,timestamp 1592957 12128> (DF) () | <mss 4312,nop,wscale 1,nop,nop,timestamp 1592957 12128> (DF) | |||
| 18:16:52.979738 A > B: . ack 1 win 17248 (DF) () | 18:16:52.979738 A > B: . ack 1 win 17248 (DF) | |||
| 18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF) () | 18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF) | |||
| 18:16:52.982557 C > A: icmp: B unreachable - | 18:16:52.982557 C > A: icmp: B unreachable - | |||
| need to frag (mtu 1500)! (DF) () | need to frag (mtu 1500)! (DF) | |||
| 18:16:52.985839 B > A: . ack 1 win 32768 (DF) () | 18:16:52.985839 B > A: . ack 1 win 32768 (DF) | |||
| 18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF) () | 18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF) | |||
| . | . | |||
| . | . | |||
| . | . | |||
| 18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF) () | 18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF) () | 18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF) () | 18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF) () | 18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.524763 B > A: . ack 1452357 win 32768 (DF) () | 18:16:58.524763 B > A: . ack 1452357 win 32768 (DF) | |||
| 18:16:58.524986 B > A: . ack 1461045 win 32768 (DF) () | 18:16:58.524986 B > A: . ack 1461045 win 32768 (DF) | |||
| 18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF) () | 18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF) () | 18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF) () | 18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF) () | 18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF) () | 18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF) () | 18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF) () | 18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF) () | 18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF) () | 18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF) () | 18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF) () | 18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF) () | 18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF) | |||
| 18:16:58.537944 B > A: . ack 1478421 win 32768 (DF) () | 18:16:58.537944 B > A: . ack 1478421 win 32768 (DF) | |||
| 18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF) () | 18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF) | |||
| Note that the interval between ACKs is significantly larger than | Note that the interval between ACKs is significantly larger than | |||
| two times the segment size; it works out to be almost exactly two | two times the segment size; it works out to be almost exactly two | |||
| times the advertised MSS. This transfer was long enough that it | times the advertised MSS. This transfer was long enough that it | |||
| could be verified that the stretch ACK was not the result of lost | could be verified that the stretch ACK was not the result of lost | |||
| ACK packets. | ACK packets. | |||
| ^L | ||||
| Trace file demonstrating correct behavior | Trace file demonstrating correct behavior | |||
| Made using tcpdump recording at an intermediate host. The | Made using tcpdump recording at an intermediate host. The | |||
| timestamp options from all but the first two packets have been | timestamp options from all but the first two packets have been | |||
| removed for clarity. | removed for clarity. | |||
| 18:13:32.287965 A > B: S 2972697496:2972697496(0) | 18:13:32.287965 A > B: S 2972697496:2972697496(0) | |||
| win 16384 <mss 4312,nop,wscale 0,nop,nop,timestamp 11326 0> (DF) | win 16384 <mss 4312,nop,wscale 0,nop,nop,timestamp 11326 0> (DF) | |||
| 18:13:32.290785 B > A: S 245639054:245639054(0) | 18:13:32.290785 B > A: S 245639054:245639054(0) | |||
| ack 2972697497 win 34496 <mss 4312> (DF) | ack 2972697497 win 34496 <mss 4312> (DF) | |||
| skipping to change at page 10, line 5 ¶ | skipping to change at page 10, line 5 ¶ | |||
| 18:13:35.616349 B > A: . ack 1508181 win 64680 (DF) | 18:13:35.616349 B > A: . ack 1508181 win 64680 (DF) | |||
| 18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF) | 18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF) | |||
| 18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF) | 18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF) | |||
| 18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF) | 18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF) | |||
| In this trace, an ACK is generated for every two segments that | In this trace, an ACK is generated for every two segments that | |||
| arrive. (The segment size is slightly larger in this trace, even | arrive. (The segment size is slightly larger in this trace, even | |||
| though the source hosts are the same, because of the lack of | though the source hosts are the same, because of the lack of | |||
| timestamp options in this trace.) | timestamp options in this trace.) | |||
| ^L | ||||
| How to detect | How to detect | |||
| This condition can be observed in a packet trace when the | This condition can be observed in a packet trace when the | |||
| advertised MSS is significantly larger than the actual PMTU of a | advertised MSS is significantly larger than the actual PMTU of a | |||
| connection. | connection. | |||
| How to fix | How to fix | |||
| Several solutions for this problem have been proposed: | Several solutions for this problem have been proposed: | |||
| A simple solution is to ACK every other packet, regardless of size. | A simple solution is to ACK every other packet, regardless of size. | |||
| This has the drawback of generating large numbers of ACKs in the | This has the drawback of generating large numbers of ACKs in the | |||
| face of lots of very small packets; this shows up with | face of lots of very small packets; this shows up with | |||
| applications like the X Window System. | applications like the X Window System. | |||
| A slightly more complex solution would monitor the size of incoming | A slightly more complex solution would monitor the size of incoming | |||
| segments and try to determine what segment size the sender is | segments and try to determine what segment size the sender is | |||
| using. This requires slightly more state in the receiver, but has | using. This requires slightly more state in the receiver, but has | |||
| the advantage of making receiver silly window syndrome avoidance | the advantage of making receiver silly window syndrome avoidance | |||
| computations more accurate. | computations more accurate [RFC813]. | |||
| 3.3. | 3.3. | |||
| Name of Problem | Name of Problem | |||
| Determining MSS from PMTU | Determining MSS from PMTU | |||
| Classification | Classification | |||
| Performance | Performance | |||
| Description | Description | |||
| The MSS advertised at the start of a connection should be based on | The MSS advertised at the start of a connection should be based on | |||
| the MTU of the interfaces on the system. Some systems use PMTUD | the MTU of the interfaces on the system. (For efficiency and other | |||
| determined values to determine the MSS to advertise. | reasons this may not be the largest MSS possible.) Some systems | |||
| use PMTUD determined values to determine the MSS to advertise. | ||||
| This results in an advertised MSS that is smaller than the largest | This results in an advertised MSS that is smaller than the largest | |||
| MTU the system can receive. | MTU the system can receive. | |||
| Significance | Significance | |||
| The advertised MSS is an indication to the remote system about the | The advertised MSS is an indication to the remote system about the | |||
| largest TCP segment that can be received [RFC879]. If this value | largest TCP segment that can be received [RFC879]. If this value | |||
| is too small, the remote system will be forced to use a smaller | is too small, the remote system will be forced to use a smaller | |||
| ^L | ||||
| segment size when sending, purely because the local system found a | segment size when sending, purely because the local system found a | |||
| particular PMTU earlier. | particular PMTU earlier. | |||
| Given the asymmetric nature of many routes on the Internet | Given the asymmetric nature of many routes on the Internet | |||
| [Paxson97], it seems entirely possible that the return PMTU is | [Paxson97], it seems entirely possible that the return PMTU is | |||
| different from the sending PMTU. Limiting the segment size in this | different from the sending PMTU. Limiting the segment size in this | |||
| way can reduce performance and frustrate the PMTUD algorithm. | way can reduce performance and frustrate the PMTUD algorithm. | |||
| Even if the route was symmetric, setting this artificially lowered | Even if the route was symmetric, setting this artificially lowered | |||
| limit on segment size will make it impossible to probe later to | limit on segment size will make it impossible to probe later to | |||
| determine if the PMTU has changed. | determine if the PMTU has changed. | |||
| Implications | Implications | |||
| The whole point of PMTUD is to send as large a segment as possible. | The whole point of PMTUD is to send as large a segment as possible. | |||
| If long-running connections cannot successfully probe for larger | If long-running connections cannot successfully probe for larger | |||
| PMTU, then potential performance gains will be impossible to | PMTU, then potential performance gains will be impossible to | |||
| realize. This destroys the whole point of PMTUD. | realize. This destroys the whole point of PMTUD. | |||
| Relevant RFCs | Relevant RFCs | |||
| RFC 1191. [RFC897] provides a complete discussion of MSS | RFC 1191. [RFC879] provides a complete discussion of MSS | |||
| calculations and appropriate values. Note that this practice does | calculations and appropriate values. Note that this practice does | |||
| not violate any of the specifications in these RFCs. | not violate any of the specifications in these RFCs. | |||
| Trace file demonstrating it | Trace file demonstrating it | |||
| This trace was made using tcpdump running on an intermediate host. | This trace was made using tcpdump running on an intermediate host. | |||
| Host A initiates two separate consecutive connections, A1 and A2, | Host A initiates two separate consecutive connections, A1 and A2, | |||
| to host B. Router C is the location of the MTU bottleneck. As | to host B. Router C is the location of the MTU bottleneck. As | |||
| usual, TCP options are removed from all non-SYN packets. | usual, TCP options are removed from all non-SYN packets. | |||
| 22:33:32.305912 A1 > B: S 1523306220:1523306220(0) | 22:33:32.305912 A1 > B: S 1523306220:1523306220(0) | |||
| skipping to change at page 11, line 48 ¶ | skipping to change at page 12, line 4 ¶ | |||
| 22:33:32.323569 C > A1: icmp: 129.99.238.5 unreachable - | 22:33:32.323569 C > A1: icmp: 129.99.238.5 unreachable - | |||
| need to frag (mtu 1024) (DF) (ttl 255, id 20666) | need to frag (mtu 1024) (DF) (ttl 255, id 20666) | |||
| 22:33:32.783694 A1 > B: . 1:985(984) ack 1 win 8856 (DF) | 22:33:32.783694 A1 > B: . 1:985(984) ack 1 win 8856 (DF) | |||
| 22:33:32.840817 B > A1: . ack 985 win 16384 | 22:33:32.840817 B > A1: . ack 985 win 16384 | |||
| 22:33:32.845651 A1 > B: . 1461:2445(984) ack 1 win 8856 (DF) | 22:33:32.845651 A1 > B: . 1461:2445(984) ack 1 win 8856 (DF) | |||
| 22:33:32.846094 B > A1: . ack 985 win 16384 | 22:33:32.846094 B > A1: . ack 985 win 16384 | |||
| 22:33:33.724392 A1 > B: . 985:1969(984) ack 1 win 8856 (DF) | 22:33:33.724392 A1 > B: . 985:1969(984) ack 1 win 8856 (DF) | |||
| 22:33:33.724893 B > A1: . ack 2445 win 14924 | 22:33:33.724893 B > A1: . ack 2445 win 14924 | |||
| 22:33:33.728591 A1 > B: . 2445:2921(476) ack 1 win 8856 (DF) | 22:33:33.728591 A1 > B: . 2445:2921(476) ack 1 win 8856 (DF) | |||
| 22:33:33.729161 A1 > B: . ack 1 win 8856 (DF) | 22:33:33.729161 A1 > B: . ack 1 win 8856 (DF) | |||
| ^L | ||||
| 22:33:33.840758 B > A1: . ack 2921 win 16384 | 22:33:33.840758 B > A1: . ack 2921 win 16384 | |||
| [...] | [...] | |||
| 22:33:34.238659 A1 > B: F 7301:8193(892) ack 1 win 8856 (DF) | 22:33:34.238659 A1 > B: F 7301:8193(892) ack 1 win 8856 (DF) | |||
| 22:33:34.239036 B > A1: . ack 8194 win 15492 | 22:33:34.239036 B > A1: . ack 8194 win 15492 | |||
| 22:33:34.239303 B > A1: F 1:1(0) ack 8194 win 16384 | 22:33:34.239303 B > A1: F 1:1(0) ack 8194 win 16384 | |||
| 22:33:34.242971 A1 > B: . ack 2 win 8856 (DF) | 22:33:34.242971 A1 > B: . ack 2 win 8856 (DF) | |||
| 22:33:34.454218 A2 > B: S 1523591299:1523591299(0) | 22:33:34.454218 A2 > B: S 1523591299:1523591299(0) | |||
| win 8856 <mss 984> (DF) | win 8856 <mss 984> (DF) | |||
| skipping to change at page 12, line 49 ¶ | skipping to change at page 13, line 4 ¶ | |||
| 22:36:58.848058 A1 > B: . ack 1 win 32768 (DF) | 22:36:58.848058 A1 > B: . ack 1 win 32768 (DF) | |||
| 22:36:58.851514 A1 > B: P 1:1025(1024) ack 1 win 32768 (DF) | 22:36:58.851514 A1 > B: P 1:1025(1024) ack 1 win 32768 (DF) | |||
| 22:36:58.851584 C > A1: icmp: 129.99.238.5 unreachable - | 22:36:58.851584 C > A1: icmp: 129.99.238.5 unreachable - | |||
| need to frag (mtu 1024) (DF) | need to frag (mtu 1024) (DF) | |||
| 22:36:58.855885 A1 > B: . 1:969(968) ack 1 win 32768 (DF) | 22:36:58.855885 A1 > B: . 1:969(968) ack 1 win 32768 (DF) | |||
| 22:36:58.856378 A1 > B: . 969:985(16) ack 1 win 32768 (DF) | 22:36:58.856378 A1 > B: . 969:985(16) ack 1 win 32768 (DF) | |||
| 22:36:59.036309 B > A1: . ack 985 win 16384 | 22:36:59.036309 B > A1: . ack 985 win 16384 | |||
| 22:36:59.039255 A1 > B: FP 985:1025(40) ack 1 win 32768 (DF) | 22:36:59.039255 A1 > B: FP 985:1025(40) ack 1 win 32768 (DF) | |||
| 22:36:59.039623 B > A1: . ack 1026 win 16344 | 22:36:59.039623 B > A1: . ack 1026 win 16344 | |||
| 22:36:59.039828 B > A1: F 1:1(0) ack 1026 win 16384 | 22:36:59.039828 B > A1: F 1:1(0) ack 1026 win 16384 | |||
| ^L | ||||
| 22:36:59.043037 A1 > B: . ack 2 win 32768 (DF) | 22:36:59.043037 A1 > B: . ack 2 win 32768 (DF) | |||
| 22:37:01.436032 A2 > B: S 3404812097:3404812097(0) win 32768 | 22:37:01.436032 A2 > B: S 3404812097:3404812097(0) win 32768 | |||
| <mss 4312,wscale 0,nop,timestamp 1123372916 0, | <mss 4312,wscale 0,nop,timestamp 1123372916 0, | |||
| echo 1123372916> (DF) | echo 1123372916> (DF) | |||
| 22:37:01.436424 B > A2: S 949814769:949814769(0) | 22:37:01.436424 B > A2: S 949814769:949814769(0) | |||
| ack 3404812098 win 16384 | ack 3404812098 win 16384 | |||
| <mss 65240,nop,wscale 0,nop,nop,timestamp 429562 1123372916> | <mss 65240,nop,wscale 0,nop,nop,timestamp 429562 1123372916> | |||
| 22:37:01.440147 A2 > B: . ack 1 win 32768 (DF) | 22:37:01.440147 A2 > B: . ack 1 win 32768 (DF) | |||
| 22:37:01.442736 A2 > B: . 1:969(968) ack 1 win 32768 (DF) | 22:37:01.442736 A2 > B: . 1:969(968) ack 1 win 32768 (DF) | |||
| 22:37:01.442894 A2 > B: P 969:985(16) ack 1 win 32768 (DF) | 22:37:01.442894 A2 > B: P 969:985(16) ack 1 win 32768 (DF) | |||
| skipping to change at page 14, line 5 ¶ | skipping to change at page 14, line 5 ¶ | |||
| The one security concern raised by this memo is that ICMP black holes | The one security concern raised by this memo is that ICMP black holes | |||
| are often caused by over-zealous security administrators who block | are often caused by over-zealous security administrators who block | |||
| all ICMP messages. It is vitally important that those who design and | all ICMP messages. It is vitally important that those who design and | |||
| deploy security systems understand the impact of strict filtering on | deploy security systems understand the impact of strict filtering on | |||
| upper-layer protocols. The safest web site in the world is worthless | upper-layer protocols. The safest web site in the world is worthless | |||
| if most TCP implementations cannot transfer data from it. It would | if most TCP implementations cannot transfer data from it. It would | |||
| be far nicer to have all of the black holes fixed rather than fixing | be far nicer to have all of the black holes fixed rather than fixing | |||
| all of the TCP implementations. | all of the TCP implementations. | |||
| ^L | ||||
| 5. Acknowledgements | 5. Acknowledgements | |||
| Thanks to Mark Allman, Vern Paxson, and Jamshid Mahdavi for generous | Thanks to Mark Allman, Vern Paxson, and Jamshid Mahdavi for generous | |||
| help reviewing the document, and to Matt Mathis for early suggestions | help reviewing the document, and to Matt Mathis for early suggestions | |||
| of various mechanisms that can cause PMTUD black holes, as well as | of various mechanisms that can cause PMTUD black holes, as well as | |||
| review. The structure for describing TCP problems, and the early | review. The structure for describing TCP problems, and the early | |||
| description of that structure is from [RFC2525]. Special thanks to | description of that structure is from [RFC2525]. Special thanks to | |||
| Amy Bock, who helped perform the PMTUD tests which discovered these | Amy Bock, who helped perform the PMTUD tests which discovered these | |||
| bugs. | bugs. | |||
| 6. References | 6. References | |||
| [RFC2581] | [RFC2581] | |||
| M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", | M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", | |||
| April 1999. | April 1999. | |||
| [RFC1122] | [RFC1122] | |||
| R. Braden, Editor, "Requirements for Internet Hosts -- | R. Braden, Editor, "Requirements for Internet Hosts -- | |||
| Communication Layers," Oct. 1989. | Communication Layers," Oct. 1989. | |||
| [RFC813] | ||||
| D. Clark, "Window and Acknowledgement Strategy in TCP," July 1982. | ||||
| [Jacobson89] | [Jacobson89] | |||
| V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via | V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via | |||
| anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. | anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. | |||
| [RFC1435] | [RFC1435] | |||
| S. Knowles, "IESG Advice from Experience with Path MTU Discovery," | S. Knowles, "IESG Advice from Experience with Path MTU Discovery," | |||
| March 1993. | March 1993. | |||
| [RFC1191] | [RFC1191] | |||
| J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. | J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. | |||
| skipping to change at page 14, line 46 ¶ | skipping to change at page 15, line 4 ¶ | |||
| [RFC1981] | [RFC1981] | |||
| J. McCann, S. Deering & J. Mogul, "Path MTU Discovery for IP | J. McCann, S. Deering & J. Mogul, "Path MTU Discovery for IP | |||
| version 6", August 1996. | version 6", August 1996. | |||
| [Paxson96] | [Paxson96] | |||
| V. Paxson, "End-to-End Routing Behavior in the Internet", IEEE/ACM | V. Paxson, "End-to-End Routing Behavior in the Internet", IEEE/ACM | |||
| Transactions on Networking (5), pp.~601-615, Oct. 1997. | Transactions on Networking (5), pp.~601-615, Oct. 1997. | |||
| [RFC2525] | [RFC2525] | |||
| V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I. | V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I. | |||
| ^L | ||||
| Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation | Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation | |||
| Problems", March 1999. | Problems", March 1999. | |||
| [RFC879] | [RFC879] | |||
| J. Postel, "The TCP Maximum Segment Size and Related Topics," | J. Postel, "The TCP Maximum Segment Size and Related Topics," | |||
| November, 1983. | November, 1983. | |||
| [RFC2001] | [RFC2001] | |||
| W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, | W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, | |||
| and Fast Recovery Algorithms," Jan. 1997. | and Fast Recovery Algorithms," Jan. 1997. | |||
| 6.1. Author's Address | 6.1. Author's Address | |||
| Kevin Lahey <kml@logictier.com> | Kevin Lahey <kml@logictier.com | |||
| LogicTier, Inc. | LogicTier, Inc. | |||
| Suite 100 | Suite 100 | |||
| 2 Waters Park Drive | 2 Waters Park Drive | |||
| San Mateo, CA 94403 | San Mateo, CA 94403 | |||
| USA | USA | |||
| Phone: +1 650/678-7033 | Phone: +1 650/678-7033 | |||
| 7. Full Copyright Statement | 7. Full Copyright Statement | |||
| Copyright (C) The Internet Society (1999). All Rights Reserved. | Copyright (C) The Internet Society (1999). All Rights Reserved. | |||
| skipping to change at page 15, line 47 ¶ | skipping to change at page 16, line 4 ¶ | |||
| copyrights defined in the Internet Standards process must be | copyrights defined in the Internet Standards process must be | |||
| followed, or as required to translate it into languages other than | followed, or as required to translate it into languages other than | |||
| English. | English. | |||
| The limited permissions granted above are perpetual and will not be | The limited permissions granted above are perpetual and will not be | |||
| revoked by the Internet Society or its successors or assigns. | revoked by the Internet Society or its successors or assigns. | |||
| This document and the information contained herein is provided on an | This document and the information contained herein is provided on an | |||
| "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | |||
| TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | |||
| ^L | ||||
| BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | |||
| HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | |||
| MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | |||
| This draft was created in May 2000. | This draft was created in June 2000. | |||
| It expires in October 2000. | It expires in November 2000. | |||
| ^L | ||||
| End of changes. 30 change blocks. | ||||
| 37 lines changed or deleted | 71 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||