< draft-ietf-tcpimpl-pmtud-03.txt   draft-ietf-tcpimpl-pmtud-04.txt >
Network Working Group K. Lahey Network Working Group K. Lahey
Expires: October 2000 Internet Draft June 2000
Expires: November 2000
TCP Problems with Path MTU Discovery TCP Problems with Path MTU Discovery
<draft-ietf-tcpimpl-pmtud-03.txt> <draft-ietf-tcpimpl-pmtud-04.txt>
1. Status of this Memo 1. Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
This document is an Internet Draft. Internet Drafts are working This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute and its working groups. Note that other groups may also distribute
working documents as Internet Drafts. working documents as Internet Drafts.
skipping to change at page 2, line 5 skipping to change at page 2, line 5
To view the entire list of current Internet-Drafts, please check the To view the entire list of current Internet-Drafts, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
This memo provides information for the Internet community. This memo This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of does not specify an Internet standard of any kind. Distribution of
this memo is unlimited. this memo is unlimited.
^L
2. Introduction 2. Introduction
This memo catalogs several known TCP implementation problems dealing This memo catalogs several known TCP implementation problems dealing
with Path MTU Discovery [RFC1191], including the long-standing black with Path MTU Discovery [RFC1191], including the long-standing black
hole problem, stretch ACKs due to confusion between MSS and segment hole problem, stretch ACKs due to confusion between MSS and segment
size, and MSS advertisement based on PMTU. The goal in doing so is size, and MSS advertisement based on PMTU. The goal in doing so is
to improve conditions in the existing Internet by enhancing the to improve conditions in the existing Internet by enhancing the
quality of current TCP/IP implementations. quality of current TCP/IP implementations.
While Path MTU Discovery (PMTUD) can be used with any upper-layer While Path MTU Discovery (PMTUD) can be used with any upper-layer
skipping to change at page 2, line 28 skipping to change at page 2, line 30
issues, but not the TCP issues brought up in this document. issues, but not the TCP issues brought up in this document.
Each problem is defined as follows: Each problem is defined as follows:
Name of Problem Name of Problem
The name associated with the problem. In this memo, the name is The name associated with the problem. In this memo, the name is
given as a subsection heading. given as a subsection heading.
Classification Classification
One or more problem categories for which the problem is classified: One or more problem categories for which the problem is classified:
"congestion control", "performance", "reliability", "non- "congestion control", "performance", "reliability",
interoperation -- connectivity failure". "non-interoperation -- connectivity failure".
Description Description
A definition of the problem, succinct but including necessary A definition of the problem, succinct but including necessary
background material. background material.
Significance Significance
A brief summary of the sorts of environments for which the problem A brief summary of the sorts of environments for which the problem
is significant. is significant.
Implications Implications
Why the problem is viewed as a problem. Why the problem is viewed as a problem.
Relevant RFCs Relevant RFCs
The RFCs defining the TCP specification with which the problem The RFCs defining the TCP specification with which the problem
conflicts. These RFCs often qualify behavior using terms such as conflicts. These RFCs often qualify behavior using terms such as
MUST, SHOULD, MAY, and others written capitalized. See RFC 2119 MUST, SHOULD, MAY, and others written capitalized. See RFC 2119
for the exact interpretation of these terms. for the exact interpretation of these terms.
^L
Trace file demonstrating the problem Trace file demonstrating the problem
One or more ASCII trace files demonstrating the problem, if One or more ASCII trace files demonstrating the problem, if
applicable. applicable.
Trace file demonstrating correct behavior Trace file demonstrating correct behavior
One or more examples of how correct behavior appears in a trace, if One or more examples of how correct behavior appears in a trace, if
applicable. applicable.
References References
References that further discuss the problem. References that further discuss the problem.
skipping to change at page 4, line 5 skipping to change at page 4, line 5
Non-interoperation -- connectivity failure Non-interoperation -- connectivity failure
Description Description
A host performs Path MTU Discovery by sending out as large a packet A host performs Path MTU Discovery by sending out as large a packet
as possible, with the Don't Fragment (DF) bit set in the IP header. as possible, with the Don't Fragment (DF) bit set in the IP header.
If the packet is too large for a router to forward on to a If the packet is too large for a router to forward on to a
particular link, the router must send an ICMP Destination particular link, the router must send an ICMP Destination
Unreachable -- Fragmentation Needed message to the source address. Unreachable -- Fragmentation Needed message to the source address.
The host then adjusts the packet size based on the ICMP message. The host then adjusts the packet size based on the ICMP message.
^L
As was pointed out in [RFC1435], routers don't always do this As was pointed out in [RFC1435], routers don't always do this
correctly -- many routers fail to send the ICMP messages, for a correctly -- many routers fail to send the ICMP messages, for a
variety of reasons ranging from kernel bugs to configuration variety of reasons ranging from kernel bugs to configuration
problems. Firewalls are often misconfigured to suppress all ICMP problems. Firewalls are often misconfigured to suppress all ICMP
messages. IPsec [RFC2401] and IP-in-IP [RFC2003] tunnels shouldn't messages. IPsec [RFC2401] and IP-in-IP [RFC2003] tunnels shouldn't
cause these sorts of problems, if the implementations follow the cause these sorts of problems, if the implementations follow the
advice in the appropriate documents. advice in the appropriate documents.
PMTUD, as documented in [RFC1191], fails when the appropriate ICMP PMTUD, as documented in [RFC1191], fails when the appropriate ICMP
messages are not received by the originating host. The upper-layer messages are not received by the originating host. The upper-layer
skipping to change at page 5, line 4 skipping to change at page 5, line 4
Relevant RFCs Relevant RFCs
RFC1191 describes Path MTU Discovery. RFC 1435 provides an early RFC1191 describes Path MTU Discovery. RFC 1435 provides an early
description of these sorts of problems. description of these sorts of problems.
Trace file demonstrating the problem Trace file demonstrating the problem
Made using tcpdump [Jacobson89] recording at an intermediate host. Made using tcpdump [Jacobson89] recording at an intermediate host.
20:12:11.951321 A > B: S 1748427200:1748427200(0) 20:12:11.951321 A > B: S 1748427200:1748427200(0)
win 49152 <mss 1460> win 49152 <mss 1460>
20:12:11.951829 B > A: S 1001927984:1001927984(0) 20:12:11.951829 B > A: S 1001927984:1001927984(0)
^L
ack 1748427201 win 16384 <mss 65240> ack 1748427201 win 16384 <mss 65240>
20:12:11.955230 A > B: . ack 1 win 49152 (DF) 20:12:11.955230 A > B: . ack 1 win 49152 (DF)
20:12:11.959099 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:11.959099 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:12:13.139074 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:13.139074 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:12:16.188685 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:16.188685 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:12:22.290483 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:22.290483 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:12:34.491856 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:34.491856 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:12:58.896405 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:12:58.896405 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:13:47.703184 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:13:47.703184 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
20:14:52.780640 A > B: . 1:1461(1460) ack 1 win 49152 (DF) 20:14:52.780640 A > B: . 1:1461(1460) ack 1 win 49152 (DF)
skipping to change at page 6, line 4 skipping to change at page 6, line 4
16:48:45.786814 A > B: . 1:1461(1460) ack 1 win 8760 (DF) 16:48:45.786814 A > B: . 1:1461(1460) ack 1 win 8760 (DF)
16:48:51.794676 A > B: . 1:1461(1460) ack 1 win 8760 (DF) 16:48:51.794676 A > B: . 1:1461(1460) ack 1 win 8760 (DF)
16:49:03.808912 A > B: . 1:537(536) ack 1 win 8760 16:49:03.808912 A > B: . 1:537(536) ack 1 win 8760
16:49:04.016476 B > A: . ack 537 win 16384 16:49:04.016476 B > A: . ack 537 win 16384
16:49:04.021245 A > B: . 537:1073(536) ack 1 win 8760 16:49:04.021245 A > B: . 537:1073(536) ack 1 win 8760
16:49:04.021697 A > B: . 1073:1609(536) ack 1 win 8760 16:49:04.021697 A > B: . 1073:1609(536) ack 1 win 8760
16:49:04.120694 B > A: . ack 1609 win 16384 16:49:04.120694 B > A: . ack 1609 win 16384
16:49:04.126142 A > B: . 1609:2145(536) ack 1 win 8760 16:49:04.126142 A > B: . 1609:2145(536) ack 1 win 8760
In this case, the sender sees four packets fail to traverse the In this case, the sender sees four packets fail to traverse the
^L
network (using a two-packet initial send window) and turns off network (using a two-packet initial send window) and turns off
PMTUD. All subsequent packets have the DF flag turned off, and the PMTUD. All subsequent packets have the DF flag turned off, and the
size set to the default value of 536 [RFC1122]. size set to the default value of 536 [RFC1122].
References References
This problem has been discussed extensively on the tcp-impl mailing This problem has been discussed extensively on the tcp-impl mailing
list; the name "black hole" has been in use for many years. list; the name "black hole" has been in use for many years.
How to detect How to detect
skipping to change at page 7, line 5 skipping to change at page 7, line 5
all times. Fragmentation is not allowed in routers, only at the all times. Fragmentation is not allowed in routers, only at the
originating host. Fortunately, the minimum supported MTU for IPv6 originating host. Fortunately, the minimum supported MTU for IPv6
is 1280 octets, which is significantly larger than the 68 octet is 1280 octets, which is significantly larger than the 68 octet
minimum in IPv4. This should make it more reasonable for IPv6 TCP minimum in IPv4. This should make it more reasonable for IPv6 TCP
implementations to fall back to 1280 octet packets, when IPv4 implementations to fall back to 1280 octet packets, when IPv4
implementations will probably have to turn off DF to respond to implementations will probably have to turn off DF to respond to
black hole detection. black hole detection.
Ideally, the ICMP black holes should be fixed when they are found. Ideally, the ICMP black holes should be fixed when they are found.
^L
If hosts start to implement black hole detection, it may be that If hosts start to implement black hole detection, it may be that
these problems will go unnoticed and unfixed. This is especially these problems will go unnoticed and unfixed. This is especially
unfortunate, since detection can take several seconds each time, unfortunate, since detection can take several seconds each time,
and these delays could result in a significant, hidden degradation and these delays could result in a significant, hidden degradation
of performance. Hosts that implement black hole detection should of performance. Hosts that implement black hole detection should
probably log detected black holes, so that they can be fixed. probably log detected black holes, so that they can be fixed.
3.2. 3.2.
Name of Problem Name of Problem
Stretch ACK due to PMTUD Stretch ACK due to PMTUD
Classification Classification
Congestion Control / Performance Congestion Control / Performance
Description Description
When a naively implemented TCP stack communicates with a PMTUD- When a naively implemented TCP stack communicates with a PMTUD
equipped stack, it will try to generate an ACK for every second equipped stack, it will try to generate an ACK for every second
full-sized segment. If it determines the full-sized segment based full-sized segment. If it determines the full-sized segment based
on the advertised MSS, this can degrade badly in the face of PMTUD. on the advertised MSS, this can degrade badly in the face of PMTUD.
The PMTU can wind up being a small fraction of the advertised MSS; The PMTU can wind up being a small fraction of the advertised MSS;
in this case, an ACK would be generated only very infrequently. in this case, an ACK would be generated only very infrequently.
Significance Significance
Stretch ACKs have a variety of unfortunate effects, more fully Stretch ACKs have a variety of unfortunate effects, more fully
skipping to change at page 8, line 5 skipping to change at page 8, line 5
Implications Implications
The complete implications of stretch ACKs are outlined in The complete implications of stretch ACKs are outlined in
[RFC2525]. [RFC2525].
Relevant RFCs Relevant RFCs
RFC 1122 outlines the requirements for frequency of ACK generation. RFC 1122 outlines the requirements for frequency of ACK generation.
[RFC2581] expands on this and clarifies that delayed ACK is a [RFC2581] expands on this and clarifies that delayed ACK is a
SHOULD, not a MUST. SHOULD, not a MUST.
^L
Trace file demonstrating it Trace file demonstrating it
Made using tcpdump recording at an intermediate host. The Made using tcpdump recording at an intermediate host. The
timestamp options from all but the first two packets have been timestamp options from all but the first two packets have been
removed for clarity. removed for clarity.
18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384 18:16:52.976657 A > B: S 3183102292:3183102292(0) win 16384
<mss 4312,nop,wscale 0,nop,nop,timestamp 12128 0> (DF) () <mss 4312,nop,wscale 0,nop,nop,timestamp 12128 0> (DF)
18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152 18:16:52.979580 B > A: S 2022212745:2022212745(0) ack 3183102293 win 49152
<mss 4312,nop,wscale 1,nop,nop,timestamp 1592957 12128> (DF) () <mss 4312,nop,wscale 1,nop,nop,timestamp 1592957 12128> (DF)
18:16:52.979738 A > B: . ack 1 win 17248 (DF) () 18:16:52.979738 A > B: . ack 1 win 17248 (DF)
18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF) () 18:16:52.982473 A > B: . 1:4301(4300) ack 1 win 17248 (DF)
18:16:52.982557 C > A: icmp: B unreachable - 18:16:52.982557 C > A: icmp: B unreachable -
need to frag (mtu 1500)! (DF) () need to frag (mtu 1500)! (DF)
18:16:52.985839 B > A: . ack 1 win 32768 (DF) () 18:16:52.985839 B > A: . ack 1 win 32768 (DF)
18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF) () 18:16:54.129928 A > B: . 1:1449(1448) ack 1 win 17248 (DF)
. .
. .
. .
18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF) () 18:16:58.507078 A > B: . 1463941:1465389(1448) ack 1 win 17248 (DF)
18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF) () 18:16:58.507200 A > B: . 1465389:1466837(1448) ack 1 win 17248 (DF)
18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF) () 18:16:58.507326 A > B: . 1466837:1468285(1448) ack 1 win 17248 (DF)
18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF) () 18:16:58.507439 A > B: . 1468285:1469733(1448) ack 1 win 17248 (DF)
18:16:58.524763 B > A: . ack 1452357 win 32768 (DF) () 18:16:58.524763 B > A: . ack 1452357 win 32768 (DF)
18:16:58.524986 B > A: . ack 1461045 win 32768 (DF) () 18:16:58.524986 B > A: . ack 1461045 win 32768 (DF)
18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF) () 18:16:58.525138 A > B: . 1469733:1471181(1448) ack 1 win 17248 (DF)
18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF) () 18:16:58.525268 A > B: . 1471181:1472629(1448) ack 1 win 17248 (DF)
18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF) () 18:16:58.525393 A > B: . 1472629:1474077(1448) ack 1 win 17248 (DF)
18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF) () 18:16:58.525516 A > B: . 1474077:1475525(1448) ack 1 win 17248 (DF)
18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF) () 18:16:58.525642 A > B: . 1475525:1476973(1448) ack 1 win 17248 (DF)
18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF) () 18:16:58.525766 A > B: . 1476973:1478421(1448) ack 1 win 17248 (DF)
18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF) () 18:16:58.526063 A > B: . 1478421:1479869(1448) ack 1 win 17248 (DF)
18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF) () 18:16:58.526187 A > B: . 1479869:1481317(1448) ack 1 win 17248 (DF)
18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF) () 18:16:58.526310 A > B: . 1481317:1482765(1448) ack 1 win 17248 (DF)
18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF) () 18:16:58.526432 A > B: . 1482765:1484213(1448) ack 1 win 17248 (DF)
18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF) () 18:16:58.526561 A > B: . 1484213:1485661(1448) ack 1 win 17248 (DF)
18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF) () 18:16:58.526671 A > B: . 1485661:1487109(1448) ack 1 win 17248 (DF)
18:16:58.537944 B > A: . ack 1478421 win 32768 (DF) () 18:16:58.537944 B > A: . ack 1478421 win 32768 (DF)
18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF) () 18:16:58.538328 A > B: . 1487109:1488557(1448) ack 1 win 17248 (DF)
Note that the interval between ACKs is significantly larger than Note that the interval between ACKs is significantly larger than
two times the segment size; it works out to be almost exactly two two times the segment size; it works out to be almost exactly two
times the advertised MSS. This transfer was long enough that it times the advertised MSS. This transfer was long enough that it
could be verified that the stretch ACK was not the result of lost could be verified that the stretch ACK was not the result of lost
ACK packets. ACK packets.
^L
Trace file demonstrating correct behavior Trace file demonstrating correct behavior
Made using tcpdump recording at an intermediate host. The Made using tcpdump recording at an intermediate host. The
timestamp options from all but the first two packets have been timestamp options from all but the first two packets have been
removed for clarity. removed for clarity.
18:13:32.287965 A > B: S 2972697496:2972697496(0) 18:13:32.287965 A > B: S 2972697496:2972697496(0)
win 16384 <mss 4312,nop,wscale 0,nop,nop,timestamp 11326 0> (DF) win 16384 <mss 4312,nop,wscale 0,nop,nop,timestamp 11326 0> (DF)
18:13:32.290785 B > A: S 245639054:245639054(0) 18:13:32.290785 B > A: S 245639054:245639054(0)
ack 2972697497 win 34496 <mss 4312> (DF) ack 2972697497 win 34496 <mss 4312> (DF)
skipping to change at page 10, line 5 skipping to change at page 10, line 5
18:13:35.616349 B > A: . ack 1508181 win 64680 (DF) 18:13:35.616349 B > A: . ack 1508181 win 64680 (DF)
18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF) 18:13:35.616448 A > B: . 1530081:1531541(1460) ack 1 win 17248 (DF)
18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF) 18:13:35.616565 A > B: . 1531541:1533001(1460) ack 1 win 17248 (DF)
18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF) 18:13:35.616891 A > B: . 1533001:1534461(1460) ack 1 win 17248 (DF)
In this trace, an ACK is generated for every two segments that In this trace, an ACK is generated for every two segments that
arrive. (The segment size is slightly larger in this trace, even arrive. (The segment size is slightly larger in this trace, even
though the source hosts are the same, because of the lack of though the source hosts are the same, because of the lack of
timestamp options in this trace.) timestamp options in this trace.)
^L
How to detect How to detect
This condition can be observed in a packet trace when the This condition can be observed in a packet trace when the
advertised MSS is significantly larger than the actual PMTU of a advertised MSS is significantly larger than the actual PMTU of a
connection. connection.
How to fix How to fix
Several solutions for this problem have been proposed: Several solutions for this problem have been proposed:
A simple solution is to ACK every other packet, regardless of size. A simple solution is to ACK every other packet, regardless of size.
This has the drawback of generating large numbers of ACKs in the This has the drawback of generating large numbers of ACKs in the
face of lots of very small packets; this shows up with face of lots of very small packets; this shows up with
applications like the X Window System. applications like the X Window System.
A slightly more complex solution would monitor the size of incoming A slightly more complex solution would monitor the size of incoming
segments and try to determine what segment size the sender is segments and try to determine what segment size the sender is
using. This requires slightly more state in the receiver, but has using. This requires slightly more state in the receiver, but has
the advantage of making receiver silly window syndrome avoidance the advantage of making receiver silly window syndrome avoidance
computations more accurate. computations more accurate [RFC813].
3.3. 3.3.
Name of Problem Name of Problem
Determining MSS from PMTU Determining MSS from PMTU
Classification Classification
Performance Performance
Description Description
The MSS advertised at the start of a connection should be based on The MSS advertised at the start of a connection should be based on
the MTU of the interfaces on the system. Some systems use PMTUD the MTU of the interfaces on the system. (For efficiency and other
determined values to determine the MSS to advertise. reasons this may not be the largest MSS possible.) Some systems
use PMTUD determined values to determine the MSS to advertise.
This results in an advertised MSS that is smaller than the largest This results in an advertised MSS that is smaller than the largest
MTU the system can receive. MTU the system can receive.
Significance Significance
The advertised MSS is an indication to the remote system about the The advertised MSS is an indication to the remote system about the
largest TCP segment that can be received [RFC879]. If this value largest TCP segment that can be received [RFC879]. If this value
is too small, the remote system will be forced to use a smaller is too small, the remote system will be forced to use a smaller
^L
segment size when sending, purely because the local system found a segment size when sending, purely because the local system found a
particular PMTU earlier. particular PMTU earlier.
Given the asymmetric nature of many routes on the Internet Given the asymmetric nature of many routes on the Internet
[Paxson97], it seems entirely possible that the return PMTU is [Paxson97], it seems entirely possible that the return PMTU is
different from the sending PMTU. Limiting the segment size in this different from the sending PMTU. Limiting the segment size in this
way can reduce performance and frustrate the PMTUD algorithm. way can reduce performance and frustrate the PMTUD algorithm.
Even if the route was symmetric, setting this artificially lowered Even if the route was symmetric, setting this artificially lowered
limit on segment size will make it impossible to probe later to limit on segment size will make it impossible to probe later to
determine if the PMTU has changed. determine if the PMTU has changed.
Implications Implications
The whole point of PMTUD is to send as large a segment as possible. The whole point of PMTUD is to send as large a segment as possible.
If long-running connections cannot successfully probe for larger If long-running connections cannot successfully probe for larger
PMTU, then potential performance gains will be impossible to PMTU, then potential performance gains will be impossible to
realize. This destroys the whole point of PMTUD. realize. This destroys the whole point of PMTUD.
Relevant RFCs Relevant RFCs
RFC 1191. [RFC897] provides a complete discussion of MSS RFC 1191. [RFC879] provides a complete discussion of MSS
calculations and appropriate values. Note that this practice does calculations and appropriate values. Note that this practice does
not violate any of the specifications in these RFCs. not violate any of the specifications in these RFCs.
Trace file demonstrating it Trace file demonstrating it
This trace was made using tcpdump running on an intermediate host. This trace was made using tcpdump running on an intermediate host.
Host A initiates two separate consecutive connections, A1 and A2, Host A initiates two separate consecutive connections, A1 and A2,
to host B. Router C is the location of the MTU bottleneck. As to host B. Router C is the location of the MTU bottleneck. As
usual, TCP options are removed from all non-SYN packets. usual, TCP options are removed from all non-SYN packets.
22:33:32.305912 A1 > B: S 1523306220:1523306220(0) 22:33:32.305912 A1 > B: S 1523306220:1523306220(0)
skipping to change at page 11, line 48 skipping to change at page 12, line 4
22:33:32.323569 C > A1: icmp: 129.99.238.5 unreachable - 22:33:32.323569 C > A1: icmp: 129.99.238.5 unreachable -
need to frag (mtu 1024) (DF) (ttl 255, id 20666) need to frag (mtu 1024) (DF) (ttl 255, id 20666)
22:33:32.783694 A1 > B: . 1:985(984) ack 1 win 8856 (DF) 22:33:32.783694 A1 > B: . 1:985(984) ack 1 win 8856 (DF)
22:33:32.840817 B > A1: . ack 985 win 16384 22:33:32.840817 B > A1: . ack 985 win 16384
22:33:32.845651 A1 > B: . 1461:2445(984) ack 1 win 8856 (DF) 22:33:32.845651 A1 > B: . 1461:2445(984) ack 1 win 8856 (DF)
22:33:32.846094 B > A1: . ack 985 win 16384 22:33:32.846094 B > A1: . ack 985 win 16384
22:33:33.724392 A1 > B: . 985:1969(984) ack 1 win 8856 (DF) 22:33:33.724392 A1 > B: . 985:1969(984) ack 1 win 8856 (DF)
22:33:33.724893 B > A1: . ack 2445 win 14924 22:33:33.724893 B > A1: . ack 2445 win 14924
22:33:33.728591 A1 > B: . 2445:2921(476) ack 1 win 8856 (DF) 22:33:33.728591 A1 > B: . 2445:2921(476) ack 1 win 8856 (DF)
22:33:33.729161 A1 > B: . ack 1 win 8856 (DF) 22:33:33.729161 A1 > B: . ack 1 win 8856 (DF)
^L
22:33:33.840758 B > A1: . ack 2921 win 16384 22:33:33.840758 B > A1: . ack 2921 win 16384
[...] [...]
22:33:34.238659 A1 > B: F 7301:8193(892) ack 1 win 8856 (DF) 22:33:34.238659 A1 > B: F 7301:8193(892) ack 1 win 8856 (DF)
22:33:34.239036 B > A1: . ack 8194 win 15492 22:33:34.239036 B > A1: . ack 8194 win 15492
22:33:34.239303 B > A1: F 1:1(0) ack 8194 win 16384 22:33:34.239303 B > A1: F 1:1(0) ack 8194 win 16384
22:33:34.242971 A1 > B: . ack 2 win 8856 (DF) 22:33:34.242971 A1 > B: . ack 2 win 8856 (DF)
22:33:34.454218 A2 > B: S 1523591299:1523591299(0) 22:33:34.454218 A2 > B: S 1523591299:1523591299(0)
win 8856 <mss 984> (DF) win 8856 <mss 984> (DF)
skipping to change at page 12, line 49 skipping to change at page 13, line 4
22:36:58.848058 A1 > B: . ack 1 win 32768 (DF) 22:36:58.848058 A1 > B: . ack 1 win 32768 (DF)
22:36:58.851514 A1 > B: P 1:1025(1024) ack 1 win 32768 (DF) 22:36:58.851514 A1 > B: P 1:1025(1024) ack 1 win 32768 (DF)
22:36:58.851584 C > A1: icmp: 129.99.238.5 unreachable - 22:36:58.851584 C > A1: icmp: 129.99.238.5 unreachable -
need to frag (mtu 1024) (DF) need to frag (mtu 1024) (DF)
22:36:58.855885 A1 > B: . 1:969(968) ack 1 win 32768 (DF) 22:36:58.855885 A1 > B: . 1:969(968) ack 1 win 32768 (DF)
22:36:58.856378 A1 > B: . 969:985(16) ack 1 win 32768 (DF) 22:36:58.856378 A1 > B: . 969:985(16) ack 1 win 32768 (DF)
22:36:59.036309 B > A1: . ack 985 win 16384 22:36:59.036309 B > A1: . ack 985 win 16384
22:36:59.039255 A1 > B: FP 985:1025(40) ack 1 win 32768 (DF) 22:36:59.039255 A1 > B: FP 985:1025(40) ack 1 win 32768 (DF)
22:36:59.039623 B > A1: . ack 1026 win 16344 22:36:59.039623 B > A1: . ack 1026 win 16344
22:36:59.039828 B > A1: F 1:1(0) ack 1026 win 16384 22:36:59.039828 B > A1: F 1:1(0) ack 1026 win 16384
^L
22:36:59.043037 A1 > B: . ack 2 win 32768 (DF) 22:36:59.043037 A1 > B: . ack 2 win 32768 (DF)
22:37:01.436032 A2 > B: S 3404812097:3404812097(0) win 32768 22:37:01.436032 A2 > B: S 3404812097:3404812097(0) win 32768
<mss 4312,wscale 0,nop,timestamp 1123372916 0, <mss 4312,wscale 0,nop,timestamp 1123372916 0,
echo 1123372916> (DF) echo 1123372916> (DF)
22:37:01.436424 B > A2: S 949814769:949814769(0) 22:37:01.436424 B > A2: S 949814769:949814769(0)
ack 3404812098 win 16384 ack 3404812098 win 16384
<mss 65240,nop,wscale 0,nop,nop,timestamp 429562 1123372916> <mss 65240,nop,wscale 0,nop,nop,timestamp 429562 1123372916>
22:37:01.440147 A2 > B: . ack 1 win 32768 (DF) 22:37:01.440147 A2 > B: . ack 1 win 32768 (DF)
22:37:01.442736 A2 > B: . 1:969(968) ack 1 win 32768 (DF) 22:37:01.442736 A2 > B: . 1:969(968) ack 1 win 32768 (DF)
22:37:01.442894 A2 > B: P 969:985(16) ack 1 win 32768 (DF) 22:37:01.442894 A2 > B: P 969:985(16) ack 1 win 32768 (DF)
skipping to change at page 14, line 5 skipping to change at page 14, line 5
The one security concern raised by this memo is that ICMP black holes The one security concern raised by this memo is that ICMP black holes
are often caused by over-zealous security administrators who block are often caused by over-zealous security administrators who block
all ICMP messages. It is vitally important that those who design and all ICMP messages. It is vitally important that those who design and
deploy security systems understand the impact of strict filtering on deploy security systems understand the impact of strict filtering on
upper-layer protocols. The safest web site in the world is worthless upper-layer protocols. The safest web site in the world is worthless
if most TCP implementations cannot transfer data from it. It would if most TCP implementations cannot transfer data from it. It would
be far nicer to have all of the black holes fixed rather than fixing be far nicer to have all of the black holes fixed rather than fixing
all of the TCP implementations. all of the TCP implementations.
^L
5. Acknowledgements 5. Acknowledgements
Thanks to Mark Allman, Vern Paxson, and Jamshid Mahdavi for generous Thanks to Mark Allman, Vern Paxson, and Jamshid Mahdavi for generous
help reviewing the document, and to Matt Mathis for early suggestions help reviewing the document, and to Matt Mathis for early suggestions
of various mechanisms that can cause PMTUD black holes, as well as of various mechanisms that can cause PMTUD black holes, as well as
review. The structure for describing TCP problems, and the early review. The structure for describing TCP problems, and the early
description of that structure is from [RFC2525]. Special thanks to description of that structure is from [RFC2525]. Special thanks to
Amy Bock, who helped perform the PMTUD tests which discovered these Amy Bock, who helped perform the PMTUD tests which discovered these
bugs. bugs.
6. References 6. References
[RFC2581] [RFC2581]
M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control", M. Allman, V. Paxson, and W. Stevens, "TCP Congestion Control",
April 1999. April 1999.
[RFC1122] [RFC1122]
R. Braden, Editor, "Requirements for Internet Hosts -- R. Braden, Editor, "Requirements for Internet Hosts --
Communication Layers," Oct. 1989. Communication Layers," Oct. 1989.
[RFC813]
D. Clark, "Window and Acknowledgement Strategy in TCP," July 1982.
[Jacobson89] [Jacobson89]
V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via V. Jacobson, C. Leres, and S. McCanne, tcpdump, available via
anonymous ftp to ftp.ee.lbl.gov, Jun. 1989. anonymous ftp to ftp.ee.lbl.gov, Jun. 1989.
[RFC1435] [RFC1435]
S. Knowles, "IESG Advice from Experience with Path MTU Discovery," S. Knowles, "IESG Advice from Experience with Path MTU Discovery,"
March 1993. March 1993.
[RFC1191] [RFC1191]
J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990. J. Mogul and S. Deering, "Path MTU discovery," Nov. 1990.
skipping to change at page 14, line 46 skipping to change at page 15, line 4
[RFC1981] [RFC1981]
J. McCann, S. Deering & J. Mogul, "Path MTU Discovery for IP J. McCann, S. Deering & J. Mogul, "Path MTU Discovery for IP
version 6", August 1996. version 6", August 1996.
[Paxson96] [Paxson96]
V. Paxson, "End-to-End Routing Behavior in the Internet", IEEE/ACM V. Paxson, "End-to-End Routing Behavior in the Internet", IEEE/ACM
Transactions on Networking (5), pp.~601-615, Oct. 1997. Transactions on Networking (5), pp.~601-615, Oct. 1997.
[RFC2525] [RFC2525]
V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I. V. Paxon, Editor, M. Allman, S. Dawson, W. Fenner, J. Griner, I.
^L
Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation Heavens, K. Lahey, J. Semke, and B. Volz, "Known TCP Implementation
Problems", March 1999. Problems", March 1999.
[RFC879] [RFC879]
J. Postel, "The TCP Maximum Segment Size and Related Topics," J. Postel, "The TCP Maximum Segment Size and Related Topics,"
November, 1983. November, 1983.
[RFC2001] [RFC2001]
W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms," Jan. 1997. and Fast Recovery Algorithms," Jan. 1997.
6.1. Author's Address 6.1. Author's Address
Kevin Lahey <kml@logictier.com> Kevin Lahey <kml@logictier.com
LogicTier, Inc. LogicTier, Inc.
Suite 100 Suite 100
2 Waters Park Drive 2 Waters Park Drive
San Mateo, CA 94403 San Mateo, CA 94403
USA USA
Phone: +1 650/678-7033 Phone: +1 650/678-7033
7. Full Copyright Statement 7. Full Copyright Statement
Copyright (C) The Internet Society (1999). All Rights Reserved. Copyright (C) The Internet Society (1999). All Rights Reserved.
skipping to change at page 15, line 47 skipping to change at page 16, line 4
copyrights defined in the Internet Standards process must be copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than followed, or as required to translate it into languages other than
English. English.
The limited permissions granted above are perpetual and will not be The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns. revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
^L
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
This draft was created in May 2000. This draft was created in June 2000.
It expires in October 2000. It expires in November 2000.
^L
 End of changes. 30 change blocks. 
37 lines changed or deleted 71 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/