Path MTU Discovery (pmtud) Charter

2.8.11 Path MTU Discovery (pmtud)

Last Modified: 2003-08-20

Chair(s):

Matt Mathis <mathis@psc.edu>

Transport Area Director(s):

Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>

Transport Area Advisor:

Allison Mankin <mankin@psg.com>

Mailing Lists:

General Discussion: mtu@psc.edu
To Subscribe: majordomo@psc.edu with
Archive: http://www.psc.edu/~mathis/MTU/mbox.txt

Description of Working Group:

The goal of the PMTUD working group is to specify a robust method for
determining the IP Maximum Transmission Unit supported over an
end-to-end path. This new method is expected to update most uses of
RFC1191 and RFC1981, the current standards track protocols for this
purpose. Various weakness in the current methods are documented in
RFC2923, and have proven to be a chronic impediment to the deployment
of new technologies that alter the path MTU, such as tunnels and new
types of link layers.

The proposed new method does not rely on ICMP or other messages from
the network. It finds the proper MTU by starting a connection using
relatively small packets (e.g. TCP segments) and searching upwards by
probing with progressively larger test packets (containing application
data). If a probe packet is successfully delivered, then the path MTU
is raised. The isolated loss of a probe packet (with or without an
ICMP can't fragment message) is treated as an indication of a MTU
limit, and not a congestion indicator.

The working group will specify the method for use in TCP, SCTP, and
will outline what is necessary to support the method in transports
such as DCCP. It will particularly describe the precise conditions
under which lost packets are not treated as congestion indications.
The work will pay particular attention to details that affect
robustness and security.

Path MTU discovery has the potential to interact with many other parts
of the Internet, including all link, transport, encapsulation and
tunnel protocols. Thereforethis working group will particularly
encourage input from a wide cross section of the IETF to help to
maximize the robustness of path MTU discovery in the presence of
pathological behaviors from other components.

Input draft:

Packetization Layer Path MTU Discovery
draft-mathis-plpmtud-00.txt

Goals and Milestones:

Jul 03		Reorganized Internet-Draft. Solicit implementation and field experience.
Dec 03		Update Internet-Draft incorporating implementers experience, actively solicit input from stakeholders - all communities that might be affected by changing PMTUD.
Feb 04		Submit completed Internet-draft and a PMTUD MIB draft for Proposed Standard.

No Current Internet-Drafts

No Request For Comments

Current Meeting Report

Path Maximum Transmission Unit Discovery Pre-WG (pmtud)
Thursday, 17-July-2003 from 13:00 to 15:00

=======================================================


The meeting was chaired by Matt Mathis and Matt Zekauskas. Al Morton, 
Itojun Hagino and Matt Z. took notes that were assembled into these 
minutes by the chairs.


Agenda
------
* Preliminaries: Blue sheets, Note takers, etc
* WG Status
* Short history and work to date
* Robustness Issues
* Other Stakeholders
* Plans


Matt Mathis led off the meeting presenting the new co-chair, the agenda, the 
changes to the proposed charter, and the aggressive milestones.

The group status is that some parts of the administrative 
preparation did not get done, but IESG has approved, hence "Pre WG".  This 
will be a fast development, silence will be acceptance (at the start as 
sections are integrated).  PMTUD is a re-activation of path MTU WG, which 
was a very similar effort.  To participate, you must subscribe to IETF list 
(pmtud@ietf.org).  Matt Zekauskas volunteered to Co-chair the group. The 
charter was broadened so as not to restrict to a single method. 
Milestones are aggressive, and need for implementation to test is clear.

No one disagreed, nor were there suggestions for other methods to study.  
Matt Mathis will make another editorial pass at the document (which was a 
rewrite, starting from RFC 1981, instead of an update to the previous BOF 
input).  Sections will be added based on mailing list comments and any 
input from stakeholder communities.

Matt Mathis sketched the previous algorithm and noted some of the 
problems.  He then sketched the new algorithm, noting that there is just a 
small amount of MUST/SHOULD language: under what circumstances can losses be 
ignored as a congestion signal.  The rest is heuristics; it doesn't need to 
be the same for every application, and permits vendor diversity.

Question: when can the algorithm be used by TCP?  Just after 3-way 
handshake, or before real communication?  Matt responded that it uses live 
payload data, and the draft has a recommendation not to attempt the 
algorithm unless the congestion window is at least twenty packets, so the 
connection is well established before the algorithm starts.  Thus, this 
could slow down tiny files -- the exact algorithm is a heuristic, so you 
could choose to perform it differently.  There are tradeoffs.

The request for collaborators in the IPsec & security area led to a big 
discussion on tunnel issues.  People were positive about the method, but 
there are corner cases to consider.  Input has been promised to the 
mailing list.  (In the IPv4 world, lack of PMTUD is noted as a major 
problem with IPsec VPNs and providing services.)

This started around slide 11, "Plans for the Next Draft".  One of the 
collaborators was folks from the multicast area; one possibility is a 
generalization of the algorithm for reliable delivery. This would solve an 
ICMP implosion problem if the current MTU discovery technique was used.  
Dave Thaler noted that the behavior as specified in IPv4 and IPv6 was 
different, in 6 you respond, in v4 you don't.

Another collaborating group would be IPsec; currently the security 
architecture document has major sections dealing with the 
interaction of MTU discovery and IPsec (because tunnels are created); the 
new technique might obsolete many of those sections.

Itojun Hagino noted that the interaction between IPsec and and TCP 
depends on if the TCP stack is aware of IPsec.  If the TCP stack does not 
take care of the IPsec header size, the algorithm would need to be 
revised.

Matt M. responded that the detail in the draft needs to be resolved in a 
consistent way.  You can count the IPsec header as part of the IP header or 
TCP header.  The really nasty cases involve additional layers, for 
example IPsec on a VPN, ICMP messages could go back to the wrong place.

Michael Richardson expanded on this as an IPsec implementer.  The worst 
case is common at meetings such as this -- you have a corporate address on 
your laptop, and a VPN back to the corporate space, so all traffic goes 
back to HQ.  Try to visit a bank, and they have ICMP filters.  Your 
gateway is sending out ICMP messages to the bank, and they drop them.  This 
proposed algorithm should work really well.  Many times VPNs are blamed 
(since they are the newest element in the path), when the problem is 
really a bad ICMP filter. 



However, there is a problem, if you raise the MTU and the tunnels do not 
toss large messages but fragment them anyway, you will end up always 
fragmenting.  Michael noted that his Linux implementation (he's the 
FreeS/WAN technical lead) did not honor the DF bit by default. Having a 
poor(er) performing implementation was better than one that didn't work at 
all.  ("Poor performance is better than no security.") Perhaps there could be 
a heuristic that worked for a short term solution so these mechanisms 
don't interact badly... the endpoint would need to be updated for this 
algorithm, so IPsec tunneling could be updated at the same time.  This 
behavior is often a kernel option, too.  [In reviewing the minutes, 
Michael related that:
"The key point is that I, the IPsec developer can't control:
     1) the ICMP filter.
     2) the TCP on the machine behind it.
I can have *some* influence on the TCP at the receiving end of the flow, but 
not a lot. *If* the IPsec tunnel terminates on the same machine as the TCP, 
then in theory, the TCP can learn about the reduced MTU, and set the MSS 
appropriately. In practice, probably only KAME, Microsoft and Sun are well 
enough integrated to do this right now. Probably Linux 2.6 will be able to do 
so as well.  We do have an option to hack the MSS on the 
encapsulator's side already, alas."]

Perhaps you could fragment into tunnel, but retain the DF bit, and if set 
don't do anything weird.  Itojun related KAME experience; there they ended up 
not setting the DF bit on output header when IPsec tunnels are created.

Another point was that IPv6 on IPv4 tunnels have the same issue. IPv6 
tunnels should have a MTU of 1280 by default so a minimum MTU can be 
maintained.

Matt M. mentioned that he's aware that a large number of tunneling 
implementations don't copy the DF bit from inner packet to outer header.  
He's not yet sure if the document needs a specific section covering 
tunnels and tunnel migration; an intermediate ground that works is to let 
tunnels behave this way in the interim, and discourage a mode where end 
systems ignore can't fragment messages.

Dave Thaler noted that mobility might add additional headers; so a tunnel 
MTU of 1280 might not be enough; 1380 would be better.

Itojun stated that he was thinking of configured tunnels and not mobile 
IPv6.  If you send a packet with mobile headers, the TCP stack needs to be 
aware of the size of the mobile IP headers and reduce MSS 
appropriately -- maintain the total MTU size.

Matt Mathis noted that there was definitely a subgroup interested in 
considering tunneling and MTU discovery; he encouraged folks to join the 
mailing list and contribute the various circumstances where there are 
potential problems.

Lars Eggert mentioned that RFC2003 specifies some things related to MTU 
discovery, and RFC2401 specifically prohibits some of the mechanisms in 
RFC2003 for security reasons.  Joe Touch also has a relevant draft: 
draft-touch-ipsec-vpn.

As to other transport protocols, Matt Z. reported that he had quickly 
skimmed SCTP and DCCP documents, and that SCTP looked possible, but DCCP 
says specifically that the MTU can't be raised. No one that claimed to be an 
SCTP expert was in the room (or at least didn't comment negatively on the 
applicability to SCTP). Eddie Kohler noted that this behavior was 
revised in the DCCP WG meeting this week.   Matt Z. prompted him to send 
some DCCP text.

Matt M. emphasized that the point with getting a draft done early is to 
encourage implementation as soon as possible.  The algorithm will use 
specific details of other protocols, and we're dependent on the 
uniformity of implementation of certain features.  We need to learn what 
implementations really do; ideally get a custom implementation run on 
servers and real field data to feed back into the document.

Matt M also noted two cases that he's worried about (although these are 
just examples; others are encouraged to consider other cases, or report 
back implementation experience).  First, what happens if a path is 
striped across multiple links, and the MTU is not the same across the 
stripes?  You can require that the MTU is not raised until a certain 
number of segments are received successfully.  You need to understand the 
interaction between random losses and whether the MTU is or is not 
raised.  Second, what happens if there is a parametric failure -- when 
raising the MTU causes the error rate to increase?  An actual case is one 
particular 10G gbic; it was error free with 1500 byte packets, but not with 
9000 byte packets.  There is an opportiunity for different heuristics 
here, for example use a smaller MSS if you cannot fill a window.

Michael thought a "brokenness test page" was needed -- a good 
testbench.

For hard, repeated, timeouts the first thing you want to do is reduce 
congestion variables, then reduce MTU.  At some point want to restart the 
checks to increase MTU.

There are other possible protocol interactions, too: for example, SCTP can 
use multiple endpoints.  What if it changes addresses, and the new path has a 
smaller MTU?

Michael felt that it was important to focus where the production 
environments hurt most with the current MTU scheme.  Matt M noted that 
different things hurt in different environments.  Michael expanded that the 
most frequent case will likely be large port 80 responses to a client.  And 
it's the client that would decide that the path is stupid or broken, and 
other A record should be tried. The web server is getting the 
timeouts, not the client.  This won't deploy if we can't solve the web 
server case.

Matt M noted another case he had thought about, but not seen: what 
happens if raising the MTU causes link stability problems (as opposed to 
hard failures) -- say the link "goes away" for 10 seconds and then 
returns.  He's thought about using a state machine to catch this case... the 
link is broken, and we don't necessarily want to fix it with an MTU 
discovery algorithm.

On ignoring DF bits, so that a tunnel fragments large packets: Matt M 
contended it was worse for a 1500 tunnel fragmenting a 9000 packet than a 
tunnel fragmenting a 1500 packet by the tunnel overhead. Michael didn't 
understand this at first; Matt explained that the problem is that with many 
fragments the odds are greater that you lose a fragment, and hence the 
whole packet than if there are only two fragments.

In thinking about other stakeholders, Magnus Westerlund felt that the 
algorithm would work for RTP over UDP with the use of RTCP extensions for 
packet loss vectors.

Itojun said that we should contact rrs@cisco.com for SCTP.

In the multicast case, Dave Thaler argued that this algorithm might cause an 
ACK implosion that is worse than a ICMP-message-too-big implosion, since 
there are typically far more receivers than there are 
bottleneck-MTU links. 


Magnus commented that the document as written is very TCP specific. The 
algorithm should be better separated from actual deployment. Matt M said 
that's the intention.


Another question was what, exactly, is the definition of MTU?  
End-to-end or link-specific.  Matt M said that we were talking about IP MTU 
when using a particular link layer; how the IETF uses it, not what 
hardware specifications say.


Dave Thaler argued that IPv6 might not need this at all; the algorithm 
could arguably make performance worse (since the MSS size would ramp up 
instead of being decided once by ICMP-message-too-big).  Matt M. said the 
new algorithm would prevent against implementation or 
configuration bugs and also work in the cases where L2 MTUs were 
different on a switch.


One audience member said that this should be documented in detail.


Dave Thaler mentioned that filtering ICMPv6 has larger problems.  IPv6 
neighbor discovery uses ICMPv6, so if ICMP is filtered you won't get 
connectivity.  In addition, since v6 has no DF bit (but implied DF on all 
packets) blocking ICMP definitely leads to black hole problems in the 
network.  Thus, there is already a natural incentive to allow PMTU using 
ICMP.


Matt M noted that some stacks have IPv4 mimic IPv6 -- they always set DF, 
even on fragments,  and attempt to fragment only at the endpoints.


However, there's no requirement that routers send the too big messages in 
v4, but there is a requirement in v6.


Another comment was tha the MTU in Router Advertisement messages should 
solve this problem.  If operational experience says this isn't 
happening, it should be reflected to the v6 working group.


Matt M said that in all cases the problems with path MTU discovery are 
bugs.  There are a large set of problems.

Slides

Path Maximum Transmission Unit Discovery

Presentation 1