Path MTU Discovery (pmtud) Charter

2.7.11 Path MTU Discovery (pmtud)

NOTE: This charter is a snapshot of the 62nd IETF Meeting in Minneapolis, MN USA. It may now be out-of-date.

Last Modified: 2005-01-07

Chair(s):

Matt Mathis <mathis@psc.edu>
Matthew Zekauskas <matt@internet2.edu>

Transport Area Director(s):

Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>

Transport Area Advisor:

Allison Mankin <mankin@psg.com>

Mailing Lists:

General Discussion: pmtud@ietf.org
To Subscribe: pmtud-request@ietf.org
In Body: In Body: subscribe email_address
Archive: http://www.ietf.org/mail-archive/web/pmtud/index.html

Description of Working Group:

The goal of the PMTUD working group is to specify a robust method for
determining the IP Maximum Transmission Unit supported over an
end-to-end path. This new method is expected to update most uses of
RFC1191 and RFC1981, the current standards track protocols for this
purpose. Various weakness in the current methods are documented in
RFC2923, and have proven to be a chronic impediment to the deployment
of new technologies that alter the path MTU, such as tunnels and new
types of link layers.

The proposed new method does not rely on ICMP or other messages from
the network. It finds the proper MTU by starting a connection using
relatively small packets (e.g. TCP segments) and searching upwards by
probing with progressively larger test packets (containing application
data). If a probe packet is successfully delivered, then the path MTU
is raised. The isolated loss of a probe packet (with or without an
ICMP can't fragment message) is treated as an indication of a MTU
limit, and not a congestion indicator.

The working group will specify the method for use in TCP, SCTP, and
will outline what is necessary to support the method in transports
such as DCCP. It will particularly describe the precise conditions
under which lost packets are not treated as congestion indications.
The work will pay particular attention to details that affect
robustness and security.

Path MTU discovery has the potential to interact with many other parts
of the Internet, including all link, transport, encapsulation and
tunnel protocols. Thereforethis working group will particularly
encourage input from a wide cross section of the IETF to help to
maximize the robustness of path MTU discovery in the presence of
pathological behaviors from other components.

Input draft:

Packetization Layer Path MTU Discovery
draft-mathis-plpmtud-00.txt

Goals and Milestones:

Done		Reorganized Internet-Draft. Solicit implementation and field experience.
Done		Update Internet-Draft incorporating implementers experience,
Feb 05		Submit completed Internet-draft and a PMTUD MIB draft for Proposed Standard.

Internet-Drafts:

draft-ietf-pmtud-method-04.txt

No Request For Comments

Current Meeting Report

Path MTU Discovery WG (pmtud)
Tuesday, March 8, 15:30--16:30
=====================================

CHAIRS: Matt Mathis <mathis@psc.edu>
Matt Zekauskas <matt@internet2.edu>

Due to a lack of volunteers, these minutes were generated from notes taken by Matt Zekauskas. Any audio record can be used for corroboration.

AGENDA:

1. Agenda bashing, milestones review
2. PMTUD method draft update and issue summary
3. PMTUD method draft implementation experience
4. PMTUD method draft issues

1. Agenda bashing, milestones review

Matt Z opened the meeting reviewing the agenda, and milestones. We think we're in the middle of the second milestone (revise draft based on stakeholder feedback and implementation experience).

2. PMTUD method draft overview
--Matt Mathis

Matt Mathis started with a one-page summary of what we're trying to do, and an overview of the changes for this draft revision (draft-ietf-pmtud-method-04.txt) (see slides). The changes were mainly editorial, and some of the feedback was missed (as noted on the mailing list); a new revision is promised soon. In particular, Matt noted that the draft is intended to be interoperable with the a full range of ICMP techniques, from believing the messages implicitly, treating them as advisory, to ignoring them altogether. Some text will be added in response to the ICMP vulnerability discussion on the list.

Matt remarked that as Stanislav Shalunov observed in the August meeting last year, this algorithm is parallel to TCP congestion control, and it fits into the Internet architecture in the same way. TCP uses losses to vary properties of the protocol stream, as this does. Thus we have well understood security profiles, and understand how it interacts with middleboxes.

Matt then introduced John Heffner to talk about a recent implementation.

3. PMTUD method draft implementation experience
--John Heffner

John Heffner related his experience with doing an implementation in Linux 2.6. There was a mad dash before meeting -- it's been working on his laptop day-to-day for about a week now, though. Source is available for others to try it out. One thing that was done was to use a very simple ("stupid") method to probe for MTU size, to make the algorithm do a lot of work (it starts with a 256 byte MTU and worked up to 9000 bytes in this example). John is also tracking the 2.6 kernel mainline, so it could be included in a future release.

John showed a sequence number trace showing the algorithm in action. The trace shows TCP segment size versus time of an iperf flow. It artificially started out with 256 byte mss, and then moved from 512 to 1024 up to 8k (9k MTU). It takes about 1ms to converge on a gigabit LAN.

Issues that came up during the implementation included
* what is the right search strategy; keeping to factors of 2 may help performance depending on OS
* deciding when to probe; possibly delaying sending some data so can probe (because you can't probe into a full window)
* verification phase duration: is there a smaller size than what the draft currently recommends that would work well (longer increases the time to "full MTU")
* a Linux-specific issue with buffer partitioning that raises the time before you can utilize a larger MTU
* should any of the information be cached (basically how much can you keep around for more connections)

Discussion started with the time for the algorithm to converge: you don't want to slow single-page web transfers (the picture showed it took ~1ms to converge on a gigabit LAN, it will take longer on longer paths). A comment was made that shouldn't we be talking about RTT instead of time? John said that he had not done extensive testing, and it varies; further, the way Linux computes its send buffer size means that the time will be more than just a factor of RTT. However, it should be approximately a certain number of RTT. John was asked why 256 bytes was chosen as a starting point; it was done for testing purposes only.

Next discussion turned to small MTU issues. Fred Templin related talking about 81 byte packets in the low-power BOF earlier, and that the minimum v4 MTU is in fact 68, not 576. John said that in this implementation it is a variable that can be modified by sysctl; setting it to 68 might not be a good idea. Matt Mathis mentioned that he was only aware of in-service equipment with an MTU of 1006 (the old SLIP standard); as a practical matter, there is no link layer that is less than 576 bytes, and that when the natural MTU is really small people have used adaptation layers on top of it. Fred noted that RFC 3150 mentions dialup links with an MTU of 296 bytes. Matt said he needed to think a bit about this; if the end-system can observe the small MTU then there is no problem. We might only have a problem if there was an intermediate link with a tiny MTU. Fred did say that he hoped all occurrences should be at the edge. There are lots of low-end devices, and perhaps we could fine-scan if there is a small MTU.

John related an idea that might speed convergence: start out with a large MSS, and then go straight into the verification phase. If we see a timeout, then fall back to searching up for the MSS. He also mentioned that the probe strategy was "really dumb" in this implementation: start small and probe up by doubling the size. (There can be a performance benefit to ending up with an MSS that is a integral factor of the page size (if the page size is a multiple of the MSS, or MSS a multiple of page size).

The probing strategy contains the idea of "wait to probe": if we expect to probe in the immediate future, don't send additional data. This avoids a problem that occurs if the window fills. Once the window fills you do not get to probe. So, if you are limited by the window (filling cwnd) and you get an ack for one packet but want to probe with two, wait for the next ack and then send the probe. This idea is still under investigation: are there cases where the tests are too strict, and you never end up probing, or cases where nothing comes in so you stall. It would be good to for others to think about this and provide feedback. Matt noted that it was critically important that the probe doesn't raise the window; you don't want the probe to overflow a queue.

The verification phase also contains a deviation from the suggestion in the draft. Instead of counting cwnd worth of data, only count full size packets, and stop at 10. Ten seemed to be good enough to see if the path is striped over dissimilar MTU links. Having the verification phase be "too long", and the algorithm becomes more fragile to verification failures. Having the verification phase be too small, and it might get a false success.

A particularity of Linux that slows the MSS search is that Linux segments data into MSS chunks as it copies the data from the user application. Therefore you have to wait for the current send buffer to drain before the new MSS is tried. If you can probe larger MSS ahead of verifications, you could probe in parallel with the verifications to speed the process up.

There were a few items that were not implemented in favor of getting the basic functionality working (see slides). In particular, nothing was done about ICMP attack detection mentioned on both the TCPM and PMTUD lists.

Some open issues (see the slides): Short connections can't do probing. Do we want to cache any of the MSS information? False positives would pass on a penalty, but you'd like to pass on as much as possible. Also, you might pay a high penalty on SMP machines depending on how access to the cache data structure is granted (locking issues). The issue of ICMP attacks (or just bad information) deserves some thought.

A question came up on how the algorithm would behave on slow speed lines with a large RTT. A lot of web transactions are short lived, and a big MTU would help. For those, you could cache some of the state, or start with a large packet in parallel. Perhaps the defaults might be different for a large web server where you expect lots of ICMP black holes. Matt went back to the Linux buffer issue: if the MSS is tiny, the send buffer is "drained in teaspoons". If you overlap the probing, you could do the algorithm nearly as fast as slow start. In fact, you could do slow-start by doubling the packet size rather than the number of packets. John noted that a lot of web servers send one buffer and that's it. Matt noted that the issue is how conservative do you want to be at the start when the number of packets in flight is small.

4. PMTUD method draft issues discussion, part II
--Matt Mathis

Matt Mathis then took over the presentation talking about what has been learned and planned changes. Insights from the implementation: starting in the verification phase would speed convergence, as would a combined transition and verification phase. We need more thought about shared state; what part is in the protocol, and what part is shared. This is tied to the MIB design work to be done; you don't want to have a separate MIB for ever protocol that uses the algorithm. This pushes state toward the IP layer; then again, if you have shard state you need to also worry about lock contention in high-speed machines. Matt will be looking carefully at this issue as John's implementation matures, and other implementations are done.

The ICMP issues brought up on the list may also spawn a few changes. One of the ideas there, and is standard in some stacks, is to do additional checking on the validity of the ICMP payload. Checking the TCP sequence and acknowledgement numbers if they are available is a once sentence addition to the current draft. Other issues from the Gont draft are equivalent to the state machine we have for searching for the MSS, and doesn't add to what is already done. There is another opportunity: since we know which packets are probes, we can check other fields, and match more of the packet. This is lightweight and easy, and could gain a fair amount of robustness against a bunch of attacks, and is also protocol independent... as long as the protocol does tell IP that it is probing.

A new revision of the document is planned to address all the existing concerns; then there will be work on a MIB, more push on the sample implementation, and an attempt to get an implementers list to capture feedback to go to the final version of the draft.

A final question came from the audience about if a formal proof could be done to show that packet-level PMTUD doesn't interact badly with congestion control. Matt Mathis pointed to the single page that has all the MUST directives; the questioner was going to go off, read that in detail, and come back with further thoughts.

No one was interested in a more detailed rehash of the current algorithm ("Algorithm Review" in the slides), and the meeting was closed.

Slides

Agenda
Path MTU Discovery Update, part I
Implementation Report
Path MTU Discovery Update, part II