Internet-Draft                                               Matt Mathis
                                                            John Heffner
                                                                     PSC
                                                             Kevin Lahey
                                                               Freelance
                                                           June 21, 2003

                           Path MTU Discovery
                     draft-mathis-pmtud-method-00.txt


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


Abstract

   [@@ To be rewritten]

   This document describes Path MTU Discovery for the Internet.  It is
   largely derived from RFC 1191 and RFC 1981, which describe ICMP based
   Path MTU Discovery for IP versions 4 and 6, plus a robust new
   algorithm.


Mathis, et al                                                   [Page 1]


Internet-Draft Expires Feb 2004                            June 21, 2003


   The general strategy of the new algorithm is to start with a small
   MTU and probe upward, testing successively larger MTUs by probing
   with single packets.  If the probe is successfully delivered, then
   the MTU is raised.  If the probe is lost, it is treated as an MTU
   limitation and not as a congestion signal.


Table of Contents

   TBD


1. Introduction


   When one Internet node has a large amount of data to send to another
   node, the data is transmitted in a series of IP packets.  It is
   usually preferable that these packets be of the largest size that can
   successfully traverse the path from the source node to the
   destination node.  This packet size is referred to as the Path MTU
   (PMTU), and it is equal to the minimum link MTU of all the links in a
   path.

   This document describes a path MTU discovery (PMTUD) method based on
   the earlier methods described in the standards tract documents,
   RFC1191 and RFC1981, with the addition of a new algorithm that
   searches for the proper MTU by probing with successively larger
   packets.  Large sections of this document are taken directly from
   RFC1191 and RFC1981.

   The methods described in this document apply to IPv4, IPv6, TCP, and
   other transport protocols.   This document does not define a
   protocol, but rather a method to use features of existing protocols
   to discover the path MTU.  It does not require cooperation from the
   lower layers (except that they are consistent about what packet sizes
   are acceptable) or the far node.  Variants in implementations will
   not cause problems with interoperability.

   [[As a consequence people are encouraged to start developing


Mathis, et al                                                   [Page 2]


Internet-Draft Expires Feb 2004                            June 21, 2003


   experimental implementations as soon as the requirements sections is
   stable.   All other sections are recommendations only.]]

   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
   the terminology section we also present the analogous IPv4 terms and
   concepts for the IPv6 terminology.  In a few situations we describe
   specific details that are different between IPv4 and IPv6.

   [[This document still bears markup notes, indicated with square
   brackets [] or @@@@ signs.]]


2. Terminology

   IP          - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC].

   node        - a device that implements IP.

   router      - a node that forwards IP packets not explicitly
                 addressed to itself.

   host        - any node that is not a router.

   upper layer - a protocol layer immediately above IP.  Examples are
                 transport protocols such as TCP and UDP, control
                 protocols such as ICMP, routing protocols such as OSPF,
                 and Internet or lower-layer protocols being "tunneled"
                 over (i.e., encapsulated in) IP such as IPX,
                 AppleTalk, IP itself.

   link        - a communication facility or medium over which nodes can
                 communicate at the link layer, i.e., the layer
                 immediately below IPv6.  Examples are Ethernets (simple
                 or bridged); PPP links; X.25, Frame Relay, or ATM
                 networks; and Internet (or higher) layer "tunnels",
                 such as tunnels over IPv4 or IPv6 itself.

   interface   - a node's attachment to a link.

   address     - an IP-layer identifier for an interface or a set of
                 interfaces.

   packet      - an IP header plus payload.

   MTU         - Maximum Transmission Unit, the size in bytes of the


Mathis, et al                                                   [Page 3]


Internet-Draft Expires Feb 2004                            June 21, 2003


                 largest packet that can be transmitted on a link or
                 path.   Note that this could more properly be called
                 the IP MTU, to be consistent with how other standards
                 organizations use the term.  Beware that the definition
                 used in this and other IETF documents is not the same
                 as the definition used in other contexts.

   link MTU    - the Maximum Transmission Unit, i.e., maximum packet
                 size in octets, that can be conveyed in one piece over
                 a link.

   path        - the set of links traversed by a packet between a source
                 node and a destination node

   path MTU    - the minimum link MTU of all the links in a path between
                 a source node and a destination node.

   PMTU        - path MTU

   Path MTU Discovery,
   PMTUD       - process by which a node learns the PMTU of a path

   Packet Too Big message
               - An ICMP message reporting that an IP packet is too
                 large to forward.  This is the IPv6 term that
                 corresponds to the IPv4 "ICMP Can't fragment" message.

   flow id     - a combination of a source address and a non-zero
                 IPv6 flow label.

   L3 MTU      - the maximum available IP payload size, usually over a
                 specific path.  This is the maximum layer 3 transmission
                 unit (e.g TCP message, including all TCP headers and data,
                 but not IP or link headers.)

   segment size- the L3 payload size (from TCP usage).

   probe packet- A packet which is being used to test for a larger MTU.

   probe size  - The size of a packet being used to probe for a larger MTU.

   successful probe
               - The probe packet was delivered through the network.

   inconclusive probe
               - The probe packet was not delivered, but there were other lost
                 packets too close to the probe.   By implication the probe
                 might have been lost due to something other than MTU, so the


Mathis, et al                                                   [Page 4]


Internet-Draft Expires Feb 2004                            June 21, 2003


                 results are inconclusive.

   failed probe
               - The probe packet was not delivered and there were not other
                 lost packets close to the probe.

   probe gap   - The L3 payload data that will need to be retransmitted if the
                 probe is not delivered.

[[Deprecated terms - these terms should only appear in very specific parts of
the document.

ICMP

Can't fragment messages

lower layers

@@@ remove as the document matures]]


3. Overview

   This document describes a technique to dynamically discover the MTU
   of a path.  These procedures are applicable to TCP and other
   transport- or application-level Packetization protocols which
   implement similar features.

   The general strategy of the new procedure is to find the proper MTU
   by starting a connection using relatively small packets and then
   probing with progressively larger packets (containing application
   data).  If a probe packet is successfully delivered, then the path
   MTU is raised.  The isolated loss of a probe packet (with or without
   a Packet Too Big message) is treated as an indication of an MTU
   limit, and not as a congestion indicator.

   PMTUD can optionally process Packet Too Big messages for faster
   convergence in exchange for a slight decrease in robustness.
   Processing malicious or erroneous Packet Too Big messages can cause
   PMTU discovery to arrive at the incorrect MTU for a path, which is
   likely to reduce protocol performance.  The document describes three
   options for processing Packet Too Big messages: completely ignore
   them, only accept them in response to probes or accept all Packet Too


Mathis, et al                                                   [Page 5]


Internet-Draft Expires Feb 2004                            June 21, 2003


   Big messages (the previous approach).

   In addition, PMTUD can be extended with heuristics to use alternate
   criteria to select PMTU.  For example, on a path that is so congested
   that the fair share window is too small (smaller than 5 kB), TCP may
   be better behaved with 512-byte packets than with 1500-byte packets
   since with the larger packets the window would be too small to
   trigger Fast Retransmit.

   Relatively few details of this procedure affect interoperability with
   other standards or Internet protocols.  These details are specified
   in RFC2026 standards language in the requirements section.  The vast
   majority of the implementation details are recommendations based on
   experiences with earlier versions of path MTU discovery.  These are
   motivated by a desire to maximize robustness in the presence of less
   than ideal implementations as they exist in the field.


4. Requirements

   [This section is written in 2026 standards language MUST/SHOULD, 
   etc.]

   All Internet nodes SHOULD implement Path MTU Discovery in order to
   discover and take advantage of the largest MTU supported along the
   Internet path.

   Nodes not implementing Path MTU Discovery must use a default MTU as
   specified by the respective IP protocols.  For IPv6 the default MTU
   is 1280 bytes, the minimum link MTU as defined in [IPv6-SPEC].  For
   IPv4 it is 576 bytes, as specified in [IPv4-SPEC].


   Links MUST not deliver packets that are larger than their true MTU.
   Links that have parametric limitations (e.g. MTU bounds due to
   limited clock stability) MUST include explicit mechanisms to
   consistently reject packets that might otherwise be
   nondeterministically delivered.


   When a packet is too large to traverse a link, the attached router,
   if any, SHOULD send a Packet Too Big message (IPv6) or ICMP, can't
   fragment message (IPv4 with DF set), as appropriate.


Mathis, et al                                                   [Page 6]


Internet-Draft Expires Feb 2004                            June 21, 2003


   The requirements below only apply to those implementations that
   include Path MTU Discovery.


   Before a probe can be sent the connection MUST have at least the
   candidate MSS worth of pending data and MUST be using the current
   MSS, as defined by having received at least one acknowledgment for a
   recent non-probe segment at the current MSS.  This implicitly limits
   successful probes to once per two round trips.  [Making the algorithm
   robust in the presence of multi-path routing is likely to require an
   additional RTT.]  @@@ generalize

   Failed and inconclusive probes must be more widely spaced than the
   normal Additive Increase Multiplicative Decrease (AIMD) congestion
   interval for the current average window size.  This is enforced by
   keeping a "probe countdown" which is decremented on each non-probe
   segment sent.  Probes MUST NOT be sent before the probe countdown
   reaches zero.  @@@ generalize

   The candidate MSS MUST be strictly smaller than three times the
   current MSS.  Thus the probe segment fully covers at most one
   subsequent segment.  The second subsequent segment is at most
   partially covered by the probe segment.  This guarantees that the
   segments following the probe segment will cause at most one
   superfluous duplicate acknowledgment.  @@@ generalize

   The TCP MUST be using Fast-Retransmit and SACK or new Reno, such that
   isolated lost segments will normally be retransmitted without the
   spurious retransmission of any additional segments.

   During the probe, all of the normal retransmission, recovery and
   congestion control machinery is in effect except when just the probe
   gap is retransmitted (and no other segments) the normal
   multiplicative cwnd reduction is suppressed.  If any other segments
   are retransmitted, all normal cwnd reductions MUST take place.


   If the probe was successful, the current MSS is updated to the
   candidate MSS.  If cwnd and other congestion state variables are kept
   in packets, they MUST be rescaled by the change in MSS, to preserve
   the current window size in bytes.  @@@ generalize


5. Implementation Issues


Mathis, et al                                                   [Page 7]


Internet-Draft Expires Feb 2004                            June 21, 2003


   This section discusses a number of issues related to the
   implementation of Path MTU Discovery.  This is not a specification,
   but rather a set of notes provided as an aid for implementers.

   The issues include:

   - What layer or layers implement Path MTU Discovery?

   - Accounting for headers

   - How is the PMTU information cached?

   - How are ICMP messages processed

   - How is stale PMTU information removed?

   - How to implement PMTUD with TCP?

   - What should other transport and higher layers do?

   - What should tunnels above IP do?


5.1. Layering

   In the IP architecture, the choice of what size packet to send is
   made by a protocol at a layer above IP.  This memo refers to such a
   protocol as a "packetization protocol".  Packetization protocols are
   usually transport protocols (for example, TCP) but can also be
   higher-layer protocols (for example, protocols built on top of UDP).

   Implementing Path MTU Discovery in the packetization layers
   simplifies some of the inter-layer issues, but has several drawbacks:
   the implementation may have to be redone for each packetization
   protocol, it becomes hard to share PMTU information between different
   packetization layers, and the connection-oriented state maintained by
   some packetization layers may not easily extend to save PMTU
   information for long periods.

   It is therefore suggested that the IP layer store PMTU information
   and that the ICMP layer process received Packet Too Big messages.
   The packetization layers may respond to changes in the PMTU, by
   changing the size of the messages they send.  To support this
   layering, packetization layers require a way to learn of changes in
   the value of MMS_S, the "maximum send transport-message size".  The
   MMS_S is derived from the Path MTU by subtracting the size of the
   IPv6 header plus space reserved by the IP layer for additional
   headers (if any).


Mathis, et al                                                   [Page 8]


Internet-Draft Expires Feb 2004                            June 21, 2003


   It is possible that a packetization layer, perhaps a UDP application
   outside the kernel, is unable to change the size of messages it
   sends.  This may result in a packet size that exceeds the Path MTU.

   To accommodate such situations, IPv6 defines a mechanism that allows
   large payloads to be divided into fragments, with each fragment sent
   in a separate packet (see [IPv6-SPEC] section "Fragment Header").
   However, packetization layers are encouraged to avoid sending
   messages that will require fragmentation (for the case against
   fragmentation, see [FRAG]).

   To accommodate such situations, it is recommended that IPv4 use a
   mechanism that parallels the IPv6 mechanism and only fragment in the
   end systems.  Also set DF on the fragments.  @@@more

5.2. Accounting for headers

   [[@@@To be written

   IP MTU is the payload size of the lower layer (should be "lower layer
   MTU minus link headers", but this is a different use of "MTU").  @@@
   more, clarify

   L3 MTU is IP MTU minus IP headers @@@ more

   MSS is L3 MTU minus TCP headers @@@ more

   This document does not take a position on the position of IPsec,
   which logically sits at the boundary between IP and TCP or other
   packetization later.  IPsec can be treated either as part of IP or as
   part of the packetization later, as long as the accounting is
   consistent within any given implementation.

   If IPsec is treated as part of the IP layer, then each security
   association that contributes a different length security header, may
   need to be treated as a separate path.  If IPsec is treated as part
   of the packetization layer, then the MSS to L3 MTU calculation must
   include the IPsec header size.

   ]

5.3. Storing PMTU information

   Ideally, a PMTU value should be associated with a specific path
   traversed by packets exchanged between the source and destination
   nodes.  However, in most cases a node will not have enough
   information to completely and accurately identify such a path.
   Rather, a node must associate a PMTU value with some local


Mathis, et al                                                   [Page 9]


Internet-Draft Expires Feb 2004                            June 21, 2003


   representation of a path.  It is left to the implementation to select
   the local representation of a path.

   In the case of a multicast destination address, copies of a packet
   may traverse many different paths to reach many different nodes.  The
   local representation of the "path" to a multicast destination must in
   fact represent a potentially large set of paths.

   Minimally, an implementation could maintain a single PMTU value to be
   used for all packets originated from the node.  This PMTU value would
   be the minimum PMTU learned across the set of all paths in use by the
   node.  This approach is likely to result in the use of smaller
   packets than is necessary for many paths.

   An implementation could use the destination address as the local
   representation of a path.  The PMTU value associated with a
   destination would be the minimum PMTU learned across the set of all
   paths in use to that destination.  The set of paths in use to a
   particular destination is expected to be small, in many cases
   consisting of a single path.  This approach will result in the use of
   optimally sized packets on a per-destination basis.  This approach
   integrates nicely with the conceptual model of a host as described in
   [ND]: a PMTU value could be stored with the corresponding entry in
   the destination cache.

   If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the
   flow id as the local representation of a path.  Packets sent to a
   particular destination but belonging to different flows may use
   different paths, with the choice of path depending on the flow id.
   This approach will result in the use of optimally sized packets on a
   per-flow basis, providing finer granularity than PMTU values
   maintained on a per-destination basis.

   For source routed packets (i.e. packets containing an IPv6 Routing
   header [IPv6-SPEC]), the source route may further qualify the local
   representation of a path.  In particular, a packet containing a type
   0 Routing header in which all bits in the Strict/Loose Bit Map are
   equal to 1 contains a complete path specification.  An implementation
   could use source route information in the local representation of a
   path.

   Note: Some paths may be further distinguished by different security
   classifications.  The details of such classifications are beyond the
   scope of this memo.    @@@ this should be in scope

5.4. Probing method using TCP

   A new "candidate MSS" is tested by sending one "probe segment", which


Mathis, et al                                                  [Page 10]


Internet-Draft Expires Feb 2004                            June 21, 2003


   is larger than the current MSS.

   After a probe segment has been sent (of size candidate MSS), the
   subsequent segment(s) may be sent as though the probe segment was not
   over sized.  Thus if the probe segment is lost, it will leave a hole
   that is exactly one current MSS.  We refer to this potential hole as
   the probe gap.  Note that the length of the probe segment is
   determined by the candidate MSS under consideration, but the length
   of the probe gap is the current MSS.  [This has been shown to be more
   restrictive than necessary.]

   The probe is completed when the acknowledgments sequence advances
   past the probe gap.  If the probe gap was not retransmitted the probe
   was successful.  If the probe gap was retransmitted and there were no
   other retransmissions, the candidate MSS failed.  If there were any
   other retransmissions the probe was inconclusive.

   If the probe was successful, the current MSS is updated to the
   candidate MSS.  @@@ add robustness language re: more losses

   If the probe failed or was inconclusive the probe countdown is set to
   COUNTDOWN_SCALE times the square of the current window size in
   packets.

   If a Packet Too Big message is received, it can be is used to compute
   a MSS limit by deducting the TCP/IP header sizes (including options)
   from the MTU reported in the ICMP message.  If the MSS limit is
   between the current MSS and candidate MSS, the current MSS is updated
   from the MSS limit, otherwise the message is ignored.   If the
   current MSS is updated, then the probe strategy is forced into a
   monitor state described below.  @@@ update

5.5. Probing method using SCTP

   @@@@ to be written

5.6. General probing methods

   @@@@ to be written

5.7. Probe strategy

   The probe strategy described here is a recommended baseline
   algorithm.  It is not presented in formal standards language because
   the probe strategy can include heuristics to help select an optimal
   MSS for a given path.  As a consequence there is opportunity for
   future improvements to this algorithms.


Mathis, et al                                                  [Page 11]


Internet-Draft Expires Feb 2004                            June 21, 2003


   The probing strategy has three major states: search, monitor and
   suspend.  During the search state, it sequentially searches for the
   largest MSS that the path can support.  Once the path MSS has been
   discovered, the probing algorithm enters the monitor state where it
   probes infrequently to detect if the path MSS has become larger.

   If the MSS probing persistently fails it may be desirable to suspend
   path MSS probing and heuristically select one of the common default
   MSSs: 576, 1280, or 1500 Bytes.

   5.7.1. Search

   The recommended search strategy is a multi-phase scan: First, a
   coarse scan for the approximate path MSS using factor of 2 steps
   starting at 1024 Bytes until a probe fails, followed by successively
   finer scans between the largest previously successful and
   unsuccessful probes.

          Table 1: Recommended MSS scanning sequence
          (Coarse scan down column 1, fine scan across each row)
          512, [Use only after repeated timeouts]
          1024,  1492, 2002
          2048
          4096, 4352
          8192, 9000
          16384, 17914
          32768
          64512
          ((Additional values needed))

   During the scan it is recommended that the MSS not be raised if cwnd
   is too small as determined by a heuristic.  The recommended heuristic
   is that the MSS is only raised when the cwnd is larger than 20
   segments.

   5.7.2. Monitor

   Once the scan has found an appropriate MSS, the probe strategy enters
   the monitor state, where it re-probes the most recent failed MTU,
   once every MONITOR_INTERVAL seconds.  If the probe fails, it remains
   in the monitor state.  If it succeeds, it enters the scanning state.

   If the network becomes too congested during either the scan or the
   monitor states it is recommended that the MSS be reduced to a smaller
   size as determined by a heuristic.  The recommended heuristic is to
   reduce the MSS if ssthresh is reduced to 5 segments or smaller.  The
   recommended reduction is to the next smaller major MSS step in table
   1.


Mathis, et al                                                  [Page 12]


Internet-Draft Expires Feb 2004                            June 21, 2003


   When there are repeated timeouts (MAX_TIMO or more retransmissions,
   without any received ACKs), it is presumed that the connection was
   re-routed onto a link with a smaller MSS, and that ICMP messages are
   not being delivered.  The MSS probing algorithms is reset by pulling
   back the MSS to 1024 Bytes, rescaling the congestion control
   variables and reentering the search state.

   5.7.3. Suspend

   If there is a timeout, and cwnd prior to the timeout was smaller than
   6 packets, then the probe strategy can enter the suspended phase and
   set the MSS to 512 (1280) Bytes.  This has the effect of reducing the
   minimum data rate that TCP can stably manage.


5.8.  Processing Packet Too Big messages

   @@@ Add language re: optional processing

   When a Packet Too Big message is received, the node determines which
   path the message applies to based on the contents of the Packet Too
   Big message.  For example, if the destination address is used as the
   local representation of a path, the destination address from the
   original packet would be used to determine which path the message
   applies to.

      Note: if the original packet contained a IPv6 Routing header, the
      Routing header should be used to determine the location of the
      destination address within the original packet.  If Segments Left
      is equal to zero, the destination address is in the Destination
      Address field in the IPv6 header.  If Segments Left is greater
      than zero, the destination address is the last address
      (Address[n]) in the Routing header.

      If the original packet contained a IPv4 Source Route Option .....
      @@@@ write

   The node then uses the value in the MTU field in the Packet Too Big
   message as a tentative PMTU value, and compares the tentative PMTU to
   the existing PMTU.  If the tentative PMTU is less than the existing
   PMTU estimate, the tentative PMTU replaces the existing PMTU as the
   PMTU value for the path.

   The packetization layers must be notified about decreases in the
   PMTU.  Any packetization layer instance (for example, a TCP
   connection) that is actively using the path must be notified if the
   PMTU estimate is decreased.


Mathis, et al                                                  [Page 13]


Internet-Draft Expires Feb 2004                            June 21, 2003


      Note: even if the Packet Too Big message contains an Original
      Packet Header that refers to a UDP packet, the TCP layer must be
      notified if any of its connections use the given path.

   Also, the instance that sent the packet that elicited the Packet Too
   Big message should be notified that its packet has been dropped, even
   if the PMTU estimate has not changed, so that it may retransmit the
   dropped data.

      Note: An implementation can avoid the use of an asynchronous
      notification mechanism for PMTU decreases by postponing
      notification until the next attempt to send a packet larger than
      the PMTU estimate.  In this approach, when an attempt is made to
      SEND a packet that is larger than the PMTU estimate, the SEND
      function should fail and return a suitable error indication.  This
      approach may be more suitable to a connectionless packetization
      layer (such as one using UDP), which (in some implementations) may
      be hard to "notify" from the ICMP layer.  In this case, the normal
      timeout-based retransmission mechanisms would be used to recover
      from the dropped packets.    @@@@ why "SEND"?

   It is important to understand that the notification of the
   packetization layer instances using the path about the change in the
   PMTU is distinct from the notification of a specific instance that a
   packet has been dropped.  The latter should be done as soon as
   practical (i.e., asynchronously from the point of view of the
   packetization layer instance), while the former may be delayed until
   a packetization layer instance wants to create a packet.
   Retransmission should be done for only those packets that are known
   to be dropped, as indicated by a Packet Too Big message.

5.9. Purging stale PMTU information

   @@@ update

   Internetwork topology is dynamic; routes change over time.  While the
   local representation of a path may remain constant, the actual
   path(s) in use may change.  Thus, PMTU information cached by a node
   can become stale.

   If the stale PMTU value is too large, this will be discovered almost
   immediately once a large enough packet is sent on the path.  No such
   mechanism exists for realizing that a stale PMTU value is too small,
   so an implementation should "age" cached values.  When a PMTU value
   has not been decreased for a while (on the order of 10 minutes), the
   PMTU estimate should be set to the MTU of the first-hop link, and the
   packetization layers should be notified of the change.  This will
   cause the complete Path MTU Discovery process to take place again.


Mathis, et al                                                  [Page 14]


Internet-Draft Expires Feb 2004                            June 21, 2003


      Note: an implementation should provide a means for changing the
      timeout duration, including setting it to "infinity".  For
      example, nodes attached to an FDDI link which is then attached to
      the rest of the Internet via a small MTU serial line are never
      going to discover a new non-local PMTU, so they should not have to
      put up with dropped packets every 10 minutes.

   An upper layer must not retransmit data in response to an increase in
   the PMTU estimate, since this increase never comes in response to an
   indication of a dropped packet.

   One approach to implementing PMTU aging is to associate a timestamp
   field with a PMTU value.  This field is initialized to a "reserved"
   value, indicating that the PMTU is equal to the MTU of the first hop
   link.  Whenever the PMTU is decreased in response to a Packet Too Big
   message, the timestamp is set to the current time.

   Once a minute, a timer-driven procedure runs through all cached PMTU
   values, and for each PMTU whose timestamp is not "reserved" and is
   older than the timeout interval:

   - The PMTU estimate is set to the MTU of the first hop link.

   - The timestamp is set to the "reserved" value.

   - Packetization layers using this path are notified of the increase.

5.10. TCP layer actions

   The TCP layer must track the PMTU for the path(s) in use by a
   connection; it should not send segments that would result in packets
   larger than the PMTU except to probe the path MTU.  A simple
   implementation could ask the IP layer for this value each time it
   created a new segment, but this could be inefficient.  Moreover, TCP
   implementations that follow the "slow-start" congestion-avoidance
   algorithm [CONG] typically calculate and cache several other values
   derived from the PMTU.  It may be simpler to receive asynchronous
   notification when the PMTU changes, so that these variables may be
   updated.

   A TCP implementation must also store the MSS value received from its
   peer, and must not send any segment larger than this MSS, regardless
   of the PMTU.  In 4.xBSD-derived implementations, this may require
   adding an additional field to the TCP state record.

   The value sent in the TCP MSS option is independent of the PMTU.
   This MSS option value is used by the other end of the connection,
   which may be using an unrelated PMTU value.  See [IPv6-SPEC] sections


Mathis, et al                                                  [Page 15]


Internet-Draft Expires Feb 2004                            June 21, 2003


   "Packet Size Issues" and "Maximum Upper-Layer Payload Size" for
   information on selecting a value for the TCP MSS option.  When a
   Packet Too Big message is received, it implies that a packet was
   dropped by the node that sent the ICMP message.  It is sufficient to
   treat this as any other dropped segment, and wait until the
   retransmission timer expires to cause retransmission of the segment.
   If the Path MTU Discovery process requires several steps to find the
   PMTU of the full path, this could delay the connection by many round-
   trip times.

   @@@ Add IPv4 text

   [@@@deprecate?  Alternatively, the retransmission could be done in
   immediate response to a notification that the Path MTU has changed,
   but only for the specific connection specified by the Packet Too Big
   message.  The packet size used in the retransmission should be no
   larger than the new PMTU. ]

      Note: A packetization layer must not retransmit in response to
      every Packet Too Big message, since a burst of several oversized
      segments will give rise to several such messages and hence several
      retransmissions of the same data.  If the new estimated PMTU is
      still wrong, the process repeats, and there is an exponential
      growth in the number of superfluous segments sent.

      This means that the TCP layer must be able to recognize when a
      Packet Too Big notification actually decreases the PMTU that it
      has already used to send a packet on the given connection, and
      should ignore any other notifications.

   Many TCP implementations incorporate "congestion avoidance" and
   "slow-start" algorithms to improve performance [CONG].  Unlike a
   retransmission caused by a TCP retransmission timeout, a
   retransmission caused by a Packet Too Big message should not change
   the congestion window.  It should, however, trigger the slow-start
   mechanism (i.e., only one segment should be retransmitted until
   acknowledgments begin to arrive again).

   TCP performance can be reduced if the sender's maximum window size is
   not an exact multiple of the segment size in use (this is not the
   congestion window size, which is always a multiple of the segment
   size).  In many systems (such as those derived from 4.2BSD), the
   segment size is often set to 1024 octets, and the maximum window size
   (the "send space") is usually a multiple of 1024 octets, so the
   proper relationship holds by default.  If Path MTU Discovery is used,
   however, the segment size may not be a sub-multiple of the send
   space, and it may change during a connection; this means that the TCP
   layer may need to change the transmission window size when Path MTU


Mathis, et al                                                  [Page 16]


Internet-Draft Expires Feb 2004                            June 21, 2003


   Discovery changes the PMTU value.  The maximum window size should be
   set to the greatest multiple of the segment size that is less than or
   equal to the sender's buffer space size.

5.11.  Issues for other transport protocols

   Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
   repacketize when doing a retransmission.  That is, once an attempt is
   made to transmit a segment of a certain size, the transport cannot
   split the contents of the segment into smaller segments for
   retransmission.  In such a case, the original segment can be
   fragmented by the IP layer during retransmission.  Subsequent
   segments, when transmitted for the first time, should be no larger
   than allowed by the Path MTU.

   The Sun Network File System (NFS) uses a Remote Procedure Call (RPC)
   protocol [RPC] that, when used over UDP, in many cases will generate
   payloads that must be fragmented even for the first-hop link.  This
   might improve performance in certain cases, but it is known to cause
   reliability and performance problems, especially when the client and
   server are separated by routers.

   It is recommended that NFS implementations use Path MTU Discovery
   whenever routers are involved.  Most NFS implementations allow the
   RPC datagram size to be changed at mount-time (indirectly, by
   changing the effective file system block size), but might require
   some modification to support changes later on.

   Also, since a single NFS operation cannot be split across several UDP
   datagrams, certain operations (primarily, those operating on file
   names and directories) require a minimum payload size that if sent in
   a single packet would exceed the PMTU.  NFS implementations should
   not reduce the payload size below this threshold, even if Path MTU
   Discovery suggests a lower value.  In this case the payload will be
   fragmented by the IP layer.

5.12.  Issues for tunnels

   @@@ to be written

   5.13.  Diagnostic tools

   All implementations MUST include a mechanism to implement diagnostic
   tools that do not rely on the operating systems implementation of
   path MTU discovery.   This requires an mechanism where an application
   can send oversized packets that are not subjected to the operating
   systems notion of the current path MTU, up to the physical MTU limit
   as supported by the network interface, as well as a mechanism to


Mathis, et al                                                  [Page 17]


Internet-Draft Expires Feb 2004                            June 21, 2003


   collect any Packet Too Big Messages.

5.14.  Management interface

   It is suggested that an implementation provide a way for a system
   utility program to:

   - Specify that Path MTU Discovery not be done on a given path.

   - Change the PMTU value associated with a given path.

   - Global controls on ICMP processing

   - Per connection or per application controls on ICMP processing

   The former can be accomplished by associating a flag with the path;
   when a packet is sent on a path with this flag set, the IP layer does
   not send packets larger than the IPv6 minimum link MTU.

   These features might be used to work around an anomalous situation,
   or by a routing protocol implementation that is able to obtain Path
   MTU values.

   The implementation should also provide a way to change the timeout
   period for aging stale PMTU information.


6. Normative references

 [RFC1191]  Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990.
            (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT
            STANDARD)

 [RFC1435]  IESG Advice from Experience with Path MTU Discovery. S.
            Knowles. March 1993. (Format: TXT=2708 bytes) (Status:
            INFORMATIONAL)

 [RFC1981]  Path MTU Discovery for IP version 6. J. McCann, S. Deering,
            J. Mogul. August 1996. (Format: TXT=34088 bytes) (Status:
            PROPOSED STANDARD)

 [RFC2923]  TCP Problems with Path MTU Discovery. K. Lahey. September
            2000. (Format: TXT=30976 bytes) (Status: INFORMATIONAL)


Mathis, et al                                                  [Page 18]


Internet-Draft Expires Feb 2004                            June 21, 2003


7. Informative references

 [RFC1063]  IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par�
            tridge, K. McCloghrie. Jul-01-1988. (Format: TXT=27121
            bytes) (Obsoleted by RFC1191)

 [RFC1626]  Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994.
            (Format: TXT=11841 bytes) (Obsoleted by RFC2225) (Status:
            PROPOSED STANDARD)

 [RFC1791]  TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung.
            April 1995. (Format: TXT=22347 bytes) (Status: EXPERIMENTAL)


8. Security considerations

   Since the MTU reported in the ICMP messages is constrained to be
   between the old MTU and the candidate MTU, this algorithm is more
   difficult to attack through fraudulent ICMP messages.

   Furthermore, since this algorithm can function properly without ICMP
   messages that part of the algorithm can be disabled for additional
   robustness in hostile environments.

9. IANA considerations


10. Contributors

11. Acknowledgements

   Matt Mathis and John Heffner are supported by a grant from Cisco Sys�
   tems, Inc.

12. Authors' addresses

   Please send comments and suggestions to mtu@psc.edu.

   Matt Mathis and John Heffner
   Pittsburgh Supercomputing Center


Mathis, et al                                                  [Page 19]


Internet-Draft Expires Feb 2004                            June 21, 2003


   4400 Fifth Ave.
   Pittsburgh, PA 15213
   mathis@psc.edu
   jheffner@psc.edu

   Kevin Lahey
   Freelance
   kml@patheticgeek.net


13. Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to  per�
   tain to the implementation or use of the technology described in this
   document or the extent to which any license under such rights might
   or might not be available; neither does it represent that it has made
   any effort to identify any such rights.  Information on the IETF's
   procedures with respect to rights in standards-track and standards-
   related documentation can be found in BCP-11.  Copies of claims of
   rights made available for publication and any assurances of licenses
   to be made available, or the result of an attempt made to obtain a
   general license or permission for the use of such proprietary rights
   by implementers or users of this specification can be obtained from
   the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard.  Please address the information to the IETF Executive
   Director.


14. Full copyright statement

   Copyright (C) The Internet Society June 21, 2003. All Rights
   Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this


Mathis, et al                                                  [Page 20]


Internet-Draft Expires Feb 2004                            June 21, 2003


   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the  purpose of develop�
   ing Internet standards in which case the procedures for copyrights
   defined in the Internet Standards process must be followed, or as
   required to translate it into languages other than English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.


Mathis, et al                                                  [Page 21]