idnits 2.17.1 

draft-ietf-pmtud-method-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1.a on line 18.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1962.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1939.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1946.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1952.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement. 

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        This document is an Internet-Draft and is subject to all provisions of
        Section 3 of RFC 3667.

        By submitting this Internet-Draft, each author represents that any
        applicable patent or other IPR claims of which he or she is aware
        have been or will be disclosed, and any of which he or she
        becomes aware will be disclosed, in accordance with Section 6 of
        BCP 79.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 865 has weird spacing: '...retried  after...'

  == Line 1095 has weird spacing: '...irement  has p...'

  == Line 1293 has weird spacing: '...ill not  yield...'

  == Line 1396 has weird spacing: '...address  many ...'

  == Line 1445 has weird spacing: '...portant  that ...'

  == (1 more instance...)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 20, 2005) is 6998 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC 2461' on line 633

  -- Looks like a reference, but probably isn't: 'ISOTP' on line 1173

  -- Looks like a reference, but probably isn't: 'RFC2960' on line 1479

  == Unused Reference: '10' is defined on line 1835, but no explicit
     reference was found in the text

  == Unused Reference: '11' is defined on line 1838, but no explicit
     reference was found in the text

  == Unused Reference: '13' is defined on line 1844, but no explicit
     reference was found in the text

  == Unused Reference: '15' is defined on line 1850, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201)

  ** Obsolete normative reference: RFC 2401 (ref. '5') (Obsoleted by RFC 4301)

  ** Obsolete normative reference: RFC 2414 (ref. '6') (Obsoleted by RFC 3390)

  ** Obsolete normative reference: RFC 2460 (ref. '7') (Obsoleted by RFC 8200)

  ** Obsolete normative reference: RFC 2960 (ref. '9') (Obsoleted by RFC 4960)

  -- Obsolete informational reference (is this intentional?): RFC 1063 (ref.
     '10') (Obsoleted by RFC 1191)

  -- Obsolete informational reference (is this intentional?): RFC 1626 (ref.
     '12') (Obsoleted by RFC 2225)

  == Outdated reference: A later version (-16) exists of
     draft-ietf-tsvwg-sctpimpguide-10


     Summary: 11 errors (**), 0 flaws (~~), 13 warnings (==), 12 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Mathis
2	Internet-Draft                                                J. Heffner
3	Expires: August 21, 2005                                             PSC
4	                                                                K. Lahey
5	                                                               Freelance
6	                                                       February 20, 2005

8	                           Path MTU Discovery
9	                       draft-ietf-pmtud-method-04

11	Status of this Memo

13	   This document is an Internet-Draft and is subject to all provisions
14	   of section 3 of RFC 3667.  By submitting this Internet-Draft, each
15	   author represents that any applicable patent or other IPR claims of
16	   which he or she is aware have been or will be disclosed, and any of
17	   which he or she become aware will be disclosed, in accordance with
18	   RFC 3668.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as
23	   Internet-Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on August 21, 2005.

38	Copyright Notice

40	   Copyright (C) The Internet Society (2005).

42	Abstract

44	   This document describes a robust method for Path MTU Discovery that
45	   relies on TCP or some other Packetization Layer to probe an Internet
46	   path with progressively larger packets.  This method is described as
47	   an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
48	   MTU Discovery for IP versions 4 and 6, respectively.

50	   The general strategy of the new algorithm is to start with a small
51	   MTU and search upward, testing successively larger MTUs by probing
52	   with single packets.  If the probe is successfully delivered and
53	   satisfies a subsequent verification phase then the MTU is raised.  If
54	   the probe is lost, it is treated as an MTU limitation and not as a
55	   congestion signal.

57	   There are several options for integrating PLPMTUD with classical path
58	   MTU discovery.  PLPMTUD can be minimally configured to perform ICMP
59	   black hole recovery to increase the robustness of classical path MTU
60	   discovery, or ICMP processing can be completely disabled, and PLPMTUD
61	   can completely replace classical path MTU discovery.

63	   In the latter configuration, PLPMTUD exactly parallels congestion
64	   control.  An end-to-end transport protocol adjusts non-protocol
65	   properties of the data stream (window size or packet size) while
66	   using packet losses to deduce the appropriateness of the adjustments.
67	   This technique seems to be more philosophically consistent with the
68	   end-to-end principle than relying on ICMP messages containing
69	   transcribed headers of multiple protocol layers.

71	Table of Contents

73	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
74	     1.1   Revision History . . . . . . . . . . . . . . . . . . . . .  5
75	       1.1.1   Changes since version -02, July 19th 2004 (IETF 60)  .  6
76	   2.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
77	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  9
78	   4.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 12
79	   5.  Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
80	     5.1   Accounting for Header Sizes  . . . . . . . . . . . . . . . 14
81	     5.2   Storing PMTU information . . . . . . . . . . . . . . . . . 15
82	     5.3   Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16
83	     5.4   Measuring path MTU . . . . . . . . . . . . . . . . . . . . 16
84	   6.  The Probing Sequence and Lower Layers  . . . . . . . . . . . . 17
85	     6.1   Normal sequence of events to raise the MTU . . . . . . . . 17
86	     6.2   Processing MTU Indications . . . . . . . . . . . . . . . . 18
87	       6.2.1   Processing ICMP PTB messages . . . . . . . . . . . . . 18
88	       6.2.2   Packetization Layer Detects Lost Packets . . . . . . . 19
89	       6.2.3   Packetization Layer Retransmission Timeout . . . . . . 21
90	       6.2.4   Packetization Layer Full Stop Timeout  . . . . . . . . 21
91	     6.3   Probing Intervals  . . . . . . . . . . . . . . . . . . . . 22
92	     6.4   Host fragmentation . . . . . . . . . . . . . . . . . . . . 24
93	     6.5   Multicast  . . . . . . . . . . . . . . . . . . . . . . . . 25
94	   7.  Common Packetization Properties  . . . . . . . . . . . . . . . 25
95	     7.1   Mechanism to detect loss . . . . . . . . . . . . . . . . . 25
96	     7.2   Generating Probes  . . . . . . . . . . . . . . . . . . . . 26
97	     7.3   Mechanism to support provisional MTUs  . . . . . . . . . . 26
98	     7.4   Selecting the initial MPS  . . . . . . . . . . . . . . . . 27
99	     7.5   Common MPS Search Strategy . . . . . . . . . . . . . . . . 28
100	       7.5.1   Fine Scans . . . . . . . . . . . . . . . . . . . . . . 29
101	     7.6   Congestion Control and Window Management . . . . . . . . . 30
102	   8.  Specific Packetization Layers  . . . . . . . . . . . . . . . . 31
103	     8.1   Probing method using TCP . . . . . . . . . . . . . . . . . 31
104	     8.2   Probing method using SCTP  . . . . . . . . . . . . . . . . 32
105	     8.3   Probing method for IP fragmentation  . . . . . . . . . . . 34
106	     8.4   Probing method for applications  . . . . . . . . . . . . . 35
107	   9.  Operational Integration  . . . . . . . . . . . . . . . . . . . 36
108	     9.1   Interoperation with prior algorithms . . . . . . . . . . . 37
109	     9.2   Operation over subnets with dissimilar MTUs  . . . . . . . 37
110	     9.3   Interoperation with tunnels  . . . . . . . . . . . . . . . 38
111	     9.4   Diagnostic tools . . . . . . . . . . . . . . . . . . . . . 38
112	     9.5   Management interface . . . . . . . . . . . . . . . . . . . 39
113	   10.   References . . . . . . . . . . . . . . . . . . . . . . . . . 40
114	   10.1  Normative References . . . . . . . . . . . . . . . . . . . . 40
115	   10.2  Informative References . . . . . . . . . . . . . . . . . . . 40
116	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 41
117	   A.  Security Considerations  . . . . . . . . . . . . . . . . . . . 41
118	   B.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 42
119	   C.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 42
120	       Intellectual Property and Copyright Statements . . . . . . . . 43

122	1.  Introduction

124	   This document describes a method for Packetization Layer Path MTU
125	   Discovery (PLPMTUD) which is an extension to existing Path MTU
126	   discovery methods as described in RFC 1191 [2] and RFC 1981 [3].  The
127	   proper MTU is determined by starting with small packets and probing
128	   with successively larger packets.  The bulk of the algorithm is
129	   implemented above IP, in the transport layer (e.g.  TCP) or other
130	   "Packetization Protocol" that is responsible for determining packet
131	   boundaries.

133	   This document draws heavily RFC 1191 [2] and RFC 1981 [3] for
134	   terminology, ideas and some of the text.

136	   This document describes methods to discover the path MTU using
137	   features of existing protocols.  The methods apply to IPv4 and IPv6,
138	   and many transport protocols.  They do not require cooperation from
139	   the lower layers (except that they are consistent about what packet
140	   sizes are acceptable) or the far node.  Variants in implementations
141	   will not cause interoperability problems.

143	   The methods described in this document are carefully designed to
144	   maximize robustness in the presence of less than ideal
145	   implementations of other protocols or Internet components.

147	   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
148	   the terminology section we also present the analogous IPv4 terms and
149	   concepts for the IPv6 terminology.  In a few situations we describe
150	   specific details that are different between IPv4 and IPv6.

152	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
153	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
154	   document are to be interpreted as described in RFC 2119 [4].

156	   This draft is a product of the Path MTU Discovery (pmtud) working
157	   group of the IETF.  Please send comments and suggestions to
158	   pmtud@ietf.org.  Interim drafts and other useful information will be
159	   posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html .

161	1.1  Revision History

163	   These are all recent substantive changes, in reverse chronological
164	   order.  This section will be removed prior to publication as an RFC.
165	   Note that there are still some missing details that need to be
166	   resolved.  These are flagged by @@@@.  None of the missing details
167	   are serious.

169	1.1.1  Changes since version -02, July 19th 2004 (IETF 60)

171	   Many minor updates throughout the document.

173	   Added a section describing the interactions between PLPMTUD and
174	   congestion control.

176	   Removed a difficult to implement requirement for future data to
177	   transmit.

179	   Added "IP Fragmentation" and "Application protocol" as Packetization
180	   Layers.

182	   Clarified interactions between TCP SACK and MTU.

184	   Updated SCTP section to reflect new probing method using "PAD
185	   chunks".

187	   Distilled the protocol specific material into separate subsections
188	   for each protocol.

190	   Added a section on common requirements and functions for all
191	   Packetization Layers.  More accurately characterized the
192	   "bidirectional" (and other) requirements of the PL protocol.  Updated
193	   the search strategy in this new section.

195	   Change "ICMP can't fragment" and "packet too big" to uniformly use
196	   "ICMP PTB message" everywhere.

198	   Added Stanislav Shalunov's observation that PLPMTUD parallels
199	   congestion control.

201	   Better described the range of interoperability with classical pMTUd
202	   in the introduction.

204	   Removed vague language about "not being a protocol" and "excessive
205	   Loss".

207	   Slightly redefined flow: the granularity of PLPMTUD within a path.

209	   Many English NITs and clarifications per Gorry Fairhurst and others.
210	   Passes strict xml2rfc checking.

212	   Add a paragraph encouraging interface MTUs that are the optimal for
213	   the NIC, rather than standard for the media.

215	   Added a revision history section.

217	2.  Overview

219	   This document describes a method for TCP or other packetization
220	   protocols to dynamically discover the MTU of a path without relying
221	   on explicit signals from the network.  These procedures are
222	   applicable to TCP and other transport- or application-level protocols
223	   that are responsible for choosing packet boundaries (e.g.  segment
224	   sizes) and have an acknowledgement structure that delivers to the
225	   sender accurate and timely indications of which packets were lost.

227	   The general strategy of the new procedure is for the packetization
228	   layer to find an appropriate path MTU by probing with progressively
229	   larger packets.  A "probe sequence" consists of a single "probe
230	   packet", which initiates a "probe phase", followed by a "transition
231	   phase" and a "verification phase".

233	   If a probe packet is successfully delivered, then the path MTU is
234	   provisionally raised to the probe size during the transition phase.
235	   If there are no losses during the subsequent verification phase, then
236	   the path MTU is confirmed (verified) to be at least as large as the
237	   provisional MTU.  Each conclusive probe sequence narrows the MTU
238	   search range, converging toward the true path MTU.

240	   The verification phase is used to detect some situations where
241	   raising the MTU raises the packet loss rate.  For example, if a link
242	   is striped across multiple physical channels with inconsistent MTUs,
243	   it is possible that a probe will be delivered even if it is too large
244	   for some of the physical channels.  In such cases raising the path
245	   MTU to the probe size will cause severe periodic loss and abysmal
246	   performance.  The verification phase is designed to prevent the path
247	   MTU from being raised if doing so causes excessive packet losses.

249	   A conservative implementation of PLPMTUD would use a full round trip
250	   time for the verification phase.  In this case the entire probe
251	   sequence takes three full round trip times.  It takes one round trip
252	   for the probe phase, during which the probe propagates to the far
253	   node and an acknowledgment is returned.  The second round trip is the
254	   transitional phase, during which data packets using the provisional
255	   MTU propagate to the far node and are acknowledged.  During he third
256	   and final round trip time, it is verified that raising the MTU did
257	   not cause any additional losses.

259	   The isolated loss of a probe packet (with or without an ICMP PTB
260	   message) is treated as an indication of an MTU limit, and not as a
261	   congestion indicator.  In this case alone, the packetization protocol
262	   is permitted to retransmit any missing data without adjusting the
263	   congestion window.

265	   If there is a timeout, or additional packets are lost during any of
266	   the three phases, the loss is treated as a congestion indication as
267	   well as an indication of some sort of failure of the PLPMTUD process.
268	   The congestion indication is treated like any other congestion
269	   indication: window or rate adjustments are mandatory per the relevant
270	   congestion control standards [8].  Probing can resume after a delay
271	   which is determined by the nature of the detected failure.

273	   The most likely (and least serious) PLPMTUD failure is the link
274	   experiencing congestion related losses while probing.  In this case
275	   it is appropriate to retry a probe of the same size as soon as the
276	   packetization layer has fully adapted to the congestion and recovered
277	   from the losses.

279	   In other cases, additional losses or timeouts indicate problems with
280	   the link or packetization layer.  In these situations it is desirable
281	   to use longer delays depending on the severity of the error.

283	   There are a range of options for integrating PLPMTUD with classical
284	   path MTU discovery.  In the most conservative configuration, from a
285	   deployment point of view, classical path MTU discovery is fully
286	   functional (all correct ICMP PTB messages are unconditionally
287	   processed) and PLPMTUD is invoked only to recover from ICMP black
288	   holes.

290	   In the most conservative configuration, from a security point of
291	   view, all ICMP PTB messages are ignored, and PLPMTUD is the sole
292	   method used to discover the path MTU.  This protects against
293	   malicious or erroneous ICMP PTB messages which might otherwise cause
294	   MTU discovery to arrive at the incorrect MTU for a path.

296	   Note that in the latter configuration, PLPMTUD parallels congestion
297	   control.  An end-to-end transport protocol adjusts non-protocol
298	   properties of the data stream (window size or packet size) while
299	   using packet losses to deduce the appropriateness of the adjustments.
300	   This technique seems to be more philosophically consistent with the
301	   end-to-end principle of the Internet than relying on ICMP messages
302	   containing transcribed headers of multiple protocol layers.

304	   We advocate a compromise, in which ICMP PTB messages are only
305	   processed in conjunction with probing (described in section 6.2.1),
306	   and Packetization Layer timeouts (described in section 6.2.3), and
307	   ignored in all other situations.

309	   Most of the difficulty in implementing PLPMTUD arises because it
310	   needs to be implemented in several different places within a single
311	   node.  In general, each packetization protocol needs to have its own
312	   implementation of PLPMTUD.  Furthermore, the natural mechanism to
313	   share path MTU information between concurrent or subsequent
314	   connections over the same path is a path information cache in the IP
315	   layer.  The various packetization protocols need to have the means to
316	   access and update the shared cache in the IP layer.  This memo
317	   describes PLPMTUD in terms of its primary subsystems without fully
318	   describing how they are assembled into a complete implementation.

320	   Section  3 provides a complete glossary of terms.

322	   Relatively few details of PLPMTUD affect interoperability with other
323	   standards or Internet protocols.  These details are specified in
324	   RFC2119 standards language in section 4.  The vast majority of the
325	   implementation details described in this document are recommendations
326	   based on experiences with earlier versions of path MTU discovery.
327	   These recommendations are motivated by a desire to maximize
328	   robustness of PLPMTUD in the presence of less than ideal network
329	   conditions as they exist in the field.

331	   Section  5 describes how to partition PLPMTUD into layers, and how to
332	   manage the "path information cache" in the IP layer.

334	   Section  6 describes the details of a probe sequence, including how
335	   to process MTU and error indications, necessary to raise the MTU by
336	   one step.

338	   Section  7 describes the general search strategy and Packetization
339	   Layer features needed to implement PLPMTUD.

341	   Section  8 discusses specific implementation details for some
342	   specific protocols, including TCP.

344	   Section  9 describes ways to minimize deployment problems for
345	   PLPMTUD, by including a number of good management features.  It also
346	   addresses some potentially serious interactions with nodes that do
347	   not honor the IPv4 DF bit.

349	3.  Terminology

351	   We use the following terms in this document:

353	   IP: Either IPv4 [1] or IPv6 [7].

355	   Node: A device that implements IP.

357	   Router: A node that forwards IP packets not explicitly addressed to
358	      itself.

360	   Host: Any node that is not a router.

362	   Upper layer: A protocol layer immediately above IP.  Examples are
363	      transport protocols such as TCP and UDP, control protocols such as
364	      ICMP, routing protocols such as OSPF, and Internet or lower-layer
365	      protocols being "tunneled" over (i.e., encapsulated in) IP such as
366	      IPX, AppleTalk, or IP itself.

368	   Link: A communication facility or medium over which nodes can
369	      communicate at the link layer, i.e., the layer immediately below
370	      IP.  Examples are Ethernets (simple or bridged); PPP links; X.25,
371	      Frame Relay, or ATM networks; and Internet (or higher) layer
372	      "tunnels", such as tunnels over IPv4 or IPv6.  Occasionally we use
373	      the slightly more general term "lower layer" for this concept.

375	   Interface: A node's attachment to a link.

377	   Address: An IP-layer identifier for an interface or a set of
378	      interfaces.

380	   Packet: An IP header plus payload.

382	   MTU: Maximum Transmission Unit, the size in bytes of the largest IP
383	      packet, including the IP header and payload, that can be
384	      transmitted on a link or path.  Note that this could more properly
385	      be called the IP MTU, to be consistent with how other standards
386	      organizations use the acronym MTU.

388	   Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet size
389	      in bytes, that can be conveyed in one piece over a link.  Beware
390	      that this definition differers from the definition used by other
391	      standards organizations.

393	      For IETF documents, link MTU is uniformly defined as the IP MTU
394	      over the link.  This includes the IP header, but excludes link
395	      layer headers and other framing which is not part of IP or the IP
396	      payload.

398	      Be aware that other standards organizations generally define link
399	      MTU to include the link layer headers.

401	   Path: The set of links traversed by a packet between a source node
402	      and a destination node

404	   Path MTU, or pMTU: The minimum link MTU of all the links in a path
405	      between a source node and a destination node.

407	   Classical path MTU discovery: Process described in RFC 1191 and RFC
408	      1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages
409	      to learn the MTU of a path.

411	   Packetization Layer: The layer of the network stack which segments
412	      data into packets.

414	   PLPMTUD: Packetization Layer Path MTU Discovery, the method described
415	      in this document, which is an extension to classical PMTU
416	      discovery.

418	   PTB (Packet Too Big) message: An ICMP message reporting that an IP
419	      packet is too large to forward.  This is the IPv6 term that
420	      corresponds to the IPv4 "ICMP Can't fragment" message.

422	   Flow: A context in which MTU discovery algorithms can be invoked.
423	      This is naturally an instance of the packetization protocol, e.g.
424	      one side of a TCP connection.

426	   MPS: The maximum IP payload size available over a specific path.
427	      Typically this is the path MTU minus the IP header.  As an
428	      example, this is the maximum TCP packet size, including TCP
429	      payload and headers but not including IP headers.  This has also
430	      been called the "Layer 3 MTU".

432	   MSS: The TCP Maximum Segment Size, the maximum payload size available
433	      to the TCP layer.  This is typically the path MPS minus the size
434	      of the TCP header.

436	   Probe packet: A packet which is being used to test a path for a
437	      larger MTU.

439	   Probe size: The size of a packet being used to probe for a larger
440	      MTU.

442	   Successful probe: The probe packet was delivered through the network
443	      and acknowledged by the Packetization Layer on the far node.

445	   Inconclusive probe: The probe packet was not delivered, but there
446	      were other lost packets close enough to the probe where it can not
447	      be presumed that the probe was lost because it was larger than the
448	      path MTU.  By implication the probe might have been lost due to
449	      something other than MTU (such as congestion), so the results are
450	      inconclusive.

452	   Failed probe: The probe packet was not delivered and there were no
453	      other lost packets close to the probe.  This is taken as an
454	      indication that the probe was larger than the path MTU, and future
455	      probes should be smaller.

457	   Errored probe: There were losses or timeouts during the verification
458	      phase which suggest a potentially disruptive failure or network
459	      condition.  These are generally retried only after substantially
460	      longer intervals.

462	   Probe gap: The payload data that will be lost and need to be
463	      retransmitted if the probe is not delivered.

465	   Probe phase: The interval (time or protocol events) between when a
466	      probe is sent and when it is determined that the the probe
467	      succeeded, failed or was inconclusive

469	   Verification phase: An additional interval during which the new path
470	      MTU is considered provisional.  Packet losses or timeouts are
471	      treated as an indication that there may be a problem with the
472	      provisional MTU.

474	   Transition phase: The interval between the probe phase and the
475	      verification phase, during which packets using the new MTU
476	      propagate to the far node and the acknowledgment propagates back.

478	   Probe sequence: The sequence of events to raise the MTU by one step,
479	      starting with the transmission of a probe packet followed by
480	      probe, transition and verification phases.

482	   Search strategy: The heuristics used to choose successive probe sizes
483	      to converge on the proper path MTU, as described in section 7.5.

485	   Full stop timeout: a timeout where none of the packets transmitted
486	      after some event are acknowledged by the receiver, including any
487	      retransmissions.  This is taken as an indication of some failure
488	      condition in the network, such as a routing change onto a link
489	      with a smaller MTU.  For the sake of PLPMTUD we suggest the
490	      following definition of a full stop timeout:  the loss of one full
491	      window of data and at least one retransmission or at least 6
492	      consecutive packets including at least 2 retransmissions (along
493	      with two retransmission timer expirations).  [@@@ This probably
494	      needs some experimentation.]

496	4.  Requirements

498	   All Internet nodes SHOULD implement PLPMTUD in order to discover and
499	   take advantage of the largest MTU supported along the Internet path.

501	   Links MUST NOT deliver packets that are larger than their MTU.  Links
502	   that have parametric limitations (e.g.  MTU bounds due to limited
503	   clock stability) MUST include explicit mechanisms to consistently
504	   reject packets that might otherwise be nondeterministically
505	   delivered.

507	   All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
508	   functionality.  All fragmentation SHOULD be done on the host, and all
509	   IPv4 packets, including fragments, SHOULD have the DF bit set such
510	   that they will not be fragmented (again) in the network.  See Section
511	   6.4.

513	   The requirements below only apply to those implementations that
514	   include PLPMTUD.

516	   To use PLPMTUD a Packetization Layer MUST have a loss reporting
517	   mechanism that provides the sender with timely and accurate
518	   indications of which packets were lost in the network.

520	   Normal congestion control algorithms MUST remain in effect under all
521	   conditions except when only an isolated probe packet is detected as
522	   lost.  In this case alone the normal congestion (window or data rate)
523	   reduction MAY be suppressed.  If any other data loss is detected,
524	   standard congestion control MUST take place.

526	   Suppressed congestion control (as above) MUST be rate limited such
527	   that it occurs less frequently than the worst case loss rate for TCP
528	   congestion control at a comparable data rate over the same path (i.e.
529	   less than the "TCP-friendly" loss rate [@@]).  This SHOULD be
530	   enforced by requiring a minimum headway between a suppressed
531	   congestion adjustment (due to a failed probe) and the next attempted
532	   probe, which is equal to one round trip time for each packet
533	   permitted by the congestion window.  Alternatively this may be
534	   enforced by not suppressing congestion control if a 2nd probe is lost
535	   too soon after the 1st lost probe.  This and other issues relating to
536	   congestion control are discussed in section 7.6.

538	   Whenever the MTU is raised, the congestion state variables MUST be
539	   rescaled so as not to raise the window size in bytes (or data rate in
540	   bytes per seconds).

542	   Whenever the MTU is reduced (e.g.  when processing ICMP PTB messages)
543	   the congestion state variable SHOULD be rescaled not to raise the
544	   window size in packets.

546	   If PLPMTUD updates the MTU for a particular path, all Packetization
547	   Layer sessions that share the path representation SHOULD be notified
548	   to make use of the new MTU and make the required congestion
549	   adjustments.

551	   All implementations MUST include a mechanism to implement diagnostic
552	   tools that do not rely on the operating systems implementation of
553	   path MTU discovery.  This specifically requires the ability to send
554	   packets that are larger than the known MTU for the path, and
555	   collecting any resultant ICMP error message.  See section 9.4 for
556	   further discussion of MTU diagnostics.

558	5.  Layering

560	   Packetization Layer Path MTU Discovery is most easily implemented by
561	   splitting its functions between layers.  The IP layer is the best
562	   place to keep shared state, collect the ICMP messages, track IP
563	   header sizes and manage MTU information provided by the link layer
564	   interfaces.  However the procedures that PLPMTUD uses for probing,
565	   verification and scanning for the path MTU are very tightly coupled
566	   to the data recovery and congestion control state machines in the
567	   Packetization Layers.  The most difficult part of implementing
568	   PLPMTUD is properly splitting the implementation between the layers.

570	   Note that this layering approach is consistent with the advice in the
571	   current PMTUD specifications [2][3].  Many implementations of
572	   classical PMTU Discovery are already split along these same layers.

574	5.1  Accounting for Header Sizes

576	   Early implementation of PLPMTUD revealed that it is critically
577	   important to have a good clean mechanism for accounting header sizes
578	   at all layers.  This is because each Packetization Layer does its
579	   calculations in its own natural data unit, which are almost always a
580	   reflection of the service that the Packetization Layer provides to
581	   the application or other upper layers.  For example, TCP naturally
582	   performs all of its calculations in terms of sequence numbers and
583	   segment sizes.  However, the MTU size being probed, MTU size reported
584	   in ICMP PTB messages, etc are measures of full packets, which not
585	   only include the TCP payload (measured in sequence space) but also
586	   include fixed TCP and IP headers, and may include IPv6 extension
587	   headers or IPv4 options, TCP options and even IPsec AH or ESP
588	   headers.

590	   PLPMTUD requires frequent translation between these two domains: the
591	   Packetization Layer's natural data unit and full IP packet sizes.
592	   While there are a number of possible ways to accurately implement
593	   dual size measures, our experience has been that it is best if the
594	   boundary between the IP layer and the Packetization layer communicate
595	   in terms of the IP Maximum Payload Size or MPS.  The MPS is the only
596	   size measure that is common to both layers because it exactly matches
597	   the boundary between the layers.  The IP Layer is responsible for
598	   adding or deducting its own headers when translating between MTU and
599	   MPS.  Likewise the Packetization Layer is responsible for adding or
600	   deducting its own headers when calculations in its natural data
601	   units.  For example, the MPS and TCP's MSS are different by the TCP
602	   header size.

604	   Be aware that a casual reading of this document might give the
605	   impression that MTU, MPS and Packetization Layer data size (e.g.  TCP
606	   MSS) are used interchangeably.  They are not.  Our choice of
607	   terminology is consistent with the protocol layer being discussed in
608	   the surrounding context.  All implementors must pay attention to the
609	   distinction between these terms and include all necessary
610	   conversions, even when thy are not explicitly indicated in this
611	   document.

613	5.2  Storing PMTU information

615	   The IP layer is the best place to store cached MPS values and other
616	   shared state such as MTU values reported by ICMP PTB messages.
617	   Ideally this shared state should be associated with a specific path
618	   traversed by packets exchanged between the source and destination
619	   nodes.  However, in most cases a node will not have enough
620	   information to completely and accurately identify such a path.
621	   Rather, a node must associate a MPS value with some local
622	   representation of a path.  It is left to the implementation to select
623	   the local representation of a path.

625	   An implementation could use the destination address as the local
626	   representation of a path.  The MPS value associated with a
627	   destination would be the minimum MPS learned across the set of all
628	   paths in use to that destination.  The set of paths in use to a
629	   particular destination is expected to be small, in many cases
630	   consisting of a single path.  This approach will result in the use of
631	   optimally sized packets on a per-destination basis.  This approach
632	   integrates nicely with the conceptual model of a host as described in
633	   [RFC 2461]: a MPS value could be stored with the corresponding entry
634	   in the destination cache.  However, NAT and other forms of middle
635	   boxes may exhibit differing MTUs at as single IP address.

637	   Note that network or subnet numbers are not suitable to use as
638	   representations of a path, because there is not a general mechanism
639	   to determine the network mask at the remote host.

641	   If IPv6 flows are in use, an implementation could use the IPv6 flow
642	   id [7][14] as the local representation of a path.  Packets sent to a
643	   particular destination but belonging to different flows may use
644	   different paths, with the choice of path depending on the flow id.
645	   This approach will result in the use of optimally sized packets on a
646	   per-flow basis, providing finer granularity than MPS values
647	   maintained on a per-destination basis.

649	   For source routed packets, i.e.  packets containing an IPv6 routing
650	   header, or IPv4 LSRR or SSRR options, the source route may further
651	   qualify the local representation of a path.  An implementation could
652	   use source route information in the local representation of a path.

654	5.3  Accounting for IPsec

656	   This document does not take a stance on the placement of IPsec, which
657	   logically sits between IP and the Packetization Layer.  As far as
658	   PLPMTUD is concerned IPsec can be treated either as part of IP or as
659	   part of the Packetization Layer, as long as the accounting is
660	   consistent within the implementation.  If IPsec is treated as part of
661	   the IP layer, then each security association to a remote node may
662	   need to be treated as a separate path, i.e., the the security
663	   association is used to represent the path.  If IPsec is treated as
664	   part of the packetization layer, the IPsec header size has to be
665	   included in the Packetization Layer's header size calculations.

667	5.4  Measuring path MTU

669	   This memo uses the concept of a "flow" to define the scope of the
670	   path MTU discovery algorithms.  For many implementations, a flow
671	   would naturally correspond to an instance of each protocol, i.e.,
672	   each connection or session.  In such implementations the algorithms
673	   described in this document are performed within each session for each
674	   protocol.  The observed MPS can be shared between different flows
675	   sharing a common path representation.

677	   Alternatively, PLPMTUD could be implemented such that the complete
678	   PLPMTUD state is associated with the path representations.  Such an
679	   implementation could use multiple connections or sessions for each
680	   probe sequence.  For example, one connection could do the initial
681	   probe and set the provisional MTU and one or more subsequent
682	   connection could verify the MTU.  This approach may converge much
683	   more quickly in some environments such as when the application uses
684	   many small connections, each of which is too short to complete a
685	   probe sequence.

687	   These approaches are not mutually exclusive.  However, due to
688	   differing constraints on generating probes (section Section 7.2) and
689	   the MPS searching algorithm (section Section 7.5), it may not be
690	   feasible for different packetization layer protocols to share PLPMTUD
691	   state.  This suggests that it may be possible for some protocols to
692	   share probing state, but not others.  In this case, the different
693	   protocols can still share the observed MPS but they will have
694	   differing convergence properties.

696	6.  The Probing Sequence and Lower Layers

698	   This section describes the details of a probe sequence, including how
699	   to process MTU and error indications, necessary to raise the MTU by
700	   one step.

702	6.1  Normal sequence of events to raise the MTU

704	   If the probe size is smaller than the actual path MTU and there are
705	   no other losses, the normal sequence of events to raise the MTU is:
706	   1.  Confirm probing preconditions: no outstanding Packetization Layer
707	       losses, sufficient congestion window per section 7.6, sufficient
708	       elapsed time since previous probe per section 6.3, if candidate
709	       MPS has not been set from ICMP MPS, then compute the candidate
710	       MPS per MPS search strategy in section 7.5.

712	   2.  A new MTU is tested by sending one "probe packet", of size "probe
713	       size" (computed from the candidate MPS).  The probe is sent,
714	       followed by additional packets at the current MTU.  By definition
715	       PLPMTUD enters the probe phase.  The probe propagates through the
716	       network and the far node acknowledges it (or possibly latter
717	       data, if acknowledgments are cumulative and delayed
718	       acknowledgment is in effect).

720	   3.  The acknowledgment for the probe reaches the data sender.  By
721	       definition, this ends the probe phase.

723	   4.  The packetization layer provisionally raises the MTU to the probe
724	       size.  PLPMTUD enters the transitional phase when it starts
725	       sending data using the provisional MTU.

727	       Note that implementations that use packet counts for congestion
728	       accounting (e.g.  keep cwnd in units of packets) must re-scale
729	       their congestion accounting such that raising the MTU does not
730	       raise the data rate (bytes/second) or the total congestion window
731	       in bytes, as required in section 4 and discussed in 7.6.

733	       If the implementation packetizes the data at the application
734	       programming interface, it may transmit already queued data at the
735	       current MTU before raising the MTU.  In this case this data is
736	       not part of either the probing or transition phases, because all
737	       of the packets in flight fit within the current MTU.

739	   5.  Once the first packet of the transitional phase is acknowledged,
740	       PLPMTUD enters the verification phase.  In principle the
741	       verification phase can be of arbitrary duration, however at this
742	       time we are recommending one full window of data (i.e one full
743	       round trip time) for most Packetization Layers.

745	   6.  Once there has been sufficient data delivered and acknowledged
746	       the provisional MTU is considered verified and the path MTU is
747	       updated.  PLPMTUD can then probe for an even larger MTU, as
748	       described in the searching strategy in section 7.5.

750	   Other events described in the next section are treated as exceptions
751	   and alter or cancel some of the steps above.

753	6.2  Processing MTU Indications

755	   When the probe sequence fails to raise the MTU, it will be due to one
756	   of three broad classes of outcomes: the probe was inconclusive,
757	   failed or errored.  If the probe was inconclusive, it means that
758	   there were other losses seemingly unrelated to the probe, such that
759	   the probe outcome was ambiguous.  Inclusive probes should be retried
760	   with the same probe size.  If the probe failed, this is an indication
761	   that the probe size was larger than the path MTU, and probing should
762	   continue with a smaller size, as selected by the MTU searching
763	   algorithm.  In some situations there can be indications that the
764	   probing sequence caused some unexpected event.  In these error
765	   conditions, it is desirable to use progressively longer delays
766	   between probes to minimize the possible impact on the network.

768	6.2.1  Processing ICMP PTB messages

770	   Classical PMTU discovery specifies the generation of ICMP PTB
771	   Messages if an over-sized packet (e.g.  a probe) encounters a link
772	   that has a smaller MTU.  Since these messages can not be
773	   authenticated they introduce a number of well documented attacks
774	   against classical PMTUD [5].

776	   With PLPMTUD these messages are not required for correct operation,
777	   and in principle can be summarily ignored at the expense of slower
778	   convergence to the proper MTU.  However, we believe that a slightly
779	   better approach is to save the reported PTB size (computed from the
780	   ICMP MTU) in the path information cache and act on it only in
781	   conjunction with a lost PLPMTUD probe or a full-stop timeout.

783	   Every ICMP PTB Message should be subjected to the following checks:
784	   o  If globally forbidden then discard the message.

786	   o  If forbidden by the application then discard the message.

788	   o  If this path has been tagged "bogus ICMP messages" then discard
789	      the message.

791	   o  If the reported MTU fails consistency checks then set "bogus ICMP
792	      messages" flag for this path and discards the message.  These
793	      consistency checks include:
794	      *  unrecognized or unparseable enclosed header, or
795	      *  reported MTU is larger than the size indicated by the enclosed
796	         header, or
797	      *  larger than the current MTU, provisional MTU or probe size as
798	         appropriate, or
799	      *  fails a ICMP consistency checks specific to the Packetization
800	         Layer.  (E.g.  The SCTP Verification-Tag mechanism [9][16])
801	      To ease migration, it is suggested that implementations may
802	      include global controls to emulate legacy operation by suppressing
803	      some or all of the consistency checks.

805	   If the ICMP PTB message is acceptable under all of these checks then
806	   save the "ICMP MPS" computed from the MTU field in the ICMP message.
807	   If the global configuration switch is set to emulate classical path
808	   MTU discovery then process the message immediately, i.e., set the
809	   path MPS to the ICMP MPS and invoke any protocol specific actions.
810	   Otherwise, the saved ICMP MPS will be acted upon if and only if there
811	   are other PLPMTUD events such as lost probes, etc as indicated in the
812	   next section.  This delayed processing of ICMP PTB messages makes it
813	   more difficult for an attacker to interfere with correct PLPMTUD
814	   operation by injecting fraudulent ICMP PTB messages.

816	   In either case if the Packetization Layer calls for specific actions
817	   in response to a PTB message, that action should be invoked only at
818	   the point when the path MPS is updated from the ICMP MPS.

820	6.2.2  Packetization Layer Detects Lost Packets

822	   Each packetization protocol has its own mechanism to detect lost
823	   packets and request the retransmission of missing data.  The primary
824	   signals used by PLPMTUD are these protocol-specific loss indications.
825	   The packetization layer is responsible for retransmitting the lost
826	   data if necessary, and notifying PLPMTUD that there was a loss.
827	   o  If the probe itself was lost, and there were no other losses
828	      during the probe phase (The RTT between when the probe was sent
829	      and the loss detected) then it is taken as an indication that the
830	      path MTU is smaller than the probe size.  In this specific
831	      situation, the Packetization Layer may choose not to treat this
832	      loss as a congestion signal, and continue with the same congestion
833	      window or data transmission rate.

835	      If an accepted ICMP PTB message was received after the probe was
836	      sent, and it passes the additional checks that the ICMP MTU value
837	      is less than the probe size, and corresponds to an MPS greater
838	      than that in use for the path, then set the candidate MPS from the
839	      ICMP MTU value, and restart the probe sequence from step 1 in
840	      section 6.1.

842	      If there was not an accepted PTB Message, then the indicated event
843	      is a "probe failure", which can be retried with a smaller probe
844	      size after a suitable delay for a probe_fail_event.  See section
845	      6.2.2 for more complete descriptions of failure events.

847	   o  If there are losses during the probe phase yet the probe was
848	      acknowledged as received, then the probe was successful.  However,
849	      since additional losses have the potential to spoil the
850	      verification phase, it is important that PLPMTUD not progress into
851	      the transition phase (step 4 above) until after the Packetization
852	      Layer has fully recovered from the losses and completed the
853	      congestion window (or rate) adjustment.

855	   o  If there are losses during the probe phase and the probe was also
856	      lost the outcome depends on the presence an ICMP MTU set by an
857	      acceptable PTB message.

859	      If there was an accepted PTB message received after the probe was
860	      sent, it should be treated in the same manner as if there were no
861	      other losses (see above).

863	      If there was not an acceptable ICMP PTB message, then the probe is
864	      inconclusive because the lost probe might have been caused by
865	      congestion.  The probe can be retried  after a suitable delay for
866	      a probe_inconclusive_event.

868	   o  It is unlikely that losses during the transition phase are caused
869	      by PLPMTUD; however, the presence of loss does potentially
870	      complicate the verification phase.  Note that we are referring to
871	      losses that are bracketed by acknowledgment of packets that were
872	      sent at the old MTU, while the transition to the provisional MTU
873	      is still propagating through the network.  The first
874	      acknowledgment from the provisional MTU (and the transition to the
875	      verification phase) is most likely going to occur during the
876	      recovery of the losses in transition phase.  It is important that
877	      the Packetization Layer retransmission machinery distinguish
878	      between losses at the old MTU (transition phase) and the
879	      provisional MTU (the verification phase, discussed next).

881	   o  Losses during the verification phase are taken as an indication
882	      that the path may have a non-uniform MTU or other condition such
883	      that raising the MTU raises the loss rate.  If so, this is
884	      potentially a very serious problem.  The provisional MTU is
885	      considered unsuitable, and the cached path MTU is set back to the
886	      previously verified MTU.

888	      Packet loss during the verification phase might also be due to
889	      coincidental congestion on the path, unrelated to the probe, so it
890	      would seem desirable to re-probe the path.  The risk is that this
891	      effectively raises the tolerated loss threshold because even
892	      though raising the MTU seemed to cause additional loss, there is a
893	      statistical chance that repeated attempts to verify a new MTU may
894	      yield as false pass.  The compromise is to re-probe once with the
895	      same probe size (after delay probe_inconclusive_event), and if
896	      this also fails, then the probe may not be retried until after a
897	      suitable delay for a verification_error_event, which exponentially
898	      increases on each successive failure.

900	6.2.3  Packetization Layer Retransmission Timeout

902	   Note that the we do not make distinctions between the various methods
903	   that different Packetization Layers might use for detecting and
904	   retransmitting lost packets.  It is preferable that the Packetization
905	   Layer uses a recovery mechanism similar to TCP SACK or fast
906	   retransmit designed to detect and report losses to recover as quickly
907	   as possible.

909	   Under some conditions the Packetization Layer may have to rely on
910	   retransmission timeouts or other fairly disruptive techniques to
911	   detect and recover from losses.  Since these greatly increase the
912	   cost of failed probes, it is recommended that PLPMTUD use even longer
913	   delays before re-probing.  In these situations replace
914	   probe_fail_event with probe_timeout_event.

916	6.2.4  Packetization Layer Full Stop Timeout

918	   Under all conditions (not just during MTU probing) a full stop
919	   timeout should be taken as an indication of some significantly
920	   disruptive event in the network, such as a router failure or a
921	   routing change to a path with a smaller MTU.

923	   If an ICMP PTB message was recently received, even if its its MTU
924	   value was less than the current path MTU value in use, then the path
925	   MTU can be reduced to the ICMP MTU.  A full stop timeout is the only
926	   situation outside of a probe that we recommended that the path MTU is
927	   set from the ICMP MTU.  (In section 9.1 we relax this recommendation
928	   to facilitate migration to PLPMTUD in exchange for slightly less
929	   protection from corrupt ICMP PTB messages).

931	   Note that whenever a problem with the path that causes a full-stop
932	   timeout (also known as a "persistent timeout" in other documents),
933	   several different path restart/recovery algorithms may be invoked at
934	   different layers in the stack.  Some device drivers may be restarted
935	   [@@], router discovery [@@], ES-IS [@@] and so forth.  We recommend
936	   that in most situation the first action should be to reset the path
937	   MTU down.  Note that this recommendation is really beyond the scope
938	   of this document, and may require substantial additional research.

940	   If there is a full stop timeout and there was not an ICMP message
941	   indicating a reason (PTB, Net unreachable, etc, or the ICMP messages
942	   was ignored for some reason), we suggest that the first recovery
943	   action should be to set the path MTU down to a safe minimum "restart
944	   MTU" value, and the reset PLPMTUD search state, so PLPMTUD will start
945	   over again searching for the proper MTU.  The default IPV4
946	   restart_MTU should be the minimum MTU as specified by IPv4 (576
947	   Bytes)[1].  The default IPV6 restart_MTU should be the minimum MTU as
948	   specified by IPv6 (1280 Bytes) [7].  Unless the default MTU is
949	   overridden by some global control (See section 9.5).

951	   If, and only if, the full stop timeout happens during the probe or
952	   transition phases, e.g., after sending data using the provisional MTU
953	   but before any of it is acknowledged, is it considered likely that
954	   raising the MTU caused the full stop timeout.  If so, this situation
955	   is is likely to be cyclic, because resetting the PLPMTUD search state
956	   is likely to eventually cause re-probing the same problematic MTU.
957	   It is tempting to define additional states to detect recurrent full
958	   stop timeouts.  However in today's hostile network environment, there
959	   is little tolerance for nodes that are so fragile that they can be
960	   disrupted by something as simple as oversized packets.  Therefore, we
961	   do not feel that it is worth the overhead of specifying a state
962	   machine that is capable of automatically detecting these situations
963	   and disabling PLPMTUD.  However, it is important that there be a
964	   manual way to disable or limit probing on specific paths.  See
965	   section 9.5.

967	6.3  Probing Intervals

969	   The previous sections describe a number of events that prevent a
970	   probe sequences from raising the path MTU.  In all cases the basic
971	   response is the same: to wait some time interval (dependent on the
972	   specific event and possibly the history) and then to probe again.
973	   For events that are "inconclusive," it is generally appropriate to
974	   re-probe with the same probe size.  For events that are identified as
975	   "failed probes," it is generally appropriate to re-probe with a
976	   smaller probe size.  The search strategy described in section 7.5 is
977	   used to select probe sizes.

979	   Many of the intervals described below are specified in terms of
980	   elapsed round trips relative to the current congestion window.  This
981	   is because TCP and other Packetization Layer protocols tend to
982	   exhibit periodic loses which cause periodic variations of the
983	   congestion window and possibly the data rate.  It is preferable that
984	   the PLPMTUD probes be scheduled near the low point of these cycles to
985	   minimize ambiguities caused by congestion losses.

987	   In order from least to most serious:
988	   probe_converge_event: The candidate probe size has already been
989	      probed so there is no need for further searching.  Delay 5 minutes
990	      and then re-probe last SEARCH_HIGH.

992	   probe_inconclusive_event: Other lost packets near the lost probe made
993	      the probe result ambiguous.  Since the loss of non-probe packets
994	      requires a window (or data rate) reduction, it is desirable to
995	      schedule the re-probe (at the same probe size) roughly one round
996	      trip time after the end of the loss recovery.  This will be almost
997	      the minimum congestion window size, with a small cushion to
998	      minimize the chances that correlated losses caused by some other
999	      bursty connection spoil another probe.

1001	   probe_fail_event: A probe fail event is the one situation under which
1002	      the Packetization layer is permitted not to treat loss as a
1003	      congestion signal.  Because there is some small risk that
1004	      suppressing congestion control might have unanticipated
1005	      consequences (even for one isolated loss), we require that probe
1006	      fail events be less frequent than the normal period for losses
1007	      under standard congestion control.  Specifically after a probe
1008	      fail event and suppressed congestion control, PLPMTUD may not
1009	      probe again until an interval which is comparable to the expected
1010	      interval between congestion control events.  This is required in
1011	      section 4 and discussed further in section 7.6.

1013	      The simplest estimate of the interval to the next congestion event
1014	      is the same number of round trips as the current window in
1015	      packets.

1017	   probe_timeout_event: Since this event was detected by a timeout, it
1018	      is relatively disruptive to protocol operation.  Furthermore,
1019	      since the event indirectly includes a window adjustment that may
1020	      have been caused by the MTU probe, it is important that the probe
1021	      not be repeated until congestion control has had more than
1022	      sufficient time to recover from the loss.  Therefore we recommend
1023	      five times the probe_fail_event interval, i.e., five times as many
1024	      round trips as the current congestion window in packets.

1026	   verification_error_event: A verification fail event indicates that a
1027	      probe was delivered and the verification phase failed twice
1028	      separated by a congestion adjustment (so the second verification
1029	      phase was at a low point in the congestion control cycle).  This
1030	      is an indication that one of the following three things might have
1031	      happened: repeated losses unrelated to PLPMTUD; the path is
1032	      striped across links with dissimilar MTUs, or the link layer has
1033	      some parametric limitation such that raising the MTU greatly
1034	      increases the random error rate.

1036	      The optimal method responding to this situation is an open
1037	      research question.  We believe that the correct response is some
1038	      combination of exponentially lengthening back-offs, e.g., starting
1039	      at 1 minute and quadrupling on each repeat, and implicitly
1040	      treating the situation as a probe fail (and choosing a smaller
1041	      probe size) after some threshold number of repeated
1042	      verification_error_events.

1044	6.4  Host fragmentation

1046	   Packetization layers are encouraged to avoid sending messages that
1047	   will require fragmentation.  (For the case against fragmentation, see
1048	   [17], [18]).  However, entirely preventing fragmentation is not
1049	   always possible.  Some packetization layers, such as a UDP
1050	   application outside the kernel, may be unable to change the size of
1051	   messages it sends, resulting in datagram sizes that exceed the path
1052	   MTU.

1054	   IPv4 permitted such applications to send packets without the DF bit
1055	   set.  Oversized packets without the DF bit set would be fragmented in
1056	   the network or sending host when they encountered a link with a MTU
1057	   smaller than the packet.  In some case, packets could be fragmented
1058	   more than once if there were cascaded links with progressively
1059	   smaller MTUs.

1061	   This approach is no longer recommended.  We now recommend that IPv4
1062	   implementations use a strategy that mimics IPv6 functionality.  When
1063	   an application sends datagrams that are larger than the known path
1064	   MTU they should be fragmented to the path MTU in the host IP layer
1065	   even if they are smaller than the link MTU of the first network hop
1066	   directly attached to the host.  The DF bit should be set on the
1067	   fragments, so they will not be fragmented again in the network.

1069	   This technique will minimize future surprises as the Internet
1070	   migrates to IPv6.  Otherwise, the potential exists for widely
1071	   deployed applications or services relying on IPv4 fragmentation in a
1072	   way that cannot be implemented in IPv6.  At least one major operating
1073	   system already uses this strategy.

1075	   Note that IP fragmentation divides data into packets, so it is
1076	   minimally a Packetization Layer.  However it does not have a
1077	   mechanism to detect lost packets, so it can not support a native
1078	   implementation of PLPMTUD.  Fragmentation-based PLPMTUD requires an
1079	   adjunct protocol as described in section 8.3.

1081	6.5  Multicast

1083	   In the case of a multicast destination address, copies of a packet
1084	   may traverse many different paths to reach many different nodes.  The
1085	   local representation of the "path" to a multicast destination must in
1086	   fact represent a potentially large set of paths.

1088	   Minimally, an implementation could maintain a single MPS value to be
1089	   used for all packets originated from the node.  This MPS value would
1090	   be the minimum MPS learned across the set of all paths in use by the
1091	   node.  This approach is likely to result in the use of smaller
1092	   packets than is necessary for many paths.

1094	   If the application using multicast gets complete delivery reports
1095	   (unlikely because this requirement  has poor scaling properties),
1096	   PLPMTUD could be implemented in multicast protocols.

1098	7.  Common Packetization Properties

1100	   This section describes general Packetization Layer properties and
1101	   characteristics needed to implement PLPMTUD.  It also describes some
1102	   implementation issues that are common to all Packetization Layers.

1104	7.1  Mechanism to detect loss

1106	   It is important that the Packetization Layer has a timely and robust
1107	   mechanism for detecting and reporting losses.  PLPMTUD makes MTU
1108	   adjustments on the basis of detected losses.  Any delays or
1109	   inaccuracy in loss notification is likely to result in incorrect MTU
1110	   decisions or slow convergence.

1112	   It is best if Packetization Protocols use fairly explicit loss
1113	   notification such as Selective acknowledgments, although implicit
1114	   mechanisms such as TCP Reno style duplicate acknowledgments counting
1115	   are sufficient.  It is important that the mechanism can robustly
1116	   distinguish between the isolated loss of just a probe and other
1117	   combinations of losses.

1119	   Many protocol implementation have complicated mechanisms such as SACK
1120	   scoreboards to distinguish between real losses and temporary missing
1121	   data due to reordering in the network.  In these implementation is
1122	   desirable to signal losses to PLPMTUD as a side effect of the data
1123	   retransmission.  This approach offer the maximum protection from
1124	   confusing signals due to reordering and other events that might mimic
1125	   losses.

1127	   PLPMTUD can also be implemented in protocols that rely on timeouts as
1128	   their primary mechanism for loss recovery, although this should be
1129	   used only when there are no other alternatives.

1131	7.2  Generating Probes

1133	   There are several possible ways to alter packetization layers to
1134	   generate probes.  The different techniques incur different overheads
1135	   in three areas: difficulty in generating the probe packet (in terms
1136	   of packetization layer implementation complexity and extra data
1137	   motion) possible additional network capacity consumed by the probes
1138	   and the overhead of recovering from failed probes (both network and
1139	   protocol overheads).

1141	   Some protocols might be extended to allow arbitrary padding with
1142	   dummy data.  This greatly simplifies the implementation because the
1143	   probing can be performed without participation from higher layers and
1144	   if the probe fails, the missing data (the "probe gap") is assured to
1145	   fit within the current MTU when it is retransmitted.  This is
1146	   probably the most appropriate method for protocols that support
1147	   arbitrary length options or multiplexing within the protocol itself.

1149	   Many Packetization Layer protocols can carry pure control messages
1150	   (without any data from higher protocol layers) which can be padded to
1151	   arbitrary lengths.  For example the SCTP HEARTBEAT message can be
1152	   used it this manner (See section 8.2) .  This approach has the
1153	   advantage that nothing needs to be retransmitted if the probe is
1154	   lost.

1156	   These techniques do not work for TCP, because there is not a separate
1157	   length field or other mechanism to differentiate between padding and
1158	   real payload data.  With TCP the only approach is to send additional
1159	   payload data in an over-sized segment.  There are at least two
1160	   variants of this approach, discussed in section 8.1.

1162	   In a few cases there may no reasonable mechanisms to generate probes
1163	   within the Packetization Layer protocol itself.  As a last resort it
1164	   may be possible to rely an an adjunct protocol, such as ICMP ECHO
1165	   (aka "ping"), to send probe packets.  See section 8.3 for further
1166	   discussion of this approach.

1168	7.3  Mechanism to support provisional MTUs

1170	   The verification phase requires a mechanism provisionally raise the
1171	   MPS and if there are additional losses, restore the old MPS.  While
1172	   this is not difficult for most potential Packetization Layers, there
1173	   are a few (e.g.  ISO TP4 [ISOTP]) that are not allowed to
1174	   re-packetize when doing a retransmission.  That is, once an attempt
1175	   is made to transmit a segment of a certain size, the transport cannot
1176	   split the contents of the segment into smaller segments for
1177	   retransmission.  In such a case, the original segment can be
1178	   fragmented by the IP layer during retransmission as described in
1179	   section 6.4.  Subsequent segments, when transmitted for the first
1180	   time, should be no larger than allowed by the path MTU.

1182	   Note that while padding is an appropriate mechanism for probing, it
1183	   is too wasteful for use during the verification phase.

1185	   Unresolved problem: if 2 PL are using the same path and one can only
1186	   verify constrained sizes (e.g blocks+headers) then the verified MTU
1187	   might be the actual packet size for the constrained PL, not the
1188	   probed size.  @@@@

1190	   Unresolved problem: what to do about very short flows?  No
1191	   verification phase?  @@@@@

1193	7.4  Selecting the initial MPS

1195	   If if there is already a cached MPS value for this path, PLPMTUD may
1196	   use the saved MPS value.  Unless it is very recent (how recent?
1197	   @@@@@) SEARCH_HIGH should be set to SEARCH_MAX, to restart the search
1198	   process from the old MPS.

1200	   Note that there are tradeoffs to how long the path information cache
1201	   entries is retained when it is not being used by any flows.  If they
1202	   are kept for to long they waste memory, if too short it will cause
1203	   frequent re-probing.  We suggest an adjustable Least Recently Used
1204	   algorithm to purge old entries.  @@@@ This belongs some place else.

1206	   When the PLPMTUD process is started the recommended initial MPS
1207	   should normally be set such that the Packetization Layer can carry 1
1208	   kByte data segments.  This initial MPS would be 1 kByte plus space
1209	   for Packetization layer headers.  (see section 5 on accounting for
1210	   headers).  With the this MPS, RFC2414 [6] allows TCP and other
1211	   transport protocols to start with an initial window of 4 packets.

1213	   [We suspect, but have not confirmed that] TCP completes sooner for
1214	   short connections when started with four 1kB packets rather than
1215	   three 1500 byte packets because the 2nd ACK occurs one round trip
1216	   earlier

1218	   This initial MPS should also be configurable.  One of the
1219	   configuration options should be to mimic classical PMTUD behavior by
1220	   setting the initial MPS from the interface MTU.  This option
1221	   facilitates using PLPMTUD in a mode that mimics classical PMTU
1222	   discovery.  (See section 9.1)

1224	7.5  Common MPS Search Strategy

1226	   The MPS search strategy described here is a only rough guide for
1227	   implementors.  It is difficult to imagine a completely standard
1228	   algorithm because the strategy can include many Packetization Layer
1229	   specific heuristics to optimize MPS selection.  There is significant
1230	   opportunity for future improvements to this portion of PLPMTUD.

1232	   The search strategy is trying to find the largest "candidate MPS"
1233	   that meets the constraints of both the Packetization and the link
1234	   layers.  Although this algorithm is primarily described in terms of
1235	   MPS, it needs to use knowledge about link layer MTUs and
1236	   Packetization Layer buffer sizes.

1238	   The search strategy uses three variables:
1239	      SEARCH_MAX is the largest MPS that a Packetization Layer might be
1240	      able to use.  It is determined by such considerations as interface
1241	      MTU, widths of protocol length fields, and possibly other
1242	      protocol-dependent values, such as the the TCP MSS option.  In
1243	      many cases it would be the same as the classical MTU discovery
1244	      initial MTU, minus the IP layer headers.
1245	      SEARCH_LOW is the largest validated MPS, the same as them current
1246	      MPS in use by the packetization layer.  The initial value for
1247	      SEARCH_LOW is described in section 7.4.
1248	      SEARCH_HIGH is the least invalidated MPS.  In most cases is will
1249	      be the most recent failed candidate MPS.  When PLPMTUD is
1250	      initialized SEARCH_HIGH should be set to SEARCH_MAX, indicating
1251	      that there have been no failed probes.

1253	   For many Packetization Layer protocols, the cost for a failed probe
1254	   is significantly higher than the cost of a successful probe due to
1255	   the additional time and overhead needed for retransmission and
1256	   recovery.  For this reason it is often desirable to bias the search
1257	   strategy to make more smaller steps.

1259	   The search strategy first computes an initial candidate MPS using one
1260	   of these methods:
1261	      If SEARCH_HIGH >= SEARCH_MAX, there have been no recent failed
1262	      probes so use a coarse (geometric doubling) scan.  Set
1263	      candidate MPS = MIN(2 * SEARCH_LOW, SEARCH_MAX).  Otherwise use
1264	      one of several possible fins scan candidate MPS values:
1265	      Select a candidate MPS that corresponds to a common MTU possibly
1266	      minus common tunnel header sizes between SEARCH_LOW and
1267	      SEARCH_HIGH.  There is a fine scan heuristic described section
1268	      7.5.1 that might be used.
1269	      Use a simple weighted binary search by selecting the candidate MPS
1270	      some prorated distance between SEARCH_LOW and SEARCH_HIGH.  E.g.
1271	      set
1272	      candidate MPS = SEARCH_LOW * (1 - alpha) + SEARCH_HIGH * alpha,
1273	      for some alpha between 0 and 1.  If you choose an alpha slightly
1274	      less than 0.5, PLPMTUD  will tend to converge from below,
1275	      minimizing the number of failed probes.  Alternatively alpha can
1276	      be selected to optimally converge for some common MTUs, such as
1277	      1500 bytes.
1278	   If the Packetization Layer has preferred data sizes (e.g.  carries
1279	   block data), optionally round the candidate MPS to an efficient size
1280	   for the Packetization Layer.  The rounded candidate MPS would
1281	   typically be a multiple of the optimal data block size plus space for
1282	   Packetization Layer headers.  The MPS can be rounded up or down, but
1283	   should avoid selecting previously probed valued if possible, per the
1284	   convergence test below.  Packetization Layer that do not have
1285	   intrinsically preferred data sizes may still choose to round the
1286	   candidate MPS to some convenient increment such as 4 or 8 bytes, to
1287	   prevent excessive hunting.  Note that this step is intrinsically
1288	   Packetization Layer dependent, and may be different for different
1289	   packetization Layers.

1291	   If the resulting candidate MPS is not between SEARCH_LOW and
1292	   SEARCH_HIGH, then the probe process has converged and further probing
1293	   will not  yield a better value for the MPS for this protocol.  To
1294	   detect if a routing change has raised the path MTU, the path should
1295	   be re-probed after a suitable delay as indicated by a
1296	   probe_converge_event (See section 6.3).  If the probe succeeds, then
1297	   SEARCH_HIGH should be set to SEARCH_MAX to restart the probing
1298	   process from the current MPS.

1300	   MPS searching can be implicitly disabled by setting the SEARCH_HIGH
1301	   to SEARCH_LOW.

1303	   Note that if two different Packetization Layers are sharing a path,
1304	   they may choose different MPS due to differences in the protocols.
1305	   It is even possible for one of the Packetization Protocol to consider
1306	   the process converged, while the other continues to probe.  In this
1307	   case one of the Packetization Layers does may chose not to use the
1308	   full MPS, and instead chooses some slightly smaller but more
1309	   efficient packet size.

1311	7.5.1  Fine Scans

1313	   If SEARCH_LOW does not correspond to a common link MTU, and there is
1314	   a common link MTU between SEARCH_LOW and SEARCH_HIGH, set the
1315	   candidate MPS from the most common link MTU between SEARCH_LOW and
1316	   SEARCH_HIGH.

1318	   If SEARCH_LOW does not correspond to a common link MTU, and there is
1319	   not a common link MTU between SEARCH_LOW and SEARCH_HIGH, then set
1320	   the candidate MPS to either the weighted binary search between
1321	   SEARCH_LOW and SEARCH_HIGH or to SEARCH_HIGH, reduced by a reasonable
1322	   increments for tunnel headers.

1324	   If SEARCH_LOW corresponds to a common link MTU, set the candidate MPS
1325	   to SEARCH_LOW plus some small delta.  If this fails, we found the
1326	   proper MPS, otherwise we need to keep searching.

1328	   @@@@@ common link MTUs are: 1500......  ?

1330	   @@@@@ common tunnel header sizes are....

1332	7.6  Congestion Control and Window Management

1334	   PLPMTUD and congestion control share the same slice of the protocol
1335	   stack.  Both algorithms nominally run inside of a transport protocol
1336	   and rely on packet losses as their primary signal to adjust
1337	   parameters of the data stream (packet size or window size).
1338	   Furthermore both push up the controlled parameter until the onset of
1339	   packet losses, and then back off to a smaller value.  Due to the
1340	   close proximity of these two algorithms there is the potential for
1341	   side effects and unexpected interactions between them.

1343	   This section describes potential interactions between PLPMTUD and
1344	   congestion control.  In general PLPMTUD is designed to minimize its
1345	   potential impact on congestion control.  This is appropriate because
1346	   correctly functioning congestion control is critical to the overall
1347	   operation of the Internet.

1349	   The requirements in section 4 protect congestion control from
1350	   PLPMTUD.  It is important that MTU changes do not raise the
1351	   congestion window.  Given that we do not know a priori the nature of
1352	   the network bottleneck, PLPMTUD should not raise either the data rate
1353	   (bytes per second) or the packet rate (packets per second).

1355	   Since there is a risk that lost probes might actually be congestion
1356	   losses, and not MTU losses at all, we limit the maximum allowed rate
1357	   for suppressing congestion control to less than the loss rate
1358	   required to throttle the flow to the "TCP friendly" rate.  This
1359	   guarantees that the losses due to PLPMTUD are less than the losses
1360	   needed for normal congestion control.

1362	   If there is some node which is accounting queue length in bytes
1363	   (rather than packets), there is even the possibility that a probe
1364	   might cause a loss due to driving the queue over some threshold and
1365	   into congestion.  For this reason it recommended that all PLPMTUD
1366	   implementations use some strategy to slightly depress the actual
1367	   window during the probe process.  It may be sufficient to require
1368	   that the excess data in the probe packet fits within the current
1369	   congestion control window.

1371	   If a probe is carrying real application data that must be
1372	   retransmitted, it is important to suppress (or restore) all of the
1373	   congestion control state changes normally associated with the
1374	   retransmission.  For example if a TCP connection is in slow-start
1375	   when a probe is lost, it is important that ssthresh is not changed as
1376	   a side effect of the probing.  It is for this reason that it is
1377	   strongly recommended that packetization protocols use some
1378	   combination of out-of-band echo message and padding, if at all
1379	   possible.  Lost probes that do not carry any real application data do
1380	   not need to be retransmitted.

1382	   It is recommended that TCP should not probe a new MPS if that MPS
1383	   will likely result in a cwnd of less than 5 segments.

1385	   If the network becomes too congested, it is recommended that the MPS
1386	   be reduced to a smaller size as determined by a heuristic.  The
1387	   recommended heuristic is to reduce the MPS by half if ssthresh is
1388	   reduced to 5 segments or smaller, with a minimum MPS of 512 bytes.

1390	8.  Specific Packetization Layers

1392	   This section discusses specific implementation details for different
1393	   protocols that can be used as Packetization Layer protocols.  All
1394	   Packetization Layer protocols must consider all of the issues
1395	   discussed in section Section 7.  For most protocols it is self
1396	   evident how to address  many of these issues.  It is hoped that the
1397	   protocols described here will be sufficient illustration for
1398	   implementors to adapt other protocols.

1400	8.1  Probing method using TCP

1402	   TCP has no mechanism that could be used to distinguish between real
1403	   application data and some other form of padding that might be used to
1404	   fill out probe packets.  Therefore, TCP must generate probes by
1405	   sending oversized segments that are carrying real data from upper
1406	   layers.  There are two approaches that TCP might use to minimize the
1407	   overheads associated with the probing sequence.

1409	   A TCP implementation of PLPMTUD can elect to send subsequent segments
1410	   overlapping the probe as though the probe segment was not oversized.

1412	   This has the advantage that TCP only need to retransmit one segment
1413	   at the current MTU to recover from failed probes.  However the
1414	   duplicate data in the probe does consume network resources and will
1415	   cause duplicate acknowledgments.  It is important that these extra
1416	   duplicate acknowledgments not trigger Fast Retransmit.  This can be
1417	   guaranteed by limiting the largest probe segment size to twice the
1418	   current segment size (causing at most 1 duplicate acknowledgment) or
1419	   three times the current segment size (causing at most 2 duplicate
1420	   acknowledgments).

1422	   The other approach is to send non-overlapping segments following the
1423	   probe.  Although this is cleaner from a protocol architecture
1424	   standpoint it clashes with many of the optimizations used improve the
1425	   efficiency of data motion within many operating systems.  In
1426	   particular many implementations divide the data into segments and
1427	   pre-compute checksums as the data is copied out of application
1428	   buffers.  In these implementation it can be relatively expensive to
1429	   adjust segment boundaries after the data is already queued.

1431	   If TCP is using SACK or any other variable length headers, the
1432	   headers on the probe and verification packets should be padded to the
1433	   maximum possible length.  Otherwise, unexpected options on
1434	   bidirectional data may cause cause IP packets that are larger than
1435	   the tested MTU.

1437	   At the point when TCP is ready to start the verification phase, it is
1438	   permitted transmit already queued data at the old MTU rather than
1439	   re-packetizes it.  This postpones the verification process by the
1440	   time required to send the queued data.

1442	   If the verification phase experiences any segment losses, TCP is
1443	   required to pull back to the prior MSS.  Since failing the
1444	   verification phase should be an infrequent error condition it is less
1445	   important  that this be  as efficient as probing.

1447	8.2  Probing method using SCTP

1449	   In the SCTP protocol [9][16] the application writes messages to SCTP
1450	   and SCTP "chunkifies" them into smaller pieces suitable for
1451	   transmission through the network.  Once a message has been
1452	   chunkified, they are assigned TSN's.  Once some TSNs have been
1453	   transmitted SCTP can not change the chunk sizes.  SCTP multi-path
1454	   support normally requires SCTP to chunkify its messages to fit the
1455	   smallest MPS (maximum payload size, same as MTU - IP headers) of all
1456	   paths.  Although not required, implementations may bundle multiple
1457	   data chunks together to make larger IP packets to allow for support
1458	   for larger MPSs on different paths.  Note that SCTP must
1459	   independently probe and verify the MPS on each path to the peer.

1461	   The recommended method for generating probes is to add a chunk
1462	   consisting only of padding to an SCTP message.  There are two methods
1463	   to implement this padding.

1465	   In method 1, the message is padded with an SCTP heart beat (HB), of
1466	   the necessary size to construct an IP packet the desired probe size.
1467	   The peer SCTP implementation will acknowledge a successful probe
1468	   without delay by the returning the same Heartbeat as a HEARTBEAT-ACK.
1469	   This method is fully compatible with current SCTP standards and
1470	   implementations, but is exposed to MPS limitation on the return path,
1471	   which might cause the HEARTBEAT-ACK to be lost.

1473	   In method 2, a new "PAD" chunk type would have to be defined.  This
1474	   chunk would be silently discard by the peer.  The PAD chunk could be
1475	   attached to another message (either a minimum length HB or other
1476	   application data which will be acknowledged by the peer) to build a
1477	   probe packet.  The default action for an unknown chunk types in the
1478	   range 128 to 190, (high bits = 10 ) is to "Skip this chunk and
1479	   continue processing" [RFC2960] - exactly the required behavior for a
1480	   PAD chunk.  Any currently unused type in this range will work for a
1481	   PAD chunk type.  This method is fully compatible with all current
1482	   SCTP implementations, but requires adding a new type to the current
1483	   standards.  It has the advantage that restrictions due to the return
1484	   path MPS are not applied to the forward path.

1486	   The verification phase is most efficiently implemented by picking a
1487	   new chunk size such that the new MPS and all of the old multi-path
1488	   MPSs are larger than different multiples of the new chunk size, by at
1489	   least the required header sizes.  This approach permits chunks from
1490	   SCTP application messages to be assembled into packets that are
1491	   suitable for any path to the peer at either the old or new MPS.  This
1492	   is the easiest method to permit the provisional MPS to be withdrawn,
1493	   if there are losses during the verification phase.

1495	   Once each of old path MPSs has been updated to a new verified MPS,
1496	   SCTP may be able to pick a new larger chunk size that will fit into
1497	   all paths.  However, if the MPS is later reduced (say due to a
1498	   routing change and subsequent ICMP PTB message) SCTP will be forced
1499	   to use IP fragmentation to transmit application messages that are
1500	   already chunkified, as described in section 7.3.

1502	   The constraints on efficiently choosing chunk sizes are complicated
1503	   enough to make it difficult if not impossible to efficiently support
1504	   arbitrary combinations of old and new MPSs.  It greatly simplify the
1505	   implementation to add constraints, such as making the chunk size
1506	   itself a multiple of some common size, such as 512 bytes.  This in
1507	   turn constrains the searching algorithm to test MPSs that are
1508	   multiples of 512 bytes, plus the appropriate headers.  Clearly the
1509	   PLPMTUD search heuristic for SCTP must be constrained to pick
1510	   candidate MPSs that are consistent with the limitations of the
1511	   algorithm for choosing appropriate chunk sizes.

1513	   The SCTP Verification-Tag is designed to increase SCTPs robustness in
1514	   the presence of a number of attacks, including forged ICMP messages.
1515	   It relies on a 32 bit Verification Tag which is initialized to a
1516	   random value during connection establishment and placed in the first
1517	   64 bits of all SCTP messages.  All subsequent messages (including
1518	   ICMP messages, which copy at least the first 64 bits of the message)
1519	   must match the original Verification Tag, or they are rejected as
1520	   being likely attacks against the connection.

1522	   It is believed that the Verification Tag mechanism is strong enough
1523	   where SCTP could unconditionally process ICMP PTB messages that would
1524	   reduce the path MPS at arbitrary times.  As written, this document
1525	   does not encourage this method.  The PLPMTUD ICMP validity checks are
1526	   cascaded with the SCTP checks, such that the messages are processed
1527	   only if they meet all consistency checks for both protocols.  In
1528	   particular, PLPMTUD only uses the ICMP MPS value following a probe,
1529	   during MPS verification, or following a full stop timeout.

1531	   Alternatively, an SCTP implementation could suppress some of the
1532	   checks in section 6.2.1.

1534	8.3  Probing method for IP fragmentation

1536	   As mentioned in section 6.4, datagram protocols (such as UDP) might
1537	   rely on IP fragmentation as a packetization layer.  However,
1538	   implementing PLPMTUD with IP fragmentation is problematic because the
1539	   IP layer has no mechanism to to determine if the packets are
1540	   ultimately delivered properly to the far node, without participation
1541	   by the application.

1543	   To support IP fragmentation as a packetization layer under an
1544	   unmodified application, we propose the use of an adjunct MTU
1545	   measurement protocol (ICMP ECHO) and a separate path MTU discovery
1546	   daemon (described here) to perform PLPMTUD and update the stored path
1547	   MTU information.

1549	   For IP fragmentation the initial MPS should be selected as described
1550	   in section 7.4, except with a separate global control for the default
1551	   initial MPS for connectionless protocols.  Since connectionless
1552	   protocols may not keep enough state to effectively diagnose MTU black
1553	   holes, it would be more robust to error on the side of using too
1554	   small of an initial MTU (e.g.  1kBytes or less) prior initiating
1555	   probing of the path to measure the MTU.

1557	   Since many protocols that rely on IP fragmentation are
1558	   connectionless, there is an additional problem with the path
1559	   information cache: there are no events corresponding to connection
1560	   establishment and tear-down to use to manage the cache itself.  We
1561	   take this approach: if there is no entry in the path information
1562	   cache for a particular packet being transmitted, it uses an immutable
1563	   cache entry for the "default path", which  has a MPS that is fixed at
1564	   the initial value.  A new path cache entry is not created until there
1565	   is an attempt to set the MPS.

1567	   The path MTU discovery daemon should be triggered as a side effect of
1568	   IP fragmentation.  Once the number of fragmented datagrams via any
1569	   particular path reaches some configurable threshold (say 5
1570	   datagrams), the daemon can start probing the path with ICMP ECHO
1571	   packets.  These probes must use the diagnostic interface described in
1572	   section 9.4 and have DF set.  The daemon can implement all of the
1573	   PLPMTUD probe sequence and search strategy, collect all of the ICMP
1574	   responses (ECHO REPLY, ICMP PTB, etc) and only the saved PTB in the
1575	   path information cache in the IP layer.

1577	   Alternatively, most of the PLPMTUD state machinery can be implemented
1578	   within the path information cache in the IP layer, which can
1579	   specifically invoke the path MTU discovery daemon to perform
1580	   specified measurements on specific paths and report the results back
1581	   to the IP layer.

1583	   Using ICMP ECHO to measure the MTU has a number of potential
1584	   robustness problems.  Note that the most likely failures are due to
1585	   losses unrelated to MTU (e.g.  nodes that discriminate on the basis
1586	   of protocol type).  These non-MTU losses can prevent PLPMTUD from
1587	   raising the MTU, forcing the Packetization Layer protocol to use a
1588	   smaller MTU than necessary.  Since these failures are not likely to
1589	   cause interoperability problem they are relatively benign.

1591	   However there does exist other more serious failure modes, such as
1592	   layer 3 or 4 routers choosing different paths for different protocol
1593	   types or sessions.  In such environments, adjunct protocols may
1594	   experience different MTUs than the primary protocol.  If the adjunct
1595	   protocol has a larger MTU than the primary protocol,  PLPMTUD will
1596	   select a non-functional MTU.  This does not seem to be likely
1597	   situation.

1599	8.4  Probing method for applications

1601	   The disadvantages of probing with ICMP ECHO can be overcome by
1602	   implementing the path MTU discovery daemon within the application
1603	   itself, using applications own protocol.

1605	   The application must have some suitable method for generating probes.
1606	   The ideal situation is a lightweight echo function, that confirms
1607	   message delivery, plus a mechanism for padding the messages out to
1608	   the desired MTU, such that the padding is not echoed.  This
1609	   combination (akin to the SCTP HB plus PAD) has is preferred because
1610	   you can send large probes that causes small acknowledgments.  For
1611	   protocols that can not implement these messages directly there are
1612	   often alternate methods for generating probes.  E.g the protocol may
1613	   have a variable length echo (that measures both the forward and
1614	   return path) or if there is no echo function, there may be a way to
1615	   add padding to regular messages carrying real application data.
1616	   There may to others ways to generate probes.  As a last resort, it
1617	   may be feasible to extend the protocol with new message types to
1618	   support MTU discovery.

1620	   Probing within an application introduces one new issues: many
1621	   applications do not currently concern themselves with MTU and rely on
1622	   IP fragmentation to deliver datagrams that just happen to be larger
1623	   than the path MTU.  PLPMTUD requires that the protocol can send
1624	   probes that are larger than the IP layers current notion of the path
1625	   MTU, but are marked not to be fragmented.  This requires an alternate
1626	   method for sending these datagrams.

1628	   As with ICMP MTU probing, there is considerable flexibility in how
1629	   the PLPMTUD algorithms can be divided between the Application and the
1630	   path information cache.

1632	   Some applications send large datagrams no matter what the link size,
1633	   and rely on IP fragmentation to deliver the datagrams.  It has been
1634	   known for a long time that this has some undesirable consequences
1635	   [@@harm1].  Recently it has come to light that IPv4 fragmentation is
1636	   not sufficiently robust for general use in today's Internet.  The
1637	   16-bit IP identification field is not large enough to prevent
1638	   frequent misassociated IP fragments and the TCP and UDP checksums are
1639	   insufficient to prevent the resulting corrupted data from being
1640	   delivered to higher protocol layers.  [@@harm2]

1642	   None the less, there are a number of higher layer protocols, such as
1643	   NFS [@@NFS] which use IP fragmentation as a mechanism to reduce CPU
1644	   load.  NFS typically sends fragmented 8k Byte datagram's over all
1645	   link types, no matter what the link MTU.  The other common case, in
1646	   which the application wants to use the largest possible datagram that
1647	   fits within the MTU is most easily treated as a special case of the
1648	   fragmenting case.

1650	9.  Operational Integration

1652	   This section describes ways to minimize deployment problems for
1653	   PLPMTUD, by including a number of good management features:
1654	   mechanisms to diagnose problems with path MTU discovery, and
1655	   configuration controls such that the more risky properties can be
1656	   progressively deployed.  We also address some potentially serious
1657	   interactions with nodes that do not honor the DF bit.

1659	9.1  Interoperation with prior algorithms

1661	   Properly functioning Path MTU discovery is critical to the robust and
1662	   efficient operation of the Internet.  Any major change (as described
1663	   in this document) has the potential to be very disruptive if it
1664	   contains any errors or oversights.  Therefore, we offer a deployment
1665	   strategy in which classical PMTUD operation as described in RFC 1191
1666	   and RFC 1981 is unmodified and PLPMTUD is only invoked following a
1667	   full stop timeout, presumably due to an "ICMP black hole".  To do
1668	   this:
1669	   o  Relax the ICMP checks in section 6.2.1 specifically to allow an
1670	      ICMP Packet Too Large message to reduce the MTU at arbitrary
1671	      times.
1672	   o  When there is no cached MTU, use the Interface MTU as specified by
1673	      classical PMTU discovery, rather the initial MTU as specified in
1674	      section 7.4
1675	   o  MTU searching as described in section 7.5 is disabled by setting
1676	      SEARCH_HIGH equal to SEARCH_LOW and the initial MPS.
1677	   o  A full stop timeout is processed as described in section 6.2.4.
1678	      This becomes the only mechanism to invoke the rest of PLPMTUD.

1680	   When configured in this manner, PLPMTUD will increase the robustness
1681	   of classical PMTU discovery in the presence of ICMP black holes and
1682	   other ICMP problems, with minimal exposure to unanticipated problems
1683	   during deployment.  Since this configuration does not help robustness
1684	   in the presence of malicious or erroneous ICMP messages, it is not
1685	   recommended for the long term.

1687	9.2  Operation over subnets with dissimilar MTUs

1689	   With classical PMTUD, the ingress router to a subnet is responsible
1690	   for knowing what size packets can be delivered to every node attached
1691	   to that subnets.  For most subnet types, this requires that the
1692	   entire subnet has a single MTU which is common to every attached
1693	   node.  (For a few subnets types, such as ATM[12] the nodes on a
1694	   subnet can negotiate the MTU on a pairwise basis, and the ingress
1695	   router is responsible for knowing the MTU to each of it peers).

1697	   This requirement has proven to be a major impediment to deploying
1698	   larger MTUs in the operational Internet.  Often one single node which
1699	   does not support a larger MTU effectively vetoes raising the MTU on a
1700	   subnet, because the ingress router does not have a mechanism to
1701	   generate the proper ICMP PTB message for the one attached node with a
1702	   smaller MTU.

1704	   With  PLPMTUD, this requirement is completely relaxed.  As long as
1705	   oversized packets addressed to nodes with the smaller MTU are
1706	   reliably discarded, PLPMTUD will find the proper MTU for these nodes.

1708	   Once there sufficient field experience to demonstrate that PLPMTUD is
1709	   robust, we recommend that OS vendors consider updating default MTUs
1710	   for Network Interface Cards.  It would raise the overall performance
1711	   of the Internet if all NICs were configured to default to the MTU
1712	   which is most efficient for the NIC (lowest overhead per byte),
1713	   rather than the standard MTU for the media or switch.  This is most
1714	   likely to be the largest MTU supported by the NIC chip set or some
1715	   other logical boundary, such as memory page sizes.

1717	9.3  Interoperation with tunnels

1719	   PLPMTUD is specifically designed to solve many of the problems that
1720	   people are experiencing today due to poor interactions between
1721	   classical MTU discovery, IPsec, and various sorts of tunnels [5].  As
1722	   long as the tunnel reliably discards packets that are too large,
1723	   PLPMTUD will discover an appropriate MTU for the path.

1725	   Unfortunately due to the pervasive problems with classical PMTU
1726	   discovery, many manufacturers of various types of VPN/tunneling
1727	   equipment have resorted to ignoring the DF bit under some conditions.
1728	   This not only violates the IP standard and many recommendations to
1729	   the contrary [17][18], it also violates the only requirement that
1730	   PLPMTUD places on the link layer: that oversized packets are reliably
1731	   discarded.  It is imperative that people understand the impact of
1732	   ignoring the DF bit both to applications and to PLPMTUD.

1734	   We do understand the reality of the situation.  It is important that
1735	   vendors who are building devices the violate the DF specification
1736	   understand that PLPMTUD requires that probe packets be discarded, and
1737	   that sending ICMP PTB messages alone is insufficient to prevent
1738	   wholesale fragmentation if the probe packets are delivered.

1740	   Therefore, it is imperative that devices that do not honor DF include
1741	   packet size history caches and other heuristics to robustly detect
1742	   and discard probe packets, if delivering them would require
1743	   fragmentation.

1745	9.4  Diagnostic tools

1747	   All implementations MUST include facilities for MTU discovery
1748	   diagnostic tools that implement PLPMTUD or other MTU discovery
1749	   algorithms in user mode without help or interference by the PMTUD
1750	   algorithm present in the operating system.  This requires an
1751	   mechanism where a diagnostic application can send packets that are
1752	   larger than the operating system's notion of the current path MTU and
1753	   for the diagnostic application to collect any resulting ICMP PTB
1754	   messages or other ICMP messages.  For IPv4, the diagnostic
1755	   application must be able to set the DF bit.

1757	   At this time nearly all operating systems support two modes for
1758	   sending UDP datagrams: one which silently fragments packets that are
1759	   too large, and another that rejects packets that are too large.
1760	   Neither of these modes are suitable for efficiently diagnosing
1761	   problems with the MTU discovery, such as routers that return ICMP PTB
1762	   messages containing incorrect size information.

1764	9.5  Management interface

1766	   It is suggested that an implementation provide a way for a system
1767	   utility program to:
1768	   o  Globally disable all ICMP Packet Tool Large message processing
1769	   o  Globally suppress some or all ICMP consistency checks described in
1770	      section 6.2.1.  Setting this option fore goes some possible
1771	      security improvements, in exchange for making PLPMTUD behave more
1772	      like classical PMTU discovery.  (See section 9.1)
1773	   o  Globally permit ICMP Packet Tool Large messages to unconditionally
1774	      reduce the MTU, even if there were not lost lost packets.  Setting
1775	      option fore goes some possible security improvements, in exchange
1776	      for making PLPMTUD behave more like classical PMTU discovery.
1777	      (See section 9.1)
1778	   o  Globally adjust timer intervals for specific classes of probe
1779	      failures

1781	   In addition, it is important that there be a mechanism to permit per
1782	   path controls to override specific parts of the PLPMTUD algorithm.
1783	   All of these per path controls should be preset from similar global
1784	   controls:
1785	   o  Disable MTU searching a given path, such that new MTU values are
1786	      never probed.
1787	   o  Set the initial MTU for a given path.  This could be used to speed
1788	      convergence in relatively static environments.  There should be an
1789	      option to cause PLPMTUD to choose the same initial value as would
1790	      be chosen by classical PMTU discovery.  I.e.  typically the
1791	      Interface MTU.  This is used in the mode described in section 9.1
1792	      where PLPMTUD is used only for black hole detection in classical
1793	      PMTU discovery.
1794	   o  Limit the maximum probed MTU for a given path.  This permits a
1795	      manual configuration to work around a link that spuriously
1796	      delivers packets that are larger than the useful path MTU.

1798	   o  Per path and per application controls to disable ICMP processing,
1799	      to further limit possible damage from malicious ICMP PTB messages
1800	      (in addition to the global controls).

1802	10.  References

1804	10.1  Normative References

1806	   [1]  Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.

1808	   [2]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1809	        November 1990.

1811	   [3]  McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP
1812	        version 6", RFC 1981, August 1996.

1814	   [4]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
1815	        Levels", BCP 14, RFC 2119, March 1997.

1817	   [5]  Kent, S. and R. Atkinson, "Security Architecture for the
1818	        Internet Protocol", RFC 2401, November 1998.

1820	   [6]  Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
1821	        Initial Window", RFC 2414, September 1998.

1823	   [7]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
1824	        Specification", RFC 2460, December 1998.

1826	   [8]  Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914,
1827	        September 2000.

1829	   [9]  Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
1830	        H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson,
1831	        "Stream Control Transmission Protocol", RFC 2960, October 2000.

1833	10.2  Informative References

1835	   [10]  Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU
1836	         discovery options", RFC 1063, July 1988.

1838	   [11]  Knowles, S., "IESG Advice from Experience with Path MTU
1839	         Discovery", RFC 1435, March 1993.

1841	   [12]  Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626,
1842	         May 1994.

1844	   [13]  Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU",
1845	         RFC 1791, April 1995.

1847	   [14]  Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809,
1848	         June 1995.

1850	   [15]  Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923,
1851	         September 2000.

1853	   [16]  Stewart, R., "Stream Control Transmission Protocol (SCTP)
1854	         Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in
1855	         progress), December 2003.

1857	   [17]  Kent, C. and J. Mogul, "Fragmentation considered harmful",
1858	         Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.

1860	   [18]  Mathis, M., Heffner, J. and B. Chandler, "Fragmentation
1861	         Considered Very Harmful", draft-mathis-frag-harmful-00 (work in
1862	         progress), July 2004.

1864	Authors' Addresses

1866	   Matt Mathis
1867	   Pittsburgh Supercomputing Center
1868	   4400 Fifth Avenue
1869	   Pittsburgh, PA  15213
1870	   US

1872	   Phone: 412-268-3319
1873	   EMail: mathis@psc.edu

1875	   John W. Heffner
1876	   Pittsburgh Supercomputing Center
1877	   4400 Fifth Avenue
1878	   Pittsburgh, PA  15213
1879	   US

1881	   Phone: 412-268-2329
1882	   EMail: jheffner@psc.edu

1884	   Kevin Lahey
1885	   Freelance

1887	   EMail: kml@patheticgeek.net

1889	Appendix A.  Security Considerations

1891	   Under all conditions the PLPMTUD procedure described in this document
1892	   is at least as secure as the current standard path MTU discovery
1893	   procedures described in RFC 1191 [2] and RFC 1981 [3].

1895	   It the recommended configuration, PLPMTUD is significantly harder to
1896	   attack than current procedures, because ICMP messages are cached and
1897	   only processed in connection with lost packets.  This effectively
1898	   prevents blind attacks on the path MTU discovery system.

1900	   Furthermore, since this algorithm is designed for robust operation
1901	   without any ICMP (or other messages from the network), it can be
1902	   configured to ignore all ICMP messages (globally or on a per
1903	   application basis).  In this configuration it can not be attacked,
1904	   unless the attacker can identify and selectively cause probe packets
1905	   to be lost.

1907	Appendix B.  IANA considerations

1909	   None.

1911	Appendix C.  Acknowledgements

1913	   Many ideas and even some of the text come directly from RFC1191 and
1914	   RFC1981.

1916	   Many people made significant contributions to this document,
1917	   including: Randall Stewart for SCTP text, Michael Richardson for
1918	   material from an earlier ID on tunnels that ignore DF, Stanislav
1919	   Shalunov for the idea that pure PLPMTUD parallels congestion control,
1920	   and Matt Zekauskas for maintaining focus during the meetings.  Thanks
1921	   to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib
1922	   who provided concrete feedback on weaknesses in earlier drafts.
1923	   Thanks also to all of the people who made constructive comments in
1924	   the working group meetings and on the mailing list.  I am sure I have
1925	   missed many deserving people.

1927	   Matt Mathis and John Heffner are supported in this work by a grant
1928	   from Cisco Systems, Inc.

1930	Intellectual Property Statement

1932	   The IETF takes no position regarding the validity or scope of any
1933	   Intellectual Property Rights or other rights that might be claimed to
1934	   pertain to the implementation or use of the technology described in
1935	   this document or the extent to which any license under such rights
1936	   might or might not be available; nor does it represent that it has
1937	   made any independent effort to identify any such rights.  Information
1938	   on the procedures with respect to rights in RFC documents can be
1939	   found in BCP 78 and BCP 79.

1941	   Copies of IPR disclosures made to the IETF Secretariat and any
1942	   assurances of licenses to be made available, or the result of an
1943	   attempt made to obtain a general license or permission for the use of
1944	   such proprietary rights by implementers or users of this
1945	   specification can be obtained from the IETF on-line IPR repository at
1946	   http://www.ietf.org/ipr.

1948	   The IETF invites any interested party to bring to its attention any
1949	   copyrights, patents or patent applications, or other proprietary
1950	   rights that may cover technology that may be required to implement
1951	   this standard.  Please address the information to the IETF at
1952	   ietf-ipr@ietf.org.

1954	Disclaimer of Validity

1956	   This document and the information contained herein are provided on an
1957	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1958	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1959	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1960	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1961	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1962	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1964	Copyright Statement

1966	   Copyright (C) The Internet Society (2005).  This document is subject
1967	   to the rights, licenses and restrictions contained in BCP 78, and
1968	   except as set forth therein, the authors retain all their rights.

1970	Acknowledgment

1972	   Funding for the RFC Editor function is currently provided by the
1973	   Internet Society.