idnits 2.17.1 

draft-ietf-pmtud-method-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1.a on line 18.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1978.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1955.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1962.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1968.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement. 

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        This document is an Internet-Draft and is subject to all provisions of
        Section 3 of RFC 3667.

        By submitting this Internet-Draft, each author represents that any
        applicable patent or other IPR claims of which he or she is aware
        have been or will be disclosed, and any of which he or she
        becomes aware will be disclosed, in accordance with Section 6 of
        BCP 79.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 883 has weird spacing: '...retried  after...'

  == Line 941 has weird spacing: '...ecently  set, ...'

  == Line 1111 has weird spacing: '...irement  has p...'

  == Line 1310 has weird spacing: '...ill not  yield...'

  == Line 1413 has weird spacing: '...address  many ...'

  == (2 more instances...)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 24, 2004) is 7116 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC 2461' on line 650

  -- Looks like a reference, but probably isn't: 'ISOTP' on line 1189

  -- Looks like a reference, but probably isn't: 'RFC2960' on line 1495

  == Unused Reference: '10' is defined on line 1851, but no explicit
     reference was found in the text

  == Unused Reference: '11' is defined on line 1854, but no explicit
     reference was found in the text

  == Unused Reference: '13' is defined on line 1860, but no explicit
     reference was found in the text

  == Unused Reference: '15' is defined on line 1866, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201)

  ** Obsolete normative reference: RFC 2401 (ref. '5') (Obsoleted by RFC 4301)

  ** Obsolete normative reference: RFC 2414 (ref. '6') (Obsoleted by RFC 3390)

  ** Obsolete normative reference: RFC 2460 (ref. '7') (Obsoleted by RFC 8200)

  ** Obsolete normative reference: RFC 2960 (ref. '9') (Obsoleted by RFC 4960)

  -- Obsolete informational reference (is this intentional?): RFC 1063 (ref.
     '10') (Obsoleted by RFC 1191)

  -- Obsolete informational reference (is this intentional?): RFC 1626 (ref.
     '12') (Obsoleted by RFC 2225)

  == Outdated reference: A later version (-16) exists of
     draft-ietf-tsvwg-sctpimpguide-10


     Summary: 11 errors (**), 0 flaws (~~), 13 warnings (==), 12 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                          M. Mathis
2	Internet-Draft                                                J. Heffner
3	Expires: April 24, 2005                                              PSC
4	                                                                K. Lahey
5	                                                               Freelance
6	                                                        October 24, 2004

8	                           Path MTU Discovery
9	                       draft-ietf-pmtud-method-03

11	Status of this Memo

13	   This document is an Internet-Draft and is subject to all provisions
14	   of section 3 of RFC 3667.  By submitting this Internet-Draft, each
15	   author represents that any applicable patent or other IPR claims of
16	   which he or she is aware have been or will be disclosed, and any of
17	   which he or she become aware will be disclosed, in accordance with
18	   RFC 3668.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as
23	   Internet-Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on April 24, 2005.

38	Copyright Notice

40	   Copyright (C) The Internet Society (2004).

42	Abstract

44	   This document describes a robust method for Path MTU Discovery that
45	   relies on TCP or some other Packetization Layer to probe an Internet
46	   path with progressively larger packets.  This method is described as
47	   an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
48	   MTU Discovery for IP versions 4 and 6, respectively.

50	   The general strategy of the new algorithm is to start with a small
51	   MTU and search upward, testing successively larger MTUs by probing
52	   with single packets.  If the probe is successfully delivered and
53	   satisfies a subsequent verification phase then the MTU is raised.  If
54	   the probe is lost, it is treated as an MTU limitation and not as a
55	   congestion signal.

57	   There are several options for integrating PLPMTUD with classical path
58	   MTU discovery.  PLPMTUD can be minimally configured to perform ICMP
59	   black hole recovery to increase the robustness of classical path MTU
60	   discovery, or ICMP processing can be completely disabled, and PLPMTUD
61	   can completely replace classical path MTU discovery.

63	   In the latter configuration, PLPMTUD exactly parallels congestion
64	   control.  An end-to-end transport protocol adjusts non-protocol
65	   properties of the data stream (window size or packet size) while
66	   using packet losses to deduce the appropriateness of the adjustments.
67	   This technique seems to be more philosophically consistent with the
68	   end-to-end principle than relying on ICMP messages containing
69	   transcribed headers of multiple protocol layers.

71	Table of Contents

73	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
74	     1.1   Revision History . . . . . . . . . . . . . . . . . . . . .  5
75	       1.1.1   Changes since version -02, July 19th 2004 (IETF 60)  .  6
76	   2.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
77	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  9
78	   4.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 12
79	   5.  Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
80	     5.1   Accounting for Header Sizes  . . . . . . . . . . . . . . . 14
81	     5.2   Storing PMTU information . . . . . . . . . . . . . . . . . 15
82	     5.3   Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16
83	     5.4   Measuring path MTU . . . . . . . . . . . . . . . . . . . . 16
84	   6.  The Probing Sequence and Lower Layers  . . . . . . . . . . . . 17
85	     6.1   Normal sequence of events to raise the MTU . . . . . . . . 17
86	     6.2   Processing MTU Indications . . . . . . . . . . . . . . . . 18
87	       6.2.1   Processing ICMP PTB messages . . . . . . . . . . . . . 18
88	       6.2.2   Packetization Layer Detects Lost Packets . . . . . . . 20
89	       6.2.3   Packetization Layer Retransmission Timeout . . . . . . 21
90	       6.2.4   Packetization Layer Full Stop Timeout  . . . . . . . . 22
91	     6.3   Probing Intervals  . . . . . . . . . . . . . . . . . . . . 23
92	     6.4   Host fragmentation . . . . . . . . . . . . . . . . . . . . 24
93	     6.5   Multicast  . . . . . . . . . . . . . . . . . . . . . . . . 25
94	   7.  Common Packetization Properties  . . . . . . . . . . . . . . . 25
95	     7.1   Mechanism to detect loss . . . . . . . . . . . . . . . . . 26
96	     7.2   Generating Probes  . . . . . . . . . . . . . . . . . . . . 26
97	     7.3   Mechanism to support provisional MTUs  . . . . . . . . . . 27
98	     7.4   Selecting the initial MPS  . . . . . . . . . . . . . . . . 27
99	     7.5   Common MPS Search Strategy . . . . . . . . . . . . . . . . 28
100	       7.5.1   Fine Scans . . . . . . . . . . . . . . . . . . . . . . 30
101	     7.6   Congestion Control and Window Management . . . . . . . . . 30
102	   8.  Specific Packetization Layers  . . . . . . . . . . . . . . . . 32
103	     8.1   Probing method using TCP . . . . . . . . . . . . . . . . . 32
104	     8.2   Probing method using SCTP  . . . . . . . . . . . . . . . . 33
105	     8.3   Probing method for IP fragmentation  . . . . . . . . . . . 35
106	     8.4   Probing method for applications  . . . . . . . . . . . . . 36
107	   9.  Operational Integration  . . . . . . . . . . . . . . . . . . . 37
108	     9.1   Interoperation with prior algorithms . . . . . . . . . . . 37
109	     9.2   Operation over subnets with dissimilar MTUs  . . . . . . . 38
110	     9.3   Interoperation with tunnels  . . . . . . . . . . . . . . . 38
111	     9.4   Diagnostic tools . . . . . . . . . . . . . . . . . . . . . 39
112	     9.5   Management interface . . . . . . . . . . . . . . . . . . . 39
113	   10.   References . . . . . . . . . . . . . . . . . . . . . . . . . 40
114	   10.1  Normative References . . . . . . . . . . . . . . . . . . . . 40
115	   10.2  Informative References . . . . . . . . . . . . . . . . . . . 41
116	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 41
117	   A.  Security Considerations  . . . . . . . . . . . . . . . . . . . 42
118	   B.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 42
119	   C.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 43
120	       Intellectual Property and Copyright Statements . . . . . . . . 44

122	1.  Introduction

124	   This document describes a method for Packetization Layer Path MTU
125	   Discovery (PLPMTUD) which is an extension to existing Path MTU
126	   discovery methods as described in RFC 1191 [2] and RFC 1981 [3].  The
127	   proper MTU is determined by starting with small packets and probing
128	   with successively larger packets.  The bulk of the algorithm is
129	   implemented above IP, in the transport layer (e.g.  TCP) or other
130	   "Packetization Protocol" that is responsible for determining packet
131	   boundaries.

133	   This document draws heavily RFC 1191 [2] and RFC 1981 [3] for
134	   terminology, ideas and some of the text.

136	   This document describes methods to discover the path MTU using
137	   features of existing protocols.  The methods apply to IPv4 and IPv6,
138	   and many transport protocols.  They do not require cooperation from
139	   the lower layers (except that they are consistent about what packet
140	   sizes are acceptable) or the far node.  Variants in implementations
141	   will not cause interoperability problems.

143	   The methods described in this document are carefully designed to
144	   maximize robustness in the presence of less than ideal
145	   implementations of other protocols or Internet components.

147	   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
148	   the terminology section we also present the analogous IPv4 terms and
149	   concepts for the IPv6 terminology.  In a few situations we describe
150	   specific details that are different between IPv4 and IPv6.

152	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
153	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
154	   document are to be interpreted as described in RFC 2119 [4].

156	   This draft is a product of the Path MTU Discovery (pmtud) working
157	   group of the IETF.  Please send comments and suggestions to
158	   pmtud@ietf.org.  Interim drafts and other useful information will be
159	   posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html .

161	1.1  Revision History

163	   These are all recent substantive changes, in reverse chronological
164	   order.  This section will be removed prior to publication as an RFC.
165	   Note that there are still some missing details that need to be
166	   resolved.  These are flagged by @@@@.  None of the missing details
167	   are serious.

169	1.1.1  Changes since version -02, July 19th 2004 (IETF 60)

171	   Many minor updates throughout the document.

173	   Added a section describing the interactions between PLPMTUD and
174	   congestion control.

176	   Removed a difficult to implement requirement for future data to
177	   transmit.

179	   Added "IP Fragmentation" and "Application protocol" as Packetization
180	   Layers.

182	   Clarified interactions between TCP SACK and MTU.

184	   Updated SCTP section to reflect new probing method using "PAD
185	   chunks".

187	   Distilled the protocol specific material into separate subsections
188	   for each protocol.

190	   Added a section on common requirements and functions for all
191	   Packetization Layers.  More accurately characterized the
192	   "bidirectional" (and other) requirements of the PL protocol.  Updated
193	   the search strategy in this new section.

195	   Change "ICMP can't fragment" and "packet too big" to uniformly use
196	   "ICMP PTB message" everywhere.

198	   Added Stanislav Shalunov's observation that PLPMTUD parallels
199	   congestion control.

201	   Better described the range of interoperability with classical pMTUd
202	   in the introduction.

204	   Removed vague language about "not being a protocol" and "excessive
205	   Loss".

207	   Slightly redefined flow: the granularity of PLPMTUD within a path.

209	   Many English NITs and clarifications per Gorry Fairhurst and others.
210	   Passes strict xml2rfc checking.

212	   Add a paragraph encouraging interface MTUs that are the optimal for
213	   the NIC, rather than standard for the media.

215	   Added a revision history section.

217	2.  Overview

219	   This document describes a method for TCP or other packetization
220	   protocols to dynamically discover the MTU of a path without relying
221	   on explicit signals from the network.  These procedures are
222	   applicable to TCP and other transport- or application-level protocols
223	   that are responsible for choosing packet boundrys (e.g.  segment
224	   sizes) and have an acknowledgement structure that delivers to the
225	   sender, accurate and timely indications of which packets were lost.

227	   The general strategy of the new procedure is for the packetization
228	   layer to find the proper path MTU by probing with progressively
229	   larger packets.  A "probe sequence" consists of a single "probe
230	   packet", which initiates a "probe phase", followed by a "transition
231	   phase" and a "verification phases".

233	   If a probe packet is successfully delivered, then the path MTU is
234	   provisionally raised during the transition phase.  If there are no
235	   additional losses during the subsequent verification phase, then the
236	   path MTU is confirmed (verified) to be at least as large as the
237	   provisional MTU.  Each probe sequence raises the estimated path MTU
238	   by one step.  A search strategy uses heuristics to select optimal MTU
239	   steps, until PLPMTUD converges to the correct path MTU.

241	   The verification phase is used to detect some situations where
242	   raising the MTU raises the packet loss rate.  For example if a link
243	   is striped across multiple physical channels with inconsistent MTUs,
244	   it is possible that a probe will be delivered even if it is too large
245	   for some of the physical channels.  In such cases raising the path
246	   MTU to the probe size will cause severe periodic loss and abysmal
247	   performance.  The verification phase is designed to prevent the path
248	   MTU from being raised if doing so causes excessive packet losses.

250	   A conservative implementation of PLPMTUD would use a full round trip
251	   time for the verification phase.  In this case the entire probe
252	   sequence takes three full round trip times.  It takes one round trip
253	   for the probe phase, during which the probe propagates to the far
254	   node and an acknowledgment is returned.  The second round trip is the
255	   transitional phase, during which data packets using the provisional
256	   MTU propagate to the far node and are acknowledged.  During he third
257	   and final round trip time, it is verified that raising the MTU did
258	   not cause any additional losses.

260	   The isolated loss of a probe packet (with or without an ICMP PTB
261	   message) is treated as an indication of an MTU limit, and not as a
262	   congestion indicator.  In this case alone, the packetization protocol
263	   is permitted to retransmit any missing data without adjusting the
264	   congestion window.

266	   If there is a timeout or any additional lost packets during any of
267	   the three phases, the loss is treated as a congestion indication as
268	   well as an indication of some sort of failure of the PLPMTUD process.
269	   The congestion indication is treated like any other congestion
270	   indication: window or rate adjustments are mandatory per the relevant
271	   congestion control standards [8].  Probing can resume with some new
272	   probe size after a delay which is determined by the nature of the
273	   detected failure.

275	   The most likely (and least serious) PLPMTUD failure is the link
276	   experiencing congestion related losses at about the same time as the
277	   probe.  In this case it is appropriate to retry the probe at the same
278	   probe size as soon as the packetization layer has fully adapted to
279	   the congestion and recovered from the losses.

281	   In other cases, additional losses or timeouts indicate problems with
282	   the link or packetization layer.  In these situations it is desirable
283	   to use longer delays depending on the severity of the error.

285	   There is a range of options for integrating PLPMTUD with classical
286	   path MTU discovery.  In the most conservative configuration, from a
287	   deployment point of view, classical path MTU discovery is fully
288	   functional (all correct ICMP PTB messages are unconditionally
289	   processed) and PLPMTUD is invoked only to recover from ICMP black
290	   holes.

292	   In the most conservative configuration, from a security point of
293	   view, all ICMP PTB messages are ignored, and PLPMTUD is the sole
294	   method used to discover the path MTU.  This protects against
295	   malicious or erroneous ICMP PTB messages which might otherwise cause
296	   MTU discovery to arrive at the incorrect MTU for a path.

298	   Note that in the latter configuration, PLPMTUD exactly parallels
299	   congestion control.  An end-to-end transport protocol adjusts
300	   non-protocol properties of the data stream (window size or packet
301	   size) while using packet losses to deduce the appropriateness of the
302	   adjustments.  This technique seems to be more philosophically
303	   consistent with the end-to-end principle of the Internet than relying
304	   on ICMP messages containing transcribed headers of multiple protocol
305	   layers.

307	   We advocate a compromise, in which ICMP PTB messages are only
308	   processed in conjunction with probing (described in section 6.2.1),
309	   and Packetization Layer timeouts (described in section 6.2.3) and
310	   ignored in all other situations.

312	   Most of the difficulty in implementing PLPMTUD arises because it
313	   needs to be implemented in several different places within a single
314	   node.  In general each packetization protocol needs to have it's own
315	   implementation of PLPMTUD.  Furthermore, the natural mechanism to
316	   share path MTU information between concurrent or subsequent
317	   connections over the same path is a path information cache in the IP
318	   layer.  The various packetization protocols need to have the means to
319	   access and update the shared cache in the IP layer.  This memo
320	   describes PLPMTUD in terms of its primary subsystems without fully
321	   describing how they are assembled into a complete implementation.

323	   Section  3 provides a complete glossary of terms.

325	   Relatively few details of PLPMTUD affect interoperability with other
326	   standards or Internet protocols.  These details are specified in
327	   RFC2119 standards language in section 4.  The vast majority of the
328	   implementation details described in this document are recommendations
329	   based on experiences with earlier versions of path MTU discovery.
330	   These recommendations are motivated by a desire to maximize
331	   robustness of PLPMTUD in the presence of less than ideal
332	   implementations as they exist in the field.

334	   Section  5 describes how to partition PLPMTUD into layers, and how to
335	   manage the "path information cache" in the IP layer.

337	   Section  6 describes the details of a probe sequence, including how
338	   to process MTU and error indications, necessary to raise the MTU by
339	   one step.

341	   Section  7 describes the general search stratiegy and Packetization
342	   Layer features needed to implement PLPMTUD.

344	   Section  8 discusses specific implementation details for a couple of
345	   specific protocols, such as TCP.

347	   Section  9 describes ways to minimize deployment problems for
348	   PLPMTUD, by including a number of good management features.  It also
349	   addresses some potentially serious interactions with nodes that do
350	   not honor the IPv4 DF bit.

352	3.  Terminology

354	   We use the following terms in this document:
355	   IP Either IPv4 [1] or IPv6 [7].

357	   node A device that implements IP.

359	   router A node that forwards IP packets not explicitly addressed to
360	      itself.

362	   host Any node that is not a router.

364	   upper layer A protocol layer immediately above IP.  Examples are
365	      transport protocols such as TCP and UDP, control protocols such as
366	      ICMP, routing protocols such as OSPF, and Internet or lower-layer
367	      protocols being "tunneled" over (i.e., encapsulated in) IP such as
368	      IPX, AppleTalk, or IP itself.

370	   link A communication facility or medium over which nodes can
371	      communicate at the link layer, i.e., the layer immediately below
372	      IP.  Examples are Ethernets (simple or bridged); PPP links; X.25,
373	      Frame Relay, or ATM networks; and Internet (or higher) layer
374	      "tunnels", such as tunnels over IPv4 or IPv6.  Occasionally we use
375	      the slightly more general term "lower layer" for this concept.

377	   interface A node's attachment to a link.

379	   address An IP-layer identifier for an interface or a set of
380	      interfaces.

382	   packet An IP header plus payload.

384	   MTU Maximum Transmission Unit, the size in bytes of the largest IP
385	      packet, including the IP header and payload, that can be
386	      transmitted on a link or path.  Note that this could more properly
387	      be called the IP MTU, to be consistent with how other standards
388	      organizations use the acronym MTU.

390	   link MTU The Maximum Transmission Unit, i.e., maximum IP packet size
391	      in bytes, that can be conveyed in one piece over a link.  Beware
392	      that this definition differers from the definition used by other
393	      standards organizations.

395	      For IETF documents, link MTU is uniformly defined as the IP MTU
396	      over the link.  This includes the IP header, but excludes link
397	      layer headers and other framing which is not part of IP or the IP
398	      payload.

400	      Be aware that other standards organizations generally define link
401	      MTU to include the link layer headers.

403	   path The set of links traversed by a packet between a source node and
404	      a destination node

406	   pMTU, path MTU The minimum link MTU of all the links in a path
407	      between a source node and a destination node.

409	   classical path MTU discovery, Process described in RFC 1191 and RFC
410	      1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages
411	      to learn the MTU of a path.

413	   PL, Packetization Layer The layer of the network stack which segments
414	      data into packets.

416	   PLPMTUD Packetization Layer Path MTU Discovery, the method described
417	      in this document, which is an extension to classical PMTU
418	      discovery.

420	   PTB (Packet Too Big) message An ICMP message reporting that an IP
421	      packet is too large to forward.  This is the IPv6 term that
422	      corresponds to the IPv4 "ICMP Can't fragment" message.

424	   flow A context in which MTU discovery algorithms can be invoked.
425	      This is naturally an instance of the packetization protocol, e.g.
426	      one side of a TCP connection.

428	   MPS The maximum IP payload size available over a specific path.  This
429	      is typically the path MTU minus the IP header.  As an example,
430	      this is the maximum TCP packet size, including TCP payload and
431	      headers but not including IP headers.  This has also been called
432	      the "L3 MTU".

434	   MSS The TCP Maximum Segment Size, the maximum payload size available
435	      to the TCP layer.  This is typically the path MPS minus the size
436	      of the TCP header.

438	   probe packet A packet which is being used to test a path for a larger
439	      MTU.

441	   probe size The size of a packet being used to probe for a larger MTU.

443	   successful probe The probe packet was delivered through the network
444	      and acknowledged by the Packetization Layer on the far node.

446	   inconclusive probe The probe packet was not delivered, but there were
447	      other lost packets close enough to the probe where it can not be
448	      presumed that the probe was lost because it was larger than the
449	      path MTU.  By implication the probe might have been lost due to
450	      something other than MTU (such as congestion), so the results are
451	      inconclusive.

453	   failed probe The probe packet was not delivered and there were no
454	      other lost packets close to the probe.  This is taken as an
455	      indication that the probe was larger than the path MTU, and future
456	      probes should generally be for smaller sizes.

458	   errored probe There were losses or timeouts during the verification
459	      phase which suggest a potentially disruptive failure or network
460	      condition.  These are generally retried only after substantially
461	      longer intervals.

463	   probe gap The payload data that will be lost and need to be
464	      retransmitted if the probe is not delivered.

466	   probe phase The interval (time or protocol events) between when a
467	      probe is sent and when it is determined that the the probe
468	      succeeded, failed or was inconclusive

470	   verification phase An additional interval during which the new path
471	      MTU is considered provisional.  Packet losses or timeouts are
472	      treated as an indication that there may be a problem with the
473	      provisional MTU.

475	   transition phase The interval between the probe phase and the
476	      verification phase, during which packets using the new MTU
477	      propagate to the far node and the acknowledgment propagates back.

479	   probe sequence The sequence of events to raise the MTU by one step,
480	      starting with the transmission of a probe packet followed by
481	      probe, transition and verification phases.

483	   search strategy The heuristics used to choose successive probe sizes
484	      to converge to the proper path MTU, as described in section 7.5.

486	   full stop timeout a timeout where none of the packets transmitted
487	      after some event are acknowledged by the receiver, including any
488	      retransmissions.  This is taken as an indication of some failure
489	      condition in the network, such as a routing change onto a link
490	      with a smaller MTU.  For the sake of PLPMTUD we suggest the
491	      following definition of a full stop timeout:  the loss of one full
492	      window of data and at least one retransmission or at least 6
493	      consecutive packets including at least 2 retransmissions (along
494	      with two retransmission timer expirations).  [@@@ This probably
495	      needs some experimentation.]

497	4.  Requirements

499	   All Internet nodes SHOULD implement PLPMTUD in order to discover and
500	   take advantage of the largest MTU supported along the Internet path.

502	   Links MUST NOT deliver packets that are larger than their MTU.  Links
503	   that have parametric limitations (e.g.  MTU bounds due to limited
504	   clock stability) MUST include explicit mechanisms to consistently
505	   reject packets that might otherwise be nondeterministically
506	   delivered.

508	   All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
509	   functionality.  All fragmentation SHOULD be done on the host, and all
510	   IPv4 packets, including fragments, SHOULD have the DF bit set such
511	   that they will not be fragmented (again) in the network.  See Section
512	   6.4.

514	   The requirements below only apply to those implementations that
515	   include PLPMTUD.

517	   To use PLPMTUD a Packetization Layer MUST have a loss reporting
518	   mechanism that provides the sender with timely and accurate
519	   indications of which packet were lost in the network.

521	   Normal congestion control algorithms MUST remain in effect under all
522	   conditions except when only an isolated probe packet is detected to
523	   be lost.  In this case alone the normal congestion (window or data
524	   rate) reduction MAY be suppressed.  If any other data loss is
525	   detected, all normal congestion control MUST take place.

527	   Suppressed congestion control (as above) MUST be rate limited such
528	   that it occurs less frequently than the worst case loss rate for TCP
529	   congestion control at a comparable data rate over the same path (i.e.
530	   less than the "TCP-friendly" loss rate [@@]).  This SHOULD be
531	   enforced by requiring a minimum headway between a suppressed
532	   congestion adjustment (due to a failed probe) and the next attempted
533	   probe, which is equal to one round trip time for each packet
534	   permitted by the congestion window.  Alternatively this may be
535	   enforced by not suppressing congestion control if a 2nd probe is lost
536	   too soon after the 1st lost probe.  This and other issues relating to
537	   congestion control are discussed in section 7.6.

539	   Whenever the MTU is raised, the congestion state variables MUST be
540	   rescaled to not to raise the window size in bytes (or date rate in
541	   bytes per seconds).

543	   Whenever the MTU is reduced (e.g.  when processing ICMP PTB messages)
544	   the congestion state variable SHOULD be rescaled not to raise the
545	   window size in packets.

547	   If PLPMTUD updates the MTU for a particular path, all Packetization
548	   Layer sessions that share the path representation SHOULD be notified
549	   to make use of the new MTU and make the required congestion
550	   adjustments.

552	   All implementations MUST include a mechanism to implement diagnostic
553	   tools that do not rely on the operating systems implementation of
554	   path MTU discovery.  This specifically requires the ability to send
555	   packets that are larger than the known MTU for the path, and
556	   collecting any resultant ICMP error message.  See section 9.4 for
557	   further discussion of MTU diagnostics.

559	5.  Layering

561	   Packetization Layer Path MTU Discovery is most easily implemented by
562	   splitting its functions between layers.  The IP layer is in the best
563	   place to keep shared state, collect the ICMP messages, track IP
564	   headers sizes and manage MTU information provided by the link layer
565	   interfaces.  However the procedures that PLPMTUD uses for probing,
566	   verifications and scanning for the path MTU are very tightly coupled
567	   to the data recovery and congestion control state machines in the
568	   Packetization Layers.  The most difficult part of implementing
569	   PLPMTUD is properly splitting the implementation between the layers.

571	   Note that this layering is constant with the advice in the current
572	   PMTUD specifications [2][3].  Many implementations of classical PMTU
573	   Discovery are already split along these same layers.

575	5.1  Accounting for Header Sizes

577	   Early implementation of PLPMTUD revealed that it is critically
578	   important to have a good clean mechanism for accounting header sizes
579	   at all layers.  This is because each Packetization Layer does its
580	   calculations in its own natural data unit, which are almost always a
581	   reflection of the service that the Packetization Layer provides to
582	   the application or other upper layers.  For example, TCP naturally
583	   performs all of its calculations in terms of sequence numbers and
584	   segment sizes.  However, the MTU size being probed, MTU size reported
585	   in ICMP PTB messages, etc are measures of full packets, which not
586	   only include the TCP payload (measured in sequence space) but also
587	   include fixed TCP and IP headers, and may include IPv6 extension
588	   headers or IPv4 options, TCP options and even IPsec AH or ESP
589	   headers.

591	   PLPMTUD requires frequent translation between these two domains: the
592	   Packetization Layer's natural data unit and full IP packet sizes.
593	   While there are a number of possible ways to accurately implement
594	   dual size measures, our experience has been that it is best if the
595	   boundary between the IP layer and the Packetization layer communicate
596	   in terms of the IP Maximum Payload Size or MPS.  The MPS is the only
597	   size measure that is common to both layers because it exactly matches
598	   the boundary between the layers.  The IP Layer is responsible for
599	   adding or deducting its own headers when translating between MTU and
600	   MPS.  Likewise the Packetization Layer is responsible for adding or
601	   deducting its own headers when calculations in it's natural data
602	   units.  E.g.  for TCP, the MPS and MSS are different by the TCP
603	   header size.

605	   Be aware that a casual reading of this document might give the
606	   impression that MTU, MPS and Packetization Layer data size (e.g.  TCP
607	   MSS) are used interchangeably.  They are not.  Our choice of
608	   terminology is consistent with the protocol layer being discussed in
609	   the surrounding context.  All implementors must pay attention to the
610	   distinction between these terms and include all necessary
611	   conversions, even when thy are not explicitly indicated in this
612	   document.

614	   [Note that although existing implementations of classical path MTU
615	   discovery typically include some sort of path information cache
616	   structured as described here, none keep the cached information in
617	   MPS.  All know path information caches keep path information in terms
618	   of IP MTU, which results in layering (and probable scope) violations
619	   at every point where upper protocol layers need to make decisions
620	   about message sizes.  Early PLPMTUD implementations avoided
621	   redefining the path information cache, and as a consequence were
622	   fraught with problems relating to implementing MTU to MPS to payload
623	   size calculations in other parts of PLPMTUD.  We encourage all
624	   implementors (and potential implementors) to start by changing the
625	   path information cache to use MPS.  This change is quite possibly the
626	   most difficult step in implementing PLPMTUD, because it requires
627	   changes in many places throughout the protocol stack.  @@@@ remove
628	   before RFC status?]

630	5.2  Storing PMTU information

632	   The IP layer is the best place to store cached MPS values and other
633	   shared state such as MTU values reported by ICMP PTB messages.
634	   Ideally this shared state should be associated with a specific path
635	   traversed by packets exchanged between the source and destination
636	   nodes.  However, in most cases a node will not have enough
637	   information to completely and accurately identify such a path.
638	   Rather, a node must associate a MPS value with some local
639	   representation of a path.  It is left to the implementation to select
640	   the local representation of a path.

642	   An implementation could use the destination address as the local
643	   representation of a path.  The MPS value associated with a
644	   destination would be the minimum MPS learned across the set of all
645	   paths in use to that destination.  The set of paths in use to a
646	   particular destination is expected to be small, in many cases
647	   consisting of a single path.  This approach will result in the use of
648	   optimally sized packets on a per-destination basis.  This approach
649	   integrates nicely with the conceptual model of a host as described in
650	   [RFC 2461]: a MPS value could be stored with the corresponding entry
651	   in the destination cache.  However, NAT and other forms of middle
652	   boxes may exhibit differing MTUs at as single IP address.

654	   Note that network or subnet numbers are not suitable to use as
655	   representations of a path, because there is not a general mechanism
656	   to determine the network mask at the remote host.

658	   If IPv6 flows are in use, an implementation could use the IPv6 flow
659	   id [7][14] as the local representation of a path.  Packets sent to a
660	   particular destination but belonging to different flows may use
661	   different paths, with the choice of path depending on the flow id.
662	   This approach will result in the use of optimally sized packets on a
663	   per-flow basis, providing finer granularity than MPS values
664	   maintained on a per-destination basis.

666	   For source routed packets (i.e.  packets containing an IPv6 Routing
667	   header, or IPv4 LSRR or SSRR options), the source route may further
668	   qualify the local representation of a path.  An implementation could
669	   use source route information in the local representation of a path.

671	5.3  Accounting for IPsec

673	   This document does not take a stance on the placement of IPsec, which
674	   logically sits between IP and the Packetization Layer.  As far as
675	   PLPMTUD is concerned IPsec can be treated either as part of IP or as
676	   part of the Packetization Layer, as long as the accounting is
677	   consistent within the implementation.  If IPsec is treated as part of
678	   the IP layer, then each security association to a remote node may
679	   need to be treated as a separate path.  I.e.  the the security
680	   association is used to represent the path.  If IPsec is treated as
681	   part of the packetization layer, the IPsec header size has to be
682	   included in the Packetization Layer's header size calculations.

684	5.4  Measuring path MTU

686	   This memo uses the concept of a "flow" to define the scope of the
687	   path MTU discovery algorithms.  For many implementations, a flow
688	   would naturally correspond to an instance of each protocol (i.e.
689	   each connection or session).  In such implementations the algorithms
690	   described in this document are performed within each session for each
691	   protocol.  The observed MPS can be shared between different flows
692	   sharing a common path representation.

694	   Alternatively, PLPMTUD could be implemented such that the complete
695	   PLPMTUD state is associated with the path representations.  Such an
696	   implementation could use multiple connections or sessions for each
697	   probe sequence.  e.g.  one connection could do the initial probe an
698	   set the provisional MTU and and one or more subsequent connection
699	   could verify the MTU.  This approach may converge much more quickly
700	   in some environments such as when the application uses many small
701	   connections, each of which is too short to complete a probe sequence.

703	   These approaches are not mutually exclusive.  Due to differing
704	   constraints on generating probes (section Section 7.2) and the MPS
705	   searching algorithm (section Section 7.5), it may not be feasible for
706	   different packetization layer protocols to share PLPMTUD state.  This
707	   suggests that it may make sense to share state within some protocols,
708	   but not with others.  In this case, the different protocols can still
709	   share the observed MPS but they will have differing convergence
710	   properties.

712	6.  The Probing Sequence and Lower Layers

714	   This section describes the details of a probe sequence, including how
715	   to process MTU and error indications, necessary to raise the MTU by
716	   one step.

718	6.1  Normal sequence of events to raise the MTU

720	   If the probe size is smaller than the actual path MTU and there are
721	   no other losses, the normal sequence of events to raise the MTU is:
722	   1.  Confirm probing preconditions: no outstanding Packetization Layer
723	       losses, sufficient congestion window per section 7.6, sufficient
724	       elapsed time since previous probe per section 6.3, if candidate
725	       MPS has not been set from ICMP MPS, then compute the candidate
726	       MPS per MPS search strategy in section 7.5.

728	   2.  A new MTU is tested by sending one "probe packet", of size "probe
729	       size" (computed from the candidate MPS).  The probe is sent,
730	       followed by additional packets at the current MTU.  By definition
731	       PLPMTUD enters the probe phase.  The probe propagates through the
732	       network and the far node acknowledges it (or possibly latter
733	       data, if acknowledgements are cumulative and delayed
734	       acknowledgement is in effect).

736	   3.  The acknowledgement for the probe reaches the data sender.  By
737	       definition, this ends the probe phase.

739	   4.  The packetization layer provisionally raises the MTU to the probe
740	       size.  PLPMTUD enters the transitional phase when it starts
741	       sending data using the provisional MTU.

743	       Note that implementations that use packet counts for congestion
744	       accounting (e.g.  keep cwnd in units of packets) must re-scale
745	       their congestion accounting such that raising the MTU does not
746	       raise the data rate (bytes/second) or the total congestion window
747	       in bytes, as required in section 4 and discussed in 7.6.

749	       If the implementation packetizes the data at the application
750	       programming interface, it may transmit already queued data at the
751	       current MTU before raising the MTU.  In this case this data is
752	       not part of either the probing or transition phases, because all
753	       of the packets in flight fit within the current MTU.

755	   5.  Once the first packet of the transitional phase is acknowledged,
756	       PLPMTUD enters the verification phase.  In principle the
757	       verification phase can be of arbitrary duration, however at this
758	       time we are recommending one full window of data (i.e one full
759	       round trip time) for most Packetization Layers.

761	   6.  Once there has been sufficient data delivered and acknowledged
762	       the provisional MTU is considered verified and the path MTU is
763	       updated.  PLPMTUD can then probe for an even larger MTU, as
764	       described in the searching strategy in section 7.5.

766	   Other events described in the next section are treated as exceptions
767	   and alter or cancel some of the steps above.

769	6.2  Processing MTU Indications

771	   When the probe sequence fails to raise the MTU, it will be due to one
772	   of three broad classes of outcomes: the probe was inconclusive,
773	   failed or errored.  If the probe was inconclusive, it means that
774	   there were other looses seemingly unrelated to the probe, such that
775	   the probe outcome was ambiguous.  Inclusive probes should be retried
776	   with the same probe size.  If the probe failed, there was an
777	   indication that the probe size was larger than the path MTU, and the
778	   probe should be retried with a smaller size, as selected by the MTU
779	   searching algorithm.  In some situations there are indications that
780	   the probing sequence caused some unexpected event.  In these "error"
781	   conditions it is desirable to use progressively longer delays to
782	   minimize the possible impact to the network.

784	6.2.1  Processing ICMP PTB messages

786	   Classical PMTU discovery specifies the generation of ICMP PTB
787	   Messages if an over-sized packet (e.g.  a probe) encounters a link
788	   that has a smaller MTU.  Since these messages can not be
789	   authenticated they introduce a number of well documented attacks
790	   against classical PMTUD [5].

792	   With PLPMTUD these messages are not required for correct operation,
793	   and in principle can be summarily ignored at the expense of slower
794	   convergence to the proper MTU.  However we believe that a slightly
795	   better compromise is to save the reported PTB size (computed from the
796	   ICMP MTU) in the path information cache and act on it only in in two
797	   specific contexts: in conjunction with a lost PLPMTUD probe or a
798	   full-stop timeout.

800	   Every ICMP PTB Message should be subjected to the following checks:
801	   o  If globally forbidden then discard the message.

803	   o  If forbidden by the application then discard the message.

805	   o  If this path has been tagged "bogus ICMP messages" then discard
806	      the message.

808	   o  If the reported MTU fails consistency checks then set "bogus ICMP
809	      messages" flag for this path and discards the message.  These
810	      consistency checks include:
811	      *  unrecognized or unparseable enclosed header, or
812	      *  reported MTU is larger than the size indicated by the enclosed
813	         header, or
814	      *  larger than the current MTU, provisional MTU or probe size as
815	         appropriate, or
816	      *  fails a ICMP consistency checks specific to the Packetization
817	         Layer.  (E.g.  The SCTP Verification-Tag mechanism [9][16])
818	      To ease migration, it is suggested that implementations may
819	      include global controls to emulate legacy operation by suppressing
820	      some or all of the consistency checks.

822	   If the ICMP PTB message is acceptable under all of these checks then
823	   save the "ICMP MPS" computed from the MTU field in the ICMP message.
824	   If the global configuration switch is set to emulate classical path
825	   MTU discovery then process the message immediately (I.e.  set the
826	   path MPS to the ICMP MPS and invoke any protocol specific actions).
827	   Otherwise, the saved ICMP MPS will be acted upon if and only if there
828	   are other PLPMTUD events such as lost probes, etc as indicated in the
829	   next section.  This delayed processing of ICMP PTB messages makes it
830	   more difficult for an attacker to interfere with correct PLPMTUD
831	   operation by injecting fraudulent ICMP PTB messages.

833	   In either case if the Packetization Layer calls for specific actions
834	   in response to a PTB message, that action should be invoked only at
835	   the point when the path MPS is updated from the ICMP MPS.

837	6.2.2  Packetization Layer Detects Lost Packets

839	   Each packetization protocol has it's own mechanism to detect lost
840	   packets and request the retransmission of missing data.  The primary
841	   signals used by PLPMTUD are these protocol specific loss indications.
842	   The packetization layer is responsible for retransmitting the lost
843	   data and notifying PLPMTUD that there was a loss.
844	   o  If the probe itself was lost, and there were no other losses
845	      during the probe phase (The RTT between when the probe was sent
846	      and the loss detected) then it is taken as an indication that the
847	      path MTU is smaller than the probe size.  In this situation alone
848	      the Packetization Layer is permitted to retransmit the missing
849	      data (the "probe gap") without adjusting its congestion window or
850	      data transmission rate.

852	      If an accepted ICMP PTB message was received after the probe was
853	      sent, and it passes the additional checks that the ICMP MTU is
854	      greater than the current MTU and less than the probe size, then
855	      set the candidate MPS from the ICMP MTU, and restart the probe
856	      sequence from step 1 in section 6.1.

858	      If there was not a accepted PTB Message, then the indicated event
859	      is a "probe failure", which can be retried with a smaller probe
860	      size after a suitable delay for a probe_fail_event.  See section
861	      6.2.2 for more complete descriptions of failure events.

863	   o  If there are losses during the probe phase and the probe was not
864	      lost, then the probe was successful.  However, since additional
865	      losses have the potential to spoil the verification phase, it is
866	      important that PLPMTUD not progress into the transition phase
867	      (step 4 above) until after the Packetization Layer has fully
868	      recovered from the losses and completed the congestion window (or
869	      rate) adjustment.

871	   o  If there are losses during the probe phase and the probe was also
872	      lost the outcome depends on the presence an ICMP MTU set by an
873	      acceptable PTB message.

875	      If there was an accepted PTB message received since the probe was
876	      sent, and it passes the additional checks that the ICMP MTU is
877	      greater than the current MTU and less than the probe size, then
878	      set the candidate MPS from the ICMP MTU, and restart the probe
879	      sequence from step 1 in section 6.1.

881	      If there was not an acceptable ICMP PTB message, then the probe is
882	      inconclusive because the lost probe might have been caused by
883	      congestion.  The probe can be retried  after a suitable delay for
884	      a probe_inconclusive_event.

886	   o  It is unlikely that losses during the transition phase are caused
887	      by PLPMTUD, however they do potentially complicate the
888	      verification phase.  Note that we are referring to losses that are
889	      bracketed by acknowledgement of packets that were sent at the old
890	      MTU, while the transition to the provisional MTU is still
891	      propagating through the network.  The first acknowledgement from
892	      the provisional MTU (and the transition to the verification phase)
893	      is most likely going to occur during the recovery of the losses in
894	      transition phase.  It is important that the Packetization Layer
895	      retransmission machinery distinguish between loses at the old MTU
896	      (transition phase) and the provisional MTU (the verification
897	      phase, discussed next).

899	   o  Losses during the verification phase are taken as a indication
900	      that the path may have a non-uniform MTU or some other problems
901	      such that raising the MTU raises the loss rate.  If so, this is
902	      potentially a very serious problem, so the provisional MTU is
903	      considered to have errored and the path MTU is set back to the
904	      previously verified MTU.

906	      Packet loss during the verification phase might also be due to
907	      coincidental congestion on the path, unrelated to the probe, so it
908	      would seem to be desirable to re-probe the path.  The risk is that
909	      this effectively raises the tolerated loss threshold because even
910	      though raising the MTU seemed to cause additional loss, there is a
911	      statistical chance that repeated attempts to verify a new MTU may
912	      yield as false pass.  The compromise is to re-probe once with the
913	      same probe size (after delay probe_inconclusive_event), and if
914	      this also fails, then the probe may not be retried until after a
915	      suitable delay for a verification_error_event, which exponentially
916	      increases on each successive failure.

918	6.2.3  Packetization Layer Retransmission Timeout

920	   Note that the we do not make distinctions between the various methods
921	   that different Packetization Layers might use for detecting and
922	   retransmitting lost packets.  It is preferable that the Packetization
923	   Layer uses a recovery mechanism similar to TCP SACK or fast
924	   retransmit designed to detect and report losses to recover as quickly
925	   as possible.

927	   Under some conditions the Packetization Layer may have to rely on
928	   retransmission timeouts or other fairly disruptive techniques to
929	   detect and recover from losses.  Since these greatly increase the
930	   cost of failed probes, it is recommended that PLPMTUD use even longer
931	   delays before re-probing.  In these situations replace
932	   probe_fail_event with probe_timeout_event.

934	6.2.4  Packetization Layer Full Stop Timeout

936	   Under all conditions (not just during MTU probing) a full stop
937	   timeout should be taken as an indication of some significantly
938	   disruptive event in the network, such as a router failure or a
939	   routing change to a path with a smaller MTU.

941	   If the ICMP MPS was recently  set, and it is less that the current
942	   MTU (or provisional MTU during the transitional phase), then the path
943	   MTU can be reduced to the ICMP MTU.  A full stop timeout is the only
944	   situation outside of a probe that we recommended that the path MTU is
945	   set from the ICMP MTU.  (In section 9.1 we relax this recommendation
946	   to facilitate migration to PLPMTUD in exchange for slightly less
947	   protection from corrupt ICMP PTB messages).

949	   Note that whenever a problem with the path that causes a full-stop
950	   timeout (also known as a "persistent timeout" in other documents),
951	   several different path restart/recovery algorithms may be invoked at
952	   different layers in the stack.  Some device drivers may be restarted
953	   [@@], router discovery [@@], ES-IS [@@] and so forth.  We recommend
954	   that in most situation the first action should be to set the path MTU
955	   down.  Note that this recommendation is really beyond the scope of
956	   this document, and may require substantial additional research.

958	   If there is a full stop timeout and there was not an ICMP message
959	   indicating a reason (PTB, Net unreachable, etc, or the ICMP messages
960	   was ignored for some reason), we suggest that the first recovery
961	   action should be to set the path MTU down to a safe minimum "restart
962	   MTU" value, and the reset PLPMTUD search state, so PLPMTUD will start
963	   over again searching for the proper MTU.  The default IPV4
964	   restart_MTU should be the minimum MTU as specified by IPv4 (576
965	   Bytes)[1].  The default IPV6 restart_MTU should be the minimum MTU as
966	   specified by IPv6 (1280 Bytes) [7].  Unless the default MTU is
967	   overridden by some global control (See section 9.5).

969	   If and only if the full stop timeout happens during the probe or
970	   transition phases (e.g.  after the sending data using the provisional
971	   MTU but before any of it is acknowledged) is it considered likely
972	   that raising the MTU caused the full stop timeout.  If so this
973	   situation is is likely to be cyclic, because resetting the PLPMTUD
974	   search state is likely to eventually cause re-probing the same
975	   problematic MTU.  It is tempting to define additional states to
976	   detect recurrent full stop timeouts.  However in today's hostile
977	   network environment, there is little tolerance for nodes that are so
978	   fragile that they can be disrupted by something as simple as
979	   oversized packets.  Therefor we do not feel that it is worth the
980	   overhead of specifying a state machine that is capable of
981	   automatically detecting these situations and disabling PLPMTUD.
982	   However, it is important that there be a manual way to disable or
983	   limit probing on specific paths.  See section 9.5.

985	6.3  Probing Intervals

987	   The previous sections describe a number of events that prevent a
988	   probe sequences from raising the path MTU.  In all cases the basic
989	   response is the same: to wait some time interval (dependent on the
990	   specific event and possibly the history) and then to probe again.
991	   For events that are "inconclusive", it is generally appropriate to
992	   re-probe with the same probe size.  For events that are identified as
993	   "failed probes" it is generally appropriate to re-probe with a
994	   smaller probe size.  The search strategy described in section 7.5 is
995	   used to select probe sizes.

997	   Many of the intervals below are specified in terms of elapsed round
998	   trips relative to the current congestion window.  This is because TCP
999	   and other Packetization Layer protocols tend to exhibit periodic
1000	   loses which cause periodic variations of the congestion window and
1001	   possibly the data rate.  It is preferable that the PLPMTUD probes are
1002	   scheduled near the low point of these cycles to minimize ambiguities
1003	   caused by congestion losses.

1005	   In order from least to most serious:
1006	   probe_converge_event The candidate probe size has already been probed
1007	      so there are no further searching.  Delay 5 minutes and then
1008	      re-probe last SEARCH_HIGH.

1010	   probe_inconclusive_event Other lost packets near the lost probe made
1011	      the probe result ambiguous.  Since the loss of non-probe packets
1012	      requires a window (or data rate) reduction, it is desirable to
1013	      schedule the re-probe (at the same probe size) roughly one round
1014	      trip time after the end of the loss recovery.  This will be almost
1015	      the minimum congestion window size, with a small cushion to
1016	      minimize the chances that correlated losses caused by some other
1017	      bursty connection spoil another probe.

1019	   probe_fail_event A probe fail event is the one situation under which
1020	      the Packetization layer is permitted not to treat loss as a
1021	      congestion signal.  Because there is some small risk that
1022	      suppressing congestion control might have unanticipated
1023	      consequences (even for one isolated loss), we require that probe
1024	      fail events be less frequent than the normal period for losses
1025	      under standard congestion control.  Specifically after a probe
1026	      fail event and suppressed congestion control, PLPMTUD may not
1027	      probe again until an interval which is comparable to the expected
1028	      interval between congestion control events.  This is required in
1029	      section 4 and discussed further in section 7.6.

1031	      The simplest estimate of the interval to the next congestion event
1032	      is the same number of round trips as the current window in
1033	      packets.

1035	   probe_timeout_event Since this event was detected by a timeout, it is
1036	      relatively disruptive to protocol operation.  Furthermore, since
1037	      the event indirectly includes a window adjustment that may have
1038	      been caused by the MTU probe, it is important that the probe not
1039	      be repeated until congestion control has had more than sufficient
1040	      time to recover from the loss.  Therefore we recommend five times
1041	      the probe_fail_event interval.  I.e.  five times as many round
1042	      trips as the current congestion window in packets.

1044	   verification_error_event A verification fail event indicates that a
1045	      probe was delivered and the verification phase failed twice
1046	      separated by a congestion adjustment (so the second verification
1047	      phase was at a low point in the congestion control cycle).  This
1048	      is an indication that one of the following three things might have
1049	      happened: repeated losses unrelated to PLPMTUD; the path is
1050	      striped across links with dissimilar MTUs, or the link layer has
1051	      some parametric limitation such that raising the MTU greatly
1052	      increases the random error rate.

1054	      The optimal method responding to this situation is an open
1055	      research question.  We believe that the correct response is some
1056	      combination of exponentially lengthening back-offs (e.g.  Starting
1057	      at 1 minute and quadrupling on each repeat.) and implicitly
1058	      treating the situation as a probe fail (and choosing a smaller
1059	      probe size) after some threshold number of repeated
1060	      verification_error_events.

1062	6.4  Host fragmentation

1064	   Packetization layers are encouraged to avoid sending messages that
1065	   will require fragmentation (for the case against fragmentation, see
1066	   [17][18]).  However this is not always possible.  Some packetization
1067	   layers, such as a UDP application outside the kernel, may be unable
1068	   to change the size of messages it sends.  This may result in packet
1069	   sizes that exceeds the Path MTU.

1071	   IPv4 permitted such applications to send packets without DF set.
1072	   Oversized packets without DF would be fragmented in the network or
1073	   sending host when they encountered a link with a MTU smaller than the
1074	   packet.  In some case, packets could be fragmented more than once if
1075	   there were cascaded links with progressively smaller MTUs.

1077	   This approach is no longer recommended.  We now recommend that IPv4
1078	   implementation use a strategy that mimics IPv6 functionality.  When
1079	   an application sends datagrams that are larger than the known path
1080	   MTU they should be fragmented to the path MTU in the host IP layer
1081	   even if they are smaller than the link MTU of the first network hop
1082	   directly attached to the host.  The DF bit should be set on the
1083	   fragments, so they will not be fragmented again in the network.

1085	   This technique will minimize future surprises as the Internet
1086	   migrates to IPv6.  Otherwise there is the potential for widely
1087	   deployed applications or services relying on IPv4 fragmentation in a
1088	   way that can not be implemented in IPv6.  At least one major
1089	   operating system already uses this strategy.

1091	   Note that IP fragmentation divides data into packets, so it is a
1092	   minimally a Packetization Layer.  However it does not have a
1093	   mechanism to detect lost packets, so it can not support a native
1094	   implementation of PLPMTUD.  PLPMTUD has to be implemented with an
1095	   adjunct protocol as described in section 8.3.

1097	6.5  Multicast

1099	   In the case of a multicast destination address, copies of a packet
1100	   may traverse many different paths to reach many different nodes.  The
1101	   local representation of the "path" to a multicast destination must in
1102	   fact represent a potentially large set of paths.

1104	   Minimally, an implementation could maintain a single MPS value to be
1105	   used for all packets originated from the node.  This MPS value would
1106	   be the minimum MPS learned across the set of all paths in use by the
1107	   node.  This approach is likely to result in the use of smaller
1108	   packets than is necessary for many paths.

1110	   If the application using multicast gets complete delivery reports
1111	   (unlikely because this requirement  has poor scaling properties),
1112	   PLPMTUD could be implemented in multicast protocols.

1114	7.  Common Packetization Properties

1116	   This section describes general Packetization Layer properties and
1117	   characteristics needed to implement PLPMTUD.  It also describes some
1118	   implementation issues that are common to all Packetization Layers.

1120	7.1  Mechanism to detect loss

1122	   It is important that the Packetization Layer has a timely and robust
1123	   mechanism for detecting and reporting losses.  PLPMTUD makes MTU
1124	   adjustments on the basis of detected losses.  Any delays or
1125	   inaccuracy in loss notification is likely to result in incorrect MTU
1126	   decisions or slow convergence.

1128	   It is best if Packetization Protocols use fairly explicit loss
1129	   notification such as Selective acknowledgements, although implicit
1130	   mechanisms such as TCP Reno style duplicate acknowledgements counting
1131	   are sufficient.  It is important that the mechanism can robustly
1132	   distinguish between the isolated loss of just a probe and other
1133	   combinations of losses.

1135	   Many protocol implementation have complicated mechanisms such as SACK
1136	   scoreboards to distinguish between real losses and temporary missing
1137	   data due to reordering in the network.  In these implementation is
1138	   desirable to signal losses to PLPMTUD as a side effect of the data
1139	   retransmission.  This approach offer the maximum protection from
1140	   confusing signals due to reordering and other events that might mimic
1141	   losses.

1143	   PLPMTUD can also be implemented in protocols that rely on timeouts as
1144	   their primary mechanism for loss recovery, although this should be
1145	   used only when there are no other alternatives.

1147	7.2  Generating Probes

1149	   There are several possible ways to alter packetization layers to
1150	   generate probes.  The different techniques incur different overheads
1151	   in three areas: difficulty in generating the probe packet (in terms
1152	   of packetization layer implementation complexity and extra data
1153	   motion) possible additional network capacity consumed by the probes
1154	   and the overhead of recovering from failed probes (both network and
1155	   protocol overheads).

1157	   Some protocols might be extended to allow arbitrary padding with
1158	   dummy data.  This greatly simplifies the implementation because the
1159	   probing can be performed without participation from higher layers and
1160	   if the probe fails, the missing data (the "probe gap") is assured to
1161	   fit within the current MTU when it is retransmitted.  This is
1162	   probably the most appropriate method for protocols that support
1163	   arbitrary length options or multiplexing within the protocol itself.

1165	   Many Packetization Layer protocols can carry pure control messages
1166	   (without any data from higher protocol layers) which can be padded to
1167	   arbitrary lengths.  For example the SCTP HEARTBEAT message can be
1168	   used it this manner (See section 8.2) .  This approach has the
1169	   advantage that nothing needs to be retransmitted if the probe is
1170	   lost.

1172	   These techniques do not work for TCP, because there is not a separate
1173	   length field or other mechanism to differentiate between padding and
1174	   real payload data.  With TCP the only approach is to send additional
1175	   payload data in an over-sized segment.  There are at least two
1176	   variants of this approach, discussed in section 8.1.

1178	   In a few cases there may no reasonable mechanisms to generate probes
1179	   within the Packetization Layer protocol itself.  As a last resort it
1180	   may be possible to rely an an adjunct protocol, such as ICMP ECHO
1181	   (aka "ping"), to send probe packets.  See section 8.3 for further
1182	   discussion of this approach.

1184	7.3  Mechanism to support provisional MTUs

1186	   The verification phase requires a mechanism provisionally raise the
1187	   MPS and if there are additional losses, restore the old MPS.  While
1188	   this is not difficult for most potential Packetization Layers, there
1189	   are a few (e.g.  ISO TP4 [ISOTP]) that are not allowed to repacketize
1190	   when doing a retransmission.  That is, once an attempt is made to
1191	   transmit a segment of a certain size, the transport cannot split the
1192	   contents of the segment into smaller segments for retransmission.  In
1193	   such a case, the original segment can be fragmented by the IP layer
1194	   during retransmission as described in section 6.4.  Subsequent
1195	   segments, when transmitted for the first time, should be no larger
1196	   than allowed by the path MTU.

1198	   Note that while padding is an appropriate mechanism for probing, it
1199	   is too wasteful for use during the verification phase.

1201	   Unresolved problem: if 2 PL are using the same path and one can only
1202	   verify constrained sizes (e.g blocks+headers) then the verified MTU
1203	   might be the actual packet size for the constrained PL, not the
1204	   probed size.  @@@@

1206	   Unresolved problem: what to do about very short flows?  No
1207	   verification phase?  @@@@@

1209	7.4  Selecting the initial MPS

1211	   If if there is already a cached MPS value for this path, PLPMTUD may
1212	   use the saved MPS value.  Unless it is very recent (how recent?
1213	   @@@@@) SEARCH_HIGH should be set to SEARCH_MAX, to restart the search
1214	   process from the old MPS.

1216	   Note that there are tradeoffs to how long the path information cache
1217	   entries is retained when it is not being used by any flows.  If they
1218	   are kept for to long they waste memory, if too short it will cause
1219	   frequent re-probing.  We suggest an adjustable Least Recently Used
1220	   algorithm to purge old entries.  @@@@ This belongs some place else.

1222	   When the PLPMTUD process is started the recommended initial MPS
1223	   should normally be set such that the Packetization Layer can carry 1
1224	   kByte data segments.  This initial MPS would be 1 kByte plus space
1225	   for Packetization layer headers.  (see section 5 on accounting for
1226	   headers).  With the this MPS, RFC2414 [6] allows TCP and other
1227	   transport protocols to start with an initial window of 4 packets.

1229	   [We suspect, but have not confirmed that] TCP completes sooner for
1230	   short connections when started with four 1kB packets rather than
1231	   three 1500 byte packets because the 2nd ACK occurs one round trip
1232	   earlier

1234	   This initial MPS should also be configurable.  One of the
1235	   configuration options should be to mimic classical PMTUD behavior by
1236	   setting the initial MPS from the interface MTU.  This option
1237	   facilitates using PLPMTUD in a mode that mimics classical PMTU
1238	   discovery.  (See section 9.1)

1240	7.5  Common MPS Search Strategy

1242	   The MPS search strategy described here is a only rough guide for
1243	   implementors.  It is difficult to imagine a completely standard
1244	   algorithm because the strategy can include many Packetization Layer
1245	   specific heuristics to optimize MPS selection.  There is significant
1246	   opportunity for future improvements to this portion of PLPMTUD.

1248	   The search strategy is trying to find the largest "candidate MPS"
1249	   that meets the constraints of both the Packetization and the link
1250	   layers.  Although this algorithm is primarily described in terms of
1251	   MPS, it needs to use knowledge about link layer MTUs and
1252	   Packetization Layer buffer sizes.

1254	   The search strategy uses three variables:
1255	      SEARCH_MAX is the largest MPS that a Packetization Layer might be
1256	      able to use.  It is determined by such considerations as interface
1257	      MTU, widths of protocol length fields, and possibly other
1258	      protocol-dependent values, such as the the TCP MSS option.  In
1259	      many cases it would be the same as the classical MTU discovery
1260	      initial MTU, minus the IP layer headers.
1261	      SEARCH_LOW is the largest validated MPS, the same as them current
1262	      MPS in use by the packetization layer.  The initial value for
1263	      SEARCH_LOW is described in section 7.4.

1265	      SEARCH_HIGH is the least invalidated MPS.  In most cases is will
1266	      be the most recent failed candidate MPS.  When PLPMTUD is
1267	      initialized SEARCH_HIGH should be set to SEARCH_MAX, indicating
1268	      that there have been no failed probes.

1270	   For many Packetization Layer protocols, the cost for a failed probe
1271	   is significantly higher than the cost of a successful probe due to
1272	   the additional time and overhead needed for retransmission and
1273	   recovery.  For this reason it is often desirable to bias the search
1274	   strategy to make more smaller steps.

1276	   The search strategy first computes an initial candidate MPS using one
1277	   of these methods:
1278	      If SEARCH_HIGH >= SEARCH_MAX, there have been no recent failed
1279	      probes so use a coarse (geometric doubling) scan.  Set
1280	      candidate MPS = MIN(2 * SEARCH_LOW, SEARCH_MAX).  Otherwise use
1281	      one of several possible fins scan candidate MPS values:
1282	      Select a candidate MPS that corresponds to a common MTU possibly
1283	      minus common tunnel header sizes between SEARCH_LOW and
1284	      SEARCH_HIGH.  There is a fine scan heuristic described section
1285	      7.5.1 that might be used.
1286	      Use a simple weighted binary search by selecting the candidate MPS
1287	      some prorated distance between SEARCH_LOW and SEARCH_HIGH.  E.g.
1288	      set
1289	      candidate MPS = SEARCH_LOW * (1 - alpha) + SEARCH_HIGH * alpha,
1290	      for some alpha between 0 and 1.  If you choose an alpha slightly
1291	      less than 0.5, PLPMTUD  will tend to converge from below,
1292	      minimizing the number of failed probes.  Alternatively alpha can
1293	      be selected to optimally converge for some common MTUs, such as
1294	      1500 bytes.
1295	   If the Packetization Layer has preferred data sizes (e.g.  carries
1296	   block data), optionally round the candidate MPS to an efficient size
1297	   for the Packetization Layer.  The rounded candidate MPS would
1298	   typically be a multiple of the optimal data block size plus space for
1299	   Packetization Layer headers.  The MPS can be rounded up or down, but
1300	   should avoid selecting previously probed valued if possible, per the
1301	   convergence test below.  Packetization Layer that do not have
1302	   intrinsically preferred data sizes may still choose to round the
1303	   candidate MPS to some convenient increment such as 4 or 8 bytes, to
1304	   prevent excessive hunting.  Note that this step is intrinsically
1305	   Packetization Layer dependent, and may be different for different
1306	   packetization Layers.

1308	   If the resulting candidate MPS is not between SEARCH_LOW and
1309	   SEARCH_HIGH, then the probe process has converged and further probing
1310	   will not  yield a better value for the MPS for this protocol.  To
1311	   detect if a routing change has raised the path MTU, the path should
1312	   be re-probed after a suitable delay as indicated by a
1313	   probe_converge_event (See section 6.3).  If the probe succeeds, then
1314	   SEARCH_HIGH should be set to SEARCH_MAX to restart the probing
1315	   process from the current MPS.

1317	   MPS searching can be implicitly disabled by setting the SEARCH_HIGH
1318	   to SEARCH_LOW.

1320	   Note that if two different Packetization Layers are sharing a path,
1321	   they may choose different MPS due to differences in the protocols.
1322	   It is even possible for one of the Packetization Protocol to consider
1323	   the process converged, while the other continues to probe.  In this
1324	   case one of the Packetization Layers does may chose not to use the
1325	   full MPS, and instead chooses some slightly smaller but more
1326	   efficient packet size.

1328	7.5.1  Fine Scans

1330	   If SEARCH_LOW does not correspond to a common link MTU, and there is
1331	   a common link MTU between SEARCH_LOW and SEARCH_HIGH, set the
1332	   candidate MPS from the most common link MTU between SEARCH_LOW and
1333	   SEARCH_HIGH.

1335	   If SEARCH_LOW does not correspond to a common link MTU, and there is
1336	   not a common link MTU between SEARCH_LOW and SEARCH_HIGH, then set
1337	   the candidate MPS to either the weighted binary search between
1338	   SEARCH_LOW and SEARCH_HIGH or to SEARCH_HIGH, reduced by a reasonable
1339	   increments for tunnel headers.

1341	   If SEARCH_LOW corresponds to a common link MTU, set the candidate MPS
1342	   to SEARCH_LOW plus some small delta.  If this fails, we found the
1343	   proper MPS, otherwise we need to keep searching.

1345	   @@@@@ common link MTUs are: 1500......  ?

1347	   @@@@@ common tunnel header sizes are....

1349	7.6  Congestion Control and Window Management

1351	   PLPMTUD and congestion control share the same slice of the protocol
1352	   stack.  Both algorithms nominally run inside of a transport protocol
1353	   and rely on packet losses as their primary signal to adjust
1354	   parameters of the data stream (packet size or window size).
1355	   Furthermore both push up the controlled parameter until the onset of
1356	   packet losses, and then back off to a smaller value.  Due to the
1357	   close proximity of these two algorithms there is the potential for
1358	   side effects and unexpected interactions between them.

1360	   This section describes potential interactions between PLPMTUD and
1361	   congestions control.  In general PLPMTUD is designed to minimize its
1362	   potential impact on congestion control.  This is appropriate because
1363	   correctly functioning congestion control is critical to the overall
1364	   operation of the Internet.

1366	   The requirements in section 4 protect congestion control from
1367	   PLPMTUD.  It is important that MTU changes do not raise the
1368	   congestion window.  Given that we do not know a priori the nature of
1369	   the network bottleneck, PLPMTUD should not raise either the data rate
1370	   (bytes per second) or the packet rate (packets per second).

1372	   Since there is a risk that lost probes might actually be congestion
1373	   losses, and not MTU losses at all, we limit the maximum allowed rate
1374	   for suppressing congestion control to less than the loss rate
1375	   required to throttle the flow to the "TCP friendly" rate.  This
1376	   guarantees that the losses due to PLPMTUD are less than the losses
1377	   needed for normal congestion control.

1379	   If there is some node which is accounting queue length in bytes
1380	   (rather than packets), there is even the possibility that a probe
1381	   might cause a loss due to driving the queue over some threshold and
1382	   into congestion.  For this reason it recommended that all PLPMTUD
1383	   implementations use some strategy to slightly depress the actual
1384	   window during the probe process.  It may be sufficient to require
1385	   that the excess data in the probe packet fits within the current
1386	   congestion control window.

1388	   If a probe is carrying real application data that must be
1389	   retransmitted, it is important to suppress (or restore) all of the
1390	   congestion control state changes normally associated with the
1391	   retransmission.  For example if a TCP connection is in slow-start
1392	   when a probe is lost, it is important that ssthresh is not changed as
1393	   a side effect of the probing.  It is for this reason that it is
1394	   strongly recommended that packetization protocols use some
1395	   combination of out-of-band echo message and padding, if at all
1396	   possible.  Lost probes that do not carry any real application data do
1397	   not need to be retransmitted.

1399	   It is recommended that TCP should not probe a new MPS if that MPS
1400	   will likely result in a cwnd of less than 5 segments.

1402	   If the network becomes too congested, it is recommended that the MPS
1403	   be reduced to a smaller size as determined by a heuristic.  The
1404	   recommended heuristic is to reduce the MPS by half if ssthresh is
1405	   reduced to 5 segments or smaller, with a minimum MPS of 512 bytes.

1407	8.  Specific Packetization Layers

1409	   This section discusses specific implementation details for different
1410	   protocols that can be used as Packetization Layer protocols.  All
1411	   Packetization Layer protocols must consider all of the issues
1412	   discussed in section Section 7.  For most protocols it is self
1413	   evident how to address  many of these issues.  It is hoped that the
1414	   protocols described here will be sufficient illustration for
1415	   implementors to adapt other protocols.

1417	8.1  Probing method using TCP

1419	   TCP has no mechanism that could be used to distinguish between real
1420	   application data and some other form of padding that might be used to
1421	   fill out probe packets.  Therefore, TCP must generate probes by
1422	   sending oversized segments that are carrying real data from upper
1423	   layers.  There are two approaches that TCP might use to minimize the
1424	   overheads associated with the probing sequence.

1426	   A TCP implementation of PLPMTUD can elect to send subsequent segments
1427	   overlapping the probe as though the probe segment was not oversized.
1428	   This has the advantage that TCP only need to retransmit one segment
1429	   at the current MTU to recover from failed probes.  However the
1430	   duplicate data in the probe does consume network resources and will
1431	   cause duplicate acknowledgments.  It is important that these extra
1432	   duplicate acknowledgments not trigger Fast Retransmit.  This can be
1433	   guaranteed by limiting the largest probe segment size to twice the
1434	   current segment size (causing at most 1 duplicate acknowledgment) or
1435	   three times the current segment size (causing at most 2 duplicate
1436	   acknowledgments).

1438	   The other approach is to send non-overlapping segments following the
1439	   probe.  Although this is cleaner from a protocol architecture
1440	   standpoint it clashes with many of the optimizations used improve the
1441	   efficiency of data motion within many operating systems.  In
1442	   particular many implementations divide the data into segments and
1443	   pre-compute checksums as the data is copied out of application
1444	   buffers.  In these implementation it can be relatively expensive to
1445	   adjust segment boundaries after the data is already queued.

1447	   If TCP is using SACK or any other variable length headers, the
1448	   headers on the probe and verification packets should be padded to the
1449	   maximum possible length.  Otherwise, unexpected options on
1450	   bidirectional data may cause cause IP packets that are larger than
1451	   the tested MTU.

1453	   At the point when TCP is ready to start the verification phase, it is
1454	   permitted transmit already queued data at the old MTU rather than
1455	   re-packetizes it.  This postpones the verification process by the
1456	   time required to send the queued data.

1458	   If the verification phase experiences any segment losses, TCP is
1459	   required to pull back to the prior MSS.  Since failing the
1460	   verification phase should be an infrequent error condition it is less
1461	   important  that this be  as efficient as probing.

1463	8.2  Probing method using SCTP

1465	   In the SCTP protocol [9][16] the application writes messages to SCTP
1466	   and SCTP "chunkifies" them into smaller pieces suitable for
1467	   transmission through the network.  Once a message has been
1468	   chunkified, they are asigned TSN's.  Once some TSNs have been
1469	   transmitted SCTP can not change the chunk sizes.  SCTP multi-path
1470	   support normally requires SCTP to chunkify its messages to fit the
1471	   smallest MPS (maximum payload size, same as MTU - IP headers) of all
1472	   paths.  Although not required, implementations may bundle multiple
1473	   data chunks together to make larger IP packets to allow for support
1474	   for larger MPSs on different paths.  Note that SCTP must
1475	   independently probe and verify the MPS on each path to the peer.

1477	   The recommended method for generating probes is to add a chunk
1478	   consisting only of padding to an SCTP message.  There are two methods
1479	   to implement this padding.

1481	   In method 1, the message is padded with an SCTP heart beat (HB), of
1482	   the necessary size to construct an IP packet the desired probe size.
1483	   The peer SCTP implementation will acknowledge a successful probe
1484	   without delay by the returning the same Heartbeat as a HEARTBEAT-ACK.
1485	   This method is fully compatible with current SCTP standards and
1486	   implementations, but is exposed to MPS limitation on the return path,
1487	   which might cause the HEARTBEAT-ACK to be lost.

1489	   In method 2, a new "PAD" chunk type would have to be defined.  This
1490	   chunk would be silently discard by the peer.  The PAD chunk could be
1491	   attached to another message (either a minimum length HB or other
1492	   application data which will be acknowledged by the peer) to build a
1493	   probe packet.  The default action for an unknown chunk types in the
1494	   range 128 to 190, (high bits = 10 ) is to "Skip this chunk and
1495	   continue processing" [RFC2960] - exactly the required behavior for a
1496	   PAD chunk.  Any currently unused type in this range will work for a
1497	   PAD chunk type.  This method is fully compatible with all current
1498	   SCTP implementations, but requires adding a new type to the current
1499	   standards.  It has the advantage that restrictions due to the return
1500	   path MPS are not applied to the forward path.

1502	   The verification phase is most efficiently implemented by picking a
1503	   new chunk size such that the new MPS and all of the old multi-path
1504	   MPSs are larger than different multiples of the new chunk size, by at
1505	   least the required header sizes.  This approach permits chunks from
1506	   SCTP application messages to be assembled into packets that are
1507	   suitable for any path to the peer at either the old or new MPS.  This
1508	   is the easiest method to permit the provisional MPS to be withdrawn,
1509	   if there are losses during the verification phase.

1511	   Once each of old path MPSs has been updated to a new verified MPS,
1512	   SCTP may be able to pick a new larger chunk size that will fit into
1513	   all paths.  However, if the MPS is later reduced (say due to a
1514	   routing change and subsequent ICMP PTB message) SCTP will be forced
1515	   to use IP fragmentation to transmit application messages that are
1516	   already chunkified, as described in section 7.3.

1518	   The constraints on efficiently choosing chunk sizes are complicated
1519	   enough to make it difficult if not impossible to efficiently support
1520	   arbitrary combinations of old and new MPSs.  It greatly simplify the
1521	   implementation to add constraints, such as making the chunk size
1522	   itself a multiple of some common size, such as 512 bytes.  This in
1523	   turn constrains the searching algorithm to test MPSs that are
1524	   multiples of 512 bytes, plus the appropriate headers.  Clearly the
1525	   PLPMTUD search heuristic for SCTP must be constrained to pick
1526	   candidate MPSs that are consistent with the limitations of the
1527	   algorithm for choosing appropriate chunk sizes.

1529	   The SCTP Verification-Tag is designed to increase SCTPs robustness in
1530	   the presence of a number of attacks, including forged ICMP messages.
1531	   It relies on a 32 bit Verification Tag which is initialized to a
1532	   random value during connection establishment and placed in the first
1533	   64 bits of all SCTP messages.  All subsequent messages (including
1534	   ICMP messages, which copy at least the first 64 bits of the message)
1535	   must match the original Verification Tag, or they are rejected as
1536	   being likely attacks against the connection.

1538	   It is believed that the Verification Tag mechanism is strong enough
1539	   where SCTP could unconditionally process ICMP PTB messages that would
1540	   reduce the path MPS at arbitrary times.  As written, this document
1541	   does not encourage this method.  The PLPMTUD ICMP validity checks are
1542	   cascaded with the SCTP checks, such that the messages are processed
1543	   only if they meet all consistency checks for both protocols.  In
1544	   particular, PLPMTUD only uses the ICMP MPS value following a probe,
1545	   during MPS verification, or following a full stop timeout.

1547	   Alternatively, an SCTP implementation could suppress some of the
1548	   checks in section 6.2.1.

1550	8.3  Probing method for IP fragmentation

1552	   As mentioned in section 6.4, datagram protocols (such as UDP) might
1553	   rely on IP fragmentation as a packetization layer.  However,
1554	   implementing PLPMTUD with IP fragmentation is problematic because the
1555	   IP layer has no mechanism to to determine if the packets are
1556	   ultimately delivered properly to the far node, without participation
1557	   by the application.

1559	   To support IP fragmentation as a packetization layer under an
1560	   unmodified application, we propose the use of an adjunct MTU
1561	   measurement protocol (ICMP ECHO) and a separate path MTU discovery
1562	   daemon (described here) to perform PLPMTUD and update the stored path
1563	   MTU information.

1565	   For IP fragmentation the initial MPS should be selected as described
1566	   in section 7.4, except with a separate global control for the default
1567	   initial MPS for connectionless protocols.  Since connectionless
1568	   protocols may not keep enough state to effectively diagnose MTU black
1569	   holes, it would be more robust to error on the side of using too
1570	   small of an initial MTU (e.g.  1kBytes or less) prior initiating
1571	   probing of the path to measure the MTU.

1573	   Since many protocols that rely on IP fragmentation are
1574	   connectionless, there is an additional problem with the path
1575	   information cache: there are no events corresponding to connection
1576	   establishment and tear-down to use to manage the cache itself.  We
1577	   take this approach: if there is no entry in the path information
1578	   cache for a particular packet being transmitted, it uses an immutable
1579	   cache entry for the "default path", which  has a MPS that is fixed at
1580	   the initial value.  A new path cache entry is not created until there
1581	   is an attempt to set the MPS.

1583	   The path MTU discovery daemon should be triggered as a side effect of
1584	   IP fragmentation.  Once the number of fragmented datagrams via any
1585	   particular path reaches some configurable threshold (say 5
1586	   datagrams), the daemon can start probing the path with ICMP ECHO
1587	   packets.  These probes must use the diagnostic interface described in
1588	   section 9.4 and have DF set.  The daemon can implement all of the
1589	   PLPMTUD probe sequence and search strategy, collect all of the ICMP
1590	   responses (ECHO REPLY, ICMP PTB, etc) and only the saved PTB in the
1591	   path information cache in the IP layer.

1593	   Alternatively, most of the PLPMTUD state machinery can be implemented
1594	   within the path information cache in the IP layer, which can
1595	   specifically invoke the path MTU discovery daemon to perform
1596	   specified measurements on specific paths and report the results back
1597	   to the IP layer.

1599	   Using ICMP ECHO to measure the MTU has a number of potential
1600	   robustness problems.  Note that the most likely failures are due to
1601	   losses unrelated to MTU (e.g.  nodes that discriminate on the basis
1602	   of protocol type).  These non-MTU losses can prevent PLPMTUD from
1603	   raising the MTU, forcing the Packetization Layer protocol to use a
1604	   smaller MTU than necessary.  Since these failures are not likely to
1605	   cause interoperability problem they are relatively benign.

1607	   However there does exist other more serious failure modes, such as
1608	   layer 3 or 4 routers choosing different paths for different protocol
1609	   types or sessions.  In such environments, adjunct protocols may
1610	   experience different MTUs than the primary protocol.  If the adjunct
1611	   protocol has a larger MTU than the primary protocol,  PLPMTUD will
1612	   select a non-functional MTU.  This does not seem to be likely
1613	   situation.

1615	8.4  Probing method for applications

1617	   The disadvantages of probing with ICMP ECHO can be overcome by
1618	   implementing the path MTU discovery daemon within the application
1619	   itself, using applications own protocol.

1621	   The application must have some suitable method for generating probes.
1622	   The ideal situation is a lightweight echo function, that confirms
1623	   message delivery, plus a mechanism for padding the messages out to
1624	   the desired MTU, such that the padding is not echoed.  This
1625	   combination (akin to the SCTP HB plus PAD) has is preferred because
1626	   you can send large probes that causes small acknowledgements.  For
1627	   protocols that can not implement these messages directly there are
1628	   often alternate methods for generating probes.  E.g the protocol may
1629	   have a variable length echo (that measures both the forward and
1630	   return path) or if there is no echo function, there may be a way to
1631	   add padding to regular messages carrying real application data.
1632	   There may to others ways to generate probes.  As a last resort, it
1633	   may be feasible to extend the protocol with new message types to
1634	   support MTU discovery.

1636	   Probing within an application introduces one new issues: many
1637	   applications do not currently concern themselves with MTU and rely on
1638	   IP fragmentation to deliver datagrams that just happen to be larger
1639	   than the path MTU.  PLPMTUD requires that the protocol can send
1640	   probes that are larger than the IP layers current notion of the path
1641	   MTU, but are marked not to be fragmented.  This requires an alternate
1642	   method for sending these datagrams.

1644	   As with ICMP MTU probing, there is considerable flexibility in how
1645	   the PLPMTUD algorithms can be divided between the Application and the
1646	   path information cache.

1648	   Some applications send large datagrams no matter what the link size,
1649	   and rely on IP fragmentation to deliver the datagrams.  It has been
1650	   known for a long time that this has some undesirable consequences
1651	   [@@harm1].  Recently it has come to light that IPv4 fragmentation is
1652	   not sufficiently robust for general use in today's Internet.  The
1653	   16-bit IP identification field is not large enough to prevent
1654	   frequent missassociated IP fragments and the TCP and UDP checksums
1655	   are insufficient to prevent the resulting corrupted data from being
1656	   delivered to higher protocol layers.  [@@harm2]

1658	   None the less, there are a number of higher layer protocols, such as
1659	   NFS [@@NFS] which use IP fragmentation as a mechanism to reduce CPU
1660	   load.  NFS typically sends fragmented 8k Byte datagram's over all
1661	   link types, no matter what the link MTU.  The other common case, in
1662	   which the application wants to use the largest possible datagram that
1663	   fits within the MTU is most easily treated as a special case of the
1664	   fragmenting case.

1666	9.  Operational Integration

1668	   This section describes ways to minimize deployment problems for
1669	   PLPMTUD, by including a number of good management features:
1670	   mechanisms to diagnose problems with path MTU discovery, and
1671	   configuration controls such that the more risky properties can be
1672	   progressively deployed.  We also address some potentially serious
1673	   interactions with nodes that do not honor the DF bit.

1675	9.1  Interoperation with prior algorithms

1677	   Properly functioning Path MTU discovery is critical to the robust and
1678	   efficient operation of the Internet.  Any major change (as described
1679	   in this document) has the potential to be very disruptive if it
1680	   contains any errors or oversights.  Therefore, we offer a deployment
1681	   strategy in which classical PMTUD operation as described in RFC 1191
1682	   and RFC 1981 is unmodified and PLPMTUD is only invoked following a
1683	   full stop timeout, presumably due to an "ICMP black hole".  To do
1684	   this:
1685	   o  Relax the ICMP checks in section 6.2.1 specifically to allow an
1686	      ICMP Packet Too Large message to reduce the MTU at arbitrary
1687	      times.
1688	   o  When there is no cached MTU, use the Interface MTU as specified by
1689	      classical PMTU discovery, rather the initial MTU as specified in
1690	      section 7.4
1691	   o  MTU searching as described in section 7.5 is disabled by setting
1692	      SEARCH_HIGH equal to SEARCH_LOW and the initial MPS.
1693	   o  A full stop timeout is processed as described in section 6.2.4.
1694	      This becomes the only mechanism to invoke the rest of PLPMTUD.

1696	   When configured in this manner, PLPMTUD will increase the robustness
1697	   of classical PMTU discovery in the presence of ICMP black holes and
1698	   other ICMP problems, with minimal exposure to unanticipated problems
1699	   during deployment.  Since this configuration does not help robustness
1700	   in the presence of malicious or erroneous ICMP messages, it is not
1701	   recommended for the long term.

1703	9.2  Operation over subnets with dissimilar MTUs

1705	   With classical PMTUD, the ingress router to a subnet is responsible
1706	   for knowing what size packets can be delivered to every node attached
1707	   to that subnets.  For most subnet types, this requires that the
1708	   entire subnet has a single MTU which is common to every attached
1709	   node.  (For a few subnets types, such as ATM[12] the nodes on a
1710	   subnet can negotiate the MTU on a pairwise basis, and the ingress
1711	   router is responsible for knowing the MTU to each of it peers).

1713	   This requirement has proven to be a major impediment to deploying
1714	   larger MTUs in the operational Internet.  Often one single node which
1715	   does not support a larger MTU effectively vetoes raising the MTU on a
1716	   subnet, because the ingress router does not have a mechanism to
1717	   generate the proper ICMP PTB message for the one attached node with a
1718	   smaller MTU.

1720	   With  PLPMTUD, this requirement is completely relaxed.  As long as
1721	   oversized packets addressed to nodes with the smaller MTU are
1722	   reliably discarded, PLPMTUD will find the proper MTU for these nodes.

1724	   Once there sufficient field experience to demonstrate that PLPMTUD is
1725	   robust, we recommend that OS vendors consider updating default MTUs
1726	   for Network Interface Cards.  It would raise the overall performance
1727	   of the Internet if all NICs were configured to default to the MTU
1728	   which is most efficient for the NIC (lowest overhead per byte),
1729	   rather than the standard MTU for the media or switch.  This is most
1730	   likely to be the largest MTU supported by the NIC chip set or some
1731	   other logical boundary, such as memory page sizes.

1733	9.3  Interoperation with tunnels

1735	   PLPMTUD is specifically designed to solve many of the problems that
1736	   people are experiencing today due to poor interactions between
1737	   classical MTU discovery, IPsec, and various sorts of tunnels [5].  As
1738	   long as the tunnel reliably discards packets that are too large,
1739	   PLPMTUD will discover an appropriate MTU for the path.

1741	   Unfortunately due to the pervasive problems with classical PMTU
1742	   discovery, many manufacturers of various types of VPN/tunneling
1743	   equipment have resorted to ignoring the DF bit under some conditions.

1745	   This not only violates the IP standard and many recommendations to
1746	   the contrary [17][18], it also violates the only requirement that
1747	   PLPMTUD places on the link layer: that oversized packets are reliably
1748	   discarded.  It is imperative that people understand the impact of
1749	   ignoring the DF bit both to applications and to PLPMTUD.

1751	   We do understand the reality of the situation.  It is important that
1752	   vendors who are building devices the violate the DF specification
1753	   understand that PLPMTUD requires that probe packets be discarded, and
1754	   that sending ICMP PTB messages alone is insufficient to prevent
1755	   wholesale fragmentation if the probe packets are delivered.

1757	   Therefore, it is imperative that devices that do not honor DF include
1758	   packet size history caches and other heuristics to robustly detect
1759	   and discard probe packets, if delivering them would require
1760	   fragmentation.

1762	9.4  Diagnostic tools

1764	   All implementations MUST include facilities for MTU discovery
1765	   diagnostic tools that implement PLPMTUD or other MTU discovery
1766	   algorithms in user mode without help or interference by the PMTUD
1767	   algorithm present in the operating system.  This requires an
1768	   mechanism where a diagnostic application can send packets that are
1769	   larger than the operating system's notion of the current path MTU and
1770	   for the diagnostic application to collect any resulting ICMP PTB
1771	   messages or other ICMP messages.  For IPv4, the diagnostic
1772	   application must be able to set the DF bit.

1774	   At this time nearly all operating systems support two modes for
1775	   sending UDP datagrams: one which silently fragments packets that are
1776	   too large, and another that rejects packets that are too large.
1777	   Neither of these modes are suitable for efficiently diagnosing
1778	   problems with the MTU discovery, such as routers that return ICMP PTB
1779	   messages containing incorrect size information.

1781	9.5  Management interface

1783	   It is suggested that an implementation provide a way for a system
1784	   utility program to:
1785	   o  Globally disable all ICMP Packet Tool Large message processing
1786	   o  Globally suppress some or all ICMP consistency checks described in
1787	      section 6.2.1.  Setting this option fore goes some possible
1788	      security improvements, in exchange for making PLPMTUD behave more
1789	      like classical PMTU discovery.  (See section 9.1)
1790	   o  Globally permit ICMP Packet Tool Large messages to unconditionally
1791	      reduce the MTU, even if there were not lost lost packets.  Setting
1792	      option fore goes some possible security improvements, in exchange
1793	      for making PLPMTUD behave more like classical PMTU discovery.
1794	      (See section 9.1)
1795	   o  Globally adjust timer intervals for specific classes of probe
1796	      failures

1798	   In addition, it is important that there be a mechanism to permit per
1799	   path controls to override specific parts of the PLPMTUD algorithm.
1800	   All of these per path controls should be preset from similar global
1801	   controls:
1802	   o  Disable MTU searching a given path, such that new MTU values are
1803	      never probed.
1804	   o  Set the initial MTU for a given path.  This could be used to speed
1805	      convergence in relatively static environments.  There should be an
1806	      option to cause PLPMTUD to choose the same initial value as would
1807	      be chosen by classical PMTU discovery.  I.e.  typically the
1808	      Interface MTU.  This is used in the mode described in section 9.1
1809	      where PLPMTUD is used only for black hole detection in classical
1810	      PMTU discovery.
1811	   o  Limit the maximum probed MTU for a given path.  This permits a
1812	      manual configuration to work around a link that spuriously
1813	      delivers packets that are larger than the useful path MTU.
1814	   o  Per path and per application controls to disable ICMP processing,
1815	      to further limit possible damage from malicious ICMP PTB messages
1816	      (in addition to the global controls).

1818	10.  References

1820	10.1  Normative References

1822	   [1]  Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.

1824	   [2]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1825	        November 1990.

1827	   [3]  McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP
1828	        version 6", RFC 1981, August 1996.

1830	   [4]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
1831	        Levels", BCP 14, RFC 2119, March 1997.

1833	   [5]  Kent, S. and R. Atkinson, "Security Architecture for the
1834	        Internet Protocol", RFC 2401, November 1998.

1836	   [6]  Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's
1837	        Initial Window", RFC 2414, September 1998.

1839	   [7]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
1840	        Specification", RFC 2460, December 1998.

1842	   [8]  Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914,
1843	        September 2000.

1845	   [9]  Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
1846	        H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson,
1847	        "Stream Control Transmission Protocol", RFC 2960, October 2000.

1849	10.2  Informative References

1851	   [10]  Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU
1852	         discovery options", RFC 1063, July 1988.

1854	   [11]  Knowles, S., "IESG Advice from Experience with Path MTU
1855	         Discovery", RFC 1435, March 1993.

1857	   [12]  Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626,
1858	         May 1994.

1860	   [13]  Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU",
1861	         RFC 1791, April 1995.

1863	   [14]  Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809,
1864	         June 1995.

1866	   [15]  Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923,
1867	         September 2000.

1869	   [16]  Stewart, R., "Stream Control Transmission Protocol (SCTP)
1870	         Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in
1871	         progress), December 2003.

1873	   [17]  Kent, C. and J. Mogul, "Fragmentation considered harmful",
1874	         Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.

1876	   [18]  Mathis, M., Heffner, J. and B. Chandler, "Fragmentation
1877	         Considered Very Harmful", draft-mathis-frag-harmful-00 (work in
1878	         progress), July 2004.

1880	Authors' Addresses

1882	   Matt Mathis
1883	   Pittsburgh Supercomputing Center
1884	   4400 Fifth Avenue
1885	   Pittsburgh, PA  15213
1886	   US

1888	   Phone: 412-268-3319
1889	   EMail: mathis@psc.edu

1891	   John W. Heffner
1892	   Pittsburgh Supercomputing Center
1893	   4400 Fifth Avenue
1894	   Pittsburgh, PA  15213
1895	   US

1897	   Phone: 412-268-2329
1898	   EMail: jheffner@psc.edu

1900	   Kevin Lahey
1901	   Freelance

1903	   EMail: kml@patheticgeek.net

1905	Appendix A.  Security Considerations

1907	   Under all conditions the PLPMTUD procedure described in this document
1908	   is at least as secure as the current standard path MTU discovery
1909	   procedures described in RFC 1191 [2] and RFC 1981 [3].

1911	   It the recommended configuration, PLPMTUD is significantly harder to
1912	   attack than current procedures, because ICMP messages are cached and
1913	   only processed in connection with lost packets.  This effectively
1914	   prevents blind attacks on the path MTU discovery system.

1916	   Furthermore, since this algorithm is designed for robust operation
1917	   without any ICMP (or other messages from the network), it can be
1918	   configured to ignore all ICMP messages (globally or on a per
1919	   application basis).  In this configuration it can not be attacked,
1920	   unless the attacker can identify and selectively cause probe packets
1921	   to be lost.

1923	Appendix B.  IANA considerations

1925	   None.

1927	Appendix C.  Acknowledgements

1929	   Many ideas and even some of the text come directly from RFC1191 and
1930	   RFC1981.

1932	   Many people made significant contributions to this document,
1933	   including: Randall Stewart for SCTP text, Michael Richardson for
1934	   material from an earlier ID on tunnels that ignore DF, Stanislav
1935	   Shalunov for the idea that pure PLPMTUD parallels congestion control,
1936	   and Matt Zekauskas for maintaining focus during the meetings.  Thanks
1937	   to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib
1938	   who provided concrete feedback on weaknesses in earlier drafts.
1939	   Thanks also to all of the people who made constructive comments in
1940	   the working group meetings and on the mailing list.  I am sure I have
1941	   missed many deserving people.

1943	   Matt Mathis and John Heffner are supported in this work by a grant
1944	   from Cisco Systems, Inc.

1946	Intellectual Property Statement

1948	   The IETF takes no position regarding the validity or scope of any
1949	   Intellectual Property Rights or other rights that might be claimed to
1950	   pertain to the implementation or use of the technology described in
1951	   this document or the extent to which any license under such rights
1952	   might or might not be available; nor does it represent that it has
1953	   made any independent effort to identify any such rights.  Information
1954	   on the procedures with respect to rights in RFC documents can be
1955	   found in BCP 78 and BCP 79.

1957	   Copies of IPR disclosures made to the IETF Secretariat and any
1958	   assurances of licenses to be made available, or the result of an
1959	   attempt made to obtain a general license or permission for the use of
1960	   such proprietary rights by implementers or users of this
1961	   specification can be obtained from the IETF on-line IPR repository at
1962	   http://www.ietf.org/ipr.

1964	   The IETF invites any interested party to bring to its attention any
1965	   copyrights, patents or patent applications, or other proprietary
1966	   rights that may cover technology that may be required to implement
1967	   this standard.  Please address the information to the IETF at
1968	   ietf-ipr@ietf.org.

1970	Disclaimer of Validity

1972	   This document and the information contained herein are provided on an
1973	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1974	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1975	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1976	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1977	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1978	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1980	Copyright Statement

1982	   Copyright (C) The Internet Society (2004).  This document is subject
1983	   to the rights, licenses and restrictions contained in BCP 78, and
1984	   except as set forth therein, the authors retain all their rights.

1986	Acknowledgment

1988	   Funding for the RFC Editor function is currently provided by the
1989	   Internet Society.