idnits 2.17.1 

draft-ietf-pmtud-method-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1528.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1505.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1512.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1518.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 1, 2006) is 6479 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201)

  ** Obsolete normative reference: RFC 2460 (Obsoleted by RFC 8200)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960)

  -- Obsolete informational reference (is this intentional?): RFC 2401
     (Obsoleted by RFC 4301)

  -- Obsolete informational reference (is this intentional?): RFC 2461
     (Obsoleted by RFC 4861)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  == Outdated reference: A later version (-05) exists of
     draft-heffner-frag-harmful-01

  == Outdated reference: A later version (-01) exists of
     draft-tuexen-tsvwg-sctp-padding-00


     Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          M. Mathis
3	Internet-Draft                                                J. Heffner
4	Expires: February 2, 2007                                            PSC
5	                                                          August 1, 2006

7	                 Packetization Layer Path MTU Discovery
8	                       draft-ietf-pmtud-method-08

10	Status of this Memo

12	   By submitting this Internet-Draft, each author represents that any
13	   applicable patent or other IPR claims of which he or she is aware
14	   have been or will be disclosed, and any of which he or she becomes
15	   aware will be disclosed, in accordance with Section 6 of BCP 79.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress."

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt.

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html.

33	   This Internet-Draft will expire on February 2, 2007.

35	Copyright Notice

37	   Copyright (C) The Internet Society (2006).

39	Abstract

41	   This document describes a robust method for Path MTU Discovery
42	   (PMTUD) that relies on TCP or some other Packetization Layer to probe
43	   an Internet path with progressively larger packets.  This method is
44	   described as an extension to RFC 1191 and RFC 1981, which specify
45	   ICMP based Path MTU Discovery for IP versions 4 and 6, respectively.

47	   The general strategy of the new algorithm is to start with a small
48	   MTU and search upward, testing successively larger MTUs by probing
49	   with single packets.  If a probe is successfully delivered then the
50	   MTU can be raised.  If the probe is lost, it is treated as an MTU
51	   limitation and not as a congestion signal.

53	   Packetization Layer PMTUD (PLPMTUD) introduces some flexibility in
54	   the implementation of classical Path MTU discovery.  IT can be
55	   configured to perform just ICMP black hole recovery to increase the
56	   robustness of classical Path MTU Discovery, or at the other extreme,
57	   all ICMP processing can be disabled and PLPMTUD can completely
58	   replace classical Path MTU Discovery.

60	   In the latter configuration, PLPMTUD exactly parallels congestion
61	   control.  An end-to-end transport protocol adjusts properties of the
62	   data stream (window size or packet size) while using packet losses to
63	   deduce the appropriateness of the adjustments.  This technique is
64	   more philosophically consistent with the end-to-end principle than
65	   relying on ICMP messages containing transcribed headers of multiple
66	   protocol layers.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
71	     1.1.  Revision History . . . . . . . . . . . . . . . . . . . . .  5
72	       1.1.1.  Changes since version -07, July 2006 (IETF 66) . . . .  5
73	       1.1.2.  Changes since version -06, March 2006 (IETF 65)  . . .  6
74	       1.1.3.  Changes since version -05, November 2005 (IETF 64) . .  6
75	       1.1.4.  Changes since version -04, February 2005 (IETF 62) . .  6
76	       1.1.5.  Changes since version -03, October 2004 (IETF 61)  . .  6
77	       1.1.6.  Changes since version -02, July 19th 2004 (IETF 60)  .  6
78	   2.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
79	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . . 10
80	   4.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 12
81	   5.  Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
82	     5.1.  Accounting for header sizes  . . . . . . . . . . . . . . . 14
83	     5.2.  Storing PMTU information . . . . . . . . . . . . . . . . . 15
84	     5.3.  Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16
85	     5.4.  Multicast  . . . . . . . . . . . . . . . . . . . . . . . . 16
86	   6.  Common Packetization Properties  . . . . . . . . . . . . . . . 17
87	     6.1.  Mechanism to detect loss . . . . . . . . . . . . . . . . . 17
88	     6.2.  Generating probes  . . . . . . . . . . . . . . . . . . . . 17
89	   7.  The Probing Method . . . . . . . . . . . . . . . . . . . . . . 18
90	     7.1.  Packet size ranges . . . . . . . . . . . . . . . . . . . . 18
91	     7.2.  Selecting initial values . . . . . . . . . . . . . . . . . 19
92	     7.3.  Selecting probe size . . . . . . . . . . . . . . . . . . . 21
93	     7.4.  Probing preconditions  . . . . . . . . . . . . . . . . . . 21
94	     7.5.  Conducting a probe . . . . . . . . . . . . . . . . . . . . 22
95	     7.6.  Response to probe results  . . . . . . . . . . . . . . . . 22
96	       7.6.1.  Probe success  . . . . . . . . . . . . . . . . . . . . 22
97	       7.6.2.  Probe failure  . . . . . . . . . . . . . . . . . . . . 23
98	       7.6.3.  Probe timeout failure  . . . . . . . . . . . . . . . . 23
99	       7.6.4.  Probe inconclusive . . . . . . . . . . . . . . . . . . 24
100	     7.7.  Full stop timeout  . . . . . . . . . . . . . . . . . . . . 24
101	     7.8.  MTU verification . . . . . . . . . . . . . . . . . . . . . 24
102	   8.  Host Fragmentation . . . . . . . . . . . . . . . . . . . . . . 25
103	   9.  Application Probing  . . . . . . . . . . . . . . . . . . . . . 26
104	   10. Specific Packetization Layers  . . . . . . . . . . . . . . . . 27
105	     10.1. Probing method using TCP . . . . . . . . . . . . . . . . . 27
106	     10.2. Probing method using SCTP  . . . . . . . . . . . . . . . . 28
107	     10.3. Probing method for IP fragmentation  . . . . . . . . . . . 29
108	     10.4. Probing method using applications  . . . . . . . . . . . . 30
109	   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 31
110	   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
111	   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 31
112	     13.1. Normative references . . . . . . . . . . . . . . . . . . . 31
113	     13.2. Informative references . . . . . . . . . . . . . . . . . . 32
114	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 33
115	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34
116	   Intellectual Property and Copyright Statements . . . . . . . . . . 35

118	1.  Introduction

120	   This document describes a method for Packetization Layer Path MTU
121	   Discovery (PLPMTUD) which is an extension to existing Path MTU
122	   Discovery methods described in [RFC1191] and [RFC1981].  In the
123	   absence of ICMP messages, the proper MTU is determined by starting
124	   with small packets and probing with successively larger packets.  The
125	   bulk of the algorithm is implemented above IP, in the transport layer
126	   (e.g., TCP) or other "Packetization Protocol" that is responsible for
127	   determining packet boundaries.

129	   The methods described in this document rely on features of existing
130	   protocols.  They apply to many transport protocols over IPv4 and
131	   IPv6.  They do not require cooperation from the lower layers (except
132	   that they are consistent about what packet sizes are acceptable), or
133	   from peers.  As the methods apply only to senders, variants in
134	   implementations will not cause interoperability problems.

136	   For sake of clarity, we uniformly prefer TCP and IPv6 terminology.
137	   In the terminology section we also present the analogous IPv4 terms
138	   and concepts for the IPv6 terminology.  In a few situations we
139	   describe specific details that are different between IPv4 and IPv6.

141	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
142	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
143	   document are to be interpreted as described in [RFC2119].

145	   This document is a product of the Path MTU Discovery (pmtud) working
146	   group of the IETF and draws heavily on RFC1191 and RFC1981 for
147	   terminology, ideas, and some of the text.

149	1.1.  Revision History

151	   NOTE TO RFC EDITOR: this section to be removed before publication.

153	   These are all recent substantive changes, in reverse chronological
154	   order.  This section will be removed prior to publication as an RFC.

156	   Please send comments and suggestions to pmtud@ietf.org.  Interim
157	   drafts and other useful information will be posted at
158	   http://www.psc.edu/~mathis/MTU/pmtud/index.html .

160	1.1.1.  Changes since version -07, July 2006 (IETF 66)

162	   Last call comments from Gorry Fairhurst, Ivan Beschastnikh, and Mark
163	   Allman.  Nits and clarifications.

165	   Changed MAY to SHOULD supress congestion control response on failed
166	   probe.

168	1.1.2.  Changes since version -06, March 2006 (IETF 65)

170	   Changed the title to include "Packetization Layer".

172	   Renamed "Diagnostic Interface" section to "Application Probing" and
173	   broadened the language to include other uses.

175	   Clarifications to sections "packet size ranges", "host
176	   fragmentation", and "probing using applications".

178	   Language nits.

180	1.1.3.  Changes since version -05, November 2005 (IETF 64)

182	   Re-worked probing method sections for TCP and SCTP.  The SCTP section
183	   reflects the new PAD chunk type, and contains some text from Michael
184	   Tuexen.

186	   Made a number of language clarification and consistency improvements,
187	   largely from comments by Gorry Fairhurst.

189	   Added appropriate citations, and removed the last of the "@@" TODO
190	   items.

192	1.1.4.  Changes since version -04, February 2005 (IETF 62)

194	   General restructuring and rewriting of some sections based on new
195	   experience.  Relaxed and generalized a lot of over-specified
196	   language, for example, the search strategy description.

198	   Decoupled verification from probing, and relaxed its specification.

200	   Removed all specified changes to ICMP processing.  We decided this
201	   was out of scope for this particular document.

203	   Changed all language to refer to MTU rather than MPS.

205	1.1.5.  Changes since version -03, October 2004 (IETF 61)

207	   A number of minor style and grammar edits.

209	1.1.6.  Changes since version -02, July 19th 2004 (IETF 60)

211	   Many minor updates throughout the document.

213	   Added a section describing the interactions between PLPMTUD and
214	   congestion control.

216	   Removed a difficult to implement requirement for future data to
217	   transmit.

219	   Added "IP Fragmentation" and "Application protocol" as Packetization
220	   Layers.

222	   Clarified interactions between TCP SACK and MTU.

224	   Updated SCTP section to reflect new probing method using "PAD
225	   chunks".

227	   Distilled the protocol specific material into separate subsections
228	   for each protocol.

230	   Added a section on common requirements and functions for all
231	   Packetization Layers.  More accurately characterized the
232	   "bidirectional" (and other) requirements of the PL protocol.  Updated
233	   the search strategy in this new section.

235	   Change "ICMP can't fragment" and "packet too big" to uniformly use
236	   "ICMP PTB message" everywhere.

238	   Added Stanislav Shalunov's observation that PLPMTUD parallels
239	   congestion control.

241	   Better described the range of interoperability with classical pMTUd
242	   in the introduction.

244	   Removed vague language about "not being a protocol" and "excessive
245	   Loss".

247	   Slightly redefined flow: the granularity of PLPMTUD within a path.

249	   Many English NITs and clarifications per Gorry Fairhurst and others.
250	   Passes strict xml2rfc checking.

252	   Add a paragraph encouraging interface MTUs that are the optimal for
253	   the NIC, rather than standard for the media.

255	   Added a revision history section.

257	2.  Overview

259	   Packetization Layer Path MTU Discovery (PLPMTUD) is a method for TCP
260	   or other Packetization Protocols to dynamically discover the MTU of a
261	   path by probing with progressively larger packets.  It is most
262	   efficient when used in conjunction with the ICMP based Path MTU
263	   Discovery mechanism as specified in RFC 1191 and RFC 1981, but
264	   resolves many of the robustness problems of the classical techniques
265	   since it does not depend on the delivery of ICMP messages.

267	   This method is applicable to TCP and other transport- or application-
268	   level protocols that are responsible for choosing packet boundaries
269	   (e.g., segment sizes) and have an acknowledgment structure that
270	   delivers to the sender accurate and timely indications of which
271	   packets were lost.

273	   The general strategy is for the Packetization Layer to find an
274	   appropriate Path MTU by probing the path with progressively larger
275	   packets.  If a probe packet is successfully delivered, then the
276	   effective Path MTU is raised to the probe size.

278	   The isolated loss of a probe packet (with or without an ICMP Packet
279	   Too Big message) is treated as an indication of an MTU limit, and not
280	   as a congestion indicator.  In this case alone, the Packetization
281	   Protocol is permitted to retransmit any missing data without
282	   adjusting the congestion window.

284	   If there is a timeout or additional packets are lost during the
285	   probing process, the probe is considered to be inconclusive (e.g.,
286	   the lost probe does not necessarily indicate that the probe exceeded
287	   the Path MTU).  Furthermore, the losses are treated like any other
288	   congestion indication: window or rate adjustments are mandatory per
289	   the relevant congestion control standards [RFC2914].  Probing can
290	   resume after a delay which is determined by the nature of the
291	   detected failure.

293	   PLPMTUD uses a searching technique to find the Path MTU.  Each
294	   conclusive probe narrows the MTU search range, either by raising the
295	   lower limit on a successful probe or lowering the upper limit on a
296	   failed probe, converging toward the true Path MTU.  For most
297	   transport layers, the search should be stopped once the range is
298	   narrow enough that the benefit of a larger effective Path MTU is
299	   smaller than the search overhead of finding it.

301	   The most likely (and least serious) probe failure is the link
302	   experiencing congestion related losses while probing.  In this case
303	   it is appropriate to retry a probe of the same size as soon as the
304	   Packetization Layer has fully adapted to the congestion and recovered
305	   from the losses.  In other cases, additional losses or timeouts
306	   indicate problems with the link or Packetization Layer.  In these
307	   situations it is desirable to use longer delays depending on the
308	   severity of the error.

310	   An optional verification process can be used to detect some
311	   situations where raising the MTU raises the packet loss rate.  For
312	   example, if a link is striped across multiple physical channels with
313	   inconsistent MTUs, it is possible that a probe will be delivered even
314	   if it is too large for some of the physical channels.  In such cases,
315	   raising the Path MTU to the probe size can cause severe packet loss
316	   and abysmal performance.  After raising the MTU, the new MTU size can
317	   be verified by monitoring the loss rate.

319	   PLPMTUD introduces some flexibility in the implementation of
320	   classical Path MTU discovery, which is subject to protocol failures
321	   (connection hangs) if ICMP Packet Too Big (PTB) messages are not
322	   delivered or processed for some reason [RFC2923].  With PLPMTUD,
323	   classical Path MTU Discovery can include additional consistency
324	   checks (e.g., validating additional fields in the transcribed header)
325	   without increasing the risk of connection hangs due to spurious
326	   failures of the added checks.  Such changes to classical Path MTU
327	   Discovery are beyond the scope of this document.

329	   In the limiting case, all ICMP PTB messages might be unconditionally
330	   ignored, and PLPMTUD can be used as the sole method used to discover
331	   the Path MTU.  In this configuration, PLPMTUD parallels congestion
332	   control.  An end-to-end transport protocol adjusts properties of the
333	   data stream (window size or packet size) while using packet losses to
334	   deduce the appropriateness of the adjustments.  This technique seems
335	   to be more philosophically consistent with the end-to-end principle
336	   of the Internet than relying on ICMP messages containing transcribed
337	   headers of multiple protocol layers.

339	   Most of the difficulty in implementing PLPMTUD arises because it
340	   needs to be implemented in several different places within a single
341	   node.  In general, each Packetization Protocol needs to have its own
342	   implementation of PLPMTUD.  Furthermore, the natural mechanism to
343	   share Path MTU information between concurrent or subsequent
344	   connections over the same path is a path information cache in the IP
345	   layer.  The various Packetization Protocols need to have the means to
346	   access and update the shared cache in the IP layer.  This memo
347	   describes PLPMTUD in terms of its primary subsystems without fully
348	   describing how they are assembled into a complete implementation.

350	   The vast majority of the implementation details described in this
351	   document are recommendations based on experiences with earlier
352	   versions of Path MTU Discovery.  These recommendations are motivated
353	   by a desire to maximize robustness of PLPMTUD in the presence of less
354	   than ideal network conditions as they exist in the field.

356	   Section 3 provides a complete glossary of terms.

358	   Section 4 describes the details of PLPMTUD that affect
359	   interoperability with other standards or Internet protocols.

361	   Section 5 describes how to partition PLPMTUD into layers, and how to
362	   manage the "path information cache" in the IP layer.

364	   Section 6 describes the general Packetization Layer properties and
365	   features needed to implement PLPMTUD.

367	   Section 7 describes how to use probes to search for the Path MTU.

369	   Section 8 recommends using IPv4 fragmentation in a configuration that
370	   mimics IPv6 functionality, to minimize future problems migrating to
371	   IPv6.

373	   Section 9 describes a programing interface for implementing PLPMTUD
374	   in applications that choose their own packet boundaries and for tools
375	   to be able to diagnose path problems that interfere with Path MTU
376	   Discovery.

378	   Section 10 discusses implementation details for specific protocols,
379	   including TCP.

381	3.  Terminology

383	   We use the following terms in this document:

385	   IP: Either IPv4 [RFC0791] or IPv6 [RFC2460].

387	   Node: A device that implements IP.

389	   Router: A node that forwards IP packets not explicitly addressed to
390	      itself.

392	   Host: Any node that is not a router.

394	   Upper layer: A protocol layer immediately above IP.  Examples are
395	      transport protocols such as TCP and UDP, control protocols such as
396	      ICMP, routing protocols such as OSPF, and Internet or lower-layer
397	      protocols being "tunneled" over (i.e., encapsulated in) IP such as
398	      IPX, AppleTalk, or IP itself.

400	   Link: A communication facility or medium over which nodes can
401	      communicate at the link layer, i.e., the layer immediately below
402	      IP.  Examples are Ethernets (simple or bridged); PPP links; X.25,
403	      Frame Relay, or ATM networks; and Internet (or higher) layer
404	      "tunnels", such as tunnels over IPv4 or IPv6.  Occasionally we use
405	      the slightly more general term "lower layer" for this concept.

407	   Interface: A node's attachment to a link.

409	   Address: An IP-layer identifier for an interface or a set of
410	      interfaces.

412	   Packet: An IP header plus payload.

414	   MTU: Maximum Transmission Unit, the size in bytes of the largest IP
415	      packet, including the IP header and payload, that can be
416	      transmitted on a link or path.  Note that this could more properly
417	      be called the IP MTU, to be consistent with how other standards
418	      organizations use the acronym MTU.

420	   Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet size
421	      in bytes, that can be conveyed in one piece over a link.  Beware
422	      that this definition is different from the definition used by
423	      other standards organizations.

425	      For IETF documents, link MTU is uniformly defined as the IP MTU
426	      over the link.  This includes the IP header, but excludes link
427	      layer headers and other framing which is not part of IP or the IP
428	      payload.

430	      Be aware that other standards organizations generally define link
431	      MTU to include the link layer headers.

433	   Path: The set of links traversed by a packet between a source node
434	      and a destination node.

436	   Path MTU, or PMTU: The minimum link MTU of all the links in a path
437	      between a source node and a destination node.

439	   Classical Path MTU Discovery: Process described in RFC 1191 and RFC
440	      1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages
441	      to learn the MTU of a path.

443	   Packetization Layer: The layer of the network stack which segments
444	      data into packets.

446	   Effective PMTU: The current estimated value for PMTU used by a
447	      Packetization Layer for segmentation.

449	   PLPMTUD: Packetization Layer Path MTU Discovery, the method described
450	      in this document, which is an extension to classical PMTU
451	      discovery.

453	   PTB (Packet Too Big) message: An ICMP message reporting that an IP
454	      packet is too large to forward.  This is the IPv6 term that
455	      corresponds to the IPv4 "ICMP Can't fragment" message.

457	   Flow: A context in which MTU discovery algorithms can be invoked.
458	      This is naturally an instance of a Packetization Protocol, for
459	      example, one side of a TCP connection.

461	   MSS: The TCP Maximum Segment Size [RFC0793], the maximum payload size
462	      available to the TCP layer.  This is typically the Path MTU minus
463	      the size of the IP and TCP headers.

465	   Probe packet: A packet which is being used to test a path for a
466	      larger MTU.

468	   Probe size: The size of a packet being used to probe for a larger
469	      MTU, including IP headers.

471	   Probe gap: The payload data that will be lost and need to be
472	      retransmitted if the probe is not delivered.

474	   Leading window: Any unacknowledged data in a flow at the time a probe
475	      is sent.

477	   Trailing window: Any data in a flow sent after a probe, but before
478	      the probe is acknowledged.

480	   Search strategy: The heuristics used to choose successive probe sizes
481	      to converge on the proper Path MTU, as described in section
482	      Section 7.3.

484	   Full stop timeout: a timeout where none of the packets transmitted
485	      after some event are acknowledged by the receiver, including any
486	      retransmissions.  This is taken as an indication of some failure
487	      condition in the network, such as a routing change onto a link
488	      with a smaller MTU.  This is described in more detail in section
489	      Section 7.7.

491	4.  Requirements

493	   All Internet nodes SHOULD implement PLPMTUD in order to discover and
494	   take advantage of the largest MTU supported along the Internet path.

496	   All links MUST enforce their MTU: links that might non-
497	   deterministically deliver packets that are larger than their rated
498	   MTU MUST consistently discard such packets.

500	   All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
501	   functionality.  All fragmentation SHOULD be done on the host, and all
502	   IPv4 packets, including fragments, SHOULD have the DF bit set such
503	   that they will not be fragmented (again) in the network.  See
504	   Section 8.

506	   The requirements below only apply to those implementations that
507	   include PLPMTUD.

509	   To use PLPMTUD a Packetization Layer MUST have a loss reporting
510	   mechanism that provides the sender with timely and accurate
511	   indications of which packets were lost in the network.

513	   Normal congestion control algorithms MUST remain in effect under all
514	   conditions except when only an isolated probe packet is detected as
515	   lost.  In this case alone the normal congestion (window or data rate)
516	   reduction SHOULD be suppressed.  If any other data loss is detected,
517	   standard congestion control MUST take place.

519	   Suppressed congestion control (as above) MUST be rate limited such
520	   that it occurs less frequently than the worst case loss rate for TCP
521	   congestion control at a comparable data rate over the same path
522	   (i.e., less than the "TCP-friendly" loss rate [tcp-friendly]).  This
523	   SHOULD be enforced by requiring a minimum headway between a
524	   suppressed congestion adjustment (due to a failed probe) and the next
525	   attempted probe, which is equal to one round trip time for each
526	   packet permitted by the congestion window.  Alternatively, this may
527	   be enforced by not suppressing congestion control if a second probe
528	   is lost too soon after the first lost probe.  This is discussed
529	   further in Section 7.6.2.

531	   Whenever the MTU is raised, the congestion state variables MUST be
532	   rescaled so as not to raise the window size in bytes (or data rate in
533	   bytes per seconds).

535	   Whenever the MTU is reduced (e.g., when processing ICMP PTB messages)
536	   the congestion state variable SHOULD be rescaled not to raise the
537	   window size in packets.

539	   If PLPMTUD updates the MTU for a particular path, all Packetization
540	   Layer sessions that share the path representation SHOULD be notified
541	   to make use of the new MTU and make the required congestion control
542	   adjustments.

544	   All implementations MUST include mechanisms for applications to
545	   selectively transmit packets larger than the current effective Path
546	   MTU (but smaller than the link MTU).  This is necessary to implement
547	   PLPMTUD within an application (using a connectionless protocol) and
548	   to implement diagnostic tools that do not rely on the operating
549	   systems implementation of Path MTU discovery.  See Section 9 for
550	   further discussion.

552	   Connectionless protocols and protocols that do not support PLPMTUD
553	   SHOULD have their own default value for the initial effective path
554	   MTU, which can be set to a more conservative (smaller) value than the
555	   initial value used by TCP and other protocols that are well suited to
556	   PLPMTUD.  Implementation MAY use different heuristics to select the
557	   initial effective path MTU for each protocol.  There SHOULD be per
558	   protocol and per route limits on the initial effective path MTU
559	   (eff_pmtu) and the upper searching limit (search_high).

561	5.  Layering

563	   Packetization Layer Path MTU Discovery is most easily implemented by
564	   splitting its functions between layers.  The IP layer is the best
565	   place to keep shared state, collect the ICMP messages, track IP
566	   header sizes and manage MTU information provided by the link layer
567	   interfaces.  However, the procedures that PLPMTUD uses for probing
568	   and verification of the Path MTU are very tightly coupled to features
569	   of the Packetization Layers, such as data recovery and congestion
570	   control state machines.

572	   Note that this layering approach is a direct extension of the advice
573	   in the current PMTUD specifications in RFC 1191 and RFC 1981.

575	5.1.  Accounting for header sizes

577	   The way in which PLPMTUD operates across multiple layers requires a
578	   mechanism for accounting header sizes at all layers between IP and
579	   the Packetization Layer (inclusive).  When transmitting non-probe
580	   packets, it is sufficient for the Packetization Layer to ensure an
581	   upper bound on final IP packet size, so as not to exceed the current
582	   effective Path MTU.  All Packetization Layers participating in
583	   classical Path MTU Discovery have this requirement already.  When
584	   conducting a probe, the Packetization Layer MUST determine the probe
585	   packet's final size including IP headers.  This requirement is
586	   specific to PLPMTUD, and satisfying it may require additional inter-
587	   layer communication in existing implementations.

589	5.2.  Storing PMTU information

591	   This memo uses the concept of a "flow" to define the scope of the
592	   Path MTU discovery algorithms.  For many implementations, a flow
593	   would naturally correspond to an instance of each protocol (i.e.,
594	   each connection or session).  In such implementations, the algorithms
595	   described in this document are performed within each session for each
596	   protocol.  The observed PMTU (eff_pmtu in Section 7.1) can optionally
597	   be shared between different flows with a common path representation.

599	   Alternatively, PLPMTUD could be implemented such that its complete
600	   state is associated with the path representations.  Such an
601	   implementation could use multiple connections or sessions for each
602	   probe sequence.  This approach is likely to converge much more
603	   quickly in some environments, such as where an application uses many
604	   small connections, each of which is too short to complete the Path
605	   MTU Discovery process.

607	   Within a single implementation, different protocols can use either of
608	   these two approaches.  Due to protocol specific differences in
609	   constraints on generating probes (Section 6.2) and the MTU searching
610	   algorithm (Section 7.3), it may not be feasible for different
611	   Packetization Layer protocols to share PLPMTUD state.  This suggests
612	   that it may be possible for some protocols to share probing state,
613	   but other protocols can only share observed PMTU.  In this case, the
614	   different protocols will have different PMTU convergence properties.

616	   The IP layer is the best place to store cached PMTU values and other
617	   shared state such as MTU values reported by ICMP PTB messages.
618	   Ideally, this shared state should be associated with a specific path
619	   traversed by packets exchanged between the source and destination
620	   nodes.  However, in most cases a node will not have enough
621	   information to completely and accurately identify such a path.
622	   Rather, a node must associate a PMTU value with some local
623	   representation of a path.  It is left to the implementation to select
624	   the local representation of a path.

626	   An implementation could use the destination address as the local
627	   representation of a path.  The PMTU value associated with a
628	   destination would be the minimum PMTU learned across the set of all
629	   paths in use to that destination.  The set of paths in use to a
630	   particular destination is expected to be small, in many cases
631	   consisting of a single path.  This approach will result in the use of
632	   optimally sized packets on a per-destination basis, and integrates
633	   nicely with the conceptual model of a host as described in [RFC2461]:
634	   a PMTU value could be stored with the corresponding entry in the
635	   destination cache.  Storing the minimum value is suggested since NATs
636	   and other forms of middle boxes may exhibit differing PMTUs
637	   simultaneously at a single IP address.

639	   Note that network or subnet numbers are not suitable to use as
640	   representations of a path, because there is not a general mechanism
641	   to determine the network mask at the remote host.

643	   If IPv6 flows are in use, an implementation could use the IPv6 flow
644	   id [RFC2460][RFC1809] as the local representation of a path.  Packets
645	   sent to a particular destination but belonging to different flows may
646	   use different paths, with the choice of path depending on the flow
647	   id.  This approach will result in the use of optimally sized packets
648	   on a per-flow basis, providing finer granularity than MTU values
649	   maintained on a per-destination basis.

651	   For source routed packets (i.e., packets containing an IPv6 routing
652	   header, or IPv4 LSRR or SSRR options), the source route may further
653	   qualify the local representation of a path.  An implementation could
654	   use source route information in the local representation of a path.

656	5.3.  Accounting for IPsec

658	   This document does not take a stance on the placement of IPsec
659	   [RFC2401], which logically sits between IP and the Packetization
660	   Layer.  The PLPMTUD implementation can treat IPsec either as part of
661	   IP or as part of the Packetization Layer, as long as the accounting
662	   is consistent within the implementation.  If IPsec is treated as part
663	   of the IP layer, then each security association to a remote node may
664	   need to be treated as a separate path.  If IPsec is treated as part
665	   of the Packetization Layer, the IPsec header size must be included in
666	   the Packetization Layer's header size calculations.

668	5.4.  Multicast

670	   In the case of a multicast destination address, copies of a packet
671	   may traverse many different paths to reach many different nodes.  The
672	   local representation of the "path" to a multicast destination must in
673	   fact represent a potentially large set of paths.

675	   Minimally, an implementation could maintain a single MTU value to be
676	   used for all packets originated from the node.  This MTU value would
677	   be the minimum MTU learned across the set of all paths in use by the
678	   node.  This approach is likely to result in the use of smaller
679	   packets than is necessary for many paths.

681	   If the application using multicast gets complete delivery reports
682	   (unlikely because this requirement has poor scaling properties),
683	   PLPMTUD could be implemented in multicast protocols.

685	6.  Common Packetization Properties

687	   This section describes general Packetization Layer properties and
688	   characteristics needed to implement PLPMTUD.  It also describes some
689	   implementation issues that are common to all Packetization Layers.

691	6.1.  Mechanism to detect loss

693	   It is important that the Packetization Layer has a timely and robust
694	   mechanism for detecting and reporting losses.  PLPMTUD makes MTU
695	   adjustments on the basis of detected losses.  Any delays or
696	   inaccuracy in loss notification is likely to result in incorrect MTU
697	   decisions or slow convergence.

699	   It is best if Packetization Protocols use fairly explicit loss
700	   notification such as selective acknowledgments, although implicit
701	   mechanisms such as TCP Reno style duplicate acknowledgments counting
702	   are sufficient.  It is important that the mechanism can robustly
703	   distinguish between the isolated loss of just a probe and other
704	   combinations of losses.

706	   Many protocol implementations have sophisticated mechanisms such as a
707	   SACK scoreboard [RFC3517] or ACK Vector [RFC4340] to distinguish real
708	   losses from reordered data.  In these implementations it is desirable
709	   to signal losses to PLPMTUD as a side effect of the data
710	   retransmission.  This approach offers the maximum protection from
711	   confusing signals due to reordering and other events that might mimic
712	   losses.

714	   PLPMTUD can also be implemented in protocols that rely on timeouts as
715	   their primary mechanism for loss recovery; however, timeouts should
716	   be used only when there are no other alternatives.

718	6.2.  Generating probes

720	   There are several possible ways to alter Packetization Layers to
721	   generate probes.  The different techniques incur different overheads
722	   in three areas: difficulty in generating the probe packet (in terms
723	   of Packetization Layer implementation complexity and extra data
724	   motion) possible additional network capacity consumed by the probes
725	   and the overhead of recovering from failed probes (both network and
726	   protocol overheads).

728	   Some protocols might be extended to allow arbitrary padding with
729	   dummy data.  This greatly simplifies the implementation because the
730	   probing can be performed without participation from higher layers and
731	   if the probe fails, the missing data (the "probe gap") is assured to
732	   fit within the current MTU when it is retransmitted.  This is
733	   probably the most appropriate method for protocols that support
734	   arbitrary length options or multiplexing within the protocol itself.

736	   Many Packetization Layer protocols can carry pure control messages
737	   (without any data from higher protocol layers) which can be padded to
738	   arbitrary lengths.  For example, the SCTP PAD chunk can be used in
739	   this manner (see Section 10.2).  This approach has the advantage that
740	   nothing needs to be retransmitted if the probe is lost.

742	   These techniques do not work for TCP, because there is not a separate
743	   length field or other mechanism to differentiate between padding and
744	   real payload data.  With TCP the only approach is to send additional
745	   payload data in an over-sized segment.  There are at least two
746	   variants of this approach, discussed in Section 10.1.

748	   In a few cases, there may be no reasonable mechanisms to generate
749	   probes within the Packetization Layer protocol itself.  As a last
750	   resort, it may be possible to rely an an adjunct protocol, such as
751	   ICMP ECHO ("ping"), to send probe packets.  See Section 10.3 for
752	   further discussion of this approach.

754	7.  The Probing Method

756	   This section describes the details of the MTU probing method,
757	   including how to send probes and process error indications necessary
758	   to search for the Path MTU.

760	7.1.  Packet size ranges

762	   This document describes the probing method using three state
763	   variables:

765	   search_low: The smallest useful probe size, minus one.  The network
766	      is expected to be able to deliver packets of size search_low.

768	   search_high: The greatest useful probe size.  The network is expected
769	      not to be able to deliver packets of size search_high.

771	   eff_pmtu: The effective PMTU for this flow.  This is the largest non-
772	      probe packet permitted by PLPMTUD for the path.

774	               search_low          eff_pmtu         search_high
775	                   |                   |                  |
776	           ...------------------------->
777	               non-probe size range
778	                   <-------------------------------------->
779	                               probe size range

781	   Figure 1

783	   When transmitting non-probes, the Packetization Layer SHOULD create
784	   packets of size less than or equal to eff_pmtu.

786	   When transmitting probes, the Packetization Layer MUST select a probe
787	   size which is larger than search_low and smaller or equal to
788	   search_high.

790	   When probing upward, eff_pmtu always equals search_low.  In other
791	   states, such as initial conditions, after ICMP PTB message processing
792	   or following PLPMTUD on another flow sharing the same path
793	   representation, eff_pmtu may be different from search_low.  Normally
794	   eff_pmtu will be greater than or equal to search_low and less than
795	   search_high.  It is generally expected but not required that probe
796	   size will be greater than eff_pmtu.

798	   For initial conditions when there is no information about the path,
799	   eff_pmtu may be greater than search_low.  The initial value of
800	   search_low should be conservatively low, but performance may be
801	   better if eff_pmtu starts at a higher, less conservative value.  See
802	   Section 7.2.

804	   If eff_pmtu is larger than search_low it is explicitly permitted to
805	   send non-probe packets larger than search_low.  When such a packet is
806	   acknowledged, it is effectively an "implcit probe" and search_low
807	   SHOULD be raised to the size of the acknowledged packet.  However, if
808	   an "implicit probe" is lost, it MUST NOT be treated as a probe
809	   failure as a true probe would be.  If eff_pmtu is too large, this
810	   condition will only be detected with ICMP PTB messages or black hole
811	   discovery (see Section 7.7).

813	7.2.  Selecting initial values

815	   The initial value for search_high should be the largest possible
816	   packet that might be supported by the flow.  This may be limited by
817	   the local interface MTU, by an explicit protocol mechanism such as
818	   the TCP MSS option, an intrinsic limit such as the size of a protocol
819	   length field, or by a configuration option to prevent probing above
820	   some maximum packet size.  Search_high is likely to be the same as
821	   the initial path MTU as computed by the classical path MTU discovery
822	   algorithm.

824	   It is recommended that search_low be initially set to an MTU size
825	   that is likely to work over a very wide range of environments.  Given
826	   today's technologies, a value of 512 bytes is probably safe.  For
827	   IPv6 flows, a value of 1280 bytes is appropriate.  The initial value
828	   for search_low SHOULD be configurable.

830	   Properly functioning Path MTU Discovery is critical to the robust and
831	   efficient operation of the Internet.  Any major change (as described
832	   in this document) has the potential to be very disruptive if it
833	   causes any unexpected changes in protocol behaviors.  The selection
834	   of the initial value for eff_pmtu determines to what extent a PLPMTUD
835	   implementation's behavior resembles classical PMTUD in cases where
836	   the classical method is sufficient.

838	   A conservative configuration would be to set eff_pmtu to search_high,
839	   and rely on ICMP PTB messages to set the eff_pmtu down as
840	   appropriate.  In this configuration classical PMTUD is fully
841	   functional and PLPMTUD is only invoked to recover from ICMP black
842	   holes through the procedure described in Section 7.7.

844	   In some cases where it is known that classical PMTUD is likely to
845	   fail, (for example, if ICMP PTB messages are administratively
846	   disabled for security reasons) using a small initial eff_pmtu will
847	   avoid the costly timeouts required for black hole detection.  The
848	   trade-off is that using a smaller than necessary initial eff_pmtu
849	   might cause reduced performance.

851	   Note that the initial eff_pmtu can be any value in the range
852	   search_low to search_high.  An initial eff_pmtu of 1400 bytes might
853	   be a good compromise because it would be safe for nearly all tunnels
854	   over all common networking gear, and yet close to the optimal MTU for
855	   the majority of paths in the Internet today.  This might be improved
856	   by using some statistics of other recent flows: for example the
857	   initial eff_pmtu for a flow might be set to the median of the probe
858	   size for all recent successful probes.

860	   Since the cost of PLPMTUD is dominated by the protocol specific
861	   overheads of generating and processing probes, it is probably
862	   desirable for each protocol to have its own heuristics to select the
863	   initial eff_pmtu.  It is especially important that connectionless
864	   protocols and other protocols that may not receive clear indications
865	   of ICMP black holes use conservative (smaller) initial values for
866	   eff_pmtu, as described in section Section 10.3.

868	   There SHOULD be per protocol and per route configuration options to
869	   override initial values for eff_pmtu and other PLPMTUD state
870	   variables.

872	7.3.  Selecting probe size

874	   The probe may have a size anywhere in the "probe size range"
875	   described above.  However, a number of factors affect the selection
876	   of an appropriate size.  A simple strategy might be to do a binary
877	   search halving the probe size range with each probe.  However, for
878	   some protocols such as TCP, failed probes are more expensive than
879	   successful ones, since data in a failed probe will need to be
880	   retransmitted.  For such protocols, a strategy using smaller probe
881	   sizes and "probing up" behaves better.  For many protocols, both at
882	   and above the Packetization Layer, the benefit of increasing MTU
883	   sizes may follow a step function such that it is not advantageous to
884	   probe within certain regions at all.

886	   As an optimization, it may be appropriate to probe at certain common
887	   or expected MTU sizes, for example, 1500 bytes for standard Ethernet,
888	   or 1500 bytes minus header sizes for tunnel protocols.

890	   Some protocols may use other mechanisms to choose the probe sizes.
891	   For example, protocols that have certain natural data block sizes
892	   might simply assemble messages from a number of blocks until the
893	   total size is smaller than search_high, and larger than search_low
894	   (if possible).

896	   Each Packetization Layer must determine when probing has converged,
897	   that is, when the probe size range is small enough that further
898	   probing is no longer worth its cost.  When probing has converged, a
899	   timer should be set.  When the timer expires, search_high should be
900	   reset to its initial value (described above) so that probing can
901	   resume.  Thus if the path changes, increasing the Path MTU, then the
902	   flow will eventually take advantage of it.  The value for this timer
903	   MUST NOT be less than 5 minutes, and is recommended to be 10 minutes,
904	   per RFC 1981.

906	7.4.  Probing preconditions

908	   Before sending a probe, the flow must at least meet the following
909	   conditions:
910	   o  It has no outstanding probes or losses.
911	   o  If the last probe failed or was inconclusive, then the probe
912	      timeout has expired (see Section 7.6.2).
913	   o  The available window is greater than the probe size.
914	   o  For a protocol using in-band data for probing, enough data is
915	      available to send the probe.

917	   For protocols that probe with in-band data, when not enough data is
918	   available to probe, the protocol may wish to delay sending non-probes
919	   in order to accumulate enough data to send a probe.  A delayed
920	   sending algorithm such as Nagle [RFC0896] should be used to
921	   appropriately limit the time data is delayed.

923	   Some protocols may require additional packets after a loss to detect
924	   it promptly (e.g., TCP loss detection using duplicate
925	   acknowledgments).  Such a protocol should wait until sufficient data
926	   and window space is available so that it will be able to transmit
927	   enough data after the probe to trigger the loss detection mechanism
928	   in the event of a lost probe.

930	7.5.  Conducting a probe

932	   Once a probe size in the appropriate range has been selected, and the
933	   above preconditions have been met, the Packetization Layer may
934	   conduct a probe.  To do so, it creates a probe packet such that its
935	   size, including the outermost IP headers, is equal to the probe size.
936	   After sending the probe it awaits response, which may take the
937	   following results:
938	   Success: The probe is acknowledged as having been received by the
939	      remote host.

941	   Failure: A protocol mechanism indicates that the probe was lost, but
942	      no packets in the leading or trailing window were lost.

944	   Timeout failure: A protocol mechanism indicates that the probe was
945	      lost, and no packets in the leading window were lost, but is
946	      unable to determine if any packets in the trailing window were
947	      lost.  For example, loss is detected by a timeout, and go-back-n
948	      retransmission is used.

950	   Inconclusive: The probe was lost in addition to other packets in the
951	      leading or trailing windows.

953	7.6.  Response to probe results

955	   When a probe has completed, the result should be processed as
956	   follows, categorized by the probe's result type.

958	7.6.1.  Probe success

960	   When the probe is delivered, it is an indication that the Path MTU is
961	   at least as large as the probe size.  Set search_low to the probe
962	   size.  If the probe size is larger than the eff_pmtu, raise eff_pmtu
963	   to the probe size.  The probe size might be smaller than the eff_pmtu
964	   if the flow has not been using the full MTU of the path because it is
965	   subject to some other limitation, such as available data in an
966	   interactive session.

968	   Note that if a flow's packets are routed via multiple paths, or over
969	   a path with a non-deterministic MTU, delivery of a single probe
970	   packet does not indicate that all packets of that size will be
971	   delivered.  To be robust in such a case, the Packetization Layer
972	   should conduct MTU verification as described in Section 7.8.

974	7.6.2.  Probe failure

976	   When only the probe is lost, this is treated as an indication that
977	   the Path MTU is smaller than the probe size.  In this case alone, the
978	   loss SHOULD NOT be interpreted as congestion signal.

980	   In the absence of other indications, set search_high to the probe
981	   size minus one.  The eff_pmtu might be larger than the probe size if
982	   the flow has not been using the full MTU of the path because it is
983	   subject to some other limitation, such as available data in an
984	   interactive session.  If eff_pmtu is larger than the probe size,
985	   eff_pmtu MUST be reduced to no larger than search_high, and SHOULD be
986	   reduced to search_low, as the eff_pmtu has been determined to be
987	   invalid, similar to after a full stop timeout (see Section 7.7).

989	   If an ICMP PTB message is received matching the probe packet, then
990	   search_high and eff_pmtu may be set from the MTU value indicated in
991	   the message.  Note that the ICMP message may be received either
992	   before or after the protocol loss indication.

994	   A probe failure event is the one situation under which the
995	   Packetization Layer is permitted not to treat loss as a congestion
996	   signal.  Because there is some small risk that suppressing congestion
997	   control might have unanticipated consequences (even for one isolated
998	   loss), it is REQUIRED that probe failure events be less frequent than
999	   the normal period for losses under standard congestion control.
1000	   Specifically, after a probe failure event and suppressed congestion
1001	   control, PLPMTUD MUST NOT probe again until an interval which is
1002	   comparable to the expected interval between congestion control
1003	   events.  See Section 4 for details.  The simplest estimate of the
1004	   interval to the next congestion event is the same number of round
1005	   trips as the current congestion window in packets.

1007	7.6.3.  Probe timeout failure

1009	   If the loss was detected with a timeout and repaired with go-back-n
1010	   retransmission, then congestion window reduction will be necessary.
1011	   The relatively high price of a failed probe in this case may merit a
1012	   longer timeout.  A timeout value of five times the non-timeout
1013	   failure case (Section 7.6.2) is recommended.

1015	7.6.4.  Probe inconclusive

1017	   The presence of other losses near the loss of the probe may indicate
1018	   that the probe was lost due to congestion rather than because of an
1019	   MTU limitation.  In this case, it is appropriate to update no state,
1020	   and simply probe again when the probing preconditions are met (i.e.,
1021	   when no recent losses have been observed).  At this point, it is
1022	   particularly appropriate to re-probe since the flow's congestion
1023	   window will be at its lowest point, minimizing the probability of
1024	   congestive losses.

1026	7.7.  Full stop timeout

1028	   Under all conditions, a full stop timeout (also known as a
1029	   "persistent timeout" in other documents) should be taken as an
1030	   indication of some significantly disruptive event in the network,
1031	   such as a router failure or a routing change to a path with a smaller
1032	   MTU.  For TCP, this occurs when the R1 timeout threshold described by
1033	   [RFC1122] expires.

1035	   If there is a full stop timeout and there was not an ICMP message
1036	   indicating a reason (PTB, Net unreachable, etc., or the ICMP message
1037	   was ignored for some reason), the suggested first recovery action is
1038	   to treat this as a detected ICMP black hole as defined in [RFC2923].

1040	   The response to a detected black hole depends on the current values
1041	   for search_low and eff_pmtu.  If eff_pmtu is larger than search_low,
1042	   set eff_pmtu to search_low.  Otherwise, set both eff_pmtu and
1043	   search_low to the to the initial value for search_low.  Upon further
1044	   successive timeouts, search_low and eff_pmtu SHOULD be halved, with a
1045	   lower bound of 68 bytes for IPv4 and 1280 bytes for IPv6.

1047	7.8.  MTU verification

1049	   It is possible for a flow to simultaneously traverse multiple paths,
1050	   but it will only be able to keep a single path representation for the
1051	   flow.  If the paths have different MTUs, storing the minimum MTU of
1052	   all paths in the flow's path representation will result in correct
1053	   behavior.  If ICMP PTB messages are delivered, then classical PMTUD
1054	   will work correctly in this situation.

1056	   If ICMP delivery fails, breaking classical PMTUD, the connection will
1057	   rely solely on PLPMTUD.  However, in this case, PLPMTUD may fail as
1058	   well since its requirement that links MUST NOT deliver packets larger
1059	   than their MTU is violated.  A probe with a size greater than the
1060	   minimum but smaller than the maximum of the Path MTUs may be
1061	   successful.  However, upon raising the flow's effective PMTU, the
1062	   loss rate will significantly increase.  The flow may still make
1063	   progress, but the resultant loss rate may be unacceptable.  For
1064	   example, when using two-way round-robin striping, 50% of full-sized
1065	   packets would be dropped.

1067	   Striping in this manner is often operationally undesirable (e.g., due
1068	   to packet reordering), and is usually avoided by hashing flows to a
1069	   single path.  However, to increase robustness, an implementation
1070	   should implement some form of MTU verification, such that if
1071	   increasing eff_pmtu results in a sharp increase in loss rate, it will
1072	   fall back to using a lower MTU.

1074	   A recommended strategy would be to save the value of eff_pmtu before
1075	   raising it.  Then, if loss rate rises above a threshold for a period
1076	   of time (e.g., loss rate is higher than 10% over multiple RTO
1077	   intervals), then the new MTU is considered incorrect.  The saved
1078	   value of eff_pmtu can be restored, and search_high reduced in the
1079	   same manner as in a probe failure.  PLPMTUD implementations SHOULD
1080	   implement MTU verification.

1082	8.  Host Fragmentation

1084	   Packetization Layers are encouraged to avoid sending messages that
1085	   will require fragmentation [Kent87] [I-D.heffner-frag-harmful].
1086	   However, entirely preventing fragmentation is not always possible.
1087	   Some Packetization Layers, such as a UDP application outside the
1088	   kernel, may be unable to change the size of messages it sends,
1089	   resulting in datagram sizes that exceed the Path MTU.

1091	   IPv4 permitted such applications to send packets without the DF bit
1092	   set.  Oversized packets without the DF bit set would be fragmented in
1093	   the network or sending host when they encountered a link with a MTU
1094	   smaller than the packet.  In some case, packets could be fragmented
1095	   more than once if there were cascaded links with progressively
1096	   smaller MTUs.  This approach is not recommended.

1098	   It is recommended that IPv4 implementations use a strategy that
1099	   mimics IPv6 functionality.  When an application sends datagrams that
1100	   are larger than the effective Path MTU they should be fragmented to
1101	   the Path MTU in the host IP layer even if they are smaller than the
1102	   link MTU of the first network hop directly attached to the host.  The
1103	   DF bit should be set on the fragments, so they will not be fragmented
1104	   again in the network.  This technique will minimize the likelihood
1105	   that applications will rely on IPv4 fragmentation in a way that
1106	   cannot be implemented in IPv6.  At least one major operating system
1107	   already uses this strategy.  An exception to this rule is if the
1108	   application indicates that it is sending an oversized packet for
1109	   probing or diagnostic purposes, described in Section 9.

1111	   Since protocols that do not implement PLPMTUD are still subject to
1112	   the black hole problem, it may be desirable to present to these
1113	   protocols a "safe" MTU likely to work on any path (e.g., 1280 bytes).
1114	   Then, allow any protocol implementing PLPMTUD to operate in the full
1115	   range supported by the lower layer.

1117	   Note that IP fragmentation divides data into packets, so it is
1118	   minimally a Packetization Layer.  However, it does not have a
1119	   mechanism to detect lost packets, so it cannot support a native
1120	   implementation of PLPMTUD.  Fragmentation-based PLPMTUD requires an
1121	   adjunct protocol as described in Section 10.3.

1123	9.  Application Probing

1125	   All implementations MUST include a mechanism where applications using
1126	   connectionless protocols can send their own probes.  This is
1127	   necessary to implement PLPMTUD in an application protocol as
1128	   described in Section 10.4 or to implement diagnostic tools for
1129	   debugging problems with PMTUD.  There must be a mechanism that
1130	   permits an application to send datagrams that are larger than
1131	   eff_pmtu, the operating systems estimate of the path MTU, without
1132	   being fragmented.  If these are IPv4 packets, they MUST have the DF
1133	   bit set.

1135	   At this time, most operating systems support two modes for sending
1136	   datagrams: one which silently fragments packets that are too large,
1137	   and another that rejects packets that are too large.  Neither of
1138	   these modes is suitable for implementing PLPMTUD in an application or
1139	   diagnosing problems with path MTU discovery.  A third mode is needed
1140	   where the datagram is sent even if it is larger than the current
1141	   estimate of the path MTU.

1143	   Implementing PLPMTUD in an application also requires a mechanism
1144	   where the application can inform the operating system about the
1145	   outcome of the probe as described in Section 7.6, or directly update
1146	   search_low, search_high and eff_pmtu, described in Section 7.1.

1148	   Diagnostic applications are useful for finding PMTUD problems, such
1149	   as those that might be caused by a buggy router than returns ICMP PTB
1150	   messages with incorrect size information.  Such problems can be most
1151	   quickly located with a tool that can send probes of any specified
1152	   size, and collect and display all returned ICMP PTB messages.

1154	10.  Specific Packetization Layers

1156	   This section discusses specific implementation details for different
1157	   protocols that can be used as Packetization Layer protocols.  All
1158	   Packetization Layer protocols must consider all of the issues
1159	   discussed in Section 6.  For most protocols, it is self evident how
1160	   to address many of these issues.  It is hoped that the protocols
1161	   described here will be sufficient illustration for implementers to
1162	   adapt other protocols.

1164	10.1.  Probing method using TCP

1166	   TCP has no mechanism to distinguish in-band data from padding.
1167	   Therefore, TCP must generate probes by appropriately segmenting data.
1168	   There are two approaches to segmentation: overlapping and non-
1169	   overlapping.

1171	   In the non-overlapping method, data is segmented such that the probe
1172	   and any subsequent segments contain no overlapping data.  If the
1173	   probe is lost, the "probe gap" will be a full probe size minus
1174	   headers.  Data in the probe gap will need to be retransmitted with
1175	   multiple smaller segments.

1177	             TCP sequence number

1179	           t   <---->
1180	           i         <-------->           (probe)
1181	           m                   <---->
1182	           e
1183	                         .
1184	                         .                (probe lost)
1185	                         .

1187	                     <---->               (probe gap retransmitted)
1188	                           <-->

1190	   Figure 2

1192	   An alternate approach is to send subsequent data overlapping the
1193	   probe such that the probe gap is equal in length to the current MSS.
1194	   In the case of a successful probe, this has added overhead in that it
1195	   will send some data twice, but it will have to retransmit only one
1196	   segment after a lost probe.  When a probe succeeds, there will likely
1197	   be some duplicate acknowledgments generated due to the duplicate data
1198	   sent.  It is important that these duplicate acknowledgments not
1199	   trigger Fast Retransmit.  As such, an implementation using this
1200	   approach SHOULD limit the probe size to three times the current MSS
1201	   (causing at most 2 duplicate acknowledgments), or appropriately
1202	   adjust its duplicate acknowledgment threshold for data immediately
1203	   after a successful probe.

1205	             TCP sequence number

1207	           t   <---->
1208	           i         <-------->           (probe)
1209	           m               <---->
1210	           e                     <---->

1212	                         .
1213	                         .                (probe lost)
1214	                         .

1216	                     <---->               (probe gap retransmitted)

1218	   Figure 3

1220	   The choice of which segmentation method to use should be based on
1221	   what is simplest and most efficient for a given TCP implementation.

1223	10.2.  Probing method using SCTP

1225	   In the SCTP protocol [RFC2960], the application writes messages to
1226	   SCTP, which "chunkifies" them into smaller pieces suitable for
1227	   transmission through the network.  Once a message has been
1228	   chunkified, it is assigned a Transmission Sequence Number (TSN).
1229	   Once a TSN have been transmitted, SCTP can not change the chunk size.
1230	   SCTP multi-path support normally requires SCTP to chunkify its
1231	   messages to fit the smallest PMTU of all paths.  Although not
1232	   required, implementations may bundle multiple data chunks together to
1233	   make larger IP packets to send on paths with a larger PMTU.  Note
1234	   that SCTP must independently probe the PMTU on each path to the peer.

1236	   The recommended method for generating probes is to add a chunk
1237	   consisting only of padding to an SCTP message.  The PAD chunk defined
1238	   in [I-D.tuexen-tsvwg-sctp-padding] SHOULD be attached to a minimum
1239	   length HEARTBEAT chunk to build a probe packet.  This method is fully
1240	   compatible with all current SCTP implementations.

1242	   SCTP MAY also probe with a method similar to TCP's described above,
1243	   using inline data.  Using such a method has the advantage that
1244	   successful probes have no additional overhead; however, failed probes
1245	   will require retransmission of data, which may impact flow
1246	   performance.

1248	10.3.  Probing method for IP fragmentation

1250	   There are a few protocols and applications that normally send large
1251	   datagrams and rely on IP fragmentation to deliver them.  It has been
1252	   known for a long time that this has some undesirable consequences
1253	   [Kent87].  More recently it has come to light that IPv4 fragmentation
1254	   is not sufficiently robust for general use in today's Internet.  The
1255	   16-bit IP identification field is not large enough to prevent
1256	   frequent mis-associated IP fragments and the TCP and UDP checksums
1257	   are insufficient to prevent the resulting corrupted data from being
1258	   delivered to higher protocol layers [I-D.heffner-frag-harmful].

1260	   As mentioned in Section 8, datagram protocols (such as UDP) might
1261	   rely on IP fragmentation as a Packetization Layer.  However, using IP
1262	   fragmentation to implement PLPMTUD is problematic because the IP
1263	   layer has no mechanism to determine if the packets are ultimately
1264	   delivered to the far node, without direct participation by the
1265	   application.

1267	   To support IP fragmentation as a Packetization Layer under an
1268	   unmodified application, we propose to rely on the path MTU sharing
1269	   described in Section 5.2 plus an adjunct protocol to probe the path
1270	   MTU.  There are a number of protocols that might be used for the
1271	   purpose, such as ICMP ECHO and ECHO REPLY, or "traceroute" style UDP
1272	   datagrams that trigger ICMP messages.  Use of ICMP ECHO and ECHO
1273	   REPLY will probe both forward and return paths, so the sender will
1274	   only be able to take advantage of the minimum of the two.  Other
1275	   methods that probe only the forward path are preffered if available.

1277	   All of these approaches have a number of potential robustness
1278	   problems.  The most likely failures are due to losses unrelated to
1279	   MTU (e.g., nodes that discard some protocol types).  These non-MTU-
1280	   related losses can prevent PLPMTUD from raising the MTU, forcing IP
1281	   fragmentation to use a smaller MTU than necessary.  Since these
1282	   failures are not likely to cause interoperability problems they are
1283	   relatively benign.

1285	   However there does exist other more serious failure modes, such as
1286	   might be caused by middle boxes or upper layer routers that choose
1287	   different paths for different protocol types or sessions.  In such
1288	   environments, adjunct protocols may legitimately experience a
1289	   different path MTU than the primary protocol.  If the adjunct
1290	   protocol finds a larger MTU than the primary protocol, PLPMTUD may
1291	   select an MTU that is not usable by the primary protocol.  Although
1292	   this is a potentially serious problem, this sort of situation is
1293	   likely to be viewed as broken by a large number of observers, and
1294	   thus there will be strong motivation to correct it.

1296	   Since connectionless protocols might not keep enough state to
1297	   effectively diagnose MTU black holes, it would be more robust to err
1298	   on the side of using too small of an initial MTU (e.g., 1kBytes or
1299	   less) prior to probing a path to measure the MTU.  For this reason we
1300	   suggest that IP fragmentation use an initial eff_pmtu which is
1301	   selected as described in Section 7.2, except using a separate global
1302	   control for the default initial eff_mtu for connectionless protocols.

1304	   Connectionless protocols also introduce an additional problem with
1305	   maintaining the path information cache: there are no events
1306	   corresponding to connection establishment and tear-down to use to
1307	   manage the cache itself.  A natural approach would be to keep an
1308	   immutable cache entry for the "default path", which has a eff_pmtu
1309	   that is fixed at the initial value for connectionless protocols.  The
1310	   adjunct path MTU discovery protocol would be invoked once the number
1311	   of fragmented datagrams to any particular destination reaches some
1312	   configurable threshold (e.g., 5 datagrams).  A new path cache entry
1313	   would be created when the adjunct protocol updates eff_pmtu, and
1314	   deleted on the basis of a timer or Least Recently Used cache
1315	   replacement algorithm.

1317	10.4.  Probing method using applications

1319	   The disadvantages of relying on IP fragmentation and an adjunct
1320	   protocol to perform path MTU discovery can be overcome by
1321	   implementing path MTU discovery within the application itself, using
1322	   the application's own protocol.  The application must have some
1323	   suitable method for generating probes and have an accurate and timely
1324	   mechanism to determine if the probes were lost.

1326	   Ideally the application protocol includes a lightweight echo function
1327	   that confirms message delivery, plus a mechanism for padding the
1328	   messages out to the desired probe size, such that the padding is not
1329	   echoed.  This combination (akin to the SCTP HB plus PAD) is preferred
1330	   because an application can separately measure the MTU of each
1331	   direction on a path with asymmetrical MTUs.

1333	   For protocols that cannot implement PLPMTUD with "echo plus pad"
1334	   there are often alternate methods for generating probes.  For
1335	   example, the protocol may have a variable length echo that
1336	   effectively measures minimum MTU of both the forward and return path,
1337	   or there may be a way to add padding to regular messages carrying
1338	   real application data.  There may also be alternate ways to segment
1339	   application data to generate probes, or as a last resort, it may be
1340	   feasible to extend the protocol with new message types specifically
1341	   to support MTU discovery.

1343	   Note that if it is necessary to add new message types to support
1344	   PLPMTUD, the most general approach is to add ECHO and PAD messages,
1345	   which permit the the greatest possible latitude in how an application
1346	   specific implementation of PLPMTUD interacts with other applications
1347	   and protocols on the same end system.

1349	   All application probing techniques require the ability to send
1350	   messages that are larger than the current eff_pmtu described in
1351	   Section 9.

1353	11.  Security Considerations

1355	   Under all conditions the PLPMTUD procedures described in this
1356	   document are at least as secure as the current standard Path MTU
1357	   Discovery procedures described in RFC 1191 and RFC 1981.

1359	   Since this algorithm is designed for robust operation without any
1360	   ICMP or other messages from the network, PLPMTUD could be configured
1361	   to ignore all ICMP messages, either globally or on a per application
1362	   basis.  In such a configuration, it cannot be attacked unless the
1363	   attacker can identify and cause probe packets to be lost.  Attacking
1364	   PLPMTUD reduces performance, but not as much as attacking congestion
1365	   control by causing arbitrary packets to be lost.  Such an attacker
1366	   might do far more damage by completely disrupting specific other
1367	   protocols, such as DNS.

1369	   Since packetization protocols may share state with each other, if one
1370	   packetization protocol (particularly an application) were hostile to
1371	   other protocols on the same host, it could harm performance in the
1372	   other protocols by reducing the effective MTU.  If a packetization
1373	   protocol is untrusted, it should not be allowed to write to shared
1374	   state.

1376	12.  IANA Considerations

1378	   None.

1380	13.  References

1382	13.1.  Normative references

1384	   [RFC0791]  Postel, J., "Internet Protocol", STD 5, RFC 791,
1385	              September 1981.

1387	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1388	              November 1990.

1390	   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
1391	              for IP version 6", RFC 1981, August 1996.

1393	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1394	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1396	   [RFC2460]  Deering, S. and R. Hinden, "Internet Protocol, Version 6
1397	              (IPv6) Specification", RFC 2460, December 1998.

1399	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1400	              RFC 793, September 1981.

1402	   [RFC2960]  Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
1403	              Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
1404	              Zhang, L., and V. Paxson, "Stream Control Transmission
1405	              Protocol", RFC 2960, October 2000.

1407	13.2.  Informative references

1409	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1410	              Communication Layers", STD 3, RFC 1122, October 1989.

1412	   [RFC1809]  Partridge, C., "Using the Flow Label Field in IPv6",
1413	              RFC 1809, June 1995.

1415	   [RFC2923]  Lahey, K., "TCP Problems with Path MTU Discovery",
1416	              RFC 2923, September 2000.

1418	   [RFC2401]  Kent, S. and R. Atkinson, "Security Architecture for the
1419	              Internet Protocol", RFC 2401, November 1998.

1421	   [RFC2914]  Floyd, S., "Congestion Control Principles", BCP 41,
1422	              RFC 2914, September 2000.

1424	   [RFC2461]  Narten, T., Nordmark, E., and W. Simpson, "Neighbor
1425	              Discovery for IP Version 6 (IPv6)", RFC 2461,
1426	              December 1998.

1428	   [RFC3517]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A
1429	              Conservative Selective Acknowledgment (SACK)-based Loss
1430	              Recovery Algorithm for TCP", RFC 3517, April 2003.

1432	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
1433	              Congestion Control Protocol (DCCP)", RFC 4340, March 2006.

1435	   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
1436	              RFC 896, January 1984.

1438	   [Kent87]   Kent, C. and J. Mogul, "Fragmentation considered harmful",
1439	              Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.

1441	   [tcp-friendly]
1442	              Mahdavi, J. and S. Floyd, "TCP-Friendly Unicast Rate-Based
1443	              Flow Control", Technical note sent to the end2end-interest
1444	              mailing list , January 1997,
1445	              <http://www.psc.edu/networking/papers/tcp_friendly.html>.

1447	   [I-D.heffner-frag-harmful]
1448	              Heffner, J., "Fragmentation Considered Very Harmful",
1449	              draft-heffner-frag-harmful-01 (work in progress),
1450	              April 2006.

1452	   [I-D.tuexen-tsvwg-sctp-padding]
1453	              Tuexen, M. and R. Stewart, "Padding Chunk and Parameter
1454	              for SCTP", draft-tuexen-tsvwg-sctp-padding-00 (work in
1455	              progress), February 2006.

1457	Appendix A.  Acknowledgements

1459	   Many ideas and even some of the text come directly from RFC 1191 and
1460	   RFC 1981.

1462	   Many people made significant contributions to this document,
1463	   including: Randall Stewart for SCTP text, Michael Richardson for
1464	   material from an earlier ID on tunnels that ignore DF, Stanislav
1465	   Shalunov for the idea that pure PLPMTUD parallels congestion control,
1466	   and Matt Zekauskas for maintaining focus during the meetings.  Thanks
1467	   to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib
1468	   who provided concrete feedback on weaknesses in earlier drafts.
1469	   Thanks also to all of the people who made constructive comments in
1470	   the working group meetings and on the mailing list.  I am sure I have
1471	   missed many deserving people.

1473	   Matt Mathis and John Heffner are supported in this work by a grant
1474	   from Cisco Systems, Inc.

1476	Authors' Addresses

1478	   Matt Mathis
1479	   Pittsburgh Supercomputing Center
1480	   4400 Fifth Avenue
1481	   Pittsburgh, PA  15213
1482	   US

1484	   Phone: 412-268-3319
1485	   Email: mathis@psc.edu

1487	   John W. Heffner
1488	   Pittsburgh Supercomputing Center
1489	   4400 Fifth Avenue
1490	   Pittsburgh, PA  15213
1491	   US

1493	   Phone: 412-268-2329
1494	   Email: jheffner@psc.edu

1496	Intellectual Property Statement

1498	   The IETF takes no position regarding the validity or scope of any
1499	   Intellectual Property Rights or other rights that might be claimed to
1500	   pertain to the implementation or use of the technology described in
1501	   this document or the extent to which any license under such rights
1502	   might or might not be available; nor does it represent that it has
1503	   made any independent effort to identify any such rights.  Information
1504	   on the procedures with respect to rights in RFC documents can be
1505	   found in BCP 78 and BCP 79.

1507	   Copies of IPR disclosures made to the IETF Secretariat and any
1508	   assurances of licenses to be made available, or the result of an
1509	   attempt made to obtain a general license or permission for the use of
1510	   such proprietary rights by implementers or users of this
1511	   specification can be obtained from the IETF on-line IPR repository at
1512	   http://www.ietf.org/ipr.

1514	   The IETF invites any interested party to bring to its attention any
1515	   copyrights, patents or patent applications, or other proprietary
1516	   rights that may cover technology that may be required to implement
1517	   this standard.  Please address the information to the IETF at
1518	   ietf-ipr@ietf.org.

1520	Disclaimer of Validity

1522	   This document and the information contained herein are provided on an
1523	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1524	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1525	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1526	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1527	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1528	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1530	Copyright Statement

1532	   Copyright (C) The Internet Society (2006).  This document is subject
1533	   to the rights, licenses and restrictions contained in BCP 78, and
1534	   except as set forth therein, the authors retain all their rights.

1536	Acknowledgment

1538	   Funding for the RFC Editor function is currently provided by the
1539	   Internet Society.