idnits 2.17.1 

draft-ietf-pmtud-method-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1391.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1368.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1375.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1381.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 3, 2006) is 6626 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201)

  ** Obsolete normative reference: RFC 2460 (ref. '5') (Obsoleted by RFC 8200)

  ** Obsolete normative reference: RFC  793 (ref. '6') (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 2960 (ref. '7') (Obsoleted by RFC 4960)

  -- Obsolete informational reference (is this intentional?): RFC 2401 (ref.
     '11') (Obsoleted by RFC 4301)

  -- Obsolete informational reference (is this intentional?): RFC 2461 (ref.
     '13') (Obsoleted by RFC 4861)

  -- Obsolete informational reference (is this intentional?): RFC 3517 (ref.
     '14') (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC  896 (ref.
     '15') (Obsoleted by RFC 7805)

  == Outdated reference: A later version (-01) exists of
     draft-tuexen-tsvwg-sctp-padding-00


     Summary: 8 errors (**), 0 flaws (~~), 3 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          M. Mathis
3	Internet-Draft                                                J. Heffner
4	Expires: September 4, 2006                                           PSC
5	                                                           March 3, 2006

7	                           Path MTU Discovery
8	                       draft-ietf-pmtud-method-06

10	Status of this Memo

12	   By submitting this Internet-Draft, each author represents that any
13	   applicable patent or other IPR claims of which he or she is aware
14	   have been or will be disclosed, and any of which he or she becomes
15	   aware will be disclosed, in accordance with Section 6 of BCP 79.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress."

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt.

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html.

33	   This Internet-Draft will expire on September 4, 2006.

35	Copyright Notice

37	   Copyright (C) The Internet Society (2006).

39	Abstract

41	   This document describes a robust method for Path MTU Discovery that
42	   relies on TCP or some other Packetization Layer to probe an Internet
43	   path with progressively larger packets.  This method is described as
44	   an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
45	   MTU Discovery for IP versions 4 and 6, respectively.

47	   The general strategy of the new algorithm is to start with a small
48	   MTU and search upward, testing successively larger MTUs by probing
49	   with single packets.  If the probe is successfully delivered and
50	   satisfies a subsequent verification phase then the MTU is raised.  If
51	   the probe is lost, it is treated as an MTU limitation and not as a
52	   congestion signal.

54	   There are several options for integrating PLPMTUD with classical Path
55	   MTU Discovery.  PLPMTUD can be minimally configured to perform ICMP
56	   black hole recovery to increase the robustness of classical Path MTU
57	   Discovery, or ICMP processing can be completely disabled, and PLPMTUD
58	   can completely replace classical Path MTU Discovery.

60	   In the latter configuration, PLPMTUD exactly parallels congestion
61	   control.  An end-to-end transport protocol adjusts non-protocol
62	   properties of the data stream (window size or packet size) while
63	   using packet losses to deduce the appropriateness of the adjustments.
64	   This technique seems to be more philosophically consistent with the
65	   end-to-end principle than relying on ICMP messages containing
66	   transcribed headers of multiple protocol layers.

68	Table of Contents

70	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
71	     1.1.  Revision History . . . . . . . . . . . . . . . . . . . . .  4
72	       1.1.1.  Changes since version -05, November 2005 (IETF 64) . .  5
73	       1.1.2.  Changes since version -04, February 2005 (IETF 62) . .  5
74	       1.1.3.  Changes since version -03, October 2004 (IETF 61)  . .  5
75	       1.1.4.  Changes since version -02, July 19th 2004 (IETF 60)  .  5
76	   2.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
77	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  9
78	   4.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 11
79	   5.  Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
80	     5.1.  Accounting for header sizes  . . . . . . . . . . . . . . . 13
81	     5.2.  Storing PMTU information . . . . . . . . . . . . . . . . . 13
82	     5.3.  Accounting for IPsec . . . . . . . . . . . . . . . . . . . 15
83	     5.4.  Multicast  . . . . . . . . . . . . . . . . . . . . . . . . 15
84	   6.  Common Packetization Properties  . . . . . . . . . . . . . . . 15
85	     6.1.  Mechanism to detect loss . . . . . . . . . . . . . . . . . 15
86	     6.2.  Generating probes  . . . . . . . . . . . . . . . . . . . . 16
87	   7.  Host Fragmentation . . . . . . . . . . . . . . . . . . . . . . 17
88	   8.  The Probing Method . . . . . . . . . . . . . . . . . . . . . . 17
89	     8.1.  Packet size ranges . . . . . . . . . . . . . . . . . . . . 18
90	     8.2.  Selecting initial values . . . . . . . . . . . . . . . . . 18
91	     8.3.  Selecting probe size . . . . . . . . . . . . . . . . . . . 19
92	     8.4.  Probing preconditions  . . . . . . . . . . . . . . . . . . 20
93	     8.5.  Conducting a probe . . . . . . . . . . . . . . . . . . . . 20
94	     8.6.  Response to probe results  . . . . . . . . . . . . . . . . 21
95	       8.6.1.  Probe success  . . . . . . . . . . . . . . . . . . . . 21
96	       8.6.2.  Probe failure  . . . . . . . . . . . . . . . . . . . . 21
97	       8.6.3.  Probe timeout failure  . . . . . . . . . . . . . . . . 22
98	       8.6.4.  Probe inconclusive . . . . . . . . . . . . . . . . . . 22
99	     8.7.  Full stop timeout  . . . . . . . . . . . . . . . . . . . . 22
100	     8.8.  MTU verification . . . . . . . . . . . . . . . . . . . . . 22
101	   9.  Diagnostic Interface . . . . . . . . . . . . . . . . . . . . . 23
102	   10. Specific Packetization Layers  . . . . . . . . . . . . . . . . 24
103	     10.1. Probing method using TCP . . . . . . . . . . . . . . . . . 24
104	     10.2. Probing method using SCTP  . . . . . . . . . . . . . . . . 24
105	     10.3. Probing method using IP fragmentation  . . . . . . . . . . 25
106	     10.4. Probing method using applications  . . . . . . . . . . . . 26
107	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
108	     11.1. Normative references . . . . . . . . . . . . . . . . . . . 27
109	     11.2. Informative references . . . . . . . . . . . . . . . . . . 28
110	   Appendix A.  Security Considerations . . . . . . . . . . . . . . . 29
111	   Appendix B.  IANA Considerations . . . . . . . . . . . . . . . . . 29
112	   Appendix C.  Acknowledgements  . . . . . . . . . . . . . . . . . . 29
113	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30
114	   Intellectual Property and Copyright Statements . . . . . . . . . . 31

116	1.  Introduction

118	   This document describes a method for Packetization Layer Path MTU
119	   Discovery (PLPMTUD) which is an extension to existing Path MTU
120	   Discovery methods as described in RFC 1191 [2] and RFC 1981 [3].  The
121	   proper MTU is determined by starting with small packets and probing
122	   with successively larger packets.  The bulk of the algorithm is
123	   implemented above IP, in the transport layer (e.g., TCP) or other
124	   "Packetization Protocol" that is responsible for determining packet
125	   boundaries.

127	   This document draws heavily RFC 1191 and RFC 1981 for terminology,
128	   ideas and some of the text.

130	   This document describes methods to discover the Path MTU using
131	   features of existing protocols.  The methods apply to IPv4 and IPv6,
132	   and many transport protocols.  They do not require cooperation from
133	   the lower layers (except that they are consistent about what packet
134	   sizes are acceptable) or the far node.  Variants in implementations
135	   will not cause interoperability problems.

137	   The methods described in this document are carefully designed to
138	   maximize robustness in the presence of less than ideal
139	   implementations of other protocols or Internet components.

141	   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
142	   the terminology section we also present the analogous IPv4 terms and
143	   concepts for the IPv6 terminology.  In a few situations we describe
144	   specific details that are different between IPv4 and IPv6.

146	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
147	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
148	   document are to be interpreted as described in RFC 2119 [4].

150	   This draft is a product of the Path MTU Discovery (pmtud) working
151	   group of the IETF.  Please send comments and suggestions to
152	   pmtud@ietf.org.  Interim drafts and other useful information will be
153	   posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html .

155	1.1.  Revision History

157	   These are all recent substantive changes, in reverse chronological
158	   order.  This section will be removed prior to publication as an RFC.
159	   Note that there are still some missing details that need to be
160	   resolved.  These are flagged by @@@@.  None of the missing details
161	   are serious.

163	1.1.1.  Changes since version -05, November 2005 (IETF 64)

165	   Re-worked probing method sections for TCP and SCTP.  The SCTP section
166	   reflects the new PAD chunk type, and contains some text from Michael
167	   Tuexen.

169	   Made a number of language clarification and consistency improvements,
170	   largely from comments by Gorry Fairhurst.

172	   Added appropriate citations, and removed the last of the "@@" TODO
173	   items.

175	1.1.2.  Changes since version -04, February 2005 (IETF 62)

177	   General restructuring and rewriting of some sections based on new
178	   experience.  Relaxed and generalized a lot of over-specified
179	   language, for example, the search strategy description.

181	   Decoupled verification from probing, and relaxed its specification.

183	   Removed all specified changes to ICMP processing.  We decided this
184	   was out of scope for this particular document.

186	   Changed all language to refer to MTU rather than MPS.

188	1.1.3.  Changes since version -03, October 2004 (IETF 61)

190	   A number of minor style and grammar edits.

192	1.1.4.  Changes since version -02, July 19th 2004 (IETF 60)

194	   Many minor updates throughout the document.

196	   Added a section describing the interactions between PLPMTUD and
197	   congestion control.

199	   Removed a difficult to implement requirement for future data to
200	   transmit.

202	   Added "IP Fragmentation" and "Application protocol" as Packetization
203	   Layers.

205	   Clarified interactions between TCP SACK and MTU.

207	   Updated SCTP section to reflect new probing method using "PAD
208	   chunks".

210	   Distilled the protocol specific material into separate subsections
211	   for each protocol.

213	   Added a section on common requirements and functions for all
214	   Packetization Layers.  More accurately characterized the
215	   "bidirectional" (and other) requirements of the PL protocol.  Updated
216	   the search strategy in this new section.

218	   Change "ICMP can't fragment" and "packet too big" to uniformly use
219	   "ICMP PTB message" everywhere.

221	   Added Stanislav Shalunov's observation that PLPMTUD parallels
222	   congestion control.

224	   Better described the range of interoperability with classical pMTUd
225	   in the introduction.

227	   Removed vague language about "not being a protocol" and "excessive
228	   Loss".

230	   Slightly redefined flow: the granularity of PLPMTUD within a path.

232	   Many English NITs and clarifications per Gorry Fairhurst and others.
233	   Passes strict xml2rfc checking.

235	   Add a paragraph encouraging interface MTUs that are the optimal for
236	   the NIC, rather than standard for the media.

238	   Added a revision history section.

240	2.  Overview

242	   This document describes a method for TCP or other Packetization
243	   Protocols to dynamically discover the MTU of a path without explicit
244	   signals from the network.  This method is most efficient when used in
245	   conjunction with the current ICMP based Path MTU Discovery mechanism
246	   as specified in RFC 1191 and RFC 1981.  When used in such a way, it
247	   eliminates many robustness problems since it does not depend on the
248	   delivery ICMP messages.

250	   These procedures are applicable to TCP and other transport- or
251	   application-level protocols that are responsible for choosing packet
252	   boundaries (e.g., segment sizes) and have an acknowledgment structure
253	   that delivers to the sender accurate and timely indications of which
254	   packets were lost.

256	   The general strategy is for the Packetization Layer to find an
257	   appropriate Path MTU by probing the path with progressively larger
258	   packets.  If a probe packet is successfully delivered, then the
259	   effective Path MTU is raised to the probe size.

261	   The isolated loss of a probe packet (with or without an ICMP Packet
262	   To Big message) is treated as an indication of an MTU limit, and not
263	   as a congestion indicator.  In this case alone, the Packetization
264	   Protocol is permitted to retransmit any missing data without
265	   adjusting the congestion window.

267	   If there is a timeout or additional packets are lost during the
268	   probing process, the probe is considered to be inconclusive (e.g.,
269	   the lost probe does not necessarily indicate that the probe exceeded
270	   the Path MTU).  Furthermore the losses are treated like any other
271	   congestion indication: window or rate adjustments are mandatory per
272	   the relevant congestion control standards of RFC 2914 [12].  Probing
273	   can resume after a delay which is determined by the nature of the
274	   detected failure.

276	   PLPMTUD uses a searching technique to find the Path MTU.  Each
277	   conclusive probe narrows the MTU search range, either by raising the
278	   low limit on a successful probe or lowering the high limit on a
279	   failed probe, until the search range converges toward the true Path
280	   MTU.  For most transport layers, it makes sense to abandon the search
281	   once the range is narrow enough where the likely gain from picking a
282	   larger effective Path MTU is smaller than the search overhead to find
283	   it.

285	   The most likely (and least serious) PLPMTUD failure is the link
286	   experiencing congestion related losses while probing.  In this case
287	   it is appropriate to retry a probe of the same size as soon as the
288	   Packetization Layer has fully adapted to the congestion and recovered
289	   from the losses.  In other cases, additional losses or timeouts
290	   indicate problems with the link or Packetization Layer.  In these
291	   situations it is desirable to use longer delays depending on the
292	   severity of the error.

294	   An optional verification phase can be used to detect some situations
295	   where raising the MTU raises the packet loss rate.  For example, if a
296	   link is striped across multiple physical channels with inconsistent
297	   MTUs, it is possible that a probe will be delivered even if it is too
298	   large for some of the physical channels.  In such cases raising the
299	   Path MTU to the probe size can cause severe packet loss and abysmal
300	   performance.  After raising the MTU, the new MTU size can be verified
301	   by monitoring the loss rate.

303	   PLPMTUD introduces some flexibility in the implementation of
304	   classical Path MTU discovery, which is subject to protocol failures
305	   (connection hangs) if ICMP PTB messages are not delivered or
306	   processed for some reason.  With PLPMTUD, classical Path MTU
307	   Discovery can include additional consistency checks (e.g., validating
308	   additional fields in the transcribed header) without increasing the
309	   risk of connection hangs due to false failures of the added checks.
310	   Such changes to classical Path MTU Discovery are beyond the scope of
311	   this document.

313	   In the limiting case, all ICMP PTB messages might be unconditionally
314	   ignored, and PLPMTUD can be used as the sole method used to discover
315	   the Path MTU.  In this configuration, PLPMTUD parallels congestion
316	   control.  An end-to-end transport protocol adjusts non-protocol
317	   properties of the data stream (window size or packet size) while
318	   using packet losses to deduce the appropriateness of the adjustments.
319	   This technique seems to be more philosophically consistent with the
320	   end-to-end principle of the Internet than relying on ICMP messages
321	   containing transcribed headers of multiple protocol layers.

323	   Most of the difficulty in implementing PLPMTUD arises because it
324	   needs to be implemented in several different places within a single
325	   node.  In general, each Packetization Protocol needs to have its own
326	   implementation of PLPMTUD.  Furthermore, the natural mechanism to
327	   share Path MTU information between concurrent or subsequent
328	   connections over the same path is a path information cache in the IP
329	   layer.  The various Packetization Protocols need to have the means to
330	   access and update the shared cache in the IP layer.  This memo
331	   describes PLPMTUD in terms of its primary subsystems without fully
332	   describing how they are assembled into a complete implementation.

334	   Section 3 provides a complete glossary of terms.

336	   Relatively few details of PLPMTUD affect interoperability with other
337	   standards or Internet protocols.  These details are specified in RFC
338	   2119 standards language in Section 4.  The vast majority of the
339	   implementation details described in this document are recommendations
340	   based on experiences with earlier versions of Path MTU Discovery.
341	   These recommendations are motivated by a desire to maximize
342	   robustness of PLPMTUD in the presence of less than ideal network
343	   conditions as they exist in the field.

345	   Section 5 describes how to partition PLPMTUD into layers, and how to
346	   manage the "path information cache" in the IP layer.

348	   Section 6 describes the general Packetization Layer properties and
349	   features needed to implement PLPMTUD.

351	   Section 7 recommends using IPv4 fragmentation in a configuration that
352	   mimics IPv6 functionality, to minimize future problems migrating to
353	   IPv6.

355	   Section 8 describes the details of how to use probes to search for
356	   the Path MTU.

358	   Section 9 describes a programing interface for applications acting as
359	   Packetization Layers, and for tools to be able to diagnose path
360	   problems that interfere with Path MTU Discovery.

362	   Section 10 discusses implementation details for specific protocols,
363	   including TCP.

365	3.  Terminology

367	   We use the following terms in this document:

369	   IP: Either IPv4 [1] or IPv6 [5].

371	   Node: A device that implements IP.

373	   Router: A node that forwards IP packets not explicitly addressed to
374	      itself.

376	   Host: Any node that is not a router.

378	   Upper layer: A protocol layer immediately above IP.  Examples are
379	      transport protocols such as TCP and UDP, control protocols such as
380	      ICMP, routing protocols such as OSPF, and Internet or lower-layer
381	      protocols being "tunneled" over (i.e., encapsulated in) IP such as
382	      IPX, AppleTalk, or IP itself.

384	   Link: A communication facility or medium over which nodes can
385	      communicate at the link layer, i.e., the layer immediately below
386	      IP.  Examples are Ethernets (simple or bridged); PPP links; X.25,
387	      Frame Relay, or ATM networks; and Internet (or higher) layer
388	      "tunnels", such as tunnels over IPv4 or IPv6.  Occasionally we use
389	      the slightly more general term "lower layer" for this concept.

391	   Interface: A node's attachment to a link.

393	   Address: An IP-layer identifier for an interface or a set of
394	      interfaces.

396	   Packet: An IP header plus payload.

398	   MTU: Maximum Transmission Unit, the size in bytes of the largest IP
399	      packet, including the IP header and payload, that can be
400	      transmitted on a link or path.  Note that this could more properly
401	      be called the IP MTU, to be consistent with how other standards
402	      organizations use the acronym MTU.

404	   Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet size
405	      in bytes, that can be conveyed in one piece over a link.  Beware
406	      that this definition differers from the definition used by other
407	      standards organizations.

409	      For IETF documents, link MTU is uniformly defined as the IP MTU
410	      over the link.  This includes the IP header, but excludes link
411	      layer headers and other framing which is not part of IP or the IP
412	      payload.

414	      Be aware that other standards organizations generally define link
415	      MTU to include the link layer headers.

417	   Path: The set of links traversed by a packet between a source node
418	      and a destination node.

420	   Path MTU, or pMTU: The minimum link MTU of all the links in a path
421	      between a source node and a destination node.

423	   Classical Path MTU Discovery: Process described in RFC 1191 and RFC
424	      1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages
425	      to learn the MTU of a path.

427	   Packetization Layer: The layer of the network stack which segments
428	      data into packets.

430	   Effective PMTU: The current estimated value for PMTU used by a
431	      Packetization Layer for segmentation.

433	   PLPMTUD: Packetization Layer Path MTU Discovery, the method described
434	      in this document, which is an extension to classical PMTU
435	      discovery.

437	   PTB (Packet Too Big) message: An ICMP message reporting that an IP
438	      packet is too large to forward.  This is the IPv6 term that
439	      corresponds to the IPv4 "ICMP Can't fragment" message.

441	   Flow: A context in which MTU discovery algorithms can be invoked.
442	      This is naturally an instance of a Packetization Protocol, for
443	      example, one side of a TCP connection.

445	   MSS: The TCP Maximum Segment Size [6], the maximum payload size
446	      available to the TCP layer.  This is typically the Path MTU minus
447	      the size of the IP and TCP headers.

449	   Probe packet: A packet which is being used to test a path for a
450	      larger MTU.

452	   Probe size: The size of a packet being used to probe for a larger
453	      MTU.

455	   Probe gap: The payload data that will be lost and need to be
456	      retransmitted if the probe is not delivered.

458	   Leading window: Any unacknowledged data in a flow at the time a probe
459	      is sent.

461	   Trailing window: Any data in a flow sent after a probe, but before
462	      the probe is acknowledged.

464	   Search strategy: The heuristics used to choose successive probe sizes
465	      to converge on the proper Path MTU, as described in section
466	      Section 8.3.

468	   Full stop timeout: a timeout where none of the packets transmitted
469	      after some event are acknowledged by the receiver, including any
470	      retransmissions.  This is taken as an indication of some failure
471	      condition in the network, such as a routing change onto a link
472	      with a smaller MTU.  This is described in more detail in section
473	      Section 8.7.

475	4.  Requirements

477	   All Internet nodes SHOULD implement PLPMTUD in order to discover and
478	   take advantage of the largest MTU supported along the Internet path.

480	   Links MUST NOT deliver packets that are larger than their MTU.  Links
481	   that have parametric limitations (e.g., MTU bounds due to limited
482	   clock stability) MUST include explicit mechanisms to consistently
483	   reject packets that might otherwise be nondeterministically
484	   delivered.

486	   All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
487	   functionality.  All fragmentation SHOULD be done on the host, and all
488	   IPv4 packets, including fragments, SHOULD have the DF bit set such
489	   that they will not be fragmented (again) in the network.  See
490	   Section 7.

492	   The requirements below only apply to those implementations that
493	   include PLPMTUD.

495	   To use PLPMTUD a Packetization Layer MUST have a loss reporting
496	   mechanism that provides the sender with timely and accurate
497	   indications of which packets were lost in the network.

499	   Normal congestion control algorithms MUST remain in effect under all
500	   conditions except when only an isolated probe packet is detected as
501	   lost.  In this case alone the normal congestion (window or data rate)
502	   reduction MAY be suppressed.  If any other data loss is detected,
503	   standard congestion control MUST take place.

505	   Suppressed congestion control (as above) MUST be rate limited such
506	   that it occurs less frequently than the worst case loss rate for TCP
507	   congestion control at a comparable data rate over the same path
508	   (i.e., less than the "TCP-friendly" loss rate [17]).  This SHOULD be
509	   enforced by requiring a minimum headway between a suppressed
510	   congestion adjustment (due to a failed probe) and the next attempted
511	   probe, which is equal to one round trip time for each packet
512	   permitted by the congestion window.  Alternatively this may be
513	   enforced by not suppressing congestion control if a second probe is
514	   lost too soon after the first lost probe.  This discussed in section
515	   Section 8.6.2.

517	   Whenever the MTU is raised, the congestion state variables MUST be
518	   rescaled so as not to raise the window size in bytes (or data rate in
519	   bytes per seconds).

521	   Whenever the MTU is reduced (e.g., when processing ICMP PTB messages)
522	   the congestion state variable SHOULD be rescaled not to raise the
523	   window size in packets.

525	   If PLPMTUD updates the MTU for a particular path, all Packetization
526	   Layer sessions that share the path representation SHOULD be notified
527	   to make use of the new MTU and make the required congestion
528	   adjustments.

530	   All implementations MUST include a mechanism to implement diagnostic
531	   tools that do not rely on the operating systems implementation of
532	   Path MTU discovery.  This specifically requires the ability to send
533	   packets that are larger than the known MTU for the path, and
534	   collecting any resultant ICMP error message.  See Section 9 for
535	   further discussion of MTU diagnostics.

537	5.  Layering
538	   Packetization Layer Path MTU Discovery is most easily implemented by
539	   splitting its functions between layers.  The IP layer is the best
540	   place to keep shared state, collect the ICMP messages, track IP
541	   header sizes and manage MTU information provided by the link layer
542	   interfaces.  However, the procedures that PLPMTUD uses for probing,
543	   verification and scanning for the Path MTU are very tightly coupled
544	   to features of the Packetization Layers such as data recovery and
545	   congestion control state machines.

547	   Note that this layering approach is consistent with the advice in the
548	   current PMTUD specifications in RFC 1191 and RFC 1981.  Many
549	   implementations of classical PMTU Discovery are already split along
550	   these same layers.

552	5.1.  Accounting for header sizes

554	   The way in which PLPMTUD operates across multiple layers requires a
555	   mechanism for accounting header sizes at all layers between IP and
556	   the Packetization Layer (inclusive).  When transmitting non-probe
557	   packets, it is sufficient for the Packetization Layer to ensure an
558	   upper bound on final IP packet size, so as not to exceed the current
559	   effective Path MTU.  All Packetization Layers participating in
560	   classical Path MTU Discovery have this requirement already.  When
561	   participating in PLPMTUD and transmitting a probe packet, the
562	   Packetization Layer MUST determine that packet's final size including
563	   IP headers.  This requirement is specific to PLPMTUD, and to satisfy
564	   it existing implementations may need additional inter-layer
565	   communication.

567	5.2.  Storing PMTU information

569	   This memo uses the concept of a "flow" to define the scope of the
570	   Path MTU discovery algorithms.  For many implementations, a flow
571	   would naturally correspond to an instance of each protocol, i.e.,
572	   each connection or session.  In such implementations the algorithms
573	   described in this document are performed within each session for each
574	   protocol.  The observed PMTU can optionally be shared between
575	   different flows sharing a common path representation.

577	   Alternatively, PLPMTUD could be implemented such that the complete
578	   PLPMTUD state is associated with the path representations.  Such an
579	   implementation could use multiple connections or sessions for each
580	   probe sequence.  This approach may converge much more quickly in some
581	   environments such as when an application uses many small connections,
582	   each of which may be too short to complete the Path MTU Discovery
583	   process.

585	   These approaches are not mutually exclusive.  However, due to
586	   differing constraints on generating probes (Section 6.2) and the MTU
587	   searching algorithm (Section 8.3), it may not be feasible for
588	   different Packetization Layer protocols to share PLPMTUD state.  This
589	   suggests that it may be possible for some protocols to share probing
590	   state, but not others.  In this case, the different protocols can
591	   still share the observed PMTU but they will have differing
592	   convergence properties.

594	   The IP layer is the best place to store cached PMTU values and other
595	   shared state such as MTU values reported by ICMP PTB messages.
596	   Ideally this shared state should be associated with a specific path
597	   traversed by packets exchanged between the source and destination
598	   nodes.  However, in most cases a node will not have enough
599	   information to completely and accurately identify such a path.
600	   Rather, a node must associate a PMTU value with some local
601	   representation of a path.  It is left to the implementation to select
602	   the local representation of a path.

604	   An implementation could use the destination address as the local
605	   representation of a path.  The PMTU value associated with a
606	   destination would be the minimum PMTU learned across the set of all
607	   paths in use to that destination.  The set of paths in use to a
608	   particular destination is expected to be small, in many cases
609	   consisting of a single path.  This approach will result in the use of
610	   optimally sized packets on a per-destination basis.  This approach
611	   integrates nicely with the conceptual model of a host as described in
612	   RFC 2461 [13]: a PMTU value could be stored with the corresponding
613	   entry in the destination cache.  Storing the minimum value is
614	   suggested since NATs and other forms of middle boxes may exhibit
615	   differing PMTUs at a single IP address.

617	   Note that network or subnet numbers are not suitable to use as
618	   representations of a path, because there is not a general mechanism
619	   to determine the network mask at the remote host.

621	   If IPv6 flows are in use, an implementation could use the IPv6 flow
622	   id [5][9] as the local representation of a path.  Packets sent to a
623	   particular destination but belonging to different flows may use
624	   different paths, with the choice of path depending on the flow id.
625	   This approach will result in the use of optimally sized packets on a
626	   per-flow basis, providing finer granularity than MTU values
627	   maintained on a per-destination basis.

629	   For source routed packets, i.e., packets containing an IPv6 routing
630	   header, or IPv4 LSRR or SSRR options, the source route may further
631	   qualify the local representation of a path.  An implementation could
632	   use source route information in the local representation of a path.

634	5.3.  Accounting for IPsec

636	   This document does not take a stance on the placement of IPsec, which
637	   logically sits between IP and the Packetization Layer.  The PLPMTUD
638	   implementation can treat IPsec either as part of IP or as part of the
639	   Packetization Layer, as long as the accounting is consistent within
640	   the implementation.  If IPsec is treated as part of the IP layer,
641	   then each security association to a remote node may need to be
642	   treated as a separate path; i.e., the security association is used to
643	   represent the path.  If IPsec is treated as part of the Packetization
644	   Layer, the IPsec header size must be included in the Packetization
645	   Layer's header size calculations[11].

647	5.4.  Multicast

649	   In the case of a multicast destination address, copies of a packet
650	   may traverse many different paths to reach many different nodes.  The
651	   local representation of the "path" to a multicast destination must in
652	   fact represent a potentially large set of paths.

654	   Minimally, an implementation could maintain a single MTU value to be
655	   used for all packets originated from the node.  This MTU value would
656	   be the minimum MTU learned across the set of all paths in use by the
657	   node.  This approach is likely to result in the use of smaller
658	   packets than is necessary for many paths.

660	   If the application using multicast gets complete delivery reports
661	   (unlikely because this requirement has poor scaling properties),
662	   PLPMTUD could be implemented in multicast protocols.

664	6.  Common Packetization Properties

666	   This section describes general Packetization Layer properties and
667	   characteristics needed to implement PLPMTUD.  It also describes some
668	   implementation issues that are common to all Packetization Layers.

670	6.1.  Mechanism to detect loss

672	   It is important that the Packetization Layer has a timely and robust
673	   mechanism for detecting and reporting losses.  PLPMTUD makes MTU
674	   adjustments on the basis of detected losses.  Any delays or
675	   inaccuracy in loss notification is likely to result in incorrect MTU
676	   decisions or slow convergence.

678	   It is best if Packetization Protocols use fairly explicit loss
679	   notification such as selective acknowledgments, although implicit
680	   mechanisms such as TCP Reno style duplicate acknowledgments counting
681	   are sufficient.  It is important that the mechanism can robustly
682	   distinguish between the isolated loss of just a probe and other
683	   combinations of losses.

685	   Many protocol implementations have sophisticated mechanisms such as a
686	   SACK scoreboard [14] to distinguish real losses from reordered data.
687	   In these implementations it is desirable to signal losses to PLPMTUD
688	   as a side effect of the data retransmission.  This approach offers
689	   the maximum protection from confusing signals due to reordering and
690	   other events that might mimic losses.

692	   PLPMTUD can also be implemented in protocols that rely on timeouts as
693	   their primary mechanism for loss recovery; however, timeouts should
694	   be used only when there are no other alternatives.

696	6.2.  Generating probes

698	   There are several possible ways to alter Packetization Layers to
699	   generate probes.  The different techniques incur different overheads
700	   in three areas: difficulty in generating the probe packet (in terms
701	   of Packetization Layer implementation complexity and extra data
702	   motion) possible additional network capacity consumed by the probes
703	   and the overhead of recovering from failed probes (both network and
704	   protocol overheads).

706	   Some protocols might be extended to allow arbitrary padding with
707	   dummy data.  This greatly simplifies the implementation because the
708	   probing can be performed without participation from higher layers and
709	   if the probe fails, the missing data (the "probe gap") is assured to
710	   fit within the current MTU when it is retransmitted.  This is
711	   probably the most appropriate method for protocols that support
712	   arbitrary length options or multiplexing within the protocol itself.

714	   Many Packetization Layer protocols can carry pure control messages
715	   (without any data from higher protocol layers) which can be padded to
716	   arbitrary lengths.  For example, the SCTP PAD chunk can be used in
717	   this manner (see Section 10.2).  This approach has the advantage that
718	   nothing needs to be retransmitted if the probe is lost.

720	   These techniques do not work for TCP, because there is not a separate
721	   length field or other mechanism to differentiate between padding and
722	   real payload data.  With TCP the only approach is to send additional
723	   payload data in an over-sized segment.  There are at least two
724	   variants of this approach, discussed in Section Section 10.1.

726	   In a few cases, there may be no reasonable mechanisms to generate
727	   probes within the Packetization Layer protocol itself.  As a last
728	   resort, it may be possible to rely an an adjunct protocol, such as
729	   ICMP ECHO ("ping"), to send probe packets.  See Section 10.3 for
730	   further discussion of this approach.

732	7.  Host Fragmentation

734	   Packetization Layers are encouraged to avoid sending messages that
735	   will require fragmentation [16] [18].  However, entirely preventing
736	   fragmentation is not always possible.  Some Packetization Layers,
737	   such as a UDP application outside the kernel, may be unable to change
738	   the size of messages it sends, resulting in datagram sizes that
739	   exceed the Path MTU.

741	   IPv4 permitted such applications to send packets without the DF bit
742	   set.  Oversized packets without the DF bit set would be fragmented in
743	   the network or sending host when they encountered a link with a MTU
744	   smaller than the packet.  In some case, packets could be fragmented
745	   more than once if there were cascaded links with progressively
746	   smaller MTUs.  This approach is not recommended.

748	   It is recommended that IPv4 implementations use a strategy that
749	   mimics IPv6 functionality.  When an application sends datagrams that
750	   are larger than the known Path MTU they should be fragmented to the
751	   Path MTU in the host IP layer even if they are smaller than the link
752	   MTU of the first network hop directly attached to the host.  The DF
753	   bit should be set on the fragments, so they will not be fragmented
754	   again in the network.

756	   This technique will minimize future surprises as the Internet
757	   migrates to IPv6.  Otherwise, the potential exists for widely
758	   deployed applications or services relying on IPv4 fragmentation in a
759	   way that cannot be implemented in IPv6.  At least one major operating
760	   system already uses this strategy.

762	   The ability to selectively transmit packets larger than the current
763	   effective Path MTU (but smaller than the link MTU) is REQUIRED, to be
764	   able to send probes generated by Packetization Layers participating
765	   in PLPMTUD, and to facilitate diagnostic utilities.

767	   Note that IP fragmentation divides data into packets, so it is
768	   minimally a Packetization Layer.  However, it does not have a
769	   mechanism to detect lost packets, so it can not support a native
770	   implementation of PLPMTUD.  Fragmentation-based PLPMTUD requires an
771	   adjunct protocol as described in Section 10.3.

773	8.  The Probing Method
774	   This section describes the details of the MTU probing method,
775	   including how to send probes and process error indications necessary
776	   to search for the Path MTU.

778	8.1.  Packet size ranges

780	   This document described the probing method using three state
781	   variables:
782	   search_low: The smallest available probe size, minus one.
783	   search_high: The greatest available probe size.
784	   eff_pmtu: The effective PMTU for this flow.

786	               search_low          eff_pmtu         search_high
787	                   |                   |                  |
788	           ...------------------------->
789	               non-probe size range
790	                   <-------------------------------------->
791	                               probe size range

793	   Figure 1

795	   When transmitting probes, the Packetization Layer MUST select the
796	   probe size from within the range "(search_low, search_high]".  When
797	   transmitting non-probes, it SHOULD create packets of size less than
798	   or equal to eff_pmtu.

800	   The eff_pmtu must be in the range "[search_low, search_high]".  When
801	   probing upward, eff_pmtu always equals search_low.  However, in other
802	   states this may not be the case, for example, due to initial
803	   conditions or after ICMP PTB message processing.

805	8.2.  Selecting initial values

807	   The initial value for search_high should be the largest possible
808	   packet supported by the flow.  This may be limited by the local
809	   interface MTU, by a protocol mechanism such as the TCP MSS option, or
810	   an intrinsic limit such as the protocol length field.

812	   It is recommended that search_low be initially set to a value likely
813	   to work over a large range of links.  Given today's technologies, a
814	   value of 512 bytes is likely to work.  For IPv6 flows, a value of
815	   1280 is appropriate.  The initial value for search_low SHOULD be
816	   configurable.

818	   Properly functioning Path MTU Discovery is critical to the robust and
819	   efficient operation of the Internet.  Any major change (as described
820	   in this document) has the potential to be very disruptive if it
821	   contains any errors or oversights.  The selection of initial values
822	   determines to what extent a PLPMTUD implementation's behavior differs
823	   from classical PMTUD in cases where MTU discovery is not needed, or
824	   where classical PMTUD is sufficient.

826	   It may be desirable to configure hosts in such a way that PLPMTUD
827	   only has an effect in cases where classical PMTUD fails.  Setting
828	   eff_pmtu = search_high and relying on black hole detection has this
829	   effect.  Using initial values of search_low = eff_pmtu = search_high
830	   effectively disables PLPMTUD, resorting to only classical PMTUD.

832	   In some cases where it is known that classical PMTUD is likely to
833	   fail, using a conservatively small initial eff_pmtu may produce
834	   better results by avoiding the costly timeouts required for black
835	   hole detection.  The trade-off is that using a smaller initial
836	   eff_pmtu than necessary can cause reduced performance.  Appropriate
837	   initial values for PLPMTUD state variables may vary not only per host
838	   but per path.  As such, per-route configuration options for these
839	   values is desirable.

841	8.3.  Selecting probe size

843	   The probe may have a size anywhere in the "probe size range"
844	   described above.  However, a number of factors affect the selection
845	   of an appropriate size.  A simple strategy might be to do a binary
846	   search halving the probe size range with each probe.  However, for
847	   some protocols, data in a lost probe may require retransmission,
848	   making a failed probe more expensive than a successful probe.  For
849	   such protocols, a strategy using smaller probe sizes and "probing up"
850	   may behave better.  For many protocols, both at and above the
851	   Packetization Layer, the benefit of increasing MTU sizes may follow a
852	   step function such that it is not advantageous to probe within
853	   certain regions at all.

855	   As an optimization, it may be appropriate to probe at certain common
856	   or expected MTU sizes, for example, 1500 bytes for standard Ethernet,
857	   or 1500 bytes minus header sizes for tunnel protocols.

859	   Some protocols may not even "choose" probe sizes.  For protocols
860	   which have certain natural data block sizes, an effective strategy
861	   could be to simply treat blocks whose size falls in the probe size
862	   range as a probe.

864	   Each Packetization Layer must determine when probing is considered
865	   converged; that is, when the probe size range is considered small
866	   enough that further probing is no longer worth its cost.  When it is
867	   determined that searching has converged, a timer should be set.  When
868	   the timer expires, search_high should be reset to its initial value
869	   (described above) so that probing can resume.  This is so that if the
870	   path changes, and in increased Path MTU is available, then the flow
871	   will eventually be able to take advantage of it to send larger
872	   packets.  The recommended value for this timer is 10 minutes, per RFC
873	   1981.

875	8.4.  Probing preconditions

877	   Before sending a probe, the flow must at least meet the following
878	   conditions:
879	   o  The flow has no outstanding probes or losses.
880	   o  If the last probe failed or was inconclusive, then the probe
881	      timeout has expired (see Section Section 8.6.2).
882	   o  The available window is greater than the probe size.
883	   o  For a protocol using in-band data for probing, enough data is
884	      available to send the probe.

886	   For protocols which probe with in-band data, when not enough data is
887	   available to probe, the protocol may wish to delay sending non-probes
888	   in order to accumulate enough data to send a probe.  A delayed
889	   sending algorithm such as Nagle [15] should be used to appropriately
890	   limit the time data is delayed.

892	   Some protocols may require additional packets after the loss to
893	   detect it promptly (e.g., TCP loss detection using duplication
894	   acknowledgments).  Such a protocol should wait until sufficient data
895	   and window space is available so that it will be able to transmit
896	   enough data after the probe to trigger the loss detection mechanism
897	   in the event of a lost probe.

899	8.5.  Conducting a probe

901	   Once a probe size in the appropriate range has been selected, and the
902	   above preconditions have been met, the Packetization Layer may
903	   conduct a probe.  To do so, it creates a probe packet such that its
904	   size, including the outermost IP headers, is equal the probe size.
905	   After sending the probe it awaits response, which may take the
906	   following results:
907	   Success: The probe is acknowledged as having been received by the
908	      remote host.

910	   Failure: A protocol mechanism indicates that the probe was lost, but
911	      no packets in the leading or trailing window were lost.

913	   Timeout failure: A protocol mechanism indicates that the probe was
914	      lost, and no packets in the leading window were lost, but is
915	      unable to determine if any packets in the trailing window were
916	      lost.  For example, loss is detected by a timeout, and go-back-n
917	      retransmission is used.

919	   Inconclusive: The probe was lost in addition to other packets in the
920	      leading or trailing windows.

922	8.6.  Response to probe results

924	   When a probe has completed, the result should be processed as
925	   follows, categorized by the probe's result type.

927	8.6.1.  Probe success

929	   When the probe is delivered, this is an indication that the Path MTU
930	   is at least as large as the probe size.  The Packetization Layer
931	   should set search_low to the probe size, eff_pmtu to "max(eff_pmtu,
932	   probe size)".

934	   Note that if a flow's packets are routed via multiple paths, or over
935	   a path with a non-deterministic MTU, delivery of a single probe
936	   packet does not indicate that all packets of that size will be
937	   delivered.  To be robust in such a case, the Packetization Layer
938	   should conduct MTU verification as described in Section Section 8.8.

940	8.6.2.  Probe failure

942	   When only the probe is lost, this is treated as an indication that
943	   the Path MTU is smaller than the probe size.  In this case alone, the
944	   loss should not be interpreted as congestion signal.

946	   In the absence of other indications, the Packetization Layer should
947	   set search_high to the probe size minus one, and eff_pmtu to
948	   "min(eff_pmtu, probe size)".

950	   If an ICMP PTB message is received matching the probe packet, then
951	   search_high and eff_pmtu may be set from the MTU value indicated in
952	   the message.  Note that the ICMP message may be received either
953	   before or after the protocol loss indication.

955	   A probe failure event is the one situation under which the
956	   Packetization Layer is permitted not to treat loss as a congestion
957	   signal.  Because there is some small risk that suppressing congestion
958	   control might have unanticipated consequences (even for one isolated
959	   loss), it is required that probe failure events be less frequent than
960	   the normal period for losses under standard congestion control.
961	   Specifically after a probe failure event and suppressed congestion
962	   control, PLPMTUD may not probe again until an interval which is
963	   comparable to the expected interval between congestion control
964	   events.  This is required in Section 4.  The simplest estimate of the
965	   interval to the next congestion event is the same number of round
966	   trips as the current congestion window in packets.

968	8.6.3.  Probe timeout failure

970	   If the loss was detected with a timeout and repaired with go-back-n
971	   retransmission, then congestion window reduction will be necessary.
972	   The relatively high price of a failed probe in this case may merit a
973	   longer timeout.  A timeout value of five times the non-timeout
974	   failure case is recommended.

976	8.6.4.  Probe inconclusive

978	   The presence of other losses near the loss of the probe may indicate
979	   that the probe was lost due to congestion rather than because of an
980	   MTU limitation.  In this case it is appropriate to update no state,
981	   and simply probe again when the probing preconditions are met; i.e.,
982	   when no recent losses have been observed.  At this point, it is
983	   particularly appropriate to re-probe since the flow's congestion
984	   window will be at its lowest point, minimizing the probability of
985	   congestive losses.

987	8.7.  Full stop timeout

989	   Under all conditions a full stop timeout (also known as a "persistent
990	   timeout" in other documents) should be taken as an indication of some
991	   significantly disruptive event in the network, such as a router
992	   failure or a routing change to a path with a smaller MTU.  For TCP,
993	   this occurs when the R1 timeout threshold described by RFC 1122 [8]
994	   expires.

996	   If there is a full stop timeout and there was not an ICMP message
997	   indicating a reason (PTB, Net unreachable, etc., or the ICMP message
998	   was ignored for some reason), the suggested first recovery action is
999	   to treat this as a detected black hole as described in RFC 2923 [10].

1001	   The response to a detected black hole should be to set search_low to
1002	   its initial value, and set eff_pmtu to search_low.  Upon further
1003	   successive timeouts, search_low and eff_pmtu should be halved, with a
1004	   lower bound of 68 bytes for IPv4 and 1280 bytes for IPv6.

1006	8.8.  MTU verification

1008	   It is possible for a flow to simultaneously traverse multiple paths,
1009	   but it will only be able to keep a single path representation for the
1010	   flow.  If in such a case the paths have different MTUs, storing the
1011	   minimum MTU of all paths in the flow's path representation will
1012	   result in correct, though sub-optimal behavior.  If ICMP PTB messages
1013	   are delivered, then classical PMTUD will work correctly in this
1014	   situation.

1016	   If ICMP delivery fails, breaking classical PMTUD, the connection will
1017	   rely on PLPMTUD.  However, in this case, PLPMTUD may fail as well
1018	   since its requirement that links MUST NOT deliver packets larger than
1019	   their MTU is violated.  A probe with a size greater than the minimum
1020	   but smaller than the maximum of the Path MTUs may be successful.
1021	   However, upon raising the flow's effective PMTU, the loss rate may
1022	   significantly increase.  The flow may still make progress, but the
1023	   resultant loss rate may be unacceptable.  For example, when using
1024	   two-way round-robin striping, 50% of full-sized packets would be
1025	   lost.

1027	   Striping in this manner is often operationally undesirable (e.g., due
1028	   to packet reordering), and is usually avoided by hashing flows to a
1029	   single path.  However, to increase robustness, an implementation
1030	   should implement some form of MTU verification, such that if
1031	   increasing eff_pmtu results in a sharp increase in loss rate, it will
1032	   fall back to using a lower MTU.

1034	   A recommended strategy would be to save the value of eff_pmtu before
1035	   raising it.  Then, if loss rate rises above a threshold for a period
1036	   of time (e.g., loss rate is higher than 10% over multiple RTO
1037	   intervals), then the new MTU is considered incorrect.  The saved
1038	   value of eff_pmtu can be restored, and search_high reduced in the
1039	   same manner as in a probe failure.  PLPMTUD implementations SHOULD
1040	   implement MTU verification.

1042	9.  Diagnostic Interface

1044	   All implementations MUST include facilities for MTU discovery
1045	   diagnostic tools that implement PLPMTUD or other MTU discovery
1046	   algorithms in user mode without help or interference by the PMTUD
1047	   algorithm present in the operating system.  This requires a mechanism
1048	   where a diagnostic application can send packets that are larger than
1049	   the operating system's notion of the current Path MTU and for the
1050	   diagnostic application to collect any resulting ICMP PTB messages or
1051	   other ICMP messages.  For IPv4, the diagnostic application must be
1052	   able to set the DF bit.

1054	   At this time nearly all operating systems support two modes for
1055	   sending UDP datagrams: one which silently fragments packets that are
1056	   too large, and another that rejects packets that are too large.
1057	   Neither of these modes are suitable for efficiently diagnosing
1058	   problems with MTU discovery, such as routers that return ICMP PTB
1059	   messages containing incorrect size information.

1061	10.  Specific Packetization Layers

1063	   This section discusses specific implementation details for different
1064	   protocols that can be used as Packetization Layer protocols.  All
1065	   Packetization Layer protocols must consider all of the issues
1066	   discussed in Section 6.  For most protocols it is self evident how to
1067	   address many of these issues.  It is hoped that the protocols
1068	   described here will be sufficient illustration for implementors to
1069	   adapt other protocols.

1071	10.1.  Probing method using TCP

1073	   TCP has no mechanism that could be used to distinguish between real
1074	   application data and some other form of padding that might be used to
1075	   fill out probe packets.  Therefore, TCP must generate probes by
1076	   sending oversized segments that are carrying in-band data.  There are
1077	   two approaches to segmentation from which an implementation may
1078	   choose: overlapping or non-overlapping segments.

1080	   In the non-overlapping method, data is segmented such that the probe
1081	   and any subsequent segments contain no overlapping data.  If the
1082	   probe is lost, the "probe gap" will be a full probe size minus
1083	   headers.  Data in the probe gap will need to be retransmitted with
1084	   multiple smaller segments.

1086	   An alternate approach is to send data following the probe such that
1087	   the probe gap is equal in length to the current MSS.  In the case of
1088	   a successful probe, this has added overhead in that it will send some
1089	   data twice, but it will have to retransmit only one segment after a
1090	   lost probe.  When a probe succeeds, there will likely be some
1091	   duplicate acknowledgments generated due to the duplicate data sent.
1092	   It is important that these duplicate acknowledgments not trigger Fast
1093	   Retransmit.  As such, an implementation using this approach SHOULD
1094	   limit the probe size to three times the current MSS (causing at most
1095	   2 duplicate acknowledgments), or appropriately adjust its duplicate
1096	   acknowledgment threshold for data immediately after a successful
1097	   probe.

1099	   The choice of which segmentation method to use should be based on
1100	   what is simplest and most efficient for a given TCP implementation.

1102	10.2.  Probing method using SCTP

1104	   In the SCTP protocol [7] the application writes messages to SCTP and
1105	   SCTP "chunkifies" them into smaller pieces suitable for transmission
1106	   through the network.  Once a message has been chunkified, they are
1107	   assigned Transmission Sequence Numbers (TSNs).  Once some TSNs have
1108	   been transmitted SCTP can not change the chunk sizes.  SCTP multi-
1109	   path support normally requires SCTP to chunkify its messages to fit
1110	   the smallest PMTU of all paths.  Although not required,
1111	   implementations may bundle multiple data chunks together to make
1112	   larger IP packets to send on paths with a larger PMTU.  Note that
1113	   SCTP must independently probe the PMTU on each path to the peer.

1115	   The recommended method for generating probes is to add a chunk
1116	   consisting only of padding to an SCTP message.  The PAD chunk defined
1117	   in [19] SHOULD be attached to a minimum length HEARTBEAT chunk to
1118	   build a probe packet.  This method is fully compatible with all
1119	   current SCTP implementations.

1121	   SCTP MAY also probe with a method similar to TCP's described above,
1122	   using inline data.  Using such a method has the advantage that
1123	   successful probes have no additional overhead; however, failed probes
1124	   will require retransmission of data, which may significantly impact
1125	   flow performance.

1127	10.3.  Probing method using IP fragmentation

1129	   As mentioned in Section 7, datagram protocols (such as UDP) might
1130	   rely on IP fragmentation as a Packetization Layer.  However,
1131	   implementing PLPMTUD with IP fragmentation is problematic because the
1132	   IP layer has no mechanism to determine if the packets are ultimately
1133	   delivered properly to the far node, without participation by the
1134	   application.

1136	   To support IP fragmentation as a Packetization Layer under an
1137	   unmodified application, we propose the use of an adjunct MTU
1138	   measurement protocol (ICMP ECHO) and a separate Path MTU Discovery
1139	   daemon (described here) to perform PLPMTUD and update the stored Path
1140	   MTU information.

1142	   For IP fragmentation the initial MTU should be selected as described
1143	   in Section Section 8.2, except with a separate global control for the
1144	   default initial MTU for connectionless protocols.  Since
1145	   connectionless protocols may not keep enough state to effectively
1146	   diagnose MTU black holes, it would be more robust to err on the side
1147	   of using too small of an initial MTU (e.g., 1kBytes or less) prior to
1148	   initiating probing of the path to measure the MTU.

1150	   Since many protocols that rely on IP fragmentation are
1151	   connectionless, there is an additional problem with the path
1152	   information cache: there are no events corresponding to connection
1153	   establishment and tear-down to use to manage the cache itself.  If
1154	   there is no entry in the path information cache for a particular
1155	   packet being transmitted, it uses an immutable cache entry for the
1156	   "default path", which has a MTU that is fixed at the initial value.

1158	   A new path cache entry is not created until there is an attempt to
1159	   set the MTU.

1161	   The Path MTU Discovery daemon should be triggered as a side effect of
1162	   IP fragmentation.  Once the number of fragmented datagrams via any
1163	   particular path reaches some configurable threshold (e.g., 5
1164	   datagrams), the daemon can start probing the path with ICMP ECHO
1165	   packets.  These probes must use the diagnostic interface described in
1166	   Section 9 and have DF set.  The daemon can implement the PLPMTUD
1167	   probe sequence and search strategy, collect all of the ICMP
1168	   responses, and store results in the path information cache in the IP
1169	   layer.

1171	   Alternatively, most of the PLPMTUD state machinery can be implemented
1172	   within the path information cache in the IP layer, which can
1173	   specifically invoke the Path MTU Discovery daemon to perform
1174	   specified measurements on specific paths and report the results back
1175	   to the IP layer.

1177	   Using ICMP ECHO to measure the MTU has a number of potential
1178	   robustness problems.  Note that the most likely failures are due to
1179	   losses unrelated to MTU (e.g., nodes that discriminate on the basis
1180	   of protocol type).  These non-MTU-related losses can prevent PLPMTUD
1181	   from raising the MTU, forcing the Packetization Protocol to use a
1182	   smaller MTU than necessary.  Since these failures are not likely to
1183	   cause interoperability problems they are relatively benign.

1185	   However there does exist other more serious failure modes, such as
1186	   layer 3 or 4 routers choosing different paths for different protocol
1187	   types or sessions.  In such environments, adjunct protocols may
1188	   experience different MTUs than the primary protocol.  If the adjunct
1189	   protocol has a larger MTU than the primary protocol, PLPMTUD will
1190	   select a non-functional MTU.  This does not seem to be a likely
1191	   situation.

1193	10.4.  Probing method using applications

1195	   The disadvantages of probing with ICMP ECHO can be overcome by
1196	   implementing the Path MTU Discovery daemon within the application
1197	   itself, using the application's own protocol.

1199	   The application must have some suitable method for generating probes.
1200	   The ideal situation is a lightweight echo function, that confirms
1201	   message delivery, plus a mechanism for padding the messages out to
1202	   the desired MTU, such that the padding is not echoed.  This
1203	   combination (akin to the SCTP HB plus PAD) is preferred because you
1204	   can send large probes that cause small acknowledgments.  For
1205	   protocols that can not implement these messages directly there are
1206	   often alternate methods for generating probes.  For example, the
1207	   protocol may have a variable length echo (that measures both the
1208	   forward and return path) or if there is no echo function, there may
1209	   be a way to add padding to regular messages carrying real application
1210	   data.  There may also be other ways to generate probes.  As a last
1211	   resort, it may be feasible to extend the protocol with new message
1212	   types to support MTU discovery.

1214	   Probing within an application introduces one new issue: many
1215	   applications do not currently concern themselves with MTU and rely on
1216	   IP fragmentation to deliver datagrams that just happen to be larger
1217	   than the Path MTU.  PLPMTUD requires that the protocol be able to
1218	   send probes that are larger than the IP layer's current notion of the
1219	   Path MTU, but are marked not to be fragmented.  This requires an
1220	   alternate method for sending these datagrams.

1222	   As with ICMP MTU probing, there is considerable flexibility in how
1223	   the PLPMTUD algorithms can be divided between the Application and the
1224	   path information cache.

1226	   Some applications send large datagrams no matter what the link size,
1227	   and rely on IP fragmentation to deliver the datagrams.  It has been
1228	   known for a long time that this has some undesirable consequences
1229	   [16].  More recently it has come to light that IPv4 fragmentation is
1230	   not sufficiently robust for general use in today's Internet.  The 16-
1231	   bit IP identification field is not large enough to prevent frequent
1232	   misassociated IP fragments and the TCP and UDP checksums are
1233	   insufficient to prevent the resulting corrupted data from being
1234	   delivered to higher protocol layers [18].

1236	11.  References

1238	11.1.  Normative references

1240	   [1]  Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.

1242	   [2]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1243	        November 1990.

1245	   [3]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery for
1246	        IP version 6", RFC 1981, August 1996.

1248	   [4]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
1249	        Levels", BCP 14, RFC 2119, March 1997.

1251	   [5]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
1252	        Specification", RFC 2460, December 1998.

1254	   [6]  Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
1255	        September 1981.

1257	   [7]  Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
1258	        H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson,
1259	        "Stream Control Transmission Protocol", RFC 2960, October 2000.

1261	   [8]  Braden, R., "Requirements for Internet Hosts - Communication
1262	        Layers", STD 3, RFC 1122, October 1989.

1264	11.2.  Informative references

1266	   [9]   Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809,
1267	         June 1995.

1269	   [10]  Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923,
1270	         September 2000.

1272	   [11]  Kent, S. and R. Atkinson, "Security Architecture for the
1273	         Internet Protocol", RFC 2401, November 1998.

1275	   [12]  Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914,
1276	         September 2000.

1278	   [13]  Narten, T., Nordmark, E., and W. Simpson, "Neighbor Discovery
1279	         for IP Version 6 (IPv6)", RFC 2461, December 1998.

1281	   [14]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A Conservative
1282	         Selective Acknowledgment (SACK)-based Loss Recovery Algorithm
1283	         for TCP", RFC 3517, April 2003.

1285	   [15]  Nagle, J., "Congestion control in IP/TCP internetworks",
1286	         RFC 896, January 1984.

1288	   [16]  Kent, C. and J. Mogul, "Fragmentation considered harmful",
1289	         Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.

1291	   [17]  Mahdavi, J. and S. Floyd, "TCP-Friendly Unicast Rate-Based Flow
1292	         Control", Technical note sent to the end2end-interest mailing
1293	         list , January 1997,
1294	         <http://www.psc.edu/networking/papers/tcp_friendly.html>.

1296	   [18]  Mathis, M., "Fragmentation Considered Very Harmful",
1297	         draft-mathis-frag-harmful-00 (work in progress), July 2004.

1299	   [19]  Tuexen, M. and R. Stewart, "Padding Chunk and Parameter for
1300	         SCTP", draft-tuexen-tsvwg-sctp-padding-00 (work in progress),
1301	         February 2006.

1303	Appendix A.  Security Considerations

1305	   Under all conditions the PLPMTUD procedure described in this document
1306	   is at least as secure as the current standard Path MTU Discovery
1307	   procedures described in RFC 1191 and RFC 1981.

1309	   Since this algorithm is designed for robust operation without any
1310	   ICMP (or other messages from the network), PLPMTUD could be
1311	   configured to ignore all ICMP messages (globally or on a per
1312	   application basis).  In this configuration, it cannot be attacked
1313	   unless the attacker can identify and selectively cause probe packets
1314	   to be lost.

1316	Appendix B.  IANA Considerations

1318	   None.

1320	Appendix C.  Acknowledgements

1322	   Many ideas and even some of the text come directly from RFC 1191 and
1323	   RFC 1981.

1325	   Many people made significant contributions to this document,
1326	   including: Randall Stewart for SCTP text, Michael Richardson for
1327	   material from an earlier ID on tunnels that ignore DF, Stanislav
1328	   Shalunov for the idea that pure PLPMTUD parallels congestion control,
1329	   and Matt Zekauskas for maintaining focus during the meetings.  Thanks
1330	   to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib
1331	   who provided concrete feedback on weaknesses in earlier drafts.
1332	   Thanks also to all of the people who made constructive comments in
1333	   the working group meetings and on the mailing list.  I am sure I have
1334	   missed many deserving people.

1336	   Matt Mathis and John Heffner are supported in this work by a grant
1337	   from Cisco Systems, Inc.

1339	Authors' Addresses

1341	   Matt Mathis
1342	   Pittsburgh Supercomputing Center
1343	   4400 Fifth Avenue
1344	   Pittsburgh, PA  15213
1345	   US

1347	   Phone: 412-268-3319
1348	   Email: mathis@psc.edu

1350	   John W. Heffner
1351	   Pittsburgh Supercomputing Center
1352	   4400 Fifth Avenue
1353	   Pittsburgh, PA  15213
1354	   US

1356	   Phone: 412-268-2329
1357	   Email: jheffner@psc.edu

1359	Intellectual Property Statement

1361	   The IETF takes no position regarding the validity or scope of any
1362	   Intellectual Property Rights or other rights that might be claimed to
1363	   pertain to the implementation or use of the technology described in
1364	   this document or the extent to which any license under such rights
1365	   might or might not be available; nor does it represent that it has
1366	   made any independent effort to identify any such rights.  Information
1367	   on the procedures with respect to rights in RFC documents can be
1368	   found in BCP 78 and BCP 79.

1370	   Copies of IPR disclosures made to the IETF Secretariat and any
1371	   assurances of licenses to be made available, or the result of an
1372	   attempt made to obtain a general license or permission for the use of
1373	   such proprietary rights by implementers or users of this
1374	   specification can be obtained from the IETF on-line IPR repository at
1375	   http://www.ietf.org/ipr.

1377	   The IETF invites any interested party to bring to its attention any
1378	   copyrights, patents or patent applications, or other proprietary
1379	   rights that may cover technology that may be required to implement
1380	   this standard.  Please address the information to the IETF at
1381	   ietf-ipr@ietf.org.

1383	Disclaimer of Validity

1385	   This document and the information contained herein are provided on an
1386	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1387	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1388	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1389	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1390	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1391	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1393	Copyright Statement

1395	   Copyright (C) The Internet Society (2006).  This document is subject
1396	   to the rights, licenses and restrictions contained in BCP 78, and
1397	   except as set forth therein, the authors retain all their rights.

1399	Acknowledgment

1401	   Funding for the RFC Editor function is currently provided by the
1402	   Internet Society.