idnits 2.17.1 

draft-mathis-pmtud-method-00.txt:
-(846): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(867): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(899): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  == There are 4 instances of lines with non-ascii characters in the document.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 20
     longer pages, the longest (page 2) being 60 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 21 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 9 instances of too long lines in the document, the longest one
     being 6 characters in excess of 72.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 233: '...l Internet nodes SHOULD implement Path...'
     RFC 2119 keyword, line 242: '...   Links MUST not deliver packets that...'
     RFC 2119 keyword, line 244: '...clock stability) MUST include explicit...'
     RFC 2119 keyword, line 249: '...   if any, SHOULD send a Packet Too Bi...'
     RFC 2119 keyword, line 255: '...t the connection MUST have at least th...'
     (7 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 867 has weird spacing: '...imed to  per...'

  == Line 899 has weird spacing: '...for the  purpo...'

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Links MUST not deliver packets that are larger than their true MTU.
     Links that have parametric limitations (e.g. MTU bounds due to limited
     clock stability) MUST include explicit mechanisms to consistently reject
     packets that might otherwise be nondeterministically delivered.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 21, 2003) is 7614 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'IPv4-SPEC' is mentioned on line 240, but not defined

  == Missing Reference: 'IPv6-SPEC' is mentioned on line 677, but not defined

  == Missing Reference: 'FRAG' is mentioned on line 350, but not defined

  == Missing Reference: 'ND' is mentioned on line 411, but not defined

  == Missing Reference: 'CONG' is mentioned on line 709, but not defined

  == Missing Reference: 'ISOTP' is mentioned on line 732, but not defined

  == Missing Reference: 'RPC' is mentioned on line 742, but not defined

  == Unused Reference: 'RFC1191' is defined on line 802, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1435' is defined on line 806, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1981' is defined on line 810, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2923' is defined on line 814, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1063' is defined on line 819, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1626' is defined on line 823, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1791' is defined on line 827, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 1063 (ref. 'RFC1191') (Obsoleted by
     RFC 1191)

  ** Downref: Normative reference to an Informational RFC: RFC 1435

  ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201)

  ** Downref: Normative reference to an Informational RFC: RFC 2923


     Summary: 8 errors (**), 0 flaws (~~), 21 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft                                               Matt Mathis
3	                                                            John Heffner
4	                                                                     PSC
5	                                                             Kevin Lahey
6	                                                               Freelance
7	                                                           June 21, 2003

9	                           Path MTU Discovery
10	                     draft-mathis-pmtud-method-00.txt

12	Status of this Memo

14	   This document is an Internet-Draft and is in full conformance with
15	   all provisions of Section 10 of RFC2026.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups. Note that other
19	   groups may also distribute working documents as Internet-Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time. It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	Abstract

34	   [@@ To be rewritten]

36	   This document describes Path MTU Discovery for the Internet.  It is
37	   largely derived from RFC 1191 and RFC 1981, which describe ICMP based
38	   Path MTU Discovery for IP versions 4 and 6, plus a robust new
39	   algorithm.

41	   The general strategy of the new algorithm is to start with a small
42	   MTU and probe upward, testing successively larger MTUs by probing
43	   with single packets.  If the probe is successfully delivered, then
44	   the MTU is raised.  If the probe is lost, it is treated as an MTU
45	   limitation and not as a congestion signal.

47	Table of Contents

49	   TBD

51	1. Introduction

53	   When one Internet node has a large amount of data to send to another
54	   node, the data is transmitted in a series of IP packets.  It is
55	   usually preferable that these packets be of the largest size that can
56	   successfully traverse the path from the source node to the
57	   destination node.  This packet size is referred to as the Path MTU
58	   (PMTU), and it is equal to the minimum link MTU of all the links in a
59	   path.

61	   This document describes a path MTU discovery (PMTUD) method based on
62	   the earlier methods described in the standards tract documents,
63	   RFC1191 and RFC1981, with the addition of a new algorithm that
64	   searches for the proper MTU by probing with successively larger
65	   packets.  Large sections of this document are taken directly from
66	   RFC1191 and RFC1981.

68	   The methods described in this document apply to IPv4, IPv6, TCP, and
69	   other transport protocols.   This document does not define a
70	   protocol, but rather a method to use features of existing protocols
71	   to discover the path MTU.  It does not require cooperation from the
72	   lower layers (except that they are consistent about what packet sizes
73	   are acceptable) or the far node.  Variants in implementations will
74	   not cause problems with interoperability.

76	   [[As a consequence people are encouraged to start developing
77	   experimental implementations as soon as the requirements sections is
78	   stable.   All other sections are recommendations only.]]

80	   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
81	   the terminology section we also present the analogous IPv4 terms and
82	   concepts for the IPv6 terminology.  In a few situations we describe
83	   specific details that are different between IPv4 and IPv6.

85	   [[This document still bears markup notes, indicated with square
86	   brackets [] or @@@@ signs.]]

88	2. Terminology

90	   IP          - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC].

92	   node        - a device that implements IP.

94	   router      - a node that forwards IP packets not explicitly
95	                 addressed to itself.

97	   host        - any node that is not a router.

99	   upper layer - a protocol layer immediately above IP.  Examples are
100	                 transport protocols such as TCP and UDP, control
101	                 protocols such as ICMP, routing protocols such as OSPF,
102	                 and Internet or lower-layer protocols being "tunneled"
103	                 over (i.e., encapsulated in) IP such as IPX,
104	                 AppleTalk, IP itself.

106	   link        - a communication facility or medium over which nodes can
107	                 communicate at the link layer, i.e., the layer
108	                 immediately below IPv6.  Examples are Ethernets (simple
109	                 or bridged); PPP links; X.25, Frame Relay, or ATM
110	                 networks; and Internet (or higher) layer "tunnels",
111	                 such as tunnels over IPv4 or IPv6 itself.

113	   interface   - a node's attachment to a link.

115	   address     - an IP-layer identifier for an interface or a set of
116	                 interfaces.

118	   packet      - an IP header plus payload.

120	   MTU         - Maximum Transmission Unit, the size in bytes of the
121	                 largest packet that can be transmitted on a link or
122	                 path.   Note that this could more properly be called
123	                 the IP MTU, to be consistent with how other standards
124	                 organizations use the term.  Beware that the definition
125	                 used in this and other IETF documents is not the same
126	                 as the definition used in other contexts.

128	   link MTU    - the Maximum Transmission Unit, i.e., maximum packet
129	                 size in octets, that can be conveyed in one piece over
130	                 a link.

132	   path        - the set of links traversed by a packet between a source
133	                 node and a destination node

135	   path MTU    - the minimum link MTU of all the links in a path between
136	                 a source node and a destination node.

138	   PMTU        - path MTU

140	   Path MTU Discovery,
141	   PMTUD       - process by which a node learns the PMTU of a path

143	   Packet Too Big message
144	               - An ICMP message reporting that an IP packet is too
145	                 large to forward.  This is the IPv6 term that
146	                 corresponds to the IPv4 "ICMP Can't fragment" message.

148	   flow id     - a combination of a source address and a non-zero
149	                 IPv6 flow label.

151	   L3 MTU      - the maximum available IP payload size, usually over a
152	                 specific path.  This is the maximum layer 3 transmission
153	                 unit (e.g TCP message, including all TCP headers and data,
154	                 but not IP or link headers.)

156	   segment size- the L3 payload size (from TCP usage).

158	   probe packet- A packet which is being used to test for a larger MTU.

160	   probe size  - The size of a packet being used to probe for a larger MTU.

162	   successful probe
163	               - The probe packet was delivered through the network.

165	   inconclusive probe
166	               - The probe packet was not delivered, but there were other lost
167	                 packets too close to the probe.   By implication the probe
168	                 might have been lost due to something other than MTU, so the
169	                 results are inconclusive.

171	   failed probe
172	               - The probe packet was not delivered and there were not other
173	                 lost packets close to the probe.

175	   probe gap   - The L3 payload data that will need to be retransmitted if the
176	                 probe is not delivered.

178	[[Deprecated terms - these terms should only appear in very specific parts of
179	the document.

181	ICMP

183	Can't fragment messages

185	lower layers

187	@@@ remove as the document matures]]

189	3. Overview

191	   This document describes a technique to dynamically discover the MTU
192	   of a path.  These procedures are applicable to TCP and other
193	   transport- or application-level Packetization protocols which
194	   implement similar features.

196	   The general strategy of the new procedure is to find the proper MTU
197	   by starting a connection using relatively small packets and then
198	   probing with progressively larger packets (containing application
199	   data).  If a probe packet is successfully delivered, then the path
200	   MTU is raised.  The isolated loss of a probe packet (with or without
201	   a Packet Too Big message) is treated as an indication of an MTU
202	   limit, and not as a congestion indicator.

204	   PMTUD can optionally process Packet Too Big messages for faster
205	   convergence in exchange for a slight decrease in robustness.
206	   Processing malicious or erroneous Packet Too Big messages can cause
207	   PMTU discovery to arrive at the incorrect MTU for a path, which is
208	   likely to reduce protocol performance.  The document describes three
209	   options for processing Packet Too Big messages: completely ignore
210	   them, only accept them in response to probes or accept all Packet Too
211	   Big messages (the previous approach).

213	   In addition, PMTUD can be extended with heuristics to use alternate
214	   criteria to select PMTU.  For example, on a path that is so congested
215	   that the fair share window is too small (smaller than 5 kB), TCP may
216	   be better behaved with 512-byte packets than with 1500-byte packets
217	   since with the larger packets the window would be too small to
218	   trigger Fast Retransmit.

220	   Relatively few details of this procedure affect interoperability with
221	   other standards or Internet protocols.  These details are specified
222	   in RFC2026 standards language in the requirements section.  The vast
223	   majority of the implementation details are recommendations based on
224	   experiences with earlier versions of path MTU discovery.  These are
225	   motivated by a desire to maximize robustness in the presence of less
226	   than ideal implementations as they exist in the field.

228	4. Requirements

230	   [This section is written in 2026 standards language MUST/SHOULD,
231	   etc.]

233	   All Internet nodes SHOULD implement Path MTU Discovery in order to
234	   discover and take advantage of the largest MTU supported along the
235	   Internet path.

237	   Nodes not implementing Path MTU Discovery must use a default MTU as
238	   specified by the respective IP protocols.  For IPv6 the default MTU
239	   is 1280 bytes, the minimum link MTU as defined in [IPv6-SPEC].  For
240	   IPv4 it is 576 bytes, as specified in [IPv4-SPEC].

242	   Links MUST not deliver packets that are larger than their true MTU.
243	   Links that have parametric limitations (e.g. MTU bounds due to
244	   limited clock stability) MUST include explicit mechanisms to
245	   consistently reject packets that might otherwise be
246	   nondeterministically delivered.

248	   When a packet is too large to traverse a link, the attached router,
249	   if any, SHOULD send a Packet Too Big message (IPv6) or ICMP, can't
250	   fragment message (IPv4 with DF set), as appropriate.

252	   The requirements below only apply to those implementations that
253	   include Path MTU Discovery.

255	   Before a probe can be sent the connection MUST have at least the
256	   candidate MSS worth of pending data and MUST be using the current
257	   MSS, as defined by having received at least one acknowledgment for a
258	   recent non-probe segment at the current MSS.  This implicitly limits
259	   successful probes to once per two round trips.  [Making the algorithm
260	   robust in the presence of multi-path routing is likely to require an
261	   additional RTT.]  @@@ generalize

263	   Failed and inconclusive probes must be more widely spaced than the
264	   normal Additive Increase Multiplicative Decrease (AIMD) congestion
265	   interval for the current average window size.  This is enforced by
266	   keeping a "probe countdown" which is decremented on each non-probe
267	   segment sent.  Probes MUST NOT be sent before the probe countdown
268	   reaches zero.  @@@ generalize

270	   The candidate MSS MUST be strictly smaller than three times the
271	   current MSS.  Thus the probe segment fully covers at most one
272	   subsequent segment.  The second subsequent segment is at most
273	   partially covered by the probe segment.  This guarantees that the
274	   segments following the probe segment will cause at most one
275	   superfluous duplicate acknowledgment.  @@@ generalize

277	   The TCP MUST be using Fast-Retransmit and SACK or new Reno, such that
278	   isolated lost segments will normally be retransmitted without the
279	   spurious retransmission of any additional segments.

281	   During the probe, all of the normal retransmission, recovery and
282	   congestion control machinery is in effect except when just the probe
283	   gap is retransmitted (and no other segments) the normal
284	   multiplicative cwnd reduction is suppressed.  If any other segments
285	   are retransmitted, all normal cwnd reductions MUST take place.

287	   If the probe was successful, the current MSS is updated to the
288	   candidate MSS.  If cwnd and other congestion state variables are kept
289	   in packets, they MUST be rescaled by the change in MSS, to preserve
290	   the current window size in bytes.  @@@ generalize

292	5. Implementation Issues
293	   This section discusses a number of issues related to the
294	   implementation of Path MTU Discovery.  This is not a specification,
295	   but rather a set of notes provided as an aid for implementers.

297	   The issues include:

299	   - What layer or layers implement Path MTU Discovery?

301	   - Accounting for headers

303	   - How is the PMTU information cached?

305	   - How are ICMP messages processed

307	   - How is stale PMTU information removed?

309	   - How to implement PMTUD with TCP?

311	   - What should other transport and higher layers do?

313	   - What should tunnels above IP do?

315	5.1. Layering

317	   In the IP architecture, the choice of what size packet to send is
318	   made by a protocol at a layer above IP.  This memo refers to such a
319	   protocol as a "packetization protocol".  Packetization protocols are
320	   usually transport protocols (for example, TCP) but can also be
321	   higher-layer protocols (for example, protocols built on top of UDP).

323	   Implementing Path MTU Discovery in the packetization layers
324	   simplifies some of the inter-layer issues, but has several drawbacks:
325	   the implementation may have to be redone for each packetization
326	   protocol, it becomes hard to share PMTU information between different
327	   packetization layers, and the connection-oriented state maintained by
328	   some packetization layers may not easily extend to save PMTU
329	   information for long periods.

331	   It is therefore suggested that the IP layer store PMTU information
332	   and that the ICMP layer process received Packet Too Big messages.
333	   The packetization layers may respond to changes in the PMTU, by
334	   changing the size of the messages they send.  To support this
335	   layering, packetization layers require a way to learn of changes in
336	   the value of MMS_S, the "maximum send transport-message size".  The
337	   MMS_S is derived from the Path MTU by subtracting the size of the
338	   IPv6 header plus space reserved by the IP layer for additional
339	   headers (if any).

341	   It is possible that a packetization layer, perhaps a UDP application
342	   outside the kernel, is unable to change the size of messages it
343	   sends.  This may result in a packet size that exceeds the Path MTU.

345	   To accommodate such situations, IPv6 defines a mechanism that allows
346	   large payloads to be divided into fragments, with each fragment sent
347	   in a separate packet (see [IPv6-SPEC] section "Fragment Header").
348	   However, packetization layers are encouraged to avoid sending
349	   messages that will require fragmentation (for the case against
350	   fragmentation, see [FRAG]).

352	   To accommodate such situations, it is recommended that IPv4 use a
353	   mechanism that parallels the IPv6 mechanism and only fragment in the
354	   end systems.  Also set DF on the fragments.  @@@more

356	5.2. Accounting for headers

358	   [[@@@To be written

360	   IP MTU is the payload size of the lower layer (should be "lower layer
361	   MTU minus link headers", but this is a different use of "MTU").  @@@
362	   more, clarify

364	   L3 MTU is IP MTU minus IP headers @@@ more

366	   MSS is L3 MTU minus TCP headers @@@ more

368	   This document does not take a position on the position of IPsec,
369	   which logically sits at the boundary between IP and TCP or other
370	   packetization later.  IPsec can be treated either as part of IP or as
371	   part of the packetization later, as long as the accounting is
372	   consistent within any given implementation.

374	   If IPsec is treated as part of the IP layer, then each security
375	   association that contributes a different length security header, may
376	   need to be treated as a separate path.  If IPsec is treated as part
377	   of the packetization layer, then the MSS to L3 MTU calculation must
378	   include the IPsec header size.

380	   ]

382	5.3. Storing PMTU information

384	   Ideally, a PMTU value should be associated with a specific path
385	   traversed by packets exchanged between the source and destination
386	   nodes.  However, in most cases a node will not have enough
387	   information to completely and accurately identify such a path.
388	   Rather, a node must associate a PMTU value with some local
389	   representation of a path.  It is left to the implementation to select
390	   the local representation of a path.

392	   In the case of a multicast destination address, copies of a packet
393	   may traverse many different paths to reach many different nodes.  The
394	   local representation of the "path" to a multicast destination must in
395	   fact represent a potentially large set of paths.

397	   Minimally, an implementation could maintain a single PMTU value to be
398	   used for all packets originated from the node.  This PMTU value would
399	   be the minimum PMTU learned across the set of all paths in use by the
400	   node.  This approach is likely to result in the use of smaller
401	   packets than is necessary for many paths.

403	   An implementation could use the destination address as the local
404	   representation of a path.  The PMTU value associated with a
405	   destination would be the minimum PMTU learned across the set of all
406	   paths in use to that destination.  The set of paths in use to a
407	   particular destination is expected to be small, in many cases
408	   consisting of a single path.  This approach will result in the use of
409	   optimally sized packets on a per-destination basis.  This approach
410	   integrates nicely with the conceptual model of a host as described in
411	   [ND]: a PMTU value could be stored with the corresponding entry in
412	   the destination cache.

414	   If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the
415	   flow id as the local representation of a path.  Packets sent to a
416	   particular destination but belonging to different flows may use
417	   different paths, with the choice of path depending on the flow id.
418	   This approach will result in the use of optimally sized packets on a
419	   per-flow basis, providing finer granularity than PMTU values
420	   maintained on a per-destination basis.

422	   For source routed packets (i.e. packets containing an IPv6 Routing
423	   header [IPv6-SPEC]), the source route may further qualify the local
424	   representation of a path.  In particular, a packet containing a type
425	   0 Routing header in which all bits in the Strict/Loose Bit Map are
426	   equal to 1 contains a complete path specification.  An implementation
427	   could use source route information in the local representation of a
428	   path.

430	   Note: Some paths may be further distinguished by different security
431	   classifications.  The details of such classifications are beyond the
432	   scope of this memo.    @@@ this should be in scope

434	5.4. Probing method using TCP

436	   A new "candidate MSS" is tested by sending one "probe segment", which
437	   is larger than the current MSS.

439	   After a probe segment has been sent (of size candidate MSS), the
440	   subsequent segment(s) may be sent as though the probe segment was not
441	   over sized.  Thus if the probe segment is lost, it will leave a hole
442	   that is exactly one current MSS.  We refer to this potential hole as
443	   the probe gap.  Note that the length of the probe segment is
444	   determined by the candidate MSS under consideration, but the length
445	   of the probe gap is the current MSS.  [This has been shown to be more
446	   restrictive than necessary.]

448	   The probe is completed when the acknowledgments sequence advances
449	   past the probe gap.  If the probe gap was not retransmitted the probe
450	   was successful.  If the probe gap was retransmitted and there were no
451	   other retransmissions, the candidate MSS failed.  If there were any
452	   other retransmissions the probe was inconclusive.

454	   If the probe was successful, the current MSS is updated to the
455	   candidate MSS.  @@@ add robustness language re: more losses

457	   If the probe failed or was inconclusive the probe countdown is set to
458	   COUNTDOWN_SCALE times the square of the current window size in
459	   packets.

461	   If a Packet Too Big message is received, it can be is used to compute
462	   a MSS limit by deducting the TCP/IP header sizes (including options)
463	   from the MTU reported in the ICMP message.  If the MSS limit is
464	   between the current MSS and candidate MSS, the current MSS is updated
465	   from the MSS limit, otherwise the message is ignored.   If the
466	   current MSS is updated, then the probe strategy is forced into a
467	   monitor state described below.  @@@ update

469	5.5. Probing method using SCTP

471	   @@@@ to be written

473	5.6. General probing methods

475	   @@@@ to be written

477	5.7. Probe strategy

479	   The probe strategy described here is a recommended baseline
480	   algorithm.  It is not presented in formal standards language because
481	   the probe strategy can include heuristics to help select an optimal
482	   MSS for a given path.  As a consequence there is opportunity for
483	   future improvements to this algorithms.

485	   The probing strategy has three major states: search, monitor and
486	   suspend.  During the search state, it sequentially searches for the
487	   largest MSS that the path can support.  Once the path MSS has been
488	   discovered, the probing algorithm enters the monitor state where it
489	   probes infrequently to detect if the path MSS has become larger.

491	   If the MSS probing persistently fails it may be desirable to suspend
492	   path MSS probing and heuristically select one of the common default
493	   MSSs: 576, 1280, or 1500 Bytes.

495	   5.7.1. Search

497	   The recommended search strategy is a multi-phase scan: First, a
498	   coarse scan for the approximate path MSS using factor of 2 steps
499	   starting at 1024 Bytes until a probe fails, followed by successively
500	   finer scans between the largest previously successful and
501	   unsuccessful probes.

503	          Table 1: Recommended MSS scanning sequence
504	          (Coarse scan down column 1, fine scan across each row)
505	          512, [Use only after repeated timeouts]
506	          1024,  1492, 2002
507	          2048
508	          4096, 4352
509	          8192, 9000
510	          16384, 17914
511	          32768
512	          64512
513	          ((Additional values needed))

515	   During the scan it is recommended that the MSS not be raised if cwnd
516	   is too small as determined by a heuristic.  The recommended heuristic
517	   is that the MSS is only raised when the cwnd is larger than 20
518	   segments.

520	   5.7.2. Monitor

522	   Once the scan has found an appropriate MSS, the probe strategy enters
523	   the monitor state, where it re-probes the most recent failed MTU,
524	   once every MONITOR_INTERVAL seconds.  If the probe fails, it remains
525	   in the monitor state.  If it succeeds, it enters the scanning state.

527	   If the network becomes too congested during either the scan or the
528	   monitor states it is recommended that the MSS be reduced to a smaller
529	   size as determined by a heuristic.  The recommended heuristic is to
530	   reduce the MSS if ssthresh is reduced to 5 segments or smaller.  The
531	   recommended reduction is to the next smaller major MSS step in table
532	   1.

534	   When there are repeated timeouts (MAX_TIMO or more retransmissions,
535	   without any received ACKs), it is presumed that the connection was
536	   re-routed onto a link with a smaller MSS, and that ICMP messages are
537	   not being delivered.  The MSS probing algorithms is reset by pulling
538	   back the MSS to 1024 Bytes, rescaling the congestion control
539	   variables and reentering the search state.

541	   5.7.3. Suspend

543	   If there is a timeout, and cwnd prior to the timeout was smaller than
544	   6 packets, then the probe strategy can enter the suspended phase and
545	   set the MSS to 512 (1280) Bytes.  This has the effect of reducing the
546	   minimum data rate that TCP can stably manage.

548	5.8.  Processing Packet Too Big messages

550	   @@@ Add language re: optional processing

552	   When a Packet Too Big message is received, the node determines which
553	   path the message applies to based on the contents of the Packet Too
554	   Big message.  For example, if the destination address is used as the
555	   local representation of a path, the destination address from the
556	   original packet would be used to determine which path the message
557	   applies to.

559	      Note: if the original packet contained a IPv6 Routing header, the
560	      Routing header should be used to determine the location of the
561	      destination address within the original packet.  If Segments Left
562	      is equal to zero, the destination address is in the Destination
563	      Address field in the IPv6 header.  If Segments Left is greater
564	      than zero, the destination address is the last address
565	      (Address[n]) in the Routing header.

567	      If the original packet contained a IPv4 Source Route Option .....
568	      @@@@ write

570	   The node then uses the value in the MTU field in the Packet Too Big
571	   message as a tentative PMTU value, and compares the tentative PMTU to
572	   the existing PMTU.  If the tentative PMTU is less than the existing
573	   PMTU estimate, the tentative PMTU replaces the existing PMTU as the
574	   PMTU value for the path.

576	   The packetization layers must be notified about decreases in the
577	   PMTU.  Any packetization layer instance (for example, a TCP
578	   connection) that is actively using the path must be notified if the
579	   PMTU estimate is decreased.

581	      Note: even if the Packet Too Big message contains an Original
582	      Packet Header that refers to a UDP packet, the TCP layer must be
583	      notified if any of its connections use the given path.

585	   Also, the instance that sent the packet that elicited the Packet Too
586	   Big message should be notified that its packet has been dropped, even
587	   if the PMTU estimate has not changed, so that it may retransmit the
588	   dropped data.

590	      Note: An implementation can avoid the use of an asynchronous
591	      notification mechanism for PMTU decreases by postponing
592	      notification until the next attempt to send a packet larger than
593	      the PMTU estimate.  In this approach, when an attempt is made to
594	      SEND a packet that is larger than the PMTU estimate, the SEND
595	      function should fail and return a suitable error indication.  This
596	      approach may be more suitable to a connectionless packetization
597	      layer (such as one using UDP), which (in some implementations) may
598	      be hard to "notify" from the ICMP layer.  In this case, the normal
599	      timeout-based retransmission mechanisms would be used to recover
600	      from the dropped packets.    @@@@ why "SEND"?

602	   It is important to understand that the notification of the
603	   packetization layer instances using the path about the change in the
604	   PMTU is distinct from the notification of a specific instance that a
605	   packet has been dropped.  The latter should be done as soon as
606	   practical (i.e., asynchronously from the point of view of the
607	   packetization layer instance), while the former may be delayed until
608	   a packetization layer instance wants to create a packet.
609	   Retransmission should be done for only those packets that are known
610	   to be dropped, as indicated by a Packet Too Big message.

612	5.9. Purging stale PMTU information

614	   @@@ update

616	   Internetwork topology is dynamic; routes change over time.  While the
617	   local representation of a path may remain constant, the actual
618	   path(s) in use may change.  Thus, PMTU information cached by a node
619	   can become stale.

621	   If the stale PMTU value is too large, this will be discovered almost
622	   immediately once a large enough packet is sent on the path.  No such
623	   mechanism exists for realizing that a stale PMTU value is too small,
624	   so an implementation should "age" cached values.  When a PMTU value
625	   has not been decreased for a while (on the order of 10 minutes), the
626	   PMTU estimate should be set to the MTU of the first-hop link, and the
627	   packetization layers should be notified of the change.  This will
628	   cause the complete Path MTU Discovery process to take place again.

630	      Note: an implementation should provide a means for changing the
631	      timeout duration, including setting it to "infinity".  For
632	      example, nodes attached to an FDDI link which is then attached to
633	      the rest of the Internet via a small MTU serial line are never
634	      going to discover a new non-local PMTU, so they should not have to
635	      put up with dropped packets every 10 minutes.

637	   An upper layer must not retransmit data in response to an increase in
638	   the PMTU estimate, since this increase never comes in response to an
639	   indication of a dropped packet.

641	   One approach to implementing PMTU aging is to associate a timestamp
642	   field with a PMTU value.  This field is initialized to a "reserved"
643	   value, indicating that the PMTU is equal to the MTU of the first hop
644	   link.  Whenever the PMTU is decreased in response to a Packet Too Big
645	   message, the timestamp is set to the current time.

647	   Once a minute, a timer-driven procedure runs through all cached PMTU
648	   values, and for each PMTU whose timestamp is not "reserved" and is
649	   older than the timeout interval:

651	   - The PMTU estimate is set to the MTU of the first hop link.

653	   - The timestamp is set to the "reserved" value.

655	   - Packetization layers using this path are notified of the increase.

657	5.10. TCP layer actions

659	   The TCP layer must track the PMTU for the path(s) in use by a
660	   connection; it should not send segments that would result in packets
661	   larger than the PMTU except to probe the path MTU.  A simple
662	   implementation could ask the IP layer for this value each time it
663	   created a new segment, but this could be inefficient.  Moreover, TCP
664	   implementations that follow the "slow-start" congestion-avoidance
665	   algorithm [CONG] typically calculate and cache several other values
666	   derived from the PMTU.  It may be simpler to receive asynchronous
667	   notification when the PMTU changes, so that these variables may be
668	   updated.

670	   A TCP implementation must also store the MSS value received from its
671	   peer, and must not send any segment larger than this MSS, regardless
672	   of the PMTU.  In 4.xBSD-derived implementations, this may require
673	   adding an additional field to the TCP state record.

675	   The value sent in the TCP MSS option is independent of the PMTU.
676	   This MSS option value is used by the other end of the connection,
677	   which may be using an unrelated PMTU value.  See [IPv6-SPEC] sections
678	   "Packet Size Issues" and "Maximum Upper-Layer Payload Size" for
679	   information on selecting a value for the TCP MSS option.  When a
680	   Packet Too Big message is received, it implies that a packet was
681	   dropped by the node that sent the ICMP message.  It is sufficient to
682	   treat this as any other dropped segment, and wait until the
683	   retransmission timer expires to cause retransmission of the segment.
684	   If the Path MTU Discovery process requires several steps to find the
685	   PMTU of the full path, this could delay the connection by many round-
686	   trip times.

688	   @@@ Add IPv4 text

690	   [@@@deprecate?  Alternatively, the retransmission could be done in
691	   immediate response to a notification that the Path MTU has changed,
692	   but only for the specific connection specified by the Packet Too Big
693	   message.  The packet size used in the retransmission should be no
694	   larger than the new PMTU. ]

696	      Note: A packetization layer must not retransmit in response to
697	      every Packet Too Big message, since a burst of several oversized
698	      segments will give rise to several such messages and hence several
699	      retransmissions of the same data.  If the new estimated PMTU is
700	      still wrong, the process repeats, and there is an exponential
701	      growth in the number of superfluous segments sent.

703	      This means that the TCP layer must be able to recognize when a
704	      Packet Too Big notification actually decreases the PMTU that it
705	      has already used to send a packet on the given connection, and
706	      should ignore any other notifications.

708	   Many TCP implementations incorporate "congestion avoidance" and
709	   "slow-start" algorithms to improve performance [CONG].  Unlike a
710	   retransmission caused by a TCP retransmission timeout, a
711	   retransmission caused by a Packet Too Big message should not change
712	   the congestion window.  It should, however, trigger the slow-start
713	   mechanism (i.e., only one segment should be retransmitted until
714	   acknowledgments begin to arrive again).

716	   TCP performance can be reduced if the sender's maximum window size is
717	   not an exact multiple of the segment size in use (this is not the
718	   congestion window size, which is always a multiple of the segment
719	   size).  In many systems (such as those derived from 4.2BSD), the
720	   segment size is often set to 1024 octets, and the maximum window size
721	   (the "send space") is usually a multiple of 1024 octets, so the
722	   proper relationship holds by default.  If Path MTU Discovery is used,
723	   however, the segment size may not be a sub-multiple of the send
724	   space, and it may change during a connection; this means that the TCP
725	   layer may need to change the transmission window size when Path MTU
726	   Discovery changes the PMTU value.  The maximum window size should be
727	   set to the greatest multiple of the segment size that is less than or
728	   equal to the sender's buffer space size.

730	5.11.  Issues for other transport protocols

732	   Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
733	   repacketize when doing a retransmission.  That is, once an attempt is
734	   made to transmit a segment of a certain size, the transport cannot
735	   split the contents of the segment into smaller segments for
736	   retransmission.  In such a case, the original segment can be
737	   fragmented by the IP layer during retransmission.  Subsequent
738	   segments, when transmitted for the first time, should be no larger
739	   than allowed by the Path MTU.

741	   The Sun Network File System (NFS) uses a Remote Procedure Call (RPC)
742	   protocol [RPC] that, when used over UDP, in many cases will generate
743	   payloads that must be fragmented even for the first-hop link.  This
744	   might improve performance in certain cases, but it is known to cause
745	   reliability and performance problems, especially when the client and
746	   server are separated by routers.

748	   It is recommended that NFS implementations use Path MTU Discovery
749	   whenever routers are involved.  Most NFS implementations allow the
750	   RPC datagram size to be changed at mount-time (indirectly, by
751	   changing the effective file system block size), but might require
752	   some modification to support changes later on.

754	   Also, since a single NFS operation cannot be split across several UDP
755	   datagrams, certain operations (primarily, those operating on file
756	   names and directories) require a minimum payload size that if sent in
757	   a single packet would exceed the PMTU.  NFS implementations should
758	   not reduce the payload size below this threshold, even if Path MTU
759	   Discovery suggests a lower value.  In this case the payload will be
760	   fragmented by the IP layer.

762	5.12.  Issues for tunnels

764	   @@@ to be written

766	   5.13.  Diagnostic tools

768	   All implementations MUST include a mechanism to implement diagnostic
769	   tools that do not rely on the operating systems implementation of
770	   path MTU discovery.   This requires an mechanism where an application
771	   can send oversized packets that are not subjected to the operating
772	   systems notion of the current path MTU, up to the physical MTU limit
773	   as supported by the network interface, as well as a mechanism to
774	   collect any Packet Too Big Messages.

776	5.14.  Management interface

778	   It is suggested that an implementation provide a way for a system
779	   utility program to:

781	   - Specify that Path MTU Discovery not be done on a given path.

783	   - Change the PMTU value associated with a given path.

785	   - Global controls on ICMP processing

787	   - Per connection or per application controls on ICMP processing

789	   The former can be accomplished by associating a flag with the path;
790	   when a packet is sent on a path with this flag set, the IP layer does
791	   not send packets larger than the IPv6 minimum link MTU.

793	   These features might be used to work around an anomalous situation,
794	   or by a routing protocol implementation that is able to obtain Path
795	   MTU values.

797	   The implementation should also provide a way to change the timeout
798	   period for aging stale PMTU information.

800	6. Normative references

802	 [RFC1191]  Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990.
803	            (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT
804	            STANDARD)

806	 [RFC1435]  IESG Advice from Experience with Path MTU Discovery. S.
807	            Knowles. March 1993. (Format: TXT=2708 bytes) (Status:
808	            INFORMATIONAL)

810	 [RFC1981]  Path MTU Discovery for IP version 6. J. McCann, S. Deering,
811	            J. Mogul. August 1996. (Format: TXT=34088 bytes) (Status:
812	            PROPOSED STANDARD)

814	 [RFC2923]  TCP Problems with Path MTU Discovery. K. Lahey. September
815	            2000. (Format: TXT=30976 bytes) (Status: INFORMATIONAL)

817	7. Informative references

819	 [RFC1063]  IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par�
820	            tridge, K. McCloghrie. Jul-01-1988. (Format: TXT=27121
821	            bytes) (Obsoleted by RFC1191)

823	 [RFC1626]  Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994.
824	            (Format: TXT=11841 bytes) (Obsoleted by RFC2225) (Status:
825	            PROPOSED STANDARD)

827	 [RFC1791]  TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung.
828	            April 1995. (Format: TXT=22347 bytes) (Status: EXPERIMENTAL)

830	8. Security considerations

832	   Since the MTU reported in the ICMP messages is constrained to be
833	   between the old MTU and the candidate MTU, this algorithm is more
834	   difficult to attack through fraudulent ICMP messages.

836	   Furthermore, since this algorithm can function properly without ICMP
837	   messages that part of the algorithm can be disabled for additional
838	   robustness in hostile environments.

840	9. IANA considerations

842	10. Contributors

844	11. Acknowledgements

846	   Matt Mathis and John Heffner are supported by a grant from Cisco Sys�
847	   tems, Inc.

849	12. Authors' addresses

851	   Please send comments and suggestions to mtu@psc.edu.

853	   Matt Mathis and John Heffner
854	   Pittsburgh Supercomputing Center
855	   4400 Fifth Ave.
856	   Pittsburgh, PA 15213
857	   mathis@psc.edu
858	   jheffner@psc.edu

860	   Kevin Lahey
861	   Freelance
862	   kml@patheticgeek.net

864	13. Intellectual Property

866	   The IETF takes no position regarding the validity or scope of any
867	   intellectual property or other rights that might be claimed to  per�
868	   tain to the implementation or use of the technology described in this
869	   document or the extent to which any license under such rights might
870	   or might not be available; neither does it represent that it has made
871	   any effort to identify any such rights.  Information on the IETF's
872	   procedures with respect to rights in standards-track and standards-
873	   related documentation can be found in BCP-11.  Copies of claims of
874	   rights made available for publication and any assurances of licenses
875	   to be made available, or the result of an attempt made to obtain a
876	   general license or permission for the use of such proprietary rights
877	   by implementers or users of this specification can be obtained from
878	   the IETF Secretariat.

880	   The IETF invites any interested party to bring to its attention any
881	   copyrights, patents or patent applications, or other proprietary
882	   rights which may cover technology that may be required to practice
883	   this standard.  Please address the information to the IETF Executive
884	   Director.

886	14. Full copyright statement

888	   Copyright (C) The Internet Society June 21, 2003. All Rights
889	   Reserved.

891	   This document and translations of it may be copied and furnished to
892	   others, and derivative works that comment on or otherwise explain it
893	   or assist in its implementation may be prepared, copied, published
894	   and distributed, in whole or in part, without restriction of any
895	   kind, provided that the above copyright notice and this paragraph are
896	   included on all such copies and derivative works.  However, this
897	   document itself may not be modified in any way, such as by removing
898	   the copyright notice or references to the Internet Society or other
899	   Internet organizations, except as needed for the  purpose of develop�
900	   ing Internet standards in which case the procedures for copyrights
901	   defined in the Internet Standards process must be followed, or as
902	   required to translate it into languages other than English.

904	   The limited permissions granted above are perpetual and will not be
905	   revoked by the Internet Society or its successors or assigns.