idnits 2.17.1 draft-ietf-pmtud-method-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1.a on line 18. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1978. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1955. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1962. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1968. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: This document is an Internet-Draft and is subject to all provisions of Section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 883 has weird spacing: '...retried after...' == Line 941 has weird spacing: '...ecently set, ...' == Line 1111 has weird spacing: '...irement has p...' == Line 1310 has weird spacing: '...ill not yield...' == Line 1413 has weird spacing: '...address many ...' == (2 more instances...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 24, 2004) is 7116 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC 2461' on line 650 -- Looks like a reference, but probably isn't: 'ISOTP' on line 1189 -- Looks like a reference, but probably isn't: 'RFC2960' on line 1495 == Unused Reference: '10' is defined on line 1851, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 1854, but no explicit reference was found in the text == Unused Reference: '13' is defined on line 1860, but no explicit reference was found in the text == Unused Reference: '15' is defined on line 1866, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2401 (ref. '5') (Obsoleted by RFC 4301) ** Obsolete normative reference: RFC 2414 (ref. '6') (Obsoleted by RFC 3390) ** Obsolete normative reference: RFC 2460 (ref. '7') (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 2960 (ref. '9') (Obsoleted by RFC 4960) -- Obsolete informational reference (is this intentional?): RFC 1063 (ref. '10') (Obsoleted by RFC 1191) -- Obsolete informational reference (is this intentional?): RFC 1626 (ref. '12') (Obsoleted by RFC 2225) == Outdated reference: A later version (-16) exists of draft-ietf-tsvwg-sctpimpguide-10 Summary: 11 errors (**), 0 flaws (~~), 13 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Mathis 2 Internet-Draft J. Heffner 3 Expires: April 24, 2005 PSC 4 K. Lahey 5 Freelance 6 October 24, 2004 8 Path MTU Discovery 9 draft-ietf-pmtud-method-03 11 Status of this Memo 13 This document is an Internet-Draft and is subject to all provisions 14 of section 3 of RFC 3667. By submitting this Internet-Draft, each 15 author represents that any applicable patent or other IPR claims of 16 which he or she is aware have been or will be disclosed, and any of 17 which he or she become aware will be disclosed, in accordance with 18 RFC 3668. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as 23 Internet-Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on April 24, 2005. 38 Copyright Notice 40 Copyright (C) The Internet Society (2004). 42 Abstract 44 This document describes a robust method for Path MTU Discovery that 45 relies on TCP or some other Packetization Layer to probe an Internet 46 path with progressively larger packets. This method is described as 47 an extension to RFC 1191 and RFC 1981, which specify ICMP based Path 48 MTU Discovery for IP versions 4 and 6, respectively. 50 The general strategy of the new algorithm is to start with a small 51 MTU and search upward, testing successively larger MTUs by probing 52 with single packets. If the probe is successfully delivered and 53 satisfies a subsequent verification phase then the MTU is raised. If 54 the probe is lost, it is treated as an MTU limitation and not as a 55 congestion signal. 57 There are several options for integrating PLPMTUD with classical path 58 MTU discovery. PLPMTUD can be minimally configured to perform ICMP 59 black hole recovery to increase the robustness of classical path MTU 60 discovery, or ICMP processing can be completely disabled, and PLPMTUD 61 can completely replace classical path MTU discovery. 63 In the latter configuration, PLPMTUD exactly parallels congestion 64 control. An end-to-end transport protocol adjusts non-protocol 65 properties of the data stream (window size or packet size) while 66 using packet losses to deduce the appropriateness of the adjustments. 67 This technique seems to be more philosophically consistent with the 68 end-to-end principle than relying on ICMP messages containing 69 transcribed headers of multiple protocol layers. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 74 1.1 Revision History . . . . . . . . . . . . . . . . . . . . . 5 75 1.1.1 Changes since version -02, July 19th 2004 (IETF 60) . 6 76 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 77 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 9 78 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 12 79 5. Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 80 5.1 Accounting for Header Sizes . . . . . . . . . . . . . . . 14 81 5.2 Storing PMTU information . . . . . . . . . . . . . . . . . 15 82 5.3 Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16 83 5.4 Measuring path MTU . . . . . . . . . . . . . . . . . . . . 16 84 6. The Probing Sequence and Lower Layers . . . . . . . . . . . . 17 85 6.1 Normal sequence of events to raise the MTU . . . . . . . . 17 86 6.2 Processing MTU Indications . . . . . . . . . . . . . . . . 18 87 6.2.1 Processing ICMP PTB messages . . . . . . . . . . . . . 18 88 6.2.2 Packetization Layer Detects Lost Packets . . . . . . . 20 89 6.2.3 Packetization Layer Retransmission Timeout . . . . . . 21 90 6.2.4 Packetization Layer Full Stop Timeout . . . . . . . . 22 91 6.3 Probing Intervals . . . . . . . . . . . . . . . . . . . . 23 92 6.4 Host fragmentation . . . . . . . . . . . . . . . . . . . . 24 93 6.5 Multicast . . . . . . . . . . . . . . . . . . . . . . . . 25 94 7. Common Packetization Properties . . . . . . . . . . . . . . . 25 95 7.1 Mechanism to detect loss . . . . . . . . . . . . . . . . . 26 96 7.2 Generating Probes . . . . . . . . . . . . . . . . . . . . 26 97 7.3 Mechanism to support provisional MTUs . . . . . . . . . . 27 98 7.4 Selecting the initial MPS . . . . . . . . . . . . . . . . 27 99 7.5 Common MPS Search Strategy . . . . . . . . . . . . . . . . 28 100 7.5.1 Fine Scans . . . . . . . . . . . . . . . . . . . . . . 30 101 7.6 Congestion Control and Window Management . . . . . . . . . 30 102 8. Specific Packetization Layers . . . . . . . . . . . . . . . . 32 103 8.1 Probing method using TCP . . . . . . . . . . . . . . . . . 32 104 8.2 Probing method using SCTP . . . . . . . . . . . . . . . . 33 105 8.3 Probing method for IP fragmentation . . . . . . . . . . . 35 106 8.4 Probing method for applications . . . . . . . . . . . . . 36 107 9. Operational Integration . . . . . . . . . . . . . . . . . . . 37 108 9.1 Interoperation with prior algorithms . . . . . . . . . . . 37 109 9.2 Operation over subnets with dissimilar MTUs . . . . . . . 38 110 9.3 Interoperation with tunnels . . . . . . . . . . . . . . . 38 111 9.4 Diagnostic tools . . . . . . . . . . . . . . . . . . . . . 39 112 9.5 Management interface . . . . . . . . . . . . . . . . . . . 39 113 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 40 114 10.1 Normative References . . . . . . . . . . . . . . . . . . . . 40 115 10.2 Informative References . . . . . . . . . . . . . . . . . . . 41 116 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 41 117 A. Security Considerations . . . . . . . . . . . . . . . . . . . 42 118 B. IANA considerations . . . . . . . . . . . . . . . . . . . . . 42 119 C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 43 120 Intellectual Property and Copyright Statements . . . . . . . . 44 122 1. Introduction 124 This document describes a method for Packetization Layer Path MTU 125 Discovery (PLPMTUD) which is an extension to existing Path MTU 126 discovery methods as described in RFC 1191 [2] and RFC 1981 [3]. The 127 proper MTU is determined by starting with small packets and probing 128 with successively larger packets. The bulk of the algorithm is 129 implemented above IP, in the transport layer (e.g. TCP) or other 130 "Packetization Protocol" that is responsible for determining packet 131 boundaries. 133 This document draws heavily RFC 1191 [2] and RFC 1981 [3] for 134 terminology, ideas and some of the text. 136 This document describes methods to discover the path MTU using 137 features of existing protocols. The methods apply to IPv4 and IPv6, 138 and many transport protocols. They do not require cooperation from 139 the lower layers (except that they are consistent about what packet 140 sizes are acceptable) or the far node. Variants in implementations 141 will not cause interoperability problems. 143 The methods described in this document are carefully designed to 144 maximize robustness in the presence of less than ideal 145 implementations of other protocols or Internet components. 147 For sake of clarity we uniformly prefer TCP and IPv6 terminology. In 148 the terminology section we also present the analogous IPv4 terms and 149 concepts for the IPv6 terminology. In a few situations we describe 150 specific details that are different between IPv4 and IPv6. 152 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 153 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 154 document are to be interpreted as described in RFC 2119 [4]. 156 This draft is a product of the Path MTU Discovery (pmtud) working 157 group of the IETF. Please send comments and suggestions to 158 pmtud@ietf.org. Interim drafts and other useful information will be 159 posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html . 161 1.1 Revision History 163 These are all recent substantive changes, in reverse chronological 164 order. This section will be removed prior to publication as an RFC. 165 Note that there are still some missing details that need to be 166 resolved. These are flagged by @@@@. None of the missing details 167 are serious. 169 1.1.1 Changes since version -02, July 19th 2004 (IETF 60) 171 Many minor updates throughout the document. 173 Added a section describing the interactions between PLPMTUD and 174 congestion control. 176 Removed a difficult to implement requirement for future data to 177 transmit. 179 Added "IP Fragmentation" and "Application protocol" as Packetization 180 Layers. 182 Clarified interactions between TCP SACK and MTU. 184 Updated SCTP section to reflect new probing method using "PAD 185 chunks". 187 Distilled the protocol specific material into separate subsections 188 for each protocol. 190 Added a section on common requirements and functions for all 191 Packetization Layers. More accurately characterized the 192 "bidirectional" (and other) requirements of the PL protocol. Updated 193 the search strategy in this new section. 195 Change "ICMP can't fragment" and "packet too big" to uniformly use 196 "ICMP PTB message" everywhere. 198 Added Stanislav Shalunov's observation that PLPMTUD parallels 199 congestion control. 201 Better described the range of interoperability with classical pMTUd 202 in the introduction. 204 Removed vague language about "not being a protocol" and "excessive 205 Loss". 207 Slightly redefined flow: the granularity of PLPMTUD within a path. 209 Many English NITs and clarifications per Gorry Fairhurst and others. 210 Passes strict xml2rfc checking. 212 Add a paragraph encouraging interface MTUs that are the optimal for 213 the NIC, rather than standard for the media. 215 Added a revision history section. 217 2. Overview 219 This document describes a method for TCP or other packetization 220 protocols to dynamically discover the MTU of a path without relying 221 on explicit signals from the network. These procedures are 222 applicable to TCP and other transport- or application-level protocols 223 that are responsible for choosing packet boundrys (e.g. segment 224 sizes) and have an acknowledgement structure that delivers to the 225 sender, accurate and timely indications of which packets were lost. 227 The general strategy of the new procedure is for the packetization 228 layer to find the proper path MTU by probing with progressively 229 larger packets. A "probe sequence" consists of a single "probe 230 packet", which initiates a "probe phase", followed by a "transition 231 phase" and a "verification phases". 233 If a probe packet is successfully delivered, then the path MTU is 234 provisionally raised during the transition phase. If there are no 235 additional losses during the subsequent verification phase, then the 236 path MTU is confirmed (verified) to be at least as large as the 237 provisional MTU. Each probe sequence raises the estimated path MTU 238 by one step. A search strategy uses heuristics to select optimal MTU 239 steps, until PLPMTUD converges to the correct path MTU. 241 The verification phase is used to detect some situations where 242 raising the MTU raises the packet loss rate. For example if a link 243 is striped across multiple physical channels with inconsistent MTUs, 244 it is possible that a probe will be delivered even if it is too large 245 for some of the physical channels. In such cases raising the path 246 MTU to the probe size will cause severe periodic loss and abysmal 247 performance. The verification phase is designed to prevent the path 248 MTU from being raised if doing so causes excessive packet losses. 250 A conservative implementation of PLPMTUD would use a full round trip 251 time for the verification phase. In this case the entire probe 252 sequence takes three full round trip times. It takes one round trip 253 for the probe phase, during which the probe propagates to the far 254 node and an acknowledgment is returned. The second round trip is the 255 transitional phase, during which data packets using the provisional 256 MTU propagate to the far node and are acknowledged. During he third 257 and final round trip time, it is verified that raising the MTU did 258 not cause any additional losses. 260 The isolated loss of a probe packet (with or without an ICMP PTB 261 message) is treated as an indication of an MTU limit, and not as a 262 congestion indicator. In this case alone, the packetization protocol 263 is permitted to retransmit any missing data without adjusting the 264 congestion window. 266 If there is a timeout or any additional lost packets during any of 267 the three phases, the loss is treated as a congestion indication as 268 well as an indication of some sort of failure of the PLPMTUD process. 269 The congestion indication is treated like any other congestion 270 indication: window or rate adjustments are mandatory per the relevant 271 congestion control standards [8]. Probing can resume with some new 272 probe size after a delay which is determined by the nature of the 273 detected failure. 275 The most likely (and least serious) PLPMTUD failure is the link 276 experiencing congestion related losses at about the same time as the 277 probe. In this case it is appropriate to retry the probe at the same 278 probe size as soon as the packetization layer has fully adapted to 279 the congestion and recovered from the losses. 281 In other cases, additional losses or timeouts indicate problems with 282 the link or packetization layer. In these situations it is desirable 283 to use longer delays depending on the severity of the error. 285 There is a range of options for integrating PLPMTUD with classical 286 path MTU discovery. In the most conservative configuration, from a 287 deployment point of view, classical path MTU discovery is fully 288 functional (all correct ICMP PTB messages are unconditionally 289 processed) and PLPMTUD is invoked only to recover from ICMP black 290 holes. 292 In the most conservative configuration, from a security point of 293 view, all ICMP PTB messages are ignored, and PLPMTUD is the sole 294 method used to discover the path MTU. This protects against 295 malicious or erroneous ICMP PTB messages which might otherwise cause 296 MTU discovery to arrive at the incorrect MTU for a path. 298 Note that in the latter configuration, PLPMTUD exactly parallels 299 congestion control. An end-to-end transport protocol adjusts 300 non-protocol properties of the data stream (window size or packet 301 size) while using packet losses to deduce the appropriateness of the 302 adjustments. This technique seems to be more philosophically 303 consistent with the end-to-end principle of the Internet than relying 304 on ICMP messages containing transcribed headers of multiple protocol 305 layers. 307 We advocate a compromise, in which ICMP PTB messages are only 308 processed in conjunction with probing (described in section 6.2.1), 309 and Packetization Layer timeouts (described in section 6.2.3) and 310 ignored in all other situations. 312 Most of the difficulty in implementing PLPMTUD arises because it 313 needs to be implemented in several different places within a single 314 node. In general each packetization protocol needs to have it's own 315 implementation of PLPMTUD. Furthermore, the natural mechanism to 316 share path MTU information between concurrent or subsequent 317 connections over the same path is a path information cache in the IP 318 layer. The various packetization protocols need to have the means to 319 access and update the shared cache in the IP layer. This memo 320 describes PLPMTUD in terms of its primary subsystems without fully 321 describing how they are assembled into a complete implementation. 323 Section 3 provides a complete glossary of terms. 325 Relatively few details of PLPMTUD affect interoperability with other 326 standards or Internet protocols. These details are specified in 327 RFC2119 standards language in section 4. The vast majority of the 328 implementation details described in this document are recommendations 329 based on experiences with earlier versions of path MTU discovery. 330 These recommendations are motivated by a desire to maximize 331 robustness of PLPMTUD in the presence of less than ideal 332 implementations as they exist in the field. 334 Section 5 describes how to partition PLPMTUD into layers, and how to 335 manage the "path information cache" in the IP layer. 337 Section 6 describes the details of a probe sequence, including how 338 to process MTU and error indications, necessary to raise the MTU by 339 one step. 341 Section 7 describes the general search stratiegy and Packetization 342 Layer features needed to implement PLPMTUD. 344 Section 8 discusses specific implementation details for a couple of 345 specific protocols, such as TCP. 347 Section 9 describes ways to minimize deployment problems for 348 PLPMTUD, by including a number of good management features. It also 349 addresses some potentially serious interactions with nodes that do 350 not honor the IPv4 DF bit. 352 3. Terminology 354 We use the following terms in this document: 355 IP Either IPv4 [1] or IPv6 [7]. 357 node A device that implements IP. 359 router A node that forwards IP packets not explicitly addressed to 360 itself. 362 host Any node that is not a router. 364 upper layer A protocol layer immediately above IP. Examples are 365 transport protocols such as TCP and UDP, control protocols such as 366 ICMP, routing protocols such as OSPF, and Internet or lower-layer 367 protocols being "tunneled" over (i.e., encapsulated in) IP such as 368 IPX, AppleTalk, or IP itself. 370 link A communication facility or medium over which nodes can 371 communicate at the link layer, i.e., the layer immediately below 372 IP. Examples are Ethernets (simple or bridged); PPP links; X.25, 373 Frame Relay, or ATM networks; and Internet (or higher) layer 374 "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use 375 the slightly more general term "lower layer" for this concept. 377 interface A node's attachment to a link. 379 address An IP-layer identifier for an interface or a set of 380 interfaces. 382 packet An IP header plus payload. 384 MTU Maximum Transmission Unit, the size in bytes of the largest IP 385 packet, including the IP header and payload, that can be 386 transmitted on a link or path. Note that this could more properly 387 be called the IP MTU, to be consistent with how other standards 388 organizations use the acronym MTU. 390 link MTU The Maximum Transmission Unit, i.e., maximum IP packet size 391 in bytes, that can be conveyed in one piece over a link. Beware 392 that this definition differers from the definition used by other 393 standards organizations. 395 For IETF documents, link MTU is uniformly defined as the IP MTU 396 over the link. This includes the IP header, but excludes link 397 layer headers and other framing which is not part of IP or the IP 398 payload. 400 Be aware that other standards organizations generally define link 401 MTU to include the link layer headers. 403 path The set of links traversed by a packet between a source node and 404 a destination node 406 pMTU, path MTU The minimum link MTU of all the links in a path 407 between a source node and a destination node. 409 classical path MTU discovery, Process described in RFC 1191 and RFC 410 1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages 411 to learn the MTU of a path. 413 PL, Packetization Layer The layer of the network stack which segments 414 data into packets. 416 PLPMTUD Packetization Layer Path MTU Discovery, the method described 417 in this document, which is an extension to classical PMTU 418 discovery. 420 PTB (Packet Too Big) message An ICMP message reporting that an IP 421 packet is too large to forward. This is the IPv6 term that 422 corresponds to the IPv4 "ICMP Can't fragment" message. 424 flow A context in which MTU discovery algorithms can be invoked. 425 This is naturally an instance of the packetization protocol, e.g. 426 one side of a TCP connection. 428 MPS The maximum IP payload size available over a specific path. This 429 is typically the path MTU minus the IP header. As an example, 430 this is the maximum TCP packet size, including TCP payload and 431 headers but not including IP headers. This has also been called 432 the "L3 MTU". 434 MSS The TCP Maximum Segment Size, the maximum payload size available 435 to the TCP layer. This is typically the path MPS minus the size 436 of the TCP header. 438 probe packet A packet which is being used to test a path for a larger 439 MTU. 441 probe size The size of a packet being used to probe for a larger MTU. 443 successful probe The probe packet was delivered through the network 444 and acknowledged by the Packetization Layer on the far node. 446 inconclusive probe The probe packet was not delivered, but there were 447 other lost packets close enough to the probe where it can not be 448 presumed that the probe was lost because it was larger than the 449 path MTU. By implication the probe might have been lost due to 450 something other than MTU (such as congestion), so the results are 451 inconclusive. 453 failed probe The probe packet was not delivered and there were no 454 other lost packets close to the probe. This is taken as an 455 indication that the probe was larger than the path MTU, and future 456 probes should generally be for smaller sizes. 458 errored probe There were losses or timeouts during the verification 459 phase which suggest a potentially disruptive failure or network 460 condition. These are generally retried only after substantially 461 longer intervals. 463 probe gap The payload data that will be lost and need to be 464 retransmitted if the probe is not delivered. 466 probe phase The interval (time or protocol events) between when a 467 probe is sent and when it is determined that the the probe 468 succeeded, failed or was inconclusive 470 verification phase An additional interval during which the new path 471 MTU is considered provisional. Packet losses or timeouts are 472 treated as an indication that there may be a problem with the 473 provisional MTU. 475 transition phase The interval between the probe phase and the 476 verification phase, during which packets using the new MTU 477 propagate to the far node and the acknowledgment propagates back. 479 probe sequence The sequence of events to raise the MTU by one step, 480 starting with the transmission of a probe packet followed by 481 probe, transition and verification phases. 483 search strategy The heuristics used to choose successive probe sizes 484 to converge to the proper path MTU, as described in section 7.5. 486 full stop timeout a timeout where none of the packets transmitted 487 after some event are acknowledged by the receiver, including any 488 retransmissions. This is taken as an indication of some failure 489 condition in the network, such as a routing change onto a link 490 with a smaller MTU. For the sake of PLPMTUD we suggest the 491 following definition of a full stop timeout: the loss of one full 492 window of data and at least one retransmission or at least 6 493 consecutive packets including at least 2 retransmissions (along 494 with two retransmission timer expirations). [@@@ This probably 495 needs some experimentation.] 497 4. Requirements 499 All Internet nodes SHOULD implement PLPMTUD in order to discover and 500 take advantage of the largest MTU supported along the Internet path. 502 Links MUST NOT deliver packets that are larger than their MTU. Links 503 that have parametric limitations (e.g. MTU bounds due to limited 504 clock stability) MUST include explicit mechanisms to consistently 505 reject packets that might otherwise be nondeterministically 506 delivered. 508 All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 509 functionality. All fragmentation SHOULD be done on the host, and all 510 IPv4 packets, including fragments, SHOULD have the DF bit set such 511 that they will not be fragmented (again) in the network. See Section 512 6.4. 514 The requirements below only apply to those implementations that 515 include PLPMTUD. 517 To use PLPMTUD a Packetization Layer MUST have a loss reporting 518 mechanism that provides the sender with timely and accurate 519 indications of which packet were lost in the network. 521 Normal congestion control algorithms MUST remain in effect under all 522 conditions except when only an isolated probe packet is detected to 523 be lost. In this case alone the normal congestion (window or data 524 rate) reduction MAY be suppressed. If any other data loss is 525 detected, all normal congestion control MUST take place. 527 Suppressed congestion control (as above) MUST be rate limited such 528 that it occurs less frequently than the worst case loss rate for TCP 529 congestion control at a comparable data rate over the same path (i.e. 530 less than the "TCP-friendly" loss rate [@@]). This SHOULD be 531 enforced by requiring a minimum headway between a suppressed 532 congestion adjustment (due to a failed probe) and the next attempted 533 probe, which is equal to one round trip time for each packet 534 permitted by the congestion window. Alternatively this may be 535 enforced by not suppressing congestion control if a 2nd probe is lost 536 too soon after the 1st lost probe. This and other issues relating to 537 congestion control are discussed in section 7.6. 539 Whenever the MTU is raised, the congestion state variables MUST be 540 rescaled to not to raise the window size in bytes (or date rate in 541 bytes per seconds). 543 Whenever the MTU is reduced (e.g. when processing ICMP PTB messages) 544 the congestion state variable SHOULD be rescaled not to raise the 545 window size in packets. 547 If PLPMTUD updates the MTU for a particular path, all Packetization 548 Layer sessions that share the path representation SHOULD be notified 549 to make use of the new MTU and make the required congestion 550 adjustments. 552 All implementations MUST include a mechanism to implement diagnostic 553 tools that do not rely on the operating systems implementation of 554 path MTU discovery. This specifically requires the ability to send 555 packets that are larger than the known MTU for the path, and 556 collecting any resultant ICMP error message. See section 9.4 for 557 further discussion of MTU diagnostics. 559 5. Layering 561 Packetization Layer Path MTU Discovery is most easily implemented by 562 splitting its functions between layers. The IP layer is in the best 563 place to keep shared state, collect the ICMP messages, track IP 564 headers sizes and manage MTU information provided by the link layer 565 interfaces. However the procedures that PLPMTUD uses for probing, 566 verifications and scanning for the path MTU are very tightly coupled 567 to the data recovery and congestion control state machines in the 568 Packetization Layers. The most difficult part of implementing 569 PLPMTUD is properly splitting the implementation between the layers. 571 Note that this layering is constant with the advice in the current 572 PMTUD specifications [2][3]. Many implementations of classical PMTU 573 Discovery are already split along these same layers. 575 5.1 Accounting for Header Sizes 577 Early implementation of PLPMTUD revealed that it is critically 578 important to have a good clean mechanism for accounting header sizes 579 at all layers. This is because each Packetization Layer does its 580 calculations in its own natural data unit, which are almost always a 581 reflection of the service that the Packetization Layer provides to 582 the application or other upper layers. For example, TCP naturally 583 performs all of its calculations in terms of sequence numbers and 584 segment sizes. However, the MTU size being probed, MTU size reported 585 in ICMP PTB messages, etc are measures of full packets, which not 586 only include the TCP payload (measured in sequence space) but also 587 include fixed TCP and IP headers, and may include IPv6 extension 588 headers or IPv4 options, TCP options and even IPsec AH or ESP 589 headers. 591 PLPMTUD requires frequent translation between these two domains: the 592 Packetization Layer's natural data unit and full IP packet sizes. 593 While there are a number of possible ways to accurately implement 594 dual size measures, our experience has been that it is best if the 595 boundary between the IP layer and the Packetization layer communicate 596 in terms of the IP Maximum Payload Size or MPS. The MPS is the only 597 size measure that is common to both layers because it exactly matches 598 the boundary between the layers. The IP Layer is responsible for 599 adding or deducting its own headers when translating between MTU and 600 MPS. Likewise the Packetization Layer is responsible for adding or 601 deducting its own headers when calculations in it's natural data 602 units. E.g. for TCP, the MPS and MSS are different by the TCP 603 header size. 605 Be aware that a casual reading of this document might give the 606 impression that MTU, MPS and Packetization Layer data size (e.g. TCP 607 MSS) are used interchangeably. They are not. Our choice of 608 terminology is consistent with the protocol layer being discussed in 609 the surrounding context. All implementors must pay attention to the 610 distinction between these terms and include all necessary 611 conversions, even when thy are not explicitly indicated in this 612 document. 614 [Note that although existing implementations of classical path MTU 615 discovery typically include some sort of path information cache 616 structured as described here, none keep the cached information in 617 MPS. All know path information caches keep path information in terms 618 of IP MTU, which results in layering (and probable scope) violations 619 at every point where upper protocol layers need to make decisions 620 about message sizes. Early PLPMTUD implementations avoided 621 redefining the path information cache, and as a consequence were 622 fraught with problems relating to implementing MTU to MPS to payload 623 size calculations in other parts of PLPMTUD. We encourage all 624 implementors (and potential implementors) to start by changing the 625 path information cache to use MPS. This change is quite possibly the 626 most difficult step in implementing PLPMTUD, because it requires 627 changes in many places throughout the protocol stack. @@@@ remove 628 before RFC status?] 630 5.2 Storing PMTU information 632 The IP layer is the best place to store cached MPS values and other 633 shared state such as MTU values reported by ICMP PTB messages. 634 Ideally this shared state should be associated with a specific path 635 traversed by packets exchanged between the source and destination 636 nodes. However, in most cases a node will not have enough 637 information to completely and accurately identify such a path. 638 Rather, a node must associate a MPS value with some local 639 representation of a path. It is left to the implementation to select 640 the local representation of a path. 642 An implementation could use the destination address as the local 643 representation of a path. The MPS value associated with a 644 destination would be the minimum MPS learned across the set of all 645 paths in use to that destination. The set of paths in use to a 646 particular destination is expected to be small, in many cases 647 consisting of a single path. This approach will result in the use of 648 optimally sized packets on a per-destination basis. This approach 649 integrates nicely with the conceptual model of a host as described in 650 [RFC 2461]: a MPS value could be stored with the corresponding entry 651 in the destination cache. However, NAT and other forms of middle 652 boxes may exhibit differing MTUs at as single IP address. 654 Note that network or subnet numbers are not suitable to use as 655 representations of a path, because there is not a general mechanism 656 to determine the network mask at the remote host. 658 If IPv6 flows are in use, an implementation could use the IPv6 flow 659 id [7][14] as the local representation of a path. Packets sent to a 660 particular destination but belonging to different flows may use 661 different paths, with the choice of path depending on the flow id. 662 This approach will result in the use of optimally sized packets on a 663 per-flow basis, providing finer granularity than MPS values 664 maintained on a per-destination basis. 666 For source routed packets (i.e. packets containing an IPv6 Routing 667 header, or IPv4 LSRR or SSRR options), the source route may further 668 qualify the local representation of a path. An implementation could 669 use source route information in the local representation of a path. 671 5.3 Accounting for IPsec 673 This document does not take a stance on the placement of IPsec, which 674 logically sits between IP and the Packetization Layer. As far as 675 PLPMTUD is concerned IPsec can be treated either as part of IP or as 676 part of the Packetization Layer, as long as the accounting is 677 consistent within the implementation. If IPsec is treated as part of 678 the IP layer, then each security association to a remote node may 679 need to be treated as a separate path. I.e. the the security 680 association is used to represent the path. If IPsec is treated as 681 part of the packetization layer, the IPsec header size has to be 682 included in the Packetization Layer's header size calculations. 684 5.4 Measuring path MTU 686 This memo uses the concept of a "flow" to define the scope of the 687 path MTU discovery algorithms. For many implementations, a flow 688 would naturally correspond to an instance of each protocol (i.e. 689 each connection or session). In such implementations the algorithms 690 described in this document are performed within each session for each 691 protocol. The observed MPS can be shared between different flows 692 sharing a common path representation. 694 Alternatively, PLPMTUD could be implemented such that the complete 695 PLPMTUD state is associated with the path representations. Such an 696 implementation could use multiple connections or sessions for each 697 probe sequence. e.g. one connection could do the initial probe an 698 set the provisional MTU and and one or more subsequent connection 699 could verify the MTU. This approach may converge much more quickly 700 in some environments such as when the application uses many small 701 connections, each of which is too short to complete a probe sequence. 703 These approaches are not mutually exclusive. Due to differing 704 constraints on generating probes (section Section 7.2) and the MPS 705 searching algorithm (section Section 7.5), it may not be feasible for 706 different packetization layer protocols to share PLPMTUD state. This 707 suggests that it may make sense to share state within some protocols, 708 but not with others. In this case, the different protocols can still 709 share the observed MPS but they will have differing convergence 710 properties. 712 6. The Probing Sequence and Lower Layers 714 This section describes the details of a probe sequence, including how 715 to process MTU and error indications, necessary to raise the MTU by 716 one step. 718 6.1 Normal sequence of events to raise the MTU 720 If the probe size is smaller than the actual path MTU and there are 721 no other losses, the normal sequence of events to raise the MTU is: 722 1. Confirm probing preconditions: no outstanding Packetization Layer 723 losses, sufficient congestion window per section 7.6, sufficient 724 elapsed time since previous probe per section 6.3, if candidate 725 MPS has not been set from ICMP MPS, then compute the candidate 726 MPS per MPS search strategy in section 7.5. 728 2. A new MTU is tested by sending one "probe packet", of size "probe 729 size" (computed from the candidate MPS). The probe is sent, 730 followed by additional packets at the current MTU. By definition 731 PLPMTUD enters the probe phase. The probe propagates through the 732 network and the far node acknowledges it (or possibly latter 733 data, if acknowledgements are cumulative and delayed 734 acknowledgement is in effect). 736 3. The acknowledgement for the probe reaches the data sender. By 737 definition, this ends the probe phase. 739 4. The packetization layer provisionally raises the MTU to the probe 740 size. PLPMTUD enters the transitional phase when it starts 741 sending data using the provisional MTU. 743 Note that implementations that use packet counts for congestion 744 accounting (e.g. keep cwnd in units of packets) must re-scale 745 their congestion accounting such that raising the MTU does not 746 raise the data rate (bytes/second) or the total congestion window 747 in bytes, as required in section 4 and discussed in 7.6. 749 If the implementation packetizes the data at the application 750 programming interface, it may transmit already queued data at the 751 current MTU before raising the MTU. In this case this data is 752 not part of either the probing or transition phases, because all 753 of the packets in flight fit within the current MTU. 755 5. Once the first packet of the transitional phase is acknowledged, 756 PLPMTUD enters the verification phase. In principle the 757 verification phase can be of arbitrary duration, however at this 758 time we are recommending one full window of data (i.e one full 759 round trip time) for most Packetization Layers. 761 6. Once there has been sufficient data delivered and acknowledged 762 the provisional MTU is considered verified and the path MTU is 763 updated. PLPMTUD can then probe for an even larger MTU, as 764 described in the searching strategy in section 7.5. 766 Other events described in the next section are treated as exceptions 767 and alter or cancel some of the steps above. 769 6.2 Processing MTU Indications 771 When the probe sequence fails to raise the MTU, it will be due to one 772 of three broad classes of outcomes: the probe was inconclusive, 773 failed or errored. If the probe was inconclusive, it means that 774 there were other looses seemingly unrelated to the probe, such that 775 the probe outcome was ambiguous. Inclusive probes should be retried 776 with the same probe size. If the probe failed, there was an 777 indication that the probe size was larger than the path MTU, and the 778 probe should be retried with a smaller size, as selected by the MTU 779 searching algorithm. In some situations there are indications that 780 the probing sequence caused some unexpected event. In these "error" 781 conditions it is desirable to use progressively longer delays to 782 minimize the possible impact to the network. 784 6.2.1 Processing ICMP PTB messages 786 Classical PMTU discovery specifies the generation of ICMP PTB 787 Messages if an over-sized packet (e.g. a probe) encounters a link 788 that has a smaller MTU. Since these messages can not be 789 authenticated they introduce a number of well documented attacks 790 against classical PMTUD [5]. 792 With PLPMTUD these messages are not required for correct operation, 793 and in principle can be summarily ignored at the expense of slower 794 convergence to the proper MTU. However we believe that a slightly 795 better compromise is to save the reported PTB size (computed from the 796 ICMP MTU) in the path information cache and act on it only in in two 797 specific contexts: in conjunction with a lost PLPMTUD probe or a 798 full-stop timeout. 800 Every ICMP PTB Message should be subjected to the following checks: 801 o If globally forbidden then discard the message. 803 o If forbidden by the application then discard the message. 805 o If this path has been tagged "bogus ICMP messages" then discard 806 the message. 808 o If the reported MTU fails consistency checks then set "bogus ICMP 809 messages" flag for this path and discards the message. These 810 consistency checks include: 811 * unrecognized or unparseable enclosed header, or 812 * reported MTU is larger than the size indicated by the enclosed 813 header, or 814 * larger than the current MTU, provisional MTU or probe size as 815 appropriate, or 816 * fails a ICMP consistency checks specific to the Packetization 817 Layer. (E.g. The SCTP Verification-Tag mechanism [9][16]) 818 To ease migration, it is suggested that implementations may 819 include global controls to emulate legacy operation by suppressing 820 some or all of the consistency checks. 822 If the ICMP PTB message is acceptable under all of these checks then 823 save the "ICMP MPS" computed from the MTU field in the ICMP message. 824 If the global configuration switch is set to emulate classical path 825 MTU discovery then process the message immediately (I.e. set the 826 path MPS to the ICMP MPS and invoke any protocol specific actions). 827 Otherwise, the saved ICMP MPS will be acted upon if and only if there 828 are other PLPMTUD events such as lost probes, etc as indicated in the 829 next section. This delayed processing of ICMP PTB messages makes it 830 more difficult for an attacker to interfere with correct PLPMTUD 831 operation by injecting fraudulent ICMP PTB messages. 833 In either case if the Packetization Layer calls for specific actions 834 in response to a PTB message, that action should be invoked only at 835 the point when the path MPS is updated from the ICMP MPS. 837 6.2.2 Packetization Layer Detects Lost Packets 839 Each packetization protocol has it's own mechanism to detect lost 840 packets and request the retransmission of missing data. The primary 841 signals used by PLPMTUD are these protocol specific loss indications. 842 The packetization layer is responsible for retransmitting the lost 843 data and notifying PLPMTUD that there was a loss. 844 o If the probe itself was lost, and there were no other losses 845 during the probe phase (The RTT between when the probe was sent 846 and the loss detected) then it is taken as an indication that the 847 path MTU is smaller than the probe size. In this situation alone 848 the Packetization Layer is permitted to retransmit the missing 849 data (the "probe gap") without adjusting its congestion window or 850 data transmission rate. 852 If an accepted ICMP PTB message was received after the probe was 853 sent, and it passes the additional checks that the ICMP MTU is 854 greater than the current MTU and less than the probe size, then 855 set the candidate MPS from the ICMP MTU, and restart the probe 856 sequence from step 1 in section 6.1. 858 If there was not a accepted PTB Message, then the indicated event 859 is a "probe failure", which can be retried with a smaller probe 860 size after a suitable delay for a probe_fail_event. See section 861 6.2.2 for more complete descriptions of failure events. 863 o If there are losses during the probe phase and the probe was not 864 lost, then the probe was successful. However, since additional 865 losses have the potential to spoil the verification phase, it is 866 important that PLPMTUD not progress into the transition phase 867 (step 4 above) until after the Packetization Layer has fully 868 recovered from the losses and completed the congestion window (or 869 rate) adjustment. 871 o If there are losses during the probe phase and the probe was also 872 lost the outcome depends on the presence an ICMP MTU set by an 873 acceptable PTB message. 875 If there was an accepted PTB message received since the probe was 876 sent, and it passes the additional checks that the ICMP MTU is 877 greater than the current MTU and less than the probe size, then 878 set the candidate MPS from the ICMP MTU, and restart the probe 879 sequence from step 1 in section 6.1. 881 If there was not an acceptable ICMP PTB message, then the probe is 882 inconclusive because the lost probe might have been caused by 883 congestion. The probe can be retried after a suitable delay for 884 a probe_inconclusive_event. 886 o It is unlikely that losses during the transition phase are caused 887 by PLPMTUD, however they do potentially complicate the 888 verification phase. Note that we are referring to losses that are 889 bracketed by acknowledgement of packets that were sent at the old 890 MTU, while the transition to the provisional MTU is still 891 propagating through the network. The first acknowledgement from 892 the provisional MTU (and the transition to the verification phase) 893 is most likely going to occur during the recovery of the losses in 894 transition phase. It is important that the Packetization Layer 895 retransmission machinery distinguish between loses at the old MTU 896 (transition phase) and the provisional MTU (the verification 897 phase, discussed next). 899 o Losses during the verification phase are taken as a indication 900 that the path may have a non-uniform MTU or some other problems 901 such that raising the MTU raises the loss rate. If so, this is 902 potentially a very serious problem, so the provisional MTU is 903 considered to have errored and the path MTU is set back to the 904 previously verified MTU. 906 Packet loss during the verification phase might also be due to 907 coincidental congestion on the path, unrelated to the probe, so it 908 would seem to be desirable to re-probe the path. The risk is that 909 this effectively raises the tolerated loss threshold because even 910 though raising the MTU seemed to cause additional loss, there is a 911 statistical chance that repeated attempts to verify a new MTU may 912 yield as false pass. The compromise is to re-probe once with the 913 same probe size (after delay probe_inconclusive_event), and if 914 this also fails, then the probe may not be retried until after a 915 suitable delay for a verification_error_event, which exponentially 916 increases on each successive failure. 918 6.2.3 Packetization Layer Retransmission Timeout 920 Note that the we do not make distinctions between the various methods 921 that different Packetization Layers might use for detecting and 922 retransmitting lost packets. It is preferable that the Packetization 923 Layer uses a recovery mechanism similar to TCP SACK or fast 924 retransmit designed to detect and report losses to recover as quickly 925 as possible. 927 Under some conditions the Packetization Layer may have to rely on 928 retransmission timeouts or other fairly disruptive techniques to 929 detect and recover from losses. Since these greatly increase the 930 cost of failed probes, it is recommended that PLPMTUD use even longer 931 delays before re-probing. In these situations replace 932 probe_fail_event with probe_timeout_event. 934 6.2.4 Packetization Layer Full Stop Timeout 936 Under all conditions (not just during MTU probing) a full stop 937 timeout should be taken as an indication of some significantly 938 disruptive event in the network, such as a router failure or a 939 routing change to a path with a smaller MTU. 941 If the ICMP MPS was recently set, and it is less that the current 942 MTU (or provisional MTU during the transitional phase), then the path 943 MTU can be reduced to the ICMP MTU. A full stop timeout is the only 944 situation outside of a probe that we recommended that the path MTU is 945 set from the ICMP MTU. (In section 9.1 we relax this recommendation 946 to facilitate migration to PLPMTUD in exchange for slightly less 947 protection from corrupt ICMP PTB messages). 949 Note that whenever a problem with the path that causes a full-stop 950 timeout (also known as a "persistent timeout" in other documents), 951 several different path restart/recovery algorithms may be invoked at 952 different layers in the stack. Some device drivers may be restarted 953 [@@], router discovery [@@], ES-IS [@@] and so forth. We recommend 954 that in most situation the first action should be to set the path MTU 955 down. Note that this recommendation is really beyond the scope of 956 this document, and may require substantial additional research. 958 If there is a full stop timeout and there was not an ICMP message 959 indicating a reason (PTB, Net unreachable, etc, or the ICMP messages 960 was ignored for some reason), we suggest that the first recovery 961 action should be to set the path MTU down to a safe minimum "restart 962 MTU" value, and the reset PLPMTUD search state, so PLPMTUD will start 963 over again searching for the proper MTU. The default IPV4 964 restart_MTU should be the minimum MTU as specified by IPv4 (576 965 Bytes)[1]. The default IPV6 restart_MTU should be the minimum MTU as 966 specified by IPv6 (1280 Bytes) [7]. Unless the default MTU is 967 overridden by some global control (See section 9.5). 969 If and only if the full stop timeout happens during the probe or 970 transition phases (e.g. after the sending data using the provisional 971 MTU but before any of it is acknowledged) is it considered likely 972 that raising the MTU caused the full stop timeout. If so this 973 situation is is likely to be cyclic, because resetting the PLPMTUD 974 search state is likely to eventually cause re-probing the same 975 problematic MTU. It is tempting to define additional states to 976 detect recurrent full stop timeouts. However in today's hostile 977 network environment, there is little tolerance for nodes that are so 978 fragile that they can be disrupted by something as simple as 979 oversized packets. Therefor we do not feel that it is worth the 980 overhead of specifying a state machine that is capable of 981 automatically detecting these situations and disabling PLPMTUD. 982 However, it is important that there be a manual way to disable or 983 limit probing on specific paths. See section 9.5. 985 6.3 Probing Intervals 987 The previous sections describe a number of events that prevent a 988 probe sequences from raising the path MTU. In all cases the basic 989 response is the same: to wait some time interval (dependent on the 990 specific event and possibly the history) and then to probe again. 991 For events that are "inconclusive", it is generally appropriate to 992 re-probe with the same probe size. For events that are identified as 993 "failed probes" it is generally appropriate to re-probe with a 994 smaller probe size. The search strategy described in section 7.5 is 995 used to select probe sizes. 997 Many of the intervals below are specified in terms of elapsed round 998 trips relative to the current congestion window. This is because TCP 999 and other Packetization Layer protocols tend to exhibit periodic 1000 loses which cause periodic variations of the congestion window and 1001 possibly the data rate. It is preferable that the PLPMTUD probes are 1002 scheduled near the low point of these cycles to minimize ambiguities 1003 caused by congestion losses. 1005 In order from least to most serious: 1006 probe_converge_event The candidate probe size has already been probed 1007 so there are no further searching. Delay 5 minutes and then 1008 re-probe last SEARCH_HIGH. 1010 probe_inconclusive_event Other lost packets near the lost probe made 1011 the probe result ambiguous. Since the loss of non-probe packets 1012 requires a window (or data rate) reduction, it is desirable to 1013 schedule the re-probe (at the same probe size) roughly one round 1014 trip time after the end of the loss recovery. This will be almost 1015 the minimum congestion window size, with a small cushion to 1016 minimize the chances that correlated losses caused by some other 1017 bursty connection spoil another probe. 1019 probe_fail_event A probe fail event is the one situation under which 1020 the Packetization layer is permitted not to treat loss as a 1021 congestion signal. Because there is some small risk that 1022 suppressing congestion control might have unanticipated 1023 consequences (even for one isolated loss), we require that probe 1024 fail events be less frequent than the normal period for losses 1025 under standard congestion control. Specifically after a probe 1026 fail event and suppressed congestion control, PLPMTUD may not 1027 probe again until an interval which is comparable to the expected 1028 interval between congestion control events. This is required in 1029 section 4 and discussed further in section 7.6. 1031 The simplest estimate of the interval to the next congestion event 1032 is the same number of round trips as the current window in 1033 packets. 1035 probe_timeout_event Since this event was detected by a timeout, it is 1036 relatively disruptive to protocol operation. Furthermore, since 1037 the event indirectly includes a window adjustment that may have 1038 been caused by the MTU probe, it is important that the probe not 1039 be repeated until congestion control has had more than sufficient 1040 time to recover from the loss. Therefore we recommend five times 1041 the probe_fail_event interval. I.e. five times as many round 1042 trips as the current congestion window in packets. 1044 verification_error_event A verification fail event indicates that a 1045 probe was delivered and the verification phase failed twice 1046 separated by a congestion adjustment (so the second verification 1047 phase was at a low point in the congestion control cycle). This 1048 is an indication that one of the following three things might have 1049 happened: repeated losses unrelated to PLPMTUD; the path is 1050 striped across links with dissimilar MTUs, or the link layer has 1051 some parametric limitation such that raising the MTU greatly 1052 increases the random error rate. 1054 The optimal method responding to this situation is an open 1055 research question. We believe that the correct response is some 1056 combination of exponentially lengthening back-offs (e.g. Starting 1057 at 1 minute and quadrupling on each repeat.) and implicitly 1058 treating the situation as a probe fail (and choosing a smaller 1059 probe size) after some threshold number of repeated 1060 verification_error_events. 1062 6.4 Host fragmentation 1064 Packetization layers are encouraged to avoid sending messages that 1065 will require fragmentation (for the case against fragmentation, see 1066 [17][18]). However this is not always possible. Some packetization 1067 layers, such as a UDP application outside the kernel, may be unable 1068 to change the size of messages it sends. This may result in packet 1069 sizes that exceeds the Path MTU. 1071 IPv4 permitted such applications to send packets without DF set. 1072 Oversized packets without DF would be fragmented in the network or 1073 sending host when they encountered a link with a MTU smaller than the 1074 packet. In some case, packets could be fragmented more than once if 1075 there were cascaded links with progressively smaller MTUs. 1077 This approach is no longer recommended. We now recommend that IPv4 1078 implementation use a strategy that mimics IPv6 functionality. When 1079 an application sends datagrams that are larger than the known path 1080 MTU they should be fragmented to the path MTU in the host IP layer 1081 even if they are smaller than the link MTU of the first network hop 1082 directly attached to the host. The DF bit should be set on the 1083 fragments, so they will not be fragmented again in the network. 1085 This technique will minimize future surprises as the Internet 1086 migrates to IPv6. Otherwise there is the potential for widely 1087 deployed applications or services relying on IPv4 fragmentation in a 1088 way that can not be implemented in IPv6. At least one major 1089 operating system already uses this strategy. 1091 Note that IP fragmentation divides data into packets, so it is a 1092 minimally a Packetization Layer. However it does not have a 1093 mechanism to detect lost packets, so it can not support a native 1094 implementation of PLPMTUD. PLPMTUD has to be implemented with an 1095 adjunct protocol as described in section 8.3. 1097 6.5 Multicast 1099 In the case of a multicast destination address, copies of a packet 1100 may traverse many different paths to reach many different nodes. The 1101 local representation of the "path" to a multicast destination must in 1102 fact represent a potentially large set of paths. 1104 Minimally, an implementation could maintain a single MPS value to be 1105 used for all packets originated from the node. This MPS value would 1106 be the minimum MPS learned across the set of all paths in use by the 1107 node. This approach is likely to result in the use of smaller 1108 packets than is necessary for many paths. 1110 If the application using multicast gets complete delivery reports 1111 (unlikely because this requirement has poor scaling properties), 1112 PLPMTUD could be implemented in multicast protocols. 1114 7. Common Packetization Properties 1116 This section describes general Packetization Layer properties and 1117 characteristics needed to implement PLPMTUD. It also describes some 1118 implementation issues that are common to all Packetization Layers. 1120 7.1 Mechanism to detect loss 1122 It is important that the Packetization Layer has a timely and robust 1123 mechanism for detecting and reporting losses. PLPMTUD makes MTU 1124 adjustments on the basis of detected losses. Any delays or 1125 inaccuracy in loss notification is likely to result in incorrect MTU 1126 decisions or slow convergence. 1128 It is best if Packetization Protocols use fairly explicit loss 1129 notification such as Selective acknowledgements, although implicit 1130 mechanisms such as TCP Reno style duplicate acknowledgements counting 1131 are sufficient. It is important that the mechanism can robustly 1132 distinguish between the isolated loss of just a probe and other 1133 combinations of losses. 1135 Many protocol implementation have complicated mechanisms such as SACK 1136 scoreboards to distinguish between real losses and temporary missing 1137 data due to reordering in the network. In these implementation is 1138 desirable to signal losses to PLPMTUD as a side effect of the data 1139 retransmission. This approach offer the maximum protection from 1140 confusing signals due to reordering and other events that might mimic 1141 losses. 1143 PLPMTUD can also be implemented in protocols that rely on timeouts as 1144 their primary mechanism for loss recovery, although this should be 1145 used only when there are no other alternatives. 1147 7.2 Generating Probes 1149 There are several possible ways to alter packetization layers to 1150 generate probes. The different techniques incur different overheads 1151 in three areas: difficulty in generating the probe packet (in terms 1152 of packetization layer implementation complexity and extra data 1153 motion) possible additional network capacity consumed by the probes 1154 and the overhead of recovering from failed probes (both network and 1155 protocol overheads). 1157 Some protocols might be extended to allow arbitrary padding with 1158 dummy data. This greatly simplifies the implementation because the 1159 probing can be performed without participation from higher layers and 1160 if the probe fails, the missing data (the "probe gap") is assured to 1161 fit within the current MTU when it is retransmitted. This is 1162 probably the most appropriate method for protocols that support 1163 arbitrary length options or multiplexing within the protocol itself. 1165 Many Packetization Layer protocols can carry pure control messages 1166 (without any data from higher protocol layers) which can be padded to 1167 arbitrary lengths. For example the SCTP HEARTBEAT message can be 1168 used it this manner (See section 8.2) . This approach has the 1169 advantage that nothing needs to be retransmitted if the probe is 1170 lost. 1172 These techniques do not work for TCP, because there is not a separate 1173 length field or other mechanism to differentiate between padding and 1174 real payload data. With TCP the only approach is to send additional 1175 payload data in an over-sized segment. There are at least two 1176 variants of this approach, discussed in section 8.1. 1178 In a few cases there may no reasonable mechanisms to generate probes 1179 within the Packetization Layer protocol itself. As a last resort it 1180 may be possible to rely an an adjunct protocol, such as ICMP ECHO 1181 (aka "ping"), to send probe packets. See section 8.3 for further 1182 discussion of this approach. 1184 7.3 Mechanism to support provisional MTUs 1186 The verification phase requires a mechanism provisionally raise the 1187 MPS and if there are additional losses, restore the old MPS. While 1188 this is not difficult for most potential Packetization Layers, there 1189 are a few (e.g. ISO TP4 [ISOTP]) that are not allowed to repacketize 1190 when doing a retransmission. That is, once an attempt is made to 1191 transmit a segment of a certain size, the transport cannot split the 1192 contents of the segment into smaller segments for retransmission. In 1193 such a case, the original segment can be fragmented by the IP layer 1194 during retransmission as described in section 6.4. Subsequent 1195 segments, when transmitted for the first time, should be no larger 1196 than allowed by the path MTU. 1198 Note that while padding is an appropriate mechanism for probing, it 1199 is too wasteful for use during the verification phase. 1201 Unresolved problem: if 2 PL are using the same path and one can only 1202 verify constrained sizes (e.g blocks+headers) then the verified MTU 1203 might be the actual packet size for the constrained PL, not the 1204 probed size. @@@@ 1206 Unresolved problem: what to do about very short flows? No 1207 verification phase? @@@@@ 1209 7.4 Selecting the initial MPS 1211 If if there is already a cached MPS value for this path, PLPMTUD may 1212 use the saved MPS value. Unless it is very recent (how recent? 1213 @@@@@) SEARCH_HIGH should be set to SEARCH_MAX, to restart the search 1214 process from the old MPS. 1216 Note that there are tradeoffs to how long the path information cache 1217 entries is retained when it is not being used by any flows. If they 1218 are kept for to long they waste memory, if too short it will cause 1219 frequent re-probing. We suggest an adjustable Least Recently Used 1220 algorithm to purge old entries. @@@@ This belongs some place else. 1222 When the PLPMTUD process is started the recommended initial MPS 1223 should normally be set such that the Packetization Layer can carry 1 1224 kByte data segments. This initial MPS would be 1 kByte plus space 1225 for Packetization layer headers. (see section 5 on accounting for 1226 headers). With the this MPS, RFC2414 [6] allows TCP and other 1227 transport protocols to start with an initial window of 4 packets. 1229 [We suspect, but have not confirmed that] TCP completes sooner for 1230 short connections when started with four 1kB packets rather than 1231 three 1500 byte packets because the 2nd ACK occurs one round trip 1232 earlier 1234 This initial MPS should also be configurable. One of the 1235 configuration options should be to mimic classical PMTUD behavior by 1236 setting the initial MPS from the interface MTU. This option 1237 facilitates using PLPMTUD in a mode that mimics classical PMTU 1238 discovery. (See section 9.1) 1240 7.5 Common MPS Search Strategy 1242 The MPS search strategy described here is a only rough guide for 1243 implementors. It is difficult to imagine a completely standard 1244 algorithm because the strategy can include many Packetization Layer 1245 specific heuristics to optimize MPS selection. There is significant 1246 opportunity for future improvements to this portion of PLPMTUD. 1248 The search strategy is trying to find the largest "candidate MPS" 1249 that meets the constraints of both the Packetization and the link 1250 layers. Although this algorithm is primarily described in terms of 1251 MPS, it needs to use knowledge about link layer MTUs and 1252 Packetization Layer buffer sizes. 1254 The search strategy uses three variables: 1255 SEARCH_MAX is the largest MPS that a Packetization Layer might be 1256 able to use. It is determined by such considerations as interface 1257 MTU, widths of protocol length fields, and possibly other 1258 protocol-dependent values, such as the the TCP MSS option. In 1259 many cases it would be the same as the classical MTU discovery 1260 initial MTU, minus the IP layer headers. 1261 SEARCH_LOW is the largest validated MPS, the same as them current 1262 MPS in use by the packetization layer. The initial value for 1263 SEARCH_LOW is described in section 7.4. 1265 SEARCH_HIGH is the least invalidated MPS. In most cases is will 1266 be the most recent failed candidate MPS. When PLPMTUD is 1267 initialized SEARCH_HIGH should be set to SEARCH_MAX, indicating 1268 that there have been no failed probes. 1270 For many Packetization Layer protocols, the cost for a failed probe 1271 is significantly higher than the cost of a successful probe due to 1272 the additional time and overhead needed for retransmission and 1273 recovery. For this reason it is often desirable to bias the search 1274 strategy to make more smaller steps. 1276 The search strategy first computes an initial candidate MPS using one 1277 of these methods: 1278 If SEARCH_HIGH >= SEARCH_MAX, there have been no recent failed 1279 probes so use a coarse (geometric doubling) scan. Set 1280 candidate MPS = MIN(2 * SEARCH_LOW, SEARCH_MAX). Otherwise use 1281 one of several possible fins scan candidate MPS values: 1282 Select a candidate MPS that corresponds to a common MTU possibly 1283 minus common tunnel header sizes between SEARCH_LOW and 1284 SEARCH_HIGH. There is a fine scan heuristic described section 1285 7.5.1 that might be used. 1286 Use a simple weighted binary search by selecting the candidate MPS 1287 some prorated distance between SEARCH_LOW and SEARCH_HIGH. E.g. 1288 set 1289 candidate MPS = SEARCH_LOW * (1 - alpha) + SEARCH_HIGH * alpha, 1290 for some alpha between 0 and 1. If you choose an alpha slightly 1291 less than 0.5, PLPMTUD will tend to converge from below, 1292 minimizing the number of failed probes. Alternatively alpha can 1293 be selected to optimally converge for some common MTUs, such as 1294 1500 bytes. 1295 If the Packetization Layer has preferred data sizes (e.g. carries 1296 block data), optionally round the candidate MPS to an efficient size 1297 for the Packetization Layer. The rounded candidate MPS would 1298 typically be a multiple of the optimal data block size plus space for 1299 Packetization Layer headers. The MPS can be rounded up or down, but 1300 should avoid selecting previously probed valued if possible, per the 1301 convergence test below. Packetization Layer that do not have 1302 intrinsically preferred data sizes may still choose to round the 1303 candidate MPS to some convenient increment such as 4 or 8 bytes, to 1304 prevent excessive hunting. Note that this step is intrinsically 1305 Packetization Layer dependent, and may be different for different 1306 packetization Layers. 1308 If the resulting candidate MPS is not between SEARCH_LOW and 1309 SEARCH_HIGH, then the probe process has converged and further probing 1310 will not yield a better value for the MPS for this protocol. To 1311 detect if a routing change has raised the path MTU, the path should 1312 be re-probed after a suitable delay as indicated by a 1313 probe_converge_event (See section 6.3). If the probe succeeds, then 1314 SEARCH_HIGH should be set to SEARCH_MAX to restart the probing 1315 process from the current MPS. 1317 MPS searching can be implicitly disabled by setting the SEARCH_HIGH 1318 to SEARCH_LOW. 1320 Note that if two different Packetization Layers are sharing a path, 1321 they may choose different MPS due to differences in the protocols. 1322 It is even possible for one of the Packetization Protocol to consider 1323 the process converged, while the other continues to probe. In this 1324 case one of the Packetization Layers does may chose not to use the 1325 full MPS, and instead chooses some slightly smaller but more 1326 efficient packet size. 1328 7.5.1 Fine Scans 1330 If SEARCH_LOW does not correspond to a common link MTU, and there is 1331 a common link MTU between SEARCH_LOW and SEARCH_HIGH, set the 1332 candidate MPS from the most common link MTU between SEARCH_LOW and 1333 SEARCH_HIGH. 1335 If SEARCH_LOW does not correspond to a common link MTU, and there is 1336 not a common link MTU between SEARCH_LOW and SEARCH_HIGH, then set 1337 the candidate MPS to either the weighted binary search between 1338 SEARCH_LOW and SEARCH_HIGH or to SEARCH_HIGH, reduced by a reasonable 1339 increments for tunnel headers. 1341 If SEARCH_LOW corresponds to a common link MTU, set the candidate MPS 1342 to SEARCH_LOW plus some small delta. If this fails, we found the 1343 proper MPS, otherwise we need to keep searching. 1345 @@@@@ common link MTUs are: 1500...... ? 1347 @@@@@ common tunnel header sizes are.... 1349 7.6 Congestion Control and Window Management 1351 PLPMTUD and congestion control share the same slice of the protocol 1352 stack. Both algorithms nominally run inside of a transport protocol 1353 and rely on packet losses as their primary signal to adjust 1354 parameters of the data stream (packet size or window size). 1355 Furthermore both push up the controlled parameter until the onset of 1356 packet losses, and then back off to a smaller value. Due to the 1357 close proximity of these two algorithms there is the potential for 1358 side effects and unexpected interactions between them. 1360 This section describes potential interactions between PLPMTUD and 1361 congestions control. In general PLPMTUD is designed to minimize its 1362 potential impact on congestion control. This is appropriate because 1363 correctly functioning congestion control is critical to the overall 1364 operation of the Internet. 1366 The requirements in section 4 protect congestion control from 1367 PLPMTUD. It is important that MTU changes do not raise the 1368 congestion window. Given that we do not know a priori the nature of 1369 the network bottleneck, PLPMTUD should not raise either the data rate 1370 (bytes per second) or the packet rate (packets per second). 1372 Since there is a risk that lost probes might actually be congestion 1373 losses, and not MTU losses at all, we limit the maximum allowed rate 1374 for suppressing congestion control to less than the loss rate 1375 required to throttle the flow to the "TCP friendly" rate. This 1376 guarantees that the losses due to PLPMTUD are less than the losses 1377 needed for normal congestion control. 1379 If there is some node which is accounting queue length in bytes 1380 (rather than packets), there is even the possibility that a probe 1381 might cause a loss due to driving the queue over some threshold and 1382 into congestion. For this reason it recommended that all PLPMTUD 1383 implementations use some strategy to slightly depress the actual 1384 window during the probe process. It may be sufficient to require 1385 that the excess data in the probe packet fits within the current 1386 congestion control window. 1388 If a probe is carrying real application data that must be 1389 retransmitted, it is important to suppress (or restore) all of the 1390 congestion control state changes normally associated with the 1391 retransmission. For example if a TCP connection is in slow-start 1392 when a probe is lost, it is important that ssthresh is not changed as 1393 a side effect of the probing. It is for this reason that it is 1394 strongly recommended that packetization protocols use some 1395 combination of out-of-band echo message and padding, if at all 1396 possible. Lost probes that do not carry any real application data do 1397 not need to be retransmitted. 1399 It is recommended that TCP should not probe a new MPS if that MPS 1400 will likely result in a cwnd of less than 5 segments. 1402 If the network becomes too congested, it is recommended that the MPS 1403 be reduced to a smaller size as determined by a heuristic. The 1404 recommended heuristic is to reduce the MPS by half if ssthresh is 1405 reduced to 5 segments or smaller, with a minimum MPS of 512 bytes. 1407 8. Specific Packetization Layers 1409 This section discusses specific implementation details for different 1410 protocols that can be used as Packetization Layer protocols. All 1411 Packetization Layer protocols must consider all of the issues 1412 discussed in section Section 7. For most protocols it is self 1413 evident how to address many of these issues. It is hoped that the 1414 protocols described here will be sufficient illustration for 1415 implementors to adapt other protocols. 1417 8.1 Probing method using TCP 1419 TCP has no mechanism that could be used to distinguish between real 1420 application data and some other form of padding that might be used to 1421 fill out probe packets. Therefore, TCP must generate probes by 1422 sending oversized segments that are carrying real data from upper 1423 layers. There are two approaches that TCP might use to minimize the 1424 overheads associated with the probing sequence. 1426 A TCP implementation of PLPMTUD can elect to send subsequent segments 1427 overlapping the probe as though the probe segment was not oversized. 1428 This has the advantage that TCP only need to retransmit one segment 1429 at the current MTU to recover from failed probes. However the 1430 duplicate data in the probe does consume network resources and will 1431 cause duplicate acknowledgments. It is important that these extra 1432 duplicate acknowledgments not trigger Fast Retransmit. This can be 1433 guaranteed by limiting the largest probe segment size to twice the 1434 current segment size (causing at most 1 duplicate acknowledgment) or 1435 three times the current segment size (causing at most 2 duplicate 1436 acknowledgments). 1438 The other approach is to send non-overlapping segments following the 1439 probe. Although this is cleaner from a protocol architecture 1440 standpoint it clashes with many of the optimizations used improve the 1441 efficiency of data motion within many operating systems. In 1442 particular many implementations divide the data into segments and 1443 pre-compute checksums as the data is copied out of application 1444 buffers. In these implementation it can be relatively expensive to 1445 adjust segment boundaries after the data is already queued. 1447 If TCP is using SACK or any other variable length headers, the 1448 headers on the probe and verification packets should be padded to the 1449 maximum possible length. Otherwise, unexpected options on 1450 bidirectional data may cause cause IP packets that are larger than 1451 the tested MTU. 1453 At the point when TCP is ready to start the verification phase, it is 1454 permitted transmit already queued data at the old MTU rather than 1455 re-packetizes it. This postpones the verification process by the 1456 time required to send the queued data. 1458 If the verification phase experiences any segment losses, TCP is 1459 required to pull back to the prior MSS. Since failing the 1460 verification phase should be an infrequent error condition it is less 1461 important that this be as efficient as probing. 1463 8.2 Probing method using SCTP 1465 In the SCTP protocol [9][16] the application writes messages to SCTP 1466 and SCTP "chunkifies" them into smaller pieces suitable for 1467 transmission through the network. Once a message has been 1468 chunkified, they are asigned TSN's. Once some TSNs have been 1469 transmitted SCTP can not change the chunk sizes. SCTP multi-path 1470 support normally requires SCTP to chunkify its messages to fit the 1471 smallest MPS (maximum payload size, same as MTU - IP headers) of all 1472 paths. Although not required, implementations may bundle multiple 1473 data chunks together to make larger IP packets to allow for support 1474 for larger MPSs on different paths. Note that SCTP must 1475 independently probe and verify the MPS on each path to the peer. 1477 The recommended method for generating probes is to add a chunk 1478 consisting only of padding to an SCTP message. There are two methods 1479 to implement this padding. 1481 In method 1, the message is padded with an SCTP heart beat (HB), of 1482 the necessary size to construct an IP packet the desired probe size. 1483 The peer SCTP implementation will acknowledge a successful probe 1484 without delay by the returning the same Heartbeat as a HEARTBEAT-ACK. 1485 This method is fully compatible with current SCTP standards and 1486 implementations, but is exposed to MPS limitation on the return path, 1487 which might cause the HEARTBEAT-ACK to be lost. 1489 In method 2, a new "PAD" chunk type would have to be defined. This 1490 chunk would be silently discard by the peer. The PAD chunk could be 1491 attached to another message (either a minimum length HB or other 1492 application data which will be acknowledged by the peer) to build a 1493 probe packet. The default action for an unknown chunk types in the 1494 range 128 to 190, (high bits = 10 ) is to "Skip this chunk and 1495 continue processing" [RFC2960] - exactly the required behavior for a 1496 PAD chunk. Any currently unused type in this range will work for a 1497 PAD chunk type. This method is fully compatible with all current 1498 SCTP implementations, but requires adding a new type to the current 1499 standards. It has the advantage that restrictions due to the return 1500 path MPS are not applied to the forward path. 1502 The verification phase is most efficiently implemented by picking a 1503 new chunk size such that the new MPS and all of the old multi-path 1504 MPSs are larger than different multiples of the new chunk size, by at 1505 least the required header sizes. This approach permits chunks from 1506 SCTP application messages to be assembled into packets that are 1507 suitable for any path to the peer at either the old or new MPS. This 1508 is the easiest method to permit the provisional MPS to be withdrawn, 1509 if there are losses during the verification phase. 1511 Once each of old path MPSs has been updated to a new verified MPS, 1512 SCTP may be able to pick a new larger chunk size that will fit into 1513 all paths. However, if the MPS is later reduced (say due to a 1514 routing change and subsequent ICMP PTB message) SCTP will be forced 1515 to use IP fragmentation to transmit application messages that are 1516 already chunkified, as described in section 7.3. 1518 The constraints on efficiently choosing chunk sizes are complicated 1519 enough to make it difficult if not impossible to efficiently support 1520 arbitrary combinations of old and new MPSs. It greatly simplify the 1521 implementation to add constraints, such as making the chunk size 1522 itself a multiple of some common size, such as 512 bytes. This in 1523 turn constrains the searching algorithm to test MPSs that are 1524 multiples of 512 bytes, plus the appropriate headers. Clearly the 1525 PLPMTUD search heuristic for SCTP must be constrained to pick 1526 candidate MPSs that are consistent with the limitations of the 1527 algorithm for choosing appropriate chunk sizes. 1529 The SCTP Verification-Tag is designed to increase SCTPs robustness in 1530 the presence of a number of attacks, including forged ICMP messages. 1531 It relies on a 32 bit Verification Tag which is initialized to a 1532 random value during connection establishment and placed in the first 1533 64 bits of all SCTP messages. All subsequent messages (including 1534 ICMP messages, which copy at least the first 64 bits of the message) 1535 must match the original Verification Tag, or they are rejected as 1536 being likely attacks against the connection. 1538 It is believed that the Verification Tag mechanism is strong enough 1539 where SCTP could unconditionally process ICMP PTB messages that would 1540 reduce the path MPS at arbitrary times. As written, this document 1541 does not encourage this method. The PLPMTUD ICMP validity checks are 1542 cascaded with the SCTP checks, such that the messages are processed 1543 only if they meet all consistency checks for both protocols. In 1544 particular, PLPMTUD only uses the ICMP MPS value following a probe, 1545 during MPS verification, or following a full stop timeout. 1547 Alternatively, an SCTP implementation could suppress some of the 1548 checks in section 6.2.1. 1550 8.3 Probing method for IP fragmentation 1552 As mentioned in section 6.4, datagram protocols (such as UDP) might 1553 rely on IP fragmentation as a packetization layer. However, 1554 implementing PLPMTUD with IP fragmentation is problematic because the 1555 IP layer has no mechanism to to determine if the packets are 1556 ultimately delivered properly to the far node, without participation 1557 by the application. 1559 To support IP fragmentation as a packetization layer under an 1560 unmodified application, we propose the use of an adjunct MTU 1561 measurement protocol (ICMP ECHO) and a separate path MTU discovery 1562 daemon (described here) to perform PLPMTUD and update the stored path 1563 MTU information. 1565 For IP fragmentation the initial MPS should be selected as described 1566 in section 7.4, except with a separate global control for the default 1567 initial MPS for connectionless protocols. Since connectionless 1568 protocols may not keep enough state to effectively diagnose MTU black 1569 holes, it would be more robust to error on the side of using too 1570 small of an initial MTU (e.g. 1kBytes or less) prior initiating 1571 probing of the path to measure the MTU. 1573 Since many protocols that rely on IP fragmentation are 1574 connectionless, there is an additional problem with the path 1575 information cache: there are no events corresponding to connection 1576 establishment and tear-down to use to manage the cache itself. We 1577 take this approach: if there is no entry in the path information 1578 cache for a particular packet being transmitted, it uses an immutable 1579 cache entry for the "default path", which has a MPS that is fixed at 1580 the initial value. A new path cache entry is not created until there 1581 is an attempt to set the MPS. 1583 The path MTU discovery daemon should be triggered as a side effect of 1584 IP fragmentation. Once the number of fragmented datagrams via any 1585 particular path reaches some configurable threshold (say 5 1586 datagrams), the daemon can start probing the path with ICMP ECHO 1587 packets. These probes must use the diagnostic interface described in 1588 section 9.4 and have DF set. The daemon can implement all of the 1589 PLPMTUD probe sequence and search strategy, collect all of the ICMP 1590 responses (ECHO REPLY, ICMP PTB, etc) and only the saved PTB in the 1591 path information cache in the IP layer. 1593 Alternatively, most of the PLPMTUD state machinery can be implemented 1594 within the path information cache in the IP layer, which can 1595 specifically invoke the path MTU discovery daemon to perform 1596 specified measurements on specific paths and report the results back 1597 to the IP layer. 1599 Using ICMP ECHO to measure the MTU has a number of potential 1600 robustness problems. Note that the most likely failures are due to 1601 losses unrelated to MTU (e.g. nodes that discriminate on the basis 1602 of protocol type). These non-MTU losses can prevent PLPMTUD from 1603 raising the MTU, forcing the Packetization Layer protocol to use a 1604 smaller MTU than necessary. Since these failures are not likely to 1605 cause interoperability problem they are relatively benign. 1607 However there does exist other more serious failure modes, such as 1608 layer 3 or 4 routers choosing different paths for different protocol 1609 types or sessions. In such environments, adjunct protocols may 1610 experience different MTUs than the primary protocol. If the adjunct 1611 protocol has a larger MTU than the primary protocol, PLPMTUD will 1612 select a non-functional MTU. This does not seem to be likely 1613 situation. 1615 8.4 Probing method for applications 1617 The disadvantages of probing with ICMP ECHO can be overcome by 1618 implementing the path MTU discovery daemon within the application 1619 itself, using applications own protocol. 1621 The application must have some suitable method for generating probes. 1622 The ideal situation is a lightweight echo function, that confirms 1623 message delivery, plus a mechanism for padding the messages out to 1624 the desired MTU, such that the padding is not echoed. This 1625 combination (akin to the SCTP HB plus PAD) has is preferred because 1626 you can send large probes that causes small acknowledgements. For 1627 protocols that can not implement these messages directly there are 1628 often alternate methods for generating probes. E.g the protocol may 1629 have a variable length echo (that measures both the forward and 1630 return path) or if there is no echo function, there may be a way to 1631 add padding to regular messages carrying real application data. 1632 There may to others ways to generate probes. As a last resort, it 1633 may be feasible to extend the protocol with new message types to 1634 support MTU discovery. 1636 Probing within an application introduces one new issues: many 1637 applications do not currently concern themselves with MTU and rely on 1638 IP fragmentation to deliver datagrams that just happen to be larger 1639 than the path MTU. PLPMTUD requires that the protocol can send 1640 probes that are larger than the IP layers current notion of the path 1641 MTU, but are marked not to be fragmented. This requires an alternate 1642 method for sending these datagrams. 1644 As with ICMP MTU probing, there is considerable flexibility in how 1645 the PLPMTUD algorithms can be divided between the Application and the 1646 path information cache. 1648 Some applications send large datagrams no matter what the link size, 1649 and rely on IP fragmentation to deliver the datagrams. It has been 1650 known for a long time that this has some undesirable consequences 1651 [@@harm1]. Recently it has come to light that IPv4 fragmentation is 1652 not sufficiently robust for general use in today's Internet. The 1653 16-bit IP identification field is not large enough to prevent 1654 frequent missassociated IP fragments and the TCP and UDP checksums 1655 are insufficient to prevent the resulting corrupted data from being 1656 delivered to higher protocol layers. [@@harm2] 1658 None the less, there are a number of higher layer protocols, such as 1659 NFS [@@NFS] which use IP fragmentation as a mechanism to reduce CPU 1660 load. NFS typically sends fragmented 8k Byte datagram's over all 1661 link types, no matter what the link MTU. The other common case, in 1662 which the application wants to use the largest possible datagram that 1663 fits within the MTU is most easily treated as a special case of the 1664 fragmenting case. 1666 9. Operational Integration 1668 This section describes ways to minimize deployment problems for 1669 PLPMTUD, by including a number of good management features: 1670 mechanisms to diagnose problems with path MTU discovery, and 1671 configuration controls such that the more risky properties can be 1672 progressively deployed. We also address some potentially serious 1673 interactions with nodes that do not honor the DF bit. 1675 9.1 Interoperation with prior algorithms 1677 Properly functioning Path MTU discovery is critical to the robust and 1678 efficient operation of the Internet. Any major change (as described 1679 in this document) has the potential to be very disruptive if it 1680 contains any errors or oversights. Therefore, we offer a deployment 1681 strategy in which classical PMTUD operation as described in RFC 1191 1682 and RFC 1981 is unmodified and PLPMTUD is only invoked following a 1683 full stop timeout, presumably due to an "ICMP black hole". To do 1684 this: 1685 o Relax the ICMP checks in section 6.2.1 specifically to allow an 1686 ICMP Packet Too Large message to reduce the MTU at arbitrary 1687 times. 1688 o When there is no cached MTU, use the Interface MTU as specified by 1689 classical PMTU discovery, rather the initial MTU as specified in 1690 section 7.4 1691 o MTU searching as described in section 7.5 is disabled by setting 1692 SEARCH_HIGH equal to SEARCH_LOW and the initial MPS. 1693 o A full stop timeout is processed as described in section 6.2.4. 1694 This becomes the only mechanism to invoke the rest of PLPMTUD. 1696 When configured in this manner, PLPMTUD will increase the robustness 1697 of classical PMTU discovery in the presence of ICMP black holes and 1698 other ICMP problems, with minimal exposure to unanticipated problems 1699 during deployment. Since this configuration does not help robustness 1700 in the presence of malicious or erroneous ICMP messages, it is not 1701 recommended for the long term. 1703 9.2 Operation over subnets with dissimilar MTUs 1705 With classical PMTUD, the ingress router to a subnet is responsible 1706 for knowing what size packets can be delivered to every node attached 1707 to that subnets. For most subnet types, this requires that the 1708 entire subnet has a single MTU which is common to every attached 1709 node. (For a few subnets types, such as ATM[12] the nodes on a 1710 subnet can negotiate the MTU on a pairwise basis, and the ingress 1711 router is responsible for knowing the MTU to each of it peers). 1713 This requirement has proven to be a major impediment to deploying 1714 larger MTUs in the operational Internet. Often one single node which 1715 does not support a larger MTU effectively vetoes raising the MTU on a 1716 subnet, because the ingress router does not have a mechanism to 1717 generate the proper ICMP PTB message for the one attached node with a 1718 smaller MTU. 1720 With PLPMTUD, this requirement is completely relaxed. As long as 1721 oversized packets addressed to nodes with the smaller MTU are 1722 reliably discarded, PLPMTUD will find the proper MTU for these nodes. 1724 Once there sufficient field experience to demonstrate that PLPMTUD is 1725 robust, we recommend that OS vendors consider updating default MTUs 1726 for Network Interface Cards. It would raise the overall performance 1727 of the Internet if all NICs were configured to default to the MTU 1728 which is most efficient for the NIC (lowest overhead per byte), 1729 rather than the standard MTU for the media or switch. This is most 1730 likely to be the largest MTU supported by the NIC chip set or some 1731 other logical boundary, such as memory page sizes. 1733 9.3 Interoperation with tunnels 1735 PLPMTUD is specifically designed to solve many of the problems that 1736 people are experiencing today due to poor interactions between 1737 classical MTU discovery, IPsec, and various sorts of tunnels [5]. As 1738 long as the tunnel reliably discards packets that are too large, 1739 PLPMTUD will discover an appropriate MTU for the path. 1741 Unfortunately due to the pervasive problems with classical PMTU 1742 discovery, many manufacturers of various types of VPN/tunneling 1743 equipment have resorted to ignoring the DF bit under some conditions. 1745 This not only violates the IP standard and many recommendations to 1746 the contrary [17][18], it also violates the only requirement that 1747 PLPMTUD places on the link layer: that oversized packets are reliably 1748 discarded. It is imperative that people understand the impact of 1749 ignoring the DF bit both to applications and to PLPMTUD. 1751 We do understand the reality of the situation. It is important that 1752 vendors who are building devices the violate the DF specification 1753 understand that PLPMTUD requires that probe packets be discarded, and 1754 that sending ICMP PTB messages alone is insufficient to prevent 1755 wholesale fragmentation if the probe packets are delivered. 1757 Therefore, it is imperative that devices that do not honor DF include 1758 packet size history caches and other heuristics to robustly detect 1759 and discard probe packets, if delivering them would require 1760 fragmentation. 1762 9.4 Diagnostic tools 1764 All implementations MUST include facilities for MTU discovery 1765 diagnostic tools that implement PLPMTUD or other MTU discovery 1766 algorithms in user mode without help or interference by the PMTUD 1767 algorithm present in the operating system. This requires an 1768 mechanism where a diagnostic application can send packets that are 1769 larger than the operating system's notion of the current path MTU and 1770 for the diagnostic application to collect any resulting ICMP PTB 1771 messages or other ICMP messages. For IPv4, the diagnostic 1772 application must be able to set the DF bit. 1774 At this time nearly all operating systems support two modes for 1775 sending UDP datagrams: one which silently fragments packets that are 1776 too large, and another that rejects packets that are too large. 1777 Neither of these modes are suitable for efficiently diagnosing 1778 problems with the MTU discovery, such as routers that return ICMP PTB 1779 messages containing incorrect size information. 1781 9.5 Management interface 1783 It is suggested that an implementation provide a way for a system 1784 utility program to: 1785 o Globally disable all ICMP Packet Tool Large message processing 1786 o Globally suppress some or all ICMP consistency checks described in 1787 section 6.2.1. Setting this option fore goes some possible 1788 security improvements, in exchange for making PLPMTUD behave more 1789 like classical PMTU discovery. (See section 9.1) 1790 o Globally permit ICMP Packet Tool Large messages to unconditionally 1791 reduce the MTU, even if there were not lost lost packets. Setting 1792 option fore goes some possible security improvements, in exchange 1793 for making PLPMTUD behave more like classical PMTU discovery. 1794 (See section 9.1) 1795 o Globally adjust timer intervals for specific classes of probe 1796 failures 1798 In addition, it is important that there be a mechanism to permit per 1799 path controls to override specific parts of the PLPMTUD algorithm. 1800 All of these per path controls should be preset from similar global 1801 controls: 1802 o Disable MTU searching a given path, such that new MTU values are 1803 never probed. 1804 o Set the initial MTU for a given path. This could be used to speed 1805 convergence in relatively static environments. There should be an 1806 option to cause PLPMTUD to choose the same initial value as would 1807 be chosen by classical PMTU discovery. I.e. typically the 1808 Interface MTU. This is used in the mode described in section 9.1 1809 where PLPMTUD is used only for black hole detection in classical 1810 PMTU discovery. 1811 o Limit the maximum probed MTU for a given path. This permits a 1812 manual configuration to work around a link that spuriously 1813 delivers packets that are larger than the useful path MTU. 1814 o Per path and per application controls to disable ICMP processing, 1815 to further limit possible damage from malicious ICMP PTB messages 1816 (in addition to the global controls). 1818 10. References 1820 10.1 Normative References 1822 [1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. 1824 [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1825 November 1990. 1827 [3] McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP 1828 version 6", RFC 1981, August 1996. 1830 [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1831 Levels", BCP 14, RFC 2119, March 1997. 1833 [5] Kent, S. and R. Atkinson, "Security Architecture for the 1834 Internet Protocol", RFC 2401, November 1998. 1836 [6] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 1837 Initial Window", RFC 2414, September 1998. 1839 [7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) 1840 Specification", RFC 2460, December 1998. 1842 [8] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914, 1843 September 2000. 1845 [9] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 1846 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, 1847 "Stream Control Transmission Protocol", RFC 2960, October 2000. 1849 10.2 Informative References 1851 [10] Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU 1852 discovery options", RFC 1063, July 1988. 1854 [11] Knowles, S., "IESG Advice from Experience with Path MTU 1855 Discovery", RFC 1435, March 1993. 1857 [12] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626, 1858 May 1994. 1860 [13] Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU", 1861 RFC 1791, April 1995. 1863 [14] Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809, 1864 June 1995. 1866 [15] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, 1867 September 2000. 1869 [16] Stewart, R., "Stream Control Transmission Protocol (SCTP) 1870 Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in 1871 progress), December 2003. 1873 [17] Kent, C. and J. Mogul, "Fragmentation considered harmful", 1874 Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. 1876 [18] Mathis, M., Heffner, J. and B. Chandler, "Fragmentation 1877 Considered Very Harmful", draft-mathis-frag-harmful-00 (work in 1878 progress), July 2004. 1880 Authors' Addresses 1882 Matt Mathis 1883 Pittsburgh Supercomputing Center 1884 4400 Fifth Avenue 1885 Pittsburgh, PA 15213 1886 US 1888 Phone: 412-268-3319 1889 EMail: mathis@psc.edu 1891 John W. Heffner 1892 Pittsburgh Supercomputing Center 1893 4400 Fifth Avenue 1894 Pittsburgh, PA 15213 1895 US 1897 Phone: 412-268-2329 1898 EMail: jheffner@psc.edu 1900 Kevin Lahey 1901 Freelance 1903 EMail: kml@patheticgeek.net 1905 Appendix A. Security Considerations 1907 Under all conditions the PLPMTUD procedure described in this document 1908 is at least as secure as the current standard path MTU discovery 1909 procedures described in RFC 1191 [2] and RFC 1981 [3]. 1911 It the recommended configuration, PLPMTUD is significantly harder to 1912 attack than current procedures, because ICMP messages are cached and 1913 only processed in connection with lost packets. This effectively 1914 prevents blind attacks on the path MTU discovery system. 1916 Furthermore, since this algorithm is designed for robust operation 1917 without any ICMP (or other messages from the network), it can be 1918 configured to ignore all ICMP messages (globally or on a per 1919 application basis). In this configuration it can not be attacked, 1920 unless the attacker can identify and selectively cause probe packets 1921 to be lost. 1923 Appendix B. IANA considerations 1925 None. 1927 Appendix C. Acknowledgements 1929 Many ideas and even some of the text come directly from RFC1191 and 1930 RFC1981. 1932 Many people made significant contributions to this document, 1933 including: Randall Stewart for SCTP text, Michael Richardson for 1934 material from an earlier ID on tunnels that ignore DF, Stanislav 1935 Shalunov for the idea that pure PLPMTUD parallels congestion control, 1936 and Matt Zekauskas for maintaining focus during the meetings. Thanks 1937 to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib 1938 who provided concrete feedback on weaknesses in earlier drafts. 1939 Thanks also to all of the people who made constructive comments in 1940 the working group meetings and on the mailing list. I am sure I have 1941 missed many deserving people. 1943 Matt Mathis and John Heffner are supported in this work by a grant 1944 from Cisco Systems, Inc. 1946 Intellectual Property Statement 1948 The IETF takes no position regarding the validity or scope of any 1949 Intellectual Property Rights or other rights that might be claimed to 1950 pertain to the implementation or use of the technology described in 1951 this document or the extent to which any license under such rights 1952 might or might not be available; nor does it represent that it has 1953 made any independent effort to identify any such rights. Information 1954 on the procedures with respect to rights in RFC documents can be 1955 found in BCP 78 and BCP 79. 1957 Copies of IPR disclosures made to the IETF Secretariat and any 1958 assurances of licenses to be made available, or the result of an 1959 attempt made to obtain a general license or permission for the use of 1960 such proprietary rights by implementers or users of this 1961 specification can be obtained from the IETF on-line IPR repository at 1962 http://www.ietf.org/ipr. 1964 The IETF invites any interested party to bring to its attention any 1965 copyrights, patents or patent applications, or other proprietary 1966 rights that may cover technology that may be required to implement 1967 this standard. Please address the information to the IETF at 1968 ietf-ipr@ietf.org. 1970 Disclaimer of Validity 1972 This document and the information contained herein are provided on an 1973 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1974 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1975 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1976 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1977 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1978 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1980 Copyright Statement 1982 Copyright (C) The Internet Society (2004). This document is subject 1983 to the rights, licenses and restrictions contained in BCP 78, and 1984 except as set forth therein, the authors retain all their rights. 1986 Acknowledgment 1988 Funding for the RFC Editor function is currently provided by the 1989 Internet Society.