idnits 2.17.1 draft-ietf-pmtud-method-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1559. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1543. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1549. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 1565), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 37. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 805 has weird spacing: '...retried after...' == Line 1026 has weird spacing: '...irement has p...' == Line 1170 has weird spacing: '...portant that ...' == Line 1253 has weird spacing: '...ntation would...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 2004) is 7255 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'ISOTP' on line 1268 == Unused Reference: '10' is defined on line 1444, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 1447, but no explicit reference was found in the text == Unused Reference: '13' is defined on line 1453, but no explicit reference was found in the text == Unused Reference: '15' is defined on line 1459, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2401 (ref. '5') (Obsoleted by RFC 4301) ** Obsolete normative reference: RFC 2414 (ref. '6') (Obsoleted by RFC 3390) ** Obsolete normative reference: RFC 2460 (ref. '7') (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 2960 (ref. '9') (Obsoleted by RFC 4960) -- Obsolete informational reference (is this intentional?): RFC 1063 (ref. '10') (Obsoleted by RFC 1191) -- Obsolete informational reference (is this intentional?): RFC 1626 (ref. '12') (Obsoleted by RFC 2225) == Outdated reference: A later version (-16) exists of draft-ietf-tsvwg-sctpimpguide-10 Summary: 13 errors (**), 0 flaws (~~), 11 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Mathis 2 Internet-Draft J. Heffner 3 Expires: November 30, 2004 PSC 4 K. Lahey 5 Freelance 6 June 2004 8 Path MTU Discovery 9 draft-ietf-pmtud-method-02 11 Status of this Memo 13 By submitting this Internet-Draft, I certify that any applicable 14 patent or other IPR claims of which I am aware have been disclosed, 15 and any of which I become aware will be disclosed, in accordance with 16 RFC 3668. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that other 20 groups may also distribute working documents as Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at http:// 28 www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on November 30, 2004. 35 Copyright Notice 37 Copyright (C) The Internet Society (2004). All Rights Reserved. 39 Abstract 41 This document describes a robust new method for Path MTU Discovery 42 that relies on TCP or other Packetization Layer to probe an Internet 43 path with progressively larger packets. This method is described as 44 an extension to RFC 1191 and RFC 1981, which specify ICMP based Path 45 MTU Discovery for IP versions 4 and 6, respectively. This document 46 does not define a protocol, but rather a method to use features of 47 existing protocols to discover the path MTU. 49 The general strategy of the new algorithm is to start with a small 50 MTU and probe upward, testing successively larger MTUs by probing 51 with single packets. If the probe is successfully delivered, then 52 the MTU is raised. If the probe is lost, it is treated as an MTU 53 limitation and not as a congestion signal. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 60 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 9 61 5. Implementation Issues . . . . . . . . . . . . . . . . . . . . 10 62 5.1 Layering . . . . . . . . . . . . . . . . . . . . . . . . . 10 63 5.1.1 Accounting for Header Sizes . . . . . . . . . . . . . 10 64 5.1.2 Storing PMTU information . . . . . . . . . . . . . . . 11 65 5.2 Lower Layers . . . . . . . . . . . . . . . . . . . . . . . 12 66 5.2.1 Generating Probes . . . . . . . . . . . . . . . . . . 12 67 5.2.2 Selecting the initial MTU . . . . . . . . . . . . . . 14 68 5.2.3 Normal sequence of events to raise the MTU . . . . . . 14 69 5.2.4 Processing MTU Indications . . . . . . . . . . . . . . 15 70 5.2.5 Probing Intervals . . . . . . . . . . . . . . . . . . 20 71 5.2.6 Host fragmentation . . . . . . . . . . . . . . . . . . 21 72 5.2.7 Multicast . . . . . . . . . . . . . . . . . . . . . . 22 73 5.3 Search Strategy . . . . . . . . . . . . . . . . . . . . . 22 74 5.3.1 Search . . . . . . . . . . . . . . . . . . . . . . . . 23 75 5.3.2 Monitor . . . . . . . . . . . . . . . . . . . . . . . 24 76 5.3.3 Suspend . . . . . . . . . . . . . . . . . . . . . . . 24 77 5.4 Specific Packetization Layers . . . . . . . . . . . . . . 24 78 5.4.1 Probing method using TCP . . . . . . . . . . . . . . . 24 79 5.4.2 Probing method using SCTP . . . . . . . . . . . . . . 25 80 5.4.3 Probing Method for IP Fragmentation . . . . . . . . . 27 81 5.4.4 Issues for other transport protocols . . . . . . . . . 27 82 5.5 Operational Integration . . . . . . . . . . . . . . . . . 27 83 5.5.1 Interoperation with prior algorithms . . . . . . . . . 27 84 5.5.2 Interoperation over subnets with dissimilar MTUs . . . 28 85 5.5.3 Interoperation with tunnels . . . . . . . . . . . . . 28 86 5.5.4 Diagnostic tools . . . . . . . . . . . . . . . . . . . 29 87 5.5.5 Management interface . . . . . . . . . . . . . . . . . 29 88 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30 89 6.1 Normative References . . . . . . . . . . . . . . . . . . . . 30 90 6.2 Informative References . . . . . . . . . . . . . . . . . . . 31 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 32 92 A. Security Considerations . . . . . . . . . . . . . . . . . . . 32 93 B. IANA considerations . . . . . . . . . . . . . . . . . . . . . 32 94 C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33 95 Intellectual Property and Copyright Statements . . . . . . . . 34 97 1. Introduction 99 This document describes a method for Packetization Layer Path MTU 100 Discovery (PLPMTUD) which is an extension to existing Path MTU 101 discovery methods as described in RFC 1191 [2] and RFC 1981 [3]. The 102 proper MTU is determined by starting with small packets and probing 103 with successively larger packets. The bulk of the algorithm is 104 implemented above IP, in the transport layer (e.g. TCP) or other 105 "Packetization Protocol" that is responsible for determining packet 106 boundaries. 108 This document draws heavily RFC 1191 [2] and RFC 1981 [3] for 109 terminology, ideas and some of the text. 111 The methods described in this document apply both IPv4 and IPv6, and 112 many transport protocols. This document does not define a protocol, 113 but rather a method to use features of existing protocols to discover 114 the path MTU. It does not require cooperation from the lower layers 115 (except that they are consistent about what packet sizes are 116 acceptable) or the far node. Variants in implementations will not 117 cause interoperability problems. 119 The methods described in this document are carefully designed to 120 maximize robustness in the presence of less than ideal 121 implementations of other protocols or Internet components. 123 For sake of clarity we uniformly prefer TCP and IPv6 terminology. In 124 the terminology section we also present the analogous IPv4 terms and 125 concepts for the IPv6 terminology. In a few situations we describe 126 specific details that are different between IPv4 and IPv6. 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in RFC 2119 [4]. 132 This draft is a product of the Path MTU Discovery (pmtud) working 133 group of the IETF. Please send comments and suggestions to 134 pmtud@ietf.org. Interim drafts and other useful information will be 135 posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html . 137 2. Terminology 138 IP Either IPv4 [1] or IPv6 [7]. 140 node A device that implements IP. 142 router A node that forwards IP packets not explicitly addressed to 143 itself. 145 host Any node that is not a router. 147 upper layer A protocol layer immediately above IP. Examples are 148 transport protocols such as TCP and UDP, control protocols such as 149 ICMP, routing protocols such as OSPF, and Internet or lower-layer 150 protocols being "tunneled" over (i.e., encapsulated in) IP such as 151 IPX, AppleTalk, IP itself. 153 link A communication facility or medium over which nodes can 154 communicate at the link layer, i.e., the layer immediately below 155 IP. Examples are Ethernets (simple or bridged); PPP links; X.25, 156 Frame Relay, or ATM networks; and Internet (or higher) layer 157 "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use 158 the slightly more general term "lower layer" for this concept. 160 interface A node's attachment to a link. 162 address An IP-layer identifier for an interface or a set of 163 interfaces. 165 packet An IP header plus payload. 167 MTU Maximum Transmission Unit, the size in bytes of the largest IP 168 packet, including the IP header and payload, that can be 169 transmitted on a link or path. Note that this could more properly 170 be called the IP MTU, to be consistent with how other standards 171 organizations use the acronym MTU. 173 link MTU The Maximum Transmission Unit, i.e., maximum IP packet size 174 in bytes, that can be conveyed in one piece over a link. Beware 175 that this definition differers from the definition used by other 176 standards organizations. 178 For IETF documents, link MTU is uniformly defined as the IP MTU 179 over the link. This includes the IP header, but excludes link 180 layer headers and other framing which is not part of IP or the IP 181 payload. 183 Beware that other standards organizations generally define link 184 MTU to include the link layer headers. 186 path The set of links traversed by a packet between a source node and 187 a destination node 189 PMTU, path MTU The minimum link MTU of all the links in a path 190 between a source node and a destination node. 192 classical PMTU discovery, Process described in RFC 1191 and RFC 1981, 193 in which nodes rely on ICMP "Packet Too Big" messages to learn the 194 MTU of a path. 196 PL, packetization layer The layer of the network stack which segments 197 data into packets. 199 PLPMTUD Packetization Layer Path MTU Discovers, the method described 200 in this document, which is an extension to classical PMTU 201 discovery. 203 Packet Too Big message An ICMP message reporting that an IP packet is 204 too large to forward. This is the IPv6 term that corresponds to 205 the IPv4 "ICMP Can't fragment" message. 207 flow A context in which MTU discovery is applied. This is naturally 208 an instance of the packetization protocol, e.g. one side of a TCP 209 connection. 211 MPS The maximum IP payload size available over a specific path. This 212 is typically the path MTU minus the IP header. As an example, this 213 is the maximum TCP packet size, including TCP payload and headers 214 but not including IP headers. This has also been called the "L3 215 MTU". 217 MSS The TCP Maximum Segment Size, the maximum payload size available 218 to the TCP layer. This is typically the path MPS minus the size of 219 the TCP header. 221 probe packet A packet which is being used to test a path for a larger 222 MTU. 224 probe size The size of a packet being used to probe for a larger MTU. 226 successful probe The probe packet was delivered through the network 227 and acknowledged by the Packetization Layer on the far node. 229 inconclusive probe The probe packet was not delivered, but there were 230 other lost packets close enough to the probe where it can not be 231 presumed that the probe was lost because it was larger than the 232 path MTU. By implication the probe might have been lost due to 233 something other than MTU (such congestion), so the results are 234 inconclusive. Inconclusive probes are generally repeated at the 235 same probe size, after a suitable delay. 237 failed probe The probe packet was not delivered and there were no 238 other lost packets close to the probe. This is taken as an 239 indication that the probe was larger than the path MTU, and future 240 probes should generally be for at smaller sizes. 242 errored probe There were losses or timeouts during the verification 243 phase which suggest a potentially disruptive failure or network 244 condition. These are generally retried only after substantially 245 longer intervals. 247 probe gap The payload data that will be lost and need to be 248 retransmitted if the probe is not delivered. 250 probe phase The interval (time or protocol events) between when a 251 probe is sent, and when it is determined that the the probe 252 succeeded, failed or was inconclusive 254 verification phase An additional interval during which the new path 255 MTU is considered provisional. Packet losses or timeouts are 256 treated as an indication that there may be a problem with the 257 provisional MTU. 259 Transition phase The interval between the probe phase and the 260 verification phase, during which packets using the new MTU 261 propagate to the far node and the acknowledgment propagates back. 263 full stop timeout a timeout where none of the packets transmitted 264 after some event are acknowledged by the receiver, including any 265 retransmissions. This is taken as an indication of some failure 266 condition in the network, such as a routing change onto a link 267 with a smaller MTU. For the sake of PLPMTUD we suggest the 268 following definition of a full stop timeout: the loss of one full 269 window of data and at least one retransmission or at least 6 270 consecutive packets including at least 2 retransmissions (along 271 with two retransmission timer expirations). [@@@ This probably 272 needs some experimentation.] 274 search strategy the heuristics used to choose successive probe sizes 275 to converge to the proper path MTU, as described in section 5.5. 277 3. Overview 279 This document describes a method for TCP or other packetization 280 protocols to dynamically discover the MTU of a path without relying 281 on explicit signals from the network. These procedures are applicable 282 to TCP and other transport- or application-level packetization 283 protocols in which the receiver always reports to the sender complete 284 information about which packets were lost in the network. 286 The general strategy of the new procedure is for the packetization 287 layer to find the proper MTU by probing with progressively larger 288 packets, without disrupting its normal protocol operation. If a probe 289 packet is successfully delivered, then the path MTU is provisionally 290 raised. If there are no additional losses during the subsequent 291 verification phase, then the path MTU is confirmed (verified) to be 292 at least as large as the provisional MTU. PLPMTUD can then probe 293 again with an even larger MTU, according to MTU search strategy 294 described in Section 5.3. 296 The verification phase is used to detect some situations where 297 raising the MTU raises the packet loss rate. For example if a link 298 is striped across multiple physical channels with inconsistent MTUs, 299 it is possible that a probe will be delivered even if it is too large 300 for some of the physical channels. In such cases raising the path MTU 301 to the probe size will cause severe periodic loss and abysmal 302 performance. The verification phase is designed to prevent the path 303 MTU from being raised if doing so causes excessive packet losses. 305 A conservative implementation of PLPMTUD would use a full round trip 306 time for the verification phase. In this case each time PLPMTUD 307 raises the MTU it takes three full round trip times to do so. It 308 takes one round trip for the probe phase, during which the probe 309 propagates to the far node and an acknowledgment is returned. The 310 second round trip is the transitional phase, during which data 311 packets using the provisional MTU propagate to the far node and are 312 acknowledged. During he third and final round trip time, it is 313 verified that raising the MTU does not cause excessive loss. 315 The isolated loss of a probe packet (with or without a Packet Too Big 316 message) is treated as an indication of an MTU limit, and not as a 317 congestion indicator. In this case alone, the packetization protocol 318 is permitted to retransmit the probe gap without adjusting the 319 congestion window. 321 If there is a timeout or any additional lost packets during any of 322 the three phases, the loss is treated as a congestion indication as 323 well as an indication of some sort of failure of the PLPMTUD process. 324 The congestion indication is treated like any other congestion 325 indication: window or rate adjustments are mandatory per the relevant 326 congestion control standards [8]. Probing can resume with some new 327 probe size after a delay which is determined by the nature of the 328 indicated failure. 330 The most likely (and least serious) PLPMTUD failure is the link 331 experiencing legitimate congestion related losses at about the same 332 time as the probe. In this case, it is appropriate to retry the 333 probe (with the same probe size) as soon as the packetization layer 334 has fully adapted to the congestion and recovered from the losses. 336 In other cases, additional losses or timeouts indicate problems with 337 the link or packetization layer, and that probes may be disruptive. 338 In these situations it is desirable to use progressively longer 339 delays depending on the severity of the failure and if it persists. 341 PLPMTUD can optionally process Packet Too Big messages to select the 342 provisional MTU for faster convergence in exchange for a slight 343 decrease in robustness. Processing malicious or erroneous Packet Too 344 Big messages can cause PLPMTUD to arrive at the incorrect MTU for a 345 path, which is likely to reduce protocol performance. There are 346 several different options for processing Packet Too Big messages: in 347 one extreme they could be completely ignored, in the other extreme, 348 accept all of them (fully implementing classic PMTUD within PLPMTUD). 349 We advocate a compromise, where Packet Too Big messages are only 350 processed in conjunction with probes (described in Section 5.2.4.1), 351 and Packetization Layer timeouts (described in Section 5.2.4.3). 353 Relatively few details of this procedure affect interoperability with 354 other standards or Internet protocols. These details are specified 355 in RFC2119 standards language in Section 4. 357 Most of the difficulty in implementing PLPMTUD arises because it 358 needs to be implemented in several different places within a single 359 node. In general each packetization protocol needs to have it's own 360 implementation of PLPMTUD. Furthermore, the natural mechanism to 361 share path MTU information between concurrent or subsequent 362 connections over the same path is a path information cache in the IP 363 layer. The various packetization protocols need to have the means to 364 access and update the shared cache in the IP layer. This memo 365 describes PLPMTUD in terms of its primary subsystems without fully 366 describing how they are assembled into a complete implementation. 367 Section 5 describes: the separation into layers, the mechanics of 368 probing from the point of view other lower layers, Maximum Payload 369 Size search heuristics; implementation in specific Packetization 370 Layers; and operational integration issues. 372 The vast majority of the implementation details are recommendations 373 based on experiences with earlier versions of path MTU discovery. 374 These are motivated by a desire to maximize robustness of PLPMTUD in 375 the presence of less than ideal implementations as they exist in the 376 field. 378 4. Requirements 380 All Internet nodes SHOULD implement PLPMTUD in order to discover and 381 take advantage of the largest MTU supported along the Internet path. 383 Links MUST NOT deliver packets that are larger than their MTU. Links 384 that have parametric limitations (e.g. MTU bounds due to limited 385 clock stability) MUST include explicit mechanisms to consistently 386 reject packets that might otherwise be nondeterministically 387 delivered. 389 All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 390 functionality. All fragmentation SHOULD be done on the host, and all 391 IPv4 packets, including fragments, SHOULD have the DF bit set such 392 that they will not be fragmented (again) in the network. See Section 393 5.2.6 395 The requirements below only apply to those implementations that 396 include PLPMTUD. 398 If the Packetization Layer uses application data to implement PLPMTUD 399 it MUST use a loss reporting mechanism mechanism (e.g. TCP SACK) 400 which avoids spurious retransmission of other data when a probe 401 packet is lost. 403 A Packetization Layer using application data for probes MUST NOT send 404 a probe unless it has sufficient following data available to send 405 such that a lost probe will trigger Fast Retransmit or similar data 406 recovery algorithm. 408 A Packetization Layer using application data for probes SHOULD NOT 409 send a probe packet unless the flow is expected to have at least the 410 3 round trips worth of data needed to successfully complete the 411 probe, transition and verification phases. 413 Normal congestion control algorithms MUST remain in effect under all 414 conditions except when only an isolated probe packet is detected to 415 be lost. In this case alone the normal congestion (window or data 416 rate) reduction can be suppressed. If any other lost data is 417 detected, all normal congestion control MUST take place. 419 When a probe is lost and normal congestion control is suppressed as 420 permitted above, then the Packetization Layer MUST NOT probe again 421 until at least an interval equal to the normal congestion control 422 cycle. For TCP and TCP friendly protocols this generally means one 423 round trip of elapsed time for each packet permitted under the 424 current congestion window. 426 If PLPMTUD updates the MTU for a particular path, all Packetization 427 Layer sessions that share the flow (path) must be notified. 429 Whenever the MTU is raised, the congestion state variables must be 430 rescaled to not to raise the window size in bytes (or date rate in 431 bytes per seconds). 433 Whenever the MTU is reduced (e.g. when unconditionally processing 434 ICMP Packet Too Big messages) the congestion state variable must be 435 rescaled not to raise the window size in packets. 437 All implementations MUST include a mechanism to implement diagnostic 438 tools that do not rely on the operating systems implementation of 439 path MTU discovery. This specifically requires the ability to send 440 packets that are larger than the known MTU for the path, and 441 collecting any resultant ICMP error message. See Section 5.5.4 443 5. Implementation Issues 445 This section discusses a number of issues related to the 446 implementation of Path MTU Discovery. This is not a specification, 447 but rather a set of notes provided as an aid for implementers. 449 The issues include: 450 o The seperation into layers 451 o The Mechanics of Probing, as seen by IP and brlow 452 o Search Strategy. 453 o How to implement PLPMTUD in specific Packetization Layers. 454 o How to improve Operational Integration and deployment. 456 5.1 Layering 458 5.1.1 Accounting for Header Sizes 460 Packetization Layer Path MTU Discovery is most easily implemented by 461 splitting its functions between layers. The IP layer is in the best 462 place to keep shared state, collect the ICMP messages, track IP 463 headers sizes and manage MTU information from the link layer 464 interfaces. However the procedures that PLPMTUD uses for probing, 465 verifications and scanning for the path MTU are very tightly coupled 466 to the data recovery and congestion control state machines in the 467 Packetization Layer. The most difficult part of implementing 468 PLPMTUD is properly splitting the implementation between the layers. 470 Note that this layering is constant with the advice in the current 471 PMTUD specifications [2][3]. Today, many implementations of classical 472 PMTU Discovery are already split along these same layers. 474 Early implementation of PLPMTUD revealed that it is critically 475 important to have a good clean mechanism for accounting header sizes 476 at all layers. This is because each Packetization Layer does its 477 calculations in its own natural data unit, which are almost always a 478 reflection of the service that the Packetization Layer provides to 479 the application or other upper layers. For example, TCP naturally 480 performs all of its calculations in terms of sequence numbers and 481 segment sizes. The size of the Probe gap is the size of the data 482 segment that was that was carried by the probe packet. However, the 483 MTU size being probed, ICMP MTU, etc are measures of full packets, 484 which not only include the TCP data (measured in sequence space) but 485 also include fixed TCP and IP headers, and may include IPv6 extension 486 headers or IPv4 options, TCP options and even IPsec AH or ESP headers 487 as well. 489 PLPMTUD requires frequent translation between these two domains: the 490 Packetization Layer's natural data unit and full IP packet sizes. 491 While there are a number of possible ways to accurately implement 492 dual size measures, our experience has been that it is best if the 493 boundary between the IP layer and the Packetization layer communicate 494 in terms of the IP Maximum Payload Size or MPS. The MPS is the only 495 size measure that is common to both the IP and Packetization Layers, 496 because it exactly matches the boundary between the layers. The IP 497 Layer is responsible for adding or deducting it's own headers when 498 translating between MTU and MPS. Likewise the Packetization Layer is 499 responsible for adding or deducting its own headers when calculations 500 in it's natural data units. 502 This document does not take a stance on the placement of IPsec, which 503 logically sits between IP and the Packetization Layer. As far as 504 PLPMTUD is concerned IPsec can be treated either as part of IP or as 505 part of the Packetization Layer, as long as the accounting is 506 consistent within any given implementation. If IPsec is treated as 507 part of the IP layer, then each security association to a remote node 508 may need to be treated as a separate flow for PLPMTUD, if they have 509 different length security headers. If IPsec is treated as part of the 510 packetization layer, the IPsec header size has to be included in the 511 Packetization Layer's header size calculations. 513 5.1.2 Storing PMTU information 515 This memo uses the concept of a "flow" to define the scope in which 516 path MTU information is used. Each flow locally stores its maximum 517 payload size (MPS), which is used for packetizing data. 518 Packetization Layers may communicate with the IP layer to store or 519 access cached MPS values, providing a means by which similar flows 520 may share information. The IP layer also stores PMTU and derived MPS 521 information when it receives Packet Too Big messages. 523 Ideally, a PMTU value should be associated with a specific path 524 traversed by packets exchanged between the source and destination 525 nodes. However, in most cases a node will not have enough 526 information to completely and accurately identify such a path. 527 Rather, a node must associate a PMTU value with some local 528 representation of a path. It is left to the implementation to select 529 the local representation of a path. 531 An implementation could use the destination address as the local 532 representation of a path. The PMTU value associated with a 533 destination would be the minimum PMTU learned across the set of all 534 paths in use to that destination. The set of paths in use to a 535 particular destination is expected to be small, in many cases 536 consisting of a single path. This approach will result in the use of 537 optimally sized packets on a per-destination basis. This approach 538 integrates nicely with the conceptual model of a host as described in 539 [ND@@@@]: a PMTU value could be stored with the corresponding entry 540 in the destination cache. However, NAT and other forms of middle 541 boxes may exhibit differing MTUs at as single IP address. 543 If IPv6 flows are in use, an implementation could use the IPv6 flow 544 id [7][14] as the local representation of a path. Packets sent to a 545 particular destination but belonging to different flows may use 546 different paths, with the choice of path depending on the flow id. 547 This approach will result in the use of optimally sized packets on a 548 per-flow basis, providing finer granularity than PMTU values 549 maintained on a per-destination basis. 551 For source routed packets (i.e. packets containing an IPv6 Routing 552 header, or IPv4 LSRR or SSRR options), the source route may further 553 qualify the local representation of a path. An implementation 554 could use source route information in the local representation of a 555 path. 557 If IPsec is in use, the security association can also be used to 558 represent a path. 560 5.2 Lower Layers 562 5.2.1 Generating Probes 564 A new candidate MTU is tested by sending one "probe packet", which is 565 larger than the current MTU. In this section we present a couple of 566 possible ways to alter packetization layers to generate probe 567 packets. The different techniques incur different overheads in 568 three areas: difficulty in generating the probe packet (in terms of 569 packetization layer implementation complexity and computational 570 overhead) possible additional network capacity consumed by the probes 571 and the overhead of recovering from failed probes (both network and 572 protocol overhead). 574 For example some protocols might be extended to allow padding with 575 dummy data within their packets. This would greatly simplify the 576 implementation because the probing can be performed without 577 participation from the application and if the probe fails, the 578 missing data (the "probe gap") is assured to fit within the current 579 MTU when it is retransmitted. However, the padding does consume 580 network capacity without carrying any useful payload. 582 This technique does not work for TCP, because there is not a separate 583 length field or other mechanism to differentiate between padding and 584 real payload data. With TCP the natural approach is to send 585 additional payload data in an over-sized segment. There are several 586 variants which have different tradeoffs. 588 In one method, after a TCP probe segment has been sent the subsequent 589 segment(s) may be sent as though the probe segment was not 590 over-sized. Thus if the probe segment is lost, it will leave a gap 591 in the sequence space that is exactly the correct size to be filled 592 by one segment at the current MTU. Since this method generates 593 overlapping data, it will cause duplicate acknowledgments if the 594 probe is successfully delivered. The sender must be capable of 595 ignoring these expected duplicate acknowledgments in a manner which 596 will not cause unnecessary retransmission or congestion window 597 reduction. 599 In the second method, after a TCP probe segment has been sent, 600 subsequent TCP segments are sent in a non-overlapping manner. If the 601 probe segment is lost, it will leave a gap which will require 602 retransmission of multiple segments to fill. This method has lower 603 overhead for successful probes, but it requires more complexity in 604 the retransmit logic to correctly retransmit the missing data (the 605 "probe gap") with multiple segments that fit into the old MTU, while 606 properly suppressing the congestion adjustments for this one 607 situation and no others. 609 Several Packetization protocols may be best served by using an 610 adjunct protocol for MTU probing: a separate protocol (or protocol 611 feature) that does not carry and real application data. This greatly 612 simplify s implementation because nothing needs to be retransmitted 613 when the probe is lost, but it does consume network capacity without 614 delivering any useful payload. 616 Two important example of this come to mind: SCTP [9] which might use 617 its existing HEARTBEAT facility padded with dummy data to fill out 618 the probe packet; and IP fragmentation which is sometimes used as a 619 Packetization layer for carrying oversized datagrams as described in 620 Section 5.2.6. In the case of IP fragmentation an entire separate 621 protocol in need, that has to use the diagnostic interface described 622 in Section 5.5.4 624 It should be clear that nearly all packetization layers can be 625 adapted to support PLPMTUD, possibly in more than one way. 627 5.2.2 Selecting the initial MTU 629 When the PLPMTUD process is started the initial MTU should normally 630 be set such that the Packetization Layer can carry 1 kByte data 631 segments. This initial MTU should be 1 kByte plus space for IP and 632 Packetization layer headers. (see Section 5.1 on accounting for 633 headers). With the this MTU, RFC2414 [6] allows TCP and other 634 transport protocols to start with an initial window of 4 packets. 636 We suspect, but have not confirmed that TCP actually starts faster 637 (and completes sooner for small packets) with 1kB packets rather than 638 1500 byte packets because the 2nd data ACK occurs one round trip 639 earlier 641 This initial MTU should also be configurable. One of the 642 configuration options should be to set it to default to the 643 interfaces MTU, to mimic classical PMTUD behavior. (See Section 5.5.1 645 5.2.3 Normal sequence of events to raise the MTU 647 If the probe size is smaller than the actual path MTU and there are 648 no other losses, the normal sequence of events to probe and raise the 649 MTU will be: 650 1. The probe is sent, followed by more packets at the current MTU. 651 By definition PLPMTUD enters the probe phase. The probe 652 propagates through the network and the far node acknowledges it 653 (or possibly latter data, if acknowledgements are cumulative and 654 delayed acknowledgement is in effect). 656 2. The acknowledgement for the probe reaches the data sender. By 657 definition, this ends the probe phase. 659 3. The packetization layer provisionally raises the MTU to the probe 660 size. PLPMTUD enters the transitional phase when it starts 661 sending data using the provisional MTU. 663 Note that implementations that use packet counts for congestion 664 accounting (e.g. keep cwnd in units of packets) must re-scale 665 their congestion accounting such that raising the MTU does not 666 raise the data rate (bytes/second) or the total congestion window 667 in bytes. 669 If the implementation packetizes the data at the application 670 programming interface, it may transmit already queued data at the 671 current MTU before raising the MTU. In this case this data is not 672 part of either the probing or transition phases, because all of 673 the packets in flight fit within the current MTU. 675 4. Once the first packet of the transitional phase is acknowledged, 676 PLPMTUD enters the verification phase. In principle the 677 verification phase can be of arbitrary duration, however at this 678 time we are recommending one full window of data (i.e one full 679 round trip time) for most Packetization Layers. 681 5. Once there has been sufficient data delivered and acknowledged in 682 the provisional MTU is considered verified and the path MTU is 683 updated. PLPMTUD can then probe for an even larger MTU, as 684 described in the searching strategy in Section 5.3. 686 Other events described in the next section are treated as exceptions 687 and alter or cancel some of the steps above. 689 5.2.4 Processing MTU Indications 691 The descriptions below assume that the Packetization Layer protocol 692 that has a TCP fast retransmit style mechanism to synchronously 693 detect the loss of a probe packet and trigger retransmission, without 694 loss of the protocols self clock. If this fails, then some sort of 695 retransmission timeout will serve to catch the loss. It also 696 assumes that there is some mechanism to detect full-stop timeouts. 698 If any of these events (or the receipt of an ICMP Packet Too Big 699 message) occurs during the the above process to raise the MTU, then 700 it is processed as indicated in the following sections. 702 5.2.4.1 Processing Packet Too Big Messages 704 Classical PMTU discovery specifies the generation of Packet Too Big 705 Messages if an over-sized packet (e.g. a probe) encounters a link 706 that has a smaller MTU. Since these messages can not be authenticated 707 they introduce a number of well documented attacks against classical 708 PMTUD [5]. 710 With PLPMTUD these messages are not required for correct operation, 711 and in principle can be summarily ignored at the expense of slower 712 convergence to the proper MTU. However we believe that a slightly 713 better compromise is to process Packet too big messages in two 714 specific contexts: in conjunction with a PLPMTUD probe or a full-stop 715 timeout. 717 Every Packet Too Big Message should be subjected to the following 718 checks: 719 o If globally forbidden then discard the message. 721 o If forbidden by the application then discard the message. 723 o If this path has been tagged "bogus ICMP messages" then discard 724 the message. 726 o If the reported MTU fails consistency checks then set "bogus ICMP 727 messages" flag for this path and discards the message. These 728 consistency checks include: 729 * unrecognized or unparseable enclosed header, 730 * reported MTU is larger than the size indicated by the enclosed 731 header or 732 * larger than the current MTU, provisional MTU or probe size as 733 appropriate. 734 * or fails a ICMP consistency checks specific to the 735 Packetization Layer. (E.g. The SCTP Verification-Tag mechanism 736 [9][16]) 737 To ease migration, it is suggested that implementations may 738 include global controls to suppress some or all of the consistency 739 checks. 741 If the Packet Too Big Message is acceptable under all of these checks 742 do one of two things on depending on a global configuration switch: 743 Emulate classical path MTU discovery by processing the message 744 immediately (I.e. set the path MTU to the size indicated in the 745 message) or save the "ICMP MTU", pending another PLPMTUD event. In 746 this case the saved ICMP MTU will only be acted upon under 747 appropriate conditions if there are lost probes, verification packets 748 or a full stop timeout. This greatly reduces the impact of 749 fraudulent ICMP Packet Too Big messages. 751 In either case if the Packetization Layer calls for specific actions 752 in response to a Packet Too Big message, that action should be 753 invoked only at the point when the path MTU is updated from the ICMP 754 MTU. 756 5.2.4.2 Packetization Layer Detects Lost Packets 758 Each packetization protocol has it's own mechanism to detect lost 759 packets and request the retransmission of missing data. The primary 760 signals used by the packetization layer are these protocol specific 761 loss indications. The packetization layer is responsible for 762 retransmitting the lost data and notifying PLPMTUD that there was a 763 loss. 764 o If the probe itself was lost, and there were no other losses 765 during the probe phase (The RTT between when the probe was sent 766 and the loss detected) than it is taken as an indication that the 767 path MTU is smaller than the probe size. In this situation alone 768 the Packetization Layer is permitted to retransmit the missing 769 data (the "probe gap") without adjusting its congestion window or 770 data transmission rate. 772 If an accepted Packet Too Big Message was received after the probe 773 was sent, and it passes the additional checks that the ICMP MTU is 774 greater than the current MTU and less than the probe SIZE, then 775 set the probe side to the ICMP MTU, and restart the probe process 776 from step 1 in Section 5.2.3. 778 If there was not a accepted Packet Too Big Message, then the 779 indicated event is a "probe failure", which can be retried with a 780 smaller probe size after a suitable delay for a probe_fail_event. 781 See Section 5.2.4.2 for more complete descriptions of failure 782 events. 784 o If there are losses during the probe phase and the probe was not 785 lost, then the probe was successful. However, since additional 786 losses have the potential to spoil the verification phase, it is 787 important that PLPMTUD not progress into the transition phase 788 (step 3 above) until after the Packetization Layer has fully 789 recovered from the losses and completed the congestion window (or 790 rate) adjustment. 792 o If there are losses during the probe phase and the probe was also 793 lost the outcome depends on the presence an ICMP MTU set by an 794 acceptable Packet Too Big Message. 796 If there was an accepted Packet Too Big Message received since the 797 probe was sent, and it passes the additional checks that the ICMP 798 MTU is greater than the current MTU and less than the probe size, 799 then set the probe size to the ICMP MTU, and once the 800 Packetization Layer completes the recovery from the losses then 801 restart the probe process from step 1 in Section 5.2.3. 803 If there was not an accepted Packet Too big Message, then the 804 probe is inconclusive because the lost probe might have been 805 caused by congestion. The probe can be retried after a suitable 806 delay for a probe_inconclusive_event. 808 o It is unlikely that losses during the transition phase are caused 809 by PLPMTUD, however they do potentially complicate the 810 verification phase. Note that we are referring to losses that are 811 followed by acknowledgement of packets that were sent at the old 812 MTU, while the transition to the provisional MTU is still 813 propagating through the network. The first acknowledgement of 814 the provisional MTU (and the transition to the verification phase) 815 is most likely going to occur during the recovery of the losses in 816 transition phase. It is important that the Packetization Layer 817 retransmission machinery distinguish between loses at the old MTU 818 (transition phase) and the provisional MTU (the verification 819 phase, discussed next). 821 o Losses during the verification phase are taken as a indication 822 that the path may have a non-uniform MTU or some other problems 823 such that raising the MTU substantially raises the loss rate. If 824 so, this is potentially a very serious problem, so the provisional 825 MTU is considered to have errored and the path MTU is set back to 826 the previously verified MTU (the previously current MTU). 828 Packet loss during the verification phase might also be due to 829 coincidental congestion on the path, unrelated to the probe, so it 830 would seem to be desirable to re-probe the path. The risk is that 831 this effectively raises the tolerated loss threshold because even 832 though raising the MTU seemed to cause additional loss, there is a 833 statistical chance that repeated attempts to verify a new MTU may 834 yield as false pass. The compromise is to re-probe once with 835 the same probe size (after delay probe_inconclusive_event), and if 836 this also fails, then the probe may not be retried until after a 837 suitable delay for a verification_error_event, which exponentially 838 increases on each successive failure. 840 5.2.4.3 Packetization Layer Retransmission Timeout 842 Note that the we do not make distinctions between the various methods 843 that different Packetization Layers might use for detecting and 844 retransmitting lost packets. It is preferable that the 845 Packetization Layer uses a recovery mechanism similar to TCP SACK or 846 fast retransmit (or other "synchronous" loss recover mechanism) to 847 detect losses and recover as quickly as possible. 849 Under some conditions the Packetization Layer may have to rely on 850 retransmission timeouts or other fairly disruptive techniques to 851 recover from losses. Since these greatly increase the cost of 852 failed probes, it is recommended that PLPMTUD use even longer delays 853 before re-probing. In these situations replace probe_fail_event with 854 probe_timeout_event. 856 5.2.4.4 Packetization Layer Full Stop Timeout 858 Under all conditions (not just during MTU probing) a full stop 859 timeout should be taken as an indication of some significantly 860 disruptive event in the network, such as a router failure or a 861 routing change to a path with a smaller MTU. 863 If the ICMP MTU is set, and it is less that the current MTU (or 864 provisional MTU during the transitional phase), then the path MTU can 865 be reduced to the ICMP MTU. This is the only situation (a full stop 866 timeout) outside of a probe that we recommended that the path MTU is 867 set from the ICMP MTU. (In Section 5.5.1 we relax this recommendation 868 to facilitate migration to PLPMTUD in exchange for slightly less 869 protection from corrupt Packet Too Big messages) 871 Note that whenever a problem with the path that causes a full-stop 872 timeout (also known as a "persistent timeout" in other documents), 873 several different path restart/recovery algorithms may be invoked at 874 different layers in the stack. Some device drivers may be restarted 875 [@@], router discovery [@@], ES-IS [@@] and so forth. We recommend 876 that in most situation the first action should be to set the path MTU 877 down. Note that this recommendation is really beyond the scope of 878 this document, and may require substantial additional research. 880 Therefore, if there is a full stop timeout and there was not an ICMP 881 message indicating a reason (Packet Too Big, Net unreachable, etc, or 882 the ICMP messages was ignored for some reason), we suggest that the 883 first recovery action should be to set the path MTU down to a safe 884 minimum "restart MTU" value, and the PLPMTUD search state reset, so 885 PLPMTUD will start over again searching for the proper MTU. The 886 default restart_MTU should be the minimum MTU as specified by IPv4 887 (576)[1] or IPv6 (1280) [7] as appropriate, unless overridden by some 888 global control (See Section 5.5.5). 890 If and only if the full stop timeout happens during the probe or 891 transition phases (e.g. after the sending data using the provisional 892 MTU but before any of it is acknowledged) is it considered likely 893 that raising the MTU caused the full stop timeout. If so this 894 situation is is likely to be cyclic, because resetting the PLPMTUD 895 search state is likely to eventually cause re-probing the same 896 problematic MTU. 898 It is tempting to define additional states to detect recurrent full 899 stop timeouts. However in today's hostile network environment, there 900 is little tolerance for nodes that are so fragile that they can be 901 disrupted by something as simple as oversized packets. Therefor we 902 do not feel that it is worth the overhead of specifying a state 903 machine that is capable of automaticly detecting these situations and 904 disabling PLPMTUD. However, it is important that there be a manual 905 way to disable or limit probing on specific paths. See Section 906 5.5.5. 908 5.2.5 Probing Intervals 910 Section 5.2.4.2 describes a number of probe failure events. In all 911 cases the basic response is the same: to wait some time interval 912 (dependent on the specific event and possibly the history) and then 913 to probe again. For events that are "inconclusive", it is generally 914 appropriate to re-probe with the same probe size. For events that 915 are identified as "failed probes" it is generally appropriate to 916 re-probe with a smaller probe size. The search strategy described 917 in Section 5.3 is used to select probe sizes. 919 Many of the intervals below are specified in terms of elapsed round 920 trips relative to the current congestion window. This is because 921 TCP and other Packetization Layer protocols tend to exhibit periodic 922 loses which cause periodic variations of the congestion window and 923 possibly the data rate. It is preferable that the PLPMTUD probes are 924 scheduled near the low point of these cycles to minimize ambiguities 925 caused by congestion losses. 927 In order from least to most serious: 928 probe_inconclusive_event Other lost packets near the lost probe made 929 the probe result ambiguous. Since the loss of non-probe packets 930 requires a window (or data rate) reduction, it is desirable to 931 schedule the re-probe (at the same probe size) at one round trip 932 time after the end of the loss recovery. This will be almost the 933 minimum congestion window size, with a small cushion to minimize 934 the chances that correlated losses caused by some other bursty 935 connection spoil another probe. 937 probe_fail_event A probe fail event is the one situation under which 938 the Packetization layer is permitted not to treat loss as a 939 congestion signal. Because there is some small risk that 940 suppressing congestion control might have unanticipated 941 consequences (even for one isolated loss), we require that probe 942 fail events be less frequent than the normal period for losses 943 under standard congestion control. Specifically after a probe 944 fail event and suppressed congestion control, PLPMTUD may not 945 probe again until an interval which is comparable to the expected 946 interval between congestion control events. See Section 4. 948 The simplest estimate of the interval to the next congestion event 949 is the same number of round trips as the current window in 950 packets. 952 probe_timeout_event Since this event was detected by a timeout, it is 953 relatively disruptive to protocol operation. Furthermore, since 954 the event indirectly includes a window adjustment that may have 955 been caused by the MTU probe, it is important that the probe not 956 be repeated until congestion has more than recovered from the 957 loss. Therefore we recommend five times the probe_fail_event 958 interval. I.e. five times as many round trips as the current 959 congestion window in packets. 961 verification_error_event A verification fail event indicates that a 962 probe was deliver and the verification phase failed twice 963 separated by a congestion adjustment (so the second verification 964 phase was at a low point in the congestion control cycle). This is 965 an indication that one of the following three things might have 966 happened: repeated losses unrelated to PLPMTUD; the path is 967 striped across links with dissimilar MTUs, or the link layer has 968 some parametric limitation such that raising the MTU greatly 969 increases the random error rate. 971 The optimal method responding to this situation is an open 972 research question. We believe that the correct response is some 973 combination of exponentially lengthening backoffs (e.g. Starting 974 at 1 minute and quadrupling on each repeat.) and implicitly 975 treating the situation as a probe fail (and choosing a smaller 976 probe size) after some threshold number of repeated 977 verification_error_events. 979 5.2.6 Host fragmentation 981 Packetization layers are encouraged to avoid sending messages that 982 will require fragmentation (for the case against fragmentation, see 983 [17][18]). However this is not always possible. Some packetization 984 layers, such as a UDP application outside the kernel, may be unable 985 to change the size of messages it sends. This may result in packet 986 sizes that exceeds the Path MTU. 988 IPv4 permitted such applications to send packets without DF set. 989 Oversized packets without DF would be fragmented in the network or 990 sending host when they encountered a link with a small MTU. In some 991 case, packets could be fragmented more than once if there were 992 cascaded links with progressively smaller MTUs. 994 This approach is no longer recommended. We now recommend that IPv4 995 implementation use a strategy that mimics IPv6 functionality. When 996 an application sends datagrams that are larger than the known path 997 MTU they should be fragmented to the path MTU in the host IP layer 998 even if they are smaller than the link MTU of the first hop networks 999 directly attached to the host. The DF bit should be set on the 1000 fragments, so they will not be fragmented again in the network. 1002 This technique will minimize future surprises as the Internet 1003 migrated to IPv6. Otherwise there is the potential for widely 1004 deployed applications or services relying on IPv4 fragmentation, in a 1005 way that can not be implemented in IPv6. At least one major operating 1006 system already uses this strategy. 1008 Note that in principle the IP fragmentation layer is an example of a 1009 Packetization Layers, it could implement full PLPMTUD in the 1010 fragmentation process. 1012 5.2.7 Multicast 1014 In the case of a multicast destination address, copies of a packet 1015 may traverse many different paths to reach many different nodes. The 1016 local representation of the "path" to a multicast destination must in 1017 fact represent a potentially large set of paths. 1019 Minimally, an implementation could maintain a single MPS value to be 1020 used for all packets originated from the node. This MPS value would 1021 be the minimum MPS learned across the set of all paths in use by the 1022 node. This approach is likely to result in the use of smaller 1023 packets than is necessary for many paths. 1025 Alternatively, if the application using multicast gets complete 1026 delivery reports (unlikely because this requirement has poor scaling 1027 properties), PLPMTUD could be implemented in multicast protocols. 1029 5.3 Search Strategy 1031 The search strategy described here is a only guide for implementors. 1032 A standard algorithm is not specified because the strategy can 1033 include many heuristics to optimize MPS selection for a given path. 1034 Particularly, it may be appropriate for different protocols to follow 1035 different strategies. There is opportunity for future improvements 1036 to this algorithm. 1038 The search strategy uses three variables: 1039 SEARCH_MAX is the largest MPS that a flow might be able to use. 1040 It is determined by such considerations as interface MTU, widths 1041 of protocol length fields, and possibly other protocol-dependent 1042 values, such as the the TCP MSS option. In many cases it would be 1043 the same as the classical MTU discovery initial MSS, minus the IP 1044 layer headers. 1045 SEARCH_LOW is the largest validated MPS, and should be used as the 1046 effective MPS by the packetization layer. It is the same as the 1047 current validated MTU minus the IP layer headers. The initial 1048 value for SEARCH_LOW should be a parameter, but a value of 1024 1049 may be a reasonable default. 1050 SEARCH_HIGH is the least invalidated MPS. In most cases is will 1051 be the most recent failed probe size minus the IP layer headers. 1052 When PLPMTUD is initialized SEARCH_HIGH should be set to 1053 SEARCH_MAX. 1055 There are three major states: Search, Monitor and Suspend. In the 1056 Search state, it incrementally searches for the largest MPS that the 1057 path can support, narrowing the difference between SEARCH_LOW and 1058 SEARCH_HIGH. Once this gap is sufficiently narrow, the probing 1059 algorithm enters the Monitor state where it probes infrequently to 1060 detect if the path MPS has become larger. 1062 If the MPS probing is determined harmful, perhaps by persistent probe 1063 failures, the flow may enter the Suspend state, completely disabling 1064 MPS probing. 1066 5.3.1 Search 1068 In the Search state, the strategy follows a multi-phase scan. If 1069 SEARCH_HIGH >= SEARCH_MAX, a course scan is used. In this mode, each 1070 probe's payload size should be MIN(2 * SEARCH_LOW, SEARCH_MAX). If 1071 SEARCH_HIGH < SEARCH_MAX, the fine scan mode should be used. 1073 The fine scan algorithm may pursue a number of different methods for 1074 choosing probe sizes. It may be useful to choose probe sizes so that 1075 the final IP packet will fit common link MTUs, for example 1500, 1076 4352, 9000, 17914. Optionally, probes smaller than these values by 1077 common tunnel header sizes may be used. 1079 When using some protocols, the cost for a failed probe may be 1080 significantly higher than the cost of a successful probe due to 1081 retransmission and consequent delay jitter as seen by the 1082 application. For this reason, one possible approach to the fine scan 1083 could be to use probes of size SEARCH_LOW + d, for some increment d. 1084 It should enter the Monitor state when SEARCH_LOW + d >= SEARCH_HIGH. 1085 This will result in at most one additional probe failure. 1087 Another approach may be to use a simple binary search where each 1088 probe size is (SEARCH_LOW + SEARCH_HIGH) / 2, entering the Monitor 1089 state when SEARCH_LOW + s >= SEARCH_HIGH for some threshold s. This 1090 will converge quickly, but may have a higher number of probe 1091 failures. It is more appropriate for a protocol whose probes consist 1092 entirely of padding. 1094 5.3.2 Monitor 1096 In the Monitor state, a probe of size SEARCH_HIGH should be sent at 1097 most once every MONITOR_INTERVAL seconds. If the probe succeeds, 1098 then SEARCH_HIGH should be set to SEARCH_MAX, and the state should be 1099 set to Search. 1101 If there is evidence that no flow traffic is receiving its 1102 destination, such as repeated timeouts with no acknowledgements in 1103 TCP, it may be that the connection was re-routed to a path with a 1104 smaller MTU, and the Packet Too Big messages are ignored of filtered. 1105 In this case, SEARCH_LOW and SEARCH_HIGH should be set to initial 1106 values, and the Search state should be entered. 1108 5.3.3 Suspend 1110 In the Suspend state, probing is entirely disabled, and the MPS 1111 should be set to 512 bytes. The Suspend state should only be used if 1112 it is heuristically determined that probing is causing harmful 1113 failures. 1115 5.4 Specific Packetization Layers 1117 In this section we discuss specific implementation issues different 1118 Packetization Layer protocols. 1120 5.4.1 Probing method using TCP 1122 TCP has no mechanism that could be used to distinguish between real 1123 application data and some other form of padding that might be used to 1124 fill out probe packets. Therefore, TCP must generate probes by 1125 sending oversized segments that are carrying real data from upper 1126 layers. As previously mentioned there are two approaches that TCP 1127 might use to minimize the overheads associated with the probing 1128 process. 1130 A TCP implementation of PLPMTUD can elect to send subsequent segments 1131 overlapping the probe as though the probe segment was not oversized. 1132 This has the advantage that TCP only need to retransmit one segment 1133 at the current MTU to recover from failed probes. However the 1134 duplicate data in the probe does consume network resources and will 1135 cause duplicate acknowledgments. It is important that these extra 1136 duplicate acknowledgments not trigger Fast Retransmit. This can be 1137 guaranteed by limiting the largest probe segment size to twice the 1138 current segment size (causing at most 1 duplicate acknowledgment) or 1139 three times the current segment size (causing at most 2 duplicate 1140 acknowledgments). 1142 The other approach is to send non-overlapping segments following the 1143 probe. Although this is cleaner from a protocol architecture 1144 standpoint it clashes with many of the optimizations used improve the 1145 efficiency of data motion withing many operating systems. In 1146 particular many implementations divide the data into segments and 1147 pre-compute checksums as the data is copied out of user space. In 1148 these implementation it can be very expensive to adjust segment 1149 boundaries after the data is already queued. 1151 If TCP is using SACK or any other variable length headers, the 1152 headers on the probe and verification packets should be padded to the 1153 maximum possible length. Otherwise, future options may cause delivery 1154 problems if they cause IP packets that are larger than the MTU. 1156 Note that the header size and overhead calculations described in 1157 Section 5.1 apply here. TCP's natural data accounting units are 1158 sequence space and Maximum Segment Size. However the the PLPMTUD 1159 process is described in terms of total packet size, which is larger 1160 than the MSS by all fixed and optional headers. 1162 At the point when TCP is ready to start the verification phase, it is 1163 permitted transmit already queued data at the old MTU rather than 1164 re-packetize it. This postpones the verification process by the time 1165 required to send the queued data. 1167 If the verification phase experiences any segment losses, TCP is 1168 required to pull back to the prior MSS. Since failing the 1169 verification phase should be an infrequent error condition it is less 1170 important that this be as efficient as probing. 1172 5.4.1.1 Window management 1174 Some TCP implementations keep the congestion window in units of 1175 segments. When segment size is increased during a connection, a 1176 conservative implementation should scale cwnd so that, in units of 1177 bytes, it will remain unchanged. 1179 It is recommended that TCP should not probe a new MPS if that MPS 1180 will likely result in a cwnd of less than 5 segments. 1182 If the network becomes too congested, it is recommended that the MPS 1183 be reduced to a smaller size as determined by a heuristic. The 1184 recommended heuristic is to reduce the MPS by half if ssthresh is 1185 reduced to 5 segments or smaller, with a minimum MPS of 512 bytes. 1187 5.4.2 Probing method using SCTP 1189 In the SCTP protocol packetization is the responsibility of the 1190 application or protocol above SCTP. The application writes a set 1191 message to SCTP and SCTP will "chunkify" it into appropriate sized 1192 pieces. Some implementations MAY bundle multiple data chunks 1193 together, but this is NOT required implementation behavior. By 1194 implication not all SCTP implementations can easily generate probes 1195 sending additional application data. In particular any implementation 1196 that does not implement data chunk bundling would not be able to 1197 implement a probe. 1199 For SCTP the recommended method for generating probes is to pad SCTP 1200 HeartBeat messages to the desired probed size. A successful probe 1201 will be acknowledged without delay by the peer SCTP implementation 1202 returning the same Heartbeat as a HEARTBEAT-ACK. This assures that 1203 both directions will support the probed MTU size. [@@@@@ note that 1204 both sides of the path are tested] 1206 The verification phase is entered after a successful probe. For 1207 implementations that can bundle multiple DATA chunks the verification 1208 phase completes when a windows worth of bundled DATA chunks are 1209 exchanged at the new MTU value. An SCTP implementation SHOULD arrange 1210 its fragmentation point to be a suitable multiple of the new MTU size 1211 (e.g. if the MTU size is 1500 bytes in IPv4 then a fragmentation 1212 point of 718 bytes might be selected during the verification phase. 1213 This would allow the two bundled DATA chunks to be put together to 1214 exactly equal the proposed new PMTU. After verification is complete 1215 the fragmentation point can then be set to the actual PMTU assuming 1216 that this new value is the smallest MTU of all of the SCTP paths). 1217 An SCTP implementation is allowed to transmit already fragmented DATA 1218 chunks that cannot be bundled together at the new MTU value that were 1219 previously queued. For implementation that do not allow DATA chunk 1220 bundling three subsequent HEARTBEAT messages should be sent over the 1221 next XX@@ RTT's padded to the new proposed MTU value. If all of HB's 1222 are successful then the new PMTU should be adopted for the path. 1224 [@@@@NOTE: it might be simpler to always use multiple HB's to prove 1225 in a PMTU during verification, I leave this up to you. One thing to 1226 keep in mind is that SCTP normally fragments its messages to the 1227 SMALLEST PMTU of all paths... since SCTP is multi-homed this makes it 1228 so any data chunk can fit on ANY path. Most implementations DO bundle 1229 data chunks for this very reason... its easy to do and it allows 1230 larger PMTU's on different paths to be utilized. So using the HB may 1231 be more efficient... its definitely simpler... I leave it to you to 1232 choose. We may also want to mention the ICMP issue with SCTP since a 1233 validated ICMP message with SCTP can always be trusted]. 1235 The SCTP Verification-Tag is designed to increase SCTPs robustness in 1236 the presence of a number of attacks, including forged ICMP messages. 1237 It relies on a 32 bit Verification Tag which is initialized to a 1238 random value during connection establishment and placed in the first 1239 64 bits of all SCTP messages. All subsequent messages (including ICMP 1240 messages, which copy at least the first 64 bits of the message) must 1241 match the original Verification Tag, or they are rejected as being 1242 likely attacks against the connection. [9][16]. 1244 It is believed that the Verification Tag mechanism is strong enough 1245 where SCTP could unconditionally process Packet Too Large messages 1246 that would reduce the path MTU at arbitrary times. As written, this 1247 document does not encourage this method. The PLPMTUD ICMP validity 1248 checks are cascaded with the SCTP checks, such that the messages are 1249 processed only if they meet all consistency checks. In particular, 1250 PLPMTUD only uses the ICMP MTU value following a probe, during MTU 1251 verification, or following a hard stop timeout. 1253 To change this an implementation would have to suppress some of the 1254 checks in Section 5.2.4.1 for SCTP. 1256 5.4.3 Probing Method for IP Fragmentation 1258 As mentioned in Section 5.2.6, datagram protocols (such as UDP) can 1259 rely on IP fragmentation as a packetization layer. Since the IP 1260 layer does not have any way to determine if the fragments were 1261 delivered, it can not do the probing directly. The probing has to 1262 be done with an adjunct protocol that uses the diagnostic API 1263 (Section 5.5.4) to send oversized probes, and some other API to 1264 update the MPS stored in the IP layer. 1266 5.4.4 Issues for other transport protocols 1268 Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to 1269 repacketize when doing a retransmission. That is, once an attempt is 1270 made to transmit a segment of a certain size, the transport cannot 1271 split the contents of the segment into smaller segments for 1272 retransmission. In such a case, the original segment can be 1273 fragmented by the IP layer during retransmission. Subsequent 1274 segments, when transmitted for the first time, should be no larger 1275 than allowed by the Path MTU. 1277 5.5 Operational Integration 1279 5.5.1 Interoperation with prior algorithms 1281 Properly functioning Path MTU discovery is critical to the robust and 1282 efficient operation of the Internet. Any major change (as described 1283 in this document) has the potential to be very disruptive if it 1284 contains any errors or oversights. Therefore, we offer a deployment 1285 strategy in which classical PMTUD operation as described in RFC 1191 1286 and RFC 1981 is unmodified and PLPMTUD is only invoked following a 1287 full stop timeout, presumably due to an "ICMP black hole". To do 1288 this: 1289 o Relax the ICMP checks in Section 5.2.4.1 specifically to allow an 1290 ICMP Packet Too Large message to reduce the MTU at arbitrary 1291 times. 1292 o When there is no cached MTU, use the Interface MTU as specified 1293 classical PMTU discovery, rather the initial MTU as specified in 1294 Section 5.2.2 1295 o MTU searching as described in Section 5.3 is disabled entirely or 1296 starts in the monitor state. 1297 o A full stop timeout is processed as described in Section 5.2.4.4. 1298 This becomes the only mechanism to invoke the rest of PLPMTUD. 1300 When configured in this manner, PLPMTUD will increase the robustness 1301 of classical PMTU discovery in the presence of ICMP black holes and 1302 other ICMP problems, with minimal exposure to unanticipated problems 1303 during deployment. Since this configuration does not help robustness 1304 in the presence of malicious or erroneous ICMP messages, it is not 1305 recommended for the long term. 1307 5.5.2 Interoperation over subnets with dissimilar MTUs 1309 With classical PMTUD, the ingress router to a subnet is responsible 1310 for knowing what size packets can be delivered to every node attached 1311 to that subnets. For most subnet types, this requires that the 1312 entire subnet has a single MTU which is common to every attached 1313 node. (For a few subnets types (e.g. ATM[12]) the nodes on a subnet 1314 can be negotiate the MTU on a pairwise basis, and the ingress router 1315 is responsible for knowing the MTU to each of it peers). 1317 This requirement has proven to be a major impediment to deploying 1318 larger MTUs in the operational Internet. Often one single node which 1319 does not support a larger MTU effectively vetoes raising the MTU on a 1320 subnet, because the ingress router does not have a mechanism to 1321 generate the proper Packet Too Big Message for the one attached node 1322 with a smaller MTU 1324 With PLPMTUD, this requirement is completely relaxed. As long as 1325 oversized packets addressed the nodes with the smaller MTU are 1326 reliably discarded, PLPMTUD will find the proper MTU for these nodes. 1328 5.5.3 Interoperation with tunnels 1330 PLPMTUD is specifically designed to solve many of the problems that 1331 people are experiencing today due to poor interactions between 1332 classical MTU discovery, IPsec, and various sorts of tunnels [5]. 1333 As long as the tunnel reliably discards packets that are too large, 1334 PLPMTUD will discover an appropriate MTU for the path. 1336 Unfortunately due to the pervasive problems with classical PMTU 1337 discovery, many manufacturers of various types of VPN/tunneling 1338 equipment have resorted to ignoring the DF bit. This not only 1339 violates the IP standard and many recommendations to the contrary 1340 [17][18], it also violates the only requirement that PLPMTUD places 1341 on the link layer: that oversized packets are reliably discarded. 1342 It is imperative that people understand the impact of ignoring the DF 1343 bit both to applications and to PLPMTUD. 1345 We do understand the reality of the situation. It is important that 1346 vendors who are building devices the violate the DF specification 1347 understand that PLPMTUD requires that probe packets be discarded, and 1348 that sending ICMP packet too big messages alone is insufficient to 1349 prevent wholesale fragmentation if the probe packets are delivered. 1351 Therefore, it is imperative that devices that do not honor DF include 1352 packet size history caches and other heuristics to robustly detect 1353 and discard probe packets, if delivering them would require 1354 fragmentation. 1356 5.5.4 Diagnostic tools 1358 All implementations MUST include facilities for MTU discovery 1359 diagnostic tools that implement PLPMTUD or other MTU discovery 1360 algorithms in user mode without help or interference by the PMTUD 1361 algorithm present in the operating system. This requires an 1362 mechanism where a diagnostic application can send packets that are 1363 larger than the operating system's notion of the current path MTU and 1364 collect any resulting Packet Too Big Messages or other ICMP messages. 1365 For IPv4 the diagnostic application must be able to set the DF bit. 1367 At this time nearly all operating systems support two modes for 1368 sending UDP datagrams: one which silently fragments packets that are 1369 too large, and another that rejects packets that are too large. 1370 Neither of these modes are suitable for efficiently diagnosing 1371 problems with the MTU discovery, such as routers that return Packet 1372 Too Big messages containing incorrect size information. 1374 5.5.5 Management interface 1376 It is suggested that an implementation provide a way for a system 1377 utility program to: 1378 o Globally disable all ICMP Packet Tool Large message processing 1379 o Globally suppress some or all ICMP consistency checks described in 1380 Section 5.2.4.1. Setting this option foregoes some possible 1381 security improvements, in exchange for making PLPMTUD behave more 1382 like classical PMTU discovery. (See Section 5.5.1) 1383 o Globally permit ICMP Packet Tool Large messages to unconditionally 1384 reduce the MTU, even if there were not lost lost packets. 1385 Setting option foregoes some possible security improvements, in 1386 exchange for making PLPMTUD behave more like classical PMTU 1387 discovery. (See Section 5.5.1) 1388 o Globally adjust timer intervals for specific classes of probe 1389 failures 1391 In addition, it is important that there be a mechanism to permit per 1392 path controls to override specific parts of the PLPMTUD algorithm. 1393 All of these per path controls can be preset from similar global 1394 controls. 1395 o Disable MTU searching a given path, such that new MTU values are 1396 never probed. 1397 o Set the initial MTU for a given path. This could be used to 1398 speed convergence in relatively static environments. There 1399 should be an option to cause PLPMTUD to choose the same initial 1400 value as would be chosen by classical PMTU discovery. I.e. 1401 typically the Interface MTU. This is used in the mode described 1402 in Section 5.5.1 where PLPMTUD is used only for black hole 1403 detection in classical PMTU discovery. 1404 o Limit the maximum probed MTU for a given path. This permits a 1405 manual configuration to work around a link that spuriously 1406 delivers packets that are larger than the useful path MTU. 1407 o Per path and per application controls to disable ICMP processing, 1408 to further limit possible damage from malicious Packet Too Big 1409 messages (in addition to the global controls). 1411 6. References 1413 6.1 Normative References 1415 [1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. 1417 [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1418 November 1990. 1420 [3] McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP 1421 version 6", RFC 1981, August 1996. 1423 [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1424 Levels", BCP 14, RFC 2119, March 1997. 1426 [5] Kent, S. and R. Atkinson, "Security Architecture for the 1427 Internet Protocol", RFC 2401, November 1998. 1429 [6] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's 1430 Initial Window", RFC 2414, September 1998. 1432 [7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) 1433 Specification", RFC 2460, December 1998. 1435 [8] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914, 1436 September 2000. 1438 [9] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 1439 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, 1440 "Stream Control Transmission Protocol", RFC 2960, October 2000. 1442 6.2 Informative References 1444 [10] Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU 1445 discovery options", RFC 1063, July 1988. 1447 [11] Knowles, S., "IESG Advice from Experience with Path MTU 1448 Discovery", RFC 1435, March 1993. 1450 [12] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626, 1451 May 1994. 1453 [13] Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU", 1454 RFC 1791, April 1995. 1456 [14] Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809, 1457 June 1995. 1459 [15] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, 1460 September 2000. 1462 [16] Stewart, R., "Stream Control Transmission Protocol (SCTP) 1463 Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in 1464 progress), December 2003. 1466 [17] Kent, C. and J. Mogul, "Fragmentation considered harmful", 1467 Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. 1469 [18] Mathis, M., Heffner, J. and B. Chandler, "Fragmentation 1470 Considered Very Harmful", draft-mathis-frag-harmful-00 (work in 1471 progress), July 2004. 1473 Authors' Addresses 1475 Matt Mathis 1476 Pittsburgh Supercomputing Center 1477 4400 Fifth Avenue 1478 Pittsburgh, PA 15213 1479 US 1481 Phone: 412-268-3319 1482 EMail: mathis@psc.edu 1484 John W. Heffner 1485 Pittsburgh Supercomputing Center 1486 4400 Fifth Avenue 1487 Pittsburgh, PA 15213 1488 US 1490 Phone: 412-268-2329 1491 EMail: jheffner@psc.edu 1493 Kevin Lahey 1494 Freelance 1496 EMail: kml@patheticgeek.net 1498 Appendix A. Security Considerations 1500 Under all conditions the PLPMTUD procedure described in this document 1501 is at least as secure as the current standard path MTU discovery 1502 procedures described in RFC 1191 [2] and RFC 1981 [3]. 1504 It the recommended configuration, PLPMTUD is significantly harder to 1505 attack than current procedures, because ICMP messages are cached and 1506 only processed in connection with lost packets. This effectively 1507 prevents blind attacks on the path MTU discovery system. 1509 Furthermore, since this algorithm is designed for robust operation 1510 without any ICMP (or other messages from the network), it can be 1511 configured to ignore all ICMP messages (globally or on a per 1512 application basis). In this configuration it can not be attacked, 1513 unless the attacker can identify and selectively cause probe packets 1514 to be lost. 1516 Appendix B. IANA considerations 1518 None. 1520 Appendix C. Acknowledgements 1522 Most of the SCTP text was contributed by Randall Stewart. 1524 Matt Mathis and John Heffner are supported in this work by a grant 1525 from Cisco Systems, Inc. 1527 Intellectual Property Statement 1529 The IETF takes no position regarding the validity or scope of any 1530 Intellectual Property Rights or other rights that might be claimed to 1531 pertain to the implementation or use of the technology described in 1532 this document or the extent to which any license under such rights 1533 might or might not be available; nor does it represent that it has 1534 made any independent effort to identify any such rights. Information 1535 on the IETF's procedures with respect to rights in IETF Documents can 1536 be found in BCP 78 and BCP 79. 1538 Copies of IPR disclosures made to the IETF Secretariat and any 1539 assurances of licenses to be made available, or the result of an 1540 attempt made to obtain a general license or permission for the use of 1541 such proprietary rights by implementers or users of this 1542 specification can be obtained from the IETF on-line IPR repository at 1543 http://www.ietf.org/ipr. 1545 The IETF invites any interested party to bring to its attention any 1546 copyrights, patents or patent applications, or other proprietary 1547 rights that may cover technology that may be required to implement 1548 this standard. Please address the information to the IETF at 1549 ietf-ipr@ietf.org. 1551 Disclaimer of Validity 1553 This document and the information contained herein are provided on an 1554 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1555 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1556 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1557 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1558 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1559 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1561 Copyright Statement 1563 Copyright (C) The Internet Society (2004). This document is subject 1564 to the rights, licenses and restrictions contained in BCP 78, and 1565 except as set forth therein, the authors retain all their rights. 1567 Acknowledgment 1569 Funding for the RFC Editor function is currently provided by the 1570 Internet Society.