idnits 2.17.1 draft-ietf-pmtud-method-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1528. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1505. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1512. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1518. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 1, 2006) is 6479 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2460 (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) -- Obsolete informational reference (is this intentional?): RFC 2401 (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 2461 (Obsoleted by RFC 4861) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) == Outdated reference: A later version (-05) exists of draft-heffner-frag-harmful-01 == Outdated reference: A later version (-01) exists of draft-tuexen-tsvwg-sctp-padding-00 Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Mathis 3 Internet-Draft J. Heffner 4 Expires: February 2, 2007 PSC 5 August 1, 2006 7 Packetization Layer Path MTU Discovery 8 draft-ietf-pmtud-method-08 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on February 2, 2007. 35 Copyright Notice 37 Copyright (C) The Internet Society (2006). 39 Abstract 41 This document describes a robust method for Path MTU Discovery 42 (PMTUD) that relies on TCP or some other Packetization Layer to probe 43 an Internet path with progressively larger packets. This method is 44 described as an extension to RFC 1191 and RFC 1981, which specify 45 ICMP based Path MTU Discovery for IP versions 4 and 6, respectively. 47 The general strategy of the new algorithm is to start with a small 48 MTU and search upward, testing successively larger MTUs by probing 49 with single packets. If a probe is successfully delivered then the 50 MTU can be raised. If the probe is lost, it is treated as an MTU 51 limitation and not as a congestion signal. 53 Packetization Layer PMTUD (PLPMTUD) introduces some flexibility in 54 the implementation of classical Path MTU discovery. IT can be 55 configured to perform just ICMP black hole recovery to increase the 56 robustness of classical Path MTU Discovery, or at the other extreme, 57 all ICMP processing can be disabled and PLPMTUD can completely 58 replace classical Path MTU Discovery. 60 In the latter configuration, PLPMTUD exactly parallels congestion 61 control. An end-to-end transport protocol adjusts properties of the 62 data stream (window size or packet size) while using packet losses to 63 deduce the appropriateness of the adjustments. This technique is 64 more philosophically consistent with the end-to-end principle than 65 relying on ICMP messages containing transcribed headers of multiple 66 protocol layers. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 71 1.1. Revision History . . . . . . . . . . . . . . . . . . . . . 5 72 1.1.1. Changes since version -07, July 2006 (IETF 66) . . . . 5 73 1.1.2. Changes since version -06, March 2006 (IETF 65) . . . 6 74 1.1.3. Changes since version -05, November 2005 (IETF 64) . . 6 75 1.1.4. Changes since version -04, February 2005 (IETF 62) . . 6 76 1.1.5. Changes since version -03, October 2004 (IETF 61) . . 6 77 1.1.6. Changes since version -02, July 19th 2004 (IETF 60) . 6 78 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 79 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 10 80 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 12 81 5. Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 82 5.1. Accounting for header sizes . . . . . . . . . . . . . . . 14 83 5.2. Storing PMTU information . . . . . . . . . . . . . . . . . 15 84 5.3. Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16 85 5.4. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 16 86 6. Common Packetization Properties . . . . . . . . . . . . . . . 17 87 6.1. Mechanism to detect loss . . . . . . . . . . . . . . . . . 17 88 6.2. Generating probes . . . . . . . . . . . . . . . . . . . . 17 89 7. The Probing Method . . . . . . . . . . . . . . . . . . . . . . 18 90 7.1. Packet size ranges . . . . . . . . . . . . . . . . . . . . 18 91 7.2. Selecting initial values . . . . . . . . . . . . . . . . . 19 92 7.3. Selecting probe size . . . . . . . . . . . . . . . . . . . 21 93 7.4. Probing preconditions . . . . . . . . . . . . . . . . . . 21 94 7.5. Conducting a probe . . . . . . . . . . . . . . . . . . . . 22 95 7.6. Response to probe results . . . . . . . . . . . . . . . . 22 96 7.6.1. Probe success . . . . . . . . . . . . . . . . . . . . 22 97 7.6.2. Probe failure . . . . . . . . . . . . . . . . . . . . 23 98 7.6.3. Probe timeout failure . . . . . . . . . . . . . . . . 23 99 7.6.4. Probe inconclusive . . . . . . . . . . . . . . . . . . 24 100 7.7. Full stop timeout . . . . . . . . . . . . . . . . . . . . 24 101 7.8. MTU verification . . . . . . . . . . . . . . . . . . . . . 24 102 8. Host Fragmentation . . . . . . . . . . . . . . . . . . . . . . 25 103 9. Application Probing . . . . . . . . . . . . . . . . . . . . . 26 104 10. Specific Packetization Layers . . . . . . . . . . . . . . . . 27 105 10.1. Probing method using TCP . . . . . . . . . . . . . . . . . 27 106 10.2. Probing method using SCTP . . . . . . . . . . . . . . . . 28 107 10.3. Probing method for IP fragmentation . . . . . . . . . . . 29 108 10.4. Probing method using applications . . . . . . . . . . . . 30 109 11. Security Considerations . . . . . . . . . . . . . . . . . . . 31 110 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 111 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 31 112 13.1. Normative references . . . . . . . . . . . . . . . . . . . 31 113 13.2. Informative references . . . . . . . . . . . . . . . . . . 32 114 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 33 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 116 Intellectual Property and Copyright Statements . . . . . . . . . . 35 118 1. Introduction 120 This document describes a method for Packetization Layer Path MTU 121 Discovery (PLPMTUD) which is an extension to existing Path MTU 122 Discovery methods described in [RFC1191] and [RFC1981]. In the 123 absence of ICMP messages, the proper MTU is determined by starting 124 with small packets and probing with successively larger packets. The 125 bulk of the algorithm is implemented above IP, in the transport layer 126 (e.g., TCP) or other "Packetization Protocol" that is responsible for 127 determining packet boundaries. 129 The methods described in this document rely on features of existing 130 protocols. They apply to many transport protocols over IPv4 and 131 IPv6. They do not require cooperation from the lower layers (except 132 that they are consistent about what packet sizes are acceptable), or 133 from peers. As the methods apply only to senders, variants in 134 implementations will not cause interoperability problems. 136 For sake of clarity, we uniformly prefer TCP and IPv6 terminology. 137 In the terminology section we also present the analogous IPv4 terms 138 and concepts for the IPv6 terminology. In a few situations we 139 describe specific details that are different between IPv4 and IPv6. 141 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 142 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 143 document are to be interpreted as described in [RFC2119]. 145 This document is a product of the Path MTU Discovery (pmtud) working 146 group of the IETF and draws heavily on RFC1191 and RFC1981 for 147 terminology, ideas, and some of the text. 149 1.1. Revision History 151 NOTE TO RFC EDITOR: this section to be removed before publication. 153 These are all recent substantive changes, in reverse chronological 154 order. This section will be removed prior to publication as an RFC. 156 Please send comments and suggestions to pmtud@ietf.org. Interim 157 drafts and other useful information will be posted at 158 http://www.psc.edu/~mathis/MTU/pmtud/index.html . 160 1.1.1. Changes since version -07, July 2006 (IETF 66) 162 Last call comments from Gorry Fairhurst, Ivan Beschastnikh, and Mark 163 Allman. Nits and clarifications. 165 Changed MAY to SHOULD supress congestion control response on failed 166 probe. 168 1.1.2. Changes since version -06, March 2006 (IETF 65) 170 Changed the title to include "Packetization Layer". 172 Renamed "Diagnostic Interface" section to "Application Probing" and 173 broadened the language to include other uses. 175 Clarifications to sections "packet size ranges", "host 176 fragmentation", and "probing using applications". 178 Language nits. 180 1.1.3. Changes since version -05, November 2005 (IETF 64) 182 Re-worked probing method sections for TCP and SCTP. The SCTP section 183 reflects the new PAD chunk type, and contains some text from Michael 184 Tuexen. 186 Made a number of language clarification and consistency improvements, 187 largely from comments by Gorry Fairhurst. 189 Added appropriate citations, and removed the last of the "@@" TODO 190 items. 192 1.1.4. Changes since version -04, February 2005 (IETF 62) 194 General restructuring and rewriting of some sections based on new 195 experience. Relaxed and generalized a lot of over-specified 196 language, for example, the search strategy description. 198 Decoupled verification from probing, and relaxed its specification. 200 Removed all specified changes to ICMP processing. We decided this 201 was out of scope for this particular document. 203 Changed all language to refer to MTU rather than MPS. 205 1.1.5. Changes since version -03, October 2004 (IETF 61) 207 A number of minor style and grammar edits. 209 1.1.6. Changes since version -02, July 19th 2004 (IETF 60) 211 Many minor updates throughout the document. 213 Added a section describing the interactions between PLPMTUD and 214 congestion control. 216 Removed a difficult to implement requirement for future data to 217 transmit. 219 Added "IP Fragmentation" and "Application protocol" as Packetization 220 Layers. 222 Clarified interactions between TCP SACK and MTU. 224 Updated SCTP section to reflect new probing method using "PAD 225 chunks". 227 Distilled the protocol specific material into separate subsections 228 for each protocol. 230 Added a section on common requirements and functions for all 231 Packetization Layers. More accurately characterized the 232 "bidirectional" (and other) requirements of the PL protocol. Updated 233 the search strategy in this new section. 235 Change "ICMP can't fragment" and "packet too big" to uniformly use 236 "ICMP PTB message" everywhere. 238 Added Stanislav Shalunov's observation that PLPMTUD parallels 239 congestion control. 241 Better described the range of interoperability with classical pMTUd 242 in the introduction. 244 Removed vague language about "not being a protocol" and "excessive 245 Loss". 247 Slightly redefined flow: the granularity of PLPMTUD within a path. 249 Many English NITs and clarifications per Gorry Fairhurst and others. 250 Passes strict xml2rfc checking. 252 Add a paragraph encouraging interface MTUs that are the optimal for 253 the NIC, rather than standard for the media. 255 Added a revision history section. 257 2. Overview 259 Packetization Layer Path MTU Discovery (PLPMTUD) is a method for TCP 260 or other Packetization Protocols to dynamically discover the MTU of a 261 path by probing with progressively larger packets. It is most 262 efficient when used in conjunction with the ICMP based Path MTU 263 Discovery mechanism as specified in RFC 1191 and RFC 1981, but 264 resolves many of the robustness problems of the classical techniques 265 since it does not depend on the delivery of ICMP messages. 267 This method is applicable to TCP and other transport- or application- 268 level protocols that are responsible for choosing packet boundaries 269 (e.g., segment sizes) and have an acknowledgment structure that 270 delivers to the sender accurate and timely indications of which 271 packets were lost. 273 The general strategy is for the Packetization Layer to find an 274 appropriate Path MTU by probing the path with progressively larger 275 packets. If a probe packet is successfully delivered, then the 276 effective Path MTU is raised to the probe size. 278 The isolated loss of a probe packet (with or without an ICMP Packet 279 Too Big message) is treated as an indication of an MTU limit, and not 280 as a congestion indicator. In this case alone, the Packetization 281 Protocol is permitted to retransmit any missing data without 282 adjusting the congestion window. 284 If there is a timeout or additional packets are lost during the 285 probing process, the probe is considered to be inconclusive (e.g., 286 the lost probe does not necessarily indicate that the probe exceeded 287 the Path MTU). Furthermore, the losses are treated like any other 288 congestion indication: window or rate adjustments are mandatory per 289 the relevant congestion control standards [RFC2914]. Probing can 290 resume after a delay which is determined by the nature of the 291 detected failure. 293 PLPMTUD uses a searching technique to find the Path MTU. Each 294 conclusive probe narrows the MTU search range, either by raising the 295 lower limit on a successful probe or lowering the upper limit on a 296 failed probe, converging toward the true Path MTU. For most 297 transport layers, the search should be stopped once the range is 298 narrow enough that the benefit of a larger effective Path MTU is 299 smaller than the search overhead of finding it. 301 The most likely (and least serious) probe failure is the link 302 experiencing congestion related losses while probing. In this case 303 it is appropriate to retry a probe of the same size as soon as the 304 Packetization Layer has fully adapted to the congestion and recovered 305 from the losses. In other cases, additional losses or timeouts 306 indicate problems with the link or Packetization Layer. In these 307 situations it is desirable to use longer delays depending on the 308 severity of the error. 310 An optional verification process can be used to detect some 311 situations where raising the MTU raises the packet loss rate. For 312 example, if a link is striped across multiple physical channels with 313 inconsistent MTUs, it is possible that a probe will be delivered even 314 if it is too large for some of the physical channels. In such cases, 315 raising the Path MTU to the probe size can cause severe packet loss 316 and abysmal performance. After raising the MTU, the new MTU size can 317 be verified by monitoring the loss rate. 319 PLPMTUD introduces some flexibility in the implementation of 320 classical Path MTU discovery, which is subject to protocol failures 321 (connection hangs) if ICMP Packet Too Big (PTB) messages are not 322 delivered or processed for some reason [RFC2923]. With PLPMTUD, 323 classical Path MTU Discovery can include additional consistency 324 checks (e.g., validating additional fields in the transcribed header) 325 without increasing the risk of connection hangs due to spurious 326 failures of the added checks. Such changes to classical Path MTU 327 Discovery are beyond the scope of this document. 329 In the limiting case, all ICMP PTB messages might be unconditionally 330 ignored, and PLPMTUD can be used as the sole method used to discover 331 the Path MTU. In this configuration, PLPMTUD parallels congestion 332 control. An end-to-end transport protocol adjusts properties of the 333 data stream (window size or packet size) while using packet losses to 334 deduce the appropriateness of the adjustments. This technique seems 335 to be more philosophically consistent with the end-to-end principle 336 of the Internet than relying on ICMP messages containing transcribed 337 headers of multiple protocol layers. 339 Most of the difficulty in implementing PLPMTUD arises because it 340 needs to be implemented in several different places within a single 341 node. In general, each Packetization Protocol needs to have its own 342 implementation of PLPMTUD. Furthermore, the natural mechanism to 343 share Path MTU information between concurrent or subsequent 344 connections over the same path is a path information cache in the IP 345 layer. The various Packetization Protocols need to have the means to 346 access and update the shared cache in the IP layer. This memo 347 describes PLPMTUD in terms of its primary subsystems without fully 348 describing how they are assembled into a complete implementation. 350 The vast majority of the implementation details described in this 351 document are recommendations based on experiences with earlier 352 versions of Path MTU Discovery. These recommendations are motivated 353 by a desire to maximize robustness of PLPMTUD in the presence of less 354 than ideal network conditions as they exist in the field. 356 Section 3 provides a complete glossary of terms. 358 Section 4 describes the details of PLPMTUD that affect 359 interoperability with other standards or Internet protocols. 361 Section 5 describes how to partition PLPMTUD into layers, and how to 362 manage the "path information cache" in the IP layer. 364 Section 6 describes the general Packetization Layer properties and 365 features needed to implement PLPMTUD. 367 Section 7 describes how to use probes to search for the Path MTU. 369 Section 8 recommends using IPv4 fragmentation in a configuration that 370 mimics IPv6 functionality, to minimize future problems migrating to 371 IPv6. 373 Section 9 describes a programing interface for implementing PLPMTUD 374 in applications that choose their own packet boundaries and for tools 375 to be able to diagnose path problems that interfere with Path MTU 376 Discovery. 378 Section 10 discusses implementation details for specific protocols, 379 including TCP. 381 3. Terminology 383 We use the following terms in this document: 385 IP: Either IPv4 [RFC0791] or IPv6 [RFC2460]. 387 Node: A device that implements IP. 389 Router: A node that forwards IP packets not explicitly addressed to 390 itself. 392 Host: Any node that is not a router. 394 Upper layer: A protocol layer immediately above IP. Examples are 395 transport protocols such as TCP and UDP, control protocols such as 396 ICMP, routing protocols such as OSPF, and Internet or lower-layer 397 protocols being "tunneled" over (i.e., encapsulated in) IP such as 398 IPX, AppleTalk, or IP itself. 400 Link: A communication facility or medium over which nodes can 401 communicate at the link layer, i.e., the layer immediately below 402 IP. Examples are Ethernets (simple or bridged); PPP links; X.25, 403 Frame Relay, or ATM networks; and Internet (or higher) layer 404 "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use 405 the slightly more general term "lower layer" for this concept. 407 Interface: A node's attachment to a link. 409 Address: An IP-layer identifier for an interface or a set of 410 interfaces. 412 Packet: An IP header plus payload. 414 MTU: Maximum Transmission Unit, the size in bytes of the largest IP 415 packet, including the IP header and payload, that can be 416 transmitted on a link or path. Note that this could more properly 417 be called the IP MTU, to be consistent with how other standards 418 organizations use the acronym MTU. 420 Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet size 421 in bytes, that can be conveyed in one piece over a link. Beware 422 that this definition is different from the definition used by 423 other standards organizations. 425 For IETF documents, link MTU is uniformly defined as the IP MTU 426 over the link. This includes the IP header, but excludes link 427 layer headers and other framing which is not part of IP or the IP 428 payload. 430 Be aware that other standards organizations generally define link 431 MTU to include the link layer headers. 433 Path: The set of links traversed by a packet between a source node 434 and a destination node. 436 Path MTU, or PMTU: The minimum link MTU of all the links in a path 437 between a source node and a destination node. 439 Classical Path MTU Discovery: Process described in RFC 1191 and RFC 440 1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages 441 to learn the MTU of a path. 443 Packetization Layer: The layer of the network stack which segments 444 data into packets. 446 Effective PMTU: The current estimated value for PMTU used by a 447 Packetization Layer for segmentation. 449 PLPMTUD: Packetization Layer Path MTU Discovery, the method described 450 in this document, which is an extension to classical PMTU 451 discovery. 453 PTB (Packet Too Big) message: An ICMP message reporting that an IP 454 packet is too large to forward. This is the IPv6 term that 455 corresponds to the IPv4 "ICMP Can't fragment" message. 457 Flow: A context in which MTU discovery algorithms can be invoked. 458 This is naturally an instance of a Packetization Protocol, for 459 example, one side of a TCP connection. 461 MSS: The TCP Maximum Segment Size [RFC0793], the maximum payload size 462 available to the TCP layer. This is typically the Path MTU minus 463 the size of the IP and TCP headers. 465 Probe packet: A packet which is being used to test a path for a 466 larger MTU. 468 Probe size: The size of a packet being used to probe for a larger 469 MTU, including IP headers. 471 Probe gap: The payload data that will be lost and need to be 472 retransmitted if the probe is not delivered. 474 Leading window: Any unacknowledged data in a flow at the time a probe 475 is sent. 477 Trailing window: Any data in a flow sent after a probe, but before 478 the probe is acknowledged. 480 Search strategy: The heuristics used to choose successive probe sizes 481 to converge on the proper Path MTU, as described in section 482 Section 7.3. 484 Full stop timeout: a timeout where none of the packets transmitted 485 after some event are acknowledged by the receiver, including any 486 retransmissions. This is taken as an indication of some failure 487 condition in the network, such as a routing change onto a link 488 with a smaller MTU. This is described in more detail in section 489 Section 7.7. 491 4. Requirements 493 All Internet nodes SHOULD implement PLPMTUD in order to discover and 494 take advantage of the largest MTU supported along the Internet path. 496 All links MUST enforce their MTU: links that might non- 497 deterministically deliver packets that are larger than their rated 498 MTU MUST consistently discard such packets. 500 All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 501 functionality. All fragmentation SHOULD be done on the host, and all 502 IPv4 packets, including fragments, SHOULD have the DF bit set such 503 that they will not be fragmented (again) in the network. See 504 Section 8. 506 The requirements below only apply to those implementations that 507 include PLPMTUD. 509 To use PLPMTUD a Packetization Layer MUST have a loss reporting 510 mechanism that provides the sender with timely and accurate 511 indications of which packets were lost in the network. 513 Normal congestion control algorithms MUST remain in effect under all 514 conditions except when only an isolated probe packet is detected as 515 lost. In this case alone the normal congestion (window or data rate) 516 reduction SHOULD be suppressed. If any other data loss is detected, 517 standard congestion control MUST take place. 519 Suppressed congestion control (as above) MUST be rate limited such 520 that it occurs less frequently than the worst case loss rate for TCP 521 congestion control at a comparable data rate over the same path 522 (i.e., less than the "TCP-friendly" loss rate [tcp-friendly]). This 523 SHOULD be enforced by requiring a minimum headway between a 524 suppressed congestion adjustment (due to a failed probe) and the next 525 attempted probe, which is equal to one round trip time for each 526 packet permitted by the congestion window. Alternatively, this may 527 be enforced by not suppressing congestion control if a second probe 528 is lost too soon after the first lost probe. This is discussed 529 further in Section 7.6.2. 531 Whenever the MTU is raised, the congestion state variables MUST be 532 rescaled so as not to raise the window size in bytes (or data rate in 533 bytes per seconds). 535 Whenever the MTU is reduced (e.g., when processing ICMP PTB messages) 536 the congestion state variable SHOULD be rescaled not to raise the 537 window size in packets. 539 If PLPMTUD updates the MTU for a particular path, all Packetization 540 Layer sessions that share the path representation SHOULD be notified 541 to make use of the new MTU and make the required congestion control 542 adjustments. 544 All implementations MUST include mechanisms for applications to 545 selectively transmit packets larger than the current effective Path 546 MTU (but smaller than the link MTU). This is necessary to implement 547 PLPMTUD within an application (using a connectionless protocol) and 548 to implement diagnostic tools that do not rely on the operating 549 systems implementation of Path MTU discovery. See Section 9 for 550 further discussion. 552 Connectionless protocols and protocols that do not support PLPMTUD 553 SHOULD have their own default value for the initial effective path 554 MTU, which can be set to a more conservative (smaller) value than the 555 initial value used by TCP and other protocols that are well suited to 556 PLPMTUD. Implementation MAY use different heuristics to select the 557 initial effective path MTU for each protocol. There SHOULD be per 558 protocol and per route limits on the initial effective path MTU 559 (eff_pmtu) and the upper searching limit (search_high). 561 5. Layering 563 Packetization Layer Path MTU Discovery is most easily implemented by 564 splitting its functions between layers. The IP layer is the best 565 place to keep shared state, collect the ICMP messages, track IP 566 header sizes and manage MTU information provided by the link layer 567 interfaces. However, the procedures that PLPMTUD uses for probing 568 and verification of the Path MTU are very tightly coupled to features 569 of the Packetization Layers, such as data recovery and congestion 570 control state machines. 572 Note that this layering approach is a direct extension of the advice 573 in the current PMTUD specifications in RFC 1191 and RFC 1981. 575 5.1. Accounting for header sizes 577 The way in which PLPMTUD operates across multiple layers requires a 578 mechanism for accounting header sizes at all layers between IP and 579 the Packetization Layer (inclusive). When transmitting non-probe 580 packets, it is sufficient for the Packetization Layer to ensure an 581 upper bound on final IP packet size, so as not to exceed the current 582 effective Path MTU. All Packetization Layers participating in 583 classical Path MTU Discovery have this requirement already. When 584 conducting a probe, the Packetization Layer MUST determine the probe 585 packet's final size including IP headers. This requirement is 586 specific to PLPMTUD, and satisfying it may require additional inter- 587 layer communication in existing implementations. 589 5.2. Storing PMTU information 591 This memo uses the concept of a "flow" to define the scope of the 592 Path MTU discovery algorithms. For many implementations, a flow 593 would naturally correspond to an instance of each protocol (i.e., 594 each connection or session). In such implementations, the algorithms 595 described in this document are performed within each session for each 596 protocol. The observed PMTU (eff_pmtu in Section 7.1) can optionally 597 be shared between different flows with a common path representation. 599 Alternatively, PLPMTUD could be implemented such that its complete 600 state is associated with the path representations. Such an 601 implementation could use multiple connections or sessions for each 602 probe sequence. This approach is likely to converge much more 603 quickly in some environments, such as where an application uses many 604 small connections, each of which is too short to complete the Path 605 MTU Discovery process. 607 Within a single implementation, different protocols can use either of 608 these two approaches. Due to protocol specific differences in 609 constraints on generating probes (Section 6.2) and the MTU searching 610 algorithm (Section 7.3), it may not be feasible for different 611 Packetization Layer protocols to share PLPMTUD state. This suggests 612 that it may be possible for some protocols to share probing state, 613 but other protocols can only share observed PMTU. In this case, the 614 different protocols will have different PMTU convergence properties. 616 The IP layer is the best place to store cached PMTU values and other 617 shared state such as MTU values reported by ICMP PTB messages. 618 Ideally, this shared state should be associated with a specific path 619 traversed by packets exchanged between the source and destination 620 nodes. However, in most cases a node will not have enough 621 information to completely and accurately identify such a path. 622 Rather, a node must associate a PMTU value with some local 623 representation of a path. It is left to the implementation to select 624 the local representation of a path. 626 An implementation could use the destination address as the local 627 representation of a path. The PMTU value associated with a 628 destination would be the minimum PMTU learned across the set of all 629 paths in use to that destination. The set of paths in use to a 630 particular destination is expected to be small, in many cases 631 consisting of a single path. This approach will result in the use of 632 optimally sized packets on a per-destination basis, and integrates 633 nicely with the conceptual model of a host as described in [RFC2461]: 634 a PMTU value could be stored with the corresponding entry in the 635 destination cache. Storing the minimum value is suggested since NATs 636 and other forms of middle boxes may exhibit differing PMTUs 637 simultaneously at a single IP address. 639 Note that network or subnet numbers are not suitable to use as 640 representations of a path, because there is not a general mechanism 641 to determine the network mask at the remote host. 643 If IPv6 flows are in use, an implementation could use the IPv6 flow 644 id [RFC2460][RFC1809] as the local representation of a path. Packets 645 sent to a particular destination but belonging to different flows may 646 use different paths, with the choice of path depending on the flow 647 id. This approach will result in the use of optimally sized packets 648 on a per-flow basis, providing finer granularity than MTU values 649 maintained on a per-destination basis. 651 For source routed packets (i.e., packets containing an IPv6 routing 652 header, or IPv4 LSRR or SSRR options), the source route may further 653 qualify the local representation of a path. An implementation could 654 use source route information in the local representation of a path. 656 5.3. Accounting for IPsec 658 This document does not take a stance on the placement of IPsec 659 [RFC2401], which logically sits between IP and the Packetization 660 Layer. The PLPMTUD implementation can treat IPsec either as part of 661 IP or as part of the Packetization Layer, as long as the accounting 662 is consistent within the implementation. If IPsec is treated as part 663 of the IP layer, then each security association to a remote node may 664 need to be treated as a separate path. If IPsec is treated as part 665 of the Packetization Layer, the IPsec header size must be included in 666 the Packetization Layer's header size calculations. 668 5.4. Multicast 670 In the case of a multicast destination address, copies of a packet 671 may traverse many different paths to reach many different nodes. The 672 local representation of the "path" to a multicast destination must in 673 fact represent a potentially large set of paths. 675 Minimally, an implementation could maintain a single MTU value to be 676 used for all packets originated from the node. This MTU value would 677 be the minimum MTU learned across the set of all paths in use by the 678 node. This approach is likely to result in the use of smaller 679 packets than is necessary for many paths. 681 If the application using multicast gets complete delivery reports 682 (unlikely because this requirement has poor scaling properties), 683 PLPMTUD could be implemented in multicast protocols. 685 6. Common Packetization Properties 687 This section describes general Packetization Layer properties and 688 characteristics needed to implement PLPMTUD. It also describes some 689 implementation issues that are common to all Packetization Layers. 691 6.1. Mechanism to detect loss 693 It is important that the Packetization Layer has a timely and robust 694 mechanism for detecting and reporting losses. PLPMTUD makes MTU 695 adjustments on the basis of detected losses. Any delays or 696 inaccuracy in loss notification is likely to result in incorrect MTU 697 decisions or slow convergence. 699 It is best if Packetization Protocols use fairly explicit loss 700 notification such as selective acknowledgments, although implicit 701 mechanisms such as TCP Reno style duplicate acknowledgments counting 702 are sufficient. It is important that the mechanism can robustly 703 distinguish between the isolated loss of just a probe and other 704 combinations of losses. 706 Many protocol implementations have sophisticated mechanisms such as a 707 SACK scoreboard [RFC3517] or ACK Vector [RFC4340] to distinguish real 708 losses from reordered data. In these implementations it is desirable 709 to signal losses to PLPMTUD as a side effect of the data 710 retransmission. This approach offers the maximum protection from 711 confusing signals due to reordering and other events that might mimic 712 losses. 714 PLPMTUD can also be implemented in protocols that rely on timeouts as 715 their primary mechanism for loss recovery; however, timeouts should 716 be used only when there are no other alternatives. 718 6.2. Generating probes 720 There are several possible ways to alter Packetization Layers to 721 generate probes. The different techniques incur different overheads 722 in three areas: difficulty in generating the probe packet (in terms 723 of Packetization Layer implementation complexity and extra data 724 motion) possible additional network capacity consumed by the probes 725 and the overhead of recovering from failed probes (both network and 726 protocol overheads). 728 Some protocols might be extended to allow arbitrary padding with 729 dummy data. This greatly simplifies the implementation because the 730 probing can be performed without participation from higher layers and 731 if the probe fails, the missing data (the "probe gap") is assured to 732 fit within the current MTU when it is retransmitted. This is 733 probably the most appropriate method for protocols that support 734 arbitrary length options or multiplexing within the protocol itself. 736 Many Packetization Layer protocols can carry pure control messages 737 (without any data from higher protocol layers) which can be padded to 738 arbitrary lengths. For example, the SCTP PAD chunk can be used in 739 this manner (see Section 10.2). This approach has the advantage that 740 nothing needs to be retransmitted if the probe is lost. 742 These techniques do not work for TCP, because there is not a separate 743 length field or other mechanism to differentiate between padding and 744 real payload data. With TCP the only approach is to send additional 745 payload data in an over-sized segment. There are at least two 746 variants of this approach, discussed in Section 10.1. 748 In a few cases, there may be no reasonable mechanisms to generate 749 probes within the Packetization Layer protocol itself. As a last 750 resort, it may be possible to rely an an adjunct protocol, such as 751 ICMP ECHO ("ping"), to send probe packets. See Section 10.3 for 752 further discussion of this approach. 754 7. The Probing Method 756 This section describes the details of the MTU probing method, 757 including how to send probes and process error indications necessary 758 to search for the Path MTU. 760 7.1. Packet size ranges 762 This document describes the probing method using three state 763 variables: 765 search_low: The smallest useful probe size, minus one. The network 766 is expected to be able to deliver packets of size search_low. 768 search_high: The greatest useful probe size. The network is expected 769 not to be able to deliver packets of size search_high. 771 eff_pmtu: The effective PMTU for this flow. This is the largest non- 772 probe packet permitted by PLPMTUD for the path. 774 search_low eff_pmtu search_high 775 | | | 776 ...-------------------------> 777 non-probe size range 778 <--------------------------------------> 779 probe size range 781 Figure 1 783 When transmitting non-probes, the Packetization Layer SHOULD create 784 packets of size less than or equal to eff_pmtu. 786 When transmitting probes, the Packetization Layer MUST select a probe 787 size which is larger than search_low and smaller or equal to 788 search_high. 790 When probing upward, eff_pmtu always equals search_low. In other 791 states, such as initial conditions, after ICMP PTB message processing 792 or following PLPMTUD on another flow sharing the same path 793 representation, eff_pmtu may be different from search_low. Normally 794 eff_pmtu will be greater than or equal to search_low and less than 795 search_high. It is generally expected but not required that probe 796 size will be greater than eff_pmtu. 798 For initial conditions when there is no information about the path, 799 eff_pmtu may be greater than search_low. The initial value of 800 search_low should be conservatively low, but performance may be 801 better if eff_pmtu starts at a higher, less conservative value. See 802 Section 7.2. 804 If eff_pmtu is larger than search_low it is explicitly permitted to 805 send non-probe packets larger than search_low. When such a packet is 806 acknowledged, it is effectively an "implcit probe" and search_low 807 SHOULD be raised to the size of the acknowledged packet. However, if 808 an "implicit probe" is lost, it MUST NOT be treated as a probe 809 failure as a true probe would be. If eff_pmtu is too large, this 810 condition will only be detected with ICMP PTB messages or black hole 811 discovery (see Section 7.7). 813 7.2. Selecting initial values 815 The initial value for search_high should be the largest possible 816 packet that might be supported by the flow. This may be limited by 817 the local interface MTU, by an explicit protocol mechanism such as 818 the TCP MSS option, an intrinsic limit such as the size of a protocol 819 length field, or by a configuration option to prevent probing above 820 some maximum packet size. Search_high is likely to be the same as 821 the initial path MTU as computed by the classical path MTU discovery 822 algorithm. 824 It is recommended that search_low be initially set to an MTU size 825 that is likely to work over a very wide range of environments. Given 826 today's technologies, a value of 512 bytes is probably safe. For 827 IPv6 flows, a value of 1280 bytes is appropriate. The initial value 828 for search_low SHOULD be configurable. 830 Properly functioning Path MTU Discovery is critical to the robust and 831 efficient operation of the Internet. Any major change (as described 832 in this document) has the potential to be very disruptive if it 833 causes any unexpected changes in protocol behaviors. The selection 834 of the initial value for eff_pmtu determines to what extent a PLPMTUD 835 implementation's behavior resembles classical PMTUD in cases where 836 the classical method is sufficient. 838 A conservative configuration would be to set eff_pmtu to search_high, 839 and rely on ICMP PTB messages to set the eff_pmtu down as 840 appropriate. In this configuration classical PMTUD is fully 841 functional and PLPMTUD is only invoked to recover from ICMP black 842 holes through the procedure described in Section 7.7. 844 In some cases where it is known that classical PMTUD is likely to 845 fail, (for example, if ICMP PTB messages are administratively 846 disabled for security reasons) using a small initial eff_pmtu will 847 avoid the costly timeouts required for black hole detection. The 848 trade-off is that using a smaller than necessary initial eff_pmtu 849 might cause reduced performance. 851 Note that the initial eff_pmtu can be any value in the range 852 search_low to search_high. An initial eff_pmtu of 1400 bytes might 853 be a good compromise because it would be safe for nearly all tunnels 854 over all common networking gear, and yet close to the optimal MTU for 855 the majority of paths in the Internet today. This might be improved 856 by using some statistics of other recent flows: for example the 857 initial eff_pmtu for a flow might be set to the median of the probe 858 size for all recent successful probes. 860 Since the cost of PLPMTUD is dominated by the protocol specific 861 overheads of generating and processing probes, it is probably 862 desirable for each protocol to have its own heuristics to select the 863 initial eff_pmtu. It is especially important that connectionless 864 protocols and other protocols that may not receive clear indications 865 of ICMP black holes use conservative (smaller) initial values for 866 eff_pmtu, as described in section Section 10.3. 868 There SHOULD be per protocol and per route configuration options to 869 override initial values for eff_pmtu and other PLPMTUD state 870 variables. 872 7.3. Selecting probe size 874 The probe may have a size anywhere in the "probe size range" 875 described above. However, a number of factors affect the selection 876 of an appropriate size. A simple strategy might be to do a binary 877 search halving the probe size range with each probe. However, for 878 some protocols such as TCP, failed probes are more expensive than 879 successful ones, since data in a failed probe will need to be 880 retransmitted. For such protocols, a strategy using smaller probe 881 sizes and "probing up" behaves better. For many protocols, both at 882 and above the Packetization Layer, the benefit of increasing MTU 883 sizes may follow a step function such that it is not advantageous to 884 probe within certain regions at all. 886 As an optimization, it may be appropriate to probe at certain common 887 or expected MTU sizes, for example, 1500 bytes for standard Ethernet, 888 or 1500 bytes minus header sizes for tunnel protocols. 890 Some protocols may use other mechanisms to choose the probe sizes. 891 For example, protocols that have certain natural data block sizes 892 might simply assemble messages from a number of blocks until the 893 total size is smaller than search_high, and larger than search_low 894 (if possible). 896 Each Packetization Layer must determine when probing has converged, 897 that is, when the probe size range is small enough that further 898 probing is no longer worth its cost. When probing has converged, a 899 timer should be set. When the timer expires, search_high should be 900 reset to its initial value (described above) so that probing can 901 resume. Thus if the path changes, increasing the Path MTU, then the 902 flow will eventually take advantage of it. The value for this timer 903 MUST NOT be less than 5 minutes, and is recommended to be 10 minutes, 904 per RFC 1981. 906 7.4. Probing preconditions 908 Before sending a probe, the flow must at least meet the following 909 conditions: 910 o It has no outstanding probes or losses. 911 o If the last probe failed or was inconclusive, then the probe 912 timeout has expired (see Section 7.6.2). 913 o The available window is greater than the probe size. 914 o For a protocol using in-band data for probing, enough data is 915 available to send the probe. 917 For protocols that probe with in-band data, when not enough data is 918 available to probe, the protocol may wish to delay sending non-probes 919 in order to accumulate enough data to send a probe. A delayed 920 sending algorithm such as Nagle [RFC0896] should be used to 921 appropriately limit the time data is delayed. 923 Some protocols may require additional packets after a loss to detect 924 it promptly (e.g., TCP loss detection using duplicate 925 acknowledgments). Such a protocol should wait until sufficient data 926 and window space is available so that it will be able to transmit 927 enough data after the probe to trigger the loss detection mechanism 928 in the event of a lost probe. 930 7.5. Conducting a probe 932 Once a probe size in the appropriate range has been selected, and the 933 above preconditions have been met, the Packetization Layer may 934 conduct a probe. To do so, it creates a probe packet such that its 935 size, including the outermost IP headers, is equal to the probe size. 936 After sending the probe it awaits response, which may take the 937 following results: 938 Success: The probe is acknowledged as having been received by the 939 remote host. 941 Failure: A protocol mechanism indicates that the probe was lost, but 942 no packets in the leading or trailing window were lost. 944 Timeout failure: A protocol mechanism indicates that the probe was 945 lost, and no packets in the leading window were lost, but is 946 unable to determine if any packets in the trailing window were 947 lost. For example, loss is detected by a timeout, and go-back-n 948 retransmission is used. 950 Inconclusive: The probe was lost in addition to other packets in the 951 leading or trailing windows. 953 7.6. Response to probe results 955 When a probe has completed, the result should be processed as 956 follows, categorized by the probe's result type. 958 7.6.1. Probe success 960 When the probe is delivered, it is an indication that the Path MTU is 961 at least as large as the probe size. Set search_low to the probe 962 size. If the probe size is larger than the eff_pmtu, raise eff_pmtu 963 to the probe size. The probe size might be smaller than the eff_pmtu 964 if the flow has not been using the full MTU of the path because it is 965 subject to some other limitation, such as available data in an 966 interactive session. 968 Note that if a flow's packets are routed via multiple paths, or over 969 a path with a non-deterministic MTU, delivery of a single probe 970 packet does not indicate that all packets of that size will be 971 delivered. To be robust in such a case, the Packetization Layer 972 should conduct MTU verification as described in Section 7.8. 974 7.6.2. Probe failure 976 When only the probe is lost, this is treated as an indication that 977 the Path MTU is smaller than the probe size. In this case alone, the 978 loss SHOULD NOT be interpreted as congestion signal. 980 In the absence of other indications, set search_high to the probe 981 size minus one. The eff_pmtu might be larger than the probe size if 982 the flow has not been using the full MTU of the path because it is 983 subject to some other limitation, such as available data in an 984 interactive session. If eff_pmtu is larger than the probe size, 985 eff_pmtu MUST be reduced to no larger than search_high, and SHOULD be 986 reduced to search_low, as the eff_pmtu has been determined to be 987 invalid, similar to after a full stop timeout (see Section 7.7). 989 If an ICMP PTB message is received matching the probe packet, then 990 search_high and eff_pmtu may be set from the MTU value indicated in 991 the message. Note that the ICMP message may be received either 992 before or after the protocol loss indication. 994 A probe failure event is the one situation under which the 995 Packetization Layer is permitted not to treat loss as a congestion 996 signal. Because there is some small risk that suppressing congestion 997 control might have unanticipated consequences (even for one isolated 998 loss), it is REQUIRED that probe failure events be less frequent than 999 the normal period for losses under standard congestion control. 1000 Specifically, after a probe failure event and suppressed congestion 1001 control, PLPMTUD MUST NOT probe again until an interval which is 1002 comparable to the expected interval between congestion control 1003 events. See Section 4 for details. The simplest estimate of the 1004 interval to the next congestion event is the same number of round 1005 trips as the current congestion window in packets. 1007 7.6.3. Probe timeout failure 1009 If the loss was detected with a timeout and repaired with go-back-n 1010 retransmission, then congestion window reduction will be necessary. 1011 The relatively high price of a failed probe in this case may merit a 1012 longer timeout. A timeout value of five times the non-timeout 1013 failure case (Section 7.6.2) is recommended. 1015 7.6.4. Probe inconclusive 1017 The presence of other losses near the loss of the probe may indicate 1018 that the probe was lost due to congestion rather than because of an 1019 MTU limitation. In this case, it is appropriate to update no state, 1020 and simply probe again when the probing preconditions are met (i.e., 1021 when no recent losses have been observed). At this point, it is 1022 particularly appropriate to re-probe since the flow's congestion 1023 window will be at its lowest point, minimizing the probability of 1024 congestive losses. 1026 7.7. Full stop timeout 1028 Under all conditions, a full stop timeout (also known as a 1029 "persistent timeout" in other documents) should be taken as an 1030 indication of some significantly disruptive event in the network, 1031 such as a router failure or a routing change to a path with a smaller 1032 MTU. For TCP, this occurs when the R1 timeout threshold described by 1033 [RFC1122] expires. 1035 If there is a full stop timeout and there was not an ICMP message 1036 indicating a reason (PTB, Net unreachable, etc., or the ICMP message 1037 was ignored for some reason), the suggested first recovery action is 1038 to treat this as a detected ICMP black hole as defined in [RFC2923]. 1040 The response to a detected black hole depends on the current values 1041 for search_low and eff_pmtu. If eff_pmtu is larger than search_low, 1042 set eff_pmtu to search_low. Otherwise, set both eff_pmtu and 1043 search_low to the to the initial value for search_low. Upon further 1044 successive timeouts, search_low and eff_pmtu SHOULD be halved, with a 1045 lower bound of 68 bytes for IPv4 and 1280 bytes for IPv6. 1047 7.8. MTU verification 1049 It is possible for a flow to simultaneously traverse multiple paths, 1050 but it will only be able to keep a single path representation for the 1051 flow. If the paths have different MTUs, storing the minimum MTU of 1052 all paths in the flow's path representation will result in correct 1053 behavior. If ICMP PTB messages are delivered, then classical PMTUD 1054 will work correctly in this situation. 1056 If ICMP delivery fails, breaking classical PMTUD, the connection will 1057 rely solely on PLPMTUD. However, in this case, PLPMTUD may fail as 1058 well since its requirement that links MUST NOT deliver packets larger 1059 than their MTU is violated. A probe with a size greater than the 1060 minimum but smaller than the maximum of the Path MTUs may be 1061 successful. However, upon raising the flow's effective PMTU, the 1062 loss rate will significantly increase. The flow may still make 1063 progress, but the resultant loss rate may be unacceptable. For 1064 example, when using two-way round-robin striping, 50% of full-sized 1065 packets would be dropped. 1067 Striping in this manner is often operationally undesirable (e.g., due 1068 to packet reordering), and is usually avoided by hashing flows to a 1069 single path. However, to increase robustness, an implementation 1070 should implement some form of MTU verification, such that if 1071 increasing eff_pmtu results in a sharp increase in loss rate, it will 1072 fall back to using a lower MTU. 1074 A recommended strategy would be to save the value of eff_pmtu before 1075 raising it. Then, if loss rate rises above a threshold for a period 1076 of time (e.g., loss rate is higher than 10% over multiple RTO 1077 intervals), then the new MTU is considered incorrect. The saved 1078 value of eff_pmtu can be restored, and search_high reduced in the 1079 same manner as in a probe failure. PLPMTUD implementations SHOULD 1080 implement MTU verification. 1082 8. Host Fragmentation 1084 Packetization Layers are encouraged to avoid sending messages that 1085 will require fragmentation [Kent87] [I-D.heffner-frag-harmful]. 1086 However, entirely preventing fragmentation is not always possible. 1087 Some Packetization Layers, such as a UDP application outside the 1088 kernel, may be unable to change the size of messages it sends, 1089 resulting in datagram sizes that exceed the Path MTU. 1091 IPv4 permitted such applications to send packets without the DF bit 1092 set. Oversized packets without the DF bit set would be fragmented in 1093 the network or sending host when they encountered a link with a MTU 1094 smaller than the packet. In some case, packets could be fragmented 1095 more than once if there were cascaded links with progressively 1096 smaller MTUs. This approach is not recommended. 1098 It is recommended that IPv4 implementations use a strategy that 1099 mimics IPv6 functionality. When an application sends datagrams that 1100 are larger than the effective Path MTU they should be fragmented to 1101 the Path MTU in the host IP layer even if they are smaller than the 1102 link MTU of the first network hop directly attached to the host. The 1103 DF bit should be set on the fragments, so they will not be fragmented 1104 again in the network. This technique will minimize the likelihood 1105 that applications will rely on IPv4 fragmentation in a way that 1106 cannot be implemented in IPv6. At least one major operating system 1107 already uses this strategy. An exception to this rule is if the 1108 application indicates that it is sending an oversized packet for 1109 probing or diagnostic purposes, described in Section 9. 1111 Since protocols that do not implement PLPMTUD are still subject to 1112 the black hole problem, it may be desirable to present to these 1113 protocols a "safe" MTU likely to work on any path (e.g., 1280 bytes). 1114 Then, allow any protocol implementing PLPMTUD to operate in the full 1115 range supported by the lower layer. 1117 Note that IP fragmentation divides data into packets, so it is 1118 minimally a Packetization Layer. However, it does not have a 1119 mechanism to detect lost packets, so it cannot support a native 1120 implementation of PLPMTUD. Fragmentation-based PLPMTUD requires an 1121 adjunct protocol as described in Section 10.3. 1123 9. Application Probing 1125 All implementations MUST include a mechanism where applications using 1126 connectionless protocols can send their own probes. This is 1127 necessary to implement PLPMTUD in an application protocol as 1128 described in Section 10.4 or to implement diagnostic tools for 1129 debugging problems with PMTUD. There must be a mechanism that 1130 permits an application to send datagrams that are larger than 1131 eff_pmtu, the operating systems estimate of the path MTU, without 1132 being fragmented. If these are IPv4 packets, they MUST have the DF 1133 bit set. 1135 At this time, most operating systems support two modes for sending 1136 datagrams: one which silently fragments packets that are too large, 1137 and another that rejects packets that are too large. Neither of 1138 these modes is suitable for implementing PLPMTUD in an application or 1139 diagnosing problems with path MTU discovery. A third mode is needed 1140 where the datagram is sent even if it is larger than the current 1141 estimate of the path MTU. 1143 Implementing PLPMTUD in an application also requires a mechanism 1144 where the application can inform the operating system about the 1145 outcome of the probe as described in Section 7.6, or directly update 1146 search_low, search_high and eff_pmtu, described in Section 7.1. 1148 Diagnostic applications are useful for finding PMTUD problems, such 1149 as those that might be caused by a buggy router than returns ICMP PTB 1150 messages with incorrect size information. Such problems can be most 1151 quickly located with a tool that can send probes of any specified 1152 size, and collect and display all returned ICMP PTB messages. 1154 10. Specific Packetization Layers 1156 This section discusses specific implementation details for different 1157 protocols that can be used as Packetization Layer protocols. All 1158 Packetization Layer protocols must consider all of the issues 1159 discussed in Section 6. For most protocols, it is self evident how 1160 to address many of these issues. It is hoped that the protocols 1161 described here will be sufficient illustration for implementers to 1162 adapt other protocols. 1164 10.1. Probing method using TCP 1166 TCP has no mechanism to distinguish in-band data from padding. 1167 Therefore, TCP must generate probes by appropriately segmenting data. 1168 There are two approaches to segmentation: overlapping and non- 1169 overlapping. 1171 In the non-overlapping method, data is segmented such that the probe 1172 and any subsequent segments contain no overlapping data. If the 1173 probe is lost, the "probe gap" will be a full probe size minus 1174 headers. Data in the probe gap will need to be retransmitted with 1175 multiple smaller segments. 1177 TCP sequence number 1179 t <----> 1180 i <--------> (probe) 1181 m <----> 1182 e 1183 . 1184 . (probe lost) 1185 . 1187 <----> (probe gap retransmitted) 1188 <--> 1190 Figure 2 1192 An alternate approach is to send subsequent data overlapping the 1193 probe such that the probe gap is equal in length to the current MSS. 1194 In the case of a successful probe, this has added overhead in that it 1195 will send some data twice, but it will have to retransmit only one 1196 segment after a lost probe. When a probe succeeds, there will likely 1197 be some duplicate acknowledgments generated due to the duplicate data 1198 sent. It is important that these duplicate acknowledgments not 1199 trigger Fast Retransmit. As such, an implementation using this 1200 approach SHOULD limit the probe size to three times the current MSS 1201 (causing at most 2 duplicate acknowledgments), or appropriately 1202 adjust its duplicate acknowledgment threshold for data immediately 1203 after a successful probe. 1205 TCP sequence number 1207 t <----> 1208 i <--------> (probe) 1209 m <----> 1210 e <----> 1212 . 1213 . (probe lost) 1214 . 1216 <----> (probe gap retransmitted) 1218 Figure 3 1220 The choice of which segmentation method to use should be based on 1221 what is simplest and most efficient for a given TCP implementation. 1223 10.2. Probing method using SCTP 1225 In the SCTP protocol [RFC2960], the application writes messages to 1226 SCTP, which "chunkifies" them into smaller pieces suitable for 1227 transmission through the network. Once a message has been 1228 chunkified, it is assigned a Transmission Sequence Number (TSN). 1229 Once a TSN have been transmitted, SCTP can not change the chunk size. 1230 SCTP multi-path support normally requires SCTP to chunkify its 1231 messages to fit the smallest PMTU of all paths. Although not 1232 required, implementations may bundle multiple data chunks together to 1233 make larger IP packets to send on paths with a larger PMTU. Note 1234 that SCTP must independently probe the PMTU on each path to the peer. 1236 The recommended method for generating probes is to add a chunk 1237 consisting only of padding to an SCTP message. The PAD chunk defined 1238 in [I-D.tuexen-tsvwg-sctp-padding] SHOULD be attached to a minimum 1239 length HEARTBEAT chunk to build a probe packet. This method is fully 1240 compatible with all current SCTP implementations. 1242 SCTP MAY also probe with a method similar to TCP's described above, 1243 using inline data. Using such a method has the advantage that 1244 successful probes have no additional overhead; however, failed probes 1245 will require retransmission of data, which may impact flow 1246 performance. 1248 10.3. Probing method for IP fragmentation 1250 There are a few protocols and applications that normally send large 1251 datagrams and rely on IP fragmentation to deliver them. It has been 1252 known for a long time that this has some undesirable consequences 1253 [Kent87]. More recently it has come to light that IPv4 fragmentation 1254 is not sufficiently robust for general use in today's Internet. The 1255 16-bit IP identification field is not large enough to prevent 1256 frequent mis-associated IP fragments and the TCP and UDP checksums 1257 are insufficient to prevent the resulting corrupted data from being 1258 delivered to higher protocol layers [I-D.heffner-frag-harmful]. 1260 As mentioned in Section 8, datagram protocols (such as UDP) might 1261 rely on IP fragmentation as a Packetization Layer. However, using IP 1262 fragmentation to implement PLPMTUD is problematic because the IP 1263 layer has no mechanism to determine if the packets are ultimately 1264 delivered to the far node, without direct participation by the 1265 application. 1267 To support IP fragmentation as a Packetization Layer under an 1268 unmodified application, we propose to rely on the path MTU sharing 1269 described in Section 5.2 plus an adjunct protocol to probe the path 1270 MTU. There are a number of protocols that might be used for the 1271 purpose, such as ICMP ECHO and ECHO REPLY, or "traceroute" style UDP 1272 datagrams that trigger ICMP messages. Use of ICMP ECHO and ECHO 1273 REPLY will probe both forward and return paths, so the sender will 1274 only be able to take advantage of the minimum of the two. Other 1275 methods that probe only the forward path are preffered if available. 1277 All of these approaches have a number of potential robustness 1278 problems. The most likely failures are due to losses unrelated to 1279 MTU (e.g., nodes that discard some protocol types). These non-MTU- 1280 related losses can prevent PLPMTUD from raising the MTU, forcing IP 1281 fragmentation to use a smaller MTU than necessary. Since these 1282 failures are not likely to cause interoperability problems they are 1283 relatively benign. 1285 However there does exist other more serious failure modes, such as 1286 might be caused by middle boxes or upper layer routers that choose 1287 different paths for different protocol types or sessions. In such 1288 environments, adjunct protocols may legitimately experience a 1289 different path MTU than the primary protocol. If the adjunct 1290 protocol finds a larger MTU than the primary protocol, PLPMTUD may 1291 select an MTU that is not usable by the primary protocol. Although 1292 this is a potentially serious problem, this sort of situation is 1293 likely to be viewed as broken by a large number of observers, and 1294 thus there will be strong motivation to correct it. 1296 Since connectionless protocols might not keep enough state to 1297 effectively diagnose MTU black holes, it would be more robust to err 1298 on the side of using too small of an initial MTU (e.g., 1kBytes or 1299 less) prior to probing a path to measure the MTU. For this reason we 1300 suggest that IP fragmentation use an initial eff_pmtu which is 1301 selected as described in Section 7.2, except using a separate global 1302 control for the default initial eff_mtu for connectionless protocols. 1304 Connectionless protocols also introduce an additional problem with 1305 maintaining the path information cache: there are no events 1306 corresponding to connection establishment and tear-down to use to 1307 manage the cache itself. A natural approach would be to keep an 1308 immutable cache entry for the "default path", which has a eff_pmtu 1309 that is fixed at the initial value for connectionless protocols. The 1310 adjunct path MTU discovery protocol would be invoked once the number 1311 of fragmented datagrams to any particular destination reaches some 1312 configurable threshold (e.g., 5 datagrams). A new path cache entry 1313 would be created when the adjunct protocol updates eff_pmtu, and 1314 deleted on the basis of a timer or Least Recently Used cache 1315 replacement algorithm. 1317 10.4. Probing method using applications 1319 The disadvantages of relying on IP fragmentation and an adjunct 1320 protocol to perform path MTU discovery can be overcome by 1321 implementing path MTU discovery within the application itself, using 1322 the application's own protocol. The application must have some 1323 suitable method for generating probes and have an accurate and timely 1324 mechanism to determine if the probes were lost. 1326 Ideally the application protocol includes a lightweight echo function 1327 that confirms message delivery, plus a mechanism for padding the 1328 messages out to the desired probe size, such that the padding is not 1329 echoed. This combination (akin to the SCTP HB plus PAD) is preferred 1330 because an application can separately measure the MTU of each 1331 direction on a path with asymmetrical MTUs. 1333 For protocols that cannot implement PLPMTUD with "echo plus pad" 1334 there are often alternate methods for generating probes. For 1335 example, the protocol may have a variable length echo that 1336 effectively measures minimum MTU of both the forward and return path, 1337 or there may be a way to add padding to regular messages carrying 1338 real application data. There may also be alternate ways to segment 1339 application data to generate probes, or as a last resort, it may be 1340 feasible to extend the protocol with new message types specifically 1341 to support MTU discovery. 1343 Note that if it is necessary to add new message types to support 1344 PLPMTUD, the most general approach is to add ECHO and PAD messages, 1345 which permit the the greatest possible latitude in how an application 1346 specific implementation of PLPMTUD interacts with other applications 1347 and protocols on the same end system. 1349 All application probing techniques require the ability to send 1350 messages that are larger than the current eff_pmtu described in 1351 Section 9. 1353 11. Security Considerations 1355 Under all conditions the PLPMTUD procedures described in this 1356 document are at least as secure as the current standard Path MTU 1357 Discovery procedures described in RFC 1191 and RFC 1981. 1359 Since this algorithm is designed for robust operation without any 1360 ICMP or other messages from the network, PLPMTUD could be configured 1361 to ignore all ICMP messages, either globally or on a per application 1362 basis. In such a configuration, it cannot be attacked unless the 1363 attacker can identify and cause probe packets to be lost. Attacking 1364 PLPMTUD reduces performance, but not as much as attacking congestion 1365 control by causing arbitrary packets to be lost. Such an attacker 1366 might do far more damage by completely disrupting specific other 1367 protocols, such as DNS. 1369 Since packetization protocols may share state with each other, if one 1370 packetization protocol (particularly an application) were hostile to 1371 other protocols on the same host, it could harm performance in the 1372 other protocols by reducing the effective MTU. If a packetization 1373 protocol is untrusted, it should not be allowed to write to shared 1374 state. 1376 12. IANA Considerations 1378 None. 1380 13. References 1382 13.1. Normative references 1384 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 1385 September 1981. 1387 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1388 November 1990. 1390 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1391 for IP version 6", RFC 1981, August 1996. 1393 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1394 Requirement Levels", BCP 14, RFC 2119, March 1997. 1396 [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1397 (IPv6) Specification", RFC 2460, December 1998. 1399 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1400 RFC 793, September 1981. 1402 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 1403 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 1404 Zhang, L., and V. Paxson, "Stream Control Transmission 1405 Protocol", RFC 2960, October 2000. 1407 13.2. Informative references 1409 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1410 Communication Layers", STD 3, RFC 1122, October 1989. 1412 [RFC1809] Partridge, C., "Using the Flow Label Field in IPv6", 1413 RFC 1809, June 1995. 1415 [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery", 1416 RFC 2923, September 2000. 1418 [RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the 1419 Internet Protocol", RFC 2401, November 1998. 1421 [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, 1422 RFC 2914, September 2000. 1424 [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor 1425 Discovery for IP Version 6 (IPv6)", RFC 2461, 1426 December 1998. 1428 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 1429 Conservative Selective Acknowledgment (SACK)-based Loss 1430 Recovery Algorithm for TCP", RFC 3517, April 2003. 1432 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1433 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 1435 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1436 RFC 896, January 1984. 1438 [Kent87] Kent, C. and J. Mogul, "Fragmentation considered harmful", 1439 Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. 1441 [tcp-friendly] 1442 Mahdavi, J. and S. Floyd, "TCP-Friendly Unicast Rate-Based 1443 Flow Control", Technical note sent to the end2end-interest 1444 mailing list , January 1997, 1445 . 1447 [I-D.heffner-frag-harmful] 1448 Heffner, J., "Fragmentation Considered Very Harmful", 1449 draft-heffner-frag-harmful-01 (work in progress), 1450 April 2006. 1452 [I-D.tuexen-tsvwg-sctp-padding] 1453 Tuexen, M. and R. Stewart, "Padding Chunk and Parameter 1454 for SCTP", draft-tuexen-tsvwg-sctp-padding-00 (work in 1455 progress), February 2006. 1457 Appendix A. Acknowledgements 1459 Many ideas and even some of the text come directly from RFC 1191 and 1460 RFC 1981. 1462 Many people made significant contributions to this document, 1463 including: Randall Stewart for SCTP text, Michael Richardson for 1464 material from an earlier ID on tunnels that ignore DF, Stanislav 1465 Shalunov for the idea that pure PLPMTUD parallels congestion control, 1466 and Matt Zekauskas for maintaining focus during the meetings. Thanks 1467 to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib 1468 who provided concrete feedback on weaknesses in earlier drafts. 1469 Thanks also to all of the people who made constructive comments in 1470 the working group meetings and on the mailing list. I am sure I have 1471 missed many deserving people. 1473 Matt Mathis and John Heffner are supported in this work by a grant 1474 from Cisco Systems, Inc. 1476 Authors' Addresses 1478 Matt Mathis 1479 Pittsburgh Supercomputing Center 1480 4400 Fifth Avenue 1481 Pittsburgh, PA 15213 1482 US 1484 Phone: 412-268-3319 1485 Email: mathis@psc.edu 1487 John W. Heffner 1488 Pittsburgh Supercomputing Center 1489 4400 Fifth Avenue 1490 Pittsburgh, PA 15213 1491 US 1493 Phone: 412-268-2329 1494 Email: jheffner@psc.edu 1496 Intellectual Property Statement 1498 The IETF takes no position regarding the validity or scope of any 1499 Intellectual Property Rights or other rights that might be claimed to 1500 pertain to the implementation or use of the technology described in 1501 this document or the extent to which any license under such rights 1502 might or might not be available; nor does it represent that it has 1503 made any independent effort to identify any such rights. Information 1504 on the procedures with respect to rights in RFC documents can be 1505 found in BCP 78 and BCP 79. 1507 Copies of IPR disclosures made to the IETF Secretariat and any 1508 assurances of licenses to be made available, or the result of an 1509 attempt made to obtain a general license or permission for the use of 1510 such proprietary rights by implementers or users of this 1511 specification can be obtained from the IETF on-line IPR repository at 1512 http://www.ietf.org/ipr. 1514 The IETF invites any interested party to bring to its attention any 1515 copyrights, patents or patent applications, or other proprietary 1516 rights that may cover technology that may be required to implement 1517 this standard. Please address the information to the IETF at 1518 ietf-ipr@ietf.org. 1520 Disclaimer of Validity 1522 This document and the information contained herein are provided on an 1523 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1524 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1525 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1526 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1527 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1528 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1530 Copyright Statement 1532 Copyright (C) The Internet Society (2006). This document is subject 1533 to the rights, licenses and restrictions contained in BCP 78, and 1534 except as set forth therein, the authors retain all their rights. 1536 Acknowledgment 1538 Funding for the RFC Editor function is currently provided by the 1539 Internet Society.