idnits 2.17.1 draft-ietf-pmtud-method-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1557. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1568. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1575. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1581. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 5, 2006) is 6350 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2460 (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 3697 (Obsoleted by RFC 6437) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) == Outdated reference: A later version (-02) exists of draft-ietf-tsvwg-sctp-padding-00 -- Obsolete informational reference (is this intentional?): RFC 2401 (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 2461 (Obsoleted by RFC 4861) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) == Outdated reference: A later version (-05) exists of draft-heffner-frag-harmful-01 Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Mathis 3 Internet-Draft J. Heffner 4 Intended status: Standards Track PSC 5 Expires: June 8, 2007 December 5, 2006 7 Packetization Layer Path MTU Discovery 8 draft-ietf-pmtud-method-11 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on June 8, 2007. 35 Copyright Notice 37 Copyright (C) The Internet Society (2006). 39 Abstract 41 This document describes a robust method for Path MTU Discovery 42 (PMTUD) that relies on TCP or some other Packetization Layer to probe 43 an Internet path with progressively larger packets. This method is 44 described as an extension to RFC 1191 and RFC 1981, which specify 45 ICMP based Path MTU Discovery for IP versions 4 and 6, respectively. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 50 1.1. Revision History . . . . . . . . . . . . . . . . . . . . . 4 51 1.1.1. Changes since version -10, 25 September 2006 . . . . . 5 52 1.1.2. Changes since version -09, 29 August 2006 . . . . . . 5 53 1.1.3. Changes since version -08, 1 August 2006 . . . . . . . 5 54 1.1.4. Changes since version -07, July 2006 (IETF 66) . . . . 5 55 1.1.5. Changes since version -06, March 2006 (IETF 65) . . . 6 56 1.1.6. Changes since version -05, November 2005 (IETF 64) . . 6 57 1.1.7. Changes since version -04, February 2005 (IETF 62) . . 6 58 1.1.8. Changes since version -03, October 2004 (IETF 61) . . 6 59 1.1.9. Changes since version -02, July 19th 2004 (IETF 60) . 6 60 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 10 62 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 13 63 5. Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 64 5.1. Accounting for header sizes . . . . . . . . . . . . . . . 15 65 5.2. Storing PMTU information . . . . . . . . . . . . . . . . . 15 66 5.3. Accounting for IPsec . . . . . . . . . . . . . . . . . . . 16 67 5.4. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 17 68 6. Common Packetization Properties . . . . . . . . . . . . . . . 17 69 6.1. Mechanism to detect loss . . . . . . . . . . . . . . . . . 17 70 6.2. Generating probes . . . . . . . . . . . . . . . . . . . . 18 71 7. The Probing Method . . . . . . . . . . . . . . . . . . . . . . 18 72 7.1. Packet size ranges . . . . . . . . . . . . . . . . . . . . 19 73 7.2. Selecting initial values . . . . . . . . . . . . . . . . . 20 74 7.3. Selecting probe size . . . . . . . . . . . . . . . . . . . 21 75 7.4. Probing preconditions . . . . . . . . . . . . . . . . . . 22 76 7.5. Conducting a probe . . . . . . . . . . . . . . . . . . . . 22 77 7.6. Response to probe results . . . . . . . . . . . . . . . . 23 78 7.6.1. Probe success . . . . . . . . . . . . . . . . . . . . 23 79 7.6.2. Probe failure . . . . . . . . . . . . . . . . . . . . 23 80 7.6.3. Probe timeout failure . . . . . . . . . . . . . . . . 24 81 7.6.4. Probe inconclusive . . . . . . . . . . . . . . . . . . 24 82 7.7. Full-stop timeout . . . . . . . . . . . . . . . . . . . . 24 83 7.8. MTU verification . . . . . . . . . . . . . . . . . . . . . 25 84 8. Host Fragmentation . . . . . . . . . . . . . . . . . . . . . . 25 85 9. Application Probing . . . . . . . . . . . . . . . . . . . . . 26 86 10. Specific Packetization Layers . . . . . . . . . . . . . . . . 27 87 10.1. Probing method using TCP . . . . . . . . . . . . . . . . . 27 88 10.2. Probing method using SCTP . . . . . . . . . . . . . . . . 29 89 10.3. Probing method for IP fragmentation . . . . . . . . . . . 29 90 10.4. Probing method using applications . . . . . . . . . . . . 31 91 11. Security Considerations . . . . . . . . . . . . . . . . . . . 31 92 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 93 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 32 94 13.1. Normative references . . . . . . . . . . . . . . . . . . . 32 95 13.2. Informative references . . . . . . . . . . . . . . . . . . 33 96 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 34 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 98 Intellectual Property and Copyright Statements . . . . . . . . . . 35 100 1. Introduction 102 This document describes a method for Packetization Layer Path MTU 103 Discovery (PLPMTUD) which is an extension to existing Path MTU 104 Discovery methods described in [RFC1191] and [RFC1981]. In the 105 absence of ICMP messages, the proper MTU is determined by starting 106 with small packets and probing with successively larger packets. The 107 bulk of the algorithm is implemented above IP, in the transport layer 108 (e.g., TCP) or other "Packetization Protocol" that is responsible for 109 determining packet boundaries. 111 This document does not update RFC1191 or RFC1981; however, since it 112 supports correct operation without ICMP, it implicitly relaxes some 113 of the requirements for the algorithms specified in those documents. 115 The methods described in this document rely on features of existing 116 protocols. They apply to many transport protocols over IPv4 and 117 IPv6. They do not require cooperation from the lower layers (except 118 that they are consistent about which packet sizes are acceptable), or 119 from peers. As the methods apply only to senders, variants in 120 implementations will not cause interoperability problems. 122 For sake of clarity, we uniformly prefer TCP and IPv6 terminology. 123 In the terminology section we also present the analogous IPv4 terms 124 and concepts for the IPv6 terminology. In a few situations we 125 describe specific details that are different between IPv4 and IPv6. 127 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 128 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 129 document are to be interpreted as described in [RFC2119]. 131 This document is a product of the Path MTU Discovery (pmtud) working 132 group of the IETF and draws heavily on RFC1191 and RFC1981 for 133 terminology, ideas, and some of the text. 135 1.1. Revision History 137 NOTE TO RFC EDITOR: this section to be removed before publication. 139 These are all recent substantive changes, in reverse chronological 140 order. This section will be removed prior to publication as an RFC. 142 Please send comments and suggestions to pmtud@ietf.org. Interim 143 drafts and other useful information will be posted at 144 http://www.psc.edu/~mathis/MTU/pmtud/index.html . 146 1.1.1. Changes since version -10, 25 September 2006 148 These changes all follow the IESG and related reviews: 150 Drop over-ambitious self requirement language. 152 Clarify probing pre-conditions and why you can not probe at end of a 153 connection. Strengthened the suggestions for how to accumulate data 154 to meet probing pre-conditions. 156 Remove obsolete and unneeded definition of router. 158 Expand "Can't fragment" to "fragmentation needed and DF set" 160 Better explain text on "Links MUST enforce their MTU". 162 Tied "path representation" in the requirements section to the later 163 discussion. 165 Deleted incorrect comment about problems with multicast. 167 Updated IPv6 flow label usage to be consistent with RFC3697 169 Expand on the "sketch" nature of parts of this document. 171 Tighten the abstract. 173 1.1.2. Changes since version -09, 29 August 2006 175 Edits in response to AD review: Clarified the relationship to RFC1191 176 and RFC1981. Added RFC2119 language throughout the document. No 177 longer permit more frequent probing by not suppressing congestion 178 control. Slightly relaxed the lower bound on supported MTU sizes. 179 Corrected more nits. 181 1.1.3. Changes since version -08, 1 August 2006 183 Restore some lost edits that were supposed to have appeared in -07. 185 1.1.4. Changes since version -07, July 2006 (IETF 66) 187 Last call comments from Gorry Fairhurst, Ivan Beschastnikh, and Mark 188 Allman. Nits and clarifications. 190 Changed MAY to SHOULD suppress congestion control response on failed 191 probe. 193 1.1.5. Changes since version -06, March 2006 (IETF 65) 195 Changed the title to include "Packetization Layer". 197 Renamed "Diagnostic Interface" section to "Application Probing" and 198 broadened the language to include other uses. 200 Clarifications to sections "packet size ranges", "host 201 fragmentation", and "probing using applications". 203 Language nits. 205 1.1.6. Changes since version -05, November 2005 (IETF 64) 207 Re-worked probing method sections for TCP and SCTP. The SCTP section 208 reflects the new PAD chunk type, and contains some text from Michael 209 Tuexen. 211 Made a number of language clarification and consistency improvements, 212 largely from comments by Gorry Fairhurst. 214 Added appropriate citations, and removed the last of the "@@" TODO 215 items. 217 1.1.7. Changes since version -04, February 2005 (IETF 62) 219 General restructuring and rewriting of some sections based on new 220 experience. Relaxed and generalized a lot of over-specified 221 language, for example, the search strategy description. 223 Decoupled verification from probing, and relaxed its specification. 225 Removed all specified changes to ICMP processing. We decided this 226 was out of scope for this particular document. 228 Changed all language to refer to MTU rather than MPS. 230 1.1.8. Changes since version -03, October 2004 (IETF 61) 232 A number of minor style and grammar edits. 234 1.1.9. Changes since version -02, July 19th 2004 (IETF 60) 236 Many minor updates throughout the document. 238 Added a section describing the interactions between PLPMTUD and 239 congestion control. 241 Removed a difficult to implement requirement for future data to 242 transmit. 244 Added "IP Fragmentation" and "Application protocol" as Packetization 245 Layers. 247 Clarified interactions between TCP SACK and MTU. 249 Updated SCTP section to reflect new probing method using "PAD 250 chunks". 252 Distilled the protocol specific material into separate subsections 253 for each protocol. 255 Added a section on common requirements and functions for all 256 Packetization Layers. More accurately characterized the 257 "bidirectional" (and other) requirements of the PL protocol. Updated 258 the search strategy in this new section. 260 Change "ICMP can't fragment" and "packet too big" to uniformly use 261 "ICMP PTB message" everywhere. 263 Added Stanislav Shalunov's observation that PLPMTUD parallels 264 congestion control. 266 Better described the range of interoperability with classical pMTUd 267 in the introduction. 269 Removed vague language about "not being a protocol" and "excessive 270 Loss". 272 Slightly redefined flow: the granularity of PLPMTUD within a path. 274 Many English NITs and clarifications per Gorry Fairhurst and others. 275 Passes strict xml2rfc checking. 277 Add a paragraph encouraging interface MTUs that are the optimal for 278 the NIC, rather than standard for the media. 280 Added a revision history section. 282 2. Overview 284 Packetization Layer Path MTU Discovery (PLPMTUD) is a method for TCP 285 or other Packetization Protocols to dynamically discover the MTU of a 286 path by probing with progressively larger packets. It is most 287 efficient when used in conjunction with the ICMP based Path MTU 288 Discovery mechanism as specified in RFC 1191 and RFC 1981, but 289 resolves many of the robustness problems of the classical techniques 290 since it does not depend on the delivery of ICMP messages. 292 This method is applicable to TCP and other transport- or application- 293 level protocols that are responsible for choosing packet boundaries 294 (e.g., segment sizes) and have an acknowledgment structure that 295 delivers to the sender accurate and timely indications of which 296 packets were lost. 298 The general strategy is for the Packetization Layer to find an 299 appropriate Path MTU by probing the path with progressively larger 300 packets. If a probe packet is successfully delivered, then the 301 effective Path MTU is raised to the probe size. 303 The isolated loss of a probe packet (with or without an ICMP Packet 304 Too Big message) is treated as an indication of an MTU limit, and not 305 as a congestion indicator. In this case alone, the Packetization 306 Protocol is permitted to retransmit any missing data without 307 adjusting the congestion window. 309 If there is a timeout or additional packets are lost during the 310 probing process, the probe is considered to be inconclusive (e.g., 311 the lost probe does not necessarily indicate that the probe exceeded 312 the Path MTU). Furthermore, the losses are treated like any other 313 congestion indication: window or rate adjustments are mandatory per 314 the relevant congestion control standards [RFC2914]. Probing can 315 resume after a delay which is determined by the nature of the 316 detected failure. 318 PLPMTUD uses a searching technique to find the Path MTU. Each 319 conclusive probe narrows the MTU search range, either by raising the 320 lower limit on a successful probe or lowering the upper limit on a 321 failed probe, converging toward the true Path MTU. For most 322 transport layers, the search should be stopped once the range is 323 narrow enough that the benefit of a larger effective Path MTU is 324 smaller than the search overhead of finding it. 326 The most likely (and least serious) probe failure is due to the link 327 experiencing congestion related losses while probing. In this case 328 it is appropriate to retry a probe of the same size as soon as the 329 Packetization Layer has fully adapted to the congestion and recovered 330 from the losses. In other cases, additional losses or timeouts 331 indicate problems with the link or Packetization Layer. In these 332 situations it is desirable to use longer delays depending on the 333 severity of the error. 335 An optional verification process can be used to detect situations 336 where raising the MTU raises the packet loss rate. For example, if a 337 link is striped across multiple physical channels with inconsistent 338 MTUs, it is possible that a probe will be delivered even if it is too 339 large for some of the physical channels. In such cases, raising the 340 Path MTU to the probe size can cause severe packet loss and abysmal 341 performance. After raising the MTU, the new MTU size can be verified 342 by monitoring the loss rate. 344 Packetization Layer PMTUD (PLPMTUD) introduces some flexibility in 345 the implementation of classical Path MTU discovery. It can be 346 configured to perform just ICMP black hole recovery to increase the 347 robustness of classical Path MTU Discovery, or at the other extreme, 348 all ICMP processing can be disabled and PLPMTUD can completely 349 replace classical Path MTU Discovery. 351 Classical Path MTU discovery is subject to protocol failures 352 (connection hangs) if ICMP Packet Too Big (PTB) messages are not 353 delivered or processed for some reason [RFC2923]. With PLPMTUD, 354 classical Path MTU Discovery can be modified to include additional 355 consistency checks without increasing the risk of connection hangs 356 due to spurious failures of the additional checks. Such changes to 357 classical Path MTU Discovery are beyond the scope of this document. 359 In the limiting case, all ICMP PTB messages might be unconditionally 360 ignored, and PLPMTUD can be used as the sole method to discover the 361 Path MTU. In this configuration, PLPMTUD parallels congestion 362 control. An end-to-end transport protocol adjusts properties of the 363 data stream (window size or packet size) while using packet losses to 364 deduce the appropriateness of the adjustments. This technique seems 365 to be more philosophically consistent with the end-to-end principle 366 of the Internet than relying on ICMP messages containing transcribed 367 headers of multiple protocol layers. 369 Most of the difficulty in implementing PLPMTUD arises because it 370 needs to be implemented in several different places within a single 371 node. In general, each Packetization Protocol needs to have its own 372 implementation of PLPMTUD. Furthermore, the natural mechanism to 373 share Path MTU information between concurrent or subsequent 374 connections is a path information cache in the IP layer. The various 375 Packetization Protocols need to have the means to access and update 376 the shared cache in the IP layer. This memo describes PLPMTUD in 377 terms of its primary subsystems without fully describing how they are 378 assembled into a complete implementation. 380 The vast majority of the implementation details described in this 381 document are recommendations based on experiences with earlier 382 versions of Path MTU Discovery. These recommendations are motivated 383 by a desire to maximize robustness of PLPMTUD in the presence of less 384 than ideal network conditions as they exist in the field. 386 This document does not contain a complete description of an 387 implementation. It only sketches details that do not effect 388 interoperability with other implementations and have strong 389 externally imposed optimality criteria (e.g. the MTU searching and 390 caching heuristics). Other details are explicitly included because 391 there is an obvious alternative implementation that doesn't work well 392 in some (possibly subtle) case. 394 Section 3 provides a complete glossary of terms. 396 Section 4 describes the details of PLPMTUD that affect 397 interoperability with other standards or Internet protocols. 399 Section 5 describes how to partition PLPMTUD into layers, and how to 400 manage the path information cache in the IP layer. 402 Section 6 describes the general Packetization Layer properties and 403 features needed to implement PLPMTUD. 405 Section 7 describes how to use probes to search for the Path MTU. 407 Section 8 recommends using IPv4 fragmentation in a configuration that 408 mimics IPv6 functionality, to minimize future problems migrating to 409 IPv6. 411 Section 9 describes a programing interface for implementing PLPMTUD 412 in applications that choose their own packet boundaries and for tools 413 to be able to diagnose path problems that interfere with Path MTU 414 Discovery. 416 Section 10 discusses implementation details for specific protocols, 417 including TCP. 419 3. Terminology 421 We use the following terms in this document: 423 IP: Either IPv4 [RFC0791] or IPv6 [RFC2460]. 425 Node: A device that implements IP. 427 Upper layer: A protocol layer immediately above IP. Examples are 428 transport protocols such as TCP and UDP, control protocols such as 429 ICMP, routing protocols such as OSPF, and Internet or lower-layer 430 protocols being "tunneled" over (i.e., encapsulated in) IP such as 431 IPX, AppleTalk, or IP itself. 433 Link: A communication facility or medium over which nodes can 434 communicate at the link layer, i.e., the layer immediately below 435 IP. Examples are Ethernets (simple or bridged); PPP links; X.25, 436 Frame Relay, or ATM networks; and Internet (or higher) layer 437 "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use 438 the slightly more general term "lower layer" for this concept. 440 Interface: A node's attachment to a link. 442 Address: An IP-layer identifier for an interface or a set of 443 interfaces. 445 Packet: An IP header plus payload. 447 MTU: Maximum Transmission Unit, the size in bytes of the largest IP 448 packet, including the IP header and payload, that can be 449 transmitted on a link or path. Note that this could more properly 450 be called the IP MTU, to be consistent with how other standards 451 organizations use the acronym MTU. 453 Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet 454 size in bytes, that can be conveyed in one piece over a link. 455 Beware that this definition is different from the definition used 456 by other standards organizations. 458 For IETF documents, link MTU is uniformly defined as the IP MTU 459 over the link. This includes the IP header, but excludes link 460 layer headers and other framing which is not part of IP or the IP 461 payload. 463 Be aware that other standards organizations generally define link 464 MTU to include the link layer headers. 466 Path: The set of links traversed by a packet between a source node 467 and a destination node. 469 Path MTU, or PMTU: The minimum link MTU of all the links in a path 470 between a source node and a destination node. 472 Classical Path MTU Discovery: Process described in RFC 1191 and RFC 473 1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages 474 to learn the MTU of a path. 476 Packetization Layer: The layer of the network stack which segments 477 data into packets. 479 Effective PMTU: The current estimated value for PMTU used by a 480 Packetization Layer for segmentation. 482 PLPMTUD: Packetization Layer Path MTU Discovery, the method 483 described in this document, which is an extension to classical 484 PMTU discovery. 486 PTB (Packet Too Big) message: An ICMP message reporting that an IP 487 packet is too large to forward. This is the IPv6 term that 488 corresponds to the IPv4 ICMP "fragmentation needed and DF set" 489 message. 491 Flow: A context in which MTU discovery algorithms can be invoked. 492 This is naturally an instance of a Packetization Protocol, for 493 example, one side of a TCP connection. 495 MSS: The TCP Maximum Segment Size [RFC0793], the maximum payload 496 size available to the TCP layer. This is typically the Path MTU 497 minus the size of the IP and TCP headers. 499 Probe packet: A packet which is being used to test a path for a 500 larger MTU. 502 Probe size: The size of a packet being used to probe for a larger 503 MTU, including IP headers. 505 Probe gap: The payload data that will be lost and need to be 506 retransmitted if the probe is not delivered. 508 Leading window: Any unacknowledged data in a flow at the time a 509 probe is sent. 511 Trailing window: Any data in a flow sent after a probe, but before 512 the probe is acknowledged. 514 Search strategy: The heuristics used to choose successive probe 515 sizes to converge on the proper Path MTU, as described in section 516 Section 7.3. 518 Full-stop timeout: a timeout where none of the packets transmitted 519 after some event are acknowledged by the receiver, including any 520 retransmissions. This is taken as an indication of some failure 521 condition in the network, such as a routing change onto a link 522 with a smaller MTU. This is described in more detail in section 523 Section 7.7. 525 4. Requirements 527 All links MUST enforce their MTU: links that might non- 528 deterministically deliver packets that are larger than their rated 529 MTU MUST consistently discard such packets. 531 In the distant past there were a small number of network devices that 532 did not enforce MTU, but could not reliably deliver oversized 533 packets. For example, some early bit-wise Ethernet repeaters would 534 forward arbitrary sized packets, but could not do so reliably due to 535 finite hardware data clock stability. This is the only requirement 536 that PLPMTUD places on lower layers. It is important that this 537 requirement is explicit to forestall the future standardization or 538 deployment of technologies that might be incompatible with PLPMTUD. 540 All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 541 functionality. All fragmentation SHOULD be done on the host, and all 542 IPv4 packets, including fragments, SHOULD have the DF bit set such 543 that they will not be fragmented (again) in the network. See 544 Section 8. 546 The requirements below only apply to those implementations that 547 include PLPMTUD. 549 To use PLPMTUD a Packetization Layer MUST have a loss reporting 550 mechanism that provides the sender with timely and accurate 551 indications of which packets were lost in the network. 553 Normal congestion control algorithms MUST remain in effect under all 554 conditions except when only an isolated probe packet is detected as 555 lost. In this case alone the normal congestion (window or data rate) 556 reduction SHOULD be suppressed. If any other data loss is detected, 557 standard congestion control MUST take place. 559 Suppressed congestion control MUST be rate limited such that it 560 occurs less frequently than the worst case loss rate for TCP 561 congestion control at a comparable data rate over the same path 562 (i.e., less than the "TCP-friendly" loss rate [tcp-friendly]). This 563 SHOULD be enforced by requiring a minimum headway between a 564 suppressed congestion adjustment (due to a failed probe) and the next 565 attempted probe, which is equal to one round trip time for each 566 packet permitted by the congestion window. This is discussed further 567 in Section 7.6.2. 569 Whenever the MTU is raised, the congestion state variables MUST be 570 rescaled so as not to raise the window size in bytes (or data rate in 571 bytes per seconds). 573 Whenever the MTU is reduced (e.g., when processing ICMP PTB messages) 574 the congestion state variable SHOULD be rescaled not to raise the 575 window size in packets. 577 If PLPMTUD updates the MTU for a particular path, all Packetization 578 Layer sessions that share the path representation (as described in 579 Section 5.2) SHOULD be notified to make use of the new MTU and make 580 the required congestion control adjustments. 582 All implementations MUST include mechanisms for applications to 583 selectively transmit packets larger than the current effective Path 584 MTU, but smaller than the first hop link MTU. This is necessary to 585 implement PLPMTUD using a connectionless protocol within an 586 application and to implement diagnostic tools that do not rely on the 587 operating system's implementation of Path MTU discovery. See 588 Section 9 for further discussion. 590 Implementations MAY use different heuristics to select the initial 591 effective path MTU for each protocol. Connectionless protocols and 592 protocols that do not support PLPMTUD SHOULD have their own default 593 value for the initial effective path MTU, which can be set to a more 594 conservative (smaller) value than the initial value used by TCP and 595 other protocols that are well suited to PLPMTUD. There SHOULD be per 596 protocol and per route limits on the initial effective path MTU 597 (eff_pmtu) and the upper searching limit (search_high). See 598 Section 7.2 for further discussion. 600 5. Layering 602 Packetization Layer Path MTU Discovery is most easily implemented by 603 splitting its functions between layers. The IP layer is the best 604 place to keep shared state, collect the ICMP messages, track IP 605 header sizes and manage MTU information provided by the link layer 606 interfaces. However, the procedures that PLPMTUD uses for probing 607 and verification of the Path MTU are very tightly coupled to features 608 of the Packetization Layers, such as data recovery and congestion 609 control state machines. 611 Note that this layering approach is a direct extension of the advice 612 in the current PMTUD specifications in RFC 1191 and RFC 1981. 614 5.1. Accounting for header sizes 616 The way in which PLPMTUD operates across multiple layers requires a 617 mechanism for accounting header sizes at all layers between IP and 618 the Packetization Layer (inclusive). When transmitting non-probe 619 packets, it is sufficient for the Packetization Layer to ensure an 620 upper bound on final IP packet size, so as not to exceed the current 621 effective Path MTU. All Packetization Layers participating in 622 classical Path MTU Discovery have this requirement already. When 623 conducting a probe, the Packetization Layer MUST determine the probe 624 packet's final size including IP headers. This requirement is 625 specific to PLPMTUD, and satisfying it may require additional inter- 626 layer communication in existing implementations. 628 5.2. Storing PMTU information 630 This memo uses the concept of a "flow" to define the scope of the 631 Path MTU discovery algorithms. For many implementations, a flow 632 would naturally correspond to an instance of each protocol (i.e., 633 each connection or session). In such implementations, the algorithms 634 described in this document are performed within each session for each 635 protocol. The observed PMTU (eff_pmtu in Section 7.1) MAY be shared 636 between different flows with a common path representation. 638 Alternatively, PLPMTUD could be implemented such that its complete 639 state is associated with the path representations. Such an 640 implementation could use multiple connections or sessions for each 641 probe sequence. This approach is likely to converge much more 642 quickly in some environments, such as where an application uses many 643 small connections, each of which is too short to complete the Path 644 MTU Discovery process. 646 Within a single implementation, different protocols can use either of 647 these two approaches. Due to protocol specific differences in 648 constraints on generating probes (Section 6.2) and the MTU searching 649 algorithm (Section 7.3), it may not be feasible for different 650 Packetization Layer protocols to share PLPMTUD state. This suggests 651 that it may be possible for some protocols to share probing state, 652 but other protocols can only share observed PMTU. In this case, the 653 different protocols will have different PMTU convergence properties. 655 The IP layer SHOULD be used to store the cached PMTU value and other 656 shared state such as MTU values reported by ICMP PTB messages. 657 Ideally, this shared state should be associated with a specific path 658 traversed by packets exchanged between the source and destination 659 nodes. However, in most cases a node will not have enough 660 information to completely and accurately identify such a path. 661 Rather, a node must associate a PMTU value with some local 662 representation of a path. It is left to the implementation to select 663 the local representation of a path. 665 An implementation MAY use the destination address as the local 666 representation of a path. The PMTU value associated with a 667 destination would be the minimum PMTU learned across the set of all 668 paths in use to that destination. The set of paths in use to a 669 particular destination is expected to be small, in many cases 670 consisting of a single path. This approach will result in the use of 671 optimally sized packets on a per-destination basis, and integrates 672 nicely with the conceptual model of a host as described in [RFC2461]: 673 a PMTU value could be stored with the corresponding entry in the 674 destination cache. Since NATs and other forms of middle boxes may 675 exhibit differing PMTUs simultaneously at a single IP address, the 676 minimum value SHOULD be stored. 678 Network or subnet numbers MUST NOT be used as representations of a 679 path, because there is not a general mechanism to determine the 680 network mask at the remote host. 682 For source routed packets (i.e., packets containing an IPv6 routing 683 header, or IPv4 LSRR or SSRR options), the source route MAY further 684 qualify the local representation of a path. An implementation MAY 685 use source route information in the local representation of a path. 687 If IPv6 flows are in use, an implementation MAY use the the 3-tuple 688 of the Flow label and the source and destination addresses 689 [RFC2460][RFC3697] as the local representation of a path. Such an 690 approach could theoretically result in the use of optimally sized 691 packets on a per-flow basis, providing finer granularity than MTU 692 values maintained on a per-destination basis. 694 5.3. Accounting for IPsec 696 This document does not take a stance on the placement of IPsec 697 [RFC2401], which logically sits between IP and the Packetization 698 Layer. A PLPMTUD implementation can treat IPsec either as part of IP 699 or as part of the Packetization Layer, as long as the accounting is 700 consistent within the implementation. If IPsec is treated as part of 701 the IP layer, then each security association to a remote node may 702 need to be treated as a separate path. If IPsec is treated as part 703 of the Packetization Layer, the IPsec header size MUST be included in 704 the Packetization Layer's header size calculations. 706 5.4. Multicast 708 In the case of a multicast destination address, copies of a packet 709 may traverse many different paths to reach many different nodes. The 710 local representation of the "path" to a multicast destination must in 711 fact represent a potentially large set of paths. 713 Minimally, an implementation MAY maintain a single MTU value to be 714 used for all multicast packets originated from the node. This MTU 715 SHOULD be sufficiently small that it is expected to be less than the 716 path MTU of all paths comprising the multicast tree. If a path MTU 717 of less than the configured multicast MTU is learned via unicast 718 means, the multicast MTU MAY be reduced to this value. This approach 719 is likely to result in the use of smaller packets than is necessary 720 for many paths. 722 If the application using multicast gets complete delivery reports 723 (unlikely since this requirement has poor scaling properties), 724 PLPMTUD MAY be implemented in multicast protocols such that the 725 smallest path MTU learned across a group becomes the effective MTU 726 for that group. 728 6. Common Packetization Properties 730 This section describes general Packetization Layer properties and 731 characteristics needed to implement PLPMTUD. It also describes some 732 implementation issues that are common to all Packetization Layers. 734 6.1. Mechanism to detect loss 736 It is important that the Packetization Layer has a timely and robust 737 mechanism for detecting and reporting losses. PLPMTUD makes MTU 738 adjustments on the basis of detected losses. Any delays or 739 inaccuracy in loss notification is likely to result in incorrect MTU 740 decisions or slow convergence. It is important that the mechanism 741 can robustly distinguish between the isolated loss of just a probe 742 and other losses in the probe's leading and trailing windows. 744 It is best if Packetization Protocols use an explicit loss detection 745 mechanism such as a SACK scoreboard [RFC3517] or ACK Vector [RFC4340] 746 to distinguish real losses from reordered data, although implicit 747 mechanisms such as TCP Reno style duplicate acknowledgments counting 748 are sufficient. 750 PLPMTUD can also be implemented in protocols that rely on timeouts as 751 their primary mechanism for loss recovery; however, timeouts SHOULD 752 NOT be used as the primary mechanism for loss indication unless there 753 are no other alternatives. 755 6.2. Generating probes 757 There are several possible ways to alter Packetization Layers to 758 generate probes. The different techniques incur different overheads 759 in three areas: difficulty in generating the probe packet (in terms 760 of Packetization Layer implementation complexity and extra data 761 motion) possible additional network capacity consumed by the probes 762 and the overhead of recovering from failed probes (both network and 763 protocol overheads). 765 Some protocols might be extended to allow arbitrary padding with 766 dummy data. This greatly simplifies the implementation because the 767 probing can be performed without participation from higher layers and 768 if the probe fails, the missing data (the "probe gap") is assured to 769 fit within the current MTU when it is retransmitted. This is 770 probably the most appropriate method for protocols that support 771 arbitrary length options or multiplexing within the protocol itself. 773 Many Packetization Layer protocols can carry pure control messages 774 (without any data from higher protocol layers) which can be padded to 775 arbitrary lengths. For example, the SCTP PAD chunk can be used in 776 this manner (see Section 10.2). This approach has the advantage that 777 nothing needs to be retransmitted if the probe is lost. 779 These techniques do not work for TCP, because there is not a separate 780 length field or other mechanism to differentiate between padding and 781 real payload data. With TCP the only approach is to send additional 782 payload data in an over-sized segment. There are at least two 783 variants of this approach, discussed in Section 10.1. 785 In a few cases, there may be no reasonable mechanisms to generate 786 probes within the Packetization Layer protocol itself. As a last 787 resort, it may be possible to rely on an adjunct protocol, such as 788 ICMP ECHO ("ping"), to send probe packets. See Section 10.3 for 789 further discussion of this approach. 791 7. The Probing Method 793 This section describes the details of the MTU probing method, 794 including how to send probes and process error indications necessary 795 to search for the Path MTU. 797 7.1. Packet size ranges 799 This document describes the probing method using three state 800 variables: 802 search_low: The smallest useful probe size, minus one. The network 803 is expected to be able to deliver packets of size search_low. 805 search_high: The greatest useful probe size. The network is 806 expected not to be able to deliver packets of size search_high. 808 eff_pmtu: The effective PMTU for this flow. This is the largest 809 non-probe packet permitted by PLPMTUD for the path. 811 search_low eff_pmtu search_high 812 | | | 813 ...-------------------------> 814 non-probe size range 815 <--------------------------------------> 816 probe size range 818 Figure 1 820 When transmitting non-probes, the Packetization Layer SHOULD create 821 packets of size less than or equal to eff_pmtu. 823 When transmitting probes, the Packetization Layer MUST select a probe 824 size which is larger than search_low and smaller or equal to 825 search_high. 827 When probing upward, eff_pmtu always equals search_low. In other 828 states, such as initial conditions, after ICMP PTB message processing 829 or following PLPMTUD on another flow sharing the same path 830 representation, eff_pmtu may be different from search_low. Normally 831 eff_pmtu will be greater than or equal to search_low and less than 832 search_high. It is generally expected but not required that probe 833 size will be greater than eff_pmtu. 835 For initial conditions when there is no information about the path, 836 eff_pmtu may be greater than search_low. The initial value of 837 search_low SHOULD be conservatively low, but performance may be 838 better if eff_pmtu starts at a higher, less conservative, value. See 839 Section 7.2. 841 If eff_pmtu is larger than search_low it is explicitly permitted to 842 send non-probe packets larger than search_low. When such a packet is 843 acknowledged, it is effectively an "implicit probe" and search_low 844 SHOULD be raised to the size of the acknowledged packet. However, if 845 an "implicit probe" is lost, it MUST NOT be treated as a probe 846 failure as a true probe would be. If eff_pmtu is too large, this 847 condition will only be detected with ICMP PTB messages or black hole 848 discovery (see Section 7.7). 850 7.2. Selecting initial values 852 The initial value for search_high SHOULD be the largest possible 853 packet that might be supported by the flow. This may be limited by 854 the local interface MTU, by an explicit protocol mechanism such as 855 the TCP MSS option, an intrinsic limit such as the size of a protocol 856 length field. In addition the initial value for search_high MAY be 857 limited by a configuration option to prevent probing above some 858 maximum size. Search_high is likely to be the same as the initial 859 path MTU as computed by the classical path MTU discovery algorithm. 861 It is RECOMMENDED that search_low be initially set to an MTU size 862 that is likely to work over a very wide range of environments. Given 863 today's technologies, a value of 1024 bytes is probably safe enough. 864 The initial value for search_low SHOULD be configurable. 866 Properly functioning Path MTU Discovery is critical to the robust and 867 efficient operation of the Internet. Any major change (as described 868 in this document) has the potential to be very disruptive if it 869 causes any unexpected changes in protocol behaviors. The selection 870 of the initial value for eff_pmtu determines to what extent a PLPMTUD 871 implementation's behavior resembles classical PMTUD in cases where 872 the classical method is sufficient. 874 A conservative configuration would be to set eff_pmtu to search_high, 875 and rely on ICMP PTB messages to set the eff_pmtu down as 876 appropriate. In this configuration classical PMTUD is fully 877 functional and PLPMTUD is only invoked to recover from ICMP black 878 holes through the procedure described in Section 7.7. 880 In some cases where it is known that classical PMTUD is likely to 881 fail, (for example, if ICMP PTB messages are administratively 882 disabled for security reasons) using a small initial eff_pmtu will 883 avoid the costly timeouts required for black hole detection. The 884 trade-off is that using a smaller than necessary initial eff_pmtu 885 might cause reduced performance. 887 Note that the initial eff_pmtu can be any value in the range 888 search_low to search_high. An initial eff_pmtu of 1400 bytes might 889 be a good compromise because it would be safe for nearly all tunnels 890 over all common networking gear, and yet close to the optimal MTU for 891 the majority of paths in the Internet today. This might be improved 892 by using some statistics of other recent flows: for example the 893 initial eff_pmtu for a flow might be set to the median of the probe 894 size for all recent successful probes. 896 Since the cost of PLPMTUD is dominated by the protocol specific 897 overheads of generating and processing probes, it is probably 898 desirable for each protocol to have its own heuristics to select the 899 initial eff_pmtu. It is especially important that connectionless 900 protocols and other protocols that may not receive clear indications 901 of ICMP black holes use conservative (smaller) initial values for 902 eff_pmtu, as described in section Section 10.3. 904 There SHOULD be per protocol and per route configuration options to 905 override initial values for eff_pmtu and other PLPMTUD state 906 variables. 908 7.3. Selecting probe size 910 The probe may have a size anywhere in the "probe size range" 911 described above. However, a number of factors affect the selection 912 of an appropriate size. A simple strategy might be to do a binary 913 search halving the probe size range with each probe. However, for 914 some protocols, such as TCP, failed probes are more expensive than 915 successful ones, since data in a failed probe will need to be 916 retransmitted. For such protocols, a strategy that raises the probe 917 size in smaller increments might have lower overhead. For many 918 protocols, both at and above the Packetization Layer, the benefit of 919 increasing MTU sizes may follow a step function such that it is not 920 advantageous to probe within certain regions at all. 922 As an optimization, it may be appropriate to probe at certain common 923 or expected MTU sizes, for example, 1500 bytes for standard Ethernet, 924 or 1500 bytes minus header sizes for tunnel protocols. 926 Some protocols may use other mechanisms to choose the probe sizes. 927 For example, protocols that have certain natural data block sizes 928 might simply assemble messages from a number of blocks until the 929 total size is smaller than search_high, and if possible larger than 930 search_low. 932 Each Packetization Layer MUST determine when probing has converged, 933 that is, when the probe size range is small enough that further 934 probing is no longer worth its cost. When probing has converged, a 935 timer SHOULD be set. When the timer expires, search_high should be 936 reset to its initial value (described above) so that probing can 937 resume. Thus if the path changes, increasing the Path MTU, then the 938 flow will eventually take advantage of it. The value for this timer 939 MUST NOT be less than 5 minutes, and is recommended to be 10 minutes, 940 per RFC 1981. 942 7.4. Probing preconditions 944 Before sending a probe, the flow MUST at least meet the following 945 conditions: 946 o It has no outstanding probes or losses. 947 o If the last probe failed or was inconclusive, then the probe 948 timeout has expired (see Section 7.6.2). 949 o The available window is greater than the probe size. 950 o For a protocol using in-band data for probing, enough data is 951 available to send the probe. 953 In addition, the timely loss detection algorithms in most protocols 954 have pre-conditions which SHOULD be satisfied before sending a probe. 955 For example, TCP Fast Retransmit is not robust unless there are 956 sufficient segments following a probe, e.g. the sender SHOULD have 957 enough data queued and receiver window to send the probe plus at 958 least Tcprexmtthresh [RFC2760] additional segments. This may also 959 inhibit probing in some protocol states, such as too close to the end 960 of a connection, or when the window is too small. 962 Protocols MAY delay sending non-probes in order to accumulate enough 963 data to meet the pre-conditions for probing. The delayed sending 964 algorithm SHOULD use some self scaling technique to appropriately 965 limit the time that the data is delayed. For example the returning 966 ACKs can be used to prevent the window from falling by more than the 967 amount of data needed for the probe. 969 7.5. Conducting a probe 971 Once a probe size in the appropriate range has been selected, and the 972 above preconditions have been met, the Packetization Layer MAY 973 conduct a probe. To do so, it creates a probe packet such that its 974 size, including the outermost IP headers, is equal to the probe size. 975 After sending the probe it awaits a response, which will have one of 976 the following results: 977 Success: The probe is acknowledged as having been received by the 978 remote host. 980 Failure: A protocol mechanism indicates that the probe was lost, but 981 no packets in the leading or trailing window were lost. 983 Timeout failure: A protocol mechanism indicates that the probe was 984 lost, and no packets in the leading window were lost, but is 985 unable to determine if any packets in the trailing window were 986 lost. For example, loss is detected by a timeout, and go-back-n 987 retransmission is used. 989 Inconclusive: The probe was lost in addition to other packets in the 990 leading or trailing windows. 992 7.6. Response to probe results 994 When a probe has completed, the result SHOULD be processed as 995 follows, categorized by the probe's result type. 997 7.6.1. Probe success 999 When the probe is delivered, it is an indication that the Path MTU is 1000 at least as large as the probe size. Set search_low to the probe 1001 size. If the probe size is larger than the eff_pmtu, raise eff_pmtu 1002 to the probe size. The probe size might be smaller than the eff_pmtu 1003 if the flow has not been using the full MTU of the path because it is 1004 subject to some other limitation, such as available data in an 1005 interactive session. 1007 Note that if a flow's packets are routed via multiple paths, or over 1008 a path with a non-deterministic MTU, delivery of a single probe 1009 packet does not indicate that all packets of that size will be 1010 delivered. To be robust in such a case, the Packetization Layer 1011 SHOULD conduct MTU verification as described in Section 7.8. 1013 7.6.2. Probe failure 1015 When only the probe is lost, it is treated as an indication that the 1016 Path MTU is smaller than the probe size. In this case alone, the 1017 loss SHOULD NOT be interpreted as congestion signal. 1019 In the absence of other indications, set search_high to the probe 1020 size minus one. The eff_pmtu might be larger than the probe size if 1021 the flow has not been using the full MTU of the path because it is 1022 subject to some other limitation, such as available data in an 1023 interactive session. If eff_pmtu is larger than the probe size, 1024 eff_pmtu MUST be reduced to no larger than search_high, and SHOULD be 1025 reduced to search_low, as the eff_pmtu has been determined to be 1026 invalid, similar to after a full-stop timeout (see Section 7.7). 1028 If an ICMP PTB message is received matching the probe packet, then 1029 search_high and eff_pmtu MAY be set from the MTU value indicated in 1030 the message. Note that the ICMP message may be received either 1031 before or after the protocol loss indication. 1033 A probe failure event is the one situation under which the 1034 Packetization Layer SHOULD ignore loss as a congestion signal. 1035 Because there is some small risk that suppressing congestion control 1036 might have unanticipated consequences (even for one isolated loss), 1037 it is REQUIRED that probe failure events be less frequent than the 1038 normal period for losses under standard congestion control. 1039 Specifically, after a probe failure event and suppressed congestion 1040 control, PLPMTUD MUST NOT probe again until an interval which is 1041 larger than the expected interval between congestion control events. 1042 See Section 4 for details. The simplest estimate of the interval to 1043 the next congestion event is the same number of round trips as the 1044 current congestion window in packets. 1046 7.6.3. Probe timeout failure 1048 If the loss was detected with a timeout and repaired with go-back-n 1049 retransmission, then congestion window reduction will be necessary. 1050 The relatively high price of a failed probe in this case may merit a 1051 longer time interval until the next probe. A time interval that is 1052 five times the non-timeout failure case (Section 7.6.2) is 1053 RECOMMENDED. 1055 7.6.4. Probe inconclusive 1057 The presence of other losses near the loss of the probe may indicate 1058 that the probe was lost due to congestion rather than due to an MTU 1059 limitation. In this case the state variables eff_pmtu, search_low 1060 and search_high SHOULD NOT be updated, and the same sized probe 1061 SHOULD be attempted again as soon as the probing preconditions are 1062 met (i.e., once the packetization layer has no outstanding 1063 unrecovered losses). At this point, it is particularly appropriate 1064 to re-probe since the flow's congestion window will be at its lowest 1065 point, minimizing the probability of congestive losses. 1067 7.7. Full-stop timeout 1069 Under all conditions, a full-stop timeout (also known as a 1070 "persistent timeout" in other documents) SHOULD be taken as an 1071 indication of some significantly disruptive event in the network, 1072 such as a router failure or a routing change to a path with a smaller 1073 MTU. For TCP, this occurs when the R1 timeout threshold described by 1074 [RFC1122] expires. 1076 If there is a full-stop timeout and there was not an ICMP message 1077 indicating a reason (PTB, Net unreachable, etc., or the ICMP message 1078 was ignored for some reason), the RECOMMENDED first recovery action 1079 is to treat this as a detected ICMP black hole as defined in 1080 [RFC2923]. 1082 The response to a detected black hole depends on the current values 1083 for search_low and eff_pmtu. If eff_pmtu is larger than search_low, 1084 set eff_pmtu to search_low. Otherwise, set both eff_pmtu and 1085 search_low to the initial value for search_low. Upon additional 1086 successive timeouts, search_low and eff_pmtu SHOULD be halved, with a 1087 lower bound of 68 bytes for IPv4 and 1280 bytes for IPv6. Even lower 1088 lower bounds MAY be permitted to support limited operation over links 1089 with MTUs that are smaller than permitted by the IP specifications. 1091 7.8. MTU verification 1093 It is possible for a flow to simultaneously traverse multiple paths, 1094 but and implementation will only be able to keep a single path 1095 representation for the flow. If the paths have different MTUs, 1096 storing the minimum MTU of all paths in the flow's path 1097 representation will result in correct behavior. If ICMP PTB messages 1098 are delivered, then classical PMTUD will work correctly in this 1099 situation. 1101 If ICMP delivery fails, breaking classical PMTUD, the connection will 1102 rely solely on PLPMTUD. In this case, PLPMTUD may fail as well since 1103 it assumes a flow traverses a path with a single MTU. A probe with a 1104 size greater than the minimum but smaller than the maximum of the 1105 Path MTUs may be successful. However, upon raising the flow's 1106 effective PMTU, the loss rate will significantly increase. The flow 1107 may still make progress, but the resultant loss rate is likely to be 1108 unacceptable. For example, when using two-way round-robin striping, 1109 50% of full-sized packets would be dropped. 1111 Striping in this manner is often operationally undesirable for other 1112 reasons (e.g., due to packet reordering), and is usually avoided by 1113 hashing each flow to a single path. However, to increase robustness, 1114 an implementation SHOULD implement some form of MTU verification, 1115 such that if increasing eff_pmtu results in a sharp increase in loss 1116 rate, it will fall back to using a lower MTU. 1118 A RECOMMENDED strategy would be to save the value of eff_pmtu before 1119 raising it. Then, if loss rate rises above a threshold for a period 1120 of time (e.g., loss rate is higher than 10% over multiple RTO 1121 intervals), then the new MTU is considered incorrect. The saved 1122 value of eff_pmtu SHOULD be restored, and search_high reduced in the 1123 same manner as in a probe failure. PLPMTUD implementations SHOULD 1124 implement MTU verification. 1126 8. Host Fragmentation 1128 Packetization Layers SHOULD avoid sending messages that will require 1129 fragmentation [Kent87] [I-D.heffner-frag-harmful]. However, entirely 1130 preventing fragmentation is not always possible. Some Packetization 1131 Layers, such as a UDP application outside the kernel, may be unable 1132 to change the size of messages it sends, resulting in datagram sizes 1133 that exceed the Path MTU. 1135 IPv4 permitted such applications to send packets without the DF bit 1136 set. Oversized packets without the DF bit set would be fragmented in 1137 the network or sending host when they encountered a link with a MTU 1138 smaller than the packet. In some case, packets could be fragmented 1139 more than once if there were cascaded links with progressively 1140 smaller MTUs. This approach is NOT RECOMMENDED. 1142 It is RECOMMENDED that IPv4 implementations use a strategy that 1143 mimics IPv6 functionality. When an application sends datagrams that 1144 are larger than the effective Path MTU they SHOULD be fragmented to 1145 the Path MTU in the host IP layer even if they are smaller than the 1146 MTU of the first link, directly attached to the host. The DF bit 1147 SHOULD be set on the fragments, so they will not be fragmented again 1148 in the network. This technique will minimize the likelihood that 1149 applications will rely on IPv4 fragmentation in a way that cannot be 1150 implemented in IPv6. At least one major operating system already 1151 uses this strategy. Section 9 describes some exceptions to this rule 1152 when the application is sending oversized packets for probing or 1153 diagnostic purposes. 1155 Since protocols that do not implement PLPMTUD are still subject to 1156 problems due to ICMP black holes, it may be desirable to limit to 1157 these protocols to "safe" MTUs likely to work on any path (e.g., 1280 1158 bytes). Allow any protocol implementing PLPMTUD to operate over the 1159 full range supported by the lower layer. 1161 Note that IP fragmentation divides data into packets, so it is 1162 minimally a Packetization Layer. However, it does not have a 1163 mechanism to detect lost packets, so it cannot support a native 1164 implementation of PLPMTUD. Fragmentation-based PLPMTUD requires an 1165 adjunct protocol as described in Section 10.3. 1167 9. Application Probing 1169 All implementations MUST include a mechanism where applications using 1170 connectionless protocols can send their own probes. This is 1171 necessary to implement PLPMTUD in an application protocol as 1172 described in Section 10.4 or to implement diagnostic tools for 1173 debugging problems with PMTUD. There MUST be a mechanism that 1174 permits an application to send datagrams that are larger than 1175 eff_pmtu, the operating systems estimate of the path MTU, without 1176 being fragmented. If these are IPv4 packets, they MUST have the DF 1177 bit set. 1179 At this time, most operating systems support two modes for sending 1180 datagrams: one which silently fragments packets that are too large, 1181 and another that rejects packets that are too large. Neither of 1182 these modes is suitable for implementing PLPMTUD in an application or 1183 diagnosing problems with path MTU discovery. A third mode is 1184 REQUIRED where the datagram is sent even if it is larger than the 1185 current estimate of the path MTU. 1187 Implementing PLPMTUD in an application also requires a mechanism 1188 where the application can inform the operating system about the 1189 outcome of the probe as described in Section 7.6, or directly update 1190 search_low, search_high and eff_pmtu, described in Section 7.1. 1192 Diagnostic applications are useful for finding PMTUD problems, such 1193 as those that might be caused by a defective router that returns ICMP 1194 PTB messages with incorrect size information. Such problems can be 1195 most quickly located with a tool that can send probes of any 1196 specified size, and collect and display all returned ICMP PTB 1197 messages. 1199 10. Specific Packetization Layers 1201 All Packetization Layer protocols must consider all of the issues 1202 discussed in Section 6. For many protocols it is straight forward to 1203 address these issues. This section discusses specific details for 1204 implementing PLPMTUD with a couple of protocols. It is hoped that 1205 the descriptions here will be sufficient illustration for 1206 implementers to adapt to additional protocols. 1208 10.1. Probing method using TCP 1210 TCP has no mechanism to distinguish in-band data from padding. 1211 Therefore, TCP must generate probes by appropriately segmenting data. 1212 There are two approaches to segmentation: overlapping and non- 1213 overlapping. 1215 In the non-overlapping method, data is segmented such that the probe 1216 and any subsequent segments contain no overlapping data. If the 1217 probe is lost, the "probe gap" will be a full probe size minus 1218 headers. Data in the probe gap will need to be retransmitted with 1219 multiple smaller segments. 1221 TCP sequence number 1223 t <----> 1224 i <--------> (probe) 1225 m <----> 1226 e 1227 . 1228 . (probe lost) 1229 . 1231 <----> (probe gap retransmitted) 1232 <--> 1234 Figure 2 1236 An alternate approach is to send subsequent data overlapping the 1237 probe such that the probe gap is equal in length to the current MSS. 1238 In the case of a successful probe, this has added overhead in that it 1239 will send some data twice, but it will have to retransmit only one 1240 segment after a lost probe. When a probe succeeds, there will likely 1241 be some duplicate acknowledgments generated due to the duplicate data 1242 sent. It is important that these duplicate acknowledgments not 1243 trigger Fast Retransmit. As such, an implementation using this 1244 approach SHOULD limit the probe size to three times the current MSS 1245 (causing at most 2 duplicate acknowledgments), or appropriately 1246 adjust its duplicate acknowledgment threshold for data immediately 1247 after a successful probe. 1249 TCP sequence number 1251 t <----> 1252 i <--------> (probe) 1253 m <----> 1254 e <----> 1256 . 1257 . (probe lost) 1258 . 1260 <----> (probe gap retransmitted) 1262 Figure 3 1264 The choice of which segmentation method to use should be based on 1265 what is simplest and most efficient for a given TCP implementation. 1267 10.2. Probing method using SCTP 1269 In the SCTP protocol [RFC2960], the application writes messages to 1270 SCTP, which divides the data into smaller "chunks" suitable for 1271 transmission through the network. Each chunk is assigned a 1272 Transmission Sequence Number (TSN). Once a TSN has been transmitted, 1273 SCTP can not change the chunk size. SCTP multi-path support normally 1274 requires SCTP to choose a chunk size such that its messages to fit 1275 the smallest PMTU of all paths. Although not required, 1276 implementations may bundle multiple data chunks together to make 1277 larger IP packets to send on paths with a larger PMTU. Note that 1278 SCTP must independently probe the PMTU on each path to the peer. 1280 The RECOMMENDED method for generating probes is to add a chunk 1281 consisting only of padding to an SCTP message. The PAD chunk defined 1282 in [I-D.ietf-tsvwg-sctp-padding] SHOULD be attached to a minimum 1283 length HEARTBEAT (HB) chunk to build a probe packet. This method is 1284 fully compatible with all current SCTP implementations. 1286 SCTP MAY also probe with a method similar to TCP's described above, 1287 using inline data. Using such a method has the advantage that 1288 successful probes have no additional overhead; however, failed probes 1289 will require retransmission of data, which may impact flow 1290 performance. 1292 10.3. Probing method for IP fragmentation 1294 There are a few protocols and applications that normally send large 1295 datagrams and rely on IP fragmentation to deliver them. It has been 1296 known for a long time that this has some undesirable consequences 1297 [Kent87]. More recently it has come to light that IPv4 fragmentation 1298 is not sufficiently robust for general use in today's Internet. The 1299 16-bit IP identification field is not large enough to prevent 1300 frequent mis-associated IP fragments and the TCP and UDP checksums 1301 are insufficient to prevent the resulting corrupted data from being 1302 delivered to higher protocol layers [I-D.heffner-frag-harmful]. 1304 As mentioned in Section 8, datagram protocols (such as UDP) might 1305 rely on IP fragmentation as a Packetization Layer. However, using IP 1306 fragmentation to implement PLPMTUD is problematic because the IP 1307 layer has no mechanism to determine if the packets are ultimately 1308 delivered to the far node, without direct participation by the 1309 application. 1311 To support IP fragmentation as a Packetization Layer under an 1312 unmodified application, an implementation SHOULD rely on the path MTU 1313 sharing described in Section 5.2 plus an adjunct protocol to probe 1314 the path MTU. There are a number of protocols that might be used for 1315 the purpose, such as ICMP ECHO and ECHO REPLY, or "traceroute" style 1316 UDP datagrams that trigger ICMP messages. Use of ICMP ECHO and ECHO 1317 REPLY will probe both forward and return paths, so the sender will 1318 only be able to take advantage of the minimum of the two. Other 1319 methods that probe only the forward path are preferred if available. 1321 All of these approaches have a number of potential robustness 1322 problems. The most likely failures are due to losses unrelated to 1323 MTU (e.g., nodes that discard some protocol types). These non-MTU- 1324 related losses can prevent PLPMTUD from raising the MTU, forcing IP 1325 fragmentation to use a smaller MTU than necessary. Since these 1326 failures are not likely to cause interoperability problems they are 1327 relatively benign. 1329 However other more serious failure modes do exist, such as might be 1330 caused by middle boxes or upper layer routers that choose different 1331 paths for different protocol types or sessions. In such 1332 environments, adjunct protocols may legitimately experience a 1333 different path MTU than the primary protocol. If the adjunct 1334 protocol finds a larger MTU than the primary protocol, PLPMTUD may 1335 select an MTU that is not usable by the primary protocol. Although 1336 this is a potentially serious problem, this sort of situation is 1337 likely to be viewed as incorrect by a large number of observers, and 1338 thus there will be strong motivation to correct it. 1340 Since connectionless protocols might not keep enough state to 1341 effectively diagnose MTU black holes, it would be more robust to err 1342 on the side of using too small of an initial MTU (e.g., 1 kByte or 1343 less) prior to probing a path to measure the MTU. For this reason 1344 implementations that use IP fragmentation SHOULD use an initial 1345 eff_pmtu which is selected as described in Section 7.2, except using 1346 a separate global control for the default initial eff_mtu for 1347 connectionless protocols. 1349 Connectionless protocols also introduce an additional problem with 1350 maintaining the path information cache: there are no events 1351 corresponding to connection establishment and tear-down to use to 1352 manage the cache itself. A natural approach would be to keep an 1353 immutable cache entry for the "default path", which has a eff_pmtu 1354 that is fixed at the initial value for connectionless protocols. The 1355 adjunct path MTU discovery protocol would be invoked once the number 1356 of fragmented datagrams to any particular destination reaches some 1357 configurable threshold (e.g., 5 datagrams). A new path cache entry 1358 would be created when the adjunct protocol updates eff_pmtu, and 1359 deleted on the basis of a timer or a Least Recently Used cache 1360 replacement algorithm. 1362 10.4. Probing method using applications 1364 The disadvantages of relying on IP fragmentation and an adjunct 1365 protocol to perform path MTU discovery can be overcome by 1366 implementing path MTU discovery within the application itself, using 1367 the application's own protocol. The application must have some 1368 suitable method for generating probes and have an accurate and timely 1369 mechanism to determine if the probes were lost. 1371 Ideally the application protocol includes a lightweight echo function 1372 that confirms message delivery, plus a mechanism for padding the 1373 messages out to the desired probe size, such that the padding is not 1374 echoed. This combination (akin to the SCTP HB plus PAD) is 1375 RECOMMENDED because an application can separately measure the MTU of 1376 each direction on a path with asymmetrical MTUs. 1378 For protocols that cannot implement PLPMTUD with "echo plus pad" 1379 there are often alternate methods for generating probes. For 1380 example, the protocol may have a variable length echo that 1381 effectively measures minimum MTU of both the forward and return path, 1382 or there may be a way to add padding to regular messages carrying 1383 real application data. There may also be alternate ways to segment 1384 application data to generate probes, or as a last resort, it may be 1385 feasible to extend the protocol with new message types specifically 1386 to support MTU discovery. 1388 Note that if it is necessary to add new message types to support 1389 PLPMTUD, the most general approach is to add ECHO and PAD messages, 1390 which permit the the greatest possible latitude in how an application 1391 specific implementation of PLPMTUD interacts with other applications 1392 and protocols on the same end system. 1394 All application probing techniques require the ability to send 1395 messages that are larger than the current eff_pmtu described in 1396 Section 9. 1398 11. Security Considerations 1400 Under all conditions the PLPMTUD procedures described in this 1401 document are at least as secure as the current standard Path MTU 1402 Discovery procedures described in RFC 1191 and RFC 1981. 1404 Since PLPMTUD is designed for robust operation without any ICMP or 1405 other messages from the network, it can be configured to ignore all 1406 ICMP messages, either globally or on a per application basis. In 1407 such a configuration, it cannot be attacked unless the attacker can 1408 identify and cause probe packets to be lost. Attacking PLPMTUD 1409 reduces performance, but not as much as attacking congestion control 1410 by causing arbitrary packets to be lost. Such an attacker might do 1411 far more damage by completely disrupting specific protocols, such as 1412 DNS. 1414 Since packetization protocols may share state with each other, if one 1415 packetization protocol (particularly an application) were hostile to 1416 other protocols on the same host, it could harm performance in the 1417 other protocols by reducing the effective MTU. If a packetization 1418 protocol is untrusted, it should not be allowed to write to shared 1419 state. 1421 12. IANA Considerations 1423 None. 1425 13. References 1427 13.1. Normative references 1429 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 1430 September 1981. 1432 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1433 November 1990. 1435 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1436 for IP version 6", RFC 1981, August 1996. 1438 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1439 Requirement Levels", BCP 14, RFC 2119, March 1997. 1441 [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1442 (IPv6) Specification", RFC 2460, December 1998. 1444 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1445 RFC 793, September 1981. 1447 [RFC3697] Rajahalme, J., Conta, A., Carpenter, B., and S. Deering, 1448 "IPv6 Flow Label Specification", RFC 3697, March 2004. 1450 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 1451 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 1452 Zhang, L., and V. Paxson, "Stream Control Transmission 1453 Protocol", RFC 2960, October 2000. 1455 [I-D.ietf-tsvwg-sctp-padding] 1456 Tuexen, M., "Padding Chunk and Parameter for SCTP", 1457 draft-ietf-tsvwg-sctp-padding-00 (work in progress), 1458 June 2006. 1460 13.2. Informative references 1462 [RFC2760] Allman, M., Dawkins, S., Glover, D., Griner, J., Tran, D., 1463 Henderson, T., Heidemann, J., Touch, J., Kruse, H., 1464 Ostermann, S., Scott, K., and J. Semke, "Ongoing TCP 1465 Research Related to Satellites", RFC 2760, February 2000. 1467 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1468 Communication Layers", STD 3, RFC 1122, October 1989. 1470 [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery", 1471 RFC 2923, September 2000. 1473 [RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the 1474 Internet Protocol", RFC 2401, November 1998. 1476 [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, 1477 RFC 2914, September 2000. 1479 [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor 1480 Discovery for IP Version 6 (IPv6)", RFC 2461, 1481 December 1998. 1483 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 1484 Conservative Selective Acknowledgment (SACK)-based Loss 1485 Recovery Algorithm for TCP", RFC 3517, April 2003. 1487 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1488 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 1490 [Kent87] Kent, C. and J. Mogul, "Fragmentation considered harmful", 1491 Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. 1493 [tcp-friendly] 1494 Mahdavi, J. and S. Floyd, "TCP-Friendly Unicast Rate-Based 1495 Flow Control", Technical note sent to the end2end-interest 1496 mailing list , January 1997, 1497 . 1499 [I-D.heffner-frag-harmful] 1500 Heffner, J., "Fragmentation Considered Very Harmful", 1501 draft-heffner-frag-harmful-01 (work in progress), 1502 April 2006. 1504 Appendix A. Acknowledgments 1506 Many ideas and even some of the text come directly from RFC 1191 and 1507 RFC 1981. 1509 Many people made significant contributions to this document, 1510 including: Randall Stewart for SCTP text, Michael Richardson for 1511 material from an earlier ID on tunnels that ignore DF, Stanislav 1512 Shalunov for the idea that pure PLPMTUD parallels congestion control, 1513 and Matt Zekauskas for maintaining focus during the meetings. Thanks 1514 to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib 1515 who provided concrete feedback on weaknesses in earlier drafts. 1516 Thanks also to all of the people who made constructive comments in 1517 the working group meetings and on the mailing list. I am sure I have 1518 missed many deserving people. 1520 Matt Mathis and John Heffner are supported in this work by a grant 1521 from Cisco Systems, Inc. 1523 Authors' Addresses 1525 Matt Mathis 1526 Pittsburgh Supercomputing Center 1527 4400 Fifth Avenue 1528 Pittsburgh, PA 15213 1529 US 1531 Phone: 412-268-3319 1532 Email: mathis@psc.edu 1534 John W. Heffner 1535 Pittsburgh Supercomputing Center 1536 4400 Fifth Avenue 1537 Pittsburgh, PA 15213 1538 US 1540 Phone: 412-268-2329 1541 Email: jheffner@psc.edu 1543 Full Copyright Statement 1545 Copyright (C) The Internet Society (2006). 1547 This document is subject to the rights, licenses and restrictions 1548 contained in BCP 78, and except as set forth therein, the authors 1549 retain all their rights. 1551 This document and the information contained herein are provided on an 1552 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1553 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1554 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1555 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1556 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1557 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1559 Intellectual Property 1561 The IETF takes no position regarding the validity or scope of any 1562 Intellectual Property Rights or other rights that might be claimed to 1563 pertain to the implementation or use of the technology described in 1564 this document or the extent to which any license under such rights 1565 might or might not be available; nor does it represent that it has 1566 made any independent effort to identify any such rights. Information 1567 on the procedures with respect to rights in RFC documents can be 1568 found in BCP 78 and BCP 79. 1570 Copies of IPR disclosures made to the IETF Secretariat and any 1571 assurances of licenses to be made available, or the result of an 1572 attempt made to obtain a general license or permission for the use of 1573 such proprietary rights by implementers or users of this 1574 specification can be obtained from the IETF on-line IPR repository at 1575 http://www.ietf.org/ipr. 1577 The IETF invites any interested party to bring to its attention any 1578 copyrights, patents or patent applications, or other proprietary 1579 rights that may cover technology that may be required to implement 1580 this standard. Please address the information to the IETF at 1581 ietf-ipr@ietf.org. 1583 Acknowledgment 1585 Funding for the RFC Editor function is provided by the IETF 1586 Administrative Support Activity (IASA).