idnits 2.17.1 draft-ietf-pmtud-method-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1391. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1368. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1375. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1381. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 3, 2006) is 6626 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2460 (ref. '5') (Obsoleted by RFC 8200) ** Obsolete normative reference: RFC 793 (ref. '6') (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2960 (ref. '7') (Obsoleted by RFC 4960) -- Obsolete informational reference (is this intentional?): RFC 2401 (ref. '11') (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 2461 (ref. '13') (Obsoleted by RFC 4861) -- Obsolete informational reference (is this intentional?): RFC 3517 (ref. '14') (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 896 (ref. '15') (Obsoleted by RFC 7805) == Outdated reference: A later version (-01) exists of draft-tuexen-tsvwg-sctp-padding-00 Summary: 8 errors (**), 0 flaws (~~), 3 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Mathis 3 Internet-Draft J. Heffner 4 Expires: September 4, 2006 PSC 5 March 3, 2006 7 Path MTU Discovery 8 draft-ietf-pmtud-method-06 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on September 4, 2006. 35 Copyright Notice 37 Copyright (C) The Internet Society (2006). 39 Abstract 41 This document describes a robust method for Path MTU Discovery that 42 relies on TCP or some other Packetization Layer to probe an Internet 43 path with progressively larger packets. This method is described as 44 an extension to RFC 1191 and RFC 1981, which specify ICMP based Path 45 MTU Discovery for IP versions 4 and 6, respectively. 47 The general strategy of the new algorithm is to start with a small 48 MTU and search upward, testing successively larger MTUs by probing 49 with single packets. If the probe is successfully delivered and 50 satisfies a subsequent verification phase then the MTU is raised. If 51 the probe is lost, it is treated as an MTU limitation and not as a 52 congestion signal. 54 There are several options for integrating PLPMTUD with classical Path 55 MTU Discovery. PLPMTUD can be minimally configured to perform ICMP 56 black hole recovery to increase the robustness of classical Path MTU 57 Discovery, or ICMP processing can be completely disabled, and PLPMTUD 58 can completely replace classical Path MTU Discovery. 60 In the latter configuration, PLPMTUD exactly parallels congestion 61 control. An end-to-end transport protocol adjusts non-protocol 62 properties of the data stream (window size or packet size) while 63 using packet losses to deduce the appropriateness of the adjustments. 64 This technique seems to be more philosophically consistent with the 65 end-to-end principle than relying on ICMP messages containing 66 transcribed headers of multiple protocol layers. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 1.1. Revision History . . . . . . . . . . . . . . . . . . . . . 4 72 1.1.1. Changes since version -05, November 2005 (IETF 64) . . 5 73 1.1.2. Changes since version -04, February 2005 (IETF 62) . . 5 74 1.1.3. Changes since version -03, October 2004 (IETF 61) . . 5 75 1.1.4. Changes since version -02, July 19th 2004 (IETF 60) . 5 76 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 77 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 9 78 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 11 79 5. Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 80 5.1. Accounting for header sizes . . . . . . . . . . . . . . . 13 81 5.2. Storing PMTU information . . . . . . . . . . . . . . . . . 13 82 5.3. Accounting for IPsec . . . . . . . . . . . . . . . . . . . 15 83 5.4. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 15 84 6. Common Packetization Properties . . . . . . . . . . . . . . . 15 85 6.1. Mechanism to detect loss . . . . . . . . . . . . . . . . . 15 86 6.2. Generating probes . . . . . . . . . . . . . . . . . . . . 16 87 7. Host Fragmentation . . . . . . . . . . . . . . . . . . . . . . 17 88 8. The Probing Method . . . . . . . . . . . . . . . . . . . . . . 17 89 8.1. Packet size ranges . . . . . . . . . . . . . . . . . . . . 18 90 8.2. Selecting initial values . . . . . . . . . . . . . . . . . 18 91 8.3. Selecting probe size . . . . . . . . . . . . . . . . . . . 19 92 8.4. Probing preconditions . . . . . . . . . . . . . . . . . . 20 93 8.5. Conducting a probe . . . . . . . . . . . . . . . . . . . . 20 94 8.6. Response to probe results . . . . . . . . . . . . . . . . 21 95 8.6.1. Probe success . . . . . . . . . . . . . . . . . . . . 21 96 8.6.2. Probe failure . . . . . . . . . . . . . . . . . . . . 21 97 8.6.3. Probe timeout failure . . . . . . . . . . . . . . . . 22 98 8.6.4. Probe inconclusive . . . . . . . . . . . . . . . . . . 22 99 8.7. Full stop timeout . . . . . . . . . . . . . . . . . . . . 22 100 8.8. MTU verification . . . . . . . . . . . . . . . . . . . . . 22 101 9. Diagnostic Interface . . . . . . . . . . . . . . . . . . . . . 23 102 10. Specific Packetization Layers . . . . . . . . . . . . . . . . 24 103 10.1. Probing method using TCP . . . . . . . . . . . . . . . . . 24 104 10.2. Probing method using SCTP . . . . . . . . . . . . . . . . 24 105 10.3. Probing method using IP fragmentation . . . . . . . . . . 25 106 10.4. Probing method using applications . . . . . . . . . . . . 26 107 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 108 11.1. Normative references . . . . . . . . . . . . . . . . . . . 27 109 11.2. Informative references . . . . . . . . . . . . . . . . . . 28 110 Appendix A. Security Considerations . . . . . . . . . . . . . . . 29 111 Appendix B. IANA Considerations . . . . . . . . . . . . . . . . . 29 112 Appendix C. Acknowledgements . . . . . . . . . . . . . . . . . . 29 113 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 114 Intellectual Property and Copyright Statements . . . . . . . . . . 31 116 1. Introduction 118 This document describes a method for Packetization Layer Path MTU 119 Discovery (PLPMTUD) which is an extension to existing Path MTU 120 Discovery methods as described in RFC 1191 [2] and RFC 1981 [3]. The 121 proper MTU is determined by starting with small packets and probing 122 with successively larger packets. The bulk of the algorithm is 123 implemented above IP, in the transport layer (e.g., TCP) or other 124 "Packetization Protocol" that is responsible for determining packet 125 boundaries. 127 This document draws heavily RFC 1191 and RFC 1981 for terminology, 128 ideas and some of the text. 130 This document describes methods to discover the Path MTU using 131 features of existing protocols. The methods apply to IPv4 and IPv6, 132 and many transport protocols. They do not require cooperation from 133 the lower layers (except that they are consistent about what packet 134 sizes are acceptable) or the far node. Variants in implementations 135 will not cause interoperability problems. 137 The methods described in this document are carefully designed to 138 maximize robustness in the presence of less than ideal 139 implementations of other protocols or Internet components. 141 For sake of clarity we uniformly prefer TCP and IPv6 terminology. In 142 the terminology section we also present the analogous IPv4 terms and 143 concepts for the IPv6 terminology. In a few situations we describe 144 specific details that are different between IPv4 and IPv6. 146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 148 document are to be interpreted as described in RFC 2119 [4]. 150 This draft is a product of the Path MTU Discovery (pmtud) working 151 group of the IETF. Please send comments and suggestions to 152 pmtud@ietf.org. Interim drafts and other useful information will be 153 posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html . 155 1.1. Revision History 157 These are all recent substantive changes, in reverse chronological 158 order. This section will be removed prior to publication as an RFC. 159 Note that there are still some missing details that need to be 160 resolved. These are flagged by @@@@. None of the missing details 161 are serious. 163 1.1.1. Changes since version -05, November 2005 (IETF 64) 165 Re-worked probing method sections for TCP and SCTP. The SCTP section 166 reflects the new PAD chunk type, and contains some text from Michael 167 Tuexen. 169 Made a number of language clarification and consistency improvements, 170 largely from comments by Gorry Fairhurst. 172 Added appropriate citations, and removed the last of the "@@" TODO 173 items. 175 1.1.2. Changes since version -04, February 2005 (IETF 62) 177 General restructuring and rewriting of some sections based on new 178 experience. Relaxed and generalized a lot of over-specified 179 language, for example, the search strategy description. 181 Decoupled verification from probing, and relaxed its specification. 183 Removed all specified changes to ICMP processing. We decided this 184 was out of scope for this particular document. 186 Changed all language to refer to MTU rather than MPS. 188 1.1.3. Changes since version -03, October 2004 (IETF 61) 190 A number of minor style and grammar edits. 192 1.1.4. Changes since version -02, July 19th 2004 (IETF 60) 194 Many minor updates throughout the document. 196 Added a section describing the interactions between PLPMTUD and 197 congestion control. 199 Removed a difficult to implement requirement for future data to 200 transmit. 202 Added "IP Fragmentation" and "Application protocol" as Packetization 203 Layers. 205 Clarified interactions between TCP SACK and MTU. 207 Updated SCTP section to reflect new probing method using "PAD 208 chunks". 210 Distilled the protocol specific material into separate subsections 211 for each protocol. 213 Added a section on common requirements and functions for all 214 Packetization Layers. More accurately characterized the 215 "bidirectional" (and other) requirements of the PL protocol. Updated 216 the search strategy in this new section. 218 Change "ICMP can't fragment" and "packet too big" to uniformly use 219 "ICMP PTB message" everywhere. 221 Added Stanislav Shalunov's observation that PLPMTUD parallels 222 congestion control. 224 Better described the range of interoperability with classical pMTUd 225 in the introduction. 227 Removed vague language about "not being a protocol" and "excessive 228 Loss". 230 Slightly redefined flow: the granularity of PLPMTUD within a path. 232 Many English NITs and clarifications per Gorry Fairhurst and others. 233 Passes strict xml2rfc checking. 235 Add a paragraph encouraging interface MTUs that are the optimal for 236 the NIC, rather than standard for the media. 238 Added a revision history section. 240 2. Overview 242 This document describes a method for TCP or other Packetization 243 Protocols to dynamically discover the MTU of a path without explicit 244 signals from the network. This method is most efficient when used in 245 conjunction with the current ICMP based Path MTU Discovery mechanism 246 as specified in RFC 1191 and RFC 1981. When used in such a way, it 247 eliminates many robustness problems since it does not depend on the 248 delivery ICMP messages. 250 These procedures are applicable to TCP and other transport- or 251 application-level protocols that are responsible for choosing packet 252 boundaries (e.g., segment sizes) and have an acknowledgment structure 253 that delivers to the sender accurate and timely indications of which 254 packets were lost. 256 The general strategy is for the Packetization Layer to find an 257 appropriate Path MTU by probing the path with progressively larger 258 packets. If a probe packet is successfully delivered, then the 259 effective Path MTU is raised to the probe size. 261 The isolated loss of a probe packet (with or without an ICMP Packet 262 To Big message) is treated as an indication of an MTU limit, and not 263 as a congestion indicator. In this case alone, the Packetization 264 Protocol is permitted to retransmit any missing data without 265 adjusting the congestion window. 267 If there is a timeout or additional packets are lost during the 268 probing process, the probe is considered to be inconclusive (e.g., 269 the lost probe does not necessarily indicate that the probe exceeded 270 the Path MTU). Furthermore the losses are treated like any other 271 congestion indication: window or rate adjustments are mandatory per 272 the relevant congestion control standards of RFC 2914 [12]. Probing 273 can resume after a delay which is determined by the nature of the 274 detected failure. 276 PLPMTUD uses a searching technique to find the Path MTU. Each 277 conclusive probe narrows the MTU search range, either by raising the 278 low limit on a successful probe or lowering the high limit on a 279 failed probe, until the search range converges toward the true Path 280 MTU. For most transport layers, it makes sense to abandon the search 281 once the range is narrow enough where the likely gain from picking a 282 larger effective Path MTU is smaller than the search overhead to find 283 it. 285 The most likely (and least serious) PLPMTUD failure is the link 286 experiencing congestion related losses while probing. In this case 287 it is appropriate to retry a probe of the same size as soon as the 288 Packetization Layer has fully adapted to the congestion and recovered 289 from the losses. In other cases, additional losses or timeouts 290 indicate problems with the link or Packetization Layer. In these 291 situations it is desirable to use longer delays depending on the 292 severity of the error. 294 An optional verification phase can be used to detect some situations 295 where raising the MTU raises the packet loss rate. For example, if a 296 link is striped across multiple physical channels with inconsistent 297 MTUs, it is possible that a probe will be delivered even if it is too 298 large for some of the physical channels. In such cases raising the 299 Path MTU to the probe size can cause severe packet loss and abysmal 300 performance. After raising the MTU, the new MTU size can be verified 301 by monitoring the loss rate. 303 PLPMTUD introduces some flexibility in the implementation of 304 classical Path MTU discovery, which is subject to protocol failures 305 (connection hangs) if ICMP PTB messages are not delivered or 306 processed for some reason. With PLPMTUD, classical Path MTU 307 Discovery can include additional consistency checks (e.g., validating 308 additional fields in the transcribed header) without increasing the 309 risk of connection hangs due to false failures of the added checks. 310 Such changes to classical Path MTU Discovery are beyond the scope of 311 this document. 313 In the limiting case, all ICMP PTB messages might be unconditionally 314 ignored, and PLPMTUD can be used as the sole method used to discover 315 the Path MTU. In this configuration, PLPMTUD parallels congestion 316 control. An end-to-end transport protocol adjusts non-protocol 317 properties of the data stream (window size or packet size) while 318 using packet losses to deduce the appropriateness of the adjustments. 319 This technique seems to be more philosophically consistent with the 320 end-to-end principle of the Internet than relying on ICMP messages 321 containing transcribed headers of multiple protocol layers. 323 Most of the difficulty in implementing PLPMTUD arises because it 324 needs to be implemented in several different places within a single 325 node. In general, each Packetization Protocol needs to have its own 326 implementation of PLPMTUD. Furthermore, the natural mechanism to 327 share Path MTU information between concurrent or subsequent 328 connections over the same path is a path information cache in the IP 329 layer. The various Packetization Protocols need to have the means to 330 access and update the shared cache in the IP layer. This memo 331 describes PLPMTUD in terms of its primary subsystems without fully 332 describing how they are assembled into a complete implementation. 334 Section 3 provides a complete glossary of terms. 336 Relatively few details of PLPMTUD affect interoperability with other 337 standards or Internet protocols. These details are specified in RFC 338 2119 standards language in Section 4. The vast majority of the 339 implementation details described in this document are recommendations 340 based on experiences with earlier versions of Path MTU Discovery. 341 These recommendations are motivated by a desire to maximize 342 robustness of PLPMTUD in the presence of less than ideal network 343 conditions as they exist in the field. 345 Section 5 describes how to partition PLPMTUD into layers, and how to 346 manage the "path information cache" in the IP layer. 348 Section 6 describes the general Packetization Layer properties and 349 features needed to implement PLPMTUD. 351 Section 7 recommends using IPv4 fragmentation in a configuration that 352 mimics IPv6 functionality, to minimize future problems migrating to 353 IPv6. 355 Section 8 describes the details of how to use probes to search for 356 the Path MTU. 358 Section 9 describes a programing interface for applications acting as 359 Packetization Layers, and for tools to be able to diagnose path 360 problems that interfere with Path MTU Discovery. 362 Section 10 discusses implementation details for specific protocols, 363 including TCP. 365 3. Terminology 367 We use the following terms in this document: 369 IP: Either IPv4 [1] or IPv6 [5]. 371 Node: A device that implements IP. 373 Router: A node that forwards IP packets not explicitly addressed to 374 itself. 376 Host: Any node that is not a router. 378 Upper layer: A protocol layer immediately above IP. Examples are 379 transport protocols such as TCP and UDP, control protocols such as 380 ICMP, routing protocols such as OSPF, and Internet or lower-layer 381 protocols being "tunneled" over (i.e., encapsulated in) IP such as 382 IPX, AppleTalk, or IP itself. 384 Link: A communication facility or medium over which nodes can 385 communicate at the link layer, i.e., the layer immediately below 386 IP. Examples are Ethernets (simple or bridged); PPP links; X.25, 387 Frame Relay, or ATM networks; and Internet (or higher) layer 388 "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use 389 the slightly more general term "lower layer" for this concept. 391 Interface: A node's attachment to a link. 393 Address: An IP-layer identifier for an interface or a set of 394 interfaces. 396 Packet: An IP header plus payload. 398 MTU: Maximum Transmission Unit, the size in bytes of the largest IP 399 packet, including the IP header and payload, that can be 400 transmitted on a link or path. Note that this could more properly 401 be called the IP MTU, to be consistent with how other standards 402 organizations use the acronym MTU. 404 Link MTU: The Maximum Transmission Unit, i.e., maximum IP packet size 405 in bytes, that can be conveyed in one piece over a link. Beware 406 that this definition differers from the definition used by other 407 standards organizations. 409 For IETF documents, link MTU is uniformly defined as the IP MTU 410 over the link. This includes the IP header, but excludes link 411 layer headers and other framing which is not part of IP or the IP 412 payload. 414 Be aware that other standards organizations generally define link 415 MTU to include the link layer headers. 417 Path: The set of links traversed by a packet between a source node 418 and a destination node. 420 Path MTU, or pMTU: The minimum link MTU of all the links in a path 421 between a source node and a destination node. 423 Classical Path MTU Discovery: Process described in RFC 1191 and RFC 424 1981, in which nodes rely on ICMP "Packet Too Big" (PTB) messages 425 to learn the MTU of a path. 427 Packetization Layer: The layer of the network stack which segments 428 data into packets. 430 Effective PMTU: The current estimated value for PMTU used by a 431 Packetization Layer for segmentation. 433 PLPMTUD: Packetization Layer Path MTU Discovery, the method described 434 in this document, which is an extension to classical PMTU 435 discovery. 437 PTB (Packet Too Big) message: An ICMP message reporting that an IP 438 packet is too large to forward. This is the IPv6 term that 439 corresponds to the IPv4 "ICMP Can't fragment" message. 441 Flow: A context in which MTU discovery algorithms can be invoked. 442 This is naturally an instance of a Packetization Protocol, for 443 example, one side of a TCP connection. 445 MSS: The TCP Maximum Segment Size [6], the maximum payload size 446 available to the TCP layer. This is typically the Path MTU minus 447 the size of the IP and TCP headers. 449 Probe packet: A packet which is being used to test a path for a 450 larger MTU. 452 Probe size: The size of a packet being used to probe for a larger 453 MTU. 455 Probe gap: The payload data that will be lost and need to be 456 retransmitted if the probe is not delivered. 458 Leading window: Any unacknowledged data in a flow at the time a probe 459 is sent. 461 Trailing window: Any data in a flow sent after a probe, but before 462 the probe is acknowledged. 464 Search strategy: The heuristics used to choose successive probe sizes 465 to converge on the proper Path MTU, as described in section 466 Section 8.3. 468 Full stop timeout: a timeout where none of the packets transmitted 469 after some event are acknowledged by the receiver, including any 470 retransmissions. This is taken as an indication of some failure 471 condition in the network, such as a routing change onto a link 472 with a smaller MTU. This is described in more detail in section 473 Section 8.7. 475 4. Requirements 477 All Internet nodes SHOULD implement PLPMTUD in order to discover and 478 take advantage of the largest MTU supported along the Internet path. 480 Links MUST NOT deliver packets that are larger than their MTU. Links 481 that have parametric limitations (e.g., MTU bounds due to limited 482 clock stability) MUST include explicit mechanisms to consistently 483 reject packets that might otherwise be nondeterministically 484 delivered. 486 All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 487 functionality. All fragmentation SHOULD be done on the host, and all 488 IPv4 packets, including fragments, SHOULD have the DF bit set such 489 that they will not be fragmented (again) in the network. See 490 Section 7. 492 The requirements below only apply to those implementations that 493 include PLPMTUD. 495 To use PLPMTUD a Packetization Layer MUST have a loss reporting 496 mechanism that provides the sender with timely and accurate 497 indications of which packets were lost in the network. 499 Normal congestion control algorithms MUST remain in effect under all 500 conditions except when only an isolated probe packet is detected as 501 lost. In this case alone the normal congestion (window or data rate) 502 reduction MAY be suppressed. If any other data loss is detected, 503 standard congestion control MUST take place. 505 Suppressed congestion control (as above) MUST be rate limited such 506 that it occurs less frequently than the worst case loss rate for TCP 507 congestion control at a comparable data rate over the same path 508 (i.e., less than the "TCP-friendly" loss rate [17]). This SHOULD be 509 enforced by requiring a minimum headway between a suppressed 510 congestion adjustment (due to a failed probe) and the next attempted 511 probe, which is equal to one round trip time for each packet 512 permitted by the congestion window. Alternatively this may be 513 enforced by not suppressing congestion control if a second probe is 514 lost too soon after the first lost probe. This discussed in section 515 Section 8.6.2. 517 Whenever the MTU is raised, the congestion state variables MUST be 518 rescaled so as not to raise the window size in bytes (or data rate in 519 bytes per seconds). 521 Whenever the MTU is reduced (e.g., when processing ICMP PTB messages) 522 the congestion state variable SHOULD be rescaled not to raise the 523 window size in packets. 525 If PLPMTUD updates the MTU for a particular path, all Packetization 526 Layer sessions that share the path representation SHOULD be notified 527 to make use of the new MTU and make the required congestion 528 adjustments. 530 All implementations MUST include a mechanism to implement diagnostic 531 tools that do not rely on the operating systems implementation of 532 Path MTU discovery. This specifically requires the ability to send 533 packets that are larger than the known MTU for the path, and 534 collecting any resultant ICMP error message. See Section 9 for 535 further discussion of MTU diagnostics. 537 5. Layering 538 Packetization Layer Path MTU Discovery is most easily implemented by 539 splitting its functions between layers. The IP layer is the best 540 place to keep shared state, collect the ICMP messages, track IP 541 header sizes and manage MTU information provided by the link layer 542 interfaces. However, the procedures that PLPMTUD uses for probing, 543 verification and scanning for the Path MTU are very tightly coupled 544 to features of the Packetization Layers such as data recovery and 545 congestion control state machines. 547 Note that this layering approach is consistent with the advice in the 548 current PMTUD specifications in RFC 1191 and RFC 1981. Many 549 implementations of classical PMTU Discovery are already split along 550 these same layers. 552 5.1. Accounting for header sizes 554 The way in which PLPMTUD operates across multiple layers requires a 555 mechanism for accounting header sizes at all layers between IP and 556 the Packetization Layer (inclusive). When transmitting non-probe 557 packets, it is sufficient for the Packetization Layer to ensure an 558 upper bound on final IP packet size, so as not to exceed the current 559 effective Path MTU. All Packetization Layers participating in 560 classical Path MTU Discovery have this requirement already. When 561 participating in PLPMTUD and transmitting a probe packet, the 562 Packetization Layer MUST determine that packet's final size including 563 IP headers. This requirement is specific to PLPMTUD, and to satisfy 564 it existing implementations may need additional inter-layer 565 communication. 567 5.2. Storing PMTU information 569 This memo uses the concept of a "flow" to define the scope of the 570 Path MTU discovery algorithms. For many implementations, a flow 571 would naturally correspond to an instance of each protocol, i.e., 572 each connection or session. In such implementations the algorithms 573 described in this document are performed within each session for each 574 protocol. The observed PMTU can optionally be shared between 575 different flows sharing a common path representation. 577 Alternatively, PLPMTUD could be implemented such that the complete 578 PLPMTUD state is associated with the path representations. Such an 579 implementation could use multiple connections or sessions for each 580 probe sequence. This approach may converge much more quickly in some 581 environments such as when an application uses many small connections, 582 each of which may be too short to complete the Path MTU Discovery 583 process. 585 These approaches are not mutually exclusive. However, due to 586 differing constraints on generating probes (Section 6.2) and the MTU 587 searching algorithm (Section 8.3), it may not be feasible for 588 different Packetization Layer protocols to share PLPMTUD state. This 589 suggests that it may be possible for some protocols to share probing 590 state, but not others. In this case, the different protocols can 591 still share the observed PMTU but they will have differing 592 convergence properties. 594 The IP layer is the best place to store cached PMTU values and other 595 shared state such as MTU values reported by ICMP PTB messages. 596 Ideally this shared state should be associated with a specific path 597 traversed by packets exchanged between the source and destination 598 nodes. However, in most cases a node will not have enough 599 information to completely and accurately identify such a path. 600 Rather, a node must associate a PMTU value with some local 601 representation of a path. It is left to the implementation to select 602 the local representation of a path. 604 An implementation could use the destination address as the local 605 representation of a path. The PMTU value associated with a 606 destination would be the minimum PMTU learned across the set of all 607 paths in use to that destination. The set of paths in use to a 608 particular destination is expected to be small, in many cases 609 consisting of a single path. This approach will result in the use of 610 optimally sized packets on a per-destination basis. This approach 611 integrates nicely with the conceptual model of a host as described in 612 RFC 2461 [13]: a PMTU value could be stored with the corresponding 613 entry in the destination cache. Storing the minimum value is 614 suggested since NATs and other forms of middle boxes may exhibit 615 differing PMTUs at a single IP address. 617 Note that network or subnet numbers are not suitable to use as 618 representations of a path, because there is not a general mechanism 619 to determine the network mask at the remote host. 621 If IPv6 flows are in use, an implementation could use the IPv6 flow 622 id [5][9] as the local representation of a path. Packets sent to a 623 particular destination but belonging to different flows may use 624 different paths, with the choice of path depending on the flow id. 625 This approach will result in the use of optimally sized packets on a 626 per-flow basis, providing finer granularity than MTU values 627 maintained on a per-destination basis. 629 For source routed packets, i.e., packets containing an IPv6 routing 630 header, or IPv4 LSRR or SSRR options, the source route may further 631 qualify the local representation of a path. An implementation could 632 use source route information in the local representation of a path. 634 5.3. Accounting for IPsec 636 This document does not take a stance on the placement of IPsec, which 637 logically sits between IP and the Packetization Layer. The PLPMTUD 638 implementation can treat IPsec either as part of IP or as part of the 639 Packetization Layer, as long as the accounting is consistent within 640 the implementation. If IPsec is treated as part of the IP layer, 641 then each security association to a remote node may need to be 642 treated as a separate path; i.e., the security association is used to 643 represent the path. If IPsec is treated as part of the Packetization 644 Layer, the IPsec header size must be included in the Packetization 645 Layer's header size calculations[11]. 647 5.4. Multicast 649 In the case of a multicast destination address, copies of a packet 650 may traverse many different paths to reach many different nodes. The 651 local representation of the "path" to a multicast destination must in 652 fact represent a potentially large set of paths. 654 Minimally, an implementation could maintain a single MTU value to be 655 used for all packets originated from the node. This MTU value would 656 be the minimum MTU learned across the set of all paths in use by the 657 node. This approach is likely to result in the use of smaller 658 packets than is necessary for many paths. 660 If the application using multicast gets complete delivery reports 661 (unlikely because this requirement has poor scaling properties), 662 PLPMTUD could be implemented in multicast protocols. 664 6. Common Packetization Properties 666 This section describes general Packetization Layer properties and 667 characteristics needed to implement PLPMTUD. It also describes some 668 implementation issues that are common to all Packetization Layers. 670 6.1. Mechanism to detect loss 672 It is important that the Packetization Layer has a timely and robust 673 mechanism for detecting and reporting losses. PLPMTUD makes MTU 674 adjustments on the basis of detected losses. Any delays or 675 inaccuracy in loss notification is likely to result in incorrect MTU 676 decisions or slow convergence. 678 It is best if Packetization Protocols use fairly explicit loss 679 notification such as selective acknowledgments, although implicit 680 mechanisms such as TCP Reno style duplicate acknowledgments counting 681 are sufficient. It is important that the mechanism can robustly 682 distinguish between the isolated loss of just a probe and other 683 combinations of losses. 685 Many protocol implementations have sophisticated mechanisms such as a 686 SACK scoreboard [14] to distinguish real losses from reordered data. 687 In these implementations it is desirable to signal losses to PLPMTUD 688 as a side effect of the data retransmission. This approach offers 689 the maximum protection from confusing signals due to reordering and 690 other events that might mimic losses. 692 PLPMTUD can also be implemented in protocols that rely on timeouts as 693 their primary mechanism for loss recovery; however, timeouts should 694 be used only when there are no other alternatives. 696 6.2. Generating probes 698 There are several possible ways to alter Packetization Layers to 699 generate probes. The different techniques incur different overheads 700 in three areas: difficulty in generating the probe packet (in terms 701 of Packetization Layer implementation complexity and extra data 702 motion) possible additional network capacity consumed by the probes 703 and the overhead of recovering from failed probes (both network and 704 protocol overheads). 706 Some protocols might be extended to allow arbitrary padding with 707 dummy data. This greatly simplifies the implementation because the 708 probing can be performed without participation from higher layers and 709 if the probe fails, the missing data (the "probe gap") is assured to 710 fit within the current MTU when it is retransmitted. This is 711 probably the most appropriate method for protocols that support 712 arbitrary length options or multiplexing within the protocol itself. 714 Many Packetization Layer protocols can carry pure control messages 715 (without any data from higher protocol layers) which can be padded to 716 arbitrary lengths. For example, the SCTP PAD chunk can be used in 717 this manner (see Section 10.2). This approach has the advantage that 718 nothing needs to be retransmitted if the probe is lost. 720 These techniques do not work for TCP, because there is not a separate 721 length field or other mechanism to differentiate between padding and 722 real payload data. With TCP the only approach is to send additional 723 payload data in an over-sized segment. There are at least two 724 variants of this approach, discussed in Section Section 10.1. 726 In a few cases, there may be no reasonable mechanisms to generate 727 probes within the Packetization Layer protocol itself. As a last 728 resort, it may be possible to rely an an adjunct protocol, such as 729 ICMP ECHO ("ping"), to send probe packets. See Section 10.3 for 730 further discussion of this approach. 732 7. Host Fragmentation 734 Packetization Layers are encouraged to avoid sending messages that 735 will require fragmentation [16] [18]. However, entirely preventing 736 fragmentation is not always possible. Some Packetization Layers, 737 such as a UDP application outside the kernel, may be unable to change 738 the size of messages it sends, resulting in datagram sizes that 739 exceed the Path MTU. 741 IPv4 permitted such applications to send packets without the DF bit 742 set. Oversized packets without the DF bit set would be fragmented in 743 the network or sending host when they encountered a link with a MTU 744 smaller than the packet. In some case, packets could be fragmented 745 more than once if there were cascaded links with progressively 746 smaller MTUs. This approach is not recommended. 748 It is recommended that IPv4 implementations use a strategy that 749 mimics IPv6 functionality. When an application sends datagrams that 750 are larger than the known Path MTU they should be fragmented to the 751 Path MTU in the host IP layer even if they are smaller than the link 752 MTU of the first network hop directly attached to the host. The DF 753 bit should be set on the fragments, so they will not be fragmented 754 again in the network. 756 This technique will minimize future surprises as the Internet 757 migrates to IPv6. Otherwise, the potential exists for widely 758 deployed applications or services relying on IPv4 fragmentation in a 759 way that cannot be implemented in IPv6. At least one major operating 760 system already uses this strategy. 762 The ability to selectively transmit packets larger than the current 763 effective Path MTU (but smaller than the link MTU) is REQUIRED, to be 764 able to send probes generated by Packetization Layers participating 765 in PLPMTUD, and to facilitate diagnostic utilities. 767 Note that IP fragmentation divides data into packets, so it is 768 minimally a Packetization Layer. However, it does not have a 769 mechanism to detect lost packets, so it can not support a native 770 implementation of PLPMTUD. Fragmentation-based PLPMTUD requires an 771 adjunct protocol as described in Section 10.3. 773 8. The Probing Method 774 This section describes the details of the MTU probing method, 775 including how to send probes and process error indications necessary 776 to search for the Path MTU. 778 8.1. Packet size ranges 780 This document described the probing method using three state 781 variables: 782 search_low: The smallest available probe size, minus one. 783 search_high: The greatest available probe size. 784 eff_pmtu: The effective PMTU for this flow. 786 search_low eff_pmtu search_high 787 | | | 788 ...-------------------------> 789 non-probe size range 790 <--------------------------------------> 791 probe size range 793 Figure 1 795 When transmitting probes, the Packetization Layer MUST select the 796 probe size from within the range "(search_low, search_high]". When 797 transmitting non-probes, it SHOULD create packets of size less than 798 or equal to eff_pmtu. 800 The eff_pmtu must be in the range "[search_low, search_high]". When 801 probing upward, eff_pmtu always equals search_low. However, in other 802 states this may not be the case, for example, due to initial 803 conditions or after ICMP PTB message processing. 805 8.2. Selecting initial values 807 The initial value for search_high should be the largest possible 808 packet supported by the flow. This may be limited by the local 809 interface MTU, by a protocol mechanism such as the TCP MSS option, or 810 an intrinsic limit such as the protocol length field. 812 It is recommended that search_low be initially set to a value likely 813 to work over a large range of links. Given today's technologies, a 814 value of 512 bytes is likely to work. For IPv6 flows, a value of 815 1280 is appropriate. The initial value for search_low SHOULD be 816 configurable. 818 Properly functioning Path MTU Discovery is critical to the robust and 819 efficient operation of the Internet. Any major change (as described 820 in this document) has the potential to be very disruptive if it 821 contains any errors or oversights. The selection of initial values 822 determines to what extent a PLPMTUD implementation's behavior differs 823 from classical PMTUD in cases where MTU discovery is not needed, or 824 where classical PMTUD is sufficient. 826 It may be desirable to configure hosts in such a way that PLPMTUD 827 only has an effect in cases where classical PMTUD fails. Setting 828 eff_pmtu = search_high and relying on black hole detection has this 829 effect. Using initial values of search_low = eff_pmtu = search_high 830 effectively disables PLPMTUD, resorting to only classical PMTUD. 832 In some cases where it is known that classical PMTUD is likely to 833 fail, using a conservatively small initial eff_pmtu may produce 834 better results by avoiding the costly timeouts required for black 835 hole detection. The trade-off is that using a smaller initial 836 eff_pmtu than necessary can cause reduced performance. Appropriate 837 initial values for PLPMTUD state variables may vary not only per host 838 but per path. As such, per-route configuration options for these 839 values is desirable. 841 8.3. Selecting probe size 843 The probe may have a size anywhere in the "probe size range" 844 described above. However, a number of factors affect the selection 845 of an appropriate size. A simple strategy might be to do a binary 846 search halving the probe size range with each probe. However, for 847 some protocols, data in a lost probe may require retransmission, 848 making a failed probe more expensive than a successful probe. For 849 such protocols, a strategy using smaller probe sizes and "probing up" 850 may behave better. For many protocols, both at and above the 851 Packetization Layer, the benefit of increasing MTU sizes may follow a 852 step function such that it is not advantageous to probe within 853 certain regions at all. 855 As an optimization, it may be appropriate to probe at certain common 856 or expected MTU sizes, for example, 1500 bytes for standard Ethernet, 857 or 1500 bytes minus header sizes for tunnel protocols. 859 Some protocols may not even "choose" probe sizes. For protocols 860 which have certain natural data block sizes, an effective strategy 861 could be to simply treat blocks whose size falls in the probe size 862 range as a probe. 864 Each Packetization Layer must determine when probing is considered 865 converged; that is, when the probe size range is considered small 866 enough that further probing is no longer worth its cost. When it is 867 determined that searching has converged, a timer should be set. When 868 the timer expires, search_high should be reset to its initial value 869 (described above) so that probing can resume. This is so that if the 870 path changes, and in increased Path MTU is available, then the flow 871 will eventually be able to take advantage of it to send larger 872 packets. The recommended value for this timer is 10 minutes, per RFC 873 1981. 875 8.4. Probing preconditions 877 Before sending a probe, the flow must at least meet the following 878 conditions: 879 o The flow has no outstanding probes or losses. 880 o If the last probe failed or was inconclusive, then the probe 881 timeout has expired (see Section Section 8.6.2). 882 o The available window is greater than the probe size. 883 o For a protocol using in-band data for probing, enough data is 884 available to send the probe. 886 For protocols which probe with in-band data, when not enough data is 887 available to probe, the protocol may wish to delay sending non-probes 888 in order to accumulate enough data to send a probe. A delayed 889 sending algorithm such as Nagle [15] should be used to appropriately 890 limit the time data is delayed. 892 Some protocols may require additional packets after the loss to 893 detect it promptly (e.g., TCP loss detection using duplication 894 acknowledgments). Such a protocol should wait until sufficient data 895 and window space is available so that it will be able to transmit 896 enough data after the probe to trigger the loss detection mechanism 897 in the event of a lost probe. 899 8.5. Conducting a probe 901 Once a probe size in the appropriate range has been selected, and the 902 above preconditions have been met, the Packetization Layer may 903 conduct a probe. To do so, it creates a probe packet such that its 904 size, including the outermost IP headers, is equal the probe size. 905 After sending the probe it awaits response, which may take the 906 following results: 907 Success: The probe is acknowledged as having been received by the 908 remote host. 910 Failure: A protocol mechanism indicates that the probe was lost, but 911 no packets in the leading or trailing window were lost. 913 Timeout failure: A protocol mechanism indicates that the probe was 914 lost, and no packets in the leading window were lost, but is 915 unable to determine if any packets in the trailing window were 916 lost. For example, loss is detected by a timeout, and go-back-n 917 retransmission is used. 919 Inconclusive: The probe was lost in addition to other packets in the 920 leading or trailing windows. 922 8.6. Response to probe results 924 When a probe has completed, the result should be processed as 925 follows, categorized by the probe's result type. 927 8.6.1. Probe success 929 When the probe is delivered, this is an indication that the Path MTU 930 is at least as large as the probe size. The Packetization Layer 931 should set search_low to the probe size, eff_pmtu to "max(eff_pmtu, 932 probe size)". 934 Note that if a flow's packets are routed via multiple paths, or over 935 a path with a non-deterministic MTU, delivery of a single probe 936 packet does not indicate that all packets of that size will be 937 delivered. To be robust in such a case, the Packetization Layer 938 should conduct MTU verification as described in Section Section 8.8. 940 8.6.2. Probe failure 942 When only the probe is lost, this is treated as an indication that 943 the Path MTU is smaller than the probe size. In this case alone, the 944 loss should not be interpreted as congestion signal. 946 In the absence of other indications, the Packetization Layer should 947 set search_high to the probe size minus one, and eff_pmtu to 948 "min(eff_pmtu, probe size)". 950 If an ICMP PTB message is received matching the probe packet, then 951 search_high and eff_pmtu may be set from the MTU value indicated in 952 the message. Note that the ICMP message may be received either 953 before or after the protocol loss indication. 955 A probe failure event is the one situation under which the 956 Packetization Layer is permitted not to treat loss as a congestion 957 signal. Because there is some small risk that suppressing congestion 958 control might have unanticipated consequences (even for one isolated 959 loss), it is required that probe failure events be less frequent than 960 the normal period for losses under standard congestion control. 961 Specifically after a probe failure event and suppressed congestion 962 control, PLPMTUD may not probe again until an interval which is 963 comparable to the expected interval between congestion control 964 events. This is required in Section 4. The simplest estimate of the 965 interval to the next congestion event is the same number of round 966 trips as the current congestion window in packets. 968 8.6.3. Probe timeout failure 970 If the loss was detected with a timeout and repaired with go-back-n 971 retransmission, then congestion window reduction will be necessary. 972 The relatively high price of a failed probe in this case may merit a 973 longer timeout. A timeout value of five times the non-timeout 974 failure case is recommended. 976 8.6.4. Probe inconclusive 978 The presence of other losses near the loss of the probe may indicate 979 that the probe was lost due to congestion rather than because of an 980 MTU limitation. In this case it is appropriate to update no state, 981 and simply probe again when the probing preconditions are met; i.e., 982 when no recent losses have been observed. At this point, it is 983 particularly appropriate to re-probe since the flow's congestion 984 window will be at its lowest point, minimizing the probability of 985 congestive losses. 987 8.7. Full stop timeout 989 Under all conditions a full stop timeout (also known as a "persistent 990 timeout" in other documents) should be taken as an indication of some 991 significantly disruptive event in the network, such as a router 992 failure or a routing change to a path with a smaller MTU. For TCP, 993 this occurs when the R1 timeout threshold described by RFC 1122 [8] 994 expires. 996 If there is a full stop timeout and there was not an ICMP message 997 indicating a reason (PTB, Net unreachable, etc., or the ICMP message 998 was ignored for some reason), the suggested first recovery action is 999 to treat this as a detected black hole as described in RFC 2923 [10]. 1001 The response to a detected black hole should be to set search_low to 1002 its initial value, and set eff_pmtu to search_low. Upon further 1003 successive timeouts, search_low and eff_pmtu should be halved, with a 1004 lower bound of 68 bytes for IPv4 and 1280 bytes for IPv6. 1006 8.8. MTU verification 1008 It is possible for a flow to simultaneously traverse multiple paths, 1009 but it will only be able to keep a single path representation for the 1010 flow. If in such a case the paths have different MTUs, storing the 1011 minimum MTU of all paths in the flow's path representation will 1012 result in correct, though sub-optimal behavior. If ICMP PTB messages 1013 are delivered, then classical PMTUD will work correctly in this 1014 situation. 1016 If ICMP delivery fails, breaking classical PMTUD, the connection will 1017 rely on PLPMTUD. However, in this case, PLPMTUD may fail as well 1018 since its requirement that links MUST NOT deliver packets larger than 1019 their MTU is violated. A probe with a size greater than the minimum 1020 but smaller than the maximum of the Path MTUs may be successful. 1021 However, upon raising the flow's effective PMTU, the loss rate may 1022 significantly increase. The flow may still make progress, but the 1023 resultant loss rate may be unacceptable. For example, when using 1024 two-way round-robin striping, 50% of full-sized packets would be 1025 lost. 1027 Striping in this manner is often operationally undesirable (e.g., due 1028 to packet reordering), and is usually avoided by hashing flows to a 1029 single path. However, to increase robustness, an implementation 1030 should implement some form of MTU verification, such that if 1031 increasing eff_pmtu results in a sharp increase in loss rate, it will 1032 fall back to using a lower MTU. 1034 A recommended strategy would be to save the value of eff_pmtu before 1035 raising it. Then, if loss rate rises above a threshold for a period 1036 of time (e.g., loss rate is higher than 10% over multiple RTO 1037 intervals), then the new MTU is considered incorrect. The saved 1038 value of eff_pmtu can be restored, and search_high reduced in the 1039 same manner as in a probe failure. PLPMTUD implementations SHOULD 1040 implement MTU verification. 1042 9. Diagnostic Interface 1044 All implementations MUST include facilities for MTU discovery 1045 diagnostic tools that implement PLPMTUD or other MTU discovery 1046 algorithms in user mode without help or interference by the PMTUD 1047 algorithm present in the operating system. This requires a mechanism 1048 where a diagnostic application can send packets that are larger than 1049 the operating system's notion of the current Path MTU and for the 1050 diagnostic application to collect any resulting ICMP PTB messages or 1051 other ICMP messages. For IPv4, the diagnostic application must be 1052 able to set the DF bit. 1054 At this time nearly all operating systems support two modes for 1055 sending UDP datagrams: one which silently fragments packets that are 1056 too large, and another that rejects packets that are too large. 1057 Neither of these modes are suitable for efficiently diagnosing 1058 problems with MTU discovery, such as routers that return ICMP PTB 1059 messages containing incorrect size information. 1061 10. Specific Packetization Layers 1063 This section discusses specific implementation details for different 1064 protocols that can be used as Packetization Layer protocols. All 1065 Packetization Layer protocols must consider all of the issues 1066 discussed in Section 6. For most protocols it is self evident how to 1067 address many of these issues. It is hoped that the protocols 1068 described here will be sufficient illustration for implementors to 1069 adapt other protocols. 1071 10.1. Probing method using TCP 1073 TCP has no mechanism that could be used to distinguish between real 1074 application data and some other form of padding that might be used to 1075 fill out probe packets. Therefore, TCP must generate probes by 1076 sending oversized segments that are carrying in-band data. There are 1077 two approaches to segmentation from which an implementation may 1078 choose: overlapping or non-overlapping segments. 1080 In the non-overlapping method, data is segmented such that the probe 1081 and any subsequent segments contain no overlapping data. If the 1082 probe is lost, the "probe gap" will be a full probe size minus 1083 headers. Data in the probe gap will need to be retransmitted with 1084 multiple smaller segments. 1086 An alternate approach is to send data following the probe such that 1087 the probe gap is equal in length to the current MSS. In the case of 1088 a successful probe, this has added overhead in that it will send some 1089 data twice, but it will have to retransmit only one segment after a 1090 lost probe. When a probe succeeds, there will likely be some 1091 duplicate acknowledgments generated due to the duplicate data sent. 1092 It is important that these duplicate acknowledgments not trigger Fast 1093 Retransmit. As such, an implementation using this approach SHOULD 1094 limit the probe size to three times the current MSS (causing at most 1095 2 duplicate acknowledgments), or appropriately adjust its duplicate 1096 acknowledgment threshold for data immediately after a successful 1097 probe. 1099 The choice of which segmentation method to use should be based on 1100 what is simplest and most efficient for a given TCP implementation. 1102 10.2. Probing method using SCTP 1104 In the SCTP protocol [7] the application writes messages to SCTP and 1105 SCTP "chunkifies" them into smaller pieces suitable for transmission 1106 through the network. Once a message has been chunkified, they are 1107 assigned Transmission Sequence Numbers (TSNs). Once some TSNs have 1108 been transmitted SCTP can not change the chunk sizes. SCTP multi- 1109 path support normally requires SCTP to chunkify its messages to fit 1110 the smallest PMTU of all paths. Although not required, 1111 implementations may bundle multiple data chunks together to make 1112 larger IP packets to send on paths with a larger PMTU. Note that 1113 SCTP must independently probe the PMTU on each path to the peer. 1115 The recommended method for generating probes is to add a chunk 1116 consisting only of padding to an SCTP message. The PAD chunk defined 1117 in [19] SHOULD be attached to a minimum length HEARTBEAT chunk to 1118 build a probe packet. This method is fully compatible with all 1119 current SCTP implementations. 1121 SCTP MAY also probe with a method similar to TCP's described above, 1122 using inline data. Using such a method has the advantage that 1123 successful probes have no additional overhead; however, failed probes 1124 will require retransmission of data, which may significantly impact 1125 flow performance. 1127 10.3. Probing method using IP fragmentation 1129 As mentioned in Section 7, datagram protocols (such as UDP) might 1130 rely on IP fragmentation as a Packetization Layer. However, 1131 implementing PLPMTUD with IP fragmentation is problematic because the 1132 IP layer has no mechanism to determine if the packets are ultimately 1133 delivered properly to the far node, without participation by the 1134 application. 1136 To support IP fragmentation as a Packetization Layer under an 1137 unmodified application, we propose the use of an adjunct MTU 1138 measurement protocol (ICMP ECHO) and a separate Path MTU Discovery 1139 daemon (described here) to perform PLPMTUD and update the stored Path 1140 MTU information. 1142 For IP fragmentation the initial MTU should be selected as described 1143 in Section Section 8.2, except with a separate global control for the 1144 default initial MTU for connectionless protocols. Since 1145 connectionless protocols may not keep enough state to effectively 1146 diagnose MTU black holes, it would be more robust to err on the side 1147 of using too small of an initial MTU (e.g., 1kBytes or less) prior to 1148 initiating probing of the path to measure the MTU. 1150 Since many protocols that rely on IP fragmentation are 1151 connectionless, there is an additional problem with the path 1152 information cache: there are no events corresponding to connection 1153 establishment and tear-down to use to manage the cache itself. If 1154 there is no entry in the path information cache for a particular 1155 packet being transmitted, it uses an immutable cache entry for the 1156 "default path", which has a MTU that is fixed at the initial value. 1158 A new path cache entry is not created until there is an attempt to 1159 set the MTU. 1161 The Path MTU Discovery daemon should be triggered as a side effect of 1162 IP fragmentation. Once the number of fragmented datagrams via any 1163 particular path reaches some configurable threshold (e.g., 5 1164 datagrams), the daemon can start probing the path with ICMP ECHO 1165 packets. These probes must use the diagnostic interface described in 1166 Section 9 and have DF set. The daemon can implement the PLPMTUD 1167 probe sequence and search strategy, collect all of the ICMP 1168 responses, and store results in the path information cache in the IP 1169 layer. 1171 Alternatively, most of the PLPMTUD state machinery can be implemented 1172 within the path information cache in the IP layer, which can 1173 specifically invoke the Path MTU Discovery daemon to perform 1174 specified measurements on specific paths and report the results back 1175 to the IP layer. 1177 Using ICMP ECHO to measure the MTU has a number of potential 1178 robustness problems. Note that the most likely failures are due to 1179 losses unrelated to MTU (e.g., nodes that discriminate on the basis 1180 of protocol type). These non-MTU-related losses can prevent PLPMTUD 1181 from raising the MTU, forcing the Packetization Protocol to use a 1182 smaller MTU than necessary. Since these failures are not likely to 1183 cause interoperability problems they are relatively benign. 1185 However there does exist other more serious failure modes, such as 1186 layer 3 or 4 routers choosing different paths for different protocol 1187 types or sessions. In such environments, adjunct protocols may 1188 experience different MTUs than the primary protocol. If the adjunct 1189 protocol has a larger MTU than the primary protocol, PLPMTUD will 1190 select a non-functional MTU. This does not seem to be a likely 1191 situation. 1193 10.4. Probing method using applications 1195 The disadvantages of probing with ICMP ECHO can be overcome by 1196 implementing the Path MTU Discovery daemon within the application 1197 itself, using the application's own protocol. 1199 The application must have some suitable method for generating probes. 1200 The ideal situation is a lightweight echo function, that confirms 1201 message delivery, plus a mechanism for padding the messages out to 1202 the desired MTU, such that the padding is not echoed. This 1203 combination (akin to the SCTP HB plus PAD) is preferred because you 1204 can send large probes that cause small acknowledgments. For 1205 protocols that can not implement these messages directly there are 1206 often alternate methods for generating probes. For example, the 1207 protocol may have a variable length echo (that measures both the 1208 forward and return path) or if there is no echo function, there may 1209 be a way to add padding to regular messages carrying real application 1210 data. There may also be other ways to generate probes. As a last 1211 resort, it may be feasible to extend the protocol with new message 1212 types to support MTU discovery. 1214 Probing within an application introduces one new issue: many 1215 applications do not currently concern themselves with MTU and rely on 1216 IP fragmentation to deliver datagrams that just happen to be larger 1217 than the Path MTU. PLPMTUD requires that the protocol be able to 1218 send probes that are larger than the IP layer's current notion of the 1219 Path MTU, but are marked not to be fragmented. This requires an 1220 alternate method for sending these datagrams. 1222 As with ICMP MTU probing, there is considerable flexibility in how 1223 the PLPMTUD algorithms can be divided between the Application and the 1224 path information cache. 1226 Some applications send large datagrams no matter what the link size, 1227 and rely on IP fragmentation to deliver the datagrams. It has been 1228 known for a long time that this has some undesirable consequences 1229 [16]. More recently it has come to light that IPv4 fragmentation is 1230 not sufficiently robust for general use in today's Internet. The 16- 1231 bit IP identification field is not large enough to prevent frequent 1232 misassociated IP fragments and the TCP and UDP checksums are 1233 insufficient to prevent the resulting corrupted data from being 1234 delivered to higher protocol layers [18]. 1236 11. References 1238 11.1. Normative references 1240 [1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. 1242 [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1243 November 1990. 1245 [3] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery for 1246 IP version 6", RFC 1981, August 1996. 1248 [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1249 Levels", BCP 14, RFC 2119, March 1997. 1251 [5] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) 1252 Specification", RFC 2460, December 1998. 1254 [6] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, 1255 September 1981. 1257 [7] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 1258 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, 1259 "Stream Control Transmission Protocol", RFC 2960, October 2000. 1261 [8] Braden, R., "Requirements for Internet Hosts - Communication 1262 Layers", STD 3, RFC 1122, October 1989. 1264 11.2. Informative references 1266 [9] Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809, 1267 June 1995. 1269 [10] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, 1270 September 2000. 1272 [11] Kent, S. and R. Atkinson, "Security Architecture for the 1273 Internet Protocol", RFC 2401, November 1998. 1275 [12] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914, 1276 September 2000. 1278 [13] Narten, T., Nordmark, E., and W. Simpson, "Neighbor Discovery 1279 for IP Version 6 (IPv6)", RFC 2461, December 1998. 1281 [14] Blanton, E., Allman, M., Fall, K., and L. Wang, "A Conservative 1282 Selective Acknowledgment (SACK)-based Loss Recovery Algorithm 1283 for TCP", RFC 3517, April 2003. 1285 [15] Nagle, J., "Congestion control in IP/TCP internetworks", 1286 RFC 896, January 1984. 1288 [16] Kent, C. and J. Mogul, "Fragmentation considered harmful", 1289 Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. 1291 [17] Mahdavi, J. and S. Floyd, "TCP-Friendly Unicast Rate-Based Flow 1292 Control", Technical note sent to the end2end-interest mailing 1293 list , January 1997, 1294 . 1296 [18] Mathis, M., "Fragmentation Considered Very Harmful", 1297 draft-mathis-frag-harmful-00 (work in progress), July 2004. 1299 [19] Tuexen, M. and R. Stewart, "Padding Chunk and Parameter for 1300 SCTP", draft-tuexen-tsvwg-sctp-padding-00 (work in progress), 1301 February 2006. 1303 Appendix A. Security Considerations 1305 Under all conditions the PLPMTUD procedure described in this document 1306 is at least as secure as the current standard Path MTU Discovery 1307 procedures described in RFC 1191 and RFC 1981. 1309 Since this algorithm is designed for robust operation without any 1310 ICMP (or other messages from the network), PLPMTUD could be 1311 configured to ignore all ICMP messages (globally or on a per 1312 application basis). In this configuration, it cannot be attacked 1313 unless the attacker can identify and selectively cause probe packets 1314 to be lost. 1316 Appendix B. IANA Considerations 1318 None. 1320 Appendix C. Acknowledgements 1322 Many ideas and even some of the text come directly from RFC 1191 and 1323 RFC 1981. 1325 Many people made significant contributions to this document, 1326 including: Randall Stewart for SCTP text, Michael Richardson for 1327 material from an earlier ID on tunnels that ignore DF, Stanislav 1328 Shalunov for the idea that pure PLPMTUD parallels congestion control, 1329 and Matt Zekauskas for maintaining focus during the meetings. Thanks 1330 to the early implementors: Kevin Lahey, John Heffner and Rao Shoaib 1331 who provided concrete feedback on weaknesses in earlier drafts. 1332 Thanks also to all of the people who made constructive comments in 1333 the working group meetings and on the mailing list. I am sure I have 1334 missed many deserving people. 1336 Matt Mathis and John Heffner are supported in this work by a grant 1337 from Cisco Systems, Inc. 1339 Authors' Addresses 1341 Matt Mathis 1342 Pittsburgh Supercomputing Center 1343 4400 Fifth Avenue 1344 Pittsburgh, PA 15213 1345 US 1347 Phone: 412-268-3319 1348 Email: mathis@psc.edu 1350 John W. Heffner 1351 Pittsburgh Supercomputing Center 1352 4400 Fifth Avenue 1353 Pittsburgh, PA 15213 1354 US 1356 Phone: 412-268-2329 1357 Email: jheffner@psc.edu 1359 Intellectual Property Statement 1361 The IETF takes no position regarding the validity or scope of any 1362 Intellectual Property Rights or other rights that might be claimed to 1363 pertain to the implementation or use of the technology described in 1364 this document or the extent to which any license under such rights 1365 might or might not be available; nor does it represent that it has 1366 made any independent effort to identify any such rights. Information 1367 on the procedures with respect to rights in RFC documents can be 1368 found in BCP 78 and BCP 79. 1370 Copies of IPR disclosures made to the IETF Secretariat and any 1371 assurances of licenses to be made available, or the result of an 1372 attempt made to obtain a general license or permission for the use of 1373 such proprietary rights by implementers or users of this 1374 specification can be obtained from the IETF on-line IPR repository at 1375 http://www.ietf.org/ipr. 1377 The IETF invites any interested party to bring to its attention any 1378 copyrights, patents or patent applications, or other proprietary 1379 rights that may cover technology that may be required to implement 1380 this standard. Please address the information to the IETF at 1381 ietf-ipr@ietf.org. 1383 Disclaimer of Validity 1385 This document and the information contained herein are provided on an 1386 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1387 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1388 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1389 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1390 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1391 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1393 Copyright Statement 1395 Copyright (C) The Internet Society (2006). This document is subject 1396 to the rights, licenses and restrictions contained in BCP 78, and 1397 except as set forth therein, the authors retain all their rights. 1399 Acknowledgment 1401 Funding for the RFC Editor function is currently provided by the 1402 Internet Society.