idnits 2.17.1 draft-mathis-pmtud-method-00.txt: -(846): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(867): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(899): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == There are 4 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 20 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 21 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 9 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 233: '...l Internet nodes SHOULD implement Path...' RFC 2119 keyword, line 242: '... Links MUST not deliver packets that...' RFC 2119 keyword, line 244: '...clock stability) MUST include explicit...' RFC 2119 keyword, line 249: '... if any, SHOULD send a Packet Too Bi...' RFC 2119 keyword, line 255: '...t the connection MUST have at least th...' (7 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 867 has weird spacing: '...imed to per...' == Line 899 has weird spacing: '...for the purpo...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Links MUST not deliver packets that are larger than their true MTU. Links that have parametric limitations (e.g. MTU bounds due to limited clock stability) MUST include explicit mechanisms to consistently reject packets that might otherwise be nondeterministically delivered. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 21, 2003) is 7614 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'IPv4-SPEC' is mentioned on line 240, but not defined == Missing Reference: 'IPv6-SPEC' is mentioned on line 677, but not defined == Missing Reference: 'FRAG' is mentioned on line 350, but not defined == Missing Reference: 'ND' is mentioned on line 411, but not defined == Missing Reference: 'CONG' is mentioned on line 709, but not defined == Missing Reference: 'ISOTP' is mentioned on line 732, but not defined == Missing Reference: 'RPC' is mentioned on line 742, but not defined == Unused Reference: 'RFC1191' is defined on line 802, but no explicit reference was found in the text == Unused Reference: 'RFC1435' is defined on line 806, but no explicit reference was found in the text == Unused Reference: 'RFC1981' is defined on line 810, but no explicit reference was found in the text == Unused Reference: 'RFC2923' is defined on line 814, but no explicit reference was found in the text == Unused Reference: 'RFC1063' is defined on line 819, but no explicit reference was found in the text == Unused Reference: 'RFC1626' is defined on line 823, but no explicit reference was found in the text == Unused Reference: 'RFC1791' is defined on line 827, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1063 (ref. 'RFC1191') (Obsoleted by RFC 1191) ** Downref: Normative reference to an Informational RFC: RFC 1435 ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) ** Downref: Normative reference to an Informational RFC: RFC 2923 Summary: 8 errors (**), 0 flaws (~~), 21 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Matt Mathis 3 John Heffner 4 PSC 5 Kevin Lahey 6 Freelance 7 June 21, 2003 9 Path MTU Discovery 10 draft-mathis-pmtud-method-00.txt 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that other 19 groups may also distribute working documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Abstract 34 [@@ To be rewritten] 36 This document describes Path MTU Discovery for the Internet. It is 37 largely derived from RFC 1191 and RFC 1981, which describe ICMP based 38 Path MTU Discovery for IP versions 4 and 6, plus a robust new 39 algorithm. 41 The general strategy of the new algorithm is to start with a small 42 MTU and probe upward, testing successively larger MTUs by probing 43 with single packets. If the probe is successfully delivered, then 44 the MTU is raised. If the probe is lost, it is treated as an MTU 45 limitation and not as a congestion signal. 47 Table of Contents 49 TBD 51 1. Introduction 53 When one Internet node has a large amount of data to send to another 54 node, the data is transmitted in a series of IP packets. It is 55 usually preferable that these packets be of the largest size that can 56 successfully traverse the path from the source node to the 57 destination node. This packet size is referred to as the Path MTU 58 (PMTU), and it is equal to the minimum link MTU of all the links in a 59 path. 61 This document describes a path MTU discovery (PMTUD) method based on 62 the earlier methods described in the standards tract documents, 63 RFC1191 and RFC1981, with the addition of a new algorithm that 64 searches for the proper MTU by probing with successively larger 65 packets. Large sections of this document are taken directly from 66 RFC1191 and RFC1981. 68 The methods described in this document apply to IPv4, IPv6, TCP, and 69 other transport protocols. This document does not define a 70 protocol, but rather a method to use features of existing protocols 71 to discover the path MTU. It does not require cooperation from the 72 lower layers (except that they are consistent about what packet sizes 73 are acceptable) or the far node. Variants in implementations will 74 not cause problems with interoperability. 76 [[As a consequence people are encouraged to start developing 77 experimental implementations as soon as the requirements sections is 78 stable. All other sections are recommendations only.]] 80 For sake of clarity we uniformly prefer TCP and IPv6 terminology. In 81 the terminology section we also present the analogous IPv4 terms and 82 concepts for the IPv6 terminology. In a few situations we describe 83 specific details that are different between IPv4 and IPv6. 85 [[This document still bears markup notes, indicated with square 86 brackets [] or @@@@ signs.]] 88 2. Terminology 90 IP - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC]. 92 node - a device that implements IP. 94 router - a node that forwards IP packets not explicitly 95 addressed to itself. 97 host - any node that is not a router. 99 upper layer - a protocol layer immediately above IP. Examples are 100 transport protocols such as TCP and UDP, control 101 protocols such as ICMP, routing protocols such as OSPF, 102 and Internet or lower-layer protocols being "tunneled" 103 over (i.e., encapsulated in) IP such as IPX, 104 AppleTalk, IP itself. 106 link - a communication facility or medium over which nodes can 107 communicate at the link layer, i.e., the layer 108 immediately below IPv6. Examples are Ethernets (simple 109 or bridged); PPP links; X.25, Frame Relay, or ATM 110 networks; and Internet (or higher) layer "tunnels", 111 such as tunnels over IPv4 or IPv6 itself. 113 interface - a node's attachment to a link. 115 address - an IP-layer identifier for an interface or a set of 116 interfaces. 118 packet - an IP header plus payload. 120 MTU - Maximum Transmission Unit, the size in bytes of the 121 largest packet that can be transmitted on a link or 122 path. Note that this could more properly be called 123 the IP MTU, to be consistent with how other standards 124 organizations use the term. Beware that the definition 125 used in this and other IETF documents is not the same 126 as the definition used in other contexts. 128 link MTU - the Maximum Transmission Unit, i.e., maximum packet 129 size in octets, that can be conveyed in one piece over 130 a link. 132 path - the set of links traversed by a packet between a source 133 node and a destination node 135 path MTU - the minimum link MTU of all the links in a path between 136 a source node and a destination node. 138 PMTU - path MTU 140 Path MTU Discovery, 141 PMTUD - process by which a node learns the PMTU of a path 143 Packet Too Big message 144 - An ICMP message reporting that an IP packet is too 145 large to forward. This is the IPv6 term that 146 corresponds to the IPv4 "ICMP Can't fragment" message. 148 flow id - a combination of a source address and a non-zero 149 IPv6 flow label. 151 L3 MTU - the maximum available IP payload size, usually over a 152 specific path. This is the maximum layer 3 transmission 153 unit (e.g TCP message, including all TCP headers and data, 154 but not IP or link headers.) 156 segment size- the L3 payload size (from TCP usage). 158 probe packet- A packet which is being used to test for a larger MTU. 160 probe size - The size of a packet being used to probe for a larger MTU. 162 successful probe 163 - The probe packet was delivered through the network. 165 inconclusive probe 166 - The probe packet was not delivered, but there were other lost 167 packets too close to the probe. By implication the probe 168 might have been lost due to something other than MTU, so the 169 results are inconclusive. 171 failed probe 172 - The probe packet was not delivered and there were not other 173 lost packets close to the probe. 175 probe gap - The L3 payload data that will need to be retransmitted if the 176 probe is not delivered. 178 [[Deprecated terms - these terms should only appear in very specific parts of 179 the document. 181 ICMP 183 Can't fragment messages 185 lower layers 187 @@@ remove as the document matures]] 189 3. Overview 191 This document describes a technique to dynamically discover the MTU 192 of a path. These procedures are applicable to TCP and other 193 transport- or application-level Packetization protocols which 194 implement similar features. 196 The general strategy of the new procedure is to find the proper MTU 197 by starting a connection using relatively small packets and then 198 probing with progressively larger packets (containing application 199 data). If a probe packet is successfully delivered, then the path 200 MTU is raised. The isolated loss of a probe packet (with or without 201 a Packet Too Big message) is treated as an indication of an MTU 202 limit, and not as a congestion indicator. 204 PMTUD can optionally process Packet Too Big messages for faster 205 convergence in exchange for a slight decrease in robustness. 206 Processing malicious or erroneous Packet Too Big messages can cause 207 PMTU discovery to arrive at the incorrect MTU for a path, which is 208 likely to reduce protocol performance. The document describes three 209 options for processing Packet Too Big messages: completely ignore 210 them, only accept them in response to probes or accept all Packet Too 211 Big messages (the previous approach). 213 In addition, PMTUD can be extended with heuristics to use alternate 214 criteria to select PMTU. For example, on a path that is so congested 215 that the fair share window is too small (smaller than 5 kB), TCP may 216 be better behaved with 512-byte packets than with 1500-byte packets 217 since with the larger packets the window would be too small to 218 trigger Fast Retransmit. 220 Relatively few details of this procedure affect interoperability with 221 other standards or Internet protocols. These details are specified 222 in RFC2026 standards language in the requirements section. The vast 223 majority of the implementation details are recommendations based on 224 experiences with earlier versions of path MTU discovery. These are 225 motivated by a desire to maximize robustness in the presence of less 226 than ideal implementations as they exist in the field. 228 4. Requirements 230 [This section is written in 2026 standards language MUST/SHOULD, 231 etc.] 233 All Internet nodes SHOULD implement Path MTU Discovery in order to 234 discover and take advantage of the largest MTU supported along the 235 Internet path. 237 Nodes not implementing Path MTU Discovery must use a default MTU as 238 specified by the respective IP protocols. For IPv6 the default MTU 239 is 1280 bytes, the minimum link MTU as defined in [IPv6-SPEC]. For 240 IPv4 it is 576 bytes, as specified in [IPv4-SPEC]. 242 Links MUST not deliver packets that are larger than their true MTU. 243 Links that have parametric limitations (e.g. MTU bounds due to 244 limited clock stability) MUST include explicit mechanisms to 245 consistently reject packets that might otherwise be 246 nondeterministically delivered. 248 When a packet is too large to traverse a link, the attached router, 249 if any, SHOULD send a Packet Too Big message (IPv6) or ICMP, can't 250 fragment message (IPv4 with DF set), as appropriate. 252 The requirements below only apply to those implementations that 253 include Path MTU Discovery. 255 Before a probe can be sent the connection MUST have at least the 256 candidate MSS worth of pending data and MUST be using the current 257 MSS, as defined by having received at least one acknowledgment for a 258 recent non-probe segment at the current MSS. This implicitly limits 259 successful probes to once per two round trips. [Making the algorithm 260 robust in the presence of multi-path routing is likely to require an 261 additional RTT.] @@@ generalize 263 Failed and inconclusive probes must be more widely spaced than the 264 normal Additive Increase Multiplicative Decrease (AIMD) congestion 265 interval for the current average window size. This is enforced by 266 keeping a "probe countdown" which is decremented on each non-probe 267 segment sent. Probes MUST NOT be sent before the probe countdown 268 reaches zero. @@@ generalize 270 The candidate MSS MUST be strictly smaller than three times the 271 current MSS. Thus the probe segment fully covers at most one 272 subsequent segment. The second subsequent segment is at most 273 partially covered by the probe segment. This guarantees that the 274 segments following the probe segment will cause at most one 275 superfluous duplicate acknowledgment. @@@ generalize 277 The TCP MUST be using Fast-Retransmit and SACK or new Reno, such that 278 isolated lost segments will normally be retransmitted without the 279 spurious retransmission of any additional segments. 281 During the probe, all of the normal retransmission, recovery and 282 congestion control machinery is in effect except when just the probe 283 gap is retransmitted (and no other segments) the normal 284 multiplicative cwnd reduction is suppressed. If any other segments 285 are retransmitted, all normal cwnd reductions MUST take place. 287 If the probe was successful, the current MSS is updated to the 288 candidate MSS. If cwnd and other congestion state variables are kept 289 in packets, they MUST be rescaled by the change in MSS, to preserve 290 the current window size in bytes. @@@ generalize 292 5. Implementation Issues 293 This section discusses a number of issues related to the 294 implementation of Path MTU Discovery. This is not a specification, 295 but rather a set of notes provided as an aid for implementers. 297 The issues include: 299 - What layer or layers implement Path MTU Discovery? 301 - Accounting for headers 303 - How is the PMTU information cached? 305 - How are ICMP messages processed 307 - How is stale PMTU information removed? 309 - How to implement PMTUD with TCP? 311 - What should other transport and higher layers do? 313 - What should tunnels above IP do? 315 5.1. Layering 317 In the IP architecture, the choice of what size packet to send is 318 made by a protocol at a layer above IP. This memo refers to such a 319 protocol as a "packetization protocol". Packetization protocols are 320 usually transport protocols (for example, TCP) but can also be 321 higher-layer protocols (for example, protocols built on top of UDP). 323 Implementing Path MTU Discovery in the packetization layers 324 simplifies some of the inter-layer issues, but has several drawbacks: 325 the implementation may have to be redone for each packetization 326 protocol, it becomes hard to share PMTU information between different 327 packetization layers, and the connection-oriented state maintained by 328 some packetization layers may not easily extend to save PMTU 329 information for long periods. 331 It is therefore suggested that the IP layer store PMTU information 332 and that the ICMP layer process received Packet Too Big messages. 333 The packetization layers may respond to changes in the PMTU, by 334 changing the size of the messages they send. To support this 335 layering, packetization layers require a way to learn of changes in 336 the value of MMS_S, the "maximum send transport-message size". The 337 MMS_S is derived from the Path MTU by subtracting the size of the 338 IPv6 header plus space reserved by the IP layer for additional 339 headers (if any). 341 It is possible that a packetization layer, perhaps a UDP application 342 outside the kernel, is unable to change the size of messages it 343 sends. This may result in a packet size that exceeds the Path MTU. 345 To accommodate such situations, IPv6 defines a mechanism that allows 346 large payloads to be divided into fragments, with each fragment sent 347 in a separate packet (see [IPv6-SPEC] section "Fragment Header"). 348 However, packetization layers are encouraged to avoid sending 349 messages that will require fragmentation (for the case against 350 fragmentation, see [FRAG]). 352 To accommodate such situations, it is recommended that IPv4 use a 353 mechanism that parallels the IPv6 mechanism and only fragment in the 354 end systems. Also set DF on the fragments. @@@more 356 5.2. Accounting for headers 358 [[@@@To be written 360 IP MTU is the payload size of the lower layer (should be "lower layer 361 MTU minus link headers", but this is a different use of "MTU"). @@@ 362 more, clarify 364 L3 MTU is IP MTU minus IP headers @@@ more 366 MSS is L3 MTU minus TCP headers @@@ more 368 This document does not take a position on the position of IPsec, 369 which logically sits at the boundary between IP and TCP or other 370 packetization later. IPsec can be treated either as part of IP or as 371 part of the packetization later, as long as the accounting is 372 consistent within any given implementation. 374 If IPsec is treated as part of the IP layer, then each security 375 association that contributes a different length security header, may 376 need to be treated as a separate path. If IPsec is treated as part 377 of the packetization layer, then the MSS to L3 MTU calculation must 378 include the IPsec header size. 380 ] 382 5.3. Storing PMTU information 384 Ideally, a PMTU value should be associated with a specific path 385 traversed by packets exchanged between the source and destination 386 nodes. However, in most cases a node will not have enough 387 information to completely and accurately identify such a path. 388 Rather, a node must associate a PMTU value with some local 389 representation of a path. It is left to the implementation to select 390 the local representation of a path. 392 In the case of a multicast destination address, copies of a packet 393 may traverse many different paths to reach many different nodes. The 394 local representation of the "path" to a multicast destination must in 395 fact represent a potentially large set of paths. 397 Minimally, an implementation could maintain a single PMTU value to be 398 used for all packets originated from the node. This PMTU value would 399 be the minimum PMTU learned across the set of all paths in use by the 400 node. This approach is likely to result in the use of smaller 401 packets than is necessary for many paths. 403 An implementation could use the destination address as the local 404 representation of a path. The PMTU value associated with a 405 destination would be the minimum PMTU learned across the set of all 406 paths in use to that destination. The set of paths in use to a 407 particular destination is expected to be small, in many cases 408 consisting of a single path. This approach will result in the use of 409 optimally sized packets on a per-destination basis. This approach 410 integrates nicely with the conceptual model of a host as described in 411 [ND]: a PMTU value could be stored with the corresponding entry in 412 the destination cache. 414 If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the 415 flow id as the local representation of a path. Packets sent to a 416 particular destination but belonging to different flows may use 417 different paths, with the choice of path depending on the flow id. 418 This approach will result in the use of optimally sized packets on a 419 per-flow basis, providing finer granularity than PMTU values 420 maintained on a per-destination basis. 422 For source routed packets (i.e. packets containing an IPv6 Routing 423 header [IPv6-SPEC]), the source route may further qualify the local 424 representation of a path. In particular, a packet containing a type 425 0 Routing header in which all bits in the Strict/Loose Bit Map are 426 equal to 1 contains a complete path specification. An implementation 427 could use source route information in the local representation of a 428 path. 430 Note: Some paths may be further distinguished by different security 431 classifications. The details of such classifications are beyond the 432 scope of this memo. @@@ this should be in scope 434 5.4. Probing method using TCP 436 A new "candidate MSS" is tested by sending one "probe segment", which 437 is larger than the current MSS. 439 After a probe segment has been sent (of size candidate MSS), the 440 subsequent segment(s) may be sent as though the probe segment was not 441 over sized. Thus if the probe segment is lost, it will leave a hole 442 that is exactly one current MSS. We refer to this potential hole as 443 the probe gap. Note that the length of the probe segment is 444 determined by the candidate MSS under consideration, but the length 445 of the probe gap is the current MSS. [This has been shown to be more 446 restrictive than necessary.] 448 The probe is completed when the acknowledgments sequence advances 449 past the probe gap. If the probe gap was not retransmitted the probe 450 was successful. If the probe gap was retransmitted and there were no 451 other retransmissions, the candidate MSS failed. If there were any 452 other retransmissions the probe was inconclusive. 454 If the probe was successful, the current MSS is updated to the 455 candidate MSS. @@@ add robustness language re: more losses 457 If the probe failed or was inconclusive the probe countdown is set to 458 COUNTDOWN_SCALE times the square of the current window size in 459 packets. 461 If a Packet Too Big message is received, it can be is used to compute 462 a MSS limit by deducting the TCP/IP header sizes (including options) 463 from the MTU reported in the ICMP message. If the MSS limit is 464 between the current MSS and candidate MSS, the current MSS is updated 465 from the MSS limit, otherwise the message is ignored. If the 466 current MSS is updated, then the probe strategy is forced into a 467 monitor state described below. @@@ update 469 5.5. Probing method using SCTP 471 @@@@ to be written 473 5.6. General probing methods 475 @@@@ to be written 477 5.7. Probe strategy 479 The probe strategy described here is a recommended baseline 480 algorithm. It is not presented in formal standards language because 481 the probe strategy can include heuristics to help select an optimal 482 MSS for a given path. As a consequence there is opportunity for 483 future improvements to this algorithms. 485 The probing strategy has three major states: search, monitor and 486 suspend. During the search state, it sequentially searches for the 487 largest MSS that the path can support. Once the path MSS has been 488 discovered, the probing algorithm enters the monitor state where it 489 probes infrequently to detect if the path MSS has become larger. 491 If the MSS probing persistently fails it may be desirable to suspend 492 path MSS probing and heuristically select one of the common default 493 MSSs: 576, 1280, or 1500 Bytes. 495 5.7.1. Search 497 The recommended search strategy is a multi-phase scan: First, a 498 coarse scan for the approximate path MSS using factor of 2 steps 499 starting at 1024 Bytes until a probe fails, followed by successively 500 finer scans between the largest previously successful and 501 unsuccessful probes. 503 Table 1: Recommended MSS scanning sequence 504 (Coarse scan down column 1, fine scan across each row) 505 512, [Use only after repeated timeouts] 506 1024, 1492, 2002 507 2048 508 4096, 4352 509 8192, 9000 510 16384, 17914 511 32768 512 64512 513 ((Additional values needed)) 515 During the scan it is recommended that the MSS not be raised if cwnd 516 is too small as determined by a heuristic. The recommended heuristic 517 is that the MSS is only raised when the cwnd is larger than 20 518 segments. 520 5.7.2. Monitor 522 Once the scan has found an appropriate MSS, the probe strategy enters 523 the monitor state, where it re-probes the most recent failed MTU, 524 once every MONITOR_INTERVAL seconds. If the probe fails, it remains 525 in the monitor state. If it succeeds, it enters the scanning state. 527 If the network becomes too congested during either the scan or the 528 monitor states it is recommended that the MSS be reduced to a smaller 529 size as determined by a heuristic. The recommended heuristic is to 530 reduce the MSS if ssthresh is reduced to 5 segments or smaller. The 531 recommended reduction is to the next smaller major MSS step in table 532 1. 534 When there are repeated timeouts (MAX_TIMO or more retransmissions, 535 without any received ACKs), it is presumed that the connection was 536 re-routed onto a link with a smaller MSS, and that ICMP messages are 537 not being delivered. The MSS probing algorithms is reset by pulling 538 back the MSS to 1024 Bytes, rescaling the congestion control 539 variables and reentering the search state. 541 5.7.3. Suspend 543 If there is a timeout, and cwnd prior to the timeout was smaller than 544 6 packets, then the probe strategy can enter the suspended phase and 545 set the MSS to 512 (1280) Bytes. This has the effect of reducing the 546 minimum data rate that TCP can stably manage. 548 5.8. Processing Packet Too Big messages 550 @@@ Add language re: optional processing 552 When a Packet Too Big message is received, the node determines which 553 path the message applies to based on the contents of the Packet Too 554 Big message. For example, if the destination address is used as the 555 local representation of a path, the destination address from the 556 original packet would be used to determine which path the message 557 applies to. 559 Note: if the original packet contained a IPv6 Routing header, the 560 Routing header should be used to determine the location of the 561 destination address within the original packet. If Segments Left 562 is equal to zero, the destination address is in the Destination 563 Address field in the IPv6 header. If Segments Left is greater 564 than zero, the destination address is the last address 565 (Address[n]) in the Routing header. 567 If the original packet contained a IPv4 Source Route Option ..... 568 @@@@ write 570 The node then uses the value in the MTU field in the Packet Too Big 571 message as a tentative PMTU value, and compares the tentative PMTU to 572 the existing PMTU. If the tentative PMTU is less than the existing 573 PMTU estimate, the tentative PMTU replaces the existing PMTU as the 574 PMTU value for the path. 576 The packetization layers must be notified about decreases in the 577 PMTU. Any packetization layer instance (for example, a TCP 578 connection) that is actively using the path must be notified if the 579 PMTU estimate is decreased. 581 Note: even if the Packet Too Big message contains an Original 582 Packet Header that refers to a UDP packet, the TCP layer must be 583 notified if any of its connections use the given path. 585 Also, the instance that sent the packet that elicited the Packet Too 586 Big message should be notified that its packet has been dropped, even 587 if the PMTU estimate has not changed, so that it may retransmit the 588 dropped data. 590 Note: An implementation can avoid the use of an asynchronous 591 notification mechanism for PMTU decreases by postponing 592 notification until the next attempt to send a packet larger than 593 the PMTU estimate. In this approach, when an attempt is made to 594 SEND a packet that is larger than the PMTU estimate, the SEND 595 function should fail and return a suitable error indication. This 596 approach may be more suitable to a connectionless packetization 597 layer (such as one using UDP), which (in some implementations) may 598 be hard to "notify" from the ICMP layer. In this case, the normal 599 timeout-based retransmission mechanisms would be used to recover 600 from the dropped packets. @@@@ why "SEND"? 602 It is important to understand that the notification of the 603 packetization layer instances using the path about the change in the 604 PMTU is distinct from the notification of a specific instance that a 605 packet has been dropped. The latter should be done as soon as 606 practical (i.e., asynchronously from the point of view of the 607 packetization layer instance), while the former may be delayed until 608 a packetization layer instance wants to create a packet. 609 Retransmission should be done for only those packets that are known 610 to be dropped, as indicated by a Packet Too Big message. 612 5.9. Purging stale PMTU information 614 @@@ update 616 Internetwork topology is dynamic; routes change over time. While the 617 local representation of a path may remain constant, the actual 618 path(s) in use may change. Thus, PMTU information cached by a node 619 can become stale. 621 If the stale PMTU value is too large, this will be discovered almost 622 immediately once a large enough packet is sent on the path. No such 623 mechanism exists for realizing that a stale PMTU value is too small, 624 so an implementation should "age" cached values. When a PMTU value 625 has not been decreased for a while (on the order of 10 minutes), the 626 PMTU estimate should be set to the MTU of the first-hop link, and the 627 packetization layers should be notified of the change. This will 628 cause the complete Path MTU Discovery process to take place again. 630 Note: an implementation should provide a means for changing the 631 timeout duration, including setting it to "infinity". For 632 example, nodes attached to an FDDI link which is then attached to 633 the rest of the Internet via a small MTU serial line are never 634 going to discover a new non-local PMTU, so they should not have to 635 put up with dropped packets every 10 minutes. 637 An upper layer must not retransmit data in response to an increase in 638 the PMTU estimate, since this increase never comes in response to an 639 indication of a dropped packet. 641 One approach to implementing PMTU aging is to associate a timestamp 642 field with a PMTU value. This field is initialized to a "reserved" 643 value, indicating that the PMTU is equal to the MTU of the first hop 644 link. Whenever the PMTU is decreased in response to a Packet Too Big 645 message, the timestamp is set to the current time. 647 Once a minute, a timer-driven procedure runs through all cached PMTU 648 values, and for each PMTU whose timestamp is not "reserved" and is 649 older than the timeout interval: 651 - The PMTU estimate is set to the MTU of the first hop link. 653 - The timestamp is set to the "reserved" value. 655 - Packetization layers using this path are notified of the increase. 657 5.10. TCP layer actions 659 The TCP layer must track the PMTU for the path(s) in use by a 660 connection; it should not send segments that would result in packets 661 larger than the PMTU except to probe the path MTU. A simple 662 implementation could ask the IP layer for this value each time it 663 created a new segment, but this could be inefficient. Moreover, TCP 664 implementations that follow the "slow-start" congestion-avoidance 665 algorithm [CONG] typically calculate and cache several other values 666 derived from the PMTU. It may be simpler to receive asynchronous 667 notification when the PMTU changes, so that these variables may be 668 updated. 670 A TCP implementation must also store the MSS value received from its 671 peer, and must not send any segment larger than this MSS, regardless 672 of the PMTU. In 4.xBSD-derived implementations, this may require 673 adding an additional field to the TCP state record. 675 The value sent in the TCP MSS option is independent of the PMTU. 676 This MSS option value is used by the other end of the connection, 677 which may be using an unrelated PMTU value. See [IPv6-SPEC] sections 678 "Packet Size Issues" and "Maximum Upper-Layer Payload Size" for 679 information on selecting a value for the TCP MSS option. When a 680 Packet Too Big message is received, it implies that a packet was 681 dropped by the node that sent the ICMP message. It is sufficient to 682 treat this as any other dropped segment, and wait until the 683 retransmission timer expires to cause retransmission of the segment. 684 If the Path MTU Discovery process requires several steps to find the 685 PMTU of the full path, this could delay the connection by many round- 686 trip times. 688 @@@ Add IPv4 text 690 [@@@deprecate? Alternatively, the retransmission could be done in 691 immediate response to a notification that the Path MTU has changed, 692 but only for the specific connection specified by the Packet Too Big 693 message. The packet size used in the retransmission should be no 694 larger than the new PMTU. ] 696 Note: A packetization layer must not retransmit in response to 697 every Packet Too Big message, since a burst of several oversized 698 segments will give rise to several such messages and hence several 699 retransmissions of the same data. If the new estimated PMTU is 700 still wrong, the process repeats, and there is an exponential 701 growth in the number of superfluous segments sent. 703 This means that the TCP layer must be able to recognize when a 704 Packet Too Big notification actually decreases the PMTU that it 705 has already used to send a packet on the given connection, and 706 should ignore any other notifications. 708 Many TCP implementations incorporate "congestion avoidance" and 709 "slow-start" algorithms to improve performance [CONG]. Unlike a 710 retransmission caused by a TCP retransmission timeout, a 711 retransmission caused by a Packet Too Big message should not change 712 the congestion window. It should, however, trigger the slow-start 713 mechanism (i.e., only one segment should be retransmitted until 714 acknowledgments begin to arrive again). 716 TCP performance can be reduced if the sender's maximum window size is 717 not an exact multiple of the segment size in use (this is not the 718 congestion window size, which is always a multiple of the segment 719 size). In many systems (such as those derived from 4.2BSD), the 720 segment size is often set to 1024 octets, and the maximum window size 721 (the "send space") is usually a multiple of 1024 octets, so the 722 proper relationship holds by default. If Path MTU Discovery is used, 723 however, the segment size may not be a sub-multiple of the send 724 space, and it may change during a connection; this means that the TCP 725 layer may need to change the transmission window size when Path MTU 726 Discovery changes the PMTU value. The maximum window size should be 727 set to the greatest multiple of the segment size that is less than or 728 equal to the sender's buffer space size. 730 5.11. Issues for other transport protocols 732 Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to 733 repacketize when doing a retransmission. That is, once an attempt is 734 made to transmit a segment of a certain size, the transport cannot 735 split the contents of the segment into smaller segments for 736 retransmission. In such a case, the original segment can be 737 fragmented by the IP layer during retransmission. Subsequent 738 segments, when transmitted for the first time, should be no larger 739 than allowed by the Path MTU. 741 The Sun Network File System (NFS) uses a Remote Procedure Call (RPC) 742 protocol [RPC] that, when used over UDP, in many cases will generate 743 payloads that must be fragmented even for the first-hop link. This 744 might improve performance in certain cases, but it is known to cause 745 reliability and performance problems, especially when the client and 746 server are separated by routers. 748 It is recommended that NFS implementations use Path MTU Discovery 749 whenever routers are involved. Most NFS implementations allow the 750 RPC datagram size to be changed at mount-time (indirectly, by 751 changing the effective file system block size), but might require 752 some modification to support changes later on. 754 Also, since a single NFS operation cannot be split across several UDP 755 datagrams, certain operations (primarily, those operating on file 756 names and directories) require a minimum payload size that if sent in 757 a single packet would exceed the PMTU. NFS implementations should 758 not reduce the payload size below this threshold, even if Path MTU 759 Discovery suggests a lower value. In this case the payload will be 760 fragmented by the IP layer. 762 5.12. Issues for tunnels 764 @@@ to be written 766 5.13. Diagnostic tools 768 All implementations MUST include a mechanism to implement diagnostic 769 tools that do not rely on the operating systems implementation of 770 path MTU discovery. This requires an mechanism where an application 771 can send oversized packets that are not subjected to the operating 772 systems notion of the current path MTU, up to the physical MTU limit 773 as supported by the network interface, as well as a mechanism to 774 collect any Packet Too Big Messages. 776 5.14. Management interface 778 It is suggested that an implementation provide a way for a system 779 utility program to: 781 - Specify that Path MTU Discovery not be done on a given path. 783 - Change the PMTU value associated with a given path. 785 - Global controls on ICMP processing 787 - Per connection or per application controls on ICMP processing 789 The former can be accomplished by associating a flag with the path; 790 when a packet is sent on a path with this flag set, the IP layer does 791 not send packets larger than the IPv6 minimum link MTU. 793 These features might be used to work around an anomalous situation, 794 or by a routing protocol implementation that is able to obtain Path 795 MTU values. 797 The implementation should also provide a way to change the timeout 798 period for aging stale PMTU information. 800 6. Normative references 802 [RFC1191] Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990. 803 (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT 804 STANDARD) 806 [RFC1435] IESG Advice from Experience with Path MTU Discovery. S. 807 Knowles. March 1993. (Format: TXT=2708 bytes) (Status: 808 INFORMATIONAL) 810 [RFC1981] Path MTU Discovery for IP version 6. J. McCann, S. Deering, 811 J. Mogul. August 1996. (Format: TXT=34088 bytes) (Status: 812 PROPOSED STANDARD) 814 [RFC2923] TCP Problems with Path MTU Discovery. K. Lahey. September 815 2000. (Format: TXT=30976 bytes) (Status: INFORMATIONAL) 817 7. Informative references 819 [RFC1063] IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par� 820 tridge, K. McCloghrie. Jul-01-1988. (Format: TXT=27121 821 bytes) (Obsoleted by RFC1191) 823 [RFC1626] Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994. 824 (Format: TXT=11841 bytes) (Obsoleted by RFC2225) (Status: 825 PROPOSED STANDARD) 827 [RFC1791] TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung. 828 April 1995. (Format: TXT=22347 bytes) (Status: EXPERIMENTAL) 830 8. Security considerations 832 Since the MTU reported in the ICMP messages is constrained to be 833 between the old MTU and the candidate MTU, this algorithm is more 834 difficult to attack through fraudulent ICMP messages. 836 Furthermore, since this algorithm can function properly without ICMP 837 messages that part of the algorithm can be disabled for additional 838 robustness in hostile environments. 840 9. IANA considerations 842 10. Contributors 844 11. Acknowledgements 846 Matt Mathis and John Heffner are supported by a grant from Cisco Sys� 847 tems, Inc. 849 12. Authors' addresses 851 Please send comments and suggestions to mtu@psc.edu. 853 Matt Mathis and John Heffner 854 Pittsburgh Supercomputing Center 855 4400 Fifth Ave. 856 Pittsburgh, PA 15213 857 mathis@psc.edu 858 jheffner@psc.edu 860 Kevin Lahey 861 Freelance 862 kml@patheticgeek.net 864 13. Intellectual Property 866 The IETF takes no position regarding the validity or scope of any 867 intellectual property or other rights that might be claimed to per� 868 tain to the implementation or use of the technology described in this 869 document or the extent to which any license under such rights might 870 or might not be available; neither does it represent that it has made 871 any effort to identify any such rights. Information on the IETF's 872 procedures with respect to rights in standards-track and standards- 873 related documentation can be found in BCP-11. Copies of claims of 874 rights made available for publication and any assurances of licenses 875 to be made available, or the result of an attempt made to obtain a 876 general license or permission for the use of such proprietary rights 877 by implementers or users of this specification can be obtained from 878 the IETF Secretariat. 880 The IETF invites any interested party to bring to its attention any 881 copyrights, patents or patent applications, or other proprietary 882 rights which may cover technology that may be required to practice 883 this standard. Please address the information to the IETF Executive 884 Director. 886 14. Full copyright statement 888 Copyright (C) The Internet Society June 21, 2003. All Rights 889 Reserved. 891 This document and translations of it may be copied and furnished to 892 others, and derivative works that comment on or otherwise explain it 893 or assist in its implementation may be prepared, copied, published 894 and distributed, in whole or in part, without restriction of any 895 kind, provided that the above copyright notice and this paragraph are 896 included on all such copies and derivative works. However, this 897 document itself may not be modified in any way, such as by removing 898 the copyright notice or references to the Internet Society or other 899 Internet organizations, except as needed for the purpose of develop� 900 ing Internet standards in which case the procedures for copyrights 901 defined in the Internet Standards process must be followed, or as 902 required to translate it into languages other than English. 904 The limited permissions granted above are perpetual and will not be 905 revoked by the Internet Society or its successors or assigns.