idnits 2.17.1 draft-ietf-pmtud-method-00.txt: -(146): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(252): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(350): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(714): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(869): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == There are 9 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 20 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** There are 10 instances of too long lines in the document, the longest one being 6 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 865 has weird spacing: '...imed to per-...' == Line 896 has weird spacing: '...for the purpo...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Links MUST not deliver packets that are larger than their true MTU. Links that have parametric limitations (e.g. MTU bounds due to limited clock stability) MUST include explicit mechanisms to consistently reject packets that might otherwise be nondeterministically delivered. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Oct 19, 2003) is 7495 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 2119' is mentioned on line 84, but not defined == Missing Reference: 'IPv4-SPEC' is mentioned on line 243, but not defined == Missing Reference: 'IPv6-SPEC' is mentioned on line 676, but not defined == Missing Reference: 'FRAG' is mentioned on line 344, but not defined == Missing Reference: 'ND' is mentioned on line 396, but not defined == Missing Reference: 'CONG' is mentioned on line 707, but not defined == Missing Reference: 'ISOTP' is mentioned on line 730, but not defined == Missing Reference: 'RPC' is mentioned on line 740, but not defined == Unused Reference: 'RFC1191' is defined on line 800, but no explicit reference was found in the text == Unused Reference: 'RFC1981' is defined on line 804, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 807, but no explicit reference was found in the text == Unused Reference: 'RFC1063' is defined on line 812, but no explicit reference was found in the text == Unused Reference: 'RFC1435' is defined on line 815, but no explicit reference was found in the text == Unused Reference: 'RFC1626' is defined on line 819, but no explicit reference was found in the text == Unused Reference: 'RFC1791' is defined on line 822, but no explicit reference was found in the text == Unused Reference: 'RFC2923' is defined on line 825, but no explicit reference was found in the text ** Obsolete normative reference: RFC 1063 (ref. 'RFC1191') (Obsoleted by RFC 1191) ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 1626 (Obsoleted by RFC 2225) Summary: 6 errors (**), 0 flaws (~~), 23 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft Matt Mathis 3 John Heffner 4 PSC 5 Kevin Lahey 6 Freelance 7 Oct 19, 2003 9 Path MTU Discovery 10 draft-ietf-pmtud-method-00.txt 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that other 19 groups may also distribute working documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 Abstract 34 [@@ To be rewritten] 36 This document describes Path MTU Discovery for the Internet. It is 37 largely derived from RFC 1191 and RFC 1981, which describe ICMP based 38 Path MTU Discovery for IP versions 4 and 6, plus a robust new 39 algorithm. 41 The general strategy of the new algorithm is to start with a small 42 MTU and probe upward, testing successively larger MTUs by probing 43 with single packets. If the probe is successfully delivered, then 44 the MTU is raised. If the probe is lost, it is treated as an MTU 45 limitation and not as a congestion signal. 47 Table of Contents 49 TBD 51 1. Introduction 53 When one Internet node has a large amount of data to send to another 54 node, the data is transmitted in a series of IP packets. It is 55 usually preferable that these packets be of the largest size that can 56 successfully traverse the path from the source node to the 57 destination node. This packet size is referred to as the Path MTU 58 (PMTU), and it is equal to the minimum link MTU of all the links in a 59 path. 61 This document describes a path MTU discovery (PMTUD) method based on 62 the earlier methods described in the standards track documents, 63 RFC1191 and RFC1981, with the addition of a new algorithm that 64 searches for the proper MTU by probing with successively larger 65 packets. Large sections of this document are taken directly from 66 RFC1191 and RFC1981. 68 The methods described in this document apply to IPv4, IPv6, TCP, and 69 other transport protocols. This document does not define a 70 protocol, but rather a method to use features of existing protocols 71 to discover the path MTU. It does not require cooperation from the 72 lower layers (except that they are consistent about what packet sizes 73 are acceptable) or the far node. Variants in implementations will 74 not cause problems with interoperability. 76 For sake of clarity we uniformly prefer TCP and IPv6 terminology. In 77 the terminology section we also present the analogous IPv4 terms and 78 concepts for the IPv6 terminology. In a few situations we describe 79 specific details that are different between IPv4 and IPv6. 81 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 82 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 83 document are to be interpreted as described in [RFC 2119]. 85 [[This document still bears markup notes, indicated with square 86 brackets [] or @@@@ signs.]] 88 2. Terminology 90 IP - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC]. 92 node - A device that implements IP. 94 router - A node that forwards IP packets not explicitly 95 addressed to itself. 97 host - Any node that is not a router. 99 upper layer - A protocol layer immediately above IP. Examples are 100 transport protocols such as TCP and UDP, control 101 protocols such as ICMP, routing protocols such as OSPF, 102 and Internet or lower-layer protocols being "tunneled" 103 over (i.e., encapsulated in) IP such as IPX, 104 AppleTalk, IP itself. 106 link - A communication facility or medium over which nodes can 107 communicate at the link layer, i.e., the layer 108 immediately below IPv6. Examples are Ethernets (simple 109 or bridged); PPP links; X.25, Frame Relay, or ATM 110 networks; and Internet (or higher) layer "tunnels", 111 such as tunnels over IPv4 or IPv6 itself. 113 interface - A node���s attachment to a link. 115 address - An IP-layer identifier for an interface or a set of 116 interfaces. 118 packet - An IP header plus payload. 120 MTU - Maximum Transmission Unit, the size in bytes of the 121 largest packet that can be transmitted on a link or 122 path. Note that this could more properly be called 123 the IP MTU, to be consistent with how other standards 124 organizations use the term. Beware that the definition 125 used in this and other IETF documents is not the same 126 as the definition used in other contexts. 128 link MTU - The Maximum Transmission Unit, i.e., maximum packet 129 size in octets, that can be conveyed in one piece over 130 a link. 132 path - The set of links traversed by a packet between a source 133 node and a destination node 135 path MTU - The minimum link MTU of all the links in a path between 136 a source node and a destination node. 138 PMTU - Path MTU 140 Path MTU Discovery, 141 PMTUD - Process by which a node learns the PMTU of a path 143 Packet Too Big message 144 - An ICMP message reporting that an IP packet is too 145 large to forward. This is the IPv6 term that 146 corresponds to the IPv4 "ICMP Can���t fragment" message. 148 flow id - A combination of a source address and a non-zero 149 IPv6 flow label. 151 packetization protocol 152 - The layer of the network stack which segments data into 153 packets. 155 flow - A context in which MTU discovery is applied. This is 156 naturally an instance of the packetization protocol, e.g. 157 half of a TCP connection. 159 MPS - The maximum payload size available to a flow, usually 160 over a specific path. As an example, this is the maximum 161 TCP segment size, including TCP headers but not including 162 IP headers. 164 probe packet- A packet which is being used to test for a larger MTU. 166 probe size - The size of a packet being used to probe for a larger MTU. 168 successful probe 169 - The probe packet was delivered through the network. 171 inconclusive probe 172 - The probe packet was not delivered, but there were other lost 173 packets too close to the probe. By implication the probe 174 might have been lost due to something other than MTU, so the 175 results are inconclusive. 177 failed probe 178 - The probe packet was not delivered and there were not other 179 lost packets close to the probe. 181 probe gap - The L3 payload data that will need to be retransmitted if the 182 probe is not delivered. 184 [[Deprecated terms - these terms should only appear in very specific parts of 185 the document. 187 ICMP 189 Can���t fragment messages 191 lower layers 193 @@@ remove as the document matures]] 195 3. Overview 197 This document describes a technique to dynamically discover the MTU 198 of a path. These procedures are applicable to TCP and other 199 transport- or application-level packetization protocols which 200 implement similar features. 202 The general strategy of the new procedure is to find the proper MTU 203 by starting a connection using relatively small packets and then 204 probing with progressively larger packets (containing application 205 data). If a probe packet is successfully delivered, then the path 206 MTU is raised. The isolated loss of a probe packet (with or without 207 a Packet Too Big message) is treated as an indication of an MTU 208 limit, and not as a congestion indicator. 210 PMTUD can optionally process Packet Too Big messages for faster 211 convergence in exchange for a slight decrease in robustness. 212 Processing malicious or erroneous Packet Too Big messages can cause 213 PMTU discovery to arrive at the incorrect MTU for a path, which is 214 likely to reduce protocol performance. The document describes three 215 options for processing Packet Too Big messages: completely ignore 216 them, only accept them in response to probes or accept all Packet Too 217 Big messages (the previous approach). 219 In addition, PMTUD can be extended with heuristics to use alternate 220 criteria to select PMTU. For example, on a path that is so congested 221 that the fair share window is too small (smaller than 5 kB), TCP may 222 be better behaved with 512-byte packets than with 1500-byte packets 223 since with the larger packets the window would be too small to 224 trigger Fast Retransmit. 226 Relatively few details of this procedure affect interoperability with 227 other standards or Internet protocols. These details are specified 228 in RFC2119 standards language in the requirements section. The vast 229 majority of the implementation details are recommendations based on 230 experiences with earlier versions of path MTU discovery. These are 231 motivated by a desire to maximize robustness in the presence of less 232 than ideal implementations as they exist in the field. 234 4. Requirements 236 All Internet nodes SHOULD implement Path MTU Discovery in order to 237 discover and take advantage of the largest MTU supported along the 238 Internet path. 240 Nodes not implementing Path MTU Discovery must use a default MTU as 241 specified by the respective IP protocols. For IPv6 the default MTU 242 is 1280 bytes, the minimum link MTU as defined in [IPv6-SPEC]. For 243 IPv4 it is 576 bytes, as specified in [IPv4-SPEC]. 245 Links MUST not deliver packets that are larger than their true MTU. 246 Links that have parametric limitations (e.g. MTU bounds due to 247 limited clock stability) MUST include explicit mechanisms to 248 consistently reject packets that might otherwise be 249 nondeterministically delivered. 251 When a packet is too large to traverse a link, the attached router, 252 if any, SHOULD send a Packet Too Big message (IPv6) or ICMP, can���t 253 fragment message (IPv4 with DF set), as appropriate. 255 The requirements below only apply to those implementations that 256 include Path MTU Discovery. 258 A flow MUST NOT send a probe packet until at least one packet of its 259 full current MPS is acknowledged. This implicitly limits successful 260 probes to once per two round trips. To make the algorithm more 261 robust in the presence of multi-path routing, a flow SHOULD NOT send 262 a probe packet until at least a full window or an appropriately large 263 quantity of packets have been successfully acknowledged. 265 Before a probe can be sent, the flow MUST be able to produce a packet 266 containing a payload of at least the candidate MPS. That is, it must 267 have enough data or be able to pad the packet to the full desired 268 size. If the flow is able to send a probe with the exception of 269 having enough data to 271 Failed and inconclusive probes MUST NOT be sent more frequently than 272 the normal congestion interval for the current average window size. 274 A packetization protocol which does loss recovery MUST use a loss 275 detection mechanism which does not result in spurious retransmission 276 of any additional data when a probe packet is lost. 278 During the probe, the normal congestion control machinery should 279 remain in effect except when only the probe gap is detected as lost. 280 In this case the normal multiplicative congestion window reduction is 281 suppressed. If any other data is detected as lost, all normal 282 congestion control MUST take place. 284 If the probe is successful, the current MPS is updated to the 285 candidate MPS. If window and other congestion state variables are 286 kept in units of packets, they MUST be rescaled to preserve the 287 current window size in bytes. 289 5. Implementation Issues 291 This section discusses a number of issues related to the 292 implementation of Path MTU Discovery. This is not a specification, 293 but rather a set of notes provided as an aid for implementers. 295 The issues include: 297 - What layer or layers implement Path MTU Discovery? 299 - Accounting for headers 301 - How is the PMTU information cached? 303 - How are ICMP messages processed 305 - How is stale PMTU information removed? 307 - How to implement PMTUD with TCP? 309 - What should other transport and higher layers do? 311 - What should tunnels above IP do? 313 5.1. Layering 315 In the IP architecture, the choice of what size packet to send is 316 made by a protocol at a layer above IP. This memo refers to such a 317 protocol as a "packetization protocol". Packetization protocols are 318 usually transport protocols (for example, TCP) but can also be 319 higher-layer protocols (for example, protocols built on top of UDP). 321 This memo uses the concept of a "flow" to define the scope in which 322 path MTU information is used. Each flow locally stores its maximum 323 payload size (MPS), which is used for packetizing data. Flows may 324 communicate with the IP layer to store or access cached PMTU values, 325 providing a means by which similar flows may share information. To 326 do so, the flow must convert between these two values by adding or 327 subtracting the size of the IP header plus any additional 328 intermediate headers. The IP layer also stores PMTU information from 329 the ICMP layer when it receives Packet Too Big messages. 331 It is possible that a packetization layer, perhaps a UDP application 332 outside the kernel, is unable to change the size of messages it 333 sends. This may result in a packet size that exceeds the Path MTU. 335 In such situations, the packets must be fragmented by the IP layer. 336 To accommodate this, IPv6 defines a mechanism that allows large 337 payloads to be divided into fragments, with each fragment sent in a 338 separate packet (see [IPv6-SPEC] section "Fragment Header"). It is 339 also recommended that IPv4 fragment the packets at the end system. 340 @@@ Should it also set the DF flag to mimic IPv6? @@@ 342 However, packetization layers are encouraged to avoid sending 343 messages that will require fragmentation (for the case against 344 fragmentation, see [FRAG]). 346 5.2. Accounting for headers 348 The packetization is done at or near the top of the protocol stack, 349 while the final packet size, only determined at bottom of the stack, 350 is what is determines the link���s ability to transmit the packet. As 351 such, it is necessary for the lower layers to deterministically 352 accept all payloads of a uniform size, or for these layers to 353 communicate their header sizes to the upper layer prior to 354 packetization. 356 This document does not take a position on the layering boundaries of 357 IPsec, which logically sits between IP and TCP or another 358 packetization layer. IPsec can be treated either as part of IP or as 359 part of the packetization layer, as long as the accounting is 360 consistent within any given implementation. If IPsec is treated as 361 part of the IP layer, then each security association that contributes 362 a different length security header, may need to be treated as a 363 separate path. If IPsec is treated as part of the packetization 364 layer, then the MPS to PMTU calculation must include the IPsec header 365 size for that flow. 367 5.3. Storing PMTU information 369 Ideally, a PMTU value should be associated with a specific path 370 traversed by packets exchanged between the source and destination 371 nodes. However, in most cases a node will not have enough 372 information to completely and accurately identify such a path. 373 Rather, a node must associate a PMTU value with some local 374 representation of a path. It is left to the implementation to select 375 the local representation of a path. 377 In the case of a multicast destination address, copies of a packet 378 may traverse many different paths to reach many different nodes. The 379 local representation of the "path" to a multicast destination must in 380 fact represent a potentially large set of paths. 382 Minimally, an implementation could maintain a single PMTU value to be 383 used for all packets originated from the node. This PMTU value would 384 be the minimum PMTU learned across the set of all paths in use by the 385 node. This approach is likely to result in the use of smaller 386 packets than is necessary for many paths. 388 An implementation could use the destination address as the local 389 representation of a path. The PMTU value associated with a 390 destination would be the minimum PMTU learned across the set of all 391 paths in use to that destination. The set of paths in use to a 392 particular destination is expected to be small, in many cases 393 consisting of a single path. This approach will result in the use of 394 optimally sized packets on a per-destination basis. This approach 395 integrates nicely with the conceptual model of a host as described in 396 [ND]: a PMTU value could be stored with the corresponding entry in 397 the destination cache. 399 If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the 400 flow id as the local representation of a path. Packets sent to a 401 particular destination but belonging to different flows may use 402 different paths, with the choice of path depending on the flow id. 403 This approach will result in the use of optimally sized packets on a 404 per-flow basis, providing finer granularity than PMTU values 405 maintained on a per-destination basis. 407 For source routed packets (i.e. packets containing an IPv6 Routing 408 header [IPv6-SPEC]), the source route may further qualify the local 409 representation of a path. In particular, a packet containing a type 410 0 Routing header in which all bits in the Strict/Loose Bit Map are 411 equal to 1 contains a complete path specification. An implementation 412 could use source route information in the local representation of a 413 path. 415 Note: Some paths may be further distinguished by different security 416 classifications. The details of such classifications are beyond the 417 scope of this memo. @@@ this should be in scope 419 5.4. Probing method using TCP 421 A new candidate MPS is tested by sending one "probe segment", which 422 is larger than the current MPS. We present here two possible probing 423 methods for TCP. 425 In the first method, after a probe segment has been sent (of size 426 candidate MPS), the subsequent segment(s) may be sent as though the 427 probe segment was not over sized. Thus if the probe segment is lost, 428 it will leave a gap in the sequence space that is exactly one current 429 MPS minus the TCP header size. We refer to this potential hole as 430 the probe gap. Note that the length of the probe segment is 431 determined by the candidate MPS under consideration, but the length 432 of the probe gap by the current MPS. If the probe segment is lost, 433 this gap can be filled by a single retransmitted segment. 435 This method will create duplicate acknowledgements if the probe is 436 successful. The sender must be capable of dealing with these 437 expected duplicate acknowledgements in a manner which will not cause 438 unnecessary retransmission or congestion window reduction. 440 In the second method, after a probe segment has been sent, subsequent 441 segments are sent in a non-overlapping manner. If the probe segment 442 is lost, it will leave a gap which will require retransmission of 443 multiple segment to fill. 445 The probe is completed when the acknowledgment sequence advances past 446 the probe gap. If, when the probe is complete, the probe gap was not 447 retransmitted, the probe was successful. If the probe gap was 448 retransmitted and there were no other retransmissions, the candidate 449 MPS failed. If there were any other retransmissions the probe was 450 inconclusive. 452 If the probe was successful, the current MPS is updated to the 453 candidate MPS. @@@ add robustness language re: more losses 455 If the probe failed or was inconclusive the probe countdown is set to 456 COUNTDOWN_SCALE times the square of the current window size in 457 packets. 459 If a Packet Too Big message is received, it can be is used to compute 460 an MPS limit by deducting the IP header size from the MTU reported in 461 the ICMP message. If the MPS limit is between the current MPS and 462 candidate MPS, the current MPS is updated from the MPS limit, 463 otherwise the message is ignored. If the current MPS is updated, 464 then the probe strategy is forced into the Monitor state described 465 below. 467 5.5. Probing method using SCTP 469 @@@@ to be written 471 5.6. General probing methods 473 @@@@ to be written 475 5.7. Probe strategy 477 The probe strategy described here is a recommended baseline 478 algorithm. It is not presented in formal standards language because 479 the probe strategy can include heuristics to help select an optimal 480 MSS for a given path. As a consequence there is opportunity for 481 future improvements to this algorithms. 483 The probing strategy has three major states: Search, Monitor and 484 Suspend. In the Search state, it sequentially searches for the 485 largest MSS that the path can support. Once the appropriate MPS has 486 been discovered, the probing algorithm enters the Monitor state where 487 it probes infrequently to detect if the path MPS has become larger. 489 If the MPS probing persistently fails it may be desirable to suspend 490 MPS probing and heuristically select one of the common default MSSs: 491 576, 1240, or 1460 Bytes. 493 5.7.1. Search 495 The recommended search strategy is a multi-phase scan: First, a 496 coarse scan for the approximate MTU using factor of 2 steps starting 497 at 1024 Bytes until a probe fails, followed by successively finer 498 scans between the largest previously successful and unsuccessful 499 probes. The TCP should use its best knowledge of the lower layer 500 header sizes to appropriately determine the MPS from the MTUs listed 501 in the table below. 503 Table 1: Recommended MTU scanning sequence 504 (Coarse scan down column 1, fine scan across each row) 505 512, [Use only after repeated timeouts] 506 1024, 1492, 1500, 2002 507 2048 508 4096, 4352 509 8192, 9000 510 16384, 17914 511 32768 512 64512 513 ((Additional values needed)) 515 During the scan it is recommended that the MPS not be raised if cwnd 516 is too small as determined by a heuristic. The recommended heuristic 517 is that the MPS is only raised when the cwnd is larger than 20 518 segments. @@@ This may be too high. 520 5.7.2. Monitor 522 Once the scan has found an appropriate MPS, the probe strategy enters 523 the Monitor state, where it re-probes the most recent failed MTU, 524 once every MONITOR_INTERVAL seconds. If the probe fails, it remains 525 in the Monitor state. If it succeeds, it enters the scanning state. 527 If the network becomes too congested during either the Search or the 528 Monitor states, it is recommended that the MPS be reduced to a 529 smaller size as determined by a heuristic. The recommended heuristic 530 is to reduce the MSS if ssthresh is reduced to 5 segments or smaller. 531 The recommended reduction is to the next smaller coarse step in Table 532 1. 534 When there are repeated timeouts (MAX_TIMO or more retransmissions, 535 without any received ACKs), it is presumed that the connection was 536 re-routed onto a link with a smaller MSS, and that ICMP messages are 537 not being delivered. The MSS probing algorithms is reset by pulling 538 back the MSS to 1024 Bytes, rescaling the congestion control 539 variables and reentering the Search state. 541 5.7.3. Suspend 543 If there is a timeout, and cwnd prior to the timeout was smaller than 544 6 packets, then the probe strategy can enter the Suspend state and 545 set the MSS to 512 or 1240 Bytes. This has the effect of reducing 546 the minimum data rate that TCP can stably manage. 548 5.8. Processing Packet Too Big messages 550 @@@ Add language re: optional processing 551 When a Packet Too Big message is received, the node determines which 552 path the message applies to based on the contents of the Packet Too 553 Big message. For example, if the destination address is used as the 554 local representation of a path, the destination address from the 555 original packet would be used to determine which path the message 556 applies to. 558 Note: if the original packet contained a IPv6 Routing header, the 559 Routing header should be used to determine the location of the 560 destination address within the original packet. If Segments Left 561 is equal to zero, the destination address is in the Destination 562 Address field in the IPv6 header. If Segments Left is greater 563 than zero, the destination address is the last address 564 (Address[n]) in the Routing header. 566 If the original packet contained a IPv4 Source Route Option ..... 567 @@@@ write 569 The node then uses the value in the MTU field in the Packet Too Big 570 message as a tentative PMTU value, and compares the tentative PMTU to 571 the existing PMTU. If the tentative PMTU is less than the existing 572 PMTU estimate, the tentative PMTU replaces the existing PMTU as the 573 PMTU value for the path. 575 The packetization layers must be notified about decreases in the 576 PMTU. Any packetization layer instance (for example, a TCP 577 connection) that is actively using the path must be notified if the 578 PMTU estimate is decreased. 580 Note: even if the Packet Too Big message contains an Original 581 Packet Header that refers to a UDP packet, the TCP layer must be 582 notified if any of its connections use the given path. 584 Also, the instance that sent the packet that elicited the Packet Too 585 Big message should be notified that its packet has been dropped, even 586 if the PMTU estimate has not changed, so that it may retransmit the 587 dropped data. 589 Note: An implementation can avoid the use of an asynchronous 590 notification mechanism for PMTU decreases by postponing 591 notification until the next attempt to send a packet larger than 592 the PMTU estimate. In this approach, when an attempt is made to 593 SEND a packet that is larger than the PMTU estimate, the SEND 594 function should fail and return a suitable error indication. This 595 approach may be more suitable to a connectionless packetization 596 layer (such as one using UDP), which (in some implementations) may 597 be hard to "notify" from the ICMP layer. In this case, the normal 598 timeout-based retransmission mechanisms would be used to recover 599 from the dropped packets. @@@@ why "SEND"? 601 It is important to understand that the notification of the 602 packetization layer instances using the path about the change in the 603 PMTU is distinct from the notification of a specific instance that a 604 packet has been dropped. The latter should be done as soon as 605 practical (i.e., asynchronously from the point of view of the 606 packetization layer instance), while the former may be delayed until 607 a packetization layer instance wants to create a packet. 608 Retransmission should be done for only those packets that are known 609 to be dropped, as indicated by a Packet Too Big message. 611 5.9. Purging stale PMTU information 613 @@@ update 615 Internetwork topology is dynamic; routes change over time. While the 616 local representation of a path may remain constant, the actual 617 path(s) in use may change. Thus, PMTU information cached by a node 618 can become stale. 620 If the stale PMTU value is too large, this will be discovered almost 621 immediately once a large enough packet is sent on the path. No such 622 mechanism exists for realizing that a stale PMTU value is too small, 623 so an implementation should "age" cached values. When a PMTU value 624 has not been decreased for a while (on the order of 10 minutes), the 625 PMTU estimate should be set to the MTU of the first-hop link, and the 626 packetization layers should be notified of the change. This will 627 cause the complete Path MTU Discovery process to take place again. 629 Note: an implementation should provide a means for changing the 630 timeout duration, including setting it to "infinity". For 631 example, nodes attached to an FDDI link which is then attached to 632 the rest of the Internet via a small MTU serial line are never 633 going to discover a new non-local PMTU, so they should not have to 634 put up with dropped packets every 10 minutes. 636 An upper layer must not retransmit data in response to an increase in 637 the PMTU estimate, since this increase never comes in response to an 638 indication of a dropped packet. 640 One approach to implementing PMTU aging is to associate a timestamp 641 field with a PMTU value. This field is initialized to a "reserved" 642 value, indicating that the PMTU is equal to the MTU of the first hop 643 link. Whenever the PMTU is decreased in response to a Packet Too Big 644 message, the timestamp is set to the current time. 646 Once a minute, a timer-driven procedure runs through all cached PMTU 647 values, and for each PMTU whose timestamp is not "reserved" and is 648 older than the timeout interval: 650 - The PMTU estimate is set to the MTU of the first hop link. 652 - The timestamp is set to the "reserved" value. 654 - Packetization layers using this path are notified of the increase. 656 5.10. TCP layer actions 658 The TCP layer must track the PMTU for the path(s) in use by a 659 connection; it should not send segments that would result in packets 660 larger than the PMTU except to probe the path MTU. A simple 661 implementation could ask the IP layer for this value each time it 662 created a new segment, but this could be inefficient. Moreover, TCP 663 implementations that follow the "slow-start" congestion-avoidance 664 algorithm [CONG] typically calculate and cache several other values 665 derived from the PMTU. It may be simpler to receive asynchronous 666 notification when the PMTU changes, so that these variables may be 667 updated. 669 A TCP implementation must also store the MSS value received from its 670 peer, and must not send any segment larger than this MSS, regardless 671 of the PMTU. In 4.xBSD-derived implementations, this may require 672 adding an additional field to the TCP state record. 674 The value sent in the TCP MSS option is independent of the PMTU. 675 This MSS option value is used by the other end of the connection, 676 which may be using an unrelated PMTU value. See [IPv6-SPEC] sections 677 "Packet Size Issues" and "Maximum Upper-Layer Payload Size" for 678 information on selecting a value for the TCP MSS option. When a 679 Packet Too Big message is received, it implies that a packet was 680 dropped by the node that sent the ICMP message. It is sufficient to 681 treat this as any other dropped segment, and wait until the 682 retransmission timer expires to cause retransmission of the segment. 683 If the Path MTU Discovery process requires several steps to find the 684 PMTU of the full path, this could delay the connection by many round- 685 trip times. 687 @@@ Add IPv4 text 689 [@@@deprecate? Alternatively, the retransmission could be done in 690 immediate response to a notification that the Path MTU has changed, 691 but only for the specific connection specified by the Packet Too Big 692 message. The packet size used in the retransmission should be no 693 larger than the new PMTU. ] 694 Note: A packetization layer must not retransmit in response to 695 every Packet Too Big message, since a burst of several oversized 696 segments will give rise to several such messages and hence several 697 retransmissions of the same data. If the new estimated PMTU is 698 still wrong, the process repeats, and there is an exponential 699 growth in the number of superfluous segments sent. 701 This means that the TCP layer must be able to recognize when a 702 Packet Too Big notification actually decreases the PMTU that it 703 has already used to send a packet on the given connection, and 704 should ignore any other notifications. 706 Many TCP implementations incorporate "congestion avoidance" and 707 "slow-start" algorithms to improve performance [CONG]. Unlike a 708 retransmission caused by a TCP retransmission timeout, a 709 retransmission caused by a Packet Too Big message should not change 710 the congestion window. It should, however, trigger the slow-start 711 mechanism (i.e., only one segment should be retransmitted until 712 acknowledgments begin to arrive again). 714 TCP performance can be reduced if the sender���s maximum window size is 715 not an exact multiple of the segment size in use (this is not the 716 congestion window size, which is always a multiple of the segment 717 size). In many systems (such as those derived from 4.2BSD), the 718 segment size is often set to 1024 octets, and the maximum window size 719 (the "send space") is usually a multiple of 1024 octets, so the 720 proper relationship holds by default. If Path MTU Discovery is used, 721 however, the segment size may not be a sub-multiple of the send 722 space, and it may change during a connection; this means that the TCP 723 layer may need to change the transmission window size when Path MTU 724 Discovery changes the PMTU value. The maximum window size should be 725 set to the greatest multiple of the segment size that is less than or 726 equal to the sender���s buffer space size. 728 5.11. Issues for other transport protocols 730 Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to 731 repacketize when doing a retransmission. That is, once an attempt is 732 made to transmit a segment of a certain size, the transport cannot 733 split the contents of the segment into smaller segments for 734 retransmission. In such a case, the original segment can be 735 fragmented by the IP layer during retransmission. Subsequent 736 segments, when transmitted for the first time, should be no larger 737 than allowed by the Path MTU. 739 The Sun Network File System (NFS) uses a Remote Procedure Call (RPC) 740 protocol [RPC] that, when used over UDP, in many cases will generate 741 payloads that must be fragmented even for the first-hop link. This 742 might improve performance in certain cases, but it is known to cause 743 reliability and performance problems, especially when the client and 744 server are separated by routers. 746 It is recommended that NFS implementations use Path MTU Discovery 747 whenever routers are involved. Most NFS implementations allow the 748 RPC datagram size to be changed at mount-time (indirectly, by 749 changing the effective file system block size), but might require 750 some modification to support changes later on. 752 Also, since a single NFS operation cannot be split across several UDP 753 datagrams, certain operations (primarily, those operating on file 754 names and directories) require a minimum payload size that if sent in 755 a single packet would exceed the PMTU. NFS implementations should 756 not reduce the payload size below this threshold, even if Path MTU 757 Discovery suggests a lower value. In this case the payload will be 758 fragmented by the IP layer. 760 5.12. Issues for tunnels 762 @@@ to be written 764 5.13. Diagnostic tools 766 All implementations MUST include a mechanism to implement diagnostic 767 tools that do not rely on the operating systems implementation of 768 path MTU discovery. This requires an mechanism where an application 769 can send oversized packets that are not subjected to the operating 770 systems notion of the current path MTU, up to the physical MTU limit 771 as supported by the network interface, as well as a mechanism to 772 collect any Packet Too Big Messages. 774 5.14. Management interface 776 It is suggested that an implementation provide a way for a system 777 utility program to: 779 - Specify that Path MTU Discovery not be done on a given path. 781 - Change the PMTU value associated with a given path. 783 - Global controls on ICMP processing 785 - Per connection or per application controls on ICMP processing 787 The former can be accomplished by associating a flag with the path; 788 when a packet is sent on a path with this flag set, the IP layer does 789 not send packets larger than the IPv6 minimum link MTU. 791 These features might be used to work around an anomalous situation, 792 or by a routing protocol implementation that is able to obtain Path 793 MTU values. 795 The implementation should also provide a way to change the timeout 796 period for aging stale PMTU information. 798 6. Normative references 800 [RFC1191] Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990. 801 (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT 802 STANDARD) 804 [RFC1981] Path MTU Discovery for IP version 6. J. McCann, S. Deering, 805 J. Mogul. August 1996. (Status: PROPOSED STANDARD) 807 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels. S. 808 Bradner. March 1997. (Status: BEST CURRENT PRACTICE) 810 7. Informative references 812 [RFC1063] IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par- 813 tridge, K. McCloghrie. Jul-01-1988. (Obsoleted by RFC1191) 815 [RFC1435] IESG Advice from Experience with Path MTU Discovery. S. 816 Knowles. March 1993. (Format: TXT=2708 bytes) (Status: 817 INFORMATIONAL) 819 [RFC1626] Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994. 820 (Status: PROPOSED STANDARD) 822 [RFC1791] TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung. 823 April 1995. (Status: EXPERIMENTAL) 825 [RFC2923] TCP Problems with Path MTU Discovery. K. Lahey. September 826 2000. (Status: INFORMATIONAL) 828 8. Security considerations 830 Since the MTU reported in the ICMP messages is constrained to be 831 between the old MTU and the candidate MTU, this algorithm is more 832 difficult to attack through fraudulent ICMP messages. 834 Furthermore, since this algorithm can function properly without ICMP 835 messages that part of the algorithm can be disabled for additional 836 robustness in hostile environments. 838 9. IANA considerations 840 10. Contributors 842 11. Acknowledgements 844 Matt Mathis and John Heffner are supported by a grant from Cisco Sys- 845 tems, Inc. 847 12. Authors��� addresses 849 Please send comments and suggestions to mtu@psc.edu. 851 Matt Mathis and John Heffner 852 Pittsburgh Supercomputing Center 853 4400 Fifth Ave. 854 Pittsburgh, PA 15213 855 mathis@psc.edu 856 jheffner@psc.edu 858 Kevin Lahey 859 Freelance 860 kml@patheticgeek.net 862 13. Intellectual Property 864 The IETF takes no position regarding the validity or scope of any 865 intellectual property or other rights that might be claimed to per- 866 tain to the implementation or use of the technology described in this 867 document or the extent to which any license under such rights might 868 or might not be available; neither does it represent that it has made 869 any effort to identify any such rights. Information on the IETF���s 870 procedures with respect to rights in standards-track and standards- 871 related documentation can be found in BCP-11. Copies of claims of 872 rights made available for publication and any assurances of licenses 873 to be made available, or the result of an attempt made to obtain a 874 general license or permission for the use of such proprietary rights 875 by implementers or users of this specification can be obtained from 876 the IETF Secretariat. 878 The IETF invites any interested party to bring to its attention any 879 copyrights, patents or patent applications, or other proprietary 880 rights which may cover technology that may be required to practice 881 this standard. Please address the information to the IETF Executive 882 Director. 884 14. Full copyright statement 886 Copyright (C) The Internet Society Oct 19, 2003. All Rights Reserved. 888 This document and translations of it may be copied and furnished to 889 others, and derivative works that comment on or otherwise explain it 890 or assist in its implementation may be prepared, copied, published 891 and distributed, in whole or in part, without restriction of any 892 kind, provided that the above copyright notice and this paragraph are 893 included on all such copies and derivative works. However, this doc- 894 ument itself may not be modified in any way, such as by removing the 895 copyright notice or references to the Internet Society or other 896 Internet organizations, except as needed for the purpose of develop- 897 ing Internet standards in which case the procedures for copyrights 898 defined in the Internet Standards process must be followed, or as 899 required to translate it into languages other than English. 901 The limited permissions granted above are perpetual and will not be 902 revoked by the Internet Society or its successors or assigns.