Network Working Group M. Mathis Internet-Draft J. Heffner Expires: November 30, 2004 PSC K. Lahey Freelance June 2004 Path MTU Discovery draft-ietf-pmtud-method-02 Status of this Memo By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on November 30, 2004. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document describes a robust new method for Path MTU Discovery that relies on TCP or other Packetization Layer to probe an Internet path with progressively larger packets. This method is described as an extension to RFC 1191 and RFC 1981, which specify ICMP based Path MTU Discovery for IP versions 4 and 6, respectively. This document does not define a protocol, but rather a method to use features of existing protocols to discover the path MTU. Mathis, et al. Expires November 30, 2004 [Page 1] Internet-Draft Path MTU Discovery June 2004 The general strategy of the new algorithm is to start with a small MTU and probe upward, testing successively larger MTUs by probing with single packets. If the probe is successfully delivered, then the MTU is raised. If the probe is lost, it is treated as an MTU limitation and not as a congestion signal. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 9 5. Implementation Issues . . . . . . . . . . . . . . . . . . . . 10 5.1 Layering . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.1.1 Accounting for Header Sizes . . . . . . . . . . . . . 10 5.1.2 Storing PMTU information . . . . . . . . . . . . . . . 11 5.2 Lower Layers . . . . . . . . . . . . . . . . . . . . . . . 12 5.2.1 Generating Probes . . . . . . . . . . . . . . . . . . 12 5.2.2 Selecting the initial MTU . . . . . . . . . . . . . . 14 5.2.3 Normal sequence of events to raise the MTU . . . . . . 14 5.2.4 Processing MTU Indications . . . . . . . . . . . . . . 15 5.2.5 Probing Intervals . . . . . . . . . . . . . . . . . . 20 5.2.6 Host fragmentation . . . . . . . . . . . . . . . . . . 21 5.2.7 Multicast . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Search Strategy . . . . . . . . . . . . . . . . . . . . . 22 5.3.1 Search . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3.2 Monitor . . . . . . . . . . . . . . . . . . . . . . . 24 5.3.3 Suspend . . . . . . . . . . . . . . . . . . . . . . . 24 5.4 Specific Packetization Layers . . . . . . . . . . . . . . 24 5.4.1 Probing method using TCP . . . . . . . . . . . . . . . 24 5.4.2 Probing method using SCTP . . . . . . . . . . . . . . 25 5.4.3 Probing Method for IP Fragmentation . . . . . . . . . 27 5.4.4 Issues for other transport protocols . . . . . . . . . 27 5.5 Operational Integration . . . . . . . . . . . . . . . . . 27 5.5.1 Interoperation with prior algorithms . . . . . . . . . 27 5.5.2 Interoperation over subnets with dissimilar MTUs . . . 28 5.5.3 Interoperation with tunnels . . . . . . . . . . . . . 28 5.5.4 Diagnostic tools . . . . . . . . . . . . . . . . . . . 29 5.5.5 Management interface . . . . . . . . . . . . . . . . . 29 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1 Normative References . . . . . . . . . . . . . . . . . . . . 30 6.2 Informative References . . . . . . . . . . . . . . . . . . . 31 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 32 A. Security Considerations . . . . . . . . . . . . . . . . . . . 32 B. IANA considerations . . . . . . . . . . . . . . . . . . . . . 32 C. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33 Intellectual Property and Copyright Statements . . . . . . . . 34 Mathis, et al. Expires November 30, 2004 [Page 2] Internet-Draft Path MTU Discovery June 2004 1. Introduction This document describes a method for Packetization Layer Path MTU Discovery (PLPMTUD) which is an extension to existing Path MTU discovery methods as described in RFC 1191 [2] and RFC 1981 [3]. The proper MTU is determined by starting with small packets and probing with successively larger packets. The bulk of the algorithm is implemented above IP, in the transport layer (e.g. TCP) or other "Packetization Protocol" that is responsible for determining packet boundaries. This document draws heavily RFC 1191 [2] and RFC 1981 [3] for terminology, ideas and some of the text. The methods described in this document apply both IPv4 and IPv6, and many transport protocols. This document does not define a protocol, but rather a method to use features of existing protocols to discover the path MTU. It does not require cooperation from the lower layers (except that they are consistent about what packet sizes are acceptable) or the far node. Variants in implementations will not cause interoperability problems. The methods described in this document are carefully designed to maximize robustness in the presence of less than ideal implementations of other protocols or Internet components. For sake of clarity we uniformly prefer TCP and IPv6 terminology. In the terminology section we also present the analogous IPv4 terms and concepts for the IPv6 terminology. In a few situations we describe specific details that are different between IPv4 and IPv6. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [4]. This draft is a product of the Path MTU Discovery (pmtud) working group of the IETF. Please send comments and suggestions to pmtud@ietf.org. Interim drafts and other useful information will be posted at http://www.psc.edu/~mathis/MTU/pmtud/index.html . 2. Terminology IP Either IPv4 [1] or IPv6 [7]. node A device that implements IP. Mathis, et al. Expires November 30, 2004 [Page 3] Internet-Draft Path MTU Discovery June 2004 router A node that forwards IP packets not explicitly addressed to itself. host Any node that is not a router. upper layer A protocol layer immediately above IP. Examples are transport protocols such as TCP and UDP, control protocols such as ICMP, routing protocols such as OSPF, and Internet or lower-layer protocols being "tunneled" over (i.e., encapsulated in) IP such as IPX, AppleTalk, IP itself. link A communication facility or medium over which nodes can communicate at the link layer, i.e., the layer immediately below IP. Examples are Ethernets (simple or bridged); PPP links; X.25, Frame Relay, or ATM networks; and Internet (or higher) layer "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use the slightly more general term "lower layer" for this concept. interface A node's attachment to a link. address An IP-layer identifier for an interface or a set of interfaces. packet An IP header plus payload. MTU Maximum Transmission Unit, the size in bytes of the largest IP packet, including the IP header and payload, that can be transmitted on a link or path. Note that this could more properly be called the IP MTU, to be consistent with how other standards organizations use the acronym MTU. link MTU The Maximum Transmission Unit, i.e., maximum IP packet size in bytes, that can be conveyed in one piece over a link. Beware that this definition differers from the definition used by other standards organizations. For IETF documents, link MTU is uniformly defined as the IP MTU over the link. This includes the IP header, but excludes link layer headers and other framing which is not part of IP or the IP payload. Beware that other standards organizations generally define link MTU to include the link layer headers. path The set of links traversed by a packet between a source node and a destination node Mathis, et al. Expires November 30, 2004 [Page 4] Internet-Draft Path MTU Discovery June 2004 PMTU, path MTU The minimum link MTU of all the links in a path between a source node and a destination node. classical PMTU discovery, Process described in RFC 1191 and RFC 1981, in which nodes rely on ICMP "Packet Too Big" messages to learn the MTU of a path. PL, packetization layer The layer of the network stack which segments data into packets. PLPMTUD Packetization Layer Path MTU Discovers, the method described in this document, which is an extension to classical PMTU discovery. Packet Too Big message An ICMP message reporting that an IP packet is too large to forward. This is the IPv6 term that corresponds to the IPv4 "ICMP Can't fragment" message. flow A context in which MTU discovery is applied. This is naturally an instance of the packetization protocol, e.g. one side of a TCP connection. MPS The maximum IP payload size available over a specific path. This is typically the path MTU minus the IP header. As an example, this is the maximum TCP packet size, including TCP payload and headers but not including IP headers. This has also been called the "L3 MTU". MSS The TCP Maximum Segment Size, the maximum payload size available to the TCP layer. This is typically the path MPS minus the size of the TCP header. probe packet A packet which is being used to test a path for a larger MTU. probe size The size of a packet being used to probe for a larger MTU. successful probe The probe packet was delivered through the network and acknowledged by the Packetization Layer on the far node. inconclusive probe The probe packet was not delivered, but there were other lost packets close enough to the probe where it can not be presumed that the probe was lost because it was larger than the path MTU. By implication the probe might have been lost due to something other than MTU (such congestion), so the results are inconclusive. Inconclusive probes are generally repeated at the same probe size, after a suitable delay. Mathis, et al. Expires November 30, 2004 [Page 5] Internet-Draft Path MTU Discovery June 2004 failed probe The probe packet was not delivered and there were no other lost packets close to the probe. This is taken as an indication that the probe was larger than the path MTU, and future probes should generally be for at smaller sizes. errored probe There were losses or timeouts during the verification phase which suggest a potentially disruptive failure or network condition. These are generally retried only after substantially longer intervals. probe gap The payload data that will be lost and need to be retransmitted if the probe is not delivered. probe phase The interval (time or protocol events) between when a probe is sent, and when it is determined that the the probe succeeded, failed or was inconclusive verification phase An additional interval during which the new path MTU is considered provisional. Packet losses or timeouts are treated as an indication that there may be a problem with the provisional MTU. Transition phase The interval between the probe phase and the verification phase, during which packets using the new MTU propagate to the far node and the acknowledgment propagates back. full stop timeout a timeout where none of the packets transmitted after some event are acknowledged by the receiver, including any retransmissions. This is taken as an indication of some failure condition in the network, such as a routing change onto a link with a smaller MTU. For the sake of PLPMTUD we suggest the following definition of a full stop timeout: the loss of one full window of data and at least one retransmission or at least 6 consecutive packets including at least 2 retransmissions (along with two retransmission timer expirations). [@@@ This probably needs some experimentation.] search strategy the heuristics used to choose successive probe sizes to converge to the proper path MTU, as described in section 5.5. 3. Overview This document describes a method for TCP or other packetization protocols to dynamically discover the MTU of a path without relying on explicit signals from the network. These procedures are applicable to TCP and other transport- or application-level packetization protocols in which the receiver always reports to the sender complete Mathis, et al. Expires November 30, 2004 [Page 6] Internet-Draft Path MTU Discovery June 2004 information about which packets were lost in the network. The general strategy of the new procedure is for the packetization layer to find the proper MTU by probing with progressively larger packets, without disrupting its normal protocol operation. If a probe packet is successfully delivered, then the path MTU is provisionally raised. If there are no additional losses during the subsequent verification phase, then the path MTU is confirmed (verified) to be at least as large as the provisional MTU. PLPMTUD can then probe again with an even larger MTU, according to MTU search strategy described in Section 5.3. The verification phase is used to detect some situations where raising the MTU raises the packet loss rate. For example if a link is striped across multiple physical channels with inconsistent MTUs, it is possible that a probe will be delivered even if it is too large for some of the physical channels. In such cases raising the path MTU to the probe size will cause severe periodic loss and abysmal performance. The verification phase is designed to prevent the path MTU from being raised if doing so causes excessive packet losses. A conservative implementation of PLPMTUD would use a full round trip time for the verification phase. In this case each time PLPMTUD raises the MTU it takes three full round trip times to do so. It takes one round trip for the probe phase, during which the probe propagates to the far node and an acknowledgment is returned. The second round trip is the transitional phase, during which data packets using the provisional MTU propagate to the far node and are acknowledged. During he third and final round trip time, it is verified that raising the MTU does not cause excessive loss. The isolated loss of a probe packet (with or without a Packet Too Big message) is treated as an indication of an MTU limit, and not as a congestion indicator. In this case alone, the packetization protocol is permitted to retransmit the probe gap without adjusting the congestion window. If there is a timeout or any additional lost packets during any of the three phases, the loss is treated as a congestion indication as well as an indication of some sort of failure of the PLPMTUD process. The congestion indication is treated like any other congestion indication: window or rate adjustments are mandatory per the relevant congestion control standards [8]. Probing can resume with some new probe size after a delay which is determined by the nature of the indicated failure. The most likely (and least serious) PLPMTUD failure is the link experiencing legitimate congestion related losses at about the same Mathis, et al. Expires November 30, 2004 [Page 7] Internet-Draft Path MTU Discovery June 2004 time as the probe. In this case, it is appropriate to retry the probe (with the same probe size) as soon as the packetization layer has fully adapted to the congestion and recovered from the losses. In other cases, additional losses or timeouts indicate problems with the link or packetization layer, and that probes may be disruptive. In these situations it is desirable to use progressively longer delays depending on the severity of the failure and if it persists. PLPMTUD can optionally process Packet Too Big messages to select the provisional MTU for faster convergence in exchange for a slight decrease in robustness. Processing malicious or erroneous Packet Too Big messages can cause PLPMTUD to arrive at the incorrect MTU for a path, which is likely to reduce protocol performance. There are several different options for processing Packet Too Big messages: in one extreme they could be completely ignored, in the other extreme, accept all of them (fully implementing classic PMTUD within PLPMTUD). We advocate a compromise, where Packet Too Big messages are only processed in conjunction with probes (described in Section 5.2.4.1), and Packetization Layer timeouts (described in Section 5.2.4.3). Relatively few details of this procedure affect interoperability with other standards or Internet protocols. These details are specified in RFC2119 standards language in Section 4. Most of the difficulty in implementing PLPMTUD arises because it needs to be implemented in several different places within a single node. In general each packetization protocol needs to have it's own implementation of PLPMTUD. Furthermore, the natural mechanism to share path MTU information between concurrent or subsequent connections over the same path is a path information cache in the IP layer. The various packetization protocols need to have the means to access and update the shared cache in the IP layer. This memo describes PLPMTUD in terms of its primary subsystems without fully describing how they are assembled into a complete implementation. Section 5 describes: the separation into layers, the mechanics of probing from the point of view other lower layers, Maximum Payload Size search heuristics; implementation in specific Packetization Layers; and operational integration issues. The vast majority of the implementation details are recommendations based on experiences with earlier versions of path MTU discovery. These are motivated by a desire to maximize robustness of PLPMTUD in the presence of less than ideal implementations as they exist in the field. Mathis, et al. Expires November 30, 2004 [Page 8] Internet-Draft Path MTU Discovery June 2004 4. Requirements All Internet nodes SHOULD implement PLPMTUD in order to discover and take advantage of the largest MTU supported along the Internet path. Links MUST NOT deliver packets that are larger than their MTU. Links that have parametric limitations (e.g. MTU bounds due to limited clock stability) MUST include explicit mechanisms to consistently reject packets that might otherwise be nondeterministically delivered. All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6 functionality. All fragmentation SHOULD be done on the host, and all IPv4 packets, including fragments, SHOULD have the DF bit set such that they will not be fragmented (again) in the network. See Section 5.2.6 The requirements below only apply to those implementations that include PLPMTUD. If the Packetization Layer uses application data to implement PLPMTUD it MUST use a loss reporting mechanism mechanism (e.g. TCP SACK) which avoids spurious retransmission of other data when a probe packet is lost. A Packetization Layer using application data for probes MUST NOT send a probe unless it has sufficient following data available to send such that a lost probe will trigger Fast Retransmit or similar data recovery algorithm. A Packetization Layer using application data for probes SHOULD NOT send a probe packet unless the flow is expected to have at least the 3 round trips worth of data needed to successfully complete the probe, transition and verification phases. Normal congestion control algorithms MUST remain in effect under all conditions except when only an isolated probe packet is detected to be lost. In this case alone the normal congestion (window or data rate) reduction can be suppressed. If any other lost data is detected, all normal congestion control MUST take place. When a probe is lost and normal congestion control is suppressed as permitted above, then the Packetization Layer MUST NOT probe again until at least an interval equal to the normal congestion control cycle. For TCP and TCP friendly protocols this generally means one round trip of elapsed time for each packet permitted under the current congestion window. Mathis, et al. Expires November 30, 2004 [Page 9] Internet-Draft Path MTU Discovery June 2004 If PLPMTUD updates the MTU for a particular path, all Packetization Layer sessions that share the flow (path) must be notified. Whenever the MTU is raised, the congestion state variables must be rescaled to not to raise the window size in bytes (or date rate in bytes per seconds). Whenever the MTU is reduced (e.g. when unconditionally processing ICMP Packet Too Big messages) the congestion state variable must be rescaled not to raise the window size in packets. All implementations MUST include a mechanism to implement diagnostic tools that do not rely on the operating systems implementation of path MTU discovery. This specifically requires the ability to send packets that are larger than the known MTU for the path, and collecting any resultant ICMP error message. See Section 5.5.4 5. Implementation Issues This section discusses a number of issues related to the implementation of Path MTU Discovery. This is not a specification, but rather a set of notes provided as an aid for implementers. The issues include: o The seperation into layers o The Mechanics of Probing, as seen by IP and brlow o Search Strategy. o How to implement PLPMTUD in specific Packetization Layers. o How to improve Operational Integration and deployment. 5.1 Layering 5.1.1 Accounting for Header Sizes Packetization Layer Path MTU Discovery is most easily implemented by splitting its functions between layers. The IP layer is in the best place to keep shared state, collect the ICMP messages, track IP headers sizes and manage MTU information from the link layer interfaces. However the procedures that PLPMTUD uses for probing, verifications and scanning for the path MTU are very tightly coupled to the data recovery and congestion control state machines in the Packetization Layer. The most difficult part of implementing PLPMTUD is properly splitting the implementation between the layers. Note that this layering is constant with the advice in the current PMTUD specifications [2][3]. Today, many implementations of classical PMTU Discovery are already split along these same layers. Mathis, et al. Expires November 30, 2004 [Page 10] Internet-Draft Path MTU Discovery June 2004 Early implementation of PLPMTUD revealed that it is critically important to have a good clean mechanism for accounting header sizes at all layers. This is because each Packetization Layer does its calculations in its own natural data unit, which are almost always a reflection of the service that the Packetization Layer provides to the application or other upper layers. For example, TCP naturally performs all of its calculations in terms of sequence numbers and segment sizes. The size of the Probe gap is the size of the data segment that was that was carried by the probe packet. However, the MTU size being probed, ICMP MTU, etc are measures of full packets, which not only include the TCP data (measured in sequence space) but also include fixed TCP and IP headers, and may include IPv6 extension headers or IPv4 options, TCP options and even IPsec AH or ESP headers as well. PLPMTUD requires frequent translation between these two domains: the Packetization Layer's natural data unit and full IP packet sizes. While there are a number of possible ways to accurately implement dual size measures, our experience has been that it is best if the boundary between the IP layer and the Packetization layer communicate in terms of the IP Maximum Payload Size or MPS. The MPS is the only size measure that is common to both the IP and Packetization Layers, because it exactly matches the boundary between the layers. The IP Layer is responsible for adding or deducting it's own headers when translating between MTU and MPS. Likewise the Packetization Layer is responsible for adding or deducting its own headers when calculations in it's natural data units. This document does not take a stance on the placement of IPsec, which logically sits between IP and the Packetization Layer. As far as PLPMTUD is concerned IPsec can be treated either as part of IP or as part of the Packetization Layer, as long as the accounting is consistent within any given implementation. If IPsec is treated as part of the IP layer, then each security association to a remote node may need to be treated as a separate flow for PLPMTUD, if they have different length security headers. If IPsec is treated as part of the packetization layer, the IPsec header size has to be included in the Packetization Layer's header size calculations. 5.1.2 Storing PMTU information This memo uses the concept of a "flow" to define the scope in which path MTU information is used. Each flow locally stores its maximum payload size (MPS), which is used for packetizing data. Packetization Layers may communicate with the IP layer to store or access cached MPS values, providing a means by which similar flows may share information. The IP layer also stores PMTU and derived MPS information when it receives Packet Too Big messages. Mathis, et al. Expires November 30, 2004 [Page 11] Internet-Draft Path MTU Discovery June 2004 Ideally, a PMTU value should be associated with a specific path traversed by packets exchanged between the source and destination nodes. However, in most cases a node will not have enough information to completely and accurately identify such a path. Rather, a node must associate a PMTU value with some local representation of a path. It is left to the implementation to select the local representation of a path. An implementation could use the destination address as the local representation of a path. The PMTU value associated with a destination would be the minimum PMTU learned across the set of all paths in use to that destination. The set of paths in use to a particular destination is expected to be small, in many cases consisting of a single path. This approach will result in the use of optimally sized packets on a per-destination basis. This approach integrates nicely with the conceptual model of a host as described in [ND@@@@]: a PMTU value could be stored with the corresponding entry in the destination cache. However, NAT and other forms of middle boxes may exhibit differing MTUs at as single IP address. If IPv6 flows are in use, an implementation could use the IPv6 flow id [7][14] as the local representation of a path. Packets sent to a particular destination but belonging to different flows may use different paths, with the choice of path depending on the flow id. This approach will result in the use of optimally sized packets on a per-flow basis, providing finer granularity than PMTU values maintained on a per-destination basis. For source routed packets (i.e. packets containing an IPv6 Routing header, or IPv4 LSRR or SSRR options), the source route may further qualify the local representation of a path. An implementation could use source route information in the local representation of a path. If IPsec is in use, the security association can also be used to represent a path. 5.2 Lower Layers 5.2.1 Generating Probes A new candidate MTU is tested by sending one "probe packet", which is larger than the current MTU. In this section we present a couple of possible ways to alter packetization layers to generate probe packets. The different techniques incur different overheads in three areas: difficulty in generating the probe packet (in terms of packetization layer implementation complexity and computational overhead) possible additional network capacity consumed by the probes Mathis, et al. Expires November 30, 2004 [Page 12] Internet-Draft Path MTU Discovery June 2004 and the overhead of recovering from failed probes (both network and protocol overhead). For example some protocols might be extended to allow padding with dummy data within their packets. This would greatly simplify the implementation because the probing can be performed without participation from the application and if the probe fails, the missing data (the "probe gap") is assured to fit within the current MTU when it is retransmitted. However, the padding does consume network capacity without carrying any useful payload. This technique does not work for TCP, because there is not a separate length field or other mechanism to differentiate between padding and real payload data. With TCP the natural approach is to send additional payload data in an over-sized segment. There are several variants which have different tradeoffs. In one method, after a TCP probe segment has been sent the subsequent segment(s) may be sent as though the probe segment was not over-sized. Thus if the probe segment is lost, it will leave a gap in the sequence space that is exactly the correct size to be filled by one segment at the current MTU. Since this method generates overlapping data, it will cause duplicate acknowledgments if the probe is successfully delivered. The sender must be capable of ignoring these expected duplicate acknowledgments in a manner which will not cause unnecessary retransmission or congestion window reduction. In the second method, after a TCP probe segment has been sent, subsequent TCP segments are sent in a non-overlapping manner. If the probe segment is lost, it will leave a gap which will require retransmission of multiple segments to fill. This method has lower overhead for successful probes, but it requires more complexity in the retransmit logic to correctly retransmit the missing data (the "probe gap") with multiple segments that fit into the old MTU, while properly suppressing the congestion adjustments for this one situation and no others. Several Packetization protocols may be best served by using an adjunct protocol for MTU probing: a separate protocol (or protocol feature) that does not carry and real application data. This greatly simplify s implementation because nothing needs to be retransmitted when the probe is lost, but it does consume network capacity without delivering any useful payload. Two important example of this come to mind: SCTP [9] which might use its existing HEARTBEAT facility padded with dummy data to fill out the probe packet; and IP fragmentation which is sometimes used as a Mathis, et al. Expires November 30, 2004 [Page 13] Internet-Draft Path MTU Discovery June 2004 Packetization layer for carrying oversized datagrams as described in Section 5.2.6. In the case of IP fragmentation an entire separate protocol in need, that has to use the diagnostic interface described in Section 5.5.4 It should be clear that nearly all packetization layers can be adapted to support PLPMTUD, possibly in more than one way. 5.2.2 Selecting the initial MTU When the PLPMTUD process is started the initial MTU should normally be set such that the Packetization Layer can carry 1 kByte data segments. This initial MTU should be 1 kByte plus space for IP and Packetization layer headers. (see Section 5.1 on accounting for headers). With the this MTU, RFC2414 [6] allows TCP and other transport protocols to start with an initial window of 4 packets. We suspect, but have not confirmed that TCP actually starts faster (and completes sooner for small packets) with 1kB packets rather than 1500 byte packets because the 2nd data ACK occurs one round trip earlier This initial MTU should also be configurable. One of the configuration options should be to set it to default to the interfaces MTU, to mimic classical PMTUD behavior. (See Section 5.5.1 5.2.3 Normal sequence of events to raise the MTU If the probe size is smaller than the actual path MTU and there are no other losses, the normal sequence of events to probe and raise the MTU will be: 1. The probe is sent, followed by more packets at the current MTU. By definition PLPMTUD enters the probe phase. The probe propagates through the network and the far node acknowledges it (or possibly latter data, if acknowledgements are cumulative and delayed acknowledgement is in effect). 2. The acknowledgement for the probe reaches the data sender. By definition, this ends the probe phase. 3. The packetization layer provisionally raises the MTU to the probe size. PLPMTUD enters the transitional phase when it starts sending data using the provisional MTU. Note that implementations that use packet counts for congestion accounting (e.g. keep cwnd in units of packets) must re-scale their congestion accounting such that raising the MTU does not raise the data rate (bytes/second) or the total congestion window Mathis, et al. Expires November 30, 2004 [Page 14] Internet-Draft Path MTU Discovery June 2004 in bytes. If the implementation packetizes the data at the application programming interface, it may transmit already queued data at the current MTU before raising the MTU. In this case this data is not part of either the probing or transition phases, because all of the packets in flight fit within the current MTU. 4. Once the first packet of the transitional phase is acknowledged, PLPMTUD enters the verification phase. In principle the verification phase can be of arbitrary duration, however at this time we are recommending one full window of data (i.e one full round trip time) for most Packetization Layers. 5. Once there has been sufficient data delivered and acknowledged in the provisional MTU is considered verified and the path MTU is updated. PLPMTUD can then probe for an even larger MTU, as described in the searching strategy in Section 5.3. Other events described in the next section are treated as exceptions and alter or cancel some of the steps above. 5.2.4 Processing MTU Indications The descriptions below assume that the Packetization Layer protocol that has a TCP fast retransmit style mechanism to synchronously detect the loss of a probe packet and trigger retransmission, without loss of the protocols self clock. If this fails, then some sort of retransmission timeout will serve to catch the loss. It also assumes that there is some mechanism to detect full-stop timeouts. If any of these events (or the receipt of an ICMP Packet Too Big message) occurs during the the above process to raise the MTU, then it is processed as indicated in the following sections. 5.2.4.1 Processing Packet Too Big Messages Classical PMTU discovery specifies the generation of Packet Too Big Messages if an over-sized packet (e.g. a probe) encounters a link that has a smaller MTU. Since these messages can not be authenticated they introduce a number of well documented attacks against classical PMTUD [5]. With PLPMTUD these messages are not required for correct operation, and in principle can be summarily ignored at the expense of slower convergence to the proper MTU. However we believe that a slightly better compromise is to process Packet too big messages in two specific contexts: in conjunction with a PLPMTUD probe or a full-stop Mathis, et al. Expires November 30, 2004 [Page 15] Internet-Draft Path MTU Discovery June 2004 timeout. Every Packet Too Big Message should be subjected to the following checks: o If globally forbidden then discard the message. o If forbidden by the application then discard the message. o If this path has been tagged "bogus ICMP messages" then discard the message. o If the reported MTU fails consistency checks then set "bogus ICMP messages" flag for this path and discards the message. These consistency checks include: * unrecognized or unparseable enclosed header, * reported MTU is larger than the size indicated by the enclosed header or * larger than the current MTU, provisional MTU or probe size as appropriate. * or fails a ICMP consistency checks specific to the Packetization Layer. (E.g. The SCTP Verification-Tag mechanism [9][16]) To ease migration, it is suggested that implementations may include global controls to suppress some or all of the consistency checks. If the Packet Too Big Message is acceptable under all of these checks do one of two things on depending on a global configuration switch: Emulate classical path MTU discovery by processing the message immediately (I.e. set the path MTU to the size indicated in the message) or save the "ICMP MTU", pending another PLPMTUD event. In this case the saved ICMP MTU will only be acted upon under appropriate conditions if there are lost probes, verification packets or a full stop timeout. This greatly reduces the impact of fraudulent ICMP Packet Too Big messages. In either case if the Packetization Layer calls for specific actions in response to a Packet Too Big message, that action should be invoked only at the point when the path MTU is updated from the ICMP MTU. 5.2.4.2 Packetization Layer Detects Lost Packets Each packetization protocol has it's own mechanism to detect lost packets and request the retransmission of missing data. The primary signals used by the packetization layer are these protocol specific Mathis, et al. Expires November 30, 2004 [Page 16] Internet-Draft Path MTU Discovery June 2004 loss indications. The packetization layer is responsible for retransmitting the lost data and notifying PLPMTUD that there was a loss. o If the probe itself was lost, and there were no other losses during the probe phase (The RTT between when the probe was sent and the loss detected) than it is taken as an indication that the path MTU is smaller than the probe size. In this situation alone the Packetization Layer is permitted to retransmit the missing data (the "probe gap") without adjusting its congestion window or data transmission rate. If an accepted Packet Too Big Message was received after the probe was sent, and it passes the additional checks that the ICMP MTU is greater than the current MTU and less than the probe SIZE, then set the probe side to the ICMP MTU, and restart the probe process from step 1 in Section 5.2.3. If there was not a accepted Packet Too Big Message, then the indicated event is a "probe failure", which can be retried with a smaller probe size after a suitable delay for a probe_fail_event. See Section 5.2.4.2 for more complete descriptions of failure events. o If there are losses during the probe phase and the probe was not lost, then the probe was successful. However, since additional losses have the potential to spoil the verification phase, it is important that PLPMTUD not progress into the transition phase (step 3 above) until after the Packetization Layer has fully recovered from the losses and completed the congestion window (or rate) adjustment. o If there are losses during the probe phase and the probe was also lost the outcome depends on the presence an ICMP MTU set by an acceptable Packet Too Big Message. If there was an accepted Packet Too Big Message received since the probe was sent, and it passes the additional checks that the ICMP MTU is greater than the current MTU and less than the probe size, then set the probe size to the ICMP MTU, and once the Packetization Layer completes the recovery from the losses then restart the probe process from step 1 in Section 5.2.3. If there was not an accepted Packet Too big Message, then the probe is inconclusive because the lost probe might have been caused by congestion. The probe can be retried after a suitable delay for a probe_inconclusive_event. Mathis, et al. Expires November 30, 2004 [Page 17] Internet-Draft Path MTU Discovery June 2004 o It is unlikely that losses during the transition phase are caused by PLPMTUD, however they do potentially complicate the verification phase. Note that we are referring to losses that are followed by acknowledgement of packets that were sent at the old MTU, while the transition to the provisional MTU is still propagating through the network. The first acknowledgement of the provisional MTU (and the transition to the verification phase) is most likely going to occur during the recovery of the losses in transition phase. It is important that the Packetization Layer retransmission machinery distinguish between loses at the old MTU (transition phase) and the provisional MTU (the verification phase, discussed next). o Losses during the verification phase are taken as a indication that the path may have a non-uniform MTU or some other problems such that raising the MTU substantially raises the loss rate. If so, this is potentially a very serious problem, so the provisional MTU is considered to have errored and the path MTU is set back to the previously verified MTU (the previously current MTU). Packet loss during the verification phase might also be due to coincidental congestion on the path, unrelated to the probe, so it would seem to be desirable to re-probe the path. The risk is that this effectively raises the tolerated loss threshold because even though raising the MTU seemed to cause additional loss, there is a statistical chance that repeated attempts to verify a new MTU may yield as false pass. The compromise is to re-probe once with the same probe size (after delay probe_inconclusive_event), and if this also fails, then the probe may not be retried until after a suitable delay for a verification_error_event, which exponentially increases on each successive failure. 5.2.4.3 Packetization Layer Retransmission Timeout Note that the we do not make distinctions between the various methods that different Packetization Layers might use for detecting and retransmitting lost packets. It is preferable that the Packetization Layer uses a recovery mechanism similar to TCP SACK or fast retransmit (or other "synchronous" loss recover mechanism) to detect losses and recover as quickly as possible. Under some conditions the Packetization Layer may have to rely on retransmission timeouts or other fairly disruptive techniques to recover from losses. Since these greatly increase the cost of failed probes, it is recommended that PLPMTUD use even longer delays before re-probing. In these situations replace probe_fail_event with probe_timeout_event. Mathis, et al. Expires November 30, 2004 [Page 18] Internet-Draft Path MTU Discovery June 2004 5.2.4.4 Packetization Layer Full Stop Timeout Under all conditions (not just during MTU probing) a full stop timeout should be taken as an indication of some significantly disruptive event in the network, such as a router failure or a routing change to a path with a smaller MTU. If the ICMP MTU is set, and it is less that the current MTU (or provisional MTU during the transitional phase), then the path MTU can be reduced to the ICMP MTU. This is the only situation (a full stop timeout) outside of a probe that we recommended that the path MTU is set from the ICMP MTU. (In Section 5.5.1 we relax this recommendation to facilitate migration to PLPMTUD in exchange for slightly less protection from corrupt Packet Too Big messages) Note that whenever a problem with the path that causes a full-stop timeout (also known as a "persistent timeout" in other documents), several different path restart/recovery algorithms may be invoked at different layers in the stack. Some device drivers may be restarted [@@], router discovery [@@], ES-IS [@@] and so forth. We recommend that in most situation the first action should be to set the path MTU down. Note that this recommendation is really beyond the scope of this document, and may require substantial additional research. Therefore, if there is a full stop timeout and there was not an ICMP message indicating a reason (Packet Too Big, Net unreachable, etc, or the ICMP messages was ignored for some reason), we suggest that the first recovery action should be to set the path MTU down to a safe minimum "restart MTU" value, and the PLPMTUD search state reset, so PLPMTUD will start over again searching for the proper MTU. The default restart_MTU should be the minimum MTU as specified by IPv4 (576)[1] or IPv6 (1280) [7] as appropriate, unless overridden by some global control (See Section 5.5.5). If and only if the full stop timeout happens during the probe or transition phases (e.g. after the sending data using the provisional MTU but before any of it is acknowledged) is it considered likely that raising the MTU caused the full stop timeout. If so this situation is is likely to be cyclic, because resetting the PLPMTUD search state is likely to eventually cause re-probing the same problematic MTU. It is tempting to define additional states to detect recurrent full stop timeouts. However in today's hostile network environment, there is little tolerance for nodes that are so fragile that they can be disrupted by something as simple as oversized packets. Therefor we do not feel that it is worth the overhead of specifying a state machine that is capable of automaticly detecting these situations and Mathis, et al. Expires November 30, 2004 [Page 19] Internet-Draft Path MTU Discovery June 2004 disabling PLPMTUD. However, it is important that there be a manual way to disable or limit probing on specific paths. See Section 5.5.5. 5.2.5 Probing Intervals Section 5.2.4.2 describes a number of probe failure events. In all cases the basic response is the same: to wait some time interval (dependent on the specific event and possibly the history) and then to probe again. For events that are "inconclusive", it is generally appropriate to re-probe with the same probe size. For events that are identified as "failed probes" it is generally appropriate to re-probe with a smaller probe size. The search strategy described in Section 5.3 is used to select probe sizes. Many of the intervals below are specified in terms of elapsed round trips relative to the current congestion window. This is because TCP and other Packetization Layer protocols tend to exhibit periodic loses which cause periodic variations of the congestion window and possibly the data rate. It is preferable that the PLPMTUD probes are scheduled near the low point of these cycles to minimize ambiguities caused by congestion losses. In order from least to most serious: probe_inconclusive_event Other lost packets near the lost probe made the probe result ambiguous. Since the loss of non-probe packets requires a window (or data rate) reduction, it is desirable to schedule the re-probe (at the same probe size) at one round trip time after the end of the loss recovery. This will be almost the minimum congestion window size, with a small cushion to minimize the chances that correlated losses caused by some other bursty connection spoil another probe. probe_fail_event A probe fail event is the one situation under which the Packetization layer is permitted not to treat loss as a congestion signal. Because there is some small risk that suppressing congestion control might have unanticipated consequences (even for one isolated loss), we require that probe fail events be less frequent than the normal period for losses under standard congestion control. Specifically after a probe fail event and suppressed congestion control, PLPMTUD may not probe again until an interval which is comparable to the expected interval between congestion control events. See Section 4. The simplest estimate of the interval to the next congestion event is the same number of round trips as the current window in packets. Mathis, et al. Expires November 30, 2004 [Page 20] Internet-Draft Path MTU Discovery June 2004 probe_timeout_event Since this event was detected by a timeout, it is relatively disruptive to protocol operation. Furthermore, since the event indirectly includes a window adjustment that may have been caused by the MTU probe, it is important that the probe not be repeated until congestion has more than recovered from the loss. Therefore we recommend five times the probe_fail_event interval. I.e. five times as many round trips as the current congestion window in packets. verification_error_event A verification fail event indicates that a probe was deliver and the verification phase failed twice separated by a congestion adjustment (so the second verification phase was at a low point in the congestion control cycle). This is an indication that one of the following three things might have happened: repeated losses unrelated to PLPMTUD; the path is striped across links with dissimilar MTUs, or the link layer has some parametric limitation such that raising the MTU greatly increases the random error rate. The optimal method responding to this situation is an open research question. We believe that the correct response is some combination of exponentially lengthening backoffs (e.g. Starting at 1 minute and quadrupling on each repeat.) and implicitly treating the situation as a probe fail (and choosing a smaller probe size) after some threshold number of repeated verification_error_events. 5.2.6 Host fragmentation Packetization layers are encouraged to avoid sending messages that will require fragmentation (for the case against fragmentation, see [17][18]). However this is not always possible. Some packetization layers, such as a UDP application outside the kernel, may be unable to change the size of messages it sends. This may result in packet sizes that exceeds the Path MTU. IPv4 permitted such applications to send packets without DF set. Oversized packets without DF would be fragmented in the network or sending host when they encountered a link with a small MTU. In some case, packets could be fragmented more than once if there were cascaded links with progressively smaller MTUs. This approach is no longer recommended. We now recommend that IPv4 implementation use a strategy that mimics IPv6 functionality. When an application sends datagrams that are larger than the known path MTU they should be fragmented to the path MTU in the host IP layer even if they are smaller than the link MTU of the first hop networks directly attached to the host. The DF bit should be set on the Mathis, et al. Expires November 30, 2004 [Page 21] Internet-Draft Path MTU Discovery June 2004 fragments, so they will not be fragmented again in the network. This technique will minimize future surprises as the Internet migrated to IPv6. Otherwise there is the potential for widely deployed applications or services relying on IPv4 fragmentation, in a way that can not be implemented in IPv6. At least one major operating system already uses this strategy. Note that in principle the IP fragmentation layer is an example of a Packetization Layers, it could implement full PLPMTUD in the fragmentation process. 5.2.7 Multicast In the case of a multicast destination address, copies of a packet may traverse many different paths to reach many different nodes. The local representation of the "path" to a multicast destination must in fact represent a potentially large set of paths. Minimally, an implementation could maintain a single MPS value to be used for all packets originated from the node. This MPS value would be the minimum MPS learned across the set of all paths in use by the node. This approach is likely to result in the use of smaller packets than is necessary for many paths. Alternatively, if the application using multicast gets complete delivery reports (unlikely because this requirement has poor scaling properties), PLPMTUD could be implemented in multicast protocols. 5.3 Search Strategy The search strategy described here is a only guide for implementors. A standard algorithm is not specified because the strategy can include many heuristics to optimize MPS selection for a given path. Particularly, it may be appropriate for different protocols to follow different strategies. There is opportunity for future improvements to this algorithm. The search strategy uses three variables: SEARCH_MAX is the largest MPS that a flow might be able to use. It is determined by such considerations as interface MTU, widths of protocol length fields, and possibly other protocol-dependent values, such as the the TCP MSS option. In many cases it would be the same as the classical MTU discovery initial MSS, minus the IP layer headers. SEARCH_LOW is the largest validated MPS, and should be used as the effective MPS by the packetization layer. It is the same as the current validated MTU minus the IP layer headers. The initial Mathis, et al. Expires November 30, 2004 [Page 22] Internet-Draft Path MTU Discovery June 2004 value for SEARCH_LOW should be a parameter, but a value of 1024 may be a reasonable default. SEARCH_HIGH is the least invalidated MPS. In most cases is will be the most recent failed probe size minus the IP layer headers. When PLPMTUD is initialized SEARCH_HIGH should be set to SEARCH_MAX. There are three major states: Search, Monitor and Suspend. In the Search state, it incrementally searches for the largest MPS that the path can support, narrowing the difference between SEARCH_LOW and SEARCH_HIGH. Once this gap is sufficiently narrow, the probing algorithm enters the Monitor state where it probes infrequently to detect if the path MPS has become larger. If the MPS probing is determined harmful, perhaps by persistent probe failures, the flow may enter the Suspend state, completely disabling MPS probing. 5.3.1 Search In the Search state, the strategy follows a multi-phase scan. If SEARCH_HIGH >= SEARCH_MAX, a course scan is used. In this mode, each probe's payload size should be MIN(2 * SEARCH_LOW, SEARCH_MAX). If SEARCH_HIGH < SEARCH_MAX, the fine scan mode should be used. The fine scan algorithm may pursue a number of different methods for choosing probe sizes. It may be useful to choose probe sizes so that the final IP packet will fit common link MTUs, for example 1500, 4352, 9000, 17914. Optionally, probes smaller than these values by common tunnel header sizes may be used. When using some protocols, the cost for a failed probe may be significantly higher than the cost of a successful probe due to retransmission and consequent delay jitter as seen by the application. For this reason, one possible approach to the fine scan could be to use probes of size SEARCH_LOW + d, for some increment d. It should enter the Monitor state when SEARCH_LOW + d >= SEARCH_HIGH. This will result in at most one additional probe failure. Another approach may be to use a simple binary search where each probe size is (SEARCH_LOW + SEARCH_HIGH) / 2, entering the Monitor state when SEARCH_LOW + s >= SEARCH_HIGH for some threshold s. This will converge quickly, but may have a higher number of probe failures. It is more appropriate for a protocol whose probes consist entirely of padding. Mathis, et al. Expires November 30, 2004 [Page 23] Internet-Draft Path MTU Discovery June 2004 5.3.2 Monitor In the Monitor state, a probe of size SEARCH_HIGH should be sent at most once every MONITOR_INTERVAL seconds. If the probe succeeds, then SEARCH_HIGH should be set to SEARCH_MAX, and the state should be set to Search. If there is evidence that no flow traffic is receiving its destination, such as repeated timeouts with no acknowledgements in TCP, it may be that the connection was re-routed to a path with a smaller MTU, and the Packet Too Big messages are ignored of filtered. In this case, SEARCH_LOW and SEARCH_HIGH should be set to initial values, and the Search state should be entered. 5.3.3 Suspend In the Suspend state, probing is entirely disabled, and the MPS should be set to 512 bytes. The Suspend state should only be used if it is heuristically determined that probing is causing harmful failures. 5.4 Specific Packetization Layers In this section we discuss specific implementation issues different Packetization Layer protocols. 5.4.1 Probing method using TCP TCP has no mechanism that could be used to distinguish between real application data and some other form of padding that might be used to fill out probe packets. Therefore, TCP must generate probes by sending oversized segments that are carrying real data from upper layers. As previously mentioned there are two approaches that TCP might use to minimize the overheads associated with the probing process. A TCP implementation of PLPMTUD can elect to send subsequent segments overlapping the probe as though the probe segment was not oversized. This has the advantage that TCP only need to retransmit one segment at the current MTU to recover from failed probes. However the duplicate data in the probe does consume network resources and will cause duplicate acknowledgments. It is important that these extra duplicate acknowledgments not trigger Fast Retransmit. This can be guaranteed by limiting the largest probe segment size to twice the current segment size (causing at most 1 duplicate acknowledgment) or three times the current segment size (causing at most 2 duplicate acknowledgments). Mathis, et al. Expires November 30, 2004 [Page 24] Internet-Draft Path MTU Discovery June 2004 The other approach is to send non-overlapping segments following the probe. Although this is cleaner from a protocol architecture standpoint it clashes with many of the optimizations used improve the efficiency of data motion withing many operating systems. In particular many implementations divide the data into segments and pre-compute checksums as the data is copied out of user space. In these implementation it can be very expensive to adjust segment boundaries after the data is already queued. If TCP is using SACK or any other variable length headers, the headers on the probe and verification packets should be padded to the maximum possible length. Otherwise, future options may cause delivery problems if they cause IP packets that are larger than the MTU. Note that the header size and overhead calculations described in Section 5.1 apply here. TCP's natural data accounting units are sequence space and Maximum Segment Size. However the the PLPMTUD process is described in terms of total packet size, which is larger than the MSS by all fixed and optional headers. At the point when TCP is ready to start the verification phase, it is permitted transmit already queued data at the old MTU rather than re-packetize it. This postpones the verification process by the time required to send the queued data. If the verification phase experiences any segment losses, TCP is required to pull back to the prior MSS. Since failing the verification phase should be an infrequent error condition it is less important that this be as efficient as probing. 5.4.1.1 Window management Some TCP implementations keep the congestion window in units of segments. When segment size is increased during a connection, a conservative implementation should scale cwnd so that, in units of bytes, it will remain unchanged. It is recommended that TCP should not probe a new MPS if that MPS will likely result in a cwnd of less than 5 segments. If the network becomes too congested, it is recommended that the MPS be reduced to a smaller size as determined by a heuristic. The recommended heuristic is to reduce the MPS by half if ssthresh is reduced to 5 segments or smaller, with a minimum MPS of 512 bytes. 5.4.2 Probing method using SCTP In the SCTP protocol packetization is the responsibility of the Mathis, et al. Expires November 30, 2004 [Page 25] Internet-Draft Path MTU Discovery June 2004 application or protocol above SCTP. The application writes a set message to SCTP and SCTP will "chunkify" it into appropriate sized pieces. Some implementations MAY bundle multiple data chunks together, but this is NOT required implementation behavior. By implication not all SCTP implementations can easily generate probes sending additional application data. In particular any implementation that does not implement data chunk bundling would not be able to implement a probe. For SCTP the recommended method for generating probes is to pad SCTP HeartBeat messages to the desired probed size. A successful probe will be acknowledged without delay by the peer SCTP implementation returning the same Heartbeat as a HEARTBEAT-ACK. This assures that both directions will support the probed MTU size. [@@@@@ note that both sides of the path are tested] The verification phase is entered after a successful probe. For implementations that can bundle multiple DATA chunks the verification phase completes when a windows worth of bundled DATA chunks are exchanged at the new MTU value. An SCTP implementation SHOULD arrange its fragmentation point to be a suitable multiple of the new MTU size (e.g. if the MTU size is 1500 bytes in IPv4 then a fragmentation point of 718 bytes might be selected during the verification phase. This would allow the two bundled DATA chunks to be put together to exactly equal the proposed new PMTU. After verification is complete the fragmentation point can then be set to the actual PMTU assuming that this new value is the smallest MTU of all of the SCTP paths). An SCTP implementation is allowed to transmit already fragmented DATA chunks that cannot be bundled together at the new MTU value that were previously queued. For implementation that do not allow DATA chunk bundling three subsequent HEARTBEAT messages should be sent over the next XX@@ RTT's padded to the new proposed MTU value. If all of HB's are successful then the new PMTU should be adopted for the path. [@@@@NOTE: it might be simpler to always use multiple HB's to prove in a PMTU during verification, I leave this up to you. One thing to keep in mind is that SCTP normally fragments its messages to the SMALLEST PMTU of all paths... since SCTP is multi-homed this makes it so any data chunk can fit on ANY path. Most implementations DO bundle data chunks for this very reason... its easy to do and it allows larger PMTU's on different paths to be utilized. So using the HB may be more efficient... its definitely simpler... I leave it to you to choose. We may also want to mention the ICMP issue with SCTP since a validated ICMP message with SCTP can always be trusted]. The SCTP Verification-Tag is designed to increase SCTPs robustness in the presence of a number of attacks, including forged ICMP messages. It relies on a 32 bit Verification Tag which is initialized to a Mathis, et al. Expires November 30, 2004 [Page 26] Internet-Draft Path MTU Discovery June 2004 random value during connection establishment and placed in the first 64 bits of all SCTP messages. All subsequent messages (including ICMP messages, which copy at least the first 64 bits of the message) must match the original Verification Tag, or they are rejected as being likely attacks against the connection. [9][16]. It is believed that the Verification Tag mechanism is strong enough where SCTP could unconditionally process Packet Too Large messages that would reduce the path MTU at arbitrary times. As written, this document does not encourage this method. The PLPMTUD ICMP validity checks are cascaded with the SCTP checks, such that the messages are processed only if they meet all consistency checks. In particular, PLPMTUD only uses the ICMP MTU value following a probe, during MTU verification, or following a hard stop timeout. To change this an implementation would have to suppress some of the checks in Section 5.2.4.1 for SCTP. 5.4.3 Probing Method for IP Fragmentation As mentioned in Section 5.2.6, datagram protocols (such as UDP) can rely on IP fragmentation as a packetization layer. Since the IP layer does not have any way to determine if the fragments were delivered, it can not do the probing directly. The probing has to be done with an adjunct protocol that uses the diagnostic API (Section 5.5.4) to send oversized probes, and some other API to update the MPS stored in the IP layer. 5.4.4 Issues for other transport protocols Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to repacketize when doing a retransmission. That is, once an attempt is made to transmit a segment of a certain size, the transport cannot split the contents of the segment into smaller segments for retransmission. In such a case, the original segment can be fragmented by the IP layer during retransmission. Subsequent segments, when transmitted for the first time, should be no larger than allowed by the Path MTU. 5.5 Operational Integration 5.5.1 Interoperation with prior algorithms Properly functioning Path MTU discovery is critical to the robust and efficient operation of the Internet. Any major change (as described in this document) has the potential to be very disruptive if it contains any errors or oversights. Therefore, we offer a deployment strategy in which classical PMTUD operation as described in RFC 1191 Mathis, et al. Expires November 30, 2004 [Page 27] Internet-Draft Path MTU Discovery June 2004 and RFC 1981 is unmodified and PLPMTUD is only invoked following a full stop timeout, presumably due to an "ICMP black hole". To do this: o Relax the ICMP checks in Section 5.2.4.1 specifically to allow an ICMP Packet Too Large message to reduce the MTU at arbitrary times. o When there is no cached MTU, use the Interface MTU as specified classical PMTU discovery, rather the initial MTU as specified in Section 5.2.2 o MTU searching as described in Section 5.3 is disabled entirely or starts in the monitor state. o A full stop timeout is processed as described in Section 5.2.4.4. This becomes the only mechanism to invoke the rest of PLPMTUD. When configured in this manner, PLPMTUD will increase the robustness of classical PMTU discovery in the presence of ICMP black holes and other ICMP problems, with minimal exposure to unanticipated problems during deployment. Since this configuration does not help robustness in the presence of malicious or erroneous ICMP messages, it is not recommended for the long term. 5.5.2 Interoperation over subnets with dissimilar MTUs With classical PMTUD, the ingress router to a subnet is responsible for knowing what size packets can be delivered to every node attached to that subnets. For most subnet types, this requires that the entire subnet has a single MTU which is common to every attached node. (For a few subnets types (e.g. ATM[12]) the nodes on a subnet can be negotiate the MTU on a pairwise basis, and the ingress router is responsible for knowing the MTU to each of it peers). This requirement has proven to be a major impediment to deploying larger MTUs in the operational Internet. Often one single node which does not support a larger MTU effectively vetoes raising the MTU on a subnet, because the ingress router does not have a mechanism to generate the proper Packet Too Big Message for the one attached node with a smaller MTU With PLPMTUD, this requirement is completely relaxed. As long as oversized packets addressed the nodes with the smaller MTU are reliably discarded, PLPMTUD will find the proper MTU for these nodes. 5.5.3 Interoperation with tunnels PLPMTUD is specifically designed to solve many of the problems that people are experiencing today due to poor interactions between classical MTU discovery, IPsec, and various sorts of tunnels [5]. As long as the tunnel reliably discards packets that are too large, Mathis, et al. Expires November 30, 2004 [Page 28] Internet-Draft Path MTU Discovery June 2004 PLPMTUD will discover an appropriate MTU for the path. Unfortunately due to the pervasive problems with classical PMTU discovery, many manufacturers of various types of VPN/tunneling equipment have resorted to ignoring the DF bit. This not only violates the IP standard and many recommendations to the contrary [17][18], it also violates the only requirement that PLPMTUD places on the link layer: that oversized packets are reliably discarded. It is imperative that people understand the impact of ignoring the DF bit both to applications and to PLPMTUD. We do understand the reality of the situation. It is important that vendors who are building devices the violate the DF specification understand that PLPMTUD requires that probe packets be discarded, and that sending ICMP packet too big messages alone is insufficient to prevent wholesale fragmentation if the probe packets are delivered. Therefore, it is imperative that devices that do not honor DF include packet size history caches and other heuristics to robustly detect and discard probe packets, if delivering them would require fragmentation. 5.5.4 Diagnostic tools All implementations MUST include facilities for MTU discovery diagnostic tools that implement PLPMTUD or other MTU discovery algorithms in user mode without help or interference by the PMTUD algorithm present in the operating system. This requires an mechanism where a diagnostic application can send packets that are larger than the operating system's notion of the current path MTU and collect any resulting Packet Too Big Messages or other ICMP messages. For IPv4 the diagnostic application must be able to set the DF bit. At this time nearly all operating systems support two modes for sending UDP datagrams: one which silently fragments packets that are too large, and another that rejects packets that are too large. Neither of these modes are suitable for efficiently diagnosing problems with the MTU discovery, such as routers that return Packet Too Big messages containing incorrect size information. 5.5.5 Management interface It is suggested that an implementation provide a way for a system utility program to: o Globally disable all ICMP Packet Tool Large message processing o Globally suppress some or all ICMP consistency checks described in Section 5.2.4.1. Setting this option foregoes some possible security improvements, in exchange for making PLPMTUD behave more Mathis, et al. Expires November 30, 2004 [Page 29] Internet-Draft Path MTU Discovery June 2004 like classical PMTU discovery. (See Section 5.5.1) o Globally permit ICMP Packet Tool Large messages to unconditionally reduce the MTU, even if there were not lost lost packets. Setting option foregoes some possible security improvements, in exchange for making PLPMTUD behave more like classical PMTU discovery. (See Section 5.5.1) o Globally adjust timer intervals for specific classes of probe failures In addition, it is important that there be a mechanism to permit per path controls to override specific parts of the PLPMTUD algorithm. All of these per path controls can be preset from similar global controls. o Disable MTU searching a given path, such that new MTU values are never probed. o Set the initial MTU for a given path. This could be used to speed convergence in relatively static environments. There should be an option to cause PLPMTUD to choose the same initial value as would be chosen by classical PMTU discovery. I.e. typically the Interface MTU. This is used in the mode described in Section 5.5.1 where PLPMTUD is used only for black hole detection in classical PMTU discovery. o Limit the maximum probed MTU for a given path. This permits a manual configuration to work around a link that spuriously delivers packets that are larger than the useful path MTU. o Per path and per application controls to disable ICMP processing, to further limit possible damage from malicious Packet Too Big messages (in addition to the global controls). 6. References 6.1 Normative References [1] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [3] McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP version 6", RFC 1981, August 1996. [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [5] Kent, S. and R. Atkinson, "Security Architecture for the Internet Protocol", RFC 2401, November 1998. [6] Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's Mathis, et al. Expires November 30, 2004 [Page 30] Internet-Draft Path MTU Discovery June 2004 Initial Window", RFC 2414, September 1998. [7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [8] Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914, September 2000. [9] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson, "Stream Control Transmission Protocol", RFC 2960, October 2000. 6.2 Informative References [10] Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU discovery options", RFC 1063, July 1988. [11] Knowles, S., "IESG Advice from Experience with Path MTU Discovery", RFC 1435, March 1993. [12] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626, May 1994. [13] Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU", RFC 1791, April 1995. [14] Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809, June 1995. [15] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, September 2000. [16] Stewart, R., "Stream Control Transmission Protocol (SCTP) Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in progress), December 2003. [17] Kent, C. and J. Mogul, "Fragmentation considered harmful", Proc. SIGCOMM '87 vol. 17, No. 5, October 1987. [18] Mathis, M., Heffner, J. and B. Chandler, "Fragmentation Considered Very Harmful", draft-mathis-frag-harmful-00 (work in progress), July 2004. Mathis, et al. Expires November 30, 2004 [Page 31] Internet-Draft Path MTU Discovery June 2004 Authors' Addresses Matt Mathis Pittsburgh Supercomputing Center 4400 Fifth Avenue Pittsburgh, PA 15213 US Phone: 412-268-3319 EMail: mathis@psc.edu John W. Heffner Pittsburgh Supercomputing Center 4400 Fifth Avenue Pittsburgh, PA 15213 US Phone: 412-268-2329 EMail: jheffner@psc.edu Kevin Lahey Freelance EMail: kml@patheticgeek.net Appendix A. Security Considerations Under all conditions the PLPMTUD procedure described in this document is at least as secure as the current standard path MTU discovery procedures described in RFC 1191 [2] and RFC 1981 [3]. It the recommended configuration, PLPMTUD is significantly harder to attack than current procedures, because ICMP messages are cached and only processed in connection with lost packets. This effectively prevents blind attacks on the path MTU discovery system. Furthermore, since this algorithm is designed for robust operation without any ICMP (or other messages from the network), it can be configured to ignore all ICMP messages (globally or on a per application basis). In this configuration it can not be attacked, unless the attacker can identify and selectively cause probe packets to be lost. Appendix B. IANA considerations None. Mathis, et al. Expires November 30, 2004 [Page 32] Internet-Draft Path MTU Discovery June 2004 Appendix C. Acknowledgements Most of the SCTP text was contributed by Randall Stewart. Matt Mathis and John Heffner are supported in this work by a grant from Cisco Systems, Inc. Mathis, et al. Expires November 30, 2004 [Page 33] Internet-Draft Path MTU Discovery June 2004 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the IETF's procedures with respect to rights in IETF Documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Mathis, et al. Expires November 30, 2004 [Page 34]