< draft-van-beijnum-multi-mtu-02.txt   draft-van-beijnum-multi-mtu-03.txt >
Network Working Group I. van Beijnum Network Working Group I. van Beijnum
Internet-Draft IMDEA Networks Internet-Draft IMDEA Networks
Intended status: Experimental February 24, 2008 Intended status: Experimental July 12, 2010
Expires: August 27, 2008 Expires: January 13, 2011
Extensions for Multi-MTU Subnets Extensions for Multi-MTU Subnets
draft-van-beijnum-multi-mtu-02 draft-van-beijnum-multi-mtu-03
Abstract
In the early days of the internet, many different link types with
many different maximum packet sizes were in use. For point-to-point
or point-to-multipoint links, there are still some other link types
(PPP, ATM, Packet over SONET), but multipoint subnets are now almost
exclusively implemented as ethernets. Even though the relevant
standards mandate a 1500 byte maximum packet size for ethernet, more
and more ethernet equipment is capable of handling packets bigger
than 1500 bytes. However, since this capability isn't standardized,
it is seldom used today, despite the potential performance benefits
of using larger packets. This document specifies mechanisms to
negotiate per-neighbor maximum packet sizes so that nodes on a
multipoint subnet may use the maximum mutually supported packet size
between them without being limited by nodes with smaller maximum
sizes on the same subnet.
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any This Internet-Draft is submitted in full conformance with the
applicable patent or other IPR claims of which he or she is aware provisions of BCP 78 and BCP 79.
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF). Note that other groups may also distribute
other groups may also distribute working documents as Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This Internet-Draft will expire on January 13, 2011.
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 27, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2008). Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
Abstract
In the early days of the internet, many different link types with This document is subject to BCP 78 and the IETF Trust's Legal
many different maximum packet sizes were in use. For point-to-point Provisions Relating to IETF Documents
or point-to-multipoint links, there are still some other link types (http://trustee.ietf.org/license-info) in effect on the date of
(PPP, ATM, Packet over SONET), but shared subnets are now almost publication of this document. Please review these documents
exclusively implemented as ethernets. Even though the relevant carefully, as they describe your rights and restrictions with respect
standards mandate a 1500 byte maximum packet size for ethernet, more to this document. Code Components extracted from this document must
and more ethernet equipment is capable of handling packets bigger include Simplified BSD License text as described in Section 4.e of
than 1500 bytes. However, since this capability isn't standardized, the Trust Legal Provisions and are provided without warranty as
it's seldom used today, despite the potential performance benefits of described in the Simplified BSD License.
using larger packets. This document specifies a mechanism for
advertising a non-standard maximum packet size on a subnet. It also
specifies optional mechanisms to negotiate per-neighbor maximum
packet sizes so that nodes on a shared subnet may use the maximum
mutually supported packet size between them without being limited by
nodes with smaller maximum sizes on the same subnet.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Notational Conventions . . . . . . . . . . . . . . . . . . . . 4
3. Disadvantages of larger packets . . . . . . . . . . . . . . . 5 3. Protocol messages and options . . . . . . . . . . . . . . . . 4
3.1. Delay and jitter . . . . . . . . . . . . . . . . . . . . . 5 3.1. The ND/ARP NODEMTU option . . . . . . . . . . . . . . . . 4
3.2. Path MTU Discovery problems . . . . . . . . . . . . . . . 6 3.2. The IPv6 ND padding option . . . . . . . . . . . . . . . . 5
3.3. Packet loss through bit errors . . . . . . . . . . . . . . 6 3.3. IPv4 ethernet jumbo ARP message . . . . . . . . . . . . . 7
3.4. Undetected bit errors . . . . . . . . . . . . . . . . . . 7 3.4. Changes to the RA MTU option semantics . . . . . . . . . . 7
3.5. IEEE 802.3 compatibility . . . . . . . . . . . . . . . . . 8 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1. Managing neighbor MTUs . . . . . . . . . . . . . . . . . . 8
4. The protocol mechanisms . . . . . . . . . . . . . . . . . . . 9 4.2. Host-to-host keepalives . . . . . . . . . . . . . . . . . 9
4.1. The multi-MTU router advertisement option . . . . . . . . 9 4.3. Router-to-router keepalives . . . . . . . . . . . . . . . 10
4.2. General operation . . . . . . . . . . . . . . . . . . . . 10 4.4. Host-to-router keepalives . . . . . . . . . . . . . . . . 10
4.3. Determining the InterfaceMTU . . . . . . . . . . . . . . . 10 4.5. Router-to-host keepalives . . . . . . . . . . . . . . . . 10
4.4. Changes to the RA MTU option semantics . . . . . . . . . . 11 4.6. Determining the MTU . . . . . . . . . . . . . . . . . . . 11
4.5. The IPv6 neighbor discovery MTU option . . . . . . . . . . 11 4.7. Probe considerations . . . . . . . . . . . . . . . . . . . 11
4.6. The IPv6 neighbor discovery padding option . . . . . . . . 12 4.8. Neighbor MTU garbage collection . . . . . . . . . . . . . 11
4.7. Use of the MTU and padding options . . . . . . . . . . . . 13 5. The TCP MSS option . . . . . . . . . . . . . . . . . . . . . . 11
4.8. IPv4 ethernet jumbo ARP message . . . . . . . . . . . . . 14 6. IANA considerations . . . . . . . . . . . . . . . . . . . . . 12
4.9. Probe considerations . . . . . . . . . . . . . . . . . . . 14 7. Security considerations . . . . . . . . . . . . . . . . . . . 12
4.10. Neighbor MTU garbage collection . . . . . . . . . . . . . 15 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
5. IANA considerations . . . . . . . . . . . . . . . . . . . . . 15 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6. Security considerations . . . . . . . . . . . . . . . . . . . 15 9.1. Normative References . . . . . . . . . . . . . . . . . . . 12
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 9.2. Informative References . . . . . . . . . . . . . . . . . . 13
7.1. Normative References . . . . . . . . . . . . . . . . . . . 16 Appendix A. Document and discussion information . . . . . . . . . 13
7.2. Informative References . . . . . . . . . . . . . . . . . . 16 Appendix B. About of larger packets . . . . . . . . . . . . . . . 13
Appendix A. Document and discussion information . . . . . . . . . 16 B.1. Delay and jitter . . . . . . . . . . . . . . . . . . . . . 13
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2. Path MTU Discovery problems . . . . . . . . . . . . . . . 14
Intellectual Property and Copyright Statements . . . . . . . . . . 17 B.3. Packet loss through bit errors . . . . . . . . . . . . . . 15
B.4. Undetected bit errors . . . . . . . . . . . . . . . . . . 15
B.5. Interaction TCP congestion control . . . . . . . . . . . . 16
B.6. IEEE 802.3 compatibility . . . . . . . . . . . . . . . . . 16
B.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 17
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 17
1. Introduction 1. Introduction
Some protocols inherently generate small packets. Examples are VoIP, Some protocols inherently generate small packets. Examples are VoIP,
where it's necessary to send packets frequently before much data can where it's necessary to send packets frequently before much data can
be gathered to fill up the packet, and the DNS, where the queries are be gathered to fill up the packet, and the DNS, where the queries are
inherently small and the returned results also rarely fill up a full inherently small and the returned results also rarely fill up a full
1500-byte packet. However, most data that is transferred across the 1500-byte packet. However, most data that is transferred across the
internet and private networks is at least several kilobytes in size internet and private networks is part of long-lived sessons and
(often much larger) and requires segmentation by TCP or another requires segmentation by a transport protocol, which is almost always
transport protocol. These types of data transfer can benefit from TCP. These types of data transfers can benefit from larger packets
larger packets in several ways: in several ways:
1. A higher data-to-header ratio makes for fewer overhead bytes 1. A higher data-to-header ratio makes for fewer overhead bytes
2. Fewer packets means fewer per-packet operations on the source and 2. Fewer packets means fewer per-packet operations on the source and
destination hosts destination hosts
3. Fewer packets also means fewer per-packet operations in routers 3. Fewer packets also means fewer per-packet operations in routers
and middleboxes and middleboxes
4. TCP performance increases with larger packet sizes 4. TCP performance increases with larger packet sizes
Even though today, the capability to use larger packets (often called Even though today, the capability to use larger packets (often called
jumbo frames) is present in a lot of ethernet hardware, this jumboframes) is present in a lot of ethernet hardware, this
capability isn't used because IP assumes a common MTU size for all capability typically isn't used because IP assumes a common MTU size
nodes connected to a link or subnet. In practice, this means that for all nodes connected to a link or subnet. In practice, this means
using a larger MTU requires manual configuration of the non-standard that using a larger MTU requires manual configuration of the non-
MTU size on all hosts and routers and possibly on switches. Also, standard MTU size on all hosts and routers and possibly on switches
the MTU size for a subnet is limited to that of the least capable connected to a subnet. Also, the MTU size for a subnet is limited to
router, host or switch. that of the least capable router, host or switch.
In the future, when hosts support [RFC4821] in all relevant transport In the future, when hosts support packetization layer path MTU
protocols, it will be possible to simply ignore MTU limitations by discovery ([RFC4821], "Packetization Layer Path MTU Discovery") in
sending at the maximum locally supported size and determining the all relevant transport protocols, it will be possible to simply
maximum packet size towards a correspondent from acknowledgements ignore MTU limitations by sending at the maximum locally supported
that come back for packets of different sizes. However, [RFC4821] size and determining the maximum packet size towards a correspondent
must be implemented in every transport protocol, and there is a from acknowledgements that come back for packets of different sizes.
significant probability for failures if hosts implementing [RFC4821] However, [RFC4821] must be implemented in every transport protocol,
interact with hosts that don't implement this mechanism but do use a and problems arise in the case where hosts implementing [RFC4821]
interact with hosts that don't implement this mechanism, but do use a
larger than standard MTU. larger than standard MTU.
This document provides for a set of mechanisms that allow the use of This document provides for a set of mechanisms that allow the use of
larger packets between nodes that support them which interacts well larger packets between nodes that support them which interacts well
with both manually configured non-standard MTUs and expected future with both manually configured non-standard MTUs and expected future
[RFC4821] operation with larger MTUs. This is done using several new [RFC4821] operation with larger MTUs. This is done using several new
options and messages: options and messages for both IPv6 and IPv4:
1. An additional router advertisement Multi-MTU option to limit 1. A neighbor discovery option that allows nodes to inform their
higher maximum packet sizes neighbors of the maximum packet sizes they are prepared to
receive
2. A neighbor discovery option that allows nodes to inform their 2. An extension to the ARP packet format that allows nodes to inform
neighbors of the maximum packet size they support their neighbors of the maximum packet sizes they are prepared to
receive
3. A neighbor discovery option for padding messages to make them 3. A probe/verification message that allows nodes to determine
suitable for probing a neighbor's MTU and link-layer MTU whether jumboframes can be received successfully by the next hop
limitations
4. Padding for ARP messages to make them suitable for probing a Appendix B discusses several potential issues with larger packets,
neighbor's MTU and link-layer MTU limitations such as head-of-line blocking delays, path MTU discovery black holes
and the strength of the CRC32 with increasing packet sizes.
Only support of the Multi-MTU option is required to conform to to 2. Notational Conventions
this specification, the neighbor discovery options and jumbo ARP are
optional.
2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
InterfaceMTU: The maximum packet size considered usable on an Note that this specification is not standards track, and as such,
interface, based on the physical MTU, the MTU and SPEED advertised can't overrule existing specifications. Whenever [RFC2119] language
by routers and administrative settings. is used, this must be interpreted within the context of this
specification: while the specification as a whole is optional and
non-standard, whenever it is implemented, such an implementation can
only function properly when all MUSTs are observed.
MTU: Maximum Transmission Unit. This is the maximum IP packet size 3. Protocol messages and options
in bytes supported on a link, towards a neighbor (or towards a
remote correspondent). In some cases, the term MRU (Maximum
Receive Unit) would be more appropriate, but for consistency, the
term MTU is used throughout this document.
NeighborMTU: The maximum packet size that may be used towards a 3.1. The ND/ARP NODEMTU option
given on-link neighbor.
Node: A host or router running IPv4 and/or IPv6. All MTU values are 32-bit unsigned integers in network byte order.
All other values are also unsigned and in network byte order.
Troughout this document, the term "MTU" is used to denote the maximum
packet size that can be sent or received. The term "MRU" (maximum
receive unit) is not used. The "standard MTU" or "standard maximum
size" refers to the MTU size specified in the IP-over-... or IPv6-
over-... document for the link used, which would be 1500 for
ethernet.
Oversized packet: A packet exceeding the Standard MTU size. The MTU size and two flags are exchanged as an IPv6 neighbor
discovery option. The new option, as well as the MTU value it
avertises, are named "NODEMTU". For IPv4 operation, the NODEMTU
option is appended to ARP messages, with optional padding between the
ARP message and the MTU option. Upon reception of ARP messages, the
receiving node checks whether the ARP message is 8 or more bytes
longer than a standard ARP message. If so, the NODEMTU option is
ignored if the Type and Length fields contain values other than the
ones listed below, or if the MTU is smaller than the standard value
for the link type.
PhysicalMTU: The MTU reported by the driver for an interface when 1 2 3
operating at a given link speed. 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |R|L| Reserved |A|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NODEMTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Probe: An ARP or neighbor solicitation packet of a specific Type: TBD
(oversized) size sent for the purpose of determining whether a
neighbor can successfully receive packets of this size sent by the
local node.
SafeMTU: Maximum packet size that is supported by all nodes an all Length: 1
link layer devices on a link.
StandardMTU: For IPv4: the MTU for a link type defined in the R (router): Set to 1 if the node is a router, set to 0 if the node
relevant IP-over-... RFC. For IPv6: the minimum of the MTU for a is a host or routing functionality is currently disabled.
link type defined in the relevant IPv6-over-... RFC and the value
of the MTU option in router advertisements.
3. Disadvantages of larger packets L (large packet detect): Set to 1 if the node is capable of
determining the largest size of packets recently received from a
link address, set to 0 if the node requires explicit probe
messages.
Reserved: Set to 0 on transmission, MUST be ignored on reception.
A (acknowledgment): Set to 1 if the node received a packet larger
than the interface MTU from the node this packet is addressed to
in the last 10 seconds.
NODEMTU The maximum packet size the node wishes to receive on this
interface at this time.
When a node's interface speed changes, it MAY advertise adjusted per-
neighbor MTUs, but it SHOULD remain prepared to receive packets of
the maximum size indicated to neighbors previously (if this maximum
size is larger than the newly adjusted one).
3.2. The IPv6 ND padding option
The format of the neighbor discovery padding option is as follows:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Padding |
~ ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: TBD
Length: see below.
Reserved: set to 0 on transmission, ignored on reception.
Padding: 0 or more all-zero bytes.
There are two possible ways to determine the value of the length
field in the padding option:
1. Set it to 0. Since the option is in fact larger than 0, this
means that nodes that don't implement the option will silently
discard the packet. Setting the length to 0 makes it possible to
have packets with the padding option that aren't a multiple of 8
bytes long. Since there is now no way to determine where the
next option begins, if the length is set to 0, the padding option
MUST be the last option.
2. If the intended packet length allows a valid value for the length
field, the length field MAY be set to that value. The node MAY
reduce the size of the intended packet to accommodate the
requirement that the size field is a multiple of 8 bytes. I.e.,
if the intended packet size is 4470 bytes with 40 and 24 bytes
for the IPv6 and neighbor solicitation headers, respectively, the
padding option would have to be 4406 bytes long, which can't be
expressed in the length field. The node may choose to use a
packet size of 4464 instead, which results in a length field
value of 550. This of course means that subsequent data packets
MUST be no larger than 4464 bytes.
Nodes that support probing MUST support reception of both types of
probes, but MAY be limited to generating only one type.
Since presumably, some equipment may react badly to a large number of
out-of-spec packets, it's important that nodes limit the number of
oversized packets to destinations that aren't yet known to be capable
of receiving them. An upper limit would be to allow only 5
unacknowledged oversized packets per 300 second period.
3.3. IPv4 ethernet jumbo ARP message
Due to lack of neighbor discovery, with IPv4, it's necessary to use
ARP to probe for non-standard MTU capabilities. This is done by
simply probing with an ARP packet padded to the desired size. If a
reply comes back, the neighbor supports the probed MTU size. A
NODEMTU option MAY or MAY NOT be present in the last 8 bytes of the
jumbo ARP message. Nodes MUST take care to include either a valid
NODEMTU option or bytes that can't be mistaken for a NODEMTU option.
3.4. Changes to the RA MTU option semantics
There may be an MTU option in IPv6 router advertisements. When this
option is present, hosts MUST use the value in this option only as a
replacement for the standard link MTU size. So for multicast packets
and packets sent to nodes for which there is no known NODEMTU, the
value in the MTU option is used as the maximum packet size. But if a
NODEMTU is known for a node on the link, the NODEMTU is used, NOT the
value in the RA MTU option.
4. Operation
Basic operation is as follows: nodes advertise their interface MTU in
a neighbor discovery option or in ARP messages. So for communication
between two nodes implementing this specification, each knows what
the maximum packet size is that the other node supports, so the
minimum of the local and the remote MTU is used when sending packets.
Unfortunately, there is the complication that layer 2 devices
(switches/bridges) may have a smaller maximum packet size, so packets
larger than the standard maximum size may be lost. In order to avoid
this issue, this memo specifies a "trust, but verify" approach:
whenever packets larger than the standard size are supported between
two nodes, each periodically verifies that the other is still capable
of receiving packets of the negotiated size. It does this by sending
a probe message of the maximum negotiated size, followed by If the
verification fails, the node sends a new neighbor advertisement with
a reduced MTU size.
A number of optimizations reduce the amount of signaling traffic
where possible. Most of the optimizations are optional.
In the case of a host implementing packetization layer path MTU
discovery [RFC4821] for all transport protocols that can generate
packets larger than the standard size, the use of outgoing probe and
verification messages is unnecessary. However, such a host MUST
still process incoming probe and verification messages.
The first optimization is for senders that have the capability to
determine whether they sent packets that are larger than the standard
size, to only send a probe message and a verification message when a
data packet larger than the standard size was sent recently.
A further optimization may be applied when both the sender and the
receiver have the capability to determine whether they sent/received
data packets that are larger than the standard size. If both the
sender and the receiver have this capability, as indicated by flags
in the neighbor discovery option, no explicit MTU probe messages are
sent, just a verification message.
A final optimization applies between a host and a router. In that
case, the host may assume the responsibility for probing with large
packets in both directions. This reduces the control channel
processing on the router, and allows for the possibility to forego
probing when there are no active transport sessions that are capable
of generating larger than standard packets.
4.1. Managing neighbor MTUs
The following does not apply to hosts that support [RFC4821] or a
similar mechanism for all transport protocols that can send larger
than standard packets. For instance, if a host implements [RFC4821]
for TCP and limits UDP packets and packets using other transport
protocols to 1500 bytes on its ethernet interface, the host is not
requred to perform any probing and per-destination path MTUs can be
maintained at the TCP level. It must still respond to incoming
probes. Routers are never exempt from what follows.
Along with neighbor's link addresses, a node caches an MTU value for
each neighbor. This value starts out being undefined. Whenever a
packet must be sent (this includes packets that are forwarded), the
node consults the neighbor MTU cache. If the cached value is
undefined, it applies the interface MTU that is in effect on MTU-
related actions such as fragmentation or the generation of "too big"
messages.
Whenever a value is entered into the neighbor MTU cache, this value
is marked as "tentative" and the node MUST start a clock that times
out after 500 milliseconds. After 500 milliseconds (or less,
depending on the implementation), if the value in the MTU cache is
still tentative, it reverts back to being undefined. Values enter
the neighbor cache after receiving the NODEMTU option in ND or ARP
messages. In this case, the cache is initalized with the minimum of
the local and remote NODEMTU values and the clock is started. If no
such option is present in ND or ARP messages, the node may insert the
standard MTU value + 1 rounded up to the nearest multiple of 8. In
this case, the clock is also started. If timer resources are not
available, the neighbor MTU value remains undefined.
At this point, the node sends a probe message that is the size of the
value in the neighbor MTU cache. If this probe message is answered,
the neighbor's MTU value in the neighbor MTU cache is marked as
"valid" and the timer is stopped.
If the probe wasn't answered, or probing started from just above the
standard MTU, after some time (such as 30 seconds) a new probe MAY be
sent. For unanswered probes, new probes are larger, for answered
probes the new probe is larger. If the new probe is answered, the
size of the probe is entered in the neighbor MTU cache as a "valid"
value. Probing MAY continue for several iterations, but implementers
are encouraged to limit probing rather than exhaustively search for
the exact supported neighbor MTU value.
4.2. Host-to-host keepalives
When hosts have active communication sessions with other hosts on the
same subnet, they send periodic probes to determine whether large
packets continue to be received. Hosts MUST NOT send probes when
there are no active communication sessions and SHOULD NOT send probes
when there are no active communication sessions that support larger
than standard packet sizes. For instance, if the host only supports
larger-than-standard packet sizes over TCP, and there are no TCP
sessions where the remote host indicated that it supports larger-
than-standard packet sizes through the MSS option, probing SHOULD NOT
be performed.
The probe interval is randomized between 8 and 10 seconds. A host
SHOULD NOT send probes if it has not sent any packets larger than the
interface MTU size during the previous probe interval. Probes are
ARP or neighbor solicitation messages padded to the cached neighbor
MTU size. A timer is initialized to 21 seconds (or a slightly larger
value, with a maximum of 35 seconds) when a probe is sent. The timer
is NOT reinitialized when new probes are sent. The timer is stopped
when a probe is answered by an ARP reply or neighbor advertisement.
This message does not have to be padded. If the timer is not stopped
by an incoming probe reply and it expires, the neighbor MTU cache is
cleared and becomes undefined. The next probe is sent no earlier
than 8 seconds after the last ARP or neighbor advertisement from the
neighbor has been received.
The following is optional:
If the neighbor included the L flag set to 1 in its NODEMTU option,
the host MAY send probes as regular ARP or neighbor solicitation
packets, without padding. In this case, ARP replies or neighbor
advertisements are only considered valid probe replies when they have
a NODEMTU option with the A flag set to 1.
If the host detected that it received a packet larger than the
interface MTU in the last 7 seconds, it MAY send an unsolicited probe
reply, which consists of an ARP reply or a neighbor advertisement.
Replies to probes SHOULD and unsolicited probe replies MUST have a
NODEMTU option with the A bit set.
4.3. Router-to-router keepalives
The probe interval is randomized between 8 and 10 seconds. A router
SHOULD NOT send probes if it has not sent any packets larger than the
interface MTU size during the previous probe interval. Probes are
ARP or neighbor solicitation messages padded to the cached neighbor
MTU size. A timer is initialized to 21 seconds (or a slightly larger
value, with a maximum of 35 seconds) when a probe is sent. The timer
is NOT reinitialized when new probes are sent. The timer is stopped
when a probe is answered by an ARP reply or neighbor advertisement.
This message MUST NOT be padded. If the timer is not stopped by an
incoming probe reply and it expires, the neighbor MTU cache is
cleared and becomes undefined.
4.4. Host-to-router keepalives
The probe interval is randomized between 8 and 10 seconds. Probes
are ARP or neighbor solicitation messages padded to the cached
neighbor MTU size. The destination IP address is the host's own IP
address, the link address is the router's link address. A timer is
initialized to 21 seconds (or a slightly larger value, with a maximum
of 35 seconds) when a probe is sent. The timer is NOT reinitialized
when new probes are sent. The timer is stopped when a probe is
answered by an ARP reply or neighbor advertisement. This message
MUST be the padded message originally sent by the host itself. If
the timer is not stopped by an incoming probe reply and it expires,
the neighbor MTU cache is cleared and becomes undefined. Also, a
neighbor advertisement or ARP reply is sent with a NODEMTU option
that contains the current interface MTU. After some time, such as 15
minutes, the host MAY attempt probing for larger than standard MTU
sizes again.
4.5. Router-to-host keepalives
Routers do not send keepalives to hosts. Routers MUST adjust their
cached neighbor MTU value based on the NODEMTU option in unsolicited
neighbor advertisements or ARP replies.
4.6. Determining the MTU
Nodes SHOULD NOT blindly advertise the maximum MTU that their
hardware is capable of. On slow links, a large MTU can easily reduce
performance. In general, hosts SHOULD limit the MTU they advertise
and impose on packets they send to the standard MTU size on links
operating at speeds of 50 Mbps or slower. On links operating at
speeds of 500 Mbps and higher, MTUs of 9000 bytes or even as large as
64 kilobytes presumably won't cause problems. Between 50 and 500
Mbps, larger than standard MTUs SHOULD be used with care. For
instance, by limiting MTUs to 9000 bytes and only on full duplex
links with low bit error rates (which would exclude wireless links).
4.7. Probe considerations
In cases where the neighbor's MTU was advertised in a NODEMTU option,
it makes sense to try with this size or the local MTU, whichever is
smaller. If that probe fails or the neighbor's MTU is unknown, the
best choice for a probe size would be the smallest possible non-
standard MTU. This could be the StandardMTU + 1, or a slightly
larger value that represents the first larger size that is actually
useful, such as 1508 or 1520 for ethernet. Failure at this size
wastes relatively little bandwidth and indicates that further probes
are unnecessary. If this probe is successful, further choices for
the probe size may be common MTU sizes such as 1508, 1530, 1536,
1546, 1998, 2000, 2018, 4464, 4470, 8092, 8192, 9000, 9176, 9180,
9216, 16384, 17976, 64000 and 65280 bytes. (These values were
observed in vendor documentation and hands-on experience.)
A further consideration is that there is little value in sending many
probes to discover a few extra bytes of MTU, and using multiples of 8
bytes may streamline copying of data and makes IPv6 probing easier.
So values to test if the NODEMTU fails but 1504 succeeds could be
1992, 4464, 8088, 9000, 16384 and 64000.
Probes MUST be sent as unicast.
4.8. Neighbor MTU garbage collection
The MTU size for a neighbor is garbage collected along with a
neighbor's link address in accordance with regular ARP and neighbor
discovery timeouts. Additionally, a neighbor's MTU size is reset to
unknown after dead neighbor detection declares a neighbor "dead".
5. The TCP MSS option
Hosts SHOULD advertise the maximum MTU size they are prepared to use
on a link in the TCP MSS value, even during times when probing has
failed: should larger neighbor MTUs be established later, it will not
be possible to adjust the MSS for ongoing sessions.
6. IANA considerations
IANA is requested to assign two neighbor discovery option type
values.
[TO BE REMOVED: This registration should take place at the following
location: http://www.iana.org/assignments/icmpv6-parameters
7. Security considerations
Generating false neighbor discovery and ARP packets with large MTUs
may lead to a denial-of-serve condition, just like the advertisement
of other false link parameters.
8. Acknowledgements
This document benefited from feedback by Dave Thaler, Jari Arkko, Joe
Touch and others.
9. References
9.1. Normative References
[RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or
converting network protocol addresses to 48.bit Ethernet
address for transmission on Ethernet hardware", STD 37,
RFC 826, November 1982.
[RFC0894] Hornig, C., "Standard for the transmission of IP datagrams
over Ethernet networks", STD 41, RFC 894, April 1984.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461,
December 1998.
[RFC2462] Thomson, S. and T. Narten, "IPv6 Stateless Address
Autoconfiguration", RFC 2462, December 1998.
[RFC2464] Crawford, M., "Transmission of IPv6 Packets over Ethernet
Networks", RFC 2464, December 1998.
[RFC3315] Droms, R., Bound, J., Volz, B., Lemon, T., Perkins, C.,
and M. Carney, "Dynamic Host Configuration Protocol for
IPv6 (DHCPv6)", RFC 3315, July 2003.
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007.
9.2. Informative References
[CRC] Jain, R., "Error Characteristics of Fiber Distributed Data
Interface (FDDI), IEEE Transactions on Communications",
August 1990.
Appendix A. Document and discussion information
The latest version of this document will always be available at
http://www.muada.com/drafts/. Please direct questions and comments
to the int-area mailinglist or directly to the author.
Appendix B. About of larger packets
Although often desirable, the use of larger packets isn't universally Although often desirable, the use of larger packets isn't universally
advantageous for the following reasons: advantageous for the following reasons:
1. Increased delay and jitter 1. Increased delay and jitter
2. Increased reliance on path MTU discovery 2. Increased reliance on path MTU discovery
3. Increased packet loss through bit errors 3. Increased packet loss through bit errors
4. Increased risk of undetected bit errors 4. Increased risk of undetected bit errors
3.1. Delay and jitter B.1. Delay and jitter
An low-bandwidth links, the additional time it takes to transmit An low-bandwidth links, the additional time it takes to transmit
larger packets may lead to unacceptable delays. For instance, larger packets may lead to unacceptable delays. For instance,
transmitting a 9000-byte packet takes 7.23 milliseconds at 10 Mbps, transmitting a 9000-byte packet takes 7.23 milliseconds at 10 Mbps,
while transmitting a 1500-byte packet takes only 1.23 ms. Once while transmitting a 1500-byte packet takes only 1.23 ms. Once
transmission of a packet has started, additional traffic must wait transmission of a packet has started, additional traffic must wait
for the transmission to finish, so a larger maximum packet size for the transmission to finish, so a larger maximum packet size
immediately leads to a higher worst-case head-of-line blocking delay, immediately leads to a higher worst-case head-of-line blocking delay,
and thus, to a bigger difference between the best and worst cases and thus, to a bigger difference between the best and worst cases
(jitter). The increase in average delay depends on the number of (jitter). The increase in average delay depends on the number of
packets that are buffered, the average packet size and the queuing packets that are buffered, the average packet size and the queuing
strategy in use. Buffer sizes vary greatly between implementations, strategy in use. Buffer sizes vary greatly between implementations,
from only a few buffers in some switches and on low-speed interfaces from only a few buffers in some switches and on low-speed interfaces
on routers, to hundreds of megabytes of buffer space on 10 Gbps in routers, to hundreds of megabytes of buffer space on 10 Gbps
interfaces on some routers. interfaces in some routers.
If we assume that the delays involved with 1500-byte packets on 100 If we assume that the delays involved with 1500-byte packets on 100
Mbps ethernet are acceptable for most, if not all, applications, then Mbps ethernet are acceptable for most, if not all, applications, then
the conclusion must be that 15000-byte packets on 1 Gbps ethernet the conclusion must be that 15000-byte packets on 1 Gbps ethernet
should also be acceptable, as the delay is the same. At 10 Gbps should also be acceptable, as the delay is the same. At 10 Gbps
ethernet, much larger packet sizes could be accommodated without ethernet, much larger packet sizes could be accommodated without
adverse impact on delay-sensitive applications. At 100 Mbps, and adverse impact on delay-sensitive applications. At below 100 Mbps,
certainly below that, larger packet sizes are probably not advisable. larger packet sizes are probably not advisable.
3.2. Path MTU Discovery problems B.2. Path MTU Discovery problems
PMTUD issues arise when routers can't fragment packets in transit PMTUD issues arise when routers can't fragment packets in transit
because the DF bit is set or because the packet is IPv6, but the because the DF bit is set or because the packet is IPv6, but the
packet is too large to be forwarded over the next link, and the packet is too large to be forwarded over the next link, and the
resulting "packet too big" ICMP messages from the router don't make resulting "packet too big" ICMP messages from the router don't make
it back to the sending host. If there is a PMTUD black hole, this it back to the sending host. If there is a PMTUD black hole, this
will typically happen when there is an MTU bottleneck somewhere in will typically happen when there is an MTU bottleneck somewhere in
the middle of the path. If the MTU bottleneck is located at either the middle of the path. If the MTU bottleneck is located at either
end, the TCP MSS (maximum segment size) option makes sure that TCP end, the TCP MSS (maximum segment size) option makes sure that TCP
packets conform to the smallest MTU in the path. PMTUD problems are packets conform to the smallest MTU in the path. PMTUD problems are
of course possible with non-TCP protocols, but this is rare in of course possible with non-TCP protocols, but this is rare in
practice because non-TCP protocols are generally not capable of practice because non-TCP protocols are generally not capable of
adjusting their packet size on the fly and therefore use more adjusting their packet size on the fly and therefore use more
conservative packet sizes which won't trigger PMTUD issues. conservative packet sizes which won't trigger PMTUD issues.
Taking the delay and jitter issues to heart, maximum packet sizes Taking the delay and jitter issues to heart, maximum packet sizes
should be larger for faster links and smaller for slower links. This should be larger for faster links and smaller for slower links. This
means that in the majority of cases, the MTU bottleneck will tend to means that in the majority of cases, the MTU bottleneck will tend to
be at one of the ends of a path, rather than somewhere in the middle, be at, or close to, one of the ends of a path, rather than somewhere
as in today's internet, core of the network is quite fast, while in the middle, as in today's internet, the core of the network is
users usually connect at lower speeds. quite fast, while users usually connect to the core at lower speeds.
A crucial difference between PMTUD problems that result from MTUs A crucial difference between PMTUD problems that result from MTUs
smaller than the standard 1500 bytes and PMTUD problems that result smaller than the de facto standard 1500 bytes and PMTUD problems that
from MTUs larger than the standard 1500 bytes is that in the latter result from MTUs larger than 1500 bytes is that in the latter case,
case, only a party that's actually using the non-standard MTU is only the party that's actually using the non-standard MTU is
affected. This puts potential problems, the potential benefits and affected. This puts potential problems, the potential benefits and
the ability to solve any resulting problems in the same place so it's the ability to solve any resulting problems in the same place: it's
always possible to revert to a 1500-byte MTU if PMTUD problems can't always possible to revert to a 1500-byte MTU if PMTUD problems can't
be resolved otherwise. be resolved otherwise.
Considering the above and the work that's going on in the IETF to Considering the above and the work that's going on in the IETF to
resolve PMTUD issues as they exist today, means that increasing MTUs resolve PMTUD issues as they exist today, increasing MTUs where
where desired doesn't involve undue risks. desired doesn't involve undue risks.
3.3. Packet loss through bit errors B.3. Packet loss through bit errors
All transmission media are subject to bit errors. In many cases, a All transmission media are subject to bit errors. In many cases, a
bit error leads to a CRC failure, after which the packet is lost. In bit error leads to a CRC failure, after which the packet is lost. In
other cases, packets are retransmitted a number of times, but if other cases, packets are retransmitted a number of times, but if
error conditions are severe, packets may still be lost because an error conditions are severe, packets may still be lost because an
error occurred at every try. Using larger packets means that the error occurred at every try. Using larger packets means that the
chance of a packet being lost due to errors increases. And when a chance of a packet being lost due to errors increases. And when a
packet is lost, more data has to be retransmitted. packet is lost, more data has to be retransmitted.
Both per-packet overhead and loss through errors reduce the amount of Both per-packet overhead and loss through errors reduce the amount of
skipping to change at page 7, line 14 skipping to change at page 15, line 28
types of loss are equal. If we make the simplifying assumption that types of loss are equal. If we make the simplifying assumption that
the relationship between the bit error rate of a medium and the the relationship between the bit error rate of a medium and the
resulting number of lost packets is linear with packet size for resulting number of lost packets is linear with packet size for
reasonable bit error rates, the optimum packet size is computed as reasonable bit error rates, the optimum packet size is computed as
follows: follows:
packet size = sqrt( overhead bytes / bit error rate ) packet size = sqrt( overhead bytes / bit error rate )
According to this, the optimum packet size is one or more orders of According to this, the optimum packet size is one or more orders of
magnitude larger than what's commonly used today. For instance, the magnitude larger than what's commonly used today. For instance, the
minimum BER for 1000BASE-T is 10^-10, which implies an optimum packet maximum BER for 1000BASE-T is 10^-10, which implies an optimum packet
size of 312250 bytes with ethernet framing and IP overhead. size of 312250 bytes with ethernet framing and IP overhead.
3.4. Undetected bit errors B.4. Undetected bit errors
Nearly all link layers employ some kind of checksum to detect bit Nearly all link layers employ some kind of checksum to detect bit
errors so that packets with errors can be discarded. In the case of errors so that packets with errors can be discarded. In the case of
ethernet, this is a frame check sequence in the form of a 32-bit CRC. ethernet, this is a frame check sequence in the form of a 32-bit CRC.
Assuming a strong frame check sequence algorithm, this suggests that Assuming a strong frame check sequence algorithm, a 32-bit checksum
there is a 1 in 2^32 chance that a packet with one or more bit errors suggests that there is a 1 in 2^32 chance that a packet with one or
in it has the same CRC as the original packet, so the bit errors go more bit errors in it has the same checksum as the original packet,
undetected and data is corrupted. However, according to [CRC] the so the bit errors go undetected and data is corrupted. However,
CRC-32 that's used for FDDI and ethernet has the property that according to [CRC] the CRC-32 that's used for FDDI and ethernet has
packets between 376 and 11454 bytes long (including) have a Hamming the property that packets between 376 and 11454 bytes long
distance of 3. (Smaller packets have a larger Hamming distance, (including) have a Hamming distance of 3. (Smaller packets have a
larger packets a smaller Hamming distance.) As a result, all errors larger Hamming distance, larger packets a smaller Hamming distance.)
where only a single bit is flipped or two bits are flipped, will be As a result, all errors where only a single bit is flipped or two
detected, because they can't result in the same CRC as the original bits are flipped, will be detected, because they can't result in the
packet. The probability of a packet having undetected bit errors can same CRC as the original packet. The probability of a packet having
be approximated as follows for a 32-bit CRC: undetected bit errors can be approximated as follows for a 32-bit
CRC:
PER = (PL * BER) ^ H / 2^32 PER = (PL * BER) ^ H / 2^32
Where PER is the packet error rate, BER is the bit error rate, PL is Where PER is the packet error rate, BER is the bit error rate, PL is
the packet length in bits and H is the Hamming distance. Another the packet length in bits and H is the Hamming distance. Another
consideration is the impact of packet length on a multi-packet consideration is the impact of packet length on a multi-packet
transmission of a given size. This would be: transmission of a given size. This would be:
TER = transmission length / PL * PER TER = transmission length / PL * PER
So So
TER = transmission length / (PL ^ (H - 1) * BER ^ H) / 2^32 TER = transmission length / (PL ^ (H - 1) * BER ^ H) / 2^32
skipping to change at page 8, line 18 skipping to change at page 16, line 33
that given the low BER rates mandated for gigabit ethernet, packet that given the low BER rates mandated for gigabit ethernet, packet
sizes of up to 11454 bytes should be acceptable. sizes of up to 11454 bytes should be acceptable.
Additionally, unlike properties such as the packet length, the frame Additionally, unlike properties such as the packet length, the frame
check sequence can be made dependent on the physical media, so it check sequence can be made dependent on the physical media, so it
should be possible to define a stronger FCS in future ethernet should be possible to define a stronger FCS in future ethernet
standards, or to negotiate a stronger FCS between two stations on a standards, or to negotiate a stronger FCS between two stations on a
point-to-point ethernet link (i.e., a host and a switch or a router point-to-point ethernet link (i.e., a host and a switch or a router
and a switch). and a switch).
3.5. IEEE 802.3 compatibility B.5. Interaction TCP congestion control
TCP performance is based on the inverse of the square of the packet
loss probability. Using larger and thus fewer packets is therefore a
competitative advantage. Larger packets increase burstiness, which
can be problematic in some circumstances. Larger packets also allow
TCP to ramp up its transmission speed faster, which is helpful on
fast links, where large packets will be more common. In general, it
would seem advantageous for an individual user to use larger packets,
but under some circumstances, users using smaller packets may be put
at a slight disadvantage.
B.6. IEEE 802.3 compatibility
According to the IEEE 802.3 standard, the field following the According to the IEEE 802.3 standard, the field following the
ethernet addresses is a length field. However, [RFC0894] uses this ethernet addresses is a length field. However, [RFC0894] uses this
field as a type field. Ambiguity is largely avoided by numbering field as a type field. Ambiguity is largely avoided by numbering
type codes above 2048. The mechanisms described in this memo only type codes above 2048. The mechanisms described in this memo only
apply to the standard [RFC0894] and [RFC2464] encapsulation of IPv4 apply to the standard [RFC0894] and [RFC2464] encapsulation of IPv4
and IPv6 in ethernet, not to possible encapsulations of IPv4 or IPv6 and IPv6 in ethernet, not to possible encapsulations of IPv4 or IPv6
in IEEE 802.3/IEEE 802.2 frames, so there is no change to the current in IEEE 802.3/IEEE 802.2 frames, so there is no change to the current
use of the ethernet length/type field. use of the ethernet length/type field.
3.6. Conclusion B.7. Conclusion
Larger packets aren't universally desirable. The factors that factor Larger packets aren't universally desirable. The factors that factor
into the decision to use larger packets include: into the decision to use larger packets include:
o A link's bit error rate o A link's bit error rate
o The number of bits per symbol on a link and hence the likelihood o The number of bits per symbol on a link and hence the likelihood
of multiple bit errors in a single packet of multiple bit errors in a single packet
o The strength of the frame check sequence o The strength of the frame check sequence
o The link speed o The link speed
o The number of buffers o The number of buffers
o Queuing strategy o Queuing strategy
This means that choosing a good maximum packet size is, initially at o Number of sessions on shared links and paths
least, the responsibility of hardware builders. On top of that,
robust mechanisms MUST be available to operators to further limit
maximum packet sizes where appropriate.
4. The protocol mechanisms
The new Multi-MTU router advertisement option lets IPv6 routers (and,
if desired, devices that aren't IPv6 routers) inform hosts of the
maximum packet sizes they should use, based on the link bandwidth of
the host and whether the host supports probing for support of
oversized packets.
4.1. The multi-MTU router advertisement option
Routers use this option to inform hosts on connected subnets about
the maximum allowed MTU for three ranges of link speeds.
1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | Pri | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MAXMTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| SLOWMTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| SAFEMTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: TBD
Length: 1
Pri: Priority. Values have the following meaning:
000: Vendor default
001: Local override of 000
010: Site default
011: Local override of 010
100: Subnet default
101: Local override of 100
110: Per-node setting
111: Local override of 110
Vendors may only use priority 000 in default configurations.
Site-wide administrative settings may only use 000 and 010.
Subnet-specific administrative settings may use 000, 010 or 110,
but not 001, 011, 101 or 111. Per-node configuration may use all
values.
Reserved: Set to 0 on transmission, ignored on reception.
MAXMTU: The absolute maximum packets size allowed on a link.
Packets larger than this size MUST NOT be sent.
SLOWMTU: The maximum packet size nodes operating at a link speed
below 600 Mbps (Mbps = 1000000 bps) may use.
SAFEMTU: The maximum packet size supported by all nodes on a link,
packets of this size can be sent without probing.
4.2. General operation
Hosts MUST recover the multi-MTU options from the router
advertisements of at least the router they select as a default
router, but it's encouraged (not required) to recover options from
multiple routers. The same option, or data constituting the same
information, may be learned from other sources, such as local
configuration and/or DHCPv6.
When a node's interface speed changes, it MAY reinitiate negotiation
of per-neighbor MTUs, but it SHOULD remain prepared to receive
packets of the maximum size indicated to neighbors previously.
Devices not acting as IPv6 routers that need to inform hosts on the
local subnet of MTU limitations MAY send out a router advertisement
with a Router Lifetime of 0 [RFC2461] and the pertinent information
in a Multi MTU option.
Routers and other systems generating router advertisements with a
Multi-MTU option SHOULD NOT advertise a MAXMTU, SLOWMTU or SAFEMTU
lower than the MTU defined in the relevant IP-over-... or
IPv6-over-... RFC.
DISCUSSION: Is it appropriate that IPv4 and IPv6 use the same MTU?
4.3. Determining the InterfaceMTU
If the node supports probing and there is positive knowledge that the
interface is currently operating at is at least 600 Mbps, the
InterfaceMTU is set as follows:
InterfaceMTU = max(StandardMTU, min(MAXMTU, PhysicalMTU))
If the node supports probing and the interface is operating at a
speed below 600 Mbps, or the interface speed is unknown, the
InterfaceMTU is set as follows:
InterfaceMTU = max(StandardMTU, min(SLOWMTU, PhysicalMTU))
If the node doesn't support probing and there is positive knowledge
that the interface is currently operating at is at least 600 Mbps,
the InterfaceMTU is set as follows:
InterfaceMTU = max(StandardMTU, min(MAXMTU, SAFEMTU, PhysicalMTU))
If none of the above rules apply, the InterfaceMTU is set as follows:
InterfaceMTU = max(StandardMTU, min(SLOWMTU, SAFEMTU, PhysicalMTU))
If InterfaceMTU is smaller than SAFEMTU, an error SHOULD be logged
but operation SHOULD continue.
4.4. Changes to the RA MTU option semantics
If in addition to a Multi-MTU option, there is also an MTU option in
a router advertisement, hosts MUST ignore the MTU option and use the
value of the SAFEMTU field in the Multi-MTU option as the default MTU
size on the interface. However, it may be necessary to incorporate
special case logic to allow for the use of larger packets than what
the interface-wide MTU value that is set accordingly suggest. For
instance, if a node supports explicit probing as outlined below, or
[RFC4821] probing for some transport protocols, the transport
protocols in question may need to be aware of the possibility of
using packets larger than the SAFEMTU. For example TCP should
probably advertise a maximum segment size based on the InterfaceMTU
rather than on the SAFEMTU in the MSS option.
4.5. The IPv6 neighbor discovery MTU option
In order to be able to use the largest packet sizes under the widest
range of circumstances, nodes SHOULD include a new MTU option in both
neighbor solicitation and neighbor advertisement messages [RFC2461].
The format of the neighbor discovery MTU option is as follows:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |R|T| Transport flags | Res |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MTU |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: TBD
Length: 1
R: Reply flag. Set to 1 when the neighbor discovery packet is sent
in reply to a neighbor discovery packet containing a padding
option, otherwise set to 0.
T: TCP-MSS-override flag. If set to 1, the MTU field MAY overwrite
the maximum segment size that was advertised earlier, in the TCP
MSS option. (Note that the MSS option advertises a value that
doesn't include IP overhead; the MTU field is the size of an
entire IP packet, including the IP header.) If set to 0, the TCP
MSS option MUST be honored even if it's smaller than the
NeighborMTU.
Transport flags: Reserved for use with other transport protocols in
the same way as the T flag. Set to 0 on transmission, ignored
when receiving.
Res: Set to 0 on transmission, ignored on reception.
MTU: If the R flag is 0: the maximum packet size in bytes that the
node would like to receive. The minimum valid value is 1280.
However, the node MUST be prepared to receive packets up to the
SAFEMTU size. If the R flag is 1: the minimum of the maximum
packet size that the node would like to receive (as with R=0) and
the size of the packet that this packet is a reply to.
4.6. The IPv6 neighbor discovery padding option
The format of the neighbor discovery padding option is as follows:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Padding |
~ ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: TBD
Length: see below.
Reserved: set to 0 on transmission, ignored on reception.
Padding: 0 or more all-zero octets.
4.7. Use of the MTU and padding options
The MTU option is included in all neighbor advertisement and neighbor
solicitation messages.
Reception of a neighbor solicitation or a neighbor advertisement for
a neighbor for which no per-neighbor MTU is known triggers, in
addition to the normal response if it's a neighbor solicitation, the
sending of an neighbor solicitation message with the MTU and padding
options in it. The size of this message is may vary between the IPv6
StandardMTU size + 1 for the link and the minimum of the local MTU
and the neighbor's MTU as advertised in the MTU option of the packet
received. See below for considerations about the packet sizes to
choose. The padding option is used to bring the neighbor
solicitation message to this size. The padding option MUST be the
last option in the packet.
There are two possible ways to determine the value of the length
field in the padding option:
1. Set it to 0. Since the option is in fact larger than 0, this
means that nodes that don't implement the option will silently
discard the packet. Setting the length to 0 makes it possible to
have packets with the padding option that aren't a multiple of 8
bytes long.
2. If the intended packet length allows a valid value for the length
field, the length field MAY be set to that value. The node MAY
reduce the size of the intended packet to accommodate the
requirement that the size field is a multiple of 8 bytes. I.e.,
if the intended packet size is 4470 bytes with 40 and 24 bytes
for the IPv4 and neighbor solicitation headers, respectively, the
padding option would have to be 4406 bytes long, which can't be
expressed in the length field. The node may choose to use a
packet size of 4464 instead, which results in a length field
value of 550. This of course means that subsequent data packets
MUST be no larger than 4464 bytes.
A neighbor solicitation message with the padding option is always
sent in addition to a regular neighbor solicitation message, rather
than in place of one.
When a node receives a neighbor solicitation message with the padding
option, it stops evaluating options when it reaches the padding
option and returns a regular neighbor advertisement message, which
includes the MTU option with the R flag set to 1. Whenever the
neighbor advertisement is not the result of receiving a neighbor
solicitation with a padding option, the R flag is set to 0.
When a node receives a neighbor advertisement message, it must
determine whether the message is in reaction to a locally sent
neighbor solicitation with the padding option or not. If the MTU
option is included in the message received, an R flag of 1 indicates
that it is indeed a reply. If the message was a reply, the node sets
the NeighborMTU to the size of the MTU field in the received neighbor
discovery packet.
If no reply is received after some time, either the neighbor is
incapable of receiving packets of the size that was used, or a device
operating at the link layer was incapable for forwarding the frame.
(Incidental packet loss is also a possibility.) In order to
determine a workable MTU even in the presence of unknown limitations,
a node may repeat sending a solicitation with the padding option.
However, since presumably, some equipment may react badly to a large
number of out-of-spec packets, it's important that nodes limit the
number of oversized packets to destinations that aren't yet known to
be capable of receiving them. An upper limit would be to allow only
5 unacknowledged oversized packets per 300 second period.
Nodes that support probing MUST support reception of both types of
probes, but MAY be limited to generating only one type.
4.8. IPv4 ethernet jumbo ARP message
Due to lack of neighbor discovery, with IPv4, it's necessary to use
ARP to probe for non-standard MTU capabilities. This is done by
simply probing with an ARP packet padded to the desired size. If a
reply comes back, the neighbor supports the probed MTU size.
MAXMTU, SLOWMTU and SAFEMTU parameters advertised by IPv6 routers
MUST also be taken into account when probing and generating oversized
IPv4 packets.
4.9. Probe considerations
In cases where the neighbor's MTU was advertised in an MTU option, it
makes sense to try with this size. If that probe fails or the
neighbor's MTU is unknown, the best choice for a probe size would be
the smallest possible non-standard MTU. This could be the
StandardMTU + 1, or a slightly larger value that represents the first
larger size that is actually useful, such as 1508 or 1520 for
ethernet. Failure at this size wastes relatively little bandwidth
and indicates that further probes are unnecessary. If this probe is
successful, further choices for the probe size may be common MTU
sizes such as 1508, 1530, 1536, 1546, 1998, 2000, 2018, 4464, 4470,
8092, 8192, 9000, 9176, 9180, 9216, 17976, 64000 and 65280 bytes. A
useful heuristic would be to monitor all Multi-MTU options
advertised, regardless of their priority, and use the values in those
options as candidates for the largest supported packet size.
There is no requirement that a node tries a number of probes of
different sizes; only that before oversized packets are sent, a reply
for a probe of that size or larger MUST have been received from the
neighbor in question before packets larger than SAFEMTU are sent. A
simple strategy that would be to initially send just one probe sized
at the InterfaceMTU size, and if unsuccessful, only send a second
probe when a probe from the neighbor is received. The second probe
is made the same size as the neighbor's probe.
Probes MUST be sent as unicast.
4.10. Neighbor MTU garbage collection
The MTU size for a neighbor is garbage collected along with a
neighbor's link address in accordance with regular ARP and neighbor
discovery timeouts. Additionally, a neighbor's MTU size is reset to
unknown after dead neighbor detection declares a neighbor "dead".
5. IANA considerations
IANA is requested to assign a router advertisement option number and
two neighbor discovery options. In addition, IANA is requested to
start a registry for the transport flags. There are 10 flags,
numbered 18 to 27. Each flag may be assigned to a transport protocol
that communicates a maximum segment size in-band. See the discussion
of the T flag in section Section 4.5.
6. Security considerations
Generating false router advertisements and neighbor discovery packets
with large MTUs may lead to a denial-of-serve condition, just like
the advertisement of other false link parameters.
7. References
7.1. Normative References
[RFC0894] Hornig, C., "Standard for the transmission of IP datagrams
over Ethernet networks", STD 41, RFC 894, April 1984.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461,
December 1998.
[RFC2462] Thomson, S. and T. Narten, "IPv6 Stateless Address
Autoconfiguration", RFC 2462, December 1998.
[RFC2464] Crawford, M., "Transmission of IPv6 Packets over Ethernet
Networks", RFC 2464, December 1998.
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007.
7.2. Informative References
[CRC] Jain, R., "Error Characteristics of Fiber Distributed Data
Interface (FDDI), IEEE Transactions on Communications",
August 1990.
Appendix A. Document and discussion information
The latest version of this document will always be available at This means that choosing a good maximum packet size is, initially at
http://www.muada.com/drafts/. Please direct questions and comments least, the responsibility of hardware builders, and may also be of
to the int-area mailinglist or directly to the author. interest to ISPs.
Author's Address Author's Address
Iljitsch van Beijnum Iljitsch van Beijnum
IMDEA Networks IMDEA Networks
Avda. del Mar Mediterraneo, 22 Avda. del Mar Mediterraneo, 22
Leganes, Madrid 28918 Leganes, Madrid 28918
Spain Spain
Email: iljitsch@muada.com Email: iljitsch@muada.com
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
 End of changes. 46 change blocks. 
519 lines changed or deleted 571 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/