idnits 2.17.1 draft-van-beijnum-multi-mtu-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 21, 2016) is 2957 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     I. van Beijnum
3	Internet-Draft                                  Institute IMDEA Networks
4	Intended status: Experimental                             March 21, 2016
5	Expires: September 22, 2016

7	                    Extensions for Multi-MTU Subnets
8	                     draft-van-beijnum-multi-mtu-05

10	Abstract

12	   In the early days of the internet, many different link types with
13	   many different maximum packet sizes were in use.  For point-to-point
14	   or point-to-multipoint links, there are still some other link types
15	   (PPP, ATM, Packet over SONET), but multipoint subnets are now almost
16	   exclusively implemented as Ethernets.  Even though the relevant
17	   standards mandate a 1500 byte maximum packet size for Ethernet, more
18	   and more Ethernet equipment is capable of handling packets bigger
19	   than 1500 bytes.  However, since this capability isn't standardized,
20	   it is seldom used today, despite the potential performance benefits
21	   of using larger packets.  This document specifies mechanisms to
22	   negotiate per-neighbor maximum packet sizes so that nodes on a
23	   multipoint subnet may use the maximum mutually supported packet size
24	   between them without being limited by nodes with smaller maximum
25	   sizes on the same subnet.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on September 22, 2016.

44	Copyright Notice

46	   Copyright (c) 2016 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	   2.  Notational Conventions  . . . . . . . . . . . . . . . . . . .   4
63	   3.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
64	   4.  Overview of operation . . . . . . . . . . . . . . . . . . . .   5
65	   5.  The ND NODEMTU option . . . . . . . . . . . . . . . . . . . .   6
66	   6.  The MTUTEST packet format . . . . . . . . . . . . . . . . . .   7
67	   7.  Changes to the RA MTU option semantics  . . . . . . . . . . .   8
68	   8.  The TCP MSS option  . . . . . . . . . . . . . . . . . . . . .   9
69	   9.  Operation . . . . . . . . . . . . . . . . . . . . . . . . . .   9
70	     9.1.  Initialization  . . . . . . . . . . . . . . . . . . . . .   9
71	     9.2.  Probing . . . . . . . . . . . . . . . . . . . . . . . . .  10
72	     9.3.  Monitoring  . . . . . . . . . . . . . . . . . . . . . . .  14
73	     9.4.  Neighbor MTU garbage collection . . . . . . . . . . . . .  16
74	   10. Applicability . . . . . . . . . . . . . . . . . . . . . . . .  16
75	   11. IANA considerations . . . . . . . . . . . . . . . . . . . . .  16
76	   12. Security considerations . . . . . . . . . . . . . . . . . . .  16
77	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
78	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
79	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  17
80	     14.2.  Informative References . . . . . . . . . . . . . . . . .  18
81	   Appendix A.  Document and discussion information  . . . . . . . .  19
82	   Appendix B.  Advantages and disadvantages of larger packets . . .  19
83	     B.1.  Clock skew  . . . . . . . . . . . . . . . . . . . . . . .  19
84	     B.2.  ECMP over paths with different MTUs . . . . . . . . . . .  20
85	     B.3.  Delay and jitter  . . . . . . . . . . . . . . . . . . . .  20
86	     B.4.  Path MTU Discovery problems . . . . . . . . . . . . . . .  21
87	     B.5.  Packet loss through bit errors  . . . . . . . . . . . . .  21
88	     B.6.  Undetected bit errors . . . . . . . . . . . . . . . . . .  22
89	     B.7.  Interaction TCP congestion control  . . . . . . . . . . .  23
90	     B.8.  IEEE 802.3 compatibility  . . . . . . . . . . . . . . . .  23
91	     B.9.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . .  24
92	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  24

94	1.  Introduction

96	   Some protocols inherently generate small packets.  Examples are VoIP,
97	   where it is necessary to send packets frequently before much data can
98	   be gathered to fill up the packet, and the DNS, where the queries are
99	   inherently small and the returned results also often do not fill up a
100	   full 1500-byte packet.  However, most data that is transferred across
101	   the internet and private networks is part of long-lived sessons and
102	   requires segmentation by a transport protocol, which is almost always
103	   TCP.  These types of data transfers can benefit from larger packets
104	   in several ways:

106	   1.  A higher data-to-header ratio makes for fewer overhead bytes

108	   2.  Fewer packets means fewer per-packet operations for the source
109	       and destination hosts

111	   3.  Fewer packets also means fewer per-packet operations in routers
112	       and middleboxes

114	   4.  TCP performance increases with larger packet sizes

116	   Even though today, the capability to use larger packets (often called
117	   jumboframes) is present in a lot of Ethernet hardware, this
118	   capability typically isn't used because IP assumes a common MTU size
119	   for all nodes connected to a link or subnet.  In practice, this means
120	   that using a larger MTU requires manual configuration of the non-
121	   standard MTU size on all hosts and routers and possibly on layer 2
122	   switches connected to a subnet.  Also, the MTU size for a subnet is
123	   limited to that of the least capable router, host or switch.

125	   Perhaps in the future, when hosts support packetization layer path
126	   MTU discovery ([RFC4821], "Packetization Layer Path MTU Discovery")
127	   in all relevant transport protocols, it will be possible to simply
128	   ignore MTU limitations by sending at the maximum locally supported
129	   size and determining the maximum packet size towards a correspondent
130	   from acknowledgements that come back for packets of different sizes.
131	   However, [RFC4821] must be implemented in every transport protocol,
132	   and problems arise in the case where hosts implementing [RFC4821]
133	   interact with hosts that don't implement this mechanism, but do use a
134	   larger than standard MTU.

136	   This document provides for a set of mechanisms that allow the use of
137	   larger packets between nodes that support them which interacts well
138	   with both manually configured non-standard MTUs and expected future
139	   [RFC4821] operation with larger MTUs.  This is done using a new IPv6
140	   Neighbor Discovery option and a new UDP-based protocol for exchanging
141	   MTU information and testing whether jumboframes can be transmitted
142	   successfully.

144	   Appendix B discusses several potential issues with larger packets,
145	   such as head-of-line blocking delays, path MTU discovery black holes
146	   and the strength of the CRC32 with increasing packet sizes.

148	2.  Notational Conventions

150	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
151	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
152	   document are to be interpreted as described in [RFC2119].

154	   Note that this specification is not standards track, and as such,
155	   can't overrule existing specifications.  Whenever [RFC2119] language
156	   is used, this must be interpreted within the context of this
157	   specification: while the specification as a whole is optional and
158	   non-standard, whenever it is implemented, such an implementation can
159	   only function properly when all MUSTs are observed.

161	3.  Terminology

163	   Advertised MTU:  The MTU size announced by a node to other nodes on
164	      the local subnet.

166	   Confirmed MTU:  The largest packet size successfully received from
167	      the neighbor or the largest packet size sent to the neighbor for
168	      which an acknowledgment was received; whichever size is greater.

170	   Confirmed Time:  When a packet the size of the confirmed MTU was last
171	      received or acknowledged.

173	   Local MTU:  The MTU configured on an interface.  By default, this is
174	      the largest MTU size supported by the hardware, but the Local MTU
175	      may be lowered administratively or automatically based on policy.
176	      (For instance, the MTU may be set to the Standard MTU if the link
177	      speed is below 1000 Mbps.)

179	   MRU:  Maximum Receive Unit.  The size of the largest IP packet that
180	      can be received on an interface.  This document doesn't use the
181	      term MRU, and assumes that the MRU is equal to the MTU.

183	   MTU:  Maximum Transfer Unit.  The size of the largest IP packet that
184	      can be transmitted on an interface, considering hardware (and
185	      administrative) limitations.

187	   Neighbor:  Another node on a connected subnet.  Neighbors are
188	      identified by the combination of a link address and an IP version.

190	      The MTU may be set to different values for IPv4 and IPv6
191	      administratively, but it is assumed that if a node has multiple
192	      IPv4 or IPv6 addresses, the MTU for each set of addresses is the
193	      same.

195	   Neighbor MTU:  The currently used MTU towards a neighboring node on a
196	      subnet.  The Neighbor MTU reflects the current best understanding
197	      of the maximum packet size that can successfully be transmitted
198	      towards that neighbor.

200	   Safe MTU:  The maximum packet size that is assumed to work without
201	      testing.  Defaults to the Standard MTU, but may be set to a
202	      subnet-wide higher or lower value administratively, or to a lower
203	      value using the MTU option in IPv6 Router Advertisements.

205	   Standard MTU:  The MTU specified in the relevant IPv4-over-... or
206	      IPv6-over-... document, which is 1500 for Ethernet ([RFC0894] and
207	      [RFC2464]).

209	4.  Overview of operation

211	   The mechanisms described in this document come into play when a node
212	   is connected to a subnet using an interface that supports an MTU size
213	   larger than the standard MTU size for that link type.

215	   For each remote node connected to such a subnet, the local node
216	   maintains a neighbor MTU setting.  The length of packets transmitted
217	   to a neighbor is always limited to the neighbor MTU size.

219	   When a node starts communicating with another node on the same
220	   subnet, it follows the following procedure:

222	   1.  Initialization: the neighbor MTU is set to local maximum MTU for
223	       the interface used to reach the neighbor.

225	   2.  Discovery: learning the other node's MTU.

227	   3.  Probing: determining the maximum packet size that can
228	       successfully be transmitted to and and received from the other
229	       node, considering the (unknown) maximum packet size supported by
230	       the layer 2 infrastructure.

232	   4.  Monitoring: making sure that when large packets are transmitted,
233	       they are not silently discarded, for instance as the result of a
234	       layer 2 reconfiguration.

236	   During the discovery and probing stages, the neighbor MTU is adjusted
237	   as new information becomes available.  The monitoring stage is
238	   ongoing.  If during the monitoring stage it is determined that large
239	   packets aren't successfully exchanged with the neighboring node, the
240	   neighbor MTU is set to the safe MTU and the node returns to the
241	   testing stage.

243	   Unless administrative configuration or policy specifies otherwise,
244	   the link, IPv4 and IPv6 MTU sizes are set to the maximum supported by
245	   the hardware.  This means that when TCP sessions are created, they
246	   carry a maximum segment size (MSS) option that reflects the larger-
247	   than-standard MTU.

249	5.  The ND NODEMTU option

251	   All MTU values are 32-bit unsigned integers in network byte order.
252	   All other values are also unsigned and in network byte order .

254	   The MTU size and two flags are exchanged as an IPv6 Neighbor
255	   Discovery option.  The new option, as well as the MTU value it
256	   avertises, are named "NODEMTU".

258	                          1                   2                   3
259	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
260	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
261	     |     Type      |    Length     |           Reserved            |
262	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
263	     |                            NodeMTU                            |
264	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
265	     /                       HintMTU (optional)                      /
266	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

268	   Type: TBD

270	   Length: 1 or 2

272	   Reserved:  Set to 0 on transmission, ignored when received.

274	   NodeMTU  The maximum packet size the node wishes to receive on this
275	      interface.

277	   HintMTU  The maximum packet size the node believes it can
278	      successfully receive on this interface at this time.  If the
279	      HintMTU is equal to the NodeMTU or no value for HintMTU is known,
280	      this field may be omitted and the Length field is set to 1.  If
281	      the HintMTU field is present, the Length field is set to 2.

283	   When a node's interface speed changes, it MAY advertise a new MTU,
284	   but it SHOULD remain prepared to receive packets of the maximum size
285	   advertised to neighbors previously (if the old maximum size is larger
286	   than the newly advertised one).

288	6.  The MTUTEST packet format

290	   The packets used to test whether large packets can be transmitted
291	   successfully and communicate status are sent using UDP ([RFC0768]).
292	   Their format is as follows:

294	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
295	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
296	     |          Source Port          |       Destination Port        |
297	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
298	     |             Length            |           Checksum            |
299	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
300	     |      'M'      |      'T'      |      'U'      |      'T'      |
301	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
302	     |R|B|  Reserved |                     Nonce                     |
303	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
304	     |                            NodeMTU                            |
305	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
306	     |                            HintMTU                            |
307	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
308	     |                            Padding                            |
309	     ~                                                               ~
310	     |                                                               |
311	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

313	   Source port (UDP):  For outgoing requests: an ephemeral port number.
314	      For replies: 1022. (16 bits.)

316	   Destination port (UDP):  For outgoing requests: 1022.  For replies:
317	      the source port used in the request being replied to. (16 bits.)

319	   Length (UDP):  for IPv4 and IPv6 packets smaller than or equal to
320	      65575 bytes, the length of the UDP segment.  For IPv6 packets
321	      larger than 65575 bytes, 0 (as per [RFC2675]). (16 bits.)

323	   Checksum (UDP): the UDP checksum. (16 bits.)

325	   R: reply request flag.  If set to 0, no reply is sent.  If set to 1,
326	      the receiver is asked to send a reply.  (1 bit.)

328	   MTUT:  The value corresponding to the ASCII string "MTUT", used to
329	      differentiate MTUTEST packets from other UDP packets that use port
330	      1022.  Packets with a value other than "MTUT" at the beginning of
331	      the UDP payload MUST be ignored. (32 bits.)

333	   B: big reply request flag.  If set to 0, replies are not padded.  If
334	      set to 1, replies are padded to be the same size as the request.
335	      (1 bit.)

337	   Reserved: set to 0 on transmission, ignored on reception. (6 bits.)

339	   Nonce: a hard-to-guess value. (24 bits.)

341	   NodeMTU:  The maximum packet size that the sender is prepared to
342	      receive at this time.  (32 bits.)

344	   HintMTU:  The maximum packet size that the sender believes it can
345	      successfully receive at this time.  (32 bits.)

347	   Padding:  Filled with 0 or more all-zero bytes on transmission,
348	      ignored on reception.

350	   In addition to the fields listed above, the following IP and link
351	   layer fields are taken into consideration:

353	   Source link-layer address:  On transmission: set automatically by the
354	      networking stack.  On reception: used to identify a neighbor.

356	   IP version:  On transmission: set automatically by the networking
357	      stack.  On reception: used to identify a neighbor.  (The IP
358	      version may also be identified implicitly through the API without
359	      directly observing the version field.)

361	   Time To Live / Hop Limit:  On transmission: set to 255.  On
362	      reception: if 255, the packet is processed.  If other than 255,
363	      the packet is silently discarded.  (To enforce that the protocol
364	      is only used within a local subnet.)

366	   Source IP address:  On transmission, for requests: set to the address
367	      the node intends to use to communicate with the neighbor.  For
368	      replies: set to the destination IP address in the request being
369	      replied to.  On reception: used to identify a neighbor.

371	   Destination IP address:  On transmission, for requests: set to the
372	      address the node intends to use to communicate with the neighbor.
373	      For replies: set to the source IP address in the request being
374	      replied to.

376	7.  Changes to the RA MTU option semantics

378	   Section 6.3.4 of [RFC4861] specifies:

380	   "If the MTU option is present, hosts SHOULD copy the option's value
381	   into LinkMTU so long as the value is greater than or equal to the
382	   minimum link MTU and does not exceed the maximum LinkMTU value
383	   specified in the link-type-specific document"

385	   This document changes the handling of the Router Advertisement MTU
386	   option such that it may also be used by routers to tell hosts that
387	   they SHOULD use an MTU larger than the LinkMTU and update their
388	   SafeMTU value.  If multiple routers advertise different MTUs that are
389	   higher or lower than the standard MTU, behavior is undefined.  MTU
390	   options containing the standard MTU SHOULD be ignored.

392	   The ability to advertise a larger-than-standard MTU must be used with
393	   extreme care by nework administrators, as advertising an MTU size
394	   that exceeds the capabilities of routers or the layer 2
395	   infrastructure will lead to reachability problems.

397	   If the advertised larger-than-standard MTU is ignored or not
398	   supported by some hosts connected to the subnet, TCP will presumably
399	   still work because the MSS option ([RFC0793]) limits the size of
400	   transmitted TCP segments to what the receiver suports.  However, non-
401	   TCP protocols that use large packets will likely fail.  The most
402	   prominent example of this is DNS over UDP with EDNS0 when requesting
403	   large records, such as those used for DNSSEC ([RFC6891]).

405	8.  The TCP MSS option

407	   Hosts SHOULD advertise the maximum MTU size they are prepared to use
408	   on a link in the TCP MSS value, even during times when probing has
409	   failed: should larger neighbor MTUs be established later, it will not
410	   be possible to adjust the MSS for ongoing sessions.

412	9.  Operation

414	9.1.  Initialization

416	   When an interface is activated, an appropriate local MTU is
417	   determined, based on hardware limitations and admnistrative settings.
418	   Additionally, a policy may be in place to constrain packet sizes when
419	   operating at lower bandwidths, to avoid excessive delays as queues of
420	   large packets build up and cause significant head-of-line blocking
421	   for subsequent time-sensitive packts.  Also, layer 2 devices
422	   operating at lower interface speeds are less likely to support non-
423	   standard MTUs.

425	   In the absense of operational experience, this document RECOMMENDS
426	   limiting the use of larger than standard MTUs to interfaces operating
427	   at 400 Mbps or faster; and if a larger MTU is used for interfaces
428	   operating at lower speeds, a "mini jumbo" size of 1982 bytes or less
429	   is used for Ethernets.

431	   For IPv4, the local MTU is limited to 65535 bytes.  For IPv6, if
432	   [RFC2675] jumbograms are not supported, the local MTU is limited to
433	   65575 bytes.  These limits apply even if the interface hardware
434	   supports a larger MTU.  IPv6 nodes that implement [RFC2675]
435	   jumbograms MAY use MTU sizes larger than 65575 bytes.

437	   When the interface speed changes, the local MTU MAY be changed to
438	   reflect the new speed.  However, the node SHOULD remain prepared to
439	   receive packets of the size of a previously advertised MTU.

441	   The local MTU MAY be different for IPv4 and IPv6.  The local MTU is
442	   the size used to calculate the value of the TCP MSS option.  The
443	   HintMTU is set to undefined.

445	   When sending Neighbor Solicitations and Neighbor Advertisements, a
446	   node includes its local MTU in the NodeMTU field of the NODEMTU
447	   option.  If the size of the HintMTU is known, it is also included.

449	9.2.  Probing

451	   When a node starts communicating with a new IPv4 or IPv6 neighbor,
452	   the probing procedure is started.  This can happen when ARP [RFC0826]
453	   or Neighbor Discovery messages are exchanged, or when an incoming TCP
454	   SYN is received.

456	   The node sends a MTUTEST packet to the new neighbor and sets the
457	   neighbor MTU to the safe MTU.  The MTUTEST packet has the local MTU
458	   in the NodeMTU field.  If a hint MTU is known, it is included in the
459	   HintMTU field.  The R and B flags are set to 0.  No padding is
460	   included.

462	   Upon reception of a Neighor Solicitation or a Neighbor Advertisement
463	   with the NODEMTU option or an MTUTEST packet, the node determines if
464	   the packet is received from a known neighbor IP address and a known
465	   neighor link layer address.  If the values match the values stored
466	   for a known neighbor, no action occurs.

468	   If the values match the values for a known link layer address and IP
469	   version, but an unknown IP address, the IP address is added to the
470	   list of IP addresses for the neighbor in question and the known
471	   neighbor MTU for the neighbor is applied to the new address.

473	   If the NodeMTU matches the NodeMTU previously sent by a known
474	   neighbor but the HintMTU as a different non-zero value, the HintMTU
475	   is updated.

477	   If the HintMTU sent by a known neighbor is 0, the neighbor MTU is set
478	   to the safe MTU, the HintMTU for the neighbor is set to unknown and
479	   the probing procedure is started.

481	   If the combination of link layer address and IP version is unknown,
482	   the neighbor MTU is set to the safe MTU, the HintMTU is set to the
483	   HintMTU value in the packet and the probing procedure is started.

485	   Before starting the probing procedure, a node compares its link layer
486	   address to the neighbor's link layer address.  If the node's link
487	   layer address is numerically larger than the neighbor's link layer
488	   address, the node applies a waiting period before starting the
489	   probing procedure.  The waiting period SHOULD be at least 250
490	   milliseconds and at most 1 second.

492	   The following is pseudo-code for a probing procedure.  Note that it
493	   differens from the one outlined in [RFC4821].  The latter favors
494	   conservative probing because lost probes can't easily be
495	   differentiated from congestion losses, so lost probes are expensive.
496	   For this specification, successful probes waste bandwidth and losses
497	   are less problematic, so more aggressive probing and failing quickly
498	   is more appropriate.

500	        Neighbor.ConfirmedTime = UNDEFINED

502	        if LocalMTU > Neighbor.AdvertisedMTU
503	          let Max = Neighbor.AdvertisedMTU
504	        else
505	          let Max = LocalMTU

507	        # test with maximum supported packet size first
508	        # and finish probing upon success
509	        test (Max)
510	        if Success:
511	           Neighbor.MTU = Max
512	           return

514	        # maximum size doesn't work, now find
515	        # what does work
516	        # assumption: 256 works for IPv4, 1280 for IPv6
517	        let WorksNo = Max
518	        if IPv6:
519	          let Neighbor.ConfirmedMTU = 1280
520	        if IPv4:
521	          let Neighbor.ConfirmedMTU = 256

523	        # test with the hinted size
524	        # if successful, this becomes the minimum for further tests
525	        # if unsuccessful, this becomes the maximum
526	        test (HintMTU)
527	        if Success:
528	          let Neighbor.ConfirmedMTU = HintMTU
529	        else
530	          let WorksNo = HintMTU

532	        # test the smallest usable size larger than
533	        # the standard MTU (if that size is still
534	        # in the range to be tested) so we avoid wasting
535	        # time probing non-jumbo-capable nodes
536	        if (StandardMTU + 8 > Neighbor.ConfirmedMTU and \
537	            StandardMTU + 8 < WorksNo)
538	          test (StandardMTU + 8)
539	          if Success:
540	            let Neighbor.ConfirmedMTU = StandardMTU + 8
541	          else
542	            let WorksNo = StandardMTU + 8

544	        # to establish an upper bound quickly,
545	        # test (320, 640, 1280, ) 2560, 5120, 10240, 20480, 40960, ...
546	        let Current = 320
547	        while (Current < WorksNo)
548	          if (Current > Neighbor.ConfirmedMTU)
549	            test (Current)
550	            if Success:
551	              let Neighbor.ConfirmedMTU = Current
552	            else
553	              let WorksNo = Current
554	          let Current = Current * 2

556	        # we have now established that
557	        # WorksNo =< Neighbor.ConfirmedMTU * 2

559	        # further testing is based on a list of hints.
560	        # there SHOULD be a mechanism for administrators
561	        # to add hints.
562	        #
563	        # hint sources:
564	        #   576: common PPP low delay
565	        #  1492: PPP over Ethernet [RFC2516]
566	        #  1500: Ethernet II
567	        #  1982: IEEE Std 802.3as-2006
568	        #  2304: IEEE 802.11
569	        #  2482: Fibre Channel over Ethernet (FCoE)
570	        # [CATALYST]:
571	        # 9216, 8092, 1600, 1998, 2000, 1546, 1530, 17976, 2018
572	        # sizes observed by the author:

574	        # 576, 1982, 4070, 9000, 16384, 64000
575	        let Hints = 576, 1492, 1530, 1982, 2304, 4070, 8092, 9000, \
576	                    16384, 32000, 64000

578	        foreach Size in Hints
579	          if Size > Neighbor.ConfirmedMTU and Size < WorksNo
580	            test (Size)
581	            if Success:
582	              let Neighbor.ConfirmedMTU = Size
583	            else
584	              let WorksNo = Size

586	        # finished testing, maximum working packet size
587	        # is now known to within about a factor 1.5,
588	        # depending on the number of hints

590	        if Neighbor.ConfirmedTime <> UNDEFINED
591	          # we got at least one probe back, use discovered MTU
592	          Neighbor.MTU = Neighbor.ConfirmedMTU
593	        else
594	          # we never got any probes back, neighbor probably does
595	          # not implement MTUTEST protocol, so we use the safe MTU
596	          Neighbor.MTU = SafeMTU

598	        # done!
599	        return

601	        # sending probes
602	        function test (Size)

604	        # wait 20 milliseconds between sending probes
605	        let MsecSinceProbe = now () - ProbeTime

607	        if (MsecSinceProbe < 20)
608	          sleep (20 - MsecSinceProbe)

610	        # create probe, request reply (but not a big one)
611	        let Probe.TTL = 255
612	        let Probe.ReplyFlag = 1
613	        Let Probe.BigFlag = 0
614	        Let Nonce = rand ()
615	        Let Probe.Nonce = Nonce
616	        let Probe.NodeMTU = LocalMTU
617	        let Probe.HintMTU = HintMTU
618	        let Probe.Padding = pad (Size - sizeof (Probe))
619	        send (Probe)

621	        let ProbeTime = now ()
622	        # wait 2000 milliseconds for reply
623	        # (this also avoids sending packets that are too large more
624	        # than once every two seconds)
625	        let Success = receive (Reply, 2000)

627	        if not Success
628	          return false

630	        if not (Reply.TTL = 255 and Reply.Nonce = Nonce
631	          and Reply.LinkAddress = Neighbor.LinkAddress)
632	            return false

634	        # valid reply received
635	        # note that Neighbor.MTU is not updated yet,
636	        # this happens after probing has finished
637	        Neighbor.ConfirmedMTU = Reply.NodeMTU
638	        Neighbor.ConfirmedTime = now ()
639	        Neighbor.HintMTU = Reply.HintMTU;
640	        if HintMTU < Size
641	          HintMTU = Size
642	        return true

644	   If at any time an unsolicited packet arrives from the neighbor and
645	   the ConfirmedMTU of that neighbor is smaller than the size of the
646	   packet received, the HintMTU for the neighbor is set to the size of
647	   the received packet and a probe of that size may be sent.  However,
648	   as the maximum size of incoming packets may be different than the
649	   maximum supported size of outgoing packets, reception of a large
650	   packet is not sufficient to update the ConfirmedMTU.  The packets
651	   that update the HintMTU do not have to be MTUTEST protocol packets.

653	   There are no retransmissions.  Both nodes run the probing procedure,
654	   so there are two opportunities to succeed.  However, if both fail to
655	   determine the maximum packet size that can be used because of lost
656	   packets, the hosts will have to use a smaller packet size.

658	   It is assumed that the maximum packet size that A can send to B is
659	   the same as the maximum packet size that B can send to A.  As such,
660	   the reception of a large packet is treated the same as receiving an
661	   acknowledgment for a sent large packet.

663	9.3.  Monitoring

665	   Once a working neighbor MTU is found, large packets can be exchanged.
666	   Presumably, this situation will persist indefinitely.  However, it is
667	   possible that the network is reconfigured and then no longer supports
668	   the MTU used between two nodes.  The aim of the monitoring phase is
669	   to detect this when it happens and establish a working MTU value
670	   before sessions time out.

672	   For each neighbor (as defined by a unique combination of link layer
673	   address and IP version) with a neighbor MTU larger than the safe MTU,
674	   the ability to successfully send or receive large packets is
675	   monitored.  In the monitoring phase, a node tracks whether it sends
676	   any packets larger than the safe MTU to a neighbor and whether it
677	   receives either acknowledgments for those packets, or it receives
678	   packets of length neighbor MTU from that neighbor.  (So acknowledged
679	   outgoing packets don't have to be the maximum size supported to/from
680	   the neighbor, but incoming packets do.)

682	   The ability to track acknowledgment of non-MTUTEST packets is not
683	   required.  However, it is expected that hosts will be able to do this
684	   for TCP packets because the TCP state is readily available.

686	   Monitoring is happens in intervals.  This document RECOMMENDS that
687	   this interval is between 25 and 35 seconds for hosts and between 35
688	   and 45 seconds for routers.  At the end of each monitoring interval,
689	   if acknowledgments or large packets were received, everything is fine
690	   and the neighbor confirmed time is updated.

692	   At the end of a monitoring interval, if no large packets were sent,
693	   everything is fine and nothing happens.

695	   At the end of a monitoring interval, if large packets were sent, but
696	   no acknowledgments or incoming maximum size packets were seen, there
697	   may have been a network reconfiguration that has made it impossible
698	   for large packets to be transmitted successfully between the two
699	   nodes.  To determine whether this is the case, the node sends an
700	   MTUTEST packet with lenght neighbor MTU.  The R flag is set to 1 and
701	   the B flag SHOULD be set to 0.  A random nonce and the local MTU and
702	   the hint MTU are included.

704	   The node waits 2 seconds for a reply.  If there is no reply, the
705	   probe is retransmitted and the node waits 4 seconds for a reply.  If
706	   after 4 seconds there is still no reply, the node sets the hint MTU
707	   to 0 and reinitializes all of the neighbor's MTU-related information
708	   to initial values.  Most notably, this means that the neighbor MTU is
709	   set to the safe MTU.

711	   If the node sets is own hint MTU to 0 or receives a hint MTU of 0
712	   from a neighbor using an ND or MTUTEST packet, the node MAY start
713	   sending probes to other neighbors before the monitoring interval
714	   expires.  However, nodes SHOULD limit the number of probes for all
715	   neighbors combined to no more than one every two seconds.  If a node
716	   has many neighbors and sending probes at one every two seconds would
717	   take too long, it MAY reset the neighbor MTUs of all of its neighbors
718	   to the safe MTU without sending probes if at least two neighbors
719	   appear to be affected by a reduction of the maximum working packet
720	   size.

722	9.4.  Neighbor MTU garbage collection

724	   The MTU size for a neighbor is garbage collected along with a
725	   neighbor's link address in accordance with regular ARP and neighbor
726	   discovery timeouts.  Additionally, a neighbor's MTU size is reset to
727	   unknown after dead neighbor detection declares a neighbor "dead".

729	10.  Applicability

731	   As discussed in annex B, all larger packets, but especially very
732	   large packet sizes have the potential to be problematic in various
733	   ways.  However, jumboframes of 9000 or 9216 bytes have been supported
734	   by various vendors for a long time.  As such, larger MTUs of 9
735	   kilobytes seem safe enough for larger scale experimentation at this
736	   time, but experiments with packet sizes larger than 11 kilobytes are
737	   best done in confined and closely monitored situations.

739	11.  IANA considerations

741	   IANA is requested to assign a neighbor discovery option type value.

743	   [TO BE REMOVED: This registration should take place at the following
744	   location: http://www.iana.org/assignments/icmpv6-parameters

746	   UDP port 1022 is used in accordance with [RFC4727].  Presumably,
747	   unlike an ND option type value, a UDP port would be relatively easy
748	   to change when experimentation makes way for production deployment.

750	12.  Security considerations

752	   Generating false neighbor discovery and MTUTEST packets with large
753	   MTUs may lead to a denial-of-serve condition, just like the
754	   advertisement of other false link parameters.  Requests are large and
755	   replies typically short to avoid the MTUTEST protocol being used as
756	   an amplification vector.  The nonce is used together with the
757	   ephemeral UDP port number to make sure that malicious nodes cannot
758	   generate a reply to a request in the blind.  Enforcement of the value
759	   255 for Hop Limit makes sure that off-link attackers can't use the
760	   protocol to influence packet sizes remotely.

762	   A malicious node may negotiate the use of large packets and cause
763	   head-of-line blocking, especially on slower links.  However, this can
764	   only happen if the neighbor is prepared to use large packets in the
765	   first place.

767	13.  Acknowledgements

769	   This document benefited from feedback by Dave Thaler, Jari Arkko, Joe
770	   Touch, Pat Thaler, David Black, Brian Carpeter, Fred Templin, Jeffrey
771	   Hammond, Mikael Abrahamsson and others.

773	14.  References

775	14.1.  Normative References

777	   [RFC0768]  Postel, J., "User Datagram Protocol", STD 6, RFC 768,
778	              DOI 10.17487/RFC0768, August 1980,
779	              .

781	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
782	              RFC 793, DOI 10.17487/RFC0793, September 1981,
783	              .

785	   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
786	              Converting Network Protocol Addresses to 48.bit Ethernet
787	              Address for Transmission on Ethernet Hardware", STD 37,
788	              RFC 826, DOI 10.17487/RFC0826, November 1982,
789	              .

791	   [RFC0894]  Hornig, C., "A Standard for the Transmission of IP
792	              Datagrams over Ethernet Networks", STD 41, RFC 894,
793	              DOI 10.17487/RFC0894, April 1984,
794	              .

796	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
797	              Requirement Levels", BCP 14, RFC 2119,
798	              DOI 10.17487/RFC2119, March 1997,
799	              .

801	   [RFC2464]  Crawford, M., "Transmission of IPv6 Packets over Ethernet
802	              Networks", RFC 2464, DOI 10.17487/RFC2464, December 1998,
803	              .

805	   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
806	              RFC 2675, DOI 10.17487/RFC2675, August 1999,
807	              .

809	   [RFC2992]  Hopps, C., "Analysis of an Equal-Cost Multi-Path
810	              Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000,
811	              .

813	   [RFC4727]  Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4,
814	              ICMPv6, UDP, and TCP Headers", RFC 4727,
815	              DOI 10.17487/RFC4727, November 2006,
816	              .

818	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
819	              Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
820	              .

822	   [RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
823	              "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
824	              DOI 10.17487/RFC4861, September 2007,
825	              .

827	   [RFC6891]  Damas, J., Graff, M., and P. Vixie, "Extension Mechanisms
828	              for DNS (EDNS(0))", STD 75, RFC 6891,
829	              DOI 10.17487/RFC6891, April 2013,
830	              .

832	   [ETHERNETII]
833	              Digital Equipment Corporation, Intel Corporation, Xerox
834	              Corporation, ""The Ethernet - A Local Area Network",
835	              September 1980, .

838	14.2.  Informative References

840	   [RFC2516]  Mamakos, L., Lidl, K., Evarts, J., Carrel, D., Simone, D.,
841	              and R. Wheeler, "A Method for Transmitting PPP Over
842	              Ethernet (PPPoE)", RFC 2516, DOI 10.17487/RFC2516,
843	              February 1999, .

845	   [IEEE.802.3AS_2006]
846	              IEEE, "IEEE Standard for Information Technology
847	              Telecommunications and Information Exchange Between
848	              Systems Local and Metropolitan Area Networks Specific
849	              Requirements Part 3: Carrier Sense Multiple Access With
850	              Collision Detection (CSMA/CD) Access Method and Physical
851	              Layer Specifications Amendment 3: Frame Format
852	              Extensions", IEEE 802.3as-2006,
853	              DOI 10.1109/ieeestd.2006.248146, November 2006,
854	              .

857	   [IEEE.802.3_2012]
858	              IEEE, "802.3-2012", IEEE 802.3-2012,
859	              DOI 10.1109/ieeestd.2012.6419735, January 2013,
860	              .

863	   [CRC]      Jain, R., "Error Characteristics of Fiber Distributed Data
864	              Interface (FDDI), IEEE Transactions on Communications",
865	              August 1990.

867	   [CATALYST]
868	              Cisco, "Jumbo/Giant Frame Support on Catalyst Switches
869	              Configuration Example",
870	              .

873	Appendix A.  Document and discussion information

875	   The latest version of this document will always be available at
876	   http://www.muada.com/drafts/. Please direct questions and comments to
877	   the int-area mailinglist or directly to the author.

879	Appendix B.  Advantages and disadvantages of larger packets

881	   Although often desirable, the use of larger packets isn't universally
882	   advantageous for the following reasons:

884	   1.  Clock skew

886	   2.  ECMP over paths with different MTUs

888	   3.  Increased delay and jitter

890	   4.  Increased reliance on path MTU discovery

892	   5.  Increased packet loss through bit errors

894	   6.  Increased risk of undetected bit errors

896	B.1.  Clock skew

898	   Ethernet hardware has to compensate between clocking differences
899	   between the sender and receiver though a FIFO buffer.  As packets get
900	   larger, more buffer capacity is required.  This places a limit on
901	   packet sizes.

903	   As jumboframes have been widely supported sinze the introduction of
904	   Gigabit Ethernet, an in the absense of information to the contrary,
905	   it seems safe to assume that the packet sizes that may be set
906	   administratively fall within the capabilities of the hardware.
907	   Administrators are encouraged to monitor the fraction of packets lost
908	   from different types of corruption and adjust MTU sizes accordingly.

910	B.2.  ECMP over paths with different MTUs

912	   Should Equal Cost Multipath [RFC2992] be in effect between two nodes
913	   implementing this specification, with the different paths having
914	   different MTUs, then there is a high risk that probing will detect
915	   the larger of the supported MTU sizes but some data packets will flow
916	   over the path with the smaller MTU size.  In this situation, packets
917	   will be lost consistently and the protocol will not be able to
918	   recover.

920	   As such, configuring paths used for ECMP with different MTU sizes
921	   MUST be avoided.

923	B.3.  Delay and jitter

925	   An low-bandwidth links, the additional time it takes to transmit
926	   larger packets may lead to unacceptable delays.  For instance,
927	   transmitting a 9000-byte packet takes 7.23 milliseconds at 10 Mbps,
928	   while transmitting a 1500-byte packet takes only 1.23 ms.  Once
929	   transmission of a packet has started, additional traffic must wait
930	   for the transmission to finish, so a larger maximum packet size
931	   immediately leads to a higher worst-case head-of-line blocking delay,
932	   and thus, to a bigger difference between the best and worst cases
933	   (jitter).  The increase in average delay depends on the number of
934	   packets that are buffered, the average packet size and the queuing
935	   strategy in use.  Buffer sizes vary greatly between implementations,
936	   from only a few buffers in some switches and on low-speed interfaces
937	   in routers, to hundreds of megabytes of buffer space on 10 Gbps
938	   interfaces in some routers.

940	   If we assume that the delays involved with 1500-byte packets on 100
941	   Mbps Ethernet are acceptable for most, if not all, applications, then
942	   the conclusion must be that 15000-byte packets on 1 Gbps Ethernet
943	   should also be acceptable, as the delay is the same.  At 10 Gbps
944	   Ethernet, much larger packet sizes could be accommodated without
945	   adverse impact on delay-sensitive applications.  At below 100 Mbps,
946	   larger packet sizes are probably not advisable.

948	   When very tight QoS bounds are required, it may be appropriate to
949	   limit MTU sizes and forego larger MTUs.  With IPv6 this can be
950	   accomplised by advertising a limited MTU size in Router
951	   Advertisements.  With IPv4, it is necessary to configure each node to
952	   limit its MTU size.

954	B.4.  Path MTU Discovery problems

956	   PMTUD issues arise when routers can't fragment packets in transit
957	   because the DF bit is set or because the packet is IPv6, but the
958	   packet is too large to be forwarded over the next link, and the
959	   resulting "packet too big" ICMP messages from the router don't make
960	   it back to the sending host.  If there is a PMTUD black hole, this
961	   will typically happen when there is an MTU bottleneck somewhere in
962	   the middle of the path.  If the MTU bottleneck is located at either
963	   end, the TCP MSS (maximum segment size) option makes sure that TCP
964	   packets conform to the smallest MTU in the path.  PMTUD problems are
965	   of course possible with non-TCP protocols, but this is rare in
966	   practice because non-TCP protocols are generally not capable of
967	   adjusting their packet size on the fly and therefore use more
968	   conservative packet sizes which won't trigger PMTUD issues.

970	   Taking the delay and jitter issues to heart, maximum packet sizes
971	   should be larger for faster links and smaller for slower links.  This
972	   means that in the majority of cases, the MTU bottleneck will tend to
973	   be at, or close to, one of the ends of a path, rather than somewhere
974	   in the middle, as in today's internet, the core of the network is
975	   quite fast, while users usually connect to the core at lower speeds.

977	   A crucial difference between PMTUD problems that result from MTUs
978	   smaller than the de facto standard 1500 bytes and PMTUD problems that
979	   result from MTUs larger than 1500 bytes is that in the latter case,
980	   only the party that's actually using the non-standard MTU is
981	   affected.  This puts potential problems, the potential benefits and
982	   the ability to solve any resulting problems in the same place: it's
983	   always possible to revert to a 1500-byte MTU if PMTUD problems can't
984	   be resolved otherwise.

986	   Considering the above and the work that's going on in the IETF to
987	   resolve PMTUD issues as they exist today, increasing MTUs where
988	   desired doesn't seem to involve undue risks.

990	B.5.  Packet loss through bit errors

992	   All transmission media are subject to bit errors.  In many cases, a
993	   bit error leads to a CRC failure, after which the packet is lost.  In
994	   other cases, packets are retransmitted a number of times, but if
995	   error conditions are severe, packets may still be lost because an
996	   error occurred at every try.  Using larger packets means that the
997	   chance of a packet being lost due to errors increases.  And when a
998	   packet is lost, more data has to be retransmitted.

1000	   Both per-packet overhead and loss through errors reduce the amount of
1001	   usable data transferred.  The optimum tradeoff is reached when both
1002	   types of loss are equal.  If we make the simplifying assumption that
1003	   the relationship between the bit error rate of a medium and the
1004	   resulting number of lost packets is linear with packet size for
1005	   reasonable bit error rates, the optimum packet size is computed as
1006	   follows:

1008	   packet size = sqrt( overhead bytes / bit error rate )

1010	   According to this, the optimum packet size is one or more orders of
1011	   magnitude larger than what's commonly used today.  For instance, the
1012	   maximum BER for 1000BASE-T is 10^-10, which implies an optimum packet
1013	   size of 312250 bytes with Ethernet framing and IP overhead.

1015	B.6.  Undetected bit errors

1017	   Nearly all link layers employ some kind of checksum to detect bit
1018	   errors so that packets with errors can be discarded.  In the case of
1019	   Ethernet, this is a frame check sequence in the form of a 32-bit CRC.
1020	   Assuming a strong frame check sequence algorithm, a 32-bit checksum
1021	   suggests that there is a 1 in 2^32 chance that a packet with one or
1022	   more bit errors in it has the same checksum as the original packet,
1023	   so the bit errors go undetected and data is corrupted.  However,
1024	   according to [CRC] the CRC-32 that's used for FDDI and Ethernet has
1025	   the property that packets between 375 and 11453 bytes long
1026	   (including) have a Hamming distance of 4.  (Smaller packets have a
1027	   larger Hamming distance, larger packets a smaller Hamming distance.)
1028	   As a result, all errors where only a single bit is flipped, two bits
1029	   are flipped or three bits are flipped, will be detected, because they
1030	   can't result in the same CRC as the original packet.  The probability
1031	   of a packet having undetected bit errors can be approximated as
1032	   follows for a 32-bit CRC:

1034	   PER = (PL * BER) ^ H / 2^32

1036	   Where PER is the packet error rate, BER is the bit error rate, PL is
1037	   the packet length in bits and H is the Hamming distance.  Another
1038	   consideration is the impact of packet length on a multi-packet
1039	   transmission of a given size.  This would be:

1041	   TER = transmission length / PL * PER

1043	   So

1045	   TER = transmission length / (PL ^ (H - 1) * BER ^ H) / 2^32

1047	   Where TER is the transmission error rate.

1049	   In the case of the Ethernet FCS and a Hamming distance of 4 for a
1050	   large range of packet sizes, this means that the risk of undetected
1051	   errors goes up with the cube of the packet length, but goes down with
1052	   the fourth power of the bit error rate.  This suggest that for a
1053	   given acceptable risk of undetected errors, a maximum packet size can
1054	   be calculated from the expected bit error rate.  It also suggests
1055	   that given the low BER rates mandated for Gigabit Ethernet, packet
1056	   sizes of up to 11453 bytes should be acceptable.

1058	   Additionally, unlike properties such as the packet length, the frame
1059	   check sequence can be made dependent on the physical media, so in the
1060	   future it should be possible to define a stronger FCS in future
1061	   Ethernet standards, or to negotiate a stronger FCS between two
1062	   stations on a point-to-point Ethernet link (i.e., a host and a switch
1063	   or a router and a switch).

1065	B.7.  Interaction TCP congestion control

1067	   TCP performance is based on the inverse of the square of the packet
1068	   loss probability.  Using larger and thus fewer packets is therefore a
1069	   competitative advantage.  Larger packets increase burstiness, which
1070	   can be problematic in some circumstances.  Larger packets also allow
1071	   TCP to ramp up its transmission speed faster, which is helpful on
1072	   fast links, where large packets will be more common.  In general, it
1073	   would seem advantageous for an individual user to use larger packets,
1074	   but under some circumstances, users using smaller packets may be put
1075	   at a slight disadvantage.

1077	B.8.  IEEE 802.3 compatibility

1079	   According to the IEEE 802.3 standard ([IEEE.802.3_2012]), the field
1080	   following the Ethernet addresses is a length field.  However,
1081	   [RFC0894] uses this field as a type field.  Ambiguity is largely
1082	   avoided by numbering type codes above 2048.  The mechanisms described
1083	   in this memo only apply to the standard [RFC0894] and [RFC2464]
1084	   encapsulation of IPv4 and IPv6 in Ethernet, not to possible
1085	   encapsulations of IPv4 or IPv6 in IEEE 802.3/IEEE 802.2 frames, so
1086	   there is no change to the current use of the Ethernet length/type
1087	   field.

1089	   The 2006 revision of IEEE 802.3 ([IEEE.802.3AS_2006]) adds "frame
1090	   expansion" to 2000 bytes (allowing for 1982-byte IP packets).  As a
1091	   result, layer 2 networks supporting MTUs of 1982 bytes are becoming
1092	   more common.  However, as [RFC0894] and [RFC2464] (encapsulation of
1093	   IPv4 and IPv6 in Ethernet) are based on [ETHERNETII]), the IEEE 802.3
1094	   standard has little bearing on the problem at hand.

1096	B.9.  Conclusion

1098	   Larger packets aren't universally desirable.  The factors that factor
1099	   into the decision to use larger packets include:

1101	   o  A link's bit error rate

1103	   o  The number of bits per symbol on a link and hence the likelihood
1104	      of multiple bit errors in a single packet

1106	   o  The strength of the frame check sequence

1108	   o  The link speed

1110	   o  The number of buffers

1112	   o  Queuing strategy

1114	   o  Number of sessions on shared links and paths

1116	   This means that choosing a good maximum packet size is, initially at
1117	   least, the responsibility of hardware builders.  A conservative
1118	   approach may be called for, but even under conservative assumptions,
1119	   9000-byte jumboframes on Gigabit Ethernet links seem reasonable.

1121	Author's Address

1123	   Iljitsch van Beijnum
1124	   Institute IMDEA Networks
1125	   Avda. del Mar Mediterraneo, 22
1126	   Leganes, Madrid  28918
1127	   Spain

1129	   Email: iljitsch@muada.com