idnits 2.17.1 

draft-vasilenko-v6ops-ipv6-oversized-analysis-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Introduction section.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 305: '...   [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that...'
     RFC 2119 keyword, line 445: '...   [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN...'
     RFC 2119 keyword, line 448: '...   [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that...'
     RFC 2119 keyword, line 546: '...   [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY...'
     RFC 2119 keyword, line 548: '...   [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED...'
     (1 more instance...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (September 17, 2021) is 950 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	IPv6 Operations (v6ops) Working Group                      E. Vasilenko
2	Internet Draft                                                  X. Xiao
3	Intended status: Informational                      Huawei Technologies
4	Expires: March 2022                                         D. Khaustov
5	                                                             Rostelecom
6	                                                     September 17, 2021

8	                     IPv6 Oversized Packets Analysis
9	             draft-vasilenko-v6ops-ipv6-oversized-analysis-01

11	Abstract

13	   The IETF has some initiatives relying on IPv6 Extension Headers
14	   added in transit: SRv6, iOAM. Additionally, some recent developments
15	   are overlays (SRv6, VxLAN, NVO3, L2TPv3, and LISP). It could create
16	   oversized packets that need to be dealt with. This document analyzes
17	   available standards for the resolution of oversized packet drops.

19	Status of this Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at https://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six
30	   months and may be updated, replaced, or obsoleted by other documents
31	   at any time.  It is inappropriate to use Internet-Drafts as
32	   reference material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on February 2021.

36	Copyright Notice

38	   Copyright (c) 2021 IETF Trust and the persons identified as the
39	   document authors. All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document. Please review these documents
45	   carefully, as they describe your rights and restrictions with
46	   respect to this document. Code Components extracted from this
47	   document must include Simplified BSD License text as described in
48	   Section 4.e of the Trust Legal Provisions and are provided without
49	   warranty as described in the Simplified BSD License.

51	Table of Contents

53	   1. Terminology and pre-requisite..................................2
54	   2. Problem statement..............................................3
55	   3. Solutions......................................................5
56	      3.1. Provision links with big enough MTU.......................6
57	      3.2. Frugal usage of Extension Headers.........................7
58	      3.3. Fragmentation and reassembly at the tunnel ends...........9
59	      3.4. PMTUD by original packet source..........................12
60	      3.5. Packetization Layer MTU Discovery........................14
61	   4. Conclusion....................................................15
62	   5. Security Considerations.......................................15
63	   6. IANA Considerations...........................................15
64	   7. References....................................................16
65	      7.1. Normative References.....................................16
66	      7.2. Informative References...................................18
67	   8. Acknowledgments...............................................19

69	1. Terminology and pre-requisite

71	   We do assume good knowledge or frequent references to [PMTUD] and
72	   [IPv6 Tunneling]. Terminology is inherited from [PMTUD].

74	  Link MTU - the maximum transmission unit, i.e., maximum packet size
75	            in octets that can be conveyed over a link.

77	  Path MTU (PMTU) - the minimum link MTU of all links in a path
78	            between a source node and a destination node.

80	  Path MTU Discovery (PMTUD) - the process by which a node learns the
81	            PMTU of a path.

83	  EMTU_S - Effective MTU for sending; used by upper-layer protocols to
84	            limit the size of IP packets they queue for sending.

86	  EMTU_R - Effective MTU for receiving; the largest packet that can be
87	            reassembled at the receiver.

89	  Packetization Layer - the layer of the network stack that segments
90	            data into packets.

92	  PLPMTUD - Packetization Layer Path MTU Discovery, the method of
93	            detecting path MTU at packetization layer, which is an
94	            extension of classical PMTU Discovery.

96	  PTB (Packet Too Big) message - an ICMPv6 message reporting that an
97	            IPv6 packet is too large to forward through some link.

99	  MSS - the TCP Maximum Segment Size, the maximum payload size
100	            available to the TCP layer.  This is typically the Path MTU
101	            minus the size of the IP and TCP headers.

103	2. Problem statement

105	   IPv6 is strict regarding fragmentation - it must NOT be done in
106	   transit (section 4.5 of [IPv6]).

108	   IPv6 sees rapid developments in recent years. A lot of additional
109	   functionality has been added primarily by adding options to
110	   Extension Headers and/or using overlay encapsulation. All of the
111	   above expand the packet size. This could lead to oversized packets
112	   that would be dropped on some links.

114	   Massive parallelism in traffic delivery is the additional challenge
115	   developed in the last 10 years: ECMP on one hop could reach 16 (or
116	   even more), which creates the end-to-end possibility for 64k paths
117	   on just 5 hops (example from big production network). Different
118	   paths could have a different set of Extension Headers and different
119	   PMTU as a result. PMTU is effectively becoming dynamic: we could
120	   never know how many additional headers would be added at a
121	   particular time to the particular packet on the particular path.

123	   The old classical PMTUD problems are still with us: filtered ICMPv6
124	   messages, drops related to Extension Headers before next hop MTU has
125	   been evaluated (no Packet Too Big message sent).

127	   Standards have two important numbers that we would need for our
128	   discussion:

130	   o  [IPv6] chapter 5 requires that every link should have the MTU of
131	      1280 octets or greater (2^10+2^8 - it probably explains the
132	      choice of this size)

134	   o  [IPv6] requests minimum EMTU_R (reassembly buffer) in 1500
135	      octets. An upper-layer protocol or application that depends on
136	      IPv6 fragmentation to send packets larger than the MTU of a path
137	      should not send packets larger than 1500 octets unless it has the
138	      assurance that the destination is capable of reassembling packets
139	      of that larger size

141	   The reassembly buffer is much above 1500B for the majority of
142	   desktop and server OSes. Windows 10 has "Reassemblylimit" almost
143	   64MB (you could look by "netsh interface ipv6 show global").
144	   Different flavors of Linux have "ipfrag_high_thresh" between 256KB
145	   and 4MB (you could look by
146	   "more /proc/sys/net/ipv4/ipfrag_high_thresh"). iOS has
147	   "maxReceiveIPv6BufferSize" 64Kb.
148	   The reassembly is not so good for embedded OSes. From the four
149	   primary OSes for IoT (Contiki, FreeRTOS, Mbed OS, MicroPython), only
150	   Mbed OS has the capability for 5 fragments by default, and it is
151	   possible to activate reassembly on Contiki. In all cases, the buffer
152	   is just a few packets of 1280B or 1500B. IoT devices may not be
153	   capable to reassemble the packet that the server in the cloud would
154	   send to it. Hence, ICMP PTB is still very important for some OSes.

156	   There is only one solution by [IPv6] architecture for the PMTU
157	   problem - decrease packet size on the original source. It is
158	   workable up to the minimum limit for IPv6 packets (1280B). The
159	   typical transit link had MTU not much bigger than 1500B for a long
160	   time, only the space for a few additional MPLS labels was reserved.
161	   220B left could be considered as guaranteed for additional
162	   functionality in Extension and Encapsulation headers. It could be
163	   enough for the next decade if we would make some precautions - see
164	   discussion below.

166	   [Huston-2016] and [Huston-2021] did an investigation on a different
167	   topic (fragmentation), but he has good statistics related to MTU
168	   drops up to 1500B that did show a 5% drop for MTU as small as 1455B.
169	   Additionally, [Huston-2016] has found the big drop spike (69% from
170	   all drops!) at 1480B, 20B less is presumable for IPv6 encapsulation
171	   into IPv4. [Huston-2021] has shown twice bigger fragmentation drop
172	   for bigger packets with the peak at 1408 octets, especially for
173	   Asia. As you can see - 1500B is not always available now, the reason
174	   is not well understood. Hence, we do not have 220B for additional
175	   headers in all situations. We could be reasonably optimistic that
176	   such a type of tunneling would disappear in the long term.
177	   Later, we would stick to an optimistic assumption that 220B is
178	   available in most situations. It is still possible to have the more
179	   pessimistic estimation (200B? 175B?) looking to Huston's data.

181	   The hungriest protocol known is SRv6 that could add 40B of IPv6
182	   underlay tunnel header (called "outer IP header" in [SRH]), 16B of
183	   SRH header itself, and additionally up to 10 IPv6 addresses in the
184	   SID stack (potentially even more). It is already 216B - very close
185	   to 220B optimistic limit. It makes the introduction of any
186	   additional functionality quite challenging without rigorous
187	   expansion of all links to bigger MTU.

189	   Initial SRv6 implementations that trespassed safe limit in 220B are
190	   the reason for recent activities in MTU problem research. We see
191	   many recent efforts to improve Path MTU Discovery (which would be
192	   mentioned in the document) - let us find the rationale behind it.

194	3. Solutions

196	   Let's consider first the reassembly buffer problem as the simpler
197	   one.

199	   Minimal buffer for packet reassembly (1500B) is potentially possible
200	   to increase in new standard updates, but then would be the problem
201	   with the transition, because this limitation would be already
202	   programmed into billions of IoT hosts - it would need big time to be
203	   sure that we do not have old implementation anymore.

205	   There is no good solution for the problem of bloating headers above
206	   220B for hosts. We need to keep headers below the 220B limit for
207	   embedded OSes. Fortunately, we are far from this problem yet - very
208	   limited additional functionality is implemented directly on the
209	   hosts (like [PMTU by HbH] or APN6). This problem should be looked at
210	   again in some number of years, it may be that in the future we would
211	   have to increase default EMTU_R on all hosts to give the possibility
212	   for new functionality.

214	   Let's now return to our primary problem of not enough PMTU.

216	   There is a low probability that the Internet community would agree
217	   to decrease the minimal IPv6 packet size (1280B). Hence, the
218	   oversized problem could not be resolved in that direction.

220	   It is possible to partially alleviate the MTU problem in some
221	   network zones where all transit nodes have big enough MTU. Transit
222	   nodes should delete extension headers before packets would leave
223	   "high MTU network zone". The leakage of a big header to a host could
224	   overflow EMTU_R buffer. The majority of RFCs recommend carriers
225	   delete additional headers before forwarding traffic to the client -
226	   this practice should be strictly followed.

228	   The SPRING working group is actively developing a compressed version
229	   of SRv6 that should leave space for other functionality, even on
230	   current transit routers that sometimes do not support much above
231	   1500B.

233	   All solutions for packet drop avoidance as a result of oversized
234	   packets could be classified into 4 classes. They are examined one by
235	   one.

237	3.1. Provision links with big enough MTU

239	   MTU supported by the host's links is typically 1500 Bytes.
240	   Backbone link's MTU could be up to 9000 Bytes on modern hardware.
241	   PMTUD is not needed in an ideal world.

243	   Reality is not that good:

245	   o  Some old devices still support just a few additional MPLS labels
246	      above 1500B on Ethernet. It was historically a problem to cross
247	      1536B because the IEEE specification for 802.3 assumes that a
248	      bigger number in the Length field means the Type of the payload.
249	   o  We could have middleboxes that would not support MTU much bigger
250	      than 1500B MTU for a long time.
251	   o  Ethernet is very mature now in relation to big MTU support, but
252	      that could be a challenge for other link-layer technologies (for
253	      example WiFi, satellite links, radio links, etc.).
254	   o  Packet Links could be rented from a third party - no possibility
255	      to change the MTU.
256	   o  Big MTU may negatively influence buffer size - see below.
257	   o  The majority of vendors set the default MTU to 1500B (with
258	      variations on what is counted inside MTU). It is time-consuming
259	      to change the MTU on the production network.
260	   o  Some hosts (especially for storage traffic in Data Centers) could
261	      use 2500B or 9000B MTU that challenges the possibility of having
262	      always bigger MTU in the backbone.

264	   Cost-optimized equipment architecture (especially used for switches,
265	   but applicable for many routers as well) may not split packets in
266	   the buffer memory. So small packet would occupy a bigger buffer
267	   space reserved for the packet with maximum MTU. This limitation
268	   effectively decreases the potential number of packets that could be
269	   buffered. Most of the host packets are still limited to 1500B size.
270	   MTU 9000B would just lead to wasting buffer memory with an
271	   efficiency of 1/6. The average packet size is twice smaller, hence
272	   in the worst-case buffer efficiency could be up to 1/12. Buffer
273	   memory is about 30% of the router cost. It is not acceptable to
274	   increase buffer memory cost 12 times. Hence, in many cases, it does
275	   not make sense to increase MTU to the maximum supported by the
276	   switch or router. One should always check with the vendor the impact
277	   of using a big MTU on buffering for the particular product. MTU
278	   should be increased to the number that is bigger than the maximum
279	   MTU expected from hosts + the size of all possible network overhead
280	   + underlay IPv6 header (if present).
281	   9000B MTU makes sense in DC, cross-DC environment, or for platforms
282	   that fragment packets for smaller sells in the memory.

284	   [MTU issues in Tunneling] section 3.3 discusses the opposite
285	   solution: decrease MTU on links to hosts to be sure that a host
286	   would always generate small enough MTU for the backbone. This
287	   solution was possible for small tunnel overhead, but now we are
288	   talking about the situation when 220B margin is not enough.

290	   [L3VPN] and [EVPN] do attach an additional label and could create
291	   oversized packets. Still, the MPLS header cannot point to the
292	   original MPLS router that has an attached service label.
293	   Additionally, a VPN IP packet could use private address space or no
294	   IP address at all (for EVPN). It blocks the possibility to properly
295	   organize the PMTUD process. Hence, [L3VPN] and [EVPN] have been
296	   developed under the assumption that all MTUs on the path would be
297	   expanded for at least 8 bytes that are needed for services over the
298	   MPLS data plane.
299	   We have recent [Generic Delivery Functions] that may permit
300	   fragmentation for MPLS services, but it is a personal draft yet.

302	   [Pseudowire Fragmentation] is the rear case when fragmentation is
303	   available over MPLS for one type of service.

305	   [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that
306	   the MTUs across the physical network infrastructure be set to a
307	   value that accommodates the larger frame size due to the
308	   encapsulation".

310	   Packet drop statistics and big activity in IETF prove that the PMTUD
311	   problem persists.

313	   "Raise MTU on transit" is the best solution, if it is available.

315	3.2. Frugal usage of Extension Headers

317	   Some new functionality (especially source routing with a big SID
318	   stack) could decrease headers size without a big loss of
319	   functionality (for example, use loose node appointment in SID
320	   stack). Some functionality (like iFIT or iOAM) could be completely
321	   omitted in the situation that would lead to packet drop. It is
322	   effectively "the tradeoff of functionality to PMTU control".

324	   The important point here is that the transit node attaching an
325	   additional header should be aware of all MTUs along the assumed
326	   packet path to predict how big MTU is still acceptable.

328	   [PMTUD] is readily available for tunneling interfaces - tunnel
329	   source should be aware of PMTU of the tunnel (by PTB feedback
330	   messages). But we have cases when it is not enough:

332	   o  SDN controller (or management system in general) could assist in
333	      provisioning of extension headers (including SFC, iOAM, BIER) and
334	      encapsulation headers (SRv6, VxLAN) - should be the way to report
335	      MTUs to Controller.
336	   o  ICMPv6 PTB would be directed to the transit control plane only in
337	      the case of problems inside the tunnel. PTB messages from outside
338	      of the tunnel would be directed to the source node. It is
339	      difficult to snoop PTB on transit nodes.

341	   Hence, we see many initiatives to collect and manage MTU by many
342	   popular protocols for routing and traffic engineering: [PMTU by
343	   ISIS], [PMTU by BGP-LS], [PMTU by PCEP], [PMTU by SR-Policy].

345	   Moreover, these protocol extensions would become even more useful in
346	   the future when it would not be possible to squeeze all extension
347	   headers into 220B anymore. Frugal attachment of new headers on
348	   transit nodes would increase the need for awareness of PMTU - it
349	   should stimulate MTU collection by all other popular protocols
350	   (OSPF, normal BGP on peering borders).

352	   This approach has a fundamental problem: full knowledge about all
353	   MTUs in the domain could not help to estimate the real path for a
354	   packet, because of massive ECMP used by many networks (at least by
355	   all Carriers). Non-routing protocols do not have a proper engine to
356	   estimate traffic paths and predict PMTU as well. And even more, if
357	   L2 ECMP is used or some links are rented from another carrier it
358	   will again be impossible to predict the exact path and the PMTU.

360	   The second problem of this approach could be classified as "chicken
361	   and egg". We already have a much better solution for MTU drop -
362	   increase MTU (see the previous section). We are looking for other
363	   solutions only because upgrading equipment (to better MTU) is not
364	   possible for some reason. But new protocols introduction would also
365	   demand equipment upgrade and thus making frugal headers less
366	   valuable. However, an upgrade for the control plane should be
367	   cheaper than an upgrade for the data plane, if the vendor would
368	   support such an approach.

370	   Hence, the solution discussed in this section has only limited
371	   applicability.

373	3.3. Fragmentation and reassembly at the tunnel ends

375	   The tunnel source behaves like a host with respect to the tunnel
376	   header. It is possible to properly adjust PMTU for the tunnel by
377	   [PMTUD], so it is potentially possible to fragment all packets
378	   bigger than PMTU.

380	   [IP Encapsulation] is the earliest standard for IP-in-IP
381	   encapsulation. Section 5.1 discusses that it is possible to fragment
382	   IP packets before tunnel encapsulation, so there is no need to
383	   reassemble packets on the other tunnel end - reassembly could happen
384	   on the destination host. It does not have additional cost
385	   implications on tunnel ends. This approach did work for IPv4 in the
386	   case of the "don't fragment" bit cleared. It fully contradicts IPv6
387	   architecture that does not permit to fragment packets on transit -
388	   no standard has risked proposing such a solution for IPv6.

390	   Some standards do propose IPv6 fragmentation (primarily for packets
391	   1280B and below), but fragmentation is recommended after
392	   encapsulation. It would lead to packet reassembly on the other
393	   tunnel end to hide (from destination host) the fact of transit
394	   fragmentation. It does minimize IPv6 architecture disruption.

396	   Many standards discussed below ([MPLS Encapsulation], [L2TPv3],
397	   [VxLAN], [NVO3]) forgot to mention that packets 1280 and below
398	   should be fragmented. This inaccuracy did not create any problem in
399	   production networks because we typically have 220B for all headers -
400	   it is big enough for many tunnels nested into each other. The
401	   situation could change in the next years because of Extension
402	   Headers expansion by different functions. It could create pressure
403	   to return to many mature standards and clarify the situation: what
404	   to do when the 1280B packet could not go through the tunnel.

406	   The Fragmentation has a few issues that make it not popular:

408	   o  Fragmentation could double buffer requirements (we assume split
409	      only in 2 fragments). We could ignore small additional buffer
410	      requirements for packets that may be lost and need to wait some
411	      time before reassembly, the Internet is not productive anyway
412	      after a few percentages of packet drops. The buffer memory is
413	      about 30% of the router cost. A 30% cost increase would not be
414	      accepted by the majority of owners. Albeit, some middleboxes
415	      already have enough buffer memory that could be reused for packet
416	      reassembly.
417	   o  In general, IPv6 architecture does not approve fragmentation in
418	      transit in all standards (except the recent draft [IP Tunnels] -
419	      see below). [PMTUD] section 5.1: "packetization layers are
420	      encouraged to avoid sending messages that will require
421	      fragmentation".
422	      We would discuss in this section some situations when tunnel
423	      fragmentation is inevitable.
424	   o  [Fragile Fragmentation] has a good collection of all problems
425	      related to fragmentation (additionally to the above: breaks ECMP,
426	      stateful processing, policy routing, and has many security attack
427	      vectors). [Fragile Fragmentation] strongly recommends avoiding
428	      fragmentation, but not deprecate yet.

430	   The primary RFC for tunneling is [IPv6 Tunneling] - it is the oldest
431	   standard that was later reused by many other standards (including
432	   the latest SRH). It permits fragmentation only for the case when the
433	   original packet is already minimal (1280B or less) - see section
434	   7.1. It mandates dropping the packet and signaling ICMPv6 PTB to the
435	   source (request to decrease the PMTU size at the source) for all
436	   other cases.

438	   [MPLS Encapsulation] Section 5.1 has the name: "Preventing
439	   Fragmentation and Reassembly". It does stress again: "IPv6
440	   intermediate nodes do not perform fragmentation in any event".

442	   [L2TPv3] section 4.1.4 has a similar comment: "Note that IPv6 does
443	   not support "in-flight" fragmentation of data packets".

445	   [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN
446	   packets."

448	   [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that
449	   Path MTU Discovery ([PMTUD]) be used to prevent or minimize
450	   fragmentation."

452	   [IPv6 GRE] section 3.3 does recommend fragmentation only for packets
453	   that are less than 1280B.

455	   The most recent draft for all types of tunnels is [IP Tunnels]. It
456	   is already referenced by many IETF documents. It is complicated to
457	   cover all use cases (any IP over any IP in any situation), but the
458	   net result is: much bigger part of the traffic proposed to be
459	   fragmented into the tunnel. Section 3.3: "The path between ingress
460	   and egress interfaces has a path MTU, but the endpoints can exchange
461	   messages as large as can be reassembled at the destination (egress
462	   interface), i.e., the EMTU_R of the egress interface".
463	   A short explanation of proposed functionality: original host would
464	   try to transmit biggest flows (by volume) on maximum PMTU, that
465	   tunnel source would not try to correct by PTB messages up to 1500B.
466	   Hence, the tunnel source would not have any option except to
467	   fragment. The principal problem here is the absence of PTB messages
468	   for the packet size between PMTU and statically appointed EMTU_R.
469	   Let's see how it has been formulated in more detail.
470	   [IP Tunnels] introduces a new variable "Tunnel MTU" that should not
471	   change as a result of PMTUD. The procedure to change "Tunnel MTU" is
472	   out of the draft discussion - it is pushed to specifications of
473	   particular tunnels in the last paragraph of section 4.2.2. Moreover,
474	   it is even assumed that PLPMTUD could be used on the router for
475	   "Tunnel MTU" discovery because this parameter is considered as an
476	   above network layer (like transport layer on the host). Separate
477	   section 4.2.3 is dedicated to the explanation that the newly
478	   introduced "Tunnel MTU" cannot be adjusted dynamically. There is a
479	   recommendation for the default "Tunnel MTU": typical host EMTU_R
480	   (1500B) minus tunnel outer headers overhead. The good question could
481	   be: if it is so difficult to manage "Tunnel MTU" dynamically, then
482	   why this variable was introduced?
483	   The MTU of the tunnel is renamed into MAP (maximum atomic packet),
484	   MAP should be corrected by PMTUD feedback from inside the tunnel.
485	   Section 4.2.2 states that everything up to "Tunnel MTU" should be
486	   accepted to the tunnel, one long packet (with inner and outer
487	   headers) should be created. Then it should be split into fragments
488	   below MAP size.
489	   Initially, "tunnel MTU" and MAP could be manually synchronized by
490	   the administrator (with the difference in tunnel overhead). But any
491	   additional overhead on the tunnel path (nested tunnel, smallest
492	   Extension Header) would result in PMTUD that decreases MAP, but
493	   would not change "Tunnel MTU". It would turn on fragmentation for
494	   all bulk traffic. This situation is quite probable now (see [Huston-
495	   2020] on MTUs available on the Internet) and it would be even more
496	   probable in the future when many additional extension headers would
497	   be used. Hence, the requirement in section 5.3.1 "do NOT try to
498	   deprecate fragmentation" is indeed important.
499	   Section 3.6 has the same approach as all other standards to the
500	   question of when fragmentation should happen: "this document assumes
501	   that only outer fragmentation is viable because it is the only
502	   approach that works for both IPv4 datagrams with DF=1 and IPv6".
503	   a considerable increase in fragmentation is proposed for the reasons
504	   of academic purity: the router part of the router should behave as a
505	   router, the host part of the router should behave as a host without
506	   any deviations.
507	   Additional fragmentations would create all of the problems discussed
508	   in [Fragile Fragmentation] and substantially increase the cost of
509	   tunnel endpoints. There is a high probability that draft [IP
510	   Tunnels] would be rejected by the market for cost reasons.

512	   Additionally, we should point that statistics for fragmented packet
513	   drop on the Internet is still high enough (7%) - see [Huston-2021].

515	   Fragmentation is the least probable solution for oversized packet
516	   drops.

518	3.4. PMTUD by original packet source

520	   [PMTUD] is mandatory in IPv6 architecture, because IPv6 does not
521	   have fragmentation in transit. We could see recommendations in many
522	   RFCs not to block ICMPv6 PTB completely (it could be rate-limited -
523	   see [ICMPv6] section 2.4). [DPLPMTUD] section 1.1 has a very good
524	   collection of reasons why PTB message may not be delivered to the
525	   source - it is used as justification to augment PMTUD by [DPLPMTUD].

527	   We should not see this problem for all non-tunneling protocols in
528	   the majority of environments. ICMPv6 PTB should be delivered to
529	   packet source, packet source would dynamically decrease PMTU to
530	   adapt to new realities. PMTU could change dynamically because some
531	   transit nodes could introduce additional extension header ad-hoc or
532	   ECMP could switch flow to a different path.

534	   [IPv6 Tunneling] mandates to relay ICMPv6 PTB by tunnel ends for
535	   ICMPv6 messages received from the inside tunnel. [IPv6 Tunneling]
536	   does not use "relay" terminology, but section 8 explains in detail
537	   how to reconstruct and retransmit ICMP messages to the original
538	   packet source (delete all tunnel-related information).
539	   [MTU issues in Tunneling] section 3.2 discusses the same approach.
540	   [L2TPv3] section 4.1.4 refers to the [IPv6 Tunneling]. We could
541	   assume it as the request for PTB messages relay too.
542	   [SRH] section 5.4 confirms full adherence to ICMPv6 PTB relay
543	   approach: "For IP packets encapsulated in an outer IPv6 header, ICMP
544	   error handling is as defined in [IPv6 Tunneling]".

546	   [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY
547	   be used to address this requirement as well".
548	   [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED
549	   that Path MTU Discovery ([PMTUD]) be used to prevent or minimize
550	   fragmentation".
551	   [IPSec] section 8.2.1 requests that PMTU should be maintained for
552	   the tunnel and signaled to the original packet source as soon as any
553	   new packet would arrive.
554	   [IPv6 GRE] section 3.3 clearly instructs developers to drop the
555	   oversized packets and send PTB for packets bigger than tunnel MTU.
556	   The method of PMTU detection is fully IPv6 compliant: "the GMTU is
557	   equal to the PMTU associated with the path between the GRE ingress
558	   and the GRE egress, minus the GRE overhead".
559	   [MPLS Encapsulation] section 5.1 specifies the same approach: tunnel
560	   head-end should use [PMTUD] to understand tunnel MTU, then "the
561	   packet will have to be discarded, but the tunnel head should send
562	   the IP source of the discarded packet the proper ICMP error
563	   message".

565	   [VxLAN], [NVO3], [IPSec], [IPv6 GRE], and [MPLS Encapsulation] do
566	   not request for tunnel endpoint to relay PTB messages. PMTUD should
567	   be used to set proper MTU for the tunnel, then subsequent packets
568	   could trigger PTB messages to the packet source. It would create an
569	   additional round trip delay compared to the original [IPv6
570	   Tunneling] relay approach for the first PTB message. This small
571	   deficiency could be partially explained by the desire of many
572	   standards to be universal for IPv6 as well as IPv4. As a reminder,
573	   IPv4 may not have enough information in the ICMP message to properly
574	   reconstruct a relay message (64bits of source packet by RFC 792).

576	   [IP Tunnels] is the only draft that contradicts [IPv6 Tunneling]
577	   (and every other protocol based on top) - it does clearly prohibit
578	   relay PTB messages. It states in section 3.3: "When such messages
579	   (PTB) arrive at the ingress interface ("ingress interface" is the
580	   tunnel interface in this draft), they may affect the properties of
581	   that interface (e.g., its MTU), but they should never directly cause
582	   new ICMPs in the outer network". This idea is generalized in section
583	   5.1 as "ICMP messages MUST NOT be generated by the tunnel (as a
584	   link)". The motivation assumed in the draft is to fully mimic host
585	   behavior on the router virtual (tunnel) interface, because the host
586	   would not retranslate PTB messages.

588	   We see that "Flow Label" is gaining popularity. [IPv6 Tunneling] and
589	   [ICMPv6] do not have strong recommendations for "Flow Label" - it
590	   was not an important topic at that time. The only small improvement
591	   that makes sense to do for [IPv6 Tunneling] is to recommend coping
592	   "Flow Label" from source packet to tunnel packet and from source
593	   packet to ICMPv6 PTB message. It would permit to properly load
594	   balance PTB messages to the same path as original traffic - see the
595	   problem [ICMPv6 PTB in ECMP] about hash-based load balancing between
596	   many hosts. Copy "Flow Label" to PTB message would not contradict
597	   neither IPv6 architecture nor any RFC - it is not mandatory to
598	   develop a special standard update for it.

600	   [MTU issues in Tunneling] section 3.2 has a concern that in the case
601	   of Lawful Intercept additional encapsulation could produce PTB
602	   messages that would show the fact to the monitored host. It is not a
603	   very realistic concern, because PMTU could change for many other
604	   reasons (especially with the proliferation of new protocols). If it
605	   is still a concern, then it makes sense to use another solution for
606	   this case: bigger MTU (better) or even fragmentation.

608	   [MTU issues in Tunneling] section 3.2 raises the question about the
609	   applicability of "MSS Clamping". The transit node could snoop the
610	   transport layer and change MSS exchanged between nodes. This "hack"
611	   is not recommended because it breaks the layered model of IETF or
612	   OSI.

614	   [PMTUD] is the only mechanism that is universal for all cases and
615	   fully compliant with IPv6 architecture. Vendors just need to use it,
616	   despite some challenges to relay PTB messages on tunnel ends.
617	   Moreover, it makes sense to standardize the relay of PTB messages on
618	   tunnel ends - it would improve PMTUD time on original traffic
619	   sources for round trip time.
620	   [IPv6] RFC: "It is strongly recommended that IPv6 nodes implement
621	   Path MTU Discovery [PMTUD]".

623	3.5. Packetization Layer MTU Discovery

625	   [PLPMTUD] and [DPLPMTUD] have been greatly developed in recent
626	   years. Packetization Layer (UDP/TCP) (1) has much more visibility
627	   (could see the size of transport layer buffers); (2) could operate
628	   under the absence of ICMPv6 PTB (too much filtering); (3) could be
629	   very granular (per-flow). It does have its use cases.

631	   Albeit, PLPMTUD/DPLPMTUD have their restrictions as they: (1) are
632	   not universal for all transport protocols; (2) need more resources
633	   from the host; (3) are challenging to share PMTU information between
634	   applications; (4) need much more round trip times to find suitable
635	   PMTU; (5) do not work well on congested paths (difficult to
636	   understand the reason for packet loss).

638	   Hence, PLPMTUD is not a replacement for PMTUD - both are needed. As
639	   a reminder from [PLPMTUD]: "Packetization Layer Path MTU Discovery
640	   (PLPMTUD) is most efficient when used in conjunction with the ICMP-
641	   based Path MTU Discovery".

643	   PLPMTUD could play as a replacement for PMTUD in the worst-case
644	   scenario (ICMP is filtered). It would lead to the original host PMTU
645	   decrease too. PLPMTUD could be considered as a redundancy mechanism
646	   for PMTUD.

648	   Strictly speaking, [PMTU by HbH] is a network layer mechanism, not a
649	   packetization layer. It is mentioned in this section because its
650	   usage is very similar to PLPMTUD, [PMTU by HbH] could be considered
651	   to some degree as the extension to PLPMTUD. It is not expected to
652	   principally change the conclusions of this document.

654	4. Conclusion

656	   It is better not to have a problem with oversized packets in the
657	   first place. One should upgrade all links to a bigger MTU, if
658	   possible.

660	   The host could have MTU as big as transit node. It would be never
661	   possible to deprecate PMTUD. It is important to follow the
662	   recommendations of [PMTUD] and [IPv6 Tunneling] for ICMPv6 PTB
663	   message delivery to the original traffic source. Tunnel sources
664	   should perform the relay function to make sure that the original
665	   traffic source would get the PTB message faster.

667	   The temporary 220B limit for all headers pushes us to the frugal
668	   implementation of new extension headers. This limit would be
669	   alleviated after all backbone links would be upgraded to a much
670	   bigger MTU than 1500B. Additional protocols to collect MTU
671	   information could help in the transition period to attach additional
672	   headers frugally. It is true for all new protocols: SRv6, SFC, BIER,
673	   iOAM, APN6.

675	   [PLPMTUD] and [DPLPMTUD] are not the replacement for [PMTUD], but
676	   could help in some scenarios.

678	   Fragmentation is not at all a solution for oversized packet drops.

680	5. Security Considerations

682	   [PMTUD], [PLPMTUD], [DPLPMTUD], and [Fragile Fragmentation] have
683	   some attack vectors discussed. This document does not introduce
684	   additional security vulnerabilities.

686	6. IANA Considerations

688	   This document has no request to IANA.

690	7. References

692	7.1. Normative References

694	   [IPv6] S. Deering, R. Hinden, "Internet Protocol, Version 6 (IPv6)
695	             Specification", RFC 8200, DOI 10.17487/RFC8200, July 2017,
696	             <https://www.rfc-editor.org/info/rfc8200>.

698	   [ICMPv6] A. Conta, S. Deering, M. Gupta, "Internet Control Message
699	             Protocol (ICMPv6) for the Internet Protocol Version 6
700	             (IPv6) Specification", RFC 4443, DOI 10.17487/RFC4443,
701	             March 2006, <https://www.rfc-editor.org/info/rfc4443>.

703	   [PMTUD] J. McCann, S. Deering, J. Mogul, R. Hinden, "Path MTU
704	             Discovery for IP version 6", RFC 8201, DOI
705	             10.17487/RFC8201, July 2017, <https://www.rfc-
706	             editor.org/info/rfc8201>.

708	   [IPv6 Tunneling] A. Conta, S. Deering, "Generic Packet Tunneling in
709	             IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473,
710	             December 1998, <https://www.rfc-editor.org/info/rfc2473>.

712	   [ICMPv6 PTB in ECMP] M. Byerly, M. Hite, J. Jaeggli, "Close
713	             Encounters of the ICMP Type 2 Kind", RFC 7690, DOI
714	             10.17487/RFC7690, January 2016, <https://www.rfc-
715	             editor.org/info/rfc7690>.

717	   [MTU issues in Tunneling] P. Savola, "MTU and Fragmentation Issues
718	             with In-the-Network Tunneling", RFC 4459, DOI
719	             10.17487/RFC4459, April 2006, <https://www.rfc-
720	             editor.org/info/rfc4459>.

722	   [IP Tunnels] J. Touch, M. Townsley, "IP Tunnels in the Internet
723	             Architecture", draft-ietf-intarea-tunnels-10 (work in
724	             progress), September 2019.

726	   [IP Encapsulation] C. Perkins, "IP Encapsulation within IP", RFC
727	             2003, DOI 10.17487/RFC2003, October 1996,
728	             <https://www.rfc-editor.org/info/rfc2003>.

730	   [IPSec] S. Kent, K. Seo, "Security Architecture for the Internet
731	             Protocol", RFC 4301, DOI 10.17487/RFC4301, December 2005,
732	             <https://www.rfc-editor.org/info/rfc4301>.

734	   [IPv6 GRE] C. Pignataro, R. Bonica, S. Krishnan, "IPv6 Support for
735	             Generic Routing Encapsulation (GRE)", RFC 7676, DOI
736	             10.17487/RFC7676, October 2015, <https://www.rfc-
737	             editor.org/info/rfc7676>.

739	   [MPLS Encapsulation] T. Worster, Y. Rekhter, E. Rosen,
740	             "Encapsulating MPLS in IP or Generic Routing Encapsulation
741	             (GRE)", RFC 4023, DOI 10.17487/RFC4023, March 2005,
742	             <https://www.rfc-editor.org/info/rfc4023>.

744	   [L2TPv3] J. Lau, M. Townsley, I. Goyret, "Layer Two Tunneling
745	             Protocol - Version 3 (L2TPv3)", RFC 3931, DOI
746	             10.17487/RFC3931, March 2005, <https://www.rfc-
747	             editor.org/info/rfc3931>.

749	   [VxLAN] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T.
750	             Sridhar, M. Bursell, C. Wright, "Virtual eXtensible Local
751	             Area Network (VXLAN): A Framework for Overlaying
752	             Virtualized Layer 2 Networks over Layer 3 Networks", RFC
753	             7348, DOI 10.17487/RFC7348, August 2014, <https://www.rfc-
754	             editor.org/info/rfc7348>.

756	   [NVO3] J. Gross, I. Ganga, T. Sridhar, "Geneve: Generic Network
757	             Virtualization Encapsulation", RFC 8926, DOI
758	             10.17487/RFC8926, November 2020, <https://www.rfc-
759	             editor.org/info/rfc8926>.

761	   [L3VPN] E. Rosen, Y. Rekhter, "BGP/MPLS IP Virtual Private Networks
762	             (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 2006,
763	             <https://www.rfc-editor.org/info/rfc4364>.

765	   [EVPN] A. Sajassi, R. Aggarwal, N. Bitar, A. Isaac, J. Uttaro, J.
766	             Drake, W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC
767	             7432, DOI 10.17487/RFC7432, February 2015,
768	             <https://www.rfc-editor.org/info/rfc7432>.

770	   [Huston-2016] Huston, G., "Fragmenting IPv6", Blog Post, 2016,
771	             <https://blog.apnic.net/2016/05/19/fragmenting-ipv6/>.

773	   [Huston-2021] Huston, G., "IPv6 Fragmention Loss", Article, 2021,
774	             <https://www.potaroo.net/ispcol/2021-04/v6frag.html>.

776	   [Fragile Fragmentation] R. Bonica, F. Baker, G. Huston, R. Hinden,
777	             O. Troan, F. Gont, "IP Fragmentation Considered Fragile",
778	             RFC 8900, DOI 10.17487/RFC8900, September 2020,
779	             <https://www.rfc-editor.org/info/rfc8900>.

781	7.2. Informative References

783	   [PLPMTUD] M. Mathis, J. Heffner, "Packetization Layer Path MTU
784	             Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
785	             <https://www.rfc-editor.org/info/rfc4821>.

787	   [DPLPMTUD] G. Fairhurst, T. Jones, M. Tuexen, I. Ruengeler, T.
788	             Voelker, "Packetization Layer Path MTU Discovery for
789	             Datagram Transports", RFC 8899, DOI 10.17487/RFC8899,
790	             March 2020, <https://www.rfc-editor.org/info/rfc8899>.

792	   [SRH] C. Filsfils, D. Dukes, S. Previdi, J. Leddy, S. Matsushima, D.
793	             Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI
794	             10.17487/RFC8754, March 2020, <https://www.rfc-
795	             editor.org/info/rfc8754>.

797	   [PMTU by HbH] R. Hinden, G. Fairhurst, "IPv6 Minimum Path MTU Hop-
798	             by-Hop Option", draft-hinden-6man-mtu-option-02 (work in
799	             progress), July 2019.

801	   [PMTU by ISIS] Z. Hu, Y. Zhu, Z. Li, L. Dai, "IS-IS Extensions for
802	             Path MTU", draft-hu-lsr-isis-path-mtu-00 (work in
803	             progress), June 2018.

805	   [PMTU by PCEP] S. Peng, C. Li, L. Han, "Support for Path MTU (PMTU)
806	             in the Path Computation Element (PCE)communication
807	             Protocol (PCEP)", draft-li-pce-pcep-pmtu-03 (work in
808	             progress), October 2020.

810	   [PMTU by BGP-LS] Y. Zhu, Z. Hu, G. Yan, J. Yao, "BGP-LS Extensions
811	             for Advertising Path MTU", draft-zhu-idr-bgp-ls-path-mtu-
812	             05 (work in progress), November 2020.

814	   [PMTU by SR-Policy] C. Li, Y. Zhu, A. Sawaf, Z. Li, "Segment Routing
815	             Path MTU in BGP", draft-li-idr-sr-policy-path-mtu-03 (work
816	             in progress), November 2019.

818	   [Generic Delivery Functions] Z. Zhang, R. Bonica, K. Kompella,
819	             "Generic Delivery Functions", draft-zzhang-intarea-
820	             generic-delivery-functions-01 (work in progress), April
821	             2021.

823	   [Pseudowire Fragmentation] A. Malis, M. Townsley, "Pseudowire
824	             Emulation Edge-to-Edge (PWE3) Fragmentation and
825	             Reassembly", RFC 4623, DOI 10.17487/RFC4623, August 2006,
826	             <https://www.rfc-editor.org/info/rfc4623>.

828	8. Acknowledgments

830	   Thanks to v6ops working group for problem discussion

832	Authors' Addresses

834	   Eduard Vasilenko
835	   Huawei Technologies
836	   17/4 Krylatskaya st, Moscow, Russia 121614

838	   Email: Vasilenko.Eduard@huawei.com

840	   Xiao Xipeng
841	   Huawei Technologies
842	   205 Hansaallee, 40549 Dusseldorf, Germany

844	   Email: Xipengxiao@huawei.com

846	   Dmitriy Khaustov
847	   Rostelecom
848	   13/2 Nikoloyamskaya st, Moscow, Russia 109240

850	   Email: Dmitriy.Khaustov@rt.ru