idnits 2.17.1 

draft-bonica-intarea-frag-fragile-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 4, 2018) is 2243 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'Anderson2001' is defined on line 818, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC 4960
     (Obsoleted by RFC 9260)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Area WG                                               R. Bonica
3	Internet-Draft                                          Juniper Networks
4	Intended status: Best Current Practice                          F. Baker
5	Expires: September 5, 2018                                  Unaffiliated
6	                                                               G. Huston
7	                                                                   APNIC
8	                                                               R. Hinden
9	                                                    Check Point Software
10	                                                                O. Troan
11	                                                                   Cisco
12	                                                                 F. Gont
13	                                                            SI6 Networks
14	                                                           March 4, 2018

16	                  IP Fragmentation Considered Fragile
17	                  draft-bonica-intarea-frag-fragile-01

19	Abstract

21	   This document provides an overview of IP fragmentation.  It explains
22	   how IP fragmentation works and why it is required.  As part of that
23	   explanation, this document also explains how IP fragmentation reduces
24	   the reliability of Internet communication.

26	   This document also proposes alternatives to IP fragmentation.
27	   Finally, it provides recommendations for application developers and
28	   network operators.

30	Status of This Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at https://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on September 5, 2018.

47	Copyright Notice

49	   Copyright (c) 2018 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (https://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
65	   2.  IP Fragmentation  . . . . . . . . . . . . . . . . . . . . . .   3
66	     2.1.  Links, Paths, MTU and PMTU  . . . . . . . . . . . . . . .   3
67	     2.2.  Upper-layer Protocols . . . . . . . . . . . . . . . . . .   5
68	   3.  Requirements Language . . . . . . . . . . . . . . . . . . . .   7
69	   4.  IP Fragmentation Reduces Reliability  . . . . . . . . . . . .   7
70	     4.1.  Middle Box Failures . . . . . . . . . . . . . . . . . . .   7
71	     4.2.  Partial Filtering . . . . . . . . . . . . . . . . . . . .   8
72	     4.3.  Suboptimal Load Balancing . . . . . . . . . . . . . . . .   8
73	     4.4.  Security Vulnerabilities  . . . . . . . . . . . . . . . .   9
74	     4.5.  Blackholing Due to ICMP Loss  . . . . . . . . . . . . . .  11
75	     4.6.  Blackholing Due To Filtering  . . . . . . . . . . . . . .  12
76	   5.  Alternatives to IP Fragmentation  . . . . . . . . . . . . . .  12
77	     5.1.  Transport Layer Solutions . . . . . . . . . . . . . . . .  13
78	     5.2.  Application Layer Solutions . . . . . . . . . . . . . . .  14
79	   6.  Applications That Rely on IPv6 Fragmentation  . . . . . . . .  15
80	     6.1.  DNS . . . . . . . . . . . . . . . . . . . . . . . . . . .  15
81	     6.2.  OSPFv3  . . . . . . . . . . . . . . . . . . . . . . . . .  15
82	     6.3.  IP Encapsulations . . . . . . . . . . . . . . . . . . . .  16
83	   7.  Recommendation  . . . . . . . . . . . . . . . . . . . . . . .  16
84	     7.1.  For Application Developers  . . . . . . . . . . . . . . .  16
85	     7.2.  For Network Operators . . . . . . . . . . . . . . . . . .  16
86	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
87	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
88	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  16
89	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
90	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  17
91	     11.2.  Informative References . . . . . . . . . . . . . . . . .  18
92	   Appendix A.  Contributors' Address  . . . . . . . . . . . . . . .  20
93	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

95	1.  Introduction

97	   Operational experience [RFC7872] [Huston] reveals that IP
98	   fragmentation reduces the reliability of Internet communication.
99	   This document provides an overview of IP fragmentation.  It explains
100	   how IP fragmentation works and why it is required.  As part of that
101	   explanation, this document also explains how IP fragmentation reduces
102	   the reliability of Internet communication.

104	   This document also proposes alternatives to IP fragmentation.
105	   Finally, it provides recommendations for application developers and
106	   network operators.

108	2.  IP Fragmentation

110	2.1.  Links, Paths, MTU and PMTU

112	   An Internet path connects a source node to a destination node.  A
113	   path can contain links and intermediate systems.  If a path contains
114	   more than one link, the links are connected in series and an
115	   intermediate system connects each link to the next.  An intermediate
116	   system can be a router or a middle box.

118	   Internet paths are dynamic.  Assume that the path from one node to
119	   another contains a set of links and intermediate systems.  If the
120	   network topology changes, that path can also change so that it
121	   includes a different set of links and intermediate systems.

123	   Each link is constrained by the number of bytes that it can convey in
124	   a single IP packet.  This constraint is called the link Maximum
125	   Transmission Unit (MTU).  IPv4 [RFC0791] requires every link to have
126	   an MTU of 68 bytes or greater.  IPv6 [RFC8200] requires every link to
127	   have an MTU of 1280 bytes or greater.  These are called the IPv4 and
128	   IPv6 minimum link MTU's.

130	   Each Internet path is constrained by the number of bytes that it can
131	   convey in a IP single packet.  This constraint is called the Path MTU
132	   (PMTU).  For any given path, the PMTU is equal to the smallest of its
133	   link MTU's.  Because Internet paths are dynamic, PMTU is also
134	   dynamic.

136	   For reasons described below, source nodes estimate the PMTU between
137	   themselves and destination nodes.  A source node can produce
138	   extremely conservative PMTU estimates in which:

140	   o  The estimate for each IPv4 path is equal to IPv4 minimum link MTU
141	      (68 bytes).

143	   o  The estimate for each IPv6 path is equal to the IPv6 minimum link
144	      MTU (1280 bytes).

146	   While these conservative estimates are guaranteed to be less than or
147	   equal to the actual MTU, they are likely to be much less than the
148	   actual PMTU.  This may adversely affect upper-layer protocol
149	   performance.

151	   By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201]
152	   procedures, a source node can maintain a less conservative, running
153	   estimate of the PMTU between itself and a destination node.
154	   According to these procedures, the source node produces an initial
155	   PMTU estimate.  This initial estimate is equal to the MTU of the
156	   first link along the path to the destination node.  It can be greater
157	   than the actual PMTU.

159	   Having produced an initial PMTU estimate, the source node sends non-
160	   fragmentable IP packets to the destination node.  If one of these
161	   packets is larger than the actual PMTU, a downstream router will not
162	   be able to forward the packet through the next link along the path.
163	   Therefore, the downstream router drops the packet and send an
164	   Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet
165	   Too Big (PTB) message to the source node.  The ICMP PTB message
166	   indicates the MTU of the link through which the packet could not be
167	   forwarded.  The source node uses this information to refine its PMTU
168	   estimate.

170	   PMTUD produces a running estimate of the PMTU between a source node
171	   and a destination node.  Because PMTU is dynamic, at any given time,
172	   the PMTU estimate can differ from the actual PMTU.  In order to
173	   detect PMTU increases, PMTUD occasionally resets the PMTU estimate to
174	   the MTU of the first link along path to the destination node.  It
175	   then repeats the procedure described above.

177	   Furthermore, PMTUD has the following characteristics:

179	   o  It relies on the network's ability to deliver ICMP PTB messages to
180	      the source node.

182	   o  It is susceptible to attack because ICMP messages are easily
183	      forged [RFC5927].

185	   FOOTNOTE: According to RFC 0791, every IPv4 host must be capable of
186	   receiving a packet whose length is equal to 576 bytes.  However, the
187	   IPv4 minimum link MTU is not 576.  Section 3.2 of RFC 0791 explicitly
188	   states that the IPv4 minimum link MTU is 68 bytes.

190	   FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet"
191	   is introduced.  A non-fragmentable packet can be fragmented at its
192	   source.  However, it cannot be fragmented by a downstream node.  An
193	   IPv4 packet whose DF-bit is set to zero is fragmentable.  An IPv4
194	   packet whose DF-bit is set to one is non-fragmentable.  All IPv6
195	   packets are also non-fragmentable.

197	   FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is
198	   introduced.  The ICMP PTB message has two instantiations.  In ICMPv4
199	   [RFC0792], the ICMP PTB message is Destination Unreachable message
200	   with Code equal to (4) fragmentation needed and DF set.  This message
201	   was augmented by [RFC1191] to indicates the MTU of the link through
202	   which the packet could not be forwarded.  In ICMPv6 [RFC4443], the
203	   ICMP PTB message is a Packet Too Big Message with Code equal to (0).
204	   This message also indicates the MTU of the link through which the
205	   packet could not be forwarded.

207	2.2.  Upper-layer Protocols

209	   When an upper-layer protocol submits data to the underlying IP
210	   module, and the resulting IP packet's length is greater than the
211	   PMTU, IP fragmentation may be required.  IP fragmentation divides a
212	   packet into fragments.  Each fragment includes an IP header and a
213	   portion of the original packet.

215	   [RFC0791] describes IPv4 fragmentation procedures.  IPv4 packets
216	   whose DF-bit is set to one cannot be fragmented.  IPv4 packets whose
217	   DF-bit is set to zero can be fragmented at the source node or by any
218	   downstream router.  [RFC8200] describes IPv6 fragmentation
219	   procedures.  IPv6 packets can be fragmented at the source node only.

221	   IPv4 fragmentation differs slightly from IPv6 fragmentation.
222	   However, in both IP versions, the upper-layer header appears in the
223	   first fragment only.  It does not appear in subsequent fragments.

225	   Upper-layer protocols can operate in the following modes:

227	   o  Do not rely on IP fragmentation.

229	   o  Rely on IP source fragmentation only (i.e., fragmentation at the
230	      source node).

232	   o  Rely on IP source fragmentation and downstream fragmentation
233	      (i.e., fragmentation at any node along the path).

235	   Upper-layer protocols running over IPv4 can operate in the first and
236	   third modes (above).  Upper-layer protocols running over IPv6 can
237	   operate in the first and second modes (above).

239	   Upper-layer protocols that operate in the first two modes (above)
240	   require access to the PMTU estimate.  In order to fulfil this
241	   requirement, they can

243	   o  Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
244	      MTU.

246	   o  Access the estimate that PMTUD produced.

248	   o  Execute PMTUD procedures themselves.

250	   o  Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821]
251	      [I-D.fairhurst-tsvwg-datagram-plpmtud] procedures.

253	   According to PLPMTUD procedures, the upper-layer protocol maintains a
254	   running PMTU estimate.  It does so by sending probe packets of
255	   various sizes to its peer and receiving acknowledgements.  This
256	   strategy differs from PMTUD in that it relies of acknowledgement of
257	   received messages, as opposed to ICMP PTB messages concerning dropped
258	   messages.  Therefore, PLPMTUD does not rely on the network's ability
259	   to deliver ICMP PTB messages to the source.

261	   An upper-layer protocol that does not rely on IP fragmentation never
262	   causes the underlying IP module to emit

264	   o  A fragmentable IP packet (i.e., an IPv4 packet with the DF-bit set
265	      to zero).

267	   o  An IP fragment.

269	   o  A packet whose length is greater than the PMTU estimate.

271	   However, when the PMTU estimate is greater than the actual PMTU, the
272	   upper-layer protocol can cause the underlying IP module to emit a
273	   packet whose length is greater than the actual PMTU.  When this
274	   occurs, a downstream router drops the packet and the source node
275	   refines its PMTU estimate, employing either PMTUD or PLPMTUD
276	   procedures.

278	   When an upper-layer protocol that relies on IP source fragmentation
279	   only submits data to the underlying IP module, and the resulting
280	   packet is larger than the PMTU estimate, the underlying IP module
281	   fragments the packet and emits the fragments.  However, the upper-
282	   layer protocol never causes the underlying IP module to emit

284	   o  A fragmentable IP packet.

286	   o  A packet whose length is greater than the PMTU estimate.

288	   When the PMTU estimate is greater than the actual PMTU, the upper-
289	   layer protocol can cause the underlying IP module to emit a packet
290	   whose length is greater than the actual PMTU.  When this occurs, a
291	   downstream router drops the packet and the source node refines its
292	   PMTU estimate, employing either PMTUD or PLPMTUD procedures.

294	   An upper-layer protocol that relies on IP source fragmentation and
295	   downstream fragmentation can cause the underlying IP module to emit

297	   o  A fragmentable IP packet.

299	   o  An IP fragment.

301	   o  A packet whose length is greater than the PMTU estimate.

303	   A protocol that relies on IP source fragmentation and downstream
304	   fragmentation does not require access to the PMTU estimate.  For
305	   these protocols, the underlying IP module:

307	   o  Fragments all packets whose length exceeds the MTU of the first
308	      link along the path to the destination.

310	   o  Sets the DF-bit to zero, so that downstream nodes can fragment the
311	      packet.

313	3.  Requirements Language

315	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
316	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
317	   "OPTIONAL" in this document are to be interpreted as described in BCP
318	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
319	   capitals, as shown here.

321	4.  IP Fragmentation Reduces Reliability

323	   This section explains how IP fragmentation reduces the reliability of
324	   Internet communication.

326	4.1.  Middle Box Failures

328	   Many middle boxes require access to the transport-layer header.
329	   However, when a packet is divided into fragments, the transport-layer
330	   header appears in the first fragment only.  It does not appear in
331	   subsequent fragments.  This omission can prevent middle boxes from
332	   delivering their intended services.

334	   For example, assume that a router diverts selected packets from their
335	   normal path towards network appliances that support deep packet
336	   inspection and lawful intercept.  The router selects packets for
337	   diversion based upon the following 5-tuple:

339	   o  IP Source Address.

341	   o  IP Destination Address.

343	   o  IPv4 Protocol or IPv6 Next Header.

345	   o  transport-layer source port.

347	   o  transport-layer destination port.

349	   IP fragmentation causes this selection algorithm to behave
350	   suboptimally, because the transport-layer header appears only in the
351	   first fragment of each packet.

353	   In another example, a middle box remarks a packet's Differentiated
354	   Services Code Point [RFC2474] based upon the above mentioned 5-tuple.
355	   IP fragmentation causes this process to behave suboptimally, because
356	   the transport-layer header appears only in the first fragment of each
357	   packet.

359	   In all of the above-mentioned examples, the middle box cannot deliver
360	   its intended service without reassembling fragmented packets.

362	4.2.  Partial Filtering

364	   IP fragments cause problems for firewalls whose filter rules include
365	   decision making based on TCP and UDP ports.  As the port information
366	   is not in the trailing fragments the firewall may elect to accept all
367	   trailing fragments, which may admit certain classes of attack, or may
368	   elect to block all trailing fragments, which may block otherwise
369	   legitimate traffic, or may elect to reassemble all fragmented
370	   packets, which may be inefficient and negatively affect performance.

372	4.3.  Suboptimal Load Balancing

374	   Many stateless load-balancers require access to the transport-layer
375	   header.  Assume that a load-balancer distributes flows among parallel
376	   links.  In order to optimize load balancing, the load-balancer sends
377	   every packet or packet fragment belonging to a flow through the same
378	   link.

380	   In order to assign a packet or packet fragment to a link, the load-
381	   balancer executes an algorithm.  If the packet or packet fragment
382	   contains a transport-layer header, the load balancing algorithm
383	   accepts the following 5-tuple as input:

385	   o  IP Source Address.

387	   o  IP Destination Address.

389	   o  IPv4 Protocol or IPv6 Next Header.

391	   o  transport-layer source port.

393	   o  transport-layer destination port.

395	   However, if the packet or packet fragment does not contain a
396	   transport-layer header, the load balancing algorithm accepts only the
397	   following 3-tuple as input:

399	   o  IP Source Address.

401	   o  IP Destination Address.

403	   o  IPv4 Protocol or IPv6 Next Header.

405	   Therefore, non-fragmented packets belonging to a flow can be assigned
406	   to one link while fragmented packets belonging to the same flow can
407	   be divided between that link and another.  This can cause suboptimal
408	   load balancing.

410	4.4.  Security Vulnerabilities

412	   Security researchers have documented several attacks that rely on IP
413	   fragmentation.  The following are examples:

415	   o  Overlapping fragment attack [RFC1858] [RFC5722]

417	   o  Resource exhaustion attacks (such as the Rose Attack)

419	   o  Attacks based on predictable fragment Identification values
420	      [RFC7739]

422	   o  Attacks based on bugs in the implementation of the fragment
423	      reassembly algorithm

425	   o  Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998]

427	   In the overlapping fragment attack, an attacker constructs a series
428	   of packet fragments.  The first fragment contains an IP header, a
429	   transport-layer header, and some transport-layer payload.  This
430	   fragment complies with local security policy and is allowed to pass
431	   through a stateless firewall.  A second fragment, having a non-zero
432	   offset, overlaps with the first fragment.  The second fragment also
433	   passes through the stateless firewall.  When the packet is
434	   reassembled, the transport layer header from the first fragment is
435	   overwritten by data from the second fragment.  The reassembled packet
436	   does not comply with local security policy.  Had it traversed the
437	   firewall in one piece, the firewall would have rejected it.

439	   A stateless firewall cannot protect against the overlapping fragment
440	   attack.  However, destination nodes can protect against the
441	   overlapping fragment attack by implementing the reassembly procedures
442	   described in RFC 1858 and RFC 8200.  These reassembly procedures
443	   detect the overlap and discard the packet.

445	   The fragment reassembly algorithm is a stateful procedure for an
446	   otherwise stateless protocol.  As such, it can be exploited for
447	   resource exhaustion attacks.  An attacker can construct a series of
448	   fragmented packets, with one fragment missing from each packet such
449	   that the reassembly process cannot complete.  Thus, this attack
450	   causes resource exhaustion on the destination node, possibly denying
451	   reassembly services to other flows.  This type of attack can be
452	   mitigated by flushing fragment reassembly buffers when necessary, at
453	   the expense of possibly dropping legitimate fragments.

455	   An IP fragment contains an "Identification" field that, together with
456	   the IP Source Address and Destination Address of a packet, identifies
457	   fragments that correspond to the same original datagram, such that
458	   they can be reassembled together by the receiving host.  Many
459	   implementations have employed predictable values for the
460	   Identification field, thus making it easy for an attacker to forge
461	   malicious IP fragments that would cause the reassembly procedure for
462	   legitimate packets to fail.

464	   Over the years multiple IPv4 and IPv6 implementations have been found
465	   to have flaws in their implementation of the IP fragment reassembly
466	   algorithm, typically resulting in buffer overflows.  These buffer
467	   overflows have been exploitable for denial of service and remote code
468	   execution attacks.

470	   NIDS aims at identifying malicious activity by analyzing network
471	   traffic.  Ambiguity in the possible result of the fragment reasembly
472	   process may allow an attacker to evade these systems.  Many of these
473	   systems try to mitigate some of these evasion techniques by e.g.
474	   computing all possible outcomes of the fragment reassembly process,
475	   at the expense of increased processing requirements.

477	4.5.  Blackholing Due to ICMP Loss

479	   As stated above, an upper-layer protocol requires access the PMTU
480	   estimate if it:

482	   o  Does not rely on IP fragmentation.

484	   o  Relies on IP source fragmentation only (i.e., fragmentation at the
485	      source node).

487	   In order to satisfy this requirement, the upper-layer protocol can:

489	   o  Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
490	      MTU.

492	   o  Access the estimate that PMTUD produced.

494	   o  Execute PMTUD procedures itself.

496	   o  Execute PLPMTUD procedures.

498	   PMTUD relies upon the network's ability to deliver ICMP PTB messages
499	   to the source node.  Therefore, if an upper-layer protocol relies on
500	   PMTUD for its PMTU estimate, it also relies on the networks ability
501	   to deliver ICMP PTB messages to the source node.

503	   [RFC4890] states that the PTB messages must not be filtered.
504	   However, ICMP delivery is not reliable.  It is subject to transient
505	   loss and, in some configurations, more persistent delivery issues.

507	   ICMP rate limiting, network congestion and packet corruption can
508	   cause transient loss.  The effect of rate limiting may be severe, as
509	   RFC 4443 recommends strict rate limiting of IPv6 traffic.

511	   While transient loss causes PMTUD to perform less efficiently, it
512	   does not cause PMTUD to fail completely.  When the conditions
513	   contributing to transient loss abate, the network regains its ability
514	   to deliver ICMP PTB messages and PMTUD regains its ability to
515	   function.

517	   By contrast, more persistent delivery issues cause PMTUD to fail
518	   completely.  Consider the following example:

520	   A DNS client sends a request to an anycast address.  The network
521	   routes that DNS request to the nearest instance of that anycast
522	   address (i.e., a DNS Server).  The DNS server generates a response
523	   and sends it back to the DNS client.  While the response does not
524	   exceed the DNS server's PMTU estimate, it does exceed the actual
525	   PMTU.

527	   A downstream router drops the packet and sends an ICMP PTB message
528	   the packet's source (i.e., the anycast address).  The network routes
529	   the ICMP PTB message to the anycast instance closest to the
530	   downstream router.  Sadly, that anycast instance may not be the DNS
531	   server that originated the DNS response.  It may be another DNS
532	   server with the same anycast address.  The DNS server that originated
533	   the response may never receive the ICMP PTB message and may never
534	   updates it PMTU estimate.

536	   The problem described in this section is specific to PMTUD.  It does
537	   not occur when the upper-layer protocol obtains its PMTU estimate
538	   from PLPMTUD or any other source.

540	   Furthermore, the problem described in this section occurs when the
541	   upper-layer protocol does not rely on IP fragmentation, as well as
542	   when the upper-layer protocol relies on IP source fragmentation only.

544	4.6.  Blackholing Due To Filtering

546	   In RFC 7872, researchers sampled Internet paths to determine whether
547	   they would convey packets that contain IPv6 extension headers.
548	   Sampled paths terminated at popular Internet sites (e.g., popular
549	   web, mail and DNS servers).

551	   The study revealed that at least 28% of the sampled paths did not
552	   convey packets containing the IPv6 Fragment extension header.  In
553	   most cases, fragments were dropped in the destination autonomous
554	   system.  In other cases, the fragments were dropped in transit
555	   autonomous systems.

557	   Another recent study [Huston] confirmed this finding.  It reported
558	   that 37% of sampled endpoints used IPv6-capable DNS resolvers that
559	   were incapable of receiving a fragmented IPv6 response.

561	   It is difficult to determine why network operators drop fragments.
562	   In some cases, packet drop may be caused by misconfiguration.  In
563	   other cases, network operators may consciously choose to drop IPv6
564	   fragments, in order to address the issues raised in Section 4.1
565	   through Section 4.5, above.

567	5.  Alternatives to IP Fragmentation
568	5.1.  Transport Layer Solutions

570	   The Transport Control Protocol (TCP) [RFC0793]) can be operated in a
571	   mode that does not require IP fragmentation.

573	   Applications submit a stream of data to TCP.  TCP divides that stream
574	   of data into segments, with no segment exceeding the TCP Maximum
575	   Segment Size (MSS).  Each segment is encapsulated in a TCP header and
576	   submitted to the underlying IP module.  The underlying IP module
577	   prepends an IP header and forwards the resulting packet.

579	   If the TCP MSS is sufficiently small, the underlying IP module never
580	   produces a packet whose length is greater than the actual PMTU.
581	   Therefore, IP fragmentation is not required.

583	   TCP offers the following mechanisms for MSS management:

585	   o  Manual configuration

587	   o  PMTUD

589	   o  PLPMTUD

591	   For IPv6 nodes, manual configuration is always applicable.  If the
592	   MSS is manually configured to 1220 bytes and the packet does not
593	   contain extension headers, the IP layer will never produce a packet
594	   whose length is greater than the IPv6 minimum link MTU (1280 bytes).
595	   However, manual configuration prevents TCP from taking advantage of
596	   larger link MTU's.

598	   RFC 8200 strongly recommends that IPv6 nodes implement PMTUD, in
599	   order to discover and take advantage of path MTUs greater than 1280
600	   bytes.  However, as mentioned in Section 2.1, PMTUD relies upon the
601	   network's ability to deliver ICMP PTB messages.  Therefore, PMTUD is
602	   applicable only in environments where the risk of ICMP PTB loss is
603	   acceptable.

605	   By contrast, PLPMTUD does not rely upon the network's ability to
606	   deliver ICMP PTB messages.  However, in many loss-based TCP
607	   congestion control algorithms, the dropping of a packet may cause the
608	   TCP control algorithm to drop the congestion control window, or even
609	   re-start with the entire slow start process.  For high capacity, long
610	   RTT, large volume TCP streams, the deliberate probing with large
611	   packets and the consequent packet drop may impose too harsh a penalty
612	   on total TCP throughput for it to be a viable approach.  [RFC4821]
613	   defines PLPMTUD procedures for TCP.

615	   While TCP will never cause the underlying IP module to emit a packet
616	   that is larger than the PMTU estimate, it can cause the underlying IP
617	   module to emit a packet that is larger than the actual PMTU.  If this
618	   occurs, the packet is dropped, the PMTU estimate is updated, the
619	   segment is divided into smaller segments and each smaller segment is
620	   submitted to the underlying IP module.

622	   The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the
623	   Stream Control Protocol (SCP) [RFC4960] also can be operated in a
624	   mode that does not require IP fragmentation.  They both accept data
625	   from an application and divide that data into segments, with no
626	   segment exceeding a maximum size.  Both DCCP and SCP offer manual
627	   configuration, PMTUD and PLPMTUD as mechanisms for managing that
628	   maximum size.  [I-D.fairhurst-tsvwg-datagram-plpmtud] proposes
629	   PLPMTUD procedures for DCCP and SCP.

631	5.2.  Application Layer Solutions

633	   [RFC8085] recognizes that IP fragmentation reduces the reliability of
634	   Internet communication.  Therefore, it offers the following advice
635	   regarding applications the run over the User Data Protocol (UDP)
636	   [RFC0768].

638	   "An application SHOULD NOT send UDP datagrams that result in IP
639	   packets that exceed the Maximum Transmission Unit (MTU) along the
640	   path to the destination.  Consequently, an application SHOULD either
641	   use the path MTU information provided by the IP layer or implement
642	   Path MTU Discovery (PMTUD) itself to determine whether the path to a
643	   destination will support its desired message size without
644	   fragmentation."

646	   RFC 8085 continues:

648	   "Applications that do not follow the recommendation to do PMTU/
649	   PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would
650	   result in IP packets that exceed the path MTU.  Because the actual
651	   path MTU is unknown, such applications SHOULD fall back to sending
652	   messages that are shorter than the default effective MTU for sending
653	   (EMTU_S in [RFC1122]).  For IPv4, EMTU_S is the smaller of 576 bytes
654	   and the first-hop MTU.  For IPv6, EMTU_S is 1280 bytes.  The
655	   effective PMTU for a directly connected destination (with no routers
656	   on the path) is the configured interface MTU, which could be less
657	   than the maximum link payload size.  Transmission of minimum-sized
658	   UDP datagrams is inefficient over paths that support a larger PMTU,
659	   which is a second reason to implement PMTU discovery."

661	   RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently
662	   small, even though the IPv4 minimum link MTU is 68 bytes.

664	   This advice applies equally to application that run directly over IP.

666	6.  Applications That Rely on IPv6 Fragmentation

668	   The following applications rely on IPv6 fragmentation:

670	   o  DNS [RFC1035]

672	   o  OSPFv3 [RFC5340]

674	   o  IP Encapsulations

676	   Each of these applications relies on IPv6 fragmentation to a varying
677	   degree.  In some cases, that reliance is essential, and cannot be
678	   broken without fundamentally changing the protocol.  In other cases,
679	   that reliance is incidental, and most implementations already take
680	   appropriate steps to avoid fragmentation.

682	   This list is not comprehensive, and other protocols that rely on IPv6
683	   fragmentation may exist.  They are not specifically considered in the
684	   context of this document.

686	6.1.  DNS

688	   DNS can obtain transport services from either UDP or TCP.  Superior
689	   performance and scaling characteristics are observed when DNS runs
690	   over UDP.

692	   DNS Servers that execute DNSSEC [RFC4035] procedures are more likely
693	   to generate large responses.  Therefore, when running over UDP, they
694	   are more likely to cause the generation of IPv6 fragments.  DNS's
695	   reliance upon IPv6 fragmentation is fundamental and cannot be broken
696	   without changing the DNS specification.

698	   DNS is an essential part of the Internet architecture.  Therefore,
699	   this issue is for further study and must be resolved before DNSSEC
700	   can be deployed successfully in IPv6 only networks.

702	6.2.  OSPFv3

704	   OSPFv3 implementations can emit messages large enough to cause IPv6
705	   fragmentation.  However, in keeping with the recommendations of
706	   RFC8200, and in order to optimize performance, most OSPFv3
707	   implementations restrict their maximum message size to the IPv6
708	   minimum link MTU.

710	6.3.  IP Encapsulations

712	   In this document, IP encapsulations include IP-in-IP [RFC2003],
713	   Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP [RFC8086]
714	   and Generic Packet Tunneling in IPv6 [RFC2473].  The fragmentation
715	   strategy described for GRE in [RFC7588] has been deployed for all of
716	   the above-mentioned IP encapsulations.  This strategy does not rely
717	   on IPv6 fragmentation except in one corner case. (see Section 3.3.2.2
718	   of RFC 7588 and Section 7.1 of RFC 2473).  Section 3.3 of [RFC7676]
719	   further describes this corner case.

721	7.  Recommendation

723	7.1.  For Application Developers

725	   Application developers SHOULD NOT develop applications that rely on
726	   IPv6 fragmentation.

728	   Application-layer protocols then depend upon IPv6 fragmentation
729	   SHOULD be updated to break that dependency.

731	7.2.  For Network Operators

733	   As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB
734	   messages unless they are known to be forged or otherwise
735	   illegitimate.  As stated in Section 4.5, filtering ICMPv6 PTB packets
736	   causes PMTUD to fail.  Many upper-layer protocols rely on PMTUD.

738	8.  IANA Considerations

740	   This document makes no request of IANA.

742	9.  Security Considerations

744	   This document mitigates some of the security considerations
745	   associated with IP fragmentation by discouraging the use of IP
746	   fragmentation.  It does not introduce any new security
747	   vulnerabilities, because it does not introduce any new alternatives
748	   to IP fragmentation.  Instead, it recommends well-understood
749	   alternatives.

751	10.  Acknowledgements

753	   TBD

755	11.  References

757	11.1.  Normative References

759	   [RFC0768]  Postel, J., "User Datagram Protocol", STD 6, RFC 768,
760	              DOI 10.17487/RFC0768, August 1980,
761	              <https://www.rfc-editor.org/info/rfc768>.

763	   [RFC0791]  Postel, J., "Internet Protocol", STD 5, RFC 791,
764	              DOI 10.17487/RFC0791, September 1981,
765	              <https://www.rfc-editor.org/info/rfc791>.

767	   [RFC0792]  Postel, J., "Internet Control Message Protocol", STD 5,
768	              RFC 792, DOI 10.17487/RFC0792, September 1981,
769	              <https://www.rfc-editor.org/info/rfc792>.

771	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
772	              RFC 793, DOI 10.17487/RFC0793, September 1981,
773	              <https://www.rfc-editor.org/info/rfc793>.

775	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
776	              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
777	              November 1987, <https://www.rfc-editor.org/info/rfc1035>.

779	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
780	              DOI 10.17487/RFC1191, November 1990,
781	              <https://www.rfc-editor.org/info/rfc1191>.

783	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
784	              Requirement Levels", BCP 14, RFC 2119,
785	              DOI 10.17487/RFC2119, March 1997,
786	              <https://www.rfc-editor.org/info/rfc2119>.

788	   [RFC4443]  Conta, A., Deering, S., and M. Gupta, Ed., "Internet
789	              Control Message Protocol (ICMPv6) for the Internet
790	              Protocol Version 6 (IPv6) Specification", STD 89,
791	              RFC 4443, DOI 10.17487/RFC4443, March 2006,
792	              <https://www.rfc-editor.org/info/rfc4443>.

794	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
795	              Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
796	              <https://www.rfc-editor.org/info/rfc4821>.

798	   [RFC8085]  Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage
799	              Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085,
800	              March 2017, <https://www.rfc-editor.org/info/rfc8085>.

802	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
803	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
804	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

806	   [RFC8200]  Deering, S. and R. Hinden, "Internet Protocol, Version 6
807	              (IPv6) Specification", STD 86, RFC 8200,
808	              DOI 10.17487/RFC8200, July 2017,
809	              <https://www.rfc-editor.org/info/rfc8200>.

811	   [RFC8201]  McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed.,
812	              "Path MTU Discovery for IP version 6", STD 87, RFC 8201,
813	              DOI 10.17487/RFC8201, July 2017,
814	              <https://www.rfc-editor.org/info/rfc8201>.

816	11.2.  Informative References

818	   [Anderson2001]
819	              Anderson, J., "An Analysis of Fragmentation Attacks",
820	              2001, <http://www.ouah.org/fragma.html>.

822	   [Huston]   Huston, G., "IPv6, Large UDP Packets and the DNS
823	              (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)",
824	              August 2017.

826	   [I-D.fairhurst-tsvwg-datagram-plpmtud]
827	              Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler,
828	              "Packetization Layer Path MTU Discovery for Datagram
829	              Transports", draft-fairhurst-tsvwg-datagram-plpmtud-02
830	              (work in progress), December 2017.

832	   [Ptacek1998]
833	              Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial
834	              of Service: Eluding Network Intrusion Detection", 1998,
835	              <http://www.aciri.org/vern/Ptacek-Newsham-Evasion-98.ps>.

837	   [RFC1122]  Braden, R., Ed., "Requirements for Internet Hosts -
838	              Communication Layers", STD 3, RFC 1122,
839	              DOI 10.17487/RFC1122, October 1989,
840	              <https://www.rfc-editor.org/info/rfc1122>.

842	   [RFC1858]  Ziemba, G., Reed, D., and P. Traina, "Security
843	              Considerations for IP Fragment Filtering", RFC 1858,
844	              DOI 10.17487/RFC1858, October 1995,
845	              <https://www.rfc-editor.org/info/rfc1858>.

847	   [RFC2003]  Perkins, C., "IP Encapsulation within IP", RFC 2003,
848	              DOI 10.17487/RFC2003, October 1996,
849	              <https://www.rfc-editor.org/info/rfc2003>.

851	   [RFC2473]  Conta, A. and S. Deering, "Generic Packet Tunneling in
852	              IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473,
853	              December 1998, <https://www.rfc-editor.org/info/rfc2473>.

855	   [RFC2474]  Nichols, K., Blake, S., Baker, F., and D. Black,
856	              "Definition of the Differentiated Services Field (DS
857	              Field) in the IPv4 and IPv6 Headers", RFC 2474,
858	              DOI 10.17487/RFC2474, December 1998,
859	              <https://www.rfc-editor.org/info/rfc2474>.

861	   [RFC2784]  Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
862	              Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
863	              DOI 10.17487/RFC2784, March 2000,
864	              <https://www.rfc-editor.org/info/rfc2784>.

866	   [RFC4035]  Arends, R., Austein, R., Larson, M., Massey, D., and S.
867	              Rose, "Protocol Modifications for the DNS Security
868	              Extensions", RFC 4035, DOI 10.17487/RFC4035, March 2005,
869	              <https://www.rfc-editor.org/info/rfc4035>.

871	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
872	              Congestion Control Protocol (DCCP)", RFC 4340,
873	              DOI 10.17487/RFC4340, March 2006,
874	              <https://www.rfc-editor.org/info/rfc4340>.

876	   [RFC4890]  Davies, E. and J. Mohacsi, "Recommendations for Filtering
877	              ICMPv6 Messages in Firewalls", RFC 4890,
878	              DOI 10.17487/RFC4890, May 2007,
879	              <https://www.rfc-editor.org/info/rfc4890>.

881	   [RFC4960]  Stewart, R., Ed., "Stream Control Transmission Protocol",
882	              RFC 4960, DOI 10.17487/RFC4960, September 2007,
883	              <https://www.rfc-editor.org/info/rfc4960>.

885	   [RFC5340]  Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF
886	              for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008,
887	              <https://www.rfc-editor.org/info/rfc5340>.

889	   [RFC5722]  Krishnan, S., "Handling of Overlapping IPv6 Fragments",
890	              RFC 5722, DOI 10.17487/RFC5722, December 2009,
891	              <https://www.rfc-editor.org/info/rfc5722>.

893	   [RFC5927]  Gont, F., "ICMP Attacks against TCP", RFC 5927,
894	              DOI 10.17487/RFC5927, July 2010,
895	              <https://www.rfc-editor.org/info/rfc5927>.

897	   [RFC7588]  Bonica, R., Pignataro, C., and J. Touch, "A Widely
898	              Deployed Solution to the Generic Routing Encapsulation
899	              (GRE) Fragmentation Problem", RFC 7588,
900	              DOI 10.17487/RFC7588, July 2015,
901	              <https://www.rfc-editor.org/info/rfc7588>.

903	   [RFC7676]  Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support
904	              for Generic Routing Encapsulation (GRE)", RFC 7676,
905	              DOI 10.17487/RFC7676, October 2015,
906	              <https://www.rfc-editor.org/info/rfc7676>.

908	   [RFC7739]  Gont, F., "Security Implications of Predictable Fragment
909	              Identification Values", RFC 7739, DOI 10.17487/RFC7739,
910	              February 2016, <https://www.rfc-editor.org/info/rfc7739>.

912	   [RFC7872]  Gont, F., Linkova, J., Chown, T., and W. Liu,
913	              "Observations on the Dropping of Packets with IPv6
914	              Extension Headers in the Real World", RFC 7872,
915	              DOI 10.17487/RFC7872, June 2016,
916	              <https://www.rfc-editor.org/info/rfc7872>.

918	   [RFC8086]  Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE-
919	              in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086,
920	              March 2017, <https://www.rfc-editor.org/info/rfc8086>.

922	Appendix A.  Contributors' Address

924	Authors' Addresses

926	   Ron Bonica
927	   Juniper Networks
928	   2251 Corporate Park Drive
929	   Herndon, Virginia  20171
930	   USA

932	   Email: rbonica@juniper.net

934	   Fred Baker
935	   Unaffiliated
936	   Santa Barbara, California  93117
937	   USA

939	   Email: FredBaker.IETF@gmail.com
940	   Geoff Huston
941	   APNIC
942	   6 Cordelia St
943	   Brisbane, 4101 QLD
944	   Australia

946	   Email: gih@apnic.net

948	   Robert M. Hinden
949	   Check Point Software
950	   959 Skyway Road
951	   San Carlos, California  94070
952	   USA

954	   Email: bob.hinden@gmail.com

956	   Ole Troan
957	   Cisco
958	   Philip Pedersens vei 1
959	   N-1366 Lysaker
960	   Norway

962	   Email: ot@cisco.com

964	   Fernando Gont
965	   SI6 Networks
966	   Evaristo Carriego 2644
967	   Haedo, Provincia de Buenos Aires
968	   Argentina

970	   Email: fgont@si6networks.com