idnits 2.17.1 

draft-bonica-intarea-frag-fragile-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 23, 2018) is 2103 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-32) exists of
     draft-ietf-tsvwg-udp-options-05

  -- Obsolete informational reference (is this intentional?): RFC 4960
     (Obsoleted by RFC 9260)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Area WG                                               R. Bonica
3	Internet-Draft                                          Juniper Networks
4	Intended status: Best Current Practice                          F. Baker
5	Expires: January 24, 2019                                   Unaffiliated
6	                                                               G. Huston
7	                                                                   APNIC
8	                                                               R. Hinden
9	                                                    Check Point Software
10	                                                                O. Troan
11	                                                                   Cisco
12	                                                                 F. Gont
13	                                                            SI6 Networks
14	                                                           July 23, 2018

16	                  IP Fragmentation Considered Fragile
17	                  draft-bonica-intarea-frag-fragile-03

19	Abstract

21	   This document provides an overview of IP fragmentation.  It also
22	   explains how IP fragmentation reduces the reliability of Internet
23	   communication.

25	   Finally, this document proposes alternatives to IP fragmentation and
26	   provides recommendations for application developers and network
27	   operators.

29	Status of This Memo

31	   This Internet-Draft is submitted in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at https://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on January 24, 2019.

46	Copyright Notice

48	   Copyright (c) 2018 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (https://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Table of Contents

63	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
64	   2.  IP Fragmentation  . . . . . . . . . . . . . . . . . . . . . .   3
65	     2.1.  Links, Paths, MTU and PMTU  . . . . . . . . . . . . . . .   3
66	     2.2.  Upper-layer Protocols . . . . . . . . . . . . . . . . . .   5
67	   3.  Requirements Language . . . . . . . . . . . . . . . . . . . .   7
68	   4.  IP Fragmentation Reduces Reliability  . . . . . . . . . . . .   7
69	     4.1.  Middle Box Failures . . . . . . . . . . . . . . . . . . .   7
70	     4.2.  Partial Filtering . . . . . . . . . . . . . . . . . . . .   8
71	     4.3.  Telemetry and Monitoring and monitoring  Failures . . . .   8
72	     4.4.  Suboptimal Load Balancing . . . . . . . . . . . . . . . .   9
73	     4.5.  Security Vulnerabilities  . . . . . . . . . . . . . . . .   9
74	     4.6.  Blackholing Due to ICMP Loss  . . . . . . . . . . . . . .  11
75	       4.6.1.  Transient Loss  . . . . . . . . . . . . . . . . . . .  12
76	       4.6.2.  Incorrect Implementation of Security Policy . . . . .  12
77	       4.6.3.  Persistant Loss Caused By Anycast . . . . . . . . . .  13
78	     4.7.  Blackholing Due To Filtering  . . . . . . . . . . . . . .  13
79	   5.  Alternatives to IP Fragmentation  . . . . . . . . . . . . . .  14
80	     5.1.  Transport Layer Solutions . . . . . . . . . . . . . . . .  14
81	     5.2.  Application Layer Solutions . . . . . . . . . . . . . . .  15
82	   6.  Applications That Rely on IPv6 Fragmentation  . . . . . . . .  16
83	     6.1.  DNS . . . . . . . . . . . . . . . . . . . . . . . . . . .  16
84	     6.2.  OSPFv3  . . . . . . . . . . . . . . . . . . . . . . . . .  17
85	     6.3.  Packet-in-Packet Encapsulations . . . . . . . . . . . . .  17
86	   7.  Recommendations . . . . . . . . . . . . . . . . . . . . . . .  17
87	     7.1.  For Application Developers  . . . . . . . . . . . . . . .  17
88	     7.2.  For Network Operators . . . . . . . . . . . . . . . . . .  18
89	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
90	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  18
91	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  18
92	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
93	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  18
94	     11.2.  Informative References . . . . . . . . . . . . . . . . .  20
95	   Appendix A.  Contributors' Address  . . . . . . . . . . . . . . .  22
96	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  22

98	1.  Introduction

100	   Operational experience [RFC7872] [Huston] reveals that IP
101	   fragmentation reduces the reliability of Internet communication.
102	   This document provides an overview of IP fragmentation.  It also
103	   explains how IP fragmentation reduces the reliability of Internet
104	   communication.

106	   Finally, this document proposes alternatives to IP fragmentation and
107	   provides recommendations for application developers and network
108	   operators.

110	2.  IP Fragmentation

112	2.1.  Links, Paths, MTU and PMTU

114	   An Internet path connects a source node to a destination node.  A
115	   path can contain links and intermediate systems.  If a path contains
116	   more than one link, the links are connected in series and an
117	   intermediate system connects each link to the next.  An intermediate
118	   system can be a router or a middle box.

120	   Internet paths are dynamic.  Assume that the path from one node to
121	   another contains a set of links and intermediate systems.  If the
122	   network topology changes, that path can also change so that it
123	   includes a different set of links and intermediate systems.

125	   Each link is constrained by the number of bytes that it can convey in
126	   a single IP packet.  This constraint is called the link Maximum
127	   Transmission Unit (MTU).  IPv4 [RFC0791] requires every link to have
128	   an MTU of 68 bytes or greater.  IPv6 [RFC8200] requires every link to
129	   have an MTU of 1280 bytes or greater.  These are called the IPv4 and
130	   IPv6 minimum link MTU's.

132	   Each Internet path is constrained by the number of bytes that it can
133	   convey in a IP single packet.  This constraint is called the Path MTU
134	   (PMTU).  For any given path, the PMTU is equal to the smallest of its
135	   link MTU's.  Because Internet paths are dynamic, PMTU is also
136	   dynamic.

138	   For reasons described below, source nodes estimate the PMTU between
139	   themselves and destination nodes.  A source node can produce
140	   extremely conservative PMTU estimates in which:

142	   o  The estimate for each IPv4 path is equal to the IPv4 minimum link
143	      MTU.

145	   o  The estimate for each IPv6 path is equal to the IPv6 minimum link
146	      MTU.

148	   While these conservative estimates are guaranteed to be less than or
149	   equal to the actual PMTU, they are likely to be much less than the
150	   actual PMTU.  This may adversely affect upper-layer protocol
151	   performance.

153	   By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201]
154	   procedures, a source node can maintain a less conservative, running
155	   estimate of the PMTU between itself and a destination node.
156	   According to these procedures, the source node produces an initial
157	   PMTU estimate.  This initial estimate is equal to the MTU of the
158	   first link along the path to the destination node.  It can be greater
159	   than the actual PMTU.

161	   Having produced an initial PMTU estimate, the source node sends non-
162	   fragmentable IP packets to the destination node.  If one of these
163	   packets is larger than the actual PMTU, a downstream router will not
164	   be able to forward the packet through the next link along the path.
165	   Therefore, the downstream router drops the packet and sends an
166	   Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet
167	   Too Big (PTB) message to the source node.  The ICMP PTB message
168	   indicates the MTU of the link through which the packet could not be
169	   forwarded.  The source node uses this information to refine its PMTU
170	   estimate.

172	   PMTUD produces a running estimate of the PMTU between a source node
173	   and a destination node.  Because PMTU is dynamic, at any given time,
174	   the PMTU estimate can differ from the actual PMTU.  In order to
175	   detect PMTU increases, PMTUD occasionally resets the PMTU estimate to
176	   the MTU of the first link along path to the destination node.  It
177	   then repeats the procedure described above.

179	   PMTUD has the following characteristics:

181	   o  It relies on the network's ability to deliver ICMP PTB messages to
182	      the source node.

184	   o  It is susceptible to attack because ICMP messages are easily
185	      forged [RFC5927].

187	   FOOTNOTE: According to RFC 0791, every IPv4 host must be capable of
188	   receiving a packet whose length is equal to 576 bytes.  However, the
189	   IPv4 minimum link MTU is not 576.  Section 3.2 of RFC 0791 explicitly
190	   states that the IPv4 minimum link MTU is 68 bytes.

192	   FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet"
193	   is introduced.  A non-fragmentable packet can be fragmented at its
194	   source.  However, it cannot be fragmented by a downstream node.  An
195	   IPv4 packet whose DF-bit is set to zero is fragmentable.  An IPv4
196	   packet whose DF-bit is set to one is non-fragmentable.  All IPv6
197	   packets are also non-fragmentable.

199	   FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is
200	   introduced.  The ICMP PTB message has two instantiations.  In ICMPv4
201	   [RFC0792], the ICMP PTB message is Destination Unreachable message
202	   with Code equal to (4) fragmentation needed and DF set.  This message
203	   was augmented by [RFC1191] to indicates the MTU of the link through
204	   which the packet could not be forwarded.  In ICMPv6 [RFC4443], the
205	   ICMP PTB message is a Packet Too Big Message with Code equal to (0).
206	   This message also indicates the MTU of the link through which the
207	   packet could not be forwarded.

209	2.2.  Upper-layer Protocols

211	   When an upper-layer protocol submits data to the underlying IP
212	   module, and the resulting IP packet's length is greater than the
213	   PMTU, IP fragmentation may be required.  IP fragmentation divides a
214	   packet into fragments.  Each fragment includes an IP header and a
215	   portion of the original packet.

217	   [RFC0791] describes IPv4 fragmentation procedures.  IPv4 packets
218	   whose DF-bit is set to one cannot be fragmented.  IPv4 packets whose
219	   DF-bit is set to zero can be fragmented at the source node or by any
220	   downstream router.  [RFC8200] describes IPv6 fragmentation
221	   procedures.  IPv6 packets can be fragmented at the source node only.

223	   IPv4 fragmentation differs slightly from IPv6 fragmentation.
224	   However, in both IP versions, the upper-layer header appears in the
225	   first fragment only.  It does not appear in subsequent fragments.

227	   Upper-layer protocols can operate in the following modes:

229	   o  Do not rely on IP fragmentation.

231	   o  Rely on IP source fragmentation only (i.e., fragmentation at the
232	      source node).

234	   o  Rely on IP source fragmentation and downstream fragmentation
235	      (i.e., fragmentation at any node along the path).

237	   Upper-layer protocols running over IPv4 can operate in all of the
238	   above-mentioned modes.  Upper-layer protocols running over IPv6 can
239	   operate in the first and second modes only.

241	   Upper-layer protocols that operate in the first two modes (above)
242	   require access to the PMTU estimate.  In order to fulfil this
243	   requirement, they can

245	   o  Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
246	      MTU.

248	   o  Access the estimate that PMTUD produced.

250	   o  Execute PMTUD procedures themselves.

252	   o  Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821]
253	      [I-D.fairhurst-tsvwg-datagram-plpmtud] procedures.

255	   According to PLPMTUD procedures, the upper-layer protocol maintains a
256	   running PMTU estimate.  It does so by sending probe packets of
257	   various sizes to its peer and receiving acknowledgements.  This
258	   strategy differs from PMTUD in that it relies of acknowledgement of
259	   received messages, as opposed to ICMP PTB messages concerning dropped
260	   messages.  Therefore, PLPMTUD does not rely on the network's ability
261	   to deliver ICMP PTB messages to the source.

263	   An upper-layer protocol that does not rely on IP fragmentation never
264	   causes the underlying IP module to emit

266	   o  A fragmentable IP packet (i.e., an IPv4 packet with the DF-bit set
267	      to zero).

269	   o  An IP fragment.

271	   o  A packet whose length is greater than the PMTU estimate.

273	   However, when the PMTU estimate is greater than the actual PMTU, the
274	   upper-layer protocol can cause the underlying IP module to emit a
275	   packet whose length is greater than the actual PMTU.  When this
276	   occurs, a downstream router drops the packet and the source node
277	   refines its PMTU estimate, employing either PMTUD or PLPMTUD
278	   procedures.

280	   When an upper-layer protocol that relies on IP source fragmentation
281	   only submits data to the underlying IP module, and the resulting
282	   packet is larger than the PMTU estimate, the underlying IP module
283	   fragments the packet and emits the fragments.  However, the upper-
284	   layer protocol never causes the underlying IP module to emit
285	   o  A fragmentable IP packet.

287	   o  A packet whose length is greater than the PMTU estimate.

289	   When the PMTU estimate is greater than the actual PMTU, the upper-
290	   layer protocol can cause the underlying IP module to emit a packet
291	   whose length is greater than the actual PMTU.  When this occurs, a
292	   downstream router drops the packet and the source node refines its
293	   PMTU estimate, employing either PMTUD or PLPMTUD procedures.

295	   An upper-layer protocol that relies on IP source fragmentation and
296	   downstream fragmentation can cause the underlying IP module to emit

298	   o  A fragmentable IP packet.

300	   o  An IP fragment.

302	   o  A packet whose length is greater than the PMTU estimate.

304	   A protocol that relies on IP source fragmentation and downstream
305	   fragmentation does not require access to the PMTU estimate.  For
306	   these protocols, the underlying IP module:

308	   o  Fragments all packets whose length exceeds the MTU of the first
309	      link along the path to the destination.

311	   o  Sets the DF-bit to zero, so that downstream nodes can fragment the
312	      packet.

314	3.  Requirements Language

316	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
317	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
318	   "OPTIONAL" in this document are to be interpreted as described in BCP
319	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
320	   capitals, as shown here.

322	4.  IP Fragmentation Reduces Reliability

324	   This section explains how IP fragmentation reduces the reliability of
325	   Internet communication.

327	4.1.  Middle Box Failures

329	   Many middle boxes require access to the transport-layer header.
330	   However, when a packet is divided into fragments, the transport-layer
331	   header appears in the first fragment only.  It does not appear in
332	   subsequent fragments.  This omission can prevent middle boxes from
333	   delivering their intended services.

335	   For example, assume that a router diverts selected packets from their
336	   normal path towards network appliances that support deep packet
337	   inspection and lawful intercept.  The router selects packets for
338	   diversion based upon the following 5-tuple:

340	   o  IP Source Address.

342	   o  IP Destination Address.

344	   o  IPv4 Protocol or IPv6 Next Header.

346	   o  transport-layer source port.

348	   o  transport-layer destination port.

350	   IP fragmentation causes this selection algorithm to behave
351	   suboptimally, because the transport-layer header appears only in the
352	   first fragment of each packet.

354	   In another example, a middle box remarks a packet's Differentiated
355	   Services Code Point [RFC2474] based upon the above-mentioned 5-tuple.
356	   IP fragmentation causes this process to behave suboptimally, because
357	   the transport-layer header appears only in the first fragment of each
358	   packet.

360	   In all of the above-mentioned examples, the middle box cannot deliver
361	   its intended service without reassembling fragmented packets.

363	4.2.  Partial Filtering

365	   IP fragments cause problems for firewalls whose filter rules include
366	   decision making based on TCP and UDP ports.  As the port information
367	   is not in the trailing fragments the firewall may elect to accept all
368	   trailing fragments, which may admit certain classes of attack, or may
369	   elect to block all trailing fragments, which may block otherwise
370	   legitimate traffic, or may elect to reassemble all fragmented
371	   packets, which may be inefficient and negatively affect performance.

373	4.3.  Telemetry and Monitoring and monitoring Failures

375	   Stateless telemetry and monitoring strategies may require the
376	   transport-layer header to appear in every packet.  However, when a
377	   packet is divided into fragments, the transport-layer header appears
378	   in the first fragment only.  It does not appear in subsequent
379	   fragments.  This omission can prevent some stateless telemetry
380	   strategies from functioning correctly.

382	4.4.  Suboptimal Load Balancing

384	   Many stateless load-balancers require access to the transport-layer
385	   header.  Assume that a load-balancer distributes flows among parallel
386	   links.  In order to optimize load balancing, the load-balancer sends
387	   every packet or packet fragment belonging to a flow through the same
388	   link.

390	   In order to assign a packet or packet fragment to a link, the load-
391	   balancer executes an algorithm.  If the packet or packet fragment
392	   contains a transport-layer header, the load balancing algorithm
393	   accepts the following 5-tuple as input:

395	   o  IP Source Address.

397	   o  IP Destination Address.

399	   o  IPv4 Protocol or IPv6 Next Header.

401	   o  transport-layer source port.

403	   o  transport-layer destination port.

405	   However, if the packet or packet fragment does not contain a
406	   transport-layer header, the load balancing algorithm accepts only the
407	   following 3-tuple as input:

409	   o  IP Source Address.

411	   o  IP Destination Address.

413	   o  IPv4 Protocol or IPv6 Next Header.

415	   Therefore, non-fragmented packets belonging to a flow can be assigned
416	   to one link while fragmented packets belonging to the same flow can
417	   be divided between that link and another.  This can cause suboptimal
418	   load balancing.

420	4.5.  Security Vulnerabilities

422	   Security researchers have documented several attacks that rely on IP
423	   fragmentation.  The following are examples:

425	   o  Overlapping fragment attack [RFC1858][RFC3128] [RFC5722]
426	   o  Resource exhaustion attacks (such as the Rose Attack)

428	   o  Attacks based on predictable fragment identification values
429	      [RFC7739]

431	   o  Attacks based on bugs in the implementation of the fragment
432	      reassembly algorithm

434	   o  Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998]

436	   In the overlapping fragment attack, an attacker constructs a series
437	   of packet fragments.  The first fragment contains an IP header, a
438	   transport-layer header, and some transport-layer payload.  This
439	   fragment complies with local security policy and is allowed to pass
440	   through a stateless firewall.  A second fragment, having a non-zero
441	   offset, overlaps with the first fragment.  The second fragment also
442	   passes through the stateless firewall.  When the packet is
443	   reassembled, the transport layer header from the first fragment is
444	   overwritten by data from the second fragment.  The reassembled packet
445	   does not comply with local security policy.  Had it traversed the
446	   firewall in one piece, the firewall would have rejected it.

448	   A stateless firewall cannot protect against the overlapping fragment
449	   attack.  However, destination nodes can protect against the
450	   overlapping fragment attack by implementing the reassembly procedures
451	   described in RFC 1858, RFC 3128 and RFC 8200.  These reassembly
452	   procedures detect the overlap and discard the packet.

454	   The fragment reassembly algorithm is a stateful procedure for an
455	   otherwise stateless protocol.  As such, it can be exploited for
456	   resource exhaustion attacks.  An attacker can construct a series of
457	   fragmented packets, with one fragment missing from each packet so
458	   that the reassembly process cannot complete.  Thus, this attack
459	   causes resource exhaustion on the destination node, possibly denying
460	   reassembly services to other flows.  This type of attack can be
461	   mitigated by flushing fragment reassembly buffers when necessary, at
462	   the expense of possibly dropping legitimate fragments.

464	   An IP fragment contains an "Identification" field that, together with
465	   the IP Source Address and Destination Address of a packet, identifies
466	   fragments that correspond to the same original datagram, so that they
467	   can be reassembled together by the receiving host.  Many
468	   implementations have employed predictable values for the
469	   Identification field, thus making it easy for an attacker to forge
470	   malicious IP fragments that would cause the reassembly procedure for
471	   legitimate packets to fail.

473	   Over the years multiple IPv4 and IPv6 implementations have been found
474	   to have flaws in their implementation of the IP fragment reassembly
475	   algorithm, typically resulting in buffer overflows.  These buffer
476	   overflows have been exploitable for denial of service and remote code
477	   execution attacks.

479	   NIDS aims at identifying malicious activity by analyzing network
480	   traffic.  Ambiguity in the possible result of the fragment reassembly
481	   process may allow an attacker to evade these systems.  Many of these
482	   systems try to mitigate some of these evasion techniques by e.g.
483	   Computing all possible outcomes of the fragment reassembly process,
484	   at the expense of increased processing requirements.

486	4.6.  Blackholing Due to ICMP Loss

488	   As stated above, an upper-layer protocol requires access the PMTU
489	   estimate if it:

491	   o  Does not rely on IP fragmentation.

493	   o  Relies on IP source fragmentation only (i.e., fragmentation at the
494	      source node).

496	   In order to satisfy this requirement, the upper-layer protocol can:

498	   o  Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
499	      MTU.

501	   o  Access the estimate that PMTUD produced.

503	   o  Execute PMTUD procedures itself.

505	   o  Execute PLPMTUD procedures.

507	   PMTUD relies upon the network's ability to deliver ICMP PTB messages
508	   to the source node.  Therefore, if an upper-layer protocol relies on
509	   PMTUD, it also relies on the network's ability to deliver ICMP PTB
510	   messages to the source node.

512	   According to [RFC4890], ICMP PTB messages must not be filtered.
513	   However, ICMP PTB delivery is not reliable.  It is subject to both
514	   transient and persistent loss.

516	   Transient loss of ICMP PTB messages causes PMTUD to perform less
517	   efficiently, but does not cause it to fail completely.  When the
518	   conditions contributing to transient loss abate, the network regains
519	   its ability to deliver ICMP PTB messages and PMTUD regains its
520	   ability to function.  Section 4.6.1 of this document describes
521	   conditions that lead to transient loss of ICMP PTB messages.

523	   However, persistent loss of ICMP PTB messages causes PMTUD to fail
524	   completely.  Section 4.6.2 and Section 4.6.3 of this document
525	   describe conditions that lead to persistent loss of ICMP PTB
526	   messages.

528	   The problem described in this section is specific to PMTUD.  It does
529	   not occur when the upper-layer protocol obtains its PMTU estimate
530	   from PLPMTUD or any other source.

532	4.6.1.  Transient Loss

534	   The following factors can contribute to transient loss of ICMP PTB
535	   messages:

537	   o  Network congestion.

539	   o  Packet corruption.

541	   o  Transient routing loops.

543	   o  ICMP rate limiting.

545	   The effect of rate limiting may be severe, as RFC 4443 recommends
546	   strict rate limiting of IPv6 traffic.

548	4.6.2.  Incorrect Implementation of Security Policy

550	   Incorrect implementation of security policy can cause persistent loss
551	   of ICMP PTB messages.

553	   Assume that a Customer Premise Equipment (CPE) router implements the
554	   following zone-based security policy:

556	   o  Allow any traffic to flow from the inside zone to the outside
557	      zone.

559	   o  Do not allow any traffic to flow from the outside zone to the
560	      inside zone unless it is part of an existing flow (i.e., it was
561	      elicited by an outbound packet).

563	   When a correct implementation of the above-mentioned security policy
564	   receives an ICMP PTB message, it examines the ICMP PTB payload in
565	   order to determine the original packet (i.e., the packet that
566	   elicited the ICMP PTB message) belonged to an existing flow.  If the
567	   original packet belonged to an existing flow, the implementation
568	   allows the ICMP PTB to flow from the outside zone to the inside zone.
569	   If not, the implementation discards the ICMP PTB message.

571	   When a incorrect implementation of the above-mentioned security
572	   policy receives an ICMP PTB message, it discards the packet because
573	   its source address is not associated with an existing flow.

575	   The security policy described above is implemented incorrectly on
576	   many consumer CPE routers.

578	4.6.3.  Persistant Loss Caused By Anycast

580	   Anycast can cause persistent loss of ICMP PTB messages.  Consider the
581	   example below:

583	   A DNS client sends a request to an anycast address.  The network
584	   routes that DNS request to the nearest instance of that anycast
585	   address (i.e., a DNS Server).  The DNS server generates a response
586	   and sends it back to the DNS client.  While the response does not
587	   exceed the DNS server's PMTU estimate, it does exceed the actual
588	   PMTU.

590	   A downstream router drops the packet and sends an ICMP PTB message
591	   the packet's source (i.e., the anycast address).  The network routes
592	   the ICMP PTB message to the anycast instance closest to the
593	   downstream router.  Sadly, that anycast instance may not be the DNS
594	   server that originated the DNS response.  It may be another DNS
595	   server with the same anycast address.  The DNS server that originated
596	   the response may never receive the ICMP PTB message and may never
597	   updates it PMTU estimate.

599	4.7.  Blackholing Due To Filtering

601	   In RFC 7872, researchers sampled Internet paths to determine whether
602	   they would convey packets that contain IPv6 extension headers.
603	   Sampled paths terminated at popular Internet sites (e.g., popular
604	   web, mail and DNS servers).

606	   The study revealed that at least 28% of the sampled paths did not
607	   convey packets containing the IPv6 Fragment extension header.  In
608	   most cases, fragments were dropped in the destination autonomous
609	   system.  In other cases, the fragments were dropped in transit
610	   autonomous systems.

612	   Another recent study [Huston] confirmed this finding.  It reported
613	   that 37% of sampled endpoints used IPv6-capable DNS resolvers that
614	   were incapable of receiving a fragmented IPv6 response.

616	   It is difficult to determine why network operators drop fragments.
617	   Possible causes follow:

619	   o  Hardware inability to process fragmented packets.

621	   o  Failure to change a vendor defaults.

623	   o  Unintentional misconfiguration.

625	   o  Intentional configuration (e.g., network operators consciously
626	      chooses to drop IPv6 fragments in order to address the issues
627	      raised in Section 4.1 through Section 4.6, above.)

629	5.  Alternatives to IP Fragmentation

631	5.1.  Transport Layer Solutions

633	   The Transport Control Protocol (TCP) [RFC0793]) can be operated in a
634	   mode that does not require IP fragmentation.

636	   Applications submit a stream of data to TCP.  TCP divides that stream
637	   of data into segments, with no segment exceeding the TCP Maximum
638	   Segment Size (MSS).  Each segment is encapsulated in a TCP header and
639	   submitted to the underlying IP module.  The underlying IP module
640	   prepends an IP header and forwards the resulting packet.

642	   If the TCP MSS is sufficiently small, the underlying IP module never
643	   produces a packet whose length is greater than the actual PMTU.
644	   Therefore, IP fragmentation is not required.

646	   TCP offers the following mechanisms for MSS management:

648	   o  Manual configuration

650	   o  PMTUD

652	   o  PLPMTUD

654	   For IPv6 nodes, manual configuration is always applicable.  If the
655	   MSS is manually configured to 1220 bytes and the packet does not
656	   contain extension headers, the IP layer will never produce a packet
657	   whose length is greater than the IPv6 minimum link MTU (1280 bytes).
658	   However, manual configuration prevents TCP from taking advantage of
659	   larger link MTU's.

661	   RFC 8200 strongly recommends that IPv6 nodes implement PMTUD, in
662	   order to discover and take advantage of path MTUs greater than 1280
663	   bytes.  However, as mentioned in Section 2.1, PMTUD relies upon the
664	   network's ability to deliver ICMP PTB messages.  Therefore, PMTUD is
665	   applicable only in environments where the risk of ICMP PTB loss is
666	   acceptable.

668	   By contrast, PLPMTUD does not rely upon the network's ability to
669	   deliver ICMP PTB messages.  However, in many loss-based TCP
670	   congestion control algorithms, the dropping of a packet may cause the
671	   TCP control algorithm to drop the congestion control window, or even
672	   re-start with the entire slow start process.  For high capacity, long
673	   round-trip time, large volume TCP streams, the deliberate probing
674	   with large packets and the consequent packet drop may impose too
675	   harsh a penalty on total TCP throughput for it to be a viable
676	   approach.  [RFC4821] defines PLPMTUD procedures for TCP.

678	   While TCP will never cause the underlying IP module to emit a packet
679	   that is larger than the PMTU estimate, it can cause the underlying IP
680	   module to emit a packet that is larger than the actual PMTU.  If this
681	   occurs, the packet is dropped, the PMTU estimate is updated, the
682	   segment is divided into smaller segments and each smaller segment is
683	   submitted to the underlying IP module.

685	   The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the
686	   Stream Control Protocol (SCP) [RFC4960] also can be operated in a
687	   mode that does not require IP fragmentation.  They both accept data
688	   from an application and divide that data into segments, with no
689	   segment exceeding a maximum size.  Both DCCP and SCP offer manual
690	   configuration, PMTUD and PLPMTUD as mechanisms for managing that
691	   maximum size.  [I-D.fairhurst-tsvwg-datagram-plpmtud] proposes
692	   PLPMTUD procedures for DCCP and SCP.

694	   Currently, User Data Protocol (UDP) [RFC0768] lacks a fragmentation
695	   mechanism of its own and relies on IP fragmentation.  However,
696	   [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for
697	   UDP.

699	5.2.  Application Layer Solutions

701	   [RFC8085] recognizes that IP fragmentation reduces the reliability of
702	   Internet communication.  It also recognizes that UDP lacks a
703	   fragmentation mechanism of its own and relies on IP fragmentation.
704	   Therefore, [RFC8085] offers the following advice regarding
705	   applications the run over the UDP.

707	   "An application SHOULD NOT send UDP datagrams that result in IP
708	   packets that exceed the Maximum Transmission Unit (MTU) along the
709	   path to the destination.  Consequently, an application SHOULD either
710	   use the path MTU information provided by the IP layer or implement
711	   Path MTU Discovery (PMTUD) itself to determine whether the path to a
712	   destination will support its desired message size without
713	   fragmentation."

715	   RFC 8085 continues:

717	   "Applications that do not follow the recommendation to do PMTU/
718	   PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would
719	   result in IP packets that exceed the path MTU.  Because the actual
720	   path MTU is unknown, such applications SHOULD fall back to sending
721	   messages that are shorter than the default effective MTU for sending
722	   (EMTU_S in [RFC1122]).  For IPv4, EMTU_S is the smaller of 576 bytes
723	   and the first-hop MTU.  For IPv6, EMTU_S is 1280 bytes.  The
724	   effective PMTU for a directly connected destination (with no routers
725	   on the path) is the configured interface MTU, which could be less
726	   than the maximum link payload size.  Transmission of minimum-sized
727	   UDP datagrams is inefficient over paths that support a larger PMTU,
728	   which is a second reason to implement PMTU discovery."

730	   RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently
731	   small, even though the IPv4 minimum link MTU is 68 bytes.

733	   This advice applies equally to application that run directly over IP.

735	6.  Applications That Rely on IPv6 Fragmentation

737	   The following applications rely on IPv6 fragmentation:

739	   o  DNS [RFC1035]

741	   o  OSPFv3 [RFC5340]

743	   o  Packet-in-packet encapsulations

745	   Each of these applications relies on IPv6 fragmentation to a varying
746	   degree.  In some cases, that reliance is essential, and cannot be
747	   broken without fundamentally changing the protocol.  In other cases,
748	   that reliance is incidental, and most implementations already take
749	   appropriate steps to avoid fragmentation.

751	   This list is not comprehensive, and other protocols that rely on IPv6
752	   fragmentation may exist.  They are not specifically considered in the
753	   context of this document.

755	6.1.  DNS

757	   DNS relies on UDP for efficiency, and the consequence is the use of
758	   IP fragmentation for large responses, as permitted by the DNS EDNS(0)
759	   options in the query.  It is possible to mitigate the issue of
760	   fragmentation-based packet loss by having queries use smaller EDNS(0)
761	   UDP buffer sizes, but then the operational issue of the partial level
762	   of support for DNS over TCP over IPv6 becomes a limiting factor of
763	   the efficacy of this approach in an IPv6 context [Damas].

765	   Larger DNS responses can normally be avoided by aggressively pruning
766	   the Additional section of DNS responses.  One scenario where such
767	   pruning is ineffective is in the use of DNSSEC, where large key sizes
768	   act to increase the response size to certain DNS queries.  There is
769	   no effective response to this situation within the DNS other than
770	   using smaller cryptographic keys and adoption of DNSSEC
771	   administrative practices that attempt to keep DNS response as short
772	   as possible.

774	6.2.  OSPFv3

776	   OSPFv3 implementations can emit messages large enough to cause IPv6
777	   fragmentation.  However, in keeping with the recommendations of
778	   RFC8200, and in order to optimize performance, most OSPFv3
779	   implementations restrict their maximum message size to the IPv6
780	   minimum link MTU.

782	6.3.  Packet-in-Packet Encapsulations

784	   In this document, packet-in-packet encapsulations include IP-in-IP
785	   [RFC2003], Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP
786	   [RFC8086] and Generic Packet Tunneling in IPv6 [RFC2473].  [RFC4459]
787	   describes fragmentation issues associated with all of the above-
788	   mentioned encapsulations.

790	   The fragmentation strategy described for GRE in [RFC7588] has been
791	   deployed for all of the above-mentioned encapsulations.  This
792	   strategy does not rely on IPv6 fragmentation except in one corner
793	   case. (see Section 3.3.2.2 of RFC 7588 and Section 7.1 of RFC 2473).
794	   Section 3.3 of [RFC7676] further describes this corner case.

796	7.  Recommendations

798	7.1.  For Application Developers

800	   Application developers SHOULD NOT develop applications that rely on
801	   IPv6 fragmentation.

803	   Application-layer protocols then depend upon IPv6 fragmentation
804	   SHOULD be updated to break that dependency.

806	7.2.  For Network Operators

808	   As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB
809	   messages unless they are known to be forged or otherwise
810	   illegitimate.  As stated in Section 4.6, filtering ICMPv6 PTB packets
811	   causes PMTUD to fail.  Operators MUST ensure proper PMTUD operation
812	   in their network, including making sure the network generates PTB
813	   packets when dropping packets too large compared to outgoing
814	   interface MTU.

816	   Many upper-layer protocols rely on PMTUD.

818	8.  IANA Considerations

820	   This document makes no request of IANA.

822	9.  Security Considerations

824	   This document mitigates some of the security considerations
825	   associated with IP fragmentation by discouraging the use of IP
826	   fragmentation.  It does not introduce any new security
827	   vulnerabilities, because it does not introduce any new alternatives
828	   to IP fragmentation.  Instead, it recommends well-understood
829	   alternatives.

831	10.  Acknowledgements

833	   Thanks to Mikael Abrahamsson, Lorenzo Colitti, Mike Heard, Tom
834	   Herbert, Tatuya Jinmei, Paolo Lucente, Eric Nygren, and Joe Touch for
835	   their comments.

837	11.  References

839	11.1.  Normative References

841	   [RFC0768]  Postel, J., "User Datagram Protocol", STD 6, RFC 768,
842	              DOI 10.17487/RFC0768, August 1980,
843	              <https://www.rfc-editor.org/info/rfc768>.

845	   [RFC0791]  Postel, J., "Internet Protocol", STD 5, RFC 791,
846	              DOI 10.17487/RFC0791, September 1981,
847	              <https://www.rfc-editor.org/info/rfc791>.

849	   [RFC0792]  Postel, J., "Internet Control Message Protocol", STD 5,
850	              RFC 792, DOI 10.17487/RFC0792, September 1981,
851	              <https://www.rfc-editor.org/info/rfc792>.

853	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
854	              RFC 793, DOI 10.17487/RFC0793, September 1981,
855	              <https://www.rfc-editor.org/info/rfc793>.

857	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
858	              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
859	              November 1987, <https://www.rfc-editor.org/info/rfc1035>.

861	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
862	              DOI 10.17487/RFC1191, November 1990,
863	              <https://www.rfc-editor.org/info/rfc1191>.

865	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
866	              Requirement Levels", BCP 14, RFC 2119,
867	              DOI 10.17487/RFC2119, March 1997,
868	              <https://www.rfc-editor.org/info/rfc2119>.

870	   [RFC4443]  Conta, A., Deering, S., and M. Gupta, Ed., "Internet
871	              Control Message Protocol (ICMPv6) for the Internet
872	              Protocol Version 6 (IPv6) Specification", STD 89,
873	              RFC 4443, DOI 10.17487/RFC4443, March 2006,
874	              <https://www.rfc-editor.org/info/rfc4443>.

876	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
877	              Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
878	              <https://www.rfc-editor.org/info/rfc4821>.

880	   [RFC8085]  Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage
881	              Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085,
882	              March 2017, <https://www.rfc-editor.org/info/rfc8085>.

884	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
885	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
886	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

888	   [RFC8200]  Deering, S. and R. Hinden, "Internet Protocol, Version 6
889	              (IPv6) Specification", STD 86, RFC 8200,
890	              DOI 10.17487/RFC8200, July 2017,
891	              <https://www.rfc-editor.org/info/rfc8200>.

893	   [RFC8201]  McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed.,
894	              "Path MTU Discovery for IP version 6", STD 87, RFC 8201,
895	              DOI 10.17487/RFC8201, July 2017,
896	              <https://www.rfc-editor.org/info/rfc8201>.

898	11.2.  Informative References

900	   [Damas]    Damas, J. and G. Huston, "Measuring ATR", April 2018,
901	              <http://www.potaroo.net/ispcol/2018-04/atr.html>.

903	   [Huston]   Huston, G., "IPv6, Large UDP Packets and the DNS
904	              (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)",
905	              August 2017.

907	   [I-D.fairhurst-tsvwg-datagram-plpmtud]
908	              Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler,
909	              "Packetization Layer Path MTU Discovery for Datagram
910	              Transports", draft-fairhurst-tsvwg-datagram-plpmtud-02
911	              (work in progress), December 2017.

913	   [I-D.ietf-tsvwg-udp-options]
914	              Touch, J., "Transport Options for UDP", draft-ietf-tsvwg-
915	              udp-options-05 (work in progress), July 2018.

917	   [Ptacek1998]
918	              Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial
919	              of Service: Eluding Network Intrusion Detection", 1998,
920	              <http://www.aciri.org/vern/Ptacek-Newsham-Evasion-98.ps>.

922	   [RFC1122]  Braden, R., Ed., "Requirements for Internet Hosts -
923	              Communication Layers", STD 3, RFC 1122,
924	              DOI 10.17487/RFC1122, October 1989,
925	              <https://www.rfc-editor.org/info/rfc1122>.

927	   [RFC1858]  Ziemba, G., Reed, D., and P. Traina, "Security
928	              Considerations for IP Fragment Filtering", RFC 1858,
929	              DOI 10.17487/RFC1858, October 1995,
930	              <https://www.rfc-editor.org/info/rfc1858>.

932	   [RFC2003]  Perkins, C., "IP Encapsulation within IP", RFC 2003,
933	              DOI 10.17487/RFC2003, October 1996,
934	              <https://www.rfc-editor.org/info/rfc2003>.

936	   [RFC2473]  Conta, A. and S. Deering, "Generic Packet Tunneling in
937	              IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473,
938	              December 1998, <https://www.rfc-editor.org/info/rfc2473>.

940	   [RFC2474]  Nichols, K., Blake, S., Baker, F., and D. Black,
941	              "Definition of the Differentiated Services Field (DS
942	              Field) in the IPv4 and IPv6 Headers", RFC 2474,
943	              DOI 10.17487/RFC2474, December 1998,
944	              <https://www.rfc-editor.org/info/rfc2474>.

946	   [RFC2784]  Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
947	              Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
948	              DOI 10.17487/RFC2784, March 2000,
949	              <https://www.rfc-editor.org/info/rfc2784>.

951	   [RFC3128]  Miller, I., "Protection Against a Variant of the Tiny
952	              Fragment Attack (RFC 1858)", RFC 3128,
953	              DOI 10.17487/RFC3128, June 2001,
954	              <https://www.rfc-editor.org/info/rfc3128>.

956	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
957	              Congestion Control Protocol (DCCP)", RFC 4340,
958	              DOI 10.17487/RFC4340, March 2006,
959	              <https://www.rfc-editor.org/info/rfc4340>.

961	   [RFC4459]  Savola, P., "MTU and Fragmentation Issues with In-the-
962	              Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April
963	              2006, <https://www.rfc-editor.org/info/rfc4459>.

965	   [RFC4890]  Davies, E. and J. Mohacsi, "Recommendations for Filtering
966	              ICMPv6 Messages in Firewalls", RFC 4890,
967	              DOI 10.17487/RFC4890, May 2007,
968	              <https://www.rfc-editor.org/info/rfc4890>.

970	   [RFC4960]  Stewart, R., Ed., "Stream Control Transmission Protocol",
971	              RFC 4960, DOI 10.17487/RFC4960, September 2007,
972	              <https://www.rfc-editor.org/info/rfc4960>.

974	   [RFC5340]  Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF
975	              for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008,
976	              <https://www.rfc-editor.org/info/rfc5340>.

978	   [RFC5722]  Krishnan, S., "Handling of Overlapping IPv6 Fragments",
979	              RFC 5722, DOI 10.17487/RFC5722, December 2009,
980	              <https://www.rfc-editor.org/info/rfc5722>.

982	   [RFC5927]  Gont, F., "ICMP Attacks against TCP", RFC 5927,
983	              DOI 10.17487/RFC5927, July 2010,
984	              <https://www.rfc-editor.org/info/rfc5927>.

986	   [RFC7588]  Bonica, R., Pignataro, C., and J. Touch, "A Widely
987	              Deployed Solution to the Generic Routing Encapsulation
988	              (GRE) Fragmentation Problem", RFC 7588,
989	              DOI 10.17487/RFC7588, July 2015,
990	              <https://www.rfc-editor.org/info/rfc7588>.

992	   [RFC7676]  Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support
993	              for Generic Routing Encapsulation (GRE)", RFC 7676,
994	              DOI 10.17487/RFC7676, October 2015,
995	              <https://www.rfc-editor.org/info/rfc7676>.

997	   [RFC7739]  Gont, F., "Security Implications of Predictable Fragment
998	              Identification Values", RFC 7739, DOI 10.17487/RFC7739,
999	              February 2016, <https://www.rfc-editor.org/info/rfc7739>.

1001	   [RFC7872]  Gont, F., Linkova, J., Chown, T., and W. Liu,
1002	              "Observations on the Dropping of Packets with IPv6
1003	              Extension Headers in the Real World", RFC 7872,
1004	              DOI 10.17487/RFC7872, June 2016,
1005	              <https://www.rfc-editor.org/info/rfc7872>.

1007	   [RFC8086]  Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE-
1008	              in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086,
1009	              March 2017, <https://www.rfc-editor.org/info/rfc8086>.

1011	Appendix A.  Contributors' Address

1013	Authors' Addresses

1015	   Ron Bonica
1016	   Juniper Networks
1017	   2251 Corporate Park Drive
1018	   Herndon, Virginia  20171
1019	   USA

1021	   Email: rbonica@juniper.net

1023	   Fred Baker
1024	   Unaffiliated
1025	   Santa Barbara, California  93117
1026	   USA

1028	   Email: FredBaker.IETF@gmail.com

1030	   Geoff Huston
1031	   APNIC
1032	   6 Cordelia St
1033	   Brisbane, 4101 QLD
1034	   Australia

1036	   Email: gih@apnic.net
1037	   Robert M. Hinden
1038	   Check Point Software
1039	   959 Skyway Road
1040	   San Carlos, California  94070
1041	   USA

1043	   Email: bob.hinden@gmail.com

1045	   Ole Troan
1046	   Cisco
1047	   Philip Pedersens vei 1
1048	   N-1366 Lysaker
1049	   Norway

1051	   Email: ot@cisco.com

1053	   Fernando Gont
1054	   SI6 Networks
1055	   Evaristo Carriego 2644
1056	   Haedo, Provincia de Buenos Aires
1057	   Argentina

1059	   Email: fgont@si6networks.com