IP Fragmentation Considered
Fragile
Juniper Networks
2251 Corporate Park Drive
Herndon
20171
Virginia
USA
rbonica@juniper.net
Unaffiliated
Santa Barbara
California
93117
USA
FredBaker.IETF@gmail.com
APNIC
6 Cordelia St
Brisbane
4101 QLD
Australia
gih@apnic.net
Check Point Software
959 Skyway Road
San Carlos
California
94070
USA
bob.hinden@gmail.com
Cisco
Philip Pedersens vei 1
N-1366 Lysaker
Norway
ot@cisco.com
Internet Area
Internet Area WG
IPv6
Fragmentation
This document provides an overview of IP fragmentation. It explains
how IP fragmentation works and why it is required. As part of that
explanation, this document also explains how IP fragmentation reduces
the reliability of Internet communication.
This document also proposes alternatives to IP fragmentation.
Finally, it provides recommendations for application developers and
network operators.
Operational experience reveals that IP fragmentation reduces the reliability
of Internet communication. This document provides an overview of IP
fragmentation. It explains how IP fragmentation works and why it is
required. As part of that explanation, this document also explains how
IP fragmentation reduces the reliability of Internet communication.
This document also proposes alternatives to IP fragmentation.
Finally, it provides recommendations for application developers and
network operators.
An Internet path connects a source node to a destination node. A
path can contain links and intermediate systems. If a path contains
more than one link, the links are connected in series and an
intermediate system connects each link to the next. An intermediate
system can be a router or a middle box.
Internet paths are dynamic. Assume that the path from one node to
another contains a set of links and intermediate systems. If the
network topology changes, that path can also change so that it
includes a different set of links and intermediate systems.
Each link is constrained by the number of bytes that it can convey
in a single IP packet. This constraint is called the link Maximum
Transmission Unit (MTU). IPv4 requires
every link to have an MTU of 68 bytes or greater. IPv6 requires every link to have an MTU of
1280 bytes or greater. These are called the IPv4 and IPv6 minimum link
MTU's.
Each Internet path is constrained by the number of bytes that it
can convey in a IP single packet. This constraint is called the Path
MTU (PMTU). For any given path, the PMTU is equal to the smallest of
its link MTU's. Because Internet paths are dynamic, PMTU is also
dynamic.
For reasons described below, source nodes estimate the PMTU between
themselves and destination nodes. A source node can produce extremely
conservative PMTU estimates in which:
The estimate for each IPv4 path is equal to IPv4 minimum link
MTU (68 bytes).
The estimate for each IPv6 path is equal to the IPv6 minimum
link MTU (1280 bytes).
While these conservative estimates are guaranteed to be less
than or equal to the actual MTU, they are likely to be much less than
the actual PMTU. This may adversely affect upper-layer protocol
performance.
By executing Path MTU Discovery
(PMTUD) procedures, a source node can
maintain a less conservative, running estimate of the PMTU between
itself and a destination node. According to these procedures, the
source node produces an initial PMTU estimate. This initial estimate
is equal to the MTU of the first link along path to the destination
node. It can be greater than the actual PMTU.
Having produced an initial PMTU estimate, the source node sends
non-fragmentable IP packets to the destination node. If one of these
packets is larger than the actual PMTU, a downstream router will not
be able to forward the packet through the next link along the path.
Therefore, the downstream router drops the packet and send an Internet Control Message Protocol (ICMP) Packet Too Big (PTB) message to the source node.
The ICMP PTB message indicates the MTU of the link through which the
packet could not be forwarded. The source node uses this information
to refine its PMTU estimate.
PMTUD produces a running estimate of the PMTU between a source node
and a destination node. Because PMTU is dynamic, at any given time,
the PMTU estimate can differ from the actual PMTU. In order to detect
PMTU increases, PMTUD occasionally resets the PMTU estimate to the MTU
of the first link along path to the destination node. It then repeats
the procedure described above.
Furthermore, PMTUD has the following characteristics:
It relies on the network's ability to deliver ICMP PTB messages
to the source node.
It is susceptible to attack because ICMP messages are easily
forged.
FOOTNOTE: According to RFC 0791, every IPv4 host must be capable of
receiving a packet whose length is equal to 576 bytes. However, the
IPv4 minimum link MTU is not 576. Section 3.2 of RFC 0791 explicitly
states that the IPv4 minimum link MTU is 68 bytes.
FOOTNOTE: In the paragraphs above, the term "non-fragmentable
packet" is introduced. A non-fragmentable packet can be fragmented at
its source. However, it cannot be fragmented by a downstream node. An
IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4
packet whose DF-bit is set to one is non-fragmentable. All IPv6
packets are also non-fragmentable.
FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is
introduced. The ICMP PTB message has two instantiations. In ICMPv4, the ICMP PTB message is Destination
Unreachable message with Code equal to (4) fragmentation needed and DF
set. This message was augmented by to
indicates the MTU of the link through which the packet could not be
forwarded. In ICMPv6, the ICMP PTB
message is a Packet Too Big Message with Code equal to (0). This
message also indicates the MTU of the link through which the packet
could not be forwarded.
When an upper-layer protocol submits data to the underlying IP
module, and the resulting IP packet's length is greater than the PMTU,
IP fragmentation may be required. IP fragmentation divides a packet
into fragments. Each fragment includes an IP header and a portion of
the original packet.
describes IPv4 fragmentation procedures.
IPv4 packets whose DF-bit is set to one cannot be fragmented. IPv4
packets whose DF-bit is set to zero can be fragmented at the source
node or by any downstream router. describes
IPv6 fragmentation procedures. IPv6 packets can be fragmented at the
source node only.
IPv4 fragmentation differs slightly from IPv6 fragmentation.
However, in both IP versions, the upper-layer header appears in the
first fragment only. It does not appear in subsequent fragments.
Upper-layer protocols can operate in the following modes:
Do not rely on IP fragmentation.
Rely on IP source fragmentation only (i.e., fragmentation at
the source node).
Rely on IP source fragmentation and downstream fragmentation
(i.e., fragmentation at any node along the path).
Upper-layer protocols running over IPv4 can operate in the first
and third modes (above). Upper-layer protocols running over IPv6 can
operate in the first and second modes (above).
Upper-layer protocols that operate in the first two modes (above)
require access to the PMTU estimate. In order to fulfil this
requirement, they can
Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
MTU.
Access the estimate that PMTUD produced.
Execute PMTUD procedures themselves.
Execute Packetization Layer PMTUD
(PLPMTUD) procedures.
According to PLPMTUD procedures, the upper-layer protocol
maintains a running PMTU estimate. It does so by sending probe packets
of various sizes to its peer and receiving acknowledgements. This
strategy differs from PMTUD in that it relies of acknowledgement of
received messages, as opposed to ICMP PTB messages concerning dropped
messages. Therefore, PLPMTUD does not rely on the network's ability to
deliver ICMP PTB messages to the source.
An upper-layer protocol that does not rely on IP fragmentation
never causes the underlying IP module to emit
A fragmentable IP packet (i.e., an IPv4 packet with the DF-bit
set to zero).
An IP fragment.
A packet whose length is greater than the PMTU estimate.
However, when the PMTU estimate is greater than the actual
PMTU, the upper-layer protocol can cause the underlying IP module to
emit a packet whose length is greater than the actual PMTU. When this
occurs, a downstream router drops the packet and the source node
refines its PMTU estimate, employing either PMTUD or PLPMTUD
procedures.
When an upper-layer protocol that relies on IP source fragmentation
only submits data to the underlying IP module, and the resulting
packet is larger than the PMTU estimate, the underlying IP module
fragments the packet and emits the fragments. However, the upper-layer
protocol never causes the underlying IP module to emit
A fragmentable IP packet.
A packet whose length is greater than the PMTU estimate.
When the PMTU estimate is greater than the actual PMTU, the
upper-layer protocol can cause the underlying IP module to emit a
packet whose length is greater than the actual PMTU. When this occurs,
a downstream router drops the packet and the source node refines its
PMTU estimate, employing either PMTUD or PLPMTUD procedures.
An upper-layer protocol that relies on IP source fragmentation and
downstream fragmentation can cause the underlying IP module to
emit
A fragmentable IP packet.
An IP fragment.
A packet whose length is greater than the PMTU estimate.
A protocol that relies on IP source fragmentation and
downstream fragmentation does not require access to the PMTU estimate.
For these protocols, the underlying IP module:
Fragments all packets whose length exceeds the MTU of the first
link along the path to the destination.
Sets the DF-bit to zero, so that downstream nodes can fragment
the packet.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only
when, they appear in all capitals, as shown here.
This section explains how IP fragmentation reduces the reliability of
Internet communication.
Many middle boxes require access to the transport-layer header.
However, when a packet is divided into fragments, the transport-layer
header appears in the first fragment only. It does not appear in
subsequent fragments. This omission can prevent middle boxes from
delivering their intended services.
For example, assume that a router diverts selected packets from
their normal path towards network appliances that support deep packet
inspection and lawful intercept. The router selects packets for
diversion based upon the following 5-tuple:
IP Source Address.
IP Destination Address.
IPv4 Protocol or IPv6 Next Header.
transport-layer source port.
transport-layer destination port.
IP fragmentation causes this selection algorithm to behave
suboptimally, because the transport-layer header appears only in the
first fragment of each packet.
In another example, a middle box remarks a packet's Differentiated Services Code Point based upon
the above mentioned 5-tuple. IP fragmentation causes this process to
behave suboptimally, because the transport-layer header appears only
in the first fragment of each packet.
In all of the above-mentioned examples, the middle box cannot
deliver its intended service without reassembling fragmented
packets.
IP fragments cause problems for firewalls whose filter rules
include decision making based on TCP and UDP ports. As the port
information is not in the trailing fragments the firewall may elect to
accept all trailing fragments, which may admit certain classes of
attack, or may elect to block all trailing fragments, which may block
otherwise legitimate traffic, or may elect to reassemble all
fragmented packets, which may be inefficient and negatively affect
performance.
Many stateless load-balancers require access to the transport-layer
header. Assume that a load-balancer distributes flows among parallel
links. In order to optimize load balancing, the load-balancer sends
every packet or packet fragment belonging to a flow through the same
link.
In order to assign a packet or packet fragment to a link, the
load-balancer executes an algorithm. If the packet or packet fragment
contains a transport-layer header, the load balancing algorithm
accepts the following 5-tuple as input:
IP Source Address.
IP Destination Address.
IPv4 Protocol or IPv6 Next Header.
transport-layer source port.
transport-layer destination port.
However, if the packet or packet fragment does not contain a
transport-layer header, the load balancing algorithm accepts only the
following 3-tuple as input:
IP Source Address.
IP Destination Address.
IPv4 Protocol or IPv6 Next Header.
Therefore, non-fragmented packets belonging to a flow can be
assigned to one link while fragmented packets belonging to the same
flow can be divided between that link and another. This can cause
suboptimal load balancing.
Security researchers have documented several attacks that rely on
IP fragmentation. The following are examples:
Overlapping fragment attack
Incomplete data attack (also known as the Rose Attack)
In the overlapping fragment attack, an attacker constructs a
series of packet fragments. The first fragment contains an IP header,
a transport-layer header, and some transport-layer payload. This
fragment complies with local security policy and is allowed to pass
through a stateless firewall. A second fragment, having a non-zero
offset, overlaps with the first fragment. The second fragment also
passes through the stateless firewall. When the packet is reassembled,
the transport layer header from the first fragment is overwritten by
data from the second fragment. The reassembled packet does not comply
with local security policy. Had it traversed the firewall in one
piece, the firewall would have rejected it.
A stateless firewall cannot protect against the overlapping
fragment attack. However, destination nodes can protect against the
overlapping fragment attack by implementing the reassembly procedures
described in RFC 1858 and RFC 8200. These reassembly procedures detect
the overlap and discard the packet.
The incomplete data attack is a denial of service attack in which
the attacker constructs a series of fragmented packets. However, one
fragment is missing from each packet so that no packet can be
reassembled. This attack causes resource exhaustion on the destination
node, possibly denying reassembly services to other flows. The
incomplete data attack can be mitigated by limiting reassembly
resources dedicated to a particular Source Address or flow.
As stated above, an upper-layer protocol requires access the PMTU
estimate if it:
Does not rely on IP fragmentation.
Relies on IP source fragmentation only (i.e., fragmentation at
the source node).
In order to satisfy this requirement, the upper-layer
protocol can:
Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link
MTU.
Access the estimate that PMTUD produced.
Execute PMTUD procedures itself.
Execute PLPMTUD procedures.
PMTUD relies upon the network's ability to deliver ICMP PTB
messages to the source node. Therefore, if an upper-layer protocol
relies on PMTUD for its PMTU estimate, it also relies on the networks
ability to deliver ICMP PTB messages to the source node.
states that the PTB messages must not be
filtered. However, ICMP delivery is not reliable. It is subject to
transient loss and, in some configurations, more persistent delivery
issues.
ICMP rate limiting, network congestion and packet corruption can
cause transient loss. The effect of rate limiting may be severe, as
RFC 4443 recommends strict rate limiting of IPv6 traffic.
While transient loss causes PMTUD to perform less efficiently, it
does not cause PMTUD to fail completely. When the conditions
contributing to transient loss abate, the network regains its ability
to deliver ICMP PTB messages and PMTUD regains its ability to
function.
By contrast, more persistent delivery issues cause PMTUD to fail
completely. Consider the following example:
A DNS client sends a request to an anycast address. The network
routes that DNS request to the nearest instance of that anycast
address (i.e., a DNS Server). The DNS server generates a response and
sends it back to the DNS client. While the response does not exceed
the DNS server's PMTU estimate, it does exceed the actual PMTU.
A downstream router drops the packet and sends an ICMP PTB message
the packet's source (i.e., the anycast address). The network routes
the ICMP PTB message to the anycast instance closest to the downstream
router. Sadly, that anycast instance may not be the DNS server that
originated the DNS response. It may be another DNS server with the
same anycast address. The DNS server that originated the response may
never receive the ICMP PTB message and may never updates it PMTU
estimate.
The problem described in this section is specific to PMTUD. It does
not occur when the upper-layer protocol obtains its PMTU estimate from
PLPMTUD or any other source.
Furthermore, the problem described in this section occurs when the
upper-layer protocol does not rely on IP fragmentation, as well as
when the upper-layer protocol relies on IP source fragmentation
only.
In RFC 7872, researchers sampled Internet paths to determine
whether they would convey packets that contain IPv6 extension headers.
Sampled paths terminated at popular Internet sites (e.g., popular web,
mail and DNS servers).
The study revealed that at least 28% of the sampled paths did not
convey packets containing the IPv6 Fragment extension header. In most
cases, fragments were dropped in the destination autonomous system. In
other cases, the fragments were dropped in transit autonomous
systems.
Another recent study confirmed this
finding. It reported that 37% of sampled endpoints used IPv6-capable
DNS resolvers that were incapable of receiving a fragmented IPv6
response.
It is difficult to determine why network operators drop fragments.
In some cases, packet drop may be caused by misconfiguration. In other
cases, network operators may consciously choose to drop IPv6
fragments, in order to address the issues raised in through , above.
The Transport Control Protocol (TCP))
can be operated in a mode that does not require IP fragmentation.
Applications submit a stream of data to TCP. TCP divides that
stream of data into segments, with no segment exceeding the TCP
Maximum Segment Size (MSS). Each segment is encapsulated in a TCP
header and submitted to the underlying IP module. The underlying IP
module prepends an IP header and forwards the resulting packet.
If the TCP MSS is sufficiently small, the underlying IP module
never produces a packet whose length is greater than the actual PMTU.
Therefore, IP fragmentation is not required.
TCP offers the following mechanisms for MSS management:
Manual configuration
PMTUD
PLPMTUD
For IPv6 nodes, manual configuration is always applicable. If the
MSS is manually configured to 1220 bytes and the packet does not
contain extension headers, the IP layer will never produce a packet
whose length is greater than the IPv6 minimum link MTU (1280 bytes).
However, manual configuration prevents TCP from taking advantage of
larger link MTU's.
RFC 8200 strongly recommends that IPv6 nodes implement PMTUD, in
order to discover and take advantage of path MTUs greater than 1280
bytes. However, as mentioned in , PMTUD relies
upon the network's ability to deliver ICMP PTB messages. Therefore,
PMTUD is applicable only in environments where the risk of ICMP PTB
loss is acceptable.
By contrast, PLPMTUD does not rely upon the network's ability to
deliver ICMP PTB messages. However, in many loss-based TCP congestion
control algorithms, the dropping of a packet may cause the TCP control
algorithm to drop the congestion control window, or even re-start with
the entire slow start process. For high capacity, long RTT, large
volume TCP streams, the deliberate probing with large packets and the
consequent packet drop may impose too harsh a penalty on total TCP
throughput for it to be a viable approach.
defines PLPMTUD procedures for TCP.
While TCP will never cause the underlying IP module to emit a
packet that is larger than the PMTU estimate, it can cause the
underlying IP module to emit a packet that is larger than the actual
PMTU. If this occurs, the packet is dropped, the PMTU estimate is
updated, the segment is divided into smaller segments and each smaller
segment is submitted to the underlying IP module.
The Datagram Congestion Control Protocol
(DCCP) and the Stream Control Protocol
(SCP) also can be operated in a mode that does not require IP
fragmentation. They both accept data from an application and divide
that data into segments, with no segment exceeding a maximum size.
Both DCCP and SCP offer manual configuration, PMTUD and PLPMTUD as
mechanisms for managing that maximum size. proposes PLPMTUD
procedures for DCCP and SCP.
recognizes that IP fragmentation reduces
the reliability of Internet communication. Therefore, it offers the
following advice regarding applications the run over the User Data Protocol (UDP).
"An application SHOULD NOT send UDP datagrams that result in IP
packets that exceed the Maximum Transmission Unit (MTU) along the path
to the destination. Consequently, an application SHOULD either use the
path MTU information provided by the IP layer or implement Path MTU
Discovery (PMTUD) itself to determine whether the path to a
destination will support its desired message size without
fragmentation."
RFC 8085 continues:
"Applications that do not follow the recommendation to do
PMTU/PLPMTUD discovery SHOULD still avoid sending UDP datagrams that
would result in IP packets that exceed the path MTU. Because the
actual path MTU is unknown, such applications SHOULD fall back to
sending messages that are shorter than the default effective MTU for
sending (EMTU_S in ). For IPv4, EMTU_S is the
smaller of 576 bytes and the first-hop MTU. For IPv6, EMTU_S is 1280
bytes. The effective PMTU for a directly connected destination (with
no routers on the path) is the configured interface MTU, which could
be less than the maximum link payload size. Transmission of
minimum-sized UDP datagrams is inefficient over paths that support a
larger PMTU, which is a second reason to implement PMTU
discovery."
RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently
small, even though the IPv6 minimum link MTU is 68 bytes.
This advice applies equally to application that run directly over
IP.
The following applications rely on IPv6 fragmentation:
DNS
OSPFv3
IP Encapsulations
Each of these applications relies on IPv6 fragmentation to a
varying degree. In some cases, that reliance is essential, and cannot be
broken without fundamentally changing the protocol. In other cases, that
reliance is incidental, and most implementations already take
appropriate steps to avoid fragmentation.
This list is not comprehensive, and other protocols that rely on IPv6
fragmentation may exist. They are not specifically considered in the
context of this document.
DNS can obtain transport services from either UDP or TCP. Superior
performance and scaling characteristics are observed when DNS runs
over UDP.
DNS Servers that execute DNSSEC
procedures are more likely to generate large responses. Therefore,
when running over UDP, they are more likely to cause the generation of
IPv6 fragments. DNS's reliance upon IPv6 fragmentation is fundamental
and cannot be broken without changing the DNS specification.
DNS is an essential part of the Internet architecture. Therefore,
this issue is for further study and must be resolved before DNSSEC can
be deployed successfully in IPv6 only networks.
OSPFv3 implementations can emit messages large enough to cause IPv6
fragmentation. However, in keeping with the recommendations of
RFC8200, and in order to optimize performance, most OSPFv3
implementations restrict their maximum message size to the IPv6
minimum link MTU.
In this document, IP encapsulations include IP-in-IP , Generic
Routing Encapsulation (GRE) , GRE-in-UDP and Generic
Packet Tunneling in IPv6. The fragmentation strategy described
for GRE in has been deployed for all of the
above-mentioned IP encapsulations. This strategy does not rely on IPv6
fragmentation except in one corner case. (see Section 3.3.2.2 of RFC
7588 and Section 7.1 of RFC 2473). Section 3.3 of further describes this corner case.
Application developers SHOULD NOT develop applications that rely on
IPv6 fragmentation.
Application-layer protocols then depend upon IPv6 fragmentation
SHOULD be updated to break that dependency.
As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB
messages unless they are known to be forged or otherwise illegitimate.
As stated in , filtering ICMPv6 PTB packets causes
PMTUD to fail. Many upper-layer protocols rely on PMTUD.
This document makes no request of IANA.
This document mitigates some of the security considerations
associated with IP fragmentation by discouraging the use of IP
fragmentation. It does not introduce any new security vulnerabilities,
because it does not introduce any new alternatives to IP fragmentation.
Instead, it recommends well-understood alternatives.
IPv6, Large UDP Packets and the DNS
(http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)