idnits 2.17.1 draft-ietf-intarea-frag-fragile-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: Middle box developers SHOULD implement devices that support IP fragmentation. These boxes SHOULD not fail or cause failures when processing fragmented IP packets. -- The document date (October 10, 2018) is 2024 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-13) exists of draft-ietf-intarea-tunnels-09 == Outdated reference: A later version (-22) exists of draft-ietf-tsvwg-datagram-plpmtud-05 == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-udp-options-05 -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Area WG R. Bonica 3 Internet-Draft Juniper Networks 4 Intended status: Best Current Practice F. Baker 5 Expires: April 13, 2019 Unaffiliated 6 G. Huston 7 APNIC 8 R. Hinden 9 Check Point Software 10 O. Troan 11 Cisco 12 F. Gont 13 SI6 Networks 14 October 10, 2018 16 IP Fragmentation Considered Fragile 17 draft-ietf-intarea-frag-fragile-01 19 Abstract 21 This document describes IP fragmentation and explains how it reduces 22 the reliability of Internet communication. 24 This document also proposes alternatives to IP fragmentation and 25 provides recommendations for developers and network operators. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on April 13, 2019. 44 Copyright Notice 46 Copyright (c) 2018 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . 3 63 2.1. Links, Paths, MTU and PMTU . . . . . . . . . . . . . . . 3 64 2.2. Fragmentation Procedures . . . . . . . . . . . . . . . . 5 65 2.3. Upper-Layer Reliance on IP Fragmentation . . . . . . . . 6 66 3. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 67 4. Reduced Reliability . . . . . . . . . . . . . . . . . . . . . 7 68 4.1. Policy-Based Routing . . . . . . . . . . . . . . . . . . 7 69 4.2. Network Address Translation (NAT) . . . . . . . . . . . . 8 70 4.3. Stateless Firewalls . . . . . . . . . . . . . . . . . . . 8 71 4.4. Stateless Load Balancers . . . . . . . . . . . . . . . . 9 72 4.5. Security Vulnerabilities . . . . . . . . . . . . . . . . 9 73 4.6. Blackholing Due to ICMP Loss . . . . . . . . . . . . . . 11 74 4.6.1. Transient Loss . . . . . . . . . . . . . . . . . . . 11 75 4.6.2. Incorrect Implementation of Security Policy . . . . . 12 76 4.6.3. Persistent Loss Caused By Anycast . . . . . . . . . . 12 77 4.7. Blackholing Due To Filtering . . . . . . . . . . . . . . 13 78 5. Alternatives to IP Fragmentation . . . . . . . . . . . . . . 13 79 5.1. Transport Layer Solutions . . . . . . . . . . . . . . . . 13 80 5.2. Application Layer Solutions . . . . . . . . . . . . . . . 15 81 6. Applications That Rely on IPv6 Fragmentation . . . . . . . . 16 82 6.1. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 83 6.2. OSPF . . . . . . . . . . . . . . . . . . . . . . . . . . 17 84 6.3. Packet-in-Packet Encapsulations . . . . . . . . . . . . . 17 85 7. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 86 7.1. For Application Developers . . . . . . . . . . . . . . . 17 87 7.2. For System Developers . . . . . . . . . . . . . . . . . . 17 88 7.3. For Middle Box Developers . . . . . . . . . . . . . . . . 17 89 7.4. For Network Operators . . . . . . . . . . . . . . . . . . 18 90 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 91 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 92 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18 93 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 94 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 95 11.2. Informative References . . . . . . . . . . . . . . . . . 20 96 Appendix A. Contributors' Address . . . . . . . . . . . . . . . 22 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 99 1. Introduction 101 Operational experience [Kent] [Huston] [RFC7872] reveals that IP 102 fragmentation reduces the reliability of Internet communication. 103 This document describes IP fragmentation and explains how it reduces 104 the reliability of Internet communication. This document also 105 proposes alternatives to IP fragmentation and provides 106 recommendations for developers and network operators. 108 While this document identifies issues associated with IP 109 fragmentation, it does not recommend deprecation. Some applications 110 (e.g., [I-D.ietf-intarea-tunnels]) require IP fragmentation. 112 Rather than deprecating IP Fragmentation, this document recommends 113 that upper-layer protocols address the problem of fragmentation at 114 their layer, reducing their reliance on IP fragmentation to the 115 greatest degree possible. 117 2. IP Fragmentation 119 2.1. Links, Paths, MTU and PMTU 121 An Internet path connects a source node to a destination node. A 122 path can contain links and routers. If a path contains more than one 123 link, the links are connected in series and a router connects each 124 link to the next. 126 Internet paths are dynamic. Assume that the path from one node to 127 another contains a set of links and routers. If the network topology 128 changes, that path can also change so that it includes a different 129 set of links and routers. 131 Each link is constrained by the number of bytes that it can convey in 132 a single IP packet. This constraint is called the link Maximum 133 Transmission Unit (MTU). IPv4 [RFC0791] requires every link to 134 support a specified MTU (see footnote). IPv6 [RFC8200] requires 135 every link to support an MTU of 1280 bytes or greater. These are 136 called the IPv4 and IPv6 minimum link MTU's. 138 Likewise, each Internet path is constrained by the number of bytes 139 that it can convey in a IP single packet. This constraint is called 140 the Path MTU (PMTU). For any given path, the PMTU is equal to the 141 smallest of its link MTU's. Because Internet paths are dynamic, PMTU 142 is also dynamic. 144 For reasons described below, source nodes estimate the PMTU between 145 themselves and destination nodes. A source node can produce 146 extremely conservative PMTU estimates in which: 148 o The estimate for each IPv4 path is equal to the IPv4 minimum link 149 MTU. 151 o The estimate for each IPv6 path is equal to the IPv6 minimum link 152 MTU. 154 While these conservative estimates are guaranteed to be less than or 155 equal to the actual PMTU, they are likely to be much less than the 156 actual PMTU. This may adversely affect upper-layer protocol 157 performance. 159 By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201] 160 procedures, a source node can maintain a less conservative estimate 161 of the PMTU between itself and a destination node. In PMTUD, the 162 source node produces an initial PMTU estimate. This initial estimate 163 is equal to the MTU of the first link along the path to the 164 destination node. It can be greater than the actual PMTU. 166 Having produced an initial PMTU estimate, the source node sends non- 167 fragmentable IP packets to the destination node. If one of these 168 packets is larger than the actual PMTU, a downstream router will not 169 be able to forward the packet through the next link along the path. 170 Therefore, the downstream router drops the packet and sends an 171 Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet 172 Too Big (PTB) message to the source node. The ICMP PTB message 173 indicates the MTU of the link through which the packet could not be 174 forwarded. The source node uses this information to refine its PMTU 175 estimate. 177 PMTUD produces a running estimate of the PMTU between a source node 178 and a destination node. Because PMTU is dynamic, at any given time, 179 the PMTU estimate can differ from the actual PMTU. In order to 180 detect PMTU increases, PMTUD occasionally resets the PMTU estimate to 181 its initial value and repeats the procedure described above. 183 PMTUD has the following characteristics: 185 o It relies on the network's ability to deliver ICMP PTB messages to 186 the source node. 188 o It is susceptible to attack because ICMP messages are easily 189 forged [RFC5927]. 191 FOOTNOTE: In IPv4, every host must be capable of receiving a packet 192 whose length is equal to 576 bytes. However, the IPv4 minimum link 193 MTU is not 576. Section 3.2 of RFC 791 explicitly states that the 194 IPv4 minimum link MTU is 68 bytes. But for practical purposes, many 195 network operators consider the IPv4 minimum link MTU to be 576 bytes. 196 So, for the purposes of this document, we assume that the IPv4 197 minimum link MTU is 576 bytes. 199 FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet" 200 is introduced. A non-fragmentable packet can be fragmented at its 201 source. However, it cannot be fragmented by a downstream node. An 202 IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4 203 packet whose DF-bit is set to one is non-fragmentable. All IPv6 204 packets are also non-fragmentable. 206 FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is 207 introduced. The ICMP PTB message has two instantiations. In ICMPv4 208 [RFC0792], the ICMP PTB message is Destination Unreachable message 209 with Code equal to (4) fragmentation needed and DF set. This message 210 was augmented by [RFC1191] to indicates the MTU of the link through 211 which the packet could not be forwarded. In ICMPv6 [RFC4443], the 212 ICMP PTB message is a Packet Too Big Message with Code equal to (0). 213 This message also indicates the MTU of the link through which the 214 packet could not be forwarded. 216 2.2. Fragmentation Procedures 218 When an upper-layer protocol submits data to the underlying IP 219 module, and the resulting IP packet's length is greater than the 220 PMTU, the packet can be divided into fragments. Each fragment 221 includes an IP header and a portion of the original packet. 223 [RFC0791] describes IPv4 fragmentation procedures. An IPv4 packet 224 whose DF-bit is set to one cannot be fragmented. An IPv4 packet 225 whose DF-bit is set to zero can be fragmented by the source node or 226 by any downstream router. When an IPv4 packet is fragmented, all IP 227 options appear in the first fragment, but only options whose "copy" 228 bit is set to one appear in subsequent fragments. 230 [RFC8200] describes IPv6 fragmentation procedures. An IPv6 packets 231 can be fragmented at the source node only. When an IPv6 packet is 232 fragmented, all extension headers appear in the first fragment, but 233 only per-fragment headers appear in subsequent fragments. Per- 234 fragment headers include the following: 236 o The IPv6 header. 238 o The Hop-by-hop Options header (if present) 239 o The Destination Options header (if present and if it precedes a 240 Routing header) 242 o The Routing Header (if present) 244 o The Fragment Header 246 In both IPv4 and IPv6, the upper-layer header appears in the first 247 fragment only. It does not appear in subsequent fragments. 249 2.3. Upper-Layer Reliance on IP Fragmentation 251 Upper-layer protocols can operate in the following modes: 253 o Do not rely on IP fragmentation. 255 o Rely on IP fragmentation by the source node only. 257 o Rely on IP fragmentation by any node. 259 Upper-layer protocols running over IPv4 can operate in all of the 260 above-mentioned modes. Upper-layer protocols running over IPv6 can 261 operate in the first and second modes only. 263 Upper-layer protocols that operate in the first two modes (above) 264 require access to the PMTU estimate. In order to fulfil this 265 requirement, they can: 267 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 268 MTU. 270 o Access the estimate that PMTUD produced. 272 o Execute PMTUD procedures themselves. 274 o Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821] 275 [I-D.ietf-tsvwg-datagram-plpmtud] procedures. 277 According to PLPMTUD procedures, the upper-layer protocol maintains a 278 running PMTU estimate. It does so by sending probe packets of 279 various sizes to its upper-layer peer and receiving acknowledgements. 280 This strategy differs from PMTUD in that it relies of acknowledgement 281 of received messages, as opposed to ICMP PTB messages concerning 282 dropped messages. Therefore, PLPMTUD does not rely on the network's 283 ability to deliver ICMP PTB messages to the source. 285 3. Requirements Language 287 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 288 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 289 "OPTIONAL" in this document are to be interpreted as described in BCP 290 14 [RFC2119] [RFC8174] when, and only when, they appear in all 291 capitals, as shown here. 293 4. Reduced Reliability 295 This section explains how IP fragmentation reduces the reliability of 296 Internet communication. 298 4.1. Policy-Based Routing 300 IP Fragmentation causes problems for routers that implement policy- 301 based routing. 303 When a router receives a packet, it identifies the next-hop on route 304 to the packet's destination and forwards the packet to that next-hop. 305 In order to identify the next-hop, the router interrogates a local 306 data structure called the Forwarding Information Base (FIB). 308 Normally, the FIB contains destination-based entries that map a 309 destination prefix to a next-hop. Policy-based routing allows 310 destination-based and policy-based entries to coexist in the same 311 FIB. A policy-based FIB entry maps multiple fields, drawn from 312 either the IP or transport-layer header, to a next-hop. 314 +-------+--------------+-----------------+------------+-------------+ 315 | Entry | Type | Dest. Prefix | Next Hdr / | Next-Hop | 316 | | | | Dest. Port | | 317 +-------+--------------+-----------------+------------+-------------+ 318 | | | | | | 319 | 1 | Destination- | 2001:db8::1/128 | Any / Any | 2001:db8::2 | 320 | | based | | | | 321 | | | | | | 322 | 2 | Policy- | 2001:db8::1/128 | TCP / 80 | 2001:db8::3 | 323 | | based | | | | 324 +-------+--------------+-----------------+------------+-------------+ 326 Table 1: Policy-Based Routing FIB 328 Assume that a router maintains the FIB in Table 1. The first FIB 329 entry is destination-based. It maps the a destination prefix 330 (2001:db8::1/128) to a next-hop (2001:db8::2). The second FIB entry 331 is a policy-based. It maps the same destination prefix 332 (2001:db8::1/128) and a destination port ( TCP / 80 ) to a different 333 next-hop (2001:db8::3). The second entry is more specific than the 334 first. 336 When the router receives the first fragment of a packet that is 337 destined for TCP port 80 on 2001:db8::1, it interrogates the FIB. 338 Both FIB entries satisfy the query. The router selects the second 339 FIB entry because it is more specific and forwards the packet to 340 2001:db8::3. 342 When the router receives the second fragment of the packet, it 343 interrogates the FIB again. This time, only the first FIB entry 344 satisfies the query, because the second fragment contains no 345 indication that the packet is destined for TCP port 80. Therefore, 346 the router selects the first FIB entry and forwards the packet to 347 2001:db8::2. 349 Policy-based routing is also known as filter-based-forwarding. 351 4.2. Network Address Translation (NAT) 353 IP fragmentation causes problems for Network Address Translation 354 (NAT) devices. When a NAT device detects a new, outbound flow, it 355 maps that flow's source port and IP address to another source port 356 and IP address. Having created that mapping, the NAT device 357 translates: 359 o The Source IP Address and Source Port on each outbound packet. 361 o The Destination IP Address and Destination Port on each inbound 362 packet. 364 A+P [RFC6346] and Carrier Grade NAT (CGN) [RFC6888] are two common 365 NAT strategies. In both approaches the NAT device must virtually 366 reassemble fragmented packets in order to translate and forward each 367 fragment. 369 Virtual reassembly in the network is problematic, because it is 370 computationally expensive and because it is prone to attacks 371 (Section 4.5). 373 4.3. Stateless Firewalls 375 IP fragmentation causes problems for stateless firewalls whose rules 376 include TCP and UDP ports. Because port information is not available 377 in the trailing fragments the firewall is limited to the following 378 options: 380 o Accept all trailing fragments, possibly admitting certain classes 381 of attack. 383 o Block all trailing fragments, possibly blocking legitimate 384 traffic. 386 Neither option is attractive. 388 This problem does not occur in stateful firewalls. 390 4.4. Stateless Load Balancers 392 IP fragmentation causes problems for stateless load balancers. In 393 order to assign a packet or packet fragment to a link, the load- 394 balancer executes an algorithm. If the packet or packet fragment 395 contains a transport-layer header, the load balancing algorithm 396 accepts the following 5-tuple as input: 398 o IP Source Address. 400 o IP Destination Address. 402 o IPv4 Protocol or IPv6 Next Header. 404 o transport-layer source port. 406 o transport-layer destination port. 408 If the packet or packet fragment does not contain a transport-layer 409 header, the load balancing algorithm accepts only the following 410 3-tuple as input: 412 o IP Source Address. 414 o IP Destination Address. 416 o IPv4 Protocol or IPv6 Next Header. 418 Therefore, non-fragmented packets belonging to a flow can be assigned 419 to one link while fragmented packets belonging to the same flow can 420 be divided between that link and another. This can cause suboptimal 421 load balancing. 423 4.5. Security Vulnerabilities 425 Security researchers have documented several attacks that exploit IP 426 fragmentation. The following are examples: 428 o Overlapping fragment attacks [RFC1858][RFC3128][RFC5722] 430 o Resource exhaustion attacks (such as the Rose Attack) 432 o Attacks based on predictable fragment identification values 433 [RFC7739] 435 o Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998] 437 In the overlapping fragment attack, an attacker constructs a series 438 of packet fragments. The first fragment contains an IP header, a 439 transport-layer header, and some transport-layer payload. This 440 fragment complies with local security policy and is allowed to pass 441 through a stateless firewall. A second fragment, having a non-zero 442 offset, overlaps with the first fragment. The second fragment also 443 passes through the stateless firewall. When the packet is 444 reassembled, the transport layer header from the first fragment is 445 overwritten by data from the second fragment. The reassembled packet 446 does not comply with local security policy. Had it traversed the 447 firewall in one piece, the firewall would have rejected it. 449 A stateless firewall cannot protect against the overlapping fragment 450 attack. However, destination nodes can protect against the 451 overlapping fragment attack by implementing the procedures described 452 in RFC 1858, RFC 3128 and RFC 8200. These reassembly procedures 453 detect the overlap and discard the packet. 455 The fragment reassembly algorithm is a stateful procedure for an 456 otherwise stateless protocol. Therefore, it can be exploited by 457 resource exhaustion attacks. An attacker can construct a series of 458 fragmented packets, with one fragment missing from each packet so 459 that the reassembly is impossible. Thus, this attack causes resource 460 exhaustion on the destination node, possibly denying reassembly 461 services to other flows. This type of attack can be mitigated by 462 flushing fragment reassembly buffers when necessary, at the expense 463 of possibly dropping legitimate fragments. 465 Each IP fragment contains an "Identification" field that destination 466 nodes use to reassemble fragmented packets. Many implementations set 467 the Identification field to a predictable value, thus making it easy 468 for an attacker to forge malicious IP fragments that would cause the 469 reassembly procedure for legitimate packets to fail. 471 NIDS aims at identifying malicious activity by analyzing network 472 traffic. Ambiguity in the possible result of the fragment reassembly 473 process may allow an attacker to evade these systems. Many of these 474 systems try to mitigate some of these evasion techniques (e.g. By 475 computing all possible outcomes of the fragment reassembly process, 476 at the expense of increased processing requirements). 478 4.6. Blackholing Due to ICMP Loss 480 As mentioned in Section 2.3, upper-layer protocols can be configured 481 to rely on PMTUD. Because PMTUD relies upon the network to deliver 482 ICMP PTB messages, those protocols also rely on the networks to 483 deliver ICMP PTB messages. 485 According to [RFC4890], ICMP PTB messages must not be filtered. 486 However, ICMP PTB delivery is not reliable. It is subject to both 487 transient and persistent loss. 489 Transient loss of ICMP PTB messages can cause transient black holes. 490 When the conditions contributing to transient loss abate, the network 491 regains its ability to deliver ICMP PTB messages and connectivity 492 between the source and destination nodes is restored. Section 4.6.1 493 of this document describes conditions that lead to transient loss of 494 ICMP PTB messages. 496 Persistent loss of ICMP PTB messages can cause persistent black 497 holes. Section 4.6.2 and Section 4.6.3 of this document describe 498 conditions that lead to persistent loss of ICMP PTB messages. 500 The problem described in this section is specific to PMTUD. It does 501 not occur when the upper-layer protocol obtains its PMTU estimate 502 from PLPMTUD or from any other source. 504 4.6.1. Transient Loss 506 The following factors can contribute to transient loss of ICMP PTB 507 messages: 509 o Network congestion. 511 o Packet corruption. 513 o Transient routing loops. 515 o ICMP rate limiting. 517 The effect of rate limiting may be severe, as RFC 4443 recommends 518 strict rate limiting of IPv6 traffic. 520 4.6.2. Incorrect Implementation of Security Policy 522 Incorrect implementation of security policy can cause persistent loss 523 of ICMP PTB messages. 525 Assume that a Customer Premise Equipment (CPE) router implements the 526 following zone-based security policy: 528 o Allow any traffic to flow from the inside zone to the outside 529 zone. 531 o Do not allow any traffic to flow from the outside zone to the 532 inside zone unless it is part of an existing flow (i.e., it was 533 elicited by an outbound packet). 535 When a correct implementation of the above-mentioned security policy 536 receives an ICMP PTB message, it examines the ICMP PTB payload in 537 order to determine whether the original packet (i.e., the packet that 538 elicited the ICMP PTB message) belonged to an existing flow. If the 539 original packet belonged to an existing flow, the implementation 540 allows the ICMP PTB to flow from the outside zone to the inside zone. 541 If not, the implementation discards the ICMP PTB message. 543 When a incorrect implementation of the above-mentioned security 544 policy receives an ICMP PTB message, it discards the packet because 545 its source address is not associated with an existing flow. 547 The security policy described above is implemented incorrectly on 548 many consumer CPE routers. 550 4.6.3. Persistent Loss Caused By Anycast 552 Anycast can cause persistent loss of ICMP PTB messages. Consider the 553 example below: 555 A DNS client sends a request to an anycast address. The network 556 routes that DNS request to the nearest instance of that anycast 557 address (i.e., a DNS Server). The DNS server generates a response 558 and sends it back to the DNS client. While the response does not 559 exceed the DNS server's PMTU estimate, it does exceed the actual 560 PMTU. 562 A downstream router drops the packet and sends an ICMP PTB message 563 the packet's source (i.e., the anycast address). The network routes 564 the ICMP PTB message to the anycast instance closest to the 565 downstream router. That anycast instance may not be the DNS server 566 that originated the DNS response. It may be another DNS server with 567 the same anycast address. The DNS server that originated the 568 response may never receive the ICMP PTB message and may never updates 569 it PMTU estimate. 571 4.7. Blackholing Due To Filtering 573 In RFC 7872, researchers sampled Internet paths to determine whether 574 they would convey packets that contain IPv6 extension headers. 575 Sampled paths terminated at popular Internet sites (e.g., popular 576 web, mail and DNS servers). 578 The study revealed that at least 28% of the sampled paths did not 579 convey packets containing the IPv6 Fragment extension header. In 580 most cases, fragments were dropped in the destination autonomous 581 system. In other cases, the fragments were dropped in transit 582 autonomous systems. 584 Another recent study [Huston] confirmed this finding. It reported 585 that 37% of sampled endpoints used IPv6-capable DNS resolvers that 586 were incapable of receiving a fragmented IPv6 response. 588 It is difficult to determine why network operators drop fragments. 589 Possible causes follow: 591 o Hardware inability to process fragmented packets. 593 o Failure to change vendor defaults. 595 o Unintentional misconfiguration. 597 o Intentional configuration (e.g., network operators consciously 598 chooses to drop IPv6 fragments in order to address the issues 599 raised in Section 4.1 through Section 4.6, above.) 601 5. Alternatives to IP Fragmentation 603 5.1. Transport Layer Solutions 605 The Transport Control Protocol (TCP) [RFC0793]) can be operated in a 606 mode that does not require IP fragmentation. 608 Applications submit a stream of data to TCP. TCP divides that stream 609 of data into segments, with no segment exceeding the TCP Maximum 610 Segment Size (MSS). Each segment is encapsulated in a TCP header and 611 submitted to the underlying IP module. The underlying IP module 612 prepends an IP header and forwards the resulting packet. 614 If the TCP MSS is sufficiently small, the underlying IP module never 615 produces a packet whose length is greater than the actual PMTU. 616 Therefore, IP fragmentation is not required. 618 TCP offers the following mechanisms for MSS management: 620 o Manual configuration 622 o PMTUD 624 o PLPMTUD 626 Manual configuration is always applicable. If the MSS is configured 627 to a sufficiently low value, the IP layer will never produce a packet 628 whose length is greater than the protocol minimum link MTU. However, 629 manual configuration prevents TCP from taking advantage of larger 630 link MTU's. 632 Upper-layer protocols can implement PMTUD in order to discover and 633 take advantage of larger path MTUs. However, as mentioned in 634 Section 2.1, PMTUD relies upon the network to deliver ICMP PTB 635 messages. Therefore, PMTUD is applicable only in environments where 636 the risk of ICMP PTB loss is acceptable. 638 By contrast, PLPMTUD does not rely upon the network's ability to 639 deliver ICMP PTB messages. However, in many loss-based TCP 640 congestion control algorithms, the dropping of a packet may cause the 641 TCP control algorithm to drop the congestion control window, or even 642 re-start with the entire slow start process. For high capacity, long 643 round-trip time, large volume TCP streams, the deliberate probing 644 with large packets and the consequent packet drop may impose too 645 harsh a penalty on total TCP throughput for it to be a viable 646 approach. [RFC4821] defines PLPMTUD procedures for TCP. 648 While TCP will never cause the underlying IP module to emit a packet 649 that is larger than the PMTU estimate, it can cause the underlying IP 650 module to emit a packet that is larger than the actual PMTU. If this 651 occurs, the packet is dropped, the PMTU estimate is updated, the 652 segment is divided into smaller segments and each smaller segment is 653 submitted to the underlying IP module. 655 The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the 656 Stream Control Protocol (SCP) [RFC4960] also can be operated in a 657 mode that does not require IP fragmentation. They both accept data 658 from an application and divide that data into segments, with no 659 segment exceeding a maximum size. Both DCCP and SCP offer manual 660 configuration, PMTUD and PLPMTUD as mechanisms for managing that 661 maximum size. [I-D.ietf-tsvwg-datagram-plpmtud] proposes PLPMTUD 662 procedures for DCCP and SCP. 664 Currently, User Data Protocol (UDP) [RFC0768] lacks a fragmentation 665 mechanism of its own and relies on IP fragmentation. However, 666 [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for 667 UDP. 669 5.2. Application Layer Solutions 671 [RFC8085] recognizes that IP fragmentation reduces the reliability of 672 Internet communication. It also recognizes that UDP lacks a 673 fragmentation mechanism of its own and relies on IP fragmentation. 674 Therefore, [RFC8085] offers the following advice regarding 675 applications the run over the UDP. 677 "An application SHOULD NOT send UDP datagrams that result in IP 678 packets that exceed the Maximum Transmission Unit (MTU) along the 679 path to the destination. Consequently, an application SHOULD either 680 use the path MTU information provided by the IP layer or implement 681 Path MTU Discovery (PMTUD) itself to determine whether the path to a 682 destination will support its desired message size without 683 fragmentation." 685 RFC 8085 continues: 687 "Applications that do not follow the recommendation to do PMTU/ 688 PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would 689 result in IP packets that exceed the path MTU. Because the actual 690 path MTU is unknown, such applications SHOULD fall back to sending 691 messages that are shorter than the default effective MTU for sending 692 (EMTU_S in [RFC1122]). For IPv4, EMTU_S is the smaller of 576 bytes 693 and the first-hop MTU. For IPv6, EMTU_S is 1280 bytes. The 694 effective PMTU for a directly connected destination (with no routers 695 on the path) is the configured interface MTU, which could be less 696 than the maximum link payload size. Transmission of minimum-sized 697 UDP datagrams is inefficient over paths that support a larger PMTU, 698 which is a second reason to implement PMTU discovery." 700 RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently 701 small, even though the IPv4 minimum link MTU is 68 bytes. 703 This advice applies equally to application that run directly over IP. 705 6. Applications That Rely on IPv6 Fragmentation 707 The following applications rely on IPv6 fragmentation: 709 o DNS [RFC1035] 711 o OSPFv3 [RFC2328][RFC5340] 713 o Packet-in-packet encapsulations 715 Each of these applications relies on IPv6 fragmentation to a varying 716 degree. In some cases, that reliance is essential, and cannot be 717 broken without fundamentally changing the protocol. In other cases, 718 that reliance is incidental, and most implementations already take 719 appropriate steps to avoid fragmentation. 721 This list is not comprehensive, and other protocols that rely on IP 722 fragmentation may exist. They are not specifically considered in the 723 context of this document. 725 6.1. DNS 727 DNS relies on UDP for efficiency, and the consequence is the use of 728 IP fragmentation for large responses, as permitted by the DNS EDNS(0) 729 options in the query. It is possible to mitigate the issue of 730 fragmentation-based packet loss by having queries use smaller EDNS(0) 731 UDP buffer sizes, or by having the DNS server limit the size of its 732 UDP responses to some self-imposed maximum packet size that may be 733 less than the preferred EDNS(0) UDP Buffer Size. In both cases, 734 large responses are truncated in the DNS, signalling to the client to 735 re-query using TCP to obtain the complete response. However, the 736 operational issue of the partial level of support for DNS over TCP, 737 particularly in the case where IPv6 transport is being used, becomes 738 a limiting factor of the efficacy of this approach [Damas]. 740 Larger DNS responses can normally be avoided by aggressively pruning 741 the Additional section of DNS responses. One scenario where such 742 pruning is ineffective is in the use of DNSSEC, where large key sizes 743 act to increase the response size to certain DNS queries. There is 744 no effective response to this situation within the DNS other than 745 using smaller cryptographic keys and adoption of DNSSEC 746 administrative practices that attempt to keep DNS response as short 747 as possible. 749 6.2. OSPF 751 OSPF implementations can emit messages large enough to cause 752 fragmentation. However, in order to optimize performance, most OSPF 753 implementations restrict their maximum message size to a value that 754 will not cause fragmentation. 756 6.3. Packet-in-Packet Encapsulations 758 In this document, packet-in-packet encapsulations include IP-in-IP 759 [RFC2003], Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP 760 [RFC8086] and Generic Packet Tunneling in IPv6 [RFC2473]. [RFC4459] 761 describes fragmentation issues associated with all of the above- 762 mentioned encapsulations. 764 The fragmentation strategy described for GRE in [RFC7588] has been 765 deployed for all of the above-mentioned encapsulations. This 766 strategy does not rely on IP fragmentation except in one corner case. 767 (see Section 3.3.2.2 of RFC 7588 and Section 7.1 of RFC 2473). 768 Section 3.3 of [RFC7676] further describes this corner case. 770 7. Recommendations 772 7.1. For Application Developers 774 Application developers SHOULD NOT develop new applications that rely 775 on IP fragmentation. 777 Application-layer protocols that depend upon IPv6 fragmentation 778 SHOULD be updated to break that dependency. This can be achieved by 779 using a sufficiently small MTU (e.g. The protocol minimum link MTU), 780 disabling fragmentation, and ensuring that the transport protocol in 781 use adapts its segment size to that MTU. This would avoid the 782 problem of PMTUD failure described in Section 4.6. Another approach 783 is to use PLPMTUD in a way suitable for the transport protocol in use 784 (e.g. [I-D.ietf-tsvwg-datagram-plpmtud] for UDP). 786 7.2. For System Developers 788 Software libraries SHOULD include provision for PLPMTUD for each 789 supported transport protocol. 791 7.3. For Middle Box Developers 793 Middle box developers SHOULD implement devices that support IP 794 fragmentation. These boxes SHOULD not fail or cause failures when 795 processing fragmented IP packets. 797 For example, in order to support IP fragmentation, a load balancer 798 might execute the following procedure: 800 o Receive a fragmented packet 802 o Identify a next-hop using information drawn from the first 803 fragment 805 o Forward the first fragment and all subsequent fragments through 806 the above-mentioned next-hop 808 7.4. For Network Operators 810 As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB 811 messages unless they are known to be forged or otherwise 812 illegitimate. As stated in Section 4.6, filtering ICMPv6 PTB packets 813 causes PMTUD to fail. Operators MUST ensure proper PMTUD operation 814 in their network, including making sure the network generates PTB 815 packets when dropping packets too large compared to outgoing 816 interface MTU. 818 Many upper-layer protocols rely on PMTUD. 820 8. IANA Considerations 822 This document makes no request of IANA. 824 9. Security Considerations 826 This document mitigates some of the security considerations 827 associated with IP fragmentation by discouraging its use. It does 828 not introduce any new security vulnerabilities, because it does not 829 introduce any new alternatives to IP fragmentation. Instead, it 830 recommends well-understood alternatives. 832 10. Acknowledgements 834 Thanks to Mikael Abrahamsson, Brian Carpenter, Silambu Chelvan, 835 Lorenzo Colitti, Mike Heard, Tom Herbert, Tatuya Jinmei, Paolo 836 Lucente, Manoj Nayak, Eric Nygren, and Joe Touch for their comments. 838 11. References 840 11.1. Normative References 842 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 843 DOI 10.17487/RFC0768, August 1980, 844 . 846 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 847 DOI 10.17487/RFC0791, September 1981, 848 . 850 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 851 RFC 792, DOI 10.17487/RFC0792, September 1981, 852 . 854 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 855 RFC 793, DOI 10.17487/RFC0793, September 1981, 856 . 858 [RFC1035] Mockapetris, P., "Domain names - implementation and 859 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 860 November 1987, . 862 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 863 DOI 10.17487/RFC1191, November 1990, 864 . 866 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 867 Requirement Levels", BCP 14, RFC 2119, 868 DOI 10.17487/RFC2119, March 1997, 869 . 871 [RFC4443] Conta, A., Deering, S., and M. Gupta, Ed., "Internet 872 Control Message Protocol (ICMPv6) for the Internet 873 Protocol Version 6 (IPv6) Specification", STD 89, 874 RFC 4443, DOI 10.17487/RFC4443, March 2006, 875 . 877 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 878 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 879 . 881 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 882 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 883 March 2017, . 885 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 886 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 887 May 2017, . 889 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 890 (IPv6) Specification", STD 86, RFC 8200, 891 DOI 10.17487/RFC8200, July 2017, 892 . 894 [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., 895 "Path MTU Discovery for IP version 6", STD 87, RFC 8201, 896 DOI 10.17487/RFC8201, July 2017, 897 . 899 11.2. Informative References 901 [Damas] Damas, J. and G. Huston, "Measuring ATR", April 2018, 902 . 904 [Huston] Huston, G., "IPv6, Large UDP Packets and the DNS 905 (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)", 906 August 2017. 908 [I-D.ietf-intarea-tunnels] 909 Touch, J. and M. Townsley, "IP Tunnels in the Internet 910 Architecture", draft-ietf-intarea-tunnels-09 (work in 911 progress), July 2018. 913 [I-D.ietf-tsvwg-datagram-plpmtud] 914 Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler, 915 "Packetization Layer Path MTU Discovery for Datagram 916 Transports", draft-ietf-tsvwg-datagram-plpmtud-05 (work in 917 progress), October 2018. 919 [I-D.ietf-tsvwg-udp-options] 920 Touch, J., "Transport Options for UDP", draft-ietf-tsvwg- 921 udp-options-05 (work in progress), July 2018. 923 [Kent] Kent, C. and J. Mogul, ""Fragmentation Considered 924 Harmful", In Proc. SIGCOMM '87 Workshop on Frontiers in 925 Computer Communications Technology, DOI 926 10.1145/55483.55524", August 1987, 927 . 930 [Ptacek1998] 931 Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial 932 of Service: Eluding Network Intrusion Detection", 1998, 933 . 935 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 936 Communication Layers", STD 3, RFC 1122, 937 DOI 10.17487/RFC1122, October 1989, 938 . 940 [RFC1858] Ziemba, G., Reed, D., and P. Traina, "Security 941 Considerations for IP Fragment Filtering", RFC 1858, 942 DOI 10.17487/RFC1858, October 1995, 943 . 945 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 946 DOI 10.17487/RFC2003, October 1996, 947 . 949 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 950 DOI 10.17487/RFC2328, April 1998, 951 . 953 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 954 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 955 December 1998, . 957 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 958 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 959 DOI 10.17487/RFC2784, March 2000, 960 . 962 [RFC3128] Miller, I., "Protection Against a Variant of the Tiny 963 Fragment Attack (RFC 1858)", RFC 3128, 964 DOI 10.17487/RFC3128, June 2001, 965 . 967 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 968 Congestion Control Protocol (DCCP)", RFC 4340, 969 DOI 10.17487/RFC4340, March 2006, 970 . 972 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 973 Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April 974 2006, . 976 [RFC4890] Davies, E. and J. Mohacsi, "Recommendations for Filtering 977 ICMPv6 Messages in Firewalls", RFC 4890, 978 DOI 10.17487/RFC4890, May 2007, 979 . 981 [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", 982 RFC 4960, DOI 10.17487/RFC4960, September 2007, 983 . 985 [RFC5340] Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF 986 for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008, 987 . 989 [RFC5722] Krishnan, S., "Handling of Overlapping IPv6 Fragments", 990 RFC 5722, DOI 10.17487/RFC5722, December 2009, 991 . 993 [RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927, 994 DOI 10.17487/RFC5927, July 2010, 995 . 997 [RFC6346] Bush, R., Ed., "The Address plus Port (A+P) Approach to 998 the IPv4 Address Shortage", RFC 6346, 999 DOI 10.17487/RFC6346, August 2011, 1000 . 1002 [RFC6888] Perreault, S., Ed., Yamagata, I., Miyakawa, S., Nakagawa, 1003 A., and H. Ashida, "Common Requirements for Carrier-Grade 1004 NATs (CGNs)", BCP 127, RFC 6888, DOI 10.17487/RFC6888, 1005 April 2013, . 1007 [RFC7588] Bonica, R., Pignataro, C., and J. Touch, "A Widely 1008 Deployed Solution to the Generic Routing Encapsulation 1009 (GRE) Fragmentation Problem", RFC 7588, 1010 DOI 10.17487/RFC7588, July 2015, 1011 . 1013 [RFC7676] Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support 1014 for Generic Routing Encapsulation (GRE)", RFC 7676, 1015 DOI 10.17487/RFC7676, October 2015, 1016 . 1018 [RFC7739] Gont, F., "Security Implications of Predictable Fragment 1019 Identification Values", RFC 7739, DOI 10.17487/RFC7739, 1020 February 2016, . 1022 [RFC7872] Gont, F., Linkova, J., Chown, T., and W. Liu, 1023 "Observations on the Dropping of Packets with IPv6 1024 Extension Headers in the Real World", RFC 7872, 1025 DOI 10.17487/RFC7872, June 2016, 1026 . 1028 [RFC8086] Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE- 1029 in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086, 1030 March 2017, . 1032 Appendix A. Contributors' Address 1033 Authors' Addresses 1035 Ron Bonica 1036 Juniper Networks 1037 2251 Corporate Park Drive 1038 Herndon, Virginia 20171 1039 USA 1041 Email: rbonica@juniper.net 1043 Fred Baker 1044 Unaffiliated 1045 Santa Barbara, California 93117 1046 USA 1048 Email: FredBaker.IETF@gmail.com 1050 Geoff Huston 1051 APNIC 1052 6 Cordelia St 1053 Brisbane, 4101 QLD 1054 Australia 1056 Email: gih@apnic.net 1058 Robert M. Hinden 1059 Check Point Software 1060 959 Skyway Road 1061 San Carlos, California 94070 1062 USA 1064 Email: bob.hinden@gmail.com 1066 Ole Troan 1067 Cisco 1068 Philip Pedersens vei 1 1069 N-1366 Lysaker 1070 Norway 1072 Email: ot@cisco.com 1073 Fernando Gont 1074 SI6 Networks 1075 Evaristo Carriego 2644 1076 Haedo, Provincia de Buenos Aires 1077 Argentina 1079 Email: fgont@si6networks.com