idnits 2.17.1 draft-ietf-intarea-frag-fragile-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 11, 2019) is 1931 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-13) exists of draft-ietf-intarea-tunnels-09 == Outdated reference: A later version (-22) exists of draft-ietf-tsvwg-datagram-plpmtud-06 == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-udp-options-05 -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Area WG R. Bonica 3 Internet-Draft Juniper Networks 4 Intended status: Best Current Practice F. Baker 5 Expires: July 15, 2019 Unaffiliated 6 G. Huston 7 APNIC 8 R. Hinden 9 Check Point Software 10 O. Troan 11 Cisco 12 F. Gont 13 SI6 Networks 14 January 11, 2019 16 IP Fragmentation Considered Fragile 17 draft-ietf-intarea-frag-fragile-05 19 Abstract 21 This document describes IP fragmentation and explains how it reduces 22 the reliability of Internet communication. 24 This document also proposes alternatives to IP fragmentation and 25 provides recommendations for developers and network operators. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on July 15, 2019. 44 Copyright Notice 46 Copyright (c) 2019 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . 3 63 2.1. Links, Paths, MTU and PMTU . . . . . . . . . . . . . . . 3 64 2.2. Fragmentation Procedures . . . . . . . . . . . . . . . . 5 65 2.3. Upper-Layer Reliance on IP Fragmentation . . . . . . . . 6 66 3. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 67 4. Reduced Reliability . . . . . . . . . . . . . . . . . . . . . 7 68 4.1. Policy-Based Routing . . . . . . . . . . . . . . . . . . 7 69 4.2. Network Address Translation (NAT) . . . . . . . . . . . . 8 70 4.3. Stateless Firewalls . . . . . . . . . . . . . . . . . . . 8 71 4.4. Stateless Load Balancers . . . . . . . . . . . . . . . . 9 72 4.5. IPv4 Reassembly Errors at High Data Rates . . . . . . . . 10 73 4.6. Security Vulnerabilities . . . . . . . . . . . . . . . . 10 74 4.7. Blackholing Due to ICMP Loss . . . . . . . . . . . . . . 11 75 4.7.1. Transient Loss . . . . . . . . . . . . . . . . . . . 12 76 4.7.2. Incorrect Implementation of Security Policy . . . . . 12 77 4.7.3. Persistent Loss Caused By Anycast . . . . . . . . . . 13 78 4.8. Blackholing Due To Filtering . . . . . . . . . . . . . . 13 79 5. Alternatives to IP Fragmentation . . . . . . . . . . . . . . 14 80 5.1. Transport Layer Solutions . . . . . . . . . . . . . . . . 14 81 5.2. Application Layer Solutions . . . . . . . . . . . . . . . 15 82 6. Applications That Rely on IPv6 Fragmentation . . . . . . . . 16 83 6.1. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 84 6.2. OSPF . . . . . . . . . . . . . . . . . . . . . . . . . . 17 85 6.3. Packet-in-Packet Encapsulations . . . . . . . . . . . . . 17 86 6.4. Licklider Transmission Protocol (LTP) . . . . . . . . . . 17 87 7. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 18 88 7.1. For Application and Protocol Developers . . . . . . . . . 18 89 7.2. For System Developers . . . . . . . . . . . . . . . . . . 18 90 7.3. For Middle Box Developers . . . . . . . . . . . . . . . . 18 91 7.4. For Network Operators . . . . . . . . . . . . . . . . . . 19 92 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 93 9. Security Considerations . . . . . . . . . . . . . . . . . . . 19 94 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 95 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 96 11.1. Normative References . . . . . . . . . . . . . . . . . . 19 97 11.2. Informative References . . . . . . . . . . . . . . . . . 21 98 Appendix A. Contributors' Address . . . . . . . . . . . . . . . 24 99 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 101 1. Introduction 103 Operational experience [Kent] [Huston] [RFC7872] reveals that IP 104 fragmentation reduces the reliability of Internet communication. 105 This document describes IP fragmentation and explains how it reduces 106 the reliability of Internet communication. This document also 107 proposes alternatives to IP fragmentation and provides 108 recommendations for developers and network operators. 110 While this document identifies issues associated with IP 111 fragmentation, it does not recommend deprecation. Some applications 112 (see Section 6) require IP fragmentation. Furthermore, fragmentation 113 is expected to work in limited domains where security and 114 interoperability issues can be addressed. 116 Rather than deprecating IP Fragmentation, this document recommends 117 that upper-layer protocols address the problem of fragmentation at 118 their layer, reducing their reliance on IP fragmentation to the 119 greatest degree possible. 121 2. IP Fragmentation 123 2.1. Links, Paths, MTU and PMTU 125 An Internet path connects a source node to a destination node. A 126 path can contain links and routers. If a path contains more than one 127 link, the links are connected in series and a router connects each 128 link to the next. 130 Internet paths are dynamic. Assume that the path from one node to 131 another contains a set of links and routers. If the network topology 132 changes, that path can also change so that it includes a different 133 set of links and routers. 135 Each link is constrained by the number of bytes that it can convey in 136 a single IP packet. This constraint is called the link Maximum 137 Transmission Unit (MTU). IPv4 [RFC0791] requires every link to 138 support a specified MTU (see footnote). IPv6 [RFC8200] requires 139 every link to support an MTU of 1280 bytes or greater. These are 140 called the IPv4 and IPv6 minimum link MTU's. 142 Likewise, each Internet path is constrained by the number of bytes 143 that it can convey in a IP single packet. This constraint is called 144 the Path MTU (PMTU). For any given path, the PMTU is equal to the 145 smallest of its link MTU's. Because Internet paths are dynamic, PMTU 146 is also dynamic. 148 For reasons described below, source nodes estimate the PMTU between 149 themselves and destination nodes. A source node can produce 150 extremely conservative PMTU estimates in which: 152 o The estimate for each IPv4 path is equal to the IPv4 minimum link 153 MTU. 155 o The estimate for each IPv6 path is equal to the IPv6 minimum link 156 MTU. 158 While these conservative estimates are guaranteed to be less than or 159 equal to the actual PMTU, they are likely to be much less than the 160 actual PMTU. This may adversely affect upper-layer protocol 161 performance. 163 By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201] 164 procedures, a source node can maintain a less conservative estimate 165 of the PMTU between itself and a destination node. In PMTUD, the 166 source node produces an initial PMTU estimate. This initial estimate 167 is equal to the MTU of the first link along the path to the 168 destination node. It can be greater than the actual PMTU. 170 Having produced an initial PMTU estimate, the source node sends non- 171 fragmentable IP packets to the destination node. If one of these 172 packets is larger than the actual PMTU, a downstream router will not 173 be able to forward the packet through the next link along the path. 174 Therefore, the downstream router drops the packet and sends an 175 Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet 176 Too Big (PTB) message to the source node. The ICMP PTB message 177 indicates the MTU of the link through which the packet could not be 178 forwarded. The source node uses this information to refine its PMTU 179 estimate. 181 PMTUD produces a running estimate of the PMTU between a source node 182 and a destination node. Because PMTU is dynamic, at any given time, 183 the PMTU estimate can differ from the actual PMTU. In order to 184 detect PMTU increases, PMTUD occasionally resets the PMTU estimate to 185 its initial value and repeats the procedure described above. 187 PMTUD has the following characteristics: 189 o It relies on the network's ability to deliver ICMP PTB messages to 190 the source node. 192 o It is susceptible to attack because ICMP messages are easily 193 forged [RFC5927]. 195 FOOTNOTE: In IPv4, every host must be capable of receiving a packet 196 whose length is equal to 576 bytes. However, the IPv4 minimum link 197 MTU is not 576. Section 3.2 of RFC 791 explicitly states that the 198 IPv4 minimum link MTU is 68 bytes. But for practical purposes, many 199 network operators consider the IPv4 minimum link MTU to be 576 bytes. 200 So, for the purposes of this document, we assume that the IPv4 201 minimum link MTU is 576 bytes. 203 FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet" 204 is introduced. A non-fragmentable packet can be fragmented at its 205 source. However, it cannot be fragmented by a downstream node. An 206 IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4 207 packet whose DF-bit is set to one is non-fragmentable. All IPv6 208 packets are also non-fragmentable. 210 FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is 211 introduced. The ICMP PTB message has two instantiations. In ICMPv4 212 [RFC0792], the ICMP PTB message is Destination Unreachable message 213 with Code equal to (4) fragmentation needed and DF set. This message 214 was augmented by [RFC1191] to indicates the MTU of the link through 215 which the packet could not be forwarded. In ICMPv6 [RFC4443], the 216 ICMP PTB message is a Packet Too Big Message with Code equal to (0). 217 This message also indicates the MTU of the link through which the 218 packet could not be forwarded. 220 2.2. Fragmentation Procedures 222 When an upper-layer protocol submits data to the underlying IP 223 module, and the resulting IP packet's length is greater than the 224 PMTU, the packet can be divided into fragments. Each fragment 225 includes an IP header and a portion of the original packet. 227 [RFC0791] describes IPv4 fragmentation procedures. An IPv4 packet 228 whose DF-bit is set to one cannot be fragmented. An IPv4 packet 229 whose DF-bit is set to zero can be fragmented by the source node or 230 by any downstream router. When an IPv4 packet is fragmented, all IP 231 options appear in the first fragment, but only options whose "copy" 232 bit is set to one appear in subsequent fragments. 234 [RFC8200] describes IPv6 fragmentation procedures. An IPv6 packets 235 can be fragmented at the source node only. When an IPv6 packet is 236 fragmented, all extension headers appear in the first fragment, but 237 only per-fragment headers appear in subsequent fragments. Per- 238 fragment headers include the following: 240 o The IPv6 header. 242 o The Hop-by-hop Options header (if present) 244 o The Destination Options header (if present and if it precedes a 245 Routing header) 247 o The Routing Header (if present) 249 o The Fragment Header 251 In both IPv4 and IPv6, the upper-layer header appears in the first 252 fragment only. It does not appear in subsequent fragments. 254 2.3. Upper-Layer Reliance on IP Fragmentation 256 Upper-layer protocols can operate in the following modes: 258 o Do not rely on IP fragmentation. 260 o Rely on IP fragmentation by the source node only. 262 o Rely on IP fragmentation by any node. 264 Upper-layer protocols running over IPv4 can operate in all of the 265 above-mentioned modes. Upper-layer protocols running over IPv6 can 266 operate in the first and second modes only. 268 Upper-layer protocols that operate in the first two modes (above) 269 require access to the PMTU estimate. In order to fulfil this 270 requirement, they can: 272 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 273 MTU. 275 o Access the estimate that PMTUD produced. 277 o Execute PMTUD procedures themselves. 279 o Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821] 280 [I-D.ietf-tsvwg-datagram-plpmtud] procedures. 282 According to PLPMTUD procedures, the upper-layer protocol maintains a 283 running PMTU estimate. It does so by sending probe packets of 284 various sizes to its upper-layer peer and receiving acknowledgements. 285 This strategy differs from PMTUD in that it relies of acknowledgement 286 of received messages, as opposed to ICMP PTB messages concerning 287 dropped messages. Therefore, PLPMTUD does not rely on the network's 288 ability to deliver ICMP PTB messages to the source. 290 3. Requirements Language 292 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 293 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 294 "OPTIONAL" in this document are to be interpreted as described in BCP 295 14 [RFC2119] [RFC8174] when, and only when, they appear in all 296 capitals, as shown here. 298 4. Reduced Reliability 300 This section explains how IP fragmentation reduces the reliability of 301 Internet communication. 303 4.1. Policy-Based Routing 305 IP Fragmentation causes problems for routers that implement policy- 306 based routing. 308 When a router receives a packet, it identifies the next-hop on route 309 to the packet's destination and forwards the packet to that next-hop. 310 In order to identify the next-hop, the router interrogates a local 311 data structure called the Forwarding Information Base (FIB). 313 Normally, the FIB contains destination-based entries that map a 314 destination prefix to a next-hop. Policy-based routing allows 315 destination-based and policy-based entries to coexist in the same 316 FIB. A policy-based FIB entry maps multiple fields, drawn from 317 either the IP or transport-layer header, to a next-hop. 319 +-------+--------------+-----------------+------------+-------------+ 320 | Entry | Type | Dest. Prefix | Next Hdr / | Next-Hop | 321 | | | | Dest. Port | | 322 +-------+--------------+-----------------+------------+-------------+ 323 | | | | | | 324 | 1 | Destination- | 2001:db8::1/128 | Any / Any | 2001:db8::2 | 325 | | based | | | | 326 | | | | | | 327 | 2 | Policy- | 2001:db8::1/128 | TCP / 80 | 2001:db8::3 | 328 | | based | | | | 329 +-------+--------------+-----------------+------------+-------------+ 331 Table 1: Policy-Based Routing FIB 333 Assume that a router maintains the FIB in Table 1. The first FIB 334 entry is destination-based. It maps the a destination prefix 335 (2001:db8::1/128) to a next-hop (2001:db8::2). The second FIB entry 336 is a policy-based. It maps the same destination prefix 337 (2001:db8::1/128) and a destination port ( TCP / 80 ) to a different 338 next-hop (2001:db8::3). The second entry is more specific than the 339 first. 341 When the router receives the first fragment of a packet that is 342 destined for TCP port 80 on 2001:db8::1, it interrogates the FIB. 343 Both FIB entries satisfy the query. The router selects the second 344 FIB entry because it is more specific and forwards the packet to 345 2001:db8::3. 347 When the router receives the second fragment of the packet, it 348 interrogates the FIB again. This time, only the first FIB entry 349 satisfies the query, because the second fragment contains no 350 indication that the packet is destined for TCP port 80. Therefore, 351 the router selects the first FIB entry and forwards the packet to 352 2001:db8::2. 354 Policy-based routing is also known as filter-based-forwarding. 356 4.2. Network Address Translation (NAT) 358 IP fragmentation causes problems for Network Address Translation 359 (NAT) devices. When a NAT device detects a new, outbound flow, it 360 maps that flow's source port and IP address to another source port 361 and IP address. Having created that mapping, the NAT device 362 translates: 364 o The Source IP Address and Source Port on each outbound packet. 366 o The Destination IP Address and Destination Port on each inbound 367 packet. 369 A+P [RFC6346] and Carrier Grade NAT (CGN) [RFC6888] are two common 370 NAT strategies. In both approaches the NAT device must virtually 371 reassemble fragmented packets in order to translate and forward each 372 fragment. 374 Virtual reassembly in the network is problematic, because it is 375 computationally expensive and because it is prone to attacks 376 (Section 4.6). 378 4.3. Stateless Firewalls 380 IP fragmentation causes problems for stateless firewalls whose rules 381 include TCP and UDP ports. Because port information is not available 382 in the trailing fragments the firewall is limited to the following 383 options: 385 o Accept all trailing fragments, possibly admitting certain classes 386 of attack. 388 o Block all trailing fragments, possibly blocking legitimate 389 traffic. 391 Neither option is attractive. 393 This problem does not occur in stateful firewalls. 395 4.4. Stateless Load Balancers 397 IP fragmentation causes problems for stateless load balancers. In 398 order to assign a packet or packet fragment to a link, the load- 399 balancer executes an algorithm. If the packet or packet fragment 400 contains a transport-layer header, the load balancing algorithm 401 accepts the following 5-tuple as input: 403 o IP Source Address. 405 o IP Destination Address. 407 o IPv4 Protocol or IPv6 Next Header. 409 o transport-layer source port. 411 o transport-layer destination port. 413 If the packet or packet fragment does not contain a transport-layer 414 header, the load balancing algorithm accepts only the following 415 3-tuple as input: 417 o IP Source Address. 419 o IP Destination Address. 421 o IPv4 Protocol or IPv6 Next Header. 423 Therefore, non-fragmented packets belonging to a flow can be assigned 424 to one link while fragmented packets belonging to the same flow can 425 be divided between that link and another. This can cause suboptimal 426 load balancing. 428 4.5. IPv4 Reassembly Errors at High Data Rates 430 IPv4 fragmentation is not sufficiently robust for use under some 431 conditions in today's Internet. At high data rates, the 16-bit IP 432 identification field is not large enough to prevent frequent 433 incorrectly assembled IP fragments, and the TCP and UDP checksums are 434 insufficient to prevent the resulting corrupted datagrams from being 435 delivered to higher protocol layers. [RFC4963] describes some easily 436 reproduced experiments demonstrating the problem, and discusses some 437 of the operational implications of these observations. 439 These reassembly issues are not easily reproducible in IPv6 because 440 the IPv6 identification field is 32 bits long. 442 4.6. Security Vulnerabilities 444 Security researchers have documented several attacks that exploit IP 445 fragmentation. The following are examples: 447 o Overlapping fragment attacks [RFC1858][RFC3128][RFC5722] 449 o Resource exhaustion attacks (such as the Rose Attack) 451 o Attacks based on predictable fragment identification values 452 [RFC7739] 454 o Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998] 456 In the overlapping fragment attack, an attacker constructs a series 457 of packet fragments. The first fragment contains an IP header, a 458 transport-layer header, and some transport-layer payload. This 459 fragment complies with local security policy and is allowed to pass 460 through a stateless firewall. A second fragment, having a non-zero 461 offset, overlaps with the first fragment. The second fragment also 462 passes through the stateless firewall. When the packet is 463 reassembled, the transport layer header from the first fragment is 464 overwritten by data from the second fragment. The reassembled packet 465 does not comply with local security policy. Had it traversed the 466 firewall in one piece, the firewall would have rejected it. 468 A stateless firewall cannot protect against the overlapping fragment 469 attack. However, destination nodes can protect against the 470 overlapping fragment attack by implementing the procedures described 471 in RFC 1858, RFC 3128 and RFC 8200. These reassembly procedures 472 detect the overlap and discard the packet. 474 The fragment reassembly algorithm is a stateful procedure for an 475 otherwise stateless protocol. Therefore, it can be exploited by 476 resource exhaustion attacks. An attacker can construct a series of 477 fragmented packets, with one fragment missing from each packet so 478 that the reassembly is impossible. Thus, this attack causes resource 479 exhaustion on the destination node, possibly denying reassembly 480 services to other flows. This type of attack can be mitigated by 481 flushing fragment reassembly buffers when necessary, at the expense 482 of possibly dropping legitimate fragments. 484 Each IP fragment contains an "Identification" field that destination 485 nodes use to reassemble fragmented packets. Many implementations set 486 the Identification field to a predictable value, thus making it easy 487 for an attacker to forge malicious IP fragments that would cause the 488 reassembly procedure for legitimate packets to fail. 490 NIDS aims at identifying malicious activity by analyzing network 491 traffic. Ambiguity in the possible result of the fragment reassembly 492 process may allow an attacker to evade these systems. Many of these 493 systems try to mitigate some of these evasion techniques (e.g. By 494 computing all possible outcomes of the fragment reassembly process, 495 at the expense of increased processing requirements). 497 4.7. Blackholing Due to ICMP Loss 499 As mentioned in Section 2.3, upper-layer protocols can be configured 500 to rely on PMTUD. Because PMTUD relies upon the network to deliver 501 ICMP PTB messages, those protocols also rely on the networks to 502 deliver ICMP PTB messages. 504 According to [RFC4890], ICMP PTB messages must not be filtered. 505 However, ICMP PTB delivery is not reliable. It is subject to both 506 transient and persistent loss. 508 Transient loss of ICMP PTB messages can cause transient black holes. 509 When the conditions contributing to transient loss abate, the network 510 regains its ability to deliver ICMP PTB messages and connectivity 511 between the source and destination nodes is restored. Section 4.7.1 512 of this document describes conditions that lead to transient loss of 513 ICMP PTB messages. 515 Persistent loss of ICMP PTB messages can cause persistent black 516 holes. Section 4.7.2 and Section 4.7.3 of this document describe 517 conditions that lead to persistent loss of ICMP PTB messages. 519 The problem described in this section is specific to PMTUD. It does 520 not occur when the upper-layer protocol obtains its PMTU estimate 521 from PLPMTUD or from any other source. 523 4.7.1. Transient Loss 525 The following factors can contribute to transient loss of ICMP PTB 526 messages: 528 o Network congestion. 530 o Packet corruption. 532 o Transient routing loops. 534 o ICMP rate limiting. 536 The effect of rate limiting may be severe, as RFC 4443 recommends 537 strict rate limiting of IPv6 traffic. 539 4.7.2. Incorrect Implementation of Security Policy 541 Incorrect implementation of security policy can cause persistent loss 542 of ICMP PTB messages. 544 Assume that a Customer Premise Equipment (CPE) router implements the 545 following zone-based security policy: 547 o Allow any traffic to flow from the inside zone to the outside 548 zone. 550 o Do not allow any traffic to flow from the outside zone to the 551 inside zone unless it is part of an existing flow (i.e., it was 552 elicited by an outbound packet). 554 When a correct implementation of the above-mentioned security policy 555 receives an ICMP PTB message, it examines the ICMP PTB payload in 556 order to determine whether the original packet (i.e., the packet that 557 elicited the ICMP PTB message) belonged to an existing flow. If the 558 original packet belonged to an existing flow, the implementation 559 allows the ICMP PTB to flow from the outside zone to the inside zone. 560 If not, the implementation discards the ICMP PTB message. 562 When a incorrect implementation of the above-mentioned security 563 policy receives an ICMP PTB message, it discards the packet because 564 its source address is not associated with an existing flow. 566 The security policy described above is implemented incorrectly on 567 many consumer CPE routers. 569 4.7.3. Persistent Loss Caused By Anycast 571 Anycast can cause persistent loss of ICMP PTB messages. Consider the 572 example below: 574 A DNS client sends a request to an anycast address. The network 575 routes that DNS request to the nearest instance of that anycast 576 address (i.e., a DNS Server). The DNS server generates a response 577 and sends it back to the DNS client. While the response does not 578 exceed the DNS server's PMTU estimate, it does exceed the actual 579 PMTU. 581 A downstream router drops the packet and sends an ICMP PTB message 582 the packet's source (i.e., the anycast address). The network routes 583 the ICMP PTB message to the anycast instance closest to the 584 downstream router. That anycast instance may not be the DNS server 585 that originated the DNS response. It may be another DNS server with 586 the same anycast address. The DNS server that originated the 587 response may never receive the ICMP PTB message and may never updates 588 it PMTU estimate. 590 4.8. Blackholing Due To Filtering 592 In RFC 7872, researchers sampled Internet paths to determine whether 593 they would convey packets that contain IPv6 extension headers. 594 Sampled paths terminated at popular Internet sites (e.g., popular 595 web, mail and DNS servers). 597 The study revealed that at least 28% of the sampled paths did not 598 convey packets containing the IPv6 Fragment extension header. In 599 most cases, fragments were dropped in the destination autonomous 600 system. In other cases, the fragments were dropped in transit 601 autonomous systems. 603 Another recent study [Huston] confirmed this finding. It reported 604 that 37% of sampled endpoints used IPv6-capable DNS resolvers that 605 were incapable of receiving a fragmented IPv6 response. 607 It is difficult to determine why network operators drop fragments. 608 Possible causes follow: 610 o Hardware inability to process fragmented packets. 612 o Failure to change vendor defaults. 614 o Unintentional misconfiguration. 616 o Intentional configuration (e.g., network operators consciously 617 chooses to drop IPv6 fragments in order to address the issues 618 raised in Section 4.1 through Section 4.7, above.) 620 5. Alternatives to IP Fragmentation 622 5.1. Transport Layer Solutions 624 The Transport Control Protocol (TCP) [RFC0793]) can be operated in a 625 mode that does not require IP fragmentation. 627 Applications submit a stream of data to TCP. TCP divides that stream 628 of data into segments, with no segment exceeding the TCP Maximum 629 Segment Size (MSS). Each segment is encapsulated in a TCP header and 630 submitted to the underlying IP module. The underlying IP module 631 prepends an IP header and forwards the resulting packet. 633 If the TCP MSS is sufficiently small, the underlying IP module never 634 produces a packet whose length is greater than the actual PMTU. 635 Therefore, IP fragmentation is not required. 637 TCP offers the following mechanisms for MSS management: 639 o Manual configuration 641 o PMTUD 643 o PLPMTUD 645 Manual configuration is always applicable. If the MSS is configured 646 to a sufficiently low value, the IP layer will never produce a packet 647 whose length is greater than the protocol minimum link MTU. However, 648 manual configuration prevents TCP from taking advantage of larger 649 link MTU's. 651 Upper-layer protocols can implement PMTUD in order to discover and 652 take advantage of larger path MTUs. However, as mentioned in 653 Section 2.1, PMTUD relies upon the network to deliver ICMP PTB 654 messages. Therefore, PMTUD is applicable only in environments where 655 the risk of ICMP PTB loss is acceptable. 657 By contrast, PLPMTUD does not rely upon the network's ability to 658 deliver ICMP PTB messages. However, in many loss-based TCP 659 congestion control algorithms, the dropping of a packet may cause the 660 TCP control algorithm to drop the congestion control window, or even 661 re-start with the entire slow start process. For high capacity, long 662 round-trip time, large volume TCP streams, the deliberate probing 663 with large packets and the consequent packet drop may impose too 664 harsh a penalty on total TCP throughput for it to be a viable 665 approach. [RFC4821] defines PLPMTUD procedures for TCP. 667 While TCP will never cause the underlying IP module to emit a packet 668 that is larger than the PMTU estimate, it can cause the underlying IP 669 module to emit a packet that is larger than the actual PMTU. If this 670 occurs, the packet is dropped, the PMTU estimate is updated, the 671 segment is divided into smaller segments and each smaller segment is 672 submitted to the underlying IP module. 674 The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the 675 Stream Control Protocol (SCP) [RFC4960] also can be operated in a 676 mode that does not require IP fragmentation. They both accept data 677 from an application and divide that data into segments, with no 678 segment exceeding a maximum size. Both DCCP and SCP offer manual 679 configuration, PMTUD and PLPMTUD as mechanisms for managing that 680 maximum size. [I-D.ietf-tsvwg-datagram-plpmtud] proposes PLPMTUD 681 procedures for DCCP and SCP. 683 Currently, User Data Protocol (UDP) [RFC0768] lacks a fragmentation 684 mechanism of its own and relies on IP fragmentation. However, 685 [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for 686 UDP. 688 5.2. Application Layer Solutions 690 [RFC8085] recognizes that IP fragmentation reduces the reliability of 691 Internet communication. It also recognizes that UDP lacks a 692 fragmentation mechanism of its own and relies on IP fragmentation. 693 Therefore, [RFC8085] offers the following advice regarding 694 applications the run over the UDP. 696 "An application SHOULD NOT send UDP datagrams that result in IP 697 packets that exceed the Maximum Transmission Unit (MTU) along the 698 path to the destination. Consequently, an application SHOULD either 699 use the path MTU information provided by the IP layer or implement 700 Path MTU Discovery (PMTUD) itself to determine whether the path to a 701 destination will support its desired message size without 702 fragmentation." 704 RFC 8085 continues: 706 "Applications that do not follow the recommendation to do PMTU/ 707 PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would 708 result in IP packets that exceed the path MTU. Because the actual 709 path MTU is unknown, such applications SHOULD fall back to sending 710 messages that are shorter than the default effective MTU for sending 711 (EMTU_S in [RFC1122]). For IPv4, EMTU_S is the smaller of 576 bytes 712 and the first-hop MTU. For IPv6, EMTU_S is 1280 bytes. The 713 effective PMTU for a directly connected destination (with no routers 714 on the path) is the configured interface MTU, which could be less 715 than the maximum link payload size. Transmission of minimum-sized 716 UDP datagrams is inefficient over paths that support a larger PMTU, 717 which is a second reason to implement PMTU discovery." 719 RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently 720 small, even though the IPv4 minimum link MTU is 68 bytes. 722 This advice applies equally to application that run directly over IP. 724 6. Applications That Rely on IPv6 Fragmentation 726 The following applications rely on IPv6 fragmentation: 728 o DNS [RFC1035] 730 o OSPFv3 [RFC2328][RFC5340] 732 o Packet-in-packet encapsulations 734 Each of these applications relies on IPv6 fragmentation to a varying 735 degree. In some cases, that reliance is essential, and cannot be 736 broken without fundamentally changing the protocol. In other cases, 737 that reliance is incidental, and most implementations already take 738 appropriate steps to avoid fragmentation. 740 This list is not comprehensive, and other protocols that rely on IP 741 fragmentation may exist. They are not specifically considered in the 742 context of this document. 744 6.1. DNS 746 DNS relies on UDP for efficiency, and the consequence is the use of 747 IP fragmentation for large responses, as permitted by the DNS EDNS(0) 748 options in the query. It is possible to mitigate the issue of 749 fragmentation-based packet loss by having queries use smaller EDNS(0) 750 UDP buffer sizes, or by having the DNS server limit the size of its 751 UDP responses to some self-imposed maximum packet size that may be 752 less than the preferred EDNS(0) UDP Buffer Size. In both cases, 753 large responses are truncated in the DNS, signalling to the client to 754 re-query using TCP to obtain the complete response. However, the 755 operational issue of the partial level of support for DNS over TCP, 756 particularly in the case where IPv6 transport is being used, becomes 757 a limiting factor of the efficacy of this approach [Damas]. 759 Larger DNS responses can normally be avoided by aggressively pruning 760 the Additional section of DNS responses. One scenario where such 761 pruning is ineffective is in the use of DNSSEC, where large key sizes 762 act to increase the response size to certain DNS queries. There is 763 no effective response to this situation within the DNS other than 764 using smaller cryptographic keys and adoption of DNSSEC 765 administrative practices that attempt to keep DNS response as short 766 as possible. 768 6.2. OSPF 770 OSPF implementations can emit messages large enough to cause 771 fragmentation. However, in order to optimize performance, most OSPF 772 implementations restrict their maximum message size to a value that 773 will not cause fragmentation. 775 6.3. Packet-in-Packet Encapsulations 777 In this document, packet-in-packet encapsulations include IP-in-IP 778 [RFC2003], Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP 779 [RFC8086] and Generic Packet Tunneling in IPv6 [RFC2473]. [RFC4459] 780 describes fragmentation issues associated with all of the above- 781 mentioned encapsulations. 783 The fragmentation strategy described for GRE in [RFC7588] has been 784 deployed for all of the above-mentioned encapsulations. This 785 strategy does not rely on IP fragmentation except in one corner case. 786 (see Section 3.3.2.2 of RFC 7588 and Section 7.1 of RFC 2473). 787 Section 3.3 of [RFC7676] further describes this corner case. 789 See [I-D.ietf-intarea-tunnels] for further discussion. 791 6.4. Licklider Transmission Protocol (LTP) 793 Some UDP applications rely on IP fragmentation to achieve acceptable 794 levels of performance. These applications use UDP datagram sizes 795 that are larger than the path MTU so that more data can be conveyed 796 between the application and the kernel in a single system call. 798 For example, the Licklider Transmission Protocol (LTP) [RFC5326] 799 which is in current use on the International Space Station (ISS) uses 800 UDP datagram sizes larger than the path MTU to achieve acceptable 801 levels of performance even though this invokes IP fragmentation. 803 7. Recommendations 805 7.1. For Application and Protocol Developers 807 Developers SHOULD NOT develop new protocols or applications that rely 808 on IP fragmentation. When a new protocol or application is deployed 809 in an environment that does not fully support IP fragmentation, it 810 SHOULD operate correctly, either in its default configuration or in a 811 specified alternative configuration. 813 Developers MAY develop new protocols or applications that rely on IP 814 fragmentation if the protocol or application is to be run only in 815 environments where IP fragmentation is known to be supported. 817 Legacy protocols that depend upon IP fragmentation SHOULD be updated 818 to break that dependency. However, in some cases, there may be no 819 viable alternative to IP fragmentation (e.g., IPSEC tunnel mode, IP- 820 in-IP encapsulation). In these cases, the protocol will continue to 821 rely on IP fragmentation but should only be used in environments 822 where IP fragmentation is known to be supported. 824 Protocols may be able to avoid IP fragmentation by using a 825 sufficiently small MTU (e.g. The protocol minimum link MTU), 826 disabling IP fragmentation, and ensuring that the transport protocol 827 in use adapts its segment size to the MTU. Other protocols may 828 deploy a sufficiently reliable PMTU discovery mechanism 829 (e.g.,PLMPTUD). 831 7.2. For System Developers 833 Software libraries SHOULD include provision for PLPMTUD for each 834 supported transport protocol. 836 7.3. For Middle Box Developers 838 Middle boxes SHOULD process IP fragments in a manner that is 839 compliant with RFC 791 and RFC 8200. In many cases, middle boxes 840 must maintain state in order to achieve this goal. 842 Price and performance considerations frequently motivate network 843 operators to deploy stateless middle boxes. These stateless middle 844 boxes may perform sub-optimally, process IP fragments in a manner 845 that is not compliant with RFC 791 or RFC 8200, or even discard IP 846 fragments completely. Such behaviors are NOT RECOMMENDED. If a 847 middleboxes implements non-standard behavior with respect to IP 848 fragmentation, then that behavior MUST be clearly documented. 850 7.4. For Network Operators 852 As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB 853 messages unless they are known to be forged or otherwise 854 illegitimate. As stated in Section 4.7, filtering ICMPv6 PTB packets 855 causes PMTUD to fail. Operators MUST ensure proper PMTUD operation 856 in their network, including making sure the network generates PTB 857 packets when dropping packets too large compared to outgoing 858 interface MTU. Many upper-layer protocols rely on PMTUD. 860 As per RFC 8200, network operators MUST NOT deploy IPv6 links whose 861 MTU is less than 1280 bytes. 863 Network operators SHOULD NOT filter IP fragments if they originated 864 at a domain name server or are destined for a domain name server. 866 8. IANA Considerations 868 This document makes no request of IANA. 870 9. Security Considerations 872 This document mitigates some of the security considerations 873 associated with IP fragmentation by discouraging its use. It does 874 not introduce any new security vulnerabilities, because it does not 875 introduce any new alternatives to IP fragmentation. Instead, it 876 recommends well-understood alternatives. 878 10. Acknowledgements 880 Thanks to Mikael Abrahamsson, Brian Carpenter, Silambu Chelvan, 881 Lorenzo Colitti, Mike Heard, Tom Herbert, Tatuya Jinmei, Jen Linkova, 882 Paolo Lucente, Manoj Nayak, Eric Nygren, and Joe Touch for their 883 comments. 885 11. References 887 11.1. Normative References 889 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 890 DOI 10.17487/RFC0768, August 1980, 891 . 893 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 894 DOI 10.17487/RFC0791, September 1981, 895 . 897 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 898 RFC 792, DOI 10.17487/RFC0792, September 1981, 899 . 901 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 902 RFC 793, DOI 10.17487/RFC0793, September 1981, 903 . 905 [RFC1035] Mockapetris, P., "Domain names - implementation and 906 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 907 November 1987, . 909 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 910 DOI 10.17487/RFC1191, November 1990, 911 . 913 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 914 Requirement Levels", BCP 14, RFC 2119, 915 DOI 10.17487/RFC2119, March 1997, 916 . 918 [RFC4443] Conta, A., Deering, S., and M. Gupta, Ed., "Internet 919 Control Message Protocol (ICMPv6) for the Internet 920 Protocol Version 6 (IPv6) Specification", STD 89, 921 RFC 4443, DOI 10.17487/RFC4443, March 2006, 922 . 924 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 925 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 926 . 928 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 929 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 930 March 2017, . 932 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 933 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 934 May 2017, . 936 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 937 (IPv6) Specification", STD 86, RFC 8200, 938 DOI 10.17487/RFC8200, July 2017, 939 . 941 [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., 942 "Path MTU Discovery for IP version 6", STD 87, RFC 8201, 943 DOI 10.17487/RFC8201, July 2017, 944 . 946 11.2. Informative References 948 [Damas] Damas, J. and G. Huston, "Measuring ATR", April 2018, 949 . 951 [Huston] Huston, G., "IPv6, Large UDP Packets and the DNS 952 (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)", 953 August 2017. 955 [I-D.ietf-intarea-tunnels] 956 Touch, J. and M. Townsley, "IP Tunnels in the Internet 957 Architecture", draft-ietf-intarea-tunnels-09 (work in 958 progress), July 2018. 960 [I-D.ietf-tsvwg-datagram-plpmtud] 961 Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler, 962 "Packetization Layer Path MTU Discovery for Datagram 963 Transports", draft-ietf-tsvwg-datagram-plpmtud-06 (work in 964 progress), November 2018. 966 [I-D.ietf-tsvwg-udp-options] 967 Touch, J., "Transport Options for UDP", draft-ietf-tsvwg- 968 udp-options-05 (work in progress), July 2018. 970 [Kent] Kent, C. and J. Mogul, ""Fragmentation Considered 971 Harmful", In Proc. SIGCOMM '87 Workshop on Frontiers in 972 Computer Communications Technology, DOI 973 10.1145/55483.55524", August 1987, 974 . 977 [Ptacek1998] 978 Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial 979 of Service: Eluding Network Intrusion Detection", 1998, 980 . 982 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 983 Communication Layers", STD 3, RFC 1122, 984 DOI 10.17487/RFC1122, October 1989, 985 . 987 [RFC1858] Ziemba, G., Reed, D., and P. Traina, "Security 988 Considerations for IP Fragment Filtering", RFC 1858, 989 DOI 10.17487/RFC1858, October 1995, 990 . 992 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 993 DOI 10.17487/RFC2003, October 1996, 994 . 996 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 997 DOI 10.17487/RFC2328, April 1998, 998 . 1000 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 1001 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 1002 December 1998, . 1004 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 1005 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 1006 DOI 10.17487/RFC2784, March 2000, 1007 . 1009 [RFC3128] Miller, I., "Protection Against a Variant of the Tiny 1010 Fragment Attack (RFC 1858)", RFC 3128, 1011 DOI 10.17487/RFC3128, June 2001, 1012 . 1014 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1015 Congestion Control Protocol (DCCP)", RFC 4340, 1016 DOI 10.17487/RFC4340, March 2006, 1017 . 1019 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1020 Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April 1021 2006, . 1023 [RFC4890] Davies, E. and J. Mohacsi, "Recommendations for Filtering 1024 ICMPv6 Messages in Firewalls", RFC 4890, 1025 DOI 10.17487/RFC4890, May 2007, 1026 . 1028 [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", 1029 RFC 4960, DOI 10.17487/RFC4960, September 2007, 1030 . 1032 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1033 Errors at High Data Rates", RFC 4963, 1034 DOI 10.17487/RFC4963, July 2007, 1035 . 1037 [RFC5326] Ramadas, M., Burleigh, S., and S. Farrell, "Licklider 1038 Transmission Protocol - Specification", RFC 5326, 1039 DOI 10.17487/RFC5326, September 2008, 1040 . 1042 [RFC5340] Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF 1043 for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008, 1044 . 1046 [RFC5722] Krishnan, S., "Handling of Overlapping IPv6 Fragments", 1047 RFC 5722, DOI 10.17487/RFC5722, December 2009, 1048 . 1050 [RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927, 1051 DOI 10.17487/RFC5927, July 2010, 1052 . 1054 [RFC6346] Bush, R., Ed., "The Address plus Port (A+P) Approach to 1055 the IPv4 Address Shortage", RFC 6346, 1056 DOI 10.17487/RFC6346, August 2011, 1057 . 1059 [RFC6888] Perreault, S., Ed., Yamagata, I., Miyakawa, S., Nakagawa, 1060 A., and H. Ashida, "Common Requirements for Carrier-Grade 1061 NATs (CGNs)", BCP 127, RFC 6888, DOI 10.17487/RFC6888, 1062 April 2013, . 1064 [RFC7588] Bonica, R., Pignataro, C., and J. Touch, "A Widely 1065 Deployed Solution to the Generic Routing Encapsulation 1066 (GRE) Fragmentation Problem", RFC 7588, 1067 DOI 10.17487/RFC7588, July 2015, 1068 . 1070 [RFC7676] Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support 1071 for Generic Routing Encapsulation (GRE)", RFC 7676, 1072 DOI 10.17487/RFC7676, October 2015, 1073 . 1075 [RFC7739] Gont, F., "Security Implications of Predictable Fragment 1076 Identification Values", RFC 7739, DOI 10.17487/RFC7739, 1077 February 2016, . 1079 [RFC7872] Gont, F., Linkova, J., Chown, T., and W. Liu, 1080 "Observations on the Dropping of Packets with IPv6 1081 Extension Headers in the Real World", RFC 7872, 1082 DOI 10.17487/RFC7872, June 2016, 1083 . 1085 [RFC8086] Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE- 1086 in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086, 1087 March 2017, . 1089 Appendix A. Contributors' Address 1091 Authors' Addresses 1093 Ron Bonica 1094 Juniper Networks 1095 2251 Corporate Park Drive 1096 Herndon, Virginia 20171 1097 USA 1099 Email: rbonica@juniper.net 1101 Fred Baker 1102 Unaffiliated 1103 Santa Barbara, California 93117 1104 USA 1106 Email: FredBaker.IETF@gmail.com 1108 Geoff Huston 1109 APNIC 1110 6 Cordelia St 1111 Brisbane, 4101 QLD 1112 Australia 1114 Email: gih@apnic.net 1116 Robert M. Hinden 1117 Check Point Software 1118 959 Skyway Road 1119 San Carlos, California 94070 1120 USA 1122 Email: bob.hinden@gmail.com 1123 Ole Troan 1124 Cisco 1125 Philip Pedersens vei 1 1126 N-1366 Lysaker 1127 Norway 1129 Email: ot@cisco.com 1131 Fernando Gont 1132 SI6 Networks 1133 Evaristo Carriego 2644 1134 Haedo, Provincia de Buenos Aires 1135 Argentina 1137 Email: fgont@si6networks.com