idnits 2.17.1 draft-bonica-intarea-frag-fragile-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 10, 2018) is 2118 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-udp-options-02 -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Area WG R. Bonica 3 Internet-Draft Juniper Networks 4 Intended status: Best Current Practice F. Baker 5 Expires: December 12, 2018 Unaffiliated 6 G. Huston 7 APNIC 8 R. Hinden 9 Check Point Software 10 O. Troan 11 Cisco 12 F. Gont 13 SI6 Networks 14 June 10, 2018 16 IP Fragmentation Considered Fragile 17 draft-bonica-intarea-frag-fragile-02 19 Abstract 21 This document provides an overview of IP fragmentation. It explains 22 how IP fragmentation works and why it is required. As part of that 23 explanation, this document also explains how IP fragmentation reduces 24 the reliability of Internet communication. 26 This document also proposes alternatives to IP fragmentation. 27 Finally, it provides recommendations for application developers and 28 network operators. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on December 12, 2018. 47 Copyright Notice 49 Copyright (c) 2018 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . 3 66 2.1. Links, Paths, MTU and PMTU . . . . . . . . . . . . . . . 3 67 2.2. Upper-layer Protocols . . . . . . . . . . . . . . . . . . 5 68 3. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 69 4. IP Fragmentation Reduces Reliability . . . . . . . . . . . . 7 70 4.1. Middle Box Failures . . . . . . . . . . . . . . . . . . . 7 71 4.2. Partial Filtering . . . . . . . . . . . . . . . . . . . . 8 72 4.3. Suboptimal Load Balancing . . . . . . . . . . . . . . . . 8 73 4.4. Security Vulnerabilities . . . . . . . . . . . . . . . . 9 74 4.5. Blackholing Due to ICMP Loss . . . . . . . . . . . . . . 11 75 4.5.1. Transient Loss . . . . . . . . . . . . . . . . . . . 12 76 4.5.2. Incorrect Implementation of Security Policy . . . . . 12 77 4.5.3. Persistant Loss Caused By Anycast . . . . . . . . . . 13 78 4.6. Blackholing Due To Filtering . . . . . . . . . . . . . . 13 79 5. Alternatives to IP Fragmentation . . . . . . . . . . . . . . 14 80 5.1. Transport Layer Solutions . . . . . . . . . . . . . . . . 14 81 5.2. Application Layer Solutions . . . . . . . . . . . . . . . 15 82 6. Applications That Rely on IPv6 Fragmentation . . . . . . . . 16 83 6.1. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 84 6.2. OSPFv3 . . . . . . . . . . . . . . . . . . . . . . . . . 17 85 6.3. Packet-in-Packet Encapsulations . . . . . . . . . . . . . 17 86 7. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 87 7.1. For Application Developers . . . . . . . . . . . . . . . 17 88 7.2. For Network Operators . . . . . . . . . . . . . . . . . . 17 89 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 91 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18 92 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 93 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 94 11.2. Informative References . . . . . . . . . . . . . . . . . 19 96 Appendix A. Contributors' Address . . . . . . . . . . . . . . . 22 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 99 1. Introduction 101 Operational experience [RFC7872] [Huston] reveals that IP 102 fragmentation reduces the reliability of Internet communication. 103 This document provides an overview of IP fragmentation. It explains 104 how IP fragmentation works and why it is required. As part of that 105 explanation, this document also explains how IP fragmentation reduces 106 the reliability of Internet communication. 108 This document also proposes alternatives to IP fragmentation. 109 Finally, it provides recommendations for application developers and 110 network operators. 112 2. IP Fragmentation 114 2.1. Links, Paths, MTU and PMTU 116 An Internet path connects a source node to a destination node. A 117 path can contain links and intermediate systems. If a path contains 118 more than one link, the links are connected in series and an 119 intermediate system connects each link to the next. An intermediate 120 system can be a router or a middle box. 122 Internet paths are dynamic. Assume that the path from one node to 123 another contains a set of links and intermediate systems. If the 124 network topology changes, that path can also change so that it 125 includes a different set of links and intermediate systems. 127 Each link is constrained by the number of bytes that it can convey in 128 a single IP packet. This constraint is called the link Maximum 129 Transmission Unit (MTU). IPv4 [RFC0791] requires every link to have 130 an MTU of 68 bytes or greater. IPv6 [RFC8200] requires every link to 131 have an MTU of 1280 bytes or greater. These are called the IPv4 and 132 IPv6 minimum link MTU's. 134 Each Internet path is constrained by the number of bytes that it can 135 convey in a IP single packet. This constraint is called the Path MTU 136 (PMTU). For any given path, the PMTU is equal to the smallest of its 137 link MTU's. Because Internet paths are dynamic, PMTU is also 138 dynamic. 140 For reasons described below, source nodes estimate the PMTU between 141 themselves and destination nodes. A source node can produce 142 extremely conservative PMTU estimates in which: 144 o The estimate for each IPv4 path is equal to the IPv4 minimum link 145 MTU. 147 o The estimate for each IPv6 path is equal to the IPv6 minimum link 148 MTU. 150 While these conservative estimates are guaranteed to be less than or 151 equal to the actual MTU, they are likely to be much less than the 152 actual PMTU. This may adversely affect upper-layer protocol 153 performance. 155 By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201] 156 procedures, a source node can maintain a less conservative, running 157 estimate of the PMTU between itself and a destination node. 158 According to these procedures, the source node produces an initial 159 PMTU estimate. This initial estimate is equal to the MTU of the 160 first link along the path to the destination node. It can be greater 161 than the actual PMTU. 163 Having produced an initial PMTU estimate, the source node sends non- 164 fragmentable IP packets to the destination node. If one of these 165 packets is larger than the actual PMTU, a downstream router will not 166 be able to forward the packet through the next link along the path. 167 Therefore, the downstream router drops the packet and sends an 168 Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet 169 Too Big (PTB) message to the source node. The ICMP PTB message 170 indicates the MTU of the link through which the packet could not be 171 forwarded. The source node uses this information to refine its PMTU 172 estimate. 174 PMTUD produces a running estimate of the PMTU between a source node 175 and a destination node. Because PMTU is dynamic, at any given time, 176 the PMTU estimate can differ from the actual PMTU. In order to 177 detect PMTU increases, PMTUD occasionally resets the PMTU estimate to 178 the MTU of the first link along path to the destination node. It 179 then repeats the procedure described above. 181 PMTUD has the following characteristics: 183 o It relies on the network's ability to deliver ICMP PTB messages to 184 the source node. 186 o It is susceptible to attack because ICMP messages are easily 187 forged [RFC5927]. 189 FOOTNOTE: According to RFC 0791, every IPv4 host must be capable of 190 receiving a packet whose length is equal to 576 bytes. However, the 191 IPv4 minimum link MTU is not 576. Section 3.2 of RFC 0791 explicitly 192 states that the IPv4 minimum link MTU is 68 bytes. 194 FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet" 195 is introduced. A non-fragmentable packet can be fragmented at its 196 source. However, it cannot be fragmented by a downstream node. An 197 IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4 198 packet whose DF-bit is set to one is non-fragmentable. All IPv6 199 packets are also non-fragmentable. 201 FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is 202 introduced. The ICMP PTB message has two instantiations. In ICMPv4 203 [RFC0792], the ICMP PTB message is Destination Unreachable message 204 with Code equal to (4) fragmentation needed and DF set. This message 205 was augmented by [RFC1191] to indicates the MTU of the link through 206 which the packet could not be forwarded. In ICMPv6 [RFC4443], the 207 ICMP PTB message is a Packet Too Big Message with Code equal to (0). 208 This message also indicates the MTU of the link through which the 209 packet could not be forwarded. 211 2.2. Upper-layer Protocols 213 When an upper-layer protocol submits data to the underlying IP 214 module, and the resulting IP packet's length is greater than the 215 PMTU, IP fragmentation may be required. IP fragmentation divides a 216 packet into fragments. Each fragment includes an IP header and a 217 portion of the original packet. 219 [RFC0791] describes IPv4 fragmentation procedures. IPv4 packets 220 whose DF-bit is set to one cannot be fragmented. IPv4 packets whose 221 DF-bit is set to zero can be fragmented at the source node or by any 222 downstream router. [RFC8200] describes IPv6 fragmentation 223 procedures. IPv6 packets can be fragmented at the source node only. 225 IPv4 fragmentation differs slightly from IPv6 fragmentation. 226 However, in both IP versions, the upper-layer header appears in the 227 first fragment only. It does not appear in subsequent fragments. 229 Upper-layer protocols can operate in the following modes: 231 o Do not rely on IP fragmentation. 233 o Rely on IP source fragmentation only (i.e., fragmentation at the 234 source node). 236 o Rely on IP source fragmentation and downstream fragmentation 237 (i.e., fragmentation at any node along the path). 239 Upper-layer protocols running over IPv4 can operate in all of the 240 above-mentioned modes. Upper-layer protocols running over IPv6 can 241 operate in the first and second modes only. 243 Upper-layer protocols that operate in the first two modes (above) 244 require access to the PMTU estimate. In order to fulfil this 245 requirement, they can 247 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 248 MTU. 250 o Access the estimate that PMTUD produced. 252 o Execute PMTUD procedures themselves. 254 o Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821] 255 [I-D.fairhurst-tsvwg-datagram-plpmtud] procedures. 257 According to PLPMTUD procedures, the upper-layer protocol maintains a 258 running PMTU estimate. It does so by sending probe packets of 259 various sizes to its peer and receiving acknowledgements. This 260 strategy differs from PMTUD in that it relies of acknowledgement of 261 received messages, as opposed to ICMP PTB messages concerning dropped 262 messages. Therefore, PLPMTUD does not rely on the network's ability 263 to deliver ICMP PTB messages to the source. 265 An upper-layer protocol that does not rely on IP fragmentation never 266 causes the underlying IP module to emit 268 o A fragmentable IP packet (i.e., an IPv4 packet with the DF-bit set 269 to zero). 271 o An IP fragment. 273 o A packet whose length is greater than the PMTU estimate. 275 However, when the PMTU estimate is greater than the actual PMTU, the 276 upper-layer protocol can cause the underlying IP module to emit a 277 packet whose length is greater than the actual PMTU. When this 278 occurs, a downstream router drops the packet and the source node 279 refines its PMTU estimate, employing either PMTUD or PLPMTUD 280 procedures. 282 When an upper-layer protocol that relies on IP source fragmentation 283 only submits data to the underlying IP module, and the resulting 284 packet is larger than the PMTU estimate, the underlying IP module 285 fragments the packet and emits the fragments. However, the upper- 286 layer protocol never causes the underlying IP module to emit 287 o A fragmentable IP packet. 289 o A packet whose length is greater than the PMTU estimate. 291 When the PMTU estimate is greater than the actual PMTU, the upper- 292 layer protocol can cause the underlying IP module to emit a packet 293 whose length is greater than the actual PMTU. When this occurs, a 294 downstream router drops the packet and the source node refines its 295 PMTU estimate, employing either PMTUD or PLPMTUD procedures. 297 An upper-layer protocol that relies on IP source fragmentation and 298 downstream fragmentation can cause the underlying IP module to emit 300 o A fragmentable IP packet. 302 o An IP fragment. 304 o A packet whose length is greater than the PMTU estimate. 306 A protocol that relies on IP source fragmentation and downstream 307 fragmentation does not require access to the PMTU estimate. For 308 these protocols, the underlying IP module: 310 o Fragments all packets whose length exceeds the MTU of the first 311 link along the path to the destination. 313 o Sets the DF-bit to zero, so that downstream nodes can fragment the 314 packet. 316 3. Requirements Language 318 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 319 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 320 "OPTIONAL" in this document are to be interpreted as described in BCP 321 14 [RFC2119] [RFC8174] when, and only when, they appear in all 322 capitals, as shown here. 324 4. IP Fragmentation Reduces Reliability 326 This section explains how IP fragmentation reduces the reliability of 327 Internet communication. 329 4.1. Middle Box Failures 331 Many middle boxes require access to the transport-layer header. 332 However, when a packet is divided into fragments, the transport-layer 333 header appears in the first fragment only. It does not appear in 334 subsequent fragments. This omission can prevent middle boxes from 335 delivering their intended services. 337 For example, assume that a router diverts selected packets from their 338 normal path towards network appliances that support deep packet 339 inspection and lawful intercept. The router selects packets for 340 diversion based upon the following 5-tuple: 342 o IP Source Address. 344 o IP Destination Address. 346 o IPv4 Protocol or IPv6 Next Header. 348 o transport-layer source port. 350 o transport-layer destination port. 352 IP fragmentation causes this selection algorithm to behave 353 suboptimally, because the transport-layer header appears only in the 354 first fragment of each packet. 356 In another example, a middle box remarks a packet's Differentiated 357 Services Code Point [RFC2474] based upon the above-mentioned 5-tuple. 358 IP fragmentation causes this process to behave suboptimally, because 359 the transport-layer header appears only in the first fragment of each 360 packet. 362 In all of the above-mentioned examples, the middle box cannot deliver 363 its intended service without reassembling fragmented packets. 365 4.2. Partial Filtering 367 IP fragments cause problems for firewalls whose filter rules include 368 decision making based on TCP and UDP ports. As the port information 369 is not in the trailing fragments the firewall may elect to accept all 370 trailing fragments, which may admit certain classes of attack, or may 371 elect to block all trailing fragments, which may block otherwise 372 legitimate traffic, or may elect to reassemble all fragmented 373 packets, which may be inefficient and negatively affect performance. 375 4.3. Suboptimal Load Balancing 377 Many stateless load-balancers require access to the transport-layer 378 header. Assume that a load-balancer distributes flows among parallel 379 links. In order to optimize load balancing, the load-balancer sends 380 every packet or packet fragment belonging to a flow through the same 381 link. 383 In order to assign a packet or packet fragment to a link, the load- 384 balancer executes an algorithm. If the packet or packet fragment 385 contains a transport-layer header, the load balancing algorithm 386 accepts the following 5-tuple as input: 388 o IP Source Address. 390 o IP Destination Address. 392 o IPv4 Protocol or IPv6 Next Header. 394 o transport-layer source port. 396 o transport-layer destination port. 398 However, if the packet or packet fragment does not contain a 399 transport-layer header, the load balancing algorithm accepts only the 400 following 3-tuple as input: 402 o IP Source Address. 404 o IP Destination Address. 406 o IPv4 Protocol or IPv6 Next Header. 408 Therefore, non-fragmented packets belonging to a flow can be assigned 409 to one link while fragmented packets belonging to the same flow can 410 be divided between that link and another. This can cause suboptimal 411 load balancing. 413 4.4. Security Vulnerabilities 415 Security researchers have documented several attacks that rely on IP 416 fragmentation. The following are examples: 418 o Overlapping fragment attack [RFC1858][RFC3128] [RFC5722] 420 o Resource exhaustion attacks (such as the Rose Attack) 422 o Attacks based on predictable fragment identification values 423 [RFC7739] 425 o Attacks based on bugs in the implementation of the fragment 426 reassembly algorithm 428 o Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998] 429 In the overlapping fragment attack, an attacker constructs a series 430 of packet fragments. The first fragment contains an IP header, a 431 transport-layer header, and some transport-layer payload. This 432 fragment complies with local security policy and is allowed to pass 433 through a stateless firewall. A second fragment, having a non-zero 434 offset, overlaps with the first fragment. The second fragment also 435 passes through the stateless firewall. When the packet is 436 reassembled, the transport layer header from the first fragment is 437 overwritten by data from the second fragment. The reassembled packet 438 does not comply with local security policy. Had it traversed the 439 firewall in one piece, the firewall would have rejected it. 441 A stateless firewall cannot protect against the overlapping fragment 442 attack. However, destination nodes can protect against the 443 overlapping fragment attack by implementing the reassembly procedures 444 described in RFC 1858, RFC 3128 and RFC 8200. These reassembly 445 procedures detect the overlap and discard the packet. 447 The fragment reassembly algorithm is a stateful procedure for an 448 otherwise stateless protocol. As such, it can be exploited for 449 resource exhaustion attacks. An attacker can construct a series of 450 fragmented packets, with one fragment missing from each packet so 451 that the reassembly process cannot complete. Thus, this attack 452 causes resource exhaustion on the destination node, possibly denying 453 reassembly services to other flows. This type of attack can be 454 mitigated by flushing fragment reassembly buffers when necessary, at 455 the expense of possibly dropping legitimate fragments. 457 An IP fragment contains an "Identification" field that, together with 458 the IP Source Address and Destination Address of a packet, identifies 459 fragments that correspond to the same original datagram, so that they 460 can be reassembled together by the receiving host. Many 461 implementations have employed predictable values for the 462 Identification field, thus making it easy for an attacker to forge 463 malicious IP fragments that would cause the reassembly procedure for 464 legitimate packets to fail. 466 Over the years multiple IPv4 and IPv6 implementations have been found 467 to have flaws in their implementation of the IP fragment reassembly 468 algorithm, typically resulting in buffer overflows. These buffer 469 overflows have been exploitable for denial of service and remote code 470 execution attacks. 472 NIDS aims at identifying malicious activity by analyzing network 473 traffic. Ambiguity in the possible result of the fragment reassembly 474 process may allow an attacker to evade these systems. Many of these 475 systems try to mitigate some of these evasion techniques by e.g. 477 computing all possible outcomes of the fragment reassembly process, 478 at the expense of increased processing requirements. 480 4.5. Blackholing Due to ICMP Loss 482 As stated above, an upper-layer protocol requires access the PMTU 483 estimate if it: 485 o Does not rely on IP fragmentation. 487 o Relies on IP source fragmentation only (i.e., fragmentation at the 488 source node). 490 In order to satisfy this requirement, the upper-layer protocol can: 492 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 493 MTU. 495 o Access the estimate that PMTUD produced. 497 o Execute PMTUD procedures itself. 499 o Execute PLPMTUD procedures. 501 PMTUD relies upon the network's ability to deliver ICMP PTB messages 502 to the source node. Therefore, if an upper-layer protocol relies on 503 PMTUD, it also relies on the network's ability to deliver ICMP PTB 504 messages to the source node. 506 According to [RFC4890], ICMP PTB messages must not be filtered. 507 However, ICMP PTB delivery is not reliable. It is subject to both 508 transient and persistent loss. 510 Transient loss of ICMP PTB messages causes PMTUD to perform less 511 efficiently, but does not cause it to fail completely. When the 512 conditions contributing to transient loss abate, the network regains 513 its ability to deliver ICMP PTB messages and PMTUD regains its 514 ability to function. Section 4.5.1 of this document describes 515 conditions that lead to transient loss of ICMP PTB messages. 517 However, persistent loss of ICMP PTB messages causes PMTUD to fail 518 completely. Section 4.5.2 and Section 4.5.3 of this document 519 describe conditions that lead to persistent loss of ICMP PTB 520 messages. 522 The problem described in this section is specific to PMTUD. It does 523 not occur when the upper-layer protocol obtains its PMTU estimate 524 from PLPMTUD or any other source. 526 4.5.1. Transient Loss 528 The following factors can contribute to transient loss of ICMP PTB 529 messages: 531 o Network congestion. 533 o Packet corruption. 535 o Transient routing loops. 537 o ICMP rate limiting. 539 The effect of rate limiting may be severe, as RFC 4443 recommends 540 strict rate limiting of IPv6 traffic. 542 4.5.2. Incorrect Implementation of Security Policy 544 Incorrect implementation of security policy can cause persistent loss 545 of ICMP PTB messages. 547 Assume that a Customer Premise Equipment (CPE) router implements the 548 following zone-based security policy: 550 o Allow any traffic to flow from the inside zone to the outside 551 zone. 553 o Do not allow any traffic to flow from the outside zone to the 554 inside zone unless it is part of an existing flow (i.e., it was 555 elicited by an outbound packet). 557 When a correct implementation of the above-mentioned security policy 558 receives an ICMP PTB message, it examines the ICMP PTB payload in 559 order to determine the original packet (i.e., the packet that 560 elicited the ICMP PTB message) belonged to an existing flow. If the 561 original packet belonged to an existing flow, the implementation 562 allows the ICMP PTB to flow from the outside zone to the inside zone. 563 If not, the implementation discards the ICMP PTB message. 565 When a incorrect implementation of the above-mentioned security 566 policy receives an ICMP PTB message, it discards the packet because 567 its source address is not associated with an existing flow. 569 The security policy described above is implemented incorrectly on 570 many consumer CPE routers. 572 4.5.3. Persistant Loss Caused By Anycast 574 Anycast can cause persistent loss of ICMP PTB messages. Consider the 575 example below: 577 A DNS client sends a request to an anycast address. The network 578 routes that DNS request to the nearest instance of that anycast 579 address (i.e., a DNS Server). The DNS server generates a response 580 and sends it back to the DNS client. While the response does not 581 exceed the DNS server's PMTU estimate, it does exceed the actual 582 PMTU. 584 A downstream router drops the packet and sends an ICMP PTB message 585 the packet's source (i.e., the anycast address). The network routes 586 the ICMP PTB message to the anycast instance closest to the 587 downstream router. Sadly, that anycast instance may not be the DNS 588 server that originated the DNS response. It may be another DNS 589 server with the same anycast address. The DNS server that originated 590 the response may never receive the ICMP PTB message and may never 591 updates it PMTU estimate. 593 4.6. Blackholing Due To Filtering 595 In RFC 7872, researchers sampled Internet paths to determine whether 596 they would convey packets that contain IPv6 extension headers. 597 Sampled paths terminated at popular Internet sites (e.g., popular 598 web, mail and DNS servers). 600 The study revealed that at least 28% of the sampled paths did not 601 convey packets containing the IPv6 Fragment extension header. In 602 most cases, fragments were dropped in the destination autonomous 603 system. In other cases, the fragments were dropped in transit 604 autonomous systems. 606 Another recent study [Huston] confirmed this finding. It reported 607 that 37% of sampled endpoints used IPv6-capable DNS resolvers that 608 were incapable of receiving a fragmented IPv6 response. 610 It is difficult to determine why network operators drop fragments. 611 Possible causes follow: 613 o Hardware inability to process fragmented packets. 615 o Failure to change a vendor defaults. 617 o Unintentional misconfiguration. 619 o Intentional configuration (e.g., network operators consciously 620 chooses to drop IPv6 fragments in order to address the issues 621 raised in Section 4.1 through Section 4.5, above.) 623 5. Alternatives to IP Fragmentation 625 5.1. Transport Layer Solutions 627 The Transport Control Protocol (TCP) [RFC0793]) can be operated in a 628 mode that does not require IP fragmentation. 630 Applications submit a stream of data to TCP. TCP divides that stream 631 of data into segments, with no segment exceeding the TCP Maximum 632 Segment Size (MSS). Each segment is encapsulated in a TCP header and 633 submitted to the underlying IP module. The underlying IP module 634 prepends an IP header and forwards the resulting packet. 636 If the TCP MSS is sufficiently small, the underlying IP module never 637 produces a packet whose length is greater than the actual PMTU. 638 Therefore, IP fragmentation is not required. 640 TCP offers the following mechanisms for MSS management: 642 o Manual configuration 644 o PMTUD 646 o PLPMTUD 648 For IPv6 nodes, manual configuration is always applicable. If the 649 MSS is manually configured to 1220 bytes and the packet does not 650 contain extension headers, the IP layer will never produce a packet 651 whose length is greater than the IPv6 minimum link MTU (1280 bytes). 652 However, manual configuration prevents TCP from taking advantage of 653 larger link MTU's. 655 RFC 8200 strongly recommends that IPv6 nodes implement PMTUD, in 656 order to discover and take advantage of path MTUs greater than 1280 657 bytes. However, as mentioned in Section 2.1, PMTUD relies upon the 658 network's ability to deliver ICMP PTB messages. Therefore, PMTUD is 659 applicable only in environments where the risk of ICMP PTB loss is 660 acceptable. 662 By contrast, PLPMTUD does not rely upon the network's ability to 663 deliver ICMP PTB messages. However, in many loss-based TCP 664 congestion control algorithms, the dropping of a packet may cause the 665 TCP control algorithm to drop the congestion control window, or even 666 re-start with the entire slow start process. For high capacity, long 667 round-trip time, large volume TCP streams, the deliberate probing 668 with large packets and the consequent packet drop may impose too 669 harsh a penalty on total TCP throughput for it to be a viable 670 approach. [RFC4821] defines PLPMTUD procedures for TCP. 672 While TCP will never cause the underlying IP module to emit a packet 673 that is larger than the PMTU estimate, it can cause the underlying IP 674 module to emit a packet that is larger than the actual PMTU. If this 675 occurs, the packet is dropped, the PMTU estimate is updated, the 676 segment is divided into smaller segments and each smaller segment is 677 submitted to the underlying IP module. 679 The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the 680 Stream Control Protocol (SCP) [RFC4960] also can be operated in a 681 mode that does not require IP fragmentation. They both accept data 682 from an application and divide that data into segments, with no 683 segment exceeding a maximum size. Both DCCP and SCP offer manual 684 configuration, PMTUD and PLPMTUD as mechanisms for managing that 685 maximum size. [I-D.fairhurst-tsvwg-datagram-plpmtud] proposes 686 PLPMTUD procedures for DCCP and SCP. 688 Currently, User Data Protocol (UDP) [RFC0768] lacks a fragmentation 689 mechanism of its own and relies on IP fragmentation. However, 690 [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for 691 UDP. 693 5.2. Application Layer Solutions 695 [RFC8085] recognizes that IP fragmentation reduces the reliability of 696 Internet communication. It also recognizes that UDP lacks a 697 fragmentation mechanism of its own and relies on IP fragmentation. 698 Therefore, [RFC8085] offers the following advice regarding 699 applications the run over the UDP. 701 "An application SHOULD NOT send UDP datagrams that result in IP 702 packets that exceed the Maximum Transmission Unit (MTU) along the 703 path to the destination. Consequently, an application SHOULD either 704 use the path MTU information provided by the IP layer or implement 705 Path MTU Discovery (PMTUD) itself to determine whether the path to a 706 destination will support its desired message size without 707 fragmentation." 709 RFC 8085 continues: 711 "Applications that do not follow the recommendation to do PMTU/ 712 PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would 713 result in IP packets that exceed the path MTU. Because the actual 714 path MTU is unknown, such applications SHOULD fall back to sending 715 messages that are shorter than the default effective MTU for sending 716 (EMTU_S in [RFC1122]). For IPv4, EMTU_S is the smaller of 576 bytes 717 and the first-hop MTU. For IPv6, EMTU_S is 1280 bytes. The 718 effective PMTU for a directly connected destination (with no routers 719 on the path) is the configured interface MTU, which could be less 720 than the maximum link payload size. Transmission of minimum-sized 721 UDP datagrams is inefficient over paths that support a larger PMTU, 722 which is a second reason to implement PMTU discovery." 724 RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently 725 small, even though the IPv4 minimum link MTU is 68 bytes. 727 This advice applies equally to application that run directly over IP. 729 6. Applications That Rely on IPv6 Fragmentation 731 The following applications rely on IPv6 fragmentation: 733 o DNS [RFC1035] 735 o OSPFv3 [RFC5340] 737 o Packet-in-packet encapsulations 739 Each of these applications relies on IPv6 fragmentation to a varying 740 degree. In some cases, that reliance is essential, and cannot be 741 broken without fundamentally changing the protocol. In other cases, 742 that reliance is incidental, and most implementations already take 743 appropriate steps to avoid fragmentation. 745 This list is not comprehensive, and other protocols that rely on IPv6 746 fragmentation may exist. They are not specifically considered in the 747 context of this document. 749 6.1. DNS 751 DNS relies on UDP for efficiency, and the consequence is the use of 752 IP fragmentation for large responses, as permitted by the DNS EDNS(0) 753 options in the query. It is possible to mitigate the issue of 754 fragentation-based packet loss by having queries use smaller EDNS(0) 755 UDP buffer sizes, but then the operational issue of the partial level 756 of support for DNS over TCP over IPv6 becomes a limiting factor of 757 the efficacy of this approach in an IPv6 context [Damas]. 759 Larger DNS responses can normally be avoided by aggressively pruning 760 the Additional section of DNS responses. One scenario where such 761 pruning is ineffective is in the use of DNSSEC, where large key sizes 762 act to increase the response size to certain DNS queries. There is 763 no effective response to this situation within the DNS other than 764 using smaller cryptographic keys and adoption of DNSSEC 765 administrative practices that attempt to keep DNS response as short 766 as possible. 768 6.2. OSPFv3 770 OSPFv3 implementations can emit messages large enough to cause IPv6 771 fragmentation. However, in keeping with the recommendations of 772 RFC8200, and in order to optimize performance, most OSPFv3 773 implementations restrict their maximum message size to the IPv6 774 minimum link MTU. 776 6.3. Packet-in-Packet Encapsulations 778 In this document, packet-in-packet encapsulations include IP-in-IP 779 [RFC2003], Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP 780 [RFC8086] and Generic Packet Tunneling in IPv6 [RFC2473]. [RFC4459] 781 describes fragmentation issues associated with all of the above- 782 mentioned encapsulations. 784 The fragmentation strategy described for GRE in [RFC7588] has been 785 deployed for all of the above-mentioned encapsulations. This 786 strategy does not rely on IPv6 fragmentation except in one corner 787 case. (see Section 3.3.2.2 of RFC 7588 and Section 7.1 of RFC 2473). 788 Section 3.3 of [RFC7676] further describes this corner case. 790 7. Recommendations 792 7.1. For Application Developers 794 Application developers SHOULD NOT develop applications that rely on 795 IPv6 fragmentation. 797 Application-layer protocols then depend upon IPv6 fragmentation 798 SHOULD be updated to break that dependency. 800 7.2. For Network Operators 802 As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB 803 messages unless they are known to be forged or otherwise 804 illegitimate. As stated in Section 4.5, filtering ICMPv6 PTB packets 805 causes PMTUD to fail. Operators MUST ensure proper PMTUD operation 806 in their network, including making sure the network generates PTB 807 packets when dropping packets too large compared to outgoing 808 interface MTU. 810 Many upper-layer protocols rely on PMTUD. 812 8. IANA Considerations 814 This document makes no request of IANA. 816 9. Security Considerations 818 This document mitigates some of the security considerations 819 associated with IP fragmentation by discouraging the use of IP 820 fragmentation. It does not introduce any new security 821 vulnerabilities, because it does not introduce any new alternatives 822 to IP fragmentation. Instead, it recommends well-understood 823 alternatives. 825 10. Acknowledgements 827 Thanks to Mikael Abrahamsson, Mike Heard, Tom Herbert, Tatuya Jinmei, 828 Eric Nygren, and Joe Touch for their comments. 830 11. References 832 11.1. Normative References 834 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 835 DOI 10.17487/RFC0768, August 1980, 836 . 838 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 839 DOI 10.17487/RFC0791, September 1981, 840 . 842 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 843 RFC 792, DOI 10.17487/RFC0792, September 1981, 844 . 846 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 847 RFC 793, DOI 10.17487/RFC0793, September 1981, 848 . 850 [RFC1035] Mockapetris, P., "Domain names - implementation and 851 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 852 November 1987, . 854 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 855 DOI 10.17487/RFC1191, November 1990, 856 . 858 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 859 Requirement Levels", BCP 14, RFC 2119, 860 DOI 10.17487/RFC2119, March 1997, 861 . 863 [RFC4443] Conta, A., Deering, S., and M. Gupta, Ed., "Internet 864 Control Message Protocol (ICMPv6) for the Internet 865 Protocol Version 6 (IPv6) Specification", STD 89, 866 RFC 4443, DOI 10.17487/RFC4443, March 2006, 867 . 869 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 870 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 871 . 873 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 874 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 875 March 2017, . 877 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 878 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 879 May 2017, . 881 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 882 (IPv6) Specification", STD 86, RFC 8200, 883 DOI 10.17487/RFC8200, July 2017, 884 . 886 [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., 887 "Path MTU Discovery for IP version 6", STD 87, RFC 8201, 888 DOI 10.17487/RFC8201, July 2017, 889 . 891 11.2. Informative References 893 [Damas] Damas, J. and G. Huston, "Measuring ATR", April 2018, 894 . 896 [Huston] Huston, G., "IPv6, Large UDP Packets and the DNS 897 (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)", 898 August 2017. 900 [I-D.fairhurst-tsvwg-datagram-plpmtud] 901 Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler, 902 "Packetization Layer Path MTU Discovery for Datagram 903 Transports", draft-fairhurst-tsvwg-datagram-plpmtud-02 904 (work in progress), December 2017. 906 [I-D.ietf-tsvwg-udp-options] 907 Touch, J., "Transport Options for UDP", draft-ietf-tsvwg- 908 udp-options-02 (work in progress), January 2018. 910 [Ptacek1998] 911 Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial 912 of Service: Eluding Network Intrusion Detection", 1998, 913 . 915 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 916 Communication Layers", STD 3, RFC 1122, 917 DOI 10.17487/RFC1122, October 1989, 918 . 920 [RFC1858] Ziemba, G., Reed, D., and P. Traina, "Security 921 Considerations for IP Fragment Filtering", RFC 1858, 922 DOI 10.17487/RFC1858, October 1995, 923 . 925 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 926 DOI 10.17487/RFC2003, October 1996, 927 . 929 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 930 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 931 December 1998, . 933 [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, 934 "Definition of the Differentiated Services Field (DS 935 Field) in the IPv4 and IPv6 Headers", RFC 2474, 936 DOI 10.17487/RFC2474, December 1998, 937 . 939 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 940 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 941 DOI 10.17487/RFC2784, March 2000, 942 . 944 [RFC3128] Miller, I., "Protection Against a Variant of the Tiny 945 Fragment Attack (RFC 1858)", RFC 3128, 946 DOI 10.17487/RFC3128, June 2001, 947 . 949 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 950 Congestion Control Protocol (DCCP)", RFC 4340, 951 DOI 10.17487/RFC4340, March 2006, 952 . 954 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 955 Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April 956 2006, . 958 [RFC4890] Davies, E. and J. Mohacsi, "Recommendations for Filtering 959 ICMPv6 Messages in Firewalls", RFC 4890, 960 DOI 10.17487/RFC4890, May 2007, 961 . 963 [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", 964 RFC 4960, DOI 10.17487/RFC4960, September 2007, 965 . 967 [RFC5340] Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF 968 for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008, 969 . 971 [RFC5722] Krishnan, S., "Handling of Overlapping IPv6 Fragments", 972 RFC 5722, DOI 10.17487/RFC5722, December 2009, 973 . 975 [RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927, 976 DOI 10.17487/RFC5927, July 2010, 977 . 979 [RFC7588] Bonica, R., Pignataro, C., and J. Touch, "A Widely 980 Deployed Solution to the Generic Routing Encapsulation 981 (GRE) Fragmentation Problem", RFC 7588, 982 DOI 10.17487/RFC7588, July 2015, 983 . 985 [RFC7676] Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support 986 for Generic Routing Encapsulation (GRE)", RFC 7676, 987 DOI 10.17487/RFC7676, October 2015, 988 . 990 [RFC7739] Gont, F., "Security Implications of Predictable Fragment 991 Identification Values", RFC 7739, DOI 10.17487/RFC7739, 992 February 2016, . 994 [RFC7872] Gont, F., Linkova, J., Chown, T., and W. Liu, 995 "Observations on the Dropping of Packets with IPv6 996 Extension Headers in the Real World", RFC 7872, 997 DOI 10.17487/RFC7872, June 2016, 998 . 1000 [RFC8086] Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE- 1001 in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086, 1002 March 2017, . 1004 Appendix A. Contributors' Address 1006 Authors' Addresses 1008 Ron Bonica 1009 Juniper Networks 1010 2251 Corporate Park Drive 1011 Herndon, Virginia 20171 1012 USA 1014 Email: rbonica@juniper.net 1016 Fred Baker 1017 Unaffiliated 1018 Santa Barbara, California 93117 1019 USA 1021 Email: FredBaker.IETF@gmail.com 1023 Geoff Huston 1024 APNIC 1025 6 Cordelia St 1026 Brisbane, 4101 QLD 1027 Australia 1029 Email: gih@apnic.net 1031 Robert M. Hinden 1032 Check Point Software 1033 959 Skyway Road 1034 San Carlos, California 94070 1035 USA 1037 Email: bob.hinden@gmail.com 1038 Ole Troan 1039 Cisco 1040 Philip Pedersens vei 1 1041 N-1366 Lysaker 1042 Norway 1044 Email: ot@cisco.com 1046 Fernando Gont 1047 SI6 Networks 1048 Evaristo Carriego 2644 1049 Haedo, Provincia de Buenos Aires 1050 Argentina 1052 Email: fgont@si6networks.com