idnits 2.17.1 draft-bonica-intarea-frag-fragile-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 23, 2018) is 2103 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-udp-options-05 -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Area WG R. Bonica 3 Internet-Draft Juniper Networks 4 Intended status: Best Current Practice F. Baker 5 Expires: January 24, 2019 Unaffiliated 6 G. Huston 7 APNIC 8 R. Hinden 9 Check Point Software 10 O. Troan 11 Cisco 12 F. Gont 13 SI6 Networks 14 July 23, 2018 16 IP Fragmentation Considered Fragile 17 draft-bonica-intarea-frag-fragile-03 19 Abstract 21 This document provides an overview of IP fragmentation. It also 22 explains how IP fragmentation reduces the reliability of Internet 23 communication. 25 Finally, this document proposes alternatives to IP fragmentation and 26 provides recommendations for application developers and network 27 operators. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on January 24, 2019. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 2. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . 3 65 2.1. Links, Paths, MTU and PMTU . . . . . . . . . . . . . . . 3 66 2.2. Upper-layer Protocols . . . . . . . . . . . . . . . . . . 5 67 3. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 68 4. IP Fragmentation Reduces Reliability . . . . . . . . . . . . 7 69 4.1. Middle Box Failures . . . . . . . . . . . . . . . . . . . 7 70 4.2. Partial Filtering . . . . . . . . . . . . . . . . . . . . 8 71 4.3. Telemetry and Monitoring and monitoring Failures . . . . 8 72 4.4. Suboptimal Load Balancing . . . . . . . . . . . . . . . . 9 73 4.5. Security Vulnerabilities . . . . . . . . . . . . . . . . 9 74 4.6. Blackholing Due to ICMP Loss . . . . . . . . . . . . . . 11 75 4.6.1. Transient Loss . . . . . . . . . . . . . . . . . . . 12 76 4.6.2. Incorrect Implementation of Security Policy . . . . . 12 77 4.6.3. Persistant Loss Caused By Anycast . . . . . . . . . . 13 78 4.7. Blackholing Due To Filtering . . . . . . . . . . . . . . 13 79 5. Alternatives to IP Fragmentation . . . . . . . . . . . . . . 14 80 5.1. Transport Layer Solutions . . . . . . . . . . . . . . . . 14 81 5.2. Application Layer Solutions . . . . . . . . . . . . . . . 15 82 6. Applications That Rely on IPv6 Fragmentation . . . . . . . . 16 83 6.1. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 84 6.2. OSPFv3 . . . . . . . . . . . . . . . . . . . . . . . . . 17 85 6.3. Packet-in-Packet Encapsulations . . . . . . . . . . . . . 17 86 7. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 87 7.1. For Application Developers . . . . . . . . . . . . . . . 17 88 7.2. For Network Operators . . . . . . . . . . . . . . . . . . 18 89 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 91 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18 92 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 93 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 94 11.2. Informative References . . . . . . . . . . . . . . . . . 20 95 Appendix A. Contributors' Address . . . . . . . . . . . . . . . 22 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 98 1. Introduction 100 Operational experience [RFC7872] [Huston] reveals that IP 101 fragmentation reduces the reliability of Internet communication. 102 This document provides an overview of IP fragmentation. It also 103 explains how IP fragmentation reduces the reliability of Internet 104 communication. 106 Finally, this document proposes alternatives to IP fragmentation and 107 provides recommendations for application developers and network 108 operators. 110 2. IP Fragmentation 112 2.1. Links, Paths, MTU and PMTU 114 An Internet path connects a source node to a destination node. A 115 path can contain links and intermediate systems. If a path contains 116 more than one link, the links are connected in series and an 117 intermediate system connects each link to the next. An intermediate 118 system can be a router or a middle box. 120 Internet paths are dynamic. Assume that the path from one node to 121 another contains a set of links and intermediate systems. If the 122 network topology changes, that path can also change so that it 123 includes a different set of links and intermediate systems. 125 Each link is constrained by the number of bytes that it can convey in 126 a single IP packet. This constraint is called the link Maximum 127 Transmission Unit (MTU). IPv4 [RFC0791] requires every link to have 128 an MTU of 68 bytes or greater. IPv6 [RFC8200] requires every link to 129 have an MTU of 1280 bytes or greater. These are called the IPv4 and 130 IPv6 minimum link MTU's. 132 Each Internet path is constrained by the number of bytes that it can 133 convey in a IP single packet. This constraint is called the Path MTU 134 (PMTU). For any given path, the PMTU is equal to the smallest of its 135 link MTU's. Because Internet paths are dynamic, PMTU is also 136 dynamic. 138 For reasons described below, source nodes estimate the PMTU between 139 themselves and destination nodes. A source node can produce 140 extremely conservative PMTU estimates in which: 142 o The estimate for each IPv4 path is equal to the IPv4 minimum link 143 MTU. 145 o The estimate for each IPv6 path is equal to the IPv6 minimum link 146 MTU. 148 While these conservative estimates are guaranteed to be less than or 149 equal to the actual PMTU, they are likely to be much less than the 150 actual PMTU. This may adversely affect upper-layer protocol 151 performance. 153 By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201] 154 procedures, a source node can maintain a less conservative, running 155 estimate of the PMTU between itself and a destination node. 156 According to these procedures, the source node produces an initial 157 PMTU estimate. This initial estimate is equal to the MTU of the 158 first link along the path to the destination node. It can be greater 159 than the actual PMTU. 161 Having produced an initial PMTU estimate, the source node sends non- 162 fragmentable IP packets to the destination node. If one of these 163 packets is larger than the actual PMTU, a downstream router will not 164 be able to forward the packet through the next link along the path. 165 Therefore, the downstream router drops the packet and sends an 166 Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet 167 Too Big (PTB) message to the source node. The ICMP PTB message 168 indicates the MTU of the link through which the packet could not be 169 forwarded. The source node uses this information to refine its PMTU 170 estimate. 172 PMTUD produces a running estimate of the PMTU between a source node 173 and a destination node. Because PMTU is dynamic, at any given time, 174 the PMTU estimate can differ from the actual PMTU. In order to 175 detect PMTU increases, PMTUD occasionally resets the PMTU estimate to 176 the MTU of the first link along path to the destination node. It 177 then repeats the procedure described above. 179 PMTUD has the following characteristics: 181 o It relies on the network's ability to deliver ICMP PTB messages to 182 the source node. 184 o It is susceptible to attack because ICMP messages are easily 185 forged [RFC5927]. 187 FOOTNOTE: According to RFC 0791, every IPv4 host must be capable of 188 receiving a packet whose length is equal to 576 bytes. However, the 189 IPv4 minimum link MTU is not 576. Section 3.2 of RFC 0791 explicitly 190 states that the IPv4 minimum link MTU is 68 bytes. 192 FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet" 193 is introduced. A non-fragmentable packet can be fragmented at its 194 source. However, it cannot be fragmented by a downstream node. An 195 IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4 196 packet whose DF-bit is set to one is non-fragmentable. All IPv6 197 packets are also non-fragmentable. 199 FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is 200 introduced. The ICMP PTB message has two instantiations. In ICMPv4 201 [RFC0792], the ICMP PTB message is Destination Unreachable message 202 with Code equal to (4) fragmentation needed and DF set. This message 203 was augmented by [RFC1191] to indicates the MTU of the link through 204 which the packet could not be forwarded. In ICMPv6 [RFC4443], the 205 ICMP PTB message is a Packet Too Big Message with Code equal to (0). 206 This message also indicates the MTU of the link through which the 207 packet could not be forwarded. 209 2.2. Upper-layer Protocols 211 When an upper-layer protocol submits data to the underlying IP 212 module, and the resulting IP packet's length is greater than the 213 PMTU, IP fragmentation may be required. IP fragmentation divides a 214 packet into fragments. Each fragment includes an IP header and a 215 portion of the original packet. 217 [RFC0791] describes IPv4 fragmentation procedures. IPv4 packets 218 whose DF-bit is set to one cannot be fragmented. IPv4 packets whose 219 DF-bit is set to zero can be fragmented at the source node or by any 220 downstream router. [RFC8200] describes IPv6 fragmentation 221 procedures. IPv6 packets can be fragmented at the source node only. 223 IPv4 fragmentation differs slightly from IPv6 fragmentation. 224 However, in both IP versions, the upper-layer header appears in the 225 first fragment only. It does not appear in subsequent fragments. 227 Upper-layer protocols can operate in the following modes: 229 o Do not rely on IP fragmentation. 231 o Rely on IP source fragmentation only (i.e., fragmentation at the 232 source node). 234 o Rely on IP source fragmentation and downstream fragmentation 235 (i.e., fragmentation at any node along the path). 237 Upper-layer protocols running over IPv4 can operate in all of the 238 above-mentioned modes. Upper-layer protocols running over IPv6 can 239 operate in the first and second modes only. 241 Upper-layer protocols that operate in the first two modes (above) 242 require access to the PMTU estimate. In order to fulfil this 243 requirement, they can 245 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 246 MTU. 248 o Access the estimate that PMTUD produced. 250 o Execute PMTUD procedures themselves. 252 o Execute Packetization Layer PMTUD (PLPMTUD) [RFC4821] 253 [I-D.fairhurst-tsvwg-datagram-plpmtud] procedures. 255 According to PLPMTUD procedures, the upper-layer protocol maintains a 256 running PMTU estimate. It does so by sending probe packets of 257 various sizes to its peer and receiving acknowledgements. This 258 strategy differs from PMTUD in that it relies of acknowledgement of 259 received messages, as opposed to ICMP PTB messages concerning dropped 260 messages. Therefore, PLPMTUD does not rely on the network's ability 261 to deliver ICMP PTB messages to the source. 263 An upper-layer protocol that does not rely on IP fragmentation never 264 causes the underlying IP module to emit 266 o A fragmentable IP packet (i.e., an IPv4 packet with the DF-bit set 267 to zero). 269 o An IP fragment. 271 o A packet whose length is greater than the PMTU estimate. 273 However, when the PMTU estimate is greater than the actual PMTU, the 274 upper-layer protocol can cause the underlying IP module to emit a 275 packet whose length is greater than the actual PMTU. When this 276 occurs, a downstream router drops the packet and the source node 277 refines its PMTU estimate, employing either PMTUD or PLPMTUD 278 procedures. 280 When an upper-layer protocol that relies on IP source fragmentation 281 only submits data to the underlying IP module, and the resulting 282 packet is larger than the PMTU estimate, the underlying IP module 283 fragments the packet and emits the fragments. However, the upper- 284 layer protocol never causes the underlying IP module to emit 285 o A fragmentable IP packet. 287 o A packet whose length is greater than the PMTU estimate. 289 When the PMTU estimate is greater than the actual PMTU, the upper- 290 layer protocol can cause the underlying IP module to emit a packet 291 whose length is greater than the actual PMTU. When this occurs, a 292 downstream router drops the packet and the source node refines its 293 PMTU estimate, employing either PMTUD or PLPMTUD procedures. 295 An upper-layer protocol that relies on IP source fragmentation and 296 downstream fragmentation can cause the underlying IP module to emit 298 o A fragmentable IP packet. 300 o An IP fragment. 302 o A packet whose length is greater than the PMTU estimate. 304 A protocol that relies on IP source fragmentation and downstream 305 fragmentation does not require access to the PMTU estimate. For 306 these protocols, the underlying IP module: 308 o Fragments all packets whose length exceeds the MTU of the first 309 link along the path to the destination. 311 o Sets the DF-bit to zero, so that downstream nodes can fragment the 312 packet. 314 3. Requirements Language 316 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 317 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 318 "OPTIONAL" in this document are to be interpreted as described in BCP 319 14 [RFC2119] [RFC8174] when, and only when, they appear in all 320 capitals, as shown here. 322 4. IP Fragmentation Reduces Reliability 324 This section explains how IP fragmentation reduces the reliability of 325 Internet communication. 327 4.1. Middle Box Failures 329 Many middle boxes require access to the transport-layer header. 330 However, when a packet is divided into fragments, the transport-layer 331 header appears in the first fragment only. It does not appear in 332 subsequent fragments. This omission can prevent middle boxes from 333 delivering their intended services. 335 For example, assume that a router diverts selected packets from their 336 normal path towards network appliances that support deep packet 337 inspection and lawful intercept. The router selects packets for 338 diversion based upon the following 5-tuple: 340 o IP Source Address. 342 o IP Destination Address. 344 o IPv4 Protocol or IPv6 Next Header. 346 o transport-layer source port. 348 o transport-layer destination port. 350 IP fragmentation causes this selection algorithm to behave 351 suboptimally, because the transport-layer header appears only in the 352 first fragment of each packet. 354 In another example, a middle box remarks a packet's Differentiated 355 Services Code Point [RFC2474] based upon the above-mentioned 5-tuple. 356 IP fragmentation causes this process to behave suboptimally, because 357 the transport-layer header appears only in the first fragment of each 358 packet. 360 In all of the above-mentioned examples, the middle box cannot deliver 361 its intended service without reassembling fragmented packets. 363 4.2. Partial Filtering 365 IP fragments cause problems for firewalls whose filter rules include 366 decision making based on TCP and UDP ports. As the port information 367 is not in the trailing fragments the firewall may elect to accept all 368 trailing fragments, which may admit certain classes of attack, or may 369 elect to block all trailing fragments, which may block otherwise 370 legitimate traffic, or may elect to reassemble all fragmented 371 packets, which may be inefficient and negatively affect performance. 373 4.3. Telemetry and Monitoring and monitoring Failures 375 Stateless telemetry and monitoring strategies may require the 376 transport-layer header to appear in every packet. However, when a 377 packet is divided into fragments, the transport-layer header appears 378 in the first fragment only. It does not appear in subsequent 379 fragments. This omission can prevent some stateless telemetry 380 strategies from functioning correctly. 382 4.4. Suboptimal Load Balancing 384 Many stateless load-balancers require access to the transport-layer 385 header. Assume that a load-balancer distributes flows among parallel 386 links. In order to optimize load balancing, the load-balancer sends 387 every packet or packet fragment belonging to a flow through the same 388 link. 390 In order to assign a packet or packet fragment to a link, the load- 391 balancer executes an algorithm. If the packet or packet fragment 392 contains a transport-layer header, the load balancing algorithm 393 accepts the following 5-tuple as input: 395 o IP Source Address. 397 o IP Destination Address. 399 o IPv4 Protocol or IPv6 Next Header. 401 o transport-layer source port. 403 o transport-layer destination port. 405 However, if the packet or packet fragment does not contain a 406 transport-layer header, the load balancing algorithm accepts only the 407 following 3-tuple as input: 409 o IP Source Address. 411 o IP Destination Address. 413 o IPv4 Protocol or IPv6 Next Header. 415 Therefore, non-fragmented packets belonging to a flow can be assigned 416 to one link while fragmented packets belonging to the same flow can 417 be divided between that link and another. This can cause suboptimal 418 load balancing. 420 4.5. Security Vulnerabilities 422 Security researchers have documented several attacks that rely on IP 423 fragmentation. The following are examples: 425 o Overlapping fragment attack [RFC1858][RFC3128] [RFC5722] 426 o Resource exhaustion attacks (such as the Rose Attack) 428 o Attacks based on predictable fragment identification values 429 [RFC7739] 431 o Attacks based on bugs in the implementation of the fragment 432 reassembly algorithm 434 o Evasion of Network Intrusion Detection Systems (NIDS) [Ptacek1998] 436 In the overlapping fragment attack, an attacker constructs a series 437 of packet fragments. The first fragment contains an IP header, a 438 transport-layer header, and some transport-layer payload. This 439 fragment complies with local security policy and is allowed to pass 440 through a stateless firewall. A second fragment, having a non-zero 441 offset, overlaps with the first fragment. The second fragment also 442 passes through the stateless firewall. When the packet is 443 reassembled, the transport layer header from the first fragment is 444 overwritten by data from the second fragment. The reassembled packet 445 does not comply with local security policy. Had it traversed the 446 firewall in one piece, the firewall would have rejected it. 448 A stateless firewall cannot protect against the overlapping fragment 449 attack. However, destination nodes can protect against the 450 overlapping fragment attack by implementing the reassembly procedures 451 described in RFC 1858, RFC 3128 and RFC 8200. These reassembly 452 procedures detect the overlap and discard the packet. 454 The fragment reassembly algorithm is a stateful procedure for an 455 otherwise stateless protocol. As such, it can be exploited for 456 resource exhaustion attacks. An attacker can construct a series of 457 fragmented packets, with one fragment missing from each packet so 458 that the reassembly process cannot complete. Thus, this attack 459 causes resource exhaustion on the destination node, possibly denying 460 reassembly services to other flows. This type of attack can be 461 mitigated by flushing fragment reassembly buffers when necessary, at 462 the expense of possibly dropping legitimate fragments. 464 An IP fragment contains an "Identification" field that, together with 465 the IP Source Address and Destination Address of a packet, identifies 466 fragments that correspond to the same original datagram, so that they 467 can be reassembled together by the receiving host. Many 468 implementations have employed predictable values for the 469 Identification field, thus making it easy for an attacker to forge 470 malicious IP fragments that would cause the reassembly procedure for 471 legitimate packets to fail. 473 Over the years multiple IPv4 and IPv6 implementations have been found 474 to have flaws in their implementation of the IP fragment reassembly 475 algorithm, typically resulting in buffer overflows. These buffer 476 overflows have been exploitable for denial of service and remote code 477 execution attacks. 479 NIDS aims at identifying malicious activity by analyzing network 480 traffic. Ambiguity in the possible result of the fragment reassembly 481 process may allow an attacker to evade these systems. Many of these 482 systems try to mitigate some of these evasion techniques by e.g. 483 Computing all possible outcomes of the fragment reassembly process, 484 at the expense of increased processing requirements. 486 4.6. Blackholing Due to ICMP Loss 488 As stated above, an upper-layer protocol requires access the PMTU 489 estimate if it: 491 o Does not rely on IP fragmentation. 493 o Relies on IP source fragmentation only (i.e., fragmentation at the 494 source node). 496 In order to satisfy this requirement, the upper-layer protocol can: 498 o Estimate the PMTU to be equal to the IPv4 or IPv6 minimum link 499 MTU. 501 o Access the estimate that PMTUD produced. 503 o Execute PMTUD procedures itself. 505 o Execute PLPMTUD procedures. 507 PMTUD relies upon the network's ability to deliver ICMP PTB messages 508 to the source node. Therefore, if an upper-layer protocol relies on 509 PMTUD, it also relies on the network's ability to deliver ICMP PTB 510 messages to the source node. 512 According to [RFC4890], ICMP PTB messages must not be filtered. 513 However, ICMP PTB delivery is not reliable. It is subject to both 514 transient and persistent loss. 516 Transient loss of ICMP PTB messages causes PMTUD to perform less 517 efficiently, but does not cause it to fail completely. When the 518 conditions contributing to transient loss abate, the network regains 519 its ability to deliver ICMP PTB messages and PMTUD regains its 520 ability to function. Section 4.6.1 of this document describes 521 conditions that lead to transient loss of ICMP PTB messages. 523 However, persistent loss of ICMP PTB messages causes PMTUD to fail 524 completely. Section 4.6.2 and Section 4.6.3 of this document 525 describe conditions that lead to persistent loss of ICMP PTB 526 messages. 528 The problem described in this section is specific to PMTUD. It does 529 not occur when the upper-layer protocol obtains its PMTU estimate 530 from PLPMTUD or any other source. 532 4.6.1. Transient Loss 534 The following factors can contribute to transient loss of ICMP PTB 535 messages: 537 o Network congestion. 539 o Packet corruption. 541 o Transient routing loops. 543 o ICMP rate limiting. 545 The effect of rate limiting may be severe, as RFC 4443 recommends 546 strict rate limiting of IPv6 traffic. 548 4.6.2. Incorrect Implementation of Security Policy 550 Incorrect implementation of security policy can cause persistent loss 551 of ICMP PTB messages. 553 Assume that a Customer Premise Equipment (CPE) router implements the 554 following zone-based security policy: 556 o Allow any traffic to flow from the inside zone to the outside 557 zone. 559 o Do not allow any traffic to flow from the outside zone to the 560 inside zone unless it is part of an existing flow (i.e., it was 561 elicited by an outbound packet). 563 When a correct implementation of the above-mentioned security policy 564 receives an ICMP PTB message, it examines the ICMP PTB payload in 565 order to determine the original packet (i.e., the packet that 566 elicited the ICMP PTB message) belonged to an existing flow. If the 567 original packet belonged to an existing flow, the implementation 568 allows the ICMP PTB to flow from the outside zone to the inside zone. 569 If not, the implementation discards the ICMP PTB message. 571 When a incorrect implementation of the above-mentioned security 572 policy receives an ICMP PTB message, it discards the packet because 573 its source address is not associated with an existing flow. 575 The security policy described above is implemented incorrectly on 576 many consumer CPE routers. 578 4.6.3. Persistant Loss Caused By Anycast 580 Anycast can cause persistent loss of ICMP PTB messages. Consider the 581 example below: 583 A DNS client sends a request to an anycast address. The network 584 routes that DNS request to the nearest instance of that anycast 585 address (i.e., a DNS Server). The DNS server generates a response 586 and sends it back to the DNS client. While the response does not 587 exceed the DNS server's PMTU estimate, it does exceed the actual 588 PMTU. 590 A downstream router drops the packet and sends an ICMP PTB message 591 the packet's source (i.e., the anycast address). The network routes 592 the ICMP PTB message to the anycast instance closest to the 593 downstream router. Sadly, that anycast instance may not be the DNS 594 server that originated the DNS response. It may be another DNS 595 server with the same anycast address. The DNS server that originated 596 the response may never receive the ICMP PTB message and may never 597 updates it PMTU estimate. 599 4.7. Blackholing Due To Filtering 601 In RFC 7872, researchers sampled Internet paths to determine whether 602 they would convey packets that contain IPv6 extension headers. 603 Sampled paths terminated at popular Internet sites (e.g., popular 604 web, mail and DNS servers). 606 The study revealed that at least 28% of the sampled paths did not 607 convey packets containing the IPv6 Fragment extension header. In 608 most cases, fragments were dropped in the destination autonomous 609 system. In other cases, the fragments were dropped in transit 610 autonomous systems. 612 Another recent study [Huston] confirmed this finding. It reported 613 that 37% of sampled endpoints used IPv6-capable DNS resolvers that 614 were incapable of receiving a fragmented IPv6 response. 616 It is difficult to determine why network operators drop fragments. 617 Possible causes follow: 619 o Hardware inability to process fragmented packets. 621 o Failure to change a vendor defaults. 623 o Unintentional misconfiguration. 625 o Intentional configuration (e.g., network operators consciously 626 chooses to drop IPv6 fragments in order to address the issues 627 raised in Section 4.1 through Section 4.6, above.) 629 5. Alternatives to IP Fragmentation 631 5.1. Transport Layer Solutions 633 The Transport Control Protocol (TCP) [RFC0793]) can be operated in a 634 mode that does not require IP fragmentation. 636 Applications submit a stream of data to TCP. TCP divides that stream 637 of data into segments, with no segment exceeding the TCP Maximum 638 Segment Size (MSS). Each segment is encapsulated in a TCP header and 639 submitted to the underlying IP module. The underlying IP module 640 prepends an IP header and forwards the resulting packet. 642 If the TCP MSS is sufficiently small, the underlying IP module never 643 produces a packet whose length is greater than the actual PMTU. 644 Therefore, IP fragmentation is not required. 646 TCP offers the following mechanisms for MSS management: 648 o Manual configuration 650 o PMTUD 652 o PLPMTUD 654 For IPv6 nodes, manual configuration is always applicable. If the 655 MSS is manually configured to 1220 bytes and the packet does not 656 contain extension headers, the IP layer will never produce a packet 657 whose length is greater than the IPv6 minimum link MTU (1280 bytes). 658 However, manual configuration prevents TCP from taking advantage of 659 larger link MTU's. 661 RFC 8200 strongly recommends that IPv6 nodes implement PMTUD, in 662 order to discover and take advantage of path MTUs greater than 1280 663 bytes. However, as mentioned in Section 2.1, PMTUD relies upon the 664 network's ability to deliver ICMP PTB messages. Therefore, PMTUD is 665 applicable only in environments where the risk of ICMP PTB loss is 666 acceptable. 668 By contrast, PLPMTUD does not rely upon the network's ability to 669 deliver ICMP PTB messages. However, in many loss-based TCP 670 congestion control algorithms, the dropping of a packet may cause the 671 TCP control algorithm to drop the congestion control window, or even 672 re-start with the entire slow start process. For high capacity, long 673 round-trip time, large volume TCP streams, the deliberate probing 674 with large packets and the consequent packet drop may impose too 675 harsh a penalty on total TCP throughput for it to be a viable 676 approach. [RFC4821] defines PLPMTUD procedures for TCP. 678 While TCP will never cause the underlying IP module to emit a packet 679 that is larger than the PMTU estimate, it can cause the underlying IP 680 module to emit a packet that is larger than the actual PMTU. If this 681 occurs, the packet is dropped, the PMTU estimate is updated, the 682 segment is divided into smaller segments and each smaller segment is 683 submitted to the underlying IP module. 685 The Datagram Congestion Control Protocol (DCCP) [RFC4340] and the 686 Stream Control Protocol (SCP) [RFC4960] also can be operated in a 687 mode that does not require IP fragmentation. They both accept data 688 from an application and divide that data into segments, with no 689 segment exceeding a maximum size. Both DCCP and SCP offer manual 690 configuration, PMTUD and PLPMTUD as mechanisms for managing that 691 maximum size. [I-D.fairhurst-tsvwg-datagram-plpmtud] proposes 692 PLPMTUD procedures for DCCP and SCP. 694 Currently, User Data Protocol (UDP) [RFC0768] lacks a fragmentation 695 mechanism of its own and relies on IP fragmentation. However, 696 [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for 697 UDP. 699 5.2. Application Layer Solutions 701 [RFC8085] recognizes that IP fragmentation reduces the reliability of 702 Internet communication. It also recognizes that UDP lacks a 703 fragmentation mechanism of its own and relies on IP fragmentation. 704 Therefore, [RFC8085] offers the following advice regarding 705 applications the run over the UDP. 707 "An application SHOULD NOT send UDP datagrams that result in IP 708 packets that exceed the Maximum Transmission Unit (MTU) along the 709 path to the destination. Consequently, an application SHOULD either 710 use the path MTU information provided by the IP layer or implement 711 Path MTU Discovery (PMTUD) itself to determine whether the path to a 712 destination will support its desired message size without 713 fragmentation." 715 RFC 8085 continues: 717 "Applications that do not follow the recommendation to do PMTU/ 718 PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would 719 result in IP packets that exceed the path MTU. Because the actual 720 path MTU is unknown, such applications SHOULD fall back to sending 721 messages that are shorter than the default effective MTU for sending 722 (EMTU_S in [RFC1122]). For IPv4, EMTU_S is the smaller of 576 bytes 723 and the first-hop MTU. For IPv6, EMTU_S is 1280 bytes. The 724 effective PMTU for a directly connected destination (with no routers 725 on the path) is the configured interface MTU, which could be less 726 than the maximum link payload size. Transmission of minimum-sized 727 UDP datagrams is inefficient over paths that support a larger PMTU, 728 which is a second reason to implement PMTU discovery." 730 RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently 731 small, even though the IPv4 minimum link MTU is 68 bytes. 733 This advice applies equally to application that run directly over IP. 735 6. Applications That Rely on IPv6 Fragmentation 737 The following applications rely on IPv6 fragmentation: 739 o DNS [RFC1035] 741 o OSPFv3 [RFC5340] 743 o Packet-in-packet encapsulations 745 Each of these applications relies on IPv6 fragmentation to a varying 746 degree. In some cases, that reliance is essential, and cannot be 747 broken without fundamentally changing the protocol. In other cases, 748 that reliance is incidental, and most implementations already take 749 appropriate steps to avoid fragmentation. 751 This list is not comprehensive, and other protocols that rely on IPv6 752 fragmentation may exist. They are not specifically considered in the 753 context of this document. 755 6.1. DNS 757 DNS relies on UDP for efficiency, and the consequence is the use of 758 IP fragmentation for large responses, as permitted by the DNS EDNS(0) 759 options in the query. It is possible to mitigate the issue of 760 fragmentation-based packet loss by having queries use smaller EDNS(0) 761 UDP buffer sizes, but then the operational issue of the partial level 762 of support for DNS over TCP over IPv6 becomes a limiting factor of 763 the efficacy of this approach in an IPv6 context [Damas]. 765 Larger DNS responses can normally be avoided by aggressively pruning 766 the Additional section of DNS responses. One scenario where such 767 pruning is ineffective is in the use of DNSSEC, where large key sizes 768 act to increase the response size to certain DNS queries. There is 769 no effective response to this situation within the DNS other than 770 using smaller cryptographic keys and adoption of DNSSEC 771 administrative practices that attempt to keep DNS response as short 772 as possible. 774 6.2. OSPFv3 776 OSPFv3 implementations can emit messages large enough to cause IPv6 777 fragmentation. However, in keeping with the recommendations of 778 RFC8200, and in order to optimize performance, most OSPFv3 779 implementations restrict their maximum message size to the IPv6 780 minimum link MTU. 782 6.3. Packet-in-Packet Encapsulations 784 In this document, packet-in-packet encapsulations include IP-in-IP 785 [RFC2003], Generic Routing Encapsulation (GRE) [RFC2784], GRE-in-UDP 786 [RFC8086] and Generic Packet Tunneling in IPv6 [RFC2473]. [RFC4459] 787 describes fragmentation issues associated with all of the above- 788 mentioned encapsulations. 790 The fragmentation strategy described for GRE in [RFC7588] has been 791 deployed for all of the above-mentioned encapsulations. This 792 strategy does not rely on IPv6 fragmentation except in one corner 793 case. (see Section 3.3.2.2 of RFC 7588 and Section 7.1 of RFC 2473). 794 Section 3.3 of [RFC7676] further describes this corner case. 796 7. Recommendations 798 7.1. For Application Developers 800 Application developers SHOULD NOT develop applications that rely on 801 IPv6 fragmentation. 803 Application-layer protocols then depend upon IPv6 fragmentation 804 SHOULD be updated to break that dependency. 806 7.2. For Network Operators 808 As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB 809 messages unless they are known to be forged or otherwise 810 illegitimate. As stated in Section 4.6, filtering ICMPv6 PTB packets 811 causes PMTUD to fail. Operators MUST ensure proper PMTUD operation 812 in their network, including making sure the network generates PTB 813 packets when dropping packets too large compared to outgoing 814 interface MTU. 816 Many upper-layer protocols rely on PMTUD. 818 8. IANA Considerations 820 This document makes no request of IANA. 822 9. Security Considerations 824 This document mitigates some of the security considerations 825 associated with IP fragmentation by discouraging the use of IP 826 fragmentation. It does not introduce any new security 827 vulnerabilities, because it does not introduce any new alternatives 828 to IP fragmentation. Instead, it recommends well-understood 829 alternatives. 831 10. Acknowledgements 833 Thanks to Mikael Abrahamsson, Lorenzo Colitti, Mike Heard, Tom 834 Herbert, Tatuya Jinmei, Paolo Lucente, Eric Nygren, and Joe Touch for 835 their comments. 837 11. References 839 11.1. Normative References 841 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 842 DOI 10.17487/RFC0768, August 1980, 843 . 845 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 846 DOI 10.17487/RFC0791, September 1981, 847 . 849 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 850 RFC 792, DOI 10.17487/RFC0792, September 1981, 851 . 853 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 854 RFC 793, DOI 10.17487/RFC0793, September 1981, 855 . 857 [RFC1035] Mockapetris, P., "Domain names - implementation and 858 specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, 859 November 1987, . 861 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 862 DOI 10.17487/RFC1191, November 1990, 863 . 865 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 866 Requirement Levels", BCP 14, RFC 2119, 867 DOI 10.17487/RFC2119, March 1997, 868 . 870 [RFC4443] Conta, A., Deering, S., and M. Gupta, Ed., "Internet 871 Control Message Protocol (ICMPv6) for the Internet 872 Protocol Version 6 (IPv6) Specification", STD 89, 873 RFC 4443, DOI 10.17487/RFC4443, March 2006, 874 . 876 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 877 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 878 . 880 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 881 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 882 March 2017, . 884 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 885 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 886 May 2017, . 888 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 889 (IPv6) Specification", STD 86, RFC 8200, 890 DOI 10.17487/RFC8200, July 2017, 891 . 893 [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., 894 "Path MTU Discovery for IP version 6", STD 87, RFC 8201, 895 DOI 10.17487/RFC8201, July 2017, 896 . 898 11.2. Informative References 900 [Damas] Damas, J. and G. Huston, "Measuring ATR", April 2018, 901 . 903 [Huston] Huston, G., "IPv6, Large UDP Packets and the DNS 904 (http://www.potaroo.net/ispcol/2017-08/xtn-hdrs.html)", 905 August 2017. 907 [I-D.fairhurst-tsvwg-datagram-plpmtud] 908 Fairhurst, G., Jones, T., Tuexen, M., and I. Ruengeler, 909 "Packetization Layer Path MTU Discovery for Datagram 910 Transports", draft-fairhurst-tsvwg-datagram-plpmtud-02 911 (work in progress), December 2017. 913 [I-D.ietf-tsvwg-udp-options] 914 Touch, J., "Transport Options for UDP", draft-ietf-tsvwg- 915 udp-options-05 (work in progress), July 2018. 917 [Ptacek1998] 918 Ptacek, T. and T. Newsham, "Insertion, Evasion and Denial 919 of Service: Eluding Network Intrusion Detection", 1998, 920 . 922 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 923 Communication Layers", STD 3, RFC 1122, 924 DOI 10.17487/RFC1122, October 1989, 925 . 927 [RFC1858] Ziemba, G., Reed, D., and P. Traina, "Security 928 Considerations for IP Fragment Filtering", RFC 1858, 929 DOI 10.17487/RFC1858, October 1995, 930 . 932 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 933 DOI 10.17487/RFC2003, October 1996, 934 . 936 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 937 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 938 December 1998, . 940 [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, 941 "Definition of the Differentiated Services Field (DS 942 Field) in the IPv4 and IPv6 Headers", RFC 2474, 943 DOI 10.17487/RFC2474, December 1998, 944 . 946 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 947 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 948 DOI 10.17487/RFC2784, March 2000, 949 . 951 [RFC3128] Miller, I., "Protection Against a Variant of the Tiny 952 Fragment Attack (RFC 1858)", RFC 3128, 953 DOI 10.17487/RFC3128, June 2001, 954 . 956 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 957 Congestion Control Protocol (DCCP)", RFC 4340, 958 DOI 10.17487/RFC4340, March 2006, 959 . 961 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 962 Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April 963 2006, . 965 [RFC4890] Davies, E. and J. Mohacsi, "Recommendations for Filtering 966 ICMPv6 Messages in Firewalls", RFC 4890, 967 DOI 10.17487/RFC4890, May 2007, 968 . 970 [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", 971 RFC 4960, DOI 10.17487/RFC4960, September 2007, 972 . 974 [RFC5340] Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF 975 for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008, 976 . 978 [RFC5722] Krishnan, S., "Handling of Overlapping IPv6 Fragments", 979 RFC 5722, DOI 10.17487/RFC5722, December 2009, 980 . 982 [RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927, 983 DOI 10.17487/RFC5927, July 2010, 984 . 986 [RFC7588] Bonica, R., Pignataro, C., and J. Touch, "A Widely 987 Deployed Solution to the Generic Routing Encapsulation 988 (GRE) Fragmentation Problem", RFC 7588, 989 DOI 10.17487/RFC7588, July 2015, 990 . 992 [RFC7676] Pignataro, C., Bonica, R., and S. Krishnan, "IPv6 Support 993 for Generic Routing Encapsulation (GRE)", RFC 7676, 994 DOI 10.17487/RFC7676, October 2015, 995 . 997 [RFC7739] Gont, F., "Security Implications of Predictable Fragment 998 Identification Values", RFC 7739, DOI 10.17487/RFC7739, 999 February 2016, . 1001 [RFC7872] Gont, F., Linkova, J., Chown, T., and W. Liu, 1002 "Observations on the Dropping of Packets with IPv6 1003 Extension Headers in the Real World", RFC 7872, 1004 DOI 10.17487/RFC7872, June 2016, 1005 . 1007 [RFC8086] Yong, L., Ed., Crabbe, E., Xu, X., and T. Herbert, "GRE- 1008 in-UDP Encapsulation", RFC 8086, DOI 10.17487/RFC8086, 1009 March 2017, . 1011 Appendix A. Contributors' Address 1013 Authors' Addresses 1015 Ron Bonica 1016 Juniper Networks 1017 2251 Corporate Park Drive 1018 Herndon, Virginia 20171 1019 USA 1021 Email: rbonica@juniper.net 1023 Fred Baker 1024 Unaffiliated 1025 Santa Barbara, California 93117 1026 USA 1028 Email: FredBaker.IETF@gmail.com 1030 Geoff Huston 1031 APNIC 1032 6 Cordelia St 1033 Brisbane, 4101 QLD 1034 Australia 1036 Email: gih@apnic.net 1037 Robert M. Hinden 1038 Check Point Software 1039 959 Skyway Road 1040 San Carlos, California 94070 1041 USA 1043 Email: bob.hinden@gmail.com 1045 Ole Troan 1046 Cisco 1047 Philip Pedersens vei 1 1048 N-1366 Lysaker 1049 Norway 1051 Email: ot@cisco.com 1053 Fernando Gont 1054 SI6 Networks 1055 Evaristo Carriego 2644 1056 Haedo, Provincia de Buenos Aires 1057 Argentina 1059 Email: fgont@si6networks.com