idnits 2.17.1 draft-vasilenko-v6ops-ipv6-oversized-analysis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 305: '... [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that...' RFC 2119 keyword, line 445: '... [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN...' RFC 2119 keyword, line 448: '... [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that...' RFC 2119 keyword, line 546: '... [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY...' RFC 2119 keyword, line 548: '... [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 17, 2021) is 950 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 IPv6 Operations (v6ops) Working Group E. Vasilenko 2 Internet Draft X. Xiao 3 Intended status: Informational Huawei Technologies 4 Expires: March 2022 D. Khaustov 5 Rostelecom 6 September 17, 2021 8 IPv6 Oversized Packets Analysis 9 draft-vasilenko-v6ops-ipv6-oversized-analysis-01 11 Abstract 13 The IETF has some initiatives relying on IPv6 Extension Headers 14 added in transit: SRv6, iOAM. Additionally, some recent developments 15 are overlays (SRv6, VxLAN, NVO3, L2TPv3, and LISP). It could create 16 oversized packets that need to be dealt with. This document analyzes 17 available standards for the resolution of oversized packet drops. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six 30 months and may be updated, replaced, or obsoleted by other documents 31 at any time. It is inappropriate to use Internet-Drafts as 32 reference material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on February 2021. 36 Copyright Notice 38 Copyright (c) 2021 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with 46 respect to this document. Code Components extracted from this 47 document must include Simplified BSD License text as described in 48 Section 4.e of the Trust Legal Provisions and are provided without 49 warranty as described in the Simplified BSD License. 51 Table of Contents 53 1. Terminology and pre-requisite..................................2 54 2. Problem statement..............................................3 55 3. Solutions......................................................5 56 3.1. Provision links with big enough MTU.......................6 57 3.2. Frugal usage of Extension Headers.........................7 58 3.3. Fragmentation and reassembly at the tunnel ends...........9 59 3.4. PMTUD by original packet source..........................12 60 3.5. Packetization Layer MTU Discovery........................14 61 4. Conclusion....................................................15 62 5. Security Considerations.......................................15 63 6. IANA Considerations...........................................15 64 7. References....................................................16 65 7.1. Normative References.....................................16 66 7.2. Informative References...................................18 67 8. Acknowledgments...............................................19 69 1. Terminology and pre-requisite 71 We do assume good knowledge or frequent references to [PMTUD] and 72 [IPv6 Tunneling]. Terminology is inherited from [PMTUD]. 74 Link MTU - the maximum transmission unit, i.e., maximum packet size 75 in octets that can be conveyed over a link. 77 Path MTU (PMTU) - the minimum link MTU of all links in a path 78 between a source node and a destination node. 80 Path MTU Discovery (PMTUD) - the process by which a node learns the 81 PMTU of a path. 83 EMTU_S - Effective MTU for sending; used by upper-layer protocols to 84 limit the size of IP packets they queue for sending. 86 EMTU_R - Effective MTU for receiving; the largest packet that can be 87 reassembled at the receiver. 89 Packetization Layer - the layer of the network stack that segments 90 data into packets. 92 PLPMTUD - Packetization Layer Path MTU Discovery, the method of 93 detecting path MTU at packetization layer, which is an 94 extension of classical PMTU Discovery. 96 PTB (Packet Too Big) message - an ICMPv6 message reporting that an 97 IPv6 packet is too large to forward through some link. 99 MSS - the TCP Maximum Segment Size, the maximum payload size 100 available to the TCP layer. This is typically the Path MTU 101 minus the size of the IP and TCP headers. 103 2. Problem statement 105 IPv6 is strict regarding fragmentation - it must NOT be done in 106 transit (section 4.5 of [IPv6]). 108 IPv6 sees rapid developments in recent years. A lot of additional 109 functionality has been added primarily by adding options to 110 Extension Headers and/or using overlay encapsulation. All of the 111 above expand the packet size. This could lead to oversized packets 112 that would be dropped on some links. 114 Massive parallelism in traffic delivery is the additional challenge 115 developed in the last 10 years: ECMP on one hop could reach 16 (or 116 even more), which creates the end-to-end possibility for 64k paths 117 on just 5 hops (example from big production network). Different 118 paths could have a different set of Extension Headers and different 119 PMTU as a result. PMTU is effectively becoming dynamic: we could 120 never know how many additional headers would be added at a 121 particular time to the particular packet on the particular path. 123 The old classical PMTUD problems are still with us: filtered ICMPv6 124 messages, drops related to Extension Headers before next hop MTU has 125 been evaluated (no Packet Too Big message sent). 127 Standards have two important numbers that we would need for our 128 discussion: 130 o [IPv6] chapter 5 requires that every link should have the MTU of 131 1280 octets or greater (2^10+2^8 - it probably explains the 132 choice of this size) 134 o [IPv6] requests minimum EMTU_R (reassembly buffer) in 1500 135 octets. An upper-layer protocol or application that depends on 136 IPv6 fragmentation to send packets larger than the MTU of a path 137 should not send packets larger than 1500 octets unless it has the 138 assurance that the destination is capable of reassembling packets 139 of that larger size 141 The reassembly buffer is much above 1500B for the majority of 142 desktop and server OSes. Windows 10 has "Reassemblylimit" almost 143 64MB (you could look by "netsh interface ipv6 show global"). 144 Different flavors of Linux have "ipfrag_high_thresh" between 256KB 145 and 4MB (you could look by 146 "more /proc/sys/net/ipv4/ipfrag_high_thresh"). iOS has 147 "maxReceiveIPv6BufferSize" 64Kb. 148 The reassembly is not so good for embedded OSes. From the four 149 primary OSes for IoT (Contiki, FreeRTOS, Mbed OS, MicroPython), only 150 Mbed OS has the capability for 5 fragments by default, and it is 151 possible to activate reassembly on Contiki. In all cases, the buffer 152 is just a few packets of 1280B or 1500B. IoT devices may not be 153 capable to reassemble the packet that the server in the cloud would 154 send to it. Hence, ICMP PTB is still very important for some OSes. 156 There is only one solution by [IPv6] architecture for the PMTU 157 problem - decrease packet size on the original source. It is 158 workable up to the minimum limit for IPv6 packets (1280B). The 159 typical transit link had MTU not much bigger than 1500B for a long 160 time, only the space for a few additional MPLS labels was reserved. 161 220B left could be considered as guaranteed for additional 162 functionality in Extension and Encapsulation headers. It could be 163 enough for the next decade if we would make some precautions - see 164 discussion below. 166 [Huston-2016] and [Huston-2021] did an investigation on a different 167 topic (fragmentation), but he has good statistics related to MTU 168 drops up to 1500B that did show a 5% drop for MTU as small as 1455B. 169 Additionally, [Huston-2016] has found the big drop spike (69% from 170 all drops!) at 1480B, 20B less is presumable for IPv6 encapsulation 171 into IPv4. [Huston-2021] has shown twice bigger fragmentation drop 172 for bigger packets with the peak at 1408 octets, especially for 173 Asia. As you can see - 1500B is not always available now, the reason 174 is not well understood. Hence, we do not have 220B for additional 175 headers in all situations. We could be reasonably optimistic that 176 such a type of tunneling would disappear in the long term. 177 Later, we would stick to an optimistic assumption that 220B is 178 available in most situations. It is still possible to have the more 179 pessimistic estimation (200B? 175B?) looking to Huston's data. 181 The hungriest protocol known is SRv6 that could add 40B of IPv6 182 underlay tunnel header (called "outer IP header" in [SRH]), 16B of 183 SRH header itself, and additionally up to 10 IPv6 addresses in the 184 SID stack (potentially even more). It is already 216B - very close 185 to 220B optimistic limit. It makes the introduction of any 186 additional functionality quite challenging without rigorous 187 expansion of all links to bigger MTU. 189 Initial SRv6 implementations that trespassed safe limit in 220B are 190 the reason for recent activities in MTU problem research. We see 191 many recent efforts to improve Path MTU Discovery (which would be 192 mentioned in the document) - let us find the rationale behind it. 194 3. Solutions 196 Let's consider first the reassembly buffer problem as the simpler 197 one. 199 Minimal buffer for packet reassembly (1500B) is potentially possible 200 to increase in new standard updates, but then would be the problem 201 with the transition, because this limitation would be already 202 programmed into billions of IoT hosts - it would need big time to be 203 sure that we do not have old implementation anymore. 205 There is no good solution for the problem of bloating headers above 206 220B for hosts. We need to keep headers below the 220B limit for 207 embedded OSes. Fortunately, we are far from this problem yet - very 208 limited additional functionality is implemented directly on the 209 hosts (like [PMTU by HbH] or APN6). This problem should be looked at 210 again in some number of years, it may be that in the future we would 211 have to increase default EMTU_R on all hosts to give the possibility 212 for new functionality. 214 Let's now return to our primary problem of not enough PMTU. 216 There is a low probability that the Internet community would agree 217 to decrease the minimal IPv6 packet size (1280B). Hence, the 218 oversized problem could not be resolved in that direction. 220 It is possible to partially alleviate the MTU problem in some 221 network zones where all transit nodes have big enough MTU. Transit 222 nodes should delete extension headers before packets would leave 223 "high MTU network zone". The leakage of a big header to a host could 224 overflow EMTU_R buffer. The majority of RFCs recommend carriers 225 delete additional headers before forwarding traffic to the client - 226 this practice should be strictly followed. 228 The SPRING working group is actively developing a compressed version 229 of SRv6 that should leave space for other functionality, even on 230 current transit routers that sometimes do not support much above 231 1500B. 233 All solutions for packet drop avoidance as a result of oversized 234 packets could be classified into 4 classes. They are examined one by 235 one. 237 3.1. Provision links with big enough MTU 239 MTU supported by the host's links is typically 1500 Bytes. 240 Backbone link's MTU could be up to 9000 Bytes on modern hardware. 241 PMTUD is not needed in an ideal world. 243 Reality is not that good: 245 o Some old devices still support just a few additional MPLS labels 246 above 1500B on Ethernet. It was historically a problem to cross 247 1536B because the IEEE specification for 802.3 assumes that a 248 bigger number in the Length field means the Type of the payload. 249 o We could have middleboxes that would not support MTU much bigger 250 than 1500B MTU for a long time. 251 o Ethernet is very mature now in relation to big MTU support, but 252 that could be a challenge for other link-layer technologies (for 253 example WiFi, satellite links, radio links, etc.). 254 o Packet Links could be rented from a third party - no possibility 255 to change the MTU. 256 o Big MTU may negatively influence buffer size - see below. 257 o The majority of vendors set the default MTU to 1500B (with 258 variations on what is counted inside MTU). It is time-consuming 259 to change the MTU on the production network. 260 o Some hosts (especially for storage traffic in Data Centers) could 261 use 2500B or 9000B MTU that challenges the possibility of having 262 always bigger MTU in the backbone. 264 Cost-optimized equipment architecture (especially used for switches, 265 but applicable for many routers as well) may not split packets in 266 the buffer memory. So small packet would occupy a bigger buffer 267 space reserved for the packet with maximum MTU. This limitation 268 effectively decreases the potential number of packets that could be 269 buffered. Most of the host packets are still limited to 1500B size. 270 MTU 9000B would just lead to wasting buffer memory with an 271 efficiency of 1/6. The average packet size is twice smaller, hence 272 in the worst-case buffer efficiency could be up to 1/12. Buffer 273 memory is about 30% of the router cost. It is not acceptable to 274 increase buffer memory cost 12 times. Hence, in many cases, it does 275 not make sense to increase MTU to the maximum supported by the 276 switch or router. One should always check with the vendor the impact 277 of using a big MTU on buffering for the particular product. MTU 278 should be increased to the number that is bigger than the maximum 279 MTU expected from hosts + the size of all possible network overhead 280 + underlay IPv6 header (if present). 281 9000B MTU makes sense in DC, cross-DC environment, or for platforms 282 that fragment packets for smaller sells in the memory. 284 [MTU issues in Tunneling] section 3.3 discusses the opposite 285 solution: decrease MTU on links to hosts to be sure that a host 286 would always generate small enough MTU for the backbone. This 287 solution was possible for small tunnel overhead, but now we are 288 talking about the situation when 220B margin is not enough. 290 [L3VPN] and [EVPN] do attach an additional label and could create 291 oversized packets. Still, the MPLS header cannot point to the 292 original MPLS router that has an attached service label. 293 Additionally, a VPN IP packet could use private address space or no 294 IP address at all (for EVPN). It blocks the possibility to properly 295 organize the PMTUD process. Hence, [L3VPN] and [EVPN] have been 296 developed under the assumption that all MTUs on the path would be 297 expanded for at least 8 bytes that are needed for services over the 298 MPLS data plane. 299 We have recent [Generic Delivery Functions] that may permit 300 fragmentation for MPLS services, but it is a personal draft yet. 302 [Pseudowire Fragmentation] is the rear case when fragmentation is 303 available over MPLS for one type of service. 305 [VxLAN] section 4.3 also uses the approach: "it is RECOMMENDED that 306 the MTUs across the physical network infrastructure be set to a 307 value that accommodates the larger frame size due to the 308 encapsulation". 310 Packet drop statistics and big activity in IETF prove that the PMTUD 311 problem persists. 313 "Raise MTU on transit" is the best solution, if it is available. 315 3.2. Frugal usage of Extension Headers 317 Some new functionality (especially source routing with a big SID 318 stack) could decrease headers size without a big loss of 319 functionality (for example, use loose node appointment in SID 320 stack). Some functionality (like iFIT or iOAM) could be completely 321 omitted in the situation that would lead to packet drop. It is 322 effectively "the tradeoff of functionality to PMTU control". 324 The important point here is that the transit node attaching an 325 additional header should be aware of all MTUs along the assumed 326 packet path to predict how big MTU is still acceptable. 328 [PMTUD] is readily available for tunneling interfaces - tunnel 329 source should be aware of PMTU of the tunnel (by PTB feedback 330 messages). But we have cases when it is not enough: 332 o SDN controller (or management system in general) could assist in 333 provisioning of extension headers (including SFC, iOAM, BIER) and 334 encapsulation headers (SRv6, VxLAN) - should be the way to report 335 MTUs to Controller. 336 o ICMPv6 PTB would be directed to the transit control plane only in 337 the case of problems inside the tunnel. PTB messages from outside 338 of the tunnel would be directed to the source node. It is 339 difficult to snoop PTB on transit nodes. 341 Hence, we see many initiatives to collect and manage MTU by many 342 popular protocols for routing and traffic engineering: [PMTU by 343 ISIS], [PMTU by BGP-LS], [PMTU by PCEP], [PMTU by SR-Policy]. 345 Moreover, these protocol extensions would become even more useful in 346 the future when it would not be possible to squeeze all extension 347 headers into 220B anymore. Frugal attachment of new headers on 348 transit nodes would increase the need for awareness of PMTU - it 349 should stimulate MTU collection by all other popular protocols 350 (OSPF, normal BGP on peering borders). 352 This approach has a fundamental problem: full knowledge about all 353 MTUs in the domain could not help to estimate the real path for a 354 packet, because of massive ECMP used by many networks (at least by 355 all Carriers). Non-routing protocols do not have a proper engine to 356 estimate traffic paths and predict PMTU as well. And even more, if 357 L2 ECMP is used or some links are rented from another carrier it 358 will again be impossible to predict the exact path and the PMTU. 360 The second problem of this approach could be classified as "chicken 361 and egg". We already have a much better solution for MTU drop - 362 increase MTU (see the previous section). We are looking for other 363 solutions only because upgrading equipment (to better MTU) is not 364 possible for some reason. But new protocols introduction would also 365 demand equipment upgrade and thus making frugal headers less 366 valuable. However, an upgrade for the control plane should be 367 cheaper than an upgrade for the data plane, if the vendor would 368 support such an approach. 370 Hence, the solution discussed in this section has only limited 371 applicability. 373 3.3. Fragmentation and reassembly at the tunnel ends 375 The tunnel source behaves like a host with respect to the tunnel 376 header. It is possible to properly adjust PMTU for the tunnel by 377 [PMTUD], so it is potentially possible to fragment all packets 378 bigger than PMTU. 380 [IP Encapsulation] is the earliest standard for IP-in-IP 381 encapsulation. Section 5.1 discusses that it is possible to fragment 382 IP packets before tunnel encapsulation, so there is no need to 383 reassemble packets on the other tunnel end - reassembly could happen 384 on the destination host. It does not have additional cost 385 implications on tunnel ends. This approach did work for IPv4 in the 386 case of the "don't fragment" bit cleared. It fully contradicts IPv6 387 architecture that does not permit to fragment packets on transit - 388 no standard has risked proposing such a solution for IPv6. 390 Some standards do propose IPv6 fragmentation (primarily for packets 391 1280B and below), but fragmentation is recommended after 392 encapsulation. It would lead to packet reassembly on the other 393 tunnel end to hide (from destination host) the fact of transit 394 fragmentation. It does minimize IPv6 architecture disruption. 396 Many standards discussed below ([MPLS Encapsulation], [L2TPv3], 397 [VxLAN], [NVO3]) forgot to mention that packets 1280 and below 398 should be fragmented. This inaccuracy did not create any problem in 399 production networks because we typically have 220B for all headers - 400 it is big enough for many tunnels nested into each other. The 401 situation could change in the next years because of Extension 402 Headers expansion by different functions. It could create pressure 403 to return to many mature standards and clarify the situation: what 404 to do when the 1280B packet could not go through the tunnel. 406 The Fragmentation has a few issues that make it not popular: 408 o Fragmentation could double buffer requirements (we assume split 409 only in 2 fragments). We could ignore small additional buffer 410 requirements for packets that may be lost and need to wait some 411 time before reassembly, the Internet is not productive anyway 412 after a few percentages of packet drops. The buffer memory is 413 about 30% of the router cost. A 30% cost increase would not be 414 accepted by the majority of owners. Albeit, some middleboxes 415 already have enough buffer memory that could be reused for packet 416 reassembly. 417 o In general, IPv6 architecture does not approve fragmentation in 418 transit in all standards (except the recent draft [IP Tunnels] - 419 see below). [PMTUD] section 5.1: "packetization layers are 420 encouraged to avoid sending messages that will require 421 fragmentation". 422 We would discuss in this section some situations when tunnel 423 fragmentation is inevitable. 424 o [Fragile Fragmentation] has a good collection of all problems 425 related to fragmentation (additionally to the above: breaks ECMP, 426 stateful processing, policy routing, and has many security attack 427 vectors). [Fragile Fragmentation] strongly recommends avoiding 428 fragmentation, but not deprecate yet. 430 The primary RFC for tunneling is [IPv6 Tunneling] - it is the oldest 431 standard that was later reused by many other standards (including 432 the latest SRH). It permits fragmentation only for the case when the 433 original packet is already minimal (1280B or less) - see section 434 7.1. It mandates dropping the packet and signaling ICMPv6 PTB to the 435 source (request to decrease the PMTU size at the source) for all 436 other cases. 438 [MPLS Encapsulation] Section 5.1 has the name: "Preventing 439 Fragmentation and Reassembly". It does stress again: "IPv6 440 intermediate nodes do not perform fragmentation in any event". 442 [L2TPv3] section 4.1.4 has a similar comment: "Note that IPv6 does 443 not support "in-flight" fragmentation of data packets". 445 [VxLAN] section 4.3 is strict: "VTEPs MUST NOT fragment VXLAN 446 packets." 448 [NVO3] section 4.4.4 is strict too: "It is strongly RECOMMENDED that 449 Path MTU Discovery ([PMTUD]) be used to prevent or minimize 450 fragmentation." 452 [IPv6 GRE] section 3.3 does recommend fragmentation only for packets 453 that are less than 1280B. 455 The most recent draft for all types of tunnels is [IP Tunnels]. It 456 is already referenced by many IETF documents. It is complicated to 457 cover all use cases (any IP over any IP in any situation), but the 458 net result is: much bigger part of the traffic proposed to be 459 fragmented into the tunnel. Section 3.3: "The path between ingress 460 and egress interfaces has a path MTU, but the endpoints can exchange 461 messages as large as can be reassembled at the destination (egress 462 interface), i.e., the EMTU_R of the egress interface". 463 A short explanation of proposed functionality: original host would 464 try to transmit biggest flows (by volume) on maximum PMTU, that 465 tunnel source would not try to correct by PTB messages up to 1500B. 466 Hence, the tunnel source would not have any option except to 467 fragment. The principal problem here is the absence of PTB messages 468 for the packet size between PMTU and statically appointed EMTU_R. 469 Let's see how it has been formulated in more detail. 470 [IP Tunnels] introduces a new variable "Tunnel MTU" that should not 471 change as a result of PMTUD. The procedure to change "Tunnel MTU" is 472 out of the draft discussion - it is pushed to specifications of 473 particular tunnels in the last paragraph of section 4.2.2. Moreover, 474 it is even assumed that PLPMTUD could be used on the router for 475 "Tunnel MTU" discovery because this parameter is considered as an 476 above network layer (like transport layer on the host). Separate 477 section 4.2.3 is dedicated to the explanation that the newly 478 introduced "Tunnel MTU" cannot be adjusted dynamically. There is a 479 recommendation for the default "Tunnel MTU": typical host EMTU_R 480 (1500B) minus tunnel outer headers overhead. The good question could 481 be: if it is so difficult to manage "Tunnel MTU" dynamically, then 482 why this variable was introduced? 483 The MTU of the tunnel is renamed into MAP (maximum atomic packet), 484 MAP should be corrected by PMTUD feedback from inside the tunnel. 485 Section 4.2.2 states that everything up to "Tunnel MTU" should be 486 accepted to the tunnel, one long packet (with inner and outer 487 headers) should be created. Then it should be split into fragments 488 below MAP size. 489 Initially, "tunnel MTU" and MAP could be manually synchronized by 490 the administrator (with the difference in tunnel overhead). But any 491 additional overhead on the tunnel path (nested tunnel, smallest 492 Extension Header) would result in PMTUD that decreases MAP, but 493 would not change "Tunnel MTU". It would turn on fragmentation for 494 all bulk traffic. This situation is quite probable now (see [Huston- 495 2020] on MTUs available on the Internet) and it would be even more 496 probable in the future when many additional extension headers would 497 be used. Hence, the requirement in section 5.3.1 "do NOT try to 498 deprecate fragmentation" is indeed important. 499 Section 3.6 has the same approach as all other standards to the 500 question of when fragmentation should happen: "this document assumes 501 that only outer fragmentation is viable because it is the only 502 approach that works for both IPv4 datagrams with DF=1 and IPv6". 503 a considerable increase in fragmentation is proposed for the reasons 504 of academic purity: the router part of the router should behave as a 505 router, the host part of the router should behave as a host without 506 any deviations. 507 Additional fragmentations would create all of the problems discussed 508 in [Fragile Fragmentation] and substantially increase the cost of 509 tunnel endpoints. There is a high probability that draft [IP 510 Tunnels] would be rejected by the market for cost reasons. 512 Additionally, we should point that statistics for fragmented packet 513 drop on the Internet is still high enough (7%) - see [Huston-2021]. 515 Fragmentation is the least probable solution for oversized packet 516 drops. 518 3.4. PMTUD by original packet source 520 [PMTUD] is mandatory in IPv6 architecture, because IPv6 does not 521 have fragmentation in transit. We could see recommendations in many 522 RFCs not to block ICMPv6 PTB completely (it could be rate-limited - 523 see [ICMPv6] section 2.4). [DPLPMTUD] section 1.1 has a very good 524 collection of reasons why PTB message may not be delivered to the 525 source - it is used as justification to augment PMTUD by [DPLPMTUD]. 527 We should not see this problem for all non-tunneling protocols in 528 the majority of environments. ICMPv6 PTB should be delivered to 529 packet source, packet source would dynamically decrease PMTU to 530 adapt to new realities. PMTU could change dynamically because some 531 transit nodes could introduce additional extension header ad-hoc or 532 ECMP could switch flow to a different path. 534 [IPv6 Tunneling] mandates to relay ICMPv6 PTB by tunnel ends for 535 ICMPv6 messages received from the inside tunnel. [IPv6 Tunneling] 536 does not use "relay" terminology, but section 8 explains in detail 537 how to reconstruct and retransmit ICMP messages to the original 538 packet source (delete all tunnel-related information). 539 [MTU issues in Tunneling] section 3.2 discusses the same approach. 540 [L2TPv3] section 4.1.4 refers to the [IPv6 Tunneling]. We could 541 assume it as the request for PTB messages relay too. 542 [SRH] section 5.4 confirms full adherence to ICMPv6 PTB relay 543 approach: "For IP packets encapsulated in an outer IPv6 header, ICMP 544 error handling is as defined in [IPv6 Tunneling]". 546 [VxLAN] section 4.3 proposes to use PMTUD: "Path MTU discovery MAY 547 be used to address this requirement as well". 548 [NVO3] section 4.4.4 assumes PMTUD too: "It is strongly RECOMMENDED 549 that Path MTU Discovery ([PMTUD]) be used to prevent or minimize 550 fragmentation". 551 [IPSec] section 8.2.1 requests that PMTU should be maintained for 552 the tunnel and signaled to the original packet source as soon as any 553 new packet would arrive. 554 [IPv6 GRE] section 3.3 clearly instructs developers to drop the 555 oversized packets and send PTB for packets bigger than tunnel MTU. 556 The method of PMTU detection is fully IPv6 compliant: "the GMTU is 557 equal to the PMTU associated with the path between the GRE ingress 558 and the GRE egress, minus the GRE overhead". 559 [MPLS Encapsulation] section 5.1 specifies the same approach: tunnel 560 head-end should use [PMTUD] to understand tunnel MTU, then "the 561 packet will have to be discarded, but the tunnel head should send 562 the IP source of the discarded packet the proper ICMP error 563 message". 565 [VxLAN], [NVO3], [IPSec], [IPv6 GRE], and [MPLS Encapsulation] do 566 not request for tunnel endpoint to relay PTB messages. PMTUD should 567 be used to set proper MTU for the tunnel, then subsequent packets 568 could trigger PTB messages to the packet source. It would create an 569 additional round trip delay compared to the original [IPv6 570 Tunneling] relay approach for the first PTB message. This small 571 deficiency could be partially explained by the desire of many 572 standards to be universal for IPv6 as well as IPv4. As a reminder, 573 IPv4 may not have enough information in the ICMP message to properly 574 reconstruct a relay message (64bits of source packet by RFC 792). 576 [IP Tunnels] is the only draft that contradicts [IPv6 Tunneling] 577 (and every other protocol based on top) - it does clearly prohibit 578 relay PTB messages. It states in section 3.3: "When such messages 579 (PTB) arrive at the ingress interface ("ingress interface" is the 580 tunnel interface in this draft), they may affect the properties of 581 that interface (e.g., its MTU), but they should never directly cause 582 new ICMPs in the outer network". This idea is generalized in section 583 5.1 as "ICMP messages MUST NOT be generated by the tunnel (as a 584 link)". The motivation assumed in the draft is to fully mimic host 585 behavior on the router virtual (tunnel) interface, because the host 586 would not retranslate PTB messages. 588 We see that "Flow Label" is gaining popularity. [IPv6 Tunneling] and 589 [ICMPv6] do not have strong recommendations for "Flow Label" - it 590 was not an important topic at that time. The only small improvement 591 that makes sense to do for [IPv6 Tunneling] is to recommend coping 592 "Flow Label" from source packet to tunnel packet and from source 593 packet to ICMPv6 PTB message. It would permit to properly load 594 balance PTB messages to the same path as original traffic - see the 595 problem [ICMPv6 PTB in ECMP] about hash-based load balancing between 596 many hosts. Copy "Flow Label" to PTB message would not contradict 597 neither IPv6 architecture nor any RFC - it is not mandatory to 598 develop a special standard update for it. 600 [MTU issues in Tunneling] section 3.2 has a concern that in the case 601 of Lawful Intercept additional encapsulation could produce PTB 602 messages that would show the fact to the monitored host. It is not a 603 very realistic concern, because PMTU could change for many other 604 reasons (especially with the proliferation of new protocols). If it 605 is still a concern, then it makes sense to use another solution for 606 this case: bigger MTU (better) or even fragmentation. 608 [MTU issues in Tunneling] section 3.2 raises the question about the 609 applicability of "MSS Clamping". The transit node could snoop the 610 transport layer and change MSS exchanged between nodes. This "hack" 611 is not recommended because it breaks the layered model of IETF or 612 OSI. 614 [PMTUD] is the only mechanism that is universal for all cases and 615 fully compliant with IPv6 architecture. Vendors just need to use it, 616 despite some challenges to relay PTB messages on tunnel ends. 617 Moreover, it makes sense to standardize the relay of PTB messages on 618 tunnel ends - it would improve PMTUD time on original traffic 619 sources for round trip time. 620 [IPv6] RFC: "It is strongly recommended that IPv6 nodes implement 621 Path MTU Discovery [PMTUD]". 623 3.5. Packetization Layer MTU Discovery 625 [PLPMTUD] and [DPLPMTUD] have been greatly developed in recent 626 years. Packetization Layer (UDP/TCP) (1) has much more visibility 627 (could see the size of transport layer buffers); (2) could operate 628 under the absence of ICMPv6 PTB (too much filtering); (3) could be 629 very granular (per-flow). It does have its use cases. 631 Albeit, PLPMTUD/DPLPMTUD have their restrictions as they: (1) are 632 not universal for all transport protocols; (2) need more resources 633 from the host; (3) are challenging to share PMTU information between 634 applications; (4) need much more round trip times to find suitable 635 PMTU; (5) do not work well on congested paths (difficult to 636 understand the reason for packet loss). 638 Hence, PLPMTUD is not a replacement for PMTUD - both are needed. As 639 a reminder from [PLPMTUD]: "Packetization Layer Path MTU Discovery 640 (PLPMTUD) is most efficient when used in conjunction with the ICMP- 641 based Path MTU Discovery". 643 PLPMTUD could play as a replacement for PMTUD in the worst-case 644 scenario (ICMP is filtered). It would lead to the original host PMTU 645 decrease too. PLPMTUD could be considered as a redundancy mechanism 646 for PMTUD. 648 Strictly speaking, [PMTU by HbH] is a network layer mechanism, not a 649 packetization layer. It is mentioned in this section because its 650 usage is very similar to PLPMTUD, [PMTU by HbH] could be considered 651 to some degree as the extension to PLPMTUD. It is not expected to 652 principally change the conclusions of this document. 654 4. Conclusion 656 It is better not to have a problem with oversized packets in the 657 first place. One should upgrade all links to a bigger MTU, if 658 possible. 660 The host could have MTU as big as transit node. It would be never 661 possible to deprecate PMTUD. It is important to follow the 662 recommendations of [PMTUD] and [IPv6 Tunneling] for ICMPv6 PTB 663 message delivery to the original traffic source. Tunnel sources 664 should perform the relay function to make sure that the original 665 traffic source would get the PTB message faster. 667 The temporary 220B limit for all headers pushes us to the frugal 668 implementation of new extension headers. This limit would be 669 alleviated after all backbone links would be upgraded to a much 670 bigger MTU than 1500B. Additional protocols to collect MTU 671 information could help in the transition period to attach additional 672 headers frugally. It is true for all new protocols: SRv6, SFC, BIER, 673 iOAM, APN6. 675 [PLPMTUD] and [DPLPMTUD] are not the replacement for [PMTUD], but 676 could help in some scenarios. 678 Fragmentation is not at all a solution for oversized packet drops. 680 5. Security Considerations 682 [PMTUD], [PLPMTUD], [DPLPMTUD], and [Fragile Fragmentation] have 683 some attack vectors discussed. This document does not introduce 684 additional security vulnerabilities. 686 6. IANA Considerations 688 This document has no request to IANA. 690 7. References 692 7.1. Normative References 694 [IPv6] S. Deering, R. Hinden, "Internet Protocol, Version 6 (IPv6) 695 Specification", RFC 8200, DOI 10.17487/RFC8200, July 2017, 696 . 698 [ICMPv6] A. Conta, S. Deering, M. Gupta, "Internet Control Message 699 Protocol (ICMPv6) for the Internet Protocol Version 6 700 (IPv6) Specification", RFC 4443, DOI 10.17487/RFC4443, 701 March 2006, . 703 [PMTUD] J. McCann, S. Deering, J. Mogul, R. Hinden, "Path MTU 704 Discovery for IP version 6", RFC 8201, DOI 705 10.17487/RFC8201, July 2017, . 708 [IPv6 Tunneling] A. Conta, S. Deering, "Generic Packet Tunneling in 709 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 710 December 1998, . 712 [ICMPv6 PTB in ECMP] M. Byerly, M. Hite, J. Jaeggli, "Close 713 Encounters of the ICMP Type 2 Kind", RFC 7690, DOI 714 10.17487/RFC7690, January 2016, . 717 [MTU issues in Tunneling] P. Savola, "MTU and Fragmentation Issues 718 with In-the-Network Tunneling", RFC 4459, DOI 719 10.17487/RFC4459, April 2006, . 722 [IP Tunnels] J. Touch, M. Townsley, "IP Tunnels in the Internet 723 Architecture", draft-ietf-intarea-tunnels-10 (work in 724 progress), September 2019. 726 [IP Encapsulation] C. Perkins, "IP Encapsulation within IP", RFC 727 2003, DOI 10.17487/RFC2003, October 1996, 728 . 730 [IPSec] S. Kent, K. Seo, "Security Architecture for the Internet 731 Protocol", RFC 4301, DOI 10.17487/RFC4301, December 2005, 732 . 734 [IPv6 GRE] C. Pignataro, R. Bonica, S. Krishnan, "IPv6 Support for 735 Generic Routing Encapsulation (GRE)", RFC 7676, DOI 736 10.17487/RFC7676, October 2015, . 739 [MPLS Encapsulation] T. Worster, Y. Rekhter, E. Rosen, 740 "Encapsulating MPLS in IP or Generic Routing Encapsulation 741 (GRE)", RFC 4023, DOI 10.17487/RFC4023, March 2005, 742 . 744 [L2TPv3] J. Lau, M. Townsley, I. Goyret, "Layer Two Tunneling 745 Protocol - Version 3 (L2TPv3)", RFC 3931, DOI 746 10.17487/RFC3931, March 2005, . 749 [VxLAN] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. 750 Sridhar, M. Bursell, C. Wright, "Virtual eXtensible Local 751 Area Network (VXLAN): A Framework for Overlaying 752 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 753 7348, DOI 10.17487/RFC7348, August 2014, . 756 [NVO3] J. Gross, I. Ganga, T. Sridhar, "Geneve: Generic Network 757 Virtualization Encapsulation", RFC 8926, DOI 758 10.17487/RFC8926, November 2020, . 761 [L3VPN] E. Rosen, Y. Rekhter, "BGP/MPLS IP Virtual Private Networks 762 (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 2006, 763 . 765 [EVPN] A. Sajassi, R. Aggarwal, N. Bitar, A. Isaac, J. Uttaro, J. 766 Drake, W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 767 7432, DOI 10.17487/RFC7432, February 2015, 768 . 770 [Huston-2016] Huston, G., "Fragmenting IPv6", Blog Post, 2016, 771 . 773 [Huston-2021] Huston, G., "IPv6 Fragmention Loss", Article, 2021, 774 . 776 [Fragile Fragmentation] R. Bonica, F. Baker, G. Huston, R. Hinden, 777 O. Troan, F. Gont, "IP Fragmentation Considered Fragile", 778 RFC 8900, DOI 10.17487/RFC8900, September 2020, 779 . 781 7.2. Informative References 783 [PLPMTUD] M. Mathis, J. Heffner, "Packetization Layer Path MTU 784 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 785 . 787 [DPLPMTUD] G. Fairhurst, T. Jones, M. Tuexen, I. Ruengeler, T. 788 Voelker, "Packetization Layer Path MTU Discovery for 789 Datagram Transports", RFC 8899, DOI 10.17487/RFC8899, 790 March 2020, . 792 [SRH] C. Filsfils, D. Dukes, S. Previdi, J. Leddy, S. Matsushima, D. 793 Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI 794 10.17487/RFC8754, March 2020, . 797 [PMTU by HbH] R. Hinden, G. Fairhurst, "IPv6 Minimum Path MTU Hop- 798 by-Hop Option", draft-hinden-6man-mtu-option-02 (work in 799 progress), July 2019. 801 [PMTU by ISIS] Z. Hu, Y. Zhu, Z. Li, L. Dai, "IS-IS Extensions for 802 Path MTU", draft-hu-lsr-isis-path-mtu-00 (work in 803 progress), June 2018. 805 [PMTU by PCEP] S. Peng, C. Li, L. Han, "Support for Path MTU (PMTU) 806 in the Path Computation Element (PCE)communication 807 Protocol (PCEP)", draft-li-pce-pcep-pmtu-03 (work in 808 progress), October 2020. 810 [PMTU by BGP-LS] Y. Zhu, Z. Hu, G. Yan, J. Yao, "BGP-LS Extensions 811 for Advertising Path MTU", draft-zhu-idr-bgp-ls-path-mtu- 812 05 (work in progress), November 2020. 814 [PMTU by SR-Policy] C. Li, Y. Zhu, A. Sawaf, Z. Li, "Segment Routing 815 Path MTU in BGP", draft-li-idr-sr-policy-path-mtu-03 (work 816 in progress), November 2019. 818 [Generic Delivery Functions] Z. Zhang, R. Bonica, K. Kompella, 819 "Generic Delivery Functions", draft-zzhang-intarea- 820 generic-delivery-functions-01 (work in progress), April 821 2021. 823 [Pseudowire Fragmentation] A. Malis, M. Townsley, "Pseudowire 824 Emulation Edge-to-Edge (PWE3) Fragmentation and 825 Reassembly", RFC 4623, DOI 10.17487/RFC4623, August 2006, 826 . 828 8. Acknowledgments 830 Thanks to v6ops working group for problem discussion 832 Authors' Addresses 834 Eduard Vasilenko 835 Huawei Technologies 836 17/4 Krylatskaya st, Moscow, Russia 121614 838 Email: Vasilenko.Eduard@huawei.com 840 Xiao Xipeng 841 Huawei Technologies 842 205 Hansaallee, 40549 Dusseldorf, Germany 844 Email: Xipengxiao@huawei.com 846 Dmitriy Khaustov 847 Rostelecom 848 13/2 Nikoloyamskaya st, Moscow, Russia 109240 850 Email: Dmitriy.Khaustov@rt.ru