idnits 2.17.1 draft-ietf-intarea-tunnels-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The draft header indicates that this document updates RFC4459, but the abstract doesn't seem to directly say this. It does mention RFC4459 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC4459, updated by this document, for RFC5378 checks: 2004-06-14) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 12, 2019) is 1659 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC2119, mentioned in 'RFC8174', was also mentioned in 'RFC2119'. == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-14 == Outdated reference: A later version (-09) exists of draft-ietf-intarea-gue-07 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 6434 (Obsoleted by RFC 8504) -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) -- Obsolete informational reference (is this intentional?): RFC 6833 (Obsoleted by RFC 9301) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Area WG J. Touch 2 Internet Draft Independent consultant 3 Intended status: Best Current Practice M. Townsley 4 Updates: 4459 Cisco 5 Expires: March 2020 September 12, 2019 7 IP Tunnels in the Internet Architecture 8 draft-ietf-intarea-tunnels-10.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 This document may contain material from IETF Documents or IETF 16 Contributions published or made publicly available before November 17 10, 2008. The person(s) controlling the copyright in some of this 18 material may not have granted the IETF Trust the right to allow 19 modifications of such material outside the IETF Standards Process. 20 Without obtaining an adequate license from the person(s) controlling 21 the copyright in such materials, this document may not be modified 22 outside the IETF Standards Process, and derivative works of it may 23 not be created outside the IETF Standards Process, except to format 24 it for publication as an RFC or to translate it into languages other 25 than English. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 This Internet-Draft will expire on March 12, 2020. 45 Copyright Notice 47 Copyright (c) 2019 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Abstract 62 This document discusses the role of IP tunnels in the Internet 63 architecture. An IP tunnel transits IP datagrams as payloads in non- 64 link layer protocols. This document explains the relationship of IP 65 tunnels to existing protocol layers and the challenges in supporting 66 IP tunneling, based on the equivalence of tunnels to links. The 67 implications of this document are used to derive recommendations that 68 update MTU and fragment issues in RFC 4459. 70 Table of Contents 72 1. Introduction...................................................3 73 2. Conventions used in this document..............................6 74 2.1. Key Words.................................................6 75 2.2. Terminology...............................................6 76 3. The Tunnel Model..............................................10 77 3.1. What is a Tunnel?........................................11 78 3.2. View from the Outside....................................13 79 3.3. View from the Inside.....................................14 80 3.4. Location of the Ingress and Egress.......................15 81 3.5. Implications of This Model...............................15 82 3.6. Fragmentation............................................16 83 3.6.1. Outer Fragmentation.................................16 84 3.6.2. Inner Fragmentation.................................18 85 3.6.3. The Necessity of Outer Fragmentation................19 86 4. IP Tunnel Requirements........................................20 87 4.1. Encapsulation Header Issues..............................20 88 4.1.1. General Principles of Header Fields Relationships...20 89 4.1.2. Addressing Fields...................................21 90 4.1.3. Hop Count Fields....................................21 91 4.1.4. IP Fragment Identification Fields...................22 92 4.1.5. Checksums...........................................23 93 4.2. MTU Issues...............................................24 94 4.2.1. Minimum MTU Considerations..........................24 95 4.2.2. Fragmentation.......................................27 96 4.2.3. Path MTU Discovery..................................30 97 4.3. Coordination Issues......................................32 98 4.3.1. Signaling...........................................32 99 4.3.2. Congestion..........................................34 100 4.3.3. Multipoint Tunnels and Multicast....................34 101 4.3.4. Load Balancing......................................35 102 4.3.5. Recursive Tunnels...................................36 103 5. Observations..................................................37 104 5.1. Summary of Recommendations...............................37 105 5.2. Impact on Existing Encapsulation Protocols...............37 106 5.3. Tunnel Protocol Designers................................40 107 5.3.1. For Future Standards................................40 108 5.3.2. Diagnostics.........................................40 109 5.4. Tunnel Implementers......................................41 110 5.5. Tunnel Operators.........................................41 111 6. Security Considerations.......................................42 112 7. IANA Considerations...........................................43 113 8. References....................................................43 114 8.1. Normative References.....................................43 115 8.2. Informative References...................................43 116 9. Acknowledgments...............................................49 117 APPENDIX A: Fragmentation efficiency.............................50 118 A.1. Selecting fragment sizes.................................50 119 A.2. Packing..................................................51 121 1. Introduction 123 The Internet layering architecture is loosely based on the ISO seven 124 layer stack, in which data units traverse the stack by being wrapped 125 inside data units of the next layer down [Cl88][Zi80]. A tunnel is a 126 mechanism for transmitting data units between endpoints by wrapping 127 them as data units of the same or higher layers, e.g., IP in IP 128 (Figure 1) or IP in UDP (Figure 2). 130 +----+----+--------------+ 131 | IP'| IP | Data | 132 +----+----+--------------+ 134 Figure 1 IP inside IP 136 +----+-----+----+--------------+ 137 | IP'| UDP | IP | Data | 138 +----+-----+----+--------------+ 140 Figure 2 IP in UDP in IP in Ethernet 142 This document focuses on tunnels that transit IP packets, i.e., in 143 which an IP packet is the payload of another protocol, other than a 144 typical link layer. A tunnel is a virtual link that can help decouple 145 the network topology seen by transiting packets from the underlying 146 physical network [To98][RFC2473]. Tunnels were critical in the 147 development of multicast because not all routers were capable of 148 processing multicast packets [Er94]. Tunnels allowed multicast 149 packets to transit efficiently between multicast-capable routers over 150 paths that did not support native link-layer multicast. Similar 151 techniques have been used to support incremental deployment of other 152 protocols over legacy substrates, such as IPv6 [RFC2546]. 154 Use of tunnels is common in the Internet. The word "tunnel" occurs in 155 nearly 1,500 RFCs (of nearly 8,000 current RFCs, close to 20%), and 156 is supported within numerous protocols, including: 158 o IP in IP / mobile IP - IPv4 in IPv4 tunnels using protocol 4 159 [RFC2003][RFC2473][RFC5944] and its precursor called "IPIP" using 160 protocol 94 [RFC1853] 162 o IP in IPv6 - IPv6 or IPv4 in IPv6 [RFC2473] 164 o IPsec - includes a tunnel mode to enable encryption or 165 authentication of the an entire IP datagram inside another IP 166 datagram [RFC4301] 168 o Generic Router Encapsulation (GRE) - a shim layer for tunneling 169 any network layer in any other network layer, as in IP in GRE in 170 IP [RFC2784][RFC7588][RFC7676], or inside UDP in IP [RFC8086] 172 o MPLS - a shim layer for tunneling IP over a circuit-like path over 173 a link layer [RFC3031] or inside UDP in IP [RFC7510], in which 174 identifiers are rewritten on each hop, often used for traffic 175 provisioning 177 o LISP - a mechanism that uses multipoint IP tunnels to reduce 178 routing table load within an enclave of routers at the expense of 179 more complex tunnel ingress encapsulation tables [RFC6830] 181 o TRILL - a mechanism that uses multipoint L2 tunnels to enable use 182 of L3 routing (typically IS-IS) in an enclave of Ethernet bridges 183 [RFC5556][RFC6325] 185 o Generic UDP Encapsulation (GUE) - IP in UDP in IP [He19] 187 o Automatic Multicast Tunneling (AMT) - IP in UDP in IP for 188 multicast [RFC7450] 190 o L2TP - PPP over IP, to extend a subscriber's DSL/FTTH connection 191 from an access line provider to an ISP [RFC3931] 193 o L2VPNs - provides a link topology different from that provided by 194 physical links [RFC4664]; many of these are not classical tunnels, 195 using only tags (Ethernet VLAN tags) rather than encapsulation 197 o L3VPNs - provides a network topology different from that provided 198 by ISPs [RFC4176] 200 o NVO3 - data center network sharing (to be determined, which may 201 include use of GUE or other tunnels) [RFC7364] 203 o PWE3 - emulates wire-like services over packet-switched services 204 [RFC3985] 206 o SEAL/AERO -IP in IP tunneling with an additional shim header 207 designed to overcome the limitations of RFC2003 [RFC5320][Te18] 209 o A number of legacy variants, including swIPe (an IPsec precursor), 210 a GRE precursor, and the Internet Encapsulation Protocol, all of 211 which included a shim layer [RFC1853] 213 The variety of tunnel mechanisms raises the question of the role of 214 tunnels in the Internet architecture and the potential need for these 215 mechanisms to have similar and predictable behavior. In particular, 216 the ways in which packet size (i.e., Maximum Transmission Unit or 217 MTU) mismatches and error signals (e.g., ICMP) are handled may 218 benefit from a coordinated approach. 220 Regardless of the layer in which encapsulation occurs, tunnels 221 emulate a link. The only difference is that a link operates over a 222 physical communication channel, whereas a tunnel operates over other 223 software protocol layers. Because tunnels are links, they are subject 224 to the same issues as any link, e.g., MTU discovery, signaling, and 225 the potential utility of native support for broadcast and multicast 226 [RFC3819]. Tunnels have some advantages over native links, being 227 potentially easier to reconfigure and control because they can 228 generally rely on existing out-of-band communication between its 229 endpoints. 231 The first attempt to use large-scale tunnels was to transit multicast 232 traffic across the Internet in 1988, and this resulted in 'tunnel 233 collapse'. At the time, tunnels were not implemented as 234 encapsulation-based virtual links, but rather as loose source routes 235 on un-encapsulated IP datagrams [RFC1075]. Then, as now, routers did 236 not support use of the loose source route IP option at line rate, and 237 the multicast traffic caused overload of the so-called "slow path" 238 processing of IP datagrams in software. Using encapsulation tunnels 239 avoided that collapse by allowing the forwarding of encapsulated 240 packets to use the "fast path" hardware processing [Er94]. 242 The remainder of this document describes the general principles of IP 243 tunneling and discusses the key considerations in the design of any 244 protocol that tunnels IP datagrams. It derives its conclusions from 245 the equivalence of tunnels and links and from requirements of 246 existing standards for supporting IPv4 and IPv6 as payloads. 248 2. Conventions used in this document 250 2.1. Key Words 252 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 253 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 254 "OPTIONAL" in this document are to be interpreted as described in BCP 255 14 [RFC2119] [RFC8174] when, and only when, they appear in all 256 capitals, as shown here. 258 2.2. Terminology 260 This document uses the following terminology. Optional words in the 261 term are indicated in parentheses, e.g., "(link or network) 262 interface" or "egress (interface)". 264 Terms from existing RFCs: 266 o Messages: variable length data labeled with globally-unique 267 endpoint IDs, also known as a datagram for IP messages [RFC791]. 269 o Node: a physical or logical network device that participates as 270 either a host [RFC1122][RFC6434] or router [RFC1812]. This term 271 originally referred to gateways since some very early RFCs [RFC5], 272 but is currently the common way to describe a point in a network 273 at which messages are processed. 275 o Host or endpoint: a node that sources or sinks messages labeled 276 from/to its IDs, typically known as a host for both IP and higher- 277 layer protocol messages [RFC1122]. 279 o Source or sender: the node that generates a message [RFC1122]. 281 o Destination or receiver: the node that consumes a message 282 [RFC1122]. 284 o Router or gateway: a node that relays IP messages using 285 destination IDs and local context [RFC1812]. Routers also act as 286 hosts when they source or sink messages. Also known as a forwarder 287 for IP messages. Note that the notion of router is relative to the 288 layer at which message processing is considered [To16]. 290 o Link: a communications medium (or emulation thereof) that 291 transfers IP messages between nodes without traversing a router 292 (as would require decrementing the hop count) [RFC1122][RFC1812]. 294 o Link packet: a link layer message, which can carry an IP datagram 295 as a payload 297 o (Link or network) Interface: a location on a link co-located with 298 a node where messages depart onto that link or arrive from that 299 link. On physical links, this interface formats the message for 300 transmission and interprets the received signals. 302 o Path: a sequence of one or more links over which an IP message 303 traverses between source and destination nodes (hosts or routers). 305 o (Link) MTU: the largest message that can transit a link [RFC791], 306 also often referred to simply as "MTU". It does not include the 307 size of link-layer information, e.g., link layer headers or 308 trailers, i.e., it refers to the message that the link can carry 309 as a payload rather than the message as it appears on the link. 310 This is thus the largest network layer packet (including network 311 layer headers, e.g., IP datagram) that can transit a link. Note 312 that this need not be the native size of messages on the link, 313 i.e., the link may internally fragment and reassemble messages. 314 For IPv4, the smallest MTU must be at least 68 bytes [RFC791], and 315 for IPv6 the smallest MTU must be at least 1280 bytes [RFC8200]. 317 o EMTU_S (effective MTU for sending): the largest message that can 318 transit a link, possibly also accounting for fragmentation that 319 happens before the fragments are emitted onto the link [RFC1122]. 320 When source fragmentation is possible, EMTU_S = EMTU_R. When 321 source fragmentation is not possible, EMTU_S = (link) MTU. For 322 IPv4, this is MUST be at least 68 bytes [RFC791] and for IPv6 this 323 MUST be at least 1280 bytes [RFC8200]. 325 o EMTU_R (effective MTU to receive): the largest payload message 326 that a receiver must be able to accept. This thus also represents 327 the largest message that can traverse a link, taking into account 328 reassembly at the receiver that happens after the fragments are 329 received [RFC1122]. For IPv4, this is MUST be at least 576 bytes 330 [RFC791] and for IPv6 this MUST be at least 1500 bytes [RFC8200]. 332 o Path MTU (PMTU): the largest message that can transit a path of 333 links [RFC1191][RFC8201]. Typically, this is the minimum of the 334 link MTUs of the links of the path, and represents the largest 335 network layer message (including network layer headers) that can 336 transit a path without requiring fragmentation while in transit. 337 Note that this is not the largest network packet that can be sent 338 between a source and destination, because that network packet 339 might have been fragmented at the network layer of the source and 340 reassembled at the network layer of the destination. 342 o Tunnel: a protocol mechanism that transits messages between an 343 ingress interface and egress interface using encapsulation to 344 allow an existing network path to appear as a single link 345 [RFC1853]. Note that a protocol can be used to tunnel itself (IP 346 over IP). There is essentially no difference between a tunnel and 347 the conventional layering of the ISO stack (i.e., by this 348 definition, Ethernet is can be considered tunnel for IP). A tunnel 349 is also known as a virtual link. 351 o Ingress (interface): the virtual link interface of a tunnel that 352 receives messages within a node, encapsulates them according to 353 the tunnel protocol, and transmits them into the tunnel [RFC2983]. 354 An ingress is the tunnel equivalent of the outgoing (departing) 355 network interface of a link, and its encapsulation processing is 356 the tunnel equivalent of encoding a message for transmission over 357 a physical link. The ingress virtual link interface can be co- 358 located with the traffic source. 360 The term 'ingress' in other RFCs also refers to 'network ingress', 361 which is the entry point of traffic to a transit network. Because 362 this document focuses on tunnels, the term "ingress" used in the 363 remainder of this document implies "tunnel ingress". 365 o Egress (interface): a virtual link interface of a tunnel that 366 receives messages that have finished transiting a tunnel and 367 presents them to a node [RFC2983]. For reasons similar to ingress, 368 the term 'egress' will refer to 'tunnel egress' throughout the 369 remainder of this document. An egress is the tunnel equivalent of 370 the incoming (arriving) network interface of a link and its 371 decapsulation processing is the tunnel equivalent of interpreting 372 a signal received from a physical link. The egress decapsulates 373 messages for further transit to the destination. The egress 374 virtual link interface can be co-located with the traffic 375 destination. 377 o Ingress node: network device on which an ingress is attached as a 378 virtual link interface [RFC2983]. Note that a node can act as both 379 an ingress node and an egress node at the same time, but typically 380 only for different tunnels. 382 o Egress node: device where an egress is attached as a virtual link 383 interface [RFC2983]. Note that a device can act as both a ingress 384 node and an egress node at the same time, but typically only for 385 different tunnels. 387 o Inner header: the header of the message as it arrives to the 388 ingress [RFC2003]. 390 o Outer header(s): one or more headers added to the message by the 391 ingress, as part of the encapsulation for tunnel transit 392 [RFC2003]. 394 o Mid-tunnel fragmentation: Fragmentation of the message during the 395 tunnel transit, as could occur for IPv4 datagrams with DF=0 396 [RFC2983]. 398 o Atomic packet, datagram, or fragment: an IP packet that has not 399 been fragmented and which cannot be fragmented further [RFC6864] 400 [RFC6946]. 402 The following terms are introduced by this document: 404 o (Tunnel) transit packet: the packet arriving at a node connected 405 to a tunnel that enters the ingress interface and exits the egress 406 interface, i.e., the packet carried over the tunnel. This is 407 sometimes known as the 'tunneled packet', i.e., the packet carried 408 over the tunnel. This is the tunnel equivalent of a network layer 409 packet as it would traverse a link. This document focuses on IPv4 410 and IPv6 transit packets. 412 o (Tunnel) link packet (TLP): packets that traverse between two 413 interfaces, e.g., from ingress interface to egress interface, in 414 which resides all or part of a transit packet. A tunnel link 415 packet is the tunnel equivalent of a link (layer) packet as it 416 would traverse a link, which is why we use the same terminology. 418 o Tunnel MTU: the largest transit packet that can traverse a tunnel, 419 i.e., the tunnel equivalent of a link MTU, which is why we use the 420 same terminology. This is the largest transit packet which can be 421 reassembled at the egress interface. 423 o Tunnel maximum atomic packet (MAP): the largest transit packet 424 that can traverse a tunnel as an atomic packet, i.e., without 425 requiring tunnel link packet fragmentation either at the ingress 426 or on-path between the ingress and egress. 428 o Inner fragmentation: fragmentation of the transit packet that 429 arrives at the ingress interface before any additional headers are 430 added. This can only correctly occur for IPv4 DF=0 datagrams. 432 o Outer fragmentation: source fragmentation of the tunnel link 433 packet after encapsulation; this can involve fragmenting the 434 outermost header or any of the other (if any) protocol layers 435 involved in encapsulation. 437 o Maximum frame size (MFS): the link-layer equivalent of the MTU, 438 using the OSI term 'frame'. For Ethernet, the MTU (network packet 439 size) is 1500 bytes but the MFS (link frame size) is 1518 bytes 440 originally, and 1522 bytes assuming VLAN (802.1Q) tagging support. 442 o EMFS_S: the link layer equivalent of EMTU_S. 444 o EMFS_R: the link layer equivalent of EMTU_R. 446 o Path MFS: the link layer equivalent of PMTU. 448 3. The Tunnel Model 450 A network architecture is an abstract description of a distributed 451 communications system, its components and their relationships, the 452 requisite properties of those components and the emergent properties 453 of the system that result [To03]. Such descriptions can help explain 454 behavior, as when the OSI seven-layer model is used as a teaching 455 example [Zi80]. Architectures describe capabilities - and, just as 456 importantly, constraints. 458 A network can be defined as a system of endpoints and relays 459 interconnected by communication paths, abstracting away issues of 460 naming in order to focus on message forwarding. To the extent that 461 the Internet has a single, coherent interpretation, its architecture 462 is defined by its core protocols (IP [RFC791], TCP [RFC793], UDP 463 [RFC768]) whose messages are handled by hosts, routers, and links 464 [Cl88][To03], as shown in Figure 3: 466 +------+ ------ ------ +------+ 467 | | / \ / \ | | 468 | HOST |--+ ROUTER +--+ ROUTER +--| HOST | 469 | | \ / \ / | | 470 +------+ ------ ------ +------+ 472 Figure 3 Basic Internet architecture 474 As a network architecture, the Internet is a system of hosts 475 (endpoints) and routers (relays) interconnected by links that 476 exchange messages when possible. "When possible" defines the 477 Internet's "best effort" principle. The limited role of routers and 478 links represents the End-to-End Principle [Sa84] and longest-prefix 479 match enables hierarchical forwarding using compact tables. 481 Although the definitions of host, router, and link seem absolute, 482 they are often relative as viewed within the context of one protocol 483 layer, each of which can be considered a distinct network 484 architecture. An Internet gateway is an OSI Layer 3 router when it 485 transits IP datagrams but it acts as an OSI Layer 2 host as it 486 sources or sinks Layer 2 messages on attached links to accomplish 487 this transit capability. In this way, one device (Internet gateway) 488 behaves as different components (router, host) at different layers. 490 Even though a single device may have multiple roles - even 491 concurrently - at a given layer, each role is typically static and 492 determined by context. An Internet gateway always acts as a Layer 2 493 host and that behavior does not depend on where the gateway is viewed 494 from within Layer 2. In the context of a single layer, a device's 495 behavior is typically modeled as a single component from all 496 viewpoints in that layer (with some notable exceptions, e.g., Network 497 Address Translators, which appear as hosts and routers, depending on 498 the direction of the viewpoint [To16]). 500 3.1. What is a Tunnel? 502 A tunnel can be modeled as a link in another network 503 [To98][To01][To03]. In Figure 4, a source host (Hsrc) and destination 504 host (Hdst) communicating over a network M in which two routers (Ra 505 and Rd) are connected by a tunnel. Keep in mind that it is possible 506 that both network N and network M can both be components of the 507 Internet, i.e., there may be regular traffic as well as tunneled 508 traffic over any of the routers shown. 510 --_ -- 511 +------+ / \ / \ +------+ 512 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 513 +------+ \ //\ / \ / \ /\\ / +------+ 514 --/I \--+ Rb +--+ Rc +--/E \-- 515 \ / \ / \ / \ / 516 \/ -- -- \/ 517 <------ Network N -------> 518 <-------------------- Network M ---------------------> 520 Figure 4 The big picture 522 The tunnel consists of two interfaces - an ingress (I) and an egress 523 (E) that lie along a path connected by network N. Regardless of how 524 the ingress and egress interfaces are connected, the tunnel serves as 525 a link between the nodes it connects (here, Ra and Rd). 527 IP packets arriving at the ingress interface are encapsulated to 528 traverse network N. We call these packets 'tunnel transit packets' 529 (or just 'transit packets') because they will transit the tunnel 530 inside one or more of what we call 'tunnel link packets'. Transit 531 packets correspond to network (IP) packets traversing a conventional 532 link and tunnel link packets correspond to the packets of a 533 conventional link layer (which can be called just 'link packets'). 535 Link packets use the source address of the ingress interface and the 536 destination address of the egress interface - using whatever address 537 is appropriate to the Layer at which the ingress and egress 538 interfaces operate (Layer 2, Layer 3, Layer 4, etc.). The egress 539 interface decapsulates those messages, which then continue on network 540 M as if emerging from a link. To transit packets and to the routers 541 the tunnel connects (Ra and Rd), the tunnel acts as a link and the 542 ingress and egress interfaces act as network interfaces to that link. 544 The model of each component (ingress and egress interfaces) and the 545 entire system (tunnel) depends on the layer from which they are 546 viewed. From the perspective of the outermost hosts (Hsrc and Hdst), 547 the tunnel appears as a link between two routers (Ra and Rd). For 548 routers along the tunnel (e.g., Rb and Rc), the ingress and egress 549 interfaces appear as the endpoint hosts on network N. 551 When the tunnel network (N) is implemented using the same protocol as 552 the endpoint network (M), the picture looks flatter (Figure 5), as if 553 it were running over a single network. However, this appearance is 554 incorrect - nothing has changed from the previous case. From the 555 perspective of the endpoints, Rb and Rc and network N don't exist and 556 aren't visible, and from the perspective of the tunnel, network M 557 doesn't exist. The fact that network N and M use the same protocol, 558 and may traverse the same links is irrelevant. 560 --_ -- -- -- 561 +------+ / \ /\ / \ / \ /\ / \ +------+ 562 | Hsrc |--+ Ra +/I \--+ Rb +--+ Rc +--/E \+ Rd +--| Hdst | 563 +------+ \ / \ / \ / \ / \ / \ / +------+ 564 -- \/ -- -- \/ -- 565 <---- Network N -----> 566 <------------------ Network M -------------------> 568 Figure 5 IP in IP network picture 570 3.2. View from the Outside 572 As already observed, from outside the tunnel, to network M, the 573 entire tunnel acts as a link (Figure 6). Consequently all 574 requirements for links supporting IP also apply to tunnels [RFC3819]. 576 --_ -- 577 +------+ / \ / \ +------+ 578 | Hsrc |--+ Ra +--------------------------+ Rd +--| Hdst | 579 +------+ \ / \ / +------+ 580 -- -- 581 <------------------ Network M -------------------> 583 Figure 6 Tunnels as viewed from the outside 585 For example, the IP datagram hop counts (IPv4 Time-to-Live [RFC791] 586 and IPv6 Hop Limit [RFC8200]) are decremented when traversing a 587 router, but not when traversing a link - or thus a tunnel. Similarly, 588 because the ingress and egress are interfaces on this outer network, 589 they should never issue ICMP messages. A router or host would issue 590 the appropriate ICMP, e.g., "packet too big" (IPv4 fragmentation 591 needed and DF set [RFC792] or IPv6 packet too big [RFC4443]), when 592 trying to send a packet to the egress, as it would for any interface. 594 Tunnels have a tunnel MTU - the largest message that can transit that 595 tunnel, just as links have a link MTU. This MTU may not reflect the 596 native message size of hops within a multihop link (or tunnel) and 597 the same is true for a tunnel. In both cases, the MTU is defined by 598 the link's (or tunnel's) effective MTU to receive (EMTU_R). 600 3.3. View from the Inside 602 Within network N, i.e., from inside the tunnel itself, the ingress 603 interface is a source of tunnel link packets and the egress interface 604 is a sink - so both are viewed as hosts on network N (Figure 7). 605 Consequently [RFC1122] Internet host requirements apply to ingress 606 and egress interfaces when Network N uses IP (and thus the 607 ingress/egress interfaces use IP encapsulation). 609 _ -- -- 610 /\ / \ / \ /\ 611 /I \--+ Rb +--+ Rc +--/E \ 612 \ / \ / \ / \ / 613 \/ -- -- \/ 614 <---- Network N -----> 616 Figure 7 Tunnels, as viewed from within the tunnel 618 Viewed from within the tunnel, the outer network (M) doesn't exist. 619 Tunnel link packets can be fragmented by the source (ingress 620 interface) and reassembled at the destination (egress interface), 621 just as at conventional hosts. The path between ingress and egress 622 interfaces has a path MTU, but the endpoints can exchange messages as 623 large as can be reassembled at the destination (egress interface), 624 i.e., the EMTU_R of the egress interface. However, in both cases, 625 these MTUs refer to the size of the message that can transit the 626 links and between the hosts of network N, which represents a link 627 layer to network M. I.e., the MTUs of network N represent the maximum 628 frame sizes (MFSs) of the tunnel as a link in network M. 630 Information about the network - i.e., regarding network N MTU sizes, 631 network reachability, etc. - are relayed from the destination (egress 632 interface) and intermediate routers back to the source (ingress 633 interface), without regard for the external network (M). When such 634 messages arrive at the ingress interface, they may affect the 635 properties of that interface (e.g., its reported MTU to network M), 636 but they should never directly cause new ICMPs in the outer network 637 M. Again, events at interfaces don't generate ICMP messages; it would 638 be the host or router at which that interface is attached that would 639 generate ICMPs, e.g., upon attempting to use that interface. 641 3.4. Location of the Ingress and Egress 643 The ingress and egress interfaces are endpoints of the tunnel. Tunnel 644 interfaces may be physical or virtual. The interface may be 645 implemented inside the node where the tunnel attaches, e.g., inside a 646 host or router. The interface may also be implemented as a "bump in 647 the wire" (BITW), somewhere along a link between the two nodes the 648 link interconnects. IP in IP tunnels are often implemented as 649 interfaces on nodes, whereas IPsec tunnels are sometimes implemented 650 as BITW. These implementation variations determine only whether 651 information available at the link endpoints (ingress/egress 652 interfaces) can be easily shared with the connected network nodes. 654 An ingress or egress can be implemented as an integrated component, 655 appearing equivalent to any other network interface, or can be more 656 complex. In the simple variant, each is tightly coupled to another 657 network interface, e.g., where the ingress emits encapsulated packets 658 directly into another network interface, or where the egress receives 659 packets to decapsulate directly from another network interface. 661 The other implementation variant is more modular, but more complex to 662 explain. The ingress acts like a network interface by receiving IP 663 packets to transmit from an upper layer protocol (or relay mechanism 664 of a router), but then acts like an upper layer protocol (or relay 665 mechanism of a router) when it emits encapsulated packets back into 666 the same node. The egress acts like an upper layer interface (or 667 relay mechanism of a router) by receiving packets from a network 668 interface, but then acts like a network interface when it emits 669 decapsulated packets back in to the same node. To the existing 670 network interfaces, the ingress/egress act like upper layer 671 interfaces (i.e., sending or receiving application stacks), while to 672 the interior of the node, the ingress/egress act like network 673 interfaces. This dual nature inside the node reflects the duality of 674 the tunnel as transit link and host-host channel. 676 3.5. Implications of This Model 678 This approach highlights a few key features of a tunnel as a network 679 architecture construct: 681 o To the transit packets, tunnels turn a network (Layer 3) path into 682 a (Layer 2) link 684 o To nodes the tunnel traverses, the tunnel ingress and egress 685 interfaces act as hosts that source and sink tunnel link packets 687 The consequences of these features are as follow: 689 o Like a link MTU, a tunnel MTU is defined by the effective MTU of 690 the receiver (i.e., EMTU_R of the egress). 692 o The messages inside the tunnel are treated like any other link 693 layer, i.e., the MTU is determined by the largest (transit) 694 payload that traverses the link. 696 o The tunnel path MFS is not relevant to the transited traffic. 697 There is no mechanism or protocol by which it can be determined. 699 o Because routers, not links, alter hop counts [RFC1812], hopcounts 700 are not decremented solely by the transit of a tunnel. A packet 701 with a hop count of zero should successfully transit a link (and 702 thus a tunnel) that connects two hosts. 704 o The addresses of a tunnel ingress and egress interface correspond 705 to link layer addresses to the transit packet. Like links, some 706 tunnels may not have their own addresses. Like network interfaces, 707 ingress and egress interfaces typically require network layer 708 addresses. 710 o Like network interfaces, the ingress and egress interfaces are 711 never a direct source of ICMP messages but may provide information 712 to their attached host or router to generate those ICMP messages 713 during the processing of transit packets. 715 o Like network interfaces and links, two nodes may be connected by 716 any combination of tunnels and links, including multiple tunnels. 717 As with multiple links, existing network layer forwarding 718 determines which IP traffic uses each link or tunnel. 720 These observations make it much easier to determine what a tunnel 721 must do to transit IP packets, notably it must satisfy all 722 requirements expected of a link [RFC1122][RFC3819]. The remainder of 723 this document explores these implications in greater detail. 725 3.6. Fragmentation 727 There are two places where fragmentation can occur in a tunnel, 728 called 'outer fragmentation' and 'inner fragmentation'. This document 729 assumes that only outer fragmentation is viable because it is the 730 only approach that works for both IPv4 datagrams with DF=1 and IPv6. 732 3.6.1. Outer Fragmentation 734 Outer fragmentation is shown in Figure 8. The bottom of the figure 735 shows the network topology, where transit packets originate at the 736 source, enter the tunnel at the ingress interface for encapsulation, 737 exit the tunnel at the egress interface where they are decapsulated, 738 and arrive at the destination. The packet traffic is shown above the 739 topology, where the transit packets are shown at the top. In this 740 diagram, the ingress interface is located on router 'Ra' and the 741 egress interface is located on router 'Rd'. 743 When the link packet - which is the encapsulated transit packet - 744 would exceed the tunnel MTU, the packet needs to be fragmented. In 745 this case the packet is fragmented at the outer (link) header, with 746 the fragments shown as (b1) and (b2). The outer header indicates 747 fragmentation (as ' and "), the inner (transit) header occurs only in 748 the first fragment, and the inner (transit) data is broken across the 749 two packets. These fragments are reassembled at the egress interface 750 during decapsulation in step (c), where the resulting link packet is 751 reassembled and decapsulated so that the transit packet can continue 752 on its way to the destination. 754 Transit packet 755 +----+----+ +----+----+ 756 | iH | iD |------+ - - - - - - - - - - +------>| iH | iD | 757 +----+----+ | | +----+----+ 758 v Link packet | 759 +----+----+----+ +----+----+----+ 760 (a) | oH | iH | iD | | oH | iH | iD | (d) 761 +----+----+----+ +----+----+----+ 762 | ^ 763 | Link packet fragment #1 | 764 | +----+----+-----+ | 765 (b1) +----- >| oH'| iH | iD1 |-------+ (c) 766 | +----+----+-----+ | 767 | | 768 | Link packet fragment #2 | 769 | +----+-----+ | 770 (b2) +----- >| oH"| iD2 |------------+ 771 +----+-----+ 772 +-----+ +--+ +---+ +---+ +--+ +-----+ 773 | | | |/ \ / \| | | | 774 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 775 | | | |\ / \ /| | | | 776 +-----+ +--+ +---+ +---+ +--+ +-----+ 778 Figure 8 Fragmentation of the (outer) link packet 780 Outer fragmentation isolates the tunnel encapsulation duties to the 781 ingress and egress interfaces. This can be considered a benefit in 782 clean, layered network design, but also may require complex egress 783 interface decapsulation, especially where tunnels aggregate large 784 amounts of traffic, such as may result in IP ID overload (see Sec. 785 4.1.4). Outer fragmentation is valid for any tunnel link protocol 786 that supports fragmentation (e.g., IPv4 or IPv6), in which the tunnel 787 endpoints act as the host endpoints of that protocol. 789 Along the tunnel, the inner (transit) header is contained only in the 790 first fragment, which can interfere with mechanisms that 'peek' into 791 lower layer headers, e.g., as for relayed ICMP (see Sec. 4.3). 793 3.6.2. Inner Fragmentation 795 Inner fragmentation distributes the impact of tunnel fragmentation 796 across both egress interface decapsulation and transit packet 797 destination, as shown in Figure 9; this can be especially important 798 when the tunnel would otherwise need to source (outer) fragment large 799 amounts of traffic. However, this mechanism is valid only when the 800 transit packets can be fragmented on-path, e.g., as when the transit 801 packets are IPv4 datagrams with DF=0. 803 Again, the network topology is shown at the bottom of the figure, and 804 the original packets show at the top. Packets arrive at the ingress 805 node (router Ra) and are fragmented there based into transit packet 806 fragments #1 (a1) and #2 (a2). These fragments are encapsulated at 807 the ingress interface in steps (b1) and (b2) and each resulting link 808 packet traverses the tunnel. When these link packets arrive at the 809 egress interface they are decapsulated in steps (c1) and (c2) and the 810 egress node (router) forwards the transit packet fragments to their 811 destination. This destination is then responsible for reassembling 812 the transit packet fragments into the original transit packet (d). 814 Along the tunnel, the inner headers are copied into each fragment, 815 and so can be 'peeked at' inside the tunnel (see Sec. 4.3). 816 Fragmentation shifts from the ingress interface to the ingress router 817 and reassembly shifts from the egress interface to the destination. 819 Transit packet 820 +----+----+ +----+----+ 821 | iH | iD |-+ - - - - - - - - - - - - - - - - >| iH | iD | 822 +----+----+ | +----+----+ 823 v Transit packet fragment #1 ^ 824 +----+-----+ +----+-----+ | 825 (a1) | iH'| iD1 | | iH'| iD1 |-----+(d) 826 +----+-----+ +----+-----+ ^ 827 | | Link packet #1 ^ | 828 | | +----+----+----- | | 829 | (b1)+----- >| oH | iH'| iD1 |-------+(c1) | 830 | +----+----+-----+ | 831 | | 832 v Transit packet fragment #2 | 833 +----+-----+ +----+-----+ | 834 (a2) | iH"| iD2 | | iH"| iD2 |-----+ 835 +----+-----+ +----+-----+ 836 | Link packet #2 | 837 | +----+----+-----+ | 838 (b2)+----- >| oH | iH"| iD2 |-------+(c2) 839 +----+----+-----+ 840 +-----+ +--+ +---+ +---+ +--+ +-----+ 841 | | | |/ \ / \| | | | 842 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 843 | | | |\ / \ /| | | | 844 +-----+ +--+ +---+ +---+ +--+ +-----+ 846 Figure 9 Fragmentation of the inner (transit) packet 848 3.6.3. The Necessity of Outer Fragmentation 850 Fragmentation is critical for tunnels that support transit packets 851 for protocols with minimum MTU requirements, while operating over 852 tunnel paths using protocols that have their own MTU requirements. 853 Depending on the amount of space used by encapsulation, these two 854 minimums will ultimately interfere (especially when a protocol 855 transits itself either directly, as with IP-in-IP, or indirectly, as 856 in IP-in-GRE-in-IP), and the transit packet will need to be 857 fragmented to both support a tunnel MTU while traversing tunnels with 858 their own tunnel path MTUs. 860 Outer fragmentation is the only solution that supports all IPv4 and 861 IPv6 traffic, because inner fragmentation is allowed only for IPv4 862 datagrams with DF=0. 864 4. IP Tunnel Requirements 866 The requirements of an IP tunnel are defined by the requirements of 867 an IP link because both transit IP packets. A tunnel thus must 868 transit the IP minimum MTU, i.e., 68 bytes for IPv4 [RFC793] and 1280 869 bytes for IPv6 [RFC8200] and a tunnel must support address resolution 870 when there is more than one egress interface for that tunnel. 872 The requirements of the tunnel ingress and egress interfaces are 873 defined by the network over which they exchange messages (link 874 packets). For IP-over-IP, this means that the ingress interface MUST 875 NOT exceed the IP fragment identification field uniqueness 876 requirements [RFC6864]. Uniqueness is more difficult to maintain at 877 high packet rates for IPv4, whose fragment ID field is only 16 bits. 879 These requirements remain even though tunnels have some unique 880 issues, including the need for additional space for encapsulation 881 headers and the potential for tunnel MTU variation. 883 4.1. Encapsulation Header Issues 885 Tunnel encapsulation uses a non-link protocol as a link layer. The 886 encapsulation layer thus has the same requirements and expectations 887 as any other IP link layer when used to transit IP packets. These 888 relationships are addressed in the following subsections. 890 4.1.1. General Principles of Header Fields Relationships 892 Some tunnel specifications attempt to relate the header fields of the 893 transit packet and tunnel link packet. In some cases, this 894 relationship is warranted, whereas in other cases the two protocol 895 layers need to be isolated from each other. For example, the tunnel 896 link header source and destination addresses are network endpoints in 897 the tunnel network N, but have no meaning in the outer network M. The 898 two sets of addresses are effectively independent, just as are other 899 network and link addresses. 901 Because the tunneled packet uses source and destination addresses 902 with a separate meaning, it is inappropriate to copy or reuse the 903 IPv4 Identification (ID) or IPv6 Fragment ID fields of the tunnel 904 transit packet (see Section 4.1.4). Similarly, the DF field of the 905 transit packet is not related to that field in the tunnel link packet 906 header (presuming both are IPv4) (see Section 4.2). Most other fields 907 are similarly independent between the transit packet and tunnel link 908 packet. When a field value is generated in the encapsulation header, 909 its meaning should be derived from what is desired in the context of 910 the tunnel as a link. When feedback is received from these fields, 911 they should be presented to the tunnel ingress and egress as if they 912 were network interfaces. The behavior of the node where these 913 interfaces attach should be identical to that of a conventional link. 915 There are exceptions to this rule that are explicitly intended to 916 relay signals from inside the tunnel to the network outside the 917 tunnel, typically relevant only when the tunnel network N and the 918 outer network M use the same network. These apply only when that 919 coordination is defined, as with explicit congestion notification 920 (ECN) [RFC6040] (see Section 4.3.2), and differentiated services code 921 points (DSCPs) [RFC2983]. Equal-cost multipath routing may also 922 affect how some encapsulation fields are set, including IPv6 flow 923 labels [RFC6438] and source ports for transport protocols when used 924 for tunnel encapsulation [RFC8085] (see Section 4.3.4). 926 4.1.2. Addressing Fields 928 Tunnel ingresses and egresses have addresses associated with the 929 encapsulation protocol. These addresses are the source and 930 destination (respectively) of the encapsulated packet while 931 traversing the tunnel network. 933 Tunnels may or may not have addresses in the network whose traffic 934 they transit (e.g., network M in Figure 4). In some cases, the tunnel 935 is an unnumbered interface to a point-to-point virtual link. When the 936 tunnel has multiple egresses, tunnel interfaces require separate 937 addresses in network M. 939 To see the effect of tunnel interface addresses, consider traffic 940 sourced at router Ra in Figure 4. Even before being encapsulated by 941 the ingress, traffic needs a source IP network address that belongs 942 to the router. One option is to use an address associated with one of 943 the other interfaces of the router [RFC1122]. Another option is to 944 assign a number to the tunnel interface itself. Regardless of which 945 address is used, the resulting IP packet is then encapsulated by the 946 tunnel ingress using the ingress address as a separate operation. 948 4.1.3. Hop Count Fields 950 The Internet hop count field is used to detect and avoid forwarding 951 loops that cannot be corrected without a synchronized reboot. The 952 IPv4 Time-to-Live (TTL) and IPv6 Hop Limit field each serve this 953 purpose [RFC791][RFC8200]. The IPv4 TTL field was originally intended 954 to indicate packet expiration time, measured in seconds. A router is 955 required to decrement the TTL by at least one or the number of 956 seconds the packet is delayed, whichever is larger [RFC1812]. Packets 957 are rarely held that long, and so the field has come to represent the 958 count of the number of routers traversed. IPv6 makes this meaning 959 more explicit. 961 These hop count fields represent the number of network forwarding 962 elements (routers) traversed by an IP datagram. An IP datagram with a 963 hop count of zero can traverse a link between two hosts because it 964 never visits a router (where it would need to be decremented and 965 would have been dropped). 967 An IP datagram traversing a tunnel thus need not have its hop count 968 modified, i.e., the tunnel transit header need not be affected. A 969 zero hop count datagram should be able to traverse a tunnel as easily 970 as it traverses a link. A router MAY be configured to decrement 971 packets traversing a particular link (and thus a tunnel), which may 972 be useful in emulating a tunnel path as if it were a network path 973 that traversed one or more routers, but this is strictly optional. 974 The ability of the outer network M and tunnel network N to avoid 975 indefinitely looping packets does not rely on the hop counts of the 976 transit packet and tunnel link packet being related. 978 The hop count field is also used by several protocols to determine 979 whether endpoints are 'local', i.e., connected to the same subnet 980 (link-local discovery and related protocols [RFC4861]). A tunnel is a 981 way to make a remote network address appear directly-connected, so it 982 makes sense that the other ends of the tunnel appear local and that 983 such link-local protocols operate over tunnels unless configured 984 explicitly otherwise. When the interfaces of a tunnel are numbered, 985 these can be interpreted the same way as if they were on the same 986 link subnet. 988 4.1.4. IP Fragment Identification Fields 990 Both IPv4 and IPv6 include an IP Identification (ID) field to support 991 IP datagram fragmentation and reassembly [RFC791][RFC1122][RFC8200]. 992 When used, the ID field is intended to be unique for every packet for 993 a given source address, destination address, and protocol, such that 994 it does not repeat within the Maximum Segment Lifetime (MSL). 996 For IPv4, this field is in the default header and is meaningful only 997 when either source fragmented or DF=0 ("non-atomic packets") 998 [RFC6864]. For IPv6, this field is contained in the optional Fragment 999 Header [RFC8200]. Although IPv6 supports only source fragmentation, 1000 the field may occur in atomic fragments [RFC6946]. 1002 Although the ID field was originally intended for fragmentation and 1003 reassembly, it can also be used to detect and discard duplicate 1004 packets, e.g., at congested routers (see Sec. 3.2.1.5 of [RFC1122]). 1006 For this reason, and because IPv4 packets can be fragmented anywhere 1007 along a path, all non-atomic IPv4 packets and all IPv6 packets 1008 between a source and destination of a given protocol must have unique 1009 ID values over the potential fragment reordering period 1010 [RFC6864][RFC8200]. 1012 The uniqueness of the IP ID is a known problem for high speed nodes, 1013 because it limits the speed of a single protocol between two 1014 endpoints [RFC4963]. Although this RFC suggests that the uniqueness 1015 of the IP ID is moot, tunnels exacerbate this condition. A tunnel 1016 often aggregates traffic from a number of different source and 1017 destination addresses, of different protocols, and encapsulates them 1018 in a header with the same ingress and egress addresses, all using a 1019 single encapsulation protocol. If the ingress enforces IP ID 1020 uniqueness, this can either severely limit tunnel throughput or can 1021 require substantial resources; the alternative is to ignore IP ID 1022 uniqueness and risk reassembly errors. Although fragmentation is 1023 somewhat rare in the current Internet at large, it can be common 1024 along a tunnel. Reassembly errors are not always detected by other 1025 protocol layers (see Sec. 4.3.3) , and even when detected they can 1026 result in excessive overall packet loss and can waste bandwidth 1027 between the egress and ultimate packet destination. 1029 The 32-bit IPv6 ID field in the Fragment Header is typically used 1030 only during source fragmentation. The size of the ID field is 1031 typically sufficient that a single counter can be used at the tunnel 1032 ingress, regardless of the endpoint addresses or next-header 1033 protocol, allowing efficient support for very high throughput 1034 tunnels. 1036 The smaller 16-bit IPv4 ID is more difficult to correctly support. A 1037 recent update to IPv4 allows the ID to be repeated for atomic packets 1038 [RFC6864]. When either source fragmentation or on-path fragmentation 1039 is supported, the tunnel ingress may need to keep independent ID 1040 counters for each tunnel source/destination/protocol tuple. 1042 4.1.5. Checksums 1044 IP traffic transiting a tunnel needs to expect a similar level of 1045 error detection and correction as it would expect from any other 1046 link. In the case of IPv4, there are no such expectations, which is 1047 partly why it includes a header checksum [RFC791]. 1049 IPv6 omitted the header checksum because it already expects most link 1050 errors to be detected and dropped by the link layer and because it 1051 also assumes transport protection [RFC8200]. When transiting IPv6 1052 over IPv6, the tunnel fails to provide the expected error detection. 1054 This is why IPv6 is often tunneled over layers that include separate 1055 protection, such as GRE [RFC2784]. 1057 The fragmentation created by the tunnel ingress can increase the need 1058 for stronger error detection and correction, especially at the tunnel 1059 egress to avoid reassembly errors. The Internet checksum is known to 1060 be susceptible to reassembly errors that could be common [RFC4963], 1061 and should not be relied upon for this purpose. This is why some 1062 tunnel protocols, e.g., SEAL and AERO [RFC5320][Te18] and GRE 1063 [RFC2784] as well as legacy protocols swIPe and the Internet 1064 Encapsulation Protocol [RFC1853], include a separate checksum. This 1065 requirement can be undermined when using UDP as a tunnel with no UDP 1066 checksum (as per [RFC6935][RFC6936]) when fragmentation occurs 1067 because the egress has no checksum with which to validate reassembly. 1068 For this reason, it is safe to use UDP with a zero checksum for 1069 atomic tunnel link packets only; when used on fragments, whether 1070 generated at the ingress or en-route inside the tunnel, omission of 1071 such a checksum can result in reassembly errors that can cause 1072 additional work (capacity, forwarding processing, receiver 1073 processing) downstream of the egress. 1075 4.2. MTU Issues 1077 Link MTUs, IP datagram limits, and transport protocol segment sizes 1078 are already related by several requirements 1079 [RFC768][RFC791][RFC1122][RFC1812][RFC8200] and by a variety of 1080 protocol mechanisms that attempt to establish relationships between 1081 them, including path MTU discovery (PMTUD) [RFC1191][RFC8201], 1082 packetization layer path MTU discovery (PLMTUD) [RFC4821], as well as 1083 mechanisms inside transport protocols [RFC793][RFC4340][RFC4960]. The 1084 following subsections summarize the interactions between tunnels and 1085 MTU issues, including minimum tunnel MTUs, tunnel fragmentation and 1086 reassembly, and MTU discovery. 1088 4.2.1. Minimum MTU Considerations 1090 There are a variety of values of minimum MTU values to consider, both 1091 in a conventional network and in a tunnel as a link in that network. 1092 These are indicated in Figure 10, an annotated variant of Figure 4. 1093 Note that a (link) MTU (a) corresponds to a tunnel MTU (d) and that a 1094 path MTU (b) corresponds to a tunnel path MTU (e). The tunnel MTU is 1095 the EMTU_R of the egress interface, because that defines the largest 1096 transit packet message that can traverse the tunnel as a link in 1097 network M. The ability to traverse the hops of the tunnel - in 1098 network N - is not related, and only the ingress need be concerned 1099 with that value. 1101 --_ -- 1102 +------+ / \ / \ +------+ 1103 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1104 +------+ \ //\ / \ / \ /\\ / +------+ 1105 --/I \---+ Rb +---+ Rc +---/E \-- 1106 \ / \ / \ / \ / 1107 \/ -- -- \/ 1108 <----- Network N -------> 1109 <-------------------- Network M ---------------------> 1111 Communication in network M viewed at that layer: 1112 (a) <-> Link MTU 1113 (b) <---- Tunnel MTU ---------> 1114 (c) <----------- Path MTU -----------------> 1115 (d) <------------------- EMTU_R ---------------------------> 1117 Communication in network N viewed at that layer: 1118 (e) <--> Link MTU 1119 (f) <--- Path MTU ------> 1120 (g) <----- EMTU_R ---------> 1122 Communication in network N viewed from network M: 1123 (h) <--> MFS 1124 (i) <--- Path MFS ------> 1125 (j) <----- EMFS_R ---------> 1127 Figure 10 The variety of MTU values 1129 Consider the following example values. For IPv6 transit packets, the 1130 minimum (link) MTU (a) is 1280 bytes, which similarly applies to 1131 tunnels as the tunnel MTU (b). The path MTU (c) is the minimum of the 1132 links (including tunnels as links) along a path, and indicates the 1133 smallest IP message (packet or fragment) that can traverse a path 1134 between a source and destination without on-path fragmentation (e.g., 1135 supported in IPv4 with DF=0). Path MTU discovery, either at the 1136 network layer (PMTUD [RFC1191][RFC8201]) or packetization layer 1137 (PLPMTUD [RFC4821]) attempts to tune the source IP packets and 1138 fragments (i.e., EMTU_S) to fit within this path MTU size to avoid 1139 fragmentation and reassembly [Ke95]. The minimum EMTU_R (d) is 1500 1140 bytes, i.e., the minimum MTU for endpoint-to-endpoint communication. 1142 The tunnel is a source-destination communication in network N. 1143 Messages between the tunnel source (the ingress interface) and tunnel 1144 destination (egress interface) similarly experience a variety of 1145 network N MTU values, including a link MTU (e), a path MTU (f), and 1146 an EMTU_R (g). The network N message maximum is limited by the path 1147 MTU, and the source-destination message maximum (EMTU_S) is limited 1148 by the path MTU when source fragmentation is disabled and by EMTU_R 1149 otherwise, just as it was in for those types of MTUs in network M. 1150 For an IPv6 network N, its link and path MTUs must be at least 1280 1151 and its EMTU_R must be at least 1500. 1153 However, viewed from the context of network M, these network N MTUs 1154 are link layer properties, i.e., maximum frame sizes (MFS (h)). The 1155 network N EMTU_R determines the largest message that can transit 1156 between the source (ingress) and destination (egress), but viewed 1157 from network M this is a link layer, i.e., EMFS_R (j). The tunnel 1158 EMTU_R is EMFS_R minus the link (encapsulation) headers and includes 1159 the encapsulation headers of the link layer. Just as the path MTU has 1160 no bearing on EMTU_R, the path MFS (i) in network N has no bearing on 1161 the MTU of the tunnel. 1163 For IPv6 networks M and N, these relationships are summarized as 1164 follows: 1166 o Network M MTU = 1280, the largest transit packet (i.e., payload) 1167 over a single IPv6 link in the base network without source 1168 fragmentation 1170 o Network M path MTU = 1280, the transit packet (i.e., payload) that 1171 can traverse a path of links in the base network without source 1172 fragmentation 1174 o Network M EMTU_R = 1500, the largest transit packet (i.e., 1175 payload) that can traverse a path in the base network with source 1176 fragmentation 1178 o Network N MTU = 1280 (for the same reasons as for network M) 1180 o Network N path MTU = 1280 (for the same reasons as for network M) 1182 o Network N EMTU_R = 1500 (for the same reasons as for network M) 1184 o Tunnel MTU = 1500-encapsulation (typically 1460), the network N 1185 EMTU_R payload 1187 o Tunnel MAP (maximum atomic packet) = largest network M message 1188 that transits a tunnel as an atomic packet using network N as a 1189 link layer: 1280-encapsulation, i.e., the network N path MTU 1190 payload (which is itself limited by the tunnel path MFS) 1192 The difference between the network N MTU and its treatment as a link 1193 layer in network M is the reason why the tunnel ingress interfaces 1194 need to support fragmentation and tunnel egress interfaces need to 1195 support reassembly in the encapsulation layer(s). The high cost of 1196 fragmentation and reassembly is why it is useful for applications to 1197 avoid sending messages too close to the size of the tunnel path MTU 1198 [Ke95], although there is no signaling mechanism that can achieve 1199 this (see Section 4.2.3). 1201 4.2.2. Fragmentation 1203 A tunnel interacts with fragmentation in two different ways. As a 1204 link in network M, transit packets might be fragmented before they 1205 reach the tunnel - i.e., in network M either during source 1206 fragmentation (if generated at the same node as the ingress 1207 interface) or forwarding fragmentation (for IPv4 DF=0 datagrams). In 1208 addition, link packets traversing inside the tunnel may require 1209 fragmentation by the ingress interface - i.e., source fragmentation 1210 by the ingress as a host in network N. These two fragmentation 1211 operations are no more related than are conventional IP fragmentation 1212 and ATM segmentation and reassembly; one occurs at the (transit) 1213 network layer, the other at the (virtual) link layer. 1215 Although many of these issues with tunnel fragmentation and MTU 1216 handling were discussed in [RFC4459], that document described a 1217 variety of alternatives as if they were independent. This document 1218 explains the combined approach that is necessary. 1220 Like any other link, an IPv4 tunnel must transit 68 byte packets 1221 without requiring source fragmentation [RFC791][RFC1122] and an IPv6 1222 tunnel must transit 1280 byte packets without requiring source 1223 fragmentation [RFC8200]. The tunnel MTU interacts with routers or 1224 hosts it connects the same way as would any other link MTU. The 1225 pseudocode examples in this section use the following values: 1227 o TP: transit packet 1229 o TLP: tunnel link packet 1231 o TPsize: size of the transit packet (including its headers) 1233 o encaps: ingress encapsulation overhead (tunnel link headers) 1235 o tunMTU: tunnel MTU, i.e., network N egress EMTU_R - encaps 1237 o tunMAP: tunnel maximum atomic packet as limited by the tunnel path 1238 MFS 1240 These rules apply at the host/router where the tunnel is attached, 1241 i.e., at the network layer of the transit packet (we assume that all 1242 tunnels, including multipoint tunnels, have a single, uniform MTU). 1243 These are basic source fragmentation rules (or transit 1244 refragmentation for IPv4 DF=0 datagrams), and have no relation to the 1245 tunnel itself other than to consider the tunnel MTU as the effective 1246 link MTU of the next hop. 1248 Inside the source during transit packet generation or a router during 1249 transit packet forwarding, the tunnel is treated as if it were any 1250 other link (i.e., this is not tunnel processing, but rather typical 1251 source or router processing), as indicated in the pseudocode in 1252 Figure 11. 1254 if (TPsize > tunMTU) then 1255 if (TP can be on-path fragmented, e.g., IPv4 DF=0) then 1256 split TP into TP fragments of tunMTU size 1257 and send each TP fragment to the tunnel ingress interface 1258 else 1259 drop the TP and send ICMP "too big" to the TP source 1260 endif 1261 else 1262 send TP to the tunnel ingress (i.e., as an outbound interface) 1263 endif 1265 Figure 11 Router / host packet size processing algorithm 1267 The tunnel ingress acts as host on the tunnel path, i.e., as source 1268 fragmentation of tunnel link packets (we assume that all tunnels, 1269 even multipoint tunnels, have a single, uniform tunnel MTU), using 1270 the pseudocode shown in Figure 12. Note that ingress source 1271 fragmentation occurs in the encapsulation process, which may involve 1272 more than one protocol layer. In those cases, fragmentation can occur 1273 at any of the layers of encapsulation in which it is supported, based 1274 on the configuration of the ingress. 1276 if (TPsize <= tunMAP) then 1277 encapsulate the TP and emit 1278 else 1279 if (tunMAP < TPsize) then 1280 encapsulate the TP, creating the TLP 1281 fragment the TLP into tunMAP chunks 1282 emit the TLP fragments 1283 endif 1284 endif 1286 Figure 12 Ingress processing algorithm 1288 Note that these Figure 11 and Figure 12 indicate that a node might 1289 both "fragment then encapsulate" and "encapsulate then fragment", 1290 i.e., the effect is "on-path fragment, then encapsulate, then source 1291 fragment". The first (on-path) fragmentation occurs only for IPv4 1292 DF=0 packets, based on the tunnel MTU. The second (source) 1293 fragmentation occurs for all packets, based on the tunnel maximum 1294 atomic packet (MAP) size. The first fragmentation is a convenience 1295 for a subset of IPv4 packets; it is the second (source) fragmentation 1296 that ensures that messages traverse the tunnel. 1298 Just as a network interface should never receive a message larger 1299 than its MTU, a tunnel should never receive a message larger than its 1300 tunnel MTU limit (see the host/router processing above). A router 1301 attempting to process such a message would already have generated an 1302 ICMP "packet too big" and the transit packet would have been dropped 1303 before entering into this algorithm. Similarly, a host would have 1304 generated an error internally and aborted the attempted transmission. 1306 As an example, consider IPv4 over IPv6 or IPv6 over IPv6 tunneling, 1307 where IPv6 encapsulation adds a 40 byte fixed header plus IPv6 1308 options (i.e., IPv6 header extensions) of total size 'EHsize'. The 1309 tunnel MTU will be at least 1500 - (40 + EHsize) bytes. The tunnel 1310 path MTU will be at least 1280 - (40 + EHsize) bytes, which then also 1311 represents the tunnel maximum atomic packet size (MAP). Transit 1312 packets larger than the tunnel MTU will be dropped by a node before 1313 ingress processing, and so do not need to be addressed as part of 1314 ingress processing. Considering these minimum values, the previous 1315 algorithm uses actual values shown in the pseudocode in Figure 13. 1317 if (TPsize <= (1240 - EHsize)) then 1318 encapsulate TP and emit 1319 else 1320 if ((1240 - EHsize) < TPsize) then 1321 encapsulate the TP, creating the TLP 1322 fragment the TLP into (1240 - EHsize) chunks 1323 emit the TLP fragments 1324 endif 1325 endif 1327 Figure 13 Ingress processing for an tunnel over IPv6 1329 IPv6 cannot necessarily support all tunnel encapsulations. When the 1330 egress EMTU_R is the default of 1500 bytes, an IPv6 tunnel supports 1331 IPv6 transit only if EHsize is 180 bytes or less; otherwise the 1332 incoming transit packet would have been dropped as being too large by 1333 the host/router. Under the same EMTU_R assumption, an IPv6 tunnel 1334 supports IPv4 transit only if EHsize is 884 bytes or less. In this 1335 example, transit packets of up to (1240 - Ehsize) can traverse the 1336 tunnel without ingress source fragmentation and egress reassembly. 1338 When using IP directly over IP, the minimum transit packet EMTU_R for 1339 IPv4 is 576 bytes and for IPv6 is 1500 bytes. This means that tunnels 1340 of IPv4-over-IPv4, IPv4-over-IPv6, and IPv6-over-IPv6 are possible 1341 without additional requirements, but this may involve ingress 1342 fragmentation and egress reassembly. IPv6 cannot be tunneled directly 1343 over IPv4 without additional requirements, notably that the egress 1344 EMTU_R is at least 1280 bytes. 1346 When ongoing ingress fragmentation and egress reassembly would be 1347 prohibitive or costly, larger MTUs can be supported by design and 1348 confirmed either out-of-band (by design) or in-band (e.g., using 1349 PLPMTUD [RFC4821], as done in SEAL [RFC5320] and AERO [Te18]). In 1350 particular, many tunnel specifications are often able to avoid 1351 persistent fragmentation because they operationally assume larger 1352 EMTU_R and tunnel MAP sizes than are guaranteed for IPv4 [RFC1122] or 1353 IPv6 [RFC8200]. 1355 4.2.3. Path MTU Discovery 1357 Path MTU discovery (PMTUD) enables a network path to support a larger 1358 PMTU than it can assume from the minimum requirements of protocol 1359 over which it operates. Note, however, that PMTUD never discovers 1360 EMTU_R that is larger than the required minimum; that information is 1361 available to some upper layer protocols, such as TCP [RFC1122], but 1362 cannot be determined at the IP layer. 1364 There is temptation to optimize tunnel traversal so that packets are 1365 not fragmented between ingress and egress, i.e., to attempt tune the 1366 network M PMTU to the tunnel MAP size rather than to the tunnel MTU, 1367 to avoid ingress fragmentation. This is often impossible because the 1368 ICMP "packet too big" message (IPv4 fragmentation needed [RFC792] or 1369 IPv6 packet too big [RFC4443]) indicates the complete failure of a 1370 link to transit a packet, not a preference for a size that matches 1371 that internal the mechanism of the link. ICMP messages are intended 1372 to indicate whether a tunnel MTU is insufficient; there is no ICMP 1373 message that can indicate when a transit packet is "too big for the 1374 tunnel path MTU, but not larger than the tunnel MTU". If there were, 1375 endpoints might receive that message for IP packets larger than 40 1376 bytes (the payload of a single ATM cell, allowing for the 8-byte AAL5 1377 trailer), but smaller than 9K (the ATM EMTU_R payload). 1379 In addition, attempting to try to tune the network transit size to 1380 natively match that of the link internal transit can be hazardous for 1381 many reasons: 1383 o The tunnel is capable of transiting packets as large as the 1384 network N EMTU_R - encapsulation, which is always at least as 1385 large as the tunnel MTU and typically is larger. 1387 o ICMP has only one type of error message regarding large packets - 1388 "too big", i.e., too large to transit. There is no optimization 1389 message of "bigger than I'd like, but I can deal with if needed". 1391 o IP tunnels often involve some level of recursion, i.e., 1392 encapsulation over itself [RFC4459]. 1394 Tunnels that use IPv4 as the encapsulation layer SHOULD set DF=0, but 1395 this requires generating unique fragmentation ID values, which may 1396 limit throughput [RFC6864]. These tunnels might have difficulty 1397 assuming ingress EMTU_S values over 64 bytes, so it may not be 1398 feasible to assume that larger packets with DF=1 are safe. 1400 Recursive tunneling occurs whenever a protocol ends up encapsulated 1401 in itself. This happens directly, as when IPv4 is encapsulated in 1402 IPv4, or indirectly, as when IP is encapsulated in UDP which then is 1403 a payload inside IP. It can involve many layers of encapsulation 1404 because a tunnel provider isn't always aware of whether the packets 1405 it transits are already tunneled. 1407 Recursion is impossible when the tunnel transit packets are limited 1408 to that of the native size of the ingress payload. Arriving tunnel 1409 transit packets have a minimum supported size (1280 for IPv6) and the 1410 tunnel PMFS has the same requirement; there would be no room for the 1411 tunnel's "link layer" headers, i.e., the encapsulation layer. The 1412 result would be an IPv6 tunnel that cannot satisfy IPv6 transit 1413 requirements. 1415 It is more appropriate to require the tunnel to satisfy IP transit 1416 requirements and enforce that requirement at design time or during 1417 operation (the latter using PLPMTUD [RFC4821]). Conventional path MTU 1418 discovery (PMTUD) relies on existing endpoint ICMP processing of 1419 explicit negative feedback from routers along the path via "packet to 1420 big" ICMP packets in the reverse direction of the tunnel 1421 [RFC1191][RFC8201]. This technique is susceptible to the "black hole" 1422 phenomenon, in which the ICMP messages never return to the source due 1423 to policy-based filtering [RFC2923]. PLPMTUD requires a separate, 1424 direct control channel from the egress to the ingress that provides 1425 positive feedback; the direct channel is not blocked by policy 1426 filters and the positive feedback ensures fail-safe operation if 1427 feedback messages are lost [RFC4821]. 1429 PLPMTUD might require that the ingress consider the potential impact 1430 of multipath forwarding (see Section 4.3.4). In such cases, probes 1431 generated by the ingress might need to track different flows, e.g., 1432 that might traverse different tunnel paths. Additionally, 1433 encapsulation might need to consider mechanisms to ensure that probes 1434 traverse the same path as their corresponding traffic, even when 1435 labeled as the same flow (e.g., using the IPv6 flow ID). In such 1436 cases, the transit packet and probe may need to be encrypted or 1437 encapsulated in an additional flow-based transport header, to avoid 1438 differential path traversal based on deep-packet inspection within 1439 the tunnel. 1441 4.3. Coordination Issues 1443 IP tunnels interact with link layer signals and capabilities in a 1444 variety of ways. The following subsections address some key issues of 1445 these interactions. In general, they are again informed by treating a 1446 tunnel as any other link layer and considering the interactions 1447 between the IP layer and link layers [RFC3819]. 1449 4.3.1. Signaling 1451 In the current Internet architecture, signaling goes upstream, either 1452 from routers along a path or from the destination, back toward the 1453 source. Such signals are typically contained in ICMP messages, but 1454 can involve other protocols such as RSVP, transport protocol signals 1455 (e.g., TCP RSTs), or multicast control or transport protocols. 1457 A tunnel behaves like a link and acts like a link interface at the 1458 nodes where it is attached. As such, it can provide information that 1459 enhances IP signaling (e.g., ICMP), but itself does not directly 1460 generate ICMP messages. 1462 For tunnels, this means that there are two separate signaling paths. 1463 The outer network M nodes can each signal the source of the tunnel 1464 transit packets, Hsrc (Figure 14). Inside the tunnel, the inner 1465 network N nodes can signal the source of the tunnel link packets, the 1466 ingress I (Figure 15). 1468 +--------+---------------------------+--------+ 1469 | | | | 1470 v --_ -- v 1471 +------+ / \ / \ +------+ 1472 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1473 +------+ \ //\ / \ / \ /\\ / +------+ 1474 --/I \--+ Rb +--+ Rc +--/E \-- 1475 \ / \ / \ / \ / 1476 \/ -- -- \/ 1477 <---- Network N -----> 1478 <-------------------- Network M ---------------------> 1480 Figure 14 Signals outside the tunnel 1482 +-----+-------+------+ 1483 --_ | | | | -- 1484 +------+ / \ v | | | / \ +------+ 1485 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1486 +------+ \ //\ / \ / \ /\\ / +------+ 1487 --/I \--+ Rb +--+ Rc +--/E \-- 1488 \ / \ / \ / \ / 1489 \/ -- -- \/ 1490 <----- Network N ----> 1491 <--------------------- Network M --------------------> 1493 Figure 15 Signals inside the tunnel 1495 These two signal paths are inherently distinct except where 1496 information is exchanged between the network interface of the tunnel 1497 (the ingress) and its attached node (Ra, in both figures). 1499 It is always possible for a network interface to provide hints to its 1500 attached node (host or router), which can be used for optimization. 1501 In this case, when signals inside the tunnel indicate a change to the 1502 tunnel, the ingress (i.e., the tunnel network interface) can provide 1503 information to the router (Ra, in both figures), so that Ra can 1504 generate the appropriate signal in return to Hsrc. This relaying may 1505 be difficult, because signals inside the tunnel may not return enough 1506 information to the ingress to support direct relaying to Hsrc. 1508 In all cases, the tunnel ingress needs to determine how to relay the 1509 signals from inside the tunnel into signals back to the source. For 1510 some protocols this is either simple or impossible (such as for 1511 ICMP), for others, it can even be undefined (e.g., multicast). In 1512 some cases, the individual signals relayed from inside the tunnel may 1513 result in corresponding signals in the outside network, and in other 1514 cases they may just change state of the tunnel interface. In the 1515 latter case, the result may cause the router Ra to generate new ICMP 1516 errors when later messages arrive from Hsrc or other sources in the 1517 outer network. 1519 The meaning of the relayed information must be carefully translated. 1520 An ICMP error within a tunnel indicates a failure of the path inside 1521 the tunnel to support an egress atomic packet or packet fragment 1522 size. It can be very difficult to convert that ICMP error into a 1523 corresponding ICMP message from the ingress node back to the transit 1524 packet source. The ICMP message may not contain enough of a packet 1525 prefix to extract the transit packet header sufficient to generate 1526 the appropriate ICMP message. The relationship between the egress 1527 EMTU_R and the transit packet may be indirect, e.g., the ingress node 1528 may be performing source fragmentation that should be adjusted 1529 instead of propagating the ICMP upstream. 1531 Some messages have detailed specifications for relaying between the 1532 tunnel link packet and transit packet, including Explicit Congestion 1533 Notification (ECN [RFC6040]) and multicast (IGMP, e.g.). 1535 4.3.2. Congestion 1537 Tunnels carrying IP traffic (i.e., the focus of this document) need 1538 not react directly to congestion any more than would any other link 1539 layer [RFC8085]. IP transit packet traffic is already expected to be 1540 congestion controlled. 1542 It is useful to relay network congestion notification between the 1543 tunnel link and the tunnel transit packets. Explicit congestion 1544 notification requires that ECN bits are copied from the tunnel 1545 transit packet to the tunnel link packet on encapsulation, as well as 1546 copied back at the egress based on a combination of the bits of the 1547 two headers [RFC6040]. This allows congestion notification within the 1548 tunnel to be interpreted as if it were on the direct path. 1550 4.3.3. Multipoint Tunnels and Multicast 1552 Multipoint tunnels are tunnels with more than two ingress/egress 1553 endpoints [RFC2529][RFC5214][Te18]. Just as tunnels emulate links, 1554 multipoint tunnels emulate multipoint links, and can support 1555 multicast as a tunnel capability. Multipoint tunnels can be useful on 1556 their own, or may be used as part of more complex systems, e.g., LISP 1557 and TRILL configurations [RFC6830][RFC6325]. 1559 Multipoint tunnels require a support for egress determination, just 1560 as multipoint links do. This function is typically supported by ARP 1561 [RFC826] or ARP emulation (e.g., LAN Emulation, known as LANE 1563 [RFC2225]) for multipoint links. For multipoint tunnels, a similar 1564 mechanism is required for the same purpose - to determine the egress 1565 address for proper ingress encapsulation (e.g., LISP Map-Service 1566 [RFC6833]). 1568 All multipoint systems - tunnels and links - might support different 1569 MTUs between each ingress/egress (or link entrance/exit) pair. In 1570 most cases, it is simpler to assume a uniform MTU throughout the 1571 multipoint system, e.g., the minimum MTU supported across all 1572 ingress/egress pairs. This applies to both the ingress EMTU_S and 1573 egress EMTU_R (the latter determining the tunnel MTU). Values valid 1574 across all receivers need to be confirmed in advance (e.g., via IPv6 1575 ND announcements or out-of-band configuration information) before a 1576 multipoint tunnel or link can use values other than the default, 1577 otherwise packets may reach some receivers but be "black-holed" to 1578 others (e.g., if PMTUD fails [RFC2923]). 1580 A multipoint tunnel MUST have support for broadcast and multicast (or 1581 their equivalent), in exactly the same way as this is already 1582 required for multipoint links [RFC3819]. Both modes can be supported 1583 either by a native mechanism inside the tunnel or by emulation using 1584 serial replication at the tunnel ingress (e.g., AMT [RFC7450]), in 1585 the same way that links may provide the same support either natively 1586 (e.g., via promiscuous or automatic replication in the link itself) 1587 or network interface emulation (e.g., as for non-broadcast 1588 multiaccess networks, i.e., NBMAs). 1590 IGMP snooping enables IP multicast to be coupled with native link 1591 layer multicast support [RFC4541]. A similar technique may be 1592 relevant to couple transit packet multicast to tunnel link packet 1593 multicast, but the coupling of the protocols may be more complex 1594 because many tunnel link protocols rely on their own network N 1595 multicast control protocol, e.g., via PIM-SM [RFC6807][RFC7761]. 1597 4.3.4. Load Balancing 1599 Load balancing can impact the way in which a tunnel operates. In 1600 particular, multipath routing inside the tunnel can impact some of 1601 the tunnel parameters to vary, both over time and for different 1602 transit packets. The use of multiple paths can be the result of MPLS 1603 link aggregation groups (LAGs), equal-cost multipath routing (ECMP 1604 [RFC2991]), or other load balancing mechanisms. In some cases, the 1605 tunnel exists as the mechanism to support ECMP, as for GRE in UDP 1606 [RFC8086]. 1608 A tunnel may have multiple paths between the ingress and egress with 1609 different tunnel path MTU or tunnel MAP values, causing the ingress 1610 EMTU_S to vary [RFC7690]. When individual values cannot be correlated 1611 to transit traffic, the EMTU_S can be set to the minimum of these 1612 different path MTU and MAP values. 1614 In some cases, these values can be correlated to paths, e.g., IPv6 1615 packets include a flow label to enable multipath routing to keep 1616 packets of a single flow following the same path, as well as to help 1617 differentiate path properties (e.g., for path MTU discovery 1618 [RFC4821]). It is important to preserve the semantics of that flow 1619 label as an aggregate identifier of the encapsulated link packets of 1620 a tunnel. This is achieved by hashing the transit IP addresses and 1621 flow label to generate a new flow label for use between the ingress 1622 and egress addresses [RFC6438]. It is not appropriate to simply copy 1623 the flow label from the transit packet into the link packet because 1624 of collisions that might arise if a label is used for flows between 1625 different transit packet addresses that traverse the same tunnel. 1627 When the transit packet is visible to forwarding nodes inside the 1628 tunnel (e.g., when it is not encrypted), those nodes use deep packet 1629 inspection (DPI) context to send a single flow over different paths. 1630 This sort of "DPI override" of the IP flow information can interfere 1631 with both PMTUD and PLPMTUD mechanisms. The only way to ensure that 1632 intermediate nodes do not interfere with PLPMTUD is to encrypt the 1633 transit packet when it is encapsulated for tunnel traversal, or to 1634 provide some other signals (e.g., an additional layer of 1635 encapsulation header including transport ports) that preserves the 1636 flow semantics. 1638 4.3.5. Recursive Tunnels 1640 The rules described in this document already support tunnels over 1641 tunnels, sometimes known as "recursive" tunnels, in which IP is 1642 transited over IP either directly or via intermediate encapsulation 1643 (IP-UDP-IP, as in GUE [He19]). 1645 There are known hazards to recursive tunneling, notably that the 1646 independence of the tunnel transit header and tunnel link header hop 1647 counts can result in a tunneling loop. Such looping can be avoided 1648 when using direct encapsulation (IP in IP) by use of a header option 1649 to track the encapsulation count and to limit that count [RFC2473]. 1650 This looping cannot be avoided when other protocols are used for 1651 tunneling, e.g., IP in UDP in IP, because the encapsulation count may 1652 not be visible where the recursion occurs. 1654 5. Observations 1656 The following subsections summarize the observations of this document 1657 and a summary of issues with existing tunnel protocol specifications. 1658 It also includes advice for tunnel protocol designers, implementers, 1659 and operators. It also includes 1661 5.1. Summary of Recommendations 1663 o Tunnel endpoints are network interfaces, tunnel are virtual links 1665 o ICMP messages MUST NOT be generated by the tunnel (as a link) 1667 o ICMP messages received by the ingress inside link change the 1668 link properties (they do not generate transit-layer ICMP 1669 messages) 1671 o Link headers (hop, ID, options) are largely independent of 1672 arriving ID (with few exceptions based on translation, not 1673 direct copying, e.g., ECN and IPv6 flow IDs) 1675 o MTU values should treat the tunnel as any other link 1677 o Require source ingress source fragmentation and egress 1678 reassembly at the tunnel link packet layer 1680 o The tunnel MTU is the tunnel egress EMTU_R less headers, and 1681 not related at all to the ingress-egress MFS 1683 o Tunnels must obey core IP requirements 1685 o Obey IPv4 DF=1 on arrival at a node (nodes MUST NOT fragment 1686 IPv4 packets where DF=1 and routers MUST NOT clear the DF bit) 1688 o Shut down an IP tunnel if the tunnel MTU falls below the 1689 required minimum 1691 5.2. Impact on Existing Encapsulation Protocols 1693 Many existing and proposed encapsulation protocols are inconsistent 1694 with the guidelines of this document. The following list summarizes 1695 only those inconsistencies, but omits places where a protocol is 1696 inconsistent solely by reference to another protocol. 1698 [should this be inverted as a table of issues and a list of which 1699 RFCs have problems?] 1700 o IP in IP / mobile IP [RFC2003][RFC4459] - IPv4 in IPv4 1702 o Sets link DF when transit DF=1 (fails without PLPMTUD) 1704 o Drops at egress if hopcount = 0 (host-host tunnels fail) 1706 o Drops based on transit source (same as router IP, matches 1707 egress), i.e., performs routing functions it should not 1709 o Ingress generates ICMP messages (based on relayed context), 1710 rather than using inner ICMP messages to set interface 1711 properties only 1713 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1715 o IPv6 tunnels [RFC2473] -- IPv6 or IPv4 in IPv6 1717 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1719 o Decrements transiting packet hopcount (by 1) 1721 o Copies traffic class from tunnel link to tunnel transit header 1723 o Ignores IPv4 DF=0 and fragments at that layer upon arrival 1725 o Fails to retain soft ingress state based on inner ICMP messages 1726 affecting tunnel MTU 1728 o Tunnel ingress issues ICMPs 1730 o Fragments IPv4 over IPv6 fragments only if IPv4 DF=0 1731 (misinterpreting the "can fragment the IPv4 packet" as 1732 permission to fragment at the IPv6 link header) 1734 o IPsec tunnel mode (IP in IPsec in IP) [RFC4301] -- IP in IPsec 1736 o Uses security policy to set, clear, or copy DF (rather than 1737 generating it independently, which would also be more secure) 1739 o Intertwines tunnel selection with security selection, rather 1740 than presenting tunnel as an interface and using existing 1741 forwarding (as with transport mode over IP-in-IP [RFC3884]) 1743 o GRE (IP in GRE in IP or IP in GRE in UDP in IP) 1744 [RFC2784][RFC7588][RFC7676][RFC8086] 1746 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1747 o Requires ingress to generate ICMP errors 1749 o Copies IPv4 DF to outer IPv4 DF 1751 o Violates IPv6 MTU requirements when using IPv6 encapsulation 1753 o LISP [RFC6830] 1755 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1757 o Requires ingress to generate ICMP errors 1759 o Copies inner hop limit to outer 1761 o L2TP [RFC3931] 1763 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1765 o Requires ingress to generate ICMP errors 1767 o PWE [RFC3985] 1769 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1771 o Requires ingress to generate ICMP errors 1773 o GUE (Generic UDP encapsulation) [He19] - IP (et. al) in UDP in IP 1775 o Allows inner encapsulation fragmentation 1777 o Geneve [RFC7364][Gr19] - IP (et al.) in Geneve in UDP in IP 1779 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1781 o SEAL/AERO [RFC5320][Te18] - IP in SEAL/AERO in IP 1783 o Some issues with SEAL (MTU, ICMP), corrected in AERO 1785 o RTG DT encapsulations [No16] 1787 o Assumes fragmentation can be avoided completely 1789 o Allows encapsulation protocols that lack fragmentation 1791 o Relies on ICMP PTB to correct for tunnel path MTU 1793 o No known issues 1794 o L2VPN (framework for L2 virtualization) [RFC4664] 1796 o L3VPN (framework for L3 virtualization) [RFC4176] 1798 o MPLS (IP in MPLS) [RFC3031] 1800 o TRILL (Ethernet in Ethernet) [RFC5556][RFC6325] 1802 5.3. Tunnel Protocol Designers 1804 [To be completed] 1806 Recursive tunneling + minimum MTU = frag/reassembly is inevitable, at 1807 least to be able to split/join two fragments 1809 Account for egress MTU/path MTU differences. 1811 Include a stronger checksum. 1813 Ensure the egress MTU is always larger than the path MTU. 1815 Ensure that the egress reassembly can keep up with line rate OR 1816 design PLPMTUD into the tunneling protocol. 1818 5.3.1. For Future Standards 1820 [To be completed] 1822 Larger IPv4 MTU (2K? or just 2x path MTU?) for reassembly 1824 Always include frag support for at least two frags; do NOT try to 1825 deprecate fragmentation. 1827 Limit encapsulation option use/space. 1829 Augment ICMP to have two separate messages: PTB vs P-bigger-than- 1830 optimal 1832 Include MTU as part of BGP as a hint - SB 1834 Hazards of multi-MTU draft-van-beijnum-multi-mtu-04 1836 5.3.2. Diagnostics 1838 [To be completed] 1839 Some current implementations include diagnostics to support 1840 monitoring the impact of tunneling, especially the impact on 1841 fragmentation and reassembly resources, the status of path MTU 1842 discovery, etc. 1844 >> Because a tunnel ingress/egress is a network interface, it SHOULD 1845 have similar resources as any other network interface. This includes 1846 resources for packet processing as well as monitoring. 1848 5.4. Tunnel Implementers 1850 [To be completed] 1852 Detect when the egress MTU is exceeded. 1854 Detect when the egress MTU drops below the required minimum and shut 1855 down the tunnel if that happens - configuring the tunnel down and 1856 issuing a hard error may be the only way to detect this anomaly, and 1857 it's sufficiently important that the tunnel SHOULD be disabled. This 1858 is always better than blindly assuming the tunnel has been deployed 1859 correctly, i.e., that the solution has been engineered. 1861 Do NOT decrement the TTL as part of being a tunnel. It's always 1862 already OK for a router to decrement the TTL based on different next- 1863 hop routers, but TTL is a property of a router not a link. 1865 5.5. Tunnel Operators 1867 [To be completed] 1869 Keep the difference between "enforced by operators" vs. "enforced by 1870 active protocol mechanism" in mind. It's fine to assume something the 1871 tunnel cannot or does not test, as long as you KNOW you can assume 1872 it. When the assumption is wrong, it will NOT be signaled by the 1873 tunnel. Do NOT decrement the TTL as part of being a tunnel. It's 1874 always already OK for a router to decrement the TTL based on 1875 different next-hop routers, but TTL is a property of a router not a 1876 link. 1878 Consider the circuit breakers doc to provide diagnostics and last- 1879 resort control to avoid overload for non-reactive traffic (see 1880 Gorry's RFC-to-be) 1882 Do NOT decrement the TTL as part of being a tunnel. It's always 1883 already OK for a router to decrement the TTL based on different next- 1884 hop routers, but TTL is a property of a router not a link. 1886 >>>> PLPMTUD can give multiple conflicting PMTU values during ECMP or 1887 LAG if PMTU is cached per endpoint pair rather than per flow -- but 1888 so can PMTUD! This is another reason why ICMP should never drive up 1889 the effective MTU (if aggregate, treat as the minimum of received 1890 messages over an interval). 1892 6. Security Considerations 1894 Tunnels may introduce vulnerabilities or add to the potential for 1895 receiver overload and thus DOS attacks. These issues are primarily 1896 related to the fact that a tunnel is a link that traverses a network 1897 path and to fragmentation and reassembly. ICMP signal translation 1898 introduces a new security issue and must be done with care. ICMP 1899 generation at the router or host attached to a tunnel is already 1900 covered by existing requirements (e.g., should be throttled). 1902 Tunnels traverse multiple hops of a network path from ingress to 1903 egress. Traffic along such tunnels may be susceptible to on-path and 1904 off-path attacks, including fragment injection, reassembly buffer 1905 overload, and ICMP attacks. Some of these attacks may not be as 1906 visible to the endpoints of the architecture into which tunnels are 1907 deployed and these attacks may thus be more difficult to detect. 1909 Fragmentation at routers or hosts attached to tunnels may place an 1910 undue burden on receivers where traffic is not sufficiently diffuse, 1911 because tunnels may induce source fragmentation at hosts and path 1912 fragmentation (for IPv4 DF=0) more for tunnels than for other links. 1913 Care should be taken to avoid this situation, notably by ensuring 1914 that tunnel MTUs are not significantly different from other link 1915 MTUs. 1917 Tunnel ingresses emitting IP datagrams MUST obey all existing IP 1918 requirements, such as the uniqueness of the IP ID field. Failure to 1919 either limit encapsulation traffic, or use additional ingress/egress 1920 IP addresses, can result in high speed traffic fragments being 1921 incorrectly reassembled. 1923 Tunnels are susceptible to attacks at both the inner and outer 1924 network layers. The tunnel ingress/egress endpoints appear as network 1925 interfaces in the outer network, and are as susceptible as any other 1926 network interface. This includes vulnerability to fragmentation 1927 reassembly overload, traffic overload, and spoofed ICMP messages that 1928 misreport the state of those interfaces. Similarly, the 1929 ingress/egress appear as hosts to the path traversed by the tunnel, 1930 and thus are as susceptible as any other host to attacks as well. 1932 [management?] 1934 [Access control?] 1936 describe relationship to [RFC6169] - JT (as per INTAREA meeting 1937 notes, don't cover Teredo-specific issues in RFC6169, but include 1938 generic issues here) 1940 7. IANA Considerations 1942 This document has no IANA considerations. 1944 The RFC Editor should remove this section prior to publication. 1946 8. References 1948 8.1. Normative References 1950 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1951 Requirement Levels", BCP 14, RFC 2119, March 1997. 1953 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 1954 Key Words," RFC 2119, May 2017. 1956 [are there others? 3819? ECN? Flow label issues?] 1958 8.2. Informative References 1960 [Cl88] Clark, D., "The design philosophy of the DARPA internet 1961 protocols," Proc. Sigcomm 1988, p.106-114, 1988. 1963 [Er94] Eriksson, H., "MBone: The Multicast Backbone," 1964 Communications of the ACM, Aug. 1994, pp.54-60. 1966 [Gr19] Gross, J. (Ed.), I. Ganga (Ed.), T. Sridhar (Ed.), "Geneve: 1967 Generic Network Virtualization Encapsulation," draft-ietf- 1968 nvo3-geneve-14, Sep. 2019. 1970 [He19] Herbert, T., L. Yong, O. Zia, "Generic UDP Encapsulation," 1971 draft-ietf-intarea-gue-07, Mar. 2019. 1973 [Ke95] Kent, S., J. Mogul, "Fragmentation considered harmful," ACM 1974 Sigcomm Computer Communication Review (CCR), V25 N1, Jan. 1975 1995, pp. 75-87. 1977 [No16] Nordmark, E. (Ed.), A. Tian, J. Gross, J. Hudson, L. 1978 Kreeger, P. Garg, P. Thaler, T. Herbert, "Encapsulation 1979 Considerations," draft-ietf-rtgwg-dt-encap-02, Oct. 2016. 1981 [RFC5] Rulifson, J, "Decode Encode Language (DEL)," RFC 5, June 1982 1969. 1984 [RFC768] Postel, J, "User Datagram Protocol," RFC 768, Aug. 1980 1986 [RFC791] Postel, J., "Internet Protocol," RFC 791 / STD 5, September 1987 1981. 1989 [RFC792] Postel, J., "Internet Control Message Protocol," RFC 792, 1990 Sep. 981. 1992 [RFC793] Postel, J, "Transmission Control Protocol," RFC 793, Sept. 1993 1981. 1995 [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol -- or 1996 -- Converting Network Protocol Addresses to 48.bit Ethernet 1997 Address for Transmission on Ethernet Hardware," RFC 826, 1998 Nov. 1982. 2000 [RFC1075] Waitzman, D., C. Partridge, S. Deering, "Distance Vector 2001 Multicast Routing Protocol," RFC 1075, Nov. 1988. 2003 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 2004 Communication Layers," RFC 1122 / STD 3, October 1989. 2006 [RFC1191] Mogul, J., S. Deering, "Path MTU discovery," RFC 1191, 2007 November 1990. 2009 [RFC1812] Baker, F., "Requirements for IP Version 4 Routers," RFC 2010 1812, June 1995. 2012 [RFC1853] Simpson, W., "IP in IP Tunneling," RFC 1853, Oct. 1995. 2014 [RFC2003] Perkins, C., "IP Encapsulation within IP," RFC 2003, Oct. 2015 1996. 2017 [RFC2225] Laubach, M., J. Halpern, "Classical IP and ARP over ATM," 2018 RFC 2225, Apr. 1998. 2020 [RFC2473] Conta, A., "Generic Packet Tunneling in IPv6 2021 Specification," RFC 2473, Dec. 1998. 2023 [RFC2529] Carpenter, B., C. Jung, "Transmission of IPv6 over IPv4 2024 Domains without Explicit Tunnels," RFC 2529, Mar. 1999. 2026 [RFC2784] Farinacci, D., T. Li, S. Hanks, D. Meyer, P. Traina, 2027 "Generic Routing Encapsulation (GRE)", RFC 2784, March 2028 2000. 2030 [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery," RFC 2031 2923, September 2000. 2033 [RFC2983] Black, D., "Differentiated Services and Tunnels," RFC 2983, 2034 Oct. 2000. 2036 [RFC2991] Thaler, D., C. Hopps, "Multipath Issues in Unicast and 2037 Multicast Next-Hop Selection," RFC 2991, Nov. 2000. 2039 [RFC2473] Conta, A., S. Deering, "Generic Packet Tunneling in IPv6 2040 Specification," RFC 2473, Dec. 1998. 2042 [RFC2546] Durand, A., B. Buclin, "6bone Routing Practice," RFC 2540, 2043 Mar. 1999. 2045 [RFC3031] Rosen, E., A. Viswanathan, R. Callon, "Multiprotocol Label 2046 Switching Architecture", RFC 3031, January 2001. 2048 [RFC3819] Karn, P., Ed., C. Bormann, G. Fairhurst, D. Grossman, R. 2049 Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood, 2050 "Advice for Internet Subnetwork Designers," RFC 3819 / BCP 2051 89, July 2004. 2053 [RFC3884] Touch, J., L. Eggert, Y. Wang, "Use of IPsec Transport Mode 2054 for Dynamic Routing," RFC 3884, September 2004. 2056 [RFC3931] Lau, J., Ed., M. Townsley, Ed., I. Goyret, Ed., "Layer Two 2057 Tunneling Protocol - Version 3 (L2TPv3)," RFC 3931, March 2058 2005. 2060 [RFC3985] Bryant, S., P. Pate (Eds.), "Pseudo Wire Emulation Edge-to- 2061 Edge (PWE3) Architecture", RFC 3985, March 2005. 2063 [RFC4176] El Mghazli, Y., Ed., T. Nadeau, M. Boucadair, K. Chan, A. 2064 Gonguet, "Framework for Layer 3 Virtual Private Networks 2065 (L3VPN) Operations and Management," RFC 4176, October 2005. 2067 [RFC4301] Kent, S., and K. Seo, "Security Architecture for the 2068 Internet Protocol," RFC 4301, December 2005. 2070 [RFC4340] Kohler, E., M. Handley, S. Floyd, "Datagram Congestion 2071 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 2073 [RFC4443] Conta, A., S. Deering, M. Gupta (Ed.), "Internet Control 2074 Message Protocol (ICMPv6) for the Internet Protocol Version 2075 6 (IPv6) Specification," RFC 4443, Mar. 2006. 2077 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 2078 Network Tunneling," RFC 4459, April 2006. 2080 [RFC4541] Christensen, M., K. Kimball, F. Solensky, "Considerations 2081 for Internet Group Management Protocol (IGMP) and Multicast 2082 Listener Discovery (MLD) Snooping Switches," RFC 4541, May 2083 2006. 2085 [RFC4664] Andersson, L., Ed., E. Rosen, Ed., "Framework for Layer 2 2086 Virtual Private Networks (L2VPNs)," RFC 4664, September 2087 2006. 2089 [RFC4821] Mathis, M., J. Heffner, "Packetization Layer Path MTU 2090 Discovery," RFC 4821, March 2007. 2092 [RFC4861] Narten, T., E. Nordmark, W. Simpson, H. Soliman, "Neighbor 2093 Discovery for IP version 6 (IPv6)," RFC 4861, Sept. 2007. 2095 [RFC4960] Stewart, R. (Ed.), "Stream Control Transmission Protocol," 2096 RFC 4960, Sep. 2007. 2098 [RFC4963] Heffner, J., M. Mathis, B. Chandler, "IPv4 Reassembly 2099 Errors at High Data Rates," RFC 4963, July 2007. 2101 [RFC5214] Templin, F., T. Gleeson, D. Thaler, "Intra-Site Automatic 2102 Tunnel Addressing Protocol (ISATAP)," RFC 5214, Mar. 2008. 2104 [RFC5320] Templin, F., Ed., "The Subnetwork Encapsulation and 2105 Adaptation Layer (SEAL)," RFC 5320, Feb. 2010. 2107 [RFC5556] Touch, J., R. Perlman, "Transparently Interconnecting Lots 2108 of Links (TRILL): Problem and Applicability Statement," RFC 2109 5556, May 2009. 2111 [RFC5944] Perkins, C., Ed., "IP Mobility Support for IPv4, Revised" 2112 RFC 5944, Nov. 2010. 2114 [RFC6040] Briscoe, B., "Tunneling of Explicit Congestion 2115 Notification," RFC 6040, Nov. 2010. 2117 [RFC6169] Krishnan, S., D. Thaler, J. Hoagland, "Security Concerns 2118 With IP Tunneling," RFC 6169, Apr. 2011. 2120 [RFC6325] Perlman, R., D. Eastlake, D. Dutt, S. Gai, A. Ghanwani, 2121 "Routing Bridges (RBridges): Base Protocol Specification," 2122 RFC 6325, July 2011. 2124 [RFC6434] Jankiewicz, E., J. Loughney, T. Narten, "IPv6 Node 2125 Requirements," RFC 6434, Dec. 2011. 2127 [RFC6438] Carpenter, B., S. Amante, "Using the IPv6 Flow Label for 2128 Equal Cost Multipath Routing and Link Aggregation in 2129 Tunnels," RFC 6438, Nov. 2011. 2131 [RFC6807] Farinacci, D., G. Shepherd, S. Venaas, Y. Cai, "Population 2132 Count Extensions to Protocol Independent Multicast (PIM)," 2133 RFC 6807, Dec. 2012. 2135 [RFC6830] Farinacci, D., V. Fuller, D. Meyer, D. Lewis, "The 2136 Locator/ID Separation Protocol," RFC 6830, Jan. 2013. 2138 [RFC6833] Fuller, V., D. Farinacci, "Locator/ID Separation Protocol 2139 (LISP) Map-Server Interface," RFC 6833, Jan. 2013. 2141 [RFC6864] Touch, J., "Updated Specification of the IPv4 ID Field," 2142 Proposed Standard, RFC 6864, Feb. 2013. 2144 [RFC6935] Eubanks, M., P. Chimento, M. Westerlund, "IPv6 and UDP 2145 Checksums for Tunneled Packets," RFC 6935, Apr. 2013. 2147 [RFC6936] Fairhurst, G., M. Westerlund, "Applicability Statement for 2148 the Use of IPv6 UDP Datagrams with Zero Checksums," RFC 2149 6936, Apr. 2013. 2151 [RFC6946] Gont, F., "Processing of IPv6 "Atomic" Fragments," RFC 2152 6946, May 2013. 2154 [RFC7364] Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L., M. 2155 Napierala, "Problem Statement: Overlays for Network 2156 Virtualization", RFC 7364, Oct. 2014. 2158 [RFC7450] Bumgardner, G., "Automatic Multicast Tunneling," RFC 7450, 2159 Feb. 2015. 2161 [RFC7510] Xu, X., N. Sheth, L. Yong, R. Callon, D. Black, 2162 "Encapsulating MPLS in UDP," RFC 7510, April 2015. 2164 [RFC7588] Bonica, R., C. Pignataro, J. Touch, "A Widely-Deployed 2165 Solution to the Generic Routing Encapsulation Fragmentation 2166 Problem," RFC 7588, July 2015. 2168 [RFC7676] Pignataro, C., R. Bonica, S. Krishnan, "IPv6 Support for 2169 Generic Routing Encapsulation (GRE)," RFC 7676, Oct 2015. 2171 [RFC7690] Byerly, M., M. Hite, J. Jaeggli, "Close Encounters of the 2172 ICMP Type 2 Kind (Near Misses with ICMPv6 Packet Too Big 2173 (PTB))," RFC 7690, Jan. 2016. 2175 [RFC7761] Fenner, B., M. Handley, H. Holbrook, I. Kouvelas, R. 2176 Parekh, Z. Zhang, L. Zheng, "Protocol Independent Multicast 2177 - Sparse Mode (PIM-SM): Protocol Specification (Revised)," 2178 RFC 7761, Mar. 2016. 2180 [RFC8085] Eggert, L., G. Fairhurst, G. Shepherd, "Unicast UDP Usage 2181 Guidelines," RFC 8085, Oct. 2015. 2183 [RFC8086] Yong, L. (Ed.), E. Crabbe, X. Xu, T. Herbert, "GRE-in-UDP 2184 Encapsulation," RFC 8086, Feb. 2017. 2186 [RFC8200] Deering, S., R. Hinden, "Internet Protocol, Version 6 2187 (IPv6) Specification," RFC 8200, Jul. 2017. 2189 [RFC8201] McCann, J., S. Deering, J. Mogul, R. Hinden (Ed.), "Path 2190 MTU Discovery for IP version 6," RFC 8201, Jul. 2017. 2192 [Sa84] Saltzer, J., D. Reed, D. Clark, "End-to-end arguments in 2193 system design," ACM Trans. on Computing Systems, Nov. 1984. 2195 [Te18] Templin, F., "Asymmetric Extended Route Optimization," 2196 draft-templin-aerolink-82, May 2018. 2198 [To01] Touch, J., "Dynamic Internet Overlay Deployment and 2199 Management Using the X-Bone," Computer Networks, July 2001, 2200 pp. 117-135. 2202 [To03] Touch, J., Y. Wang, L. Eggert, G. Finn, "Virtual Internet 2203 Architecture," USC/ISI Tech. Report ISI-TR-570, Aug. 2003. 2205 [To16] Touch, J., "Middleboxes Models Compatible with the 2206 Internet," USC/ISI Tech. Report ISI-TR-711, Oct. 2016. 2208 [To98] Touch, J., S. Hotz, "The X-Bone," Proc. Globecom Third 2209 Global Internet Mini-Conference, Nov. 1998. 2211 [Zi80] Zimmermann, H., "OSI Reference Model - The ISO Model of 2212 Architecture for Open Systems Interconnection," IEEE Trans. 2213 on Comm., Apr. 1980. 2215 9. Acknowledgments 2217 This document originated as the result of numerous discussions among 2218 the authors, Jari Arkko, Stuart Bryant, Lars Eggert, Ted Faber, Gorry 2219 Fairhurst, Dino Farinacci, Matt Mathis, and Fred Templin. It 2220 benefitted substantially from detailed feedback from Toerless Eckert, 2221 Vincent Roca, and Lucy Yong, as well as other members of the Internet 2222 Area Working Group. 2224 This work is partly supported by USC/ISI's Postel Center. 2226 This document was prepared using 2-Word-v2.0.template.dot. 2228 Authors' Addresses 2230 Joe Touch 2231 Manhattan Beach, CA 90266 2232 U.S.A. 2234 Phone: +1 (310) 560-0334 2235 Email: touch@strayalpha.com 2237 W. Mark Townsley 2238 Cisco 2239 L'Atlantis, 11, Rue Camille Desmoulins 2240 Issy Les Moulineaux, ILE DE FRANCE 92782 2242 Email: townsley@cisco.com 2244 APPENDIX A: Fragmentation efficiency 2246 A.1. Selecting fragment sizes 2248 There are different ways to fragment a packet. Consider a network 2249 with a PMTU as shown in Figure 16, where packets are encapsulated 2250 over the same network layer as they arrive on (e.g., IP in IP). If a 2251 packet as large as the PMTU arrives, it must be fragmented to 2252 accommodate the additional header. 2254 X===========================X (transit PMTU) 2255 +----+----------------------+ 2256 | iH | DDDDDDDDDDDDDDDDDDDD | 2257 +----+----------------------+ 2258 | 2259 | X===========================X (tunnel 1 MTU) 2260 | +---+----+------------------+ 2261 (a) +->| H'| iH | DDDDDDDDDDDDDDDD | 2262 | +---+----+------------------+ 2263 | | 2264 | | X===========================X (tunnel 2 MTU) 2265 | | +----+---+----+-------------+ 2266 | (a1) +->| nH'| H | iH | DDDDDDDDDDD | 2267 | | +----+---+----+-------------+ 2268 | | 2269 | | +----+-------+ 2270 | (a2) +->| nH"| DDDDD | 2271 | +----+-------+ 2272 | 2273 | +---+------+ 2274 (b) +->| H"| DDDD | 2275 +---+------+ 2276 | 2277 | +----+---+------+ 2278 (b1) +->| nH'| H"| DDDD | 2279 +----+---+------+ 2281 Figure 16 Fragmenting via maximum fit 2283 Figure 16 shows this process using "maximum fit", assuming outer 2284 fragmentation as an example (the situation is the same for inner 2285 fragmentation, but the headers that are affected differ). In maximum 2286 fit, the arriving packet is split into (a) and (b), where (a) is the 2287 size of the first tunnel, i.e., the tunnel 1 MTU (the maximum that 2288 fits over the first tunnel). However, this tunnel then traverses over 2289 another tunnel (number 2), whose impact the first tunnel ingress has 2290 not accommodated. The packet (a) arrives at the second tunnel 2291 ingress, and needs to be encapsulated again, but it needs to be 2292 fragmented as well to fit into the tunnel 2 MTU, into (a1) and (a2). 2293 In this case, packet (b) arrives at the second tunnel ingress and is 2294 encapsulated into (b1) without fragmentation, because it is already 2295 below the tunnel 2 MTU size. 2297 In Figure 17, the fragmentation is done using "even split", i.e., by 2298 splitting the original packet into two roughly equal-sized 2299 components, (c) and (d). Note that (d) contains more packet data, 2300 because (c) includes the original packet header because this is an 2301 example of outer fragmentation. The packets (c) and (d) arrive at the 2302 second tunnel encapsulator, and are encapsulated again; this time, 2303 neither packet exceeds the tunnel 2 MTU, and neither requires further 2304 fragmentation. 2306 X===========================X (transit PMTU) 2307 +----+----------------------+ 2308 | iH | DDDDDDDDDDDDDDDDDDDD | 2309 +----+----------------------+ 2310 | 2311 | X===========================X (tunnel 1 MTU) 2312 | +---+----+----------+ 2313 (c) +->| H'| iH | DDDDDDDD | 2314 | +---+----+----------+ 2315 | | 2316 | | X===========================X (tunnel 2 MTU) 2317 | | +----+---+----+----------+ 2318 | (c1) +->| nH | H'| iH | DDDDDDDD | 2319 | +----+---+----+----------+ 2320 | 2321 | +---+--------------+ 2322 (d) +->| H"| DDDDDDDDDDDD | 2323 +---+--------------+ 2324 | 2325 | +----+---+--------------+ 2326 (d1) +->| nH | H"| DDDDDDDDDDDD | 2327 +----+---+--------------+ 2329 Figure 17 Fragmenting via "even split" 2331 A.2. Packing 2333 Encapsulating individual packets to traverse a tunnel can be 2334 inefficient, especially where headers are large relative to the 2335 packets being carried. In that case, it can be more efficient to 2336 encapsulate many small packets in a single, larger tunnel payload. 2338 This technique, similar to the effect of packet bursting in Gigabit 2339 Ethernet (regardless of whether they're encoded using L2 symbols as 2340 delineators), reduces the overhead of the encapsulation headers 2341 (Figure 18). It reduces the work of header addition and removal at 2342 the tunnel endpoints, but increases other work involving the packing 2343 and unpacking of the component packets carried. 2345 +-----+-----+ 2346 | iHa | iDa | 2347 +-----+-----+ 2348 | 2349 | +-----+-----+ 2350 | | iHb | iDb | 2351 | +-----+-----+ 2352 | | 2353 | | +-----+-----+ 2354 | | | iHc | iDc | 2355 | | +-----+-----+ 2356 | | | 2357 v v v 2358 +----+-----+-----+-----+-----+-----+-----+ 2359 | oH | iHa | iDa | iHb | iDb | iHc | iDc | 2360 +----+-----+-----+-----+-----+-----+-----+ 2362 Figure 18 Packing packets into a tunnel