idnits 2.17.1 draft-ietf-intarea-tunnels-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The draft header indicates that this document updates RFC4459, but the abstract doesn't seem to directly say this. It does mention RFC4459 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC4459, updated by this document, for RFC5378 checks: 2004-06-14) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 8, 2017) is 2513 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-04 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2460 (Obsoleted by RFC 8200) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 6434 (Obsoleted by RFC 8504) -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) -- Obsolete informational reference (is this intentional?): RFC 6833 (Obsoleted by RFC 9301) == Outdated reference: A later version (-82) exists of draft-templin-aerolink-75 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Area WG J. Touch 2 Internet Draft USC/ISI 3 Intended status: Best Current Practice M. Townsley 4 Updates: 4459 Cisco 5 Expires: December 2017 June 8, 2017 7 IP Tunnels in the Internet Architecture 8 draft-ietf-intarea-tunnels-07.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 This document may contain material from IETF Documents or IETF 16 Contributions published or made publicly available before November 17 10, 2008. The person(s) controlling the copyright in some of this 18 material may not have granted the IETF Trust the right to allow 19 modifications of such material outside the IETF Standards Process. 20 Without obtaining an adequate license from the person(s) controlling 21 the copyright in such materials, this document may not be modified 22 outside the IETF Standards Process, and derivative works of it may 23 not be created outside the IETF Standards Process, except to format 24 it for publication as an RFC or to translate it into languages other 25 than English. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 This Internet-Draft will expire on November 8, 2017. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Abstract 62 This document discusses the role of IP tunnels in the Internet 63 architecture. An IP tunnel transits IP datagrams as payloads in non- 64 link layer protocols. This document explains the relationship of IP 65 tunnels to existing protocol layers and the challenges in supporting 66 IP tunneling, based on the equivalence of tunnels to links. The 67 implications of this document are used to derive recommendations that 68 update MTU and fragment issues in RFC 4459. 70 Table of Contents 72 1. Introduction...................................................3 73 2. Conventions used in this document..............................6 74 2.1. Key Words.................................................6 75 2.2. Terminology...............................................6 76 3. The Tunnel Model..............................................10 77 3.1. What is a Tunnel?........................................11 78 3.2. View from the Outside....................................13 79 3.3. View from the Inside.....................................14 80 3.4. Location of the Ingress and Egress.......................15 81 3.5. Implications of This Model...............................15 82 3.6. Fragmentation............................................16 83 3.6.1. Outer Fragmentation.................................16 84 3.6.2. Inner Fragmentation.................................18 85 3.6.3. The Necessity of Outer Fragmentation................19 86 4. IP Tunnel Requirements........................................20 87 4.1. Encapsulation Header Issues..............................20 88 4.1.1. General Principles of Header Fields Relationships...20 89 4.1.2. Addressing Fields...................................21 90 4.1.3. Hop Count Fields....................................21 91 4.1.4. IP Fragment Identification Fields...................22 92 4.1.5. Checksums...........................................23 93 4.2. MTU Issues...............................................24 94 4.2.1. Minimum MTU Considerations..........................24 95 4.2.2. Fragmentation.......................................27 96 4.2.3. Path MTU Discovery..................................30 97 4.3. Coordination Issues......................................32 98 4.3.1. Signaling...........................................32 99 4.3.2. Congestion..........................................34 100 4.3.3. Multipoint Tunnels and Multicast....................34 101 4.3.4. Load Balancing......................................35 102 4.3.5. Recursive Tunnels...................................36 103 5. Observations..................................................37 104 5.1. Summary of Recommendations...............................37 105 5.2. Impact on Existing Encapsulation Protocols...............37 106 5.3. Tunnel Protocol Designers................................40 107 5.3.1. For Future Standards................................40 108 5.3.2. Diagnostics.........................................40 109 5.4. Tunnel Implementers......................................41 110 5.5. Tunnel Operators.........................................41 111 6. Security Considerations.......................................42 112 7. IANA Considerations...........................................43 113 8. References....................................................43 114 8.1. Normative References.....................................43 115 8.2. Informative References...................................43 116 9. Acknowledgments...............................................48 117 APPENDIX A: Fragmentation efficiency.............................50 118 A.1. Selecting fragment sizes.................................50 119 A.2. Packing..................................................51 121 1. Introduction 123 The Internet layering architecture is loosely based on the ISO seven 124 layer stack, in which data units traverse the stack by being wrapped 125 inside data units of the next layer down [Cl88][Zi80]. A tunnel is a 126 mechanism for transmitting data units between endpoints by wrapping 127 them as data units of the same or higher layers, e.g., IP in IP 128 (Figure 1) or IP in UDP (Figure 2). 130 +----+----+--------------+ 131 | IP'| IP | Data | 132 +----+----+--------------+ 134 Figure 1 IP inside IP 136 +----+-----+----+--------------+ 137 | IP'| UDP | IP | Data | 138 +----+-----+----+--------------+ 140 Figure 2 IP in UDP in IP in Ethernet 142 This document focuses on tunnels that transit IP packets, i.e., in 143 which an IP packet is the payload of another protocol, other than a 144 typical link layer. A tunnel is a virtual link that can help decouple 145 the network topology seen by transiting packets from the underlying 146 physical network [To98][RFC2473]. Tunnels were critical in the 147 development of multicast because not all routers were capable of 148 processing multicast packets [Er94]. Tunnels allowed multicast 149 packets to transit efficiently between multicast-capable routers over 150 paths that did not support native link-layer multicast. Similar 151 techniques have been used to support incremental deployment of other 152 protocols over legacy substrates, such as IPv6 [RFC2546]. 154 Use of tunnels is common in the Internet. The word "tunnel" occurs in 155 nearly 1,500 RFCs (of nearly 8,000 current RFCs, close to 20%), and 156 is supported within numerous protocols, including: 158 o IP in IP / mobile IP - IPv4 in IPv4 tunnels using protocol 4 159 [RFC2003][RFC2473][RFC5944] and its precursor called "IPIP" using 160 protocol 94 [RFC1853] 162 o IP in IPv6 - IPv6 or IPv4 in IPv6 [RFC2473] 164 o IPsec - includes a tunnel mode to enable encryption or 165 authentication of the an entire IP datagram inside another IP 166 datagram [RFC4301] 168 o Generic Router Encapsulation (GRE) - a shim layer for tunneling 169 any network layer in any other network layer, as in IP in GRE in 170 IP [RFC2784][RFC7588][RFC7676], or inside UDP in IP [RFC8086] 172 o MPLS - a shim layer for tunneling IP over a circuit-like path over 173 a link layer [RFC3031] or inside UDP in IP [RFC7510], in which 174 identifiers are rewritten on each hop, often used for traffic 175 provisioning 177 o LISP - a mechanism that uses multipoint IP tunnels to reduce 178 routing table load within an enclave of routers at the expense of 179 more complex tunnel ingress encapsulation tables [RFC6830] 181 o TRILL - a mechanism that uses multipoint L2 tunnels to enable use 182 of L3 routing (typically IS-IS) in an enclave of Ethernet bridges 183 [RFC5556][RFC6325] 185 o Generic UDP Encapsulation (GUE) - IP in UDP in IP [He16] 187 o Automatic Multicast Tunneling (AMT) - IP in UDP in IP for 188 multicast [RFC7450] 190 o L2TP - PPP over IP, to extend a subscriber's DSL/FTTH connection 191 from an access line provider to an ISP [RFC3931] 193 o L2VPNs - provides a link topology different from that provided by 194 physical links [RFC4664]; many of these are not classical tunnels, 195 using only tags (Ethernet VLAN tags) rather than encapsulation 197 o L3VPNs - provides a network topology different from that provided 198 by ISPs [RFC4176] 200 o NVO3 - data center network sharing (to be determined, which may 201 include use of GUE or other tunnels) [RFC7364] 203 o PWE3 - emulates wire-like services over packet-switched services 204 [RFC3985] 206 o SEAL/AERO -IP in IP tunneling with an additional shim header 207 designed to overcome the limitations of RFC2003 [RFC5320][Te17] 209 o A number of legacy variants, including swIPe (an IPsec precursor), 210 a GRE precursor, and the Internet Encapsulation Protocol, all of 211 which included a shim layer [RFC1853] 213 The variety of tunnel mechanisms raises the question of the role of 214 tunnels in the Internet architecture and the potential need for these 215 mechanisms to have similar and predictable behavior. In particular, 216 the ways in which packet size (i.e., Maximum Transmission Unit or 217 MTU) mismatches and error signals (e.g., ICMP) are handled may 218 benefit from a coordinated approach. 220 Regardless of the layer in which encapsulation occurs, tunnels 221 emulate a link. The only difference is that a link operates over a 222 physical communication channel, whereas a tunnel operates over other 223 software protocol layers. Because tunnels are links, they are subject 224 to the same issues as any link, e.g., MTU discovery, signaling, and 225 the potential utility of native support for broadcast and multicast 226 [RFC3819]. Tunnels have some advantages over native links, being 227 potentially easier to reconfigure and control because they can 228 generally rely on existing out-of-band communication between its 229 endpoints. 231 The first attempt to use large-scale tunnels was to transit multicast 232 traffic across the Internet in 1988, and this resulted in 'tunnel 233 collapse'. At the time, tunnels were not implemented as 234 encapsulation-based virtual links, but rather as loose source routes 235 on un-encapsulated IP datagrams [RFC1075]. Then, as now, routers did 236 not support use of the loose source route IP option at line rate, and 237 the multicast traffic caused overload of the so-called "slow path" 238 processing of IP datagrams in software. Using encapsulation tunnels 239 avoided that collapse by allowing the forwarding of encapsulated 240 packets to use the "fast path" hardware processing [Er94]. 242 The remainder of this document describes the general principles of IP 243 tunneling and discusses the key considerations in the design of any 244 protocol that tunnels IP datagrams. It derives its conclusions from 245 the equivalence of tunnels and links and from requirements of 246 existing standards for supporting IPv4 and IPv6 as payloads. 248 2. Conventions used in this document 250 2.1. Key Words 252 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 253 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 254 document are to be interpreted as described in RFC-2119 [RFC2119]. 256 In this document, these key words will appear with that 257 interpretation only when in ALL CAPS. Lower case uses of these words 258 are not to be interpreted as carrying RFC-2119 significance. 260 2.2. Terminology 262 This document uses the following terminology. Optional words in the 263 term are indicated in parentheses, e.g., "(link or network) 264 interface" or "egress (interface)". 266 Terms from existing RFCs: 268 o Messages: variable length data labeled with globally-unique 269 endpoint IDs, also known as a datagram for IP messages [RFC791]. 271 o Node: a physical or logical network device that participates as 272 either a host [RFC1122][RFC6434] or router [RFC1812]. This term 273 originally referred to gateways since some very early RFCs [RFC5], 274 but is currently the common way to describe a point in a network 275 at which messages are processed. 277 o Host or endpoint: a node that sources or sinks messages labeled 278 from/to its IDs, typically known as a host for both IP and higher- 279 layer protocol messages [RFC1122]. 281 o Source or sender: the node that generates a message [RFC1122]. 283 o Destination or receiver: the node that consumes a message 284 [RFC1122]. 286 o Router or gateway: a node that relays IP messages using 287 destination IDs and local context [RFC1812]. Routers also act as 288 hosts when they source or sink messages. Also known as a forwarder 289 for IP messages. Note that the notion of router is relative to the 290 layer at which message processing is considered [To16]. 292 o Link: a communications medium (or emulation thereof) that 293 transfers IP messages between nodes without traversing a router 294 (as would require decrementing the hop count) [RFC1122][RFC1812]. 296 o Link packet: a link layer message, which can carry an IP datagram 297 as a payload 299 o (Link or network) Interface: a location on a link co-located with 300 a node where messages depart onto that link or arrive from that 301 link. On physical links, this interface formats the message for 302 transmission and interprets the received signals. 304 o Path: a sequence of one or more links over which an IP message 305 traverses between source and destination nodes (hosts or routers). 307 o (Link) MTU: the largest message that can transit a link [RFC791], 308 also often referred to simply as "MTU". It does not include the 309 size of link-layer information, e.g., link layer headers or 310 trailers, i.e., it refers to the message that the link can carry 311 as a payload rather than the message as it appears on the link. 312 This is thus the largest network layer packet (including network 313 layer headers, e.g., IP datagram) that can transit a link. Note 314 that this need not be the native size of messages on the link, 315 i.e., the link may internally fragment and reassemble messages. 316 For IPv4, the smallest MTU must be at least 68 bytes [RFC791], and 317 for IPv6 the smallest MTU must be at least 1280 bytes [RFC2460]. 319 o EMTU_S (effective MTU for sending): the largest message that can 320 transit a link, possibly also accounting for fragmentation that 321 happens before the fragments are emitted onto the link [RFC1122]. 322 When source fragmentation is possible, EMTU_S = EMTU_R. When 323 source fragmentation is not possible, EMTU_S = (link) MTU. For 324 IPv4, this is MUST be at least 68 bytes [RFC791] and for IPv6 this 325 MUST be at least 1280 bytes [RFC2460]. 327 o EMTU_R (effective MTU to receive): the largest payload message 328 that a receiver must be able to accept. This thus also represents 329 the largest message that can traverse a link, taking into account 330 reassembly at the receiver that happens after the fragments are 331 received [RFC1122]. For IPv4, this is MUST be at least 576 bytes 332 [RFC791] and for IPv6 this MUST be at least 1500 bytes [RFC2460]. 334 o Path MTU (PMTU): the largest message that can transit a path of 335 links [RFC1191][RFC1981]. Typically, this is the minimum of the 336 link MTUs of the links of the path, and represents the largest 337 network layer message (including network layer headers) that can 338 transit a path without requiring fragmentation while in transit. 339 Note that this is not the largest network packet that can be sent 340 between a source and destination, because that network packet 341 might have been fragmented at the network layer of the source and 342 reassembled at the network layer of the destination. 344 o Tunnel: a protocol mechanism that transits messages between an 345 ingress interface and egress interface using encapsulation to 346 allow an existing network path to appear as a single link 347 [RFC1853]. Note that a protocol can be used to tunnel itself (IP 348 over IP). There is essentially no difference between a tunnel and 349 the conventional layering of the ISO stack (i.e., by this 350 definition, Ethernet is can be considered tunnel for IP). A tunnel 351 is also known as a virtual link. 353 o Ingress (interface): the virtual link interface of a tunnel that 354 receives messages within a node, encapsulates them according to 355 the tunnel protocol, and transmits them into the tunnel [RFC2983]. 356 An ingress is the tunnel equivalent of the outgoing (departing) 357 network interface of a link, and its encapsulation processing is 358 the tunnel equivalent of encoding a message for transmission over 359 a physical link. The ingress virtual link interface can be co- 360 located with the traffic source. 362 The term 'ingress' in other RFCs also refers to 'network ingress', 363 which is the entry point of traffic to a transit network. Because 364 this document focuses on tunnels, the term "ingress" used in the 365 remainder of this document implies "tunnel ingress". 367 o Egress (interface): a virtual link interface of a tunnel that 368 receives messages that have finished transiting a tunnel and 369 presents them to a node [RFC2983]. For reasons similar to ingress, 370 the term 'egress' will refer to 'tunnel egress' throughout the 371 remainder of this document. An egress is the tunnel equivalent of 372 the incoming (arriving) network interface of a link and its 373 decapsulation processing is the tunnel equivalent of interpreting 374 a signal received from a physical link. The egress decapsulates 375 messages for further transit to the destination. The egress 376 virtual link interface can be co-located with the traffic 377 destination. 379 o Ingress node: network device on which an ingress is attached as a 380 virtual link interface [RFC2983]. Note that a node can act as both 381 an ingress node and an egress node at the same time, but typically 382 only for different tunnels. 384 o Egress node: device where an egress is attached as a virtual link 385 interface [RFC2983]. Note that a device can act as both a ingress 386 node and an egress node at the same time, but typically only for 387 different tunnels. 389 o Inner header: the header of the message as it arrives to the 390 ingress [RFC2003]. 392 o Outer header(s): one or more headers added to the message by the 393 ingress, as part of the encapsulation for tunnel transit 394 [RFC2003]. 396 o Mid-tunnel fragmentation: Fragmentation of the message during the 397 tunnel transit, as could occur for IPv4 datagrams with DF=0 398 [RFC2983]. 400 o Atomic packet, datagram, or fragment: an IP packet that has not 401 been fragmented and which cannot be fragmented further [RFC6864] 402 [RFC6946]. 404 The following terms are introduced by this document: 406 o (Tunnel) transit packet: the packet arriving at a node connected 407 to a tunnel that enters the ingress interface and exits the egress 408 interface, i.e., the packet carried over the tunnel. This is 409 sometimes known as the 'tunneled packet', i.e., the packet carried 410 over the tunnel. This is the tunnel equivalent of a network layer 411 packet as it would traverse a link. This document focuses on IPv4 412 and IPv6 transit packets. 414 o (Tunnel) link packet (TLP): packets that traverse between two 415 interfaces, e.g., from ingress interface to egress interface, in 416 which resides all or part of a transit packet. A tunnel link 417 packet is the tunnel equivalent of a link (layer) packet as it 418 would traverse a link, which is why we use the same terminology. 420 o Tunnel MTU: the largest transit packet that can traverse a tunnel, 421 i.e., the tunnel equivalent of a link MTU, which is why we use the 422 same terminology. This is the largest transit packet which can be 423 reassembled at the egress interface. 425 o Tunnel maximum atomic packet (MAP): the largest transit packet 426 that can traverse a tunnel as an atomic packet, i.e., without 427 requiring tunnel link packet fragmentation either at the ingress 428 or on-path between the ingress and egress. 430 o Inner fragmentation: fragmentation of the transit packet that 431 arrives at the ingress interface before any additional headers are 432 added. This can only correctly occur for IPv4 DF=0 datagrams. 434 o Outer fragmentation: source fragmentation of the tunnel link 435 packet after encapsulation; this can involve fragmenting the 436 outermost header or any of the other (if any) protocol layers 437 involved in encapsulation. 439 o Maximum frame size (MFS): the link-layer equivalent of the MTU, 440 using the OSI term 'frame'. For Ethernet, the MTU (network packet 441 size) is 1500 bytes but the MFS (link frame size) is 1518 bytes 442 originally, and 1522 bytes assuming VLAN (802.1Q) tagging support. 444 o EMFS_S: the link layer equivalent of EMTU_S. 446 o EMFS_R: the link layer equivalent of EMTU_R. 448 o Path MFS: the link layer equivalent of PMTU. 450 3. The Tunnel Model 452 A network architecture is an abstract description of a distributed 453 communications system, its components and their relationships, the 454 requisite properties of those components and the emergent properties 455 of the system that result [To03]. Such descriptions can help explain 456 behavior, as when the OSI seven-layer model is used as a teaching 457 example [Zi80]. Architectures describe capabilities - and, just as 458 importantly, constraints. 460 A network can be defined as a system of endpoints and relays 461 interconnected by communication paths, abstracting away issues of 462 naming in order to focus on message forwarding. To the extent that 463 the Internet has a single, coherent interpretation, its architecture 464 is defined by its core protocols (IP [RFC791], TCP [RFC793], UDP 465 [RFC768]) whose messages are handled by hosts, routers, and links 466 [Cl88][To03], as shown in Figure 3: 468 +------+ ------ ------ +------+ 469 | | / \ / \ | | 470 | HOST |--+ ROUTER +--+ ROUTER +--| HOST | 471 | | \ / \ / | | 472 +------+ ------ ------ +------+ 474 Figure 3 Basic Internet architecture 476 As a network architecture, the Internet is a system of hosts 477 (endpoints) and routers (relays) interconnected by links that 478 exchange messages when possible. "When possible" defines the 479 Internet's "best effort" principle. The limited role of routers and 480 links represents the End-to-End Principle [Sa84] and longest-prefix 481 match enables hierarchical forwarding using compact tables. 483 Although the definitions of host, router, and link seem absolute, 484 they are often relative as viewed within the context of one protocol 485 layer, each of which can be considered a distinct network 486 architecture. An Internet gateway is an OSI Layer 3 router when it 487 transits IP datagrams but it acts as an OSI Layer 2 host as it 488 sources or sinks Layer 2 messages on attached links to accomplish 489 this transit capability. In this way, one device (Internet gateway) 490 behaves as different components (router, host) at different layers. 492 Even though a single device may have multiple roles - even 493 concurrently - at a given layer, each role is typically static and 494 determined by context. An Internet gateway always acts as a Layer 2 495 host and that behavior does not depend on where the gateway is viewed 496 from within Layer 2. In the context of a single layer, a device's 497 behavior is typically modeled as a single component from all 498 viewpoints in that layer (with some notable exceptions, e.g., Network 499 Address Translators, which appear as hosts and routers, depending on 500 the direction of the viewpoint [To16]). 502 3.1. What is a Tunnel? 504 A tunnel can be modeled as a link in another network 505 [To98][To01][To03]. In Figure 4, a source host (Hsrc) and destination 506 host (Hdst) communicating over a network M in which two routers (Ra 507 and Rd) are connected by a tunnel. Keep in mind that it is possible 508 that both network N and network M can both be components of the 509 Internet, i.e., there may be regular traffic as well as tunneled 510 traffic over any of the routers shown. 512 --_ -- 513 +------+ / \ / \ +------+ 514 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 515 +------+ \ //\ / \ / \ /\\ / +------+ 516 --/I \--+ Rb +--+ Rc +--/E \-- 517 \ / \ / \ / \ / 518 \/ -- -- \/ 519 <------ Network N -------> 520 <-------------------- Network M ---------------------> 522 Figure 4 The big picture 524 The tunnel consists of two interfaces - an ingress (I) and an egress 525 (E) that lie along a path connected by network N. Regardless of how 526 the ingress and egress interfaces are connected, the tunnel serves as 527 a link between the nodes it connects (here, Ra and Rd). 529 IP packets arriving at the ingress interface are encapsulated to 530 traverse network N. We call these packets 'tunnel transit packets' 531 (or just 'transit packets') because they will transit the tunnel 532 inside one or more of what we call 'tunnel link packets'. Transit 533 packets correspond to network (IP) packets traversing a conventional 534 link and tunnel link packets correspond to the packets of a 535 conventional link layer (which can be called just 'link packets'). 537 Link packets use the source address of the ingress interface and the 538 destination address of the egress interface - using whatever address 539 is appropriate to the Layer at which the ingress and egress 540 interfaces operate (Layer 2, Layer 3, Layer 4, etc.). The egress 541 interface decapsulates those messages, which then continue on network 542 M as if emerging from a link. To transit packets and to the routers 543 the tunnel connects (Ra and Rd), the tunnel acts as a link and the 544 ingress and egress interfaces act as network interfaces to that link. 546 The model of each component (ingress and egress interfaces) and the 547 entire system (tunnel) depends on the layer from which they are 548 viewed. From the perspective of the outermost hosts (Hsrc and Hdst), 549 the tunnel appears as a link between two routers (Ra and Rd). For 550 routers along the tunnel (e.g., Rb and Rc), the ingress and egress 551 interfaces appear as the endpoint hosts on network N. 553 When the tunnel network (N) is implemented using the same protocol as 554 the endpoint network (M), the picture looks flatter (Figure 5), as if 555 it were running over a single network. However, this appearance is 556 incorrect - nothing has changed from the previous case. From the 557 perspective of the endpoints, Rb and Rc and network N don't exist and 558 aren't visible, and from the perspective of the tunnel, network M 559 doesn't exist. The fact that network N and M use the same protocol, 560 and may traverse the same links is irrelevant. 562 --_ -- -- -- 563 +------+ / \ /\ / \ / \ /\ / \ +------+ 564 | Hsrc |--+ Ra +/I \--+ Rb +--+ Rc +--/E \+ Rd +--| Hdst | 565 +------+ \ / \ / \ / \ / \ / \ / +------+ 566 -- \/ -- -- \/ -- 567 <---- Network N -----> 568 <------------------ Network M -------------------> 570 Figure 5 IP in IP network picture 572 3.2. View from the Outside 574 As already observed, from outside the tunnel, to network M, the 575 entire tunnel acts as a link (Figure 6). Consequently all 576 requirements for links supporting IP also apply to tunnels [RFC3819]. 578 --_ -- 579 +------+ / \ / \ +------+ 580 | Hsrc |--+ Ra +--------------------------+ Rd +--| Hdst | 581 +------+ \ / \ / +------+ 582 -- -- 583 <------------------ Network M -------------------> 585 Figure 6 Tunnels as viewed from the outside 587 For example, the IP datagram hop counts (IPv4 Time-to-Live [RFC791] 588 and IPv6 Hop Limit [RFC2460]) are decremented when traversing a 589 router, but not when traversing a link - or thus a tunnel. Similarly, 590 because the ingress and egress are interfaces on this outer network, 591 they should never issue ICMP messages. A router or host would issue 592 the appropriate ICMP, e.g., "packet too big" (IPv4 fragmentation 593 needed and DF set [RFC792] or IPv6 packet too big [RFC4443]), when 594 trying to send a packet to the egress, as it would for any interface. 596 Tunnels have a tunnel MTU - the largest message that can transit that 597 tunnel, just as links have a link MTU. This MTU may not reflect the 598 native message size of hops within a multihop link (or tunnel) and 599 the same is true for a tunnel. In both cases, the MTU is defined by 600 the link's (or tunnel's) effective MTU to receive (EMTU_R). 602 3.3. View from the Inside 604 Within network N, i.e., from inside the tunnel itself, the ingress 605 interface is a source of tunnel link packets and the egress interface 606 is a sink - so both are viewed as hosts on network N (Figure 7). 607 Consequently [RFC1122] Internet host requirements apply to ingress 608 and egress interfaces when Network N uses IP (and thus the 609 ingress/egress interfaces use IP encapsulation). 611 _ -- -- 612 /\ / \ / \ /\ 613 /I \--+ Rb +--+ Rc +--/E \ 614 \ / \ / \ / \ / 615 \/ -- -- \/ 616 <---- Network N -----> 618 Figure 7 Tunnels, as viewed from within the tunnel 620 Viewed from within the tunnel, the outer network (M) doesn't exist. 621 Tunnel link packets can be fragmented by the source (ingress 622 interface) and reassembled at the destination (egress interface), 623 just as at conventional hosts. The path between ingress and egress 624 interfaces has a path MTU, but the endpoints can exchange messages as 625 large as can be reassembled at the destination (egress interface), 626 i.e., the EMTU_R of the egress interface. However, in both cases, 627 these MTUs refer to the size of the message that can transit the 628 links and between the hosts of network N, which represents a link 629 layer to network M. I.e., the MTUs of network N represent the maximum 630 frame sizes (MFSs) of the tunnel as a link in network M. 632 Information about the network - i.e., regarding network N MTU sizes, 633 network reachability, etc. - are relayed from the destination (egress 634 interface) and intermediate routers back to the source (ingress 635 interface), without regard for the external network (M). When such 636 messages arrive at the ingress interface, they may affect the 637 properties of that interface (e.g., its reported MTU to network M), 638 but they should never directly cause new ICMPs in the outer network 639 M. Again, events at interfaces don't generate ICMP messages; it would 640 be the host or router at which that interface is attached that would 641 generate ICMPs, e.g., upon attempting to use that interface. 643 3.4. Location of the Ingress and Egress 645 The ingress and egress interfaces are endpoints of the tunnel. Tunnel 646 interfaces may be physical or virtual. The interface may be 647 implemented inside the node where the tunnel attaches, e.g., inside a 648 host or router. The interface may also be implemented as a "bump in 649 the wire" (BITW), somewhere along a link between the two nodes the 650 link interconnects. IP in IP tunnels are often implemented as 651 interfaces on nodes, whereas IPsec tunnels are sometimes implemented 652 as BITW. These implementation variations determine only whether 653 information available at the link endpoints (ingress/egress 654 interfaces) can be easily shared with the connected network nodes. 656 An ingress or egress can be implemented as an integrated component, 657 appearing equivalent to any other network interface, or can be more 658 complex. In the simple variant, each is tightly coupled to another 659 network interface, e.g., where the ingress emits encapsulated packets 660 directly into another network interface, or where the egress receives 661 packets to decapsulate directly from another network interface. 663 The other implementation variant is more modular, but more complex to 664 explain. The ingress acts like a network interface by receiving IP 665 packets to transmit from an upper layer protocol (or relay mechanism 666 of a router), but then acts like an upper layer protocol (or relay 667 mechanism of a router) when it emits encapsulated packets back into 668 the same node. The egress acts like an upper layer interface (or 669 relay mechanism of a router) by receiving packets from a network 670 interface, but then acts like a network interface when it emits 671 decapsulated packets back in to the same node. To the existing 672 network interfaces, the ingress/egress act like upper layer 673 interfaces (i.e., sending or receiving application stacks), while to 674 the interior of the node, the ingress/egress act like network 675 interfaces. This dual nature inside the node reflects the duality of 676 the tunnel as transit link and host-host channel. 678 3.5. Implications of This Model 680 This approach highlights a few key features of a tunnel as a network 681 architecture construct: 683 o To the transit packets, tunnels turn a network (Layer 3) path into 684 a (Layer 2) link 686 o To nodes the tunnel traverses, the tunnel ingress and egress 687 interfaces act as hosts that source and sink tunnel link packets 689 The consequences of these features are as follow: 691 o Like a link MTU, a tunnel MTU is defined by the effective MTU of 692 the receiver (i.e., EMTU_R of the egress). 694 o The messages inside the tunnel are treated like any other link 695 layer, i.e., the MTU is determined by the largest (transit) 696 payload that traverses the link. 698 o The tunnel path MFS is not relevant to the transited traffic. 699 There is no mechanism or protocol by which it can be determined. 701 o Because routers, not links, alter hop counts [RFC1812], hopcounts 702 are not decremented solely by the transit of a tunnel. A packet 703 with a hop count of zero should successfully transit a link (and 704 thus a tunnel) that connects two hosts. 706 o The addresses of a tunnel ingress and egress interface correspond 707 to link layer addresses to the transit packet. Like links, some 708 tunnels may not have their own addresses. Like network interfaces, 709 ingress and egress interfaces typically require network layer 710 addresses. 712 o Like network interfaces, the ingress and egress interfaces are 713 never a direct source of ICMP messages but may provide information 714 to their attached host or router to generate those ICMP messages 715 during the processing of transit packets. 717 o Like network interfaces and links, two nodes may be connected by 718 any combination of tunnels and links, including multiple tunnels. 719 As with multiple links, existing network layer forwarding 720 determines which IP traffic uses each link or tunnel. 722 These observations make it much easier to determine what a tunnel 723 must do to transit IP packets, notably it must satisfy all 724 requirements expected of a link [RFC1122][RFC3819]. The remainder of 725 this document explores these implications in greater detail. 727 3.6. Fragmentation 729 There are two places where fragmentation can occur in a tunnel, 730 called 'outer fragmentation' and 'inner fragmentation'. This document 731 assumes that only outer fragmentation is viable because it is the 732 only approach that works for both IPv4 datagrams with DF=1 and IPv6. 734 3.6.1. Outer Fragmentation 736 Outer fragmentation is shown in Figure 8. The bottom of the figure 737 shows the network topology, where transit packets originate at the 738 source, enter the tunnel at the ingress interface for encapsulation, 739 exit the tunnel at the egress interface where they are decapsulated, 740 and arrive at the destination. The packet traffic is shown above the 741 topology, where the transit packets are shown at the top. In this 742 diagram, the ingress interface is located on router 'Ra' and the 743 egress interface is located on router 'Rd'. 745 When the link packet - which is the encapsulated transit packet - 746 would exceed the tunnel MTU, the packet needs to be fragmented. In 747 this case the packet is fragmented at the outer (link) header, with 748 the fragments shown as (b1) and (b2). The outer header indicates 749 fragmentation (as ' and "), the inner (transit) header occurs only in 750 the first fragment, and the inner (transit) data is broken across the 751 two packets. These fragments are reassembled at the egress interface 752 during decapsulation in step (c), where the resulting link packet is 753 reassembled and decapsulated so that the transit packet can continue 754 on its way to the destination. 756 Transit packet 757 +----+----+ +----+----+ 758 | iH | iD |------+ - - - - - - - - - - +------>| iH | iD | 759 +----+----+ | | +----+----+ 760 v Link packet | 761 +----+----+----+ +----+----+----+ 762 (a) | oH | iH | iD | | oH | iH | iD | (d) 763 +----+----+----+ +----+----+----+ 764 | ^ 765 | Link packet fragment #1 | 766 | +----+----+-----+ | 767 (b1) +----- >| oH'| iH | iD1 |-------+ (c) 768 | +----+----+-----+ | 769 | | 770 | Link packet fragment #2 | 771 | +----+-----+ | 772 (b2) +----- >| oH"| iD2 |------------+ 773 +----+-----+ 774 +-----+ +--+ +---+ +---+ +--+ +-----+ 775 | | | |/ \ / \| | | | 776 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 777 | | | |\ / \ /| | | | 778 +-----+ +--+ +---+ +---+ +--+ +-----+ 780 Figure 8 Fragmentation of the (outer) link packet 782 Outer fragmentation isolates the tunnel encapsulation duties to the 783 ingress and egress interfaces. This can be considered a benefit in 784 clean, layered network design, but also may require complex egress 785 interface decapsulation, especially where tunnels aggregate large 786 amounts of traffic, such as may result in IP ID overload (see Sec. 787 4.1.4). Outer fragmentation is valid for any tunnel link protocol 788 that supports fragmentation (e.g., IPv4 or IPv6), in which the tunnel 789 endpoints act as the host endpoints of that protocol. 791 Along the tunnel, the inner (transit) header is contained only in the 792 first fragment, which can interfere with mechanisms that 'peek' into 793 lower layer headers, e.g., as for relayed ICMP (see Sec. 4.3). 795 3.6.2. Inner Fragmentation 797 Inner fragmentation distributes the impact of tunnel fragmentation 798 across both egress interface decapsulation and transit packet 799 destination, as shown in Figure 9; this can be especially important 800 when the tunnel would otherwise need to source (outer) fragment large 801 amounts of traffic. However, this mechanism is valid only when the 802 transit packets can be fragmented on-path, e.g., as when the transit 803 packets are IPv4 datagrams with DF=0. 805 Again, the network topology is shown at the bottom of the figure, and 806 the original packets show at the top. Packets arrive at the ingress 807 node (router Ra) and are fragmented there based into transit packet 808 fragments #1 (a1) and #2 (a2). These fragments are encapsulated at 809 the ingress interface in steps (b1) and (b2) and each resulting link 810 packet traverses the tunnel. When these link packets arrive at the 811 egress interface they are decapsulated in steps (c1) and (c2) and the 812 egress node (router) forwards the transit packet fragments to their 813 destination. This destination is then responsible for reassembling 814 the transit packet fragments into the original transit packet (d). 816 Along the tunnel, the inner headers are copied into each fragment, 817 and so can be 'peeked at' inside the tunnel (see Sec. 4.3). 818 Fragmentation shifts from the ingress interface to the ingress router 819 and reassembly shifts from the egress interface to the destination. 821 Transit packet 822 +----+----+ +----+----+ 823 | iH | iD |-+ - - - - - - - - - - - - - - - - >| iH | iD | 824 +----+----+ | +----+----+ 825 v Transit packet fragment #1 ^ 826 +----+-----+ +----+-----+ | 827 (a1) | iH'| iD1 | | iH'| iD1 |-----+(d) 828 +----+-----+ +----+-----+ ^ 829 | | Link packet #1 ^ | 830 | | +----+----+----- | | 831 | (b1)+----- >| oH | iH'| iD1 |-------+(c1) | 832 | +----+----+-----+ | 833 | | 834 v Transit packet fragment #2 | 835 +----+-----+ +----+-----+ | 836 (a2) | iH"| iD2 | | iH"| iD2 |-----+ 837 +----+-----+ +----+-----+ 838 | Link packet #2 | 839 | +----+----+-----+ | 840 (b2)+----- >| oH | iH"| iD2 |-------+(c2) 841 +----+----+-----+ 842 +-----+ +--+ +---+ +---+ +--+ +-----+ 843 | | | |/ \ / \| | | | 844 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 845 | | | |\ / \ /| | | | 846 +-----+ +--+ +---+ +---+ +--+ +-----+ 848 Figure 9 Fragmentation of the inner (transit) packet 850 3.6.3. The Necessity of Outer Fragmentation 852 Fragmentation is critical for tunnels that support transit packets 853 for protocols with minimum MTU requirements, while operating over 854 tunnel paths using protocols that have their own MTU requirements. 855 Depending on the amount of space used by encapsulation, these two 856 minimums will ultimately interfere (especially when a protocol 857 transits itself either directly, as with IP-in-IP, or indirectly, as 858 in IP-in-GRE-in-IP), and the transit packet will need to be 859 fragmented to both support a tunnel MTU while traversing tunnels with 860 their own tunnel path MTUs. 862 Outer fragmentation is the only solution that supports all IPv4 and 863 IPv6 traffic, because inner fragmentation is allowed only for IPv4 864 datagrams with DF=0. 866 4. IP Tunnel Requirements 868 The requirements of an IP tunnel are defined by the requirements of 869 an IP link because both transit IP packets. A tunnel thus must 870 transit the IP minimum MTU, i.e., 68 bytes for IPv4 [RFC793] and 1280 871 bytes for IPv6 [RFC2460] and a tunnel must support address resolution 872 when there is more than one egress interface for that tunnel. 874 The requirements of the tunnel ingress and egress interfaces are 875 defined by the network over which they exchange messages (link 876 packets). For IP-over-IP, this means that the ingress interface MUST 877 NOT exceed the IP fragment identification field uniqueness 878 requirements [RFC6864]. Uniqueness is more difficult to maintain at 879 high packet rates for IPv4, whose fragment ID field is only 16 bits. 881 These requirements remain even though tunnels have some unique 882 issues, including the need for additional space for encapsulation 883 headers and the potential for tunnel MTU variation. 885 4.1. Encapsulation Header Issues 887 Tunnel encapsulation uses a non-link protocol as a link layer. The 888 encapsulation layer thus has the same requirements and expectations 889 as any other IP link layer when used to transit IP packets. These 890 relationships are addressed in the following subsections. 892 4.1.1. General Principles of Header Fields Relationships 894 Some tunnel specifications attempt to relate the header fields of the 895 transit packet and tunnel link packet. In some cases, this 896 relationship is warranted, whereas in other cases the two protocol 897 layers need to be isolated from each other. For example, the tunnel 898 link header source and destination addresses are network endpoints in 899 the tunnel network N, but have no meaning in the outer network M. The 900 two sets of addresses are effectively independent, just as are other 901 network and link addresses. 903 Because the tunneled packet uses source and destination addresses 904 with a separate meaning, it is inappropriate to copy or reuse the 905 IPv4 Identification (ID) or IPv6 Fragment ID fields of the tunnel 906 transit packet (see Section 4.1.4). Similarly, the DF field of the 907 transit packet is not related to that field in the tunnel link packet 908 header (presuming both are IPv4) (see Section 4.2). Most other fields 909 are similarly independent between the transit packet and tunnel link 910 packet. When a field value is generated in the encapsulation header, 911 its meaning should be derived from what is desired in the context of 912 the tunnel as a link. When feedback is received from these fields, 913 they should be presented to the tunnel ingress and egress as if they 914 were network interfaces. The behavior of the node where these 915 interfaces attach should be identical to that of a conventional link. 917 There are exceptions to this rule that are explicitly intended to 918 relay signals from inside the tunnel to the network outside the 919 tunnel, typically relevant only when the tunnel network N and the 920 outer network M use the same network. These apply only when that 921 coordination is defined, as with explicit congestion notification 922 (ECN) [RFC6040] (see Section 4.3.2), and differentiated services code 923 points (DSCPs) [RFC2983]. Equal-cost multipath routing may also 924 affect how some encapsulation fields are set, including IPv6 flow 925 labels [RFC6438] and source ports for transport protocols when used 926 for tunnel encapsulation [RFC8085] (see Section 4.3.4). 928 4.1.2. Addressing Fields 930 Tunnel ingresses and egresses have addresses associated with the 931 encapsulation protocol. These addresses are the source and 932 destination (respectively) of the encapsulated packet while 933 traversing the tunnel network. 935 Tunnels may or may not have addresses in the network whose traffic 936 they transit (e.g., network M in Figure 4). In some cases, the tunnel 937 is an unnumbered interface to a point-to-point virtual link. When the 938 tunnel has multiple egresses, tunnel interfaces require separate 939 addresses in network M. 941 To see the effect of tunnel interface addresses, consider traffic 942 sourced at router Ra in Figure 4. Even before being encapsulated by 943 the ingress, traffic needs a source IP network address that belongs 944 to the router. One option is to use an address associated with one of 945 the other interfaces of the router [RFC1122]. Another option is to 946 assign a number to the tunnel interface itself. Regardless of which 947 address is used, the resulting IP packet is then encapsulated by the 948 tunnel ingress using the ingress address as a separate operation. 950 4.1.3. Hop Count Fields 952 The Internet hop count field is used to detect and avoid forwarding 953 loops that cannot be corrected without a synchronized reboot. The 954 IPv4 Time-to-Live (TTL) and IPv6 Hop Limit field each serve this 955 purpose [RFC791][RFC2460]. The IPv4 TTL field was originally intended 956 to indicate packet expiration time, measured in seconds. A router is 957 required to decrement the TTL by at least one or the number of 958 seconds the packet is delayed, whichever is larger [RFC1812]. Packets 959 are rarely held that long, and so the field has come to represent the 960 count of the number of routers traversed. IPv6 makes this meaning 961 more explicit. 963 These hop count fields represent the number of network forwarding 964 elements (routers) traversed by an IP datagram. An IP datagram with a 965 hop count of zero can traverse a link between two hosts because it 966 never visits a router (where it would need to be decremented and 967 would have been dropped). 969 An IP datagram traversing a tunnel thus need not have its hop count 970 modified, i.e., the tunnel transit header need not be affected. A 971 zero hop count datagram should be able to traverse a tunnel as easily 972 as it traverses a link. A router MAY be configured to decrement 973 packets traversing a particular link (and thus a tunnel), which may 974 be useful in emulating a tunnel path as if it were a network path 975 that traversed one or more routers, but this is strictly optional. 976 The ability of the outer network M and tunnel network N to avoid 977 indefinitely looping packets does not rely on the hop counts of the 978 transit packet and tunnel link packet being related. 980 The hop count field is also used by several protocols to determine 981 whether endpoints are 'local', i.e., connected to the same subnet 982 (link-local discovery and related protocols [RFC4861]). A tunnel is a 983 way to make a remote network address appear directly-connected, so it 984 makes sense that the other ends of the tunnel appear local and that 985 such link-local protocols operate over tunnels unless configured 986 explicitly otherwise. When the interfaces of a tunnel are numbered, 987 these can be interpreted the same way as if they were on the same 988 link subnet. 990 4.1.4. IP Fragment Identification Fields 992 Both IPv4 and IPv6 include an IP Identification (ID) field to support 993 IP datagram fragmentation and reassembly [RFC791][RFC1122][RFC2460]. 994 When used, the ID field is intended to be unique for every packet for 995 a given source address, destination address, and protocol, such that 996 it does not repeat within the Maximum Segment Lifetime (MSL). 998 For IPv4, this field is in the default header and is meaningful only 999 when either source fragmented or DF=0 ("non-atomic packets") 1000 [RFC6864]. For IPv6, this field is contained in the optional Fragment 1001 Header [RFC2460]. Although IPv6 supports only source fragmentation, 1002 the field may occur in atomic fragments [RFC6946]. 1004 Although the ID field was originally intended for fragmentation and 1005 reassembly, it can also be used to detect and discard duplicate 1006 packets, e.g., at congested routers (see Sec. 3.2.1.5 of [RFC1122]). 1008 For this reason, and because IPv4 packets can be fragmented anywhere 1009 along a path, all non-atomic IPv4 packets and all IPv6 packets 1010 between a source and destination of a given protocol must have unique 1011 ID values over the potential fragment reordering period 1012 [RFC2460][RFC6864]. 1014 The uniqueness of the IP ID is a known problem for high speed nodes, 1015 because it limits the speed of a single protocol between two 1016 endpoints [RFC4963]. Although this RFC suggests that the uniqueness 1017 of the IP ID is moot, tunnels exacerbate this condition. A tunnel 1018 often aggregates traffic from a number of different source and 1019 destination addresses, of different protocols, and encapsulates them 1020 in a header with the same ingress and egress addresses, all using a 1021 single encapsulation protocol. If the ingress enforces IP ID 1022 uniqueness, this can either severely limit tunnel throughput or can 1023 require substantial resources; the alternative is to ignore IP ID 1024 uniqueness and risk reassembly errors. Although fragmentation is 1025 somewhat rare in the current Internet at large, it can be common 1026 along a tunnel. Reassembly errors are not always detected by other 1027 protocol layers (see Sec. 4.3.3) , and even when detected they can 1028 result in excessive overall packet loss and can waste bandwidth 1029 between the egress and ultimate packet destination. 1031 The 32-bit IPv6 ID field in the Fragment Header is typically used 1032 only during source fragmentation. The size of the ID field is 1033 typically sufficient that a single counter can be used at the tunnel 1034 ingress, regardless of the endpoint addresses or next-header 1035 protocol, allowing efficient support for very high throughput 1036 tunnels. 1038 The smaller 16-bit IPv4 ID is more difficult to correctly support. A 1039 recent update to IPv4 allows the ID to be repeated for atomic packets 1040 [RFC6864]. When either source fragmentation or on-path fragmentation 1041 is supported, the tunnel ingress may need to keep independent ID 1042 counters for each tunnel source/destination/protocol tuple. 1044 4.1.5. Checksums 1046 IP traffic transiting a tunnel needs to expect a similar level of 1047 error detection and correction as it would expect from any other 1048 link. In the case of IPv4, there are no such expectations, which is 1049 partly why it includes a header checksum [RFC791]. 1051 IPv6 omitted the header checksum because it already expects most link 1052 errors to be detected and dropped by the link layer and because it 1053 also assumes transport protection [RFC2460]. When transiting IPv6 1054 over IPv6, the tunnel fails to provide the expected error detection. 1056 This is why IPv6 is often tunneled over layers that include separate 1057 protection, such as GRE [RFC2784]. 1059 The fragmentation created by the tunnel ingress can increase the need 1060 for stronger error detection and correction, especially at the tunnel 1061 egress to avoid reassembly errors. The Internet checksum is known to 1062 be susceptible to reassembly errors that could be common [RFC4963], 1063 and should not be relied upon for this purpose. This is why some 1064 tunnel protocols, e.g., SEAL and AERO [RFC5320][Te17] and GRE 1065 [RFC2784] as well as legacy protocols swIPe and the Internet 1066 Encapsulation Protocol [RFC1853], include a separate checksum. This 1067 requirement can be undermined when using UDP as a tunnel with no UDP 1068 checksum (as per [RFC6935][RFC6936]) when fragmentation occurs 1069 because the egress has no checksum with which to validate reassembly. 1070 For this reason, it is safe to use UDP with a zero checksum for 1071 atomic tunnel link packets only; when used on fragments, whether 1072 generated at the ingress or en-route inside the tunnel, omission of 1073 such a checksum can result in reassembly errors that can cause 1074 additional work (capacity, forwarding processing, receiver 1075 processing) downstream of the egress. 1077 4.2. MTU Issues 1079 Link MTUs, IP datagram limits, and transport protocol segment sizes 1080 are already related by several requirements 1081 [RFC768][RFC791][RFC1122][RFC1812][RFC2460] and by a variety of 1082 protocol mechanisms that attempt to establish relationships between 1083 them, including path MTU discovery (PMTUD) [RFC1191][RFC1981], 1084 packetization layer path MTU discovery (PLMTUD) [RFC4821], as well as 1085 mechanisms inside transport protocols [RFC793][RFC4340][RFC4960]. The 1086 following subsections summarize the interactions between tunnels and 1087 MTU issues, including minimum tunnel MTUs, tunnel fragmentation and 1088 reassembly, and MTU discovery. 1090 4.2.1. Minimum MTU Considerations 1092 There are a variety of values of minimum MTU values to consider, both 1093 in a conventional network and in a tunnel as a link in that network. 1094 These are indicated in Figure 10, an annotated variant of Figure 4. 1095 Note that a (link) MTU (a) corresponds to a tunnel MTU (d) and that a 1096 path MTU (b) corresponds to a tunnel path MTU (e). The tunnel MTU is 1097 the EMTU_R of the egress interface, because that defines the largest 1098 transit packet message that can traverse the tunnel as a link in 1099 network M. The ability to traverse the hops of the tunnel - in 1100 network N - is not related, and only the ingress need be concerned 1101 with that value. 1103 --_ -- 1104 +------+ / \ / \ +------+ 1105 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1106 +------+ \ //\ / \ / \ /\\ / +------+ 1107 --/I \---+ Rb +---+ Rc +---/E \-- 1108 \ / \ / \ / \ / 1109 \/ -- -- \/ 1110 <----- Network N -------> 1111 <-------------------- Network M ---------------------> 1113 Communication in network M viewed at that layer: 1114 (a) <-> Link MTU 1115 (b) <---- Tunnel MTU ---------> 1116 (c) <----------- Path MTU -----------------> 1117 (d) <------------------- EMTU_R ---------------------------> 1119 Communication in network N viewed at that layer: 1120 (e) <--> Link MTU 1121 (f) <--- Path MTU ------> 1122 (g) <----- EMTU_R ---------> 1124 Communication in network N viewed from network M: 1125 (h) <--> MFS 1126 (i) <--- Path MFS ------> 1127 (j) <----- EMFS_R ---------> 1129 Figure 10 The variety of MTU values 1131 Consider the following example values. For IPv6 transit packets, the 1132 minimum (link) MTU (a) is 1280 bytes, which similarly applies to 1133 tunnels as the tunnel MTU (b). The path MTU (c) is the minimum of the 1134 links (including tunnels as links) along a path, and indicates the 1135 smallest IP message (packet or fragment) that can traverse a path 1136 between a source and destination without on-path fragmentation (e.g., 1137 supported in IPv4 with DF=0). Path MTU discovery, either at the 1138 network layer (PMTUD [RFC1191][RFC1981]) or packetization layer 1139 (PLPMTUD [RFC4821]) attempts to tune the source IP packets and 1140 fragments (i.e., EMTU_S) to fit within this path MTU size to avoid 1141 fragmentation and reassembly [Ke95]. The minimum EMTU_R (d) is 1500 1142 bytes, i.e., the minimum MTU for endpoint-to-endpoint communication. 1144 The tunnel is a source-destination communication in network N. 1145 Messages between the tunnel source (the ingress interface) and tunnel 1146 destination (egress interface) similarly experience a variety of 1147 network N MTU values, including a link MTU (e), a path MTU (f), and 1148 an EMTU_R (g). The network N message maximum is limited by the path 1149 MTU, and the source-destination message maximum (EMTU_S) is limited 1150 by the path MTU when source fragmentation is disabled and by EMTU_R 1151 otherwise, just as it was in for those types of MTUs in network M. 1152 For an IPv6 network N, its link and path MTUs must be at least 1280 1153 and its EMTU_R must be at least 1500. 1155 However, viewed from the context of network M, these network N MTUs 1156 are link layer properties, i.e., maximum frame sizes (MFS (h)). The 1157 network N EMTU_R determines the largest message that can transit 1158 between the source (ingress) and destination (egress), but viewed 1159 from network M this is a link layer, i.e., EMFS_R (j). The tunnel 1160 EMTU_R is EMFS_R minus the link (encapsulation) headers and includes 1161 the encapsulation headers of the link layer. Just as the path MTU has 1162 no bearing on EMTU_R, the path MFS (i) in network N has no bearing on 1163 the MTU of the tunnel. 1165 For IPv6 networks M and N, these relationships are summarized as 1166 follows: 1168 o Network M MTU = 1280, the largest transit packet (i.e., payload) 1169 over a single IPv6 link in the base network without source 1170 fragmentation 1172 o Network M path MTU = 1280, the transit packet (i.e., payload) that 1173 can traverse a path of links in the base network without source 1174 fragmentation 1176 o Network M EMTU_R = 1500, the largest transit packet (i.e., 1177 payload) that can traverse a path in the base network with source 1178 fragmentation 1180 o Network N MTU = 1280 (for the same reasons as for network M) 1182 o Network N path MTU = 1280 (for the same reasons as for network M) 1184 o Network N EMTU_R = 1500 (for the same reasons as for network M) 1186 o Tunnel MTU = 1500-encapsulation (typically 1460), the network N 1187 EMTU_R payload 1189 o Tunnel MAP (maximum atomic packet) = largest network M message 1190 that transits a tunnel as an atomic packet using network N as a 1191 link layer: 1280-encapsulation, i.e., the network N path MTU 1192 payload (which is itself limited by the tunnel path MFS) 1194 The difference between the network N MTU and its treatment as a link 1195 layer in network M is the reason why the tunnel ingress interfaces 1196 need to support fragmentation and tunnel egress interfaces need to 1197 support reassembly in the encapsulation layer(s). The high cost of 1198 fragmentation and reassembly is why it is useful for applications to 1199 avoid sending messages too close to the size of the tunnel path MTU 1200 [Ke95], although there is no signaling mechanism that can achieve 1201 this (see Section 4.2.3). 1203 4.2.2. Fragmentation 1205 A tunnel interacts with fragmentation in two different ways. As a 1206 link in network M, transit packets might be fragmented before they 1207 reach the tunnel - i.e., in network M either during source 1208 fragmentation (if generated at the same node as the ingress 1209 interface) or forwarding fragmentation (for IPv4 DF=0 datagrams). In 1210 addition, link packets traversing inside the tunnel may require 1211 fragmentation by the ingress interface - i.e., source fragmentation 1212 by the ingress as a host in network N. These two fragmentation 1213 operations are no more related than are conventional IP fragmentation 1214 and ATM segmentation and reassembly; one occurs at the (transit) 1215 network layer, the other at the (virtual) link layer. 1217 Although many of these issues with tunnel fragmentation and MTU 1218 handling were discussed in [RFC4459], that document described a 1219 variety of alternatives as if they were independent. This document 1220 explains the combined approach that is necessary. 1222 Like any other link, an IPv4 tunnel must transit 68 byte packets 1223 without requiring source fragmentation [RFC791][RFC1122] and an IPv6 1224 tunnel must transit 1280 byte packets without requiring source 1225 fragmentation [RFC2460]. The tunnel MTU interacts with routers or 1226 hosts it connects the same way as would any other link MTU. The 1227 pseudocode examples in this section use the following values: 1229 o TP: transit packet 1231 o TLP: tunnel link packet 1233 o TPsize: size of the transit packet (including its headers) 1235 o encaps: ingress encapsulation overhead (tunnel link headers) 1237 o tunMTU: tunnel MTU, i.e., network N egress EMTU_R - encaps 1239 o tunMAP: tunnel maximum atomic packet as limited by the tunnel path 1240 MFS 1242 These rules apply at the host/router where the tunnel is attached, 1243 i.e., at the network layer of the transit packet (we assume that all 1244 tunnels, including multipoint tunnels, have a single, uniform MTU). 1245 These are basic source fragmentation rules (or transit 1246 refragmentation for IPv4 DF=0 datagrams), and have no relation to the 1247 tunnel itself other than to consider the tunnel MTU as the effective 1248 link MTU of the next hop. 1250 Inside the source during transit packet generation or a router during 1251 transit packet forwarding, the tunnel is treated as if it were any 1252 other link (i.e., this is not tunnel processing, but rather typical 1253 source or router processing), as indicated in the pseudocode in 1254 Figure 11. 1256 if (TPsize > tunMTU) then 1257 if (TP can be on-path fragmented, e.g., IPv4 DF=0) then 1258 split TP into TP fragments of tunMTU size 1259 and send each TP fragment to the tunnel ingress interface 1260 else 1261 drop the TP and send ICMP "too big" to the TP source 1262 endif 1263 else 1264 send TP to the tunnel ingress (i.e., as an outbound interface) 1265 endif 1267 Figure 11 Router / host packet size processing algorithm 1269 The tunnel ingress acts as host on the tunnel path, i.e., as source 1270 fragmentation of tunnel link packets (we assume that all tunnels, 1271 even multipoint tunnels, have a single, uniform tunnel MTU), using 1272 the pseudocode shown in Figure 12. Note that ingress source 1273 fragmentation occurs in the encapsulation process, which may involve 1274 more than one protocol layer. In those cases, fragmentation can occur 1275 at any of the layers of encapsulation in which it is supported, based 1276 on the configuration of the ingress. 1278 if (TPsize <= tunMAP) then 1279 encapsulate the TP and emit 1280 else 1281 if (tunMAP < TPsize) then 1282 encapsulate the TP, creating the TLP 1283 fragment the TLP into tunMAP chunks 1284 emit the TLP fragments 1285 endif 1286 endif 1288 Figure 12 Ingress processing algorithm 1290 Note that these Figure 11 and Figure 12 indicate that a node might 1291 both "fragment then encapsulate" and "encapsulate then fragment", 1292 i.e., the effect is "on-path fragment, then encapsulate, then source 1293 fragment". The first (on-path) fragmentation occurs only for IPv4 1294 DF=0 packets, based on the tunnel MTU. The second (source) 1295 fragmentation occurs for all packets, based on the tunnel maximum 1296 atomic packet (MAP) size. The first fragmentation is a convenience 1297 for a subset of IPv4 packets; it is the second (source) fragmentation 1298 that ensures that messages traverse the tunnel. 1300 Just as a network interface should never receive a message larger 1301 than its MTU, a tunnel should never receive a message larger than its 1302 tunnel MTU limit (see the host/router processing above). A router 1303 attempting to process such a message would already have generated an 1304 ICMP "packet too big" and the transit packet would have been dropped 1305 before entering into this algorithm. Similarly, a host would have 1306 generated an error internally and aborted the attempted transmission. 1308 As an example, consider IPv4 over IPv6 or IPv6 over IPv6 tunneling, 1309 where IPv6 encapsulation adds a 40 byte fixed header plus IPv6 1310 options (i.e., IPv6 header extensions) of total size 'EHsize'. The 1311 tunnel MTU will be at least 1500 - (40 + EHsize) bytes. The tunnel 1312 path MTU will be at least 1280 - (40 + EHsize) bytes, which then also 1313 represents the tunnel maximum atomic packet size (MAP). Transit 1314 packets larger than the tunnel MTU will be dropped by a node before 1315 ingress processing, and so do not need to be addressed as part of 1316 ingress processing. Considering these minimum values, the previous 1317 algorithm uses actual values shown in the pseudocode in Figure 13. 1319 if (TPsize <= (1240 - EHsize)) then 1320 encapsulate TP and emit 1321 else 1322 if ((1240 - EHsize) < TPsize) then 1323 encapsulate the TP, creating the TLP 1324 fragment the TLP into (1240 - EHsize) chunks 1325 emit the TLP fragments 1326 endif 1327 endif 1329 Figure 13 Ingress processing for an tunnel over IPv6 1331 IPv6 cannot necessarily support all tunnel encapsulations. When the 1332 egress EMTU_R is the default of 1500 bytes, an IPv6 tunnel supports 1333 IPv6 transit only if EHsize is 180 bytes or less; otherwise the 1334 incoming transit packet would have been dropped as being too large by 1335 the host/router. Under the same EMTU_R assumption, an IPv6 tunnel 1336 supports IPv4 transit only if EHsize is 884 bytes or less. In this 1337 example, transit packets of up to (1240 - Ehsize) can traverse the 1338 tunnel without ingress source fragmentation and egress reassembly. 1340 When using IP directly over IP, the minimum transit packet EMTU_R for 1341 IPv4 is 576 bytes and for IPv6 is 1500 bytes. This means that tunnels 1342 of IPv4-over-IPv4, IPv4-over-IPv6, and IPv6-over-IPv6 are possible 1343 without additional requirements, but this may involve ingress 1344 fragmentation and egress reassembly. IPv6 cannot be tunneled directly 1345 over IPv4 without additional requirements, notably that the egress 1346 EMTU_R is at least 1280 bytes. 1348 When ongoing ingress fragmentation and egress reassembly would be 1349 prohibitive or costly, larger MTUs can be supported by design and 1350 confirmed either out-of-band (by design) or in-band (e.g., using 1351 PLPMTUD [RFC4821], as done in SEAL [RFC5320] and AERO [Te17]). In 1352 particular, many tunnel specifications are often able to avoid 1353 persistent fragmentation because they operationally assume larger 1354 EMTU_R and tunnel MAP sizes than are guaranteed for IPv4 [RFC1122] or 1355 IPv6 [RFC2460]. 1357 4.2.3. Path MTU Discovery 1359 Path MTU discovery (PMTUD) enables a network path to support a larger 1360 PMTU than it can assume from the minimum requirements of protocol 1361 over which it operates. Note, however, that PMTUD never discovers 1362 EMTU_R that is larger than the required minimum; that information is 1363 available to some upper layer protocols, such as TCP [RFC1122], but 1364 cannot be determined at the IP layer. 1366 There is temptation to optimize tunnel traversal so that packets are 1367 not fragmented between ingress and egress, i.e., to attempt tune the 1368 network M PMTU to the tunnel MAP size rather than to the tunnel MTU, 1369 to avoid ingress fragmentation. This is often impossible because the 1370 ICMP "packet too big" message (IPv4 fragmentation needed [RFC792] or 1371 IPv6 packet too big [RFC4443]) indicates the complete failure of a 1372 link to transit a packet, not a preference for a size that matches 1373 that internal the mechanism of the link. ICMP messages are intended 1374 to indicate whether a tunnel MTU is insufficient; there is no ICMP 1375 message that can indicate when a transit packet is "too big for the 1376 tunnel path MTU, but not larger than the tunnel MTU". If there were, 1377 endpoints might receive that message for IP packets larger than 40 1378 bytes (the payload of a single ATM cell, allowing for the 8-byte AAL5 1379 trailer), but smaller than 9K (the ATM EMTU_R payload). 1381 In addition, attempting to try to tune the network transit size to 1382 natively match that of the link internal transit can be hazardous for 1383 many reasons: 1385 o The tunnel is capable of transiting packets as large as the 1386 network N EMTU_R - encapsulation, which is always at least as 1387 large as the tunnel MTU and typically is larger. 1389 o ICMP has only one type of error message regarding large packets - 1390 "too big", i.e., too large to transit. There is no optimization 1391 message of "bigger than I'd like, but I can deal with if needed". 1393 o IP tunnels often involve some level of recursion, i.e., 1394 encapsulation over itself [RFC4459]. 1396 Tunnels that use IPv4 as the encapsulation layer SHOULD set DF=0, but 1397 this requires generating unique fragmentation ID values, which may 1398 limit throughput [RFC6864]. These tunnels might have difficulty 1399 assuming ingress EMTU_S values over 64 bytes, so it may not be 1400 feasible to assume that larger packets with DF=1 are safe. 1402 Recursive tunneling occurs whenever a protocol ends up encapsulated 1403 in itself. This happens directly, as when IPv4 is encapsulated in 1404 IPv4, or indirectly, as when IP is encapsulated in UDP which then is 1405 a payload inside IP. It can involve many layers of encapsulation 1406 because a tunnel provider isn't always aware of whether the packets 1407 it transits are already tunneled. 1409 Recursion is impossible when the tunnel transit packets are limited 1410 to that of the native size of the ingress payload. Arriving tunnel 1411 transit packets have a minimum supported size (1280 for IPv6) and the 1412 tunnel PMFS has the same requirement; there would be no room for the 1413 tunnel's "link layer" headers, i.e., the encapsulation layer. The 1414 result would be an IPv6 tunnel that cannot satisfy IPv6 transit 1415 requirements. 1417 It is more appropriate to require the tunnel to satisfy IP transit 1418 requirements and enforce that requirement at design time or during 1419 operation (the latter using PLPMTUD [RFC4821]). Conventional path MTU 1420 discovery (PMTUD) relies on existing endpoint ICMP processing of 1421 explicit negative feedback from routers along the path via "packet to 1422 big" ICMP packets in the reverse direction of the tunnel 1423 [RFC1191][RFC1981]. This technique is susceptible to the "black hole" 1424 phenomenon, in which the ICMP messages never return to the source due 1425 to policy-based filtering [RFC2923]. PLPMTUD requires a separate, 1426 direct control channel from the egress to the ingress that provides 1427 positive feedback; the direct channel is not blocked by policy 1428 filters and the positive feedback ensures fail-safe operation if 1429 feedback messages are lost [RFC4821]. 1431 PLPMTUD might require that the ingress consider the potential impact 1432 of multipath forwarding (see Section 4.3.4). In such cases, probes 1433 generated by the ingress might need to track different flows, e.g., 1434 that might traverse different tunnel paths. Additionally, 1435 encapsulation might need to consider mechanisms to ensure that probes 1436 traverse the same path as their corresponding traffic, even when 1437 labeled as the same flow (e.g., using the IPv6 flow ID). In such 1438 cases, the transit packet and probe may need to be encrypted or 1439 encapsulated in an additional flow-based transport header, to avoid 1440 differential path traversal based on deep-packet inspection within 1441 the tunnel. 1443 4.3. Coordination Issues 1445 IP tunnels interact with link layer signals and capabilities in a 1446 variety of ways. The following subsections address some key issues of 1447 these interactions. In general, they are again informed by treating a 1448 tunnel as any other link layer and considering the interactions 1449 between the IP layer and link layers [RFC3819]. 1451 4.3.1. Signaling 1453 In the current Internet architecture, signaling goes upstream, either 1454 from routers along a path or from the destination, back toward the 1455 source. Such signals are typically contained in ICMP messages, but 1456 can involve other protocols such as RSVP, transport protocol signals 1457 (e.g., TCP RSTs), or multicast control or transport protocols. 1459 A tunnel behaves like a link and acts like a link interface at the 1460 nodes where it is attached. As such, it can provide information that 1461 enhances IP signaling (e.g., ICMP), but itself does not directly 1462 generate ICMP messages. 1464 For tunnels, this means that there are two separate signaling paths. 1465 The outer network M nodes can each signal the source of the tunnel 1466 transit packets, Hsrc (Figure 14). Inside the tunnel, the inner 1467 network N nodes can signal the source of the tunnel link packets, the 1468 ingress I (Figure 15). 1470 +--------+---------------------------+--------+ 1471 | | | | 1472 v --_ -- v 1473 +------+ / \ / \ +------+ 1474 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1475 +------+ \ //\ / \ / \ /\\ / +------+ 1476 --/I \--+ Rb +--+ Rc +--/E \-- 1477 \ / \ / \ / \ / 1478 \/ -- -- \/ 1479 <---- Network N -----> 1480 <-------------------- Network M ---------------------> 1482 Figure 14 Signals outside the tunnel 1484 +-----+-------+------+ 1485 --_ | | | | -- 1486 +------+ / \ v | | | / \ +------+ 1487 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1488 +------+ \ //\ / \ / \ /\\ / +------+ 1489 --/I \--+ Rb +--+ Rc +--/E \-- 1490 \ / \ / \ / \ / 1491 \/ -- -- \/ 1492 <----- Network N ----> 1493 <--------------------- Network M --------------------> 1495 Figure 15 Signals inside the tunnel 1497 These two signal paths are inherently distinct except where 1498 information is exchanged between the network interface of the tunnel 1499 (the ingress) and its attached node (Ra, in both figures). 1501 It is always possible for a network interface to provide hints to its 1502 attached node (host or router), which can be used for optimization. 1503 In this case, when signals inside the tunnel indicate a change to the 1504 tunnel, the ingress (i.e., the tunnel network interface) can provide 1505 information to the router (Ra, in both figures), so that Ra can 1506 generate the appropriate signal in return to Hsrc. This relaying may 1507 be difficult, because signals inside the tunnel may not return enough 1508 information to the ingress to support direct relaying to Hsrc. 1510 In all cases, the tunnel ingress needs to determine how to relay the 1511 signals from inside the tunnel into signals back to the source. For 1512 some protocols this is either simple or impossible (such as for 1513 ICMP), for others, it can even be undefined (e.g., multicast). In 1514 some cases, the individual signals relayed from inside the tunnel may 1515 result in corresponding signals in the outside network, and in other 1516 cases they may just change state of the tunnel interface. In the 1517 latter case, the result may cause the router Ra to generate new ICMP 1518 errors when later messages arrive from Hsrc or other sources in the 1519 outer network. 1521 The meaning of the relayed information must be carefully translated. 1522 An ICMP error within a tunnel indicates a failure of the path inside 1523 the tunnel to support an egress atomic packet or packet fragment 1524 size. It can be very difficult to convert that ICMP error into a 1525 corresponding ICMP message from the ingress node back to the transit 1526 packet source. The ICMP message may not contain enough of a packet 1527 prefix to extract the transit packet header sufficient to generate 1528 the appropriate ICMP message. The relationship between the egress 1529 EMTU_R and the transit packet may be indirect, e.g., the ingress node 1530 may be performing source fragmentation that should be adjusted 1531 instead of propagating the ICMP upstream. 1533 Some messages have detailed specifications for relaying between the 1534 tunnel link packet and transit packet, including Explicit Congestion 1535 Notification (ECN [RFC6040]) and multicast (IGMP, e.g.). 1537 4.3.2. Congestion 1539 Tunnels carrying IP traffic (i.e., the focus of this document) need 1540 not react directly to congestion any more than would any other link 1541 layer [RFC8085]. IP transit packet traffic is already expected to be 1542 congestion controlled. 1544 It is useful to relay network congestion notification between the 1545 tunnel link and the tunnel transit packets. Explicit congestion 1546 notification requires that ECN bits are copied from the tunnel 1547 transit packet to the tunnel link packet on encapsulation, as well as 1548 copied back at the egress based on a combination of the bits of the 1549 two headers [RFC6040]. This allows congestion notification within the 1550 tunnel to be interpreted as if it were on the direct path. 1552 4.3.3. Multipoint Tunnels and Multicast 1554 Multipoint tunnels are tunnels with more than two ingress/egress 1555 endpoints [RFC2529][RFC5214][Te17]. Just as tunnels emulate links, 1556 multipoint tunnels emulate multipoint links, and can support 1557 multicast as a tunnel capability. Multipoint tunnels can be useful on 1558 their own, or may be used as part of more complex systems, e.g., LISP 1559 and TRILL configurations [RFC6830][RFC6325]. 1561 Multipoint tunnels require a support for egress determination, just 1562 as multipoint links do. This function is typically supported by ARP 1563 [RFC826] or ARP emulation (e.g., LAN Emulation, known as LANE 1565 [RFC2225]) for multipoint links. For multipoint tunnels, a similar 1566 mechanism is required for the same purpose - to determine the egress 1567 address for proper ingress encapsulation (e.g., LISP Map-Service 1568 [RFC6833]). 1570 All multipoint systems - tunnels and links - might support different 1571 MTUs between each ingress/egress (or link entrance/exit) pair. In 1572 most cases, it is simpler to assume a uniform MTU throughout the 1573 multipoint system, e.g., the minimum MTU supported across all 1574 ingress/egress pairs. This applies to both the ingress EMTU_S and 1575 egress EMTU_R (the latter determining the tunnel MTU). Values valid 1576 across all receivers need to be confirmed in advance (e.g., via IPv6 1577 ND announcements or out-of-band configuration information) before a 1578 multipoint tunnel or link can use values other than the default, 1579 otherwise packets may reach some receivers but be "black-holed" to 1580 others (e.g., if PMTUD fails [RFC2923]). 1582 A multipoint tunnel MUST have support for broadcast and multicast (or 1583 their equivalent), in exactly the same way as this is already 1584 required for multipoint links [RFC3819]. Both modes can be supported 1585 either by a native mechanism inside the tunnel or by emulation using 1586 serial replication at the tunnel ingress (e.g., AMT [RFC7450]), in 1587 the same way that links may provide the same support either natively 1588 (e.g., via promiscuous or automatic replication in the link itself) 1589 or network interface emulation (e.g., as for non-broadcast 1590 multiaccess networks, i.e., NBMAs). 1592 IGMP snooping enables IP multicast to be coupled with native link 1593 layer multicast support [RFC4541]. A similar technique may be 1594 relevant to couple transit packet multicast to tunnel link packet 1595 multicast, but the coupling of the protocols may be more complex 1596 because many tunnel link protocols rely on their own network N 1597 multicast control protocol, e.g., via PIM-SM [RFC6807][RFC7761]. 1599 4.3.4. Load Balancing 1601 Load balancing can impact the way in which a tunnel operates. In 1602 particular, multipath routing inside the tunnel can impact some of 1603 the tunnel parameters to vary, both over time and for different 1604 transit packets. The use of multiple paths can be the result of MPLS 1605 link aggregation groups (LAGs), equal-cost multipath routing (ECMP 1606 [RFC2991]), or other load balancing mechanisms. In some cases, the 1607 tunnel exists as the mechanism to support ECMP, as for GRE in UDP 1608 [RFC8086]. 1610 A tunnel may have multiple paths between the ingress and egress with 1611 different tunnel path MTU or tunnel MAP values, causing the ingress 1612 EMTU_S to vary [RFC7690]. When individual values cannot be correlated 1613 to transit traffic, the EMTU_S can be set to the minimum of these 1614 different path MTU and MAP values. 1616 In some cases, these values can be correlated to paths, e.g., IPv6 1617 packets include a flow label to enable multipath routing to keep 1618 packets of a single flow following the same path, as well as to help 1619 differentiate path properties (e.g., for path MTU discovery 1620 [RFC4821]). It is important to preserve the semantics of that flow 1621 label as an aggregate identifier of the encapsulated link packets of 1622 a tunnel. This is achieved by hashing the transit IP addresses and 1623 flow label to generate a new flow label for use between the ingress 1624 and egress addresses [RFC6438]. It is not appropriate to simply copy 1625 the flow label from the transit packet into the link packet because 1626 of collisions that might arise if a label is used for flows between 1627 different transit packet addresses that traverse the same tunnel. 1629 When the transit packet is visible to forwarding nodes inside the 1630 tunnel (e.g., when it is not encrypted), those nodes use deep packet 1631 inspection (DPI) context to send a single flow over different paths. 1632 This sort of "DPI override" of the IP flow information can interfere 1633 with both PMTUD and PLPMTUD mechanisms. The only way to ensure that 1634 intermediate nodes do not interfere with PLPMTUD is to encrypt the 1635 transit packet when it is encapsulated for tunnel traversal, or to 1636 provide some other signals (e.g., an additional layer of 1637 encapsulation header including transport ports) that preserves the 1638 flow semantics. 1640 4.3.5. Recursive Tunnels 1642 The rules described in this document already support tunnels over 1643 tunnels, sometimes known as "recursive" tunnels, in which IP is 1644 transited over IP either directly or via intermediate encapsulation 1645 (IP-UDP-IP, as in GUE [He16]). 1647 There are known hazards to recursive tunneling, notably that the 1648 independence of the tunnel transit header and tunnel link header hop 1649 counts can result in a tunneling loop. Such looping can be avoided 1650 when using direct encapsulation (IP in IP) by use of a header option 1651 to track the encapsulation count and to limit that count [RFC2473]. 1652 This looping cannot be avoided when other protocols are used for 1653 tunneling, e.g., IP in UDP in IP, because the encapsulation count may 1654 not be visible where the recursion occurs. 1656 5. Observations 1658 The following subsections summarize the observations of this document 1659 and a summary of issues with existing tunnel protocol specifications. 1660 It also includes advice for tunnel protocol designers, implementers, 1661 and operators. It also includes 1663 5.1. Summary of Recommendations 1665 o Tunnel endpoints are network interfaces, tunnel are virtual links 1667 o ICMP messages MUST NOT be generated by the tunnel (as a link) 1669 o ICMP messages received by the ingress inside link change the 1670 link properties (they do not generate transit-layer ICMP 1671 messages) 1673 o Link headers (hop, ID, options) are largely independent of 1674 arriving ID (with few exceptions based on translation, not 1675 direct copying, e.g., ECN and IPv6 flow IDs) 1677 o MTU values should treat the tunnel as any other link 1679 o Require source ingress source fragmentation and egress 1680 reassembly at the tunnel link packet layer 1682 o The tunnel MTU is the tunnel egress EMTU_R less headers, and 1683 not related at all to the ingress-egress MFS 1685 o Tunnels must obey core IP requirements 1687 o Obey IPv4 DF=1 on arrival at a node (nodes MUST NOT fragment 1688 IPv4 packets where DF=1 and routers MUST NOT clear the DF bit) 1690 o Shut down an IP tunnel if the tunnel MTU falls below the 1691 required minimum 1693 5.2. Impact on Existing Encapsulation Protocols 1695 Many existing and proposed encapsulation protocols are inconsistent 1696 with the guidelines of this document. The following list summarizes 1697 only those inconsistencies, but omits places where a protocol is 1698 inconsistent solely by reference to another protocol. 1700 [should this be inverted as a table of issues and a list of which 1701 RFCs have problems?] 1702 o IP in IP / mobile IP [RFC2003][RFC4459] - IPv4 in IPv4 1704 o Sets link DF when transit DF=1 (fails without PLPMTUD) 1706 o Drops at egress if hopcount = 0 (host-host tunnels fail) 1708 o Drops based on transit source (same as router IP, matches 1709 egress), i.e., performs routing functions it should not 1711 o Ingress generates ICMP messages (based on relayed context), 1712 rather than using inner ICMP messages to set interface 1713 properties only 1715 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1717 o IPv6 tunnels [RFC2473] -- IPv6 or IPv4 in IPv6 1719 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1721 o Decrements transiting packet hopcount (by 1) 1723 o Copies traffic class from tunnel link to tunnel transit header 1725 o Ignores IPv4 DF=0 and fragments at that layer upon arrival 1727 o Fails to retain soft ingress state based on inner ICMP messages 1728 affecting tunnel MTU 1730 o Tunnel ingress issues ICMPs 1732 o Fragments IPv4 over IPv6 fragments only if IPv4 DF=0 1733 (misinterpreting the "can fragment the IPv4 packet" as 1734 permission to fragment at the IPv6 link header) 1736 o IPsec tunnel mode (IP in IPsec in IP) [RFC4301] -- IP in IPsec 1738 o Uses security policy to set, clear, or copy DF (rather than 1739 generating it independently, which would also be more secure) 1741 o Intertwines tunnel selection with security selection, rather 1742 than presenting tunnel as an interface and using existing 1743 forwarding (as with transport mode over IP-in-IP [RFC3884]) 1745 o GRE (IP in GRE in IP or IP in GRE in UDP in IP) 1746 [RFC2784][RFC7588][RFC7676][RFC8086] 1748 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1749 o Requires ingress to generate ICMP errors 1751 o Copies IPv4 DF to outer IPv4 DF 1753 o Violates IPv6 MTU requirements when using IPv6 encapsulation 1755 o LISP [RFC6830] 1757 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1759 o Requires ingress to generate ICMP errors 1761 o Copies inner hop limit to outer 1763 o L2TP [RFC3931] 1765 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1767 o Requires ingress to generate ICMP errors 1769 o PWE [RFC3985] 1771 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1773 o Requires ingress to generate ICMP errors 1775 o GUE (Generic UDP encapsulation) [He16] - IP (et. al) in UDP in IP 1777 o Allows inner encapsulation fragmentation 1779 o Geneve [RFC7364][Gr17] - IP (et al.) in Geneve in UDP in IP 1781 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1783 o SEAL/AERO [RFC5320][Te17] - IP in SEAL/AERO in IP 1785 o Some issues with SEAL (MTU, ICMP), corrected in AERO 1787 o RTG DT encapsulations [No16] 1789 o Assumes fragmentation can be avoided completely 1791 o Allows encapsulation protocols that lack fragmentation 1793 o Relies on ICMP PTB to correct for tunnel path MTU 1795 o No known issues 1796 o L2VPN (framework for L2 virtualization) [RFC4664] 1798 o L3VPN (framework for L3 virtualization) [RFC4176] 1800 o MPLS (IP in MPLS) [RFC3031] 1802 o TRILL (Ethernet in Ethernet) [RFC5556][RFC6325] 1804 5.3. Tunnel Protocol Designers 1806 [To be completed] 1808 Recursive tunneling + minimum MTU = frag/reassembly is inevitable, at 1809 least to be able to split/join two fragments 1811 Account for egress MTU/path MTU differences. 1813 Include a stronger checksum. 1815 Ensure the egress MTU is always larger than the path MTU. 1817 Ensure that the egress reassembly can keep up with line rate OR 1818 design PLPMTUD into the tunneling protocol. 1820 5.3.1. For Future Standards 1822 [To be completed] 1824 Larger IPv4 MTU (2K? or just 2x path MTU?) for reassembly 1826 Always include frag support for at least two frags; do NOT try to 1827 deprecate fragmentation. 1829 Limit encapsulation option use/space. 1831 Augment ICMP to have two separate messages: PTB vs P-bigger-than- 1832 optimal 1834 Include MTU as part of BGP as a hint - SB 1836 Hazards of multi-MTU draft-van-beijnum-multi-mtu-04 1838 5.3.2. Diagnostics 1840 [To be completed] 1841 Some current implementations include diagnostics to support 1842 monitoring the impact of tunneling, especially the impact on 1843 fragmentation and reassembly resources, the status of path MTU 1844 discovery, etc. 1846 >> Because a tunnel ingress/egress is a network interface, it SHOULD 1847 have similar resources as any other network interface. This includes 1848 resources for packet processing as well as monitoring. 1850 5.4. Tunnel Implementers 1852 [To be completed] 1854 Detect when the egress MTU is exceeded. 1856 Detect when the egress MTU drops below the required minimum and shut 1857 down the tunnel if that happens - configuring the tunnel down and 1858 issuing a hard error may be the only way to detect this anomaly, and 1859 it's sufficiently important that the tunnel SHOULD be disabled. This 1860 is always better than blindly assuming the tunnel has been deployed 1861 correctly, i.e., that the solution has been engineered. 1863 Do NOT decrement the TTL as part of being a tunnel. It's always 1864 already OK for a router to decrement the TTL based on different next- 1865 hop routers, but TTL is a property of a router not a link. 1867 5.5. Tunnel Operators 1869 [To be completed] 1871 Keep the difference between "enforced by operators" vs. "enforced by 1872 active protocol mechanism" in mind. It's fine to assume something the 1873 tunnel cannot or does not test, as long as you KNOW you can assume 1874 it. When the assumption is wrong, it will NOT be signaled by the 1875 tunnel. Do NOT decrement the TTL as part of being a tunnel. It's 1876 always already OK for a router to decrement the TTL based on 1877 different next-hop routers, but TTL is a property of a router not a 1878 link. 1880 Consider the circuit breakers doc to provide diagnostics and last- 1881 resort control to avoid overload for non-reactive traffic (see 1882 Gorry's RFC-to-be) 1884 Do NOT decrement the TTL as part of being a tunnel. It's always 1885 already OK for a router to decrement the TTL based on different next- 1886 hop routers, but TTL is a property of a router not a link. 1888 >>>> PLPMTUD can give multiple conflicting PMTU values during ECMP or 1889 LAG if PMTU is cached per endpoint pair rather than per flow -- but 1890 so can PMTUD! This is another reason why ICMP should never drive up 1891 the effective MTU (if aggregate, treat as the minimum of received 1892 messages over an interval). 1894 6. Security Considerations 1896 Tunnels may introduce vulnerabilities or add to the potential for 1897 receiver overload and thus DOS attacks. These issues are primarily 1898 related to the fact that a tunnel is a link that traverses a network 1899 path and to fragmentation and reassembly. ICMP signal translation 1900 introduces a new security issue and must be done with care. ICMP 1901 generation at the router or host attached to a tunnel is already 1902 covered by existing requirements (e.g., should be throttled). 1904 Tunnels traverse multiple hops of a network path from ingress to 1905 egress. Traffic along such tunnels may be susceptible to on-path and 1906 off-path attacks, including fragment injection, reassembly buffer 1907 overload, and ICMP attacks. Some of these attacks may not be as 1908 visible to the endpoints of the architecture into which tunnels are 1909 deployed and these attacks may thus be more difficult to detect. 1911 Fragmentation at routers or hosts attached to tunnels may place an 1912 undue burden on receivers where traffic is not sufficiently diffuse, 1913 because tunnels may induce source fragmentation at hosts and path 1914 fragmentation (for IPv4 DF=0) more for tunnels than for other links. 1915 Care should be taken to avoid this situation, notably by ensuring 1916 that tunnel MTUs are not significantly different from other link 1917 MTUs. 1919 Tunnel ingresses emitting IP datagrams MUST obey all existing IP 1920 requirements, such as the uniqueness of the IP ID field. Failure to 1921 either limit encapsulation traffic, or use additional ingress/egress 1922 IP addresses, can result in high speed traffic fragments being 1923 incorrectly reassembled. 1925 Tunnels are susceptible to attacks at both the inner and outer 1926 network layers. The tunnel ingress/egress endpoints appear as network 1927 interfaces in the outer network, and are as susceptible as any other 1928 network interface. This includes vulnerability to fragmentation 1929 reassembly overload, traffic overload, and spoofed ICMP messages that 1930 misreport the state of those interfaces. Similarly, the 1931 ingress/egress appear as hosts to the path traversed by the tunnel, 1932 and thus are as susceptible as any other host to attacks as well. 1934 [management?] 1936 [Access control?] 1938 describe relationship to [RFC6169] - JT (as per INTAREA meeting 1939 notes, don't cover Teredo-specific issues in RFC6169, but include 1940 generic issues here) 1942 7. IANA Considerations 1944 This document has no IANA considerations. 1946 The RFC Editor should remove this section prior to publication. 1948 8. References 1950 8.1. Normative References 1952 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1953 Requirement Levels", BCP 14, RFC 2119, March 1997. 1955 [are there others? 3819? ECN? Flow label issues?] 1957 8.2. Informative References 1959 [Cl88] Clark, D., "The design philosophy of the DARPA internet 1960 protocols," Proc. Sigcomm 1988, p.106-114, 1988. 1962 [Er94] Eriksson, H., "MBone: The Multicast Backbone," 1963 Communications of the ACM, Aug. 1994, pp.54-60. 1965 [Gr17] Gross, J. (Ed.), I. Ganga (Ed.), T. Sridhar (Ed.), "Geneve: 1966 Generic Network Virtualization Encapsulation," draft-ietf- 1967 nvo3-geneve-04, Mar. 2017. 1969 [He16] Herbert, T., L. Yong, O. Zia, "Generic UDP Encapsulation," 1970 draft-ietf-nvo3-gue-05, Oct. 2016. 1972 [Ke95] Kent, S., J. Mogul, "Fragmentation considered harmful," ACM 1973 Sigcomm Computer Communication Review (CCR), V25 N1, Jan. 1974 1995, pp. 75-87. 1976 [No16] Nordmark, E. (Ed.), A. Tian, J. Gross, J. Hudson, L. 1977 Kreeger, P. Garg, P. Thaler, T. Herbert, "Encapsulation 1978 Considerations," draft-ietf-rtgwg-dt-encap-02, Oct. 2016. 1980 [RFC5] Rulifson, J, "Decode Encode Language (DEL)," RFC 5, June 1981 1969. 1983 [RFC768] Postel, J, "User Datagram Protocol," RFC 768, Aug. 1980 1985 [RFC791] Postel, J., "Internet Protocol," RFC 791 / STD 5, September 1986 1981. 1988 [RFC792] Postel, J., "Internet Control Message Protocol," RFC 792, 1989 Sep. 981. 1991 [RFC793] Postel, J, "Transmission Control Protocol," RFC 793, Sept. 1992 1981. 1994 [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol -- or 1995 -- Converting Network Protocol Addresses to 48.bit Ethernet 1996 Address for Transmission on Ethernet Hardware," RFC 826, 1997 Nov. 1982. 1999 [RFC1075] Waitzman, D., C. Partridge, S. Deering, "Distance Vector 2000 Multicast Routing Protocol," RFC 1075, Nov. 1988. 2002 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 2003 Communication Layers," RFC 1122 / STD 3, October 1989. 2005 [RFC1191] Mogul, J., S. Deering, "Path MTU discovery," RFC 1191, 2006 November 1990. 2008 [RFC1812] Baker, F., "Requirements for IP Version 4 Routers," RFC 2009 1812, June 1995. 2011 [RFC1853] Simpson, W., "IP in IP Tunneling," RFC 1853, Oct. 1995. 2013 [RFC1981] McCann, J., S. Deering, J. Mogul, "Path MTU Discovery for 2014 IP version 6," RFC 1981, Aug. 1996. 2016 [RFC2003] Perkins, C., "IP Encapsulation within IP," RFC 2003, Oct. 2017 1996. 2019 [RFC2225] Laubach, M., J. Halpern, "Classical IP and ARP over ATM," 2020 RFC 2225, Apr. 1998. 2022 [RFC2460] Deering, S., R. Hinden, "Internet Protocol, Version 6 2023 (IPv6) Specification," RFC 2460, Dec. 1998. 2025 [RFC2473] Conta, A., "Generic Packet Tunneling in IPv6 2026 Specification," RFC 2473, Dec. 1998. 2028 [RFC2529] Carpenter, B., C. Jung, "Transmission of IPv6 over IPv4 2029 Domains without Explicit Tunnels," RFC 2529, Mar. 1999. 2031 [RFC2784] Farinacci, D., T. Li, S. Hanks, D. Meyer, P. Traina, 2032 "Generic Routing Encapsulation (GRE)", RFC 2784, March 2033 2000. 2035 [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery," RFC 2036 2923, September 2000. 2038 [RFC2983] Black, D., "Differentiated Services and Tunnels," RFC 2983, 2039 Oct. 2000. 2041 [RFC2991] Thaler, D., C. Hopps, "Multipath Issues in Unicast and 2042 Multicast Next-Hop Selection," RFC 2991, Nov. 2000. 2044 [RFC2473] Conta, A., S. Deering, "Generic Packet Tunneling in IPv6 2045 Specification," RFC 2473, Dec. 1998. 2047 [RFC2546] Durand, A., B. Buclin, "6bone Routing Practice," RFC 2540, 2048 Mar. 1999. 2050 [RFC3031] Rosen, E., A. Viswanathan, R. Callon, "Multiprotocol Label 2051 Switching Architecture", RFC 3031, January 2001. 2053 [RFC3819] Karn, P., Ed., C. Bormann, G. Fairhurst, D. Grossman, R. 2054 Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood, 2055 "Advice for Internet Subnetwork Designers," RFC 3819 / BCP 2056 89, July 2004. 2058 [RFC3884] Touch, J., L. Eggert, Y. Wang, "Use of IPsec Transport Mode 2059 for Dynamic Routing," RFC 3884, September 2004. 2061 [RFC3931] Lau, J., Ed., M. Townsley, Ed., I. Goyret, Ed., "Layer Two 2062 Tunneling Protocol - Version 3 (L2TPv3)," RFC 3931, March 2063 2005. 2065 [RFC3985] Bryant, S., P. Pate (Eds.), "Pseudo Wire Emulation Edge-to- 2066 Edge (PWE3) Architecture", RFC 3985, March 2005. 2068 [RFC4176] El Mghazli, Y., Ed., T. Nadeau, M. Boucadair, K. Chan, A. 2069 Gonguet, "Framework for Layer 3 Virtual Private Networks 2070 (L3VPN) Operations and Management," RFC 4176, October 2005. 2072 [RFC4301] Kent, S., and K. Seo, "Security Architecture for the 2073 Internet Protocol," RFC 4301, December 2005. 2075 [RFC4340] Kohler, E., M. Handley, S. Floyd, "Datagram Congestion 2076 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 2078 [RFC4443] Conta, A., S. Deering, M. Gupta (Ed.), "Internet Control 2079 Message Protocol (ICMPv6) for the Internet Protocol Version 2080 6 (IPv6) Specification," RFC 4443, Mar. 2006. 2082 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 2083 Network Tunneling," RFC 4459, April 2006. 2085 [RFC4541] Christensen, M., K. Kimball, F. Solensky, "Considerations 2086 for Internet Group Management Protocol (IGMP) and Multicast 2087 Listener Discovery (MLD) Snooping Switches," RFC 4541, May 2088 2006. 2090 [RFC4664] Andersson, L., Ed., E. Rosen, Ed., "Framework for Layer 2 2091 Virtual Private Networks (L2VPNs)," RFC 4664, September 2092 2006. 2094 [RFC4821] Mathis, M., J. Heffner, "Packetization Layer Path MTU 2095 Discovery," RFC 4821, March 2007. 2097 [RFC4861] Narten, T., E. Nordmark, W. Simpson, H. Soliman, "Neighbor 2098 Discovery for IP version 6 (IPv6)," RFC 4861, Sept. 2007. 2100 [RFC4960] Stewart, R. (Ed.), "Stream Control Transmission Protocol," 2101 RFC 4960, Sep. 2007. 2103 [RFC4963] Heffner, J., M. Mathis, B. Chandler, "IPv4 Reassembly 2104 Errors at High Data Rates," RFC 4963, July 2007. 2106 [RFC5214] Templin, F., T. Gleeson, D. Thaler, "Intra-Site Automatic 2107 Tunnel Addressing Protocol (ISATAP)," RFC 5214, Mar. 2008. 2109 [RFC5320] Templin, F., Ed., "The Subnetwork Encapsulation and 2110 Adaptation Layer (SEAL)," RFC 5320, Feb. 2010. 2112 [RFC5556] Touch, J., R. Perlman, "Transparently Interconnecting Lots 2113 of Links (TRILL): Problem and Applicability Statement," RFC 2114 5556, May 2009. 2116 [RFC5944] Perkins, C., Ed., "IP Mobility Support for IPv4, Revised" 2117 RFC 5944, Nov. 2010. 2119 [RFC6040] Briscoe, B., "Tunneling of Explicit Congestion 2120 Notification," RFC 6040, Nov. 2010. 2122 [RFC6169] Krishnan, S., D. Thaler, J. Hoagland, "Security Concerns 2123 With IP Tunneling," RFC 6169, Apr. 2011. 2125 [RFC6325] Perlman, R., D. Eastlake, D. Dutt, S. Gai, A. Ghanwani, 2126 "Routing Bridges (RBridges): Base Protocol Specification," 2127 RFC 6325, July 2011. 2129 [RFC6434] Jankiewicz, E., J. Loughney, T. Narten, "IPv6 Node 2130 Requirements," RFC 6434, Dec. 2011. 2132 [RFC6438] Carpenter, B., S. Amante, "Using the IPv6 Flow Label for 2133 Equal Cost Multipath Routing and Link Aggregation in 2134 Tunnels," RFC 6438, Nov. 2011. 2136 [RFC6807] Farinacci, D., G. Shepherd, S. Venaas, Y. Cai, "Population 2137 Count Extensions to Protocol Independent Multicast (PIM)," 2138 RFC 6807, Dec. 2012. 2140 [RFC6830] Farinacci, D., V. Fuller, D. Meyer, D. Lewis, "The 2141 Locator/ID Separation Protocol," RFC 6830, Jan. 2013. 2143 [RFC6833] Fuller, V., D. Farinacci, "Locator/ID Separation Protocol 2144 (LISP) Map-Server Interface," RFC 6833, Jan. 2013. 2146 [RFC6864] Touch, J., "Updated Specification of the IPv4 ID Field," 2147 Proposed Standard, RFC 6864, Feb. 2013. 2149 [RFC6935] Eubanks, M., P. Chimento, M. Westerlund, "IPv6 and UDP 2150 Checksums for Tunneled Packets," RFC 6935, Apr. 2013. 2152 [RFC6936] Fairhurst, G., M. Westerlund, "Applicability Statement for 2153 the Use of IPv6 UDP Datagrams with Zero Checksums," RFC 2154 6936, Apr. 2013. 2156 [RFC6946] Gont, F., "Processing of IPv6 "Atomic" Fragments," RFC 2157 6946, May 2013. 2159 [RFC7364] Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L., M. 2160 Napierala, "Problem Statement: Overlays for Network 2161 Virtualization", RFC 7364, Oct. 2014. 2163 [RFC7450] Bumgardner, G., "Automatic Multicast Tunneling," RFC 7450, 2164 Feb. 2015. 2166 [RFC7510] Xu, X., N. Sheth, L. Yong, R. Callon, D. Black, 2167 "Encapsulating MPLS in UDP," RFC 7510, April 2015. 2169 [RFC7588] Bonica, R., C. Pignataro, J. Touch, "A Widely-Deployed 2170 Solution to the Generic Routing Encapsulation Fragmentation 2171 Problem," RFC 7588, July 2015. 2173 [RFC7676] Pignataro, C., R. Bonica, S. Krishnan, "IPv6 Support for 2174 Generic Routing Encapsulation (GRE)," RFC 7676, Oct 2015. 2176 [RFC7690] Byerly, M., M. Hite, J. Jaeggli, "Close Encounters of the 2177 ICMP Type 2 Kind (Near Misses with ICMPv6 Packet Too Big 2178 (PTB))," RFC 7690, Jan. 2016. 2180 [RFC7761] Fenner, B., M. Handley, H. Holbrook, I. Kouvelas, R. 2181 Parekh, Z. Zhang, L. Zheng, "Protocol Independent Multicast 2182 - Sparse Mode (PIM-SM): Protocol Specification (Revised)," 2183 RFC 7761, Mar. 2016. 2185 [RFC8085] Eggert, L., G. Fairhurst, G. Shepherd, "Unicast UDP Usage 2186 Guidelines," RFC 8085, Oct. 2015. 2188 [RFC8086] Yong, L. (Ed.), E. Crabbe, X. Xu, T. Herbert, "GRE-in-UDP 2189 Encapsulation," RFC 8086, Feb. 2017. 2191 [Sa84] Saltzer, J., D. Reed, D. Clark, "End-to-end arguments in 2192 system design," ACM Trans. on Computing Systems, Nov. 1984. 2194 [Te17] Templin, F., "Asymmetric Extended Route Optimization," 2195 draft-templin-aerolink-75, May 2017. 2197 [To01] Touch, J., "Dynamic Internet Overlay Deployment and 2198 Management Using the X-Bone," Computer Networks, July 2001, 2199 pp. 117-135. 2201 [To03] Touch, J., Y. Wang, L. Eggert, G. Finn, "Virtual Internet 2202 Architecture," USC/ISI Tech. Report ISI-TR-570, Aug. 2003. 2204 [To16] Touch, J., "Middleboxes Models Compatible with the 2205 Internet," USC/ISI Tech. Report ISI-TR-711, Oct. 2016. 2207 [To98] Touch, J., S. Hotz, "The X-Bone," Proc. Globecom Third 2208 Global Internet Mini-Conference, Nov. 1998. 2210 [Zi80] Zimmermann, H., "OSI Reference Model - The ISO Model of 2211 Architecture for Open Systems Interconnection," IEEE Trans. 2212 on Comm., Apr. 1980. 2214 9. Acknowledgments 2216 This document originated as the result of numerous discussions among 2217 the authors, Jari Arkko, Stuart Bryant, Lars Eggert, Ted Faber, Gorry 2218 Fairhurst, Dino Farinacci, Matt Mathis, and Fred Templin. It 2219 benefitted substantially from detailed feedback from Toerless Eckert, 2220 Vincent Roca, and Lucy Yong, as well as other members of the Internet 2221 Area Working Group. 2223 This work is partly supported by USC/ISI's Postel Center. 2225 This document was prepared using 2-Word-v2.0.template.dot. 2227 Authors' Addresses 2229 Joe Touch 2230 USC/ISI 2231 4676 Admiralty Way 2232 Marina del Rey, CA 90292-6695 2233 U.S.A. 2235 Phone: +1 (310) 448-9151 2236 Email: touch@isi.edu 2238 W. Mark Townsley 2239 Cisco 2240 L'Atlantis, 11, Rue Camille Desmoulins 2241 Issy Les Moulineaux, ILE DE FRANCE 92782 2243 Email: townsley@cisco.com 2245 APPENDIX A: Fragmentation efficiency 2247 A.1. Selecting fragment sizes 2249 There are different ways to fragment a packet. Consider a network 2250 with a PMTU as shown in Figure 16, where packets are encapsulated 2251 over the same network layer as they arrive on (e.g., IP in IP). If a 2252 packet as large as the PMTU arrives, it must be fragmented to 2253 accommodate the additional header. 2255 X===========================X (transit PMTU) 2256 +----+----------------------+ 2257 | iH | DDDDDDDDDDDDDDDDDDDD | 2258 +----+----------------------+ 2259 | 2260 | X===========================X (tunnel 1 MTU) 2261 | +---+----+------------------+ 2262 (a) +->| H'| iH | DDDDDDDDDDDDDDDD | 2263 | +---+----+------------------+ 2264 | | 2265 | | X===========================X (tunnel 2 MTU) 2266 | | +----+---+----+-------------+ 2267 | (a1) +->| nH'| H | iH | DDDDDDDDDDD | 2268 | | +----+---+----+-------------+ 2269 | | 2270 | | +----+-------+ 2271 | (a2) +->| nH"| DDDDD | 2272 | +----+-------+ 2273 | 2274 | +---+------+ 2275 (b) +->| H"| DDDD | 2276 +---+------+ 2277 | 2278 | +----+---+------+ 2279 (b1) +->| nH'| H"| DDDD | 2280 +----+---+------+ 2282 Figure 16 Fragmenting via maximum fit 2284 Figure 16 shows this process using "maximum fit", assuming outer 2285 fragmentation as an example (the situation is the same for inner 2286 fragmentation, but the headers that are affected differ). In maximum 2287 fit, the arriving packet is split into (a) and (b), where (a) is the 2288 size of the first tunnel, i.e., the tunnel 1 MTU (the maximum that 2289 fits over the first tunnel). However, this tunnel then traverses over 2290 another tunnel (number 2), whose impact the first tunnel ingress has 2291 not accommodated. The packet (a) arrives at the second tunnel 2292 ingress, and needs to be encapsulated again, but it needs to be 2293 fragmented as well to fit into the tunnel 2 MTU, into (a1) and (a2). 2294 In this case, packet (b) arrives at the second tunnel ingress and is 2295 encapsulated into (b1) without fragmentation, because it is already 2296 below the tunnel 2 MTU size. 2298 In Figure 17, the fragmentation is done using "even split", i.e., by 2299 splitting the original packet into two roughly equal-sized 2300 components, (c) and (d). Note that (d) contains more packet data, 2301 because (c) includes the original packet header because this is an 2302 example of outer fragmentation. The packets (c) and (d) arrive at the 2303 second tunnel encapsulator, and are encapsulated again; this time, 2304 neither packet exceeds the tunnel 2 MTU, and neither requires further 2305 fragmentation. 2307 X===========================X (transit PMTU) 2308 +----+----------------------+ 2309 | iH | DDDDDDDDDDDDDDDDDDDD | 2310 +----+----------------------+ 2311 | 2312 | X===========================X (tunnel 1 MTU) 2313 | +---+----+----------+ 2314 (c) +->| H'| iH | DDDDDDDD | 2315 | +---+----+----------+ 2316 | | 2317 | | X===========================X (tunnel 2 MTU) 2318 | | +----+---+----+----------+ 2319 | (c1) +->| nH | H'| iH | DDDDDDDD | 2320 | +----+---+----+----------+ 2321 | 2322 | +---+--------------+ 2323 (d) +->| H"| DDDDDDDDDDDD | 2324 +---+--------------+ 2325 | 2326 | +----+---+--------------+ 2327 (d1) +->| nH | H"| DDDDDDDDDDDD | 2328 +----+---+--------------+ 2330 Figure 17 Fragmenting via "even split" 2332 A.2. Packing 2334 Encapsulating individual packets to traverse a tunnel can be 2335 inefficient, especially where headers are large relative to the 2336 packets being carried. In that case, it can be more efficient to 2337 encapsulate many small packets in a single, larger tunnel payload. 2339 This technique, similar to the effect of packet bursting in Gigabit 2340 Ethernet (regardless of whether they're encoded using L2 symbols as 2341 delineators), reduces the overhead of the encapsulation headers 2342 (Figure 18). It reduces the work of header addition and removal at 2343 the tunnel endpoints, but increases other work involving the packing 2344 and unpacking of the component packets carried. 2346 +-----+-----+ 2347 | iHa | iDa | 2348 +-----+-----+ 2349 | 2350 | +-----+-----+ 2351 | | iHb | iDb | 2352 | +-----+-----+ 2353 | | 2354 | | +-----+-----+ 2355 | | | iHc | iDc | 2356 | | +-----+-----+ 2357 | | | 2358 v v v 2359 +----+-----+-----+-----+-----+-----+-----+ 2360 | oH | iHa | iDa | iHb | iDb | iHc | iDc | 2361 +----+-----+-----+-----+-----+-----+-----+ 2363 Figure 18 Packing packets into a tunnel