idnits 2.17.1 draft-ietf-intarea-tunnels-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The draft header indicates that this document updates RFC4459, but the abstract doesn't seem to directly say this. It does mention RFC4459 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC4459, updated by this document, for RFC5378 checks: 2004-06-14) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 13, 2017) is 2601 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-03 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2460 (Obsoleted by RFC 8200) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 6434 (Obsoleted by RFC 8504) -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) -- Obsolete informational reference (is this intentional?): RFC 6833 (Obsoleted by RFC 9301) == Outdated reference: A later version (-82) exists of draft-templin-aerolink-74 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Area WG J. Touch 2 Internet Draft USC/ISI 3 Intended status: Informational M. Townsley 4 Updates: 4459 Cisco 5 Expires: September 2017 March 13, 2017 7 IP Tunnels in the Internet Architecture 8 draft-ietf-intarea-tunnels-04.txt 10 Status of this Memo 12 This Internet-Draft is submitted in full conformance with the 13 provisions of BCP 78 and BCP 79. 15 This document may contain material from IETF Documents or IETF 16 Contributions published or made publicly available before November 17 10, 2008. The person(s) controlling the copyright in some of this 18 material may not have granted the IETF Trust the right to allow 19 modifications of such material outside the IETF Standards Process. 20 Without obtaining an adequate license from the person(s) controlling 21 the copyright in such materials, this document may not be modified 22 outside the IETF Standards Process, and derivative works of it may 23 not be created outside the IETF Standards Process, except to format 24 it for publication as an RFC or to translate it into languages other 25 than English. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 This Internet-Draft will expire on September 13, 2017. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Abstract 62 This document discusses the role of IP tunnels in the Internet 63 architecture. An IP tunnel transits IP datagrams as payloads in non- 64 link layer protocols. This document explains the relationship of IP 65 tunnels to existing protocol layers and the challenges in supporting 66 IP tunneling, based on the equivalence of tunnels to links. The 67 implications of this document are used to derive recommendations that 68 update MTU and fragment issues in RFC 4459. 70 Table of Contents 72 1. Introduction...................................................3 73 2. Conventions used in this document..............................6 74 2.1. Key Words.................................................6 75 2.2. Terminology...............................................6 76 3. The Tunnel Model..............................................10 77 3.1. What is a Tunnel?........................................11 78 3.2. View from the Outside....................................13 79 3.3. View from the Inside.....................................13 80 3.4. Location of the Ingress and Egress.......................14 81 3.5. Implications of This Model...............................14 82 3.6. Fragmentation............................................15 83 3.6.1. Outer Fragmentation.................................16 84 3.6.2. Inner Fragmentation.................................17 85 3.6.3. The Necessity of Outer Fragmentation................18 86 4. IP Tunnel Requirements........................................19 87 4.1. Encapsulation Header Issues..............................19 88 4.1.1. General Principles of Header Fields Relationships...19 89 4.1.2. Addressing Fields...................................20 90 4.1.3. Hop Count Fields....................................20 91 4.1.4. IP Fragment Identification Fields...................21 92 4.1.5. Checksums...........................................22 93 4.2. MTU Issues...............................................23 94 4.2.1. Minimum MTU Considerations..........................23 95 4.2.2. Fragmentation.......................................26 96 4.2.3. Path MTU Discovery..................................29 97 4.3. Coordination Issues......................................30 98 4.3.1. Signaling...........................................30 99 4.3.2. Congestion..........................................32 100 4.3.3. Multipoint Tunnels and Multicast....................33 101 4.3.4. Load Balancing......................................33 102 4.3.5. Recursive Tunnels...................................34 103 5. Observations..................................................34 104 5.1. Summary of Recommendations...............................34 105 5.2. Impact on Existing Encapsulation Protocols...............35 106 5.3. Tunnel Protocol Designers................................38 107 5.3.1. For Future Standards................................38 108 5.3.2. Diagnostics.........................................38 109 5.4. Tunnel Implementers......................................39 110 5.5. Tunnel Operators.........................................39 111 6. Security Considerations.......................................40 112 7. IANA Considerations...........................................41 113 8. References....................................................41 114 8.1. Normative References.....................................41 115 8.2. Informative References...................................41 116 9. Acknowledgments...............................................46 117 APPENDIX A: Fragmentation efficiency.............................48 118 A.1. Selecting fragment sizes.................................48 119 A.2. Packing..................................................49 121 1. Introduction 123 The Internet layering architecture is loosely based on the ISO seven 124 layer stack, in which data units traverse the stack by being wrapped 125 inside data units of the next layer down [Cl88][Zi80]. A tunnel is a 126 mechanism for transmitting data units between endpoints by wrapping 127 them as data units of the same or higher layers, e.g., IP in IP 128 (Figure 1) or IP in UDP (Figure 2). 130 +----+----+--------------+ 131 | IP'| IP | Data | 132 +----+----+--------------+ 134 Figure 1 IP inside IP 136 +----+-----+----+--------------+ 137 | IP'| UDP | IP | Data | 138 +----+-----+----+--------------+ 140 Figure 2 IP in UDP in IP in Ethernet 142 This document focuses on tunnels that transit IP packets, i.e., in 143 which an IP packet is the payload of another protocol, other than a 144 typical link layer. A tunnel is a virtual link that can help decouple 145 the network topology seen by transiting packets from the underlying 146 physical network [To98][RFC2473]. Tunnels were critical in the 147 development of multicast because not all routers were capable of 148 processing multicast packets [Er94]. Tunnels allowed multicast 149 packets to transit efficiently between multicast-capable routers over 150 paths that did not support native link-layer multicast. Similar 151 techniques have been used to support incremental deployment of other 152 protocols over legacy substrates, such as IPv6 [RFC2546]. 154 Use of tunnels is common in the Internet. The word "tunnel" occurs in 155 nearly 1,500 RFCs (of nearly 8,000 current RFCs, close to 20%), and 156 is supported within numerous protocols, including: 158 o IP in IP / mobile IP - IPv4 in IPv4 tunnels 159 [RFC2003][RFC2473][RFC5944] 161 o IP in IPv6 - IPv6 or IPv4 in IPv6 [RFC2473] 163 o IPsec - includes a tunnel mode to enable encryption or 164 authentication of the an entire IP datagram inside another IP 165 datagram [RFC4301] 167 o Generic Router Encapsulation (GRE) - a shim layer for tunneling 168 any network layer in any other network layer, as in IP in GRE in 169 IP [RFC2784][RFC7588][RFC7676], or inside UDP in IP [RFC8086] 171 o MPLS - a shim layer for tunneling IP over a circuit-like path over 172 a link layer [RFC3031] or inside UDP in IP [RFC7510], in which 173 identifiers are rewritten on each hop, often used for traffic 174 provisioning 176 o LISP - a mechanism that uses multipoint IP tunnels to reduce 177 routing table load within an enclave of routers at the expense of 178 more complex tunnel ingress encapsulation tables [RFC6830] 180 o TRILL - a mechanism that uses multipoint L2 tunnels to enable use 181 of L3 routing (typically IS-IS) in an enclave of Ethernet bridges 182 [RFC5556][RFC6325] 184 o Generic UDP Encapsulation (GUE) - IP in UDP in IP [He16] 186 o Automatic Multicast Tunneling (AMT) - IP in UDP in IP for 187 multicast [RFC7450] 189 o L2TP - PPP over IP, to extend a subscriber's DSL/FTTH connection 190 from an access line provider to an ISP [RFC3931] 192 o L2VPNs - provides a link topology different from that provided by 193 physical links [RFC4664]; many of these are not classical tunnels, 194 using only tags (Ethernet VLAN tags) rather than encapsulation 196 o L3VPNs - provides a network topology different from that provided 197 by ISPs [RFC4176] 199 o NVO3 - data center network sharing (to be determined, which may 200 include use of GUE or other tunnels) [RFC7364] 202 o PWE3 - emulates wire-like services over packet-switched services 203 [RFC3985] 205 o SEAL/AERO -IP in IP tunneling with an additional shim header 206 designed to overcome the limitations of RFC2003 [RFC5320][Te16] 208 The variety of tunnel mechanisms raises the question of the role of 209 tunnels in the Internet architecture and the potential need for these 210 mechanisms to have similar and predictable behavior. In particular, 211 the ways in which packet sizes (i.e., Maximum Transmission Unit or 212 MTU) mismatch and error signals (e.g., ICMP) are handled may benefit 213 from a coordinated approach. 215 Regardless of the layer in which encapsulation occurs, tunnels 216 emulate a link. The only difference is that a link operates over a 217 physical communication channel, whereas a tunnel operates over other 218 software protocol layers. Because tunnels are links, they are subject 219 to the same issues as any link, e.g., MTU discovery, signaling, and 220 the potential utility of native support for broadcast and multicast 221 [RFC3819]. Tunnels have some advantages over native links, being 222 potentially easier to reconfigure and control because they can 223 generally rely on existing out-of-band communication between its 224 endpoints. 226 The first attempt to use large-scale tunnels was to transit multicast 227 traffic across the Internet in 1988, and this resulted in 'tunnel 228 collapse'. At the time, tunnels were not implemented as 229 encapsulation-based virtual links, but rather as loose source routes 230 on un-encapsulated IP datagrams [RFC1075]. Then, as now, routers did 231 not support use of the loose source route IP option at line rate, and 232 the multicast traffic caused overload of the so-called "slow path" 233 processing of IP datagrams in software. Using encapsulation tunnels 234 avoided that collapse by allowing the forwarding of encapsulated 235 packets to use the "fast path" hardware processing [Er94]. 237 The remainder of this document describes the general principles of IP 238 tunneling and discusses the key considerations in the design of any 239 protocol that tunnels IP datagrams. It derives its conclusions from 240 the equivalence of tunnels and links and from requirements of 241 existing standards for supporting IPv4 and IPv6 as payloads. 243 2. Conventions used in this document 245 2.1. Key Words 247 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 248 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 249 document are to be interpreted as described in RFC-2119 [RFC2119]. 251 In this document, these key words will appear with that 252 interpretation only when in ALL CAPS. Lower case uses of these words 253 are not to be interpreted as carrying RFC-2119 significance. 255 2.2. Terminology 257 This document uses the following terminology. Optional words in the 258 term are indicated in parentheses, e.g., "(link or network) 259 interface" or "egress (interface)". 261 Terms from existing RFCs: 263 o Messages: variable length data labeled with globally-unique 264 endpoint IDs, also known as a datagram for IP messages [RFC791]. 266 o Node: a physical or logical network device that participates as 267 either a host [RFC1122][RFC6434] or router [RFC1812]. This term 268 originally referred to gateways since some very early RFCs [RFC5], 269 but is currently the common way to describe a point in a network 270 at which messages are processed. 272 o Host or endpoint: a node that sources or sinks messages labeled 273 from/to its IDs, typically known as a host for both IP and higher- 274 layer protocol messages [RFC1122]. 276 o Source or sender: the node that generates a message [RFC1122]. 278 o Destination or receiver: the node that consumes a message 279 [RFC1122]. 281 o Router or gateway: a node that relays IP messages using 282 destination IDs and local context [RFC1812]. Routers also act as 283 hosts when they source or sink messages. Also known as a forwarder 284 for IP messages. Note that the notion of router is relative to the 285 layer at which message processing is considered [To16]. 287 o Link: a communications medium (or emulation thereof) that 288 transfers IP messages between nodes without traversing a router 289 (as would require decrementing the hop count) [RFC1122][RFC1812]. 291 o (Link or network) Interface: a location on a link co-located with 292 a node where messages depart onto that link or arrive from that 293 link. On physical links, this interface formats the message for 294 transmission and interprets the received signals. 296 o Path: a sequence of one or more links over which an IP message 297 traverses between source and destination nodes (hosts or routers). 299 o (Link) MTU: the largest message that can transit a link [RFC791], 300 also often referred to simply as "MTU". It does not include the 301 size of link-layer information, e.g., link layer headers or 302 trailers, i.e., it refers to the message that the link can carry 303 as a payload rather than the message as it appears on the link. 304 This is thus the largest network layer packet (including network 305 layer headers, e.g., IP datagram) that can transit a link. Note 306 that this need not be the native size of messages on the link, 307 i.e., the link may internally fragment and reassemble messages. 308 For IPv4, the smallest MTU must be at least 68 bytes [RFC791], and 309 for IPv6 the smallest MTU must be at least 1280 bytes [RFC2460]. 311 o EMTU_S (effective MTU for sending): the largest message that can 312 transit a link, possibly also accounting for fragmentation that 313 happens before the fragments are emitted onto the link [RFC1122]. 314 When source fragmentation is not possible, EMTU_S = (link) MTU. 315 For IPv4, this is MUST be at least 68 bytes [RFC791] and for IPv6 316 this MUST be at least 1280 bytes [RFC2460]. 318 o EMTU_R (effective MTU to receive): the largest payload message 319 that a receiver must be able to accept. This thus also represents 320 the largest message that can traverse a link, taking into account 321 reassembly at the receiver that happens after the fragments are 322 received [RFC1122]. For IPv4, this is MUST be at least 576 bytes 323 [RFC791] and for IPv6 this MUST be at least 1500 bytes [RFC2460]. 325 o Path MTU (PMTU): the largest message that can transit a path of 326 links [RFC1191][RFC1981]. Typically, this is the minimum of the 327 link MTUs of the links of the path, and represents the largest 328 network layer message (including network layer headers) that can 329 transit a path without requiring fragmentation while in transit. 330 Note that this is not the largest network packet that can be sent 331 between a source and destination, because that network packet 332 might have been fragmented at the network layer of the source and 333 reassembled at the network layer of the destination (if 334 supported). 336 o Tunnel: a protocol mechanism that transits messages between an 337 ingress interface and egress interface using encapsulation to 338 allow an existing network path to appear as a single link 339 [RFC1853]. Note that a protocol can be used to tunnel itself (IP 340 over IP). There is essentially no difference between a tunnel and 341 the conventional layering of the ISO stack (i.e., by this 342 definition, Ethernet is can be considered tunnel for IP). A tunnel 343 is also known as a virtual link. 345 o Ingress (interface): the virtual link interface of a tunnel that 346 receives messages within a node, encapsulates them according to 347 the tunnel protocol, and transmits them into the tunnel [RFC2983]. 348 An ingress is the tunnel equivalent of the outgoing (departing) 349 network interface of a link, and its encapsulation processing is 350 the tunnel equivalent of encoding a message for transmission over 351 a physical link. The ingress virtual link interface can be co- 352 located with the traffic source. 354 The term 'ingress' in other RFCs also refers to 'network ingress', 355 which is the entry point of traffic to a transit network. Because 356 this document focuses on tunnels, the term "ingress" used in the 357 remainder of this document implies "tunnel ingress". 359 o Egress (interface): a virtual link interface of a tunnel that 360 receives messages that have finished transiting a tunnel and 361 presents them to a node [RFC2983]. For reasons similar to ingress, 362 the term 'egress' will refer to 'tunnel egress' throughout the 363 remainder of this document. An egress is the tunnel equivalent of 364 the incoming (arriving) network interface of a link and its 365 decapsulation processing is the tunnel equivalent of interpreting 366 a signal received from a physical link. The egress decapsulates 367 messages for further transit to the destination. The egress 368 virtual link interface can be co-located with the traffic 369 destination. 371 o Ingress node: network device on which an ingress is attached as a 372 virtual link interface [RFC2983]. Note that a node can act as both 373 an ingress node and an egress node at the same time, but typically 374 only for different tunnels. 376 o Egress node: device where an egress is attached as a virtual link 377 interface [RFC2983]. Note that a device can act as both a ingress 378 node and an egress node at the same time, but typically only for 379 different tunnels. 381 o Inner header: the header of the message as it arrives to the 382 ingress [RFC2003]. 384 o Outer header(s): the headers added to the message by the ingress, 385 as part of the encapsulation for tunnel transit [RFC2003]. 387 o Mid-tunnel fragmentation: Fragmentation of the message during the 388 tunnel transit, as could occur for IPv4 datagrams with DF=0 389 [RFC2983]. 391 o Atomic packet or datagram: an IP packet that has not been 392 fragmented and which cannot be fragmented further [RFC6864] 394 The following terms are introduced by this document: 396 o (Tunnel) transit packet: the packet arriving at a node connected 397 to a tunnel that enters the ingress interface and exits the egress 398 interface, i.e., the packet carried over the tunnel. This is 399 sometimes known as the 'tunneled packet', i.e., the packet carried 400 over the tunnel. This is the tunnel equivalent of a network layer 401 packet as it would traverse a link. This document focuses on IPv4 402 and IPv6 transit packets. 404 o (Tunnel) link packet: packets that traverse from ingress interface 405 to egress interface, in which resides all or part of a transit 406 packet. This is the tunnel equivalent of a link layer packet as it 407 would traverse a link, which is why we use the same terminology. 409 o Tunnel MTU: the largest transit packet that can traverse a tunnel, 410 i.e., the tunnel equivalent of a link MTU, which is why we use the 411 same terminology. This is the largest transit packet which can be 412 reassembled at the egress interface. 414 o Tunnel atom: the largest transit packet that can traverse a tunnel 415 as an atomic packet, i.e., without requiring tunnel link packet 416 fragmentation either at the ingress or on-path between the ingress 417 and egress. 419 o Inner fragmentation: fragmentation of the transit packet that 420 arrives at the ingress interface before any additional headers are 421 added. This can only correctly occur for IPv4 DF=0 datagrams. 423 o Outer fragmentation: source fragmentation of the tunnel link 424 packet after encapsulation; this can involve fragmenting the 425 outermost header or any of the other (if any) protocol layers 426 involved in encapsulation. 428 o Maximum frame size (MFS): the link-layer equivalent of the MTU, 429 using the OSI term 'frame'. For Ethernet, the MTU (network packet 430 size) is 1500 bytes but the MFS (link frame size) is 1518 bytes 431 originally, and 1522 bytes assuming VLAN (802.1Q) tagging support. 433 o EMFS_S: the link layer equivalent of EMTU_S. 435 o EMFS_R: the link layer equivalent of EMTU_R. 437 o Path MFS: the link layer equivalent of PMTU. 439 3. The Tunnel Model 441 A network architecture is an abstract description of a distributed 442 communications system, its components and their relationships, the 443 requisite properties of those components and the emergent properties 444 of the system that result [To03]. Such descriptions can help explain 445 behavior, as when the OSI seven-layer model is used as a teaching 446 example [Zi80]. Architectures describe capabilities - and, just as 447 importantly, constraints. 449 A network can be defined as a system of endpoints and relays 450 interconnected by communication paths, abstracting away issues of 451 naming in order to focus on message forwarding. To the extent that 452 the Internet has a single, coherent interpretation, its architecture 453 is defined by its core protocols (IP [RFC791], TCP [RFC793], UDP 454 [RFC768]) whose messages are handled by hosts, routers, and links 455 [Cl88][To03], as shown in Figure 3: 457 +------+ ------ ------ +------+ 458 | | / \ / \ | | 459 | HOST |--+ ROUTER +--+ ROUTER +--| HOST | 460 | | \ / \ / | | 461 +------+ ------ ------ +------+ 463 Figure 3 Basic Internet architecture 465 As a network architecture, the Internet is a system of hosts 466 (endpoints) and routers (relays) interconnected by links that 467 exchange messages when possible. "When possible" defines the 468 Internet's "best effort" principle. The limited role of routers and 469 links represents the End-to-End Principle [Sa84] and longest-prefix 470 match enables hierarchical forwarding using compact tables. 472 Although the definitions of host, router, and link seem absolute, 473 they are often relative as viewed within the context of one protocol 474 layer, each of which can be considered a distinct network 475 architecture. An Internet gateway is an OSI Layer 3 router when it 476 transits IP datagrams but it acts as an OSI Layer 2 host as it 477 sources or sinks Layer 2 messages on attached links to accomplish 478 this transit capability. In this way, one device (Internet gateway) 479 behaves as different components (router, host) at different layers. 481 Even though a single device may have multiple roles - even 482 concurrently - at a given layer, each role is typically static and 483 determined by context. An Internet gateway always acts as a Layer 2 484 host and that behavior does not depend on where the gateway is viewed 485 from within Layer 2. In the context of a single layer, a device's 486 behavior is typically modeled as a single component from all 487 viewpoints in that layer (with some notable exceptions, e.g., Network 488 Address Translators, which appear as hosts and routers, depending on 489 the direction of the viewpoint [To16]). 491 3.1. What is a Tunnel? 493 A tunnel can be modeled as a link in another network 494 [To98][To01][To03]. In Figure 4, a source host (Hsrc) and destination 495 host (Hdst) communicating over a network M in which two routers (Ra 496 and Rd) are connected by a tunnel. Keep in mind that it is possible 497 that both network N and network M can both be components of the 498 Internet, i.e., there may be regular traffic as well as tunneled 499 traffic over any of the routers shown. 501 --_ -- 502 +------+ / \ / \ +------+ 503 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 504 +------+ \ //\ / \ / \ /\\ / +------+ 505 --/I \--+ Rb +--+ Rc +--/E \-- 506 \ / \ / \ / \ / 507 \/ -- -- \/ 508 <------ Network N -------> 509 <-------------------- Network M ---------------------> 511 Figure 4 The big picture 513 The tunnel consists of two interfaces - an ingress (I) and an egress 514 (E) that lie along a path connected by network N. Regardless of how 515 the ingress and egress interfaces are connected, the tunnel serves as 516 a link between the nodes it connects (here, Ra and Rd). 518 IP packets arriving at the ingress interface are encapsulated to 519 traverse network N. We call these packets 'tunnel transit packets' 520 (or just 'transit packets') because they will transit the tunnel 521 inside one or more of what we call 'tunnel link packets'. Transit 522 packets correspond to network (IP) packets traversing a conventional 523 link and tunnel link packets correspond to the packets of a 524 conventional link layer (which can be called just 'link packets'). 526 Link packets use the source address of the ingress interface and the 527 destination address of the egress interface - using whatever address 528 is appropriate to the Layer at which the ingress and egress 529 interfaces operate (Layer 2, Layer 3, Layer 4, etc.). The egress 530 interface decapsulates those messages, which then continue on network 531 M as if emerging from a link. To transit packets and to the routers 532 the tunnel connects (Ra and Rd), the tunnel acts as a link and the 533 ingress and egress interfaces act as network interfaces to that link. 535 The model of each component (ingress and egress interfaces) and the 536 entire system (tunnel) depends on the layer from which they are 537 viewed. From the perspective of the outermost hosts (Hsrc and Hdst), 538 the tunnel appears as a link between two routers (Ra and Rd). For 539 routers along the tunnel (e.g., Rb and Rc), the ingress and egress 540 interfaces appear as the endpoint hosts on network N. 542 When the tunnel network (N) is implemented using the same protocol as 543 the endpoint network (M), the picture looks flatter (Figure 5), as if 544 it were running over a single network. However, this appearance is 545 incorrect - nothing has changed from the previous case. From the 546 perspective of the endpoints, Rb and Rc and network N don't exist and 547 aren't visible, and from the perspective of the tunnel, network M 548 doesn't exist. The fact that network N and M use the same protocol, 549 and may traverse the same links is irrelevant. 551 --_ -- -- -- 552 +------+ / \ /\ / \ / \ /\ / \ +------+ 553 | Hsrc |--+ Ra +/I \--+ Rb +--+ Rc +--/E \+ Rd +--| Hdst | 554 +------+ \ / \ / \ / \ / \ / \ / +------+ 555 -- \/ -- -- \/ -- 556 <---- Network N -----> 557 <------------------ Network M -------------------> 559 Figure 5 IP in IP network picture 561 3.2. View from the Outside 563 As already observed, from outside the tunnel, to network M, the 564 entire tunnel acts as a link (Figure 6). Consequently all 565 requirements for links supporting IP also apply to tunnels [RFC3819]. 567 --_ -- 568 +------+ / \ / \ +------+ 569 | Hsrc |--+ Ra +--------------------------+ Rd +--| Hdst | 570 +------+ \ / \ / +------+ 571 -- -- 572 <------------------ Network M -------------------> 574 Figure 6 Tunnels as viewed from the outside 576 For example, the IP datagram hop counts (IPv4 Time-to-Live [RFC791] 577 and IPv6 Hop Limit [RFC2460]) are decremented when traversing a 578 router, but not when traversing a link - or thus a tunnel. Similarly, 579 because the ingress and egress are interfaces on this outer network, 580 they should never issue ICMP messages. A router or host would issue 581 the appropriate ICMP, e.g., "packet too big" (IPv4 fragmentation 582 needed and DF set [RFC792] or IPv6 packet too big [RFC4443]), when 583 trying to send a packet to the egress, as it would for any interface. 585 Tunnels have a tunnel MTU - the largest message that can transit that 586 tunnel, just as links have a link MTU. Tis MTU may not reflect the 587 native message size of hops within a multihop link (or tunnel) and 588 the same is true for a tunnel. In both cases, the MTU is defined by 589 the link's (or tunnel's) effective MTU to receive (EMTU_R). 591 3.3. View from the Inside 593 Within network N, i.e., from inside the tunnel itself, the ingress 594 interface is a source of tunnel link packets and the egress interface 595 is a sink - so both are viewed as hosts on network N (Figure 7). 596 Consequently [RFC1122] Internet host requirements apply to ingress 597 and egress interfaces when Network N uses IP (and thus the 598 ingress/egress interfaces use IP encapsulation). 600 _ -- -- 601 /\ / \ / \ /\ 602 /I \--+ Rb +--+ Rc +--/E \ 603 \ / \ / \ / \ / 604 \/ -- -- \/ 605 <---- Network N -----> 607 Figure 7 Tunnels, as viewed from within the tunnel 609 Viewed from within the tunnel, the outer network (M) doesn't exist. 610 Tunnel link packets can be fragmented by the source (ingress 611 interface) and reassembled at the destination (egress interface), 612 just as at conventional hosts. The path between ingress and egress 613 interfaces has a path MTU, but the endpoints can exchange messages as 614 large as can be reassembled at the destination (egress interface), 615 i.e., the EMTU_R of the egress interface. However, in both cases, 616 these MTUs refer to the size of the message that can transit the 617 links and between the hosts of network N, which represents a link 618 layer to network M. I.e., the MTUs of network N represent the maximum 619 frame sizes (MFSs) of the tunnel as a link in network M. 621 Information about the network - i.e., regarding network N MTU sizes, 622 network reachability, etc. - are relayed from the destination (egress 623 interface) and intermediate routers back to the source (ingress 624 interface), without regard for the external network (M). When such 625 messages arrive at the ingress interface, they may affect the 626 properties of that interface (e.g., its reported MTU to network M), 627 but they should never directly cause new ICMPs in the outer network 628 M. Again, events at interfaces don't generate ICMP messages; it would 629 be the host or router at which that interface is attached that would 630 generate ICMPs, e.g., upon attempting to use that interface. 632 3.4. Location of the Ingress and Egress 634 The ingress and egress interfaces are endpoints of the tunnel. Tunnel 635 interfaces may be physical or virtual. The interface may be 636 implemented inside the node where the tunnel attaches, e.g., inside a 637 host or router. The interface may also be implemented as a "bump in 638 the wire" (BITW), somewhere along a link between the two nodes the 639 link interconnects. IP in IP tunnels are often implemented as 640 interfaces on nodes, whereas IPsec tunnels are sometimes implemented 641 as BITW. These implementation variations determine only whether 642 information available at the link endpoints (ingress/egress 643 interfaces) can be easily shared with the connected network nodes. 645 3.5. Implications of This Model 647 This approach highlights a few key features of a tunnel as a network 648 architecture construct: 650 o To the transit packets, tunnels turn a network (Layer 3) path into 651 a (Layer 2) link 653 o To nodes the tunnel traverses, the tunnel ingress and egress 654 interfaces act as hosts that source and sink tunnel link packets 656 The consequences of these features are as follow: 658 o Like a link MTU, a tunnel MTU is defined by the effective MTU of 659 the receiver (i.e., EMTU_R of the egress). 661 o The messages inside the tunnel are treated like any other link 662 layer, i.e., the MTU is determined by the largest (transit) 663 payload that traverses the link. 665 o The tunnel path MFS is not relevant to the transited traffic. 666 There is no mechanism or protocol by which it can be determined. 668 o Because routers, not links, alter hop counts [RFC1812], hopcounts 669 are not decremented solely by the transit of a tunnel. A packet 670 with a hop count of zero should successfully transit a link (and 671 thus a tunnel) that connects two hosts. 673 o The addresses of a tunnel ingress and egress interface correspond 674 to link layer addresses to the transit packet. Like links, some 675 tunnels may not have their own addresses. Like network interfaces, 676 ingress and egress interfaces typically require network layer 677 addresses. 679 o Like network interfaces, the ingress and egress interfaces are 680 never a direct source of ICMP messages but may provide information 681 to their attached host or router to generate those ICMP messages 682 during the processing of transit packets. 684 o Like network interfaces and links, two nodes may be connected by 685 any combination of tunnels and links, including multiple tunnels. 686 As with multiple links, existing network layer forwarding 687 determines which IP traffic uses each link or tunnel. 689 These observations make it much easier to determine what a tunnel 690 must do to transit IP packets, notably it must satisfy all 691 requirements expected of a link [RFC1122][RFC3819]. The remainder of 692 this document explores these implications in greater detail. 694 3.6. Fragmentation 696 There are two places where fragmentation can occur in a tunnel, 697 called 'outer fragmentation' and 'inner fragmentation'. This document 698 assumes that only outer fragmentation is viable because it is the 699 only approach that works for both IPv4 datagrams with DF=1 and IPv6. 701 3.6.1. Outer Fragmentation 703 Outer fragmentation is shown in Figure 8. The bottom of the figure 704 shows the network topology, where transit packets originate at the 705 source, enter the tunnel at the ingress interface for encapsulation, 706 exit the tunnel at the egress interface where they are decapsulated, 707 and arrive at the destination. The packet traffic is shown above the 708 topology, where the transit packets are shown at the top. In this 709 diagram, the ingress interface is located on router 'Ra' and the 710 egress interface is located on router 'Rd'. 712 When the link packet - which is the encapsulated transit packet - 713 would exceed the tunnel MTU, the packet needs to be fragmented. In 714 this case the packet is fragmented at the outer (link) header, with 715 the fragments shown as (b1) and (b2). The outer header indicates 716 fragmentation (as ' and "), the inner (transit) header occurs only in 717 the first fragment, and the inner (transit) data is broken across the 718 two packets. These fragments are reassembled at the egress interface 719 during decapsulation in step (c), where the resulting link packet is 720 reassembled and decapsulated so that the transit packet can continue 721 on its way to the destination. 723 Transit packet 724 +----+----+ +----+----+ 725 | iH | iD |------+ - - - - - - - - - - +------>| iH | iD | 726 +----+----+ | | +----+----+ 727 v Link packet | 728 +----+----+----+ +----+----+----+ 729 (a) | oH | iH | iD | | oH | iH | iD | (d) 730 +----+----+----+ +----+----+----+ 731 | ^ 732 | Link packet fragment #1 | 733 | +----+----+-----+ | 734 (b1) +----- >| oH'| iH | iD1 |-------+ (c) 735 | +----+----+-----+ | 736 | | 737 | Link packet fragment #2 | 738 | +----+-----+ | 739 (b2) +----- >| oH"| iD2 |------------+ 740 +----+-----+ 741 +-----+ +--+ +---+ +---+ +--+ +-----+ 742 | | | |/ \ / \| | | | 743 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 744 | | | |\ / \ /| | | | 745 +-----+ +--+ +---+ +---+ +--+ +-----+ 747 Figure 8 Fragmentation of the (outer) link packet 749 Outer fragmentation isolates the tunnel encapsulation duties to the 750 ingress and egress interfaces. This can be considered a benefit in 751 clean, layered network design, but also may require complex egress 752 interface decapsulation, especially where tunnels aggregate large 753 amounts of traffic, such as may result in IP ID overload (see Sec. 754 4.1.4). Outer fragmentation is valid for any tunnel link protocol 755 that supports fragmentation (e.g., IPv4 or IPv6), in which the tunnel 756 endpoints act as the host endpoints of that protocol. 758 Along the tunnel, the inner (transit) header is contained only in the 759 first fragment, which can interfere with mechanisms that 'peek' into 760 lower layer headers, e.g., as for relayed ICMP (see Sec. 4.3). 762 3.6.2. Inner Fragmentation 764 Inner fragmentation distributes the impact of tunnel fragmentation 765 across both egress interface decapsulation and transit packet 766 destination, as shown in Figure 9; this can be especially important 767 when the tunnel would otherwise need to source (outer) fragment large 768 amounts of traffic. However, this mechanism is valid only when the 769 transit packets can be fragmented on-path, e.g., as when the transit 770 packets are IPv4 datagrams with DF=0. 772 Again, the network topology is shown at the bottom of the figure, and 773 the original packets show at the top. Packets arrive at the ingress 774 node (router Ra) and are fragmented there based into transit packet 775 fragments #1 (a1) and #2 (a2). These fragments are encapsulated at 776 the ingress interface in steps (b1) and (b2) and each resulting link 777 packet traverses the tunnel. When these link packets arrive at the 778 egress interface they are decapsulated in steps (c1) and (c2) and the 779 egress node (router) forwards the transit packet fragments to their 780 destination. This destination is then responsible for reassembling 781 the transit packet fragments into the original transit packet (d). 783 Along the tunnel, the inner headers are copied into each fragment, 784 and so can be 'peeked at' inside the tunnel (see Sec. 4.3). 785 Fragmentation shifts from the ingress interface to the ingress router 786 and reassembly shifts from the egress interface to the destination. 788 Transit packet 789 +----+----+ +----+----+ 790 | iH | iD |-+ - - - - - - - - - - - - - - - - >| iH | iD | 791 +----+----+ | +----+----+ 792 v Transit packet fragment #1 ^ 793 +----+-----+ +----+-----+ | 794 (a1) | iH'| iD1 | | iH'| iD1 |-----+(d) 795 +----+-----+ +----+-----+ ^ 796 | | Link packet #1 ^ | 797 | | +----+----+----- | | 798 | (b1)+----- >| oH | iH'| iD1 |-------+(c1) | 799 | +----+----+-----+ | 800 | | 801 v Transit packet fragment #2 | 802 +----+-----+ +----+-----+ | 803 (a2) | iH"| iD2 | | iH"| iD2 |-----+ 804 +----+-----+ +----+-----+ 805 | Link packet #2 | 806 | +----+----+-----+ | 807 (b2)+----- >| oH | iH"| iD2 |-------+(c2) 808 +----+----+-----+ 809 +-----+ +--+ +---+ +---+ +--+ +-----+ 810 | | | |/ \ / \| | | | 811 | Src |----|Ra|Ingress|=======================|Egress |Rd|----| Dst | 812 | | | |\ / \ /| | | | 813 +-----+ +--+ +---+ +---+ +--+ +-----+ 815 Figure 9 Fragmentation of the inner (transit) packet 817 3.6.3. The Necessity of Outer Fragmentation 819 Fragmentation is critical for tunnels that support transit packets 820 for protocols with minimum MTU requirements, while operating over 821 tunnel paths using protocols that have their own MTU requirements. 822 Depending on the amount of space used by encapsulation, these two 823 minimums will ultimately interfere (especially when a protocol 824 transits itself either directly, as with IP-in-IP, or indirectly, as 825 in IP-in-GRE-in-IP), and the transit packet will need to be 826 fragmented to both support a tunnel MTU while traversing tunnels with 827 their own tunnel path MTUs. 829 Outer fragmentation is the only solution that supports all IPv4 and 830 IPv6 traffic, because inner fragmentation is allowed only for IPv4 831 datagrams with DF=0. 833 4. IP Tunnel Requirements 835 The requirements of an IP tunnel are defined by the requirements of 836 an IP link because both transit IP packets. A tunnel thus must 837 transit the IP minimum MTU, i.e., 68 bytes for IPv4 [RFC793] and 1280 838 bytes for IPv6 [RFC2460] and a tunnel must support address resolution 839 when there is more than one egress interface for that tunnel. 841 The requirements of the tunnel ingress and egress interfaces are 842 defined by the network over which they exchange messages (link 843 packets). For IP-over-IP, this means that the ingress interface MUST 844 NOT exceed the IP fragment identification field uniqueness 845 requirements [RFC6864]. Uniqueness is more difficult to maintain at 846 high packet rates for IPv4, whose fragment ID field is only 16 bits. 848 These requirements remain even though tunnels have some unique 849 issues, including the need for additional space for encapsulation 850 headers and the potential for tunnel MTU variation. 852 4.1. Encapsulation Header Issues 854 Tunneling uses encapsulation uses a non-link protocol as a link 855 layer. The encapsulation layer thus has the same requirements and 856 expectations as any other IP link layer when used to transit IP 857 packets. These relationships are addressed in the following 858 subsections. 860 4.1.1. General Principles of Header Fields Relationships 862 Some tunnel specifications attempt to relate the header fields of the 863 transit packet and tunnel link packet. In some cases, this 864 relationship is warranted, whereas in other cases the two protocol 865 layers need to be isolated from each other. For example, the tunnel 866 link header source and destination addresses are network endpoints in 867 the tunnel network N, but have no meaning in the outer network M. The 868 two sets of addresses are effectively independent, just as are other 869 network and link addresses. 871 Because the tunneled packet uses source and destination addresses 872 with a separate meaning, it is inappropriate to copy or reuse the 873 IPv4 Identification (ID) or IPv6 Fragment ID fields of the tunnel 874 transit packet (see Section 4.1.4). Similarly, the DF field of the 875 transit packet is not related to that field in the tunnel link packet 876 header (presuming both are IPv4) (see Section 4.2). Most other fields 877 are similarly independent between the transit packet and tunnel link 878 packet. When a field value is generated in the encapsulation header, 879 its meaning should be derived from what is desired in the context of 880 the tunnel as a link. When feedback is received from these fields, 881 they should be presented to the tunnel ingress and egress as if they 882 were network interfaces. The behavior of the node where these 883 interfaces attach should be identical to that of a conventional link. 885 There are exceptions to this rule that are explicitly intended to 886 relay signals from inside the tunnel to the network outside the 887 tunnel, typically relevant only when the tunnel network N and the 888 outer network M use the same network. These apply only when that 889 coordination is defined, as with explicit congestion notification 890 (ECN) [RFC6040] (see Section 4.3.2), and differentiated services code 891 points (DSCPs) [RFC2983]. Equal-cost multipath routing may also 892 affect how some encapsulation fields are set, including IPv6 flow 893 labels [RFC6438] and source ports for transport protocols when used 894 for tunnel encapsulation [RFC8085] (see Section 4.3.4). 896 4.1.2. Addressing Fields 898 Tunnel ingresses and egresses have addresses associated with the 899 encapsulation protocol. These addresses are the source and 900 destination (respectively) of the encapsulated packet while 901 traversing the tunnel network. 903 Tunnels may or may not have addresses in the network whose traffic 904 they transit (e.g., network M in Figure 4). In some cases, the tunnel 905 is an unnumbered interface to a point-to-point virtual link. When the 906 tunnel has multiple egresses, tunnel interfaces require separate 907 addresses in network M. 909 To see the effect of tunnel interface addresses, consider traffic 910 sourced at router Ra in Figure 4. Even before being encapsulated by 911 the ingress, traffic needs a source IP network address that belongs 912 to the router. One option is to use an address associated with one of 913 the other interfaces of the router [RFC1122]. Another option is to 914 assign a number to the tunnel interface itself. Regardless of which 915 address is used, the resulting IP packet is then encapsulated by the 916 tunnel ingress using the ingress address as a separate operation. 918 4.1.3. Hop Count Fields 920 The Internet hop count field is used to detect and avoid forwarding 921 loops that cannot be corrected without a synchronized reboot. The 922 IPv4 Time-to-Live (TTL) and IPv6 Hop Limit field each serve this 923 purpose [RFC791][RFC2460]. The IPv4 TTL field was originally intended 924 to indicate packet expiration time, measured in seconds. A router is 925 required to decrement the TTL by at least one or the number of 926 seconds the packet is delayed, whichever is larger [RFC1812]. Packets 927 are rarely held that long, and so the field has come to represent the 928 count of the number of routers traversed. IPv6 makes this meaning 929 more explicit. 931 These hop count fields represent the number of network forwarding 932 elements (routers) traversed by an IP datagram. An IP datagram with a 933 hop count of zero can traverse a link between two hosts because it 934 never visits a router (where it would need to be decremented and 935 would have been dropped). 937 An IP datagram traversing a tunnel thus need not have its hop count 938 modified, i.e., the tunnel transit header need not be affected. A 939 zero hop count datagram should be able to traverse a tunnel as easily 940 as it traverses a link. A router MAY be configured to decrement 941 packets traversing a particular link (and thus a tunnel), which may 942 be useful in emulating a tunnel path as if it were a network path 943 that traversed one or more routers, but this is strictly optional. 944 The ability of the outer network M and tunnel network N to avoid 945 indefinitely looping packets does not rely on the hop counts of the 946 transit packet and tunnel link packet being related. 948 The hop count field is also used by several protocols to determine 949 whether endpoints are 'local', i.e., connected to the same subnet 950 (link-local discovery and related protocols [RFC4861]). A tunnel is a 951 way to make a remote network address appear directly-connected, so it 952 makes sense that the other ends of the tunnel appear local and that 953 such link-local protocols operate over tunnels unless configured 954 explicitly otherwise. When the interfaces of a tunnel are numbered, 955 these can be interpreted the same way as if they were on the same 956 link subnet. 958 4.1.4. IP Fragment Identification Fields 960 Both IPv4 and IPv6 include an IP Identification (ID) field to support 961 IP datagram fragmentation and reassembly [RFC791][RFC1122][RFC2460]. 962 When used, the ID field is intended to be unique for every packet for 963 a given source address, destination address, and protocol, such that 964 it does not repeat within the Maximum Segment Lifetime (MSL). 966 For IPv4, this field is in the default header and is meaningful only 967 when either source fragmented or DF=0 ("non-atomic packets") 968 [RFC6864]. For IPv6, this field is contained in the optional Fragment 969 Header [RFC2460]. Although IPv6 supports only source fragmentation, 970 the field may occur in atomic fragments [RFC6946]. 972 Although the ID field was originally intended for fragmentation and 973 reassembly, it can also be used to detect and discard duplicate 974 packets, e.g., at congested routers (see Sec. 3.2.1.5 of [RFC1122]). 975 For this reason, and because IPv4 packets can be fragmented anywhere 976 along a path, all non-atomic IPv4 packets and all IPv6 packets 977 between a source and destination of a given protocol must have unique 978 ID values over the potential fragment reordering period 979 [RFC2460][RFC6864]. 981 The uniqueness of the IP ID is a known problem for high speed nodes, 982 because it limits the speed of a single protocol between two 983 endpoints [RFC4963]. Although this RFC suggests that the uniqueness 984 of the IP ID is moot, tunnels exacerbate this condition. A tunnel 985 often aggregates traffic from a number of different source and 986 destination addresses, of different protocols, and encapsulates them 987 in a header with the same ingress and egress addresses, all using a 988 single encapsulation protocol. If the ingress enforces IP ID 989 uniqueness, this can either severely limit tunnel throughput or can 990 require substantial resources; the alternative is to ignore IP ID 991 uniqueness and risk reassembly errors. Although fragmentation is 992 somewhat rare in the current Internet at large, but it can be common 993 along a tunnel. Reassembly errors are not always detected by other 994 protocol layers (see Sec. 4.3.3) , and even when detected they can 995 result in excessive overall packet loss and can waste bandwidth 996 between the egress and ultimate packet destination. 998 The 32-bit IPv6 ID field in the Fragment Header is typically used 999 only during source fragmentation. The size of the ID field is 1000 typically sufficient that a single counter can be used at the tunnel 1001 ingress, regardless of the endpoint addresses or next-header 1002 protocol, allowing efficient support for very high throughput 1003 tunnels. 1005 The smaller 16-bit IPv4 ID is more difficult to correctly support. A 1006 recent update to IPv4 allows the ID to be repeated for atomic 1007 packets. When either source fragmentation or on-path fragmentation is 1008 supported, the tunnel ingress may need to keep independent ID 1009 counters for each tunnel source/destination/protocol tuple. 1011 4.1.5. Checksums 1013 IP traffic transiting a tunnel needs to expect a similar level of 1014 error detection and correction as it would expect from any other 1015 link. In the case of IPv4, there are no such expectations, which is 1016 partly why it includes a header checksum [RFC791]. 1018 IPv6 omitted the header checksum because it already expects most link 1019 errors to be detected and dropped by the link layer and because it 1020 also assumes transport protection [RFC2460]. When transiting IPv6 1021 over IPv6, the tunnel fails to provide the expected error detection. 1022 This is why IPv6 is often tunneled over layers that include separate 1023 protection, such as GRE [RFC2784]. 1025 The fragmentation created by the tunnel ingress can increase the need 1026 for stronger error detection and correction, especially at the tunnel 1027 egress to avoid reassembly errors. The Internet checksum is known to 1028 be susceptible to reassembly errors that could be common [RFC4963], 1029 and should not be relied upon for this purpose. This is why some 1030 tunnel protocols, e.g., SEAL and AERO [RFC5320][Te16], include a 1031 separate checksum. This requirement can be undermined when using UDP 1032 as a tunnel with no UDP checksum (as per [RFC6935][RFC6936]) when 1033 fragmentation occurs because the egress has no checksum with which to 1034 validate reassembly. For this reason, it is safe to use UDP with a 1035 zero checksum for atomic tunnel link packets only; when used on 1036 fragments, whether generated at the ingress or en-route inside the 1037 tunnel, omission of such a checksum can result in reassembly errors 1038 that can cause additional work (capacity, forwarding processing, 1039 receiver processing) downstream of the egress. 1041 4.2. MTU Issues 1043 Link MTUs, IP datagram limits, and transport protocol segment sizes 1044 are already related by several requirements 1045 [RFC768][RFC791][RFC1122][RFC1812][RFC2460] and by a variety of 1046 protocol mechanisms that attempt to establish relationships between 1047 them, including path MTU discovery (PMTUD) [RFC1191][RFC1981], 1048 packetization layer path MTU discovery (PLMTUD) [RFC4821], as well as 1049 mechanisms inside transport protocols [RFC793][RFC4340][RFC4960]. The 1050 following subsections summarize the interactions between tunnels and 1051 MTU issues, including minimum tunnel MTUs, tunnel fragmentation and 1052 reassembly, and MTU discovery. 1054 4.2.1. Minimum MTU Considerations 1056 There are a variety of values of minimum MTU values to consider, both 1057 in a conventional network and in a tunnel as a link in that network. 1058 These are indicated in Figure 10, an annotated variant of Figure 4. 1059 Note that a (link) MTU (a) corresponds to a tunnel MTU (d) and that a 1060 path MTU (b) corresponds to a tunnel path MTU (e). The tunnel MTU is 1061 the EMTU_R of the egress interface, because that defines the largest 1062 transit packet message that can traverse the tunnel as a link in 1063 network M. The ability to traverse the hops of the tunnel - in 1064 network N - is not related, and only the ingress need be concerned 1065 with that value. 1067 --_ -- 1068 +------+ / \ / \ +------+ 1069 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1070 +------+ \ //\ / \ / \ /\\ / +------+ 1071 --/I \---+ Rb +---+ Rc +---/E \-- 1072 \ / \ / \ / \ / 1073 \/ -- -- \/ 1074 <----- Network N -------> 1075 <-------------------- Network M ---------------------> 1077 Communication in network M viewed at that layer: 1078 (a) <-> Link MTU 1079 (b) <---- Tunnel MTU ---------> 1080 (c) <----------- Path MTU -----------------> 1081 (d) <------------------- EMTU_R ---------------------------> 1083 Communication in network N viewed at that layer: 1084 (e) <--> Link MTU 1085 (f) <--- Path MTU ------> 1086 (g) <----- EMTU_R ---------> 1088 Communication in network N viewed from network M: 1089 (h) <--> MFS 1090 (i) <--- Path MFS ------> 1091 (j) <----- EMFS_R ---------> 1093 Figure 10 The variety of MTU values 1095 Consider the following example values. For IPv6 transit packets, the 1096 minimum (link) MTU (a) is 1280 bytes, which similarly applies to 1097 tunnels as the tunnel MTU (b). The path MTU (c) is the minimum of the 1098 links (including tunnels as links) along a path, and indicates the 1099 smallest IP message (packet or fragment) that can traverse a path 1100 between a source and destination without on-path fragmentation (e.g., 1101 supported in IPv4 with DF=0). Path MTU discovery, either at the 1102 network layer (PMTUD [RFC1191][RFC1981]) or packetization layer 1103 (PLPMTUD [RFC4821]) attempts to tune the source IP packets and 1104 fragments (i.e., EMTU_S) to fit within this path MTU size to avoid 1105 fragmentation and reassembly [Ke95]. The minimum EMTU_R (c) is 1500 1106 bytes, i.e., the minimum MTU for endpoint-to-endpoint communication. 1108 The tunnel is a source-destination communication in network N. 1109 Messages between the tunnel source (the ingress interface) and tunnel 1110 destination (egress interface) similarly experience a variety of 1111 network N MTU values, including a link MTU (e), a path MTU (f), and 1112 an EMTU_R (g). The network N EMTU_S is limited by the path MTU, and 1113 the source-destination message maximum is limited by EMTU_R, just as 1114 it was in for those types of MTUs in network M. For an IPv6 network 1115 N, its link and path MTUs must be at least 1280 and its EMTU_R must 1116 be at least 1500. 1118 However, viewed from the context of network M, these network N MTUs 1119 are link layer properties, i.e., maximum frame sizes (MFS). The 1120 network N EMTU_R determines the largest message that can transit 1121 between the source (ingress) and destination (egress), but viewed 1122 from network M this is a link layer, i.e., EMFS_R. The tunnel EMTU_R 1123 is EMFS_R minus the link (encapsulation) headers includes the 1124 encapsulation headers of the link layer. Just as the path MTU has no 1125 bearing on EMTU_R, the path MFS in network N has no bearing on the 1126 MTU of the tunnel. 1128 For IPv6 networks M and N, these relationships are summarized as 1129 follows: 1131 o Network M MTU = 1280, the largest transit packet (i.e., payload) 1132 over a single IPv6 link in the base network without source 1133 fragmentation 1135 o Network M path MTU = 1280, the transit packet (i.e., payload) that 1136 can traverse a path of links in the base network without source 1137 fragmentation 1139 o Network M EMTU_R = 1500, the largest transit packet (i.e., 1140 payload) that can traverse a path in the base network with source 1141 fragmentation 1143 o Network N MTU = 1280 (for the same reasons as for network M) 1145 o Network N path MTU = 1280 (for the same reasons as for network M) 1147 o Network N EMTU_R = 1500 (for the same reasons as for network M) 1149 o Tunnel MTU = 1500-encapsulation (typically 1460), the network N 1150 EMTU_R payload 1152 o Tunnel atom = largest network M message that transits a tunnel 1153 using network N as a link layer without fragmentation: 1280- 1154 encapsulation, i.e., the network N EMTU_S payload, treating EMTU_S 1155 as a network M EMFS_S. 1157 The difference between the network N MTU and its treatment as a link 1158 layer in network M is the reason why the tunnel ingress interfaces 1159 need to support fragmentation and tunnel egress interfaces need to 1160 support reassembly in the encapsulation layer(s). The high cost of 1161 fragmentation and reassembly is why it is useful for applications to 1162 avoid sending messages too close to the size of the tunnel path MTU 1163 [Ke95], although there is no signaling mechanism that can achieve 1164 this (see Section 4.2.3). 1166 4.2.2. Fragmentation 1168 A tunnel interacts with fragmentation in two different ways. As a 1169 link in network M, transit packets might be fragmented before they 1170 reach the tunnel - i.e., in network M either during source 1171 fragmentation (if generated at the same node as the ingress 1172 interface) or forwarding fragmentation (for IPv4 DF=0 datagrams). In 1173 addition, link packets traversing inside the tunnel may require 1174 fragmentation by the ingress interface - i.e., source fragmentation 1175 by the ingress as a host in network N. These two fragmentation 1176 operations are no more related than are conventional IP fragmentation 1177 and ATM segmentation and reassembly; one occurs at the (transit) 1178 network layer, the other at the (virtual) link layer. 1180 Although many of these issues with tunnel fragmentation and MTU 1181 handling were discussed in [RFC4459], that document described a 1182 variety of alternatives as if they were independent. This document 1183 explains the combined approach that is necessary. 1185 Like any other link, an IPv4 tunnel must transit 68 byte packets 1186 without requiring source fragmentation [RFC791][RFC1122] and an IPv6 1187 tunnel must transit 1280 byte packets without requiring source 1188 fragmentation [RFC2460]. The tunnel MTU interacts with routers or 1189 hosts it connects the same way as would any other link MTU. The 1190 pseudocode examples in this section use the following values: 1192 o TP: transit packet 1194 o TPsize: size of the transit packet (including its headers) 1196 o encaps: ingress encapsulation overhead (tunnel link headers) 1198 o tunMTU: tunnel MTU, i.e., network N egress EMTU_R - encaps. 1200 o tunAtom: tunnel atom size, equal to the egress host-level EMTU_S - 1201 encaps. 1203 These rules apply at the host/router where the tunnel is attached, 1204 i.e., at the network layer of the transit packet (we assume that all 1205 tunnels, including multipoint tunnels, have a single, uniform MTU). 1206 These are basic source fragmentation rules (or transit 1207 refragmentation for IPv4 DF=0 datagrams), and have no relation to the 1208 tunnel itself other than to consider the tunnel MTU as the effective 1209 link MTU of the next hop. 1211 Inside the source during transit packet generation or a router during 1212 transit packet forwarding, the tunnel is treated as if it were any 1213 other link (i.e., this is not tunnel processing, but rather typical 1214 source or router processing), as indicated in the pseudocode in 1215 Figure 11. 1217 if (TPsize > tunMTU) then 1218 if (TP can be on-path fragmented, e.g., IPv4 DF=0) then 1219 split TP into fragments of tunMTU size 1220 and send each fragment to the tunnel ingress interface 1221 else 1222 drop the TP and send ICMP "too big" to TP source 1223 endif 1224 else 1225 send TP to the tunnel ingress 1226 endif 1228 Figure 11 Router / host packet size processing algorithm 1230 The tunnel ingress acts as host on the tunnel path, i.e., as source 1231 fragmentation of tunnel link packets (we assume that all tunnels, 1232 even multipoint tunnels, have a single, uniform tunnel MTU), using 1233 the pseudocode shown in Figure 12. Note that ingress source 1234 fragmentation occurs in the encapsulation process, which may involve 1235 more than one protocol layer. In those cases, fragmentation can occur 1236 at any of the layers of encapsulation in which it is supported, based 1237 on the configuration of the ingress. 1239 if (TPsize <= tunAtom) then 1240 encapsulate the TP and emit 1241 else 1242 if (tunAtom < TPsize) then 1243 fragment TP into tunAtom chunks 1244 encapslate each chunk and emit 1245 endif 1246 endif 1248 Figure 12 Ingress processing algorithm 1250 Just as a network interface should never receive a message larger 1251 than its MTU, a tunnel should never receive a message larger than its 1252 tunnel MTU limit (see the host/router processing above). A router 1253 attempting to process such a message would already have generated an 1254 ICMP "packet too big" and the transit packet would have been dropped 1255 before entering into this algorithm. Similarly, a host would have 1256 generated an error internally and aborted the attempted transmission. 1258 As an example, consider IPv4 over IPv6 or IPv6 over IPv6 tunneling, 1259 where IPv6 encapsulation adds a 40 byte fixed header plus IPv6 1260 options (i.e., IPv6 header extensions) of total size 'EHsize'. The 1261 tunnel MTU will be at least 1500 - (40 + EHsize) bytes. The tunnel 1262 path MTU will be at least 1280 - (40 + EHsize) bytes. Transit packets 1263 larger than 1460-EHsize will be dropped by a node before ingress 1264 processing. Considering these minimum values, the previous algorithm 1265 uses actual values shown in the pseudocode in Figure 13. 1267 if (TPsize <= (1240 - EHsize)) then 1268 encapsulate TP and emit 1269 else 1270 if ((1240 - EHsize) < TPsize) then 1271 fragment TP into (1240 - EHsize) chunks 1272 encapsulate each chunk and emit 1273 endif 1274 endif 1276 Figure 13 Ingress processing for an tunnel over IPv6 1278 An IPv6 tunnel supports IPv6 transit only if EHsize is 180 bytes or 1279 less; otherwise the incoming transit packet would have been dropped 1280 as being too large by the host/router. Similarly, an IPv6 tunnel 1281 supports IPv4 transit only if EHsize is 884 bytes or less. In this 1282 example, transit packets of up to (1240 - Ehsize) can traverse the 1283 tunnel without ingress source fragmentation and egress reassembly. 1285 When using IP directly over IP, the minimum transit packet EMTU_R for 1286 IPv4 is 576 bytes and for IPv6 is 1500 bytes. This means that tunnels 1287 of IPv4-over-IPv4, IPv4-over-IPv6, and IPv6-over-IPv6 are possible 1288 without additional requirements, but this may involve ingress 1289 fragmentation and egress reassembly. IPv6 cannot be tunneled directly 1290 over IPv4 without additional requirements, notably that the egress 1291 EMTU_R is at least 1280 bytes. 1293 When ongoing ingress fragmentation and egress reassembly would be 1294 prohibitive or costly, larger MTUs can be supported by design and 1295 confirmed either out-of-band (by design) or in-band (e.g., using 1296 PLPMTUD [RFC4821], as done in SEAL [RFC5320] and AERO [Te16]). 1298 4.2.3. Path MTU Discovery 1300 Path MTU discovery (PMTUD) enables a network path to support a larger 1301 PMTU than it can assume from the minimum requirements of protocol 1302 over which it operates. Note, however, that PMTUD never discovers 1303 EMTU_R that is larger than the required minimum; that information is 1304 available to some upper layer protocols, such as TCP [RFC1122], but 1305 cannot be determined at the IP layer. 1307 There is temptation to optimize tunnel traversal so that packets are 1308 not fragmented between ingress and egress, i.e., to attempt tune the 1309 network M PMTU to the tunnel atom size (i.e., the ingress EMTU_S 1310 minus encapsulation overhead) rather than the tunnel MTU, to avoid 1311 ingress fragmentation. 1313 This is often impossible because the ICMP "packet too big" message 1314 (IPv4 fragmentation needed [RFC792] or IPv6 packet too big [RFC4443]) 1315 indicates the complete failure of a link to transit a packet, not a 1316 preference for a size that matches that internal the mechanism of the 1317 link. ICMP messages are intended to indicate whether a tunnel MTU is 1318 insufficient; there is no ICMP message that can indicate when a 1319 transit packet is "too bit to for the tunnel path MTU, but not larger 1320 than the tunnel MTU". If there were, endpoints might receive that 1321 message for IP packets larger than 40 bytes (the payload of a single 1322 ATM cell, allowing for the 8-byte AAL5 trailer), but smaller than 9K 1323 (the ATM EMTU_R payload). 1325 In addition, attempting to try to tune the network transit size to 1326 natively match that of the link internal transit can be hazardous for 1327 many reasons: 1329 o The tunnel is capable of transiting packets as large as the 1330 network N EMTU_R - encapsulation, which is always at least as 1331 large as the tunnel MTU and typically is larger. 1333 o ICMP has only one type of error message regarding large packets - 1334 "too big", i.e., too large to transit. There is no optimization 1335 message of "bigger than I'd like, but I can deal with if needed". 1337 o IP tunnels often involve some level of recursion, i.e., 1338 encapsulation over itself [RFC4459]. 1340 Tunnels that use IPv4 as the encapsulation layer SHOULD set DF=0, but 1341 this requires generating unique fragmentation ID values, which may 1342 limit throughput [RFC6864]. These tunnels might have difficulty 1343 assuming ingress EMTU_S values over 64 bytes, so it may not be 1344 feasible to assume that larger packets with DF=1 are safe. 1346 Recursive tunneling occurs whenever a protocol ends up encapsulated 1347 in itself. This happens directly, as when IPv4 is encapsulated in 1348 IPv4, or indirectly, as when IP is encapsulated in UDP which then is 1349 a payload inside IP. It can involve many layers of encapsulation 1350 because a tunnel provider isn't always aware of whether the packets 1351 it transits are already tunneled. 1353 Recursion is impossible when the tunnel transit packets are limited 1354 to that of the native size of the ingress payload. Arriving tunnel 1355 transit packets have a minimum supported size (1280 for IPv6) and the 1356 tunnel PMFS has the same requirement; there would be no room for the 1357 tunnel's "link layer" headers, i.e., the encapsulation layer. The 1358 result would be an IPv6 tunnel that cannot satisfy IPv6 transit 1359 requirements. 1361 It is more appropriate to require the tunnel to satisfy IP transit 1362 requirements and enforce that requirement at design time or during 1363 operation (the latter using PLPMTUD [RFC4821]). Conventional path MTU 1364 discovery (PMTUD) relies on existing endpoint ICMP processing of 1365 explicit negative feedback from routers along the path via "message 1366 to big" ICMP packets in the reverse direction of the tunnel 1367 [RFC1191][RFC1981]. This technique is susceptible to the "black hole" 1368 phenomenon, in which the ICMP messages never return to the source due 1369 to policy-based filtering [RFC2923]. PLPMTUD requires a separate, 1370 direct control channel from the egress to the ingress that provides 1371 positive feedback; the direct channel is not blocked by policy 1372 filters and the positive feedback ensures fail-safe operation if 1373 feedback messages are lost [RFC4821]. 1375 4.3. Coordination Issues 1377 IP tunnels interact with link layer signals and capabilities in a 1378 variety of ways. The following subsections address some key issues of 1379 these interactions. In general, they are again informed by treating a 1380 tunnel as any other link layer and considering the interactions 1381 between the IP layer and link layers [RFC3819]. 1383 4.3.1. Signaling 1385 In the current Internet architecture, signaling goes upstream, either 1386 from routers along a path or from the destination, back toward the 1387 source. Such signals are typically contained in ICMP messages, but 1388 can involve other protocols such as RSVP, transport protocol signals 1389 (e.g., TCP RSTs), or multicast control or transport protocols. 1391 A tunnel behaves like a link and acts like a link interface at the 1392 nodes where it is attached. As such, it can provide information that 1393 enhances IP signaling (e.g., ICMP), but itself does not directly 1394 generate ICMP messages. 1396 For tunnels, this means that there are two separate signaling paths. 1397 The outer network M nodes can each signal the source of the tunnel 1398 transit packets, Hsrc (Figure 14). Inside the tunnel, the inner 1399 network N nodes can signal the source of the tunnel link packets, the 1400 ingress I (Figure 15). 1402 +--------+---------------------------+--------+ 1403 | | | | 1404 v --_ -- v 1405 +------+ / \ / \ +------+ 1406 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1407 +------+ \ //\ / \ / \ /\\ / +------+ 1408 --/I \--+ Rb +--+ Rc +--/E \-- 1409 \ / \ / \ / \ / 1410 \/ -- -- \/ 1411 <---- Network N -----> 1412 <-------------------- Network M ---------------------> 1414 Figure 14 Signals outside the tunnel 1416 +-----+-------+------+ 1417 --_ | | | | -- 1418 +------+ / \ v | | | / \ +------+ 1419 | Hsrc |--+ Ra + -- -- + Rd +--| Hdst | 1420 +------+ \ //\ / \ / \ /\\ / +------+ 1421 --/I \--+ Rb +--+ Rc +--/E \-- 1422 \ / \ / \ / \ / 1423 \/ -- -- \/ 1424 <----- Network N ----> 1425 <--------------------- Network M --------------------> 1427 Figure 15 Signals inside the tunnel 1429 These two signal paths are inherently distinct except where 1430 information is exchanged between the network interface of the tunnel 1431 (the ingress) and its attached node (Ra, in both figures). 1433 It is always possible for a network interface to provide hints to its 1434 attached node (host or router), which can be used for optimization. 1435 In this case, when signals inside the tunnel indicate a change to the 1436 tunnel, the ingress (i.e., the tunnel network interface) can provide 1437 information to the router (Ra, in both figures), so that Ra can 1438 generate the appropriate signal in return to Hsrc. This relaying may 1439 be difficult, because signals inside the tunnel may not return enough 1440 information to the ingress to support direct relaying to Hsrc. 1442 In all cases, the tunnel ingress needs to determine how to relay the 1443 signals from inside the tunnel into signals back to the source. For 1444 some protocols this is either simple or impossible (such as for 1445 ICMP), for others, it can even be undefined (e.g., multicast). In 1446 some cases, the individual signals relayed from inside the tunnel may 1447 result in corresponding signals in the outside network, and in other 1448 cases they may just change state of the tunnel interface. In the 1449 latter case, the result may cause the router Ra to generate new ICMP 1450 errors when later messages arrive from Hsrc or other sources in the 1451 outer network. 1453 The meaning of the relayed information must be carefully translated. 1454 An ICMP error within a tunnel indicates a failure of the path inside 1455 the tunnel to support an egress EMTU_S. It can be very difficult to 1456 convert that ICMP error into a corresponding ICMP message from the 1457 ingress node back to the transit packet source. The ICMP message may 1458 not contain enough of a packet prefix to extract the transit packet 1459 header sufficient to generate the appropriate ICMP message. The 1460 relationship between the egress EMTU_S and the transit packet may be 1461 indirect, e.g., the ingress node may be performing source 1462 fragmentation that should be adjusted instead of propagating the ICMP 1463 upstream. 1465 Some messages have detailed specifications for relaying between the 1466 tunnel link packet and transit packet, including Explicit Congestion 1467 Notification (ECN [RFC6040]) and multicast (IGMP, e.g.). 1469 4.3.2. Congestion 1471 Tunnels carrying IP traffic (i.e., the focus of this document) need 1472 not react directly to congestion any more than would any other link 1473 layer [RFC8085]. IP transit packet traffic is already expected to be 1474 congestion controlled. 1476 It is useful to relay network congestion notification between the 1477 tunnel link and the tunnel transit packets. Explicit congestion 1478 notification requires that ECN bits are copied from the tunnel 1479 transit packet to the tunnel link packet on encapsulation, as well as 1480 copied back at the egress based on a combination of the bits of the 1481 two headers [RFC6040]. This allows congestion notification within the 1482 tunnel to be interpreted as if it were on the direct path. 1484 4.3.3. Multipoint Tunnels and Multicast 1486 Multipoint tunnels are tunnels with more than two ingress/egress 1487 endpoints. Just as tunnels emulate links, multipoint tunnels emulate 1488 multipoint links, and can support multicast as a tunnel capability. 1489 Multipoint tunnels can be useful on their own, or may be used as part 1490 of more complex systems, e.g., LISP and TRILL configurations 1491 [RFC6830][RFC6325]. 1493 Multipoint tunnels require a support for egress determination, just 1494 as multipoint links do. This function is typically supported by ARP 1495 [RFC826] or ARP emulation (e.g., LAN Emulation, known as LANE 1496 [RFC2225]) for multipoint links. For multipoint tunnels, a similar 1497 mechanism is required for the same purpose - to determine the egress 1498 address for proper ingress encapsulation (e.g., LISP Map-Service 1499 [RFC6833]). 1501 All multipoint systems - tunnels and links - might support different 1502 MTUs between each ingress/egress (or link entrance/exit) pair. In 1503 most cases, it is simpler to assume a uniform MTU throughout the 1504 multipoint system, e.g., the minimum MTU supported across all 1505 ingress/egress pairs. This applies to both the ingress EMTU_S and 1506 ingress EMTU_S (the latter determining the tunnel MTU). 1508 A multipoint tunnel MUST have support for broadcast and multicast, in 1509 exactly the same way as this is already required for multipoint links 1510 [RFC3819]. Both modes can be supported either by a native mechanism 1511 inside the tunnel or by emulation using serial replication at the 1512 tunnel ingress (e.g., AMT [RFC7450]), in the same way that links may 1513 provide the same support either natively (e.g., via promiscuous or 1514 automatic replication in the link itself) or network interface 1515 emulation (e.g., as for non-broadcast multiaccess networks, i.e., 1516 NBMAs). 1518 IGMP snooping enables IP multicast to be coupled with native link 1519 layer multicast support [RFC4541]. A similar technique may be 1520 relevant to couple transit packet multicast to tunnel link packet 1521 multicast, but the coupling of the protocols may be more complex 1522 because many tunnel link protocols rely on their own network N 1523 multicast control protocol, e.g., via PIM-SM [RFC6807][RFC7761]. 1525 4.3.4. Load Balancing 1527 Load balancing can impact the way in which a tunnel operates. In 1528 particular, multipath routing inside the tunnel can impact some of 1529 the tunnel parameters to vary, both over time and for different 1530 transit packets. The use of multiple paths can be the result of MPLS 1531 link aggregation groups (LAGs), equal-cost multipath routing (ECMP 1532 [RFC2991]), or other load balancing mechanisms. In some cases, the 1533 tunnel exists as the mechanism to support ECMP, as for GRE in UDP 1534 [RFC8086]. 1536 A tunnel may have multiple paths between the ingress and egress with 1537 different path MTU values, causing the ingress EMTU_S to vary 1538 [RFC7690]. Rather than track individual values, the EMTU_S can be set 1539 to the minimum of these different path MTU values. 1541 IPv6 packets include a flow label to enable multipath routing to keep 1542 packets of a single flow following the same path. It is helpful to 1543 preserve the semantics of that flow label as an aggregate identifier 1544 inside the encapsulated link packets of a tunnel. This is achieved by 1545 hashing the transit IP addresses and flow label to generate a new 1546 flow label for use between the ingress and egress addresses 1547 [RFC6438]. It is not useful to simply copy the flow label from the 1548 transit packet into the link packet because of collisions that might 1549 arise if a label is used for flows between different transit packet 1550 addresses that traverse the same tunnel. 1552 4.3.5. Recursive Tunnels 1554 The rules described in this document already support tunnels over 1555 tunnels, sometimes known as "recursive" tunnels, in which IP is 1556 transited over IP either directly or via intermediate encapsulation 1557 (IP-UDP-IP, as in GUE [He16]). 1559 There are known hazards to recursive tunneling, notably that the 1560 independence of the tunnel transit header and tunnel link header hop 1561 counts can result in a tunneling loop. Such looping can be avoided 1562 when using direct encapsulation (IP in IP) by use of a header option 1563 to track the encapsulation count and to limit that count [RFC2473]. 1564 This looping cannot be avoided when other protocols are used for 1565 tunneling, e.g., IP in UDP in IP, because the encapsulation count may 1566 not be visible where the recursion occurs. 1568 5. Observations 1570 The following subsections summarize the observations of this document 1571 and a summary of issues with existing tunnel protocol specifications. 1572 It also includes advice for tunnel protocol designers, implementers, 1573 and operators. It also includes 1575 5.1. Summary of Recommendations 1577 o Tunnel endpoints are network interfaces, tunnel are virtual links 1578 o ICMP messages MUST NOT be generated by the tunnel (as a link) 1580 o ICMP messages received by the ingress inside link change the 1581 link properties (they not generate transit-layer ICMP 1582 messages) 1584 o Link headers (hop, ID, options) are largely independent of 1585 arriving ID (with few exceptions based on translation, not 1586 direct copying, e.g., ECN and IPv6 flow IDs) 1588 o MTU values should treat the tunnel as any other link 1590 o Require source ingress source fragmentation and egress 1591 reassembly at the tunnel link packet layer 1593 o The tunnel MTU is the tunnel egress EMTU_S less headers, and 1594 not related at all to the ingress-egress MFS 1596 o Tunnels must obey core IP requirements 1598 o Obey IPv4 DF=0 on arrival at a node (nodes MUST NOT fragment 1599 IPv4 packets where DF=0) 1601 o Shut down an IP tunnel if the tunnel MTU falls below the 1602 required minimum 1604 5.2. Impact on Existing Encapsulation Protocols 1606 Many existing and proposed encapsulation protocols are inconsistent 1607 with the guidelines of this document. The following list summarizes 1608 only those inconsistencies, but omits places where a protocol is 1609 inconsistent solely by reference to another protocol. 1611 [should this be inverted as a table of issues and a list of which 1612 RFCs have problems?] 1614 o IP in IP / mobile IP [RFC2003][RFC4459] - IPv4 in IPv4 1616 o Sets link DF when transit DF=1 (fails without PLPMTUD) 1618 o Drops at egress if hopcount = 0 (host-host tunnels fail) 1620 o Drops based on transit source (same as router IP, matches 1621 egress), i.e., performs routing functions it should not 1623 o Ingress generates ICMP messages (based on relayed context), 1624 rather than using inner ICMP messages to set interface 1625 properties only 1627 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1629 o IPv6 tunnels [RFC2473] -- IPv6 or IPv4 in IPv6 1631 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1633 o Decrements transiting packet hopcount (by 1) 1635 o Copies traffic class from tunnel link to tunnel transit header 1637 o Ignores IPv4 DF=0 and fragments at that layer upon arrival 1639 o Fails to retain soft ingress state based on inner ICMP messages 1640 affecting tunnel MTU 1642 o Tunnel ingress issues ICMPs 1644 o Fragments IPv4 over IPv6 fragments only if IPv4 DF=0 1645 (misinterpreting the "can fragment the IPv4 packet" as 1646 permission to fragment at the IPv6 link header) 1648 o IPsec tunnel mode (IP in IPsec in IP) [RFC4301] -- IP in IPsec 1650 o Uses security policy to set, clear, or copy DF (rather than 1651 generating it independently, which would also be more secure) 1653 o Intertwines tunnel selection with security selection, rather 1654 than presenting tunnel as an interface and using existing 1655 forwarding (as with transport mode over IP-in-IP [RFC3884]) 1657 o GRE (IP in GRE in IP or IP in GRE in UDP in IP) 1658 [RFC2784][RFC7588][RFC7676][RFC8086] 1660 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1662 o Requires ingress to generate ICMP errors 1664 o Copies IPv4 DF to outer IPv4 DF 1666 o Violates IPv6 MTU requirements when using IPv6 encapsulation 1668 o LISP [RFC6830] 1669 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1671 o Requires ingress to generate ICMP errors 1673 o Copies inner hop limit to outer 1675 o L2TP [RFC3931] 1677 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1679 o Requires ingress to generate ICMP errors 1681 o PWE [RFC3985] 1683 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1685 o Requires ingress to generate ICMP errors 1687 o GUE (Generic UDP encapsulation) [He16] - IP (et. al) in UDP in IP 1689 o Allows inner encapsulation fragmentation 1691 o Geneve [RFC7364][Gr16] - IP (et al.) in Geneve in UDP in IP 1693 o Treats tunnel MTU as tunnel path MTU, not tunnel egress MTU 1695 o SEAL/AERO [RFC5320][Te16] - IP in SEAL/AERO in IP 1697 o Some issues with SEAL (MTU, ICMP), corrected in AERO 1699 o RTG DT encapsulations [No16] 1701 o Assumes fragmentation can be avoided completely 1703 o Allows encapsulation protocols that lack fragmentation 1705 o Relies on ICMP PTB to correct for tunnel path MTU 1707 o No known issues 1709 o L2VPN (framework for L2 virtualization) [RFC4664] 1711 o L3VPN (framework for L3 virtualization) [RFC4176] 1713 o MPLS (IP in MPLS) [RFC3031] 1715 o TRILL (Ethernet in Ethernet) [RFC5556][RFC6325] 1717 5.3. Tunnel Protocol Designers 1719 [To be completed] 1721 Recursive tunneling + minimum MTU = frag/reassembly is inevitable, at 1722 least to be able to split/join two fragments 1724 Account for egress MTU/path MTU differences. 1726 Include a stronger checksum. 1728 Ensure the egress MTU is always larger than the path MTU. 1730 Ensure that the egress reassembly can keep up with line rate OR 1731 design PLPMTUD into the tunneling protocol. 1733 5.3.1. For Future Standards 1735 [To be completed] 1737 Larger IPv4 MTU (2K? or just 2x path MTU?) for reassembly 1739 Always include frag support for at least two frags; do NOT try to 1740 deprecate fragmentation. 1742 Limit encapsulation option use/space. 1744 Augment ICMP to have two separate messages: PTB vs P-bigger-than- 1745 optimal 1747 Include MTU as part of BGP as a hint - SB 1749 Hazards of multi-MTU draft-van-beijnum-multi-mtu-04 1751 5.3.2. Diagnostics 1753 [To be completed] 1755 Some current implementations include diagnostics to support 1756 monitoring the impact of tunneling, especially the impact on 1757 fragmentation and reassembly resources, the status of path MTU 1758 discovery, etc. 1760 >> Because a tunnel ingress/egress is a network interface, it SHOULD 1761 have similar resources as any other network interface. This includes 1762 resources for packet processing as well as monitoring. 1764 5.4. Tunnel Implementers 1766 [To be completed] 1768 Detect when the egress MTU is exceeded. 1770 Detect when the egress MTU drops below the required minimum and shut 1771 down the tunnel if that happens - configuring the tunnel down and 1772 issuing a hard error may be the only way to detect this anomaly, and 1773 it's sufficiently important that the tunnel SHOULD be disabled. This 1774 is always better than blindly assuming the tunnel has been deployed 1775 correctly, i.e., that the solution has been engineered. 1777 Do NOT decrement the TTL as part of being a tunnel. It's always 1778 already OK for a router to decrement the TTL based on different next- 1779 hop routers, but TTL is a property of a router not a link. 1781 5.5. Tunnel Operators 1783 [To be completed] 1785 Keep the difference between "enforced by operators" vs. "enforced by 1786 active protocol mechanism" in mind. It's fine to assume something the 1787 tunnel cannot or does not test, as long as you KNOW you can assume 1788 it. When the assumption is wrong, it will NOT be signaled by the 1789 tunnel. Do NOT decrement the TTL as part of being a tunnel. It's 1790 always already OK for a router to decrement the TTL based on 1791 different next-hop routers, but TTL is a property of a router not a 1792 link. 1794 Consider the circuit breakers doc to provide diagnostics and last- 1795 resort control to avoid overload for non-reactive traffic (see 1796 Gorry's RFC-to-be) 1798 Do NOT decrement the TTL as part of being a tunnel. It's always 1799 already OK for a router to decrement the TTL based on different next- 1800 hop routers, but TTL is a property of a router not a link. 1802 >>>> PLPMTUD can give multiple conflicting PMTU values during ECMP or 1803 LAG if PMTU is cached per endpoint pair rather than per flow -- but 1804 so can PMTUD! This is another reason why ICMP should never drive up 1805 the effective MTU (if aggregate, treat as the minimum of received 1806 messages over an interval). 1808 6. Security Considerations 1810 Tunnels may introduce vulnerabilities or add to the potential for 1811 receiver overload and thus DOS attacks. These issues are primarily 1812 related to the fact that a tunnel is a link that traverses a network 1813 path and to fragmentation and reassembly. ICMP signal translation 1814 introduces a new security issue and must be done with care. ICMP 1815 generation at the router or host attached to a tunnel is already 1816 covered by existing requirements (e.g., should be throttled). 1818 Tunnels traverse multiple hops of a network path from ingress to 1819 egress. Traffic along such tunnels may be susceptible to on-path and 1820 off-path attacks, including fragment injection, reassembly buffer 1821 overload, and ICMP attacks. Some of these attacks may not be as 1822 visible to the endpoints of the architecture into which tunnels are 1823 deployed and these attacks may thus be more difficult to detect. 1825 Fragmentation at routers or hosts attached to tunnels may place an 1826 undue burden on receivers where traffic is not sufficiently diffuse, 1827 because tunnels may induce source fragmentation at hosts and path 1828 fragmentation (for IPv4 DF=0) more for tunnels than for other links. 1829 Care should be taken to avoid this situation, notably by ensuring 1830 that tunnel MTUs are not significantly different from other link 1831 MTUs. 1833 Tunnel ingresses emitting IP datagrams MUST obey all existing IP 1834 requirements, such as the uniqueness of the IP ID field. Failure to 1835 either limit encapsulation traffic, or use additional ingress/egress 1836 IP addresses, can result in high speed traffic fragments being 1837 incorrectly reassembled. 1839 Tunnels are susceptible to attacks at both the inner and outer 1840 network layers. The tunnel ingress/egress endpoints appear as network 1841 interfaces in the outer network, and are as susceptible as any other 1842 network interface. This includes vulnerability to fragmentation 1843 reassembly overload, traffic overload, and spoofed ICMP messages that 1844 misreport the state of those interfaces. Similarly, the 1845 ingress/egress appear as hosts to the path traversed by the tunnel, 1846 and thus are as susceptible as any other host to attacks as well. 1848 [management?] 1850 [Access control?] 1852 describe relationship to [RFC6169] - JT (as per INTAREA meeting 1853 notes, don't cover Teredo-specific issues in RFC6169, but include 1854 generic issues here) 1856 7. IANA Considerations 1858 This document has no IANA considerations. 1860 The RFC Editor should remove this section prior to publication. 1862 8. References 1864 8.1. Normative References 1866 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1867 Requirement Levels", BCP 14, RFC 2119, March 1997. 1869 [are there others? 3819? ECN? Flow label issues?] 1871 8.2. Informative References 1873 [Cl88] Clark, D., "The design philosophy of the DARPA internet 1874 protocols," Proc. Sigcomm 1988, p.106-114, 1988. 1876 [Er94] Eriksson, H., "MBone: The Multicast Backbone," 1877 Communications of the ACM, Aug. 1994, pp.54-60. 1879 [Gr16] Gross, J. (Ed.), I. Ganga (Ed.), T. Sridhar (Ed.), "Geneve: 1880 Generic Network Virtualization Encapsulation," draft-ietf- 1881 nvo3-geneve-03, Sep. 2016. 1883 [He16] Herbert, T., L. Yong, O. Zia, "Generic UDP Encapsulation," 1884 draft-ietf-nvo3-gue-05, Oct. 2016. 1886 [Ke95] Kent, S., J. Mogul, "Fragmentation considered harmful," ACM 1887 Sigcomm Computer Communication Review (CCR), V25 N1, Jan. 1888 1995, pp. 75-87. 1890 [No16] Nordmark, E. (Ed.), A. Tian, J. Gross, J. Hudson, L. 1891 Kreeger, P. Garg, P. Thaler, T. Herbert, "Encapsulation 1892 Considerations," draft-ietf-rtgwg-dt-encap-02, Oct. 2016. 1894 [RFC5] Rulifson, J, "Decode Encode Language (DEL)," RFC 5, June 1895 1969. 1897 [RFC768] Postel, J, "User Datagram Protocol," RFC 768, Aug. 1980 1899 [RFC791] Postel, J., "Internet Protocol," RFC 791 / STD 5, September 1900 1981. 1902 [RFC792] Postel, J., "Internet Control Message Protocol," RFC 792, 1903 Sep. 981. 1905 [RFC793] Postel, J, "Transmission Control Protocol," RFC 793, Sept. 1906 1981. 1908 [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol -- or 1909 -- Converting Network Protocol Addresses to 48.bit Ethernet 1910 Address for Transmission on Ethernet Hardware," RFC 826, 1911 Nov. 1982. 1913 [RFC1075] Waitzman, D., C. Partridge, S. Deering, "Distance Vector 1914 Multicast Routing Protocol," RFC 1075, Nov. 1988. 1916 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 1917 Communication Layers," RFC 1122 / STD 3, October 1989. 1919 [RFC1191] Mogul, J., S. Deering, "Path MTU discovery," RFC 1191, 1920 November 1990. 1922 [RFC1812] Baker, F., "Requirements for IP Version 4 Routers," RFC 1923 1812, June 1995. 1925 [RFC1853] Simpson, W., "IP in IP Tunneling," RFC 1853, Oct. 1995. 1927 [RFC1981] McCann, J., S. Deering, J. Mogul, "Path MTU Discovery for 1928 IP version 6," RFC 1981, Aug. 1996. 1930 [RFC2003] Perkins, C., "IP Encapsulation within IP," RFC 2003, Oct. 1931 1996. 1933 [RFC2225] Laubach, M., J. Halpern, "Classical IP and ARP over ATM," 1934 RFC 2225, Apr. 1998. 1936 [RFC2460] Deering, S., R. Hinden, "Internet Protocol, Version 6 1937 (IPv6) Specification," RFC 2460, Dec. 1998. 1939 [RFC2473] Conta, A., "Generic Packet Tunneling in IPv6 1940 Specification," RFC 2473, Dec. 1998. 1942 [RFC2784] Farinacci, D., T. Li, S. Hanks, D. Meyer, P. Traina, 1943 "Generic Routing Encapsulation (GRE)", RFC 2784, March 1944 2000. 1946 [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery," RFC 1947 2923, September 2000. 1949 [RFC2983] Black, D., "Differentiated Services and Tunnels," RFC 2983, 1950 Oct. 2000. 1952 [RFC2991] Thaler, D., C. Hopps, "Multipath Issues in Unicast and 1953 Multicast Next-Hop Selection," RFC 2991, Nov. 2000. 1955 [RFC2473] Conta, A., S. Deering, "Generic Packet Tunneling in IPv6 1956 Specification," RFC 2473, Dec. 1998. 1958 [RFC2546] Durand, A., B. Buclin, "6bone Routing Practice," RFC 2540, 1959 Mar. 1999. 1961 [RFC3031] Rosen, E., A. Viswanathan, R. Callon, "Multiprotocol Label 1962 Switching Architecture", RFC 3031, January 2001. 1964 [RFC3819] Karn, P., Ed., C. Bormann, G. Fairhurst, D. Grossman, R. 1965 Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood, 1966 "Advice for Internet Subnetwork Designers," RFC 3819 / BCP 1967 89, July 2004. 1969 [RFC3884] Touch, J., L. Eggert, Y. Wang, "Use of IPsec Transport Mode 1970 for Dynamic Routing," RFC 3884, September 2004. 1972 [RFC3931] Lau, J., Ed., M. Townsley, Ed., I. Goyret, Ed., "Layer Two 1973 Tunneling Protocol - Version 3 (L2TPv3)," RFC 3931, March 1974 2005. 1976 [RFC3985] Bryant, S., P. Pate (Eds.), "Pseudo Wire Emulation Edge-to- 1977 Edge (PWE3) Architecture", RFC 3985, March 2005. 1979 [RFC4176] El Mghazli, Y., Ed., T. Nadeau, M. Boucadair, K. Chan, A. 1980 Gonguet, "Framework for Layer 3 Virtual Private Networks 1981 (L3VPN) Operations and Management," RFC 4176, October 2005. 1983 [RFC4301] Kent, S., and K. Seo, "Security Architecture for the 1984 Internet Protocol," RFC 4301, December 2005. 1986 [RFC4340] Kohler, E., M. Handley, S. Floyd, "Datagram Congestion 1987 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 1989 [RFC4443] Conta, A., S. Deering, M. Gupta (Ed.), "Internet Control 1990 Message Protocol (ICMPv6) for the Internet Protocol Version 1991 6 (IPv6) Specification," RFC 4443, Mar. 2006. 1993 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1994 Network Tunneling," RFC 4459, April 2006. 1996 [RFC4541] Christensen, M., K. Kimball, F. Solensky, "Considerations 1997 for Internet Group Management Protocol (IGMP) and Multicast 1998 Listener Discovery (MLD) Snooping Switches," RFC 4541, May 1999 2006. 2001 [RFC4664] Andersson, L., Ed., E. Rosen, Ed., "Framework for Layer 2 2002 Virtual Private Networks (L2VPNs)," RFC 4664, September 2003 2006. 2005 [RFC4821] Mathis, M., J. Heffner, "Packetization Layer Path MTU 2006 Discovery," RFC 4821, March 2007. 2008 [RFC4861] Narten, T., E. Nordmark, W. Simpson, H. Soliman, "Neighbor 2009 Discovery for IP version 6 (IPv6)," RFC 4861, Sept. 2007. 2011 [RFC4960] Stewart, R. (Ed.), "Stream Control Transmission Protocol," 2012 RFC 4960, Sep. 2007. 2014 [RFC4963] Heffner, J., M. Mathis, B. Chandler, "IPv4 Reassembly 2015 Errors at High Data Rates," RFC 4963, July 2007. 2017 [RFC5320] Templin, F., Ed., "The Subnetwork Encapsulation and 2018 Adaptation Layer (SEAL)," RFC 5320, Feb. 2010. 2020 [RFC5556] Touch, J., R. Perlman, "Transparently Interconnecting Lots 2021 of Links (TRILL): Problem and Applicability Statement," RFC 2022 5556, May 2009. 2024 [RFC5944] Perkins, C., Ed., "IP Mobility Support for IPv4, Revised" 2025 RFC 5944, Nov. 2010. 2027 [RFC6040] Briscoe, B., "Tunneling of Explicit Congestion 2028 Notification," RFC 6040, Nov. 2010. 2030 [RFC6169] Krishnan, S., D. Thaler, J. Hoagland, "Security Concerns 2031 With IP Tunneling," RFC 6169, Apr. 2011. 2033 [RFC6325] Perlman, R., D. Eastlake, D. Dutt, S. Gai, A. Ghanwani, 2034 "Routing Bridges (RBridges): Base Protocol Specification," 2035 RFC 6325, July 2011. 2037 [RFC6434] Jankiewicz, E., J. Loughney, T. Narten, "IPv6 Node 2038 Requirements," RFC 6434, Dec. 2011. 2040 [RFC6438] Carpenter, B., S. Amante, "Using the IPv6 Flow Label for 2041 Equal Cost Multipath Routing and Link Aggregation in 2042 Tunnels," RFC 6438, Nov. 2011. 2044 [RFC6807] Farinacci, D., G. Shepherd, S. Venaas, Y. Cai, "Population 2045 Count Extensions to Protocol Independent Multicast (PIM)," 2046 RFC 6807, Dec. 2012. 2048 [RFC6830] Farinacci, D., V. Fuller, D. Meyer, D. Lewis, "The 2049 Locator/ID Separation Protocol," RFC 6830, Jan. 2013. 2051 [RFC6833] Fuller, V., D. Farinacci, "Locator/ID Separation Protocol 2052 (LISP) Map-Server Interface," RFC 6833, Jan. 2013. 2054 [RFC6864] Touch, J., "Updated Specification of the IPv4 ID Field," 2055 Proposed Standard, RFC 6864, Feb. 2013. 2057 [RFC6935] Eubanks, M., P. Chimento, M. Westerlund, "IPv6 and UDP 2058 Checksums for Tunneled Packets," RFC 6935, Apr. 2013. 2060 [RFC6936] Fairhurst, G., M. Westerlund, "Applicability Statement for 2061 the Use of IPv6 UDP Datagrams with Zero Checksums," RFC 2062 6936, Apr. 2013. 2064 [RFC6946] Gont, F., "Processing of IPv6 "Atomic" Fragments," RFC 2065 6946, May 2013. 2067 [RFC7364] Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L., M. 2068 Napierala, "Problem Statement: Overlays for Network 2069 Virtualization", RFC 7364, Oct. 2014. 2071 [RFC7450] Bumgardner, G., "Automatic Multicast Tunneling," RFC 7450, 2072 Feb. 2015. 2074 [RFC7510] Xu, X., N. Sheth, L. Yong, R. Callon, D. Black, 2075 "Encapsulating MPLS in UDP," RFC 7510, April 2015. 2077 [RFC7588] Bonica, R., C. Pignataro, J. Touch, "A Widely-Deployed 2078 Solution to the Generic Routing Encapsulation Fragmentation 2079 Problem," RFC 7588, July 2015. 2081 [RFC7676] Pignataro, C., R. Bonica, S. Krishnan, "IPv6 Support for 2082 Generic Routing Encapsulation (GRE)," RFC 7676, Oct 2015. 2084 [RFC7690] Byerly, M., M. Hite, J. Jaeggli, "Close Encounters of the 2085 ICMP Type 2 Kind (Near Misses with ICMPv6 Packet Too Big 2086 (PTB))," RFC 7690, Jan. 2016. 2088 [RFC7761] Fenner, B., M. Handley, H. Holbrook, I. Kouvelas, R. 2089 Parekh, Z. Zhang, L. Zheng, "Protocol Independent Multicast 2090 - Sparse Mode (PIM-SM): Protocol Specification (Revised)," 2091 RFC 7761, Mar. 2016. 2093 [RFC8085] Eggert, L., G. Fairhurst, G. Shepherd, "Unicast UDP Usage 2094 Guidelines," RFC 8085, Oct. 2015. 2096 [RFC8086] Yong, L. (Ed.), E. Crabbe, X. Xu, T. Herbert, "GRE-in-UDP 2097 Encapsulation," RFC 8086, Feb. 2017. 2099 [Sa84] Saltzer, J., D. Reed, D. Clark, "End-to-end arguments in 2100 system design," ACM Trans. on Computing Systems, Nov. 1984. 2102 [Te16] Templin, F., "Asymmetric Extended Route Optimization," 2103 draft-templin-aerolink-74, Nov. 2016. 2105 [To01] Touch, J., "Dynamic Internet Overlay Deployment and 2106 Management Using the X-Bone," Computer Networks, July 2001, 2107 pp. 117-135. 2109 [To03] Touch, J., Y. Wang, L. Eggert, G. Finn, "Virtual Internet 2110 Architecture," USC/ISI Tech. Report ISI-TR-570, Aug. 2003. 2112 [To16] Touch, J., "Middleboxes Models Compatible with the 2113 Internet," USC/ISI Tech. Report ISI-TR-711, Oct. 2016. 2115 [To98] Touch, J., S. Hotz, "The X-Bone," Proc. Globecom Third 2116 Global Internet Mini-Conference, Nov. 1998. 2118 [Zi80] Zimmermann, H., "OSI Reference Model - The ISO Model of 2119 Architecture for Open Systems Interconnection," IEEE Trans. 2120 on Comm., Apr. 1980. 2122 9. Acknowledgments 2124 This document originated as the result of numerous discussions among 2125 the authors, Jari Arkko, Stuart Bryant, Lars Eggert, Ted Faber, Gorry 2126 Fairhurst, Dino Farinacci, Matt Mathis, and Fred Templin. It 2127 benefitted substantially from detailed feedback from Toerless Eckert, 2128 Vincent Roca, and Lucy Yong, as well as other members of the Internet 2129 Area Working Group. 2131 This work is partly supported by USC/ISI's Postel Center. 2133 This document was prepared using 2-Word-v2.0.template.dot. 2135 Authors' Addresses 2137 Joe Touch 2138 USC/ISI 2139 4676 Admiralty Way 2140 Marina del Rey, CA 90292-6695 2141 U.S.A. 2143 Phone: +1 (310) 448-9151 2144 Email: touch@isi.edu 2146 W. Mark Townsley 2147 Cisco 2148 L'Atlantis, 11, Rue Camille Desmoulins 2149 Issy Les Moulineaux, ILE DE FRANCE 92782 2151 Email: townsley@cisco.com 2153 APPENDIX A: Fragmentation efficiency 2155 A.1. Selecting fragment sizes 2157 There are different ways to fragment a packet. Consider a network 2158 with a PMTU as shown in Figure 16, where packets are encapsulated 2159 over the same network layer as they arrive on (e.g., IP in IP). If a 2160 packet as large as the PMTU arrives, it must be fragmented to 2161 accommodate the additional header. 2163 X===========================X (transit PMTU) 2164 +----+----------------------+ 2165 | iH | DDDDDDDDDDDDDDDDDDDD | 2166 +----+----------------------+ 2167 | 2168 | X===========================X (tunnel 1 MTU) 2169 | +---+----+------------------+ 2170 (a) +->| H'| iH | DDDDDDDDDDDDDDDD | 2171 | +---+----+------------------+ 2172 | | 2173 | | X===========================X (tunnel 2 MTU) 2174 | | +----+---+----+-------------+ 2175 | (a1) +->| nH'| H | iH | DDDDDDDDDDD | 2176 | | +----+---+----+-------------+ 2177 | | 2178 | | +----+-------+ 2179 | (a2) +->| nH"| DDDDD | 2180 | +----+-------+ 2181 | 2182 | +---+------+ 2183 (b) +->| H"| DDDD | 2184 +---+------+ 2185 | 2186 | +----+---+------+ 2187 (b1) +->| nH'| H"| DDDD | 2188 +----+---+------+ 2190 Figure 16 Fragmenting via maximum fit 2192 Figure 16 shows this process using "maximum fit", assuming outer 2193 fragmentation as an example (the situation is the same for inner 2194 fragmentation, but the headers that are affected differ). In maximum 2195 fit, the arriving packet is split into (a) and (b), where (a) is the 2196 size of the first tunnel, i.e., the tunnel 1 MTU (the maximum that 2197 fits over the first tunnel). However, this tunnel then traverses over 2198 another tunnel (number 2), whose impact the first tunnel ingress has 2199 not accommodated. The packet (a) arrives at the second tunnel 2200 ingress, and needs to be encapsulated again, but it needs to be 2201 fragmented as well to fit into the tunnel 2 MTU, into (a1) and (a2). 2202 In this case, packet (b) arrives at the second tunnel ingress and is 2203 encapsulated into (b1) without fragmentation, because it is already 2204 below the tunnel 2 MTU size. 2206 In Figure 17, the fragmentation is done using "even split", i.e., by 2207 splitting the original packet into two roughly equal-sized 2208 components, (c) and (d). Note that (d) contains more packet data, 2209 because (c) includes the original packet header because this is an 2210 example of outer fragmentation. The packets (c) and (d) arrive at the 2211 second tunnel encapsulator, and are encapsulated again; this time, 2212 neither packet exceeds the tunnel 2 MTU, and neither requires further 2213 fragmentation. 2215 X===========================X (transit PMTU) 2216 +----+----------------------+ 2217 | iH | DDDDDDDDDDDDDDDDDDDD | 2218 +----+----------------------+ 2219 | 2220 | X===========================X (tunnel 1 MTU) 2221 | +---+----+----------+ 2222 (c) +->| H'| iH | DDDDDDDD | 2223 | +---+----+----------+ 2224 | | 2225 | | X===========================X (tunnel 2 MTU) 2226 | | +----+---+----+----------+ 2227 | (c1) +->| nH | H'| iH | DDDDDDDD | 2228 | +----+---+----+----------+ 2229 | 2230 | +---+--------------+ 2231 (d) +->| H"| DDDDDDDDDDDD | 2232 +---+--------------+ 2233 | 2234 | +----+---+--------------+ 2235 (d1) +->| nH | H"| DDDDDDDDDDDD | 2236 +----+---+--------------+ 2238 Figure 17 Fragmenting via "even split" 2240 A.2. Packing 2242 Encapsulating individual packets to traverse a tunnel can be 2243 inefficient, especially where headers are large relative to the 2244 packets being carried. In that case, it can be more efficient to 2245 encapsulate many small packets in a single, larger tunnel payload. 2247 This technique, similar to the effect of packet bursting in Gigabit 2248 Ethernet (regardless of whether they're encoded using L2 symbols as 2249 delineators), reduces the overhead of the encapsulation headers 2250 (Figure 18). It reduces the work of header addition and removal at 2251 the tunnel endpoints, but increases other work involving the packing 2252 and unpacking of the component packets carried. 2254 +-----+-----+ 2255 | iHa | iDa | 2256 +-----+-----+ 2257 | 2258 | +-----+-----+ 2259 | | iHb | iDb | 2260 | +-----+-----+ 2261 | | 2262 | | +-----+-----+ 2263 | | | iHc | iDc | 2264 | | +-----+-----+ 2265 | | | 2266 v v v 2267 +----+-----+-----+-----+-----+-----+-----+ 2268 | oH | iHa | iHa | iHb | iDb | iHc | iDc | 2269 +----+-----+-----+-----+-----+-----+-----+ 2271 Figure 18 Packing packets into a tunnel