idnits 2.17.1 draft-ietf-nvo3-geneve-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Geneve encapsulation is used between NVEs to establish overlay tunnels over an existing IP underlay network. In a multi-tenant data center, a rogue or compromised tenant system may try to launch a passive attack such as monitoring the traffic of other tenants, or an active attack such as spoofing or trying to inject unauthorized Geneve encapsulated traffic into the network. To prevent such attacks, an NVE MUST not propagate Geneve packets beyond the NVE to tenant systems and SHOULD employ packet filtering mechanisms so as not to forward unauthorized traffic between TSs in different tenant networks. -- The document date (July 02, 2018) is 2124 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) == Outdated reference: A later version (-12) exists of draft-ietf-nvo3-encap-01 -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Gross, Ed. 3 Internet-Draft 4 Intended status: Standards Track I. Ganga, Ed. 5 Expires: January 3, 2019 Intel 6 T. Sridhar, Ed. 7 VMware 8 July 02, 2018 10 Geneve: Generic Network Virtualization Encapsulation 11 draft-ietf-nvo3-geneve-07 13 Abstract 15 Network virtualization involves the cooperation of devices with a 16 wide variety of capabilities such as software and hardware tunnel 17 endpoints, transit fabrics, and centralized control clusters. As a 18 result of their role in tying together different elements in the 19 system, the requirements on tunnels are influenced by all of these 20 components. Flexibility is therefore the most important aspect of a 21 tunnel protocol if it is to keep pace with the evolution of the 22 system. This draft describes Geneve, a protocol designed to 23 recognize and accommodate these changing capabilities and needs. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on January 3, 2019. 42 Copyright Notice 44 Copyright (c) 2018 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 61 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 62 2. Design Requirements . . . . . . . . . . . . . . . . . . . . . 5 63 2.1. Control Plane Independence . . . . . . . . . . . . . . . 6 64 2.2. Data Plane Extensibility . . . . . . . . . . . . . . . . 7 65 2.2.1. Efficient Implementation . . . . . . . . . . . . . . 7 66 2.3. Use of Standard IP Fabrics . . . . . . . . . . . . . . . 8 67 3. Geneve Encapsulation Details . . . . . . . . . . . . . . . . 9 68 3.1. Geneve Packet Format Over IPv4 . . . . . . . . . . . . . 9 69 3.2. Geneve Packet Format Over IPv6 . . . . . . . . . . . . . 10 70 3.3. UDP Header . . . . . . . . . . . . . . . . . . . . . . . 12 71 3.4. Tunnel Header Fields . . . . . . . . . . . . . . . . . . 13 72 3.5. Tunnel Options . . . . . . . . . . . . . . . . . . . . . 14 73 3.5.1. Options Processing . . . . . . . . . . . . . . . . . 16 74 4. Implementation and Deployment Considerations . . . . . . . . 17 75 4.1. Encapsulation of Geneve in IP . . . . . . . . . . . . . . 17 76 4.1.1. IP Fragmentation . . . . . . . . . . . . . . . . . . 17 77 4.1.2. DSCP and ECN . . . . . . . . . . . . . . . . . . . . 17 78 4.1.3. Broadcast and Multicast . . . . . . . . . . . . . . . 18 79 4.1.4. Unidirectional Tunnels . . . . . . . . . . . . . . . 18 80 4.2. Constraints on Protocol Features . . . . . . . . . . . . 19 81 4.2.1. Constraints on Options . . . . . . . . . . . . . . . 19 82 4.3. NIC Offloads . . . . . . . . . . . . . . . . . . . . . . 19 83 4.4. Inner VLAN Handling . . . . . . . . . . . . . . . . . . . 20 84 5. Interoperability Issues . . . . . . . . . . . . . . . . . . . 20 85 6. Security Considerations . . . . . . . . . . . . . . . . . . . 21 86 6.1. Data Confidentiality . . . . . . . . . . . . . . . . . . 21 87 6.1.1. Inter-data center traffic . . . . . . . . . . . . . . 22 88 6.2. Data Integrity . . . . . . . . . . . . . . . . . . . . . 22 89 6.3. Authentication of NVE peers . . . . . . . . . . . . . . . 23 90 6.4. Multicast/Broadcast . . . . . . . . . . . . . . . . . . . 23 91 6.5. Control plane communications . . . . . . . . . . . . . . 24 92 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 93 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 25 94 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 95 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 26 96 10.1. Normative References . . . . . . . . . . . . . . . . . . 26 97 10.2. Informative References . . . . . . . . . . . . . . . . . 27 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 100 1. Introduction 102 Networking has long featured a variety of tunneling, tagging, and 103 other encapsulation mechanisms. However, the advent of network 104 virtualization has caused a surge of renewed interest and a 105 corresponding increase in the introduction of new protocols. The 106 large number of protocols in this space, ranging all the way from 107 VLANs [IEEE.802.1Q_2014] and MPLS [RFC3031] through the more recent 108 VXLAN [RFC7348], NVGRE [RFC7637], often leads to questions about the 109 need for new encapsulation formats and what it is about network 110 virtualization in particular that leads to their proliferation. 112 While many encapsulation protocols seek to simply partition the 113 underlay network or bridge between two domains, network 114 virtualization views the transit network as providing connectivity 115 between multiple components of a distributed system. In many ways 116 this system is similar to a chassis switch with the IP underlay 117 network playing the role of the backplane and tunnel endpoints on the 118 edge as line cards. When viewed in this light, the requirements 119 placed on the tunnel protocol are significantly different in terms of 120 the quantity of metadata necessary and the role of transit nodes. 122 Current work such as VL2 [VL2] and the NVO3 working group 123 [I-D.ietf-nvo3-dataplane-requirements] have described some of the 124 properties that the data plane must have to support network 125 virtualization. However, one additional defining requirement is the 126 need to carry system state along with the packet data. The use of 127 some metadata is certainly not a foreign concept - nearly all 128 protocols used for virtualization have at least 24 bits of identifier 129 space as a way to partition between tenants. This is often described 130 as overcoming the limits of 12-bit VLANs, and when seen in that 131 context, or any context where it is a true tenant identifier, 16 132 million possible entries is a large number. However, the reality is 133 that the metadata is not exclusively used to identify tenants and 134 encoding other information quickly starts to crowd the space. In 135 fact, when compared to the tags used to exchange metadata between 136 line cards on a chassis switch, 24-bit identifiers start to look 137 quite small. There are nearly endless uses for this metadata, 138 ranging from storing input ports for simple security policies to 139 service based context for interposing advanced middleboxes. 141 Existing tunnel protocols have each attempted to solve different 142 aspects of these new requirements, only to be quickly rendered out of 143 date by changing control plane implementations and advancements. 144 Furthermore, software and hardware components and controllers all 145 have different advantages and rates of evolution - a fact that should 146 be viewed as a benefit, not a liability or limitation. This draft 147 describes Geneve, a protocol which seeks to avoid these problems by 148 providing a framework for tunneling for network virtualization rather 149 than being prescriptive about the entire system. 151 1.1. Requirements Language 153 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 154 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 155 document are to be interpreted as described in [RFC2119]. 157 In this document, these words will appear with that interpretation 158 only when in ALL CAPS. Lower case uses of these words are not to be 159 interpreted as carrying RFC-2119 significance. 161 1.2. Terminology 163 The NVO3 framework [RFC7365] defines many of the concepts commonly 164 used in network virtualization. In addition, the following terms are 165 specifically meaningful in this document: 167 Checksum offload. An optimization implemented by many NICs which 168 enables computation and verification of upper layer protocol 169 checksums in hardware on transmit and receive, respectively. This 170 typically includes IP and TCP/UDP checksums which would otherwise be 171 computed by the protocol stack in software. 173 Clos network. A technique for composing network fabrics larger than 174 a single switch while maintaining non-blocking bandwidth across 175 connection points. ECMP is used to divide traffic across the 176 multiple links and switches that constitute the fabric. Sometimes 177 termed "leaf and spine" or "fat tree" topologies. 179 ECMP. Equal Cost Multipath. A routing mechanism for selecting from 180 among multiple best next hop paths by hashing packet headers in order 181 to better utilize network bandwidth while avoiding reordering a 182 single stream. 184 Geneve. Generic Network Virtualization Encapsulation. The tunnel 185 protocol described in this draft. 187 LRO. Large Receive Offload. The receive-side equivalent function of 188 LSO, in which multiple protocol segments (primarily TCP) are 189 coalesced into larger data units. 191 NIC. Network Interface Card. A NIC could be part of a tunnel 192 endpoint or transit device and can either process Geneve packets or 193 aid in the processing of Geneve packets. 195 OAM. Operations, Administration, and Management. A suite of tools 196 used to monitor and troubleshoot network problems. 198 Transit device. A forwarding element along the path of the tunnel 199 making up part of the Underlay Network. A transit device MAY be 200 capable of understanding the Geneve packet format but does not 201 originate or terminate Geneve packets. 203 LSO. Large Segmentation Offload. A function provided by many 204 commercial NICs that allows data units larger than the MTU to be 205 passed to the NIC to improve performance, the NIC being responsible 206 for creating smaller segments of size less than or equal to the MTU 207 with correct protocol headers. When referring specifically to TCP/ 208 IP, this feature is often known as TSO (TCP Segmentation Offload). 210 Tunnel endpoint. A component performing encapsulation and 211 decapsulation of packets, such as Ethernet frames or IP datagrams, in 212 Geneve headers. As the ultimate consumer of any tunnel metadata, 213 endpoints have the highest level of requirements for parsing and 214 interpreting tunnel headers. Tunnel endpoints may consist of either 215 software or hardware implementations or a combination of the two. 216 Endpoints are frequently a component of an NVE but may also be found 217 in middleboxes or other elements making up an NVO3 Network. 219 VM. Virtual Machine. 221 2. Design Requirements 223 Geneve is designed to support network virtualization use cases, where 224 tunnels are typically established to act as a backplane between the 225 virtual switches residing in hypervisors, physical switches, or 226 middleboxes or other appliances. An arbitrary IP network can be used 227 as an underlay although Clos networks composed using ECMP links are a 228 common choice to provide consistent bisectional bandwidth across all 229 connection points. Figure 1 shows an example of a hypervisor, top of 230 rack switch for connectivity to physical servers, and a WAN uplink 231 connected using Geneve tunnels over a simplified Clos network. These 232 tunnels are used to encapsulate and forward frames from the attached 233 components such as VMs or physical links. 235 +---------------------+ +-------+ +------+ 236 | +--+ +-------+---+ | |Transit|--|Top of|==Physical 237 | |VM|--| | | | +------+ /|Router | | Rack |==Servers 238 | +--+ |Virtual|NIC|---|Top of|/ +-------+\/+------+ 239 | +--+ |Switch | | | | Rack |\ +-------+/\+------+ 240 | |VM|--| | | | +------+ \|Transit| |Uplink| WAN 241 | +--+ +-------+---+ | |Router |--| |=========> 242 +---------------------+ +-------+ +------+ 243 Hypervisor 245 ()===================================() 246 Switch-Switch Geneve Tunnels 248 Figure 1: Sample Geneve Deployment 250 To support the needs of network virtualization, the tunnel protocol 251 should be able to take advantage of the differing (and evolving) 252 capabilities of each type of device in both the underlay and overlay 253 networks. This results in the following requirements being placed on 254 the data plane tunneling protocol: 256 o The data plane is generic and extensible enough to support current 257 and future control planes. 259 o Tunnel components are efficiently implementable in both hardware 260 and software without restricting capabilities to the lowest common 261 denominator. 263 o High performance over existing IP fabrics. 265 These requirements are described further in the following 266 subsections. 268 2.1. Control Plane Independence 270 Although some protocols for network virtualization have included a 271 control plane as part of the tunnel format specification (most 272 notably, the original VXLAN spec prescribed a multicast learning- 273 based control plane), these specifications have largely been treated 274 as describing only the data format. The VXLAN packet format has 275 actually seen a wide variety of control planes built on top of it. 277 There is a clear advantage in settling on a data format: most of the 278 protocols are only superficially different and there is little 279 advantage in duplicating effort. However, the same cannot be said of 280 control planes, which are diverse in very fundamental ways. The case 281 for standardization is also less clear given the wide variety in 282 requirements, goals, and deployment scenarios. 284 As a result of this reality, Geneve aims to be a pure tunnel format 285 specification that is capable of fulfilling the needs of many control 286 planes by explicitly not selecting any one of them. This 287 simultaneously promotes a shared data format and increases the 288 chances that it will not be obsoleted by future control plane 289 enhancements. 291 2.2. Data Plane Extensibility 293 Achieving the level of flexibility needed to support current and 294 future control planes effectively requires an options infrastructure 295 to allow new metadata types to be defined, deployed, and either 296 finalized or retired. Options also allow for differentiation of 297 products by encouraging independent development in each vendor's core 298 specialty, leading to an overall faster pace of advancement. By far 299 the most common mechanism for implementing options is Type-Length- 300 Value (TLV) format. 302 It should be noted that while options can be used to support non- 303 wirespeed control packets, they are equally important on data packets 304 as well to segregate and direct forwarding (for instance, the 305 examples given before of input port based security policies and 306 service interposition both require tags to be placed on data 307 packets). Therefore, while it would be desirable to limit the 308 extensibility to only control packets for the purposes of simplifying 309 the datapath, that would not satisfy the design requirements. 311 2.2.1. Efficient Implementation 313 There is often a conflict between software flexibility and hardware 314 performance that is difficult to resolve. For a given set of 315 functionality, it is obviously desirable to maximize performance. 316 However, that does not mean new features that cannot be run at that 317 speed today should be disallowed. Therefore, for a protocol to be 318 efficiently implementable means that a set of common capabilities can 319 be reasonably handled across platforms along with a graceful 320 mechanism to handle more advanced features in the appropriate 321 situations. 323 The use of a variable length header and options in a protocol often 324 raises questions about whether it is truly efficiently implementable 325 in hardware. To answer this question in the context of Geneve, it is 326 important to first divide "hardware" into two categories: tunnel 327 endpoints and transit devices. 329 Endpoints must be able to parse the variable header, including any 330 options, and take action. Since these devices are actively 331 participating in the protocol, they are the most affected by Geneve. 333 However, as endpoints are the ultimate consumers of the data, 334 transmitters can tailor their output to the capabilities of the 335 recipient. As new functionality becomes sufficiently well defined to 336 add to endpoints, supporting options can be designed using ordering 337 restrictions and other techniques to ease parsing. 339 Transit devices MAY be able to interpret the options, however, as 340 non-terminating devices, transit devices do not originate or 341 terminate the Geneve packet, hence MUST NOT insert or delete options, 342 which is the responsibility of Geneve endpoints. The participation 343 of transit devices in interpreting options is OPTIONAL. 345 Further, either tunnel endpoints or transit devices MAY use offload 346 capabilities of NICs such as checksum offload to improve the 347 performance of Geneve packet processing. The presence of a Geneve 348 variable length header SHOULD NOT prevent the tunnel endpoints and 349 transit devices from using such offload capabilities. 351 2.3. Use of Standard IP Fabrics 353 IP has clearly cemented its place as the dominant transport mechanism 354 and many techniques have evolved over time to make it robust, 355 efficient, and inexpensive. As a result, it is natural to use IP 356 fabrics as a transit network for Geneve. Fortunately, the use of IP 357 encapsulation and addressing is enough to achieve the primary goal of 358 delivering packets to the correct point in the network through 359 standard switching and routing. 361 In addition, nearly all underlay fabrics are designed to exploit 362 parallelism in traffic to spread load across multiple links without 363 introducing reordering in individual flows. These equal cost 364 multipathing (ECMP) techniques typically involve parsing and hashing 365 the addresses and port numbers from the packet to select an outgoing 366 link. However, the use of tunnels often results in poor ECMP 367 performance without additional knowledge of the protocol as the 368 encapsulated traffic is hidden from the fabric by design and only 369 endpoint addresses are available for hashing. 371 Since it is desirable for Geneve to perform well on these existing 372 fabrics, it is necessary for entropy from encapsulated packets to be 373 exposed in the tunnel header. The most common technique for this is 374 to use the UDP source port, which is discussed further in 375 Section 3.3. 377 3. Geneve Encapsulation Details 379 The Geneve packet format consists of a compact tunnel header 380 encapsulated in UDP over either IPv4 or IPv6. A small fixed tunnel 381 header provides control information plus a base level of 382 functionality and interoperability with a focus on simplicity. This 383 header is then followed by a set of variable options to allow for 384 future innovation. Finally, the payload consists of a protocol data 385 unit of the indicated type, such as an Ethernet frame. Section 3.1 386 and Section 3.2 illustrate the Geneve packet format transported (for 387 example) over Ethernet along with an Ethernet payload. 389 3.1. Geneve Packet Format Over IPv4 391 0 1 2 3 392 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 393 Outer Ethernet Header: 394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 395 | Outer Destination MAC Address | 396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 397 | Outer Destination MAC Address | Outer Source MAC Address | 398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 399 | Outer Source MAC Address | 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 403 | Ethertype=0x0800 | 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 406 Outer IPv4 Header: 407 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 408 |Version| IHL |Type of Service| Total Length | 409 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 410 | Identification |Flags| Fragment Offset | 411 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 412 | Time to Live |Protocol=17 UDP| Header Checksum | 413 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 414 | Outer Source IPv4 Address | 415 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 416 | Outer Destination IPv4 Address | 417 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 419 Outer UDP Header: 420 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 421 | Source Port = xxxx | Dest Port = 6081 | 422 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 423 | UDP Length | UDP Checksum | 424 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 426 Geneve Header: 427 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 428 |Ver| Opt Len |O|C| Rsvd. | Protocol Type | 429 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 430 | Virtual Network Identifier (VNI) | Reserved | 431 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 432 | Variable Length Options | 433 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 435 Inner Ethernet Header (example payload): 436 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 437 | Inner Destination MAC Address | 438 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 439 | Inner Destination MAC Address | Inner Source MAC Address | 440 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 441 | Inner Source MAC Address | 442 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 443 |Optional Ethertype=C-Tag 802.1Q| Inner VLAN Tag Information | 444 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 446 Payload: 447 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 448 | Ethertype of Original Payload | | 449 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 450 | Original Ethernet Payload | 451 | | 452 | (Note that the original Ethernet Frame's FCS is not included) | 453 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 455 Frame Check Sequence: 456 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 457 | New FCS (Frame Check Sequence) for Outer Ethernet Frame | 458 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 460 3.2. Geneve Packet Format Over IPv6 462 0 1 2 3 463 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 464 Outer Ethernet Header: 465 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 466 | Outer Destination MAC Address | 467 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 468 | Outer Destination MAC Address | Outer Source MAC Address | 469 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 470 | Outer Source MAC Address | 471 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 472 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 473 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 474 | Ethertype=0x86DD | 475 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 477 Outer IPv6 Header: 478 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 479 |Version| Traffic Class | Flow Label | 480 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 481 | Payload Length | NxtHdr=17 UDP | Hop Limit | 482 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 483 | | 484 + + 485 | | 486 + Outer Source IPv6 Address + 487 | | 488 + + 489 | | 490 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 491 | | 492 + + 493 | | 494 + Outer Destination IPv6 Address + 495 | | 496 + + 497 | | 498 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 500 Outer UDP Header: 501 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 502 | Source Port = xxxx | Dest Port = 6081 | 503 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 504 | UDP Length | UDP Checksum | 505 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 507 Geneve Header: 508 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 509 |Ver| Opt Len |O|C| Rsvd. | Protocol Type | 510 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 511 | Virtual Network Identifier (VNI) | Reserved | 512 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 513 | Variable Length Options | 514 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 516 Inner Ethernet Header (example payload): 517 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 518 | Inner Destination MAC Address | 519 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 520 | Inner Destination MAC Address | Inner Source MAC Address | 521 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 522 | Inner Source MAC Address | 523 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 524 |Optional Ethertype=C-Tag 802.1Q| Inner VLAN Tag Information | 525 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 527 Payload: 528 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 529 | Ethertype of Original Payload | | 530 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 531 | Original Ethernet Payload | 532 | | 533 | (Note that the original Ethernet Frame's FCS is not included) | 534 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 536 Frame Check Sequence: 537 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 538 | New FCS (Frame Check Sequence) for Outer Ethernet Frame | 539 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 541 3.3. UDP Header 543 The use of an encapsulating UDP [RFC0768] header follows the 544 connectionless semantics of Ethernet and IP in addition to providing 545 entropy to routers performing ECMP. The header fields are therefore 546 interpreted as follows: 548 Source port: A source port selected by the originating tunnel 549 endpoint. This source port SHOULD be the same for all packets 550 belonging to a single encapsulated flow to prevent reordering due 551 to the use of different paths. To encourage an even distribution 552 of flows across multiple links, the source port SHOULD be 553 calculated using a hash of the encapsulated packet headers using, 554 for example, a traditional 5-tuple. Since the port represents a 555 flow identifier rather than a true UDP connection, the entire 556 16-bit range MAY be used to maximize entropy. 558 Dest port: IANA has assigned port 6081 as the fixed well-known 559 destination port for Geneve. Although the well-known value should 560 be used by default, it is RECOMMENDED that implementations make 561 this configurable. The chosen port is used for identification of 562 Geneve packets and MUST NOT be reversed for different ends of a 563 connection as is done with TCP. 565 UDP length: The length of the UDP packet including the UDP header. 567 UDP checksum: The checksum MAY be set to zero on transmit for 568 packets encapsulated in both IPv4 and IPv6 [RFC6935]. When a 569 packet is received with a UDP checksum of zero it MUST be accepted 570 and decapsulated. If the originating tunnel endpoint optionally 571 encapsulates a packet with a non-zero checksum, it MUST be a 572 correctly computed UDP checksum. Upon receiving such a packet, 573 the egress endpoint MUST validate the checksum. If the checksum 574 is not correct, the packet MUST be dropped, otherwise the packet 575 MUST be accepted for decapsulation. It is RECOMMENDED that the 576 UDP checksum be computed to protect the Geneve header and options 577 in situations where the network reliability is not high and the 578 packet is not protected by another checksum or CRC. 580 3.4. Tunnel Header Fields 582 Ver (2 bits): The current version number is 0. Packets received by 583 an endpoint with an unknown version MUST be dropped. Non- 584 terminating devices processing Geneve packets with an unknown 585 version number MUST treat them as UDP packets with an unknown 586 payload. 588 Opt Len (6 bits): The length of the options fields, expressed in 589 four byte multiples, not including the eight byte fixed tunnel 590 header. This results in a minimum total Geneve header size of 8 591 bytes and a maximum of 260 bytes. The start of the payload 592 headers can be found using this offset from the end of the base 593 Geneve header. 595 O (1 bit): OAM packet. This packet contains a control message 596 instead of a data payload. Control messages are sent between 597 Geneve endpoints. Endpoints MUST NOT forward the payload and 598 transit devices MUST NOT attempt to interpret or process it. 599 Since these are infrequent control messages, it is RECOMMENDED 600 that endpoints direct these packets to a high priority control 601 queue (for example, to direct the packet to a general purpose CPU 602 from a forwarding ASIC or to separate out control traffic on a 603 NIC). Transit devices MUST NOT alter forwarding behavior on the 604 basis of this bit, such as ECMP link selection. 606 C (1 bit): Critical options present. One or more options has the 607 critical bit set (see Section 3.5). If this bit is set then 608 tunnel endpoints MUST parse the options list to interpret any 609 critical options. On endpoints where option parsing is not 610 supported the packet MUST be dropped on the basis of the 'C' bit 611 in the base header. If the bit is not set tunnel endpoints MAY 612 strip all options using 'Opt Len' and forward the decapsulated 613 packet. Transit devices MUST NOT drop packets on the basis of 614 this bit. 616 The critical bit allows hardware implementations the flexibility 617 to handle options processing in the hardware fastpath or in the 618 exception (slow) path without the need to process all the options. 619 For example, a critical option such as secure hash to provide 620 Geneve header integrity check must be processed by tunnel 621 endpoints and typically processed in the hardware fastpath. 623 Rsvd. (6 bits): Reserved field which MUST be zero on transmission 624 and ignored on receipt. 626 Protocol Type (16 bits): The type of the protocol data unit 627 appearing after the Geneve header. This follows the EtherType 628 [ETYPES] convention with Ethernet itself being represented by the 629 value 0x6558. 631 Virtual Network Identifier (VNI) (24 bits): An identifier for a 632 unique element of a virtual network. In many situations this may 633 represent an L2 segment, however, the control plane defines the 634 forwarding semantics of decapsulated packets. The VNI MAY be used 635 as part of ECMP forwarding decisions or MAY be used as a mechanism 636 to distinguish between overlapping address spaces contained in the 637 encapsulated packet when load balancing across CPUs. 639 Reserved (8 bits): Reserved field which MUST be zero on transmission 640 and ignored on receipt. 642 Transit devices MUST maintain consistent forwarding behavior 643 irrespective of the value of 'Opt Len', including ECMP link 644 selection. These devices SHOULD be able to forward packets 645 containing options without resorting to a slow path. 647 3.5. Tunnel Options 649 0 1 2 3 650 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 651 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 652 | Option Class | Type |R|R|R| Length | 653 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 654 | Variable Option Data | 655 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 657 Geneve Option 659 The base Geneve header is followed by zero or more options in Type- 660 Length-Value format. Each option consists of a four byte option 661 header and a variable amount of option data interpreted according to 662 the type. 664 Option Class (16 bits): Namespace for the 'Type' field. IANA will 665 be requested to create a "Geneve Option Class" registry to 666 allocate identifiers for organizations, technologies, and vendors 667 that have an interest in creating types for options. Each 668 organization may allocate types independently to allow 669 experimentation and rapid innovation. It is expected that over 670 time certain options will become well known and a given 671 implementation may use option types from a variety of sources. In 672 addition, IANA will be requested to reserve specific ranges for 673 standardized and experimental options. 675 Type (8 bits): Type indicating the format of the data contained in 676 this option. Options are primarily designed to encourage future 677 extensibility and innovation and so standardized forms of these 678 options will be defined in a separate document. 680 The high order bit of the option type indicates that this is a 681 critical option. If the receiving endpoint does not recognize 682 this option and this bit is set then the packet MUST be dropped. 683 If the critical bit is set in any option then the 'C' bit in the 684 Geneve base header MUST also be set. Transit devices MUST NOT 685 drop packets on the basis of this bit. The following figure shows 686 the location of the 'C' bit in the 'Type' field: 688 0 1 2 3 4 5 6 7 8 689 +-+-+-+-+-+-+-+-+ 690 |C| Type | 691 +-+-+-+-+-+-+-+-+ 693 The requirement to drop a packet with an unknown critical option 694 applies to the entire tunnel endpoint system and not a particular 695 component of the implementation. For example, in a system 696 comprised of a forwarding ASIC and a general purpose CPU, this 697 does not mean that the packet must be dropped in the ASIC. An 698 implementation may send the packet to the CPU using a rate-limited 699 control channel for slow-path exception handling. 701 R (3 bits): Option control flags reserved for future use. MUST be 702 zero on transmission and ignored on receipt. 704 Length (5 bits): Length of the option, expressed in four byte 705 multiples excluding the option header. The total length of each 706 option may be between 4 and 128 bytes. A value of 0 in the Length 707 field implies an option with only the option header without the 708 variable option data. Packets in which the total length of all 709 options is not equal to the 'Opt Len' in the base header are 710 invalid and MUST be silently dropped if received by an endpoint. 712 Variable Option Data: Option data interpreted according to 'Type'. 714 3.5.1. Options Processing 716 Geneve options are intended to be originated and processed by tunnel 717 endpoints. However, options MAY be interpreted by transit devices 718 along the tunnel path. Transit devices not processing Geneve headers 719 SHOULD process Geneve packets as any other UDP packet and maintain 720 consistent forwarding behavior. 722 In tunnel endpoints, the generation and interpretation of options is 723 determined by the control plane, which is out of the scope of this 724 document. However, to ensure interoperability between heterogeneous 725 devices some requirements are imposed on options and the devices that 726 process them: 728 o Receiving endpoints MUST drop packets containing unknown options 729 with the 'C' bit set in the option type. Conversely, transit 730 devices MUST NOT drop packets as a result of encountering unknown 731 options, including those with the 'C' bit set. 733 o Some options may be defined in such a way that the position in the 734 option list is significant. Options or their ordering, MUST NOT 735 be changed by transit devices. 737 o An option MUST NOT affect the parsing or interpretation of any 738 other option. 740 When designing a Geneve option, it is important to consider how the 741 option will evolve in the future. Once an option is defined it is 742 reasonable to expect that implementations may come to depend on a 743 specific behavior. As a result, the scope of any future changes must 744 be carefully described upfront. 746 Unexpectedly significant interoperability issues may result from 747 changing the length of an option that was defined to be a certain 748 size. A particular option is specified to have either a fixed 749 length, which is constant, or a variable length, which may change 750 over time or for different use cases. This property is part of the 751 definition of the option and conveyed by the 'Type'. For fixed 752 length options, some implementations may choose to ignore the length 753 field in the option header and instead parse based on the well known 754 length associated with the type. In this case, redefining the length 755 will impact not only parsing of the option in question but also any 756 options that follow. Therefore, options that are defined to be fixed 757 length in size MUST NOT be redefined to a different length. Instead, 758 a new 'Type' should be allocated. 760 4. Implementation and Deployment Considerations 762 4.1. Encapsulation of Geneve in IP 764 As an IP-based tunnel protocol, Geneve shares many properties and 765 techniques with existing protocols. The application of some of these 766 are described in further detail, although in general most concepts 767 applicable to the IP layer or to IP tunnels generally also function 768 in the context of Geneve. 770 4.1.1. IP Fragmentation 772 To prevent fragmentation and maximize performance, the best practice 773 when using Geneve is to ensure that the MTU of the physical network 774 is greater than or equal to the MTU of the encapsulated network plus 775 tunnel headers. Manual or upper layer (such as TCP MSS clamping) 776 configuration can be used to ensure that fragmentation never takes 777 place, however, in some situations this may not be feasible. 779 It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191], 780 [RFC1981]) be used by setting the DF bit in the IP header when Geneve 781 packets are transmitted over IPv4 (this is the default with IPv6). 782 The use of Path MTU Discovery on the transit network provides the 783 encapsulating endpoint with soft-state about the link that it may use 784 to prevent or minimize fragmentation depending on its role in the 785 virtualized network. For example, recommendations/guidance for 786 handling fragmenation in similar overlay encapsulation services like 787 PWE3 are provided in section 5.3 of [RFC3985]. 789 Note that some implementations may not be capable of supporting 790 fragmentation or other less common features of the IP header, such as 791 options and extension headers. 793 4.1.2. DSCP and ECN 795 When encapsulating IP (including over Ethernet) packets in Geneve, 796 there are several considerations for propagating DSCP and ECN bits 797 from the inner header to the tunnel on transmission and the reverse 798 on reception. 800 [RFC2983] provides guidance for mapping DSCP between inner and outer 801 IP headers. Network virtualization is typically more closely aligned 802 with the Pipe model described, where the DSCP value on the tunnel 803 header is set based on a policy (which may be a fixed value, one 804 based on the inner traffic class, or some other mechanism for 805 grouping traffic). Aspects of the Uniform model (which treats the 806 inner and outer DSCP value as a single field by copying on ingress 807 and egress) may also apply, such as the ability to remark the inner 808 header on tunnel egress based on transit marking. However, the 809 Uniform model is not conceptually consistent with network 810 virtualization, which seeks to provide strong isolation between 811 encapsulated traffic and the physical network. 813 [RFC6040] describes the mechanism for exposing ECN capabilities on IP 814 tunnels and propagating congestion markers to the inner packets. 815 This behavior MUST be followed for IP packets encapsulated in Geneve. 817 4.1.3. Broadcast and Multicast 819 Geneve tunnels may either be point-to-point unicast between two 820 endpoints or may utilize broadcast or multicast addressing. It is 821 not required that inner and outer addressing match in this respect. 822 For example, in physical networks that do not support multicast, 823 encapsulated multicast traffic may be replicated into multiple 824 unicast tunnels or forwarded by policy to a unicast location 825 (possibly to be replicated there). 827 With physical networks that do support multicast it may be desirable 828 to use this capability to take advantage of hardware replication for 829 encapsulated packets. In this case, multicast addresses may be 830 allocated in the physical network corresponding to tenants, 831 encapsulated multicast groups, or some other factor. The allocation 832 of these groups is a component of the control plane and therefore 833 outside of the scope of this document. When physical multicast is in 834 use, the 'C' bit in the Geneve header may be used with groups of 835 devices with heterogeneous capabilities as each device can interpret 836 only the options that are significant to it if they are not critical. 838 4.1.4. Unidirectional Tunnels 840 Generally speaking, a Geneve tunnel is a unidirectional concept. IP 841 is not a connection oriented protocol and it is possible for two 842 endpoints to communicate with each other using different paths or to 843 have one side not transmit anything at all. As Geneve is an IP-based 844 protocol, the tunnel layer inherits these same characteristics. 846 It is possible for a tunnel to encapsulate a protocol, such as TCP, 847 which is connection oriented and maintains session state at that 848 layer. In addition, implementations MAY model Geneve tunnels as 849 connected, bidirectional links, such as to provide the abstraction of 850 a virtual port. In both of these cases, bidirectionality of the 851 tunnel is handled at a higher layer and does not affect the operation 852 of Geneve itself. 854 4.2. Constraints on Protocol Features 856 Geneve is intended to be flexible to a wide range of current and 857 future applications. As a result, certain constraints may be placed 858 on the use of metadata or other aspects of the protocol in order to 859 optimize for a particular use case. For example, some applications 860 may limit the types of options which are supported or enforce a 861 maximum number or length of options. Other applications may only 862 handle certain encapsulated payload types, such as Ethernet or IP. 863 This could be either globally throughout the system or, for example, 864 restricted to certain classes of devices or network paths. 866 These constraints may be communicated to tunnel endpoints either 867 explicitly through a control plane or implicitly by the nature of the 868 application. As Geneve is defined as a data plane protocol that is 869 control plane agnostic, the exact mechanism is not defined in this 870 document. 872 4.2.1. Constraints on Options 874 While Geneve options are more flexible, a control plane may restrict 875 the number of option TLVs as well as the order and size of the TLVs, 876 between tunnel endpoints, to make it simpler for a data plane 877 implementation in software or hardware to handle 878 [I-D.ietf-nvo3-encap]. For example, there may be some critical 879 information such as a secure hash that must be processed in a certain 880 order to provide lowest latency. 882 A control plane may negotiate a subset of option TLVs and certain TLV 883 ordering, as well may limit the total number of option TLVs present 884 in the packet, for example, to accommodate hardware capable of 885 processing fewer options [I-D.ietf-nvo3-encap]. Hence, a control 886 plane needs to have the ability to describe the supported TLVs subset 887 and their order to the tunnel end points. In the absence of a 888 control plane, alternative configuration mechanisms may be used for 889 this purpose. The exact mechanism is not defined in this document. 891 4.3. NIC Offloads 893 Modern NICs currently provide a variety of offloads to enable the 894 efficient processing of packets. The implementation of many of these 895 offloads requires only that the encapsulated packet be easily parsed 896 (for example, checksum offload). However, optimizations such as LSO 897 and LRO involve some processing of the options themselves since they 898 must be replicated/merged across multiple packets. In these 899 situations, it is desirable to not require changes to the offload 900 logic to handle the introduction of new options. To enable this, 901 some constraints are placed on the definitions of options to allow 902 for simple processing rules: 904 o When performing LSO, a NIC MUST replicate the entire Geneve header 905 and all options, including those unknown to the device, onto each 906 resulting segment. However, a given option definition may 907 override this rule and specify different behavior in supporting 908 devices. Conversely, when performing LRO, a NIC MAY assume that a 909 binary comparison of the options (including unknown options) is 910 sufficient to ensure equality and MAY merge packets with equal 911 Geneve headers. 913 o Options MUST NOT be reordered during the course of offload 914 processing, including when merging packets for the purpose of LRO. 916 o NICs performing offloads MUST NOT drop packets with unknown 917 options, including those marked as critical. 919 There is no requirement that a given implementation of Geneve employ 920 the offloads listed as examples above. However, as these offloads 921 are currently widely deployed in commercially available NICs, the 922 rules described here are intended to enable efficient handling of 923 current and future options across a variety of devices. 925 4.4. Inner VLAN Handling 927 Geneve is capable of encapsulating a wide range of protocols and 928 therefore a given implementation is likely to support only a small 929 subset of the possibilities. However, as Ethernet is expected to be 930 widely deployed, it is useful to describe the behavior of VLANs 931 inside encapsulated Ethernet frames. 933 As with any protocol, support for inner VLAN headers is OPTIONAL. In 934 many cases, the use of encapsulated VLANs may be disallowed due to 935 security or implementation considerations. However, in other cases 936 trunking of VLAN frames across a Geneve tunnel can prove useful. As 937 a result, the processing of inner VLAN tags upon ingress or egress 938 from a tunnel endpoint is based upon the configuration of the 939 endpoint and/or control plane and not explicitly defined as part of 940 the data format. 942 5. Interoperability Issues 944 Viewed exclusively from the data plane, Geneve does not introduce any 945 interoperability issues as it appears to most devices as UDP packets. 946 However, as there are already a number of tunnel protocols deployed 947 in network virtualization environments, there is a practical question 948 of transition and coexistence. 950 Since Geneve is a superset of the functionality of the most common 951 protocols used for network virtualization (VXLAN, NVGRE ) it should 952 be straightforward to port an existing control plane to run on top of 953 it with minimal effort. With both the old and new packet formats 954 supporting the same set of capabilities, there is no need for a hard 955 transition - endpoints directly communicating with each other use any 956 common protocol, which may be different even within a single overall 957 system. As transit devices are primarily forwarding packets on the 958 basis of the IP header, all protocols appear similar and these 959 devices do not introduce additional interoperability concerns. 961 To assist with this transition, it is strongly suggested that 962 implementations support simultaneous operation of both Geneve and 963 existing tunnel protocols as it is expected to be common for a single 964 node to communicate with a mixture of other nodes. Eventually, older 965 protocols may be phased out as they are no longer in use. 967 6. Security Considerations 969 As encapsulated within an UDP/IP packet, Geneve does not have any 970 inherent security mechanisms. As a result, an attacker with access 971 to the underlay network transporting the IP packets has the ability 972 to snoop or inject packets. Legitimate but malicious tunnel 973 endpoints may also spoof identifiers in the tunnel header to gain 974 access to networks owned by other tenants. 976 Within a particular security domain, such as a data center operated 977 by a single service provider, the most common and highest performing 978 security mechanism is isolation of trusted components. Tunnel 979 traffic can be carried over a separate VLAN and filtered at any 980 untrusted boundaries. In addition, tunnel endpoints should only be 981 operated in environments controlled by the service provider, such as 982 the hypervisor itself rather than within a customer VM. 984 When crossing an untrusted link, such as the public Internet, IPsec 985 [RFC4301] may be used to provide authentication and/or encryption of 986 the IP packets formed as part of Geneve encapsulation. 988 Geneve does not otherwise affect the security of the encapsulated 989 packets. As per the guidelines of BCP72 [RFC3552], the following 990 sections describe potential security risks that may be applicable to 991 Geneve deployments and approaches to mitigate such risks. 993 6.1. Data Confidentiality 995 Geneve is a network virtualization overlay encapsulation protocol 996 designed to establish tunnels between network virtualization end 997 points (NVE) over an existing IP network. It can be used to deploy 998 multi-tenant overlay networks over an existing IP underlay network in 999 a public or private data center. The overlay service is typically 1000 provided by a service provider, for example a cloud services provider 1001 or a private data center operator. Due to the nature of multi- 1002 tenancy in such environments, a tenant system may expect data 1003 confidentiality to ensure its packet data is not tampered with 1004 (active attack) in transit or a target of unauthorized monitoring 1005 (passive attack). A tenant may expect the overlay service provider 1006 to provide data confidentiality as part of the service or a tenant 1007 may bring its own data confidentiality mechanisms like IPsec or TLS 1008 to protect the data end to end between its tenant systems. 1010 An NVE, used in multi-tenant environments, MUST have the capability 1011 to encrypt the tenant data end to end between the NVEs. The NVEs may 1012 use existing well established encryption mechanisms such as IPsec, 1013 DTLs, etc., The NVEs SHOULD have a configurable option to disable the 1014 encryption if, for example, the packet data is already encrypted by 1015 the tenant system. 1017 6.1.1. Inter-data center traffic 1019 A tenant system in a customer premises (private data center) may want 1020 to connect to tenant systems on their tenant overlay network in a 1021 public cloud data center or a tenant may want to have its tenant 1022 systems located in multiple geographically separated data centers for 1023 high availability. Geneve data traffic between tenant systems across 1024 such separated networks should be protected from threats when 1025 traversing public networks. Any Geneve overlay data leaving the data 1026 center network, beyond the operators security domain, for example 1027 over a public Internet SHOULD be secured by encryption mechanisms 1028 such as IPsec or other VPN mechanisms to protect the communications 1029 between the NVEs when they are geographically separated over 1030 untrusted network links. Implementation of specific data protection 1031 mechanisms employed between data centers is beyond the scope of this 1032 document. 1034 6.2. Data Integrity 1036 Geneve encapsulation is used between NVEs to establish overlay 1037 tunnels over an existing IP underlay network. In a multi-tenant data 1038 center, a rogue or compromised tenant system may try to launch a 1039 passive attack such as monitoring the traffic of other tenants, or an 1040 active attack such as spoofing or trying to inject unauthorized 1041 Geneve encapsulated traffic into the network. To prevent such 1042 attacks, an NVE MUST not propagate Geneve packets beyond the NVE to 1043 tenant systems and SHOULD employ packet filtering mechanisms so as 1044 not to forward unauthorized traffic between TSs in different tenant 1045 networks. 1047 A compromised network node or a transit device within a data center 1048 may launch an active attack trying to tamper with the Geneve packet 1049 data between NVEs. Malicious tampering of Geneve header fields may 1050 cause the packet from one tenant to be forwarded to a different 1051 tenant network. If an operator determines the possibility of such 1052 threat in their environment, the operator may choose to employ data 1053 integrity mechanisms between NVEs. In order to prevent such risks, a 1054 Geneve NVE MUST have the capability to protect the integrity of 1055 Geneve packets including packet headers, options and payload on 1056 communications between NVE pairs. A cryptographic data protection 1057 mechanism such as IPsec may be used to provide data integrity 1058 protection. The NVE SHOULD have a configuration option to enable or 1059 disable the data integrity protection, based on the presence of 1060 threats in their environment. A data center operator may choose to 1061 deploy any other data integrity mechanisms as applicable and 1062 supported in their underlay networks. 1064 Geneve supports Geneve Options, so an operator may choose to use a 1065 Geneve option TLV to provide a cryptographic data protection 1066 mechanism, to verify the data integrity of the Geneve header, Geneve 1067 options or the entire Geneve packet including the payload. 1068 Implementation of such a mechanism is beyond the scope of this 1069 document. 1071 6.3. Authentication of NVE peers 1073 A rogue network device or a compromised NVE in a data center 1074 environment might be able to spoof Geneve packets as if it came from 1075 a legitimate NVE. In order to mitigate such a risk, a Geneve NVE 1076 MUST support an Authentication mechanism, such as IPsec AH, to ensure 1077 that the Geneve packet originated from the intended NVE peer, in 1078 environments where spoofing or rogue devices is a potential threat. 1079 Other simpler source checks such as ingress filtering for VLAN/MAC/IP 1080 address, reverse path forwarding checks, etc., may be used in certain 1081 trusted environments to ensure Geneve packets originated from the 1082 intended NVE peer. 1084 6.4. Multicast/Broadcast 1086 In typical data center networks where IP multicasting is not 1087 supported in the underlay network, multicasting can be supported 1088 using multiple unicast tunnels. The same security requirements as 1089 described in the above sections can be used to protect Geneve 1090 communications between NVE peers. If IP multicasting is supported in 1091 the underlay network and the operator chooses to use it for multicast 1092 traffic among Geneve endpoints, then Geneve NVEs used in such 1093 environments SHOULD support data protection mechanisms such as IPsec 1094 with Multicast extensions [RFC5374] to protect multicast traffic 1095 among Geneve NVE groups. 1097 6.5. Control plane communications 1099 A Network Virtualization Authority (NVA) as outlined in [RFC8014] may 1100 be used as a control plane for configuring and managing the Geneve 1101 NVEs. The data center operator is expected to use security 1102 mechanisms to protect the communications between the NVA to NVEs and 1103 use authentication mechanisms to detect any rogue or compromised NVEs 1104 within their administrative domain. Data protection mechanisms for 1105 control plane communication or authentication mechanisms between the 1106 NVA and the NVEs is beyond the scope of this document. 1108 7. IANA Considerations 1110 IANA has allocated UDP port 6081 as the well-known destination port 1111 for Geneve. Upon publication, the registry should be updated to cite 1112 this document. The original request was: 1114 Service Name: geneve 1115 Transport Protocol(s): UDP 1116 Assignee: Jesse Gross 1117 Contact: Jesse Gross 1118 Description: Generic Network Virtualization Encapsulation (Geneve) 1119 Reference: This document 1120 Port Number: 6081 1122 In addition, IANA is requested to create a "Geneve Option Class" 1123 registry to allocate Option Classes. This shall be a registry of 1124 16-bit hexadecimal values along with descriptive strings. The 1125 identifiers 0x0-0xFF are to be reserved for standardized options for 1126 allocation by IETF Review [RFC5226] and 0xFFF0-0xFFFF for 1127 Experimental Use. Otherwise, identifiers are to be assigned to any 1128 organization with an interest in creating Geneve options on a First 1129 Come First Served basis. The registry is to be populated with the 1130 following initial values: 1132 +----------------+--------------------------------------+ 1133 | Option Class | Description | 1134 +----------------+--------------------------------------+ 1135 | 0x0000..0x00FF | Unassigned - IETF Review | 1136 | 0x0100 | Linux | 1137 | 0x0101 | Open vSwitch | 1138 | 0x0102 | Open Virtual Networking (OVN) | 1139 | 0x0103 | In-band Network Telemetry (INT) | 1140 | 0x0104 | VMware | 1141 | 0x0105 | Amazon | 1142 | 0x0106 | Cisco | 1143 | 0x0107..0xFFEF | Unassigned - First Come First Served | 1144 | 0xFFF0..FFFF | Experimental | 1145 +----------------+--------------------------------------+ 1147 8. Contributors 1149 The following individuals were authors of an earlier version of this 1150 document and made significant contributions: 1152 Pankaj Garg 1153 Microsoft Corporation 1154 1 Microsoft Way 1155 Redmond, WA 98052 1156 USA 1158 Email: pankajg@microsoft.com 1160 Chris Wright 1161 Red Hat Inc. 1162 1801 Varsity Drive 1163 Raleigh, NC 27606 1164 USA 1166 Email: chrisw@redhat.com 1168 Puneet Agarwal 1169 Innovium, Inc. 1170 6001 America Center Drive 1171 San Jose, CA 95002 1172 USA 1174 Email: puneet@innovium.com 1176 Kenneth Duda 1177 Arista Networks 1178 5453 Great America Parkway 1179 Santa Clara, CA 95054 1180 USA 1182 Email: kduda@arista.com 1184 Dinesh G. Dutt 1185 Cumulus Networks 1186 140C S. Whisman Road 1187 Mountain View, CA 94041 1188 USA 1190 Email: ddutt@cumulusnetworks.com 1192 Jon Hudson 1193 Independent 1195 Email: jon.hudson@gmail.com 1197 Ariel Hendel 1198 Facebook, Inc. 1199 1 Hacker Way 1200 Menlo Park, CA 94025 1201 USA 1203 Email: ahendel@fb.com 1205 9. Acknowledgements 1207 The authors wish to thank Martin Casado, Bruce Davie and Dave Thaler 1208 for their input, feedback, and helpful suggestions. 1210 10. References 1212 10.1. Normative References 1214 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 1215 DOI 10.17487/RFC0768, August 1980, 1216 . 1218 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1219 Requirement Levels", BCP 14, RFC 2119, 1220 DOI 10.17487/RFC2119, March 1997, 1221 . 1223 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 1224 IANA Considerations Section in RFCs", RFC 5226, 1225 DOI 10.17487/RFC5226, May 2008, 1226 . 1228 10.2. Informative References 1230 [ETYPES] The IEEE Registration Authority, "IEEE 802 Numbers", 2013, 1231 . 1234 [I-D.ietf-nvo3-dataplane-requirements] 1235 Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L., 1236 and B. Khasnabish, "NVO3 Data Plane Requirements", draft- 1237 ietf-nvo3-dataplane-requirements-03 (work in progress), 1238 April 2014. 1240 [I-D.ietf-nvo3-encap] 1241 Boutros, S., Ganga, I., Garg, P., Manur, R., Mizrahi, T., 1242 Mozes, D., Nordmark, E., Smith, M., Aldrin, S., and I. 1243 Bagdonas, "NVO3 Encapsulation Considerations", draft-ietf- 1244 nvo3-encap-01 (work in progress), October 2017. 1246 [IEEE.802.1Q_2014] 1247 IEEE, "IEEE Standard for Local and metropolitan area 1248 networks--Bridges and Bridged Networks", IEEE 802.1Q-2014, 1249 DOI 10.1109/ieeestd.2014.6991462, December 2014, 1250 . 1253 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1254 DOI 10.17487/RFC1191, November 1990, 1255 . 1257 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1258 for IP version 6", RFC 1981, DOI 10.17487/RFC1981, August 1259 1996, . 1261 [RFC2983] Black, D., "Differentiated Services and Tunnels", 1262 RFC 2983, DOI 10.17487/RFC2983, October 2000, 1263 . 1265 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 1266 Label Switching Architecture", RFC 3031, 1267 DOI 10.17487/RFC3031, January 2001, 1268 . 1270 [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 1271 Text on Security Considerations", BCP 72, RFC 3552, 1272 DOI 10.17487/RFC3552, July 2003, 1273 . 1275 [RFC3985] Bryant, S., Ed. and P. Pate, Ed., "Pseudo Wire Emulation 1276 Edge-to-Edge (PWE3) Architecture", RFC 3985, 1277 DOI 10.17487/RFC3985, March 2005, 1278 . 1280 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 1281 Internet Protocol", RFC 4301, DOI 10.17487/RFC4301, 1282 December 2005, . 1284 [RFC5374] Weis, B., Gross, G., and D. Ignjatic, "Multicast 1285 Extensions to the Security Architecture for the Internet 1286 Protocol", RFC 5374, DOI 10.17487/RFC5374, November 2008, 1287 . 1289 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 1290 Notification", RFC 6040, DOI 10.17487/RFC6040, November 1291 2010, . 1293 [RFC6935] Eubanks, M., Chimento, P., and M. Westerlund, "IPv6 and 1294 UDP Checksums for Tunneled Packets", RFC 6935, 1295 DOI 10.17487/RFC6935, April 2013, 1296 . 1298 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1299 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1300 eXtensible Local Area Network (VXLAN): A Framework for 1301 Overlaying Virtualized Layer 2 Networks over Layer 3 1302 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 1303 . 1305 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 1306 Rekhter, "Framework for Data Center (DC) Network 1307 Virtualization", RFC 7365, DOI 10.17487/RFC7365, October 1308 2014, . 1310 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 1311 Virtualization Using Generic Routing Encapsulation", 1312 RFC 7637, DOI 10.17487/RFC7637, September 2015, 1313 . 1315 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 1316 Narten, "An Architecture for Data-Center Network 1317 Virtualization over Layer 3 (NVO3)", RFC 8014, 1318 DOI 10.17487/RFC8014, December 2016, 1319 . 1321 [VL2] Greenberg, A., et al., "VL2: A Scalable and Flexible Data 1322 Center Network", ACM SIGCOMM Computer Communication 1323 Review, DOI 10.1145/1594977.1592576, 2009, 1324 . 1327 Authors' Addresses 1329 Jesse Gross (editor) 1331 Email: jesse@kernel.org 1333 Ilango Ganga (editor) 1334 Intel Corporation 1335 2200 Mission College Blvd. 1336 Santa Clara, CA 95054 1337 USA 1339 Email: ilango.s.ganga@intel.com 1341 T. Sridhar (editor) 1342 VMware, Inc. 1343 3401 Hillview Ave. 1344 Palo Alto, CA 94304 1345 USA 1347 Email: tsridhar@vmware.com