idnits 2.17.1 draft-sridharan-virtualization-nvgre-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 11, 2014) is 3454 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 184 == Unused Reference: '1' is defined on line 633, but no explicit reference was found in the text -- No information found for draft-ietf-nov3-framework - is the name correct? Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group P. Garg Ed. 2 Internet Draft Y. Wang Ed. 3 Intended Category: Informational Microsoft 4 Expires: May 10, 2015 November 11, 2014 6 NVGRE: Network Virtualization using Generic Routing Encapsulation 7 draft-sridharan-virtualization-nvgre-07.txt 9 Status of this Memo 11 This memo provides information for the Internet Community. It does 12 not specify an Internet standard of any kind; instead it relies on a 13 proposed standard. Distribution of this memo is unlimited. 15 Copyright Notice 17 Copyright (c) 2014 IETF Trust and the persons identified as the 18 document authors. All rights reserved. 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six 29 months and may be updated, replaced, or obsoleted by other documents 30 at any time. It is inappropriate to use Internet-Drafts as 31 reference material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with 44 respect to this document. 46 This Internet-Draft will expire on May 10, 2015. 48 Abstract 50 This document describes the usage of Generic Routing Encapsulation 51 (GRE) header for Network Virtualization (NVGRE) in multi-tenant 52 datacenters. Network Virtualization decouples virtual networks and 53 addresses from physical network infrastructure, providing isolation 54 and concurrency between multiple virtual networks on the same 55 physical network infrastructure. This document also introduces a 56 Network Virtualization framework to illustrate the use cases, but 57 the focus is on specifying the data plane aspect of NVGRE. 59 Table of Contents 61 1. Introduction...................................................2 62 1.1. Terminology...............................................4 63 2. Conventions used in this document..............................4 64 3. NVGRE: Network Virtualization using GRE........................5 65 3.1. NVGRE Endpoint............................................5 66 3.2. NVGRE frame format........................................6 67 3.3. Inner 802.1Q Tag..........................................9 68 3.4. Reserved VSID.............................................9 69 4. NVGRE Deployment Consideration................................10 70 4.1. ECMP Support.............................................10 71 4.2. Broadcast and Multicast Traffic..........................10 72 4.3. Unicast Traffic..........................................10 73 4.4. IP Fragmentation.........................................11 74 4.5. Address/Policy Management & Routing......................11 75 4.6. Cross-subnet, Cross-premise Communication................11 76 4.7. Internet Connectivity....................................13 77 4.8. Management and Control Planes............................13 78 4.9. NVGRE-Aware Devices......................................13 79 4.10. Network Scalability with NVGRE..........................14 80 5. Security Considerations.......................................15 81 6. IANA Considerations...........................................15 82 7. References....................................................15 83 7.1. Normative References.....................................15 84 7.2. Informative References...................................15 85 8. Authors and Contributors......................................16 86 9. Acknowledgments...............................................17 88 1. Introduction 90 Conventional data center network designs cater to largely static 91 workloads and cause fragmentation of network and server capacity 93 [5][6]. There are several issues that limit dynamic allocation and 94 consolidation of capacity. Layer-2 networks use Rapid Spanning Tree 95 Protocol (RSTP) which is designed to eliminate loops by blocking 96 redundant paths. These eliminated paths translate to wasted capacity 97 and a highly oversubscribed network. There are alternative 98 approaches such as TRILL that address this problem [12]. 100 The network utilization inefficiencies are exacerbated by network 101 fragmentation due to the use of VLANs for broadcast isolation. VLANs 102 are used for traffic management and also as the mechanism for 103 providing security and performance isolation among services 104 belonging to different tenants. The Layer-2 network is carved into 105 smaller sized subnets typically one subnet per VLAN, with VLAN tags 106 configured on all the Layer-2 switches connected to server racks 107 that host a given tenant's services. The current VLAN limits 108 theoretically allow for 4K such subnets to be created. The size of 109 the broadcast domain is typically restricted due to the overhead of 110 broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer 111 sufficient in a shared infrastructure servicing multiple tenants. 113 Data center operators must be able to achieve high utilization of 114 server and network capacity. In order to achieve efficiency it 115 should be possible to assign workloads that operate in a single 116 Layer-2 network to any server in any rack in the network. It should 117 also be possible to migrate workloads to any server anywhere in the 118 network while retaining the workloads' addresses. This can be 119 achieved today by stretching VLANs, however when workloads migrate 120 the network needs to be reconfigured which is typically error prone. 121 By decoupling the workload's location on the LAN from its network 122 address, the network administrator configures the network once and 123 not every time a service migrates. This decoupling enables any 124 server to become part of any server resource pool. 126 The following are key design objectives for next generation data 127 centers: 129 a) location independent addressing 130 b) the ability to a scale the number of logical Layer-2/Layer-3 131 networks irrespective of the underlying physical topology or 132 the number of VLANs 133 c) preserving Layer-2 semantics for services and allowing them to 134 retain their addresses as they move within and across data 135 centers 136 d) providing broadcast isolation as workloads move around without 137 burdening the network control plane 139 This document describes the use of Generic Routing Encapsulation 140 (GRE, [3][4]) header for network virtualization. Network 141 virtualization decouples a virtual network from the underlying 142 physical network infrastructure by virtualizing network addresses. 143 Combined with a management and control plane for the virtual-to- 144 physical mapping, network virtualization can enable flexible VM 145 placement and movement, and provide network isolation for a multi- 146 tenant datacenter. 148 Network virtualization enables customers to bring their own address 149 spaces into a multi-tenant datacenter while the datacenter 150 administrators can place the customer VMs anywhere in the datacenter 151 without reconfiguring their network switches or routers, 152 irrespective of the customer address spaces. 154 1.1. Terminology 156 Please refer to [8][10] for more formal definition of terminology. 157 The following terms were used in this document. 159 Customer Address (CA): These are the virtual IP addresses assigned 160 and configured on the virtual NIC within each VM. These are the only 161 addresses visible to VMs and applications running within VMs. 163 NVE: Network Virtualization Edge, the entity that performs the 164 network virtualization encapsulation and decapsulation. 166 Provider Address (PA): These are the IP addresses used in the 167 physical network. PA's are associated with VM CA's through the 168 network virtualization mapping policy. 170 VM: Virtual Machine. Virtual machines are typically instances of 171 OS's running on top of hypervisor over a physical machine or server. 172 Multiple VMs can share the same physical server via the hypervisor, 173 yet are completely isolated from each other in terms of compute, 174 storage, and other OS resources. 176 VSID: Virtual Subnet Identifier, a 24-bit ID that uniquely 177 identifies a virtual subnet or virtual layer 2 broadcast domain. 179 2. Conventions used in this document 181 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 182 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 183 document are to be interpreted as described in RFC-2119 [RFC2119]. 185 In this document, these words will appear with that interpretation 186 only when in ALL CAPS. Lower case uses of these words are not to be 187 interpreted as carrying RFC-2119 significance. 189 3. NVGRE: Network Virtualization using GRE 191 This section describes Network Virtualization using GRE, NVGRE. 192 Network virtualization involves creating virtual Layer 2 topologies 193 on top of a physical Layer 3 network. Connectivity in the virtual 194 topology is provided by tunneling Ethernet frames in GRE over the 195 physical network. 197 In NVGRE, every virtual Layer-2 network is associated with a 24-bit 198 identifier, called Virtual Subnet Identifier (VSID). VSID is carried 199 in an outer header as defined in Section 3.2. , allowing unique 200 identification of a tenant's virtual subnet to various devices in 201 the network. A 24-bit VSID supports up to 16 million virtual subnets 202 in the same management domain, in contrast to only 4K achievable 203 with VLANs. Each VSID represents a virtual Layer-2 broadcast domain, 204 which can be used to identify a virtual subnet of a given tenant. To 205 support multi-subnet virtual topology, datacenter administrators can 206 configure routes to facilitate communication between virtual subnets 207 of the same tenant. 209 GRE is a proposed IETF standard [3][4] and provides a way for 210 encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE 211 header to carry VSID information in each packet. The VSID 212 information in each packet can be used to build multi-tenant-aware 213 tools for traffic analysis, traffic inspection, and monitoring. 215 The following sections detail the packet format for NVGRE, describe 216 the functions of a NVGRE endpoint, illustrate typical traffic flow 217 both within and across data centers, and discuss address, policy 218 management and deployment considerations. 220 3.1. NVGRE Endpoint 222 NVGRE endpoints are the ingress/egress points between the virtual 223 and the physical networks. The NVGRE endpoints are the NVEs as 224 defined in the NVO Framework document [8]. Any physical server or 225 network device can be an NVGRE endpoint. One common deployment is 226 for the endpoint to be part of a hypervisor. The primary function of 227 this endpoint is to encapsulate/decapsulate Ethernet data frames to 228 and from the GRE tunnel, ensure Layer-2 semantics, and apply 229 isolation policy scoped on VSID. The endpoint can optionally 230 participate in routing and function as a gateway in the virtual 231 topology. To encapsulate an Ethernet frame, the endpoint needs to 232 know the location information for the destination address in the 233 frame. This information can be provisioned via a management plane, 234 or obtained via a combination of control plane distribution or data 235 plane learning approaches. This document assumes that the location 236 information, including VSID, is available to the NVGRE endpoint. 238 3.2. NVGRE frame format 240 GRE header format as specified in RFC 2784 and RFC 2890 [3][4] is 241 used for communication between NVGRE endpoints. NVGRE leverages the 242 Key extension specified in RFC 2890 to carry the VSID. The packet 243 format for Layer-2 encapsulation in GRE is shown in Figure 1. 245 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 246 Outer Ethernet Header: | 247 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 248 | (Outer) Destination MAC Address | 249 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 250 |(Outer)Destination MAC Address | (Outer)Source MAC Address | 251 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 252 | (Outer) Source MAC Address | 253 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 254 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 255 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 256 | Ethertype 0x0800 | 257 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 258 Outer IPv4 Header: 259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 |Version| IHL |Type of Service| Total Length | 261 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 262 | Identification |Flags| Fragment Offset | 263 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 264 | Time to Live | Protocol 0x2F | Header Checksum | 265 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 266 | (Outer) Source Address | 267 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 268 | (Outer) Destination Address | 269 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 270 GRE Header: 271 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 272 |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | 273 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 274 | Virtual Subnet ID (VSID) | FlowID | 275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 276 Inner Ethernet Header 277 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 278 | (Inner) Destination MAC Address | 279 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 280 |(Inner)Destination MAC Address | (Inner)Source MAC Address | 281 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 282 | (Inner) Source MAC Address | 283 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 284 | Ethertype 0x0800 | 285 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 287 (Continued on the next page) 288 Inner IPv4 Header: 289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 290 |Version| IHL |Type of Service| Total Length | 291 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 292 | Identification |Flags| Fragment Offset | 293 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 294 | Time to Live | Protocol | Header Checksum | 295 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 296 | Source Address | 297 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 298 | Destination Address | 299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 300 | Options | Padding | 301 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 302 | Original IP Payload | 303 | | 304 | | 305 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 307 Figure 1 GRE Encapsulation Frame Format 309 The outer/delivery headers include the outer Ethernet header and the 310 outer IP header: 312 o The outer Ethernet header: The source Ethernet address in the 313 outer frame is set to the MAC address associated with the NVGRE 314 endpoint. The destination endpoint may or may not be on the same 315 physical subnet. The destination Ethernet address is set to the MAC 316 address of the nexthop IP address for the destination NVE. The outer 317 VLAN tag information is optional and can be used for traffic 318 management and broadcast scalability on the physical network. 320 o The outer IP header: Both IPv4 and IPv6 can be used as the 321 delivery protocol for GRE. The IPv4 header is shown for illustrative 322 purposes. Henceforth the IP address in the outer frame is referred 323 to as the Provider Address (PA). There can be one or more PA address 324 associated with an NVGRE endpoint, with policy controlling the 325 choice of PA to use for a given Customer Address (CA) for a customer 326 VM. 328 The GRE header: 330 o The C (Checksum Present) and S (Sequence Number Present) bits in 331 the GRE header MUST be zero. 333 o The K bit (Key Present) in the GRE header MUST be set to one. The 334 32-bit Key field in the GRE header is used to carry the Virtual 335 Subnet ID (VSID), and the FlowId: 337 - Virtual Subnet ID (VSID): This is a 24-bit value that is used 338 to identify the NVGRE based Virtual Layer-2 Network. 339 - FlowID: This is an 8-bit value that is used to provide per- 340 flow entropy for flows in the same VSID. The FlowID MUST NOT 341 be modified by transit devices. The encapsulating NVE SHOULD 342 provide as much entropy as possible in the FlowId. If a FlowID 343 is not generated, it MUST be set to all zero. 345 o The protocol type field in the GRE header is set to 0x6558 346 (transparent Ethernet bridging)[2]. 348 The inner headers (headers of the GRE payload): 350 o The inner Ethernet frame comprises of an inner Ethernet header 351 followed by optional inner IP header, followed by the IP payload. 352 The inner frame could be any Ethernet data frame not just IP. Note 353 that the inner Ethernet frame's FCS is not encapsulated. 355 o For illustrative purposes IPv4 headers are shown as the inner IP 356 headers but IPv6 headers may be used. Henceforth the IP address 357 contained in the inner frame is referred to as the Customer Address 358 (CA). 360 3.3. Inner 802.1Q Tag 362 The inner Ethernet header of NVGRE MUST NOT contain 802.1Q tag. The 363 encapsulating NVE MUST remove any existing 802.1Q Tag before 364 encapsulation of the frame in NVGRE. A decapsulating NVE MUST drop 365 the frame if the inner Ethernet frame contains an 802.1Q tag. 367 3.4. Reserved VSID 369 The VSID range from 0-0xFFF is reserved for future use. 371 The VSID 0xFFFFFF is reserved for vendor specific NVE-NVE 372 communication. The sender NVE SHOULD verify receiver NVE's vendor 373 before sending a packet using this VSID, however such verification 374 mechanism is out of scope of this document. Implementations SHOULD 375 choose a mechanism that meets their requirements. 377 4. NVGRE Deployment Consideration 379 4.1. ECMP Support 381 The switches and routers SHOULD provide ECMP on NVGRE packets using 382 the outer frame fields and entire Key field (32-bit). 384 4.2. Broadcast and Multicast Traffic 386 To support broadcast and multicast traffic inside a virtual subnet, 387 one or more administratively scoped multicast addresses [7][9] can 388 be assigned for the VSID. All multicast or broadcast traffic 389 originating from within a VSID is encapsulated and sent to the 390 assigned multicast address. From an administrative standpoint it is 391 possible for network operators to configure a PA multicast address 392 for each multicast address that is used inside a VSID, to facilitate 393 optimal multicast handling. Depending on the hardware capabilities 394 of the physical network devices and the physical network 395 architecture, multiple virtual subnet may re-use the same physical 396 IP multicast address. 398 Alternatively, based upon the configuration at NVE, the broadcast 399 and multicast in the virtual subnet can be supported using N-Way 400 unicast. In N-Way unicast, the sender NVE would send one 401 encapsulated packet to every NVE in the virtual subnet. The sender 402 NVE can encapsulate and send the packet as described in the Unicast 403 Traffic Section 4.3. This alleviates the need for multicast support 404 in the physical network. 406 4.3. Unicast Traffic 408 The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the 409 source PA associated with the endpoint with the destination PA 410 corresponding to the location of the destination endpoint. As 411 outlined earlier, there can be one or more PAs associated with an 412 endpoint and policy will control which ones get used for 413 communication. The encapsulated GRE packet is bridged and routed 414 normally by the physical network to the destination PA. Bridging 415 uses the outer Ethernet encapsulation for scope on the LAN. The only 416 requirement is bi-directional IP connectivity from the underlying 417 physical network. On the destination, the NVGRE endpoint 418 decapsulates the GRE packet to recover the original Layer-2 frame. 419 Traffic flows similarly on the reverse path. 421 4.4. IP Fragmentation 423 RFC 2003 [11] Section 5.1 specifies mechanisms for handling 424 fragmentation when encapsulating IP within IP. The subset of 425 mechanisms NVGRE selects are intended to ensure that NVGRE 426 encapsulated frames are not fragmented after encapsulation en-route 427 to the destination NVGRE endpoint, and that traffic sources can 428 leverage Path MTU discovery. A future version of this draft will 429 clarify the details around setting the DF bit on the outer IP header 430 as well as maintaining per destination NVGRE endpoint MTU soft state 431 so that ICMP Datagram Too Big messages can be exploited. 432 Fragmentation behavior when tunneling non-IP Ethernet frames in GRE 433 will also be specified in a future version. 435 4.5. Address/Policy Management & Routing 437 Address acquisition is beyond the scope of this document and can be 438 obtained statically, dynamically or using stateless address auto- 439 configuration. CA and PA space can be either IPv4 or IPv6. In fact 440 the address families don't have to match, for example, a CA can be 441 IPv4 while the PA is IPv6 and vice versa. 443 4.6. Cross-subnet, Cross-premise Communication 445 One application of this framework is that it provides a seamless 446 path for enterprises looking to expand their virtual machine hosting 447 capabilities into public clouds. Enterprises can bring their entire 448 IP subnet(s) and isolation policies, thus making the transition to 449 or from the cloud simpler. It is possible to move portions of a IP 450 subnet to the cloud however that requires additional configuration 451 on the enterprise network and is not discussed in this document. 452 Enterprises can continue to use existing communications models like 453 site-to-site VPN to secure their traffic. 455 A VPN gateway is used to establish a secure site-to-site tunnel over 456 the Internet and all the enterprise services running in virtual 457 machines in the cloud use the VPN gateway to communicate back to the 458 enterprise. For simplicity we use a VPN GW configured as a VM shown 459 in Figure 2 to illustrate cross-subnet, cross-premise communication. 461 +-----------------------+ +-----------------------+ 462 | Server 1 | | Server 2 | 463 | +--------+ +--------+ | | +-------------------+ | 464 | | VM1 | | VM2 | | | | VPN Gateway | | 465 | | IP=CA1 | | IP=CA2 | | | | Internal External| | 466 | | | | | | | | IP=CAg IP=GAdc | | 467 | +--------+ +--------+ | | +-------------------+ | 468 | Hypervisor | | | Hypervisor| ^ | 469 +-----------------------+ +-------------------:---+ 470 | IP=PA1 | IP=PA4 | : 471 | | | : 472 | +-------------------------+ | : VPN 473 +-----| Layer 3 Network |------+ : Tunnel 474 +-------------------------+ : 475 | : 476 +-----------------------------------------------:--+ 477 | : | 478 | Internet : | 479 | : | 480 +-----------------------------------------------:--+ 481 | v 482 | +-------------------+ 483 | | VPN Gateway | 484 |---| | 485 IP=GAcorp| External IP=GAcorp| 486 +-------------------+ 487 | 488 +-----------------------+ 489 | Corp Layer 3 Network | 490 | (In CA Space) | 491 +-----------------------+ 492 | 493 +---------------------------+ 494 | Server X | 495 | +----------+ +----------+ | 496 | | Corp VMe1| | Corp VMe2| | 497 | | IP=CAe1 | | IP=CAe2 | | 498 | +----------+ +----------+ | 499 | Hypervisor | 500 +---------------------------+ 501 Figure 2 Cross-Subnet, Cross-Premise Communication 503 The packet flow is similar to the unicast traffic flow between VMs, 504 the key difference in this case the packet needs to be sent to a VPN 505 gateway before it gets forwarded to the destination. As part of 506 routing configuration in the CA space, a per-tenant VPN gateway is 507 provisioned for communication back to the enterprise. The example 508 illustrates an outbound connection between VM1 inside the datacenter 509 and VMe1 inside the enterprise network. When the outbound packet 510 from CA1 to CAe1 reaches the hypervisor on Server 1, the NVE in 511 Server 1 can perform an equivalent of a route lookup on the packet. 512 The cross premise packet will match the default gateway rule as CAe1 513 is not part of the tenant virtual network in the datacenter. The 514 virtualization policy will indicate the packet to be encapsulated 515 and sent to the PA of tenant VPN gateway (PA4) running as a VM on 516 Server 2. The packet is decapsulated on Server 2 and delivered to 517 the VM gateway. The gateway in turn validates and sends the packet 518 on the site-to-site VPN tunnel back to the enterprise network. As 519 the communication here is external to the datacenter the PA address 520 for the VPN tunnel is globally routable. The outer header of this 521 packet is sourced from GAdc destined to GAcorp. This packet is 522 routed through the Internet to the enterprise VPN gateway which is 523 the other end of the site-to-site tunnel, at which point the VPN 524 gateway decapsulates the packet and sends it inside the enterprise 525 where the CAe1 is routable on the network. The reverse path is 526 similar once the packet reaches the enterprise VPN gateway. 528 4.7. Internet Connectivity 530 To enable connectivity to the Internet, an Internet gateway is 531 needed that bridges the virtualized CA space to the public Internet 532 address space. The gateway need to perform translation between the 533 virtualized world and the Internet. For example, the NVGRE endpoint 534 can be part of a load balancer or a NAT, which replaces the VPN 535 Gateway on Server 2 shown in Figure 2. 537 4.8. Management and Control Planes 539 There are several protocols that can manage and distribute policy; 540 however, it is out of scope of this document. Implementations SHOULD 541 choose a mechanism that meets their scale requirements. 543 4.9. NVGRE-Aware Devices 545 One example of a typical deployment consists of virtualized servers 546 deployed across multiple racks connected by one or more layers of 547 Layer-2 switches which in turn may be connected to a layer 3 routing 548 domain. Even though routing in the physical infrastructure will work 549 without any modification with NVGRE, devices that perform 550 specialized processing in the network need to be able to parse GRE 551 to get access to tenant specific information. Devices that 552 understand and parse the VSID can provide rich multi-tenancy aware 553 services inside the data center. As outlined earlier it is 554 imperative to exploit multiple paths inside the network through 555 techniques such as Equal Cost Multipath (ECMP). The Key field (32- 556 bit field, including both VSID and the optional FlowID) can provide 557 additional entropy to the switches to exploit path diversity inside 558 the network. A diverse ecosystem is expected to emerge as more and 559 more devices become multi-tenant aware. In the interim, without 560 requiring any hardware upgrades, there are alternatives to exploit 561 path diversity with GRE by associating multiple PAs with NVGRE 562 endpoints with policy controlling the choice of PA to be used. 564 It is expected that communication can span multiple data centers and 565 also cross the virtual to physical boundary. Typical scenarios that 566 require virtual-to-physical communication includes access to storage 567 and databases. Scenarios demanding lossless Ethernet functionality 568 may not be amenable to NVGRE as traffic is carried over an IP 569 network. NVGRE endpoints mediate between the network virtualized and 570 non-network virtualized environments. This functionality can be 571 incorporated into Top of Rack switches, storage appliances, load 572 balancers, routers etc. or built as a stand-alone appliance. 574 It is imperative to consider the impact of any solution on host 575 performance. Today's server operating systems employ sophisticated 576 acceleration techniques such as checksum offload, Large Send Offload 577 (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), 578 Virtual Machine Queue (VMQ) etc. These technologies should become 579 NVGRE aware. IPsec Security Associations (SA) can be offloaded to 580 the NIC so that computationally expensive cryptographic operations 581 are performed at line rate in the NIC hardware. These SAs are based 582 on the IP addresses of the endpoints. As each packet on the wire 583 gets translated, the NVGRE endpoint SHOULD intercept the offload 584 requests and do the appropriate address translation. This will 585 ensure that IPsec continues to be usable with network virtualization 586 while taking advantage of hardware offload capabilities for improved 587 performance. 589 4.10. Network Scalability with NVGRE 591 One of the key benefits of using NVGRE is the IP address scalability 592 and in turn MAC address table scalability that can be achieved. 593 NVGRE endpoint can use one PA to represent multiple CAs. This lowers 594 the burden on the MAC address table sizes at the Top of Rack 595 switches. One obvious benefit is in the context of server 596 virtualization which has increased the demands on the network 597 infrastructure. By embedding a NVGRE endpoint in a hypervisor it is 598 possible to scale significantly. This framework allows for location 599 information to be preconfigured inside a NVGRE endpoint allowing 600 broadcast ARP traffic to be proxied locally. This approach can scale 601 to large sized virtual subnets. These virtual subnets can be spread 602 across multiple layer-3 physical subnets. It allows workloads to be 603 moved around without imposing a huge burden on the network control 604 plane. By eliminating most broadcast traffic and converting others 605 to multicast the routers and switches can function more efficiently 606 by building efficient multicast trees. By using server and network 607 capacity efficiently it is possible to drive down the cost of 608 building and managing data centers. 610 5. Security Considerations 612 This proposal extends the Layer-2 subnet across the data center and 613 increases the scope for spoofing attacks. Mitigations of such 614 attacks are possible with authentication/encryption using IPsec or 615 any other IP based mechanism. The control plane for policy 616 distribution is expected to be secured by using any of the existing 617 security protocols. Further management traffic can be isolated in a 618 separate subnet/VLAN. 620 The checksum in the GRE header is not supported. The mitigation of 621 this is to deploy NVGRE based solution in a network that provides 622 error detection along the NVGRE packet path, for example, using 623 Ethernet CRC or IPsec or any other error detection mechanism. 625 6. IANA Considerations 627 This document has no IANA actions. 629 7. References 631 7.1. Normative References 633 [1] Bradner, S., "Key words for use in RFCs to Indicate 634 Requirement Levels", BCP 14, RFC 2119, March 1997. 636 [2] Ethertypes, ftp://ftp.isi.edu/in- 637 notes/iana/assignments/ethernet-numbers 639 [3] D. Farinacci et al, "Generic Routing Encapsulation (GRE)", RFC 640 2784, March, 2000. 642 [4] G. Dommety, "Key and Sequence Number Extensions to GRE", RFC 643 2890, September 2000. 645 7.2. Informative References 647 [5] A. Greenberg et al, "VL2: A Scalable and Flexible Data Center 648 Network", Proc. SIGCOMM 2009. 650 [6] A. Greenberg et al, "The Cost of a Cloud: Research Problems in 651 the Data Center", ACM SIGCOMM Computer Communication Review. 653 [7] B. Hinden, S. Deering, "IP Version 6 Addressing Architecture", 654 RFC 4291, February 2006. 656 [8] M. Lasserre et al, "Framework for DC Network Virtualization", 657 draft-ietf-nov3-framework (work in progress), July 2014. 659 [9] D. Meyer, "Administratively Scoped IP Multicast", BCP 23, RFC 660 2365, July 1998. 662 [10] T. Narten et al, "Problem Statement: Overlays for Network 663 Virtualization", draft-ietf-nvo3-overlay-problem-statement 664 (work in progress), July 2013. 666 [11] C. Perkins, "IP Encapsulation within IP", RFC 2003, October 667 1996. 669 [12] J. Touch, R. Perlman, "Transparent Interconnection of Lots of 670 Links (TRILL): Problem and Applicability Statement", RFC 5556, 671 May 2009. 673 8. Authors and Contributors 675 M. Sridharan 676 A. Greenberg 677 Y. Wang 678 P. Garg 679 N. Venkataramiah 680 Microsoft 682 K. Duda 683 Arista Networks 685 I. Ganga 686 Intel 688 G. Lin 689 Google 691 M. Pearson 692 Hewlett-Packard 693 P. Thaler 694 Broadcom 696 C. Tumuluri 697 Emulex 699 9. Acknowledgments 701 This document was prepared using 2-Word-v2.0.template.dot. 703 Authors' Addresses 705 Murari Sridharan 706 Microsoft Corporation 707 1 Microsoft Way 708 Redmond, WA 98052 709 Email: muraris@microsoft.com 711 Yu-Shun Wang 712 Microsoft Corporation 713 1 Microsoft Way 714 Redmond, WA 98052 715 Email: yushwang@microsoft.com 717 Albert Greenberg 718 Microsoft Corporation 719 1 Microsoft Way 720 Redmond, WA 98052 721 Email: albert@microsoft.com 723 Pankaj Garg 724 Microsoft Corporation 725 1 Microsoft Way 726 Redmond, WA 98052 727 Email: pankajg@microsoft.com 729 Narasimhan Venkataramiah 730 Facebook Inc 731 1730 Minor Ave. 732 Seattle, WA 98101 733 Email: navenkat@microsoft.com 735 Kenneth Duda 736 Arista Networks, Inc. 737 5470 Great America Pkwy 738 Santa Clara, CA 95054 739 kduda@aristanetworks.com 741 Ilango Ganga 742 Intel Corporation 743 2200 Mission College Blvd. 745 M/S: SC12-325 746 Santa Clara, CA - 95054 747 Email: ilango.s.ganga@intel.com 749 Geng Lin 750 Google 751 1600 Amphitheatre Parkway 752 Mountain View, California 94043 753 Email: genglin@google.com 755 Mark Pearson 756 Hewlett-Packard Co. 757 8000 Foothills Blvd. 758 Roseville, CA 95747 759 Email: mark.pearson@hp.com 761 Patricia Thaler 762 Broadcom Corporation 763 3151 Zanker Road 764 San Jose, CA 95134 765 Email: pthaler@broadcom.com 767 Chait Tumuluri 768 Emulex Corporation 769 3333 Susan Street 770 Costa Mesa, CA 92626 771 Email: chait@emulex.com