idnits 2.17.1 draft-sridharan-virtualization-nvgre-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 2013) is 3879 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 197 == Unused Reference: '1' is defined on line 642, but no explicit reference was found in the text -- No information found for draft-ietf-nov3-framework - is the name correct? -- No information found for draft-narten-nov3-overlay-problem-statement - is the name correct? Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Sridharan 2 Internet Draft A. Greenberg 3 Intended Category: Informational Y. Wang 4 Expires: February 2014 P. Garg 5 Microsoft 6 N. Venkataramiah 7 Facebook 8 K. Duda 9 Arista Networks 10 I. Ganga 11 Intel 12 G. Lin 13 Google 14 M. Pearson 15 Hewlett-Packard 16 P. Thaler 17 Broadcom 18 C. Tumuluri 19 Emulex 20 August 2013 22 NVGRE: Network Virtualization using Generic Routing Encapsulation 23 draft-sridharan-virtualization-nvgre-03.txt 25 Status of this Memo 27 This memo provides information for the Internet Community. It does 28 not specify an Internet standard of any kind; instead it relies on a 29 proposed standard. Distribution of this memo is unlimited. 31 Copyright Notice 33 Copyright (c) 2013 IETF Trust and the persons identified as the 34 document authors. All rights reserved. 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF), its areas, and its working groups. Note that 41 other groups may also distribute working documents as Internet- 42 Drafts. 44 Internet-Drafts are draft documents valid for a maximum of six 45 months and may be updated, replaced, or obsoleted by other documents 46 at any time. It is inappropriate to use Internet-Drafts as 47 reference material or to cite them other than as "work in progress." 49 The list of current Internet-Drafts can be accessed at 50 http://www.ietf.org/ietf/1id-abstracts.txt 52 The list of Internet-Draft Shadow Directories can be accessed at 53 http://www.ietf.org/shadow.html 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with 60 respect to this document. 62 This Internet-Draft will expire on February 12, 2014. 64 Abstract 66 This document describes the usage of Generic Routing Encapsulation 67 (GRE) header for Network Virtualization (NVGRE) in multi-tenant 68 datacenters. Network Virtualization decouples virtual networks and 69 addresses from physical network infrastructure, providing isolation 70 and concurrency between multiple virtual networks on the same 71 physical network infrastructure. This document also introduces a 72 Network Virtualization framework to illustrate the use cases, but 73 the focus is on specifying the data plane aspect of NVGRE. 75 Table of Contents 77 1. Introduction...................................................3 78 1.1. Terminology...............................................4 79 2. Conventions used in this document..............................5 80 3. NVGRE: Network Virtualization using GRE........................5 81 3.1. NVGRE Endpoint............................................6 82 3.2. NVGRE frame format........................................6 83 3.3. Reserved VSID.............................................9 84 4. NVGRE Deployment Consideration................................10 85 4.1. ECMP Support.............................................10 86 4.2. Broadcast and Multicast Traffic..........................10 87 4.3. Unicast Traffic..........................................10 88 4.4. IP Fragmentation.........................................11 89 4.5. Address/Policy Management & Routing......................11 90 4.6. Cross-subnet, Cross-premise Communication................11 91 4.7. Internet Connectivity....................................13 92 4.8. Management and Control Planes............................13 93 4.9. NVGRE-Aware Devices......................................13 94 4.10. Network Scalability with NVGRE..........................14 95 5. Security Considerations.......................................15 96 6. IANA Considerations...........................................15 97 7. References....................................................15 98 7.1. Normative References.....................................15 99 7.2. Informative References...................................15 100 8. Acknowledgments...............................................16 102 1. Introduction 104 Conventional data center network designs cater to largely static 105 workloads and cause fragmentation of network and server capacity 106 [5][6]. There are several issues that limit dynamic allocation and 107 consolidation of capacity. Layer-2 networks use Rapid Spanning Tree 108 Protocol (RSTP) which is designed to eliminate loops by blocking 109 redundant paths. These eliminated paths translate to wasted capacity 110 and a highly oversubscribed network. There are alternative 111 approaches such as TRILL that address this problem [12]. 113 The network utilization inefficiencies are exacerbated by network 114 fragmentation due to the use of VLANs for broadcast isolation. VLANs 115 are used for traffic management and also as the mechanism for 116 providing security and performance isolation among services 117 belonging to different tenants. The Layer-2 network is carved into 118 smaller sized subnets typically one subnet per VLAN, with VLAN tags 119 configured on all the Layer-2 switches connected to server racks 120 that host a given tenant's services. The current VLAN limits 121 theoretically allow for 4K such subnets to be created. The size of 122 the broadcast domain is typically restricted due to the overhead of 123 broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer 124 sufficient in a shared infrastructure servicing multiple tenants. 126 Data center operators must be able to achieve high utilization of 127 server and network capacity. In order to achieve efficiency it 128 should be possible to assign workloads that operate in a single 129 Layer-2 network to any server in any rack in the network. It should 130 also be possible to migrate workloads to any server anywhere in the 131 network while retaining the workloads' addresses. This can be 132 achieved today by stretching VLANs, however when workloads migrate 133 the network needs to be reconfigured which is typically error prone. 134 By decoupling the workload's location on the LAN from its network 135 address, the network administrator configures the network once and 136 not every time a service migrates. This decoupling enables any 137 server to become part of any server resource pool. 139 The following are key design objectives for next generation data 140 centers: 142 a) location independent addressing 143 b) the ability to a scale the number of logical Layer-2/Layer-3 144 networks irrespective of the underlying physical topology or 145 the number of VLANs 146 c) preserving Layer-2 semantics for services and allowing them to 147 retain their addresses as they move within and across data 148 centers 149 d) providing broadcast isolation as workloads move around without 150 burdening the network control plane 152 This document describes the use of Generic Routing Encapsulation 153 (GRE, [3][4]) header for network virtualization. Network 154 virtualization decouples a virtual network from the underlying 155 physical network infrastructure by virtualizing network addresses. 156 Combined with a management and control plane for the virtual-to- 157 physical mapping, network virtualization can enable flexible VM 158 placement and movement, and provide network isolation for a multi- 159 tenant datacenter. 161 Network virtualization enables customers to bring their own address 162 spaces into a multi-tenant datacenter while the datacenter 163 administrators can place the customer VMs anywhere in the datacenter 164 without reconfiguring their network switches or routers, 165 irrespective of the customer address spaces. 167 1.1. Terminology 169 Please refer to [8][10] for more formal definition of terminology. 170 The following terms were used in this document. 172 Customer Address (CA): These are the virtual IP addresses assigned 173 and configured on the virtual NIC within each VM. These are the only 174 addresses visible to VMs and applications running within VMs. 176 NVE: Network Virtualization Edge, the entity that performs the 177 network virtualization encapsulation and decapsulation. 179 Provider Address (PA): These are the IP addresses used in the 180 physical network. PA's are associated with VM CA's through the 181 network virtualization mapping policy. 183 VM: Virtual Machine. Virtual machines are typically instances of 184 OS's running on top of hypervisor over a physical machine or server. 185 Multiple VMs can share the same physical server via the hypervisor, 186 yet are completely isolated from each other in terms of compute, 187 storage, and other OS resources. 189 VSID: Virtual Subnet Identifier, a 24-bit ID that uniquely 190 identifies a virtual subnet or virtual layer 2 broadcast domain. 192 2. Conventions used in this document 194 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 195 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 196 document are to be interpreted as described in RFC-2119 [RFC2119]. 198 In this document, these words will appear with that interpretation 199 only when in ALL CAPS. Lower case uses of these words are not to be 200 interpreted as carrying RFC-2119 significance. 202 3. NVGRE: Network Virtualization using GRE 204 This section describes Network Virtualization using GRE, NVGRE. 205 Network virtualization involves creating virtual Layer 2 topologies 206 on top of a physical Layer 3 network. Connectivity in the virtual 207 topology is provided by tunneling Ethernet frames in GRE over the 208 physical network. 210 In NVGRE, every virtual Layer-2 network is associated with a 24-bit 211 identifier, called Virtual Subnet Identifier (VSID). VSID is carried 212 in an outer header as defined in Section 3.2. , allowing unique 213 identification of a tenant's virtual subnet to various devices in 214 the network. A 24-bit VSID supports up to 16 million virtual subnets 215 in the same management domain, in contrast to only 4K achievable 216 with VLANs. Each VSID represents a virtual Layer-2 broadcast domain, 217 which can be used to identify a virtual subnet of a given tenant. To 218 support multi-subnet virtual topology, datacenter administrators can 219 configure routes to facilitate communication between virtual subnets 220 of the same tenant. 222 GRE is a proposed IETF standard [3][4] and provides a way for 223 encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE 224 header to carry VSID information in each packet. The VSID 225 information in each packet can be used to build multi-tenant-aware 226 tools for traffic analysis, traffic inspection, and monitoring. 228 The following sections detail the packet format for NVGRE, describe 229 the functions of a NVGRE endpoint, illustrate typical traffic flow 230 both within and across data centers, and discuss address, policy 231 management and deployment considerations. 233 3.1. NVGRE Endpoint 235 NVGRE endpoints are the ingress/egress points between the virtual 236 and the physical networks. The NVGRE endpoints are the NVEs as 237 defined in the NVO Framework document [8]. Any physical server or 238 network device can be an NVGRE endpoint. One common deployment is 239 for the endpoint to be part of a hypervisor. The primary function of 240 this endpoint is to encapsulate/decapsulate Ethernet data frames to 241 and from the GRE tunnel, ensure Layer-2 semantics, and apply 242 isolation policy scoped on VSID. The endpoint can optionally 243 participate in routing and function as a gateway in the virtual 244 topology. To encapsulate an Ethernet frame, the endpoint needs to 245 know the location information for the destination address in the 246 frame. This information can be provisioned via a management plane, 247 or obtained via a combination of control plane distribution or data 248 plane learning approaches. This document assumes that the location 249 information, including VSID, is available to the NVGRE endpoint. 251 3.2. NVGRE frame format 253 GRE header format as specified in RFC 2784 and RFC 2890 [3][4] is 254 used for communication between NVGRE endpoints. NVGRE leverages the 255 Key extension specified in RFC 2890 to carry the VSID. The packet 256 format for Layer-2 encapsulation in GRE is shown in Figure 1. 258 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 259 Outer Ethernet Header: | 260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 261 | (Outer) Destination MAC Address | 262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 263 |(Outer)Destination MAC Address | (Outer)Source MAC Address | 264 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 265 | (Outer) Source MAC Address | 266 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 267 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 269 | Ethertype 0x0800 | 270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 271 Outer IPv4 Header: 272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 273 |Version| IHL |Type of Service| Total Length | 274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 275 | Identification |Flags| Fragment Offset | 276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 277 | Time to Live | Protocol 0x2F | Header Checksum | 278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 279 | (Outer) Source Address | 280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 281 | (Outer) Destination Address | 282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 283 GRE Header: 284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 285 |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | 286 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 287 | Virtual Subnet ID (VSID) | FlowID | 288 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 289 Inner Ethernet Header 290 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 291 | (Inner) Destination MAC Address | 292 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 293 |(Inner)Destination MAC Address | (Inner)Source MAC Address | 294 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 295 | (Inner) Source MAC Address | 296 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 297 |Optional Ethertype=C-Tag 802.1Q| PCP |0| VID set to 0 | 298 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 299 | Ethertype 0x0800 | 300 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 302 (Continued on the next page) 303 Inner IPv4 Header: 304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 305 |Version| IHL |Type of Service| Total Length | 306 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 307 | Identification |Flags| Fragment Offset | 308 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 309 | Time to Live | Protocol | Header Checksum | 310 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 311 | Source Address | 312 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 313 | Destination Address | 314 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 315 | Options | Padding | 316 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 317 | Original IP Payload | 318 | | 319 | | 320 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 322 Figure 1 GRE Encapsulation Frame Format 324 The outer/delivery headers include the outer Ethernet header and the 325 outer IP header: 327 o The outer Ethernet header: The source Ethernet address in the 328 outer frame is set to the MAC address associated with the NVGRE 329 endpoint. The destination endpoint may or may not be on the same 330 physical subnet. The destination Ethernet address is set to the MAC 331 address of the nexthop IP address for the destination NVE. The outer 332 VLAN tag information is optional and can be used for traffic 333 management and broadcast scalability on the physical network. 335 o The outer IP header: Both IPv4 and IPv6 can be used as the 336 delivery protocol for GRE. The IPv4 header is shown for illustrative 337 purposes. Henceforth the IP address in the outer frame is referred 338 to as the Provider Address (PA). There can be one or more PA address 339 associated with an NVGRE endpoint, with policy controlling the 340 choice of PA to use for a given Customer Address (CA) for a customer 341 VM. 343 The GRE header: 345 o The C (Checksum Present) and S (Sequence Number Present) bits in 346 the GRE header MUST be zero. 348 o The K bit (Key Present) in the GRE header MUST be set to one. The 349 32-bit Key field in the GRE header is used to carry the Virtual 350 Subnet ID (VSID), and the FlowId: 352 - Virtual Subnet ID (VSID): This is a 24-bit value that is used 353 to identify the NVGRE based Virtual Layer-2 Network. 354 - FlowID: This is an 8-bit value that is used to provide per- 355 flow entropy for flows in the same VSID. The FlowID MUST NOT 356 be modified by transit devices. The encapsulating NVE SHOULD 357 provide as much entropy as possible in the FlowId. If a FlowID 358 is not generated, it MUST be set to all zero. 360 o The protocol type field in the GRE header is set to 0x6558 361 (transparent Ethernet bridging)[2]. 363 The inner headers (headers of the GRE payload): 365 o The inner Ethernet frame comprises of an inner Ethernet header 366 followed by optional inner IP header, followed by the IP payload. 367 The inner frame could be any Ethernet data frame not just IP. Note 368 that the inner Ethernet frame's FCS is not encapsulated. 370 o Inner 802.1Q tag: The inner Ethernet header of NVGRE MUST NOT 371 contain 802.1Q Tag. The encapsulating NVE MUST remove any existing 372 802.1Q Tag before encapsulation of the frame in NVGRE. A 373 decapsulating NVE MUST drop the frame if the inner Ethernet frame 374 contains an 802.1Q tag. 376 o For illustrative purposes IPv4 headers are shown as the inner IP 377 headers but IPv6 headers may be used. Henceforth the IP address 378 contained in the inner frame is referred to as the Customer Address 379 (CA). 381 3.3. Reserved VSID 383 The VSID range from 0-0xFFF is reserved for future use. 385 The VSID 0xFFFFFF is reserved for vendor specific NVE-NVE 386 communication. The sender NVE SHOULD verify receiver NVE's vendor 387 before sending a packet using this VSID, however such verification 388 mechanism is out of scope of this document. Implementations SHOULD 389 choose a mechanism that meets their requirements. 391 4. NVGRE Deployment Consideration 393 4.1. ECMP Support 395 The switches and routers SHOULD provide ECMP on the NVGRE packets 396 using the outer frame fields and entire Key field (32-bit). 398 4.2. Broadcast and Multicast Traffic 400 To support broadcast and multicast traffic inside a virtual subnet, 401 one or more administratively scoped multicast addresses [7][9] can 402 be assigned for the VSID. All multicast or broadcast traffic 403 originating from within a VSID is encapsulated and sent to the 404 assigned multicast address. From an administrative standpoint it is 405 possible for network operators to configure a PA multicast address 406 for each multicast address that is used inside a VSID, to facilitate 407 optimal multicast handling. Depending on the hardware capabilities 408 of the physical network devices and the physical network 409 architecture, multiple virtual subnet may re-use the same physical 410 IP multicast address. 412 Alternatively, based upon the configuration at NVE, the broadcast 413 and multicast in the virtual subnet can be supported using N-Way 414 unicast. In N-Way unicast, the sender NVE would send one 415 encapsulated packet to every NVE in the virtual subnet. The sender 416 NVE can encapsulate and send the packet as described in the Unicast 417 Traffic Section 4.3. This alleviates the need for multicast support 418 in the physical network. 420 4.3. Unicast Traffic 422 The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the 423 source PA associated with the endpoint with the destination PA 424 corresponding to the location of the destination endpoint. As 425 outlined earlier, there can be one or more PAs associated with an 426 endpoint and policy will control which ones get used for 427 communication. The encapsulated GRE packet is bridged and routed 428 normally by the physical network to the destination PA. Bridging 429 uses the outer Ethernet encapsulation for scope on the LAN. The only 430 requirement is bi-directional IP connectivity from the underlying 431 physical network. On the destination, the NVGRE endpoint 432 decapsulates the GRE packet to recover the original Layer-2 frame. 433 Traffic flows similarly on the reverse path. 435 4.4. IP Fragmentation 437 RFC 2003 [11] Section 5.1 specifies mechanisms for handling 438 fragmentation when encapsulating IP within IP. The subset of 439 mechanisms NVGRE selects are intended to ensure that NVGRE 440 encapsulated frames are not fragmented after encapsulation en-route 441 to the destination NVGRE endpoint, and that traffic sources can 442 leverage Path MTU discovery. A future version of this draft will 443 clarify the details around setting the DF bit on the outer IP header 444 as well as maintaining per destination NVGRE endpoint MTU soft state 445 so that ICMP Datagram Too Big messages can be exploited. 446 Fragmentation behavior when tunneling non-IP Ethernet frames in GRE 447 will also be specified in a future version. 449 4.5. Address/Policy Management & Routing 451 Address acquisition is beyond the scope of this document and can be 452 obtained statically, dynamically or using stateless address auto- 453 configuration. CA and PA space can be either IPv4 or IPv6. In fact 454 the address families don't have to match, for example, a CA can be 455 IPv4 while the PA is IPv6 and vice versa. 457 4.6. Cross-subnet, Cross-premise Communication 459 One application of this framework is that it provides a seamless 460 path for enterprises looking to expand their virtual machine hosting 461 capabilities into public clouds. Enterprises can bring their entire 462 IP subnet(s) and isolation policies, thus making the transition to 463 or from the cloud simpler. It is possible to move portions of a IP 464 subnet to the cloud however that requires additional configuration 465 on the enterprise network and is not discussed in this document. 466 Enterprises can continue to use existing communications models like 467 site-to-site VPN to secure their traffic. 469 A VPN gateway is used to establish a secure site-to-site tunnel over 470 the Internet and all the enterprise services running in virtual 471 machines in the cloud use the VPN gateway to communicate back to the 472 enterprise. For simplicity we use a VPN GW configured as a VM shown 473 in Figure 2 to illustrate cross-subnet, cross-premise communication. 475 +-----------------------+ +-----------------------+ 476 | Server 1 | | Server 2 | 477 | +--------+ +--------+ | | +-------------------+ | 478 | | VM1 | | VM2 | | | | VPN Gateway | | 479 | | IP=CA1 | | IP=CA2 | | | | Internal External| | 480 | | | | | | | | IP=CAg IP=GAdc | | 481 | +--------+ +--------+ | | +-------------------+ | 482 | Hypervisor | | | Hypervisor| ^ | 483 +-----------------------+ +-------------------:---+ 484 | IP=PA1 | IP=PA4 | : 485 | | | : 486 | +-------------------------+ | : VPN 487 +-----| Layer 3 Network |------+ : Tunnel 488 +-------------------------+ : 489 | : 490 +-----------------------------------------------:--+ 491 | : | 492 | Internet : | 493 | : | 494 +-----------------------------------------------:--+ 495 | v 496 | +-------------------+ 497 | | VPN Gateway | 498 |---| | 499 IP=GAcorp| External IP=GAcorp| 500 +-------------------+ 501 | 502 +-----------------------+ 503 | Corp Layer 3 Network | 504 | (In CA Space) | 505 +-----------------------+ 506 | 507 +---------------------------+ 508 | Server X | 509 | +----------+ +----------+ | 510 | | Corp VMe1| | Corp VMe2| | 511 | | IP=CAe1 | | IP=CAe2 | | 512 | +----------+ +----------+ | 513 | Hypervisor | 514 +---------------------------+ 515 Figure 2 Cross-Subnet, Cross-Premise Communication 517 The packet flow is similar to the unicast traffic flow between VMs, 518 the key difference in this case the packet needs to be sent to a VPN 519 gateway before it gets forwarded to the destination. As part of 520 routing configuration in the CA space, a per-tenant VPN gateway is 521 provisioned for communication back to the enterprise. The example 522 illustrates an outbound connection between VM1 inside the datacenter 523 and VMe1 inside the enterprise network. When the outbound packet 524 from CA1 to CAe1 reaches the hypervisor on Server 1, the NVE in 525 Server 1 can perform an equivalent of a route lookup on the packet. 526 The cross premise packet will match the default gateway rule as CAe1 527 is not part of the tenant virtual network in the datacenter. The 528 virtualization policy will indicate the packet to be encapsulated 529 and sent to the PA of tenant VPN gateway (PA4) running as a VM on 530 Server 2. The packet is decapsulated on Server 2 and delivered to 531 the VM gateway. The gateway in turn validates and sends the packet 532 on the site-to-site VPN tunnel back to the enterprise network. As 533 the communication here is external to the datacenter the PA address 534 for the VPN tunnel is globally routable. The outer header of this 535 packet is sourced from GAdc destined to GAcorp. This packet is 536 routed through the Internet to the enterprise VPN gateway which is 537 the other end of the site-to-site tunnel, at which point the VPN 538 gateway decapsulates the packet and sends it inside the enterprise 539 where the CAe1 is routable on the network. The reverse path is 540 similar once the packet reaches the enterprise VPN gateway. 542 4.7. Internet Connectivity 544 To enable connectivity to the Internet, an Internet gateway is 545 needed that bridges the virtualized CA space to the public Internet 546 address space. The gateway need to perform translation between the 547 virtualized world and the Internet. For example, the NVGRE endpoint 548 can be part of a load balancer or a NAT, which replaces the VPN 549 Gateway on Server 2 shown in Figure 2. 551 4.8. Management and Control Planes 553 There are several protocols that can manage and distribute policy; 554 however, it is out of scope of this document. Implementations SHOULD 555 choose a mechanism that meets their scale requirements. 557 4.9. NVGRE-Aware Devices 559 One example of a typical deployment consists of virtualized servers 560 deployed across multiple racks connected by one or more layers of 561 Layer-2 switches which in turn may be connected to a layer 3 routing 562 domain. Even though routing in the physical infrastructure will work 563 without any modification with NVGRE, devices that perform 564 specialized processing in the network need to be able to parse GRE 565 to get access to tenant specific information. Devices that 566 understand and parse the VSID can provide rich multi-tenancy aware 567 services inside the data center. As outlined earlier it is 568 imperative to exploit multiple paths inside the network through 569 techniques such as Equal Cost Multipath (ECMP). The Key field (32- 570 bit field, including both VSID and the optional FlowID) can provide 571 additional entropy to the switches to exploit path diversity inside 572 the network. A diverse ecosystem is expected to emerge as more and 573 more devices become multi-tenant aware. In the interim, without 574 requiring any hardware upgrades, there are alternatives to exploit 575 path diversity with GRE by associating multiple PAs with NVGRE 576 endpoints with policy controlling the choice of PA to be used. 578 It is expected that communication can span multiple data centers and 579 also cross the virtual to physical boundary. Typical scenarios that 580 require virtual-to-physical communication includes access to storage 581 and databases. Scenarios demanding lossless Ethernet functionality 582 may not be amenable to NVGRE as traffic is carried over an IP 583 network. NVGRE endpoints mediate between the network virtualized and 584 non-network virtualized environments. This functionality can be 585 incorporated into Top of Rack switches, storage appliances, load 586 balancers, routers etc. or built as a stand-alone appliance. 588 It is imperative to consider the impact of any solution on host 589 performance. Today's server operating systems employ sophisticated 590 acceleration techniques such as checksum offload, Large Send Offload 591 (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), 592 Virtual Machine Queue (VMQ) etc. These technologies should become 593 NVGRE aware. IPsec Security Associations (SA) can be offloaded to 594 the NIC so that computationally expensive cryptographic operations 595 are performed at line rate in the NIC hardware. These SAs are based 596 on the IP addresses of the endpoints. As each packet on the wire 597 gets translated, the NVGRE endpoint SHOULD intercept the offload 598 requests and do the appropriate address translation. This will 599 ensure that IPsec continues to be usable with network virtualization 600 while taking advantage of hardware offload capabilities for improved 601 performance. 603 4.10. Network Scalability with NVGRE 605 One of the key benefits of using NVGRE is the IP address scalability 606 and in turn MAC address table scalability that can be achieved. 607 NVGRE endpoint can use one PA to represent multiple CAs. This lowers 608 the burden on the MAC address table sizes at the Top of Rack 609 switches. One obvious benefit is in the context of server 610 virtualization which has increased the demands on the network 611 infrastructure. By embedding a NVGRE endpoint in a hypervisor it is 612 possible to scale significantly. This framework allows for location 613 information to be preconfigured inside a NVGRE endpoint allowing 614 broadcast ARP traffic to be proxied locally. This approach can scale 615 to large sized virtual subnets. These virtual subnets can be spread 616 across multiple layer-3 physical subnets. It allows workloads to be 617 moved around without imposing a huge burden on the network control 618 plane. By eliminating most broadcast traffic and converting others 619 to multicast the routers and switches can function more efficiently 620 by building efficient multicast trees. By using server and network 621 capacity efficiently it is possible to drive down the cost of 622 building and managing data centers. 624 5. Security Considerations 626 This proposal extends the Layer-2 subnet across the data center and 627 increases the scope for spoofing attacks. Mitigations of such 628 attacks are possible with authentication/encryption using IPsec or 629 any other IP based mechanism. The control plane for policy 630 distribution is expected to be secured by using any of the existing 631 security protocols. Further management traffic can be isolated in a 632 separate subnet/VLAN. 634 6. IANA Considerations 636 This document has no IANA actions. 638 7. References 640 7.1. Normative References 642 [1] Bradner, S., "Key words for use in RFCs to Indicate 643 Requirement Levels", BCP 14, RFC 2119, March 1997. 645 [2] Ethertypes, ftp://ftp.isi.edu/in- 646 notes/iana/assignments/ethernet-numbers 648 [3] D. Farinacci et al, "Generic Routing Encapsulation (GRE)", RFC 649 2784, March, 2000. 651 [4] G. Dommety, "Key and Sequence Number Extensions to GRE", RFC 652 2890, September 2000. 654 7.2. Informative References 656 [5] A. Greenberg et al, "VL2: A Scalable and Flexible Data Center 657 Network", Proc. SIGCOMM 2009. 659 [6] A. Greenberg et al, "The Cost of a Cloud: Research Problems in 660 the Data Center", ACM SIGCOMM Computer Communication Review. 662 [7] B. Hinden, S. Deering, "IP Version 6 Addressing Architecture", 663 RFC 4291, February 2006. 665 [8] M. Lasserre et al, "Framework for DC Network Virtualization", 666 draft-ietf-nov3-framework (work in progress), February 2013. 668 [9] D. Meyer, "Administratively Scoped IP Multicast", BCP 23, RFC 669 2365, July 1998. 671 [10] T. Narten et al, "Problem Statement: Overlays for Network 672 Virtualization", draft-narten-nov3-overlay-problem-statement 673 (work in progress), February 2013. 675 [11] C. Perkins, "IP Encapsulation within IP", RFC 2003, October 676 1996. 678 [12] J. Touch, R. Perlman, "Transparent Interconnection of Lots of 679 Links (TRILL): Problem and Applicability Statement", RFC 5556, 680 May 2009. 682 8. Acknowledgments 684 This document was prepared using 2-Word-v2.0.template.dot. 686 Authors' Addresses 688 Murari Sridharan 689 Microsoft Corporation 690 1 Microsoft Way 691 Redmond, WA 98052 692 Email: muraris@microsoft.com 694 Yu-Shun Wang 695 Microsoft Corporation 696 1 Microsoft Way 697 Redmond, WA 98052 698 Email: yushwang@microsoft.com 700 Albert Greenberg 701 Microsoft Corporation 702 1 Microsoft Way 703 Redmond, WA 98052 704 Email: albert@microsoft.com 706 Pankaj Garg 707 Microsoft Corporation 708 1 Microsoft Way 709 Redmond, WA 98052 710 Email: pankajg@microsoft.com 712 Kenneth Duda 713 Arista Networks, Inc. 714 5470 Great America Pkwy 715 Santa Clara, CA 95054 716 kduda@aristanetworks.com 718 Ilango Ganga 719 Intel Corporation 720 2200 Mission College Blvd. 721 M/S: SC12-325 722 Santa Clara, CA - 95054 723 Email: ilango.s.ganga@intel.com 725 Geng Lin 726 Google 727 1600 Amphitheatre Parkway 728 Mountain View, California 94043 729 Email: genglin@google.com 731 Mark Pearson 732 Hewlett-Packard Co. 733 8000 Foothills Blvd. 734 Roseville, CA 95747 735 Email: mark.pearson@hp.com 737 Patricia Thaler 738 Broadcom Corporation 739 3151 Zanker Road 740 San Jose, CA 95134 741 Email: pthaler@broadcom.com 743 Chait Tumuluri 744 Emulex Corporation 745 3333 Susan Street 746 Costa Mesa, CA 92626 747 Email: chait@emulex.com 749 Narasimhan Venkataramiah 750 Facebook Inc 751 1730 Minor Ave. 752 Seattle, WA 98101 753 Email: Narasimhan.av@gmail.com