idnits 2.17.1 draft-sridharan-virtualization-nvgre-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 31, 2014) is 3550 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 196 == Unused Reference: '1' is defined on line 641, but no explicit reference was found in the text -- No information found for draft-ietf-nov3-framework - is the name correct? -- No information found for draft-narten-nov3-overlay-problem-statement - is the name correct? Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Sridharan 2 Internet Draft A. Greenberg 3 Intended Category: Informational Y. Wang 4 Expires: January 30, 2015 P. Garg 5 N. Venkataramiah 6 Microsoft 7 K. Duda 8 Arista Networks 9 I. Ganga 10 Intel 11 G. Lin 12 Google 13 M. Pearson 14 Hewlett-Packard 15 P. Thaler 16 Broadcom 17 C. Tumuluri 18 Emulex 19 July 31, 2014 21 NVGRE: Network Virtualization using Generic Routing Encapsulation 22 draft-sridharan-virtualization-nvgre-05.txt 24 Status of this Memo 26 This memo provides information for the Internet Community. It does 27 not specify an Internet standard of any kind; instead it relies on a 28 proposed standard. Distribution of this memo is unlimited. 30 Copyright Notice 32 Copyright (c) 2014 IETF Trust and the persons identified as the 33 document authors. All rights reserved. 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF), its areas, and its working groups. Note that 40 other groups may also distribute working documents as Internet- 41 Drafts. 43 Internet-Drafts are draft documents valid for a maximum of six 44 months and may be updated, replaced, or obsoleted by other documents 45 at any time. It is inappropriate to use Internet-Drafts as 46 reference material or to cite them other than as "work in progress." 47 The list of current Internet-Drafts can be accessed at 48 http://www.ietf.org/ietf/1id-abstracts.txt 50 The list of Internet-Draft Shadow Directories can be accessed at 51 http://www.ietf.org/shadow.html 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with 58 respect to this document. 60 This Internet-Draft will expire on January 30, 2015. 62 Abstract 64 This document describes the usage of Generic Routing Encapsulation 65 (GRE) header for Network Virtualization (NVGRE) in multi-tenant 66 datacenters. Network Virtualization decouples virtual networks and 67 addresses from physical network infrastructure, providing isolation 68 and concurrency between multiple virtual networks on the same 69 physical network infrastructure. This document also introduces a 70 Network Virtualization framework to illustrate the use cases, but 71 the focus is on specifying the data plane aspect of NVGRE. 73 Table of Contents 75 1. Introduction...................................................3 76 1.1. Terminology...............................................4 77 2. Conventions used in this document..............................5 78 3. NVGRE: Network Virtualization using GRE........................5 79 3.1. NVGRE Endpoint............................................6 80 3.2. NVGRE frame format........................................6 81 3.3. Reserved VSID.............................................9 82 4. NVGRE Deployment Consideration................................10 83 4.1. ECMP Support.............................................10 84 4.2. Broadcast and Multicast Traffic..........................10 85 4.3. Unicast Traffic..........................................10 86 4.4. IP Fragmentation.........................................11 87 4.5. Address/Policy Management & Routing......................11 88 4.6. Cross-subnet, Cross-premise Communication................11 89 4.7. Internet Connectivity....................................13 90 4.8. Management and Control Planes............................13 91 4.9. NVGRE-Aware Devices......................................13 92 4.10. Network Scalability with NVGRE..........................14 94 5. Security Considerations.......................................15 95 6. IANA Considerations...........................................15 96 7. References....................................................15 97 7.1. Normative References.....................................15 98 7.2. Informative References...................................15 99 8. Acknowledgments...............................................16 101 1. Introduction 103 Conventional data center network designs cater to largely static 104 workloads and cause fragmentation of network and server capacity 105 [5][6]. There are several issues that limit dynamic allocation and 106 consolidation of capacity. Layer-2 networks use Rapid Spanning Tree 107 Protocol (RSTP) which is designed to eliminate loops by blocking 108 redundant paths. These eliminated paths translate to wasted capacity 109 and a highly oversubscribed network. There are alternative 110 approaches such as TRILL that address this problem [12]. 112 The network utilization inefficiencies are exacerbated by network 113 fragmentation due to the use of VLANs for broadcast isolation. VLANs 114 are used for traffic management and also as the mechanism for 115 providing security and performance isolation among services 116 belonging to different tenants. The Layer-2 network is carved into 117 smaller sized subnets typically one subnet per VLAN, with VLAN tags 118 configured on all the Layer-2 switches connected to server racks 119 that host a given tenant's services. The current VLAN limits 120 theoretically allow for 4K such subnets to be created. The size of 121 the broadcast domain is typically restricted due to the overhead of 122 broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer 123 sufficient in a shared infrastructure servicing multiple tenants. 125 Data center operators must be able to achieve high utilization of 126 server and network capacity. In order to achieve efficiency it 127 should be possible to assign workloads that operate in a single 128 Layer-2 network to any server in any rack in the network. It should 129 also be possible to migrate workloads to any server anywhere in the 130 network while retaining the workloads' addresses. This can be 131 achieved today by stretching VLANs, however when workloads migrate 132 the network needs to be reconfigured which is typically error prone. 133 By decoupling the workload's location on the LAN from its network 134 address, the network administrator configures the network once and 135 not every time a service migrates. This decoupling enables any 136 server to become part of any server resource pool. 138 The following are key design objectives for next generation data 139 centers: 141 a) location independent addressing 142 b) the ability to a scale the number of logical Layer-2/Layer-3 143 networks irrespective of the underlying physical topology or 144 the number of VLANs 145 c) preserving Layer-2 semantics for services and allowing them to 146 retain their addresses as they move within and across data 147 centers 148 d) providing broadcast isolation as workloads move around without 149 burdening the network control plane 151 This document describes the use of Generic Routing Encapsulation 152 (GRE, [3][4]) header for network virtualization. Network 153 virtualization decouples a virtual network from the underlying 154 physical network infrastructure by virtualizing network addresses. 155 Combined with a management and control plane for the virtual-to- 156 physical mapping, network virtualization can enable flexible VM 157 placement and movement, and provide network isolation for a multi- 158 tenant datacenter. 160 Network virtualization enables customers to bring their own address 161 spaces into a multi-tenant datacenter while the datacenter 162 administrators can place the customer VMs anywhere in the datacenter 163 without reconfiguring their network switches or routers, 164 irrespective of the customer address spaces. 166 1.1. Terminology 168 Please refer to [8][10] for more formal definition of terminology. 169 The following terms were used in this document. 171 Customer Address (CA): These are the virtual IP addresses assigned 172 and configured on the virtual NIC within each VM. These are the only 173 addresses visible to VMs and applications running within VMs. 175 NVE: Network Virtualization Edge, the entity that performs the 176 network virtualization encapsulation and decapsulation. 178 Provider Address (PA): These are the IP addresses used in the 179 physical network. PA's are associated with VM CA's through the 180 network virtualization mapping policy. 182 VM: Virtual Machine. Virtual machines are typically instances of 183 OS's running on top of hypervisor over a physical machine or server. 184 Multiple VMs can share the same physical server via the hypervisor, 185 yet are completely isolated from each other in terms of compute, 186 storage, and other OS resources. 188 VSID: Virtual Subnet Identifier, a 24-bit ID that uniquely 189 identifies a virtual subnet or virtual layer 2 broadcast domain. 191 2. Conventions used in this document 193 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 194 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 195 document are to be interpreted as described in RFC-2119 [RFC2119]. 197 In this document, these words will appear with that interpretation 198 only when in ALL CAPS. Lower case uses of these words are not to be 199 interpreted as carrying RFC-2119 significance. 201 3. NVGRE: Network Virtualization using GRE 203 This section describes Network Virtualization using GRE, NVGRE. 204 Network virtualization involves creating virtual Layer 2 topologies 205 on top of a physical Layer 3 network. Connectivity in the virtual 206 topology is provided by tunneling Ethernet frames in GRE over the 207 physical network. 209 In NVGRE, every virtual Layer-2 network is associated with a 24-bit 210 identifier, called Virtual Subnet Identifier (VSID). VSID is carried 211 in an outer header as defined in Section 3.2. , allowing unique 212 identification of a tenant's virtual subnet to various devices in 213 the network. A 24-bit VSID supports up to 16 million virtual subnets 214 in the same management domain, in contrast to only 4K achievable 215 with VLANs. Each VSID represents a virtual Layer-2 broadcast domain, 216 which can be used to identify a virtual subnet of a given tenant. To 217 support multi-subnet virtual topology, datacenter administrators can 218 configure routes to facilitate communication between virtual subnets 219 of the same tenant. 221 GRE is a proposed IETF standard [3][4] and provides a way for 222 encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE 223 header to carry VSID information in each packet. The VSID 224 information in each packet can be used to build multi-tenant-aware 225 tools for traffic analysis, traffic inspection, and monitoring. 227 The following sections detail the packet format for NVGRE, describe 228 the functions of a NVGRE endpoint, illustrate typical traffic flow 229 both within and across data centers, and discuss address, policy 230 management and deployment considerations. 232 3.1. NVGRE Endpoint 234 NVGRE endpoints are the ingress/egress points between the virtual 235 and the physical networks. The NVGRE endpoints are the NVEs as 236 defined in the NVO Framework document [8]. Any physical server or 237 network device can be an NVGRE endpoint. One common deployment is 238 for the endpoint to be part of a hypervisor. The primary function of 239 this endpoint is to encapsulate/decapsulate Ethernet data frames to 240 and from the GRE tunnel, ensure Layer-2 semantics, and apply 241 isolation policy scoped on VSID. The endpoint can optionally 242 participate in routing and function as a gateway in the virtual 243 topology. To encapsulate an Ethernet frame, the endpoint needs to 244 know the location information for the destination address in the 245 frame. This information can be provisioned via a management plane, 246 or obtained via a combination of control plane distribution or data 247 plane learning approaches. This document assumes that the location 248 information, including VSID, is available to the NVGRE endpoint. 250 3.2. NVGRE frame format 252 GRE header format as specified in RFC 2784 and RFC 2890 [3][4] is 253 used for communication between NVGRE endpoints. NVGRE leverages the 254 Key extension specified in RFC 2890 to carry the VSID. The packet 255 format for Layer-2 encapsulation in GRE is shown in Figure 1. 257 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 258 Outer Ethernet Header: | 259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 | (Outer) Destination MAC Address | 261 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 262 |(Outer)Destination MAC Address | (Outer)Source MAC Address | 263 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 264 | (Outer) Source MAC Address | 265 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 266 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 267 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 268 | Ethertype 0x0800 | 269 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 270 Outer IPv4 Header: 271 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 272 |Version| IHL |Type of Service| Total Length | 273 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 274 | Identification |Flags| Fragment Offset | 275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 276 | Time to Live | Protocol 0x2F | Header Checksum | 277 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 278 | (Outer) Source Address | 279 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 280 | (Outer) Destination Address | 281 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 282 GRE Header: 283 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 284 |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | 285 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 286 | Virtual Subnet ID (VSID) | FlowID | 287 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 288 Inner Ethernet Header 289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 290 | (Inner) Destination MAC Address | 291 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 292 |(Inner)Destination MAC Address | (Inner)Source MAC Address | 293 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 294 | (Inner) Source MAC Address | 295 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 296 |Optional Ethertype=C-Tag 802.1Q| PCP |0| VID set to 0 | 297 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 298 | Ethertype 0x0800 | 299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 301 (Continued on the next page) 302 Inner IPv4 Header: 303 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 304 |Version| IHL |Type of Service| Total Length | 305 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 306 | Identification |Flags| Fragment Offset | 307 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 308 | Time to Live | Protocol | Header Checksum | 309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 310 | Source Address | 311 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 312 | Destination Address | 313 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 314 | Options | Padding | 315 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 316 | Original IP Payload | 317 | | 318 | | 319 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 321 Figure 1 GRE Encapsulation Frame Format 323 The outer/delivery headers include the outer Ethernet header and the 324 outer IP header: 326 o The outer Ethernet header: The source Ethernet address in the 327 outer frame is set to the MAC address associated with the NVGRE 328 endpoint. The destination endpoint may or may not be on the same 329 physical subnet. The destination Ethernet address is set to the MAC 330 address of the nexthop IP address for the destination NVE. The outer 331 VLAN tag information is optional and can be used for traffic 332 management and broadcast scalability on the physical network. 334 o The outer IP header: Both IPv4 and IPv6 can be used as the 335 delivery protocol for GRE. The IPv4 header is shown for illustrative 336 purposes. Henceforth the IP address in the outer frame is referred 337 to as the Provider Address (PA). There can be one or more PA address 338 associated with an NVGRE endpoint, with policy controlling the 339 choice of PA to use for a given Customer Address (CA) for a customer 340 VM. 342 The GRE header: 344 o The C (Checksum Present) and S (Sequence Number Present) bits in 345 the GRE header MUST be zero. 347 o The K bit (Key Present) in the GRE header MUST be set to one. The 348 32-bit Key field in the GRE header is used to carry the Virtual 349 Subnet ID (VSID), and the FlowId: 351 - Virtual Subnet ID (VSID): This is a 24-bit value that is used 352 to identify the NVGRE based Virtual Layer-2 Network. 353 - FlowID: This is an 8-bit value that is used to provide per- 354 flow entropy for flows in the same VSID. The FlowID MUST NOT 355 be modified by transit devices. The encapsulating NVE SHOULD 356 provide as much entropy as possible in the FlowId. If a FlowID 357 is not generated, it MUST be set to all zero. 359 o The protocol type field in the GRE header is set to 0x6558 360 (transparent Ethernet bridging)[2]. 362 The inner headers (headers of the GRE payload): 364 o The inner Ethernet frame comprises of an inner Ethernet header 365 followed by optional inner IP header, followed by the IP payload. 366 The inner frame could be any Ethernet data frame not just IP. Note 367 that the inner Ethernet frame's FCS is not encapsulated. 369 o Inner 802.1Q tag: The inner Ethernet header of NVGRE MUST NOT 370 contain 802.1Q Tag. The encapsulating NVE MUST remove any existing 371 802.1Q Tag before encapsulation of the frame in NVGRE. A 372 decapsulating NVE MUST drop the frame if the inner Ethernet frame 373 contains an 802.1Q tag. 375 o For illustrative purposes IPv4 headers are shown as the inner IP 376 headers but IPv6 headers may be used. Henceforth the IP address 377 contained in the inner frame is referred to as the Customer Address 378 (CA). 380 3.3. Reserved VSID 382 The VSID range from 0-0xFFF is reserved for future use. 384 The VSID 0xFFFFFF is reserved for vendor specific NVE-NVE 385 communication. The sender NVE SHOULD verify receiver NVE's vendor 386 before sending a packet using this VSID, however such verification 387 mechanism is out of scope of this document. Implementations SHOULD 388 choose a mechanism that meets their requirements. 390 4. NVGRE Deployment Consideration 392 4.1. ECMP Support 394 The switches and routers SHOULD provide ECMP on the NVGRE packets 395 using the outer frame fields and entire Key field (32-bit). 397 4.2. Broadcast and Multicast Traffic 399 To support broadcast and multicast traffic inside a virtual subnet, 400 one or more administratively scoped multicast addresses [7][9] can 401 be assigned for the VSID. All multicast or broadcast traffic 402 originating from within a VSID is encapsulated and sent to the 403 assigned multicast address. From an administrative standpoint it is 404 possible for network operators to configure a PA multicast address 405 for each multicast address that is used inside a VSID, to facilitate 406 optimal multicast handling. Depending on the hardware capabilities 407 of the physical network devices and the physical network 408 architecture, multiple virtual subnet may re-use the same physical 409 IP multicast address. 411 Alternatively, based upon the configuration at NVE, the broadcast 412 and multicast in the virtual subnet can be supported using N-Way 413 unicast. In N-Way unicast, the sender NVE would send one 414 encapsulated packet to every NVE in the virtual subnet. The sender 415 NVE can encapsulate and send the packet as described in the Unicast 416 Traffic Section 4.3. This alleviates the need for multicast support 417 in the physical network. 419 4.3. Unicast Traffic 421 The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the 422 source PA associated with the endpoint with the destination PA 423 corresponding to the location of the destination endpoint. As 424 outlined earlier, there can be one or more PAs associated with an 425 endpoint and policy will control which ones get used for 426 communication. The encapsulated GRE packet is bridged and routed 427 normally by the physical network to the destination PA. Bridging 428 uses the outer Ethernet encapsulation for scope on the LAN. The only 429 requirement is bi-directional IP connectivity from the underlying 430 physical network. On the destination, the NVGRE endpoint 431 decapsulates the GRE packet to recover the original Layer-2 frame. 432 Traffic flows similarly on the reverse path. 434 4.4. IP Fragmentation 436 RFC 2003 [11] Section 5.1 specifies mechanisms for handling 437 fragmentation when encapsulating IP within IP. The subset of 438 mechanisms NVGRE selects are intended to ensure that NVGRE 439 encapsulated frames are not fragmented after encapsulation en-route 440 to the destination NVGRE endpoint, and that traffic sources can 441 leverage Path MTU discovery. A future version of this draft will 442 clarify the details around setting the DF bit on the outer IP header 443 as well as maintaining per destination NVGRE endpoint MTU soft state 444 so that ICMP Datagram Too Big messages can be exploited. 445 Fragmentation behavior when tunneling non-IP Ethernet frames in GRE 446 will also be specified in a future version. 448 4.5. Address/Policy Management & Routing 450 Address acquisition is beyond the scope of this document and can be 451 obtained statically, dynamically or using stateless address auto- 452 configuration. CA and PA space can be either IPv4 or IPv6. In fact 453 the address families don't have to match, for example, a CA can be 454 IPv4 while the PA is IPv6 and vice versa. 456 4.6. Cross-subnet, Cross-premise Communication 458 One application of this framework is that it provides a seamless 459 path for enterprises looking to expand their virtual machine hosting 460 capabilities into public clouds. Enterprises can bring their entire 461 IP subnet(s) and isolation policies, thus making the transition to 462 or from the cloud simpler. It is possible to move portions of a IP 463 subnet to the cloud however that requires additional configuration 464 on the enterprise network and is not discussed in this document. 465 Enterprises can continue to use existing communications models like 466 site-to-site VPN to secure their traffic. 468 A VPN gateway is used to establish a secure site-to-site tunnel over 469 the Internet and all the enterprise services running in virtual 470 machines in the cloud use the VPN gateway to communicate back to the 471 enterprise. For simplicity we use a VPN GW configured as a VM shown 472 in Figure 2 to illustrate cross-subnet, cross-premise communication. 474 +-----------------------+ +-----------------------+ 475 | Server 1 | | Server 2 | 476 | +--------+ +--------+ | | +-------------------+ | 477 | | VM1 | | VM2 | | | | VPN Gateway | | 478 | | IP=CA1 | | IP=CA2 | | | | Internal External| | 479 | | | | | | | | IP=CAg IP=GAdc | | 480 | +--------+ +--------+ | | +-------------------+ | 481 | Hypervisor | | | Hypervisor| ^ | 482 +-----------------------+ +-------------------:---+ 483 | IP=PA1 | IP=PA4 | : 484 | | | : 485 | +-------------------------+ | : VPN 486 +-----| Layer 3 Network |------+ : Tunnel 487 +-------------------------+ : 488 | : 489 +-----------------------------------------------:--+ 490 | : | 491 | Internet : | 492 | : | 493 +-----------------------------------------------:--+ 494 | v 495 | +-------------------+ 496 | | VPN Gateway | 497 |---| | 498 IP=GAcorp| External IP=GAcorp| 499 +-------------------+ 500 | 501 +-----------------------+ 502 | Corp Layer 3 Network | 503 | (In CA Space) | 504 +-----------------------+ 505 | 506 +---------------------------+ 507 | Server X | 508 | +----------+ +----------+ | 509 | | Corp VMe1| | Corp VMe2| | 510 | | IP=CAe1 | | IP=CAe2 | | 511 | +----------+ +----------+ | 512 | Hypervisor | 513 +---------------------------+ 514 Figure 2 Cross-Subnet, Cross-Premise Communication 516 The packet flow is similar to the unicast traffic flow between VMs, 517 the key difference in this case the packet needs to be sent to a VPN 518 gateway before it gets forwarded to the destination. As part of 519 routing configuration in the CA space, a per-tenant VPN gateway is 520 provisioned for communication back to the enterprise. The example 521 illustrates an outbound connection between VM1 inside the datacenter 522 and VMe1 inside the enterprise network. When the outbound packet 523 from CA1 to CAe1 reaches the hypervisor on Server 1, the NVE in 524 Server 1 can perform an equivalent of a route lookup on the packet. 525 The cross premise packet will match the default gateway rule as CAe1 526 is not part of the tenant virtual network in the datacenter. The 527 virtualization policy will indicate the packet to be encapsulated 528 and sent to the PA of tenant VPN gateway (PA4) running as a VM on 529 Server 2. The packet is decapsulated on Server 2 and delivered to 530 the VM gateway. The gateway in turn validates and sends the packet 531 on the site-to-site VPN tunnel back to the enterprise network. As 532 the communication here is external to the datacenter the PA address 533 for the VPN tunnel is globally routable. The outer header of this 534 packet is sourced from GAdc destined to GAcorp. This packet is 535 routed through the Internet to the enterprise VPN gateway which is 536 the other end of the site-to-site tunnel, at which point the VPN 537 gateway decapsulates the packet and sends it inside the enterprise 538 where the CAe1 is routable on the network. The reverse path is 539 similar once the packet reaches the enterprise VPN gateway. 541 4.7. Internet Connectivity 543 To enable connectivity to the Internet, an Internet gateway is 544 needed that bridges the virtualized CA space to the public Internet 545 address space. The gateway need to perform translation between the 546 virtualized world and the Internet. For example, the NVGRE endpoint 547 can be part of a load balancer or a NAT, which replaces the VPN 548 Gateway on Server 2 shown in Figure 2. 550 4.8. Management and Control Planes 552 There are several protocols that can manage and distribute policy; 553 however, it is out of scope of this document. Implementations SHOULD 554 choose a mechanism that meets their scale requirements. 556 4.9. NVGRE-Aware Devices 558 One example of a typical deployment consists of virtualized servers 559 deployed across multiple racks connected by one or more layers of 560 Layer-2 switches which in turn may be connected to a layer 3 routing 561 domain. Even though routing in the physical infrastructure will work 562 without any modification with NVGRE, devices that perform 563 specialized processing in the network need to be able to parse GRE 564 to get access to tenant specific information. Devices that 565 understand and parse the VSID can provide rich multi-tenancy aware 566 services inside the data center. As outlined earlier it is 567 imperative to exploit multiple paths inside the network through 568 techniques such as Equal Cost Multipath (ECMP). The Key field (32- 569 bit field, including both VSID and the optional FlowID) can provide 570 additional entropy to the switches to exploit path diversity inside 571 the network. A diverse ecosystem is expected to emerge as more and 572 more devices become multi-tenant aware. In the interim, without 573 requiring any hardware upgrades, there are alternatives to exploit 574 path diversity with GRE by associating multiple PAs with NVGRE 575 endpoints with policy controlling the choice of PA to be used. 577 It is expected that communication can span multiple data centers and 578 also cross the virtual to physical boundary. Typical scenarios that 579 require virtual-to-physical communication includes access to storage 580 and databases. Scenarios demanding lossless Ethernet functionality 581 may not be amenable to NVGRE as traffic is carried over an IP 582 network. NVGRE endpoints mediate between the network virtualized and 583 non-network virtualized environments. This functionality can be 584 incorporated into Top of Rack switches, storage appliances, load 585 balancers, routers etc. or built as a stand-alone appliance. 587 It is imperative to consider the impact of any solution on host 588 performance. Today's server operating systems employ sophisticated 589 acceleration techniques such as checksum offload, Large Send Offload 590 (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), 591 Virtual Machine Queue (VMQ) etc. These technologies should become 592 NVGRE aware. IPsec Security Associations (SA) can be offloaded to 593 the NIC so that computationally expensive cryptographic operations 594 are performed at line rate in the NIC hardware. These SAs are based 595 on the IP addresses of the endpoints. As each packet on the wire 596 gets translated, the NVGRE endpoint SHOULD intercept the offload 597 requests and do the appropriate address translation. This will 598 ensure that IPsec continues to be usable with network virtualization 599 while taking advantage of hardware offload capabilities for improved 600 performance. 602 4.10. Network Scalability with NVGRE 604 One of the key benefits of using NVGRE is the IP address scalability 605 and in turn MAC address table scalability that can be achieved. 606 NVGRE endpoint can use one PA to represent multiple CAs. This lowers 607 the burden on the MAC address table sizes at the Top of Rack 608 switches. One obvious benefit is in the context of server 609 virtualization which has increased the demands on the network 610 infrastructure. By embedding a NVGRE endpoint in a hypervisor it is 611 possible to scale significantly. This framework allows for location 612 information to be preconfigured inside a NVGRE endpoint allowing 613 broadcast ARP traffic to be proxied locally. This approach can scale 614 to large sized virtual subnets. These virtual subnets can be spread 615 across multiple layer-3 physical subnets. It allows workloads to be 616 moved around without imposing a huge burden on the network control 617 plane. By eliminating most broadcast traffic and converting others 618 to multicast the routers and switches can function more efficiently 619 by building efficient multicast trees. By using server and network 620 capacity efficiently it is possible to drive down the cost of 621 building and managing data centers. 623 5. Security Considerations 625 This proposal extends the Layer-2 subnet across the data center and 626 increases the scope for spoofing attacks. Mitigations of such 627 attacks are possible with authentication/encryption using IPsec or 628 any other IP based mechanism. The control plane for policy 629 distribution is expected to be secured by using any of the existing 630 security protocols. Further management traffic can be isolated in a 631 separate subnet/VLAN. 633 6. IANA Considerations 635 This document has no IANA actions. 637 7. References 639 7.1. Normative References 641 [1] Bradner, S., "Key words for use in RFCs to Indicate 642 Requirement Levels", BCP 14, RFC 2119, March 1997. 644 [2] Ethertypes, ftp://ftp.isi.edu/in- 645 notes/iana/assignments/ethernet-numbers 647 [3] D. Farinacci et al, "Generic Routing Encapsulation (GRE)", RFC 648 2784, March, 2000. 650 [4] G. Dommety, "Key and Sequence Number Extensions to GRE", RFC 651 2890, September 2000. 653 7.2. Informative References 655 [5] A. Greenberg et al, "VL2: A Scalable and Flexible Data Center 656 Network", Proc. SIGCOMM 2009. 658 [6] A. Greenberg et al, "The Cost of a Cloud: Research Problems in 659 the Data Center", ACM SIGCOMM Computer Communication Review. 661 [7] B. Hinden, S. Deering, "IP Version 6 Addressing Architecture", 662 RFC 4291, July 2006. 664 [8] M. Lasserre et al, "Framework for DC Network Virtualization", 665 draft-ietf-nov3-framework (work in progress), July 2013. 667 [9] D. Meyer, "Administratively Scoped IP Multicast", BCP 23, RFC 668 2365, July 1998. 670 [10] T. Narten et al, "Problem Statement: Overlays for Network 671 Virtualization", draft-narten-nov3-overlay-problem-statement 672 (work in progress), July 2013. 674 [11] C. Perkins, "IP Encapsulation within IP", RFC 2003, October 675 1996. 677 [12] J. Touch, R. Perlman, "Transparent Interconnection of Lots of 678 Links (TRILL): Problem and Applicability Statement", RFC 5556, 679 May 2009. 681 8. Acknowledgments 683 This document was prepared using 2-Word-v2.0.template.dot. 685 Authors' Addresses 687 Murari Sridharan 688 Microsoft Corporation 689 1 Microsoft Way 690 Redmond, WA 98052 691 Email: muraris@microsoft.com 693 Yu-Shun Wang 694 Microsoft Corporation 695 1 Microsoft Way 696 Redmond, WA 98052 697 Email: yushwang@microsoft.com 699 Albert Greenberg 700 Microsoft Corporation 701 1 Microsoft Way 702 Redmond, WA 98052 703 Email: albert@microsoft.com 705 Pankaj Garg 706 Microsoft Corporation 707 1 Microsoft Way 708 Redmond, WA 98052 709 Email: pankajg@microsoft.com 711 Narasimhan Venkataramiah 712 Facebook Inc 713 1730 Minor Ave. 714 Seattle, WA 98101 715 Email: navenkat@microsoft.com 717 Kenneth Duda 718 Arista Networks, Inc. 719 5470 Great America Pkwy 720 Santa Clara, CA 95054 721 kduda@aristanetworks.com 723 Ilango Ganga 724 Intel Corporation 725 2200 Mission College Blvd. 727 M/S: SC12-325 728 Santa Clara, CA - 95054 729 Email: ilango.s.ganga@intel.com 731 Geng Lin 732 Google 733 1600 Amphitheatre Parkway 734 Mountain View, California 94043 735 Email: genglin@google.com 737 Mark Pearson 738 Hewlett-Packard Co. 739 8000 Foothills Blvd. 740 Roseville, CA 95747 741 Email: mark.pearson@hp.com 743 Patricia Thaler 744 Broadcom Corporation 745 3151 Zanker Road 746 San Jose, CA 95134 747 Email: pthaler@broadcom.com 749 Chait Tumuluri 750 Emulex Corporation 751 3333 Susan Street 752 Costa Mesa, CA 92626 753 Email: chait@emulex.com