idnits 2.17.1 draft-sridharan-virtualization-nvgre-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 9, 2012) is 4302 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 152 == Unused Reference: '1' is defined on line 599, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Sridharan 2 Internet Draft A. Greenberg 3 Intended status: Informational N. Venkataramiah 4 Expires: January 2013 Y. Wang 5 Microsoft 6 K. Duda 7 Arista Networks 8 I. Ganga 9 Intel 10 G. Lin 11 Dell 12 M. Pearson 13 Hewlett-Packard 14 P. Thaler 15 Broadcom 16 C. Tumuluri 17 Emulex 18 July 9, 2012 20 NVGRE: Network Virtualization using Generic Routing Encapsulation 21 draft-sridharan-virtualization-nvgre-01.txt 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other documents 35 at any time. It is inappropriate to use Internet-Drafts as 36 reference material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on January 9, 2013. 46 Copyright Notice 48 Copyright (c) 2012 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with 56 respect to this document. 58 Abstract 60 This document describes the usage of Generic Routing Encapsulation 61 (GRE) header for Network Virtualization, called NVGRE, in multi- 62 tenant datacenters. Network Virtualization decouples virtual 63 networks and addresses from physical network infrastructure, 64 providing isolation and concurrency between multiple virtual 65 networks on the same physical network infrastructure. This document 66 also introduces a Network Virtualization framework to illustrate the 67 use cases, but the focus is on specifying the data plane aspect of 68 NVGRE. 70 Table of Contents 72 1. Introduction...................................................3 73 1.1. Terminology...............................................4 74 2. Conventions used in this document..............................4 75 3. Network Virtualization using GRE...............................4 76 3.1. NVGRE End Points..........................................5 77 3.2. NVGRE Frame Format........................................5 78 4. NVGRE Deployment Considerations................................8 79 4.1. Broadcast and Multicast Traffic...........................8 80 4.2. Unicast Traffic...........................................9 81 4.3. IP Fragmentation..........................................9 82 4.4. Address/Policy Management & Routing.......................9 83 4.5. Cross-subnet, Cross-premise Communication................10 84 4.6. Internet Connectivity....................................12 85 4.7. Management and Control Planes............................12 86 4.8. NVGRE-Aware Device.......................................12 87 4.9. Network Scalability with NVGRE...........................13 88 5. Security Considerations.......................................14 89 6. IANA Considerations...........................................14 90 7. References....................................................14 91 7.1. Normative References.....................................14 92 7.2. Informative References...................................14 93 8. Acknowledgments...............................................15 95 1. Introduction 97 Conventional data center network designs cater to largely static 98 workloads and cause fragmentation of network and server capacity 99 [5][6]. There are several issues that limit dynamic allocation and 100 consolidation of capacity. Layer-2 networks use Rapid Spanning Tree 101 Protocol (RSTP) which is designed to eliminate loops by blocking 102 redundant paths. These eliminated paths translate to wasted capacity 103 and a highly oversubscribed network. There are alternative 104 approaches such as TRILL that address this problem [13]. 106 The network utilization inefficiencies are exacerbated by network 107 fragmentation due to the use of VLANs for broadcast isolation. VLANs 108 are used for traffic management and also as the mechanism for 109 providing security and performance isolation among services 110 belonging to different tenants. The Layer-2 network is carved into 111 smaller sized subnets typically one subnet per VLAN, with VLAN tags 112 configured on all the Layer-2 switches connected to server racks 113 that run a given tenant's services. The current VLAN limits 114 theoretically allow for 4K such subnets to be created. The size of 115 the broadcast domain is typically restricted due to the overhead of 116 broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer 117 sufficient in a shared infrastructure servicing multiple tenants. 119 Data center operators must be able to achieve high utilization of 120 server and network capacity. In order to achieve efficiency it 121 should be possible to assign workloads that operate in a single 122 Layer-2 network to any server in any rack in the network. It should 123 also be possible to migrate workloads to any server anywhere in the 124 network while retaining the workload's addresses. This can be 125 achieved today by stretching VLANs however when workloads migrate 126 the network needs to be reconfigured which is typically error prone. 127 By decoupling the workload's location on the LAN from its network 128 address, the network administrator configures the network once and 129 not every time a service migrates. This decoupling enables any 130 server to become part of any server resource pool. 132 The following are key design objectives for next generation data 133 centers: a) location independent addressing, b) the ability to a 134 scale the number of logical Layer-2/Layer-3 networks irrespective of 135 the underlying physical topology or the number of concurrent VLANs, 136 c) preserving Layer-2 semantics for services and allowing them to 137 retain their addresses as they move within and across data centers, 138 and d) providing broadcast isolation as workloads move around 139 without burdening the network control plane. 141 1.1. Terminology 143 For common NVO3 terminology, refer to [8] and [10]. 145 o NVE: Network Virtualization Endpoint 147 2. Conventions used in this document 149 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 150 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 151 document are to be interpreted as described in RFC-2119 [RFC2119]. 153 In this document, these words will appear with that interpretation 154 only when in ALL CAPS. Lower case uses of these words are not to be 155 interpreted as carrying RFC-2119 significance. 157 3. Network Virtualization using GRE 159 This section describes Network Virtualization using GRE [4], called 160 NVGRE. Network virtualization involves creating virtual Layer 2 161 and/or Layer 3 topologies on top of an arbitrary physical Layer 162 2/Layer 3 network. Connectivity in the virtual topology is provided 163 by tunneling Ethernet frames in IP over the physical network. 164 Virtual broadcast domains are realized as multicast distribution 165 trees. The multicast distribution trees are analogous to the VLAN 166 broadcast domains. A virtual Layer 2 network can span multiple 167 physical subnets. Support for bi-directional IP unicast and 168 multicast connectivity is the only requirement from the underlying 169 physical network to support unicast communications within a virtual 170 network. If the operator chooses to support broadcast and multicast 171 traffic in the virtual topology the physical topology must support 172 IP multicast. The physical network, for example, can be a 173 conventional hierarchical 3-tier network, a full bisection bandwidth 174 Clos network, or a large Layer 2 network with or without TRILL 175 support. 177 Every virtual Layer-2 network is associated with a 24 bit 178 identifier, called Virtual Subnet Identifier (VSID). A 24 bit VSID 179 allows up to 16 million virtual subnets in the same management 180 domain in contrast to only 4K achievable with VLANs. Each VSID 181 represents a virtual Layer-2 broadcast domain and routes can be 182 configured for communication between virtual subnets. The VSID can 183 be crafted in such a way that it uniquely identifies a specific 184 tenant's subnet. The VSID is carried in an outer header allowing 185 unique identification of the tenant's virtual subnet to various 186 devices in the network. 188 GRE is a proposed IETF standard [4][3] and provides a way for 189 encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE 190 header to carry VSID information in each packet. The VSID 191 information in each packet can be used to build multi-tenant-aware 192 tools for traffic analysis, traffic inspection, and monitoring. 194 The following sections detail the packet format for NVGRE, describe 195 the functions of a NVGRE endpoint, illustrate typical traffic flow 196 both within and across data centers, and discuss address, policy 197 management and deployment considerations. 199 3.1. NVGRE End Points 201 NVGRE endpoints are the ingress/egress points between the virtual 202 and the physical networks. Any physical server or network device can 203 be a NVGRE endpoint. One common deployment is for the NVGRE endpoint 204 to be part of a hypervisor. The primary function of this endpoint is 205 to encapsulate/decapsulate Ethernet data frames to and from the GRE 206 tunnel, ensure Layer-2 semantics, and apply isolation policy scoped 207 on VSID. The endpoint can optionally participate in routing and 208 function as a gateway in the virtual topology. To encapsulate an 209 Ethernet frame, the endpoint needs to know the location information 210 for the destination address in the frame. This information can be 211 provisioned via a management plane, or obtained via a combination of 212 control plane distribution or data plane learning approaches. This 213 document assumes that the location information, including VSID, is 214 available to the NVGRE endpoint. 216 3.2. NVGRE Frame Format 218 GRE header format as specified in RFC 2784 and RFC 2890 is used for 219 communication between NVGRE endpoints. NVGRE leverages the Key 220 extension specified in RFC 2890 to carry the VSID. The packet format 221 for Layer-2 encapsulation in GRE is shown in Figure 1. 223 0 1 2 3 224 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 225 Outer Ethernet Header: 226 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 227 | (Outer) Destination MAC Address | 228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 229 |(Outer)Destination MAC Address | (Outer)Source MAC Address | 230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 231 | (Outer) Source MAC Address | 232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 233 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 234 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 235 | Ethertype 0x0800 | 236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 237 Outer IPv4 Header: 238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 239 |Version| IHL |Type of Service| Total Length | 240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 241 | Identification |Flags| Fragment Offset | 242 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 243 | Time to Live | Protocol 0x2F | Header Checksum | 244 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 245 | (Outer) Source Address | 246 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 247 | (Outer) Destination Address | 248 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 249 GRE Header: 250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 251 |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | 252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 253 | Virtual Subnet ID (VSID) | FlowID | 254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 255 Inner Ethernet Header 256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 257 | (Inner) Destination MAC Address | 258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 259 |(Inner)Destination MAC Address | (Inner)Source MAC Address | 260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 261 | (Inner) Source MAC Address | 262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 263 | Ethertype 0x0800 | 264 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 266 (Continued on the next page) 267 Inner IPv4 Header: 268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 269 |Version| IHL |Type of Service| Total Length | 270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 271 | Identification |Flags| Fragment Offset | 272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 273 | Time to Live | Protocol | Header Checksum | 274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 275 | Source Address | 276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 277 | Destination Address | 278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 279 | Options | Padding | 280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 281 | Original IP Payload | 282 | | 283 | | 284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 285 Figure 1 NVGRE Encapsulation Frame Format 287 The outer/delivery headers include the outer Ethernet header and the 288 outer IP header: 290 o The outer Ethernet header: The source Ethernet address in the 291 outer frame is set to the MAC address associated with the NVGRE 292 endpoint. The destination Ethernet address is set to the MAC 293 address of the nexthop IP address for the destination NVE. The 294 destination endpoint may or may not be on the same physical 295 subnet. The outer VLAN tag information is optional and can be 296 used for traffic management and broadcast scalability. 298 o The outer IP header: Both IPv4 and IPv6 can be used as the 299 delivery protocol for GRE. The IPv4 header is shown for 300 illustrative purposes. Henceforth the IP address in the outer 301 frame is referred to as the Provider Address (PA). 303 The GRE header: 305 o The C (Checksum Present) and S (Sequence Number Present) bits in 306 the GRE header MUST be zero. 308 o The K bit (Key Present) in the GRE header MUST be one. The 32-bit 309 Key field in the GRE header is used to carry the Virtual Subnet 310 ID (VSID) and the optional FlowID. 312 o Virtual Subnet ID (VSID): The first 24 bits of the Key field are 313 used for VSID as shown in Figure 1. 315 o FlowID: The last 8 bits of the Key field are (optional) FlowID, 316 which can be used to add per-flow entropy within the same VSID, 317 where the entire Key field (32-bit) MAY be used by switches or 318 routers in the physical network infrastructure for ECMP purposes 319 [12] (Equal-Cost, Multi-Path). If a FlowID is not generated, the 320 FlowID field MUST be set to all zeros. 322 o The protocol type field in the GRE header is set to 0x6558 323 (transparent Ethernet bridging)[2]. 325 The inner headers (headers of the GRE payload): 327 o The inner Ethernet frame comprises of an inner Ethernet header 328 followed by the inner Ethernet payload. The inner frame could be 329 any Ethernet data frame; an inner IP payload is shown in Figure 1 330 for illustrative purposes. Note that the inner Ethernet frame's 331 FCS is not encapsulated. 333 o Inner VLAN tag: The inner Ethernet header of NVGRE SHOULD NOT 334 contain inner VLAN Tag. When an NVE performs NVGRE encapsulation, 335 it SHOULD remove any existing VLAN Tag before encapsulating NVGRE 336 headers. If a VLAN-tagged frame arrives encapsulated in NVGRE, 337 then the decapsulating NVE SHOULD drop the frame. 339 o An inner IPv4 header is shown as an example, but IPv6 headers may 340 be used. Henceforth the IP address contained in the inner frame 341 is referred to as the Customer Address (CA). 343 4. NVGRE Deployment Considerations 345 4.1. Broadcast and Multicast Traffic 347 The following discussion applies if the network operator chooses to 348 support broadcast and multicast traffic. Each virtual subnet is 349 assigned an administratively scoped multicast address to carry 350 broadcast and multicast traffic. All traffic originating from within 351 a VSID is encapsulated and sent to the assigned multicast address. 352 As an example, the addresses can be derived from an administratively 353 scoped multicast address as specified in RFC 2365 for IPv4 354 (organization Local Scope 239.192.0.0/14) [9], or an Organization- 355 Local scope multicast address for IPv6 as specified in RFC 4291[7]. 356 This provides a wide range of address choices. Purely from an 357 efficiency standpoint for every multicast address that a tenant uses 358 the network operator may configure a corresponding multicast address 359 in the PA space. To support broadcast and multicast traffic in the 360 virtual topology the physical topology must support IP multicast. 361 Depending on the hardware capabilities of the physical network 362 devices multiple virtual broadcast domains may be assigned the same 363 physical IP multicast address. For interoperability reasons, a 364 future version of this draft will specify a standard way to map VSID 365 to IP multicast address. 367 4.2. Unicast Traffic 369 The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the 370 source PA associated with the endpoint with the destination PA 371 corresponding to the location of the destination endpoint. As 372 outlined earlier there can be one or more PAs associated with an 373 endpoint and policy will control which ones get used for 374 communication. The encapsulated GRE packet is bridged and routed 375 normally by the physical network to the destination. Bridging uses 376 the outer Ethernet encapsulation for scope on the LAN. The only 377 assumption is bi-directional IP connectivity from the underlying 378 physical network. On the destination the NVGRE endpoint decapsulates 379 the GRE packet to recover the original Layer-2 frame. Traffic flows 380 similarly on the reverse path. 382 4.3. IP Fragmentation 384 RFC 2003 section 5.1 specifies mechanisms for handling fragmentation 385 when encapsulating IP within IP [11]. The subset of mechanisms NVGRE 386 selects are intended to ensure that NVGRE encapsulated frames are 387 not fragmented after encapsulation en-route to the destination NVGRE 388 endpoint, and that traffic sources can leverage Path MTU discovery. 389 A future version of this draft will clarify the details around 390 setting the DF bit on the outer IP header as well as maintaining per 391 destination NVGRE endpoint MTU soft state so that ICMP Datagram Too 392 Big messages can be exploited. Fragmentation behavior when tunneling 393 non-IP Ethernet frames in GRE will also be specified in a future 394 version. 396 4.4. Address/Policy Management & Routing 398 Address acquisition is beyond the scope of this document and can be 399 obtained statically, dynamically or using stateless address auto- 400 configuration. CA and PA space can be either IPv4 or IPv6. In fact 401 the address families don't have to match, for example, CA can be 402 IPv4 while PA is IPv6 and vice versa. The isolation policies MUST be 403 explicitly configured in the NVGRE endpoint. A typical policy table 404 entry consists of CA, MAC address, VSID and optionally, the specific 405 PA if more than one PA is associated with the NVGRE endpoint. If 406 there are multiple virtual subnets, explicit routing information 407 MUST be configured along with a default gateway for cross-subnet 408 communication. Routing between virtual subnets can be optionally 409 handled by the NVGRE endpoint acting as a gateway. If 410 broadcast/multicast support is required the NVGRE endpoints MUST 411 participate in IGMP/MLD for all subscribed multicast groups. 413 4.5. Cross-subnet, Cross-premise Communication 415 One application of this framework is that it provides a seamless 416 path for enterprises looking to expand their virtual machine hosting 417 capabilities into public clouds. Enterprises can bring their entire 418 IP subnet(s) and isolation policies, thus making the transition to 419 or from the cloud simpler. It is possible to move portions of a IP 420 subnet to the cloud however that requires additional configuration 421 on the enterprise network and is not discussed in this document. 422 Enterprises can continue to use existing communications models like 423 site-to-site VPN to secure their traffic. 425 A VPN gateway is used to establish a secure site-to-site tunnel over 426 the Internet and all the enterprise services running in virtual 427 machines in the cloud use the VPN gateway to communicate back to the 428 enterprise. For simplicity we use a VPN GW configured as a VM shown 429 in Figure 2 to illustrate cross-subnet, cross-premise communication. 431 +-----------------------+ +-----------------------+ 432 | Server 1 | | Server 2 | 433 | +--------+ +--------+ | | +-------------------+ | 434 | | VM1 | | VM2 | | | | VPN Gateway | | 435 | | IP=CA1 | | IP=CA2 | | | | Internal External| | 436 | | | | | | | | IP=CAg IP=GAdc | | 437 | +--------+ +--------+ | | +-------------------+ | 438 | Hypervisor | | | Hypervisor| ^ | 439 +-----------------------+ +-------------------:---+ 440 | IP=PA1 | IP=PA4 | : 441 | | | : 442 | +-------------------------+ | : VPN 443 +-----| Layer 3 Network |------+ : Tunnel 444 +-------------------------+ : 445 | : 446 +-----------------------------------------------:--+ 447 | : | 448 | Internet : | 449 | : | 450 +-----------------------------------------------:--+ 451 | v 452 | +-------------------+ 453 | | VPN Gateway | 454 |---| | 455 IP=GAcorp| External IP=GAcorp| 456 +-------------------+ 457 | 458 +-----------------------+ 459 | Corp Layer 3 Network | 460 | (In CA Space) | 461 +-----------------------+ 462 | 463 +---------------------------+ 464 | Server X | 465 | +----------+ +----------+ | 466 | | Corp VMe | | Corp VM2 | | 467 | | IP=CAe | | IP=CAE2 | | 468 | +----------+ +----------+ | 469 | Hypervisor | 470 +---------------------------+ 471 Figure 2 Cross-Subnet, Cross-Premise Communication 473 The flow here is similar to the unicast traffic flow between VMs, 474 the key difference in this case the packet needs to be sent to a VPN 475 gateway before it gets forwarded to the destination. As part of 476 routing configuration in the CA space, a VPN gateway is provisioned 477 per-tenant for communication back to the enterprise. The example 478 illustrates an outbound connection between VM1 inside the datacenter 479 and VMe inside the enterprise network. The outbound packet from CA1 480 to CAe when it hits the hypervisor on Server 1 matches the default 481 gateway rule as CAe is not part of the tenant virtual network in the 482 datacenter. The packet is encapsulated and sent to the PA of tenant 483 VPN gateway (PA4) running as a VM on Server 2. The packet is 484 decapsulated on Server 2 and delivered to the VM gateway. The 485 gateway in turn validates and sends the packet on the site-to-site 486 tunnel back to the enterprise network. As the communication here is 487 external to the datacenter the PA address for the VPN tunnel is 488 globally routable. The outer header of this packet is sourced from 489 GAdc destined to GAcorp. This packet is routed through the internet 490 to the enterprise VPN gateway which is the other end of the site-to- 491 site tunnel at which point the VPN decapsulates the packet and sends 492 it inside the enterprise where the CAe is routable on the network. 493 The reverse path is similar once the packet hits the enterprise VPN 494 gateway. 496 4.6. Internet Connectivity 498 To enable connectivity to the Internet, an Internet gateway is 499 needed that bridges the virtualized CA space to the public Internet 500 address space. The gateway performs translation between the 501 virtualized world and the Internet, for example, the NVGRE endpoint 502 can be part of a load balancer or a NAT. Section 4 has more 503 discussions around building GRE gateways. 505 4.7. Management and Control Planes 507 There are several protocols that can manage and distribute policy; 508 however this document does not recommend any one mechanism. 509 Implementations SHOULD choose a mechanism that meets their scale 510 requirements. 512 4.8. NVGRE-Aware Device 514 One example of a typical deployment consists of virtualized servers 515 deployed across multiple racks connected by one or more layers of 516 Layer-2 switches which in turn may be connected to a layer 3 routing 517 domain. Even though routing in the physical infrastructure will work 518 without any modification with GRE, devices that perform specialized 519 processing in the network need to be able to parse GRE to get access 520 to tenant specific information. Devices that understand and parse 521 the VSID can provide rich multi-tenancy aware services inside the 522 data center. As outlined earlier it is imperative to exploit 523 multiple paths inside the network through techniques such as Equal 524 Cost Multipath (ECMP)[12]. The Key field could provide additional 525 entropy to the switches to exploit path diversity inside the 526 network. Switches or routers could use the Key field, with VSID and 527 optional FlowID, to add flow based entropy and tag all the packets 528 from a flow with an entropy label. A diverse ecosystem play is 529 expected to emerge as more and more devices become multi-tenant 530 aware. In the interim, without requiring any hardware upgrades, 531 there are alternatives to exploit path diversity with GRE by 532 associating multiple PAs with NVGRE endpoints with policy 533 controlling the choice of PA to be used. 535 It is expected that communication can span multiple data centers and 536 also cross the virtual to physical boundary. Typical scenarios that 537 require virtual-to-physical communication includes access to storage 538 and databases. Scenarios demanding lossless Ethernet functionality 539 may not be amenable to NVGRE as traffic is carried over an IP 540 network. NVGRE endpoints mediate between the network virtualized and 541 non-network virtualized environments. This functionality can be 542 incorporated into Top of Rack switches, storage appliances, load 543 balancers, routers etc. or built as a stand-alone appliance. 545 It is imperative to consider the impact of any solution on host 546 performance. Today's server operating systems employ sophisticated 547 acceleration techniques such as checksum offload, Large Send Offload 548 (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), 549 Virtual Machine Queue (VMQ) etc. These technologies should become 550 GRE aware. IPsec Security Associations (SA) can be offloaded to the 551 NIC so that computationally expensive cryptographic operations are 552 performed at line rate in the NIC hardware. These SAs are based on 553 the IP addresses of the endpoints. As each packet on the wire gets 554 translated, the NVGRE endpoint SHOULD intercept the offload requests 555 and do the appropriate address translation. This will ensure that 556 IPsec continues to be usable with network virtualization while 557 taking advantage of hardware offload capabilities for improved 558 performance. 560 4.9. Network Scalability with NVGRE 562 One of the key benefits of using GRE is the IP address scalability 563 and in turn MAC address table scalability that can be achieved. 564 NVGRE endpoint can use one PA to represent multiple CAs. This lowers 565 the burden on the MAC address table sizes at the Top of Rack 566 switches. One obvious benefit is in the context of server 567 virtualization which has increased the demands on the network 568 infrastructure. By embedding a NVGRE endpoint in a hypervisor it is 569 possible to scale significantly. This framework allows for location 570 information to be preconfigured inside a NVGRE endpoint allowing 571 broadcast ARP traffic to be proxied locally. This approach can scale 572 to large sized virtual subnets. These virtual subnets can be spread 573 across multiple layer-3 physical subnets. It allows workloads to be 574 moved around without imposing a huge burden on the network control 575 plane. By eliminating most broadcast traffic and converting others 576 to multicast the routers and switches can function more efficiently 577 by building efficient multicast trees. By using server and network 578 capacity efficiently it is possible to drive down the cost of 579 building and managing data centers. 581 5. Security Considerations 583 This proposal extends the Layer-2 subnet across the data center and 584 increases the scope for spoofing attacks. Mitigations of such 585 attacks are possible with authentication/encryption using IPsec or 586 any other IP based mechanism. The control plane for policy 587 distribution is expected to be secured by using any of the existing 588 security protocols. Further management traffic can be isolated in a 589 separate subnet/VLAN. 591 6. IANA Considerations 593 None 595 7. References 597 7.1. Normative References 599 [1] Bradner, S., "Key words for use in RFCs to Indicate 600 Requirement Levels", BCP 14, RFC 2119, March 1997. 602 [2] ETHTYPES, ftp://ftp.isi.edu/in- 603 notes/iana/assignments/ethernet- numbers 605 7.2. Informative References 607 [3] Dommety, G., "Key and Sequence Number Extensions to GRE", RFC 608 2890, September 2000. 610 [4] Farinacci, D. et al, "Generic Routing Encapsulation (GRE)", 611 RFC 2784, March 2000. 613 [5] Greenberg, A. et al, "VL2: A Scalable and Flexible Data Center 614 Network", Proc. SIGCOMM 2009. 616 [6] Greenberg, A. et al, "The Cost of a Cloud: Research Problems 617 in the Data Center", ACM SIGCOMM Computer Communication 618 Review, V. 39, No. 1, January 2009. 620 [7] Hinden, R., Deering, S., "IP Version 6 Addressing 621 Architecture", RFC 4291, February 2006. 623 [8] Lasserre, M. et al, "Framework for DC Network Virtualization", 624 draft-lasserre-nvo3-framework (work in progress) 626 [9] Meyer, D., "Administratively Scoped IP Multicast", RFC 2365, 627 July 1998. 629 [10] Narten, T. et al, "Problem Statement : Overlays for Network 630 Virtualization", draft-narten-nvo3-overlay-problem-statement 631 (work in progress) 633 [11] Perkins, C., "IP Encapsulation within IP", RFC 2003, October 634 1996. 636 [12] Thaler, D. & Hopps, C., "Multipath Issues in Unicast and 637 Multicast Next-Hop Selection", RFC 2991, November 2000. 639 [13] Touch J. & Perlman R., "Transparent Interconnection of Lots of 640 Links (TRILL): Problem and Applicability Statement", RFC 5556, 641 May 2009. 643 8. Acknowledgments 645 This document was prepared using 2-Word-v2.0.template.dot. 647 Authors' Addresses 649 Murari Sridharan 650 Microsoft Corporation 651 1 Microsoft Way 652 Redmond, WA 98052 653 Email: muraris@microsoft.com 655 Kenneth Duda 656 Arista Networks, Inc. 657 5470 Great America Pkwy 658 Santa Clara, CA 95054 659 kduda@aristanetworks.com 660 Ilango Ganga 661 Intel Corporation 662 2200 Mission College Blvd. 663 M/S: SC12-325 664 Santa Clara, CA - 95054 665 Email: ilango.s.ganga@intel.com 667 Albert Greenberg 668 Microsoft Corporation 669 1 Microsoft Way 670 Redmond, WA 98052 671 Email: albert@microsoft.com 673 Geng Lin 674 Dell 675 One Dell Way 676 Round Rock, TX 78682 677 Email: geng_lin@dell.com 679 Mark Pearson 680 Hewlett-Packard Co. 681 8000 Foothills Blvd. 682 Roseville, CA 95747 683 Email: mark.pearson@hp.com 685 Patricia Thaler 686 Broadcom Corporation 687 3151 Zanker Road 688 San Jose, CA 95134 689 Email: pthaler@broadcom.com 691 Chait Tumuluri 692 Emulex Corporation 693 3333 Susan Street 694 Costa Mesa, CA 92626 695 Email: chait@emulex.com 696 Narasimhan Venkataramiah 697 Microsoft Corporation 698 1 Microsoft Way 699 Redmond, WA 98052 700 Email: narave@microsoft.com 702 Yu-Shun Wang 703 Microsoft Corporation 704 1 Microsoft Way 705 Redmond, WA 98052 706 Email: yushwang@microsoft.com