idnits 2.17.1 draft-sridharan-virtualization-nvgre-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 2011) is 4606 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'TRILL' is mentioned on line 111, but not defined == Missing Reference: 'RFC2119' is mentioned on line 154, but not defined == Missing Reference: 'RFC 2784' is mentioned on line 187, but not defined == Missing Reference: 'RFC 2890' is mentioned on line 187, but not defined == Unused Reference: 'RFC 2119' is defined on line 594, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group M. Sridharan 2 Internet Draft Microsoft 3 Intended Category: Informational K. Duda 4 Expires: March 2012 Arista Networks 5 I. Ganga 6 Intel 7 A. Greenberg 8 Microsoft 9 G. Lin 10 Dell 11 M. Pearson 12 Hewlett-Packard 13 P. Thaler 14 Broadcom 15 C. Tumuluri 16 Emulex 17 N. Venkataramiah 18 Microsoft 19 Y. Wang 20 Microsoft 22 September 2011 24 NVGRE: Network Virtualization using Generic Routing Encapsulation 25 draft-sridharan-virtualization-nvgre-00.txt 27 Status of this Memo 29 This memo provides information for the Internet Community. It does 30 not specify an Internet standard of any kind; instead it relies on a 31 proposed standard. Distribution of this memo is unlimited. 33 Copyright Notice 35 Copyright (c) 2011 IETF Trust and the persons identified as the 36 document authors. All rights reserved. 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF), its areas, and its working groups. Note that 43 other groups may also distribute working documents as Internet- 44 Drafts. 46 Internet-Drafts are draft documents valid for a maximum of six 47 months and may be updated, replaced, or obsoleted by other documents 48 at any time. It is inappropriate to use Internet-Drafts as 49 reference material or to cite them other than as "work in progress." 51 The list of current Internet-Drafts can be accessed at 52 http://www.ietf.org/ietf/1id-abstracts.txt 54 The list of Internet-Draft Shadow Directories can be accessed at 55 http://www.ietf.org/shadow.html 57 This document is subject to BCP 78 and the IETF Trust's Legal 58 Provisions Relating to IETF Documents 59 (http://trustee.ietf.org/license-info) in effect on the date of 60 publication of this document. Please review these documents 61 carefully, as they describe your rights and restrictions with 62 respect to this document. 64 This Internet-Draft will expire on March 14, 2012. 66 Abstract 68 We describe a framework for policy-based, software controlled 69 network virtualization to support multitenancy in public and private 70 clouds using Generic Routing Encapsulation (GRE). The framework 71 outlined in this document can be used by cloud hosters, enterprise 72 data centers and enables seamless migration of workloads between 73 public and private clouds. This document is focused on the data 74 plane aspects of the NVGRE framework. 76 Table of Contents 78 1. Introduction...................................................3 79 2. Conventions used in this document..............................4 80 3. Network Virtualization using GRE...............................4 81 3.1. NVGRE Endpoint............................................5 82 3.2. Network virtualization frame format.......................5 83 3.3. Broadcast and Multicast Traffic...........................9 84 3.4. Unicast Traffic...........................................9 85 3.5. IP Fragmentation.........................................10 86 3.6. Address/Policy Management & Routing......................10 87 3.7. Cross-subnet, Cross-premise Communication................10 88 3.8. Internet Connectivity....................................13 89 3.9. Manageability............................................13 90 4. Deployment Considerations.....................................13 91 4.1. Network Scalability with GRE.............................14 92 5. Security Considerations.......................................15 93 6. IANA Considerations...........................................15 94 7. References....................................................15 95 7.1. Normative References.....................................15 96 7.2. Informative References...................................15 97 8. Acknowledgments...............................................16 99 1. Introduction 101 Conventional data center network designs cater to largely static 102 workloads and cause fragmentation of network and server capacity 103 [VL2, COST-CCR]. The key concepts described in this document are 104 motivated by earlier work [VL2], although the specific approach 105 described here is significantly different from the one outlined in 106 the paper. There are several issues that limit dynamic allocation 107 and consolidation of capacity. Layer-2 networks use Rapid Spanning 108 Tree Protocol (RSTP) which is designed to eliminate loops by 109 blocking redundant paths. These eliminated paths translate to wasted 110 capacity and a highly oversubscribed network. There are alternative 111 approaches such as TRILL that address this problem [TRILL]. 113 The network utilization inefficiencies are exacerbated by network 114 fragmentation due to the use of VLANs for broadcast isolation. VLANs 115 are used for traffic management and also as the mechanism for 116 providing security and performance isolation among services 117 belonging to different tenants. The Layer-2 network is carved into 118 smaller sized subnets typically one subnet per VLAN, with VLAN tags 119 configured on all the Layer-2 switches connected to server racks 120 that run a given tenant's services. The current VLAN limits 121 theoretically allow for 4K such subnets to be created. The size of 122 the broadcast domain is typically restricted due to the overhead of 123 broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer 124 sufficient in a shared infrastructure servicing multiple tenants. 126 Data center operators must be able to achieve high utilization of 127 server and network capacity. In order to achieve efficiency it 128 should be possible to assign workloads that operate in a single 129 Layer-2 network to any server in any rack in the network. It should 130 also be possible to migrate workloads to any server anywhere in the 131 network while retaining the workload's addresses. This can be 132 achieved today by stretching VLANs however when workloads migrate 133 the network needs to be reconfigured which is typically error prone. 135 By decoupling the workload's location on the LAN from its network 136 address, the network administrator configures the network once and 137 not every time a service migrates. This decoupling enables any 138 server to become part of any server resource pool. 140 The following are key design objectives for next generation data 141 centers: a) location independent addressing, b) the ability to a 142 scale the number of logical Layer-2/Layer-3 networks irrespective of 143 the underlying physical topology or the number of concurrent VLANs, 144 c) preserving Layer-2 semantics for services and allowing them to 145 retain their addresses as they move within and across data centers, 146 and d) providing broadcast isolation as workloads move around 147 without burdening the network control plane. 149 2. Conventions used in this document 151 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 152 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 153 document are to be interpreted as described in RFC-2119 [RFC2119]. 155 In this document, these words will appear with that interpretation 156 only when in ALL CAPS. Lower case uses of these words are not to be 157 interpreted as carrying RFC-2119 significance. 159 3. Network Virtualization using GRE 161 Network virtualization involves creating virtual Layer 2 and/or 162 Layer 3 topologies on top of an arbitrary physical Layer 2/Layer 3 163 network. Connectivity in the virtual topology is provided by 164 tunneling Ethernet frames in IP over the physical network. Virtual 165 broadcast domains are realized as multicast distribution trees. The 166 multicast distribution trees are analogous to the VLAN broadcast 167 domains. A virtual Layer 2 network can span multiple physical 168 subnets. Support for bi-directional IP unicast and multicast 169 connectivity is the only expectation from the underlying physical 170 network. If the operator chooses to support broadcast and multicast 171 traffic in the virtual topology the physical topology must support 172 IP multicast. The physical network, for example, can be a 173 conventional hierarchical 3-tier network, a full bisection bandwidth 174 Clos network or a large Layer 2 network with or without TRILL 175 support. 177 Every virtual Layer-2 network is associated with a 24 bit Tenant 178 Network Identifier (TNI). A 24 bit TNI allows up to 16 million 179 logical networks in the same management domain in contrast to only 180 4K achievable with VLANs. Each TNI represents a virtual Layer-2 181 broadcast domain and routes can be configured for communication 182 between virtual subnets. The TNI can be crafted in such a way that 183 it uniquely identifies a specific tenant's subnet. The TNI is 184 carried in an outer header allowing unique identification of the 185 tenant's virtual subnet to various devices in the network. 187 GRE is a proposed IETF standard [RFC 2784, RFC 2890] and provides a 188 way for encapsulating an arbitrary protocol over IP. The tunneling 189 mechanism itself is designed to be stateless although for this 190 specific implementation there may be some soft state to handle 191 issues such as IP fragmentation as explained in later sections. The 192 GRE header provides space to carry TNI information in each packet. 193 The TNI information in each packet can be used to build multi- 194 tenancy aware tools for traffic analysis, traffic inspection, and 195 monitoring. 197 The following sections detail the packet format for network 198 virtualization, describe the functions of a NVGRE endpoint, 199 illustrate typical traffic flow both within and across data centers, 200 and discuss address, policy management and deployment 201 considerations. 203 3.1. NVGRE Endpoint 205 NVGRE endpoints are gateways between the virtual and the physical 206 networks. Any physical server or network device can be a NVGRE 207 endpoint. One common deployment is for the endpoint to be part of a 208 hypervisor. The primary function of this endpoint is to 209 encapsulate/decapsulate Ethernet data frames to and from the GRE 210 tunnel, ensure Layer-2 semantics, and apply isolation policy scoped 211 on TNI. The endpoint can optionally participate in routing and 212 function as a gateway in the virtual subnet space. To encapsulate an 213 Ethernet frame, the endpoint needs to know location information for 214 the destination address in the frame. The way to obtain this 215 information is not covered in this document and will be covered in a 216 different draft. Any number of techniques can be used in the control 217 plane to configure, discover and distribute the policy information. 218 For the rest of this document we assume that the location 219 information including TNI is readily available to the NVGRE 220 endpoint. 222 3.2. Network virtualization frame format 224 GRE encapsulation as specified in RFC 2784 and RFC 2890 is used for 225 communication between NVGRE endpoints. The Key extension to GRE 226 specified in RFC 2890 is used to carry the TNI. The packet format 227 for Layer-2 encapsulation in GRE is shown in Figure 1. 229 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 230 Outer Ethernet Header: | 231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 232 | (Outer) Destination MAC Address | 233 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 234 |(Outer)Destination MAC Address | (Outer)Source MAC Address | 235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 236 | (Outer) Source MAC Address | 237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 238 |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | 239 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 240 | Ethertype 0x0800 | 241 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 242 Outer IPv4 Header: 243 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 244 |Version| IHL |Type of Service| Total Length | 245 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 246 | Identification |Flags| Fragment Offset | 247 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 248 | Time to Live | Protocol 0x2F | Header Checksum | 249 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 250 | (Outer) Source Address | 251 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 252 | (Outer) Destination Address | 253 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 254 GRE Header: 255 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 256 |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | 257 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 258 | Tenant Network ID (TNI)| Reserved | 259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 Inner Ethernet Header 261 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 262 | (Inner) Destination MAC Address | 263 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 264 |(Inner)Destination MAC Address | (Inner)Source MAC Address | 265 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 266 | (Inner) Source MAC Address | 267 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 268 |Optional Ethertype=C-Tag 802.1Q| PCP |0| VID set to 0 | 269 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 270 | Ethertype 0x0800 | 271 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 272 Inner IPv4 Header: 274 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 275 |Version| IHL |Type of Service| Total Length | 276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 277 | Identification |Flags| Fragment Offset | 278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 279 | Time to Live | Protocol | Header Checksum | 280 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 281 | Source Address | 282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 283 | Destination Address | 284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 285 | Options | Padding | 286 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 287 | Original IP Payload | 288 | | 289 | | 290 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 292 Figure 1 GRE Encapsulation Frame Format 294 O The inner Ethernet frame comprises of an inner Ethernet header 295 followed by the inner IP header, followed by the IP payload. The 296 inner frame could be any Ethernet data frame not just IP. Note that 297 the inner Ethernet frame's FCS is not encapsulated. 299 0 Traffic may go through multiple NV-GRE gateways and no assumptions 300 can be made about the VLAN ID space. NVGRE endpoint MUST set the VID 301 in 802.1Q VLAN tags, if present, to zero before encapsulating the 302 frame in a GRE header. If a VLAN-tagged frame arrives encapsulated 303 in NV-GRE with VID not set to zero, then the decapsulating device 304 SHOULD drop the frame. 306 0 For illustrative purposes IPv4 headers are shown as the inner IP 307 headers but IPv6 headers may be used. Henceforth the IP address 308 contained in the inner frame is referred to as the Customer Address 309 (CA). 311 O The Key field in the GRE header is used to carry the Tenant 312 Network Identifier. Key field is 32 bits long of which the lower 24 313 bits are used for TNI. The Key Present (bit 2 in the GRE header) is 314 always set to 1. 316 0 The upper 8 bits of the Key field are reserved for use by NVGRE 317 endpoints and are not part of the TNI space. NVGRE endpoints MUST 318 set this value to zero. 320 0 NVGRE endpoint MUST set C and S bits in the GRE header to zero. 322 O The protocol type field in the GRE header is set to 0x6558 323 (transparent Ethernet bridging) [ETHTYPES]. 325 O Outer IP header: Both IPv4 and IPv6 can be used as the delivery 326 protocol for GRE. The IPv4 header is shown for illustrative 327 purposes. Henceforth the IP address in the outer frame is referred 328 to as the Provider Address (PA). There can be one or more PA address 329 associated with the NVGRE endpoint, with policy controlling the 330 choice of PA to use for a given CA. 332 O The source Ethernet address in the outer frame is set to the MAC 333 address associated with the NVGRE endpoint. The destination Ethernet 334 address is set to the MAC address of the nexthop IP address for the 335 destination PA. The destination endpoint may or may not be on the 336 same physical subnet. The outer VLAN tag information is optional and 337 can be used for traffic management and broadcast scalability. 339 3.3. Broadcast and Multicast Traffic 341 The following discussion applies if the network operator chooses to 342 support broadcast and multicast traffic. Each virtual subnet is 343 assigned an administratively scoped multicast address to carry 344 broadcast and multicast traffic. All traffic originating from within 345 a TNI is encapsulated and sent to the assigned multicast address. As 346 an example, the addresses can be derived from a administratively 347 scoped multicast address as specified in RFC 2365 for IPv4 348 (organization Local Scope 239.192.0.0/14), or an Organization-Local 349 scope multicast address for IPv6 as specified in RFC 4291. This 350 provides a wide range of address choices. Purely from an efficiency 351 standpoint for every multicast address that a tenant uses the 352 network operator may configure a corresponding multicast address in 353 the PA space. To support broadcast and multicast traffic in the 354 virtual topology the physical topology must support IP multicast. 355 Depending on the hardware capabilities of the physical network 356 devices multiple virtual broadcast domains may be assigned the same 357 physical IP multicast address. For interoperability reasons, a 358 future version of this draft will specify a standard way to map TNI 359 to IP multicast address. 361 3.4. Unicast Traffic 363 The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the 364 source PA associated with the endpoint with the destination PA 365 corresponding to the location of the destination endpoint. As 366 outlined earlier there can be one or more PAs associated with an 367 endpoint and policy will control which ones get used for 368 communication. The encapsulated GRE packet is bridged and routed 369 normally by the physical network to the destination. Bridging uses 370 the outer Ethernet encapsulation for scope on the LAN. The only 371 assumption is bi-directional IP connectivity from the underlying 372 physical network. On the destination the NVGRE endpoint decapsulates 373 the GRE packet to recover the original Layer-2 frame. Traffic flows 374 similarly on the reverse path. 376 3.5. IP Fragmentation 378 RFC 2003 section 5.1 specifies mechanisms for handling fragmentation 379 when encapsulating IP within IP. The subset of mechanisms NVGRE 380 selects are intended to ensure that NVGRE encapsulated frames are 381 not fragmented after encapsulation en-route to the destination NVGRE 382 endpoint, and that traffic sources can leverage Path MTU discovery. 383 A future version of this draft will clarify the details around 384 setting the DF bit on the outer IP header as well as maintaining per 385 destination NVGRE endpoint MTU soft state so that ICMP Datagram Too 386 Big messages can be exploited. Fragmentation behavior when tunneling 387 non-IP Ethernet frames in GRE will also be specified in a future 388 version. 390 3.6. Address/Policy Management & Routing 392 Address acquisition is beyond the scope of this document and can be 393 obtained statically, dynamically or using stateless address auto- 394 configuration. CA and PA space can be either IPv4 or IPv6. In fact 395 the address families don't have to match, for example, CA can be 396 IPv4 while PA is IPv6 and vice versa. The isolation policies MUST be 397 explicitly configured in the NVGRE endpoint. A typical policy table 398 entry consists of CA, MAC address, TNI and optionally, the specific 399 PA if more than one PA is associated with the NVGRE endpoint. If 400 there are multiple virtual subnets, explicit routing information 401 MUST be configured along with a default gateway for cross-subnet 402 communication. Routing between virtual subnets can be optionally 403 handled by the NVGRE endpoint acting as a gateway. If 404 broadcast/multicast support is required the NVGRE endpoints MUST 405 participate in IGMP/MLD for all subscribed multicast groups. 407 3.7. Cross-subnet, Cross-premise Communication 409 One application of this framework is that it provides a seamless 410 path for enterprises looking to expand their virtual machine hosting 411 capabilities into public clouds. Enterprises can bring their entire 412 IP subnet(s) and isolation policies, thus making the transition to 413 or from the cloud simpler. It is possible to move portions of a IP 414 subnet to the cloud however that requires additional configuration 415 on the enterprise network and is not discussed in this document. 417 Enterprises can continue to use existing communications models like 418 site-to-site VPN to secure their traffic. 420 A VPN gateway is used to establish a secure site-to-site tunnel over 421 the Internet and all the enterprise services running in virtual 422 machines in the cloud use the VPN gateway to communicate back to the 423 enterprise. For simplicity we use a VPN GW configured as a VM shown 424 in Figure 2 to illustrate cross-subnet, cross-premise communication. 426 +-----------------------+ +-----------------------+ 427 | Server 1 | | Server 2 | 428 | +--------+ +--------+ | | +-------------------+ | 429 | | VM1 | | VM2 | | | | VPN Gateway | | 430 | | IP=CA1 | | IP=CA2 | | | | Internal External| | 431 | | | | | | | | IP=CAg IP=GAdc | | 432 | +--------+ +--------+ | | +-------------------+ | 433 | Hypervisor | | | Hypervisor| ^ | 434 +-----------------------+ +-------------------:---+ 435 | IP=PA1 | IP=PA4 | : 436 | | | : 437 | +-------------------------+ | : VPN 438 +-----| Layer 3 Network |------+ : Tunnel 439 +-------------------------+ : 440 | : 441 +-----------------------------------------------:--+ 442 | : | 443 | Internet : | 444 | : | 445 +-----------------------------------------------:--+ 446 | v 447 | +-------------------+ 448 | | VPN Gateway | 449 |---| | 450 IP=GAcorp| External IP=GAcorp| 451 +-------------------+ 452 | 453 +-----------------------+ 454 | Corp Layer 3 Network | 455 | (In CA Space) | 456 +-----------------------+ 457 | 458 +---------------------------+ 459 | Server X | 460 | +----------+ +----------+ | 461 | | Corp VMe | | Corp VM2 | | 462 | | | | | | 463 | | IP=CAe | | IP=CAE2 | | 464 | +----------+ +----------+ | 465 | Hypervisor | 466 +---------------------------+ 467 Figure 2 Cross-Subnet, Cross-Premise Communication 469 The flow here is similar to the unicast traffic flow between VMs, 470 the key difference in this case the packet needs to be sent to a VPN 471 gateway before it gets forwarded to the destination. As part of 472 routing configuration in the CA space, a VPN gateway is provisioned 473 per-tenant for communication back to the enterprise. The example 474 illustrates an outbound connection between VM1 inside the datacenter 475 and VMe inside the enterprise network. The outbound packet from CA1 476 to CAe when it hits the hypervisor on Server 1 matches the default 477 gateway rule as CAe is not part of the tenant virtual network in the 478 datacenter. The packet is encapsulated and sent to the PA of tenant 479 VPN gateway (PA4) running as a VM on Server 2. The packet is 480 decapsulated on Server 2 and delivered to the VM gateway. The 481 gateway in turn validates and sends the packet on the site-to-site 482 tunnel back to the enterprise network. As the communication here is 483 external to the datacenter the PA address for the VPN tunnel is 484 globally routable. The outer header of this packet is sourced from 485 GAdc destined to GAcorp. This packet is routed through the internet 486 to the enterprise VPN gateway which is the other end of the site-to- 487 site tunnel at which point the VPN decapsulates the packet and sends 488 it inside the enterprise where the CAe is routable on the network. 489 The reverse path is similar once the packet hits the enterprise VPN 490 gateway. 492 3.8. Internet Connectivity 494 To enable connectivity to the Internet, an Internet gateway is 495 needed that bridges the virtualized CA space to the public Internet 496 address space. The gateway performs translation between the 497 virtualized world and the Internet, for example, the NVGRE endpoint 498 can be part of a load balancer or a NAT. Section 4 has more 499 discussions around building GRE gateways. 501 3.9. Manageability 503 There are several protocols that can manage and distribute policy; 504 however this document does not recommend any one mechanism. 505 Implementations SHOULD choose a mechanism that meets their scale 506 requirements. 508 4. Deployment Considerations 510 One example of a typical deployment consists of virtualized servers 511 deployed across multiple racks connected by one or more layers of 512 Layer-2 switches which in turn may be connected to a layer 3 routing 513 domain. Even though routing in the physical infrastructure will work 514 without any modification with GRE, devices that perform specialized 515 processing in the network need to be able to parse GRE to get access 516 to tenant specific information. Devices that understand and parse 517 the TNI can provide rich multi-tenancy aware services inside the 518 data center. As outlined earlier it is imperative to exploit 519 multiple paths inside the network through techniques such as Equal 520 Cost Multipath (ECMP). The Key field may provide additional entropy 521 to the switches to exploit path diversity inside the network. One 522 such example could be to use the upper 8 bits of the Key field to 523 add flow based entropy and tag all the packets from a flow with an 524 entropy label. A diverse ecosystem play is expected to emerge as 525 more and more devices become multitenancy aware. In the interim, 526 without requiring any hardware upgrades, there are alternatives to 527 exploit path diversity with GRE by associating multiple PAs with 528 NVGRE endpoints with policy controlling the choice of PA to be used. 530 It is expected that communication can span multiple data centers and 531 also cross the virtual to physical boundary. Typical scenarios that 532 require virtual-to-physical communication includes access to storage 533 and databases. Scenarios demanding lossless Ethernet functionality 534 may not be amenable to NVGRE as traffic is carried over an IP 535 network. NVGRE endpoints mediate between the network virtualized and 536 non-network virtualized environments. This functionality can be 537 incorporated into Top of Rack switches, storage appliances, load 538 balancers, routers etc. or built as a stand-alone appliance. 540 It is imperative to consider the impact of any solution on host 541 performance. Today's server operating systems employ sophisticated 542 acceleration techniques such as checksum offload, Large Send Offload 543 (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), 544 Virtual Machine Queue (VMQ) etc. These technologies should become 545 GRE aware. IPsec Security Associations (SA) can be offloaded to the 546 NIC so that computationally expensive cryptographic operations are 547 performed at line rate in the NIC hardware. These SAs are based on 548 the IP addresses of the endpoints. As each packet on the wire gets 549 translated, the NVGRE endpoint SHOULD intercept the offload requests 550 and do the appropriate address translation. This will ensure that 551 IPsec continues to be usable with network virtualization while 552 taking advantage of hardware offload capabilities for improved 553 performance. 555 4.1. Network Scalability with GRE 557 One of the key benefits of using GRE is the IP address scalability 558 and in turn MAC address table scalability that can be achieved. 559 NVGRE endpoint can use one PA to represent multiple CAs. This lowers 560 the burden on the MAC address table sizes at the Top of Rack 561 switches. One obvious benefit is in the context of server 562 virtualization which has increased the demands on the network 563 infrastructure. By embedding a NVGRE endpoint in a hypervisor it is 564 possible to scale significantly. This framework allows for location 565 information to be preconfigured inside a NVGRE endpoint allowing 566 broadcast ARP traffic to be proxied locally. This approach can scale 567 to large sized virtual subnets. These virtual subnets can be spread 568 across multiple layer-3 physical subnets. It allows workloads to be 569 moved around without imposing a huge burden on the network control 570 plane. By eliminating most broadcast traffic and converting others 571 to multicast the routers and switches can function more efficiently 572 by building efficient multicast trees. By using server and network 573 capacity efficiently it is possible to drive down the cost of 574 building and managing data centers. 576 5. Security Considerations 578 This proposal extends the Layer-2 subnet across the data center and 579 increases the scope for spoofing attacks. Mitigations of such 580 attacks are possible with authentication/encryption using IPsec or 581 any other IP based mechanism. The control plane for policy 582 distribution is expected to be secured by using any of the existing 583 security protocols. Further management traffic can be isolated in a 584 separate subnet/VLAN. 586 6. IANA Considerations 588 None. 590 7. References 592 7.1. Normative References 594 [RFC 2119] Bradner, S., "Key words for use in RFCs to Indicate 595 Requirement Levels", BCP 14, RFC 2119, March 1997. 597 [ETHTYPES] ftp://ftp.isi.edu/in-notes/iana/assignments/ethernet- 598 numbers 600 7.2. Informative References 602 [VL2] A. Greenberg et al, "VL2: A Scalable and Flexible Data Center 603 Network", Proc. SIGCOMM 2009. 605 [COST-CCR] A. Greenberg et al, "The Cost of a Cloud: Research 606 Problems in the Data Center", ACM SIGCOMM Computer 607 Communication Review. 609 8. Acknowledgments 611 This document was prepared using 2-Word-v2.0.template.dot. 613 Authors' Addresses 615 Murari Sridharan 616 Microsoft Corporation 617 1 Microsoft Way 618 Redmond, WA 98052 619 Email: muraris@microsoft.com 621 Kenneth Duda 622 Arista Networks, Inc. 623 5470 Great America Pkwy 624 Santa Clara, CA 95054 625 kduda@aristanetworks.com 627 Ilango Ganga 628 Intel Corporation 629 2200 Mission College Blvd. 630 M/S: SC12-325 631 Santa Clara, CA - 95054 632 Email: ilango.s.ganga@intel.com 634 Albert Greenberg 635 Microsoft Corporation 636 1 Microsoft Way 637 Redmond, WA 98052 638 Email: albert@microsoft.com 640 Geng Lin 641 Dell 642 One Dell Way 643 Round Rock, TX 78682 644 Email: geng_lin@dell.com 646 Mark Pearson 647 Hewlett-Packard Co. 648 8000 Foothills Blvd. 649 Roseville, CA 95747 650 Email: mark.pearson@hp.com 652 Patricia Thaler 653 Broadcom Corporation 654 3151 Zanker Road 655 San Jose, CA 95134 656 Email: pthaler@broadcom.com 658 Chait Tumuluri 659 Emulex Corporation 660 3333 Susan Street 661 Costa Mesa, CA 92626 662 Email: chait@emulex.com 664 Narasimhan Venkataramiah 665 Microsoft Corporation 666 1 Microsoft Way 667 Redmond, WA 98052 668 Email: narave@microsoft.com 670 Yu-Shun Wang 671 Microsoft Corporation 672 1 Microsoft Way 673 Redmond, WA 98052 674 Email: yushwang@microsoft.com