idnits 2.17.1 

draft-sridharan-virtualization-nvgre-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with multicast IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use the 233.252.0.x range defined in RFC 5771


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 9, 2012) is 4302 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: 'RFC2119' on line 152

  == Unused Reference: '1' is defined on line 599, but no explicit reference
     was found in the text


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Network Working Group                                      M. Sridharan
2	Internet Draft                                             A. Greenberg
3	Intended status: Informational                         N. Venkataramiah
4	Expires: January 2013                                           Y. Wang
5	                                                              Microsoft
6	                                                                K. Duda
7	                                                        Arista Networks
8	                                                               I. Ganga
9	                                                                  Intel
10	                                                                 G. Lin
11	                                                                   Dell
12	                                                             M. Pearson
13	                                                        Hewlett-Packard
14	                                                              P. Thaler
15	                                                               Broadcom
16	                                                            C. Tumuluri
17	                                                                 Emulex
18	                                                           July 9, 2012

20	     NVGRE: Network Virtualization using Generic Routing Encapsulation
21	                draft-sridharan-virtualization-nvgre-01.txt

23	Status of this Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF), its areas, and its working groups.  Note that
30	   other groups may also distribute working documents as Internet-
31	   Drafts.

33	   Internet-Drafts are draft documents valid for a maximum of six
34	   months and may be updated, replaced, or obsoleted by other documents
35	   at any time.  It is inappropriate to use Internet-Drafts as
36	   reference material or to cite them other than as "work in progress."

38	   The list of current Internet-Drafts can be accessed at
39	   http://www.ietf.org/ietf/1id-abstracts.txt

41	   The list of Internet-Draft Shadow Directories can be accessed at
42	   http://www.ietf.org/shadow.html

44	   This Internet-Draft will expire on January 9, 2013.

46	Copyright Notice

48	   Copyright (c) 2012 IETF Trust and the persons identified as the
49	   document authors. All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document. Please review these documents
55	   carefully, as they describe your rights and restrictions with
56	   respect to this document.

58	Abstract

60	   This document describes the usage of Generic Routing Encapsulation
61	   (GRE) header for Network Virtualization, called NVGRE, in multi-
62	   tenant datacenters. Network Virtualization decouples virtual
63	   networks and addresses from physical network infrastructure,
64	   providing isolation and concurrency between multiple virtual
65	   networks on the same physical network infrastructure. This document
66	   also introduces a Network Virtualization framework to illustrate the
67	   use cases, but the focus is on specifying the data plane aspect of
68	   NVGRE.

70	Table of Contents

72	   1. Introduction...................................................3
73	      1.1. Terminology...............................................4
74	   2. Conventions used in this document..............................4
75	   3. Network Virtualization using GRE...............................4
76	      3.1. NVGRE End Points..........................................5
77	      3.2. NVGRE Frame Format........................................5
78	   4. NVGRE Deployment Considerations................................8
79	      4.1. Broadcast and Multicast Traffic...........................8
80	      4.2. Unicast Traffic...........................................9
81	      4.3. IP Fragmentation..........................................9
82	      4.4. Address/Policy Management & Routing.......................9
83	      4.5. Cross-subnet, Cross-premise Communication................10
84	      4.6. Internet Connectivity....................................12
85	      4.7. Management and Control Planes............................12
86	      4.8. NVGRE-Aware Device.......................................12
87	      4.9. Network Scalability with NVGRE...........................13
88	   5. Security Considerations.......................................14
89	   6. IANA Considerations...........................................14
90	   7. References....................................................14
91	      7.1. Normative References.....................................14
92	      7.2. Informative References...................................14
93	   8. Acknowledgments...............................................15

95	1. Introduction

97	   Conventional data center network designs cater to largely static
98	   workloads and cause fragmentation of network and server capacity
99	   [5][6]. There are several issues that limit dynamic allocation and
100	   consolidation of capacity. Layer-2 networks use Rapid Spanning Tree
101	   Protocol (RSTP) which is designed to eliminate loops by blocking
102	   redundant paths. These eliminated paths translate to wasted capacity
103	   and a highly oversubscribed network. There are alternative
104	   approaches such as TRILL that address this problem [13].

106	   The network utilization inefficiencies are exacerbated by network
107	   fragmentation due to the use of VLANs for broadcast isolation. VLANs
108	   are used for traffic management and also as the mechanism for
109	   providing security and performance isolation among services
110	   belonging to different tenants. The Layer-2 network is carved into
111	   smaller sized subnets typically one subnet per VLAN, with VLAN tags
112	   configured on all the Layer-2 switches connected to server racks
113	   that run a given tenant's services. The current VLAN limits
114	   theoretically allow for 4K such subnets to be created. The size of
115	   the broadcast domain is typically restricted due to the overhead of
116	   broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer
117	   sufficient in a shared infrastructure servicing multiple tenants.

119	   Data center operators must be able to achieve high utilization of
120	   server and network capacity. In order to achieve efficiency it
121	   should be possible to assign workloads that operate in a single
122	   Layer-2 network to any server in any rack in the network. It should
123	   also be possible to migrate workloads to any server anywhere in the
124	   network while retaining the workload's addresses. This can be
125	   achieved today by stretching VLANs however when workloads migrate
126	   the network needs to be reconfigured which is typically error prone.
127	   By decoupling the workload's location on the LAN from its network
128	   address, the network administrator configures the network once and
129	   not every time a service migrates. This decoupling enables any
130	   server to become part of any server resource pool.

132	   The following are key design objectives for next generation data
133	   centers: a) location independent addressing, b) the ability to a
134	   scale the number of logical Layer-2/Layer-3 networks irrespective of
135	   the underlying physical topology or the number of concurrent VLANs,
136	   c) preserving Layer-2 semantics for services and allowing them to
137	   retain their addresses as they move within and across data centers,
138	   and d) providing broadcast isolation as workloads move around
139	   without burdening the network control plane.

141	1.1. Terminology

143	   For common NVO3 terminology, refer to [8] and [10].

145	   o  NVE: Network Virtualization Endpoint

147	2. Conventions used in this document

149	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
150	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
151	   document are to be interpreted as described in RFC-2119 [RFC2119].

153	   In this document, these words will appear with that interpretation
154	   only when in ALL CAPS. Lower case uses of these words are not to be
155	   interpreted as carrying RFC-2119 significance.

157	3. Network Virtualization using GRE

159	   This section describes Network Virtualization using GRE [4], called
160	   NVGRE. Network virtualization involves creating virtual Layer 2
161	   and/or Layer 3 topologies on top of an arbitrary physical Layer
162	   2/Layer 3 network. Connectivity in the virtual topology is provided
163	   by tunneling Ethernet frames in IP over the physical network.
164	   Virtual broadcast domains are realized as multicast distribution
165	   trees. The multicast distribution trees are analogous to the VLAN
166	   broadcast domains. A virtual Layer 2 network can span multiple
167	   physical subnets. Support for bi-directional IP unicast and
168	   multicast connectivity is the only requirement from the underlying
169	   physical network to support unicast communications within a virtual
170	   network. If the operator chooses to support broadcast and multicast
171	   traffic in the virtual topology the physical topology must support
172	   IP multicast. The physical network, for example, can be a
173	   conventional hierarchical 3-tier network, a full bisection bandwidth
174	   Clos network, or a large Layer 2 network with or without TRILL
175	   support.

177	   Every virtual Layer-2 network is associated with a 24 bit
178	   identifier, called Virtual Subnet Identifier (VSID). A 24 bit VSID
179	   allows up to 16 million virtual subnets in the same management
180	   domain in contrast to only 4K achievable with VLANs. Each VSID
181	   represents a virtual Layer-2 broadcast domain and routes can be
182	   configured for communication between virtual subnets. The VSID can
183	   be crafted in such a way that it uniquely identifies a specific
184	   tenant's subnet. The VSID is carried in an outer header allowing
185	   unique identification of the tenant's virtual subnet to various
186	   devices in the network.

188	   GRE is a proposed IETF standard [4][3] and provides a way for
189	   encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE
190	   header to carry VSID information in each packet. The VSID
191	   information in each packet can be used to build multi-tenant-aware
192	   tools for traffic analysis, traffic inspection, and monitoring.

194	   The following sections detail the packet format for NVGRE, describe
195	   the functions of a NVGRE endpoint, illustrate typical traffic flow
196	   both within and across data centers, and discuss address, policy
197	   management and deployment considerations.

199	3.1. NVGRE End Points

201	   NVGRE endpoints are the ingress/egress points between the virtual
202	   and the physical networks. Any physical server or network device can
203	   be a NVGRE endpoint. One common deployment is for the NVGRE endpoint
204	   to be part of a hypervisor. The primary function of this endpoint is
205	   to encapsulate/decapsulate Ethernet data frames to and from the GRE
206	   tunnel, ensure Layer-2 semantics, and apply isolation policy scoped
207	   on VSID. The endpoint can optionally participate in routing and
208	   function as a gateway in the virtual topology. To encapsulate an
209	   Ethernet frame, the endpoint needs to know the location information
210	   for the destination address in the frame. This information can be
211	   provisioned via a management plane, or obtained via a combination of
212	   control plane distribution or data plane learning approaches. This
213	   document assumes that the location information, including VSID, is
214	   available to the NVGRE endpoint.

216	3.2. NVGRE Frame Format

218	   GRE header format as specified in RFC 2784 and RFC 2890 is used for
219	   communication between NVGRE endpoints. NVGRE leverages the Key
220	   extension specified in RFC 2890 to carry the VSID. The packet format
221	   for Layer-2 encapsulation in GRE is shown in Figure 1.

223	   0                   1                   2                   3
224	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
225	   Outer Ethernet Header:
226	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
227	   |                (Outer) Destination MAC Address                |
228	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
229	   |(Outer)Destination MAC Address |  (Outer)Source MAC Address    |
230	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
231	   |                  (Outer) Source MAC Address                   |
232	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
233	   |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information    |
234	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
235	   |       Ethertype 0x0800        |
236	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
237	   Outer IPv4 Header:
238	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
239	   |Version|  IHL  |Type of Service|          Total Length         |
240	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
241	   |         Identification        |Flags|      Fragment Offset    |
242	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
243	   |  Time to Live | Protocol 0x2F |         Header Checksum       |
244	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
245	   |                      (Outer) Source Address                   |
246	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
247	   |                  (Outer) Destination Address                  |
248	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
249	   GRE Header:
250	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
251	   |0| |1|0| Reserved0       | Ver |   Protocol Type 0x6558        |
252	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
253	   |               Virtual Subnet ID (VSID)        |    FlowID     |
254	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
255	   Inner Ethernet Header
256	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
257	   |                (Inner) Destination MAC Address                |
258	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
259	   |(Inner)Destination MAC Address |  (Inner)Source MAC Address    |
260	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
261	   |                  (Inner) Source MAC Address                   |
262	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
263	   |       Ethertype 0x0800        |
264	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

266	   (Continued on the next page)
267	   Inner IPv4 Header:
268	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
269	   |Version|  IHL  |Type of Service|          Total Length         |
270	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
271	   |         Identification        |Flags|      Fragment Offset    |
272	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
273	   |  Time to Live |    Protocol   |         Header Checksum       |
274	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
275	   |                       Source Address                          |
276	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
277	   |                    Destination Address                        |
278	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
279	   |                    Options                    |    Padding    |
280	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
281	   |                      Original IP Payload                      |
282	   |                                                               |
283	   |                                                               |
284	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
285	                 Figure 1 NVGRE Encapsulation Frame Format

287	   The outer/delivery headers include the outer Ethernet header and the
288	   outer IP header:

290	   o  The outer Ethernet header: The source Ethernet address in the
291	      outer frame is set to the MAC address associated with the NVGRE
292	      endpoint. The destination Ethernet address is set to the MAC
293	      address of the nexthop IP address for the destination NVE. The
294	      destination endpoint may or may not be on the same physical
295	      subnet. The outer VLAN tag information is optional and can be
296	      used for traffic management and broadcast scalability.

298	   o  The outer IP header: Both IPv4 and IPv6 can be used as the
299	      delivery protocol for GRE. The IPv4 header is shown for
300	      illustrative purposes. Henceforth the IP address in the outer
301	      frame is referred to as the Provider Address (PA).

303	   The GRE header:

305	   o  The C (Checksum Present) and S (Sequence Number Present) bits in
306	      the GRE header MUST be zero.

308	   o  The K bit (Key Present) in the GRE header MUST be one. The 32-bit
309	      Key field in the GRE header is used to carry the Virtual Subnet
310	      ID (VSID) and the optional FlowID.

312	   o  Virtual Subnet ID (VSID): The first 24 bits of the Key field are
313	      used for VSID as shown in Figure 1.

315	   o  FlowID: The last 8 bits of the Key field are (optional) FlowID,
316	      which can be used to add per-flow entropy within the same VSID,
317	      where the entire Key field (32-bit) MAY be used by switches or
318	      routers in the physical network infrastructure for ECMP purposes
319	      [12] (Equal-Cost, Multi-Path). If a FlowID is not generated, the
320	      FlowID field MUST be set to all zeros.

322	   o  The protocol type field in the GRE header is set to 0x6558
323	      (transparent Ethernet bridging)[2].

325	   The inner headers (headers of the GRE payload):

327	   o  The inner Ethernet frame comprises of an inner Ethernet header
328	      followed by the inner Ethernet payload. The inner frame could be
329	      any Ethernet data frame; an inner IP payload is shown in Figure 1
330	      for illustrative purposes. Note that the inner Ethernet frame's
331	      FCS is not encapsulated.

333	   o  Inner VLAN tag: The inner Ethernet header of NVGRE SHOULD NOT
334	      contain inner VLAN Tag. When an NVE performs NVGRE encapsulation,
335	      it SHOULD remove any existing VLAN Tag before encapsulating NVGRE
336	      headers. If a VLAN-tagged frame arrives encapsulated in NVGRE,
337	      then the decapsulating NVE SHOULD drop the frame.

339	   o  An inner IPv4 header is shown as an example, but IPv6 headers may
340	      be used. Henceforth the IP address contained in the inner frame
341	      is referred to as the Customer Address (CA).

343	4. NVGRE Deployment Considerations

345	4.1. Broadcast and Multicast Traffic

347	   The following discussion applies if the network operator chooses to
348	   support broadcast and multicast traffic. Each virtual subnet is
349	   assigned an administratively scoped multicast address to carry
350	   broadcast and multicast traffic. All traffic originating from within
351	   a VSID is encapsulated and sent to the assigned multicast address.
352	   As an example, the addresses can be derived from an administratively
353	   scoped multicast address as specified in RFC 2365 for IPv4
354	   (organization Local Scope 239.192.0.0/14) [9], or an Organization-
355	   Local scope multicast address for IPv6 as specified in RFC 4291[7].
356	   This provides a wide range of address choices. Purely from an
357	   efficiency standpoint for every multicast address that a tenant uses
358	   the network operator may configure a corresponding multicast address
359	   in the PA space. To support broadcast and multicast traffic in the
360	   virtual topology the physical topology must support IP multicast.
361	   Depending on the hardware capabilities of the physical network
362	   devices multiple virtual broadcast domains may be assigned the same
363	   physical IP multicast address. For interoperability reasons, a
364	   future version of this draft will specify a standard way to map VSID
365	   to IP multicast address.

367	4.2. Unicast Traffic

369	   The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the
370	   source PA associated with the endpoint with the destination PA
371	   corresponding to the location of the destination endpoint. As
372	   outlined earlier there can be one or more PAs associated with an
373	   endpoint and policy will control which ones get used for
374	   communication. The encapsulated GRE packet is bridged and routed
375	   normally by the physical network to the destination. Bridging uses
376	   the outer Ethernet encapsulation for scope on the LAN. The only
377	   assumption is bi-directional IP connectivity from the underlying
378	   physical network. On the destination the NVGRE endpoint decapsulates
379	   the GRE packet to recover the original Layer-2 frame. Traffic flows
380	   similarly on the reverse path.

382	4.3. IP Fragmentation

384	   RFC 2003 section 5.1 specifies mechanisms for handling fragmentation
385	   when encapsulating IP within IP [11]. The subset of mechanisms NVGRE
386	   selects are intended to ensure that NVGRE encapsulated frames are
387	   not fragmented after encapsulation en-route to the destination NVGRE
388	   endpoint, and that traffic sources can leverage Path MTU discovery.
389	   A future version of this draft will clarify the details around
390	   setting the DF bit on the outer IP header as well as maintaining per
391	   destination NVGRE endpoint MTU soft state so that ICMP Datagram Too
392	   Big messages can be exploited. Fragmentation behavior when tunneling
393	   non-IP Ethernet frames in GRE will also be specified in a future
394	   version.

396	4.4. Address/Policy Management & Routing

398	   Address acquisition is beyond the scope of this document and can be
399	   obtained statically, dynamically or using stateless address auto-
400	   configuration. CA and PA space can be either IPv4 or IPv6. In fact
401	   the address families don't have to match, for example, CA can be
402	   IPv4 while PA is IPv6 and vice versa. The isolation policies MUST be
403	   explicitly configured in the NVGRE endpoint. A typical policy table
404	   entry consists of CA, MAC address, VSID and optionally, the specific
405	   PA if more than one PA is associated with the NVGRE endpoint. If
406	   there are multiple virtual subnets, explicit routing information
407	   MUST be configured along with a default gateway for cross-subnet
408	   communication. Routing between virtual subnets can be optionally
409	   handled by the NVGRE endpoint acting as a gateway. If
410	   broadcast/multicast support is required the NVGRE endpoints MUST
411	   participate in IGMP/MLD for all subscribed multicast groups.

413	4.5. Cross-subnet, Cross-premise Communication

415	   One application of this framework is that it provides a seamless
416	   path for enterprises looking to expand their virtual machine hosting
417	   capabilities into public clouds. Enterprises can bring their entire
418	   IP subnet(s) and isolation policies, thus making the transition to
419	   or from the cloud simpler. It is possible to move portions of a IP
420	   subnet to the cloud however that requires additional configuration
421	   on the enterprise network and is not discussed in this document.
422	   Enterprises can continue to use existing communications models like
423	   site-to-site VPN to secure their traffic.

425	   A VPN gateway is used to establish a secure site-to-site tunnel over
426	   the Internet and all the enterprise services running in virtual
427	   machines in the cloud use the VPN gateway to communicate back to the
428	   enterprise. For simplicity we use a VPN GW configured as a VM shown
429	   in Figure 2 to illustrate cross-subnet, cross-premise communication.

431	   +-----------------------+        +-----------------------+
432	   |       Server 1        |        |       Server 2        |
433	   | +--------+ +--------+ |        | +-------------------+ |
434	   | | VM1    | | VM2    | |        | |    VPN Gateway    | |
435	   | | IP=CA1 | | IP=CA2 | |        | | Internal  External| |
436	   | |        | |        | |        | |  IP=CAg   IP=GAdc | |
437	   | +--------+ +--------+ |        | +-------------------+ |
438	   |       Hypervisor      |        |     | Hypervisor| ^   |
439	   +-----------------------+        +-------------------:---+
440	               | IP=PA1                   | IP=PA4    | :
441	               |                          |           | :
442	               |     +-------------------------+      | : VPN
443	               +-----|     Layer 3 Network     |------+ : Tunnel
444	                     +-------------------------+        :
445	                                  |                     :
446	        +-----------------------------------------------:--+
447	        |                                               :  |
448	        |                     Internet                  :  |
449	        |                                               :  |
450	        +-----------------------------------------------:--+
451	                                  |                     v
452	                                  |   +-------------------+
453	                                  |   |    VPN Gateway    |
454	                                  |---|                   |
455	                             IP=GAcorp| External IP=GAcorp|
456	                                      +-------------------+
457	                                                |
458	                                    +-----------------------+
459	                                    |  Corp Layer 3 Network |
460	                                    |      (In CA Space)    |
461	                                    +-----------------------+
462	                                                |
463	                                   +---------------------------+
464	                                   |       Server X            |
465	                                   | +----------+ +----------+ |
466	                                   | | Corp VMe | | Corp VM2 | |
467	                                   | |  IP=CAe  | | IP=CAE2  | |
468	                                   | +----------+ +----------+ |
469	                                   |         Hypervisor        |
470	                                   +---------------------------+
471	            Figure 2 Cross-Subnet, Cross-Premise Communication

473	   The flow here is similar to the unicast traffic flow between VMs,
474	   the key difference in this case the packet needs to be sent to a VPN
475	   gateway before it gets forwarded to the destination. As part of
476	   routing configuration in the CA space, a VPN gateway is provisioned
477	   per-tenant for communication back to the enterprise. The example
478	   illustrates an outbound connection between VM1 inside the datacenter
479	   and VMe inside the enterprise network. The outbound packet from CA1
480	   to CAe when it hits the hypervisor on Server 1 matches the default
481	   gateway rule as CAe is not part of the tenant virtual network in the
482	   datacenter. The packet is encapsulated and sent to the PA of tenant
483	   VPN gateway (PA4) running as a VM on Server 2. The packet is
484	   decapsulated on Server 2 and delivered to the VM gateway. The
485	   gateway in turn validates and sends the packet on the site-to-site
486	   tunnel back to the enterprise network. As the communication here is
487	   external to the datacenter the PA address for the VPN tunnel is
488	   globally routable. The outer header of this packet is sourced from
489	   GAdc destined to GAcorp. This packet is routed through the internet
490	   to the enterprise VPN gateway which is the other end of the site-to-
491	   site tunnel at which point the VPN decapsulates the packet and sends
492	   it inside the enterprise where the CAe is routable on the network.
493	   The reverse path is similar once the packet hits the enterprise VPN
494	   gateway.

496	4.6. Internet Connectivity

498	   To enable connectivity to the Internet, an Internet gateway is
499	   needed that bridges the virtualized CA space to the public Internet
500	   address space. The gateway performs translation between the
501	   virtualized world and the Internet, for example, the NVGRE endpoint
502	   can be part of a load balancer or a NAT. Section 4 has more
503	   discussions around building GRE gateways.

505	4.7. Management and Control Planes

507	   There are several protocols that can manage and distribute policy;
508	   however this document does not recommend any one mechanism.
509	   Implementations SHOULD choose a mechanism that meets their scale
510	   requirements.

512	4.8. NVGRE-Aware Device

514	   One example of a typical deployment consists of virtualized servers
515	   deployed across multiple racks connected by one or more layers of
516	   Layer-2 switches which in turn may be connected to a layer 3 routing
517	   domain. Even though routing in the physical infrastructure will work
518	   without any modification with GRE, devices that perform specialized
519	   processing in the network need to be able to parse GRE to get access
520	   to tenant specific information. Devices that understand and parse
521	   the VSID can provide rich multi-tenancy aware services inside the
522	   data center. As outlined earlier it is imperative to exploit
523	   multiple paths inside the network through techniques such as Equal
524	   Cost Multipath (ECMP)[12]. The Key field could provide additional
525	   entropy to the switches to exploit path diversity inside the
526	   network. Switches or routers could use the Key field, with VSID and
527	   optional FlowID, to add flow based entropy and tag all the packets
528	   from a flow with an entropy label. A diverse ecosystem play is
529	   expected to emerge as more and more devices become multi-tenant
530	   aware. In the interim, without requiring any hardware upgrades,
531	   there are alternatives to exploit path diversity with GRE by
532	   associating multiple PAs with NVGRE endpoints with policy
533	   controlling the choice of PA to be used.

535	   It is expected that communication can span multiple data centers and
536	   also cross the virtual to physical boundary. Typical scenarios that
537	   require virtual-to-physical communication includes access to storage
538	   and databases. Scenarios demanding lossless Ethernet functionality
539	   may not be amenable to NVGRE as traffic is carried over an IP
540	   network. NVGRE endpoints mediate between the network virtualized and
541	   non-network virtualized environments. This functionality can be
542	   incorporated into Top of Rack switches, storage appliances, load
543	   balancers, routers etc. or built as a stand-alone appliance.

545	   It is imperative to consider the impact of any solution on host
546	   performance. Today's server operating systems employ sophisticated
547	   acceleration techniques such as checksum offload, Large Send Offload
548	   (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS),
549	   Virtual Machine Queue (VMQ) etc. These technologies should become
550	   GRE aware. IPsec Security Associations (SA) can be offloaded to the
551	   NIC so that computationally expensive cryptographic operations are
552	   performed at line rate in the NIC hardware. These SAs are based on
553	   the IP addresses of the endpoints. As each packet on the wire gets
554	   translated, the NVGRE endpoint SHOULD intercept the offload requests
555	   and do the appropriate address translation. This will ensure that
556	   IPsec continues to be usable with network virtualization while
557	   taking advantage of hardware offload capabilities for improved
558	   performance.

560	4.9. Network Scalability with NVGRE

562	   One of the key benefits of using GRE is the IP address scalability
563	   and in turn MAC address table scalability that can be achieved.
564	   NVGRE endpoint can use one PA to represent multiple CAs. This lowers
565	   the burden on the MAC address table sizes at the Top of Rack
566	   switches. One obvious benefit is in the context of server
567	   virtualization which has increased the demands on the network
568	   infrastructure. By embedding a NVGRE endpoint in a hypervisor it is
569	   possible to scale significantly. This framework allows for location
570	   information to be preconfigured inside a NVGRE endpoint allowing
571	   broadcast ARP traffic to be proxied locally. This approach can scale
572	   to large sized virtual subnets. These virtual subnets can be spread
573	   across multiple layer-3 physical subnets. It allows workloads to be
574	   moved around without imposing a huge burden on the network control
575	   plane. By eliminating most broadcast traffic and converting others
576	   to multicast the routers and switches can function more efficiently
577	   by building efficient multicast trees. By using server and network
578	   capacity efficiently it is possible to drive down the cost of
579	   building and managing data centers.

581	5. Security Considerations

583	   This proposal extends the Layer-2 subnet across the data center and
584	   increases the scope for spoofing attacks. Mitigations of such
585	   attacks are possible with authentication/encryption using IPsec or
586	   any other IP based mechanism. The control plane for policy
587	   distribution is expected to be secured by using any of the existing
588	   security protocols. Further management traffic can be isolated in a
589	   separate subnet/VLAN.

591	6. IANA Considerations

593	   None

595	7. References

597	7.1. Normative References

599	   [1]   Bradner, S., "Key words for use in RFCs to Indicate
600	         Requirement Levels", BCP 14, RFC 2119, March 1997.

602	   [2]   ETHTYPES, ftp://ftp.isi.edu/in-
603	         notes/iana/assignments/ethernet- numbers

605	7.2. Informative References

607	   [3]   Dommety, G., "Key and Sequence Number Extensions to GRE", RFC
608	         2890, September 2000.

610	   [4]   Farinacci, D. et al, "Generic Routing Encapsulation (GRE)",
611	         RFC 2784, March 2000.

613	   [5]   Greenberg, A. et al, "VL2: A Scalable and Flexible Data Center
614	         Network", Proc. SIGCOMM 2009.

616	   [6]   Greenberg, A. et al, "The Cost of a Cloud: Research Problems
617	         in the Data Center", ACM SIGCOMM Computer Communication
618	         Review, V. 39, No. 1, January 2009.

620	   [7]   Hinden, R., Deering, S., "IP Version 6 Addressing
621	         Architecture", RFC 4291, February 2006.

623	   [8]   Lasserre, M. et al, "Framework for DC Network Virtualization",
624	         draft-lasserre-nvo3-framework (work in progress)

626	   [9]   Meyer, D., "Administratively Scoped IP Multicast", RFC 2365,
627	         July 1998.

629	   [10]  Narten, T. et al, "Problem Statement : Overlays for Network
630	         Virtualization", draft-narten-nvo3-overlay-problem-statement
631	         (work in progress)

633	   [11]  Perkins, C., "IP Encapsulation within IP", RFC 2003, October
634	         1996.

636	   [12]  Thaler, D. & Hopps, C., "Multipath Issues in Unicast and
637	         Multicast Next-Hop Selection", RFC 2991, November 2000.

639	   [13]  Touch J. & Perlman R., "Transparent Interconnection of Lots of
640	         Links (TRILL): Problem and Applicability Statement", RFC 5556,
641	         May 2009.

643	8. Acknowledgments

645	   This document was prepared using 2-Word-v2.0.template.dot.

647	Authors' Addresses

649	   Murari Sridharan
650	   Microsoft Corporation
651	   1 Microsoft Way
652	   Redmond, WA 98052
653	   Email: muraris@microsoft.com

655	   Kenneth Duda
656	   Arista Networks, Inc.
657	   5470 Great America Pkwy
658	   Santa Clara, CA 95054
659	   kduda@aristanetworks.com
660	   Ilango Ganga
661	   Intel Corporation
662	   2200 Mission College Blvd.
663	   M/S: SC12-325
664	   Santa Clara, CA - 95054
665	   Email: ilango.s.ganga@intel.com

667	   Albert Greenberg
668	   Microsoft Corporation
669	   1 Microsoft Way
670	   Redmond, WA 98052
671	   Email: albert@microsoft.com

673	   Geng Lin
674	   Dell
675	   One Dell Way
676	   Round Rock, TX 78682
677	   Email: geng_lin@dell.com

679	   Mark Pearson
680	   Hewlett-Packard Co.
681	   8000 Foothills Blvd.
682	   Roseville, CA 95747
683	   Email: mark.pearson@hp.com

685	   Patricia Thaler
686	   Broadcom Corporation
687	   3151 Zanker Road
688	   San Jose, CA 95134
689	   Email: pthaler@broadcom.com

691	   Chait Tumuluri
692	   Emulex Corporation
693	   3333 Susan Street
694	   Costa Mesa, CA 92626
695	   Email: chait@emulex.com
696	   Narasimhan Venkataramiah
697	   Microsoft Corporation
698	   1 Microsoft Way
699	   Redmond, WA 98052
700	   Email: narave@microsoft.com

702	   Yu-Shun Wang
703	   Microsoft Corporation
704	   1 Microsoft Way
705	   Redmond, WA 98052
706	   Email: yushwang@microsoft.com