idnits 2.17.1 

draft-sridharan-virtualization-nvgre-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 31, 2014) is 3550 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: 'RFC2119' on line 196

  == Unused Reference: '1' is defined on line 641, but no explicit reference
     was found in the text

  -- No information found for draft-ietf-nov3-framework - is the name correct?

  -- No information found for draft-narten-nov3-overlay-problem-statement -
     is the name correct?


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Network Working Group                                      M. Sridharan
2	Internet Draft                                             A. Greenberg
3	Intended Category: Informational                                Y. Wang
4	Expires: January 30, 2015                                       P. Garg
5	                                                        N. Venkataramiah
6	                                                               Microsoft
7	                                                                 K. Duda
8	                                                         Arista Networks
9	                                                                I. Ganga
10	                                                                   Intel
11	                                                                  G. Lin
12	                                                                  Google
13	                                                              M. Pearson
14	                                                         Hewlett-Packard
15	                                                               P. Thaler
16	                                                                Broadcom
17	                                                             C. Tumuluri
18	                                                                  Emulex
19	                                                           July 31, 2014

21	     NVGRE: Network Virtualization using Generic Routing Encapsulation
22	                draft-sridharan-virtualization-nvgre-05.txt

24	Status of this Memo

26	   This memo provides information for the Internet Community. It does
27	   not specify an Internet standard of any kind; instead it relies on a
28	   proposed standard. Distribution of this memo is unlimited.

30	Copyright Notice

32	   Copyright (c) 2014 IETF Trust and the persons identified as the
33	   document authors. All rights reserved.

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF), its areas, and its working groups.  Note that
40	   other groups may also distribute working documents as Internet-
41	   Drafts.

43	   Internet-Drafts are draft documents valid for a maximum of six
44	   months and may be updated, replaced, or obsoleted by other documents
45	   at any time.  It is inappropriate to use Internet-Drafts as
46	   reference material or to cite them other than as "work in progress."
47	   The list of current Internet-Drafts can be accessed at
48	   http://www.ietf.org/ietf/1id-abstracts.txt

50	   The list of Internet-Draft Shadow Directories can be accessed at
51	   http://www.ietf.org/shadow.html

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents
55	   (http://trustee.ietf.org/license-info) in effect on the date of
56	   publication of this document. Please review these documents
57	   carefully, as they describe your rights and restrictions with
58	   respect to this document.

60	   This Internet-Draft will expire on January 30, 2015.

62	Abstract

64	   This document describes the usage of Generic Routing Encapsulation
65	   (GRE) header for Network Virtualization (NVGRE) in multi-tenant
66	   datacenters. Network Virtualization decouples virtual networks and
67	   addresses from physical network infrastructure, providing isolation
68	   and concurrency between multiple virtual networks on the same
69	   physical network infrastructure. This document also introduces a
70	   Network Virtualization framework to illustrate the use cases, but
71	   the focus is on specifying the data plane aspect of NVGRE.

73	Table of Contents

75	   1. Introduction...................................................3
76	      1.1. Terminology...............................................4
77	   2. Conventions used in this document..............................5
78	   3. NVGRE: Network Virtualization using GRE........................5
79	      3.1. NVGRE Endpoint............................................6
80	      3.2. NVGRE frame format........................................6
81	      3.3. Reserved VSID.............................................9
82	   4. NVGRE Deployment Consideration................................10
83	      4.1. ECMP Support.............................................10
84	      4.2. Broadcast and Multicast Traffic..........................10
85	      4.3. Unicast Traffic..........................................10
86	      4.4. IP Fragmentation.........................................11
87	      4.5. Address/Policy Management & Routing......................11
88	      4.6. Cross-subnet, Cross-premise Communication................11
89	      4.7. Internet Connectivity....................................13
90	      4.8. Management and Control Planes............................13
91	      4.9. NVGRE-Aware Devices......................................13
92	      4.10. Network Scalability with NVGRE..........................14

94	   5. Security Considerations.......................................15
95	   6. IANA Considerations...........................................15
96	   7. References....................................................15
97	      7.1. Normative References.....................................15
98	      7.2. Informative References...................................15
99	   8. Acknowledgments...............................................16

101	1. Introduction

103	   Conventional data center network designs cater to largely static
104	   workloads and cause fragmentation of network and server capacity
105	   [5][6]. There are several issues that limit dynamic allocation and
106	   consolidation of capacity. Layer-2 networks use Rapid Spanning Tree
107	   Protocol (RSTP) which is designed to eliminate loops by blocking
108	   redundant paths. These eliminated paths translate to wasted capacity
109	   and a highly oversubscribed network. There are alternative
110	   approaches such as TRILL that address this problem [12].

112	   The network utilization inefficiencies are exacerbated by network
113	   fragmentation due to the use of VLANs for broadcast isolation. VLANs
114	   are used for traffic management and also as the mechanism for
115	   providing security and performance isolation among services
116	   belonging to different tenants. The Layer-2 network is carved into
117	   smaller sized subnets typically one subnet per VLAN, with VLAN tags
118	   configured on all the Layer-2 switches connected to server racks
119	   that host a given tenant's services. The current VLAN limits
120	   theoretically allow for 4K such subnets to be created. The size of
121	   the broadcast domain is typically restricted due to the overhead of
122	   broadcast traffic (e.g., ARP). The 4K VLAN limit is no longer
123	   sufficient in a shared infrastructure servicing multiple tenants.

125	   Data center operators must be able to achieve high utilization of
126	   server and network capacity. In order to achieve efficiency it
127	   should be possible to assign workloads that operate in a single
128	   Layer-2 network to any server in any rack in the network. It should
129	   also be possible to migrate workloads to any server anywhere in the
130	   network while retaining the workloads' addresses. This can be
131	   achieved today by stretching VLANs, however when workloads migrate
132	   the network needs to be reconfigured which is typically error prone.
133	   By decoupling the workload's location on the LAN from its network
134	   address, the network administrator configures the network once and
135	   not every time a service migrates. This decoupling enables any
136	   server to become part of any server resource pool.

138	   The following are key design objectives for next generation data
139	   centers:

141	     a) location independent addressing
142	     b) the ability to a scale the number of logical Layer-2/Layer-3
143	        networks irrespective of the underlying physical topology or
144	        the number of VLANs
145	     c) preserving Layer-2 semantics for services and allowing them to
146	        retain their addresses as they move within and across data
147	        centers
148	     d) providing broadcast isolation as workloads move around without
149	        burdening the network control plane

151	   This document describes the use of Generic Routing Encapsulation
152	   (GRE, [3][4]) header for network virtualization. Network
153	   virtualization decouples a virtual network from the underlying
154	   physical network infrastructure by virtualizing network addresses.
155	   Combined with a management and control plane for the virtual-to-
156	   physical mapping, network virtualization can enable flexible VM
157	   placement and movement, and provide network isolation for a multi-
158	   tenant datacenter.

160	   Network virtualization enables customers to bring their own address
161	   spaces into a multi-tenant datacenter while the datacenter
162	   administrators can place the customer VMs anywhere in the datacenter
163	   without reconfiguring their network switches or routers,
164	   irrespective of the customer address spaces.

166	1.1. Terminology

168	   Please refer to [8][10] for more formal definition of terminology.
169	   The following terms were used in this document.

171	   Customer Address (CA): These are the virtual IP addresses assigned
172	   and configured on the virtual NIC within each VM. These are the only
173	   addresses visible to VMs and applications running within VMs.

175	   NVE: Network Virtualization Edge, the entity that performs the
176	   network virtualization encapsulation and decapsulation.

178	   Provider Address (PA): These are the IP addresses used in the
179	   physical network. PA's are associated with VM CA's through the
180	   network virtualization mapping policy.

182	   VM: Virtual Machine. Virtual machines are typically instances of
183	   OS's running on top of hypervisor over a physical machine or server.
184	   Multiple VMs can share the same physical server via the hypervisor,
185	   yet are completely isolated from each other in terms of compute,
186	   storage, and other OS resources.

188	   VSID: Virtual Subnet Identifier, a 24-bit ID that uniquely
189	   identifies a virtual subnet or virtual layer 2 broadcast domain.

191	2. Conventions used in this document

193	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
194	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
195	   document are to be interpreted as described in RFC-2119 [RFC2119].

197	   In this document, these words will appear with that interpretation
198	   only when in ALL CAPS. Lower case uses of these words are not to be
199	   interpreted as carrying RFC-2119 significance.

201	3. NVGRE: Network Virtualization using GRE

203	   This section describes Network Virtualization using GRE, NVGRE.
204	   Network virtualization involves creating virtual Layer 2 topologies
205	   on top of a physical Layer 3 network. Connectivity in the virtual
206	   topology is provided by tunneling Ethernet frames in GRE over the
207	   physical network.

209	   In NVGRE, every virtual Layer-2 network is associated with a 24-bit
210	   identifier, called Virtual Subnet Identifier (VSID). VSID is carried
211	   in an outer header as defined in Section 3.2. , allowing unique
212	   identification of a tenant's virtual subnet to various devices in
213	   the network. A 24-bit VSID supports up to 16 million virtual subnets
214	   in the same management domain, in contrast to only 4K achievable
215	   with VLANs. Each VSID represents a virtual Layer-2 broadcast domain,
216	   which can be used to identify a virtual subnet of a given tenant. To
217	   support multi-subnet virtual topology, datacenter administrators can
218	   configure routes to facilitate communication between virtual subnets
219	   of the same tenant.

221	   GRE is a proposed IETF standard [3][4] and provides a way for
222	   encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE
223	   header to carry VSID information in each packet. The VSID
224	   information in each packet can be used to build multi-tenant-aware
225	   tools for traffic analysis, traffic inspection, and monitoring.

227	   The following sections detail the packet format for NVGRE, describe
228	   the functions of a NVGRE endpoint, illustrate typical traffic flow
229	   both within and across data centers, and discuss address, policy
230	   management and deployment considerations.

232	3.1. NVGRE Endpoint

234	   NVGRE endpoints are the ingress/egress points between the virtual
235	   and the physical networks. The NVGRE endpoints are the NVEs as
236	   defined in the NVO Framework document [8]. Any physical server or
237	   network device can be an NVGRE endpoint. One common deployment is
238	   for the endpoint to be part of a hypervisor. The primary function of
239	   this endpoint is to encapsulate/decapsulate Ethernet data frames to
240	   and from the GRE tunnel, ensure Layer-2 semantics, and apply
241	   isolation policy scoped on VSID. The endpoint can optionally
242	   participate in routing and function as a gateway in the virtual
243	   topology. To encapsulate an Ethernet frame, the endpoint needs to
244	   know the location information for the destination address in the
245	   frame. This information can be provisioned via a management plane,
246	   or obtained via a combination of control plane distribution or data
247	   plane learning approaches. This document assumes that the location
248	   information, including VSID, is available to the NVGRE endpoint.

250	3.2. NVGRE frame format

252	   GRE header format as specified in RFC 2784 and RFC 2890 [3][4] is
253	   used for communication between NVGRE endpoints. NVGRE leverages the
254	   Key extension specified in RFC 2890 to carry the VSID. The packet
255	   format for Layer-2 encapsulation in GRE is shown in Figure 1.

257	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
258	   Outer Ethernet Header:             |
259	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
260	   |                (Outer) Destination MAC Address                |
261	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
262	   |(Outer)Destination MAC Address |  (Outer)Source MAC Address    |
263	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
264	   |                  (Outer) Source MAC Address                   |
265	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
266	   |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information    |
267	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
268	   |       Ethertype 0x0800        |
269	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
270	   Outer IPv4 Header:
271	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
272	   |Version|  IHL  |Type of Service|          Total Length         |
273	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
274	   |         Identification        |Flags|      Fragment Offset    |
275	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
276	   |  Time to Live | Protocol 0x2F |         Header Checksum       |
277	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
278	   |                      (Outer) Source Address                   |
279	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
280	   |                  (Outer) Destination Address                  |
281	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
282	   GRE Header:
283	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
284	   |0| |1|0|   Reserved0     | Ver |   Protocol Type 0x6558        |
285	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
286	   |               Virtual Subnet ID (VSID)        |    FlowID     |
287	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
288	   Inner Ethernet Header
289	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
290	   |                (Inner) Destination MAC Address                |
291	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
292	   |(Inner)Destination MAC Address |  (Inner)Source MAC Address    |
293	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
294	   |                  (Inner) Source MAC Address                   |
295	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
296	   |Optional Ethertype=C-Tag 802.1Q| PCP |0| VID set to 0          |
297	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
298	   |       Ethertype 0x0800        |
299	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

301	   (Continued on the next page)
302	   Inner IPv4 Header:
303	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
304	   |Version|  IHL  |Type of Service|          Total Length         |
305	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
306	   |         Identification        |Flags|      Fragment Offset    |
307	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
308	   |  Time to Live |    Protocol   |         Header Checksum       |
309	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
310	   |                       Source Address                          |
311	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
312	   |                    Destination Address                        |
313	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
314	   |                    Options                    |    Padding    |
315	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
316	   |                      Original IP Payload                      |
317	   |                                                               |
318	   |                                                               |
319	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

321	                  Figure 1 GRE Encapsulation Frame Format

323	   The outer/delivery headers include the outer Ethernet header and the
324	   outer IP header:

326	   o The outer Ethernet header: The source Ethernet address in the
327	   outer frame is set to the MAC address associated with the NVGRE
328	   endpoint. The destination endpoint may or may not be on the same
329	   physical subnet. The destination Ethernet address is set to the MAC
330	   address of the nexthop IP address for the destination NVE. The outer
331	   VLAN tag information is optional and can be used for traffic
332	   management and broadcast scalability on the physical network.

334	   o The outer IP header: Both IPv4 and IPv6 can be used as the
335	   delivery protocol for GRE. The IPv4 header is shown for illustrative
336	   purposes. Henceforth the IP address in the outer frame is referred
337	   to as the Provider Address (PA). There can be one or more PA address
338	   associated with an NVGRE endpoint, with policy controlling the
339	   choice of PA to use for a given Customer Address (CA) for a customer
340	   VM.

342	   The GRE header:

344	   o The C (Checksum Present) and S (Sequence Number Present) bits in
345	   the GRE header MUST be zero.

347	   o The K bit (Key Present) in the GRE header MUST be set to one. The
348	   32-bit Key field in the GRE header is used to carry the Virtual
349	   Subnet ID (VSID), and the FlowId:

351	       - Virtual Subnet ID (VSID): This is a 24-bit value that is used
352	         to identify the NVGRE based Virtual Layer-2 Network.
353	       - FlowID: This is an 8-bit value that is used to provide per-
354	         flow entropy for flows in the same VSID. The FlowID MUST NOT
355	         be modified by transit devices. The encapsulating NVE SHOULD
356	         provide as much entropy as possible in the FlowId. If a FlowID
357	         is not generated, it MUST be set to all zero.

359	   o The protocol type field in the GRE header is set to 0x6558
360	   (transparent Ethernet bridging)[2].

362	   The inner headers (headers of the GRE payload):

364	   o The inner Ethernet frame comprises of an inner Ethernet header
365	   followed by optional inner IP header, followed by the IP payload.
366	   The inner frame could be any Ethernet data frame not just IP. Note
367	   that the inner Ethernet frame's FCS is not encapsulated.

369	   o Inner 802.1Q tag: The inner Ethernet header of NVGRE MUST NOT
370	   contain 802.1Q Tag. The encapsulating NVE MUST remove any existing
371	   802.1Q Tag before encapsulation of the frame in NVGRE. A
372	   decapsulating NVE MUST drop the frame if the inner Ethernet frame
373	   contains an 802.1Q tag.

375	   o For illustrative purposes IPv4 headers are shown as the inner IP
376	   headers but IPv6 headers may be used. Henceforth the IP address
377	   contained in the inner frame is referred to as the Customer Address
378	   (CA).

380	3.3. Reserved VSID

382	   The VSID range from 0-0xFFF is reserved for future use.

384	   The VSID 0xFFFFFF is reserved for vendor specific NVE-NVE
385	   communication. The sender NVE SHOULD verify receiver NVE's vendor
386	   before sending a packet using this VSID, however such verification
387	   mechanism is out of scope of this document. Implementations SHOULD
388	   choose a mechanism that meets their requirements.

390	4. NVGRE Deployment Consideration

392	4.1. ECMP Support

394	   The switches and routers SHOULD provide ECMP on the NVGRE packets
395	   using the outer frame fields and entire Key field (32-bit).

397	4.2. Broadcast and Multicast Traffic

399	   To support broadcast and multicast traffic inside a virtual subnet,
400	   one or more administratively scoped multicast addresses [7][9] can
401	   be assigned for the VSID. All multicast or broadcast traffic
402	   originating from within a VSID is encapsulated and sent to the
403	   assigned multicast address. From an administrative standpoint it is
404	   possible for network operators to configure a PA multicast address
405	   for each multicast address that is used inside a VSID, to facilitate
406	   optimal multicast handling. Depending on the hardware capabilities
407	   of the physical network devices and the physical network
408	   architecture, multiple virtual subnet may re-use the same physical
409	   IP multicast address.

411	   Alternatively, based upon the configuration at NVE, the broadcast
412	   and multicast in the virtual subnet can be supported using N-Way
413	   unicast. In N-Way unicast, the sender NVE would send one
414	   encapsulated packet to every NVE in the virtual subnet. The sender
415	   NVE can encapsulate and send the packet as described in the Unicast
416	   Traffic Section 4.3. This alleviates the need for multicast support
417	   in the physical network.

419	4.3. Unicast Traffic

421	   The NVGRE endpoint encapsulates a Layer-2 packet in GRE using the
422	   source PA associated with the endpoint with the destination PA
423	   corresponding to the location of the destination endpoint. As
424	   outlined earlier, there can be one or more PAs associated with an
425	   endpoint and policy will control which ones get used for
426	   communication. The encapsulated GRE packet is bridged and routed
427	   normally by the physical network to the destination PA. Bridging
428	   uses the outer Ethernet encapsulation for scope on the LAN. The only
429	   requirement is bi-directional IP connectivity from the underlying
430	   physical network. On the destination, the NVGRE endpoint
431	   decapsulates the GRE packet to recover the original Layer-2 frame.
432	   Traffic flows similarly on the reverse path.

434	4.4. IP Fragmentation

436	   RFC 2003 [11] Section 5.1 specifies mechanisms for handling
437	   fragmentation when encapsulating IP within IP. The subset of
438	   mechanisms NVGRE selects are intended to ensure that NVGRE
439	   encapsulated frames are not fragmented after encapsulation en-route
440	   to the destination NVGRE endpoint, and that traffic sources can
441	   leverage Path MTU discovery. A future version of this draft will
442	   clarify the details around setting the DF bit on the outer IP header
443	   as well as maintaining per destination NVGRE endpoint MTU soft state
444	   so that ICMP Datagram Too Big messages can be exploited.
445	   Fragmentation behavior when tunneling non-IP Ethernet frames in GRE
446	   will also be specified in a future version.

448	4.5. Address/Policy Management & Routing

450	   Address acquisition is beyond the scope of this document and can be
451	   obtained statically, dynamically or using stateless address auto-
452	   configuration. CA and PA space can be either IPv4 or IPv6. In fact
453	   the address families don't have to match, for example, a CA can be
454	   IPv4 while the PA is IPv6 and vice versa.

456	4.6. Cross-subnet, Cross-premise Communication

458	   One application of this framework is that it provides a seamless
459	   path for enterprises looking to expand their virtual machine hosting
460	   capabilities into public clouds. Enterprises can bring their entire
461	   IP subnet(s) and isolation policies, thus making the transition to
462	   or from the cloud simpler. It is possible to move portions of a IP
463	   subnet to the cloud however that requires additional configuration
464	   on the enterprise network and is not discussed in this document.
465	   Enterprises can continue to use existing communications models like
466	   site-to-site VPN to secure their traffic.

468	   A VPN gateway is used to establish a secure site-to-site tunnel over
469	   the Internet and all the enterprise services running in virtual
470	   machines in the cloud use the VPN gateway to communicate back to the
471	   enterprise. For simplicity we use a VPN GW configured as a VM shown
472	   in Figure 2 to illustrate cross-subnet, cross-premise communication.

474	   +-----------------------+        +-----------------------+
475	   |       Server 1        |        |       Server 2        |
476	   | +--------+ +--------+ |        | +-------------------+ |
477	   | | VM1    | | VM2    | |        | |    VPN Gateway    | |
478	   | | IP=CA1 | | IP=CA2 | |        | | Internal  External| |
479	   | |        | |        | |        | |  IP=CAg   IP=GAdc | |
480	   | +--------+ +--------+ |        | +-------------------+ |
481	   |       Hypervisor      |        |     | Hypervisor| ^   |
482	   +-----------------------+        +-------------------:---+
483	               | IP=PA1                   | IP=PA4    | :
484	               |                          |           | :
485	               |     +-------------------------+      | : VPN
486	               +-----|     Layer 3 Network     |------+ : Tunnel
487	                     +-------------------------+        :
488	                                  |                     :
489	        +-----------------------------------------------:--+
490	        |                                               :  |
491	        |                     Internet                  :  |
492	        |                                               :  |
493	        +-----------------------------------------------:--+
494	                                  |                     v
495	                                  |   +-------------------+
496	                                  |   |    VPN Gateway    |
497	                                  |---|                   |
498	                             IP=GAcorp| External IP=GAcorp|
499	                                      +-------------------+
500	                                                |
501	                                    +-----------------------+
502	                                    |  Corp Layer 3 Network |
503	                                    |      (In CA Space)    |
504	                                    +-----------------------+
505	                                                |
506	                                   +---------------------------+
507	                                   |       Server X            |
508	                                   | +----------+ +----------+ |
509	                                   | | Corp VMe1| | Corp VMe2| |
510	                                   | |  IP=CAe1 | |  IP=CAe2 | |
511	                                   | +----------+ +----------+ |
512	                                   |         Hypervisor        |
513	                                   +---------------------------+
514	            Figure 2 Cross-Subnet, Cross-Premise Communication

516	   The packet flow is similar to the unicast traffic flow between VMs,
517	   the key difference in this case the packet needs to be sent to a VPN
518	   gateway before it gets forwarded to the destination. As part of
519	   routing configuration in the CA space, a per-tenant VPN gateway is
520	   provisioned for communication back to the enterprise. The example
521	   illustrates an outbound connection between VM1 inside the datacenter
522	   and VMe1 inside the enterprise network. When the outbound packet
523	   from CA1 to CAe1 reaches the hypervisor on Server 1, the NVE in
524	   Server 1 can perform an equivalent of a route lookup on the packet.
525	   The cross premise packet will match the default gateway rule as CAe1
526	   is not part of the tenant virtual network in the datacenter. The
527	   virtualization policy will indicate the packet to be encapsulated
528	   and sent to the PA of tenant VPN gateway (PA4) running as a VM on
529	   Server 2. The packet is decapsulated on Server 2 and delivered to
530	   the VM gateway. The gateway in turn validates and sends the packet
531	   on the site-to-site VPN tunnel back to the enterprise network. As
532	   the communication here is external to the datacenter the PA address
533	   for the VPN tunnel is globally routable. The outer header of this
534	   packet is sourced from GAdc destined to GAcorp. This packet is
535	   routed through the Internet to the enterprise VPN gateway which is
536	   the other end of the site-to-site tunnel, at which point the VPN
537	   gateway decapsulates the packet and sends it inside the enterprise
538	   where the CAe1 is routable on the network. The reverse path is
539	   similar once the packet reaches the enterprise VPN gateway.

541	4.7. Internet Connectivity

543	   To enable connectivity to the Internet, an Internet gateway is
544	   needed that bridges the virtualized CA space to the public Internet
545	   address space. The gateway need to perform translation between the
546	   virtualized world and the Internet. For example, the NVGRE endpoint
547	   can be part of a load balancer or a NAT, which replaces the VPN
548	   Gateway on Server 2 shown in Figure 2.

550	4.8. Management and Control Planes

552	   There are several protocols that can manage and distribute policy;
553	   however, it is out of scope of this document. Implementations SHOULD
554	   choose a mechanism that meets their scale requirements.

556	4.9. NVGRE-Aware Devices

558	   One example of a typical deployment consists of virtualized servers
559	   deployed across multiple racks connected by one or more layers of
560	   Layer-2 switches which in turn may be connected to a layer 3 routing
561	   domain. Even though routing in the physical infrastructure will work
562	   without any modification with NVGRE, devices that perform
563	   specialized processing in the network need to be able to parse GRE
564	   to get access to tenant specific information. Devices that
565	   understand and parse the VSID can provide rich multi-tenancy aware
566	   services inside the data center. As outlined earlier it is
567	   imperative to exploit multiple paths inside the network through
568	   techniques such as Equal Cost Multipath (ECMP). The Key field (32-
569	   bit field, including both VSID and the optional FlowID) can provide
570	   additional entropy to the switches to exploit path diversity inside
571	   the network. A diverse ecosystem is expected to emerge as more and
572	   more devices become multi-tenant aware. In the interim, without
573	   requiring any hardware upgrades, there are alternatives to exploit
574	   path diversity with GRE by associating multiple PAs with NVGRE
575	   endpoints with policy controlling the choice of PA to be used.

577	   It is expected that communication can span multiple data centers and
578	   also cross the virtual to physical boundary. Typical scenarios that
579	   require virtual-to-physical communication includes access to storage
580	   and databases. Scenarios demanding lossless Ethernet functionality
581	   may not be amenable to NVGRE as traffic is carried over an IP
582	   network. NVGRE endpoints mediate between the network virtualized and
583	   non-network virtualized environments. This functionality can be
584	   incorporated into Top of Rack switches, storage appliances, load
585	   balancers, routers etc. or built as a stand-alone appliance.

587	   It is imperative to consider the impact of any solution on host
588	   performance. Today's server operating systems employ sophisticated
589	   acceleration techniques such as checksum offload, Large Send Offload
590	   (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS),
591	   Virtual Machine Queue (VMQ) etc. These technologies should become
592	   NVGRE aware. IPsec Security Associations (SA) can be offloaded to
593	   the NIC so that computationally expensive cryptographic operations
594	   are performed at line rate in the NIC hardware. These SAs are based
595	   on the IP addresses of the endpoints. As each packet on the wire
596	   gets translated, the NVGRE endpoint SHOULD intercept the offload
597	   requests and do the appropriate address translation. This will
598	   ensure that IPsec continues to be usable with network virtualization
599	   while taking advantage of hardware offload capabilities for improved
600	   performance.

602	4.10. Network Scalability with NVGRE

604	   One of the key benefits of using NVGRE is the IP address scalability
605	   and in turn MAC address table scalability that can be achieved.
606	   NVGRE endpoint can use one PA to represent multiple CAs. This lowers
607	   the burden on the MAC address table sizes at the Top of Rack
608	   switches. One obvious benefit is in the context of server
609	   virtualization which has increased the demands on the network
610	   infrastructure. By embedding a NVGRE endpoint in a hypervisor it is
611	   possible to scale significantly. This framework allows for location
612	   information to be preconfigured inside a NVGRE endpoint allowing
613	   broadcast ARP traffic to be proxied locally. This approach can scale
614	   to large sized virtual subnets. These virtual subnets can be spread
615	   across multiple layer-3 physical subnets. It allows workloads to be
616	   moved around without imposing a huge burden on the network control
617	   plane. By eliminating most broadcast traffic and converting others
618	   to multicast the routers and switches can function more efficiently
619	   by building efficient multicast trees. By using server and network
620	   capacity efficiently it is possible to drive down the cost of
621	   building and managing data centers.

623	5. Security Considerations

625	   This proposal extends the Layer-2 subnet across the data center and
626	   increases the scope for spoofing attacks. Mitigations of such
627	   attacks are possible with authentication/encryption using IPsec or
628	   any other IP based mechanism. The control plane for policy
629	   distribution is expected to be secured by using any of the existing
630	   security protocols. Further management traffic can be isolated in a
631	   separate subnet/VLAN.

633	6. IANA Considerations

635	   This document has no IANA actions.

637	7. References

639	7.1. Normative References

641	   [1]   Bradner, S., "Key words for use in RFCs to Indicate
642	         Requirement Levels", BCP 14, RFC 2119, March 1997.

644	   [2]   Ethertypes, ftp://ftp.isi.edu/in-
645	         notes/iana/assignments/ethernet-numbers

647	   [3]   D. Farinacci et al, "Generic Routing Encapsulation (GRE)", RFC
648	         2784, March, 2000.

650	   [4]   G. Dommety, "Key and Sequence Number Extensions to GRE", RFC
651	         2890, September 2000.

653	7.2. Informative References

655	   [5]   A. Greenberg et al, "VL2: A Scalable and Flexible Data Center
656	         Network", Proc. SIGCOMM 2009.

658	   [6]   A. Greenberg et al, "The Cost of a Cloud: Research Problems in
659	         the Data Center", ACM SIGCOMM Computer Communication Review.

661	   [7]   B. Hinden, S. Deering, "IP Version 6 Addressing Architecture",
662	         RFC 4291,     July 2006.

664	   [8]   M. Lasserre et al, "Framework for DC Network Virtualization",
665	         draft-ietf-nov3-framework (work in progress),     July 2013.

667	   [9]   D. Meyer, "Administratively Scoped IP Multicast", BCP 23, RFC
668	         2365, July 1998.

670	   [10]  T. Narten et al, "Problem Statement: Overlays for Network
671	         Virtualization", draft-narten-nov3-overlay-problem-statement
672	         (work in progress),     July 2013.

674	   [11]  C. Perkins, "IP Encapsulation within IP", RFC 2003, October
675	         1996.

677	   [12]  J. Touch, R. Perlman, "Transparent Interconnection of Lots of
678	         Links (TRILL): Problem and Applicability Statement", RFC 5556,
679	         May 2009.

681	8. Acknowledgments

683	   This document was prepared using 2-Word-v2.0.template.dot.

685	Authors' Addresses

687	   Murari Sridharan
688	   Microsoft Corporation
689	   1 Microsoft Way
690	   Redmond, WA 98052
691	   Email: muraris@microsoft.com

693	   Yu-Shun Wang
694	   Microsoft Corporation
695	   1 Microsoft Way
696	   Redmond, WA 98052
697	   Email: yushwang@microsoft.com

699	   Albert Greenberg
700	   Microsoft Corporation
701	   1 Microsoft Way
702	   Redmond, WA 98052
703	   Email: albert@microsoft.com

705	   Pankaj Garg
706	   Microsoft Corporation
707	   1 Microsoft Way
708	   Redmond, WA 98052
709	   Email: pankajg@microsoft.com

711	   Narasimhan Venkataramiah
712	   Facebook Inc
713	   1730 Minor Ave.
714	   Seattle, WA 98101
715	   Email: navenkat@microsoft.com

717	   Kenneth Duda
718	   Arista Networks, Inc.
719	   5470 Great America Pkwy
720	   Santa Clara, CA 95054
721	   kduda@aristanetworks.com

723	   Ilango Ganga
724	   Intel Corporation
725	   2200 Mission College Blvd.

727	   M/S: SC12-325
728	   Santa Clara, CA - 95054
729	   Email: ilango.s.ganga@intel.com

731	   Geng Lin
732	   Google
733	   1600 Amphitheatre Parkway
734	   Mountain View, California 94043
735	   Email: genglin@google.com

737	   Mark Pearson
738	   Hewlett-Packard Co.
739	   8000 Foothills Blvd.
740	   Roseville, CA 95747
741	   Email: mark.pearson@hp.com

743	   Patricia Thaler
744	   Broadcom Corporation
745	   3151 Zanker Road
746	   San Jose, CA 95134
747	   Email: pthaler@broadcom.com

749	   Chait Tumuluri
750	   Emulex Corporation
751	   3333 Susan Street
752	   Costa Mesa, CA 92626
753	   Email: chait@emulex.com