idnits 2.17.1 

draft-ietf-nvo3-framework-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 25
     longer pages, the longest (page 19) being 72 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 25 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 314 instances of too long lines in the document, the longest
     one being 3 characters in excess of 72.

  == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (November 12, 2013) is 3812 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'OF' is mentioned on line 390, but not defined

  == Unused Reference: 'RFC2119' is defined on line 1065, but no explicit
     reference was found in the text

  == Unused Reference: 'OVCPREQ' is defined on line 1074, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                          Marc Lasserre
3	    Internet Draft                                            Florin Balus
4	    Intended status: Informational                          Alcatel-Lucent
5	    Expires: May 2014
6	                                                              Thomas Morin
7	                                                     France Telecom Orange

9	                                                               Nabil Bitar
10	                                                                   Verizon

12	                                                             Yakov Rekhter
13	                                                                   Juniper

15	                                                         November 12, 2013

17	                      Framework for DC Network Virtualization
18	                         draft-ietf-nvo3-framework-04.txt

20	    Abstract

22	       This document provides a framework for Network Virtualization over
23	       L3 (NVO3) and it defines a reference model along with logical
24	       components required to design a solution.

26	    Status of this Memo

28	       This Internet-Draft is submitted in full conformance with the
29	       provisions of BCP 78 and BCP 79.

31	       Internet-Drafts are working documents of the Internet Engineering
32	       Task Force (IETF).  Note that other groups may also distribute
33	       working documents as Internet-Drafts. The list of current Internet-
34	       Drafts is at http://datatracker.ietf.org/drafts/current/.

36	       Internet-Drafts are draft documents valid for a maximum of six
37	       months and may be updated, replaced, or obsoleted by other documents
38	       at any time.  It is inappropriate to use Internet-Drafts as
39	       reference material or to cite them other than as "work in progress."

41	       This Internet-Draft will expire on May 12, 2014.

43	    Internet-Draft  Framework for DC Network Virtualization      November
44	    2013

46	    Copyright Notice

48	       Copyright (c) 2013 IETF Trust and the persons identified as the
49	       document authors. All rights reserved.

51	       This document is subject to BCP 78 and the IETF Trust's Legal
52	       Provisions Relating to IETF Documents
53	       (http://trustee.ietf.org/license-info) in effect on the date of
54	       publication of this document. Please review these documents
55	       carefully, as they describe your rights and restrictions with
56	       respect to this document. Code Components extracted from this
57	       document must include Simplified BSD License text as described in
58	       Section 4.e of the Trust Legal Provisions and are provided without
59	       warranty as described in the Simplified BSD License.

61	    Table of Contents

63	       1. Introduction..................................................3
64	          1.1. General terminology......................................3
65	          1.2. DC network architecture..................................6
66	       2. Reference Models..............................................8
67	          2.1. Generic Reference Model..................................8
68	          2.2. NVE Reference Model.....................................11
69	          2.3. NVE Service Types.......................................12
70	             2.3.1. L2 NVE providing Ethernet LAN-like service.........12
71	             2.3.2. L3 NVE providing IP/VRF-like service...............12
72	       3. Functional components........................................12
73	          3.1. Service Virtualization Components.......................12
74	             3.1.1. Virtual Access Points (VAPs).......................12
75	             3.1.2. Virtual Network Instance (VNI).....................13
76	             3.1.3. Overlay Modules and VN Context.....................13
77	             3.1.4. Tunnel Overlays and Encapsulation options..........14
78	             3.1.5. Control Plane Components...........................14
79	             3.1.5.1. Distributed vs Centralized Control Plane.........14
80	             3.1.5.2. Auto-provisioning/Service discovery..............15
81	             3.1.5.3. Address advertisement and tunnel mapping.........15
82	             3.1.5.4. Overlay Tunneling................................16
83	          3.2. Multi-homing............................................16
84	          3.3. VM Mobility.............................................17
85	       4. Key aspects of overlay networks..............................18
86	          4.1. Pros & Cons.............................................18
87	          4.2. Overlay issues to consider..............................20
88	             4.2.1. Data plane vs Control plane driven.................20
89	             4.2.2. Coordination between data plane and control plane..20

91	    Internet-Draft  Framework for DC Network Virtualization      November
92	    2013

94	             4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
95	             traffic...................................................20
96	             4.2.4. Path MTU...........................................21
97	             4.2.5. NVE location trade-offs............................22
98	             4.2.6. Interaction between network overlays and underlays.23
99	       5. Security Considerations......................................23
100	       6. IANA Considerations..........................................24
101	       7. References...................................................24
102	          7.1. Normative References....................................24
103	          7.2. Informative References..................................24
104	       8. Acknowledgments..............................................24

106	    1. Introduction

108	       This document provides a framework for Data Center Network
109	       Virtualization over Layer3 (L3) tunnels. This framework is intended
110	       to aid in standardizing protocols and mechanisms to support large-
111	       scale network virtualization for data centers.

113	       [NVOPS] defines the rationale for using overlay networks in order to
114	       build large multi-tenant data center networks. Compute, storage and
115	       network virtualization are often used in these large data centers to
116	       support a large number of communication domains and end systems.

118	       This document provides reference models and functional components of
119	       data center overlay networks as well as a discussion of technical
120	       issues that have to be addressed.

122	    1.1. General terminology

124	       This document uses the following terminology:

126	       NVO3 Network: An overlay network that provides an Layer2 (L2) or
127	       Layer3 (L3) service to Tenant Systems over an L3 underlay network
128	       using the architecture and protocols as defined by the NVO3 Working
129	       Group.

131	       Network Virtualization Edge (NVE). An NVE is the network entity that
132	       sits at the edge of an underlay network and implements L2 and/or L3
133	       network virtualization functions. The network-facing side of the NVE
134	       uses the underlying L3 network to tunnel tenant frames to and from
135	       other NVEs. The tenant-facing side of the NVE sends and receives
136	       Ethernet frames to and from individual Tenant Systems.  An NVE could
137	       be implemented as part of a virtual switch within a hypervisor, a

139	    Internet-Draft  Framework for DC Network Virtualization      November
140	    2013

142	       physical switch or router, a Network Service Appliance, or be split
143	       across multiple devices.

145	       Virtual Network (VN): A VN is a logical abstraction of a physical
146	       network that provides L2 or L3 network services to a set of Tenant
147	       Systems. A VN is also known as a Closed User Group (CUG).

149	       Virtual Network Instance (VNI): A specific instance of a VN from the
150	       perspective of an NVE.

152	       Virtual Network Context (VN Context) Identifier: Field in overlay
153	       encapsulation header that identifies the specific VN the packet
154	       belongs to. The egress NVE uses the VN Context identifier to deliver
155	       the packet to the correct Tenant System. The VN Context identifier
156	       can be a locally significant identifier or a globally unique
157	       identifier.

159	       Underlay or Underlying Network: The network that provides the
160	       connectivity among NVEs and over which NVO3 packets are tunneled,
161	       where an NVO3 packet carries an NVO3 overlay header followed by a
162	       tenant packet. The Underlay Network does not need to be aware that
163	       it is carrying NVO3 packets. Addresses on the Underlay Network
164	       appear as "outer addresses" in encapsulated NVO3 packets. In
165	       general, the Underlay Network can use a completely different
166	       protocol (and address family) from that of the overlay. In the case
167	       of NVO3, the underlay network is IP.

169	       Data Center (DC): A physical complex housing physical servers,
170	       network switches and routers, network service appliances and
171	       networked storage. The purpose of a Data Center is to provide
172	       application, compute and/or storage services. One such service is
173	       virtualized infrastructure data center services, also known as
174	       Infrastructure as a Service.

176	       Virtual Data Center (Virtual DC): A container for virtualized
177	       compute, storage and network services. A Virtual DC is associated
178	       with a single tenant, and can contain multiple VNs and Tenant
179	       Systems connected to one or more of these VNs.

181	       Virtual machine (VM): A software implementation of a physical
182	       machine that runs programs as if they were executing on a physical,
183	       non-virtualized machine.  Applications (generally) do not know they
184	       are running on a VM as opposed to running on a "bare metal" host or
185	       server, though some systems provide a para-virtualization

187	    Internet-Draft  Framework for DC Network Virtualization      November
188	    2013

190	       environment that allows an operating system or application to be
191	       aware of the presences of virtualization for optimization purposes.

193	       Hypervisor: Software running on a server that allows multiple VMs to
194	       run on the same physical server. The hypervisor manages and provides
195	       shared compute/memory/storage and network connectivity to the VMs
196	       that it hosts. Hypervisors often embed a Virtual Switch (see below).

198	       Server: A physical end host machine that runs user applications. A
199	       standalone (or "bare metal") server runs a conventional operating
200	       system hosting a single-tenant application. A virtualized server
201	       runs a hypervisor supporting one or more VMs.

203	       Virtual Switch (vSwitch): A function within a Hypervisor (typically
204	       implemented in software) that provides similar forwarding services
205	       to a physical Ethernet switch. A vSwitch forwards Ethernet frames
206	       between VMs running on the same server, or between a VM and a
207	       physical NIC card connecting the server to a physical Ethernet
208	       switch or router. A vSwitch also enforces network isolation between
209	       VMs that by policy are not permitted to communicate with each other
210	       (e.g., by honoring VLANs). A vSwitch may be bypassed when an NVE is
211	       enabled on the host server.

213	       Tenant: The customer using a virtual network and any associated
214	       resources (e.g., compute, storage and network).  A tenant could be
215	       an enterprise, or a department/organization within an enterprise.

217	       Tenant System: A physical or virtual system that can play the role
218	       of a host, or a forwarding element such as a router, switch,
219	       firewall, etc. It belongs to a single tenant and connects to one or
220	       more VNs of that tenant.

222	       Tenant Separation: Tenant Separation refers to isolating traffic of
223	       different tenants such that traffic from one tenant is not visible
224	       to or delivered to another tenant, except when allowed by policy.
225	       Tenant Separation also refers to address space separation, whereby
226	       different tenants can use the same address space without conflict.

228	       Virtual Access Points (VAPs): A logical connection point on the NVE
229	       for connecting a Tenant System to a virtual network. Tenant Systems
230	       connect to VNIs at an NVE through VAPs. VAPs can be physical ports
231	       or virtual ports identified through logical interface identifiers
232	       (e.g., VLAN ID, internal vSwitch Interface ID connected to a VM).

234	    Internet-Draft  Framework for DC Network Virtualization      November
235	    2013

237	       End Device: A physical device that connects directly to the DC
238	       Underlay Network. This is in contrast to a Tenant System, which
239	       connects to a corresponding tenant VN. An End Device is administered
240	       by the DC operator rather than a tenant, and is part of the DC
241	       infrastructure. An End Device may implement NVO3 technology in
242	       support of NVO3 functions. Examples of an End Device include hosts
243	       (e.g., server or server blade), storage systems (e.g., file servers,
244	       iSCSI storage systems), and network devices (e.g., firewall, load-
245	       balancer, IPSec gateway).

247	       Network Virtualization Authority (NVA): Entity that provides
248	       reachability and forwarding information to NVEs.

250	    1.2. DC network architecture

252	       A generic architecture for Data Centers is depicted in Figure 1:

254	    Internet-Draft  Framework for DC Network Virtualization      November
255	    2013

257	                                    ,---------.
258	                                  ,'           `.
259	                                 (  IP/MPLS WAN )
260	                                  `.           ,'
261	                                    `-+------+'
262	                                     \      /
263	                              +--------+   +--------+
264	                              |   DC   |+-+|   DC   |
265	                              |gateway |+-+|gateway |
266	                              +--------+   +--------+
267	                                    |       /
268	                                    .--. .--.
269	                                  (    '    '.--.
270	                                .-.' Intra-DC     '
271	                               (     network      )
272	                                (             .'-'
273	                                 '--'._.'.    )\ \
274	                                 / /     '--'  \ \
275	                                / /      | |    \ \
276	                       +--------+   +--------+   +--------+
277	                       | access |   | access |   | access |
278	                       | switch |   | switch |   | switch |
279	                       +--------+   +--------+   +--------+
280	                          /     \    /    \     /      \
281	                       __/_      \  /      \   /_      _\__
282	                 '--------'   '--------'   '--------'   '--------'
283	                 :  End   :   :  End   :   :  End   :   :  End   :
284	                 : Device :   : Device :   : Device :   : Device :
285	                 '--------'   '--------'   '--------'   '--------'

287	                 Figure 1 : A Generic Architecture for Data Centers

289	       An example of multi-tier DC network architecture is presented in
290	       Figure 1. It provides a view of physical components inside a DC.

292	       A DC network is usually composed of intra-DC networks and network
293	       services, and inter-DC network and network connectivity services.

295	       DC networking elements can act as strict L2 switches and/or provide
296	       IP routing capabilities, including network service virtualization.

298	       In some DC architectures, some tier layers could provide L2 and/or
299	       L3 services. In addition, some tier layers may be collapsed, and
300	       Internet connectivity, inter-DC connectivity and VPN support may be
301	       handled by a smaller number of nodes. Nevertheless, one can assume

303	    Internet-Draft  Framework for DC Network Virtualization      November
304	    2013

306	       that the network functional blocks in a DC fit in the architecture
307	       depicted in Figure 1.

309	       The following components can be present in a DC:

311	          o Access switch: Hardware-based Ethernet switch aggregating all
312	            Ethernet links from the End Devices in a rack representing the
313	            entry point in the physical DC network for the hosts. It may
314	            also provide routing functionality, virtual IP network
315	            connectivity, or Layer2 tunneling over IP for instance. Access
316	            switches are usually multi-homed to aggregation switches in the
317	            Intra-DC network. A typical example of an access switch is a
318	            Top of Rack (ToR) switch. Other deployment scenarios may use an
319	            intermediate Blade Switch before the ToR, or an EoR (End of
320	            Row) switch, to provide similar function as a ToR.

322	          o Intra-DC Network: Network composed of high capacity core nodes
323	            (Ethernet switches/routers). Core nodes may provide virtual
324	            Ethernet bridging and/or IP routing services.

326	          o DC Gateway (DC GW): Gateway to the outside world providing DC
327	            Interconnect and connectivity to Internet and VPN customers. In
328	            the current DC network model, this may be simply a router
329	            connected to the Internet and/or an IP Virtual Private Network
330	            (VPN)/L2VPN PE. Some network implementations may dedicate DC
331	            GWs for different connectivity types (e.g., a DC GW for
332	            Internet, and another for VPN).

334	       Note that End Devices may be single or multi-homed to access
335	       switches.

337	    2. Reference Models

339	    2.1. Generic Reference Model

341	       Figure 2 depicts a DC reference model for network virtualization
342	       using L3 (IP/MPLS) overlays where NVEs provide a logical
343	       interconnect between Tenant Systems that belong to a specific VN.

345	    Internet-Draft  Framework for DC Network Virtualization      November
346	    2013

348	             +--------+                                    +--------+
349	             | Tenant +--+                            +----| Tenant |
350	             | System |  |                           (')   | System |
351	             +--------+  |    .................     (   )  +--------+
352	                         |  +---+           +---+    (_)
353	                         +--|NVE|---+   +---|NVE|-----+
354	                            +---+   |   |   +---+
355	                            / .    +-----+      .
356	                           /  . +--| NVA |      .
357	                          /   . |  +-----+      .
358	                         |    . |               .
359	                         |    . |  L3 Overlay +--+--++--------+
360	             +--------+  |    . |   Network   | NVE || Tenant |
361	             | Tenant +--+    . |             |     || System |
362	             | System |       .  \ +---+      +--+--++--------+
363	             +--------+       .....|NVE|.........
364	                                   +---+
365	                                     |
366	                                     |
367	                           =====================
368	                             |               |
369	                         +--------+      +--------+
370	                         | Tenant |      | Tenant |
371	                         | System |      | System |
372	                         +--------+      +--------+

374	          Figure 2 : Generic reference model for DC network virtualization
375	                         over a Layer3 (IP) infrastructure

377	       In order to get reachability information, NVEs may exchange
378	       information directly between themselves via a protocol. In this

380	    Internet-Draft  Framework for DC Network Virtualization      November
381	    2013

383	       case, a control plane module resides in every NVE. This is how
384	       routing control plane modules are implemented in routers for
385	       instance.

387	       It is also possible for NVEs to communicate with an external Network
388	       Virtualization Authority (NVA) to obtain reachability and forwarding
389	       information. In this case, a protocol is used between NVEs and
390	       NVA(s) to exchange information. OpenFlow [OF] is one example of such
391	       a protocol.

393	       It should be noted that NVAs may be organized in clusters for
394	       redundancy and scalability and can appear as one logically
395	       centralized controller. In this case, inter-NVA communication is
396	       necessary to synchronize state among nodes within a cluster or share
397	       information across clusters. The information exchanged between NVAs
398	       of the same cluster could be different from the information
399	       exchanged across clusters.

401	       A Tenant System can be attached to an NVE in several ways:

403	         - locally, by being co-located in the same End Device

405	         - remotely, via a point-to-point connection or a switched network

407	       When an NVE is co-located with a Tenant System, the state of the
408	       Tenant System can be provided without protocol assistance. For
409	       instance, the operational status of a VM can be communicated via a
410	       local API. When an NVE is remotely connected to a Tenant System, the
411	       state of the Tenant System or NVE needs to be exchanged directly or
412	       via a management entity, using a control plane protocol or API, or
413	       directly via a dataplane protocol.

415	       The functional components in Figure 2 do not necessarily map
416	       directly to the physical components described in Figure 1. For
417	       example, an End Device can be a server blade with VMs and a virtual
418	       switch. A VM can be a Tenant System and the NVE functions may be
419	       performed by the host server. In this case, the Tenant System and
420	       NVE function are co-located. Another example is the case where the
421	       End Device is the Tenant System, and the NVE function can be
422	       implemented by the connected ToR. In this case, the Tenant System
423	       and NVE function are not co-located.

425	       Underlay nodes utilize L3 technologies to interconnect NVE nodes.
426	       These nodes perform forwarding based on outer L3 header information,
427	       and generally do not maintain per tenant-service state albeit some

429	    Internet-Draft  Framework for DC Network Virtualization      November
430	    2013

432	       applications (e.g., multicast) may require control plane or
433	       forwarding plane information that pertain to a tenant, group of
434	       tenants, tenant service or a set of services that belong to one or
435	       more tenants. Mechanisms to control the amount of state maintained
436	       in the underlay may be needed.

438	    2.2. NVE Reference Model

440	       Figure 3 depicts the NVE reference model. One or more VNIs can be
441	       instantiated on an NVE. A Tenant System interfaces with a
442	       corresponding VNI via a VAP. An overlay module provides tunneling
443	       overlay functions (e.g., encapsulation and decapsulation of tenant
444	       traffic, tenant identification and mapping, etc.).

446	                         +-------- L3 Network -------+
447	                         |                           |
448	                         |        Tunnel Overlay     |
449	             +------------+---------+       +---------+------------+
450	             | +----------+-------+ |       | +---------+--------+ |
451	             | |  Overlay Module  | |       | |  Overlay Module  | |
452	             | +---------+--------+ |       | +---------+--------+ |
453	             |           |VN context|       | VN context|          |
454	             |           |          |       |           |          |
455	             |  +--------+-------+  |       |  +--------+-------+  |
456	             |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
457	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
458	             |    |   VAPs     |    |       |    |    VAPs   |     |
459	             +----+------------+----+       +----+-----------+-----+
460	                  |            |                 |           |
461	                  |            |                 |           |
462	                 Tenant Systems                 Tenant Systems

464	                      Figure 3 : Generic NVE reference model

466	       Note that some NVE functions (e.g., data plane and control plane
467	       functions) may reside in one device or may be implemented separately

469	    Internet-Draft  Framework for DC Network Virtualization      November
470	    2013

472	       in different devices. In addition, NVE functions can be implemented
473	       in a hierarchical fashion. For instance, an End Device can act as an
474	       NVE Spoke, while an access switch can act as an NVE hub.

476	    2.3. NVE Service Types

478	       An NVE provides different types of virtualized network services to
479	       multiple tenants, i.e. an L2 service or an L3 service. Note that an
480	       NVE may be capable of providing both L2 and L3 services for a
481	       tenant. This section defines the service types and associated
482	       attributes.

484	    2.3.1. L2 NVE providing Ethernet LAN-like service

486	       An L2 NVE implements Ethernet LAN emulation, an Ethernet based
487	       multipoint service similar to an IETF VPLS or EVPN service, where
488	       the Tenant Systems appear to be interconnected by a LAN environment
489	       over an L3 overlay. As such, an L2 NVE provides per-tenant virtual
490	       switching instance (L2 VNI), and L3 (IP/MPLS) tunneling
491	       encapsulation of tenant MAC frames across the underlay. Note that
492	       the control plane for an L2 NVE could be implemented locally on the
493	       NVE or in a separate control entity.

495	    2.3.2. L3 NVE providing IP/VRF-like service

497	       An L3 NVE provides Virtualized IP forwarding service, similar to
498	       IETF IP VPN (e.g., BGP/MPLS IPVPN [RFC4364]) from a service
499	       definition perspective. That is, an L3 NVE provides per-tenant
500	       forwarding and routing instance (L3 VNI), and L3 (IP/MPLS) tunneling
501	       encapsulation of tenant IP packets across the underlay. Note that
502	       routing could be performed locally on the NVE or in a separate
503	       control entity.

505	    3. Functional components

507	       This section decomposes the Network Virtualization architecture into
508	       functional components described in Figure 3 to make it easier to
509	       discuss solution options for these components.

511	    3.1. Service Virtualization Components

513	    3.1.1. Virtual Access Points (VAPs)

515	       Tenant Systems are connected to VNIs through Virtual Access Points
516	       (VAPs).

518	    Internet-Draft  Framework for DC Network Virtualization      November
519	    2013

521	       VAPs can be physical ports or virtual ports identified through
522	       logical interface identifiers (e.g., VLAN ID, internal vSwitch
523	       Interface ID connected to a VM).

525	    3.1.2. Virtual Network Instance (VNI)

527	       A VNI is a specific VN instance on an NVE. Each VNI defines a
528	       forwarding context that contains reachability information and
529	       policies.

531	    3.1.3. Overlay Modules and VN Context

533	       Mechanisms for identifying each tenant service are required to allow
534	       the simultaneous overlay of multiple tenant services over the same
535	       underlay L3 network topology. In the data plane, each NVE, upon
536	       sending a tenant packet, must be able to encode the VN Context for
537	       the destination NVE in addition to the L3 tunneling information
538	       (e.g., source IP address identifying the source NVE and the
539	       destination IP address identifying the destination NVE, or MPLS
540	       label). This allows the destination NVE to identify the tenant
541	       service instance and therefore appropriately process and forward the
542	       tenant packet.

544	       The Overlay module provides tunneling overlay functions: tunnel
545	       initiation/termination as in the case of stateful tunnels (see
546	       Section 3.1.4), and/or simply encapsulation/decapsulation of frames
547	       from VAPs/L3 underlay.

549	       In a multi-tenant context, tunneling aggregates frames from/to
550	       different VNIs. Tenant identification and traffic demultiplexing are
551	       based on the VN Context identifier.

553	       The following approaches can be considered:

555	          o One VN Context identifier per Tenant: A globally unique (on a
556	            per-DC administrative domain) VN identifier is used to identify
557	            the corresponding VNI. Examples of such identifiers in existing
558	            technologies are IEEE VLAN IDs and ISID tags that identify
559	            virtual L2 domains when using IEEE 802.1aq and IEEE 802.1ah,
560	            respectively.

562	          o One VN Context identifier per VNI: A per-VNI local value is
563	            automatically generated by the egress NVE, or a control plane
564	            associated with that NVE, and usually distributed by a control

566	    Internet-Draft  Framework for DC Network Virtualization      November
567	    2013

569	            plane protocol to all the related NVEs. An example of this
570	            approach is the use of per VRF MPLS labels in IP VPN [RFC4364].

572	          o One VN Context identifier per VAP: A per-VAP local value is
573	            assigned and usually distributed by a control plane protocol.
574	            An example of this approach is the use of per CE-PE MPLS labels
575	            in IP VPN [RFC4364].

577	       Note that when using one VN Context per VNI or per VAP, an
578	       additional global identifier (e.g., a VN identifier or name) may be
579	       used by the control plane to identify the Tenant context.

581	    3.1.4. Tunnel Overlays and Encapsulation options

583	       Once the VN context identifier is added to the frame, an L3 Tunnel
584	       encapsulation is used to transport the frame to the destination NVE.

586	       Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS
587	       tunneling can be used. Tunneling could be stateless or stateful.
588	       Stateless tunneling simply entails the encapsulation of a tenant
589	       packet with another header necessary for forwarding the packet
590	       across the underlay (e.g., IP tunneling over an IP underlay).
591	       Stateful tunneling on the other hand entails maintaining tunneling
592	       state at the tunnel endpoints (i.e., NVEs). Tenant packets on an
593	       ingress NVE can then be transmitted over such tunnels to a
594	       destination (egress) NVE by encapsulating the packets with a
595	       corresponding tunneling header. The tunneling state at the endpoints
596	       may be configured or dynamically established. Solutions should
597	       specify the tunneling technology used, whether it is stateful or
598	       stateless. In this document, however, tunneling and tunneling
599	       encapsulation are used interchangeably to simply mean the
600	       encapsulation of a tenant packet with a tunneling header necessary
601	       to carry the packet between an ingress NVE and an egress NVE across
602	       the underlay. It should be noted that stateful tunneling, especially
603	       when configuration is involved, does impose management overhead and
604	       scale constraints. Thus, stateless tunneling is preferred when
605	       feasible.

607	    3.1.5. Control Plane Components

609	    3.1.5.1. Distributed vs Centralized Control Plane

611	       A control/management plane entity can be centralized or distributed.
612	       Both approaches have been used extensively in the past. The routing
613	       model of the Internet is a good example of a distributed approach.

615	    Internet-Draft  Framework for DC Network Virtualization      November
616	    2013

618	       Transport networks have usually used a centralized approach to
619	       manage transport paths.

621	       It is also possible to combine the two approaches, i.e., using a
622	       hybrid model. A global view of network state can have many benefits
623	       but it does not preclude the use of distributed protocols within the
624	       network. Centralized models provide a facility to maintain global
625	       state, and distribute that state to the network. When used in
626	       combination with distributed protocols, greater network
627	       efficiencies, improved reliability and robustness can be achieved.
628	       Domain and/or deployment specific constraints define the balance
629	       between centralized and distributed approaches.

631	    3.1.5.2. Auto-provisioning/Service discovery

633	       NVEs must be able to identify the appropriate VNI for each Tenant
634	       System. This is based on state information that is often provided by
635	       external entities. For example, in an environment where a VM is a
636	       Tenant System, this information is provided by VM orchestration
637	       systems, since these are the only entities that have visibility of
638	       which VM belongs to which tenant.

640	       A mechanism for communicating this information to the NVE is
641	       required. VAPs have to be created and mapped to the appropriate VNI.
642	       Depending upon the implementation, this control interface can be
643	       implemented using an auto-discovery protocol between Tenant Systems
644	       and their local NVE or through management entities. In either case,
645	       appropriate security and authentication mechanisms to verify that
646	       Tenant System information is not spoofed or altered are required.
647	       This is one critical aspect for providing integrity and tenant
648	       isolation in the system.

650	       NVEs may learn reachability information to VNIs on other NVEs via a
651	       control protocol exchanging such information among NVEs or via a
652	       management control entity.

654	    3.1.5.3. Address advertisement and tunnel mapping

656	       As traffic reaches an ingress NVE on a VAP, a lookup is performed to
657	       determine which NVE or local VAP the packet needs to be sent to. If
658	       the packet is to be sent to another NVE, the packet is encapsulated
659	       with a tunnel header containing the destination information
660	       (destination IP address or MPLS label) of the egress NVE.
661	       Intermediate nodes (between the ingress and egress NVEs) switch or
662	       route traffic based upon the tunnel destination information.

664	    Internet-Draft  Framework for DC Network Virtualization      November
665	    2013

667	       A key step in the above process consists of identifying the
668	       destination NVE the packet is to be tunneled to. NVEs are
669	       responsible for maintaining a set of forwarding or mapping tables
670	       that hold the bindings between destination VM and egress NVE
671	       addresses. Several ways of populating these tables are possible:
672	       control plane driven, management plane driven, or data plane driven.

674	       When a control plane protocol is used to distribute address
675	       reachability and tunneling information, the auto-
676	       provisioning/Service discovery could be accomplished by the same
677	       protocol. In this scenario, the auto-provisioning/Service discovery
678	       could be combined with (be inferred from) the address advertisement
679	       and associated tunnel mapping. Furthermore, a control plane protocol
680	       that carries both MAC and IP addresses eliminates the need for ARP,
681	       and hence addresses one of the issues with explosive ARP handling.

683	    3.1.5.4. Overlay Tunneling

685	       For overlay tunneling, and dependent upon the tunneling technology
686	       used for encapsulating the Tenant System packets, it may be
687	       sufficient to have one or more local NVE addresses assigned and used
688	       in the source and destination fields of a tunneling encapsulating
689	       header. Other information that is part of the
690	       tunneling encapsulation header may also need to be configured. In
691	       certain cases, local NVE configuration may be sufficient while in
692	       other cases, some tunneling related information may need to
693	       be shared among NVEs. The information that needs to be shared will
694	       be technology dependent. For instance, potential information could
695	       include tunnel identity, encapsulation type, and/or tunnel
696	       resources. In certain cases, such as when using IP multicast in the
697	       underlay, tunnels may need to be established, interconnecting
698	       NVEs. When tunneling information needs to be exchanged or shared
699	       among NVEs, a control plane protocol may be required. For instance,
700	       it may be necessary to provide active/standby status information
701	       between NVEs, up/down status information, pruning/grafting
702	       information for multicast tunnels, etc.

704	       In addition, a control plane may be required to setup the tunnel
705	       path for some tunneling technologies. This applies to both unicast
706	       and multicast tunneling.

708	    3.2. Multi-homing

710	       Multi-homing techniques can be used to increase the reliability of
711	       an NVO3 network. It is also important to ensure that physical

713	    Internet-Draft  Framework for DC Network Virtualization      November
714	    2013

716	       diversity in an NVO3 network is taken into account to avoid single
717	       points of failure.

719	       Multi-homing can be enabled in various nodes, from Tenant Systems
720	       into TORs, TORs into core switches/routers, and core nodes into DC
721	       GWs.

723	       The NVO3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
724	       routing as the means to re-route traffic upon failures techniques or
725	       on MPLS re-rerouting capabilities.

727	       When a Tenant System is co-located with the NVE, the Tenant System
728	       is effectively single homed to the NVE via a virtual port. When the
729	       Tenant System and the NVE are separated, the Tenant System is
730	       connected to the NVE via a logical Layer2 (L2) construct such as a
731	       VLAN and it can be multi-homed to various NVEs. An NVE may provide
732	       an L2 service to the end system or an l3 service. An NVE may be
733	       multi-homed to a next layer in the DC at Layer2 (L2) or Layer3
734	       (L3). When an NVE provides an L2 service and is not co-located with
735	       the end system, techniques such as Ethernet Link Aggregation Group
736	       (LAG) or Spanning Tree Protocol (STP) can be used to switch traffic
737	       between an end system and connected NVEs without creating
738	       loops. Similarly, when the NVE provides L3 service, similar dual-
739	       homing techniques can be used. When the NVE provides a L3 service to
740	       the end system, it is possible that no dynamic routing protocol is
741	       enabled between the end system and the NVE. The end system can be
742	       multi-homed to multiple physically-separated L3 NVEs over multiple
743	       interfaces. When one of the links connected to an NVE fails, the
744	       other interfaces can be used to reach the end system.

746	       External connectivity out of a DC can be handled by two or more DC
747	       gateways. Each gateway provides access to external networks such as
748	       VPNs or the Internet. A gateway may be connected to two or more edge
749	       nodes in the external network for redundancy. When a connection to
750	       an upstream node is lost, the alternative connection is used and the
751	       failed route withdrawn.

753	    3.3. VM Mobility

755	       In DC environments utilizing VM technologies, an important feature
756	       is that VMs can move from one server to another server in the same
757	       or different L2 physical domains (within or across DCs) in a
758	       seamless manner.

760	    Internet-Draft  Framework for DC Network Virtualization      November
761	    2013

763	       A VM can be moved from one server to another in stopped or suspended
764	       state ("cold" VM mobility) or in running/active state ("hot" VM
765	       mobility). With "hot" mobility, VM L2 and L3 addresses need to be
766	       preserved. With "cold" mobility, it may be desired to preserve at
767	       least VM L3 addresses.

769	       Solutions to maintain connectivity while a VM is moved are necessary
770	       in the case of "hot" mobility. This implies that connectivity among
771	       VMs is preserved. For instance, for L2 VNs, ARP caches are updated
772	       accordingly.

774	       Upon VM mobility, NVE policies that define connectivity among VMs
775	       must be maintained.

777	       During VM mobility, it is expected that the path to the VM's default
778	       gateway assures adequate performance to VM applications.

780	    4. Key aspects of overlay networks

782	       The intent of this section is to highlight specific issues that
783	       proposed overlay solutions need to address.

785	    4.1. Pros & Cons

787	       An overlay network is a layer of virtual network topology on top of
788	       the physical network.

790	       Overlay networks offer the following key advantages:

792	          o Unicast tunneling state management and association of Tenant
793	            Systems reachability are handled at the edge of the network (at
794	            the NVE). Intermediate transport nodes are unaware of such
795	            state. Note that when multicast is enabled in the underlay
796	            network to build multicast trees for tenant VNs, there would be
797	            more state related to tenants in the underlay core network.

799	          o Tunneling is used to aggregate traffic and hide tenant
800	            addresses from the underlay network, and hence offer the
801	            advantage of minimizing the amount of forwarding state required
802	            within the underlay network

804	          o Decoupling of the overlay addresses (MAC and IP) used by VMs
805	            from the underlay network for tenant separation and separation
806	            of the tenant address spaces from the underlay address space.

808	    Internet-Draft  Framework for DC Network Virtualization      November
809	    2013

811	          o Support of a large number of virtual network identifiers

813	       Overlay networks also create several challenges:

815	          o Overlay networks have typically no control of underlay networks
816	            and lack underlay network information (e.g. underlay
817	            utilization):

819	               o Overlay networks and/or their associated management
820	                 entities typically probe the network to measure link or
821	                 path properties, such as available bandwidth or packet
822	                 loss rate. It is difficult to accurately evaluate network
823	                 properties. It might be preferable for the underlay
824	                 network to expose usage and performance information.

826	               o
827	                Miscommunication or lack of coordination between overlay
828	                 and underlay networks can lead to an inefficient usage of
829	                 network resources.

831	               o
832	                When multiple overlays co-exist on top of a common underlay
833	                 network, the lack of coordination between overlays can
834	                 lead to performance issues and/or resource usage
835	                 inefficiencies.

837	          o Traffic carried over an overlay may not traverse firewalls and
838	            NAT devices.

840	          o Multicast service scalability: Multicast support may be
841	            required in the underlay network to address tenant flood
842	            containment or efficient multicast handling. The underlay may
843	            also be required to maintain multicast state on a per-tenant
844	            basis, or even on a per-individual multicast flow of a given
845	            tenant. Ingress replication at the NVE eliminates that
846	            additional multicast state in the underlay core, but depending
847	            on the multicast traffic volume, it may cause inefficient use
848	            of bandwidth.

850	          o Hash-based load balancing may not be optimal as the hash
851	            algorithm may not work well due to the limited number of
852	            combinations of tunnel source and destination addresses. Other
853	            NVO3 mechanisms may use additional entropy information than
854	            source and destination addresses.

856	    Internet-Draft  Framework for DC Network Virtualization      November
857	    2013

859	    4.2. Overlay issues to consider

861	    4.2.1. Data plane vs Control plane driven

863	       In the case of an L2 NVE, it is possible to dynamically learn MAC
864	       addresses against VAPs. It is also possible that such addresses be
865	       known and controlled via management or a control protocol for both
866	       L2 NVEs and L3 NVEs. Dynamic data plane learning implies that
867	       flooding of unknown destinations be supported and hence implies that
868	       broadcast and/or multicast be supported or that ingress replication
869	       be used as described in section 4.2.3. Multicasting in the underlay
870	       network for dynamic learning may lead to significant scalability
871	       limitations. Specific forwarding rules must be enforced to prevent
872	       loops from happening. This can be achieved using a spanning tree, a
873	       shortest path tree, or a split-horizon mesh.

875	       It should be noted that the amount of state to be distributed is
876	       dependent upon network topology and the number of virtual machines.
877	       Different forms of caching can also be utilized to minimize state
878	       distribution between the various elements. The control plane should
879	       not require an NVE to maintain the locations of all the Tenant
880	       Systems whose VNs are not present on the NVE. The use of a control
881	       plane does not imply that the data plane on NVEs has to maintain all
882	       the forwarding state in the control plane.

884	    4.2.2. Coordination between data plane and control plane

886	       For an L2 NVE, the NVE needs to be able to determine MAC addresses
887	       of the Tenant Systems connected via a VAP. This can be achieved via
888	       dataplane learning or a control plane. For an L3 NVE, the NVE needs
889	       to be able to determine IP addresses of the Tenant Systems connected
890	       via a VAP.

892	       In both cases, coordination with the NVE control protocol is needed
893	       such that when the NVE determines that the set of addresses behind a
894	       VAP has changed, it triggers the NVE control plane to distribute
895	       this information to its peers.

897	    4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

899	       There are several options to support packet replication needed for
900	       broadcast, unknown unicast and multicast.  Typical methods include:

902	    Internet-Draft  Framework for DC Network Virtualization      November
903	    2013

905	          o Ingress replication

907	          o Use of underlay multicast trees

909	       There is a bandwidth vs state trade-off between the two approaches.
910	       Depending upon the degree of replication required (i.e. the number
911	       of hosts per group) and the amount of multicast state to maintain,
912	       trading bandwidth for state should be considered.

914	       When the number of hosts per group is large, the use of underlay
915	       multicast trees may be more appropriate. When the number of hosts is
916	       small (e.g. 2-3) and/or the amount of multicast traffic is small,
917	       ingress replication may not be an issue.

919	       Depending upon the size of the data center network and hence the
920	       number of (S,G) entries, but also the duration of multicast flows,
921	       the use of underlay multicast trees can be a challenge.

923	       When flows are well known, it is possible to pre-provision such
924	       multicast trees. However, it is often difficult to predict
925	       application flows ahead of time, and hence programming of (S,G)
926	       entries for short-lived flows could be impractical.

928	       A possible trade-off is to use in the underlay shared multicast
929	       trees as opposed to dedicated multicast trees.

931	    4.2.4. Path MTU

933	       When using overlay tunneling, an outer header is added to the
934	       original frame. This can cause the MTU of the path to the egress
935	       tunnel endpoint to be exceeded.

937	       It is usually not desirable to rely on IP fragmentation for
938	       performance reasons. Ideally, the interface MTU as seen by a Tenant
939	       System is adjusted such that no fragmentation is needed. TCP will
940	       adjust its maximum segment size accordingly.

942	       It is possible for the MTU to be configured manually or to be
943	       discovered dynamically. Various Path MTU discovery techniques exist
944	       in order to determine the proper MTU size to use:

946	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

948	               o
949	                 Tenant Systems rely on ICMP messages to discover the MTU
950	                 of the end-to-end path to its destination. This method is

952	    Internet-Draft  Framework for DC Network Virtualization      November
953	    2013

955	                 not always possible, such as when traversing middle boxes
956	                 (e.g. firewalls) which disable ICMP for security reasons

958	          o Extended MTU Path Discovery techniques such as defined in
959	            [RFC4821]

961	       It is also possible to rely on the NVE to perform segmentation and
962	       reassembly operations without relying on the Tenant Systems to know
963	       about the end-to-end MTU. The assumption is that some hardware
964	       assist is available on the NVE node to perform such SAR operations.
965	       However, fragmentation by the NVE can lead to performance and
966	       congestion issues due to TCP dynamics and might require new
967	       congestion avoidance mechanisms from the underlay network [FLOYD].

969	       Finally, the underlay network may be designed in such a way that the
970	       MTU can accommodate the extra tunneling and possibly additional NVO3
971	       header encapsulation overhead.

973	    4.2.5. NVE location trade-offs

975	       In the case of DC traffic, traffic originated from a VM is native
976	       Ethernet traffic. This traffic can be switched by a local virtual
977	       switch or ToR switch and then by a DC gateway. The NVE function can
978	       be embedded within any of these elements.

980	       There are several criteria to consider when deciding where the NVE
981	       function should happen:

983	          o Processing and memory requirements

985	              o Datapath (e.g. lookups, filtering,
986	                 encapsulation/decapsulation)

988	              o Control plane processing (e.g. routing, signaling, OAM) and
989	                 where specific control plane functions should be enabled

991	          o FIB/RIB size

993	          o Multicast support

995	              o Routing/signaling protocols

997	              o Packet replication capability

999	              o Multicast FIB

1001	    Internet-Draft  Framework for DC Network Virtualization      November
1002	    2013

1004	          o Fragmentation support

1006	          o QoS support (e.g. marking, policing, queuing)

1008	          o Resiliency

1010	    4.2.6. Interaction between network overlays and underlays

1012	       When multiple overlays co-exist on top of a common underlay network,
1013	       resources (e.g., bandwidth) should be provisioned to ensure that
1014	       traffic from overlays can be accommodated and QoS objectives can be
1015	       met. Overlays can have partially overlapping paths (nodes and
1016	       links).

1018	       Each overlay is selfish by nature. It sends traffic so as to
1019	       optimize its own performance without considering the impact on other
1020	       overlays, unless the underlay paths are traffic engineered on a per
1021	       overlay basis to avoid congestion of underlay resources.

1023	       Better visibility between overlays and underlays, or generally
1024	       coordination in placing overlay demand on an underlay network, may
1025	       be achieved by providing mechanisms to exchange performance and
1026	       liveliness information between the underlay and overlay(s) or the
1027	       use of such information by a coordination system. Such information
1028	       may include:

1030	          o Performance metrics (throughput, delay, loss, jitter)

1032	          o Cost metrics

1034	    5. Security Considerations

1036	       NVO3 solutions must at least consider and address the following:

1038	          . Secure and authenticated communication between an NVE and an
1039	            NVE management system and/or control system.

1041	          . Isolation between tenant overlay networks. The use of per-
1042	            tenant FIB tables (VNIs) on an NVE is essential.

1044	          . Security of any protocol used to carry overlay network
1045	            information.

1047	          . Preventing packets from reaching the wrong NVI, especially
1048	            during VM moves.

1050	    Internet-Draft  Framework for DC Network Virtualization      November
1051	    2013

1053	          . It may desirable to restrict the types of information that can
1054	            be exchanged between overlays and underlays (e.g. topology
1055	            information)

1057	    6. IANA Considerations

1059	       IANA does not need to take any action for this draft.

1061	    7. References

1063	    7.1. Normative References

1065	       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1066	                 Requirement Levels", BCP 14, RFC 2119, March 1997.

1068	    7.2. Informative References

1070	       [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network
1071	                 Virtualization", draft-narten-nvo3-overlay-problem-
1072	                 statement (work in progress)

1074	       [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
1075	                 Protocol Requirements", draft-kreeger-nvo3-overlay-cp
1076	                 (work in progress)

1078	       [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
1079	                 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

1081	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1082	                 Networks (VPNs)", RFC 4364, February 2006.

1084	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

1086	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
1087	                 August 1996

1089	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
1090	                 Discovery", RFC4821, March 2007

1092	    8. Acknowledgments

1094	       In addition to the authors the following people have contributed to
1095	       this document:

1097	    Internet-Draft  Framework for DC Network Virtualization      November
1098	    2013

1100	       Dimitrios Stiliadis, Rotem Salomonovitch, Lucy Yong, Thomas Narten,
1101	       Larry Kreeger.

1103	       This document was prepared using 2-Word-v2.0.template.dot.

1105	    Authors' Addresses

1107	       Marc Lasserre
1108	       Alcatel-Lucent
1109	       Email: marc.lasserre@alcatel-lucent.com

1111	       Florin Balus
1112	       Alcatel-Lucent
1113	       777 E. Middlefield Road
1114	       Mountain View, CA, USA 94043
1115	       Email: florin.balus@alcatel-lucent.com

1117	       Thomas Morin
1118	       France Telecom Orange
1119	       Email: thomas.morin@orange.com

1121	       Nabil Bitar
1122	       Verizon
1123	       40 Sylvan Road
1124	       Waltham, MA 02145
1125	       Email: nabil.bitar@verizon.com

1127	       Yakov Rekhter
1128	       Juniper
1129	       Email: yakov@juniper.net