idnits 2.17.1 

draft-ietf-nvo3-framework-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 20, 2014) is 3741 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Internet Engineering Task Force                          Marc Lasserre
2	Internet Draft                                            Florin Balus
3	Intended status: Informational                          Alcatel-Lucent
4	Expires: July 2014
5	                                                          Thomas Morin
6	                                                 France Telecom Orange

8	                                                           Nabil Bitar
9	                                                               Verizon

11	                                                         Yakov Rekhter
12	                                                               Juniper

14	                                                      January 20, 2014

16	                  Framework for DC Network Virtualization
17	                     draft-ietf-nvo3-framework-05.txt

19	Abstract

21	   This document provides a framework for Network Virtualization
22	   Overlays (NVO3) and it defines a reference model along with logical
23	   components required to design a solution.

25	Status of this Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts. The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six
36	   months and may be updated, replaced, or obsoleted by other documents
37	   at any time.  It is inappropriate to use Internet-Drafts as
38	   reference material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on July 20, 2014.

42	Copyright Notice

44	   Copyright (c) 2014 IETF Trust and the persons identified as the
45	   document authors. All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document. Please review these documents
51	   carefully, as they describe your rights and restrictions with
52	   respect to this document. Code Components extracted from this
53	   document must include Simplified BSD License text as described in
54	   Section 4.e of the Trust Legal Provisions and are provided without
55	   warranty as described in the Simplified BSD License.

57	Table of Contents

59	   1. Introduction..................................................3
60	      1.1. General terminology......................................3
61	      1.2. DC network architecture..................................6
62	   2. Reference Models..............................................8
63	      2.1. Generic Reference Model..................................8
64	      2.2. NVE Reference Model.....................................10
65	      2.3. NVE Service Types.......................................11
66	         2.3.1. L2 NVE providing Ethernet LAN-like service.........11
67	         2.3.2. L3 NVE providing IP/VRF-like service...............11
68	   3. Functional components........................................11
69	      3.1. Service Virtualization Components.......................11
70	         3.1.1. Virtual Access Points (VAPs).......................11
71	         3.1.2. Virtual Network Instance (VNI).....................12
72	         3.1.3. Overlay Modules and VN Context.....................12
73	         3.1.4. Tunnel Overlays and Encapsulation options..........13
74	         3.1.5. Control Plane Components...........................13
75	         3.1.5.1. Distributed vs Centralized Control Plane.........13
76	         3.1.5.2. Auto-provisioning/Service discovery..............14
77	         3.1.5.3. Address advertisement and tunnel mapping.........14
78	         3.1.5.4. Overlay Tunneling................................15
79	      3.2. Multi-homing............................................15
80	      3.3. VM Mobility.............................................16
81	   4. Key aspects of overlay networks..............................17
82	      4.1. Pros & Cons.............................................17
83	      4.2. Overlay issues to consider..............................18
84	         4.2.1. Data plane vs Control plane driven.................18
85	         4.2.2. Coordination between data plane and control plane..19
86	         4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
87	         traffic...................................................19
88	         4.2.4. Path MTU............................................20
89	         4.2.5. NVE location trade-offs.............................21
90	         4.2.6. Interaction between network overlays and underlays..21
91	   5. Security Considerations.......................................22
92	   6. IANA Considerations...........................................22
93	   7. References....................................................23
94	      7.1. Informative References...................................23
95	   8. Acknowledgments...............................................23

97	1. Introduction

99	   This document provides a framework for Data Center (DC) Network
100	   Virtualization over Layer3 (L3) tunnels. This framework is intended
101	   to aid in standardizing protocols and mechanisms to support large-
102	   scale network virtualization for data centers.

104	   [NVOPS] defines the rationale for using overlay networks in order to
105	   build large multi-tenant data center networks. Compute, storage and
106	   network virtualization are often used in these large data centers to
107	   support a large number of communication domains and end systems.

109	   This document provides reference models and functional components of
110	   data center overlay networks as well as a discussion of technical
111	   issues that have to be addressed.

113	1.1. General terminology

115	   This document uses the following terminology:

117	   NVO3 Network: An overlay network that provides a Layer2 (L2) or
118	   Layer3 (L3) service to Tenant Systems over an L3 underlay network
119	   using the architecture and protocols as defined by the NVO3 Working
120	   Group.

122	   Network Virtualization Edge (NVE). An NVE is the network entity that
123	   sits at the edge of an underlay network and implements L2 and/or L3
124	   network virtualization functions. The network-facing side of the NVE
125	   uses the underlying L3 network to tunnel tenant frames to and from
126	   other NVEs. The tenant-facing side of the NVE sends and receives
127	   Ethernet frames to and from individual Tenant Systems.  An NVE could
128	   be implemented as part of a virtual switch within a hypervisor, a
129	   physical switch or router, a Network Service Appliance, or be split
130	   across multiple devices.

132	   Virtual Network (VN): A VN is a logical abstraction of a physical
133	   network that provides L2 or L3 network services to a set of Tenant
134	   Systems. A VN is also known as a Closed User Group (CUG).

136	   Virtual Network Instance (VNI): A specific instance of a VN from the
137	   perspective of an NVE.

139	   Virtual Network Context (VN Context) Identifier: Field in overlay
140	   encapsulation header that identifies the specific VN the packet
141	   belongs to. The egress NVE uses the VN Context identifier to deliver
142	   the packet to the correct Tenant System. The VN Context identifier
143	   can be a locally significant identifier or a globally unique
144	   identifier.

146	   Underlay or Underlying Network: The network that provides the
147	   connectivity among NVEs and over which NVO3 packets are tunneled,
148	   where an NVO3 packet carries an NVO3 overlay header followed by a
149	   tenant packet. The Underlay Network does not need to be aware that
150	   it is carrying NVO3 packets. Addresses on the Underlay Network
151	   appear as "outer addresses" in encapsulated NVO3 packets. In
152	   general, the Underlay Network can use a completely different
153	   protocol (and address family) from that of the overlay. In the case
154	   of NVO3, the underlay network is IP.

156	   Data Center (DC): A physical complex housing physical servers,
157	   network switches and routers, network service appliances and
158	   networked storage. The purpose of a Data Center is to provide
159	   application, compute and/or storage services. One such service is
160	   virtualized infrastructure data center services, also known as
161	   Infrastructure as a Service.

163	   Virtual Data Center (Virtual DC): A container for virtualized
164	   compute, storage and network services. A Virtual DC is associated
165	   with a single tenant, and can contain multiple VNs and Tenant
166	   Systems connected to one or more of these VNs.

168	   Virtual machine (VM): A software implementation of a physical
169	   machine that runs programs as if they were executing on a physical,
170	   non-virtualized machine.  Applications (generally) do not know they
171	   are running on a VM as opposed to running on a "bare metal" host or
172	   server, though some systems provide a para-virtualization
173	   environment that allows an operating system or application to be
174	   aware of the presence of virtualization for optimization purposes.

176	   Hypervisor: Software running on a server that allows multiple VMs to
177	   run on the same physical server. The hypervisor manages and provides
178	   shared compute/memory/storage and network connectivity to the VMs
179	   that it hosts. Hypervisors often embed a Virtual Switch (see below).

181	   Server: A physical end host machine that runs user applications. A
182	   standalone (or "bare metal") server runs a conventional operating
183	   system hosting a single-tenant application. A virtualized server
184	   runs a hypervisor supporting one or more VMs.

186	   Virtual Switch (vSwitch): A function within a Hypervisor (typically
187	   implemented in software) that provides similar forwarding services
188	   to a physical Ethernet switch. A vSwitch forwards Ethernet frames
189	   between VMs running on the same server, or between a VM and a
190	   physical NIC card connecting the server to a physical Ethernet
191	   switch or router. A vSwitch also enforces network isolation between
192	   VMs that by policy are not permitted to communicate with each other
193	   (e.g., by honoring VLANs). A vSwitch may be bypassed when an NVE is
194	   enabled on the host server.

196	   Tenant: The customer using a virtual network and any associated
197	   resources (e.g., compute, storage and network).  A tenant could be
198	   an enterprise, or a department/organization within an enterprise.

200	   Tenant System: A physical or virtual system that can play the role
201	   of a host, or a forwarding element such as a router, switch,
202	   firewall, etc. It belongs to a single tenant and connects to one or
203	   more VNs of that tenant.

205	   Tenant Separation: Tenant Separation refers to isolating traffic of
206	   different tenants such that traffic from one tenant is not visible
207	   to or delivered to another tenant, except when allowed by policy.
208	   Tenant Separation also refers to address space separation, whereby
209	   different tenants can use the same address space without conflict.

211	   Virtual Access Points (VAPs): A logical connection point on the NVE
212	   for connecting a Tenant System to a virtual network. Tenant Systems
213	   connect to VNIs at an NVE through VAPs. VAPs can be physical ports
214	   or virtual ports identified through logical interface identifiers
215	   (e.g., VLAN ID, internal vSwitch Interface ID connected to a VM).

217	   End Device: A physical device that connects directly to the DC
218	   Underlay Network. This is in contrast to a Tenant System, which
219	   connects to a corresponding tenant VN. An End Device is administered
220	   by the DC operator rather than a tenant, and is part of the DC
221	   infrastructure. An End Device may implement NVO3 technology in
222	   support of NVO3 functions. Examples of an End Device include hosts
223	   (e.g., server or server blade), storage systems (e.g., file servers,
224	   iSCSI storage systems), and network devices (e.g., firewall, load-
225	   balancer, IPSec gateway).

227	   Network Virtualization Authority (NVA): Entity that provides
228	   reachability and forwarding information to NVEs.

230	1.2. DC network architecture

232	   A generic architecture for Data Centers is depicted in Figure 1:

234	                               ,---------.
235	                              ,'           `.
236	                             (  IP/MPLS WAN )
237	                              `.           ,'
238	                               `-+------+'
239	                                 \      /
240	                          +--------+   +--------+
241	                          |   DC   |+-+|   DC   |
242	                          |gateway |+-+|gateway |
243	                          +--------+   +--------+
244	                                |       /
245	                                .--. .--.
246	                              (    '    '.--.
247	                            .-.' Intra-DC     '
248	                           (     network      )
249	                            (             .'-'
250	                             '--'._.'.    )\ \
251	                             / /     '--'  \ \
252	                            / /      | |    \ \
253	                   +--------+   +--------+   +--------+
254	                   | access |   | access |   | access |
255	                   | switch |   | switch |   | switch |
256	                   +--------+   +--------+   +--------+
257	                      /     \    /    \     /      \
258	                   __/_      \  /      \   /_      _\__
259	             '--------'   '--------'   '--------'   '--------'
260	             :  End   :   :  End   :   :  End   :   :  End   :
261	             : Device :   : Device :   : Device :   : Device :
262	             '--------'   '--------'   '--------'   '--------'

264	            Figure 1 : A Generic Architecture for Data Centers

266	   An example of multi-tier DC network architecture is presented in
267	   Figure 1. It provides a view of physical components inside a DC.

269	   A DC network is usually composed of intra-DC networks and network
270	   services, and inter-DC network and network connectivity services.

272	   DC networking elements can act as strict L2 switches and/or provide
273	   IP routing capabilities, including network service virtualization.

275	   In some DC architectures, some tier layers could provide L2 and/or
276	   L3 services. In addition, some tier layers may be collapsed, and
277	   Internet connectivity, inter-DC connectivity and VPN support may be
278	   handled by a smaller number of nodes. Nevertheless, one can assume
279	   that the network functional blocks in a DC fit in the architecture
280	   depicted in Figure 1.

282	   The following components can be present in a DC:

284	   - Access switch: Hardware-based Ethernet switch aggregating all
285	      Ethernet links from the End Devices in a rack representing the
286	      entry point in the physical DC network for the hosts. It may also
287	      provide routing functionality, virtual IP network connectivity, or
288	      Layer2 tunneling over IP for instance. Access switches are usually
289	      multi-homed to aggregation switches in the Intra-DC network. A
290	      typical example of an access switch is a Top of Rack (ToR) switch.
291	      Other deployment scenarios may use an intermediate Blade Switch
292	      before the ToR, or an EoR (End of Row) switch, to provide similar
293	      functions to a ToR.

295	   - Intra-DC Network: Network composed of high capacity core nodes
296	      (Ethernet switches/routers). Core nodes may provide virtual
297	      Ethernet bridging and/or IP routing services.

299	   - DC Gateway (DC GW): Gateway to the outside world providing DC
300	      Interconnect and connectivity to Internet and VPN customers. In
301	      the current DC network model, this may be simply a router
302	      connected to the Internet and/or an IP Virtual Private Network
303	      (VPN)/L2VPN PE. Some network implementations may dedicate DC GWs
304	      for different connectivity types (e.g., a DC GW for Internet, and
305	      another for VPN).

307	   Note that End Devices may be single or multi-homed to access
308	   switches.

310	2. Reference Models

312	2.1. Generic Reference Model

314	   Figure 2 depicts a DC reference model for network virtualization
315	   overlay where NVEs provide a logical interconnect between Tenant
316	   Systems that belong to a specific VN.

318	         +--------+                                    +--------+
319	         | Tenant +--+                            +----| Tenant |
320	         | System |  |                           (')   | System |
321	         +--------+  |    .................     (   )  +--------+
322	                     |  +---+           +---+    (_)
323	                     +--|NVE|---+   +---|NVE|-----+
324	                        +---+   |   |   +---+
325	                        / .    +-----+      .
326	                       /  . +--| NVA |--+   .
327	                      /   . |  +-----+   \  .
328	                     |    . |             \ .
329	                     |    . |   Overlay   +--+--++--------+
330	         +--------+  |    . |   Network   | NVE || Tenant |
331	         | Tenant +--+    . |             |     || System |
332	         | System |       .  \ +---+      +--+--++--------+
333	         +--------+       .....|NVE|.........
334	                               +---+
335	                                 |
336	                                 |
337	                       =====================
338	                         |               |
339	                     +--------+      +--------+
340	                     | Tenant |      | Tenant |
341	                     | System |      | System |
342	                     +--------+      +--------+

344	      Figure 2 : Generic reference model for DC network virtualization
345	                                 overlay

347	   In order to obtain reachability information, NVEs may exchange
348	   information directly between themselves via a control plane
349	   protocol. In this case, a control plane module resides in every NVE.

351	   It is also possible for NVEs to communicate with an external Network
352	   Virtualization Authority (NVA) to obtain reachability and forwarding
353	   information. In this case, a protocol is used between NVEs and
354	   NVA(s) to exchange information. OpenFlow [OF] is one example of such
355	   a protocol.

357	   It should be noted that NVAs may be organized in clusters for
358	   redundancy and scalability and can appear as one logically
359	   centralized controller. In this case, inter-NVA communication is
360	   necessary to synchronize state among nodes within a cluster or share
361	   information across clusters. The information exchanged between NVAs
362	   of the same cluster could be different from the information
363	   exchanged across clusters.

365	   A Tenant System can be attached to an NVE in several ways:

367	   - locally, by being co-located in the same End Device

369	   - remotely, via a point-to-point connection or a switched network

371	   When an NVE is co-located with a Tenant System, the state of the
372	   Tenant System can be determined without protocol assistance. For
373	   instance, the operational status of a VM can be communicated via a
374	   local API. When an NVE is remotely connected to a Tenant System, the
375	   state of the Tenant System or NVE needs to be exchanged directly or
376	   via a management entity, using a control plane protocol or API, or
377	   directly via a dataplane protocol.

379	   The functional components in Figure 2 do not necessarily map
380	   directly to the physical components described in Figure 1. For
381	   example, an End Device can be a server blade with VMs and a virtual
382	   switch. A VM can be a Tenant System and the NVE functions may be
383	   performed by the host server. In this case, the Tenant System and
384	   NVE function are co-located. Another example is the case where the
385	   End Device is the Tenant System, and the NVE function can be
386	   implemented by the connected ToR. In this case, the Tenant System
387	   and NVE function are not co-located.

389	   Underlay nodes utilize L3 technologies to interconnect NVE nodes.
390	   These nodes perform forwarding based on outer L3 header information,
391	   and generally do not maintain per tenant-service state albeit some
392	   applications (e.g., multicast) may require control plane or
393	   forwarding plane information that pertain to a tenant, group of
394	   tenants, tenant service or a set of services that belong to one or
395	   more tenants. Mechanisms to control the amount of state maintained
396	   in the underlay may be needed.

398	2.2. NVE Reference Model

400	   Figure 3 depicts the NVE reference model. One or more VNIs can be
401	   instantiated on an NVE. A Tenant System interfaces with a
402	   corresponding VNI via a VAP. An overlay module provides tunneling
403	   overlay functions (e.g., encapsulation and decapsulation of tenant
404	   traffic, tenant identification and mapping, etc.).

406	                     +-------- L3 Network -------+
407	                     |                           |
408	                     |        Tunnel Overlay     |
409	         +------------+---------+       +---------+------------+
410	         | +----------+-------+ |       | +---------+--------+ |
411	         | |  Overlay Module  | |       | |  Overlay Module  | |
412	         | +---------+--------+ |       | +---------+--------+ |
413	         |           |VN context|       | VN context|          |
414	         |           |          |       |           |          |
415	         |  +--------+-------+  |       |  +--------+-------+  |
416	         |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
417	    NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
418	         |    |   VAPs     |    |       |    |    VAPs   |     |
419	         +----+------------+----+       +----+-----------+-----+
420	              |            |                 |           |
421	              |            |                 |           |
422	             Tenant Systems                 Tenant Systems

424	                  Figure 3 : Generic NVE reference model

426	   Note that some NVE functions (e.g., data plane and control plane
427	   functions) may reside in one device or may be implemented separately
428	   in different devices. In addition, NVE functions can be implemented
429	   in a hierarchical fashion. For instance, an End Device can act as an
430	   NVE Spoke, while an access switch can act as an NVE hub.

432	2.3. NVE Service Types

434	   An NVE provides different types of virtualized network services to
435	   multiple tenants, i.e. an L2 service or an L3 service. Note that an
436	   NVE may be capable of providing both L2 and L3 services for a
437	   tenant. This section defines the service types and associated
438	   attributes.

440	2.3.1. L2 NVE providing Ethernet LAN-like service

442	   An L2 NVE implements Ethernet LAN emulation, an Ethernet based
443	   multipoint service similar to an IETF VPLS [RFC4761][RFC4762] or
444	   EVPN [EVPN] service, where the Tenant Systems appear to be
445	   interconnected by a LAN environment over an L3 overlay. As such, an
446	   L2 NVE provides per-tenant virtual switching instance (L2 VNI), and
447	   L3 (IP/MPLS) tunneling encapsulation of tenant MAC frames across the
448	   underlay. Note that the control plane for an L2 NVE could be
449	   implemented locally on the NVE or in a separate control entity.

451	2.3.2. L3 NVE providing IP/VRF-like service

453	   An L3 NVE provides Virtualized IP forwarding service, similar to
454	   IETF IP VPN (e.g., BGP/MPLS IPVPN [RFC4364]) from a service
455	   definition perspective. That is, an L3 NVE provides per-tenant
456	   forwarding and routing instance (L3 VNI), and L3 (IP/MPLS) tunneling
457	   encapsulation of tenant IP packets across the underlay. Note that
458	   routing could be performed locally on the NVE or in a separate
459	   control entity.

461	3. Functional components

463	   This section decomposes the Network Virtualization architecture into
464	   functional components described in Figure 3 to make it easier to
465	   discuss solution options for these components.

467	3.1. Service Virtualization Components

469	3.1.1. Virtual Access Points (VAPs)

471	   Tenant Systems are connected to VNIs through Virtual Access Points
472	   (VAPs).

474	   VAPs can be physical ports or virtual ports identified through
475	   logical interface identifiers (e.g., VLAN ID, internal vSwitch
476	   Interface ID connected to a VM).

478	3.1.2. Virtual Network Instance (VNI)

480	   A VNI is a specific VN instance on an NVE. Each VNI defines a
481	   forwarding context that contains reachability information and
482	   policies.

484	3.1.3. Overlay Modules and VN Context

486	   Mechanisms for identifying each tenant service are required to allow
487	   the simultaneous overlay of multiple tenant services over the same
488	   underlay L3 network topology. In the data plane, each NVE, upon
489	   sending a tenant packet, must be able to encode the VN Context for
490	   the destination NVE in addition to the L3 tunneling information
491	   (e.g., source IP address identifying the source NVE and the
492	   destination IP address identifying the destination NVE, or MPLS
493	   label). This allows the destination NVE to identify the tenant
494	   service instance and therefore appropriately process and forward the
495	   tenant packet.

497	   The Overlay module provides tunneling overlay functions: tunnel
498	   initiation/termination as in the case of stateful tunnels (see
499	   Section 3.1.4), and/or simply encapsulation/decapsulation of frames
500	   from VAPs/L3 underlay.

502	   In a multi-tenant context, tunneling aggregates frames from/to
503	   different VNIs. Tenant identification and traffic demultiplexing are
504	   based on the VN Context identifier.

506	   The following approaches can be considered:

508	   - One VN Context identifier per Tenant: A globally unique (on a per-
509	      DC administrative domain) VN identifier is used to identify the
510	      corresponding VNI. Examples of such identifiers in existing
511	      technologies are IEEE VLAN IDs and ISID tags that identify virtual
512	      L2 domains when using IEEE 802.1aq and IEEE 802.1ah, respectively.

514	   - One VN Context identifier per VNI: A per-VNI local value is
515	      automatically generated by the egress NVE, or a control plane
516	      associated with that NVE, and usually distributed by a control
517	      plane protocol to all the related NVEs. An example of this
518	      approach is the use of per VRF MPLS labels in IP VPN [RFC4364].

520	   - One VN Context identifier per VAP: A per-VAP local value is
521	      assigned and usually distributed by a control plane protocol. An
522	      example of this approach is the use of per CE-PE MPLS labels in IP
523	      VPN [RFC4364].

525	   Note that when using one VN Context per VNI or per VAP, an
526	   additional global identifier (e.g., a VN identifier or name) may be
527	   used by the control plane to identify the Tenant context.

529	3.1.4. Tunnel Overlays and Encapsulation options

531	   Once the VN context identifier is added to the frame, an L3 Tunnel
532	   encapsulation is used to transport the frame to the destination NVE.

534	   Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS
535	   tunneling can be used. Tunneling could be stateless or stateful.
536	   Stateless tunneling simply entails the encapsulation of a tenant
537	   packet with another header necessary for forwarding the packet
538	   across the underlay (e.g., IP tunneling over an IP underlay).
539	   Stateful tunneling on the other hand entails maintaining tunneling
540	   state at the tunnel endpoints (i.e., NVEs). Tenant packets on an
541	   ingress NVE can then be transmitted over such tunnels to a
542	   destination (egress) NVE by encapsulating the packets with a
543	   corresponding tunneling header. The tunneling state at the endpoints
544	   may be configured or dynamically established. Solutions should
545	   specify the tunneling technology used, whether it is stateful or
546	   stateless. In this document, however, tunneling and tunneling
547	   encapsulation are used interchangeably to simply mean the
548	   encapsulation of a tenant packet with a tunneling header necessary
549	   to carry the packet between an ingress NVE and an egress NVE across
550	   the underlay. It should be noted that stateful tunneling, especially
551	   when configuration is involved, does impose management overhead and
552	   scale constraints. Thus, stateless tunneling is preferred when
553	   feasible.

555	3.1.5. Control Plane Components

557	3.1.5.1. Distributed vs Centralized Control Plane

559	   A control/management plane entity can be centralized or distributed.
560	   Both approaches have been used extensively in the past. The routing
561	   model of the Internet is a good example of a distributed approach.
562	   Transport networks have usually used a centralized approach to
563	   manage transport paths.

565	   It is also possible to combine the two approaches, i.e., using a
566	   hybrid model. A global view of network state can have many benefits
567	   but it does not preclude the use of distributed protocols within the
568	   network. Centralized models provide a facility to maintain global
569	   state, and distribute that state to the network. When used in
570	   combination with distributed protocols, greater network
571	   efficiencies, improved reliability and robustness can be achieved.
572	   Domain and/or deployment specific constraints define the balance
573	   between centralized and distributed approaches.

575	3.1.5.2. Auto-provisioning/Service discovery

577	   NVEs must be able to identify the appropriate VNI for each Tenant
578	   System. This is based on state information that is often provided by
579	   external entities. For example, in an environment where a VM is a
580	   Tenant System, this information is provided by VM orchestration
581	   systems, since these are the only entities that have visibility of
582	   which VM belongs to which tenant.

584	   A mechanism for communicating this information to the NVE is
585	   required. VAPs have to be created and mapped to the appropriate VNI.
586	   Depending upon the implementation, this control interface can be
587	   implemented using an auto-discovery protocol between Tenant Systems
588	   and their local NVE or through management entities. In either case,
589	   appropriate security and authentication mechanisms to verify that
590	   Tenant System information is not spoofed or altered are required.
591	   This is one critical aspect for providing integrity and tenant
592	   isolation in the system.

594	   NVEs may learn reachability information to VNIs on other NVEs via a
595	   control protocol that exchanges such information among NVEs, or via
596	   a management control entity.

598	3.1.5.3. Address advertisement and tunnel mapping

600	   As traffic reaches an ingress NVE on a VAP, a lookup is performed to
601	   determine which NVE or local VAP the packet needs to be sent to. If
602	   the packet is to be sent to another NVE, the packet is encapsulated
603	   with a tunnel header containing the destination information
604	   (destination IP address or MPLS label) of the egress NVE.
605	   Intermediate nodes (between the ingress and egress NVEs) switch or
606	   route traffic based upon the tunnel destination information.

608	   A key step in the above process consists of identifying the
609	   destination NVE the packet is to be tunneled to. NVEs are
610	   responsible for maintaining a set of forwarding or mapping tables
611	   that hold the bindings between destination VM and egress NVE
612	   addresses. Several ways of populating these tables are possible:
613	   control plane driven, management plane driven, or data plane driven.

615	   When a control plane protocol is used to distribute address
616	   reachability and tunneling information, the auto-
617	   provisioning/Service discovery could be accomplished by the same
618	   protocol. In this scenario, the auto-provisioning/Service discovery
619	   could be combined with (be inferred from) the address advertisement
620	   and associated tunnel mapping. Furthermore, a control plane protocol
621	   that carries both MAC and IP addresses eliminates the need for ARP,
622	   and hence addresses one of the issues with explosive ARP handling as
623	   discussed in [RFC6820].

625	3.1.5.4. Overlay Tunneling

627	   For overlay tunneling, and dependent upon the tunneling technology
628	   used for encapsulating the Tenant System packets, it may be
629	   sufficient to have one or more local NVE addresses assigned and used
630	   in the source and destination fields of a tunneling encapsulation
631	   header. Other information that is part of the
632	   tunneling encapsulation header may also need to be configured. In
633	   certain cases, local NVE configuration may be sufficient while in
634	   other cases, some tunneling related information may need to
635	   be shared among NVEs. The information that needs to be shared will
636	   be technology dependent. For instance, potential information could
637	   include tunnel identity, encapsulation type, and/or tunnel
638	   resources. In certain cases, such as when using IP multicast in the
639	   underlay, tunnels which interconnect NVEs may need to be
640	   established. When tunneling information needs to be exchanged or
641	   shared among NVEs, a control plane protocol may be required. For
642	   instance, it may be necessary to provide active/standby status
643	   information between NVEs, up/down status information,
644	   pruning/grafting information for multicast tunnels, etc.

646	   In addition, a control plane may be required to setup the tunnel
647	   path for some tunneling technologies. This applies to both unicast
648	   and multicast tunneling.

650	3.2. Multi-homing

652	   Multi-homing techniques can be used to increase the reliability of
653	   an NVO3 network. It is also important to ensure that physical
654	   diversity in an NVO3 network is taken into account to avoid single
655	   points of failure.

657	   Multi-homing can be enabled in various nodes, from Tenant Systems
658	   into ToRs, ToRs into core switches/routers, and core nodes into DC
659	   GWs.

661	   The NVO3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
662	   routing as the means to re-route traffic upon failures techniques or
663	   on MPLS re-rerouting capabilities.

665	   When a Tenant System is co-located with the NVE, the Tenant System
666	   is effectively single homed to the NVE via a virtual port. When the
667	   Tenant System and the NVE are separated, the Tenant System is
668	   connected to the NVE via a logical Layer2 (L2) construct such as a
669	   VLAN and it can be multi-homed to various NVEs. An NVE may provide
670	   an L2 service to the end system or an l3 service. An NVE may be
671	   multi-homed to a next layer in the DC at Layer2 (L2) or Layer3
672	   (L3). When an NVE provides an L2 service and is not co-located with
673	   the end system, techniques such as Ethernet Link Aggregation Group
674	   (LAG) or Spanning Tree Protocol (STP) can be used to switch traffic
675	   between an end system and connected NVEs without creating
676	   loops. Similarly, when the NVE provides L3 service, similar dual-
677	   homing techniques can be used. When the NVE provides a L3 service to
678	   the end system, it is possible that no dynamic routing protocol is
679	   enabled between the end system and the NVE. The end system can be
680	   multi-homed to multiple physically-separated L3 NVEs over multiple
681	   interfaces. When one of the links connected to an NVE fails, the
682	   other interfaces can be used to reach the end system.

684	   External connectivity from a DC can be handled by two or more DC
685	   gateways. Each gateway provides access to external networks such as
686	   VPNs or the Internet. A gateway may be connected to two or more edge
687	   nodes in the external network for redundancy. When a connection to
688	   an upstream node is lost, the alternative connection is used and the
689	   failed route withdrawn.

691	3.3. VM Mobility

693	   In DC environments utilizing VM technologies, an important feature
694	   is that VMs can move from one server to another server in the same
695	   or different L2 physical domains (within or across DCs) in a
696	   seamless manner.

698	   A VM can be moved from one server to another in stopped or suspended
699	   state ("cold" VM mobility) or in running/active state ("hot" VM
700	   mobility). With "hot" mobility, VM L2 and L3 addresses need to be
701	   preserved. With "cold" mobility, it may be desired to preserve at
702	   least VM L3 addresses.

704	   Solutions to maintain connectivity while a VM is moved are necessary
705	   in the case of "hot" mobility. This implies that connectivity among
706	   VMs is preserved. For instance, for L2 VNs, ARP caches are updated
707	   accordingly.

709	   Upon VM mobility, NVE policies that define connectivity among VMs
710	   must be maintained.

712	   During VM mobility, it is expected that the path to the VM's default
713	   gateway assures adequate performance to VM applications.

715	4. Key aspects of overlay networks

717	   The intent of this section is to highlight specific issues that
718	   proposed overlay solutions need to address.

720	4.1. Pros & Cons

722	   An overlay network is a layer of virtual network topology on top of
723	   the physical network.

725	   Overlay networks offer the following key advantages:

727	      - Unicast tunneling state management and association of Tenant
728	        Systems reachability are handled at the edge of the network (at
729	        the NVE). Intermediate transport nodes are unaware of such
730	        state. Note that when multicast is enabled in the underlay
731	        network to build multicast trees for tenant VNs, there would be
732	        more state related to tenants in the underlay core network.

734	      - Tunneling is used to aggregate traffic and hide tenant
735	        addresses from the underlay network, and hence offer the
736	        advantage of minimizing the amount of forwarding state required
737	        within the underlay network

739	      - Decoupling of the overlay addresses (MAC and IP) used by VMs
740	        from the underlay network for tenant separation and separation
741	        of the tenant address spaces from the underlay address space.

743	      - Support of a large number of virtual network identifiers

745	   Overlay networks also create several challenges:

747	      - Overlay networks have typically no control of underlay networks
748	        and lack underlay network information (e.g. underlay
749	        utilization):

751	        - Overlay networks and/or their associated management entities
752	           typically probe the network to measure link or path
753	           properties, such as available bandwidth or packet loss rate.
754	           It is difficult to accurately evaluate network properties. It
755	           might be preferable for the underlay network to expose usage
756	           and performance information.
757	        - Miscommunication or lack of coordination between overlay and
758	           underlay networks can lead to an inefficient usage of network
759	           resources.
760	        - When multiple overlays co-exist on top of a common underlay
761	           network, the lack of coordination between overlays can lead
762	           to performance issues and/or resource usage inefficiencies.

764	      - Traffic carried over an overlay may not traverse firewalls and
765	        NAT devices.

767	      - Multicast service scalability: Multicast support may be
768	        required in the underlay network to address tenant flood
769	        containment or efficient multicast handling. The underlay may
770	        also be required to maintain multicast state on a per-tenant
771	        basis, or even on a per-individual multicast flow of a given
772	        tenant. Ingress replication at the NVE eliminates that
773	        additional multicast state in the underlay core, but depending
774	        on the multicast traffic volume, it may cause inefficient use
775	        of bandwidth.

777	      - Hash-based load balancing may not be optimal as the hash
778	        algorithm may not work well due to the limited number of
779	        combinations of tunnel source and destination addresses. Other
780	        NVO3 mechanisms may use additional entropy information than
781	        source and destination addresses.

783	4.2. Overlay issues to consider

785	4.2.1. Data plane vs Control plane driven

787	   In the case of an L2 NVE, it is possible to dynamically learn MAC
788	   addresses against VAPs. It is also possible that such addresses be
789	   known and controlled via management or a control protocol for both
790	   L2 NVEs and L3 NVEs. Dynamic data plane learning implies that
791	   flooding of unknown destinations be supported and hence implies that
792	   broadcast and/or multicast be supported or that ingress replication
793	   be used as described in section 4.2.3. Multicasting in the underlay
794	   network for dynamic learning may lead to significant scalability
795	   limitations. Specific forwarding rules must be enforced to prevent
796	   loops from happening. This can be achieved using a spanning tree, a
797	   shortest path tree, or a split-horizon mesh.

799	   It should be noted that the amount of state to be distributed is
800	   dependent upon network topology and the number of virtual machines.
801	   Different forms of caching can also be utilized to minimize state
802	   distribution between the various elements. The control plane should
803	   not require an NVE to maintain the locations of all the Tenant
804	   Systems whose VNs are not present on the NVE. The use of a control
805	   plane does not imply that the data plane on NVEs has to maintain all
806	   the forwarding state in the control plane.

808	4.2.2. Coordination between data plane and control plane

810	   For an L2 NVE, the NVE needs to be able to determine MAC addresses
811	   of the Tenant Systems connected via a VAP. This can be achieved via
812	   dataplane learning or a control plane. For an L3 NVE, the NVE needs
813	   to be able to determine IP addresses of the Tenant Systems connected
814	   via a VAP.

816	   In both cases, coordination with the NVE control protocol is needed
817	   such that when the NVE determines that the set of addresses behind a
818	   VAP has changed, it triggers the NVE control plane to distribute
819	   this information to its peers.

821	4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

823	   There are several options to support packet replication needed for
824	   broadcast, unknown unicast and multicast. Typical methods include:

826	   - Ingress replication

828	   - Use of underlay multicast trees

830	   There is a bandwidth vs state trade-off between the two approaches.
831	   Depending upon the degree of replication required (i.e. the number
832	   of hosts per group) and the amount of multicast state to maintain,
833	   trading bandwidth for state should be considered.

835	   When the number of hosts per group is large, the use of underlay
836	   multicast trees may be more appropriate. When the number of hosts is
837	   small (e.g. 2-3) and/or the amount of multicast traffic is small,
838	   ingress replication may not be an issue.

840	   Depending upon the size of the data center network and hence the
841	   number of (S,G) entries, and also the duration of multicast flows,
842	   the use of underlay multicast trees can be a challenge.

844	   When flows are well known, it is possible to pre-provision such
845	   multicast trees. However, it is often difficult to predict
846	   application flows ahead of time, and hence programming of (S,G)
847	   entries for short-lived flows could be impractical.

849	   A possible trade-off is to use in the underlay shared multicast
850	   trees as opposed to dedicated multicast trees.

852	4.2.4. Path MTU

854	   When using overlay tunneling, an outer header is added to the
855	   original frame. This can cause the MTU of the path to the egress
856	   tunnel endpoint to be exceeded.

858	   It is usually not desirable to rely on IP fragmentation for
859	   performance reasons. Ideally, the interface MTU as seen by a Tenant
860	   System is adjusted such that no fragmentation is needed. TCP will
861	   adjust its maximum segment size accordingly.

863	   It is possible for the MTU to be configured manually or to be
864	   discovered dynamically. Various Path MTU discovery techniques exist
865	   in order to determine the proper MTU size to use:

867	   - Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

869	     - Tenant Systems rely on ICMP messages to discover the MTU of the
870	        end-to-end path to its destination. This method is not always
871	        possible, such as when traversing middle boxes (e.g. firewalls)
872	        which disable ICMP for security reasons

874	   - Extended MTU Path Discovery techniques such as defined in
875	      [RFC4821]

877	   It is also possible to rely on the NVE to perform segmentation and
878	   reassembly operations without relying on the Tenant Systems to know
879	   about the end-to-end MTU. The assumption is that some hardware
880	   assist is available on the NVE node to perform such SAR operations.
881	   However, fragmentation by the NVE can lead to performance and
882	   congestion issues due to TCP dynamics and might require new
883	   congestion avoidance mechanisms from the underlay network [FLOYD].

885	   Finally, the underlay network may be designed in such a way that the
886	   MTU can accommodate the extra tunneling and possibly additional NVO3
887	   header encapsulation overhead.

889	4.2.5. NVE location trade-offs

891	   In the case of DC traffic, traffic originated from a VM is native
892	   Ethernet traffic. This traffic can be switched by a local virtual
893	   switch or ToR switch and then by a DC gateway. The NVE function can
894	   be embedded within any of these elements.

896	   There are several criteria to consider when deciding where the NVE
897	   function should happen:

899	   - Processing and memory requirements

901	     - Datapath (e.g. lookups, filtering, encapsulation/decapsulation)

903	     - Control plane processing (e.g. routing, signaling, OAM) and
904	        where specific control plane functions should be enabled

906	   - FIB/RIB size

908	   - Multicast support

910	     - Routing/signaling protocols

912	     - Packet replication capability

914	     - Multicast FIB

916	   - Fragmentation support

918	   - QoS support (e.g. marking, policing, queuing)

920	   - Resiliency

922	4.2.6. Interaction between network overlays and underlays

924	   When multiple overlays co-exist on top of a common underlay network,
925	   resources (e.g., bandwidth) should be provisioned to ensure that
926	   traffic from overlays can be accommodated and QoS objectives can be
927	   met. Overlays can have partially overlapping paths (nodes and
928	   links).

930	   Each overlay is selfish by nature. It sends traffic so as to
931	   optimize its own performance without considering the impact on other
932	   overlays, unless the underlay paths are traffic engineered on a per
933	   overlay basis to avoid congestion of underlay resources.

935	   Better visibility between overlays and underlays, or generally
936	   coordination in placing overlay demand on an underlay network, may
937	   be achieved by providing mechanisms to exchange performance and
938	   liveliness information between the underlay and overlay(s) or the
939	   use of such information by a coordination system. Such information
940	   may include:

942	   - Performance metrics (throughput, delay, loss, jitter)

944	   - Cost metrics

946	5. Security Considerations

948	   NVO3 solutions must at least consider and address the following:

950	   - Secure and authenticated communication between an NVE and an NVE
951	      management system and/or control system.

953	   - Isolation between tenant overlay networks. The use of per-tenant
954	      FIB tables (VNIs) on an NVE is essential.

956	   - Security of any protocol used to carry overlay network
957	      information.

959	   - Preventing packets from reaching the wrong NVI, especially during
960	      VM moves.

962	   - It may desirable to restrict the types of information that can be
963	      exchanged between overlays and underlays (e.g. topology
964	      information)

966	6. IANA Considerations

968	   IANA does not need to take any action for this draft.

970	7. References

972	7.1. Informative References

974	   [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network
975	             Virtualization", draft-narten-nvo3-overlay-problem-
976	             statement (work in progress)

978	   [OF]    Open Networking Foundation, "OpenFlow Switch Specification
979	             v1.4.0"

981	   [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
982	             ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

984	   [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
985	             Networks (VPNs)", RFC 4364, February 2006.

987	   [RFC4761] Kompella, K. et al, "Virtual Private LAN Service (VPLS)
988	             Using BGP for auto-discovery and Signaling", RFC4761,
989	             January 2007

991	   [RFC4762] Lasserre, M. et al, "Virtual Private LAN Service (VPLS)
992	             Using Label Distribution Protocol (LDP) Signaling",
993	             RFC4762, January 2007

995	   [EVPN]  Sajassi, A. et al, "BGP MPLS Based Ethernet VPN", draft-
996	             ietf-l2vpn-evpn (work in progress)

998	   [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

1000	   [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
1001	             August 1996

1003	   [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
1004	             Discovery", RFC4821, March 2007

1006	   [RFC6820] Narten, T. et al, "Address Resolution Problems in Large
1007	             Data Center Networks", RFC6820, January 2013

1009	8. Acknowledgments

1011	   In addition to the authors the following people have contributed to
1012	   this document:

1014	   Dimitrios Stiliadis, Rotem Salomonovitch, Lucy Yong, Thomas Narten,
1015	   Larry Kreeger.

1017	   This document was prepared using 2-Word-v2.0.template.dot.

1019	Authors' Addresses

1021	   Marc Lasserre
1022	   Alcatel-Lucent
1023	   Email: marc.lasserre@alcatel-lucent.com

1025	   Florin Balus
1026	   Alcatel-Lucent
1027	   777 E. Middlefield Road
1028	   Mountain View, CA, USA 94043
1029	   Email: florin.balus@alcatel-lucent.com

1031	   Thomas Morin
1032	   France Telecom Orange
1033	   Email: thomas.morin@orange.com

1035	   Nabil Bitar
1036	   Verizon
1037	   40 Sylvan Road
1038	   Waltham, MA 02145
1039	   Email: nabil.bitar@verizon.com

1041	   Yakov Rekhter
1042	   Juniper
1043	   Email: yakov@juniper.net