idnits 2.17.1 

draft-ietf-nvo3-framework-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 25
     longer pages, the longest (page 21) being 72 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 25 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 250 instances of too long lines in the document, the longest
     one being 3 characters in excess of 72.

  == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 4, 2013) is 3942 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'OF' is mentioned on line 373, but not defined

  == Unused Reference: 'OVCPREQ' is defined on line 1009, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                          Marc Lasserre
3	    Internet Draft                                            Florin Balus
4	    Intended status: Informational                          Alcatel-Lucent
5	    Expires: January 2014
6	                                                              Thomas Morin
7	                                                     France Telecom Orange

9	                                                               Nabil Bitar
10	                                                                   Verizon

12	                                                             Yakov Rekhter
13	                                                                   Juniper

15	                                                              July 4, 2013

17	                      Framework for DC Network Virtualization
18	                         draft-ietf-nvo3-framework-03.txt

20	    Status of this Memo

22	       This Internet-Draft is submitted in full conformance with the
23	       provisions of BCP 78 and BCP 79.

25	       Internet-Drafts are working documents of the Internet Engineering
26	       Task Force (IETF).  Note that other groups may also distribute
27	       working documents as Internet-Drafts. The list of current Internet-
28	       Drafts is at http://datatracker.ietf.org/drafts/current/.

30	       Internet-Drafts are draft documents valid for a maximum of six
31	       months and may be updated, replaced, or obsoleted by other documents
32	       at any time.  It is inappropriate to use Internet-Drafts as
33	       reference material or to cite them other than as "work in progress."

35	       This Internet-Draft will expire on January 4, 2014.

37	    Copyright Notice

39	       Copyright (c) 2013 IETF Trust and the persons identified as the
40	       document authors. All rights reserved.

42	       This document is subject to BCP 78 and the IETF Trust's Legal
43	       Provisions Relating to IETF Documents
44	       (http://trustee.ietf.org/license-info) in effect on the date of
45	       publication of this document. Please review these documents
46	       carefully, as they describe your rights and restrictions with
47	       respect to this document. Code Components extracted from this
48	       document must include Simplified BSD License text as described in
49	       Section 4.e of the Trust Legal Provisions and are provided without
50	       warranty as described in the Simplified BSD License.

52	    Abstract

54	       Several IETF drafts relate to the use of overlay networks to support
55	       large scale virtual data centers. This draft provides a framework
56	       for Network Virtualization over L3 (NVO3) and is intended to help
57	       plan a set of work items in order to provide a complete solution
58	       set. It defines a logical view of the main components with the
59	       intention of streamlining the terminology and focusing the solution
60	       set.

62	    Table of Contents

64	       1. Introduction.................................................3
65	          1.1. Conventions used in this document.......................3
66	          1.2. General terminology.....................................4
67	          1.3. DC network architecture.................................6
68	       2. Reference Models.............................................8
69	          2.1. Generic Reference Model.................................8
70	          2.2. NVE Reference Model....................................11
71	          2.3. NVE Service Types......................................12
72	             2.3.1. L2 NVE providing Ethernet LAN-like service........12
73	             2.3.2. L3 NVE providing IP/VRF-like service..............12
74	       3. Functional components.......................................12
75	          3.1. Service Virtualization Components......................12
76	             3.1.1. Virtual Access Points (VAPs)......................12
77	             3.1.2. Virtual Network Instance (VNI)....................13
78	             3.1.3. Overlay Modules and VN Context....................13
79	             3.1.4. Tunnel Overlays and Encapsulation options.........14
80	             3.1.5. Control Plane Components..........................14
81	             3.1.5.1. Distributed vs Centralized Control Plane........14
82	             3.1.5.2. Auto-provisioning/Service discovery.............15
83	             3.1.5.3. Address advertisement and tunnel mapping........15
84	             3.1.5.4. Overlay Tunneling...............................16
85	          3.2. Multi-homing...........................................16
86	          3.3. VM Mobility............................................17
87	       4. Key aspects of overlay networks.............................18
88	          4.1. Pros & Cons............................................18
89	          4.2. Overlay issues to consider.............................19
90	             4.2.1. Data plane vs Control plane driven................19
91	             4.2.2. Coordination between data plane and control plane.20
92	             4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
93	             traffic..................................................20
94	             4.2.4. Path MTU..........................................21
95	             4.2.5. NVE location trade-offs...........................22
96	             4.2.6. Interaction between network overlays and underlays.22
97	       5. Security Considerations.....................................23
98	       6. IANA Considerations.........................................23
99	       7. References..................................................23
100	          7.1. Normative References...................................23
101	          7.2. Informative References.................................24
102	       8. Acknowledgments.............................................24

104	    1. Introduction

106	       This document provides a framework for Data Center Network
107	       Virtualization over Layer3 (L3) tunnels. This framework is intended
108	       to aid in standardizing protocols and mechanisms to support large-
109	       scale network virtualization for data centers.

111	       [NVOPS] defines the rationale for using overlay networks in order to
112	       build large multi-tenant data center networks. Compute, storage and
113	       network virtualization are often used in these large data centers to
114	       support a large number of communication domains and end systems.

116	       This document provides reference models and functional components of
117	       data center overlay networks as well as a discussion of technical
118	       issues that have to be addressed.

120	    1.1. Conventions used in this document

122	       The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
123	       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
124	       document are to be interpreted as described in RFC-2119 [RFC2119].

126	       In this document, these words will appear with that interpretation
127	       only when in ALL CAPS. Lower case uses of these words are not to be
128	       interpreted as carrying RFC-2119 significance.

130	    1.2. General terminology

132	       This document uses the following terminology:

134	       NVO3 Network: An overlay network that provides an Layer2 (L2) or
135	       Layer3 (L3) service to Tenant Systems over an L3 underlay network,
136	       using the architecture and protocols as defined by the NVO3 Working
137	       Group.

139	       Network Virtualization Edge (NVE). An NVE is the network entity that
140	       sits at the edge of an underlay network and implements L2 and/or L3
141	       network virtualization functions. The network-facing side of the NVE
142	       uses the underlying L3 network to tunnel frames to and from other
143	       NVEs. The tenant-facing side of the NVE sends and receives Ethernet
144	       frames to and from individual Tenant Systems.  An NVE could be
145	       implemented as part of a virtual switch within a hypervisor, a
146	       physical switch or router, a Network Service Appliance, or be split
147	       across multiple devices.

149	       Virtual Network (VN): A VN is a logical abstraction of a physical
150	       network that provides L2 or L3 network services to a set of Tenant
151	       Systems. A VN is also known as a Closed User Group (CUG).

153	       Virtual Network Instance (VNI): A specific instance of a VN.

155	       Virtual Network Context (VN Context) Identifier: Field in overlay
156	       encapsulation header that identifies the specific VN the packet
157	       belongs to. The egress NVE uses the VN Context identifier to deliver
158	       the packet to the correct Tenant System. The VN Context identifier
159	       can be a locally significant identifier or a globally unique
160	       identifier.

162	       Underlay or Underlying Network: The network that provides the
163	       connectivity among NVEs and over which NVO3 packets are tunneled,
164	       where an NVO3 packet carries an NVO3 overlay header followed by a
165	       tenant packet. The Underlay Network does not need to be aware that
166	       it is carrying NVO3 packets. Addresses on the Underlay Network
167	       appear as "outer addresses" in encapsulated NVO3 packets. In
168	       general, the Underlay Network can use a completely different
169	       protocol (and address family) from that of the overlay. In the case
170	       of NVO3, the underlay network is typically IP.

172	       Data Center (DC): A physical complex housing physical servers,
173	       network switches and routers, network service appliances and
174	       networked storage. The purpose of a Data Center is to provide
175	       application, compute and/or storage services. One such service is
176	       virtualized infrastructure data center services, also known as
177	       Infrastructure as a Service.

179	       Virtual Data Center (Virtual DC): A container for virtualized
180	       compute, storage and network services. A Virtual DC is associated
181	       with a single tenant, and can contain multiple VNs and Tenant
182	       Systems connected to one or more of these VNs.

184	       Virtual machine (VM): A software implementation of a physical
185	       machine that runs programs as if they were executing on a physical,
186	       non-virtualized machine.  Applications (generally) do not know they
187	       are running on a VM as opposed to running on a "bare metal" host or
188	       server, though some systems provide a para-virtualization
189	       environment that allows an operating system or application to be
190	       aware of the presences of virtualization for optimization purposes.

192	       Hypervisor: Software running on a server that allows multiple VMs to
193	       run on the same physical server. The hypervisor manages and provides
194	       shared compute/memory/storage and network connectivity to the VMs
195	       that it hosts. Hypervisors often embed a Virtual Switch (see below).

197	       Server: A physical end host machine that runs user applications. A
198	       standalone (or "bare metal") server runs a conventional operating
199	       system hosting a single-tenant application. A virtualized server
200	       runs a hypervisor supporting one or more VMs.

202	       Virtual Switch (vSwitch): A function within a Hypervisor (typically
203	       implemented in software) that provides similar forwarding services
204	       to a physical Ethernet switch. A vSwitch forwards Ethernet frames
205	       between VMs running on the same server, or between a VM and a
206	       physical NIC card connecting the server to a physical Ethernet
207	       switch or router. A vSwitch also enforces network isolation between
208	       VMs that by policy are not permitted to communicate with each other
209	       (e.g., by honoring VLANs). A vSwitch may be bypassed when an NVE is
210	       enabled on the host server.

212	       Tenant: The customer using a virtual network and any associated
213	       resources (e.g., compute, storage and network).  A tenant could be
214	       an enterprise, or a department/organization within an enterprise.

216	       Tenant System: A physical or virtual system that can play the role
217	       of a host, or a forwarding element such as a router, switch,
218	       firewall, etc. It belongs to a single tenant and connects to one or
219	       more VNs of that tenant.

221	       Tenant Separation: Tenant Separation refers to isolating traffic of
222	       different tenants such that traffic from one tenant is not visible
223	       to or delivered to another tenant, except when allowed by policy.
224	       Tenant Separation also refers to address space separation, whereby
225	       different tenants can use the same address space without conflict.

227	       Virtual Access Points (VAPs): Tenant Systems are connected to VNIs
228	       through VAPs. VAPs can be physical ports or virtual ports identified
229	       through logical interface identifiers (e.g., VLAN ID, internal
230	       vSwitch Interface ID connected to a VM).

232	       End Device: A physical device that connects directly to the DC
233	       Underlay Network. This is in contrast to a tenant system, which
234	       connects to a corresponding tenant VN. An End Device is administered
235	       by the DC operator rather than a tenant, and is part of the DC
236	       infrastructure. An End Device may implement NVO3 technology in
237	       support of NVO3 functions. Examples of an End Device include hosts
238	       (e.g., server or server blade), storage systems (e.g., file servers,
239	       iSCSI storage systems), and network devices (e.g., firewall, load-
240	       balancer, IPSec gateway).

242	       Network Virtualization Authority (NVA): Entity that provides
243	       reachability and forwarding information to NVEs. An NVA is also
244	       known as a controller.

246	    1.3. DC network architecture

248	       A generic architecture for Data Centers is depicted in Figure 1:

250	                                    ,---------.
251	                                  ,'           `.
252	                                 (  IP/MPLS WAN )
253	                                  `.           ,'
254	                                    `-+------+'
255	                                     \      /
256	                              +--------+   +--------+
257	                              |   DC   |+-+|   DC   |
258	                              |gateway |+-+|gateway |
259	                              +--------+   +--------+
260	                                    |       /
261	                                    .--. .--.
262	                                  (    '    '.--.
263	                                .-.' Intra-DC     '
264	                               (     network      )
265	                                (             .'-'
266	                                 '--'._.'.    )\ \
267	                                 / /     '--'  \ \
268	                                / /      | |    \ \
269	                       +--------+   +--------+   +--------+
270	                       | access |   | access |   | access |
271	                       | switch |   | switch |   | switch |
272	                       +--------+   +--------+   +--------+
273	                          /     \    /    \     /      \
274	                       __/_      \  /      \   /_      _\__
275	                 '--------'   '--------'   '--------'   '--------'
276	                 :  End   :   :  End   :   :  End   :   :  End   :
277	                 : Device :   : Device :   : Device :   : Device :
278	                 '--------'   '--------'   '--------'   '--------'

280	                 Figure 1 : A Generic Architecture for Data Centers

282	       An example of multi-tier DC network architecture is presented in
283	       Figure 1. It provides a view of physical components inside a DC.

285	       A DC network is usually composed of intra-DC networks and network
286	       services, and inter-DC network and network connectivity services.
287	       Depending upon the scale, DC distribution, operations model, Capital
288	       expenditure (Capex) and Operational expenditure (Opex) aspects, DC
289	       networking elements can act as strict L2 switches and/or provide IP
290	       routing capabilities, including network service virtualization.

292	       In some DC architectures, some tier layers could provide L2 and/or
293	       L3 services. In addition, some tier layers may be collapsed, and
294	       Internet connectivity, inter-DC connectivity and VPN support may be
295	       handled by a smaller number of nodes. Nevertheless, one can assume
296	       that the network functional blocks in a DC fit in the architecture
297	       depicted in Figure 1.

299	       The following components can be present in a DC:

301	          o Access switch: Hardware-based Ethernet switch aggregating all
302	            Ethernet links from the End Devices in a rack representing the
303	            entry point in the physical DC network for the hosts. It may
304	            also provide routing functionality, virtual IP network
305	            connectivity, or Layer2 tunneling over IP for instance. Access
306	            swicthes are usually multi-homed to aggregation switches in the
307	            Intra-DC network. A typical example of an access switch is a
308	            Top of Rack (ToR) switch. Other deployment scenarios may use an
309	            intermediate Blade Switch before the ToR, or an EoR (End of
310	            Row) switch, to provide similar function as a ToR.

312	          o Intra-DC Network: Network composed of high capacity core nodes
313	            (Ethernet switches/routers). Core nodes may provide virtual
314	            Ethernet bridging and/or IP routing services.

316	          o DC Gateway (DC GW): Gateway to the outside world providing DC
317	            Interconnect and connectivity to Internet and VPN customers. In
318	            the current DC network model, this may be simply a router
319	            connected to the Internet and/or an IP Virtual Private Network
320	            (VPN)/L2VPN PE. Some network implementations may dedicate DC
321	            GWs for different connectivity types (e.g., a DC GW for
322	            Internet, and another for VPN).

324	       Note that End Devices may be single or multi-homed to access
325	       switches.

327	    2. Reference Models

329	    2.1. Generic Reference Model

331	       Figure 2 depicts a DC reference model for network virtualization
332	       using L3 (IP/MPLS) overlays where NVEs provide a logical
333	       interconnect between Tenant Systems that belong to a specific VN.

335	             +--------+                                    +--------+
336	             | Tenant +--+                            +----| Tenant |
337	             | System |  |                           (')   | System |
338	             +--------+  |    .................     (   )  +--------+
339	                         |  +---+           +---+    (_)
340	                         +--|NVE|---+   +---|NVE|-----+
341	                            +---+   |   |   +---+
342	                            / .    +-----+      .
343	                           /  . +--| NVA |      .
344	                          /   . |  +-----+      .
345	                         |    . |               .
346	                         |    . |  L3 Overlay +--+--++--------+
347	             +--------+  |    . |   Network   | NVE || Tenant |
348	             | Tenant +--+    . |             |     || System |
349	             | System |       .  \ +---+      +--+--++--------+
350	             +--------+       .....|NVE|.........
351	                                   +---+
352	                                     |
353	                                     |
354	                           =====================
355	                             |               |
356	                         +--------+      +--------+
357	                         | Tenant |      | Tenant |
358	                         | System |      | System |
359	                         +--------+      +--------+

361	          Figure 2 : Generic reference model for DC network virtualization
362	                         over a Layer3 (IP) infrastructure

364	       In order to get reachability information, NVEs may exchange
365	       information directly between themselves via a protocol. In this
366	       case, a control plane module resides in every NVE. This is how
367	       routing control plane modules are implemented in routers for
368	       instance.

370	       It is also possible for NVEs to communicate with an external Network
371	       Virtualization Authority (NVA) to obtain reachability and forwarding
372	       information. In this case, a protocol is used between NVEs and
373	       NVA(s) to exchange information. OpenFlow [OF] is one example of such
374	       a protocol.

376	       It should be noted that NVAs may be organized in clusters for
377	       redundancy and scalability and can appear as one logically
378	       centralized controller. In this case, inter-NVA communication is
379	       necessary to synchronize state among nodes within a cluster or share
380	       information across clusters. The information exchanged between NVAs
381	       of the same cluster could be different from the information
382	       exchanged across clusters.

384	       A Tenant System can be attached to an NVE in several ways:

386	         - locally, by being co-located in the same End Device

388	         - remotely, via a point-to-point connection or a switched network

390	       When an NVE is co-located with a Tenant System, the state of the
391	       Tenant System can be provided without protocol assistance. For
392	       instance, the operational status of a VM can be communicated via a
393	       local API. When an NVE is remotely connected to a tenant system, the
394	       state of the Tenant System or NVE needs to be exchanged directly or
395	       via a management entity, using a control plane protocol or API, or
396	       directly via a dataplane protocol.

398	       The functional components in Figure 2 do not necessarily map
399	       directly to the physical components described in Figure 1. For
400	       example, an End Device can be a server blade with VMs and a virtual
401	       switch. A VM can be a Tenant System and the NVE functions may be
402	       performed by the host server. In this case, the Tenant System and
403	       NVE function are co-located.

405	       Another example is the case where the End Device is the tenant
406	       System, and the NVE function can be implemented by the connected
407	       ToR.

409	       The NVE implements network virtualization functions that allow for
410	       L2 and/or L3 tenant separation.

412	       Underlay nodes utilize L3 technologies to interconnect NVE nodes.
413	       These nodes perform forwarding based on outer L3 header information,
414	       and generally do not maintain per tenant-service state albeit some
415	       applications (e.g., multicast) may require control plane or
416	       forwarding plane information that pertain to a tenant, group of
417	       tenants, tenant service or a set of services that belong to one or
418	       more tenants. When such tenant or tenant-service related information
419	       is maintained in the underlay, mechanisms to control that
420	       information should be provided.

422	    2.2. NVE Reference Model

424	       Figure 3 depicts the NVE reference model. One or more VNIs can be
425	       instantiated on an NVE. A Tenant System interfaces with a
426	       corresponding VNI via a VAP. An overlay module provides tunneling
427	       overlay functions (e.g., encapsulation and decapsulation of tenant
428	       traffic, tenant identification and mapping, etc.).

430	                         +-------- L3 Network -------+
431	                         |                           |
432	                         |        Tunnel Overlay     |
433	             +------------+---------+       +---------+------------+
434	             | +----------+-------+ |       | +---------+--------+ |
435	             | |  Overlay Module  | |       | |  Overlay Module  | |
436	             | +---------+--------+ |       | +---------+--------+ |
437	             |           |VN context|       | VN context|          |
438	             |           |          |       |           |          |
439	             |  +--------+-------+  |       |  +--------+-------+  |
440	             |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
441	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
442	             |    |   VAPs     |    |       |    |    VAPs   |     |
443	             +----+------------+----+       +----+-----------+-----+
444	                  |            |                 |           |
445	                  |            |                 |           |
446	                 Tenant Systems                 Tenant Systems

448	                      Figure 3 : Generic NVE reference model

450	       Note that some NVE functions (e.g., data plane and control plane
451	       functions) may reside in one device or may be implemented separately
452	       in different devices. In addition, NVE functions can be implemented
453	       in a hierarchical fashion. For instance, an End Device can act as an
454	       NVE Spoke, while an access switch can act as an NVE hub.

456	    2.3. NVE Service Types

458	       NVE components may be used to provide different types of virtualized
459	       network services. This section defines the service types and
460	       associated attributes. Note that an NVE may be capable of providing
461	       both L2 and L3 services.

463	    2.3.1. L2 NVE providing Ethernet LAN-like service

465	       L2 NVE implements Ethernet LAN emulation, an Ethernet based
466	       multipoint service similar to an IETF VPLS or EVPN service, where
467	       the Tenant Systems appear to be interconnected by a LAN environment
468	       over an L3 overlay. As such, an L2 NVE provides per-tenant virtual
469	       switching instance (L2 VNI), and L3 (IP/MPLS) tunneling
470	       encapsulation of tenant MAC frames across the underlay. Note that
471	       the control plane for an L2 NVE could be implemented locally on the
472	       NVE or in a separate control entity.

474	    2.3.2. L3 NVE providing IP/VRF-like service

476	       L3 NVE provides Virtualized IP forwarding service, similar from a
477	       service definition perspective to IETF IP VPN (e.g., BGP/MPLS IPVPN
478	       [RFC4364]). That is, an L3 NVE provides per-tenant forwarding and
479	       routing instance (L3 VNI), and L3 (IP/MPLS) tunneling encapsulation
480	       of tenant IP packets across the underlay. Note that routing could be
481	       performed locally on the NVE or in a separate control entity.

483	    3. Functional components

485	       This section decomposes the Network Virtualization architecture into
486	       functional components described in Figure 3 to make it easier to
487	       discuss solution options for these components.

489	    3.1. Service Virtualization Components

491	    3.1.1. Virtual Access Points (VAPs)

493	       Tenant Systems are connected to VNIs through Virtual Access Points
494	       (VAPs).

496	       VAPs can be physical ports or virtual ports identified through
497	       logical interface identifiers (e.g., VLAN ID, internal vSwitch
498	       Interface ID connected to a VM).

500	    3.1.2. Virtual Network Instance (VNI)

502	       A VNI is a specific VN instance on a NVE. Each VNI defines a
503	       forwarding context that contains reachability information and
504	       policies.

506	    3.1.3. Overlay Modules and VN Context

508	       Mechanisms for identifying each tenant service are required to allow
509	       the simultaneous overlay of multiple tenant services over the same
510	       underlay L3 network topology. In the data plane, each NVE, upon
511	       sending a tenant packet, must be able to encode the VN Context for
512	       the destination NVE in addition to the L3 tunneling information
513	       (e.g., source IP address identifying the source NVE and the
514	       destination IP address identifying the destination NVE, or MPLS
515	       label). This allows the destination NVE to identify the tenant
516	       service instance and therefore appropriately process and forward the
517	       tenant packet.

519	       The Overlay module provides tunneling overlay functions: tunnel
520	       initiation/termination as in the case of stateful tunnels (see
521	       Section 3.1.4), and/or simply encapsulation/decapsulation of frames
522	       from VAPs/L3 underlay.

524	       In a multi-tenant context, tunneling aggregates frames from/to
525	       different VNIs. Tenant identification and traffic demultiplexing are
526	       based on the VN Context identifier.

528	       The following approaches can be considered:

530	          o One VN Context identifier per Tenant: A globally unique (on a
531	            per-DC administrative domain) VN identifier is used to identify
532	            the corresponding VNI. Examples of such identifiers in existing
533	            technologies are IEEE VLAN IDs and ISID tags that identify
534	            virtual L2 domains when using IEEE 802.1aq and IEEE 802.1ah,
535	            respectively.

537	          o One VN Context identifier per VNI: A per-VNI local value is
538	            automatically generated by the egress NVE, or a control plane
539	            associated with that NVE, and usually distributed by a control
540	            plane protocol to all the related NVEs. An example of this
541	            approach is the use of per VRF MPLS labels in IP VPN [RFC4364].

543	          o One VN Context identifier per VAP: A per-VAP local value is
544	            assigned and usually distributed by a control plane protocol.

546	            An example of this approach is the use of per CE-PE MPLS labels
547	            in IP VPN [RFC4364].

549	       Note that when using one VN Context per VNI or per VAP, an
550	       additional global identifier (e.g., a VN identifier or name) may be
551	       used by the control plane to identify the Tenant context.

553	    3.1.4. Tunnel Overlays and Encapsulation options

555	       Once the VN context identifier is added to the frame, a L3 Tunnel
556	       encapsulation is used to transport the frame to the destination NVE.
557	       The underlay devices do not usually keep any per service state,
558	       simply forwarding the frames based on the outer tunnel header.

560	       Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS
561	       tunneling can be used. Tunneling could be stateless or stateful.
562	       Stateless tunneling simply entails the encapsulation of a tenant
563	       packet with another header necessary for forwarding the packet
564	       across the underlay (e.g., IP tunneling over an IP underlay.
565	       Stateful tunneling on the other hand entails maintaining tunneling
566	       state at the tunnel endpoints (i.e., NVEs). Tenant packets on an
567	       ingress NVE can then be transmitted over such tunnels to a
568	       destination (egress) NVE by encapsulating the packets with a
569	       corresponding tunneling header. The tunneling state at the endpoints
570	       may be configured or dynamically established. Solutions SHOULD
571	       specify the tunneling technology used, whether it is stateful or
572	       stateless. In this document, however, tunneling and tunneling
573	       encapsulation are used interchangeably to simply mean the
574	       encapsulation of a tenant packet with a tunneling header necessary
575	       to deliver the packet between an ingress NVE and an egress NVE
576	       across the underlay. It should be noted that stateful tunneling,
577	       especially when configuration is involved, does impose management
578	       overhead and scale constraints. Thus, stateless tunneling is
579	       preferred when feasible.

581	    3.1.5. Control Plane Components

583	    3.1.5.1. Distributed vs Centralized Control Plane

585	       A control/management plane entity can be centralized or distributed.
586	       Both approaches have been used extensively in the past. The routing
587	       model of the Internet is a good example of a distributed approach.
588	       Transport networks have usually used a centralized approach to
589	       manage transport paths.

591	       It is also possible to combine the two approaches, i.e., using a
592	       hybrid model. A global view of network state can have many benefits
593	       but it does not preclude the use of distributed protocols within the
594	       network. Centralized models provide a facility to maintain global
595	       state, and distribute that state to the network. When used in
596	       combination with distributed protocols, greater network
597	       efficiencies, improved reliability and robustness can be achieved.
598	       Domain and/or deployment specific constraints define the balance
599	       between centralized and distributed approaches.

601	    3.1.5.2. Auto-provisioning/Service discovery

603	       NVEs must be able to identify the appropriate VNI for each Tenant
604	       System. This is based on state information that is often provided by
605	       external entities. For example, in an environment where a VM is a
606	       Tenant System, this information is provided by VM orchestration
607	       systems, since these are the only entities that have visibility of
608	       which VM belongs to which tenant.

610	       A mechanism for communicating this information to the NVE is
611	       required. VAPs have to be created and mapped to the appropriate VNI.
612	       Depending upon the implementation, this control interface can be
613	       implemented using an auto-discovery protocol between Tenant Systems
614	       and their local NVE or through management entities. In either case,
615	       appropriate security and authentication mechanisms to verify that
616	       Tenant System information is not spoofed or altered are required.
617	       This is one critical aspect for providing integrity and tenant
618	       isolation in the system.

620	       NVEs may learn reachability information to VNIs on other NVEs via a
621	       control protocol exchanging such information among NVEs or via a
622	       management control entity.

624	    3.1.5.3. Address advertisement and tunnel mapping

626	       As traffic reaches an ingress NVE on a VAP, a lookup is performed to
627	       determine which NVE or local VAP the packet needs to be sent to. If
628	       the packet is to be sent to another NVE, the packet is encapsulated
629	       with a tunnel header containing the destination information
630	       (destination IP address or MPLS label) of the egress NVE.
631	       Intermediate nodes (between the ingress and egress NVEs) switch or
632	       route traffic based upon the tunnel destination information.

634	       A key step in the above process consists of identifying the
635	       destination NVE the packet is to be tunneled to. NVEs are
636	       responsible for maintaining a set of forwarding or mapping tables
637	       that hold the bindings between destination VM and egress NVE
638	       addresses. Several ways of populating these tables are possible:
639	       control plane driven, management plane driven, or data plane driven.

641	       When a control plane protocol is used to distribute address
642	       reachability and tunneling information, the auto-
643	       provisioning/Service discovery could be accomplished by the same
644	       protocol. In this scenario, the auto-provisioning/Service discovery
645	       could be combined with (be inferred from) the address advertisement
646	       and associated tunnel mapping. Furthermore, a control plane protocol
647	       that carries both MAC and IP addresses eliminates the need for ARP,
648	       and hence addresses one of the issues with explosive ARP handling.

650	    3.1.5.4. Overlay Tunneling

652	       For overlay tunneling, and dependent upon the tunneling technology
653	       used for encapsulating the tenant system packets, it may be
654	       sufficient to have one or more local NVE addresses assigned and used
655	       in the source and destination fields of a tunneling encapsulating
656	       header. Other information that is part of the
657	       tunneling encapsulation header may also need to be configured. In
658	       certain cases, local NVE configuration may be sufficient while in
659	       other cases, some tunneling related information may need to
660	       be shared among NVEs. The information that needs to be shared will
661	       be technology dependent. This includes the discovery and
662	       announcement of the tunneling technology used. In certain cases,
663	       such as when using IP multicast in the underlay, tunnels may need to
664	       be established, interconnecting NVEs. When tunneling information
665	       needs to be exchanged or shared among NVEs, a control plane protocol
666	       may be required. For instance, it may be necessary to provide
667	       active/standby status information between NVEs, up/down status
668	       information, pruning/grafting information for multicast tunnels,
669	       etc.

671	       In addition, a control plane may be required to setup the tunnel
672	       path for some tunneling technologies. This applies to both unicast
673	       and multicast tunneling.

675	    3.2. Multi-homing

677	       Multi-homing techniques can be used to increase the reliability of
678	       an nvo3 network. It is also important to ensure that physical
679	       diversity in an nvo3 network is taken into account to avoid single
680	       points of failure.

682	       Multi-homing can be enabled in various nodes, from tenant systems
683	       into TORs, TORs into core switches/routers, and core nodes into DC
684	       GWs.

686	       The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
687	       routing as the means to re-route traffic upon failures techniques or
688	       on MPLS re-rerouting capabilities.

690	       When a tenant system is co-located with the NVE, the Tenant System
691	       is single homed to the NVE via a virtual port. When the Tenant
692	       System and the NVE are separated, the Tenant System is connected to
693	       the NVE via a logical Layer2 (L2) construct such as a VLAN and it
694	       can be multi-homed to various NVEs. An NVE may provide an L2 service
695	       to the end system or an l3 service. An NVE may be multi-homed to a
696	       next layer in the DC at Layer2 (L2) or Layer3 (L3). When an NVE
697	       provides an L2 service and is not co-located with the end
698	       system, techniques such as Ethernet Link Aggregation Group (LAG) or
699	       Spanning Tree Protocol (STP) can be used to switch traffic
700	       between an end system and connected NVEs without creating
701	       loops. Similarly, when the NVE provides L3 service, similar dual-
702	       homing techniques can be used. When the NVE provides a L3 service to
703	       the end system, it is possible that no dynamic routing protocol is
704	       enabled between the end system and the NVE. The end system can be
705	       multi-homed to multiple physically-separated L3 NVEs over multiple
706	       interfaces. When one of the links connected to an NVE fails, the
707	       other interfaces can be used to reach the end system.

709	       External connectivity out of a DC can be handled by two or more DC
710	       gateways. Each gateway provides access to external networks such as
711	       VPNs or the Internet. A gateway may be connected to two or more edge
712	       nodes in the external network for redundancy. When a connection to
713	       an upstream node is lost, the alternative connection is used and the
714	       failed route withdrawn.

716	    3.3. VM Mobility

718	       In DC environments utilizing VM technologies, an important feature
719	       is that VMs can move from one server to another server in the same
720	       or different L2 physical domains (within or across DCs) in a
721	       seamless manner.

723	       A VM can be moved from one server to another in stopped or suspended
724	       state ("cold" VM mobility) or in running/active state ("hot" VM
725	       mobility). With "hot" mobility, VM L2 and L3 addresses need to be
726	       preserved. With "cold" mobility, it may be desired to preserve at
727	       least VM L3 addresses.

729	       Solutions to maintain connectivity while a VM is moved are necessary
730	       in the case of "hot" mobility. This implies that transport
731	       connections among VMs are preserved. For instance, for L2 VNs, ARP
732	       caches are updated accordingly.

734	       Upon VM mobility, NVE policies that define connectivity among VMs
735	       must be maintained.

737	       Optimal routing during VM mobility is also an important aspect to
738	       address. It is expected that the VM's default gateway be as close as
739	       possible to the server hosting the VM.

741	    4. Key aspects of overlay networks

743	       The intent of this section is to highlight specific issues that
744	       proposed overlay solutions need to address.

746	    4.1. Pros & Cons

748	       An overlay network is a layer of virtual network topology on top of
749	       the physical network.

751	       Overlay networks offer the following key advantages:

753	          o Unicast tunneling state management and association of Tenant
754	            Systems reachability are handled at the edge of the network (at
755	            the NVE). Intermediate transport nodes are unaware of such
756	            state. Note that when multicast is enabled in the underlay
757	            network to build multicast trees for tenant VNs, there would be
758	            more state related to tenants in the underlay core network.

760	          o Tunneling is used to aggregate traffic and hide tenant
761	            addresses from the underlay network, and hence offer the
762	            advantage of minimizing the amount of forwarding state required
763	            within the underlay network

765	          o Decoupling of the overlay addresses (MAC and IP) used by VMs
766	            from the underlay network for tenant separation and separation
767	            of the tenant address spaces and the underlay address space.

769	          o Support of a large number of virtual network identifiers

771	       Overlay networks also create several challenges:

773	          o Overlay networks have no controls of underlay networks and lack
774	            critical underlay network information
775	               o Overlay networks and/or their associated management
776	                 entities typically probe the network to measure link or
777	                 path properties, such as available bandwidth or packet
778	                 loss rate. It is difficult to accurately evaluate network
779	                 properties. It might be preferable for the underlay
780	                 network to expose usage and performance information.

782	          o Miscommunication or lack of coordination between overlay and
783	            underlay networks can lead to an inefficient usage of network
784	            resources.

786	          o When multiple overlays co-exist on top of a common underlay
787	            network, the lack of coordination between overlays can lead to
788	            performance issues and/or resource usage inefficiencies.

790	          o Traffic carried over an overlay may not traverse firewalls and
791	            NAT devices.

793	          o Multicast service scalability: Multicast support may be
794	            required in the underlay network to address tenant flood
795	            containment or efficient multicast handling. The underlay may
796	            also be required to maintain multicast state on a per-tenant
797	            basis, or even on a per-individual multicast flow of a given
798	            tenant. Ingress replication at the NVE eliminates that
799	            additional multicast state in the underlay core, but depending
800	            on the multicast traffic volume, it may cause inefficient use
801	            of bandwidth.

803	          o Hash-based load balancing may not be optimal as the hash
804	            algorithm may not work well due to the limited number of
805	            combinations of tunnel source and destination addresses. Other
806	            NVO3 mechanisms may use additional entropy information than
807	            source and destination addresses.

809	    4.2. Overlay issues to consider

811	    4.2.1. Data plane vs Control plane driven

813	       In the case of an L2 NVE, it is possible to dynamically learn MAC
814	       addresses against VAPs. It is also possible that such addresses be
815	       known and controlled via management or a control protocol for both
816	       L2 NVEs and L3 NVEs. Dynamic data plane learning implies that
817	       flooding of unknown destinations be supported and hence implies that
818	       broadcast and/or multicast be supported or that ingress replication
819	       be used as described in section 4.2.3. Multicasting in the underlay
820	       network for dynamic learning may lead to significant scalability
821	       limitations. Specific forwarding rules must be enforced to prevent
822	       loops from happening. This can be achieved using a spanning tree, a
823	       shortest path tree, or a split-horizon mesh.

825	       It should be noted that the amount of state to be distributed is
826	       dependent upon network topology and the number of virtual machines.
827	       Different forms of caching can also be utilized to minimize state
828	       distribution between the various elements. The control plane should
829	       not require an NVE to maintain the locations of all the tenant
830	       systems whose VNs are not present on the NVE. The use of a control
831	       plane does not imply that the data plane on NVEs has to maintain all
832	       the forwarding state in the control plane.

834	    4.2.2. Coordination between data plane and control plane

836	       For an L2 NVE, the NVE needs to be able to determine MAC addresses
837	       of the Tenant Systems connected via a VAP. This can be achieved via
838	       dataplane learning or a control plane. For an L3 NVE, the NVE needs
839	       to be able to determine IP addresses of the Tenant Systems connected
840	       via a VAP.

842	       In both cases, coordination with the NVE control protocol is needed
843	       such that when the NVE determines that the set of addresses behind a
844	       VAP has changed, it triggers the NVE control plane to distribute
845	       this information to its peers.

847	    4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

849	       There are several options to support packet replication needed for
850	       broadcast, unknown unicast and multicast.  Typical methods include:

852	          o Ingress replication

854	          o Use of underlay multicast trees

856	       There is a bandwidth vs state trade-off between the two approaches.
857	       Depending upon the degree of replication required (i.e. the number
858	       of hosts per group) and the amount of multicast state to maintain,
859	       trading bandwidth for state should be considered.

861	       When the number of hosts per group is large, the use of underlay
862	       multicast trees may be more appropriate. When the number of hosts is
863	       small (e.g. 2-3) and/or the amount of multicast traffic is small,
864	       ingress replication may not be an issue.

866	       Depending upon the size of the data center network and hence the
867	       number of (S,G) entries, but also the duration of multicast flows,
868	       the use of underlay multicast trees can be a challenge.

870	       When flows are well known, it is possible to pre-provision such
871	       multicast trees. However, it is often difficult to predict
872	       application flows ahead of time, and hence programming of (S,G)
873	       entries for short-lived flows could be impractical.

875	       A possible trade-off is to use in the underlay shared multicast
876	       trees as opposed to dedicated multicast trees.

878	    4.2.4. Path MTU

880	       When using overlay tunneling, an outer header is added to the
881	       original frame. This can cause the MTU of the path to the egress
882	       tunnel endpoint to be exceeded.

884	       In this section, we will only consider the case of an IP overlay.

886	       It is usually not desirable to rely on IP fragmentation for
887	       performance reasons. Ideally, the interface MTU as seen by a Tenant
888	       System is adjusted such that no fragmentation is needed. TCP will
889	       adjust its maximum segment size accordingly.

891	       It is possible for the MTU to be configured manually or to be
892	       discovered dynamically. Various Path MTU discovery techniques exist
893	       in order to determine the proper MTU size to use:

895	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

897	               o
898	                Tenant Systems rely on ICMP messages to discover the MTU of
899	                 the end-to-end path to its destination. This method is not
900	                 always possible, such as when traversing middle boxes
901	                 (e.g. firewalls) which disable ICMP for security reasons

903	          o Extended MTU Path Discovery techniques such as defined in
904	            [RFC4821]

906	       It is also possible to rely on the NVE to perform segmentation and
907	       reassembly operations without relying on the Tenant Systems to know
908	       about the end-to-end MTU. The assumption is that some hardware
909	       assist is available on the NVE node to perform such SAR operations.
910	       However, fragmentation by the NVE can lead to performance and
911	       congestion issues due to TCP dynamics and might require new
912	       congestion avoidance mechanisms from the underlay network [FLOYD].

914	       Finally, the underlay network may be designed in such a way that the
915	       MTU can accommodate the extra tunneling and possibly additional nvo3
916	       header encapsulation overhead.

918	    4.2.5. NVE location trade-offs

920	       In the case of DC traffic, traffic originated from a VM is native
921	       Ethernet traffic. This traffic can be switched by a local virtual
922	       switch or ToR switch and then by a DC gateway. The NVE function can
923	       be embedded within any of these elements.

925	       There are several criteria to consider when deciding where the NVE
926	       function should happen:

928	          o Processing and memory requirements

930	              o Datapath (e.g. lookups, filtering,
931	                 encapsulation/decapsulation)

933	              o Control plane processing (e.g. routing, signaling, OAM) and
934	                 where specific control plane functions should be enabled

936	          o FIB/RIB size

938	          o Multicast support

940	              o Routing/signaling protocols

942	              o Packet replication capability

944	              o Multicast FIB

946	          o Fragmentation support

948	          o QoS support (e.g. marking, policing, queuing)

950	          o Resiliency

952	    4.2.6. Interaction between network overlays and underlays

954	       When multiple overlays co-exist on top of a common underlay network,
955	       resources (e.g., bandwidth) should be provisioned to ensure that
956	       traffic from overlays can be accommodated and QoS objectives can be
957	       met. Overlays can have partially overlapping paths (nodes and
958	       links).

960	       Each overlay is selfish by nature. It sends traffic so as to
961	       optimize its own performance without considering the impact on other
962	       overlays, unless the underlay paths are traffic engineered on a per
963	       overlay basis to avoid congestion of underlay resources.

965	       Better visibility between overlays and underlays, or generally
966	       coordination in placing overlay demand on an underlay network, can
967	       be achieved by providing mechanisms to exchange performance and
968	       liveliness information between the underlay and overlay(s) or the
969	       use of such information by a coordination system. Such information
970	       may include:

972	          o Performance metrics (throughput, delay, loss, jitter)

974	          o Cost metrics

976	    5. Security Considerations

978	       Nvo3 solutions must at least consider and address the following:

980	          . Secure and authenticated communication between an NVE and an
981	            NVE management system and/or control system.

983	          . Isolation between tenant overlay networks. The use of per-
984	            tenant FIB tables (VNIs) on an NVE is essential.

986	          . Security of any protocol used to carry overlay network
987	            information.

989	          . Avoiding packets from reaching the wrong NVI, especially during
990	            VM moves.

992	    6. IANA Considerations

994	       IANA does not need to take any action for this draft.

996	    7. References

998	    7.1. Normative References

1000	       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1001	                 Requirement Levels", BCP 14, RFC 2119, March 1997.

1003	    7.2. Informative References

1005	       [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network
1006	                 Virtualization", draft-narten-nvo3-overlay-problem-
1007	                 statement (work in progress)

1009	       [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
1010	                 Protocol Requirements", draft-kreeger-nvo3-overlay-cp
1011	                 (work in progress)

1013	       [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
1014	                 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

1016	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1017	                 Networks (VPNs)", RFC 4364, February 2006.

1019	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

1021	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
1022	                 August 1996

1024	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
1025	                 Discovery", RFC4821, March 2007

1027	    8. Acknowledgments

1029	       In addition to the authors the following people have contributed to
1030	       this document:

1032	       Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent

1034	       Lucy Yong, Huawei

1036	       This document was prepared using 2-Word-v2.0.template.dot.

1038	    Authors' Addresses

1040	       Marc Lasserre
1041	       Alcatel-Lucent
1042	       Email: marc.lasserre@alcatel-lucent.com

1044	       Florin Balus
1045	       Alcatel-Lucent
1046	       777 E. Middlefield Road
1047	       Mountain View, CA, USA 94043
1048	       Email: florin.balus@alcatel-lucent.com

1050	       Thomas Morin
1051	       France Telecom Orange
1052	       Email: thomas.morin@orange.com

1054	       Nabil Bitar
1055	       Verizon
1056	       40 Sylvan Road
1057	       Waltham, MA 02145
1058	       Email: nabil.bitar@verizon.com

1060	       Yakov Rekhter
1061	       Juniper
1062	       Email: yakov@juniper.net