idnits 2.17.1 

draft-ietf-nvo3-framework-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 25
     longer pages, the longest (page 3) being 71 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 25 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 263 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.

  == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 5, 2014) is 3606 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                          Marc Lasserre
3	    Internet Draft                                            Florin Balus
4	    Intended status: Informational                          Alcatel-Lucent
5	    Expires: Dec 2014
6	                                                              Thomas Morin
7	                                                     France Telecom Orange

9	                                                               Nabil Bitar
10	                                                                   Verizon

12	                                                             Yakov Rekhter
13	                                                                   Juniper

15	                                                              June 5, 2014

17	                      Framework for DC Network Virtualization
18	                         draft-ietf-nvo3-framework-07.txt

20	    Abstract

22	       This document provides a framework for Data Center (DC) Network
23	       Virtualization Overlays (NVO3) and it defines a reference model
24	       along with logical components required to design a solution.

26	    Status of this Memo

28	       This Internet-Draft is submitted in full conformance with the
29	       provisions of BCP 78 and BCP 79.

31	       Internet-Drafts are working documents of the Internet Engineering
32	       Task Force (IETF).  Note that other groups may also distribute
33	       working documents as Internet-Drafts. The list of current Internet-
34	       Drafts is at http://datatracker.ietf.org/drafts/current/.

36	       Internet-Drafts are draft documents valid for a maximum of six
37	       months and may be updated, replaced, or obsoleted by other documents
38	       at any time.  It is inappropriate to use Internet-Drafts as
39	       reference material or to cite them other than as "work in progress."

41	       This Internet-Draft will expire on Dec 5, 2014.

43	    Copyright Notice

45	       Copyright (c) 2014 IETF Trust and the persons identified as the
46	       document authors. All rights reserved.

48	       This document is subject to BCP 78 and the IETF Trust's Legal
49	       Provisions Relating to IETF Documents
50	       (http://trustee.ietf.org/license-info) in effect on the date of
51	       publication of this document. Please review these documents
52	       carefully, as they describe your rights and restrictions with
53	       respect to this document. Code Components extracted from this
54	       document must include Simplified BSD License text as described in
55	       Section 4.e of the Trust Legal Provisions and are provided without
56	       warranty as described in the Simplified BSD License.

58	    Table of Contents

60	       1. Introduction.................................................3
61	          1.1. General terminology.....................................3
62	          1.2. DC network architecture.................................6
63	       2. Reference Models.............................................8
64	          2.1. Generic Reference Model.................................8
65	          2.2. NVE Reference Model....................................10
66	          2.3. NVE Service Types......................................10
67	             2.3.1. L2 NVE providing Ethernet LAN-like service........11
68	             2.3.2. L3 NVE providing IP/VRF-like service..............11
69	          2.4. Operational Management Considerations..................11
70	       3. Functional components.......................................12
71	          3.1. Service Virtualization Components......................12
72	             3.1.1. Virtual Access Points (VAPs)......................12
73	             3.1.2. Virtual Network Instance (VNI)....................12
74	             3.1.3. Overlay Modules and VN Context....................12
75	             3.1.4. Tunnel Overlays and Encapsulation options.........13
76	             3.1.5. Control Plane Components..........................14
77	             3.1.5.1. Distributed vs Centralized Control Plane........14
78	             3.1.5.2. Auto-provisioning/Service discovery.............14
79	             3.1.5.3. Address advertisement and tunnel mapping........15
80	             3.1.5.4. Overlay Tunneling...............................15
81	          3.2. Multi-homing...........................................16
82	          3.3. VM Mobility............................................17
83	       4. Key aspects of overlay networks.............................17
84	          4.1. Pros & Cons............................................17
85	          4.2. Overlay issues to consider.............................19
86	             4.2.1. Data plane vs Control plane driven................19
87	             4.2.2. Coordination between data plane and control plane.19
88	             4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
89	             traffic..................................................19
90	             4.2.4. Path MTU..........................................20
91	             4.2.5. NVE location trade-offs...........................21
92	             4.2.6. Interaction between network overlays and underlays.22
93	       5. Security Considerations.....................................22
94	       6. IANA Considerations.........................................23
95	       7. References..................................................23
96	          7.1. Informative References.................................23
97	       8. Acknowledgments.............................................24

99	    1. Introduction

101	       This document provides a framework for Data Center (DC) Network
102	       Virtualization over Layer3 (L3) tunnels. This framework is intended
103	       to aid in standardizing protocols and mechanisms to support large-
104	       scale network virtualization for data centers.

106	       [NVOPS] defines the rationale for using overlay networks in order to
107	       build large multi-tenant data center networks. Compute, storage and
108	       network virtualization are often used in these large data centers to
109	       support a large number of communication domains and end systems.

111	       This document provides reference models and functional components of
112	       data center overlay networks as well as a discussion of technical
113	       issues that have to be addressed.

115	    1.1. General terminology

117	       This document uses the following terminology:

119	       NVO3 Network: An overlay network that provides a Layer2 (L2) or
120	       Layer3 (L3) service to Tenant Systems over an L3 underlay network
121	       using the architecture and protocols as defined by the NVO3 Working
122	       Group.

124	       Network Virtualization Edge (NVE). An NVE is the network entity that
125	       sits at the edge of an underlay network and implements L2 and/or L3
126	       network virtualization functions. The network-facing side of the NVE
127	       uses the underlying L3 network to tunnel tenant frames to and from
128	       other NVEs. The tenant-facing side of the NVE sends and receives
129	       Ethernet frames to and from individual Tenant Systems.  An NVE could
130	       be implemented as part of a virtual switch within a hypervisor, a
131	       physical switch or router, a Network Service Appliance, or be split
132	       across multiple devices.

134	       Virtual Network (VN): A VN is a logical abstraction of a physical
135	       network that provides L2 or L3 network services to a set of Tenant
136	       Systems. A VN is also known as a Closed User Group (CUG).

138	       Virtual Network Instance (VNI): A specific instance of a VN from the
139	       perspective of an NVE.

141	       Virtual Network Context (VN Context) Identifier: Field in overlay
142	       encapsulation header that identifies the specific VN the packet
143	       belongs to. The egress NVE uses the VN Context identifier to deliver
144	       the packet to the correct Tenant System. The VN Context identifier
145	       can be a locally significant identifier or a globally unique
146	       identifier.

148	       Underlay or Underlying Network: The network that provides the
149	       connectivity among NVEs and over which NVO3 packets are tunneled,
150	       where an NVO3 packet carries an NVO3 overlay header followed by a
151	       tenant packet. The Underlay Network does not need to be aware that
152	       it is carrying NVO3 packets. Addresses on the Underlay Network
153	       appear as "outer addresses" in encapsulated NVO3 packets. In
154	       general, the Underlay Network can use a completely different
155	       protocol (and address family) from that of the overlay. In the case
156	       of NVO3, the underlay network is IP.

158	       Data Center (DC): A physical complex housing physical servers,
159	       network switches and routers, network service appliances and
160	       networked storage. The purpose of a Data Center is to provide
161	       application, compute and/or storage services. One such service is
162	       virtualized infrastructure data center services, also known as
163	       Infrastructure as a Service.

165	       Virtual Data Center (Virtual DC): A container for virtualized
166	       compute, storage and network services. A Virtual DC is associated
167	       with a single tenant, and can contain multiple VNs and Tenant
168	       Systems connected to one or more of these VNs.

170	       Virtual machine (VM): A software implementation of a physical
171	       machine that runs programs as if they were executing on a physical,
172	       non-virtualized machine.  Applications (generally) do not know they
173	       are running on a VM as opposed to running on a "bare metal" host or
174	       server, though some systems provide a para-virtualization
175	       environment that allows an operating system or application to be
176	       aware of the presence of virtualization for optimization purposes.

178	       Hypervisor: Software running on a server that allows multiple VMs to
179	       run on the same physical server. The hypervisor manages and provides
180	       shared compute/memory/storage and network connectivity to the VMs
181	       that it hosts. Hypervisors often embed a Virtual Switch (see below).

183	       Server: A physical end host machine that runs user applications. A
184	       standalone (or "bare metal") server runs a conventional operating
185	       system hosting a single-tenant application. A virtualized server
186	       runs a hypervisor supporting one or more VMs.

188	       Virtual Switch (vSwitch): A function within a Hypervisor (typically
189	       implemented in software) that provides similar forwarding services
190	       to a physical Ethernet switch. A vSwitch forwards Ethernet frames
191	       between VMs running on the same server, or between a VM and a
192	       physical NIC card connecting the server to a physical Ethernet
193	       switch or router. A vSwitch also enforces network isolation between
194	       VMs that by policy are not permitted to communicate with each other
195	       (e.g., by honoring VLANs). A vSwitch may be bypassed when an NVE is
196	       enabled on the host server.

198	       Tenant: The customer using a virtual network and any associated
199	       resources (e.g., compute, storage and network).  A tenant could be
200	       an enterprise, or a department/organization within an enterprise.

202	       Tenant System: A physical or virtual system that can play the role
203	       of a host, or a forwarding element such as a router, switch,
204	       firewall, etc. It belongs to a single tenant and connects to one or
205	       more VNs of that tenant.

207	       Tenant Separation: Tenant Separation refers to isolating traffic of
208	       different tenants such that traffic from one tenant is not visible
209	       to or delivered to another tenant, except when allowed by policy.
210	       Tenant Separation also refers to address space separation, whereby
211	       different tenants can use the same address space without conflict.

213	       Virtual Access Points (VAPs): A logical connection point on the NVE
214	       for connecting a Tenant System to a virtual network. Tenant Systems
215	       connect to VNIs at an NVE through VAPs. VAPs can be physical ports
216	       or virtual ports identified through logical interface identifiers
217	       (e.g., VLAN ID, internal vSwitch Interface ID connected to a VM).

219	       End Device: A physical device that connects directly to the DC
220	       Underlay Network. This is in contrast to a Tenant System, which
221	       connects to a corresponding tenant VN. An End Device is administered
222	       by the DC operator rather than a tenant, and is part of the DC
223	       infrastructure. An End Device may implement NVO3 technology in
224	       support of NVO3 functions. Examples of an End Device include hosts
225	       (e.g., server or server blade), storage systems (e.g., file servers,
226	       iSCSI storage systems), and network devices (e.g., firewall, load-
227	       balancer, IPSec gateway).

229	       Network Virtualization Authority (NVA): Entity that provides
230	       reachability and forwarding information to NVEs.

232	    1.2. DC network architecture

234	       A generic architecture for Data Centers is depicted in Figure 1:

236	                                    ,---------.
237	                                  ,'           `.
238	                                 (  IP/MPLS WAN )
239	                                  `.           ,'
240	                                    `-+------+'
241	                                     \      /
242	                              +--------+   +--------+
243	                              |   DC   |+-+|   DC   |
244	                              |gateway |+-+|gateway |
245	                              +--------+   +--------+
246	                                    |       /
247	                                    .--. .--.
248	                                  (    '    '.--.
249	                                .-.' Intra-DC     '
250	                               (     network      )
251	                                (             .'-'
252	                                 '--'._.'.    )\ \
253	                                 / /     '--'  \ \
254	                                / /      | |    \ \
255	                       +--------+   +--------+   +--------+
256	                       | access |   | access |   | access |
257	                       | switch |   | switch |   | switch |
258	                       +--------+   +--------+   +--------+
259	                          /     \    /    \     /      \
260	                       __/_      \  /      \   /_      _\__
261	                 '--------'   '--------'   '--------'   '--------'
262	                 :  End   :   :  End   :   :  End   :   :  End   :
263	                 : Device :   : Device :   : Device :   : Device :
264	                 '--------'   '--------'   '--------'   '--------'

266	                 Figure 1 : A Generic Architecture for Data Centers

268	       An example of multi-tier DC network architecture is presented in
269	       Figure 1. It provides a view of physical components inside a DC.

271	       A DC network is usually composed of intra-DC networks and network
272	       services, and inter-DC network and network connectivity services.

274	       DC networking elements can act as strict L2 switches and/or provide
275	       IP routing capabilities, including network service virtualization.

277	       In some DC architectures, some tier layers could provide L2 and/or
278	       L3 services. In addition, some tier layers may be collapsed, and
279	       Internet connectivity, inter-DC connectivity and VPN support may be
280	       handled by a smaller number of nodes. Nevertheless, one can assume
281	       that the network functional blocks in a DC fit in the architecture
282	       depicted in Figure 1.

284	       The following components can be present in a DC:

286	       - Access switch: Hardware-based Ethernet switch aggregating all
287	          Ethernet links from the End Devices in a rack representing the
288	          entry point in the physical DC network for the hosts. It may also
289	          provide routing functionality, virtual IP network connectivity, or
290	          Layer2 tunneling over IP for instance. Access switches are usually
291	          multi-homed to aggregation switches in the Intra-DC network. A
292	          typical example of an access switch is a Top of Rack (ToR) switch.
293	          Other deployment scenarios may use an intermediate Blade Switch
294	          before the ToR, or an EoR (End of Row) switch, to provide similar
295	          functions to a ToR.

297	       - Intra-DC Network: Network composed of high capacity core nodes
298	          (Ethernet switches/routers). Core nodes may provide virtual
299	          Ethernet bridging and/or IP routing services.

301	       - DC Gateway (DC GW): Gateway to the outside world providing DC
302	          Interconnect and connectivity to Internet and VPN customers. In
303	          the current DC network model, this may be simply a router
304	          connected to the Internet and/or an IP Virtual Private Network
305	          (VPN)/L2VPN PE. Some network implementations may dedicate DC GWs
306	          for different connectivity types (e.g., a DC GW for Internet, and
307	          another for VPN).

309	       Note that End Devices may be single or multi-homed to access
310	       switches.

312	    2. Reference Models

314	    2.1. Generic Reference Model

316	       Figure 2 depicts a DC reference model for network virtualization
317	       overlay where NVEs provide a logical interconnect between Tenant
318	       Systems that belong to a specific VN.

320	             +--------+                                    +--------+
321	             | Tenant +--+                            +----| Tenant |
322	             | System |  |                           (')   | System |
323	             +--------+  |    .................     (   )  +--------+
324	                         |  +---+           +---+    (_)
325	                         +--|NVE|---+   +---|NVE|-----+
326	                            +---+   |   |   +---+
327	                            / .    +-----+      .
328	                           /  . +--| NVA |--+   .
329	                          /   . |  +-----+   \  .
330	                         |    . |             \ .
331	                         |    . |   Overlay   +--+--++--------+
332	             +--------+  |    . |   Network   | NVE || Tenant |
333	             | Tenant +--+    . |             |     || System |
334	             | System |       .  \ +---+      +--+--++--------+
335	             +--------+       .....|NVE|.........
336	                                   +---+
337	                                     |
338	                                     |
339	                           =====================
340	                             |               |
341	                         +--------+      +--------+
342	                         | Tenant |      | Tenant |
343	                         | System |      | System |
344	                         +--------+      +--------+

346	          Figure 2 : Generic reference model for DC network virtualization
347	                                     overlay

349	       In order to obtain reachability information, NVEs may exchange
350	       information directly between themselves via a control plane
351	       protocol. In this case, a control plane module resides in every NVE.

353	       It is also possible for NVEs to communicate with an external Network
354	       Virtualization Authority (NVA) to obtain reachability and forwarding
355	       information. In this case, a protocol is used between NVEs and
356	       NVA(s) to exchange information.

358	       It should be noted that NVAs may be organized in clusters for
359	       redundancy and scalability and can appear as one logically
360	       centralized controller. In this case, inter-NVA communication is
361	       necessary to synchronize state among nodes within a cluster or share
362	       information across clusters. The information exchanged between NVAs
363	       of the same cluster could be different from the information
364	       exchanged across clusters.

366	       A Tenant System can be attached to an NVE in several ways:

368	       - locally, by being co-located in the same End Device

370	       - remotely, via a point-to-point connection or a switched network

372	       When an NVE is co-located with a Tenant System, the state of the
373	       Tenant System can be determined without protocol assistance. For
374	       instance, the operational status of a VM can be communicated via a
375	       local API. When an NVE is remotely connected to a Tenant System, the
376	       state of the Tenant System or NVE needs to be exchanged directly or
377	       via a management entity, using a control plane protocol or API, or
378	       directly via a dataplane protocol.

380	       The functional components in Figure 2 do not necessarily map
381	       directly to the physical components described in Figure 1. For
382	       example, an End Device can be a server blade with VMs and a virtual
383	       switch. A VM can be a Tenant System and the NVE functions may be
384	       performed by the host server. In this case, the Tenant System and
385	       NVE function are co-located. Another example is the case where the
386	       End Device is the Tenant System, and the NVE function can be
387	       implemented by the connected ToR. In this case, the Tenant System
388	       and NVE function are not co-located.

390	       Underlay nodes utilize L3 technologies to interconnect NVE nodes.
391	       These nodes perform forwarding based on outer L3 header information,
392	       and generally do not maintain per tenant-service state albeit some
393	       applications (e.g., multicast) may require control plane or
394	       forwarding plane information that pertain to a tenant, group of
395	       tenants, tenant service or a set of services that belong to one or
396	       more tenants. Mechanisms to control the amount of state maintained
397	       in the underlay may be needed.

399	    2.2. NVE Reference Model

401	       Figure 3 depicts the NVE reference model. One or more VNIs can be
402	       instantiated on an NVE. A Tenant System interfaces with a
403	       corresponding VNI via a VAP. An overlay module provides tunneling
404	       overlay functions (e.g., encapsulation and decapsulation of tenant
405	       traffic, tenant identification and mapping, etc.).

407	                         +-------- L3 Network -------+
408	                         |                           |
409	                         |        Tunnel Overlay     |
410	             +------------+---------+       +---------+------------+
411	             | +----------+-------+ |       | +---------+--------+ |
412	             | |  Overlay Module  | |       | |  Overlay Module  | |
413	             | +---------+--------+ |       | +---------+--------+ |
414	             |           |VN context|       | VN context|          |
415	             |           |          |       |           |          |
416	             |  +--------+-------+  |       |  +--------+-------+  |
417	             |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
418	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
419	             |    |   VAPs     |    |       |    |    VAPs   |     |
420	             +----+------------+----+       +----+-----------+-----+
421	                  |            |                 |           |
422	                  |            |                 |           |
423	                 Tenant Systems                 Tenant Systems

425	                      Figure 3 : Generic NVE reference model

427	       Note that some NVE functions (e.g., data plane and control plane
428	       functions) may reside in one device or may be implemented separately
429	       in different devices.

431	    2.3. NVE Service Types

433	       An NVE provides different types of virtualized network services to
434	       multiple tenants, i.e. an L2 service or an L3 service. Note that an
435	       NVE may be capable of providing both L2 and L3 services for a
436	       tenant. This section defines the service types and associated
437	       attributes.

439	    2.3.1. L2 NVE providing Ethernet LAN-like service

441	       An L2 NVE implements Ethernet LAN emulation, an Ethernet based
442	       multipoint service similar to an IETF VPLS [RFC4761][RFC4762] or
443	       EVPN [EVPN] service, where the Tenant Systems appear to be
444	       interconnected by a LAN environment over an L3 overlay. As such, an
445	       L2 NVE provides per-tenant virtual switching instance (L2 VNI), and
446	       L3 (IP/MPLS) tunneling encapsulation of tenant MAC frames across the
447	       underlay. Note that the control plane for an L2 NVE could be
448	       implemented locally on the NVE or in a separate control entity.

450	    2.3.2. L3 NVE providing IP/VRF-like service

452	       An L3 NVE provides Virtualized IP forwarding service, similar to
453	       IETF IP VPN (e.g., BGP/MPLS IPVPN [RFC4364]) from a service
454	       definition perspective. That is, an L3 NVE provides per-tenant
455	       forwarding and routing instance (L3 VNI), and L3 (IP/MPLS) tunneling
456	       encapsulation of tenant IP packets across the underlay. Note that
457	       routing could be performed locally on the NVE or in a separate
458	       control entity.

460	    2.4. Operational Management Considerations

462	       NVO3 services are overlay services over an IP underlay.

464	       As far as the IP underlay is concerned, existing IP OAM facilities
465	       are used.

467	       With regards to the NVO3 overlay, both L2 and L3 services can be
468	       offered. it is expected that existing fault and performance OAM
469	       facilities will be used. Sections 4.1. and 4.2.6.  below provide
470	       further discussion of additional fault and performance management
471	       issues to consider.

473	       As far as configuration is concerned, the DC environment is driven
474	       by the need to bring new services up rapidly and is typically very
475	       dynamic specifically in the context of virtualized services. It is
476	       therefore critical to automate the configuration of NVO3 services.

478	    3. Functional components

480	       This section decomposes the Network Virtualization architecture into
481	       functional components described in Figure 3 to make it easier to
482	       discuss solution options for these components.

484	    3.1. Service Virtualization Components

486	    3.1.1. Virtual Access Points (VAPs)

488	       Tenant Systems are connected to VNIs through Virtual Access Points
489	       (VAPs).

491	       VAPs can be physical ports or virtual ports identified through
492	       logical interface identifiers (e.g., VLAN ID, internal vSwitch
493	       Interface ID connected to a VM).

495	    3.1.2. Virtual Network Instance (VNI)

497	       A VNI is a specific VN instance on an NVE. Each VNI defines a
498	       forwarding context that contains reachability information and
499	       policies.

501	    3.1.3. Overlay Modules and VN Context

503	       Mechanisms for identifying each tenant service are required to allow
504	       the simultaneous overlay of multiple tenant services over the same
505	       underlay L3 network topology. In the data plane, each NVE, upon
506	       sending a tenant packet, must be able to encode the VN Context for
507	       the destination NVE in addition to the L3 tunneling information
508	       (e.g., source IP address identifying the source NVE and the
509	       destination IP address identifying the destination NVE, or MPLS
510	       label). This allows the destination NVE to identify the tenant
511	       service instance and therefore appropriately process and forward the
512	       tenant packet.

514	       The Overlay module provides tunneling overlay functions: tunnel
515	       initiation/termination as in the case of stateful tunnels (see
516	       Section 3.1.4), and/or simply encapsulation/decapsulation of frames
517	       from VAPs/L3 underlay.

519	       In a multi-tenant context, tunneling aggregates frames from/to
520	       different VNIs. Tenant identification and traffic demultiplexing are
521	       based on the VN Context identifier.

523	       The following approaches can be considered:

525	       - VN Context identifier per Tenant: Globally unique (on a per-DC
526	          administrative domain) VN identifier used to identify the
527	          corresponding VNI. Examples of such identifiers in existing
528	          technologies are IEEE VLAN IDs and ISID tags that identify virtual
529	          L2 domains when using IEEE 802.1aq and IEEE 802.1ah, respectively.
530	          Note that multiple VN identifiers can belong to a tenant.

532	       - One VN Context identifier per VNI: Each VNI value is automatically
533	          generated by the egress NVE, or a control plane associated with
534	          that NVE, and usually distributed by a control plane protocol to
535	          all the related NVEs. An example of this approach is the use of
536	          per VRF MPLS labels in IP VPN [RFC4364]. The VNI value is
537	          therefore locally significant to the egress NVE.

539	       - One VN Context identifier per VAP: A value locally significant to
540	          an NVE is assigned and usually distributed by a control plane
541	          protocol to identify a VAP. An example of this approach is the use
542	          of per CE-PE MPLS labels in IP VPN [RFC4364].

544	       Note that when using one VN Context per VNI or per VAP, an
545	       additional global identifier (e.g., a VN identifier or name) may be
546	       used by the control plane to identify the Tenant context.

548	    3.1.4. Tunnel Overlays and Encapsulation options

550	       Once the VN context identifier is added to the frame, an L3 Tunnel
551	       encapsulation is used to transport the frame to the destination NVE.

553	       Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS
554	       tunneling can be used. Tunneling could be stateless or stateful.
555	       Stateless tunneling simply entails the encapsulation of a tenant
556	       packet with another header necessary for forwarding the packet
557	       across the underlay (e.g., IP tunneling over an IP underlay).
558	       Stateful tunneling on the other hand entails maintaining tunneling
559	       state at the tunnel endpoints (i.e., NVEs). Tenant packets on an
560	       ingress NVE can then be transmitted over such tunnels to a
561	       destination (egress) NVE by encapsulating the packets with a
562	       corresponding tunneling header. The tunneling state at the endpoints
563	       may be configured or dynamically established. Solutions should
564	       specify the tunneling technology used, whether it is stateful or
565	       stateless. In this document, however, tunneling and tunneling
566	       encapsulation are used interchangeably to simply mean the
567	       encapsulation of a tenant packet with a tunneling header necessary
568	       to carry the packet between an ingress NVE and an egress NVE across
569	       the underlay. It should be noted that stateful tunneling, especially
570	       when configuration is involved, does impose management overhead and
571	       scale constraints. When confidentiality is required, the use of
572	       opportunistic encryption can be used as a stateless tunneling
573	       solution.

575	    3.1.5. Control Plane Components

577	    3.1.5.1. Distributed vs Centralized Control Plane

579	       A control/management plane entity can be centralized or distributed.
580	       Both approaches have been used extensively in the past. The routing
581	       model of the Internet is a good example of a distributed approach.
582	       Transport networks have usually used a centralized approach to
583	       manage transport paths.

585	       It is also possible to combine the two approaches, i.e., using a
586	       hybrid model. A global view of network state can have many benefits
587	       but it does not preclude the use of distributed protocols within the
588	       network. Centralized models provide a facility to maintain global
589	       state, and distribute that state to the network. When used in
590	       combination with distributed protocols, greater network
591	       efficiencies, improved reliability and robustness can be achieved.
592	       Domain and/or deployment specific constraints define the balance
593	       between centralized and distributed approaches.

595	    3.1.5.2. Auto-provisioning/Service discovery

597	       NVEs must be able to identify the appropriate VNI for each Tenant
598	       System. This is based on state information that is often provided by
599	       external entities. For example, in an environment where a VM is a
600	       Tenant System, this information is provided by VM orchestration
601	       systems, since these are the only entities that have visibility of
602	       which VM belongs to which tenant.

604	       A mechanism for communicating this information to the NVE is
605	       required. VAPs have to be created and mapped to the appropriate VNI.
606	       Depending upon the implementation, this control interface can be
607	       implemented using an auto-discovery protocol between Tenant Systems
608	       and their local NVE or through management entities. In either case,
609	       appropriate security and authentication mechanisms to verify that
610	       Tenant System information is not spoofed or altered are required.
611	       This is one critical aspect for providing integrity and tenant
612	       isolation in the system.

614	       NVEs may learn reachability information to VNIs on other NVEs via a
615	       control protocol that exchanges such information among NVEs, or via
616	       a management control entity.

618	    3.1.5.3. Address advertisement and tunnel mapping

620	       As traffic reaches an ingress NVE on a VAP, a lookup is performed to
621	       determine which NVE or local VAP the packet needs to be sent to. If
622	       the packet is to be sent to another NVE, the packet is encapsulated
623	       with a tunnel header containing the destination information
624	       (destination IP address or MPLS label) of the egress NVE.
625	       Intermediate nodes (between the ingress and egress NVEs) switch or
626	       route traffic based upon the tunnel destination information.

628	       A key step in the above process consists of identifying the
629	       destination NVE the packet is to be tunneled to. NVEs are
630	       responsible for maintaining a set of forwarding or mapping tables
631	       that hold the bindings between destination VM and egress NVE
632	       addresses. Several ways of populating these tables are possible:
633	       control plane driven, management plane driven, or data plane driven.

635	       When a control plane protocol is used to distribute address
636	       reachability and tunneling information, the auto-
637	       provisioning/Service discovery could be accomplished by the same
638	       protocol. In this scenario, the auto-provisioning/Service discovery
639	       could be combined with (be inferred from) the address advertisement
640	       and associated tunnel mapping. Furthermore, a control plane protocol
641	       that carries both MAC and IP addresses eliminates the need for ARP,
642	       and hence addresses one of the issues with explosive ARP handling as
643	       discussed in [RFC6820].

645	    3.1.5.4. Overlay Tunneling

647	       For overlay tunneling, and dependent upon the tunneling technology
648	       used for encapsulating the Tenant System packets, it may be
649	       sufficient to have one or more local NVE addresses assigned and used
650	       in the source and destination fields of a tunneling encapsulation
651	       header. Other information that is part of the
652	       tunneling encapsulation header may also need to be configured. In
653	       certain cases, local NVE configuration may be sufficient while in
654	       other cases, some tunneling related information may need to
655	       be shared among NVEs. The information that needs to be shared will
656	       be technology dependent. For instance, potential information could
657	       include tunnel identity, encapsulation type, and/or tunnel
658	       resources. In certain cases, such as when using IP multicast in the
659	       underlay, tunnels which interconnect NVEs may need to be
660	       established. When tunneling information needs to be exchanged or
661	       shared among NVEs, a control plane protocol may be required. For
662	       instance, it may be necessary to provide active/standby status
663	       information between NVEs, up/down status information,
664	       pruning/grafting information for multicast tunnels, etc.

666	       In addition, a control plane may be required to setup the tunnel
667	       path for some tunneling technologies. This applies to both unicast
668	       and multicast tunneling.

670	    3.2. Multi-homing

672	       Multi-homing techniques can be used to increase the reliability of
673	       an NVO3 network. It is also important to ensure that physical
674	       diversity in an NVO3 network is taken into account to avoid single
675	       points of failure.

677	       Multi-homing can be enabled in various nodes, from Tenant Systems
678	       into ToRs, ToRs into core switches/routers, and core nodes into DC
679	       GWs.

681	       The NVO3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
682	       routing as the means to re-route traffic upon failures techniques or
683	       on MPLS re-rerouting capabilities.

685	       When a Tenant System is co-located with the NVE, the Tenant System
686	       is effectively single homed to the NVE via a virtual port. When the
687	       Tenant System and the NVE are separated, the Tenant System is
688	       connected to the NVE via a logical Layer2 (L2) construct such as a
689	       VLAN and it can be multi-homed to various NVEs. An NVE may provide
690	       an L2 service to the end system or an l3 service. An NVE may be
691	       multi-homed to a next layer in the DC at Layer2 (L2) or Layer3
692	       (L3). When an NVE provides an L2 service and is not co-located with
693	       the end system, loop avoidance techniques must be used. Similarly,
694	       when the NVE provides L3 service, similar dual-homing techniques can
695	       be used. When the NVE provides a L3 service to the end system, it is
696	       possible that no dynamic routing protocol is enabled between the end
697	       system and the NVE. The end system can be multi-homed to
698	       multiple physically-separated L3 NVEs over multiple interfaces. When
699	       one of the links connected to an NVE fails, the other interfaces can
700	       be used to reach the end system.

702	       External connectivity from a DC can be handled by two or more DC
703	       gateways. Each gateway provides access to external networks such as
704	       VPNs or the Internet. A gateway may be connected to two or more edge
705	       nodes in the external network for redundancy. When a connection to
706	       an upstream node is lost, the alternative connection is used and the
707	       failed route withdrawn.

709	    3.3. VM Mobility

711	       In DC environments utilizing VM technologies, an important feature
712	       is that VMs can move from one server to another server in the same
713	       or different L2 physical domains (within or across DCs) in a
714	       seamless manner.

716	       A VM can be moved from one server to another in stopped or suspended
717	       state ("cold" VM mobility) or in running/active state ("hot" VM
718	       mobility). With "hot" mobility, VM L2 and L3 addresses need to be
719	       preserved. With "cold" mobility, it may be desired to preserve at
720	       least VM L3 addresses.

722	       Solutions to maintain connectivity while a VM is moved are necessary
723	       in the case of "hot" mobility. This implies that connectivity among
724	       VMs is preserved. For instance, for L2 VNs, ARP caches are updated
725	       accordingly.

727	       Upon VM mobility, NVE policies that define connectivity among VMs
728	       must be maintained.

730	       During VM mobility, it is expected that the path to the VM's default
731	       gateway assures adequate QoS to VM applications, i.e. QoS that
732	       matches the expected service level agreement for these applications.

734	    4. Key aspects of overlay networks

736	       The intent of this section is to highlight specific issues that
737	       proposed overlay solutions need to address.

739	    4.1. Pros & Cons

741	       An overlay network is a layer of virtual network topology on top of
742	       the physical network.

744	       Overlay networks offer the following key advantages:

746	          - Unicast tunneling state management and association of Tenant
747	            Systems reachability are handled at the edge of the network (at
748	            the NVE). Intermediate transport nodes are unaware of such
749	            state. Note that when multicast is enabled in the underlay
750	            network to build multicast trees for tenant VNs, there would be
751	            more state related to tenants in the underlay core network.

753	          - Tunneling is used to aggregate traffic and hide tenant
754	            addresses from the underlay network, and hence offer the
755	            advantage of minimizing the amount of forwarding state required
756	            within the underlay network

758	          - Decoupling of the overlay addresses (MAC and IP) used by VMs
759	            from the underlay network for tenant separation and separation
760	            of the tenant address spaces from the underlay address space.

762	          - Support of a large number of virtual network identifiers

764	       Overlay networks also create several challenges:

766	          - Overlay networks have typically no control of underlay networks
767	            and lack underlay network information (e.g. underlay
768	            utilization):

770	            - Overlay networks and/or their associated management entities
771	               typically probe the network to measure link or path
772	               properties, such as available bandwidth or packet loss rate.
773	               It is difficult to accurately evaluate network properties. It
774	               might be preferable for the underlay network to expose usage
775	               and performance information.
776	            - Miscommunication or lack of coordination between overlay and
777	               underlay networks can lead to an inefficient usage of network
778	               resources.
779	            - When multiple overlays co-exist on top of a common underlay
780	               network, the lack of coordination between overlays can lead
781	               to performance issues and/or resource usage inefficiencies.

783	          - Traffic carried over an overlay might fail to traverse
784	            firewalls and NAT devices.

786	          - Multicast service scalability: Multicast support may be
787	            required in the underlay network to address tenant flood
788	            containment or efficient multicast handling. The underlay may
789	            also be required to maintain multicast state on a per-tenant
790	            basis, or even on a per-individual multicast flow of a given
791	            tenant. Ingress replication at the NVE eliminates that
792	            additional multicast state in the underlay core, but depending
793	            on the multicast traffic volume, it may cause inefficient use
794	            of bandwidth.

796	    4.2. Overlay issues to consider

798	    4.2.1. Data plane vs Control plane driven

800	       In the case of an L2 NVE, it is possible to dynamically learn MAC
801	       addresses against VAPs. It is also possible that such addresses be
802	       known and controlled via management or a control protocol for both
803	       L2 NVEs and L3 NVEs. Dynamic data plane learning implies that
804	       flooding of unknown destinations be supported and hence implies that
805	       broadcast and/or multicast be supported or that ingress replication
806	       be used as described in section 4.2.3. Multicasting in the underlay
807	       network for dynamic learning may lead to significant scalability
808	       limitations. Specific forwarding rules must be enforced to prevent
809	       loops from happening. This can be achieved using a spanning tree, a
810	       shortest path tree, or a split-horizon mesh.

812	       It should be noted that the amount of state to be distributed is
813	       dependent upon network topology and the number of virtual machines.
814	       Different forms of caching can also be utilized to minimize state
815	       distribution between the various elements. The control plane should
816	       not require an NVE to maintain the locations of all the Tenant
817	       Systems whose VNs are not present on the NVE. The use of a control
818	       plane does not imply that the data plane on NVEs has to maintain all
819	       the forwarding state in the control plane.

821	    4.2.2. Coordination between data plane and control plane

823	       For an L2 NVE, the NVE needs to be able to determine MAC addresses
824	       of the Tenant Systems connected via a VAP. This can be achieved via
825	       dataplane learning or a control plane. For an L3 NVE, the NVE needs
826	       to be able to determine IP addresses of the Tenant Systems connected
827	       via a VAP.

829	       In both cases, coordination with the NVE control protocol is needed
830	       such that when the NVE determines that the set of addresses behind a
831	       VAP has changed, it triggers the NVE control plane to distribute
832	       this information to its peers.

834	    4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

836	       There are several options to support packet replication needed for
837	       broadcast, unknown unicast and multicast. Typical methods include:

839	       - Ingress replication
840	       - Use of underlay multicast trees

842	       There is a bandwidth vs state trade-off between the two approaches.
843	       Depending upon the degree of replication required (i.e. the number
844	       of hosts per group) and the amount of multicast state to maintain,
845	       trading bandwidth for state should be considered.

847	       When the number of hosts per group is large, the use of underlay
848	       multicast trees may be more appropriate. When the number of hosts is
849	       small (e.g. 2-3) and/or the amount of multicast traffic is small,
850	       ingress replication may not be an issue.

852	       Depending upon the size of the data center network and hence the
853	       number of (S,G) entries, and also the duration of multicast flows,
854	       the use of underlay multicast trees can be a challenge.

856	       When flows are well known, it is possible to pre-provision such
857	       multicast trees. However, it is often difficult to predict
858	       application flows ahead of time, and hence programming of (S,G)
859	       entries for short-lived flows could be impractical.

861	       A possible trade-off is to use in the underlay shared multicast
862	       trees as opposed to dedicated multicast trees.

864	    4.2.4. Path MTU

866	       When using overlay tunneling, an outer header is added to the
867	       original frame. This can cause the MTU of the path to the egress
868	       tunnel endpoint to be exceeded.

870	       It is usually not desirable to rely on IP fragmentation for
871	       performance reasons. Ideally, the interface MTU as seen by a Tenant
872	       System is adjusted such that no fragmentation is needed.

874	       It is possible for the MTU to be configured manually or to be
875	       discovered dynamically. Various Path MTU discovery techniques exist
876	       in order to determine the proper MTU size to use:

878	       - Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

880	         - Tenant Systems rely on ICMP messages to discover the MTU of the
881	            end-to-end path to its destination. This method is not always
882	            possible, such as when traversing middle boxes (e.g. firewalls)
883	            which disable ICMP for security reasons

885	       - Extended MTU Path Discovery techniques such as defined in
886	          [RFC4821]

888	         - Tenant Systems send probe packets of different sizes, and rely
889	            on confirmation of receipt or lack thereof from receivers to
890	            allow a sender to discover the MTU of the end-to-end paths.

892	       While it could also be possible to rely on the NVE to perform
893	       segmentation and reassembly operations without relying on the Tenant
894	       Systems to know about the end-to-end MTU, this would lead to
895	       undesired performance and congestion issues as well as significantly
896	       increase the complexity of hardware NVEs required for buffering and
897	       reassembly logic.

899	       Preferably, the underlay network should be designed in such a way
900	       that the MTU can accommodate the extra tunneling and possibly
901	       additional NVO3 header encapsulation overhead.

903	    4.2.5. NVE location trade-offs

905	       In the case of DC traffic, traffic originated from a VM is native
906	       Ethernet traffic. This traffic can be switched by a local virtual
907	       switch or ToR switch and then by a DC gateway. The NVE function can
908	       be embedded within any of these elements.

910	       There are several criteria to consider when deciding where the NVE
911	       function should happen:

913	       - Processing and memory requirements

915	         - Datapath (e.g. lookups, filtering, encapsulation/decapsulation)

917	         - Control plane processing (e.g. routing, signaling, OAM) and
918	            where specific control plane functions should be enabled

920	       - FIB/RIB size

922	       - Multicast support

924	         - Routing/signaling protocols

926	         - Packet replication capability

928	         - Multicast FIB

930	       - Fragmentation support
931	       - QoS support (e.g. marking, policing, queuing)

933	       - Resiliency

935	    4.2.6. Interaction between network overlays and underlays

937	       When multiple overlays co-exist on top of a common underlay network,
938	       resources (e.g., bandwidth) should be provisioned to ensure that
939	       traffic from overlays can be accommodated and QoS objectives can be
940	       met. Overlays can have partially overlapping paths (nodes and
941	       links).

943	       Each overlay is selfish by nature. It sends traffic so as to
944	       optimize its own performance without considering the impact on other
945	       overlays, unless the underlay paths are traffic engineered on a per
946	       overlay basis to avoid congestion of underlay resources.

948	       Better visibility between overlays and underlays, or generally
949	       coordination in placing overlay demand on an underlay network, may
950	       be achieved by providing mechanisms to exchange performance and
951	       liveliness information between the underlay and overlay(s) or the
952	       use of such information by a coordination system. Such information
953	       may include:

955	       - Performance metrics (throughput, delay, loss, jitter)

957	       - Cost metrics

959	       such as defined in [RFC2330].

961	    5. Security Considerations

963	       Since NVEs and NVAs play a central role in NVO3, it is critical that
964	       a secure access to NVEs and NVAs be ensured such that no
965	       unauthorized access is possible.

967	       As discussed in section 3.1.5.2. , Tenant Systems identification is
968	       based upon state that is often provided by management systems (e.g.
969	       a VM orchestration system in a virtualized environment). Secure
970	       access to such management systems must also be ensured.

972	       When an NVE receives data from a Tenant System, the tenant identity
973	       needs to be verified in order to guarantee that it is authorized to
974	       access the corresponding VN. This can be achieved by identifying
975	       incoming packets against specific VAPs in some cases. In other
976	       circumstances, authentication may be necessary.

978	       Data integrity can be assured if authorized access to NVEs, NVAs,
979	       and intermediate underlay nodes is ensured. Otherwise, encryption
980	       must be used.

982	       NVO3 provides data confidentiality through data separation. The use
983	       of both VNIs and tunneling of tenant traffic by NVEs ensures that
984	       NVO3 data is kept in a separate context and thus separated from
985	       other tenant traffic. When NVO3 data traverses untrusted networks,
986	       data encryption may be needed.

988	       Not only tenant data but also NVO3 control data must be secured
989	       (e.g. control traffic between NVAs and NVEs, between NVAs and
990	       between NVEs).

992	       It may also be desirable to restrict the types of information that
993	       can be exchanged between overlays and underlays (e.g. topology
994	       information).

996	    6. IANA Considerations

998	       IANA does not need to take any action for this draft.

1000	    7. References

1002	    7.1. Informative References

1004	       [NVOPS]  Narten, T. et al, "Problem Statement : Overlays for
1005	                 Network Virtualization", draft-ietf-nvo3-overlay-problem-
1006	                 statement (work in progress)

1008	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1009	                 Networks (VPNs)", RFC 4364, February 2006.

1011	       [RFC4761] Kompella, K. et al, "Virtual Private LAN Service (VPLS)
1012	                 Using BGP for auto-discovery and Signaling", RFC4761,
1013	                 January 2007

1015	       [RFC4762] Lasserre, M. et al, "Virtual Private LAN Service (VPLS)
1016	                 Using Label Distribution Protocol (LDP) Signaling",
1017	                 RFC4762, January 2007

1019	       [EVPN]  Sajassi, A. et al, "BGP MPLS Based Ethernet VPN", draft-
1020	                 ietf-l2vpn-evpn (work in progress)

1022	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

1024	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
1025	                 August 1996

1027	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
1028	                 Discovery", RFC4821, March 2007

1030	       [RFC6820] Narten, T. et al, "Address Resolution Problems in Large
1031	                 Data Center Networks", RFC6820, January 2013

1033	       [RFC2330] Paxson, V. et al, "Framework for IP Performance Metrics",
1034	                 RFC2330, May 1998

1036	    8. Acknowledgments

1038	       In addition to the authors the following people have contributed to
1039	       this document:

1041	       Dimitrios Stiliadis, Rotem Salomonovitch, Lucy Yong, Thomas Narten,
1042	       Larry Kreeger, David Black.

1044	       This document was prepared using 2-Word-v2.0.template.dot.

1046	    Authors' Addresses

1048	       Marc Lasserre
1049	       Alcatel-Lucent
1050	       Email: marc.lasserre@alcatel-lucent.com

1052	       Florin Balus
1053	       Alcatel-Lucent
1054	       777 E. Middlefield Road
1055	       Mountain View, CA, USA 94043
1056	       Email: florin.balus@alcatel-lucent.com

1058	       Thomas Morin
1059	       France Telecom Orange
1060	       Email: thomas.morin@orange.com

1062	       Nabil Bitar
1063	       Verizon
1064	       40 Sylvan Road
1065	       Waltham, MA 02145
1066	       Email: nabil.bitar@verizon.com

1068	       Yakov Rekhter
1069	       Juniper
1070	       Email: yakov@juniper.net