idnits 2.17.1 

draft-ietf-nvo3-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 25
     longer pages, the longest (page 21) being 72 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 25 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 218 instances of too long lines in the document, the longest
     one being 3 characters in excess of 72.

  == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 19, 2012) is 4205 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'EVPN' is mentioned on line 229, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                          Marc Lasserre
3	    Internet Draft                                            Florin Balus
4	    Intended status: Informational                          Alcatel-Lucent
5	    Expires: March 2013
6	                                                              Thomas Morin
7	                                                     France Telecom Orange

9	                                                               Nabil Bitar
10	                                                                   Verizon

12	                                                             Yakov Rekhter
13	                                                                   Juniper

15	                                                          October 19, 2012

17	                      Framework for DC Network Virtualization
18	                         draft-ietf-nvo3-framework-01.txt

20	    Status of this Memo

22	       This Internet-Draft is submitted in full conformance with the
23	       provisions of BCP 78 and BCP 79.

25	       Internet-Drafts are working documents of the Internet Engineering
26	       Task Force (IETF).  Note that other groups may also distribute
27	       working documents as Internet-Drafts. The list of current Internet-
28	       Drafts is at http://datatracker.ietf.org/drafts/current/.

30	       Internet-Drafts are draft documents valid for a maximum of six
31	       months and may be updated, replaced, or obsoleted by other documents
32	       at any time.  It is inappropriate to use Internet-Drafts as
33	       reference material or to cite them other than as "work in progress."

35	       This Internet-Draft will expire on April 19, 2013.

37	    Copyright Notice

39	       Copyright (c) 2012 IETF Trust and the persons identified as the
40	       document authors. All rights reserved.

42	       This document is subject to BCP 78 and the IETF Trust's Legal
43	       Provisions Relating to IETF Documents
44	       (http://trustee.ietf.org/license-info) in effect on the date of
45	       publication of this document. Please review these documents
46	       carefully, as they describe your rights and restrictions with
47	       respect to this document. Code Components extracted from this
48	       document must include Simplified BSD License text as described in
49	       Section 4.e of the Trust Legal Provisions and are provided without
50	       warranty as described in the Simplified BSD License.

52	    Abstract

54	       Several IETF drafts relate to the use of overlay networks to support
55	       large scale virtual data centers. This draft provides a framework
56	       for Network Virtualization over L3 (NVO3) and is intended to help
57	       plan a set of work items in order to provide a complete solution
58	       set. It defines a logical view of the main components with the
59	       intention of streamlining the terminology and focusing the solution
60	       set.

62	    Table of Contents

64	       1. Introduction................................................3
65	          1.1. Conventions used in this document.......................4
66	          1.2. General terminology.....................................4
67	          1.3. DC network architecture.................................6
68	          1.4. Tenant networking view..................................7
69	       2. Reference Models............................................8
70	          2.1. Generic Reference Model.................................8
71	          2.2. NVE Reference Model....................................10
72	          2.3. NVE Service Types......................................12
73	             2.3.1. L2 NVE providing Ethernet LAN-like service.........12
74	             2.3.2. L3 NVE providing IP/VRF-like service..............12
75	       3. Functional components.......................................12
76	          3.1. Generic service virtualization components..............12
77	             3.1.1. Virtual Access Points (VAPs)......................13
78	             3.1.2. Virtual Network Instance (VNI)....................13
79	             3.1.3. Overlay Modules and VN Context....................13
80	             3.1.4. Tunnel Overlays and Encapsulation options..........14
81	             3.1.5. Control Plane Components..........................14
82	             3.1.5.1. Distributed vs Centralized Control Plane.........15
83	             3.1.5.2. Auto-provisioning/Service discovery.............15
84	             3.1.5.3. Address advertisement and tunnel mapping.........16
85	             3.1.5.4. Tunnel management...............................17
86	          3.2. Multi-homing..........................................17
87	          3.3. Service Overlay Topologies.............................18
88	       4. Key aspects of overlay networks.............................18
89	          4.1. Pros & Cons...........................................18
90	          4.2. Overlay issues to consider.............................19
91	             4.2.1. Data plane vs Control plane driven................19
92	             4.2.2. Coordination between data plane and control plane..20
93	             4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
94	             traffic.................................................20
95	             4.2.4. Path MTU.........................................21
96	             4.2.5. NVE location trade-offs...........................21
97	             4.2.6. Interaction between network overlays and underlays.22
98	       5. Security Considerations.....................................23
99	       6. IANA Considerations........................................23
100	       7. References.................................................23
101	          7.1. Normative References...................................23
102	          7.2. Informative References.................................23
103	       8. Acknowledgments............................................24

105	    1. Introduction

107	       This document provides a framework for Data Center Network
108	       Virtualization over L3 tunnels. This framework is intended to aid in
109	       standardizing protocols and mechanisms to support large scale
110	       network virtualization for data centers.

112	       Several IETF drafts relate to the use of overlay networks for data
113	       centers.

115	       [NVOPS] defines the rationale for using overlay networks in order to
116	       build large data center networks. The use of virtualization leads to
117	       a very large number of communication domains and end systems to cope
118	       with.

120	       [OVCPREQ] describes the requirements for a control plane protocol
121	       required by overlay border nodes to exchange overlay mappings.

123	       This document provides reference models and functional components of
124	       data center overlay networks as well as a discussion of technical
125	       issues that have to be addressed in the design of standards and
126	       mechanisms for large scale data centers.

128	    1.1. Conventions used in this document

130	       The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
131	       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
132	       document are to be interpreted as described in RFC-2119 [RFC2119].

134	       In this document, these words will appear with that interpretation
135	       only when in ALL CAPS. Lower case uses of these words are not to be
136	       interpreted as carrying RFC-2119 significance.

138	    1.2. General terminology

140	       This document uses the following terminology:

142	       NVE: Network Virtualization Edge. It is a network entity that sits
143	       on the edge of the NVO3 network. It implements network
144	       virtualization functions that allow for L2 and/or L3 tenant
145	       separation and for hiding tenant addressing information (MAC and IP
146	       addresses). An NVE could be implemented as part of a virtual switch
147	       within a hypervisor, a physical switch or router, a Network Service
148	       Appliance.

150	       VN: Virtual Network. This is a virtual L2 or L3 domain that belongs
151	       to a tenant.

153	       VNI: Virtual Network Instance. This is one instance of a virtual
154	       overlay network. Two Virtual Networks are isolated from one another
155	       and may use overlapping addresses.

157	       Virtual Network Context or VN Context: Field that is part of the
158	       overlay encapsulation header which allows the encapsulated frame to
159	       be delivered to the appropriate virtual network endpoint by the
160	       egress NVE. The egress NVE uses this field to determine the
161	       appropriate virtual network context in which to process the packet.
162	       This field MAY be an explicit, unique (to the administrative domain)
163	       virtual network identifier (VNID) or MAY express the necessary
164	       context information in other ways (e.g. a locally significant
165	       identifier).

167	       VNID:  Virtual Network Identifier. In the case where the VN context
168	       has global significance, this is the ID value that is carried in
169	       each data packet in the overlay encapsulation that identifies the
170	       Virtual Network the packet belongs to.

172	       Underlay or Underlying Network: This is the network that provides
173	       the connectivity between NVEs. The Underlying Network can be
174	       completely unaware of the overlay packets. Addresses within the
175	       Underlying Network are also referred to as "outer addresses" because
176	       they exist in the outer encapsulation. The Underlying Network can
177	       use a completely different protocol (and address family) from that
178	       of the overlay.

180	       Data Center (DC): A physical complex housing physical servers,
181	       network switches and routers, Network Service Appliances and
182	       networked storage. The purpose of a Data Center is to provide
183	       application and/or compute and/or storage services. One such service
184	       is virtualized data center services, also known as Infrastructure as
185	       a Service.

187	       Virtual Data Center or Virtual DC: A container for virtualized
188	       compute, storage and network services. Managed by a single tenant, a
189	       Virtual DC can contain multiple VNs and multiple Tenant Systems that
190	       are connected to one or more of these VNs.

192	       VM: Virtual Machine. Several Virtual Machines can share the
193	       resources of a single physical computer server using the services of
194	       a Hypervisor (see below definition).

196	       Hypervisor: Server virtualization software running on a physical
197	       compute server that hosts Virtual Machines. The hypervisor provides
198	       shared compute/memory/storage and network connectivity to the VMs
199	       that it hosts. Hypervisors often embed a Virtual Switch (see below).

201	       Virtual Switch: A function within a Hypervisor (typically
202	       implemented in software) that provides similar services to a
203	       physical Ethernet switch.  It switches Ethernet frames between VMs'
204	       virtual NICs within the same physical server, or between a VM and a
205	       physical NIC card connecting the server to a physical Ethernet
206	       switch. It also enforces network isolation between VMs that should
207	       not communicate with each other.

209	       Tenant: In a DC, a tenant refers to a customer that could an
210	       organization within an enterprise, or an enterprise with a set of DC
211	       compute, storage and network resources associated with it.

213	       Tenant System: A physical or virtual system that can play the role
214	       of a host, or a forwarding element such as a router, switch,
215	       firewall, etc. It belongs to a single tenant and connects to one or
216	       more VNs of that tenant.

218	       End device: A physical system to which networking service is
219	       provided. Examples include hosts (e.g. server or server blade),
220	       storage systems (e.g. file servers, iSCSI storage systems) and
221	       network devices (e.g. firewall, load-balancer, IPSec gateway). An
222	       end device may include internal networking functionality that
223	       interconnects the device's components (e.g. virtual switches that
224	       interconnects VMs running on the same server). NVE functionality may
225	       be implemented as part of that internal networking.

227	       ELAN: MEF ELAN, multipoint to multipoint Ethernet service

229	       EVPN: Ethernet VPN as defined in [EVPN]

231	    1.3. DC network architecture

233	       A generic architecture for Data Centers is depicted in Figure 1:

235	                                    ,---------.
236	                                  ,'           `.
237	                                 (  IP/MPLS WAN )
238	                                  `.           ,'
239	                                    `-+------+'
240	                                 +--+--+   +-+---+
241	                                 |DC GW|+-+|DC GW|
242	                                 +-+---+   +-----+
243	                                    |       /
244	                                    .--. .--.
245	                                  (    '    '.--.
246	                                .-.' Intra-DC     '
247	                               (     network      )
248	                                (             .'-'
249	                                 '--'._.'.    )\ \
250	                                 / /     '--'  \ \
251	                                / /      | |    \ \
252	                          +---+--+   +-`.+--+  +--+----+
253	                          | ToR  |   | ToR  |  |  ToR  |
254	                          +-+--`.+   +-+-`.-+  +-+--+--+
255	                           /     \    /    \   /       \
256	                        __/_      \  /      \ /_       _\__
257	                 '--------'   '--------'   '--------'   '--------'
258	                 :  End   :   :  End   :   :  End   :   :  End   :
259	                 : Device :   : Device :   : Device :   : Device :
260	                 '--------'   '--------'   '--------'   '--------'

262	                 Figure 1 : A Generic Architecture for Data Centers

264	       An example of multi-tier DC network architecture is presented in
265	       this figure. It provides a view of physical components inside a DC.

267	       A cloud network is composed of intra-Data Center (DC) networks and
268	       network services, and, inter-DC network and network connectivity
269	       services. Depending upon the scale, DC distribution, operations
270	       model, Capex and Opex aspects, DC networking elements can act as
271	       strict L2 switches and/or provide IP routing capabilities, including
272	       also service virtualization.

274	       In some DC architectures, it is possible that some tier layers
275	       provide L2 and/or L3 services, are collapsed, and that Internet
276	       connectivity, inter-DC connectivity and VPN support are handled by a
277	       smaller number of nodes. Nevertheless, one can assume that the
278	       functional blocks fit with the architecture above.

280	       The following components can be present in a DC:

282	          o Top of Rack (ToR): Hardware-based Ethernet switch aggregating
283	            all Ethernet links from the End Devices in a rack representing
284	            the entry point in the physical DC network for the hosts. ToRs
285	            may also provide routing functionality, virtual IP network
286	            connectivity, or Layer2 tunneling over IP for instance. ToRs
287	            are usually multi-homed to switches in the Intra-DC network.
288	            Other deployment scenarios may use an intermediate Blade Switch
289	            before the ToR or an EoR (End of Row) switch to provide similar
290	            function as a ToR.

292	          o Intra-DC Network: High capacity network composed of core
293	            switches aggregating multiple ToRs. Core switches are usually
294	            Ethernet switches but can also support routing capabilities.

296	          o DC GW: Gateway to the outside world providing DC Interconnect
297	            and connectivity to Internet and VPN customers. In the current
298	            DC network model, this may be simply a Router connected to the
299	            Internet and/or an IPVPN/L2VPN PE. Some network implementations
300	            may dedicate DC GWs for different connectivity types (e.g., a
301	            DC GW for Internet, and another for VPN).

303	       Note that End Devices may be single or multi-homed to ToRs.

305	    1.4. Tenant networking view

307	       The DC network architecture is used to provide L2 and/or L3 service
308	       connectivity to each tenant. An example is depicted in Figure 2:

310	                         +----- L3 Infrastructure ----+
311	                         |                            |
312	                      ,--+--.                      ,--+--.
313	                .....( Rtr1  )......              ( Rtr2  )
314	                |     `-----'      |               `-----'
315	                |     Tenant1      |LAN12      Tenant1|
316	                |LAN11         ....|........          |LAN13
317	          ..............        |        |     ..............
318	             |        |         |        |       |        |
319	            ,-.      ,-.       ,-.      ,-.     ,-.      ,-.
320	           (VM )....(VM )     (VM )... (VM )   (VM )....(VM )
321	            `-'      `-'       `-'      `-'     `-'      `-'

323	            Figure 2 : Logical Service connectivity for a single tenant

325	       In this example one or more L3 contexts and one or more LANs (e.g.,
326	       one per application type) running on DC switches are assigned for DC
327	       tenant 1.

329	       For a multi-tenant DC, a virtualized version of this type of service
330	       connectivity needs to be provided for each tenant by the Network
331	       Virtualization solution.

333	    2. Reference Models

335	    2.1. Generic Reference Model

337	       The following diagram shows a DC reference model for network
338	       virtualization using Layer3 overlays where NVEs provide a logical
339	       interconnect between Tenant Systems that belong to specific tenant
340	       network.

342	             +--------+                                    +--------+
343	             | Tenant +--+                            +----| Tenant |
344	             | System |  |                           (')   | System |
345	             +--------+  |    ...................   (   )  +--------+
346	                         |  +-+--+           +--+-+  (_)
347	                         |  | NV |           | NV |   |
348	                         +--|Edge|           |Edge|---+
349	                            +-+--+           +--+-+
350	                            / .                 .
351	                           /  .   L3 Overlay +--+-++--------+
352	             +--------+   /   .    Network   | NV || Tenant |
353	             | Tenant +--+    .              |Edge|| System |
354	             | System |       .    +----+    +--+-++--------+
355	             +--------+       .....| NV |........
356	                                   |Edge|
357	                                   +----+
358	                                     |
359	                                     |
360	                           =====================
361	                             |               |
362	                         +--------+      +--------+
363	                         | Tenant |      | Tenant |
364	                         | System |      | System |
365	                         +--------+      +--------+

367	          Figure 3 : Generic reference model for DC network virtualization
368	                           over a Layer3 infrastructure

370	       A Tenant System can be attached to a Network Virtualization Edge
371	       (NVE) node in several ways:

373	         - locally, by being co-located i.e. resident in the same device

375	         - remotely, via a point-to-point connection or a switched network
376	         (e.g. Ethernet)

378	       When an NVE is local, the state of Tenant Systems can be provided
379	       without protocol assistance. For instance, the operational status of
380	       a VM can be communicated via a local API. When an NVE is remote, the
381	       state of Tenant Systems needs to be exchanged via a data or control
382	       plane protocol, or via a management entity.

384	       The functional components in this picture do not necessarily map
385	       directly with the physical components described in Figure 1.

387	       For example, an End Device can be a server blade with VMs and
388	       virtual switch, i.e. the VM is the Tenant System and the NVE
389	       functions may be performed by the virtual switch and/or the
390	       hypervisor. In this case, the Tenant System and NVE function are co-
391	       located.

393	       Another example is the case where an End Device can be a traditional
394	       physical server (no VMs, no virtual switch), i.e. the server is the
395	       Tenant System and the NVE function may be performed by the ToR.
396	       Other End Devices in this category are Physical Network Appliances
397	       or Storage Systems.

399	       The NVE implements network virtualization functions that allow for
400	       L2 and/or L3 tenant separation and for hiding tenant addressing
401	       information (MAC and IP addresses), tenant-related control plane
402	       activity and service contexts from the Routed Backbone nodes.

404	       Core nodes utilize L3 techniques to interconnect NVE nodes in
405	       support of the overlay network. These devices perform forwarding
406	       based on outer L3 tunnel header, and generally do not maintain per
407	       tenant-service state albeit some applications (e.g., multicast) may
408	       require control plane or forwarding plane information that pertain
409	       to a tenant, group of tenants, tenant service or a set of services
410	       that belong to one or more tunnels. When such tenant or tenant-
411	       service related information is maintained in the core, overlay
412	       virtualization provides knobs to control that information.

414	    2.2. NVE Reference Model

416	       The NVE is composed of a Virtual Network instance that Tenant
417	       Systems interface with and an overlay module that provides tunneling
418	       overlay functions (e.g. encapsulation/decapsulation of tenant
419	       traffic from/to the tenant forwarding instance, tenant
420	       identification and mapping, etc), as described in figure 4:

422	                          +------- L3 Network ------+
423	                          |                         |
424	                          |       Tunnel Overlay    |
425	             +------------+---------+       +---------+------------+
426	             | +----------+-------+ |       | +---------+--------+ |
427	             | |  Overlay Module  | |       | |  Overlay Module  | |
428	             | +---------+--------+ |       | +---------+--------+ |
429	             |           |VN context|       | VN context|          |
430	             |           |          |       |           |          |
431	             |  +--------+-------+  |       |  +--------+-------+  |
432	             |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
433	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
434	             |    |   VAPs     |    |       |    |    VAPs   |     |
435	             +----+------------+----+       +----+-----------+-----+
436	                  |            |                 |           |
437	           -------+------------+-----------------+-----------+-------
438	                  |            |     Tenant      |           |
439	                  |            |   Service IF    |           |
440	                 Tenant Systems                 Tenant Systems

442	                  Figure 4 : Generic reference model for NV Edge

444	       Note that some NVE functions (e.g. data plane and control plane
445	       functions) may reside in one device or may be implemented separately
446	       in different devices.

448	       For example, the NVE functionality could reside solely on the End
449	       Devices, on the ToRs or on both the End Devices and the ToRs. In the
450	       latter case we say that the End Device NVE component acts as the NVE
451	       Spoke, and ToRs act as NVE hubs. Tenant Systems will interface with
452	       VNIs maintained on the NVE spokes, and VNIs maintained on the NVE
453	       spokes will interface with VNIs maintained on the NVE hubs.

455	    2.3. NVE Service Types

457	       NVE components may be used to provide different types of virtualized
458	       service connectivity. This section defines the service types and
459	       associated attributes

461	    2.3.1. L2 NVE providing Ethernet LAN-like service

463	       L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based
464	       multipoint service where the Tenant Systems appear to be
465	       interconnected by a LAN environment over a set of L3 tunnels. It
466	       provides per tenant virtual switching instance with MAC addressing
467	       isolation and L3 tunnel encapsulation across the core.

469	    2.3.2. L3 NVE providing IP/VRF-like service

471	       Virtualized IP routing and forwarding is similar from a service
472	       definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and
473	       IPsec VPNs). It provides per tenant routing instance with addressing
474	       isolation and L3 tunnel encapsulation across the core.

476	    3. Functional components

478	       This section breaks down the Network Virtualization architecture
479	       into functional components to make it easier to discuss solution
480	       options for different modules.

482	       This version of the document gives an overview of generic functional
483	       components that are shared between L2 and L3 service types. Details
484	       specific for each service type will be added in future revisions.

486	    3.1. Generic service virtualization components

488	       A Network Virtualization solution is built around a number of
489	       functional components as depicted in Figure 5:

491	                         +------- L3 Network ------+
492	                         |                         |
493	                         |       Tunnel Overlay    |
494	            +------------+--------+       +--------+------------+
495	            | +----------+------+ |       | +------+----------+ |
496	            | | Overlay Module  | |       | | Overlay Module  | |
497	            | +--------+--------+ |       | +--------+--------+ |
498	            |          |VN Context|       |          |VN Context|
499	            |          |          |       |          |          |
500	            |  +-------+-------+  |       |  +-------+-------+  |
501	            |  ||VNI| ... |VNI||  |       |  ||VNI| ... |VNI||  |
502	       NVE1 |  +-+-----------+-+  |       |  +-+-----------+-+  | NVE2
503	            |    |   VAPs    |    |       |    |   VAPs    |    |
504	            +----+-----------+----+       +----+-----------+----+
505	                 |           |                 |           |
506	            -----+-----------+-----------------+-----------+-----
507	                 |           |     Tenant      |           |
508	                 |           |   Service IF    |           |
509	                Tenant Systems                Tenant Systems

511	                  Figure 5 : Generic reference model for NV Edge

513	    3.1.1. Virtual Access Points (VAPs)

515	       Tenant Systems are connected to the VNI Instance through Virtual
516	       Access Points (VAPs).

518	       The VAPs can be physical ports or virtual ports identified through
519	       logical interface identifiers (VLANs, internal VSwitch Interface ID
520	       leading to a VM).

522	    3.1.2. Virtual Network Instance (VNI)

524	       The VNI represents a set of configuration attributes defining access
525	       and tunnel policies and (L2 and/or L3) forwarding functions.

527	       Per tenant FIB tables and control plane protocol instances are used
528	       to maintain separate private contexts between tenants. Hence tenants
529	       are free to use their own addressing schemes without concerns about
530	       address overlapping with other tenants.

532	    3.1.3. Overlay Modules and VN Context

534	       Mechanisms for identifying each tenant service are required to allow
535	       the simultaneous overlay of multiple tenant services over the same
536	       underlay L3 network topology. In the data plane, each NVE, upon
537	       sending a tenant packet, must be able to encode the VN Context for
538	       the destination NVE in addition to the L3 tunnel source address
539	       identifying the source NVE and the tunnel destination L3 address
540	       identifying the destination NVE. This allows the destination NVE to
541	       identify the tenant service instance and therefore appropriately
542	       process and forward the tenant packet.

544	       The Overlay module provides tunneling overlay functions: tunnel
545	       initiation/termination, encapsulation/decapsulation of frames from
546	       VAPs/L3 Backbone and may provide for transit forwarding of IP
547	       traffic (e.g., transparent tunnel forwarding).

549	       In a multi-tenant context, the tunnel aggregates frames from/to
550	       different VNIs. Tenant identification and traffic demultiplexing are
551	       based on the VN Context (e.g. VNID).

553	       The following approaches can been considered:

555	          o One VN Context per Tenant: A globally unique (on a per-DC
556	            administrative domain) VNID is used to identify the related
557	            Tenant instances. An example of this approach is the use of
558	            IEEE VLAN or ISID tags to provide virtual L2 domains.

560	          o One VN Context per VNI: A per-tenant local value is
561	            automatically generated by the egress NVE and usually
562	            distributed by a control plane protocol to all the related
563	            NVEs. An example of this approach is the use of per VRF MPLS
564	            labels in IP VPN [RFC4364].

566	          o One VN Context per VAP: A per-VAP local value is assigned and
567	            usually distributed by a control plane protocol. An example of
568	            this approach is the use of per CE-PE MPLS labels in IP VPN
569	            [RFC4364].

571	       Note that when using one VN Context per VNI or per VAP, an
572	       additional global identifier may be used by the control plane to
573	       identify the Tenant context.

575	    3.1.4. Tunnel Overlays and Encapsulation options

577	       Once the VN context is added to the frame, a L3 Tunnel encapsulation
578	       is used to transport the frame to the destination NVE. The backbone
579	       devices do not usually keep any per service state, simply forwarding
580	       the frames based on the outer tunnel header.

582	       Different IP tunneling options (GRE/L2TP/IPSec) and tunneling
583	       options (BGP VPN, PW, VPLS) are available for both Ethernet and IP
584	       formats.

586	    3.1.5. Control Plane Components

588	       Control plane components may be used to provide the following
589	       capabilities:

591	          . Auto-provisioning/Service discovery

593	          . Address advertisement and tunnel mapping

595	          . Tunnel management

597	       A control plane component can be an on-net control protocol or a
598	       management control entity.

600	    3.1.5.1. Distributed vs Centralized Control Plane

602	       A control/management plane entity can be centralized or distributed.
603	       Both approaches have been used extensively in the past. The routing
604	       model of the Internet is a good example of a distributed approach.
605	       Transport networks have usually used a centralized approach to
606	       manage transport paths.

608	       It is also possible to combine the two approaches i.e. using a
609	       hybrid model. A global view of network state can have many benefits
610	       but it does not preclude the use of distributed protocols within the
611	       network. Centralized controllers provide a facility to maintain
612	       global and distribute that state to the network which in combination
613	       with distributed protocols can aid in achieving greater network
614	       efficiencies, improve reliability and robustness. Domain and/or
615	       deployment specific constraints define the balance between
616	       centralized and distributed approaches.

618	       On one hand, a control plane module can reside in every NVE. This is
619	       how routing control plane modules are implemented in routers. At the
620	       same time, an external controller can manage a group of NVEs via an
621	       agent sitting in each NVE. This is how an SDN controller could
622	       communicate with the nodes it controls, via OpenFlow for instance.

624	       In the case where a centralized control plane is preferred, the
625	       controller will need to be distributed to more than one node for
626	       redundancy. Depending upon the size of the DC domain, hence the
627	       number of NVEs to manage, it should be possible to use several
628	       external controllers. Inter-controller communication will thus be
629	       necessary for scalability and redundancy.

631	    3.1.5.2. Auto-provisioning/Service discovery

633	       NVEs must be able to select the appropriate VNI for each Tenant
634	       System. This is based on state information that is often provided by
635	       external entities. For example, in a VM environment, this
636	       information is provided by compute management systems, since these
637	       are the only entities that have visibility on which VM belongs to
638	       which tenant.

640	       A mechanism for communicating this information between Tenant
641	       Systems and the local NVE is required. As a result the VAPs are
642	       created and mapped to the appropriate VNI.

644	       Depending upon the implementation, this control interface can be
645	       implemented using an auto-discovery protocol between Tenant Systems
646	       and their local NVE or through management entities.

648	       When a protocol is used, appropriate security and authentication
649	       mechanisms to verify that Tenant System information is not spoofed
650	       or altered are required. This is one critical aspect for providing
651	       integrity and tenant isolation in the system.

653	       Another control plane protocol can also be used to advertize
654	       supported VNs to other NVEs. Alternatively, management control
655	       entities can also be used to perform these functions.

657	    3.1.5.3. Address advertisement and tunnel mapping

659	       As traffic reaches an ingress NVE, a lookup is performed to
660	       determine which tunnel the packet needs to be sent to. It is then
661	       encapsulated with a tunnel header containing the destination address
662	       of the egress overlay node. Intermediate nodes (between the ingress
663	       and egress NVEs) switch or route traffic based upon the outer
664	       destination address.

666	       One key step in this process consists of mapping a final destination
667	       address to the proper tunnel. NVEs are responsible for maintaining
668	       such mappings in their lookup tables. Several ways of populating
669	       these lookup tables are possible: control plane driven, management
670	       plane driven, or data plane driven.

672	       When a control plane protocol is used to distribute address
673	       advertisement and tunneling information, the auto-
674	       provisioning/Service discovery could be accomplished by the same
675	       protocol. In this scenario, the auto-provisioning/Service discovery
676	       could be combined with (be inferred from) the address advertisement
677	       and tunnel mapping. Furthermore, a control plane protocol that
678	       carries both MAC and IP addresses eliminates the need for ARP, and
679	       hence addresses one of the issues with explosive ARP handling.

681	    3.1.5.4. Tunnel management

683	       A control plane protocol may be required to exchange tunnel state
684	       information. This may include setting up tunnels and/or providing
685	       tunnel state information.

687	       This applies to both unicast and multicast tunnels.

689	       For instance, it may be necessary to provide active/standby status
690	       information between NVEs, up/down status information,
691	       pruning/grafting information for multicast tunnels, etc.

693	    3.2. Multi-homing

695	       Multi-homing techniques can be used to increase the reliability of
696	       an nvo3 network. It is also important to ensure that physical
697	       diversity in an nvo3 network is taken into account to avoid single
698	       points of failure.

700	       Multi-homing can be enabled in various nodes, from tenant systems
701	       into TORs, TORs into core switches/routers, and core nodes into DC
702	       GWs.

704	       The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP
705	       routing as the means to re-route traffic upon failures and/or ECMP
706	       techniques.

708	       Tenant systems can either be L2 or L3 nodes. In the former case
709	       (L2), techniques such as LAG or STP for instance can be used. In the
710	       latter case (L3), it is possible that no dynamic routing protocol is
711	       enabled. Tenant systems can be multi-homed into remote NVE using
712	       several interfaces (physical NICS or vNICS) with an IP address per
713	       interface either to the same nvo3 network or into different nvo3
714	       networks. When one of the links fails, the corresponding IP is not
715	       reachable but the other interfaces can still be used. When a tenant
716	       system is co-located with an NVE, IP routing can be relied upon to
717	       handle routing over diverse links to TORs.

719	       External connectivity is handled by to or more nvo3 gateways. Each
720	       gateway is connected to a different domain (e.g. ISP) and runs BGP
721	       multi-homing. They serve as an access point to external networks
722	       such as VPNs or the Internet. When a connection to an upstream
723	       router is lost, the alternative connection is used and the failed
724	       route withdrawn.

726	    3.3. Service Overlay Topologies

728	       A number of service topologies may be used to optimize the service
729	       connectivity and to address NVE performance limitations.

731	       The topology described in Figure 3 suggests the use of a tunnel mesh
732	       between the NVEs where each tenant instance is one hop away from a
733	       service processing perspective. Partial mesh topologies and an NVE
734	       hierarchy may be used where certain NVEs may act as service transit
735	       points.

737	    4. Key aspects of overlay networks

739	       The intent of this section is to highlight specific issues that
740	       proposed overlay solutions need to address.

742	    4.1. Pros & Cons

744	       An overlay network is a layer of virtual network topology on top of
745	       the physical network.

747	       Overlay networks offer the following key advantages:

749	          o Unicast tunneling state management is handled at the edge of
750	            the network. Intermediate transport nodes are unaware of such
751	            state. Note that this is not the case when multicast is enabled
752	            in the core network.

754	          o Tunnels are used to aggregate traffic and hence offer the
755	            advantage of minimizing the amount of forwarding state required
756	            within the underlay network

758	          o Decoupling of the overlay addresses (MAC and IP) used by VMs
759	            from the underlay network. This offers a clear separation
760	            between addresses used within the overlay and the underlay
761	            networks and it enables the use of overlapping addresses spaces
762	            by Tenant Systems

764	          o Support of a large number of virtual network identifiers

766	       Overlay networks also create several challenges:

768	          o Overlay networks have no controls of underlay networks and lack
769	            critical network information
770	               o Overlays typically probe the network to measure link
771	                 properties, such as available bandwidth or packet loss
772	                 rate. It is difficult to accurately evaluate network
773	                 properties. It might be preferable for the underlay
774	                 network to expose usage and performance information.

776	          o Miscommunication between overlay and underlay networks can lead
777	            to an inefficient usage of network resources.

779	          o Fairness of resource sharing and collaboration among end-nodes
780	            in overlay networks are two critical issues

782	          o When multiple overlays co-exist on top of a common underlay
783	            network, the lack of coordination between overlays can lead to
784	            performance issues.

786	          o Overlaid traffic may not traverse firewalls and NAT devices.

788	          o Multicast service scalability. Multicast support may be
789	            required in the overlay network to address for each tenant
790	            flood containment or efficient multicast handling.

792	          o Hash-based load balancing may not be optimal as the hash
793	            algorithm may not work well due to the limited number of
794	            combinations of tunnel source and destination addresses

796	    4.2. Overlay issues to consider

798	    4.2.1. Data plane vs Control plane driven

800	       In the case of an L2NVE, it is possible to dynamically learn MAC
801	       addresses against VAPs. It is also possible that such addresses be
802	       known and controlled via management or a control protocol for both
803	       L2NVEs and L3NVEs.

805	       Dynamic data plane learning implies that flooding of unknown
806	       destinations be supported and hence implies that broadcast and/or
807	       multicast be supported. Multicasting in the core network for dynamic
808	       learning may lead to significant scalability limitations. Specific
809	       forwarding rules must be enforced to prevent loops from happening.
810	       This can be achieved using a spanning tree, a shortest path tree, or
811	       a split-horizon mesh.

813	       It should be noted that the amount of state to be distributed is
814	       dependent upon network topology and the number of virtual machines.
815	       Different forms of caching can also be utilized to minimize state
816	       distribution between the various elements. The control plane should
817	       not require an NVE to maintain the locations of all the tenant
818	       systems whose VNs are not present on the NVE.

820	    4.2.2. Coordination between data plane and control plane

822	       For an L2 NVE, the NVE needs to be able to determine MAC addresses
823	       of the end systems present on a VAP. This can be achieved via
824	       dataplane learning or a control plane. For an L3 NVE, the NVE needs
825	       to be able to determine IP addresses of the end systems present on a
826	       VAP.

828	       In both cases, coordination with the NVE control protocol is needed
829	       such that when the NVE determines that the set of addresses behind a
830	       VAP has changed, it triggers the local NVE control plane to
831	       distribute this information to its peers.

833	    4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

835	       There are two techniques to support packet replication needed for
836	       broadcast, unknown unicast and multicast:

838	          o Ingress replication

840	          o Use of core multicast trees

842	       There is a bandwidth vs state trade-off between the two approaches.
843	       Depending upon the degree of replication required (i.e. the number
844	       of hosts per group) and the amount of multicast state to maintain,
845	       trading bandwidth for state is of consideration.

847	       When the number of hosts per group is large, the use of core
848	       multicast trees may be more appropriate. When the number of hosts is
849	       small (e.g. 2-3), ingress replication may not be an issue.

851	       Depending upon the size of the data center network and hence the
852	       number of (S,G) entries, but also the duration of multicast flows,
853	       the use of core multicast trees can be a challenge.

855	       When flows are well known, it is possible to pre-provision such
856	       multicast trees. However, it is often difficult to predict
857	       application flows ahead of time, and hence programming of (S,G)
858	       entries for short-lived flows could be impractical.

860	       A possible trade-off is to use in the core shared multicast trees as
861	       opposed to dedicated multicast trees.

863	    4.2.4. Path MTU

865	       When using overlay tunneling, an outer header is added to the
866	       original frame. This can cause the MTU of the path to the egress
867	       tunnel endpoint to be exceeded.

869	       In this section, we will only consider the case of an IP overlay.

871	       It is usually not desirable to rely on IP fragmentation for
872	       performance reasons. Ideally, the interface MTU as seen by a Tenant
873	       System is adjusted such that no fragmentation is needed. TCP will
874	       adjust its maximum segment size accordingly.

876	       It is possible for the MTU to be configured manually or to be
877	       discovered dynamically. Various Path MTU discovery techniques exist
878	       in order to determine the proper MTU size to use:

880	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

882	               o
883	                Tenant Systems rely on ICMP messages to discover the MTU of
884	                 the end-to-end path to its destination. This method is not
885	                 always possible, such as when traversing middle boxes
886	                 (e.g. firewalls) which disable ICMP for security reasons

888	          o Extended MTU Path Discovery techniques such as defined in
889	            [RFC4821]

891	       It is also possible to rely on the overlay layer to perform
892	       segmentation and reassembly operations without relying on the Tenant
893	       Systems to know about the end-to-end MTU. The assumption is that
894	       some hardware assist is available on the NVE node to perform such
895	       SAR operations. However, fragmentation by the overlay layer can lead
896	       to performance and congestion issues due to TCP dynamics and might
897	       require new congestion avoidance mechanisms from then underlay
898	       network [FLOYD].

900	       Finally, the underlay network may be designed in such a way that the
901	       MTU can accommodate the extra tunnel overhead.

903	    4.2.5. NVE location trade-offs

905	       In the case of DC traffic, traffic originated from a VM is native
906	       Ethernet traffic. This traffic can be switched by a local VM switch
907	       or ToR switch and then by a DC gateway. The NVE function can be
908	       embedded within any of these elements.

910	       There are several criteria to consider when deciding where the NVE
911	       processing boundary happens:

913	          o Processing and memory requirements

915	              o Datapath (e.g. lookups, filtering,
916	                 encapsulation/decapsulation)

918	              o Control plane processing (e.g. routing, signaling, OAM)

920	          o FIB/RIB size

922	          o Multicast support

924	              o Routing protocols

926	              o Packet replication capability

928	          o Fragmentation support

930	          o QoS transparency

932	          o Resiliency

934	    4.2.6. Interaction between network overlays and underlays

936	       When multiple overlays co-exist on top of a common underlay network,
937	       this can cause some performance issues. These overlays have
938	       partially overlapping paths and nodes.

940	       Each overlay is selfish by nature in that it sends traffic so as to
941	       optimize its own performance without considering the impact on other
942	       overlays, unless the underlay tunnels are traffic engineered on a
943	       per overlay basis so as to avoid sharing underlay resources.

945	       Better visibility between overlays and underlays can be achieved by
946	       providing mechanisms to exchange information about:

948	          o Performance metrics (throughput, delay, loss, jitter)

950	          o Cost metrics

952	    5. Security Considerations

954	       As a framework document, no protocols are being defined and hence no
955	       specific security consideration are raised.

957	       The following security aspects shall be discussed in respective
958	       solutions documents:

960	       Traffic isolation between NVO3 domains is guaranteed by the use of
961	       per tenant FIB tables (VNIs).

963	       The creation of overlay networks and the tenant to overlay mapping
964	       function can introduce significant security risks. When dynamic
965	       protocols are used, authentication should be supported. When a
966	       centralized controller is used, access to that controller should be
967	       restricted to authorized personnel. This can be achieved via login
968	       authentication.

970	    6. IANA Considerations

972	       IANA does not need to take any action for this draft.

974	    7. References

976	    7.1. Normative References

978	       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
979	                 Requirement Levels", BCP 14, RFC 2119, March 1997.

981	    7.2. Informative References

983	       [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network
984	                 Virtualization", draft-narten-nvo3-overlay-problem-
985	                 statement (work in progress)

987	       [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
988	                 Protocol Requirements", draft-kreeger-nvo3-overlay-cp
989	                 (work in progress)

991	       [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
992	                 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

994	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
995	                 Networks (VPNs)", RFC 4364, February 2006.

997	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

999	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
1000	                 August 1996

1002	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
1003	                 Discovery", RFC4821, March 2007

1005	    8. Acknowledgments

1007	       In addition to the authors the following people have contributed to
1008	       this document:

1010	       Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent

1012	       Lucy Yong, Huawei

1014	       This document was prepared using 2-Word-v2.0.template.dot.

1016	    Authors' Addresses

1018	       Marc Lasserre
1019	       Alcatel-Lucent
1020	       Email: marc.lasserre@alcatel-lucent.com

1022	       Florin Balus
1023	       Alcatel-Lucent
1024	       777 E. Middlefield Road
1025	       Mountain View, CA, USA 94043
1026	       Email: florin.balus@alcatel-lucent.com

1028	       Thomas Morin
1029	       France Telecom Orange
1030	       Email: thomas.morin@orange.com

1032	       Nabil Bitar
1033	       Verizon
1034	       40 Sylvan Road
1035	       Waltham, MA 02145
1036	       Email: nabil.bitar@verizon.com

1038	       Yakov Rekhter
1039	       Juniper
1040	       Email: yakov@juniper.net