idnits 2.17.1 

draft-lasserre-nvo3-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (March 12, 2012) is 4420 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Internet Engineering Task Force                           Marc Lasserre
2	Internet Draft                                             Florin Balus
3	Intended status: Informational                           Alcatel-Lucent
4	Expires: September 2012
5	                                                           Thomas Morin
6	                                                  France Telecom Orange

8	                                                            Nabil Bitar
9	                                                                Verizon

11	                                                           Yakov Rekhter
12	                                                                 Juniper

14	                                                         Yuichi Ikejiri
15	                                                     NTT Communications

17	                                                         March 12, 2012

19	                  Framework for DC Network Virtualization
20	                   draft-lasserre-nvo3-framework-01.txt

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF), its areas, and its working groups.  Note that
29	   other groups may also distribute working documents as Internet-
30	   Drafts.

32	   Internet-Drafts are draft documents valid for a maximum of six
33	   months and may be updated, replaced, or obsoleted by other documents
34	   at any time.  It is inappropriate to use Internet-Drafts as
35	   reference material or to cite them other than as "work in progress."

37	   The list of current Internet-Drafts can be accessed at
38	   http://www.ietf.org/ietf/1id-abstracts.txt

40	   The list of Internet-Draft Shadow Directories can be accessed at
41	   http://www.ietf.org/shadow.html

43	   This Internet-Draft will expire on September 12, 2012.

45	Copyright Notice

47	   Copyright (c) 2012 IETF Trust and the persons identified as the
48	   document authors. All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents
52	   (http://trustee.ietf.org/license-info) in effect on the date of
53	   publication of this document. Please review these documents
54	   carefully, as they describe your rights and restrictions with
55	   respect to this document.

57	Abstract

59	   Several IETF drafts relate to the use of overlay networks to support
60	   large scale virtual data centers. This draft provides a framework
61	   for Network Virtualization over L3 (NVO3) and is intended to help
62	   plan a set of work items in order to provide a complete solution
63	   set. It defines a logical view of the main components with the
64	   intention of streamlining the terminology and focusing the solution
65	   set.

67	Table of Contents

69	   1. Introduction...................................................3
70	      1.1. Conventions used in this document.........................4
71	      1.2. General terminology.......................................4
72	      1.3. DC network architecture...................................5
73	      1.4. Tenant networking view....................................6
74	   2. Reference Models...............................................7
75	      2.1. Generic Reference Model...................................7
76	      2.2. NVE Reference Model.......................................9
77	      2.3. NVE Service Types........................................10
78	         2.3.1. L2 NVE providing Ethernet LAN-like service..........11
79	         2.3.2. L3 NVE providing IP/VRF-like service................11
80	   3. Functional components.........................................11
81	      3.1. Generic service virtualization components................11
82	         3.1.1. Virtual Attachment Points (VAPs)....................12
83	         3.1.2. Tenant Instance.....................................12
84	         3.1.3. Overlay Modules and Tenant ID.......................13
85	         3.1.4. Tunnel Overlays and Encapsulation options...........14
86	         3.1.5. Control Plane Components............................14
87	         3.1.5.1. Auto-provisioning/Service discovery...............14
88	         3.1.5.2. Address advertisement and tunnel mapping..........15
89	         3.1.5.3. Tunnel management.................................15
90	      3.2. Service Overlay Topologies...............................16
91	   4. Key aspects of overlay networks...............................16
92	      4.1. Pros & Cons..............................................16
93	      4.2. Overlay issues to consider...............................17
94	         4.2.1. Data plane vs Control plane driven..................17
95	         4.2.2. Coordination between data plane and control plane...18
96	         4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
97	         traffic....................................................18
98	         4.2.4. Path MTU............................................19
99	         4.2.5. NVE location trade-offs.............................19
100	         4.2.6. Interaction between network overlays and underlays..20
101	   5. Security Considerations.......................................21
102	   6. IANA Considerations...........................................21
103	   7. References....................................................21
104	      7.1. Normative References.....................................21
105	      7.2. Informative References...................................21
106	   8. Acknowledgments...............................................22

108	1. Introduction

110	   This document provides a framework for Data Center Network
111	   Virtualization over L3 tunnels. This framework is intended to aid in
112	   standardizing protocols and mechanisms to support large scale
113	   network virtualization for data centers.

115	   Several IETF drafts relate to the use of overlay networks for data
116	   centers.

118	   [NVOPS] defines the rationale for using overlay networks in order to
119	   build large data center networks. The use of virtualization leads to
120	   a very large number of communication domains and end systems to cope
121	   with. Existing virtual network models used for data center networks
122	   have known limitations, specifically in the context of multiple
123	   tenants. These issues can be summarized as:

125	     o Limited VLAN space

127	     o FIB explosion due to handling of a large number of MACs/IP
128	        addresses

130	     o Spanning Tree limitations

132	     o Excessive ARP handling
133	     o Broadcast storms

135	     o Inefficient Broadcast/Multicast handling

137	     o Limited mobility/portability support

139	     o Lack of service auto-discovery

141	   Overlay techniques have been used in the past to address some of
142	   these issues.

144	   [OVCPREQ] describes the requirements for a control plane protocol
145	   required by overlay border nodes to exchange overlay mappings.

147	   This document provides reference models that describe functional
148	   components of data center overlay networks. It also describes
149	   technical issues that have to be addressed in the design of
150	   protocols and mechanisms for large-scale data center networks.

152	1.1. Conventions used in this document

154	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
155	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
156	   document are to be interpreted as described in RFC-2119 [RFC2119].

158	   In this document, these words will appear with that interpretation
159	   only when in ALL CAPS. Lower case uses of these words are not to be
160	   interpreted as carrying RFC-2119 significance.

162	1.2. General terminology

164	   Some general terminology is defined here. Terminology specific to
165	   this memo is introduced as needed in later sections.

167	   DC: Data Center

169	   ELAN: MEF ELAN, multipoint to multipoint Ethernet service

171	1.3. DC network architecture

173	                                ,---------.
174	                              ,'           `.
175	                             (  IP/MPLS WAN )
176	                              `.           ,'
177	                                `-+------+'
178	                             +--+--+   +-+---+
179	                             |DC GW|+-+|DC GW|
180	                             +-+---+   +-----+
181	                                 |       /
182	                                 .--. .--.
183	                               (    '    '.--.
184	                            .-.' Intra-DC     '
185	                           (     network      )
186	                            (             .'-'
187	                             '--'._.'.    )\ \
188	                              / /     '--'  \ \
189	                             / /      | |    \ \
190	                      +---+--+   +-`.+--+  +--+----+
191	                      | ToR  |   | ToR  |  |  ToR  |
192	                      +-+--`.+   +-+-`.-+  +-+--+--+
193	                      .'     \   .'    \   .'     `.
194	                   __/_      _i./       i./_       _\__
195	            '--------'    '--------'   '--------'   '--------'
196	            :  End   :    :  End   :   :  End   :   :  End   :
197	            : Device :    : Device :   : Device :   : Device :
198	            '--------'    '--------'   '--------'   '--------'

200	            Figure 1 : A Generic Architecture for Data Centers

202	   Figure 1 depicts a common and generic multi-tier DC network
203	   architecture. It provides a view of physical components inside a DC.

205	   A cloud network is composed of intra-Data Center (DC) networks and
206	   network services, and inter-DC network and network connectivity
207	   services. Depending upon the scale, DC distribution, operations
208	   model, Capex and Opex aspects, DC networking elements can act as
209	   strict L2 switches and/or provide IP routing capabilities, including
210	   service virtualization.

212	   In some DC architectures, it is possible that some tier layers are
213	   collapsed and/or provide L2 and/or L3 services, and that Internet
214	   connectivity, inter-DC connectivity and VPN support are handled by a
215	   smaller number of nodes. Nevertheless, one can assume that the
216	   functional blocks fit with the architecture depicted in Figure 1.

218	   The following components can be present in a DC:

220	     o End Device: a DC resource to which the networking service is
221	        provided. End Device may be a compute resource (server or
222	        server blade), storage component or a network appliance
223	        (firewall, load-balancer, IPsec gateway). Alternatively, the
224	        End Device may include software based networking functions used
225	        to interconnect multiple hosts. An example of soft networking
226	        is the virtual switch in the server blades, used to
227	        interconnect multiple virtual machines (VMs). End Device may be
228	        single or multi-homed to the Top of Rack switches (ToRs).

230	     o Top of Rack (ToR): Hardware-based Ethernet switch aggregating
231	        all Ethernet links from the End Devices in a rack representing
232	        the entry point in the physical DC network for the hosts. ToRs
233	        may also provide routing functionality, virtual IP network
234	        connectivity, or Layer2 tunneling over IP for instance. ToRs
235	        are usually multi-homed to switches/routers in the Intra-DC
236	        network. Other deployment scenarios may use an intermediate
237	        Blade Switch before the ToR or an EoR (End of Row) switch to
238	        provide similar function as a ToR.

240	     o Intra-DC Network: High capacity network composed of core
241	        switches/routers aggregating multiple ToRs. Core network
242	        elements are usually Ethernet switches but can also support
243	        routing capabilities.

245	     o DC GW: Gateway to the outside world providing DC Interconnect
246	        and connectivity to Internet and VPN customers. In the current
247	        DC network model, this may be simply a Router connected to the
248	        Internet and/or an IPVPN/L2VPN PE. Some network implementations
249	        may dedicate DC GWs for different connectivity types (e.g., a
250	        DC GW for Internet, and another for VPN).

252	   We use throughout this document also the term "Tenant End System" to
253	   define an end system of a particular tenant, which can be for
254	   instance a virtual machine (VM), a non-virtualized server, or a
255	   physical appliance. One or more Tenant End Systems can be part of an
256	   End Device.

258	1.4. Tenant networking view

260	   The DC network architecture is used to provide L2 and/or L3 service
261	   connectivity to each tenant. An example is depicted in Figure 2:

263	                         +----- L3 Infrastructure ----+
264	                         |                            |
265	                      ,--+-'.                      ;--+--.
266	                 .....  Rtr1 )......              .  Rtr2 )
267	                 |    '-----'      |               '-----'
268	                 |     Tenant1     |LAN12      Tenant1|
269	                 |LAN11        ....|........          |LAN13
270	             '':'''''''':'       |        |     '':'''''''':'
271	              ,'.      ,'.      ,+.      ,+.     ,'.      ,'.
272	             (VM )....(VM )    (VM )... (VM )   (VM )....(VM )
273	              `-'      `-'      `-'      `-'     `-'      `-'

275	        Figure 2 : Logical Service connectivity for a single tenant

277	   In this example one or more L3 contexts and one or more LANs (e.g.,
278	   one per Application) running on DC switches are assigned for DC
279	   tenant 1.

281	   For a multi-tenant DC, a virtualized version of this type of service
282	   connectivity needs to be provided for each tenant by the Network
283	   Virtualization solution.

285	2. Reference Models

287	2.1. Generic Reference Model

289	   The following diagram shows a DC reference model for network
290	   virtualization using Layer3 overlays where edge devices provide a
291	   logical interconnect between Tenant End Systems that belong to
292	   specific tenant network.

294	         +--------+                                  +--------+
295	         | Tenant |                                  | Tenant |
296	         |  End   +--+                           +---|  End   |
297	         | System |  |                           |   | System |
298	         +--------+  |    ...................    |   +--------+
299	                     |  +-+--+           +--+-+  |
300	                     |  | NV |           | NV |  |
301	                     +--|Edge|           |Edge|--+
302	                        +-+--+           +--+-+
303	                       /  .    L3 Overlay   .  \
304	         +--------+   /   .     Network     .   \     +--------+
305	         | Tenant +--+    .                 .    +----| Tenant |
306	         |  End   |       .                 .         |  End   |
307	         | System |       .    +----+       .         | System |
308	         +--------+       .....| NV |........         +--------+
309	                               |Edge|
310	                               +----+
311	                                 |
312	                                 |
313	                              +--------+
314	                              | Tenant |
315	                              |  End   |
316	                              | System |
317	                              +--------+

319	     Figure 3 : Generic reference model for DC network virtualization
320	                       over a Layer3 infrastructure

322	   The functional components in Figure 3 do not necessarily map
323	   directly with the physical components described in Figure 1.

325	   For example, an End Device in Figure 1 can be a server blade with
326	   VMs and virtual switch, i.e. the VM is the Tenant End System and the
327	   NVE functions may be performed by the virtual switch and/or the
328	   hypervisor.

330	   Another example is the case where an End Device in Figure 1 can be a
331	   traditional physical server (no VMs, no virtual switch), i.e. the
332	   server is the Tenant End System and the NVE functions may be
333	   performed by the ToR. Other End Devices in this category are
334	   Physical Network Appliances or Storage Systems.

336	   A Tenant End System attaches to a Network Virtualization Edge (NVE)
337	   node, either directly or via a switched network (typically
338	   Ethernet).

340	   The NVE implements network virtualization functions that allow for
341	   L2 and/or L3 tenant separation and for hiding tenant addressing
342	   information (MAC and IP addresses), tenant-related control plane
343	   activity and service contexts from the Routed Core nodes.

345	   Core nodes utilize L3 techniques to interconnect NVE nodes in
346	   support of the overlay network. Specifically, they perform
347	   forwarding based on outer L3 tunnel header, and generally do not
348	   maintain per tenant-service state albeit some applications (e.g.,
349	   multicast) may require control plane or forwarding plane information
350	   that pertain to a tenant, group of tenants, tenant service or a set
351	   of services that belong to one or more tenants. When such tenant or
352	   tenant-service related information is maintained in the core,
353	   overlay virtualization provides knobs to control the magnitude of
354	   that information.

356	2.2. NVE Reference Model

358	   Figure 4 depicts the NVE reference model. An NVE contains one or
359	   more tenant service instances whereby a Tenant End Systems
360	   interfaces with its associated tenant service instance. The NVE also
361	   contains an overlay module that provides tunneling overlay functions
362	   (e.g. encapsulation/decapsulation of tenant traffic from/to the
363	   tenant forwarding instance, tenant identification and mapping, etc),
364	   as described in Figure 4.

366	                      +------- L3 Network ------+
367	                      |                         |
368	                      |                         |
369	         +------------+--------+       +--------+------------+
370	         | +----------+------+ |       | +------+----------+ |
371	         | | Overlay Module  | |       | | Overlay Module  | |
372	         | +--------+--------+ |       | +--------+--------+ |
373	         |          |          |       |          |          |
374	         |   NVE1   |          |        |         |   NVE2   |
375	         |  +-------+-------+  |       |  +-------+-------+  |
376	         |  |Tenant Instance|  |       |  |Tenant Instance|  |
377	         |  +-+-----------+-+  |       |  +-+-----------+-+  |
378	         |    |           |    |       |    |           |    |
379	         +----+-----------+----+       +----+-----------+----+
380	              |           |                 |           |
381	       -------+-----------+-----------------+-----------+-------
382	              |           |     Tenant      |           |
383	              |           |   Service IF    |           |
384	           Tenant End Systems             Tenant End Systems

386	              Figure 5 : Generic reference model for NV Edge

388	   Note that some NVE functions may reside in one device or may be
389	   implemented separately in different devices: for example, data plane
390	   may reside in one device while the control plane components may be
391	   distributed between multiple devices.

393	   The NVE functionality could reside solely on the End Devices, on the
394	   ToRs or on both the End Devices and the ToRs. In the latter case we
395	   say that the End Device NVE component acts as the NVE Spoke, and
396	   ToRs act as NVE hubs. Tenant End Systems will interface with the
397	   tenant service instances maintained on the NVE spokes, and tenant
398	   service instances maintained on the NVE spokes will interface with
399	   the tenant service instances maintained on the NVE hubs.

401	2.3. NVE Service Types

403	   NVE components may be used to provide different types of virtualized
404	   service connectivity. This section defines the service types and
405	   associated attributes

407	2.3.1. L2 NVE providing Ethernet LAN-like service

409	   L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based
410	   multipoint service where the Tenant End Systems appear to be
411	   interconnected by a LAN environment over a set of L3 tunnels. It
412	   provides per tenant virtual switching instance and associated MAC
413	   FIB, MAC address isolation across tenants, and L3 tunnel
414	   encapsulation across the core.

416	2.3.2. L3 NVE providing IP/VRF-like service

418	   Virtualized IP routing and forwarding is similar from a service
419	   definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and
420	   IPsec VPNs). It provides per tenant routing instance and associated
421	   IP FIB, IP address isolation across tenants, and L3 tunnel
422	   encapsulation across the core.

424	3. Functional components

426	   This section breaks down the Network Virtualization architecture
427	   into functional components to make it easier to discuss solution
428	   options for different modules.

430	   This version of the document gives an overview of generic functional
431	   components that are shared between L2 and L3 service types. Details
432	   specific for each service type will be added in future revisions.

434	3.1. Generic service virtualization components

436	   A Network Virtualization solution is built around a number of
437	   functional components as depicted in Figure 5:

439	                      +------- L3 Network ------+
440	                      |                         |
441	                      |       Tunnel Overlay    |
442	         +------------+--------+       +--------+------------+
443	         | +----------+------+ |       | +------+----------+ |
444	         | | Overlay Module  | |       | | Overlay Module  | |
445	         | +--------+--------+ |       | +--------+--------+ |
446	         |          |Tenant ID |       |          |Tenant ID |
447	         |          | (TNI)    |       |          | (TNI)    |
448	         |  +-------+-------+  |       |  +-------+-------+  |
449	         |  |Tenant Instance|  |       |  |Tenant Instance|  |
450	    NVE2 |  +-+-----------+-+  |       |  +-+-----------+-+  | NVE1
451	         |    |   VAPs    |    |       |    |   VAPs    |    |
452	         +----+-----------+----+       +----+-----------+----+
453	              |           |                 |           |
454	       -------+-----------+-----------------+-----------+-------
455	              |           |     Tenant      |           |
456	              |           |   Service IF    |           |
457	            Tenant End Systems            Tenant End Systems

459	              Figure 6 : Generic reference model for NV Edge

461	3.1.1. Virtual Attachment Points (VAPs)

463	   Tenant End Systems are connected to the Tenant Instance through
464	   Virtual Attachment Points (VAPs). The VAPs can be in reality
465	   physical ports on a ToR or virtual ports identified through logical
466	   interface identifiers (VLANs, internal VSwitch Interface ID leading
467	   to a VM).

469	3.1.2. Tenant Instance

471	   The Tenant Instance represents a set of configuration attributes
472	   defining access and tunnel policies and (L2 and/or L3) forwarding
473	   functions and possibly control plane functions.

475	   Per tenant FIB tables and control plane protocol instances are used
476	   to maintain separate private contexts across tenants. Hence tenants
477	   are free to use their own addressing schemes without concerns about
478	   address overlapping with other tenants.

480	3.1.3. Overlay Modules and Tenant ID

482	   Mechanisms for identifying each tenant service are required to allow
483	   the simultaneous overlay of multiple tenant services over the same
484	   underlay L3 network topology. In the data plane, each NVE, upon
485	   sending a tenant packet, must be able to encode the TNI for the
486	   destination NVE in addition to the L3 tunnel source address
487	   identifying the source NVE and the tunnel destination L3 address
488	   identifying the destination NVE. This allows the destination NVE to
489	   identify the tenant service instance and therefore appropriately
490	   process and forward the tenant packet.

492	   The Overlay module provides tunneling overlay functions: tunnel
493	   initiation/termination, encapsulation/decapsulation of frames from
494	   VAPs/L3 Backbone and may provide for transit forwarding of IP
495	   traffic (e.g., transparent forwarding of tunnel packets).

497	   In a multi-tenant context, the tunnel aggregates frames from/to
498	   different Tenant Instances. Tenant identification and traffic
499	   demultiplexing are based on the Tenant Identifier (TNI).

501	   Historically the following approaches have been considered:

503	     o One ID per Tenant: A globally unique (on a per-DC
504	        administrative domain) Tenant ID is used to identify the
505	        related Tenant instances. An example of this approach is the
506	        use of IEEE VLAN or ISID tags to provide virtual L2 domains.

508	     o One ID per Tenant Instance: A per-tenant local ID is
509	        automatically generated by the egress NVE and usually
510	        distributed by a control plane protocol to all the related
511	        NVEs. An example of this approach is the use of per VRF MPLS
512	        labels in IP VPN [RFC4364].

514	     o One ID per VAP: A per-VAP local ID is assigned and usually
515	        distributed by a control plane protocol. An example of this
516	        approach is the use of per CE-PE MPLS labels in IP VPN
517	        [RFC4364].

519	   Note that when using one ID per Tenant Instance or per VAP, an
520	   additional global identifier may be used by the control plane to
521	   identify the Tenant context (e.g., historically equivalent to the
522	   route target community attribute in [RFC4364]).

524	3.1.4. Tunnel Overlays and Encapsulation options

526	   Once the TNI is added to the tenant data frame, L3 Tunnel
527	   encapsulation is used to transport the resulting frame to the
528	   destination NVE. The backbone devices do not usually keep any per
529	   service state, simply forwarding the frames based on the outer
530	   tunnel header.

532	   Different IP tunneling options (e.g., GRE/L2TPv3/IPSec) and MPLS-
533	   based tunneling options (e.g., BGP VPN, PW, VPLS) can be used for
534	   tunneling Ethernet and IP packets.

536	3.1.5. Control Plane Components

538	   Control plane components may be used to provide the following
539	   capabilities:

541	     . Service Auto-provisioning/Auto-discovery

543	     . Address advertisement and tunnel mapping

545	     . Tunnel establishment/tear-down and routing

547	   A control plane component can be an on-net control protocol or a
548	   management control entity.

550	3.1.5.1. Auto-provisioning/Service discovery

552	   NVEs must be able to select the appropriate Tenant Instance for each
553	   Tenant End System. This is based on state information that is often
554	   provided by external entities. For example, in a VM environment,
555	   this information is provided by compute management systems, since
556	   these are the only entities that have visibility of which VM belongs
557	   to which tenant.

559	   A mechanism for communicating this information between Tenant End
560	   Systems and the local NVE is required. As a result the VAPs are
561	   created and mapped to the appropriate Tenant Instance.

563	   Depending upon the implementation, this control interface can be
564	   implemented using an auto-discovery protocol between Tenant End
565	   Systems and their local NVE or through management entities.

567	   When a protocol is used, appropriate security and authentication
568	   mechanisms to verify that Tenant End System information is not
569	   spoofed or altered are required. This is one critical aspect for
570	   providing integrity and tenant isolation in the system.

572	   Another control plane protocol can also be used to advertize NVE
573	   tenant service instance (tenant and service type provided to the
574	   tenant) to other NVEs. Alternatively, management control entities
575	   can also be used to perform these functions.

577	3.1.5.2. Address advertisement and tunnel mapping

579	   As traffic reaches an ingress NVE, a lookup is performed to
580	   determine which tunnel the packet needs to be sent to. It is then
581	   encapsulated with a tunnel header containing the destination address
582	   of the egress NVE. Intermediate nodes (between the ingress and
583	   egress NVEs) switch or route traffic based upon the outer
584	   destination address. It should be noted that an NVE may be
585	   implemented on a gateway to provide traffic forwarding between two
586	   different types of overlay networks, and may not be directly
587	   connected to a tenant End System.

589	   One key step in this process consists of mapping a final destination
590	   address to the proper tunnel. NVEs are responsible for maintaining
591	   such mappings in their lookup tables. Several ways of populating
592	   these lookup tables are possible: control plane driven, management
593	   plane driven, or data plane driven.

595	   When a control plane protocol is used to distribute address
596	   advertisement and tunneling information, the service auto-
597	   provisioning/auto-discovery could be accomplished by the same
598	   protocol. In this scenario, the auto-provisioning/Service discovery
599	   could be combined with (be inferred from) the address advertisement
600	   and tunnel mapping. Furthermore, a control plane protocol that
601	   carries both IP addresses and associated MACs eliminates the need
602	   for ARP and hence addresses one of the issues with explosive ARP
603	   handling.

605	3.1.5.3. Tunnel management

607	   A control plane protocol may be required to setup/teardown tunnels,
608	   exchange tunnel state information, and/or provide for tunnel
609	   endpoint routing. This applies to both unicast and multicast
610	   tunnels.

612	   For instance, it may be necessary to provide active/standby tunnel
613	   status information between NVEs, up/down status information,
614	   pruning/grafting information for multicast tunnels, etc.

616	3.2. Service Overlay Topologies

618	   A number of service topologies may be used to optimize the service
619	   connectivity and to address NVE performance limitations.

621	   The topology described in Figure 3 suggests the use of a tunnel mesh
622	   between the NVEs where each tenant instance is one hop away from a
623	   service processing perspective. This should not be construed to
624	   imply that a tunnel mesh must be configured as tunneling can simply
625	   be encapsulation/decapsulation with a tunnel header. Partial mesh
626	   topologies and a NVE hierarchy may be used where certain NVEs may
627	   act as service transit points.

629	4. Key aspects of overlay networks

631	   The intent of this section is to highlight specific issues that
632	   proposed overlay solutions need to address.

634	4.1. Pros & Cons

636	   An overlay network is a layer of virtual network topology on top of
637	   the physical network.

639	   Overlay networks offer the following key advantages:

641	     o Unicast tunneling state management is handled at the edge of
642	        the network. Intermediate transport nodes are unaware of such
643	        state. Note that this is not often the case when multicast is
644	        enabled in the core network.

646	     o Tunnels are used to aggregate traffic and hence offer the
647	        advantage of minimizing the amount of forwarding state required
648	        within the underlay network.

650	     o Decoupling of the overlay addresses (MAC and IP) used by VMs or
651	        Tenant End Systems in general from the underlay network. This
652	        offers a clear separation between addresses used within the
653	        overlay and the underlay networks and it enables the use of
654	        overlapping addresses spaces by Tenant End Systems.

656	     o Support of a large number of virtual network identifiers.

658	   Overlay networks also create several challenges:

660	     o Overlay networks have no controls of underlay networks and lack
661	        critical network information
662	          o Overlays may probe the network to measure link properties,
663	             such as available bandwidth or packet loss rate. It is
664	             difficult to accurately evaluate network properties. It
665	             might be preferable for the underlay network to expose
666	             usage and performance information for itself or the
667	             overlay networks.

669	     o Miscommunication between overlay and underlay networks can lead
670	        to an inefficient usage of network resources.

672	     o Fairness of resource sharing and co-ordination among edge-nodes
673	        in overlay networks are two critical issues. When multiple
674	        overlays co-exist on top of a common underlay network, the lack
675	        of coordination between overlays can lead to performance
676	        issues.

678	     o Overlaid traffic may not traverse firewalls and NAT devices.

680	     o Multicast service scalability. Multicast support may be
681	        required in the overlay network to address for each tenant
682	        flood containment or efficient multicast handling.

684	     o Load balancing may not be optimal as the hash algorithm may not
685	        work well due to the limited number of combinations of tunnel
686	        source and destination addresses

688	4.2. Overlay issues to consider

690	4.2.1. Data plane vs Control plane driven

692	   Dynamic (data plane) learning implies that flooding of unknown
693	   destinations be supported and hence implies that broadcast and/or
694	   multicast be supported. Multicasting in the core network for dynamic
695	   learning can lead to significant scalability limitations. Specific
696	   forwarding rules must be enforced to prevent loops from happening.
697	   This can be achieved using a spanning tree protocol or a shortest
698	   path tree, or using split-horizon forwarding rules.

700	   It should be noted that the amount of state to be distributed is a
701	   function of the number of virtual machines. Different forms of
702	   caching can also be utilized to minimize state distribution among
703	   the various elements.

705	4.2.2. Coordination between data plane and control plane

707	   Often a combination of dynamic data plane and control based learning
708	   is necessary. MAC Data-plane learning or IP data plane learning can
709	   be applied on tenant VAPs at the NVE whereas control plane-based MAC
710	   and IP reachability distribution can be performed across the overlay
711	   network among the NVEs, possibly with the help of a control plane
712	   mediation device (e.g., BGP route reflector if BGP is used to
713	   distribute such information). Coordination between the data-plane
714	   learning process and the control plane reachability distribution
715	   process is needed such that when a new address gets learned or an
716	   old address is removed, it triggers the local control plane to
717	   advertise this information to its peers.

719	4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

721	   There are two techniques to support packet replication needed for
722	   broadcast, unknown unicast and multicast:

724	     o Ingress replication

726	     o Use of core multicast trees

728	   There is a bandwidth vs state trade-off between the two approaches.
729	   Depending upon the degree of replication required (i.e. the number
730	   of hosts per group) and the amount of multicast state to maintain,
731	   trading bandwidth for state is of consideration.

733	   When the number of hosts per group is large, the use of core
734	   multicast trees may be more appropriate. When the number of hosts is
735	   small (e.g. 2-3), ingress replication may not be an issue depending
736	   on multicast stream bandwidth.

738	   Depending upon the size of the data center network and hence the
739	   number of (S,G) entries, but also the duration of multicast flows,
740	   the use of core multicast trees can be a challenge.

742	   When flows are well known, it is possible to pre-provision such
743	   multicast trees. However, it is often difficult to predict
744	   application flows ahead of time, and hence programming of (S,G)
745	   entries for short-lived flows could be impractical.

747	   A possible trade-off is to use shared multicast trees in the core as
748	   opposed to dedicated multicast trees.

750	4.2.4. Path MTU

752	   When using overlay tunneling, an outer header is added to the
753	   original tenant frame. This can cause the MTU of the path to the
754	   egress tunnel endpoint to be exceeded.

756	   In this section, we will only consider the case of an IP overlay.

758	   It is usually not desirable to rely on IP fragmentation for
759	   performance reasons. Ideally, the interface MTU as seen by a Tenant
760	   End System is adjusted such that no fragmentation is needed. TCP
761	   will adjust its maximum segment size accordingly.

763	   It is possible for the MTU to be configured manually or to be
764	   discovered dynamically. Various Path MTU discovery techniques exist
765	   in order to determine the proper MTU size to use:

767	     o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

769	          o Tenant End Systems rely on ICMP messages to discover the
770	             MTU of the end-to-end path to its destination. This method
771	             is not always possible, such as when traversing middle
772	             boxes (e.g. firewalls) which disable ICMP for security
773	             reasons

775	     o Extended MTU Path Discovery techniques such as defined in
776	        [RFC4821]

778	   It is also possible to rely on the overlay layer to perform
779	   segmentation and reassembly operations without relying on the Tenant
780	   End Systems to know about the end-to-end MTU. The assumption is that
781	   some hardware assist is available on the NVE node to perform such
782	   fragmentation and reassembly operations. However, fragmentation by
783	   the overlay layer can lead to performance and congestion issues due
784	   to TCP dynamics and might require new congestion avoidance
785	   mechanisms from the underlay network [FLOYD].

787	   Finally, the underlay network may be designed in such a way that the
788	   MTU can accommodate the extra tunnel overhead.

790	4.2.5. NVE location trade-offs

792	   In the case of DC traffic, traffic originated from a VM is native
793	   Ethernet traffic. This traffic may be receiving ELAN service or IP
794	   service. In the case of ELAN service, it can be switched by a local
795	   VM switch or ToR switch and then by a DC gateway. The NVE function
796	   can be embedded within any of these elements.

798	   There are several criteria to consider when deciding where the NVE
799	   processing boundary happens:

801	     o Processing and memory requirements

803	          o Datapath (e.g. FIB size, lookups, filtering,
804	            encapsulation/decapsulation)

806	          o Control plane (e.g. RIB size, routing, signaling, OAM)

808	     o Multicast support

810	          o Routing protocols

812	          o Packet replication capability

814	     o Fragmentation support

816	     o QoS transparency

818	     o Resiliency

820	4.2.6. Interaction between network overlays and underlays

822	   When multiple overlays co-exist on top of a common underlay network,
823	   this can cause some performance issues. These overlays have
824	   partially overlapping paths and nodes.

826	   Each overlay is selfish by nature in that it sends traffic so as to
827	   optimize its own performance without considering the impact on other
828	   overlays, unless the underlay tunnels are traffic engineered on a
829	   per overlay basis so as to avoid oversubscribing underlay resources.

831	   Better visibility between overlays and underlays or their
832	   controllers can be achieved by providing mechanisms to exchange
833	   information about:

835	     o Performance metrics (throughput, delay, loss, jitter)

837	     o Cost metrics

839	   This information may then be used to traffic engineer the underlay
840	   network and/or traffic engineer the overlay networks in a
841	   coordinated fashion over the overlay.

843	5. Security Considerations

845	   The tenant to overlay mapping function can introduce significant
846	   security risks if appropriate protocols/mechanisms used to establish
847	   that mapping are not trusted, do not support mutual authentication
848	   and/or cannot be established over trusted interfaces and/or mutually
849	   authenticated connections.

851	   No other new security issues are introduced beyond those described
852	   already in the related L2VPN and L3VPN RFCs.

854	6. IANA Considerations

856	   IANA does not need to take any action for this draft.

858	7. References

860	7.1. Normative References

862	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
863	             Requirement Levels", BCP 14, RFC 2119, March 1997.

865	7.2. Informative References

867	   [NVOPS]  Narten, T. et al, "Problem Statement : Overlays for Network
868	             Virtualization", draft-narten-nvo3-overlay-problem-
869	             statement (work in progress)

871	   [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
872	             Protocol Requirements", draft-kreeger-nvo3-overlay-cp
873	             (work in progress)

875	   [FLOYD]  Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
876	             ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

878	   [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
879	             Networks (VPNs)", RFC 4364, February 2006.

881	   [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

883	   [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
884	             August 1996

886	   [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
887	             Discovery", RFC4821, March 2007

889	8. Acknowledgments

891	   In addition to the authors the following people have contributed to
892	   this document:

894	   Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent

896	   Javier Benitez, Colt

898	   This document was prepared using 2-Word-v2.0.template.dot.

900	Authors' Addresses

902	   Marc Lasserre
903	   Alcatel-Lucent
904	   Email: marc.lasserre@alcatel-lucent.com

906	   Florin Balus
907	   Alcatel-Lucent
908	   777 E. Middlefield Road
909	   Mountain View, CA, USA 94043
910	   Email: florin.balus@alcatel-lucent.com

912	   Thomas Morin
913	   France Telecom Orange
914	   Email: thomas.morin@orange.com

916	   Nabil Bitar
917	   Verizon
918	   60 Sylvan Road
919	   Waltham, MA 02145
920	   Email: nabil.n.bitar@verizon.com

922	   Yakov Rekhter
923	   Juniper
924	   Email: yakov@juniper.net

926	   Yuichi Ikejiri
927	   NTT Communications
928	   1-1-6, Uchisaiwai-cho, Chiyoda-ku
929	   Tokyo, 100-8019 Japan
930	   Email: y.ikejiri@ntt.com