idnits 2.17.1 

draft-lasserre-nvo3-framework-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 22
     longer pages, the longest (page 3) being 66 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 22 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 254 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.

  == There are 3 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 9, 2012) is 4307 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'EVPN' is mentioned on line 217, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	     Internet Engineering Task Force                           Marc Lasserre
3	     Internet Draft                                             Florin Balus
4	     Intended status: Informational                           Alcatel-Lucent
5	     Expires: January 2013
6	                                                                Thomas Morin
7	                                                       France Telecom Orange

9	                                                                 Nabil Bitar
10	                                                                     Verizon

12	                                                               Yakov Rekhter
13	                                                                     Juniper

15	                                                                July 9, 2012

17	                       Framework for DC Network Virtualization
18	                        draft-lasserre-nvo3-framework-03.txt

20	     Status of this Memo

22	        This Internet-Draft is submitted in full conformance with the
23	        provisions of BCP 78 and BCP 79.

25	        Internet-Drafts are working documents of the Internet Engineering
26	        Task Force (IETF).  Note that other groups may also distribute
27	        working documents as Internet-Drafts. The list of current Internet-
28	        Drafts is at http://datatracker.ietf.org/drafts/current/.

30	        Internet-Drafts are draft documents valid for a maximum of six
31	        months and may be updated, replaced, or obsoleted by other documents
32	        at any time.  It is inappropriate to use Internet-Drafts as
33	        reference material or to cite them other than as "work in progress."

35	        This Internet-Draft will expire on January 9, 2013.

37	     Copyright Notice

39	        Copyright (c) 2012 IETF Trust and the persons identified as the
40	        document authors. All rights reserved.

42	        This document is subject to BCP 78 and the IETF Trust's Legal
43	        Provisions Relating to IETF Documents
44	        (http://trustee.ietf.org/license-info) in effect on the date of
45	        publication of this document. Please review these documents
46	        carefully, as they describe your rights and restrictions with
47	        respect to this document. Code Components extracted from this
48	        document must include Simplified BSD License text as described in
49	        Section 4.e of the Trust Legal Provisions and are provided without
50	        warranty as described in the Simplified BSD License.

52	     Abstract

54	        Several IETF drafts relate to the use of overlay networks to support
55	        large scale virtual data centers. This draft provides a framework
56	        for Network Virtualization over L3 (NVO3) and is intended to help
57	        plan a set of work items in order to provide a complete solution
58	        set. It defines a logical view of the main components with the
59	        intention of streamlining the terminology and focusing the solution
60	        set.

62	     Table of Contents

64	        1. Introduction...................................................3
65	           1.1. Conventions used in this document.........................4
66	           1.2. General terminology.......................................4
67	           1.3. DC network architecture...................................6
68	           1.4. Tenant networking view....................................7
69	        2. Reference Models...............................................8
70	           2.1. Generic Reference Model...................................8
71	           2.2. NVE Reference Model......................................10
72	           2.3. NVE Service Types........................................11
73	              2.3.1. L2 NVE providing Ethernet LAN-like service..........11
74	              2.3.2. L3 NVE providing IP/VRF-like service................11
75	        3. Functional components.........................................11
76	           3.1. Generic service virtualization components................12
77	              3.1.1. Virtual Access Points (VAPs)........................12
78	              3.1.2. Virtual Network Instance (VNI)......................12
79	              3.1.3. Overlay Modules and VN Context......................13
80	              3.1.4. Tunnel Overlays and Encapsulation options...........14
81	              3.1.5. Control Plane Components............................14
82	              3.1.5.1. Auto-provisioning/Service discovery...............14
83	              3.1.5.2. Address advertisement and tunnel mapping..........15
84	              3.1.5.3. Tunnel management.................................15
85	           3.2. Service Overlay Topologies...............................16
86	        4. Key aspects of overlay networks...............................16
87	           4.1. Pros & Cons..............................................16
88	           4.2. Overlay issues to consider...............................17
89	              4.2.1. Data plane vs Control plane driven..................17
90	              4.2.2. Coordination between data plane and control plane...18
91	              4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM)
92	              traffic....................................................18
93	              4.2.4. Path MTU............................................19
94	              4.2.5. NVE location trade-offs.............................19
95	              4.2.6. Interaction between network overlays and underlays..20
96	        5. Security Considerations.......................................21
97	        6. IANA Considerations...........................................21
98	        7. References....................................................21
99	           7.1. Normative References.....................................21
100	           7.2. Informative References...................................21
101	        8. Acknowledgments...............................................22

103	     1. Introduction

105	        This document provides a framework for Data Center Network
106	        Virtualization over L3 tunnels. This framework is intended to aid in
107	        standardizing protocols and mechanisms to support large scale
108	        network virtualization for data centers.

110	        Several IETF drafts relate to the use of overlay networks for data
111	        centers.

113	        [NVOPS] defines the rationale for using overlay networks in order to
114	        build large data center networks. The use of virtualization leads to
115	        a very large number of communication domains and end systems to cope
116	        with.

118	        [OVCPREQ] describes the requirements for a control plane protocol
119	        required by overlay border nodes to exchange overlay mappings.

121	        This document provides reference models and functional components of
122	        data center overlay networks as well as a discussion of technical
123	        issues that have to be addressed in the design of standards and
124	        mechanisms for large scale data centers.

126	     1.1. Conventions used in this document

128	        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
129	        "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
130	        document are to be interpreted as described in RFC-2119 [RFC2119].

132	        In this document, these words will appear with that interpretation
133	        only when in ALL CAPS. Lower case uses of these words are not to be
134	        interpreted as carrying RFC-2119 significance.

136	     1.2. General terminology

138	        This document uses the following terminology:

140	        NVE: Network Virtualization Edge. It is a network entity that sits
141	        on the edge of the NVO3 network. It implements network
142	        virtualization functions that allow for L2 and/or L3 tenant
143	        separation and for hiding tenant addressing information (MAC and IP
144	        addresses). An NVE could be implemented as part of a virtual switch
145	        within a hypervisor, a physical switch or router, a Network Service
146	        Appliance or even be embedded within an End Station.

148	        VN: Virtual Network. This is a virtual L2 or L3 domain that belongs
149	        a tenant.

151	        VNI: Virtual Network Instance. This is one instance of a virtual
152	        overlay network. Two Virtual Networks are isolated from one another
153	        and may use overlapping addresses.

155	        Virtual Network Context or VN Context: Field that is part of the
156	        overlay encapsulation header which allows the encapsulated frame to
157	        be delivered to the appropriate virtual network endpoint by the
158	        egress NVE. The egress NVE uses this field to determine the
159	        appropriate virtual network context in which to process the packet.
160	        This field MAY be an explicit, unique (to the administrative domain)
161	        virtual network identifier (VNID) or MAY express the necessary
162	        context information in other ways (e.g. a locally significant
163	        identifier).

165	        VNID:  Virtual Network Identifier. In the case where the VN context
166	        has global significance, this is the ID value that is carried in
167	        each data packet in the overlay encapsulation that identifies the
168	        Virtual Network the packet belongs to.

170	        Underlay or Underlying Network: This is the network that provides
171	        the connectivity between NVEs. The Underlying Network can be
172	        completely unaware of the overlay packets. Addresses within the
173	        Underlying Network are also referred to as "outer addresses" because
174	        they exist in the outer encapsulation. The Underlying Network can
175	        use a completely different protocol (and address family) from that
176	        of the overlay.

178	        Data Center (DC): A physical complex housing physical servers,
179	        network switches and routers, Network Service Appliances and
180	        networked storage. The purpose of a Data Center is to provide
181	        application and/or compute and/or storage services. One such service
182	        is virtualized data center services, also known as Infrastructure as
183	        a Service.

185	        Virtual Data Center or Virtual DC: A container for virtualized
186	        compute, storage and network services. Managed by a single tenant, a
187	        Virtual DC can contain multiple VNs and multiple Tenant End Systems
188	        that are connected to one or more of these VNs.

190	        VM: Virtual Machine. Several Virtual Machines can share the
191	        resources of a single physical computer server using the services of
192	        a Hypervisor (see below definition).

194	        Hypervisor: Server virtualization software running on a physical
195	        compute server that hosts Virtual Machines. The hypervisor provides
196	        shared compute/memory/storage and network connectivity to the VMs
197	        that it hosts. Hypervisors often embed a Virtual Switch (see below).

199	        Virtual Switch: A function within a Hypervisor (typically
200	        implemented in software) that provides similar services to a
201	        physical Ethernet switch.  It switches Ethernet frames between VMs'
202	        virtual NICs within the same physical server, or between a VM and a
203	        physical NIC card connecting the server to a physical Ethernet
204	        switch. It also enforces network isolation between VMs that should
205	        not communicate with each other.

207	        Tenant: A customer who consumes virtualized data center services
208	        offered by a cloud service provider. A single tenant may consume one
209	        or more Virtual Data Centers hosted by the same cloud service
210	        provider.

212	        Tenant End System: It defines an end system of a particular tenant,
213	        which can be for instance a virtual machine (VM), a non-virtualized
214	        server, or a physical appliance.

216	        ELAN: MEF ELAN, multipoint to multipoint Ethernet service
217	        EVPN: Ethernet VPN as defined in [EVPN]

219	     1.3. DC network architecture

221	        A generic architecture for Data Centers is depicted in Figure 1:

223	                                     ,---------.
224	                                   ,'           `.
225	                                  (  IP/MPLS WAN )
226	                                   `.           ,'
227	                                     `-+------+'
228	                                  +--+--+   +-+---+
229	                                  |DC GW|+-+|DC GW|
230	                                  +-+---+   +-----+
231	                                      |       /
232	                                      .--. .--.
233	                                    (    '    '.--.
234	                                 .-.' Intra-DC     '
235	                                (     network      )
236	                                 (             .'-'
237	                                  '--'._.'.    )\ \
238	                                   / /     '--'  \ \
239	                                  / /      | |    \ \
240	                           +---+--+   +-`.+--+  +--+----+
241	                           | ToR  |   | ToR  |  |  ToR  |
242	                           +-+--`.+   +-+-`.-+  +-+--+--+
243	                           .'     \   .'    \   .'     `.
244	                        __/_      _i./       i./_       _\__
245	                 '--------'    '--------'   '--------'   '--------'
246	                 :  End   :    :  End   :   :  End   :   :  End   :
247	                 : Device :    : Device :   : Device :   : Device :
248	                 '--------'    '--------'   '--------'   '--------'

250	                 Figure 1 : A Generic Architecture for Data Centers

252	        An example of multi-tier DC network architecture is presented in
253	        this figure. It provides a view of physical components inside a DC.

255	        A cloud network is composed of intra-Data Center (DC) networks and
256	        network services, and, inter-DC network and network connectivity
257	        services. Depending upon the scale, DC distribution, operations
258	        model, Capex and Opex aspects, DC networking elements can act as
259	        strict L2 switches and/or provide IP routing capabilities, including
260	        also service virtualization.

262	        In some DC architectures, it is possible that some tier layers
263	        provide L2 and/or L3 services, are collapsed, and that Internet
264	        connectivity, inter-DC connectivity and VPN support are handled by a
265	        smaller number of nodes. Nevertheless, one can assume that the
266	        functional blocks fit with the architecture above.

268	        The following components can be present in a DC:

270	          o End Device: a DC resource to which the networking service is
271	             provided. End Device may be a compute resource (server or
272	             server blade), storage component or a network appliance
273	             (firewall, load-balancer, IPsec gateway). Alternatively, the
274	             End Device may include software based networking functions used
275	             to interconnect multiple hosts. An example of soft networking
276	             is the virtual switch in the server blades, used to
277	             interconnect multiple virtual machines (VMs). End Device may be
278	             single or multi-homed to the Top of Rack switches (ToRs).

280	          o Top of Rack (ToR): Hardware-based Ethernet switch aggregating
281	             all Ethernet links from the End Devices in a rack representing
282	             the entry point in the physical DC network for the hosts. ToRs
283	             may also provide routing functionality, virtual IP network
284	             connectivity, or Layer2 tunneling over IP for instance. ToRs
285	             are usually multi-homed to switches in the Intra-DC network.
286	             Other deployment scenarios may use an intermediate Blade Switch
287	             before the ToR or an EoR (End of Row) switch to provide similar
288	             function as a ToR.

290	          o Intra-DC Network: High capacity network composed of core
291	             switches aggregating multiple ToRs. Core switches are usually
292	             Ethernet switches but can also support routing capabilities.

294	          o DC GW: Gateway to the outside world providing DC Interconnect
295	             and connectivity to Internet and VPN customers. In the current
296	             DC network model, this may be simply a Router connected to the
297	             Internet and/or an IPVPN/L2VPN PE. Some network implementations
298	             may dedicate DC GWs for different connectivity types (e.g., a
299	             DC GW for Internet, and another for VPN).

301	     1.4. Tenant networking view

303	        The DC network architecture is used to provide L2 and/or L3 service
304	        connectivity to each tenant. An example is depicted in Figure 2:

306	                              +----- L3 Infrastructure ----+
307	                              |                            |
308	                           ,--+-'.                      ;--+--.
309	                      .....  Rtr1 )......              .  Rtr2 )
310	                      |    '-----'      |               '-----'
311	                      |     Tenant1     |LAN12      Tenant1|
312	                      |LAN11        ....|........          |LAN13
313	                  '':'''''''':'       |        |     '':'''''''':'
314	                   ,'.      ,'.      ,+.      ,+.     ,'.      ,'.
315	                  (VM )....(VM )    (VM )... (VM )   (VM )....(VM )
316	                   `-'      `-'      `-'      `-'     `-'      `-'

318	             Figure 2 : Logical Service connectivity for a single tenant

320	        In this example one or more L3 contexts and one or more LANs (e.g.,
321	        one per application type) running on DC switches are assigned for DC
322	        tenant 1.

324	        For a multi-tenant DC, a virtualized version of this type of service
325	        connectivity needs to be provided for each tenant by the Network
326	        Virtualization solution.

328	     2. Reference Models

330	     2.1. Generic Reference Model

332	        The following diagram shows a DC reference model for network
333	        virtualization using Layer3 overlays where edge devices provide a
334	        logical interconnect between Tenant End Systems that belong to
335	        specific tenant network.

337	              +--------+                                  +--------+
338	              | Tenant |                                  | Tenant |
339	              |  End   +--+                           +---|  End   |
340	              | System |  |                           |   | System |
341	              +--------+  |    ...................    |   +--------+
342	                          |  +-+--+           +--+-+  |
343	                          |  | NV |           | NV |  |
344	                          +--|Edge|           |Edge|--+
345	                             +-+--+           +--+-+
346	                            /  .    L3 Overlay   .  \
347	              +--------+   /   .     Network     .   \     +--------+
348	              | Tenant +--+    .                 .    +----| Tenant |
349	              |  End   |       .                 .         |  End   |
350	              | System |       .    +----+       .         | System |
351	              +--------+       .....| NV |........         +--------+
352	                                    |Edge|
353	                                    +----+
354	                                      |
355	                                      |
356	                                   +--------+
357	                                   | Tenant |
358	                                   |  End   |
359	                                   | System |
360	                                   +--------+

362	          Figure 3 : Generic reference model for DC network virtualization
363	                            over a Layer3 infrastructure

365	        The functional components in this picture do not necessarily map
366	        directly with the physical components described in Figure 1.

368	        For example, an End Device can be a server blade with VMs and
369	        virtual switch, i.e. the VM is the Tenant End System and the NVE
370	        functions may be performed by the virtual switch and/or the
371	        hypervisor.

373	        Another example is the case where an End Device can be a traditional
374	        physical server (no VMs, no virtual switch), i.e. the server is the
375	        Tenant End System and the NVE functions may be performed by the ToR.
376	        Other End Devices in this category are Physical Network Appliances
377	        or Storage Systems.

379	        A Tenant End System attaches to a Network Virtualization Edge (NVE)
380	        node, either directly or via a switched network (typically
381	        Ethernet).

383	        The NVE implements network virtualization functions that allow for
384	        L2 and/or L3 tenant separation and for hiding tenant addressing
385	        information (MAC and IP addresses), tenant-related control plane
386	        activity and service contexts from the Routed Backbone nodes.

388	        Core nodes utilize L3 techniques to interconnect NVE nodes in
389	        support of the overlay network. These devices perform forwarding
390	        based on outer L3 tunnel header, and generally do not maintain per
391	        tenant-service state albeit some applications (e.g., multicast) may
392	        require control plane or forwarding plane information that pertain
393	        to a tenant, group of tenants, tenant service or a set of services
394	        that belong to one or more tunnels. When such tenant or tenant-
395	        service related information is maintained in the core, overlay
396	        virtualization provides knobs to control that information.

398	     2.2. NVE Reference Model

400	        The NVE is composed of a tenant service instance that Tenant End
401	        Systems interface with and an overlay module that provides tunneling
402	        overlay functions (e.g. encapsulation/decapsulation of tenant
403	        traffic from/to the tenant forwarding instance, tenant
404	        identification and mapping, etc), as described in figure 4:

406	                           +------- L3 Network ------+
407	                           |                         |
408	                           |       Tunnel Overlay    |
409	             +------------+---------+       +---------+------------+
410	             | +----------+-------+ |       | +---------+--------+ |
411	             | |  Overlay Module  | |       | |  Overlay Module  | |
412	             | +---------+--------+ |       | +---------+--------+ |
413	             |           |VN context|       | VN context|          |
414	             |           |          |       |           |          |
415	             |  +--------+-------+  |       |  +--------+-------+  |
416	             |  | |VNI|   .  |VNI|  |       |  | |VNI|   .  |VNI|  |
417	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
418	             |    |   VAPs     |    |       |    |    VAPs   |     |
419	             +----+------------+----+       +----+------------+----+
420	                  |            |                 |            |
421	           -------+------------+-----------------+------------+-------
422	                  |            |     Tenant      |            |
423	                  |            |   Service IF    |            |
424	                 Tenant End Systems            Tenant End Systems

426	                   Figure 4 : Generic reference model for NV Edge

428	        Note that some NVE functions (e.g. data plane and control plane
429	        functions) may reside in one device or may be implemented separately
430	        in different devices.

432	        For example, the NVE functionality could reside solely on the End
433	        Devices, on the ToRs or on both the End Devices and the ToRs. In the
434	        latter case we say that the the End Device NVE component acts as the
435	        NVE Spoke, and ToRs act as NVE hubs. Tenant End Systems will
436	        interface with the tenant service instances maintained on the NVE
437	        spokes, and tenant service instances maintained on the NVE spokes
438	        will interface with the tenant service instances maintained on the
439	        NVE hubs.

441	     2.3. NVE Service Types

443	        NVE components may be used to provide different types of virtualized
444	        service connectivity. This section defines the service types and
445	        associated attributes

447	     2.3.1. L2 NVE providing Ethernet LAN-like service

449	        L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based
450	        multipoint service where the Tenant End Systems appear to be
451	        interconnected by a LAN environment over a set of L3 tunnels. It
452	        provides per tenant virtual switching instance with MAC addressing
453	        isolation and L3 tunnel encapsulation across the core.

455	     2.3.2. L3 NVE providing IP/VRF-like service

457	        Virtualized IP routing and forwarding is similar from a service
458	        definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and
459	        IPsec VPNs). It provides per tenant routing instance with addressing
460	        isolation and L3 tunnel encapsulation across the core.

462	     3. Functional components

464	        This section breaks down the Network Virtualization architecture
465	        into functional components to make it easier to discuss solution
466	        options for different modules.

468	        This version of the document gives an overview of generic functional
469	        components that are shared between L2 and L3 service types. Details
470	        specific for each service type will be added in future revisions.

472	     3.1. Generic service virtualization components

474	        A Network Virtualization solution is built around a number of
475	        functional components as depicted in Figure 5:

477	                          +------- L3 Network ------+
478	                          |                         |
479	                          |       Tunnel Overlay    |
480	             +------------+--------+       +--------+------------+
481	             | +----------+------+ |       | +------+----------+ |
482	             | | Overlay Module  | |       | | Overlay Module  | |
483	             | +--------+--------+ |       | +--------+--------+ |
484	             |          |VN Context|       |          |VN Context|
485	             |          |          |       |          |          |
486	             |  +-------+-------+  |       |  +-------+-------+  |
487	             |  ||VNI| ... |VNI||  |       |  ||VNI| ... |VNI||  |
488	        NVE1 |  +-+-----------+-+  |       |  +-+-----------+-+  | NVE2
489	             |    |   VAPs    |    |       |    |   VAPs    |    |
490	             +----+-----------+----+       +----+-----------+----+
491	                  |           |                 |           |
492	             -----+-----------+-----------------+-----------+-----
493	                  |           |     Tenant      |           |
494	                  |           |   Service IF    |           |
495	               Tenant End Systems            Tenant End Systems

497	                   Figure 5 : Generic reference model for NV Edge

499	     3.1.1. Virtual Access Points (VAPs)

501	        Tenant End Systems are connected to the VNI Instance through Virtual
502	        Access Points (VAPs). The VAPs can be in reality physical ports on a
503	        ToR or virtual ports identified through logical interface
504	        identifiers (VLANs, internal VSwitch Interface ID leading to a VM).

506	     3.1.2. Virtual Network Instance (VNI)

508	        The VNI represents a set of configuration attributes defining access
509	        and tunnel policies and (L2 and/or L3) forwarding functions.

511	        Per tenant FIB tables and control plane protocol instances are used
512	        to maintain separate private contexts between tenants. Hence tenants
513	        are free to use their own addressing schemes without concerns about
514	        address overlapping with other tenants.

516	     3.1.3. Overlay Modules and VN Context

518	        Mechanisms for identifying each tenant service are required to allow
519	        the simultaneous overlay of multiple tenant services over the same
520	        underlay L3 network topology. In the data plane, each NVE, upon
521	        sending a tenant packet, must be able to encode the VN Context for
522	        the destination NVE in addition to the L3 tunnel source address
523	        identifying the source NVE and the tunnel destination L3 address
524	        identifying the destination NVE. This allows the destination NVE to
525	        identify the tenant service instance and therefore appropriately
526	        process and forward the tenant packet.

528	        The Overlay module provides tunneling overlay functions: tunnel
529	        initiation/termination, encapsulation/decapsulation of frames from
530	        VAPs/L3 Backbone and may provide for transit forwarding of IP
531	        traffic (e.g., transparent tunnel forwarding).

533	        In a multi-tenant context, the tunnel aggregates frames from/to
534	        different VNIs. Tenant identification and traffic demultiplexing are
535	        based on the VN Context (e.g. VNID).

537	        The following approaches can been considered:

539	          o One VN Context per Tenant: A globally unique (on a per-DC
540	             administrative domain) VNID is used to identify the related
541	             Tenant instances. An example of this approach is the use of
542	             IEEE VLAN or ISID tags to provide virtual L2 domains.

544	          o One VN Context per VNI: A per-tenant local value is
545	             automatically generated by the egress NVE and usually
546	             distributed by a control plane protocol to all the related
547	             NVEs. An example of this approach is the use of per VRF MPLS
548	             labels in IP VPN [RFC4364].

550	          o One VN Context per VAP: A per-VAP local value is assigned and
551	             usually distributed by a control plane protocol. An example of
552	             this approach is the use of per CE-PE MPLS labels in IP VPN
553	             [RFC4364].

555	        Note that when using one VN Context per VNI or per VAP, an
556	        additional global identifier may be used by the control plane to
557	        identify the Tenant context.

559	     3.1.4. Tunnel Overlays and Encapsulation options

561	        Once the VN context is added to the frame, a L3 Tunnel encapsulation
562	        is used to transport the frame to the destination NVE. The backbone
563	        devices do not usually keep any per service state, simply forwarding
564	        the frames based on the outer tunnel header.

566	        Different IP tunneling options (GRE/L2TP/IPSec) and tunneling
567	        options (BGP VPN, PW, VPLS) are available for both Ethernet and IP
568	        formats.

570	     3.1.5. Control Plane Components

572	        Control plane components may be used to provide the following
573	        capabilities:

575	          . Auto-provisioning/Service discovery

577	          . Address advertisement and tunnel mapping

579	          . Tunnel management

581	        A control plane component can be an on-net control protocol or a
582	        management control entity.

584	     3.1.5.1. Auto-provisioning/Service discovery

586	        NVEs must be able to select the appropriate VNI for each Tenant End
587	        System. This is based on state information that is often provided by
588	        external entities. For example, in a VM environment, this
589	        information is provided by compute management systems, since these
590	        are the only entities that have visibility on which VM belongs to
591	        which tenant.

593	        A mechanism for communicating this information between Tenant End
594	        Systems and the local NVE is required. As a result the VAPs are
595	        created and mapped to the appropriate Tenant Instance.

597	        Depending upon the implementation, this control interface can be
598	        implemented using an auto-discovery protocol between Tenant End
599	        Systems and their local NVE or through management entities.

601	        When a protocol is used, appropriate security and authentication
602	        mechanisms to verify that Tenant End System information is not
603	        spoofed or altered are required. This is one critical aspect for
604	        providing integrity and tenant isolation in the system.

606	        Another control plane protocol can also be used to advertize NVE
607	        tenant service instance (tenant and service type provided to the
608	        tenant) to other NVEs. Alternatively, management control entities
609	        can also be used to perform these functions.

611	     3.1.5.2. Address advertisement and tunnel mapping

613	        As traffic reaches an ingress NVE, a lookup is performed to
614	        determine which tunnel the packet needs to be sent to. It is then
615	        encapsulated with a tunnel header containing the destination address
616	        of the egress overlay node. Intermediate nodes (between the ingress
617	        and egress NVEs) switch or route traffic based upon the outer
618	        destination address.

620	        One key step in this process consists of mapping a final destination
621	        address to the proper tunnel. NVEs are responsible for maintaining
622	        such mappings in their lookup tables. Several ways of populating
623	        these lookup tables are possible: control plane driven, management
624	        plane driven, or data plane driven.

626	        When a control plane protocol is used to distribute address
627	        advertisement and tunneling information, the auto-
628	        provisioning/Service discovery could be accomplished by the same
629	        protocol. In this scenario, the auto-provisioning/Service discovery
630	        could be combined with (be inferred from) the address advertisement
631	        and tunnel mapping. Furthermore, a control plane protocol that
632	        carries both MAC and IP addresses eliminates the need for ARP, and
633	        hence addresses one of the issues with explosive ARP handling.

635	     3.1.5.3. Tunnel management

637	        A control plane protocol may be required to exchange tunnel state
638	        information. This may include setting up tunnels and/or providing
639	        tunnel state information.

641	        This applies to both unicast and multicast tunnels.

643	        For instance, it may be necessary to provide active/standby status
644	        information between NVEs, up/down status information,
645	        pruning/grafting information for multicast tunnels, etc.

647	     3.2. Service Overlay Topologies

649	        A number of service topologies may be used to optimize the service
650	        connectivity and to address NVE performance limitations.

652	        The topology described in Figure 3 suggests the use of a tunnel mesh
653	        between the NVEs where each tenant instance is one hop away from a
654	        service processing perspective. Partial mesh topologies and an NVE
655	        hierarchy may be used where certain NVEs may act as service transit
656	        points.

658	     4. Key aspects of overlay networks

660	        The intent of this section is to highlight specific issues that
661	        proposed overlay solutions need to address.

663	     4.1. Pros & Cons

665	        An overlay network is a layer of virtual network topology on top of
666	        the physical network.

668	        Overlay networks offer the following key advantages:

670	          o Unicast tunneling state management is handled at the edge of
671	             the network. Intermediate transport nodes are unaware of such
672	             state. Note that this is not the case when multicast is enabled
673	             in the core network.

675	          o Tunnels are used to aggregate traffic and hence offer the
676	             advantage of minimizing the amount of forwarding state required
677	             within the underlay network

679	          o Decoupling of the overlay addresses (MAC and IP) used by VMs
680	             from the underlay network. This offers a clear separation
681	             between addresses used within the overlay and the underlay
682	             networks and it enables the use of overlapping addresses spaces
683	             by Tenant End Systems

685	          o Support of a large number of virtual network identifiers

687	        Overlay networks also create several challenges:

689	          o Overlay networks have no controls of underlay networks and lack
690	             critical network information
691	               o Overlays typically probe the network to measure link
692	                  properties, such as available bandwidth or packet loss
693	                  rate. It is difficult to accurately evaluate network
694	                  properties. It might be preferable for the underlay
695	                  network to expose usage and performance information.

697	          o Miscommunication between overlay and underlay networks can lead
698	             to an inefficient usage of network resources.

700	          o Fairness of resource sharing and collaboration among end-nodes
701	             in overlay networks are two critical issues

703	          o When multiple overlays co-exist on top of a common underlay
704	             network, the lack of coordination between overlays can lead to
705	             performance issues.

707	          o Overlaid traffic may not traverse firewalls and NAT devices.

709	          o Multicast service scalability. Multicast support may be
710	             required in the overlay network to address for each tenant
711	             flood containment or efficient multicast handling.

713	          o Hash-based load balancing may not be optimal as the hash
714	             algorithm may not work well due to the limited number of
715	             combinations of tunnel source and destination addresses

717	     4.2. Overlay issues to consider

719	     4.2.1. Data plane vs Control plane driven

721	        In the case of an L2NVE, it is possible to dynamically learn MAC
722	        addresses against VAPs. It is also possible that such addresses be
723	        known and controlled via management or a control protocol for both
724	        L2NVEs and L3NVEs.

726	        Dynamic data plane learning implies that flooding of unknown
727	        destinations be supported and hence implies that broadcast and/or
728	        multicast be supported. Multicasting in the core network for dynamic
729	        learning may lead to significant scalability limitations. Specific
730	        forwarding rules must be enforced to prevent loops from happening.
731	        This can be achieved using a spanning tree, a shortest path tree, or
732	        a split-horizon mesh.

734	        It should be noted that the amount of state to be distributed is
735	        dependent upon network topology and the number of virtual machines.

737	        Different forms of caching can also be utilized to minimize state
738	        distribution between the various elements.

740	     4.2.2. Coordination between data plane and control plane

742	        For an L2 NVE, the NVE needs to be able to determine MAC addresses
743	        of the end systems present on a VAP (for instance, dataplane
744	        learning may be relied upon for this purpose). For an L3 NVE, the
745	        NVE needs to be able to determine IP addresses of the end systems
746	        present on a VAP.

748	        In both cases, coordination with the NVE control protocol is needed
749	        such that when the NVE determines that the set of addresses behind a
750	        VAP has changed, it triggers the local NVE control plane to
751	        distribute this information to its peers.

753	     4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic

755	        There are two techniques to support packet replication needed for
756	        broadcast, unknown unicast and multicast:

758	          o Ingress replication

760	          o Use of core multicast trees

762	        There is a bandwidth vs state trade-off between the two approaches.
763	        Depending upon the degree of replication required (i.e. the number
764	        of hosts per group) and the amount of multicast state to maintain,
765	        trading bandwidth for state is of consideration.

767	        When the number of hosts per group is large, the use of core
768	        multicast trees may be more appropriate. When the number of hosts is
769	        small (e.g. 2-3), ingress replication may not be an issue.

771	        Depending upon the size of the data center network and hence the
772	        number of (S,G) entries, but also the duration of multicast flows,
773	        the use of core multicast trees can be a challenge.

775	        When flows are well known, it is possible to pre-provision such
776	        multicast trees. However, it is often difficult to predict
777	        application flows ahead of time, and hence programming of (S,G)
778	        entries for short-lived flows could be impractical.

780	        A possible trade-off is to use in the core shared multicast trees as
781	        opposed to dedicated multicast trees.

783	     4.2.4. Path MTU

785	        When using overlay tunneling, an outer header is added to the
786	        original frame. This can cause the MTU of the path to the egress
787	        tunnel endpoint to be exceeded.

789	        In this section, we will only consider the case of an IP overlay.

791	        It is usually not desirable to rely on IP fragmentation for
792	        performance reasons. Ideally, the interface MTU as seen by a Tenant
793	        End System is adjusted such that no fragmentation is needed. TCP
794	        will adjust its maximum segment size accordingly.

796	        It is possible for the MTU to be configured manually or to be
797	        discovered dynamically. Various Path MTU discovery techniques exist
798	        in order to determine the proper MTU size to use:

800	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981]

802	               o Tenant End Systems rely on ICMP messages to discover the
803	                  MTU of the end-to-end path to its destination. This method
804	                  is not always possible, such as when traversing middle
805	                  boxes (e.g. firewalls) which disable ICMP for security
806	                  reasons

808	          o Extended MTU Path Discovery techniques such as defined in
809	             [RFC4821]

811	        It is also possible to rely on the overlay layer to perform
812	        segmentation and reassembly operations without relying on the Tenant
813	        End Systems to know about the end-to-end MTU. The assumption is that
814	        some hardware assist is available on the NVE node to perform such
815	        SAR operations. However, fragmentation by the overlay layer can lead
816	        to performance and congestion issues due to TCP dynamics and might
817	        require new congestion avoidance mechanisms from then underlay
818	        network [FLOYD].

820	        Finally, the underlay network may be designed in such a way that the
821	        MTU can accommodate the extra tunnel overhead.

823	     4.2.5. NVE location trade-offs

825	        In the case of DC traffic, traffic originated from a VM is native
826	        Ethernet traffic. This traffic can be switched by a local VM switch
827	        or ToR switch and then by a DC gateway. The NVE function can be
828	        embedded within any of these elements.

830	        There are several criteria to consider when deciding where the NVE
831	        processing boundary happens:

833	          o Processing and memory requirements

835	               o Datapath (e.g. lookups, filtering,
836	                 encapsulation/decapsulation)

838	               o Control plane processing (e.g. routing, signaling, OAM)

840	          o FIB/RIB size

842	          o Multicast support

844	               o Routing protocols

846	               o Packet replication capability

848	          o Fragmentation support

850	          o QoS transparency

852	          o Resiliency

854	     4.2.6. Interaction between network overlays and underlays

856	        When multiple overlays co-exist on top of a common underlay network,
857	        this can cause some performance issues. These overlays have
858	        partially overlapping paths and nodes.

860	        Each overlay is selfish by nature in that it sends traffic so as to
861	        optimize its own performance without considering the impact on other
862	        overlays, unless the underlay tunnels are traffic engineered on a
863	        per overlay basis so as to avoid sharing underlay resources.

865	        Better visibility between overlays and underlays can be achieved by
866	        providing mechanisms to exchange information about:

868	          o Performance metrics (throughput, delay, loss, jitter)

870	          o Cost metrics

872	     5. Security Considerations

874	        The tenant to overlay mapping function can introduce significant
875	        security risks if appropriate protocols are not used that can
876	        support mutual authentication.

878	        No other new security issues are introduced beyond those described
879	        already in the related L2VPN and L3VPN RFCs.

881	     6. IANA Considerations

883	        IANA does not need to take any action for this draft.

885	     7. References

887	     7.1. Normative References

889	        [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
890	                  Requirement Levels", BCP 14, RFC 2119, March 1997.

892	     7.2. Informative References

894	        [NVOPS]  Narten, T. et al, "Problem Statement : Overlays for Network
895	                  Virtualization", draft-narten-nvo3-overlay-problem-
896	                  statement (work in progress)

898	        [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
899	                  Protocol Requirements", draft-kreeger-nvo3-overlay-cp
900	                  (work in progress)

902	        [FLOYD]  Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
903	                  ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

905	        [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
906	                  Networks (VPNs)", RFC 4364, February 2006.

908	        [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

910	        [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
911	                  August 1996

913	        [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
914	                  Discovery", RFC4821, March 2007

916	     8. Acknowledgments

918	        In addition to the authors the following people have contributed to
919	        this document:

921	        Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent

923	        This document was prepared using 2-Word-v2.0.template.dot.

925	     Authors' Addresses

927	        Marc Lasserre
928	        Alcatel-Lucent
929	        Email: marc.lasserre@alcatel-lucent.com

931	        Florin Balus
932	        Alcatel-Lucent
933	        777 E. Middlefield Road
934	        Mountain View, CA, USA 94043
935	        Email: florin.balus@alcatel-lucent.com

937	        Thomas Morin
938	        France Telecom Orange
939	        Email: thomas.morin@orange.com

941	        Nabil Bitar
942	        Verizon
943	        40 Sylvan Road
944	        Waltham, MA 02145
945	        Email: nabil.bitar@verizon.com

947	        Yakov Rekhter
948	        Juniper
949	        Email: yakov@juniper.net