idnits 2.17.1 

draft-ietf-nvo3-arch-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1197 has weird spacing: '...xxxxxxx    xxx...'

  -- The document date (April 21, 2016) is 2928 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-11) exists of
     draft-ietf-nvo3-mcast-framework-04


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                 D. Black
3	Internet-Draft                                                       EMC
4	Intended status: Informational                                 J. Hudson
5	Expires: October 23, 2016                                    Independent
6	                                                              L. Kreeger
7	                                                                   Cisco
8	                                                             M. Lasserre
9	                                                             Independent
10	                                                               T. Narten
11	                                                                     IBM
12	                                                          April 21, 2016

14	 An Architecture for Data Center Network Virtualization Overlays (NVO3)
15	                        draft-ietf-nvo3-arch-06

17	Abstract

19	   This document presents a high-level overview architecture for
20	   building data center network virtualization overlay (NVO3) networks.
21	   The architecture is given at a high-level, showing the major
22	   components of an overall system.  An important goal is to divide the
23	   space into individual smaller components that can be implemented
24	   independently and with clear interfaces and interactions with other
25	   components.  It should be possible to build and implement individual
26	   components in isolation and have them work with other components with
27	   no changes to other components.  That way implementers have
28	   flexibility in implementing individual components and can optimize
29	   and innovate within their respective components without requiring
30	   changes to other components.

32	Status of This Memo

34	   This Internet-Draft is submitted in full conformance with the
35	   provisions of BCP 78 and BCP 79.

37	   Internet-Drafts are working documents of the Internet Engineering
38	   Task Force (IETF).  Note that other groups may also distribute
39	   working documents as Internet-Drafts.  The list of current Internet-
40	   Drafts is at http://datatracker.ietf.org/drafts/current/.

42	   Internet-Drafts are draft documents valid for a maximum of six months
43	   and may be updated, replaced, or obsoleted by other documents at any
44	   time.  It is inappropriate to use Internet-Drafts as reference
45	   material or to cite them other than as "work in progress."

47	   This Internet-Draft will expire on October 23, 2016.

49	Copyright Notice

51	   Copyright (c) 2016 IETF Trust and the persons identified as the
52	   document authors.  All rights reserved.

54	   This document is subject to BCP 78 and the IETF Trust's Legal
55	   Provisions Relating to IETF Documents
56	   (http://trustee.ietf.org/license-info) in effect on the date of
57	   publication of this document.  Please review these documents
58	   carefully, as they describe your rights and restrictions with respect
59	   to this document.  Code Components extracted from this document must
60	   include Simplified BSD License text as described in Section 4.e of
61	   the Trust Legal Provisions and are provided without warranty as
62	   described in the Simplified BSD License.

64	Table of Contents

66	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
67	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
68	   3.  Background  . . . . . . . . . . . . . . . . . . . . . . . . .   4
69	     3.1.  VN Service (L2 and L3)  . . . . . . . . . . . . . . . . .   5
70	       3.1.1.  VLAN Tags in L2 Service . . . . . . . . . . . . . . .   7
71	       3.1.2.  TTL Considerations  . . . . . . . . . . . . . . . . .   7
72	     3.2.  Network Virtualization Edge (NVE) . . . . . . . . . . . .   7
73	     3.3.  Network Virtualization Authority (NVA)  . . . . . . . . .   9
74	     3.4.  VM Orchestration Systems  . . . . . . . . . . . . . . . .   9
75	   4.  Network Virtualization Edge (NVE) . . . . . . . . . . . . . .  11
76	     4.1.  NVE Co-located With Server Hypervisor . . . . . . . . . .  11
77	     4.2.  Split-NVE . . . . . . . . . . . . . . . . . . . . . . . .  12
78	       4.2.1.  Tenant VLAN handling in Split-NVE Case  . . . . . . .  12
79	     4.3.  NVE State . . . . . . . . . . . . . . . . . . . . . . . .  13
80	     4.4.  Multi-Homing of NVEs  . . . . . . . . . . . . . . . . . .  14
81	     4.5.  VAP . . . . . . . . . . . . . . . . . . . . . . . . . . .  14
82	   5.  Tenant System Types . . . . . . . . . . . . . . . . . . . . .  15
83	     5.1.  Overlay-Aware Network Service Appliances  . . . . . . . .  15
84	     5.2.  Bare Metal Servers  . . . . . . . . . . . . . . . . . . .  15
85	     5.3.  Gateways  . . . . . . . . . . . . . . . . . . . . . . . .  16
86	       5.3.1.  Gateway Taxonomy  . . . . . . . . . . . . . . . . . .  16
87	         5.3.1.1.  L2 Gateways (Bridging)  . . . . . . . . . . . . .  16
88	         5.3.1.2.  L3 Gateways (Only IP Packets) . . . . . . . . . .  17
89	     5.4.  Distributed Inter-VN Gateways . . . . . . . . . . . . . .  17
90	     5.5.  ARP and Neighbor Discovery  . . . . . . . . . . . . . . .  18
91	   6.  NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . .  18
92	   7.  Network Virtualization Authority  . . . . . . . . . . . . . .  20
93	     7.1.  How an NVA Obtains Information  . . . . . . . . . . . . .  20
94	     7.2.  Internal NVA Architecture . . . . . . . . . . . . . . . .  21
95	     7.3.  NVA External Interface  . . . . . . . . . . . . . . . . .  21
96	   8.  NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . . .  22
97	     8.1.  NVE-NVA Interaction Models  . . . . . . . . . . . . . . .  23
98	     8.2.  Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . .  23
99	     8.3.  Propagating Information Between NVEs and NVAs . . . . . .  24
100	   9.  Federated NVAs  . . . . . . . . . . . . . . . . . . . . . . .  25
101	     9.1.  Inter-NVA Peering . . . . . . . . . . . . . . . . . . . .  27
102	   10. Control Protocol Work Areas . . . . . . . . . . . . . . . . .  28
103	   11. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . .  28
104	   12. Operations and Management . . . . . . . . . . . . . . . . . .  28
105	   13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  28
106	   14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  29
107	   15. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  29
108	   16. Security Considerations . . . . . . . . . . . . . . . . . . .  29
109	   17. Informative References  . . . . . . . . . . . . . . . . . . .  29
110	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  31

112	1.  Introduction

114	   This document presents a high-level architecture for building data
115	   center network virtualization overlay (NVO3) networks.  The
116	   architecture is given at a high-level, showing the major components
117	   of an overall system.  An important goal is to divide the space into
118	   smaller individual components that can be implemented independently
119	   and with clear interfaces and interactions with other components.  It
120	   should be possible to build and implement individual components in
121	   isolation and have them work with other components with no changes to
122	   other components.  That way implementers have flexibility in
123	   implementing individual components and can optimize and innovate
124	   within their respective components without necessarily requiring
125	   changes to other components.

127	   The motivation for overlay networks is given in "Problem Statement:
128	   Overlays for Network Virtualization" [RFC7364].  "Framework for DC
129	   Network Virtualization" [RFC7365] provides a framework for discussing
130	   overlay networks generally and the various components that must work
131	   together in building such systems.  This document differs from the
132	   framework document in that it doesn't attempt to cover all possible
133	   approaches within the general design space.  Rather, it describes one
134	   particular approach.

136	2.  Terminology

138	   This document uses the same terminology as [RFC7365].  In addition,
139	   the following terms are used:

141	   NV Domain  A Network Virtualization Domain is an administrative
142	      construct that defines a Network Virtualization Authority (NVA),
143	      the set of Network Virtualization Edges (NVEs) associated with
144	      that NVA, and the set of virtual networks the NVA manages and
145	      supports.  NVEs are associated with a (logically centralized) NVA,
146	      and an NVE supports communication for any of the virtual networks
147	      in the domain.

149	   NV Region  A region over which information about a set of virtual
150	      networks is shared.  The degenerate case of a single NV Domain
151	      corresponds to an NV region corresponding to that domain.  The
152	      more interesting case occurs when two or more NV Domains share
153	      information about part or all of a set of virtual networks that
154	      they manage.  Two NVAs share information about particular virtual
155	      networks for the purpose of supporting connectivity between
156	      tenants located in different NV Domains.  NVAs can share
157	      information about an entire NV domain, or just individual virtual
158	      networks.

160	   Tenant System Identifier (TSI)  Interface to a Virtual Network as
161	      presented to a Tenant System.  The TSI logically connects to the
162	      NVE via a Virtual Access Point (VAP).  To the Tenant System, the
163	      TSI is like a Network Interface Card (NIC); the TSI presents
164	      itself to a Tenant System as a normal network interface.

166	   VLAN  Unless stated otherwise, the terms VLAN and VLAN Tag are used
167	      in this document denote a C-VLAN [IEEE-802.1Q] and the terms are
168	      used interchangeably to improve readability.

170	3.  Background

172	   Overlay networks are an approach for providing network virtualization
173	   services to a set of Tenant Systems (TSs) [RFC7365].  With overlays,
174	   data traffic between tenants is tunneled across the underlying data
175	   center's IP network.  The use of tunnels provides a number of
176	   benefits by decoupling the network as viewed by tenants from the
177	   underlying physical network across which they communicate.

179	   Tenant Systems connect to Virtual Networks (VNs), with each VN having
180	   associated attributes defining properties of the network, such as the
181	   set of members that connect to it.  Tenant Systems connected to a
182	   virtual network typically communicate freely with other Tenant
183	   Systems on the same VN, but communication between Tenant Systems on
184	   one VN and those external to the VN (whether on another VN or
185	   connected to the Internet) is carefully controlled and governed by
186	   policy.  The NVO3 architecture does not impose any restrictions to
187	   the application of policy controls even within a VN.

189	   A Network Virtualization Edge (NVE) [RFC7365] is the entity that
190	   implements the overlay functionality.  An NVE resides at the boundary
191	   between a Tenant System and the overlay network as shown in Figure 1.

193	   An NVE creates and maintains local state about each Virtual Network
194	   for which it is providing service on behalf of a Tenant System.

196	       +--------+                                             +--------+
197	       | Tenant +--+                                     +----| Tenant |
198	       | System |  |                                    (')   | System |
199	       +--------+  |          ................         (   )  +--------+
200	                   |  +-+--+  .              .  +--+-+  (_)
201	                   |  | NVE|--.              .--| NVE|   |
202	                   +--|    |  .              .  |    |---+
203	                      +-+--+  .              .   +--+-+
204	                      /       .              .
205	                     /        .  L3 Overlay  .   +--+-++--------+
206	       +--------+   /         .    Network   .   | NVE|| Tenant |
207	       | Tenant +--+          .              .- -|    || System |
208	       | System |             .              .   +--+-++--------+
209	       +--------+             ................
210	                                     |
211	                                   +----+
212	                                   | NVE|
213	                                   |    |
214	                                   +----+
215	                                     |
216	                                     |
217	                           =====================
218	                             |               |
219	                         +--------+      +--------+
220	                         | Tenant |      | Tenant |
221	                         | System |      | System |
222	                         +--------+      +--------+

224	                  Figure 1: NVO3 Generic Reference Model

226	   The following subsections describe key aspects of an overlay system
227	   in more detail.  Section 3.1 describes the service model (Ethernet
228	   vs. IP) provided to Tenant Systems.  Section 3.2 describes NVEs in
229	   more detail.  Section 3.3 introduces the Network Virtualization
230	   Authority, from which NVEs obtain information about virtual networks.
231	   Section 3.4 provides background on Virtual Machine (VM) orchestration
232	   systems and their use of virtual networks.

234	3.1.  VN Service (L2 and L3)

236	   A Virtual Network provides either L2 or L3 service to connected
237	   tenants.  For L2 service, VNs transport Ethernet frames, and a Tenant
238	   System is provided with a service that is analogous to being
239	   connected to a specific L2 C-VLAN.  L2 broadcast frames are generally
240	   delivered to all (and multicast frames delivered to a subset of) the
241	   other Tenant Systems on the VN.  To a Tenant System, it appears as if
242	   they are connected to a regular L2 Ethernet link.  Within the NVO3
243	   architecture, tenant frames are tunneled to remote NVEs based on the
244	   MAC addresses of the frame headers as originated by the Tenant
245	   System.  On the underlay, NVO3 packets are forwarded between NVEs
246	   based on the outer addresses of tunneled packets.

248	   For L3 service, VNs transport IP datagrams, and a Tenant System is
249	   provided with a service that only supports IP traffic.  Within the
250	   NVO3 architecture, tenant frames are tunneled to remote NVEs based on
251	   the IP addresses of the packet originated by the Tenant System; any
252	   L2 destination addresses provided by Tenant Systems are effectively
253	   ignored by the NVEs and overlay network.  For L3 service, the Tenant
254	   System will be configured with an IP subnet that is effectively a
255	   point-to-point link, i.e., having only the Tenant System and a next-
256	   hop router address on it.

258	   L2 service is intended for systems that need native L2 Ethernet
259	   service and the ability to run protocols directly over Ethernet
260	   (i.e., not based on IP).  L3 service is intended for systems in which
261	   all the traffic can safely be assumed to be IP.  It is important to
262	   note that whether an NVO3 network provides L2 or L3 service to a
263	   Tenant System, the Tenant System does not generally need to be aware
264	   of the distinction.  In both cases, the virtual network presents
265	   itself to the Tenant System as an L2 Ethernet interface.  An Ethernet
266	   interface is used in both cases simply as a widely supported
267	   interface type that essentially all Tenant Systems already support.
268	   Consequently, no special software is needed on Tenant Systems to use
269	   an L3 vs. an L2 overlay service.

271	   NVO3 can also provide a combined L2 and L3 service to tenants.  A
272	   combined service provides L2 service for intra-VN communication, but
273	   also provides L3 service for L3 traffic entering or leaving the VN.
274	   Architecturally, the handling of a combined L2/L3 service within the
275	   NVO3 architecture is intended to match what is commonly done today in
276	   non-overlay environments by devices providing a combined bridge/
277	   router service.  With combined service, the virtual network itself
278	   retains the semantics of L2 service and all traffic is processed
279	   according to its L2 semantics.  In addition, however, traffic
280	   requiring IP processing is also processed at the IP level.

282	   The IP processing for a combined service can be implemented on a
283	   standalone device attached to the virtual network (e.g., an IP
284	   router) or implemented locally on the NVE (see Section 5.4 on
285	   Distributed Gateways).  For unicast traffic, NVE implementation of a
286	   combined service may result in a packet being delivered to another TS
287	   attached to the same NVE (on either the same or a different VN) or
288	   tunneled to a remote NVE, or even forwarded outside the NVO3 domain.
289	   For multicast or broadcast packets, the combination of NVE L2 and L3
290	   processing may result in copies of the packet receiving both L2 and
291	   L3 treatments to realize delivery to all of the destinations
292	   involved.  This distributed NVE implementation of IP routing results
293	   in the same network delivery behavior as if the L2 processing of the
294	   packet included delivery of the packet to an IP router attached to
295	   the L2 VN as a TS, with the router having additional network
296	   attachments to other networks, either virtual or not.

298	3.1.1.  VLAN Tags in L2 Service

300	   An NVO3 L2 virtual network service may include encapsulated L2 VLAN
301	   tags provided by a Tenant System, but does not use encapsulated tags
302	   in deciding where and how to forward traffic.  Such VLAN tags can be
303	   passed through, so that Tenant Systems that send or expect to receive
304	   them can be supported as appropriate.

306	   The processing of VLAN tags that an NVE receives from a TS is
307	   controlled by settings associated with the VAP.  Just as in the case
308	   with ports on Ethernet switches, a number of settings could be
309	   imagined.  For example, C-TAGs can be passed through transparently,
310	   they could always be stripped upon receipt from a Tenant System, they
311	   could be compared against a list of explicitly configured tags, etc.

313	   Note that the handling of C-VIDs has additional complications, as
314	   described in Section 4.2.1 below.

316	3.1.2.  TTL Considerations

318	   For L3 service, Tenant Systems should expect the TTL of the packets
319	   they send to be decremented by at least 1.  For L2 service, the TTL
320	   on packets (when the packet is IP) is not modified.  The underlay
321	   network manages the TTLs in the outer IP encapsulation (which could
322	   be independent from or related to the TTL in the tenant IP packets).

324	3.2.  Network Virtualization Edge (NVE)

326	   Tenant Systems connect to NVEs via a Tenant System Interface (TSI).
327	   The TSI logically connects to the NVE via a Virtual Access Point
328	   (VAP) and each VAP is associated with one Virtual Network as shown in
329	   Figure 2.  To the Tenant System, the TSI is like a NIC; the TSI
330	   presents itself to a Tenant System as a normal network interface.  On
331	   the NVE side, a VAP is a logical network port (virtual or physical)
332	   into a specific virtual network.  Note that two different Tenant
333	   Systems (and TSIs) attached to a common NVE can share a VAP (e.g.,
334	   TS1 and TS2 in Figure 2) so long as they connect to the same Virtual
335	   Network.

337	                    |         Data Center Network (IP)        |
338	                    |                                         |
339	                    +-----------------------------------------+
340	                         |                           |
341	                         |       Tunnel Overlay      |
342	            +------------+---------+       +---------+------------+
343	            | +----------+-------+ |       | +-------+----------+ |
344	            | |  Overlay Module  | |       | |  Overlay Module  | |
345	            | +---------+--------+ |       | +---------+--------+ |
346	            |           |          |       |           |          |
347	     NVE1   |           |          |       |           |          | NVE2
348	            |  +--------+-------+  |       |  +--------+-------+  |
349	            |  | VNI1      VNI2 |  |       |  | VNI1      VNI2 |  |
350	            |  +-+----------+---+  |       |  +-+-----------+--+  |
351	            |    | VAP1     | VAP2 |       |    | VAP1      | VAP2|
352	            +----+----------+------+       +----+-----------+-----+
353	                 |          |                   |           |
354	                 |\         |                   |           |
355	                 | \        |                   |          /|
356	          -------+--\-------+-------------------+---------/-+-------
357	                 |   \      |     Tenant        |        /  |
358	            TSI1 |TSI2\     | TSI3            TSI1  TSI2/   TSI3
359	                +---+ +---+ +---+             +---+ +---+   +---+
360	                |TS1| |TS2| |TS3|             |TS4| |TS5|   |TS6|
361	                +---+ +---+ +---+             +---+ +---+   +---+

363	                       Figure 2: NVE Reference Model

365	   The Overlay Module performs the actual encapsulation and
366	   decapsulation of tunneled packets.  The NVE maintains state about the
367	   virtual networks it is a part of so that it can provide the Overlay
368	   Module with such information as the destination address of the NVE to
369	   tunnel a packet to, or the Context ID that should be placed in the
370	   encapsulation header to identify the virtual network that a tunneled
371	   packet belongs to.

373	   On the data center network side, the NVE sends and receives native IP
374	   traffic.  When ingressing traffic from a Tenant System, the NVE
375	   identifies the egress NVE to which the packet should be sent, adds an
376	   overlay encapsulation header, and sends the packet on the underlay
377	   network.  When receiving traffic from a remote NVE, an NVE strips off
378	   the encapsulation header, and delivers the (original) packet to the
379	   appropriate Tenant System.  When the source and destination Tenant
380	   System are on the same NVE, no encapsulation is needed and the NVE
381	   forwards traffic directly.

383	   Conceptually, the NVE is a single entity implementing the NVO3
384	   functionality.  In practice, there are a number of different
385	   implementation scenarios, as described in detail in Section 4.

387	3.3.  Network Virtualization Authority (NVA)

389	   Address dissemination refers to the process of learning, building and
390	   distributing the mapping/forwarding information that NVEs need in
391	   order to tunnel traffic to each other on behalf of communicating
392	   Tenant Systems.  For example, in order to send traffic to a remote
393	   Tenant System, the sending NVE must know the destination NVE for that
394	   Tenant System.

396	   One way to build and maintain mapping tables is to use learning, as
397	   802.1 bridges do [IEEE-802.1Q].  When forwarding traffic to multicast
398	   or unknown unicast destinations, an NVE could simply flood traffic.
399	   While flooding works, it can lead to traffic hot spots and can lead
400	   to problems in larger networks (e.g., excessive amounts of flooded
401	   traffic).

403	   Alternatively, to reduce the scope of where flooding must take place,
404	   or to eliminate it all together, NVEs can make use of a Network
405	   Virtualization Authority (NVA).  An NVA is the entity that provides
406	   address mapping and other information to NVEs.  NVEs interact with an
407	   NVA to obtain any required address mapping information they need in
408	   order to properly forward traffic on behalf of tenants.  The term NVA
409	   refers to the overall system, without regards to its scope or how it
410	   is implemented.  NVAs provide a service, and NVEs access that service
411	   via an NVE-to-NVA protocol as discussed in Section 4.3.

413	   Even when an NVA is present, Ethernet bridge MAC address learning
414	   could be used as a fallback mechanism, should the NVA be unable to
415	   provide an answer or for other reasons.  This document does not
416	   consider flooding approaches in detail, as there are a number of
417	   benefits in using an approach that depends on the presence of an NVA.

419	   For the rest of this document, it is assumed that an NVA exists and
420	   will be used.  NVAs are discussed in more detail in Section 7.

422	3.4.  VM Orchestration Systems

424	   VM orchestration systems manage server virtualization across a set of
425	   servers.  Although VM management is a separate topic from network
426	   virtualization, the two areas are closely related.  Managing the
427	   creation, placement, and movement of VMs also involves creating,
428	   attaching to and detaching from virtual networks.  A number of
429	   existing VM orchestration systems have incorporated aspects of
430	   virtual network management into their systems.

432	   Note also, that although this section uses the term "VM" and
433	   "hypervisor" throughout, the same issues apply to other
434	   virtualization approaches, including Linux Containers (LXC), BSD
435	   Jails, Network Service Appliances as discussed in Section 5.1, etc..
436	   From an NVO3 perspective, it should be assumed that where the
437	   document uses the term "VM" and "hypervisor", the intention is that
438	   the discussion also applies to other systems, where, e.g., the host
439	   operating system plays the role of the hypervisor in supporting
440	   virtualization, and a container plays the equivalent role as a VM.

442	   When a new VM image is started, the VM orchestration system
443	   determines where the VM should be placed, interacts with the
444	   hypervisor on the target server to load and start the VM and controls
445	   when a VM should be shutdown or migrated elsewhere.  VM orchestration
446	   systems also have knowledge about how a VM should connect to a
447	   network, possibly including the name of the virtual network to which
448	   a VM is to connect.  The VM orchestration system can pass such
449	   information to the hypervisor when a VM is instantiated.  VM
450	   orchestration systems have significant (and sometimes global)
451	   knowledge over the domain they manage.  They typically know on what
452	   servers a VM is running, and meta data associated with VM images can
453	   be useful from a network virtualization perspective.  For example,
454	   the meta data may include the addresses (MAC and IP) the VMs will use
455	   and the name(s) of the virtual network(s) they connect to.

457	   VM orchestration systems run a protocol with an agent running on the
458	   hypervisor of the servers they manage.  That protocol can also carry
459	   information about what virtual network a VM is associated with.  When
460	   the orchestrator instantiates a VM on a hypervisor, the hypervisor
461	   interacts with the NVE in order to attach the VM to the virtual
462	   networks it has access to.  In general, the hypervisor will need to
463	   communicate significant VM state changes to the NVE.  In the reverse
464	   direction, the NVE may need to communicate network connectivity
465	   information back to the hypervisor.  Example VM orchestration systems
466	   in use today include VMware's vCenter Server, Microsoft's System
467	   Center Virtual Machine Manager, and systems based on OpenStack and
468	   its associated plugins (e.g., Nova and Neutron).  Each can pass
469	   information about what virtual networks a VM connects to down to the
470	   hypervisor.  The protocol used between the VM orchestration system
471	   and hypervisors is generally proprietary.

473	   It should be noted that VM orchestration systems may not have direct
474	   access to all networking related information a VM uses.  For example,
475	   a VM may make use of additional IP or MAC addresses that the VM
476	   management system is not aware of.

478	4.  Network Virtualization Edge (NVE)

480	   As introduced in Section 3.2 an NVE is the entity that implements the
481	   overlay functionality.  This section describes NVEs in more detail.
482	   An NVE will have two external interfaces:

484	   Tenant System Facing:  On the Tenant System facing side, an NVE
485	      interacts with the hypervisor (or equivalent entity) to provide
486	      the NVO3 service.  An NVE will need to be notified when a Tenant
487	      System "attaches" to a virtual network (so it can validate the
488	      request and set up any state needed to send and receive traffic on
489	      behalf of the Tenant System on that VN).  Likewise, an NVE will
490	      need to be informed when the Tenant System "detaches" from the
491	      virtual network so that it can reclaim state and resources
492	      appropriately.

494	   Data Center Network Facing:  On the data center network facing side,
495	      an NVE interfaces with the data center underlay network, sending
496	      and receiving tunneled TS packets to and from the underlay.  The
497	      NVE may also run a control protocol with other entities on the
498	      network, such as the Network Virtualization Authority.

500	4.1.  NVE Co-located With Server Hypervisor

502	   When server virtualization is used, the entire NVE functionality will
503	   typically be implemented as part of the hypervisor and/or virtual
504	   switch on the server.  In such cases, the Tenant System interacts
505	   with the hypervisor and the hypervisor interacts with the NVE.
506	   Because the interaction between the hypervisor and NVE is implemented
507	   entirely in software on the server, there is no "on-the-wire"
508	   protocol between Tenant Systems (or the hypervisor) and the NVE that
509	   needs to be standardized.  While there may be APIs between the NVE
510	   and hypervisor to support necessary interaction, the details of such
511	   an API are not in-scope for the IETF to work on.

513	   Implementing NVE functionality entirely on a server has the
514	   disadvantage that server CPU resources must be spent implementing the
515	   NVO3 functionality.  Experimentation with overlay approaches and
516	   previous experience with TCP and checksum adapter offloads suggests
517	   that offloading certain NVE operations (e.g., encapsulation and
518	   decapsulation operations) onto the physical network adapter can
519	   produce performance advantages.  As has been done with checksum and/
520	   or TCP server offload and other optimization approaches, there may be
521	   benefits to offloading common operations onto adapters where
522	   possible.  Just as important, the addition of an overlay header can
523	   disable existing adapter offload capabilities that are generally not
524	   prepared to handle the addition of a new header or other operations
525	   associated with an NVE.

527	   While the exact details of how to split the implementation of
528	   specific NVE functionality between a server and its network adapters
529	   is an implementation matter and outside the scope of IETF
530	   standardization, the NVO3 architecture should be cognizant of and
531	   support such separation.  Ideally, it may even be possible to bypass
532	   the hypervisor completely on critical data path operations so that
533	   packets between a TS and its VN can be sent and received without
534	   having the hypervisor involved in each individual packet operation.

536	4.2.  Split-NVE

538	   Another possible scenario leads to the need for a split NVE
539	   implementation.  An NVE running on a server (e.g. within a
540	   hypervisor) could support NVO3 service towards the tenant, but not
541	   perform all NVE functions (e.g., encapsulation) directly on the
542	   server; some of the actual NVO3 functionality could be implemented on
543	   (i.e., offloaded to) an adjacent switch to which the server is
544	   attached.  While one could imagine a number of link types between a
545	   server and the NVE, one simple deployment scenario would involve a
546	   server and NVE separated by a simple L2 Ethernet link.  A more
547	   complicated scenario would have the server and NVE separated by a
548	   bridged access network, such as when the NVE resides on a ToR, with
549	   an embedded switch residing between servers and the ToR.

551	   For the split NVE case, protocols will be needed that allow the
552	   hypervisor and NVE to negotiate and setup the necessary state so that
553	   traffic sent across the access link between a server and the NVE can
554	   be associated with the correct virtual network instance.
555	   Specifically, on the access link, traffic belonging to a specific
556	   Tenant System would be tagged with a specific VLAN C-TAG that
557	   identifies which specific NVO3 virtual network instance it connects
558	   to.  The hypervisor-NVE protocol would negotiate which VLAN C-TAG to
559	   use for a particular virtual network instance.  More details of the
560	   protocol requirements for functionality between hypervisors and NVEs
561	   can be found in [I-D.ietf-nvo3-nve-nva-cp-req].

563	4.2.1.  Tenant VLAN handling in Split-NVE Case

565	   Preserving tenant VLAN tags across an NVO3 VN as described in
566	   Section 3.1.1 poses additional complications in the split-NVE case.
567	   The portion of the NVE that performs the encapsulation function needs
568	   access to the specific VLAN tags that the Tenant System is using in
569	   order to include them in the encapsulated packet.  When an NVE is
570	   implemented entirely within the hypervisor, the NVE has access to the
571	   complete original packet (including any VLAN tags) sent by the
572	   tenant.  In the split-NVE case, however, the VLAN tag used between
573	   the hypervisor and offloaded portions of the NVE normally only
574	   identify the specific VN that traffic belongs to.  In order to allow
575	   a tenant to preserve VLAN information from end to end between TS when
576	   in the split-NVE case, additional mechanisms would be needed (e.g.
577	   carry an additional VLAN tag by carrying both a C-Tag and an S-Tag as
578	   specified in [IEEE-802.1Q]).

580	4.3.  NVE State

582	   NVEs maintain internal data structures and state to support the
583	   sending and receiving of tenant traffic.  An NVE may need some or all
584	   of the following information:

586	   1.  An NVE keeps track of which attached Tenant Systems are connected
587	       to which virtual networks.  When a Tenant System attaches to a
588	       virtual network, the NVE will need to create or update local
589	       state for that virtual network.  When the last Tenant System
590	       detaches from a given VN, the NVE can reclaim state associated
591	       with that VN.

593	   2.  For tenant unicast traffic, an NVE maintains a per-VN table of
594	       mappings from Tenant System (inner) addresses to remote NVE
595	       (outer) addresses.

597	   3.  For tenant multicast (or broadcast) traffic, an NVE maintains a
598	       per-VN table of mappings and other information on how to deliver
599	       tenant multicast (or broadcast) traffic.  If the underlying
600	       network supports IP multicast, the NVE could use IP multicast to
601	       deliver tenant traffic.  In such a case, the NVE would need to
602	       know what IP underlay multicast address to use for a given VN.
603	       Alternatively, if the underlying network does not support
604	       multicast, a source NVE could use unicast replication to deliver
605	       traffic.  In such a case, an NVE would need to know which remote
606	       NVEs are participating in the VN.  An NVE could use both
607	       approaches, switching from one mode to the other depending on
608	       such factors as bandwidth efficiency and group membership
609	       sparseness.  [I-D.ietf-nvo3-mcast-framework] discusses the
610	       subject of multicast handling in NVO3 in further detail.

612	   4.  An NVE maintains necessary information to encapsulate outgoing
613	       traffic, including what type of encapsulation and what value to
614	       use for a Context ID within the encapsulation header.

616	   5.  In order to deliver incoming encapsulated packets to the correct
617	       Tenant Systems, an NVE maintains the necessary information to map
618	       incoming traffic to the appropriate VAP (i.e., Tenant System
619	       Interface).

621	   6.  An NVE may find it convenient to maintain additional per-VN
622	       information such as QoS settings, Path MTU information, ACLs,
623	       etc.

625	4.4.  Multi-Homing of NVEs

627	   NVEs may be multi-homed.  That is, an NVE may have more than one IP
628	   address associated with it on the underlay network.  Multihoming
629	   happens in two different scenarios.  First, an NVE may have multiple
630	   interfaces connecting it to the underlay.  Each of those interfaces
631	   will typically have a different IP address, resulting in a specific
632	   Tenant Address (on a specific VN) being reachable through the same
633	   NVE but through more than one underlay IP address.  Second, a
634	   specific tenant system may be reachable through more than one NVE,
635	   each having one or more underlay addresses.  In both cases, NVE
636	   address mapping functionality needs to support one-to-many mappings
637	   and enable a sending NVE to (at a minimum) be able to fail over from
638	   one IP address to another, e.g., should a specific NVE underlay
639	   address become unreachable.

641	   Finally, multi-homed NVEs introduce complexities when source unicast
642	   replication is used to implement tenant multicast as described in
643	   Section 4.3.  Specifically, an NVE should only receive one copy of a
644	   replicated packet.

646	   Multi-homing is needed to support important use cases.  First, a bare
647	   metal server may have multiple uplink connections to either the same
648	   or different NVEs.  Having only a single physical path to an upstream
649	   NVE, or indeed, having all traffic flow through a single NVE would be
650	   considered unacceptable in highly-resilient deployment scenarios that
651	   seek to avoid single points of failure.  Moreover, in today's
652	   networks, the availability of multiple paths would require that they
653	   be usable in an active-active fashion (e.g., for load balancing).

655	4.5.  VAP

657	   The VAP is the NVE-side of the interface between the NVE and the TS.
658	   Traffic to and from the tenant flows through the VAP.  If an NVE runs
659	   into difficulties sending traffic received on the VAP, it may need to
660	   signal such errors back to the VAP.  Because the VAP is an emulation
661	   of a physical port, its ability to signal NVE errors is limited and
662	   lacks sufficient granularity to reflect all possible errors an NVE
663	   may encounter (e.g., inability reach a particular destination).  Some
664	   errors, such as an NVE losing all of its connections to the underlay,
665	   could be reflected back to the VAP by effectively disabling it.  This
666	   state change would reflect itself on the TS as an interface going
667	   down, allowing the TS to implement interface error handling, e.g.,
668	   failover, in the same manner as when a physical interfaces becomes
669	   disabled.

671	5.  Tenant System Types

673	   This section describes a number of special Tenant System types and
674	   how they fit into an NVO3 system.

676	5.1.  Overlay-Aware Network Service Appliances

678	   Some Network Service Appliances [I-D.ietf-nvo3-nve-nva-cp-req]
679	   (virtual or physical) provide tenant-aware services.  That is, the
680	   specific service they provide depends on the identity of the tenant
681	   making use of the service.  For example, firewalls are now becoming
682	   available that support multi-tenancy where a single firewall provides
683	   virtual firewall service on a per-tenant basis, using per-tenant
684	   configuration rules and maintaining per-tenant state.  Such
685	   appliances will be aware of the VN an activity corresponds to while
686	   processing requests.  Unlike server virtualization, which shields VMs
687	   from needing to know about multi-tenancy, a Network Service Appliance
688	   may explicitly support multi-tenancy.  In such cases, the Network
689	   Service Appliance itself will be aware of network virtualization and
690	   either embed an NVE directly, or implement a split NVE as described
691	   in Section 4.2.  Unlike server virtualization, however, the Network
692	   Service Appliance may not be running a hypervisor and the VM
693	   orchestration system may not interact with the Network Service
694	   Appliance.  The NVE on such appliances will need to support a control
695	   plane to obtain the necessary information needed to fully participate
696	   in an NVO3 Domain.

698	5.2.  Bare Metal Servers

700	   Many data centers will continue to have at least some servers
701	   operating as non-virtualized (or "bare metal") machines running a
702	   traditional operating system and workload.  In such systems, there
703	   will be no NVE functionality on the server, and the server will have
704	   no knowledge of NVO3 (including whether overlays are even in use).
705	   In such environments, the NVE functionality can reside on the first-
706	   hop physical switch.  In such a case, the network administrator would
707	   (manually) configure the switch to enable the appropriate NVO3
708	   functionality on the switch port connecting the server and associate
709	   that port with a specific virtual network.  Such configuration would
710	   typically be static, since the server is not virtualized, and once
711	   configured, is unlikely to change frequently.  Consequently, this
712	   scenario does not require any protocol or standards work.

714	5.3.  Gateways

716	   Gateways on VNs relay traffic onto and off of a virtual network.
717	   Tenant Systems use gateways to reach destinations outside of the
718	   local VN.  Gateways receive encapsulated traffic from one VN, remove
719	   the encapsulation header, and send the native packet out onto the
720	   data center network for delivery.  Outside traffic enters a VN in a
721	   reverse manner.

723	   Gateways can be either virtual (i.e., implemented as a VM) or
724	   physical (i.e., as a standalone physical device).  For performance
725	   reasons, standalone hardware gateways may be desirable in some cases.
726	   Such gateways could consist of a simple switch forwarding traffic
727	   from a VN onto the local data center network, or could embed router
728	   functionality.  On such gateways, network interfaces connecting to
729	   virtual networks will (at least conceptually) embed NVE (or split-
730	   NVE) functionality within them.  As in the case with Network Service
731	   Appliances, gateways may not support a hypervisor and will need an
732	   appropriate control plane protocol to obtain the information needed
733	   to provide NVO3 service.

735	   Gateways handle several different use cases.  For example, one use
736	   case consists of systems supporting overlays together with systems
737	   that do not (e.g., bare metal servers).  Gateways could be used to
738	   connect legacy systems supporting, e.g., L2 VLANs, to specific
739	   virtual networks, effectively making them part of the same virtual
740	   network.  Gateways could also forward traffic between a virtual
741	   network and other hosts on the data center network or relay traffic
742	   between different VNs.  Finally, gateways can provide external
743	   connectivity such as Internet or VPN access.

745	5.3.1.  Gateway Taxonomy

747	   As can be seen from the discussion above, there are several types of
748	   gateways that can exist in an NVO3 environment.  This section breaks
749	   them down into the various types that could be supported.  Note that
750	   each of the types below could be implemented in either a centralized
751	   manner or distributed to co-exist with the NVEs.

753	5.3.1.1.  L2 Gateways (Bridging)

755	   L2 Gateways act as layer 2 bridges to forward Ethernet frames based
756	   on the MAC addresses present in them.

758	   L2 VN to Legacy L2:  This type of gateway bridges traffic between L2
759	         VNs and other legacy L2 networks such as VLANs or L2 VPNs.

761	   L2 VN to L2 VN:  The main motivation for this type of gateway to
762	         create separate groups of Tenant Systems using L2 VNs such that
763	         the gateway can enforce network policies between each L2 VN.

765	5.3.1.2.  L3 Gateways (Only IP Packets)

767	   L3 Gateways forward IP packets based on the IP addresses present in
768	   the packets.

770	   L3 VN to Legacy L2:  This type of gateway forwards packets on between
771	         L3 VNs and legacy L2 networks such as VLANs or L2 VPNs.  The
772	         MAC address in any frames forwarded between the legacy L2
773	         network would be the MAC address of the gateway.

775	   L3 VN to Legacy L3:  The type of gateway forwards packets between L3
776	         VNs and legacy L3 networks.  These legacy L3 networks could be
777	         local the data center, in the WAN, or an L3 VPN.

779	   L3 VN to L2 VN:  This type of gateway forwards packets on between L3
780	         VNs and L2 VNs.  The MAC address in any frames forwarded
781	         between the L2 VN would be the MAC address of the gateway.

783	   L2 VN to L2 VN:  This type of gateway acts similar to a traditional
784	         router that forwards between L2 interfaces.  The MAC address in
785	         any frames forwarded between the L2 VNs would be the MAC
786	         address of the gateway.

788	   L3 VN to L3 VN:  The main motivation for this type of gateway to
789	         create separate groups of Tenant Systems using L3 VNs such that
790	         the gateway can enforce network policies between each L3 VN.

792	5.4.  Distributed Inter-VN Gateways

794	   The relaying of traffic from one VN to another deserves special
795	   consideration.  Whether traffic is permitted to flow from one VN to
796	   another is a matter of policy, and would not (by default) be allowed
797	   unless explicitly enabled.  In addition, NVAs are the logical place
798	   to maintain policy information about allowed inter-VN communication.
799	   Policy enforcement for inter-VN communication can be handled in (at
800	   least) two different ways.  Explicit gateways could be the central
801	   point for such enforcement, with all inter-VN traffic forwarded to
802	   such gateways for processing.  Alternatively, the NVA can provide
803	   such information directly to NVEs, by either providing a mapping for
804	   a target TS on another VN, or indicating that such communication is
805	   disallowed by policy.

807	   When inter-VN gateways are centralized, traffic between TSs on
808	   different VNs can take suboptimal paths, i.e., triangular routing
809	   results in paths that always traverse the gateway.  In the worst
810	   case, traffic between two TSs connected to the same NVE can be hair-
811	   pinned through an external gateway.  As an optimization, individual
812	   NVEs can be part of a distributed gateway that performs such
813	   relaying, reducing or completely eliminating triangular routing.  In
814	   a distributed gateway, each ingress NVE can perform such relaying
815	   activity directly, so long as it has access to the policy information
816	   needed to determine whether cross-VN communication is allowed.
817	   Having individual NVEs be part of a distributed gateway allows them
818	   to tunnel traffic directly to the destination NVE without the need to
819	   take suboptimal paths.

821	   The NVO3 architecture supports distributed gateways for the case of
822	   inter-VN communication.  Such support requires that NVO3 control
823	   protocols include mechanisms for the maintenance and distribution of
824	   policy information about what type of cross-VN communication is
825	   allowed so that NVEs acting as distributed gateways can tunnel
826	   traffic from one VN to another as appropriate.

828	   Distributed gateways could also be used to distribute other
829	   traditional router services to individual NVEs.  The NVO3
830	   architecture does not preclude such implementations, but does not
831	   define or require them as they are outside the scope of the NVO3
832	   architecture.

834	5.5.  ARP and Neighbor Discovery

836	   For an L2 service, strictly speaking, special processing of Address
837	   Resolution Protocol (ARP) [RFC0826] (and IPv6 Neighbor Discovery (ND)
838	   [RFC4861]) is not required.  ARP requests are broadcast, and an NVO3
839	   can deliver ARP requests to all members of a given L2 virtual
840	   network, just as it does for any packet sent to an L2 broadcast
841	   address.  Similarly, ND requests are sent via IP multicast, which
842	   NVO3 can support by delivering via L2 multicast.  However, as a
843	   performance optimization, an NVE can intercept ARP (or ND) requests
844	   from its attached TSs and respond to them directly using information
845	   in its mapping tables.  Since an NVE will have mechanisms for
846	   determining the NVE address associated with a given TS, the NVE can
847	   leverage the same mechanisms to suppress sending ARP and ND requests
848	   for a given TS to other members of the VN.  The NVO3 architecture
849	   supports such a capability.

851	6.  NVE-NVE Interaction

853	   Individual NVEs will interact with each other for the purposes of
854	   tunneling and delivering traffic to remote TSs.  At a minimum, a
855	   control protocol may be needed for tunnel setup and maintenance.  For
856	   example, tunneled traffic may need to be encrypted or integrity
857	   protected, in which case it will be necessary to set up appropriate
858	   security associations between NVE peers.  It may also be desirable to
859	   perform tunnel maintenance (e.g., continuity checks) on a tunnel in
860	   order to detect when a remote NVE becomes unreachable.  Such generic
861	   tunnel setup and maintenance functions are not generally
862	   NVO3-specific.  Hence, the NVO3 architecture expects to leverage
863	   existing tunnel maintenance protocols rather than defining new ones.

865	   Some NVE-NVE interactions may be specific to NVO3 (and in particular
866	   be related to information kept in mapping tables) and agnostic to the
867	   specific tunnel type being used.  For example, when tunneling traffic
868	   for TS-X to a remote NVE, it is possible that TS-X is not presently
869	   associated with the remote NVE.  Normally, this should not happen,
870	   but there could be race conditions where the information an NVE has
871	   learned from the NVA is out-of-date relative to actual conditions.
872	   In such cases, the remote NVE could return an error or warning
873	   indication, allowing the sending NVE to attempt a recovery or
874	   otherwise attempt to mitigate the situation.

876	   The NVE-NVE interaction could signal a range of indications, for
877	   example:

879	   o  "No such TS here", upon a receipt of a tunneled packet for an
880	      unknown TS.

882	   o  "TS-X not here, try the following NVE instead" (i.e., a redirect).

884	   o  Delivered to correct NVE, but could not deliver packet to TS-X
885	      (soft error).

887	   o  Delivered to correct NVE, but could not deliver packet to TS-X
888	      (hard error).

890	   When an NVE receives information from a remote NVE that conflicts
891	   with the information it has in its own mapping tables, it should
892	   consult with the NVA to resolve those conflicts.  In particular, it
893	   should confirm that the information it has is up-to-date, and it
894	   might indicate the error to the NVA, so as to nudge the NVA into
895	   following up (as appropriate).  While it might make sense for an NVE
896	   to update its mapping table temporarily in response to an error from
897	   a remote NVE, any changes must be handled carefully as doing so can
898	   raise security considerations if the received information cannot be
899	   authenticated.  That said, a sending NVE might still take steps to
900	   mitigate a problem, such as applying rate limiting to data traffic
901	   towards a particular NVE or TS.

903	7.  Network Virtualization Authority

905	   Before sending to and receiving traffic from a virtual network, an
906	   NVE must obtain the information needed to build its internal
907	   forwarding tables and state as listed in Section 4.3.  An NVE can
908	   obtain such information from a Network Virtualization Authority.

910	   The Network Virtualization Authority (NVA) is the entity that is
911	   expected to provide address mapping and other information to NVEs.
912	   NVEs can interact with an NVA to obtain any required information they
913	   need in order to properly forward traffic on behalf of tenants.  The
914	   term NVA refers to the overall system, without regards to its scope
915	   or how it is implemented.

917	7.1.  How an NVA Obtains Information

919	   There are two primary ways in which an NVA can obtain the address
920	   dissemination information it manages.  The NVA can obtain information
921	   either from the VM orchestration system, and/or directly from the
922	   NVEs themselves.

924	   On virtualized systems, the NVA may be able to obtain the address
925	   mapping information associated with VMs from the VM orchestration
926	   system itself.  If the VM orchestration system contains a master
927	   database for all the virtualization information, having the NVA
928	   obtain information directly to the orchestration system would be a
929	   natural approach.  Indeed, the NVA could effectively be co-located
930	   with the VM orchestration system itself.  In such systems, the VM
931	   orchestration system communicates with the NVE indirectly through the
932	   hypervisor.

934	   However, as described in Section 4 not all NVEs are associated with
935	   hypervisors.  In such cases, NVAs cannot leverage VM orchestration
936	   protocols to interact with an NVE and will instead need to peer
937	   directly with them.  By peering directly with an NVE, NVAs can obtain
938	   information about the TSs connected to that NVE and can distribute
939	   information to the NVE about the VNs those TSs are associated with.
940	   For example, whenever a Tenant System attaches to an NVE, that NVE
941	   would notify the NVA that the TS is now associated with that NVE.
942	   Likewise when a TS detaches from an NVE, that NVE would inform the
943	   NVA.  By communicating directly with NVEs, both the NVA and the NVE
944	   are able to maintain up-to-date information about all active tenants
945	   and the NVEs to which they are attached.

947	7.2.  Internal NVA Architecture

949	   For reliability and fault tolerance reasons, an NVA would be
950	   implemented in a distributed or replicated manner without single
951	   points of failure.  How the NVA is implemented, however, is not
952	   important to an NVE so long as the NVA provides a consistent and
953	   well-defined interface to the NVE.  For example, an NVA could be
954	   implemented via database techniques whereby a server stores address
955	   mapping information in a traditional (possibly replicated) database.
956	   Alternatively, an NVA could be implemented in a distributed fashion
957	   using an existing (or modified) routing protocol to maintain and
958	   distribute mappings.  So long as there is a clear interface between
959	   the NVE and NVA, how an NVA is architected and implemented is not
960	   important to an NVE.

962	   A number of architectural approaches could be used to implement NVAs
963	   themselves.  NVAs manage address bindings and distribute them to
964	   where they need to go.  One approach would be to use Border Gateway
965	   Protocol (BGP) [RFC4364] (possibly with extensions) and route
966	   reflectors.  Another approach could use a transaction-based database
967	   model with replicated servers.  Because the implementation details
968	   are local to an NVA, there is no need to pick exactly one solution
969	   technology, so long as the external interfaces to the NVEs (and
970	   remote NVAs) are sufficiently well defined to achieve
971	   interoperability.

973	7.3.  NVA External Interface

975	   Conceptually, from the perspective of an NVE, an NVA is a single
976	   entity.  An NVE interacts with the NVA, and it is the NVA's
977	   responsibility for ensuring that interactions between the NVE and NVA
978	   result in consistent behavior across the NVA and all other NVEs using
979	   the same NVA.  Because an NVA is built from multiple internal
980	   components, an NVA will have to ensure that information flows to all
981	   internal NVA components appropriately.

983	   One architectural question is how the NVA presents itself to the NVE.
984	   For example, an NVA could be required to provide access via a single
985	   IP address.  If NVEs only have one IP address to interact with, it
986	   would be the responsibility of the NVA to handle NVA component
987	   failures, e.g., by using a "floating IP address" that migrates among
988	   NVA components to ensure that the NVA can always be reached via the
989	   one address.  Having all NVA accesses through a single IP address,
990	   however, adds constraints to implementing robust failover, load
991	   balancing, etc.

993	   In the NVO3 architecture, an NVA is accessed through one or more IP
994	   addresses (or IP address/port combination).  If multiple IP addresses
995	   are used, each IP address provides equivalent functionality, meaning
996	   that an NVE can use any of the provided addresses to interact with
997	   the NVA.  Should one address stop working, an NVE is expected to
998	   failover to another.  While the different addresses result in
999	   equivalent functionality, one address may respond more quickly than
1000	   another, e.g., due to network conditions, load on the server, etc.

1002	   To provide some control over load balancing, NVA addresses may have
1003	   an associated priority.  Addresses are used in order of priority,
1004	   with no explicit preference among NVA addresses having the same
1005	   priority.  To provide basic load-balancing among NVAs of equal
1006	   priorities, NVEs could use some randomization input to select among
1007	   equal-priority NVAs.  Such a priority scheme facilitates failover and
1008	   load balancing, for example, allowing a network operator to specify a
1009	   set of primary and backup NVAs.

1011	   It may be desirable to have individual NVA addresses responsible for
1012	   a subset of information about an NV Domain.  In such a case, NVEs
1013	   would use different NVA addresses for obtaining or updating
1014	   information about particular VNs or TS bindings.  A key question with
1015	   such an approach is how information would be partitioned, and how an
1016	   NVE could determine which address to use to get the information it
1017	   needs.

1019	   Another possibility is to treat the information on which NVA
1020	   addresses to use as cached (soft-state) information at the NVEs, so
1021	   that any NVA address can be used to obtain any information, but NVEs
1022	   are informed of preferences for which addresses to use for particular
1023	   information on VNs or TS bindings.  That preference information would
1024	   be cached for future use to improve behavior - e.g., if all requests
1025	   for a specific subset of VNs are forwarded to a specific NVA
1026	   component, the NVE can optimize future requests within that subset by
1027	   sending them directly to that NVA component via its address.

1029	8.  NVE-to-NVA Protocol

1031	   As outlined in Section 4.3, an NVE needs certain information in order
1032	   to perform its functions.  To obtain such information from an NVA, an
1033	   NVE-to-NVA protocol is needed.  The NVE-to-NVA protocol provides two
1034	   functions.  First it allows an NVE to obtain information about the
1035	   location and status of other TSs with which it needs to communicate.
1036	   Second, the NVE-to-NVA protocol provides a way for NVEs to provide
1037	   updates to the NVA about the TSs attached to that NVE (e.g., when a
1038	   TS attaches or detaches from the NVE), or about communication errors
1039	   encountered when sending traffic to remote NVEs.  For example, an NVE
1040	   could indicate that a destination it is trying to reach at a
1041	   destination NVE is unreachable for some reason.

1043	   While having a direct NVE-to-NVA protocol might seem straightforward,
1044	   the existence of existing VM orchestration systems complicates the
1045	   choices an NVE has for interacting with the NVA.

1047	8.1.  NVE-NVA Interaction Models

1049	   An NVE interacts with an NVA in at least two (quite different) ways:

1051	   o  NVEs embedded within the same server as the hypervisor can obtain
1052	      necessary information entirely through the hypervisor-facing side
1053	      of the NVE.  Such an approach is a natural extension to existing
1054	      VM orchestration systems supporting server virtualization because
1055	      an existing protocol between the hypervisor and VM orchestration
1056	      system already exists and can be leveraged to obtain any needed
1057	      information.  Specifically, VM orchestration systems used to
1058	      create, terminate and migrate VMs already use well-defined (though
1059	      typically proprietary) protocols to handle the interactions
1060	      between the hypervisor and VM orchestration system.  For such
1061	      systems, it is a natural extension to leverage the existing
1062	      orchestration protocol as a sort of proxy protocol for handling
1063	      the interactions between an NVE and the NVA.  Indeed, existing
1064	      implementations can already do this.

1066	   o  Alternatively, an NVE can obtain needed information by interacting
1067	      directly with an NVA via a protocol operating over the data center
1068	      underlay network.  Such an approach is needed to support NVEs that
1069	      are not associated with systems performing server virtualization
1070	      (e.g., as in the case of a standalone gateway) or where the NVE
1071	      needs to communicate directly with the NVA for other reasons.

1073	   The NVO3 architecture will focus on support for the second model
1074	   above.  Existing virtualization environments are already using the
1075	   first model.  But they are not sufficient to cover the case of
1076	   standalone gateways -- such gateways may not support virtualization
1077	   and do not interface with existing VM orchestration systems.

1079	8.2.  Direct NVE-NVA Protocol

1081	   An NVE can interact directly with an NVA via an NVE-to-NVA protocol.
1082	   Such a protocol can be either independent of the NVA internal
1083	   protocol, or an extension of it.  Using a purpose-specific protocol
1084	   would provide architectural separation and independence between the
1085	   NVE and NVA.  The NVE and NVA interact in a well-defined way, and
1086	   changes in the NVA (or NVE) do not need to impact each other.  Using
1087	   a dedicated protocol also ensures that both NVE and NVA
1088	   implementations can evolve independently and without dependencies on
1089	   each other.  Such independence is important because the upgrade path
1090	   for NVEs and NVAs is quite different.  Upgrading all the NVEs at a
1091	   site will likely be more difficult in practice than upgrading NVAs
1092	   because of their large number - one on each end device.  In practice,
1093	   it would be prudent to assume that once an NVE has been implemented
1094	   and deployed, it may be challenging to get subsequent NVE extensions
1095	   and changes implemented and deployed, whereas an NVA (and its
1096	   associated internal protocols) are more likely to evolve over time as
1097	   experience is gained from usage and upgrades will involve fewer
1098	   nodes.

1100	   Requirements for a direct NVE-NVA protocol can be found in
1101	   [I-D.ietf-nvo3-nve-nva-cp-req]

1103	8.3.  Propagating Information Between NVEs and NVAs

1105	   Information flows between NVEs and NVAs in both directions.  The NVA
1106	   maintains information about all VNs in the NV Domain, so that NVEs do
1107	   not need to do so themselves.  NVEs obtain from the NVA information
1108	   about where a given remote TS destination resides.  NVAs in turn
1109	   obtain information from NVEs about the individual TSs attached to
1110	   those NVEs.

1112	   While the NVA could push information relevant to every virtual
1113	   network to every NVE, such an approach scales poorly and is
1114	   unnecessary.  In practice, a given NVE will only need and want to
1115	   know about VNs to which it is attached.  Thus, an NVE should be able
1116	   to subscribe to updates only for the virtual networks it is
1117	   interested in receiving updates for.  The NVO3 architecture supports
1118	   a model where an NVE is not required to have full mapping tables for
1119	   all virtual networks in an NV Domain.

1121	   Before sending unicast traffic to a remote TS (or TSes for broadcast
1122	   or multicast traffic), an NVE must know where the remote TS(es)
1123	   currently reside.  When a TS attaches to a virtual network, the NVE
1124	   obtains information about that VN from the NVA.  The NVA can provide
1125	   that information to the NVE at the time the TS attaches to the VN,
1126	   either because the NVE requests the information when the attach
1127	   operation occurs, or because the VM orchestration system has
1128	   initiated the attach operation and provides associated mapping
1129	   information to the NVE at the same time.

1131	   There are scenarios where an NVE may wish to query the NVA about
1132	   individual mappings within an VN.  For example, when sending traffic
1133	   to a remote TS on a remote NVE, that TS may become unavailable (e.g,.
1134	   because it has migrated elsewhere or has been shutdown, in which case
1135	   the remote NVE may return an error indication).  In such situations,
1136	   the NVE may need to query the NVA to obtain updated mapping
1137	   information for a specific TS, or verify that the information is
1138	   still correct despite the error condition.  Note that such a query
1139	   could also be used by the NVA as an indication that there may be an
1140	   inconsistency in the network and that it should take steps to verify
1141	   that the information it has about the current state and location of a
1142	   specific TS is still correct.

1144	   For very large virtual networks, the amount of state an NVE needs to
1145	   maintain for a given virtual network could be significant.  Moreover,
1146	   an NVE may only be communicating with a small subset of the TSs on
1147	   such a virtual network.  In such cases, the NVE may find it desirable
1148	   to maintain state only for those destinations it is actively
1149	   communicating with.  In such scenarios, an NVE may not want to
1150	   maintain full mapping information about all destinations on a VN.
1151	   Should it then need to communicate with a destination for which it
1152	   does not have mapping information, however, it will need to be able
1153	   to query the NVA on demand for the missing information on a per-
1154	   destination basis.

1156	   The NVO3 architecture will need to support a range of operations
1157	   between the NVE and NVA.  Requirements for those operations can be
1158	   found in [I-D.ietf-nvo3-nve-nva-cp-req].

1160	9.  Federated NVAs

1162	   An NVA provides service to the set of NVEs in its NV Domain.  Each
1163	   NVA manages network virtualization information for the virtual
1164	   networks within its NV Domain.  An NV domain is administered by a
1165	   single entity.

1167	   In some cases, it will be necessary to expand the scope of a specific
1168	   VN or even an entire NV domain beyond a single NVA.  For example,
1169	   multiple data centers managed by the same administrator may wish to
1170	   operate all of its data centers as a single NV region.  Such cases
1171	   are handled by having different NVAs peer with each other to exchange
1172	   mapping information about specific VNs.  NVAs operate in a federated
1173	   manner with a set of NVAs operating as a loosely-coupled federation
1174	   of individual NVAs.  If a virtual network spans multiple NVAs (e.g.,
1175	   located at different data centers), and an NVE needs to deliver
1176	   tenant traffic to an NVE that is part of a different NV Domain, it
1177	   still interacts only with its NVA, even when obtaining mappings for
1178	   NVEs associated with a different NV Domain.

1180	   Figure 3 shows a scenario where two separate NV Domains (1 and 2)
1181	   share information about Virtual Network "1217".  VM1 and VM2 both
1182	   connect to the same Virtual Network 1217, even though the two VMs are
1183	   in separate NV Domains.  There are two cases to consider.  In the
1184	   first case, NV Domain B (NVB) does not allow NVE-A to tunnel traffic
1185	   directly to NVE-B.  There could be a number of reasons for this.  For
1186	   example, NV Domains 1 and 2 may not share a common address space
1187	   (i.e., require traversal through a NAT device), or for policy
1188	   reasons, a domain might require that all traffic between separate NV
1189	   Domains be funneled through a particular device (e.g., a firewall).
1190	   In such cases, NVA-2 will advertise to NVA-1 that VM1 on Virtual
1191	   Network 1217 is available, and direct that traffic between the two
1192	   nodes go through IP-G.  IP-G would then decapsulate received traffic
1193	   from one NV Domain, translate it appropriately for the other domain
1194	   and re-encapsulate the packet for delivery.

1196	                   xxxxxx                          xxxxxx        +-----+
1197	+-----+     xxxxxxxx    xxxxxx               xxxxxxx     xxxxx   | VM2 |
1198	| VM1 |    xx                xx            xxx               xx  |-----|
1199	|-----|   xx      +           x          xx                   x  |NVE-B|
1200	|NVE-A|   x                   x  +----+  x                     x +-----+
1201	+--+--+   x     NV Domain A   x  |IP-G|--x                      x    |
1202	   +-------x                 xx--+    | x                       xx   |
1203	           x                x    +----+ x      NV Domain B       x   |
1204	        +---x             xx            xx                       x---+
1205	        |    xxxx        xx           +->xx                     xx
1206	        |       xxxxxxxxxx            |   xx                   xx
1207	    +---+-+                           |     xx                xx
1208	    |NVA-1|                        +--+--+    xx           xxx
1209	    +-----+                        |NVA-2|     xxxx     xxxx
1210	                                   +-----+        xxxxxxx

1212	            Figure 3: VM1 and VM2 are in different NV Domains.

1214	   NVAs at one site share information and interact with NVAs at other
1215	   sites, but only in a controlled manner.  It is expected that policy
1216	   and access control will be applied at the boundaries between
1217	   different sites (and NVAs) so as to minimize dependencies on external
1218	   NVAs that could negatively impact the operation within a site.  It is
1219	   an architectural principle that operations involving NVAs at one site
1220	   not be immediately impacted by failures or errors at another site.
1221	   (Of course, communication between NVEs in different NV domains may be
1222	   impacted by such failures or errors.)  It is a strong requirement
1223	   that an NVA continue to operate properly for local NVEs even if
1224	   external communication is interrupted (e.g., should communication
1225	   between a local and remote NVA fail).

1227	   At a high level, a federation of interconnected NVAs has some
1228	   analogies to BGP and Autonomous Systems.  Like an Autonomous System,
1229	   NVAs at one site are managed by a single administrative entity and do
1230	   not interact with external NVAs except as allowed by policy.
1231	   Likewise, the interface between NVAs at different sites is well
1232	   defined, so that the internal details of operations at one site are
1233	   largely hidden to other sites.  Finally, an NVA only peers with other
1234	   NVAs that it has a trusted relationship with, i.e., where a VN is
1235	   intended to span multiple NVAs.

1237	   Reasons for using a federated model include:

1239	   o  Provide isolation among NVAs operating at different sites at
1240	      different geographic locations.

1242	   o  Control the quantity and rate of information updates that flow
1243	      (and must be processed) between different NVAs in different data
1244	      centers.

1246	   o  Control the set of external NVAs (and external sites) a site peers
1247	      with.  A site will only peer with other sites that are cooperating
1248	      in providing an overlay service.

1250	   o  Allow policy to be applied between sites.  A site will want to
1251	      carefully control what information it exports (and to whom) as
1252	      well as what information it is willing to import (and from whom).

1254	   o  Allow different protocols and architectures to be used to for
1255	      intra- vs. inter-NVA communication.  For example, within a single
1256	      data center, a replicated transaction server using database
1257	      techniques might be an attractive implementation option for an
1258	      NVA, and protocols optimized for intra-NVA communication would
1259	      likely be different from protocols involving inter-NVA
1260	      communication between different sites.

1262	   o  Allow for optimized protocols, rather than using a one-size-fits
1263	      all approach.  Within a data center, networks tend to have lower-
1264	      latency, higher-speed and higher redundancy when compared with WAN
1265	      links interconnecting data centers.  The design constraints and
1266	      tradeoffs for a protocol operating within a data center network
1267	      are different from those operating over WAN links.  While a single
1268	      protocol could be used for both cases, there could be advantages
1269	      to using different and more specialized protocols for the intra-
1270	      and inter-NVA case.

1272	9.1.  Inter-NVA Peering

1274	   To support peering between different NVAs, an inter-NVA protocol is
1275	   needed.  The inter-NVA protocol defines what information is exchanged
1276	   between NVAs.  It is assumed that the protocol will be used to share
1277	   addressing information between data centers and must scale well over
1278	   WAN links.

1280	10.  Control Protocol Work Areas

1282	   The NVO3 architecture consists of two major distinct entities: NVEs
1283	   and NVAs.  In order to provide isolation and independence between
1284	   these two entities, the NVO3 architecture calls for well defined
1285	   protocols for interfacing between them.  For an individual NVA, the
1286	   architecture calls for a logically centralized entity that could be
1287	   implemented in a distributed or replicated fashion.  While the IETF
1288	   may choose to define one or more specific architectural approaches to
1289	   building individual NVAs, there is little need for it to pick exactly
1290	   one approach to the exclusion of others.  An NVA for a single domain
1291	   will likely be deployed as a single vendor product and thus there is
1292	   little benefit in standardizing the internal structure of an NVA.

1294	   Individual NVAs peer with each other in a federated manner.  The NVO3
1295	   architecture calls for a well-defined interface between NVAs.

1297	   Finally, a hypervisor-to-NVE protocol is needed to cover the split-
1298	   NVE scenario described in Section 4.2.

1300	11.  NVO3 Data Plane Encapsulation

1302	   When tunneling tenant traffic, NVEs add encapsulation header to the
1303	   original tenant packet.  The exact encapsulation to use for NVO3 does
1304	   not seem to be critical.  The main requirement is that the
1305	   encapsulation support a Context ID of sufficient size
1306	   [I-D.ietf-nvo3-dataplane-requirements].  A number of encapsulations
1307	   already exist that provide a VN Context of sufficient size for NVO3.
1308	   For example, VXLAN [RFC7348] has a 24-bit VXLAN Network Identifier
1309	   (VNI).  NVGRE [RFC7637] has a 24-bit Tenant Network ID (TNI).  MPLS-
1310	   over-GRE provides a 20-bit label field.  While there is widespread
1311	   recognition that a 12-bit VN Context would be too small (only 4096
1312	   distinct values), it is generally agreed that 20 bits (1 million
1313	   distinct values) and 24 bits (16.8 million distinct values) are
1314	   sufficient for a wide variety of deployment scenarios.

1316	12.  Operations and Management

1318	   The simplicity of operating and debugging overlay networks will be
1319	   critical for successful deployment.  Some architectural choices can
1320	   facilitate or hinder OAM.  Related OAM drafts include
1321	   [I-D.ashwood-nvo3-operational-requirement].

1323	13.  Summary

1325	   This document presents the overall architecture for Network
1326	   Virtualization Overlays (NVO3).  The architecture calls for three
1327	   main areas of protocol work:

1329	   1.  A hypervisor-to-NVE protocol to support Split NVEs as discussed
1330	       in Section 4.2.

1332	   2.  An NVE to NVA protocol for disseminating VN information (e.g.,
1333	       inner to outer address mappings).

1335	   3.  An NVA-to-NVA protocol for exchange of information about specific
1336	       virtual networks between federated NVAs.

1338	   It should be noted that existing protocols or extensions of existing
1339	   protocols are applicable.

1341	14.  Acknowledgments

1343	   Helpful comments and improvements to this document have come from
1344	   Lizhong Jin, Anton Ivanov, Dennis (Xiaohong) Qin, Erik Smith, Ziye
1345	   Yang and Lucy Yong.

1347	15.  IANA Considerations

1349	   This memo includes no request to IANA.

1351	16.  Security Considerations

1353	   The data plane and control plane described in this architecture will
1354	   need to address potential security threats.

1356	   For the data plane, tunneled application traffic may need protection
1357	   against being misdelivered, modified, or having its content exposed
1358	   to an inappropriate third party.  In all cases, encryption between
1359	   authenticated tunnel endpoints can be used to mitigate risks.

1361	   For the control plane, between NVAs, the NVA and NVE as well as
1362	   between different components of the split-NVE approach, a combination
1363	   of authentication and encryption can be used.  All entities will need
1364	   to properly authenticate with each other and enable encryption for
1365	   their interactions as appropriate to protect sensitive information.

1367	   Leakage of sensitive information about users or other entities
1368	   associated with VMs whose traffic is virtualized can also be covered
1369	   by using encryption for the control plane protocols.

1371	17.  Informative References

1373	   [I-D.ashwood-nvo3-operational-requirement]
1374	              Ashwood-Smith, P., Iyengar, R., Tsou, T., Sajassi, A.,
1375	              Boucadair, M., Jacquenet, C., and M. Daikoku, "NVO3
1376	              Operational Requirements", draft-ashwood-nvo3-operational-
1377	              requirement-03 (work in progress), July 2013.

1379	   [I-D.ietf-nvo3-dataplane-requirements]
1380	              Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L.,
1381	              and B. Khasnabish, "NVO3 Data Plane Requirements", draft-
1382	              ietf-nvo3-dataplane-requirements-03 (work in progress),
1383	              April 2014.

1385	   [I-D.ietf-nvo3-mcast-framework]
1386	              Ghanwani, A., Dunbar, L., McBride, M., Bannai, V., and R.
1387	              Krishnan, "A Framework for Multicast in NVO3", draft-ietf-
1388	              nvo3-mcast-framework-04 (work in progress), February 2016.

1390	   [I-D.ietf-nvo3-nve-nva-cp-req]
1391	              Kreeger, L., Dutt, D., Narten, T., and D. Black, "Network
1392	              Virtualization NVE to NVA Control Protocol Requirements",
1393	              draft-ietf-nvo3-nve-nva-cp-req-05 (work in progress),
1394	              March 2016.

1396	   [IEEE-802.1Q]
1397	              IEEE Std 802.1Q-2014, , "IEEE Standard for Local and
1398	              metropolitan area networks: Bridges and Bridged
1399	              Networks,", November 2014.

1401	   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
1402	              Converting Network Protocol Addresses to 48.bit Ethernet
1403	              Address for Transmission on Ethernet Hardware", STD 37,
1404	              RFC 826, DOI 10.17487/RFC0826, November 1982,
1405	              <http://www.rfc-editor.org/info/rfc826>.

1407	   [RFC4364]  Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1408	              Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February
1409	              2006, <http://www.rfc-editor.org/info/rfc4364>.

1411	   [RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
1412	              "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
1413	              DOI 10.17487/RFC4861, September 2007,
1414	              <http://www.rfc-editor.org/info/rfc4861>.

1416	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
1417	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
1418	              eXtensible Local Area Network (VXLAN): A Framework for
1419	              Overlaying Virtualized Layer 2 Networks over Layer 3
1420	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
1421	              <http://www.rfc-editor.org/info/rfc7348>.

1423	   [RFC7364]  Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L.,
1424	              Kreeger, L., and M. Napierala, "Problem Statement:
1425	              Overlays for Network Virtualization", RFC 7364,
1426	              DOI 10.17487/RFC7364, October 2014,
1427	              <http://www.rfc-editor.org/info/rfc7364>.

1429	   [RFC7365]  Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y.
1430	              Rekhter, "Framework for Data Center (DC) Network
1431	              Virtualization", RFC 7365, DOI 10.17487/RFC7365, October
1432	              2014, <http://www.rfc-editor.org/info/rfc7365>.

1434	   [RFC7637]  Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network
1435	              Virtualization Using Generic Routing Encapsulation",
1436	              RFC 7637, DOI 10.17487/RFC7637, September 2015,
1437	              <http://www.rfc-editor.org/info/rfc7637>.

1439	Authors' Addresses

1441	   David Black
1442	   EMC

1444	   Email: david.black@emc.com

1446	   Jon Hudson
1447	   Independent

1449	   Email: jon.hudson@gmail.com

1451	   Lawrence Kreeger
1452	   Cisco

1454	   Email: kreeger@cisco.com

1456	   Marc Lasserre
1457	   Independent

1459	   Email: mmlasserre@gmail.com
1460	   Thomas Narten
1461	   IBM

1463	   Email: narten@us.ibm.com