idnits 2.17.1 

draft-ietf-nvo3-arch-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1189 has weird spacing: '...xxxxxxx    xxx...'

  -- The document date (March 9, 2015) is 3336 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-05) exists of
     draft-ietf-nvo3-nve-nva-cp-req-03

  == Outdated reference: A later version (-08) exists of
     draft-sridharan-virtualization-nvgre-07


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                 D. Black
3	Internet-Draft                                                       EMC
4	Intended status: Informational                                 J. Hudson
5	Expires: September 10, 2015                                      Brocade
6	                                                              L. Kreeger
7	                                                                   Cisco
8	                                                             M. Lasserre
9	                                                          Alcatel-Lucent
10	                                                               T. Narten
11	                                                                     IBM
12	                                                           March 9, 2015

14	              An Architecture for Overlay Networks (NVO3)
15	                        draft-ietf-nvo3-arch-03

17	Abstract

19	   This document presents a high-level overview architecture for
20	   building overlay networks in NVO3.  The architecture is given at a
21	   high-level, showing the major components of an overall system.  An
22	   important goal is to divide the space into individual smaller
23	   components that can be implemented independently and with clear
24	   interfaces and interactions with other components.  It should be
25	   possible to build and implement individual components in isolation
26	   and have them work with other components with no changes to other
27	   components.  That way implementers have flexibility in implementing
28	   individual components and can optimize and innovate within their
29	   respective components without requiring changes to other components.

31	Status of This Memo

33	   This Internet-Draft is submitted in full conformance with the
34	   provisions of BCP 78 and BCP 79.

36	   Internet-Drafts are working documents of the Internet Engineering
37	   Task Force (IETF).  Note that other groups may also distribute
38	   working documents as Internet-Drafts.  The list of current Internet-
39	   Drafts is at http://datatracker.ietf.org/drafts/current/.

41	   Internet-Drafts are draft documents valid for a maximum of six months
42	   and may be updated, replaced, or obsoleted by other documents at any
43	   time.  It is inappropriate to use Internet-Drafts as reference
44	   material or to cite them other than as "work in progress."

46	   This Internet-Draft will expire on September 10, 2015.

48	Copyright Notice

50	   Copyright (c) 2015 IETF Trust and the persons identified as the
51	   document authors.  All rights reserved.

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents
55	   (http://trustee.ietf.org/license-info) in effect on the date of
56	   publication of this document.  Please review these documents
57	   carefully, as they describe your rights and restrictions with respect
58	   to this document.  Code Components extracted from this document must
59	   include Simplified BSD License text as described in Section 4.e of
60	   the Trust Legal Provisions and are provided without warranty as
61	   described in the Simplified BSD License.

63	Table of Contents

65	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
66	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
67	   3.  Background  . . . . . . . . . . . . . . . . . . . . . . . . .   4
68	     3.1.  VN Service (L2 and L3)  . . . . . . . . . . . . . . . . .   6
69	       3.1.1.  VLAN Tags in L2 Service . . . . . . . . . . . . . . .   7
70	       3.1.2.  TTL Considerations  . . . . . . . . . . . . . . . . .   7
71	     3.2.  Network Virtualization Edge (NVE) . . . . . . . . . . . .   7
72	     3.3.  Network Virtualization Authority (NVA)  . . . . . . . . .   9
73	     3.4.  VM Orchestration Systems  . . . . . . . . . . . . . . . .   9
74	   4.  Network Virtualization Edge (NVE) . . . . . . . . . . . . . .  11
75	     4.1.  NVE Co-located With Server Hypervisor . . . . . . . . . .  11
76	     4.2.  Split-NVE . . . . . . . . . . . . . . . . . . . . . . . .  12
77	       4.2.1.  Tenant VLAN handling in Split-NVE Case  . . . . . . .  12
78	     4.3.  NVE State . . . . . . . . . . . . . . . . . . . . . . . .  13
79	     4.4.  Multi-Homing of NVEs  . . . . . . . . . . . . . . . . . .  14
80	     4.5.  VAP . . . . . . . . . . . . . . . . . . . . . . . . . . .  14
81	   5.  Tenant System Types . . . . . . . . . . . . . . . . . . . . .  15
82	     5.1.  Overlay-Aware Network Service Appliances  . . . . . . . .  15
83	     5.2.  Bare Metal Servers  . . . . . . . . . . . . . . . . . . .  15
84	     5.3.  Gateways  . . . . . . . . . . . . . . . . . . . . . . . .  16
85	       5.3.1.  Gateway Taxonomy  . . . . . . . . . . . . . . . . . .  16
86	         5.3.1.1.  L2 Gateways (Bridging)  . . . . . . . . . . . . .  16
87	         5.3.1.2.  L3 Gateways (Only IP Packets) . . . . . . . . . .  17
88	     5.4.  Distributed Inter-VN Gateways . . . . . . . . . . . . . .  17
89	     5.5.  ARP and Neighbor Discovery  . . . . . . . . . . . . . . .  18
90	   6.  NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . .  18
91	   7.  Network Virtualization Authority  . . . . . . . . . . . . . .  20
92	     7.1.  How an NVA Obtains Information  . . . . . . . . . . . . .  20
93	     7.2.  Internal NVA Architecture . . . . . . . . . . . . . . . .  21
94	     7.3.  NVA External Interface  . . . . . . . . . . . . . . . . .  21
95	   8.  NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . . .  22
96	     8.1.  NVE-NVA Interaction Models  . . . . . . . . . . . . . . .  23
97	     8.2.  Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . .  23
98	     8.3.  Propagating Information Between NVEs and NVAs . . . . . .  24
99	   9.  Federated NVAs  . . . . . . . . . . . . . . . . . . . . . . .  25
100	     9.1.  Inter-NVA Peering . . . . . . . . . . . . . . . . . . . .  27
101	   10. Control Protocol Work Areas . . . . . . . . . . . . . . . . .  27
102	   11. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . .  28
103	   12. Operations and Management . . . . . . . . . . . . . . . . . .  28
104	   13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  28
105	   14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  29
106	   15. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  29
107	   16. Security Considerations . . . . . . . . . . . . . . . . . . .  29
108	   17. Informative References  . . . . . . . . . . . . . . . . . . .  29
109	   Appendix A.  Change Log . . . . . . . . . . . . . . . . . . . . .  30
110	     A.1.  Changes From draft-ietf-nvo3-arch-02 to -03 . . . . . . .  30
111	     A.2.  Changes From draft-ietf-nvo3-arch-01 to -02 . . . . . . .  31
112	     A.3.  Changes From draft-ietf-nvo3-arch-00 to -01 . . . . . . .  31
113	     A.4.  Changes From draft-narten-nvo3 to draft-ietf-nvo3 . . . .  31
114	     A.5.  Changes From -00 to -01 (of draft-narten-nvo3-arch) . . .  31
115	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  32

117	1.  Introduction

119	   This document presents a high-level architecture for building overlay
120	   networks in NVO3.  The architecture is given at a high-level, showing
121	   the major components of an overall system.  An important goal is to
122	   divide the space into smaller individual components that can be
123	   implemented independently and with clear interfaces and interactions
124	   with other components.  It should be possible to build and implement
125	   individual components in isolation and have them work with other
126	   components with no changes to other components.  That way
127	   implementers have flexibility in implementing individual components
128	   and can optimize and innovate within their respective components
129	   without necessarily requiring changes to other components.

131	   The motivation for overlay networks is given in [RFC7364].
132	   "Framework for DC Network Virtualization" [RFC7365] provides a
133	   framework for discussing overlay networks generally and the various
134	   components that must work together in building such systems.  This
135	   document differs from the framework document in that it doesn't
136	   attempt to cover all possible approaches within the general design
137	   space.  Rather, it describes one particular approach.

139	2.  Terminology

141	   This document uses the same terminology as [RFC7365].  In addition,
142	   the following terms are used:

144	   NV Domain  A Network Virtualization Domain is an administrative
145	      construct that defines a Network Virtualization Authority (NVA),
146	      the set of Network Virtualization Edges (NVEs) associated with
147	      that NVA, and the set of virtual networks the NVA manages and
148	      supports.  NVEs are associated with a (logically centralized) NVA,
149	      and an NVE supports communication for any of the virtual networks
150	      in the domain.

152	   NV Region  A region over which information about a set of virtual
153	      networks is shared.  The degenerate case of a single NV Domain
154	      corresponds to an NV region corresponding to that domain.  The
155	      more interesting case occurs when two or more NV Domains share
156	      information about part or all of a set of virtual networks that
157	      they manage.  Two NVAs share information about particular virtual
158	      networks for the purpose of supporting connectivity between
159	      tenants located in different NVA Domains.  NVAs can share
160	      information about an entire NV domain, or just individual virtual
161	      networks.

163	   Tenant System Identifier (TSI)  Interface to a Virtual Network as
164	      presented to a Tenant System.  The TSI logically connects to the
165	      NVE via a Virtual Access Point (VAP).  To the Tenant System, the
166	      TSI is like a NIC; the TSI presents itself to a Tenant System as a
167	      normal network interface.

169	   VLAN  Unless stated otherwise, the terms VLAN and VLAN Tag are used
170	      in this document denote a C-VLAN [IEEE-802.1Q] and the terms are
171	      used interchangeably to improve readability.

173	3.  Background

175	   Overlay networks are an approach for providing network virtualization
176	   services to a set of Tenant Systems (TSs) [RFC7365].  With overlays,
177	   data traffic between tenants is tunneled across the underlying data
178	   center's IP network.  The use of tunnels provides a number of
179	   benefits by decoupling the network as viewed by tenants from the
180	   underlying physical network across which they communicate.

182	   Tenant Systems connect to Virtual Networks (VNs), with each VN having
183	   associated attributes defining properties of the network, such as the
184	   set of members that connect to it.  Tenant Systems connected to a
185	   virtual network typically communicate freely with other Tenant
186	   Systems on the same VN, but communication between Tenant Systems on
187	   one VN and those external to the VN (whether on another VN or
188	   connected to the Internet) is carefully controlled and governed by
189	   policy.

191	   A Network Virtualization Edge (NVE) [RFC7365] is the entity that
192	   implements the overlay functionality.  An NVE resides at the boundary
193	   between a Tenant System and the overlay network as shown in Figure 1.
194	   An NVE creates and maintains local state about each Virtual Network
195	   for which it is providing service on behalf of a Tenant System.

197	       +--------+                                             +--------+
198	       | Tenant +--+                                     +----| Tenant |
199	       | System |  |                                    (')   | System |
200	       +--------+  |          ................         (   )  +--------+
201	                   |  +-+--+  .              .  +--+-+  (_)
202	                   |  | NVE|--.              .--| NVE|   |
203	                   +--|    |  .              .  |    |---+
204	                      +-+--+  .              .   +--+-+
205	                      /       .              .
206	                     /        .  L3 Overlay  .   +--+-++--------+
207	       +--------+   /         .    Network   .   | NVE|| Tenant |
208	       | Tenant +--+          .              .- -|    || System |
209	       | System |             .              .   +--+-++--------+
210	       +--------+             ................
211	                                     |
212	                                   +----+
213	                                   | NVE|
214	                                   |    |
215	                                   +----+
216	                                     |
217	                                     |
218	                           =====================
219	                             |               |
220	                         +--------+      +--------+
221	                         | Tenant |      | Tenant |
222	                         | System |      | System |
223	                         +--------+      +--------+

225	                  Figure 1: NVO3 Generic Reference Model

227	   The following subsections describe key aspects of an overlay system
228	   in more detail.  Section 3.1 describes the service model (Ethernet
229	   vs. IP) provided to Tenant Systems.  Section 3.2 describes NVEs in
230	   more detail.  Section 3.3 introduces the Network Virtualization
231	   Authority, from which NVEs obtain information about virtual networks.
232	   Section 3.4 provides background on VM orchestration systems and their
233	   use of virtual networks.

235	3.1.  VN Service (L2 and L3)

237	   A Virtual Network provides either L2 or L3 service to connected
238	   tenants.  For L2 service, VNs transport Ethernet frames, and a Tenant
239	   System is provided with a service that is analogous to being
240	   connected to a specific L2 C-VLAN.  L2 broadcast frames are generally
241	   delivered to all (and multicast frames delivered to a subset of) the
242	   other Tenant Systems on the VN.  To a Tenant System, it appears as if
243	   they are connected to a regular L2 Ethernet link.  Within NVO3,
244	   tenant frames are tunneled to remote NVEs based on the MAC addresses
245	   of the frame headers as originated by the Tenant System.  On the
246	   underlay, NVO3 packets are forwarded between NVEs based on the outer
247	   addresses of tunneled packets.

249	   For L3 service, VNs transport IP datagrams, and a Tenant System is
250	   provided with a service that only supports IP traffic.  Within NVO3,
251	   tenant frames are tunneled to remote NVEs based on the IP addresses
252	   of the packet originated by the Tenant System; any L2 destination
253	   addresses provided by Tenant Systems are effectively ignored.  For L3
254	   service, the Tenant System will be configured with an IP subnet that
255	   is effectively a point-to-point link, i.e., having only the Tenant
256	   System and a next-hop router address on it.

258	   L2 service is intended for systems that need native L2 Ethernet
259	   service and the ability to run protocols directly over Ethernet
260	   (i.e., not based on IP).  L3 service is intended for systems in which
261	   all the traffic can safely be assumed to be IP.  It is important to
262	   note that whether NVO3 provides L2 or L3 service to a Tenant System,
263	   the Tenant System does not generally need to be aware of the
264	   distinction.  In both cases, the virtual network presents itself to
265	   the Tenant System as an L2 Ethernet interface.  An Ethernet interface
266	   is used in both cases simply as a widely supported interface type
267	   that essentially all Tenant Systems already support.  Consequently,
268	   no special software is needed on Tenant Systems to use an L3 vs. an
269	   L2 overlay service.

271	   NVO3 can also provide a combined L2 and L3 service to tenants.  A
272	   combined service provides L2 service for intra-VN communication, but
273	   also provides L3 service for L3 traffic entering or leaving the VN.
274	   Architecturally, the handling of a combined L2/L3 service in NVO3 is
275	   intended to match what is commonly done today in non-overlay
276	   environments by devices providing a combined bridge/router service.
277	   With combined service, the virtual network itself retains the
278	   semantics of L2 service and all traffic is processed according to its
279	   L2 semantics.  In addition, however, traffic requiring IP processing
280	   is also processed at the IP level.

282	   The IP processing for a combined service can be implemented on a
283	   standalone device attached to the virtual network (e.g., an IP
284	   router) or implemented locally on the NVE (see Section 5.4 on
285	   Distributed Gateways).  For unicast traffic, NVE implementation of a
286	   combined service may result in a packet being delivered to another TS
287	   attached to the same NVE (on either the same or a different VN) or
288	   tunneled to a remote NVE, or even forwarded outside the NVO3 domain.
289	   For multicast or broadcast packets, the combination of NVE L2 and L3
290	   processing may result in copies of the packet receiving both L2 and
291	   L3 treatments to realize delivery to all of the destinations
292	   involved.  This distributed NVE implementation of IP routing results
293	   in the same network delivery behavior as if the L2 processing of the
294	   packet included delivery of the packet to an IP router attached to
295	   the L2 VN as a TS, with the router having additional network
296	   attachments to other networks, either virtual or not.

298	3.1.1.  VLAN Tags in L2 Service

300	   An NVO3 L2 virtual network service may include encapsulated L2 VLAN
301	   tags provided by a Tenant System, but does not use encapsulated tags
302	   in deciding where and how to forward traffic.  Such VLAN tags can be
303	   passed through, so that Tenant Systems that send or expect to receive
304	   them can be supported as appropriate.

306	   The processing of VLAN tags that an NVE receives from a TS is
307	   controlled by settings associated with the VAP.  Just as in the case
308	   with ports on Ethernet switches, a number of settings could be
309	   imagined.  For example, C-TAGs can be passed through transparently,
310	   they could always be stripped upon receipt from a Tenant System, they
311	   could be compared against a list of explicitly configured tags, etc.

313	   Note that the handling of C-VIDs has additional complications, as
314	   described in Section 4.2.1 below.

316	3.1.2.  TTL Considerations

318	   For L3 service, Tenant Systems should expect the TTL of the packets
319	   they send to be decremented by at least 1.  For L2 service, the TTL
320	   on packets (when the packet is IP) is not modified.

322	3.2.  Network Virtualization Edge (NVE)

324	   Tenant Systems connect to NVEs via a Tenant System Interface (TSI).
325	   The TSI logically connects to the NVE via a Virtual Access Point
326	   (VAP) and each VAP is associated with one Virtual Network as shown in
327	   Figure 2.  To the Tenant System, the TSI is like a NIC; the TSI
328	   presents itself to a Tenant System as a normal network interface.  On
329	   the NVE side, a VAP is a logical network port (virtual or physical)
330	   into a specific virtual network.  Note that two different Tenant
331	   Systems (and TSIs) attached to a common NVE can share a VAP (e.g.,
332	   TS1 and TS2 in Figure 2) so long as they connect to the same Virtual
333	   Network.

335	                    |         Data Center Network (IP)        |
336	                    |                                         |
337	                    +-----------------------------------------+
338	                         |                           |
339	                         |       Tunnel Overlay      |
340	            +------------+---------+       +---------+------------+
341	            | +----------+-------+ |       | +-------+----------+ |
342	            | |  Overlay Module  | |       | |  Overlay Module  | |
343	            | +---------+--------+ |       | +---------+--------+ |
344	            |           |          |       |           |          |
345	     NVE1   |           |          |       |           |          | NVE2
346	            |  +--------+-------+  |       |  +--------+-------+  |
347	            |  | VNI1      VNI2 |  |       |  | VNI1      VNI2 |  |
348	            |  +-+----------+---+  |       |  +-+-----------+--+  |
349	            |    | VAP1     | VAP2 |       |    | VAP1      | VAP2|
350	            +----+----------+------+       +----+-----------+-----+
351	                 |          |                   |           |
352	                 |\         |                   |           |
353	                 | \        |                   |          /|
354	          -------+--\-------+-------------------+---------/-+-------
355	                 |   \      |     Tenant        |        /  |
356	            TSI1 |TSI2\     | TSI3            TSI1  TSI2/   TSI3
357	                +---+ +---+ +---+             +---+ +---+   +---+
358	                |TS1| |TS2| |TS3|             |TS4| |TS5|   |TS6|
359	                +---+ +---+ +---+             +---+ +---+   +---+

361	                       Figure 2: NVE Reference Model

363	   The Overlay Module performs the actual encapsulation and
364	   decapsulation of tunneled packets.  The NVE maintains state about the
365	   virtual networks it is a part of so that it can provide the Overlay
366	   Module with such information as the destination address of the NVE to
367	   tunnel a packet to, or the Context ID that should be placed in the
368	   encapsulation header to identify the virtual network that a tunneled
369	   packet belongs to.

371	   On the data center network side, the NVE sends and receives native IP
372	   traffic.  When ingressing traffic from a Tenant System, the NVE
373	   identifies the egress NVE to which the packet should be sent, adds an
374	   overlay encapsulation header, and sends the packet on the underlay
375	   network.  When receiving traffic from a remote NVE, an NVE strips off
376	   the encapsulation header, and delivers the (original) packet to the
377	   appropriate Tenant System.  When the source and destination Tenant
378	   System are on the same NVE, no encapsulation is needed and the NVE
379	   forwards traffic directly.

381	   Conceptually, the NVE is a single entity implementing the NVO3
382	   functionality.  In practice, there are a number of different
383	   implementation scenarios, as described in detail in Section 4.

385	3.3.  Network Virtualization Authority (NVA)

387	   Address dissemination refers to the process of learning, building and
388	   distributing the mapping/forwarding information that NVEs need in
389	   order to tunnel traffic to each other on behalf of communicating
390	   Tenant Systems.  For example, in order to send traffic to a remote
391	   Tenant System, the sending NVE must know the destination NVE for that
392	   Tenant System.

394	   One way to build and maintain mapping tables is to use learning, as
395	   802.1 bridges do [IEEE-802.1Q].  When forwarding traffic to multicast
396	   or unknown unicast destinations, an NVE could simply flood traffic.
397	   While flooding works, it can lead to traffic hot spots and can lead
398	   to problems in larger networks.

400	   Alternatively, to reduce the scope of where flooding must take place,
401	   or to eliminate it all together, NVEs can make use of a Network
402	   Virtualization Authority (NVA).  An NVA is the entity that provides
403	   address mapping and other information to NVEs.  NVEs interact with an
404	   NVA to obtain any required address mapping information they need in
405	   order to properly forward traffic on behalf of tenants.  The term NVA
406	   refers to the overall system, without regards to its scope or how it
407	   is implemented.  NVAs provide a service, and NVEs access that service
408	   via an NVE-to-NVA protocol as discussed in Section 4.3.

410	   Even when an NVA is present, Ethernet bridge MAC address learning
411	   could be used as a fallback mechanism, should the NVA be unable to
412	   provide an answer or for other reasons.  This document does not
413	   consider flooding approaches in detail, as there are a number of
414	   benefits in using an approach that depends on the presence of an NVA.

416	   For the rest of this document, it is assumed that an NVA exists and
417	   will be used.  NVAs are discussed in more detail in Section 7.

419	3.4.  VM Orchestration Systems

421	   VM Orchestration systems manage server virtualization across a set of
422	   servers.  Although VM management is a separate topic from network
423	   virtualization, the two areas are closely related.  Managing the
424	   creation, placement, and movement of VMs also involves creating,
425	   attaching to and detaching from virtual networks.  A number of
426	   existing VM orchestration systems have incorporated aspects of
427	   virtual network management into their systems.

429	   Note also, that although this section uses the term "VM" and
430	   "hypervisor" throughout, the same issues apply to other
431	   virtualization approaches, including Linux Containers (LXC), BSD
432	   Jails, Network Service Appliances as discussed in Section 5.1, etc..
433	   From an NVO3 perspective, it should be assumed that where the
434	   document uses the term "VM" and "hypervisor", the intention is that
435	   the discussion also applies to other systems, where, e.g., the host
436	   operating system plays the role of the hypervisor in supporting
437	   virtualization, and a container plays the equivalent role as a VM.

439	   When a new VM image is started, the VM Orchestration system
440	   determines where the VM should be placed, interacts with the
441	   hypervisor on the target server to load and start the VM and controls
442	   when a VM should be shutdown or migrated elsewhere.  VM Orchestration
443	   systems also have knowledge about how a VM should connect to a
444	   network, possibly including the name of the virtual network to which
445	   a VM is to connect.  The VM orchestration system can pass such
446	   information to the hypervisor when a VM is instantiated.  VM
447	   orchestration systems have significant (and sometimes global)
448	   knowledge over the domain they manage.  They typically know on what
449	   servers a VM is running, and meta data associated with VM images can
450	   be useful from a network virtualization perspective.  For example,
451	   the meta data may include the addresses (MAC and IP) the VMs will use
452	   and the name(s) of the virtual network(s) they connect to.

454	   VM orchestration systems run a protocol with an agent running on the
455	   hypervisor of the servers they manage.  That protocol can also carry
456	   information about what virtual network a VM is associated with.  When
457	   the orchestrator instantiates a VM on a hypervisor, the hypervisor
458	   interacts with the NVE in order to attach the VM to the virtual
459	   networks it has access to.  In general, the hypervisor will need to
460	   communicate significant VM state changes to the NVE.  In the reverse
461	   direction, the NVE may need to communicate network connectivity
462	   information back to the hypervisor.  XXX Example VM orchestration
463	   systems in use today include VMware's vCenter Server, Microsoft's
464	   System Center Virtual Machine Manager, and systems based on OpenStack
465	   and its associated plugins (e.g., Nova and Neutron).  Both can pass
466	   information about what virtual networks a VM connects to down to the
467	   hypervisor.  The protocol used between the VM orchestration system
468	   and hypervisors is generally proprietary.

470	   It should be noted that VM orchestration systems may not have direct
471	   access to all networking related information a VM uses.  For example,
472	   a VM may make use of additional IP or MAC addresses that the VM
473	   management system is not aware of.

475	4.  Network Virtualization Edge (NVE)

477	   As introduced in Section 3.2 an NVE is the entity that implements the
478	   overlay functionality.  This section describes NVEs in more detail.
479	   An NVE will have two external interfaces:

481	   Tenant System Facing:  On the Tenant System facing side, an NVE
482	      interacts with the hypervisor (or equivalent entity) to provide
483	      the NVO3 service.  An NVE will need to be notified when a Tenant
484	      System "attaches" to a virtual network (so it can validate the
485	      request and set up any state needed to send and receive traffic on
486	      behalf of the Tenant System on that VN).  Likewise, an NVE will
487	      need to be informed when the Tenant System "detaches" from the
488	      virtual network so that it can reclaim state and resources
489	      appropriately.

491	   Data Center Network Facing:  On the data center network facing side,
492	      an NVE interfaces with the data center underlay network, sending
493	      and receiving tunneled TS packets to and from the underlay.  The
494	      NVE may also run a control protocol with other entities on the
495	      network, such as the Network Virtualization Authority.

497	4.1.  NVE Co-located With Server Hypervisor

499	   When server virtualization is used, the entire NVE functionality will
500	   typically be implemented as part of the hypervisor and/or virtual
501	   switch on the server.  In such cases, the Tenant System interacts
502	   with the hypervisor and the hypervisor interacts with the NVE.
503	   Because the interaction between the hypervisor and NVE is implemented
504	   entirely in software on the server, there is no "on-the-wire"
505	   protocol between Tenant Systems (or the hypervisor) and the NVE that
506	   needs to be standardized.  While there may be APIs between the NVE
507	   and hypervisor to support necessary interaction, the details of such
508	   an API are not in-scope for the IETF to work on.

510	   Implementing NVE functionality entirely on a server has the
511	   disadvantage that server CPU resources must be spent implementing the
512	   NVO3 functionality.  Experimentation with overlay approaches and
513	   previous experience with TCP and checksum adapter offloads suggests
514	   that offloading certain NVE operations (e.g., encapsulation and
515	   decapsulation operations) onto the physical network adapter can
516	   produce performance improvements.  As has been done with checksum
517	   and/or TCP server offload and other optimization approaches, there
518	   may be benefits to offloading common operations onto adapters where
519	   possible.  Just as important, the addition of an overlay header can
520	   disable existing adapter offload capabilities that are generally not
521	   prepared to handle the addition of a new header or other operations
522	   associated with an NVE.

524	   While the exact details of how to split the implementation of
525	   specific NVE functionality between a server and its network adapters
526	   is an implementation matter and outside the scope of IETF
527	   standardization, the NVO3 architecture should be cognizant of and
528	   support such separation.  Ideally, it may even be possible to bypass
529	   the hypervisor completely on critical data path operations so that
530	   packets between a TS and its VN can be sent and received without
531	   having the hypervisor involved in each individual packet operation.

533	4.2.  Split-NVE

535	   Another possible scenario leads to the need for a split NVE
536	   implementation.  An NVE running on a server (e.g. within a
537	   hypervisor) could support NVO3 towards the tenant, but not perform
538	   all NVE functions (e.g., encapsulation) directly on the server; some
539	   of the actual NVO3 functionality could be implemented on (i.e.,
540	   offloaded to) an adjacent switch to which the server is attached.
541	   While one could imagine a number of link types between a server and
542	   the NVE, one simple deployment scenario would involve a server and
543	   NVE separated by a simple L2 Ethernet link.  A more complicated
544	   scenario would have the server and NVE separated by a bridged access
545	   network, such as when the NVE resides on a ToR, with an embedded
546	   switch residing between servers and the ToR.

548	   For the split NVE case, protocols will be needed that allow the
549	   hypervisor and NVE to negotiate and setup the necessary state so that
550	   traffic sent across the access link between a server and the NVE can
551	   be associated with the correct virtual network instance.
552	   Specifically, on the access link, traffic belonging to a specific
553	   Tenant System would be tagged with a specific VLAN C-TAG that
554	   identifies which specific NVO3 virtual network instance it connects
555	   to.  The hypervisor-NVE protocol would negotiate which VLAN C-TAG to
556	   use for a particular virtual network instance.  More details of the
557	   protocol requirements for functionality between hypervisors and NVEs
558	   can be found in [I-D.ietf-nvo3-nve-nva-cp-req].

560	4.2.1.  Tenant VLAN handling in Split-NVE Case

562	   Preserving tenant VLAN tags across NVO3 as described in Section 3.1.1
563	   poses additional complications in the split-NVE case.  The portion of
564	   the NVE that performs the encapsulation function needs access to the
565	   specific VLAN tags that the Tenant System is using in order to
566	   include them in the encapsulated packet.  When an NVE is implemented
567	   entirely within the hypervisor, the NVE has access to the complete
568	   original packet (including any VLAN tags) sent by the tenant.  In the
569	   split-NVE case, however, the VLAN tag used between the hypervisor and
570	   offloaded portions of the NVE normally only identify the specific VN
571	   that traffic belongs to.  In order to allow a tenant to preserve VLAN
572	   information in the split-NVE case, additional mechanisms would be
573	   needed.

575	4.3.  NVE State

577	   NVEs maintain internal data structures and state to support the
578	   sending and receiving of tenant traffic.  An NVE may need some or all
579	   of the following information:

581	   1.  An NVE keeps track of which attached Tenant Systems are connected
582	       to which virtual networks.  When a Tenant System attaches to a
583	       virtual network, the NVE will need to create or update local
584	       state for that virtual network.  When the last Tenant System
585	       detaches from a given VN, the NVE can reclaim state associated
586	       with that VN.

588	   2.  For tenant unicast traffic, an NVE maintains a per-VN table of
589	       mappings from Tenant System (inner) addresses to remote NVE
590	       (outer) addresses.

592	   3.  For tenant multicast (or broadcast) traffic, an NVE maintains a
593	       per-VN table of mappings and other information on how to deliver
594	       tenant multicast (or broadcast) traffic.  If the underlying
595	       network supports IP multicast, the NVE could use IP multicast to
596	       deliver tenant traffic.  In such a case, the NVE would need to
597	       know what IP underlay multicast address to use for a given VN.
598	       Alternatively, if the underlying network does not support
599	       multicast, an NVE could use serial unicast to deliver traffic.
600	       In such a case, an NVE would need to know which remote NVEs are
601	       participating in the VN.  An NVE could use both approaches,
602	       switching from one mode to the other depending on such factors as
603	       bandwidth efficiency and group membership sparseness.

605	   4.  An NVE maintains necessary information to encapsulate outgoing
606	       traffic, including what type of encapsulation and what value to
607	       use for a Context ID within the encapsulation header.

609	   5.  In order to deliver incoming encapsulated packets to the correct
610	       Tenant Systems, an NVE maintains the necessary information to map
611	       incoming traffic to the appropriate VAP (i.e., Tenant System
612	       Interface).

614	   6.  An NVE may find it convenient to maintain additional per-VN
615	       information such as QoS settings, Path MTU information, ACLs,
616	       etc.

618	4.4.  Multi-Homing of NVEs

620	   NVEs may be multi-homed.  That is, an NVE may have more than one IP
621	   address associated with it on the underlay network.  Multihoming
622	   happens in two different scenarios.  First, an NVE may have multiple
623	   interfaces connecting it to the underlay.  Each of those interfaces
624	   will typically have a different IP address, resulting in a specific
625	   Tenant Address (on a specific VN) being reachable through the same
626	   NVE but through more than one underlay IP address.  Second, a
627	   specific tenant system may be reachable through more than one NVE,
628	   each having one or more underlay addresses.  In both cases, the NVE
629	   address mapping tables need to support one-to-many mappings and
630	   enable a sending NVE to (at a minimum) be able to fail over from one
631	   IP address to another, e.g., should a specific NVE underlay address
632	   become unreachable.

634	   Finally, multi-homed NVEs introduce complexities when serial unicast
635	   is used to implement tenant multicast as described in Section 4.3.
636	   Specifically, an NVE should only receive one copy of a replicated
637	   packet.

639	   Multi-homing is needed to support important use cases.  First, a bare
640	   metal server may have multiple uplink connections to either the same
641	   or different NVEs.  Having only a single physical path to an upstream
642	   NVE, or indeed, having all traffic flow through a single NVE would be
643	   considered unacceptable in highly-resilient deployment scenarios that
644	   seek to avoid single points of failure.  Moreover, in today's
645	   networks, the availability of multiple paths would require that they
646	   be usable in an active-active fashion (e.g., for load balancing).

648	4.5.  VAP

650	   The VAP is the NVE-side of the interface between the NVE and the TS.
651	   Traffic to and from the tenant flows through the VAP.  If an NVE runs
652	   into difficulties sending traffic received on the VAP, it may need
653	   signal such errors back to the VAP.  Because the VAP is an emulation
654	   of a physical port, its ability to signal NVE errors is limited and
655	   lacks sufficient granularity to reflect all possible errors an NVE
656	   may encounter (e.g., inability reach a particular destination).  Some
657	   errors, such as an NVE losing all of its connections to the underlay,
658	   could be reflected back to the VAP by effectively disabling it.  This
659	   state change would reflect itself on the TS as an interface going
660	   down, allowing the TS to implement interface error handling, e.g.,
661	   failover, in the same manner as when a physical interfaces becomes
662	   disabled.

664	5.  Tenant System Types

666	   This section describes a number of special Tenant System types and
667	   how they fit into an NVO3 system.

669	5.1.  Overlay-Aware Network Service Appliances

671	   Some Network Service Appliances [I-D.ietf-nvo3-nve-nva-cp-req]
672	   (virtual or physical) provide tenant-aware services.  That is, the
673	   specific service they provide depends on the identity of the tenant
674	   making use of the service.  For example, firewalls are now becoming
675	   available that support multi-tenancy where a single firewall provides
676	   virtual firewall service on a per-tenant basis, using per-tenant
677	   configuration rules and maintaining per-tenant state.  Such
678	   appliances will be aware of the VN an activity corresponds to while
679	   processing requests.  Unlike server virtualization, which shields VMs
680	   from needing to know about multi-tenancy, a Network Service Appliance
681	   may explicitly support multi-tenancy.  In such cases, the Network
682	   Service Appliance itself will be aware of network virtualization and
683	   either embed an NVE directly, or implement a split NVE as described
684	   in Section 4.2.  Unlike server virtualization, however, the Network
685	   Service Appliance may not be running a hypervisor and the VM
686	   Orchestration system may not interact with the Network Service
687	   Appliance.  The NVE on such appliances will need to support a control
688	   plane to obtain the necessary information needed to fully participate
689	   in an NVO3 Domain.

691	5.2.  Bare Metal Servers

693	   Many data centers will continue to have at least some servers
694	   operating as non-virtualized (or "bare metal") machines running a
695	   traditional operating system and workload.  In such systems, there
696	   will be no NVE functionality on the server, and the server will have
697	   no knowledge of NVO3 (including whether overlays are even in use).
698	   In such environments, the NVE functionality can reside on the first-
699	   hop physical switch.  In such a case, the network administrator would
700	   (manually) configure the switch to enable the appropriate NVO3
701	   functionality on the switch port connecting the server and associate
702	   that port with a specific virtual network.  Such configuration would
703	   typically be static, since the server is not virtualized, and once
704	   configured, is unlikely to change frequently.  Consequently, this
705	   scenario does not require any protocol or standards work.

707	5.3.  Gateways

709	   Gateways on VNs relay traffic onto and off of a virtual network.
710	   Tenant Systems use gateways to reach destinations outside of the
711	   local VN.  Gateways receive encapsulated traffic from one VN, remove
712	   the encapsulation header, and send the native packet out onto the
713	   data center network for delivery.  Outside traffic enters a VN in a
714	   reverse manner.

716	   Gateways can be either virtual (i.e., implemented as a VM) or
717	   physical (i.e., as a standalone physical device).  For performance
718	   reasons, standalone hardware gateways may be desirable in some cases.
719	   Such gateways could consist of a simple switch forwarding traffic
720	   from a VN onto the local data center network, or could embed router
721	   functionality.  On such gateways, network interfaces connecting to
722	   virtual networks will (at least conceptually) embed NVE (or split-
723	   NVE) functionality within them.  As in the case with Network Service
724	   Appliances, gateways may not support a hypervisor and will need an
725	   appropriate control plane protocol to obtain the information needed
726	   to provide NVO3 service.

728	   Gateways handle several different use cases.  For example, one use
729	   case consists of systems supporting overlays together with systems
730	   that do not (e.g., bare metal servers).  Gateways could be used to
731	   connect legacy systems supporting, e.g., L2 VLANs, to specific
732	   virtual networks, effectively making them part of the same virtual
733	   network.  Gateways could also forward traffic between a virtual
734	   network and other hosts on the data center network or relay traffic
735	   between different VNs.  Finally, gateways can provide external
736	   connectivity such as Internet or VPN access.

738	5.3.1.  Gateway Taxonomy

740	   As can be seen from the discussion above, there are several types of
741	   gateways that can exist in an NVO3 environment.  This section breaks
742	   them down into the various types that could be supported.  Note that
743	   each of the types below could be implemented in either a centralized
744	   manner or distributed to co-exist with the NVEs.

746	5.3.1.1.  L2 Gateways (Bridging)

748	   L2 Gateways act as layer 2 bridges to forward Ethernet frames based
749	   on the MAC addresses present in them.

751	   L2 VN to Legacy L2:  This type of gateway bridges traffic between L2
752	         VNs and other legacy L2 networks such as VLANs or L2 VPNs.

754	   L2 VN to L2 VN:  The main motivation for this type of gateway to
755	         create separate groups of Tenant Systems using L2 VNs such that
756	         the gateway can enforce network policies between each L2 VN.

758	5.3.1.2.  L3 Gateways (Only IP Packets)

760	   L3 Gateways forward IP packets based on the IP addresses present in
761	   the packets.

763	   L3 VN to Legacy L2:  This type of gateway forwards packets on between
764	         L3 VNs and legacy L2 networks such as VLANs or L2 VPNs.  The
765	         MAC address in any frames forwarded between the legacy L2
766	         network would be the MAC address of the gateway.

768	   L3 VN to Legacy L3:  The type of gateway forwards packets between L3
769	         VNs and legacy L3 networks.  These legacy L3 networks could be
770	         local the data center, in the WAN, or an L3 VPN.

772	   L3 VN to L2 VN:  This type of gateway forwards packets on between L3
773	         VNs and L2 VNs.  The MAC address in any frames forwarded
774	         between the L2 VN would be the MAC address of the gateway.

776	   L2 VN to L2 VN:  This type of gateway acts similar to a traditional
777	         router that forwards between L2 interfaces.  The MAC address in
778	         any frames forwarded between the L2 VNs would be the MAC
779	         address of the gateway.

781	   L3 VN to L3 VN:  The main motivation for this type of gateway to
782	         create separate groups of Tenant Systems using L3 VNs such that
783	         the gateway can enforce network policies between each L3 VN.

785	5.4.  Distributed Inter-VN Gateways

787	   The relaying of traffic from one VN to another deserves special
788	   consideration.  Whether traffic is permitted to flow from one VN to
789	   another is a matter of policy, and would not (by default) be allowed
790	   unless explicitly enabled.  In addition, NVAs are the logical place
791	   to maintain policy information about allowed inter-VN communication.
792	   Policy enforcement for inter-VN communication can be handled in (at
793	   least) two different ways.  Explicit gateways could be the central
794	   point for such enforcement, with all inter-VN traffic forwarded to
795	   such gateways for processing.  Alternatively, the NVA can provide
796	   such information directly to NVEs, by either providing a mapping for
797	   a target TS on another VN, or indicating that such communication is
798	   disallowed by policy.

800	   When inter-VN gateways are centralized, traffic between TSs on
801	   different VNs can take suboptimal paths, i.e., triangular routing
802	   results in paths that always traverse the gateway.  In the worst
803	   case, traffic between two TSs connected to the same NVE can be hair-
804	   pinned through an external gateway.  As an optimization, individual
805	   NVEs can be part of a distributed gateway that performs such
806	   relaying, reducing or completely eliminating triangular routing.  In
807	   a distributed gateway, each ingress NVE can perform such relaying
808	   activity directly, so long as it has access to the policy information
809	   needed to determine whether cross-VN communication is allowed.
810	   Having individual NVEs be part of a distributed gateway allows them
811	   to tunnel traffic directly to the destination NVE without the need to
812	   take suboptimal paths.

814	   The NVO3 architecture must support distributed gateways for the case
815	   of inter-VN communication.  Such support requires that NVO3 control
816	   protocols include mechanisms for the maintenance and distribution of
817	   policy information about what type of cross-VN communication is
818	   allowed so that NVEs acting as distributed gateways can tunnel
819	   traffic from one VN to another as appropriate.

821	   Distributed gateways could also be used to distribute other
822	   traditional router services to individual NVEs.  The NVO3
823	   architecture does not preclude such implementations, but does not
824	   define or require them as they are outside the scope of NVO3.

826	5.5.  ARP and Neighbor Discovery

828	   For an L2 service, strictly speaking, special processing of ARP
829	   [RFC0826] (and IPv6 Neighbor Discovery (ND) [RFC4861]) is not
830	   required.  ARP requests are broadcast, and NVO3 can deliver ARP
831	   requests to all members of a given L2 virtual network, just as it
832	   does for any packet sent to an L2 broadcast address.  Similarly, ND
833	   requests are sent via IP multicast, which NVO3 can support by
834	   delivering via L2 multicast.  However, as a performance optimization,
835	   an NVE can intercept ARP (or ND) requests from its attached TSs and
836	   respond to them directly using information in its mapping tables.
837	   Since an NVE will have mechanisms for determining the NVE address
838	   associated with a given TS, the NVE can leverage the same mechanisms
839	   to suppress sending ARP and ND requests for a given TS to other
840	   members of the VN.  The NVO3 architecture must support such a
841	   capability.

843	6.  NVE-NVE Interaction

845	   Individual NVEs will interact with each other for the purposes of
846	   tunneling and delivering traffic to remote TSs.  At a minimum, a
847	   control protocol may be needed for tunnel setup and maintenance.  For
848	   example, tunneled traffic may need to be encrypted or integrity
849	   protected, in which case it will be necessary to set up appropriate
850	   security associations between NVE peers.  It may also be desirable to
851	   perform tunnel maintenance (e.g., continuity checks) on a tunnel in
852	   order to detect when a remote NVE becomes unreachable.  Such generic
853	   tunnel setup and maintenance functions are not generally
854	   NVO3-specific.  Hence, NVO3 expects to leverage existing tunnel
855	   maintenance protocols rather than defining new ones.

857	   Some NVE-NVE interactions may be specific to NVO3 (and in particular
858	   be related to information kept in mapping tables) and agnostic to the
859	   specific tunnel type being used.  For example, when tunneling traffic
860	   for TS-X to a remote NVE, it is possible that TS-X is not presently
861	   associated with the remote NVE.  Normally, this should not happen,
862	   but there could be race conditions where the information an NVE has
863	   learned from the NVA is out-of-date relative to actual conditions.
864	   In such cases, the remote NVE could return an error or warning
865	   indication, allowing the sending NVE to attempt a recovery or
866	   otherwise attempt to mitigate the situation.

868	   The NVE-NVE interaction could signal a range of indications, for
869	   example:

871	   o  "No such TS here", upon a receipt of a tunneled packet for an
872	      unknown TS.

874	   o  "TS-X not here, try the following NVE instead" (i.e., a redirect).

876	   o  Delivered to correct NVE, but could not deliver packet to TS-X
877	      (soft error).

879	   o  Delivered to correct NVE, but could not deliver packet to TS-X
880	      (hard error).

882	   When an NVE receives information from a remote NVE that conflicts
883	   with the information it has in its own mapping tables, it should
884	   consult with the NVA to resolve those conflicts.  In particular, it
885	   should confirm that the information it has is up-to-date, and it
886	   might indicate the error to the NVA, so as to nudge the NVA into
887	   following up (as appropriate).  While it might make sense for an NVE
888	   to update its mapping table temporarily in response to an error from
889	   a remote NVE, any changes must be handled carefully as doing so can
890	   raise security considerations if the received information cannot be
891	   authenticated.  That said, a sending NVE might still take steps to
892	   mitigate a problem, such as applying rate limiting to data traffic
893	   towards a particular NVE or TS.

895	7.  Network Virtualization Authority

897	   Before sending to and receiving traffic from a virtual network, an
898	   NVE must obtain the information needed to build its internal
899	   forwarding tables and state as listed in Section 4.3.  An NVE can
900	   obtain such information from a Network Virtualization Authority.

902	   The Network Virtualization Authority (NVA) is the entity that is
903	   expected to provide address mapping and other information to NVEs.
904	   NVEs can interact with an NVA to obtain any required information they
905	   need in order to properly forward traffic on behalf of tenants.  The
906	   term NVA refers to the overall system, without regards to its scope
907	   or how it is implemented.

909	7.1.  How an NVA Obtains Information

911	   There are two primary ways in which an NVA can obtain the address
912	   dissemination information it manages.  The NVA can obtain information
913	   either from the VM orchestration system, and/or directly from the
914	   NVEs themselves.

916	   On virtualized systems, the NVA may be able to obtain the address
917	   mapping information associated with VMs from the VM orchestration
918	   system itself.  If the VM orchestration system contains a master
919	   database for all the virtualization information, having the NVA
920	   obtain information directly to the orchestration system would be a
921	   natural approach.  Indeed, the NVA could effectively be co-located
922	   with the VM orchestration system itself.  In such systems, the VM
923	   orchestration system communicates with the NVE indirectly through the
924	   hypervisor.

926	   However, as described in Section 4 not all NVEs are associated with
927	   hypervisors.  In such cases, NVAs cannot leverage VM orchestration
928	   protocols to interact with an NVE and will instead need to peer
929	   directly with them.  By peering directly with an NVE, NVAs can obtain
930	   information about the TSs connected to that NVE and can distribute
931	   information to the NVE about the VNs those TSs are associated with.
932	   For example, whenever a Tenant System attaches to an NVE, that NVE
933	   would notify the NVA that the TS is now associated with that NVE.
934	   Likewise when a TS detaches from an NVE, that NVE would inform the
935	   NVA.  By communicating directly with NVEs, both the NVA and the NVE
936	   are able to maintain up-to-date information about all active tenants
937	   and the NVEs to which they are attached.

939	7.2.  Internal NVA Architecture

941	   For reliability and fault tolerance reasons, an NVA would be
942	   implemented in a distributed or replicated manner without single
943	   points of failure.  How the NVA is implemented, however, is not
944	   important to an NVE so long as the NVA provides a consistent and
945	   well-defined interface to the NVE.  For example, an NVA could be
946	   implemented via database techniques whereby a server stores address
947	   mapping information in a traditional (possibly replicated) database.
948	   Alternatively, an NVA could be implemented in a distributed fashion
949	   using an existing (or modified) routing protocol to maintain and
950	   distribute mappings.  So long as there is a clear interface between
951	   the NVE and NVA, how an NVA is architected and implemented is not
952	   important to an NVE.

954	   A number of architectural approaches could be used to implement NVAs
955	   themselves.  NVAs manage address bindings and distribute them to
956	   where they need to go.  One approach would be to use Border Gateway
957	   Protocol (BGP) [RFC4364] (possibly with extensions) and route
958	   reflectors.  Another approach could use a transaction-based database
959	   model with replicated servers.  Because the implementation details
960	   are local to an NVA, there is no need to pick exactly one solution
961	   technology, so long as the external interfaces to the NVEs (and
962	   remote NVAs) are sufficiently well defined to achieve
963	   interoperability.

965	7.3.  NVA External Interface

967	   Conceptually, from the perspective of an NVE, an NVA is a single
968	   entity.  An NVE interacts with the NVA, and it is the NVA's
969	   responsibility for ensuring that interactions between the NVE and NVA
970	   result in consistent behavior across the NVA and all other NVEs using
971	   the same NVA.  Because an NVA is built from multiple internal
972	   components, an NVA will have to ensure that information flows to all
973	   internal NVA components appropriately.

975	   One architectural question is how the NVA presents itself to the NVE.
976	   For example, an NVA could be required to provide access via a single
977	   IP address.  If NVEs only have one IP address to interact with, it
978	   would be the responsibility of the NVA to handle NVA component
979	   failures, e.g., by using a "floating IP address" that migrates among
980	   NVA components to ensure that the NVA can always be reached via the
981	   one address.  Having all NVA accesses through a single IP address,
982	   however, adds constraints to implementing robust failover, load
983	   balancing, etc.

985	   In the NVO3 architecture, an NVA is accessed through one or more IP
986	   addresses (or IP address/port combination).  If multiple IP addresses
987	   are used, each IP address provides equivalent functionality, meaning
988	   that an NVE can use any of the provided addresses to interact with
989	   the NVA.  Should one address stop working, an NVE is expected to
990	   failover to another.  While the different addresses result in
991	   equivalent functionality, one address may be more respond more
992	   quickly than another, e.g., due to network conditions, load on the
993	   server, etc.

995	   To provide some control over load balancing, NVA addresses may have
996	   an associated priority.  Addresses are used in order of priority,
997	   with no explicit preference among NVA addresses having the same
998	   priority.  To provide basic load-balancing among NVAs of equal
999	   priorities, NVEs use some randomization input to select among equal-
1000	   priority NVAs.  Such a priority scheme facilitates failover and load
1001	   balancing, for example, allowing a network operator to specify a set
1002	   of primary and backup NVAs.

1004	   It may be desirable to have individual NVA addresses responsible for
1005	   a subset of information about an NV Domain.  In such a case, NVEs
1006	   would use different NVA addresses for obtaining or updating
1007	   information about particular VNs or TS bindings.  A key question with
1008	   such an approach is how information would be partitioned, and how an
1009	   NVE could determine which address to use to get the information it
1010	   needs.

1012	   Another possibility is to treat the information on which NVA
1013	   addresses to use as cached (soft-state) information at the NVEs, so
1014	   that any NVA address can be used to obtain any information, but NVEs
1015	   are informed of preferences for which addresses to use for particular
1016	   information on VNs or TS bindings.  That preference information would
1017	   be cached for future use to improve behavior - e.g., if all requests
1018	   for a specific subset of VNs are forwarded to a specific NVA
1019	   component, the NVE can optimize future requests within that subset by
1020	   sending them directly to that NVA component via its address.

1022	8.  NVE-to-NVA Protocol

1024	   As outlined in Section 4.3, an NVE needs certain information in order
1025	   to perform its functions.  To obtain such information from an NVA, an
1026	   NVE-to-NVA protocol is needed.  The NVE-to-NVA protocol provides two
1027	   functions.  First it allows an NVE to obtain information about the
1028	   location and status of other TSs with which it needs to communicate.
1029	   Second, the NVE-to-NVA protocol provides a way for NVEs to provide
1030	   updates to the NVA about the TSs attached to that NVE (e.g., when a
1031	   TS attaches or detaches from the NVE), or about communication errors
1032	   encountered when sending traffic to remote NVEs.  For example, an NVE
1033	   could indicate that a destination it is trying to reach at a
1034	   destination NVE is unreachable for some reason.

1036	   While having a direct NVE-to-NVA protocol might seem straightforward,
1037	   the existence of existing VM orchestration systems complicates the
1038	   choices an NVE has for interacting with the NVA.

1040	8.1.  NVE-NVA Interaction Models

1042	   An NVE interacts with an NVA in at least two (quite different) ways:

1044	   o  NVEs embedded within the same server as the hypervisor can obtain
1045	      necessary information entirely through the hypervisor-facing side
1046	      of the NVE.  Such an approach is a natural extension to existing
1047	      VM orchestration systems supporting server virtualization because
1048	      an existing protocol between the hypervisor and VM Orchestration
1049	      system already exists and can be leveraged to obtain any needed
1050	      information.  Specifically, VM orchestration systems used to
1051	      create, terminate and migrate VMs already use well-defined (though
1052	      typically proprietary) protocols to handle the interactions
1053	      between the hypervisor and VM orchestration system.  For such
1054	      systems, it is a natural extension to leverage the existing
1055	      orchestration protocol as a sort of proxy protocol for handling
1056	      the interactions between an NVE and the NVA.  Indeed, existing
1057	      implementations can already do this.

1059	   o  Alternatively, an NVE can obtain needed information by interacting
1060	      directly with an NVA via a protocol operating over the data center
1061	      underlay network.  Such an approach is needed to support NVEs that
1062	      are not associated with systems performing server virtualization
1063	      (e.g., as in the case of a standalone gateway) or where the NVE
1064	      needs to communicate directly with the NVA for other reasons.

1066	   The NVO3 architecture will focus on support for the second model
1067	   above.  Existing virtualization environments are already using the
1068	   first model.  But they are not sufficient to cover the case of
1069	   standalone gateways -- such gateways may not support virtualization
1070	   and do not interface with existing VM orchestration systems.

1072	8.2.  Direct NVE-NVA Protocol

1074	   An NVE can interact directly with an NVA via an NVE-to-NVA protocol.
1075	   Such a protocol can be either independent of the NVA internal
1076	   protocol, or an extension of it.  Using a dedicated protocol provides
1077	   architectural separation and independence between the NVE and NVA.
1078	   The NVE and NVA interact in a well-defined way, and changes in the
1079	   NVA (or NVE) do not need to impact each other.  Using a dedicated
1080	   protocol also ensures that both NVE and NVA implementations can
1081	   evolve independently and without dependencies on each other.  Such
1082	   independence is important because the upgrade path for NVEs and NVAs
1083	   is quite different.  Upgrading all the NVEs at a site will likely be
1084	   more difficult in practice than upgrading NVAs because of their large
1085	   number - one on each end device.  In practice, it would be prudent to
1086	   assume that once an NVE has been implemented and deployed, it may be
1087	   challenging to get subsequent NVE extensions and changes implemented
1088	   and deployed, whereas an NVA (and its associated protocols) are more
1089	   likely to evolve over time as experience is gained from usage and
1090	   upgrades will involve fewer nodes.

1092	   Requirements for a direct NVE-NVA protocol can be found in
1093	   [I-D.ietf-nvo3-nve-nva-cp-req]

1095	8.3.  Propagating Information Between NVEs and NVAs

1097	   Information flows between NVEs and NVAs in both directions.  The NVA
1098	   maintains information about all VNs in the NV Domain, so that NVEs do
1099	   not need to do so themselves.  NVEs obtain from the NVA information
1100	   about where a given remote TS destination resides.  NVAs in turn
1101	   obtain information from NVEs about the individual TSs attached to
1102	   those NVEs.

1104	   While the NVA could push information about every virtual network to
1105	   every NVE, such an approach scales poorly and is unnecessary.  In
1106	   practice, a given NVE will only need and want to know about VNs to
1107	   which it is attached.  Thus, an NVE should be able to subscribe to
1108	   updates only for the virtual networks it is interested in receiving
1109	   updates for.  The NVO3 architecture supports a model where an NVE is
1110	   not required to have full mapping tables for all virtual networks in
1111	   an NV Domain.

1113	   Before sending unicast traffic to a remote TS (or TSes for broadcast
1114	   or multicast traffic), an NVE must know where the remote(es) TS
1115	   currently resides.  When a TS attaches to a virtual network, the NVE
1116	   obtains information about that VN from the NVA.  The NVA can provide
1117	   that information to the NVE at the time the TS attaches to the VN,
1118	   either because the NVE requests the information when the attach
1119	   operation occurs, or because the VM orchestration system has
1120	   initiated the attach operation and provides associated mapping
1121	   information to the NVE at the same time.

1123	   There are scenarios where an NVE may wish to query the NVA about
1124	   individual mappings within an VN.  For example, when sending traffic
1125	   to a remote TS on a remote NVE, that TS may become unavailable (e.g,.
1126	   because it has migrated elsewhere or has been shutdown, in which case
1127	   the remote NVE may return an error indication).  In such situations,
1128	   the NVE may need to query the NVA to obtain updated mapping
1129	   information for a specific TS, or verify that the information is
1130	   still correct despite the error condition.  Note that such a query
1131	   could also be used by the NVA as an indication that there may be an
1132	   inconsistency in the network and that it should take steps to verify
1133	   that the information it has about the current state and location of a
1134	   specific TS is still correct.

1136	   For very large virtual networks, the amount of state an NVE needs to
1137	   maintain for a given virtual network could be significant.  Moreover,
1138	   an NVE may only be communicating with a small subset of the TSs on
1139	   such a virtual network.  In such cases, the NVE may find it desirable
1140	   to maintain state only for those destinations it is actively
1141	   communicating with.  In such scenarios, an NVE may not want to
1142	   maintain full mapping information about all destinations on a VN.
1143	   Should it then need to communicate with a destination for which it
1144	   does not have mapping information, however, it will need to be able
1145	   to query the NVA on demand for the missing information on a per-
1146	   destination basis.

1148	   The NVO3 architecture will need to support a range of operations
1149	   between the NVE and NVA.  Requirements for those operations can be
1150	   found in [I-D.ietf-nvo3-nve-nva-cp-req].

1152	9.  Federated NVAs

1154	   An NVA provides service to the set of NVEs in its NV Domain.  Each
1155	   NVA manages network virtualization information for the virtual
1156	   networks within its NV Domain.  An NV domain is administered by a
1157	   single entity.

1159	   In some cases, it will be necessary to expand the scope of a specific
1160	   VN or even an entire NV domain beyond a single NVA.  For example,
1161	   multiple data centers managed by the same administrator may wish to
1162	   operate all of its data centers as a single NV region.  Such cases
1163	   are handled by having different NVAs peer with each other to exchange
1164	   mapping information about specific VNs.  NVAs operate in a federated
1165	   manner with a set of NVAs operating as a loosely-coupled federation
1166	   of individual NVAs.  If a virtual network spans multiple NVAs (e.g.,
1167	   located at different data centers), and an NVE needs to deliver
1168	   tenant traffic to an NVE that is part of a different NV Domain, it
1169	   still interacts only with its NVA, even when obtaining mappings for
1170	   NVEs associated with a different NV Domain.

1172	   Figure 3 shows a scenario where two separate NV Domains (1 and 2)
1173	   share information about Virtual Network "1217".  VM1 and VM2 both
1174	   connect to the same Virtual Network 1217, even though the two VMs are
1175	   in separate NV Domains.  There are two cases to consider.  In the
1176	   first case, NV Domain B (NVB) does not allow NVE-A to tunnel traffic
1177	   directly to NVE-B.  There could be a number of reasons for this.  For
1178	   example, NV Domains 1 and 2 may not share a common address space
1179	   (i.e., require traversal through a NAT device), or for policy
1180	   reasons, a domain might require that all traffic between separate NV
1181	   Domains be funneled through a particular device (e.g., a firewall).
1182	   In such cases, NVA-2 will advertise to NVA-1 that VM1 on Virtual
1183	   Network 1217 is available, and direct that traffic between the two
1184	   nodes go through IP-G.  IP-G would then decapsulate received traffic
1185	   from one NV Domain, translate it appropriately for the other domain
1186	   and re-encapsulate the packet for delivery.

1188	                   xxxxxx                          xxxxxx        +-----+
1189	+-----+     xxxxxxxx    xxxxxx               xxxxxxx     xxxxx   | VM2 |
1190	| VM1 |    xx                xx            xxx               xx  |-----|
1191	|-----|   xx      +           x          xx                   x  |NVE-B|
1192	|NVE-A|   x                   x  +----+  x                     x +-----+
1193	+--+--+   x     NV Domain A   x  |IP-G|--x                      x    |
1194	   +-------x                 xx--+    | x                       xx   |
1195	           x                x    +----+ x      NV Domain B       x   |
1196	        +---x             xx            xx                       x---+
1197	        |    xxxx        xx           +->xx                     xx
1198	        |       xxxxxxxxxx            |   xx                   xx
1199	    +---+-+                           |     xx                xx
1200	    |NVA-1|                        +--+--+    xx           xxx
1201	    +-----+                        |NVA-2|     xxxx     xxxx
1202	                                   +-----+        xxxxxxx

1204	            Figure 3: VM1 and VM2 are in different NV Domains.

1206	   NVAs at one site share information and interact with NVAs at other
1207	   sites, but only in a controlled manner.  It is expected that policy
1208	   and access control will be applied at the boundaries between
1209	   different sites (and NVAs) so as to minimize dependencies on external
1210	   NVAs that could negatively impact the operation within a site.  It is
1211	   an architectural principle that operations involving NVAs at one site
1212	   not be immediately impacted by failures or errors at another site.
1213	   (Of course, communication between NVEs in different NV domains may be
1214	   impacted by such failures or errors.)  It is a strong requirement
1215	   that an NVA continue to operate properly for local NVEs even if
1216	   external communication is interrupted (e.g., should communication
1217	   between a local and remote NVA fail).

1219	   At a high level, a federation of interconnected NVAs has some
1220	   analogies to BGP and Autonomous Systems.  Like an Autonomous System,
1221	   NVAs at one site are managed by a single administrative entity and do
1222	   not interact with external NVAs except as allowed by policy.
1223	   Likewise, the interface between NVAs at different sites is well
1224	   defined, so that the internal details of operations at one site are
1225	   largely hidden to other sites.  Finally, an NVA only peers with other
1226	   NVAs that it has a trusted relationship with, i.e., where a VN is
1227	   intended to span multiple NVAs.

1229	   Reasons for using a federated model include:

1231	   o  Provide isolation among NVAs operating at different sites at
1232	      different geographic locations.

1234	   o  Control the quantity and rate of information updates that flow
1235	      (and must be processed) between different NVAs in different data
1236	      centers.

1238	   o  Control the set of external NVAs (and external sites) a site peers
1239	      with.  A site will only peer with other sites that are cooperating
1240	      in providing an overlay service.

1242	   o  Allow policy to be applied between sites.  A site will want to
1243	      carefully control what information it exports (and to whom) as
1244	      well as what information it is willing to import (and from whom).

1246	   o  Allow different protocols and architectures to be used to for
1247	      intra- vs. inter-NVA communication.  For example, within a single
1248	      data center, a replicated transaction server using database
1249	      techniques might be an attractive implementation option for an
1250	      NVA, and protocols optimized for intra-NVA communication would
1251	      likely be different from protocols involving inter-NVA
1252	      communication between different sites.

1254	   o  Allow for optimized protocols, rather than using a one-size-fits
1255	      all approach.  Within a data center, networks tend to have lower-
1256	      latency, higher-speed and higher redundancy when compared with WAN
1257	      links interconnecting data centers.  The design constraints and
1258	      tradeoffs for a protocol operating within a data center network
1259	      are different from those operating over WAN links.  While a single
1260	      protocol could be used for both cases, there could be advantages
1261	      to using different and more specialized protocols for the intra-
1262	      and inter-NVA case.

1264	9.1.  Inter-NVA Peering

1266	   To support peering between different NVAs, an inter-NVA protocol is
1267	   needed.  The inter-NVA protocol defines what information is exchanged
1268	   between NVAs.  It is assumed that the protocol will be used to share
1269	   addressing information between data centers and must scale well over
1270	   WAN links.

1272	10.  Control Protocol Work Areas

1274	   The NVO3 architecture consists of two major distinct entities: NVEs
1275	   and NVAs.  In order to provide isolation and independence between
1276	   these two entities, the NVO3 architecture calls for well defined
1277	   protocols for interfacing between them.  For an individual NVA, the
1278	   architecture calls for a logically centralized entity that could be
1279	   implemented in a distributed or replicated fashion.  While the IETF
1280	   may choose to define one or more specific architectural approaches to
1281	   building individual NVAs, there is little need for it to pick exactly
1282	   one approach to the exclusion of others.  An NVA for a single domain
1283	   will likely be deployed as a single vendor product and thus their is
1284	   little benefit in standardizing the internal structure of an NVA.

1286	   Individual NVAs peer with each other in a federated manner.  The NVO3
1287	   architecture calls for a well-defined interface between NVAs.

1289	   Finally, a hypervisor-to-NVE protocol is needed to cover the split-
1290	   NVE scenario described in Section 4.2.

1292	11.  NVO3 Data Plane Encapsulation

1294	   When tunneling tenant traffic, NVEs add encapsulation header to the
1295	   original tenant packet.  The exact encapsulation to use for NVO3 does
1296	   not seem to be critical.  The main requirement is that the
1297	   encapsulation support a Context ID of sufficient size
1298	   [I-D.ietf-nvo3-dataplane-requirements].  A number of encapsulations
1299	   already exist that provide a VN Context of sufficient size for NVO3.
1300	   For example, VXLAN [RFC7348] has a 24-bit VXLAN Network Identifier
1301	   (VNI).  NVGRE [I-D.sridharan-virtualization-nvgre] has a 24-bit
1302	   Tenant Network ID (TNI).  MPLS-over-GRE provides a 20-bit label
1303	   field.  While there is widespread recognition that a 12-bit VN
1304	   Context would be too small (only 4096 distinct values), it is
1305	   generally agreed that 20 bits (1 million distinct values) and 24 bits
1306	   (16.8 million distinct values) are sufficient for a wide variety of
1307	   deployment scenarios.

1309	12.  Operations and Management

1311	   The simplicity of operating and debugging overlay networks will be
1312	   critical for successful deployment.  Some architectural choices can
1313	   facilitate or hinder OAM.  Related OAM drafts include
1314	   [I-D.ashwood-nvo3-operational-requirement].

1316	13.  Summary

1318	   This document presents the overall architecture for overlays in NVO3.
1319	   The architecture calls for three main areas of protocol work:

1321	   1.  A hypervisor-to-NVE protocol to support Split NVEs as discussed
1322	       in Section 4.2.

1324	   2.  An NVE to NVA protocol for disseminating VN information (e.g.,
1325	       inner to outer address mappings).

1327	   3.  An NVA-to-NVA protocol for exchange of information about specific
1328	       virtual networks between federated NVAs.

1330	   It should be noted that existing protocols or extensions of existing
1331	   protocols are applicable.

1333	14.  Acknowledgments

1335	   Helpful comments and improvements to this document have come from
1336	   Lizhong Jin, Anton Ivanov, Dennis (Xiaohong) Qin, Erik Smith, Ziye
1337	   Yang and Lucy Yong.

1339	15.  IANA Considerations

1341	   This memo includes no request to IANA.

1343	16.  Security Considerations

1345	   Yep, kind of sparse.  But we'll get there eventually. :-)

1347	17.  Informative References

1349	   [I-D.ashwood-nvo3-operational-requirement]
1350	              Ashwood-Smith, P., Iyengar, R., Tsou, T., Sajassi, A.,
1351	              Boucadair, M., Jacquenet, C., and M. Daikoku, "NVO3
1352	              Operational Requirements", draft-ashwood-nvo3-operational-
1353	              requirement-03 (work in progress), July 2013.

1355	   [I-D.ietf-nvo3-dataplane-requirements]
1356	              Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L.,
1357	              and B. Khasnabish, "NVO3 Data Plane Requirements", draft-
1358	              ietf-nvo3-dataplane-requirements-03 (work in progress),
1359	              April 2014.

1361	   [I-D.ietf-nvo3-nve-nva-cp-req]
1362	              Kreeger, L., Dutt, D., Narten, T., and D. Black, "Network
1363	              Virtualization NVE to NVA Control Protocol Requirements",
1364	              draft-ietf-nvo3-nve-nva-cp-req-03 (work in progress),
1365	              October 2014.

1367	   [I-D.sridharan-virtualization-nvgre]
1368	              Garg, P. and Y. Wang, "NVGRE: Network Virtualization using
1369	              Generic Routing Encapsulation", draft-sridharan-
1370	              virtualization-nvgre-07 (work in progress), November 2014.

1372	   [IEEE-802.1Q]
1373	              IEEE 802.1Q-2011, , "IEEE standard for local and
1374	              metropolitan area networks: Media access control (MAC)
1375	              bridges and virtual bridged local area networks,", August
1376	              2011.

1378	   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
1379	              converting network protocol addresses to 48.bit Ethernet
1380	              address for transmission on Ethernet hardware", STD 37,
1381	              RFC 826, November 1982.

1383	   [RFC4364]  Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
1384	              Networks (VPNs)", RFC 4364, February 2006.

1386	   [RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
1387	              "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
1388	              September 2007.

1390	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
1391	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
1392	              eXtensible Local Area Network (VXLAN): A Framework for
1393	              Overlaying Virtualized Layer 2 Networks over Layer 3
1394	              Networks", RFC 7348, August 2014.

1396	   [RFC7364]  Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L.,
1397	              and M. Napierala, "Problem Statement: Overlays for Network
1398	              Virtualization", RFC 7364, October 2014.

1400	   [RFC7365]  Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y.
1401	              Rekhter, "Framework for Data Center (DC) Network
1402	              Virtualization", RFC 7365, October 2014.

1404	Appendix A.  Change Log

1406	A.1.  Changes From draft-ietf-nvo3-arch-02 to -03

1408	   1.  Removed "[Note:" comments from section 7.3 and 8.

1410	   2.  Removed discussion stimulating "[Note" comment from section 8.1
1411	       and changed the text to note that the NVO3 architecture will
1412	       focus on a model where all NVEs interact with the NVA.

1414	   3.  Added a subsection on NVO3 Gateway taxonomy.

1416	A.2.  Changes From draft-ietf-nvo3-arch-01 to -02

1418	   1.  Minor editorial improvements after a close re-reading; references
1419	       to problem statement and framework updated to point to recently-
1420	       published RFCs.

1422	   2.  Added text making it more clear that other virtualization
1423	       approaches, including Linux Containers are intended to be fully
1424	       supported in NVO3.

1426	A.3.  Changes From draft-ietf-nvo3-arch-00 to -01

1428	   1.  Miscellaneous text/section additions, including:

1430	       *  New section on VLAN tag Handling (Section 3.1.1).

1432	       *  New section on tenant VLAN handling in Split-NVE case
1433	          (Section 4.2.1).

1435	       *  New section on TTL handling (Section 3.1.2).

1437	       *  New section on multi-homing of NVEs (Section 4.4).

1439	       *  2 paragraphs new text describing L2/L3 Combined service
1440	          (Section 3.1).

1442	       *  New section on VAPs (and error handling) (Section 4.5).

1444	       *  New section on ARP and ND handling (Section 5.5)

1446	       *  New section on NVE-to-NVE interactions (Section 6)

1448	   2.  Editorial cleanups from careful review by Erik Smith, Ziye Yang.

1450	   3.  Expanded text on Distributed Inter-VN Gateways.

1452	A.4.  Changes From draft-narten-nvo3 to draft-ietf-nvo3

1454	   1.  No changes between draft-narten-nvo3-arch-01 and draft-ietf-nvoe-
1455	       arch-00.

1457	A.5.  Changes From -00 to -01 (of draft-narten-nvo3-arch)

1459	   1.  Editorial and clarity improvements.

1461	   2.  Replaced "push vs. pull" section with section more focused on
1462	       triggers where an event implies or triggers some action.

1464	   3.  Clarified text on co-located NVE to show how offloading NVE
1465	       functionality onto adapters is desirable.

1467	   4.  Added new section on distributed gateways.

1469	   5.  Expanded Section on NVA external interface, adding requirement
1470	       for NVE to support multiple IP NVA addresses.

1472	Authors' Addresses

1474	   David Black
1475	   EMC

1477	   Email: david.black@emc.com

1479	   Jon Hudson
1480	   Brocade
1481	   120 Holger Way
1482	   San Jose, CA  95134
1483	   USA

1485	   Email: jon.hudson@gmail.com

1487	   Lawrence Kreeger
1488	   Cisco

1490	   Email: kreeger@cisco.com

1492	   Marc Lasserre
1493	   Alcatel-Lucent

1495	   Email: marc.lasserre@alcatel-lucent.com

1497	   Thomas Narten
1498	   IBM

1500	   Email: narten@us.ibm.com