idnits 2.17.1 

draft-ietf-nvo3-dataplane-requirements-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 19
     longer pages, the longest (page 2) being 70 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 19 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 220 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.

  == There are 10 instances of lines with non-RFC6890-compliant IPv4
     addresses in the document.  If these are example addresses, they should
     be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (November 12, 2013) is 3812 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'NVOPS' is defined on line 751, but no explicit
     reference was found in the text

  == Unused Reference: 'OVCPREQ' is defined on line 759, but no explicit
     reference was found in the text

  == Unused Reference: 'FLOYD' is defined on line 763, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4364' is defined on line 766, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6438' is defined on line 782, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6391' is defined on line 786, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                            Nabil Bitar
3	    Internet Draft                                                 Verizon
4	    Intended status: Informational
5	    Expires: May 2014                                        Marc Lasserre
6	                                                              Florin Balus
7	                                                            Alcatel-Lucent

9	                                                              Thomas Morin
10	                                                     France Telecom Orange

12	                                                               Lizhong Jin

14	                                                         Bhumip Khasnabish
15	                                                                       ZTE

17	                                                         November 12, 2013

19	                           NVO3 Data Plane Requirements
20	                   draft-ietf-nvo3-dataplane-requirements-02.txt

22	    Status of this Memo

24	       This Internet-Draft is submitted in full conformance with the
25	       provisions of BCP 78 and BCP 79.

27	       Internet-Drafts are working documents of the Internet Engineering
28	       Task Force (IETF).  Note that other groups may also distribute
29	       working documents as Internet-Drafts. The list of current Internet-
30	       Drafts is at http://datatracker.ietf.org/drafts/current/.

32	       Internet-Drafts are draft documents valid for a maximum of six
33	       months and may be updated, replaced, or obsoleted by other documents
34	       at any time.  It is inappropriate to use Internet-Drafts as
35	       reference material or to cite them other than as "work in progress."

37	       This Internet-Draft will expire on May 12, 2014.

39	    Copyright Notice

41	       Copyright (c) 2013 IETF Trust and the persons identified as the
42	       document authors. All rights reserved.

44	       This document is subject to BCP 78 and the IETF Trust's Legal
45	       Provisions Relating to IETF Documents
46	       (http://trustee.ietf.org/license-info) in effect on the date of
47	       publication of this document. Please review these documents
48	       carefully, as they describe your rights and restrictions with
49	       respect to this document. Code Components extracted from this
50	       document must include Simplified BSD License text as described in
51	       Section 4.e of the Trust Legal Provisions and are provided without
52	       warranty as described in the Simplified BSD License.

54	    Abstract

56	       Several IETF drafts relate to the use of overlay networks to support
57	       large scale virtual data centers. This draft provides a list of data
58	       plane requirements for Network Virtualization over L3 (NVO3) that
59	       have to be addressed in solutions documents.

61	    Table of Contents

63	       1. Introduction..................................................3
64	          1.1. Conventions used in this document........................3
65	          1.2. General terminology......................................3
66	       2. Data Path Overview............................................4
67	       3. Data Plane Requirements.......................................5
68	          3.1. Virtual Access Points (VAPs).............................5
69	          3.2. Virtual Network Instance (VNI)...........................5
70	          3.2.1. L2 VNI.................................................5
71	          3.2.2. L3 VNI.................................................6
72	          3.3. Overlay Module...........................................7
73	          3.3.1. NVO3 overlay header....................................8
74	          3.3.1.1. Virtual Network Context Identification...............8
75	          3.3.1.2. Service QoS identifier...............................8
76	          3.3.2. Tunneling function.....................................9
77	          3.3.2.1. LAG and ECMP........................................10
78	          3.3.2.2. DiffServ and ECN marking............................10
79	          3.3.2.3. Handling of BUM traffic.............................11
80	          3.4. External NVO3 connectivity..............................11
81	          3.4.1. GW Types..............................................12
82	          3.4.1.1. VPN and Internet GWs................................12
83	          3.4.1.2. Inter-DC GW.........................................12
84	          3.4.1.3. Intra-DC gateways...................................12
85	          3.4.2. Path optimality between NVEs and Gateways.............12
86	          3.4.2.1. Load-balancing......................................14
87	          3.4.2.2. Triangular Routing Issues (a.k.a. Traffic Tromboning)14
88	          3.5. Path MTU................................................14
89	          3.6. Hierarchical NVE........................................15
90	          3.7. NVE Multi-Homing Requirements...........................15
91	          3.8. Other considerations....................................16
92	          3.8.1. Data Plane Optimizations..............................16
93	          3.8.2. NVE location trade-offs...............................16
94	       4. Security Considerations......................................17
95	       5. IANA Considerations..........................................17
96	       6. References...................................................17
97	          6.1. Normative References....................................17
98	          6.2. Informative References..................................17
99	       7. Acknowledgments..............................................18

101	    1. Introduction

103	    1.1. Conventions used in this document

105	       The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
106	       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
107	       document are to be interpreted as described in RFC-2119 [RFC2119].

109	       In this document, these words will appear with that interpretation
110	       only when in ALL CAPS. Lower case uses of these words are not to be
111	       interpreted as carrying RFC-2119 significance.

113	    1.2. General terminology

115	       The terminology defined in [NVO3-framework] is used throughout this
116	       document. Terminology specific to this memo is defined here and is
117	       introduced as needed in later sections.

119	       BUM: Broadcast, Unknown Unicast, Multicast traffic

121	       TS: Tenant System

123	    2. Data Path Overview

125	       The NVO3 framework [NVO3-framework] defines the generic NVE model
126	       depicted in Figure 1:

128	                          +------- L3 Network ------+
129	                          |                         |
130	                          |       Tunnel Overlay    |
131	             +------------+---------+       +---------+------------+
132	             | +----------+-------+ |       | +---------+--------+ |
133	             | |  Overlay Module  | |       | |  Overlay Module  | |
134	             | +---------+--------+ |       | +---------+--------+ |
135	             |           |VN context|       | VN context|          |
136	             |           |          |       |           |          |
137	             |  +-------+--------+  |       |  +--------+-------+  |
138	             |  | |VNI|  ... |VNI|  |       |  | |VNI|  ... |VNI|  |
139	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
140	             |    |   VAPs     |    |       |    |    VAPs   |     |
141	             +----+------------+----+       +----+------------+----+
142	                 |            |                 |            |
143	           -------+------------+-----------------+------------+-------
144	                 |            |     Tenant      |            |
145	                 |            |   Service IF    |            |
146	                 Tenant Systems                 Tenant Systems

148	                  Figure 1 : Generic reference model for NV Edge

150	       When a frame is received by an ingress NVE from a Tenant System over
151	       a local VAP, it needs to be parsed in order to identify which
152	       virtual network instance it belongs to. The parsing function can
153	       examine various fields in the data frame (e.g., VLANID) and/or
154	       associated interface/port the frame came from.

156	       Once a corresponding VNI is identified, a lookup is performed to
157	       determine where the frame needs to be sent. This lookup can be based
158	       on any combinations of various fields in the data frame (e.g.,
159	       destination MAC addresses and/or destination IP addresses). Note
160	       that additional criteria such as 802.1p and/or DSCP markings might
161	       be used to select an appropriate tunnel or local VAP destination.

163	       Lookup tables can be populated using different techniques: data
164	       plane learning, management plane configuration, or a distributed
165	       control plane. Management and control planes are not in the scope of
166	       this document. The data plane based solution is described in this
167	       document as it has implications on the data plane processing
168	       function.

170	       The result of this lookup yields the corresponding information
171	       needed to build the overlay header, as described in section 3.3.
172	       This information includes the destination L3 address of the egress
173	       NVE. Note that this lookup might yield a list of tunnels such as
174	       when ingress replication is used for BUM traffic.

176	       The overlay header MUST include a context identifier which the
177	       egress NVE will use to identify which VNI this frame belongs to.

179	       The egress NVE checks the context identifier and removes the
180	       encapsulation header and then forwards the original frame towards
181	       the appropriate recipient, usually a local VAP.

183	    3. Data Plane Requirements

185	    3.1. Virtual Access Points (VAPs)

187	       The NVE forwarding plane MUST support VAP identification through the
188	       following mechanisms:

190	       - Using the local interface on which the frames are received, where
191	          the local interface may be an internal, virtual port in a VSwitch
192	          or a physical port on the ToR
193	       - Using the local interface and some fields in the frame header,
194	          e.g. one or multiple VLANs or the source MAC

196	    3.2. Virtual Network Instance (VNI)

198	       VAPs are associated with a specific VNI at service instantiation
199	       time.

201	       A VNI identifies a per-tenant private context, i.e. per-tenant
202	       policies and a FIB table to allow overlapping address space between
203	       tenants.

205	       There are different VNI types differentiated by the virtual network
206	       service they provide to Tenant Systems. Network virtualization can
207	       be provided by L2 and/or L3 VNIs.

209	    3.2.1. L2 VNI

211	       An L2 VNI MUST provide an emulated Ethernet multipoint service as if
212	       Tenant Systems are interconnected by a bridge (but instead by using
213	       a set of NVO3 tunnels). The emulated bridge could be 802.1Q enabled
214	       (allowing use of VLAN tags as a VAP). An L2 VNI provides per tenant
215	       virtual switching instance with MAC addressing isolation and L3
216	       tunneling. Loop avoidance capability MUST be provided.

218	       Forwarding table entries provide mapping information between tenant
219	       system MAC addresses and VAPs on directly connected VNIs and L3
220	       tunnel destination addresses over the overlay. Such entries could be
221	       populated by a control or management plane, or via data plane.

223	       By default, data plane learning MUST be used to populate forwarding
224	       tables. As frames arrive from VAPs or from overlay tunnels, standard
225	       MAC learning procedures are used: The tenant system source MAC
226	       address is learned against the VAP or the NVO3 tunneling
227	       encapsulation source address on which the frame arrived. This
228	       implies that unknown unicast traffic will be flooded (i.e.
229	       broadcast).

231	       When flooding is required, either to deliver unknown unicast, or
232	       broadcast or multicast traffic, the NVE MUST either support ingress
233	       replication or multicast.

235	       When using multicast, the NVE MUST have one or more multicast trees
236	       that can be used by local VNIs for flooding to NVEs belonging to the
237	       same VN. For each VNI, there is at least one flooding tree used for
238	       Broadcast, Unknown Unicast and Multicast forwarding.  This tree MAY
239	       be shared across VNIs. The flooding tree is equivalent with a
240	       multicast (*,G) construct where all the NVEs for which the
241	       corresponding VNI is instantiated are members.

243	       When tenant multicast is supported, it SHOULD also be possible to
244	       select whether the NVE provides optimized multicast trees inside the
245	       VNI for individual tenant multicast groups or whether the default
246	       VNI flooding tree is used. If the former option is selected the VNI
247	       SHOULD be able to snoop IGMP/MLD messages in order to efficiently
248	       join/prune Tenant System from multicast trees.

250	    3.2.2. L3 VNI

252	       L3 VNIs MUST provide virtualized IP routing and forwarding. L3 VNIs
253	       MUST support per-tenant forwarding instance with IP addressing
254	       isolation and L3 tunneling for interconnecting instances of the same
255	       VNI on NVEs.

257	       In the case of L3 VNI, the inner TTL field MUST be decremented by
258	       (at least) 1 as if the NVO3 egress NVE was one (or more) hop(s)
259	       away. The TTL field in the outer IP header MUST be set to a value
260	       appropriate for delivery of the encapsulated frame to the tunnel
261	       exit point. Thus, the default behavior MUST be the TTL pipe model
262	       where the overlay network looks like one hop to the sending NVE.
263	       Configuration of a "uniform" TTL model where the outer tunnel TTL is
264	       set equal to the inner TTL on ingress NVE and the inner TTL is set
265	       to the outer TTL value on egress MAY be supported.

267	       L2 and L3 VNIs can be deployed in isolation or in combination to
268	       optimize traffic flows per tenant across the overlay network. For
269	       example, an L2 VNI may be configured across a number of NVEs to
270	       offer L2 multi-point service connectivity while a L3 VNI can be co-
271	       located to offer local routing capabilities and gateway
272	       functionality. In addition, integrated routing and bridging per
273	       tenant MAY be supported on an NVE. An instantiation of such service
274	       may be realized by interconnecting an L2 VNI as access to an L3 VNI
275	       on the NVE.

277	       When multicast is supported, it MAY be possible to select whether
278	       the NVE provides optimized multicast trees inside the VNI for
279	       individual tenant multicast groups or whether a default VNI
280	       multicasting tree, where all the NVEs of the corresponding VNI are
281	       members, is used.

283	    3.3. Overlay Module

285	       The overlay module performs a number of functions related to NVO3
286	       header and tunnel processing.

288	       The following figure shows a generic NVO3 encapsulated frame:

290	                           +--------------------------+
291	                           |     Tenant Frame         |
292	                           +--------------------------+
293	                           |   NVO3 Overlay Header    |
294	                           +--------------------------+
295	                           |   Outer Underlay header  |
296	                           +--------------------------+
297	                           |  Outer Link layer header |
298	                           +--------------------------+
299	                        Figure 2 : NVO3 encapsulated frame

301	       where
302	            . Tenant frame: Ethernet or IP based upon the VNI type

304	            . NVO3 overlay header: Header containing VNI context information
305	              and other optional fields that can be used for processing
306	              this packet.

308	            . Outer underlay header: Can be either IP or MPLS

310	            . Outer link layer header: Header specific to the physical
311	              transmission link used

313	    3.3.1. NVO3 overlay header

315	       An NVO3 overlay header MUST be included after the underlay tunnel
316	       header when forwarding tenant traffic.

318	       Note that this information can be carried within existing protocol
319	       headers (when overloading of specific fields is possible) or within
320	       a separate header.

322	    3.3.1.1. Virtual Network Context Identification

324	       The overlay encapsulation header MUST contain a field which allows
325	       the encapsulated frame to be delivered to the appropriate virtual
326	       network endpoint by the egress NVE.

328	       The egress NVE uses this field to determine the appropriate virtual
329	       network context in which to process the packet. This field MAY be an
330	       explicit, unique (to the administrative domain) virtual network
331	       identifier (VNID) or MAY express the necessary context information
332	       in other ways (e.g. a locally significant identifier).

334	       In the case of a global identifier, this field MUST be large enough
335	       to scale to 100's of thousands of virtual networks. Note that there
336	       is typically no such constraint when using a local identifier.

338	    3.3.1.2. Service QoS identifier

340	       Traffic flows originating from different applications could rely on
341	       differentiated forwarding treatment to meet end-to-end availability
342	       and performance objectives. Such applications may span across one or
343	       more overlay networks. To enable such treatment, support for
344	       multiple Classes of Service across or between overlay networks MAY
345	       be required.

347	       To effectively enforce CoS across or between overlay networks, NVEs
348	       MAY be able to map CoS markings between networking layers, e.g.,
349	       Tenant Systems, Overlays, and/or Underlay, enabling each networking
350	       layer to independently enforce its own CoS policies. For example:

352	       - TS (e.g. VM) CoS

354	            o  Tenant CoS policies MAY be defined by Tenant administrators

356	            o  QoS fields (e.g. IP DSCP and/or Ethernet 802.1p) in the
357	               tenant frame are used to indicate application level CoS
358	               requirements

360	       - NVE CoS

362	            o  NVE MAY classify packets based on Tenant CoS markings or
363	               other mechanisms (eg. DPI) to identify the proper service CoS
364	               to be applied across the overlay network

366	            o  NVE service CoS levels are normalized to a common set (for
367	               example 8 levels) across multiple tenants; NVE uses per
368	               tenant policies to map Tenant CoS to the normalized service
369	               CoS fields in the NVO3 header

371	       - Underlay CoS

373	            o  The underlay/core network MAY use a different CoS set (for
374	               example 4 levels) than the NVE CoS as the core devices MAY
375	               have different QoS capabilities compared with NVEs.

377	            o  The Underlay CoS MAY also change as the NVO3 tunnels pass
378	               between different domains.

380	       Support for NVE Service CoS MAY be provided through a QoS field,
381	       inside the NVO3 overlay header. Examples of service CoS provided
382	       part of the service tag are 802.1p and DE bits in the VLAN and PBB
383	       ISID tags and MPLS TC bits in the VPN labels.

385	    3.3.2. Tunneling function

387	       This section describes the underlay tunneling requirements. From an
388	       encapsulation perspective, IPv4 or IPv6 MUST be supported, both IPv4
389	       and IPv6 SHOULD be supported, MPLS tunneling MAY be supported.

391	    3.3.2.1. LAG and ECMP

393	       For performance reasons, multipath over LAG and ECMP paths MAY be
394	       supported.

396	       LAG (Link Aggregation Group) [IEEE 802.1AX-2008] and ECMP (Equal
397	       Cost Multi Path) are commonly used techniques to perform load-
398	       balancing of microflows over a set of a parallel links either at
399	       Layer-2 (LAG) or Layer-3 (ECMP). Existing deployed hardware
400	       implementations of LAG and ECMP uses a hash of various fields in the
401	       encapsulation (outermost) header(s) (e.g. source and destination MAC
402	       addresses for non-IP traffic, source and destination IP addresses,
403	       L4 protocol, L4 source and destination port numbers, etc).
404	       Furthermore, hardware deployed for the underlay network(s) will be
405	       most often unaware of the carried, innermost L2 frames or L3 packets
406	       transmitted by the TS.

408	       Thus, in order to perform fine-grained load-balancing over LAG and
409	       ECMP paths in the underlying network, the encapsulation MUST result
410	       in sufficient entropy to exercise all paths through several LAG/ECMP
411	       hops.

413	       The entropy information can be inferred from the NVO3 overlay header
414	       or underlay header. If the overlay protocol does not support the
415	       necessary entropy information or the switches/routers in the
416	       underlay do not support parsing of the additional entropy
417	       information in the overlay header, underlay switches and routers
418	       should be programmable, i.e. select the appropriate fields in the
419	       underlay header for hash calculation based on the type of overlay
420	       header.

422	       All packets that belong to a specific flow MUST follow the same path
423	       in order to prevent packet re-ordering. This is typically achieved
424	       by ensuring that the fields used for hashing are identical for a
425	       given flow.

427	       The goal is for all paths available to the overlay network to be
428	       used efficiently. Different flows should be distributed as evenly as
429	       possible across multiple underlay network paths. For instance, this
430	       can be achieved by ensuring that some fields used for hashing are
431	       randomly generated.

433	    3.3.2.2. DiffServ and ECN marking

435	       When traffic is encapsulated in a tunnel header, there are numerous
436	       options as to how the Diffserv Code-Point (DSCP) and Explicit
437	       Congestion Notification (ECN) markings are set in the outer header
438	       and propagated to the inner header on decapsulation.

440	       [RFC2983] defines two modes for mapping the DSCP markings from inner
441	       to outer headers and vice versa.  The Uniform model copies the inner
442	       DSCP marking to the outer header on tunnel ingress, and copies that
443	       outer header value back to the inner header at tunnel egress.  The
444	       Pipe model sets the DSCP value to some value based on local policy
445	       at ingress and does not modify the inner header on egress.  Both
446	       models SHOULD be supported.

448	       [RFC6040] defines ECN marking and processing for IP tunnels.

450	    3.3.2.3. Handling of BUM traffic

452	       NVO3 data plane support for either ingress replication or point-to-
453	       multipoint tunnels is required to send traffic destined to multiple
454	       locations on a per-VNI basis (e.g. L2/L3 multicast traffic, L2
455	       broadcast and unknown unicast traffic). It is possible that both
456	       methods be used simultaneously.

458	       There is a bandwidth vs state trade-off between the two approaches.
459	       User-configurable knobs MUST be provided to select which method(s)
460	       gets used based upon the amount of replication required (i.e. the
461	       number of hosts per group), the amount of multicast state to
462	       maintain, the duration of multicast flows and the scalability of
463	       multicast protocols.

465	       When ingress replication is used, NVEs MUST maintain for each VNI
466	       the related tunnel endpoints to which it needs to replicate the
467	       frame.

469	       For point-to-multipoint tunnels, the bandwidth efficiency is
470	       increased at the cost of more state in the Core nodes. The ability
471	       to auto-discover or pre-provision the mapping between VNI multicast
472	       trees to related tunnel endpoints at the NVE and/or throughout the
473	       core SHOULD be supported.

475	    3.4. External NVO3 connectivity

477	       NVO3 services MUST interoperate with current VPN and Internet
478	       services. This may happen inside one DC during a migration phase or
479	       as NVO3 services are delivered to the outside world via Internet or
480	       VPN gateways.

482	       Moreover the compute and storage services delivered by a NVO3 domain
483	       may span multiple DCs requiring Inter-DC connectivity. From a DC
484	       perspective a set of gateway devices are required in all of these
485	       cases albeit with different functionalities influenced by the
486	       overlay type across the WAN, the service type and the DC network
487	       technologies used at each DC site.

489	       A GW handling the connectivity between NVO3 and external domains
490	       represents a single point of failure that may affect multiple tenant
491	       services. Redundancy between NVO3 and external domains MUST be
492	       supported.

494	    3.4.1. GW Types

496	    3.4.1.1. VPN and Internet GWs

498	       Tenant sites may be already interconnected using one of the existing
499	       VPN services and technologies (VPLS or IP VPN). If a new NVO3
500	       encapsulation is used, a VPN GW is required to forward traffic
501	       between NVO3 and VPN domains. Translation of encapsulations MAY be
502	       required. Internet connected Tenants require translation from NVO3
503	       encapsulation to IP in the NVO3 gateway. The translation function
504	       SHOULD minimize provisioning touches.

506	    3.4.1.2. Inter-DC GW

508	       Inter-DC connectivity MAY be required to provide support for
509	       features like disaster prevention or compute load re-distribution.
510	       This MAY be provided via a set of gateways interconnected through a
511	       WAN. This type of connectivity MAY be provided either through
512	       extension of the NVO3 tunneling domain or via VPN GWs.

514	    3.4.1.3. Intra-DC gateways

516	       Even within one DC there may be End Devices that do not support NVO3
517	       encapsulation, for example bare metal servers, hardware appliances
518	       and storage. A gateway device, e.g. a ToR, is required to translate
519	       the NVO3 to Ethernet VLAN encapsulation.

521	    3.4.2. Path optimality between NVEs and Gateways

523	       Within an NVO3 overlay, a default assumption is that NVO3 traffic
524	       will be equally load-balanced across the underlying network
525	       consisting of LAG and/or ECMP paths. This assumption is valid only
526	       as long as: a) all traffic is load-balanced equally among each of
527	       the component-links and paths; and, b) each of the component-
528	       links/paths is of identical capacity. During the course of normal
529	       operation of the underlying network, it is possible that one, or
530	       more, of the component-links/paths of a LAG may be taken out-of-
531	       service in order to be repaired, e.g.: due to hardware failure of
532	       cabling, optics, etc. In such cases, the administrator should
533	       configure the underlying network such that an entire LAG bundle in
534	       the underlying network will be reported as operationally down if
535	       there is a failure of any single component-link member of the LAG
536	       bundle, (e.g.: N = M configuration of the LAG bundle), and, thus,
537	       they know that traffic will be carried sufficiently by alternate,
538	       available (potentially ECMP) paths in the underlying network. This
539	       is a likely an adequate assumption for Intra-DC traffic where
540	       presumably the costs for additional, protection capacity along
541	       alternate paths is not cost-prohibitive. Thus, there are likely no
542	       additional requirements on NVO3 solutions to accommodate this type
543	       of underlying network configuration and administration.

545	       There is a similar case with ECMP, used Intra-DC, where failure of a
546	       single component-path of an ECMP group would result in traffic
547	       shifting onto the surviving members of the ECMP group.
548	       Unfortunately, there are no automatic recovery methods in IP routing
549	       protocols to detect a simultaneous failure of more than one
550	       component-path in a ECMP group, operationally disable the entire
551	       ECMP group and allow traffic to shift onto alternative paths. This
552	       problem is attributable to the underlying network and, thus, out-of-
553	       scope of any NVO3 solutions.

555	       On the other hand, for Inter-DC and DC to External Network cases
556	       that use a WAN, the costs of the underlying network and/or service
557	       (e.g.: IPVPN service) are more expensive; therefore, there is a
558	       requirement on administrators to both: a) ensure high availability
559	       (active-backup failover or active-active load-balancing); and, b)
560	       maintaining substantial utilization of the WAN transport capacity at
561	       nearly all times, particularly in the case of active-active load-
562	       balancing. With respect to the dataplane requirements of NVO3
563	       solutions, in the case of active-backup fail-over, all of the
564	       ingress NVE's need to dynamically adapt to the failure of an active
565	       NVE GW when the backup NVE GW announces itself into the NVO3 overlay
566	       immediately following a failure of the previously active NVE GW and
567	       update their forwarding tables accordingly, (e.g.: perhaps through
568	       dataplane learning and/or translation of a gratuitous ARP, IPv6
569	       Router Advertisement). Note that active-backup fail-over could be
570	       used to accomplish a crude form of load-balancing by, for example,
571	       manually configuring each tenant to use a different NVE GW, in a
572	       round-robin fashion.

574	    3.4.2.1. Load-balancing

576	       When using active-active load-balancing across physically separate
577	       NVE GW's (e.g.: two, separate chassis) an NVO3 solution SHOULD
578	       support forwarding tables that can simultaneously map a single
579	       egress NVE to more than one NVO3 tunnels. The granularity of such
580	       mappings, in both active-backup and active-active, MUST be specific
581	       to each tenant.

583	    3.4.2.2. Triangular Routing Issues (a.k.a. Traffic Tromboning)

585	       L2/ELAN over NVO3 service may span multiple racks distributed across
586	       different DC regions. Multiple ELANs belonging to one tenant may be
587	       interconnected or connected to the outside world through multiple
588	       Router/VRF gateways distributed throughout the DC regions. In this
589	       scenario, without aid from an NVO3 or other type of solution,
590	       traffic from an ingress NVE destined to External gateways will take
591	       a non-optimal path that will result in higher latency and costs,
592	       (since it is using more expensive resources of a WAN). In the case
593	       of traffic from an IP/MPLS network destined toward the entrance to
594	       an NVO3 overlay, well-known IP routing techniques MAY be used to
595	       optimize traffic into the NVO3 overlay, (at the expense of
596	       additional routes in the IP/MPLS network). In summary, these issues
597	       are well known as triangular routing.

599	       Procedures for gateway selection to avoid triangular routing issues
600	       SHOULD be provided.

602	       The details of such procedures are, most likely, part of the NVO3
603	       Management and/or Control Plane requirements and, thus, out of scope
604	       of this document. However, a key requirement on the dataplane of any
605	       NVO3 solution to avoid triangular routing is stated above, in
606	       Section 3.4.2, with respect to active-active load-balancing. More
607	       specifically, an NVO3 solution SHOULD support forwarding tables that
608	       can simultaneously map a single egress NVE to more than one NVO3
609	       tunnel.

611	       The expectation is that, through the Control and/or Management
612	       Planes, this mapping information may be dynamically manipulated to,
613	       for example, provide the closest geographic and/or topological exit
614	       point (egress NVE) for each ingress NVE.

616	    3.5. Path MTU

618	       The tunnel overlay header can cause the MTU of the path to the
619	       egress tunnel endpoint to be exceeded.

621	       IP fragmentation SHOULD be avoided for performance reasons.

623	       The interface MTU as seen by a Tenant System SHOULD be adjusted such
624	       that no fragmentation is needed. This can be achieved by
625	       configuration or be discovered dynamically.

627	       Either of the following options MUST be supported:

629	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] or
630	            Extended MTU Path Discovery techniques such as defined in
631	            [RFC4821]

633	          o Segmentation and reassembly support from the overlay layer
634	            operations without relying on the Tenant Systems to know about
635	            the end-to-end MTU

637	          o The underlay network MAY be designed in such a way that the MTU
638	            can accommodate the extra tunnel overhead.

640	    3.6. Hierarchical NVE

642	       It might be desirable to support the concept of hierarchical NVEs,
643	       such as spoke NVEs and hub NVEs, in order to address possible NVE
644	       performance limitations and service connectivity optimizations.

646	       For instance, spoke NVE functionality may be used when processing
647	       capabilities are limited. A hub NVE would provide additional data
648	       processing capabilities such as packet replication.

650	       NVEs can be either connected in an any-to-any or hub and spoke
651	       topology on a per VNI basis.

653	    3.7. NVE Multi-Homing Requirements

655	       Multi-homing techniques SHOULD be used to increase the reliability
656	       of an nvo3 network. It is also important to ensure that physical
657	       diversity in an nvo3 network is taken into account to avoid single
658	       points of failure.

660	       Multi-homing can be enabled in various nodes, from tenant systems
661	       into TORs, TORs into core switches/routers, and core nodes into DC
662	       GWs.

664	       Tenant systems can either be L2 or L3 nodes. In the former case
665	       (L2), techniques such as LAG or STP for instance MAY be used. In the
666	       latter case (L3), it is possible that no dynamic routing protocol is
667	       enabled. Tenant systems can be multi-homed into remote NVE using
668	       several interfaces (physical NICS or vNICS) with an IP address per
669	       interface either to the same nvo3 network or into different nvo3
670	       networks. When one of the links fails, the corresponding IP is not
671	       reachable but the other interfaces can still be used. When a tenant
672	       system is co-located with an NVE, IP routing can be relied upon to
673	       handle routing over diverse links to TORs.

675	       External connectivity MAY be handled by two or more nvo3 gateways.
676	       Each gateway is connected to a different domain (e.g. ISP) and runs
677	       BGP multi-homing. They serve as an access point to external networks
678	       such as VPNs or the Internet. When a connection to an upstream
679	       router is lost, the alternative connection is used and the failed
680	       route withdrawn.

682	    3.8. Other considerations

684	    3.8.1. Data Plane Optimizations

686	       Data plane forwarding and encapsulation choices SHOULD consider the
687	       limitation of possible NVE implementations, specifically in software
688	       based implementations (e.g.  servers running VSwitches)

690	       NVE SHOULD provide efficient processing of traffic. For instance,
691	       packet alignment, the use of offsets to minimize header parsing,
692	       padding techniques SHOULD be considered when designing NVO3
693	       encapsulation types.

695	       The NV03 encapsulation/decapsulation processing in software-based
696	       NVEs SHOULD make use of hardware assist provided by NICs in order to
697	       speed up packet processing.

699	    3.8.2. NVE location trade-offs

701	       In the case of DC traffic, traffic originated from a VM is native
702	       Ethernet traffic. This traffic can be switched by a local VM switch
703	       or ToR switch and then by a DC gateway. The NVE function can be
704	       embedded within any of these elements.

706	       The NVE function can be supported in various DC network elements
707	       such as a VM, VM switch, ToR switch or DC GW.

709	       The following criteria SHOULD be considered when deciding where the
710	       NVE processing boundary happens:

712	          o Processing and memory requirements

714	              o Datapath (e.g. lookups, filtering,
715	                 encapsulation/decapsulation)

717	              o Control plane processing (e.g. routing, signaling, OAM)

719	          o FIB/RIB size

721	          o Multicast support

723	              o Routing protocols

725	              o Packet replication capability

727	          o Fragmentation support

729	          o QoS transparency

731	          o Resiliency

733	    4. Security Considerations

735	       This requirements document does not raise in itself any specific
736	       security issues.

738	    5. IANA Considerations

740	       IANA does not need to take any action for this draft.

742	    6. References

744	    6.1. Normative References

746	       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
747	                 Requirement Levels", BCP 14, RFC 2119, March 1997.

749	    6.2. Informative References

751	       [NVOPS] Narten, T. et al, "Problem Statement: Overlays for Network
752	                 Virtualization", draft-narten-nvo3-overlay-problem-
753	                 statement (work in progress)

755	       [NVO3-framework] Lasserre, M. et al, "Framework for DC Network
756	                 Virtualization", draft-lasserre-nvo3-framework (work in
757	                 progress)

759	       [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
760	                 Protocol Requirements", draft-kreeger-nvo3-overlay-cp
761	                 (work in progress)

763	       [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
764	                 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

766	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
767	                 Networks (VPNs)", RFC 4364, February 2006.

769	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

771	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
772	                 August 1996

774	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
775	                 Discovery", RFC4821, March 2007

777	       [RFC2983] Black, D. "Diffserv and tunnels", RFC2983, Cotober 2000

779	       [RFC6040] Briscoe, B. "Tunnelling of Explicit Congestion
780	                 Notification", RFC6040, November 2010

782	       [RFC6438] Carpenter, B. et al, "Using the IPv6 Flow Label for Equal
783	                 Cost Multipath Routing and Link Aggregation in Tunnels",
784	                 RFC6438, November 2011

786	       [RFC6391] Bryant, S. et al, "Flow-Aware Transport of Pseudowires
787	                 over an MPLS Packet Switched Network", RFC6391, November
788	                 2011

790	    7. Acknowledgments

792	       In addition to the authors the following people have contributed to
793	       this document:

795	       Shane Amante, Dimitrios Stiliadis, Rotem Salomonovitch, Larry
796	       Kreeger, and Eric Gray.

798	       This document was prepared using 2-Word-v2.0.template.dot.

800	    Authors' Addresses

802	       Nabil Bitar
803	       Verizon
804	       40 Sylvan Road
805	       Waltham, MA 02145
806	       Email: nabil.bitar@verizon.com

808	       Marc Lasserre
809	       Alcatel-Lucent
810	       Email: marc.lasserre@alcatel-lucent.com

812	       Florin Balus
813	       Alcatel-Lucent
814	       777 E. Middlefield Road
815	       Mountain View, CA, USA 94043
816	       Email: florin.balus@alcatel-lucent.com

818	       Thomas Morin
819	       France Telecom Orange
820	       Email: thomas.morin@orange.com

822	       Lizhong Jin
823	       Email : lizho.jin@gmail.com

825	       Bhumip Khasnabish
826	       ZTE
827	       Email : Bhumip.khasnabish@zteusa.com