idnits 2.17.1 

draft-ietf-nvo3-dataplane-requirements-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 19
     longer pages, the longest (page 2) being 70 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 19 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 204 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.

  == There are 9 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 1, 2013) is 3952 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'NVOPS' is defined on line 774, but no explicit
     reference was found in the text

  == Unused Reference: 'OVCPREQ' is defined on line 782, but no explicit
     reference was found in the text

  == Unused Reference: 'FLOYD' is defined on line 786, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4364' is defined on line 789, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6438' is defined on line 805, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC6391' is defined on line 809, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)


     Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	    Internet Engineering Task Force                            Nabil Bitar
3	    Internet Draft                                                 Verizon
4	    Intended status: Informational
5	    Expires: January 2014                                    Marc Lasserre
6	                                                              Florin Balus
7	                                                            Alcatel-Lucent

9	                                                              Thomas Morin
10	                                                     France Telecom Orange

12	                                                               Lizhong Jin

14	                                                         Bhumip Khasnabish
15	                                                                       ZTE

17	                                                              July 1, 2013

19	                           NVO3 Data Plane Requirements
20	                   draft-ietf-nvo3-dataplane-requirements-01.txt

22	    Status of this Memo

24	       This Internet-Draft is submitted in full conformance with the
25	       provisions of BCP 78 and BCP 79.

27	       Internet-Drafts are working documents of the Internet Engineering
28	       Task Force (IETF).  Note that other groups may also distribute
29	       working documents as Internet-Drafts. The list of current Internet-
30	       Drafts is at http://datatracker.ietf.org/drafts/current/.

32	       Internet-Drafts are draft documents valid for a maximum of six
33	       months and may be updated, replaced, or obsoleted by other documents
34	       at any time.  It is inappropriate to use Internet-Drafts as
35	       reference material or to cite them other than as "work in progress."

37	       This Internet-Draft will expire on January 1, 2013.

39	    Copyright Notice

41	       Copyright (c) 2013 IETF Trust and the persons identified as the
42	       document authors. All rights reserved.

44	       This document is subject to BCP 78 and the IETF Trust's Legal
45	       Provisions Relating to IETF Documents
46	       (http://trustee.ietf.org/license-info) in effect on the date of
47	       publication of this document. Please review these documents
48	       carefully, as they describe your rights and restrictions with
49	       respect to this document. Code Components extracted from this
50	       document must include Simplified BSD License text as described in
51	       Section 4.e of the Trust Legal Provisions and are provided without
52	       warranty as described in the Simplified BSD License.

54	    Abstract

56	       Several IETF drafts relate to the use of overlay networks to support
57	       large scale virtual data centers. This draft provides a list of data
58	       plane requirements for Network Virtualization over L3 (NVO3) that
59	       have to be addressed in solutions documents.

61	    Table of Contents

63	       1. Introduction................................................3
64	          1.1. Conventions used in this document.......................3
65	          1.2. General terminology.....................................3
66	       2. Data Path Overview..........................................4
67	       3. Data Plane Requirements......................................5
68	          3.1. Virtual Access Points (VAPs)............................5
69	          3.2. Virtual Network Instance (VNI)..........................5
70	          3.2.1. L2 VNI...............................................5
71	          3.2.2. L3 VNI...............................................6
72	          3.3. Overlay Module.........................................7
73	          3.3.1. NVO3 overlay header...................................8
74	          3.3.1.1. Virtual Network Context Identification..............8
75	          3.3.1.2. Service QoS identifier..............................8
76	          3.3.2. Tunneling function....................................9
77	          3.3.2.1. LAG and ECMP.......................................10
78	          3.3.2.2. DiffServ and ECN marking...........................10
79	          3.3.2.3. Handling of BUM traffic............................11
80	          3.4. External NVO3 connectivity.............................11
81	          3.4.1. GW Types............................................12
82	          3.4.1.1. VPN and Internet GWs...............................12
83	          3.4.1.2. Inter-DC GW........................................12
84	          3.4.1.3. Intra-DC gateways..................................12
85	          3.4.2. Path optimality between NVEs and Gateways............12
86	          3.4.2.1. Triangular Routing Issues (Traffic Tromboning)......13
87	          3.5. Path MTU..............................................14
88	          3.6. Hierarchical NVE.......................................15
89	          3.7. NVE Multi-Homing Requirements..........................15
90	          3.8. OAM...................................................16
91	          3.9. Other considerations...................................16
92	          3.9.1. Data Plane Optimizations.............................16
93	          3.9.2. NVE location trade-offs..............................17
94	       4. Security Considerations.....................................17
95	       5. IANA Considerations........................................17
96	       6. References.................................................18
97	          6.1. Normative References...................................18
98	          6.2. Informative References.................................18
99	       7. Acknowledgments............................................19

101	    1. Introduction

103	    1.1. Conventions used in this document

105	       The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
106	       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
107	       document are to be interpreted as described in RFC-2119 [RFC2119].

109	       In this document, these words will appear with that interpretation
110	       only when in ALL CAPS. Lower case uses of these words are not to be
111	       interpreted as carrying RFC-2119 significance.

113	    1.2. General terminology

115	       The terminology defined in [NVO3-framework] is used throughout this
116	       document. Terminology specific to this memo is defined here and is
117	       introduced as needed in later sections.

119	       BUM: Broadcast, Unknown Unicast, Multicast traffic

121	       TS: Tenant System

123	    2. Data Path Overview

125	       The NVO3 framework [NVO3-framework] defines the generic NVE model
126	       depicted in Figure 1:

128	                          +------- L3 Network ------+
129	                          |                         |
130	                          |       Tunnel Overlay    |
131	             +------------+---------+       +---------+------------+
132	             | +----------+-------+ |       | +---------+--------+ |
133	             | |  Overlay Module  | |       | |  Overlay Module  | |
134	             | +---------+--------+ |       | +---------+--------+ |
135	             |           |VN context|       | VN context|          |
136	             |           |          |       |           |          |
137	             |  +-------+--------+  |       |  +--------+-------+  |
138	             |  | |VNI|  ... |VNI|  |       |  | |VNI|  ... |VNI|  |
139	        NVE1 |  +-+------------+-+  |       |  +-+-----------+--+  | NVE2
140	             |    |   VAPs     |    |       |    |    VAPs   |     |
141	             +----+------------+----+       +----+------------+----+
142	                 |            |                 |            |
143	           -------+------------+-----------------+------------+-------
144	                 |            |     Tenant      |            |
145	                 |            |   Service IF    |            |
146	                 Tenant Systems                 Tenant Systems

148	                  Figure 1 : Generic reference model for NV Edge

150	       When a frame is received by an ingress NVE from a Tenant System over
151	       a local VAP, it needs to be parsed in order to identify which
152	       virtual network instance it belongs to. The parsing function can
153	       examine various fields in the data frame (e.g., VLANID) and/or
154	       associated interface/port the frame came from.

156	       Once a corresponding VNI is identified, a lookup is performed to
157	       determine where the frame needs to be sent. This lookup can be based
158	       on any combinations of various fields in the data frame (e.g.,
159	       destination MAC addresses and/or destination IP addresses). Note
160	       that additional criteria such as 802.1p and/or DSCP markings might
161	       be used to select an appropriate tunnel or local VAP destination.

163	       Lookup tables can be populated using different techniques: data
164	       plane learning, management plane configuration, or a distributed
165	       control plane. Management and control planes are not in the scope of
166	       this document. The data plane based solution is described in this
167	       document as it has implications on the data plane processing
168	       function.

170	       The result of this lookup yields the corresponding information
171	       needed to build the overlay header, as described in section 3.3.
172	       This information includes the destination L3 address of the egress
173	       NVE. Note that this lookup might yield a list of tunnels such as
174	       when ingress replication is used for BUM traffic.

176	       The overlay header MUST include a context identifier which the
177	       egress NVE will use to identify which VNI this frame belongs to.

179	       The egress NVE checks the context identifier and removes the
180	       encapsulation header and then forwards the original frame towards
181	       the appropriate recipient, usually a local VAP.

183	    3. Data Plane Requirements

185	    3.1. Virtual Access Points (VAPs)

187	       The NVE forwarding plane MUST support VAP identification through the
188	       following mechanisms:

190	       - Using the local interface on which the frames are received, where
191	          the local interface may be an internal, virtual port in a VSwitch
192	          or a physical port on the ToR
193	       - Using the local interface and some fields in the frame header,
194	          e.g. one or multiple VLANs or the source MAC

196	    3.2. Virtual Network Instance (VNI)

198	       VAPs are associated with a specific VNI at service instantiation
199	       time.

201	       A VNI identifies a per-tenant private context, i.e. per-tenant
202	       policies and a FIB table to allow overlapping address space between
203	       tenants.

205	       There are different VNI types differentiated by the virtual network
206	       service they provide to Tenant Systems. Network virtualization can
207	       be provided by L2 and/or L3 VNIs.

209	    3.2.1. L2 VNI

211	       An L2 VNI MUST provide an emulated Ethernet multipoint service as if
212	       Tenant Systems are interconnected by a bridge (but instead by using
213	       a set of NVO3 tunnels). The emulated bridge MAY be 802.1Q enabled
214	       (allowing use of VLAN tags as a VAP). An L2 VNI provides per tenant
215	       virtual switching instance with MAC addressing isolation and L3
216	       tunneling. Loop avoidance capability MUST be provided.

218	       Forwarding table entries provide mapping information between tenant
219	       system MAC addresses and VAPs on directly connected VNIs and L3
220	       tunnel destination addresses over the overlay. Such entries MAY be
221	       populated by a control or management plane, or via data plane.

223	       In the absence of a management or control plane, data plane learning
224	       MUST be used to populate forwarding tables. As frames arrive from
225	       VAPs or from overlay tunnels, standard MAC learning procedures are
226	       used: The tenant system source MAC address is learned against the
227	       VAP or the NVO3 tunneling encapsulation source address on which the
228	       frame arrived. This implies that unknown unicast traffic be flooded
229	       i.e. broadcast.

231	       When flooding is required, either to deliver unknown unicast, or
232	       broadcast or multicast traffic, the NVE MUST either support ingress
233	       replication or multicast. In this latter case, the NVE MUST have one
234	       or more multicast trees that can be used by local VNIs for flooding
235	       to NVEs belonging to the same VN. For each VNI, there is one
236	       flooding tree, and a multicast tree may be dedicated per VNI or
237	       shared across VNIs. In such cases, multiple VNIs MAY share the same
238	       default flooding tree.  The flooding tree is equivalent with a
239	       multicast (*,G) construct where all the NVEs for which the
240	       corresponding VNI is instantiated are members. The multicast tree
241	       MAY be established automatically via routing and signaling or pre-
242	       provisioned.

244	       When tenant multicast is supported, it SHOULD also be possible to
245	       select whether the NVE provides optimized multicast trees inside the
246	       VNI for individual tenant multicast groups or whether the default
247	       VNI flooding tree is used. If the former option is selected the VNI
248	       SHOULD be able to snoop IGMP/MLD messages in order to efficiently
249	       join/prune Tenant System from multicast trees.

251	    3.2.2. L3 VNI

253	       L3 VNIs MUST provide virtualized IP routing and forwarding. L3 VNIs
254	       MUST support per-tenant forwarding instance with IP addressing
255	       isolation and L3 tunneling for interconnecting instances of the same
256	       VNI on NVEs.

258	       In the case of L3 VNI, the inner TTL field MUST be decremented by
259	       (at least) 1 as if the NVO3 egress NVE was one (or more) hop(s)
260	       away. The TTL field in the outer IP header MUST be set to a value
261	       appropriate for delivery of the encapsulated frame to the tunnel
262	       exit point. Thus, the default behavior MUST be the TTL pipe model
263	       where the overlay network looks like one hop to the sending NVE.
264	       Configuration of a "uniform" TTL model where the outer tunnel TTL is
265	       set equal to the inner TTL on ingress NVE and the inner TTL is set
266	       to the outer TTL value on egress MAY be supported.

268	       L2 and L3 VNIs can be deployed in isolation or in combination to
269	       optimize traffic flows per tenant across the overlay network. For
270	       example, an L2 VNI may be configured across a number of NVEs to
271	       offer L2 multi-point service connectivity while a L3 VNI can be co-
272	       located to offer local routing capabilities and gateway
273	       functionality. In addition, integrated routing and bridging per
274	       tenant MAY be supported on an NVE. An instantiation of such service
275	       may be realized by interconnecting an L2 VNI as access to an L3 VNI
276	       on the NVE.

278	       The L3 VNI does not require support for Broadcast and Unknown
279	       Unicast traffic. The L3 VNI MAY provide support for customer
280	       multicast groups. When multicast is supported, it SHOULD be possible
281	       to select whether the NVE provides optimized multicast trees inside
282	       the VNI for individual tenant multicast groups or whether a default
283	       VNI multicasting tree, where all the NVEs of the corresponding VNI
284	       are members, is used.

286	    3.3. Overlay Module

288	       The overlay module performs a number of functions related to NVO3
289	       header and tunnel processing.

291	       The following figure shows a generic NVO3 encapsulated frame:

293	                           +--------------------------+
294	                           |     Tenant Frame         |
295	                           +--------------------------+
296	                           |   NVO3 Overlay Header    |
297	                           +--------------------------+
298	                           |   Outer Underlay header  |
299	                           +--------------------------+
300	                           |  Outer Link layer header |
301	                           +--------------------------+
302	                        Figure 2 : NVO3 encapsulated frame

304	       where

306	            . Tenant frame: Ethernet or IP based upon the VNI type

308	            . NVO3 overlay header: Header containing VNI context information
309	              and other optional fields that can be used for processing
310	              this packet.

312	            . Outer underlay header: Can be either IP or MPLS

314	            . Outer link layer header: Header specific to the physical
315	              transmission link used

317	    3.3.1. NVO3 overlay header

319	       An NVO3 overlay header MUST be included after the underlay tunnel
320	       header when forwarding tenant traffic. Note that this information
321	       can be carried within existing protocol headers (when overloading of
322	       specific fields is possible) or within a separate header.

324	    3.3.1.1. Virtual Network Context Identification

326	       The overlay encapsulation header MUST contain a field which allows
327	       the encapsulated frame to be delivered to the appropriate virtual
328	       network endpoint by the egress NVE. The egress NVE uses this field
329	       to determine the appropriate virtual network context in which to
330	       process the packet. This field MAY be an explicit, unique (to the
331	       administrative domain) virtual network identifier (VNID) or MAY
332	       express the necessary context information in other ways (e.g. a
333	       locally significant identifier).

335	       It SHOULD be aligned on a 32-bit boundary so as to make it
336	       efficiently processable by the data path. It MUST be distributable
337	       by a control-plane or configured via a management plane.

339	       In the case of a global identifier, this field MUST be large enough
340	       to scale to 100's of thousands of virtual networks. Note that there
341	       is no such constraint when using a local identifier.

343	    3.3.1.2. Service QoS identifier

345	       Traffic flows originating from different applications could rely on
346	       differentiated forwarding treatment to meet end-to-end availability
347	       and performance objectives. Such applications may span across one or
348	       more overlay networks. To enable such treatment, support for
349	       multiple Classes of Service across or between overlay networks MAY
350	       be required.

352	       To effectively enforce CoS across or between overlay networks, NVEs
353	       MAY be able to map CoS markings between networking layers, e.g.,
354	       Tenant Systems, Overlays, and/or Underlay, enabling each networking
355	       layer to independently enforce its own CoS policies. For example:

357	       - TS (e.g. VM) CoS

359	            o  Tenant CoS policies MAY be defined by Tenant administrators

361	            o  QoS fields (e.g. IP DSCP and/or Ethernet 802.1p) in the
362	               tenant frame are used to indicate application level CoS
363	               requirements

365	       - NVE CoS

367	            o  NVE MAY classify packets based on Tenant CoS markings or
368	               other mechanisms (eg. DPI) to identify the proper service CoS
369	               to be applied across the overlay network

371	            o  NVE service CoS levels are normalized to a common set (for
372	               example 8 levels) across multiple tenants; NVE uses per
373	               tenant policies to map Tenant CoS to the normalized service
374	               CoS fields in the NVO3 header

376	       - Underlay CoS

378	            o  The underlay/core network MAY use a different CoS set (for
379	               example 4 levels) than the NVE CoS as the core devices MAY
380	               have different QoS capabilities compared with NVEs.

382	            o  The Underlay CoS MAY also change as the NVO3 tunnels pass
383	               between different domains.

385	       Support for NVE Service CoS MAY be provided through a QoS field,
386	       inside the NVO3 overlay header. Examples of service CoS provided
387	       part of the service tag are 802.1p and DE bits in the VLAN and PBB
388	       ISID tags and MPLS TC bits in the VPN labels.

390	    3.3.2. Tunneling function

392	       This section describes the underlay tunneling requirements. From an
393	       encapsulation perspective, IPv4 or IPv6 MUST be supported, both IPv4
394	       and IPv6 SHOULD be supported, MPLS tunneling MAY be supported.

396	    3.3.2.1. LAG and ECMP

398	       For performance reasons, multipath over LAG and ECMP paths SHOULD be
399	       supported.

401	       LAG (Link Aggregation Group) [IEEE 802.1AX-2008] and ECMP (Equal
402	       Cost Multi Path) are commonly used techniques to perform load-
403	       balancing of microflows over a set of a parallel links either at
404	       Layer-2 (LAG) or Layer-3 (ECMP). Existing deployed hardware
405	       implementations of LAG and ECMP uses a hash of various fields in the
406	       encapsulation (outermost) header(s) (e.g. source and destination MAC
407	       addresses for non-IP traffic, source and destination IP addresses,
408	       L4 protocol, L4 source and destination port numbers, etc).
409	       Furthermore, hardware deployed for the underlay network(s) will be
410	       most often unaware of the carried, innermost L2 frames or L3 packets
411	       transmitted by the TS. Thus, in order to perform fine-grained load-
412	       balancing over LAG and ECMP paths in the underlying network, the
413	       encapsulation MUST result in sufficient entropy to exercise all
414	       paths through several LAG/ECMP hops. The entropy information MAY be
415	       inferred from the NVO3 overlay header or underlay header. If the
416	       overlay protocol does not support the necessary entropy information
417	       or the switches/routers in the underlay do not support parsing of
418	       the additional entropy information in the overlay header, underlay
419	       switches and routers should be programmable, i.e. select the
420	       appropriate fields in the underlay header for hash calculation based
421	       on the type of overlay header.

423	       All packets that belong to a specific flow MUST follow the same path
424	       in order to prevent packet re-ordering. This is typically achieved
425	       by ensuring that the fields used for hashing are identical for a
426	       given flow.

428	       All paths available to the overlay network SHOULD be used
429	       efficiently. Different flows SHOULD be distributed as evenly as
430	       possible across multiple underlay network paths. For instance, this
431	       can be achieved by ensuring that some fields used for hashing are
432	       randomly generated.

434	    3.3.2.2. DiffServ and ECN marking

436	       When traffic is encapsulated in a tunnel header, there are numerous
437	       options as to how the Diffserv Code-Point (DSCP) and Explicit
438	       Congestion Notification (ECN) markings are set in the outer header
439	       and propagated to the inner header on decapsulation.

441	       [RFC2983] defines two modes for mapping the DSCP markings from inner
442	       to outer headers and vice versa.  The Uniform model copies the inner
443	       DSCP marking to the outer header on tunnel ingress, and copies that
444	       outer header value back to the inner header at tunnel egress.  The
445	       Pipe model sets the DSCP value to some value based on local policy
446	       at ingress and does not modify the inner header on egress.  Both
447	       models SHOULD be supported.

449	       ECN marking MUST be performed according to [RFC6040] which describes
450	       the correct ECN behavior for IP tunnels.

452	    3.3.2.3. Handling of BUM traffic

454	       NVO3 data plane support for either ingress replication or point-to-
455	       multipoint tunnels is required to send traffic destined to multiple
456	       locations on a per-VNI basis (e.g. L2/L3 multicast traffic, L2
457	       broadcast and unknown unicast traffic). It is possible that both
458	       methods be used simultaneously.

460	       There is a bandwidth vs state trade-off between the two approaches.
461	       User-definable knobs MUST be provided to select which method(s) gets
462	       used based upon the amount of replication required (i.e. the number
463	       of hosts per group), the amount of multicast state to maintain, the
464	       duration of multicast flows and the scalability of multicast
465	       protocols.

467	       When ingress replication is used, NVEs MUST track for each VNI the
468	       related tunnel endpoints to which it needs to replicate the frame.

470	       For point-to-multipoint tunnels, the bandwidth efficiency is
471	       increased at the cost of more state in the Core nodes. The ability
472	       to auto-discover or pre-provision the mapping between VNI multicast
473	       trees to related tunnel endpoints at the NVE and/or throughout the
474	       core SHOULD be supported.

476	    3.4. External NVO3 connectivity

478	       NVO3 services MUST interoperate with current VPN and Internet
479	       services. This may happen inside one DC during a migration phase or
480	       as NVO3 services are delivered to the outside world via Internet or
481	       VPN gateways.

483	       Moreover the compute and storage services delivered by a NVO3 domain
484	       may span multiple DCs requiring Inter-DC connectivity. From a DC
485	       perspective a set of gateway devices are required in all of these
486	       cases albeit with different functionalities influenced by the
487	       overlay type across the WAN, the service type and the DC network
488	       technologies used at each DC site.

490	       A GW handling the connectivity between NVO3 and external domains
491	       represents a single point of failure that may affect multiple tenant
492	       services. Redundancy between NVO3 and external domains MUST be
493	       supported.

495	    3.4.1. GW Types

497	    3.4.1.1. VPN and Internet GWs

499	       Tenant sites may be already interconnected using one of the existing
500	       VPN services and technologies (VPLS or IP VPN). If a new NVO3
501	       encapsulation is used, a VPN GW is required to forward traffic
502	       between NVO3 and VPN domains. Translation of encapsulations MAY be
503	       required. Internet connected Tenants require translation from NVO3
504	       encapsulation to IP in the NVO3 gateway. The translation function
505	       SHOULD minimize provisioning touches.

507	    3.4.1.2. Inter-DC GW

509	       Inter-DC connectivity MAY be required to provide support for
510	       features like disaster prevention or compute load re-distribution.
511	       This MAY be provided via a set of gateways interconnected through a
512	       WAN. This type of connectivity MAY be provided either through
513	       extension of the NVO3 tunneling domain or via VPN GWs.

515	    3.4.1.3. Intra-DC gateways

517	       Even within one DC there may be End Devices that do not support NVO3
518	       encapsulation, for example bare metal servers, hardware appliances
519	       and storage. A gateway device, e.g. a ToR, is required to translate
520	       the NVO3 to Ethernet VLAN encapsulation.

522	    3.4.2. Path optimality between NVEs and Gateways

524	       Within the NVO3 overlay, a default assumption is that NVO3 traffic
525	       will be equally load-balanced across the underlying network
526	       consisting of LAG and/or ECMP paths. This assumption is valid only
527	       as long as: a) all traffic is load-balanced equally among each of
528	       the component-links and paths; and, b) each of the component-
529	       links/paths is of identical capacity. During the course of normal
530	       operation of the underlying network, it is possible that one, or
531	       more, of the component-links/paths of a LAG may be taken out-of-
532	       service in order to be repaired, e.g.: due to hardware failure of
533	       cabling, optics, etc. In such cases, the administrator should
534	       configure the underlying network such that an entire LAG bundle in
535	       the underlying network will be reported as operationally down if
536	       there is a failure of any single component-link member of the LAG
537	       bundle, (e.g.: N = M configuration of the LAG bundle), and, thus,
538	       they know that traffic will be carried sufficiently by alternate,
539	       available (potentially ECMP) paths in the underlying network. This
540	       is a likely an adequate assumption for Intra-DC traffic where
541	       presumably the costs for additional, protection capacity along
542	       alternate paths is not cost-prohibitive. Thus, there are likely no
543	       additional requirements on NVO3 solutions to accommodate this type
544	       of underlying network configuration and administration.

546	       There is a similar case with ECMP, used Intra-DC, where failure of a
547	       single component-path of an ECMP group would result in traffic
548	       shifting onto the surviving members of the ECMP group.
549	       Unfortunately, there are no automatic recovery methods in IP routing
550	       protocols to detect a simultaneous failure of more than one
551	       component-path in a ECMP group, operationally disable the entire
552	       ECMP group and allow traffic to shift onto alternative paths. This
553	       problem is attributable to the underlying network and, thus, out-of-
554	       scope of any NVO3 solutions.

556	       On the other hand, for Inter-DC and DC to External Network cases
557	       that use a WAN, the costs of the underlying network and/or service
558	       (e.g.: IPVPN service) are more expensive; therefore, there is a
559	       requirement on administrators to both: a) ensure high availability
560	       (active-backup failover or active-active load-balancing); and, b)
561	       maintaining substantial utilization of the WAN transport capacity at
562	       nearly all times, particularly in the case of active-active load-
563	       balancing. With respect to the dataplane requirements of NVO3
564	       solutions, in the case of active-backup fail-over, all of the
565	       ingress NVE's MUST dynamically adapt to the failure of an active NVE
566	       GW when the backup NVE GW announces itself into the NVO3 overlay
567	       immediately following a failure of the previously active NVE GW and
568	       update their forwarding tables accordingly, (e.g.: perhaps through
569	       dataplane learning and/or translation of a gratuitous ARP, IPv6
570	       Router Advertisement, etc.) Note that active-backup fail-over could
571	       be used to accomplish a crude form of load-balancing by, for
572	       example, manually configuring each tenant to use a different NVE GW,
573	       in a round-robin fashion. On the other hand, with respect to active-
574	       active load-balancing across physically separate NVE GW's (e.g.:
575	       two, separate chassis) an NVO3 solution SHOULD support forwarding
576	       tables that can simultaneously map a single egress NVE to more than
577	       one NVO3 tunnels. The granularity of such mappings, in both active-
578	       backup and active-active, MUST be unique to each tenant.

580	    3.4.2.1. Triangular Routing Issues (Traffic Tromboning)

582	       L2/ELAN over NVO3 service may span multiple racks distributed across
583	       different DC regions. Multiple ELANs belonging to one tenant may be
584	       interconnected or connected to the outside world through multiple
585	       Router/VRF gateways distributed throughout the DC regions. In this
586	       scenario, without aid from an NVO3 or other type of solution,
587	       traffic from an ingress NVE destined to External gateways will take
588	       a non-optimal path that will result in higher latency and costs,
589	       (since it is using more expensive resources of a WAN). In the case
590	       of traffic from an IP/MPLS network destined toward the entrance to
591	       an NVO3 overlay, well-known IP routing techniques MAY be used to
592	       optimize traffic into the NVO3 overlay, (at the expense of
593	       additional routes in the IP/MPLS network). In summary, these issues
594	       are well known as triangular routing.

596	       Procedures for gateway selection to avoid triangular routing issues
597	       SHOULD be provided. The details of such procedures are, most likely,
598	       part of the NVO3 Management and/or Control Plane requirements and,
599	       thus, out of scope of this document. However, a key requirement on
600	       the dataplane of any NVO3 solution to avoid triangular routing is
601	       stated above, in Section 3.4.2, with respect to active-active load-
602	       balancing. More specifically, an NVO3 solution SHOULD support
603	       forwarding tables that can simultaneously map a single egress NVE to
604	       more than one NVO3 tunnels. The expectation is that, through the
605	       Control and/or Management Planes, this mapping information MAY be
606	       dynamically manipulated to, for example, provide the closest
607	       geographic and/or topological exit point (egress NVE) for each
608	       ingress NVE.

610	    3.5. Path MTU

612	       The tunnel overlay header can cause the MTU of the path to the
613	       egress tunnel endpoint to be exceeded.

615	       IP fragmentation SHOULD be avoided for performance reasons.

617	       The interface MTU as seen by a Tenant System SHOULD be adjusted such
618	       that no fragmentation is needed. This can be achieved by
619	       configuration or be discovered dynamically.

621	       Either of the following options MUST be supported:

623	          o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] or
624	            Extended MTU Path Discovery techniques such as defined in
625	            [RFC4821]

627	          o Segmentation and reassembly support from the overlay layer
628	            operations without relying on the Tenant Systems to know about
629	            the end-to-end MTU

631	          o The underlay network MAY be designed in such a way that the MTU
632	            can accommodate the extra tunnel overhead.

634	    3.6. Hierarchical NVE

636	       It might be desirable to support the concept of hierarchical NVEs,
637	       such as spoke NVEs and hub NVEs, in order to address possible NVE
638	       performance limitations and service connectivity optimizations.

640	       For instance, spoke NVE functionality MAY be used when processing
641	       capabilities are limited. A hub NVE would provide additional data
642	       processing capabilities such as packet replication.

644	       NVEs can be either connected in an any-to-any or hub and spoke
645	       topology on a per VNI basis.

647	    3.7. NVE Multi-Homing Requirements

649	       Multi-homing techniques SHOULD be used to increase the reliability
650	       of an nvo3 network. It is also important to ensure that physical
651	       diversity in an nvo3 network is taken into account to avoid single
652	       points of failure.

654	       Multi-homing can be enabled in various nodes, from tenant systems
655	       into TORs, TORs into core switches/routers, and core nodes into DC
656	       GWs.

658	       Tenant systems can either be L2 or L3 nodes. In the former case
659	       (L2), techniques such as LAG or STP for instance MAY be used. In the
660	       latter case (L3), it is possible that no dynamic routing protocol is
661	       enabled. Tenant systems can be multi-homed into remote NVE using
662	       several interfaces (physical NICS or vNICS) with an IP address per
663	       interface either to the same nvo3 network or into different nvo3
664	       networks. When one of the links fails, the corresponding IP is not
665	       reachable but the other interfaces can still be used. When a tenant
666	       system is co-located with an NVE, IP routing can be relied upon to
667	       handle routing over diverse links to TORs.

669	       External connectivity MAY be handled by two or more nvo3 gateways.
670	       Each gateway is connected to a different domain (e.g. ISP) and runs
671	       BGP multi-homing. They serve as an access point to external networks
672	       such as VPNs or the Internet. When a connection to an upstream
673	       router is lost, the alternative connection is used and the failed
674	       route withdrawn.

676	    3.8. OAM

678	       NVE MAY be able to originate/terminate OAM messages for connectivity
679	       verification, performance monitoring, statistic gathering and fault
680	       isolation. Depending on configuration, NVEs SHOULD be able to
681	       process or transparently tunnel OAM messages, as well as supporting
682	       alarm propagation capabilities.

684	       Given the critical requirement to load-balance NVO3 encapsulated
685	       packets over LAG and ECMP paths, it will be equally critical to
686	       ensure existing and/or new OAM tools allow NVE administrators to
687	       proactively and/or reactively monitor the health of various
688	       component-links that comprise both LAG and ECMP paths carrying NVO3
689	       encapsulated packets. For example, it will be important that such
690	       OAM tools allow NVE administrators to reveal the set of underlying
691	       network hops (topology) in order that the underlying network
692	       administrators can use this information to quickly perform fault
693	       isolation and restore the underlying network.

695	       The NVE MUST provide the ability to reveal the set of ECMP and/or
696	       LAG paths used by NVO3 encapsulated packets in the underlying
697	       network from an ingress NVE to egress NVE. The NVE MUST provide the
698	       ability to provide a "ping"-like functionality that can be used to
699	       determine the health (liveness) of remote NVE's or their VNI's. The
700	       NVE SHOULD provide a "ping"-like functionality to more expeditiously
701	       aid in troubleshooting performance problems, i.e.: blackholing or
702	       other types of congestion occurring in the underlying network, for
703	       NVO3 encapsulated packets carried over LAG and/or ECMP paths.

705	    3.9. Other considerations

707	    3.9.1. Data Plane Optimizations

709	       Data plane forwarding and encapsulation choices SHOULD consider the
710	       limitation of possible NVE implementations, specifically in software
711	       based implementations (e.g.  servers running VSwitches)

713	       NVE SHOULD provide efficient processing of traffic. For instance,
714	       packet alignment, the use of offsets to minimize header parsing,
715	       padding techniques SHOULD be considered when designing NVO3
716	       encapsulation types.

718	       The NV03 encapsulation/decapsulation processing in software-based
719	       NVEs SHOULD make use of hardware assist provided by NICs in order to
720	       speed up packet processing.

722	    3.9.2. NVE location trade-offs

724	       In the case of DC traffic, traffic originated from a VM is native
725	       Ethernet traffic. This traffic can be switched by a local VM switch
726	       or ToR switch and then by a DC gateway. The NVE function can be
727	       embedded within any of these elements.

729	       The NVE function can be supported in various DC network elements
730	       such as a VM, VM switch, ToR switch or DC GW.

732	       The following criteria SHOULD be considered when deciding where the
733	       NVE processing boundary happens:

735	          o Processing and memory requirements

737	              o Datapath (e.g. lookups, filtering,
738	                 encapsulation/decapsulation)

740	              o Control plane processing (e.g. routing, signaling, OAM)

742	          o FIB/RIB size

744	          o Multicast support

746	              o Routing protocols

748	              o Packet replication capability

750	          o Fragmentation support

752	          o QoS transparency

754	          o Resiliency

756	    4. Security Considerations

758	       This requirements document does not raise in itself any specific
759	       security issues.

761	    5. IANA Considerations

763	       IANA does not need to take any action for this draft.

765	    6. References

767	    6.1. Normative References

769	       [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
770	                 Requirement Levels", BCP 14, RFC 2119, March 1997.

772	    6.2. Informative References

774	       [NVOPS] Narten, T. et al, "Problem Statement: Overlays for Network
775	                 Virtualization", draft-narten-nvo3-overlay-problem-
776	                 statement (work in progress)

778	       [NVO3-framework] Lasserre, M. et al, "Framework for DC Network
779	                 Virtualization", draft-lasserre-nvo3-framework (work in
780	                 progress)

782	       [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control
783	                 Protocol Requirements", draft-kreeger-nvo3-overlay-cp
784	                 (work in progress)

786	       [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over
787	                 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995

789	       [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private
790	                 Networks (VPNs)", RFC 4364, February 2006.

792	       [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990

794	       [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981,
795	                 August 1996

797	       [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU
798	                 Discovery", RFC4821, March 2007

800	       [RFC2983] Black, D. "Diffserv and tunnels", RFC2983, Cotober 2000

802	       [RFC6040] Briscoe, B. "Tunnelling of Explicit Congestion
803	                 Notification", RFC6040, November 2010

805	       [RFC6438] Carpenter, B. et al, "Using the IPv6 Flow Label for Equal
806	                 Cost Multipath Routing and Link Aggregation in Tunnels",
807	                 RFC6438, November 2011

809	       [RFC6391] Bryant, S. et al, "Flow-Aware Transport of Pseudowires
810	                 over an MPLS Packet Switched Network", RFC6391, November
811	                 2011

813	    7. Acknowledgments

815	       In addition to the authors the following people have contributed to
816	       this document:

818	       Shane Amante, Level3

820	       Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent

822	       Larry Kreeger, Cisco

824	       This document was prepared using 2-Word-v2.0.template.dot.

826	    Authors' Addresses

828	       Nabil Bitar
829	       Verizon
830	       40 Sylvan Road
831	       Waltham, MA 02145
832	       Email: nabil.bitar@verizon.com

834	       Marc Lasserre
835	       Alcatel-Lucent
836	       Email: marc.lasserre@alcatel-lucent.com

838	       Florin Balus
839	       Alcatel-Lucent
840	       777 E. Middlefield Road
841	       Mountain View, CA, USA 94043
842	       Email: florin.balus@alcatel-lucent.com

844	       Thomas Morin
845	       France Telecom Orange
846	       Email: thomas.morin@orange.com

848	       Lizhong Jin
849	       Email : lizho.jin@gmail.com

851	       Bhumip Khasnabish
852	       ZTE
853	       Email : Bhumip.khasnabish@zteusa.com