idnits 2.17.1 draft-ietf-nvo3-dataplane-requirements-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 70 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 19 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 204 instances of too long lines in the document, the longest one being 4 characters in excess of 72. == There are 9 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 1, 2013) is 3952 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'NVOPS' is defined on line 774, but no explicit reference was found in the text == Unused Reference: 'OVCPREQ' is defined on line 782, but no explicit reference was found in the text == Unused Reference: 'FLOYD' is defined on line 786, but no explicit reference was found in the text == Unused Reference: 'RFC4364' is defined on line 789, but no explicit reference was found in the text == Unused Reference: 'RFC6438' is defined on line 805, but no explicit reference was found in the text == Unused Reference: 'RFC6391' is defined on line 809, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Nabil Bitar 3 Internet Draft Verizon 4 Intended status: Informational 5 Expires: January 2014 Marc Lasserre 6 Florin Balus 7 Alcatel-Lucent 9 Thomas Morin 10 France Telecom Orange 12 Lizhong Jin 14 Bhumip Khasnabish 15 ZTE 17 July 1, 2013 19 NVO3 Data Plane Requirements 20 draft-ietf-nvo3-dataplane-requirements-01.txt 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other documents 34 at any time. It is inappropriate to use Internet-Drafts as 35 reference material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 1, 2013. 39 Copyright Notice 41 Copyright (c) 2013 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with 49 respect to this document. Code Components extracted from this 50 document must include Simplified BSD License text as described in 51 Section 4.e of the Trust Legal Provisions and are provided without 52 warranty as described in the Simplified BSD License. 54 Abstract 56 Several IETF drafts relate to the use of overlay networks to support 57 large scale virtual data centers. This draft provides a list of data 58 plane requirements for Network Virtualization over L3 (NVO3) that 59 have to be addressed in solutions documents. 61 Table of Contents 63 1. Introduction................................................3 64 1.1. Conventions used in this document.......................3 65 1.2. General terminology.....................................3 66 2. Data Path Overview..........................................4 67 3. Data Plane Requirements......................................5 68 3.1. Virtual Access Points (VAPs)............................5 69 3.2. Virtual Network Instance (VNI)..........................5 70 3.2.1. L2 VNI...............................................5 71 3.2.2. L3 VNI...............................................6 72 3.3. Overlay Module.........................................7 73 3.3.1. NVO3 overlay header...................................8 74 3.3.1.1. Virtual Network Context Identification..............8 75 3.3.1.2. Service QoS identifier..............................8 76 3.3.2. Tunneling function....................................9 77 3.3.2.1. LAG and ECMP.......................................10 78 3.3.2.2. DiffServ and ECN marking...........................10 79 3.3.2.3. Handling of BUM traffic............................11 80 3.4. External NVO3 connectivity.............................11 81 3.4.1. GW Types............................................12 82 3.4.1.1. VPN and Internet GWs...............................12 83 3.4.1.2. Inter-DC GW........................................12 84 3.4.1.3. Intra-DC gateways..................................12 85 3.4.2. Path optimality between NVEs and Gateways............12 86 3.4.2.1. Triangular Routing Issues (Traffic Tromboning)......13 87 3.5. Path MTU..............................................14 88 3.6. Hierarchical NVE.......................................15 89 3.7. NVE Multi-Homing Requirements..........................15 90 3.8. OAM...................................................16 91 3.9. Other considerations...................................16 92 3.9.1. Data Plane Optimizations.............................16 93 3.9.2. NVE location trade-offs..............................17 94 4. Security Considerations.....................................17 95 5. IANA Considerations........................................17 96 6. References.................................................18 97 6.1. Normative References...................................18 98 6.2. Informative References.................................18 99 7. Acknowledgments............................................19 101 1. Introduction 103 1.1. Conventions used in this document 105 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 106 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 107 document are to be interpreted as described in RFC-2119 [RFC2119]. 109 In this document, these words will appear with that interpretation 110 only when in ALL CAPS. Lower case uses of these words are not to be 111 interpreted as carrying RFC-2119 significance. 113 1.2. General terminology 115 The terminology defined in [NVO3-framework] is used throughout this 116 document. Terminology specific to this memo is defined here and is 117 introduced as needed in later sections. 119 BUM: Broadcast, Unknown Unicast, Multicast traffic 121 TS: Tenant System 123 2. Data Path Overview 125 The NVO3 framework [NVO3-framework] defines the generic NVE model 126 depicted in Figure 1: 128 +------- L3 Network ------+ 129 | | 130 | Tunnel Overlay | 131 +------------+---------+ +---------+------------+ 132 | +----------+-------+ | | +---------+--------+ | 133 | | Overlay Module | | | | Overlay Module | | 134 | +---------+--------+ | | +---------+--------+ | 135 | |VN context| | VN context| | 136 | | | | | | 137 | +-------+--------+ | | +--------+-------+ | 138 | | |VNI| ... |VNI| | | | |VNI| ... |VNI| | 139 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 140 | | VAPs | | | | VAPs | | 141 +----+------------+----+ +----+------------+----+ 142 | | | | 143 -------+------------+-----------------+------------+------- 144 | | Tenant | | 145 | | Service IF | | 146 Tenant Systems Tenant Systems 148 Figure 1 : Generic reference model for NV Edge 150 When a frame is received by an ingress NVE from a Tenant System over 151 a local VAP, it needs to be parsed in order to identify which 152 virtual network instance it belongs to. The parsing function can 153 examine various fields in the data frame (e.g., VLANID) and/or 154 associated interface/port the frame came from. 156 Once a corresponding VNI is identified, a lookup is performed to 157 determine where the frame needs to be sent. This lookup can be based 158 on any combinations of various fields in the data frame (e.g., 159 destination MAC addresses and/or destination IP addresses). Note 160 that additional criteria such as 802.1p and/or DSCP markings might 161 be used to select an appropriate tunnel or local VAP destination. 163 Lookup tables can be populated using different techniques: data 164 plane learning, management plane configuration, or a distributed 165 control plane. Management and control planes are not in the scope of 166 this document. The data plane based solution is described in this 167 document as it has implications on the data plane processing 168 function. 170 The result of this lookup yields the corresponding information 171 needed to build the overlay header, as described in section 3.3. 172 This information includes the destination L3 address of the egress 173 NVE. Note that this lookup might yield a list of tunnels such as 174 when ingress replication is used for BUM traffic. 176 The overlay header MUST include a context identifier which the 177 egress NVE will use to identify which VNI this frame belongs to. 179 The egress NVE checks the context identifier and removes the 180 encapsulation header and then forwards the original frame towards 181 the appropriate recipient, usually a local VAP. 183 3. Data Plane Requirements 185 3.1. Virtual Access Points (VAPs) 187 The NVE forwarding plane MUST support VAP identification through the 188 following mechanisms: 190 - Using the local interface on which the frames are received, where 191 the local interface may be an internal, virtual port in a VSwitch 192 or a physical port on the ToR 193 - Using the local interface and some fields in the frame header, 194 e.g. one or multiple VLANs or the source MAC 196 3.2. Virtual Network Instance (VNI) 198 VAPs are associated with a specific VNI at service instantiation 199 time. 201 A VNI identifies a per-tenant private context, i.e. per-tenant 202 policies and a FIB table to allow overlapping address space between 203 tenants. 205 There are different VNI types differentiated by the virtual network 206 service they provide to Tenant Systems. Network virtualization can 207 be provided by L2 and/or L3 VNIs. 209 3.2.1. L2 VNI 211 An L2 VNI MUST provide an emulated Ethernet multipoint service as if 212 Tenant Systems are interconnected by a bridge (but instead by using 213 a set of NVO3 tunnels). The emulated bridge MAY be 802.1Q enabled 214 (allowing use of VLAN tags as a VAP). An L2 VNI provides per tenant 215 virtual switching instance with MAC addressing isolation and L3 216 tunneling. Loop avoidance capability MUST be provided. 218 Forwarding table entries provide mapping information between tenant 219 system MAC addresses and VAPs on directly connected VNIs and L3 220 tunnel destination addresses over the overlay. Such entries MAY be 221 populated by a control or management plane, or via data plane. 223 In the absence of a management or control plane, data plane learning 224 MUST be used to populate forwarding tables. As frames arrive from 225 VAPs or from overlay tunnels, standard MAC learning procedures are 226 used: The tenant system source MAC address is learned against the 227 VAP or the NVO3 tunneling encapsulation source address on which the 228 frame arrived. This implies that unknown unicast traffic be flooded 229 i.e. broadcast. 231 When flooding is required, either to deliver unknown unicast, or 232 broadcast or multicast traffic, the NVE MUST either support ingress 233 replication or multicast. In this latter case, the NVE MUST have one 234 or more multicast trees that can be used by local VNIs for flooding 235 to NVEs belonging to the same VN. For each VNI, there is one 236 flooding tree, and a multicast tree may be dedicated per VNI or 237 shared across VNIs. In such cases, multiple VNIs MAY share the same 238 default flooding tree. The flooding tree is equivalent with a 239 multicast (*,G) construct where all the NVEs for which the 240 corresponding VNI is instantiated are members. The multicast tree 241 MAY be established automatically via routing and signaling or pre- 242 provisioned. 244 When tenant multicast is supported, it SHOULD also be possible to 245 select whether the NVE provides optimized multicast trees inside the 246 VNI for individual tenant multicast groups or whether the default 247 VNI flooding tree is used. If the former option is selected the VNI 248 SHOULD be able to snoop IGMP/MLD messages in order to efficiently 249 join/prune Tenant System from multicast trees. 251 3.2.2. L3 VNI 253 L3 VNIs MUST provide virtualized IP routing and forwarding. L3 VNIs 254 MUST support per-tenant forwarding instance with IP addressing 255 isolation and L3 tunneling for interconnecting instances of the same 256 VNI on NVEs. 258 In the case of L3 VNI, the inner TTL field MUST be decremented by 259 (at least) 1 as if the NVO3 egress NVE was one (or more) hop(s) 260 away. The TTL field in the outer IP header MUST be set to a value 261 appropriate for delivery of the encapsulated frame to the tunnel 262 exit point. Thus, the default behavior MUST be the TTL pipe model 263 where the overlay network looks like one hop to the sending NVE. 264 Configuration of a "uniform" TTL model where the outer tunnel TTL is 265 set equal to the inner TTL on ingress NVE and the inner TTL is set 266 to the outer TTL value on egress MAY be supported. 268 L2 and L3 VNIs can be deployed in isolation or in combination to 269 optimize traffic flows per tenant across the overlay network. For 270 example, an L2 VNI may be configured across a number of NVEs to 271 offer L2 multi-point service connectivity while a L3 VNI can be co- 272 located to offer local routing capabilities and gateway 273 functionality. In addition, integrated routing and bridging per 274 tenant MAY be supported on an NVE. An instantiation of such service 275 may be realized by interconnecting an L2 VNI as access to an L3 VNI 276 on the NVE. 278 The L3 VNI does not require support for Broadcast and Unknown 279 Unicast traffic. The L3 VNI MAY provide support for customer 280 multicast groups. When multicast is supported, it SHOULD be possible 281 to select whether the NVE provides optimized multicast trees inside 282 the VNI for individual tenant multicast groups or whether a default 283 VNI multicasting tree, where all the NVEs of the corresponding VNI 284 are members, is used. 286 3.3. Overlay Module 288 The overlay module performs a number of functions related to NVO3 289 header and tunnel processing. 291 The following figure shows a generic NVO3 encapsulated frame: 293 +--------------------------+ 294 | Tenant Frame | 295 +--------------------------+ 296 | NVO3 Overlay Header | 297 +--------------------------+ 298 | Outer Underlay header | 299 +--------------------------+ 300 | Outer Link layer header | 301 +--------------------------+ 302 Figure 2 : NVO3 encapsulated frame 304 where 306 . Tenant frame: Ethernet or IP based upon the VNI type 308 . NVO3 overlay header: Header containing VNI context information 309 and other optional fields that can be used for processing 310 this packet. 312 . Outer underlay header: Can be either IP or MPLS 314 . Outer link layer header: Header specific to the physical 315 transmission link used 317 3.3.1. NVO3 overlay header 319 An NVO3 overlay header MUST be included after the underlay tunnel 320 header when forwarding tenant traffic. Note that this information 321 can be carried within existing protocol headers (when overloading of 322 specific fields is possible) or within a separate header. 324 3.3.1.1. Virtual Network Context Identification 326 The overlay encapsulation header MUST contain a field which allows 327 the encapsulated frame to be delivered to the appropriate virtual 328 network endpoint by the egress NVE. The egress NVE uses this field 329 to determine the appropriate virtual network context in which to 330 process the packet. This field MAY be an explicit, unique (to the 331 administrative domain) virtual network identifier (VNID) or MAY 332 express the necessary context information in other ways (e.g. a 333 locally significant identifier). 335 It SHOULD be aligned on a 32-bit boundary so as to make it 336 efficiently processable by the data path. It MUST be distributable 337 by a control-plane or configured via a management plane. 339 In the case of a global identifier, this field MUST be large enough 340 to scale to 100's of thousands of virtual networks. Note that there 341 is no such constraint when using a local identifier. 343 3.3.1.2. Service QoS identifier 345 Traffic flows originating from different applications could rely on 346 differentiated forwarding treatment to meet end-to-end availability 347 and performance objectives. Such applications may span across one or 348 more overlay networks. To enable such treatment, support for 349 multiple Classes of Service across or between overlay networks MAY 350 be required. 352 To effectively enforce CoS across or between overlay networks, NVEs 353 MAY be able to map CoS markings between networking layers, e.g., 354 Tenant Systems, Overlays, and/or Underlay, enabling each networking 355 layer to independently enforce its own CoS policies. For example: 357 - TS (e.g. VM) CoS 359 o Tenant CoS policies MAY be defined by Tenant administrators 361 o QoS fields (e.g. IP DSCP and/or Ethernet 802.1p) in the 362 tenant frame are used to indicate application level CoS 363 requirements 365 - NVE CoS 367 o NVE MAY classify packets based on Tenant CoS markings or 368 other mechanisms (eg. DPI) to identify the proper service CoS 369 to be applied across the overlay network 371 o NVE service CoS levels are normalized to a common set (for 372 example 8 levels) across multiple tenants; NVE uses per 373 tenant policies to map Tenant CoS to the normalized service 374 CoS fields in the NVO3 header 376 - Underlay CoS 378 o The underlay/core network MAY use a different CoS set (for 379 example 4 levels) than the NVE CoS as the core devices MAY 380 have different QoS capabilities compared with NVEs. 382 o The Underlay CoS MAY also change as the NVO3 tunnels pass 383 between different domains. 385 Support for NVE Service CoS MAY be provided through a QoS field, 386 inside the NVO3 overlay header. Examples of service CoS provided 387 part of the service tag are 802.1p and DE bits in the VLAN and PBB 388 ISID tags and MPLS TC bits in the VPN labels. 390 3.3.2. Tunneling function 392 This section describes the underlay tunneling requirements. From an 393 encapsulation perspective, IPv4 or IPv6 MUST be supported, both IPv4 394 and IPv6 SHOULD be supported, MPLS tunneling MAY be supported. 396 3.3.2.1. LAG and ECMP 398 For performance reasons, multipath over LAG and ECMP paths SHOULD be 399 supported. 401 LAG (Link Aggregation Group) [IEEE 802.1AX-2008] and ECMP (Equal 402 Cost Multi Path) are commonly used techniques to perform load- 403 balancing of microflows over a set of a parallel links either at 404 Layer-2 (LAG) or Layer-3 (ECMP). Existing deployed hardware 405 implementations of LAG and ECMP uses a hash of various fields in the 406 encapsulation (outermost) header(s) (e.g. source and destination MAC 407 addresses for non-IP traffic, source and destination IP addresses, 408 L4 protocol, L4 source and destination port numbers, etc). 409 Furthermore, hardware deployed for the underlay network(s) will be 410 most often unaware of the carried, innermost L2 frames or L3 packets 411 transmitted by the TS. Thus, in order to perform fine-grained load- 412 balancing over LAG and ECMP paths in the underlying network, the 413 encapsulation MUST result in sufficient entropy to exercise all 414 paths through several LAG/ECMP hops. The entropy information MAY be 415 inferred from the NVO3 overlay header or underlay header. If the 416 overlay protocol does not support the necessary entropy information 417 or the switches/routers in the underlay do not support parsing of 418 the additional entropy information in the overlay header, underlay 419 switches and routers should be programmable, i.e. select the 420 appropriate fields in the underlay header for hash calculation based 421 on the type of overlay header. 423 All packets that belong to a specific flow MUST follow the same path 424 in order to prevent packet re-ordering. This is typically achieved 425 by ensuring that the fields used for hashing are identical for a 426 given flow. 428 All paths available to the overlay network SHOULD be used 429 efficiently. Different flows SHOULD be distributed as evenly as 430 possible across multiple underlay network paths. For instance, this 431 can be achieved by ensuring that some fields used for hashing are 432 randomly generated. 434 3.3.2.2. DiffServ and ECN marking 436 When traffic is encapsulated in a tunnel header, there are numerous 437 options as to how the Diffserv Code-Point (DSCP) and Explicit 438 Congestion Notification (ECN) markings are set in the outer header 439 and propagated to the inner header on decapsulation. 441 [RFC2983] defines two modes for mapping the DSCP markings from inner 442 to outer headers and vice versa. The Uniform model copies the inner 443 DSCP marking to the outer header on tunnel ingress, and copies that 444 outer header value back to the inner header at tunnel egress. The 445 Pipe model sets the DSCP value to some value based on local policy 446 at ingress and does not modify the inner header on egress. Both 447 models SHOULD be supported. 449 ECN marking MUST be performed according to [RFC6040] which describes 450 the correct ECN behavior for IP tunnels. 452 3.3.2.3. Handling of BUM traffic 454 NVO3 data plane support for either ingress replication or point-to- 455 multipoint tunnels is required to send traffic destined to multiple 456 locations on a per-VNI basis (e.g. L2/L3 multicast traffic, L2 457 broadcast and unknown unicast traffic). It is possible that both 458 methods be used simultaneously. 460 There is a bandwidth vs state trade-off between the two approaches. 461 User-definable knobs MUST be provided to select which method(s) gets 462 used based upon the amount of replication required (i.e. the number 463 of hosts per group), the amount of multicast state to maintain, the 464 duration of multicast flows and the scalability of multicast 465 protocols. 467 When ingress replication is used, NVEs MUST track for each VNI the 468 related tunnel endpoints to which it needs to replicate the frame. 470 For point-to-multipoint tunnels, the bandwidth efficiency is 471 increased at the cost of more state in the Core nodes. The ability 472 to auto-discover or pre-provision the mapping between VNI multicast 473 trees to related tunnel endpoints at the NVE and/or throughout the 474 core SHOULD be supported. 476 3.4. External NVO3 connectivity 478 NVO3 services MUST interoperate with current VPN and Internet 479 services. This may happen inside one DC during a migration phase or 480 as NVO3 services are delivered to the outside world via Internet or 481 VPN gateways. 483 Moreover the compute and storage services delivered by a NVO3 domain 484 may span multiple DCs requiring Inter-DC connectivity. From a DC 485 perspective a set of gateway devices are required in all of these 486 cases albeit with different functionalities influenced by the 487 overlay type across the WAN, the service type and the DC network 488 technologies used at each DC site. 490 A GW handling the connectivity between NVO3 and external domains 491 represents a single point of failure that may affect multiple tenant 492 services. Redundancy between NVO3 and external domains MUST be 493 supported. 495 3.4.1. GW Types 497 3.4.1.1. VPN and Internet GWs 499 Tenant sites may be already interconnected using one of the existing 500 VPN services and technologies (VPLS or IP VPN). If a new NVO3 501 encapsulation is used, a VPN GW is required to forward traffic 502 between NVO3 and VPN domains. Translation of encapsulations MAY be 503 required. Internet connected Tenants require translation from NVO3 504 encapsulation to IP in the NVO3 gateway. The translation function 505 SHOULD minimize provisioning touches. 507 3.4.1.2. Inter-DC GW 509 Inter-DC connectivity MAY be required to provide support for 510 features like disaster prevention or compute load re-distribution. 511 This MAY be provided via a set of gateways interconnected through a 512 WAN. This type of connectivity MAY be provided either through 513 extension of the NVO3 tunneling domain or via VPN GWs. 515 3.4.1.3. Intra-DC gateways 517 Even within one DC there may be End Devices that do not support NVO3 518 encapsulation, for example bare metal servers, hardware appliances 519 and storage. A gateway device, e.g. a ToR, is required to translate 520 the NVO3 to Ethernet VLAN encapsulation. 522 3.4.2. Path optimality between NVEs and Gateways 524 Within the NVO3 overlay, a default assumption is that NVO3 traffic 525 will be equally load-balanced across the underlying network 526 consisting of LAG and/or ECMP paths. This assumption is valid only 527 as long as: a) all traffic is load-balanced equally among each of 528 the component-links and paths; and, b) each of the component- 529 links/paths is of identical capacity. During the course of normal 530 operation of the underlying network, it is possible that one, or 531 more, of the component-links/paths of a LAG may be taken out-of- 532 service in order to be repaired, e.g.: due to hardware failure of 533 cabling, optics, etc. In such cases, the administrator should 534 configure the underlying network such that an entire LAG bundle in 535 the underlying network will be reported as operationally down if 536 there is a failure of any single component-link member of the LAG 537 bundle, (e.g.: N = M configuration of the LAG bundle), and, thus, 538 they know that traffic will be carried sufficiently by alternate, 539 available (potentially ECMP) paths in the underlying network. This 540 is a likely an adequate assumption for Intra-DC traffic where 541 presumably the costs for additional, protection capacity along 542 alternate paths is not cost-prohibitive. Thus, there are likely no 543 additional requirements on NVO3 solutions to accommodate this type 544 of underlying network configuration and administration. 546 There is a similar case with ECMP, used Intra-DC, where failure of a 547 single component-path of an ECMP group would result in traffic 548 shifting onto the surviving members of the ECMP group. 549 Unfortunately, there are no automatic recovery methods in IP routing 550 protocols to detect a simultaneous failure of more than one 551 component-path in a ECMP group, operationally disable the entire 552 ECMP group and allow traffic to shift onto alternative paths. This 553 problem is attributable to the underlying network and, thus, out-of- 554 scope of any NVO3 solutions. 556 On the other hand, for Inter-DC and DC to External Network cases 557 that use a WAN, the costs of the underlying network and/or service 558 (e.g.: IPVPN service) are more expensive; therefore, there is a 559 requirement on administrators to both: a) ensure high availability 560 (active-backup failover or active-active load-balancing); and, b) 561 maintaining substantial utilization of the WAN transport capacity at 562 nearly all times, particularly in the case of active-active load- 563 balancing. With respect to the dataplane requirements of NVO3 564 solutions, in the case of active-backup fail-over, all of the 565 ingress NVE's MUST dynamically adapt to the failure of an active NVE 566 GW when the backup NVE GW announces itself into the NVO3 overlay 567 immediately following a failure of the previously active NVE GW and 568 update their forwarding tables accordingly, (e.g.: perhaps through 569 dataplane learning and/or translation of a gratuitous ARP, IPv6 570 Router Advertisement, etc.) Note that active-backup fail-over could 571 be used to accomplish a crude form of load-balancing by, for 572 example, manually configuring each tenant to use a different NVE GW, 573 in a round-robin fashion. On the other hand, with respect to active- 574 active load-balancing across physically separate NVE GW's (e.g.: 575 two, separate chassis) an NVO3 solution SHOULD support forwarding 576 tables that can simultaneously map a single egress NVE to more than 577 one NVO3 tunnels. The granularity of such mappings, in both active- 578 backup and active-active, MUST be unique to each tenant. 580 3.4.2.1. Triangular Routing Issues (Traffic Tromboning) 582 L2/ELAN over NVO3 service may span multiple racks distributed across 583 different DC regions. Multiple ELANs belonging to one tenant may be 584 interconnected or connected to the outside world through multiple 585 Router/VRF gateways distributed throughout the DC regions. In this 586 scenario, without aid from an NVO3 or other type of solution, 587 traffic from an ingress NVE destined to External gateways will take 588 a non-optimal path that will result in higher latency and costs, 589 (since it is using more expensive resources of a WAN). In the case 590 of traffic from an IP/MPLS network destined toward the entrance to 591 an NVO3 overlay, well-known IP routing techniques MAY be used to 592 optimize traffic into the NVO3 overlay, (at the expense of 593 additional routes in the IP/MPLS network). In summary, these issues 594 are well known as triangular routing. 596 Procedures for gateway selection to avoid triangular routing issues 597 SHOULD be provided. The details of such procedures are, most likely, 598 part of the NVO3 Management and/or Control Plane requirements and, 599 thus, out of scope of this document. However, a key requirement on 600 the dataplane of any NVO3 solution to avoid triangular routing is 601 stated above, in Section 3.4.2, with respect to active-active load- 602 balancing. More specifically, an NVO3 solution SHOULD support 603 forwarding tables that can simultaneously map a single egress NVE to 604 more than one NVO3 tunnels. The expectation is that, through the 605 Control and/or Management Planes, this mapping information MAY be 606 dynamically manipulated to, for example, provide the closest 607 geographic and/or topological exit point (egress NVE) for each 608 ingress NVE. 610 3.5. Path MTU 612 The tunnel overlay header can cause the MTU of the path to the 613 egress tunnel endpoint to be exceeded. 615 IP fragmentation SHOULD be avoided for performance reasons. 617 The interface MTU as seen by a Tenant System SHOULD be adjusted such 618 that no fragmentation is needed. This can be achieved by 619 configuration or be discovered dynamically. 621 Either of the following options MUST be supported: 623 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] or 624 Extended MTU Path Discovery techniques such as defined in 625 [RFC4821] 627 o Segmentation and reassembly support from the overlay layer 628 operations without relying on the Tenant Systems to know about 629 the end-to-end MTU 631 o The underlay network MAY be designed in such a way that the MTU 632 can accommodate the extra tunnel overhead. 634 3.6. Hierarchical NVE 636 It might be desirable to support the concept of hierarchical NVEs, 637 such as spoke NVEs and hub NVEs, in order to address possible NVE 638 performance limitations and service connectivity optimizations. 640 For instance, spoke NVE functionality MAY be used when processing 641 capabilities are limited. A hub NVE would provide additional data 642 processing capabilities such as packet replication. 644 NVEs can be either connected in an any-to-any or hub and spoke 645 topology on a per VNI basis. 647 3.7. NVE Multi-Homing Requirements 649 Multi-homing techniques SHOULD be used to increase the reliability 650 of an nvo3 network. It is also important to ensure that physical 651 diversity in an nvo3 network is taken into account to avoid single 652 points of failure. 654 Multi-homing can be enabled in various nodes, from tenant systems 655 into TORs, TORs into core switches/routers, and core nodes into DC 656 GWs. 658 Tenant systems can either be L2 or L3 nodes. In the former case 659 (L2), techniques such as LAG or STP for instance MAY be used. In the 660 latter case (L3), it is possible that no dynamic routing protocol is 661 enabled. Tenant systems can be multi-homed into remote NVE using 662 several interfaces (physical NICS or vNICS) with an IP address per 663 interface either to the same nvo3 network or into different nvo3 664 networks. When one of the links fails, the corresponding IP is not 665 reachable but the other interfaces can still be used. When a tenant 666 system is co-located with an NVE, IP routing can be relied upon to 667 handle routing over diverse links to TORs. 669 External connectivity MAY be handled by two or more nvo3 gateways. 670 Each gateway is connected to a different domain (e.g. ISP) and runs 671 BGP multi-homing. They serve as an access point to external networks 672 such as VPNs or the Internet. When a connection to an upstream 673 router is lost, the alternative connection is used and the failed 674 route withdrawn. 676 3.8. OAM 678 NVE MAY be able to originate/terminate OAM messages for connectivity 679 verification, performance monitoring, statistic gathering and fault 680 isolation. Depending on configuration, NVEs SHOULD be able to 681 process or transparently tunnel OAM messages, as well as supporting 682 alarm propagation capabilities. 684 Given the critical requirement to load-balance NVO3 encapsulated 685 packets over LAG and ECMP paths, it will be equally critical to 686 ensure existing and/or new OAM tools allow NVE administrators to 687 proactively and/or reactively monitor the health of various 688 component-links that comprise both LAG and ECMP paths carrying NVO3 689 encapsulated packets. For example, it will be important that such 690 OAM tools allow NVE administrators to reveal the set of underlying 691 network hops (topology) in order that the underlying network 692 administrators can use this information to quickly perform fault 693 isolation and restore the underlying network. 695 The NVE MUST provide the ability to reveal the set of ECMP and/or 696 LAG paths used by NVO3 encapsulated packets in the underlying 697 network from an ingress NVE to egress NVE. The NVE MUST provide the 698 ability to provide a "ping"-like functionality that can be used to 699 determine the health (liveness) of remote NVE's or their VNI's. The 700 NVE SHOULD provide a "ping"-like functionality to more expeditiously 701 aid in troubleshooting performance problems, i.e.: blackholing or 702 other types of congestion occurring in the underlying network, for 703 NVO3 encapsulated packets carried over LAG and/or ECMP paths. 705 3.9. Other considerations 707 3.9.1. Data Plane Optimizations 709 Data plane forwarding and encapsulation choices SHOULD consider the 710 limitation of possible NVE implementations, specifically in software 711 based implementations (e.g. servers running VSwitches) 713 NVE SHOULD provide efficient processing of traffic. For instance, 714 packet alignment, the use of offsets to minimize header parsing, 715 padding techniques SHOULD be considered when designing NVO3 716 encapsulation types. 718 The NV03 encapsulation/decapsulation processing in software-based 719 NVEs SHOULD make use of hardware assist provided by NICs in order to 720 speed up packet processing. 722 3.9.2. NVE location trade-offs 724 In the case of DC traffic, traffic originated from a VM is native 725 Ethernet traffic. This traffic can be switched by a local VM switch 726 or ToR switch and then by a DC gateway. The NVE function can be 727 embedded within any of these elements. 729 The NVE function can be supported in various DC network elements 730 such as a VM, VM switch, ToR switch or DC GW. 732 The following criteria SHOULD be considered when deciding where the 733 NVE processing boundary happens: 735 o Processing and memory requirements 737 o Datapath (e.g. lookups, filtering, 738 encapsulation/decapsulation) 740 o Control plane processing (e.g. routing, signaling, OAM) 742 o FIB/RIB size 744 o Multicast support 746 o Routing protocols 748 o Packet replication capability 750 o Fragmentation support 752 o QoS transparency 754 o Resiliency 756 4. Security Considerations 758 This requirements document does not raise in itself any specific 759 security issues. 761 5. IANA Considerations 763 IANA does not need to take any action for this draft. 765 6. References 767 6.1. Normative References 769 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 770 Requirement Levels", BCP 14, RFC 2119, March 1997. 772 6.2. Informative References 774 [NVOPS] Narten, T. et al, "Problem Statement: Overlays for Network 775 Virtualization", draft-narten-nvo3-overlay-problem- 776 statement (work in progress) 778 [NVO3-framework] Lasserre, M. et al, "Framework for DC Network 779 Virtualization", draft-lasserre-nvo3-framework (work in 780 progress) 782 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 783 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 784 (work in progress) 786 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 787 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 789 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 790 Networks (VPNs)", RFC 4364, February 2006. 792 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 794 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 795 August 1996 797 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 798 Discovery", RFC4821, March 2007 800 [RFC2983] Black, D. "Diffserv and tunnels", RFC2983, Cotober 2000 802 [RFC6040] Briscoe, B. "Tunnelling of Explicit Congestion 803 Notification", RFC6040, November 2010 805 [RFC6438] Carpenter, B. et al, "Using the IPv6 Flow Label for Equal 806 Cost Multipath Routing and Link Aggregation in Tunnels", 807 RFC6438, November 2011 809 [RFC6391] Bryant, S. et al, "Flow-Aware Transport of Pseudowires 810 over an MPLS Packet Switched Network", RFC6391, November 811 2011 813 7. Acknowledgments 815 In addition to the authors the following people have contributed to 816 this document: 818 Shane Amante, Level3 820 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 822 Larry Kreeger, Cisco 824 This document was prepared using 2-Word-v2.0.template.dot. 826 Authors' Addresses 828 Nabil Bitar 829 Verizon 830 40 Sylvan Road 831 Waltham, MA 02145 832 Email: nabil.bitar@verizon.com 834 Marc Lasserre 835 Alcatel-Lucent 836 Email: marc.lasserre@alcatel-lucent.com 838 Florin Balus 839 Alcatel-Lucent 840 777 E. Middlefield Road 841 Mountain View, CA, USA 94043 842 Email: florin.balus@alcatel-lucent.com 844 Thomas Morin 845 France Telecom Orange 846 Email: thomas.morin@orange.com 848 Lizhong Jin 849 Email : lizho.jin@gmail.com 851 Bhumip Khasnabish 852 ZTE 853 Email : Bhumip.khasnabish@zteusa.com