idnits 2.17.1 draft-bl-nvo3-dataplane-requirements-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 70 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 19 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 196 instances of too long lines in the document, the longest one being 4 characters in excess of 72. == There are 9 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 28, 2012) is 4159 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'NVOPS' is defined on line 773, but no explicit reference was found in the text == Unused Reference: 'OVCPREQ' is defined on line 781, but no explicit reference was found in the text == Unused Reference: 'FLOYD' is defined on line 785, but no explicit reference was found in the text == Unused Reference: 'RFC4364' is defined on line 788, but no explicit reference was found in the text == Unused Reference: 'RFC6438' is defined on line 804, but no explicit reference was found in the text == Unused Reference: 'RFC6391' is defined on line 808, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Nabil Bitar 3 Internet Draft Verizon 4 Intended status: Informational 5 Expires: May 2013 Marc Lasserre 6 Florin Balus 7 Alcatel-Lucent 9 Thomas Morin 10 France Telecom Orange 12 Lizhong Jin 13 Bhumip Khasnabish 14 ZTE 16 November 28, 2012 18 NVO3 Data Plane Requirements 19 draft-bl-nvo3-dataplane-requirements-03.txt 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six 32 months and may be updated, replaced, or obsoleted by other documents 33 at any time. It is inappropriate to use Internet-Drafts as 34 reference material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/ietf/1id-abstracts.txt 39 The list of Internet-Draft Shadow Directories can be accessed at 40 http://www.ietf.org/shadow.html 42 This Internet-Draft will expire on May 28, 2013. 44 Copyright Notice 46 Copyright (c) 2012 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with 54 respect to this document. 56 Abstract 58 Several IETF drafts relate to the use of overlay networks to support 59 large scale virtual data centers. This draft provides a list of data 60 plane requirements for Network Virtualization over L3 (NVO3) that 61 have to be addressed in solutions documents. 63 Table of Contents 65 1. Introduction.................................................3 66 1.1. Conventions used in this document.......................3 67 1.2. General terminology.....................................3 68 2. Data Path Overview...........................................4 69 3. Data Plane Requirements......................................5 70 3.1. Virtual Access Points (VAPs)............................5 71 3.2. Virtual Network Instance (VNI)..........................5 72 3.2.1. L2 VNI................................................5 73 3.2.2. L3 VNI................................................6 74 3.3. Overlay Module..........................................7 75 3.3.1. NVO3 overlay header...................................8 76 3.3.1.1. Virtual Network Context Identification..............8 77 3.3.1.2. Service QoS identifier..............................8 78 3.3.2. Tunneling function....................................9 79 3.3.2.1. LAG and ECMP.......................................10 80 3.3.2.2. DiffServ and ECN marking...........................10 81 3.3.2.3. Handling of BUM traffic............................11 82 3.4. External NVO3 connectivity.............................11 83 3.4.1. GW Types.............................................12 84 3.4.1.1. VPN and Internet GWs...............................12 85 3.4.1.2. Inter-DC GW........................................12 86 3.4.1.3. Intra-DC gateways..................................12 87 3.4.2. Path optimality between NVEs and Gateways............12 88 3.4.2.1. Triangular Routing Issues,a.k.a.: Traffic Tromboning13 89 3.5. Path MTU...............................................14 90 3.6. Hierarchical NVE.......................................15 91 3.7. NVE Multi-Homing Requirements..........................15 92 3.8. OAM....................................................16 93 3.9. Other considerations...................................16 94 3.9.1. Data Plane Optimizations.............................16 95 3.9.2. NVE location trade-offs..............................17 96 4. Security Considerations.....................................17 97 5. IANA Considerations.........................................17 98 6. References..................................................18 99 6.1. Normative References...................................18 100 6.2. Informative References.................................18 101 7. Acknowledgments.............................................19 103 1. Introduction 105 1.1. Conventions used in this document 107 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 108 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 109 document are to be interpreted as described in RFC-2119 [RFC2119]. 111 In this document, these words will appear with that interpretation 112 only when in ALL CAPS. Lower case uses of these words are not to be 113 interpreted as carrying RFC-2119 significance. 115 1.2. General terminology 117 The terminology defined in [NVO3-framework] is used throughout this 118 document. Terminology specific to this memo is defined here and is 119 introduced as needed in later sections. 121 DC: Data Center 123 BUM: Broadcast, Unknown Unicast, Multicast traffic 125 TS: Tenant System 127 VAP: Virtual Access Point 129 VNI: Virtual Network Instance 131 VNID: VNI ID 133 2. Data Path Overview 135 The NVO3 framework [NVO3-framework] defines the generic NVE model 136 depicted in Figure 1: 138 +------- L3 Network ------+ 139 | | 140 | Tunnel Overlay | 141 +------------+---------+ +---------+------------+ 142 | +----------+-------+ | | +---------+--------+ | 143 | | Overlay Module | | | | Overlay Module | | 144 | +---------+--------+ | | +---------+--------+ | 145 | |VN context| | VN context| | 146 | | | | | | 147 | +-------+--------+ | | +--------+-------+ | 148 | | |VNI| ... |VNI| | | | |VNI| ... |VNI| | 149 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 150 | | VAPs | | | | VAPs | | 151 +----+------------+----+ +----+------------+----+ 152 | | | | 153 -------+------------+-----------------+------------+------- 154 | | Tenant | | 155 | | Service IF | | 156 Tenant Systems Tenant Systems 158 Figure 1 : Generic reference model for NV Edge 160 When a frame is received by an ingress NVE from a Tenant System over 161 a local VAP, it needs to be parsed in order to identify which 162 virtual network instance it belongs to. The parsing function can 163 examine various fields in the data frame (e.g., VLANID) and/or 164 associated interface/port the frame came from. 166 Once a corresponding VNI is identified, a lookup is performed to 167 determine where the frame needs to be sent. This lookup can be based 168 on any combinations of various fields in the data frame (e.g., 169 destination MAC addresses and/or destination IP addresses). Note 170 that additional criteria such as 802.1p and/or DSCP markings might 171 be used to select an appropriate tunnel or local VAP destination. 173 Lookup tables can be populated using different techniques: data 174 plane learning, management plane configuration, or a distributed 175 control plane. Management and control planes are not in the scope of 176 this document. The data plane based solution is described in this 177 document as it has implications on the data plane processing 178 function. 180 The result of this lookup yields the corresponding information 181 needed to build the overlay header, as described in section 3.3. 182 This information includes the destination L3 address of the egress 183 NVE. Note that this lookup might yield a list of tunnels such as 184 when ingress replication is used for BUM traffic. 186 The overlay header MUST include a context identifier which the 187 egress NVE will use to identify which VNI this frame belongs to. 189 The egress NVE checks the context identifier and removes the 190 encapsulation header and then forwards the original frame towards 191 the appropriate recipient, usually a local VAP. 193 3. Data Plane Requirements 195 3.1. Virtual Access Points (VAPs) 197 The NVE forwarding plane MUST support VAP identification through the 198 following mechanisms: 200 - Using the local interface on which the frames are received, where 201 the local interface may be an internal, virtual port in a VSwitch 202 or a physical port on the ToR 203 - Using the local interface and some fields in the frame header, 204 e.g. one or multiple VLANs or the source MAC 206 3.2. Virtual Network Instance (VNI) 208 VAPs are associated with a specific VNI at service instantiation 209 time. 211 A VNI identifies a per-tenant private context, i.e. per-tenant 212 policies and a FIB table to allow overlapping address space between 213 tenants. 215 There are different VNI types differentiated by the virtual network 216 service they provide to Tenant Systems. Network virtualization can 217 be provided by L2 and/or L3 VNIs. 219 3.2.1. L2 VNI 221 An L2 VNI MUST provide an emulated Ethernet multipoint service as if 222 Tenant Systems are interconnected by a bridge (but instead by using 223 a set of NVO3 tunnels). The emulated bridge MAY be 802.1Q enabled 224 (allowing use of VLAN tags as a VAP). An L2 VNI provides per tenant 225 virtual switching instance with MAC addressing isolation and L3 226 tunneling. Loop avoidance capability MUST be provided. 228 Forwarding table entries provide mapping information between MAC 229 addresses and L3 tunnel destination addresses. Such entries MAY be 230 populated by a control or management plane, or via data plane. 232 In the absence of a management or control plane, data plane learning 233 MUST be used to populate forwarding tables. As frames arrive from 234 VAPs or from overlay tunnels, standard MAC learning procedures are 235 used: The source MAC address is learned against the VAP or the NVO3 236 tunnel on which the frame arrived. This implies that unknown unicast 237 traffic be flooded i.e. broadcast. 239 When flooding is required, either to deliver unknown unicast, or 240 broadcast or multicast traffic, the NVE MUST either support ingress 241 replication or multicast. In this latter case, the NVE MUST be able 242 to build at least a default flooding tree per VNI. In such cases, 243 multiple VNIs MAY share the same default flooding tree. The 244 flooding tree is equivalent with a multicast (*,G) construct where 245 all the NVEs for which the corresponding VNI is instantiated are 246 members. The multicast tree MAY be established automatically via 247 routing and signaling or pre-provisioned. 249 When tenant multicast is supported, it SHOULD also be possible to 250 select whether the NVE provides optimized multicast trees inside the 251 VNI for individual tenant multicast groups or whether the default 252 VNI flooding tree is used. If the former option is selected the VNI 253 SHOULD be able to snoop IGMP/MLD messages in order to efficiently 254 join/prune Tenant System from multicast trees. 256 3.2.2. L3 VNI 258 L3 VNIs MUST provide virtualized IP routing and forwarding. L3 VNIs 259 MUST support per-tenant forwarding instance with IP addressing 260 isolation and L3 tunneling for interconnecting instances of the same 261 VNI on NVEs. 263 In the case of L3 VNI, the inner TTL field MUST be decremented by 264 (at least) 1 as if the NVO3 egress NVE was one (or more) hop(s) 265 away. The TTL field in the outer IP header MUST be set to a value 266 appropriate for delivery of the encapsulated frame to the tunnel 267 exit point. Thus, the default behavior MUST be the TTL pipe model 268 where the overlay network looks like one hop to the sending NVE. 269 Configuration of a "uniform" TTL model where the outer tunnel TTL is 270 set equal to the inner TTL on ingress NVE and the inner TTL is set 271 to the outer TTL value on egress MAY be supported. 273 L2 and L3 VNIs can be deployed in isolation or in combination to 274 optimize traffic flows per tenant across the overlay network. For 275 example, an L2 VNI may be configured across a number of NVEs to 276 offer L2 multi-point service connectivity while a L3 VNI can be co- 277 located to offer local routing capabilities and gateway 278 functionality. In addition, integrated routing and bridging per 279 tenant MAY be supported on an NVE. An instantiation of such service 280 may be realized by interconnecting an L2 VNI as access to an L3 VNI 281 on the NVE. 283 The L3 VNI does not require support for Broadcast and Unknown 284 Unicast traffic. The L3 VNI MAY provide support for customer 285 multicast groups. When multicast is supported, it SHOULD be possible 286 to select whether the NVE provides optimized multicast trees inside 287 the VNI for individual tenant multicast groups or whether a default 288 VNI multicasting tree, where all the NVEs of the corresponding VNI 289 are members, is used. 291 3.3. Overlay Module 293 The overlay module performs a number of functions related to NVO3 294 header and tunnel processing. 296 The following figure shows a generic NVO3 encapsulated frame: 298 +--------------------------+ 299 | Customer Payload | 300 +--------------------------+ 301 | NVO3 Overlay Header | 302 +--------------------------+ 303 | Outer Underlay header | 304 +--------------------------+ 305 | Outer Link layer header | 306 +--------------------------+ 307 Figure 2 : NVO3 encapsulated frame 309 where 311 . Customer payload: Ethernet or IP based upon the VNI type 312 . NVO3 overlay header: Header containing VNI context information 313 and other optional fields that can be used for processing 314 this packet. 316 . Outer underlay header: Can be either IP or MPLS 318 . Outer link layer header: Header specific to the physical 319 transmission link used 321 3.3.1. NVO3 overlay header 323 An NVO3 overlay header MUST be included after the underlay tunnel 324 header when forwarding tenant traffic. Note that this information 325 can be carried within existing protocol headers (when overloading of 326 specific fields is possible) or within a separate header. 328 3.3.1.1. Virtual Network Context Identification 330 The overlay encapsulation header MUST contain a field which allows 331 the encapsulated frame to be delivered to the appropriate virtual 332 network endpoint by the egress NVE. The egress NVE uses this field 333 to determine the appropriate virtual network context in which to 334 process the packet. This field MAY be an explicit, unique (to the 335 administrative domain) virtual network identifier (VNID) or MAY 336 express the necessary context information in other ways (e.g. a 337 locally significant identifier). 339 It SHOULD be aligned on a 32-bit boundary so as to make it 340 efficiently processable by the data path. It MUST be distributable 341 by a control-plane or configured via a management plane. 343 In the case of a global identifier, this field MUST be large enough 344 to scale to 100's of thousands of virtual networks. Note that there 345 is no such constraint when using a local identifier. 347 3.3.1.2. Service QoS identifier 349 Traffic flows originating from different applications could rely on 350 differentiated forwarding treatment to meet end-to-end availability 351 and performance objectives. Such applications may span across one or 352 more overlay networks. To enable such treatment, support for 353 multiple Classes of Service across or between overlay networks MAY 354 be required. 356 To effectively enforce CoS across or between overlay networks, NVEs 357 MAY be able to map CoS markings between networking layers, e.g., 358 Tenant Systems, Overlays, and/or Underlay, enabling each networking 359 layer to independently enforce its own CoS policies. For example: 361 - TS (e.g. VM) CoS 363 o Tenant CoS policies MAY be defined by Tenant administrators 365 o QoS fields (e.g. IP DSCP and/or Ethernet 802.1p) in the 366 tenant frame are used to indicate application level CoS 367 requirements 369 - NVE CoS 371 o NVE MAY classify packets based on Tenant CoS markings or 372 other mechanisms (eg. DPI) to identify the proper service CoS 373 to be applied across the overlay network 375 o NVE service CoS levels are normalized to a common set (for 376 example 8 levels) across multiple tenants; NVE uses per 377 tenant policies to map Tenant CoS to the normalized service 378 CoS fields in the NVO3 header 380 - Underlay CoS 382 o The underlay/core network MAY use a different CoS set (for 383 example 4 levels) than the NVE CoS as the core devices MAY 384 have different QoS capabilities compared with NVEs. 386 o The Underlay CoS MAY also change as the NVO3 tunnels pass 387 between different domains. 389 Support for NVE Service CoS MAY be provided through a QoS field, 390 inside the NVO3 overlay header. Examples of service CoS provided 391 part of the service tag are 802.1p and DE bits in the VLAN and PBB 392 ISID tags and MPLS TC bits in the VPN labels. 394 3.3.2. Tunneling function 396 This section describes the underlay tunneling requirements. From an 397 encapsulation perspective, IPv4 or IPv6 MUST be supported, both IPv4 398 and IPv6 SHOULD be supported, MPLS tunneling MAY be supported. 400 3.3.2.1. LAG and ECMP 402 For performance reasons, multipath over LAG and ECMP paths SHOULD be 403 supported. 405 LAG (Link Aggregation Group) [IEEE 802.1AX-2008] and ECMP (Equal 406 Cost Multi Path) are commonly used techniques to perform load- 407 balancing of microflows over a set of a parallel links either at 408 Layer-2 (LAG) or Layer-3 (ECMP). Existing deployed hardware 409 implementations of LAG and ECMP uses a hash of various fields in the 410 encapsulation (outermost) header(s) (e.g. source and destination MAC 411 addresses for non-IP traffic, source and destination IP addresses, 412 L4 protocol, L4 source and destination port numbers, etc). 413 Furthermore, hardware deployed for the underlay network(s) will be 414 most often unaware of the carried, innermost L2 frames or L3 packets 415 transmitted by the TS. Thus, in order to perform fine-grained load- 416 balancing over LAG and ECMP paths in the underlying network, the 417 encapsulation MUST result in sufficient entropy to exercise all 418 paths through several LAG/ECMP hops. The entropy information MAY be 419 inferred from the NVO3 overlay header or underlay header. 421 All packets that belong to a specific flow MUST follow the same path 422 in order to prevent packet re-ordering. This is typically achieved 423 by ensuring that the fields used for hashing are identical for a 424 given flow. 426 All paths available to the overlay network SHOULD be used 427 efficiently. Different flows SHOULD be distributed as evenly as 428 possible across multiple underlay network paths. For instance, this 429 can be achieved by ensuring that some fields used for hashing are 430 randomly generated. 432 3.3.2.2. DiffServ and ECN marking 434 When traffic is encapsulated in a tunnel header, there are numerous 435 options as to how the Diffserv Code-Point (DSCP) and Explicit 436 Congestion Notification (ECN) markings are set in the outer header 437 and propagated to the inner header on decapsulation. 439 [RFC2983] defines two modes for mapping the DSCP markings from inner 440 to outer headers and vice versa. The Uniform model copies the inner 441 DSCP marking to the outer header on tunnel ingress, and copies that 442 outer header value back to the inner header at tunnel egress. The 443 Pipe model sets the DSCP value to some value based on local policy 444 at ingress and does not modify the inner header on egress. Both 445 models SHOULD be supported. 447 ECN marking MUST be performed according to [RFC6040] which describes 448 the correct ECN behavior for IP tunnels. 450 3.3.2.3. Handling of BUM traffic 452 NVO3 data plane support for either ingress replication or point-to- 453 multipoint tunnels is required to send traffic destined to multiple 454 locations on a per-VNI basis (e.g. L2/L3 multicast traffic, L2 455 broadcast and unknown unicast traffic). It is possible that both 456 methods be used simultaneously. 458 There is a bandwidth vs state trade-off between the two approaches. 459 User-definable knobs MUST be provided to select which method(s) gets 460 used based upon the amount of replication required (i.e. the number 461 of hosts per group), the amount of multicast state to maintain, the 462 duration of multicast flows and the scalability of multicast 463 protocols. 465 When ingress replication is used, NVEs MUST track for each VNI the 466 related tunnel endpoints to which it needs to replicate the frame. 468 For point-to-multipoint tunnels, the bandwidth efficiency is 469 increased at the cost of more state in the Core nodes. The ability 470 to auto-discover or pre-provision the mapping between VNI multicast 471 trees to related tunnel endpoints at the NVE and/or throughout the 472 core SHOULD be supported. 474 3.4. External NVO3 connectivity 476 NVO3 services MUST interoperate with current VPN and Internet 477 services. This may happen inside one DC during a migration phase or 478 as NVO3 services are delivered to the outside world via Internet or 479 VPN gateways. 481 Moreover the compute and storage services delivered by a NVO3 domain 482 may span multiple DCs requiring Inter-DC connectivity. From a DC 483 perspective a set of gateway devices are required in all of these 484 cases albeit with different functionalities influenced by the 485 overlay type across the WAN, the service type and the DC network 486 technologies used at each DC site. 488 A GW handling the connectivity between NVO3 and external domains 489 represents a single point of failure that may affect multiple tenant 490 services. Redundancy between NVO3 and external domains MUST be 491 supported. 493 3.4.1. GW Types 495 3.4.1.1. VPN and Internet GWs 497 Tenant sites may be already interconnected using one of the existing 498 VPN services and technologies (VPLS or IP VPN). If a new NVO3 499 encapsulation is used, a VPN GW is required to forward traffic 500 between NVO3 and VPN domains. Translation of encapsulations MAY be 501 required. Internet connected Tenants require translation from NVO3 502 encapsulation to IP in the NVO3 gateway. The translation function 503 SHOULD NOT require provisioning touches and SHOULD NOT use 504 intermediate hand-offs, for example VLANs. 506 3.4.1.2. Inter-DC GW 508 Inter-DC connectivity MAY be required to provide support for 509 features like disaster prevention or compute load re-distribution. 510 This MAY be provided via a set of gateways interconnected through a 511 WAN. This type of connectivity MAY be provided either through 512 extension of the NVO3 tunneling domain or via VPN GWs. 514 3.4.1.3. Intra-DC gateways 516 Even within one DC there may be End Devices that do not support NVO3 517 encapsulation, for example bare metal servers, hardware appliances 518 and storage. A gateway device, e.g. a ToR, is required to translate 519 the NVO3 to Ethernet VLAN encapsulation. 521 3.4.2. Path optimality between NVEs and Gateways 523 Within the NVO3 overlay, a default assumption is that NVO3 traffic 524 will be equally load-balanced across the underlying network 525 consisting of LAG and/or ECMP paths. This assumption is valid only 526 as long as: a) all traffic is load-balanced equally among each of 527 the component-links and paths; and, b) each of the component- 528 links/paths is of identical capacity. During the course of normal 529 operation of the underlying network, it is possible that one, or 530 more, of the component-links/paths of a LAG may be taken out-of- 531 service in order to be repaired, e.g.: due to hardware failure of 532 cabling, optics, etc. In such cases, the administrator should 533 configure the underlying network such that an entire LAG bundle in 534 the underlying network will be reported as operationally down if 535 there is a failure of any single component-link member of the LAG 536 bundle, (e.g.: N = M configuration of the LAG bundle), and, thus, 537 they know that traffic will be carried sufficiently by alternate, 538 available (potentially ECMP) paths in the underlying network. This 539 is a likely an adequate assumption for Intra-DC traffic where 540 presumably the costs for additional, protection capacity along 541 alternate paths is not cost-prohibitive. Thus, there are likely no 542 additional requirements on NVO3 solutions to accommodate this type 543 of underlying network configuration and administration. 545 There is a similar case with ECMP, used Intra-DC, where failure of a 546 single component-path of an ECMP group would result in traffic 547 shifting onto the surviving members of the ECMP group. 548 Unfortunately, there are no automatic recovery methods in IP routing 549 protocols to detect a simultaneous failure of more than one 550 component-path in a ECMP group, operationally disable the entire 551 ECMP group and allow traffic to shift onto alternative paths. This 552 is problem is attributable to the underlying network and, thus, out- 553 of-scope of any NVO3 solutions. 555 On the other hand, for Inter-DC and DC to External Network cases 556 that use a WAN, the costs of the underlying network and/or service 557 (e.g.: IPVPN service) are more expensive; therefore, there is a 558 requirement on administrators to both: a) ensure high availability 559 (active-backup failover or active-active load-balancing); and, b) 560 maintaining substantial utilization of the WAN transport capacity at 561 nearly all times, particularly in the case of active-active load- 562 balancing. With respect to the dataplane requirements of NVO3 563 solutions, in the case of active-backup fail-over, all of the 564 ingress NVE's MUST dynamically adapt to the failure of an active NVE 565 GW when the backup NVE GW announces itself into the NVO3 overlay 566 immediately following a failure of the previously active NVE GW and 567 update their forwarding tables accordingly, (e.g.: perhaps through 568 dataplane learning and/or translation of a gratuitous ARP, IPv6 569 Router Advertisement, etc.) Note that active-backup fail-over could 570 be used to accomplish a crude form of load-balancing by, for 571 example, manually configuring each tenant to use a different NVE GW, 572 in a round-robin fashion. On the other hand, with respect to active- 573 active load-balancing across physically separate NVE GW's (e.g.: 574 two, separate chassis) an NVO3 solution SHOULD support forwarding 575 tables that can simultaneously map a single egress NVE to more than 576 one NVO3 tunnels. The granularity of such mappings, in both active- 577 backup and active-active, MUST be unique to each tenant. 579 3.4.2.1. Triangular Routing Issues,a.k.a.: Traffic Tromboning 581 L2/ELAN over NVO3 service may span multiple racks distributed across 582 different DC regions. Multiple ELANs belonging to one tenant may be 583 interconnected or connected to the outside world through multiple 584 Router/VRF gateways distributed throughout the DC regions. In this 585 scenario, without aid from an NVO3 or other type of solution, 586 traffic from an ingress NVE destined to External gateways will take 587 a non-optimal path that will result in higher latency and costs, 588 (since it is using more expensive resources of a WAN). In the case 589 of traffic from an IP/MPLS network destined toward the entrance to 590 an NVO3 overlay, well-known IP routing techniques MAY be used to 591 optimize traffic into the NVO3 overlay, (at the expense of 592 additional routes in the IP/MPLS network). In summary, these issues 593 are well known as triangular routing. 595 Procedures for gateway selection to avoid triangular routing issues 596 SHOULD be provided. The details of such procedures are, most likely, 597 part of the NVO3 Management and/or Control Plane requirements and, 598 thus, out of scope of this document. However, a key requirement on 599 the dataplane of any NVO3 solution to avoid triangular routing is 600 stated above, in Section 3.4.2, with respect to active-active load- 601 balancing. More specifically, an NVO3 solution SHOULD support 602 forwarding tables that can simultaneously map a single egress NVE to 603 more than one NVO3 tunnels. The expectation is that, through the 604 Control and/or Management Planes, this mapping information MAY be 605 dynamically manipulated to, for example, provide the closest 606 geographic and/or topological exit point (egress NVE) for each 607 ingress NVE. 609 3.5. Path MTU 611 The tunnel overlay header can cause the MTU of the path to the 612 egress tunnel endpoint to be exceeded. 614 IP fragmentation SHOULD be avoided for performance reasons. 616 The interface MTU as seen by a Tenant System SHOULD be adjusted such 617 that no fragmentation is needed. This can be achieved by 618 configuration or be discovered dynamically. 620 Either of the following options MUST be supported: 622 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] or 623 Extended MTU Path Discovery techniques such as defined in 624 [RFC4821] 626 o Segmentation and reassembly support from the overlay layer 627 operations without relying on the Tenant Systems to know about 628 the end-to-end MTU 630 o The underlay network MAY be designed in such a way that the MTU 631 can accommodate the extra tunnel overhead. 633 3.6. Hierarchical NVE 635 It might be desirable to support the concept of hierarchical NVEs, 636 such as spoke NVEs and hub NVEs, in order to address possible NVE 637 performance limitations and service connectivity optimizations. 639 For instance, spoke NVE functionality MAY be used when processing 640 capabilities are limited. A hub NVE would provide additional data 641 processing capabilities such as packet replication. 643 NVEs can be either connected in an any-to-any or hub and spoke 644 topology on a per VNI basis. 646 3.7. NVE Multi-Homing Requirements 648 Multi-homing techniques SHOULD be used to increase the reliability 649 of an nvo3 network. It is also important to ensure that physical 650 diversity in an nvo3 network is taken into account to avoid single 651 points of failure. 653 Multi-homing can be enabled in various nodes, from tenant systems 654 into TORs, TORs into core switches/routers, and core nodes into DC 655 GWs. 657 Tenant systems can either be L2 or L3 nodes. In the former case 658 (L2), techniques such as LAG or STP for instance MAY be used. In the 659 latter case (L3), it is possible that no dynamic routing protocol is 660 enabled. Tenant systems can be multi-homed into remote NVE using 661 several interfaces (physical NICS or vNICS) with an IP address per 662 interface either to the same nvo3 network or into different nvo3 663 networks. When one of the links fails, the corresponding IP is not 664 reachable but the other interfaces can still be used. When a tenant 665 system is co-located with an NVE, IP routing can be relied upon to 666 handle routing over diverse links to TORs. 668 External connectivity MAY be handled by two or more nvo3 gateways. 669 Each gateway is connected to a different domain (e.g. ISP) and runs 670 BGP multi-homing. They serve as an access point to external networks 671 such as VPNs or the Internet. When a connection to an upstream 672 router is lost, the alternative connection is used and the failed 673 route withdrawn. 675 3.8. OAM 677 NVE MAY be able to originate/terminate OAM messages for connectivity 678 verification, performance monitoring, statistic gathering and fault 679 isolation. Depending on configuration, NVEs SHOULD be able to 680 process or transparently tunnel OAM messages, as well as supporting 681 alarm propagation capabilities. 683 Given the critical requirement to load-balance NVO3 encapsulated 684 packets over LAG and ECMP paths, it will be equally critical to 685 ensure existing and/or new OAM tools allow NVE administrators to 686 proactively and/or reactively monitor the health of various 687 component-links that comprise both LAG and ECMP paths carrying NVO3 688 encapsulated packets. For example, it will be important that such 689 OAM tools allow NVE administrators to reveal the set of underlying 690 network hops (topology) in order that the underlying network 691 administrators can use this information to quickly perform fault 692 isolation and restore the underlying network. 694 The NVE MUST provide the ability to reveal the set of ECMP and/or 695 LAG paths used by NVO3 encapsulated packets in the underlying 696 network from an ingress NVE to egress NVE. The NVE MUST provide the 697 ability to provide a "ping"-like functionality that can be used to 698 determine the health (liveness) of remote NVE's or their VNI's. The 699 NVE SHOULD provide a "ping"-like functionality to more expeditiously 700 aid in troubleshooting performance problems, i.e.: blackholing or 701 other types of congestion occurring in the underlying network, for 702 NVO3 encapsulated packets carried over LAG and/or ECMP paths. 704 3.9. Other considerations 706 3.9.1. Data Plane Optimizations 708 Data plane forwarding and encapsulation choices SHOULD consider the 709 limitation of possible NVE implementations, specifically in software 710 based implementations (e.g. servers running VSwitches) 712 NVE SHOULD provide efficient processing of traffic. For instance, 713 packet alignment, the use of offsets to minimize header parsing, 714 padding techniques SHOULD be considered when designing NVO3 715 encapsulation types. 717 The NV03 encapsulation/decapsulation processing in software-based 718 NVEs SHOULD make use of hardware assist provided by NICs in order to 719 speed up packet processing. 721 3.9.2. NVE location trade-offs 723 In the case of DC traffic, traffic originated from a VM is native 724 Ethernet traffic. This traffic can be switched by a local VM switch 725 or ToR switch and then by a DC gateway. The NVE function can be 726 embedded within any of these elements. 728 The NVE function can be supported in various DC network elements 729 such as a VM, VM switch, ToR switch or DC GW. 731 The following criteria SHOULD be considered when deciding where the 732 NVE processing boundary happens: 734 o Processing and memory requirements 736 o Datapath (e.g. lookups, filtering, 737 encapsulation/decapsulation) 739 o Control plane processing (e.g. routing, signaling, OAM) 741 o FIB/RIB size 743 o Multicast support 745 o Routing protocols 747 o Packet replication capability 749 o Fragmentation support 751 o QoS transparency 753 o Resiliency 755 4. Security Considerations 757 This requirements document does not raise in itself any specific 758 security issues. 760 5. IANA Considerations 762 IANA does not need to take any action for this draft. 764 6. References 766 6.1. Normative References 768 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 769 Requirement Levels", BCP 14, RFC 2119, March 1997. 771 6.2. Informative References 773 [NVOPS] Narten, T. et al, "Problem Statement: Overlays for Network 774 Virtualization", draft-narten-nvo3-overlay-problem- 775 statement (work in progress) 777 [NVO3-framework] Lasserre, M. et al, "Framework for DC Network 778 Virtualization", draft-lasserre-nvo3-framework (work in 779 progress) 781 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 782 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 783 (work in progress) 785 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 786 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 788 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 789 Networks (VPNs)", RFC 4364, February 2006. 791 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 793 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 794 August 1996 796 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 797 Discovery", RFC4821, March 2007 799 [RFC2983] Black, D. "Diffserv and tunnels", RFC2983, Cotober 2000 801 [RFC6040] Briscoe, B. "Tunnelling of Explicit Congestion 802 Notification", RFC6040, November 2010 804 [RFC6438] Carpenter, B. et al, "Using the IPv6 Flow Label for Equal 805 Cost Multipath Routing and Link Aggregation in Tunnels", 806 RFC6438, November 2011 808 [RFC6391] Bryant, S. et al, "Flow-Aware Transport of Pseudowires 809 over an MPLS Packet Switched Network", RFC6391, November 810 2011 812 7. Acknowledgments 814 In addition to the authors the following people have contributed to 815 this document: 817 Shane Amante, Level3 819 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 821 Larry Kreeger, Cisco 823 This document was prepared using 2-Word-v2.0.template.dot. 825 Authors' Addresses 827 Nabil Bitar 828 Verizon 829 40 Sylvan Road 830 Waltham, MA 02145 831 Email: nabil.bitar@verizon.com 833 Marc Lasserre 834 Alcatel-Lucent 835 Email: marc.lasserre@alcatel-lucent.com 837 Florin Balus 838 Alcatel-Lucent 839 777 E. Middlefield Road 840 Mountain View, CA, USA 94043 841 Email: florin.balus@alcatel-lucent.com 843 Thomas Morin 844 France Telecom Orange 845 Email: thomas.morin@orange.com 847 Lizhong Jin 848 ZTE 849 Email : lizhong.jin@zte.com.cn 851 Bhumip Khasnabish 852 ZTE 853 Email : Bhumip.khasnabish@zteusa.com