idnits 2.17.1 draft-ietf-nvo3-dataplane-requirements-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 2) being 70 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 19 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 220 instances of too long lines in the document, the longest one being 4 characters in excess of 72. == There are 10 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 12, 2013) is 3812 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'NVOPS' is defined on line 751, but no explicit reference was found in the text == Unused Reference: 'OVCPREQ' is defined on line 759, but no explicit reference was found in the text == Unused Reference: 'FLOYD' is defined on line 763, but no explicit reference was found in the text == Unused Reference: 'RFC4364' is defined on line 766, but no explicit reference was found in the text == Unused Reference: 'RFC6438' is defined on line 782, but no explicit reference was found in the text == Unused Reference: 'RFC6391' is defined on line 786, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Nabil Bitar 3 Internet Draft Verizon 4 Intended status: Informational 5 Expires: May 2014 Marc Lasserre 6 Florin Balus 7 Alcatel-Lucent 9 Thomas Morin 10 France Telecom Orange 12 Lizhong Jin 14 Bhumip Khasnabish 15 ZTE 17 November 12, 2013 19 NVO3 Data Plane Requirements 20 draft-ietf-nvo3-dataplane-requirements-02.txt 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other documents 34 at any time. It is inappropriate to use Internet-Drafts as 35 reference material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 12, 2014. 39 Copyright Notice 41 Copyright (c) 2013 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with 49 respect to this document. Code Components extracted from this 50 document must include Simplified BSD License text as described in 51 Section 4.e of the Trust Legal Provisions and are provided without 52 warranty as described in the Simplified BSD License. 54 Abstract 56 Several IETF drafts relate to the use of overlay networks to support 57 large scale virtual data centers. This draft provides a list of data 58 plane requirements for Network Virtualization over L3 (NVO3) that 59 have to be addressed in solutions documents. 61 Table of Contents 63 1. Introduction..................................................3 64 1.1. Conventions used in this document........................3 65 1.2. General terminology......................................3 66 2. Data Path Overview............................................4 67 3. Data Plane Requirements.......................................5 68 3.1. Virtual Access Points (VAPs).............................5 69 3.2. Virtual Network Instance (VNI)...........................5 70 3.2.1. L2 VNI.................................................5 71 3.2.2. L3 VNI.................................................6 72 3.3. Overlay Module...........................................7 73 3.3.1. NVO3 overlay header....................................8 74 3.3.1.1. Virtual Network Context Identification...............8 75 3.3.1.2. Service QoS identifier...............................8 76 3.3.2. Tunneling function.....................................9 77 3.3.2.1. LAG and ECMP........................................10 78 3.3.2.2. DiffServ and ECN marking............................10 79 3.3.2.3. Handling of BUM traffic.............................11 80 3.4. External NVO3 connectivity..............................11 81 3.4.1. GW Types..............................................12 82 3.4.1.1. VPN and Internet GWs................................12 83 3.4.1.2. Inter-DC GW.........................................12 84 3.4.1.3. Intra-DC gateways...................................12 85 3.4.2. Path optimality between NVEs and Gateways.............12 86 3.4.2.1. Load-balancing......................................14 87 3.4.2.2. Triangular Routing Issues (a.k.a. Traffic Tromboning)14 88 3.5. Path MTU................................................14 89 3.6. Hierarchical NVE........................................15 90 3.7. NVE Multi-Homing Requirements...........................15 91 3.8. Other considerations....................................16 92 3.8.1. Data Plane Optimizations..............................16 93 3.8.2. NVE location trade-offs...............................16 94 4. Security Considerations......................................17 95 5. IANA Considerations..........................................17 96 6. References...................................................17 97 6.1. Normative References....................................17 98 6.2. Informative References..................................17 99 7. Acknowledgments..............................................18 101 1. Introduction 103 1.1. Conventions used in this document 105 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 106 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 107 document are to be interpreted as described in RFC-2119 [RFC2119]. 109 In this document, these words will appear with that interpretation 110 only when in ALL CAPS. Lower case uses of these words are not to be 111 interpreted as carrying RFC-2119 significance. 113 1.2. General terminology 115 The terminology defined in [NVO3-framework] is used throughout this 116 document. Terminology specific to this memo is defined here and is 117 introduced as needed in later sections. 119 BUM: Broadcast, Unknown Unicast, Multicast traffic 121 TS: Tenant System 123 2. Data Path Overview 125 The NVO3 framework [NVO3-framework] defines the generic NVE model 126 depicted in Figure 1: 128 +------- L3 Network ------+ 129 | | 130 | Tunnel Overlay | 131 +------------+---------+ +---------+------------+ 132 | +----------+-------+ | | +---------+--------+ | 133 | | Overlay Module | | | | Overlay Module | | 134 | +---------+--------+ | | +---------+--------+ | 135 | |VN context| | VN context| | 136 | | | | | | 137 | +-------+--------+ | | +--------+-------+ | 138 | | |VNI| ... |VNI| | | | |VNI| ... |VNI| | 139 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 140 | | VAPs | | | | VAPs | | 141 +----+------------+----+ +----+------------+----+ 142 | | | | 143 -------+------------+-----------------+------------+------- 144 | | Tenant | | 145 | | Service IF | | 146 Tenant Systems Tenant Systems 148 Figure 1 : Generic reference model for NV Edge 150 When a frame is received by an ingress NVE from a Tenant System over 151 a local VAP, it needs to be parsed in order to identify which 152 virtual network instance it belongs to. The parsing function can 153 examine various fields in the data frame (e.g., VLANID) and/or 154 associated interface/port the frame came from. 156 Once a corresponding VNI is identified, a lookup is performed to 157 determine where the frame needs to be sent. This lookup can be based 158 on any combinations of various fields in the data frame (e.g., 159 destination MAC addresses and/or destination IP addresses). Note 160 that additional criteria such as 802.1p and/or DSCP markings might 161 be used to select an appropriate tunnel or local VAP destination. 163 Lookup tables can be populated using different techniques: data 164 plane learning, management plane configuration, or a distributed 165 control plane. Management and control planes are not in the scope of 166 this document. The data plane based solution is described in this 167 document as it has implications on the data plane processing 168 function. 170 The result of this lookup yields the corresponding information 171 needed to build the overlay header, as described in section 3.3. 172 This information includes the destination L3 address of the egress 173 NVE. Note that this lookup might yield a list of tunnels such as 174 when ingress replication is used for BUM traffic. 176 The overlay header MUST include a context identifier which the 177 egress NVE will use to identify which VNI this frame belongs to. 179 The egress NVE checks the context identifier and removes the 180 encapsulation header and then forwards the original frame towards 181 the appropriate recipient, usually a local VAP. 183 3. Data Plane Requirements 185 3.1. Virtual Access Points (VAPs) 187 The NVE forwarding plane MUST support VAP identification through the 188 following mechanisms: 190 - Using the local interface on which the frames are received, where 191 the local interface may be an internal, virtual port in a VSwitch 192 or a physical port on the ToR 193 - Using the local interface and some fields in the frame header, 194 e.g. one or multiple VLANs or the source MAC 196 3.2. Virtual Network Instance (VNI) 198 VAPs are associated with a specific VNI at service instantiation 199 time. 201 A VNI identifies a per-tenant private context, i.e. per-tenant 202 policies and a FIB table to allow overlapping address space between 203 tenants. 205 There are different VNI types differentiated by the virtual network 206 service they provide to Tenant Systems. Network virtualization can 207 be provided by L2 and/or L3 VNIs. 209 3.2.1. L2 VNI 211 An L2 VNI MUST provide an emulated Ethernet multipoint service as if 212 Tenant Systems are interconnected by a bridge (but instead by using 213 a set of NVO3 tunnels). The emulated bridge could be 802.1Q enabled 214 (allowing use of VLAN tags as a VAP). An L2 VNI provides per tenant 215 virtual switching instance with MAC addressing isolation and L3 216 tunneling. Loop avoidance capability MUST be provided. 218 Forwarding table entries provide mapping information between tenant 219 system MAC addresses and VAPs on directly connected VNIs and L3 220 tunnel destination addresses over the overlay. Such entries could be 221 populated by a control or management plane, or via data plane. 223 By default, data plane learning MUST be used to populate forwarding 224 tables. As frames arrive from VAPs or from overlay tunnels, standard 225 MAC learning procedures are used: The tenant system source MAC 226 address is learned against the VAP or the NVO3 tunneling 227 encapsulation source address on which the frame arrived. This 228 implies that unknown unicast traffic will be flooded (i.e. 229 broadcast). 231 When flooding is required, either to deliver unknown unicast, or 232 broadcast or multicast traffic, the NVE MUST either support ingress 233 replication or multicast. 235 When using multicast, the NVE MUST have one or more multicast trees 236 that can be used by local VNIs for flooding to NVEs belonging to the 237 same VN. For each VNI, there is at least one flooding tree used for 238 Broadcast, Unknown Unicast and Multicast forwarding. This tree MAY 239 be shared across VNIs. The flooding tree is equivalent with a 240 multicast (*,G) construct where all the NVEs for which the 241 corresponding VNI is instantiated are members. 243 When tenant multicast is supported, it SHOULD also be possible to 244 select whether the NVE provides optimized multicast trees inside the 245 VNI for individual tenant multicast groups or whether the default 246 VNI flooding tree is used. If the former option is selected the VNI 247 SHOULD be able to snoop IGMP/MLD messages in order to efficiently 248 join/prune Tenant System from multicast trees. 250 3.2.2. L3 VNI 252 L3 VNIs MUST provide virtualized IP routing and forwarding. L3 VNIs 253 MUST support per-tenant forwarding instance with IP addressing 254 isolation and L3 tunneling for interconnecting instances of the same 255 VNI on NVEs. 257 In the case of L3 VNI, the inner TTL field MUST be decremented by 258 (at least) 1 as if the NVO3 egress NVE was one (or more) hop(s) 259 away. The TTL field in the outer IP header MUST be set to a value 260 appropriate for delivery of the encapsulated frame to the tunnel 261 exit point. Thus, the default behavior MUST be the TTL pipe model 262 where the overlay network looks like one hop to the sending NVE. 263 Configuration of a "uniform" TTL model where the outer tunnel TTL is 264 set equal to the inner TTL on ingress NVE and the inner TTL is set 265 to the outer TTL value on egress MAY be supported. 267 L2 and L3 VNIs can be deployed in isolation or in combination to 268 optimize traffic flows per tenant across the overlay network. For 269 example, an L2 VNI may be configured across a number of NVEs to 270 offer L2 multi-point service connectivity while a L3 VNI can be co- 271 located to offer local routing capabilities and gateway 272 functionality. In addition, integrated routing and bridging per 273 tenant MAY be supported on an NVE. An instantiation of such service 274 may be realized by interconnecting an L2 VNI as access to an L3 VNI 275 on the NVE. 277 When multicast is supported, it MAY be possible to select whether 278 the NVE provides optimized multicast trees inside the VNI for 279 individual tenant multicast groups or whether a default VNI 280 multicasting tree, where all the NVEs of the corresponding VNI are 281 members, is used. 283 3.3. Overlay Module 285 The overlay module performs a number of functions related to NVO3 286 header and tunnel processing. 288 The following figure shows a generic NVO3 encapsulated frame: 290 +--------------------------+ 291 | Tenant Frame | 292 +--------------------------+ 293 | NVO3 Overlay Header | 294 +--------------------------+ 295 | Outer Underlay header | 296 +--------------------------+ 297 | Outer Link layer header | 298 +--------------------------+ 299 Figure 2 : NVO3 encapsulated frame 301 where 302 . Tenant frame: Ethernet or IP based upon the VNI type 304 . NVO3 overlay header: Header containing VNI context information 305 and other optional fields that can be used for processing 306 this packet. 308 . Outer underlay header: Can be either IP or MPLS 310 . Outer link layer header: Header specific to the physical 311 transmission link used 313 3.3.1. NVO3 overlay header 315 An NVO3 overlay header MUST be included after the underlay tunnel 316 header when forwarding tenant traffic. 318 Note that this information can be carried within existing protocol 319 headers (when overloading of specific fields is possible) or within 320 a separate header. 322 3.3.1.1. Virtual Network Context Identification 324 The overlay encapsulation header MUST contain a field which allows 325 the encapsulated frame to be delivered to the appropriate virtual 326 network endpoint by the egress NVE. 328 The egress NVE uses this field to determine the appropriate virtual 329 network context in which to process the packet. This field MAY be an 330 explicit, unique (to the administrative domain) virtual network 331 identifier (VNID) or MAY express the necessary context information 332 in other ways (e.g. a locally significant identifier). 334 In the case of a global identifier, this field MUST be large enough 335 to scale to 100's of thousands of virtual networks. Note that there 336 is typically no such constraint when using a local identifier. 338 3.3.1.2. Service QoS identifier 340 Traffic flows originating from different applications could rely on 341 differentiated forwarding treatment to meet end-to-end availability 342 and performance objectives. Such applications may span across one or 343 more overlay networks. To enable such treatment, support for 344 multiple Classes of Service across or between overlay networks MAY 345 be required. 347 To effectively enforce CoS across or between overlay networks, NVEs 348 MAY be able to map CoS markings between networking layers, e.g., 349 Tenant Systems, Overlays, and/or Underlay, enabling each networking 350 layer to independently enforce its own CoS policies. For example: 352 - TS (e.g. VM) CoS 354 o Tenant CoS policies MAY be defined by Tenant administrators 356 o QoS fields (e.g. IP DSCP and/or Ethernet 802.1p) in the 357 tenant frame are used to indicate application level CoS 358 requirements 360 - NVE CoS 362 o NVE MAY classify packets based on Tenant CoS markings or 363 other mechanisms (eg. DPI) to identify the proper service CoS 364 to be applied across the overlay network 366 o NVE service CoS levels are normalized to a common set (for 367 example 8 levels) across multiple tenants; NVE uses per 368 tenant policies to map Tenant CoS to the normalized service 369 CoS fields in the NVO3 header 371 - Underlay CoS 373 o The underlay/core network MAY use a different CoS set (for 374 example 4 levels) than the NVE CoS as the core devices MAY 375 have different QoS capabilities compared with NVEs. 377 o The Underlay CoS MAY also change as the NVO3 tunnels pass 378 between different domains. 380 Support for NVE Service CoS MAY be provided through a QoS field, 381 inside the NVO3 overlay header. Examples of service CoS provided 382 part of the service tag are 802.1p and DE bits in the VLAN and PBB 383 ISID tags and MPLS TC bits in the VPN labels. 385 3.3.2. Tunneling function 387 This section describes the underlay tunneling requirements. From an 388 encapsulation perspective, IPv4 or IPv6 MUST be supported, both IPv4 389 and IPv6 SHOULD be supported, MPLS tunneling MAY be supported. 391 3.3.2.1. LAG and ECMP 393 For performance reasons, multipath over LAG and ECMP paths MAY be 394 supported. 396 LAG (Link Aggregation Group) [IEEE 802.1AX-2008] and ECMP (Equal 397 Cost Multi Path) are commonly used techniques to perform load- 398 balancing of microflows over a set of a parallel links either at 399 Layer-2 (LAG) or Layer-3 (ECMP). Existing deployed hardware 400 implementations of LAG and ECMP uses a hash of various fields in the 401 encapsulation (outermost) header(s) (e.g. source and destination MAC 402 addresses for non-IP traffic, source and destination IP addresses, 403 L4 protocol, L4 source and destination port numbers, etc). 404 Furthermore, hardware deployed for the underlay network(s) will be 405 most often unaware of the carried, innermost L2 frames or L3 packets 406 transmitted by the TS. 408 Thus, in order to perform fine-grained load-balancing over LAG and 409 ECMP paths in the underlying network, the encapsulation MUST result 410 in sufficient entropy to exercise all paths through several LAG/ECMP 411 hops. 413 The entropy information can be inferred from the NVO3 overlay header 414 or underlay header. If the overlay protocol does not support the 415 necessary entropy information or the switches/routers in the 416 underlay do not support parsing of the additional entropy 417 information in the overlay header, underlay switches and routers 418 should be programmable, i.e. select the appropriate fields in the 419 underlay header for hash calculation based on the type of overlay 420 header. 422 All packets that belong to a specific flow MUST follow the same path 423 in order to prevent packet re-ordering. This is typically achieved 424 by ensuring that the fields used for hashing are identical for a 425 given flow. 427 The goal is for all paths available to the overlay network to be 428 used efficiently. Different flows should be distributed as evenly as 429 possible across multiple underlay network paths. For instance, this 430 can be achieved by ensuring that some fields used for hashing are 431 randomly generated. 433 3.3.2.2. DiffServ and ECN marking 435 When traffic is encapsulated in a tunnel header, there are numerous 436 options as to how the Diffserv Code-Point (DSCP) and Explicit 437 Congestion Notification (ECN) markings are set in the outer header 438 and propagated to the inner header on decapsulation. 440 [RFC2983] defines two modes for mapping the DSCP markings from inner 441 to outer headers and vice versa. The Uniform model copies the inner 442 DSCP marking to the outer header on tunnel ingress, and copies that 443 outer header value back to the inner header at tunnel egress. The 444 Pipe model sets the DSCP value to some value based on local policy 445 at ingress and does not modify the inner header on egress. Both 446 models SHOULD be supported. 448 [RFC6040] defines ECN marking and processing for IP tunnels. 450 3.3.2.3. Handling of BUM traffic 452 NVO3 data plane support for either ingress replication or point-to- 453 multipoint tunnels is required to send traffic destined to multiple 454 locations on a per-VNI basis (e.g. L2/L3 multicast traffic, L2 455 broadcast and unknown unicast traffic). It is possible that both 456 methods be used simultaneously. 458 There is a bandwidth vs state trade-off between the two approaches. 459 User-configurable knobs MUST be provided to select which method(s) 460 gets used based upon the amount of replication required (i.e. the 461 number of hosts per group), the amount of multicast state to 462 maintain, the duration of multicast flows and the scalability of 463 multicast protocols. 465 When ingress replication is used, NVEs MUST maintain for each VNI 466 the related tunnel endpoints to which it needs to replicate the 467 frame. 469 For point-to-multipoint tunnels, the bandwidth efficiency is 470 increased at the cost of more state in the Core nodes. The ability 471 to auto-discover or pre-provision the mapping between VNI multicast 472 trees to related tunnel endpoints at the NVE and/or throughout the 473 core SHOULD be supported. 475 3.4. External NVO3 connectivity 477 NVO3 services MUST interoperate with current VPN and Internet 478 services. This may happen inside one DC during a migration phase or 479 as NVO3 services are delivered to the outside world via Internet or 480 VPN gateways. 482 Moreover the compute and storage services delivered by a NVO3 domain 483 may span multiple DCs requiring Inter-DC connectivity. From a DC 484 perspective a set of gateway devices are required in all of these 485 cases albeit with different functionalities influenced by the 486 overlay type across the WAN, the service type and the DC network 487 technologies used at each DC site. 489 A GW handling the connectivity between NVO3 and external domains 490 represents a single point of failure that may affect multiple tenant 491 services. Redundancy between NVO3 and external domains MUST be 492 supported. 494 3.4.1. GW Types 496 3.4.1.1. VPN and Internet GWs 498 Tenant sites may be already interconnected using one of the existing 499 VPN services and technologies (VPLS or IP VPN). If a new NVO3 500 encapsulation is used, a VPN GW is required to forward traffic 501 between NVO3 and VPN domains. Translation of encapsulations MAY be 502 required. Internet connected Tenants require translation from NVO3 503 encapsulation to IP in the NVO3 gateway. The translation function 504 SHOULD minimize provisioning touches. 506 3.4.1.2. Inter-DC GW 508 Inter-DC connectivity MAY be required to provide support for 509 features like disaster prevention or compute load re-distribution. 510 This MAY be provided via a set of gateways interconnected through a 511 WAN. This type of connectivity MAY be provided either through 512 extension of the NVO3 tunneling domain or via VPN GWs. 514 3.4.1.3. Intra-DC gateways 516 Even within one DC there may be End Devices that do not support NVO3 517 encapsulation, for example bare metal servers, hardware appliances 518 and storage. A gateway device, e.g. a ToR, is required to translate 519 the NVO3 to Ethernet VLAN encapsulation. 521 3.4.2. Path optimality between NVEs and Gateways 523 Within an NVO3 overlay, a default assumption is that NVO3 traffic 524 will be equally load-balanced across the underlying network 525 consisting of LAG and/or ECMP paths. This assumption is valid only 526 as long as: a) all traffic is load-balanced equally among each of 527 the component-links and paths; and, b) each of the component- 528 links/paths is of identical capacity. During the course of normal 529 operation of the underlying network, it is possible that one, or 530 more, of the component-links/paths of a LAG may be taken out-of- 531 service in order to be repaired, e.g.: due to hardware failure of 532 cabling, optics, etc. In such cases, the administrator should 533 configure the underlying network such that an entire LAG bundle in 534 the underlying network will be reported as operationally down if 535 there is a failure of any single component-link member of the LAG 536 bundle, (e.g.: N = M configuration of the LAG bundle), and, thus, 537 they know that traffic will be carried sufficiently by alternate, 538 available (potentially ECMP) paths in the underlying network. This 539 is a likely an adequate assumption for Intra-DC traffic where 540 presumably the costs for additional, protection capacity along 541 alternate paths is not cost-prohibitive. Thus, there are likely no 542 additional requirements on NVO3 solutions to accommodate this type 543 of underlying network configuration and administration. 545 There is a similar case with ECMP, used Intra-DC, where failure of a 546 single component-path of an ECMP group would result in traffic 547 shifting onto the surviving members of the ECMP group. 548 Unfortunately, there are no automatic recovery methods in IP routing 549 protocols to detect a simultaneous failure of more than one 550 component-path in a ECMP group, operationally disable the entire 551 ECMP group and allow traffic to shift onto alternative paths. This 552 problem is attributable to the underlying network and, thus, out-of- 553 scope of any NVO3 solutions. 555 On the other hand, for Inter-DC and DC to External Network cases 556 that use a WAN, the costs of the underlying network and/or service 557 (e.g.: IPVPN service) are more expensive; therefore, there is a 558 requirement on administrators to both: a) ensure high availability 559 (active-backup failover or active-active load-balancing); and, b) 560 maintaining substantial utilization of the WAN transport capacity at 561 nearly all times, particularly in the case of active-active load- 562 balancing. With respect to the dataplane requirements of NVO3 563 solutions, in the case of active-backup fail-over, all of the 564 ingress NVE's need to dynamically adapt to the failure of an active 565 NVE GW when the backup NVE GW announces itself into the NVO3 overlay 566 immediately following a failure of the previously active NVE GW and 567 update their forwarding tables accordingly, (e.g.: perhaps through 568 dataplane learning and/or translation of a gratuitous ARP, IPv6 569 Router Advertisement). Note that active-backup fail-over could be 570 used to accomplish a crude form of load-balancing by, for example, 571 manually configuring each tenant to use a different NVE GW, in a 572 round-robin fashion. 574 3.4.2.1. Load-balancing 576 When using active-active load-balancing across physically separate 577 NVE GW's (e.g.: two, separate chassis) an NVO3 solution SHOULD 578 support forwarding tables that can simultaneously map a single 579 egress NVE to more than one NVO3 tunnels. The granularity of such 580 mappings, in both active-backup and active-active, MUST be specific 581 to each tenant. 583 3.4.2.2. Triangular Routing Issues (a.k.a. Traffic Tromboning) 585 L2/ELAN over NVO3 service may span multiple racks distributed across 586 different DC regions. Multiple ELANs belonging to one tenant may be 587 interconnected or connected to the outside world through multiple 588 Router/VRF gateways distributed throughout the DC regions. In this 589 scenario, without aid from an NVO3 or other type of solution, 590 traffic from an ingress NVE destined to External gateways will take 591 a non-optimal path that will result in higher latency and costs, 592 (since it is using more expensive resources of a WAN). In the case 593 of traffic from an IP/MPLS network destined toward the entrance to 594 an NVO3 overlay, well-known IP routing techniques MAY be used to 595 optimize traffic into the NVO3 overlay, (at the expense of 596 additional routes in the IP/MPLS network). In summary, these issues 597 are well known as triangular routing. 599 Procedures for gateway selection to avoid triangular routing issues 600 SHOULD be provided. 602 The details of such procedures are, most likely, part of the NVO3 603 Management and/or Control Plane requirements and, thus, out of scope 604 of this document. However, a key requirement on the dataplane of any 605 NVO3 solution to avoid triangular routing is stated above, in 606 Section 3.4.2, with respect to active-active load-balancing. More 607 specifically, an NVO3 solution SHOULD support forwarding tables that 608 can simultaneously map a single egress NVE to more than one NVO3 609 tunnel. 611 The expectation is that, through the Control and/or Management 612 Planes, this mapping information may be dynamically manipulated to, 613 for example, provide the closest geographic and/or topological exit 614 point (egress NVE) for each ingress NVE. 616 3.5. Path MTU 618 The tunnel overlay header can cause the MTU of the path to the 619 egress tunnel endpoint to be exceeded. 621 IP fragmentation SHOULD be avoided for performance reasons. 623 The interface MTU as seen by a Tenant System SHOULD be adjusted such 624 that no fragmentation is needed. This can be achieved by 625 configuration or be discovered dynamically. 627 Either of the following options MUST be supported: 629 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] or 630 Extended MTU Path Discovery techniques such as defined in 631 [RFC4821] 633 o Segmentation and reassembly support from the overlay layer 634 operations without relying on the Tenant Systems to know about 635 the end-to-end MTU 637 o The underlay network MAY be designed in such a way that the MTU 638 can accommodate the extra tunnel overhead. 640 3.6. Hierarchical NVE 642 It might be desirable to support the concept of hierarchical NVEs, 643 such as spoke NVEs and hub NVEs, in order to address possible NVE 644 performance limitations and service connectivity optimizations. 646 For instance, spoke NVE functionality may be used when processing 647 capabilities are limited. A hub NVE would provide additional data 648 processing capabilities such as packet replication. 650 NVEs can be either connected in an any-to-any or hub and spoke 651 topology on a per VNI basis. 653 3.7. NVE Multi-Homing Requirements 655 Multi-homing techniques SHOULD be used to increase the reliability 656 of an nvo3 network. It is also important to ensure that physical 657 diversity in an nvo3 network is taken into account to avoid single 658 points of failure. 660 Multi-homing can be enabled in various nodes, from tenant systems 661 into TORs, TORs into core switches/routers, and core nodes into DC 662 GWs. 664 Tenant systems can either be L2 or L3 nodes. In the former case 665 (L2), techniques such as LAG or STP for instance MAY be used. In the 666 latter case (L3), it is possible that no dynamic routing protocol is 667 enabled. Tenant systems can be multi-homed into remote NVE using 668 several interfaces (physical NICS or vNICS) with an IP address per 669 interface either to the same nvo3 network or into different nvo3 670 networks. When one of the links fails, the corresponding IP is not 671 reachable but the other interfaces can still be used. When a tenant 672 system is co-located with an NVE, IP routing can be relied upon to 673 handle routing over diverse links to TORs. 675 External connectivity MAY be handled by two or more nvo3 gateways. 676 Each gateway is connected to a different domain (e.g. ISP) and runs 677 BGP multi-homing. They serve as an access point to external networks 678 such as VPNs or the Internet. When a connection to an upstream 679 router is lost, the alternative connection is used and the failed 680 route withdrawn. 682 3.8. Other considerations 684 3.8.1. Data Plane Optimizations 686 Data plane forwarding and encapsulation choices SHOULD consider the 687 limitation of possible NVE implementations, specifically in software 688 based implementations (e.g. servers running VSwitches) 690 NVE SHOULD provide efficient processing of traffic. For instance, 691 packet alignment, the use of offsets to minimize header parsing, 692 padding techniques SHOULD be considered when designing NVO3 693 encapsulation types. 695 The NV03 encapsulation/decapsulation processing in software-based 696 NVEs SHOULD make use of hardware assist provided by NICs in order to 697 speed up packet processing. 699 3.8.2. NVE location trade-offs 701 In the case of DC traffic, traffic originated from a VM is native 702 Ethernet traffic. This traffic can be switched by a local VM switch 703 or ToR switch and then by a DC gateway. The NVE function can be 704 embedded within any of these elements. 706 The NVE function can be supported in various DC network elements 707 such as a VM, VM switch, ToR switch or DC GW. 709 The following criteria SHOULD be considered when deciding where the 710 NVE processing boundary happens: 712 o Processing and memory requirements 714 o Datapath (e.g. lookups, filtering, 715 encapsulation/decapsulation) 717 o Control plane processing (e.g. routing, signaling, OAM) 719 o FIB/RIB size 721 o Multicast support 723 o Routing protocols 725 o Packet replication capability 727 o Fragmentation support 729 o QoS transparency 731 o Resiliency 733 4. Security Considerations 735 This requirements document does not raise in itself any specific 736 security issues. 738 5. IANA Considerations 740 IANA does not need to take any action for this draft. 742 6. References 744 6.1. Normative References 746 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 747 Requirement Levels", BCP 14, RFC 2119, March 1997. 749 6.2. Informative References 751 [NVOPS] Narten, T. et al, "Problem Statement: Overlays for Network 752 Virtualization", draft-narten-nvo3-overlay-problem- 753 statement (work in progress) 755 [NVO3-framework] Lasserre, M. et al, "Framework for DC Network 756 Virtualization", draft-lasserre-nvo3-framework (work in 757 progress) 759 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 760 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 761 (work in progress) 763 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 764 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 766 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 767 Networks (VPNs)", RFC 4364, February 2006. 769 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 771 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 772 August 1996 774 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 775 Discovery", RFC4821, March 2007 777 [RFC2983] Black, D. "Diffserv and tunnels", RFC2983, Cotober 2000 779 [RFC6040] Briscoe, B. "Tunnelling of Explicit Congestion 780 Notification", RFC6040, November 2010 782 [RFC6438] Carpenter, B. et al, "Using the IPv6 Flow Label for Equal 783 Cost Multipath Routing and Link Aggregation in Tunnels", 784 RFC6438, November 2011 786 [RFC6391] Bryant, S. et al, "Flow-Aware Transport of Pseudowires 787 over an MPLS Packet Switched Network", RFC6391, November 788 2011 790 7. Acknowledgments 792 In addition to the authors the following people have contributed to 793 this document: 795 Shane Amante, Dimitrios Stiliadis, Rotem Salomonovitch, Larry 796 Kreeger, and Eric Gray. 798 This document was prepared using 2-Word-v2.0.template.dot. 800 Authors' Addresses 802 Nabil Bitar 803 Verizon 804 40 Sylvan Road 805 Waltham, MA 02145 806 Email: nabil.bitar@verizon.com 808 Marc Lasserre 809 Alcatel-Lucent 810 Email: marc.lasserre@alcatel-lucent.com 812 Florin Balus 813 Alcatel-Lucent 814 777 E. Middlefield Road 815 Mountain View, CA, USA 94043 816 Email: florin.balus@alcatel-lucent.com 818 Thomas Morin 819 France Telecom Orange 820 Email: thomas.morin@orange.com 822 Lizhong Jin 823 Email : lizho.jin@gmail.com 825 Bhumip Khasnabish 826 ZTE 827 Email : Bhumip.khasnabish@zteusa.com