idnits 2.17.1 draft-ietf-nvo3-arch-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1197 has weird spacing: '...xxxxxxx xxx...' -- The document date (April 21, 2016) is 2928 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-11) exists of draft-ietf-nvo3-mcast-framework-04 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force D. Black 3 Internet-Draft EMC 4 Intended status: Informational J. Hudson 5 Expires: October 23, 2016 Independent 6 L. Kreeger 7 Cisco 8 M. Lasserre 9 Independent 10 T. Narten 11 IBM 12 April 21, 2016 14 An Architecture for Data Center Network Virtualization Overlays (NVO3) 15 draft-ietf-nvo3-arch-06 17 Abstract 19 This document presents a high-level overview architecture for 20 building data center network virtualization overlay (NVO3) networks. 21 The architecture is given at a high-level, showing the major 22 components of an overall system. An important goal is to divide the 23 space into individual smaller components that can be implemented 24 independently and with clear interfaces and interactions with other 25 components. It should be possible to build and implement individual 26 components in isolation and have them work with other components with 27 no changes to other components. That way implementers have 28 flexibility in implementing individual components and can optimize 29 and innovate within their respective components without requiring 30 changes to other components. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on October 23, 2016. 49 Copyright Notice 51 Copyright (c) 2016 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 67 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3.1. VN Service (L2 and L3) . . . . . . . . . . . . . . . . . 5 70 3.1.1. VLAN Tags in L2 Service . . . . . . . . . . . . . . . 7 71 3.1.2. TTL Considerations . . . . . . . . . . . . . . . . . 7 72 3.2. Network Virtualization Edge (NVE) . . . . . . . . . . . . 7 73 3.3. Network Virtualization Authority (NVA) . . . . . . . . . 9 74 3.4. VM Orchestration Systems . . . . . . . . . . . . . . . . 9 75 4. Network Virtualization Edge (NVE) . . . . . . . . . . . . . . 11 76 4.1. NVE Co-located With Server Hypervisor . . . . . . . . . . 11 77 4.2. Split-NVE . . . . . . . . . . . . . . . . . . . . . . . . 12 78 4.2.1. Tenant VLAN handling in Split-NVE Case . . . . . . . 12 79 4.3. NVE State . . . . . . . . . . . . . . . . . . . . . . . . 13 80 4.4. Multi-Homing of NVEs . . . . . . . . . . . . . . . . . . 14 81 4.5. VAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 82 5. Tenant System Types . . . . . . . . . . . . . . . . . . . . . 15 83 5.1. Overlay-Aware Network Service Appliances . . . . . . . . 15 84 5.2. Bare Metal Servers . . . . . . . . . . . . . . . . . . . 15 85 5.3. Gateways . . . . . . . . . . . . . . . . . . . . . . . . 16 86 5.3.1. Gateway Taxonomy . . . . . . . . . . . . . . . . . . 16 87 5.3.1.1. L2 Gateways (Bridging) . . . . . . . . . . . . . 16 88 5.3.1.2. L3 Gateways (Only IP Packets) . . . . . . . . . . 17 89 5.4. Distributed Inter-VN Gateways . . . . . . . . . . . . . . 17 90 5.5. ARP and Neighbor Discovery . . . . . . . . . . . . . . . 18 91 6. NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . . 18 92 7. Network Virtualization Authority . . . . . . . . . . . . . . 20 93 7.1. How an NVA Obtains Information . . . . . . . . . . . . . 20 94 7.2. Internal NVA Architecture . . . . . . . . . . . . . . . . 21 95 7.3. NVA External Interface . . . . . . . . . . . . . . . . . 21 96 8. NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . . . 22 97 8.1. NVE-NVA Interaction Models . . . . . . . . . . . . . . . 23 98 8.2. Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . . 23 99 8.3. Propagating Information Between NVEs and NVAs . . . . . . 24 100 9. Federated NVAs . . . . . . . . . . . . . . . . . . . . . . . 25 101 9.1. Inter-NVA Peering . . . . . . . . . . . . . . . . . . . . 27 102 10. Control Protocol Work Areas . . . . . . . . . . . . . . . . . 28 103 11. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . . 28 104 12. Operations and Management . . . . . . . . . . . . . . . . . . 28 105 13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 106 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 107 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 108 16. Security Considerations . . . . . . . . . . . . . . . . . . . 29 109 17. Informative References . . . . . . . . . . . . . . . . . . . 29 110 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 112 1. Introduction 114 This document presents a high-level architecture for building data 115 center network virtualization overlay (NVO3) networks. The 116 architecture is given at a high-level, showing the major components 117 of an overall system. An important goal is to divide the space into 118 smaller individual components that can be implemented independently 119 and with clear interfaces and interactions with other components. It 120 should be possible to build and implement individual components in 121 isolation and have them work with other components with no changes to 122 other components. That way implementers have flexibility in 123 implementing individual components and can optimize and innovate 124 within their respective components without necessarily requiring 125 changes to other components. 127 The motivation for overlay networks is given in "Problem Statement: 128 Overlays for Network Virtualization" [RFC7364]. "Framework for DC 129 Network Virtualization" [RFC7365] provides a framework for discussing 130 overlay networks generally and the various components that must work 131 together in building such systems. This document differs from the 132 framework document in that it doesn't attempt to cover all possible 133 approaches within the general design space. Rather, it describes one 134 particular approach. 136 2. Terminology 138 This document uses the same terminology as [RFC7365]. In addition, 139 the following terms are used: 141 NV Domain A Network Virtualization Domain is an administrative 142 construct that defines a Network Virtualization Authority (NVA), 143 the set of Network Virtualization Edges (NVEs) associated with 144 that NVA, and the set of virtual networks the NVA manages and 145 supports. NVEs are associated with a (logically centralized) NVA, 146 and an NVE supports communication for any of the virtual networks 147 in the domain. 149 NV Region A region over which information about a set of virtual 150 networks is shared. The degenerate case of a single NV Domain 151 corresponds to an NV region corresponding to that domain. The 152 more interesting case occurs when two or more NV Domains share 153 information about part or all of a set of virtual networks that 154 they manage. Two NVAs share information about particular virtual 155 networks for the purpose of supporting connectivity between 156 tenants located in different NV Domains. NVAs can share 157 information about an entire NV domain, or just individual virtual 158 networks. 160 Tenant System Identifier (TSI) Interface to a Virtual Network as 161 presented to a Tenant System. The TSI logically connects to the 162 NVE via a Virtual Access Point (VAP). To the Tenant System, the 163 TSI is like a Network Interface Card (NIC); the TSI presents 164 itself to a Tenant System as a normal network interface. 166 VLAN Unless stated otherwise, the terms VLAN and VLAN Tag are used 167 in this document denote a C-VLAN [IEEE-802.1Q] and the terms are 168 used interchangeably to improve readability. 170 3. Background 172 Overlay networks are an approach for providing network virtualization 173 services to a set of Tenant Systems (TSs) [RFC7365]. With overlays, 174 data traffic between tenants is tunneled across the underlying data 175 center's IP network. The use of tunnels provides a number of 176 benefits by decoupling the network as viewed by tenants from the 177 underlying physical network across which they communicate. 179 Tenant Systems connect to Virtual Networks (VNs), with each VN having 180 associated attributes defining properties of the network, such as the 181 set of members that connect to it. Tenant Systems connected to a 182 virtual network typically communicate freely with other Tenant 183 Systems on the same VN, but communication between Tenant Systems on 184 one VN and those external to the VN (whether on another VN or 185 connected to the Internet) is carefully controlled and governed by 186 policy. The NVO3 architecture does not impose any restrictions to 187 the application of policy controls even within a VN. 189 A Network Virtualization Edge (NVE) [RFC7365] is the entity that 190 implements the overlay functionality. An NVE resides at the boundary 191 between a Tenant System and the overlay network as shown in Figure 1. 193 An NVE creates and maintains local state about each Virtual Network 194 for which it is providing service on behalf of a Tenant System. 196 +--------+ +--------+ 197 | Tenant +--+ +----| Tenant | 198 | System | | (') | System | 199 +--------+ | ................ ( ) +--------+ 200 | +-+--+ . . +--+-+ (_) 201 | | NVE|--. .--| NVE| | 202 +--| | . . | |---+ 203 +-+--+ . . +--+-+ 204 / . . 205 / . L3 Overlay . +--+-++--------+ 206 +--------+ / . Network . | NVE|| Tenant | 207 | Tenant +--+ . .- -| || System | 208 | System | . . +--+-++--------+ 209 +--------+ ................ 210 | 211 +----+ 212 | NVE| 213 | | 214 +----+ 215 | 216 | 217 ===================== 218 | | 219 +--------+ +--------+ 220 | Tenant | | Tenant | 221 | System | | System | 222 +--------+ +--------+ 224 Figure 1: NVO3 Generic Reference Model 226 The following subsections describe key aspects of an overlay system 227 in more detail. Section 3.1 describes the service model (Ethernet 228 vs. IP) provided to Tenant Systems. Section 3.2 describes NVEs in 229 more detail. Section 3.3 introduces the Network Virtualization 230 Authority, from which NVEs obtain information about virtual networks. 231 Section 3.4 provides background on Virtual Machine (VM) orchestration 232 systems and their use of virtual networks. 234 3.1. VN Service (L2 and L3) 236 A Virtual Network provides either L2 or L3 service to connected 237 tenants. For L2 service, VNs transport Ethernet frames, and a Tenant 238 System is provided with a service that is analogous to being 239 connected to a specific L2 C-VLAN. L2 broadcast frames are generally 240 delivered to all (and multicast frames delivered to a subset of) the 241 other Tenant Systems on the VN. To a Tenant System, it appears as if 242 they are connected to a regular L2 Ethernet link. Within the NVO3 243 architecture, tenant frames are tunneled to remote NVEs based on the 244 MAC addresses of the frame headers as originated by the Tenant 245 System. On the underlay, NVO3 packets are forwarded between NVEs 246 based on the outer addresses of tunneled packets. 248 For L3 service, VNs transport IP datagrams, and a Tenant System is 249 provided with a service that only supports IP traffic. Within the 250 NVO3 architecture, tenant frames are tunneled to remote NVEs based on 251 the IP addresses of the packet originated by the Tenant System; any 252 L2 destination addresses provided by Tenant Systems are effectively 253 ignored by the NVEs and overlay network. For L3 service, the Tenant 254 System will be configured with an IP subnet that is effectively a 255 point-to-point link, i.e., having only the Tenant System and a next- 256 hop router address on it. 258 L2 service is intended for systems that need native L2 Ethernet 259 service and the ability to run protocols directly over Ethernet 260 (i.e., not based on IP). L3 service is intended for systems in which 261 all the traffic can safely be assumed to be IP. It is important to 262 note that whether an NVO3 network provides L2 or L3 service to a 263 Tenant System, the Tenant System does not generally need to be aware 264 of the distinction. In both cases, the virtual network presents 265 itself to the Tenant System as an L2 Ethernet interface. An Ethernet 266 interface is used in both cases simply as a widely supported 267 interface type that essentially all Tenant Systems already support. 268 Consequently, no special software is needed on Tenant Systems to use 269 an L3 vs. an L2 overlay service. 271 NVO3 can also provide a combined L2 and L3 service to tenants. A 272 combined service provides L2 service for intra-VN communication, but 273 also provides L3 service for L3 traffic entering or leaving the VN. 274 Architecturally, the handling of a combined L2/L3 service within the 275 NVO3 architecture is intended to match what is commonly done today in 276 non-overlay environments by devices providing a combined bridge/ 277 router service. With combined service, the virtual network itself 278 retains the semantics of L2 service and all traffic is processed 279 according to its L2 semantics. In addition, however, traffic 280 requiring IP processing is also processed at the IP level. 282 The IP processing for a combined service can be implemented on a 283 standalone device attached to the virtual network (e.g., an IP 284 router) or implemented locally on the NVE (see Section 5.4 on 285 Distributed Gateways). For unicast traffic, NVE implementation of a 286 combined service may result in a packet being delivered to another TS 287 attached to the same NVE (on either the same or a different VN) or 288 tunneled to a remote NVE, or even forwarded outside the NVO3 domain. 289 For multicast or broadcast packets, the combination of NVE L2 and L3 290 processing may result in copies of the packet receiving both L2 and 291 L3 treatments to realize delivery to all of the destinations 292 involved. This distributed NVE implementation of IP routing results 293 in the same network delivery behavior as if the L2 processing of the 294 packet included delivery of the packet to an IP router attached to 295 the L2 VN as a TS, with the router having additional network 296 attachments to other networks, either virtual or not. 298 3.1.1. VLAN Tags in L2 Service 300 An NVO3 L2 virtual network service may include encapsulated L2 VLAN 301 tags provided by a Tenant System, but does not use encapsulated tags 302 in deciding where and how to forward traffic. Such VLAN tags can be 303 passed through, so that Tenant Systems that send or expect to receive 304 them can be supported as appropriate. 306 The processing of VLAN tags that an NVE receives from a TS is 307 controlled by settings associated with the VAP. Just as in the case 308 with ports on Ethernet switches, a number of settings could be 309 imagined. For example, C-TAGs can be passed through transparently, 310 they could always be stripped upon receipt from a Tenant System, they 311 could be compared against a list of explicitly configured tags, etc. 313 Note that the handling of C-VIDs has additional complications, as 314 described in Section 4.2.1 below. 316 3.1.2. TTL Considerations 318 For L3 service, Tenant Systems should expect the TTL of the packets 319 they send to be decremented by at least 1. For L2 service, the TTL 320 on packets (when the packet is IP) is not modified. The underlay 321 network manages the TTLs in the outer IP encapsulation (which could 322 be independent from or related to the TTL in the tenant IP packets). 324 3.2. Network Virtualization Edge (NVE) 326 Tenant Systems connect to NVEs via a Tenant System Interface (TSI). 327 The TSI logically connects to the NVE via a Virtual Access Point 328 (VAP) and each VAP is associated with one Virtual Network as shown in 329 Figure 2. To the Tenant System, the TSI is like a NIC; the TSI 330 presents itself to a Tenant System as a normal network interface. On 331 the NVE side, a VAP is a logical network port (virtual or physical) 332 into a specific virtual network. Note that two different Tenant 333 Systems (and TSIs) attached to a common NVE can share a VAP (e.g., 334 TS1 and TS2 in Figure 2) so long as they connect to the same Virtual 335 Network. 337 | Data Center Network (IP) | 338 | | 339 +-----------------------------------------+ 340 | | 341 | Tunnel Overlay | 342 +------------+---------+ +---------+------------+ 343 | +----------+-------+ | | +-------+----------+ | 344 | | Overlay Module | | | | Overlay Module | | 345 | +---------+--------+ | | +---------+--------+ | 346 | | | | | | 347 NVE1 | | | | | | NVE2 348 | +--------+-------+ | | +--------+-------+ | 349 | | VNI1 VNI2 | | | | VNI1 VNI2 | | 350 | +-+----------+---+ | | +-+-----------+--+ | 351 | | VAP1 | VAP2 | | | VAP1 | VAP2| 352 +----+----------+------+ +----+-----------+-----+ 353 | | | | 354 |\ | | | 355 | \ | | /| 356 -------+--\-------+-------------------+---------/-+------- 357 | \ | Tenant | / | 358 TSI1 |TSI2\ | TSI3 TSI1 TSI2/ TSI3 359 +---+ +---+ +---+ +---+ +---+ +---+ 360 |TS1| |TS2| |TS3| |TS4| |TS5| |TS6| 361 +---+ +---+ +---+ +---+ +---+ +---+ 363 Figure 2: NVE Reference Model 365 The Overlay Module performs the actual encapsulation and 366 decapsulation of tunneled packets. The NVE maintains state about the 367 virtual networks it is a part of so that it can provide the Overlay 368 Module with such information as the destination address of the NVE to 369 tunnel a packet to, or the Context ID that should be placed in the 370 encapsulation header to identify the virtual network that a tunneled 371 packet belongs to. 373 On the data center network side, the NVE sends and receives native IP 374 traffic. When ingressing traffic from a Tenant System, the NVE 375 identifies the egress NVE to which the packet should be sent, adds an 376 overlay encapsulation header, and sends the packet on the underlay 377 network. When receiving traffic from a remote NVE, an NVE strips off 378 the encapsulation header, and delivers the (original) packet to the 379 appropriate Tenant System. When the source and destination Tenant 380 System are on the same NVE, no encapsulation is needed and the NVE 381 forwards traffic directly. 383 Conceptually, the NVE is a single entity implementing the NVO3 384 functionality. In practice, there are a number of different 385 implementation scenarios, as described in detail in Section 4. 387 3.3. Network Virtualization Authority (NVA) 389 Address dissemination refers to the process of learning, building and 390 distributing the mapping/forwarding information that NVEs need in 391 order to tunnel traffic to each other on behalf of communicating 392 Tenant Systems. For example, in order to send traffic to a remote 393 Tenant System, the sending NVE must know the destination NVE for that 394 Tenant System. 396 One way to build and maintain mapping tables is to use learning, as 397 802.1 bridges do [IEEE-802.1Q]. When forwarding traffic to multicast 398 or unknown unicast destinations, an NVE could simply flood traffic. 399 While flooding works, it can lead to traffic hot spots and can lead 400 to problems in larger networks (e.g., excessive amounts of flooded 401 traffic). 403 Alternatively, to reduce the scope of where flooding must take place, 404 or to eliminate it all together, NVEs can make use of a Network 405 Virtualization Authority (NVA). An NVA is the entity that provides 406 address mapping and other information to NVEs. NVEs interact with an 407 NVA to obtain any required address mapping information they need in 408 order to properly forward traffic on behalf of tenants. The term NVA 409 refers to the overall system, without regards to its scope or how it 410 is implemented. NVAs provide a service, and NVEs access that service 411 via an NVE-to-NVA protocol as discussed in Section 4.3. 413 Even when an NVA is present, Ethernet bridge MAC address learning 414 could be used as a fallback mechanism, should the NVA be unable to 415 provide an answer or for other reasons. This document does not 416 consider flooding approaches in detail, as there are a number of 417 benefits in using an approach that depends on the presence of an NVA. 419 For the rest of this document, it is assumed that an NVA exists and 420 will be used. NVAs are discussed in more detail in Section 7. 422 3.4. VM Orchestration Systems 424 VM orchestration systems manage server virtualization across a set of 425 servers. Although VM management is a separate topic from network 426 virtualization, the two areas are closely related. Managing the 427 creation, placement, and movement of VMs also involves creating, 428 attaching to and detaching from virtual networks. A number of 429 existing VM orchestration systems have incorporated aspects of 430 virtual network management into their systems. 432 Note also, that although this section uses the term "VM" and 433 "hypervisor" throughout, the same issues apply to other 434 virtualization approaches, including Linux Containers (LXC), BSD 435 Jails, Network Service Appliances as discussed in Section 5.1, etc.. 436 From an NVO3 perspective, it should be assumed that where the 437 document uses the term "VM" and "hypervisor", the intention is that 438 the discussion also applies to other systems, where, e.g., the host 439 operating system plays the role of the hypervisor in supporting 440 virtualization, and a container plays the equivalent role as a VM. 442 When a new VM image is started, the VM orchestration system 443 determines where the VM should be placed, interacts with the 444 hypervisor on the target server to load and start the VM and controls 445 when a VM should be shutdown or migrated elsewhere. VM orchestration 446 systems also have knowledge about how a VM should connect to a 447 network, possibly including the name of the virtual network to which 448 a VM is to connect. The VM orchestration system can pass such 449 information to the hypervisor when a VM is instantiated. VM 450 orchestration systems have significant (and sometimes global) 451 knowledge over the domain they manage. They typically know on what 452 servers a VM is running, and meta data associated with VM images can 453 be useful from a network virtualization perspective. For example, 454 the meta data may include the addresses (MAC and IP) the VMs will use 455 and the name(s) of the virtual network(s) they connect to. 457 VM orchestration systems run a protocol with an agent running on the 458 hypervisor of the servers they manage. That protocol can also carry 459 information about what virtual network a VM is associated with. When 460 the orchestrator instantiates a VM on a hypervisor, the hypervisor 461 interacts with the NVE in order to attach the VM to the virtual 462 networks it has access to. In general, the hypervisor will need to 463 communicate significant VM state changes to the NVE. In the reverse 464 direction, the NVE may need to communicate network connectivity 465 information back to the hypervisor. Example VM orchestration systems 466 in use today include VMware's vCenter Server, Microsoft's System 467 Center Virtual Machine Manager, and systems based on OpenStack and 468 its associated plugins (e.g., Nova and Neutron). Each can pass 469 information about what virtual networks a VM connects to down to the 470 hypervisor. The protocol used between the VM orchestration system 471 and hypervisors is generally proprietary. 473 It should be noted that VM orchestration systems may not have direct 474 access to all networking related information a VM uses. For example, 475 a VM may make use of additional IP or MAC addresses that the VM 476 management system is not aware of. 478 4. Network Virtualization Edge (NVE) 480 As introduced in Section 3.2 an NVE is the entity that implements the 481 overlay functionality. This section describes NVEs in more detail. 482 An NVE will have two external interfaces: 484 Tenant System Facing: On the Tenant System facing side, an NVE 485 interacts with the hypervisor (or equivalent entity) to provide 486 the NVO3 service. An NVE will need to be notified when a Tenant 487 System "attaches" to a virtual network (so it can validate the 488 request and set up any state needed to send and receive traffic on 489 behalf of the Tenant System on that VN). Likewise, an NVE will 490 need to be informed when the Tenant System "detaches" from the 491 virtual network so that it can reclaim state and resources 492 appropriately. 494 Data Center Network Facing: On the data center network facing side, 495 an NVE interfaces with the data center underlay network, sending 496 and receiving tunneled TS packets to and from the underlay. The 497 NVE may also run a control protocol with other entities on the 498 network, such as the Network Virtualization Authority. 500 4.1. NVE Co-located With Server Hypervisor 502 When server virtualization is used, the entire NVE functionality will 503 typically be implemented as part of the hypervisor and/or virtual 504 switch on the server. In such cases, the Tenant System interacts 505 with the hypervisor and the hypervisor interacts with the NVE. 506 Because the interaction between the hypervisor and NVE is implemented 507 entirely in software on the server, there is no "on-the-wire" 508 protocol between Tenant Systems (or the hypervisor) and the NVE that 509 needs to be standardized. While there may be APIs between the NVE 510 and hypervisor to support necessary interaction, the details of such 511 an API are not in-scope for the IETF to work on. 513 Implementing NVE functionality entirely on a server has the 514 disadvantage that server CPU resources must be spent implementing the 515 NVO3 functionality. Experimentation with overlay approaches and 516 previous experience with TCP and checksum adapter offloads suggests 517 that offloading certain NVE operations (e.g., encapsulation and 518 decapsulation operations) onto the physical network adapter can 519 produce performance advantages. As has been done with checksum and/ 520 or TCP server offload and other optimization approaches, there may be 521 benefits to offloading common operations onto adapters where 522 possible. Just as important, the addition of an overlay header can 523 disable existing adapter offload capabilities that are generally not 524 prepared to handle the addition of a new header or other operations 525 associated with an NVE. 527 While the exact details of how to split the implementation of 528 specific NVE functionality between a server and its network adapters 529 is an implementation matter and outside the scope of IETF 530 standardization, the NVO3 architecture should be cognizant of and 531 support such separation. Ideally, it may even be possible to bypass 532 the hypervisor completely on critical data path operations so that 533 packets between a TS and its VN can be sent and received without 534 having the hypervisor involved in each individual packet operation. 536 4.2. Split-NVE 538 Another possible scenario leads to the need for a split NVE 539 implementation. An NVE running on a server (e.g. within a 540 hypervisor) could support NVO3 service towards the tenant, but not 541 perform all NVE functions (e.g., encapsulation) directly on the 542 server; some of the actual NVO3 functionality could be implemented on 543 (i.e., offloaded to) an adjacent switch to which the server is 544 attached. While one could imagine a number of link types between a 545 server and the NVE, one simple deployment scenario would involve a 546 server and NVE separated by a simple L2 Ethernet link. A more 547 complicated scenario would have the server and NVE separated by a 548 bridged access network, such as when the NVE resides on a ToR, with 549 an embedded switch residing between servers and the ToR. 551 For the split NVE case, protocols will be needed that allow the 552 hypervisor and NVE to negotiate and setup the necessary state so that 553 traffic sent across the access link between a server and the NVE can 554 be associated with the correct virtual network instance. 555 Specifically, on the access link, traffic belonging to a specific 556 Tenant System would be tagged with a specific VLAN C-TAG that 557 identifies which specific NVO3 virtual network instance it connects 558 to. The hypervisor-NVE protocol would negotiate which VLAN C-TAG to 559 use for a particular virtual network instance. More details of the 560 protocol requirements for functionality between hypervisors and NVEs 561 can be found in [I-D.ietf-nvo3-nve-nva-cp-req]. 563 4.2.1. Tenant VLAN handling in Split-NVE Case 565 Preserving tenant VLAN tags across an NVO3 VN as described in 566 Section 3.1.1 poses additional complications in the split-NVE case. 567 The portion of the NVE that performs the encapsulation function needs 568 access to the specific VLAN tags that the Tenant System is using in 569 order to include them in the encapsulated packet. When an NVE is 570 implemented entirely within the hypervisor, the NVE has access to the 571 complete original packet (including any VLAN tags) sent by the 572 tenant. In the split-NVE case, however, the VLAN tag used between 573 the hypervisor and offloaded portions of the NVE normally only 574 identify the specific VN that traffic belongs to. In order to allow 575 a tenant to preserve VLAN information from end to end between TS when 576 in the split-NVE case, additional mechanisms would be needed (e.g. 577 carry an additional VLAN tag by carrying both a C-Tag and an S-Tag as 578 specified in [IEEE-802.1Q]). 580 4.3. NVE State 582 NVEs maintain internal data structures and state to support the 583 sending and receiving of tenant traffic. An NVE may need some or all 584 of the following information: 586 1. An NVE keeps track of which attached Tenant Systems are connected 587 to which virtual networks. When a Tenant System attaches to a 588 virtual network, the NVE will need to create or update local 589 state for that virtual network. When the last Tenant System 590 detaches from a given VN, the NVE can reclaim state associated 591 with that VN. 593 2. For tenant unicast traffic, an NVE maintains a per-VN table of 594 mappings from Tenant System (inner) addresses to remote NVE 595 (outer) addresses. 597 3. For tenant multicast (or broadcast) traffic, an NVE maintains a 598 per-VN table of mappings and other information on how to deliver 599 tenant multicast (or broadcast) traffic. If the underlying 600 network supports IP multicast, the NVE could use IP multicast to 601 deliver tenant traffic. In such a case, the NVE would need to 602 know what IP underlay multicast address to use for a given VN. 603 Alternatively, if the underlying network does not support 604 multicast, a source NVE could use unicast replication to deliver 605 traffic. In such a case, an NVE would need to know which remote 606 NVEs are participating in the VN. An NVE could use both 607 approaches, switching from one mode to the other depending on 608 such factors as bandwidth efficiency and group membership 609 sparseness. [I-D.ietf-nvo3-mcast-framework] discusses the 610 subject of multicast handling in NVO3 in further detail. 612 4. An NVE maintains necessary information to encapsulate outgoing 613 traffic, including what type of encapsulation and what value to 614 use for a Context ID within the encapsulation header. 616 5. In order to deliver incoming encapsulated packets to the correct 617 Tenant Systems, an NVE maintains the necessary information to map 618 incoming traffic to the appropriate VAP (i.e., Tenant System 619 Interface). 621 6. An NVE may find it convenient to maintain additional per-VN 622 information such as QoS settings, Path MTU information, ACLs, 623 etc. 625 4.4. Multi-Homing of NVEs 627 NVEs may be multi-homed. That is, an NVE may have more than one IP 628 address associated with it on the underlay network. Multihoming 629 happens in two different scenarios. First, an NVE may have multiple 630 interfaces connecting it to the underlay. Each of those interfaces 631 will typically have a different IP address, resulting in a specific 632 Tenant Address (on a specific VN) being reachable through the same 633 NVE but through more than one underlay IP address. Second, a 634 specific tenant system may be reachable through more than one NVE, 635 each having one or more underlay addresses. In both cases, NVE 636 address mapping functionality needs to support one-to-many mappings 637 and enable a sending NVE to (at a minimum) be able to fail over from 638 one IP address to another, e.g., should a specific NVE underlay 639 address become unreachable. 641 Finally, multi-homed NVEs introduce complexities when source unicast 642 replication is used to implement tenant multicast as described in 643 Section 4.3. Specifically, an NVE should only receive one copy of a 644 replicated packet. 646 Multi-homing is needed to support important use cases. First, a bare 647 metal server may have multiple uplink connections to either the same 648 or different NVEs. Having only a single physical path to an upstream 649 NVE, or indeed, having all traffic flow through a single NVE would be 650 considered unacceptable in highly-resilient deployment scenarios that 651 seek to avoid single points of failure. Moreover, in today's 652 networks, the availability of multiple paths would require that they 653 be usable in an active-active fashion (e.g., for load balancing). 655 4.5. VAP 657 The VAP is the NVE-side of the interface between the NVE and the TS. 658 Traffic to and from the tenant flows through the VAP. If an NVE runs 659 into difficulties sending traffic received on the VAP, it may need to 660 signal such errors back to the VAP. Because the VAP is an emulation 661 of a physical port, its ability to signal NVE errors is limited and 662 lacks sufficient granularity to reflect all possible errors an NVE 663 may encounter (e.g., inability reach a particular destination). Some 664 errors, such as an NVE losing all of its connections to the underlay, 665 could be reflected back to the VAP by effectively disabling it. This 666 state change would reflect itself on the TS as an interface going 667 down, allowing the TS to implement interface error handling, e.g., 668 failover, in the same manner as when a physical interfaces becomes 669 disabled. 671 5. Tenant System Types 673 This section describes a number of special Tenant System types and 674 how they fit into an NVO3 system. 676 5.1. Overlay-Aware Network Service Appliances 678 Some Network Service Appliances [I-D.ietf-nvo3-nve-nva-cp-req] 679 (virtual or physical) provide tenant-aware services. That is, the 680 specific service they provide depends on the identity of the tenant 681 making use of the service. For example, firewalls are now becoming 682 available that support multi-tenancy where a single firewall provides 683 virtual firewall service on a per-tenant basis, using per-tenant 684 configuration rules and maintaining per-tenant state. Such 685 appliances will be aware of the VN an activity corresponds to while 686 processing requests. Unlike server virtualization, which shields VMs 687 from needing to know about multi-tenancy, a Network Service Appliance 688 may explicitly support multi-tenancy. In such cases, the Network 689 Service Appliance itself will be aware of network virtualization and 690 either embed an NVE directly, or implement a split NVE as described 691 in Section 4.2. Unlike server virtualization, however, the Network 692 Service Appliance may not be running a hypervisor and the VM 693 orchestration system may not interact with the Network Service 694 Appliance. The NVE on such appliances will need to support a control 695 plane to obtain the necessary information needed to fully participate 696 in an NVO3 Domain. 698 5.2. Bare Metal Servers 700 Many data centers will continue to have at least some servers 701 operating as non-virtualized (or "bare metal") machines running a 702 traditional operating system and workload. In such systems, there 703 will be no NVE functionality on the server, and the server will have 704 no knowledge of NVO3 (including whether overlays are even in use). 705 In such environments, the NVE functionality can reside on the first- 706 hop physical switch. In such a case, the network administrator would 707 (manually) configure the switch to enable the appropriate NVO3 708 functionality on the switch port connecting the server and associate 709 that port with a specific virtual network. Such configuration would 710 typically be static, since the server is not virtualized, and once 711 configured, is unlikely to change frequently. Consequently, this 712 scenario does not require any protocol or standards work. 714 5.3. Gateways 716 Gateways on VNs relay traffic onto and off of a virtual network. 717 Tenant Systems use gateways to reach destinations outside of the 718 local VN. Gateways receive encapsulated traffic from one VN, remove 719 the encapsulation header, and send the native packet out onto the 720 data center network for delivery. Outside traffic enters a VN in a 721 reverse manner. 723 Gateways can be either virtual (i.e., implemented as a VM) or 724 physical (i.e., as a standalone physical device). For performance 725 reasons, standalone hardware gateways may be desirable in some cases. 726 Such gateways could consist of a simple switch forwarding traffic 727 from a VN onto the local data center network, or could embed router 728 functionality. On such gateways, network interfaces connecting to 729 virtual networks will (at least conceptually) embed NVE (or split- 730 NVE) functionality within them. As in the case with Network Service 731 Appliances, gateways may not support a hypervisor and will need an 732 appropriate control plane protocol to obtain the information needed 733 to provide NVO3 service. 735 Gateways handle several different use cases. For example, one use 736 case consists of systems supporting overlays together with systems 737 that do not (e.g., bare metal servers). Gateways could be used to 738 connect legacy systems supporting, e.g., L2 VLANs, to specific 739 virtual networks, effectively making them part of the same virtual 740 network. Gateways could also forward traffic between a virtual 741 network and other hosts on the data center network or relay traffic 742 between different VNs. Finally, gateways can provide external 743 connectivity such as Internet or VPN access. 745 5.3.1. Gateway Taxonomy 747 As can be seen from the discussion above, there are several types of 748 gateways that can exist in an NVO3 environment. This section breaks 749 them down into the various types that could be supported. Note that 750 each of the types below could be implemented in either a centralized 751 manner or distributed to co-exist with the NVEs. 753 5.3.1.1. L2 Gateways (Bridging) 755 L2 Gateways act as layer 2 bridges to forward Ethernet frames based 756 on the MAC addresses present in them. 758 L2 VN to Legacy L2: This type of gateway bridges traffic between L2 759 VNs and other legacy L2 networks such as VLANs or L2 VPNs. 761 L2 VN to L2 VN: The main motivation for this type of gateway to 762 create separate groups of Tenant Systems using L2 VNs such that 763 the gateway can enforce network policies between each L2 VN. 765 5.3.1.2. L3 Gateways (Only IP Packets) 767 L3 Gateways forward IP packets based on the IP addresses present in 768 the packets. 770 L3 VN to Legacy L2: This type of gateway forwards packets on between 771 L3 VNs and legacy L2 networks such as VLANs or L2 VPNs. The 772 MAC address in any frames forwarded between the legacy L2 773 network would be the MAC address of the gateway. 775 L3 VN to Legacy L3: The type of gateway forwards packets between L3 776 VNs and legacy L3 networks. These legacy L3 networks could be 777 local the data center, in the WAN, or an L3 VPN. 779 L3 VN to L2 VN: This type of gateway forwards packets on between L3 780 VNs and L2 VNs. The MAC address in any frames forwarded 781 between the L2 VN would be the MAC address of the gateway. 783 L2 VN to L2 VN: This type of gateway acts similar to a traditional 784 router that forwards between L2 interfaces. The MAC address in 785 any frames forwarded between the L2 VNs would be the MAC 786 address of the gateway. 788 L3 VN to L3 VN: The main motivation for this type of gateway to 789 create separate groups of Tenant Systems using L3 VNs such that 790 the gateway can enforce network policies between each L3 VN. 792 5.4. Distributed Inter-VN Gateways 794 The relaying of traffic from one VN to another deserves special 795 consideration. Whether traffic is permitted to flow from one VN to 796 another is a matter of policy, and would not (by default) be allowed 797 unless explicitly enabled. In addition, NVAs are the logical place 798 to maintain policy information about allowed inter-VN communication. 799 Policy enforcement for inter-VN communication can be handled in (at 800 least) two different ways. Explicit gateways could be the central 801 point for such enforcement, with all inter-VN traffic forwarded to 802 such gateways for processing. Alternatively, the NVA can provide 803 such information directly to NVEs, by either providing a mapping for 804 a target TS on another VN, or indicating that such communication is 805 disallowed by policy. 807 When inter-VN gateways are centralized, traffic between TSs on 808 different VNs can take suboptimal paths, i.e., triangular routing 809 results in paths that always traverse the gateway. In the worst 810 case, traffic between two TSs connected to the same NVE can be hair- 811 pinned through an external gateway. As an optimization, individual 812 NVEs can be part of a distributed gateway that performs such 813 relaying, reducing or completely eliminating triangular routing. In 814 a distributed gateway, each ingress NVE can perform such relaying 815 activity directly, so long as it has access to the policy information 816 needed to determine whether cross-VN communication is allowed. 817 Having individual NVEs be part of a distributed gateway allows them 818 to tunnel traffic directly to the destination NVE without the need to 819 take suboptimal paths. 821 The NVO3 architecture supports distributed gateways for the case of 822 inter-VN communication. Such support requires that NVO3 control 823 protocols include mechanisms for the maintenance and distribution of 824 policy information about what type of cross-VN communication is 825 allowed so that NVEs acting as distributed gateways can tunnel 826 traffic from one VN to another as appropriate. 828 Distributed gateways could also be used to distribute other 829 traditional router services to individual NVEs. The NVO3 830 architecture does not preclude such implementations, but does not 831 define or require them as they are outside the scope of the NVO3 832 architecture. 834 5.5. ARP and Neighbor Discovery 836 For an L2 service, strictly speaking, special processing of Address 837 Resolution Protocol (ARP) [RFC0826] (and IPv6 Neighbor Discovery (ND) 838 [RFC4861]) is not required. ARP requests are broadcast, and an NVO3 839 can deliver ARP requests to all members of a given L2 virtual 840 network, just as it does for any packet sent to an L2 broadcast 841 address. Similarly, ND requests are sent via IP multicast, which 842 NVO3 can support by delivering via L2 multicast. However, as a 843 performance optimization, an NVE can intercept ARP (or ND) requests 844 from its attached TSs and respond to them directly using information 845 in its mapping tables. Since an NVE will have mechanisms for 846 determining the NVE address associated with a given TS, the NVE can 847 leverage the same mechanisms to suppress sending ARP and ND requests 848 for a given TS to other members of the VN. The NVO3 architecture 849 supports such a capability. 851 6. NVE-NVE Interaction 853 Individual NVEs will interact with each other for the purposes of 854 tunneling and delivering traffic to remote TSs. At a minimum, a 855 control protocol may be needed for tunnel setup and maintenance. For 856 example, tunneled traffic may need to be encrypted or integrity 857 protected, in which case it will be necessary to set up appropriate 858 security associations between NVE peers. It may also be desirable to 859 perform tunnel maintenance (e.g., continuity checks) on a tunnel in 860 order to detect when a remote NVE becomes unreachable. Such generic 861 tunnel setup and maintenance functions are not generally 862 NVO3-specific. Hence, the NVO3 architecture expects to leverage 863 existing tunnel maintenance protocols rather than defining new ones. 865 Some NVE-NVE interactions may be specific to NVO3 (and in particular 866 be related to information kept in mapping tables) and agnostic to the 867 specific tunnel type being used. For example, when tunneling traffic 868 for TS-X to a remote NVE, it is possible that TS-X is not presently 869 associated with the remote NVE. Normally, this should not happen, 870 but there could be race conditions where the information an NVE has 871 learned from the NVA is out-of-date relative to actual conditions. 872 In such cases, the remote NVE could return an error or warning 873 indication, allowing the sending NVE to attempt a recovery or 874 otherwise attempt to mitigate the situation. 876 The NVE-NVE interaction could signal a range of indications, for 877 example: 879 o "No such TS here", upon a receipt of a tunneled packet for an 880 unknown TS. 882 o "TS-X not here, try the following NVE instead" (i.e., a redirect). 884 o Delivered to correct NVE, but could not deliver packet to TS-X 885 (soft error). 887 o Delivered to correct NVE, but could not deliver packet to TS-X 888 (hard error). 890 When an NVE receives information from a remote NVE that conflicts 891 with the information it has in its own mapping tables, it should 892 consult with the NVA to resolve those conflicts. In particular, it 893 should confirm that the information it has is up-to-date, and it 894 might indicate the error to the NVA, so as to nudge the NVA into 895 following up (as appropriate). While it might make sense for an NVE 896 to update its mapping table temporarily in response to an error from 897 a remote NVE, any changes must be handled carefully as doing so can 898 raise security considerations if the received information cannot be 899 authenticated. That said, a sending NVE might still take steps to 900 mitigate a problem, such as applying rate limiting to data traffic 901 towards a particular NVE or TS. 903 7. Network Virtualization Authority 905 Before sending to and receiving traffic from a virtual network, an 906 NVE must obtain the information needed to build its internal 907 forwarding tables and state as listed in Section 4.3. An NVE can 908 obtain such information from a Network Virtualization Authority. 910 The Network Virtualization Authority (NVA) is the entity that is 911 expected to provide address mapping and other information to NVEs. 912 NVEs can interact with an NVA to obtain any required information they 913 need in order to properly forward traffic on behalf of tenants. The 914 term NVA refers to the overall system, without regards to its scope 915 or how it is implemented. 917 7.1. How an NVA Obtains Information 919 There are two primary ways in which an NVA can obtain the address 920 dissemination information it manages. The NVA can obtain information 921 either from the VM orchestration system, and/or directly from the 922 NVEs themselves. 924 On virtualized systems, the NVA may be able to obtain the address 925 mapping information associated with VMs from the VM orchestration 926 system itself. If the VM orchestration system contains a master 927 database for all the virtualization information, having the NVA 928 obtain information directly to the orchestration system would be a 929 natural approach. Indeed, the NVA could effectively be co-located 930 with the VM orchestration system itself. In such systems, the VM 931 orchestration system communicates with the NVE indirectly through the 932 hypervisor. 934 However, as described in Section 4 not all NVEs are associated with 935 hypervisors. In such cases, NVAs cannot leverage VM orchestration 936 protocols to interact with an NVE and will instead need to peer 937 directly with them. By peering directly with an NVE, NVAs can obtain 938 information about the TSs connected to that NVE and can distribute 939 information to the NVE about the VNs those TSs are associated with. 940 For example, whenever a Tenant System attaches to an NVE, that NVE 941 would notify the NVA that the TS is now associated with that NVE. 942 Likewise when a TS detaches from an NVE, that NVE would inform the 943 NVA. By communicating directly with NVEs, both the NVA and the NVE 944 are able to maintain up-to-date information about all active tenants 945 and the NVEs to which they are attached. 947 7.2. Internal NVA Architecture 949 For reliability and fault tolerance reasons, an NVA would be 950 implemented in a distributed or replicated manner without single 951 points of failure. How the NVA is implemented, however, is not 952 important to an NVE so long as the NVA provides a consistent and 953 well-defined interface to the NVE. For example, an NVA could be 954 implemented via database techniques whereby a server stores address 955 mapping information in a traditional (possibly replicated) database. 956 Alternatively, an NVA could be implemented in a distributed fashion 957 using an existing (or modified) routing protocol to maintain and 958 distribute mappings. So long as there is a clear interface between 959 the NVE and NVA, how an NVA is architected and implemented is not 960 important to an NVE. 962 A number of architectural approaches could be used to implement NVAs 963 themselves. NVAs manage address bindings and distribute them to 964 where they need to go. One approach would be to use Border Gateway 965 Protocol (BGP) [RFC4364] (possibly with extensions) and route 966 reflectors. Another approach could use a transaction-based database 967 model with replicated servers. Because the implementation details 968 are local to an NVA, there is no need to pick exactly one solution 969 technology, so long as the external interfaces to the NVEs (and 970 remote NVAs) are sufficiently well defined to achieve 971 interoperability. 973 7.3. NVA External Interface 975 Conceptually, from the perspective of an NVE, an NVA is a single 976 entity. An NVE interacts with the NVA, and it is the NVA's 977 responsibility for ensuring that interactions between the NVE and NVA 978 result in consistent behavior across the NVA and all other NVEs using 979 the same NVA. Because an NVA is built from multiple internal 980 components, an NVA will have to ensure that information flows to all 981 internal NVA components appropriately. 983 One architectural question is how the NVA presents itself to the NVE. 984 For example, an NVA could be required to provide access via a single 985 IP address. If NVEs only have one IP address to interact with, it 986 would be the responsibility of the NVA to handle NVA component 987 failures, e.g., by using a "floating IP address" that migrates among 988 NVA components to ensure that the NVA can always be reached via the 989 one address. Having all NVA accesses through a single IP address, 990 however, adds constraints to implementing robust failover, load 991 balancing, etc. 993 In the NVO3 architecture, an NVA is accessed through one or more IP 994 addresses (or IP address/port combination). If multiple IP addresses 995 are used, each IP address provides equivalent functionality, meaning 996 that an NVE can use any of the provided addresses to interact with 997 the NVA. Should one address stop working, an NVE is expected to 998 failover to another. While the different addresses result in 999 equivalent functionality, one address may respond more quickly than 1000 another, e.g., due to network conditions, load on the server, etc. 1002 To provide some control over load balancing, NVA addresses may have 1003 an associated priority. Addresses are used in order of priority, 1004 with no explicit preference among NVA addresses having the same 1005 priority. To provide basic load-balancing among NVAs of equal 1006 priorities, NVEs could use some randomization input to select among 1007 equal-priority NVAs. Such a priority scheme facilitates failover and 1008 load balancing, for example, allowing a network operator to specify a 1009 set of primary and backup NVAs. 1011 It may be desirable to have individual NVA addresses responsible for 1012 a subset of information about an NV Domain. In such a case, NVEs 1013 would use different NVA addresses for obtaining or updating 1014 information about particular VNs or TS bindings. A key question with 1015 such an approach is how information would be partitioned, and how an 1016 NVE could determine which address to use to get the information it 1017 needs. 1019 Another possibility is to treat the information on which NVA 1020 addresses to use as cached (soft-state) information at the NVEs, so 1021 that any NVA address can be used to obtain any information, but NVEs 1022 are informed of preferences for which addresses to use for particular 1023 information on VNs or TS bindings. That preference information would 1024 be cached for future use to improve behavior - e.g., if all requests 1025 for a specific subset of VNs are forwarded to a specific NVA 1026 component, the NVE can optimize future requests within that subset by 1027 sending them directly to that NVA component via its address. 1029 8. NVE-to-NVA Protocol 1031 As outlined in Section 4.3, an NVE needs certain information in order 1032 to perform its functions. To obtain such information from an NVA, an 1033 NVE-to-NVA protocol is needed. The NVE-to-NVA protocol provides two 1034 functions. First it allows an NVE to obtain information about the 1035 location and status of other TSs with which it needs to communicate. 1036 Second, the NVE-to-NVA protocol provides a way for NVEs to provide 1037 updates to the NVA about the TSs attached to that NVE (e.g., when a 1038 TS attaches or detaches from the NVE), or about communication errors 1039 encountered when sending traffic to remote NVEs. For example, an NVE 1040 could indicate that a destination it is trying to reach at a 1041 destination NVE is unreachable for some reason. 1043 While having a direct NVE-to-NVA protocol might seem straightforward, 1044 the existence of existing VM orchestration systems complicates the 1045 choices an NVE has for interacting with the NVA. 1047 8.1. NVE-NVA Interaction Models 1049 An NVE interacts with an NVA in at least two (quite different) ways: 1051 o NVEs embedded within the same server as the hypervisor can obtain 1052 necessary information entirely through the hypervisor-facing side 1053 of the NVE. Such an approach is a natural extension to existing 1054 VM orchestration systems supporting server virtualization because 1055 an existing protocol between the hypervisor and VM orchestration 1056 system already exists and can be leveraged to obtain any needed 1057 information. Specifically, VM orchestration systems used to 1058 create, terminate and migrate VMs already use well-defined (though 1059 typically proprietary) protocols to handle the interactions 1060 between the hypervisor and VM orchestration system. For such 1061 systems, it is a natural extension to leverage the existing 1062 orchestration protocol as a sort of proxy protocol for handling 1063 the interactions between an NVE and the NVA. Indeed, existing 1064 implementations can already do this. 1066 o Alternatively, an NVE can obtain needed information by interacting 1067 directly with an NVA via a protocol operating over the data center 1068 underlay network. Such an approach is needed to support NVEs that 1069 are not associated with systems performing server virtualization 1070 (e.g., as in the case of a standalone gateway) or where the NVE 1071 needs to communicate directly with the NVA for other reasons. 1073 The NVO3 architecture will focus on support for the second model 1074 above. Existing virtualization environments are already using the 1075 first model. But they are not sufficient to cover the case of 1076 standalone gateways -- such gateways may not support virtualization 1077 and do not interface with existing VM orchestration systems. 1079 8.2. Direct NVE-NVA Protocol 1081 An NVE can interact directly with an NVA via an NVE-to-NVA protocol. 1082 Such a protocol can be either independent of the NVA internal 1083 protocol, or an extension of it. Using a purpose-specific protocol 1084 would provide architectural separation and independence between the 1085 NVE and NVA. The NVE and NVA interact in a well-defined way, and 1086 changes in the NVA (or NVE) do not need to impact each other. Using 1087 a dedicated protocol also ensures that both NVE and NVA 1088 implementations can evolve independently and without dependencies on 1089 each other. Such independence is important because the upgrade path 1090 for NVEs and NVAs is quite different. Upgrading all the NVEs at a 1091 site will likely be more difficult in practice than upgrading NVAs 1092 because of their large number - one on each end device. In practice, 1093 it would be prudent to assume that once an NVE has been implemented 1094 and deployed, it may be challenging to get subsequent NVE extensions 1095 and changes implemented and deployed, whereas an NVA (and its 1096 associated internal protocols) are more likely to evolve over time as 1097 experience is gained from usage and upgrades will involve fewer 1098 nodes. 1100 Requirements for a direct NVE-NVA protocol can be found in 1101 [I-D.ietf-nvo3-nve-nva-cp-req] 1103 8.3. Propagating Information Between NVEs and NVAs 1105 Information flows between NVEs and NVAs in both directions. The NVA 1106 maintains information about all VNs in the NV Domain, so that NVEs do 1107 not need to do so themselves. NVEs obtain from the NVA information 1108 about where a given remote TS destination resides. NVAs in turn 1109 obtain information from NVEs about the individual TSs attached to 1110 those NVEs. 1112 While the NVA could push information relevant to every virtual 1113 network to every NVE, such an approach scales poorly and is 1114 unnecessary. In practice, a given NVE will only need and want to 1115 know about VNs to which it is attached. Thus, an NVE should be able 1116 to subscribe to updates only for the virtual networks it is 1117 interested in receiving updates for. The NVO3 architecture supports 1118 a model where an NVE is not required to have full mapping tables for 1119 all virtual networks in an NV Domain. 1121 Before sending unicast traffic to a remote TS (or TSes for broadcast 1122 or multicast traffic), an NVE must know where the remote TS(es) 1123 currently reside. When a TS attaches to a virtual network, the NVE 1124 obtains information about that VN from the NVA. The NVA can provide 1125 that information to the NVE at the time the TS attaches to the VN, 1126 either because the NVE requests the information when the attach 1127 operation occurs, or because the VM orchestration system has 1128 initiated the attach operation and provides associated mapping 1129 information to the NVE at the same time. 1131 There are scenarios where an NVE may wish to query the NVA about 1132 individual mappings within an VN. For example, when sending traffic 1133 to a remote TS on a remote NVE, that TS may become unavailable (e.g,. 1134 because it has migrated elsewhere or has been shutdown, in which case 1135 the remote NVE may return an error indication). In such situations, 1136 the NVE may need to query the NVA to obtain updated mapping 1137 information for a specific TS, or verify that the information is 1138 still correct despite the error condition. Note that such a query 1139 could also be used by the NVA as an indication that there may be an 1140 inconsistency in the network and that it should take steps to verify 1141 that the information it has about the current state and location of a 1142 specific TS is still correct. 1144 For very large virtual networks, the amount of state an NVE needs to 1145 maintain for a given virtual network could be significant. Moreover, 1146 an NVE may only be communicating with a small subset of the TSs on 1147 such a virtual network. In such cases, the NVE may find it desirable 1148 to maintain state only for those destinations it is actively 1149 communicating with. In such scenarios, an NVE may not want to 1150 maintain full mapping information about all destinations on a VN. 1151 Should it then need to communicate with a destination for which it 1152 does not have mapping information, however, it will need to be able 1153 to query the NVA on demand for the missing information on a per- 1154 destination basis. 1156 The NVO3 architecture will need to support a range of operations 1157 between the NVE and NVA. Requirements for those operations can be 1158 found in [I-D.ietf-nvo3-nve-nva-cp-req]. 1160 9. Federated NVAs 1162 An NVA provides service to the set of NVEs in its NV Domain. Each 1163 NVA manages network virtualization information for the virtual 1164 networks within its NV Domain. An NV domain is administered by a 1165 single entity. 1167 In some cases, it will be necessary to expand the scope of a specific 1168 VN or even an entire NV domain beyond a single NVA. For example, 1169 multiple data centers managed by the same administrator may wish to 1170 operate all of its data centers as a single NV region. Such cases 1171 are handled by having different NVAs peer with each other to exchange 1172 mapping information about specific VNs. NVAs operate in a federated 1173 manner with a set of NVAs operating as a loosely-coupled federation 1174 of individual NVAs. If a virtual network spans multiple NVAs (e.g., 1175 located at different data centers), and an NVE needs to deliver 1176 tenant traffic to an NVE that is part of a different NV Domain, it 1177 still interacts only with its NVA, even when obtaining mappings for 1178 NVEs associated with a different NV Domain. 1180 Figure 3 shows a scenario where two separate NV Domains (1 and 2) 1181 share information about Virtual Network "1217". VM1 and VM2 both 1182 connect to the same Virtual Network 1217, even though the two VMs are 1183 in separate NV Domains. There are two cases to consider. In the 1184 first case, NV Domain B (NVB) does not allow NVE-A to tunnel traffic 1185 directly to NVE-B. There could be a number of reasons for this. For 1186 example, NV Domains 1 and 2 may not share a common address space 1187 (i.e., require traversal through a NAT device), or for policy 1188 reasons, a domain might require that all traffic between separate NV 1189 Domains be funneled through a particular device (e.g., a firewall). 1190 In such cases, NVA-2 will advertise to NVA-1 that VM1 on Virtual 1191 Network 1217 is available, and direct that traffic between the two 1192 nodes go through IP-G. IP-G would then decapsulate received traffic 1193 from one NV Domain, translate it appropriately for the other domain 1194 and re-encapsulate the packet for delivery. 1196 xxxxxx xxxxxx +-----+ 1197 +-----+ xxxxxxxx xxxxxx xxxxxxx xxxxx | VM2 | 1198 | VM1 | xx xx xxx xx |-----| 1199 |-----| xx + x xx x |NVE-B| 1200 |NVE-A| x x +----+ x x +-----+ 1201 +--+--+ x NV Domain A x |IP-G|--x x | 1202 +-------x xx--+ | x xx | 1203 x x +----+ x NV Domain B x | 1204 +---x xx xx x---+ 1205 | xxxx xx +->xx xx 1206 | xxxxxxxxxx | xx xx 1207 +---+-+ | xx xx 1208 |NVA-1| +--+--+ xx xxx 1209 +-----+ |NVA-2| xxxx xxxx 1210 +-----+ xxxxxxx 1212 Figure 3: VM1 and VM2 are in different NV Domains. 1214 NVAs at one site share information and interact with NVAs at other 1215 sites, but only in a controlled manner. It is expected that policy 1216 and access control will be applied at the boundaries between 1217 different sites (and NVAs) so as to minimize dependencies on external 1218 NVAs that could negatively impact the operation within a site. It is 1219 an architectural principle that operations involving NVAs at one site 1220 not be immediately impacted by failures or errors at another site. 1221 (Of course, communication between NVEs in different NV domains may be 1222 impacted by such failures or errors.) It is a strong requirement 1223 that an NVA continue to operate properly for local NVEs even if 1224 external communication is interrupted (e.g., should communication 1225 between a local and remote NVA fail). 1227 At a high level, a federation of interconnected NVAs has some 1228 analogies to BGP and Autonomous Systems. Like an Autonomous System, 1229 NVAs at one site are managed by a single administrative entity and do 1230 not interact with external NVAs except as allowed by policy. 1231 Likewise, the interface between NVAs at different sites is well 1232 defined, so that the internal details of operations at one site are 1233 largely hidden to other sites. Finally, an NVA only peers with other 1234 NVAs that it has a trusted relationship with, i.e., where a VN is 1235 intended to span multiple NVAs. 1237 Reasons for using a federated model include: 1239 o Provide isolation among NVAs operating at different sites at 1240 different geographic locations. 1242 o Control the quantity and rate of information updates that flow 1243 (and must be processed) between different NVAs in different data 1244 centers. 1246 o Control the set of external NVAs (and external sites) a site peers 1247 with. A site will only peer with other sites that are cooperating 1248 in providing an overlay service. 1250 o Allow policy to be applied between sites. A site will want to 1251 carefully control what information it exports (and to whom) as 1252 well as what information it is willing to import (and from whom). 1254 o Allow different protocols and architectures to be used to for 1255 intra- vs. inter-NVA communication. For example, within a single 1256 data center, a replicated transaction server using database 1257 techniques might be an attractive implementation option for an 1258 NVA, and protocols optimized for intra-NVA communication would 1259 likely be different from protocols involving inter-NVA 1260 communication between different sites. 1262 o Allow for optimized protocols, rather than using a one-size-fits 1263 all approach. Within a data center, networks tend to have lower- 1264 latency, higher-speed and higher redundancy when compared with WAN 1265 links interconnecting data centers. The design constraints and 1266 tradeoffs for a protocol operating within a data center network 1267 are different from those operating over WAN links. While a single 1268 protocol could be used for both cases, there could be advantages 1269 to using different and more specialized protocols for the intra- 1270 and inter-NVA case. 1272 9.1. Inter-NVA Peering 1274 To support peering between different NVAs, an inter-NVA protocol is 1275 needed. The inter-NVA protocol defines what information is exchanged 1276 between NVAs. It is assumed that the protocol will be used to share 1277 addressing information between data centers and must scale well over 1278 WAN links. 1280 10. Control Protocol Work Areas 1282 The NVO3 architecture consists of two major distinct entities: NVEs 1283 and NVAs. In order to provide isolation and independence between 1284 these two entities, the NVO3 architecture calls for well defined 1285 protocols for interfacing between them. For an individual NVA, the 1286 architecture calls for a logically centralized entity that could be 1287 implemented in a distributed or replicated fashion. While the IETF 1288 may choose to define one or more specific architectural approaches to 1289 building individual NVAs, there is little need for it to pick exactly 1290 one approach to the exclusion of others. An NVA for a single domain 1291 will likely be deployed as a single vendor product and thus there is 1292 little benefit in standardizing the internal structure of an NVA. 1294 Individual NVAs peer with each other in a federated manner. The NVO3 1295 architecture calls for a well-defined interface between NVAs. 1297 Finally, a hypervisor-to-NVE protocol is needed to cover the split- 1298 NVE scenario described in Section 4.2. 1300 11. NVO3 Data Plane Encapsulation 1302 When tunneling tenant traffic, NVEs add encapsulation header to the 1303 original tenant packet. The exact encapsulation to use for NVO3 does 1304 not seem to be critical. The main requirement is that the 1305 encapsulation support a Context ID of sufficient size 1306 [I-D.ietf-nvo3-dataplane-requirements]. A number of encapsulations 1307 already exist that provide a VN Context of sufficient size for NVO3. 1308 For example, VXLAN [RFC7348] has a 24-bit VXLAN Network Identifier 1309 (VNI). NVGRE [RFC7637] has a 24-bit Tenant Network ID (TNI). MPLS- 1310 over-GRE provides a 20-bit label field. While there is widespread 1311 recognition that a 12-bit VN Context would be too small (only 4096 1312 distinct values), it is generally agreed that 20 bits (1 million 1313 distinct values) and 24 bits (16.8 million distinct values) are 1314 sufficient for a wide variety of deployment scenarios. 1316 12. Operations and Management 1318 The simplicity of operating and debugging overlay networks will be 1319 critical for successful deployment. Some architectural choices can 1320 facilitate or hinder OAM. Related OAM drafts include 1321 [I-D.ashwood-nvo3-operational-requirement]. 1323 13. Summary 1325 This document presents the overall architecture for Network 1326 Virtualization Overlays (NVO3). The architecture calls for three 1327 main areas of protocol work: 1329 1. A hypervisor-to-NVE protocol to support Split NVEs as discussed 1330 in Section 4.2. 1332 2. An NVE to NVA protocol for disseminating VN information (e.g., 1333 inner to outer address mappings). 1335 3. An NVA-to-NVA protocol for exchange of information about specific 1336 virtual networks between federated NVAs. 1338 It should be noted that existing protocols or extensions of existing 1339 protocols are applicable. 1341 14. Acknowledgments 1343 Helpful comments and improvements to this document have come from 1344 Lizhong Jin, Anton Ivanov, Dennis (Xiaohong) Qin, Erik Smith, Ziye 1345 Yang and Lucy Yong. 1347 15. IANA Considerations 1349 This memo includes no request to IANA. 1351 16. Security Considerations 1353 The data plane and control plane described in this architecture will 1354 need to address potential security threats. 1356 For the data plane, tunneled application traffic may need protection 1357 against being misdelivered, modified, or having its content exposed 1358 to an inappropriate third party. In all cases, encryption between 1359 authenticated tunnel endpoints can be used to mitigate risks. 1361 For the control plane, between NVAs, the NVA and NVE as well as 1362 between different components of the split-NVE approach, a combination 1363 of authentication and encryption can be used. All entities will need 1364 to properly authenticate with each other and enable encryption for 1365 their interactions as appropriate to protect sensitive information. 1367 Leakage of sensitive information about users or other entities 1368 associated with VMs whose traffic is virtualized can also be covered 1369 by using encryption for the control plane protocols. 1371 17. Informative References 1373 [I-D.ashwood-nvo3-operational-requirement] 1374 Ashwood-Smith, P., Iyengar, R., Tsou, T., Sajassi, A., 1375 Boucadair, M., Jacquenet, C., and M. Daikoku, "NVO3 1376 Operational Requirements", draft-ashwood-nvo3-operational- 1377 requirement-03 (work in progress), July 2013. 1379 [I-D.ietf-nvo3-dataplane-requirements] 1380 Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L., 1381 and B. Khasnabish, "NVO3 Data Plane Requirements", draft- 1382 ietf-nvo3-dataplane-requirements-03 (work in progress), 1383 April 2014. 1385 [I-D.ietf-nvo3-mcast-framework] 1386 Ghanwani, A., Dunbar, L., McBride, M., Bannai, V., and R. 1387 Krishnan, "A Framework for Multicast in NVO3", draft-ietf- 1388 nvo3-mcast-framework-04 (work in progress), February 2016. 1390 [I-D.ietf-nvo3-nve-nva-cp-req] 1391 Kreeger, L., Dutt, D., Narten, T., and D. Black, "Network 1392 Virtualization NVE to NVA Control Protocol Requirements", 1393 draft-ietf-nvo3-nve-nva-cp-req-05 (work in progress), 1394 March 2016. 1396 [IEEE-802.1Q] 1397 IEEE Std 802.1Q-2014, , "IEEE Standard for Local and 1398 metropolitan area networks: Bridges and Bridged 1399 Networks,", November 2014. 1401 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 1402 Converting Network Protocol Addresses to 48.bit Ethernet 1403 Address for Transmission on Ethernet Hardware", STD 37, 1404 RFC 826, DOI 10.17487/RFC0826, November 1982, 1405 . 1407 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1408 Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 1409 2006, . 1411 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1412 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1413 DOI 10.17487/RFC4861, September 2007, 1414 . 1416 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1417 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1418 eXtensible Local Area Network (VXLAN): A Framework for 1419 Overlaying Virtualized Layer 2 Networks over Layer 3 1420 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 1421 . 1423 [RFC7364] Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L., 1424 Kreeger, L., and M. Napierala, "Problem Statement: 1425 Overlays for Network Virtualization", RFC 7364, 1426 DOI 10.17487/RFC7364, October 2014, 1427 . 1429 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 1430 Rekhter, "Framework for Data Center (DC) Network 1431 Virtualization", RFC 7365, DOI 10.17487/RFC7365, October 1432 2014, . 1434 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 1435 Virtualization Using Generic Routing Encapsulation", 1436 RFC 7637, DOI 10.17487/RFC7637, September 2015, 1437 . 1439 Authors' Addresses 1441 David Black 1442 EMC 1444 Email: david.black@emc.com 1446 Jon Hudson 1447 Independent 1449 Email: jon.hudson@gmail.com 1451 Lawrence Kreeger 1452 Cisco 1454 Email: kreeger@cisco.com 1456 Marc Lasserre 1457 Independent 1459 Email: mmlasserre@gmail.com 1460 Thomas Narten 1461 IBM 1463 Email: narten@us.ibm.com