idnits 2.17.1 draft-ietf-nvo3-arch-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1191 has weird spacing: '...xxxxxxx xxx...' -- The document date (October 19, 2015) is 3111 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-05) exists of draft-ietf-nvo3-nve-nva-cp-req-04 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force D. Black 3 Internet-Draft EMC 4 Intended status: Informational J. Hudson 5 Expires: April 21, 2016 Brocade 6 L. Kreeger 7 Cisco 8 M. Lasserre 9 Alcatel-Lucent 10 T. Narten 11 IBM 12 October 19, 2015 14 An Architecture for Overlay Networks (NVO3) 15 draft-ietf-nvo3-arch-04 17 Abstract 19 This document presents a high-level overview architecture for 20 building overlay networks in NVO3. The architecture is given at a 21 high-level, showing the major components of an overall system. An 22 important goal is to divide the space into individual smaller 23 components that can be implemented independently and with clear 24 interfaces and interactions with other components. It should be 25 possible to build and implement individual components in isolation 26 and have them work with other components with no changes to other 27 components. That way implementers have flexibility in implementing 28 individual components and can optimize and innovate within their 29 respective components without requiring changes to other components. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on April 21, 2016. 48 Copyright Notice 50 Copyright (c) 2015 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 3.1. VN Service (L2 and L3) . . . . . . . . . . . . . . . . . 6 69 3.1.1. VLAN Tags in L2 Service . . . . . . . . . . . . . . . 7 70 3.1.2. TTL Considerations . . . . . . . . . . . . . . . . . 7 71 3.2. Network Virtualization Edge (NVE) . . . . . . . . . . . . 7 72 3.3. Network Virtualization Authority (NVA) . . . . . . . . . 9 73 3.4. VM Orchestration Systems . . . . . . . . . . . . . . . . 10 74 4. Network Virtualization Edge (NVE) . . . . . . . . . . . . . . 11 75 4.1. NVE Co-located With Server Hypervisor . . . . . . . . . . 11 76 4.2. Split-NVE . . . . . . . . . . . . . . . . . . . . . . . . 12 77 4.2.1. Tenant VLAN handling in Split-NVE Case . . . . . . . 13 78 4.3. NVE State . . . . . . . . . . . . . . . . . . . . . . . . 13 79 4.4. Multi-Homing of NVEs . . . . . . . . . . . . . . . . . . 14 80 4.5. VAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 81 5. Tenant System Types . . . . . . . . . . . . . . . . . . . . . 15 82 5.1. Overlay-Aware Network Service Appliances . . . . . . . . 15 83 5.2. Bare Metal Servers . . . . . . . . . . . . . . . . . . . 15 84 5.3. Gateways . . . . . . . . . . . . . . . . . . . . . . . . 16 85 5.3.1. Gateway Taxonomy . . . . . . . . . . . . . . . . . . 16 86 5.3.1.1. L2 Gateways (Bridging) . . . . . . . . . . . . . 16 87 5.3.1.2. L3 Gateways (Only IP Packets) . . . . . . . . . . 17 88 5.4. Distributed Inter-VN Gateways . . . . . . . . . . . . . . 17 89 5.5. ARP and Neighbor Discovery . . . . . . . . . . . . . . . 18 90 6. NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . . 18 91 7. Network Virtualization Authority . . . . . . . . . . . . . . 20 92 7.1. How an NVA Obtains Information . . . . . . . . . . . . . 20 93 7.2. Internal NVA Architecture . . . . . . . . . . . . . . . . 21 94 7.3. NVA External Interface . . . . . . . . . . . . . . . . . 21 95 8. NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . . . 22 96 8.1. NVE-NVA Interaction Models . . . . . . . . . . . . . . . 23 97 8.2. Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . . 23 98 8.3. Propagating Information Between NVEs and NVAs . . . . . . 24 99 9. Federated NVAs . . . . . . . . . . . . . . . . . . . . . . . 25 100 9.1. Inter-NVA Peering . . . . . . . . . . . . . . . . . . . . 27 101 10. Control Protocol Work Areas . . . . . . . . . . . . . . . . . 28 102 11. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . . 28 103 12. Operations and Management . . . . . . . . . . . . . . . . . . 28 104 13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 105 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 106 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 107 16. Security Considerations . . . . . . . . . . . . . . . . . . . 29 108 17. Informative References . . . . . . . . . . . . . . . . . . . 30 109 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 31 110 A.1. Changes From draft-ietf-nvo3-arch-03 to -04 . . . . . . . 31 111 A.2. Changes From draft-ietf-nvo3-arch-02 to -03 . . . . . . . 31 112 A.3. Changes From draft-ietf-nvo3-arch-01 to -02 . . . . . . . 31 113 A.4. Changes From draft-ietf-nvo3-arch-00 to -01 . . . . . . . 32 114 A.5. Changes From draft-narten-nvo3 to draft-ietf-nvo3 . . . . 32 115 A.6. Changes From -00 to -01 (of draft-narten-nvo3-arch) . . . 32 116 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 118 1. Introduction 120 This document presents a high-level architecture for building overlay 121 networks in NVO3. The architecture is given at a high-level, showing 122 the major components of an overall system. An important goal is to 123 divide the space into smaller individual components that can be 124 implemented independently and with clear interfaces and interactions 125 with other components. It should be possible to build and implement 126 individual components in isolation and have them work with other 127 components with no changes to other components. That way 128 implementers have flexibility in implementing individual components 129 and can optimize and innovate within their respective components 130 without necessarily requiring changes to other components. 132 The motivation for overlay networks is given in [RFC7364]. 133 "Framework for DC Network Virtualization" [RFC7365] provides a 134 framework for discussing overlay networks generally and the various 135 components that must work together in building such systems. This 136 document differs from the framework document in that it doesn't 137 attempt to cover all possible approaches within the general design 138 space. Rather, it describes one particular approach. 140 2. Terminology 142 This document uses the same terminology as [RFC7365]. In addition, 143 the following terms are used: 145 NV Domain A Network Virtualization Domain is an administrative 146 construct that defines a Network Virtualization Authority (NVA), 147 the set of Network Virtualization Edges (NVEs) associated with 148 that NVA, and the set of virtual networks the NVA manages and 149 supports. NVEs are associated with a (logically centralized) NVA, 150 and an NVE supports communication for any of the virtual networks 151 in the domain. 153 NV Region A region over which information about a set of virtual 154 networks is shared. The degenerate case of a single NV Domain 155 corresponds to an NV region corresponding to that domain. The 156 more interesting case occurs when two or more NV Domains share 157 information about part or all of a set of virtual networks that 158 they manage. Two NVAs share information about particular virtual 159 networks for the purpose of supporting connectivity between 160 tenants located in different NV Domains. NVAs can share 161 information about an entire NV domain, or just individual virtual 162 networks. 164 Tenant System Identifier (TSI) Interface to a Virtual Network as 165 presented to a Tenant System. The TSI logically connects to the 166 NVE via a Virtual Access Point (VAP). To the Tenant System, the 167 TSI is like a NIC; the TSI presents itself to a Tenant System as a 168 normal network interface. 170 VLAN Unless stated otherwise, the terms VLAN and VLAN Tag are used 171 in this document denote a C-VLAN [IEEE-802.1Q] and the terms are 172 used interchangeably to improve readability. 174 3. Background 176 Overlay networks are an approach for providing network virtualization 177 services to a set of Tenant Systems (TSs) [RFC7365]. With overlays, 178 data traffic between tenants is tunneled across the underlying data 179 center's IP network. The use of tunnels provides a number of 180 benefits by decoupling the network as viewed by tenants from the 181 underlying physical network across which they communicate. 183 Tenant Systems connect to Virtual Networks (VNs), with each VN having 184 associated attributes defining properties of the network, such as the 185 set of members that connect to it. Tenant Systems connected to a 186 virtual network typically communicate freely with other Tenant 187 Systems on the same VN, but communication between Tenant Systems on 188 one VN and those external to the VN (whether on another VN or 189 connected to the Internet) is carefully controlled and governed by 190 policy. 192 A Network Virtualization Edge (NVE) [RFC7365] is the entity that 193 implements the overlay functionality. An NVE resides at the boundary 194 between a Tenant System and the overlay network as shown in Figure 1. 195 An NVE creates and maintains local state about each Virtual Network 196 for which it is providing service on behalf of a Tenant System. 198 +--------+ +--------+ 199 | Tenant +--+ +----| Tenant | 200 | System | | (') | System | 201 +--------+ | ................ ( ) +--------+ 202 | +-+--+ . . +--+-+ (_) 203 | | NVE|--. .--| NVE| | 204 +--| | . . | |---+ 205 +-+--+ . . +--+-+ 206 / . . 207 / . L3 Overlay . +--+-++--------+ 208 +--------+ / . Network . | NVE|| Tenant | 209 | Tenant +--+ . .- -| || System | 210 | System | . . +--+-++--------+ 211 +--------+ ................ 212 | 213 +----+ 214 | NVE| 215 | | 216 +----+ 217 | 218 | 219 ===================== 220 | | 221 +--------+ +--------+ 222 | Tenant | | Tenant | 223 | System | | System | 224 +--------+ +--------+ 226 Figure 1: NVO3 Generic Reference Model 228 The following subsections describe key aspects of an overlay system 229 in more detail. Section 3.1 describes the service model (Ethernet 230 vs. IP) provided to Tenant Systems. Section 3.2 describes NVEs in 231 more detail. Section 3.3 introduces the Network Virtualization 232 Authority, from which NVEs obtain information about virtual networks. 234 Section 3.4 provides background on VM orchestration systems and their 235 use of virtual networks. 237 3.1. VN Service (L2 and L3) 239 A Virtual Network provides either L2 or L3 service to connected 240 tenants. For L2 service, VNs transport Ethernet frames, and a Tenant 241 System is provided with a service that is analogous to being 242 connected to a specific L2 C-VLAN. L2 broadcast frames are generally 243 delivered to all (and multicast frames delivered to a subset of) the 244 other Tenant Systems on the VN. To a Tenant System, it appears as if 245 they are connected to a regular L2 Ethernet link. Within NVO3, 246 tenant frames are tunneled to remote NVEs based on the MAC addresses 247 of the frame headers as originated by the Tenant System. On the 248 underlay, NVO3 packets are forwarded between NVEs based on the outer 249 addresses of tunneled packets. 251 For L3 service, VNs transport IP datagrams, and a Tenant System is 252 provided with a service that only supports IP traffic. Within NVO3, 253 tenant frames are tunneled to remote NVEs based on the IP addresses 254 of the packet originated by the Tenant System; any L2 destination 255 addresses provided by Tenant Systems are effectively ignored. For L3 256 service, the Tenant System will be configured with an IP subnet that 257 is effectively a point-to-point link, i.e., having only the Tenant 258 System and a next-hop router address on it. 260 L2 service is intended for systems that need native L2 Ethernet 261 service and the ability to run protocols directly over Ethernet 262 (i.e., not based on IP). L3 service is intended for systems in which 263 all the traffic can safely be assumed to be IP. It is important to 264 note that whether NVO3 provides L2 or L3 service to a Tenant System, 265 the Tenant System does not generally need to be aware of the 266 distinction. In both cases, the virtual network presents itself to 267 the Tenant System as an L2 Ethernet interface. An Ethernet interface 268 is used in both cases simply as a widely supported interface type 269 that essentially all Tenant Systems already support. Consequently, 270 no special software is needed on Tenant Systems to use an L3 vs. an 271 L2 overlay service. 273 NVO3 can also provide a combined L2 and L3 service to tenants. A 274 combined service provides L2 service for intra-VN communication, but 275 also provides L3 service for L3 traffic entering or leaving the VN. 276 Architecturally, the handling of a combined L2/L3 service in NVO3 is 277 intended to match what is commonly done today in non-overlay 278 environments by devices providing a combined bridge/router service. 279 With combined service, the virtual network itself retains the 280 semantics of L2 service and all traffic is processed according to its 281 L2 semantics. In addition, however, traffic requiring IP processing 282 is also processed at the IP level. 284 The IP processing for a combined service can be implemented on a 285 standalone device attached to the virtual network (e.g., an IP 286 router) or implemented locally on the NVE (see Section 5.4 on 287 Distributed Gateways). For unicast traffic, NVE implementation of a 288 combined service may result in a packet being delivered to another TS 289 attached to the same NVE (on either the same or a different VN) or 290 tunneled to a remote NVE, or even forwarded outside the NVO3 domain. 291 For multicast or broadcast packets, the combination of NVE L2 and L3 292 processing may result in copies of the packet receiving both L2 and 293 L3 treatments to realize delivery to all of the destinations 294 involved. This distributed NVE implementation of IP routing results 295 in the same network delivery behavior as if the L2 processing of the 296 packet included delivery of the packet to an IP router attached to 297 the L2 VN as a TS, with the router having additional network 298 attachments to other networks, either virtual or not. 300 3.1.1. VLAN Tags in L2 Service 302 An NVO3 L2 virtual network service may include encapsulated L2 VLAN 303 tags provided by a Tenant System, but does not use encapsulated tags 304 in deciding where and how to forward traffic. Such VLAN tags can be 305 passed through, so that Tenant Systems that send or expect to receive 306 them can be supported as appropriate. 308 The processing of VLAN tags that an NVE receives from a TS is 309 controlled by settings associated with the VAP. Just as in the case 310 with ports on Ethernet switches, a number of settings could be 311 imagined. For example, C-TAGs can be passed through transparently, 312 they could always be stripped upon receipt from a Tenant System, they 313 could be compared against a list of explicitly configured tags, etc. 315 Note that the handling of C-VIDs has additional complications, as 316 described in Section 4.2.1 below. 318 3.1.2. TTL Considerations 320 For L3 service, Tenant Systems should expect the TTL of the packets 321 they send to be decremented by at least 1. For L2 service, the TTL 322 on packets (when the packet is IP) is not modified. 324 3.2. Network Virtualization Edge (NVE) 326 Tenant Systems connect to NVEs via a Tenant System Interface (TSI). 327 The TSI logically connects to the NVE via a Virtual Access Point 328 (VAP) and each VAP is associated with one Virtual Network as shown in 329 Figure 2. To the Tenant System, the TSI is like a NIC; the TSI 330 presents itself to a Tenant System as a normal network interface. On 331 the NVE side, a VAP is a logical network port (virtual or physical) 332 into a specific virtual network. Note that two different Tenant 333 Systems (and TSIs) attached to a common NVE can share a VAP (e.g., 334 TS1 and TS2 in Figure 2) so long as they connect to the same Virtual 335 Network. 337 | Data Center Network (IP) | 338 | | 339 +-----------------------------------------+ 340 | | 341 | Tunnel Overlay | 342 +------------+---------+ +---------+------------+ 343 | +----------+-------+ | | +-------+----------+ | 344 | | Overlay Module | | | | Overlay Module | | 345 | +---------+--------+ | | +---------+--------+ | 346 | | | | | | 347 NVE1 | | | | | | NVE2 348 | +--------+-------+ | | +--------+-------+ | 349 | | VNI1 VNI2 | | | | VNI1 VNI2 | | 350 | +-+----------+---+ | | +-+-----------+--+ | 351 | | VAP1 | VAP2 | | | VAP1 | VAP2| 352 +----+----------+------+ +----+-----------+-----+ 353 | | | | 354 |\ | | | 355 | \ | | /| 356 -------+--\-------+-------------------+---------/-+------- 357 | \ | Tenant | / | 358 TSI1 |TSI2\ | TSI3 TSI1 TSI2/ TSI3 359 +---+ +---+ +---+ +---+ +---+ +---+ 360 |TS1| |TS2| |TS3| |TS4| |TS5| |TS6| 361 +---+ +---+ +---+ +---+ +---+ +---+ 363 Figure 2: NVE Reference Model 365 The Overlay Module performs the actual encapsulation and 366 decapsulation of tunneled packets. The NVE maintains state about the 367 virtual networks it is a part of so that it can provide the Overlay 368 Module with such information as the destination address of the NVE to 369 tunnel a packet to, or the Context ID that should be placed in the 370 encapsulation header to identify the virtual network that a tunneled 371 packet belongs to. 373 On the data center network side, the NVE sends and receives native IP 374 traffic. When ingressing traffic from a Tenant System, the NVE 375 identifies the egress NVE to which the packet should be sent, adds an 376 overlay encapsulation header, and sends the packet on the underlay 377 network. When receiving traffic from a remote NVE, an NVE strips off 378 the encapsulation header, and delivers the (original) packet to the 379 appropriate Tenant System. When the source and destination Tenant 380 System are on the same NVE, no encapsulation is needed and the NVE 381 forwards traffic directly. 383 Conceptually, the NVE is a single entity implementing the NVO3 384 functionality. In practice, there are a number of different 385 implementation scenarios, as described in detail in Section 4. 387 3.3. Network Virtualization Authority (NVA) 389 Address dissemination refers to the process of learning, building and 390 distributing the mapping/forwarding information that NVEs need in 391 order to tunnel traffic to each other on behalf of communicating 392 Tenant Systems. For example, in order to send traffic to a remote 393 Tenant System, the sending NVE must know the destination NVE for that 394 Tenant System. 396 One way to build and maintain mapping tables is to use learning, as 397 802.1 bridges do [IEEE-802.1Q]. When forwarding traffic to multicast 398 or unknown unicast destinations, an NVE could simply flood traffic. 399 While flooding works, it can lead to traffic hot spots and can lead 400 to problems in larger networks. 402 Alternatively, to reduce the scope of where flooding must take place, 403 or to eliminate it all together, NVEs can make use of a Network 404 Virtualization Authority (NVA). An NVA is the entity that provides 405 address mapping and other information to NVEs. NVEs interact with an 406 NVA to obtain any required address mapping information they need in 407 order to properly forward traffic on behalf of tenants. The term NVA 408 refers to the overall system, without regards to its scope or how it 409 is implemented. NVAs provide a service, and NVEs access that service 410 via an NVE-to-NVA protocol as discussed in Section 4.3. 412 Even when an NVA is present, Ethernet bridge MAC address learning 413 could be used as a fallback mechanism, should the NVA be unable to 414 provide an answer or for other reasons. This document does not 415 consider flooding approaches in detail, as there are a number of 416 benefits in using an approach that depends on the presence of an NVA. 418 For the rest of this document, it is assumed that an NVA exists and 419 will be used. NVAs are discussed in more detail in Section 7. 421 3.4. VM Orchestration Systems 423 VM orchestration systems manage server virtualization across a set of 424 servers. Although VM management is a separate topic from network 425 virtualization, the two areas are closely related. Managing the 426 creation, placement, and movement of VMs also involves creating, 427 attaching to and detaching from virtual networks. A number of 428 existing VM orchestration systems have incorporated aspects of 429 virtual network management into their systems. 431 Note also, that although this section uses the term "VM" and 432 "hypervisor" throughout, the same issues apply to other 433 virtualization approaches, including Linux Containers (LXC), BSD 434 Jails, Network Service Appliances as discussed in Section 5.1, etc.. 435 From an NVO3 perspective, it should be assumed that where the 436 document uses the term "VM" and "hypervisor", the intention is that 437 the discussion also applies to other systems, where, e.g., the host 438 operating system plays the role of the hypervisor in supporting 439 virtualization, and a container plays the equivalent role as a VM. 441 When a new VM image is started, the VM orchestration system 442 determines where the VM should be placed, interacts with the 443 hypervisor on the target server to load and start the VM and controls 444 when a VM should be shutdown or migrated elsewhere. VM orchestration 445 systems also have knowledge about how a VM should connect to a 446 network, possibly including the name of the virtual network to which 447 a VM is to connect. The VM orchestration system can pass such 448 information to the hypervisor when a VM is instantiated. VM 449 orchestration systems have significant (and sometimes global) 450 knowledge over the domain they manage. They typically know on what 451 servers a VM is running, and meta data associated with VM images can 452 be useful from a network virtualization perspective. For example, 453 the meta data may include the addresses (MAC and IP) the VMs will use 454 and the name(s) of the virtual network(s) they connect to. 456 VM orchestration systems run a protocol with an agent running on the 457 hypervisor of the servers they manage. That protocol can also carry 458 information about what virtual network a VM is associated with. When 459 the orchestrator instantiates a VM on a hypervisor, the hypervisor 460 interacts with the NVE in order to attach the VM to the virtual 461 networks it has access to. In general, the hypervisor will need to 462 communicate significant VM state changes to the NVE. In the reverse 463 direction, the NVE may need to communicate network connectivity 464 information back to the hypervisor. Example VM orchestration systems 465 in use today include VMware's vCenter Server, Microsoft's System 466 Center Virtual Machine Manager, and systems based on OpenStack and 467 its associated plugins (e.g., Nova and Neutron). Both can pass 468 information about what virtual networks a VM connects to down to the 469 hypervisor. The protocol used between the VM orchestration system 470 and hypervisors is generally proprietary. 472 It should be noted that VM orchestration systems may not have direct 473 access to all networking related information a VM uses. For example, 474 a VM may make use of additional IP or MAC addresses that the VM 475 management system is not aware of. 477 4. Network Virtualization Edge (NVE) 479 As introduced in Section 3.2 an NVE is the entity that implements the 480 overlay functionality. This section describes NVEs in more detail. 481 An NVE will have two external interfaces: 483 Tenant System Facing: On the Tenant System facing side, an NVE 484 interacts with the hypervisor (or equivalent entity) to provide 485 the NVO3 service. An NVE will need to be notified when a Tenant 486 System "attaches" to a virtual network (so it can validate the 487 request and set up any state needed to send and receive traffic on 488 behalf of the Tenant System on that VN). Likewise, an NVE will 489 need to be informed when the Tenant System "detaches" from the 490 virtual network so that it can reclaim state and resources 491 appropriately. 493 Data Center Network Facing: On the data center network facing side, 494 an NVE interfaces with the data center underlay network, sending 495 and receiving tunneled TS packets to and from the underlay. The 496 NVE may also run a control protocol with other entities on the 497 network, such as the Network Virtualization Authority. 499 4.1. NVE Co-located With Server Hypervisor 501 When server virtualization is used, the entire NVE functionality will 502 typically be implemented as part of the hypervisor and/or virtual 503 switch on the server. In such cases, the Tenant System interacts 504 with the hypervisor and the hypervisor interacts with the NVE. 505 Because the interaction between the hypervisor and NVE is implemented 506 entirely in software on the server, there is no "on-the-wire" 507 protocol between Tenant Systems (or the hypervisor) and the NVE that 508 needs to be standardized. While there may be APIs between the NVE 509 and hypervisor to support necessary interaction, the details of such 510 an API are not in-scope for the IETF to work on. 512 Implementing NVE functionality entirely on a server has the 513 disadvantage that server CPU resources must be spent implementing the 514 NVO3 functionality. Experimentation with overlay approaches and 515 previous experience with TCP and checksum adapter offloads suggests 516 that offloading certain NVE operations (e.g., encapsulation and 517 decapsulation operations) onto the physical network adapter can 518 produce performance advantages. As has been done with checksum and/ 519 or TCP server offload and other optimization approaches, there may be 520 benefits to offloading common operations onto adapters where 521 possible. Just as important, the addition of an overlay header can 522 disable existing adapter offload capabilities that are generally not 523 prepared to handle the addition of a new header or other operations 524 associated with an NVE. 526 While the exact details of how to split the implementation of 527 specific NVE functionality between a server and its network adapters 528 is an implementation matter and outside the scope of IETF 529 standardization, the NVO3 architecture should be cognizant of and 530 support such separation. Ideally, it may even be possible to bypass 531 the hypervisor completely on critical data path operations so that 532 packets between a TS and its VN can be sent and received without 533 having the hypervisor involved in each individual packet operation. 535 4.2. Split-NVE 537 Another possible scenario leads to the need for a split NVE 538 implementation. An NVE running on a server (e.g. within a 539 hypervisor) could support NVO3 towards the tenant, but not perform 540 all NVE functions (e.g., encapsulation) directly on the server; some 541 of the actual NVO3 functionality could be implemented on (i.e., 542 offloaded to) an adjacent switch to which the server is attached. 543 While one could imagine a number of link types between a server and 544 the NVE, one simple deployment scenario would involve a server and 545 NVE separated by a simple L2 Ethernet link. A more complicated 546 scenario would have the server and NVE separated by a bridged access 547 network, such as when the NVE resides on a ToR, with an embedded 548 switch residing between servers and the ToR. 550 For the split NVE case, protocols will be needed that allow the 551 hypervisor and NVE to negotiate and setup the necessary state so that 552 traffic sent across the access link between a server and the NVE can 553 be associated with the correct virtual network instance. 554 Specifically, on the access link, traffic belonging to a specific 555 Tenant System would be tagged with a specific VLAN C-TAG that 556 identifies which specific NVO3 virtual network instance it connects 557 to. The hypervisor-NVE protocol would negotiate which VLAN C-TAG to 558 use for a particular virtual network instance. More details of the 559 protocol requirements for functionality between hypervisors and NVEs 560 can be found in [I-D.ietf-nvo3-nve-nva-cp-req]. 562 4.2.1. Tenant VLAN handling in Split-NVE Case 564 Preserving tenant VLAN tags across NVO3 as described in Section 3.1.1 565 poses additional complications in the split-NVE case. The portion of 566 the NVE that performs the encapsulation function needs access to the 567 specific VLAN tags that the Tenant System is using in order to 568 include them in the encapsulated packet. When an NVE is implemented 569 entirely within the hypervisor, the NVE has access to the complete 570 original packet (including any VLAN tags) sent by the tenant. In the 571 split-NVE case, however, the VLAN tag used between the hypervisor and 572 offloaded portions of the NVE normally only identify the specific VN 573 that traffic belongs to. In order to allow a tenant to preserve VLAN 574 information in the split-NVE case, additional mechanisms would be 575 needed. 577 4.3. NVE State 579 NVEs maintain internal data structures and state to support the 580 sending and receiving of tenant traffic. An NVE may need some or all 581 of the following information: 583 1. An NVE keeps track of which attached Tenant Systems are connected 584 to which virtual networks. When a Tenant System attaches to a 585 virtual network, the NVE will need to create or update local 586 state for that virtual network. When the last Tenant System 587 detaches from a given VN, the NVE can reclaim state associated 588 with that VN. 590 2. For tenant unicast traffic, an NVE maintains a per-VN table of 591 mappings from Tenant System (inner) addresses to remote NVE 592 (outer) addresses. 594 3. For tenant multicast (or broadcast) traffic, an NVE maintains a 595 per-VN table of mappings and other information on how to deliver 596 tenant multicast (or broadcast) traffic. If the underlying 597 network supports IP multicast, the NVE could use IP multicast to 598 deliver tenant traffic. In such a case, the NVE would need to 599 know what IP underlay multicast address to use for a given VN. 600 Alternatively, if the underlying network does not support 601 multicast, an NVE could use serial unicast to deliver traffic. 602 In such a case, an NVE would need to know which remote NVEs are 603 participating in the VN. An NVE could use both approaches, 604 switching from one mode to the other depending on such factors as 605 bandwidth efficiency and group membership sparseness. 607 4. An NVE maintains necessary information to encapsulate outgoing 608 traffic, including what type of encapsulation and what value to 609 use for a Context ID within the encapsulation header. 611 5. In order to deliver incoming encapsulated packets to the correct 612 Tenant Systems, an NVE maintains the necessary information to map 613 incoming traffic to the appropriate VAP (i.e., Tenant System 614 Interface). 616 6. An NVE may find it convenient to maintain additional per-VN 617 information such as QoS settings, Path MTU information, ACLs, 618 etc. 620 4.4. Multi-Homing of NVEs 622 NVEs may be multi-homed. That is, an NVE may have more than one IP 623 address associated with it on the underlay network. Multihoming 624 happens in two different scenarios. First, an NVE may have multiple 625 interfaces connecting it to the underlay. Each of those interfaces 626 will typically have a different IP address, resulting in a specific 627 Tenant Address (on a specific VN) being reachable through the same 628 NVE but through more than one underlay IP address. Second, a 629 specific tenant system may be reachable through more than one NVE, 630 each having one or more underlay addresses. In both cases, NVE 631 address mapping functionality needs to support one-to-many mappings 632 and enable a sending NVE to (at a minimum) be able to fail over from 633 one IP address to another, e.g., should a specific NVE underlay 634 address become unreachable. 636 Finally, multi-homed NVEs introduce complexities when serial unicast 637 is used to implement tenant multicast as described in Section 4.3. 638 Specifically, an NVE should only receive one copy of a replicated 639 packet. 641 Multi-homing is needed to support important use cases. First, a bare 642 metal server may have multiple uplink connections to either the same 643 or different NVEs. Having only a single physical path to an upstream 644 NVE, or indeed, having all traffic flow through a single NVE would be 645 considered unacceptable in highly-resilient deployment scenarios that 646 seek to avoid single points of failure. Moreover, in today's 647 networks, the availability of multiple paths would require that they 648 be usable in an active-active fashion (e.g., for load balancing). 650 4.5. VAP 652 The VAP is the NVE-side of the interface between the NVE and the TS. 653 Traffic to and from the tenant flows through the VAP. If an NVE runs 654 into difficulties sending traffic received on the VAP, it may need to 655 signal such errors back to the VAP. Because the VAP is an emulation 656 of a physical port, its ability to signal NVE errors is limited and 657 lacks sufficient granularity to reflect all possible errors an NVE 658 may encounter (e.g., inability reach a particular destination). Some 659 errors, such as an NVE losing all of its connections to the underlay, 660 could be reflected back to the VAP by effectively disabling it. This 661 state change would reflect itself on the TS as an interface going 662 down, allowing the TS to implement interface error handling, e.g., 663 failover, in the same manner as when a physical interfaces becomes 664 disabled. 666 5. Tenant System Types 668 This section describes a number of special Tenant System types and 669 how they fit into an NVO3 system. 671 5.1. Overlay-Aware Network Service Appliances 673 Some Network Service Appliances [I-D.ietf-nvo3-nve-nva-cp-req] 674 (virtual or physical) provide tenant-aware services. That is, the 675 specific service they provide depends on the identity of the tenant 676 making use of the service. For example, firewalls are now becoming 677 available that support multi-tenancy where a single firewall provides 678 virtual firewall service on a per-tenant basis, using per-tenant 679 configuration rules and maintaining per-tenant state. Such 680 appliances will be aware of the VN an activity corresponds to while 681 processing requests. Unlike server virtualization, which shields VMs 682 from needing to know about multi-tenancy, a Network Service Appliance 683 may explicitly support multi-tenancy. In such cases, the Network 684 Service Appliance itself will be aware of network virtualization and 685 either embed an NVE directly, or implement a split NVE as described 686 in Section 4.2. Unlike server virtualization, however, the Network 687 Service Appliance may not be running a hypervisor and the VM 688 orchestration system may not interact with the Network Service 689 Appliance. The NVE on such appliances will need to support a control 690 plane to obtain the necessary information needed to fully participate 691 in an NVO3 Domain. 693 5.2. Bare Metal Servers 695 Many data centers will continue to have at least some servers 696 operating as non-virtualized (or "bare metal") machines running a 697 traditional operating system and workload. In such systems, there 698 will be no NVE functionality on the server, and the server will have 699 no knowledge of NVO3 (including whether overlays are even in use). 700 In such environments, the NVE functionality can reside on the first- 701 hop physical switch. In such a case, the network administrator would 702 (manually) configure the switch to enable the appropriate NVO3 703 functionality on the switch port connecting the server and associate 704 that port with a specific virtual network. Such configuration would 705 typically be static, since the server is not virtualized, and once 706 configured, is unlikely to change frequently. Consequently, this 707 scenario does not require any protocol or standards work. 709 5.3. Gateways 711 Gateways on VNs relay traffic onto and off of a virtual network. 712 Tenant Systems use gateways to reach destinations outside of the 713 local VN. Gateways receive encapsulated traffic from one VN, remove 714 the encapsulation header, and send the native packet out onto the 715 data center network for delivery. Outside traffic enters a VN in a 716 reverse manner. 718 Gateways can be either virtual (i.e., implemented as a VM) or 719 physical (i.e., as a standalone physical device). For performance 720 reasons, standalone hardware gateways may be desirable in some cases. 721 Such gateways could consist of a simple switch forwarding traffic 722 from a VN onto the local data center network, or could embed router 723 functionality. On such gateways, network interfaces connecting to 724 virtual networks will (at least conceptually) embed NVE (or split- 725 NVE) functionality within them. As in the case with Network Service 726 Appliances, gateways may not support a hypervisor and will need an 727 appropriate control plane protocol to obtain the information needed 728 to provide NVO3 service. 730 Gateways handle several different use cases. For example, one use 731 case consists of systems supporting overlays together with systems 732 that do not (e.g., bare metal servers). Gateways could be used to 733 connect legacy systems supporting, e.g., L2 VLANs, to specific 734 virtual networks, effectively making them part of the same virtual 735 network. Gateways could also forward traffic between a virtual 736 network and other hosts on the data center network or relay traffic 737 between different VNs. Finally, gateways can provide external 738 connectivity such as Internet or VPN access. 740 5.3.1. Gateway Taxonomy 742 As can be seen from the discussion above, there are several types of 743 gateways that can exist in an NVO3 environment. This section breaks 744 them down into the various types that could be supported. Note that 745 each of the types below could be implemented in either a centralized 746 manner or distributed to co-exist with the NVEs. 748 5.3.1.1. L2 Gateways (Bridging) 750 L2 Gateways act as layer 2 bridges to forward Ethernet frames based 751 on the MAC addresses present in them. 753 L2 VN to Legacy L2: This type of gateway bridges traffic between L2 754 VNs and other legacy L2 networks such as VLANs or L2 VPNs. 756 L2 VN to L2 VN: The main motivation for this type of gateway to 757 create separate groups of Tenant Systems using L2 VNs such that 758 the gateway can enforce network policies between each L2 VN. 760 5.3.1.2. L3 Gateways (Only IP Packets) 762 L3 Gateways forward IP packets based on the IP addresses present in 763 the packets. 765 L3 VN to Legacy L2: This type of gateway forwards packets on between 766 L3 VNs and legacy L2 networks such as VLANs or L2 VPNs. The 767 MAC address in any frames forwarded between the legacy L2 768 network would be the MAC address of the gateway. 770 L3 VN to Legacy L3: The type of gateway forwards packets between L3 771 VNs and legacy L3 networks. These legacy L3 networks could be 772 local the data center, in the WAN, or an L3 VPN. 774 L3 VN to L2 VN: This type of gateway forwards packets on between L3 775 VNs and L2 VNs. The MAC address in any frames forwarded 776 between the L2 VN would be the MAC address of the gateway. 778 L2 VN to L2 VN: This type of gateway acts similar to a traditional 779 router that forwards between L2 interfaces. The MAC address in 780 any frames forwarded between the L2 VNs would be the MAC 781 address of the gateway. 783 L3 VN to L3 VN: The main motivation for this type of gateway to 784 create separate groups of Tenant Systems using L3 VNs such that 785 the gateway can enforce network policies between each L3 VN. 787 5.4. Distributed Inter-VN Gateways 789 The relaying of traffic from one VN to another deserves special 790 consideration. Whether traffic is permitted to flow from one VN to 791 another is a matter of policy, and would not (by default) be allowed 792 unless explicitly enabled. In addition, NVAs are the logical place 793 to maintain policy information about allowed inter-VN communication. 794 Policy enforcement for inter-VN communication can be handled in (at 795 least) two different ways. Explicit gateways could be the central 796 point for such enforcement, with all inter-VN traffic forwarded to 797 such gateways for processing. Alternatively, the NVA can provide 798 such information directly to NVEs, by either providing a mapping for 799 a target TS on another VN, or indicating that such communication is 800 disallowed by policy. 802 When inter-VN gateways are centralized, traffic between TSs on 803 different VNs can take suboptimal paths, i.e., triangular routing 804 results in paths that always traverse the gateway. In the worst 805 case, traffic between two TSs connected to the same NVE can be hair- 806 pinned through an external gateway. As an optimization, individual 807 NVEs can be part of a distributed gateway that performs such 808 relaying, reducing or completely eliminating triangular routing. In 809 a distributed gateway, each ingress NVE can perform such relaying 810 activity directly, so long as it has access to the policy information 811 needed to determine whether cross-VN communication is allowed. 812 Having individual NVEs be part of a distributed gateway allows them 813 to tunnel traffic directly to the destination NVE without the need to 814 take suboptimal paths. 816 The NVO3 architecture must support distributed gateways for the case 817 of inter-VN communication. Such support requires that NVO3 control 818 protocols include mechanisms for the maintenance and distribution of 819 policy information about what type of cross-VN communication is 820 allowed so that NVEs acting as distributed gateways can tunnel 821 traffic from one VN to another as appropriate. 823 Distributed gateways could also be used to distribute other 824 traditional router services to individual NVEs. The NVO3 825 architecture does not preclude such implementations, but does not 826 define or require them as they are outside the scope of NVO3. 828 5.5. ARP and Neighbor Discovery 830 For an L2 service, strictly speaking, special processing of ARP 831 [RFC0826] (and IPv6 Neighbor Discovery (ND) [RFC4861]) is not 832 required. ARP requests are broadcast, and NVO3 can deliver ARP 833 requests to all members of a given L2 virtual network, just as it 834 does for any packet sent to an L2 broadcast address. Similarly, ND 835 requests are sent via IP multicast, which NVO3 can support by 836 delivering via L2 multicast. However, as a performance optimization, 837 an NVE can intercept ARP (or ND) requests from its attached TSs and 838 respond to them directly using information in its mapping tables. 839 Since an NVE will have mechanisms for determining the NVE address 840 associated with a given TS, the NVE can leverage the same mechanisms 841 to suppress sending ARP and ND requests for a given TS to other 842 members of the VN. The NVO3 architecture must support such a 843 capability. 845 6. NVE-NVE Interaction 847 Individual NVEs will interact with each other for the purposes of 848 tunneling and delivering traffic to remote TSs. At a minimum, a 849 control protocol may be needed for tunnel setup and maintenance. For 850 example, tunneled traffic may need to be encrypted or integrity 851 protected, in which case it will be necessary to set up appropriate 852 security associations between NVE peers. It may also be desirable to 853 perform tunnel maintenance (e.g., continuity checks) on a tunnel in 854 order to detect when a remote NVE becomes unreachable. Such generic 855 tunnel setup and maintenance functions are not generally 856 NVO3-specific. Hence, NVO3 expects to leverage existing tunnel 857 maintenance protocols rather than defining new ones. 859 Some NVE-NVE interactions may be specific to NVO3 (and in particular 860 be related to information kept in mapping tables) and agnostic to the 861 specific tunnel type being used. For example, when tunneling traffic 862 for TS-X to a remote NVE, it is possible that TS-X is not presently 863 associated with the remote NVE. Normally, this should not happen, 864 but there could be race conditions where the information an NVE has 865 learned from the NVA is out-of-date relative to actual conditions. 866 In such cases, the remote NVE could return an error or warning 867 indication, allowing the sending NVE to attempt a recovery or 868 otherwise attempt to mitigate the situation. 870 The NVE-NVE interaction could signal a range of indications, for 871 example: 873 o "No such TS here", upon a receipt of a tunneled packet for an 874 unknown TS. 876 o "TS-X not here, try the following NVE instead" (i.e., a redirect). 878 o Delivered to correct NVE, but could not deliver packet to TS-X 879 (soft error). 881 o Delivered to correct NVE, but could not deliver packet to TS-X 882 (hard error). 884 When an NVE receives information from a remote NVE that conflicts 885 with the information it has in its own mapping tables, it should 886 consult with the NVA to resolve those conflicts. In particular, it 887 should confirm that the information it has is up-to-date, and it 888 might indicate the error to the NVA, so as to nudge the NVA into 889 following up (as appropriate). While it might make sense for an NVE 890 to update its mapping table temporarily in response to an error from 891 a remote NVE, any changes must be handled carefully as doing so can 892 raise security considerations if the received information cannot be 893 authenticated. That said, a sending NVE might still take steps to 894 mitigate a problem, such as applying rate limiting to data traffic 895 towards a particular NVE or TS. 897 7. Network Virtualization Authority 899 Before sending to and receiving traffic from a virtual network, an 900 NVE must obtain the information needed to build its internal 901 forwarding tables and state as listed in Section 4.3. An NVE can 902 obtain such information from a Network Virtualization Authority. 904 The Network Virtualization Authority (NVA) is the entity that is 905 expected to provide address mapping and other information to NVEs. 906 NVEs can interact with an NVA to obtain any required information they 907 need in order to properly forward traffic on behalf of tenants. The 908 term NVA refers to the overall system, without regards to its scope 909 or how it is implemented. 911 7.1. How an NVA Obtains Information 913 There are two primary ways in which an NVA can obtain the address 914 dissemination information it manages. The NVA can obtain information 915 either from the VM orchestration system, and/or directly from the 916 NVEs themselves. 918 On virtualized systems, the NVA may be able to obtain the address 919 mapping information associated with VMs from the VM orchestration 920 system itself. If the VM orchestration system contains a master 921 database for all the virtualization information, having the NVA 922 obtain information directly to the orchestration system would be a 923 natural approach. Indeed, the NVA could effectively be co-located 924 with the VM orchestration system itself. In such systems, the VM 925 orchestration system communicates with the NVE indirectly through the 926 hypervisor. 928 However, as described in Section 4 not all NVEs are associated with 929 hypervisors. In such cases, NVAs cannot leverage VM orchestration 930 protocols to interact with an NVE and will instead need to peer 931 directly with them. By peering directly with an NVE, NVAs can obtain 932 information about the TSs connected to that NVE and can distribute 933 information to the NVE about the VNs those TSs are associated with. 934 For example, whenever a Tenant System attaches to an NVE, that NVE 935 would notify the NVA that the TS is now associated with that NVE. 936 Likewise when a TS detaches from an NVE, that NVE would inform the 937 NVA. By communicating directly with NVEs, both the NVA and the NVE 938 are able to maintain up-to-date information about all active tenants 939 and the NVEs to which they are attached. 941 7.2. Internal NVA Architecture 943 For reliability and fault tolerance reasons, an NVA would be 944 implemented in a distributed or replicated manner without single 945 points of failure. How the NVA is implemented, however, is not 946 important to an NVE so long as the NVA provides a consistent and 947 well-defined interface to the NVE. For example, an NVA could be 948 implemented via database techniques whereby a server stores address 949 mapping information in a traditional (possibly replicated) database. 950 Alternatively, an NVA could be implemented in a distributed fashion 951 using an existing (or modified) routing protocol to maintain and 952 distribute mappings. So long as there is a clear interface between 953 the NVE and NVA, how an NVA is architected and implemented is not 954 important to an NVE. 956 A number of architectural approaches could be used to implement NVAs 957 themselves. NVAs manage address bindings and distribute them to 958 where they need to go. One approach would be to use Border Gateway 959 Protocol (BGP) [RFC4364] (possibly with extensions) and route 960 reflectors. Another approach could use a transaction-based database 961 model with replicated servers. Because the implementation details 962 are local to an NVA, there is no need to pick exactly one solution 963 technology, so long as the external interfaces to the NVEs (and 964 remote NVAs) are sufficiently well defined to achieve 965 interoperability. 967 7.3. NVA External Interface 969 Conceptually, from the perspective of an NVE, an NVA is a single 970 entity. An NVE interacts with the NVA, and it is the NVA's 971 responsibility for ensuring that interactions between the NVE and NVA 972 result in consistent behavior across the NVA and all other NVEs using 973 the same NVA. Because an NVA is built from multiple internal 974 components, an NVA will have to ensure that information flows to all 975 internal NVA components appropriately. 977 One architectural question is how the NVA presents itself to the NVE. 978 For example, an NVA could be required to provide access via a single 979 IP address. If NVEs only have one IP address to interact with, it 980 would be the responsibility of the NVA to handle NVA component 981 failures, e.g., by using a "floating IP address" that migrates among 982 NVA components to ensure that the NVA can always be reached via the 983 one address. Having all NVA accesses through a single IP address, 984 however, adds constraints to implementing robust failover, load 985 balancing, etc. 987 In the NVO3 architecture, an NVA is accessed through one or more IP 988 addresses (or IP address/port combination). If multiple IP addresses 989 are used, each IP address provides equivalent functionality, meaning 990 that an NVE can use any of the provided addresses to interact with 991 the NVA. Should one address stop working, an NVE is expected to 992 failover to another. While the different addresses result in 993 equivalent functionality, one address may respond more quickly than 994 another, e.g., due to network conditions, load on the server, etc. 996 To provide some control over load balancing, NVA addresses may have 997 an associated priority. Addresses are used in order of priority, 998 with no explicit preference among NVA addresses having the same 999 priority. To provide basic load-balancing among NVAs of equal 1000 priorities, NVEs could use some randomization input to select among 1001 equal-priority NVAs. Such a priority scheme facilitates failover and 1002 load balancing, for example, allowing a network operator to specify a 1003 set of primary and backup NVAs. 1005 It may be desirable to have individual NVA addresses responsible for 1006 a subset of information about an NV Domain. In such a case, NVEs 1007 would use different NVA addresses for obtaining or updating 1008 information about particular VNs or TS bindings. A key question with 1009 such an approach is how information would be partitioned, and how an 1010 NVE could determine which address to use to get the information it 1011 needs. 1013 Another possibility is to treat the information on which NVA 1014 addresses to use as cached (soft-state) information at the NVEs, so 1015 that any NVA address can be used to obtain any information, but NVEs 1016 are informed of preferences for which addresses to use for particular 1017 information on VNs or TS bindings. That preference information would 1018 be cached for future use to improve behavior - e.g., if all requests 1019 for a specific subset of VNs are forwarded to a specific NVA 1020 component, the NVE can optimize future requests within that subset by 1021 sending them directly to that NVA component via its address. 1023 8. NVE-to-NVA Protocol 1025 As outlined in Section 4.3, an NVE needs certain information in order 1026 to perform its functions. To obtain such information from an NVA, an 1027 NVE-to-NVA protocol is needed. The NVE-to-NVA protocol provides two 1028 functions. First it allows an NVE to obtain information about the 1029 location and status of other TSs with which it needs to communicate. 1030 Second, the NVE-to-NVA protocol provides a way for NVEs to provide 1031 updates to the NVA about the TSs attached to that NVE (e.g., when a 1032 TS attaches or detaches from the NVE), or about communication errors 1033 encountered when sending traffic to remote NVEs. For example, an NVE 1034 could indicate that a destination it is trying to reach at a 1035 destination NVE is unreachable for some reason. 1037 While having a direct NVE-to-NVA protocol might seem straightforward, 1038 the existence of existing VM orchestration systems complicates the 1039 choices an NVE has for interacting with the NVA. 1041 8.1. NVE-NVA Interaction Models 1043 An NVE interacts with an NVA in at least two (quite different) ways: 1045 o NVEs embedded within the same server as the hypervisor can obtain 1046 necessary information entirely through the hypervisor-facing side 1047 of the NVE. Such an approach is a natural extension to existing 1048 VM orchestration systems supporting server virtualization because 1049 an existing protocol between the hypervisor and VM orchestration 1050 system already exists and can be leveraged to obtain any needed 1051 information. Specifically, VM orchestration systems used to 1052 create, terminate and migrate VMs already use well-defined (though 1053 typically proprietary) protocols to handle the interactions 1054 between the hypervisor and VM orchestration system. For such 1055 systems, it is a natural extension to leverage the existing 1056 orchestration protocol as a sort of proxy protocol for handling 1057 the interactions between an NVE and the NVA. Indeed, existing 1058 implementations can already do this. 1060 o Alternatively, an NVE can obtain needed information by interacting 1061 directly with an NVA via a protocol operating over the data center 1062 underlay network. Such an approach is needed to support NVEs that 1063 are not associated with systems performing server virtualization 1064 (e.g., as in the case of a standalone gateway) or where the NVE 1065 needs to communicate directly with the NVA for other reasons. 1067 The NVO3 architecture will focus on support for the second model 1068 above. Existing virtualization environments are already using the 1069 first model. But they are not sufficient to cover the case of 1070 standalone gateways -- such gateways may not support virtualization 1071 and do not interface with existing VM orchestration systems. 1073 8.2. Direct NVE-NVA Protocol 1075 An NVE can interact directly with an NVA via an NVE-to-NVA protocol. 1076 Such a protocol can be either independent of the NVA internal 1077 protocol, or an extension of it. Using a purpose-specific protocol 1078 would provide architectural separation and independence between the 1079 NVE and NVA. The NVE and NVA interact in a well-defined way, and 1080 changes in the NVA (or NVE) do not need to impact each other. Using 1081 a dedicated protocol also ensures that both NVE and NVA 1082 implementations can evolve independently and without dependencies on 1083 each other. Such independence is important because the upgrade path 1084 for NVEs and NVAs is quite different. Upgrading all the NVEs at a 1085 site will likely be more difficult in practice than upgrading NVAs 1086 because of their large number - one on each end device. In practice, 1087 it would be prudent to assume that once an NVE has been implemented 1088 and deployed, it may be challenging to get subsequent NVE extensions 1089 and changes implemented and deployed, whereas an NVA (and its 1090 associated internal protocols) are more likely to evolve over time as 1091 experience is gained from usage and upgrades will involve fewer 1092 nodes. 1094 Requirements for a direct NVE-NVA protocol can be found in 1095 [I-D.ietf-nvo3-nve-nva-cp-req] 1097 8.3. Propagating Information Between NVEs and NVAs 1099 Information flows between NVEs and NVAs in both directions. The NVA 1100 maintains information about all VNs in the NV Domain, so that NVEs do 1101 not need to do so themselves. NVEs obtain from the NVA information 1102 about where a given remote TS destination resides. NVAs in turn 1103 obtain information from NVEs about the individual TSs attached to 1104 those NVEs. 1106 While the NVA could push information about every virtual network to 1107 every NVE, such an approach scales poorly and is unnecessary. In 1108 practice, a given NVE will only need and want to know about VNs to 1109 which it is attached. Thus, an NVE should be able to subscribe to 1110 updates only for the virtual networks it is interested in receiving 1111 updates for. The NVO3 architecture supports a model where an NVE is 1112 not required to have full mapping tables for all virtual networks in 1113 an NV Domain. 1115 Before sending unicast traffic to a remote TS (or TSes for broadcast 1116 or multicast traffic), an NVE must know where the remote TS(es) 1117 currently reside. When a TS attaches to a virtual network, the NVE 1118 obtains information about that VN from the NVA. The NVA can provide 1119 that information to the NVE at the time the TS attaches to the VN, 1120 either because the NVE requests the information when the attach 1121 operation occurs, or because the VM orchestration system has 1122 initiated the attach operation and provides associated mapping 1123 information to the NVE at the same time. 1125 There are scenarios where an NVE may wish to query the NVA about 1126 individual mappings within an VN. For example, when sending traffic 1127 to a remote TS on a remote NVE, that TS may become unavailable (e.g,. 1128 because it has migrated elsewhere or has been shutdown, in which case 1129 the remote NVE may return an error indication). In such situations, 1130 the NVE may need to query the NVA to obtain updated mapping 1131 information for a specific TS, or verify that the information is 1132 still correct despite the error condition. Note that such a query 1133 could also be used by the NVA as an indication that there may be an 1134 inconsistency in the network and that it should take steps to verify 1135 that the information it has about the current state and location of a 1136 specific TS is still correct. 1138 For very large virtual networks, the amount of state an NVE needs to 1139 maintain for a given virtual network could be significant. Moreover, 1140 an NVE may only be communicating with a small subset of the TSs on 1141 such a virtual network. In such cases, the NVE may find it desirable 1142 to maintain state only for those destinations it is actively 1143 communicating with. In such scenarios, an NVE may not want to 1144 maintain full mapping information about all destinations on a VN. 1145 Should it then need to communicate with a destination for which it 1146 does not have mapping information, however, it will need to be able 1147 to query the NVA on demand for the missing information on a per- 1148 destination basis. 1150 The NVO3 architecture will need to support a range of operations 1151 between the NVE and NVA. Requirements for those operations can be 1152 found in [I-D.ietf-nvo3-nve-nva-cp-req]. 1154 9. Federated NVAs 1156 An NVA provides service to the set of NVEs in its NV Domain. Each 1157 NVA manages network virtualization information for the virtual 1158 networks within its NV Domain. An NV domain is administered by a 1159 single entity. 1161 In some cases, it will be necessary to expand the scope of a specific 1162 VN or even an entire NV domain beyond a single NVA. For example, 1163 multiple data centers managed by the same administrator may wish to 1164 operate all of its data centers as a single NV region. Such cases 1165 are handled by having different NVAs peer with each other to exchange 1166 mapping information about specific VNs. NVAs operate in a federated 1167 manner with a set of NVAs operating as a loosely-coupled federation 1168 of individual NVAs. If a virtual network spans multiple NVAs (e.g., 1169 located at different data centers), and an NVE needs to deliver 1170 tenant traffic to an NVE that is part of a different NV Domain, it 1171 still interacts only with its NVA, even when obtaining mappings for 1172 NVEs associated with a different NV Domain. 1174 Figure 3 shows a scenario where two separate NV Domains (1 and 2) 1175 share information about Virtual Network "1217". VM1 and VM2 both 1176 connect to the same Virtual Network 1217, even though the two VMs are 1177 in separate NV Domains. There are two cases to consider. In the 1178 first case, NV Domain B (NVB) does not allow NVE-A to tunnel traffic 1179 directly to NVE-B. There could be a number of reasons for this. For 1180 example, NV Domains 1 and 2 may not share a common address space 1181 (i.e., require traversal through a NAT device), or for policy 1182 reasons, a domain might require that all traffic between separate NV 1183 Domains be funneled through a particular device (e.g., a firewall). 1184 In such cases, NVA-2 will advertise to NVA-1 that VM1 on Virtual 1185 Network 1217 is available, and direct that traffic between the two 1186 nodes go through IP-G. IP-G would then decapsulate received traffic 1187 from one NV Domain, translate it appropriately for the other domain 1188 and re-encapsulate the packet for delivery. 1190 xxxxxx xxxxxx +-----+ 1191 +-----+ xxxxxxxx xxxxxx xxxxxxx xxxxx | VM2 | 1192 | VM1 | xx xx xxx xx |-----| 1193 |-----| xx + x xx x |NVE-B| 1194 |NVE-A| x x +----+ x x +-----+ 1195 +--+--+ x NV Domain A x |IP-G|--x x | 1196 +-------x xx--+ | x xx | 1197 x x +----+ x NV Domain B x | 1198 +---x xx xx x---+ 1199 | xxxx xx +->xx xx 1200 | xxxxxxxxxx | xx xx 1201 +---+-+ | xx xx 1202 |NVA-1| +--+--+ xx xxx 1203 +-----+ |NVA-2| xxxx xxxx 1204 +-----+ xxxxxxx 1206 Figure 3: VM1 and VM2 are in different NV Domains. 1208 NVAs at one site share information and interact with NVAs at other 1209 sites, but only in a controlled manner. It is expected that policy 1210 and access control will be applied at the boundaries between 1211 different sites (and NVAs) so as to minimize dependencies on external 1212 NVAs that could negatively impact the operation within a site. It is 1213 an architectural principle that operations involving NVAs at one site 1214 not be immediately impacted by failures or errors at another site. 1215 (Of course, communication between NVEs in different NV domains may be 1216 impacted by such failures or errors.) It is a strong requirement 1217 that an NVA continue to operate properly for local NVEs even if 1218 external communication is interrupted (e.g., should communication 1219 between a local and remote NVA fail). 1221 At a high level, a federation of interconnected NVAs has some 1222 analogies to BGP and Autonomous Systems. Like an Autonomous System, 1223 NVAs at one site are managed by a single administrative entity and do 1224 not interact with external NVAs except as allowed by policy. 1225 Likewise, the interface between NVAs at different sites is well 1226 defined, so that the internal details of operations at one site are 1227 largely hidden to other sites. Finally, an NVA only peers with other 1228 NVAs that it has a trusted relationship with, i.e., where a VN is 1229 intended to span multiple NVAs. 1231 Reasons for using a federated model include: 1233 o Provide isolation among NVAs operating at different sites at 1234 different geographic locations. 1236 o Control the quantity and rate of information updates that flow 1237 (and must be processed) between different NVAs in different data 1238 centers. 1240 o Control the set of external NVAs (and external sites) a site peers 1241 with. A site will only peer with other sites that are cooperating 1242 in providing an overlay service. 1244 o Allow policy to be applied between sites. A site will want to 1245 carefully control what information it exports (and to whom) as 1246 well as what information it is willing to import (and from whom). 1248 o Allow different protocols and architectures to be used to for 1249 intra- vs. inter-NVA communication. For example, within a single 1250 data center, a replicated transaction server using database 1251 techniques might be an attractive implementation option for an 1252 NVA, and protocols optimized for intra-NVA communication would 1253 likely be different from protocols involving inter-NVA 1254 communication between different sites. 1256 o Allow for optimized protocols, rather than using a one-size-fits 1257 all approach. Within a data center, networks tend to have lower- 1258 latency, higher-speed and higher redundancy when compared with WAN 1259 links interconnecting data centers. The design constraints and 1260 tradeoffs for a protocol operating within a data center network 1261 are different from those operating over WAN links. While a single 1262 protocol could be used for both cases, there could be advantages 1263 to using different and more specialized protocols for the intra- 1264 and inter-NVA case. 1266 9.1. Inter-NVA Peering 1268 To support peering between different NVAs, an inter-NVA protocol is 1269 needed. The inter-NVA protocol defines what information is exchanged 1270 between NVAs. It is assumed that the protocol will be used to share 1271 addressing information between data centers and must scale well over 1272 WAN links. 1274 10. Control Protocol Work Areas 1276 The NVO3 architecture consists of two major distinct entities: NVEs 1277 and NVAs. In order to provide isolation and independence between 1278 these two entities, the NVO3 architecture calls for well defined 1279 protocols for interfacing between them. For an individual NVA, the 1280 architecture calls for a logically centralized entity that could be 1281 implemented in a distributed or replicated fashion. While the IETF 1282 may choose to define one or more specific architectural approaches to 1283 building individual NVAs, there is little need for it to pick exactly 1284 one approach to the exclusion of others. An NVA for a single domain 1285 will likely be deployed as a single vendor product and thus there is 1286 little benefit in standardizing the internal structure of an NVA. 1288 Individual NVAs peer with each other in a federated manner. The NVO3 1289 architecture calls for a well-defined interface between NVAs. 1291 Finally, a hypervisor-to-NVE protocol is needed to cover the split- 1292 NVE scenario described in Section 4.2. 1294 11. NVO3 Data Plane Encapsulation 1296 When tunneling tenant traffic, NVEs add encapsulation header to the 1297 original tenant packet. The exact encapsulation to use for NVO3 does 1298 not seem to be critical. The main requirement is that the 1299 encapsulation support a Context ID of sufficient size 1300 [I-D.ietf-nvo3-dataplane-requirements]. A number of encapsulations 1301 already exist that provide a VN Context of sufficient size for NVO3. 1302 For example, VXLAN [RFC7348] has a 24-bit VXLAN Network Identifier 1303 (VNI). NVGRE [I-D.sridharan-virtualization-nvgre] has a 24-bit 1304 Tenant Network ID (TNI). MPLS-over-GRE provides a 20-bit label 1305 field. While there is widespread recognition that a 12-bit VN 1306 Context would be too small (only 4096 distinct values), it is 1307 generally agreed that 20 bits (1 million distinct values) and 24 bits 1308 (16.8 million distinct values) are sufficient for a wide variety of 1309 deployment scenarios. 1311 12. Operations and Management 1313 The simplicity of operating and debugging overlay networks will be 1314 critical for successful deployment. Some architectural choices can 1315 facilitate or hinder OAM. Related OAM drafts include 1316 [I-D.ashwood-nvo3-operational-requirement]. 1318 13. Summary 1320 This document presents the overall architecture for overlays in NVO3. 1321 The architecture calls for three main areas of protocol work: 1323 1. A hypervisor-to-NVE protocol to support Split NVEs as discussed 1324 in Section 4.2. 1326 2. An NVE to NVA protocol for disseminating VN information (e.g., 1327 inner to outer address mappings). 1329 3. An NVA-to-NVA protocol for exchange of information about specific 1330 virtual networks between federated NVAs. 1332 It should be noted that existing protocols or extensions of existing 1333 protocols are applicable. 1335 14. Acknowledgments 1337 Helpful comments and improvements to this document have come from 1338 Lizhong Jin, Anton Ivanov, Dennis (Xiaohong) Qin, Erik Smith, Ziye 1339 Yang and Lucy Yong. 1341 15. IANA Considerations 1343 This memo includes no request to IANA. 1345 16. Security Considerations 1347 The NVO3 architecure will need to defend against a number of 1348 potential security threats involving both the data plane and control 1349 plane. 1351 For the data plane, tunneled application traffic may need protection 1352 against being misdelivered, modified, or having its content exposed 1353 to an inappropriate third party. In all cases, encryption between 1354 authenticated tunnel endpoints can be used to mitigate risks. 1356 For the control plane, between NVAs, the NVA and NVE as well as 1357 between different components of the split-NVE approach, a combination 1358 of authentication and encryption can be used. All entities will need 1359 to properly authenticate with each other and enable encryption for 1360 their interactions as appropriate to protect sensitive information. 1362 Leakage of sensitive information about users or other entities 1363 associated with VMs whose traffic is virtualized can also be covered 1364 by using encryption for the control plane protocols. 1366 17. Informative References 1368 [I-D.ashwood-nvo3-operational-requirement] 1369 Ashwood-Smith, P., Iyengar, R., Tsou, T., Sajassi, A., 1370 Boucadair, M., Jacquenet, C., and M. Daikoku, "NVO3 1371 Operational Requirements", draft-ashwood-nvo3-operational- 1372 requirement-03 (work in progress), July 2013. 1374 [I-D.ietf-nvo3-dataplane-requirements] 1375 Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L., 1376 and B. Khasnabish, "NVO3 Data Plane Requirements", draft- 1377 ietf-nvo3-dataplane-requirements-03 (work in progress), 1378 April 2014. 1380 [I-D.ietf-nvo3-nve-nva-cp-req] 1381 Kreeger, L., Dutt, D., Narten, T., and D. Black, "Network 1382 Virtualization NVE to NVA Control Protocol Requirements", 1383 draft-ietf-nvo3-nve-nva-cp-req-04 (work in progress), July 1384 2015. 1386 [I-D.sridharan-virtualization-nvgre] 1387 Garg, P. and Y. Wang, "NVGRE: Network Virtualization using 1388 Generic Routing Encapsulation", draft-sridharan- 1389 virtualization-nvgre-08 (work in progress), April 2015. 1391 [IEEE-802.1Q] 1392 IEEE 802.1Q-2011, , "IEEE standard for local and 1393 metropolitan area networks: Media access control (MAC) 1394 bridges and virtual bridged local area networks,", August 1395 2011. 1397 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 1398 Converting Network Protocol Addresses to 48.bit Ethernet 1399 Address for Transmission on Ethernet Hardware", STD 37, 1400 RFC 826, DOI 10.17487/RFC0826, November 1982, 1401 . 1403 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1404 Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 1405 2006, . 1407 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1408 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1409 DOI 10.17487/RFC4861, September 2007, 1410 . 1412 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1413 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1414 eXtensible Local Area Network (VXLAN): A Framework for 1415 Overlaying Virtualized Layer 2 Networks over Layer 3 1416 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 1417 . 1419 [RFC7364] Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L., 1420 Kreeger, L., and M. Napierala, "Problem Statement: 1421 Overlays for Network Virtualization", RFC 7364, 1422 DOI 10.17487/RFC7364, October 2014, 1423 . 1425 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 1426 Rekhter, "Framework for Data Center (DC) Network 1427 Virtualization", RFC 7365, DOI 10.17487/RFC7365, October 1428 2014, . 1430 Appendix A. Change Log 1432 A.1. Changes From draft-ietf-nvo3-arch-03 to -04 1434 1. First cut at proper Security Considerations Section. 1436 2. Fixed some obvious typos. 1438 A.2. Changes From draft-ietf-nvo3-arch-02 to -03 1440 1. Removed "[Note:" comments from section 7.3 and 8. 1442 2. Removed discussion stimulating "[Note" comment from section 8.1 1443 and changed the text to note that the NVO3 architecture will 1444 focus on a model where all NVEs interact with the NVA. 1446 3. Added a subsection on NVO3 Gateway taxonomy. 1448 A.3. Changes From draft-ietf-nvo3-arch-01 to -02 1450 1. Minor editorial improvements after a close re-reading; references 1451 to problem statement and framework updated to point to recently- 1452 published RFCs. 1454 2. Added text making it more clear that other virtualization 1455 approaches, including Linux Containers are intended to be fully 1456 supported in NVO3. 1458 A.4. Changes From draft-ietf-nvo3-arch-00 to -01 1460 1. Miscellaneous text/section additions, including: 1462 * New section on VLAN tag Handling (Section 3.1.1). 1464 * New section on tenant VLAN handling in Split-NVE case 1465 (Section 4.2.1). 1467 * New section on TTL handling (Section 3.1.2). 1469 * New section on multi-homing of NVEs (Section 4.4). 1471 * 2 paragraphs new text describing L2/L3 Combined service 1472 (Section 3.1). 1474 * New section on VAPs (and error handling) (Section 4.5). 1476 * New section on ARP and ND handling (Section 5.5) 1478 * New section on NVE-to-NVE interactions (Section 6) 1480 2. Editorial cleanups from careful review by Erik Smith, Ziye Yang. 1482 3. Expanded text on Distributed Inter-VN Gateways. 1484 A.5. Changes From draft-narten-nvo3 to draft-ietf-nvo3 1486 1. No changes between draft-narten-nvo3-arch-01 and draft-ietf-nvoe- 1487 arch-00. 1489 A.6. Changes From -00 to -01 (of draft-narten-nvo3-arch) 1491 1. Editorial and clarity improvements. 1493 2. Replaced "push vs. pull" section with section more focused on 1494 triggers where an event implies or triggers some action. 1496 3. Clarified text on co-located NVE to show how offloading NVE 1497 functionality onto adapters is desirable. 1499 4. Added new section on distributed gateways. 1501 5. Expanded Section on NVA external interface, adding requirement 1502 for NVE to support multiple IP NVA addresses. 1504 Authors' Addresses 1506 David Black 1507 EMC 1509 Email: david.black@emc.com 1511 Jon Hudson 1512 Brocade 1513 120 Holger Way 1514 San Jose, CA 95134 1515 USA 1517 Email: jon.hudson@gmail.com 1519 Lawrence Kreeger 1520 Cisco 1522 Email: kreeger@cisco.com 1524 Marc Lasserre 1525 Alcatel-Lucent 1527 Email: marc.lasserre@alcatel-lucent.com 1529 Thomas Narten 1530 IBM 1532 Email: narten@us.ibm.com