idnits 2.17.1 draft-ietf-nvo3-arch-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1189 has weird spacing: '...xxxxxxx xxx...' -- The document date (March 9, 2015) is 3336 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-05) exists of draft-ietf-nvo3-nve-nva-cp-req-03 == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-07 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force D. Black 3 Internet-Draft EMC 4 Intended status: Informational J. Hudson 5 Expires: September 10, 2015 Brocade 6 L. Kreeger 7 Cisco 8 M. Lasserre 9 Alcatel-Lucent 10 T. Narten 11 IBM 12 March 9, 2015 14 An Architecture for Overlay Networks (NVO3) 15 draft-ietf-nvo3-arch-03 17 Abstract 19 This document presents a high-level overview architecture for 20 building overlay networks in NVO3. The architecture is given at a 21 high-level, showing the major components of an overall system. An 22 important goal is to divide the space into individual smaller 23 components that can be implemented independently and with clear 24 interfaces and interactions with other components. It should be 25 possible to build and implement individual components in isolation 26 and have them work with other components with no changes to other 27 components. That way implementers have flexibility in implementing 28 individual components and can optimize and innovate within their 29 respective components without requiring changes to other components. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on September 10, 2015. 48 Copyright Notice 50 Copyright (c) 2015 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 67 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 3.1. VN Service (L2 and L3) . . . . . . . . . . . . . . . . . 6 69 3.1.1. VLAN Tags in L2 Service . . . . . . . . . . . . . . . 7 70 3.1.2. TTL Considerations . . . . . . . . . . . . . . . . . 7 71 3.2. Network Virtualization Edge (NVE) . . . . . . . . . . . . 7 72 3.3. Network Virtualization Authority (NVA) . . . . . . . . . 9 73 3.4. VM Orchestration Systems . . . . . . . . . . . . . . . . 9 74 4. Network Virtualization Edge (NVE) . . . . . . . . . . . . . . 11 75 4.1. NVE Co-located With Server Hypervisor . . . . . . . . . . 11 76 4.2. Split-NVE . . . . . . . . . . . . . . . . . . . . . . . . 12 77 4.2.1. Tenant VLAN handling in Split-NVE Case . . . . . . . 12 78 4.3. NVE State . . . . . . . . . . . . . . . . . . . . . . . . 13 79 4.4. Multi-Homing of NVEs . . . . . . . . . . . . . . . . . . 14 80 4.5. VAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 81 5. Tenant System Types . . . . . . . . . . . . . . . . . . . . . 15 82 5.1. Overlay-Aware Network Service Appliances . . . . . . . . 15 83 5.2. Bare Metal Servers . . . . . . . . . . . . . . . . . . . 15 84 5.3. Gateways . . . . . . . . . . . . . . . . . . . . . . . . 16 85 5.3.1. Gateway Taxonomy . . . . . . . . . . . . . . . . . . 16 86 5.3.1.1. L2 Gateways (Bridging) . . . . . . . . . . . . . 16 87 5.3.1.2. L3 Gateways (Only IP Packets) . . . . . . . . . . 17 88 5.4. Distributed Inter-VN Gateways . . . . . . . . . . . . . . 17 89 5.5. ARP and Neighbor Discovery . . . . . . . . . . . . . . . 18 90 6. NVE-NVE Interaction . . . . . . . . . . . . . . . . . . . . . 18 91 7. Network Virtualization Authority . . . . . . . . . . . . . . 20 92 7.1. How an NVA Obtains Information . . . . . . . . . . . . . 20 93 7.2. Internal NVA Architecture . . . . . . . . . . . . . . . . 21 94 7.3. NVA External Interface . . . . . . . . . . . . . . . . . 21 95 8. NVE-to-NVA Protocol . . . . . . . . . . . . . . . . . . . . . 22 96 8.1. NVE-NVA Interaction Models . . . . . . . . . . . . . . . 23 97 8.2. Direct NVE-NVA Protocol . . . . . . . . . . . . . . . . . 23 98 8.3. Propagating Information Between NVEs and NVAs . . . . . . 24 99 9. Federated NVAs . . . . . . . . . . . . . . . . . . . . . . . 25 100 9.1. Inter-NVA Peering . . . . . . . . . . . . . . . . . . . . 27 101 10. Control Protocol Work Areas . . . . . . . . . . . . . . . . . 27 102 11. NVO3 Data Plane Encapsulation . . . . . . . . . . . . . . . . 28 103 12. Operations and Management . . . . . . . . . . . . . . . . . . 28 104 13. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 105 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 106 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 107 16. Security Considerations . . . . . . . . . . . . . . . . . . . 29 108 17. Informative References . . . . . . . . . . . . . . . . . . . 29 109 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 30 110 A.1. Changes From draft-ietf-nvo3-arch-02 to -03 . . . . . . . 30 111 A.2. Changes From draft-ietf-nvo3-arch-01 to -02 . . . . . . . 31 112 A.3. Changes From draft-ietf-nvo3-arch-00 to -01 . . . . . . . 31 113 A.4. Changes From draft-narten-nvo3 to draft-ietf-nvo3 . . . . 31 114 A.5. Changes From -00 to -01 (of draft-narten-nvo3-arch) . . . 31 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 117 1. Introduction 119 This document presents a high-level architecture for building overlay 120 networks in NVO3. The architecture is given at a high-level, showing 121 the major components of an overall system. An important goal is to 122 divide the space into smaller individual components that can be 123 implemented independently and with clear interfaces and interactions 124 with other components. It should be possible to build and implement 125 individual components in isolation and have them work with other 126 components with no changes to other components. That way 127 implementers have flexibility in implementing individual components 128 and can optimize and innovate within their respective components 129 without necessarily requiring changes to other components. 131 The motivation for overlay networks is given in [RFC7364]. 132 "Framework for DC Network Virtualization" [RFC7365] provides a 133 framework for discussing overlay networks generally and the various 134 components that must work together in building such systems. This 135 document differs from the framework document in that it doesn't 136 attempt to cover all possible approaches within the general design 137 space. Rather, it describes one particular approach. 139 2. Terminology 141 This document uses the same terminology as [RFC7365]. In addition, 142 the following terms are used: 144 NV Domain A Network Virtualization Domain is an administrative 145 construct that defines a Network Virtualization Authority (NVA), 146 the set of Network Virtualization Edges (NVEs) associated with 147 that NVA, and the set of virtual networks the NVA manages and 148 supports. NVEs are associated with a (logically centralized) NVA, 149 and an NVE supports communication for any of the virtual networks 150 in the domain. 152 NV Region A region over which information about a set of virtual 153 networks is shared. The degenerate case of a single NV Domain 154 corresponds to an NV region corresponding to that domain. The 155 more interesting case occurs when two or more NV Domains share 156 information about part or all of a set of virtual networks that 157 they manage. Two NVAs share information about particular virtual 158 networks for the purpose of supporting connectivity between 159 tenants located in different NVA Domains. NVAs can share 160 information about an entire NV domain, or just individual virtual 161 networks. 163 Tenant System Identifier (TSI) Interface to a Virtual Network as 164 presented to a Tenant System. The TSI logically connects to the 165 NVE via a Virtual Access Point (VAP). To the Tenant System, the 166 TSI is like a NIC; the TSI presents itself to a Tenant System as a 167 normal network interface. 169 VLAN Unless stated otherwise, the terms VLAN and VLAN Tag are used 170 in this document denote a C-VLAN [IEEE-802.1Q] and the terms are 171 used interchangeably to improve readability. 173 3. Background 175 Overlay networks are an approach for providing network virtualization 176 services to a set of Tenant Systems (TSs) [RFC7365]. With overlays, 177 data traffic between tenants is tunneled across the underlying data 178 center's IP network. The use of tunnels provides a number of 179 benefits by decoupling the network as viewed by tenants from the 180 underlying physical network across which they communicate. 182 Tenant Systems connect to Virtual Networks (VNs), with each VN having 183 associated attributes defining properties of the network, such as the 184 set of members that connect to it. Tenant Systems connected to a 185 virtual network typically communicate freely with other Tenant 186 Systems on the same VN, but communication between Tenant Systems on 187 one VN and those external to the VN (whether on another VN or 188 connected to the Internet) is carefully controlled and governed by 189 policy. 191 A Network Virtualization Edge (NVE) [RFC7365] is the entity that 192 implements the overlay functionality. An NVE resides at the boundary 193 between a Tenant System and the overlay network as shown in Figure 1. 194 An NVE creates and maintains local state about each Virtual Network 195 for which it is providing service on behalf of a Tenant System. 197 +--------+ +--------+ 198 | Tenant +--+ +----| Tenant | 199 | System | | (') | System | 200 +--------+ | ................ ( ) +--------+ 201 | +-+--+ . . +--+-+ (_) 202 | | NVE|--. .--| NVE| | 203 +--| | . . | |---+ 204 +-+--+ . . +--+-+ 205 / . . 206 / . L3 Overlay . +--+-++--------+ 207 +--------+ / . Network . | NVE|| Tenant | 208 | Tenant +--+ . .- -| || System | 209 | System | . . +--+-++--------+ 210 +--------+ ................ 211 | 212 +----+ 213 | NVE| 214 | | 215 +----+ 216 | 217 | 218 ===================== 219 | | 220 +--------+ +--------+ 221 | Tenant | | Tenant | 222 | System | | System | 223 +--------+ +--------+ 225 Figure 1: NVO3 Generic Reference Model 227 The following subsections describe key aspects of an overlay system 228 in more detail. Section 3.1 describes the service model (Ethernet 229 vs. IP) provided to Tenant Systems. Section 3.2 describes NVEs in 230 more detail. Section 3.3 introduces the Network Virtualization 231 Authority, from which NVEs obtain information about virtual networks. 232 Section 3.4 provides background on VM orchestration systems and their 233 use of virtual networks. 235 3.1. VN Service (L2 and L3) 237 A Virtual Network provides either L2 or L3 service to connected 238 tenants. For L2 service, VNs transport Ethernet frames, and a Tenant 239 System is provided with a service that is analogous to being 240 connected to a specific L2 C-VLAN. L2 broadcast frames are generally 241 delivered to all (and multicast frames delivered to a subset of) the 242 other Tenant Systems on the VN. To a Tenant System, it appears as if 243 they are connected to a regular L2 Ethernet link. Within NVO3, 244 tenant frames are tunneled to remote NVEs based on the MAC addresses 245 of the frame headers as originated by the Tenant System. On the 246 underlay, NVO3 packets are forwarded between NVEs based on the outer 247 addresses of tunneled packets. 249 For L3 service, VNs transport IP datagrams, and a Tenant System is 250 provided with a service that only supports IP traffic. Within NVO3, 251 tenant frames are tunneled to remote NVEs based on the IP addresses 252 of the packet originated by the Tenant System; any L2 destination 253 addresses provided by Tenant Systems are effectively ignored. For L3 254 service, the Tenant System will be configured with an IP subnet that 255 is effectively a point-to-point link, i.e., having only the Tenant 256 System and a next-hop router address on it. 258 L2 service is intended for systems that need native L2 Ethernet 259 service and the ability to run protocols directly over Ethernet 260 (i.e., not based on IP). L3 service is intended for systems in which 261 all the traffic can safely be assumed to be IP. It is important to 262 note that whether NVO3 provides L2 or L3 service to a Tenant System, 263 the Tenant System does not generally need to be aware of the 264 distinction. In both cases, the virtual network presents itself to 265 the Tenant System as an L2 Ethernet interface. An Ethernet interface 266 is used in both cases simply as a widely supported interface type 267 that essentially all Tenant Systems already support. Consequently, 268 no special software is needed on Tenant Systems to use an L3 vs. an 269 L2 overlay service. 271 NVO3 can also provide a combined L2 and L3 service to tenants. A 272 combined service provides L2 service for intra-VN communication, but 273 also provides L3 service for L3 traffic entering or leaving the VN. 274 Architecturally, the handling of a combined L2/L3 service in NVO3 is 275 intended to match what is commonly done today in non-overlay 276 environments by devices providing a combined bridge/router service. 277 With combined service, the virtual network itself retains the 278 semantics of L2 service and all traffic is processed according to its 279 L2 semantics. In addition, however, traffic requiring IP processing 280 is also processed at the IP level. 282 The IP processing for a combined service can be implemented on a 283 standalone device attached to the virtual network (e.g., an IP 284 router) or implemented locally on the NVE (see Section 5.4 on 285 Distributed Gateways). For unicast traffic, NVE implementation of a 286 combined service may result in a packet being delivered to another TS 287 attached to the same NVE (on either the same or a different VN) or 288 tunneled to a remote NVE, or even forwarded outside the NVO3 domain. 289 For multicast or broadcast packets, the combination of NVE L2 and L3 290 processing may result in copies of the packet receiving both L2 and 291 L3 treatments to realize delivery to all of the destinations 292 involved. This distributed NVE implementation of IP routing results 293 in the same network delivery behavior as if the L2 processing of the 294 packet included delivery of the packet to an IP router attached to 295 the L2 VN as a TS, with the router having additional network 296 attachments to other networks, either virtual or not. 298 3.1.1. VLAN Tags in L2 Service 300 An NVO3 L2 virtual network service may include encapsulated L2 VLAN 301 tags provided by a Tenant System, but does not use encapsulated tags 302 in deciding where and how to forward traffic. Such VLAN tags can be 303 passed through, so that Tenant Systems that send or expect to receive 304 them can be supported as appropriate. 306 The processing of VLAN tags that an NVE receives from a TS is 307 controlled by settings associated with the VAP. Just as in the case 308 with ports on Ethernet switches, a number of settings could be 309 imagined. For example, C-TAGs can be passed through transparently, 310 they could always be stripped upon receipt from a Tenant System, they 311 could be compared against a list of explicitly configured tags, etc. 313 Note that the handling of C-VIDs has additional complications, as 314 described in Section 4.2.1 below. 316 3.1.2. TTL Considerations 318 For L3 service, Tenant Systems should expect the TTL of the packets 319 they send to be decremented by at least 1. For L2 service, the TTL 320 on packets (when the packet is IP) is not modified. 322 3.2. Network Virtualization Edge (NVE) 324 Tenant Systems connect to NVEs via a Tenant System Interface (TSI). 325 The TSI logically connects to the NVE via a Virtual Access Point 326 (VAP) and each VAP is associated with one Virtual Network as shown in 327 Figure 2. To the Tenant System, the TSI is like a NIC; the TSI 328 presents itself to a Tenant System as a normal network interface. On 329 the NVE side, a VAP is a logical network port (virtual or physical) 330 into a specific virtual network. Note that two different Tenant 331 Systems (and TSIs) attached to a common NVE can share a VAP (e.g., 332 TS1 and TS2 in Figure 2) so long as they connect to the same Virtual 333 Network. 335 | Data Center Network (IP) | 336 | | 337 +-----------------------------------------+ 338 | | 339 | Tunnel Overlay | 340 +------------+---------+ +---------+------------+ 341 | +----------+-------+ | | +-------+----------+ | 342 | | Overlay Module | | | | Overlay Module | | 343 | +---------+--------+ | | +---------+--------+ | 344 | | | | | | 345 NVE1 | | | | | | NVE2 346 | +--------+-------+ | | +--------+-------+ | 347 | | VNI1 VNI2 | | | | VNI1 VNI2 | | 348 | +-+----------+---+ | | +-+-----------+--+ | 349 | | VAP1 | VAP2 | | | VAP1 | VAP2| 350 +----+----------+------+ +----+-----------+-----+ 351 | | | | 352 |\ | | | 353 | \ | | /| 354 -------+--\-------+-------------------+---------/-+------- 355 | \ | Tenant | / | 356 TSI1 |TSI2\ | TSI3 TSI1 TSI2/ TSI3 357 +---+ +---+ +---+ +---+ +---+ +---+ 358 |TS1| |TS2| |TS3| |TS4| |TS5| |TS6| 359 +---+ +---+ +---+ +---+ +---+ +---+ 361 Figure 2: NVE Reference Model 363 The Overlay Module performs the actual encapsulation and 364 decapsulation of tunneled packets. The NVE maintains state about the 365 virtual networks it is a part of so that it can provide the Overlay 366 Module with such information as the destination address of the NVE to 367 tunnel a packet to, or the Context ID that should be placed in the 368 encapsulation header to identify the virtual network that a tunneled 369 packet belongs to. 371 On the data center network side, the NVE sends and receives native IP 372 traffic. When ingressing traffic from a Tenant System, the NVE 373 identifies the egress NVE to which the packet should be sent, adds an 374 overlay encapsulation header, and sends the packet on the underlay 375 network. When receiving traffic from a remote NVE, an NVE strips off 376 the encapsulation header, and delivers the (original) packet to the 377 appropriate Tenant System. When the source and destination Tenant 378 System are on the same NVE, no encapsulation is needed and the NVE 379 forwards traffic directly. 381 Conceptually, the NVE is a single entity implementing the NVO3 382 functionality. In practice, there are a number of different 383 implementation scenarios, as described in detail in Section 4. 385 3.3. Network Virtualization Authority (NVA) 387 Address dissemination refers to the process of learning, building and 388 distributing the mapping/forwarding information that NVEs need in 389 order to tunnel traffic to each other on behalf of communicating 390 Tenant Systems. For example, in order to send traffic to a remote 391 Tenant System, the sending NVE must know the destination NVE for that 392 Tenant System. 394 One way to build and maintain mapping tables is to use learning, as 395 802.1 bridges do [IEEE-802.1Q]. When forwarding traffic to multicast 396 or unknown unicast destinations, an NVE could simply flood traffic. 397 While flooding works, it can lead to traffic hot spots and can lead 398 to problems in larger networks. 400 Alternatively, to reduce the scope of where flooding must take place, 401 or to eliminate it all together, NVEs can make use of a Network 402 Virtualization Authority (NVA). An NVA is the entity that provides 403 address mapping and other information to NVEs. NVEs interact with an 404 NVA to obtain any required address mapping information they need in 405 order to properly forward traffic on behalf of tenants. The term NVA 406 refers to the overall system, without regards to its scope or how it 407 is implemented. NVAs provide a service, and NVEs access that service 408 via an NVE-to-NVA protocol as discussed in Section 4.3. 410 Even when an NVA is present, Ethernet bridge MAC address learning 411 could be used as a fallback mechanism, should the NVA be unable to 412 provide an answer or for other reasons. This document does not 413 consider flooding approaches in detail, as there are a number of 414 benefits in using an approach that depends on the presence of an NVA. 416 For the rest of this document, it is assumed that an NVA exists and 417 will be used. NVAs are discussed in more detail in Section 7. 419 3.4. VM Orchestration Systems 421 VM Orchestration systems manage server virtualization across a set of 422 servers. Although VM management is a separate topic from network 423 virtualization, the two areas are closely related. Managing the 424 creation, placement, and movement of VMs also involves creating, 425 attaching to and detaching from virtual networks. A number of 426 existing VM orchestration systems have incorporated aspects of 427 virtual network management into their systems. 429 Note also, that although this section uses the term "VM" and 430 "hypervisor" throughout, the same issues apply to other 431 virtualization approaches, including Linux Containers (LXC), BSD 432 Jails, Network Service Appliances as discussed in Section 5.1, etc.. 433 From an NVO3 perspective, it should be assumed that where the 434 document uses the term "VM" and "hypervisor", the intention is that 435 the discussion also applies to other systems, where, e.g., the host 436 operating system plays the role of the hypervisor in supporting 437 virtualization, and a container plays the equivalent role as a VM. 439 When a new VM image is started, the VM Orchestration system 440 determines where the VM should be placed, interacts with the 441 hypervisor on the target server to load and start the VM and controls 442 when a VM should be shutdown or migrated elsewhere. VM Orchestration 443 systems also have knowledge about how a VM should connect to a 444 network, possibly including the name of the virtual network to which 445 a VM is to connect. The VM orchestration system can pass such 446 information to the hypervisor when a VM is instantiated. VM 447 orchestration systems have significant (and sometimes global) 448 knowledge over the domain they manage. They typically know on what 449 servers a VM is running, and meta data associated with VM images can 450 be useful from a network virtualization perspective. For example, 451 the meta data may include the addresses (MAC and IP) the VMs will use 452 and the name(s) of the virtual network(s) they connect to. 454 VM orchestration systems run a protocol with an agent running on the 455 hypervisor of the servers they manage. That protocol can also carry 456 information about what virtual network a VM is associated with. When 457 the orchestrator instantiates a VM on a hypervisor, the hypervisor 458 interacts with the NVE in order to attach the VM to the virtual 459 networks it has access to. In general, the hypervisor will need to 460 communicate significant VM state changes to the NVE. In the reverse 461 direction, the NVE may need to communicate network connectivity 462 information back to the hypervisor. XXX Example VM orchestration 463 systems in use today include VMware's vCenter Server, Microsoft's 464 System Center Virtual Machine Manager, and systems based on OpenStack 465 and its associated plugins (e.g., Nova and Neutron). Both can pass 466 information about what virtual networks a VM connects to down to the 467 hypervisor. The protocol used between the VM orchestration system 468 and hypervisors is generally proprietary. 470 It should be noted that VM orchestration systems may not have direct 471 access to all networking related information a VM uses. For example, 472 a VM may make use of additional IP or MAC addresses that the VM 473 management system is not aware of. 475 4. Network Virtualization Edge (NVE) 477 As introduced in Section 3.2 an NVE is the entity that implements the 478 overlay functionality. This section describes NVEs in more detail. 479 An NVE will have two external interfaces: 481 Tenant System Facing: On the Tenant System facing side, an NVE 482 interacts with the hypervisor (or equivalent entity) to provide 483 the NVO3 service. An NVE will need to be notified when a Tenant 484 System "attaches" to a virtual network (so it can validate the 485 request and set up any state needed to send and receive traffic on 486 behalf of the Tenant System on that VN). Likewise, an NVE will 487 need to be informed when the Tenant System "detaches" from the 488 virtual network so that it can reclaim state and resources 489 appropriately. 491 Data Center Network Facing: On the data center network facing side, 492 an NVE interfaces with the data center underlay network, sending 493 and receiving tunneled TS packets to and from the underlay. The 494 NVE may also run a control protocol with other entities on the 495 network, such as the Network Virtualization Authority. 497 4.1. NVE Co-located With Server Hypervisor 499 When server virtualization is used, the entire NVE functionality will 500 typically be implemented as part of the hypervisor and/or virtual 501 switch on the server. In such cases, the Tenant System interacts 502 with the hypervisor and the hypervisor interacts with the NVE. 503 Because the interaction between the hypervisor and NVE is implemented 504 entirely in software on the server, there is no "on-the-wire" 505 protocol between Tenant Systems (or the hypervisor) and the NVE that 506 needs to be standardized. While there may be APIs between the NVE 507 and hypervisor to support necessary interaction, the details of such 508 an API are not in-scope for the IETF to work on. 510 Implementing NVE functionality entirely on a server has the 511 disadvantage that server CPU resources must be spent implementing the 512 NVO3 functionality. Experimentation with overlay approaches and 513 previous experience with TCP and checksum adapter offloads suggests 514 that offloading certain NVE operations (e.g., encapsulation and 515 decapsulation operations) onto the physical network adapter can 516 produce performance improvements. As has been done with checksum 517 and/or TCP server offload and other optimization approaches, there 518 may be benefits to offloading common operations onto adapters where 519 possible. Just as important, the addition of an overlay header can 520 disable existing adapter offload capabilities that are generally not 521 prepared to handle the addition of a new header or other operations 522 associated with an NVE. 524 While the exact details of how to split the implementation of 525 specific NVE functionality between a server and its network adapters 526 is an implementation matter and outside the scope of IETF 527 standardization, the NVO3 architecture should be cognizant of and 528 support such separation. Ideally, it may even be possible to bypass 529 the hypervisor completely on critical data path operations so that 530 packets between a TS and its VN can be sent and received without 531 having the hypervisor involved in each individual packet operation. 533 4.2. Split-NVE 535 Another possible scenario leads to the need for a split NVE 536 implementation. An NVE running on a server (e.g. within a 537 hypervisor) could support NVO3 towards the tenant, but not perform 538 all NVE functions (e.g., encapsulation) directly on the server; some 539 of the actual NVO3 functionality could be implemented on (i.e., 540 offloaded to) an adjacent switch to which the server is attached. 541 While one could imagine a number of link types between a server and 542 the NVE, one simple deployment scenario would involve a server and 543 NVE separated by a simple L2 Ethernet link. A more complicated 544 scenario would have the server and NVE separated by a bridged access 545 network, such as when the NVE resides on a ToR, with an embedded 546 switch residing between servers and the ToR. 548 For the split NVE case, protocols will be needed that allow the 549 hypervisor and NVE to negotiate and setup the necessary state so that 550 traffic sent across the access link between a server and the NVE can 551 be associated with the correct virtual network instance. 552 Specifically, on the access link, traffic belonging to a specific 553 Tenant System would be tagged with a specific VLAN C-TAG that 554 identifies which specific NVO3 virtual network instance it connects 555 to. The hypervisor-NVE protocol would negotiate which VLAN C-TAG to 556 use for a particular virtual network instance. More details of the 557 protocol requirements for functionality between hypervisors and NVEs 558 can be found in [I-D.ietf-nvo3-nve-nva-cp-req]. 560 4.2.1. Tenant VLAN handling in Split-NVE Case 562 Preserving tenant VLAN tags across NVO3 as described in Section 3.1.1 563 poses additional complications in the split-NVE case. The portion of 564 the NVE that performs the encapsulation function needs access to the 565 specific VLAN tags that the Tenant System is using in order to 566 include them in the encapsulated packet. When an NVE is implemented 567 entirely within the hypervisor, the NVE has access to the complete 568 original packet (including any VLAN tags) sent by the tenant. In the 569 split-NVE case, however, the VLAN tag used between the hypervisor and 570 offloaded portions of the NVE normally only identify the specific VN 571 that traffic belongs to. In order to allow a tenant to preserve VLAN 572 information in the split-NVE case, additional mechanisms would be 573 needed. 575 4.3. NVE State 577 NVEs maintain internal data structures and state to support the 578 sending and receiving of tenant traffic. An NVE may need some or all 579 of the following information: 581 1. An NVE keeps track of which attached Tenant Systems are connected 582 to which virtual networks. When a Tenant System attaches to a 583 virtual network, the NVE will need to create or update local 584 state for that virtual network. When the last Tenant System 585 detaches from a given VN, the NVE can reclaim state associated 586 with that VN. 588 2. For tenant unicast traffic, an NVE maintains a per-VN table of 589 mappings from Tenant System (inner) addresses to remote NVE 590 (outer) addresses. 592 3. For tenant multicast (or broadcast) traffic, an NVE maintains a 593 per-VN table of mappings and other information on how to deliver 594 tenant multicast (or broadcast) traffic. If the underlying 595 network supports IP multicast, the NVE could use IP multicast to 596 deliver tenant traffic. In such a case, the NVE would need to 597 know what IP underlay multicast address to use for a given VN. 598 Alternatively, if the underlying network does not support 599 multicast, an NVE could use serial unicast to deliver traffic. 600 In such a case, an NVE would need to know which remote NVEs are 601 participating in the VN. An NVE could use both approaches, 602 switching from one mode to the other depending on such factors as 603 bandwidth efficiency and group membership sparseness. 605 4. An NVE maintains necessary information to encapsulate outgoing 606 traffic, including what type of encapsulation and what value to 607 use for a Context ID within the encapsulation header. 609 5. In order to deliver incoming encapsulated packets to the correct 610 Tenant Systems, an NVE maintains the necessary information to map 611 incoming traffic to the appropriate VAP (i.e., Tenant System 612 Interface). 614 6. An NVE may find it convenient to maintain additional per-VN 615 information such as QoS settings, Path MTU information, ACLs, 616 etc. 618 4.4. Multi-Homing of NVEs 620 NVEs may be multi-homed. That is, an NVE may have more than one IP 621 address associated with it on the underlay network. Multihoming 622 happens in two different scenarios. First, an NVE may have multiple 623 interfaces connecting it to the underlay. Each of those interfaces 624 will typically have a different IP address, resulting in a specific 625 Tenant Address (on a specific VN) being reachable through the same 626 NVE but through more than one underlay IP address. Second, a 627 specific tenant system may be reachable through more than one NVE, 628 each having one or more underlay addresses. In both cases, the NVE 629 address mapping tables need to support one-to-many mappings and 630 enable a sending NVE to (at a minimum) be able to fail over from one 631 IP address to another, e.g., should a specific NVE underlay address 632 become unreachable. 634 Finally, multi-homed NVEs introduce complexities when serial unicast 635 is used to implement tenant multicast as described in Section 4.3. 636 Specifically, an NVE should only receive one copy of a replicated 637 packet. 639 Multi-homing is needed to support important use cases. First, a bare 640 metal server may have multiple uplink connections to either the same 641 or different NVEs. Having only a single physical path to an upstream 642 NVE, or indeed, having all traffic flow through a single NVE would be 643 considered unacceptable in highly-resilient deployment scenarios that 644 seek to avoid single points of failure. Moreover, in today's 645 networks, the availability of multiple paths would require that they 646 be usable in an active-active fashion (e.g., for load balancing). 648 4.5. VAP 650 The VAP is the NVE-side of the interface between the NVE and the TS. 651 Traffic to and from the tenant flows through the VAP. If an NVE runs 652 into difficulties sending traffic received on the VAP, it may need 653 signal such errors back to the VAP. Because the VAP is an emulation 654 of a physical port, its ability to signal NVE errors is limited and 655 lacks sufficient granularity to reflect all possible errors an NVE 656 may encounter (e.g., inability reach a particular destination). Some 657 errors, such as an NVE losing all of its connections to the underlay, 658 could be reflected back to the VAP by effectively disabling it. This 659 state change would reflect itself on the TS as an interface going 660 down, allowing the TS to implement interface error handling, e.g., 661 failover, in the same manner as when a physical interfaces becomes 662 disabled. 664 5. Tenant System Types 666 This section describes a number of special Tenant System types and 667 how they fit into an NVO3 system. 669 5.1. Overlay-Aware Network Service Appliances 671 Some Network Service Appliances [I-D.ietf-nvo3-nve-nva-cp-req] 672 (virtual or physical) provide tenant-aware services. That is, the 673 specific service they provide depends on the identity of the tenant 674 making use of the service. For example, firewalls are now becoming 675 available that support multi-tenancy where a single firewall provides 676 virtual firewall service on a per-tenant basis, using per-tenant 677 configuration rules and maintaining per-tenant state. Such 678 appliances will be aware of the VN an activity corresponds to while 679 processing requests. Unlike server virtualization, which shields VMs 680 from needing to know about multi-tenancy, a Network Service Appliance 681 may explicitly support multi-tenancy. In such cases, the Network 682 Service Appliance itself will be aware of network virtualization and 683 either embed an NVE directly, or implement a split NVE as described 684 in Section 4.2. Unlike server virtualization, however, the Network 685 Service Appliance may not be running a hypervisor and the VM 686 Orchestration system may not interact with the Network Service 687 Appliance. The NVE on such appliances will need to support a control 688 plane to obtain the necessary information needed to fully participate 689 in an NVO3 Domain. 691 5.2. Bare Metal Servers 693 Many data centers will continue to have at least some servers 694 operating as non-virtualized (or "bare metal") machines running a 695 traditional operating system and workload. In such systems, there 696 will be no NVE functionality on the server, and the server will have 697 no knowledge of NVO3 (including whether overlays are even in use). 698 In such environments, the NVE functionality can reside on the first- 699 hop physical switch. In such a case, the network administrator would 700 (manually) configure the switch to enable the appropriate NVO3 701 functionality on the switch port connecting the server and associate 702 that port with a specific virtual network. Such configuration would 703 typically be static, since the server is not virtualized, and once 704 configured, is unlikely to change frequently. Consequently, this 705 scenario does not require any protocol or standards work. 707 5.3. Gateways 709 Gateways on VNs relay traffic onto and off of a virtual network. 710 Tenant Systems use gateways to reach destinations outside of the 711 local VN. Gateways receive encapsulated traffic from one VN, remove 712 the encapsulation header, and send the native packet out onto the 713 data center network for delivery. Outside traffic enters a VN in a 714 reverse manner. 716 Gateways can be either virtual (i.e., implemented as a VM) or 717 physical (i.e., as a standalone physical device). For performance 718 reasons, standalone hardware gateways may be desirable in some cases. 719 Such gateways could consist of a simple switch forwarding traffic 720 from a VN onto the local data center network, or could embed router 721 functionality. On such gateways, network interfaces connecting to 722 virtual networks will (at least conceptually) embed NVE (or split- 723 NVE) functionality within them. As in the case with Network Service 724 Appliances, gateways may not support a hypervisor and will need an 725 appropriate control plane protocol to obtain the information needed 726 to provide NVO3 service. 728 Gateways handle several different use cases. For example, one use 729 case consists of systems supporting overlays together with systems 730 that do not (e.g., bare metal servers). Gateways could be used to 731 connect legacy systems supporting, e.g., L2 VLANs, to specific 732 virtual networks, effectively making them part of the same virtual 733 network. Gateways could also forward traffic between a virtual 734 network and other hosts on the data center network or relay traffic 735 between different VNs. Finally, gateways can provide external 736 connectivity such as Internet or VPN access. 738 5.3.1. Gateway Taxonomy 740 As can be seen from the discussion above, there are several types of 741 gateways that can exist in an NVO3 environment. This section breaks 742 them down into the various types that could be supported. Note that 743 each of the types below could be implemented in either a centralized 744 manner or distributed to co-exist with the NVEs. 746 5.3.1.1. L2 Gateways (Bridging) 748 L2 Gateways act as layer 2 bridges to forward Ethernet frames based 749 on the MAC addresses present in them. 751 L2 VN to Legacy L2: This type of gateway bridges traffic between L2 752 VNs and other legacy L2 networks such as VLANs or L2 VPNs. 754 L2 VN to L2 VN: The main motivation for this type of gateway to 755 create separate groups of Tenant Systems using L2 VNs such that 756 the gateway can enforce network policies between each L2 VN. 758 5.3.1.2. L3 Gateways (Only IP Packets) 760 L3 Gateways forward IP packets based on the IP addresses present in 761 the packets. 763 L3 VN to Legacy L2: This type of gateway forwards packets on between 764 L3 VNs and legacy L2 networks such as VLANs or L2 VPNs. The 765 MAC address in any frames forwarded between the legacy L2 766 network would be the MAC address of the gateway. 768 L3 VN to Legacy L3: The type of gateway forwards packets between L3 769 VNs and legacy L3 networks. These legacy L3 networks could be 770 local the data center, in the WAN, or an L3 VPN. 772 L3 VN to L2 VN: This type of gateway forwards packets on between L3 773 VNs and L2 VNs. The MAC address in any frames forwarded 774 between the L2 VN would be the MAC address of the gateway. 776 L2 VN to L2 VN: This type of gateway acts similar to a traditional 777 router that forwards between L2 interfaces. The MAC address in 778 any frames forwarded between the L2 VNs would be the MAC 779 address of the gateway. 781 L3 VN to L3 VN: The main motivation for this type of gateway to 782 create separate groups of Tenant Systems using L3 VNs such that 783 the gateway can enforce network policies between each L3 VN. 785 5.4. Distributed Inter-VN Gateways 787 The relaying of traffic from one VN to another deserves special 788 consideration. Whether traffic is permitted to flow from one VN to 789 another is a matter of policy, and would not (by default) be allowed 790 unless explicitly enabled. In addition, NVAs are the logical place 791 to maintain policy information about allowed inter-VN communication. 792 Policy enforcement for inter-VN communication can be handled in (at 793 least) two different ways. Explicit gateways could be the central 794 point for such enforcement, with all inter-VN traffic forwarded to 795 such gateways for processing. Alternatively, the NVA can provide 796 such information directly to NVEs, by either providing a mapping for 797 a target TS on another VN, or indicating that such communication is 798 disallowed by policy. 800 When inter-VN gateways are centralized, traffic between TSs on 801 different VNs can take suboptimal paths, i.e., triangular routing 802 results in paths that always traverse the gateway. In the worst 803 case, traffic between two TSs connected to the same NVE can be hair- 804 pinned through an external gateway. As an optimization, individual 805 NVEs can be part of a distributed gateway that performs such 806 relaying, reducing or completely eliminating triangular routing. In 807 a distributed gateway, each ingress NVE can perform such relaying 808 activity directly, so long as it has access to the policy information 809 needed to determine whether cross-VN communication is allowed. 810 Having individual NVEs be part of a distributed gateway allows them 811 to tunnel traffic directly to the destination NVE without the need to 812 take suboptimal paths. 814 The NVO3 architecture must support distributed gateways for the case 815 of inter-VN communication. Such support requires that NVO3 control 816 protocols include mechanisms for the maintenance and distribution of 817 policy information about what type of cross-VN communication is 818 allowed so that NVEs acting as distributed gateways can tunnel 819 traffic from one VN to another as appropriate. 821 Distributed gateways could also be used to distribute other 822 traditional router services to individual NVEs. The NVO3 823 architecture does not preclude such implementations, but does not 824 define or require them as they are outside the scope of NVO3. 826 5.5. ARP and Neighbor Discovery 828 For an L2 service, strictly speaking, special processing of ARP 829 [RFC0826] (and IPv6 Neighbor Discovery (ND) [RFC4861]) is not 830 required. ARP requests are broadcast, and NVO3 can deliver ARP 831 requests to all members of a given L2 virtual network, just as it 832 does for any packet sent to an L2 broadcast address. Similarly, ND 833 requests are sent via IP multicast, which NVO3 can support by 834 delivering via L2 multicast. However, as a performance optimization, 835 an NVE can intercept ARP (or ND) requests from its attached TSs and 836 respond to them directly using information in its mapping tables. 837 Since an NVE will have mechanisms for determining the NVE address 838 associated with a given TS, the NVE can leverage the same mechanisms 839 to suppress sending ARP and ND requests for a given TS to other 840 members of the VN. The NVO3 architecture must support such a 841 capability. 843 6. NVE-NVE Interaction 845 Individual NVEs will interact with each other for the purposes of 846 tunneling and delivering traffic to remote TSs. At a minimum, a 847 control protocol may be needed for tunnel setup and maintenance. For 848 example, tunneled traffic may need to be encrypted or integrity 849 protected, in which case it will be necessary to set up appropriate 850 security associations between NVE peers. It may also be desirable to 851 perform tunnel maintenance (e.g., continuity checks) on a tunnel in 852 order to detect when a remote NVE becomes unreachable. Such generic 853 tunnel setup and maintenance functions are not generally 854 NVO3-specific. Hence, NVO3 expects to leverage existing tunnel 855 maintenance protocols rather than defining new ones. 857 Some NVE-NVE interactions may be specific to NVO3 (and in particular 858 be related to information kept in mapping tables) and agnostic to the 859 specific tunnel type being used. For example, when tunneling traffic 860 for TS-X to a remote NVE, it is possible that TS-X is not presently 861 associated with the remote NVE. Normally, this should not happen, 862 but there could be race conditions where the information an NVE has 863 learned from the NVA is out-of-date relative to actual conditions. 864 In such cases, the remote NVE could return an error or warning 865 indication, allowing the sending NVE to attempt a recovery or 866 otherwise attempt to mitigate the situation. 868 The NVE-NVE interaction could signal a range of indications, for 869 example: 871 o "No such TS here", upon a receipt of a tunneled packet for an 872 unknown TS. 874 o "TS-X not here, try the following NVE instead" (i.e., a redirect). 876 o Delivered to correct NVE, but could not deliver packet to TS-X 877 (soft error). 879 o Delivered to correct NVE, but could not deliver packet to TS-X 880 (hard error). 882 When an NVE receives information from a remote NVE that conflicts 883 with the information it has in its own mapping tables, it should 884 consult with the NVA to resolve those conflicts. In particular, it 885 should confirm that the information it has is up-to-date, and it 886 might indicate the error to the NVA, so as to nudge the NVA into 887 following up (as appropriate). While it might make sense for an NVE 888 to update its mapping table temporarily in response to an error from 889 a remote NVE, any changes must be handled carefully as doing so can 890 raise security considerations if the received information cannot be 891 authenticated. That said, a sending NVE might still take steps to 892 mitigate a problem, such as applying rate limiting to data traffic 893 towards a particular NVE or TS. 895 7. Network Virtualization Authority 897 Before sending to and receiving traffic from a virtual network, an 898 NVE must obtain the information needed to build its internal 899 forwarding tables and state as listed in Section 4.3. An NVE can 900 obtain such information from a Network Virtualization Authority. 902 The Network Virtualization Authority (NVA) is the entity that is 903 expected to provide address mapping and other information to NVEs. 904 NVEs can interact with an NVA to obtain any required information they 905 need in order to properly forward traffic on behalf of tenants. The 906 term NVA refers to the overall system, without regards to its scope 907 or how it is implemented. 909 7.1. How an NVA Obtains Information 911 There are two primary ways in which an NVA can obtain the address 912 dissemination information it manages. The NVA can obtain information 913 either from the VM orchestration system, and/or directly from the 914 NVEs themselves. 916 On virtualized systems, the NVA may be able to obtain the address 917 mapping information associated with VMs from the VM orchestration 918 system itself. If the VM orchestration system contains a master 919 database for all the virtualization information, having the NVA 920 obtain information directly to the orchestration system would be a 921 natural approach. Indeed, the NVA could effectively be co-located 922 with the VM orchestration system itself. In such systems, the VM 923 orchestration system communicates with the NVE indirectly through the 924 hypervisor. 926 However, as described in Section 4 not all NVEs are associated with 927 hypervisors. In such cases, NVAs cannot leverage VM orchestration 928 protocols to interact with an NVE and will instead need to peer 929 directly with them. By peering directly with an NVE, NVAs can obtain 930 information about the TSs connected to that NVE and can distribute 931 information to the NVE about the VNs those TSs are associated with. 932 For example, whenever a Tenant System attaches to an NVE, that NVE 933 would notify the NVA that the TS is now associated with that NVE. 934 Likewise when a TS detaches from an NVE, that NVE would inform the 935 NVA. By communicating directly with NVEs, both the NVA and the NVE 936 are able to maintain up-to-date information about all active tenants 937 and the NVEs to which they are attached. 939 7.2. Internal NVA Architecture 941 For reliability and fault tolerance reasons, an NVA would be 942 implemented in a distributed or replicated manner without single 943 points of failure. How the NVA is implemented, however, is not 944 important to an NVE so long as the NVA provides a consistent and 945 well-defined interface to the NVE. For example, an NVA could be 946 implemented via database techniques whereby a server stores address 947 mapping information in a traditional (possibly replicated) database. 948 Alternatively, an NVA could be implemented in a distributed fashion 949 using an existing (or modified) routing protocol to maintain and 950 distribute mappings. So long as there is a clear interface between 951 the NVE and NVA, how an NVA is architected and implemented is not 952 important to an NVE. 954 A number of architectural approaches could be used to implement NVAs 955 themselves. NVAs manage address bindings and distribute them to 956 where they need to go. One approach would be to use Border Gateway 957 Protocol (BGP) [RFC4364] (possibly with extensions) and route 958 reflectors. Another approach could use a transaction-based database 959 model with replicated servers. Because the implementation details 960 are local to an NVA, there is no need to pick exactly one solution 961 technology, so long as the external interfaces to the NVEs (and 962 remote NVAs) are sufficiently well defined to achieve 963 interoperability. 965 7.3. NVA External Interface 967 Conceptually, from the perspective of an NVE, an NVA is a single 968 entity. An NVE interacts with the NVA, and it is the NVA's 969 responsibility for ensuring that interactions between the NVE and NVA 970 result in consistent behavior across the NVA and all other NVEs using 971 the same NVA. Because an NVA is built from multiple internal 972 components, an NVA will have to ensure that information flows to all 973 internal NVA components appropriately. 975 One architectural question is how the NVA presents itself to the NVE. 976 For example, an NVA could be required to provide access via a single 977 IP address. If NVEs only have one IP address to interact with, it 978 would be the responsibility of the NVA to handle NVA component 979 failures, e.g., by using a "floating IP address" that migrates among 980 NVA components to ensure that the NVA can always be reached via the 981 one address. Having all NVA accesses through a single IP address, 982 however, adds constraints to implementing robust failover, load 983 balancing, etc. 985 In the NVO3 architecture, an NVA is accessed through one or more IP 986 addresses (or IP address/port combination). If multiple IP addresses 987 are used, each IP address provides equivalent functionality, meaning 988 that an NVE can use any of the provided addresses to interact with 989 the NVA. Should one address stop working, an NVE is expected to 990 failover to another. While the different addresses result in 991 equivalent functionality, one address may be more respond more 992 quickly than another, e.g., due to network conditions, load on the 993 server, etc. 995 To provide some control over load balancing, NVA addresses may have 996 an associated priority. Addresses are used in order of priority, 997 with no explicit preference among NVA addresses having the same 998 priority. To provide basic load-balancing among NVAs of equal 999 priorities, NVEs use some randomization input to select among equal- 1000 priority NVAs. Such a priority scheme facilitates failover and load 1001 balancing, for example, allowing a network operator to specify a set 1002 of primary and backup NVAs. 1004 It may be desirable to have individual NVA addresses responsible for 1005 a subset of information about an NV Domain. In such a case, NVEs 1006 would use different NVA addresses for obtaining or updating 1007 information about particular VNs or TS bindings. A key question with 1008 such an approach is how information would be partitioned, and how an 1009 NVE could determine which address to use to get the information it 1010 needs. 1012 Another possibility is to treat the information on which NVA 1013 addresses to use as cached (soft-state) information at the NVEs, so 1014 that any NVA address can be used to obtain any information, but NVEs 1015 are informed of preferences for which addresses to use for particular 1016 information on VNs or TS bindings. That preference information would 1017 be cached for future use to improve behavior - e.g., if all requests 1018 for a specific subset of VNs are forwarded to a specific NVA 1019 component, the NVE can optimize future requests within that subset by 1020 sending them directly to that NVA component via its address. 1022 8. NVE-to-NVA Protocol 1024 As outlined in Section 4.3, an NVE needs certain information in order 1025 to perform its functions. To obtain such information from an NVA, an 1026 NVE-to-NVA protocol is needed. The NVE-to-NVA protocol provides two 1027 functions. First it allows an NVE to obtain information about the 1028 location and status of other TSs with which it needs to communicate. 1029 Second, the NVE-to-NVA protocol provides a way for NVEs to provide 1030 updates to the NVA about the TSs attached to that NVE (e.g., when a 1031 TS attaches or detaches from the NVE), or about communication errors 1032 encountered when sending traffic to remote NVEs. For example, an NVE 1033 could indicate that a destination it is trying to reach at a 1034 destination NVE is unreachable for some reason. 1036 While having a direct NVE-to-NVA protocol might seem straightforward, 1037 the existence of existing VM orchestration systems complicates the 1038 choices an NVE has for interacting with the NVA. 1040 8.1. NVE-NVA Interaction Models 1042 An NVE interacts with an NVA in at least two (quite different) ways: 1044 o NVEs embedded within the same server as the hypervisor can obtain 1045 necessary information entirely through the hypervisor-facing side 1046 of the NVE. Such an approach is a natural extension to existing 1047 VM orchestration systems supporting server virtualization because 1048 an existing protocol between the hypervisor and VM Orchestration 1049 system already exists and can be leveraged to obtain any needed 1050 information. Specifically, VM orchestration systems used to 1051 create, terminate and migrate VMs already use well-defined (though 1052 typically proprietary) protocols to handle the interactions 1053 between the hypervisor and VM orchestration system. For such 1054 systems, it is a natural extension to leverage the existing 1055 orchestration protocol as a sort of proxy protocol for handling 1056 the interactions between an NVE and the NVA. Indeed, existing 1057 implementations can already do this. 1059 o Alternatively, an NVE can obtain needed information by interacting 1060 directly with an NVA via a protocol operating over the data center 1061 underlay network. Such an approach is needed to support NVEs that 1062 are not associated with systems performing server virtualization 1063 (e.g., as in the case of a standalone gateway) or where the NVE 1064 needs to communicate directly with the NVA for other reasons. 1066 The NVO3 architecture will focus on support for the second model 1067 above. Existing virtualization environments are already using the 1068 first model. But they are not sufficient to cover the case of 1069 standalone gateways -- such gateways may not support virtualization 1070 and do not interface with existing VM orchestration systems. 1072 8.2. Direct NVE-NVA Protocol 1074 An NVE can interact directly with an NVA via an NVE-to-NVA protocol. 1075 Such a protocol can be either independent of the NVA internal 1076 protocol, or an extension of it. Using a dedicated protocol provides 1077 architectural separation and independence between the NVE and NVA. 1078 The NVE and NVA interact in a well-defined way, and changes in the 1079 NVA (or NVE) do not need to impact each other. Using a dedicated 1080 protocol also ensures that both NVE and NVA implementations can 1081 evolve independently and without dependencies on each other. Such 1082 independence is important because the upgrade path for NVEs and NVAs 1083 is quite different. Upgrading all the NVEs at a site will likely be 1084 more difficult in practice than upgrading NVAs because of their large 1085 number - one on each end device. In practice, it would be prudent to 1086 assume that once an NVE has been implemented and deployed, it may be 1087 challenging to get subsequent NVE extensions and changes implemented 1088 and deployed, whereas an NVA (and its associated protocols) are more 1089 likely to evolve over time as experience is gained from usage and 1090 upgrades will involve fewer nodes. 1092 Requirements for a direct NVE-NVA protocol can be found in 1093 [I-D.ietf-nvo3-nve-nva-cp-req] 1095 8.3. Propagating Information Between NVEs and NVAs 1097 Information flows between NVEs and NVAs in both directions. The NVA 1098 maintains information about all VNs in the NV Domain, so that NVEs do 1099 not need to do so themselves. NVEs obtain from the NVA information 1100 about where a given remote TS destination resides. NVAs in turn 1101 obtain information from NVEs about the individual TSs attached to 1102 those NVEs. 1104 While the NVA could push information about every virtual network to 1105 every NVE, such an approach scales poorly and is unnecessary. In 1106 practice, a given NVE will only need and want to know about VNs to 1107 which it is attached. Thus, an NVE should be able to subscribe to 1108 updates only for the virtual networks it is interested in receiving 1109 updates for. The NVO3 architecture supports a model where an NVE is 1110 not required to have full mapping tables for all virtual networks in 1111 an NV Domain. 1113 Before sending unicast traffic to a remote TS (or TSes for broadcast 1114 or multicast traffic), an NVE must know where the remote(es) TS 1115 currently resides. When a TS attaches to a virtual network, the NVE 1116 obtains information about that VN from the NVA. The NVA can provide 1117 that information to the NVE at the time the TS attaches to the VN, 1118 either because the NVE requests the information when the attach 1119 operation occurs, or because the VM orchestration system has 1120 initiated the attach operation and provides associated mapping 1121 information to the NVE at the same time. 1123 There are scenarios where an NVE may wish to query the NVA about 1124 individual mappings within an VN. For example, when sending traffic 1125 to a remote TS on a remote NVE, that TS may become unavailable (e.g,. 1126 because it has migrated elsewhere or has been shutdown, in which case 1127 the remote NVE may return an error indication). In such situations, 1128 the NVE may need to query the NVA to obtain updated mapping 1129 information for a specific TS, or verify that the information is 1130 still correct despite the error condition. Note that such a query 1131 could also be used by the NVA as an indication that there may be an 1132 inconsistency in the network and that it should take steps to verify 1133 that the information it has about the current state and location of a 1134 specific TS is still correct. 1136 For very large virtual networks, the amount of state an NVE needs to 1137 maintain for a given virtual network could be significant. Moreover, 1138 an NVE may only be communicating with a small subset of the TSs on 1139 such a virtual network. In such cases, the NVE may find it desirable 1140 to maintain state only for those destinations it is actively 1141 communicating with. In such scenarios, an NVE may not want to 1142 maintain full mapping information about all destinations on a VN. 1143 Should it then need to communicate with a destination for which it 1144 does not have mapping information, however, it will need to be able 1145 to query the NVA on demand for the missing information on a per- 1146 destination basis. 1148 The NVO3 architecture will need to support a range of operations 1149 between the NVE and NVA. Requirements for those operations can be 1150 found in [I-D.ietf-nvo3-nve-nva-cp-req]. 1152 9. Federated NVAs 1154 An NVA provides service to the set of NVEs in its NV Domain. Each 1155 NVA manages network virtualization information for the virtual 1156 networks within its NV Domain. An NV domain is administered by a 1157 single entity. 1159 In some cases, it will be necessary to expand the scope of a specific 1160 VN or even an entire NV domain beyond a single NVA. For example, 1161 multiple data centers managed by the same administrator may wish to 1162 operate all of its data centers as a single NV region. Such cases 1163 are handled by having different NVAs peer with each other to exchange 1164 mapping information about specific VNs. NVAs operate in a federated 1165 manner with a set of NVAs operating as a loosely-coupled federation 1166 of individual NVAs. If a virtual network spans multiple NVAs (e.g., 1167 located at different data centers), and an NVE needs to deliver 1168 tenant traffic to an NVE that is part of a different NV Domain, it 1169 still interacts only with its NVA, even when obtaining mappings for 1170 NVEs associated with a different NV Domain. 1172 Figure 3 shows a scenario where two separate NV Domains (1 and 2) 1173 share information about Virtual Network "1217". VM1 and VM2 both 1174 connect to the same Virtual Network 1217, even though the two VMs are 1175 in separate NV Domains. There are two cases to consider. In the 1176 first case, NV Domain B (NVB) does not allow NVE-A to tunnel traffic 1177 directly to NVE-B. There could be a number of reasons for this. For 1178 example, NV Domains 1 and 2 may not share a common address space 1179 (i.e., require traversal through a NAT device), or for policy 1180 reasons, a domain might require that all traffic between separate NV 1181 Domains be funneled through a particular device (e.g., a firewall). 1182 In such cases, NVA-2 will advertise to NVA-1 that VM1 on Virtual 1183 Network 1217 is available, and direct that traffic between the two 1184 nodes go through IP-G. IP-G would then decapsulate received traffic 1185 from one NV Domain, translate it appropriately for the other domain 1186 and re-encapsulate the packet for delivery. 1188 xxxxxx xxxxxx +-----+ 1189 +-----+ xxxxxxxx xxxxxx xxxxxxx xxxxx | VM2 | 1190 | VM1 | xx xx xxx xx |-----| 1191 |-----| xx + x xx x |NVE-B| 1192 |NVE-A| x x +----+ x x +-----+ 1193 +--+--+ x NV Domain A x |IP-G|--x x | 1194 +-------x xx--+ | x xx | 1195 x x +----+ x NV Domain B x | 1196 +---x xx xx x---+ 1197 | xxxx xx +->xx xx 1198 | xxxxxxxxxx | xx xx 1199 +---+-+ | xx xx 1200 |NVA-1| +--+--+ xx xxx 1201 +-----+ |NVA-2| xxxx xxxx 1202 +-----+ xxxxxxx 1204 Figure 3: VM1 and VM2 are in different NV Domains. 1206 NVAs at one site share information and interact with NVAs at other 1207 sites, but only in a controlled manner. It is expected that policy 1208 and access control will be applied at the boundaries between 1209 different sites (and NVAs) so as to minimize dependencies on external 1210 NVAs that could negatively impact the operation within a site. It is 1211 an architectural principle that operations involving NVAs at one site 1212 not be immediately impacted by failures or errors at another site. 1213 (Of course, communication between NVEs in different NV domains may be 1214 impacted by such failures or errors.) It is a strong requirement 1215 that an NVA continue to operate properly for local NVEs even if 1216 external communication is interrupted (e.g., should communication 1217 between a local and remote NVA fail). 1219 At a high level, a federation of interconnected NVAs has some 1220 analogies to BGP and Autonomous Systems. Like an Autonomous System, 1221 NVAs at one site are managed by a single administrative entity and do 1222 not interact with external NVAs except as allowed by policy. 1223 Likewise, the interface between NVAs at different sites is well 1224 defined, so that the internal details of operations at one site are 1225 largely hidden to other sites. Finally, an NVA only peers with other 1226 NVAs that it has a trusted relationship with, i.e., where a VN is 1227 intended to span multiple NVAs. 1229 Reasons for using a federated model include: 1231 o Provide isolation among NVAs operating at different sites at 1232 different geographic locations. 1234 o Control the quantity and rate of information updates that flow 1235 (and must be processed) between different NVAs in different data 1236 centers. 1238 o Control the set of external NVAs (and external sites) a site peers 1239 with. A site will only peer with other sites that are cooperating 1240 in providing an overlay service. 1242 o Allow policy to be applied between sites. A site will want to 1243 carefully control what information it exports (and to whom) as 1244 well as what information it is willing to import (and from whom). 1246 o Allow different protocols and architectures to be used to for 1247 intra- vs. inter-NVA communication. For example, within a single 1248 data center, a replicated transaction server using database 1249 techniques might be an attractive implementation option for an 1250 NVA, and protocols optimized for intra-NVA communication would 1251 likely be different from protocols involving inter-NVA 1252 communication between different sites. 1254 o Allow for optimized protocols, rather than using a one-size-fits 1255 all approach. Within a data center, networks tend to have lower- 1256 latency, higher-speed and higher redundancy when compared with WAN 1257 links interconnecting data centers. The design constraints and 1258 tradeoffs for a protocol operating within a data center network 1259 are different from those operating over WAN links. While a single 1260 protocol could be used for both cases, there could be advantages 1261 to using different and more specialized protocols for the intra- 1262 and inter-NVA case. 1264 9.1. Inter-NVA Peering 1266 To support peering between different NVAs, an inter-NVA protocol is 1267 needed. The inter-NVA protocol defines what information is exchanged 1268 between NVAs. It is assumed that the protocol will be used to share 1269 addressing information between data centers and must scale well over 1270 WAN links. 1272 10. Control Protocol Work Areas 1274 The NVO3 architecture consists of two major distinct entities: NVEs 1275 and NVAs. In order to provide isolation and independence between 1276 these two entities, the NVO3 architecture calls for well defined 1277 protocols for interfacing between them. For an individual NVA, the 1278 architecture calls for a logically centralized entity that could be 1279 implemented in a distributed or replicated fashion. While the IETF 1280 may choose to define one or more specific architectural approaches to 1281 building individual NVAs, there is little need for it to pick exactly 1282 one approach to the exclusion of others. An NVA for a single domain 1283 will likely be deployed as a single vendor product and thus their is 1284 little benefit in standardizing the internal structure of an NVA. 1286 Individual NVAs peer with each other in a federated manner. The NVO3 1287 architecture calls for a well-defined interface between NVAs. 1289 Finally, a hypervisor-to-NVE protocol is needed to cover the split- 1290 NVE scenario described in Section 4.2. 1292 11. NVO3 Data Plane Encapsulation 1294 When tunneling tenant traffic, NVEs add encapsulation header to the 1295 original tenant packet. The exact encapsulation to use for NVO3 does 1296 not seem to be critical. The main requirement is that the 1297 encapsulation support a Context ID of sufficient size 1298 [I-D.ietf-nvo3-dataplane-requirements]. A number of encapsulations 1299 already exist that provide a VN Context of sufficient size for NVO3. 1300 For example, VXLAN [RFC7348] has a 24-bit VXLAN Network Identifier 1301 (VNI). NVGRE [I-D.sridharan-virtualization-nvgre] has a 24-bit 1302 Tenant Network ID (TNI). MPLS-over-GRE provides a 20-bit label 1303 field. While there is widespread recognition that a 12-bit VN 1304 Context would be too small (only 4096 distinct values), it is 1305 generally agreed that 20 bits (1 million distinct values) and 24 bits 1306 (16.8 million distinct values) are sufficient for a wide variety of 1307 deployment scenarios. 1309 12. Operations and Management 1311 The simplicity of operating and debugging overlay networks will be 1312 critical for successful deployment. Some architectural choices can 1313 facilitate or hinder OAM. Related OAM drafts include 1314 [I-D.ashwood-nvo3-operational-requirement]. 1316 13. Summary 1318 This document presents the overall architecture for overlays in NVO3. 1319 The architecture calls for three main areas of protocol work: 1321 1. A hypervisor-to-NVE protocol to support Split NVEs as discussed 1322 in Section 4.2. 1324 2. An NVE to NVA protocol for disseminating VN information (e.g., 1325 inner to outer address mappings). 1327 3. An NVA-to-NVA protocol for exchange of information about specific 1328 virtual networks between federated NVAs. 1330 It should be noted that existing protocols or extensions of existing 1331 protocols are applicable. 1333 14. Acknowledgments 1335 Helpful comments and improvements to this document have come from 1336 Lizhong Jin, Anton Ivanov, Dennis (Xiaohong) Qin, Erik Smith, Ziye 1337 Yang and Lucy Yong. 1339 15. IANA Considerations 1341 This memo includes no request to IANA. 1343 16. Security Considerations 1345 Yep, kind of sparse. But we'll get there eventually. :-) 1347 17. Informative References 1349 [I-D.ashwood-nvo3-operational-requirement] 1350 Ashwood-Smith, P., Iyengar, R., Tsou, T., Sajassi, A., 1351 Boucadair, M., Jacquenet, C., and M. Daikoku, "NVO3 1352 Operational Requirements", draft-ashwood-nvo3-operational- 1353 requirement-03 (work in progress), July 2013. 1355 [I-D.ietf-nvo3-dataplane-requirements] 1356 Bitar, N., Lasserre, M., Balus, F., Morin, T., Jin, L., 1357 and B. Khasnabish, "NVO3 Data Plane Requirements", draft- 1358 ietf-nvo3-dataplane-requirements-03 (work in progress), 1359 April 2014. 1361 [I-D.ietf-nvo3-nve-nva-cp-req] 1362 Kreeger, L., Dutt, D., Narten, T., and D. Black, "Network 1363 Virtualization NVE to NVA Control Protocol Requirements", 1364 draft-ietf-nvo3-nve-nva-cp-req-03 (work in progress), 1365 October 2014. 1367 [I-D.sridharan-virtualization-nvgre] 1368 Garg, P. and Y. Wang, "NVGRE: Network Virtualization using 1369 Generic Routing Encapsulation", draft-sridharan- 1370 virtualization-nvgre-07 (work in progress), November 2014. 1372 [IEEE-802.1Q] 1373 IEEE 802.1Q-2011, , "IEEE standard for local and 1374 metropolitan area networks: Media access control (MAC) 1375 bridges and virtual bridged local area networks,", August 1376 2011. 1378 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 1379 converting network protocol addresses to 48.bit Ethernet 1380 address for transmission on Ethernet hardware", STD 37, 1381 RFC 826, November 1982. 1383 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1384 Networks (VPNs)", RFC 4364, February 2006. 1386 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1387 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1388 September 2007. 1390 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1391 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1392 eXtensible Local Area Network (VXLAN): A Framework for 1393 Overlaying Virtualized Layer 2 Networks over Layer 3 1394 Networks", RFC 7348, August 2014. 1396 [RFC7364] Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L., 1397 and M. Napierala, "Problem Statement: Overlays for Network 1398 Virtualization", RFC 7364, October 2014. 1400 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 1401 Rekhter, "Framework for Data Center (DC) Network 1402 Virtualization", RFC 7365, October 2014. 1404 Appendix A. Change Log 1406 A.1. Changes From draft-ietf-nvo3-arch-02 to -03 1408 1. Removed "[Note:" comments from section 7.3 and 8. 1410 2. Removed discussion stimulating "[Note" comment from section 8.1 1411 and changed the text to note that the NVO3 architecture will 1412 focus on a model where all NVEs interact with the NVA. 1414 3. Added a subsection on NVO3 Gateway taxonomy. 1416 A.2. Changes From draft-ietf-nvo3-arch-01 to -02 1418 1. Minor editorial improvements after a close re-reading; references 1419 to problem statement and framework updated to point to recently- 1420 published RFCs. 1422 2. Added text making it more clear that other virtualization 1423 approaches, including Linux Containers are intended to be fully 1424 supported in NVO3. 1426 A.3. Changes From draft-ietf-nvo3-arch-00 to -01 1428 1. Miscellaneous text/section additions, including: 1430 * New section on VLAN tag Handling (Section 3.1.1). 1432 * New section on tenant VLAN handling in Split-NVE case 1433 (Section 4.2.1). 1435 * New section on TTL handling (Section 3.1.2). 1437 * New section on multi-homing of NVEs (Section 4.4). 1439 * 2 paragraphs new text describing L2/L3 Combined service 1440 (Section 3.1). 1442 * New section on VAPs (and error handling) (Section 4.5). 1444 * New section on ARP and ND handling (Section 5.5) 1446 * New section on NVE-to-NVE interactions (Section 6) 1448 2. Editorial cleanups from careful review by Erik Smith, Ziye Yang. 1450 3. Expanded text on Distributed Inter-VN Gateways. 1452 A.4. Changes From draft-narten-nvo3 to draft-ietf-nvo3 1454 1. No changes between draft-narten-nvo3-arch-01 and draft-ietf-nvoe- 1455 arch-00. 1457 A.5. Changes From -00 to -01 (of draft-narten-nvo3-arch) 1459 1. Editorial and clarity improvements. 1461 2. Replaced "push vs. pull" section with section more focused on 1462 triggers where an event implies or triggers some action. 1464 3. Clarified text on co-located NVE to show how offloading NVE 1465 functionality onto adapters is desirable. 1467 4. Added new section on distributed gateways. 1469 5. Expanded Section on NVA external interface, adding requirement 1470 for NVE to support multiple IP NVA addresses. 1472 Authors' Addresses 1474 David Black 1475 EMC 1477 Email: david.black@emc.com 1479 Jon Hudson 1480 Brocade 1481 120 Holger Way 1482 San Jose, CA 95134 1483 USA 1485 Email: jon.hudson@gmail.com 1487 Lawrence Kreeger 1488 Cisco 1490 Email: kreeger@cisco.com 1492 Marc Lasserre 1493 Alcatel-Lucent 1495 Email: marc.lasserre@alcatel-lucent.com 1497 Thomas Narten 1498 IBM 1500 Email: narten@us.ibm.com