Network Working Group K. Kompella Internet-Draft Y. Rekhter Intended status: Standards Track Juniper Networks Expires: January 10, 2013 T. Morin France Telecom - Orange Labs July 9, 2012 Using Signaling to Simplify Network Virtualization Provisioning draft-kompella-nvo3-server2nve-00 Abstract This document proposes a "single-touch" approach for provisioning the networking parameters related to Virtual Machine creation, migration and termination on servers. The idea is to provision the server, then have the server signal the requisite parameters to the relevant network device(s). Such an approach reduces the workload on the provisioning system and simplifies the data model that the provisioning system needs to maintain. Furthermore, it is more resilient to topology changes in server-network connectivity, for example, reconnecting a server to a different network port or switch. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 10, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of Kompella, et al. Expires January 10, 2013 [Page 1] Internet-Draft Simplifying NVE Provisioning July 2012 publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Conventions and Acronyms Used . . . . . . . . . . . . . . 4 1.2. Virtual Networks . . . . . . . . . . . . . . . . . . . . . 4 1.2.1. Current Mode of Operation . . . . . . . . . . . . . . 5 1.2.2. Future Mode of Operation . . . . . . . . . . . . . . . 6 1.3. Provisioning DCVPNs . . . . . . . . . . . . . . . . . . . 6 2. Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 7 2.2. VM Operations . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1. Network Parameters . . . . . . . . . . . . . . . . . . 7 2.2.2. Creating a VM . . . . . . . . . . . . . . . . . . . . 8 2.2.3. Terminating a VM . . . . . . . . . . . . . . . . . . . 9 2.2.4. Migrating a VM . . . . . . . . . . . . . . . . . . . . 10 2.3. Signaling Protocols . . . . . . . . . . . . . . . . . . . 11 3. Interfacing with BGP-based Realizations of DCVPNs . . . . . . 11 4. Security Considerations . . . . . . . . . . . . . . . . . . . 12 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 7.1. Normative References . . . . . . . . . . . . . . . . . . . 12 7.2. Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Kompella, et al. Expires January 10, 2013 [Page 2] Internet-Draft Simplifying NVE Provisioning July 2012 1. Introduction Creating a Virtual Machine (VM) on a server in a data center and making it operational involves several steps, among them: 1. gathering the CPU, network, storage, and appliance parameters required for the VM; 2. deciding which server, network, storage and appliance devices best match the VM requirements in the current state of the data center; 3. provisioning the server with the VM parameters; 4. provisioning the network element(s) to which the VM is connected with the network-related parameters of the VM; 5. informing the network element(s) to which a VM is connected about the VM's peer VMs, storage devices and other appliances with which the VM needs to communicate; 6. informing the network element(s) to which a VM's peer VMs are connected about the new VM and its addresses; 7. provisioning storage with the storage-related parameters; and 8. provisioning necessary appliances (firewalls, load balancers and "middle boxes"). Steps 1 and 2 are primarily information gathering. For Steps 3 to 8, the provisioning system talks actively to servers, network switches, storage and appliances, and must know the details of the physical server, network, storage and appliance connectivity topologies. Step 4 is typically done using just provisioning, whereas Steps 5 and 6 may be a combination of provisioning and other techniques. Steps 4 to 6 accomplish the task of provisioning the network for a VM, the result of which is a Data Center Virtual Private Network (DCVPN) overlaid on the physical network. This document focuses on the case where the network elements in Step 4 are not co-resident with the server, and shows how the provisioning in Step 4 can be replaced by signaling between server and network, using information from Step 3. This document also shows how Step 4 can interact seamlessly with some of the realizations of Steps 5 and 6. Kompella, et al. Expires January 10, 2013 [Page 3] Internet-Draft Simplifying NVE Provisioning July 2012 1.1. Conventions and Acronyms Used The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. The following acronyms are used: DCVPN: Data Center Virtual Private Network -- a virtual connectivity topology overlaid on physical devices to provide virtual devices with the connectivity they need and isolation from other DCVPNs NVE: Network Virtualization Edge -- the entities that realize private communication among VMs in a DCVPN lNVE: local NVE: wrt a VM, NVE elements to which it is directly connected rNVE: remote NVE: wrt a VM, NVE elements to which the VM's peer VMs are connected peer VM: wrt a VM, other VMs in the VM's DCVPN ToR: Top of Rack (or access) switch VDP: VSI Discovery and Configuration Protocol VID: 12-bit VLAN tag or identifier used locally between a server and its lNVE VLAN: Virtual Local Area Network VM: Virtual Machine (same as Virtual Station) VNID: DCVPN Identifier (sometimes called a Group Identifier) VSI: Virtual Station Interface 1.2. Virtual Networks The goal of provisioning networks for VMs is to create an "isolation domain" wherein a group of VMs can talk freely to each other, but communication to and from VMs outside that group is restricted (either prohibited, or mediated through a router or a firewall). Such an isolation domain, sometimes called a Closed User Group, here will be called a Data Center Virtual Private Network (DCVPN). The network elements that enforce the VN are called the Network Kompella, et al. Expires January 10, 2013 [Page 4] Internet-Draft Simplifying NVE Provisioning July 2012 Virtualization Edge (NVE). A DCVPN is assigned a global "name" that identifies it in the management plane; this name is unique in the scope of the data center, but may be unique across several cooperating data centers. A DCVPN is also assigned an identifier unique in the scope of the data center, the Virtual Network Group ID (VNID). The VNID is a control plane entity. A data plane tag is also needed to distinguish different DCVPNs' traffic; more on this later. For a given VM, the NVE can be classified into two parts: the network elements to which the VM's server is directly connected (the local NVE or lNVE), and those to which peer VMs are connected (the remote NVE or rNVE). In some cases, the lNVE is co-resident with the server hosting the VM; in other cases, the lNVE is separate (distributed lNVE). The latter case is the one of interest in this document. A VM is added to a DCVPN through Steps 4 to 6, which can be recast as follows. In Step 4, the lNVE are informed about the VM's VNID, network addresses and policies, and the lNVE and server agree on how to distinguish traffic for different DCVPNs from and to the server. In Step 5 the lNVE discover relevant rNVE elements and the addresses of their VMs. In Step 6, the rNVE discover the presence and addresses of the new VM. Once a DCVPN is created, the next steps for network provisioning are to create and apply policies such as for QoS or access control. These occur in three flavors: policies for all VMs in the group, policies for individual VMs, and policies for communication across DCVPN boundaries. 1.2.1. Current Mode of Operation DCVPNs are often realized as Ethernet VLAN segments. A VLAN segment satisfies the communication properties of a DCVPN. A VLAN also has data plane mechanisms for discovering rNVEs and VM addresses. When a DCVPN is realized as a VLAN, Step 4 requires provisioning both the server and lNVE with the VLAN tag that identifies the DCVPN. Step 6 requires provisioning all rNVEs with the same VLAN tag. Address learning is done by flooding, and the announcement of a new VM is typically by a "gratuitous ARP". While VLANs are familiar and well-understood, they fail to scale on several dimensions. Underlying VLANs is a Layer 2 infrastructure that traditionally has had a very weak control plane; only recently have there been efforts to shore this up. The number of independent VLANs in a Layer 2 domain is limited by the size of the VLAN tag. Data plane techniques (flooding and broadcast) are another source of Kompella, et al. Expires January 10, 2013 [Page 5] Internet-Draft Simplifying NVE Provisioning July 2012 serious concern as the overall size of the network grows. 1.2.2. Future Mode of Operation There are several scalable realizations of DCVPNs that address the isolation requirements of DCVPNs as well as the need for a scalable substrate for DCVPNs and the need for scalable mechanisms for rNVE and VM address discovery. While these are not the goal of this document, a secondary goal of this document is to show how the signaling that replaces Step 4 can seamlessly interact with several of these realizations of DCVPNs. VLAN tags (VIDs) will be used as the data plane tag to distinguish traffic for different DCVPNs' between a server and its lNVE. Note that, as used here, VIDs are only have local significance between server and NVE, not to be confused with the notion of VLANs, which are a data center-wide concept. Data plane tags between lNVE and rNVE depends on the encapsulation mechanism among the NVE; the lNVE is expected to map between VIDs and intra-NVE tags in both directions. 1.3. Provisioning DCVPNs Step 3 provisions the server; Steps 4 and 5 provision the lNVE elements; Step 6 provisions the rNVE elements. In some cases, the lNVE elements live within the server; in this case, Steps 3 and 4 are "single-touch" in that the provisioning system only needs to talk to the server, and both CPU and network parameters can be applied by the server. However, in other cases, the lNVE is separate from the server, requiring that the provisioning system talk independently to both the server and lNVE. This scenario, which we call "distributed local NVE", is the one considered in this document. This document resurrects "single-touch" provisioning in the distributed lNVE case. The approach here is to provision the server, then have the server signal the requisite parameters to the lNVE. Such an approach reduces the workload on the provisioning system, allowing it to scale both in the number of elements it can manage, as well as the rate at which it can process changes. It also simplifies the data model that the provisioning system needs to have; in particular, the provisioning system does not have to maintain a full, up-to-date map of server to network connectivity. Furthermore, it is more resilient to topology changes in server-network connectivity that have not yet been transmitted to the provisioning system. For example, if a server is reconnected to a different port or a different lNVE to recover from a malfunctioning port, the server can contact the new Kompella, et al. Expires January 10, 2013 [Page 6] Internet-Draft Simplifying NVE Provisioning July 2012 lNVE over the new port without the provisioning system being aware of the change. While the current document focuses on provisioning networking parameters via signaling, future extensions may address the provisioning of storage and middle-box parameters in a similar fashion. Companion documents will describe how NVEs to which peer VMs are connected can get the required networking information via signaling rather than by provisioning and/or other means. 2. Signaling 2.1. Preliminaries There are three common operations in a virtualized data center: creating a VM; migrating a VM from one physical server to another; and terminating a VM. Creating a VM requires "associating" it with its DCVPN; decommissioning a VM requires "dissociating" it from its DCVPN. Moving a VM consists of associating it with its DCVPN in its new location, then dissociating it from its old location. To facilitate a smooth migration of a VM, there is one additional operation, "pre-associate". 2.2. VM Operations 2.2.1. Network Parameters For each VM operation, a subset of the following information is needed from server to lNVE: operation: one of pre-associate, associate, pre-dissociate, dissociate. authentication: proof that this operation was authorized by the provisioning system VNID: identifier of DCVPN to which VM belongs VID: tag to use between server and lNVE to distinguish DCVPN traffic table type: realization of DCVPN on NVE: what type of forwarding the DCVPN requires address entries: addresses for VM on server Kompella, et al. Expires January 10, 2013 [Page 7] Internet-Draft Simplifying NVE Provisioning July 2012 policy VM-specific network policies, such as access control lists and/or QoS policies hold time: time (in milliseconds) to keep a VM's addresses after it migrates away from this lNVE Typically, for the pre-associate and associate messages, all the information except hold time would be needed. For the pre-dissociate message, all the above information except VID, table type and hold time would be needed. For the dissociate message, the pre-dissociate information plus the hold time would be needed. Operations are stateful, that is, they remain in place until superceded by another operation. For example, on receiving an associate message, a lNVE is expected to create and maintain the DCVPN table for a VM until the lNVE receives a dissociate message to remove the table. A separate liveness protocol may be run between server and lNVE to let each side know that the other is still operational; if the liveness protocol fails, each side may remove all state installed in response to messages from the other. In the descriptions below, we assume that the NVE layer provides a mechanism for control plane distribution of VM addresses, as opposed to doing this in the data plane. If this is not the case, NVE elements can skip the parts of the procedures below that involve address distribution. As VIDs are local to a server-lNVE device, in fact to a specific port connecting the two, the following map will prove useful to the NVE: -> VID. Note that valid values of VID are from 1 to 4094, inclusive. A value of 0 is used to mean "unassigned". 2.2.2. Creating a VM When a VM is instantiated on a server, it is assigned a VNID, VM addresses and a table type for the DCVPN. The VM addresses may be any of IPv4, IPv6 and MAC addresses. There may also be network policies specific to the VM. To connect the VM to its DCVPN, the server signals these parameters to the lNVE via an "associate" operation. The server may first optionally signal a "pre-associate" operation with the same parameters. (Note that the lNVE may consist of more than one device.) On receiving an associate message on port P from server S, each lNVE device L does the following: Kompella, et al. Expires January 10, 2013 [Page 8] Internet-Draft Simplifying NVE Provisioning July 2012 A.1: Validate the authentication (if present). If not, inform the provisioning system, log the error, and stop processing the associate message. A.2: Check if the VID in the associate message is zero; if so, look up the VID for ; if there is none, allocate a new VID and associates it with . A.3: If the VID in the associate message is non-zero, look up . If the result is zero, or equal to VID, all's well. Otherwise, respond to S with an error, and stop processing the associate message. A.4: If a table of appropriate type (as signaled) for VNID does not already exist, create it, and add the VM's addresses to it. A.5: Commmunicate with each rNVE device to advertise the VM's addresses, and also to get the addresses of other VMs in the DCVPN. Populate the table with the VM's addresses and addresses learned from each rNVE. A.6: Enable forwarding to and from the VM. A.7: Finally, respond to S with the VID for , and also saying that the operation was successful. 2.2.3. Terminating a VM On receiving a request from the provisioning system to terminate a VM, the server sends a dissociate message to the lNVE. The dissociate message contains the operation, authentication, VNID, table type, and VM addresses. On receiving the dissociate message on port P from server S, each lNVE device L does the following: D.1: Validate the authentication (if present). If not, inform the provisioning system, log the error, and stop processing the associate message. D.2: If the hold time is non-zero, point the VM's addresses in the VNID table to the new location of the VM, if known, or to "discard", and start a timer for the period of the hold time. If hold time is zero, immediately perform step D.4, then go to D.3. D.3: Set the VID for as unassigned. Respond to S saying that the operation was successful. Kompella, et al. Expires January 10, 2013 [Page 9] Internet-Draft Simplifying NVE Provisioning July 2012 D.4: When the hold timer expires, delete the VM's addresses from the VNID table. Delete any VM-specific network policies associated with any of the VM addresses. If the VNID table is empty after deleting the VM's addresses, optionally delete the table and any network policies for the VNID. 2.2.4. Migrating a VM Let's say that a VM is to be migrated from server S (connected to lNVE device L) to server S' (connected to lNVE device L'). The sequence of steps for migration is: M.1: S' gets a request to prepare to receive a copy of the VM from S. M.2: S gets a request to copy the VM to S'. M.3: S then gets a request to terminate the VM on S. M.4: Finally, S' gets a request to start up the VM on S'. At Step M.1, S' initiates the move, and also sends a pre-associate message to L', including the pre-associate information. The processing of a pre-associate message (PA.1 to PA.7) for L' is the same as that of an associate message (A.1 to A.7), with the following change to step 5. PA.5: Commmunicate with each rNVE device to advertise the VM's addresses but as non-preferred destinations(*). Also get the addresses of other VMs in the DCVPN. Populate the table with the VM's addresses and addresses learned from each rNVE. (*) See Section 3 for some mechanisms for doing this. This is necessary so that L' does not attract traffic to the VM's new location before the migration is complete, yet L knows ahead of time how to send traffic to L' (Step D.2), minimizing traffic loss to the VM when migration is complete. At step M.2, S initiates the VM copy. If at any time L hears advertisements from L' about how to communicate with the VM in its new location (as unpreferred destinations), L stores that information for use in step D.2. At step M.3, S terminates the running of the VM on itself, and sends a dissociate message to L with a non-zero hold time (either what the provisioning system sends, or a default value). L processes the dissociate message as above. Kompella, et al. Expires January 10, 2013 [Page 10] Internet-Draft Simplifying NVE Provisioning July 2012 2.3. Signaling Protocols There are several options for protocols to use to signal the above messages. One could invent a new protocol for this purpose. One could reuse existing protocols, among them LLDP, XMPP, HTTP REST, and VDP [VDP], a new protocol standardized for the purposes of signaling a VM's network parameters from server to lNVE. Several factors influence the choice of protocol(s); at this time, the focus is on what needs to be signaled, leaving for later the choice of how the information is signaled, and specific encodings. 3. Interfacing with BGP-based Realizations of DCVPNs DCVPNs may be realized using BGP-based VPNs, for example, BGP/MPLS VPNs ([RFC4364]), E-VPNs ([I-D.ietf-l2vpn-evpn]), and VPLS ([RFC4761]). BGP meets all the requirements stated in Section 4 of [I-D.kreeger-nvo3-overlay-cp]. BGP fits in well in the context of the signaling described here. 1. VM addresses signaled from server to lNVE through the associate message (and removed through the dissociate message) can be advertised (withdrawn) by the lNVE to rVNEs via BGP. 2. The control plane entity VNID used in server to NVE signaling can be mapped to the Route Target (RT) community used by BGP ([RFC4360]) as follows: use the two octet AS specific RT; set the AS number to the DC's AS number and the Assigned Number to the VNID. Furthermore, the "Route Target Constraint" mechanism ([RFC4684]) allows BGP to meet requirement 1 of [I-D.kreeger-nvo3-overlay-cp]. 3. The use of RTs solves the problem of a lNVE discovering rNVEs connected to the same DCVPN and vice versa (Steps P.5 and P.6). 4. Techniques such as those mentioned in [I-D.uttaro-idr-bgp-persistence] can be used to advertise routes to the new location of a migrating VM (Steps PA.5 and D.2) to minimize traffic loss during a VM migration. 5. BGP can minimize or eliminate "triangular routing" in data centers [I-D.raggarwa-data-center-mobility]. 6. BGP "flow spec" ([RFC5575]) advertisements annotated with the RT for a given DCVPN, can be used to populate the appropriate network policies for a DCVPN at the relevant NVE devices. Kompella, et al. Expires January 10, 2013 [Page 11] Internet-Draft Simplifying NVE Provisioning July 2012 7. Finally, BGP can address the issue of interconnecting data centers belonging to multiple providers. To enable item 6 above, the provisioning system can pre-populate network policies for DCVPNs on the Route Reflectors, which are basically control plane devices. Network policies for a given DCVPN should be tagged with that DCVPN's RT. Then, when a NVE device needs addresses and policies relevant to a DCVPN, it can request them by the RT, and the RR will send them to the NVE, along with routes for the DCVPN table. 4. Security Considerations 5. IANA Considerations 6. Acknowledgments Many thanks to Amit Shukla for his help with the details of EVB and his insight into data center issues. 7. References 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [VDP] IEEE 802.1 Working Group, "Edge Virtual Bridging (802.1Qbg) (work in progress)", 2012. 7.2. Informative References [I-D.ietf-l2vpn-evpn] Aggarwal, R., Sajassi, A., Henderickx, W., Bitar, N., Shekhar, R., and J. Drake, "BGP MPLS Based Ethernet VPN", draft-ietf-l2vpn-evpn-00 (work in progress), February 2012. [I-D.kreeger-nvo3-overlay-cp] Black, D., Dutt, D., Kreeger, L., Sridhavan, M., and T. Narten, "Network Virtualization Overlay Control Protocol Requirements", draft-kreeger-nvo3-overlay-cp-00 (work in progress), January 2012. Kompella, et al. Expires January 10, 2013 [Page 12] Internet-Draft Simplifying NVE Provisioning July 2012 [I-D.raggarwa-data-center-mobility] Aggarwal, R., Rekhter, Y., Henderickx, W., Shekhar, R., and L. Fang, "Data Center Mobility based on BGP/MPLS, IP Routing and NHRP", draft-raggarwa-data-center-mobility-03 (work in progress), June 2012. [I-D.uttaro-idr-bgp-persistence] Uttaro, J., Simpson, A., Shakir, R., Filsfils, C., Mohapatra, P., Decraene, B., Scudder, J., and Y. Rekhter, "BGP Persistence", draft-uttaro-idr-bgp-persistence-01 (work in progress), March 2012. [RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended Communities Attribute", RFC 4360, February 2006. [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [RFC4684] Marques, P., Bonica, R., Fang, L., Martini, L., Raszuk, R., Patel, K., and J. Guichard, "Constrained Route Distribution for Border Gateway Protocol/MultiProtocol Label Switching (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks (VPNs)", RFC 4684, November 2006. [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 4761, January 2007. [RFC5575] Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J., and D. McPherson, "Dissemination of Flow Specification Rules", RFC 5575, August 2009. Authors' Addresses Kireeti Kompella Juniper Networks 1194 N. Mathilda Ave. Sunnyvale, CA 94089 US Email: kireeti@juniper.net Kompella, et al. Expires January 10, 2013 [Page 13] Internet-Draft Simplifying NVE Provisioning July 2012 Yakov Rekhter Juniper Networks 1194 N. Mathilda Ave. Sunnyvale, CA 94089 US Email: yakov@juniper.net Thomas Morin France Telecom - Orange Labs 2, avenue Pierre Marzin Lannion 22307 France Email: thomas.morin@orange.com Kompella, et al. Expires January 10, 2013 [Page 14]