idnits 2.17.1 draft-kompella-nvo3-server2nve-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 29, 2013) is 4014 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'I-D.ietf-nvo3-overlay-problem-statement' is defined on line 777, but no explicit reference was found in the text == Unused Reference: 'I-D.kreeger-nvo3-overlay-cp' is defined on line 784, but no explicit reference was found in the text == Outdated reference: A later version (-11) exists of draft-ietf-l2vpn-evpn-03 == Outdated reference: A later version (-09) exists of draft-ietf-nvo3-framework-02 == Outdated reference: A later version (-04) exists of draft-ietf-nvo3-overlay-problem-statement-02 == Outdated reference: A later version (-04) exists of draft-kreeger-nvo3-overlay-cp-02 == Outdated reference: A later version (-09) exists of draft-mahalingam-dutt-dcops-vxlan-03 == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-02 Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group K. Kompella 3 Internet-Draft Y. Rekhter 4 Intended status: Informational Juniper Networks 5 Expires: October 31, 2013 T. Morin 6 France Telecom - Orange Labs 7 D. Black 8 EMC Corporation 9 April 29, 2013 11 Signaling Virtual Machine Activity to the Network Virtualization Edge 12 draft-kompella-nvo3-server2nve-02 14 Abstract 16 This document proposes a simplified approach for provisioning network 17 parameters related to Virtual Machine creation, migration and 18 termination on servers. The idea is to provision the server, then 19 have the server signal the requisite parameters to the relevant 20 network device(s). Such an approach reduces the workload on the 21 provisioning system and simplifies the data model that the 22 provisioning system needs to maintain. It is also more resilient to 23 topology changes in server-network connectivity, for example, 24 reconnecting a server to a different network port or switch. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on October 31, 2013. 43 Copyright Notice 45 Copyright (c) 2013 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 61 1.1. VM Creation . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.2. VM Live Migration . . . . . . . . . . . . . . . . . . . . 4 63 1.3. VM Termination . . . . . . . . . . . . . . . . . . . . . 5 64 2. Acronyms Used . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3. Virtual Networks . . . . . . . . . . . . . . . . . . . . . . 7 66 3.1. Current Mode of Operation . . . . . . . . . . . . . . . . 8 67 3.2. Future Mode of Operation . . . . . . . . . . . . . . . . 8 68 4. Provisioning DCVPNs . . . . . . . . . . . . . . . . . . . . . 9 69 5. Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . 9 70 5.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . 9 71 5.2. VM Operations . . . . . . . . . . . . . . . . . . . . . . 10 72 5.2.1. Network Parameters . . . . . . . . . . . . . . . . . 10 73 5.2.2. Creating a VM . . . . . . . . . . . . . . . . . . . . 12 74 5.2.3. Terminating a VM . . . . . . . . . . . . . . . . . . 14 75 5.2.4. Migrating a VM . . . . . . . . . . . . . . . . . . . 15 76 5.3. Signaling Protocols . . . . . . . . . . . . . . . . . . . 16 77 6. Interfacing with DCVPN Control Planes . . . . . . . . . . . . 16 78 7. Security Considerations . . . . . . . . . . . . . . . . . . . 16 79 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 80 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 17 81 10. Informative References . . . . . . . . . . . . . . . . . . . 17 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 84 1. Introduction 86 To create a Virtual Machine (VM) on a server in a data center, one 87 must specify parameters for the compute, storage, network and 88 appliance aspects of the VM. At a minimum, this requires 89 provisioning the server that will host the VM, and the Network 90 Virtualization Edge (NVE) that will implement the virtual network for 91 the VM in addition to the VM's storage. Similar considerations apply 92 to live migration and terminating VMs. This document proposes 93 mechanisms whereby a server can be provisioned with all of the 94 parameters for the VM, and the server in turn signals the networking 95 aspects to the NVE. The NVE may be located on the server or in an 96 external network switch that may be directly connected to the server 97 or accessed via an L2 (Ethernet) LAN or VLAN. The following sections 98 capture the abstract sequence of steps for VM creation, live 99 migration and deletion. 101 While much of the material in this draft may apply to virtual 102 entities other than virtual machines that exist on physical entities 103 other than servers, this draft is written in terms of virtual 104 machines and servers for clarity. 106 1.1. VM Creation 108 This section describes an abstract sequence of steps involved in 109 creating a VM and making it operational (the latter is also known as 110 "powering on" the VM). The following steps are intended as an 111 illustrative example, not as prescriptive text; the goal is to 112 capture sufficient detail to set a context for the signaling 113 described in Section 5. 115 Creating a VM requires: 117 1. gathering the compute, network, storage, and appliance parameters 118 required for the VM; 120 2. deciding which server, network, storage and network appliance 121 devices best match the VM requirements in the current state of 122 the data center; 124 3. provisioning the server with the VM parameters; 126 4. provisioning the network element(s) to which the server is 127 connected with the network-related parameters of the VM; 129 5. informing the network element(s) to which the server is connected 130 about the VM's peer VMs, storage devices and other network 131 appliances with which the VM needs to communicate; 133 6. informing the network element(s) to which a VM's peer VMs are 134 connected about the new VM and its addresses; 136 7. provisioning storage with the storage-related parameters; and 138 8. provisioning necessary network appliances (firewalls, load 139 balancers and "middle boxes"). 141 Steps 1 and 2 are primarily information gathering. For Steps 3 to 8, 142 the provisioning system talks actively to servers, network switches, 143 storage and appliances, and must know the details of the physical 144 server, network, storage and appliance connectivity topologies. Step 145 4 is typically done using just provisioning, whereas Steps 5 and 6 146 may be a combination of provisioning and other techniques that may 147 defer discovery of the relevant information. Steps 4 to 6 accomplish 148 the task of provisioning the network for a VM, the result of which is 149 a Data Center Virtual Private Network (DCVPN) overlaid on the 150 physical network. 152 While shown as a numbered sequence above, some of these steps may be 153 concurrent (e.g., server, storage and network provisioning for the 154 new VM may be done concurrently), and the two "informing" steps for 155 the network (5 and 6) may be partially or fully lazily evaluated 156 based on network traffic that the VM sends or receives after it 157 becomes operational. 159 This document focuses on the case where the network elements in Step 160 4 are not co-resident with the server, and describes how the 161 provisioning in Step 4 can be replaced by signaling between server 162 and network, using information from Step 3. 164 1.2. VM Live Migration 166 This subsection describes an abstract sequence of steps involved in 167 live migration of a VM. Live migration is sometimes referred to as 168 "hot" migration, in that from an external viewpoint, the VM appears 169 to continue to run while being migrated to another server (e.g., TCP 170 connections generally survive this class of migration). In contrast, 171 suspend/resume (or "cold") migration consists of suspending VM 172 execution on one server and resuming it on another. The following 173 live migration steps are intended as an illustrative example, not as 174 prescriptive text; the goal is to capture sufficient detail to 175 provide context for the signaling described in Section 5. 177 For simplicity, this set of abstract steps assumes shared storage, so 178 that the VM's storage is accessible to the source and destination 179 servers. Live migration of a VM requires: 181 1. deciding which server should be the destination of the migration 182 based on the VM's requirements, data center state and reason for 183 the migration; 185 2. provisioning the destination server with the VM parameters and 186 creating a VM to receive the live migration; 188 3. provisioning the network element(s) to which the destination 189 server is connected with the network-related parameters of the 190 VM; 192 4. transferring the VM's memory image between the source and 193 destination servers; 195 5. actually moving the VM: pausing the VM's execution on the source 196 server, transferring the VM's execution state and any remaining 197 memory state to the destination server and continuing the VM's 198 execution on the destination server; 200 6. informing the network element(s) to which the destination server 201 is connected about the VM's peer VMs, storage devices and other 202 network appliances with which the VM needs to communicate; 204 7. informing the network element(s) to which a VM's peer VMs are 205 connected about the VM's new location; 207 8. activating the VM's network parameters at the destination 208 server; 210 9. deprovisioning the VM from the network element(s) to which the 211 source server is connected; and 213 10. deleting the VM from the source server. 215 Step 1 is primarily information gathering. For Steps 2, 3, 9 and 10, 216 the provisioning system talks actively to servers, network switches 217 and appliances, and must know the details of the physical server, 218 network and appliance connectivity topologies. Steps 4 and 5 are 219 usually handled directly by the servers involved. Steps 6 to 9 may 220 be handled by the servers (e.g., one or more "gratuitous" ARPs or 221 RARPs from the destination server may accomplish all four steps) or 222 other techniques. For steps 6 and 7, the other techniques may 223 involve discovery of the relevant information after the VM has been 224 migrated. 226 While shown as a numbered sequence above, some of these steps may be 227 concurrent (e.g., moving the VM and associated network changes), and 228 the two "informing" steps (6 and 7) may be partially or fully lazily 229 evaluated based on network traffic that the VM sends and/or receives 230 after it is migrated to the destination server. 232 This document focuses on the case where the network elements are not 233 co-resident with the server, and shows how the provisioning in Step 3 234 and the deprovisioning in Step 9 can be replaced by signaling between 235 server and network, using information from Step 3. 237 1.3. VM Termination 238 This subsection describes an abstract sequence of steps involved in 239 termination of a VM, also referred to as "powering off" a VM. The 240 following termination steps are intended as an illustrative example, 241 not as prescriptive text; the goal is to capture sufficient detail to 242 set a context for the signaling described in Section 5. 244 Termination of a VM requires: 246 1. ensuring that the VM is no longer executing; 248 2. deprovisioning the VM from the network element(s) to which the 249 server is connected; and 251 3. deleting the VM from the server (the VM's image may remain on 252 storage for reuse). 254 Steps 1 and 3 are handled by the server, based on instructions from 255 the provisioning system. For Step 2, the provisioning system talks 256 actively to servers, network switches, storage and appliances, and 257 must know the details of the physical server, network, storage and 258 appliance connectivity topologies. 260 While shown as a numbered sequence above, some of these steps may be 261 concurrent (e.g., network deprovisioning and VM deletion). 263 This document focuses on the case where the network elements in Step 264 2 are not co-resident with the server, and shows how the 265 deprovisioning in Step 3 can be replaced by signaling between server 266 and network. 268 2. Acronyms Used 270 The following acronyms are used: 272 DCVPN: Data Center Virtual Private Network -- a virtual 273 connectivity topology overlaid on physical devices to provide 274 virtual devices with the connectivity they need and isolation from 275 other DCVPNs. This corresponds to the concept of a Virtual 276 Network Instance (VNI) in [I-D.ietf-nvo3-framework]. 278 NVE: Network Virtualization Edge -- the entities that realize 279 private communication among VMs in a DCVPN 281 l-NVE: local NVE: wrt a VM, NVE elements to which it is 282 directly connected 284 r-NVE: remote NVE: wrt a VM, NVE elements to which the VM's 285 peer VMs are connected 287 NVGRE: Network Virtualization using Generic Routing Encapsulation 289 VDP: VSI Discovery and Configuration Protocol 291 VID: 12-bit VLAN tag or identifier used locally between a server 292 and its l-NVE 294 VLAN: Virtual Local Area Network 296 VM: Virtual Machine (same as Virtual Station) 298 Peer VM: wrt a VM, other VMs in the VM's DCVPN 300 VNID: DCVPN Identifier 302 VSI: Virtual Station Interface 304 VXLAN: Virtual eXtensible Local Area Network 306 3. Virtual Networks 308 The goal of provisioning a network for VMs is to create an "isolation 309 domain" wherein a group of VMs can talk freely to each other, but 310 communication to and from VMs outside that group is restricted 311 (either prohibited, or mediated via a router, firewall or other 312 network gateway). Such an isolation domain, sometimes called a 313 Closed User Group, here will be called a Data Center Virtual Private 314 Network (DCVPN). The network elements on the outer border or edge of 315 the overlay portion of a Virtual Network are called Network 316 Virtualization Edges (NVEs). 318 A DCVPN is assigned a global "name" that identifies it in the 319 management plane; this name is unique in the scope of the data 320 center, but may be unique across several cooperating data centers. A 321 DCVPN is also assigned an identifier unique in the scope of the data 322 center, the Virtual Network Group ID (VNID). The VNID is a control 323 plane entity. A data plane tag is also needed to distinguish 324 different DCVPNs' traffic; more on this later. 326 For a given VM, the NVE can be classified into two parts: the network 327 elements to which the VM's server is directly connected (the local 328 NVE or l-NVE), and those to which peer VMs are connected (the remote 329 NVE or r-NVE). In some cases, the l-NVE is co-resident with the 330 server hosting the VM; in other cases, the l-NVE is separate 331 (distributed l-NVE). The latter case is the one of primary interest 332 in this document. 334 A created VM is added to a DCVPN through Steps 4 to 6 in section 335 Section 1.1 which can be recast as follows. In Step 4, the l-NVE(s) 336 are informed about the VM's VNID, network addresses and policies, and 337 the l-NVE and server agree on how to distinguish traffic for 338 different DCVPNs from and to the server. In Step 5 the relevant 339 r-NVE elements and the addresses of their VMs are discovered, and in 340 Step 6, the r-NVE(s) are informed of the presence of the new VM and 341 obtain or discover its addresses; for both steps 5 and 6, the 342 discovery may be lazily evaluated so that it occurs after the VM 343 begins sending and receiving DCVPN traffic. 345 Once a DCVPN is created, the next steps for network provisioning are 346 to create and apply policies such as for QoS or access control. 347 These occur in three flavors: policies for all VMs in the group, 348 policies for individual VMs, and policies for communication across 349 DCVPN boundaries. 351 3.1. Current Mode of Operation 353 DCVPNs are often realized as Ethernet VLAN segments. A VLAN segment 354 satisfies the communication properties of a DCVPN. A VLAN also has 355 data plane mechanisms for discovering network elements (Layer 2 356 switches, aka bridges) and VM addresses. When a DCVPN is realized as 357 a VLAN, Step 4 in section Section 1.1 requires provisioning both the 358 server and l-NVE with the VLAN tag that identifies the DCVPN. Step 6 359 requires provisioning all involved network elements with the same 360 VLAN tag. Address learning is done by flooding, and the announcement 361 of a new VM or the new location of a migrated VM is often via a 362 "gratuitous" ARP or RARP. 364 While VLANs are familiar and well-understood, they have scaling 365 challenges because they are Layer 2 infrastructure. The number of 366 independent VLANs in a Layer 2 domain is limited by the 12-bit size 367 of the VLAN tag. In addition, data plane techniques (flooding and 368 broadcast) are another source of scaling concerns as the overall size 369 of the network grows. 371 3.2. Future Mode of Operation 373 There are multiple scalable realizations of DCVPNs that address the 374 isolation requirements of DCVPNs as well as the need for a scalable 375 substrate for DCVPNs and the need for scalable mechanisms for NVE and 376 VM address discovery. While describing these approaches beyond the 377 scope of this document, a secondary goal of this document is to show 378 how the signaling that replaces Step 4 in section Section 1.1 can 379 seamlessly interact with realizations of DCVPNs. 381 VLAN tags (VIDs) will be used as the data plane tag to distinguish 382 traffic for different DCVPNs' between a server and its l-NVE. Note 383 that, as used here, VIDs only have local significance between server 384 and NVE, and should not be confused with data-center-wide usage of 385 VLANs. If VLAN tags are used for traffic between NVEs, that tag 386 usage depends on the encapsulation mechanism among the NVEs and is 387 orthogonal to VLAN tag usage between servers and l-NVEs. 389 4. Provisioning DCVPNs 391 For VM creation as described in section Section 1.1, Step 3 392 provisions the server; Steps 4 and 5 provision the l-NVE elements; 393 Step 6 provisions the r-NVE elements. 395 In some cases, the l-NVE is located within the server (e.g., a 396 software-implemented switch within a hypervisor); in this case, Steps 397 3 and 4 are "single-touch" in that the provisioning system need only 398 talk to the server, as both compute and network parameters are 399 applied by the server. However, in other cases, the l-NVE is 400 separate from the server, requiring that the provisioning system talk 401 to both the server and l-NVE. This scenario, which we call 402 "distributed local NVE", is the one considered in this document. 403 This draft's goal is to describe how "single-touch" provisioning can 404 be achieved in the distributed l-NVE case. 406 The overall approach is to provision the server, and have the server 407 signal the requisite parameters to the l-NVE. This approach reduces 408 the workload on the provisioning system, allowing it to scale both in 409 the number of elements it can manage, as well as the rate at which it 410 can process changes. It also simplifies the data model of the 411 network that is used by the provisioning system, because a complete, 412 up-to-date map of server to network connectivity is not required. 413 This approach is also more resilient to server-network connectivity/ 414 topology changes that have not yet been transmitted to the 415 provisioning system. For example, if a server is reconnected to a 416 different port or a different l-NVE to recover from a malfunctioning 417 port, the server can contact the new l-NVE over the new port without 418 the provisioning system needing to immediately be aware of the 419 change. 421 While this draft focuses on provisioning networking parameters via 422 signaling, extensions may address the provisioning of storage and 423 network appliance parameters in a similar fashion. 425 5. Signaling 427 5.1. Preliminaries 428 This draft considers three common VM operations in a virtualized data 429 center: creating a VM; migrating a VM from one physical server to 430 another; and terminating a VM. Creating a VM requires "associating" 431 it with its DCVPN and "activating" that association; decommissioning 432 a VM requires "dissociating" the VM from its DCVPN. Moving a VM 433 consists of associating it with its DCVPN in its new location, 434 activating that association, and dissociating the VM from its old 435 location. 437 5.2. VM Operations 439 5.2.1. Network Parameters 441 For each VM association or dissociation operation, a subset of the 442 following information is needed from server to l-NVE: 444 operation: one of associate or dissociate. 446 authentication: proof that this operation was authorized by the 447 provisioning system 449 VNID: identifier of DCVPN to which VM belongs 451 VID: tag to use between server and l-NVE to distinguish DCVPN 452 traffic; the value zero in an associate operation is a request 453 that the l-NVE to assign an unused VID. This approach provides 454 extensibility by allowing the VID to be a VLAN-id, although other 455 local means of multiplexing traffic between the server and the NVE 456 could be used instead of VIDs. 458 encapsulation type: type of encapsulation used by the DCVPN for 459 traffic exchanged between NVEs (see below). 461 network addresses: network addresses for VM on the server (e.g., 462 MACs) 464 policy: VM-specific and/or network-address-specific network 465 policies, such as access control lists and/or QoS policies 467 hold time: time (in milliseconds) to keep a VM's addresses after it 468 migrates away from this l-NVE. This is usually set to zero when a 469 VM is terminated. 471 per-address-VID-allocation: boolean flag which can optionally be set 472 to "yes", resulting in the VID allocated to the this address being 473 distinct from the VID allocated to other addresses (for the same 474 VM or other VMs) connected to the same DCVPN on a same NVE port; 475 this behavior will result in traffic always transiting through the 476 NVE, even to/from other addresses for the same DCVPN on the same 477 server. 479 The "activate" operation is a dataplane operations that references a 480 previously established association via the address and VID; all other 481 parameters are obtained at the NVE by mapping the source address, VID 482 and port involved to obtain information established by a prior 483 associate operation. 485 Realizations of DCVPNs include, E-VPNs ([I-D.ietf-l2vpn-evpn]), IP 486 VPNs ([RFC4364]), NVGRE ([I-D.sridharan-virtualization-nvgre], VPLS 487 ([RFC4761], [RFC4762]), and VXLAN 488 ([I-D.mahalingam-dutt-dcops-vxlan]). The encapsulation type 489 determines whether forwarding at the NVE for the DCVPN is based on 490 Layer 2 or Layer 3 service. 492 Typically, for the associate messages, all of the above information 493 except hold time would be needed. Similarly, for the dissociate 494 message, all of the above information except VID and encapsulation 495 type would typically be needed. 497 These operations are stateful in that their results remain in place 498 until superseded by another operation. For example, on receiving an 499 associate message, an NVE is expected to create and maintain the 500 DCVPN information for the addresses until the NVE receives a 501 dissociate message to remove that information. A separate liveness 502 protocol may be run between server and NVE to let each side know that 503 the other is still operational; if the liveness protocol fails, each 504 side may remove state installed in response to messages from the 505 other. 507 The descriptions below generally assume that the NVEs participate in 508 a mechanism for control plane distribution of VM addresses, as 509 opposed to doing this in the data plane. If this is not the case, 510 NVE elements can lazily evaluate (via data plane discovery) the parts 511 of the procedures below that involve address distribution. 513 As VIDs are local to server-NVE communication, in fact to a specific 514 port connecting these two elements, a mapping table containing 515 4-tuples of the following form will prove useful to the NVE: 517 519 The valid VID values are from 1 to 4094, inclusive. A value of 0 is 520 used to mean "unassigned". When a VID can be shared by more than one 521 VM, it is necessary to reference-count entries in this table; the 522 list of addresses in an entry serves this purpose. Entries in this 523 table have multiple uses: 525 o Finding the VNID for a VID and port for association, activation 526 and traffic forwarding; 528 o Determining whether a VID exists (has already been assigned) for a 529 VNID and port. 531 o Determining which pairs to use for forwarding traffic 532 that requires flooding on the DCVPN. 534 For simplicity and clarity, this draft assumes that the network 535 interfaces in VMs (vNICs) do not use VLAN tags. 537 5.2.2. Creating a VM 539 When a VM is instantiated on a server (powered on, e.g., after 540 creation), each of the VM's interfaces is assigned a VNID, one or 541 more network addresses and an encapsulation type for the DCVPN. The 542 VM addresses may be any of IPv4, IPv6 and MAC addresses. There may 543 also be network policies specific to the VM or its interfaces. To 544 connect the VM to its DCVPN, the server signals these parameters to 545 the l-NVE via an "associate" operation followed by an "activate" 546 operation to put the parameters into use. (Note that the l-NVE may 547 consist of more than one device.) 549 On receiving an associate message on port P from server S, an NVE 550 device does the following for each network address in that message: 552 A.1: Validate the authentication (if present). If not, inform the 553 provisioning system, log the error, and stop processing the 554 associate message. This validation may include authorization 555 checks. 557 A.2: Check the per-address-VID-allocation flag in the associate 558 message: 560 * if this flag is not set: 562 + Check if the VID in the associate message is zero (i.e., the 563 associate message requests VID allocation); if so, look up 564 the VID for ; if there is no 565 current VID for that tuple, allocate a new VID 567 + If the VID in the associate message is non-zero, look up the 568 VID for . If that lookup results in the same 569 VID as the one in the associate message, associate that VID 570 with . If the lookup indicates that 571 there is no current VID for that tuple, associate the VID in 572 the associate message with . 573 Otherwise, the VID in the associate message does not match 574 the VID that is currently in use for , so 575 respond to S with an error, and stop processing the 576 associate message. 578 * if this flag is set, check if the VID in the associate message 579 is zero : 581 + if so, this is an allocation request, so allocate a new VID, 582 distinct from other VIDs allocated on this port; 584 + if the VID is non-zero, check that the provided VID is 585 distinct from other VIDs allocated on this port; if so, 586 associate the VID with . If 587 not, the provided VID is already in used and hence cannot be 588 dedicated to this network address, so respond to S with an 589 error, and stop processing the associate message. 591 A.3: Add the entry to the NVE's 592 mapping table. This table entry includes information about the 593 DCVPN encapsulation type for the VNID. 595 A.4: Communicate with the control plane to advertise the network 596 address, and (if the VNID is new to the NVE) also to get other 597 network addresses in the DCVPN. Populate the NVE's mapping table 598 with all of these network addresses (some control planes may not 599 provide all or even any of the other addresses in the DCVPN at 600 this point). 602 A.5: Finally, respond to S with the VID for , and indicate that the operation was successful. 605 After a successful associate, the network has been provisioned (at 606 least in the local NVE) for traffic, but forwarding has not been 607 enabled. On receiving an activate message on port P from server S, 608 an NVE device does the following (activate is a one-way message that 609 does not have a response): 611 B.1: Validate the authentication (if present). If not, inform the 612 provisioning system, log the error, and stop processing the 613 associate message. This validation may include authorization 614 checks. The authentication and authorization may be implicit when 615 the activate message is a dataplane frame (e.g., a "gratuitous" 616 ARP or RARP). 618 B.2: Check if the VID in the activate message is zero. If so, log 619 the error, and stop processing the activate message. 621 B.3: Use the VID and port P to look up the VNID from a previous 622 associate message. If there is no mapping table state for that 623 VID and port, log the error and stop processing the activate 624 message. 626 B.4: If forwarding is not enabled for 627 activate it, mapping VID -> VNID on this port (P) for traffic sent 628 to and received from r-NVEs. 630 B.5: If the activate message is a dataplane frame that requires 631 forwarding beyond the NVE, (e.g., a "gratuitous" ARP or RARP), use 632 the activated forwarding to send the frame onward via the virtual 633 network identified by the VNID. 635 5.2.3. Terminating a VM 637 On receiving a request from the provisioning system to terminate 638 execution of a VM (powering off the VM, whether or not the VM's image 639 is retained on storage), the server sends a dissociate message to the 640 l-NVE with the hold time set to zero. The dissociate message 641 contains the operation, authentication, VNID, encapsulation type, and 642 VM addresses. On receiving the dissociate message on port P from 643 server S, each NVE device L does the following: 645 D.1: Validate the authentication (if present). If not, inform the 646 provisioning system, log the error, and stop processing the 647 associate message. 649 D.2: Communicate with the control plane to withdraw the VM's 650 addresses. If the hold time is as non-zero, wait until the hold 651 time expires before proceeding to the next step. 653 D.3: Delete the VM's addresses from the mapping table and delete any 654 VM-specific network policies associated with any of the VM 655 addresses. If a mapping tuple contains no VM addresses as a 656 result delete that tuple. If the mapping table contains no 657 entries for the VNID involved after deleting the tuple, optionally 658 delete any network policies for the VNID. 660 D.4: Respond to S saying that the operation was successful. 662 At step D.2, the control plane is responsible for not disrupting 663 network operation if the addresses are in use at another l-NVE. 664 Also, l-NVEs cannot rely on receiving dissociate messages for all 665 terminated VMs, as a server crash may implicitly terminate a VM 666 before a dissociate message can be sent. 668 5.2.4. Migrating a VM 670 Consider a VM that is being migrated from server S (connected to 671 l-NVE device L) to server S' (connected to l-NVE device L'). This 672 section assumes shared storage, so that both S and S' have access to 673 the VM's storage. The sequence of steps for a successful VM 674 migration is: 676 M.1: S' gets a request to prepare to receive a copy of the VM from 677 S. 679 M.2: S gets a request to copy the VM to S'. 681 M.3: The copy of the VM (memory, configuration state, etc.) occurs 682 while the VM continues to execute. 684 M.4: When that copy has made sufficient progress, S pauses the VM, 685 and completes the copy, including the VM's execution state. 687 M.5: S' gets a request to resume the paused VM. 689 M.6: After that resume has succeeded, S then proceeds to terminate 690 the paused VM on S, see section Section 5.2.3, but this operation 691 may specify a non-zero hold time during which traffic received may 692 be forwarded to the VM's new location. 694 Steps M.1 and M.2 initiate the copy of the VM. During step M.3, S' 695 sends an "associate" message to L' for each of the VM's network 696 addresses (S' receives information about these addresses as part of 697 the VM copy). Step M.4 occurs when the VM copy has made sufficient 698 progress that the pause required to transfer the VM's execution from 699 S to S' is sufficiently short. At step M.4, or M.5 at the latest, S' 700 sends an "activate" message to L' for each of the VM's interfaces. 701 At Step M.6, S sends a "dissociate" message to L for each of the VM's 702 network addresses, optionally with a non-zero hold time. 704 From the DCVPN's view, there are two important overlaps in the 705 apparent network location of the VM's addresses: 707 o The VM's addresses are associated with both L and L' between steps 708 M.3 and M.6. 710 o The VM's addresses are activated at L' during step M.4 or step M.5 711 at the latest (e.g., if activate is a dataplane operation based on 712 traffic sent at that step); both of these typically occur before 713 these addresses are dissociated at L during step M.6 715 The DCVPN control plane must work correctly in the presence of these 716 overlaps, and in particular must not: 718 o Fail to activate the VM's network addresses at L' because they 719 have not yet been withdrawn at L, or 721 o Disruptively withdraw the VM's network addresses from use at step 722 M.6 of a migration when the VM continues to execute on a different 723 server. 725 An additional scenario that is important for migration is that the 726 source and destination servers, S and S', may share a common l-NVE, 727 i.e., L and L' are the same. In this scenario there is no need for 728 remote interaction of that l-NVE with other NVEs, but that NVE must 729 be aware of the possibility of a new association of the VM's 730 addresses with a different port and the need to promptly activate 731 them on that port even though they have not (yet) been dissociated 732 from their original port. 734 5.3. Signaling Protocols 736 There are multiple protocols that can be used to signal the above 737 messages. One could invent a new protocol for this purpose, or reuse 738 existing protocols, among them LLDP, XMPP, HTTP REST, and VDP [VDP], 739 a new protocol standardized for the purposes of signaling a VM's 740 network parameters from server to l-NVE. Multiple factors influence 741 the choice of protocol(s); this draft's focus is on what needs to be 742 signaled, leaving choices of how the information is signaled, and 743 specific encodings for other drafts to consider. 745 6. Interfacing with DCVPN Control Planes 747 The control plane for a DCVPN manages the creation/deletion, 748 membership and span of the DCVPN ([I-D.ietf-nvo3-overlay-problem- 749 statement],[I-D.kreeger-nvo3-overlay-cp]). Such a control plane 750 needs to work with the server-to-nve signaling in a coordinated 751 manner, to ensure that address changes at a local NVE are reflected 752 appropriately in remote NVEs. The details of such coordination are 753 specified in separate documents. 755 7. Security Considerations 756 8. IANA Considerations 758 9. Acknowledgments 760 Many thanks to Amit Shukla for his help with the details of EVB and 761 his insight into data center issues. Many thanks to members of the 762 nvo3 WG for their comments, including Yingjie Gu. 764 10. Informative References 766 [I-D.ietf-l2vpn-evpn] 767 Sajassi, A., Aggarwal, R., Henderickx, W., Balus, F., 768 Isaac, A., and J. Uttaro, "BGP MPLS Based Ethernet VPN", 769 draft-ietf-l2vpn-evpn-03 (work in progress), February 770 2013. 772 [I-D.ietf-nvo3-framework] 773 Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 774 Rekhter, "Framework for DC Network Virtualization", draft- 775 ietf-nvo3-framework-02 (work in progress), February 2013. 777 [I-D.ietf-nvo3-overlay-problem-statement] 778 Narten, T., Gray, E., Black, D., Dutt, D., Fang, L., 779 Kreeger, L., Napierala, M., and M. Sridharan, "Problem 780 Statement: Overlays for Network Virtualization", draft- 781 ietf-nvo3-overlay-problem-statement-02 (work in progress), 782 February 2013. 784 [I-D.kreeger-nvo3-overlay-cp] 785 Kreeger, L., Dutt, D., Narten, T., and M. Sridharan, 786 "Network Virtualization Overlay Control Protocol 787 Requirements", draft-kreeger-nvo3-overlay-cp-02 (work in 788 progress), October 2012. 790 [I-D.mahalingam-dutt-dcops-vxlan] 791 Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 792 L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A 793 Framework for Overlaying Virtualized Layer 2 Networks over 794 Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-03 795 (work in progress), February 2013. 797 [I-D.sridharan-virtualization-nvgre] 798 Sridharan, M., Greenberg, A., Venkataramaiah, N., Wang, 799 Y., Duda, K., Ganga, I., Lin, G., Pearson, M., Thaler, P., 800 and C. Tumuluri, "NVGRE: Network Virtualization using 801 Generic Routing Encapsulation", draft-sridharan- 802 virtualization-nvgre-02 (work in progress), February 2013. 804 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 805 Networks (VPNs)", RFC 4364, February 2006. 807 [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service 808 (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 809 4761, January 2007. 811 [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service 812 (VPLS) Using Label Distribution Protocol (LDP) Signaling", 813 RFC 4762, January 2007. 815 [VDP] IEEE, "Edge Virtual Bridging (IEEE Std 802.1Qbg-2012)", 816 July 2012. 818 Authors' Addresses 820 Kireeti Kompella 821 Juniper Networks 822 1194 N. Mathilda Ave. 823 Sunnyvale, CA 94089 824 US 826 Email: kireeti@juniper.net 828 Yakov Rekhter 829 Juniper Networks 830 1194 N. Mathilda Ave. 831 Sunnyvale, CA 94089 832 US 834 Email: yakov@juniper.net 836 Thomas Morin 837 France Telecom - Orange Labs 838 2, avenue Pierre Marzin 839 Lannion 22307 840 France 842 Email: thomas.morin@orange.com 843 David L. Black 844 EMC Corporation 845 176 South St. 846 Hopkinton, MA 01748 848 Email: david.black@emc.com