idnits 2.17.1 draft-ietf-nvo3-framework-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 25 longer pages, the longest (page 21) being 72 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 25 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 233 instances of too long lines in the document, the longest one being 3 characters in excess of 72. == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 4, 2013) is 4099 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'EVPN' is mentioned on line 229, but not defined == Missing Reference: 'OF' is mentioned on line 594, but not defined -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Marc Lasserre 3 Internet Draft Florin Balus 4 Intended status: Informational Alcatel-Lucent 5 Expires: August 2013 6 Thomas Morin 7 France Telecom Orange 9 Nabil Bitar 10 Verizon 12 Yakov Rekhter 13 Juniper 15 February 4, 2013 17 Framework for DC Network Virtualization 18 draft-ietf-nvo3-framework-02.txt 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six 31 months and may be updated, replaced, or obsoleted by other documents 32 at any time. It is inappropriate to use Internet-Drafts as 33 reference material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on August 4, 2013. 37 Copyright Notice 39 Copyright (c) 2013 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Abstract 54 Several IETF drafts relate to the use of overlay networks to support 55 large scale virtual data centers. This draft provides a framework 56 for Network Virtualization over L3 (NVO3) and is intended to help 57 plan a set of work items in order to provide a complete solution 58 set. It defines a logical view of the main components with the 59 intention of streamlining the terminology and focusing the solution 60 set. 62 Table of Contents 64 1. Introduction.................................................3 65 1.1. Conventions used in this document.......................4 66 1.2. General terminology.....................................4 67 1.3. DC network architecture.................................6 68 1.4. Tenant networking view..................................7 69 2. Reference Models.............................................8 70 2.1. Generic Reference Model.................................8 71 2.2. NVE Reference Model....................................10 72 2.3. NVE Service Types......................................11 73 2.3.1. L2 NVE providing Ethernet LAN-like service........12 74 2.3.2. L3 NVE providing IP/VRF-like service..............12 75 3. Functional components.......................................12 76 3.1. Service Virtualization Components......................12 77 3.1.1. Virtual Access Points (VAPs)......................12 78 3.1.2. Virtual Network Instance (VNI)....................12 79 3.1.3. Overlay Modules and VN Context....................13 80 3.1.4. Tunnel Overlays and Encapsulation options.........14 81 3.1.5. Control Plane Components..........................14 82 3.1.5.1. Distributed vs Centralized Control Plane........14 83 3.1.5.2. Auto-provisioning/Service discovery.............15 84 3.1.5.3. Address advertisement and tunnel mapping........15 85 3.1.5.4. Overlay Tunneling...............................16 86 3.2. Multi-homing...........................................16 87 3.3. VM Mobility............................................17 88 3.4. Service Overlay Topologies.............................18 89 4. Key aspects of overlay networks.............................18 90 4.1. Pros & Cons............................................18 91 4.2. Overlay issues to consider.............................20 92 4.2.1. Data plane vs Control plane driven................20 93 4.2.2. Coordination between data plane and control plane.20 94 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) 95 traffic..................................................20 96 4.2.4. Path MTU..........................................21 97 4.2.5. NVE location trade-offs...........................22 98 4.2.6. Interaction between network overlays and underlays.23 99 5. Security Considerations.....................................23 100 6. IANA Considerations.........................................24 101 7. References..................................................24 102 7.1. Normative References...................................24 103 7.2. Informative References.................................24 104 8. Acknowledgments.............................................24 106 1. Introduction 108 This document provides a framework for Data Center Network 109 Virtualization over L3 tunnels. This framework is intended to aid in 110 standardizing protocols and mechanisms to support large scale 111 network virtualization for data centers. 113 Several IETF drafts relate to the use of overlay networks for data 114 centers. [NVOPS] defines the rationale for using overlay networks in 115 order to build large multi-tenant data center networks. Compute, 116 storage and network virtualization are often used in these large 117 data centers to support a large number of communication domains and 118 end systems. [OVCPREQ] describes the requirements for a control 119 plane protocol required by overlay border nodes to exchange overlay 120 mappings. 122 This document provides reference models and functional components of 123 data center overlay networks as well as a discussion of technical 124 issues that have to be addressed in the design of standards and 125 mechanisms for large-scale data centers. 127 1.1. Conventions used in this document 129 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 130 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 131 document are to be interpreted as described in RFC-2119 [RFC2119]. 133 In this document, these words will appear with that interpretation 134 only when in ALL CAPS. Lower case uses of these words are not to be 135 interpreted as carrying RFC-2119 significance. 137 1.2. General terminology 139 This document uses the following terminology: 141 NVE: Network Virtualization Edge. It is a network entity that sits 142 on the edge of the NVO3 network. It implements network 143 virtualization functions that allow for L2 and/or L3 tenant 144 separation and for hiding tenant addressing information (MAC and IP 145 addresses). An NVE could be implemented as part of a virtual switch 146 within a hypervisor, a physical switch or router, or a network 147 service appliance. 149 VN: Virtual Network. This is a virtual L2 or L3 domain that belongs 150 to a tenant. 152 VNI: Virtual Network Instance. This is one instance of a virtual 153 overlay network. It refers to the state maintained for a given VN on 154 a given NVE. Two Virtual Networks are isolated from one another and 155 may use overlapping addresses. 157 Virtual Network Context or VN Context: Field that is part of the 158 overlay encapsulation header which allows the encapsulated frame to 159 be delivered to the appropriate virtual network endpoint by the 160 egress NVE. The egress NVE uses this field to determine the 161 appropriate virtual network context in which to process the packet. 162 This field MAY be an explicit, unique (to the administrative domain) 163 virtual network identifier (VNID) or MAY express the necessary 164 context information in other ways (e.g., a locally significant 165 identifier). 167 VNID: Virtual Network Identifier. In the case where the VN context 168 identifier has global significance, this is the ID value that is 169 carried in each data packet in the overlay encapsulation that 170 identifies the Virtual Network the packet belongs to. 172 Underlay or Underlying Network: This is the network that provides 173 the connectivity between NVEs. The Underlying Network can be 174 completely unaware of the overlay packets. Addresses within the 175 Underlying Network are also referred to as "outer addresses" because 176 they exist in the outer encapsulation. The Underlying Network can 177 use a completely different protocol (and address family) from that 178 of the overlay. 180 Data Center (DC): A physical complex housing physical servers, 181 network switches and routers, network service sppliances and 182 networked storage. The purpose of a Data Center is to provide 183 application, compute and/or storage services. One such service is 184 virtualized infrastructure data center services, also known as 185 Infrastructure as a Service. 187 Virtual Data Center or Virtual DC: A container for virtualized 188 compute, storage and network services. Managed by a single tenant, a 189 Virtual DC can contain multiple VNs and multiple Tenant Systems that 190 are connected to one or more of these VNs. 192 VM: Virtual Machine. Several Virtual Machines can share the 193 resources of a single physical computer server using the services of 194 a Hypervisor (see below definition). 196 Hypervisor: Server virtualization software running on a physical 197 compute server that hosts Virtual Machines. The hypervisor provides 198 shared compute/memory/storage and network connectivity to the VMs 199 that it hosts. Hypervisors often embed a Virtual Switch (see below). 201 Virtual Switch: A function within a Hypervisor (typically 202 implemented in software) that provides similar services to a 203 physical Ethernet switch. It switches Ethernet frames between VMs 204 virtual NICs within the same physical server, or between a VM and a 205 physical NIC card connecting the server to a physical Ethernet 206 switch or router. It also enforces network isolation between VMs 207 that should not communicate with each other. 209 Tenant: In a DC, a tenant refers to a customer that could be an 210 organization within an enterprise, or an enterprise with a set of DC 211 compute, storage and network resources associated with it. 213 Tenant System: A physical or virtual system that can play the role 214 of a host, or a forwarding element such as a router, switch, 215 firewall, etc. It belongs to a single tenant and connects to one or 216 more VNs of that tenant. 218 End device: A physical system to which networking service is 219 provided. Examples include hosts (e.g. server or server blade), 220 storage systems (e.g., file servers, iSCSI storage systems), and 221 network devices (e.g., firewall, load-balancer, IPSec gateway). An 222 end device may include internal networking functionality that 223 interconnects the device's components (e.g. virtual switches that 224 interconnect VMs running on the same server). NVE functionality may 225 be implemented as part of that internal networking. 227 ELAN: MEF ELAN, multipoint to multipoint Ethernet service 229 EVPN: Ethernet VPN as defined in [EVPN] 231 1.3. DC network architecture 233 A generic architecture for Data Centers is depicted in Figure 1: 235 ,---------. 236 ,' `. 237 ( IP/MPLS WAN ) 238 `. ,' 239 `-+------+' 240 +--+--+ +-+---+ 241 |DC GW|+-+|DC GW| 242 +-+---+ +-----+ 243 | / 244 .--. .--. 245 ( ' '.--. 246 .-.' Intra-DC ' 247 ( network ) 248 ( .'-' 249 '--'._.'. )\ \ 250 / / '--' \ \ 251 / / | | \ \ 252 +---+--+ +-`.+--+ +--+----+ 253 | ToR | | ToR | | ToR | 254 +-+--`.+ +-+-`.-+ +-+--+--+ 255 / \ / \ / \ 256 __/_ \ / \ /_ _\__ 257 '--------' '--------' '--------' '--------' 258 : End : : End : : End : : End : 259 : Device : : Device : : Device : : Device : 260 '--------' '--------' '--------' '--------' 262 Figure 1 : A Generic Architecture for Data Centers 264 An example of multi-tier DC network architecture is presented in 265 this figure. It provides a view of physical components inside a DC. 267 A cloud network is composed of intra-Data Center (DC) networks and 268 network services, and inter-DC network and network connectivity 269 services. Depending upon the scale, DC distribution, operations 270 model, Capex and Opex aspects, DC networking elements can act as 271 strict L2 switches and/or provide IP routing capabilities, including 272 service virtualization. 274 In some DC architectures, it is possible that some tier layers 275 provide L2 and/or L3 services, are collapsed, and that Internet 276 connectivity, inter-DC connectivity and VPN support are handled by a 277 smaller number of nodes. Nevertheless, one can assume that the 278 functional blocks fit in the architecture above. 280 The following components can be present in a DC: 282 o Top of Rack (ToR): Hardware-based Ethernet switch aggregating 283 all Ethernet links from the End Devices in a rack representing 284 the entry point in the physical DC network for the hosts. ToRs 285 may also provide routing functionality, virtual IP network 286 connectivity, or Layer2 tunneling over IP for instance. ToRs 287 are usually multi-homed to switches in the Intra-DC network. 288 Other deployment scenarios may use an intermediate Blade Switch 289 before the ToR or an EoR (End of Row) switch to provide similar 290 function as a ToR. 292 o Intra-DC Network: High capacity network composed of core 293 switches aggregating multiple ToRs. Core switches are usually 294 Ethernet switches but can also support routing capabilities. 296 o DC GW: Gateway to the outside world providing DC Interconnect 297 and connectivity to Internet and VPN customers. In the current 298 DC network model, this may be simply a Router connected to the 299 Internet and/or an IPVPN/L2VPN PE. Some network implementations 300 may dedicate DC GWs for different connectivity types (e.g., a 301 DC GW for Internet, and another for VPN). 303 Note that End Devices may be single or multi-homed to ToRs. 305 1.4. Tenant networking view 307 The DC network architecture is used to provide L2 and/or L3 service 308 connectivity to each tenant. An example is depicted in Figure 2: 310 +----- L3 Infrastructure ----+ 311 | | 312 ,--+--. ,--+--. 313 .....( Rtr1 )...... ( Rtr2 ) 314 | `-----' | `-----' 315 | Tenant1 |LAN12 Tenant1| 316 |LAN11 ....|........ |LAN13 317 .............. | | .............. 318 | | | | | | 319 ,-. ,-. ,-. ,-. ,-. ,-. 320 (VM )....(VM ) (VM )... (VM ) (VM )....(VM ) 321 `-' `-' `-' `-' `-' `-' 323 Figure 2 : Logical Service connectivity for a single tenant 325 In this example, one or more L3 contexts and one or more LANs (e.g., 326 one per application type) are assigned for DC tenant1. 328 For a multi-tenant DC, a virtualized version of this type of service 329 connectivity needs to be provided for each tenant by the Network 330 Virtualization solution. 332 2. Reference Models 334 2.1. Generic Reference Model 336 The following diagram shows a DC reference model for network 337 virtualization using L3 (IP/MPLS) overlays where NVEs provide a 338 logical interconnect between Tenant Systems that belong to a 339 specific tenant network. 341 +--------+ +--------+ 342 | Tenant +--+ +----| Tenant | 343 | System | | (') | System | 344 +--------+ | ................... ( ) +--------+ 345 | +-+--+ +--+-+ (_) 346 | | NV | | NV | | 347 +--|Edge| |Edge|---+ 348 +-+--+ +--+-+ 349 / . . 350 / . L3 Overlay +--+-++--------+ 351 +--------+ / . Network | NV || Tenant | 352 | Tenant +--+ . |Edge|| System | 353 | System | . +----+ +--+-++--------+ 354 +--------+ .....| NV |........ 355 |Edge| 356 +----+ 357 | 358 | 359 ===================== 360 | | 361 +--------+ +--------+ 362 | Tenant | | Tenant | 363 | System | | System | 364 +--------+ +--------+ 366 Figure 3 : Generic reference model for DC network virtualization 367 over a Layer3 infrastructure 369 A Tenant System can be attached to a Network Virtualization Edge 370 (NVE) node in several ways: 372 - locally, by being co-located in the same device 374 - remotely, via a point-to-point connection or a switched network 375 (e.g., Ethernet) 377 When an NVE is local, the state of Tenant Systems can be provided 378 without protocol assistance. For instance, the operational status of 379 a VM can be communicated via a local API. When an NVE is remote, the 380 state of Tenant Systems needs to be exchanged via a data or control 381 plane protocol, or via a management entity. 383 The functional components in Figure 3 do not necessarily map 384 directly with the physical components described in Figure 1. 386 For example, an End Device can be a server blade with VMs and 387 virtual switch, i.e. the VM is the Tenant System and the NVE 388 functions may be performed by the virtual switch and/or the 389 hypervisor. In this case, the Tenant System and NVE function are co- 390 located. 392 Another example is the case where an End Device can be a traditional 393 physical server (no VMs, no virtual switch), i.e. the server is the 394 Tenant System and the NVE function may be performed by the ToR. 395 Other End Devices in this category are physical network appliances 396 or storage systems. 398 The NVE implements network virtualization functions that allow for 399 L2 and/or L3 tenant separation and for hiding tenant addressing 400 information (MAC and IP addresses), tenant-related control plane 401 activity and service contexts from the underlay nodes. 403 Underlay nodes utilize L3 techniques to interconnect NVE nodes in 404 support of the overlay network. These devices perform forwarding 405 based on outer L3 tunnel header, and generally do not maintain per 406 tenant-service state albeit some applications (e.g., multicast) may 407 require control plane or forwarding plane information that pertain 408 to a tenant, group of tenants, tenant service or a set of services 409 that belong to one or more tenants. When such tenant or tenant- 410 service related information is maintained in the underlay, overlay 411 virtualization provides knobs to control the magnitude of that 412 information. 414 2.2. NVE Reference Model 416 One or more VNIs can be instantiated on an NVE. Tenant Systems 417 interface with a corresponding VNI via a Virtual Access Point (VAP). 418 An overlay module that provides tunneling overlay functions (e.g., 419 encapsulation and decapsulation of tenant traffic from/to the tenant 420 forwarding instance, tenant identification and mapping, etc), as 421 described in figure 4: 423 +------- L3 Network ------+ 424 | | 425 | Tunnel Overlay | 426 +------------+---------+ +---------+------------+ 427 | +----------+-------+ | | +---------+--------+ | 428 | | Overlay Module | | | | Overlay Module | | 429 | +---------+--------+ | | +---------+--------+ | 430 | |VN context| | VN context| | 431 | | | | | | 432 | +--------+-------+ | | +--------+-------+ | 433 | | |VNI| . |VNI| | | | |VNI| . |VNI| | 434 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 435 | | VAPs | | | | VAPs | | 436 +----+------------+----+ +----+-----------+-----+ 437 | | | | 438 -------+------------+-----------------+-----------+------- 439 | | Tenant | | 440 | | Service IF | | 441 Tenant Systems Tenant Systems 443 Figure 4 : Generic reference model for NV Edge 445 Note that some NVE functions (e.g., data plane and control plane 446 functions) may reside in one device or may be implemented separately 447 in different devices. For example, the NVE functionality could 448 reside solely on the End Devices, or be distributed between the End 449 Devices and the ToRs. In the latter case we say that the End Device 450 NVE component acts as the NVE Spoke, and ToRs act as NVE hubs. 451 Tenant Systems will interface with VNIs maintained on the NVE 452 spokes, and VNIs maintained on the NVE spokes will interface with 453 VNIs maintained on the NVE hubs. 455 2.3. NVE Service Types 457 NVE components may be used to provide different types of virtualized 458 network services. This section defines the service types and 459 associated attributes. Note that an NVE may be capable of providing 460 both L2 and L3 services. 462 2.3.1. L2 NVE providing Ethernet LAN-like service 464 L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based 465 multipoint service where the Tenant Systems appear to be 466 interconnected by a LAN environment over a set of L3 tunnels. It 467 provides per tenant virtual switching instance with MAC addressing 468 isolation and L3 (IP/MPLS) tunnel encapsulation across the underlay. 470 2.3.2. L3 NVE providing IP/VRF-like service 472 Virtualized IP routing and forwarding is similar from a service 473 definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN 474 [RFC4364] and IPsec VPNs). It provides per tenant routing instance 475 with addressing isolation and L3 (IP/MPLS) tunnel encapsulation 476 across the underlay. 478 3. Functional components 480 This section decomposes the Network Virtualization architecture into 481 functional components described in Figure 4 to make it easier to 482 discuss solution options for these components. 484 3.1. Service Virtualization Components 486 3.1.1. Virtual Access Points (VAPs) 488 Tenant Systems are connected to the VNI Instance through Virtual 489 Access Points (VAPs). 491 The VAPs can be physical ports or virtual ports identified through 492 logical interface identifiers (e.g., VLAN ID, internal vSwitch 493 Interface ID coonected to a VM). 495 3.1.2. Virtual Network Instance (VNI) 497 The VNI represents a set of configuration attributes defining access 498 and tunnel policies and (L2 and/or L3) forwarding functions. 500 Per tenant FIB tables and control plane protocol instances are used 501 to maintain separate private contexts between tenants. Hence tenants 502 are free to use their own addressing schemes without concerns about 503 address overlapping with other tenants. 505 3.1.3. Overlay Modules and VN Context 507 Mechanisms for identifying each tenant service are required to allow 508 the simultaneous overlay of multiple tenant services over the same 509 underlay L3 network topology. In the data plane, each NVE, upon 510 sending a tenant packet, must be able to encode the VN Context for 511 the destination NVE in addition to the L3 tunnel information (e.g., 512 source IP address identifying the source NVE and the destination IP 513 address identifying the destination NVE, or MPLS label). This allows 514 the destination NVE to identify the tenant service instance and 515 therefore appropriately process and forward the tenant packet. 517 The Overlay module provides tunneling overlay functions: tunnel 518 initiation/termination, encapsulation/decapsulation of frames from 519 VAPs/L3 Backbone and may provide for transit forwarding of IP 520 traffic (e.g., transparent tunnel forwarding). 522 In a multi-tenant context, the tunnel aggregates frames from/to 523 different VNIs. Tenant identification and traffic demultiplexing are 524 based on the VN Context identifier (e.g., VNID). 526 The following approaches can been considered: 528 o One VN Context per Tenant: A globally unique (on a per-DC 529 administrative domain) VNID is used to identify the related 530 Tenant instances. An example of this approach is the use of 531 IEEE VLAN or ISID tags to provide virtual L2 domains. 533 o One VN Context per VNI: A per-tenant local value is 534 automatically generated by the egress NVE and usually 535 distributed by a control plane protocol to all the related 536 NVEs. An example of this approach is the use of per VRF MPLS 537 labels in IP VPN [RFC4364]. 539 o One VN Context per VAP: A per-VAP local value is assigned and 540 usually distributed by a control plane protocol. An example of 541 this approach is the use of per CE-PE MPLS labels in IP VPN 542 [RFC4364]. 544 Note that when using one VN Context per VNI or per VAP, an 545 additional global identifier may be used by the control plane to 546 identify the Tenant context. 548 3.1.4. Tunnel Overlays and Encapsulation options 550 Once the VN context identifier is added to the frame, a L3 Tunnel 551 encapsulation is used to transport the frame to the destination NVE. 552 The backbone devices do not usually keep any per service state, 553 simply forwarding the frames based on the outer tunnel header. 555 Different IP tunneling options (e.g., GRE, L2TP, IPSec) and MPLS 556 tunneling options (e.g., BGP VPN, VPLS) can be used. 558 3.1.5. Control Plane Components 560 Control plane components may be used to provide the following 561 capabilities: 563 . Auto-provisioning/Service discovery 565 . Address advertisement and tunnel mapping 567 . Tunnel management 569 A control plane component can be an on-net control protocol 570 implemented on the NVE or a management control entity. 572 3.1.5.1. Distributed vs Centralized Control Plane 574 A control/management plane entity can be centralized or distributed. 575 Both approaches have been used extensively in the past. The routing 576 model of the Internet is a good example of a distributed approach. 577 Transport networks have usually used a centralized approach to 578 manage transport paths. 580 It is also possible to combine the two approaches i.e. using a 581 hybrid model. A global view of network state can have many benefits 582 but it does not preclude the use of distributed protocols within the 583 network. Centralized controllers provide a facility to maintain 584 global state, and distribute that state to the network which in 585 combination with distributed protocols can aid in achieving greater 586 network efficiencies, and improve reliability and robustness. Domain 587 and/or deployment specific constraints define the balance between 588 centralized and distributed approaches. 590 On one hand, a control plane module can reside in every NVE. This is 591 how routing control plane modules are implemented in routers. At the 592 same time, an external controller can manage a group of NVEs via an 593 agent in each NVE. This is how an SDN controller could communicate 594 with the nodes it controls, via OpenFlow [OF] for instance. 596 In the case where a logically centralized control plane is 597 preferred, the controller will need to be distributed to more than 598 one node for redundancy and scalability in order to manage a large 599 number of NVEs. Hence, inter-controller communication is necessary 600 to synchronize state among controllers. It should be noted that 601 controllers may be organized in clusters. The information exchanged 602 between controllers of the same cluster could be different from the 603 information exchanged across clusters. 605 3.1.5.2. Auto-provisioning/Service discovery 607 NVEs must be able to identify the appropriate VNI for each Tenant 608 System. This is based on state information that is often provided by 609 external entities. For example, in an environment where a VM is a 610 Tenant System, this information is provided by compute management 611 systems, since these are the only entities that have visibility of 612 which VM belongs to which tenant. 614 A mechanism for communicating this information between Tenant 615 Systems and the corresponding NVE is required. As a result the VAPs 616 are created and mapped to the appropriate VNI. Depending upon the 617 implementation, this control interface can be implemented using an 618 auto-discovery protocol between Tenant Systems and their local NVE 619 or through management entities. In either case, appropriate security 620 and authentication mechanisms to verify that Tenant System 621 information is not spoofed or altered are required. This is one 622 critical aspect for providing integrity and tenant isolation in the 623 system. 625 A control plane protocol can also be used to advertize supported VNs 626 to other NVEs. Alternatively, management control entities can also 627 be used to perform these functions. 629 3.1.5.3. Address advertisement and tunnel mapping 631 As traffic reaches an ingress NVE, a lookup is performed to 632 determine which tunnel the packet needs to be sent to. It is then 633 encapsulated with a tunnel header containing the destination 634 information (destination IP address or MPLS label) of the egress 635 overlay node. Intermediate nodes (between the ingress and egress 636 NVEs) switch or route traffic based upon the outer destination 637 information. 639 One key step in this process consists of mapping a final destination 640 information to the proper tunnel. NVEs are responsible for 641 maintaining such mappings in their forwarding tables. Several ways 642 of populating these tables are possible: control plane driven, 643 management plane driven, or data plane driven. 645 When a control plane protocol is used to distribute address 646 advertisement and tunneling information, the auto- 647 provisioning/Service discovery could be accomplished by the same 648 protocol. In this scenario, the auto-provisioning/Service discovery 649 could be combined with (be inferred from) the address advertisement 650 and associated tunnel mapping. Furthermore, a control plane protocol 651 that carries both MAC and IP addresses eliminates the need for ARP, 652 and hence addresses one of the issues with explosive ARP handling. 654 3.1.5.4. Overlay Tunneling 656 For overlay tunneling, and dependent upon the tunneling technology 657 used for encapsulating the tenant system packets, it may be 658 sufficient to have one or more local NVE addresses assigned and used 659 in the source and destination fields of a tunneling encapsulating 660 header. Other information that is part of the 661 tunneling encapsulation header may also need to be configured. In 662 certain cases, local NVE configuration may be sufficient while in 663 other cases, some tunneling related information may need to 664 be shared among NVEs. The information that needs to be shared will 665 be technology dependent. This includes the discovery and 666 announcement of the tunneling technology used. In certain cases, 667 such as when using IP multicast in the underlay, tunnels may need to 668 be established, interconnecting NVEs. When tunneling information 669 needs to be exchanged or shared among NVEs, a control plane protocol 670 may be required. For instance, it may be necessary to provide 671 active/standby status information between NVEs, up/down status 672 information, pruning/grafting information for multicast tunnels, 673 etc. 675 In addition, a control plane may be required to setup the tunnel 676 path for some tunneling technologies. This applies to both unicast 677 and multicast tunneling. 679 3.2. Multi-homing 681 Multi-homing techniques can be used to increase the reliability of 682 an nvo3 network. It is also important to ensure that physical 683 diversity in an nvo3 network is taken into account to avoid single 684 points of failure. 686 Multi-homing can be enabled in various nodes, from tenant systems 687 into TORs, TORs into core switches/routers, and core nodes into DC 688 GWs. 690 The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP 691 routing as the means to re-route traffic upon failures and/or ECMP 692 techniques or on MPLS re-rerouting capabilities. 694 When a tenant system is co-located with the NVE on the same end- 695 system, the tenant system is single homed to the NVE via a vport 696 that is virtual NIC (vNIC). When the end system and the NVEs are 697 separated, the end system is connected to the NVE via a logical 698 Layer2 (L2) construct such as a VLAN. In this latter case, an end 699 device or vSwitch on that device could be multi-homed to various 700 NVEs. An NVE may provide an L2 service to the end system or a l3 701 service. An NVE may be multi-homed to a next layer in the DC at 702 Layer2 (L2) or Layer3 (L3). When an NVE provides an L2 service and 703 is not co-located with the end system, techniques such as Ethernet 704 Link Aggregation Group (LAG) or Spanning Tree Protocol (STP) can be 705 used to switch traffic between an end system and connected 706 NVEs without creating loops. Similarly, when the NVE provides L3 707 service, similar dual-homing techniques can be used. When the NVE 708 provides a L3 service to the end system, it is possible that no 709 dynamic routing protocol is enabled between the end system and the 710 NVE. The end system can be multi-homed to multiple physically- 711 separated L3 NVEs over multiple interfaces. When one of the 712 links connected to an NVE fails, the other interfaces can be used to 713 reach the end system. 715 External connectivity out of an nvo3 domain can be handled by two or 716 more nvo3 gateways. Each gateway is connected to a different domain 717 (e.g. ISP), providing access to external networks such as VPNs or 718 the Internet. A gateway may be connected to two nodes. When a 719 connection to an upstream node is lost, the alternative connection 720 is used and the failed route withdrawn. 722 3.3. VM Mobility 724 In DC environments utilizing VM technologies, an important feature 725 is that VMs can move from one server to another server in the same 726 or different L2 physical domains (within or across DCs) in a 727 seamless manner. 729 A VM can be moved from one server to another in stopped or suspended 730 state ("cold" VM mobility) or in running/active state ("hot" VM 731 mobility). With "hot" mobility, VM L2 and L3 addresses need to be 732 preserved. With "cold" mobility, it may be desired to preserve VM L3 733 addresses. 735 Solutions to maintain connectivity while a VM is moved are necessary 736 in the case of "hot" mobility. This implies that transport 737 connections among VMs are preserved and that ARP caches are updated 738 accordingly. 740 Upon VM mobility, NVE policies that define connectivity among VMs 741 must be maintained. 743 Optimal routing during VM mobility is also an important aspect to 744 address. It is expected that the VM's default gateway be as close as 745 possible to the server hosting the VM and triangular routing be 746 avoided. 748 3.4. Service Overlay Topologies 750 A number of service topologies may be used to optimize the service 751 connectivity and to address NVE performance limitations. 753 The topology described in Figure 3 suggests the use of a tunnel mesh 754 between the NVEs where each tenant instance is one hop away from a 755 service processing perspective. Partial mesh topologies and an NVE 756 hierarchy may be used where certain NVEs may act as service transit 757 points. 759 4. Key aspects of overlay networks 761 The intent of this section is to highlight specific issues that 762 proposed overlay solutions need to address. 764 4.1. Pros & Cons 766 An overlay network is a layer of virtual network topology on top of 767 the physical network. 769 Overlay networks offer the following key advantages: 771 o Unicast tunneling state management and association with tenant 772 systems reachability are handled at the edge of the network. 773 Intermediate transport nodes are unaware of such state. Note 774 that this is not the case when multicast is enabled in the core 775 network. 777 o Tunneling is used to aggregate traffic and hide tenant 778 addresses from the underkay network, and hence offer the 779 advantage of minimizing the amount of forwarding state required 780 within the underlay network 782 o Decoupling of the overlay addresses (MAC and IP) used by VMs 783 from the underlay network. This offers a clear separation 784 between addresses used within the overlay and the underlay 785 networks and it enables the use of overlapping addresses spaces 786 by Tenant Systems 788 o Support of a large number of virtual network identifiers 790 Overlay networks also create several challenges: 792 o Overlay networks have no controls of underlay networks and lack 793 critical network information 795 o Overlays typically probe the network to measure link or 796 path properties, such as available bandwidth or packet 797 loss rate. It is difficult to accurately evaluate network 798 properties. It might be preferable for the underlay 799 network to expose usage and performance information. 801 o Miscommunication or lack of coordination between overlay and 802 underlay networks can lead to an inefficient usage of network 803 resources. 805 o When multiple overlays co-exist on top of a common underlay 806 network, the lack of coordination between overlays can lead to 807 performance issues. 809 o Overlaid traffic may not traverse firewalls and NAT devices. 811 o Multicast service scalability. Multicast support may be 812 required in the underlay network to address for each tenant 813 flood containment or efficient multicast handling. The underlay 814 may be also be required to maintain multicast state on a per- 815 tenant basis, or even on a per-individual multicast flow of a 816 given tenant. 818 o Hash-based load balancing may not be optimal as the hash 819 algorithm may not work well due to the limited number of 820 combinations of tunnel source and destination addresses. Other 821 NVO3 mechanisms may use additional entropy information than 822 source and destination addresses. 824 4.2. Overlay issues to consider 826 4.2.1. Data plane vs Control plane driven 828 In the case of an L2NVE, it is possible to dynamically learn MAC 829 addresses against VAPs. It is also possible that such addresses be 830 known and controlled via management or a control protocol for both 831 L2NVEs and L3NVEs. 833 Dynamic data plane learning implies that flooding of unknown 834 destinations be supported and hence implies that broadcast and/or 835 multicast be supported or that ingress replication be used as 836 described in section 4.2.3. Multicasting in the underlay network for 837 dynamic learning may lead to significant scalability limitations. 838 Specific forwarding rules must be enforced to prevent loops from 839 happening. This can be achieved using a spanning tree, a shortest 840 path tree, or a split-horizon mesh. 842 It should be noted that the amount of state to be distributed is 843 dependent upon network topology and the number of virtual machines. 844 Different forms of caching can also be utilized to minimize state 845 distribution between the various elements. The control plane should 846 not require an NVE to maintain the locations of all the tenant 847 systems whose VNs are not present on the NVE. The use of a control 848 plane does not imply that the data plane on NVEs has to maintain all 849 the forwarding state in the control plane. 851 4.2.2. Coordination between data plane and control plane 853 For an L2 NVE, the NVE needs to be able to determine MAC addresses 854 of the end systems connected via a VAP. This can be achieved via 855 dataplane learning or a control plane. For an L3 NVE, the NVE needs 856 to be able to determine IP addresses of the end systems connected 857 via a VAP. 859 In both cases, coordination with the NVE control protocol is needed 860 such that when the NVE determines that the set of addresses behind a 861 VAP has changed, it triggers the local NVE control plane to 862 distribute this information to its peers. 864 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic 866 There are two techniques to support packet replication needed for 867 broadcast, unknown unicast and multicast: 869 o Ingress replication 870 o Use of underlay multicast trees 872 There is a bandwidth vs state trade-off between the two approaches. 873 Depending upon the degree of replication required (i.e. the number 874 of hosts per group) and the amount of multicast state to maintain, 875 trading bandwidth for state should be considered. 877 When the number of hosts per group is large, the use of underlay 878 multicast trees may be more appropriate. When the number of hosts is 879 small (e.g. 2-3), ingress replication may not be an issue. 881 Depending upon the size of the data center network and hence the 882 number of (S,G) entries, but also the duration of multicast flows, 883 the use of underlay multicast trees can be a challenge. 885 When flows are well known, it is possible to pre-provision such 886 multicast trees. However, it is often difficult to predict 887 application flows ahead of time, and hence programming of (S,G) 888 entries for short-lived flows could be impractical. 890 A possible trade-off is to use in the underlay shared multicast 891 trees as opposed to dedicated multicast trees. 893 4.2.4. Path MTU 895 When using overlay tunneling, an outer header is added to the 896 original frame. This can cause the MTU of the path to the egress 897 tunnel endpoint to be exceeded. 899 In this section, we will only consider the case of an IP overlay. 901 It is usually not desirable to rely on IP fragmentation for 902 performance reasons. Ideally, the interface MTU as seen by a Tenant 903 System is adjusted such that no fragmentation is needed. TCP will 904 adjust its maximum segment size accordingly. 906 It is possible for the MTU to be configured manually or to be 907 discovered dynamically. Various Path MTU discovery techniques exist 908 in order to determine the proper MTU size to use: 910 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] 912 o 913 Tenant Systems rely on ICMP messages to discover the MTU of 914 the end-to-end path to its destination. This method is not 915 always possible, such as when traversing middle boxes 916 (e.g. firewalls) which disable ICMP for security reasons 918 o Extended MTU Path Discovery techniques such as defined in 919 [RFC4821] 921 It is also possible to rely on the overlay layer to perform 922 segmentation and reassembly operations without relying on the Tenant 923 Systems to know about the end-to-end MTU. The assumption is that 924 some hardware assist is available on the NVE node to perform such 925 SAR operations. However, fragmentation by the overlay layer can lead 926 to performance and congestion issues due to TCP dynamics and might 927 require new congestion avoidance mechanisms from then underlay 928 network [FLOYD]. 930 Finally, the underlay network may be designed in such a way that the 931 MTU can accommodate the extra tunneling and possibly additional nvo3 932 header encapsulation overhead. 934 4.2.5. NVE location trade-offs 936 In the case of DC traffic, traffic originated from a VM is native 937 Ethernet traffic. This traffic can be switched by a local virtual 938 switch or ToR switch and then by a DC gateway. The NVE function can 939 be embedded within any of these elements. 941 There are several criteria to consider when deciding where the NVE 942 function should happen: 944 o Processing and memory requirements 946 o Datapath (e.g. lookups, filtering, 947 encapsulation/decapsulation) 949 o Control plane processing (e.g. routing, signaling, OAM) and 950 where specific control plane functions should be enabled 952 o FIB/RIB size 954 o Multicast support 956 o Routing/signaling protocols 958 o Packet replication capability 960 o Multicast FIB 962 o Fragmentation support 963 o QoS support (e.g. marking, policing, queuing) 965 o Resiliency 967 4.2.6. Interaction between network overlays and underlays 969 When multiple overlays co-exist on top of a common underlay network, 970 resources (e.g., bandwidth) should be provisioned to ensure that 971 traffic from overlays can be accommodated and QoS objectives can be 972 met. Overlays can have partially overlapping paths (nodes and 973 links). 975 Each overlay is selfish by nature. It sends traffic so as to 976 optimize its own performance without considering the impact on other 977 overlays, unless the underlay paths are traffic engineered on a per 978 overlay basis to avoid congestion of underlay resources. 980 Better visibility between overlays and underlays, or generally 981 coordination in placing overlay demand on an underlay network, can 982 be achieved by providing mechanisms to exchange performance and 983 liveliness information between the underlay and overlay(s) or the 984 use of such information by a coordination system. Such information 985 may include: 987 o Performance metrics (throughput, delay, loss, jitter) 989 o Cost metrics 991 5. Security Considerations 993 Nvo3 solutions must at least consider and address the following: 995 . Secure and authenticated communication between an NVE and an 996 NVE management system. 998 . Isolation between tenant overlay networks. The use of per- 999 tenant FIB tables (VNIs) on an NVE is essential. 1001 . Security of any protocol used to carry overlay network 1002 information. 1004 . Avoiding packets from reaching the wrong NVI, especially during 1005 VM moves. 1007 6. IANA Considerations 1009 IANA does not need to take any action for this draft. 1011 7. References 1013 7.1. Normative References 1015 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1016 Requirement Levels", BCP 14, RFC 2119, March 1997. 1018 7.2. Informative References 1020 [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network 1021 Virtualization", draft-narten-nvo3-overlay-problem- 1022 statement (work in progress) 1024 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 1025 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 1026 (work in progress) 1028 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 1029 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 1031 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1032 Networks (VPNs)", RFC 4364, February 2006. 1034 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 1036 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 1037 August 1996 1039 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 1040 Discovery", RFC4821, March 2007 1042 8. Acknowledgments 1044 In addition to the authors the following people have contributed to 1045 this document: 1047 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 1049 Lucy Yong, Huawei 1050 This document was prepared using 2-Word-v2.0.template.dot. 1052 Authors' Addresses 1054 Marc Lasserre 1055 Alcatel-Lucent 1056 Email: marc.lasserre@alcatel-lucent.com 1058 Florin Balus 1059 Alcatel-Lucent 1060 777 E. Middlefield Road 1061 Mountain View, CA, USA 94043 1062 Email: florin.balus@alcatel-lucent.com 1064 Thomas Morin 1065 France Telecom Orange 1066 Email: thomas.morin@orange.com 1068 Nabil Bitar 1069 Verizon 1070 40 Sylvan Road 1071 Waltham, MA 02145 1072 Email: nabil.bitar@verizon.com 1074 Yakov Rekhter 1075 Juniper 1076 Email: yakov@juniper.net