idnits 2.17.1 draft-lasserre-nvo3-framework-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 22 longer pages, the longest (page 3) being 66 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 22 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 254 instances of too long lines in the document, the longest one being 4 characters in excess of 72. == There are 3 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 9, 2012) is 4307 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'EVPN' is mentioned on line 217, but not defined -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Marc Lasserre 3 Internet Draft Florin Balus 4 Intended status: Informational Alcatel-Lucent 5 Expires: January 2013 6 Thomas Morin 7 France Telecom Orange 9 Nabil Bitar 10 Verizon 12 Yakov Rekhter 13 Juniper 15 July 9, 2012 17 Framework for DC Network Virtualization 18 draft-lasserre-nvo3-framework-03.txt 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six 31 months and may be updated, replaced, or obsoleted by other documents 32 at any time. It is inappropriate to use Internet-Drafts as 33 reference material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on January 9, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Abstract 54 Several IETF drafts relate to the use of overlay networks to support 55 large scale virtual data centers. This draft provides a framework 56 for Network Virtualization over L3 (NVO3) and is intended to help 57 plan a set of work items in order to provide a complete solution 58 set. It defines a logical view of the main components with the 59 intention of streamlining the terminology and focusing the solution 60 set. 62 Table of Contents 64 1. Introduction...................................................3 65 1.1. Conventions used in this document.........................4 66 1.2. General terminology.......................................4 67 1.3. DC network architecture...................................6 68 1.4. Tenant networking view....................................7 69 2. Reference Models...............................................8 70 2.1. Generic Reference Model...................................8 71 2.2. NVE Reference Model......................................10 72 2.3. NVE Service Types........................................11 73 2.3.1. L2 NVE providing Ethernet LAN-like service..........11 74 2.3.2. L3 NVE providing IP/VRF-like service................11 75 3. Functional components.........................................11 76 3.1. Generic service virtualization components................12 77 3.1.1. Virtual Access Points (VAPs)........................12 78 3.1.2. Virtual Network Instance (VNI)......................12 79 3.1.3. Overlay Modules and VN Context......................13 80 3.1.4. Tunnel Overlays and Encapsulation options...........14 81 3.1.5. Control Plane Components............................14 82 3.1.5.1. Auto-provisioning/Service discovery...............14 83 3.1.5.2. Address advertisement and tunnel mapping..........15 84 3.1.5.3. Tunnel management.................................15 85 3.2. Service Overlay Topologies...............................16 86 4. Key aspects of overlay networks...............................16 87 4.1. Pros & Cons..............................................16 88 4.2. Overlay issues to consider...............................17 89 4.2.1. Data plane vs Control plane driven..................17 90 4.2.2. Coordination between data plane and control plane...18 91 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) 92 traffic....................................................18 93 4.2.4. Path MTU............................................19 94 4.2.5. NVE location trade-offs.............................19 95 4.2.6. Interaction between network overlays and underlays..20 96 5. Security Considerations.......................................21 97 6. IANA Considerations...........................................21 98 7. References....................................................21 99 7.1. Normative References.....................................21 100 7.2. Informative References...................................21 101 8. Acknowledgments...............................................22 103 1. Introduction 105 This document provides a framework for Data Center Network 106 Virtualization over L3 tunnels. This framework is intended to aid in 107 standardizing protocols and mechanisms to support large scale 108 network virtualization for data centers. 110 Several IETF drafts relate to the use of overlay networks for data 111 centers. 113 [NVOPS] defines the rationale for using overlay networks in order to 114 build large data center networks. The use of virtualization leads to 115 a very large number of communication domains and end systems to cope 116 with. 118 [OVCPREQ] describes the requirements for a control plane protocol 119 required by overlay border nodes to exchange overlay mappings. 121 This document provides reference models and functional components of 122 data center overlay networks as well as a discussion of technical 123 issues that have to be addressed in the design of standards and 124 mechanisms for large scale data centers. 126 1.1. Conventions used in this document 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in RFC-2119 [RFC2119]. 132 In this document, these words will appear with that interpretation 133 only when in ALL CAPS. Lower case uses of these words are not to be 134 interpreted as carrying RFC-2119 significance. 136 1.2. General terminology 138 This document uses the following terminology: 140 NVE: Network Virtualization Edge. It is a network entity that sits 141 on the edge of the NVO3 network. It implements network 142 virtualization functions that allow for L2 and/or L3 tenant 143 separation and for hiding tenant addressing information (MAC and IP 144 addresses). An NVE could be implemented as part of a virtual switch 145 within a hypervisor, a physical switch or router, a Network Service 146 Appliance or even be embedded within an End Station. 148 VN: Virtual Network. This is a virtual L2 or L3 domain that belongs 149 a tenant. 151 VNI: Virtual Network Instance. This is one instance of a virtual 152 overlay network. Two Virtual Networks are isolated from one another 153 and may use overlapping addresses. 155 Virtual Network Context or VN Context: Field that is part of the 156 overlay encapsulation header which allows the encapsulated frame to 157 be delivered to the appropriate virtual network endpoint by the 158 egress NVE. The egress NVE uses this field to determine the 159 appropriate virtual network context in which to process the packet. 160 This field MAY be an explicit, unique (to the administrative domain) 161 virtual network identifier (VNID) or MAY express the necessary 162 context information in other ways (e.g. a locally significant 163 identifier). 165 VNID: Virtual Network Identifier. In the case where the VN context 166 has global significance, this is the ID value that is carried in 167 each data packet in the overlay encapsulation that identifies the 168 Virtual Network the packet belongs to. 170 Underlay or Underlying Network: This is the network that provides 171 the connectivity between NVEs. The Underlying Network can be 172 completely unaware of the overlay packets. Addresses within the 173 Underlying Network are also referred to as "outer addresses" because 174 they exist in the outer encapsulation. The Underlying Network can 175 use a completely different protocol (and address family) from that 176 of the overlay. 178 Data Center (DC): A physical complex housing physical servers, 179 network switches and routers, Network Service Appliances and 180 networked storage. The purpose of a Data Center is to provide 181 application and/or compute and/or storage services. One such service 182 is virtualized data center services, also known as Infrastructure as 183 a Service. 185 Virtual Data Center or Virtual DC: A container for virtualized 186 compute, storage and network services. Managed by a single tenant, a 187 Virtual DC can contain multiple VNs and multiple Tenant End Systems 188 that are connected to one or more of these VNs. 190 VM: Virtual Machine. Several Virtual Machines can share the 191 resources of a single physical computer server using the services of 192 a Hypervisor (see below definition). 194 Hypervisor: Server virtualization software running on a physical 195 compute server that hosts Virtual Machines. The hypervisor provides 196 shared compute/memory/storage and network connectivity to the VMs 197 that it hosts. Hypervisors often embed a Virtual Switch (see below). 199 Virtual Switch: A function within a Hypervisor (typically 200 implemented in software) that provides similar services to a 201 physical Ethernet switch. It switches Ethernet frames between VMs' 202 virtual NICs within the same physical server, or between a VM and a 203 physical NIC card connecting the server to a physical Ethernet 204 switch. It also enforces network isolation between VMs that should 205 not communicate with each other. 207 Tenant: A customer who consumes virtualized data center services 208 offered by a cloud service provider. A single tenant may consume one 209 or more Virtual Data Centers hosted by the same cloud service 210 provider. 212 Tenant End System: It defines an end system of a particular tenant, 213 which can be for instance a virtual machine (VM), a non-virtualized 214 server, or a physical appliance. 216 ELAN: MEF ELAN, multipoint to multipoint Ethernet service 217 EVPN: Ethernet VPN as defined in [EVPN] 219 1.3. DC network architecture 221 A generic architecture for Data Centers is depicted in Figure 1: 223 ,---------. 224 ,' `. 225 ( IP/MPLS WAN ) 226 `. ,' 227 `-+------+' 228 +--+--+ +-+---+ 229 |DC GW|+-+|DC GW| 230 +-+---+ +-----+ 231 | / 232 .--. .--. 233 ( ' '.--. 234 .-.' Intra-DC ' 235 ( network ) 236 ( .'-' 237 '--'._.'. )\ \ 238 / / '--' \ \ 239 / / | | \ \ 240 +---+--+ +-`.+--+ +--+----+ 241 | ToR | | ToR | | ToR | 242 +-+--`.+ +-+-`.-+ +-+--+--+ 243 .' \ .' \ .' `. 244 __/_ _i./ i./_ _\__ 245 '--------' '--------' '--------' '--------' 246 : End : : End : : End : : End : 247 : Device : : Device : : Device : : Device : 248 '--------' '--------' '--------' '--------' 250 Figure 1 : A Generic Architecture for Data Centers 252 An example of multi-tier DC network architecture is presented in 253 this figure. It provides a view of physical components inside a DC. 255 A cloud network is composed of intra-Data Center (DC) networks and 256 network services, and, inter-DC network and network connectivity 257 services. Depending upon the scale, DC distribution, operations 258 model, Capex and Opex aspects, DC networking elements can act as 259 strict L2 switches and/or provide IP routing capabilities, including 260 also service virtualization. 262 In some DC architectures, it is possible that some tier layers 263 provide L2 and/or L3 services, are collapsed, and that Internet 264 connectivity, inter-DC connectivity and VPN support are handled by a 265 smaller number of nodes. Nevertheless, one can assume that the 266 functional blocks fit with the architecture above. 268 The following components can be present in a DC: 270 o End Device: a DC resource to which the networking service is 271 provided. End Device may be a compute resource (server or 272 server blade), storage component or a network appliance 273 (firewall, load-balancer, IPsec gateway). Alternatively, the 274 End Device may include software based networking functions used 275 to interconnect multiple hosts. An example of soft networking 276 is the virtual switch in the server blades, used to 277 interconnect multiple virtual machines (VMs). End Device may be 278 single or multi-homed to the Top of Rack switches (ToRs). 280 o Top of Rack (ToR): Hardware-based Ethernet switch aggregating 281 all Ethernet links from the End Devices in a rack representing 282 the entry point in the physical DC network for the hosts. ToRs 283 may also provide routing functionality, virtual IP network 284 connectivity, or Layer2 tunneling over IP for instance. ToRs 285 are usually multi-homed to switches in the Intra-DC network. 286 Other deployment scenarios may use an intermediate Blade Switch 287 before the ToR or an EoR (End of Row) switch to provide similar 288 function as a ToR. 290 o Intra-DC Network: High capacity network composed of core 291 switches aggregating multiple ToRs. Core switches are usually 292 Ethernet switches but can also support routing capabilities. 294 o DC GW: Gateway to the outside world providing DC Interconnect 295 and connectivity to Internet and VPN customers. In the current 296 DC network model, this may be simply a Router connected to the 297 Internet and/or an IPVPN/L2VPN PE. Some network implementations 298 may dedicate DC GWs for different connectivity types (e.g., a 299 DC GW for Internet, and another for VPN). 301 1.4. Tenant networking view 303 The DC network architecture is used to provide L2 and/or L3 service 304 connectivity to each tenant. An example is depicted in Figure 2: 306 +----- L3 Infrastructure ----+ 307 | | 308 ,--+-'. ;--+--. 309 ..... Rtr1 )...... . Rtr2 ) 310 | '-----' | '-----' 311 | Tenant1 |LAN12 Tenant1| 312 |LAN11 ....|........ |LAN13 313 '':'''''''':' | | '':'''''''':' 314 ,'. ,'. ,+. ,+. ,'. ,'. 315 (VM )....(VM ) (VM )... (VM ) (VM )....(VM ) 316 `-' `-' `-' `-' `-' `-' 318 Figure 2 : Logical Service connectivity for a single tenant 320 In this example one or more L3 contexts and one or more LANs (e.g., 321 one per application type) running on DC switches are assigned for DC 322 tenant 1. 324 For a multi-tenant DC, a virtualized version of this type of service 325 connectivity needs to be provided for each tenant by the Network 326 Virtualization solution. 328 2. Reference Models 330 2.1. Generic Reference Model 332 The following diagram shows a DC reference model for network 333 virtualization using Layer3 overlays where edge devices provide a 334 logical interconnect between Tenant End Systems that belong to 335 specific tenant network. 337 +--------+ +--------+ 338 | Tenant | | Tenant | 339 | End +--+ +---| End | 340 | System | | | | System | 341 +--------+ | ................... | +--------+ 342 | +-+--+ +--+-+ | 343 | | NV | | NV | | 344 +--|Edge| |Edge|--+ 345 +-+--+ +--+-+ 346 / . L3 Overlay . \ 347 +--------+ / . Network . \ +--------+ 348 | Tenant +--+ . . +----| Tenant | 349 | End | . . | End | 350 | System | . +----+ . | System | 351 +--------+ .....| NV |........ +--------+ 352 |Edge| 353 +----+ 354 | 355 | 356 +--------+ 357 | Tenant | 358 | End | 359 | System | 360 +--------+ 362 Figure 3 : Generic reference model for DC network virtualization 363 over a Layer3 infrastructure 365 The functional components in this picture do not necessarily map 366 directly with the physical components described in Figure 1. 368 For example, an End Device can be a server blade with VMs and 369 virtual switch, i.e. the VM is the Tenant End System and the NVE 370 functions may be performed by the virtual switch and/or the 371 hypervisor. 373 Another example is the case where an End Device can be a traditional 374 physical server (no VMs, no virtual switch), i.e. the server is the 375 Tenant End System and the NVE functions may be performed by the ToR. 376 Other End Devices in this category are Physical Network Appliances 377 or Storage Systems. 379 A Tenant End System attaches to a Network Virtualization Edge (NVE) 380 node, either directly or via a switched network (typically 381 Ethernet). 383 The NVE implements network virtualization functions that allow for 384 L2 and/or L3 tenant separation and for hiding tenant addressing 385 information (MAC and IP addresses), tenant-related control plane 386 activity and service contexts from the Routed Backbone nodes. 388 Core nodes utilize L3 techniques to interconnect NVE nodes in 389 support of the overlay network. These devices perform forwarding 390 based on outer L3 tunnel header, and generally do not maintain per 391 tenant-service state albeit some applications (e.g., multicast) may 392 require control plane or forwarding plane information that pertain 393 to a tenant, group of tenants, tenant service or a set of services 394 that belong to one or more tunnels. When such tenant or tenant- 395 service related information is maintained in the core, overlay 396 virtualization provides knobs to control that information. 398 2.2. NVE Reference Model 400 The NVE is composed of a tenant service instance that Tenant End 401 Systems interface with and an overlay module that provides tunneling 402 overlay functions (e.g. encapsulation/decapsulation of tenant 403 traffic from/to the tenant forwarding instance, tenant 404 identification and mapping, etc), as described in figure 4: 406 +------- L3 Network ------+ 407 | | 408 | Tunnel Overlay | 409 +------------+---------+ +---------+------------+ 410 | +----------+-------+ | | +---------+--------+ | 411 | | Overlay Module | | | | Overlay Module | | 412 | +---------+--------+ | | +---------+--------+ | 413 | |VN context| | VN context| | 414 | | | | | | 415 | +--------+-------+ | | +--------+-------+ | 416 | | |VNI| . |VNI| | | | |VNI| . |VNI| | 417 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 418 | | VAPs | | | | VAPs | | 419 +----+------------+----+ +----+------------+----+ 420 | | | | 421 -------+------------+-----------------+------------+------- 422 | | Tenant | | 423 | | Service IF | | 424 Tenant End Systems Tenant End Systems 426 Figure 4 : Generic reference model for NV Edge 428 Note that some NVE functions (e.g. data plane and control plane 429 functions) may reside in one device or may be implemented separately 430 in different devices. 432 For example, the NVE functionality could reside solely on the End 433 Devices, on the ToRs or on both the End Devices and the ToRs. In the 434 latter case we say that the the End Device NVE component acts as the 435 NVE Spoke, and ToRs act as NVE hubs. Tenant End Systems will 436 interface with the tenant service instances maintained on the NVE 437 spokes, and tenant service instances maintained on the NVE spokes 438 will interface with the tenant service instances maintained on the 439 NVE hubs. 441 2.3. NVE Service Types 443 NVE components may be used to provide different types of virtualized 444 service connectivity. This section defines the service types and 445 associated attributes 447 2.3.1. L2 NVE providing Ethernet LAN-like service 449 L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based 450 multipoint service where the Tenant End Systems appear to be 451 interconnected by a LAN environment over a set of L3 tunnels. It 452 provides per tenant virtual switching instance with MAC addressing 453 isolation and L3 tunnel encapsulation across the core. 455 2.3.2. L3 NVE providing IP/VRF-like service 457 Virtualized IP routing and forwarding is similar from a service 458 definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and 459 IPsec VPNs). It provides per tenant routing instance with addressing 460 isolation and L3 tunnel encapsulation across the core. 462 3. Functional components 464 This section breaks down the Network Virtualization architecture 465 into functional components to make it easier to discuss solution 466 options for different modules. 468 This version of the document gives an overview of generic functional 469 components that are shared between L2 and L3 service types. Details 470 specific for each service type will be added in future revisions. 472 3.1. Generic service virtualization components 474 A Network Virtualization solution is built around a number of 475 functional components as depicted in Figure 5: 477 +------- L3 Network ------+ 478 | | 479 | Tunnel Overlay | 480 +------------+--------+ +--------+------------+ 481 | +----------+------+ | | +------+----------+ | 482 | | Overlay Module | | | | Overlay Module | | 483 | +--------+--------+ | | +--------+--------+ | 484 | |VN Context| | |VN Context| 485 | | | | | | 486 | +-------+-------+ | | +-------+-------+ | 487 | ||VNI| ... |VNI|| | | ||VNI| ... |VNI|| | 488 NVE1 | +-+-----------+-+ | | +-+-----------+-+ | NVE2 489 | | VAPs | | | | VAPs | | 490 +----+-----------+----+ +----+-----------+----+ 491 | | | | 492 -----+-----------+-----------------+-----------+----- 493 | | Tenant | | 494 | | Service IF | | 495 Tenant End Systems Tenant End Systems 497 Figure 5 : Generic reference model for NV Edge 499 3.1.1. Virtual Access Points (VAPs) 501 Tenant End Systems are connected to the VNI Instance through Virtual 502 Access Points (VAPs). The VAPs can be in reality physical ports on a 503 ToR or virtual ports identified through logical interface 504 identifiers (VLANs, internal VSwitch Interface ID leading to a VM). 506 3.1.2. Virtual Network Instance (VNI) 508 The VNI represents a set of configuration attributes defining access 509 and tunnel policies and (L2 and/or L3) forwarding functions. 511 Per tenant FIB tables and control plane protocol instances are used 512 to maintain separate private contexts between tenants. Hence tenants 513 are free to use their own addressing schemes without concerns about 514 address overlapping with other tenants. 516 3.1.3. Overlay Modules and VN Context 518 Mechanisms for identifying each tenant service are required to allow 519 the simultaneous overlay of multiple tenant services over the same 520 underlay L3 network topology. In the data plane, each NVE, upon 521 sending a tenant packet, must be able to encode the VN Context for 522 the destination NVE in addition to the L3 tunnel source address 523 identifying the source NVE and the tunnel destination L3 address 524 identifying the destination NVE. This allows the destination NVE to 525 identify the tenant service instance and therefore appropriately 526 process and forward the tenant packet. 528 The Overlay module provides tunneling overlay functions: tunnel 529 initiation/termination, encapsulation/decapsulation of frames from 530 VAPs/L3 Backbone and may provide for transit forwarding of IP 531 traffic (e.g., transparent tunnel forwarding). 533 In a multi-tenant context, the tunnel aggregates frames from/to 534 different VNIs. Tenant identification and traffic demultiplexing are 535 based on the VN Context (e.g. VNID). 537 The following approaches can been considered: 539 o One VN Context per Tenant: A globally unique (on a per-DC 540 administrative domain) VNID is used to identify the related 541 Tenant instances. An example of this approach is the use of 542 IEEE VLAN or ISID tags to provide virtual L2 domains. 544 o One VN Context per VNI: A per-tenant local value is 545 automatically generated by the egress NVE and usually 546 distributed by a control plane protocol to all the related 547 NVEs. An example of this approach is the use of per VRF MPLS 548 labels in IP VPN [RFC4364]. 550 o One VN Context per VAP: A per-VAP local value is assigned and 551 usually distributed by a control plane protocol. An example of 552 this approach is the use of per CE-PE MPLS labels in IP VPN 553 [RFC4364]. 555 Note that when using one VN Context per VNI or per VAP, an 556 additional global identifier may be used by the control plane to 557 identify the Tenant context. 559 3.1.4. Tunnel Overlays and Encapsulation options 561 Once the VN context is added to the frame, a L3 Tunnel encapsulation 562 is used to transport the frame to the destination NVE. The backbone 563 devices do not usually keep any per service state, simply forwarding 564 the frames based on the outer tunnel header. 566 Different IP tunneling options (GRE/L2TP/IPSec) and tunneling 567 options (BGP VPN, PW, VPLS) are available for both Ethernet and IP 568 formats. 570 3.1.5. Control Plane Components 572 Control plane components may be used to provide the following 573 capabilities: 575 . Auto-provisioning/Service discovery 577 . Address advertisement and tunnel mapping 579 . Tunnel management 581 A control plane component can be an on-net control protocol or a 582 management control entity. 584 3.1.5.1. Auto-provisioning/Service discovery 586 NVEs must be able to select the appropriate VNI for each Tenant End 587 System. This is based on state information that is often provided by 588 external entities. For example, in a VM environment, this 589 information is provided by compute management systems, since these 590 are the only entities that have visibility on which VM belongs to 591 which tenant. 593 A mechanism for communicating this information between Tenant End 594 Systems and the local NVE is required. As a result the VAPs are 595 created and mapped to the appropriate Tenant Instance. 597 Depending upon the implementation, this control interface can be 598 implemented using an auto-discovery protocol between Tenant End 599 Systems and their local NVE or through management entities. 601 When a protocol is used, appropriate security and authentication 602 mechanisms to verify that Tenant End System information is not 603 spoofed or altered are required. This is one critical aspect for 604 providing integrity and tenant isolation in the system. 606 Another control plane protocol can also be used to advertize NVE 607 tenant service instance (tenant and service type provided to the 608 tenant) to other NVEs. Alternatively, management control entities 609 can also be used to perform these functions. 611 3.1.5.2. Address advertisement and tunnel mapping 613 As traffic reaches an ingress NVE, a lookup is performed to 614 determine which tunnel the packet needs to be sent to. It is then 615 encapsulated with a tunnel header containing the destination address 616 of the egress overlay node. Intermediate nodes (between the ingress 617 and egress NVEs) switch or route traffic based upon the outer 618 destination address. 620 One key step in this process consists of mapping a final destination 621 address to the proper tunnel. NVEs are responsible for maintaining 622 such mappings in their lookup tables. Several ways of populating 623 these lookup tables are possible: control plane driven, management 624 plane driven, or data plane driven. 626 When a control plane protocol is used to distribute address 627 advertisement and tunneling information, the auto- 628 provisioning/Service discovery could be accomplished by the same 629 protocol. In this scenario, the auto-provisioning/Service discovery 630 could be combined with (be inferred from) the address advertisement 631 and tunnel mapping. Furthermore, a control plane protocol that 632 carries both MAC and IP addresses eliminates the need for ARP, and 633 hence addresses one of the issues with explosive ARP handling. 635 3.1.5.3. Tunnel management 637 A control plane protocol may be required to exchange tunnel state 638 information. This may include setting up tunnels and/or providing 639 tunnel state information. 641 This applies to both unicast and multicast tunnels. 643 For instance, it may be necessary to provide active/standby status 644 information between NVEs, up/down status information, 645 pruning/grafting information for multicast tunnels, etc. 647 3.2. Service Overlay Topologies 649 A number of service topologies may be used to optimize the service 650 connectivity and to address NVE performance limitations. 652 The topology described in Figure 3 suggests the use of a tunnel mesh 653 between the NVEs where each tenant instance is one hop away from a 654 service processing perspective. Partial mesh topologies and an NVE 655 hierarchy may be used where certain NVEs may act as service transit 656 points. 658 4. Key aspects of overlay networks 660 The intent of this section is to highlight specific issues that 661 proposed overlay solutions need to address. 663 4.1. Pros & Cons 665 An overlay network is a layer of virtual network topology on top of 666 the physical network. 668 Overlay networks offer the following key advantages: 670 o Unicast tunneling state management is handled at the edge of 671 the network. Intermediate transport nodes are unaware of such 672 state. Note that this is not the case when multicast is enabled 673 in the core network. 675 o Tunnels are used to aggregate traffic and hence offer the 676 advantage of minimizing the amount of forwarding state required 677 within the underlay network 679 o Decoupling of the overlay addresses (MAC and IP) used by VMs 680 from the underlay network. This offers a clear separation 681 between addresses used within the overlay and the underlay 682 networks and it enables the use of overlapping addresses spaces 683 by Tenant End Systems 685 o Support of a large number of virtual network identifiers 687 Overlay networks also create several challenges: 689 o Overlay networks have no controls of underlay networks and lack 690 critical network information 691 o Overlays typically probe the network to measure link 692 properties, such as available bandwidth or packet loss 693 rate. It is difficult to accurately evaluate network 694 properties. It might be preferable for the underlay 695 network to expose usage and performance information. 697 o Miscommunication between overlay and underlay networks can lead 698 to an inefficient usage of network resources. 700 o Fairness of resource sharing and collaboration among end-nodes 701 in overlay networks are two critical issues 703 o When multiple overlays co-exist on top of a common underlay 704 network, the lack of coordination between overlays can lead to 705 performance issues. 707 o Overlaid traffic may not traverse firewalls and NAT devices. 709 o Multicast service scalability. Multicast support may be 710 required in the overlay network to address for each tenant 711 flood containment or efficient multicast handling. 713 o Hash-based load balancing may not be optimal as the hash 714 algorithm may not work well due to the limited number of 715 combinations of tunnel source and destination addresses 717 4.2. Overlay issues to consider 719 4.2.1. Data plane vs Control plane driven 721 In the case of an L2NVE, it is possible to dynamically learn MAC 722 addresses against VAPs. It is also possible that such addresses be 723 known and controlled via management or a control protocol for both 724 L2NVEs and L3NVEs. 726 Dynamic data plane learning implies that flooding of unknown 727 destinations be supported and hence implies that broadcast and/or 728 multicast be supported. Multicasting in the core network for dynamic 729 learning may lead to significant scalability limitations. Specific 730 forwarding rules must be enforced to prevent loops from happening. 731 This can be achieved using a spanning tree, a shortest path tree, or 732 a split-horizon mesh. 734 It should be noted that the amount of state to be distributed is 735 dependent upon network topology and the number of virtual machines. 737 Different forms of caching can also be utilized to minimize state 738 distribution between the various elements. 740 4.2.2. Coordination between data plane and control plane 742 For an L2 NVE, the NVE needs to be able to determine MAC addresses 743 of the end systems present on a VAP (for instance, dataplane 744 learning may be relied upon for this purpose). For an L3 NVE, the 745 NVE needs to be able to determine IP addresses of the end systems 746 present on a VAP. 748 In both cases, coordination with the NVE control protocol is needed 749 such that when the NVE determines that the set of addresses behind a 750 VAP has changed, it triggers the local NVE control plane to 751 distribute this information to its peers. 753 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic 755 There are two techniques to support packet replication needed for 756 broadcast, unknown unicast and multicast: 758 o Ingress replication 760 o Use of core multicast trees 762 There is a bandwidth vs state trade-off between the two approaches. 763 Depending upon the degree of replication required (i.e. the number 764 of hosts per group) and the amount of multicast state to maintain, 765 trading bandwidth for state is of consideration. 767 When the number of hosts per group is large, the use of core 768 multicast trees may be more appropriate. When the number of hosts is 769 small (e.g. 2-3), ingress replication may not be an issue. 771 Depending upon the size of the data center network and hence the 772 number of (S,G) entries, but also the duration of multicast flows, 773 the use of core multicast trees can be a challenge. 775 When flows are well known, it is possible to pre-provision such 776 multicast trees. However, it is often difficult to predict 777 application flows ahead of time, and hence programming of (S,G) 778 entries for short-lived flows could be impractical. 780 A possible trade-off is to use in the core shared multicast trees as 781 opposed to dedicated multicast trees. 783 4.2.4. Path MTU 785 When using overlay tunneling, an outer header is added to the 786 original frame. This can cause the MTU of the path to the egress 787 tunnel endpoint to be exceeded. 789 In this section, we will only consider the case of an IP overlay. 791 It is usually not desirable to rely on IP fragmentation for 792 performance reasons. Ideally, the interface MTU as seen by a Tenant 793 End System is adjusted such that no fragmentation is needed. TCP 794 will adjust its maximum segment size accordingly. 796 It is possible for the MTU to be configured manually or to be 797 discovered dynamically. Various Path MTU discovery techniques exist 798 in order to determine the proper MTU size to use: 800 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] 802 o Tenant End Systems rely on ICMP messages to discover the 803 MTU of the end-to-end path to its destination. This method 804 is not always possible, such as when traversing middle 805 boxes (e.g. firewalls) which disable ICMP for security 806 reasons 808 o Extended MTU Path Discovery techniques such as defined in 809 [RFC4821] 811 It is also possible to rely on the overlay layer to perform 812 segmentation and reassembly operations without relying on the Tenant 813 End Systems to know about the end-to-end MTU. The assumption is that 814 some hardware assist is available on the NVE node to perform such 815 SAR operations. However, fragmentation by the overlay layer can lead 816 to performance and congestion issues due to TCP dynamics and might 817 require new congestion avoidance mechanisms from then underlay 818 network [FLOYD]. 820 Finally, the underlay network may be designed in such a way that the 821 MTU can accommodate the extra tunnel overhead. 823 4.2.5. NVE location trade-offs 825 In the case of DC traffic, traffic originated from a VM is native 826 Ethernet traffic. This traffic can be switched by a local VM switch 827 or ToR switch and then by a DC gateway. The NVE function can be 828 embedded within any of these elements. 830 There are several criteria to consider when deciding where the NVE 831 processing boundary happens: 833 o Processing and memory requirements 835 o Datapath (e.g. lookups, filtering, 836 encapsulation/decapsulation) 838 o Control plane processing (e.g. routing, signaling, OAM) 840 o FIB/RIB size 842 o Multicast support 844 o Routing protocols 846 o Packet replication capability 848 o Fragmentation support 850 o QoS transparency 852 o Resiliency 854 4.2.6. Interaction between network overlays and underlays 856 When multiple overlays co-exist on top of a common underlay network, 857 this can cause some performance issues. These overlays have 858 partially overlapping paths and nodes. 860 Each overlay is selfish by nature in that it sends traffic so as to 861 optimize its own performance without considering the impact on other 862 overlays, unless the underlay tunnels are traffic engineered on a 863 per overlay basis so as to avoid sharing underlay resources. 865 Better visibility between overlays and underlays can be achieved by 866 providing mechanisms to exchange information about: 868 o Performance metrics (throughput, delay, loss, jitter) 870 o Cost metrics 872 5. Security Considerations 874 The tenant to overlay mapping function can introduce significant 875 security risks if appropriate protocols are not used that can 876 support mutual authentication. 878 No other new security issues are introduced beyond those described 879 already in the related L2VPN and L3VPN RFCs. 881 6. IANA Considerations 883 IANA does not need to take any action for this draft. 885 7. References 887 7.1. Normative References 889 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 890 Requirement Levels", BCP 14, RFC 2119, March 1997. 892 7.2. Informative References 894 [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network 895 Virtualization", draft-narten-nvo3-overlay-problem- 896 statement (work in progress) 898 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 899 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 900 (work in progress) 902 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 903 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 905 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 906 Networks (VPNs)", RFC 4364, February 2006. 908 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 910 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 911 August 1996 913 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 914 Discovery", RFC4821, March 2007 916 8. Acknowledgments 918 In addition to the authors the following people have contributed to 919 this document: 921 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 923 This document was prepared using 2-Word-v2.0.template.dot. 925 Authors' Addresses 927 Marc Lasserre 928 Alcatel-Lucent 929 Email: marc.lasserre@alcatel-lucent.com 931 Florin Balus 932 Alcatel-Lucent 933 777 E. Middlefield Road 934 Mountain View, CA, USA 94043 935 Email: florin.balus@alcatel-lucent.com 937 Thomas Morin 938 France Telecom Orange 939 Email: thomas.morin@orange.com 941 Nabil Bitar 942 Verizon 943 40 Sylvan Road 944 Waltham, MA 02145 945 Email: nabil.bitar@verizon.com 947 Yakov Rekhter 948 Juniper 949 Email: yakov@juniper.net