idnits 2.17.1 draft-lasserre-nvo3-framework-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 12, 2012) is 4420 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Marc Lasserre 2 Internet Draft Florin Balus 3 Intended status: Informational Alcatel-Lucent 4 Expires: September 2012 5 Thomas Morin 6 France Telecom Orange 8 Nabil Bitar 9 Verizon 11 Yakov Rekhter 12 Juniper 14 Yuichi Ikejiri 15 NTT Communications 17 March 12, 2012 19 Framework for DC Network Virtualization 20 draft-lasserre-nvo3-framework-01.txt 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other documents 34 at any time. It is inappropriate to use Internet-Drafts as 35 reference material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 This Internet-Draft will expire on September 12, 2012. 45 Copyright Notice 47 Copyright (c) 2012 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with 55 respect to this document. 57 Abstract 59 Several IETF drafts relate to the use of overlay networks to support 60 large scale virtual data centers. This draft provides a framework 61 for Network Virtualization over L3 (NVO3) and is intended to help 62 plan a set of work items in order to provide a complete solution 63 set. It defines a logical view of the main components with the 64 intention of streamlining the terminology and focusing the solution 65 set. 67 Table of Contents 69 1. Introduction...................................................3 70 1.1. Conventions used in this document.........................4 71 1.2. General terminology.......................................4 72 1.3. DC network architecture...................................5 73 1.4. Tenant networking view....................................6 74 2. Reference Models...............................................7 75 2.1. Generic Reference Model...................................7 76 2.2. NVE Reference Model.......................................9 77 2.3. NVE Service Types........................................10 78 2.3.1. L2 NVE providing Ethernet LAN-like service..........11 79 2.3.2. L3 NVE providing IP/VRF-like service................11 80 3. Functional components.........................................11 81 3.1. Generic service virtualization components................11 82 3.1.1. Virtual Attachment Points (VAPs)....................12 83 3.1.2. Tenant Instance.....................................12 84 3.1.3. Overlay Modules and Tenant ID.......................13 85 3.1.4. Tunnel Overlays and Encapsulation options...........14 86 3.1.5. Control Plane Components............................14 87 3.1.5.1. Auto-provisioning/Service discovery...............14 88 3.1.5.2. Address advertisement and tunnel mapping..........15 89 3.1.5.3. Tunnel management.................................15 90 3.2. Service Overlay Topologies...............................16 91 4. Key aspects of overlay networks...............................16 92 4.1. Pros & Cons..............................................16 93 4.2. Overlay issues to consider...............................17 94 4.2.1. Data plane vs Control plane driven..................17 95 4.2.2. Coordination between data plane and control plane...18 96 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) 97 traffic....................................................18 98 4.2.4. Path MTU............................................19 99 4.2.5. NVE location trade-offs.............................19 100 4.2.6. Interaction between network overlays and underlays..20 101 5. Security Considerations.......................................21 102 6. IANA Considerations...........................................21 103 7. References....................................................21 104 7.1. Normative References.....................................21 105 7.2. Informative References...................................21 106 8. Acknowledgments...............................................22 108 1. Introduction 110 This document provides a framework for Data Center Network 111 Virtualization over L3 tunnels. This framework is intended to aid in 112 standardizing protocols and mechanisms to support large scale 113 network virtualization for data centers. 115 Several IETF drafts relate to the use of overlay networks for data 116 centers. 118 [NVOPS] defines the rationale for using overlay networks in order to 119 build large data center networks. The use of virtualization leads to 120 a very large number of communication domains and end systems to cope 121 with. Existing virtual network models used for data center networks 122 have known limitations, specifically in the context of multiple 123 tenants. These issues can be summarized as: 125 o Limited VLAN space 127 o FIB explosion due to handling of a large number of MACs/IP 128 addresses 130 o Spanning Tree limitations 132 o Excessive ARP handling 133 o Broadcast storms 135 o Inefficient Broadcast/Multicast handling 137 o Limited mobility/portability support 139 o Lack of service auto-discovery 141 Overlay techniques have been used in the past to address some of 142 these issues. 144 [OVCPREQ] describes the requirements for a control plane protocol 145 required by overlay border nodes to exchange overlay mappings. 147 This document provides reference models that describe functional 148 components of data center overlay networks. It also describes 149 technical issues that have to be addressed in the design of 150 protocols and mechanisms for large-scale data center networks. 152 1.1. Conventions used in this document 154 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 155 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 156 document are to be interpreted as described in RFC-2119 [RFC2119]. 158 In this document, these words will appear with that interpretation 159 only when in ALL CAPS. Lower case uses of these words are not to be 160 interpreted as carrying RFC-2119 significance. 162 1.2. General terminology 164 Some general terminology is defined here. Terminology specific to 165 this memo is introduced as needed in later sections. 167 DC: Data Center 169 ELAN: MEF ELAN, multipoint to multipoint Ethernet service 171 1.3. DC network architecture 173 ,---------. 174 ,' `. 175 ( IP/MPLS WAN ) 176 `. ,' 177 `-+------+' 178 +--+--+ +-+---+ 179 |DC GW|+-+|DC GW| 180 +-+---+ +-----+ 181 | / 182 .--. .--. 183 ( ' '.--. 184 .-.' Intra-DC ' 185 ( network ) 186 ( .'-' 187 '--'._.'. )\ \ 188 / / '--' \ \ 189 / / | | \ \ 190 +---+--+ +-`.+--+ +--+----+ 191 | ToR | | ToR | | ToR | 192 +-+--`.+ +-+-`.-+ +-+--+--+ 193 .' \ .' \ .' `. 194 __/_ _i./ i./_ _\__ 195 '--------' '--------' '--------' '--------' 196 : End : : End : : End : : End : 197 : Device : : Device : : Device : : Device : 198 '--------' '--------' '--------' '--------' 200 Figure 1 : A Generic Architecture for Data Centers 202 Figure 1 depicts a common and generic multi-tier DC network 203 architecture. It provides a view of physical components inside a DC. 205 A cloud network is composed of intra-Data Center (DC) networks and 206 network services, and inter-DC network and network connectivity 207 services. Depending upon the scale, DC distribution, operations 208 model, Capex and Opex aspects, DC networking elements can act as 209 strict L2 switches and/or provide IP routing capabilities, including 210 service virtualization. 212 In some DC architectures, it is possible that some tier layers are 213 collapsed and/or provide L2 and/or L3 services, and that Internet 214 connectivity, inter-DC connectivity and VPN support are handled by a 215 smaller number of nodes. Nevertheless, one can assume that the 216 functional blocks fit with the architecture depicted in Figure 1. 218 The following components can be present in a DC: 220 o End Device: a DC resource to which the networking service is 221 provided. End Device may be a compute resource (server or 222 server blade), storage component or a network appliance 223 (firewall, load-balancer, IPsec gateway). Alternatively, the 224 End Device may include software based networking functions used 225 to interconnect multiple hosts. An example of soft networking 226 is the virtual switch in the server blades, used to 227 interconnect multiple virtual machines (VMs). End Device may be 228 single or multi-homed to the Top of Rack switches (ToRs). 230 o Top of Rack (ToR): Hardware-based Ethernet switch aggregating 231 all Ethernet links from the End Devices in a rack representing 232 the entry point in the physical DC network for the hosts. ToRs 233 may also provide routing functionality, virtual IP network 234 connectivity, or Layer2 tunneling over IP for instance. ToRs 235 are usually multi-homed to switches/routers in the Intra-DC 236 network. Other deployment scenarios may use an intermediate 237 Blade Switch before the ToR or an EoR (End of Row) switch to 238 provide similar function as a ToR. 240 o Intra-DC Network: High capacity network composed of core 241 switches/routers aggregating multiple ToRs. Core network 242 elements are usually Ethernet switches but can also support 243 routing capabilities. 245 o DC GW: Gateway to the outside world providing DC Interconnect 246 and connectivity to Internet and VPN customers. In the current 247 DC network model, this may be simply a Router connected to the 248 Internet and/or an IPVPN/L2VPN PE. Some network implementations 249 may dedicate DC GWs for different connectivity types (e.g., a 250 DC GW for Internet, and another for VPN). 252 We use throughout this document also the term "Tenant End System" to 253 define an end system of a particular tenant, which can be for 254 instance a virtual machine (VM), a non-virtualized server, or a 255 physical appliance. One or more Tenant End Systems can be part of an 256 End Device. 258 1.4. Tenant networking view 260 The DC network architecture is used to provide L2 and/or L3 service 261 connectivity to each tenant. An example is depicted in Figure 2: 263 +----- L3 Infrastructure ----+ 264 | | 265 ,--+-'. ;--+--. 266 ..... Rtr1 )...... . Rtr2 ) 267 | '-----' | '-----' 268 | Tenant1 |LAN12 Tenant1| 269 |LAN11 ....|........ |LAN13 270 '':'''''''':' | | '':'''''''':' 271 ,'. ,'. ,+. ,+. ,'. ,'. 272 (VM )....(VM ) (VM )... (VM ) (VM )....(VM ) 273 `-' `-' `-' `-' `-' `-' 275 Figure 2 : Logical Service connectivity for a single tenant 277 In this example one or more L3 contexts and one or more LANs (e.g., 278 one per Application) running on DC switches are assigned for DC 279 tenant 1. 281 For a multi-tenant DC, a virtualized version of this type of service 282 connectivity needs to be provided for each tenant by the Network 283 Virtualization solution. 285 2. Reference Models 287 2.1. Generic Reference Model 289 The following diagram shows a DC reference model for network 290 virtualization using Layer3 overlays where edge devices provide a 291 logical interconnect between Tenant End Systems that belong to 292 specific tenant network. 294 +--------+ +--------+ 295 | Tenant | | Tenant | 296 | End +--+ +---| End | 297 | System | | | | System | 298 +--------+ | ................... | +--------+ 299 | +-+--+ +--+-+ | 300 | | NV | | NV | | 301 +--|Edge| |Edge|--+ 302 +-+--+ +--+-+ 303 / . L3 Overlay . \ 304 +--------+ / . Network . \ +--------+ 305 | Tenant +--+ . . +----| Tenant | 306 | End | . . | End | 307 | System | . +----+ . | System | 308 +--------+ .....| NV |........ +--------+ 309 |Edge| 310 +----+ 311 | 312 | 313 +--------+ 314 | Tenant | 315 | End | 316 | System | 317 +--------+ 319 Figure 3 : Generic reference model for DC network virtualization 320 over a Layer3 infrastructure 322 The functional components in Figure 3 do not necessarily map 323 directly with the physical components described in Figure 1. 325 For example, an End Device in Figure 1 can be a server blade with 326 VMs and virtual switch, i.e. the VM is the Tenant End System and the 327 NVE functions may be performed by the virtual switch and/or the 328 hypervisor. 330 Another example is the case where an End Device in Figure 1 can be a 331 traditional physical server (no VMs, no virtual switch), i.e. the 332 server is the Tenant End System and the NVE functions may be 333 performed by the ToR. Other End Devices in this category are 334 Physical Network Appliances or Storage Systems. 336 A Tenant End System attaches to a Network Virtualization Edge (NVE) 337 node, either directly or via a switched network (typically 338 Ethernet). 340 The NVE implements network virtualization functions that allow for 341 L2 and/or L3 tenant separation and for hiding tenant addressing 342 information (MAC and IP addresses), tenant-related control plane 343 activity and service contexts from the Routed Core nodes. 345 Core nodes utilize L3 techniques to interconnect NVE nodes in 346 support of the overlay network. Specifically, they perform 347 forwarding based on outer L3 tunnel header, and generally do not 348 maintain per tenant-service state albeit some applications (e.g., 349 multicast) may require control plane or forwarding plane information 350 that pertain to a tenant, group of tenants, tenant service or a set 351 of services that belong to one or more tenants. When such tenant or 352 tenant-service related information is maintained in the core, 353 overlay virtualization provides knobs to control the magnitude of 354 that information. 356 2.2. NVE Reference Model 358 Figure 4 depicts the NVE reference model. An NVE contains one or 359 more tenant service instances whereby a Tenant End Systems 360 interfaces with its associated tenant service instance. The NVE also 361 contains an overlay module that provides tunneling overlay functions 362 (e.g. encapsulation/decapsulation of tenant traffic from/to the 363 tenant forwarding instance, tenant identification and mapping, etc), 364 as described in Figure 4. 366 +------- L3 Network ------+ 367 | | 368 | | 369 +------------+--------+ +--------+------------+ 370 | +----------+------+ | | +------+----------+ | 371 | | Overlay Module | | | | Overlay Module | | 372 | +--------+--------+ | | +--------+--------+ | 373 | | | | | | 374 | NVE1 | | | | NVE2 | 375 | +-------+-------+ | | +-------+-------+ | 376 | |Tenant Instance| | | |Tenant Instance| | 377 | +-+-----------+-+ | | +-+-----------+-+ | 378 | | | | | | | | 379 +----+-----------+----+ +----+-----------+----+ 380 | | | | 381 -------+-----------+-----------------+-----------+------- 382 | | Tenant | | 383 | | Service IF | | 384 Tenant End Systems Tenant End Systems 386 Figure 5 : Generic reference model for NV Edge 388 Note that some NVE functions may reside in one device or may be 389 implemented separately in different devices: for example, data plane 390 may reside in one device while the control plane components may be 391 distributed between multiple devices. 393 The NVE functionality could reside solely on the End Devices, on the 394 ToRs or on both the End Devices and the ToRs. In the latter case we 395 say that the End Device NVE component acts as the NVE Spoke, and 396 ToRs act as NVE hubs. Tenant End Systems will interface with the 397 tenant service instances maintained on the NVE spokes, and tenant 398 service instances maintained on the NVE spokes will interface with 399 the tenant service instances maintained on the NVE hubs. 401 2.3. NVE Service Types 403 NVE components may be used to provide different types of virtualized 404 service connectivity. This section defines the service types and 405 associated attributes 407 2.3.1. L2 NVE providing Ethernet LAN-like service 409 L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based 410 multipoint service where the Tenant End Systems appear to be 411 interconnected by a LAN environment over a set of L3 tunnels. It 412 provides per tenant virtual switching instance and associated MAC 413 FIB, MAC address isolation across tenants, and L3 tunnel 414 encapsulation across the core. 416 2.3.2. L3 NVE providing IP/VRF-like service 418 Virtualized IP routing and forwarding is similar from a service 419 definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and 420 IPsec VPNs). It provides per tenant routing instance and associated 421 IP FIB, IP address isolation across tenants, and L3 tunnel 422 encapsulation across the core. 424 3. Functional components 426 This section breaks down the Network Virtualization architecture 427 into functional components to make it easier to discuss solution 428 options for different modules. 430 This version of the document gives an overview of generic functional 431 components that are shared between L2 and L3 service types. Details 432 specific for each service type will be added in future revisions. 434 3.1. Generic service virtualization components 436 A Network Virtualization solution is built around a number of 437 functional components as depicted in Figure 5: 439 +------- L3 Network ------+ 440 | | 441 | Tunnel Overlay | 442 +------------+--------+ +--------+------------+ 443 | +----------+------+ | | +------+----------+ | 444 | | Overlay Module | | | | Overlay Module | | 445 | +--------+--------+ | | +--------+--------+ | 446 | |Tenant ID | | |Tenant ID | 447 | | (TNI) | | | (TNI) | 448 | +-------+-------+ | | +-------+-------+ | 449 | |Tenant Instance| | | |Tenant Instance| | 450 NVE2 | +-+-----------+-+ | | +-+-----------+-+ | NVE1 451 | | VAPs | | | | VAPs | | 452 +----+-----------+----+ +----+-----------+----+ 453 | | | | 454 -------+-----------+-----------------+-----------+------- 455 | | Tenant | | 456 | | Service IF | | 457 Tenant End Systems Tenant End Systems 459 Figure 6 : Generic reference model for NV Edge 461 3.1.1. Virtual Attachment Points (VAPs) 463 Tenant End Systems are connected to the Tenant Instance through 464 Virtual Attachment Points (VAPs). The VAPs can be in reality 465 physical ports on a ToR or virtual ports identified through logical 466 interface identifiers (VLANs, internal VSwitch Interface ID leading 467 to a VM). 469 3.1.2. Tenant Instance 471 The Tenant Instance represents a set of configuration attributes 472 defining access and tunnel policies and (L2 and/or L3) forwarding 473 functions and possibly control plane functions. 475 Per tenant FIB tables and control plane protocol instances are used 476 to maintain separate private contexts across tenants. Hence tenants 477 are free to use their own addressing schemes without concerns about 478 address overlapping with other tenants. 480 3.1.3. Overlay Modules and Tenant ID 482 Mechanisms for identifying each tenant service are required to allow 483 the simultaneous overlay of multiple tenant services over the same 484 underlay L3 network topology. In the data plane, each NVE, upon 485 sending a tenant packet, must be able to encode the TNI for the 486 destination NVE in addition to the L3 tunnel source address 487 identifying the source NVE and the tunnel destination L3 address 488 identifying the destination NVE. This allows the destination NVE to 489 identify the tenant service instance and therefore appropriately 490 process and forward the tenant packet. 492 The Overlay module provides tunneling overlay functions: tunnel 493 initiation/termination, encapsulation/decapsulation of frames from 494 VAPs/L3 Backbone and may provide for transit forwarding of IP 495 traffic (e.g., transparent forwarding of tunnel packets). 497 In a multi-tenant context, the tunnel aggregates frames from/to 498 different Tenant Instances. Tenant identification and traffic 499 demultiplexing are based on the Tenant Identifier (TNI). 501 Historically the following approaches have been considered: 503 o One ID per Tenant: A globally unique (on a per-DC 504 administrative domain) Tenant ID is used to identify the 505 related Tenant instances. An example of this approach is the 506 use of IEEE VLAN or ISID tags to provide virtual L2 domains. 508 o One ID per Tenant Instance: A per-tenant local ID is 509 automatically generated by the egress NVE and usually 510 distributed by a control plane protocol to all the related 511 NVEs. An example of this approach is the use of per VRF MPLS 512 labels in IP VPN [RFC4364]. 514 o One ID per VAP: A per-VAP local ID is assigned and usually 515 distributed by a control plane protocol. An example of this 516 approach is the use of per CE-PE MPLS labels in IP VPN 517 [RFC4364]. 519 Note that when using one ID per Tenant Instance or per VAP, an 520 additional global identifier may be used by the control plane to 521 identify the Tenant context (e.g., historically equivalent to the 522 route target community attribute in [RFC4364]). 524 3.1.4. Tunnel Overlays and Encapsulation options 526 Once the TNI is added to the tenant data frame, L3 Tunnel 527 encapsulation is used to transport the resulting frame to the 528 destination NVE. The backbone devices do not usually keep any per 529 service state, simply forwarding the frames based on the outer 530 tunnel header. 532 Different IP tunneling options (e.g., GRE/L2TPv3/IPSec) and MPLS- 533 based tunneling options (e.g., BGP VPN, PW, VPLS) can be used for 534 tunneling Ethernet and IP packets. 536 3.1.5. Control Plane Components 538 Control plane components may be used to provide the following 539 capabilities: 541 . Service Auto-provisioning/Auto-discovery 543 . Address advertisement and tunnel mapping 545 . Tunnel establishment/tear-down and routing 547 A control plane component can be an on-net control protocol or a 548 management control entity. 550 3.1.5.1. Auto-provisioning/Service discovery 552 NVEs must be able to select the appropriate Tenant Instance for each 553 Tenant End System. This is based on state information that is often 554 provided by external entities. For example, in a VM environment, 555 this information is provided by compute management systems, since 556 these are the only entities that have visibility of which VM belongs 557 to which tenant. 559 A mechanism for communicating this information between Tenant End 560 Systems and the local NVE is required. As a result the VAPs are 561 created and mapped to the appropriate Tenant Instance. 563 Depending upon the implementation, this control interface can be 564 implemented using an auto-discovery protocol between Tenant End 565 Systems and their local NVE or through management entities. 567 When a protocol is used, appropriate security and authentication 568 mechanisms to verify that Tenant End System information is not 569 spoofed or altered are required. This is one critical aspect for 570 providing integrity and tenant isolation in the system. 572 Another control plane protocol can also be used to advertize NVE 573 tenant service instance (tenant and service type provided to the 574 tenant) to other NVEs. Alternatively, management control entities 575 can also be used to perform these functions. 577 3.1.5.2. Address advertisement and tunnel mapping 579 As traffic reaches an ingress NVE, a lookup is performed to 580 determine which tunnel the packet needs to be sent to. It is then 581 encapsulated with a tunnel header containing the destination address 582 of the egress NVE. Intermediate nodes (between the ingress and 583 egress NVEs) switch or route traffic based upon the outer 584 destination address. It should be noted that an NVE may be 585 implemented on a gateway to provide traffic forwarding between two 586 different types of overlay networks, and may not be directly 587 connected to a tenant End System. 589 One key step in this process consists of mapping a final destination 590 address to the proper tunnel. NVEs are responsible for maintaining 591 such mappings in their lookup tables. Several ways of populating 592 these lookup tables are possible: control plane driven, management 593 plane driven, or data plane driven. 595 When a control plane protocol is used to distribute address 596 advertisement and tunneling information, the service auto- 597 provisioning/auto-discovery could be accomplished by the same 598 protocol. In this scenario, the auto-provisioning/Service discovery 599 could be combined with (be inferred from) the address advertisement 600 and tunnel mapping. Furthermore, a control plane protocol that 601 carries both IP addresses and associated MACs eliminates the need 602 for ARP and hence addresses one of the issues with explosive ARP 603 handling. 605 3.1.5.3. Tunnel management 607 A control plane protocol may be required to setup/teardown tunnels, 608 exchange tunnel state information, and/or provide for tunnel 609 endpoint routing. This applies to both unicast and multicast 610 tunnels. 612 For instance, it may be necessary to provide active/standby tunnel 613 status information between NVEs, up/down status information, 614 pruning/grafting information for multicast tunnels, etc. 616 3.2. Service Overlay Topologies 618 A number of service topologies may be used to optimize the service 619 connectivity and to address NVE performance limitations. 621 The topology described in Figure 3 suggests the use of a tunnel mesh 622 between the NVEs where each tenant instance is one hop away from a 623 service processing perspective. This should not be construed to 624 imply that a tunnel mesh must be configured as tunneling can simply 625 be encapsulation/decapsulation with a tunnel header. Partial mesh 626 topologies and a NVE hierarchy may be used where certain NVEs may 627 act as service transit points. 629 4. Key aspects of overlay networks 631 The intent of this section is to highlight specific issues that 632 proposed overlay solutions need to address. 634 4.1. Pros & Cons 636 An overlay network is a layer of virtual network topology on top of 637 the physical network. 639 Overlay networks offer the following key advantages: 641 o Unicast tunneling state management is handled at the edge of 642 the network. Intermediate transport nodes are unaware of such 643 state. Note that this is not often the case when multicast is 644 enabled in the core network. 646 o Tunnels are used to aggregate traffic and hence offer the 647 advantage of minimizing the amount of forwarding state required 648 within the underlay network. 650 o Decoupling of the overlay addresses (MAC and IP) used by VMs or 651 Tenant End Systems in general from the underlay network. This 652 offers a clear separation between addresses used within the 653 overlay and the underlay networks and it enables the use of 654 overlapping addresses spaces by Tenant End Systems. 656 o Support of a large number of virtual network identifiers. 658 Overlay networks also create several challenges: 660 o Overlay networks have no controls of underlay networks and lack 661 critical network information 662 o Overlays may probe the network to measure link properties, 663 such as available bandwidth or packet loss rate. It is 664 difficult to accurately evaluate network properties. It 665 might be preferable for the underlay network to expose 666 usage and performance information for itself or the 667 overlay networks. 669 o Miscommunication between overlay and underlay networks can lead 670 to an inefficient usage of network resources. 672 o Fairness of resource sharing and co-ordination among edge-nodes 673 in overlay networks are two critical issues. When multiple 674 overlays co-exist on top of a common underlay network, the lack 675 of coordination between overlays can lead to performance 676 issues. 678 o Overlaid traffic may not traverse firewalls and NAT devices. 680 o Multicast service scalability. Multicast support may be 681 required in the overlay network to address for each tenant 682 flood containment or efficient multicast handling. 684 o Load balancing may not be optimal as the hash algorithm may not 685 work well due to the limited number of combinations of tunnel 686 source and destination addresses 688 4.2. Overlay issues to consider 690 4.2.1. Data plane vs Control plane driven 692 Dynamic (data plane) learning implies that flooding of unknown 693 destinations be supported and hence implies that broadcast and/or 694 multicast be supported. Multicasting in the core network for dynamic 695 learning can lead to significant scalability limitations. Specific 696 forwarding rules must be enforced to prevent loops from happening. 697 This can be achieved using a spanning tree protocol or a shortest 698 path tree, or using split-horizon forwarding rules. 700 It should be noted that the amount of state to be distributed is a 701 function of the number of virtual machines. Different forms of 702 caching can also be utilized to minimize state distribution among 703 the various elements. 705 4.2.2. Coordination between data plane and control plane 707 Often a combination of dynamic data plane and control based learning 708 is necessary. MAC Data-plane learning or IP data plane learning can 709 be applied on tenant VAPs at the NVE whereas control plane-based MAC 710 and IP reachability distribution can be performed across the overlay 711 network among the NVEs, possibly with the help of a control plane 712 mediation device (e.g., BGP route reflector if BGP is used to 713 distribute such information). Coordination between the data-plane 714 learning process and the control plane reachability distribution 715 process is needed such that when a new address gets learned or an 716 old address is removed, it triggers the local control plane to 717 advertise this information to its peers. 719 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic 721 There are two techniques to support packet replication needed for 722 broadcast, unknown unicast and multicast: 724 o Ingress replication 726 o Use of core multicast trees 728 There is a bandwidth vs state trade-off between the two approaches. 729 Depending upon the degree of replication required (i.e. the number 730 of hosts per group) and the amount of multicast state to maintain, 731 trading bandwidth for state is of consideration. 733 When the number of hosts per group is large, the use of core 734 multicast trees may be more appropriate. When the number of hosts is 735 small (e.g. 2-3), ingress replication may not be an issue depending 736 on multicast stream bandwidth. 738 Depending upon the size of the data center network and hence the 739 number of (S,G) entries, but also the duration of multicast flows, 740 the use of core multicast trees can be a challenge. 742 When flows are well known, it is possible to pre-provision such 743 multicast trees. However, it is often difficult to predict 744 application flows ahead of time, and hence programming of (S,G) 745 entries for short-lived flows could be impractical. 747 A possible trade-off is to use shared multicast trees in the core as 748 opposed to dedicated multicast trees. 750 4.2.4. Path MTU 752 When using overlay tunneling, an outer header is added to the 753 original tenant frame. This can cause the MTU of the path to the 754 egress tunnel endpoint to be exceeded. 756 In this section, we will only consider the case of an IP overlay. 758 It is usually not desirable to rely on IP fragmentation for 759 performance reasons. Ideally, the interface MTU as seen by a Tenant 760 End System is adjusted such that no fragmentation is needed. TCP 761 will adjust its maximum segment size accordingly. 763 It is possible for the MTU to be configured manually or to be 764 discovered dynamically. Various Path MTU discovery techniques exist 765 in order to determine the proper MTU size to use: 767 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] 769 o Tenant End Systems rely on ICMP messages to discover the 770 MTU of the end-to-end path to its destination. This method 771 is not always possible, such as when traversing middle 772 boxes (e.g. firewalls) which disable ICMP for security 773 reasons 775 o Extended MTU Path Discovery techniques such as defined in 776 [RFC4821] 778 It is also possible to rely on the overlay layer to perform 779 segmentation and reassembly operations without relying on the Tenant 780 End Systems to know about the end-to-end MTU. The assumption is that 781 some hardware assist is available on the NVE node to perform such 782 fragmentation and reassembly operations. However, fragmentation by 783 the overlay layer can lead to performance and congestion issues due 784 to TCP dynamics and might require new congestion avoidance 785 mechanisms from the underlay network [FLOYD]. 787 Finally, the underlay network may be designed in such a way that the 788 MTU can accommodate the extra tunnel overhead. 790 4.2.5. NVE location trade-offs 792 In the case of DC traffic, traffic originated from a VM is native 793 Ethernet traffic. This traffic may be receiving ELAN service or IP 794 service. In the case of ELAN service, it can be switched by a local 795 VM switch or ToR switch and then by a DC gateway. The NVE function 796 can be embedded within any of these elements. 798 There are several criteria to consider when deciding where the NVE 799 processing boundary happens: 801 o Processing and memory requirements 803 o Datapath (e.g. FIB size, lookups, filtering, 804 encapsulation/decapsulation) 806 o Control plane (e.g. RIB size, routing, signaling, OAM) 808 o Multicast support 810 o Routing protocols 812 o Packet replication capability 814 o Fragmentation support 816 o QoS transparency 818 o Resiliency 820 4.2.6. Interaction between network overlays and underlays 822 When multiple overlays co-exist on top of a common underlay network, 823 this can cause some performance issues. These overlays have 824 partially overlapping paths and nodes. 826 Each overlay is selfish by nature in that it sends traffic so as to 827 optimize its own performance without considering the impact on other 828 overlays, unless the underlay tunnels are traffic engineered on a 829 per overlay basis so as to avoid oversubscribing underlay resources. 831 Better visibility between overlays and underlays or their 832 controllers can be achieved by providing mechanisms to exchange 833 information about: 835 o Performance metrics (throughput, delay, loss, jitter) 837 o Cost metrics 839 This information may then be used to traffic engineer the underlay 840 network and/or traffic engineer the overlay networks in a 841 coordinated fashion over the overlay. 843 5. Security Considerations 845 The tenant to overlay mapping function can introduce significant 846 security risks if appropriate protocols/mechanisms used to establish 847 that mapping are not trusted, do not support mutual authentication 848 and/or cannot be established over trusted interfaces and/or mutually 849 authenticated connections. 851 No other new security issues are introduced beyond those described 852 already in the related L2VPN and L3VPN RFCs. 854 6. IANA Considerations 856 IANA does not need to take any action for this draft. 858 7. References 860 7.1. Normative References 862 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 863 Requirement Levels", BCP 14, RFC 2119, March 1997. 865 7.2. Informative References 867 [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network 868 Virtualization", draft-narten-nvo3-overlay-problem- 869 statement (work in progress) 871 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 872 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 873 (work in progress) 875 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 876 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 878 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 879 Networks (VPNs)", RFC 4364, February 2006. 881 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 883 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 884 August 1996 886 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 887 Discovery", RFC4821, March 2007 889 8. Acknowledgments 891 In addition to the authors the following people have contributed to 892 this document: 894 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 896 Javier Benitez, Colt 898 This document was prepared using 2-Word-v2.0.template.dot. 900 Authors' Addresses 902 Marc Lasserre 903 Alcatel-Lucent 904 Email: marc.lasserre@alcatel-lucent.com 906 Florin Balus 907 Alcatel-Lucent 908 777 E. Middlefield Road 909 Mountain View, CA, USA 94043 910 Email: florin.balus@alcatel-lucent.com 912 Thomas Morin 913 France Telecom Orange 914 Email: thomas.morin@orange.com 916 Nabil Bitar 917 Verizon 918 60 Sylvan Road 919 Waltham, MA 02145 920 Email: nabil.n.bitar@verizon.com 922 Yakov Rekhter 923 Juniper 924 Email: yakov@juniper.net 926 Yuichi Ikejiri 927 NTT Communications 928 1-1-6, Uchisaiwai-cho, Chiyoda-ku 929 Tokyo, 100-8019 Japan 930 Email: y.ikejiri@ntt.com