idnits 2.17.1 draft-ietf-nvo3-framework-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 25 longer pages, the longest (page 21) being 72 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 25 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 218 instances of too long lines in the document, the longest one being 3 characters in excess of 72. == There are 4 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 19, 2012) is 4205 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'EVPN' is mentioned on line 229, but not defined -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Marc Lasserre 3 Internet Draft Florin Balus 4 Intended status: Informational Alcatel-Lucent 5 Expires: March 2013 6 Thomas Morin 7 France Telecom Orange 9 Nabil Bitar 10 Verizon 12 Yakov Rekhter 13 Juniper 15 October 19, 2012 17 Framework for DC Network Virtualization 18 draft-ietf-nvo3-framework-01.txt 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six 31 months and may be updated, replaced, or obsoleted by other documents 32 at any time. It is inappropriate to use Internet-Drafts as 33 reference material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on April 19, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Abstract 54 Several IETF drafts relate to the use of overlay networks to support 55 large scale virtual data centers. This draft provides a framework 56 for Network Virtualization over L3 (NVO3) and is intended to help 57 plan a set of work items in order to provide a complete solution 58 set. It defines a logical view of the main components with the 59 intention of streamlining the terminology and focusing the solution 60 set. 62 Table of Contents 64 1. Introduction................................................3 65 1.1. Conventions used in this document.......................4 66 1.2. General terminology.....................................4 67 1.3. DC network architecture.................................6 68 1.4. Tenant networking view..................................7 69 2. Reference Models............................................8 70 2.1. Generic Reference Model.................................8 71 2.2. NVE Reference Model....................................10 72 2.3. NVE Service Types......................................12 73 2.3.1. L2 NVE providing Ethernet LAN-like service.........12 74 2.3.2. L3 NVE providing IP/VRF-like service..............12 75 3. Functional components.......................................12 76 3.1. Generic service virtualization components..............12 77 3.1.1. Virtual Access Points (VAPs)......................13 78 3.1.2. Virtual Network Instance (VNI)....................13 79 3.1.3. Overlay Modules and VN Context....................13 80 3.1.4. Tunnel Overlays and Encapsulation options..........14 81 3.1.5. Control Plane Components..........................14 82 3.1.5.1. Distributed vs Centralized Control Plane.........15 83 3.1.5.2. Auto-provisioning/Service discovery.............15 84 3.1.5.3. Address advertisement and tunnel mapping.........16 85 3.1.5.4. Tunnel management...............................17 86 3.2. Multi-homing..........................................17 87 3.3. Service Overlay Topologies.............................18 88 4. Key aspects of overlay networks.............................18 89 4.1. Pros & Cons...........................................18 90 4.2. Overlay issues to consider.............................19 91 4.2.1. Data plane vs Control plane driven................19 92 4.2.2. Coordination between data plane and control plane..20 93 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) 94 traffic.................................................20 95 4.2.4. Path MTU.........................................21 96 4.2.5. NVE location trade-offs...........................21 97 4.2.6. Interaction between network overlays and underlays.22 98 5. Security Considerations.....................................23 99 6. IANA Considerations........................................23 100 7. References.................................................23 101 7.1. Normative References...................................23 102 7.2. Informative References.................................23 103 8. Acknowledgments............................................24 105 1. Introduction 107 This document provides a framework for Data Center Network 108 Virtualization over L3 tunnels. This framework is intended to aid in 109 standardizing protocols and mechanisms to support large scale 110 network virtualization for data centers. 112 Several IETF drafts relate to the use of overlay networks for data 113 centers. 115 [NVOPS] defines the rationale for using overlay networks in order to 116 build large data center networks. The use of virtualization leads to 117 a very large number of communication domains and end systems to cope 118 with. 120 [OVCPREQ] describes the requirements for a control plane protocol 121 required by overlay border nodes to exchange overlay mappings. 123 This document provides reference models and functional components of 124 data center overlay networks as well as a discussion of technical 125 issues that have to be addressed in the design of standards and 126 mechanisms for large scale data centers. 128 1.1. Conventions used in this document 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in RFC-2119 [RFC2119]. 134 In this document, these words will appear with that interpretation 135 only when in ALL CAPS. Lower case uses of these words are not to be 136 interpreted as carrying RFC-2119 significance. 138 1.2. General terminology 140 This document uses the following terminology: 142 NVE: Network Virtualization Edge. It is a network entity that sits 143 on the edge of the NVO3 network. It implements network 144 virtualization functions that allow for L2 and/or L3 tenant 145 separation and for hiding tenant addressing information (MAC and IP 146 addresses). An NVE could be implemented as part of a virtual switch 147 within a hypervisor, a physical switch or router, a Network Service 148 Appliance. 150 VN: Virtual Network. This is a virtual L2 or L3 domain that belongs 151 to a tenant. 153 VNI: Virtual Network Instance. This is one instance of a virtual 154 overlay network. Two Virtual Networks are isolated from one another 155 and may use overlapping addresses. 157 Virtual Network Context or VN Context: Field that is part of the 158 overlay encapsulation header which allows the encapsulated frame to 159 be delivered to the appropriate virtual network endpoint by the 160 egress NVE. The egress NVE uses this field to determine the 161 appropriate virtual network context in which to process the packet. 162 This field MAY be an explicit, unique (to the administrative domain) 163 virtual network identifier (VNID) or MAY express the necessary 164 context information in other ways (e.g. a locally significant 165 identifier). 167 VNID: Virtual Network Identifier. In the case where the VN context 168 has global significance, this is the ID value that is carried in 169 each data packet in the overlay encapsulation that identifies the 170 Virtual Network the packet belongs to. 172 Underlay or Underlying Network: This is the network that provides 173 the connectivity between NVEs. The Underlying Network can be 174 completely unaware of the overlay packets. Addresses within the 175 Underlying Network are also referred to as "outer addresses" because 176 they exist in the outer encapsulation. The Underlying Network can 177 use a completely different protocol (and address family) from that 178 of the overlay. 180 Data Center (DC): A physical complex housing physical servers, 181 network switches and routers, Network Service Appliances and 182 networked storage. The purpose of a Data Center is to provide 183 application and/or compute and/or storage services. One such service 184 is virtualized data center services, also known as Infrastructure as 185 a Service. 187 Virtual Data Center or Virtual DC: A container for virtualized 188 compute, storage and network services. Managed by a single tenant, a 189 Virtual DC can contain multiple VNs and multiple Tenant Systems that 190 are connected to one or more of these VNs. 192 VM: Virtual Machine. Several Virtual Machines can share the 193 resources of a single physical computer server using the services of 194 a Hypervisor (see below definition). 196 Hypervisor: Server virtualization software running on a physical 197 compute server that hosts Virtual Machines. The hypervisor provides 198 shared compute/memory/storage and network connectivity to the VMs 199 that it hosts. Hypervisors often embed a Virtual Switch (see below). 201 Virtual Switch: A function within a Hypervisor (typically 202 implemented in software) that provides similar services to a 203 physical Ethernet switch. It switches Ethernet frames between VMs' 204 virtual NICs within the same physical server, or between a VM and a 205 physical NIC card connecting the server to a physical Ethernet 206 switch. It also enforces network isolation between VMs that should 207 not communicate with each other. 209 Tenant: In a DC, a tenant refers to a customer that could an 210 organization within an enterprise, or an enterprise with a set of DC 211 compute, storage and network resources associated with it. 213 Tenant System: A physical or virtual system that can play the role 214 of a host, or a forwarding element such as a router, switch, 215 firewall, etc. It belongs to a single tenant and connects to one or 216 more VNs of that tenant. 218 End device: A physical system to which networking service is 219 provided. Examples include hosts (e.g. server or server blade), 220 storage systems (e.g. file servers, iSCSI storage systems) and 221 network devices (e.g. firewall, load-balancer, IPSec gateway). An 222 end device may include internal networking functionality that 223 interconnects the device's components (e.g. virtual switches that 224 interconnects VMs running on the same server). NVE functionality may 225 be implemented as part of that internal networking. 227 ELAN: MEF ELAN, multipoint to multipoint Ethernet service 229 EVPN: Ethernet VPN as defined in [EVPN] 231 1.3. DC network architecture 233 A generic architecture for Data Centers is depicted in Figure 1: 235 ,---------. 236 ,' `. 237 ( IP/MPLS WAN ) 238 `. ,' 239 `-+------+' 240 +--+--+ +-+---+ 241 |DC GW|+-+|DC GW| 242 +-+---+ +-----+ 243 | / 244 .--. .--. 245 ( ' '.--. 246 .-.' Intra-DC ' 247 ( network ) 248 ( .'-' 249 '--'._.'. )\ \ 250 / / '--' \ \ 251 / / | | \ \ 252 +---+--+ +-`.+--+ +--+----+ 253 | ToR | | ToR | | ToR | 254 +-+--`.+ +-+-`.-+ +-+--+--+ 255 / \ / \ / \ 256 __/_ \ / \ /_ _\__ 257 '--------' '--------' '--------' '--------' 258 : End : : End : : End : : End : 259 : Device : : Device : : Device : : Device : 260 '--------' '--------' '--------' '--------' 262 Figure 1 : A Generic Architecture for Data Centers 264 An example of multi-tier DC network architecture is presented in 265 this figure. It provides a view of physical components inside a DC. 267 A cloud network is composed of intra-Data Center (DC) networks and 268 network services, and, inter-DC network and network connectivity 269 services. Depending upon the scale, DC distribution, operations 270 model, Capex and Opex aspects, DC networking elements can act as 271 strict L2 switches and/or provide IP routing capabilities, including 272 also service virtualization. 274 In some DC architectures, it is possible that some tier layers 275 provide L2 and/or L3 services, are collapsed, and that Internet 276 connectivity, inter-DC connectivity and VPN support are handled by a 277 smaller number of nodes. Nevertheless, one can assume that the 278 functional blocks fit with the architecture above. 280 The following components can be present in a DC: 282 o Top of Rack (ToR): Hardware-based Ethernet switch aggregating 283 all Ethernet links from the End Devices in a rack representing 284 the entry point in the physical DC network for the hosts. ToRs 285 may also provide routing functionality, virtual IP network 286 connectivity, or Layer2 tunneling over IP for instance. ToRs 287 are usually multi-homed to switches in the Intra-DC network. 288 Other deployment scenarios may use an intermediate Blade Switch 289 before the ToR or an EoR (End of Row) switch to provide similar 290 function as a ToR. 292 o Intra-DC Network: High capacity network composed of core 293 switches aggregating multiple ToRs. Core switches are usually 294 Ethernet switches but can also support routing capabilities. 296 o DC GW: Gateway to the outside world providing DC Interconnect 297 and connectivity to Internet and VPN customers. In the current 298 DC network model, this may be simply a Router connected to the 299 Internet and/or an IPVPN/L2VPN PE. Some network implementations 300 may dedicate DC GWs for different connectivity types (e.g., a 301 DC GW for Internet, and another for VPN). 303 Note that End Devices may be single or multi-homed to ToRs. 305 1.4. Tenant networking view 307 The DC network architecture is used to provide L2 and/or L3 service 308 connectivity to each tenant. An example is depicted in Figure 2: 310 +----- L3 Infrastructure ----+ 311 | | 312 ,--+--. ,--+--. 313 .....( Rtr1 )...... ( Rtr2 ) 314 | `-----' | `-----' 315 | Tenant1 |LAN12 Tenant1| 316 |LAN11 ....|........ |LAN13 317 .............. | | .............. 318 | | | | | | 319 ,-. ,-. ,-. ,-. ,-. ,-. 320 (VM )....(VM ) (VM )... (VM ) (VM )....(VM ) 321 `-' `-' `-' `-' `-' `-' 323 Figure 2 : Logical Service connectivity for a single tenant 325 In this example one or more L3 contexts and one or more LANs (e.g., 326 one per application type) running on DC switches are assigned for DC 327 tenant 1. 329 For a multi-tenant DC, a virtualized version of this type of service 330 connectivity needs to be provided for each tenant by the Network 331 Virtualization solution. 333 2. Reference Models 335 2.1. Generic Reference Model 337 The following diagram shows a DC reference model for network 338 virtualization using Layer3 overlays where NVEs provide a logical 339 interconnect between Tenant Systems that belong to specific tenant 340 network. 342 +--------+ +--------+ 343 | Tenant +--+ +----| Tenant | 344 | System | | (') | System | 345 +--------+ | ................... ( ) +--------+ 346 | +-+--+ +--+-+ (_) 347 | | NV | | NV | | 348 +--|Edge| |Edge|---+ 349 +-+--+ +--+-+ 350 / . . 351 / . L3 Overlay +--+-++--------+ 352 +--------+ / . Network | NV || Tenant | 353 | Tenant +--+ . |Edge|| System | 354 | System | . +----+ +--+-++--------+ 355 +--------+ .....| NV |........ 356 |Edge| 357 +----+ 358 | 359 | 360 ===================== 361 | | 362 +--------+ +--------+ 363 | Tenant | | Tenant | 364 | System | | System | 365 +--------+ +--------+ 367 Figure 3 : Generic reference model for DC network virtualization 368 over a Layer3 infrastructure 370 A Tenant System can be attached to a Network Virtualization Edge 371 (NVE) node in several ways: 373 - locally, by being co-located i.e. resident in the same device 375 - remotely, via a point-to-point connection or a switched network 376 (e.g. Ethernet) 378 When an NVE is local, the state of Tenant Systems can be provided 379 without protocol assistance. For instance, the operational status of 380 a VM can be communicated via a local API. When an NVE is remote, the 381 state of Tenant Systems needs to be exchanged via a data or control 382 plane protocol, or via a management entity. 384 The functional components in this picture do not necessarily map 385 directly with the physical components described in Figure 1. 387 For example, an End Device can be a server blade with VMs and 388 virtual switch, i.e. the VM is the Tenant System and the NVE 389 functions may be performed by the virtual switch and/or the 390 hypervisor. In this case, the Tenant System and NVE function are co- 391 located. 393 Another example is the case where an End Device can be a traditional 394 physical server (no VMs, no virtual switch), i.e. the server is the 395 Tenant System and the NVE function may be performed by the ToR. 396 Other End Devices in this category are Physical Network Appliances 397 or Storage Systems. 399 The NVE implements network virtualization functions that allow for 400 L2 and/or L3 tenant separation and for hiding tenant addressing 401 information (MAC and IP addresses), tenant-related control plane 402 activity and service contexts from the Routed Backbone nodes. 404 Core nodes utilize L3 techniques to interconnect NVE nodes in 405 support of the overlay network. These devices perform forwarding 406 based on outer L3 tunnel header, and generally do not maintain per 407 tenant-service state albeit some applications (e.g., multicast) may 408 require control plane or forwarding plane information that pertain 409 to a tenant, group of tenants, tenant service or a set of services 410 that belong to one or more tunnels. When such tenant or tenant- 411 service related information is maintained in the core, overlay 412 virtualization provides knobs to control that information. 414 2.2. NVE Reference Model 416 The NVE is composed of a Virtual Network instance that Tenant 417 Systems interface with and an overlay module that provides tunneling 418 overlay functions (e.g. encapsulation/decapsulation of tenant 419 traffic from/to the tenant forwarding instance, tenant 420 identification and mapping, etc), as described in figure 4: 422 +------- L3 Network ------+ 423 | | 424 | Tunnel Overlay | 425 +------------+---------+ +---------+------------+ 426 | +----------+-------+ | | +---------+--------+ | 427 | | Overlay Module | | | | Overlay Module | | 428 | +---------+--------+ | | +---------+--------+ | 429 | |VN context| | VN context| | 430 | | | | | | 431 | +--------+-------+ | | +--------+-------+ | 432 | | |VNI| . |VNI| | | | |VNI| . |VNI| | 433 NVE1 | +-+------------+-+ | | +-+-----------+--+ | NVE2 434 | | VAPs | | | | VAPs | | 435 +----+------------+----+ +----+-----------+-----+ 436 | | | | 437 -------+------------+-----------------+-----------+------- 438 | | Tenant | | 439 | | Service IF | | 440 Tenant Systems Tenant Systems 442 Figure 4 : Generic reference model for NV Edge 444 Note that some NVE functions (e.g. data plane and control plane 445 functions) may reside in one device or may be implemented separately 446 in different devices. 448 For example, the NVE functionality could reside solely on the End 449 Devices, on the ToRs or on both the End Devices and the ToRs. In the 450 latter case we say that the End Device NVE component acts as the NVE 451 Spoke, and ToRs act as NVE hubs. Tenant Systems will interface with 452 VNIs maintained on the NVE spokes, and VNIs maintained on the NVE 453 spokes will interface with VNIs maintained on the NVE hubs. 455 2.3. NVE Service Types 457 NVE components may be used to provide different types of virtualized 458 service connectivity. This section defines the service types and 459 associated attributes 461 2.3.1. L2 NVE providing Ethernet LAN-like service 463 L2 NVE implements Ethernet LAN emulation (ELAN), an Ethernet based 464 multipoint service where the Tenant Systems appear to be 465 interconnected by a LAN environment over a set of L3 tunnels. It 466 provides per tenant virtual switching instance with MAC addressing 467 isolation and L3 tunnel encapsulation across the core. 469 2.3.2. L3 NVE providing IP/VRF-like service 471 Virtualized IP routing and forwarding is similar from a service 472 definition perspective with IETF IP VPN (e.g., BGP/MPLS IPVPN and 473 IPsec VPNs). It provides per tenant routing instance with addressing 474 isolation and L3 tunnel encapsulation across the core. 476 3. Functional components 478 This section breaks down the Network Virtualization architecture 479 into functional components to make it easier to discuss solution 480 options for different modules. 482 This version of the document gives an overview of generic functional 483 components that are shared between L2 and L3 service types. Details 484 specific for each service type will be added in future revisions. 486 3.1. Generic service virtualization components 488 A Network Virtualization solution is built around a number of 489 functional components as depicted in Figure 5: 491 +------- L3 Network ------+ 492 | | 493 | Tunnel Overlay | 494 +------------+--------+ +--------+------------+ 495 | +----------+------+ | | +------+----------+ | 496 | | Overlay Module | | | | Overlay Module | | 497 | +--------+--------+ | | +--------+--------+ | 498 | |VN Context| | |VN Context| 499 | | | | | | 500 | +-------+-------+ | | +-------+-------+ | 501 | ||VNI| ... |VNI|| | | ||VNI| ... |VNI|| | 502 NVE1 | +-+-----------+-+ | | +-+-----------+-+ | NVE2 503 | | VAPs | | | | VAPs | | 504 +----+-----------+----+ +----+-----------+----+ 505 | | | | 506 -----+-----------+-----------------+-----------+----- 507 | | Tenant | | 508 | | Service IF | | 509 Tenant Systems Tenant Systems 511 Figure 5 : Generic reference model for NV Edge 513 3.1.1. Virtual Access Points (VAPs) 515 Tenant Systems are connected to the VNI Instance through Virtual 516 Access Points (VAPs). 518 The VAPs can be physical ports or virtual ports identified through 519 logical interface identifiers (VLANs, internal VSwitch Interface ID 520 leading to a VM). 522 3.1.2. Virtual Network Instance (VNI) 524 The VNI represents a set of configuration attributes defining access 525 and tunnel policies and (L2 and/or L3) forwarding functions. 527 Per tenant FIB tables and control plane protocol instances are used 528 to maintain separate private contexts between tenants. Hence tenants 529 are free to use their own addressing schemes without concerns about 530 address overlapping with other tenants. 532 3.1.3. Overlay Modules and VN Context 534 Mechanisms for identifying each tenant service are required to allow 535 the simultaneous overlay of multiple tenant services over the same 536 underlay L3 network topology. In the data plane, each NVE, upon 537 sending a tenant packet, must be able to encode the VN Context for 538 the destination NVE in addition to the L3 tunnel source address 539 identifying the source NVE and the tunnel destination L3 address 540 identifying the destination NVE. This allows the destination NVE to 541 identify the tenant service instance and therefore appropriately 542 process and forward the tenant packet. 544 The Overlay module provides tunneling overlay functions: tunnel 545 initiation/termination, encapsulation/decapsulation of frames from 546 VAPs/L3 Backbone and may provide for transit forwarding of IP 547 traffic (e.g., transparent tunnel forwarding). 549 In a multi-tenant context, the tunnel aggregates frames from/to 550 different VNIs. Tenant identification and traffic demultiplexing are 551 based on the VN Context (e.g. VNID). 553 The following approaches can been considered: 555 o One VN Context per Tenant: A globally unique (on a per-DC 556 administrative domain) VNID is used to identify the related 557 Tenant instances. An example of this approach is the use of 558 IEEE VLAN or ISID tags to provide virtual L2 domains. 560 o One VN Context per VNI: A per-tenant local value is 561 automatically generated by the egress NVE and usually 562 distributed by a control plane protocol to all the related 563 NVEs. An example of this approach is the use of per VRF MPLS 564 labels in IP VPN [RFC4364]. 566 o One VN Context per VAP: A per-VAP local value is assigned and 567 usually distributed by a control plane protocol. An example of 568 this approach is the use of per CE-PE MPLS labels in IP VPN 569 [RFC4364]. 571 Note that when using one VN Context per VNI or per VAP, an 572 additional global identifier may be used by the control plane to 573 identify the Tenant context. 575 3.1.4. Tunnel Overlays and Encapsulation options 577 Once the VN context is added to the frame, a L3 Tunnel encapsulation 578 is used to transport the frame to the destination NVE. The backbone 579 devices do not usually keep any per service state, simply forwarding 580 the frames based on the outer tunnel header. 582 Different IP tunneling options (GRE/L2TP/IPSec) and tunneling 583 options (BGP VPN, PW, VPLS) are available for both Ethernet and IP 584 formats. 586 3.1.5. Control Plane Components 588 Control plane components may be used to provide the following 589 capabilities: 591 . Auto-provisioning/Service discovery 593 . Address advertisement and tunnel mapping 595 . Tunnel management 597 A control plane component can be an on-net control protocol or a 598 management control entity. 600 3.1.5.1. Distributed vs Centralized Control Plane 602 A control/management plane entity can be centralized or distributed. 603 Both approaches have been used extensively in the past. The routing 604 model of the Internet is a good example of a distributed approach. 605 Transport networks have usually used a centralized approach to 606 manage transport paths. 608 It is also possible to combine the two approaches i.e. using a 609 hybrid model. A global view of network state can have many benefits 610 but it does not preclude the use of distributed protocols within the 611 network. Centralized controllers provide a facility to maintain 612 global and distribute that state to the network which in combination 613 with distributed protocols can aid in achieving greater network 614 efficiencies, improve reliability and robustness. Domain and/or 615 deployment specific constraints define the balance between 616 centralized and distributed approaches. 618 On one hand, a control plane module can reside in every NVE. This is 619 how routing control plane modules are implemented in routers. At the 620 same time, an external controller can manage a group of NVEs via an 621 agent sitting in each NVE. This is how an SDN controller could 622 communicate with the nodes it controls, via OpenFlow for instance. 624 In the case where a centralized control plane is preferred, the 625 controller will need to be distributed to more than one node for 626 redundancy. Depending upon the size of the DC domain, hence the 627 number of NVEs to manage, it should be possible to use several 628 external controllers. Inter-controller communication will thus be 629 necessary for scalability and redundancy. 631 3.1.5.2. Auto-provisioning/Service discovery 633 NVEs must be able to select the appropriate VNI for each Tenant 634 System. This is based on state information that is often provided by 635 external entities. For example, in a VM environment, this 636 information is provided by compute management systems, since these 637 are the only entities that have visibility on which VM belongs to 638 which tenant. 640 A mechanism for communicating this information between Tenant 641 Systems and the local NVE is required. As a result the VAPs are 642 created and mapped to the appropriate VNI. 644 Depending upon the implementation, this control interface can be 645 implemented using an auto-discovery protocol between Tenant Systems 646 and their local NVE or through management entities. 648 When a protocol is used, appropriate security and authentication 649 mechanisms to verify that Tenant System information is not spoofed 650 or altered are required. This is one critical aspect for providing 651 integrity and tenant isolation in the system. 653 Another control plane protocol can also be used to advertize 654 supported VNs to other NVEs. Alternatively, management control 655 entities can also be used to perform these functions. 657 3.1.5.3. Address advertisement and tunnel mapping 659 As traffic reaches an ingress NVE, a lookup is performed to 660 determine which tunnel the packet needs to be sent to. It is then 661 encapsulated with a tunnel header containing the destination address 662 of the egress overlay node. Intermediate nodes (between the ingress 663 and egress NVEs) switch or route traffic based upon the outer 664 destination address. 666 One key step in this process consists of mapping a final destination 667 address to the proper tunnel. NVEs are responsible for maintaining 668 such mappings in their lookup tables. Several ways of populating 669 these lookup tables are possible: control plane driven, management 670 plane driven, or data plane driven. 672 When a control plane protocol is used to distribute address 673 advertisement and tunneling information, the auto- 674 provisioning/Service discovery could be accomplished by the same 675 protocol. In this scenario, the auto-provisioning/Service discovery 676 could be combined with (be inferred from) the address advertisement 677 and tunnel mapping. Furthermore, a control plane protocol that 678 carries both MAC and IP addresses eliminates the need for ARP, and 679 hence addresses one of the issues with explosive ARP handling. 681 3.1.5.4. Tunnel management 683 A control plane protocol may be required to exchange tunnel state 684 information. This may include setting up tunnels and/or providing 685 tunnel state information. 687 This applies to both unicast and multicast tunnels. 689 For instance, it may be necessary to provide active/standby status 690 information between NVEs, up/down status information, 691 pruning/grafting information for multicast tunnels, etc. 693 3.2. Multi-homing 695 Multi-homing techniques can be used to increase the reliability of 696 an nvo3 network. It is also important to ensure that physical 697 diversity in an nvo3 network is taken into account to avoid single 698 points of failure. 700 Multi-homing can be enabled in various nodes, from tenant systems 701 into TORs, TORs into core switches/routers, and core nodes into DC 702 GWs. 704 The nvo3 underlay nodes (i.e. from NVEs to DC GWs) rely on IP 705 routing as the means to re-route traffic upon failures and/or ECMP 706 techniques. 708 Tenant systems can either be L2 or L3 nodes. In the former case 709 (L2), techniques such as LAG or STP for instance can be used. In the 710 latter case (L3), it is possible that no dynamic routing protocol is 711 enabled. Tenant systems can be multi-homed into remote NVE using 712 several interfaces (physical NICS or vNICS) with an IP address per 713 interface either to the same nvo3 network or into different nvo3 714 networks. When one of the links fails, the corresponding IP is not 715 reachable but the other interfaces can still be used. When a tenant 716 system is co-located with an NVE, IP routing can be relied upon to 717 handle routing over diverse links to TORs. 719 External connectivity is handled by to or more nvo3 gateways. Each 720 gateway is connected to a different domain (e.g. ISP) and runs BGP 721 multi-homing. They serve as an access point to external networks 722 such as VPNs or the Internet. When a connection to an upstream 723 router is lost, the alternative connection is used and the failed 724 route withdrawn. 726 3.3. Service Overlay Topologies 728 A number of service topologies may be used to optimize the service 729 connectivity and to address NVE performance limitations. 731 The topology described in Figure 3 suggests the use of a tunnel mesh 732 between the NVEs where each tenant instance is one hop away from a 733 service processing perspective. Partial mesh topologies and an NVE 734 hierarchy may be used where certain NVEs may act as service transit 735 points. 737 4. Key aspects of overlay networks 739 The intent of this section is to highlight specific issues that 740 proposed overlay solutions need to address. 742 4.1. Pros & Cons 744 An overlay network is a layer of virtual network topology on top of 745 the physical network. 747 Overlay networks offer the following key advantages: 749 o Unicast tunneling state management is handled at the edge of 750 the network. Intermediate transport nodes are unaware of such 751 state. Note that this is not the case when multicast is enabled 752 in the core network. 754 o Tunnels are used to aggregate traffic and hence offer the 755 advantage of minimizing the amount of forwarding state required 756 within the underlay network 758 o Decoupling of the overlay addresses (MAC and IP) used by VMs 759 from the underlay network. This offers a clear separation 760 between addresses used within the overlay and the underlay 761 networks and it enables the use of overlapping addresses spaces 762 by Tenant Systems 764 o Support of a large number of virtual network identifiers 766 Overlay networks also create several challenges: 768 o Overlay networks have no controls of underlay networks and lack 769 critical network information 770 o Overlays typically probe the network to measure link 771 properties, such as available bandwidth or packet loss 772 rate. It is difficult to accurately evaluate network 773 properties. It might be preferable for the underlay 774 network to expose usage and performance information. 776 o Miscommunication between overlay and underlay networks can lead 777 to an inefficient usage of network resources. 779 o Fairness of resource sharing and collaboration among end-nodes 780 in overlay networks are two critical issues 782 o When multiple overlays co-exist on top of a common underlay 783 network, the lack of coordination between overlays can lead to 784 performance issues. 786 o Overlaid traffic may not traverse firewalls and NAT devices. 788 o Multicast service scalability. Multicast support may be 789 required in the overlay network to address for each tenant 790 flood containment or efficient multicast handling. 792 o Hash-based load balancing may not be optimal as the hash 793 algorithm may not work well due to the limited number of 794 combinations of tunnel source and destination addresses 796 4.2. Overlay issues to consider 798 4.2.1. Data plane vs Control plane driven 800 In the case of an L2NVE, it is possible to dynamically learn MAC 801 addresses against VAPs. It is also possible that such addresses be 802 known and controlled via management or a control protocol for both 803 L2NVEs and L3NVEs. 805 Dynamic data plane learning implies that flooding of unknown 806 destinations be supported and hence implies that broadcast and/or 807 multicast be supported. Multicasting in the core network for dynamic 808 learning may lead to significant scalability limitations. Specific 809 forwarding rules must be enforced to prevent loops from happening. 810 This can be achieved using a spanning tree, a shortest path tree, or 811 a split-horizon mesh. 813 It should be noted that the amount of state to be distributed is 814 dependent upon network topology and the number of virtual machines. 815 Different forms of caching can also be utilized to minimize state 816 distribution between the various elements. The control plane should 817 not require an NVE to maintain the locations of all the tenant 818 systems whose VNs are not present on the NVE. 820 4.2.2. Coordination between data plane and control plane 822 For an L2 NVE, the NVE needs to be able to determine MAC addresses 823 of the end systems present on a VAP. This can be achieved via 824 dataplane learning or a control plane. For an L3 NVE, the NVE needs 825 to be able to determine IP addresses of the end systems present on a 826 VAP. 828 In both cases, coordination with the NVE control protocol is needed 829 such that when the NVE determines that the set of addresses behind a 830 VAP has changed, it triggers the local NVE control plane to 831 distribute this information to its peers. 833 4.2.3. Handling Broadcast, Unknown Unicast and Multicast (BUM) traffic 835 There are two techniques to support packet replication needed for 836 broadcast, unknown unicast and multicast: 838 o Ingress replication 840 o Use of core multicast trees 842 There is a bandwidth vs state trade-off between the two approaches. 843 Depending upon the degree of replication required (i.e. the number 844 of hosts per group) and the amount of multicast state to maintain, 845 trading bandwidth for state is of consideration. 847 When the number of hosts per group is large, the use of core 848 multicast trees may be more appropriate. When the number of hosts is 849 small (e.g. 2-3), ingress replication may not be an issue. 851 Depending upon the size of the data center network and hence the 852 number of (S,G) entries, but also the duration of multicast flows, 853 the use of core multicast trees can be a challenge. 855 When flows are well known, it is possible to pre-provision such 856 multicast trees. However, it is often difficult to predict 857 application flows ahead of time, and hence programming of (S,G) 858 entries for short-lived flows could be impractical. 860 A possible trade-off is to use in the core shared multicast trees as 861 opposed to dedicated multicast trees. 863 4.2.4. Path MTU 865 When using overlay tunneling, an outer header is added to the 866 original frame. This can cause the MTU of the path to the egress 867 tunnel endpoint to be exceeded. 869 In this section, we will only consider the case of an IP overlay. 871 It is usually not desirable to rely on IP fragmentation for 872 performance reasons. Ideally, the interface MTU as seen by a Tenant 873 System is adjusted such that no fragmentation is needed. TCP will 874 adjust its maximum segment size accordingly. 876 It is possible for the MTU to be configured manually or to be 877 discovered dynamically. Various Path MTU discovery techniques exist 878 in order to determine the proper MTU size to use: 880 o Classical ICMP-based MTU Path Discovery [RFC1191] [RFC1981] 882 o 883 Tenant Systems rely on ICMP messages to discover the MTU of 884 the end-to-end path to its destination. This method is not 885 always possible, such as when traversing middle boxes 886 (e.g. firewalls) which disable ICMP for security reasons 888 o Extended MTU Path Discovery techniques such as defined in 889 [RFC4821] 891 It is also possible to rely on the overlay layer to perform 892 segmentation and reassembly operations without relying on the Tenant 893 Systems to know about the end-to-end MTU. The assumption is that 894 some hardware assist is available on the NVE node to perform such 895 SAR operations. However, fragmentation by the overlay layer can lead 896 to performance and congestion issues due to TCP dynamics and might 897 require new congestion avoidance mechanisms from then underlay 898 network [FLOYD]. 900 Finally, the underlay network may be designed in such a way that the 901 MTU can accommodate the extra tunnel overhead. 903 4.2.5. NVE location trade-offs 905 In the case of DC traffic, traffic originated from a VM is native 906 Ethernet traffic. This traffic can be switched by a local VM switch 907 or ToR switch and then by a DC gateway. The NVE function can be 908 embedded within any of these elements. 910 There are several criteria to consider when deciding where the NVE 911 processing boundary happens: 913 o Processing and memory requirements 915 o Datapath (e.g. lookups, filtering, 916 encapsulation/decapsulation) 918 o Control plane processing (e.g. routing, signaling, OAM) 920 o FIB/RIB size 922 o Multicast support 924 o Routing protocols 926 o Packet replication capability 928 o Fragmentation support 930 o QoS transparency 932 o Resiliency 934 4.2.6. Interaction between network overlays and underlays 936 When multiple overlays co-exist on top of a common underlay network, 937 this can cause some performance issues. These overlays have 938 partially overlapping paths and nodes. 940 Each overlay is selfish by nature in that it sends traffic so as to 941 optimize its own performance without considering the impact on other 942 overlays, unless the underlay tunnels are traffic engineered on a 943 per overlay basis so as to avoid sharing underlay resources. 945 Better visibility between overlays and underlays can be achieved by 946 providing mechanisms to exchange information about: 948 o Performance metrics (throughput, delay, loss, jitter) 950 o Cost metrics 952 5. Security Considerations 954 As a framework document, no protocols are being defined and hence no 955 specific security consideration are raised. 957 The following security aspects shall be discussed in respective 958 solutions documents: 960 Traffic isolation between NVO3 domains is guaranteed by the use of 961 per tenant FIB tables (VNIs). 963 The creation of overlay networks and the tenant to overlay mapping 964 function can introduce significant security risks. When dynamic 965 protocols are used, authentication should be supported. When a 966 centralized controller is used, access to that controller should be 967 restricted to authorized personnel. This can be achieved via login 968 authentication. 970 6. IANA Considerations 972 IANA does not need to take any action for this draft. 974 7. References 976 7.1. Normative References 978 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 979 Requirement Levels", BCP 14, RFC 2119, March 1997. 981 7.2. Informative References 983 [NVOPS] Narten, T. et al, "Problem Statement : Overlays for Network 984 Virtualization", draft-narten-nvo3-overlay-problem- 985 statement (work in progress) 987 [OVCPREQ] Kreeger, L. et al, "Network Virtualization Overlay Control 988 Protocol Requirements", draft-kreeger-nvo3-overlay-cp 989 (work in progress) 991 [FLOYD] Sally Floyd, Allyn Romanow, "Dynamics of TCP Traffic over 992 ATM Networks", IEEE JSAC, V. 13 N. 4, May 1995 994 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 995 Networks (VPNs)", RFC 4364, February 2006. 997 [RFC1191] Mogul, J. "Path MTU Discovery", RFC1191, November 1990 999 [RFC1981] McCann, J. et al, "Path MTU Discovery for IPv6", RFC1981, 1000 August 1996 1002 [RFC4821] Mathis, M. et al, "Packetization Layer Path MTU 1003 Discovery", RFC4821, March 2007 1005 8. Acknowledgments 1007 In addition to the authors the following people have contributed to 1008 this document: 1010 Dimitrios Stiliadis, Rotem Salomonovitch, Alcatel-Lucent 1012 Lucy Yong, Huawei 1014 This document was prepared using 2-Word-v2.0.template.dot. 1016 Authors' Addresses 1018 Marc Lasserre 1019 Alcatel-Lucent 1020 Email: marc.lasserre@alcatel-lucent.com 1022 Florin Balus 1023 Alcatel-Lucent 1024 777 E. Middlefield Road 1025 Mountain View, CA, USA 94043 1026 Email: florin.balus@alcatel-lucent.com 1028 Thomas Morin 1029 France Telecom Orange 1030 Email: thomas.morin@orange.com 1032 Nabil Bitar 1033 Verizon 1034 40 Sylvan Road 1035 Waltham, MA 02145 1036 Email: nabil.bitar@verizon.com 1038 Yakov Rekhter 1039 Juniper 1040 Email: yakov@juniper.net