idnits 2.17.1 draft-ietf-bess-evpn-overlay-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 9, 2018) is 2268 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 7348 ** Downref: Normative reference to an Informational RFC: RFC 7637 ** Obsolete normative reference: RFC 5512 (Obsoleted by RFC 9012) == Outdated reference: A later version (-10) exists of draft-ietf-bess-dci-evpn-overlay-08 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-08 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-05 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-05 == Outdated reference: A later version (-04) exists of draft-boutros-bess-evpn-geneve-00 Summary: 4 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Workgroup A. Sajassi (Editor) 3 INTERNET-DRAFT Cisco 4 Intended Status: Standards Track J. Drake (Editor) 5 Juniper 6 N. Bitar 7 Nokia 8 R. Shekhar 9 Juniper 10 J. Uttaro 11 AT&T 12 W. Henderickx 13 Nokia 15 Expires: August 9, 2018 February 9, 2018 17 A Network Virtualization Overlay Solution using EVPN 18 draft-ietf-bess-evpn-overlay-12 20 Abstract 22 This document specifies how Ethernet VPN (EVPN) can be used as a 23 Network Virtualization Overlay (NVO) solution and explores the 24 various tunnel encapsulation options over IP and their impact on the 25 EVPN control-plane and procedures. In particular, the following 26 encapsulation options are analyzed: Virtual Extensible LAN (VXLAN), 27 Network Virtualization using Generic Routing Encapsulation (NVGRE), 28 and MPLS over Generic Routing Encapsulation (GRE). This specification 29 is also applicable to Generic Network Virtualization Encapsulation 30 (GENEVE) encapsulation; however, some incremental work is required 31 which will be covered in a separate document. This document also 32 specifies new multi-homing procedures for split-horizon filtering and 33 mass-withdraw. It also specifies EVPN route constructions for 34 VXLAN/NVGRE encapsulations and Autonomous System Boundary Router 35 (ASBR) procedures for multi-homing of Network Virtualization (NV) 36 Edge devices. 38 Status of this Memo 40 This Internet-Draft is submitted to IETF in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF), its areas, and its working groups. Note that 45 other groups may also distribute working documents as 46 Internet-Drafts. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 The list of current Internet-Drafts can be accessed at 54 http://www.ietf.org/1id-abstracts.html 56 The list of Internet-Draft Shadow Directories can be accessed at 57 http://www.ietf.org/shadow.html 59 Copyright and License Notice 61 Copyright (c) 2018 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 Table of Contents 76 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 77 2 Requirements Notation and Conventions . . . . . . . . . . . . . 5 78 3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 79 4 EVPN Features . . . . . . . . . . . . . . . . . . . . . . . . . 6 80 5 Encapsulation Options for EVPN Overlays . . . . . . . . . . . . 8 81 5.1 VXLAN/NVGRE Encapsulation . . . . . . . . . . . . . . . . . 8 82 5.1.1 Virtual Identifiers Scope . . . . . . . . . . . . . . . 9 83 5.1.1.1 Data Center Interconnect with Gateway . . . . . . . 9 84 5.1.1.2 Data Center Interconnect without Gateway . . . . . . 9 85 5.1.2 Virtual Identifiers to EVI Mapping . . . . . . . . . . . 10 86 5.1.2.1 Auto Derivation of RT . . . . . . . . . . . . . . . 11 87 5.1.3 Constructing EVPN BGP Routes . . . . . . . . . . . . . 13 88 5.2 MPLS over GRE . . . . . . . . . . . . . . . . . . . . . . . 14 89 6 EVPN with Multiple Data Plane Encapsulations . . . . . . . . . 15 90 7 Single-Homing NVEs - NVE Residing in Hypervisor . . . . . . . . 15 91 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE 92 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . 16 94 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation . . 16 95 8 Multi-Homing NVEs - NVE Residing in ToR Switch . . . . . . . . 17 96 8.1 EVPN Multi-Homing Features . . . . . . . . . . . . . . . . 17 97 8.1.1 Multi-homed Ethernet Segment Auto-Discovery . . . . . . 18 98 8.1.2 Fast Convergence and Mass Withdraw . . . . . . . . . . . 18 99 8.1.3 Split-Horizon . . . . . . . . . . . . . . . . . . . . . 18 100 8.1.4 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 18 101 8.1.5 DF Election . . . . . . . . . . . . . . . . . . . . . . 19 102 8.2 Impact on EVPN BGP Routes & Attributes . . . . . . . . . . . 20 103 8.3 Impact on EVPN Procedures . . . . . . . . . . . . . . . . . 20 104 8.3.1 Split Horizon . . . . . . . . . . . . . . . . . . . . . 20 105 8.3.2 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 21 106 8.3.3 Unknown Unicast Traffic Designation . . . . . . . . . . 21 107 9 Support for Multicast . . . . . . . . . . . . . . . . . . . . . 22 108 10 Data Center Interconnections - DCI . . . . . . . . . . . . . . 23 109 10.1 DCI using GWs . . . . . . . . . . . . . . . . . . . . . . . 23 110 10.2 DCI using ASBRs . . . . . . . . . . . . . . . . . . . . . . 24 111 10.2.1 ASBR Functionality with Single-Homing NVEs . . . . . . 25 112 10.2.2 ASBR Functionality with Multi-Homing NVEs . . . . . . . 25 113 11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 27 114 12 Security Considerations . . . . . . . . . . . . . . . . . . . 27 115 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 116 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . 28 117 14.1 Normative References . . . . . . . . . . . . . . . . . . . 28 118 14.2 Informative References . . . . . . . . . . . . . . . . . . 29 119 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 120 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 122 1 Introduction 124 This document specifies how Ethernet VPN (EVPN) [RFC7432] can be used 125 as a Network Virtualization Overlay (NVO) solution and explores the 126 various tunnel encapsulation options over IP and their impact on the 127 EVPN control-plane and procedures. In particular, the following 128 encapsulation options are analyzed: Virtual Extensible LAN (VXLAN) 129 [RFC7348], Network Virtualization using Generic Routing Encapsulation 130 (NVGRE) [RFC7637], and MPLS over Generic Routing Encapsulation (GRE) 131 [RFC4023]. This specification is also applicable to Generic Network 132 Virtualization Encapsulation (GENEVE) encapsulation [GENEVE]; 133 however, some incremental work is required which will be covered in a 134 separate document [EVPN-GENEVE]. This document also specifies new 135 multi-homing procedures for split-horizon filtering and mass- 136 withdraw. It also specifies EVPN route constructions for VXLAN/NVGRE 137 encapsulations and Autonomous System Boundary Router (ASBR) 138 procedures for multi-homing of Network Virtualization (NV) Edge 139 devices. 141 In the context of this document, a Network Virtualization Overlay 142 (NVO) is a solution to address the requirements of a multi-tenant 143 data center, especially one with virtualized hosts, e.g., Virtual 144 Machines (VMs) or virtual workloads. The key requirements of such a 145 solution, as described in [RFC7364], are: 147 - Isolation of network traffic per tenant 149 - Support for a large number of tenants (tens or hundreds of 150 thousands) 152 - Extending L2 connectivity among different VMs belonging to a given 153 tenant segment (subnet) across different Point of Deliveries (PODs) 154 within a data center or between different data centers 156 - Allowing a given VM to move between different physical points of 157 attachment within a given L2 segment 159 The underlay network for NVO solutions is assumed to provide IP 160 connectivity between NVO endpoints (NVEs). 162 This document describes how Ethernet VPN (EVPN) can be used as an NVO 163 solution and explores applicability of EVPN functions and procedures. 164 In particular, it describes the various tunnel encapsulation options 165 for EVPN over IP, and their impact on the EVPN control-plane and 166 procedures for two main scenarios: 168 a) single-homing NVEs - when a NVE resides in the hypervisor, and 169 b) multi-homing NVEs - when a NVE resides in a Top of Rack (ToR) 170 device 172 The possible encapsulation options for EVPN overlays that are 173 analyzed in this document are: 175 - VXLAN and NVGRE 176 - MPLS over GRE 178 Before getting into the description of the different encapsulation 179 options for EVPN over IP, it is important to highlight the EVPN 180 solution's main features, how those features are currently supported, 181 and any impact that the encapsulation has on those features. 183 2 Requirements Notation and Conventions 185 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 186 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 187 "OPTIONAL" in this document are to be interpreted as described in BCP 188 14 [RFC2119] [RFC8174] when, and only when, they appear in all 189 capitals, as shown here. 191 3 Terminology 193 Most of the terminology used in this documents comes from [RFC7432] 194 and [RFC7365]. 196 VXLAN: Virtual Extensible LAN 198 GRE: Generic Routing Encapsulation 200 NVGRE: Network Virtualization using Generic Routing Encapsulation 202 GENEVE: Generic Network Virtualization Encapsulation 204 POD: Point of Delivery 206 NV: Network Virtualization 208 NVO: Network Virtualization Overlay 210 NVE: Network Virtualization Endpoint 212 VNI: Virtual Network Identifier (for VXLAN) 213 VSID: Virtual Subnet Identifier (for NVGRE) 215 EVPN: Ethernet VPN 217 EVI: An EVPN instance spanning the Provider Edge (PE) devices 218 participating in that EVPN 220 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 221 Control (MAC) addresses on a PE 223 IP-VRF: A Virtual Routing and Forwarding table for Internet Protocol 224 (IP) addresses on a PE 226 Ethernet Segment (ES): When a customer site (device or network) is 227 connected to one or more PEs via a set of Ethernet links, then that 228 set of links is referred to as an 'Ethernet segment'. 230 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 231 identifies an Ethernet segment is called an 'Ethernet Segment 232 Identifier'. 234 Ethernet Tag: An Ethernet tag identifies a particular broadcast 235 domain, e.g., a VLAN. An EVPN instance consists of one or more 236 broadcast domains. 238 PE: Provider Edge device. 240 Single-Active Redundancy Mode: When only a single PE, among all the 241 PEs attached to an Ethernet segment, is allowed to forward traffic 242 to/from that Ethernet segment for a given VLAN, then the Ethernet 243 segment is defined to be operating in Single-Active redundancy mode. 245 All-Active Redundancy Mode: When all PEs attached to an Ethernet 246 segment are allowed to forward known unicast traffic to/from that 247 Ethernet segment for a given VLAN, then the Ethernet segment is 248 defined to be operating in All-Active redundancy mode. 250 PIM-SM: Protocol Independent Multicast - Sparse-Mode 252 PIM-SSM: Protocol Independent Multicast - Source Specific Multicast 254 Bidir PIM: Bidirectional PIM 256 4 EVPN Features 258 EVPN [RFC7432] was originally designed to support the requirements 259 detailed in [RFC7209] and therefore has the following attributes 260 which directly address control plane scaling and ease of deployment 261 issues. 263 1) Control plane information is distributed with BGP and Broadcast 264 and Multicast traffic is sent using a shared multicast tree or with 265 ingress replication. 267 2) Control plane learning is used for MAC (and IP) addresses instead 268 of data plane learning. The latter requires the flooding of unknown 269 unicast and Address Resolution Protocol (ARP) frames; whereas, the 270 former does not require any flooding. 272 3) Route Reflector (RR) is used to reduce a full mesh of BGP sessions 273 among PE devices to a single BGP session between a PE and the RR. 274 Furthermore, RR hierarchy can be leveraged to scale the number of BGP 275 routes on the RR. 277 4) Auto-discovery via BGP is used to discover PE devices 278 participating in a given VPN, PE devices participating in a given 279 redundancy group, tunnel encapsulation types, multicast tunnel type, 280 multicast members, etc. 282 5) All-Active multihoming is used. This allows a given customer 283 device (CE) to have multiple links to multiple PEs, and traffic 284 to/from that CE fully utilizes all of these links. 286 6) When a link between a CE and a PE fails, the PEs for that EVI are 287 notified of the failure via the withdrawal of a single EVPN route. 288 This allows those PEs to remove the withdrawing PE as a next hop for 289 every MAC address associated with the failed link. This is termed 290 'mass withdrawal'. 292 7) BGP route filtering and constrained route distribution are 293 leveraged to ensure that the control plane traffic for a given EVI is 294 only distributed to the PEs in that EVI. 296 8) When a 802.1Q interface is used between a CE and a PE, each of the 297 VLAN ID (VID) on that interface can be mapped onto a bridge table 298 (for upto 4094 such bridge tables). All these bridge tables may be 299 mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 301 9) VM Mobility mechanisms ensure that all PEs in a given EVI know 302 the ES with which a given VM, as identified by its MAC and IP 303 addresses, is currently associated. 305 10) Route Targets are used to allow the operator (or customer) to 306 define a spectrum of logical network topologies including mesh, hub & 307 spoke, and extranets (e.g., a VPN whose sites are owned by different 308 enterprises), without the need for proprietary software or the aid of 309 other virtual or physical devices. 311 Because the design goal for NVO is millions of instances per common 312 physical infrastructure, the scaling properties of the control plane 313 for NVO are extremely important. EVPN and the extensions described 314 herein, are designed with this level of scalability in mind. 316 5 Encapsulation Options for EVPN Overlays 318 5.1 VXLAN/NVGRE Encapsulation 320 Both VXLAN and NVGRE are examples of technologies that provide a data 321 plane encapsulation which is used to transport a packet over the 322 common physical IP infrastructure between Network Virtualization 323 Edges (NVEs) - e.g., VXLAN Tunnel End Points (VTEPs) in VXLAN 324 network. Both of these technologies include the identifier of the 325 specific NVO instance, Virtual Network Identifier (VNI) in VXLAN and 326 Virtual Subnet Identifier (VSID) in NVGRE, in each packet. In the 327 remainder of this document we use VNI as the representation for NVO 328 instance with the understanding that VSID can equally be used if the 329 encapsulation is NVGRE unless it is stated otherwise. 331 Note that a Provider Edge (PE) is equivalent to a NVE/VTEP. 333 VXLAN encapsulation is based on UDP, with an 8-byte header following 334 the UDP header. VXLAN provides a 24-bit VNI, which typically provides 335 a one-to-one mapping to the tenant VLAN ID, as described in 336 [RFC7348]. In this scenario, the ingress VTEP does not include an 337 inner VLAN tag on the encapsulated frame, and the egress VTEP 338 discards the frames with an inner VLAN tag. This mode of operation in 339 [RFC7348] maps to VLAN Based Service in [RFC7432], where a tenant 340 VLAN ID gets mapped to an EVPN instance (EVI). 342 VXLAN also provides an option of including an inner VLAN tag in the 343 encapsulated frame, if explicitly configured at the VTEP. This mode 344 of operation can map to VLAN Bundle Service in [RFC7432] because all 345 the tenant's tagged frames map to a single bridge table / MAC-VRF, 346 and the inner VLAN tag is not used for lookup by the disposition PE 347 when performing VXLAN decapsulation as described in section 6 of 348 [RFC7348]. 350 [RFC7637] encapsulation is based on GRE encapsulation and it mandates 351 the inclusion of the optional GRE Key field which carries the VSID. 352 There is a one-to-one mapping between the VSID and the tenant VLAN 353 ID, as described in [RFC7637] and the inclusion of an inner VLAN tag 354 is prohibited. This mode of operation in [RFC7637] maps to VLAN Based 355 Service in [RFC7432]. 357 As described in the next section there is no change to the encoding 358 of EVPN routes to support VXLAN or NVGRE encapsulation except for the 359 use of the BGP Encapsulation extended community to indicate the 360 encapsulation type (e.g., VXLAN or NVGRE). However, there is 361 potential impact to the EVPN procedures depending on where the NVE is 362 located (i.e., in hypervisor or TOR) and whether multi-homing 363 capabilities are required. 365 5.1.1 Virtual Identifiers Scope 367 Although VNIs are defined as 24-bit globally unique values, there are 368 scenarios in which it is desirable to use a locally significant value 369 for VNI, especially in the context of data center interconnect: 371 5.1.1.1 Data Center Interconnect with Gateway 373 In the case where NVEs in different data centers need to be 374 interconnected, and the NVEs need to use VNIs as a globally unique 375 identifiers within a data center, then a Gateway needs to be employed 376 at the edge of the data center network. This is because the Gateway 377 will provide the functionality of translating the VNI when crossing 378 network boundaries, which may align with operator span of control 379 boundaries. As an example, consider the network of Figure 1 below. 380 Assume there are three network operators: one for each of the DC1, 381 DC2 and WAN networks. The Gateways at the edge of the data centers 382 are responsible for translating the VNIs between the values used in 383 each of the data center networks and the values used in the WAN. 385 +--------------+ 386 | | 387 +---------+ | WAN | +---------+ 388 +----+ | +---+ +----+ +----+ +---+ | +----+ 389 |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| 390 +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ 391 +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ 392 |NVE2|--| | | | | |--|NVE4| 393 +----+ +---------+ +--------------+ +---------+ +----+ 395 |<------ DC 1 ------> <------ DC2 ------>| 397 Figure 1: Data Center Interconnect with Gateway 399 5.1.1.2 Data Center Interconnect without Gateway 401 In the case where NVEs in different data centers need to be 402 interconnected, and the NVEs need to use locally assigned VNIs (e.g., 403 similar to MPLS labels), then there may be no need to employ Gateways 404 at the edge of the data center network. More specifically, the VNI 405 value that is used by the transmitting NVE is allocated by the NVE 406 that is receiving the traffic (in other words, this is similar to 407 "downstream assigned" MPLS label). This allows the VNI space to be 408 decoupled between different data center networks without the need for 409 a dedicated Gateway at the edge of the data centers. This topics is 410 covered in section 10.2. 412 +--------------+ 413 | | 414 +---------+ | WAN | +---------+ 415 +----+ | | +----+ +----+ | | +----+ 416 |NVE1|--| | |ASBR| |ASBR| | |--|NVE3| 417 +----+ |IP Fabric|---| | | |--|IP Fabric| +----+ 418 +----+ | | +----+ +----+ | | +----+ 419 |NVE2|--| | | | | |--|NVE4| 420 +----+ +---------+ +--------------+ +---------+ +----+ 422 |<------ DC 1 -----> <---- DC2 ------>| 424 Figure 2: Data Center Interconnect with ASBR 426 5.1.2 Virtual Identifiers to EVI Mapping 428 When the EVPN control plane is used in conjunction with VXLAN (or 429 NVGRE encapsulation), just like [RFC7432] where two options existed 430 for mapping broadcast domains (represented by VLAN IDs) to an EVI, in 431 here there are also two options for mapping broadcast domains 432 represented by VXLAN VNIs (or NVGRE VSIDs) to an EVI: 434 1. Option 1: Single Broadcast Domain per EVI 436 In this option, a single Ethernet broadcast domain (e.g., subnet) 437 represented by a VNI is mapped to a unique EVI. This corresponds to 438 the VLAN Based service in [RFC7432], where a tenant-facing interface, 439 logical interface (e.g., represented by a VLAN ID) or physical, gets 440 mapped to an EVPN instance (EVI). As such, a BGP RD and RT are needed 441 per VNI on every NVE. The advantage of this model is that it allows 442 the BGP RT constraint mechanisms to be used in order to limit the 443 propagation and import of routes to only the NVEs that are interested 444 in a given VNI. The disadvantage of this model may be the 445 provisioning overhead if RD and RT are not derived automatically from 446 VNI. 448 In this option, the MAC-VRF table is identified by the RT in the 449 control plane and by the VNI in the data-plane. In this option, the 450 specific MAC-VRF table corresponds to only a single bridge table. 452 2. Option 2: Multiple Broadcast Domains per EVI 454 In this option, multiple subnets each represented by a unique VNI are 455 mapped to a single EVI. For example, if a tenant has multiple 456 segments/subnets each represented by a VNI, then all the VNIs for 457 that tenant are mapped to a single EVI - e.g., the EVI in this case 458 represents the tenant and not a subnet . This corresponds to the 459 VLAN-aware bundle service in [RFC7432]. The advantage of this model 460 is that it doesn't require the provisioning of RD/RT per VNI. 461 However, this is a moot point when compared to option 1 where auto- 462 derivation is used. The disadvantage of this model is that routes 463 would be imported by NVEs that may not be interested in a given VNI. 465 In this option the MAC-VRF table is identified by the RT in the 466 control plane and a specific bridge table for that MAC-VRF is 467 identified by the in the control plane. In this 468 option, the VNI in the data-plane is sufficient to identify a 469 specific bridge table. 471 5.1.2.1 Auto Derivation of RT 473 When the option of a single VNI per EVI is used, in order to simplify 474 configuration, the RT used for EVPN can be auto-derived. RD can be 475 auto generated as described in [RFC7432] and RT can be auto-derived 476 as described next. 478 Since a gateway PE as depicted in figure-1 participates in both the 479 DCN and WAN BGP sessions, it is important that when RT values are 480 auto-derived from VNIs, there is no conflict in RT spaces between DCN 481 and WAN networks assuming that both are operating within the same AS. 482 Also, there can be scenarios where both VXLAN and NVGRE 483 encapsulations may be needed within the same DCN and their 484 corresponding VNIs are administered independently which means VNI 485 spaces can overlap. In order to avoid conflict in RT spaces arises, 486 the 6-byte RT values with 2-octet AS number for DCNs can be auto- 487 derived as follow: 489 0 1 2 3 490 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 491 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 492 | Global Administrator | Local Administrator | 493 +-----------------------------------------------+---------------+ 494 | Local Administrator (Cont.) | 495 +-------------------------------+ 497 0 1 2 3 498 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 499 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 500 | Global Administrator |A| TYPE| D-ID | Service ID | 501 +-----------------------------------------------+---------------+ 502 | Service ID (Cont.) | 503 +-------------------------------+ 505 The 6-octet RT field consists of two sub-field: 507 - Global Administrator sub-field: 2 octets. This sub-field contains 508 an Autonomous System number assigned by IANA. 510 - Local Administrator sub-field: 4 octets 512 * A: A single-bit field indicating if this RT is auto-derived 514 0: auto-derived 515 1: manually-derived 517 * Type: A 3-bit field that identifies the space in which 518 the other 3 bytes are defined. The following spaces are 519 defined: 521 0 : VID (802.1Q VLAN ID) 522 1 : VXLAN 523 2 : NVGRE 524 3 : I-SID 525 4 : EVI 526 5 : dual-VID (QinQ VLAN ID) 528 * D-ID: A 4-bit field that identifies domain-id. The default 529 value of domain-id is zero indicating that only a single 530 numbering space exist for a given technology. However, if 531 there are more than one number space exist for a given 532 technology (e.g., overlapping VXLAN spaces), then each of 533 the number spaces need to be identify by their 534 corresponding domain-id starting from 1. 536 * Service ID: This 3-octet field is set to VNI, VSID, I-SID, 537 or VID. 539 It should be noted that RT auto-derivation is applicable for 2-octet 540 AS numbers. For 4-octet AS numbers, RT needs to be manually 541 configured since 3-octet VNI fields cannot be fit within 2-octet 542 local administrator field. 544 5.1.3 Constructing EVPN BGP Routes 546 In EVPN, an MPLS label for instance identifying forwarding table is 547 distributed by the egress PE via the EVPN control plane and is placed 548 in the MPLS header of a given packet by the ingress PE. This label is 549 used upon receipt of that packet by the egress PE for disposition of 550 that packet. This is very similar to the use of the VNI by the egress 551 NVE, with the difference being that an MPLS label has local 552 significance while a VNI typically has global significance. 553 Accordingly, and specifically to support the option of locally- 554 assigned VNIs, the MPLS Label1 field in the MAC/IP Advertisement 555 route, the MPLS label field in the Ethernet AD per EVI route, and the 556 MPLS label field in the PMSI Tunnel Attribute of the Inclusive 557 Multicast Ethernet Tag (IMET) route are used to carry the VNI. For 558 the balance of this memo, the above MPLS label fields will be 559 referred to as the VNI field. The VNI field is used for both local 560 and global VNIs, and for either case the entire 24-bit field is used 561 to encode the VNI value. 563 For the VLAN-based service (a single VNI per MAC-VRF), the Ethernet 564 Tag field in the MAC/IP Advertisement, Ethernet AD per EVI, and IMET 565 route MUST be set to zero just as in the VLAN Based service in 566 [RFC7432]. 568 For the VLAN-aware bundle service (multiple VNIs per MAC-VRF with 569 each VNI associated with its own bridge table), the Ethernet Tag 570 field in the MAC Advertisement, Ethernet AD per EVI, and IMET route 571 MUST identify a bridge table within a MAC-VRF and the set of Ethernet 572 Tags for that EVI needs to be configured consistently on all PEs 573 within that EVI. For locally-assigned VNIs, the value advertised in 574 the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware 575 bundle service in [RFC7432]. Such setting must be done consistently 576 on all PE devices participating in that EVI within a given domain. 577 For global VNIs, the value advertised in the Ethernet Tag field 578 SHOULD be set to a VNI as long as it matches the existing semantics 579 of the Ethernet Tag, i.e., it identifies a bridge table within a MAC- 580 VRF and the set of VNIs are configured consistently on each PE in 581 that EVI. 583 In order to indicate which type of data plane encapsulation (i.e., 584 VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP 585 Encapsulation extended community defined in [RFC5512] is included 586 with all EVPN routes (i.e. MAC Advertisement, Ethernet AD per EVI, 587 Ethernet AD per ESI, Inclusive Multicast Ethernet Tag, and Ethernet 588 Segment) advertised by an egress PE. Five new values have been 589 assigned by IANA to extend the list of encapsulation types defined in 590 [RFC5512] and they are listed in section 13. 592 The MPLS encapsulation tunnel type, listed in section 13, is needed 593 in order to distinguish between an advertising node that only 594 supports non-MPLS encapsulations and one that supports MPLS and non- 595 MPLS encapsulations. An advertising node that only supports MPLS 596 encapsulation does not need to advertise any encapsulation tunnel 597 types; i.e., if the BGP Encapsulation extended community is not 598 present, then either MPLS encapsulation or a statically configured 599 encapsulation is assumed. 601 The Next Hop field of the MP_REACH_NLRI attribute of the route MUST 602 be set to the IPv4 or IPv6 address of the NVE. The remaining fields 603 in each route are set as per [RFC7432]. 605 Note that the procedure defined here to use the MPLS Label field to 606 carry the VNI in the presence of a Tunnel Encapsulation Extended 607 Community specifying the use of a VNI, is aligned with the procedures 608 described in section 8.2.2.2 of [TUNNEL-ENCAP] ("When a Valid VNI has 609 not been Signaled"). 611 5.2 MPLS over GRE 613 The EVPN data-plane is modeled as an EVPN MPLS client layer sitting 614 over an MPLS PSN-tunnel server layer. Some of the EVPN functions 615 (split-horizon, aliasing, and backup-path) are tied to the MPLS 616 client layer. If MPLS over GRE encapsulation is used, then the EVPN 617 MPLS client layer can be carried over an IP PSN tunnel transparently. 618 Therefore, there is no impact to the EVPN procedures and associated 619 data-plane operation. 621 The existing standards for MPLS over GRE encapsulation as defined by 622 [RFC4023] can be used for this purpose; however, when it is used in 623 conjunction with EVPN, it is recommended that the GRE key field be 624 present and be used to provide a 32-bit entropy value only if the P 625 nodes can perform Equal-Cost Multipath (ECMP) hashing based on the 626 GRE key; otherwise, the GRE header SHOULD NOT include the GRE key. 627 The Checksum and Sequence Number fields MUST NOT be included and the 628 corresponding C and S bits in the GRE Packet Header MUST be set to 629 zero. A PE capable of supporting this encapsulation, SHOULD advertise 630 its EVPN routes along with the Tunnel Encapsulation extended 631 community indicating MPLS over GRE encapsulation as described in 632 previous section. 634 6 EVPN with Multiple Data Plane Encapsulations 636 The use of the BGP Encapsulation extended community per [RFC5512] 637 allows each NVE in a given EVI to know each of the encapsulations 638 supported by each of the other NVEs in that EVI. i.e., each of the 639 NVEs in a given EVI may support multiple data plane encapsulations. 640 An ingress NVE can send a frame to an egress NVE only if the set of 641 encapsulations advertised by the egress NVE forms a non-empty 642 intersection with the set of encapsulations supported by the ingress 643 NVE, and it is at the discretion of the ingress NVE which 644 encapsulation to choose from this intersection. (As noted in 645 section 5.1.3, if the BGP Encapsulation extended community is not 646 present, then the default MPLS encapsulation or a locally configured 647 encapsulation is assumed.) 649 When a PE advertises multiple supported encapsulations, it MUST 650 advertise encapsulations that use the same EVPN procedures including 651 procedures associated with split-horizon filtering described in 652 section 8.3.1. For example, VXLAN and NVGRE (or MPLS and MPLS over 653 GRE) encapsulations use the same EVPN procedures and thus a PE can 654 advertise both of them and can support either of them or both of them 655 simultaneously. However, a PE MUST NOT advertise VXLAN and MPLS 656 encapsulations together because (a) the MPLS field of EVPN routes is 657 set to either an MPLS label or a VNI but not both and (b) some EVPN 658 procedures (such as split-horizon filtering) are different for 659 VXLAN/NVGRE and MPLS encapsulations. 661 An ingress node that uses shared multicast trees for sending 662 broadcast or multicast frames MAY maintain distinct trees for each 663 different encapsulation type. 665 It is the responsibility of the operator of a given EVI to ensure 666 that all of the NVEs in that EVI support at least one common 667 encapsulation. If this condition is violated, it could result in 668 service disruption or failure. The use of the BGP Encapsulation 669 extended community provides a method to detect when this condition is 670 violated but the actions to be taken are at the discretion of the 671 operator and are outside the scope of this document. 673 7 Single-Homing NVEs - NVE Residing in Hypervisor 675 When a NVE and its hosts/VMs are co-located in the same physical 676 device, e.g., when they reside in a server, the links between them 677 are virtual and they typically share fate; i.e., the subject 678 hosts/VMs are typically not multi-homed or if they are multi-homed, 679 the multi-homing is a purely local matter to the server hosting the 680 VM and the NVEs, and need not be "visible" to any other NVEs residing 681 on other servers, and thus does not require any specific protocol 682 mechanisms. The most common case of this is when the NVE resides on 683 the hypervisor. 685 In the sub-sections that follow, we will discuss the impact on EVPN 686 procedures for the case when the NVE resides on the hypervisor and 687 the VXLAN (or NVGRE) encapsulation is used. 689 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE Encapsulation 691 In scenarios where different groups of data centers are under 692 different administrative domains, and these data centers are 693 connected via one or more backbone core providers as described in 694 [RFC7365], the RD must be a unique value per EVI or per NVE as 695 described in [RFC7432]. In other words, whenever there is more than 696 one administrative domain for global VNI, then a unique RD must be 697 used, or whenever the VNI value has local significance, then a unique 698 RD must be used. Therefore, it is recommended to use a unique RD as 699 described in [RFC7432] at all time. 701 When the NVEs reside on the hypervisor, the EVPN BGP routes and 702 attributes associated with multi-homing are no longer required. This 703 reduces the required routes and attributes to the following subset of 704 four out of the total of eight listed in section 7 of [RFC7432]: 706 - MAC/IP Advertisement Route 707 - Inclusive Multicast Ethernet Tag Route 708 - MAC Mobility Extended Community 709 - Default Gateway Extended Community 711 However, as noted in section 8.6 of [RFC7432] in order to enable a 712 single-homing ingress NVE to take advantage of fast convergence, 713 aliasing, and backup-path when interacting with multi-homed egress 714 NVEs attached to a given Ethernet segment, the single-homing ingress 715 NVE should be able to receive and process Ethernet AD per ES and 716 Ethernet AD per EVI routes. 718 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation 720 When the NVEs reside on the hypervisors, the EVPN procedures 721 associated with multi-homing are no longer required. This limits the 722 procedures on the NVE to the following subset of the EVPN procedures: 724 1. Local learning of MAC addresses received from the VMs per section 725 10.1 of [RFC7432]. 727 2. Advertising locally learned MAC addresses in BGP using the MAC/IP 728 Advertisement routes. 730 3. Performing remote learning using BGP per Section 10.2 of 731 [RFC7432]. 733 4. Discovering other NVEs and constructing the multicast tunnels 734 using the Inclusive Multicast Ethernet Tag routes. 736 5. Handling MAC address mobility events per the procedures of Section 737 16 in [RFC7432]. 739 However, as noted in section 8.6 of [RFC7432] in order to enable a 740 single-homing ingress NVE to take advantage of fast convergence, 741 aliasing, and back-up path when interacting with multi-homed egress 742 NVEs attached to a given Ethernet segment, a single-homing ingress 743 NVE should implement the ingress node processing of Ethernet AD per 744 ES and Ethernet AD per EVI routes as defined in sections 8.2 Fast 745 Convergence and 8.4 Aliasing and Backup-Path of [RFC7432]. 747 8 Multi-Homing NVEs - NVE Residing in ToR Switch 749 In this section, we discuss the scenario where the NVEs reside in the 750 Top of Rack (ToR) switches AND the servers (where VMs are residing) 751 are multi-homed to these ToR switches. The multi-homing NVE operate 752 in All-Active or Single-Active redundancy mode. If the servers are 753 single-homed to the ToR switches, then the scenario becomes similar 754 to that where the NVE resides on the hypervisor, as discussed in 755 Section 7, as far as the required EVPN functionality are concerned. 757 [RFC7432] defines a set of BGP routes, attributes and procedures to 758 support multi-homing. We first describe these functions and 759 procedures, then discuss which of these are impacted by the VXLAN 760 (or NVGRE) encapsulation and what modifications are required. As it 761 will be seen later in this section, the only EVPN procedure that is 762 impacted by non-MPLS overlay encapsulation (e.g., VXLAN or NVGRE) 763 where it provides space for one ID rather than stack of labels, is 764 that of split-horizon filtering for multi-homed Ethernet Segments 765 described in section 8.3.1. 767 8.1 EVPN Multi-Homing Features 769 In this section, we will recap the multi-homing features of EVPN to 770 highlight the encapsulation dependencies. The section only describes 771 the features and functions at a high-level. For more details, the 772 reader is to refer to [RFC7432]. 774 8.1.1 Multi-homed Ethernet Segment Auto-Discovery 776 EVPN NVEs (or PEs) connected to the same Ethernet Segment (e.g. the 777 same server via LAG) can automatically discover each other with 778 minimal to no configuration through the exchange of BGP routes. 780 8.1.2 Fast Convergence and Mass Withdraw 782 EVPN defines a mechanism to efficiently and quickly signal, to remote 783 NVEs, the need to update their forwarding tables upon the occurrence 784 of a failure in connectivity to an Ethernet segment (e.g., a link or 785 a port failure). This is done by having each NVE advertise an 786 Ethernet A-D Route per Ethernet segment for each locally attached 787 segment. Upon a failure in connectivity to the attached segment, the 788 NVE withdraws the corresponding Ethernet A-D route. This triggers all 789 NVEs that receive the withdrawal to update their next-hop adjacencies 790 for all MAC addresses associated with the Ethernet segment in 791 question. If no other NVE had advertised an Ethernet A-D route for 792 the same segment, then the NVE that received the withdrawal simply 793 invalidates the MAC entries for that segment. Otherwise, the NVE 794 updates the next-hop adjacency list accordingly. 796 8.1.3 Split-Horizon 798 If a server is multi-homed to two or more NVEs (represented by an 799 Ethernet segment ES1) and operating in an all-active redundancy mode, 800 sends a BUM packet (ie, Broadcast, Unknown unicast, or Multicast) to 801 one of these NVEs, then it is important to ensure the packet is not 802 looped back to the server via another NVE connected to this server. 803 The filtering mechanism on the NVE to prevent such loop and packet 804 duplication is called "split horizon filtering'. 806 8.1.4 Aliasing and Backup-Path 808 In the case where a station is multi-homed to multiple NVEs, it is 809 possible that only a single NVE learns a set of the MAC addresses 810 associated with traffic transmitted by the station. This leads to a 811 situation where remote NVEs receive MAC advertisement routes, for 812 these addresses, from a single NVE even though multiple NVEs are 813 connected to the multi-homed station. As a result, the remote NVEs 814 are not able to effectively load-balance traffic among the NVEs 815 connected to the multi-homed Ethernet segment. This could be the 816 case, for e.g. when the NVEs perform data-path learning on the 817 access, and the load-balancing function on the station hashes traffic 818 from a given source MAC address to a single NVE. Another scenario 819 where this occurs is when the NVEs rely on control plane learning on 820 the access (e.g. using ARP), since ARP traffic will be hashed to a 821 single link in the LAG. 823 To alleviate this issue, EVPN introduces the concept of Aliasing. 824 This refers to the ability of an NVE to signal that it has 825 reachability to a given locally attached Ethernet segment, even when 826 it has learnt no MAC addresses from that segment. The Ethernet A-D 827 route per EVI is used to that end. Remote NVEs which receive MAC 828 advertisement routes with non-zero ESI should consider the MAC 829 address as reachable via all NVEs that advertise reachability to the 830 relevant Segment using Ethernet A-D routes with the same ESI and with 831 the Single-Active flag reset. 833 Backup-Path is a closely related function, albeit it applies to the 834 case where the redundancy mode is Single-Active. In this case, the 835 NVE signals that it has reachability to a given locally attached 836 Ethernet Segment using the Ethernet A-D route as well. Remote NVEs 837 which receive the MAC advertisement routes, with non-zero ESI, should 838 consider the MAC address as reachable via the advertising NVE. 839 Furthermore, the remote NVEs should install a Backup-Path, for said 840 MAC, to the NVE which had advertised reachability to the relevant 841 Segment using an Ethernet A-D route with the same ESI and with the 842 Single-Active flag set. 844 8.1.5 DF Election 846 If a host is multi-homed to two or more NVEs on an Ethernet segment 847 operating in all-active redundancy mode, then for a given EVI only 848 one of these NVEs, termed the Designated Forwarder (DF) is 849 responsible for sending it broadcast, multicast, and, if configured 850 for that EVI, unknown unicast frames. 852 This is required in order to prevent duplicate delivery of multi- 853 destination frames to a multi-homed host or VM, in case of all-active 854 redundancy. 856 In NVEs where .1Q tagged frames are received from hosts, the DF 857 election should be performed based on host VLAN IDs (VIDs) per 858 section 8.5 of [RFC7432]. Furthermore, multi-homing PEs of a given 859 Ethernet Segment MAY perform DF election using configured IDs such as 860 VNI, EVI, normalized VIDs, and etc. as along the IDs are configured 861 consistently across the multi-homing PEs. 863 In GWs where VXLAN encapsulated frames are received, the DF election 864 is performed on VNIs. Again, it is assumed that for a given Ethernet 865 Segment, VNIs are unique and consistent (e.g., no duplicate VNIs 866 exist). 868 8.2 Impact on EVPN BGP Routes & Attributes 870 Since multi-homing is supported in this scenario, then the entire set 871 of BGP routes and attributes defined in [RFC7432] are used. The 872 setting of the Ethernet Tag field in the MAC Advertisement, Ethernet 873 AD per EVI, and Inclusive Multicast routes follows that of section 874 5.1.3. Furthermore, the setting of the VNI field in the MAC 875 Advertisement and Ethernet AD per EVI routes follows that of section 876 5.1.3. 878 8.3 Impact on EVPN Procedures 880 Two cases need to be examined here, depending on whether the NVEs are 881 operating in Single-Active or in All-Active redundancy mode. 883 First, lets consider the case of Single-Active redundancy mode, where 884 the hosts are multi-homed to a set of NVEs, however, only a single 885 NVE is active at a given point of time for a given VNI. In this case, 886 the aliasing is not required and the split-horizon filtering may not 887 be required, but other functions such as multi-homed Ethernet segment 888 auto-discovery, fast convergence and mass withdraw, backup path, and 889 DF election are required. 891 Second, let's consider the case of All-Active redundancy mode. In 892 this case, out of all the EVPN multi-homing features listed in 893 section 8.1, the use of the VXLAN or NVGRE encapsulation impacts the 894 split-horizon and aliasing features, since those two rely on the MPLS 895 client layer. Given that this MPLS client layer is absent with these 896 types of encapsulations, alternative procedures and mechanisms are 897 needed to provide the required functions. Those are discussed in 898 detail next. 900 8.3.1 Split Horizon 902 In EVPN, an MPLS label is used for split-horizon filtering to support 903 All-Active multi-homing where an ingress NVE adds a label 904 corresponding to the site of origin (aka ESI Label) when 905 encapsulating the packet. The egress NVE checks the ESI label when 906 attempting to forward a multi-destination frame out an interface, and 907 if the label corresponds to the same site identifier (ESI) associated 908 with that interface, the packet gets dropped. This prevents the 909 occurrence of forwarding loops. 911 Since VXLAN and NVGRE encapsulations do not include the ESI label, 912 other means of performing the split-horizon filtering function must 913 be devised for these encapsulations. The following approach is 914 recommended for split-horizon filtering when VXLAN (or NVGRE) 915 encapsulation is used. 917 Every NVE track the IP address(es) associated with the other NVE(s) 918 with which it has shared multi-homed Ethernet Segments. When the NVE 919 receives a multi-destination frame from the overlay network, it 920 examines the source IP address in the tunnel header (which 921 corresponds to the ingress NVE) and filters out the frame on all 922 local interfaces connected to Ethernet Segments that are shared with 923 the ingress NVE. With this approach, it is required that the ingress 924 NVE performs replication locally to all directly attached Ethernet 925 Segments (regardless of the DF Election state) for all flooded 926 traffic ingress from the access interfaces (i.e. from the hosts). 927 This approach is referred to as "Local Bias", and has the advantage 928 that only a single IP address needs to be used per NVE for split- 929 horizon filtering, as opposed to requiring an IP address per Ethernet 930 Segment per NVE. 932 In order to allow proper operation of split-horizon filtering among 933 the same group of multi-homing PE devices, a mix of PE devices with 934 MPLS over GRE encapsulations running [RFC7432] procedures for split- 935 horizon filtering on the one hand and VXLAN/NVGRE encapsulations 936 running local-bias procedures on the other on a given Ethernet 937 Segment MUST NOT be configured. 939 8.3.2 Aliasing and Backup-Path 941 The Aliasing and the Backup-Path procedures for VXLAN/NVGRE 942 encapsulation are very similar to the ones for MPLS. In case of MPLS, 943 Ethernet A-D route per EVI is used for Aliasing when the 944 corresponding Ethernet Segment operates in All-Active multi-homing, 945 and the same route is used for Backup-Path when the corresponding 946 Ethernet Segment operates in Single-Active multi-homing. In case of 947 VXLAN/NVGRE, the same route is used for the Aliasing and the Backup- 948 Path with the difference that the Ethernet Tag and VNI fields in 949 Ethernet A-D per EVI route are set as described in section 5.1.3. 951 8.3.3 Unknown Unicast Traffic Designation 953 In EVPN, when an ingress PE uses ingress replication to flood unknown 954 unicast traffic to egress PEs, the ingress PE uses a different EVPN 955 MPLS label (from the one used for known unicast traffic) to identify 956 such BUM traffic. The egress PEs use this label to identify such BUM 957 traffic and thus apply DF filtering for All-Active multi-homed sites. 958 In absence of unknown unicast traffic designation and in presence of 959 enabling unknown unicast flooding, there can be transient duplicate 960 traffic to All-Active multi-homed sites under the following 961 condition: the host MAC address is learned by the egress PE(s) and 962 advertised to the ingress PE; however, the MAC advertisement has not 963 been received or processed by the ingress PE, resulting in the host 964 MAC address to be unknown on the ingress PE but be known on the 965 egress PE(s). Therefore, when a packet destined to that host MAC 966 address arrives on the ingress PE, it floods it via ingress 967 replication to all the egress PE(s) and since they are known to the 968 egress PE(s), multiple copies is sent to the All-Active multi-homed 969 site. It should be noted that such transient packet duplication only 970 happens when a) the destination host is multi-homed via All-Active 971 redundancy mode, b) flooding of unknown unicast is enabled in the 972 network, c) ingress replication is used, and d) traffic for the 973 destination host is arrived on the ingress PE before it learns the 974 host MAC address via BGP EVPN advertisement. If it is desired to 975 avoid occurrence of such transient packet duplication (however low 976 probability that may be), then VXLAN-GPE encapsulation needs to be 977 used between these PEs and the ingress PE needs to set the BUM 978 Traffic Bit (B bit) [VXLAN-GPE] to indicate that this is an ingress- 979 replicated BUM traffic. 981 9 Support for Multicast 983 The E-VPN Inclusive Multicast Ethernet Tag (IMET) route is used to 984 discover the multicast tunnels among the endpoints associated with a 985 given EVI (e.g., given VNI) for VLAN-based service and a given 986 for VLAN-aware bundle service. All fields of this route is 987 set as described in section 5.1.3. The Originating router's IP 988 address field is set to the NVE's IP address. This route is tagged 989 with the PMSI Tunnel attribute, which is used to encode the type of 990 multicast tunnel to be used as well as the multicast tunnel 991 identifier. The tunnel encapsulation is encoded by adding the BGP 992 Encapsulation extended community as per section 5.1.1. For example, 993 the PMSI Tunnel attribute may indicate the multicast tunnel is of 994 type Protocol Independent Multicast - Sparse-Mode (PIM-SM); whereas, 995 the BGP Encapsulation extended community may indicate the 996 encapsulation for that tunnel is of type VXLAN. The following tunnel 997 types as defined in [RFC6514] can be used in the PMSI tunnel 998 attribute for VXLAN/NVGRE: 1000 + 3 - PIM-SSM Tree 1001 + 4 - PIM-SM Tree 1002 + 5 - Bidir-PIM Tree 1003 + 6 - Ingress Replication 1005 In case of VxLAN and NVGRE encapsulation with locally-assigned VNIs, 1006 just as in [RFC7432], each PE MUST advertise an IMET route to other 1007 PEs in an EVPN instance for the multicast tunnel type that it uses 1008 (i.e., ingress replication, PIM-SM, PIM-SSM, or Bidir-PIM tunnel). 1009 However, for globally-assigned VNIs, each PE MUST advertise IMET 1010 route to other PEs in an EVPN instance for ingress replication or 1011 PIM-SSM tunnel, and MAY advertise IMET route for PIM-SM or Bidir-PIM 1012 tunnel. In case of PIM-SM or Bidir-PIM tunnel, no information in the 1013 IMET route is needed by the PE to setup these tunnels. 1015 In the scenario where the multicast tunnel is a tree, both the 1016 Inclusive as well as the Aggregate Inclusive variants may be used. In 1017 the former case, a multicast tree is dedicated to a VNI. Whereas, in 1018 the latter, a multicast tree is shared among multiple VNIs. For VNI- 1019 based service, the Aggregate Inclusive mode is accomplished by having 1020 the NVEs advertise multiple IMET routes with different Route Targets 1021 (one per VNI) but with the same tunnel identifier encoded in the PMSI 1022 tunnel attribute. For VNI-aware bundle service, the Aggregate 1023 Inclusive mode is accomplished by having the NVEs advertise multiple 1024 IMET routes with different VNI encoded in the Ethernet Tag field, but 1025 with the same tunnel identifier encoded in the PMSI Tunnel attribute. 1027 10 Data Center Interconnections - DCI 1029 For DCI, the following two main scenarios are considered when 1030 connecting data centers running evpn-overlay (as described here) over 1031 MPLS/IP core network: 1033 - Scenario 1: DCI using GWs 1034 - Scenario 2: DCI using ASBRs 1036 The following two subsections describe the operations for each of 1037 these scenarios. 1039 10.1 DCI using GWs 1041 This is the typical scenario for interconnecting data centers over 1042 WAN. In this scenario, EVPN routes are terminated and processed in 1043 each GW and MAC/IP routes are always re-advertised from DC to WAN but 1044 from WAN to DC, they are not re-advertised if unknown MAC address 1045 (and default IP address) are utilized in NVEs. In this scenario, each 1046 GW maintains a MAC-VRF (and/or IP-VRF) for each EVI. The main 1047 advantage of this approach is that NVEs do not need to maintain MAC 1048 and IP addresses from any remote data centers when default IP route 1049 and unknown MAC routes are used - i.e., they only need to maintain 1050 routes that are local to their own DC. When default IP route and 1051 unknown MAC route are used, any unknown IP and MAC packets from NVEs 1052 are forwarded to the GWs where all the VPN MAC and IP routes are 1053 maintained. This approach reduces the size of MAC-VRF and IP-VRF 1054 significantly at NVEs. Furthermore, it results in a faster 1055 convergence time upon a link or NVE failure in a multi-homed network 1056 or device redundancy scenario, because the failure related BGP routes 1057 (such as mass withdraw message) do not need to get propagated all the 1058 way to the remote NVEs in the remote DCs. This approach is described 1059 in details in section 3.4 of [DCI-EVPN-OVERLAY]. 1061 10.2 DCI using ASBRs 1063 This approach can be considered as the opposite of the first approach 1064 and it favors simplification at DCI devices over NVEs such that 1065 larger MAC-VRF (and IP-VRF) tables need to be maintained on NVEs; 1066 whereas, DCI devices don't need to maintain any MAC (and IP) 1067 forwarding tables. Furthermore, DCI devices do not need to terminate 1068 and process routes related to multi-homing but rather to relay these 1069 messages for the establishment of an end-to-end Label Switched Path 1070 (LSP) path. In other words, DCI devices in this approach operate 1071 similar to ASBRs for inter-AS option B - section 10 of [RFC4364]. 1072 This requires locally assigned VNIs to be used just like downstream 1073 assigned MPLS VPN label where for all practical purposes the VNIs 1074 function like 24-bit VPN labels. This approach is equally applicable 1075 to data centers (or Carrier Ethernet networks) with MPLS 1076 encapsulation. 1078 In inter-AS option B, when ASBR receives an EVPN route from its DC 1079 over internal BGP (iBGP) and re-advertises it to other ASBRs, it re- 1080 advertises the EVPN route by re-writing the BGP next-hops to itself, 1081 thus losing the identity of the PE that originated the advertisement. 1082 This re-write of BGP next-hop impacts the EVPN Mass Withdraw route 1083 (Ethernet A-D per ES) and its procedure adversely. However, it does 1084 not impact EVPN Aliasing mechanism/procedure because when the 1085 Aliasing routes (Ether A-D per EVI) are advertised, the receiving PE 1086 first resolves a MAC address for a given EVI into its corresponding 1087 and subsequently, it resolves the into multiple 1088 paths (and their associated next hops) via which the is 1089 reachable. Since Aliasing and MAC routes are both advertised per EVI 1090 basis and they use the same RD and RT (per EVI), the receiving PE can 1091 associate them together on a per BGP path basis (e.g., per 1092 originating PE) and thus perform recursive route resolution - e.g., a 1093 MAC is reachable via an which in turn, is reachable via a 1094 set of BGP paths, thus the MAC is reachable via the set of BGP paths. 1095 Since on a per EVI basis, the association of MAC routes and the 1096 corresponding Aliasing route is fixed and determined by the same RD 1097 and RT, there is no ambiguity when the BGP next hop for these routes 1098 is re-written as these routes pass through ASBRs - i.e., the 1099 receiving PE may receive multiple Aliasing routes for the same EVI 1100 from a single next hop (a single ASBR), and it can still create 1101 multiple paths toward that . 1103 However, when the BGP next hop address corresponding to the 1104 originating PE is re-written, the association between the Mass 1105 Withdraw route (Ether A-D per ES) and its corresponding MAC routes 1106 cannot be made based on their RDs and RTs because the RD for Mass 1107 Withdraw route is different than the one for the MAC routes. 1108 Therefore, the functionality needed at the ASBRs and the receiving 1109 PEs depends on whether the Mass Withdraw route is originated and 1110 whether there is a need to handle route resolution ambiguity for this 1111 route. The following two subsections describe the functionality 1112 needed by the ASBRs and the receiving PEs depending on whether the 1113 NVEs reside in a Hypervisors or in TORs. 1115 10.2.1 ASBR Functionality with Single-Homing NVEs 1117 When NVEs reside in hypervisors as described in section 7.1, there is 1118 no multi-homing and thus there is no need for the originating NVE to 1119 send Ethernet A-D per ES or Ethernet A-D per EVI routes. However, as 1120 noted in section 7, in order to enable a single-homing ingress NVE to 1121 take advantage of fast convergence, aliasing, and backup-path when 1122 interacting with multi-homing egress NVEs attached to a given 1123 Ethernet segment, the single-homing NVE should be able to receive and 1124 process Ethernet AD per ES and Ethernet AD per EVI routes. The 1125 handling of these routes are described in the next section. 1127 10.2.2 ASBR Functionality with Multi-Homing NVEs 1129 When NVEs reside in TORs and operate in multi-homing redundancy mode, 1130 then as described in section 8, there is a need for the originating 1131 multi-homing NVE to send Ethernet A-D per ES route(s) (used for mass 1132 withdraw) and Ethernet A-D per EVI routes (used for aliasing). As 1133 described above, the re-write of BGP next-hop by ASBRs creates 1134 ambiguities when Ethernet A-D per ES routes are received by the 1135 remote NVE in a different ASBR because the receiving NVE cannot 1136 associated that route with the MAC/IP routes of that Ethernet Segment 1137 advertised by the same originating NVE. This ambiguity inhibits the 1138 function of mass-withdraw per ES by the receiving NVE in a different 1139 AS. 1141 As an example consider a scenario where CE is multi-homed to PE1 and 1142 PE2 where these PEs are connected via ASBR1 and then ASBR2 to the 1143 remote PE3. Furthermore, consider that PE1 receives M1 from CE1 but 1144 not PE2. Therefore, PE1 advertises Eth A-D per ES1, Eth A-D per EVI1, 1145 and M1; whereas, PE2 only advertises Eth A-D per ES1 and Eth A-D per 1146 EVI1. ASBR1 receives all these five advertisements and passes them to 1147 ASBR2 (with itself as the BGP next hop). ASBR2, in turn, passes them 1148 to the remote PE3 with itself as the BGP next hop. PE3 receives these 1149 five routes where all of them have the same BGP next-hop (i.e., 1150 ASBR2). Furthermore, the two Ether A-D per ES routes received by PE3 1151 have the same info - i.e., same ESI and the same BGP next hop. 1152 Although both of these routes are maintained by the BGP process in 1153 PE3 (because they have different RDs and thus treated as different 1154 BGP routes), information from only one of them is used in the L2 1155 routing table (L2 RIB). 1157 PE1 1158 / \ 1159 CE ASBR1---ASBR2---PE3 1160 \ / 1161 PE2 1163 Figure 1: Inter-AS Option B 1165 Now, when the AC between the PE2 and the CE fails and PE2 sends NLRI 1166 withdrawal for Ether A-D per ES route and this withdrawal gets 1167 propagated and received by the PE3, the BGP process in PE3 removes 1168 the corresponding BGP route; however, it doesn't remove the 1169 associated info (namely ESI and BGP next hop) from the L2 routing 1170 table (L2 RIB) because it still has the other Ether A-D per ES route 1171 (originated from PE1) with the same info. That is why the mass- 1172 withdraw mechanism does not work when doing DCI with inter-AS option 1173 B. However, as described previoulsy, the aliasing function works and 1174 so does "mass-withdraw per EVI" (which is associated with withdrawing 1175 the EVPN route associated with Aliasing - i.e., Ether A-D per EVI 1176 route). 1178 In the above example, the PE3 receives two Aliasing routes with the 1179 same BGP next hop (ASBR2) but different RDs. One of the Alias route 1180 has the same RD as the advertised MAC route (M1). PE3 follows the 1181 route resolution procedure specified in [RFC7432] upon receiving the 1182 two Aliasing route - ie, it resolves M1 to and 1183 subsequently it resolves to a BGP path list with two paths 1184 along with the corresponding VNIs/MPLS labels (one associated with 1185 PE1 and the other associated with PE2). It should be noted that even 1186 though both paths are advertised by the same BGP next hop (ASRB2), 1187 the receiving PE3 can handle them properly. Therefore, M1 is 1188 reachable via two paths. This creates two end-to-end LSPs, from PE3 1189 to PE1 and from PE3 to PE2, for M1 such that when PE3 wants to 1190 forward traffic destined to M1, it can load balanced between the two 1191 LSPs. Although route resolution for Aliasing routes with the same BGP 1192 next hop is not explicitly mentioned in [RFC7432], this is the 1193 expected operation and thus it is elaborated here. 1195 When the AC between the PE2 and the CE fails and PE2 sends NLRI 1196 withdrawal for Ether A-D per EVI routes and these withdrawals get 1197 propagated and received by the PE3, the PE3 removes the Aliasing 1198 route and updates the path list - ie, it removes the path 1199 corresponding to the PE2. Therefore, all the corresponding MAC routes 1200 for that that point to that path list will now have the 1201 updated path list with a single path associated with PE1. This action 1202 can be considered as the mass-withdraw at the per-EVI level. The 1203 mass-withdraw at per-EVI level has longer convergence time than the 1204 mass-withdraw at per-ES level; however, it is much faster than the 1205 convergence time when the withdraw is done on a per-MAC basis. 1207 If a PE becomes detached from a given ES, then in addition to 1208 withdrawing its previously advertised Ethernet AD Per ES routes, it 1209 MUST also withdraw its previously advertised Ethernet AD Per EVI 1210 routes for that ES. For a remote PE that is separated from the 1211 withdrawing PE by one or more EVPN inter-AS option B ASBRs, the 1212 withdrawal of the Ethernet AD Per ES routes is not actionable. 1213 However, a remote PE is able to correlate a previously advertised 1214 Ethernet AD Per EVI route with any MAC/IP Advertisement routes also 1215 advertised by the withdrawing PE for that . Hence, when 1216 it receives the withdrawal of an Ethernet AD Per EVI route, it SHOULD 1217 remove the withdrawing PE as a next-hop for all MAC addresses 1218 associated with that . 1220 In the previous example, when the AC between PE2 and the CE fails, 1221 PE2 will withdraw its Ethernet AD Per ES and Per EVI routes. When 1222 PE3 receives the withdrawal of an Ethernet AD Per EVI route, it 1223 removes PE2 as a valid next-hop for all MAC addresses associated with 1224 the corresponding . Therefore, all the MAC next-hops 1225 for that will now have a single next-hop, viz the LSP to 1226 PE1. 1228 In summary, it can be seen that aliasing (and backup path) 1229 functionality should work as is for inter-AS option B without 1230 requiring any addition functionality in ASBRs or PEs. However, the 1231 mass-withdraw functionality falls back from per-ES mode to per-EVI 1232 mode for inter-AS option B - i.e., PEs receiving mass-withdraw route 1233 from the same AS take action on Ether A-D per ES route; whereas, PEs 1234 receiving mass-withdraw route from different AS take action on Ether 1235 A-D per EVI route. 1237 11 Acknowledgement 1239 The authors would like to thank Aldrin Isaac, David Smith, John 1240 Mullooly, Thomas Nadeau, Samir Thoria, and Jorge Rabadan for their 1241 valuable comments and feedback. The authors would also like to thank 1242 Jakob Heitz for his contribution on section 10.2. 1244 12 Security Considerations 1245 This document uses IP-based tunnel technologies to support data 1246 plane transport. Consequently, the security considerations of those 1247 tunnel technologies apply. This document defines support for VXLAN 1248 [RFC7348] and NVGRE [RFC7637] encapsulations. The security 1249 considerations from those RFCs apply to the data plane aspects of 1250 this document. 1252 As with [RFC5512], any modification of the information that is used 1253 to form encapsulation headers, to choose a tunnel type, or to choose 1254 a particular tunnel for a particular payload type may lead to user 1255 data packets getting misrouted, misdelivered, and/or dropped. 1257 More broadly, the security considerations for the transport of IP 1258 reachability information using BGP are discussed in [RFC4271] and 1259 [RFC4272], and are equally applicable for the extensions described 1260 in this document. 1262 13 IANA Considerations 1264 This document requests the following BGP Tunnel Encapsulation 1265 Attribute Tunnel Types from IANA and they have already been 1266 allocated. The IANA registry needs to point to this document. 1268 8 VXLAN Encapsulation 1269 9 NVGRE Encapsulation 1270 10 MPLS Encapsulation 1271 11 MPLS in GRE Encapsulation 1272 12 VXLAN GPE Encapsulation 1274 14 References 1276 14.1 Normative References 1278 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1279 Requirement Levels", BCP 14, RFC 2119, March 1997. 1281 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1282 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, 1283 . 1285 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", RFC 7432, 1286 February 2014 1288 [RFC7348] Mahalingam, M., et al, "VXLAN: A Framework for Overlaying 1289 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 1290 2014 1292 [RFC7637] Garg, P., et al., "NVGRE: Network Virtualization using 1293 Generic Routing Encapsulation", RFC 7637, September, 2015 1295 [RFC5512] Mohapatra, P. and E. Rosen, "The BGP Encapsulation 1296 Subsequent Address Family Identifier (SAFI) and the BGP Tunnel 1297 Encapsulation Attribute", RFC 5512, April 2009. 1299 [RFC4023] T. Worster et al., "Encapsulating MPLS in IP or Generic 1300 Routing Encapsulation (GRE)", RFC 4023, March 2005 1302 14.2 Informative References 1304 [RFC7209] Sajassi et al., "Requirements for Ethernet VPN (EVPN)", RFC 1305 7209, May 2014 1307 [RFC4272] S. Murphy, "BGP Security Vulnerabilities Analysis.", 1308 January 2006. 1310 [RFC7364] Narten et al., "Problem Statement: Overlays for Network 1311 Virtualization", RFC 7364, October 2014. 1313 [RFC7365] Lasserre et al., "Framework for DC Network Virtualization", 1314 RFC 7365, October 2014. 1316 [DCI-EVPN-OVERLAY] Rabadan et al., "Interconnect Solution for EVPN 1317 Overlay networks", draft-ietf-bess-dci-evpn-overlay-08, work in 1318 progress, February 8, 2018. 1320 [RFC4271] Y. Rekhter, Ed., T. Li, Ed., S. Hares, Ed., "A Border 1321 Gateway Protocol 4 (BGP-4)", January 2006. 1323 [RFC4364] Rosen, E., et al, "BGP/MPLS IP Virtual Private Networks 1324 (VPNs)", RFC 4364, February 2006. 1326 [TUNNEL-ENCAP] Rosen et al., "The BGP Tunnel Encapsulation 1327 Attribute", draft-ietf-idr-tunnel-encaps-08, work in progress, 1328 January 11, 2018. 1330 [RFC6514] R. Aggarwal et al., "BGP Encodings and Procedures for 1331 Multicast in MPLS/BGP IP VPNs", RFC 6514, February 2012 1333 [VXLAN-GPE] Maino et al., "Generic Protocol Extension for VXLAN", 1334 draft-ietf-nvo3-vxlan-gpe-05, work in progress October 30, 2017. 1336 [GENEVE] J. Gross et al., "Geneve: Generic Network Virtualization 1337 Encapsulation", draft-ietf-nvo3-geneve-05, September 2017 1339 [EVPN-GENEVE] S. Boutros et al., "EVPN control plane for Geneve", 1340 draft-boutros-bess-evpn-geneve-00.txt, June 2017 1342 Contributors 1344 S. Salam 1345 K. Patel 1346 D. Rao 1347 S. Thoria 1348 D. Cai 1349 Cisco 1351 Y. Rekhter 1352 A. Issac 1353 Wen Lin 1354 Nischal Sheth 1355 Juniper 1357 L. Yong 1358 Huawei 1360 Authors' Addresses 1362 Ali Sajassi 1363 Cisco 1364 USA 1365 Email: sajassi@cisco.com 1367 John Drake 1368 Juniper Networks 1369 USA 1370 Email: jdrake@juniper.net 1372 Nabil Bitar 1373 Nokia 1374 USA 1375 Email : nabil.bitar@nokia.com 1377 R. Shekhar 1378 Juniper 1379 USA 1380 Email: rshekhar@juniper.net 1381 James Uttaro 1382 AT&T 1383 USA 1384 Email: uttaro@att.com 1386 Wim Henderickx 1387 Nokia 1388 USA 1389 e-mail: wim.henderickx@nokia.com