idnits 2.17.1 draft-ietf-bess-evpn-overlay-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 8, 2017) is 2330 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 7348 ** Downref: Normative reference to an Informational RFC: RFC 7637 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-03 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-03 ** Downref: Normative reference to an Informational draft: draft-ietf-nvo3-vxlan-gpe (ref. 'VXLAN-GPE') == Outdated reference: A later version (-10) exists of draft-ietf-bess-dci-evpn-overlay-04 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-05 == Outdated reference: A later version (-04) exists of draft-boutros-bess-evpn-geneve-00 Summary: 4 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Workgroup A. Sajassi (Editor) 3 INTERNET-DRAFT Cisco 4 Intended Status: Standards Track J. Drake (Editor) 5 Juniper 6 N. Bitar 7 Nokia 8 R. Shekhar 9 Juniper 10 J. Uttaro 11 AT&T 12 W. Henderickx 13 Nokia 15 Expires: May 8, 2018 December 8, 2017 17 A Network Virtualization Overlay Solution using EVPN 18 draft-ietf-bess-evpn-overlay-10 20 Abstract 22 This document specifies how Ethernet VPN (EVPN) can be used as a 23 Network Virtualization Overlay (NVO) solution and explores the 24 various tunnel encapsulation options over IP and their impact on the 25 EVPN control-plane and procedures. In particular, the following 26 encapsulation options are analyzed: VXLAN, NVGRE, and MPLS over GRE. 27 This specification is also applicable to GENEVE encapsulation; 28 however, some incremental work is required which will be covered in a 29 separate document. This document also specifies new multi-homing 30 procedures for split-horizon filtering and mass-withdraw. It also 31 specifies EVPN route constructions for VxLAN/NvGRE encapsulations and 32 ASBR procedures for multi-homing NV Edge devices. 34 Status of this Memo 36 This Internet-Draft is submitted to IETF in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF), its areas, and its working groups. Note that 41 other groups may also distribute working documents as 42 Internet-Drafts. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 The list of current Internet-Drafts can be accessed at 49 http://www.ietf.org/1id-abstracts.html 51 The list of Internet-Draft Shadow Directories can be accessed at 52 http://www.ietf.org/shadow.html 54 Copyright and License Notice 56 Copyright (c) 2017 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (http://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 2 Requirements Notation and Conventions . . . . . . . . . . . . . 5 73 3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 74 4 EVPN Features . . . . . . . . . . . . . . . . . . . . . . . . . 6 75 5 Encapsulation Options for EVPN Overlays . . . . . . . . . . . . 7 76 5.1 VXLAN/NVGRE Encapsulation . . . . . . . . . . . . . . . . . 7 77 5.1.1 Virtual Identifiers Scope . . . . . . . . . . . . . . . 8 78 5.1.1.1 Data Center Interconnect with Gateway . . . . . . . 8 79 5.1.1.2 Data Center Interconnect without Gateway . . . . . . 9 80 5.1.2 Virtual Identifiers to EVI Mapping . . . . . . . . . . . 9 81 5.1.2.1 Auto Derivation of RT . . . . . . . . . . . . . . . 10 82 5.1.3 Constructing EVPN BGP Routes . . . . . . . . . . . . . 12 83 5.2 MPLS over GRE . . . . . . . . . . . . . . . . . . . . . . . 13 84 6 EVPN with Multiple Data Plane Encapsulations . . . . . . . . . 14 85 7 Single-Homing NVEs - NVE Residing in Hypervisor . . . . . . . . 15 86 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE 87 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . 15 88 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation . . 16 89 8 Multi-Homing NVEs - NVE Residing in ToR Switch . . . . . . . . 16 90 8.1 EVPN Multi-Homing Features . . . . . . . . . . . . . . . . 17 91 8.1.1 Multi-homed Ethernet Segment Auto-Discovery . . . . . . 17 92 8.1.2 Fast Convergence and Mass Withdraw . . . . . . . . . . . 17 93 8.1.3 Split-Horizon . . . . . . . . . . . . . . . . . . . . . 17 94 8.1.4 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 17 95 8.1.5 DF Election . . . . . . . . . . . . . . . . . . . . . . 18 96 8.2 Impact on EVPN BGP Routes & Attributes . . . . . . . . . . . 19 97 8.3 Impact on EVPN Procedures . . . . . . . . . . . . . . . . . 19 98 8.3.1 Split Horizon . . . . . . . . . . . . . . . . . . . . . 19 99 8.3.2 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 20 100 8.3.3 Unknown Unicast Traffic Designation . . . . . . . . . . 21 101 9 Support for Multicast . . . . . . . . . . . . . . . . . . . . . 21 102 10 Data Center Interconnections - DCI . . . . . . . . . . . . . . 22 103 10.1 DCI using GWs . . . . . . . . . . . . . . . . . . . . . . . 22 104 10.2 DCI using ASBRs . . . . . . . . . . . . . . . . . . . . . . 23 105 10.2.1 ASBR Functionality with Single-Homing NVEs . . . . . . 24 106 10.2.2 ASBR Functionality with Multi-Homing NVEs . . . . . . . 24 107 11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 26 108 12 Security Considerations . . . . . . . . . . . . . . . . . . . 27 109 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 110 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 111 14.1 Normative References . . . . . . . . . . . . . . . . . . . 27 112 14.2 Informative References . . . . . . . . . . . . . . . . . . 28 113 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 114 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29 116 1 Introduction 118 This document specifies how Ethernet VPN (EVPN) can be used as a 119 Network Virtualization Overlay (NVO) solution and explores the 120 various tunnel encapsulation options over IP and their impact on the 121 EVPN control-plane and procedures. In particular, the following 122 encapsulation options are analyzed: VXLAN [RFC7348], NVGRE [RFC7637], 123 and MPLS over GRE [RFC4023]. This specification is also applicable to 124 [GENEVE] encapsulation; however, some incremental work is required 125 which will be covered in a separate document [EVPN-GENEVE]. This 126 document also specifies new multi-homing procedures for split-horizon 127 filtering and mass-withdraw. It also specifies EVPN route 128 constructions for VxLAN/NvGRE encapsulations and ASBR procedures for 129 multi-homing NV Edge devices. 131 In the context of this document, a Network Virtualization Overlay 132 (NVO) is a solution to address the requirements of a multi-tenant 133 data center, especially one with virtualized hosts, e.g., Virtual 134 Machines (VMs) or virtual workloads. The key requirements of such a 135 solution, as described in [RFC7364], are: 137 - Isolation of network traffic per tenant 139 - Support for a large number of tenants (tens or hundreds of 140 thousands) 142 - Extending L2 connectivity among different VMs belonging to a given 143 tenant segment (subnet) across different PODs within a data center or 144 between different data centers 146 - Allowing a given VM to move between different physical points of 147 attachment within a given L2 segment 149 The underlay network for NVO solutions is assumed to provide IP 150 connectivity between NVO endpoints (NVEs). 152 This document describes how Ethernet VPN (EVPN) can be used as an NVO 153 solution and explores applicability of EVPN functions and procedures. 154 In particular, it describes the various tunnel encapsulation options 155 for EVPN over IP, and their impact on the EVPN control-plane and 156 procedures for two main scenarios: 158 a) single-homing NVEs - when a NVE resides in the hypervisor, and 159 b) multi-homing NVEs - when a NVE resides in a Top of Rack (ToR) 160 device 161 The possible encapsulation options for EVPN overlays that are 162 analyzed in this document are: 164 - VXLAN and NVGRE 165 - MPLS over GRE 167 Before getting into the description of the different encapsulation 168 options for EVPN over IP, it is important to highlight the EVPN 169 solution's main features, how those features are currently supported, 170 and any impact that the encapsulation has on those features. 172 2 Requirements Notation and Conventions 174 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 175 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 176 "OPTIONAL" in this document are to be interpreted as described in BCP 177 14 [RFC2119] [RFC8174] when, and only when, they appear in all 178 capitals, as shown here. 180 3 Terminology 182 Most of the terminology used in this documents comes from [RFC7432] 183 and [RFC7365]. 185 NVO: Network Virtualization Overlay 187 NVE: Network Virtualization Endpoint 189 VNI: Virtual Network Identifier (for VXLAN) 191 VSID: Virtual Subnet Identifier (for NVGRE) 193 EVPN: Ethernet VPN 195 EVI: An EVPN instance spanning the Provider Edge (PE) devices 196 participating in that EVPN. 198 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 199 Control (MAC) addresses on a PE. 201 Ethernet Segment (ES): When a customer site (device or network) is 202 connected to one or more PEs via a set of Ethernet links, then that 203 set of links is referred to as an 'Ethernet segment'. 205 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 206 identifies an Ethernet segment is called an 'Ethernet Segment 207 Identifier'. 209 Ethernet Tag: An Ethernet tag identifies a particular broadcast 210 domain, e.g., a VLAN. An EVPN instance consists of one or more 211 broadcast domains. 213 PE: Provider Edge device. 215 Single-Active Redundancy Mode: When only a single PE, among all the 216 PEs attached to an Ethernet segment, is allowed to forward traffic 217 to/from that Ethernet segment for a given VLAN, then the Ethernet 218 segment is defined to be operating in Single-Active redundancy mode. 220 All-Active Redundancy Mode: When all PEs attached to an Ethernet 221 segment are allowed to forward known unicast traffic to/from that 222 Ethernet segment for a given VLAN, then the Ethernet segment is 223 defined to be operating in All-Active redundancy mode. 225 4 EVPN Features 227 EVPN was originally designed to support the requirements detailed in 228 [RFC7209] and therefore has the following attributes which directly 229 address control plane scaling and ease of deployment issues. 231 1) Control plane information is distributed with BGP and Broadcast 232 and Multicast traffic is sent using a shared multicast tree or with 233 ingress replication. 235 2) Control plane learning is used for MAC (and IP) addresses instead 236 of data plane learning. The latter requires the flooding of unknown 237 unicast and ARP frames; whereas, the former does not require any 238 flooding. 240 3) Route Reflectors are used to reduce a full mesh of BGP sessions 241 among PE devices to a single BGP session between a PE and the RR. 242 Furthermore, RR hierarchy can be leveraged to scale the number of BGP 243 routes on the RR. 245 4) Auto-discovery via BGP is used to discover PE devices 246 participating in a given VPN, PE devices participating in a given 247 redundancy group, tunnel encapsulation types, multicast tunnel type, 248 multicast members, etc. 250 5) All-Active multihoming is used. This allows a given customer 251 device (CE) to have multiple links to multiple PEs, and traffic 252 to/from that CE fully utilizes all of these links. 254 6) When a link between a CE and a PE fails, the PEs for that EVI are 255 notified of the failure via the withdrawal of a single EVPN route. 256 This allows those PEs to remove the withdrawing PE as a next hop for 257 every MAC address associated with the failed link. This is termed 258 'mass withdrawal'. 260 7) BGP route filtering and constrained route distribution are 261 leveraged to ensure that the control plane traffic for a given EVI is 262 only distributed to the PEs in that EVI. 264 8) When a 802.1Q interface is used between a CE and a PE, each of the 265 VLAN ID (VID) on that interface can be mapped onto a bridge table 266 (for upto 4094 such bridge tables). All these bridge tables may be 267 mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 269 9) VM Mobility mechanisms ensure that all PEs in a given EVI know 270 the ES with which a given VM, as identified by its MAC and IP 271 addresses, is currently associated. 273 10) Route Targets are used to allow the operator (or customer) to 274 define a spectrum of logical network topologies including mesh, hub & 275 spoke, and extranets (e.g., a VPN whose sites are owned by different 276 enterprises), without the need for proprietary software or the aid of 277 other virtual or physical devices. 279 Because the design goal for NVO is millions of instances per common 280 physical infrastructure, the scaling properties of the control plane 281 for NVO are extremely important. EVPN and the extensions described 282 herein, are designed with this level of scalability in mind. 284 5 Encapsulation Options for EVPN Overlays 286 5.1 VXLAN/NVGRE Encapsulation 288 Both VXLAN and NVGRE are examples of technologies that provide a data 289 plane encapsulation which is used to transport a packet over the 290 common physical IP infrastructure between Network Virtualization 291 Edges (NVEs) - e.g., VXLAN Tunnel End Points (VTEPs) in VXLAN 292 network. Both of these technologies include the identifier of the 293 specific NVO instance, Virtual Network Identifier (VNI) in VXLAN and 294 Virtual Subnet Identifier (VSID) in NVGRE, in each packet. In the 295 remainder of this document we use VNI as the representation for NVO 296 instance with the understanding that VSID can equally be used if the 297 encapsulation is NVGRE unless it is stated otherwise. 299 Note that a Provider Edge (PE) is equivalent to a NVE/VTEP. 301 VXLAN encapsulation is based on UDP, with an 8-byte header following 302 the UDP header. VXLAN provides a 24-bit VNI, which typically provides 303 a one-to-one mapping to the tenant VLAN ID, as described in 304 [RFC7348]. In this scenario, the ingress VTEP does not include an 305 inner VLAN tag on the encapsulated frame, and the egress VTEP 306 discards the frames with an inner VLAN tag. This mode of operation in 307 [RFC7348] maps to VLAN Based Service in [RFC7432], where a tenant 308 VLAN ID gets mapped to an EVPN instance (EVI). 310 VXLAN also provides an option of including an inner VLAN tag in the 311 encapsulated frame, if explicitly configured at the VTEP. This mode 312 of operation can map to VLAN Bundle Service in [RFC7432] because all 313 the tenant's tagged frames map to a single bridge table / MAC-VRF, 314 and the inner VLAN tag is not used for lookup by the disposition PE 315 when performing VXLAN decapsulation as described in section 6 of 316 [RFC7348]. 318 [RFC7637] encapsulation is based on GRE encapsulation and it mandates 319 the inclusion of the optional GRE Key field which carries the VSID. 320 There is a one-to-one mapping between the VSID and the tenant VLAN 321 ID, as described in [RFC7637] and the inclusion of an inner VLAN tag 322 is prohibited. This mode of operation in [RFC7637] maps to VLAN Based 323 Service in [RFC7432]. 325 As described in the next section there is no change to the encoding 326 of EVPN routes to support VXLAN or NVGRE encapsulation except for the 327 use of the BGP Encapsulation extended community to indicate the 328 encapsulation type (e.g., VxLAN or NVGRE). However, there is 329 potential impact to the EVPN procedures depending on where the NVE is 330 located (i.e., in hypervisor or TOR) and whether multi-homing 331 capabilities are required. 333 5.1.1 Virtual Identifiers Scope 335 Although VNIs are defined as 24-bit globally unique values, there are 336 scenarios in which it is desirable to use a locally significant value 337 for VNI, especially in the context of data center interconnect: 339 5.1.1.1 Data Center Interconnect with Gateway 341 In the case where NVEs in different data centers need to be 342 interconnected, and the NVEs need to use VNIs as a globally unique 343 identifiers within a data center, then a Gateway needs to be employed 344 at the edge of the data center network. This is because the Gateway 345 will provide the functionality of translating the VNI when crossing 346 network boundaries, which may align with operator span of control 347 boundaries. As an example, consider the network of Figure 1 below. 348 Assume there are three network operators: one for each of the DC1, 349 DC2 and WAN networks. The Gateways at the edge of the data centers 350 are responsible for translating the VNIs between the values used in 351 each of the data center networks and the values used in the WAN. 353 +--------------+ 354 | | 355 +---------+ | WAN | +---------+ 356 +----+ | +---+ +----+ +----+ +---+ | +----+ 357 |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| 358 +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ 359 +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ 360 |NVE2|--| | | | | |--|NVE4| 361 +----+ +---------+ +--------------+ +---------+ +----+ 363 |<------ DC 1 ------> <------ DC2 ------>| 365 Figure 1: Data Center Interconnect with Gateway 367 5.1.1.2 Data Center Interconnect without Gateway 369 In the case where NVEs in different data centers need to be 370 interconnected, and the NVEs need to use locally assigned VNIs (e.g., 371 similar to MPLS labels), then there may be no need to employ Gateways 372 at the edge of the data center network. More specifically, the VNI 373 value that is used by the transmitting NVE is allocated by the NVE 374 that is receiving the traffic (in other words, this is similar to 375 "downstream assigned" MPLS label). This allows the VNI space to be 376 decoupled between different data center networks without the need for 377 a dedicated Gateway at the edge of the data centers. This topics is 378 covered in section 10.2. 380 +--------------+ 381 | | 382 +---------+ | WAN | +---------+ 383 +----+ | | +----+ +----+ | | +----+ 384 |NVE1|--| | |ASBR| |ASBR| | |--|NVE3| 385 +----+ |IP Fabric|---| | | |--|IP Fabric| +----+ 386 +----+ | | +----+ +----+ | | +----+ 387 |NVE2|--| | | | | |--|NVE4| 388 +----+ +---------+ +--------------+ +---------+ +----+ 390 |<------ DC 1 -----> <---- DC2 ------>| 392 Figure 2: Data Center Interconnect with ASBR 394 5.1.2 Virtual Identifiers to EVI Mapping 395 When the EVPN control plane is used in conjunction with VXLAN (or 396 NVGRE encapsulation), two options for mapping the VXLAN VNI (or NVGRE 397 VSID) to an EVI are possible: 399 1. Option 1: Single Broadcast Domain per EVI 401 In this option, a single Ethernet broadcast domain (e.g., subnet) 402 represented by a VNI is mapped to a unique EVI. This corresponds to 403 the VLAN Based service in [RFC7432], where a tenant-facing interface, 404 logical interface (e.g., represented by a VLAN ID) or physical, gets 405 mapped to an EVPN instance (EVI). As such, a BGP RD and RT are needed 406 per VNI on every NVE. The advantage of this model is that it allows 407 the BGP RT constraint mechanisms to be used in order to limit the 408 propagation and import of routes to only the NVEs that are interested 409 in a given VNI. The disadvantage of this model may be the 410 provisioning overhead if RD and RT are not derived automatically from 411 VNI. 413 In this option, the MAC-VRF table is identified by the RT in the 414 control plane and by the VNI in the data-plane. In this option, the 415 specific MAC-VRF table corresponds to only a single bridge table. 417 2. Option 2: Multiple Broadcast Domains per EVI 419 In this option, multiple subnets each represented by a unique VNI are 420 mapped to a single EVI. For example, if a tenant has multiple 421 segments/subnets each represented by a VNI, then all the VNIs for 422 that tenant are mapped to a single EVI - e.g., the EVI in this case 423 represents the tenant and not a subnet . This corresponds to the 424 VLAN-aware bundle service in [RFC7432]. The advantage of this model 425 is that it doesn't require the provisioning of RD/RT per VNI. 426 However, this is a moot point when compared to option 1 where auto- 427 derivation is used. The disadvantage of this model is that routes 428 would be imported by NVEs that may not be interested in a given VNI. 430 In this option the MAC-VRF table is identified by the RT in the 431 control plane and a specific bridge table for that MAC-VRF is 432 identified by the in the control plane. In this 433 option, the VNI in the data-plane is sufficient to identify a 434 specific bridge table. 436 5.1.2.1 Auto Derivation of RT 438 When the option of a single VNI per EVI is used, in order to simplify 439 configuration, the RT used for EVPN can be auto-derived. RD can be 440 auto generated as described in [RFC7432] and RT can be auto-derived 441 as described next. 443 Since a gateway PE as depicted in figure-1 participates in both the 444 DCN and WAN BGP sessions, it is important that when RT values are 445 auto-derived from VNIs, there is no conflict in RT spaces between DCN 446 and WAN networks assuming that both are operating within the same AS. 447 Also, there can be scenarios where both VXLAN and NVGRE 448 encapsulations may be needed within the same DCN and their 449 corresponding VNIs are administered independently which means VNI 450 spaces can overlap. In order to avoid conflict in RT spaces arises, 451 the 6-byte RT values with 2-octet AS number for DCNs can be auto- 452 derived as follow: 454 0 1 2 3 455 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 456 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 457 | Global Administrator | Local Administrator | 458 +-----------------------------------------------+---------------+ 459 | Local Administrator (Cont.) | 460 +-------------------------------+ 462 0 1 2 3 463 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 464 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 465 | Global Administrator |A| TYPE| D-ID | Service ID | 466 +-----------------------------------------------+---------------+ 467 | Service ID (Cont.) | 468 +-------------------------------+ 470 The 6-octet RT field consists of two sub-field: 472 - Global Administrator sub-field: 2 octets. This sub-field contains 473 an Autonomous System number assigned by IANA. 475 - Local Administrator sub-field: 4 octets 477 * A: A single-bit field indicating if this RT is auto-derived 479 0: auto-derived 480 1: manually-derived 482 * Type: A 3-bit field that identifies the space in which 483 the other 3 bytes are defined. The following spaces are 484 defined: 486 0 : VID (802.1Q VLAN ID) 487 1 : VXLAN 488 2 : NVGRE 489 3 : I-SID 490 4 : EVI 491 5 : dual-VID (QinQ VLAN ID) 493 * D-ID: A 4-bit field that identifies domain-id. The default 494 value of domain-id is zero indicating that only a single 495 numbering space exist for a given technology. However, if 496 there are more than one number space exist for a given 497 technology (e.g., overlapping VXLAN spaces), then each of 498 the number spaces need to be identify by their 499 corresponding domain-id starting from 1. 501 * Service ID: This 3-octet field is set to VNI, VSID, I-SID, 502 or VID. 504 It should be noted that RT auto-derivation is applicable for 2-octet 505 AS numbers. For 4-octet AS numbers, RT needs to be manually 506 configured since 3-octet VNI fields cannot be fit within 2-octet 507 local administrator field. 509 5.1.3 Constructing EVPN BGP Routes 511 In EVPN, an MPLS label for instance identifying forwarding table is 512 distributed by the egress PE via the EVPN control plane and is placed 513 in the MPLS header of a given packet by the ingress PE. This label is 514 used upon receipt of that packet by the egress PE for disposition of 515 that packet. This is very similar to the use of the VNI by the egress 516 NVE, with the difference being that an MPLS label has local 517 significance while a VNI typically has global significance. 518 Accordingly, and specifically to support the option of locally- 519 assigned VNIs, the MPLS Label1 field in the MAC/IP Advertisement 520 route, the MPLS label field in the Ethernet AD per EVI route, and the 521 MPLS label field in the PMSI Tunnel Attribute of the Inclusive 522 Multicast Ethernet Tag (IMET) route are used to carry the VNI. For 523 the balance of this memo, the above MPLS label fields will be 524 referred to as the VNI field. The VNI field is used for both local 525 and global VNIs, and for either case the entire 24-bit field is used 526 to encode the VNI value. 528 For the VLAN-based service (a single VNI per MAC-VRF), the Ethernet 529 Tag field in the MAC/IP Advertisement, Ethernet AD per EVI, and IMET 530 route MUST be set to zero just as in the VLAN Based service in 531 [RFC7432]. 533 For the VLAN-aware bundle service (multiple VNIs per MAC-VRF with 534 each VNI associated with its own bridge table), the Ethernet Tag 535 field in the MAC Advertisement, Ethernet AD per EVI, and IMET route 536 MUST identify a bridge table within a MAC-VRF and the set of Ethernet 537 Tags for that EVI needs to be configured consistently on all PEs 538 within that EVI. For locally-assigned VNIs, the value advertised in 539 the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware 540 bundle service in [RFC7432]. Such setting must be done consistently 541 on all PE devices participating in that EVI within a given domain. 542 For global VNIs, the value advertised in the Ethernet Tag field 543 SHOULD be set to a VNI as long as it matches the existing semantics 544 of the Ethernet Tag, i.e., it identifies a bridge table within a MAC- 545 VRF and the set of VNIs are configured consistently on each PE in 546 that EVI. 548 In order to indicate which type of data plane encapsulation (i.e., 549 VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP 550 Encapsulation extended community defined in [TUNNEL-ENCAP] is 551 included with all EVPN routes (i.e. MAC Advertisement, Ethernet AD 552 per EVI, Ethernet AD per ESI, Inclusive Multicast Ethernet Tag, and 553 Ethernet Segment) advertised by an egress PE. Five new values have 554 been assigned by IANA to extend the list of encapsulation types 555 defined in [TUNNEL-ENCAP] and they are listed in section 13. 557 The MPLS encapsulation tunnel type, listed in section 13, is needed 558 in order to distinguish between an advertising node that only 559 supports non-MPLS encapsulations and one that supports MPLS and non- 560 MPLS encapsulations. An advertising node that only supports MPLS 561 encapsulation does not need to advertise any encapsulation tunnel 562 types; i.e., if the BGP Encapsulation extended community is not 563 present, then either MPLS encapsulation or a statically configured 564 encapsulation is assumed. 566 The Next Hop field of the MP_REACH_NLRI attribute of the route MUST 567 be set to the IPv4 or IPv6 address of the NVE. The remaining fields 568 in each route are set as per [RFC7432]. 570 Note that the procedure defined here to use the MPLS Label field to 571 carry the VNI in the presence of a Tunnel Encapsulation Extended 572 Community specifying the use of a VNI, is aligned with the procedures 573 described in section 8.2.2.2 of [TUNNEL-ENCAP] ("When a Valid VNI has 574 not been Signaled"). 576 5.2 MPLS over GRE 578 The EVPN data-plane is modeled as an EVPN MPLS client layer sitting 579 over an MPLS PSN-tunnel server layer. Some of the EVPN functions 580 (split-horizon, aliasing, and backup-path) are tied to the MPLS 581 client layer. If MPLS over GRE encapsulation is used, then the EVPN 582 MPLS client layer can be carried over an IP PSN tunnel transparently. 584 Therefore, there is no impact to the EVPN procedures and associated 585 data-plane operation. 587 The existing standards for MPLS over GRE encapsulation as defined by 588 [RFC4023] can be used for this purpose; however, when it is used in 589 conjunction with EVPN, it is recommended that the GRE key field be 590 present and be used to provide a 32-bit entropy value only if the P 591 nodes can perform ECMP hashing based on the GRE key; otherwise, the 592 GRE header should not include the GRE key. The Checksum and Sequence 593 Number fields MUST NOT be included and the corresponding C and S bits 594 in the GRE Packet Header MUST be set to zero. A PE capable of 595 supporting this encapsulation, should advertise its EVPN routes along 596 with the Tunnel Encapsulation extended community indicating MPLS over 597 GRE encapsulation as described in previous section. 599 6 EVPN with Multiple Data Plane Encapsulations 601 The use of the BGP Encapsulation extended community per [TUNNEL- 602 ENCAP] allows each NVE in a given EVI to know each of the 603 encapsulations supported by each of the other NVEs in that EVI. 604 i.e., each of the NVEs in a given EVI may support multiple data plane 605 encapsulations. An ingress NVE can send a frame to an egress NVE 606 only if the set of encapsulations advertised by the egress NVE forms 607 a non-empty intersection with the set of encapsulations supported by 608 the ingress NVE, and it is at the discretion of the ingress NVE which 609 encapsulation to choose from this intersection. (As noted in 610 section 5.1.3, if the BGP Encapsulation extended community is not 611 present, then the default MPLS encapsulation or a locally configured 612 encapsulation is assumed.) 614 When a PE advertises multiple supported encapsulations, it MUST 615 advertise encapsulations that use the same EVPN procedures including 616 procedures associated with split-horizon filtering described in 617 section 8.3.1. For example, VxLAN and NvGRE (or MPLS and MPLS over 618 GRE) encapsulations use the same EVPN procedures and thus a PE can 619 advertise both of them and can support either of them or both of them 620 simultaneously. However, a PE MUST NOT advertise VxLAN and MPLS 621 encapsulations together because a) MPLS field of EVPN routes is set 622 to either a MPLS label for a VNI but not both and b) some of EVPN 623 procedures (such as split-horizon filtering) are different for 624 VxLAN/NvGRE and MPLS encapsulations. 626 An ingress node that uses shared multicast trees for sending 627 broadcast or multicast frames MAY maintain distinct trees for each 628 different encapsulation type. 630 It is the responsibility of the operator of a given EVI to ensure 631 that all of the NVEs in that EVI support at least one common 632 encapsulation. If this condition is violated, it could result in 633 service disruption or failure. The use of the BGP Encapsulation 634 extended community provides a method to detect when this condition is 635 violated but the actions to be taken are at the discretion of the 636 operator and are outside the scope of this document. 638 7 Single-Homing NVEs - NVE Residing in Hypervisor 640 When a NVE and its hosts/VMs are co-located in the same physical 641 device, e.g., when they reside in a server, the links between them 642 are virtual and they typically share fate; i.e., the subject 643 hosts/VMs are typically not multi-homed or if they are multi-homed, 644 the multi-homing is a purely local matter to the server hosting the 645 VM and the NVEs, and need not be "visible" to any other NVEs residing 646 on other servers, and thus does not require any specific protocol 647 mechanisms. The most common case of this is when the NVE resides on 648 the hypervisor. 650 In the sub-sections that follow, we will discuss the impact on EVPN 651 procedures for the case when the NVE resides on the hypervisor and 652 the VXLAN (or NVGRE) encapsulation is used. 654 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE Encapsulation 656 In scenarios where different groups of data centers are under 657 different administrative domains, and these data centers are 658 connected via one or more backbone core providers as described in 659 [RFC7365], the RD must be a unique value per EVI or per NVE as 660 described in [RFC7432]. In other words, whenever there is more than 661 one administrative domain for global VNI, then a unique RD must be 662 used, or whenever the VNI value has local significance, then a unique 663 RD must be used. Therefore, it is recommended to use a unique RD as 664 described in [RFC7432] at all time. 666 When the NVEs reside on the hypervisor, the EVPN BGP routes and 667 attributes associated with multi-homing are no longer required. This 668 reduces the required routes and attributes to the following subset of 669 four out of the total of eight listed in section 7 of [RFC7432]: 671 - MAC/IP Advertisement Route 672 - Inclusive Multicast Ethernet Tag Route 673 - MAC Mobility Extended Community 674 - Default Gateway Extended Community 676 However, as noted in section 8.6 of [RFC7432] in order to enable a 677 single-homing ingress NVE to take advantage of fast convergence, 678 aliasing, and backup-path when interacting with multi-homed egress 679 NVEs attached to a given Ethernet segment, the single-homing ingress 680 NVE should be able to receive and process Ethernet AD per ES and 681 Ethernet AD per EVI routes. 683 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation 685 When the NVEs reside on the hypervisors, the EVPN procedures 686 associated with multi-homing are no longer required. This limits the 687 procedures on the NVE to the following subset of the EVPN procedures: 689 1. Local learning of MAC addresses received from the VMs per section 690 10.1 of [RFC7432]. 692 2. Advertising locally learned MAC addresses in BGP using the MAC/IP 693 Advertisement routes. 695 3. Performing remote learning using BGP per Section 10.2 of 696 [RFC7432]. 698 4. Discovering other NVEs and constructing the multicast tunnels 699 using the Inclusive Multicast Ethernet Tag routes. 701 5. Handling MAC address mobility events per the procedures of Section 702 16 in [RFC7432]. 704 However, as noted in section 8.6 of [RFC7432] in order to enable a 705 single-homing ingress NVE to take advantage of fast convergence, 706 aliasing, and back-up path when interacting with multi-homed egress 707 NVEs attached to a given Ethernet segment, a single-homing ingress 708 NVE should implement the ingress node processing of Ethernet AD per 709 ES and Ethernet AD per EVI routes as defined in sections 8.2 Fast 710 Convergence and 8.4 Aliasing and Backup-Path of [RFC7432]. 712 8 Multi-Homing NVEs - NVE Residing in ToR Switch 714 In this section, we discuss the scenario where the NVEs reside in the 715 Top of Rack (ToR) switches AND the servers (where VMs are residing) 716 are multi-homed to these ToR switches. The multi-homing NVE operate 717 in All-Active or Single-Active redundancy mode. If the servers are 718 single-homed to the ToR switches, then the scenario becomes similar 719 to that where the NVE resides on the hypervisor, as discussed in 720 Section 7, as far as the required EVPN functionality are concerned. 722 [RFC7432] defines a set of BGP routes, attributes and procedures to 723 support multi-homing. We first describe these functions and 724 procedures, then discuss which of these are impacted by the VxLAN 725 (or NVGRE) encapsulation and what modifications are required. As it 726 will be seen later in this section, the only EVPN procedure that is 727 impacted by non-MPLS overlay encapsulation (e.g., VxLAN or NVGRE) 728 where it provides space for one ID rather than stack of labels, is 729 that of split-horizon filtering for multi-homed Ethernet Segments 730 described in section 8.3.1. 732 8.1 EVPN Multi-Homing Features 734 In this section, we will recap the multi-homing features of EVPN to 735 highlight the encapsulation dependencies. The section only describes 736 the features and functions at a high-level. For more details, the 737 reader is to refer to [RFC7432]. 739 8.1.1 Multi-homed Ethernet Segment Auto-Discovery 741 EVPN NVEs (or PEs) connected to the same Ethernet Segment (e.g. the 742 same server via LAG) can automatically discover each other with 743 minimal to no configuration through the exchange of BGP routes. 745 8.1.2 Fast Convergence and Mass Withdraw 747 EVPN defines a mechanism to efficiently and quickly signal, to remote 748 NVEs, the need to update their forwarding tables upon the occurrence 749 of a failure in connectivity to an Ethernet segment (e.g., a link or 750 a port failure). This is done by having each NVE advertise an 751 Ethernet A-D Route per Ethernet segment for each locally attached 752 segment. Upon a failure in connectivity to the attached segment, the 753 NVE withdraws the corresponding Ethernet A-D route. This triggers all 754 NVEs that receive the withdrawal to update their next-hop adjacencies 755 for all MAC addresses associated with the Ethernet segment in 756 question. If no other NVE had advertised an Ethernet A-D route for 757 the same segment, then the NVE that received the withdrawal simply 758 invalidates the MAC entries for that segment. Otherwise, the NVE 759 updates the next-hop adjacency list accordingly. 761 8.1.3 Split-Horizon 763 If a server is multi-homed to two or more NVEs (represented by an 764 Ethernet segment ES1) and operating in an all-active redundancy mode, 765 sends a BUM packet (ie, Broadcast, Unknown unicast, or Multicast) to 766 one of these NVEs, then it is important to ensure the packet is not 767 looped back to the server via another NVE connected to this server. 768 The filtering mechanism on the NVE to prevent such loop and packet 769 duplication is called "split horizon filtering'. 771 8.1.4 Aliasing and Backup-Path 772 In the case where a station is multi-homed to multiple NVEs, it is 773 possible that only a single NVE learns a set of the MAC addresses 774 associated with traffic transmitted by the station. This leads to a 775 situation where remote NVEs receive MAC advertisement routes, for 776 these addresses, from a single NVE even though multiple NVEs are 777 connected to the multi-homed station. As a result, the remote NVEs 778 are not able to effectively load-balance traffic among the NVEs 779 connected to the multi-homed Ethernet segment. This could be the 780 case, for e.g. when the NVEs perform data-path learning on the 781 access, and the load-balancing function on the station hashes traffic 782 from a given source MAC address to a single NVE. Another scenario 783 where this occurs is when the NVEs rely on control plane learning on 784 the access (e.g. using ARP), since ARP traffic will be hashed to a 785 single link in the LAG. 787 To alleviate this issue, EVPN introduces the concept of Aliasing. 788 This refers to the ability of an NVE to signal that it has 789 reachability to a given locally attached Ethernet segment, even when 790 it has learnt no MAC addresses from that segment. The Ethernet A-D 791 route per EVI is used to that end. Remote NVEs which receive MAC 792 advertisement routes with non-zero ESI should consider the MAC 793 address as reachable via all NVEs that advertise reachability to the 794 relevant Segment using Ethernet A-D routes with the same ESI and with 795 the Single-Active flag reset. 797 Backup-Path is a closely related function, albeit it applies to the 798 case where the redundancy mode is Single-Active. In this case, the 799 NVE signals that it has reachability to a given locally attached 800 Ethernet Segment using the Ethernet A-D route as well. Remote NVEs 801 which receive the MAC advertisement routes, with non-zero ESI, should 802 consider the MAC address as reachable via the advertising NVE. 803 Furthermore, the remote NVEs should install a Backup-Path, for said 804 MAC, to the NVE which had advertised reachability to the relevant 805 Segment using an Ethernet A-D route with the same ESI and with the 806 Single-Active flag set. 808 8.1.5 DF Election 810 If a host is multi-homed to two or more NVEs on an Ethernet segment 811 operating in all-active redundancy mode, then for a given EVI only 812 one of these NVEs, termed the Designated Forwarder (DF) is 813 responsible for sending it broadcast, multicast, and, if configured 814 for that EVI, unknown unicast frames. 816 This is required in order to prevent duplicate delivery of multi- 817 destination frames to a multi-homed host or VM, in case of all-active 818 redundancy. 820 In NVEs where .1Q tagged frames are received from hosts, the DF 821 election should be performed based on host VLAN IDs (VIDs) per 822 section 8.5 of [RFC7432]. Furthermore, multi-homing PEs of a given 823 Ethernet Segment MAY perform DF election using configured IDs such as 824 VNI, EVI, normalized VIDs, and etc. as along the IDs are configured 825 consistently across the multi-homing PEs. 827 In GWs where VxLAN encapsulated frames are received, the DF election 828 is performed on VNIs. Again, it is assumed that for a given Ethernet 829 Segment, VNIs are unique and consistent (e.g., no duplicate VNIs 830 exist). 832 8.2 Impact on EVPN BGP Routes & Attributes 834 Since multi-homing is supported in this scenario, then the entire set 835 of BGP routes and attributes defined in [RFC7432] are used. The 836 setting of the Ethernet Tag field in the MAC Advertisement, Ethernet 837 AD per EVI, and Inclusive Multicast routes follows that of section 838 5.1.3. Furthermore, the setting of the VNI field in the MAC 839 Advertisement and Ethernet AD per EVI routes follows that of section 840 5.1.3. 842 8.3 Impact on EVPN Procedures 844 Two cases need to be examined here, depending on whether the NVEs are 845 operating in Single-Active or in All-Active redundancy mode. 847 First, lets consider the case of Single-Active redundancy mode, where 848 the hosts are multi-homed to a set of NVEs, however, only a single 849 NVE is active at a given point of time for a given VNI. In this case, 850 the aliasing is not required and the split-horizon filtering may not 851 be required, but other functions such as multi-homed Ethernet segment 852 auto-discovery, fast convergence and mass withdraw, backup path, and 853 DF election are required. 855 Second, let's consider the case of All-Active redundancy mode. In 856 this case, out of all the EVPN multi-homing features listed in 857 section 8.1, the use of the VXLAN or NVGRE encapsulation impacts the 858 split-horizon and aliasing features, since those two rely on the MPLS 859 client layer. Given that this MPLS client layer is absent with these 860 types of encapsulations, alternative procedures and mechanisms are 861 needed to provide the required functions. Those are discussed in 862 detail next. 864 8.3.1 Split Horizon 865 In EVPN, an MPLS label is used for split-horizon filtering to support 866 All-Active multi-homing where an ingress NVE adds a label 867 corresponding to the site of origin (aka ESI Label) when 868 encapsulating the packet. The egress NVE checks the ESI label when 869 attempting to forward a multi-destination frame out an interface, and 870 if the label corresponds to the same site identifier (ESI) associated 871 with that interface, the packet gets dropped. This prevents the 872 occurrence of forwarding loops. 874 Since VXLAN and NVGRE encapsulations do not include the ESI label, 875 other means of performing the split-horizon filtering function must 876 be devised for these encapsulations. The following approach is 877 recommended for split-horizon filtering when VXLAN (or NVGRE) 878 encapsulation is used. 880 Every NVE track the IP address(es) associated with the other NVE(s) 881 with which it has shared multi-homed Ethernet Segments. When the NVE 882 receives a multi-destination frame from the overlay network, it 883 examines the source IP address in the tunnel header (which 884 corresponds to the ingress NVE) and filters out the frame on all 885 local interfaces connected to Ethernet Segments that are shared with 886 the ingress NVE. With this approach, it is required that the ingress 887 NVE performs replication locally to all directly attached Ethernet 888 Segments (regardless of the DF Election state) for all flooded 889 traffic ingress from the access interfaces (i.e. from the hosts). 890 This approach is referred to as "Local Bias", and has the advantage 891 that only a single IP address needs to be used per NVE for split- 892 horizon filtering, as opposed to requiring an IP address per Ethernet 893 Segment per NVE. 895 In order to allow proper operation of split-horizon filtering among 896 the same group of multi-homing PE devices, a mix of PE devices with 897 MPLS over GRE encapsulations running [RFC7432] procedures for split- 898 horizon filtering on the one hand and VXLAN/NVGRE encapsulations 899 running local-bias procedures on the other on a given Ethernet 900 Segment MUST NOT be configured. 902 8.3.2 Aliasing and Backup-Path 904 The Aliasing and the Backup-Path procedures for VXLAN/NVGRE 905 encapsulation are very similar to the ones for MPLS. In case of MPLS, 906 Ethernet A-D route per EVI is used for Aliasing when the 907 corresponding Ethernet Segment operates in All-Active multi-homing, 908 and the same route is used for Backup-Path when the corresponding 909 Ethernet Segment operates in Single-Active multi-homing. In case of 910 VxLAN/NVGRE, the same route is used for the Aliasing and the Backup- 911 Path with the difference that the Ethernet Tag and VNI fields in 912 Ethernet A-D per EVI route are set as described in section 5.1.3. 914 8.3.3 Unknown Unicast Traffic Designation 916 In EVPN, when an ingress PE uses ingress replication to flood unknown 917 unicast traffic to egress PEs, the ingress PE uses a different EVPN 918 MPLS label (from the one used for known unicast traffic) to identify 919 such BUM traffic. The egress PEs use this label to identify such BUM 920 traffic and thus apply DF filtering for All-Active multi-homed sites. 921 In absence of unknown unicast traffic designation and in presence of 922 enabling unknown unicast flooding, there can be transient duplicate 923 traffic to All-Active multi-homed sites under the following 924 condition: the host MAC address is learned by the egress PE(s) and 925 advertised to the ingress PE; however, the MAC advertisement has not 926 been received or processed by the ingress PE, resulting in the host 927 MAC address to be unknown on the ingress PE but be known on the 928 egress PE(s). Therefore, when a packet destined to that host MAC 929 address arrives on the ingress PE, it floods it via ingress 930 replication to all the egress PE(s) and since they are known to the 931 egress PE(s), multiple copies is sent to the All-Active multi-homed 932 site. It should be noted that such transient packet duplication only 933 happens when a) the destination host is multi-homed via All-Active 934 redundancy mode, b) flooding of unknown unicast is enabled in the 935 network, c) ingress replication is used, and d) traffic for the 936 destination host is arrived on the ingress PE before it learns the 937 host MAC address via BGP EVPN advertisement. If it is desired to 938 avoid occurrence of such transient packet duplication (however low 939 probability that may be), then VxLAN-GPE encapsulation needs to be 940 used between these PEs and the ingress PE needs to set the BUM 941 Traffic Bit (B bit) [VXLAN-GPE] to indicate that this is an ingress- 942 replicated BUM traffic. 944 9 Support for Multicast 946 The E-VPN Inclusive Multicast Ethernet Tag (IMET) route is used to 947 discover the multicast tunnels among the endpoints associated with a 948 given EVI (e.g., given VNI) for VLAN-based service and a given 949 for VLAN-aware bundle service. All fields of this route is 950 set as described in section 5.1.3. The Originating router's IP 951 address field is set to the NVE's IP address. This route is tagged 952 with the PMSI Tunnel attribute, which is used to encode the type of 953 multicast tunnel to be used as well as the multicast tunnel 954 identifier. The tunnel encapsulation is encoded by adding the BGP 955 Encapsulation extended community as per section 5.1.1. For example, 956 the PMSI Tunnel attribute may indicate the multicast tunnel is of 957 type PIM-SM; whereas, the BGP Encapsulation extended community may 958 indicate the encapsulation for that tunnel is of type VxLAN. The 959 following tunnel types as defined in [RFC6514] can be used in the 960 PMSI tunnel attribute for VXLAN/NVGRE: 962 + 3 - PIM-SSM Tree 963 + 4 - PIM-SM Tree 964 + 5 - BIDIR-PIM Tree 965 + 6 - Ingress Replication 967 Except for Ingress Replication, this multicast tunnel is used by the 968 PE originating the route for sending multicast traffic to other PEs, 969 and is used by PEs that receive this route for receiving the traffic 970 originated by hosts connected to the PE that originated the route. 972 In the scenario where the multicast tunnel is a tree, both the 973 Inclusive as well as the Aggregate Inclusive variants may be used. In 974 the former case, a multicast tree is dedicated to a VNI. Whereas, in 975 the latter, a multicast tree is shared among multiple VNIs. For VNI- 976 based service, the Aggregate Inclusive mode is accomplished by having 977 the NVEs advertise multiple IMET routes with different Route Targets 978 (one per VNI) but with the same tunnel identifier encoded in the PMSI 979 tunnel attribute. For VNI-aware bundle service, the Aggregate 980 Inclusive mode is accomplished by having the NVEs advertise multiple 981 IMET routes with different VNI encoded in the Ethernet Tag field, but 982 with the same tunnel identifier encoded in the PMSI Tunnel attribute. 984 10 Data Center Interconnections - DCI 986 For DCI, the following two main scenarios are considered when 987 connecting data centers running evpn-overlay (as described here) over 988 MPLS/IP core network: 990 - Scenario 1: DCI using GWs 991 - Scenario 2: DCI using ASBRs 993 The following two subsections describe the operations for each of 994 these scenarios. 996 10.1 DCI using GWs 998 This is the typical scenario for interconnecting data centers over 999 WAN. In this scenario, EVPN routes are terminated and processed in 1000 each GW and MAC/IP routes are always re-advertised from DC to WAN but 1001 from WAN to DC, they are not re-advertised if unknown MAC address 1002 (and default IP address) are utilized in NVEs. In this scenario, each 1003 GW maintains a MAC-VRF (and/or IP-VRF) for each EVI. The main 1004 advantage of this approach is that NVEs do not need to maintain MAC 1005 and IP addresses from any remote data centers when default IP route 1006 and unknown MAC routes are used - i.e., they only need to maintain 1007 routes that are local to their own DC. When default IP route and 1008 unknown MAC route are used, any unknown IP and MAC packets from NVEs 1009 are forwarded to the GWs where all the VPN MAC and IP routes are 1010 maintained. This approach reduces the size of MAC-VRF and IP-VRF 1011 significantly at NVEs. Furthermore, it results in a faster 1012 convergence time upon a link or NVE failure in a multi-homed network 1013 or device redundancy scenario, because the failure related BGP routes 1014 (such as mass withdraw message) do not need to get propagated all the 1015 way to the remote NVEs in the remote DCs. This approach is described 1016 in details in section 3.4 of [DCI-EVPN-OVERLAY]. 1018 10.2 DCI using ASBRs 1020 This approach can be considered as the opposite of the first approach 1021 and it favors simplification at DCI devices over NVEs such that 1022 larger MAC-VRF (and IP-VRF) tables need to be maintained on NVEs; 1023 whereas, DCI devices don't need to maintain any MAC (and IP) 1024 forwarding tables. Furthermore, DCI devices do not need to terminate 1025 and process routes related to multi-homing but rather to relay these 1026 messages for the establishment of an end-to-end LSP path. In other 1027 words, DCI devices in this approach operate similar to ASBRs for 1028 inter-AS option B - section 10 of [RFC4364]. This requires locally 1029 assigned VNIs to be used just like downstream assigned MPLS VPN label 1030 where for all practical purposes the VNIs function like 24-bit VPN 1031 labels. This approach is equally applicable to data centers (or 1032 Carrier Ethernet networks) with MPLS encapsulation. 1034 In inter-AS option B, when ASBR receives an EVPN route from its DC 1035 over iBGP and re-advertises it to other ASBRs, it re-advertises the 1036 EVPN route by re-writing the BGP next-hops to itself, thus losing the 1037 identity of the PE that originated the advertisement. This re-write 1038 of BGP next-hop impacts the EVPN Mass Withdraw route (Ethernet A-D 1039 per ES) and its procedure adversely. However, it does not impact EVPN 1040 Aliasing mechanism/procedure because when the Aliasing routes (Ether 1041 A-D per EVI) are advertised, the receiving PE first resolves a MAC 1042 address for a given EVI into its corresponding and 1043 subsequently, it resolves the into multiple paths (and their 1044 associated next hops) via which the is reachable. Since 1045 Aliasing and MAC routes are both advertised per EVI basis and they 1046 use the same RD and RT (per EVI), the receiving PE can associate them 1047 together on a per BGP path basis (e.g., per originating PE) and thus 1048 perform recursive route resolution - e.g., a MAC is reachable via an 1049 which in turn, is reachable via a set of BGP paths, thus the 1050 MAC is reachable via the set of BGP paths. Since on a per EVI basis, 1051 the association of MAC routes and the corresponding Aliasing route is 1052 fixed and determined by the same RD and RT, there is no ambiguity 1053 when the BGP next hop for these routes is re-written as these routes 1054 pass through ASBRs - i.e., the receiving PE may receive multiple 1055 Aliasing routes for the same EVI from a single next hop (a single 1056 ASBR), and it can still create multiple paths toward that . 1058 However, when the BGP next hop address corresponding to the 1059 originating PE is re-written, the association between the Mass 1060 Withdraw route (Ether A-D per ES) and its corresponding MAC routes 1061 cannot be made based on their RDs and RTs because the RD for Mass 1062 Withdraw route is different than the one for the MAC routes. 1063 Therefore, the functionality needed at the ASBRs and the receiving 1064 PEs depends on whether the Mass Withdraw route is originated and 1065 whether there is a need to handle route resolution ambiguity for this 1066 route. The following two subsections describe the functionality 1067 needed by the ASBRs and the receiving PEs depending on whether the 1068 NVEs reside in a Hypervisors or in TORs. 1070 10.2.1 ASBR Functionality with Single-Homing NVEs 1072 When NVEs reside in hypervisors as described in section 7.1, there is 1073 no multi-homing and thus there is no need for the originating NVE to 1074 send Ethernet A-D per ES or Ethernet A-D per EVI routes. However, as 1075 noted in section 7, in order to enable a single-homing ingress NVE to 1076 take advantage of fast convergence, aliasing, and backup-path when 1077 interacting with multi-homing egress NVEs attached to a given 1078 Ethernet segment, the single-homing NVE should be able to receive and 1079 process Ethernet AD per ES and Ethernet AD per EVI routes. The 1080 handling of these routes are described in the next section. 1082 10.2.2 ASBR Functionality with Multi-Homing NVEs 1084 When NVEs reside in TORs and operate in multi-homing redundancy mode, 1085 then as described in section 8, there is a need for the originating 1086 multi-homing NVE to send Ethernet A-D per ES route(s) (used for mass 1087 withdraw) and Ethernet A-D per EVI routes (used for aliasing). As 1088 described above, the re-write of BGP next-hop by ASBRs creates 1089 ambiguities when Ethernet A-D per ES routes are received by the 1090 remote NVE in a different ASBR because the receiving NVE cannot 1091 associated that route with the MAC/IP routes of that Ethernet Segment 1092 advertised by the same originating NVE. This ambiguity inhibits the 1093 function of mass-withdraw per ES by the receiving NVE in a different 1094 AS. 1096 As an example consider a scenario where CE is multi-homed to PE1 and 1097 PE2 where these PEs are connected via ASBR1 and then ASBR2 to the 1098 remote PE3. Furthermore, consider that PE1 receives M1 from CE1 but 1099 not PE2. Therefore, PE1 advertises Eth A-D per ES1, Eth A-D per EVI1, 1100 and M1; whereas, PE2 only advertises Eth A-D per ES1 and Eth A-D per 1101 EVI1. ASBR1 receives all these five advertisements and passes them to 1102 ASBR2 (with itself as the BGP next hop). ASBR2, in turn, passes them 1103 to the remote PE3 with itself as the BGP next hop. PE3 receives these 1104 five routes where all of them have the same BGP next-hop (i.e., 1105 ASBR2). Furthermore, the two Ether A-D per ES routes received by PE3 1106 have the same info - i.e., same ESI and the same BGP next hop. 1107 Although both of these routes are maintained by the BGP process in 1108 PE3 (because they have different RDs and thus treated as different 1109 BGP routes), information from only one of them is used in the L2 1110 routing table (L2 RIB). 1112 PE1 1113 / \ 1114 CE ASBR1---ASBR2---PE3 1115 \ / 1116 PE2 1118 Figure 1: Inter-AS Option B 1120 Now, when the AC between the PE2 and the CE fails and PE2 sends NLRI 1121 withdrawal for Ether A-D per ES route and this withdrawal gets 1122 propagated and received by the PE3, the BGP process in PE3 removes 1123 the corresponding BGP route; however, it doesn't remove the 1124 associated info (namely ESI and BGP next hop) from the L2 routing 1125 table (L2 RIB) because it still has the other Ether A-D per ES route 1126 (originated from PE1) with the same info. That is why the mass- 1127 withdraw mechanism does not work when doing DCI with inter-AS option 1128 B. However, as described previoulsy, the aliasing function works and 1129 so does "mass-withdraw per EVI" (which is associated with withdrawing 1130 the EVPN route associated with Aliasing - i.e., Ether A-D per EVI 1131 route). 1133 In the above example, the PE3 receives two Aliasing routes with the 1134 same BGP next hop (ASBR2) but different RDs. One of the Alias route 1135 has the same RD as the advertised MAC route (M1). PE3 follows the 1136 route resolution procedure specified in [RFC7432] upon receiving the 1137 two Aliasing route - ie, it resolves M1 to and 1138 subsequently it resolves to a BGP path list with two paths 1139 along with the corresponding VNIs/MPLS labels (one associated with 1140 PE1 and the other associated with PE2). It should be noted that even 1141 though both paths are advertised by the same BGP next hop (ASRB2), 1142 the receiving PE3 can handle them properly. Therefore, M1 is 1143 reachable via two paths. This creates two end-to-end LSPs, from PE3 1144 to PE1 and from PE3 to PE2, for M1 such that when PE3 wants to 1145 forward traffic destined to M1, it can load balanced between the two 1146 LSPs. Although route resolution for Aliasing routes with the same BGP 1147 next hop is not explicitly mentioned in [RFC7432], this is the 1148 expected operation and thus it is elaborated here. 1150 When the AC between the PE2 and the CE fails and PE2 sends NLRI 1151 withdrawal for Ether A-D per EVI routes and these withdrawals get 1152 propagated and received by the PE3, the PE3 removes the Aliasing 1153 route and updates the path list - ie, it removes the path 1154 corresponding to the PE2. Therefore, all the corresponding MAC routes 1155 for that that point to that path list will now have the 1156 updated path list with a single path associated with PE1. This action 1157 can be considered as the mass-withdraw at the per-EVI level. The 1158 mass-withdraw at per-EVI level has longer convergence time than the 1159 mass-withdraw at per-ES level; however, it is much faster than the 1160 convergence time when the withdraw is done on a per-MAC basis. 1162 If a PE becomes detached from a given ES, then in addition to 1163 withdrawing its previously advertised Ethernet AD Per ES routes, it 1164 MUST also withdraw its previously advertised Ethernet AD Per EVI 1165 routes for that ES. For a remote PE that is separated from the 1166 withdrawing PE by one or more EVPN inter-AS option B ASBRs, the 1167 withdrawal of the Ethernet AD Per ES routes is not actionable. 1168 However, a remote PE is able to correlate a previously advertised 1169 Ethernet AD Per EVI route with any MAC/IP Advertisement routes also 1170 advertised by the withdrawing PE for that . Hence, when 1171 it receives the withdrawal of an Ethernet AD Per EVI route, it SHOULD 1172 remove the withdrawing PE as a next-hop for all MAC addresses 1173 associated with that . 1175 In the previous example, when the AC between PE2 and the CE fails, 1176 PE2 will withdraw its Ethernet AD Per ES and Per EVI routes. When 1177 PE3 receives the withdrawal of an Ethernet AD Per EVI route, it 1178 removes PE2 as a valid next-hop for all MAC addresses associated with 1179 the corresponding . Therefore, all the MAC next-hops 1180 for that will now have a single next-hop, viz the LSP to 1181 PE1. 1183 In summary, it can be seen that aliasing (and backup path) 1184 functionality should work as is for inter-AS option B without 1185 requiring any addition functionality in ASBRs or PEs. However, the 1186 mass-withdraw functionality falls back from per-ES mode to per-EVI 1187 mode for inter-AS option B - i.e., PEs receiving mass-withdraw route 1188 from the same AS take action on Ether A-D per ES route; whereas, PEs 1189 receiving mass-withdraw route from different AS take action on Ether 1190 A-D per EVI route. 1192 11 Acknowledgement 1194 The authors would like to thank Aldrin Isaac, David Smith, John 1195 Mullooly, Thomas Nadeau, Samir Thoria, and Jorge Rabadan for their 1196 valuable comments and feedback. The authors would also like to thank 1197 Jakob Heitz for his contribution on section 10.2. 1199 12 Security Considerations 1201 This document uses IP-based tunnel technologies to support data 1202 plane transport. Consequently, the security considerations of those 1203 tunnel technologies apply. This document defines support for VXLAN 1204 [RFC7348] and NVGRE [RFC7637] encapsulations. The security 1205 considerations from those RFCs apply to the data plane aspects of 1206 this document. 1208 As with [TUNNEL-ENCAP], any modification of the information that is 1209 used to form encapsulation headers, to choose a tunnel type, or to 1210 choose a particular tunnel for a particular payload type may lead to 1211 user data packets getting misrouted, misdelivered, and/or dropped. 1213 More broadly, the security considerations for the transport of IP 1214 reachability information using BGP are discussed in [RFC4271] and 1215 [RFC4272], and are equally applicable for the extensions described 1216 in this document. 1218 13 IANA Considerations 1220 IANA has allocated the following BGP Tunnel Encapsulation Attribute 1221 Tunnel Types: 1223 8 VXLAN Encapsulation 1224 9 NVGRE Encapsulation 1225 10 MPLS Encapsulation 1226 11 MPLS in GRE Encapsulation 1227 12 VXLAN GPE Encapsulation 1229 14 References 1231 14.1 Normative References 1233 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1234 Requirement Levels", BCP 14, RFC 2119, March 1997. 1236 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1237 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, 1238 . 1240 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", RFC 7432, 1241 February 2014 1243 [RFC7348] Mahalingam, M., et al, "VXLAN: A Framework for Overlaying 1244 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 1245 2014 1247 [RFC7637] Garg, P., et al., "NVGRE: Network Virtualization using 1248 Generic Routing Encapsulation", RFC 7637, September, 2015 1250 [TUNNEL-ENCAP] Rosen et al., "The BGP Tunnel Encapsulation 1251 Attribute", draft-ietf-idr-tunnel-encaps-03, work in progress, May 1252 31, 2016. 1254 [VXLAN-GPE] Maino et al., "Generic Protocol Extension for VXLAN", 1255 draft-ietf-nvo3-vxlan-gpe-03, work in progress October 25, 2016. 1257 [RFC4023] T. Worster et al., "Encapsulating MPLS in IP or Generic 1258 Routing Encapsulation (GRE)", RFC 4023, March 2005 1260 14.2 Informative References 1262 [RFC7209] Sajassi et al., "Requirements for Ethernet VPN (EVPN)", RFC 1263 7209, May 2014 1265 [RFC4272] S. Murphy, "BGP Security Vulnerabilities Analysis.", 1266 January 2006. 1268 [RFC7364] Narten et al., "Problem Statement: Overlays for Network 1269 Virtualization", RFC 7364, October 2014. 1271 [RFC7365] Lasserre et al., "Framework for DC Network Virtualization", 1272 RFC 7365, October 2014. 1274 [DCI-EVPN-OVERLAY] Rabadan et al., "Interconnect Solution for EVPN 1275 Overlay networks", draft-ietf-bess-dci-evpn-overlay-04, work in 1276 progress, February 29, 2016. 1278 [RFC4271] Y. Rekhter, Ed., T. Li, Ed., S. Hares, Ed., "A Border 1279 Gateway Protocol 4 (BGP-4)", January 2006. 1281 [RFC4364] Rosen, E., et al, "BGP/MPLS IP Virtual Private Networks 1282 (VPNs)", RFC 4364, February 2006. 1284 [RFC6514] R. Aggarwal et al., "BGP Encodings and Procedures for 1285 Multicast in MPLS/BGP IP VPNs", RFC 6514, February 2012 1287 [GENEVE] J. Gross et al., "Geneve: Generic Network Virtualization 1288 Encapsulation", draft-ietf-nvo3-geneve-05, September 2017 1290 [EVPN-GENEVE] S. Boutros et al., "EVPN control plane for Geneve", 1291 draft-boutros-bess-evpn-geneve-00.txt, June 2017 1293 Contributors 1294 S. Salam 1295 K. Patel 1296 D. Rao 1297 S. Thoria 1298 D. Cai 1299 Cisco 1301 Y. Rekhter 1302 A. Issac 1303 Wen Lin 1304 Nischal Sheth 1305 Juniper 1307 L. Yong 1308 Huawei 1310 Authors' Addresses 1312 Ali Sajassi 1313 Cisco 1314 USA 1315 Email: sajassi@cisco.com 1317 John Drake 1318 Juniper Networks 1319 USA 1320 Email: jdrake@juniper.net 1322 Nabil Bitar 1323 Nokia 1324 USA 1325 Email : nabil.bitar@nokia.com 1327 R. Shekhar 1328 Juniper 1329 USA 1330 Email: rshekhar@juniper.net 1332 James Uttaro 1333 AT&T 1334 USA 1335 Email: uttaro@att.com 1336 Wim Henderickx 1337 Nokia 1338 USA 1339 e-mail: wim.henderickx@nokia.com