idnits 2.17.1 draft-ietf-bess-evpn-overlay-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 27, 2017) is 2577 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5512 (Obsoleted by RFC 9012) == Outdated reference: A later version (-10) exists of draft-ietf-bess-dci-evpn-overlay-04 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-03 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-03 Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Workgroup A. Sajassi (Editor) 3 INTERNET-DRAFT Cisco 4 Intended Status: Standards Track J. Drake (Editor) 5 Juniper 6 N. Bitar 7 Nokia 8 R. Shekhar 9 Juniper 10 J. Uttaro 11 AT&T 12 W. Henderickx 13 Nokia 15 Expires: September 27, 2017 March 27, 2017 17 A Network Virtualization Overlay Solution using EVPN 18 draft-ietf-bess-evpn-overlay-08 20 Abstract 22 This document describes how Ethernet VPN (EVPN) can be used as an 23 Network Virtualization Overlay (NVO) solution and explores the 24 various tunnel encapsulation options over IP and their impact on the 25 EVPN control-plane and procedures. In particular, the following 26 encapsulation options are analyzed: VXLAN, NVGRE, and MPLS over GRE. 28 Status of this Memo 30 This Internet-Draft is submitted to IETF in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF), its areas, and its working groups. Note that 35 other groups may also distribute working documents as 36 Internet-Drafts. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 The list of current Internet-Drafts can be accessed at 44 http://www.ietf.org/1id-abstracts.html 46 The list of Internet-Draft Shadow Directories can be accessed at 47 http://www.ietf.org/shadow.html 49 Copyright and License Notice 51 Copyright (c) 2017 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2 Specification of Requirements . . . . . . . . . . . . . . . . . 5 68 3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 4 EVPN Features . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 5 Encapsulation Options for EVPN Overlays . . . . . . . . . . . . 7 71 5.1 VXLAN/NVGRE Encapsulation . . . . . . . . . . . . . . . . . 7 72 5.1.1 Virtual Identifiers Scope . . . . . . . . . . . . . . . 8 73 5.1.1.1 Data Center Interconnect with Gateway . . . . . . . 8 74 5.1.1.2 Data Center Interconnect without Gateway . . . . . . 9 75 5.1.2 Virtual Identifiers to EVI Mapping . . . . . . . . . . . 9 76 5.1.2.1 Auto Derivation of RT . . . . . . . . . . . . . . . 10 77 5.1.3 Constructing EVPN BGP Routes . . . . . . . . . . . . . 11 78 5.2 MPLS over GRE . . . . . . . . . . . . . . . . . . . . . . . 13 79 6 EVPN with Multiple Data Plane Encapsulations . . . . . . . . . 13 80 7 Single-Homing NVEs - NVE Residing in Hypervisor . . . . . . . . 14 81 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE 82 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . 14 83 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation . . 15 84 8 Multi-Homing NVEs - NVE Residing in ToR Switch . . . . . . . . 16 85 8.1 EVPN Multi-Homing Features . . . . . . . . . . . . . . . . 16 86 8.1.1 Multi-homed Ethernet Segment Auto-Discovery . . . . . . 16 87 8.1.2 Fast Convergence and Mass Withdraw . . . . . . . . . . . 16 88 8.1.3 Split-Horizon . . . . . . . . . . . . . . . . . . . . . 17 89 8.1.4 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 17 90 8.1.5 DF Election . . . . . . . . . . . . . . . . . . . . . . 18 91 8.2 Impact on EVPN BGP Routes & Attributes . . . . . . . . . . . 18 92 8.3 Impact on EVPN Procedures . . . . . . . . . . . . . . . . . 18 93 8.3.1 Split Horizon . . . . . . . . . . . . . . . . . . . . . 19 94 8.3.2 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 20 95 8.3.3 Unknown Unicast Traffic Designation . . . . . . . . . . 20 96 9 Support for Multicast . . . . . . . . . . . . . . . . . . . . . 20 97 10 Data Center Interconnections - DCI . . . . . . . . . . . . . . 21 98 10.1 DCI using GWs . . . . . . . . . . . . . . . . . . . . . . . 22 99 10.2 DCI using ASBRs . . . . . . . . . . . . . . . . . . . . . . 22 100 10.2.1 ASBR Functionality with Single-Homing NVEs . . . . . . 23 101 10.2.2 ASBR Functionality with Multi-Homing NVEs . . . . . . . 23 102 11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 26 103 12 Security Considerations . . . . . . . . . . . . . . . . . . . 26 104 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 105 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 106 14.1 Normative References . . . . . . . . . . . . . . . . . . . 27 107 14.2 Informative References . . . . . . . . . . . . . . . . . . 27 108 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29 111 1 Introduction 113 In the context of this document, a Network Virtualization Overlay 114 (NVO) is a solution to address the requirements of a multi-tenant 115 data center, especially one with virtualized hosts, e.g., Virtual 116 Machines (VMs) or virtual workloads. The key requirements of such a 117 solution, as described in [Problem-Statement], are: 119 - Isolation of network traffic per tenant 121 - Support for a large number of tenants (tens or hundreds of 122 thousands) 124 - Extending L2 connectivity among different VMs belonging to a given 125 tenant segment (subnet) across different PODs within a data center or 126 between different data centers 128 - Allowing a given VM to move between different physical points of 129 attachment within a given L2 segment 131 The underlay network for NVO solutions is assumed to provide IP 132 connectivity between NVO endpoints (NVEs). 134 This document describes how Ethernet VPN (EVPN) can be used as an NVO 135 solution and explores applicability of EVPN functions and procedures. 136 In particular, it describes the various tunnel encapsulation options 137 for EVPN over IP, and their impact on the EVPN control-plane and 138 procedures for two main scenarios: 140 a) single-homing NVEs - when a NVE resides in the hypervisor, and 141 b) multi-homing NVEs - when a NVE resides in a Top of Rack (ToR) 142 device 144 The possible encapsulation options for EVPN overlays that are 145 analyzed in this document are: 147 - VXLAN and NVGRE 148 - MPLS over GRE 150 Before getting into the description of the different encapsulation 151 options for EVPN over IP, it is important to highlight the EVPN 152 solution's main features, how those features are currently supported, 153 and any impact that the encapsulation has on those features. 155 2 Specification of Requirements 157 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 158 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 159 document are to be interpreted as described in [KEYWORDS]. 161 3 Terminology 163 Most of the terminology used in this documents comes from [RFC7432] 164 and [NVO3-FRWK]. 166 NVO: Network Virtualization Overlay 168 NVE: Network Virtualization Endpoint 170 VNI: Virtual Network Identifier (for VXLAN) 172 VSID: Virtual Subnet Identifier (for NVGRE) 174 EVPN: Ethernet VPN 176 EVI: An EVPN instance spanning the Provider Edge (PE) devices 177 participating in that EVPN. 179 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 180 Control (MAC) addresses on a PE. 182 Ethernet Segment (ES): When a customer site (device or network) is 183 connected to one or more PEs via a set of Ethernet links, then that 184 set of links is referred to as an 'Ethernet segment'. 186 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 187 identifies an Ethernet segment is called an 'Ethernet Segment 188 Identifier'. 190 Ethernet Tag: An Ethernet tag identifies a particular broadcast 191 domain, e.g., a VLAN. An EVPN instance consists of one or more 192 broadcast domains. 194 PE: Provider Edge device. 196 Single-Active Redundancy Mode: When only a single PE, among all the 197 PEs attached to an Ethernet segment, is allowed to forward traffic 198 to/from that Ethernet segment for a given VLAN, then the Ethernet 199 segment is defined to be operating in Single-Active redundancy mode. 201 All-Active Redundancy Mode: When all PEs attached to an Ethernet 202 segment are allowed to forward known unicast traffic to/from that 203 Ethernet segment for a given VLAN, then the Ethernet segment is 204 defined to be operating in All-Active redundancy mode. 206 4 EVPN Features 208 EVPN was originally designed to support the requirements detailed in 209 [RFC7209] and therefore has the following attributes which directly 210 address control plane scaling and ease of deployment issues. 212 1) Control plane information is distributed with BGP and Broadcast 213 and Multicast traffic is sent using a shared multicast tree or with 214 ingress replication. 216 2) Control plane learning is used for MAC (and IP) addresses instead 217 of data plane learning. The latter requires the flooding of unknown 218 unicast and ARP frames; whereas, the former does not require any 219 flooding. 221 3) Route Reflectors are used to reduce a full mesh of BGP sessions 222 among PE devices to a single BGP session between a PE and the RR. 223 Furthermore, RR hierarchy can be leveraged to scale the number of BGP 224 routes on the RR. 226 4) Auto-discovery via BGP is used to discover PE devices 227 participating in a given VPN, PE devices participating in a given 228 redundancy group, tunnel encapsulation types, multicast tunnel type, 229 multicast members, etc. 231 5) All-Active multihoming is used. This allows a given customer 232 device (CE) to have multiple links to multiple PEs, and traffic 233 to/from that CE fully utilizes all of these links. 235 6) When a link between a CE and a PE fails, the PEs for that EVI are 236 notified of the failure via the withdrawal of a single EVPN route. 237 This allows those PEs to remove the withdrawing PE as a next hop for 238 every MAC address associated with the failed link. This is termed 239 'mass withdrawal'. 241 7) BGP route filtering and constrained route distribution are 242 leveraged to ensure that the control plane traffic for a given EVI is 243 only distributed to the PEs in that EVI. 245 8) When a 802.1Q interface is used between a CE and a PE, each of the 246 VLAN ID (VID) on that interface can be mapped onto a bridge table 247 (for upto 4094 such bridge tables). All these bridge tables may be 248 mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 250 9) VM Mobility mechanisms ensure that all PEs in a given EVI know 251 the ES with which a given VM, as identified by its MAC and IP 252 addresses, is currently associated. 254 10) Route Targets are used to allow the operator (or customer) to 255 define a spectrum of logical network topologies including mesh, hub & 256 spoke, and extranets (e.g., a VPN whose sites are owned by different 257 enterprises), without the need for proprietary software or the aid of 258 other virtual or physical devices. 260 Because the design goal for NVO is millions of instances per common 261 physical infrastructure, the scaling properties of the control plane 262 for NVO are extremely important. EVPN and the extensions described 263 herein, are designed with this level of scalability in mind. 265 5 Encapsulation Options for EVPN Overlays 267 5.1 VXLAN/NVGRE Encapsulation 269 Both VXLAN and NVGRE are examples of technologies that provide a data 270 plane encapsulation which is used to transport a packet over the 271 common physical IP infrastructure between Network Virtualization 272 Edges (NVEs) - e.g., VXLAN Tunnel End Points (VTEPs) in VXLAN 273 network. Both of these technologies include the identifier of the 274 specific NVO instance, Virtual Network Identifier (VNI) in VXLAN and 275 Virtual Subnet Identifier (VSID) in NVGRE, in each packet. In the 276 remainder of this document we use VNI as the representation for NVO 277 instance with the understanding that VSID can equally be used if the 278 encapsulation is NVGRE unless it is stated otherwise. 280 Note that a Provider Edge (PE) is equivalent to a NVE/VTEP. 282 VXLAN encapsulation is based on UDP, with an 8-byte header following 283 the UDP header. VXLAN provides a 24-bit VNI, which typically provides 284 a one-to-one mapping to the tenant VLAN ID, as described in 285 [RFC7348]. In this scenario, the ingress VTEP does not include an 286 inner VLAN tag on the encapsulated frame, and the egress VTEP 287 discards the frames with an inner VLAN tag. This mode of operation in 288 [RFC7348] maps to VLAN Based Service in [RFC7432], where a tenant 289 VLAN ID gets mapped to an EVPN instance (EVI). 291 VXLAN also provides an option of including an inner VLAN tag in the 292 encapsulated frame, if explicitly configured at the VTEP. This mode 293 of operation can map to VLAN Bundle Service in [RFC7432] because all 294 the tenant's tagged frames map to a single bridge table / MAC-VRF, 295 and the inner VLAN tag is not used for lookup by the disposition PE 296 when performing VXLAN decapsulation as described in section 6 of 298 [RFC7348]. 300 [NVGRE] encapsulation is based on GRE encapsulation and it mandates 301 the inclusion of the optional GRE Key field which carries the VSID. 302 There is a one-to-one mapping between the VSID and the tenant VLAN 303 ID, as described in [NVGRE] and the inclusion of an inner VLAN tag is 304 prohibited. This mode of operation in [NVGRE] maps to VLAN Based 305 Service in [RFC7432]. 307 As described in the next section there is no change to the encoding 308 of EVPN routes to support VXLAN or NVGRE encapsulation except for the 309 use of the BGP Encapsulation extended community to indicate the 310 encapsulation type (e.g., VxLAN or NVGRE). However, there is 311 potential impact to the EVPN procedures depending on where the NVE is 312 located (i.e., in hypervisor or TOR) and whether multi-homing 313 capabilities are required. 315 5.1.1 Virtual Identifiers Scope 317 Although VNIs are defined as 24-bit globally unique values, there are 318 scenarios in which it is desirable to use a locally significant value 319 for VNI, especially in the context of data center interconnect: 321 5.1.1.1 Data Center Interconnect with Gateway 323 In the case where NVEs in different data centers need to be 324 interconnected, and the NVEs need to use VNIs as a globally unique 325 identifiers within a data center, then a Gateway needs to be employed 326 at the edge of the data center network. This is because the Gateway 327 will provide the functionality of translating the VNI when crossing 328 network boundaries, which may align with operator span of control 329 boundaries. As an example, consider the network of Figure 1 below. 330 Assume there are three network operators: one for each of the DC1, 331 DC2 and WAN networks. The Gateways at the edge of the data centers 332 are responsible for translating the VNIs between the values used in 333 each of the data center networks and the values used in the WAN. 335 +--------------+ 336 | | 337 +---------+ | WAN | +---------+ 338 +----+ | +---+ +----+ +----+ +---+ | +----+ 339 |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| 340 +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ 341 +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ 342 |NVE2|--| | | | | |--|NVE4| 343 +----+ +---------+ +--------------+ +---------+ +----+ 345 |<------ DC 1 ------> <------ DC2 ------>| 347 Figure 1: Data Center Interconnect with Gateway 349 5.1.1.2 Data Center Interconnect without Gateway 351 In the case where NVEs in different data centers need to be 352 interconnected, and the NVEs need to use locally assigned VNIs (e.g., 353 similar to MPLS labels), then there may be no need to employ Gateways 354 at the edge of the data center network. More specifically, the VNI 355 value that is used by the transmitting NVE is allocated by the NVE 356 that is receiving the traffic (in other words, this is similar to 357 "downstream assigned" MPLS label). This allows the VNI space to be 358 decoupled between different data center networks without the need for 359 a dedicated Gateway at the edge of the data centers. This topics is 360 covered in section 10.2. 362 +--------------+ 363 | | 364 +---------+ | WAN | +---------+ 365 +----+ | | +----+ +----+ | | +----+ 366 |NVE1|--| | |ASBR| |ASBR| | |--|NVE3| 367 +----+ |IP Fabric|---| | | |--|IP Fabric| +----+ 368 +----+ | | +----+ +----+ | | +----+ 369 |NVE2|--| | | | | |--|NVE4| 370 +----+ +---------+ +--------------+ +---------+ +----+ 372 |<------ DC 1 -----> <---- DC2 ------>| 374 Figure 2: Data Center Interconnect with ASBR 376 5.1.2 Virtual Identifiers to EVI Mapping 378 When the EVPN control plane is used in conjunction with VXLAN (or 379 NVGRE encapsulation), two options for mapping the VXLAN VNI (or NVGRE 380 VSID) to an EVI are possible: 382 1. Option 1: Single Broadcast Domain per EVI 384 In this option, a single Ethernet broadcast domain (e.g., subnet) 385 represented by a VNI is mapped to a unique EVI. This corresponds to 386 the VLAN Based service in [RFC7432], where a tenant-facing interface, 387 logical interface (e.g., represented by a VLAN ID) or physical, gets 388 mapped to an EVPN instance (EVI). As such, a BGP RD and RT are needed 389 per VNI on every NVE. The advantage of this model is that it allows 390 the BGP RT constraint mechanisms to be used in order to limit the 391 propagation and import of routes to only the NVEs that are interested 392 in a given VNI. The disadvantage of this model may be the 393 provisioning overhead if RD and RT are not derived automatically from 394 VNI. 396 In this option, the MAC-VRF table is identified by the RT in the 397 control plane and by the VNI in the data-plane. In this option, the 398 specific MAC-VRF table corresponds to only a single bridge table. 400 2. Option 2: Multiple Broadcast Domains per EVI 402 In this option, multiple subnets each represented by a unique VNI are 403 mapped to a single EVI. For example, if a tenant has multiple 404 segments/subnets each represented by a VNI, then all the VNIs for 405 that tenant are mapped to a single EVI - e.g., the EVI in this case 406 represents the tenant and not a subnet . This corresponds to the 407 VLAN-aware bundle service in [RFC7432]. The advantage of this model 408 is that it doesn't require the provisioning of RD/RT per VNI. 409 However, this is a moot point when compared to option 1 where auto- 410 derivation is used. The disadvantage of this model is that routes 411 would be imported by NVEs that may not be interested in a given VNI. 413 In this option the MAC-VRF table is identified by the RT in the 414 control plane and a specific bridge table for that MAC-VRF is 415 identified by the in the control plane. In this 416 option, the VNI in the data-plane is sufficient to identify a 417 specific bridge table. 419 5.1.2.1 Auto Derivation of RT 421 When the option of a single VNI per EVI is used, in order to simplify 422 configuration, the RT used for EVPN can be auto-derived. RD can be 423 auto generated as described in [RFC7432] and RT can be auto-derived 424 as described next. 426 Since a gateway PE as depicted in figure-1 participates in both the 427 DCN and WAN BGP sessions, it is important that when RT values are 428 auto-derived from VNIs, there is no conflict in RT spaces between DCN 429 and WAN networks assuming that both are operating within the same AS. 430 Also, there can be scenarios where both VXLAN and NVGRE 431 encapsulations may be needed within the same DCN and their 432 corresponding VNIs are administered independently which means VNI 433 spaces can overlap. In order to ensure that no such conflict in RT 434 spaces arises, RT values for DCNs are auto-derived as follow: 436 0 1 2 3 437 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 438 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 439 | AS # |A| TYPE| D-ID | Service ID | 440 +-----------------------------------------------+---------------+ 441 | Service ID (Cont.) | 442 +-------------------------------+ 444 - 2 bytes of global admin field of the RT is set to the AS number. 446 - Three least significant bytes of the local admin field of the RT is 447 set to the VNI, VSID, I-SID, or VID. 449 - The most significant bit of the local admin field of the RT is set 450 as follow: 451 0: auto-derived 452 1: manually-derived 454 - The next 3 bits of the most significant byte of the local admin 455 field of the RT identifies the space in which the other 3 bytes are 456 defined. The following spaces are defined: 457 0 : VID (802.1Q VLAN ID) 458 1 : VXLAN 459 2 : NVGRE 460 3 : I-SID 461 4 : EVI 462 5 : dual-VID (QinQ VLAN ID) 464 - The remaining 4 bits of the most significant byte of the local 465 admin field of the RT identifies the domain-id. The default value of 466 domain-id is zero indicating that only a single numbering space exist 467 for a given technology. However, if there are more than one number 468 space exist for a given technology (e.g., overlapping VXLAN spaces), 469 then each of the number spaces need to be identify by their 470 corresponding domain-id starting from 1. 472 5.1.3 Constructing EVPN BGP Routes 473 In EVPN, an MPLS label for instance identifying forwarding table is 474 distributed by the egress PE via the EVPN control plane and is placed 475 in the MPLS header of a given packet by the ingress PE. This label is 476 used upon receipt of that packet by the egress PE for disposition of 477 that packet. This is very similar to the use of the VNI by the egress 478 NVE, with the difference being that an MPLS label has local 479 significance while a VNI typically has global significance. 480 Accordingly, and specifically to support the option of locally- 481 assigned VNIs, the MPLS Label1 field in the MAC/IP Advertisement 482 route, the MPLS label field in the Ethernet AD per EVI route, and the 483 MPLS label field in the PMSI Tunnel Attribute of the Inclusive 484 Multicast Ethernet Tag (IMET) route are used to carry the VNI. For 485 the balance of this memo, the above MPLS label fields will be 486 referred to as the VNI field. The VNI field is used for both local 487 and global VNIs, and for either case the entire 24-bit field is used 488 to encode the VNI value. 490 For the VLAN-based service (a single VNI per MAC-VRF), the Ethernet 491 Tag field in the MAC/IP Advertisement, Ethernet AD per EVI, and IMET 492 route MUST be set to zero just as in the VLAN Based service in 493 [RFC7432]. 495 For the VLAN-aware bundle service (multiple VNIs per MAC-VRF with 496 each VNI associated with its own bridge table), the Ethernet Tag 497 field in the MAC Advertisement, Ethernet AD per EVI, and IMET route 498 MUST identify a bridge table within a MAC-VRF and the set of Ethernet 499 Tags for that EVI needs to be configured consistently on all PEs 500 within that EVI. For locally-assigned VNIs, the value advertised in 501 the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware 502 bundle service in [RFC7432]. Such setting must be done consistently 503 on all PE devices participating in that EVI within a given domain. 504 For global VNIs, the value advertised in the Ethernet Tag field 505 SHOULD be set to a VNI as long as it matches the existing semantics 506 of the Ethernet Tag, i.e., it identifies a bridge table within a MAC- 507 VRF and the set of VNIs are configured consistently on each PE in 508 that EVI. 510 In order to indicate which type of data plane encapsulation (i.e., 511 VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP 512 Encapsulation extended community defined in [TUNNEL-ENCAP] and 513 [RFC5512] is included with all EVPN routes (i.e. MAC Advertisement, 514 Ethernet AD per EVI, Ethernet AD per ESI, Inclusive Multicast 515 Ethernet Tag, and Ethernet Segment) advertised by an egress PE. Five 516 new values have been assigned by IANA to extend the list of 517 encapsulation types defined in [TUNNEL-ENCAP] and they are listed in 518 section 13. 520 The MPLS encapsulation tunnel type, listed in section 13, is needed 521 in order to distinguish between an advertising node that only 522 supports non-MPLS encapsulations and one that supports MPLS and non- 523 MPLS encapsulations. An advertising node that only supports MPLS 524 encapsulation does not need to advertise any encapsulation tunnel 525 types; i.e., if the BGP Encapsulation extended community is not 526 present, then either MPLS encapsulation or a statically configured 527 encapsulation is assumed. 529 The Ethernet Segment and Ethernet AD per ESI routes MAY be advertised 530 with multiple encapsulation types as long as they use the same EVPN 531 multi-homing procedures (section 8.3.1, Split Horizon) - e.g., the 532 mix of VXLAN and NVGRE encapsulation types is a valid one but not the 533 mix of VXLAN and MPLS encapsulation types. 535 The Next Hop field of the MP_REACH_NLRI attribute of the route MUST 536 be set to the IPv4 or IPv6 address of the NVE. The remaining fields 537 in each route are set as per [RFC7432]. 539 Note that the procedure defined here to use the MPLS Label field to 540 carry the VNI in the presence of a Tunnel Encapsulation Extended 541 Community specifying the use of a VNI, is aligned with the procedures 542 described in section 8.2.2.2 of [tunnel-encap] ("When a Valid VNI has 543 not been Signaled"). 545 5.2 MPLS over GRE 547 The EVPN data-plane is modeled as an EVPN MPLS client layer sitting 548 over an MPLS PSN-tunnel server layer. Some of the EVPN functions 549 (split-horizon, aliasing, and backup-path) are tied to the MPLS 550 client layer. If MPLS over GRE encapsulation is used, then the EVPN 551 MPLS client layer can be carried over an IP PSN tunnel transparently. 552 Therefore, there is no impact to the EVPN procedures and associated 553 data-plane operation. 555 The existing standards for MPLS over GRE encapsulation as defined by 556 [RFC4023] can be used for this purpose; however, when it is used in 557 conjunction with EVPN the GRE key field SHOULD be present, and SHOULD 558 be used to provide a 32-bit entropy field. The Checksum and Sequence 559 Number fields are not needed and their corresponding C and S bits 560 MUST be set to zero. A PE capable of supporting this encapsulation, 561 should advertise its EVPN routes along with the Tunnel Encapsulation 562 extended community indicating MPLS over GRE encapsulation, as 563 described in previous section. 565 6 EVPN with Multiple Data Plane Encapsulations 567 The use of the BGP Encapsulation extended community per [TUNNEL- 568 ENCAP] and [RFC5512] allows each NVE in a given EVI to know each of 569 the encapsulations supported by each of the other NVEs in that EVI. 570 i.e., each of the NVEs in a given EVI may support multiple data plane 571 encapsulations. An ingress NVE can send a frame to an egress NVE 572 only if the set of encapsulations advertised by the egress NVE forms 573 a non-empty intersection with the set of encapsulations supported by 574 the ingress NVE, and it is at the discretion of the ingress NVE which 575 encapsulation to choose from this intersection. (As noted in 576 section 5.1.3, if the BGP Encapsulation extended community is not 577 present, then the default MPLS encapsulation or a locally configured 578 encapsulation is assumed.) 580 An ingress node that uses shared multicast trees for sending 581 broadcast or multicast frames MAY maintain distinct trees for each 582 different encapsulation type. 584 It is the responsibility of the operator of a given EVI to ensure 585 that all of the NVEs in that EVI support at least one common 586 encapsulation. If this condition is violated, it could result in 587 service disruption or failure. The use of the BGP Encapsulation 588 extended community provides a method to detect when this condition is 589 violated but the actions to be taken are at the discretion of the 590 operator and are outside the scope of this document. 592 7 Single-Homing NVEs - NVE Residing in Hypervisor 594 When a NVE and its hosts/VMs are co-located in the same physical 595 device, e.g., when they reside in a server, the links between them 596 are virtual and they typically share fate; i.e., the subject 597 hosts/VMs are typically not multi-homed or if they are multi-homed, 598 the multi-homing is a purely local matter to the server hosting the 599 VM and the NVEs, and need not be "visible" to any other NVEs residing 600 on other servers, and thus does not require any specific protocol 601 mechanisms. The most common case of this is when the NVE resides on 602 the hypervisor. 604 In the sub-sections that follow, we will discuss the impact on EVPN 605 procedures for the case when the NVE resides on the hypervisor and 606 the VXLAN (or NVGRE) encapsulation is used. 608 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE Encapsulation 610 In scenarios where different groups of data centers are under 611 different administrative domains, and these data centers are 612 connected via one or more backbone core providers as described in 613 [NVO3-FRWK], the RD must be a unique value per EVI or per NVE as 614 described in [RFC7432]. In other words, whenever there is more than 615 one administrative domain for global VNI, then a unique RD MUST be 616 used, or whenever the VNI value have local significance, then a 617 unique RD MUST be used. Therefore, it is recommend to use a unique RD 618 as described in [RFC7432] at all time. 620 When the NVEs reside on the hypervisor, the EVPN BGP routes and 621 attributes associated with multi-homing are no longer required. This 622 reduces the required routes and attributes to the following subset of 623 four out of eight: 625 - MAC/IP Advertisement Route 626 - Inclusive Multicast Ethernet Tag Route 627 - MAC Mobility Extended Community 628 - Default Gateway Extended Community 630 However, as noted in section 8.6 of [RFC7432] in order to enable a 631 single-homing ingress NVE to take advantage of fast convergence, 632 aliasing, and backup-path when interacting with multi-homed egress 633 NVEs attached to a given Ethernet segment, the single-homing ingress 634 NVE SHOULD be able to receive and process Ethernet AD per ES and 635 Ethernet AD per EVI routes. 637 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation 639 When the NVEs reside on the hypervisors, the EVPN procedures 640 associated with multi-homing are no longer required. This limits the 641 procedures on the NVE to the following subset of the EVPN procedures: 643 1. Local learning of MAC addresses received from the VMs per section 644 10.1 of [RFC7432]. 646 2. Advertising locally learned MAC addresses in BGP using the MAC/IP 647 Advertisement routes. 649 3. Performing remote learning using BGP per Section 10.2 of 650 [RFC7432]. 652 4. Discovering other NVEs and constructing the multicast tunnels 653 using the Inclusive Multicast Ethernet Tag routes. 655 5. Handling MAC address mobility events per the procedures of Section 656 16 in [RFC7432]. 658 However, as noted in section 8.6 of [RFC7432] in order to enable a 659 single-homing ingress NVE to take advantage of fast convergence, 660 aliasing, and back-up path when interacting with multi-homed egress 661 NVEs attached to a given Ethernet segment, a single-homing ingress 662 NVE SHOULD implement the ingress node processing of Ethernet AD per 663 ES and Ethernet AD per EVI routes as defined in sections 8.2 Fast 664 Convergence and 8.4 Aliasing and Backup-Path of [RFC7432]. 666 8 Multi-Homing NVEs - NVE Residing in ToR Switch 668 In this section, we discuss the scenario where the NVEs reside in the 669 Top of Rack (ToR) switches AND the servers (where VMs are residing) 670 are multi-homed to these ToR switches. The multi-homing NVE operate 671 in All-Active or Single-Active redundancy mode. If the servers are 672 single-homed to the ToR switches, then the scenario becomes similar 673 to that where the NVE resides on the hypervisor, as discussed in 674 Section 7, as far as the required EVPN functionality are concerned. 676 [RFC7432] defines a set of BGP routes, attributes and procedures to 677 support multi-homing. We first describe these functions and 678 procedures, then discuss which of these are impacted by the VxLAN 679 (or NVGRE) encapsulation and what modifications are required. As it 680 will be seen later in this section, the only EVPN procedure that is 681 impacted by non-MPLS overlay encapsulation (e.g., VxLAN or NVGRE) 682 where it provides space for one ID rather than stack of labels, is 683 that of split-horizon filtering for multi-homed Ethernet Segments 684 described in section 8.3.1. 686 8.1 EVPN Multi-Homing Features 688 In this section, we will recap the multi-homing features of EVPN to 689 highlight the encapsulation dependencies. The section only describes 690 the features and functions at a high-level. For more details, the 691 reader is to refer to [RFC7432]. 693 8.1.1 Multi-homed Ethernet Segment Auto-Discovery 695 EVPN NVEs (or PEs) connected to the same Ethernet Segment (e.g. the 696 same server via LAG) can automatically discover each other with 697 minimal to no configuration through the exchange of BGP routes. 699 8.1.2 Fast Convergence and Mass Withdraw 701 EVPN defines a mechanism to efficiently and quickly signal, to remote 702 NVEs, the need to update their forwarding tables upon the occurrence 703 of a failure in connectivity to an Ethernet segment (e.g., a link or 704 a port failure). This is done by having each NVE advertise an 705 Ethernet A-D Route per Ethernet segment for each locally attached 706 segment. Upon a failure in connectivity to the attached segment, the 707 NVE withdraws the corresponding Ethernet A-D route. This triggers all 708 NVEs that receive the withdrawal to update their next-hop adjacencies 709 for all MAC addresses associated with the Ethernet segment in 710 question. If no other NVE had advertised an Ethernet A-D route for 711 the same segment, then the NVE that received the withdrawal simply 712 invalidates the MAC entries for that segment. Otherwise, the NVE 713 updates the next-hop adjacency list accordingly. 715 8.1.3 Split-Horizon 717 If a server is multi-homed to two or more NVEs (represented by an 718 Ethernet segment ES1) and operating in an all-active redundancy mode, 719 sends a BUM packet (ie, Broadcast, Unknown unicast, or Multicast) to 720 one of these NVEs, then it is important to ensure the packet is not 721 looped back to the server via another NVE connected to this server. 722 The filtering mechanism on the NVE to prevent such loop and packet 723 duplication is called "split horizon filtering'. 725 8.1.4 Aliasing and Backup-Path 727 In the case where a station is multi-homed to multiple NVEs, it is 728 possible that only a single NVE learns a set of the MAC addresses 729 associated with traffic transmitted by the station. This leads to a 730 situation where remote NVEs receive MAC advertisement routes, for 731 these addresses, from a single NVE even though multiple NVEs are 732 connected to the multi-homed station. As a result, the remote NVEs 733 are not able to effectively load-balance traffic among the NVEs 734 connected to the multi-homed Ethernet segment. This could be the 735 case, for e.g. when the NVEs perform data-path learning on the 736 access, and the load-balancing function on the station hashes traffic 737 from a given source MAC address to a single NVE. Another scenario 738 where this occurs is when the NVEs rely on control plane learning on 739 the access (e.g. using ARP), since ARP traffic will be hashed to a 740 single link in the LAG. 742 To alleviate this issue, EVPN introduces the concept of Aliasing. 743 This refers to the ability of an NVE to signal that it has 744 reachability to a given locally attached Ethernet segment, even when 745 it has learnt no MAC addresses from that segment. The Ethernet A-D 746 route per EVI is used to that end. Remote NVEs which receive MAC 747 advertisement routes with non-zero ESI SHOULD consider the MAC 748 address as reachable via all NVEs that advertise reachability to the 749 relevant Segment using Ethernet A-D routes with the same ESI and with 750 the Single-Active flag reset. 752 Backup-Path is a closely related function, albeit it applies to the 753 case where the redundancy mode is Single-Active. In this case, the 754 NVE signals that it has reachability to a given locally attached 755 Ethernet Segment using the Ethernet A-D route as well. Remote NVEs 756 which receive the MAC advertisement routes, with non-zero ESI, SHOULD 757 consider the MAC address as reachable via the advertising NVE. 759 Furthermore, the remote NVEs SHOULD install a Backup-Path, for said 760 MAC, to the NVE which had advertised reachability to the relevant 761 Segment using an Ethernet A-D route with the same ESI and with the 762 Single-Active flag set. 764 8.1.5 DF Election 766 If a host is multi-homed to two or more NVEs on an Ethernet segment 767 operating in all-active redundancy mode, then for a given EVI only 768 one of these NVEs, termed the Designated Forwarder (DF) is 769 responsible for sending it broadcast, multicast, and, if configured 770 for that EVI, unknown unicast frames. 772 This is required in order to prevent duplicate delivery of multi- 773 destination frames to a multi-homed host or VM, in case of all-active 774 redundancy. 776 In NVEs where .1Q tagged frames are received from hosts, the DF 777 election SHOULD BE performed based on host VLAN IDs (VIDs) per 778 section 8.5 of [RFC7432]. Furthermore, multi-homing PEs of a given 779 Ethernet Segment MAY perform DF election using configured IDs such as 780 VNI, EVI, normalized VIDs, and etc. as along the IDs are configured 781 consistently across the multi-homing PEs. 783 In GWs where VxLAN encapsulated frames are received, the DF election 784 is performed on VNIs. Again, it is assumed that for a given Ethernet 785 Segment, VNIs are unique and consistent (e.g., no duplicate VNIs 786 exist). 788 8.2 Impact on EVPN BGP Routes & Attributes 790 Since multi-homing is supported in this scenario, then the entire set 791 of BGP routes and attributes defined in [RFC7432] are used. The 792 setting of the Ethernet Tag field in the MAC Advertisement, Ethernet 793 AD per EVI, and Inclusive Multicast routes follows that of section 794 5.1.3. Furthermore, the setting of the VNI field in the MAC 795 Advertisement and Ethernet AD per EVI routes follows that of section 796 5.1.3. 798 8.3 Impact on EVPN Procedures 800 Two cases need to be examined here, depending on whether the NVEs are 801 operating in Single-Active or in All-Active redundancy mode. 803 First, lets consider the case of Single-Active redundancy mode, where 804 the hosts are multi-homed to a set of NVEs, however, only a single 805 NVE is active at a given point of time for a given VNI. In this case, 806 the aliasing is not required and the split-horizon filtering may not 807 be required, but other functions such as multi-homed Ethernet segment 808 auto-discovery, fast convergence and mass withdraw, backup path, and 809 DF election are required. 811 Second, let's consider the case of All-Active redundancy mode. In 812 this case, out of all the EVPN multi-homing features listed in 813 section 8.1, the use of the VXLAN or NVGRE encapsulation impacts the 814 split-horizon and aliasing features, since those two rely on the MPLS 815 client layer. Given that this MPLS client layer is absent with these 816 types of encapsulations, alternative procedures and mechanisms are 817 needed to provide the required functions. Those are discussed in 818 detail next. 820 8.3.1 Split Horizon 822 In EVPN, an MPLS label is used for split-horizon filtering to support 823 All-Active multi-homing where an ingress NVE adds a label 824 corresponding to the site of origin (aka ESI Label) when 825 encapsulating the packet. The egress NVE checks the ESI label when 826 attempting to forward a multi-destination frame out an interface, and 827 if the label corresponds to the same site identifier (ESI) associated 828 with that interface, the packet gets dropped. This prevents the 829 occurrence of forwarding loops. 831 Since the VXLAN or NVGRE encapsulation does not include this ESI 832 label, other means of performing the split-horizon filtering function 833 MUST be devised. The following approach is recommended for split- 834 horizon filtering when VXLAN (or NVGRE) encapsulation is used. 836 Every NVE track the IP address(es) associated with the other NVE(s) 837 with which it has shared multi-homed Ethernet Segments. When the NVE 838 receives a multi-destination frame from the overlay network, it 839 examines the source IP address in the tunnel header (which 840 corresponds to the ingress NVE) and filters out the frame on all 841 local interfaces connected to Ethernet Segments that are shared with 842 the ingress NVE. With this approach, it is required that the ingress 843 NVE performs replication locally to all directly attached Ethernet 844 Segments (regardless of the DF Election state) for all flooded 845 traffic ingress from the access interfaces (i.e. from the hosts). 846 This approach is referred to as "Local Bias", and has the advantage 847 that only a single IP address needs to be used per NVE for split- 848 horizon filtering, as opposed to requiring an IP address per Ethernet 849 Segment per NVE. 851 In order to prevent unhealthy interactions between the split horizon 852 procedures defined in [RFC7432] and the local bias procedures 853 described in this document, a mix of MPLS over GRE encapsulations on 854 the one hand and VXLAN/NVGRE encapsulations on the other on a given 855 Ethernet Segment is prohibited. 857 8.3.2 Aliasing and Backup-Path 859 The Aliasing and the Backup-Path procedures for VXLAN/NVGRE 860 encapsulation are very similar to the ones for MPLS. In case of MPLS, 861 Ethernet A-D route per EVI is used for Aliasing when the 862 corresponding Ethernet Segment operates in All-Active multi-homing, 863 and the same route is used for Backup-Path when the corresponding 864 Ethernet Segment operates in Single-Active multi-homing. In case of 865 VxLAN/NVGRE, the same route is used for the Aliasing and the Backup- 866 Path with the difference that the Ethernet Tag and VNI fields in 867 Ethernet A-D per EVI route are set as described in section 5.1.3. 869 8.3.3 Unknown Unicast Traffic Designation 871 In EVPN, when an ingress PE uses ingress replication to flood unknown 872 unicast traffic to egress PEs, the ingress PE uses a different EVPN 873 MPLS label (from the one used for known unicast traffic) to identify 874 such BUM traffic. The egress PEs use this label to identify such BUM 875 traffic and thus apply DF filtering for All-Active multi-homed sites. 876 In absence of unknown unicast traffic designation and in presence of 877 enabling unknown unicast flooding, there can be transient duplicate 878 traffic to All-Active multi-homed sites under the following 879 condition: the host MAC address is learned by the egress PE(s) and 880 advertised to the ingress PE; however, the MAC advertisement has not 881 been received or processed by the ingress PE, resulting in the host 882 MAC address to be unknown on the ingress PE but be known on the 883 egress PE(s). Therefore, when a packet destined to that host MAC 884 address arrives on the ingress PE, it floods it via ingress 885 replication to all the egress PE(s) and since they are known to the 886 egress PE(s), multiple copies is sent to the All-Active multi-homed 887 site. It should be noted that such transient packet duplication only 888 happens when a) the destination host is multi-homed via All-Active 889 redundancy mode, b) flooding of unknown unicast is enabled in the 890 network, c) ingress replication is used, and d) traffic for the 891 destination host is arrived on the ingress PE before it learns the 892 host MAC address via BGP EVPN advertisement. In order to prevent such 893 occurrence of packet duplication (however low probability that may 894 be), the ingress PE MAY use a flag-bit in the VxLAN header to 895 indicate BUM traffic type. Bit 6 of flag field in the VxLAN header is 896 used for this purpose per section 3.1 of [VXLAN-GPE]. 898 9 Support for Multicast 899 The E-VPN Inclusive Multicast Ethernet Tag (IMET) route is used to 900 discover the multicast tunnels among the endpoints associated with a 901 given EVI (e.g., given VNI) for VLAN-based service and a given 902 for VLAN-aware bundle service. All fields of this route is 903 set as described in section 5.1.3. The Originating router's IP 904 address field is set to the NVE's IP address. This route is tagged 905 with the PMSI Tunnel attribute, which is used to encode the type of 906 multicast tunnel to be used as well as the multicast tunnel 907 identifier. The tunnel encapsulation is encoded by adding the BGP 908 Encapsulation extended community as per section 5.1.1. For example, 909 the PMSI Tunnel attribute may indicate the multicast tunnel is of 910 type PIM-SM; whereas, the BGP Encapsulation extended community may 911 indicate the encapsulation for that tunnel is of type VxLAN. The 912 following tunnel types as defined in [RFC6514] can be used in the 913 PMSI tunnel attribute for VXLAN/NVGRE: 915 + 3 - PIM-SSM Tree 916 + 4 - PIM-SM Tree 917 + 5 - BIDIR-PIM Tree 918 + 6 - Ingress Replication 920 Except for Ingress Replication, this multicast tunnel is used by the 921 PE originating the route for sending multicast traffic to other PEs, 922 and is used by PEs that receive this route for receiving the traffic 923 originated by hosts connected to the PE that originated the route. 925 In the scenario where the multicast tunnel is a tree, both the 926 Inclusive as well as the Aggregate Inclusive variants may be used. In 927 the former case, a multicast tree is dedicated to a VNI. Whereas, in 928 the latter, a multicast tree is shared among multiple VNIs. For VNI- 929 based service, the Aggregate Inclusive mode is accomplished by having 930 the NVEs advertise multiple IMET routes with different Route Targets 931 (one per VNI) but with the same tunnel identifier encoded in the PMSI 932 tunnel attribute. For VNI-aware bundle service, the Aggregate 933 Inclusive mode is accomplished by having the NVEs advertise multiple 934 IMET routes with different VNI encoded in the Ethernet Tag field, but 935 with the same tunnel identifier encoded in the PMSI Tunnel attribute. 937 10 Data Center Interconnections - DCI 939 For DCI, the following two main scenarios are considered when 940 connecting data centers running evpn-overlay (as described here) over 941 MPLS/IP core network: 943 - Scenario 1: DCI using GWs 944 - Scenario 2: DCI using ASBRs 945 The following two subsections describe the operations for each of 946 these scenarios. 948 10.1 DCI using GWs 950 This is the typical scenario for interconnecting data centers over 951 WAN. In this scenario, EVPN routes are terminated and processed in 952 each GW and MAC/IP routes are always re-advertised from DC to WAN but 953 from WAN to DC, they are not re-advertised if unknown MAC address 954 (and default IP address) are utilized in NVEs. In this scenario, each 955 GW maintains a MAC-VRF (and/or IP-VRF) for each EVI. The main 956 advantage of this approach is that NVEs do not need to maintain MAC 957 and IP addresses from any remote data centers when default IP route 958 and unknown MAC routes are used - i.e., they only need to maintain 959 routes that are local to their own DC. When default IP route and 960 unknown MAC route are used, any unknown IP and MAC packets from NVEs 961 are forwarded to the GWs where all the VPN MAC and IP routes are 962 maintained. This approach reduces the size of MAC-VRF and IP-VRF 963 significantly at NVEs. Furthermore, it results in a faster 964 convergence time upon a link or NVE failure in a multi-homed network 965 or device redundancy scenario, because the failure related BGP routes 966 (such as mass withdraw message) do not need to get propagated all the 967 way to the remote NVEs in the remote DCs. This approach is described 968 in details in section 3.4 of [DCI-EVPN-OVERLAY]. 970 10.2 DCI using ASBRs 972 This approach can be considered as the opposite of the first approach 973 and it favors simplification at DCI devices over NVEs such that 974 larger MAC-VRF (and IP-VRF) tables need to be maintained on NVEs; 975 whereas, DCI devices don't need to maintain any MAC (and IP) 976 forwarding tables. Furthermore, DCI devices do not need to terminate 977 and process routes related to multi-homing but rather to relay these 978 messages for the establishment of an end-to-end LSP path. In other 979 words, DCI devices in this approach operate similar to ASBRs for 980 inter-AS option B - section 10 of [RFC4364]. This requires locally 981 assigned VNIs to be used just like downstream assigned MPLS VPN label 982 where for all practical purposes the VNIs function like 24-bit VPN 983 labels. This approach is equally applicable to data centers (or 984 Carrier Ethernet networks) with MPLS encapsulation. 986 In inter-AS option B, when ASBR receives an EVPN route from its DC 987 over iBGP and re-advertises it to other ASBRs, it re-advertises the 988 EVPN route by re-writing the BGP next-hops to itself, thus losing the 989 identity of the PE that originated the advertisement. This re-write 990 of BGP next-hop impacts the EVPN Mass Withdraw route (Ethernet A-D 991 per ES) and its procedure adversely. However, it does not impact EVPN 992 Aliasing mechanism/procedure because when the Aliasing routes (Ether 993 A-D per EVI) are advertised, the receiving PE first resolves a MAC 994 address for a given EVI into its corresponding and 995 subsequently, it resolves the into multiple paths (and their 996 associated next hops) via which the is reachable. Since 997 Aliasing and MAC routes are both advertised per EVI basis and they 998 use the same RD and RT (per EVI), the receiving PE can associate them 999 together on a per BGP path basis (e.g., per originating PE) and thus 1000 perform recursive route resolution - e.g., a MAC is reachable via an 1001 which in turn, is reachable via a set of BGP paths, thus the 1002 MAC is reachable via the set of BGP paths. Since on a per EVI basis, 1003 the association of MAC routes and the corresponding Aliasing route is 1004 fixed and determined by the same RD and RT, there is no ambiguity 1005 when the BGP next hop for these routes is re-written as these routes 1006 pass through ASBRs - i.e., the receiving PE may receive multiple 1007 Aliasing routes for the same EVI from a single next hop (a single 1008 ASBR), and it can still create multiple paths toward that . 1010 However, when the BGP next hop address corresponding to the 1011 originating PE is re-written, the association between the Mass 1012 Withdraw route (Ether A-D per ES) and its corresponding MAC routes 1013 cannot be made based on their RDs and RTs because the RD for Mass 1014 Withdraw route is different than the one for the MAC routes. 1015 Therefore, the functionality needed at the ASBRs and the receiving 1016 PEs depends on whether the Mass Withdraw route is originated and 1017 whether there is a need to handle route resolution ambiguity for this 1018 route. The following two subsections describe the functionality 1019 needed by the ASBRs and the receiving PEs depending on whether the 1020 NVEs reside in a Hypervisors or in TORs. 1022 10.2.1 ASBR Functionality with Single-Homing NVEs 1024 When NVEs reside in hypervisors as described in section 7.1, there is 1025 no multi-homing and thus there is no need for the originating NVE to 1026 send Ethernet A-D per ES or Ethernet A-D per EVI routes. However, as 1027 noted in section 7, in order to enable a single-homing ingress NVE to 1028 take advantage of fast convergence, aliasing, and backup-path when 1029 interacting with multi-homing egress NVEs attached to a given 1030 Ethernet segment, the single-homing NVE SHOULD be able to receive and 1031 process Ethernet AD per ES and Ethernet AD per EVI routes. The 1032 handling of these routes are described in the next section. 1034 10.2.2 ASBR Functionality with Multi-Homing NVEs 1036 When NVEs reside in TORs and operate in multi-homing redundancy mode, 1037 then as described in section 8, there is a need for the originating 1038 multi-homing NVE to send Ethernet A-D per ES route(s) (used for mass 1039 withdraw) and Ethernet A-D per EVI routes (used for aliasing). As 1040 described above, the re-write of BGP next-hop by ASBRs creates 1041 ambiguities when Ethernet A-D per ES routes are received by the 1042 remote NVE in a different ASBR because the receiving NVE cannot 1043 associated that route with the MAC/IP routes of that Ethernet Segment 1044 advertised by the same originating NVE. This ambiguity inhibits the 1045 function of mass-withdraw per ES by the receiving NVE in a different 1046 AS. 1048 As an example consider a scenario where CE is multi-homed to PE1 and 1049 PE2 where these PEs are connected via ASBR1 and then ASBR2 to the 1050 remote PE3. Furthermore, consider that PE1 receives M1 from CE1 but 1051 not PE2. Therefore, PE1 advertises Eth A-D per ES1, Eth A-D per EVI1, 1052 and M1; whereas, PE2 only advertises Eth A-D per ES1 and Eth A-D per 1053 EVI1. ASBR1 receives all these five advertisements and passes them to 1054 ASBR2 (with itself as the BGP next hop). ASBR2, in turn, passes them 1055 to the remote PE3 with itself as the BGP next hop. PE3 receives these 1056 five routes where all of them have the same BGP next-hop (i.e., 1057 ASBR2). Furthermore, the two Ether A-D per ES routes received by PE3 1058 have the same info - i.e., same ESI and the same BGP next hop. 1059 Although both of these routes are maintained by the BGP process in 1060 PE3 (because they have different RDs and thus treated as different 1061 BGP routes), information from only one of them is used in the L2 1062 routing table (L2 RIB). 1064 PE1 1065 / \ 1066 CE ASBR1---ASBR2---PE3 1067 \ / 1068 PE2 1070 Figure 1: Inter-AS Option B 1072 Now, when the AC between the PE2 and the CE fails and PE2 sends NLRI 1073 withdrawal for Ether A-D per ES route and this withdrawal gets 1074 propagated and received by the PE3, the BGP process in PE3 removes 1075 the corresponding BGP route; however, it doesn't remove the 1076 associated info (namely ESI and BGP next hop) from the L2 routing 1077 table (L2 RIB) because it still has the other Ether A-D per ES route 1078 (originated from PE1) with the same info. That is why the mass- 1079 withdraw mechanism does not work when doing DCI with inter-AS option 1080 B. However, as described previoulsy, the aliasing function works and 1081 so does "mass-withdraw per EVI" (which is associated with withdrawing 1082 the EVPN route associated with Aliasing - i.e., Ether A-D per EVI 1083 route). 1085 In the above example, the PE3 receives two Aliasing routes with the 1086 same BGP next hop (ASBR2) but different RDs. One of the Alias route 1087 has the same RD as the advertised MAC route (M1). PE3 follows the 1088 route resolution procedure specified in [RFC7432] upon receiving the 1089 two Aliasing route - ie, it resolves M1 to and 1090 subsequently it resolves to a BGP path list with two paths 1091 along with the corresponding VNIs/MPLS labels (one associated with 1092 PE1 and the other associated with PE2). It should be noted that even 1093 though both paths are advertised by the same BGP next hop (ASRB2), 1094 the receiving PE3 can handle them properly. Therefore, M1 is 1095 reachable via two paths. This creates two end-to-end LSPs, from PE3 1096 to PE1 and from PE3 to PE2, for M1 such that when PE3 wants to 1097 forward traffic destined to M1, it can load balanced between the two 1098 LSPs. Although route resolution for Aliasing routes with the same BGP 1099 next hop is not explicitly mentioned in [RFC7432], this is the 1100 expected operation and thus it is elaborated here. 1102 When the AC between the PE2 and the CE fails and PE2 sends NLRI 1103 withdrawal for Ether A-D per EVI routes and these withdrawals get 1104 propagated and received by the PE3, the PE3 removes the Aliasing 1105 route and updates the path list - ie, it removes the path 1106 corresponding to the PE2. Therefore, all the corresponding MAC routes 1107 for that that point to that path list will now have the 1108 updated path list with a single path associated with PE1. This action 1109 can be considered as the mass-withdraw at the per-EVI level. The 1110 mass-withdraw at per-EVI level has longer convergence time than the 1111 mass-withdraw at per-ES level; however, it is much faster than the 1112 convergence time when the withdraw is done on a per-MAC basis. 1114 If a PE becomes detached from a given ES, then in addition to 1115 withdrawing its previously advertised Ethernet AD Per ES routes, it 1116 MUST also withdraw its previously advertised Ethernet AD Per EVI 1117 routes for that ES. For a remote PE that is separated from the 1118 withdrawing PE by one or more EVPN inter-AS option B ASBRs, the 1119 withdrawal of the Ethernet AD Per ES routes is not actionable. 1120 However, a remote PE is able to correlate a previously advertised 1121 Ethernet AD Per EVI route with any MAC/IP Advertisement routes also 1122 advertised by the withdrawing PE for that . Hence, when 1123 it receives the withdrawal of an Ethernet AD Per EVI route, it SHOULD 1124 remove the withdrawing PE as a next-hop for all MAC addresses 1125 associated with that . 1127 In the previous example, when the AC between PE2 and the CE fails, 1128 PE2 will withdraw its Ethernet AD Per ES and Per EVI routes. When 1129 PE3 receives the withdrawal of an Ethernet AD Per EVI route, it 1130 removes PE2 as a valid next-hop for all MAC addresses associated with 1131 the corresponding . Therefore, all the MAC next-hops 1132 for that will now have a single next-hop, viz the LSP to 1133 PE1. 1135 In summary, it can be seen that aliasing (and backup path) 1136 functionality should work as is for inter-AS option B without 1137 requiring any addition functionality in ASBRs or PEs. However, the 1138 mass-withdraw functionality falls back from per-ES mode to per-EVI 1139 mode for inter-AS option B - i.e., PEs receiving mass-withdraw route 1140 from the same AS take action on Ether A-D per ES route; whereas, PEs 1141 receiving mass-withdraw route from different AS take action on Ether 1142 A-D per EVI route. 1144 11 Acknowledgement 1146 The authors would like to thank Aldrin Isaac, David Smith, John 1147 Mullooly, Thomas Nadeau for their valuable comments and feedback. The 1148 authors would also like to thank Jakob Heitz for his contribution on 1149 section 10.2. 1151 12 Security Considerations 1153 This document uses IP-based tunnel technologies to support data 1154 plane transport. Consequently, the security considerations of those 1155 tunnel technologies apply. This document defines support for VXLAN 1156 and NVGRE encapsulations. The security considerations from those 1157 documents as well as [RFC4301] apply to the data plane aspects of 1158 this document. 1160 As with [RFC5512], any modification of the information that is used 1161 to form encapsulation headers, to choose a tunnel type, or to choose 1162 a particular tunnel for a particular payload type may lead to user 1163 data packets getting misrouted, misdelivered, and/or dropped. 1165 More broadly, the security considerations for the transport of IP 1166 reachability information using BGP are discussed in [RFC4271] and 1167 [RFC4272], and are equally applicable for the extensions described 1168 in this document. 1170 If the integrity of the BGP session is not itself protected, then an 1171 imposter could mount a denial-of-service attack by establishing 1172 numerous BGP sessions and forcing an IPsec SA to be created for each 1173 one. However, as such an imposter could wreak havoc on the entire 1174 routing system, this particular sort of attack is probably not of 1175 any special importance. 1177 It should be noted that a BGP session may itself be transported over 1178 an IPsec tunnel. Such IPsec tunnels can provide additional security 1179 to a BGP session. The management of such IPsec tunnels is outside 1180 the scope of this document. 1182 13 IANA Considerations 1184 IANA has allocated the following BGP Tunnel Encapsulation Attribute 1185 Tunnel Types: 1187 8 VXLAN Encapsulation 1188 9 NVGRE Encapsulation 1189 10 MPLS Encapsulation 1190 11 MPLS in GRE Encapsulation 1191 12 VXLAN GPE Encapsulation 1193 14 References 1195 14.1 Normative References 1197 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 1198 Requirement Levels", BCP 14, RFC 2119, March 1997. 1200 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 1201 Requirement Levels", BCP 14, RFC 2119, March 1997. 1203 [RFC4271] Y. Rekhter, Ed., T. Li, Ed., S. Hares, Ed., "A Border 1204 Gateway Protocol 4 (BGP-4)", January 2006. 1206 [RFC4301] S. Kent, K. Seo., "Security Architecture for the 1207 Internet Protocol.", December 2005. 1209 [RFC5512] Mohapatra, P. and E. Rosen, "The BGP Encapsulation 1210 Subsequent Address Family Identifier (SAFI) and the BGP 1211 Tunnel Encapsulation Attribute", RFC 5512, April 2009. 1213 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", RFC 7432, 1214 February 2014 1216 14.2 Informative References 1218 [RFC7209] Sajassi et al., "Requirements for Ethernet VPN (EVPN)", RFC 1219 7209, May 2014 1221 [RFC7348] Mahalingam, M., et al, "VXLAN: A Framework for Overlaying 1222 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 1223 2014 1225 [RFC4272] S. Murphy, "BGP Security Vulnerabilities Analysis.", 1226 January 2006. 1228 [NVGRE] Garg, P., et al., "NVGRE: Network Virtualization using 1229 Generic Routing Encapsulation", RFC 7637, September, 2015 1231 [Problem-Statement] Narten et al., "Problem Statement: Overlays for 1232 Network Virtualization", RFC 7364, October 2014. 1234 [NVO3-FRWK] Lasserre et al., "Framework for DC Network 1235 Virtualization", RFC 7365, October 2014. 1237 [DCI-EVPN-OVERLAY] Rabadan et al., "Interconnect Solution for EVPN 1238 Overlay networks", draft-ietf-bess-dci-evpn-overlay-04, work in 1239 progress, February 29, 2016. 1241 [TUNNEL-ENCAP] Rosen et al., "The BGP Tunnel Encapsulation 1242 Attribute", draft-ietf-idr-tunnel-encaps-03, work in progress, May 1243 31, 2016. 1245 [VXLAN-GPE] Maino et al., "Generic Protocol Extension for VXLAN", 1246 draft-ietf-nvo3-vxlan-gpe-03, work in progress October 25, 2016. 1248 [RFC4364] Rosen, E., et al, "BGP/MPLS IP Virtual Private Networks 1249 (VPNs)", RFC 4364, February 2006. 1251 [RFC4023] T. Worster et al., "Encapsulating MPLS in IP or Generic 1252 Routing Encapsulation (GRE)", RFC 4023, March 2005 1254 [RFC6514] R. Aggarwal et al., "BGP Encodings and Procedures for 1255 Multicast in MPLS/BGP IP VPNs", RFC 6514, February 2012 1257 Contributors 1259 S. Salam 1260 K. Patel 1261 D. Rao 1262 S. Thoria 1263 D. Cai 1264 Cisco 1266 Y. Rekhter 1267 A. Issac 1268 Wen Lin 1269 Nischal Sheth 1270 Juniper 1272 L. Yong 1273 Huawei 1275 Authors' Addresses 1277 Ali Sajassi 1278 Cisco 1279 Email: sajassi@cisco.com 1281 John Drake 1282 Juniper Networks 1283 Email: jdrake@juniper.net 1285 Nabil Bitar 1286 Nokia 1287 Email : nabil.bitar@nokia.com 1289 R. Shekhar 1290 Juniper 1291 Email: rshekhar@juniper.net 1293 James Uttaro 1294 AT&T 1295 Email: uttaro@att.com 1297 Wim Henderickx 1298 Alcatel-Lucent 1299 e-mail: wim.henderickx@nokia.com