idnits 2.17.1 draft-ietf-bess-evpn-overlay-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 13 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 19, 2015) is 3111 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2119' is mentioned on line 165, but not defined == Missing Reference: 'GRE' is mentioned on line 300, but not defined == Missing Reference: 'RFC4023' is mentioned on line 548, but not defined == Missing Reference: 'NOV3-Framework' is mentioned on line 603, but not defined == Missing Reference: 'RFC6514' is mentioned on line 858, but not defined == Missing Reference: 'DCI-EVPN-OVERLAY' is mentioned on line 912, but not defined == Unused Reference: 'KEYWORDS' is defined on line 1104, but no explicit reference was found in the text == Unused Reference: 'NOV3-FRWK' is defined on line 1143, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 4272 ** Obsolete normative reference: RFC 5512 (Obsoleted by RFC 9012) == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-07 == Outdated reference: A later version (-04) exists of draft-ietf-nvo3-overlay-problem-statement-01 == Outdated reference: A later version (-09) exists of draft-ietf-nvo3-framework-01 Summary: 4 errors (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 L2VPN Workgroup A. Sajassi (Editor) 3 INTERNET-DRAFT Cisco 4 Intended Status: Standards Track J. Drake (Editor) 5 Juniper 6 Nabil Bitar 7 Verizon 8 Aldrin Isaac 9 Juniper 10 James Uttaro 11 AT&T 12 W. Henderickx 13 Alcatel-Lucent 15 Expires: April 19, 2016 October 19, 2015 17 A Network Virtualization Overlay Solution using EVPN 18 draft-ietf-bess-evpn-overlay-02 20 Abstract 22 This document describes how Ethernet VPN (EVPN) [RFC7432] can be used 23 as an Network Virtualization Overlay (NVO) solution and explores the 24 various tunnel encapsulation options over IP and their impact on the 25 EVPN control-plane and procedures. In particular, the following 26 encapsulation options are analyzed: VXLAN, NVGRE, and MPLS over GRE. 28 Status of this Memo 30 This Internet-Draft is submitted to IETF in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF), its areas, and its working groups. Note that 35 other groups may also distribute working documents as 36 Internet-Drafts. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 The list of current Internet-Drafts can be accessed at 44 http://www.ietf.org/1id-abstracts.html 46 The list of Internet-Draft Shadow Directories can be accessed at 47 http://www.ietf.org/shadow.html 49 Copyright and License Notice 51 Copyright (c) 2012 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2 Specification of Requirements . . . . . . . . . . . . . . . . . 5 68 3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 4 EVPN Features . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 5 Encapsulation Options for EVPN Overlays . . . . . . . . . . . . 7 71 5.1 VXLAN/NVGRE Encapsulation . . . . . . . . . . . . . . . . . 7 72 5.1.1 Virtual Identifiers Scope . . . . . . . . . . . . . . . 8 73 5.1.1.1 Data Center Interconnect with Gateway . . . . . . . 8 74 5.1.1.2 Data Center Interconnect without Gateway . . . . . . 9 75 5.1.2 Virtual Identifiers to EVI Mapping . . . . . . . . . . . 9 76 5.1.2.1 Auto Derivation of RT . . . . . . . . . . . . . . . 10 77 5.1.3 Constructing EVPN BGP Routes . . . . . . . . . . . . . 11 78 5.2 MPLS over GRE . . . . . . . . . . . . . . . . . . . . . . . 13 79 6 EVPN with Multiple Data Plane Encapsulations . . . . . . . . . 13 80 7 NVE Residing in Hypervisor . . . . . . . . . . . . . . . . . . 14 81 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE 82 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . 14 83 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation . . 15 84 8 NVE Residing in ToR Switch . . . . . . . . . . . . . . . . . . 15 85 8.1 EVPN Multi-Homing Features . . . . . . . . . . . . . . . . 16 86 8.1.1 Multi-homed Ethernet Segment Auto-Discovery . . . . . . 16 87 8.1.2 Fast Convergence and Mass Withdraw . . . . . . . . . . . 16 88 8.1.3 Split-Horizon . . . . . . . . . . . . . . . . . . . . . 16 89 8.1.4 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 17 90 8.1.5 DF Election . . . . . . . . . . . . . . . . . . . . . . 17 91 8.2 Impact on EVPN BGP Routes & Attributes . . . . . . . . . . . 18 92 8.3 Impact on EVPN Procedures . . . . . . . . . . . . . . . . . 18 93 8.3.1 Split Horizon . . . . . . . . . . . . . . . . . . . . . 19 94 8.3.2 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 19 96 9 Support for Multicast . . . . . . . . . . . . . . . . . . . . . 19 97 10 Data Center Interconnections - DCI . . . . . . . . . . . . . . 20 98 10.1 DCI using GWs . . . . . . . . . . . . . . . . . . . . . . . 20 99 10.2 DCI using ASBRs . . . . . . . . . . . . . . . . . . . . . . 21 100 10.2.1 ASBR Functionality with NVEs in Hypervisors . . . . . . 22 101 10.2.2 ASBR Functionality with NVEs in TORs . . . . . . . . . 22 102 11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 24 103 12 Security Considerations . . . . . . . . . . . . . . . . . . . 24 104 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 105 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 106 14.1 Normative References . . . . . . . . . . . . . . . . . . . 25 107 14.2 Informative References . . . . . . . . . . . . . . . . . . 25 108 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 111 1 Introduction 113 In the context of this document, a Network Virtualization Overlay 114 (NVO) is a solution to address the requirements of a multi-tenant 115 data center, especially one with virtualized hosts, e.g., Virtual 116 Machines (VMs). The key requirements of such a solution, as described 117 in [Problem-Statement], are: 119 - Isolation of network traffic per tenant 121 - Support for a large number of tenants (tens or hundreds of 122 thousands) 124 - Extending L2 connectivity among different VMs belonging to a given 125 tenant segment (subnet) across different PODs within a data center or 126 between different data centers 128 - Allowing a given VM to move between different physical points of 129 attachment within a given L2 segment 131 The underlay network for NVO solutions is assumed to provide IP 132 connectivity between NVO endpoints (NVEs). 134 This document describes how Ethernet VPN (EVPN) can be used as an NVO 135 solution and explores applicability of EVPN functions and procedures. 136 In particular, it describes the various tunnel encapsulation options 137 for EVPN over IP, and their impact on the EVPN control-plane and 138 procedures for two main scenarios: 140 a) when the NVE resides in the hypervisor, and 141 b) when the NVE resides in a Top of Rack (ToR) device 143 Note that the use of EVPN as an NVO solution does not necessarily 144 mandate that the BGP control-plane be running on the NVE. For such 145 scenarios, it is still possible to leverage the EVPN solution by 146 using XMPP, or alternative mechanisms, to extend the control-plane to 147 the NVE as discussed in [L3VPN-ENDSYSTEMS]. 149 The possible encapsulation options for EVPN overlays that are 150 analyzed in this document are: 152 - VXLAN and NVGRE 153 - MPLS over GRE 155 Before getting into the description of the different encapsulation 156 options for EVPN over IP, it is important to highlight the EVPN 157 solution's main features, how those features are currently supported, 158 and any impact that the encapsulation has on those features. 160 2 Specification of Requirements 162 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 163 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 164 document are to be interpreted as described in [RFC2119]. 166 3 Terminology 168 NVO: Network Virtualization Overlay 170 NVE: Network Virtualization Endpoint 172 VNI: Virtual Network Identifier (for VXLAN) 174 VSID: Virtual Subnet Identifier (for NVGRE) 176 EVPN: Ethernet VPN 178 EVI: An EVPN instance spanning the Provider Edge (PE) devices 179 participating in that EVPN. 181 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 182 Control (MAC) addresses on a PE. 184 Ethernet Segment (ES): When a customer site (device or network) is 185 connected to one or more PEs via a set of Ethernet links, then that 186 set of links is referred to as an 'Ethernet segment'. 188 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 189 identifies an Ethernet segment is called an 'Ethernet Segment 190 Identifier'. 192 Ethernet Tag: An Ethernet tag identifies a particular broadcast 193 domain, e.g., a VLAN. An EVPN instance consists of one or more 194 broadcast domains. 196 PE: Provider Edge device. 198 Single-Active Redundancy Mode: When only a single PE, among all the 199 PEs attached to an Ethernet segment, is allowed to forward traffic 200 to/from that Ethernet segment for a given VLAN, then the Ethernet 201 segment is defined to be operating in Single-Active redundancy mode. 203 All-Active Redundancy Mode: When all PEs attached to an Ethernet 204 segment are allowed to forward known unicast traffic to/from that 205 Ethernet segment for a given VLAN, then the Ethernet segment is 206 defined to be operating in All-Active redundancy mode. 208 4 EVPN Features 210 EVPN was originally designed to support the requirements detailed in 211 [RFC7209] and therefore has the following attributes which directly 212 address control plane scaling and ease of deployment issues. 214 1) Control plane traffic is distributed with BGP and Broadcast and 215 Multicast traffic is sent using a shared multicast tree or with 216 ingress replication. 218 2) Control plane learning is used for MAC (and IP) addresses instead 219 of data plane learning. The latter requires the flooding of unknown 220 unicast and ARP frames; whereas, the former does not require any 221 flooding. 223 3) Route Reflector is used to reduce a full mesh of BGP sessions 224 among PE devices to a single BGP session between a PE and the RR. 225 Furthermore, RR hierarchy can be leveraged to scale the number of BGP 226 routes on the RR. 228 4) Auto-discovery via BGP is used to discover PE devices 229 participating in a given VPN, PE devices participating in a given 230 redundancy group, tunnel encapsulation types, multicast tunnel type, 231 multicast members, etc. 233 5) All-Active multihoming is used. This allows a given customer 234 device (CE) to have multiple links to multiple PEs, and traffic 235 to/from that CE fully utilizes all of these links. This set of links 236 is termed an Ethernet Segment (ES). 238 6) When a link between a CE and a PE fails, the PEs for that EVI are 239 notified of the failure via the withdrawal of a single EVPN route. 240 This allows those PEs to remove the withdrawing PE as a next hop for 241 every MAC address associated with the failed link. This is termed 242 'mass withdrawal'. 244 7) BGP route filtering and constrained route distribution are 245 leveraged to ensure that the control plane traffic for a given EVI is 246 only distributed to the PEs in that EVI. 248 8) When a 802.1Q interface is used between a CE and a PE, each of the 249 VLAN ID (VID) on that interface can be mapped onto a bridge table 250 (for upto 4094 such bridge tables). All these bridge tables may be 251 mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 253 9) VM Mobility mechanisms ensure that all PEs in a given EVI know 254 the ES with which a given VM, as identified by its MAC and IP 255 addresses, is currently associated. 257 10) Route Targets are used to allow the operator (or customer) to 258 define a spectrum of logical network topologies including mesh, hub & 259 spoke, and extranets (e.g., a VPN whose sites are owned by different 260 enterprises), without the need for proprietary software or the aid of 261 other virtual or physical devices. 263 11) Because the design goal for NVO is millions of instances per 264 common physical infrastructure, the scaling properties of the control 265 plane for NVO are extremely important. EVPN and the extensions 266 described herein, are designed with this level of scalability in 267 mind. 269 5 Encapsulation Options for EVPN Overlays 271 5.1 VXLAN/NVGRE Encapsulation 273 Both VXLAN and NVGRE are examples of technologies that provide a data 274 plane encapsulation which is used to transport a packet over the 275 common physical IP infrastructure between VXLAN Tunnel End Points 276 (VTEPs) in VXLAN network and Network Virtualization Endpoints (NVEs) 277 in NVGRE network. Both of these technologies include the identifier 278 of the specific NVO instance, Virtual Network Identifier (VNI) in 279 VXLAN and Virtual Subnet Identifier (VSID) in NVGRE, in each packet. 281 Note that a Provider Edge (PE) is equivalent to a VTEP/NVE. 283 VXLAN encapsulation is based on UDP, with an 8-byte header following 284 the UDP header. VXLAN provides a 24-bit VNI, which typically provides 285 a one-to-one mapping to the tenant VLAN ID, as described in 286 [RFC7348]. In this scenario, the ingress VTEP does not include an 287 inner VLAN tag on the encapsulated frame, and the egress VTEP 288 discards the frames with an inner VLAN tag. This mode of operation in 289 [RFC7348] maps to VLAN Based Service in [RFC7432], where a tenant 290 VLAN ID gets mapped to an EVPN instance (EVI). 292 VXLAN also provides an option of including an inner VLAN tag in the 293 encapsulated frame, if explicitly configured at the VTEP. This mode 294 of operation can map to VLAN Bundle Service in [RFC7432] because all 295 the tenant's tagged frames map to a single bridge table / MAC-VRF, 296 and the inner VLAN tag is not used for lookup by the disposition PE 297 when performing VXLAN decapsulation as described in section 6 of 298 [RFC7348]. 300 [NVGRE] encapsulation is based on [GRE] and it mandates the inclusion 301 of the optional GRE Key field which carries the VSID. There is a one- 302 to-one mapping between the VSID and the tenant VLAN ID, as described 303 in [NVGRE] and the inclusion of an inner VLAN tag is prohibited. This 304 mode of operation in [NVGRE] maps to VLAN Based Service in 305 [RFC7432]. 307 As described in the next section there is no change to the encoding 308 of EVPN routes to support VXLAN or NVGRE encapsulation except for the 309 use of BGP Encapsulation extended community. However, there is 310 potential impact to the EVPN procedures depending on where the NVE is 311 located (i.e., in hypervisor or TOR) and whether multi-homing 312 capabilities are required. 314 5.1.1 Virtual Identifiers Scope 316 Although VNI or VSID are defined as 24-bit globally unique values, 317 there are scenarios in which it is desirable to use a locally 318 significant value for VNI or VSID, especially in the context of data 319 center interconnect: 321 5.1.1.1 Data Center Interconnect with Gateway 323 In the case where NVEs in different data centers need to be 324 interconnected, and the NVEs need to use VNIs or VSIDs as a globally 325 unique identifiers within a data center, then a Gateway needs to be 326 employed at the edge of the data center network. This is because the 327 Gateway will provide the functionality of translating the VNI or VSID 328 when crossing network boundaries, which may align with operator span 329 of control boundaries. As an example, consider the network of Figure 330 1 below. Assume there are three network operators: one for each of 331 the DC1, DC2 and WAN networks. The Gateways at the edge of the data 332 centers are responsible for translating the VNIs / VSIDs between the 333 values used in each of the data center networks and the values used 334 in the WAN. 336 +--------------+ 337 | | 338 +---------+ | WAN | +---------+ 339 +----+ | +---+ +----+ +----+ +---+ | +----+ 340 |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| 341 +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ 342 +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ 343 |NVE2|--| | | | | |--|NVE4| 344 +----+ +---------+ +--------------+ +---------+ +----+ 346 |<------ DC 1 ------> <------ DC2 ------>| 348 Figure 1: Data Center Interconnect with Gateway 350 5.1.1.2 Data Center Interconnect without Gateway 352 In the case where NVEs in different data centers need to be 353 interconnected, and the NVEs need to use locally assigned VNIs or 354 VSIDs (e.g., as MPLS labels), then there may be no need to employ 355 Gateways at the edge of the data center network. More specifically, 356 the VNI or VSID value that is used by the transmitting NVE is 357 allocated by the NVE that is receiving the traffic (in other words, 358 this is a "downstream assigned" MPLS label). This allows the VNI or 359 VSID space to be decoupled between different data center networks 360 without the need for a dedicated Gateway at the edge of the data 361 centers. 363 +--------------+ 364 | | 365 +---------+ | WAN | +---------+ 366 +----+ | | +----+ +----+ | | +----+ 367 |NVE1|--| | |WAN | |WAN | | |--|NVE3| 368 +----+ |IP Fabric|---|Edge| |Edge|--|IP Fabric| +----+ 369 +----+ | | +----+ +----+ | | +----+ 370 |NVE2|--| | | | | |--|NVE4| 371 +----+ +---------+ +--------------+ +---------+ +----+ 373 |<------ DC 1 -----> <---- DC2 ------>| 375 Figure 2: Data Center Interconnect without Gateway 377 5.1.2 Virtual Identifiers to EVI Mapping 379 When the EVPN control plane is used in conjunction with VXLAN or 380 NVGRE, two options for mapping the VXLAN VNI or NVGRE VSID to an EVI 381 are possible: 383 1. Option 1: Single Subnet per EVI 385 In this option, a single subnet represented by a VNI or VSID is 386 mapped to a unique EVI. This corresponds to the VLAN Based service in 387 [RFC7432], where a tenant VLAN ID gets mapped to an EVPN instance 388 (EVI). As such, a BGP RD and RT is needed per VNI / VSID on every 389 VTEP. The advantage of this model is that it allows the BGP RT 390 constraint mechanisms to be used in order to limit the propagation 391 and import of routes to only the VTEPs that are interested in a given 392 VNI or VSID. The disadvantage of this model may be the provisioning 393 overhead if RD and RT are not derived automatically from VNI or 394 VSID. 396 In this option, the MAC-VRF table is identified by the RT in the 397 control plane and by the VNI or VSID in the data-plane. In this 398 option, the specific the MAC-VRF table corresponds to only a single 399 bridge table. 401 2. Option 2: Multiple Subnets per EVI 403 In this option, multiple subnets each represented by a unique VNI or 404 VSID are mapped to a single EVI. For example, if a tenant has 405 multiple segments/subnets each represented by a VNI or VSID, then all 406 the VNIs (or VSIDs) for that tenant are mapped to a single EVI - 407 e.g., the EVI in this case represents the tenant and not a subnet . 408 This corresponds to the VLAN-Aware Bundle service in [RFC7432]. The 409 advantage of this model is that it doesn't require the provisioning 410 of RD/RT per VNI or VSID. However, this is a moot point if option 1 411 with auto-derivation is used. The disadvantage of this model is that 412 routes would be imported by VTEPs that may not be interested in a 413 given VNI or VSID. 415 In this option the MAC-VRF table is identified by the RT in the 416 control plane and a specific bridge table for that MAC-VRF is 417 identified by the in the control plane. In this 418 option, the VNI/VSID in the data-plane is sufficient to identify a 419 specific bridge table - e.g., no need to do a lookup based on 420 VNI/VSID and Ethernet Tag ID fields to identify a bridge table. 422 5.1.2.1 Auto Derivation of RT 424 When the option of a single VNI or VSID per EVI is used, it is 425 important to auto-derive RT for EVPN BGP routes in order to simplify 426 configuration for data center operations. RD can be derived easily as 427 described in [RFC7432] and RT can be auto-derived as described next. 429 Since a gateway PE as depicted in figure-1 participates in both the 430 DCN and WAN BGP sessions, it is important that when RT values are 431 auto-derived for VNIs (or VSIDs), there is no conflict in RT spaces 432 between DCN and WAN networks assuming that both are operating within 433 the same AS. Also, there can be scenarios where both VXLAN and NVGRE 434 encapsulations may be needed within the same DCN and their 435 corresponding VNIs and VSIDs are administered independently which 436 means VNI and VSID spaces can overlap. In order to ensure that no 437 such conflict in RT spaces arises, RT values for DCNs are auto- 438 derived as follow: 440 0 1 2 3 4 441 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 0 442 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 443 | AS # |A| TYPE| D-ID |Service Instance ID| 444 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 446 - 2 bytes of global admin field of the RT is set to the AS number. 448 - Three least significant bytes of the local admin field of the RT is 449 set to the VNI or VSID, I-SID, or VID. The most significant bit of 450 the local admin field of the RT is set as follow: 451 0: auto-derived 452 1: manually-derived 454 - The next 3 bits of the most significant byte of the local admin 455 field of the RT identifies the space in which the other 3 bytes are 456 defined. The following spaces are defined: 457 0 : VID 458 1 : VXLAN 459 2 : NVGRE 460 3 : I-SID 461 4 : EVI 462 5 : dual-VID 464 - The remaining 4 bits of the most significant byte of the local 465 admin field of the RT identifies the domain-id. The default value of 466 domain-id is zero indicating that only a single numbering space exist 467 for a given technology. However, if there are more than one number 468 space exist for a given technology (e.g., overlapping VXLAN spaces), 469 then each of the number spaces need to be identify by their 470 corresponding domain-id starting from 1. 472 5.1.3 Constructing EVPN BGP Routes 473 In EVPN, an MPLS label is distributed by the egress PE via the EVPN 474 control plane and is placed in the MPLS header of a given packet by 475 the ingress PE. This label is used upon receipt of that packet by the 476 egress PE for disposition of that packet. This is very similar to the 477 use of the VNI or VSID by the egress VTEP or NVE, respectively, with 478 the difference being that an MPLS label has local significance while 479 a VNI or VSID typically has global significance. Accordingly, and 480 specifically to support the option of locally assigned VNIs, the MPLS 481 label field in the MAC Advertisement, Ethernet AD per EVI, and 482 Inclusive Multicast Ethernet Tag routes is used to carry the VNI or 483 VSID. For the balance of this memo, the MPLS label field will be 484 referred to as the VNI/VSID field. The VNI/VSID field is used for 485 both local and global VNIs/VSIDs, and for either case the entire 24- 486 bit field is used to encode the VNI/VSID value. 488 For the VLAN-based service (a single VNI per MAC-VRF), the Ethernet 489 Tag field in the MAC/IP Advertisement, Ethernet AD per EVI, and 490 Inclusive Multicast route MUST be set to zero just as in the VLAN 491 Based service in [RFC7432]. 493 For the VLAN-aware bundle service (multiple VNIs per MAC-VRF with 494 each VNI associated with its own bridge table), the Ethernet Tag 495 field in the MAC Advertisement, Ethernet AD per EVI, and Inclusive 496 Multicast route MUST identify a bridge table within a MAC-VRF and the 497 set of Ethernet Tags for that EVI needs to be configured consistently 498 on all PEs within that EVI. For local VNIs, the value advertised in 499 the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware 500 bundle service in [RFC7432]. Such setting must be done consistently 501 on all PE devices participating in that EVI within a given domain. 502 For global VNIs, the value advertised in the Ethernet Tag field 503 SHOULD be set to a VNI as long as it matches the existing semantics 504 of the Ethernet Tag, i.e., it identifies a bridge table within a MAC- 505 VRF and the set of VNIs are configured consistently on each PE in 506 that EVI. 508 In order to indicate that which type of data plane encapsulation 509 (i.e., VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP 510 Encapsulation extended community defined in [RFC5512] is included 511 with all EVPN routes (i.e. MAC Advertisement, Ethernet AD per EVI, 512 Ethernet AD per ESI, Inclusive Multicast Ethernet Tag, and Ethernet 513 Segment) advertised by an egress PE. Five new values have been 514 assigned by IANA to extend the list of encapsulation types defined in 515 [RFC5512]: 517 + 8 - VXLAN Encapsulation 518 + 9 - NVGRE Encapsulation 519 + 10 - MPLS Encapsulation 520 + 11 - MPLS in GRE Encapsulation 521 + 12 - VXLAN GPE Encapsulation 523 If the BGP Encapsulation extended community is not present, then the 524 default MPLS encapsulation or a statically configured encapsulation 525 is assumed. 527 The Ethernet Segment and Ethernet AD per ESI routes MAY be advertised 528 with multiple encapsulation types as long as they use the same EVPN 529 multi-homing procedures - e.g., the mix of VXLAN and NVGRE 530 encapsulation types is a valid one but not the mix of VXLAN and MPLS 531 encapsulation types. 533 The Next Hop field of the MP_REACH_NLRI attribute of the route MUST 534 be set to the IPv4 or IPv6 address of the NVE. The remaining fields 535 in each route are set as per [RFC7432]. 537 5.2 MPLS over GRE 539 The EVPN data-plane is modeled as an EVPN MPLS client layer sitting 540 over an MPLS PSN tunnel. Some of the EVPN functions (split-horizon, 541 aliasing, and backup-path) are tied to the MPLS client layer. If MPLS 542 over GRE encapsulation is used, then the EVPN MPLS client layer can 543 be carried over an IP PSN tunnel transparently. Therefore, there is 544 no impact to the EVPN procedures and associated data-plane 545 operation. 547 The existing standards for MPLS over GRE encapsulation as defined by 548 [RFC4023] can be used for this purpose; however, when it is used in 549 conjunction with EVPN the key field SHOULD be present, and SHOULD be 550 used to provide a 32-bit entropy field. The Checksum and Sequence 551 Number fields are not needed and their corresponding C and S bits 552 MUST be set to zero. 554 6 EVPN with Multiple Data Plane Encapsulations 556 The use of the BGP Encapsulation extended community allows each PE in 557 a given EVI to know each of the encapsulations supported by each of 558 the other PEs in that EVI. I.e., each of the PEs in a given EVI may 559 support multiple data plane encapsulations. An ingress PE can send a 560 frame to an egress PE only if the set of encapsulations advertised by 561 the egress PE in the subject MAC/IP Advertisement or per EVI Ethernet 562 AD route, forms a non-empty intersection with the set of 563 encapsulations supported by the ingress PE, and it is at the 564 discretion of the ingress PE which encapsulation to choose from this 565 intersection. (As noted in section 5.1.3, if the BGP Encapsulation 566 extended community is not present, then the default MPLS 567 encapsulation or a statically configured encapsulation is assumed.) 569 An ingress node that uses shared multicast trees for sending 570 broadcast or multicast frames MUST maintain distinct trees for each 571 different encapsulation type. 573 It is the responsibility of the operator of a given EVI to ensure 574 that all of the PEs in that EVI support at least one common 575 encapsulation. If this condition is violated, it could result in 576 service disruption or failure. The use of the BGP Encapsulation 577 extended community provides a method to detect when this condition is 578 violated but the actions to be taken are at the discretion of the 579 operator and are outside the scope of this document. 581 7 NVE Residing in Hypervisor 583 When a PE and its CEs are co-located in the same physical device, 584 e.g., when the PE resides in a server and the CEs are its VMs, the 585 links between them are virtual and they typically share fate; i.e., 586 the subject CEs are typically not multi-homed or if they are multi- 587 homed, the multi-homing is a purely local matter to the server 588 hosting the VM, and need not be "visible" to any other PEs, and thus 589 does not require any specific protocol mechanisms. The most common 590 case of this is when the NVE resides in the hypervisor. 592 In the sub-sections that follow, we will discuss the impact on EVPN 593 procedures for the case when the NVE resides on the hypervisor and 594 the VXLAN or NVGRE encapsulation is used. 596 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE Encapsulation 598 In the scenario where all data centers are under a single 599 administrative domain, and there is a single global VNI/VSID space, 600 the RD MAY be set to zero in the EVPN routes. However, in the 601 scenario where different groups of data centers are under different 602 administrative domains, and these data centers are connected via one 603 or more backbone core providers as described in [NOV3-Framework], the 604 RD must be a unique value per EVI or per NVE as described in 605 [RFC7432]. In other words, whenever there is more than one 606 administrative domain for global VNI or VSID, then a non-zero RD MUST 607 be used, or whenever the VNI or VSID value have local significance, 608 then a non-zero RD MUST be used. It is recommend to use a non-zero RD 609 at all time. 611 When the NVEs reside on the hypervisor, the EVPN BGP routes and 612 attributes associated with multi-homing are no longer required. This 613 reduces the required routes and attributes to the following subset of 614 four out of the set of eight : 616 - MAC Advertisement Route 617 - Inclusive Multicast Ethernet Tag Route 618 - MAC Mobility Extended Community 619 - Default Gateway Extended Community 621 However, as noted in section 8.6 of [RFC7432] in order to enable a 622 single-homing ingress PE to take advantage of fast convergence, 623 aliasing, and backup-path when interacting with multi-homed egress 624 PEs attached to a given Ethernet segment, a single-homing ingress PE 625 SHOULD be able to receive and process Ethernet AD per ES and Ethernet 626 AD per EVI routes." 628 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation 630 When the NVEs reside on the hypervisors, the EVPN procedures 631 associated with multi-homing are no longer required. This limits the 632 procedures on the NVE to the following subset of the EVPN procedures: 634 1. Local learning of MAC addresses received from the VMs per section 635 10.1 of [RFC7432]. 637 2. Advertising locally learned MAC addresses in BGP using the MAC 638 Advertisement routes. 640 3. Performing remote learning using BGP per Section 10.2 of 641 [RFC7432]. 643 4. Discovering other NVEs and constructing the multicast tunnels 644 using the Inclusive Multicast Ethernet Tag routes. 646 5. Handling MAC address mobility events per the procedures of Section 647 16 in [RFC7432]. 649 However, as noted in section 8.6 of [RFC7432] in order to enable a 650 single-homing ingress PE to take advantage of fast convergence, 651 aliasing, and back-up path when interacting with multi-homed egress 652 PEs attached to a given Ethernet segment, a single-homing ingress PE 653 SHOULD implement the ingress node processing of Ethernet AD per ES 654 and Ethernet AD per EVI routes as defined in sections 8.2 Fast 655 Convergence and 8.4 Aliasing and Backup-Path of [RFC7432]. 657 8 NVE Residing in ToR Switch 659 In this section, we discuss the scenario where the NVEs reside in the 660 Top of Rack (ToR) switches AND the servers (where VMs are residing) 661 are multi-homed to these ToR switches. The multi-homing may operate 662 in All-Active or Single-Active redundancy mode. If the servers are 663 single-homed to the ToR switches, then the scenario becomes similar 664 to that where the NVE resides in the hypervisor, as discussed in 665 Section 5, as far as the required EVPN functionality. 667 [RFC7432] defines a set of BGP routes, attributes and procedures to 668 support multi-homing. We first describe these functions and 669 procedures, then discuss which of these are impacted by the 670 encapsulation (such as VXLAN or NVGRE) and what modifications are 671 required. 673 8.1 EVPN Multi-Homing Features 675 In this section, we will recap the multi-homing features of EVPN to 676 highlight the encapsulation dependencies. The section only describes 677 the features and functions at a high-level. For more details, the 678 reader is to refer to [RFC7432]. 680 8.1.1 Multi-homed Ethernet Segment Auto-Discovery 682 EVPN NVEs (or PEs) connected to the same Ethernet Segment (e.g. the 683 same server via LAG) can automatically discover each other with 684 minimal to no configuration through the exchange of BGP routes. 686 8.1.2 Fast Convergence and Mass Withdraw 688 EVPN defines a mechanism to efficiently and quickly signal, to remote 689 NVEs, the need to update their forwarding tables upon the occurrence 690 of a failure in connectivity to an Ethernet segment (e.g., a link or 691 a port failure). This is done by having each NVE advertise an 692 Ethernet A-D Route per Ethernet segment for each locally attached 693 segment. Upon a failure in connectivity to the attached segment, the 694 NVE withdraws the corresponding Ethernet A-D route. This triggers all 695 NVEs that receive the withdrawal to update their next-hop adjacencies 696 for all MAC addresses associated with the Ethernet segment in 697 question. If no other NVE had advertised an Ethernet A-D route for 698 the same segment, then the NVE that received the withdrawal simply 699 invalidates the MAC entries for that segment. Otherwise, the NVE 700 updates the next-hop adjacencies to point to the backup NVE(s). 702 8.1.3 Split-Horizon 704 If a server is multi-homed to two or more NVEs on an Ethernet segment 705 ES1 operating in all-active redundancy mode sends a multicast, 706 broadcast or unknown unicast packet to a one of these NVEs, then it 707 is important to ensure the packet is not looped back to the server 708 via another NVE connected to this server. The filtering mechanism on 709 the NVE to prevent such loop and packet duplication is called "split 710 horizon filtering'. 712 8.1.4 Aliasing and Backup-Path 714 In the case where a station is multi-homed to multiple NVEs, it is 715 possible that only a single NVE learns a set of the MAC addresses 716 associated with traffic transmitted by the station. This leads to a 717 situation where remote NVEs receive MAC advertisement routes, for 718 these addresses, from a single NVE even though multiple NVEs are 719 connected to the multi-homed station. As a result, the remote NVEs 720 are not able to effectively load-balance traffic among the NVEs 721 connected to the multi-homed Ethernet segment. This could be the 722 case, for e.g. when the NVEs perform data-path learning on the 723 access, and the load-balancing function on the station hashes traffic 724 from a given source MAC address to a single NVE. Another scenario 725 where this occurs is when the NVEs rely on control plane learning on 726 the access (e.g. using ARP), since ARP traffic will be hashed to a 727 single link in the LAG. 729 To alleviate this issue, EVPN introduces the concept of Aliasing. 730 This refers to the ability of an NVE to signal that it has 731 reachability to a given locally attached Ethernet segment, even when 732 it has learnt no MAC addresses from that segment. The Ethernet A-D 733 route per EVI is used to that end. Remote NVEs which receive MAC 734 advertisement routes with non-zero ESI SHOULD consider the MAC 735 address as reachable via all NVEs that advertise reachability to the 736 relevant Segment using Ethernet A-D routes with the same ESI and with 737 the Single-Active flag reset. 739 Backup-Path is a closely related function, albeit it applies to the 740 case where the redundancy mode is Single-Active. In this case, the 741 NVE signals that it has reachability to a given locally attached 742 Ethernet Segment using the Ethernet A-D route as well. Remote NVEs 743 which receive the MAC advertisement routes, with non-zero ESI, SHOULD 744 consider the MAC address as reachable via the advertising NVE. 745 Furthermore, the remote NVEs SHOULD install a Backup-Path, for said 746 MAC, to the NVE which had advertised reachability to the relevant 747 Segment using an Ethernet A-D route with the same ESI and with the 748 Single-Active flag set. 750 8.1.5 DF Election 752 If a CE is multi-homed to two or more NVEs on an Ethernet segment 753 operating in all-active redundancy mode, then for a given EVI only 754 one of these NVEs, termed the Designated Forwarder (DF) is 755 responsible for sending it broadcast, multicast, and, if configured 756 for that EVI, unknown unicast frames. 758 This is required in order to prevent duplicate delivery of multi- 759 destination frames to a multi-homed host or VM, in case of all-active 760 redundancy. 762 In NVEs where .1Q tagged frames are received from hosts, the DF 763 election is performed on host VLAN IDs (VIDs). It is assumed that for 764 a given Ethernet Segment, VIDs are unique and consistent (e.g., no 765 duplicate VIDs exist). 767 In GWs where VxLAN encapsulated frames are received, the DF election 768 is performed on VNIs. Again, it is assumed that for a given Ethernet 769 Segment, VNIs are unique and consistent (e.g., no duplicate VNIs 770 exist). 772 8.2 Impact on EVPN BGP Routes & Attributes 774 Since multi-homing is supported in this scenario, then the entire set 775 of BGP routes and attributes defined in [RFC7432] are used. As 776 discussed in Section 3.1.3, the VSID or VNI is carried in the 777 VNI/VSID field in the MAC Advertisement, Ethernet AD per EVI, and 778 Inclusive Multicast Ethernet Tag routes. 780 8.3 Impact on EVPN Procedures 782 Two cases need to be examined here, depending on whether the NVEs are 783 operating in Active/Standby or in All-Active redundancy. 785 First, lets consider the case of Active/Standby redundancy, where the 786 hosts are multi-homed to a set of NVEs, however, only a single NVE is 787 active at a given point of time for a given VNI or VSID. In this 788 case, the split-horizon and the aliasing functions are not required 789 but other functions such as multi-homed Ethernet segment auto- 790 discovery, fast convergence and mass withdraw, backup path, and DF 791 election are required. 793 Second, let's consider the case of All-Active redundancy. In this 794 case, out of the EVPN multi-homing features listed in section 8.1, 795 the use of the VXLAN or NVGRE encapsulation impacts the split-horizon 796 and aliasing features, since those two rely on the MPLS client layer. 797 Given that this MPLS client layer is absent with these types of 798 encapsulations, alternative procedures and mechanisms are needed to 799 provide the required functions. Those are discussed in detail next. 801 8.3.1 Split Horizon 803 In EVPN, an MPLS label is used for split-horizon filtering to support 804 active/active multi-homing where an ingress NVE adds a label 805 corresponding to the site of origin (aka ESI Label) when 806 encapsulating the packet. The egress NVE checks the ESI label when 807 attempting to forward a multi-destination frame out an interface, and 808 if the label corresponds to the same site identifier (ESI) associated 809 with that interface, the packet gets dropped. This prevents the 810 occurrence of forwarding loops. 812 Since the VXLAN or NVGRE encapsulation does not include this ESI 813 label, other means of performing the split-horizon filtering function 814 MUST be devised. The following approach is recommended for split- 815 horizon filtering when VXLAN or NVGRE encapsulation is used. 817 Every NVE track the IP address(es) associated with the other NVE(s) 818 with which it has shared multi-homed Ethernet Segments. When the NVE 819 receives a multi-destination frame from the overlay network, it 820 examines the source IP address in the tunnel header (which 821 corresponds to the ingress NVE) and filters out the frame on all 822 local interfaces connected to Ethernet Segments that are shared with 823 the ingress NVE. With this approach, it is required that the ingress 824 NVE performs replication locally to all directly attached Ethernet 825 Segments (regardless of the DF Election state) for all flooded 826 traffic ingress from the access interfaces (i.e. from the hosts). 827 This approach is referred to as "Local Bias", and has the advantage 828 that only a single IP address needs to be used per NVE for split- 829 horizon filtering, as opposed to requiring an IP address per Ethernet 830 Segment per NVE. 832 In order to prevent unhealthy interactions between the split horizon 833 procedures defined in [RFC7432] and the local bias procedures 834 described in this document, a mix of MPLS over GRE encapsulations on 835 the one hand and VXLAN/NVGRE encapsulations on the other on a given 836 Ethernet Segment is prohibited. 838 8.3.2 Aliasing and Backup-Path 840 The Aliasing and the Backup-Path procedures for VXLAN/NVGRE 841 encapsulation is very similar to the ones for MPLS. In case of MPLS, 842 two different Ethernet AD routes are used for this purpose. The one 843 used for Aliasing has a VPN scope and carries a VPN label but the one 844 used for Backup-Path has Ethernet segment scope and doesn't carry any 845 VPN specific info (e.g., Ethernet Tag and MPLS label are set to 846 zero). 848 9 Support for Multicast 849 The E-VPN Inclusive Multicast BGP route is used to discover the 850 multicast tunnels among the endpoints associated with a given VXLAN 851 VNI or NVGRE VSID. The Ethernet Tag field of this route is used to 852 encode the VNI for VLXAN or VSID for NVGRE. The Originating router's 853 IP address field is set to the NVE's IP address. This route is tagged 854 with the PMSI Tunnel attribute, which is used to encode the type of 855 multicast tunnel to be used as well as the multicast tunnel 856 identifier. The tunnel encapsulation is encoded by adding the BGP 857 Encapsulation extended community as per section 3.1.1. The following 858 tunnel types as defined in [RFC6514] can be used in the PMSI tunnel 859 attribute for VXLAN/NVGRE: 861 + 3 - PIM-SSM Tree 862 + 4 - PIM-SM Tree 863 + 5 - BIDIR-PIM Tree 864 + 6 - Ingress Replication 866 Except for Ingress Replication, this multicast tunnel is used by the 867 PE originating the route for sending multicast traffic to other PEs, 868 and is used by PEs that receive this route for receiving the traffic 869 originated by CEs connected to the PE that originated the route. 871 In the scenario where the multicast tunnel is a tree, both the 872 Inclusive as well as the Aggregate Inclusive variants may be used. In 873 the former case, a multicast tree is dedicated to a VNI or VSID. 874 Whereas, in the latter, a multicast tree is shared among multiple 875 VNIs or VSIDs. This is done by having the NVEs advertise multiple 876 Inclusive Multicast routes with different VNI or VSID encoded in the 877 Ethernet Tag field, but with the same tunnel identifier encoded in 878 the PMSI Tunnel attribute. 880 10 Data Center Interconnections - DCI 882 For DCI, the following two main scenarios are considered when 883 connecting data centers running evpn-overlay (as described here) over 884 MPLS/IP core network: 886 - Scenario 1: DCI using GWs 887 - Scenario 2: DCI using ASBRs 889 The following two subsections describe the operations for each of 890 these scenarios. 892 10.1 DCI using GWs 894 This is the typical scenario for interconnecting data centers over 895 WAN. In this scenario, EVPN routes are terminated and processed in 896 each GW and MAC/IP routes are always re-advertised from DC to WAN but 897 from WAN to DC, they are not re-advertised if unknown MAC address 898 (and default IP address) are utilized in NVEs. In this scenario, each 899 GW maintains a MAC-VRF (and/or IP-VRF) for each EVI. The main 900 advantage of this approach is that NVEs do not need to maintain MAC 901 and IP addresses from any remote data centers when default IP route 902 and unknown MAC routes are used - i.e., they only need to maintain 903 routes that are local to their own DC. When default IP route and 904 unknown MAC route are used, any unknown IP and MAC packets from NVEs 905 are forwarded to the GWs where all the VPN MAC and IP routes are 906 maintained. This approach reduces the size of MAC-VRF and IP-VRF 907 significantly at NVEs. Furthermore, it results in a faster 908 convergence time upon a link or NVE failure in a multi-homed network 909 or device redundancy scenario, because the failure related BGP routes 910 (such as mass withdraw message) do not need to get propagated all the 911 way to the remote NVEs in the remote DCs. This approach is described 912 in details in section 3.4 of [DCI-EVPN-OVERLAY]. 914 10.2 DCI using ASBRs 916 This approach can be considered as the opposite of the first approach 917 and it favors simplification at DCI devices over NVEs such that 918 larger MAC-VRF (and IP-VRF) tables are need to be maintained on NVEs; 919 whereas, DCI devices don't need to maintain any MAC (and IP) 920 forwarding tables. Furthermore, DCI devices do not need to terminate 921 and processed routes related to multi-homing but rather to relay 922 these messages for the establishment of an end-to-end LSP path. In 923 other words, DCI devices in this approach operate similar to ASBRs 924 for inter-AS options B. This requires locally assigned VNIs to be 925 used just like downstream assigned MPLS VPN label where for all 926 practical purposes the VNIs function like 24-bit VPN labels. This 927 approach is equally applicable to data centers (or access networks) 928 with MPLS encapsulation. 930 In inter-AS option B, when ASBR receives an EVPN route from its DC 931 over iBGP and re-advertises it to other ASBRs, it re-advertises the 932 EVPN route by re-writing the BGP next-hops to itself, thus losing the 933 identity of the PE that originated the advertisement. This re-write 934 of BGP next-hop impacts the EVPN Mass Withdraw route (Ethernet A-D 935 per ES) and its procedure adversely. In EVPN, the route used for 936 aliasing (Ethernet A-D per EVI route) has the same RD as the MAC/IP 937 routes associated with that EVI. Therefore, the receiving PE can 938 associated the receive MAC/IP routes with its corresponding aliasing 939 route using their RDs even if their next hop is written to the same 940 ASBR router's address. However, in EVPN, the mass-withdraw route uses 941 a different RD than that of its associated MAC/IP routes. Thus, the 942 way to associate them together is via their next-hop router's 943 address. Now, when BGP next hop address representing the originating 944 PE, gets re-written by the re-advertising ASBR, it creates ambiguity 945 in the receiving PE that cannot be resolved. Therefore, the 946 functionality needed at the ASBRs depends on whether the EVPN 947 Ethernet A-D routes (per ES and/or per EVI) are originated and 948 whether there is a need to handle route resolution ambiguity for 949 Ethernet A-D per ES route. 951 The following two subsections describe the functionality needed by 952 the ASBRs depending on whether the NVEs reside in a Hypervisors or in 953 TORs. 955 10.2.1 ASBR Functionality with NVEs in Hypervisors 957 When NVEs reside in hypervisors as described in section 7.1, there is 958 no multi-homing and thus there is no need for the originating NVE to 959 send Ethernet A-D per ES or Ethernet A-D per EVI routes. Furthermore, 960 the processing of these routes by the receiving NVE in the hypervisor 961 are optional per [RFC7432] and as described in section 7. Therefore, 962 the ambiguity issue discussed above doesn't exist for this scenario 963 and the functionality of ASBRs are that of existing L2VPN (or L3VPN) 964 where the ASBRs assist in setting up end-to-end LSPs among the NVEs' 965 MAC-VRFs. As noted previously, for all practical purposes, the 24-bit 966 locally assigned VNIs used in this scenario, function as 24-bit 967 labels in setting up the end-to-end LSPs. 969 10.2.2 ASBR Functionality with NVEs in TORs 971 When NVEs reside in TORs and operate in multi-homing redundancy mode, 972 then as described in section 8, there is a need for the originating 973 NVE to send Ethernet A-D per ES route(s) (used for mass withdraw) and 974 Ethernet A-D per EVI routes (used for aliasing). As described above, 975 the re-write of BGP next-hop by ASBRs creates ambiguities when 976 Ethernet A-D per ES routes are received by the remote PE in a 977 different ASBR because the receiving PE cannot associated that route 978 with the MAC/IP routes from the same Ethernet Segment advertised by 979 the same originating PE. This ambiguity inhibits the function of 980 mass-withdraw per ES by the receiving PE in a different ASBR. 982 As an example consider a scenario where CE is multi-homed to PE1 and 983 PE2 where these PEs are connected via ASBR1 and then ASBR2 to the 984 remote PE3. Furthermore, consider that PE1 receives M1 from CE1 but 985 not PE2. Therefore, PE1 advertises Eth A-D per ES1, Eth A-D per EVI1, 986 and M1; whereas, PE2 only advertises Eth A-D per ES1 and Eth A-D per 987 EVI1. ASBR1 receives all these five advertisements and passes them to 988 ASBR2 (with itself as the BGP next hop). ASBR2, in turn, passes them 989 to the remote PE3 with itself as the BGP next hop. PE3 receives these 990 five routes where all of them have the same BGP next-hop (i.e., 991 ASBR2). Furthermore, the two Ether A-D per ES routes received by PE3 992 have the same info - i.e., same ESI and the same BGP next hop. 993 Although both of these routes are maintained by the BGP process in 994 PE3, information from only one of them is used in the L2 routing 995 table (L2 RIB). 997 PE1 998 / \ 999 CE ASBR1---ASBR2---PE3 1000 \ / 1001 PE2 1003 Figure 1: Inter-AS Option B 1005 Now, when the AC between the PE2 and the CE fails and PE2 sends NLRI 1006 withdrawal for Ether A-D per ES route and this withdrawal gets 1007 propagated and received by the PE3, the BGP process in PE3 removes 1008 the corresponding BGP route; however, it doesn't remove the 1009 associated info (namely ESI and BGP next hop) from the L2 routing 1010 table (L2 RIB) because it still has the other Ether A-D per ES route 1011 (originated from PE1) with the same info. That is why the mass- 1012 withdraw mechanism does not work when doing DCI with inter-AS option 1013 B. However, as described next, the Aliasing function works and so 1014 does mass-withdraw per EVI (which is associated with withdrawing the 1015 EVPN route associated with Aliasing - i.e., Ether A-D per EVI route). 1017 In the above example, the PE3 receives two Aliasing routes with the 1018 same BGP next hop (ASBR2) but different RDs. One of the Alias route 1019 has the same RD as the advertised MAC route (M1). PE3 follows the 1020 route resolution procedure specified in [RFC7432] upon receiving the 1021 two Aliasing route. PE3 should also resolve the alias path properly 1022 even though both the primary and backup paths have the same BGP next 1023 hop, they have different RDs and the alias route with the different 1024 RD than that of the MAC route is considered as the backup path. 1025 Therefore, PE3 installs both primary and backup paths (and their 1026 associated ESI/EVI MPLS labels or local VNIs) for the MAC route M1. 1027 This creates two end-to-end LSPs from PE3 to PE1 for M1 such that 1028 when PE3 wants to forward traffic destined to M1, it can load 1029 balanced between the two paths. Although route resolution for 1030 Aliasing routes with the same BGP next hop is not described in this 1031 level of details in [RFC7432], it is expected to operate as such and 1032 thus it is clarified here. 1034 When the AC between the PE2 and the CE fails and PE2 sends NLRI 1035 withdrawal for Ether A-D per EVI routes and these withdrawals get 1036 propagated and received by the PE3, the PE3 removes the Aliasing 1037 route and updates all the corresponding MAC routes for that EVI to 1038 remove the backup path. This action makes the mass-withdraw 1039 functionality to perform at the per-EVI level (instead of per-ES). 1040 The mass-withdraw at per-EVI level requires more messages than that 1041 of per-ES level and thus its convergence time is not as good as per 1042 ES level. However, its convergence time is much better than 1043 individual MAC withdraw. 1045 In summary, it can be seen that aliasing and backup path 1046 functionality should work as is for inter-AS option B. Furthermore, 1047 in case of inter-AS option B, mass-withdraw functionality falls back 1048 from per-ES to per-EVI. If per-ES mass-withdraw functionality is 1049 needed along with backward compatibility, then it is recommended to 1050 use GWs (per section 10.1) instead of ASBRs for DCI. 1052 11 Acknowledgement 1054 The authors would like to thank David Smith, John Mullooly, Thomas 1055 Nadeau for their valuable comments and feedback. The authors would 1056 also like to thank Jakob Heitz for his contribution on section 10. 1058 12 Security Considerations 1060 This document uses IP-based tunnel technologies to support data 1061 plane transport. Consequently, the security considerations of those 1062 tunnel technologies apply. This document defines support for VXLAN 1063 and NVGRE encapsulations. The security considerations from those 1064 documents as well as [RFC4301] apply to the data plane aspects of 1065 this document. 1067 As with [RFC5512], any modification of the information that is used 1068 to form encapsulation headers, to choose a tunnel type, or to choose 1069 a particular tunnel for a particular payload type may lead to user 1070 data packets getting misrouted, misdelivered, and/or dropped. 1072 More broadly, the security considerations for the transport of IP 1073 reachability information using BGP are discussed in [RFC4271] and 1074 [RFC4272], and are equally applicable for the extensions described 1075 in this document. 1077 If the integrity of the BGP session is not itself protected, then an 1078 imposter could mount a denial-of-service attack by establishing 1079 numerous BGP sessions and forcing an IPsec SA to be created for each 1080 one. However, as such an imposter could wreak havoc on the entire 1081 routing system, this particular sort of attack is probably not of 1082 any special importance. 1084 It should be noted that a BGP session may itself be transported over 1085 an IPsec tunnel. Such IPsec tunnels can provide additional security 1086 to a BGP session. The management of such IPsec tunnels is outside 1087 the scope of this document. 1089 13 IANA Considerations 1091 IANA has allocated the following BGP Tunnel Encapsulation Attribute 1092 Tunnel Types: 1094 8 VXLAN Encapsulation 1095 9 NVGRE Encapsulation 1096 10 MPLS Encapsulation 1097 11 MPLS in GRE Encapsulation 1098 12 VXLAN GPE Encapsulation 1100 14 References 1102 14.1 Normative References 1104 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 1105 Requirement Levels", BCP 14, RFC 2119, March 1997. 1107 [RFC4271] Y. Rekhter, Ed., T. Li, Ed., S. Hares, Ed., "A Border 1108 Gateway Protocol 4 (BGP-4)", January 2006. 1110 [RFC4272] S. Murphy, "BGP Security Vulnerabilities Analysis.", 1111 January 2006. 1113 [RFC4301] S. Kent, K. Seo., "Security Architecture for the 1114 Internet Protocol.", December 2005. 1116 [RFC5512] Mohapatra, P. and E. Rosen, "The BGP Encapsulation 1117 Subsequent Address Family Identifier (SAFI) and the BGP 1118 Tunnel Encapsulation Attribute", RFC 5512, April 2009. 1120 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", RFC 7432, 1121 February 2014 1123 14.2 Informative References 1125 [RFC7209] Sajassi et al., "Requirements for Ethernet VPN (EVPN)", RFC 1126 7209, May 2014 1128 [RFC7348] Mahalingam, M., et al, "VXLAN: A Framework for Overlaying 1129 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 1130 2014 1132 [NVGRE] Garg, P., et al., "NVGRE: Network Virtualization using 1133 Generic Routing Encapsulation", draft-sridharan-virtualization-nvgre- 1134 07.txt, November 11, 2014 1136 [Problem-Statement] Narten et al., "Problem Statement: Overlays for 1137 Network Virtualization", draft-ietf-nvo3-overlay-problem-statement- 1138 01, September 2012. 1140 [L3VPN-ENDSYSTEMS] Marques et al., "BGP-signaled End-system IP/VPNs", 1141 draft-ietf-l3vpn-end-system, work in progress, October 2012. 1143 [NOV3-FRWK] Lasserre et al., "Framework for DC Network 1144 Virtualization", draft-ietf-nvo3-framework-01.txt, work in progress, 1145 October 2012. 1147 Contributors 1149 S. Salam K. Patel D. Rao S. Thoria D. Cai Cisco 1151 Y. Rekhter R. Shekhar Wen Lin Nischal Sheth Juniper 1153 L. Yong Huawei 1155 Authors' Addresses 1157 Ali Sajassi 1158 Cisco 1159 Email: sajassi@cisco.com 1161 John Drake 1162 Juniper Networks 1163 Email: jdrake@juniper.net 1165 Nabil Bitar 1166 Verizon Communications 1167 Email : nabil.n.bitar@verizon.com 1169 Aldrin Isaac 1170 Juniper 1171 Email: aisaac@juniper.net 1172 James Uttaro 1173 AT&T 1174 Email: uttaro@att.com 1176 Wim Henderickx 1177 Alcatel-Lucent 1178 e-mail: wim.henderickx@alcatel-lucent.com